WEBVTT 0:00:01.921 --> 0:00:16.424 Hey welcome to today's lecture, what we today want to look at is how we can make new. 0:00:16.796 --> 0:00:26.458 So until now we have this global system, the encoder and the decoder mostly, and we haven't 0:00:26.458 --> 0:00:29.714 really thought about how long. 0:00:30.170 --> 0:00:42.684 And what we, for example, know is yeah, you can make the systems bigger in different ways. 0:00:42.684 --> 0:00:47.084 We can make them deeper so the. 0:00:47.407 --> 0:00:56.331 And if we have at least enough data that typically helps you make things performance better,. 0:00:56.576 --> 0:01:00.620 But of course leads to problems that we need more resources. 0:01:00.620 --> 0:01:06.587 That is a problem at universities where we have typically limited computation capacities. 0:01:06.587 --> 0:01:11.757 So at some point you have such big models that you cannot train them anymore. 0:01:13.033 --> 0:01:23.792 And also for companies is of course important if it costs you like to generate translation 0:01:23.792 --> 0:01:26.984 just by power consumption. 0:01:27.667 --> 0:01:35.386 So yeah, there's different reasons why you want to do efficient machine translation. 0:01:36.436 --> 0:01:48.338 One reason is there are different ways of how you can improve your machine translation 0:01:48.338 --> 0:01:50.527 system once we. 0:01:50.670 --> 0:01:55.694 There can be different types of data we looked into data crawling, monolingual data. 0:01:55.875 --> 0:01:59.024 All this data and the aim is always. 0:01:59.099 --> 0:02:05.735 Of course, we are not just purely interested in having more data, but the idea why we want 0:02:05.735 --> 0:02:12.299 to have more data is that more data also means that we have better quality because mostly 0:02:12.299 --> 0:02:17.550 we are interested in increasing the quality of the machine translation. 0:02:18.838 --> 0:02:24.892 But there's also other ways of how you can improve the quality of a machine translation. 0:02:25.325 --> 0:02:36.450 And what is, of course, that is where most research is focusing on. 0:02:36.450 --> 0:02:44.467 It means all we want to build better algorithms. 0:02:44.684 --> 0:02:48.199 Course: The other things are normally as good. 0:02:48.199 --> 0:02:54.631 Sometimes it's easier to improve, so often it's easier to just collect more data than 0:02:54.631 --> 0:02:57.473 to invent some great view algorithms. 0:02:57.473 --> 0:03:00.315 But yeah, both of them are important. 0:03:00.920 --> 0:03:09.812 But there is this third thing, especially with neural machine translation, and that means 0:03:09.812 --> 0:03:11.590 we make a bigger. 0:03:11.751 --> 0:03:16.510 Can be, as said, that we have more layers, that we have wider layers. 0:03:16.510 --> 0:03:19.977 The other thing we talked a bit about is ensemble. 0:03:19.977 --> 0:03:24.532 That means we are not building one new machine translation system. 0:03:24.965 --> 0:03:27.505 And we can easily build four. 0:03:27.505 --> 0:03:32.331 What is the typical strategy to build different systems? 0:03:32.331 --> 0:03:33.177 Remember. 0:03:35.795 --> 0:03:40.119 It should be of course a bit different if you have the same. 0:03:40.119 --> 0:03:44.585 If they all predict the same then combining them doesn't help. 0:03:44.585 --> 0:03:48.979 So what is the easiest way if you have to build four systems? 0:03:51.711 --> 0:04:01.747 And the Charleston's will take, but this is the best output of a single system. 0:04:02.362 --> 0:04:10.165 Mean now, it's really three different systems so that you later can combine them and maybe 0:04:10.165 --> 0:04:11.280 the average. 0:04:11.280 --> 0:04:16.682 Ensembles are typically that the average is all probabilities. 0:04:19.439 --> 0:04:24.227 The idea is to think about neural networks. 0:04:24.227 --> 0:04:29.342 There's one parameter which can easily adjust. 0:04:29.342 --> 0:04:36.525 That's exactly the easiest way to randomize with three different. 0:04:37.017 --> 0:04:43.119 They have the same architecture, so all the hydroparameters are the same, but they are 0:04:43.119 --> 0:04:43.891 different. 0:04:43.891 --> 0:04:46.556 They will have different predictions. 0:04:48.228 --> 0:04:52.572 So, of course, bigger amounts. 0:04:52.572 --> 0:05:05.325 Some of these are a bit the easiest way of improving your quality because you don't really 0:05:05.325 --> 0:05:08.268 have to do anything. 0:05:08.588 --> 0:05:12.588 There is limits on that bigger models only get better. 0:05:12.588 --> 0:05:19.132 If you have enough training data you can't do like a handheld layer and you will not work 0:05:19.132 --> 0:05:24.877 on very small data but with a recent amount of data that is the easiest thing. 0:05:25.305 --> 0:05:33.726 However, they are challenging with making better models, bigger motors, and that is the 0:05:33.726 --> 0:05:34.970 computation. 0:05:35.175 --> 0:05:44.482 So, of course, if you have a bigger model that can mean that you have longer running 0:05:44.482 --> 0:05:49.518 times, if you have models, you have to times. 0:05:51.171 --> 0:05:56.685 Normally you cannot paralyze the different layers because the input to one layer is always 0:05:56.685 --> 0:06:02.442 the output of the previous layer, so you propagate that so it will also increase your runtime. 0:06:02.822 --> 0:06:10.720 Then you have to store all your models in memory. 0:06:10.720 --> 0:06:20.927 If you have double weights you will have: Is more difficult to then do back propagation. 0:06:20.927 --> 0:06:27.680 You have to store in between the activations, so there's not only do you increase the model 0:06:27.680 --> 0:06:31.865 in your memory, but also all these other variables that. 0:06:34.414 --> 0:06:36.734 And so in general it is more expensive. 0:06:37.137 --> 0:06:54.208 And therefore there's good reasons in looking into can we make these models sound more efficient. 0:06:54.134 --> 0:07:00.982 So it's been through the viewer, you can have it okay, have one and one day of training time, 0:07:00.982 --> 0:07:01.274 or. 0:07:01.221 --> 0:07:07.535 Forty thousand euros and then what is the best machine translation system I can get within 0:07:07.535 --> 0:07:08.437 this budget. 0:07:08.969 --> 0:07:19.085 And then, of course, you can make the models bigger, but then you have to train them shorter, 0:07:19.085 --> 0:07:24.251 and then we can make more efficient algorithms. 0:07:25.925 --> 0:07:31.699 If you think about efficiency, there's a bit different scenarios. 0:07:32.312 --> 0:07:43.635 So if you're more of coming from the research community, what you'll be doing is building 0:07:43.635 --> 0:07:47.913 a lot of models in your research. 0:07:48.088 --> 0:07:58.645 So you're having your test set of maybe sentences, calculating the blue score, then another model. 0:07:58.818 --> 0:08:08.911 So what that means is typically you're training on millions of cents, so your training time 0:08:08.911 --> 0:08:14.944 is long, maybe a day, but maybe in other cases a week. 0:08:15.135 --> 0:08:22.860 The testing is not really the cost efficient, but the training is very costly. 0:08:23.443 --> 0:08:37.830 If you are more thinking of building models for application, the scenario is quite different. 0:08:38.038 --> 0:08:46.603 And then you keep it running, and maybe thousands of customers are using it in translating. 0:08:46.603 --> 0:08:47.720 So in that. 0:08:48.168 --> 0:08:59.577 And we will see that it is not always the same type of challenges you can paralyze some 0:08:59.577 --> 0:09:07.096 things in training, which you cannot paralyze in testing. 0:09:07.347 --> 0:09:14.124 For example, in training you have to do back propagation, so you have to store the activations. 0:09:14.394 --> 0:09:23.901 Therefore, in testing we briefly discussed that we would do it in more detail today in 0:09:23.901 --> 0:09:24.994 training. 0:09:25.265 --> 0:09:36.100 You know they're a target and you can process everything in parallel while in testing. 0:09:36.356 --> 0:09:46.741 So you can only do one word at a time, and so you can less paralyze this. 0:09:46.741 --> 0:09:50.530 Therefore, it's important. 0:09:52.712 --> 0:09:55.347 Is a specific task on this. 0:09:55.347 --> 0:10:03.157 For example, it's the efficiency task where it's about making things as efficient. 0:10:03.123 --> 0:10:09.230 Is possible and they can look at different resources. 0:10:09.230 --> 0:10:14.207 So how much deep fuel run time do you need? 0:10:14.454 --> 0:10:19.366 See how much memory you need or you can have a fixed memory budget and then have to build 0:10:19.366 --> 0:10:20.294 the best system. 0:10:20.500 --> 0:10:29.010 And here is a bit like an example of that, so there's three teams from Edinburgh from 0:10:29.010 --> 0:10:30.989 and they submitted. 0:10:31.131 --> 0:10:36.278 So then, of course, if you want to know the most efficient system you have to do a bit 0:10:36.278 --> 0:10:36.515 of. 0:10:36.776 --> 0:10:44.656 You want to have a better quality or more runtime and there's not the one solution. 0:10:44.656 --> 0:10:46.720 You can improve your. 0:10:46.946 --> 0:10:49.662 And that you see that there are different systems. 0:10:49.909 --> 0:11:06.051 Here is how many words you can do for a second on the clock, and you want to be as talk as 0:11:06.051 --> 0:11:07.824 possible. 0:11:08.068 --> 0:11:08.889 And you see here a bit. 0:11:08.889 --> 0:11:09.984 This is a little bit different. 0:11:11.051 --> 0:11:27.717 You want to be there on the top right corner and you can get a score of something between 0:11:27.717 --> 0:11:29.014 words. 0:11:30.250 --> 0:11:34.161 Two hundred and fifty thousand, then you'll ever come and score zero point three. 0:11:34.834 --> 0:11:41.243 There is, of course, any bit of a decision, but the question is, like how far can you again? 0:11:41.243 --> 0:11:47.789 Some of all these points on this line would be winners because they are somehow most efficient 0:11:47.789 --> 0:11:53.922 in a way that there's no system which achieves the same quality with less computational. 0:11:57.657 --> 0:12:04.131 So there's the one question of which resources are you interested. 0:12:04.131 --> 0:12:07.416 Are you running it on CPU or GPU? 0:12:07.416 --> 0:12:11.668 There's different ways of paralyzing stuff. 0:12:14.654 --> 0:12:20.777 Another dimension is how you process your data. 0:12:20.777 --> 0:12:27.154 There's really the best processing and streaming. 0:12:27.647 --> 0:12:34.672 So in batch processing you have the whole document available so you can translate all 0:12:34.672 --> 0:12:39.981 sentences in perimeter and then you're interested in throughput. 0:12:40.000 --> 0:12:43.844 But you can then process, for example, especially in GPS. 0:12:43.844 --> 0:12:49.810 That's interesting, you're not translating one sentence at a time, but you're translating 0:12:49.810 --> 0:12:56.108 one hundred sentences or so in parallel, so you have one more dimension where you can paralyze 0:12:56.108 --> 0:12:57.964 and then be more efficient. 0:12:58.558 --> 0:13:14.863 On the other hand, for example sorts of documents, so we learned that if you do badge processing 0:13:14.863 --> 0:13:16.544 you have. 0:13:16.636 --> 0:13:24.636 Then, of course, it makes sense to sort the sentences in order to have the minimum thing 0:13:24.636 --> 0:13:25.535 attached. 0:13:27.427 --> 0:13:32.150 The other scenario is more the streaming scenario where you do life translation. 0:13:32.512 --> 0:13:40.212 So in that case you can't wait for the whole document to pass, but you have to do. 0:13:40.520 --> 0:13:49.529 And then, for example, that's especially in situations like speech translation, and then 0:13:49.529 --> 0:13:53.781 you're interested in things like latency. 0:13:53.781 --> 0:14:00.361 So how much do you have to wait to get the output of a sentence? 0:14:06.566 --> 0:14:16.956 Finally, there is the thing about the implementation: Today we're mainly looking at different algorithms, 0:14:16.956 --> 0:14:23.678 different models of how you can model them in your machine translation system, but of 0:14:23.678 --> 0:14:29.227 course for the same algorithms there's also different implementations. 0:14:29.489 --> 0:14:38.643 So, for example, for a machine translation this tool could be very fast. 0:14:38.638 --> 0:14:46.615 So they have like coded a lot of the operations very low resource, not low resource, low level 0:14:46.615 --> 0:14:49.973 on the directly on the QDAC kernels in. 0:14:50.110 --> 0:15:00.948 So the same attention network is typically more efficient in that type of algorithm. 0:15:00.880 --> 0:15:02.474 Than in in any other. 0:15:03.323 --> 0:15:13.105 Of course, it might be other disadvantages, so if you're a little worker or have worked 0:15:13.105 --> 0:15:15.106 in the practical. 0:15:15.255 --> 0:15:22.604 Because it's normally easier to understand, easier to change, and so on, but there is again 0:15:22.604 --> 0:15:23.323 a train. 0:15:23.483 --> 0:15:29.440 You have to think about, do you want to include this into my study or comparison or not? 0:15:29.440 --> 0:15:36.468 Should it be like I compare different implementations and I also find the most efficient implementation? 0:15:36.468 --> 0:15:39.145 Or is it only about the pure algorithm? 0:15:42.742 --> 0:15:50.355 Yeah, when building these systems there is a different trade-off to do. 0:15:50.850 --> 0:15:56.555 So there's one of the traders between memory and throughput, so how many words can generate 0:15:56.555 --> 0:15:57.299 per second. 0:15:57.557 --> 0:16:03.351 So typically you can easily like increase your scruple by increasing the batch size. 0:16:03.643 --> 0:16:06.899 So that means you are translating more sentences in parallel. 0:16:07.107 --> 0:16:09.241 And gypsies are very good at that stuff. 0:16:09.349 --> 0:16:15.161 It should translate one sentence or one hundred sentences, not the same time, but its. 0:16:15.115 --> 0:16:20.784 Rough are very similar because they are at this efficient metrics multiplication so that 0:16:20.784 --> 0:16:24.415 you can do the same operation on all sentences parallel. 0:16:24.415 --> 0:16:30.148 So typically that means if you increase your benchmark you can do more things in parallel 0:16:30.148 --> 0:16:31.995 and you will translate more. 0:16:31.952 --> 0:16:33.370 Second. 0:16:33.653 --> 0:16:43.312 On the other hand, with this advantage, of course you will need higher badge sizes and 0:16:43.312 --> 0:16:44.755 more memory. 0:16:44.965 --> 0:16:56.452 To begin with, the other problem is that you have such big models that you can only translate 0:16:56.452 --> 0:16:59.141 with lower bed sizes. 0:16:59.119 --> 0:17:08.466 If you are running out of memory with translating, one idea to go on that is to decrease your. 0:17:13.453 --> 0:17:24.456 Then there is the thing about quality in Screwport, of course, and before it's like larger models, 0:17:24.456 --> 0:17:28.124 but in generally higher quality. 0:17:28.124 --> 0:17:31.902 The first one is always this way. 0:17:32.092 --> 0:17:38.709 Course: Not always larger model helps you have over fitting at some point, but in generally. 0:17:43.883 --> 0:17:52.901 And with this a bit on this training and testing thing we had before. 0:17:53.113 --> 0:17:58.455 So it wears all the difference between training and testing, and for the encoder and decoder. 0:17:58.798 --> 0:18:06.992 So if we are looking at what mentioned before at training time, we have a source sentence 0:18:06.992 --> 0:18:17.183 here: And how this is processed on a is not the attention here. 0:18:17.183 --> 0:18:21.836 That's a tubical transformer. 0:18:22.162 --> 0:18:31.626 And how we can do that on a is that we can paralyze the ear ever since. 0:18:31.626 --> 0:18:40.422 The first thing to know is: So that is, of course, not in all cases. 0:18:40.422 --> 0:18:49.184 We'll later talk about speech translation where we might want to translate. 0:18:49.389 --> 0:18:56.172 Without the general case in, it's like you have the full sentence you want to translate. 0:18:56.416 --> 0:19:02.053 So the important thing is we are here everything available on the source side. 0:19:03.323 --> 0:19:13.524 And then this was one of the big advantages that you can remember back of transformer. 0:19:13.524 --> 0:19:15.752 There are several. 0:19:16.156 --> 0:19:25.229 But the other one is now that we can calculate the full layer. 0:19:25.645 --> 0:19:29.318 There is no dependency between this and this state or this and this state. 0:19:29.749 --> 0:19:36.662 So we always did like here to calculate the key value and query, and based on that you 0:19:36.662 --> 0:19:37.536 calculate. 0:19:37.937 --> 0:19:46.616 Which means we can do all these calculations here in parallel and in parallel. 0:19:48.028 --> 0:19:55.967 And there, of course, is this very efficiency because again for GPS it's too bigly possible 0:19:55.967 --> 0:20:00.887 to do these things in parallel and one after each other. 0:20:01.421 --> 0:20:10.311 And then we can also for each layer one by one, and then we calculate here the encoder. 0:20:10.790 --> 0:20:21.921 In training now an important thing is that for the decoder we have the full sentence available 0:20:21.921 --> 0:20:28.365 because we know this is the target we should generate. 0:20:29.649 --> 0:20:33.526 We have models now in a different way. 0:20:33.526 --> 0:20:38.297 This hidden state is only on the previous ones. 0:20:38.598 --> 0:20:51.887 And the first thing here depends only on this information, so you see if you remember we 0:20:51.887 --> 0:20:56.665 had this masked self-attention. 0:20:56.896 --> 0:21:04.117 So that means, of course, we can only calculate the decoder once the encoder is done, but that's. 0:21:04.444 --> 0:21:06.656 Percent can calculate the end quarter. 0:21:06.656 --> 0:21:08.925 Then we can calculate here the decoder. 0:21:09.569 --> 0:21:25.566 But again in training we have x, y and that is available so we can calculate everything 0:21:25.566 --> 0:21:27.929 in parallel. 0:21:28.368 --> 0:21:40.941 So the interesting thing or advantage of transformer is in training. 0:21:40.941 --> 0:21:46.408 We can do it for the decoder. 0:21:46.866 --> 0:21:54.457 That means you will have more calculations because you can only calculate one layer at 0:21:54.457 --> 0:22:02.310 a time, but for example the length which is too bigly quite long or doesn't really matter 0:22:02.310 --> 0:22:03.270 that much. 0:22:05.665 --> 0:22:10.704 However, in testing this situation is different. 0:22:10.704 --> 0:22:13.276 In testing we only have. 0:22:13.713 --> 0:22:20.622 So this means we start with a sense: We don't know the full sentence yet because we ought 0:22:20.622 --> 0:22:29.063 to regularly generate that so for the encoder we have the same here but for the decoder. 0:22:29.409 --> 0:22:39.598 In this case we only have the first and the second instinct, but only for all states in 0:22:39.598 --> 0:22:40.756 parallel. 0:22:41.101 --> 0:22:51.752 And then we can do the next step for y because we are putting our most probable one. 0:22:51.752 --> 0:22:58.643 We do greedy search or beam search, but you cannot do. 0:23:03.663 --> 0:23:16.838 Yes, so if we are interesting in making things more efficient for testing, which we see, for 0:23:16.838 --> 0:23:22.363 example in the scenario of really our. 0:23:22.642 --> 0:23:34.286 It makes sense that we think about our architecture and that we are currently working on attention 0:23:34.286 --> 0:23:35.933 based models. 0:23:36.096 --> 0:23:44.150 The decoder there is some of the most time spent testing and testing. 0:23:44.150 --> 0:23:47.142 It's similar, but during. 0:23:47.167 --> 0:23:50.248 Nothing about beam search. 0:23:50.248 --> 0:23:59.833 It might be even more complicated because in beam search you have to try different. 0:24:02.762 --> 0:24:15.140 So the question is what can you now do in order to make your model more efficient and 0:24:15.140 --> 0:24:21.905 better in translation in these types of cases? 0:24:24.604 --> 0:24:30.178 And the one thing is to look into the encoded decoder trailer. 0:24:30.690 --> 0:24:43.898 And then until now we typically assume that the depth of the encoder and the depth of the 0:24:43.898 --> 0:24:48.154 decoder is roughly the same. 0:24:48.268 --> 0:24:55.553 So if you haven't thought about it, you just take what is running well. 0:24:55.553 --> 0:24:57.678 You would try to do. 0:24:58.018 --> 0:25:04.148 However, we saw now that there is a quite big challenge and the runtime is a lot longer 0:25:04.148 --> 0:25:04.914 than here. 0:25:05.425 --> 0:25:14.018 The question is also the case for the calculations, or do we have there the same issue that we 0:25:14.018 --> 0:25:21.887 only get the good quality if we are having high and high, so we know that making these 0:25:21.887 --> 0:25:25.415 more depths is increasing our quality. 0:25:25.425 --> 0:25:31.920 But what we haven't talked about is really important that we increase the depth the same 0:25:31.920 --> 0:25:32.285 way. 0:25:32.552 --> 0:25:41.815 So what we can put instead also do is something like this where you have a deep encoder and 0:25:41.815 --> 0:25:42.923 a shallow. 0:25:43.163 --> 0:25:57.386 So that would be that you, for example, have instead of having layers on the encoder, and 0:25:57.386 --> 0:25:59.757 layers on the. 0:26:00.080 --> 0:26:10.469 So in this case the overall depth from start to end would be similar and so hopefully. 0:26:11.471 --> 0:26:21.662 But we could a lot more things hear parallelized, and hear what is costly at the end during decoding 0:26:21.662 --> 0:26:22.973 the decoder. 0:26:22.973 --> 0:26:29.330 Because that does change in an outer regressive way, there we. 0:26:31.411 --> 0:26:33.727 And that that can be analyzed. 0:26:33.727 --> 0:26:38.734 So here is some examples: Where people have done all this. 0:26:39.019 --> 0:26:55.710 So here it's mainly interested on the orange things, which is auto-regressive about the 0:26:55.710 --> 0:26:57.607 speed up. 0:26:57.717 --> 0:27:15.031 You have the system, so agree is not exactly the same, but it's similar. 0:27:15.055 --> 0:27:23.004 It's always the case if you look at speed up. 0:27:23.004 --> 0:27:31.644 Think they put a speed of so that's the baseline. 0:27:31.771 --> 0:27:35.348 So between and times as fast. 0:27:35.348 --> 0:27:42.621 If you switch from a system to where you have layers in the. 0:27:42.782 --> 0:27:52.309 You see that although you have slightly more parameters, more calculations are also roughly 0:27:52.309 --> 0:28:00.283 the same, but you can speed out because now during testing you can paralyze. 0:28:02.182 --> 0:28:09.754 The other thing is that you're speeding up, but if you look at the performance it's similar, 0:28:09.754 --> 0:28:13.500 so sometimes you improve, sometimes you lose. 0:28:13.500 --> 0:28:20.421 There's a bit of losing English to Romania, but in general the quality is very slow. 0:28:20.680 --> 0:28:30.343 So you see that you can keep a similar performance while improving your speed by just having different. 0:28:30.470 --> 0:28:34.903 And you also see the encoder layers from speed. 0:28:34.903 --> 0:28:38.136 They don't really metal that much. 0:28:38.136 --> 0:28:38.690 Most. 0:28:38.979 --> 0:28:50.319 Because if you compare the 12th system to the 6th system you have a lower performance 0:28:50.319 --> 0:28:57.309 with 6th and colder layers but the speed is similar. 0:28:57.897 --> 0:29:02.233 And see the huge decrease is it maybe due to a lack of data. 0:29:03.743 --> 0:29:11.899 Good idea would say it's not the case. 0:29:11.899 --> 0:29:23.191 Romanian English should have the same number of data. 0:29:24.224 --> 0:29:31.184 Maybe it's just that something in that language. 0:29:31.184 --> 0:29:40.702 If you generate Romanian maybe they need more target dependencies. 0:29:42.882 --> 0:29:46.263 The Wine's the Eye Also Don't Know Any Sex People Want To. 0:29:47.887 --> 0:29:49.034 There could be yeah the. 0:29:49.889 --> 0:29:58.962 As the maybe if you go from like a movie sphere to a hybrid sphere, you can: It's very much 0:29:58.962 --> 0:30:12.492 easier to expand the vocabulary to English, but it must be the vocabulary. 0:30:13.333 --> 0:30:21.147 Have to check, but would assume that in this case the system is not retrained, but it's 0:30:21.147 --> 0:30:22.391 trained with. 0:30:22.902 --> 0:30:30.213 And that's why I was assuming that they have the same, but maybe you'll write that in this 0:30:30.213 --> 0:30:35.595 piece, for example, if they were pre-trained, the decoder English. 0:30:36.096 --> 0:30:43.733 But don't remember exactly if they do something like that, but that could be a good. 0:30:45.325 --> 0:30:52.457 So this is some of the most easy way to speed up. 0:30:52.457 --> 0:31:01.443 You just switch to hyperparameters, not to implement anything. 0:31:02.722 --> 0:31:08.367 Of course, there's other ways of doing that. 0:31:08.367 --> 0:31:11.880 We'll look into two things. 0:31:11.880 --> 0:31:16.521 The other thing is the architecture. 0:31:16.796 --> 0:31:28.154 We are now at some of the baselines that we are doing. 0:31:28.488 --> 0:31:39.978 However, in translation in the decoder side, it might not be the best solution. 0:31:39.978 --> 0:31:41.845 There is no. 0:31:42.222 --> 0:31:47.130 So we can use different types of architectures, also in the encoder and the. 0:31:47.747 --> 0:31:52.475 And there's two ways of what you could do different, or there's more ways. 0:31:52.912 --> 0:31:54.825 We will look into two todays. 0:31:54.825 --> 0:31:58.842 The one is average attention, which is a very simple solution. 0:31:59.419 --> 0:32:01.464 You can do as it says. 0:32:01.464 --> 0:32:04.577 It's not really attending anymore. 0:32:04.577 --> 0:32:08.757 It's just like equal attendance to everything. 0:32:09.249 --> 0:32:23.422 And the other idea, which is currently done in most systems which are optimized to efficiency, 0:32:23.422 --> 0:32:24.913 is we're. 0:32:25.065 --> 0:32:32.623 But on the decoder side we are then not using transformer or self attention, but we are using 0:32:32.623 --> 0:32:39.700 recurrent neural network because they are the disadvantage of recurrent neural network. 0:32:39.799 --> 0:32:48.353 And then the recurrent is normally easier to calculate because it only depends on inputs, 0:32:48.353 --> 0:32:49.684 the input on. 0:32:51.931 --> 0:33:02.190 So what is the difference between decoding and why is the tension maybe not sufficient 0:33:02.190 --> 0:33:03.841 for decoding? 0:33:04.204 --> 0:33:14.390 If we want to populate the new state, we only have to look at the input and the previous 0:33:14.390 --> 0:33:15.649 state, so. 0:33:16.136 --> 0:33:19.029 We are more conditional here networks. 0:33:19.029 --> 0:33:19.994 We have the. 0:33:19.980 --> 0:33:31.291 Dependency to a fixed number of previous ones, but that's rarely used for decoding. 0:33:31.291 --> 0:33:39.774 In contrast, in transformer we have this large dependency, so. 0:33:40.000 --> 0:33:52.760 So from t minus one to y t so that is somehow and mainly not very efficient in this way mean 0:33:52.760 --> 0:33:56.053 it's very good because. 0:33:56.276 --> 0:34:03.543 However, the disadvantage is that we also have to do all these calculations, so if we 0:34:03.543 --> 0:34:10.895 more view from the point of view of efficient calculation, this might not be the best. 0:34:11.471 --> 0:34:20.517 So the question is, can we change our architecture to keep some of the advantages but make things 0:34:20.517 --> 0:34:21.994 more efficient? 0:34:24.284 --> 0:34:31.131 The one idea is what is called the average attention, and the interesting thing is this 0:34:31.131 --> 0:34:32.610 work surprisingly. 0:34:33.013 --> 0:34:38.917 So the only idea what you're doing is doing the decoder. 0:34:38.917 --> 0:34:42.646 You're not doing attention anymore. 0:34:42.646 --> 0:34:46.790 The attention weights are all the same. 0:34:47.027 --> 0:35:00.723 So you don't calculate with query and key the different weights, and then you just take 0:35:00.723 --> 0:35:03.058 equal weights. 0:35:03.283 --> 0:35:07.585 So here would be one third from this, one third from this, and one third. 0:35:09.009 --> 0:35:14.719 And while it is sufficient you can now do precalculation and things get more efficient. 0:35:15.195 --> 0:35:18.803 So first go the formula that's maybe not directed here. 0:35:18.979 --> 0:35:38.712 So the difference here is that your new hint stage is the sum of all the hint states, then. 0:35:38.678 --> 0:35:40.844 So here would be with this. 0:35:40.844 --> 0:35:45.022 It would be one third of this plus one third of this. 0:35:46.566 --> 0:35:57.162 But if you calculate it this way, it's not yet being more efficient because you still 0:35:57.162 --> 0:36:01.844 have to sum over here all the hidden. 0:36:04.524 --> 0:36:22.932 But you can not easily speed up these things by having an in between value, which is just 0:36:22.932 --> 0:36:24.568 always. 0:36:25.585 --> 0:36:30.057 If you take this as ten to one, you take this one class this one. 0:36:30.350 --> 0:36:36.739 Because this one then was before this, and this one was this, so in the end. 0:36:37.377 --> 0:36:49.545 So now this one is not the final one in order to get the final one to do the average. 0:36:49.545 --> 0:36:50.111 So. 0:36:50.430 --> 0:37:00.264 But then if you do this calculation with speed up you can do it with a fixed number of steps. 0:37:00.180 --> 0:37:11.300 Instead of the sun which depends on age, so you only have to do calculations to calculate 0:37:11.300 --> 0:37:12.535 this one. 0:37:12.732 --> 0:37:21.718 Can you do the lakes and the lakes? 0:37:21.718 --> 0:37:32.701 For example, light bulb here now takes and. 0:37:32.993 --> 0:37:38.762 That's a very good point and that's why this is now in the image. 0:37:38.762 --> 0:37:44.531 It's not very good so this is the one with tilder and the tilder. 0:37:44.884 --> 0:37:57.895 So this one is just the sum of these two, because this is just this one. 0:37:58.238 --> 0:38:08.956 So the sum of this is exactly as the sum of these, and the sum of these is the sum of here. 0:38:08.956 --> 0:38:15.131 So you only do the sum in here, and the multiplying. 0:38:15.255 --> 0:38:22.145 So what you can mainly do here is you can do it more mathematically. 0:38:22.145 --> 0:38:31.531 You can know this by tea taking out of the sum, and then you can calculate the sum different. 0:38:36.256 --> 0:38:42.443 That maybe looks a bit weird and simple, so we were all talking about this great attention 0:38:42.443 --> 0:38:47.882 that we can focus on different parts, and a bit surprising on this work is now. 0:38:47.882 --> 0:38:53.321 In the end it might also work well without really putting and just doing equal. 0:38:53.954 --> 0:38:56.164 Mean it's not that easy. 0:38:56.376 --> 0:38:58.261 It's like sometimes this is working. 0:38:58.261 --> 0:39:00.451 There's also report weight work that well. 0:39:01.481 --> 0:39:05.848 But I think it's an interesting way and it maybe shows that a lot of. 0:39:05.805 --> 0:39:10.624 Things in the self or in the transformer paper which are more put as like yet. 0:39:10.624 --> 0:39:15.930 These are some hyperpermetheuss around it, like that you do the layer norm in between, 0:39:15.930 --> 0:39:21.785 and that you do a feat forward before, and things like that, that these are also all important, 0:39:21.785 --> 0:39:25.566 and that the right set up around that is also very important. 0:39:28.969 --> 0:39:38.598 The other thing you can do in the end is not completely different from this one. 0:39:38.598 --> 0:39:42.521 It's just like a very different. 0:39:42.942 --> 0:39:54.338 And that is a recurrent network which also has this type of highway connection that can 0:39:54.338 --> 0:40:01.330 ignore the recurrent unit and directly put the input. 0:40:01.561 --> 0:40:10.770 It's not really adding out, but if you see the hitting step is your input, but what you 0:40:10.770 --> 0:40:15.480 can do is somehow directly go to the output. 0:40:17.077 --> 0:40:28.390 These are the four components of the simple return unit, and the unit is motivated by GIS 0:40:28.390 --> 0:40:33.418 and by LCMs, which we have seen before. 0:40:33.513 --> 0:40:43.633 And that has proven to be very good for iron ends, which allows you to have a gate on your. 0:40:44.164 --> 0:40:48.186 In this thing we have two gates, the reset gate and the forget gate. 0:40:48.768 --> 0:40:57.334 So first we have the general structure which has a cell state. 0:40:57.334 --> 0:41:01.277 Here we have the cell state. 0:41:01.361 --> 0:41:09.661 And then this goes next, and we always get the different cell states over the times that. 0:41:10.030 --> 0:41:11.448 This Is the South Stand. 0:41:11.771 --> 0:41:16.518 How do we now calculate that just assume we have an initial cell safe here? 0:41:17.017 --> 0:41:19.670 But the first thing is we're doing the forget game. 0:41:20.060 --> 0:41:34.774 The forgetting models should the new cell state mainly depend on the previous cell state 0:41:34.774 --> 0:41:40.065 or should it depend on our age. 0:41:40.000 --> 0:41:41.356 Like Add to Them. 0:41:41.621 --> 0:41:42.877 How can we model that? 0:41:44.024 --> 0:41:45.599 First we were at a cocktail. 0:41:45.945 --> 0:41:52.151 The forget gait is depending on minus one. 0:41:52.151 --> 0:41:56.480 You also see here the former. 0:41:57.057 --> 0:42:01.963 So we are multiplying both the cell state and our input. 0:42:01.963 --> 0:42:04.890 With some weights we are getting. 0:42:05.105 --> 0:42:08.472 We are putting some Bay Inspector and then we are doing Sigma Weed on that. 0:42:08.868 --> 0:42:13.452 So in the end we have numbers between zero and one saying for each dimension. 0:42:13.853 --> 0:42:22.041 Like how much if it's near to zero we will mainly use the new input. 0:42:22.041 --> 0:42:31.890 If it's near to one we will keep the input and ignore the input at this dimension. 0:42:33.313 --> 0:42:40.173 And by this motivation we can then create here the new sound state, and here you see 0:42:40.173 --> 0:42:41.141 the formal. 0:42:41.601 --> 0:42:55.048 So you take your foot back gate and multiply it with your class. 0:42:55.048 --> 0:43:00.427 So if my was around then. 0:43:00.800 --> 0:43:07.405 In the other case, when the value was others, that's what you added. 0:43:07.405 --> 0:43:10.946 Then you're adding a transformation. 0:43:11.351 --> 0:43:24.284 So if this value was maybe zero then you're putting most of the information from inputting. 0:43:25.065 --> 0:43:26.947 Is already your element? 0:43:26.947 --> 0:43:30.561 The only question is now based on your element. 0:43:30.561 --> 0:43:32.067 What is the output? 0:43:33.253 --> 0:43:47.951 And there you have another opportunity so you can either take the output or instead you 0:43:47.951 --> 0:43:50.957 prefer the input. 0:43:52.612 --> 0:43:58.166 So is the value also the same for the recept game and the forget game. 0:43:58.166 --> 0:43:59.417 Yes, the movie. 0:44:00.900 --> 0:44:10.004 Yes exactly so the matrices are different and therefore it can be and that should be 0:44:10.004 --> 0:44:16.323 and maybe there is sometimes you want to have information. 0:44:16.636 --> 0:44:23.843 So here again we have this vector with values between zero and which says controlling how 0:44:23.843 --> 0:44:25.205 the information. 0:44:25.505 --> 0:44:36.459 And then the output is calculated here similar to a cell stage, but again input is from. 0:44:36.536 --> 0:44:45.714 So either the reset gate decides should give what is currently stored in there, or. 0:44:46.346 --> 0:44:58.647 So it's not exactly as the thing we had before, with the residual connections where we added 0:44:58.647 --> 0:45:01.293 up, but here we do. 0:45:04.224 --> 0:45:08.472 This is the general idea of a simple recurrent neural network. 0:45:08.472 --> 0:45:13.125 Then we will now look at how we can make things even more efficient. 0:45:13.125 --> 0:45:17.104 But first do you have more questions on how it is working? 0:45:23.063 --> 0:45:38.799 Now these calculations are a bit where things get more efficient because this somehow. 0:45:38.718 --> 0:45:43.177 It depends on all the other damage for the second one also. 0:45:43.423 --> 0:45:48.904 Because if you do a matrix multiplication with a vector like for the output vector, each 0:45:48.904 --> 0:45:52.353 diameter of the output vector depends on all the other. 0:45:52.973 --> 0:46:06.561 The cell state here depends because this one is used here, and somehow the first dimension 0:46:06.561 --> 0:46:11.340 of the cell state only depends. 0:46:11.931 --> 0:46:17.973 In order to make that, of course, is sometimes again making things less paralyzeable if things 0:46:17.973 --> 0:46:18.481 depend. 0:46:19.359 --> 0:46:35.122 Can easily make that different by changing from the metric product to not a vector. 0:46:35.295 --> 0:46:51.459 So you do first, just like inside here, you take like the first dimension, my second dimension. 0:46:52.032 --> 0:46:53.772 Is, of course, narrow. 0:46:53.772 --> 0:46:59.294 This should be reset or this should be because it should be a different. 0:46:59.899 --> 0:47:12.053 Now the first dimension only depends on the first dimension, so you don't have dependencies 0:47:12.053 --> 0:47:16.148 any longer between dimensions. 0:47:18.078 --> 0:47:25.692 Maybe it gets a bit clearer if you see about it in this way, so what we have to do now. 0:47:25.966 --> 0:47:31.911 First, we have to do a metrics multiplication on to gather and to get the. 0:47:32.292 --> 0:47:38.041 And then we only have the element wise operations where we take this output. 0:47:38.041 --> 0:47:38.713 We take. 0:47:39.179 --> 0:47:42.978 Minus one and our original. 0:47:42.978 --> 0:47:52.748 Here we only have elemental abrasions which can be optimally paralyzed. 0:47:53.273 --> 0:48:07.603 So here we have additional paralyzed things across the dimension and don't have to do that. 0:48:09.929 --> 0:48:24.255 Yeah, but this you can do like in parallel again for all xts. 0:48:24.544 --> 0:48:33.014 Here you can't do it in parallel, but you only have to do it on each seat, and then you 0:48:33.014 --> 0:48:34.650 can parallelize. 0:48:35.495 --> 0:48:39.190 But this maybe for the dimension. 0:48:39.190 --> 0:48:42.124 Maybe it's also important. 0:48:42.124 --> 0:48:46.037 I don't know if they have tried it. 0:48:46.037 --> 0:48:55.383 I assume it's not only for dimension reduction, but it's hard because you can easily. 0:49:01.001 --> 0:49:08.164 People have even like made the second thing even more easy. 0:49:08.164 --> 0:49:10.313 So there is this. 0:49:10.313 --> 0:49:17.954 This is how we have the highway connections in the transformer. 0:49:17.954 --> 0:49:20.699 Then it's like you do. 0:49:20.780 --> 0:49:24.789 So that is like how things are put together as a transformer. 0:49:25.125 --> 0:49:39.960 And that is a similar and simple recurring neural network where you do exactly the same 0:49:39.960 --> 0:49:44.512 for the so you don't have. 0:49:46.326 --> 0:49:47.503 This type of things. 0:49:49.149 --> 0:50:01.196 And with this we are at the end of how to make efficient architectures before we go to 0:50:01.196 --> 0:50:02.580 the next. 0:50:13.013 --> 0:50:24.424 Between the ink or the trader and the architectures there is a next technique which is used in 0:50:24.424 --> 0:50:28.988 nearly all deburning very successful. 0:50:29.449 --> 0:50:43.463 So the idea is can we extract the knowledge from a large network into a smaller one, but 0:50:43.463 --> 0:50:45.983 it's similarly. 0:50:47.907 --> 0:50:53.217 And the nice thing is that this really works, and it may be very, very surprising. 0:50:53.673 --> 0:51:03.000 So the idea is that we have a large straw model which we train for long, and the question 0:51:03.000 --> 0:51:07.871 is: Can that help us to train a smaller model? 0:51:08.148 --> 0:51:16.296 So can what we refer to as teacher model tell us better to build a small student model than 0:51:16.296 --> 0:51:17.005 before. 0:51:17.257 --> 0:51:27.371 So what we're before in it as a student model, we learn from the data and that is how we train 0:51:27.371 --> 0:51:28.755 our systems. 0:51:29.249 --> 0:51:37.949 The question is: Can we train this small model better if we are not only learning from the 0:51:37.949 --> 0:51:46.649 data, but we are also learning from a large model which has been trained maybe in the same 0:51:46.649 --> 0:51:47.222 data? 0:51:47.667 --> 0:51:55.564 So that you have then in the end a smaller model that is somehow better performing than. 0:51:55.895 --> 0:51:59.828 And maybe that's on the first view. 0:51:59.739 --> 0:52:05.396 Very very surprising because it has seen the same data so it should have learned the same 0:52:05.396 --> 0:52:11.053 so the baseline model trained only on the data and the student teacher knowledge to still 0:52:11.053 --> 0:52:11.682 model it. 0:52:11.682 --> 0:52:17.401 They all have seen only this data because your teacher modeling was also trained typically 0:52:17.401 --> 0:52:19.161 only on this model however. 0:52:20.580 --> 0:52:30.071 It has by now shown that by many ways the model trained in the teacher and analysis framework 0:52:30.071 --> 0:52:32.293 is performing better. 0:52:33.473 --> 0:52:40.971 A bit of an explanation when we see how that works. 0:52:40.971 --> 0:52:46.161 There's different ways of doing it. 0:52:46.161 --> 0:52:47.171 Maybe. 0:52:47.567 --> 0:52:51.501 So how does it work? 0:52:51.501 --> 0:53:04.802 This is our student network, the normal one, some type of new network. 0:53:04.802 --> 0:53:06.113 We're. 0:53:06.586 --> 0:53:17.050 So we are training the model to predict the same thing as we are doing that by calculating. 0:53:17.437 --> 0:53:23.173 The cross angry loss was defined in a way where saying all the probabilities for the 0:53:23.173 --> 0:53:25.332 correct word should be as high. 0:53:25.745 --> 0:53:32.207 So you are calculating your alphabet probabilities always, and each time step you have an alphabet 0:53:32.207 --> 0:53:33.055 probability. 0:53:33.055 --> 0:53:38.669 What is the most probable in the next word and your training signal is put as much of 0:53:38.669 --> 0:53:43.368 your probability mass to the correct word to the word that is there in. 0:53:43.903 --> 0:53:51.367 And this is the chief by this cross entry loss, which says with some of the all training 0:53:51.367 --> 0:53:58.664 examples of all positions, with some of the full vocabulary, and then this one is this 0:53:58.664 --> 0:54:03.947 one that this current word is the case word in the vocabulary. 0:54:04.204 --> 0:54:11.339 And then we take here the lock for the ability of that, so what we made me do is: We have 0:54:11.339 --> 0:54:27.313 this metric here, so each position of your vocabulary size. 0:54:27.507 --> 0:54:38.656 In the end what you just do is some of these three lock probabilities, and then you want 0:54:38.656 --> 0:54:40.785 to have as much. 0:54:41.041 --> 0:54:54.614 So although this is a thumb over this metric here, in the end of each dimension you. 0:54:54.794 --> 0:55:06.366 So that is a normal cross end to be lost that we have discussed at the very beginning of 0:55:06.366 --> 0:55:07.016 how. 0:55:08.068 --> 0:55:15.132 So what can we do differently in the teacher network? 0:55:15.132 --> 0:55:23.374 We also have a teacher network which is trained on large data. 0:55:24.224 --> 0:55:35.957 And of course this distribution might be better than the one from the small model because it's. 0:55:36.456 --> 0:55:40.941 So in this case we have now the training signal from the teacher network. 0:55:41.441 --> 0:55:46.262 And it's the same way as we had before. 0:55:46.262 --> 0:55:56.507 The only difference is we're training not the ground truths per ability distribution 0:55:56.507 --> 0:55:59.159 year, which is sharp. 0:55:59.299 --> 0:56:11.303 That's also a probability, so this word has a high probability, but have some probability. 0:56:12.612 --> 0:56:19.577 And that is the main difference. 0:56:19.577 --> 0:56:30.341 Typically you do like the interpretation of these. 0:56:33.213 --> 0:56:38.669 Because there's more information contained in the distribution than in the front booth, 0:56:38.669 --> 0:56:44.187 because it encodes more information about the language, because language always has more 0:56:44.187 --> 0:56:47.907 options to put alone, that's the same sentence yes exactly. 0:56:47.907 --> 0:56:53.114 So there's ambiguity in there that is encoded hopefully very well in the complaint. 0:56:53.513 --> 0:56:57.257 Trade you two networks so better than a student network you have in there from your learner. 0:56:57.537 --> 0:57:05.961 So maybe often there's only one correct word, but it might be two or three, and then all 0:57:05.961 --> 0:57:10.505 of these three have a probability distribution. 0:57:10.590 --> 0:57:21.242 And then is the main advantage or one explanation of why it's better to train from the. 0:57:21.361 --> 0:57:32.652 Of course, it's good to also keep the signal in there because then you can prevent it because 0:57:32.652 --> 0:57:33.493 crazy. 0:57:37.017 --> 0:57:49.466 Any more questions on the first type of knowledge distillation, also distribution changes. 0:57:50.550 --> 0:58:02.202 Coming around again, this would put it a bit different, so this is not a solution to maintenance 0:58:02.202 --> 0:58:04.244 or distribution. 0:58:04.744 --> 0:58:12.680 But don't think it's performing worse than only doing the ground tours because they also. 0:58:13.113 --> 0:58:21.254 So it's more like it's not improving you would assume it's similarly helping you, but. 0:58:21.481 --> 0:58:28.145 Of course, if you now have a teacher, maybe you have no danger on your target to Maine, 0:58:28.145 --> 0:58:28.524 but. 0:58:28.888 --> 0:58:39.895 Then you can use this one which is not the ground truth but helpful to learn better for 0:58:39.895 --> 0:58:42.147 the distribution. 0:58:46.326 --> 0:58:57.012 The second idea is to do sequence level knowledge distillation, so what we have in this case 0:58:57.012 --> 0:59:02.757 is we have looked at each position independently. 0:59:03.423 --> 0:59:05.436 Mean, we do that often. 0:59:05.436 --> 0:59:10.972 We are not generating a lot of sequences, but that has a problem. 0:59:10.972 --> 0:59:13.992 We have this propagation of errors. 0:59:13.992 --> 0:59:16.760 We start with one area and then. 0:59:17.237 --> 0:59:27.419 So if we are doing word-level knowledge dissolution, we are treating each word in the sentence independently. 0:59:28.008 --> 0:59:32.091 So we are not trying to like somewhat model the dependency between. 0:59:32.932 --> 0:59:47.480 We can try to do that by sequence level knowledge dissolution, but the problem is, of course,. 0:59:47.847 --> 0:59:53.478 So we can that for each position we can get a distribution over all the words at this. 0:59:53.793 --> 1:00:05.305 But if we want to have a distribution of all possible target sentences, that's not possible 1:00:05.305 --> 1:00:06.431 because. 1:00:08.508 --> 1:00:15.940 Area, so we can then again do a bit of a heck on that. 1:00:15.940 --> 1:00:23.238 If we can't have a distribution of all sentences, it. 1:00:23.843 --> 1:00:30.764 So what we can't do is you can not use the teacher network and sample different translations. 1:00:31.931 --> 1:00:39.327 And now we can do different ways to train them. 1:00:39.327 --> 1:00:49.343 We can use them as their probability, the easiest one to assume. 1:00:50.050 --> 1:00:56.373 So what that ends to is that we're taking our teacher network, we're generating some 1:00:56.373 --> 1:01:01.135 translations, and these ones we're using as additional trading. 1:01:01.781 --> 1:01:11.382 Then we have mainly done this sequence level because the teacher network takes us. 1:01:11.382 --> 1:01:17.513 These are all probable translations of the sentence. 1:01:26.286 --> 1:01:34.673 And then you can do a bit of a yeah, and you can try to better make a bit of an interpolated 1:01:34.673 --> 1:01:36.206 version of that. 1:01:36.716 --> 1:01:42.802 So what people have also done is like subsequent level interpolations. 1:01:42.802 --> 1:01:52.819 You generate here several translations: But then you don't use all of them. 1:01:52.819 --> 1:02:00.658 You do some metrics on which of these ones. 1:02:01.021 --> 1:02:12.056 So it's a bit more training on this brown chose which might be improbable or unreachable 1:02:12.056 --> 1:02:16.520 because we can generate everything. 1:02:16.676 --> 1:02:23.378 And we are giving it an easier solution which is also good quality and training of that. 1:02:23.703 --> 1:02:32.602 So you're not training it on a very difficult solution, but you're training it on an easier 1:02:32.602 --> 1:02:33.570 solution. 1:02:36.356 --> 1:02:38.494 Any More Questions to This. 1:02:40.260 --> 1:02:41.557 Yeah. 1:02:41.461 --> 1:02:44.296 Good. 1:02:43.843 --> 1:03:01.642 Is to look at the vocabulary, so the problem is we have seen that vocabulary calculations 1:03:01.642 --> 1:03:06.784 are often very presuming. 1:03:09.789 --> 1:03:19.805 The thing is that most of the vocabulary is not needed for each sentence, so in each sentence. 1:03:20.280 --> 1:03:28.219 The question is: Can we somehow easily precalculate, which words are probable to occur in the sentence, 1:03:28.219 --> 1:03:30.967 and then only calculate these ones? 1:03:31.691 --> 1:03:34.912 And this can be done so. 1:03:34.912 --> 1:03:43.932 For example, if you have sentenced card, it's probably not happening. 1:03:44.164 --> 1:03:48.701 So what you can try to do is to limit your vocabulary. 1:03:48.701 --> 1:03:51.093 You're considering for each. 1:03:51.151 --> 1:04:04.693 So you're no longer taking the full vocabulary as possible output, but you're restricting. 1:04:06.426 --> 1:04:18.275 That typically works is that we limit it by the most frequent words we always take because 1:04:18.275 --> 1:04:23.613 these are not so easy to align to words. 1:04:23.964 --> 1:04:32.241 To take the most treatment taggin' words and then work that often aligns with one of the 1:04:32.241 --> 1:04:32.985 source. 1:04:33.473 --> 1:04:46.770 So for each source word you calculate the word alignment on your training data, and then 1:04:46.770 --> 1:04:51.700 you calculate which words occur. 1:04:52.352 --> 1:04:57.680 And then for decoding you build this union of maybe the source word list that other. 1:04:59.960 --> 1:05:02.145 Are like for each source work. 1:05:02.145 --> 1:05:08.773 One of the most frequent translations of these source words, for example for each source work 1:05:08.773 --> 1:05:13.003 like in the most frequent ones, and then the most frequent. 1:05:13.193 --> 1:05:24.333 In total, if you have short sentences, you have a lot less words, so in most cases it's 1:05:24.333 --> 1:05:26.232 not more than. 1:05:26.546 --> 1:05:33.957 And so you have dramatically reduced your vocabulary, and thereby can also fax a depot. 1:05:35.495 --> 1:05:43.757 That easy does anybody see what is challenging here and why that might not always need. 1:05:47.687 --> 1:05:54.448 The performance is not why this might not. 1:05:54.448 --> 1:06:01.838 If you implement it, it might not be a strong. 1:06:01.941 --> 1:06:06.053 You have to store this list. 1:06:06.053 --> 1:06:14.135 You have to burn the union and of course your safe time. 1:06:14.554 --> 1:06:21.920 The second thing the vocabulary is used in our last step, so we have the hidden state, 1:06:21.920 --> 1:06:23.868 and then we calculate. 1:06:24.284 --> 1:06:29.610 Now we are not longer calculating them for all output words, but for a subset of them. 1:06:30.430 --> 1:06:35.613 However, this metric multiplication is typically parallelized with the perfect but good. 1:06:35.956 --> 1:06:46.937 But if you not only calculate some of them, if you're not modeling it right, it will take 1:06:46.937 --> 1:06:52.794 as long as before because of the nature of the. 1:06:56.776 --> 1:07:07.997 Here for beam search there's some ideas of course you can go back to greedy search because 1:07:07.997 --> 1:07:10.833 that's more efficient. 1:07:11.651 --> 1:07:18.347 And better quality, and you can buffer some states in between, so how much buffering it's 1:07:18.347 --> 1:07:22.216 again this tradeoff between calculation and memory. 1:07:25.125 --> 1:07:41.236 Then at the end of today what we want to look into is one last type of new machine translation 1:07:41.236 --> 1:07:42.932 approach. 1:07:43.403 --> 1:07:53.621 And the idea is what we've already seen in our first two steps is that this ultra aggressive 1:07:53.621 --> 1:07:57.246 park is taking community coding. 1:07:57.557 --> 1:08:04.461 Can process everything in parallel, but we are always taking the most probable and then. 1:08:05.905 --> 1:08:10.476 The question is: Do we really need to do that? 1:08:10.476 --> 1:08:14.074 Therefore, there is a bunch of work. 1:08:14.074 --> 1:08:16.602 Can we do it differently? 1:08:16.602 --> 1:08:19.616 Can we generate a full target? 1:08:20.160 --> 1:08:29.417 We'll see it's not that easy and there's still an open debate whether this is really faster 1:08:29.417 --> 1:08:31.832 and quality, but think. 1:08:32.712 --> 1:08:45.594 So, as said, what we have done is our encoder decoder where we can process our encoder color, 1:08:45.594 --> 1:08:50.527 and then the output always depends. 1:08:50.410 --> 1:08:54.709 We generate the output and then we have to put it here the wide because then everything 1:08:54.709 --> 1:08:56.565 depends on the purpose of the output. 1:08:56.916 --> 1:09:10.464 This is what is referred to as an outer-regressive model and nearly outs speech generation and 1:09:10.464 --> 1:09:16.739 language generation or works in this outer. 1:09:18.318 --> 1:09:21.132 So the motivation is, can we do that more efficiently? 1:09:21.361 --> 1:09:31.694 And can we somehow process all target words in parallel? 1:09:31.694 --> 1:09:41.302 So instead of doing it one by one, we are inputting. 1:09:45.105 --> 1:09:46.726 So how does it work? 1:09:46.726 --> 1:09:50.587 So let's first have a basic auto regressive mode. 1:09:50.810 --> 1:09:53.551 So the encoder looks as it is before. 1:09:53.551 --> 1:09:58.310 That's maybe not surprising because here we know we can paralyze. 1:09:58.618 --> 1:10:04.592 So we have put in here our ink holder and generated the ink stash, so that's exactly 1:10:04.592 --> 1:10:05.295 the same. 1:10:05.845 --> 1:10:16.229 However, now we need to do one more thing: One challenge is what we had before and that's 1:10:16.229 --> 1:10:26.799 a challenge of natural language generation like machine translation. 1:10:32.672 --> 1:10:38.447 We generate until we generate this out of end of center stock, but if we now generate 1:10:38.447 --> 1:10:44.625 everything at once that's no longer possible, so we cannot generate as long because we only 1:10:44.625 --> 1:10:45.632 generated one. 1:10:46.206 --> 1:10:58.321 So the question is how can we now determine how long the sequence is, and we can also accelerate. 1:11:00.000 --> 1:11:06.384 Yes, but there would be one idea, and there is other work which tries to do that. 1:11:06.806 --> 1:11:15.702 However, in here there's some work already done before and maybe you remember we had the 1:11:15.702 --> 1:11:20.900 IBM models and there was this concept of fertility. 1:11:21.241 --> 1:11:26.299 The concept of fertility is means like for one saucepan, and how many target pores does 1:11:26.299 --> 1:11:27.104 it translate? 1:11:27.847 --> 1:11:34.805 And exactly that we try to do here, and that means we are calculating like at the top we 1:11:34.805 --> 1:11:36.134 are calculating. 1:11:36.396 --> 1:11:42.045 So it says word is translated into word. 1:11:42.045 --> 1:11:54.171 Word might be translated into words into, so we're trying to predict in how many words. 1:11:55.935 --> 1:12:10.314 And then the end of the anchor, so this is like a length estimation. 1:12:10.314 --> 1:12:15.523 You can do it otherwise. 1:12:16.236 --> 1:12:24.526 You initialize your decoder input and we know it's good with word embeddings so we're trying 1:12:24.526 --> 1:12:28.627 to do the same thing and what people then do. 1:12:28.627 --> 1:12:35.224 They initialize it again with word embedding but in the frequency of the. 1:12:35.315 --> 1:12:36.460 So we have the cartilage. 1:12:36.896 --> 1:12:47.816 So one has two, so twice the is and then one is, so that is then our initialization. 1:12:48.208 --> 1:12:57.151 In other words, if you don't predict fertilities but predict lengths, you can just initialize 1:12:57.151 --> 1:12:57.912 second. 1:12:58.438 --> 1:13:07.788 This often works a bit better, but that's the other. 1:13:07.788 --> 1:13:16.432 Now you have everything in training and testing. 1:13:16.656 --> 1:13:18.621 This is all available at once. 1:13:20.280 --> 1:13:31.752 Then we can generate everything in parallel, so we have the decoder stack, and that is now 1:13:31.752 --> 1:13:33.139 as before. 1:13:35.395 --> 1:13:41.555 And then we're doing the translation predictions here on top of it in order to do. 1:13:43.083 --> 1:13:59.821 And then we are predicting here the target words and once predicted, and that is the basic 1:13:59.821 --> 1:14:00.924 idea. 1:14:01.241 --> 1:14:08.171 Machine translation: Where the idea is, we don't have to do one by one what we're. 1:14:10.210 --> 1:14:13.900 So this looks really, really, really great. 1:14:13.900 --> 1:14:20.358 On the first view there's one challenge with this, and this is the baseline. 1:14:20.358 --> 1:14:27.571 Of course there's some improvements, but in general the quality is often significant. 1:14:28.068 --> 1:14:32.075 So here you see the baseline models. 1:14:32.075 --> 1:14:38.466 You have a loss of ten blue points or something like that. 1:14:38.878 --> 1:14:40.230 So why does it change? 1:14:40.230 --> 1:14:41.640 So why is it happening? 1:14:43.903 --> 1:14:56.250 If you look at the errors there is repetitive tokens, so you have like or things like that. 1:14:56.536 --> 1:15:01.995 Broken senses or influent senses, so that exactly where algebra aggressive models are 1:15:01.995 --> 1:15:04.851 very good, we say that's a bit of a problem. 1:15:04.851 --> 1:15:07.390 They generate very fluid transcription. 1:15:07.387 --> 1:15:10.898 Translation: Sometimes there doesn't have to do anything with the input. 1:15:11.411 --> 1:15:14.047 But generally it really looks always very fluid. 1:15:14.995 --> 1:15:20.865 Here exactly the opposite, so the problem is that we don't have really fluid translation. 1:15:21.421 --> 1:15:26.123 And that is mainly due to the challenge that we have this independent assumption. 1:15:26.646 --> 1:15:35.873 So in this case, the probability of Y of the second position is independent of the probability 1:15:35.873 --> 1:15:40.632 of X, so we don't know what was there generated. 1:15:40.632 --> 1:15:43.740 We're just generating it there. 1:15:43.964 --> 1:15:55.439 You can see it also in a bit of examples. 1:15:55.439 --> 1:16:03.636 You can over-panelize shifts. 1:16:04.024 --> 1:16:10.566 And the problem is this is already an improvement again, but this is also similar to. 1:16:11.071 --> 1:16:19.900 So you can, for example, translate heeded back, or maybe you could also translate it 1:16:19.900 --> 1:16:31.105 with: But on their feeling down in feeling down, if the first position thinks of their 1:16:31.105 --> 1:16:34.594 feeling done and the second. 1:16:35.075 --> 1:16:42.908 So each position here and that is one of the main issues here doesn't know what the other. 1:16:43.243 --> 1:16:53.846 And for example, if you are translating something with, you can often translate things in two 1:16:53.846 --> 1:16:58.471 ways: German with a different agreement. 1:16:58.999 --> 1:17:02.058 And then here where you have to decide do a used jet. 1:17:02.162 --> 1:17:05.460 Interpretator: It doesn't know which word it has to select. 1:17:06.086 --> 1:17:14.789 Mean, of course, it knows a hidden state, but in the end you have a liability distribution. 1:17:16.256 --> 1:17:20.026 And that is the important thing in the outer regressive month. 1:17:20.026 --> 1:17:24.335 You know that because you have put it in you here, you don't know that. 1:17:24.335 --> 1:17:29.660 If it's equal probable here to two, you don't Know Which Is Selected, and of course that 1:17:29.660 --> 1:17:32.832 depends on what should be the latest traction under. 1:17:33.333 --> 1:17:39.554 Yep, that's the undershift, and we're going to last last the next time. 1:17:39.554 --> 1:17:39.986 Yes. 1:17:40.840 --> 1:17:44.935 Doesn't this also appear in and like now we're talking about physical training? 1:17:46.586 --> 1:17:48.412 The thing is in the auto regress. 1:17:48.412 --> 1:17:50.183 If you give it the correct one,. 1:17:50.450 --> 1:17:55.827 So if you predict here comma what the reference is feeling then you tell the model here. 1:17:55.827 --> 1:17:59.573 The last one was feeling and then it knows it has to be done. 1:17:59.573 --> 1:18:04.044 But here it doesn't know that because it doesn't get as input as a right. 1:18:04.204 --> 1:18:24.286 Yes, that's a bit depending on what. 1:18:24.204 --> 1:18:27.973 But in training, of course, you just try to make the highest one the current one. 1:18:31.751 --> 1:18:38.181 So what you can do is things like CDC loss which can adjust for this. 1:18:38.181 --> 1:18:42.866 So then you can also have this shifted correction. 1:18:42.866 --> 1:18:50.582 If you're doing this type of correction in the CDC loss you don't get full penalty. 1:18:50.930 --> 1:18:58.486 Just shifted by one, so it's a bit of a different loss, which is mainly used in, but. 1:19:00.040 --> 1:19:03.412 It can be used in order to address this problem. 1:19:04.504 --> 1:19:13.844 The other problem is that outer regressively we have the label buyers that tries to disimmigrate. 1:19:13.844 --> 1:19:20.515 That's the example did before was if you translate thank you to Dung. 1:19:20.460 --> 1:19:31.925 And then it might end up because it learns in the first position and the second also. 1:19:32.492 --> 1:19:43.201 In order to prevent that, it would be helpful for one output, only one output, so that makes 1:19:43.201 --> 1:19:47.002 the system already better learn. 1:19:47.227 --> 1:19:53.867 Might be that for slightly different inputs you have different outputs, but for the same. 1:19:54.714 --> 1:19:57.467 That we can luckily very easily solve. 1:19:59.119 --> 1:19:59.908 And it's done. 1:19:59.908 --> 1:20:04.116 We just learned the technique about it, which is called knowledge distillation. 1:20:04.985 --> 1:20:13.398 So what we can do and the easiest solution to prove your non-autoregressive model is to 1:20:13.398 --> 1:20:16.457 train an auto regressive model. 1:20:16.457 --> 1:20:22.958 Then you decode your whole training gamer with this model and then. 1:20:23.603 --> 1:20:27.078 While the main advantage of that is that this is more consistent,. 1:20:27.407 --> 1:20:33.995 So for the same input you always have the same output. 1:20:33.995 --> 1:20:41.901 So you have to make your training data more consistent and learn. 1:20:42.482 --> 1:20:54.471 So there is another advantage of knowledge distillation and that advantage is you have 1:20:54.471 --> 1:20:59.156 more consistent training signals. 1:21:04.884 --> 1:21:10.630 There's another to make the things more easy at the beginning. 1:21:10.630 --> 1:21:16.467 There's this plants model, black model where you do more masks. 1:21:16.756 --> 1:21:26.080 So during training, especially at the beginning, you give some correct solutions at the beginning. 1:21:28.468 --> 1:21:38.407 And there is this tokens at a time, so the idea is to establish other regressive training. 1:21:40.000 --> 1:21:50.049 And some targets are open, so you always predict only like first auto regression is K. 1:21:50.049 --> 1:21:59.174 It puts one, so you always have one input and one output, then you do partial. 1:21:59.699 --> 1:22:05.825 So in that way you can slowly learn what is a good and what is a bad answer. 1:22:08.528 --> 1:22:10.862 It doesn't sound very impressive. 1:22:10.862 --> 1:22:12.578 Don't contact me anyway. 1:22:12.578 --> 1:22:15.323 Go all over your training data several. 1:22:15.875 --> 1:22:20.655 You can even switch in between. 1:22:20.655 --> 1:22:29.318 There is a homework on this thing where you try to start. 1:22:31.271 --> 1:22:41.563 You have to learn so there's a whole work on that so this is often happening and it doesn't 1:22:41.563 --> 1:22:46.598 mean it's less efficient but still it helps. 1:22:49.389 --> 1:22:57.979 For later maybe here are some examples of how much things help. 1:22:57.979 --> 1:23:04.958 Maybe one point here is that it's really important. 1:23:05.365 --> 1:23:13.787 Here's the translation performance and speed. 1:23:13.787 --> 1:23:24.407 One point which is a point is if you compare researchers. 1:23:24.784 --> 1:23:33.880 So yeah, if you're compared to one very weak baseline transformer even with beam search, 1:23:33.880 --> 1:23:40.522 then you're ten times slower than a very strong auto regressive. 1:23:40.961 --> 1:23:48.620 If you make a strong baseline then it's going down to depending on times and here like: You 1:23:48.620 --> 1:23:53.454 have a lot of different speed ups. 1:23:53.454 --> 1:24:03.261 Generally, it makes a strong baseline and not very simple transformer. 1:24:07.407 --> 1:24:20.010 Yeah, with this one last thing that you can do to speed up things and also reduce your 1:24:20.010 --> 1:24:25.950 memory is what is called half precision. 1:24:26.326 --> 1:24:29.139 And especially for decoding issues for training. 1:24:29.139 --> 1:24:31.148 Sometimes it also gets less stale. 1:24:32.592 --> 1:24:45.184 With this we close nearly wait a bit, so what you should remember is that efficient machine 1:24:45.184 --> 1:24:46.963 translation. 1:24:47.007 --> 1:24:51.939 We have, for example, looked at knowledge distillation. 1:24:51.939 --> 1:24:55.991 We have looked at non auto regressive models. 1:24:55.991 --> 1:24:57.665 We have different. 1:24:58.898 --> 1:25:02.383 For today and then only requests. 1:25:02.383 --> 1:25:08.430 So if you haven't done so, please fill out the evaluation. 1:25:08.388 --> 1:25:20.127 So now if you have done so think then you should have and with the online people hopefully. 1:25:20.320 --> 1:25:29.758 Only possibility to tell us what things are good and what not the only one but the most 1:25:29.758 --> 1:25:30.937 efficient. 1:25:31.851 --> 1:25:35.871 So think of all the students doing it in this case okay and then thank.