WEBVTT 0:00:00.860 --> 0:00:04.211 Okay Again Welcome. 0:00:04.524 --> 0:00:09.256 So today I'll be doing the lecture. 0:00:09.256 --> 0:00:12.279 My name is Danny Liro. 0:00:12.279 --> 0:00:16.747 I'm one of the PhD students with. 0:00:17.137 --> 0:00:25.942 And specifically how to learn representations that are common across languages and use that 0:00:25.942 --> 0:00:29.004 to help low resource languages. 0:00:29.689 --> 0:00:39.445 So hope today we can explore a little bit about motoring machine translation and hopefully. 0:00:40.100 --> 0:00:50.940 So today what we are going to do first we are going to look at. 0:00:52.152 --> 0:01:02.491 Second, we will be looking into more details as in how we achieve modeling or machine translation 0:01:02.491 --> 0:01:06.183 and what are the techniques there. 0:01:06.183 --> 0:01:12.197 At last, we are going to look at the current challenges. 0:01:13.573 --> 0:01:15.976 Alright, so some definitions. 0:01:15.976 --> 0:01:19.819 First, what is modeling or machine translation? 0:01:21.201 --> 0:01:28.637 So for a multilingual machine translation system, it's basically a system that is able 0:01:28.637 --> 0:01:34.279 to handle multiple source languages or multiple target languages. 0:01:34.254 --> 0:01:44.798 You see here you've got source on the source side, some German Chinese, Spanish and English. 0:01:45.485 --> 0:01:50.615 Physically, it's also a quite interesting machine learning challenge actually. 0:01:51.031 --> 0:02:05.528 So if you consider each translation pair as a different task in machine learning, then 0:02:05.528 --> 0:02:08.194 a multilingual. 0:02:08.628 --> 0:02:17.290 Where it has to specialize in all these different translation directions and try to be good. 0:02:17.917 --> 0:02:26.890 So this is basically about multi-task learning, and here when translation direction being one 0:02:26.890 --> 0:02:27.462 task. 0:02:28.428 --> 0:02:35.096 Interesting question to ask here is like do we get synergy like different tasks helping 0:02:35.096 --> 0:02:39.415 each other, the knowledge of one task helping the other? 0:02:39.539 --> 0:02:48.156 Or do we get more interference in English to German, and now I get worse at English to 0:02:48.156 --> 0:02:49.047 Chinese. 0:02:49.629 --> 0:02:55.070 So this is also a very interesting question that we'll look into later. 0:02:56.096 --> 0:02:58.605 Now a little bit of context. 0:02:59.519 --> 0:03:04.733 We care about multilingual machine translation. 0:03:04.733 --> 0:03:10.599 Part of the thing is that machine translation models. 0:03:11.291 --> 0:03:22.659 If you consider all the languages in the world, there are a read it here roughly seven thousand 0:03:22.659 --> 0:03:23.962 languages. 0:03:24.684 --> 0:03:37.764 So consider this number, and if you think about this many languages out there, how many 0:03:37.764 --> 0:03:39.548 directions. 0:03:40.220 --> 0:03:46.897 So this means to cover end languages. 0:03:46.897 --> 0:03:59.374 We're going to end up with a prodretic in square number of directions. 0:03:59.779 --> 0:04:02.290 This Is Very Bad, Padre Is Very Bad. 0:04:03.203 --> 0:04:14.078 The prosthetic situation going on means that for a lot of translation directions, if you 0:04:14.078 --> 0:04:16.278 consider all the. 0:04:17.177 --> 0:04:34.950 For many of them we aren't going to have any parallel data as in existing translated data. 0:04:35.675 --> 0:04:40.001 So this is a very data scarce situation. 0:04:40.001 --> 0:04:49.709 We're not going to get parallel data in blue wear, especially likely when you have a system 0:04:49.709 --> 0:04:52.558 that covers tan languages. 0:04:52.912 --> 0:05:04.437 If this access actually goes towards thousands that are realistic, we are going to end up 0:05:04.437 --> 0:05:06.614 with some holes. 0:05:07.667 --> 0:05:15.400 So now we are going to ask: Can we use motel inquality to help this kind of glow resource? 0:05:15.875 --> 0:05:22.858 So when useful concept there is mutual intelligibility, don't know if you've heard of this. 0:05:23.203 --> 0:05:30.264 Basically isn't linguistic when you say somebody who's speaking one language can directly without 0:05:30.264 --> 0:05:33.218 learning understands the other language. 0:05:33.218 --> 0:05:39.343 So if you're a German speaker maybe Dutch or Danish and all that kind of stuff would 0:05:39.343 --> 0:05:39.631 be. 0:05:40.000 --> 0:05:45.990 Useful or like directly understandable partially to you. 0:05:46.586 --> 0:05:52.082 That is, thanks to this kind of mutual enthology ability that is basically based on language 0:05:52.082 --> 0:05:52.791 similarity. 0:05:53.893 --> 0:05:57.105 And then there's knowledge sharing this concept. 0:05:57.105 --> 0:06:01.234 I mean, it's quite intuitive, basically a very German speaker. 0:06:01.234 --> 0:06:06.805 If you start learning Dutch or Danish and all these Mordic languages, I think you're 0:06:06.805 --> 0:06:11.196 going to be faster than just a native English speaker or anything. 0:06:11.952 --> 0:06:18.751 So hopefully our model is also able to do this, but we'll see later what the real situation. 0:06:19.799 --> 0:06:27.221 So we said multilingual is good multilingual transmission, it's nice and there's a lot of 0:06:27.221 --> 0:06:28.210 potentials. 0:06:28.969 --> 0:06:32.205 So it's a long path towards there. 0:06:32.205 --> 0:06:37.569 Think all the efforts started in so quite some years ago. 0:06:37.958 --> 0:06:54.639 At first people started with models with language specific modules. 0:06:54.454 --> 0:06:58.747 So we talked about the input of the decoder architecture in the previous lecturer area. 0:07:00.100 --> 0:07:06.749 And with this separation of the inputter and the decoder, it gives it a natural way to split 0:07:06.749 --> 0:07:07.679 the modules. 0:07:09.069 --> 0:07:20.805 So basically what's happening going on here is dedicated to each toes language and dedicated. 0:07:21.281 --> 0:07:34.252 Now given parallel data of body good data English German data we just activate this German 0:07:34.252 --> 0:07:39.241 inputter and activate this and an. 0:07:40.680 --> 0:07:48.236 So now we are training basically like corresponding parts of the encoder decoders. 0:07:48.236 --> 0:07:55.278 It has some advantages: First, we have a multilingual system. 0:07:55.278 --> 0:08:03.898 Of course, second modularity is also an advantage in software engineering. 0:08:03.898 --> 0:08:10.565 We want to decouple things if the German input is broken. 0:08:11.011 --> 0:08:19.313 So modularity is advantage in this case, but again if we think about scalability, if we 0:08:19.313 --> 0:08:27.521 think about languages out there that we talked about, scalability isn't a great thing. 0:08:27.947 --> 0:08:37.016 We also talked about sharing knowledge or sharing representations for different languages. 0:08:37.317 --> 0:08:41.968 We have a separate thing for each language. 0:08:41.968 --> 0:08:46.513 How likely is it that we are sharing much? 0:08:46.513 --> 0:08:52.538 So these are potential disadvantages with this approach. 0:08:53.073 --> 0:09:01.181 So yeah we talked about, we want to have knowledge transfer, we want to have similar languages 0:09:01.181 --> 0:09:02.888 helping each other. 0:09:02.822 --> 0:09:06.095 This is somehow a more reachable goal. 0:09:06.095 --> 0:09:13.564 If you have a shared in corner and a shared in physically, a full perimeter shared model 0:09:13.564 --> 0:09:21.285 for all the translation pairs out there, and there's also another game, so if you just have 0:09:21.285 --> 0:09:21.705 one. 0:09:22.582 --> 0:09:26.084 Lock of model for all the translation directions out there. 0:09:26.606 --> 0:09:38.966 It's easier to deploy in the sense that if you are serving a model you don't have a thousand 0:09:38.966 --> 0:09:42.555 small modules to maintain. 0:09:42.762 --> 0:09:52.448 So in terms of engineering somehow these kind of fully primitive shared models have: So this 0:09:52.448 --> 0:09:59.819 is also where the parent research has been going towards in recent years. 0:10:00.460 --> 0:10:16.614 So the rest of the electro are also going to focus on this kind of model. 0:10:17.037 --> 0:10:30.901 So the first type of multilinguali is this kind of many to one abbreviated kind of situation. 0:10:30.901 --> 0:10:34.441 Basically what's going. 0:10:35.355 --> 0:10:49.804 So one news case that you can think of here is if you're subtitled for international movies 0:10:49.804 --> 0:10:51.688 in Germany. 0:10:53.073 --> 0:11:02.863 Then flipping the situation there is also many configurations where we only have when 0:11:02.863 --> 0:11:04.798 source language. 0:11:06.046 --> 0:11:13.716 There's also many use cases like if you think about the lecture translator here you've seen. 0:11:14.914 --> 0:11:21.842 So here most of the lecturers are in German and now we want to translate it into. 0:11:21.842 --> 0:11:28.432 I think on the user end we only support English but they're also supportable. 0:11:28.608 --> 0:11:38.988 So in this kind of used case, if you have one speaker and you want to serve or expand 0:11:38.988 --> 0:11:41.281 to many audience,. 0:11:42.802 --> 0:11:50.542 But of course, combining everything, there's the many to many situation here. 0:11:50.542 --> 0:11:54.015 You can think of Google Translate. 0:11:54.015 --> 0:11:58.777 They are doing basically any selected language. 0:11:59.159 --> 0:12:03.760 And this is also more difficult. 0:12:03.760 --> 0:12:14.774 If you consider the data you need to get and concerns, we'll cover this later. 0:12:15.135 --> 0:12:21.034 But first we are going to start with many to one translations. 0:12:21.741 --> 0:12:30.436 Say this is the most similar to the bilingual translation situation you saw earlier, but 0:12:30.436 --> 0:12:39.423 now one difference is we need a vocabulary or tokens that can represent all these different 0:12:39.423 --> 0:12:40.498 languages. 0:12:41.301 --> 0:12:44.200 So we need a joint more telecom global vocabulary. 0:12:44.924 --> 0:12:48.794 So let's just quickly recall what word embedding is to do. 0:12:49.189 --> 0:12:54.561 Basically we need to represent it. 0:12:54.561 --> 0:13:04.077 We have to get some vector representation for discrete words. 0:13:04.784 --> 0:13:16.911 And when we embed a token, we are retrieving the corresponding vector out of this little. 0:13:17.697 --> 0:13:19.625 And then we put it. 0:13:19.625 --> 0:13:26.082 We feed a sequence of vectors into the inputter as the next steps. 0:13:26.987 --> 0:13:34.973 Now if it's motelingual you can imagine that vocabulary suddenly gets very, very big because 0:13:34.973 --> 0:13:36.262 the languages. 0:13:37.877 --> 0:13:46.141 So what is quite useful here is the by pair like subwords you talked about by pairing. 0:13:46.406 --> 0:13:55.992 So in this case we are still limiting ourselves to a finite number of vocabularies that we 0:13:55.992 --> 0:13:59.785 are exploding the vocabulary table. 0:14:01.181 --> 0:14:11.631 So when we learn these kinds of subwords, what happens basically? 0:14:11.631 --> 0:14:17.015 We look at all the training data. 0:14:18.558 --> 0:14:20.856 So think about this. 0:14:20.856 --> 0:14:28.077 If we do this now on a bunch of Mozilla data, are there concerns? 0:14:30.050 --> 0:14:36.811 Maybe we have an underground status head, so we get over English mergers and nocularities. 0:14:37.337 --> 0:14:39.271 Yeah Exactly Thanks. 0:14:39.539 --> 0:14:46.602 So what we have to pay attention to here is learn this motilingual vocabulary. 0:14:46.602 --> 0:14:52.891 We should pay attention: All the languages are more or less balanced, not that you only 0:14:52.891 --> 0:14:58.912 learning words is for for English or some bigger languages, and then neglecting other other 0:14:58.912 --> 0:15:00.025 languages, yeah. 0:15:01.021 --> 0:15:04.068 Of course, this is not going to solve everything. 0:15:04.068 --> 0:15:09.614 Even if we get a perfectly uniform distribution out of all the languages out, there is not 0:15:09.614 --> 0:15:13.454 going to mean that we are ending up with a perfect vocabulary. 0:15:14.154 --> 0:15:20.068 There are also language differences read, so if you consider more European languages. 0:15:20.180 --> 0:15:27.081 There will be many shared subcomponents like how you write a certain word, somewhat similar. 0:15:27.267 --> 0:15:34.556 But then there are other languages with completely different scripts like Arabic, Cyrillic scripts 0:15:34.556 --> 0:15:40.594 or Eastern Asian scripts where you get a vocabulary like the characters set with. 0:15:40.940 --> 0:15:43.531 Tens of thousands of characters. 0:15:43.531 --> 0:15:50.362 So these are also individual concerns that one has to think about my building specific 0:15:50.362 --> 0:15:51.069 systems. 0:15:51.591 --> 0:16:02.660 But overall, the rule of thumb is that when you do a mottling tokenizer vocabulary, there's 0:16:02.660 --> 0:16:04.344 more or less. 0:16:05.385 --> 0:16:17.566 And there's actually some paper showing that the performance of the final system is going 0:16:17.566 --> 0:16:25.280 to start to degrade if you have a disproportionate data. 0:16:27.207 --> 0:16:33.186 Of course there is currently the trend of using pre-train models. 0:16:33.186 --> 0:16:39.890 If you take a pre-train model somewhere then you don't have this concern. 0:16:40.580 --> 0:16:47.810 Making sure that you use the same organizers that they used so that there is no train test 0:16:47.810 --> 0:16:48.287 time. 0:16:48.888 --> 0:16:53.634 Yeah for a pre-trainer, we're going to talk about a little bit later as well. 0:16:54.734 --> 0:16:59.960 Alright: So now where's a Martin Luther vocabulary? 0:17:00.920 --> 0:17:04.187 There are several good things, obviously. 0:17:04.187 --> 0:17:10.953 So one thing is that if we have words that are in the textful form like we said, there 0:17:10.953 --> 0:17:16.242 are European languages that share some vocabulary, then it's great. 0:17:16.242 --> 0:17:19.897 Then we have the first step towards knowledge. 0:17:20.000 --> 0:17:30.464 For example, the word pineapple for some reason is also in Eastern European languages. 0:17:30.464 --> 0:17:34.915 In Cyrillic scripts that's also the. 0:17:36.116 --> 0:17:42.054 But however, there is also ambiguity if you've embracing together or dye. 0:17:42.054 --> 0:17:46.066 Of course, they mean different things for German. 0:17:46.246 --> 0:17:53.276 Then, of course, that's possible to rely on further context. 0:17:53.276 --> 0:17:59.154 It's not a problem, it's something to think about. 0:18:00.200 --> 0:18:11.061 And when we go higher to cover more vocabulary entries, we might need to go bigger in the 0:18:11.061 --> 0:18:13.233 vocabulary count. 0:18:13.653 --> 0:18:28.561 So there is always sort of a bottleneck as the number of languages increase. 0:18:30.110 --> 0:18:32.836 Right, so what is the result? 0:18:32.836 --> 0:18:38.289 What are these crustling over inventings actually learning? 0:18:40.160 --> 0:18:44.658 So normally to inspect them it's quite hard. 0:18:44.658 --> 0:18:53.853 It's like high dimensional vectors with dimensions, but researchers also try to project it. 0:18:54.454 --> 0:19:05.074 So in this case it is a little bit small, but in this case for English and French there 0:19:05.074 --> 0:19:07.367 are many injuries. 0:19:07.467 --> 0:19:20.014 My example is like different words with the same word in morphological forms. 0:19:20.014 --> 0:19:26.126 Basically, it's like a morphological. 0:19:26.546 --> 0:19:32.727 There are also words in different languages like think there is research for English and 0:19:32.727 --> 0:19:33.282 French. 0:19:33.954 --> 0:19:41.508 So the take away from this plot is that somehow we learn a bit of semantic meanings beyond 0:19:41.508 --> 0:19:43.086 the textual forms. 0:19:45.905 --> 0:19:50.851 But then this looks good and this gives us hope. 0:19:52.252 --> 0:20:05.240 That if we consider what is the baseline here, the baseline we compare to is a bilingual system 0:20:05.240 --> 0:20:09.164 without any multilinguality. 0:20:10.290 --> 0:20:19.176 This looks good because if we compare for many Central European languages, Eastern and 0:20:19.176 --> 0:20:28.354 Central European languages to English, we compare: And we see that the Mini Two English has actually 0:20:28.354 --> 0:20:30.573 always gained quite a bit over it. 0:20:31.751 --> 0:20:38.876 But there is also later investigation on whether it is actually out of mountain linguality or 0:20:38.876 --> 0:20:39.254 not. 0:20:39.639 --> 0:20:46.692 So this is a spoiler won't tell much about it until the second half, but just remember 0:20:46.692 --> 0:20:47.908 there is this. 0:20:49.449 --> 0:20:53.601 Now move on to many translations. 0:20:53.601 --> 0:21:01.783 Let's recall in a normal transformer or any encoder decoder setup. 0:21:02.242 --> 0:21:08.839 We have an inkluder that creates sort of contextual representation for the sort of sentence. 0:21:09.949 --> 0:21:17.787 Is more or less the context for generating the target sentence red. 0:21:17.787 --> 0:21:28.392 Now on the target side we get the first open, then we feed it again and then get the second 0:21:28.392 --> 0:21:29.544 decoding. 0:21:31.651 --> 0:21:35.039 And now we have multiple target languages. 0:21:35.039 --> 0:21:39.057 Does anybody see a problem with this architecture? 0:21:48.268 --> 0:21:57.791 Specifically, it's in the decoder, so now have a German sentiments encoded. 0:21:57.791 --> 0:22:01.927 It now want to generate Spanish. 0:22:07.367 --> 0:22:11.551 So the problem is how does the model know which language to generate? 0:22:12.112 --> 0:22:24.053 If you just give it a generic start token, there is nowhere where we are telling the model. 0:22:24.944 --> 0:22:30.277 So that this can only be a guess, and this model will definitely not run well. 0:22:32.492 --> 0:22:40.021 So this comes to the question: How do we indicate the one's intended language to the model? 0:22:41.441 --> 0:22:52.602 One first idea is what people tried is basically now in a source where not only including the 0:22:52.602 --> 0:22:53.552 source. 0:22:53.933 --> 0:23:01.172 To Spanish things like this, so basically the source is already informed. 0:23:01.172 --> 0:23:12.342 The source sentence is already supplemented with: Now this is also called a target forcing 0:23:12.342 --> 0:23:19.248 in the sense that we try to force it to give the right target. 0:23:20.080 --> 0:23:24.622 This is one approach. 0:23:24.622 --> 0:23:38.044 Another approach is basically based on the idea that if we have. 0:23:38.438 --> 0:23:52.177 So if we create a context of our world, the incode output shouldn't really differ. 0:23:52.472 --> 0:24:02.397 So out of this motivation people have moved this signaling mechanism. 0:24:02.397 --> 0:24:09.911 They basically replaced the traditional start token. 0:24:10.330 --> 0:24:17.493 So here we are not kids starting into the generic start talking anymore instead language 0:24:17.493 --> 0:24:18.298 specific. 0:24:18.938 --> 0:24:21.805 So this is also another way to achieve this. 0:24:23.283 --> 0:24:27.714 But there are still more challenging cases. 0:24:27.714 --> 0:24:35.570 Sometimes here it can be called as General English or German when it's there. 0:24:35.570 --> 0:24:39.700 Later on it goes further and further on. 0:24:40.320 --> 0:24:46.752 Basically this information is not strong enough to always enforce the target language, especially 0:24:46.752 --> 0:24:48.392 in zero shot conditions. 0:24:48.392 --> 0:24:54.168 We'll look into this later so we'll get this kind of target translation into generating 0:24:54.168 --> 0:24:57.843 and generating and then going into some wrong language. 0:24:59.219 --> 0:25:12.542 So another technique actually developed here some years ago was to inject this language. 0:25:12.872 --> 0:25:19.834 So when we are feeding doing the auto-aggressive decoding normally, we only feed the upherb. 0:25:20.000 --> 0:25:22.327 Into the depoter. 0:25:22.327 --> 0:25:33.704 But if we also add a language embedding for the target language, on top of that we have 0:25:33.704 --> 0:25:37.066 the language information. 0:25:37.397 --> 0:25:44.335 And this has shown to perform quite a bit better, especially in conditions where the 0:25:44.335 --> 0:25:44.906 model. 0:25:46.126 --> 0:25:56.040 So yeah, we introduced three ways to enforce the Tardid language: And now with this we're 0:25:56.040 --> 0:26:02.607 going to move on to the more interesting case of many too many translations. 0:26:03.503 --> 0:26:14.021 Am so here we just consider a system that translates two directions: English to English 0:26:14.021 --> 0:26:15.575 and English. 0:26:16.676 --> 0:26:21.416 Now we have target languages read. 0:26:21.416 --> 0:26:29.541 Can you see where we're enforcing the target language here? 0:26:29.541 --> 0:26:33.468 In this case what technique? 0:26:34.934 --> 0:26:45.338 So here we are enforcing the characteristic language with the yelling we train this system. 0:26:46.526 --> 0:27:00.647 And at the inference time we are able to generate English to French, but in addition to this 0:27:00.647 --> 0:27:12.910 we are also able to: We will be able to do zero shot inference that basically translates 0:27:12.910 --> 0:27:17.916 a direction that is not seen in training. 0:27:19.319 --> 0:27:25.489 So this is so called zero shot translation using a modeling wall system. 0:27:26.606 --> 0:27:34.644 Of course, we have to reach several things before we are able to control the language, 0:27:34.644 --> 0:27:36.769 otherwise it's no use. 0:27:37.317 --> 0:27:51.087 Second, we should also have some kind of language independent representation. 0:27:51.731 --> 0:27:53.196 Why is this? 0:27:53.196 --> 0:27:55.112 Why is this big? 0:27:55.112 --> 0:28:00.633 Because if women drink generally French up here? 0:28:00.940 --> 0:28:05.870 It was trained to translate from some English. 0:28:07.187 --> 0:28:15.246 But now we use Anchored Germans in the French, so intuitively we need these representations 0:28:15.246 --> 0:28:22.429 to be similar enough, not that they are so far attracted that we cannot use this. 0:28:25.085 --> 0:28:32.059 So there are several works out there showing that if you do a standard transformer architecture 0:28:32.059 --> 0:28:39.107 this language independent property is not really there and you need to add additional approaches 0:28:39.107 --> 0:28:40.633 in order to enforce. 0:28:41.201 --> 0:28:51.422 So you can, for example, add an additional training objective: That says, we invoked SARSN, 0:28:51.422 --> 0:29:00.305 be invoked by German, and the invoked English have to be the same or be as close to each 0:29:00.305 --> 0:29:02.201 other as possible. 0:29:02.882 --> 0:29:17.576 So if we take the output and the output for another language, how can we formulate this 0:29:17.576 --> 0:29:18.745 as an. 0:29:20.981 --> 0:29:27.027 We can take the translation to the encoder and whatever you translate. 0:29:27.027 --> 0:29:32.817 The embeddings also must be similar and that's the great direction. 0:29:33.253 --> 0:29:42.877 So one thing to take care of here is the length for the same sentence in German and English 0:29:42.877 --> 0:29:44.969 is not necessarily. 0:29:45.305 --> 0:30:00.858 So if we just do a word to word matching, we can always do pulling to a fixed length 0:30:00.858 --> 0:30:03.786 representation. 0:30:04.004 --> 0:30:08.392 Or there are more advanced techniques that involve some alignments. 0:30:08.848 --> 0:30:23.456 So this is useful in the sense that in this part in experiments we have shown it improves 0:30:23.456 --> 0:30:27.189 zero shot translation. 0:30:27.447 --> 0:30:36.628 This is on the data condition of English to Malay, Java and Filipino, so kind of made to 0:30:36.628 --> 0:30:39.722 low resource language family. 0:30:40.100 --> 0:30:50.876 And there we assume that we get parallel English to all of them, but among all these. 0:30:51.451 --> 0:31:03.592 So the blue bar is a Vanilla Transformer model, and the purple bar is when we add a language. 0:31:04.544 --> 0:31:12.547 You see that in supervised conditions it's not changing much, but in zero shots there's 0:31:12.547 --> 0:31:13.183 quite. 0:31:15.215 --> 0:31:22.649 Yeah, so far we said zero shots is doable and it's even more achievable if we enforce 0:31:22.649 --> 0:31:26.366 some language independent representations. 0:31:26.366 --> 0:31:29.823 However, there's one practical concern. 0:31:29.823 --> 0:31:33.800 Don't know if you also had the same question. 0:31:34.514 --> 0:31:39.835 If you have two languages, you don't have direct parallel. 0:31:39.835 --> 0:31:43.893 One's into English and one's out of English. 0:31:45.685 --> 0:31:52.845 It's actually this kind of approach is called pivoting as in pivoting over an intermediate 0:31:52.845 --> 0:31:53.632 language. 0:31:55.935 --> 0:32:00.058 Yeah, that it definitely has advantages in the sense that we're going. 0:32:00.440 --> 0:32:11.507 Now if we go over these two steps every direction was trained with supervised data so you could 0:32:11.507 --> 0:32:18.193 always assume that when we are working with a supervised. 0:32:18.718 --> 0:32:26.868 So in this case we can expect more robust inference time behavior. 0:32:26.868 --> 0:32:31.613 However, there are also disadvantages. 0:32:31.531 --> 0:32:38.860 An inference where passing through the model ties so that's doubling the inference time 0:32:38.860 --> 0:32:39.943 computation. 0:32:40.500 --> 0:32:47.878 You might think okay doubling then what, but if you consider if your company like Google, 0:32:47.878 --> 0:32:54.929 Google Translate and all your life traffic suddenly becomes twice as big, this is not 0:32:54.929 --> 0:33:00.422 something scalable that you want to see, especially in production. 0:33:01.641 --> 0:33:11.577 A problem with this is making information loss because if we go over these games when 0:33:11.577 --> 0:33:20.936 a chain of kids pass the word to each other, in the end it's losing information. 0:33:22.082 --> 0:33:24.595 Can give it an example here. 0:33:24.595 --> 0:33:27.803 It's also from a master thesis here. 0:33:27.803 --> 0:33:30.316 It's on gender preservation. 0:33:30.770 --> 0:33:39.863 Basically, some languages like Italian and French have different word forms based on the 0:33:39.863 --> 0:33:40.782 speaker. 0:33:41.001 --> 0:33:55.987 So if a male person says feel alienated, this word for alienated would be exclusive and a 0:33:55.987 --> 0:33:58.484 female person. 0:34:00.620 --> 0:34:05.730 Now imagine that we pivot through anguish. 0:34:05.730 --> 0:34:08.701 The information is lost. 0:34:08.701 --> 0:34:11.910 We don't know what gender. 0:34:12.492 --> 0:34:19.626 When we go out into branch again, there are different forms. 0:34:19.626 --> 0:34:29.195 Depending on the speaker gender, we can: So this is one problem. 0:34:31.871 --> 0:34:44.122 This is especially the case because English compared to many other languages is relatively 0:34:44.122 --> 0:34:45.199 simple. 0:34:45.205 --> 0:34:53.373 Gendered where it forms like this, it also doesn't have many cases, so going through English 0:34:53.373 --> 0:34:56.183 many information would be lost. 0:34:57.877 --> 0:35:12.796 And another thing is if you have similar languages that you are translating out of my systems 0:35:12.796 --> 0:35:15.494 that translates. 0:35:16.496 --> 0:35:24.426 This is the output of going from Dutch to German again. 0:35:24.426 --> 0:35:30.231 If you read the German, how many of you? 0:35:32.552 --> 0:35:51.679 Good and the problem here is that we are going over English and then the English to German. 0:35:51.831 --> 0:36:06.332 However, if we go direct in this case zero shot translation you see that word forgive. 0:36:06.546 --> 0:36:09.836 In this case, the outward translation is better. 0:36:10.150 --> 0:36:20.335 And we believe this has to do with using the language similarity between the two languages. 0:36:20.335 --> 0:36:26.757 There is also quantitative results we found when born in. 0:36:27.988 --> 0:36:33.780 The models are always doing better when translating similar languages compared to the. 0:36:35.535 --> 0:36:42.093 Yeah, so in this first half what we talked about basically first, we started with how 0:36:42.093 --> 0:36:49.719 motilinguality or motilingual machine translation could enable knowledge transfer between languages 0:36:49.719 --> 0:36:53.990 and help with conditions where we don't have much data. 0:36:55.235 --> 0:37:02.826 Now it looks at three types of multilingual translation, so one is many to one, one to 0:37:02.826 --> 0:37:03.350 many. 0:37:05.285 --> 0:37:13.397 We got there first about a shared vocabulary based on different languages and how these 0:37:13.397 --> 0:37:22.154 cross lingual word embeddings capture semantic meanings rather than just on a text proof form. 0:37:25.505 --> 0:37:37.637 Then we looked at how to signal the target language, how to ask for the model to generate, 0:37:37.637 --> 0:37:43.636 and then we looked at zero shot translation. 0:37:45.325 --> 0:37:58.187 You now before go into the second half are there questions about the first okay good. 0:38:00.140 --> 0:38:10.932 In the second half of this lecture we'll be looking into challenges like what is still 0:38:10.932 --> 0:38:12.916 unsolved about. 0:38:13.113 --> 0:38:18.620 There are some aspects to look at it. 0:38:18.620 --> 0:38:26.591 The first is modeling, the second is more engineering. 0:38:28.248 --> 0:38:33.002 Okay, so we talked about this question several times. 0:38:33.002 --> 0:38:35.644 How does motilinguality help? 0:38:35.644 --> 0:38:37.405 Where does it help? 0:38:38.298 --> 0:38:45.416 Here want to show results of an experiment based on over a hundred languages. 0:38:46.266 --> 0:38:58.603 Here you can see the data amount so they use parallel data to English and it's very. 0:38:58.999 --> 0:39:00.514 This is already lock scale. 0:39:00.961 --> 0:39:12.982 So for higher resource languages like English to French, German to Spanish you get over billion 0:39:12.982 --> 0:39:14.359 sentences. 0:39:14.254 --> 0:39:21.003 In parallel, and when we go more to the right to the more low resource spectrum on the other 0:39:21.003 --> 0:39:26.519 hand, there are languages that maybe many of us have new and heard of like. 0:39:26.466 --> 0:39:29.589 Do You Want to Move Back? 0:39:30.570 --> 0:39:33.270 Hawaiian Indians have heard of it. 0:39:34.414 --> 0:39:39.497 So on that spectrum we only have like thirty thousand sentences. 0:39:40.400 --> 0:39:48.389 So what this means is when we train, we have to up sample these guys. 0:39:48.389 --> 0:39:51.585 The model didn't even know. 0:39:52.732 --> 0:40:05.777 Yeah, so on this graph on how we read it is this horizontal line and zero is basically 0:40:05.777 --> 0:40:07.577 indicating. 0:40:07.747 --> 0:40:14.761 Because we want to see where mottling quality helps only compare to what happens when there 0:40:14.761 --> 0:40:15.371 is not. 0:40:16.356 --> 0:40:29.108 So upper like higher than the zero line it means we're gaining. 0:40:29.309 --> 0:40:34.154 The same like for these languages. 0:40:34.154 --> 0:40:40.799 This side means we are a high resource for the. 0:40:40.981 --> 0:40:46.675 Yeah sorry, think I've somehow removed the the ex-O as he does. 0:40:48.008 --> 0:40:58.502 Yeah alright, what happens now if we look at many into English? 0:40:58.698 --> 0:41:08.741 On the low resource spectrum by going multilingua we gain a lot over the Palumbo system. 0:41:10.010 --> 0:41:16.658 Overall, if you consider the average for all of the languages, it's still again. 0:41:17.817 --> 0:41:27.301 Now we're looking at the green line so you can ignore the blue line. 0:41:27.301 --> 0:41:32.249 Basically we have to do our sample. 0:41:33.753 --> 0:41:41.188 Yeah, so if you just even consider the average, it's still a game form over by link. 0:41:42.983 --> 0:41:57.821 However, if we go to the English to many systems looking at the gains, we only get minor improvements. 0:41:59.039 --> 0:42:12.160 So why is it the case that Going Mott Lingu isn't really helping universally? 0:42:16.016 --> 0:42:18.546 Do you have some intuitions on yeah? 0:42:18.698 --> 0:42:38.257 It's easier to understand something that generates if we consider what the model has to generate. 0:42:38.718 --> 0:42:40.091 I See It Like. 0:42:40.460 --> 0:42:49.769 Generating is a bit like writing or speaking, while inputing on the source side is more like 0:42:49.769 --> 0:42:50.670 reading. 0:42:50.650 --> 0:42:57.971 So one is more passive and the other is more active and don't know if you have similar experience. 0:42:57.971 --> 0:43:05.144 I think speaking and writing is always a little bit more difficult than just passively listening 0:43:05.144 --> 0:43:06.032 or reading. 0:43:06.032 --> 0:43:09.803 But this is a very pendwavy kind of understanding. 0:43:10.390 --> 0:43:11.854 And fed. 0:43:12.032 --> 0:43:20.309 In terms of the model, if we consider what is the difference for the target side for many 0:43:20.309 --> 0:43:26.703 to English: One difference is that there's a data difference. 0:43:27.167 --> 0:43:33.438 So if you just consider a modern English system with German to English and Spanish to English,. 0:43:34.975 --> 0:43:44.321 One thing we have to keep in mind is that the parallel data is not all the same, so on 0:43:44.321 --> 0:43:49.156 the target side there are different English. 0:43:49.769 --> 0:43:54.481 So the situation rather looks like this. 0:43:54.481 --> 0:43:59.193 What this means is that we are going to. 0:44:00.820 --> 0:44:04.635 We also add more data on the target side for English. 0:44:06.967 --> 0:44:18.581 Now since the target side data is not identical, how do we do a controlled experiment to remove 0:44:18.581 --> 0:44:21.121 the multilinguality? 0:44:24.644 --> 0:44:42.794 So what people tried as a control experiment is to keep all the English same as the above 0:44:42.794 --> 0:44:44.205 setup. 0:44:44.684 --> 0:44:49.700 So they take the English on English data of the same branch to German. 0:44:50.090 --> 0:44:55.533 And then the general synthetic data for Germans. 0:44:55.533 --> 0:45:05.864 So now we have a bilingual system again, but on the target side we still have the previously 0:45:05.864 --> 0:45:08.419 enriched English data. 0:45:10.290 --> 0:45:25.092 Now back to this picture that we've seen before, this mysterious orange line here is basically 0:45:25.092 --> 0:45:26.962 the result. 0:45:27.907 --> 0:45:36.594 And somewhat struckly and perhaps sadly for believers of multilinguality. 0:45:36.594 --> 0:45:39.176 This is also gaining. 0:45:41.001 --> 0:45:52.775 So what this means is for the many English is gaining not really because of multilinguality 0:45:52.775 --> 0:45:55.463 but just because of. 0:45:55.976 --> 0:46:10.650 And this means that there is still quite a lot to do if we really want to gain from just 0:46:10.650 --> 0:46:13.618 shared knowledge. 0:46:14.514 --> 0:46:27.599 But this also gives hope because there are still many things to research in this area 0:46:27.599 --> 0:46:28.360 now. 0:46:28.708 --> 0:46:40.984 So we've seen adding more languages helps with somewhat data side effect and can it hurt. 0:46:40.984 --> 0:46:45.621 So if we just add more languages. 0:46:47.007 --> 0:46:48.408 We've seen this. 0:46:48.408 --> 0:46:52.694 This is the picture for the Manitou English system. 0:46:53.793 --> 0:47:09.328 Comparing to this valuable face line, we see that for these high resource languages we are 0:47:09.328 --> 0:47:12.743 not doing as great. 0:47:15.956 --> 0:47:18.664 So why are we losing here? 0:47:18.664 --> 0:47:25.285 It's been showing that this performance last is somewhat related. 0:47:26.026 --> 0:47:37.373 In the sense that the motto has to learn so much that at some point it has to sacrifice 0:47:37.373 --> 0:47:39.308 capacity from. 0:47:41.001 --> 0:47:57.081 So what to do to basically grow a bigger brain to tackle this is to add some dedicated capacity 0:47:57.081 --> 0:47:59.426 per language. 0:48:00.100 --> 0:48:15.600 Here it's like a simplified graph of a transformer architecture, so this is the encoder within 0:48:15.600 --> 0:48:16.579 time. 0:48:17.357 --> 0:48:27.108 But additionally here these little colorable blouse are now the language-specific capable 0:48:27.108 --> 0:48:28.516 of capacity. 0:48:29.169 --> 0:48:42.504 There are language specific in the sense that if you get the Chinese to English, the pattern. 0:48:43.103 --> 0:48:54.900 We are also going to language specific parts that in this case consists of a down projection. 0:48:56.416 --> 0:49:07.177 So this is also called adaptors, something that is plugged into an existing model and 0:49:07.177 --> 0:49:11.556 it adapts towards a specific task. 0:49:12.232 --> 0:49:22.593 And this is conditionally activated in the sense that if you get a different input sentence. 0:49:27.307 --> 0:49:34.173 So this was first proposed in by some folks selling Google. 0:49:34.173 --> 0:49:36.690 Does this scale well? 0:49:39.619 --> 0:49:56.621 Yes exactly, so this is a translation periscusive cannon adapter, and this is not going to scale 0:49:56.621 --> 0:49:57.672 well. 0:49:58.959 --> 0:50:13.676 So this also brought people to try some more simple architecture. 0:50:16.196 --> 0:50:22.788 Yeah, this is also an alternative, in this case called monolingual adapters. 0:50:24.184 --> 0:50:32.097 Any of these adapters so again have this low resource. 0:50:32.097 --> 0:50:42.025 The zero line is bilingual baseline, but the lines are interpolated. 0:50:43.783 --> 0:50:48.767 The red one is the mottling word original mottling word model. 0:50:49.929 --> 0:50:57.582 And if we put the adapters in like a basic virginal adapter that goes to the blue liner,. 0:50:58.078 --> 0:51:08.582 You see the lids gaining performance for the high resource languages. 0:51:08.582 --> 0:51:16.086 If they even scale a lot, this further increases. 0:51:16.556 --> 0:51:22.770 So this is also a side kind of this. 0:51:23.103 --> 0:51:27.807 From the side shows that it's really a capacity bottom up. 0:51:28.488 --> 0:51:30.590 Like If You Eleanor. 0:51:31.151 --> 0:51:34.313 Resource they regain their performance. 0:51:38.959 --> 0:51:50.514 For smaller languages, but it's just. 0:51:50.770 --> 0:52:03.258 Think in the original modeling, the smaller languages they weren't constrained by capacity. 0:52:05.445 --> 0:52:13.412 So guess for the smaller languages, the difficulty is more the data rather than the model capacity. 0:52:13.573 --> 0:52:26.597 So in general you always want to have more or less data matching your model capacity. 0:52:27.647 --> 0:52:33.255 Yeah, here think the bigger challenge for lower roots was the data. 0:52:34.874 --> 0:52:39.397 You also mention it a little bit. 0:52:39.397 --> 0:52:46.979 Are these adapters per language or how many adapters do? 0:52:47.267 --> 0:52:55.378 And do we have to design them differently so that we learn to share more like a language 0:52:55.378 --> 0:52:56.107 family? 0:52:56.576 --> 0:53:15.680 So one downside of the adaptor we talked about is that basically there is no way to go over. 0:53:16.516 --> 0:53:31.391 So then a recent kind of additional approach for these language specific capacity is so 0:53:31.391 --> 0:53:36.124 called routing or learning. 0:53:36.256 --> 0:53:42.438 Basically, we have these language specific components. 0:53:42.438 --> 0:53:45.923 We also have a shared adapter. 0:53:45.923 --> 0:53:52.574 The model should learn: So in this case maybe we could imagine for the lower resource case 0:53:52.574 --> 0:53:54.027 that we just talked about. 0:53:54.094 --> 0:54:04.838 Sense to go there because there's not much to do with language specific anyway than it's 0:54:04.838 --> 0:54:10.270 better to make use of similarity with other. 0:54:11.111 --> 0:54:30.493 So this architecture is more data driven instead of what we specify prior to training. 0:54:31.871 --> 0:54:33.998 So how do we learn this? 0:54:35.095 --> 0:54:49.286 Basically, in terms of the mask, we want to basically have a binary rule that goes either 0:54:49.286 --> 0:54:50.548 to the. 0:54:51.311 --> 0:54:56.501 But how do we get a valued zero or one mean we can? 0:54:56.501 --> 0:54:58.498 We can do a signal. 0:54:58.999 --> 0:55:13.376 However, one thing is we don't want to get stuck in the middle, so we don't want black. 0:55:14.434 --> 0:55:28.830 It is also bad because it is not going to be the same training and test time by the way. 0:55:31.151 --> 0:55:50.483 So here the question is how do we force basically the model to always go there prior to activation? 0:55:54.894 --> 0:56:02.463 Found it interesting because it sounds like a trick for me. 0:56:02.463 --> 0:56:05.491 This approach has been. 0:56:06.026 --> 0:56:15.844 So what they do is prior to going through this activation, and they add some bosom noise. 0:56:17.257 --> 0:56:31.610 If there is always noise prior to activation then the model will be encouraged to preserve 0:56:31.610 --> 0:56:34.291 the information. 0:56:36.356 --> 0:56:44.067 Was a very interesting thing that found out while preparing this, so wanted to share this 0:56:44.067 --> 0:56:44.410 as. 0:56:44.544 --> 0:56:48.937 So basically you can create a battery gate with this technique. 0:56:50.390 --> 0:57:01.668 And if you add these language specific routing: Here they also have some that can control how 0:57:01.668 --> 0:57:07.790 much is shared and how much is language specific. 0:57:07.727 --> 0:57:16.374 Here the seals are the is the routing with the red and orange lines, so. 0:57:16.576 --> 0:57:22.752 So you can see that poor for many and many to one there in both cases quite some games. 0:57:23.063 --> 0:57:30.717 So that is the overall picture and just find the idea of the routing quite interesting. 0:57:30.991 --> 0:57:32.363 And UM. 0:57:32.212 --> 0:57:38.348 It's also getting a bit more increasingly used as there are the so called mixture of 0:57:38.348 --> 0:57:39.431 expert models. 0:57:39.499 --> 0:57:51.801 The model learns where to route the input so they are all conditionally activated when 0:57:51.801 --> 0:57:53.074 you are. 0:57:53.213 --> 0:57:59.089 But this is not really something specific to mortal inquality, so won't talk too much 0:57:59.089 --> 0:57:59.567 about. 0:58:00.620 --> 0:58:02.115 No. 0:58:01.761 --> 0:58:09.640 From this parrot is first that we talked about the listing of the capacity bottleneck. 0:58:10.570 --> 0:58:19.808 Where we can partly compensate by adapters or adding language specific capacity, there's 0:58:19.808 --> 0:58:23.026 the idea of negative transfer. 0:58:24.844 --> 0:58:35.915 When we add any additional capacity, how can we improve the knowledge sharing? 0:58:38.318 --> 0:58:46.662 Also, for this one too many directions that seem to be hopeless for multilinguality, can 0:58:46.662 --> 0:58:47.881 we actually? 0:58:49.129 --> 0:58:52.171 Yeah, these are all open things still in the area. 0:58:53.673 --> 0:59:04.030 Now next part, I'm going to talk about some data challenges for Model Ewell. 0:59:04.030 --> 0:59:07.662 We talk about Model Ewell. 0:59:08.488 --> 0:59:14.967 But there are these lower resource languages that don't have well curated parallel data. 0:59:16.216 --> 0:59:27.539 When alternative people resort to Pro Data from the Internet, there's a lot of noise. 0:59:27.927 --> 0:59:36.244 And in this paper last year they did some manual analyses of several popular cross data 0:59:36.244 --> 0:59:36.811 sets. 0:59:37.437 --> 0:59:55.262 And you'll see that there are a lot of wrong translations, non-linguistic contents, pornographic 0:59:55.262 --> 0:59:57.100 contents. 0:59:57.777 --> 1:00:04.661 So as you can imagine, they say what you eat. 1:00:04.661 --> 1:00:20.116 If you use this kind of data to train a model, you can: So there are also many techniques 1:00:20.116 --> 1:00:28.819 for filtering and filtering these noisy data sets. 1:00:29.809 --> 1:00:36.982 So to filter these out we can use an additional classifier that basically are trained to classify 1:00:36.982 --> 1:00:43.496 which language to sentences and then kick out all the sentences with the wrong language. 1:00:45.105 --> 1:00:49.331 Another thing is the length ratio. 1:00:49.331 --> 1:01:00.200 Basically, the assumption there is that if two sentences are translations of each other,. 1:01:01.901 --> 1:01:08.718 So often people use maybe a ratio of three and then it eliminates the rest. 1:01:09.909 --> 1:01:20.187 Also, the other idea maybe similar to the language classifier is basically to heaven 1:01:20.187 --> 1:01:24.540 allowed character set per language. 1:01:24.540 --> 1:01:28.289 So if you're trying to filter. 1:01:28.568 --> 1:01:34.622 Don't know Cyrillic spribs or Arabic spribs, then it's maybe a good idea to remove them. 1:01:35.775 --> 1:01:43.123 This is not all there are many other ideas using some pre-trained neural networks to compare 1:01:43.123 --> 1:01:50.629 the representations, but just to give you an idea of what our basic techniques were filtering. 1:01:50.991 --> 1:01:53.458 Is quite important. 1:01:53.458 --> 1:02:02.465 We have seen in our experience that if you do these thoroughly there is. 1:02:03.883 --> 1:02:17.814 So after all, even if we do web crawling, there is still a bit of data scarcity problem. 1:02:18.118 --> 1:02:30.760 So there are many bad things that can happen when there's too little training data. 1:02:30.760 --> 1:02:35.425 The first is low performances. 1:02:35.735 --> 1:02:55.562 So they did it on many English system index languages, all together with here means: So 1:02:55.562 --> 1:03:04.079 we really need to get that area of a lot of data in order to get that ideal performance. 1:03:04.884 --> 1:03:20.639 There are also many horrible things that can happen in general when you train a model across 1:03:20.639 --> 1:03:24.874 different training runs. 1:03:26.946 --> 1:03:36.733 So one solution to tackle this problem, the data scarcity problem, is by fine tuning some 1:03:36.733 --> 1:03:38.146 pre-trained. 1:03:38.979 --> 1:03:46.245 And basically the idea is you've got the pre-trained model that can already do translation. 1:03:46.846 --> 1:03:54.214 Then you find units on your own training data and you end up with a more specialized model. 1:03:55.155 --> 1:03:59.369 So why does pretraining help? 1:03:59.369 --> 1:04:11.448 One argument is that if you do pretraining then the motto has seen over more data and 1:04:11.448 --> 1:04:12.713 learned. 1:04:13.313 --> 1:04:19.135 Say more generalizable representations that can help more downstream tasks. 1:04:19.719 --> 1:04:28.063 So in this case we are basically trying to make use of the more meaningful and generalizable 1:04:28.063 --> 1:04:29.499 representation. 1:04:30.490 --> 1:04:45.103 So for machine translation there are several open source models out there that can handle 1:04:45.103 --> 1:04:46.889 languages. 1:04:48.188 --> 1:04:49.912 Two hundred model. 1:04:49.912 --> 1:04:53.452 They also cover two hundred languages. 1:04:53.452 --> 1:04:57.628 That means that's quite a lot of translation. 1:04:57.978 --> 1:05:06.218 However, one thing to remember is that these lados are more like a how do you call them. 1:05:06.146 --> 1:05:12.812 Jackson Waltry is a master of none in the sense that they are very good as coverage, 1:05:12.812 --> 1:05:20.498 but if you look at specific translation directions they might be not as good as dedicated models. 1:05:21.521 --> 1:05:34.170 So here I'm going to have some results by comparing random initialization versus the 1:05:34.170 --> 1:05:36.104 first thing. 1:05:36.396 --> 1:05:46.420 The third line is the result of basically finding a pre-train model that is one of the 1:05:46.420 --> 1:05:47.342 family. 1:05:47.947 --> 1:05:51.822 So in this case you could see the. 1:05:51.831 --> 1:05:58.374 If we just look at the second line, that is the pre trade model out of the box, you see 1:05:58.374 --> 1:06:04.842 that if we just use it out of the box, the performance everywhere isn't super great as 1:06:04.842 --> 1:06:06.180 dedicated models. 1:06:07.867 --> 1:06:21.167 But then here that ex-here means English: So the first takeaway here is that if we do 1:06:21.167 --> 1:06:31.560 pre-train financing again when we do it into English,. 1:06:33.433 --> 1:06:40.438 Here is that we are forgetting. 1:06:40.438 --> 1:06:50.509 When we do further training there is no data. 1:06:50.770 --> 1:07:04.865 So even if we initialize the pre-trained bottle and continue training, if we don't see translation. 1:07:05.345 --> 1:07:13.826 So this is bad machine learning people termed it as perfect forgetting in the sense that 1:07:13.826 --> 1:07:20.115 if you have a model that is trained to do some task and then you. 1:07:20.860 --> 1:07:22.487 This Is Also Pretty Bad. 1:07:24.244 --> 1:07:32.341 Is especially bad if you consider training data actually grows over time. 1:07:32.341 --> 1:07:35.404 It's not like you have one. 1:07:36.336 --> 1:07:46.756 So in practice we do not always train systems from stretch so it's more like you have an 1:07:46.756 --> 1:07:54.951 existing system and later we want to expand the translation coverage. 1:07:57.277 --> 1:08:08.932 Here and the key question is how do we continue training from an existing system in doing so? 1:08:09.909 --> 1:08:12.288 Approaches. 1:08:12.288 --> 1:08:27.945 One very simple one is to include a portion of your previous training so that. 1:08:28.148 --> 1:08:34.333 So if you consider you have an English German system and now you want to explain it to English 1:08:34.333 --> 1:08:34.919 French,. 1:08:36.036 --> 1:08:42.308 Like so nice going English, French and English German, so when you train it you still include 1:08:42.308 --> 1:08:45.578 a small proportion of your previous German data. 1:08:45.578 --> 1:08:51.117 Hopefully your model is not forgetting that much about the previously lent German. 1:08:53.073 --> 1:08:58.876 Idea here is what we saw earlier. 1:08:58.876 --> 1:09:09.800 We can also add adaptors and only train them while keeping the. 1:09:10.170 --> 1:09:26.860 So this means we're going to end up with a generic model that was not anyhow changed. 1:09:27.447 --> 1:09:37.972 So in this way it's also more module and more suitable to the incremental learning kind of. 1:09:38.758 --> 1:09:49.666 Right in this part, the takeaways guess are first data filtering. 1:09:49.666 --> 1:09:55.120 His Internet data is very noisy. 1:09:56.496 --> 1:10:05.061 Second, it's about paint tuning pre-fine models and how we can or cannot avoid catastrophic 1:10:05.061 --> 1:10:06.179 forgetting. 1:10:07.247 --> 1:10:15.866 And of course open questions would include how can we do incremental learning with these 1:10:15.866 --> 1:10:19.836 multilingual machine translation models? 1:10:20.860 --> 1:10:31.840 So with this in mind would like to briefly cover several engineering challenges when we 1:10:31.840 --> 1:10:43.031 talk about: Yeah, earlier we also briefly talked about the motelingual means sometimes you have 1:10:43.031 --> 1:10:51.384 to scale up, you have to make your models bigger just to have that capacity to deal with. 1:10:52.472 --> 1:10:59.262 This means the model sizes are getting bigger and sometimes having one single is not enough 1:10:59.262 --> 1:11:00.073 to handle. 1:11:00.400 --> 1:11:08.914 Here wanted to introduce ideas of going parallel and scaling up. 1:11:08.914 --> 1:11:12.843 The first is so called model. 1:11:14.434 --> 1:11:18.859 Don't know if you also had this in other like maury cue related courses. 1:11:20.220 --> 1:11:30.639 Okay, so the idea of data parallel is basically we train in parallel. 1:11:30.790 --> 1:11:35.852 We put our model onto several GPS. 1:11:35.852 --> 1:11:47.131 We send the same model there and then when we get the training data we split. 1:11:48.108 --> 1:11:54.594 So each on each of these we are doing the forward and backward pass in parallel. 1:11:55.355 --> 1:12:07.779 Then after we get his gradient all these reviews will be synchronized and the gradients will 1:12:07.779 --> 1:12:09.783 be aggregated. 1:12:11.691 --> 1:12:27.127 We are having a bigger batch size in effect, so this would be much faster than, for example, 1:12:27.127 --> 1:12:31.277 doing all these smaller. 1:12:32.772 --> 1:12:45.252 That is, if your model itself is too big to fit onto an energy group, so you cannot split 1:12:45.252 --> 1:12:46.084 this. 1:12:46.486 --> 1:12:51.958 And honestly, the model itself, unless you're going for those. 1:12:51.891 --> 1:12:55.500 Huge models the industry made these days. 1:12:55.500 --> 1:13:03.233 I've never run into a situation where the single model itself does not fit into one shape 1:13:03.233 --> 1:13:03.748 here. 1:13:03.748 --> 1:13:08.474 Realistically, it's more the what is memory consuming. 1:13:08.528 --> 1:13:14.871 It is more of the backward cast and the Optimizer states that led me to be stored. 1:13:15.555 --> 1:13:22.193 So but still there are people training gigantic models where they have to go model parallel. 1:13:22.602 --> 1:13:35.955 This means you have a model consisting of all those orange pets, but it doesn't fit to 1:13:35.955 --> 1:13:40.714 split the next several layers. 1:13:41.581 --> 1:13:51.787 So this means when you do the forward pass you have to wait and to finish before doing. 1:13:52.532 --> 1:14:11.193 And this kind of implementation is sometimes a bit architecture or specific. 1:14:12.172 --> 1:14:17.177 Right, so there's one more thing when scaling up. 1:14:17.177 --> 1:14:19.179 Want it to mention. 1:14:20.080 --> 1:14:25.687 We also talked about it briefly earlier. 1:14:25.687 --> 1:14:34.030 We said that when we go to Linguo we need a vocabulary that. 1:14:34.614 --> 1:14:40.867 And can give you some numbers. 1:14:40.867 --> 1:14:53.575 Most of the pre-trained modeling models here use a vocabulary. 1:14:53.933 --> 1:14:58.454 Normally each vector is. 1:14:58.454 --> 1:15:10.751 This means just the word embedding table alone is times parameters. 1:15:11.011 --> 1:15:18.620 This means just for the embedding table alone it's already taking million parameters of the. 1:15:19.859 --> 1:15:28.187 And this is often one of the largest parts of the machine. 1:15:28.187 --> 1:15:31.292 This also comes with. 1:15:31.651 --> 1:15:43.891 So one question is how can we efficiently represent a multilingual vocabulary? 1:15:43.891 --> 1:15:49.003 Are there better ways than just? 1:15:50.750 --> 1:16:00.526 There are many out there people tread, maybe not all targeted for mottling wool, but think. 1:16:00.840 --> 1:16:03.635 So when is bites level representation? 1:16:03.743 --> 1:16:11.973 So the idea there is if we train with data they're all stored on computers, so all their 1:16:11.973 --> 1:16:15.579 characters must be reused in by bites. 1:16:15.579 --> 1:16:23.716 So they want to then not using subwords, not using characters, but using bites instead. 1:16:25.905 --> 1:16:27.693 Do You See Some Downsides? 1:16:31.791 --> 1:16:38.245 There are some languages that are easier to represent than others. 1:16:38.245 --> 1:16:40.556 That's definitely true. 1:16:41.081 --> 1:16:44.981 So if you have a sentence normally of five words,. 1:16:46.246 --> 1:16:59.899 You think about if we split it into characters, how many characters we have, and each character 1:16:59.899 --> 1:17:04.166 that would be how many bites. 1:17:04.424 --> 1:17:15.749 And then it's more to model, it's more for the model to learn, and it's also a bigger 1:17:15.749 --> 1:17:19.831 sequence to give to the model. 1:17:20.260 --> 1:17:22.038 Yeah. 1:17:21.941 --> 1:17:31.232 Visual representation is also quite interesting, so some people argued that we don't want to 1:17:31.232 --> 1:17:35.428 have a fixed discrete vocabulary anymore. 1:17:35.428 --> 1:17:41.921 Instead, we want to do it like OCR, like reading them as images. 1:17:42.942 --> 1:17:54.016 We'll look at one example for this next: Then another idea is how if you can distill the 1:17:54.016 --> 1:18:03.966 vocabulary as in learning some more compact representation,. 1:18:04.284 --> 1:18:12.554 But next wanted to show you an example of pixel inputs for modeling war machine. 1:18:12.852 --> 1:18:29.757 If you look at the picture, all the characters that are marked with red are actually not. 1:18:32.772 --> 1:18:48.876 They are actually from a different script for the model and let it do the subword tokenization. 1:18:52.852 --> 1:19:04.373 You would get maybe mostly characters out of it because I guess in the pre existing vocabulary 1:19:04.373 --> 1:19:07.768 there won't be Latin H and. 1:19:07.707 --> 1:19:16.737 So you'll get characters out of it, which means it's probably going to be more difficult 1:19:16.737 --> 1:19:18.259 for the model. 1:19:20.140 --> 1:19:28.502 Yeah, so the motivation for pixel inputs is that there is more sharing across languages. 1:19:30.010 --> 1:19:37.773 Here basically illustrates an embedding table for subwords and saying if you have sentences 1:19:37.773 --> 1:19:45.705 in the letter scripts like French and the English then it's going to take certain proportions 1:19:45.705 --> 1:19:48.152 of this big embetting table. 1:19:48.328 --> 1:19:56.854 While for Arabic and Chinese it's yet again another,. 1:19:56.796 --> 1:20:09.037 That is not joined with the previous one if we want to have shared representations for 1:20:09.037 --> 1:20:11.992 different languages. 1:20:12.692 --> 1:20:18.531 On the other hand, if we're going with pixels, there's definitely more sharing. 1:20:22.362 --> 1:20:30.911 There's a difference though to a standard kind of norm machine translation typeline. 1:20:32.252 --> 1:20:47.581 If you have this brace then how do we go with images into a translation model? 1:20:50.690 --> 1:20:58.684 We still have to tokenize it somehow, so in this case they do an overlapping sliding window. 1:20:59.259 --> 1:21:13.636 Since it's more visual, we're using some kind of convolution blocks before going into these 1:21:13.636 --> 1:21:14.730 black. 1:21:15.035 --> 1:21:25.514 So here wanted to show that if you go with these more specialist architectures we get 1:21:25.514 --> 1:21:27.829 pixels and that's. 1:21:30.050 --> 1:21:31.310 There's Also One Down the Side. 1:21:31.431 --> 1:21:51.380 If we go with pixels and present teachings, what are our challenges? 1:21:52.993 --> 1:22:00.001 Exactly so as they beat us others here, also pointing out here for their experiments. 1:22:01.061 --> 1:22:08.596 They only consider a one target language, and this is also on their target site. 1:22:08.596 --> 1:22:10.643 It's not pixel based. 1:22:11.131 --> 1:22:31.033 So this is definitely, in my opinion, very interesting steps towards more shared representations. 1:22:31.831 --> 1:22:40.574 Yeah, so with this kind of out of the box approach just wanted to summarize today's lecture. 1:22:41.962 --> 1:22:53.158 First think we saw why motelingue is cool, why there are several open challenges out there 1:22:53.158 --> 1:22:53.896 that. 1:22:55.355 --> 1:23:03.601 We also saw, like several approaches, how to realize implement a modern molecular translation 1:23:03.601 --> 1:23:11.058 system, and yeah, lastly, we've seen quite some over challenges on what is unsolved. 1:23:11.691 --> 1:23:22.403 Yeah, so with this want to thank you for being here today and I'm up there if you want. 1:23:26.106 --> 1:23:29.727 If you have questions, how will we also share with the moment?