Spaces:
Running
Running
WEBVTT | |
0:00:00.860 --> 0:00:04.211 | |
Okay Again Welcome. | |
0:00:04.524 --> 0:00:09.256 | |
So today I'll be doing the lecture. | |
0:00:09.256 --> 0:00:12.279 | |
My name is Danny Liro. | |
0:00:12.279 --> 0:00:16.747 | |
I'm one of the PhD students with. | |
0:00:17.137 --> 0:00:25.942 | |
And specifically how to learn representations | |
that are common across languages and use that | |
0:00:25.942 --> 0:00:29.004 | |
to help low resource languages. | |
0:00:29.689 --> 0:00:39.445 | |
So hope today we can explore a little bit | |
about motoring machine translation and hopefully. | |
0:00:40.100 --> 0:00:50.940 | |
So today what we are going to do first we | |
are going to look at. | |
0:00:52.152 --> 0:01:02.491 | |
Second, we will be looking into more details | |
as in how we achieve modeling or machine translation | |
0:01:02.491 --> 0:01:06.183 | |
and what are the techniques there. | |
0:01:06.183 --> 0:01:12.197 | |
At last, we are going to look at the current | |
challenges. | |
0:01:13.573 --> 0:01:15.976 | |
Alright, so some definitions. | |
0:01:15.976 --> 0:01:19.819 | |
First, what is modeling or machine translation? | |
0:01:21.201 --> 0:01:28.637 | |
So for a multilingual machine translation | |
system, it's basically a system that is able | |
0:01:28.637 --> 0:01:34.279 | |
to handle multiple source languages or multiple | |
target languages. | |
0:01:34.254 --> 0:01:44.798 | |
You see here you've got source on the source | |
side, some German Chinese, Spanish and English. | |
0:01:45.485 --> 0:01:50.615 | |
Physically, it's also a quite interesting | |
machine learning challenge actually. | |
0:01:51.031 --> 0:02:05.528 | |
So if you consider each translation pair as | |
a different task in machine learning, then | |
0:02:05.528 --> 0:02:08.194 | |
a multilingual. | |
0:02:08.628 --> 0:02:17.290 | |
Where it has to specialize in all these different | |
translation directions and try to be good. | |
0:02:17.917 --> 0:02:26.890 | |
So this is basically about multi-task learning, | |
and here when translation direction being one | |
0:02:26.890 --> 0:02:27.462 | |
task. | |
0:02:28.428 --> 0:02:35.096 | |
Interesting question to ask here is like do | |
we get synergy like different tasks helping | |
0:02:35.096 --> 0:02:39.415 | |
each other, the knowledge of one task helping | |
the other? | |
0:02:39.539 --> 0:02:48.156 | |
Or do we get more interference in English | |
to German, and now I get worse at English to | |
0:02:48.156 --> 0:02:49.047 | |
Chinese. | |
0:02:49.629 --> 0:02:55.070 | |
So this is also a very interesting question | |
that we'll look into later. | |
0:02:56.096 --> 0:02:58.605 | |
Now a little bit of context. | |
0:02:59.519 --> 0:03:04.733 | |
We care about multilingual machine translation. | |
0:03:04.733 --> 0:03:10.599 | |
Part of the thing is that machine translation | |
models. | |
0:03:11.291 --> 0:03:22.659 | |
If you consider all the languages in the world, | |
there are a read it here roughly seven thousand | |
0:03:22.659 --> 0:03:23.962 | |
languages. | |
0:03:24.684 --> 0:03:37.764 | |
So consider this number, and if you think | |
about this many languages out there, how many | |
0:03:37.764 --> 0:03:39.548 | |
directions. | |
0:03:40.220 --> 0:03:46.897 | |
So this means to cover end languages. | |
0:03:46.897 --> 0:03:59.374 | |
We're going to end up with a prodretic in | |
square number of directions. | |
0:03:59.779 --> 0:04:02.290 | |
This Is Very Bad, Padre Is Very Bad. | |
0:04:03.203 --> 0:04:14.078 | |
The prosthetic situation going on means that | |
for a lot of translation directions, if you | |
0:04:14.078 --> 0:04:16.278 | |
consider all the. | |
0:04:17.177 --> 0:04:34.950 | |
For many of them we aren't going to have any | |
parallel data as in existing translated data. | |
0:04:35.675 --> 0:04:40.001 | |
So this is a very data scarce situation. | |
0:04:40.001 --> 0:04:49.709 | |
We're not going to get parallel data in blue | |
wear, especially likely when you have a system | |
0:04:49.709 --> 0:04:52.558 | |
that covers tan languages. | |
0:04:52.912 --> 0:05:04.437 | |
If this access actually goes towards thousands | |
that are realistic, we are going to end up | |
0:05:04.437 --> 0:05:06.614 | |
with some holes. | |
0:05:07.667 --> 0:05:15.400 | |
So now we are going to ask: Can we use motel | |
inquality to help this kind of glow resource? | |
0:05:15.875 --> 0:05:22.858 | |
So when useful concept there is mutual intelligibility, | |
don't know if you've heard of this. | |
0:05:23.203 --> 0:05:30.264 | |
Basically isn't linguistic when you say somebody | |
who's speaking one language can directly without | |
0:05:30.264 --> 0:05:33.218 | |
learning understands the other language. | |
0:05:33.218 --> 0:05:39.343 | |
So if you're a German speaker maybe Dutch | |
or Danish and all that kind of stuff would | |
0:05:39.343 --> 0:05:39.631 | |
be. | |
0:05:40.000 --> 0:05:45.990 | |
Useful or like directly understandable partially | |
to you. | |
0:05:46.586 --> 0:05:52.082 | |
That is, thanks to this kind of mutual enthology | |
ability that is basically based on language | |
0:05:52.082 --> 0:05:52.791 | |
similarity. | |
0:05:53.893 --> 0:05:57.105 | |
And then there's knowledge sharing this concept. | |
0:05:57.105 --> 0:06:01.234 | |
I mean, it's quite intuitive, basically a | |
very German speaker. | |
0:06:01.234 --> 0:06:06.805 | |
If you start learning Dutch or Danish and | |
all these Mordic languages, I think you're | |
0:06:06.805 --> 0:06:11.196 | |
going to be faster than just a native English | |
speaker or anything. | |
0:06:11.952 --> 0:06:18.751 | |
So hopefully our model is also able to do | |
this, but we'll see later what the real situation. | |
0:06:19.799 --> 0:06:27.221 | |
So we said multilingual is good multilingual | |
transmission, it's nice and there's a lot of | |
0:06:27.221 --> 0:06:28.210 | |
potentials. | |
0:06:28.969 --> 0:06:32.205 | |
So it's a long path towards there. | |
0:06:32.205 --> 0:06:37.569 | |
Think all the efforts started in so quite | |
some years ago. | |
0:06:37.958 --> 0:06:54.639 | |
At first people started with models with language | |
specific modules. | |
0:06:54.454 --> 0:06:58.747 | |
So we talked about the input of the decoder | |
architecture in the previous lecturer area. | |
0:07:00.100 --> 0:07:06.749 | |
And with this separation of the inputter and | |
the decoder, it gives it a natural way to split | |
0:07:06.749 --> 0:07:07.679 | |
the modules. | |
0:07:09.069 --> 0:07:20.805 | |
So basically what's happening going on here | |
is dedicated to each toes language and dedicated. | |
0:07:21.281 --> 0:07:34.252 | |
Now given parallel data of body good data | |
English German data we just activate this German | |
0:07:34.252 --> 0:07:39.241 | |
inputter and activate this and an. | |
0:07:40.680 --> 0:07:48.236 | |
So now we are training basically like corresponding | |
parts of the encoder decoders. | |
0:07:48.236 --> 0:07:55.278 | |
It has some advantages: First, we have a multilingual | |
system. | |
0:07:55.278 --> 0:08:03.898 | |
Of course, second modularity is also an advantage | |
in software engineering. | |
0:08:03.898 --> 0:08:10.565 | |
We want to decouple things if the German input | |
is broken. | |
0:08:11.011 --> 0:08:19.313 | |
So modularity is advantage in this case, but | |
again if we think about scalability, if we | |
0:08:19.313 --> 0:08:27.521 | |
think about languages out there that we talked | |
about, scalability isn't a great thing. | |
0:08:27.947 --> 0:08:37.016 | |
We also talked about sharing knowledge or | |
sharing representations for different languages. | |
0:08:37.317 --> 0:08:41.968 | |
We have a separate thing for each language. | |
0:08:41.968 --> 0:08:46.513 | |
How likely is it that we are sharing much? | |
0:08:46.513 --> 0:08:52.538 | |
So these are potential disadvantages with | |
this approach. | |
0:08:53.073 --> 0:09:01.181 | |
So yeah we talked about, we want to have knowledge | |
transfer, we want to have similar languages | |
0:09:01.181 --> 0:09:02.888 | |
helping each other. | |
0:09:02.822 --> 0:09:06.095 | |
This is somehow a more reachable goal. | |
0:09:06.095 --> 0:09:13.564 | |
If you have a shared in corner and a shared | |
in physically, a full perimeter shared model | |
0:09:13.564 --> 0:09:21.285 | |
for all the translation pairs out there, and | |
there's also another game, so if you just have | |
0:09:21.285 --> 0:09:21.705 | |
one. | |
0:09:22.582 --> 0:09:26.084 | |
Lock of model for all the translation directions | |
out there. | |
0:09:26.606 --> 0:09:38.966 | |
It's easier to deploy in the sense that if | |
you are serving a model you don't have a thousand | |
0:09:38.966 --> 0:09:42.555 | |
small modules to maintain. | |
0:09:42.762 --> 0:09:52.448 | |
So in terms of engineering somehow these kind | |
of fully primitive shared models have: So this | |
0:09:52.448 --> 0:09:59.819 | |
is also where the parent research has been | |
going towards in recent years. | |
0:10:00.460 --> 0:10:16.614 | |
So the rest of the electro are also going | |
to focus on this kind of model. | |
0:10:17.037 --> 0:10:30.901 | |
So the first type of multilinguali is this | |
kind of many to one abbreviated kind of situation. | |
0:10:30.901 --> 0:10:34.441 | |
Basically what's going. | |
0:10:35.355 --> 0:10:49.804 | |
So one news case that you can think of here | |
is if you're subtitled for international movies | |
0:10:49.804 --> 0:10:51.688 | |
in Germany. | |
0:10:53.073 --> 0:11:02.863 | |
Then flipping the situation there is also | |
many configurations where we only have when | |
0:11:02.863 --> 0:11:04.798 | |
source language. | |
0:11:06.046 --> 0:11:13.716 | |
There's also many use cases like if you think | |
about the lecture translator here you've seen. | |
0:11:14.914 --> 0:11:21.842 | |
So here most of the lecturers are in German | |
and now we want to translate it into. | |
0:11:21.842 --> 0:11:28.432 | |
I think on the user end we only support English | |
but they're also supportable. | |
0:11:28.608 --> 0:11:38.988 | |
So in this kind of used case, if you have | |
one speaker and you want to serve or expand | |
0:11:38.988 --> 0:11:41.281 | |
to many audience,. | |
0:11:42.802 --> 0:11:50.542 | |
But of course, combining everything, there's | |
the many to many situation here. | |
0:11:50.542 --> 0:11:54.015 | |
You can think of Google Translate. | |
0:11:54.015 --> 0:11:58.777 | |
They are doing basically any selected language. | |
0:11:59.159 --> 0:12:03.760 | |
And this is also more difficult. | |
0:12:03.760 --> 0:12:14.774 | |
If you consider the data you need to get and | |
concerns, we'll cover this later. | |
0:12:15.135 --> 0:12:21.034 | |
But first we are going to start with many | |
to one translations. | |
0:12:21.741 --> 0:12:30.436 | |
Say this is the most similar to the bilingual | |
translation situation you saw earlier, but | |
0:12:30.436 --> 0:12:39.423 | |
now one difference is we need a vocabulary | |
or tokens that can represent all these different | |
0:12:39.423 --> 0:12:40.498 | |
languages. | |
0:12:41.301 --> 0:12:44.200 | |
So we need a joint more telecom global vocabulary. | |
0:12:44.924 --> 0:12:48.794 | |
So let's just quickly recall what word embedding | |
is to do. | |
0:12:49.189 --> 0:12:54.561 | |
Basically we need to represent it. | |
0:12:54.561 --> 0:13:04.077 | |
We have to get some vector representation | |
for discrete words. | |
0:13:04.784 --> 0:13:16.911 | |
And when we embed a token, we are retrieving | |
the corresponding vector out of this little. | |
0:13:17.697 --> 0:13:19.625 | |
And then we put it. | |
0:13:19.625 --> 0:13:26.082 | |
We feed a sequence of vectors into the inputter | |
as the next steps. | |
0:13:26.987 --> 0:13:34.973 | |
Now if it's motelingual you can imagine that | |
vocabulary suddenly gets very, very big because | |
0:13:34.973 --> 0:13:36.262 | |
the languages. | |
0:13:37.877 --> 0:13:46.141 | |
So what is quite useful here is the by pair | |
like subwords you talked about by pairing. | |
0:13:46.406 --> 0:13:55.992 | |
So in this case we are still limiting ourselves | |
to a finite number of vocabularies that we | |
0:13:55.992 --> 0:13:59.785 | |
are exploding the vocabulary table. | |
0:14:01.181 --> 0:14:11.631 | |
So when we learn these kinds of subwords, | |
what happens basically? | |
0:14:11.631 --> 0:14:17.015 | |
We look at all the training data. | |
0:14:18.558 --> 0:14:20.856 | |
So think about this. | |
0:14:20.856 --> 0:14:28.077 | |
If we do this now on a bunch of Mozilla data, | |
are there concerns? | |
0:14:30.050 --> 0:14:36.811 | |
Maybe we have an underground status head, | |
so we get over English mergers and nocularities. | |
0:14:37.337 --> 0:14:39.271 | |
Yeah Exactly Thanks. | |
0:14:39.539 --> 0:14:46.602 | |
So what we have to pay attention to here is | |
learn this motilingual vocabulary. | |
0:14:46.602 --> 0:14:52.891 | |
We should pay attention: All the languages | |
are more or less balanced, not that you only | |
0:14:52.891 --> 0:14:58.912 | |
learning words is for for English or some bigger | |
languages, and then neglecting other other | |
0:14:58.912 --> 0:15:00.025 | |
languages, yeah. | |
0:15:01.021 --> 0:15:04.068 | |
Of course, this is not going to solve everything. | |
0:15:04.068 --> 0:15:09.614 | |
Even if we get a perfectly uniform distribution | |
out of all the languages out, there is not | |
0:15:09.614 --> 0:15:13.454 | |
going to mean that we are ending up with a | |
perfect vocabulary. | |
0:15:14.154 --> 0:15:20.068 | |
There are also language differences read, | |
so if you consider more European languages. | |
0:15:20.180 --> 0:15:27.081 | |
There will be many shared subcomponents like | |
how you write a certain word, somewhat similar. | |
0:15:27.267 --> 0:15:34.556 | |
But then there are other languages with completely | |
different scripts like Arabic, Cyrillic scripts | |
0:15:34.556 --> 0:15:40.594 | |
or Eastern Asian scripts where you get a vocabulary | |
like the characters set with. | |
0:15:40.940 --> 0:15:43.531 | |
Tens of thousands of characters. | |
0:15:43.531 --> 0:15:50.362 | |
So these are also individual concerns that | |
one has to think about my building specific | |
0:15:50.362 --> 0:15:51.069 | |
systems. | |
0:15:51.591 --> 0:16:02.660 | |
But overall, the rule of thumb is that when | |
you do a mottling tokenizer vocabulary, there's | |
0:16:02.660 --> 0:16:04.344 | |
more or less. | |
0:16:05.385 --> 0:16:17.566 | |
And there's actually some paper showing that | |
the performance of the final system is going | |
0:16:17.566 --> 0:16:25.280 | |
to start to degrade if you have a disproportionate | |
data. | |
0:16:27.207 --> 0:16:33.186 | |
Of course there is currently the trend of | |
using pre-train models. | |
0:16:33.186 --> 0:16:39.890 | |
If you take a pre-train model somewhere then | |
you don't have this concern. | |
0:16:40.580 --> 0:16:47.810 | |
Making sure that you use the same organizers | |
that they used so that there is no train test | |
0:16:47.810 --> 0:16:48.287 | |
time. | |
0:16:48.888 --> 0:16:53.634 | |
Yeah for a pre-trainer, we're going to talk | |
about a little bit later as well. | |
0:16:54.734 --> 0:16:59.960 | |
Alright: So now where's a Martin Luther vocabulary? | |
0:17:00.920 --> 0:17:04.187 | |
There are several good things, obviously. | |
0:17:04.187 --> 0:17:10.953 | |
So one thing is that if we have words that | |
are in the textful form like we said, there | |
0:17:10.953 --> 0:17:16.242 | |
are European languages that share some vocabulary, | |
then it's great. | |
0:17:16.242 --> 0:17:19.897 | |
Then we have the first step towards knowledge. | |
0:17:20.000 --> 0:17:30.464 | |
For example, the word pineapple for some reason | |
is also in Eastern European languages. | |
0:17:30.464 --> 0:17:34.915 | |
In Cyrillic scripts that's also the. | |
0:17:36.116 --> 0:17:42.054 | |
But however, there is also ambiguity if you've | |
embracing together or dye. | |
0:17:42.054 --> 0:17:46.066 | |
Of course, they mean different things for | |
German. | |
0:17:46.246 --> 0:17:53.276 | |
Then, of course, that's possible to rely on | |
further context. | |
0:17:53.276 --> 0:17:59.154 | |
It's not a problem, it's something to think | |
about. | |
0:18:00.200 --> 0:18:11.061 | |
And when we go higher to cover more vocabulary | |
entries, we might need to go bigger in the | |
0:18:11.061 --> 0:18:13.233 | |
vocabulary count. | |
0:18:13.653 --> 0:18:28.561 | |
So there is always sort of a bottleneck as | |
the number of languages increase. | |
0:18:30.110 --> 0:18:32.836 | |
Right, so what is the result? | |
0:18:32.836 --> 0:18:38.289 | |
What are these crustling over inventings actually | |
learning? | |
0:18:40.160 --> 0:18:44.658 | |
So normally to inspect them it's quite hard. | |
0:18:44.658 --> 0:18:53.853 | |
It's like high dimensional vectors with dimensions, | |
but researchers also try to project it. | |
0:18:54.454 --> 0:19:05.074 | |
So in this case it is a little bit small, | |
but in this case for English and French there | |
0:19:05.074 --> 0:19:07.367 | |
are many injuries. | |
0:19:07.467 --> 0:19:20.014 | |
My example is like different words with the | |
same word in morphological forms. | |
0:19:20.014 --> 0:19:26.126 | |
Basically, it's like a morphological. | |
0:19:26.546 --> 0:19:32.727 | |
There are also words in different languages | |
like think there is research for English and | |
0:19:32.727 --> 0:19:33.282 | |
French. | |
0:19:33.954 --> 0:19:41.508 | |
So the take away from this plot is that somehow | |
we learn a bit of semantic meanings beyond | |
0:19:41.508 --> 0:19:43.086 | |
the textual forms. | |
0:19:45.905 --> 0:19:50.851 | |
But then this looks good and this gives us | |
hope. | |
0:19:52.252 --> 0:20:05.240 | |
That if we consider what is the baseline here, | |
the baseline we compare to is a bilingual system | |
0:20:05.240 --> 0:20:09.164 | |
without any multilinguality. | |
0:20:10.290 --> 0:20:19.176 | |
This looks good because if we compare for | |
many Central European languages, Eastern and | |
0:20:19.176 --> 0:20:28.354 | |
Central European languages to English, we compare: | |
And we see that the Mini Two English has actually | |
0:20:28.354 --> 0:20:30.573 | |
always gained quite a bit over it. | |
0:20:31.751 --> 0:20:38.876 | |
But there is also later investigation on whether | |
it is actually out of mountain linguality or | |
0:20:38.876 --> 0:20:39.254 | |
not. | |
0:20:39.639 --> 0:20:46.692 | |
So this is a spoiler won't tell much about | |
it until the second half, but just remember | |
0:20:46.692 --> 0:20:47.908 | |
there is this. | |
0:20:49.449 --> 0:20:53.601 | |
Now move on to many translations. | |
0:20:53.601 --> 0:21:01.783 | |
Let's recall in a normal transformer or any | |
encoder decoder setup. | |
0:21:02.242 --> 0:21:08.839 | |
We have an inkluder that creates sort of contextual | |
representation for the sort of sentence. | |
0:21:09.949 --> 0:21:17.787 | |
Is more or less the context for generating | |
the target sentence red. | |
0:21:17.787 --> 0:21:28.392 | |
Now on the target side we get the first open, | |
then we feed it again and then get the second | |
0:21:28.392 --> 0:21:29.544 | |
decoding. | |
0:21:31.651 --> 0:21:35.039 | |
And now we have multiple target languages. | |
0:21:35.039 --> 0:21:39.057 | |
Does anybody see a problem with this architecture? | |
0:21:48.268 --> 0:21:57.791 | |
Specifically, it's in the decoder, so now | |
have a German sentiments encoded. | |
0:21:57.791 --> 0:22:01.927 | |
It now want to generate Spanish. | |
0:22:07.367 --> 0:22:11.551 | |
So the problem is how does the model know | |
which language to generate? | |
0:22:12.112 --> 0:22:24.053 | |
If you just give it a generic start token, | |
there is nowhere where we are telling the model. | |
0:22:24.944 --> 0:22:30.277 | |
So that this can only be a guess, and this | |
model will definitely not run well. | |
0:22:32.492 --> 0:22:40.021 | |
So this comes to the question: How do we indicate | |
the one's intended language to the model? | |
0:22:41.441 --> 0:22:52.602 | |
One first idea is what people tried is basically | |
now in a source where not only including the | |
0:22:52.602 --> 0:22:53.552 | |
source. | |
0:22:53.933 --> 0:23:01.172 | |
To Spanish things like this, so basically | |
the source is already informed. | |
0:23:01.172 --> 0:23:12.342 | |
The source sentence is already supplemented | |
with: Now this is also called a target forcing | |
0:23:12.342 --> 0:23:19.248 | |
in the sense that we try to force it to give | |
the right target. | |
0:23:20.080 --> 0:23:24.622 | |
This is one approach. | |
0:23:24.622 --> 0:23:38.044 | |
Another approach is basically based on the | |
idea that if we have. | |
0:23:38.438 --> 0:23:52.177 | |
So if we create a context of our world, the | |
incode output shouldn't really differ. | |
0:23:52.472 --> 0:24:02.397 | |
So out of this motivation people have moved | |
this signaling mechanism. | |
0:24:02.397 --> 0:24:09.911 | |
They basically replaced the traditional start | |
token. | |
0:24:10.330 --> 0:24:17.493 | |
So here we are not kids starting into the | |
generic start talking anymore instead language | |
0:24:17.493 --> 0:24:18.298 | |
specific. | |
0:24:18.938 --> 0:24:21.805 | |
So this is also another way to achieve this. | |
0:24:23.283 --> 0:24:27.714 | |
But there are still more challenging cases. | |
0:24:27.714 --> 0:24:35.570 | |
Sometimes here it can be called as General | |
English or German when it's there. | |
0:24:35.570 --> 0:24:39.700 | |
Later on it goes further and further on. | |
0:24:40.320 --> 0:24:46.752 | |
Basically this information is not strong enough | |
to always enforce the target language, especially | |
0:24:46.752 --> 0:24:48.392 | |
in zero shot conditions. | |
0:24:48.392 --> 0:24:54.168 | |
We'll look into this later so we'll get this | |
kind of target translation into generating | |
0:24:54.168 --> 0:24:57.843 | |
and generating and then going into some wrong | |
language. | |
0:24:59.219 --> 0:25:12.542 | |
So another technique actually developed here | |
some years ago was to inject this language. | |
0:25:12.872 --> 0:25:19.834 | |
So when we are feeding doing the auto-aggressive | |
decoding normally, we only feed the upherb. | |
0:25:20.000 --> 0:25:22.327 | |
Into the depoter. | |
0:25:22.327 --> 0:25:33.704 | |
But if we also add a language embedding for | |
the target language, on top of that we have | |
0:25:33.704 --> 0:25:37.066 | |
the language information. | |
0:25:37.397 --> 0:25:44.335 | |
And this has shown to perform quite a bit | |
better, especially in conditions where the | |
0:25:44.335 --> 0:25:44.906 | |
model. | |
0:25:46.126 --> 0:25:56.040 | |
So yeah, we introduced three ways to enforce | |
the Tardid language: And now with this we're | |
0:25:56.040 --> 0:26:02.607 | |
going to move on to the more interesting case | |
of many too many translations. | |
0:26:03.503 --> 0:26:14.021 | |
Am so here we just consider a system that | |
translates two directions: English to English | |
0:26:14.021 --> 0:26:15.575 | |
and English. | |
0:26:16.676 --> 0:26:21.416 | |
Now we have target languages read. | |
0:26:21.416 --> 0:26:29.541 | |
Can you see where we're enforcing the target | |
language here? | |
0:26:29.541 --> 0:26:33.468 | |
In this case what technique? | |
0:26:34.934 --> 0:26:45.338 | |
So here we are enforcing the characteristic | |
language with the yelling we train this system. | |
0:26:46.526 --> 0:27:00.647 | |
And at the inference time we are able to generate | |
English to French, but in addition to this | |
0:27:00.647 --> 0:27:12.910 | |
we are also able to: We will be able to do | |
zero shot inference that basically translates | |
0:27:12.910 --> 0:27:17.916 | |
a direction that is not seen in training. | |
0:27:19.319 --> 0:27:25.489 | |
So this is so called zero shot translation | |
using a modeling wall system. | |
0:27:26.606 --> 0:27:34.644 | |
Of course, we have to reach several things | |
before we are able to control the language, | |
0:27:34.644 --> 0:27:36.769 | |
otherwise it's no use. | |
0:27:37.317 --> 0:27:51.087 | |
Second, we should also have some kind of language | |
independent representation. | |
0:27:51.731 --> 0:27:53.196 | |
Why is this? | |
0:27:53.196 --> 0:27:55.112 | |
Why is this big? | |
0:27:55.112 --> 0:28:00.633 | |
Because if women drink generally French up | |
here? | |
0:28:00.940 --> 0:28:05.870 | |
It was trained to translate from some English. | |
0:28:07.187 --> 0:28:15.246 | |
But now we use Anchored Germans in the French, | |
so intuitively we need these representations | |
0:28:15.246 --> 0:28:22.429 | |
to be similar enough, not that they are so | |
far attracted that we cannot use this. | |
0:28:25.085 --> 0:28:32.059 | |
So there are several works out there showing | |
that if you do a standard transformer architecture | |
0:28:32.059 --> 0:28:39.107 | |
this language independent property is not really | |
there and you need to add additional approaches | |
0:28:39.107 --> 0:28:40.633 | |
in order to enforce. | |
0:28:41.201 --> 0:28:51.422 | |
So you can, for example, add an additional | |
training objective: That says, we invoked SARSN, | |
0:28:51.422 --> 0:29:00.305 | |
be invoked by German, and the invoked English | |
have to be the same or be as close to each | |
0:29:00.305 --> 0:29:02.201 | |
other as possible. | |
0:29:02.882 --> 0:29:17.576 | |
So if we take the output and the output for | |
another language, how can we formulate this | |
0:29:17.576 --> 0:29:18.745 | |
as an. | |
0:29:20.981 --> 0:29:27.027 | |
We can take the translation to the encoder | |
and whatever you translate. | |
0:29:27.027 --> 0:29:32.817 | |
The embeddings also must be similar and that's | |
the great direction. | |
0:29:33.253 --> 0:29:42.877 | |
So one thing to take care of here is the length | |
for the same sentence in German and English | |
0:29:42.877 --> 0:29:44.969 | |
is not necessarily. | |
0:29:45.305 --> 0:30:00.858 | |
So if we just do a word to word matching, | |
we can always do pulling to a fixed length | |
0:30:00.858 --> 0:30:03.786 | |
representation. | |
0:30:04.004 --> 0:30:08.392 | |
Or there are more advanced techniques that | |
involve some alignments. | |
0:30:08.848 --> 0:30:23.456 | |
So this is useful in the sense that in this | |
part in experiments we have shown it improves | |
0:30:23.456 --> 0:30:27.189 | |
zero shot translation. | |
0:30:27.447 --> 0:30:36.628 | |
This is on the data condition of English to | |
Malay, Java and Filipino, so kind of made to | |
0:30:36.628 --> 0:30:39.722 | |
low resource language family. | |
0:30:40.100 --> 0:30:50.876 | |
And there we assume that we get parallel English | |
to all of them, but among all these. | |
0:30:51.451 --> 0:31:03.592 | |
So the blue bar is a Vanilla Transformer model, | |
and the purple bar is when we add a language. | |
0:31:04.544 --> 0:31:12.547 | |
You see that in supervised conditions it's | |
not changing much, but in zero shots there's | |
0:31:12.547 --> 0:31:13.183 | |
quite. | |
0:31:15.215 --> 0:31:22.649 | |
Yeah, so far we said zero shots is doable | |
and it's even more achievable if we enforce | |
0:31:22.649 --> 0:31:26.366 | |
some language independent representations. | |
0:31:26.366 --> 0:31:29.823 | |
However, there's one practical concern. | |
0:31:29.823 --> 0:31:33.800 | |
Don't know if you also had the same question. | |
0:31:34.514 --> 0:31:39.835 | |
If you have two languages, you don't have | |
direct parallel. | |
0:31:39.835 --> 0:31:43.893 | |
One's into English and one's out of English. | |
0:31:45.685 --> 0:31:52.845 | |
It's actually this kind of approach is called | |
pivoting as in pivoting over an intermediate | |
0:31:52.845 --> 0:31:53.632 | |
language. | |
0:31:55.935 --> 0:32:00.058 | |
Yeah, that it definitely has advantages in | |
the sense that we're going. | |
0:32:00.440 --> 0:32:11.507 | |
Now if we go over these two steps every direction | |
was trained with supervised data so you could | |
0:32:11.507 --> 0:32:18.193 | |
always assume that when we are working with | |
a supervised. | |
0:32:18.718 --> 0:32:26.868 | |
So in this case we can expect more robust | |
inference time behavior. | |
0:32:26.868 --> 0:32:31.613 | |
However, there are also disadvantages. | |
0:32:31.531 --> 0:32:38.860 | |
An inference where passing through the model | |
ties so that's doubling the inference time | |
0:32:38.860 --> 0:32:39.943 | |
computation. | |
0:32:40.500 --> 0:32:47.878 | |
You might think okay doubling then what, but | |
if you consider if your company like Google, | |
0:32:47.878 --> 0:32:54.929 | |
Google Translate and all your life traffic | |
suddenly becomes twice as big, this is not | |
0:32:54.929 --> 0:33:00.422 | |
something scalable that you want to see, especially | |
in production. | |
0:33:01.641 --> 0:33:11.577 | |
A problem with this is making information | |
loss because if we go over these games when | |
0:33:11.577 --> 0:33:20.936 | |
a chain of kids pass the word to each other, | |
in the end it's losing information. | |
0:33:22.082 --> 0:33:24.595 | |
Can give it an example here. | |
0:33:24.595 --> 0:33:27.803 | |
It's also from a master thesis here. | |
0:33:27.803 --> 0:33:30.316 | |
It's on gender preservation. | |
0:33:30.770 --> 0:33:39.863 | |
Basically, some languages like Italian and | |
French have different word forms based on the | |
0:33:39.863 --> 0:33:40.782 | |
speaker. | |
0:33:41.001 --> 0:33:55.987 | |
So if a male person says feel alienated, this | |
word for alienated would be exclusive and a | |
0:33:55.987 --> 0:33:58.484 | |
female person. | |
0:34:00.620 --> 0:34:05.730 | |
Now imagine that we pivot through anguish. | |
0:34:05.730 --> 0:34:08.701 | |
The information is lost. | |
0:34:08.701 --> 0:34:11.910 | |
We don't know what gender. | |
0:34:12.492 --> 0:34:19.626 | |
When we go out into branch again, there are | |
different forms. | |
0:34:19.626 --> 0:34:29.195 | |
Depending on the speaker gender, we can: So | |
this is one problem. | |
0:34:31.871 --> 0:34:44.122 | |
This is especially the case because English | |
compared to many other languages is relatively | |
0:34:44.122 --> 0:34:45.199 | |
simple. | |
0:34:45.205 --> 0:34:53.373 | |
Gendered where it forms like this, it also | |
doesn't have many cases, so going through English | |
0:34:53.373 --> 0:34:56.183 | |
many information would be lost. | |
0:34:57.877 --> 0:35:12.796 | |
And another thing is if you have similar languages | |
that you are translating out of my systems | |
0:35:12.796 --> 0:35:15.494 | |
that translates. | |
0:35:16.496 --> 0:35:24.426 | |
This is the output of going from Dutch to | |
German again. | |
0:35:24.426 --> 0:35:30.231 | |
If you read the German, how many of you? | |
0:35:32.552 --> 0:35:51.679 | |
Good and the problem here is that we are going | |
over English and then the English to German. | |
0:35:51.831 --> 0:36:06.332 | |
However, if we go direct in this case zero | |
shot translation you see that word forgive. | |
0:36:06.546 --> 0:36:09.836 | |
In this case, the outward translation is better. | |
0:36:10.150 --> 0:36:20.335 | |
And we believe this has to do with using the | |
language similarity between the two languages. | |
0:36:20.335 --> 0:36:26.757 | |
There is also quantitative results we found | |
when born in. | |
0:36:27.988 --> 0:36:33.780 | |
The models are always doing better when translating | |
similar languages compared to the. | |
0:36:35.535 --> 0:36:42.093 | |
Yeah, so in this first half what we talked | |
about basically first, we started with how | |
0:36:42.093 --> 0:36:49.719 | |
motilinguality or motilingual machine translation | |
could enable knowledge transfer between languages | |
0:36:49.719 --> 0:36:53.990 | |
and help with conditions where we don't have | |
much data. | |
0:36:55.235 --> 0:37:02.826 | |
Now it looks at three types of multilingual | |
translation, so one is many to one, one to | |
0:37:02.826 --> 0:37:03.350 | |
many. | |
0:37:05.285 --> 0:37:13.397 | |
We got there first about a shared vocabulary | |
based on different languages and how these | |
0:37:13.397 --> 0:37:22.154 | |
cross lingual word embeddings capture semantic | |
meanings rather than just on a text proof form. | |
0:37:25.505 --> 0:37:37.637 | |
Then we looked at how to signal the target | |
language, how to ask for the model to generate, | |
0:37:37.637 --> 0:37:43.636 | |
and then we looked at zero shot translation. | |
0:37:45.325 --> 0:37:58.187 | |
You now before go into the second half are | |
there questions about the first okay good. | |
0:38:00.140 --> 0:38:10.932 | |
In the second half of this lecture we'll be | |
looking into challenges like what is still | |
0:38:10.932 --> 0:38:12.916 | |
unsolved about. | |
0:38:13.113 --> 0:38:18.620 | |
There are some aspects to look at it. | |
0:38:18.620 --> 0:38:26.591 | |
The first is modeling, the second is more | |
engineering. | |
0:38:28.248 --> 0:38:33.002 | |
Okay, so we talked about this question several | |
times. | |
0:38:33.002 --> 0:38:35.644 | |
How does motilinguality help? | |
0:38:35.644 --> 0:38:37.405 | |
Where does it help? | |
0:38:38.298 --> 0:38:45.416 | |
Here want to show results of an experiment | |
based on over a hundred languages. | |
0:38:46.266 --> 0:38:58.603 | |
Here you can see the data amount so they use | |
parallel data to English and it's very. | |
0:38:58.999 --> 0:39:00.514 | |
This is already lock scale. | |
0:39:00.961 --> 0:39:12.982 | |
So for higher resource languages like English | |
to French, German to Spanish you get over billion | |
0:39:12.982 --> 0:39:14.359 | |
sentences. | |
0:39:14.254 --> 0:39:21.003 | |
In parallel, and when we go more to the right | |
to the more low resource spectrum on the other | |
0:39:21.003 --> 0:39:26.519 | |
hand, there are languages that maybe many of | |
us have new and heard of like. | |
0:39:26.466 --> 0:39:29.589 | |
Do You Want to Move Back? | |
0:39:30.570 --> 0:39:33.270 | |
Hawaiian Indians have heard of it. | |
0:39:34.414 --> 0:39:39.497 | |
So on that spectrum we only have like thirty | |
thousand sentences. | |
0:39:40.400 --> 0:39:48.389 | |
So what this means is when we train, we have | |
to up sample these guys. | |
0:39:48.389 --> 0:39:51.585 | |
The model didn't even know. | |
0:39:52.732 --> 0:40:05.777 | |
Yeah, so on this graph on how we read it is | |
this horizontal line and zero is basically | |
0:40:05.777 --> 0:40:07.577 | |
indicating. | |
0:40:07.747 --> 0:40:14.761 | |
Because we want to see where mottling quality | |
helps only compare to what happens when there | |
0:40:14.761 --> 0:40:15.371 | |
is not. | |
0:40:16.356 --> 0:40:29.108 | |
So upper like higher than the zero line it | |
means we're gaining. | |
0:40:29.309 --> 0:40:34.154 | |
The same like for these languages. | |
0:40:34.154 --> 0:40:40.799 | |
This side means we are a high resource for | |
the. | |
0:40:40.981 --> 0:40:46.675 | |
Yeah sorry, think I've somehow removed the | |
the ex-O as he does. | |
0:40:48.008 --> 0:40:58.502 | |
Yeah alright, what happens now if we look | |
at many into English? | |
0:40:58.698 --> 0:41:08.741 | |
On the low resource spectrum by going multilingua | |
we gain a lot over the Palumbo system. | |
0:41:10.010 --> 0:41:16.658 | |
Overall, if you consider the average for all | |
of the languages, it's still again. | |
0:41:17.817 --> 0:41:27.301 | |
Now we're looking at the green line so you | |
can ignore the blue line. | |
0:41:27.301 --> 0:41:32.249 | |
Basically we have to do our sample. | |
0:41:33.753 --> 0:41:41.188 | |
Yeah, so if you just even consider the average, | |
it's still a game form over by link. | |
0:41:42.983 --> 0:41:57.821 | |
However, if we go to the English to many systems | |
looking at the gains, we only get minor improvements. | |
0:41:59.039 --> 0:42:12.160 | |
So why is it the case that Going Mott Lingu | |
isn't really helping universally? | |
0:42:16.016 --> 0:42:18.546 | |
Do you have some intuitions on yeah? | |
0:42:18.698 --> 0:42:38.257 | |
It's easier to understand something that generates | |
if we consider what the model has to generate. | |
0:42:38.718 --> 0:42:40.091 | |
I See It Like. | |
0:42:40.460 --> 0:42:49.769 | |
Generating is a bit like writing or speaking, | |
while inputing on the source side is more like | |
0:42:49.769 --> 0:42:50.670 | |
reading. | |
0:42:50.650 --> 0:42:57.971 | |
So one is more passive and the other is more | |
active and don't know if you have similar experience. | |
0:42:57.971 --> 0:43:05.144 | |
I think speaking and writing is always a little | |
bit more difficult than just passively listening | |
0:43:05.144 --> 0:43:06.032 | |
or reading. | |
0:43:06.032 --> 0:43:09.803 | |
But this is a very pendwavy kind of understanding. | |
0:43:10.390 --> 0:43:11.854 | |
And fed. | |
0:43:12.032 --> 0:43:20.309 | |
In terms of the model, if we consider what | |
is the difference for the target side for many | |
0:43:20.309 --> 0:43:26.703 | |
to English: One difference is that there's | |
a data difference. | |
0:43:27.167 --> 0:43:33.438 | |
So if you just consider a modern English system | |
with German to English and Spanish to English,. | |
0:43:34.975 --> 0:43:44.321 | |
One thing we have to keep in mind is that | |
the parallel data is not all the same, so on | |
0:43:44.321 --> 0:43:49.156 | |
the target side there are different English. | |
0:43:49.769 --> 0:43:54.481 | |
So the situation rather looks like this. | |
0:43:54.481 --> 0:43:59.193 | |
What this means is that we are going to. | |
0:44:00.820 --> 0:44:04.635 | |
We also add more data on the target side for | |
English. | |
0:44:06.967 --> 0:44:18.581 | |
Now since the target side data is not identical, | |
how do we do a controlled experiment to remove | |
0:44:18.581 --> 0:44:21.121 | |
the multilinguality? | |
0:44:24.644 --> 0:44:42.794 | |
So what people tried as a control experiment | |
is to keep all the English same as the above | |
0:44:42.794 --> 0:44:44.205 | |
setup. | |
0:44:44.684 --> 0:44:49.700 | |
So they take the English on English data of | |
the same branch to German. | |
0:44:50.090 --> 0:44:55.533 | |
And then the general synthetic data for Germans. | |
0:44:55.533 --> 0:45:05.864 | |
So now we have a bilingual system again, but | |
on the target side we still have the previously | |
0:45:05.864 --> 0:45:08.419 | |
enriched English data. | |
0:45:10.290 --> 0:45:25.092 | |
Now back to this picture that we've seen before, | |
this mysterious orange line here is basically | |
0:45:25.092 --> 0:45:26.962 | |
the result. | |
0:45:27.907 --> 0:45:36.594 | |
And somewhat struckly and perhaps sadly for | |
believers of multilinguality. | |
0:45:36.594 --> 0:45:39.176 | |
This is also gaining. | |
0:45:41.001 --> 0:45:52.775 | |
So what this means is for the many English | |
is gaining not really because of multilinguality | |
0:45:52.775 --> 0:45:55.463 | |
but just because of. | |
0:45:55.976 --> 0:46:10.650 | |
And this means that there is still quite a | |
lot to do if we really want to gain from just | |
0:46:10.650 --> 0:46:13.618 | |
shared knowledge. | |
0:46:14.514 --> 0:46:27.599 | |
But this also gives hope because there are | |
still many things to research in this area | |
0:46:27.599 --> 0:46:28.360 | |
now. | |
0:46:28.708 --> 0:46:40.984 | |
So we've seen adding more languages helps | |
with somewhat data side effect and can it hurt. | |
0:46:40.984 --> 0:46:45.621 | |
So if we just add more languages. | |
0:46:47.007 --> 0:46:48.408 | |
We've seen this. | |
0:46:48.408 --> 0:46:52.694 | |
This is the picture for the Manitou English | |
system. | |
0:46:53.793 --> 0:47:09.328 | |
Comparing to this valuable face line, we see | |
that for these high resource languages we are | |
0:47:09.328 --> 0:47:12.743 | |
not doing as great. | |
0:47:15.956 --> 0:47:18.664 | |
So why are we losing here? | |
0:47:18.664 --> 0:47:25.285 | |
It's been showing that this performance last | |
is somewhat related. | |
0:47:26.026 --> 0:47:37.373 | |
In the sense that the motto has to learn so | |
much that at some point it has to sacrifice | |
0:47:37.373 --> 0:47:39.308 | |
capacity from. | |
0:47:41.001 --> 0:47:57.081 | |
So what to do to basically grow a bigger brain | |
to tackle this is to add some dedicated capacity | |
0:47:57.081 --> 0:47:59.426 | |
per language. | |
0:48:00.100 --> 0:48:15.600 | |
Here it's like a simplified graph of a transformer | |
architecture, so this is the encoder within | |
0:48:15.600 --> 0:48:16.579 | |
time. | |
0:48:17.357 --> 0:48:27.108 | |
But additionally here these little colorable | |
blouse are now the language-specific capable | |
0:48:27.108 --> 0:48:28.516 | |
of capacity. | |
0:48:29.169 --> 0:48:42.504 | |
There are language specific in the sense that | |
if you get the Chinese to English, the pattern. | |
0:48:43.103 --> 0:48:54.900 | |
We are also going to language specific parts | |
that in this case consists of a down projection. | |
0:48:56.416 --> 0:49:07.177 | |
So this is also called adaptors, something | |
that is plugged into an existing model and | |
0:49:07.177 --> 0:49:11.556 | |
it adapts towards a specific task. | |
0:49:12.232 --> 0:49:22.593 | |
And this is conditionally activated in the | |
sense that if you get a different input sentence. | |
0:49:27.307 --> 0:49:34.173 | |
So this was first proposed in by some folks | |
selling Google. | |
0:49:34.173 --> 0:49:36.690 | |
Does this scale well? | |
0:49:39.619 --> 0:49:56.621 | |
Yes exactly, so this is a translation periscusive | |
cannon adapter, and this is not going to scale | |
0:49:56.621 --> 0:49:57.672 | |
well. | |
0:49:58.959 --> 0:50:13.676 | |
So this also brought people to try some more | |
simple architecture. | |
0:50:16.196 --> 0:50:22.788 | |
Yeah, this is also an alternative, in this | |
case called monolingual adapters. | |
0:50:24.184 --> 0:50:32.097 | |
Any of these adapters so again have this low | |
resource. | |
0:50:32.097 --> 0:50:42.025 | |
The zero line is bilingual baseline, but the | |
lines are interpolated. | |
0:50:43.783 --> 0:50:48.767 | |
The red one is the mottling word original | |
mottling word model. | |
0:50:49.929 --> 0:50:57.582 | |
And if we put the adapters in like a basic | |
virginal adapter that goes to the blue liner,. | |
0:50:58.078 --> 0:51:08.582 | |
You see the lids gaining performance for the | |
high resource languages. | |
0:51:08.582 --> 0:51:16.086 | |
If they even scale a lot, this further increases. | |
0:51:16.556 --> 0:51:22.770 | |
So this is also a side kind of this. | |
0:51:23.103 --> 0:51:27.807 | |
From the side shows that it's really a capacity | |
bottom up. | |
0:51:28.488 --> 0:51:30.590 | |
Like If You Eleanor. | |
0:51:31.151 --> 0:51:34.313 | |
Resource they regain their performance. | |
0:51:38.959 --> 0:51:50.514 | |
For smaller languages, but it's just. | |
0:51:50.770 --> 0:52:03.258 | |
Think in the original modeling, the smaller | |
languages they weren't constrained by capacity. | |
0:52:05.445 --> 0:52:13.412 | |
So guess for the smaller languages, the difficulty | |
is more the data rather than the model capacity. | |
0:52:13.573 --> 0:52:26.597 | |
So in general you always want to have more | |
or less data matching your model capacity. | |
0:52:27.647 --> 0:52:33.255 | |
Yeah, here think the bigger challenge for | |
lower roots was the data. | |
0:52:34.874 --> 0:52:39.397 | |
You also mention it a little bit. | |
0:52:39.397 --> 0:52:46.979 | |
Are these adapters per language or how many | |
adapters do? | |
0:52:47.267 --> 0:52:55.378 | |
And do we have to design them differently | |
so that we learn to share more like a language | |
0:52:55.378 --> 0:52:56.107 | |
family? | |
0:52:56.576 --> 0:53:15.680 | |
So one downside of the adaptor we talked about | |
is that basically there is no way to go over. | |
0:53:16.516 --> 0:53:31.391 | |
So then a recent kind of additional approach | |
for these language specific capacity is so | |
0:53:31.391 --> 0:53:36.124 | |
called routing or learning. | |
0:53:36.256 --> 0:53:42.438 | |
Basically, we have these language specific | |
components. | |
0:53:42.438 --> 0:53:45.923 | |
We also have a shared adapter. | |
0:53:45.923 --> 0:53:52.574 | |
The model should learn: So in this case maybe | |
we could imagine for the lower resource case | |
0:53:52.574 --> 0:53:54.027 | |
that we just talked about. | |
0:53:54.094 --> 0:54:04.838 | |
Sense to go there because there's not much | |
to do with language specific anyway than it's | |
0:54:04.838 --> 0:54:10.270 | |
better to make use of similarity with other. | |
0:54:11.111 --> 0:54:30.493 | |
So this architecture is more data driven instead | |
of what we specify prior to training. | |
0:54:31.871 --> 0:54:33.998 | |
So how do we learn this? | |
0:54:35.095 --> 0:54:49.286 | |
Basically, in terms of the mask, we want to | |
basically have a binary rule that goes either | |
0:54:49.286 --> 0:54:50.548 | |
to the. | |
0:54:51.311 --> 0:54:56.501 | |
But how do we get a valued zero or one mean | |
we can? | |
0:54:56.501 --> 0:54:58.498 | |
We can do a signal. | |
0:54:58.999 --> 0:55:13.376 | |
However, one thing is we don't want to get | |
stuck in the middle, so we don't want black. | |
0:55:14.434 --> 0:55:28.830 | |
It is also bad because it is not going to | |
be the same training and test time by the way. | |
0:55:31.151 --> 0:55:50.483 | |
So here the question is how do we force basically | |
the model to always go there prior to activation? | |
0:55:54.894 --> 0:56:02.463 | |
Found it interesting because it sounds like | |
a trick for me. | |
0:56:02.463 --> 0:56:05.491 | |
This approach has been. | |
0:56:06.026 --> 0:56:15.844 | |
So what they do is prior to going through | |
this activation, and they add some bosom noise. | |
0:56:17.257 --> 0:56:31.610 | |
If there is always noise prior to activation | |
then the model will be encouraged to preserve | |
0:56:31.610 --> 0:56:34.291 | |
the information. | |
0:56:36.356 --> 0:56:44.067 | |
Was a very interesting thing that found out | |
while preparing this, so wanted to share this | |
0:56:44.067 --> 0:56:44.410 | |
as. | |
0:56:44.544 --> 0:56:48.937 | |
So basically you can create a battery gate | |
with this technique. | |
0:56:50.390 --> 0:57:01.668 | |
And if you add these language specific routing: | |
Here they also have some that can control how | |
0:57:01.668 --> 0:57:07.790 | |
much is shared and how much is language specific. | |
0:57:07.727 --> 0:57:16.374 | |
Here the seals are the is the routing with | |
the red and orange lines, so. | |
0:57:16.576 --> 0:57:22.752 | |
So you can see that poor for many and many | |
to one there in both cases quite some games. | |
0:57:23.063 --> 0:57:30.717 | |
So that is the overall picture and just find | |
the idea of the routing quite interesting. | |
0:57:30.991 --> 0:57:32.363 | |
And UM. | |
0:57:32.212 --> 0:57:38.348 | |
It's also getting a bit more increasingly | |
used as there are the so called mixture of | |
0:57:38.348 --> 0:57:39.431 | |
expert models. | |
0:57:39.499 --> 0:57:51.801 | |
The model learns where to route the input | |
so they are all conditionally activated when | |
0:57:51.801 --> 0:57:53.074 | |
you are. | |
0:57:53.213 --> 0:57:59.089 | |
But this is not really something specific | |
to mortal inquality, so won't talk too much | |
0:57:59.089 --> 0:57:59.567 | |
about. | |
0:58:00.620 --> 0:58:02.115 | |
No. | |
0:58:01.761 --> 0:58:09.640 | |
From this parrot is first that we talked about | |
the listing of the capacity bottleneck. | |
0:58:10.570 --> 0:58:19.808 | |
Where we can partly compensate by adapters | |
or adding language specific capacity, there's | |
0:58:19.808 --> 0:58:23.026 | |
the idea of negative transfer. | |
0:58:24.844 --> 0:58:35.915 | |
When we add any additional capacity, how can | |
we improve the knowledge sharing? | |
0:58:38.318 --> 0:58:46.662 | |
Also, for this one too many directions that | |
seem to be hopeless for multilinguality, can | |
0:58:46.662 --> 0:58:47.881 | |
we actually? | |
0:58:49.129 --> 0:58:52.171 | |
Yeah, these are all open things still in the | |
area. | |
0:58:53.673 --> 0:59:04.030 | |
Now next part, I'm going to talk about some | |
data challenges for Model Ewell. | |
0:59:04.030 --> 0:59:07.662 | |
We talk about Model Ewell. | |
0:59:08.488 --> 0:59:14.967 | |
But there are these lower resource languages | |
that don't have well curated parallel data. | |
0:59:16.216 --> 0:59:27.539 | |
When alternative people resort to Pro Data | |
from the Internet, there's a lot of noise. | |
0:59:27.927 --> 0:59:36.244 | |
And in this paper last year they did some | |
manual analyses of several popular cross data | |
0:59:36.244 --> 0:59:36.811 | |
sets. | |
0:59:37.437 --> 0:59:55.262 | |
And you'll see that there are a lot of wrong | |
translations, non-linguistic contents, pornographic | |
0:59:55.262 --> 0:59:57.100 | |
contents. | |
0:59:57.777 --> 1:00:04.661 | |
So as you can imagine, they say what you eat. | |
1:00:04.661 --> 1:00:20.116 | |
If you use this kind of data to train a model, | |
you can: So there are also many techniques | |
1:00:20.116 --> 1:00:28.819 | |
for filtering and filtering these noisy data | |
sets. | |
1:00:29.809 --> 1:00:36.982 | |
So to filter these out we can use an additional | |
classifier that basically are trained to classify | |
1:00:36.982 --> 1:00:43.496 | |
which language to sentences and then kick out | |
all the sentences with the wrong language. | |
1:00:45.105 --> 1:00:49.331 | |
Another thing is the length ratio. | |
1:00:49.331 --> 1:01:00.200 | |
Basically, the assumption there is that if | |
two sentences are translations of each other,. | |
1:01:01.901 --> 1:01:08.718 | |
So often people use maybe a ratio of three | |
and then it eliminates the rest. | |
1:01:09.909 --> 1:01:20.187 | |
Also, the other idea maybe similar to the | |
language classifier is basically to heaven | |
1:01:20.187 --> 1:01:24.540 | |
allowed character set per language. | |
1:01:24.540 --> 1:01:28.289 | |
So if you're trying to filter. | |
1:01:28.568 --> 1:01:34.622 | |
Don't know Cyrillic spribs or Arabic spribs, | |
then it's maybe a good idea to remove them. | |
1:01:35.775 --> 1:01:43.123 | |
This is not all there are many other ideas | |
using some pre-trained neural networks to compare | |
1:01:43.123 --> 1:01:50.629 | |
the representations, but just to give you an | |
idea of what our basic techniques were filtering. | |
1:01:50.991 --> 1:01:53.458 | |
Is quite important. | |
1:01:53.458 --> 1:02:02.465 | |
We have seen in our experience that if you | |
do these thoroughly there is. | |
1:02:03.883 --> 1:02:17.814 | |
So after all, even if we do web crawling, | |
there is still a bit of data scarcity problem. | |
1:02:18.118 --> 1:02:30.760 | |
So there are many bad things that can happen | |
when there's too little training data. | |
1:02:30.760 --> 1:02:35.425 | |
The first is low performances. | |
1:02:35.735 --> 1:02:55.562 | |
So they did it on many English system index | |
languages, all together with here means: So | |
1:02:55.562 --> 1:03:04.079 | |
we really need to get that area of a lot of | |
data in order to get that ideal performance. | |
1:03:04.884 --> 1:03:20.639 | |
There are also many horrible things that can | |
happen in general when you train a model across | |
1:03:20.639 --> 1:03:24.874 | |
different training runs. | |
1:03:26.946 --> 1:03:36.733 | |
So one solution to tackle this problem, the | |
data scarcity problem, is by fine tuning some | |
1:03:36.733 --> 1:03:38.146 | |
pre-trained. | |
1:03:38.979 --> 1:03:46.245 | |
And basically the idea is you've got the pre-trained | |
model that can already do translation. | |
1:03:46.846 --> 1:03:54.214 | |
Then you find units on your own training data | |
and you end up with a more specialized model. | |
1:03:55.155 --> 1:03:59.369 | |
So why does pretraining help? | |
1:03:59.369 --> 1:04:11.448 | |
One argument is that if you do pretraining | |
then the motto has seen over more data and | |
1:04:11.448 --> 1:04:12.713 | |
learned. | |
1:04:13.313 --> 1:04:19.135 | |
Say more generalizable representations that | |
can help more downstream tasks. | |
1:04:19.719 --> 1:04:28.063 | |
So in this case we are basically trying to | |
make use of the more meaningful and generalizable | |
1:04:28.063 --> 1:04:29.499 | |
representation. | |
1:04:30.490 --> 1:04:45.103 | |
So for machine translation there are several | |
open source models out there that can handle | |
1:04:45.103 --> 1:04:46.889 | |
languages. | |
1:04:48.188 --> 1:04:49.912 | |
Two hundred model. | |
1:04:49.912 --> 1:04:53.452 | |
They also cover two hundred languages. | |
1:04:53.452 --> 1:04:57.628 | |
That means that's quite a lot of translation. | |
1:04:57.978 --> 1:05:06.218 | |
However, one thing to remember is that these | |
lados are more like a how do you call them. | |
1:05:06.146 --> 1:05:12.812 | |
Jackson Waltry is a master of none in the | |
sense that they are very good as coverage, | |
1:05:12.812 --> 1:05:20.498 | |
but if you look at specific translation directions | |
they might be not as good as dedicated models. | |
1:05:21.521 --> 1:05:34.170 | |
So here I'm going to have some results by | |
comparing random initialization versus the | |
1:05:34.170 --> 1:05:36.104 | |
first thing. | |
1:05:36.396 --> 1:05:46.420 | |
The third line is the result of basically | |
finding a pre-train model that is one of the | |
1:05:46.420 --> 1:05:47.342 | |
family. | |
1:05:47.947 --> 1:05:51.822 | |
So in this case you could see the. | |
1:05:51.831 --> 1:05:58.374 | |
If we just look at the second line, that is | |
the pre trade model out of the box, you see | |
1:05:58.374 --> 1:06:04.842 | |
that if we just use it out of the box, the | |
performance everywhere isn't super great as | |
1:06:04.842 --> 1:06:06.180 | |
dedicated models. | |
1:06:07.867 --> 1:06:21.167 | |
But then here that ex-here means English: | |
So the first takeaway here is that if we do | |
1:06:21.167 --> 1:06:31.560 | |
pre-train financing again when we do it into | |
English,. | |
1:06:33.433 --> 1:06:40.438 | |
Here is that we are forgetting. | |
1:06:40.438 --> 1:06:50.509 | |
When we do further training there is no data. | |
1:06:50.770 --> 1:07:04.865 | |
So even if we initialize the pre-trained bottle | |
and continue training, if we don't see translation. | |
1:07:05.345 --> 1:07:13.826 | |
So this is bad machine learning people termed | |
it as perfect forgetting in the sense that | |
1:07:13.826 --> 1:07:20.115 | |
if you have a model that is trained to do some | |
task and then you. | |
1:07:20.860 --> 1:07:22.487 | |
This Is Also Pretty Bad. | |
1:07:24.244 --> 1:07:32.341 | |
Is especially bad if you consider training | |
data actually grows over time. | |
1:07:32.341 --> 1:07:35.404 | |
It's not like you have one. | |
1:07:36.336 --> 1:07:46.756 | |
So in practice we do not always train systems | |
from stretch so it's more like you have an | |
1:07:46.756 --> 1:07:54.951 | |
existing system and later we want to expand | |
the translation coverage. | |
1:07:57.277 --> 1:08:08.932 | |
Here and the key question is how do we continue | |
training from an existing system in doing so? | |
1:08:09.909 --> 1:08:12.288 | |
Approaches. | |
1:08:12.288 --> 1:08:27.945 | |
One very simple one is to include a portion | |
of your previous training so that. | |
1:08:28.148 --> 1:08:34.333 | |
So if you consider you have an English German | |
system and now you want to explain it to English | |
1:08:34.333 --> 1:08:34.919 | |
French,. | |
1:08:36.036 --> 1:08:42.308 | |
Like so nice going English, French and English | |
German, so when you train it you still include | |
1:08:42.308 --> 1:08:45.578 | |
a small proportion of your previous German | |
data. | |
1:08:45.578 --> 1:08:51.117 | |
Hopefully your model is not forgetting that | |
much about the previously lent German. | |
1:08:53.073 --> 1:08:58.876 | |
Idea here is what we saw earlier. | |
1:08:58.876 --> 1:09:09.800 | |
We can also add adaptors and only train them | |
while keeping the. | |
1:09:10.170 --> 1:09:26.860 | |
So this means we're going to end up with a | |
generic model that was not anyhow changed. | |
1:09:27.447 --> 1:09:37.972 | |
So in this way it's also more module and more | |
suitable to the incremental learning kind of. | |
1:09:38.758 --> 1:09:49.666 | |
Right in this part, the takeaways guess are | |
first data filtering. | |
1:09:49.666 --> 1:09:55.120 | |
His Internet data is very noisy. | |
1:09:56.496 --> 1:10:05.061 | |
Second, it's about paint tuning pre-fine models | |
and how we can or cannot avoid catastrophic | |
1:10:05.061 --> 1:10:06.179 | |
forgetting. | |
1:10:07.247 --> 1:10:15.866 | |
And of course open questions would include | |
how can we do incremental learning with these | |
1:10:15.866 --> 1:10:19.836 | |
multilingual machine translation models? | |
1:10:20.860 --> 1:10:31.840 | |
So with this in mind would like to briefly | |
cover several engineering challenges when we | |
1:10:31.840 --> 1:10:43.031 | |
talk about: Yeah, earlier we also briefly talked | |
about the motelingual means sometimes you have | |
1:10:43.031 --> 1:10:51.384 | |
to scale up, you have to make your models bigger | |
just to have that capacity to deal with. | |
1:10:52.472 --> 1:10:59.262 | |
This means the model sizes are getting bigger | |
and sometimes having one single is not enough | |
1:10:59.262 --> 1:11:00.073 | |
to handle. | |
1:11:00.400 --> 1:11:08.914 | |
Here wanted to introduce ideas of going parallel | |
and scaling up. | |
1:11:08.914 --> 1:11:12.843 | |
The first is so called model. | |
1:11:14.434 --> 1:11:18.859 | |
Don't know if you also had this in other like | |
maury cue related courses. | |
1:11:20.220 --> 1:11:30.639 | |
Okay, so the idea of data parallel is basically | |
we train in parallel. | |
1:11:30.790 --> 1:11:35.852 | |
We put our model onto several GPS. | |
1:11:35.852 --> 1:11:47.131 | |
We send the same model there and then when | |
we get the training data we split. | |
1:11:48.108 --> 1:11:54.594 | |
So each on each of these we are doing the | |
forward and backward pass in parallel. | |
1:11:55.355 --> 1:12:07.779 | |
Then after we get his gradient all these reviews | |
will be synchronized and the gradients will | |
1:12:07.779 --> 1:12:09.783 | |
be aggregated. | |
1:12:11.691 --> 1:12:27.127 | |
We are having a bigger batch size in effect, | |
so this would be much faster than, for example, | |
1:12:27.127 --> 1:12:31.277 | |
doing all these smaller. | |
1:12:32.772 --> 1:12:45.252 | |
That is, if your model itself is too big to | |
fit onto an energy group, so you cannot split | |
1:12:45.252 --> 1:12:46.084 | |
this. | |
1:12:46.486 --> 1:12:51.958 | |
And honestly, the model itself, unless you're | |
going for those. | |
1:12:51.891 --> 1:12:55.500 | |
Huge models the industry made these days. | |
1:12:55.500 --> 1:13:03.233 | |
I've never run into a situation where the | |
single model itself does not fit into one shape | |
1:13:03.233 --> 1:13:03.748 | |
here. | |
1:13:03.748 --> 1:13:08.474 | |
Realistically, it's more the what is memory | |
consuming. | |
1:13:08.528 --> 1:13:14.871 | |
It is more of the backward cast and the Optimizer | |
states that led me to be stored. | |
1:13:15.555 --> 1:13:22.193 | |
So but still there are people training gigantic | |
models where they have to go model parallel. | |
1:13:22.602 --> 1:13:35.955 | |
This means you have a model consisting of | |
all those orange pets, but it doesn't fit to | |
1:13:35.955 --> 1:13:40.714 | |
split the next several layers. | |
1:13:41.581 --> 1:13:51.787 | |
So this means when you do the forward pass | |
you have to wait and to finish before doing. | |
1:13:52.532 --> 1:14:11.193 | |
And this kind of implementation is sometimes | |
a bit architecture or specific. | |
1:14:12.172 --> 1:14:17.177 | |
Right, so there's one more thing when scaling | |
up. | |
1:14:17.177 --> 1:14:19.179 | |
Want it to mention. | |
1:14:20.080 --> 1:14:25.687 | |
We also talked about it briefly earlier. | |
1:14:25.687 --> 1:14:34.030 | |
We said that when we go to Linguo we need | |
a vocabulary that. | |
1:14:34.614 --> 1:14:40.867 | |
And can give you some numbers. | |
1:14:40.867 --> 1:14:53.575 | |
Most of the pre-trained modeling models here | |
use a vocabulary. | |
1:14:53.933 --> 1:14:58.454 | |
Normally each vector is. | |
1:14:58.454 --> 1:15:10.751 | |
This means just the word embedding table alone | |
is times parameters. | |
1:15:11.011 --> 1:15:18.620 | |
This means just for the embedding table alone | |
it's already taking million parameters of the. | |
1:15:19.859 --> 1:15:28.187 | |
And this is often one of the largest parts | |
of the machine. | |
1:15:28.187 --> 1:15:31.292 | |
This also comes with. | |
1:15:31.651 --> 1:15:43.891 | |
So one question is how can we efficiently | |
represent a multilingual vocabulary? | |
1:15:43.891 --> 1:15:49.003 | |
Are there better ways than just? | |
1:15:50.750 --> 1:16:00.526 | |
There are many out there people tread, maybe | |
not all targeted for mottling wool, but think. | |
1:16:00.840 --> 1:16:03.635 | |
So when is bites level representation? | |
1:16:03.743 --> 1:16:11.973 | |
So the idea there is if we train with data | |
they're all stored on computers, so all their | |
1:16:11.973 --> 1:16:15.579 | |
characters must be reused in by bites. | |
1:16:15.579 --> 1:16:23.716 | |
So they want to then not using subwords, not | |
using characters, but using bites instead. | |
1:16:25.905 --> 1:16:27.693 | |
Do You See Some Downsides? | |
1:16:31.791 --> 1:16:38.245 | |
There are some languages that are easier to | |
represent than others. | |
1:16:38.245 --> 1:16:40.556 | |
That's definitely true. | |
1:16:41.081 --> 1:16:44.981 | |
So if you have a sentence normally of five | |
words,. | |
1:16:46.246 --> 1:16:59.899 | |
You think about if we split it into characters, | |
how many characters we have, and each character | |
1:16:59.899 --> 1:17:04.166 | |
that would be how many bites. | |
1:17:04.424 --> 1:17:15.749 | |
And then it's more to model, it's more for | |
the model to learn, and it's also a bigger | |
1:17:15.749 --> 1:17:19.831 | |
sequence to give to the model. | |
1:17:20.260 --> 1:17:22.038 | |
Yeah. | |
1:17:21.941 --> 1:17:31.232 | |
Visual representation is also quite interesting, | |
so some people argued that we don't want to | |
1:17:31.232 --> 1:17:35.428 | |
have a fixed discrete vocabulary anymore. | |
1:17:35.428 --> 1:17:41.921 | |
Instead, we want to do it like OCR, like reading | |
them as images. | |
1:17:42.942 --> 1:17:54.016 | |
We'll look at one example for this next: Then | |
another idea is how if you can distill the | |
1:17:54.016 --> 1:18:03.966 | |
vocabulary as in learning some more compact | |
representation,. | |
1:18:04.284 --> 1:18:12.554 | |
But next wanted to show you an example of | |
pixel inputs for modeling war machine. | |
1:18:12.852 --> 1:18:29.757 | |
If you look at the picture, all the characters | |
that are marked with red are actually not. | |
1:18:32.772 --> 1:18:48.876 | |
They are actually from a different script | |
for the model and let it do the subword tokenization. | |
1:18:52.852 --> 1:19:04.373 | |
You would get maybe mostly characters out | |
of it because I guess in the pre existing vocabulary | |
1:19:04.373 --> 1:19:07.768 | |
there won't be Latin H and. | |
1:19:07.707 --> 1:19:16.737 | |
So you'll get characters out of it, which | |
means it's probably going to be more difficult | |
1:19:16.737 --> 1:19:18.259 | |
for the model. | |
1:19:20.140 --> 1:19:28.502 | |
Yeah, so the motivation for pixel inputs is | |
that there is more sharing across languages. | |
1:19:30.010 --> 1:19:37.773 | |
Here basically illustrates an embedding table | |
for subwords and saying if you have sentences | |
1:19:37.773 --> 1:19:45.705 | |
in the letter scripts like French and the English | |
then it's going to take certain proportions | |
1:19:45.705 --> 1:19:48.152 | |
of this big embetting table. | |
1:19:48.328 --> 1:19:56.854 | |
While for Arabic and Chinese it's yet again | |
another,. | |
1:19:56.796 --> 1:20:09.037 | |
That is not joined with the previous one if | |
we want to have shared representations for | |
1:20:09.037 --> 1:20:11.992 | |
different languages. | |
1:20:12.692 --> 1:20:18.531 | |
On the other hand, if we're going with pixels, | |
there's definitely more sharing. | |
1:20:22.362 --> 1:20:30.911 | |
There's a difference though to a standard | |
kind of norm machine translation typeline. | |
1:20:32.252 --> 1:20:47.581 | |
If you have this brace then how do we go with | |
images into a translation model? | |
1:20:50.690 --> 1:20:58.684 | |
We still have to tokenize it somehow, so in | |
this case they do an overlapping sliding window. | |
1:20:59.259 --> 1:21:13.636 | |
Since it's more visual, we're using some kind | |
of convolution blocks before going into these | |
1:21:13.636 --> 1:21:14.730 | |
black. | |
1:21:15.035 --> 1:21:25.514 | |
So here wanted to show that if you go with | |
these more specialist architectures we get | |
1:21:25.514 --> 1:21:27.829 | |
pixels and that's. | |
1:21:30.050 --> 1:21:31.310 | |
There's Also One Down the Side. | |
1:21:31.431 --> 1:21:51.380 | |
If we go with pixels and present teachings, | |
what are our challenges? | |
1:21:52.993 --> 1:22:00.001 | |
Exactly so as they beat us others here, also | |
pointing out here for their experiments. | |
1:22:01.061 --> 1:22:08.596 | |
They only consider a one target language, | |
and this is also on their target site. | |
1:22:08.596 --> 1:22:10.643 | |
It's not pixel based. | |
1:22:11.131 --> 1:22:31.033 | |
So this is definitely, in my opinion, very | |
interesting steps towards more shared representations. | |
1:22:31.831 --> 1:22:40.574 | |
Yeah, so with this kind of out of the box | |
approach just wanted to summarize today's lecture. | |
1:22:41.962 --> 1:22:53.158 | |
First think we saw why motelingue is cool, | |
why there are several open challenges out there | |
1:22:53.158 --> 1:22:53.896 | |
that. | |
1:22:55.355 --> 1:23:03.601 | |
We also saw, like several approaches, how | |
to realize implement a modern molecular translation | |
1:23:03.601 --> 1:23:11.058 | |
system, and yeah, lastly, we've seen quite | |
some over challenges on what is unsolved. | |
1:23:11.691 --> 1:23:22.403 | |
Yeah, so with this want to thank you for being | |
here today and I'm up there if you want. | |
1:23:26.106 --> 1:23:29.727 | |
If you have questions, how will we also share | |
with the moment? | |