WEBVTT

0:00:00.860 --> 0:00:04.211
Okay Again Welcome.

0:00:04.524 --> 0:00:09.256
So today I'll be doing the lecture.

0:00:09.256 --> 0:00:12.279
My name is Danny Liro.

0:00:12.279 --> 0:00:16.747
I'm one of the PhD students with.

0:00:17.137 --> 0:00:25.942
And specifically how to learn representations
that are common across languages and use that

0:00:25.942 --> 0:00:29.004
to help low resource languages.

0:00:29.689 --> 0:00:39.445
So hope today we can explore a little bit
about motoring machine translation and hopefully.

0:00:40.100 --> 0:00:50.940
So today what we are going to do first we
are going to look at.

0:00:52.152 --> 0:01:02.491
Second, we will be looking into more details
as in how we achieve modeling or machine translation

0:01:02.491 --> 0:01:06.183
and what are the techniques there.

0:01:06.183 --> 0:01:12.197
At last, we are going to look at the current
challenges.

0:01:13.573 --> 0:01:15.976
Alright, so some definitions.

0:01:15.976 --> 0:01:19.819
First, what is modeling or machine translation?

0:01:21.201 --> 0:01:28.637
So for a multilingual machine translation
system, it's basically a system that is able

0:01:28.637 --> 0:01:34.279
to handle multiple source languages or multiple
target languages.

0:01:34.254 --> 0:01:44.798
You see here you've got source on the source
side, some German Chinese, Spanish and English.

0:01:45.485 --> 0:01:50.615
Physically, it's also a quite interesting
machine learning challenge actually.

0:01:51.031 --> 0:02:05.528
So if you consider each translation pair as
a different task in machine learning, then

0:02:05.528 --> 0:02:08.194
a multilingual.

0:02:08.628 --> 0:02:17.290
Where it has to specialize in all these different
translation directions and try to be good.

0:02:17.917 --> 0:02:26.890
So this is basically about multi-task learning,
and here when translation direction being one

0:02:26.890 --> 0:02:27.462
task.

0:02:28.428 --> 0:02:35.096
Interesting question to ask here is like do
we get synergy like different tasks helping

0:02:35.096 --> 0:02:39.415
each other, the knowledge of one task helping
the other?

0:02:39.539 --> 0:02:48.156
Or do we get more interference in English
to German, and now I get worse at English to

0:02:48.156 --> 0:02:49.047
Chinese.

0:02:49.629 --> 0:02:55.070
So this is also a very interesting question
that we'll look into later.

0:02:56.096 --> 0:02:58.605
Now a little bit of context.

0:02:59.519 --> 0:03:04.733
We care about multilingual machine translation.

0:03:04.733 --> 0:03:10.599
Part of the thing is that machine translation
models.

0:03:11.291 --> 0:03:22.659
If you consider all the languages in the world,
there are a read it here roughly seven thousand

0:03:22.659 --> 0:03:23.962
languages.

0:03:24.684 --> 0:03:37.764
So consider this number, and if you think
about this many languages out there, how many

0:03:37.764 --> 0:03:39.548
directions.

0:03:40.220 --> 0:03:46.897
So this means to cover end languages.

0:03:46.897 --> 0:03:59.374
We're going to end up with a prodretic in
square number of directions.

0:03:59.779 --> 0:04:02.290
This Is Very Bad, Padre Is Very Bad.

0:04:03.203 --> 0:04:14.078
The prosthetic situation going on means that
for a lot of translation directions, if you

0:04:14.078 --> 0:04:16.278
consider all the.

0:04:17.177 --> 0:04:34.950
For many of them we aren't going to have any
parallel data as in existing translated data.

0:04:35.675 --> 0:04:40.001
So this is a very data scarce situation.

0:04:40.001 --> 0:04:49.709
We're not going to get parallel data in blue
wear, especially likely when you have a system

0:04:49.709 --> 0:04:52.558
that covers tan languages.

0:04:52.912 --> 0:05:04.437
If this access actually goes towards thousands
that are realistic, we are going to end up

0:05:04.437 --> 0:05:06.614
with some holes.

0:05:07.667 --> 0:05:15.400
So now we are going to ask: Can we use motel
inquality to help this kind of glow resource?

0:05:15.875 --> 0:05:22.858
So when useful concept there is mutual intelligibility,
don't know if you've heard of this.

0:05:23.203 --> 0:05:30.264
Basically isn't linguistic when you say somebody
who's speaking one language can directly without

0:05:30.264 --> 0:05:33.218
learning understands the other language.

0:05:33.218 --> 0:05:39.343
So if you're a German speaker maybe Dutch
or Danish and all that kind of stuff would

0:05:39.343 --> 0:05:39.631
be.

0:05:40.000 --> 0:05:45.990
Useful or like directly understandable partially
to you.

0:05:46.586 --> 0:05:52.082
That is, thanks to this kind of mutual enthology
ability that is basically based on language

0:05:52.082 --> 0:05:52.791
similarity.

0:05:53.893 --> 0:05:57.105
And then there's knowledge sharing this concept.

0:05:57.105 --> 0:06:01.234
I mean, it's quite intuitive, basically a
very German speaker.

0:06:01.234 --> 0:06:06.805
If you start learning Dutch or Danish and
all these Mordic languages, I think you're

0:06:06.805 --> 0:06:11.196
going to be faster than just a native English
speaker or anything.

0:06:11.952 --> 0:06:18.751
So hopefully our model is also able to do
this, but we'll see later what the real situation.

0:06:19.799 --> 0:06:27.221
So we said multilingual is good multilingual
transmission, it's nice and there's a lot of

0:06:27.221 --> 0:06:28.210
potentials.

0:06:28.969 --> 0:06:32.205
So it's a long path towards there.

0:06:32.205 --> 0:06:37.569
Think all the efforts started in so quite
some years ago.

0:06:37.958 --> 0:06:54.639
At first people started with models with language
specific modules.

0:06:54.454 --> 0:06:58.747
So we talked about the input of the decoder
architecture in the previous lecturer area.

0:07:00.100 --> 0:07:06.749
And with this separation of the inputter and
the decoder, it gives it a natural way to split

0:07:06.749 --> 0:07:07.679
the modules.

0:07:09.069 --> 0:07:20.805
So basically what's happening going on here
is dedicated to each toes language and dedicated.

0:07:21.281 --> 0:07:34.252
Now given parallel data of body good data
English German data we just activate this German

0:07:34.252 --> 0:07:39.241
inputter and activate this and an.

0:07:40.680 --> 0:07:48.236
So now we are training basically like corresponding
parts of the encoder decoders.

0:07:48.236 --> 0:07:55.278
It has some advantages: First, we have a multilingual
system.

0:07:55.278 --> 0:08:03.898
Of course, second modularity is also an advantage
in software engineering.

0:08:03.898 --> 0:08:10.565
We want to decouple things if the German input
is broken.

0:08:11.011 --> 0:08:19.313
So modularity is advantage in this case, but
again if we think about scalability, if we

0:08:19.313 --> 0:08:27.521
think about languages out there that we talked
about, scalability isn't a great thing.

0:08:27.947 --> 0:08:37.016
We also talked about sharing knowledge or
sharing representations for different languages.

0:08:37.317 --> 0:08:41.968
We have a separate thing for each language.

0:08:41.968 --> 0:08:46.513
How likely is it that we are sharing much?

0:08:46.513 --> 0:08:52.538
So these are potential disadvantages with
this approach.

0:08:53.073 --> 0:09:01.181
So yeah we talked about, we want to have knowledge
transfer, we want to have similar languages

0:09:01.181 --> 0:09:02.888
helping each other.

0:09:02.822 --> 0:09:06.095
This is somehow a more reachable goal.

0:09:06.095 --> 0:09:13.564
If you have a shared in corner and a shared
in physically, a full perimeter shared model

0:09:13.564 --> 0:09:21.285
for all the translation pairs out there, and
there's also another game, so if you just have

0:09:21.285 --> 0:09:21.705
one.

0:09:22.582 --> 0:09:26.084
Lock of model for all the translation directions
out there.

0:09:26.606 --> 0:09:38.966
It's easier to deploy in the sense that if
you are serving a model you don't have a thousand

0:09:38.966 --> 0:09:42.555
small modules to maintain.

0:09:42.762 --> 0:09:52.448
So in terms of engineering somehow these kind
of fully primitive shared models have: So this

0:09:52.448 --> 0:09:59.819
is also where the parent research has been
going towards in recent years.

0:10:00.460 --> 0:10:16.614
So the rest of the electro are also going
to focus on this kind of model.

0:10:17.037 --> 0:10:30.901
So the first type of multilinguali is this
kind of many to one abbreviated kind of situation.

0:10:30.901 --> 0:10:34.441
Basically what's going.

0:10:35.355 --> 0:10:49.804
So one news case that you can think of here
is if you're subtitled for international movies

0:10:49.804 --> 0:10:51.688
in Germany.

0:10:53.073 --> 0:11:02.863
Then flipping the situation there is also
many configurations where we only have when

0:11:02.863 --> 0:11:04.798
source language.

0:11:06.046 --> 0:11:13.716
There's also many use cases like if you think
about the lecture translator here you've seen.

0:11:14.914 --> 0:11:21.842
So here most of the lecturers are in German
and now we want to translate it into.

0:11:21.842 --> 0:11:28.432
I think on the user end we only support English
but they're also supportable.

0:11:28.608 --> 0:11:38.988
So in this kind of used case, if you have
one speaker and you want to serve or expand

0:11:38.988 --> 0:11:41.281
to many audience,.

0:11:42.802 --> 0:11:50.542
But of course, combining everything, there's
the many to many situation here.

0:11:50.542 --> 0:11:54.015
You can think of Google Translate.

0:11:54.015 --> 0:11:58.777
They are doing basically any selected language.

0:11:59.159 --> 0:12:03.760
And this is also more difficult.

0:12:03.760 --> 0:12:14.774
If you consider the data you need to get and
concerns, we'll cover this later.

0:12:15.135 --> 0:12:21.034
But first we are going to start with many
to one translations.

0:12:21.741 --> 0:12:30.436
Say this is the most similar to the bilingual
translation situation you saw earlier, but

0:12:30.436 --> 0:12:39.423
now one difference is we need a vocabulary
or tokens that can represent all these different

0:12:39.423 --> 0:12:40.498
languages.

0:12:41.301 --> 0:12:44.200
So we need a joint more telecom global vocabulary.

0:12:44.924 --> 0:12:48.794
So let's just quickly recall what word embedding
is to do.

0:12:49.189 --> 0:12:54.561
Basically we need to represent it.

0:12:54.561 --> 0:13:04.077
We have to get some vector representation
for discrete words.

0:13:04.784 --> 0:13:16.911
And when we embed a token, we are retrieving
the corresponding vector out of this little.

0:13:17.697 --> 0:13:19.625
And then we put it.

0:13:19.625 --> 0:13:26.082
We feed a sequence of vectors into the inputter
as the next steps.

0:13:26.987 --> 0:13:34.973
Now if it's motelingual you can imagine that
vocabulary suddenly gets very, very big because

0:13:34.973 --> 0:13:36.262
the languages.

0:13:37.877 --> 0:13:46.141
So what is quite useful here is the by pair
like subwords you talked about by pairing.

0:13:46.406 --> 0:13:55.992
So in this case we are still limiting ourselves
to a finite number of vocabularies that we

0:13:55.992 --> 0:13:59.785
are exploding the vocabulary table.

0:14:01.181 --> 0:14:11.631
So when we learn these kinds of subwords,
what happens basically?

0:14:11.631 --> 0:14:17.015
We look at all the training data.

0:14:18.558 --> 0:14:20.856
So think about this.

0:14:20.856 --> 0:14:28.077
If we do this now on a bunch of Mozilla data,
are there concerns?

0:14:30.050 --> 0:14:36.811
Maybe we have an underground status head,
so we get over English mergers and nocularities.

0:14:37.337 --> 0:14:39.271
Yeah Exactly Thanks.

0:14:39.539 --> 0:14:46.602
So what we have to pay attention to here is
learn this motilingual vocabulary.

0:14:46.602 --> 0:14:52.891
We should pay attention: All the languages
are more or less balanced, not that you only

0:14:52.891 --> 0:14:58.912
learning words is for for English or some bigger
languages, and then neglecting other other

0:14:58.912 --> 0:15:00.025
languages, yeah.

0:15:01.021 --> 0:15:04.068
Of course, this is not going to solve everything.

0:15:04.068 --> 0:15:09.614
Even if we get a perfectly uniform distribution
out of all the languages out, there is not

0:15:09.614 --> 0:15:13.454
going to mean that we are ending up with a
perfect vocabulary.

0:15:14.154 --> 0:15:20.068
There are also language differences read,
so if you consider more European languages.

0:15:20.180 --> 0:15:27.081
There will be many shared subcomponents like
how you write a certain word, somewhat similar.

0:15:27.267 --> 0:15:34.556
But then there are other languages with completely
different scripts like Arabic, Cyrillic scripts

0:15:34.556 --> 0:15:40.594
or Eastern Asian scripts where you get a vocabulary
like the characters set with.

0:15:40.940 --> 0:15:43.531
Tens of thousands of characters.

0:15:43.531 --> 0:15:50.362
So these are also individual concerns that
one has to think about my building specific

0:15:50.362 --> 0:15:51.069
systems.

0:15:51.591 --> 0:16:02.660
But overall, the rule of thumb is that when
you do a mottling tokenizer vocabulary, there's

0:16:02.660 --> 0:16:04.344
more or less.

0:16:05.385 --> 0:16:17.566
And there's actually some paper showing that
the performance of the final system is going

0:16:17.566 --> 0:16:25.280
to start to degrade if you have a disproportionate
data.

0:16:27.207 --> 0:16:33.186
Of course there is currently the trend of
using pre-train models.

0:16:33.186 --> 0:16:39.890
If you take a pre-train model somewhere then
you don't have this concern.

0:16:40.580 --> 0:16:47.810
Making sure that you use the same organizers
that they used so that there is no train test

0:16:47.810 --> 0:16:48.287
time.

0:16:48.888 --> 0:16:53.634
Yeah for a pre-trainer, we're going to talk
about a little bit later as well.

0:16:54.734 --> 0:16:59.960
Alright: So now where's a Martin Luther vocabulary?

0:17:00.920 --> 0:17:04.187
There are several good things, obviously.

0:17:04.187 --> 0:17:10.953
So one thing is that if we have words that
are in the textful form like we said, there

0:17:10.953 --> 0:17:16.242
are European languages that share some vocabulary,
then it's great.

0:17:16.242 --> 0:17:19.897
Then we have the first step towards knowledge.

0:17:20.000 --> 0:17:30.464
For example, the word pineapple for some reason
is also in Eastern European languages.

0:17:30.464 --> 0:17:34.915
In Cyrillic scripts that's also the.

0:17:36.116 --> 0:17:42.054
But however, there is also ambiguity if you've
embracing together or dye.

0:17:42.054 --> 0:17:46.066
Of course, they mean different things for
German.

0:17:46.246 --> 0:17:53.276
Then, of course, that's possible to rely on
further context.

0:17:53.276 --> 0:17:59.154
It's not a problem, it's something to think
about.

0:18:00.200 --> 0:18:11.061
And when we go higher to cover more vocabulary
entries, we might need to go bigger in the

0:18:11.061 --> 0:18:13.233
vocabulary count.

0:18:13.653 --> 0:18:28.561
So there is always sort of a bottleneck as
the number of languages increase.

0:18:30.110 --> 0:18:32.836
Right, so what is the result?

0:18:32.836 --> 0:18:38.289
What are these crustling over inventings actually
learning?

0:18:40.160 --> 0:18:44.658
So normally to inspect them it's quite hard.

0:18:44.658 --> 0:18:53.853
It's like high dimensional vectors with dimensions,
but researchers also try to project it.

0:18:54.454 --> 0:19:05.074
So in this case it is a little bit small,
but in this case for English and French there

0:19:05.074 --> 0:19:07.367
are many injuries.

0:19:07.467 --> 0:19:20.014
My example is like different words with the
same word in morphological forms.

0:19:20.014 --> 0:19:26.126
Basically, it's like a morphological.

0:19:26.546 --> 0:19:32.727
There are also words in different languages
like think there is research for English and

0:19:32.727 --> 0:19:33.282
French.

0:19:33.954 --> 0:19:41.508
So the take away from this plot is that somehow
we learn a bit of semantic meanings beyond

0:19:41.508 --> 0:19:43.086
the textual forms.

0:19:45.905 --> 0:19:50.851
But then this looks good and this gives us
hope.

0:19:52.252 --> 0:20:05.240
That if we consider what is the baseline here,
the baseline we compare to is a bilingual system

0:20:05.240 --> 0:20:09.164
without any multilinguality.

0:20:10.290 --> 0:20:19.176
This looks good because if we compare for
many Central European languages, Eastern and

0:20:19.176 --> 0:20:28.354
Central European languages to English, we compare:
And we see that the Mini Two English has actually

0:20:28.354 --> 0:20:30.573
always gained quite a bit over it.

0:20:31.751 --> 0:20:38.876
But there is also later investigation on whether
it is actually out of mountain linguality or

0:20:38.876 --> 0:20:39.254
not.

0:20:39.639 --> 0:20:46.692
So this is a spoiler won't tell much about
it until the second half, but just remember

0:20:46.692 --> 0:20:47.908
there is this.

0:20:49.449 --> 0:20:53.601
Now move on to many translations.

0:20:53.601 --> 0:21:01.783
Let's recall in a normal transformer or any
encoder decoder setup.

0:21:02.242 --> 0:21:08.839
We have an inkluder that creates sort of contextual
representation for the sort of sentence.

0:21:09.949 --> 0:21:17.787
Is more or less the context for generating
the target sentence red.

0:21:17.787 --> 0:21:28.392
Now on the target side we get the first open,
then we feed it again and then get the second

0:21:28.392 --> 0:21:29.544
decoding.

0:21:31.651 --> 0:21:35.039
And now we have multiple target languages.

0:21:35.039 --> 0:21:39.057
Does anybody see a problem with this architecture?

0:21:48.268 --> 0:21:57.791
Specifically, it's in the decoder, so now
have a German sentiments encoded.

0:21:57.791 --> 0:22:01.927
It now want to generate Spanish.

0:22:07.367 --> 0:22:11.551
So the problem is how does the model know
which language to generate?

0:22:12.112 --> 0:22:24.053
If you just give it a generic start token,
there is nowhere where we are telling the model.

0:22:24.944 --> 0:22:30.277
So that this can only be a guess, and this
model will definitely not run well.

0:22:32.492 --> 0:22:40.021
So this comes to the question: How do we indicate
the one's intended language to the model?

0:22:41.441 --> 0:22:52.602
One first idea is what people tried is basically
now in a source where not only including the

0:22:52.602 --> 0:22:53.552
source.

0:22:53.933 --> 0:23:01.172
To Spanish things like this, so basically
the source is already informed.

0:23:01.172 --> 0:23:12.342
The source sentence is already supplemented
with: Now this is also called a target forcing

0:23:12.342 --> 0:23:19.248
in the sense that we try to force it to give
the right target.

0:23:20.080 --> 0:23:24.622
This is one approach.

0:23:24.622 --> 0:23:38.044
Another approach is basically based on the
idea that if we have.

0:23:38.438 --> 0:23:52.177
So if we create a context of our world, the
incode output shouldn't really differ.

0:23:52.472 --> 0:24:02.397
So out of this motivation people have moved
this signaling mechanism.

0:24:02.397 --> 0:24:09.911
They basically replaced the traditional start
token.

0:24:10.330 --> 0:24:17.493
So here we are not kids starting into the
generic start talking anymore instead language

0:24:17.493 --> 0:24:18.298
specific.

0:24:18.938 --> 0:24:21.805
So this is also another way to achieve this.

0:24:23.283 --> 0:24:27.714
But there are still more challenging cases.

0:24:27.714 --> 0:24:35.570
Sometimes here it can be called as General
English or German when it's there.

0:24:35.570 --> 0:24:39.700
Later on it goes further and further on.

0:24:40.320 --> 0:24:46.752
Basically this information is not strong enough
to always enforce the target language, especially

0:24:46.752 --> 0:24:48.392
in zero shot conditions.

0:24:48.392 --> 0:24:54.168
We'll look into this later so we'll get this
kind of target translation into generating

0:24:54.168 --> 0:24:57.843
and generating and then going into some wrong
language.

0:24:59.219 --> 0:25:12.542
So another technique actually developed here
some years ago was to inject this language.

0:25:12.872 --> 0:25:19.834
So when we are feeding doing the auto-aggressive
decoding normally, we only feed the upherb.

0:25:20.000 --> 0:25:22.327
Into the depoter.

0:25:22.327 --> 0:25:33.704
But if we also add a language embedding for
the target language, on top of that we have

0:25:33.704 --> 0:25:37.066
the language information.

0:25:37.397 --> 0:25:44.335
And this has shown to perform quite a bit
better, especially in conditions where the

0:25:44.335 --> 0:25:44.906
model.

0:25:46.126 --> 0:25:56.040
So yeah, we introduced three ways to enforce
the Tardid language: And now with this we're

0:25:56.040 --> 0:26:02.607
going to move on to the more interesting case
of many too many translations.

0:26:03.503 --> 0:26:14.021
Am so here we just consider a system that
translates two directions: English to English

0:26:14.021 --> 0:26:15.575
and English.

0:26:16.676 --> 0:26:21.416
Now we have target languages read.

0:26:21.416 --> 0:26:29.541
Can you see where we're enforcing the target
language here?

0:26:29.541 --> 0:26:33.468
In this case what technique?

0:26:34.934 --> 0:26:45.338
So here we are enforcing the characteristic
language with the yelling we train this system.

0:26:46.526 --> 0:27:00.647
And at the inference time we are able to generate
English to French, but in addition to this

0:27:00.647 --> 0:27:12.910
we are also able to: We will be able to do
zero shot inference that basically translates

0:27:12.910 --> 0:27:17.916
a direction that is not seen in training.

0:27:19.319 --> 0:27:25.489
So this is so called zero shot translation
using a modeling wall system.

0:27:26.606 --> 0:27:34.644
Of course, we have to reach several things
before we are able to control the language,

0:27:34.644 --> 0:27:36.769
otherwise it's no use.

0:27:37.317 --> 0:27:51.087
Second, we should also have some kind of language
independent representation.

0:27:51.731 --> 0:27:53.196
Why is this?

0:27:53.196 --> 0:27:55.112
Why is this big?

0:27:55.112 --> 0:28:00.633
Because if women drink generally French up
here?

0:28:00.940 --> 0:28:05.870
It was trained to translate from some English.

0:28:07.187 --> 0:28:15.246
But now we use Anchored Germans in the French,
so intuitively we need these representations

0:28:15.246 --> 0:28:22.429
to be similar enough, not that they are so
far attracted that we cannot use this.

0:28:25.085 --> 0:28:32.059
So there are several works out there showing
that if you do a standard transformer architecture

0:28:32.059 --> 0:28:39.107
this language independent property is not really
there and you need to add additional approaches

0:28:39.107 --> 0:28:40.633
in order to enforce.

0:28:41.201 --> 0:28:51.422
So you can, for example, add an additional
training objective: That says, we invoked SARSN,

0:28:51.422 --> 0:29:00.305
be invoked by German, and the invoked English
have to be the same or be as close to each

0:29:00.305 --> 0:29:02.201
other as possible.

0:29:02.882 --> 0:29:17.576
So if we take the output and the output for
another language, how can we formulate this

0:29:17.576 --> 0:29:18.745
as an.

0:29:20.981 --> 0:29:27.027
We can take the translation to the encoder
and whatever you translate.

0:29:27.027 --> 0:29:32.817
The embeddings also must be similar and that's
the great direction.

0:29:33.253 --> 0:29:42.877
So one thing to take care of here is the length
for the same sentence in German and English

0:29:42.877 --> 0:29:44.969
is not necessarily.

0:29:45.305 --> 0:30:00.858
So if we just do a word to word matching,
we can always do pulling to a fixed length

0:30:00.858 --> 0:30:03.786
representation.

0:30:04.004 --> 0:30:08.392
Or there are more advanced techniques that
involve some alignments.

0:30:08.848 --> 0:30:23.456
So this is useful in the sense that in this
part in experiments we have shown it improves

0:30:23.456 --> 0:30:27.189
zero shot translation.

0:30:27.447 --> 0:30:36.628
This is on the data condition of English to
Malay, Java and Filipino, so kind of made to

0:30:36.628 --> 0:30:39.722
low resource language family.

0:30:40.100 --> 0:30:50.876
And there we assume that we get parallel English
to all of them, but among all these.

0:30:51.451 --> 0:31:03.592
So the blue bar is a Vanilla Transformer model,
and the purple bar is when we add a language.

0:31:04.544 --> 0:31:12.547
You see that in supervised conditions it's
not changing much, but in zero shots there's

0:31:12.547 --> 0:31:13.183
quite.

0:31:15.215 --> 0:31:22.649
Yeah, so far we said zero shots is doable
and it's even more achievable if we enforce

0:31:22.649 --> 0:31:26.366
some language independent representations.

0:31:26.366 --> 0:31:29.823
However, there's one practical concern.

0:31:29.823 --> 0:31:33.800
Don't know if you also had the same question.

0:31:34.514 --> 0:31:39.835
If you have two languages, you don't have
direct parallel.

0:31:39.835 --> 0:31:43.893
One's into English and one's out of English.

0:31:45.685 --> 0:31:52.845
It's actually this kind of approach is called
pivoting as in pivoting over an intermediate

0:31:52.845 --> 0:31:53.632
language.

0:31:55.935 --> 0:32:00.058
Yeah, that it definitely has advantages in
the sense that we're going.

0:32:00.440 --> 0:32:11.507
Now if we go over these two steps every direction
was trained with supervised data so you could

0:32:11.507 --> 0:32:18.193
always assume that when we are working with
a supervised.

0:32:18.718 --> 0:32:26.868
So in this case we can expect more robust
inference time behavior.

0:32:26.868 --> 0:32:31.613
However, there are also disadvantages.

0:32:31.531 --> 0:32:38.860
An inference where passing through the model
ties so that's doubling the inference time

0:32:38.860 --> 0:32:39.943
computation.

0:32:40.500 --> 0:32:47.878
You might think okay doubling then what, but
if you consider if your company like Google,

0:32:47.878 --> 0:32:54.929
Google Translate and all your life traffic
suddenly becomes twice as big, this is not

0:32:54.929 --> 0:33:00.422
something scalable that you want to see, especially
in production.

0:33:01.641 --> 0:33:11.577
A problem with this is making information
loss because if we go over these games when

0:33:11.577 --> 0:33:20.936
a chain of kids pass the word to each other,
in the end it's losing information.

0:33:22.082 --> 0:33:24.595
Can give it an example here.

0:33:24.595 --> 0:33:27.803
It's also from a master thesis here.

0:33:27.803 --> 0:33:30.316
It's on gender preservation.

0:33:30.770 --> 0:33:39.863
Basically, some languages like Italian and
French have different word forms based on the

0:33:39.863 --> 0:33:40.782
speaker.

0:33:41.001 --> 0:33:55.987
So if a male person says feel alienated, this
word for alienated would be exclusive and a

0:33:55.987 --> 0:33:58.484
female person.

0:34:00.620 --> 0:34:05.730
Now imagine that we pivot through anguish.

0:34:05.730 --> 0:34:08.701
The information is lost.

0:34:08.701 --> 0:34:11.910
We don't know what gender.

0:34:12.492 --> 0:34:19.626
When we go out into branch again, there are
different forms.

0:34:19.626 --> 0:34:29.195
Depending on the speaker gender, we can: So
this is one problem.

0:34:31.871 --> 0:34:44.122
This is especially the case because English
compared to many other languages is relatively

0:34:44.122 --> 0:34:45.199
simple.

0:34:45.205 --> 0:34:53.373
Gendered where it forms like this, it also
doesn't have many cases, so going through English

0:34:53.373 --> 0:34:56.183
many information would be lost.

0:34:57.877 --> 0:35:12.796
And another thing is if you have similar languages
that you are translating out of my systems

0:35:12.796 --> 0:35:15.494
that translates.

0:35:16.496 --> 0:35:24.426
This is the output of going from Dutch to
German again.

0:35:24.426 --> 0:35:30.231
If you read the German, how many of you?

0:35:32.552 --> 0:35:51.679
Good and the problem here is that we are going
over English and then the English to German.

0:35:51.831 --> 0:36:06.332
However, if we go direct in this case zero
shot translation you see that word forgive.

0:36:06.546 --> 0:36:09.836
In this case, the outward translation is better.

0:36:10.150 --> 0:36:20.335
And we believe this has to do with using the
language similarity between the two languages.

0:36:20.335 --> 0:36:26.757
There is also quantitative results we found
when born in.

0:36:27.988 --> 0:36:33.780
The models are always doing better when translating
similar languages compared to the.

0:36:35.535 --> 0:36:42.093
Yeah, so in this first half what we talked
about basically first, we started with how

0:36:42.093 --> 0:36:49.719
motilinguality or motilingual machine translation
could enable knowledge transfer between languages

0:36:49.719 --> 0:36:53.990
and help with conditions where we don't have
much data.

0:36:55.235 --> 0:37:02.826
Now it looks at three types of multilingual
translation, so one is many to one, one to

0:37:02.826 --> 0:37:03.350
many.

0:37:05.285 --> 0:37:13.397
We got there first about a shared vocabulary
based on different languages and how these

0:37:13.397 --> 0:37:22.154
cross lingual word embeddings capture semantic
meanings rather than just on a text proof form.

0:37:25.505 --> 0:37:37.637
Then we looked at how to signal the target
language, how to ask for the model to generate,

0:37:37.637 --> 0:37:43.636
and then we looked at zero shot translation.

0:37:45.325 --> 0:37:58.187
You now before go into the second half are
there questions about the first okay good.

0:38:00.140 --> 0:38:10.932
In the second half of this lecture we'll be
looking into challenges like what is still

0:38:10.932 --> 0:38:12.916
unsolved about.

0:38:13.113 --> 0:38:18.620
There are some aspects to look at it.

0:38:18.620 --> 0:38:26.591
The first is modeling, the second is more
engineering.

0:38:28.248 --> 0:38:33.002
Okay, so we talked about this question several
times.

0:38:33.002 --> 0:38:35.644
How does motilinguality help?

0:38:35.644 --> 0:38:37.405
Where does it help?

0:38:38.298 --> 0:38:45.416
Here want to show results of an experiment
based on over a hundred languages.

0:38:46.266 --> 0:38:58.603
Here you can see the data amount so they use
parallel data to English and it's very.

0:38:58.999 --> 0:39:00.514
This is already lock scale.

0:39:00.961 --> 0:39:12.982
So for higher resource languages like English
to French, German to Spanish you get over billion

0:39:12.982 --> 0:39:14.359
sentences.

0:39:14.254 --> 0:39:21.003
In parallel, and when we go more to the right
to the more low resource spectrum on the other

0:39:21.003 --> 0:39:26.519
hand, there are languages that maybe many of
us have new and heard of like.

0:39:26.466 --> 0:39:29.589
Do You Want to Move Back?

0:39:30.570 --> 0:39:33.270
Hawaiian Indians have heard of it.

0:39:34.414 --> 0:39:39.497
So on that spectrum we only have like thirty
thousand sentences.

0:39:40.400 --> 0:39:48.389
So what this means is when we train, we have
to up sample these guys.

0:39:48.389 --> 0:39:51.585
The model didn't even know.

0:39:52.732 --> 0:40:05.777
Yeah, so on this graph on how we read it is
this horizontal line and zero is basically

0:40:05.777 --> 0:40:07.577
indicating.

0:40:07.747 --> 0:40:14.761
Because we want to see where mottling quality
helps only compare to what happens when there

0:40:14.761 --> 0:40:15.371
is not.

0:40:16.356 --> 0:40:29.108
So upper like higher than the zero line it
means we're gaining.

0:40:29.309 --> 0:40:34.154
The same like for these languages.

0:40:34.154 --> 0:40:40.799
This side means we are a high resource for
the.

0:40:40.981 --> 0:40:46.675
Yeah sorry, think I've somehow removed the
the ex-O as he does.

0:40:48.008 --> 0:40:58.502
Yeah alright, what happens now if we look
at many into English?

0:40:58.698 --> 0:41:08.741
On the low resource spectrum by going multilingua
we gain a lot over the Palumbo system.

0:41:10.010 --> 0:41:16.658
Overall, if you consider the average for all
of the languages, it's still again.

0:41:17.817 --> 0:41:27.301
Now we're looking at the green line so you
can ignore the blue line.

0:41:27.301 --> 0:41:32.249
Basically we have to do our sample.

0:41:33.753 --> 0:41:41.188
Yeah, so if you just even consider the average,
it's still a game form over by link.

0:41:42.983 --> 0:41:57.821
However, if we go to the English to many systems
looking at the gains, we only get minor improvements.

0:41:59.039 --> 0:42:12.160
So why is it the case that Going Mott Lingu
isn't really helping universally?

0:42:16.016 --> 0:42:18.546
Do you have some intuitions on yeah?

0:42:18.698 --> 0:42:38.257
It's easier to understand something that generates
if we consider what the model has to generate.

0:42:38.718 --> 0:42:40.091
I See It Like.

0:42:40.460 --> 0:42:49.769
Generating is a bit like writing or speaking,
while inputing on the source side is more like

0:42:49.769 --> 0:42:50.670
reading.

0:42:50.650 --> 0:42:57.971
So one is more passive and the other is more
active and don't know if you have similar experience.

0:42:57.971 --> 0:43:05.144
I think speaking and writing is always a little
bit more difficult than just passively listening

0:43:05.144 --> 0:43:06.032
or reading.

0:43:06.032 --> 0:43:09.803
But this is a very pendwavy kind of understanding.

0:43:10.390 --> 0:43:11.854
And fed.

0:43:12.032 --> 0:43:20.309
In terms of the model, if we consider what
is the difference for the target side for many

0:43:20.309 --> 0:43:26.703
to English: One difference is that there's
a data difference.

0:43:27.167 --> 0:43:33.438
So if you just consider a modern English system
with German to English and Spanish to English,.

0:43:34.975 --> 0:43:44.321
One thing we have to keep in mind is that
the parallel data is not all the same, so on

0:43:44.321 --> 0:43:49.156
the target side there are different English.

0:43:49.769 --> 0:43:54.481
So the situation rather looks like this.

0:43:54.481 --> 0:43:59.193
What this means is that we are going to.

0:44:00.820 --> 0:44:04.635
We also add more data on the target side for
English.

0:44:06.967 --> 0:44:18.581
Now since the target side data is not identical,
how do we do a controlled experiment to remove

0:44:18.581 --> 0:44:21.121
the multilinguality?

0:44:24.644 --> 0:44:42.794
So what people tried as a control experiment
is to keep all the English same as the above

0:44:42.794 --> 0:44:44.205
setup.

0:44:44.684 --> 0:44:49.700
So they take the English on English data of
the same branch to German.

0:44:50.090 --> 0:44:55.533
And then the general synthetic data for Germans.

0:44:55.533 --> 0:45:05.864
So now we have a bilingual system again, but
on the target side we still have the previously

0:45:05.864 --> 0:45:08.419
enriched English data.

0:45:10.290 --> 0:45:25.092
Now back to this picture that we've seen before,
this mysterious orange line here is basically

0:45:25.092 --> 0:45:26.962
the result.

0:45:27.907 --> 0:45:36.594
And somewhat struckly and perhaps sadly for
believers of multilinguality.

0:45:36.594 --> 0:45:39.176
This is also gaining.

0:45:41.001 --> 0:45:52.775
So what this means is for the many English
is gaining not really because of multilinguality

0:45:52.775 --> 0:45:55.463
but just because of.

0:45:55.976 --> 0:46:10.650
And this means that there is still quite a
lot to do if we really want to gain from just

0:46:10.650 --> 0:46:13.618
shared knowledge.

0:46:14.514 --> 0:46:27.599
But this also gives hope because there are
still many things to research in this area

0:46:27.599 --> 0:46:28.360
now.

0:46:28.708 --> 0:46:40.984
So we've seen adding more languages helps
with somewhat data side effect and can it hurt.

0:46:40.984 --> 0:46:45.621
So if we just add more languages.

0:46:47.007 --> 0:46:48.408
We've seen this.

0:46:48.408 --> 0:46:52.694
This is the picture for the Manitou English
system.

0:46:53.793 --> 0:47:09.328
Comparing to this valuable face line, we see
that for these high resource languages we are

0:47:09.328 --> 0:47:12.743
not doing as great.

0:47:15.956 --> 0:47:18.664
So why are we losing here?

0:47:18.664 --> 0:47:25.285
It's been showing that this performance last
is somewhat related.

0:47:26.026 --> 0:47:37.373
In the sense that the motto has to learn so
much that at some point it has to sacrifice

0:47:37.373 --> 0:47:39.308
capacity from.

0:47:41.001 --> 0:47:57.081
So what to do to basically grow a bigger brain
to tackle this is to add some dedicated capacity

0:47:57.081 --> 0:47:59.426
per language.

0:48:00.100 --> 0:48:15.600
Here it's like a simplified graph of a transformer
architecture, so this is the encoder within

0:48:15.600 --> 0:48:16.579
time.

0:48:17.357 --> 0:48:27.108
But additionally here these little colorable
blouse are now the language-specific capable

0:48:27.108 --> 0:48:28.516
of capacity.

0:48:29.169 --> 0:48:42.504
There are language specific in the sense that
if you get the Chinese to English, the pattern.

0:48:43.103 --> 0:48:54.900
We are also going to language specific parts
that in this case consists of a down projection.

0:48:56.416 --> 0:49:07.177
So this is also called adaptors, something
that is plugged into an existing model and

0:49:07.177 --> 0:49:11.556
it adapts towards a specific task.

0:49:12.232 --> 0:49:22.593
And this is conditionally activated in the
sense that if you get a different input sentence.

0:49:27.307 --> 0:49:34.173
So this was first proposed in by some folks
selling Google.

0:49:34.173 --> 0:49:36.690
Does this scale well?

0:49:39.619 --> 0:49:56.621
Yes exactly, so this is a translation periscusive
cannon adapter, and this is not going to scale

0:49:56.621 --> 0:49:57.672
well.

0:49:58.959 --> 0:50:13.676
So this also brought people to try some more
simple architecture.

0:50:16.196 --> 0:50:22.788
Yeah, this is also an alternative, in this
case called monolingual adapters.

0:50:24.184 --> 0:50:32.097
Any of these adapters so again have this low
resource.

0:50:32.097 --> 0:50:42.025
The zero line is bilingual baseline, but the
lines are interpolated.

0:50:43.783 --> 0:50:48.767
The red one is the mottling word original
mottling word model.

0:50:49.929 --> 0:50:57.582
And if we put the adapters in like a basic
virginal adapter that goes to the blue liner,.

0:50:58.078 --> 0:51:08.582
You see the lids gaining performance for the
high resource languages.

0:51:08.582 --> 0:51:16.086
If they even scale a lot, this further increases.

0:51:16.556 --> 0:51:22.770
So this is also a side kind of this.

0:51:23.103 --> 0:51:27.807
From the side shows that it's really a capacity
bottom up.

0:51:28.488 --> 0:51:30.590
Like If You Eleanor.

0:51:31.151 --> 0:51:34.313
Resource they regain their performance.

0:51:38.959 --> 0:51:50.514
For smaller languages, but it's just.

0:51:50.770 --> 0:52:03.258
Think in the original modeling, the smaller
languages they weren't constrained by capacity.

0:52:05.445 --> 0:52:13.412
So guess for the smaller languages, the difficulty
is more the data rather than the model capacity.

0:52:13.573 --> 0:52:26.597
So in general you always want to have more
or less data matching your model capacity.

0:52:27.647 --> 0:52:33.255
Yeah, here think the bigger challenge for
lower roots was the data.

0:52:34.874 --> 0:52:39.397
You also mention it a little bit.

0:52:39.397 --> 0:52:46.979
Are these adapters per language or how many
adapters do?

0:52:47.267 --> 0:52:55.378
And do we have to design them differently
so that we learn to share more like a language

0:52:55.378 --> 0:52:56.107
family?

0:52:56.576 --> 0:53:15.680
So one downside of the adaptor we talked about
is that basically there is no way to go over.

0:53:16.516 --> 0:53:31.391
So then a recent kind of additional approach
for these language specific capacity is so

0:53:31.391 --> 0:53:36.124
called routing or learning.

0:53:36.256 --> 0:53:42.438
Basically, we have these language specific
components.

0:53:42.438 --> 0:53:45.923
We also have a shared adapter.

0:53:45.923 --> 0:53:52.574
The model should learn: So in this case maybe
we could imagine for the lower resource case

0:53:52.574 --> 0:53:54.027
that we just talked about.

0:53:54.094 --> 0:54:04.838
Sense to go there because there's not much
to do with language specific anyway than it's

0:54:04.838 --> 0:54:10.270
better to make use of similarity with other.

0:54:11.111 --> 0:54:30.493
So this architecture is more data driven instead
of what we specify prior to training.

0:54:31.871 --> 0:54:33.998
So how do we learn this?

0:54:35.095 --> 0:54:49.286
Basically, in terms of the mask, we want to
basically have a binary rule that goes either

0:54:49.286 --> 0:54:50.548
to the.

0:54:51.311 --> 0:54:56.501
But how do we get a valued zero or one mean
we can?

0:54:56.501 --> 0:54:58.498
We can do a signal.

0:54:58.999 --> 0:55:13.376
However, one thing is we don't want to get
stuck in the middle, so we don't want black.

0:55:14.434 --> 0:55:28.830
It is also bad because it is not going to
be the same training and test time by the way.

0:55:31.151 --> 0:55:50.483
So here the question is how do we force basically
the model to always go there prior to activation?

0:55:54.894 --> 0:56:02.463
Found it interesting because it sounds like
a trick for me.

0:56:02.463 --> 0:56:05.491
This approach has been.

0:56:06.026 --> 0:56:15.844
So what they do is prior to going through
this activation, and they add some bosom noise.

0:56:17.257 --> 0:56:31.610
If there is always noise prior to activation
then the model will be encouraged to preserve

0:56:31.610 --> 0:56:34.291
the information.

0:56:36.356 --> 0:56:44.067
Was a very interesting thing that found out
while preparing this, so wanted to share this

0:56:44.067 --> 0:56:44.410
as.

0:56:44.544 --> 0:56:48.937
So basically you can create a battery gate
with this technique.

0:56:50.390 --> 0:57:01.668
And if you add these language specific routing:
Here they also have some that can control how

0:57:01.668 --> 0:57:07.790
much is shared and how much is language specific.

0:57:07.727 --> 0:57:16.374
Here the seals are the is the routing with
the red and orange lines, so.

0:57:16.576 --> 0:57:22.752
So you can see that poor for many and many
to one there in both cases quite some games.

0:57:23.063 --> 0:57:30.717
So that is the overall picture and just find
the idea of the routing quite interesting.

0:57:30.991 --> 0:57:32.363
And UM.

0:57:32.212 --> 0:57:38.348
It's also getting a bit more increasingly
used as there are the so called mixture of

0:57:38.348 --> 0:57:39.431
expert models.

0:57:39.499 --> 0:57:51.801
The model learns where to route the input
so they are all conditionally activated when

0:57:51.801 --> 0:57:53.074
you are.

0:57:53.213 --> 0:57:59.089
But this is not really something specific
to mortal inquality, so won't talk too much

0:57:59.089 --> 0:57:59.567
about.

0:58:00.620 --> 0:58:02.115
No.

0:58:01.761 --> 0:58:09.640
From this parrot is first that we talked about
the listing of the capacity bottleneck.

0:58:10.570 --> 0:58:19.808
Where we can partly compensate by adapters
or adding language specific capacity, there's

0:58:19.808 --> 0:58:23.026
the idea of negative transfer.

0:58:24.844 --> 0:58:35.915
When we add any additional capacity, how can
we improve the knowledge sharing?

0:58:38.318 --> 0:58:46.662
Also, for this one too many directions that
seem to be hopeless for multilinguality, can

0:58:46.662 --> 0:58:47.881
we actually?

0:58:49.129 --> 0:58:52.171
Yeah, these are all open things still in the
area.

0:58:53.673 --> 0:59:04.030
Now next part, I'm going to talk about some
data challenges for Model Ewell.

0:59:04.030 --> 0:59:07.662
We talk about Model Ewell.

0:59:08.488 --> 0:59:14.967
But there are these lower resource languages
that don't have well curated parallel data.

0:59:16.216 --> 0:59:27.539
When alternative people resort to Pro Data
from the Internet, there's a lot of noise.

0:59:27.927 --> 0:59:36.244
And in this paper last year they did some
manual analyses of several popular cross data

0:59:36.244 --> 0:59:36.811
sets.

0:59:37.437 --> 0:59:55.262
And you'll see that there are a lot of wrong
translations, non-linguistic contents, pornographic

0:59:55.262 --> 0:59:57.100
contents.

0:59:57.777 --> 1:00:04.661
So as you can imagine, they say what you eat.

1:00:04.661 --> 1:00:20.116
If you use this kind of data to train a model,
you can: So there are also many techniques

1:00:20.116 --> 1:00:28.819
for filtering and filtering these noisy data
sets.

1:00:29.809 --> 1:00:36.982
So to filter these out we can use an additional
classifier that basically are trained to classify

1:00:36.982 --> 1:00:43.496
which language to sentences and then kick out
all the sentences with the wrong language.

1:00:45.105 --> 1:00:49.331
Another thing is the length ratio.

1:00:49.331 --> 1:01:00.200
Basically, the assumption there is that if
two sentences are translations of each other,.

1:01:01.901 --> 1:01:08.718
So often people use maybe a ratio of three
and then it eliminates the rest.

1:01:09.909 --> 1:01:20.187
Also, the other idea maybe similar to the
language classifier is basically to heaven

1:01:20.187 --> 1:01:24.540
allowed character set per language.

1:01:24.540 --> 1:01:28.289
So if you're trying to filter.

1:01:28.568 --> 1:01:34.622
Don't know Cyrillic spribs or Arabic spribs,
then it's maybe a good idea to remove them.

1:01:35.775 --> 1:01:43.123
This is not all there are many other ideas
using some pre-trained neural networks to compare

1:01:43.123 --> 1:01:50.629
the representations, but just to give you an
idea of what our basic techniques were filtering.

1:01:50.991 --> 1:01:53.458
Is quite important.

1:01:53.458 --> 1:02:02.465
We have seen in our experience that if you
do these thoroughly there is.

1:02:03.883 --> 1:02:17.814
So after all, even if we do web crawling,
there is still a bit of data scarcity problem.

1:02:18.118 --> 1:02:30.760
So there are many bad things that can happen
when there's too little training data.

1:02:30.760 --> 1:02:35.425
The first is low performances.

1:02:35.735 --> 1:02:55.562
So they did it on many English system index
languages, all together with here means: So

1:02:55.562 --> 1:03:04.079
we really need to get that area of a lot of
data in order to get that ideal performance.

1:03:04.884 --> 1:03:20.639
There are also many horrible things that can
happen in general when you train a model across

1:03:20.639 --> 1:03:24.874
different training runs.

1:03:26.946 --> 1:03:36.733
So one solution to tackle this problem, the
data scarcity problem, is by fine tuning some

1:03:36.733 --> 1:03:38.146
pre-trained.

1:03:38.979 --> 1:03:46.245
And basically the idea is you've got the pre-trained
model that can already do translation.

1:03:46.846 --> 1:03:54.214
Then you find units on your own training data
and you end up with a more specialized model.

1:03:55.155 --> 1:03:59.369
So why does pretraining help?

1:03:59.369 --> 1:04:11.448
One argument is that if you do pretraining
then the motto has seen over more data and

1:04:11.448 --> 1:04:12.713
learned.

1:04:13.313 --> 1:04:19.135
Say more generalizable representations that
can help more downstream tasks.

1:04:19.719 --> 1:04:28.063
So in this case we are basically trying to
make use of the more meaningful and generalizable

1:04:28.063 --> 1:04:29.499
representation.

1:04:30.490 --> 1:04:45.103
So for machine translation there are several
open source models out there that can handle

1:04:45.103 --> 1:04:46.889
languages.

1:04:48.188 --> 1:04:49.912
Two hundred model.

1:04:49.912 --> 1:04:53.452
They also cover two hundred languages.

1:04:53.452 --> 1:04:57.628
That means that's quite a lot of translation.

1:04:57.978 --> 1:05:06.218
However, one thing to remember is that these
lados are more like a how do you call them.

1:05:06.146 --> 1:05:12.812
Jackson Waltry is a master of none in the
sense that they are very good as coverage,

1:05:12.812 --> 1:05:20.498
but if you look at specific translation directions
they might be not as good as dedicated models.

1:05:21.521 --> 1:05:34.170
So here I'm going to have some results by
comparing random initialization versus the

1:05:34.170 --> 1:05:36.104
first thing.

1:05:36.396 --> 1:05:46.420
The third line is the result of basically
finding a pre-train model that is one of the

1:05:46.420 --> 1:05:47.342
family.

1:05:47.947 --> 1:05:51.822
So in this case you could see the.

1:05:51.831 --> 1:05:58.374
If we just look at the second line, that is
the pre trade model out of the box, you see

1:05:58.374 --> 1:06:04.842
that if we just use it out of the box, the
performance everywhere isn't super great as

1:06:04.842 --> 1:06:06.180
dedicated models.

1:06:07.867 --> 1:06:21.167
But then here that ex-here means English:
So the first takeaway here is that if we do

1:06:21.167 --> 1:06:31.560
pre-train financing again when we do it into
English,.

1:06:33.433 --> 1:06:40.438
Here is that we are forgetting.

1:06:40.438 --> 1:06:50.509
When we do further training there is no data.

1:06:50.770 --> 1:07:04.865
So even if we initialize the pre-trained bottle
and continue training, if we don't see translation.

1:07:05.345 --> 1:07:13.826
So this is bad machine learning people termed
it as perfect forgetting in the sense that

1:07:13.826 --> 1:07:20.115
if you have a model that is trained to do some
task and then you.

1:07:20.860 --> 1:07:22.487
This Is Also Pretty Bad.

1:07:24.244 --> 1:07:32.341
Is especially bad if you consider training
data actually grows over time.

1:07:32.341 --> 1:07:35.404
It's not like you have one.

1:07:36.336 --> 1:07:46.756
So in practice we do not always train systems
from stretch so it's more like you have an

1:07:46.756 --> 1:07:54.951
existing system and later we want to expand
the translation coverage.

1:07:57.277 --> 1:08:08.932
Here and the key question is how do we continue
training from an existing system in doing so?

1:08:09.909 --> 1:08:12.288
Approaches.

1:08:12.288 --> 1:08:27.945
One very simple one is to include a portion
of your previous training so that.

1:08:28.148 --> 1:08:34.333
So if you consider you have an English German
system and now you want to explain it to English

1:08:34.333 --> 1:08:34.919
French,.

1:08:36.036 --> 1:08:42.308
Like so nice going English, French and English
German, so when you train it you still include

1:08:42.308 --> 1:08:45.578
a small proportion of your previous German
data.

1:08:45.578 --> 1:08:51.117
Hopefully your model is not forgetting that
much about the previously lent German.

1:08:53.073 --> 1:08:58.876
Idea here is what we saw earlier.

1:08:58.876 --> 1:09:09.800
We can also add adaptors and only train them
while keeping the.

1:09:10.170 --> 1:09:26.860
So this means we're going to end up with a
generic model that was not anyhow changed.

1:09:27.447 --> 1:09:37.972
So in this way it's also more module and more
suitable to the incremental learning kind of.

1:09:38.758 --> 1:09:49.666
Right in this part, the takeaways guess are
first data filtering.

1:09:49.666 --> 1:09:55.120
His Internet data is very noisy.

1:09:56.496 --> 1:10:05.061
Second, it's about paint tuning pre-fine models
and how we can or cannot avoid catastrophic

1:10:05.061 --> 1:10:06.179
forgetting.

1:10:07.247 --> 1:10:15.866
And of course open questions would include
how can we do incremental learning with these

1:10:15.866 --> 1:10:19.836
multilingual machine translation models?

1:10:20.860 --> 1:10:31.840
So with this in mind would like to briefly
cover several engineering challenges when we

1:10:31.840 --> 1:10:43.031
talk about: Yeah, earlier we also briefly talked
about the motelingual means sometimes you have

1:10:43.031 --> 1:10:51.384
to scale up, you have to make your models bigger
just to have that capacity to deal with.

1:10:52.472 --> 1:10:59.262
This means the model sizes are getting bigger
and sometimes having one single is not enough

1:10:59.262 --> 1:11:00.073
to handle.

1:11:00.400 --> 1:11:08.914
Here wanted to introduce ideas of going parallel
and scaling up.

1:11:08.914 --> 1:11:12.843
The first is so called model.

1:11:14.434 --> 1:11:18.859
Don't know if you also had this in other like
maury cue related courses.

1:11:20.220 --> 1:11:30.639
Okay, so the idea of data parallel is basically
we train in parallel.

1:11:30.790 --> 1:11:35.852
We put our model onto several GPS.

1:11:35.852 --> 1:11:47.131
We send the same model there and then when
we get the training data we split.

1:11:48.108 --> 1:11:54.594
So each on each of these we are doing the
forward and backward pass in parallel.

1:11:55.355 --> 1:12:07.779
Then after we get his gradient all these reviews
will be synchronized and the gradients will

1:12:07.779 --> 1:12:09.783
be aggregated.

1:12:11.691 --> 1:12:27.127
We are having a bigger batch size in effect,
so this would be much faster than, for example,

1:12:27.127 --> 1:12:31.277
doing all these smaller.

1:12:32.772 --> 1:12:45.252
That is, if your model itself is too big to
fit onto an energy group, so you cannot split

1:12:45.252 --> 1:12:46.084
this.

1:12:46.486 --> 1:12:51.958
And honestly, the model itself, unless you're
going for those.

1:12:51.891 --> 1:12:55.500
Huge models the industry made these days.

1:12:55.500 --> 1:13:03.233
I've never run into a situation where the
single model itself does not fit into one shape

1:13:03.233 --> 1:13:03.748
here.

1:13:03.748 --> 1:13:08.474
Realistically, it's more the what is memory
consuming.

1:13:08.528 --> 1:13:14.871
It is more of the backward cast and the Optimizer
states that led me to be stored.

1:13:15.555 --> 1:13:22.193
So but still there are people training gigantic
models where they have to go model parallel.

1:13:22.602 --> 1:13:35.955
This means you have a model consisting of
all those orange pets, but it doesn't fit to

1:13:35.955 --> 1:13:40.714
split the next several layers.

1:13:41.581 --> 1:13:51.787
So this means when you do the forward pass
you have to wait and to finish before doing.

1:13:52.532 --> 1:14:11.193
And this kind of implementation is sometimes
a bit architecture or specific.

1:14:12.172 --> 1:14:17.177
Right, so there's one more thing when scaling
up.

1:14:17.177 --> 1:14:19.179
Want it to mention.

1:14:20.080 --> 1:14:25.687
We also talked about it briefly earlier.

1:14:25.687 --> 1:14:34.030
We said that when we go to Linguo we need
a vocabulary that.

1:14:34.614 --> 1:14:40.867
And can give you some numbers.

1:14:40.867 --> 1:14:53.575
Most of the pre-trained modeling models here
use a vocabulary.

1:14:53.933 --> 1:14:58.454
Normally each vector is.

1:14:58.454 --> 1:15:10.751
This means just the word embedding table alone
is times parameters.

1:15:11.011 --> 1:15:18.620
This means just for the embedding table alone
it's already taking million parameters of the.

1:15:19.859 --> 1:15:28.187
And this is often one of the largest parts
of the machine.

1:15:28.187 --> 1:15:31.292
This also comes with.

1:15:31.651 --> 1:15:43.891
So one question is how can we efficiently
represent a multilingual vocabulary?

1:15:43.891 --> 1:15:49.003
Are there better ways than just?

1:15:50.750 --> 1:16:00.526
There are many out there people tread, maybe
not all targeted for mottling wool, but think.

1:16:00.840 --> 1:16:03.635
So when is bites level representation?

1:16:03.743 --> 1:16:11.973
So the idea there is if we train with data
they're all stored on computers, so all their

1:16:11.973 --> 1:16:15.579
characters must be reused in by bites.

1:16:15.579 --> 1:16:23.716
So they want to then not using subwords, not
using characters, but using bites instead.

1:16:25.905 --> 1:16:27.693
Do You See Some Downsides?

1:16:31.791 --> 1:16:38.245
There are some languages that are easier to
represent than others.

1:16:38.245 --> 1:16:40.556
That's definitely true.

1:16:41.081 --> 1:16:44.981
So if you have a sentence normally of five
words,.

1:16:46.246 --> 1:16:59.899
You think about if we split it into characters,
how many characters we have, and each character

1:16:59.899 --> 1:17:04.166
that would be how many bites.

1:17:04.424 --> 1:17:15.749
And then it's more to model, it's more for
the model to learn, and it's also a bigger

1:17:15.749 --> 1:17:19.831
sequence to give to the model.

1:17:20.260 --> 1:17:22.038
Yeah.

1:17:21.941 --> 1:17:31.232
Visual representation is also quite interesting,
so some people argued that we don't want to

1:17:31.232 --> 1:17:35.428
have a fixed discrete vocabulary anymore.

1:17:35.428 --> 1:17:41.921
Instead, we want to do it like OCR, like reading
them as images.

1:17:42.942 --> 1:17:54.016
We'll look at one example for this next: Then
another idea is how if you can distill the

1:17:54.016 --> 1:18:03.966
vocabulary as in learning some more compact
representation,.

1:18:04.284 --> 1:18:12.554
But next wanted to show you an example of
pixel inputs for modeling war machine.

1:18:12.852 --> 1:18:29.757
If you look at the picture, all the characters
that are marked with red are actually not.

1:18:32.772 --> 1:18:48.876
They are actually from a different script
for the model and let it do the subword tokenization.

1:18:52.852 --> 1:19:04.373
You would get maybe mostly characters out
of it because I guess in the pre existing vocabulary

1:19:04.373 --> 1:19:07.768
there won't be Latin H and.

1:19:07.707 --> 1:19:16.737
So you'll get characters out of it, which
means it's probably going to be more difficult

1:19:16.737 --> 1:19:18.259
for the model.

1:19:20.140 --> 1:19:28.502
Yeah, so the motivation for pixel inputs is
that there is more sharing across languages.

1:19:30.010 --> 1:19:37.773
Here basically illustrates an embedding table
for subwords and saying if you have sentences

1:19:37.773 --> 1:19:45.705
in the letter scripts like French and the English
then it's going to take certain proportions

1:19:45.705 --> 1:19:48.152
of this big embetting table.

1:19:48.328 --> 1:19:56.854
While for Arabic and Chinese it's yet again
another,.

1:19:56.796 --> 1:20:09.037
That is not joined with the previous one if
we want to have shared representations for

1:20:09.037 --> 1:20:11.992
different languages.

1:20:12.692 --> 1:20:18.531
On the other hand, if we're going with pixels,
there's definitely more sharing.

1:20:22.362 --> 1:20:30.911
There's a difference though to a standard
kind of norm machine translation typeline.

1:20:32.252 --> 1:20:47.581
If you have this brace then how do we go with
images into a translation model?

1:20:50.690 --> 1:20:58.684
We still have to tokenize it somehow, so in
this case they do an overlapping sliding window.

1:20:59.259 --> 1:21:13.636
Since it's more visual, we're using some kind
of convolution blocks before going into these

1:21:13.636 --> 1:21:14.730
black.

1:21:15.035 --> 1:21:25.514
So here wanted to show that if you go with
these more specialist architectures we get

1:21:25.514 --> 1:21:27.829
pixels and that's.

1:21:30.050 --> 1:21:31.310
There's Also One Down the Side.

1:21:31.431 --> 1:21:51.380
If we go with pixels and present teachings,
what are our challenges?

1:21:52.993 --> 1:22:00.001
Exactly so as they beat us others here, also
pointing out here for their experiments.

1:22:01.061 --> 1:22:08.596
They only consider a one target language,
and this is also on their target site.

1:22:08.596 --> 1:22:10.643
It's not pixel based.

1:22:11.131 --> 1:22:31.033
So this is definitely, in my opinion, very
interesting steps towards more shared representations.

1:22:31.831 --> 1:22:40.574
Yeah, so with this kind of out of the box
approach just wanted to summarize today's lecture.

1:22:41.962 --> 1:22:53.158
First think we saw why motelingue is cool,
why there are several open challenges out there

1:22:53.158 --> 1:22:53.896
that.

1:22:55.355 --> 1:23:03.601
We also saw, like several approaches, how
to realize implement a modern molecular translation

1:23:03.601 --> 1:23:11.058
system, and yeah, lastly, we've seen quite
some over challenges on what is unsolved.

1:23:11.691 --> 1:23:22.403
Yeah, so with this want to thank you for being
here today and I'm up there if you want.

1:23:26.106 --> 1:23:29.727
If you have questions, how will we also share
with the moment?