Spaces:

retkowski
/

ytseg_demo

Running

App Files Files Community

ytseg_demo / demo_data /lectures /Lecture-10-13.06.2023 /English.vtt

retkowski

Add demo

cb71ef5 over 1 year ago

raw

history blame contribute delete

60.5 kB

	WEBVTT

	0:00:00.860 --> 0:00:04.211
	Okay Again Welcome.

	0:00:04.524 --> 0:00:09.256
	So today I'll be doing the lecture.

	0:00:09.256 --> 0:00:12.279
	My name is Danny Liro.

	0:00:12.279 --> 0:00:16.747
	I'm one of the PhD students with.

	0:00:17.137 --> 0:00:25.942
	And specifically how to learn representations
	that are common across languages and use that

	0:00:25.942 --> 0:00:29.004
	to help low resource languages.

	0:00:29.689 --> 0:00:39.445
	So hope today we can explore a little bit
	about motoring machine translation and hopefully.

	0:00:40.100 --> 0:00:50.940
	So today what we are going to do first we
	are going to look at.

	0:00:52.152 --> 0:01:02.491
	Second, we will be looking into more details
	as in how we achieve modeling or machine translation

	0:01:02.491 --> 0:01:06.183
	and what are the techniques there.

	0:01:06.183 --> 0:01:12.197
	At last, we are going to look at the current
	challenges.

	0:01:13.573 --> 0:01:15.976
	Alright, so some definitions.

	0:01:15.976 --> 0:01:19.819
	First, what is modeling or machine translation?

	0:01:21.201 --> 0:01:28.637
	So for a multilingual machine translation
	system, it's basically a system that is able

	0:01:28.637 --> 0:01:34.279
	to handle multiple source languages or multiple
	target languages.

	0:01:34.254 --> 0:01:44.798
	You see here you've got source on the source
	side, some German Chinese, Spanish and English.

	0:01:45.485 --> 0:01:50.615
	Physically, it's also a quite interesting
	machine learning challenge actually.

	0:01:51.031 --> 0:02:05.528
	So if you consider each translation pair as
	a different task in machine learning, then

	0:02:05.528 --> 0:02:08.194
	a multilingual.

	0:02:08.628 --> 0:02:17.290
	Where it has to specialize in all these different
	translation directions and try to be good.

	0:02:17.917 --> 0:02:26.890
	So this is basically about multi-task learning,
	and here when translation direction being one

	0:02:26.890 --> 0:02:27.462
	task.

	0:02:28.428 --> 0:02:35.096
	Interesting question to ask here is like do
	we get synergy like different tasks helping

	0:02:35.096 --> 0:02:39.415
	each other, the knowledge of one task helping
	the other?

	0:02:39.539 --> 0:02:48.156
	Or do we get more interference in English
	to German, and now I get worse at English to

	0:02:48.156 --> 0:02:49.047
	Chinese.

	0:02:49.629 --> 0:02:55.070
	So this is also a very interesting question
	that we'll look into later.

	0:02:56.096 --> 0:02:58.605
	Now a little bit of context.

	0:02:59.519 --> 0:03:04.733
	We care about multilingual machine translation.

	0:03:04.733 --> 0:03:10.599
	Part of the thing is that machine translation
	models.

	0:03:11.291 --> 0:03:22.659
	If you consider all the languages in the world,
	there are a read it here roughly seven thousand

	0:03:22.659 --> 0:03:23.962
	languages.

	0:03:24.684 --> 0:03:37.764
	So consider this number, and if you think
	about this many languages out there, how many

	0:03:37.764 --> 0:03:39.548
	directions.

	0:03:40.220 --> 0:03:46.897
	So this means to cover end languages.

	0:03:46.897 --> 0:03:59.374
	We're going to end up with a prodretic in
	square number of directions.

	0:03:59.779 --> 0:04:02.290
	This Is Very Bad, Padre Is Very Bad.

	0:04:03.203 --> 0:04:14.078
	The prosthetic situation going on means that
	for a lot of translation directions, if you

	0:04:14.078 --> 0:04:16.278
	consider all the.

	0:04:17.177 --> 0:04:34.950
	For many of them we aren't going to have any
	parallel data as in existing translated data.

	0:04:35.675 --> 0:04:40.001
	So this is a very data scarce situation.

	0:04:40.001 --> 0:04:49.709
	We're not going to get parallel data in blue
	wear, especially likely when you have a system

	0:04:49.709 --> 0:04:52.558
	that covers tan languages.

	0:04:52.912 --> 0:05:04.437
	If this access actually goes towards thousands
	that are realistic, we are going to end up

	0:05:04.437 --> 0:05:06.614
	with some holes.

	0:05:07.667 --> 0:05:15.400
	So now we are going to ask: Can we use motel
	inquality to help this kind of glow resource?

	0:05:15.875 --> 0:05:22.858
	So when useful concept there is mutual intelligibility,
	don't know if you've heard of this.

	0:05:23.203 --> 0:05:30.264
	Basically isn't linguistic when you say somebody
	who's speaking one language can directly without

	0:05:30.264 --> 0:05:33.218
	learning understands the other language.

	0:05:33.218 --> 0:05:39.343
	So if you're a German speaker maybe Dutch
	or Danish and all that kind of stuff would

	0:05:39.343 --> 0:05:39.631
	be.

	0:05:40.000 --> 0:05:45.990
	Useful or like directly understandable partially
	to you.

	0:05:46.586 --> 0:05:52.082
	That is, thanks to this kind of mutual enthology
	ability that is basically based on language

	0:05:52.082 --> 0:05:52.791
	similarity.

	0:05:53.893 --> 0:05:57.105
	And then there's knowledge sharing this concept.

	0:05:57.105 --> 0:06:01.234
	I mean, it's quite intuitive, basically a
	very German speaker.

	0:06:01.234 --> 0:06:06.805
	If you start learning Dutch or Danish and
	all these Mordic languages, I think you're

	0:06:06.805 --> 0:06:11.196
	going to be faster than just a native English
	speaker or anything.

	0:06:11.952 --> 0:06:18.751
	So hopefully our model is also able to do
	this, but we'll see later what the real situation.

	0:06:19.799 --> 0:06:27.221
	So we said multilingual is good multilingual
	transmission, it's nice and there's a lot of

	0:06:27.221 --> 0:06:28.210
	potentials.

	0:06:28.969 --> 0:06:32.205
	So it's a long path towards there.

	0:06:32.205 --> 0:06:37.569
	Think all the efforts started in so quite
	some years ago.

	0:06:37.958 --> 0:06:54.639
	At first people started with models with language
	specific modules.

	0:06:54.454 --> 0:06:58.747
	So we talked about the input of the decoder
	architecture in the previous lecturer area.

	0:07:00.100 --> 0:07:06.749
	And with this separation of the inputter and
	the decoder, it gives it a natural way to split

	0:07:06.749 --> 0:07:07.679
	the modules.

	0:07:09.069 --> 0:07:20.805
	So basically what's happening going on here
	is dedicated to each toes language and dedicated.

	0:07:21.281 --> 0:07:34.252
	Now given parallel data of body good data
	English German data we just activate this German

	0:07:34.252 --> 0:07:39.241
	inputter and activate this and an.

	0:07:40.680 --> 0:07:48.236
	So now we are training basically like corresponding
	parts of the encoder decoders.

	0:07:48.236 --> 0:07:55.278
	It has some advantages: First, we have a multilingual
	system.

	0:07:55.278 --> 0:08:03.898
	Of course, second modularity is also an advantage
	in software engineering.

	0:08:03.898 --> 0:08:10.565
	We want to decouple things if the German input
	is broken.

	0:08:11.011 --> 0:08:19.313
	So modularity is advantage in this case, but
	again if we think about scalability, if we

	0:08:19.313 --> 0:08:27.521
	think about languages out there that we talked
	about, scalability isn't a great thing.

	0:08:27.947 --> 0:08:37.016
	We also talked about sharing knowledge or
	sharing representations for different languages.

	0:08:37.317 --> 0:08:41.968
	We have a separate thing for each language.

	0:08:41.968 --> 0:08:46.513
	How likely is it that we are sharing much?

	0:08:46.513 --> 0:08:52.538
	So these are potential disadvantages with
	this approach.

	0:08:53.073 --> 0:09:01.181
	So yeah we talked about, we want to have knowledge
	transfer, we want to have similar languages

	0:09:01.181 --> 0:09:02.888
	helping each other.

	0:09:02.822 --> 0:09:06.095
	This is somehow a more reachable goal.

	0:09:06.095 --> 0:09:13.564
	If you have a shared in corner and a shared
	in physically, a full perimeter shared model

	0:09:13.564 --> 0:09:21.285
	for all the translation pairs out there, and
	there's also another game, so if you just have

	0:09:21.285 --> 0:09:21.705
	one.

	0:09:22.582 --> 0:09:26.084
	Lock of model for all the translation directions
	out there.

	0:09:26.606 --> 0:09:38.966
	It's easier to deploy in the sense that if
	you are serving a model you don't have a thousand

	0:09:38.966 --> 0:09:42.555
	small modules to maintain.

	0:09:42.762 --> 0:09:52.448
	So in terms of engineering somehow these kind
	of fully primitive shared models have: So this

	0:09:52.448 --> 0:09:59.819
	is also where the parent research has been
	going towards in recent years.

	0:10:00.460 --> 0:10:16.614
	So the rest of the electro are also going
	to focus on this kind of model.

	0:10:17.037 --> 0:10:30.901
	So the first type of multilinguali is this
	kind of many to one abbreviated kind of situation.

	0:10:30.901 --> 0:10:34.441
	Basically what's going.

	0:10:35.355 --> 0:10:49.804
	So one news case that you can think of here
	is if you're subtitled for international movies

	0:10:49.804 --> 0:10:51.688
	in Germany.

	0:10:53.073 --> 0:11:02.863
	Then flipping the situation there is also
	many configurations where we only have when

	0:11:02.863 --> 0:11:04.798
	source language.

	0:11:06.046 --> 0:11:13.716
	There's also many use cases like if you think
	about the lecture translator here you've seen.

	0:11:14.914 --> 0:11:21.842
	So here most of the lecturers are in German
	and now we want to translate it into.

	0:11:21.842 --> 0:11:28.432
	I think on the user end we only support English
	but they're also supportable.

	0:11:28.608 --> 0:11:38.988
	So in this kind of used case, if you have
	one speaker and you want to serve or expand

	0:11:38.988 --> 0:11:41.281
	to many audience,.

	0:11:42.802 --> 0:11:50.542
	But of course, combining everything, there's
	the many to many situation here.

	0:11:50.542 --> 0:11:54.015
	You can think of Google Translate.

	0:11:54.015 --> 0:11:58.777
	They are doing basically any selected language.

	0:11:59.159 --> 0:12:03.760
	And this is also more difficult.

	0:12:03.760 --> 0:12:14.774
	If you consider the data you need to get and
	concerns, we'll cover this later.

	0:12:15.135 --> 0:12:21.034
	But first we are going to start with many
	to one translations.

	0:12:21.741 --> 0:12:30.436
	Say this is the most similar to the bilingual
	translation situation you saw earlier, but

	0:12:30.436 --> 0:12:39.423
	now one difference is we need a vocabulary
	or tokens that can represent all these different

	0:12:39.423 --> 0:12:40.498
	languages.

	0:12:41.301 --> 0:12:44.200
	So we need a joint more telecom global vocabulary.

	0:12:44.924 --> 0:12:48.794
	So let's just quickly recall what word embedding
	is to do.

	0:12:49.189 --> 0:12:54.561
	Basically we need to represent it.

	0:12:54.561 --> 0:13:04.077
	We have to get some vector representation
	for discrete words.

	0:13:04.784 --> 0:13:16.911
	And when we embed a token, we are retrieving
	the corresponding vector out of this little.

	0:13:17.697 --> 0:13:19.625
	And then we put it.

	0:13:19.625 --> 0:13:26.082
	We feed a sequence of vectors into the inputter
	as the next steps.

	0:13:26.987 --> 0:13:34.973
	Now if it's motelingual you can imagine that
	vocabulary suddenly gets very, very big because

	0:13:34.973 --> 0:13:36.262
	the languages.

	0:13:37.877 --> 0:13:46.141
	So what is quite useful here is the by pair
	like subwords you talked about by pairing.

	0:13:46.406 --> 0:13:55.992
	So in this case we are still limiting ourselves
	to a finite number of vocabularies that we

	0:13:55.992 --> 0:13:59.785
	are exploding the vocabulary table.

	0:14:01.181 --> 0:14:11.631
	So when we learn these kinds of subwords,
	what happens basically?

	0:14:11.631 --> 0:14:17.015
	We look at all the training data.

	0:14:18.558 --> 0:14:20.856
	So think about this.

	0:14:20.856 --> 0:14:28.077
	If we do this now on a bunch of Mozilla data,
	are there concerns?

	0:14:30.050 --> 0:14:36.811
	Maybe we have an underground status head,
	so we get over English mergers and nocularities.

	0:14:37.337 --> 0:14:39.271
	Yeah Exactly Thanks.

	0:14:39.539 --> 0:14:46.602
	So what we have to pay attention to here is
	learn this motilingual vocabulary.

	0:14:46.602 --> 0:14:52.891
	We should pay attention: All the languages
	are more or less balanced, not that you only

	0:14:52.891 --> 0:14:58.912
	learning words is for for English or some bigger
	languages, and then neglecting other other

	0:14:58.912 --> 0:15:00.025
	languages, yeah.

	0:15:01.021 --> 0:15:04.068
	Of course, this is not going to solve everything.

	0:15:04.068 --> 0:15:09.614
	Even if we get a perfectly uniform distribution
	out of all the languages out, there is not

	0:15:09.614 --> 0:15:13.454
	going to mean that we are ending up with a
	perfect vocabulary.

	0:15:14.154 --> 0:15:20.068
	There are also language differences read,
	so if you consider more European languages.

	0:15:20.180 --> 0:15:27.081
	There will be many shared subcomponents like
	how you write a certain word, somewhat similar.

	0:15:27.267 --> 0:15:34.556
	But then there are other languages with completely
	different scripts like Arabic, Cyrillic scripts

	0:15:34.556 --> 0:15:40.594
	or Eastern Asian scripts where you get a vocabulary
	like the characters set with.

	0:15:40.940 --> 0:15:43.531
	Tens of thousands of characters.

	0:15:43.531 --> 0:15:50.362
	So these are also individual concerns that
	one has to think about my building specific

	0:15:50.362 --> 0:15:51.069
	systems.

	0:15:51.591 --> 0:16:02.660
	But overall, the rule of thumb is that when
	you do a mottling tokenizer vocabulary, there's

	0:16:02.660 --> 0:16:04.344
	more or less.

	0:16:05.385 --> 0:16:17.566
	And there's actually some paper showing that
	the performance of the final system is going

	0:16:17.566 --> 0:16:25.280
	to start to degrade if you have a disproportionate
	data.

	0:16:27.207 --> 0:16:33.186
	Of course there is currently the trend of
	using pre-train models.

	0:16:33.186 --> 0:16:39.890
	If you take a pre-train model somewhere then
	you don't have this concern.

	0:16:40.580 --> 0:16:47.810
	Making sure that you use the same organizers
	that they used so that there is no train test

	0:16:47.810 --> 0:16:48.287
	time.

	0:16:48.888 --> 0:16:53.634
	Yeah for a pre-trainer, we're going to talk
	about a little bit later as well.

	0:16:54.734 --> 0:16:59.960
	Alright: So now where's a Martin Luther vocabulary?

	0:17:00.920 --> 0:17:04.187
	There are several good things, obviously.

	0:17:04.187 --> 0:17:10.953
	So one thing is that if we have words that
	are in the textful form like we said, there

	0:17:10.953 --> 0:17:16.242
	are European languages that share some vocabulary,
	then it's great.

	0:17:16.242 --> 0:17:19.897
	Then we have the first step towards knowledge.

	0:17:20.000 --> 0:17:30.464
	For example, the word pineapple for some reason
	is also in Eastern European languages.

	0:17:30.464 --> 0:17:34.915
	In Cyrillic scripts that's also the.

	0:17:36.116 --> 0:17:42.054
	But however, there is also ambiguity if you've
	embracing together or dye.

	0:17:42.054 --> 0:17:46.066
	Of course, they mean different things for
	German.

	0:17:46.246 --> 0:17:53.276
	Then, of course, that's possible to rely on
	further context.

	0:17:53.276 --> 0:17:59.154
	It's not a problem, it's something to think
	about.

	0:18:00.200 --> 0:18:11.061
	And when we go higher to cover more vocabulary
	entries, we might need to go bigger in the

	0:18:11.061 --> 0:18:13.233
	vocabulary count.

	0:18:13.653 --> 0:18:28.561
	So there is always sort of a bottleneck as
	the number of languages increase.

	0:18:30.110 --> 0:18:32.836
	Right, so what is the result?

	0:18:32.836 --> 0:18:38.289
	What are these crustling over inventings actually
	learning?

	0:18:40.160 --> 0:18:44.658
	So normally to inspect them it's quite hard.

	0:18:44.658 --> 0:18:53.853
	It's like high dimensional vectors with dimensions,
	but researchers also try to project it.

	0:18:54.454 --> 0:19:05.074
	So in this case it is a little bit small,
	but in this case for English and French there

	0:19:05.074 --> 0:19:07.367
	are many injuries.

	0:19:07.467 --> 0:19:20.014
	My example is like different words with the
	same word in morphological forms.

	0:19:20.014 --> 0:19:26.126
	Basically, it's like a morphological.

	0:19:26.546 --> 0:19:32.727
	There are also words in different languages
	like think there is research for English and

	0:19:32.727 --> 0:19:33.282
	French.

	0:19:33.954 --> 0:19:41.508
	So the take away from this plot is that somehow
	we learn a bit of semantic meanings beyond

	0:19:41.508 --> 0:19:43.086
	the textual forms.

	0:19:45.905 --> 0:19:50.851
	But then this looks good and this gives us
	hope.

	0:19:52.252 --> 0:20:05.240
	That if we consider what is the baseline here,
	the baseline we compare to is a bilingual system

	0:20:05.240 --> 0:20:09.164
	without any multilinguality.

	0:20:10.290 --> 0:20:19.176
	This looks good because if we compare for
	many Central European languages, Eastern and

	0:20:19.176 --> 0:20:28.354
	Central European languages to English, we compare:
	And we see that the Mini Two English has actually

	0:20:28.354 --> 0:20:30.573
	always gained quite a bit over it.

	0:20:31.751 --> 0:20:38.876
	But there is also later investigation on whether
	it is actually out of mountain linguality or

	0:20:38.876 --> 0:20:39.254
	not.

	0:20:39.639 --> 0:20:46.692
	So this is a spoiler won't tell much about
	it until the second half, but just remember

	0:20:46.692 --> 0:20:47.908
	there is this.

	0:20:49.449 --> 0:20:53.601
	Now move on to many translations.

	0:20:53.601 --> 0:21:01.783
	Let's recall in a normal transformer or any
	encoder decoder setup.

	0:21:02.242 --> 0:21:08.839
	We have an inkluder that creates sort of contextual
	representation for the sort of sentence.

	0:21:09.949 --> 0:21:17.787
	Is more or less the context for generating
	the target sentence red.

	0:21:17.787 --> 0:21:28.392
	Now on the target side we get the first open,
	then we feed it again and then get the second

	0:21:28.392 --> 0:21:29.544
	decoding.

	0:21:31.651 --> 0:21:35.039
	And now we have multiple target languages.

	0:21:35.039 --> 0:21:39.057
	Does anybody see a problem with this architecture?

	0:21:48.268 --> 0:21:57.791
	Specifically, it's in the decoder, so now
	have a German sentiments encoded.

	0:21:57.791 --> 0:22:01.927
	It now want to generate Spanish.

	0:22:07.367 --> 0:22:11.551
	So the problem is how does the model know
	which language to generate?

	0:22:12.112 --> 0:22:24.053
	If you just give it a generic start token,
	there is nowhere where we are telling the model.

	0:22:24.944 --> 0:22:30.277
	So that this can only be a guess, and this
	model will definitely not run well.

	0:22:32.492 --> 0:22:40.021
	So this comes to the question: How do we indicate
	the one's intended language to the model?

	0:22:41.441 --> 0:22:52.602
	One first idea is what people tried is basically
	now in a source where not only including the

	0:22:52.602 --> 0:22:53.552
	source.

	0:22:53.933 --> 0:23:01.172
	To Spanish things like this, so basically
	the source is already informed.

	0:23:01.172 --> 0:23:12.342
	The source sentence is already supplemented
	with: Now this is also called a target forcing

	0:23:12.342 --> 0:23:19.248
	in the sense that we try to force it to give
	the right target.

	0:23:20.080 --> 0:23:24.622
	This is one approach.

	0:23:24.622 --> 0:23:38.044
	Another approach is basically based on the
	idea that if we have.

	0:23:38.438 --> 0:23:52.177
	So if we create a context of our world, the
	incode output shouldn't really differ.

	0:23:52.472 --> 0:24:02.397
	So out of this motivation people have moved
	this signaling mechanism.

	0:24:02.397 --> 0:24:09.911
	They basically replaced the traditional start
	token.

	0:24:10.330 --> 0:24:17.493
	So here we are not kids starting into the
	generic start talking anymore instead language

	0:24:17.493 --> 0:24:18.298
	specific.

	0:24:18.938 --> 0:24:21.805
	So this is also another way to achieve this.

	0:24:23.283 --> 0:24:27.714
	But there are still more challenging cases.

	0:24:27.714 --> 0:24:35.570
	Sometimes here it can be called as General
	English or German when it's there.

	0:24:35.570 --> 0:24:39.700
	Later on it goes further and further on.

	0:24:40.320 --> 0:24:46.752
	Basically this information is not strong enough
	to always enforce the target language, especially

	0:24:46.752 --> 0:24:48.392
	in zero shot conditions.

	0:24:48.392 --> 0:24:54.168
	We'll look into this later so we'll get this
	kind of target translation into generating

	0:24:54.168 --> 0:24:57.843
	and generating and then going into some wrong
	language.

	0:24:59.219 --> 0:25:12.542
	So another technique actually developed here
	some years ago was to inject this language.

	0:25:12.872 --> 0:25:19.834
	So when we are feeding doing the auto-aggressive
	decoding normally, we only feed the upherb.

	0:25:20.000 --> 0:25:22.327
	Into the depoter.

	0:25:22.327 --> 0:25:33.704
	But if we also add a language embedding for
	the target language, on top of that we have

	0:25:33.704 --> 0:25:37.066
	the language information.

	0:25:37.397 --> 0:25:44.335
	And this has shown to perform quite a bit
	better, especially in conditions where the

	0:25:44.335 --> 0:25:44.906
	model.

	0:25:46.126 --> 0:25:56.040
	So yeah, we introduced three ways to enforce
	the Tardid language: And now with this we're

	0:25:56.040 --> 0:26:02.607
	going to move on to the more interesting case
	of many too many translations.

	0:26:03.503 --> 0:26:14.021
	Am so here we just consider a system that
	translates two directions: English to English

	0:26:14.021 --> 0:26:15.575
	and English.

	0:26:16.676 --> 0:26:21.416
	Now we have target languages read.

	0:26:21.416 --> 0:26:29.541
	Can you see where we're enforcing the target
	language here?

	0:26:29.541 --> 0:26:33.468
	In this case what technique?

	0:26:34.934 --> 0:26:45.338
	So here we are enforcing the characteristic
	language with the yelling we train this system.

	0:26:46.526 --> 0:27:00.647
	And at the inference time we are able to generate
	English to French, but in addition to this

	0:27:00.647 --> 0:27:12.910
	we are also able to: We will be able to do
	zero shot inference that basically translates

	0:27:12.910 --> 0:27:17.916
	a direction that is not seen in training.

	0:27:19.319 --> 0:27:25.489
	So this is so called zero shot translation
	using a modeling wall system.

	0:27:26.606 --> 0:27:34.644
	Of course, we have to reach several things
	before we are able to control the language,

	0:27:34.644 --> 0:27:36.769
	otherwise it's no use.

	0:27:37.317 --> 0:27:51.087
	Second, we should also have some kind of language
	independent representation.

	0:27:51.731 --> 0:27:53.196
	Why is this?

	0:27:53.196 --> 0:27:55.112
	Why is this big?

	0:27:55.112 --> 0:28:00.633
	Because if women drink generally French up
	here?

	0:28:00.940 --> 0:28:05.870
	It was trained to translate from some English.

	0:28:07.187 --> 0:28:15.246
	But now we use Anchored Germans in the French,
	so intuitively we need these representations

	0:28:15.246 --> 0:28:22.429
	to be similar enough, not that they are so
	far attracted that we cannot use this.

	0:28:25.085 --> 0:28:32.059
	So there are several works out there showing
	that if you do a standard transformer architecture

	0:28:32.059 --> 0:28:39.107
	this language independent property is not really
	there and you need to add additional approaches

	0:28:39.107 --> 0:28:40.633
	in order to enforce.

	0:28:41.201 --> 0:28:51.422
	So you can, for example, add an additional
	training objective: That says, we invoked SARSN,

	0:28:51.422 --> 0:29:00.305
	be invoked by German, and the invoked English
	have to be the same or be as close to each

	0:29:00.305 --> 0:29:02.201
	other as possible.

	0:29:02.882 --> 0:29:17.576
	So if we take the output and the output for
	another language, how can we formulate this

	0:29:17.576 --> 0:29:18.745
	as an.

	0:29:20.981 --> 0:29:27.027
	We can take the translation to the encoder
	and whatever you translate.

	0:29:27.027 --> 0:29:32.817
	The embeddings also must be similar and that's
	the great direction.

	0:29:33.253 --> 0:29:42.877
	So one thing to take care of here is the length
	for the same sentence in German and English

	0:29:42.877 --> 0:29:44.969
	is not necessarily.

	0:29:45.305 --> 0:30:00.858
	So if we just do a word to word matching,
	we can always do pulling to a fixed length

	0:30:00.858 --> 0:30:03.786
	representation.

	0:30:04.004 --> 0:30:08.392
	Or there are more advanced techniques that
	involve some alignments.

	0:30:08.848 --> 0:30:23.456
	So this is useful in the sense that in this
	part in experiments we have shown it improves

	0:30:23.456 --> 0:30:27.189
	zero shot translation.

	0:30:27.447 --> 0:30:36.628
	This is on the data condition of English to
	Malay, Java and Filipino, so kind of made to

	0:30:36.628 --> 0:30:39.722
	low resource language family.

	0:30:40.100 --> 0:30:50.876
	And there we assume that we get parallel English
	to all of them, but among all these.

	0:30:51.451 --> 0:31:03.592
	So the blue bar is a Vanilla Transformer model,
	and the purple bar is when we add a language.

	0:31:04.544 --> 0:31:12.547
	You see that in supervised conditions it's
	not changing much, but in zero shots there's

	0:31:12.547 --> 0:31:13.183
	quite.

	0:31:15.215 --> 0:31:22.649
	Yeah, so far we said zero shots is doable
	and it's even more achievable if we enforce

	0:31:22.649 --> 0:31:26.366
	some language independent representations.

	0:31:26.366 --> 0:31:29.823
	However, there's one practical concern.

	0:31:29.823 --> 0:31:33.800
	Don't know if you also had the same question.

	0:31:34.514 --> 0:31:39.835
	If you have two languages, you don't have
	direct parallel.

	0:31:39.835 --> 0:31:43.893
	One's into English and one's out of English.

	0:31:45.685 --> 0:31:52.845
	It's actually this kind of approach is called
	pivoting as in pivoting over an intermediate

	0:31:52.845 --> 0:31:53.632
	language.

	0:31:55.935 --> 0:32:00.058
	Yeah, that it definitely has advantages in
	the sense that we're going.

	0:32:00.440 --> 0:32:11.507
	Now if we go over these two steps every direction
	was trained with supervised data so you could

	0:32:11.507 --> 0:32:18.193
	always assume that when we are working with
	a supervised.

	0:32:18.718 --> 0:32:26.868
	So in this case we can expect more robust
	inference time behavior.

	0:32:26.868 --> 0:32:31.613
	However, there are also disadvantages.

	0:32:31.531 --> 0:32:38.860
	An inference where passing through the model
	ties so that's doubling the inference time

	0:32:38.860 --> 0:32:39.943
	computation.

	0:32:40.500 --> 0:32:47.878
	You might think okay doubling then what, but
	if you consider if your company like Google,

	0:32:47.878 --> 0:32:54.929
	Google Translate and all your life traffic
	suddenly becomes twice as big, this is not

	0:32:54.929 --> 0:33:00.422
	something scalable that you want to see, especially
	in production.

	0:33:01.641 --> 0:33:11.577
	A problem with this is making information
	loss because if we go over these games when

	0:33:11.577 --> 0:33:20.936
	a chain of kids pass the word to each other,
	in the end it's losing information.

	0:33:22.082 --> 0:33:24.595
	Can give it an example here.

	0:33:24.595 --> 0:33:27.803
	It's also from a master thesis here.

	0:33:27.803 --> 0:33:30.316
	It's on gender preservation.

	0:33:30.770 --> 0:33:39.863
	Basically, some languages like Italian and
	French have different word forms based on the

	0:33:39.863 --> 0:33:40.782
	speaker.

	0:33:41.001 --> 0:33:55.987
	So if a male person says feel alienated, this
	word for alienated would be exclusive and a

	0:33:55.987 --> 0:33:58.484
	female person.

	0:34:00.620 --> 0:34:05.730
	Now imagine that we pivot through anguish.

	0:34:05.730 --> 0:34:08.701
	The information is lost.

	0:34:08.701 --> 0:34:11.910
	We don't know what gender.

	0:34:12.492 --> 0:34:19.626
	When we go out into branch again, there are
	different forms.

	0:34:19.626 --> 0:34:29.195
	Depending on the speaker gender, we can: So
	this is one problem.

	0:34:31.871 --> 0:34:44.122
	This is especially the case because English
	compared to many other languages is relatively

	0:34:44.122 --> 0:34:45.199
	simple.

	0:34:45.205 --> 0:34:53.373
	Gendered where it forms like this, it also
	doesn't have many cases, so going through English

	0:34:53.373 --> 0:34:56.183
	many information would be lost.

	0:34:57.877 --> 0:35:12.796
	And another thing is if you have similar languages
	that you are translating out of my systems

	0:35:12.796 --> 0:35:15.494
	that translates.

	0:35:16.496 --> 0:35:24.426
	This is the output of going from Dutch to
	German again.

	0:35:24.426 --> 0:35:30.231
	If you read the German, how many of you?

	0:35:32.552 --> 0:35:51.679
	Good and the problem here is that we are going
	over English and then the English to German.

	0:35:51.831 --> 0:36:06.332
	However, if we go direct in this case zero
	shot translation you see that word forgive.

	0:36:06.546 --> 0:36:09.836
	In this case, the outward translation is better.

	0:36:10.150 --> 0:36:20.335
	And we believe this has to do with using the
	language similarity between the two languages.

	0:36:20.335 --> 0:36:26.757
	There is also quantitative results we found
	when born in.

	0:36:27.988 --> 0:36:33.780
	The models are always doing better when translating
	similar languages compared to the.

	0:36:35.535 --> 0:36:42.093
	Yeah, so in this first half what we talked
	about basically first, we started with how

	0:36:42.093 --> 0:36:49.719
	motilinguality or motilingual machine translation
	could enable knowledge transfer between languages

	0:36:49.719 --> 0:36:53.990
	and help with conditions where we don't have
	much data.

	0:36:55.235 --> 0:37:02.826
	Now it looks at three types of multilingual
	translation, so one is many to one, one to

	0:37:02.826 --> 0:37:03.350
	many.

	0:37:05.285 --> 0:37:13.397
	We got there first about a shared vocabulary
	based on different languages and how these

	0:37:13.397 --> 0:37:22.154
	cross lingual word embeddings capture semantic
	meanings rather than just on a text proof form.

	0:37:25.505 --> 0:37:37.637
	Then we looked at how to signal the target
	language, how to ask for the model to generate,

	0:37:37.637 --> 0:37:43.636
	and then we looked at zero shot translation.

	0:37:45.325 --> 0:37:58.187
	You now before go into the second half are
	there questions about the first okay good.

	0:38:00.140 --> 0:38:10.932
	In the second half of this lecture we'll be
	looking into challenges like what is still

	0:38:10.932 --> 0:38:12.916
	unsolved about.

	0:38:13.113 --> 0:38:18.620
	There are some aspects to look at it.

	0:38:18.620 --> 0:38:26.591
	The first is modeling, the second is more
	engineering.

	0:38:28.248 --> 0:38:33.002
	Okay, so we talked about this question several
	times.

	0:38:33.002 --> 0:38:35.644
	How does motilinguality help?

	0:38:35.644 --> 0:38:37.405
	Where does it help?

	0:38:38.298 --> 0:38:45.416
	Here want to show results of an experiment
	based on over a hundred languages.

	0:38:46.266 --> 0:38:58.603
	Here you can see the data amount so they use
	parallel data to English and it's very.

	0:38:58.999 --> 0:39:00.514
	This is already lock scale.

	0:39:00.961 --> 0:39:12.982
	So for higher resource languages like English
	to French, German to Spanish you get over billion

	0:39:12.982 --> 0:39:14.359
	sentences.

	0:39:14.254 --> 0:39:21.003
	In parallel, and when we go more to the right
	to the more low resource spectrum on the other

	0:39:21.003 --> 0:39:26.519
	hand, there are languages that maybe many of
	us have new and heard of like.

	0:39:26.466 --> 0:39:29.589
	Do You Want to Move Back?

	0:39:30.570 --> 0:39:33.270
	Hawaiian Indians have heard of it.

	0:39:34.414 --> 0:39:39.497
	So on that spectrum we only have like thirty
	thousand sentences.

	0:39:40.400 --> 0:39:48.389
	So what this means is when we train, we have
	to up sample these guys.

	0:39:48.389 --> 0:39:51.585
	The model didn't even know.

	0:39:52.732 --> 0:40:05.777
	Yeah, so on this graph on how we read it is
	this horizontal line and zero is basically

	0:40:05.777 --> 0:40:07.577
	indicating.

	0:40:07.747 --> 0:40:14.761
	Because we want to see where mottling quality
	helps only compare to what happens when there

	0:40:14.761 --> 0:40:15.371
	is not.

	0:40:16.356 --> 0:40:29.108
	So upper like higher than the zero line it
	means we're gaining.

	0:40:29.309 --> 0:40:34.154
	The same like for these languages.

	0:40:34.154 --> 0:40:40.799
	This side means we are a high resource for
	the.

	0:40:40.981 --> 0:40:46.675
	Yeah sorry, think I've somehow removed the
	the ex-O as he does.

	0:40:48.008 --> 0:40:58.502
	Yeah alright, what happens now if we look
	at many into English?

	0:40:58.698 --> 0:41:08.741
	On the low resource spectrum by going multilingua
	we gain a lot over the Palumbo system.

	0:41:10.010 --> 0:41:16.658
	Overall, if you consider the average for all
	of the languages, it's still again.

	0:41:17.817 --> 0:41:27.301
	Now we're looking at the green line so you
	can ignore the blue line.

	0:41:27.301 --> 0:41:32.249
	Basically we have to do our sample.

	0:41:33.753 --> 0:41:41.188
	Yeah, so if you just even consider the average,
	it's still a game form over by link.

	0:41:42.983 --> 0:41:57.821
	However, if we go to the English to many systems
	looking at the gains, we only get minor improvements.

	0:41:59.039 --> 0:42:12.160
	So why is it the case that Going Mott Lingu
	isn't really helping universally?

	0:42:16.016 --> 0:42:18.546
	Do you have some intuitions on yeah?

	0:42:18.698 --> 0:42:38.257
	It's easier to understand something that generates
	if we consider what the model has to generate.

	0:42:38.718 --> 0:42:40.091
	I See It Like.

	0:42:40.460 --> 0:42:49.769
	Generating is a bit like writing or speaking,
	while inputing on the source side is more like

	0:42:49.769 --> 0:42:50.670
	reading.

	0:42:50.650 --> 0:42:57.971
	So one is more passive and the other is more
	active and don't know if you have similar experience.

	0:42:57.971 --> 0:43:05.144
	I think speaking and writing is always a little
	bit more difficult than just passively listening

	0:43:05.144 --> 0:43:06.032
	or reading.

	0:43:06.032 --> 0:43:09.803
	But this is a very pendwavy kind of understanding.

	0:43:10.390 --> 0:43:11.854
	And fed.

	0:43:12.032 --> 0:43:20.309
	In terms of the model, if we consider what
	is the difference for the target side for many

	0:43:20.309 --> 0:43:26.703
	to English: One difference is that there's
	a data difference.

	0:43:27.167 --> 0:43:33.438
	So if you just consider a modern English system
	with German to English and Spanish to English,.

	0:43:34.975 --> 0:43:44.321
	One thing we have to keep in mind is that
	the parallel data is not all the same, so on

	0:43:44.321 --> 0:43:49.156
	the target side there are different English.

	0:43:49.769 --> 0:43:54.481
	So the situation rather looks like this.

	0:43:54.481 --> 0:43:59.193
	What this means is that we are going to.

	0:44:00.820 --> 0:44:04.635
	We also add more data on the target side for
	English.

	0:44:06.967 --> 0:44:18.581
	Now since the target side data is not identical,
	how do we do a controlled experiment to remove

	0:44:18.581 --> 0:44:21.121
	the multilinguality?

	0:44:24.644 --> 0:44:42.794
	So what people tried as a control experiment
	is to keep all the English same as the above

	0:44:42.794 --> 0:44:44.205
	setup.

	0:44:44.684 --> 0:44:49.700
	So they take the English on English data of
	the same branch to German.

	0:44:50.090 --> 0:44:55.533
	And then the general synthetic data for Germans.

	0:44:55.533 --> 0:45:05.864
	So now we have a bilingual system again, but
	on the target side we still have the previously

	0:45:05.864 --> 0:45:08.419
	enriched English data.

	0:45:10.290 --> 0:45:25.092
	Now back to this picture that we've seen before,
	this mysterious orange line here is basically

	0:45:25.092 --> 0:45:26.962
	the result.

	0:45:27.907 --> 0:45:36.594
	And somewhat struckly and perhaps sadly for
	believers of multilinguality.

	0:45:36.594 --> 0:45:39.176
	This is also gaining.

	0:45:41.001 --> 0:45:52.775
	So what this means is for the many English
	is gaining not really because of multilinguality

	0:45:52.775 --> 0:45:55.463
	but just because of.

	0:45:55.976 --> 0:46:10.650
	And this means that there is still quite a
	lot to do if we really want to gain from just

	0:46:10.650 --> 0:46:13.618
	shared knowledge.

	0:46:14.514 --> 0:46:27.599
	But this also gives hope because there are
	still many things to research in this area

	0:46:27.599 --> 0:46:28.360
	now.

	0:46:28.708 --> 0:46:40.984
	So we've seen adding more languages helps
	with somewhat data side effect and can it hurt.

	0:46:40.984 --> 0:46:45.621
	So if we just add more languages.

	0:46:47.007 --> 0:46:48.408
	We've seen this.

	0:46:48.408 --> 0:46:52.694
	This is the picture for the Manitou English
	system.

	0:46:53.793 --> 0:47:09.328
	Comparing to this valuable face line, we see
	that for these high resource languages we are

	0:47:09.328 --> 0:47:12.743
	not doing as great.

	0:47:15.956 --> 0:47:18.664
	So why are we losing here?

	0:47:18.664 --> 0:47:25.285
	It's been showing that this performance last
	is somewhat related.

	0:47:26.026 --> 0:47:37.373
	In the sense that the motto has to learn so
	much that at some point it has to sacrifice

	0:47:37.373 --> 0:47:39.308
	capacity from.

	0:47:41.001 --> 0:47:57.081
	So what to do to basically grow a bigger brain
	to tackle this is to add some dedicated capacity

	0:47:57.081 --> 0:47:59.426
	per language.

	0:48:00.100 --> 0:48:15.600
	Here it's like a simplified graph of a transformer
	architecture, so this is the encoder within

	0:48:15.600 --> 0:48:16.579
	time.

	0:48:17.357 --> 0:48:27.108
	But additionally here these little colorable
	blouse are now the language-specific capable

	0:48:27.108 --> 0:48:28.516
	of capacity.

	0:48:29.169 --> 0:48:42.504
	There are language specific in the sense that
	if you get the Chinese to English, the pattern.

	0:48:43.103 --> 0:48:54.900
	We are also going to language specific parts
	that in this case consists of a down projection.

	0:48:56.416 --> 0:49:07.177
	So this is also called adaptors, something
	that is plugged into an existing model and

	0:49:07.177 --> 0:49:11.556
	it adapts towards a specific task.

	0:49:12.232 --> 0:49:22.593
	And this is conditionally activated in the
	sense that if you get a different input sentence.

	0:49:27.307 --> 0:49:34.173
	So this was first proposed in by some folks
	selling Google.

	0:49:34.173 --> 0:49:36.690
	Does this scale well?

	0:49:39.619 --> 0:49:56.621
	Yes exactly, so this is a translation periscusive
	cannon adapter, and this is not going to scale

	0:49:56.621 --> 0:49:57.672
	well.

	0:49:58.959 --> 0:50:13.676
	So this also brought people to try some more
	simple architecture.

	0:50:16.196 --> 0:50:22.788
	Yeah, this is also an alternative, in this
	case called monolingual adapters.

	0:50:24.184 --> 0:50:32.097
	Any of these adapters so again have this low
	resource.

	0:50:32.097 --> 0:50:42.025
	The zero line is bilingual baseline, but the
	lines are interpolated.

	0:50:43.783 --> 0:50:48.767
	The red one is the mottling word original
	mottling word model.

	0:50:49.929 --> 0:50:57.582
	And if we put the adapters in like a basic
	virginal adapter that goes to the blue liner,.

	0:50:58.078 --> 0:51:08.582
	You see the lids gaining performance for the
	high resource languages.

	0:51:08.582 --> 0:51:16.086
	If they even scale a lot, this further increases.

	0:51:16.556 --> 0:51:22.770
	So this is also a side kind of this.

	0:51:23.103 --> 0:51:27.807
	From the side shows that it's really a capacity
	bottom up.

	0:51:28.488 --> 0:51:30.590
	Like If You Eleanor.

	0:51:31.151 --> 0:51:34.313
	Resource they regain their performance.

	0:51:38.959 --> 0:51:50.514
	For smaller languages, but it's just.

	0:51:50.770 --> 0:52:03.258
	Think in the original modeling, the smaller
	languages they weren't constrained by capacity.

	0:52:05.445 --> 0:52:13.412
	So guess for the smaller languages, the difficulty
	is more the data rather than the model capacity.

	0:52:13.573 --> 0:52:26.597
	So in general you always want to have more
	or less data matching your model capacity.

	0:52:27.647 --> 0:52:33.255
	Yeah, here think the bigger challenge for
	lower roots was the data.

	0:52:34.874 --> 0:52:39.397
	You also mention it a little bit.

	0:52:39.397 --> 0:52:46.979
	Are these adapters per language or how many
	adapters do?

	0:52:47.267 --> 0:52:55.378
	And do we have to design them differently
	so that we learn to share more like a language

	0:52:55.378 --> 0:52:56.107
	family?

	0:52:56.576 --> 0:53:15.680
	So one downside of the adaptor we talked about
	is that basically there is no way to go over.

	0:53:16.516 --> 0:53:31.391
	So then a recent kind of additional approach
	for these language specific capacity is so

	0:53:31.391 --> 0:53:36.124
	called routing or learning.

	0:53:36.256 --> 0:53:42.438
	Basically, we have these language specific
	components.

	0:53:42.438 --> 0:53:45.923
	We also have a shared adapter.

	0:53:45.923 --> 0:53:52.574
	The model should learn: So in this case maybe
	we could imagine for the lower resource case

	0:53:52.574 --> 0:53:54.027
	that we just talked about.

	0:53:54.094 --> 0:54:04.838
	Sense to go there because there's not much
	to do with language specific anyway than it's

	0:54:04.838 --> 0:54:10.270
	better to make use of similarity with other.

	0:54:11.111 --> 0:54:30.493
	So this architecture is more data driven instead
	of what we specify prior to training.

	0:54:31.871 --> 0:54:33.998
	So how do we learn this?

	0:54:35.095 --> 0:54:49.286
	Basically, in terms of the mask, we want to
	basically have a binary rule that goes either

	0:54:49.286 --> 0:54:50.548
	to the.

	0:54:51.311 --> 0:54:56.501
	But how do we get a valued zero or one mean
	we can?

	0:54:56.501 --> 0:54:58.498
	We can do a signal.

	0:54:58.999 --> 0:55:13.376
	However, one thing is we don't want to get
	stuck in the middle, so we don't want black.

	0:55:14.434 --> 0:55:28.830
	It is also bad because it is not going to
	be the same training and test time by the way.

	0:55:31.151 --> 0:55:50.483
	So here the question is how do we force basically
	the model to always go there prior to activation?

	0:55:54.894 --> 0:56:02.463
	Found it interesting because it sounds like
	a trick for me.

	0:56:02.463 --> 0:56:05.491
	This approach has been.

	0:56:06.026 --> 0:56:15.844
	So what they do is prior to going through
	this activation, and they add some bosom noise.

	0:56:17.257 --> 0:56:31.610
	If there is always noise prior to activation
	then the model will be encouraged to preserve

	0:56:31.610 --> 0:56:34.291
	the information.

	0:56:36.356 --> 0:56:44.067
	Was a very interesting thing that found out
	while preparing this, so wanted to share this

	0:56:44.067 --> 0:56:44.410
	as.

	0:56:44.544 --> 0:56:48.937
	So basically you can create a battery gate
	with this technique.

	0:56:50.390 --> 0:57:01.668
	And if you add these language specific routing:
	Here they also have some that can control how

	0:57:01.668 --> 0:57:07.790
	much is shared and how much is language specific.

	0:57:07.727 --> 0:57:16.374
	Here the seals are the is the routing with
	the red and orange lines, so.

	0:57:16.576 --> 0:57:22.752
	So you can see that poor for many and many
	to one there in both cases quite some games.

	0:57:23.063 --> 0:57:30.717
	So that is the overall picture and just find
	the idea of the routing quite interesting.

	0:57:30.991 --> 0:57:32.363
	And UM.

	0:57:32.212 --> 0:57:38.348
	It's also getting a bit more increasingly
	used as there are the so called mixture of

	0:57:38.348 --> 0:57:39.431
	expert models.

	0:57:39.499 --> 0:57:51.801
	The model learns where to route the input
	so they are all conditionally activated when

	0:57:51.801 --> 0:57:53.074
	you are.

	0:57:53.213 --> 0:57:59.089
	But this is not really something specific
	to mortal inquality, so won't talk too much

	0:57:59.089 --> 0:57:59.567
	about.

	0:58:00.620 --> 0:58:02.115
	No.

	0:58:01.761 --> 0:58:09.640
	From this parrot is first that we talked about
	the listing of the capacity bottleneck.

	0:58:10.570 --> 0:58:19.808
	Where we can partly compensate by adapters
	or adding language specific capacity, there's

	0:58:19.808 --> 0:58:23.026
	the idea of negative transfer.

	0:58:24.844 --> 0:58:35.915
	When we add any additional capacity, how can
	we improve the knowledge sharing?

	0:58:38.318 --> 0:58:46.662
	Also, for this one too many directions that
	seem to be hopeless for multilinguality, can

	0:58:46.662 --> 0:58:47.881
	we actually?

	0:58:49.129 --> 0:58:52.171
	Yeah, these are all open things still in the
	area.

	0:58:53.673 --> 0:59:04.030
	Now next part, I'm going to talk about some
	data challenges for Model Ewell.

	0:59:04.030 --> 0:59:07.662
	We talk about Model Ewell.

	0:59:08.488 --> 0:59:14.967
	But there are these lower resource languages
	that don't have well curated parallel data.

	0:59:16.216 --> 0:59:27.539
	When alternative people resort to Pro Data
	from the Internet, there's a lot of noise.

	0:59:27.927 --> 0:59:36.244
	And in this paper last year they did some
	manual analyses of several popular cross data

	0:59:36.244 --> 0:59:36.811
	sets.

	0:59:37.437 --> 0:59:55.262
	And you'll see that there are a lot of wrong
	translations, non-linguistic contents, pornographic

	0:59:55.262 --> 0:59:57.100
	contents.

	0:59:57.777 --> 1:00:04.661
	So as you can imagine, they say what you eat.

	1:00:04.661 --> 1:00:20.116
	If you use this kind of data to train a model,
	you can: So there are also many techniques

	1:00:20.116 --> 1:00:28.819
	for filtering and filtering these noisy data
	sets.

	1:00:29.809 --> 1:00:36.982
	So to filter these out we can use an additional
	classifier that basically are trained to classify

	1:00:36.982 --> 1:00:43.496
	which language to sentences and then kick out
	all the sentences with the wrong language.

	1:00:45.105 --> 1:00:49.331
	Another thing is the length ratio.

	1:00:49.331 --> 1:01:00.200
	Basically, the assumption there is that if
	two sentences are translations of each other,.

	1:01:01.901 --> 1:01:08.718
	So often people use maybe a ratio of three
	and then it eliminates the rest.

	1:01:09.909 --> 1:01:20.187
	Also, the other idea maybe similar to the
	language classifier is basically to heaven

	1:01:20.187 --> 1:01:24.540
	allowed character set per language.

	1:01:24.540 --> 1:01:28.289
	So if you're trying to filter.

	1:01:28.568 --> 1:01:34.622
	Don't know Cyrillic spribs or Arabic spribs,
	then it's maybe a good idea to remove them.

	1:01:35.775 --> 1:01:43.123
	This is not all there are many other ideas
	using some pre-trained neural networks to compare

	1:01:43.123 --> 1:01:50.629
	the representations, but just to give you an
	idea of what our basic techniques were filtering.

	1:01:50.991 --> 1:01:53.458
	Is quite important.

	1:01:53.458 --> 1:02:02.465
	We have seen in our experience that if you
	do these thoroughly there is.

	1:02:03.883 --> 1:02:17.814
	So after all, even if we do web crawling,
	there is still a bit of data scarcity problem.

	1:02:18.118 --> 1:02:30.760
	So there are many bad things that can happen
	when there's too little training data.

	1:02:30.760 --> 1:02:35.425
	The first is low performances.

	1:02:35.735 --> 1:02:55.562
	So they did it on many English system index
	languages, all together with here means: So

	1:02:55.562 --> 1:03:04.079
	we really need to get that area of a lot of
	data in order to get that ideal performance.

	1:03:04.884 --> 1:03:20.639
	There are also many horrible things that can
	happen in general when you train a model across

	1:03:20.639 --> 1:03:24.874
	different training runs.

	1:03:26.946 --> 1:03:36.733
	So one solution to tackle this problem, the
	data scarcity problem, is by fine tuning some

	1:03:36.733 --> 1:03:38.146
	pre-trained.

	1:03:38.979 --> 1:03:46.245
	And basically the idea is you've got the pre-trained
	model that can already do translation.

	1:03:46.846 --> 1:03:54.214
	Then you find units on your own training data
	and you end up with a more specialized model.

	1:03:55.155 --> 1:03:59.369
	So why does pretraining help?

	1:03:59.369 --> 1:04:11.448
	One argument is that if you do pretraining
	then the motto has seen over more data and

	1:04:11.448 --> 1:04:12.713
	learned.

	1:04:13.313 --> 1:04:19.135
	Say more generalizable representations that
	can help more downstream tasks.

	1:04:19.719 --> 1:04:28.063
	So in this case we are basically trying to
	make use of the more meaningful and generalizable

	1:04:28.063 --> 1:04:29.499
	representation.

	1:04:30.490 --> 1:04:45.103
	So for machine translation there are several
	open source models out there that can handle

	1:04:45.103 --> 1:04:46.889
	languages.

	1:04:48.188 --> 1:04:49.912
	Two hundred model.

	1:04:49.912 --> 1:04:53.452
	They also cover two hundred languages.

	1:04:53.452 --> 1:04:57.628
	That means that's quite a lot of translation.

	1:04:57.978 --> 1:05:06.218
	However, one thing to remember is that these
	lados are more like a how do you call them.

	1:05:06.146 --> 1:05:12.812
	Jackson Waltry is a master of none in the
	sense that they are very good as coverage,

	1:05:12.812 --> 1:05:20.498
	but if you look at specific translation directions
	they might be not as good as dedicated models.

	1:05:21.521 --> 1:05:34.170
	So here I'm going to have some results by
	comparing random initialization versus the

	1:05:34.170 --> 1:05:36.104
	first thing.

	1:05:36.396 --> 1:05:46.420
	The third line is the result of basically
	finding a pre-train model that is one of the

	1:05:46.420 --> 1:05:47.342
	family.

	1:05:47.947 --> 1:05:51.822
	So in this case you could see the.

	1:05:51.831 --> 1:05:58.374
	If we just look at the second line, that is
	the pre trade model out of the box, you see

	1:05:58.374 --> 1:06:04.842
	that if we just use it out of the box, the
	performance everywhere isn't super great as

	1:06:04.842 --> 1:06:06.180
	dedicated models.

	1:06:07.867 --> 1:06:21.167
	But then here that ex-here means English:
	So the first takeaway here is that if we do

	1:06:21.167 --> 1:06:31.560
	pre-train financing again when we do it into
	English,.

	1:06:33.433 --> 1:06:40.438
	Here is that we are forgetting.

	1:06:40.438 --> 1:06:50.509
	When we do further training there is no data.

	1:06:50.770 --> 1:07:04.865
	So even if we initialize the pre-trained bottle
	and continue training, if we don't see translation.

	1:07:05.345 --> 1:07:13.826
	So this is bad machine learning people termed
	it as perfect forgetting in the sense that

	1:07:13.826 --> 1:07:20.115
	if you have a model that is trained to do some
	task and then you.

	1:07:20.860 --> 1:07:22.487
	This Is Also Pretty Bad.

	1:07:24.244 --> 1:07:32.341
	Is especially bad if you consider training
	data actually grows over time.

	1:07:32.341 --> 1:07:35.404
	It's not like you have one.

	1:07:36.336 --> 1:07:46.756
	So in practice we do not always train systems
	from stretch so it's more like you have an

	1:07:46.756 --> 1:07:54.951
	existing system and later we want to expand
	the translation coverage.

	1:07:57.277 --> 1:08:08.932
	Here and the key question is how do we continue
	training from an existing system in doing so?

	1:08:09.909 --> 1:08:12.288
	Approaches.

	1:08:12.288 --> 1:08:27.945
	One very simple one is to include a portion
	of your previous training so that.

	1:08:28.148 --> 1:08:34.333
	So if you consider you have an English German
	system and now you want to explain it to English

	1:08:34.333 --> 1:08:34.919
	French,.

	1:08:36.036 --> 1:08:42.308
	Like so nice going English, French and English
	German, so when you train it you still include

	1:08:42.308 --> 1:08:45.578
	a small proportion of your previous German
	data.

	1:08:45.578 --> 1:08:51.117
	Hopefully your model is not forgetting that
	much about the previously lent German.

	1:08:53.073 --> 1:08:58.876
	Idea here is what we saw earlier.

	1:08:58.876 --> 1:09:09.800
	We can also add adaptors and only train them
	while keeping the.

	1:09:10.170 --> 1:09:26.860
	So this means we're going to end up with a
	generic model that was not anyhow changed.

	1:09:27.447 --> 1:09:37.972
	So in this way it's also more module and more
	suitable to the incremental learning kind of.

	1:09:38.758 --> 1:09:49.666
	Right in this part, the takeaways guess are
	first data filtering.

	1:09:49.666 --> 1:09:55.120
	His Internet data is very noisy.

	1:09:56.496 --> 1:10:05.061
	Second, it's about paint tuning pre-fine models
	and how we can or cannot avoid catastrophic

	1:10:05.061 --> 1:10:06.179
	forgetting.

	1:10:07.247 --> 1:10:15.866
	And of course open questions would include
	how can we do incremental learning with these

	1:10:15.866 --> 1:10:19.836
	multilingual machine translation models?

	1:10:20.860 --> 1:10:31.840
	So with this in mind would like to briefly
	cover several engineering challenges when we

	1:10:31.840 --> 1:10:43.031
	talk about: Yeah, earlier we also briefly talked
	about the motelingual means sometimes you have

	1:10:43.031 --> 1:10:51.384
	to scale up, you have to make your models bigger
	just to have that capacity to deal with.

	1:10:52.472 --> 1:10:59.262
	This means the model sizes are getting bigger
	and sometimes having one single is not enough

	1:10:59.262 --> 1:11:00.073
	to handle.

	1:11:00.400 --> 1:11:08.914
	Here wanted to introduce ideas of going parallel
	and scaling up.

	1:11:08.914 --> 1:11:12.843
	The first is so called model.

	1:11:14.434 --> 1:11:18.859
	Don't know if you also had this in other like
	maury cue related courses.

	1:11:20.220 --> 1:11:30.639
	Okay, so the idea of data parallel is basically
	we train in parallel.

	1:11:30.790 --> 1:11:35.852
	We put our model onto several GPS.

	1:11:35.852 --> 1:11:47.131
	We send the same model there and then when
	we get the training data we split.

	1:11:48.108 --> 1:11:54.594
	So each on each of these we are doing the
	forward and backward pass in parallel.

	1:11:55.355 --> 1:12:07.779
	Then after we get his gradient all these reviews
	will be synchronized and the gradients will

	1:12:07.779 --> 1:12:09.783
	be aggregated.

	1:12:11.691 --> 1:12:27.127
	We are having a bigger batch size in effect,
	so this would be much faster than, for example,

	1:12:27.127 --> 1:12:31.277
	doing all these smaller.

	1:12:32.772 --> 1:12:45.252
	That is, if your model itself is too big to
	fit onto an energy group, so you cannot split

	1:12:45.252 --> 1:12:46.084
	this.

	1:12:46.486 --> 1:12:51.958
	And honestly, the model itself, unless you're
	going for those.

	1:12:51.891 --> 1:12:55.500
	Huge models the industry made these days.

	1:12:55.500 --> 1:13:03.233
	I've never run into a situation where the
	single model itself does not fit into one shape

	1:13:03.233 --> 1:13:03.748
	here.

	1:13:03.748 --> 1:13:08.474
	Realistically, it's more the what is memory
	consuming.

	1:13:08.528 --> 1:13:14.871
	It is more of the backward cast and the Optimizer
	states that led me to be stored.

	1:13:15.555 --> 1:13:22.193
	So but still there are people training gigantic
	models where they have to go model parallel.

	1:13:22.602 --> 1:13:35.955
	This means you have a model consisting of
	all those orange pets, but it doesn't fit to

	1:13:35.955 --> 1:13:40.714
	split the next several layers.

	1:13:41.581 --> 1:13:51.787
	So this means when you do the forward pass
	you have to wait and to finish before doing.

	1:13:52.532 --> 1:14:11.193
	And this kind of implementation is sometimes
	a bit architecture or specific.

	1:14:12.172 --> 1:14:17.177
	Right, so there's one more thing when scaling
	up.

	1:14:17.177 --> 1:14:19.179
	Want it to mention.

	1:14:20.080 --> 1:14:25.687
	We also talked about it briefly earlier.

	1:14:25.687 --> 1:14:34.030
	We said that when we go to Linguo we need
	a vocabulary that.

	1:14:34.614 --> 1:14:40.867
	And can give you some numbers.

	1:14:40.867 --> 1:14:53.575
	Most of the pre-trained modeling models here
	use a vocabulary.

	1:14:53.933 --> 1:14:58.454
	Normally each vector is.

	1:14:58.454 --> 1:15:10.751
	This means just the word embedding table alone
	is times parameters.

	1:15:11.011 --> 1:15:18.620
	This means just for the embedding table alone
	it's already taking million parameters of the.

	1:15:19.859 --> 1:15:28.187
	And this is often one of the largest parts
	of the machine.

	1:15:28.187 --> 1:15:31.292
	This also comes with.

	1:15:31.651 --> 1:15:43.891
	So one question is how can we efficiently
	represent a multilingual vocabulary?

	1:15:43.891 --> 1:15:49.003
	Are there better ways than just?

	1:15:50.750 --> 1:16:00.526
	There are many out there people tread, maybe
	not all targeted for mottling wool, but think.

	1:16:00.840 --> 1:16:03.635
	So when is bites level representation?

	1:16:03.743 --> 1:16:11.973
	So the idea there is if we train with data
	they're all stored on computers, so all their

	1:16:11.973 --> 1:16:15.579
	characters must be reused in by bites.

	1:16:15.579 --> 1:16:23.716
	So they want to then not using subwords, not
	using characters, but using bites instead.

	1:16:25.905 --> 1:16:27.693
	Do You See Some Downsides?

	1:16:31.791 --> 1:16:38.245
	There are some languages that are easier to
	represent than others.

	1:16:38.245 --> 1:16:40.556
	That's definitely true.

	1:16:41.081 --> 1:16:44.981
	So if you have a sentence normally of five
	words,.

	1:16:46.246 --> 1:16:59.899
	You think about if we split it into characters,
	how many characters we have, and each character

	1:16:59.899 --> 1:17:04.166
	that would be how many bites.

	1:17:04.424 --> 1:17:15.749
	And then it's more to model, it's more for
	the model to learn, and it's also a bigger

	1:17:15.749 --> 1:17:19.831
	sequence to give to the model.

	1:17:20.260 --> 1:17:22.038
	Yeah.

	1:17:21.941 --> 1:17:31.232
	Visual representation is also quite interesting,
	so some people argued that we don't want to

	1:17:31.232 --> 1:17:35.428
	have a fixed discrete vocabulary anymore.

	1:17:35.428 --> 1:17:41.921
	Instead, we want to do it like OCR, like reading
	them as images.

	1:17:42.942 --> 1:17:54.016
	We'll look at one example for this next: Then
	another idea is how if you can distill the

	1:17:54.016 --> 1:18:03.966
	vocabulary as in learning some more compact
	representation,.

	1:18:04.284 --> 1:18:12.554
	But next wanted to show you an example of
	pixel inputs for modeling war machine.

	1:18:12.852 --> 1:18:29.757
	If you look at the picture, all the characters
	that are marked with red are actually not.

	1:18:32.772 --> 1:18:48.876
	They are actually from a different script
	for the model and let it do the subword tokenization.

	1:18:52.852 --> 1:19:04.373
	You would get maybe mostly characters out
	of it because I guess in the pre existing vocabulary

	1:19:04.373 --> 1:19:07.768
	there won't be Latin H and.

	1:19:07.707 --> 1:19:16.737
	So you'll get characters out of it, which
	means it's probably going to be more difficult

	1:19:16.737 --> 1:19:18.259
	for the model.

	1:19:20.140 --> 1:19:28.502
	Yeah, so the motivation for pixel inputs is
	that there is more sharing across languages.

	1:19:30.010 --> 1:19:37.773
	Here basically illustrates an embedding table
	for subwords and saying if you have sentences

	1:19:37.773 --> 1:19:45.705
	in the letter scripts like French and the English
	then it's going to take certain proportions

	1:19:45.705 --> 1:19:48.152
	of this big embetting table.

	1:19:48.328 --> 1:19:56.854
	While for Arabic and Chinese it's yet again
	another,.

	1:19:56.796 --> 1:20:09.037
	That is not joined with the previous one if
	we want to have shared representations for

	1:20:09.037 --> 1:20:11.992
	different languages.

	1:20:12.692 --> 1:20:18.531
	On the other hand, if we're going with pixels,
	there's definitely more sharing.

	1:20:22.362 --> 1:20:30.911
	There's a difference though to a standard
	kind of norm machine translation typeline.

	1:20:32.252 --> 1:20:47.581
	If you have this brace then how do we go with
	images into a translation model?

	1:20:50.690 --> 1:20:58.684
	We still have to tokenize it somehow, so in
	this case they do an overlapping sliding window.

	1:20:59.259 --> 1:21:13.636
	Since it's more visual, we're using some kind
	of convolution blocks before going into these

	1:21:13.636 --> 1:21:14.730
	black.

	1:21:15.035 --> 1:21:25.514
	So here wanted to show that if you go with
	these more specialist architectures we get

	1:21:25.514 --> 1:21:27.829
	pixels and that's.

	1:21:30.050 --> 1:21:31.310
	There's Also One Down the Side.

	1:21:31.431 --> 1:21:51.380
	If we go with pixels and present teachings,
	what are our challenges?

	1:21:52.993 --> 1:22:00.001
	Exactly so as they beat us others here, also
	pointing out here for their experiments.

	1:22:01.061 --> 1:22:08.596
	They only consider a one target language,
	and this is also on their target site.

	1:22:08.596 --> 1:22:10.643
	It's not pixel based.

	1:22:11.131 --> 1:22:31.033
	So this is definitely, in my opinion, very
	interesting steps towards more shared representations.

	1:22:31.831 --> 1:22:40.574
	Yeah, so with this kind of out of the box
	approach just wanted to summarize today's lecture.

	1:22:41.962 --> 1:22:53.158
	First think we saw why motelingue is cool,
	why there are several open challenges out there

	1:22:53.158 --> 1:22:53.896
	that.

	1:22:55.355 --> 1:23:03.601
	We also saw, like several approaches, how
	to realize implement a modern molecular translation

	1:23:03.601 --> 1:23:11.058
	system, and yeah, lastly, we've seen quite
	some over challenges on what is unsolved.

	1:23:11.691 --> 1:23:22.403
	Yeah, so with this want to thank you for being
	here today and I'm up there if you want.

	1:23:26.106 --> 1:23:29.727
	If you have questions, how will we also share
	with the moment?