Spaces:

retkowski
/

ytseg_demo

Running

App Files Files Community

ytseg_demo / demo_data /lectures /Lecture-14-27.06.2023 /English.vtt

retkowski

Add demo

cb71ef5 over 1 year ago

raw

history blame contribute delete

67.9 kB

	WEBVTT

	0:00:01.921 --> 0:00:16.424
	Hey welcome to today's lecture, what we today
	want to look at is how we can make new.

	0:00:16.796 --> 0:00:26.458
	So until now we have this global system, the
	encoder and the decoder mostly, and we haven't

	0:00:26.458 --> 0:00:29.714
	really thought about how long.

	0:00:30.170 --> 0:00:42.684
	And what we, for example, know is yeah, you
	can make the systems bigger in different ways.

	0:00:42.684 --> 0:00:47.084
	We can make them deeper so the.

	0:00:47.407 --> 0:00:56.331
	And if we have at least enough data that typically
	helps you make things performance better,.

	0:00:56.576 --> 0:01:00.620
	But of course leads to problems that we need
	more resources.

	0:01:00.620 --> 0:01:06.587
	That is a problem at universities where we
	have typically limited computation capacities.

	0:01:06.587 --> 0:01:11.757
	So at some point you have such big models
	that you cannot train them anymore.

	0:01:13.033 --> 0:01:23.792
	And also for companies is of course important
	if it costs you like to generate translation

	0:01:23.792 --> 0:01:26.984
	just by power consumption.

	0:01:27.667 --> 0:01:35.386
	So yeah, there's different reasons why you
	want to do efficient machine translation.

	0:01:36.436 --> 0:01:48.338
	One reason is there are different ways of
	how you can improve your machine translation

	0:01:48.338 --> 0:01:50.527
	system once we.

	0:01:50.670 --> 0:01:55.694
	There can be different types of data we looked
	into data crawling, monolingual data.

	0:01:55.875 --> 0:01:59.024
	All this data and the aim is always.

	0:01:59.099 --> 0:02:05.735
	Of course, we are not just purely interested
	in having more data, but the idea why we want

	0:02:05.735 --> 0:02:12.299
	to have more data is that more data also means
	that we have better quality because mostly

	0:02:12.299 --> 0:02:17.550
	we are interested in increasing the quality
	of the machine translation.

	0:02:18.838 --> 0:02:24.892
	But there's also other ways of how you can
	improve the quality of a machine translation.

	0:02:25.325 --> 0:02:36.450
	And what is, of course, that is where most
	research is focusing on.

	0:02:36.450 --> 0:02:44.467
	It means all we want to build better algorithms.

	0:02:44.684 --> 0:02:48.199
	Course: The other things are normally as good.

	0:02:48.199 --> 0:02:54.631
	Sometimes it's easier to improve, so often
	it's easier to just collect more data than

	0:02:54.631 --> 0:02:57.473
	to invent some great view algorithms.

	0:02:57.473 --> 0:03:00.315
	But yeah, both of them are important.

	0:03:00.920 --> 0:03:09.812
	But there is this third thing, especially
	with neural machine translation, and that means

	0:03:09.812 --> 0:03:11.590
	we make a bigger.

	0:03:11.751 --> 0:03:16.510
	Can be, as said, that we have more layers,
	that we have wider layers.

	0:03:16.510 --> 0:03:19.977
	The other thing we talked a bit about is ensemble.

	0:03:19.977 --> 0:03:24.532
	That means we are not building one new machine
	translation system.

	0:03:24.965 --> 0:03:27.505
	And we can easily build four.

	0:03:27.505 --> 0:03:32.331
	What is the typical strategy to build different
	systems?

	0:03:32.331 --> 0:03:33.177
	Remember.

	0:03:35.795 --> 0:03:40.119
	It should be of course a bit different if
	you have the same.

	0:03:40.119 --> 0:03:44.585
	If they all predict the same then combining
	them doesn't help.

	0:03:44.585 --> 0:03:48.979
	So what is the easiest way if you have to
	build four systems?

	0:03:51.711 --> 0:04:01.747
	And the Charleston's will take, but this is
	the best output of a single system.

	0:04:02.362 --> 0:04:10.165
	Mean now, it's really three different systems
	so that you later can combine them and maybe

	0:04:10.165 --> 0:04:11.280
	the average.

	0:04:11.280 --> 0:04:16.682
	Ensembles are typically that the average is
	all probabilities.

	0:04:19.439 --> 0:04:24.227
	The idea is to think about neural networks.

	0:04:24.227 --> 0:04:29.342
	There's one parameter which can easily adjust.

	0:04:29.342 --> 0:04:36.525
	That's exactly the easiest way to randomize
	with three different.

	0:04:37.017 --> 0:04:43.119
	They have the same architecture, so all the
	hydroparameters are the same, but they are

	0:04:43.119 --> 0:04:43.891
	different.

	0:04:43.891 --> 0:04:46.556
	They will have different predictions.

	0:04:48.228 --> 0:04:52.572
	So, of course, bigger amounts.

	0:04:52.572 --> 0:05:05.325
	Some of these are a bit the easiest way of
	improving your quality because you don't really

	0:05:05.325 --> 0:05:08.268
	have to do anything.

	0:05:08.588 --> 0:05:12.588
	There is limits on that bigger models only
	get better.

	0:05:12.588 --> 0:05:19.132
	If you have enough training data you can't
	do like a handheld layer and you will not work

	0:05:19.132 --> 0:05:24.877
	on very small data but with a recent amount
	of data that is the easiest thing.

	0:05:25.305 --> 0:05:33.726
	However, they are challenging with making
	better models, bigger motors, and that is the

	0:05:33.726 --> 0:05:34.970
	computation.

	0:05:35.175 --> 0:05:44.482
	So, of course, if you have a bigger model
	that can mean that you have longer running

	0:05:44.482 --> 0:05:49.518
	times, if you have models, you have to times.

	0:05:51.171 --> 0:05:56.685
	Normally you cannot paralyze the different
	layers because the input to one layer is always

	0:05:56.685 --> 0:06:02.442
	the output of the previous layer, so you propagate
	that so it will also increase your runtime.

	0:06:02.822 --> 0:06:10.720
	Then you have to store all your models in
	memory.

	0:06:10.720 --> 0:06:20.927
	If you have double weights you will have:
	Is more difficult to then do back propagation.

	0:06:20.927 --> 0:06:27.680
	You have to store in between the activations,
	so there's not only do you increase the model

	0:06:27.680 --> 0:06:31.865
	in your memory, but also all these other variables
	that.

	0:06:34.414 --> 0:06:36.734
	And so in general it is more expensive.

	0:06:37.137 --> 0:06:54.208
	And therefore there's good reasons in looking
	into can we make these models sound more efficient.

	0:06:54.134 --> 0:07:00.982
	So it's been through the viewer, you can have
	it okay, have one and one day of training time,

	0:07:00.982 --> 0:07:01.274
	or.

	0:07:01.221 --> 0:07:07.535
	Forty thousand euros and then what is the
	best machine translation system I can get within

	0:07:07.535 --> 0:07:08.437
	this budget.

	0:07:08.969 --> 0:07:19.085
	And then, of course, you can make the models
	bigger, but then you have to train them shorter,

	0:07:19.085 --> 0:07:24.251
	and then we can make more efficient algorithms.

	0:07:25.925 --> 0:07:31.699
	If you think about efficiency, there's a bit
	different scenarios.

	0:07:32.312 --> 0:07:43.635
	So if you're more of coming from the research
	community, what you'll be doing is building

	0:07:43.635 --> 0:07:47.913
	a lot of models in your research.

	0:07:48.088 --> 0:07:58.645
	So you're having your test set of maybe sentences,
	calculating the blue score, then another model.

	0:07:58.818 --> 0:08:08.911
	So what that means is typically you're training
	on millions of cents, so your training time

	0:08:08.911 --> 0:08:14.944
	is long, maybe a day, but maybe in other cases
	a week.

	0:08:15.135 --> 0:08:22.860
	The testing is not really the cost efficient,
	but the training is very costly.

	0:08:23.443 --> 0:08:37.830
	If you are more thinking of building models
	for application, the scenario is quite different.

	0:08:38.038 --> 0:08:46.603
	And then you keep it running, and maybe thousands
	of customers are using it in translating.

	0:08:46.603 --> 0:08:47.720
	So in that.

	0:08:48.168 --> 0:08:59.577
	And we will see that it is not always the
	same type of challenges you can paralyze some

	0:08:59.577 --> 0:09:07.096
	things in training, which you cannot paralyze
	in testing.

	0:09:07.347 --> 0:09:14.124
	For example, in training you have to do back
	propagation, so you have to store the activations.

	0:09:14.394 --> 0:09:23.901
	Therefore, in testing we briefly discussed
	that we would do it in more detail today in

	0:09:23.901 --> 0:09:24.994
	training.

	0:09:25.265 --> 0:09:36.100
	You know they're a target and you can process
	everything in parallel while in testing.

	0:09:36.356 --> 0:09:46.741
	So you can only do one word at a time, and
	so you can less paralyze this.

	0:09:46.741 --> 0:09:50.530
	Therefore, it's important.

	0:09:52.712 --> 0:09:55.347
	Is a specific task on this.

	0:09:55.347 --> 0:10:03.157
	For example, it's the efficiency task where
	it's about making things as efficient.

	0:10:03.123 --> 0:10:09.230
	Is possible and they can look at different
	resources.

	0:10:09.230 --> 0:10:14.207
	So how much deep fuel run time do you need?

	0:10:14.454 --> 0:10:19.366
	See how much memory you need or you can have
	a fixed memory budget and then have to build

	0:10:19.366 --> 0:10:20.294
	the best system.

	0:10:20.500 --> 0:10:29.010
	And here is a bit like an example of that,
	so there's three teams from Edinburgh from

	0:10:29.010 --> 0:10:30.989
	and they submitted.

	0:10:31.131 --> 0:10:36.278
	So then, of course, if you want to know the
	most efficient system you have to do a bit

	0:10:36.278 --> 0:10:36.515
	of.

	0:10:36.776 --> 0:10:44.656
	You want to have a better quality or more
	runtime and there's not the one solution.

	0:10:44.656 --> 0:10:46.720
	You can improve your.

	0:10:46.946 --> 0:10:49.662
	And that you see that there are different
	systems.

	0:10:49.909 --> 0:11:06.051
	Here is how many words you can do for a second
	on the clock, and you want to be as talk as

	0:11:06.051 --> 0:11:07.824
	possible.

	0:11:08.068 --> 0:11:08.889
	And you see here a bit.

	0:11:08.889 --> 0:11:09.984
	This is a little bit different.

	0:11:11.051 --> 0:11:27.717
	You want to be there on the top right corner
	and you can get a score of something between

	0:11:27.717 --> 0:11:29.014
	words.

	0:11:30.250 --> 0:11:34.161
	Two hundred and fifty thousand, then you'll
	ever come and score zero point three.

	0:11:34.834 --> 0:11:41.243
	There is, of course, any bit of a decision,
	but the question is, like how far can you again?

	0:11:41.243 --> 0:11:47.789
	Some of all these points on this line would
	be winners because they are somehow most efficient

	0:11:47.789 --> 0:11:53.922
	in a way that there's no system which achieves
	the same quality with less computational.

	0:11:57.657 --> 0:12:04.131
	So there's the one question of which resources
	are you interested.

	0:12:04.131 --> 0:12:07.416
	Are you running it on CPU or GPU?

	0:12:07.416 --> 0:12:11.668
	There's different ways of paralyzing stuff.

	0:12:14.654 --> 0:12:20.777
	Another dimension is how you process your
	data.

	0:12:20.777 --> 0:12:27.154
	There's really the best processing and streaming.

	0:12:27.647 --> 0:12:34.672
	So in batch processing you have the whole
	document available so you can translate all

	0:12:34.672 --> 0:12:39.981
	sentences in perimeter and then you're interested
	in throughput.

	0:12:40.000 --> 0:12:43.844
	But you can then process, for example, especially
	in GPS.

	0:12:43.844 --> 0:12:49.810
	That's interesting, you're not translating
	one sentence at a time, but you're translating

	0:12:49.810 --> 0:12:56.108
	one hundred sentences or so in parallel, so
	you have one more dimension where you can paralyze

	0:12:56.108 --> 0:12:57.964
	and then be more efficient.

	0:12:58.558 --> 0:13:14.863
	On the other hand, for example sorts of documents,
	so we learned that if you do badge processing

	0:13:14.863 --> 0:13:16.544
	you have.

	0:13:16.636 --> 0:13:24.636
	Then, of course, it makes sense to sort the
	sentences in order to have the minimum thing

	0:13:24.636 --> 0:13:25.535
	attached.

	0:13:27.427 --> 0:13:32.150
	The other scenario is more the streaming scenario
	where you do life translation.

	0:13:32.512 --> 0:13:40.212
	So in that case you can't wait for the whole
	document to pass, but you have to do.

	0:13:40.520 --> 0:13:49.529
	And then, for example, that's especially in
	situations like speech translation, and then

	0:13:49.529 --> 0:13:53.781
	you're interested in things like latency.

	0:13:53.781 --> 0:14:00.361
	So how much do you have to wait to get the
	output of a sentence?

	0:14:06.566 --> 0:14:16.956
	Finally, there is the thing about the implementation:
	Today we're mainly looking at different algorithms,

	0:14:16.956 --> 0:14:23.678
	different models of how you can model them
	in your machine translation system, but of

	0:14:23.678 --> 0:14:29.227
	course for the same algorithms there's also
	different implementations.

	0:14:29.489 --> 0:14:38.643
	So, for example, for a machine translation
	this tool could be very fast.

	0:14:38.638 --> 0:14:46.615
	So they have like coded a lot of the operations
	very low resource, not low resource, low level

	0:14:46.615 --> 0:14:49.973
	on the directly on the QDAC kernels in.

	0:14:50.110 --> 0:15:00.948
	So the same attention network is typically
	more efficient in that type of algorithm.

	0:15:00.880 --> 0:15:02.474
	Than in in any other.

	0:15:03.323 --> 0:15:13.105
	Of course, it might be other disadvantages,
	so if you're a little worker or have worked

	0:15:13.105 --> 0:15:15.106
	in the practical.

	0:15:15.255 --> 0:15:22.604
	Because it's normally easier to understand,
	easier to change, and so on, but there is again

	0:15:22.604 --> 0:15:23.323
	a train.

	0:15:23.483 --> 0:15:29.440
	You have to think about, do you want to include
	this into my study or comparison or not?

	0:15:29.440 --> 0:15:36.468
	Should it be like I compare different implementations
	and I also find the most efficient implementation?

	0:15:36.468 --> 0:15:39.145
	Or is it only about the pure algorithm?

	0:15:42.742 --> 0:15:50.355
	Yeah, when building these systems there is
	a different trade-off to do.

	0:15:50.850 --> 0:15:56.555
	So there's one of the traders between memory
	and throughput, so how many words can generate

	0:15:56.555 --> 0:15:57.299
	per second.

	0:15:57.557 --> 0:16:03.351
	So typically you can easily like increase
	your scruple by increasing the batch size.

	0:16:03.643 --> 0:16:06.899
	So that means you are translating more sentences
	in parallel.

	0:16:07.107 --> 0:16:09.241
	And gypsies are very good at that stuff.

	0:16:09.349 --> 0:16:15.161
	It should translate one sentence or one hundred
	sentences, not the same time, but its.

	0:16:15.115 --> 0:16:20.784
	Rough are very similar because they are at
	this efficient metrics multiplication so that

	0:16:20.784 --> 0:16:24.415
	you can do the same operation on all sentences
	parallel.

	0:16:24.415 --> 0:16:30.148
	So typically that means if you increase your
	benchmark you can do more things in parallel

	0:16:30.148 --> 0:16:31.995
	and you will translate more.

	0:16:31.952 --> 0:16:33.370
	Second.

	0:16:33.653 --> 0:16:43.312
	On the other hand, with this advantage, of
	course you will need higher badge sizes and

	0:16:43.312 --> 0:16:44.755
	more memory.

	0:16:44.965 --> 0:16:56.452
	To begin with, the other problem is that you
	have such big models that you can only translate

	0:16:56.452 --> 0:16:59.141
	with lower bed sizes.

	0:16:59.119 --> 0:17:08.466
	If you are running out of memory with translating,
	one idea to go on that is to decrease your.

	0:17:13.453 --> 0:17:24.456
	Then there is the thing about quality in Screwport,
	of course, and before it's like larger models,

	0:17:24.456 --> 0:17:28.124
	but in generally higher quality.

	0:17:28.124 --> 0:17:31.902
	The first one is always this way.

	0:17:32.092 --> 0:17:38.709
	Course: Not always larger model helps you
	have over fitting at some point, but in generally.

	0:17:43.883 --> 0:17:52.901
	And with this a bit on this training and testing
	thing we had before.

	0:17:53.113 --> 0:17:58.455
	So it wears all the difference between training
	and testing, and for the encoder and decoder.

	0:17:58.798 --> 0:18:06.992
	So if we are looking at what mentioned before
	at training time, we have a source sentence

	0:18:06.992 --> 0:18:17.183
	here: And how this is processed on a is not
	the attention here.

	0:18:17.183 --> 0:18:21.836
	That's a tubical transformer.

	0:18:22.162 --> 0:18:31.626
	And how we can do that on a is that we can
	paralyze the ear ever since.

	0:18:31.626 --> 0:18:40.422
	The first thing to know is: So that is, of
	course, not in all cases.

	0:18:40.422 --> 0:18:49.184
	We'll later talk about speech translation
	where we might want to translate.

	0:18:49.389 --> 0:18:56.172
	Without the general case in, it's like you
	have the full sentence you want to translate.

	0:18:56.416 --> 0:19:02.053
	So the important thing is we are here everything
	available on the source side.

	0:19:03.323 --> 0:19:13.524
	And then this was one of the big advantages
	that you can remember back of transformer.

	0:19:13.524 --> 0:19:15.752
	There are several.

	0:19:16.156 --> 0:19:25.229
	But the other one is now that we can calculate
	the full layer.

	0:19:25.645 --> 0:19:29.318
	There is no dependency between this and this
	state or this and this state.

	0:19:29.749 --> 0:19:36.662
	So we always did like here to calculate the
	key value and query, and based on that you

	0:19:36.662 --> 0:19:37.536
	calculate.

	0:19:37.937 --> 0:19:46.616
	Which means we can do all these calculations
	here in parallel and in parallel.

	0:19:48.028 --> 0:19:55.967
	And there, of course, is this very efficiency
	because again for GPS it's too bigly possible

	0:19:55.967 --> 0:20:00.887
	to do these things in parallel and one after
	each other.

	0:20:01.421 --> 0:20:10.311
	And then we can also for each layer one by
	one, and then we calculate here the encoder.

	0:20:10.790 --> 0:20:21.921
	In training now an important thing is that
	for the decoder we have the full sentence available

	0:20:21.921 --> 0:20:28.365
	because we know this is the target we should
	generate.

	0:20:29.649 --> 0:20:33.526
	We have models now in a different way.

	0:20:33.526 --> 0:20:38.297
	This hidden state is only on the previous
	ones.

	0:20:38.598 --> 0:20:51.887
	And the first thing here depends only on this
	information, so you see if you remember we

	0:20:51.887 --> 0:20:56.665
	had this masked self-attention.

	0:20:56.896 --> 0:21:04.117
	So that means, of course, we can only calculate
	the decoder once the encoder is done, but that's.

	0:21:04.444 --> 0:21:06.656
	Percent can calculate the end quarter.

	0:21:06.656 --> 0:21:08.925
	Then we can calculate here the decoder.

	0:21:09.569 --> 0:21:25.566
	But again in training we have x, y and that
	is available so we can calculate everything

	0:21:25.566 --> 0:21:27.929
	in parallel.

	0:21:28.368 --> 0:21:40.941
	So the interesting thing or advantage of transformer
	is in training.

	0:21:40.941 --> 0:21:46.408
	We can do it for the decoder.

	0:21:46.866 --> 0:21:54.457
	That means you will have more calculations
	because you can only calculate one layer at

	0:21:54.457 --> 0:22:02.310
	a time, but for example the length which is
	too bigly quite long or doesn't really matter

	0:22:02.310 --> 0:22:03.270
	that much.

	0:22:05.665 --> 0:22:10.704
	However, in testing this situation is different.

	0:22:10.704 --> 0:22:13.276
	In testing we only have.

	0:22:13.713 --> 0:22:20.622
	So this means we start with a sense: We don't
	know the full sentence yet because we ought

	0:22:20.622 --> 0:22:29.063
	to regularly generate that so for the encoder
	we have the same here but for the decoder.

	0:22:29.409 --> 0:22:39.598
	In this case we only have the first and the
	second instinct, but only for all states in

	0:22:39.598 --> 0:22:40.756
	parallel.

	0:22:41.101 --> 0:22:51.752
	And then we can do the next step for y because
	we are putting our most probable one.

	0:22:51.752 --> 0:22:58.643
	We do greedy search or beam search, but you
	cannot do.

	0:23:03.663 --> 0:23:16.838
	Yes, so if we are interesting in making things
	more efficient for testing, which we see, for

	0:23:16.838 --> 0:23:22.363
	example in the scenario of really our.

	0:23:22.642 --> 0:23:34.286
	It makes sense that we think about our architecture
	and that we are currently working on attention

	0:23:34.286 --> 0:23:35.933
	based models.

	0:23:36.096 --> 0:23:44.150
	The decoder there is some of the most time
	spent testing and testing.

	0:23:44.150 --> 0:23:47.142
	It's similar, but during.

	0:23:47.167 --> 0:23:50.248
	Nothing about beam search.

	0:23:50.248 --> 0:23:59.833
	It might be even more complicated because
	in beam search you have to try different.

	0:24:02.762 --> 0:24:15.140
	So the question is what can you now do in
	order to make your model more efficient and

	0:24:15.140 --> 0:24:21.905
	better in translation in these types of cases?

	0:24:24.604 --> 0:24:30.178
	And the one thing is to look into the encoded
	decoder trailer.

	0:24:30.690 --> 0:24:43.898
	And then until now we typically assume that
	the depth of the encoder and the depth of the

	0:24:43.898 --> 0:24:48.154
	decoder is roughly the same.

	0:24:48.268 --> 0:24:55.553
	So if you haven't thought about it, you just
	take what is running well.

	0:24:55.553 --> 0:24:57.678
	You would try to do.

	0:24:58.018 --> 0:25:04.148
	However, we saw now that there is a quite
	big challenge and the runtime is a lot longer

	0:25:04.148 --> 0:25:04.914
	than here.

	0:25:05.425 --> 0:25:14.018
	The question is also the case for the calculations,
	or do we have there the same issue that we

	0:25:14.018 --> 0:25:21.887
	only get the good quality if we are having
	high and high, so we know that making these

	0:25:21.887 --> 0:25:25.415
	more depths is increasing our quality.

	0:25:25.425 --> 0:25:31.920
	But what we haven't talked about is really
	important that we increase the depth the same

	0:25:31.920 --> 0:25:32.285
	way.

	0:25:32.552 --> 0:25:41.815
	So what we can put instead also do is something
	like this where you have a deep encoder and

	0:25:41.815 --> 0:25:42.923
	a shallow.

	0:25:43.163 --> 0:25:57.386
	So that would be that you, for example, have
	instead of having layers on the encoder, and

	0:25:57.386 --> 0:25:59.757
	layers on the.

	0:26:00.080 --> 0:26:10.469
	So in this case the overall depth from start
	to end would be similar and so hopefully.

	0:26:11.471 --> 0:26:21.662
	But we could a lot more things hear parallelized,
	and hear what is costly at the end during decoding

	0:26:21.662 --> 0:26:22.973
	the decoder.

	0:26:22.973 --> 0:26:29.330
	Because that does change in an outer regressive
	way, there we.

	0:26:31.411 --> 0:26:33.727
	And that that can be analyzed.

	0:26:33.727 --> 0:26:38.734
	So here is some examples: Where people have
	done all this.

	0:26:39.019 --> 0:26:55.710
	So here it's mainly interested on the orange
	things, which is auto-regressive about the

	0:26:55.710 --> 0:26:57.607
	speed up.

	0:26:57.717 --> 0:27:15.031
	You have the system, so agree is not exactly
	the same, but it's similar.

	0:27:15.055 --> 0:27:23.004
	It's always the case if you look at speed
	up.

	0:27:23.004 --> 0:27:31.644
	Think they put a speed of so that's the baseline.

	0:27:31.771 --> 0:27:35.348
	So between and times as fast.

	0:27:35.348 --> 0:27:42.621
	If you switch from a system to where you have
	layers in the.

	0:27:42.782 --> 0:27:52.309
	You see that although you have slightly more
	parameters, more calculations are also roughly

	0:27:52.309 --> 0:28:00.283
	the same, but you can speed out because now
	during testing you can paralyze.

	0:28:02.182 --> 0:28:09.754
	The other thing is that you're speeding up,
	but if you look at the performance it's similar,

	0:28:09.754 --> 0:28:13.500
	so sometimes you improve, sometimes you lose.

	0:28:13.500 --> 0:28:20.421
	There's a bit of losing English to Romania,
	but in general the quality is very slow.

	0:28:20.680 --> 0:28:30.343
	So you see that you can keep a similar performance
	while improving your speed by just having different.

	0:28:30.470 --> 0:28:34.903
	And you also see the encoder layers from speed.

	0:28:34.903 --> 0:28:38.136
	They don't really metal that much.

	0:28:38.136 --> 0:28:38.690
	Most.

	0:28:38.979 --> 0:28:50.319
	Because if you compare the 12th system to
	the 6th system you have a lower performance

	0:28:50.319 --> 0:28:57.309
	with 6th and colder layers but the speed is
	similar.

	0:28:57.897 --> 0:29:02.233
	And see the huge decrease is it maybe due
	to a lack of data.

	0:29:03.743 --> 0:29:11.899
	Good idea would say it's not the case.

	0:29:11.899 --> 0:29:23.191
	Romanian English should have the same number
	of data.

	0:29:24.224 --> 0:29:31.184
	Maybe it's just that something in that language.

	0:29:31.184 --> 0:29:40.702
	If you generate Romanian maybe they need more
	target dependencies.

	0:29:42.882 --> 0:29:46.263
	The Wine's the Eye Also Don't Know Any Sex
	People Want To.

	0:29:47.887 --> 0:29:49.034
	There could be yeah the.

	0:29:49.889 --> 0:29:58.962
	As the maybe if you go from like a movie sphere
	to a hybrid sphere, you can: It's very much

	0:29:58.962 --> 0:30:12.492
	easier to expand the vocabulary to English,
	but it must be the vocabulary.

	0:30:13.333 --> 0:30:21.147
	Have to check, but would assume that in this
	case the system is not retrained, but it's

	0:30:21.147 --> 0:30:22.391
	trained with.

	0:30:22.902 --> 0:30:30.213
	And that's why I was assuming that they have
	the same, but maybe you'll write that in this

	0:30:30.213 --> 0:30:35.595
	piece, for example, if they were pre-trained,
	the decoder English.

	0:30:36.096 --> 0:30:43.733
	But don't remember exactly if they do something
	like that, but that could be a good.

	0:30:45.325 --> 0:30:52.457
	So this is some of the most easy way to speed
	up.

	0:30:52.457 --> 0:31:01.443
	You just switch to hyperparameters, not to
	implement anything.

	0:31:02.722 --> 0:31:08.367
	Of course, there's other ways of doing that.

	0:31:08.367 --> 0:31:11.880
	We'll look into two things.

	0:31:11.880 --> 0:31:16.521
	The other thing is the architecture.

	0:31:16.796 --> 0:31:28.154
	We are now at some of the baselines that we
	are doing.

	0:31:28.488 --> 0:31:39.978
	However, in translation in the decoder side,
	it might not be the best solution.

	0:31:39.978 --> 0:31:41.845
	There is no.

	0:31:42.222 --> 0:31:47.130
	So we can use different types of architectures,
	also in the encoder and the.

	0:31:47.747 --> 0:31:52.475
	And there's two ways of what you could do
	different, or there's more ways.

	0:31:52.912 --> 0:31:54.825
	We will look into two todays.

	0:31:54.825 --> 0:31:58.842
	The one is average attention, which is a very
	simple solution.

	0:31:59.419 --> 0:32:01.464
	You can do as it says.

	0:32:01.464 --> 0:32:04.577
	It's not really attending anymore.

	0:32:04.577 --> 0:32:08.757
	It's just like equal attendance to everything.

	0:32:09.249 --> 0:32:23.422
	And the other idea, which is currently done
	in most systems which are optimized to efficiency,

	0:32:23.422 --> 0:32:24.913
	is we're.

	0:32:25.065 --> 0:32:32.623
	But on the decoder side we are then not using
	transformer or self attention, but we are using

	0:32:32.623 --> 0:32:39.700
	recurrent neural network because they are the
	disadvantage of recurrent neural network.

	0:32:39.799 --> 0:32:48.353
	And then the recurrent is normally easier
	to calculate because it only depends on inputs,

	0:32:48.353 --> 0:32:49.684
	the input on.

	0:32:51.931 --> 0:33:02.190
	So what is the difference between decoding
	and why is the tension maybe not sufficient

	0:33:02.190 --> 0:33:03.841
	for decoding?

	0:33:04.204 --> 0:33:14.390
	If we want to populate the new state, we only
	have to look at the input and the previous

	0:33:14.390 --> 0:33:15.649
	state, so.

	0:33:16.136 --> 0:33:19.029
	We are more conditional here networks.

	0:33:19.029 --> 0:33:19.994
	We have the.

	0:33:19.980 --> 0:33:31.291
	Dependency to a fixed number of previous ones,
	but that's rarely used for decoding.

	0:33:31.291 --> 0:33:39.774
	In contrast, in transformer we have this large
	dependency, so.

	0:33:40.000 --> 0:33:52.760
	So from t minus one to y t so that is somehow
	and mainly not very efficient in this way mean

	0:33:52.760 --> 0:33:56.053
	it's very good because.

	0:33:56.276 --> 0:34:03.543
	However, the disadvantage is that we also
	have to do all these calculations, so if we

	0:34:03.543 --> 0:34:10.895
	more view from the point of view of efficient
	calculation, this might not be the best.

	0:34:11.471 --> 0:34:20.517
	So the question is, can we change our architecture
	to keep some of the advantages but make things

	0:34:20.517 --> 0:34:21.994
	more efficient?

	0:34:24.284 --> 0:34:31.131
	The one idea is what is called the average
	attention, and the interesting thing is this

	0:34:31.131 --> 0:34:32.610
	work surprisingly.

	0:34:33.013 --> 0:34:38.917
	So the only idea what you're doing is doing
	the decoder.

	0:34:38.917 --> 0:34:42.646
	You're not doing attention anymore.

	0:34:42.646 --> 0:34:46.790
	The attention weights are all the same.

	0:34:47.027 --> 0:35:00.723
	So you don't calculate with query and key
	the different weights, and then you just take

	0:35:00.723 --> 0:35:03.058
	equal weights.

	0:35:03.283 --> 0:35:07.585
	So here would be one third from this, one
	third from this, and one third.

	0:35:09.009 --> 0:35:14.719
	And while it is sufficient you can now do
	precalculation and things get more efficient.

	0:35:15.195 --> 0:35:18.803
	So first go the formula that's maybe not directed
	here.

	0:35:18.979 --> 0:35:38.712
	So the difference here is that your new hint
	stage is the sum of all the hint states, then.

	0:35:38.678 --> 0:35:40.844
	So here would be with this.

	0:35:40.844 --> 0:35:45.022
	It would be one third of this plus one third
	of this.

	0:35:46.566 --> 0:35:57.162
	But if you calculate it this way, it's not
	yet being more efficient because you still

	0:35:57.162 --> 0:36:01.844
	have to sum over here all the hidden.

	0:36:04.524 --> 0:36:22.932
	But you can not easily speed up these things
	by having an in between value, which is just

	0:36:22.932 --> 0:36:24.568
	always.

	0:36:25.585 --> 0:36:30.057
	If you take this as ten to one, you take this
	one class this one.

	0:36:30.350 --> 0:36:36.739
	Because this one then was before this, and
	this one was this, so in the end.

	0:36:37.377 --> 0:36:49.545
	So now this one is not the final one in order
	to get the final one to do the average.

	0:36:49.545 --> 0:36:50.111
	So.

	0:36:50.430 --> 0:37:00.264
	But then if you do this calculation with speed
	up you can do it with a fixed number of steps.

	0:37:00.180 --> 0:37:11.300
	Instead of the sun which depends on age, so
	you only have to do calculations to calculate

	0:37:11.300 --> 0:37:12.535
	this one.

	0:37:12.732 --> 0:37:21.718
	Can you do the lakes and the lakes?

	0:37:21.718 --> 0:37:32.701
	For example, light bulb here now takes and.

	0:37:32.993 --> 0:37:38.762
	That's a very good point and that's why this
	is now in the image.

	0:37:38.762 --> 0:37:44.531
	It's not very good so this is the one with
	tilder and the tilder.

	0:37:44.884 --> 0:37:57.895
	So this one is just the sum of these two,
	because this is just this one.

	0:37:58.238 --> 0:38:08.956
	So the sum of this is exactly as the sum of
	these, and the sum of these is the sum of here.

	0:38:08.956 --> 0:38:15.131
	So you only do the sum in here, and the multiplying.

	0:38:15.255 --> 0:38:22.145
	So what you can mainly do here is you can
	do it more mathematically.

	0:38:22.145 --> 0:38:31.531
	You can know this by tea taking out of the
	sum, and then you can calculate the sum different.

	0:38:36.256 --> 0:38:42.443
	That maybe looks a bit weird and simple, so
	we were all talking about this great attention

	0:38:42.443 --> 0:38:47.882
	that we can focus on different parts, and a
	bit surprising on this work is now.

	0:38:47.882 --> 0:38:53.321
	In the end it might also work well without
	really putting and just doing equal.

	0:38:53.954 --> 0:38:56.164
	Mean it's not that easy.

	0:38:56.376 --> 0:38:58.261
	It's like sometimes this is working.

	0:38:58.261 --> 0:39:00.451
	There's also report weight work that well.

	0:39:01.481 --> 0:39:05.848
	But I think it's an interesting way and it
	maybe shows that a lot of.

	0:39:05.805 --> 0:39:10.624
	Things in the self or in the transformer paper
	which are more put as like yet.

	0:39:10.624 --> 0:39:15.930
	These are some hyperpermetheuss around it,
	like that you do the layer norm in between,

	0:39:15.930 --> 0:39:21.785
	and that you do a feat forward before, and
	things like that, that these are also all important,

	0:39:21.785 --> 0:39:25.566
	and that the right set up around that is also
	very important.

	0:39:28.969 --> 0:39:38.598
	The other thing you can do in the end is not
	completely different from this one.

	0:39:38.598 --> 0:39:42.521
	It's just like a very different.

	0:39:42.942 --> 0:39:54.338
	And that is a recurrent network which also
	has this type of highway connection that can

	0:39:54.338 --> 0:40:01.330
	ignore the recurrent unit and directly put
	the input.

	0:40:01.561 --> 0:40:10.770
	It's not really adding out, but if you see
	the hitting step is your input, but what you

	0:40:10.770 --> 0:40:15.480
	can do is somehow directly go to the output.

	0:40:17.077 --> 0:40:28.390
	These are the four components of the simple
	return unit, and the unit is motivated by GIS

	0:40:28.390 --> 0:40:33.418
	and by LCMs, which we have seen before.

	0:40:33.513 --> 0:40:43.633
	And that has proven to be very good for iron
	ends, which allows you to have a gate on your.

	0:40:44.164 --> 0:40:48.186
	In this thing we have two gates, the reset
	gate and the forget gate.

	0:40:48.768 --> 0:40:57.334
	So first we have the general structure which
	has a cell state.

	0:40:57.334 --> 0:41:01.277
	Here we have the cell state.

	0:41:01.361 --> 0:41:09.661
	And then this goes next, and we always get
	the different cell states over the times that.

	0:41:10.030 --> 0:41:11.448
	This Is the South Stand.

	0:41:11.771 --> 0:41:16.518
	How do we now calculate that just assume we
	have an initial cell safe here?

	0:41:17.017 --> 0:41:19.670
	But the first thing is we're doing the forget
	game.

	0:41:20.060 --> 0:41:34.774
	The forgetting models should the new cell
	state mainly depend on the previous cell state

	0:41:34.774 --> 0:41:40.065
	or should it depend on our age.

	0:41:40.000 --> 0:41:41.356
	Like Add to Them.

	0:41:41.621 --> 0:41:42.877
	How can we model that?

	0:41:44.024 --> 0:41:45.599
	First we were at a cocktail.

	0:41:45.945 --> 0:41:52.151
	The forget gait is depending on minus one.

	0:41:52.151 --> 0:41:56.480
	You also see here the former.

	0:41:57.057 --> 0:42:01.963
	So we are multiplying both the cell state
	and our input.

	0:42:01.963 --> 0:42:04.890
	With some weights we are getting.

	0:42:05.105 --> 0:42:08.472
	We are putting some Bay Inspector and then
	we are doing Sigma Weed on that.

	0:42:08.868 --> 0:42:13.452
	So in the end we have numbers between zero
	and one saying for each dimension.

	0:42:13.853 --> 0:42:22.041
	Like how much if it's near to zero we will
	mainly use the new input.

	0:42:22.041 --> 0:42:31.890
	If it's near to one we will keep the input
	and ignore the input at this dimension.

	0:42:33.313 --> 0:42:40.173
	And by this motivation we can then create
	here the new sound state, and here you see

	0:42:40.173 --> 0:42:41.141
	the formal.

	0:42:41.601 --> 0:42:55.048
	So you take your foot back gate and multiply
	it with your class.

	0:42:55.048 --> 0:43:00.427
	So if my was around then.

	0:43:00.800 --> 0:43:07.405
	In the other case, when the value was others,
	that's what you added.

	0:43:07.405 --> 0:43:10.946
	Then you're adding a transformation.

	0:43:11.351 --> 0:43:24.284
	So if this value was maybe zero then you're
	putting most of the information from inputting.

	0:43:25.065 --> 0:43:26.947
	Is already your element?

	0:43:26.947 --> 0:43:30.561
	The only question is now based on your element.

	0:43:30.561 --> 0:43:32.067
	What is the output?

	0:43:33.253 --> 0:43:47.951
	And there you have another opportunity so
	you can either take the output or instead you

	0:43:47.951 --> 0:43:50.957
	prefer the input.

	0:43:52.612 --> 0:43:58.166
	So is the value also the same for the recept
	game and the forget game.

	0:43:58.166 --> 0:43:59.417
	Yes, the movie.

	0:44:00.900 --> 0:44:10.004
	Yes exactly so the matrices are different
	and therefore it can be and that should be

	0:44:10.004 --> 0:44:16.323
	and maybe there is sometimes you want to have
	information.

	0:44:16.636 --> 0:44:23.843
	So here again we have this vector with values
	between zero and which says controlling how

	0:44:23.843 --> 0:44:25.205
	the information.

	0:44:25.505 --> 0:44:36.459
	And then the output is calculated here similar
	to a cell stage, but again input is from.

	0:44:36.536 --> 0:44:45.714
	So either the reset gate decides should give
	what is currently stored in there, or.

	0:44:46.346 --> 0:44:58.647
	So it's not exactly as the thing we had before,
	with the residual connections where we added

	0:44:58.647 --> 0:45:01.293
	up, but here we do.

	0:45:04.224 --> 0:45:08.472
	This is the general idea of a simple recurrent
	neural network.

	0:45:08.472 --> 0:45:13.125
	Then we will now look at how we can make things
	even more efficient.

	0:45:13.125 --> 0:45:17.104
	But first do you have more questions on how
	it is working?

	0:45:23.063 --> 0:45:38.799
	Now these calculations are a bit where things
	get more efficient because this somehow.

	0:45:38.718 --> 0:45:43.177
	It depends on all the other damage for the
	second one also.

	0:45:43.423 --> 0:45:48.904
	Because if you do a matrix multiplication
	with a vector like for the output vector, each

	0:45:48.904 --> 0:45:52.353
	diameter of the output vector depends on all
	the other.

	0:45:52.973 --> 0:46:06.561
	The cell state here depends because this one
	is used here, and somehow the first dimension

	0:46:06.561 --> 0:46:11.340
	of the cell state only depends.

	0:46:11.931 --> 0:46:17.973
	In order to make that, of course, is sometimes
	again making things less paralyzeable if things

	0:46:17.973 --> 0:46:18.481
	depend.

	0:46:19.359 --> 0:46:35.122
	Can easily make that different by changing
	from the metric product to not a vector.

	0:46:35.295 --> 0:46:51.459
	So you do first, just like inside here, you
	take like the first dimension, my second dimension.

	0:46:52.032 --> 0:46:53.772
	Is, of course, narrow.

	0:46:53.772 --> 0:46:59.294
	This should be reset or this should be because
	it should be a different.

	0:46:59.899 --> 0:47:12.053
	Now the first dimension only depends on the
	first dimension, so you don't have dependencies

	0:47:12.053 --> 0:47:16.148
	any longer between dimensions.

	0:47:18.078 --> 0:47:25.692
	Maybe it gets a bit clearer if you see about
	it in this way, so what we have to do now.

	0:47:25.966 --> 0:47:31.911
	First, we have to do a metrics multiplication
	on to gather and to get the.

	0:47:32.292 --> 0:47:38.041
	And then we only have the element wise operations
	where we take this output.

	0:47:38.041 --> 0:47:38.713
	We take.

	0:47:39.179 --> 0:47:42.978
	Minus one and our original.

	0:47:42.978 --> 0:47:52.748
	Here we only have elemental abrasions which
	can be optimally paralyzed.

	0:47:53.273 --> 0:48:07.603
	So here we have additional paralyzed things
	across the dimension and don't have to do that.

	0:48:09.929 --> 0:48:24.255
	Yeah, but this you can do like in parallel
	again for all xts.

	0:48:24.544 --> 0:48:33.014
	Here you can't do it in parallel, but you
	only have to do it on each seat, and then you

	0:48:33.014 --> 0:48:34.650
	can parallelize.

	0:48:35.495 --> 0:48:39.190
	But this maybe for the dimension.

	0:48:39.190 --> 0:48:42.124
	Maybe it's also important.

	0:48:42.124 --> 0:48:46.037
	I don't know if they have tried it.

	0:48:46.037 --> 0:48:55.383
	I assume it's not only for dimension reduction,
	but it's hard because you can easily.

	0:49:01.001 --> 0:49:08.164
	People have even like made the second thing
	even more easy.

	0:49:08.164 --> 0:49:10.313
	So there is this.

	0:49:10.313 --> 0:49:17.954
	This is how we have the highway connections
	in the transformer.

	0:49:17.954 --> 0:49:20.699
	Then it's like you do.

	0:49:20.780 --> 0:49:24.789
	So that is like how things are put together
	as a transformer.

	0:49:25.125 --> 0:49:39.960
	And that is a similar and simple recurring
	neural network where you do exactly the same

	0:49:39.960 --> 0:49:44.512
	for the so you don't have.

	0:49:46.326 --> 0:49:47.503
	This type of things.

	0:49:49.149 --> 0:50:01.196
	And with this we are at the end of how to
	make efficient architectures before we go to

	0:50:01.196 --> 0:50:02.580
	the next.

	0:50:13.013 --> 0:50:24.424
	Between the ink or the trader and the architectures
	there is a next technique which is used in

	0:50:24.424 --> 0:50:28.988
	nearly all deburning very successful.

	0:50:29.449 --> 0:50:43.463
	So the idea is can we extract the knowledge
	from a large network into a smaller one, but

	0:50:43.463 --> 0:50:45.983
	it's similarly.

	0:50:47.907 --> 0:50:53.217
	And the nice thing is that this really works,
	and it may be very, very surprising.

	0:50:53.673 --> 0:51:03.000
	So the idea is that we have a large straw
	model which we train for long, and the question

	0:51:03.000 --> 0:51:07.871
	is: Can that help us to train a smaller model?

	0:51:08.148 --> 0:51:16.296
	So can what we refer to as teacher model tell
	us better to build a small student model than

	0:51:16.296 --> 0:51:17.005
	before.

	0:51:17.257 --> 0:51:27.371
	So what we're before in it as a student model,
	we learn from the data and that is how we train

	0:51:27.371 --> 0:51:28.755
	our systems.

	0:51:29.249 --> 0:51:37.949
	The question is: Can we train this small model
	better if we are not only learning from the

	0:51:37.949 --> 0:51:46.649
	data, but we are also learning from a large
	model which has been trained maybe in the same

	0:51:46.649 --> 0:51:47.222
	data?

	0:51:47.667 --> 0:51:55.564
	So that you have then in the end a smaller
	model that is somehow better performing than.

	0:51:55.895 --> 0:51:59.828
	And maybe that's on the first view.

	0:51:59.739 --> 0:52:05.396
	Very very surprising because it has seen the
	same data so it should have learned the same

	0:52:05.396 --> 0:52:11.053
	so the baseline model trained only on the data
	and the student teacher knowledge to still

	0:52:11.053 --> 0:52:11.682
	model it.

	0:52:11.682 --> 0:52:17.401
	They all have seen only this data because
	your teacher modeling was also trained typically

	0:52:17.401 --> 0:52:19.161
	only on this model however.

	0:52:20.580 --> 0:52:30.071
	It has by now shown that by many ways the
	model trained in the teacher and analysis framework

	0:52:30.071 --> 0:52:32.293
	is performing better.

	0:52:33.473 --> 0:52:40.971
	A bit of an explanation when we see how that
	works.

	0:52:40.971 --> 0:52:46.161
	There's different ways of doing it.

	0:52:46.161 --> 0:52:47.171
	Maybe.

	0:52:47.567 --> 0:52:51.501
	So how does it work?

	0:52:51.501 --> 0:53:04.802
	This is our student network, the normal one,
	some type of new network.

	0:53:04.802 --> 0:53:06.113
	We're.

	0:53:06.586 --> 0:53:17.050
	So we are training the model to predict the
	same thing as we are doing that by calculating.

	0:53:17.437 --> 0:53:23.173
	The cross angry loss was defined in a way
	where saying all the probabilities for the

	0:53:23.173 --> 0:53:25.332
	correct word should be as high.

	0:53:25.745 --> 0:53:32.207
	So you are calculating your alphabet probabilities
	always, and each time step you have an alphabet

	0:53:32.207 --> 0:53:33.055
	probability.

	0:53:33.055 --> 0:53:38.669
	What is the most probable in the next word
	and your training signal is put as much of

	0:53:38.669 --> 0:53:43.368
	your probability mass to the correct word to
	the word that is there in.

	0:53:43.903 --> 0:53:51.367
	And this is the chief by this cross entry
	loss, which says with some of the all training

	0:53:51.367 --> 0:53:58.664
	examples of all positions, with some of the
	full vocabulary, and then this one is this

	0:53:58.664 --> 0:54:03.947
	one that this current word is the case word
	in the vocabulary.

	0:54:04.204 --> 0:54:11.339
	And then we take here the lock for the ability
	of that, so what we made me do is: We have

	0:54:11.339 --> 0:54:27.313
	this metric here, so each position of your
	vocabulary size.

	0:54:27.507 --> 0:54:38.656
	In the end what you just do is some of these
	three lock probabilities, and then you want

	0:54:38.656 --> 0:54:40.785
	to have as much.

	0:54:41.041 --> 0:54:54.614
	So although this is a thumb over this metric
	here, in the end of each dimension you.

	0:54:54.794 --> 0:55:06.366
	So that is a normal cross end to be lost that
	we have discussed at the very beginning of

	0:55:06.366 --> 0:55:07.016
	how.

	0:55:08.068 --> 0:55:15.132
	So what can we do differently in the teacher
	network?

	0:55:15.132 --> 0:55:23.374
	We also have a teacher network which is trained
	on large data.

	0:55:24.224 --> 0:55:35.957
	And of course this distribution might be better
	than the one from the small model because it's.

	0:55:36.456 --> 0:55:40.941
	So in this case we have now the training signal
	from the teacher network.

	0:55:41.441 --> 0:55:46.262
	And it's the same way as we had before.

	0:55:46.262 --> 0:55:56.507
	The only difference is we're training not
	the ground truths per ability distribution

	0:55:56.507 --> 0:55:59.159
	year, which is sharp.

	0:55:59.299 --> 0:56:11.303
	That's also a probability, so this word has
	a high probability, but have some probability.

	0:56:12.612 --> 0:56:19.577
	And that is the main difference.

	0:56:19.577 --> 0:56:30.341
	Typically you do like the interpretation of
	these.

	0:56:33.213 --> 0:56:38.669
	Because there's more information contained
	in the distribution than in the front booth,

	0:56:38.669 --> 0:56:44.187
	because it encodes more information about the
	language, because language always has more

	0:56:44.187 --> 0:56:47.907
	options to put alone, that's the same sentence
	yes exactly.

	0:56:47.907 --> 0:56:53.114
	So there's ambiguity in there that is encoded
	hopefully very well in the complaint.

	0:56:53.513 --> 0:56:57.257
	Trade you two networks so better than a student
	network you have in there from your learner.

	0:56:57.537 --> 0:57:05.961
	So maybe often there's only one correct word,
	but it might be two or three, and then all

	0:57:05.961 --> 0:57:10.505
	of these three have a probability distribution.

	0:57:10.590 --> 0:57:21.242
	And then is the main advantage or one explanation
	of why it's better to train from the.

	0:57:21.361 --> 0:57:32.652
	Of course, it's good to also keep the signal
	in there because then you can prevent it because

	0:57:32.652 --> 0:57:33.493
	crazy.

	0:57:37.017 --> 0:57:49.466
	Any more questions on the first type of knowledge
	distillation, also distribution changes.

	0:57:50.550 --> 0:58:02.202
	Coming around again, this would put it a bit
	different, so this is not a solution to maintenance

	0:58:02.202 --> 0:58:04.244
	or distribution.

	0:58:04.744 --> 0:58:12.680
	But don't think it's performing worse than
	only doing the ground tours because they also.

	0:58:13.113 --> 0:58:21.254
	So it's more like it's not improving you would
	assume it's similarly helping you, but.

	0:58:21.481 --> 0:58:28.145
	Of course, if you now have a teacher, maybe
	you have no danger on your target to Maine,

	0:58:28.145 --> 0:58:28.524
	but.

	0:58:28.888 --> 0:58:39.895
	Then you can use this one which is not the
	ground truth but helpful to learn better for

	0:58:39.895 --> 0:58:42.147
	the distribution.

	0:58:46.326 --> 0:58:57.012
	The second idea is to do sequence level knowledge
	distillation, so what we have in this case

	0:58:57.012 --> 0:59:02.757
	is we have looked at each position independently.

	0:59:03.423 --> 0:59:05.436
	Mean, we do that often.

	0:59:05.436 --> 0:59:10.972
	We are not generating a lot of sequences,
	but that has a problem.

	0:59:10.972 --> 0:59:13.992
	We have this propagation of errors.

	0:59:13.992 --> 0:59:16.760
	We start with one area and then.

	0:59:17.237 --> 0:59:27.419
	So if we are doing word-level knowledge dissolution,
	we are treating each word in the sentence independently.

	0:59:28.008 --> 0:59:32.091
	So we are not trying to like somewhat model
	the dependency between.

	0:59:32.932 --> 0:59:47.480
	We can try to do that by sequence level knowledge
	dissolution, but the problem is, of course,.

	0:59:47.847 --> 0:59:53.478
	So we can that for each position we can get
	a distribution over all the words at this.

	0:59:53.793 --> 1:00:05.305
	But if we want to have a distribution of all
	possible target sentences, that's not possible

	1:00:05.305 --> 1:00:06.431
	because.

	1:00:08.508 --> 1:00:15.940
	Area, so we can then again do a bit of a heck
	on that.

	1:00:15.940 --> 1:00:23.238
	If we can't have a distribution of all sentences,
	it.

	1:00:23.843 --> 1:00:30.764
	So what we can't do is you can not use the
	teacher network and sample different translations.

	1:00:31.931 --> 1:00:39.327
	And now we can do different ways to train
	them.

	1:00:39.327 --> 1:00:49.343
	We can use them as their probability, the
	easiest one to assume.

	1:00:50.050 --> 1:00:56.373
	So what that ends to is that we're taking
	our teacher network, we're generating some

	1:00:56.373 --> 1:01:01.135
	translations, and these ones we're using as
	additional trading.

	1:01:01.781 --> 1:01:11.382
	Then we have mainly done this sequence level
	because the teacher network takes us.

	1:01:11.382 --> 1:01:17.513
	These are all probable translations of the
	sentence.

	1:01:26.286 --> 1:01:34.673
	And then you can do a bit of a yeah, and you
	can try to better make a bit of an interpolated

	1:01:34.673 --> 1:01:36.206
	version of that.

	1:01:36.716 --> 1:01:42.802
	So what people have also done is like subsequent
	level interpolations.

	1:01:42.802 --> 1:01:52.819
	You generate here several translations: But
	then you don't use all of them.

	1:01:52.819 --> 1:02:00.658
	You do some metrics on which of these ones.

	1:02:01.021 --> 1:02:12.056
	So it's a bit more training on this brown
	chose which might be improbable or unreachable

	1:02:12.056 --> 1:02:16.520
	because we can generate everything.

	1:02:16.676 --> 1:02:23.378
	And we are giving it an easier solution which
	is also good quality and training of that.

	1:02:23.703 --> 1:02:32.602
	So you're not training it on a very difficult
	solution, but you're training it on an easier

	1:02:32.602 --> 1:02:33.570
	solution.

	1:02:36.356 --> 1:02:38.494
	Any More Questions to This.

	1:02:40.260 --> 1:02:41.557
	Yeah.

	1:02:41.461 --> 1:02:44.296
	Good.

	1:02:43.843 --> 1:03:01.642
	Is to look at the vocabulary, so the problem
	is we have seen that vocabulary calculations

	1:03:01.642 --> 1:03:06.784
	are often very presuming.

	1:03:09.789 --> 1:03:19.805
	The thing is that most of the vocabulary is
	not needed for each sentence, so in each sentence.

	1:03:20.280 --> 1:03:28.219
	The question is: Can we somehow easily precalculate,
	which words are probable to occur in the sentence,

	1:03:28.219 --> 1:03:30.967
	and then only calculate these ones?

	1:03:31.691 --> 1:03:34.912
	And this can be done so.

	1:03:34.912 --> 1:03:43.932
	For example, if you have sentenced card, it's
	probably not happening.

	1:03:44.164 --> 1:03:48.701
	So what you can try to do is to limit your
	vocabulary.

	1:03:48.701 --> 1:03:51.093
	You're considering for each.

	1:03:51.151 --> 1:04:04.693
	So you're no longer taking the full vocabulary
	as possible output, but you're restricting.

	1:04:06.426 --> 1:04:18.275
	That typically works is that we limit it by
	the most frequent words we always take because

	1:04:18.275 --> 1:04:23.613
	these are not so easy to align to words.

	1:04:23.964 --> 1:04:32.241
	To take the most treatment taggin' words and
	then work that often aligns with one of the

	1:04:32.241 --> 1:04:32.985
	source.

	1:04:33.473 --> 1:04:46.770
	So for each source word you calculate the
	word alignment on your training data, and then

	1:04:46.770 --> 1:04:51.700
	you calculate which words occur.

	1:04:52.352 --> 1:04:57.680
	And then for decoding you build this union
	of maybe the source word list that other.

	1:04:59.960 --> 1:05:02.145
	Are like for each source work.

	1:05:02.145 --> 1:05:08.773
	One of the most frequent translations of these
	source words, for example for each source work

	1:05:08.773 --> 1:05:13.003
	like in the most frequent ones, and then the
	most frequent.

	1:05:13.193 --> 1:05:24.333
	In total, if you have short sentences, you
	have a lot less words, so in most cases it's

	1:05:24.333 --> 1:05:26.232
	not more than.

	1:05:26.546 --> 1:05:33.957
	And so you have dramatically reduced your
	vocabulary, and thereby can also fax a depot.

	1:05:35.495 --> 1:05:43.757
	That easy does anybody see what is challenging
	here and why that might not always need.

	1:05:47.687 --> 1:05:54.448
	The performance is not why this might not.

	1:05:54.448 --> 1:06:01.838
	If you implement it, it might not be a strong.

	1:06:01.941 --> 1:06:06.053
	You have to store this list.

	1:06:06.053 --> 1:06:14.135
	You have to burn the union and of course your
	safe time.

	1:06:14.554 --> 1:06:21.920
	The second thing the vocabulary is used in
	our last step, so we have the hidden state,

	1:06:21.920 --> 1:06:23.868
	and then we calculate.

	1:06:24.284 --> 1:06:29.610
	Now we are not longer calculating them for
	all output words, but for a subset of them.

	1:06:30.430 --> 1:06:35.613
	However, this metric multiplication is typically
	parallelized with the perfect but good.

	1:06:35.956 --> 1:06:46.937
	But if you not only calculate some of them,
	if you're not modeling it right, it will take

	1:06:46.937 --> 1:06:52.794
	as long as before because of the nature of
	the.

	1:06:56.776 --> 1:07:07.997
	Here for beam search there's some ideas of
	course you can go back to greedy search because

	1:07:07.997 --> 1:07:10.833
	that's more efficient.

	1:07:11.651 --> 1:07:18.347
	And better quality, and you can buffer some
	states in between, so how much buffering it's

	1:07:18.347 --> 1:07:22.216
	again this tradeoff between calculation and
	memory.

	1:07:25.125 --> 1:07:41.236
	Then at the end of today what we want to look
	into is one last type of new machine translation

	1:07:41.236 --> 1:07:42.932
	approach.

	1:07:43.403 --> 1:07:53.621
	And the idea is what we've already seen in
	our first two steps is that this ultra aggressive

	1:07:53.621 --> 1:07:57.246
	park is taking community coding.

	1:07:57.557 --> 1:08:04.461
	Can process everything in parallel, but we
	are always taking the most probable and then.

	1:08:05.905 --> 1:08:10.476
	The question is: Do we really need to do that?

	1:08:10.476 --> 1:08:14.074
	Therefore, there is a bunch of work.

	1:08:14.074 --> 1:08:16.602
	Can we do it differently?

	1:08:16.602 --> 1:08:19.616
	Can we generate a full target?

	1:08:20.160 --> 1:08:29.417
	We'll see it's not that easy and there's still
	an open debate whether this is really faster

	1:08:29.417 --> 1:08:31.832
	and quality, but think.

	1:08:32.712 --> 1:08:45.594
	So, as said, what we have done is our encoder
	decoder where we can process our encoder color,

	1:08:45.594 --> 1:08:50.527
	and then the output always depends.

	1:08:50.410 --> 1:08:54.709
	We generate the output and then we have to
	put it here the wide because then everything

	1:08:54.709 --> 1:08:56.565
	depends on the purpose of the output.

	1:08:56.916 --> 1:09:10.464
	This is what is referred to as an outer-regressive
	model and nearly outs speech generation and

	1:09:10.464 --> 1:09:16.739
	language generation or works in this outer.

	1:09:18.318 --> 1:09:21.132
	So the motivation is, can we do that more
	efficiently?

	1:09:21.361 --> 1:09:31.694
	And can we somehow process all target words
	in parallel?

	1:09:31.694 --> 1:09:41.302
	So instead of doing it one by one, we are
	inputting.

	1:09:45.105 --> 1:09:46.726
	So how does it work?

	1:09:46.726 --> 1:09:50.587
	So let's first have a basic auto regressive
	mode.

	1:09:50.810 --> 1:09:53.551
	So the encoder looks as it is before.

	1:09:53.551 --> 1:09:58.310
	That's maybe not surprising because here we
	know we can paralyze.

	1:09:58.618 --> 1:10:04.592
	So we have put in here our ink holder and
	generated the ink stash, so that's exactly

	1:10:04.592 --> 1:10:05.295
	the same.

	1:10:05.845 --> 1:10:16.229
	However, now we need to do one more thing:
	One challenge is what we had before and that's

	1:10:16.229 --> 1:10:26.799
	a challenge of natural language generation
	like machine translation.

	1:10:32.672 --> 1:10:38.447
	We generate until we generate this out of
	end of center stock, but if we now generate

	1:10:38.447 --> 1:10:44.625
	everything at once that's no longer possible,
	so we cannot generate as long because we only

	1:10:44.625 --> 1:10:45.632
	generated one.

	1:10:46.206 --> 1:10:58.321
	So the question is how can we now determine
	how long the sequence is, and we can also accelerate.

	1:11:00.000 --> 1:11:06.384
	Yes, but there would be one idea, and there
	is other work which tries to do that.

	1:11:06.806 --> 1:11:15.702
	However, in here there's some work already
	done before and maybe you remember we had the

	1:11:15.702 --> 1:11:20.900
	IBM models and there was this concept of fertility.

	1:11:21.241 --> 1:11:26.299
	The concept of fertility is means like for
	one saucepan, and how many target pores does

	1:11:26.299 --> 1:11:27.104
	it translate?

	1:11:27.847 --> 1:11:34.805
	And exactly that we try to do here, and that
	means we are calculating like at the top we

	1:11:34.805 --> 1:11:36.134
	are calculating.

	1:11:36.396 --> 1:11:42.045
	So it says word is translated into word.

	1:11:42.045 --> 1:11:54.171
	Word might be translated into words into,
	so we're trying to predict in how many words.

	1:11:55.935 --> 1:12:10.314
	And then the end of the anchor, so this is
	like a length estimation.

	1:12:10.314 --> 1:12:15.523
	You can do it otherwise.

	1:12:16.236 --> 1:12:24.526
	You initialize your decoder input and we know
	it's good with word embeddings so we're trying

	1:12:24.526 --> 1:12:28.627
	to do the same thing and what people then do.

	1:12:28.627 --> 1:12:35.224
	They initialize it again with word embedding
	but in the frequency of the.

	1:12:35.315 --> 1:12:36.460
	So we have the cartilage.

	1:12:36.896 --> 1:12:47.816
	So one has two, so twice the is and then one
	is, so that is then our initialization.

	1:12:48.208 --> 1:12:57.151
	In other words, if you don't predict fertilities
	but predict lengths, you can just initialize

	1:12:57.151 --> 1:12:57.912
	second.

	1:12:58.438 --> 1:13:07.788
	This often works a bit better, but that's
	the other.

	1:13:07.788 --> 1:13:16.432
	Now you have everything in training and testing.

	1:13:16.656 --> 1:13:18.621
	This is all available at once.

	1:13:20.280 --> 1:13:31.752
	Then we can generate everything in parallel,
	so we have the decoder stack, and that is now

	1:13:31.752 --> 1:13:33.139
	as before.

	1:13:35.395 --> 1:13:41.555
	And then we're doing the translation predictions
	here on top of it in order to do.

	1:13:43.083 --> 1:13:59.821
	And then we are predicting here the target
	words and once predicted, and that is the basic

	1:13:59.821 --> 1:14:00.924
	idea.

	1:14:01.241 --> 1:14:08.171
	Machine translation: Where the idea is, we
	don't have to do one by one what we're.

	1:14:10.210 --> 1:14:13.900
	So this looks really, really, really great.

	1:14:13.900 --> 1:14:20.358
	On the first view there's one challenge with
	this, and this is the baseline.

	1:14:20.358 --> 1:14:27.571
	Of course there's some improvements, but in
	general the quality is often significant.

	1:14:28.068 --> 1:14:32.075
	So here you see the baseline models.

	1:14:32.075 --> 1:14:38.466
	You have a loss of ten blue points or something
	like that.

	1:14:38.878 --> 1:14:40.230
	So why does it change?

	1:14:40.230 --> 1:14:41.640
	So why is it happening?

	1:14:43.903 --> 1:14:56.250
	If you look at the errors there is repetitive
	tokens, so you have like or things like that.

	1:14:56.536 --> 1:15:01.995
	Broken senses or influent senses, so that
	exactly where algebra aggressive models are

	1:15:01.995 --> 1:15:04.851
	very good, we say that's a bit of a problem.

	1:15:04.851 --> 1:15:07.390
	They generate very fluid transcription.

	1:15:07.387 --> 1:15:10.898
	Translation: Sometimes there doesn't have
	to do anything with the input.

	1:15:11.411 --> 1:15:14.047
	But generally it really looks always very
	fluid.

	1:15:14.995 --> 1:15:20.865
	Here exactly the opposite, so the problem
	is that we don't have really fluid translation.

	1:15:21.421 --> 1:15:26.123
	And that is mainly due to the challenge that
	we have this independent assumption.

	1:15:26.646 --> 1:15:35.873
	So in this case, the probability of Y of the
	second position is independent of the probability

	1:15:35.873 --> 1:15:40.632
	of X, so we don't know what was there generated.

	1:15:40.632 --> 1:15:43.740
	We're just generating it there.

	1:15:43.964 --> 1:15:55.439
	You can see it also in a bit of examples.

	1:15:55.439 --> 1:16:03.636
	You can over-panelize shifts.

	1:16:04.024 --> 1:16:10.566
	And the problem is this is already an improvement
	again, but this is also similar to.

	1:16:11.071 --> 1:16:19.900
	So you can, for example, translate heeded
	back, or maybe you could also translate it

	1:16:19.900 --> 1:16:31.105
	with: But on their feeling down in feeling
	down, if the first position thinks of their

	1:16:31.105 --> 1:16:34.594
	feeling done and the second.

	1:16:35.075 --> 1:16:42.908
	So each position here and that is one of the
	main issues here doesn't know what the other.

	1:16:43.243 --> 1:16:53.846
	And for example, if you are translating something
	with, you can often translate things in two

	1:16:53.846 --> 1:16:58.471
	ways: German with a different agreement.

	1:16:58.999 --> 1:17:02.058
	And then here where you have to decide do
	a used jet.

	1:17:02.162 --> 1:17:05.460
	Interpretator: It doesn't know which word
	it has to select.

	1:17:06.086 --> 1:17:14.789
	Mean, of course, it knows a hidden state,
	but in the end you have a liability distribution.

	1:17:16.256 --> 1:17:20.026
	And that is the important thing in the outer
	regressive month.

	1:17:20.026 --> 1:17:24.335
	You know that because you have put it in you
	here, you don't know that.

	1:17:24.335 --> 1:17:29.660
	If it's equal probable here to two, you don't
	Know Which Is Selected, and of course that

	1:17:29.660 --> 1:17:32.832
	depends on what should be the latest traction
	under.

	1:17:33.333 --> 1:17:39.554
	Yep, that's the undershift, and we're going
	to last last the next time.

	1:17:39.554 --> 1:17:39.986
	Yes.

	1:17:40.840 --> 1:17:44.935
	Doesn't this also appear in and like now we're
	talking about physical training?

	1:17:46.586 --> 1:17:48.412
	The thing is in the auto regress.

	1:17:48.412 --> 1:17:50.183
	If you give it the correct one,.

	1:17:50.450 --> 1:17:55.827
	So if you predict here comma what the reference
	is feeling then you tell the model here.

	1:17:55.827 --> 1:17:59.573
	The last one was feeling and then it knows
	it has to be done.

	1:17:59.573 --> 1:18:04.044
	But here it doesn't know that because it doesn't
	get as input as a right.

	1:18:04.204 --> 1:18:24.286
	Yes, that's a bit depending on what.

	1:18:24.204 --> 1:18:27.973
	But in training, of course, you just try to
	make the highest one the current one.

	1:18:31.751 --> 1:18:38.181
	So what you can do is things like CDC loss
	which can adjust for this.

	1:18:38.181 --> 1:18:42.866
	So then you can also have this shifted correction.

	1:18:42.866 --> 1:18:50.582
	If you're doing this type of correction in
	the CDC loss you don't get full penalty.

	1:18:50.930 --> 1:18:58.486
	Just shifted by one, so it's a bit of a different
	loss, which is mainly used in, but.

	1:19:00.040 --> 1:19:03.412
	It can be used in order to address this problem.

	1:19:04.504 --> 1:19:13.844
	The other problem is that outer regressively
	we have the label buyers that tries to disimmigrate.

	1:19:13.844 --> 1:19:20.515
	That's the example did before was if you translate
	thank you to Dung.

	1:19:20.460 --> 1:19:31.925
	And then it might end up because it learns
	in the first position and the second also.

	1:19:32.492 --> 1:19:43.201
	In order to prevent that, it would be helpful
	for one output, only one output, so that makes

	1:19:43.201 --> 1:19:47.002
	the system already better learn.

	1:19:47.227 --> 1:19:53.867
	Might be that for slightly different inputs
	you have different outputs, but for the same.

	1:19:54.714 --> 1:19:57.467
	That we can luckily very easily solve.

	1:19:59.119 --> 1:19:59.908
	And it's done.

	1:19:59.908 --> 1:20:04.116
	We just learned the technique about it, which
	is called knowledge distillation.

	1:20:04.985 --> 1:20:13.398
	So what we can do and the easiest solution
	to prove your non-autoregressive model is to

	1:20:13.398 --> 1:20:16.457
	train an auto regressive model.

	1:20:16.457 --> 1:20:22.958
	Then you decode your whole training gamer
	with this model and then.

	1:20:23.603 --> 1:20:27.078
	While the main advantage of that is that this
	is more consistent,.

	1:20:27.407 --> 1:20:33.995
	So for the same input you always have the
	same output.

	1:20:33.995 --> 1:20:41.901
	So you have to make your training data more
	consistent and learn.

	1:20:42.482 --> 1:20:54.471
	So there is another advantage of knowledge
	distillation and that advantage is you have

	1:20:54.471 --> 1:20:59.156
	more consistent training signals.

	1:21:04.884 --> 1:21:10.630
	There's another to make the things more easy
	at the beginning.

	1:21:10.630 --> 1:21:16.467
	There's this plants model, black model where
	you do more masks.

	1:21:16.756 --> 1:21:26.080
	So during training, especially at the beginning,
	you give some correct solutions at the beginning.

	1:21:28.468 --> 1:21:38.407
	And there is this tokens at a time, so the
	idea is to establish other regressive training.

	1:21:40.000 --> 1:21:50.049
	And some targets are open, so you always predict
	only like first auto regression is K.

	1:21:50.049 --> 1:21:59.174
	It puts one, so you always have one input
	and one output, then you do partial.

	1:21:59.699 --> 1:22:05.825
	So in that way you can slowly learn what is
	a good and what is a bad answer.

	1:22:08.528 --> 1:22:10.862
	It doesn't sound very impressive.

	1:22:10.862 --> 1:22:12.578
	Don't contact me anyway.

	1:22:12.578 --> 1:22:15.323
	Go all over your training data several.

	1:22:15.875 --> 1:22:20.655
	You can even switch in between.

	1:22:20.655 --> 1:22:29.318
	There is a homework on this thing where you
	try to start.

	1:22:31.271 --> 1:22:41.563
	You have to learn so there's a whole work
	on that so this is often happening and it doesn't

	1:22:41.563 --> 1:22:46.598
	mean it's less efficient but still it helps.

	1:22:49.389 --> 1:22:57.979
	For later maybe here are some examples of
	how much things help.

	1:22:57.979 --> 1:23:04.958
	Maybe one point here is that it's really important.

	1:23:05.365 --> 1:23:13.787
	Here's the translation performance and speed.

	1:23:13.787 --> 1:23:24.407
	One point which is a point is if you compare
	researchers.

	1:23:24.784 --> 1:23:33.880
	So yeah, if you're compared to one very weak
	baseline transformer even with beam search,

	1:23:33.880 --> 1:23:40.522
	then you're ten times slower than a very strong
	auto regressive.

	1:23:40.961 --> 1:23:48.620
	If you make a strong baseline then it's going
	down to depending on times and here like: You

	1:23:48.620 --> 1:23:53.454
	have a lot of different speed ups.

	1:23:53.454 --> 1:24:03.261
	Generally, it makes a strong baseline and
	not very simple transformer.

	1:24:07.407 --> 1:24:20.010
	Yeah, with this one last thing that you can
	do to speed up things and also reduce your

	1:24:20.010 --> 1:24:25.950
	memory is what is called half precision.

	1:24:26.326 --> 1:24:29.139
	And especially for decoding issues for training.

	1:24:29.139 --> 1:24:31.148
	Sometimes it also gets less stale.

	1:24:32.592 --> 1:24:45.184
	With this we close nearly wait a bit, so what
	you should remember is that efficient machine

	1:24:45.184 --> 1:24:46.963
	translation.

	1:24:47.007 --> 1:24:51.939
	We have, for example, looked at knowledge
	distillation.

	1:24:51.939 --> 1:24:55.991
	We have looked at non auto regressive models.

	1:24:55.991 --> 1:24:57.665
	We have different.

	1:24:58.898 --> 1:25:02.383
	For today and then only requests.

	1:25:02.383 --> 1:25:08.430
	So if you haven't done so, please fill out
	the evaluation.

	1:25:08.388 --> 1:25:20.127
	So now if you have done so think then you
	should have and with the online people hopefully.

	1:25:20.320 --> 1:25:29.758
	Only possibility to tell us what things are
	good and what not the only one but the most

	1:25:29.758 --> 1:25:30.937
	efficient.

	1:25:31.851 --> 1:25:35.871
	So think of all the students doing it in this
	case okay and then thank.