Spaces:

retkowski
/

ytseg_demo

Running

App Files Files Community

ytseg_demo / demo_data /lectures /Lecture-09-25.05.2023 /English.vtt

retkowski

Add demo

cb71ef5 over 1 year ago

raw

history blame contribute delete

75 kB

	WEBVTT

	0:00:01.721 --> 0:00:05.064
	Hey, and then welcome to today's lecture.

	0:00:06.126 --> 0:00:13.861
	What we want to do today is we will finish
	with what we have done last time, so we started

	0:00:13.861 --> 0:00:22.192
	looking at the new machine translation system,
	but we have had all the components of the sequence

	0:00:22.192 --> 0:00:22.787
	model.

	0:00:22.722 --> 0:00:29.361
	We're still missing is the transformer based
	architecture so that maybe the self attention.

	0:00:29.849 --> 0:00:31.958
	Then we want to look at the beginning today.

	0:00:32.572 --> 0:00:39.315
	And then the main part of the day's lecture
	will be decoding.

	0:00:39.315 --> 0:00:43.992
	That means we know how to train the model.

	0:00:44.624 --> 0:00:47.507
	So decoding sewage all they can be.

	0:00:47.667 --> 0:00:53.359
	Be useful that and the idea is how we find
	that and what challenges are there.

	0:00:53.359 --> 0:00:59.051
	Since it's unregressive, we will see that
	it's not as easy as for other tasks.

	0:00:59.359 --> 0:01:08.206
	While generating the translation step by step,
	we might make additional arrows that lead.

	0:01:09.069 --> 0:01:16.464
	But let's start with a self attention, so
	what we looked at into was an base model.

	0:01:16.816 --> 0:01:27.931
	And then in our based models you always take
	the last new state, you take your input, you

	0:01:27.931 --> 0:01:31.513
	generate a new hidden state.

	0:01:31.513 --> 0:01:35.218
	This is more like a standard.

	0:01:35.675 --> 0:01:41.088
	And one challenge in this is that we always
	store all our history in one signal hidden

	0:01:41.088 --> 0:01:41.523
	stick.

	0:01:41.781 --> 0:01:50.235
	We saw that this is a problem when going from
	encoder to decoder, and that is why we then

	0:01:50.235 --> 0:01:58.031
	introduced the attention mechanism so that
	we can look back and see all the parts.

	0:01:59.579 --> 0:02:06.059
	However, in the decoder we still have this
	issue so we are still storing all information

	0:02:06.059 --> 0:02:12.394
	in one hidden state and we might do things
	like here that we start to overwrite things

	0:02:12.394 --> 0:02:13.486
	and we forgot.

	0:02:14.254 --> 0:02:23.575
	So the idea is, can we do something similar
	which we do between encoder and decoder within

	0:02:23.575 --> 0:02:24.907
	the decoder?

	0:02:26.526 --> 0:02:33.732
	And the idea is each time we're generating
	here in New York State, it will not only depend

	0:02:33.732 --> 0:02:40.780
	on the previous one, but we will focus on the
	whole sequence and look at different parts

	0:02:40.780 --> 0:02:46.165
	as we did in attention in order to generate
	our new representation.

	0:02:46.206 --> 0:02:53.903
	So each time we generate a new representation
	we will look into what is important now to

	0:02:53.903 --> 0:02:54.941
	understand.

	0:02:55.135 --> 0:03:00.558
	You may want to understand what much is important.

	0:03:00.558 --> 0:03:08.534
	You might want to look to vary and to like
	so that it's much about liking.

	0:03:08.808 --> 0:03:24.076
	So the idea is that we are not staring everything
	in each time we are looking at the full sequence.

	0:03:25.125 --> 0:03:35.160
	And that is achieved by no longer going really
	secret, and the hidden states here aren't dependent

	0:03:35.160 --> 0:03:37.086
	on the same layer.

	0:03:37.086 --> 0:03:42.864
	But instead we are always looking at the previous
	layer.

	0:03:42.942 --> 0:03:45.510
	We will always have more information that
	we are coming.

	0:03:47.147 --> 0:03:51.572
	So how does this censor work in detail?

	0:03:51.572 --> 0:03:56.107
	So we started with our initial mistakes.

	0:03:56.107 --> 0:04:08.338
	So, for example: Now where we had the three
	terms already, the query, the key and the value,

	0:04:08.338 --> 0:04:12.597
	it was motivated by our database.

	0:04:12.772 --> 0:04:20.746
	We are comparing it to the keys to all the
	other values, and then we are merging the values.

	0:04:21.321 --> 0:04:35.735
	There was a difference between the decoder
	and the encoder.

	0:04:35.775 --> 0:04:41.981
	You can assume all the same because we are
	curving ourselves.

	0:04:41.981 --> 0:04:49.489
	However, we can make them different but just
	learning a linear projection.

	0:04:49.529 --> 0:05:01.836
	So you learn here some projection based on
	what need to do in order to ask which question.

	0:05:02.062 --> 0:05:11.800
	That is, the query and the key is to what
	do want to compare and provide others, and

	0:05:11.800 --> 0:05:13.748
	which values do.

	0:05:14.014 --> 0:05:23.017
	This is not like hand defined, but learn,
	so it's like three linear projections that

	0:05:23.017 --> 0:05:26.618
	you apply on all of these hidden.

	0:05:26.618 --> 0:05:32.338
	That is the first thing based on your initial
	hidden.

	0:05:32.612 --> 0:05:37.249
	And now you can do exactly as before, you
	can do the attention.

	0:05:37.637 --> 0:05:40.023
	How did the attention work?

	0:05:40.023 --> 0:05:45.390
	The first thing is we are comparing our query
	to all the keys.

	0:05:45.445 --> 0:05:52.713
	And that is now the difference before the
	quarry was from the decoder, the keys were

	0:05:52.713 --> 0:05:54.253
	from the encoder.

	0:05:54.253 --> 0:06:02.547
	Now it's like all from the same, so we started
	the first in state to the keys of all the others.

	0:06:02.582 --> 0:06:06.217
	We're learning some value here.

	0:06:06.217 --> 0:06:12.806
	How important are these information to better
	understand?

	0:06:13.974 --> 0:06:19.103
	And these are just like floating point numbers.

	0:06:19.103 --> 0:06:21.668
	They are normalized so.

	0:06:22.762 --> 0:06:30.160
	And that is the first step, so let's go first
	for the first curve.

	0:06:30.470 --> 0:06:41.937
	What we can then do is multiply each value
	as we have done before with the importance

	0:06:41.937 --> 0:06:43.937
	of each state.

	0:06:45.145 --> 0:06:47.686
	And then we have in here the new hit step.

	0:06:48.308 --> 0:06:57.862
	See now this new hidden status is depending
	on all the hidden state of all the sequences

	0:06:57.862 --> 0:06:59.686
	of the previous.

	0:06:59.879 --> 0:07:01.739
	One important thing.

	0:07:01.739 --> 0:07:08.737
	This one doesn't really depend, so the hidden
	states here don't depend on the.

	0:07:09.029 --> 0:07:15.000
	So it only depends on the hidden state of
	the previous layer, but it depends on all the

	0:07:15.000 --> 0:07:18.664
	hidden states, and that is of course a big
	advantage.

	0:07:18.664 --> 0:07:25.111
	So on the one hand information can directly
	flow from each hidden state before the information

	0:07:25.111 --> 0:07:27.214
	flow was always a bit limited.

	0:07:28.828 --> 0:07:35.100
	And the independence is important so we can
	calculate all these in the states in parallel.

	0:07:35.100 --> 0:07:41.371
	That's another big advantage of self attention
	that we can calculate all the hidden states

	0:07:41.371 --> 0:07:46.815
	in one layer in parallel and therefore it's
	the ad designed for GPUs and fast.

	0:07:47.587 --> 0:07:50.235
	Then we can do the same thing for the second
	in the state.

	0:07:50.530 --> 0:08:06.866
	And the only difference here is how we calculate
	what is occurring.

	0:08:07.227 --> 0:08:15.733
	Getting these values is different because
	we use the different query and then getting

	0:08:15.733 --> 0:08:17.316
	our new hidden.

	0:08:18.258 --> 0:08:26.036
	Yes, this is the word of words that underneath
	this case might, but this is simple.

	0:08:26.036 --> 0:08:26.498
	Not.

	0:08:27.127 --> 0:08:33.359
	That's a very good question that is like on
	the initial thing.

	0:08:33.359 --> 0:08:38.503
	That is exactly not one of you in the architecture.

	0:08:38.503 --> 0:08:44.042
	Maybe first you would think of a very big
	disadvantage.

	0:08:44.384 --> 0:08:49.804
	So this hidden state would be the same if
	the movie would be different.

	0:08:50.650 --> 0:08:59.983
	And of course this estate is a site someone
	should like, so if the estate would be here

	0:08:59.983 --> 0:09:06.452
	except for this correspondence the word order
	is completely.

	0:09:06.706 --> 0:09:17.133
	Therefore, just doing self attention wouldn't
	work at all because we know word order is important

	0:09:17.133 --> 0:09:21.707
	and there is a complete different meaning.

	0:09:22.262 --> 0:09:26.277
	We introduce the word position again.

	0:09:26.277 --> 0:09:33.038
	The main idea is if the position is already
	in your embeddings.

	0:09:33.533 --> 0:09:39.296
	Then of course the position is there and you
	don't lose it anymore.

	0:09:39.296 --> 0:09:46.922
	So mainly if your life representation here
	encodes at the second position and your output

	0:09:46.922 --> 0:09:48.533
	will be different.

	0:09:49.049 --> 0:09:54.585
	And that's how you encode it, but that's essential
	in order to get this work.

	0:09:57.137 --> 0:10:08.752
	But before we are coming to the next slide,
	one other thing that is typically done is multi-head

	0:10:08.752 --> 0:10:10.069
	attention.

	0:10:10.430 --> 0:10:15.662
	And it might be that in order to understand
	much, it might be good that in some way we

	0:10:15.662 --> 0:10:19.872
	focus on life, and in some way we can focus
	on vary, but not equally.

	0:10:19.872 --> 0:10:25.345
	But maybe it's like to understand again on
	different dimensions we should look into these.

	0:10:25.905 --> 0:10:31.393
	And therefore what we're doing is we're just
	doing the self attention at once, but we're

	0:10:31.393 --> 0:10:35.031
	doing it end times or based on your multi head
	attentions.

	0:10:35.031 --> 0:10:41.299
	So in typical examples, the number of heads
	people are talking about is like: So you're

	0:10:41.299 --> 0:10:50.638
	doing this process and have different queries
	and keys so you can focus.

	0:10:50.790 --> 0:10:52.887
	How can you generate eight different?

	0:10:53.593 --> 0:11:07.595
	Things it's quite easy here, so instead of
	having one linear projection you can have age

	0:11:07.595 --> 0:11:09.326
	different.

	0:11:09.569 --> 0:11:13.844
	And it might be that sometimes you're looking
	more into one thing, and sometimes you're Looking

	0:11:13.844 --> 0:11:14.779
	more into the other.

	0:11:15.055 --> 0:11:24.751
	So that's of course nice with this type of
	learned approach because we can automatically

	0:11:24.751 --> 0:11:25.514
	learn.

	0:11:29.529 --> 0:11:36.629
	And what you correctly said is its positional
	independence, so it doesn't really matter the

	0:11:36.629 --> 0:11:39.176
	order which should be important.

	0:11:39.379 --> 0:11:47.686
	So how can we do that and the idea is we are
	just encoding it directly into the embedding

	0:11:47.686 --> 0:11:52.024
	so into the starting so that a representation.

	0:11:52.512 --> 0:11:55.873
	How do we get that so we started with our
	embeddings?

	0:11:55.873 --> 0:11:58.300
	Just imagine this is embedding of eye.

	0:11:59.259 --> 0:12:06.169
	And then we are having additionally this positional
	encoding.

	0:12:06.169 --> 0:12:10.181
	In this position, encoding is just.

	0:12:10.670 --> 0:12:19.564
	With different wavelength, so with different
	lengths of your signal as you see here.

	0:12:20.160 --> 0:12:37.531
	And the number of functions you have is exactly
	the number of dimensions you have in your embedded.

	0:12:38.118 --> 0:12:51.091
	And what will then do is take the first one,
	and based on your position you multiply your

	0:12:51.091 --> 0:12:51.955
	word.

	0:12:52.212 --> 0:13:02.518
	And you see now if you put it in this position,
	of course it will get a different value.

	0:13:03.003 --> 0:13:12.347
	And thereby in each position a different function
	is multiplied.

	0:13:12.347 --> 0:13:19.823
	This is a representation for at the first
	position.

	0:13:20.020 --> 0:13:34.922
	If you have it in the input already encoded
	then of course the model is able to keep the

	0:13:34.922 --> 0:13:38.605
	position information.

	0:13:38.758 --> 0:13:48.045
	But your embeddings can also learn your embeddings
	in a way that they are optimal collaborating

	0:13:48.045 --> 0:13:49.786
	with these types.

	0:13:51.451 --> 0:13:59.351
	Is that somehow clear where he is there?

	0:14:06.006 --> 0:14:13.630
	Am the first position and second position?

	0:14:16.576 --> 0:14:17.697
	Have a long wait period.

	0:14:17.697 --> 0:14:19.624
	I'm not going to tell you how to turn the.

	0:14:21.441 --> 0:14:26.927
	Be completely issued because if you have a
	very short wavelength there might be quite

	0:14:26.927 --> 0:14:28.011
	big differences.

	0:14:28.308 --> 0:14:33.577
	And it might also be that then it depends,
	of course, like what type of world embedding

	0:14:33.577 --> 0:14:34.834
	you've learned like.

	0:14:34.834 --> 0:14:37.588
	Is the dimension where you have long changes?

	0:14:37.588 --> 0:14:43.097
	Is the report for your embedding or not so
	that's what I mean so that the model can somehow

	0:14:43.097 --> 0:14:47.707
	learn that by putting more information into
	one of the embedding dimensions?

	0:14:48.128 --> 0:14:54.560
	So incorporated and would assume it's learning
	it a bit haven't seen.

	0:14:54.560 --> 0:14:57.409
	Details studied how different.

	0:14:58.078 --> 0:15:07.863
	It's also a bit difficult because really measuring
	how similar or different a world isn't that

	0:15:07.863 --> 0:15:08.480
	easy.

	0:15:08.480 --> 0:15:13.115
	You can do, of course, the average distance.

	0:15:14.114 --> 0:15:21.393
	Them, so are the weight tags not at model
	two, or is there fixed weight tags that the

	0:15:21.393 --> 0:15:21.986
	model.

	0:15:24.164 --> 0:15:30.165
	To believe they are fixed and the mono learns
	there's a different way of doing it.

	0:15:30.165 --> 0:15:32.985
	The other thing you can do is you can.

	0:15:33.213 --> 0:15:36.945
	So you can learn the second embedding which
	says this is position one.

	0:15:36.945 --> 0:15:38.628
	This is position two and so on.

	0:15:38.628 --> 0:15:42.571
	Like for words you could learn fixed embeddings
	and then add them upwards.

	0:15:42.571 --> 0:15:45.094
	So then it would have the same thing it's
	done.

	0:15:45.094 --> 0:15:46.935
	There is one disadvantage of this.

	0:15:46.935 --> 0:15:51.403
	There is anybody an idea what could be the
	disadvantage of a more learned embedding.

	0:15:54.955 --> 0:16:00.000
	Here maybe extra play this finger and ethnic
	stuff that will be an art.

	0:16:00.000 --> 0:16:01.751
	This will be an art for.

	0:16:02.502 --> 0:16:08.323
	You would only be good at positions you have
	seen often and especially for long sequences.

	0:16:08.323 --> 0:16:14.016
	You might have seen the positions very rarely
	and then normally not performing that well

	0:16:14.016 --> 0:16:17.981
	while here it can better learn a more general
	representation.

	0:16:18.298 --> 0:16:22.522
	So that is another thing which we won't discuss
	here.

	0:16:22.522 --> 0:16:25.964
	Guess is what is called relative attention.

	0:16:25.945 --> 0:16:32.570
	And in this case you don't learn absolute
	positions, but in your calculation of the similarity

	0:16:32.570 --> 0:16:39.194
	you take again the relative distance into account
	and have a different similarity depending on

	0:16:39.194 --> 0:16:40.449
	how far they are.

	0:16:40.660 --> 0:16:45.898
	And then you don't need to encode it beforehand,
	but you would more happen within your comparison.

	0:16:46.186 --> 0:16:53.471
	So when you compare how similar things you
	print, of course also take the relative position.

	0:16:55.715 --> 0:17:03.187
	Because there are multiple ways to use the
	one, to multiply all the embedding, or to use

	0:17:03.187 --> 0:17:03.607
	all.

	0:17:17.557 --> 0:17:21.931
	The encoder can be bidirectional.

	0:17:21.931 --> 0:17:30.679
	We have everything from the beginning so we
	can have a model where.

	0:17:31.111 --> 0:17:36.455
	Decoder training of course has also everything
	available but during inference you always have

	0:17:36.455 --> 0:17:41.628
	only the past available so you can only look
	into the previous one and not into the future

	0:17:41.628 --> 0:17:46.062
	because if you generate word by word you don't
	know what it will be there in.

	0:17:46.866 --> 0:17:53.180
	And so we also have to consider this somehow
	in the attention, and until now we look more

	0:17:53.180 --> 0:17:54.653
	at the ecoder style.

	0:17:54.653 --> 0:17:58.652
	So if you look at this type of model, it's
	by direction.

	0:17:58.652 --> 0:18:03.773
	So for this hill state we are looking into
	the past and into the future.

	0:18:04.404 --> 0:18:14.436
	So the question is, can we have to do this
	like unidirectional so that you only look into

	0:18:14.436 --> 0:18:15.551
	the past?

	0:18:15.551 --> 0:18:22.573
	And the nice thing is, this is even easier
	than for our hands.

	0:18:23.123 --> 0:18:29.738
	So we would have different types of parameters
	and models because you have a forward direction.

	0:18:31.211 --> 0:18:35.679
	For attention, that is very simple.

	0:18:35.679 --> 0:18:39.403
	We are doing what is masking.

	0:18:39.403 --> 0:18:45.609
	If you want to have a backward model, these
	ones.

	0:18:45.845 --> 0:18:54.355
	So on the first hit stage it's been over,
	so it's maybe only looking at its health.

	0:18:54.894 --> 0:19:05.310
	By the second it looks on the second and the
	third, so you're always selling all values

	0:19:05.310 --> 0:19:07.085
	in the future.

	0:19:07.507 --> 0:19:13.318
	And thereby you can have with the same parameters
	the same model.

	0:19:13.318 --> 0:19:15.783
	You can have then a unique.

	0:19:16.156 --> 0:19:29.895
	In the decoder you do the masked self attention
	where you only look into the past and you don't

	0:19:29.895 --> 0:19:30.753
	look.

	0:19:32.212 --> 0:19:36.400
	Then we only have, of course, looked onto
	itself.

	0:19:36.616 --> 0:19:50.903
	So the question: How can we combine forward
	and decoder and then we can do a decoder and

	0:19:50.903 --> 0:19:54.114
	just have a second?

	0:19:54.374 --> 0:20:00.286
	And then we're doing the cross attention which
	attacks from the decoder to the anchoder.

	0:20:00.540 --> 0:20:10.239
	So in this time it's again that the queries
	is a current state of decoder, while the keys

	0:20:10.239 --> 0:20:22.833
	are: You can do both onto yourself to get the
	meaning on the target side and to get the meaning.

	0:20:23.423 --> 0:20:25.928
	So see then the full picture.

	0:20:25.928 --> 0:20:33.026
	This is now the typical picture of the transformer
	and where you use self attention.

	0:20:33.026 --> 0:20:36.700
	So what you have is have your power hidden.

	0:20:37.217 --> 0:20:43.254
	What you then apply is here the position they're
	coding: We have then doing the self attention

	0:20:43.254 --> 0:20:46.734
	to all the others, and this can be bi-directional.

	0:20:47.707 --> 0:20:54.918
	You normally do another feed forward layer
	just like to make things to learn additional

	0:20:54.918 --> 0:20:55.574
	things.

	0:20:55.574 --> 0:21:02.785
	You're just having also a feed forward layer
	which takes your heel stable and generates

	0:21:02.785 --> 0:21:07.128
	your heel state because we are making things
	deeper.

	0:21:07.747 --> 0:21:15.648
	Then this blue part you can stack over several
	times so you can have layers so that.

	0:21:16.336 --> 0:21:30.256
	In addition to these blue arrows, so we talked
	about this in R&S that if you are now back

	0:21:30.256 --> 0:21:35.883
	propagating your arrow from the top,.

	0:21:36.436 --> 0:21:48.578
	In order to prevent that we are not really
	learning how to transform that, but instead

	0:21:48.578 --> 0:21:51.230
	we have to change.

	0:21:51.671 --> 0:22:00.597
	You're calculating what should be changed
	with this one.

	0:22:00.597 --> 0:22:09.365
	The backwards clip each layer and the learning
	is just.

	0:22:10.750 --> 0:22:21.632
	The encoder before we go to the decoder.

	0:22:21.632 --> 0:22:30.655
	We have any additional questions.

	0:22:31.471 --> 0:22:33.220
	That's a Very Good Point.

	0:22:33.553 --> 0:22:38.709
	Yeah, you normally take always that at least
	the default architecture to only look at the

	0:22:38.709 --> 0:22:38.996
	top.

	0:22:40.000 --> 0:22:40.388
	Coder.

	0:22:40.388 --> 0:22:42.383
	Of course, you can do other things.

	0:22:42.383 --> 0:22:45.100
	We investigated, for example, the lowest layout.

	0:22:45.100 --> 0:22:49.424
	The decoder is looking at the lowest level
	of the incoder and not of the top.

	0:22:49.749 --> 0:23:05.342
	You can average or you can even learn theoretically
	that what you can also do is attending to all.

	0:23:05.785 --> 0:23:11.180
	Can attend to all possible layers and states.

	0:23:11.180 --> 0:23:18.335
	But what the default thing is is that you
	only have the top.

	0:23:20.580 --> 0:23:31.999
	The decoder when we're doing is firstly doing
	the same position and coding, then we're doing

	0:23:31.999 --> 0:23:36.419
	self attention in the decoder side.

	0:23:37.837 --> 0:23:43.396
	Of course here it's not important we're doing
	the mask self attention so that we're only

	0:23:43.396 --> 0:23:45.708
	attending to the past and we're not.

	0:23:47.287 --> 0:24:02.698
	Here you see the difference, so in this case
	the keys and values are from the encoder and

	0:24:02.698 --> 0:24:03.554
	the.

	0:24:03.843 --> 0:24:12.103
	You're comparing it to all the counter hidden
	states calculating the similarity and then

	0:24:12.103 --> 0:24:13.866
	you do the weight.

	0:24:14.294 --> 0:24:17.236
	And that is an edit to what is here.

	0:24:18.418 --> 0:24:29.778
	Then you have a linen layer and again this
	green one is sticked several times and then.

	0:24:32.232 --> 0:24:36.987
	Question, so each code is off.

	0:24:36.987 --> 0:24:46.039
	Every one of those has the last layer of thing,
	so in the.

	0:24:46.246 --> 0:24:51.007
	All with and only to the last or the top layer
	of the anchor.

	0:24:57.197 --> 0:25:00.127
	Good So That Would Be.

	0:25:01.501 --> 0:25:12.513
	To sequence models we have looked at attention
	and before we are decoding do you have any

	0:25:12.513 --> 0:25:18.020
	more questions to this type of architecture.

	0:25:20.480 --> 0:25:30.049
	Transformer was first used in machine translation,
	but now it's a standard thing for doing nearly

	0:25:30.049 --> 0:25:32.490
	any tie sequence models.

	0:25:33.013 --> 0:25:35.984
	Even large language models.

	0:25:35.984 --> 0:25:38.531
	They are a bit similar.

	0:25:38.531 --> 0:25:45.111
	They are just throwing away the anchor and
	cross the tension.

	0:25:45.505 --> 0:25:59.329
	And that is maybe interesting that it's important
	to have this attention because you cannot store

	0:25:59.329 --> 0:26:01.021
	everything.

	0:26:01.361 --> 0:26:05.357
	The interesting thing with the attention is
	now we can attend to everything.

	0:26:05.745 --> 0:26:13.403
	So you can again go back to your initial model
	and have just a simple sequence model and then

	0:26:13.403 --> 0:26:14.055
	target.

	0:26:14.694 --> 0:26:24.277
	There would be a more language model style
	or people call it Decoder Only model where

	0:26:24.277 --> 0:26:26.617
	you throw this away.

	0:26:27.247 --> 0:26:30.327
	The nice thing is because of your self attention.

	0:26:30.327 --> 0:26:34.208
	You have the original problem why you introduce
	the attention.

	0:26:34.208 --> 0:26:39.691
	You don't have that anymore because it's not
	everything is summarized, but each time you

	0:26:39.691 --> 0:26:44.866
	generate, you're looking back at all the previous
	words, the source and the target.

	0:26:45.805 --> 0:26:51.734
	And there is a lot of work on is a really
	important to have encoded a decoded model or

	0:26:51.734 --> 0:26:54.800
	is a decoded only model as good if you have.

	0:26:54.800 --> 0:27:00.048
	But the comparison is not that easy because
	how many parameters do you have?

	0:27:00.360 --> 0:27:08.832
	So think the general idea at the moment is,
	at least for machine translation, it's normally

	0:27:08.832 --> 0:27:17.765
	a bit better to have an encoded decoder model
	and not a decoder model where you just concatenate

	0:27:17.765 --> 0:27:20.252
	the source and the target.

	0:27:21.581 --> 0:27:24.073
	But there is not really a big difference anymore.

	0:27:24.244 --> 0:27:29.891
	Because this big issue, which we had initially
	with it that everything is stored in the working

	0:27:29.891 --> 0:27:31.009
	state, is nothing.

	0:27:31.211 --> 0:27:45.046
	Of course, the advantage maybe here is that
	you give it a bias at your same language information.

	0:27:45.285 --> 0:27:53.702
	While in an encoder only model this all is
	merged into one thing and sometimes it is good

	0:27:53.702 --> 0:28:02.120
	to give models a bit of bias okay you should
	maybe treat things separately and you should

	0:28:02.120 --> 0:28:03.617
	look different.

	0:28:04.144 --> 0:28:11.612
	And of course one other difference, one other
	disadvantage, maybe of an encoder owning one.

	0:28:16.396 --> 0:28:19.634
	You think about the suicide sentence and how
	it's treated.

	0:28:21.061 --> 0:28:33.787
	Architecture: Anchorer can both be in the
	sentence for every state and cause a little

	0:28:33.787 --> 0:28:35.563
	difference.

	0:28:35.475 --> 0:28:43.178
	If you only have a decoder that has to be
	unidirectional because for the decoder side

	0:28:43.178 --> 0:28:51.239
	for the generation you need it and so your
	input is read state by state so you don't have

	0:28:51.239 --> 0:28:54.463
	positional bidirection information.

	0:28:56.596 --> 0:29:05.551
	Again, it receives a sequence of embeddings
	with position encoding.

	0:29:05.551 --> 0:29:11.082
	The piece is like long vector has output.

	0:29:11.031 --> 0:29:17.148
	Don't understand how you can set footworks
	to this part of each other through inputs.

	0:29:17.097 --> 0:29:20.060
	Other than cola is the same as the food consume.

	0:29:21.681 --> 0:29:27.438
	Okay, it's very good bye, so this one hand
	coding is only done on the top layer.

	0:29:27.727 --> 0:29:32.012
	So this green one is only repeated.

	0:29:32.012 --> 0:29:38.558
	You have the word embedding or the position
	embedding.

	0:29:38.558 --> 0:29:42.961
	You have one layer of decoder which.

	0:29:43.283 --> 0:29:48.245
	Then you stick in the second one, the third
	one, the fourth one, and then on the top.

	0:29:48.208 --> 0:29:55.188
	Layer: You put this projection layer which
	takes a one thousand dimensional backtalk and

	0:29:55.188 --> 0:30:02.089
	generates based on your vocabulary maybe in
	ten thousand soft max layer which gives you

	0:30:02.089 --> 0:30:04.442
	the probability of all words.

	0:30:06.066 --> 0:30:22.369
	It's a very good part part of the mass tape
	ladies, but it wouldn't be for the X-rays.

	0:30:22.262 --> 0:30:27.015
	Aquarium filters to be like monsoon roding
	as they get by the river.

	0:30:27.647 --> 0:30:33.140
	Yes, there is work on that think we will discuss
	that in the pre-trained models.

	0:30:33.493 --> 0:30:39.756
	It's called where you exactly do that.

	0:30:39.756 --> 0:30:48.588
	If you have more metric side, it's like diagonal
	here.

	0:30:48.708 --> 0:30:53.018
	And it's a full metric, so here everybody's
	attending to each position.

	0:30:53.018 --> 0:30:54.694
	Here you're only attending.

	0:30:54.975 --> 0:31:05.744
	Then you can do the previous one where this
	one is decoded, not everything but everything.

	0:31:06.166 --> 0:31:13.961
	So you have a bit more that is possible, and
	we'll have that in the lecture on pre-train

	0:31:13.961 --> 0:31:14.662
	models.

	0:31:18.478 --> 0:31:27.440
	So we now know how to build a translation
	system, but of course we don't want to have

	0:31:27.440 --> 0:31:30.774
	a translation system by itself.

	0:31:31.251 --> 0:31:40.037
	Now given this model an input sentence, how
	can we generate an output mind?

	0:31:40.037 --> 0:31:49.398
	The general idea is still: So what we really
	want to do is we start with the model.

	0:31:49.398 --> 0:31:53.893
	We generate different possible translations.

	0:31:54.014 --> 0:31:59.754
	We score them the lock probability that we're
	getting, so for each input and output pair

	0:31:59.754 --> 0:32:05.430
	we can calculate the lock probability, which
	is a product of all probabilities for each

	0:32:05.430 --> 0:32:09.493
	word in there, and then we can find what is
	the most probable.

	0:32:09.949 --> 0:32:15.410
	However, that's a bit complicated we will
	see because we can't look at all possible translations.

	0:32:15.795 --> 0:32:28.842
	So there is infinite or a number of possible
	translations, so we have to do it somehow in

	0:32:28.842 --> 0:32:31.596
	more intelligence.

	0:32:32.872 --> 0:32:37.821
	So what we want to do today in the rest of
	the lecture?

	0:32:37.821 --> 0:32:40.295
	What is the search problem?

	0:32:40.295 --> 0:32:44.713
	Then we will look at different search algorithms.

	0:32:45.825 --> 0:32:56.636
	Will compare model and search errors, so there
	can be errors on the model where the model

	0:32:56.636 --> 0:33:03.483
	is not giving the highest score to the best
	translation.

	0:33:03.903 --> 0:33:21.069
	This is always like searching the best translation
	out of one model, which is often also interesting.

	0:33:24.004 --> 0:33:29.570
	And how do we do the search?

	0:33:29.570 --> 0:33:41.853
	We want to find the translation where the
	reference is minimal.

	0:33:42.042 --> 0:33:44.041
	So the nice thing is SMT.

	0:33:44.041 --> 0:33:51.347
	It wasn't the case, but in neuromachine translation
	we can't find any possible translation, so

	0:33:51.347 --> 0:33:53.808
	at least within our vocabulary.

	0:33:53.808 --> 0:33:58.114
	But if we have BPE we can really generate
	any possible.

	0:33:58.078 --> 0:34:04.604
	Translation and cereal: We could always minimize
	that, but yeah, we can't do it that easy because

	0:34:04.604 --> 0:34:07.734
	of course we don't have the reference at hand.

	0:34:07.747 --> 0:34:10.384
	If it has a reference, it's not a problem.

	0:34:10.384 --> 0:34:13.694
	We know what we are searching for, but we
	don't know.

	0:34:14.054 --> 0:34:23.886
	So how can we then model this by just finding
	the translation with the highest probability?

	0:34:23.886 --> 0:34:29.015
	Looking at it, we want to find the translation.

	0:34:29.169 --> 0:34:32.525
	Idea is our model is a good approximation.

	0:34:32.525 --> 0:34:34.399
	That's how we train it.

	0:34:34.399 --> 0:34:36.584
	What is a good translation?

	0:34:36.584 --> 0:34:43.687
	And if we find translation with the highest
	probability, this should also give us the best

	0:34:43.687 --> 0:34:44.702
	translation.

	0:34:45.265 --> 0:34:56.965
	And that is then, of course, the difference
	between the search error is that the model

	0:34:56.965 --> 0:35:02.076
	doesn't predict the best translation.

	0:35:02.622 --> 0:35:08.777
	How can we do the basic search first of all
	in basic search that seems to be very easy

	0:35:08.777 --> 0:35:15.003
	so what we can do is we can do the forward
	pass for the whole encoder and that's how it

	0:35:15.003 --> 0:35:21.724
	starts the input sentences known you can put
	the input sentence and calculate all your estates

	0:35:21.724 --> 0:35:22.573
	and hidden?

	0:35:23.083 --> 0:35:35.508
	Then you can put in your sentence start and
	you can generate.

	0:35:35.508 --> 0:35:41.721
	Here you have the probability.

	0:35:41.801 --> 0:35:52.624
	A good idea we would see later that as a typical
	algorithm is guess what you all would do, you

	0:35:52.624 --> 0:35:54.788
	would then select.

	0:35:55.235 --> 0:36:06.265
	So if you generate here a probability distribution
	over all the words in your vocabulary then

	0:36:06.265 --> 0:36:08.025
	you can solve.

	0:36:08.688 --> 0:36:13.147
	Yeah, this is how our auto condition is done
	in our system.

	0:36:14.794 --> 0:36:19.463
	Yeah, this is also why there you have to have
	a model of possible extending.

	0:36:19.463 --> 0:36:24.314
	It's more of a language model, but then this
	is one algorithm to do the search.

	0:36:24.314 --> 0:36:26.801
	They maybe have also more advanced ones.

	0:36:26.801 --> 0:36:32.076
	We will see that so this search and other
	completion should be exactly the same as the

	0:36:32.076 --> 0:36:33.774
	search machine translation.

	0:36:34.914 --> 0:36:40.480
	So we'll see that this is not optimal, so
	hopefully it's not that this way, but for this

	0:36:40.480 --> 0:36:41.043
	problem.

	0:36:41.941 --> 0:36:47.437
	And what you can do then you can select this
	word.

	0:36:47.437 --> 0:36:50.778
	This was the best translation.

	0:36:51.111 --> 0:36:57.675
	Because the decoder, of course, in the next
	step needs not to know what is the best word

	0:36:57.675 --> 0:37:02.396
	here, it inputs it and generates that flexibility
	distribution.

	0:37:03.423 --> 0:37:14.608
	And then your new distribution, and you can
	do the same thing, there's the best word there,

	0:37:14.608 --> 0:37:15.216
	and.

	0:37:15.435 --> 0:37:22.647
	So you can continue doing that and always
	get the hopefully the best translation in.

	0:37:23.483 --> 0:37:30.839
	The first question is, of course, how long
	are you doing it?

	0:37:30.839 --> 0:37:33.854
	Now we could go forever.

	0:37:36.476 --> 0:37:52.596
	We had this token at the input and we put
	the stop token at the output.

	0:37:53.974 --> 0:38:07.217
	And this is important because if we wouldn't
	do that then we wouldn't have a good idea.

	0:38:10.930 --> 0:38:16.193
	So that seems to be a good idea, but is it
	really?

	0:38:16.193 --> 0:38:21.044
	Do we find the most probable sentence in this?

	0:38:23.763 --> 0:38:25.154
	Or my dear healed proverb,.

	0:38:27.547 --> 0:38:41.823
	We are always selecting the highest probability
	one, so it seems to be that this is a very

	0:38:41.823 --> 0:38:45.902
	good solution to anybody.

	0:38:46.406 --> 0:38:49.909
	Yes, that is actually the problem.

	0:38:49.909 --> 0:38:56.416
	You might do early decisions and you don't
	have the global view.

	0:38:56.796 --> 0:39:02.813
	And this problem happens because it is an
	outer regressive model.

	0:39:03.223 --> 0:39:13.275
	So it happens because yeah, the output we
	generate is the input in the next step.

	0:39:13.793 --> 0:39:19.493
	And this, of course, is leading to problems.

	0:39:19.493 --> 0:39:27.474
	If we always take the best solution, it doesn't
	mean you have.

	0:39:27.727 --> 0:39:33.941
	It would be different if you have a problem
	where the output is not influencing your input.

	0:39:34.294 --> 0:39:44.079
	Then this solution will give you the best
	model, but since the output is influencing

	0:39:44.079 --> 0:39:47.762
	your next input and the model,.

	0:39:48.268 --> 0:39:51.599
	Because one question might not be why do we
	have this type of model?

	0:39:51.771 --> 0:39:58.946
	So why do we really need to put here in the
	last source word?

	0:39:58.946 --> 0:40:06.078
	You can also put in: And then always predict
	the word and the nice thing is then you wouldn't

	0:40:06.078 --> 0:40:11.846
	need to do beams or a difficult search because
	then the output here wouldn't influence what

	0:40:11.846 --> 0:40:12.975
	is inputted here.

	0:40:15.435 --> 0:40:20.219
	Idea whether that might not be the best idea.

	0:40:20.219 --> 0:40:24.588
	You'll just be translating each word and.

	0:40:26.626 --> 0:40:37.815
	The second one is right, yes, you're not generating
	a Korean sentence.

	0:40:38.058 --> 0:40:48.197
	We'll also see that later it's called non
	auto-progressive translation, so there is work

	0:40:48.197 --> 0:40:49.223
	on that.

	0:40:49.529 --> 0:41:02.142
	So you might know it roughly because you know
	it's based on this hidden state, but it can

	0:41:02.142 --> 0:41:08.588
	be that in the end you have your probability.

	0:41:09.189 --> 0:41:14.633
	And then you're not modeling the dependencies
	within a work within the target sentence.

	0:41:14.633 --> 0:41:27.547
	For example: You can express things in German,
	then you don't know which one you really select.

	0:41:27.547 --> 0:41:32.156
	That influences what you later.

	0:41:33.393 --> 0:41:46.411
	Then you try to find a better way not only
	based on the English sentence and the words

	0:41:46.411 --> 0:41:48.057
	that come.

	0:41:49.709 --> 0:42:00.954
	Yes, that is more like a two-step decoding,
	but that is, of course, a lot more like computational.

	0:42:01.181 --> 0:42:15.978
	The first thing you can do, which is typically
	done, is doing not really search.

	0:42:16.176 --> 0:42:32.968
	So first look at what the problem of research
	is to make it a bit more clear.

	0:42:34.254 --> 0:42:53.163
	And now you can extend them and you can extend
	these and the joint probabilities.

	0:42:54.334 --> 0:42:59.063
	The other thing is the second word.

	0:42:59.063 --> 0:43:03.397
	You can do the second word dusk.

	0:43:03.397 --> 0:43:07.338
	Now you see the problem here.

	0:43:07.707 --> 0:43:17.507
	It is true that these have the highest probability,
	but for these you have an extension.

	0:43:18.078 --> 0:43:31.585
	So the problem is just because in one position
	one hypothesis, so you can always call this

	0:43:31.585 --> 0:43:34.702
	partial translation.

	0:43:34.874 --> 0:43:41.269
	The blue one begin is higher, but the green
	one can be better extended and it will overtake.

	0:43:45.525 --> 0:43:54.672
	So the problem is if we are doing this greedy
	search is that we might not end up in really

	0:43:54.672 --> 0:43:55.275
	good.

	0:43:55.956 --> 0:44:00.916
	So the first thing we could not do is like
	yeah, we can just try.

	0:44:00.880 --> 0:44:06.049
	All combinations that are there, so there
	is the other direction.

	0:44:06.049 --> 0:44:13.020
	So if the solution to to check the first one
	is to just try all and it doesn't give us a

	0:44:13.020 --> 0:44:17.876
	good result, maybe what we have to do is just
	try everything.

	0:44:18.318 --> 0:44:23.120
	The nice thing is if we try everything, we'll
	definitely find the best translation.

	0:44:23.463 --> 0:44:26.094
	So we won't have a search error.

	0:44:26.094 --> 0:44:28.167
	We'll come to that later.

	0:44:28.167 --> 0:44:32.472
	The interesting thing is our translation performance.

	0:44:33.353 --> 0:44:37.039
	But we will definitely find the most probable
	translation.

	0:44:38.598 --> 0:44:44.552
	However, it's not really possible because
	the number of combinations is just too high.

	0:44:44.764 --> 0:44:57.127
	So the number of congregations is your vocabulary
	science times the lengths of your sentences.

	0:44:57.157 --> 0:45:03.665
	Ten thousand or so you can imagine that very
	soon you will have so many possibilities here

	0:45:03.665 --> 0:45:05.597
	that you cannot check all.

	0:45:06.226 --> 0:45:13.460
	So this is not really an implication or an
	algorithm that you can use for applying machine

	0:45:13.460 --> 0:45:14.493
	translation.

	0:45:15.135 --> 0:45:24.657
	So maybe we have to do something in between
	and yeah, not look at all but only look at

	0:45:24.657 --> 0:45:25.314
	some.

	0:45:26.826 --> 0:45:29.342
	And the easiest thing for that is okay.

	0:45:29.342 --> 0:45:34.877
	Just do sampling, so if we don't know what
	to look at, maybe it's good to randomly pick

	0:45:34.877 --> 0:45:35.255
	some.

	0:45:35.255 --> 0:45:40.601
	That's not only a very good algorithm, so
	the basic idea will always randomly select

	0:45:40.601 --> 0:45:42.865
	the word, of course, based on bits.

	0:45:43.223 --> 0:45:52.434
	We are doing that or times, and then we are
	looking which one at the end has the highest.

	0:45:52.672 --> 0:45:59.060
	So we are not doing anymore really searching
	for the best one, but we are more randomly

	0:45:59.060 --> 0:46:05.158
	doing selections with the idea that we always
	select the best one at the beginning.

	0:46:05.158 --> 0:46:11.764
	So maybe it's better to do random, but of
	course one important thing is how do we randomly

	0:46:11.764 --> 0:46:12.344
	select?

	0:46:12.452 --> 0:46:15.756
	If we just do uniform distribution, it would
	be very bad.

	0:46:15.756 --> 0:46:18.034
	You'll only have very bad translations.

	0:46:18.398 --> 0:46:23.261
	Because in each position if you think about
	it you have ten thousand possibilities.

	0:46:23.903 --> 0:46:28.729
	Most of them are really bad decisions and
	you shouldn't do that.

	0:46:28.729 --> 0:46:35.189
	There is always only a very small number,
	at least compared to the 10 000 translation.

	0:46:35.395 --> 0:46:43.826
	So if you have the sentence here, this is
	an English sentence.

	0:46:43.826 --> 0:46:47.841
	You can start with these and.

	0:46:48.408 --> 0:46:58.345
	You're thinking about setting legal documents
	in a legal document.

	0:46:58.345 --> 0:47:02.350
	You should not change the.

	0:47:03.603 --> 0:47:11.032
	The problem is we have a neural network, we
	have a black box, so it's anyway a bit random.

	0:47:12.092 --> 0:47:24.341
	It is considered, but you will see that if
	you make it intelligent for clear sentences,

	0:47:24.341 --> 0:47:26.986
	there is not that.

	0:47:27.787 --> 0:47:35.600
	Is an issue we should consider that this one
	might lead to more randomness, but it might

	0:47:35.600 --> 0:47:39.286
	also be positive for machine translation.

	0:47:40.080 --> 0:47:46.395
	Least can't directly think of a good implication
	where it's positive, but if you most think

	0:47:46.395 --> 0:47:52.778
	about dialogue systems, for example, whereas
	the similar architecture is nowadays also used,

	0:47:52.778 --> 0:47:55.524
	you predict what the system should say.

	0:47:55.695 --> 0:48:00.885
	Then you want to have randomness because it's
	not always saying the same thing.

	0:48:01.341 --> 0:48:08.370
	Machine translation is typically not you want
	to have consistency, so if you have the same

	0:48:08.370 --> 0:48:09.606
	input normally.

	0:48:09.889 --> 0:48:14.528
	Therefore, sampling is not a mathieu.

	0:48:14.528 --> 0:48:22.584
	There are some things you will later see as
	a preprocessing step.

	0:48:23.003 --> 0:48:27.832
	But of course it's important how you can make
	this process not too random.

	0:48:29.269 --> 0:48:41.619
	Therefore, the first thing is don't take a
	uniform distribution, but we have a very nice

	0:48:41.619 --> 0:48:43.562
	distribution.

	0:48:43.843 --> 0:48:46.621
	So I'm like randomly taking a word.

	0:48:46.621 --> 0:48:51.328
	We are looking at output distribution and
	now taking a word.

	0:48:51.731 --> 0:49:03.901
	So that means we are taking the word these,
	we are taking the word does, and all these.

	0:49:04.444 --> 0:49:06.095
	How can you do that?

	0:49:06.095 --> 0:49:09.948
	You randomly draw a number between zero and
	one.

	0:49:10.390 --> 0:49:23.686
	And then you have ordered your words in some
	way, and then you take the words before the

	0:49:23.686 --> 0:49:26.375
	sum of the words.

	0:49:26.806 --> 0:49:34.981
	So the easiest thing is you have zero point
	five, zero point two five, and zero point two

	0:49:34.981 --> 0:49:35.526
	five.

	0:49:35.526 --> 0:49:43.428
	If you have a number smaller than you take
	the first word, it takes a second word, and

	0:49:43.428 --> 0:49:45.336
	if it's higher than.

	0:49:45.845 --> 0:49:57.707
	Therefore, you can very easily get a distribution
	distributed according to this probability mass

	0:49:57.707 --> 0:49:59.541
	and no longer.

	0:49:59.799 --> 0:50:12.479
	You can't even do that a bit more and more
	focus on the important part if we are not randomly

	0:50:12.479 --> 0:50:19.494
	drawing from all words, but we are looking
	only at.

	0:50:21.361 --> 0:50:24.278
	You have an idea why this is an important
	stamp.

	0:50:24.278 --> 0:50:29.459
	Although we say I'm only throwing away the
	words which have a very low probability, so

	0:50:29.459 --> 0:50:32.555
	anyway the probability of taking them is quite
	low.

	0:50:32.555 --> 0:50:35.234
	So normally that shouldn't matter that much.

	0:50:36.256 --> 0:50:38.830
	There's ten thousand words.

	0:50:40.300 --> 0:50:42.074
	Of course, they admire thousand nine hundred.

	0:50:42.074 --> 0:50:44.002
	They're going to build a good people steal
	it up.

	0:50:45.085 --> 0:50:47.425
	Hi, I'm Sarah Hauer and I'm Sig Hauer and
	We're Professional.

	0:50:47.867 --> 0:50:55.299
	Yes, that's exactly why you do this most sampling
	or so that you don't take the lowest.

	0:50:55.415 --> 0:50:59.694
	Probability words, but you only look at the
	most probable ones and then like.

	0:50:59.694 --> 0:51:04.632
	Of course you have to rescale your probability
	mass then so that it's still a probability

	0:51:04.632 --> 0:51:08.417
	because now it's a probability distribution
	over ten thousand words.

	0:51:08.417 --> 0:51:13.355
	If you only take ten of them or so it's no
	longer a probability distribution, you rescale

	0:51:13.355 --> 0:51:15.330
	them and you can still do that and.

	0:51:16.756 --> 0:51:20.095
	That is what is done assembling.

	0:51:20.095 --> 0:51:26.267
	It's not the most common thing, but it's done
	several times.

	0:51:28.088 --> 0:51:40.625
	Then the search, which is somehow a standard,
	and if you're doing some type of machine translation.

	0:51:41.181 --> 0:51:50.162
	And the basic idea is that in research we
	select for the most probable and only continue

	0:51:50.162 --> 0:51:51.171
	with the.

	0:51:51.691 --> 0:51:53.970
	You can easily generalize this.

	0:51:53.970 --> 0:52:00.451
	We are not only continuing the most probable
	one, but we are continuing the most probable.

	0:52:00.880 --> 0:52:21.376
	The.

	0:52:17.697 --> 0:52:26.920
	You should say we are sampling how many examples
	it makes sense to take the one with the highest.

	0:52:27.127 --> 0:52:33.947
	But that is important that once you do a mistake
	you might want to not influence that much.

	0:52:39.899 --> 0:52:45.815
	So the idea is if we're keeping the end best
	hypotheses and not only the first fact.

	0:52:46.586 --> 0:52:51.558
	And the nice thing is in statistical machine
	translation.

	0:52:51.558 --> 0:52:54.473
	We have exactly the same problem.

	0:52:54.473 --> 0:52:57.731
	You would do the same thing, however.

	0:52:57.731 --> 0:53:03.388
	Since the model wasn't that strong you needed
	a quite large beam.

	0:53:03.984 --> 0:53:18.944
	Machine translation models are really strong
	and you get already a very good performance.

	0:53:19.899 --> 0:53:22.835
	So how does it work?

	0:53:22.835 --> 0:53:35.134
	We can't relate to our capabilities, but now
	we are not storing the most probable ones.

	0:53:36.156 --> 0:53:45.163
	Done that we extend all these hypothesis and
	of course there is now a bit difficult because

	0:53:45.163 --> 0:53:54.073
	now we always have to switch what is the input
	so the search gets more complicated and the

	0:53:54.073 --> 0:53:55.933
	first one is easy.

	0:53:56.276 --> 0:54:09.816
	In this case we have to once put in here these
	and then somehow delete this one and instead

	0:54:09.816 --> 0:54:12.759
	put that into that.

	0:54:13.093 --> 0:54:24.318
	Otherwise you could only store your current
	network states here and just continue by going

	0:54:24.318 --> 0:54:25.428
	forward.

	0:54:26.766 --> 0:54:34.357
	So now you have done the first two, and then
	you have known the best.

	0:54:34.357 --> 0:54:37.285
	Can you now just continue?

	0:54:39.239 --> 0:54:53.511
	Yes, that's very important, otherwise all
	your beam search doesn't really help because

	0:54:53.511 --> 0:54:57.120
	you would still have.

	0:54:57.317 --> 0:55:06.472
	So now you have to do one important step and
	then reduce again to end.

	0:55:06.472 --> 0:55:13.822
	So in our case to make things easier we have
	the inputs.

	0:55:14.014 --> 0:55:19.072
	Otherwise you will have two to the power of
	length possibilities, so it is still exponential.

	0:55:19.559 --> 0:55:26.637
	But by always throwing them away you keep
	your beans fixed.

	0:55:26.637 --> 0:55:31.709
	The items now differ in the last position.

	0:55:32.492 --> 0:55:42.078
	They are completely different, but you are
	always searching what is the best one.

	0:55:44.564 --> 0:55:50.791
	So another way of hearing it is like this,
	so just imagine you start with the empty sentence.

	0:55:50.791 --> 0:55:55.296
	Then you have three possible extensions: A,
	B, and end of sentence.

	0:55:55.296 --> 0:55:59.205
	It's throwing away the worst one, continuing
	with the two.

	0:55:59.699 --> 0:56:13.136
	Then you want to stay too, so in this state
	it's either or and then you continue.

	0:56:13.293 --> 0:56:24.924
	So you always have this exponential growing
	tree by destroying most of them away and only

	0:56:24.924 --> 0:56:26.475
	continuing.

	0:56:26.806 --> 0:56:42.455
	And thereby you can hopefully do less errors
	because in these examples you always see this

	0:56:42.455 --> 0:56:43.315
	one.

	0:56:43.503 --> 0:56:47.406
	So you're preventing some errors, but of course
	it's not perfect.

	0:56:47.447 --> 0:56:56.829
	You can still do errors because it could be
	not the second one but the fourth one.

	0:56:57.017 --> 0:57:03.272
	Now just the idea is that you make yeah less
	errors and prevent that.

	0:57:07.667 --> 0:57:11.191
	Then the question is how much does it help?

	0:57:11.191 --> 0:57:14.074
	And here is some examples for that.

	0:57:14.074 --> 0:57:16.716
	So for S & T it was really like.

	0:57:16.716 --> 0:57:23.523
	Typically the larger beam you have a larger
	third space and you have a better score.

	0:57:23.763 --> 0:57:27.370
	So the larger you get, the bigger your emails,
	the better you will.

	0:57:27.370 --> 0:57:30.023
	Typically maybe use something like three hundred.

	0:57:30.250 --> 0:57:38.777
	And it's mainly a trade-off between quality
	and speed because the larger your beams, the

	0:57:38.777 --> 0:57:43.184
	more time it takes and you want to finish it.

	0:57:43.184 --> 0:57:49.124
	So your quality improvements are getting smaller
	and smaller.

	0:57:49.349 --> 0:57:57.164
	So the difference between a beam of one and
	ten is bigger than the difference between a.

	0:57:58.098 --> 0:58:14.203
	And the interesting thing is we're seeing
	a bit of a different view, and we're seeing

	0:58:14.203 --> 0:58:16.263
	typically.

	0:58:16.776 --> 0:58:24.376
	And then especially if you look at the green
	ones, this is unnormalized.

	0:58:24.376 --> 0:58:26.770
	You're seeing a sharp.

	0:58:27.207 --> 0:58:32.284
	So your translation quality here measured
	in blue will go down again.

	0:58:33.373 --> 0:58:35.663
	That is now a question.

	0:58:35.663 --> 0:58:37.762
	Why is that the case?

	0:58:37.762 --> 0:58:43.678
	Why should we are seeing more and more possible
	translations?

	0:58:46.226 --> 0:58:48.743
	If we have a bigger stretch and we are going.

	0:58:52.612 --> 0:58:56.312
	I'm going to be using my examples before we
	also look at the bar.

	0:58:56.656 --> 0:58:59.194
	A good idea.

	0:59:00.000 --> 0:59:18.521
	But it's not everything because we in the
	end always in this list we're selecting.

	0:59:18.538 --> 0:59:19.382
	So this is here.

	0:59:19.382 --> 0:59:21.170
	We don't do any regions to do that.

	0:59:21.601 --> 0:59:29.287
	So the probabilities at the end we always
	give out the hypothesis with the highest probabilities.

	0:59:30.250 --> 0:59:33.623
	That is always the case.

	0:59:33.623 --> 0:59:43.338
	If you have a beam of this should be a subset
	of the items you look at.

	0:59:44.224 --> 0:59:52.571
	So if you increase your biomeat you're just
	looking at more and you're always taking the

	0:59:52.571 --> 0:59:54.728
	wine with the highest.

	0:59:57.737 --> 1:00:07.014
	Maybe they are all the probability that they
	will be comparable to don't really have.

	1:00:08.388 --> 1:00:14.010
	But the probabilities are the same, not that
	easy.

	1:00:14.010 --> 1:00:23.931
	One morning maybe you will have more examples
	where we look at some stuff that's not seen

	1:00:23.931 --> 1:00:26.356
	in the trading space.

	1:00:28.428 --> 1:00:36.478
	That's mainly the answer why we give a hyperability
	math we will see, but that is first of all

	1:00:36.478 --> 1:00:43.087
	the biggest issues, so here is a blue score,
	so that is somewhat translation.

	1:00:43.883 --> 1:00:48.673
	This will go down by the probability of the
	highest one that only goes out where stays

	1:00:48.673 --> 1:00:49.224
	at least.

	1:00:49.609 --> 1:00:57.971
	The problem is if we are searching more, we
	are finding high processes which have a high

	1:00:57.971 --> 1:00:59.193
	translation.

	1:00:59.579 --> 1:01:10.375
	So we are finding these things which we wouldn't
	find and we'll see why this is happening.

	1:01:10.375 --> 1:01:15.714
	So somehow we are reducing our search error.

	1:01:16.336 --> 1:01:25.300
	However, we also have a model error and we
	don't assign the highest probability to translation

	1:01:25.300 --> 1:01:27.942
	quality to the really best.

	1:01:28.548 --> 1:01:31.460
	They don't always add up.

	1:01:31.460 --> 1:01:34.932
	Of course somehow they add up.

	1:01:34.932 --> 1:01:41.653
	If your bottle is worse then your performance
	will even go.

	1:01:42.202 --> 1:01:49.718
	But sometimes it's happening that by increasing
	search errors we are missing out the really

	1:01:49.718 --> 1:01:57.969
	bad translations which have a high probability
	and we are only finding the decently good probability

	1:01:57.969 --> 1:01:58.460
	mass.

	1:01:59.159 --> 1:02:03.859
	So they are a bit independent of each other
	and you can make those types of arrows.

	1:02:04.224 --> 1:02:09.858
	That's why, for example, doing exact search
	will give you the translation with the highest

	1:02:09.858 --> 1:02:15.245
	probability, but there has been work on it
	that you then even have a lower translation

	1:02:15.245 --> 1:02:21.436
	quality because then you find some random translation
	which has a very high translation probability

	1:02:21.436 --> 1:02:22.984
	by which I'm really bad.

	1:02:23.063 --> 1:02:29.036
	Because our model is not perfect and giving
	a perfect translation probability over air,.

	1:02:31.431 --> 1:02:34.537
	So why is this happening?

	1:02:34.537 --> 1:02:42.301
	And one issue with this is the so called label
	or length spiral.

	1:02:42.782 --> 1:02:47.115
	And we are in each step of decoding.

	1:02:47.115 --> 1:02:55.312
	We are modeling the probability of the next
	word given the input and.

	1:02:55.895 --> 1:03:06.037
	So if you have this picture, so you always
	hear you have the probability of the next word.

	1:03:06.446 --> 1:03:16.147
	That's that's what your modeling, and of course
	the model is not perfect.

	1:03:16.576 --> 1:03:22.765
	So it can be that if we at one time do a bitter
	wrong prediction not for the first one but

	1:03:22.765 --> 1:03:28.749
	maybe for the 5th or 6th thing, then we're
	giving it an exceptional high probability we

	1:03:28.749 --> 1:03:30.178
	cannot recover from.

	1:03:30.230 --> 1:03:34.891
	Because this high probability will stay there
	forever and we just multiply other things to

	1:03:34.891 --> 1:03:39.910
	it, but we cannot like later say all this probability
	was a bit too high, we shouldn't have done.

	1:03:41.541 --> 1:03:48.984
	And this leads to that the more the longer
	your translation is, the more often you use

	1:03:48.984 --> 1:03:51.637
	this probability distribution.

	1:03:52.112 --> 1:04:03.321
	The typical example is this one, so you have
	the probability of the translation.

	1:04:04.104 --> 1:04:12.608
	And this probability is quite low as you see,
	and maybe there are a lot of other things.

	1:04:13.053 --> 1:04:25.658
	However, it might still be overestimated that
	it's still a bit too high.

	1:04:26.066 --> 1:04:33.042
	The problem is if you know the project translation
	is a very long one, but probability mask gets

	1:04:33.042 --> 1:04:33.545
	lower.

	1:04:34.314 --> 1:04:45.399
	Because each time you multiply your probability
	to it, so your sequence probability gets lower

	1:04:45.399 --> 1:04:46.683
	and lower.

	1:04:48.588 --> 1:04:59.776
	And this means that at some point you might
	get over this, and it might be a lower probability.

	1:05:00.180 --> 1:05:09.651
	And if you then have this probability at the
	beginning away, but it wasn't your beam, then

	1:05:09.651 --> 1:05:14.958
	at this point you would select the empty sentence.

	1:05:15.535 --> 1:05:25.379
	So this has happened because this short translation
	is seen and it's not thrown away.

	1:05:28.268 --> 1:05:31.121
	So,.

	1:05:31.151 --> 1:05:41.256
	If you have a very sore beam that can be prevented,
	but if you have a large beam, this one is in

	1:05:41.256 --> 1:05:41.986
	there.

	1:05:42.302 --> 1:05:52.029
	This in general seems reasonable that shorter
	pronunciations instead of longer sentences

	1:05:52.029 --> 1:05:54.543
	because non-religious.

	1:05:56.376 --> 1:06:01.561
	It's a bit depending on whether the translation
	should be a bit related to your input.

	1:06:02.402 --> 1:06:18.053
	And since we are always multiplying things,
	the longer the sequences we are getting smaller,

	1:06:18.053 --> 1:06:18.726
	it.

	1:06:19.359 --> 1:06:29.340
	It's somewhat right for human main too, but
	the models tend to overestimate because of

	1:06:29.340 --> 1:06:34.388
	this short translation of long translation.

	1:06:35.375 --> 1:06:46.474
	Then, of course, that means that it's not
	easy to stay on a computer because eventually

	1:06:46.474 --> 1:06:48.114
	it suggests.

	1:06:51.571 --> 1:06:59.247
	First of all there is another way and that's
	typically used but you don't have to do really

	1:06:59.247 --> 1:07:07.089
	because this is normally not a second position
	and if it's like on the 20th position you only

	1:07:07.089 --> 1:07:09.592
	have to have some bean lower.

	1:07:10.030 --> 1:07:17.729
	But you are right because these issues get
	larger, the larger your input is, and then

	1:07:17.729 --> 1:07:20.235
	you might make more errors.

	1:07:20.235 --> 1:07:27.577
	So therefore this is true, but it's not as
	simple that this one is always in the.

	1:07:28.408 --> 1:07:45.430
	That the translation for it goes down with
	higher insert sizes has there been more control.

	1:07:47.507 --> 1:07:51.435
	In this work you see a dozen knocks.

	1:07:51.435 --> 1:07:53.027
	Knots go down.

	1:07:53.027 --> 1:08:00.246
	That's light green here, but at least you
	don't see the sharp rock.

	1:08:00.820 --> 1:08:07.897
	So if you do some type of normalization, at
	least you can assess this probability and limit

	1:08:07.897 --> 1:08:08.204
	it.

	1:08:15.675 --> 1:08:24.828
	There is other reasons why, like initial,
	it's not only the length, but there can be

	1:08:24.828 --> 1:08:26.874
	other reasons why.

	1:08:27.067 --> 1:08:37.316
	And if you just take it too large, you're
	looking too often at ways in between, but it's

	1:08:37.316 --> 1:08:40.195
	better to ignore things.

	1:08:41.101 --> 1:08:44.487
	But that's more a hand gravy argument.

	1:08:44.487 --> 1:08:47.874
	Agree so don't know if the exact word.

	1:08:48.648 --> 1:08:53.223
	You need to do the normalization and there
	are different ways of doing it.

	1:08:53.223 --> 1:08:54.199
	It's mainly OK.

	1:08:54.199 --> 1:08:59.445
	We're just now not taking the translation
	with the highest probability, but we during

	1:08:59.445 --> 1:09:04.935
	the coding have another feature saying not
	only take the one with the highest probability

	1:09:04.935 --> 1:09:08.169
	but also prefer translations which are a bit
	longer.

	1:09:08.488 --> 1:09:16.933
	You can do that different in a way to divide
	by the center length.

	1:09:16.933 --> 1:09:23.109
	We take not the highest but the highest average.

	1:09:23.563 --> 1:09:28.841
	Of course, if both are the same lengths, it
	doesn't matter if M is the same lengths in

	1:09:28.841 --> 1:09:34.483
	all cases, but if you compare a translation
	with seven or eight words, there is a difference

	1:09:34.483 --> 1:09:39.700
	if you want to have the one with the highest
	probability or with the highest average.

	1:09:41.021 --> 1:09:50.993
	So that is the first one can have some reward
	model for each word, add a bit of the score,

	1:09:50.993 --> 1:09:51.540
	and.

	1:09:51.711 --> 1:10:03.258
	And then, of course, you have to find you
	that there is also more complex ones here.

	1:10:03.903 --> 1:10:08.226
	So there is different ways of doing that,
	and of course that's important.

	1:10:08.428 --> 1:10:11.493
	But in all of that, the main idea is OK.

	1:10:11.493 --> 1:10:18.520
	We are like knowing of the arrow that the
	model seems to prevent or prefer short translation.

	1:10:18.520 --> 1:10:24.799
	We circumvent that by OK we are adding we
	are no longer searching for the best one.

	1:10:24.764 --> 1:10:30.071
	But we're searching for the one best one and
	some additional constraints, so mainly you

	1:10:30.071 --> 1:10:32.122
	are doing here during the coding.

	1:10:32.122 --> 1:10:37.428
	You're not completely trusting your model,
	but you're adding some buyers or constraints

	1:10:37.428 --> 1:10:39.599
	into what should also be fulfilled.

	1:10:40.000 --> 1:10:42.543
	That can be, for example, that the length
	should be recently.

	1:10:49.369 --> 1:10:51.071
	Any More Questions to That.

	1:10:56.736 --> 1:11:04.001
	Last idea which gets recently quite a bit
	more interest also is what is called minimum

	1:11:04.001 --> 1:11:11.682
	base risk decoding and there is maybe not the
	one correct translation but there are several

	1:11:11.682 --> 1:11:13.937
	good correct translations.

	1:11:14.294 --> 1:11:21.731
	And the idea is now we don't want to find
	the one translation, which is maybe the highest

	1:11:21.731 --> 1:11:22.805
	probability.

	1:11:23.203 --> 1:11:31.707
	Instead we are looking at all the high translation,
	all translation with high probability and then

	1:11:31.707 --> 1:11:39.524
	we want to take one representative out of this
	so we're just most similar to all the other

	1:11:39.524 --> 1:11:42.187
	hydrobility translation again.

	1:11:43.643 --> 1:11:46.642
	So how does it work?

	1:11:46.642 --> 1:11:55.638
	First you could have imagined you have reference
	translations.

	1:11:55.996 --> 1:12:13.017
	You have a set of reference translations and
	then what you want to get is you want to have.

	1:12:13.073 --> 1:12:28.641
	As a probability distribution you measure
	the similarity of reference and the hypothesis.

	1:12:28.748 --> 1:12:31.408
	So you have two sets of translation.

	1:12:31.408 --> 1:12:34.786
	You have the human translations of a sentence.

	1:12:35.675 --> 1:12:39.251
	That's of course not realistic, but first
	from the idea.

	1:12:39.251 --> 1:12:42.324
	Then you have your set of possible translations.

	1:12:42.622 --> 1:12:52.994
	And now you're not saying okay, we have only
	one human, but we have several humans with

	1:12:52.994 --> 1:12:56.294
	different types of quality.

	1:12:56.796 --> 1:13:07.798
	You have to have two metrics here, the similarity
	between the automatic translation and the quality

	1:13:07.798 --> 1:13:09.339
	of the human.

	1:13:10.951 --> 1:13:17.451
	Of course, we have the same problem that we
	don't have the human reference, so we have.

	1:13:18.058 --> 1:13:29.751
	So when we are doing it, instead of estimating
	the quality based on the human, we use our

	1:13:29.751 --> 1:13:30.660
	model.

	1:13:31.271 --> 1:13:37.612
	So we can't be like humans, so we take the
	model probability.

	1:13:37.612 --> 1:13:40.782
	We take the set here first of.

	1:13:41.681 --> 1:13:48.755
	Then we are comparing each hypothesis to this
	one, so you have two sets.

	1:13:48.755 --> 1:13:53.987
	Just imagine here you take all possible translations.

	1:13:53.987 --> 1:13:58.735
	Here you take your hypothesis in comparing
	them.

	1:13:58.678 --> 1:14:03.798
	And then you're taking estimating the quality
	based on the outcome.

	1:14:04.304 --> 1:14:06.874
	So the overall idea is okay.

	1:14:06.874 --> 1:14:14.672
	We are not finding the best hypothesis but
	finding the hypothesis which is most similar

	1:14:14.672 --> 1:14:17.065
	to many good translations.

	1:14:19.599 --> 1:14:21.826
	Why would you do that?

	1:14:21.826 --> 1:14:25.119
	It's a bit like a smoothing idea.

	1:14:25.119 --> 1:14:28.605
	Imagine this is the probability of.

	1:14:29.529 --> 1:14:36.634
	So if you would do beam search or mini search
	or anything, if you just take the highest probability

	1:14:36.634 --> 1:14:39.049
	one, you would take this red one.

	1:14:39.799 --> 1:14:45.686
	Has this type of probability distribution.

	1:14:45.686 --> 1:14:58.555
	Then it might be better to take some of these
	models because it's a bit lower in probability.

	1:14:58.618 --> 1:15:12.501
	So what you're mainly doing is you're doing
	some smoothing of your probability distribution.

	1:15:15.935 --> 1:15:17.010
	How can you do that?

	1:15:17.010 --> 1:15:20.131
	Of course, we cannot do this again compared
	to all the hype.

	1:15:21.141 --> 1:15:29.472
	But what we can do is we have just two sets
	and we're just taking them the same.

	1:15:29.472 --> 1:15:38.421
	So we're having our penny data of the hypothesis
	and the sum of the soider references.

	1:15:39.179 --> 1:15:55.707
	And we can just take the same clue so we can
	just compare the utility of the.

	1:15:56.656 --> 1:16:16.182
	And then, of course, the question is how do
	we measure the quality of the hypothesis?

	1:16:16.396 --> 1:16:28.148
	Course: You could also take here the probability
	of this pee of given, but you can also say

	1:16:28.148 --> 1:16:30.958
	we only take the top.

	1:16:31.211 --> 1:16:39.665
	And where we don't want to really rely on
	how good they are, we filtered out all the

	1:16:39.665 --> 1:16:40.659
	bad ones.

	1:16:40.940 --> 1:16:54.657
	So that is the first question for the minimum
	base rhythm, and what are your pseudo references?

	1:16:55.255 --> 1:17:06.968
	So how do you set the quality of all these
	references here in the independent sampling?

	1:17:06.968 --> 1:17:10.163
	They all have the same.

	1:17:10.750 --> 1:17:12.308
	There's Also Work Where You Can Take That.

	1:17:13.453 --> 1:17:17.952
	And then the second question you have to do
	is, of course,.

	1:17:17.917 --> 1:17:26.190
	How do you prepare now two hypothesisms so
	you have now Y and H which are post generated

	1:17:26.190 --> 1:17:34.927
	by the system and you want to find the H which
	is most similar to all the other translations.

	1:17:35.335 --> 1:17:41.812
	So it's mainly like this model here, which
	says how similar is age to all the other whites.

	1:17:42.942 --> 1:17:50.127
	So you have to again use some type of similarity
	metric, which says how similar to possible.

	1:17:52.172 --> 1:17:53.775
	How can you do that?

	1:17:53.775 --> 1:17:58.355
	We luckily knew how to compare a reference
	to a hypothesis.

	1:17:58.355 --> 1:18:00.493
	We have evaluation metrics.

	1:18:00.493 --> 1:18:03.700
	You can do something like sentence level.

	1:18:04.044 --> 1:18:13.501
	But especially if you're looking into neuromodels
	you should have a stromometric so you can use

	1:18:13.501 --> 1:18:17.836
	a neural metric which directly compares to.

	1:18:22.842 --> 1:18:29.292
	Yes, so that is, is the main idea of minimum
	base risk to, so the important idea you should

	1:18:29.292 --> 1:18:35.743
	keep in mind is that it's doing somehow the
	smoothing by not taking the highest probability

	1:18:35.743 --> 1:18:40.510
	one, but by comparing like by taking a set
	of high probability one.

	1:18:40.640 --> 1:18:45.042
	And then looking for the translation, which
	is most similar to all of that.

	1:18:45.445 --> 1:18:49.888
	And thereby doing a bit more smoothing because
	you look at this one.

	1:18:49.888 --> 1:18:55.169
	If you have this one, for example, it would
	be more similar to all of these ones.

	1:18:55.169 --> 1:19:00.965
	But if you take this one, it's higher probability,
	but it's very dissimilar to all these.

	1:19:05.445 --> 1:19:17.609
	Hey, that is all for decoding before we finish
	with your combination of models.

	1:19:18.678 --> 1:19:20.877
	Sort of set of pseudo-reperences.

	1:19:20.877 --> 1:19:24.368
	Thomas Brown writes a little bit of type research
	or.

	1:19:24.944 --> 1:19:27.087
	For example, you can do beam search.

	1:19:27.087 --> 1:19:28.825
	You can do sampling for that.

	1:19:28.825 --> 1:19:31.257
	Oh yeah, we had mentioned sampling there.

	1:19:31.257 --> 1:19:34.500
	I don't know somebody asking for what sampling
	is good.

	1:19:34.500 --> 1:19:37.280
	So there's, of course, another important issue.

	1:19:37.280 --> 1:19:40.117
	How do you get a good representative set of
	age?

	1:19:40.620 --> 1:19:47.147
	If you do beam search, it might be that you
	end up with two similar ones, and maybe it's

	1:19:47.147 --> 1:19:49.274
	prevented by doing sampling.

	1:19:49.274 --> 1:19:55.288
	But maybe in sampling you find worse ones,
	but yet some type of model is helpful.

	1:19:56.416 --> 1:20:04.863
	Search method use more transformed based translation
	points.

	1:20:04.863 --> 1:20:09.848
	Nowadays beam search is definitely.

	1:20:10.130 --> 1:20:13.749
	There is work on this.

	1:20:13.749 --> 1:20:27.283
	The problem is that the MBR is often a lot
	more like heavy because you have to sample

	1:20:27.283 --> 1:20:29.486
	translations.

	1:20:31.871 --> 1:20:40.946
	If you are bustling then we take a pen or
	a pen for the most possible one.

	1:20:40.946 --> 1:20:43.003
	Now we put them.

	1:20:43.623 --> 1:20:46.262
	Bit and then we say okay, you don't have to
	be fine.

	1:20:46.262 --> 1:20:47.657
	I'm going to put it to you.

	1:20:48.428 --> 1:20:52.690
	Yes, so that is what you can also do.

	1:20:52.690 --> 1:21:00.092
	Instead of taking uniform per ability, you
	could take the modest.

	1:21:01.041 --> 1:21:14.303
	The uniform is a bit more robust because if
	you had this one it might be that there is

	1:21:14.303 --> 1:21:17.810
	some crazy exceptions.

	1:21:17.897 --> 1:21:21.088
	And then it would still relax.

	1:21:21.088 --> 1:21:28.294
	So if you look at this picture, the probability
	here would be higher.

	1:21:28.294 --> 1:21:31.794
	But yeah, that's a bit of tuning.

	1:21:33.073 --> 1:21:42.980
	In this case, and yes, it is like modeling
	also the ants that.

	1:21:49.169 --> 1:21:56.265
	The last thing is now we always have considered
	one model.

	1:21:56.265 --> 1:22:04.084
	It's also some prints helpful to not only
	look at one model but.

	1:22:04.384 --> 1:22:10.453
	So in general there's many ways of how you
	can make several models and with it's even

	1:22:10.453 --> 1:22:17.370
	easier you can just start three different random
	municipalizations you get three different models

	1:22:17.370 --> 1:22:18.428
	and typically.

	1:22:19.019 --> 1:22:27.299
	And then the question is, can we combine their
	strength into one model and use that then?

	1:22:29.669 --> 1:22:39.281
	And that can be done and it can be either
	online or ensemble, and the more offline thing

	1:22:39.281 --> 1:22:41.549
	is called reranking.

	1:22:42.462 --> 1:22:52.800
	So the idea is, for example, an ensemble that
	you combine different initializations.

	1:22:52.800 --> 1:23:02.043
	Of course, you can also do other things like
	having different architecture.

	1:23:02.222 --> 1:23:08.922
	But the easiest thing you can change always
	in generating two motors is to have different.

	1:23:09.209 --> 1:23:24.054
	And then the question is how can you combine
	that?

	1:23:26.006 --> 1:23:34.245
	And the easiest thing, as said, is the bottle
	of soda.

	1:23:34.245 --> 1:23:39.488
	What you mainly do is in parallel.

	1:23:39.488 --> 1:23:43.833
	You decode all of the money.

	1:23:44.444 --> 1:23:59.084
	So the probability of the output and you can
	join this one to a joint one by just summing

	1:23:59.084 --> 1:24:04.126
	up over your key models again.

	1:24:04.084 --> 1:24:10.374
	So you still have a pro bonding distribution,
	but you are not taking only one output here,

	1:24:10.374 --> 1:24:10.719
	but.

	1:24:11.491 --> 1:24:20.049
	So that's one you can easily combine different
	models, and the nice thing is it typically

	1:24:20.049 --> 1:24:20.715
	works.

	1:24:21.141 --> 1:24:27.487
	You additional improvement with only more
	calculation but not more human work.

	1:24:27.487 --> 1:24:33.753
	You just do the same thing for times and you're
	getting a better performance.

	1:24:33.793 --> 1:24:41.623
	Like having more layers and so on, the advantage
	of bigger models is of course you have to have

	1:24:41.623 --> 1:24:46.272
	the big models only joint and decoding during
	inference.

	1:24:46.272 --> 1:24:52.634
	There you have to load models in parallel
	because you have to do your search.

	1:24:52.672 --> 1:24:57.557
	Normally there is more memory resources for
	training than you need for insurance.

	1:25:00.000 --> 1:25:12.637
	You have to train four models and the decoding
	speed is also slower because you need to decode

	1:25:12.637 --> 1:25:14.367
	four models.

	1:25:14.874 --> 1:25:25.670
	There is one other very important thing and
	the models have to be very similar, at least

	1:25:25.670 --> 1:25:27.368
	in some ways.

	1:25:27.887 --> 1:25:28.506
	Course.

	1:25:28.506 --> 1:25:34.611
	You can only combine this one if you have
	the same words because you are just.

	1:25:34.874 --> 1:25:43.110
	So just imagine you have two different sizes
	because you want to compare them or a director

	1:25:43.110 --> 1:25:44.273
	based model.

	1:25:44.724 --> 1:25:53.327
	That's at least not easily possible here because
	once your output would be here a word and the

	1:25:53.327 --> 1:25:56.406
	other one would have to sum over.

	1:25:56.636 --> 1:26:07.324
	So this ensemble typically only works if you
	have the same output vocabulary.

	1:26:07.707 --> 1:26:16.636
	Your input can be different because that is
	only done once and then.

	1:26:16.636 --> 1:26:23.752
	Your hardware vocabulary has to be the same
	otherwise.

	1:26:27.507 --> 1:26:41.522
	There's even a surprising effect of improving
	your performance and it's again some kind of

	1:26:41.522 --> 1:26:43.217
	smoothing.

	1:26:43.483 --> 1:26:52.122
	So normally during training what we are doing
	is we can save the checkpoints after each epoch.

	1:26:52.412 --> 1:27:01.774
	And you have this type of curve where your
	Arab performance normally should go down, and

	1:27:01.774 --> 1:27:09.874
	if you do early stopping it means that at the
	end you select not the lowest.

	1:27:11.571 --> 1:27:21.467
	However, some type of smoothing is there again.

	1:27:21.467 --> 1:27:31.157
	Sometimes what you can do is take an ensemble.

	1:27:31.491 --> 1:27:38.798
	That is not as good, but you still have four
	different bottles, and they give you a little.

	1:27:39.259 --> 1:27:42.212
	So,.

	1:27:43.723 --> 1:27:48.340
	It's some are helping you, so now they're
	supposed to be something different, you know.

	1:27:49.489 --> 1:27:53.812
	Oh didn't do that, so that is a checkpoint.

	1:27:53.812 --> 1:27:59.117
	There is one thing interesting, which is even
	faster.

	1:27:59.419 --> 1:28:12.255
	Normally let's give you better performance
	because this one might be again like a smooth

	1:28:12.255 --> 1:28:13.697
	ensemble.

	1:28:16.736 --> 1:28:22.364
	Of course, there is also some problems with
	this, so I said.

	1:28:22.364 --> 1:28:30.022
	For example, maybe you want to do different
	web representations with Cherokee and.

	1:28:30.590 --> 1:28:37.189
	You want to do right to left decoding so you
	normally do like I go home but then your translation

	1:28:37.189 --> 1:28:39.613
	depends only on the previous words.

	1:28:39.613 --> 1:28:45.942
	If you want to model on the future you could
	do the inverse direction and generate the target

	1:28:45.942 --> 1:28:47.895
	sentence from right to left.

	1:28:48.728 --> 1:28:50.839
	But it's not easy to combine these things.

	1:28:51.571 --> 1:28:56.976
	In order to do this, or what is also sometimes
	interesting is doing in verse translation.

	1:28:57.637 --> 1:29:07.841
	You can combine these types of models in the
	next election.

	1:29:07.841 --> 1:29:13.963
	That is only a bit which we can do.

	1:29:14.494 --> 1:29:29.593
	Next time what you should remember is how
	search works and do you have any final questions.

	1:29:33.773 --> 1:29:43.393
	Then I wish you a happy holiday for next week
	and then Monday there is another practical

	1:29:43.393 --> 1:29:50.958
	and then Thursday in two weeks so we'll have
	the next lecture Monday.