Spaces:

retkowski
/

ytseg_demo

Running

File size: 75,028 Bytes

cb71ef5

WEBVTT

0:00:01.721 --> 0:00:05.064
Hey, and then welcome to today's lecture.

0:00:06.126 --> 0:00:13.861
What we want to do today is we will finish
with what we have done last time, so we started

0:00:13.861 --> 0:00:22.192
looking at the new machine translation system,
but we have had all the components of the sequence

0:00:22.192 --> 0:00:22.787
model.

0:00:22.722 --> 0:00:29.361
We're still missing is the transformer based
architecture so that maybe the self attention.

0:00:29.849 --> 0:00:31.958
Then we want to look at the beginning today.

0:00:32.572 --> 0:00:39.315
And then the main part of the day's lecture
will be decoding.

0:00:39.315 --> 0:00:43.992
That means we know how to train the model.

0:00:44.624 --> 0:00:47.507
So decoding sewage all they can be.

0:00:47.667 --> 0:00:53.359
Be useful that and the idea is how we find
that and what challenges are there.

0:00:53.359 --> 0:00:59.051
Since it's unregressive, we will see that
it's not as easy as for other tasks.

0:00:59.359 --> 0:01:08.206
While generating the translation step by step,
we might make additional arrows that lead.

0:01:09.069 --> 0:01:16.464
But let's start with a self attention, so
what we looked at into was an base model.

0:01:16.816 --> 0:01:27.931
And then in our based models you always take
the last new state, you take your input, you

0:01:27.931 --> 0:01:31.513
generate a new hidden state.

0:01:31.513 --> 0:01:35.218
This is more like a standard.

0:01:35.675 --> 0:01:41.088
And one challenge in this is that we always
store all our history in one signal hidden

0:01:41.088 --> 0:01:41.523
stick.

0:01:41.781 --> 0:01:50.235
We saw that this is a problem when going from
encoder to decoder, and that is why we then

0:01:50.235 --> 0:01:58.031
introduced the attention mechanism so that
we can look back and see all the parts.

0:01:59.579 --> 0:02:06.059
However, in the decoder we still have this
issue so we are still storing all information

0:02:06.059 --> 0:02:12.394
in one hidden state and we might do things
like here that we start to overwrite things

0:02:12.394 --> 0:02:13.486
and we forgot.

0:02:14.254 --> 0:02:23.575
So the idea is, can we do something similar
which we do between encoder and decoder within

0:02:23.575 --> 0:02:24.907
the decoder?

0:02:26.526 --> 0:02:33.732
And the idea is each time we're generating
here in New York State, it will not only depend

0:02:33.732 --> 0:02:40.780
on the previous one, but we will focus on the
whole sequence and look at different parts

0:02:40.780 --> 0:02:46.165
as we did in attention in order to generate
our new representation.

0:02:46.206 --> 0:02:53.903
So each time we generate a new representation
we will look into what is important now to

0:02:53.903 --> 0:02:54.941
understand.

0:02:55.135 --> 0:03:00.558
You may want to understand what much is important.

0:03:00.558 --> 0:03:08.534
You might want to look to vary and to like
so that it's much about liking.

0:03:08.808 --> 0:03:24.076
So the idea is that we are not staring everything
in each time we are looking at the full sequence.

0:03:25.125 --> 0:03:35.160
And that is achieved by no longer going really
secret, and the hidden states here aren't dependent

0:03:35.160 --> 0:03:37.086
on the same layer.

0:03:37.086 --> 0:03:42.864
But instead we are always looking at the previous
layer.

0:03:42.942 --> 0:03:45.510
We will always have more information that
we are coming.

0:03:47.147 --> 0:03:51.572
So how does this censor work in detail?

0:03:51.572 --> 0:03:56.107
So we started with our initial mistakes.

0:03:56.107 --> 0:04:08.338
So, for example: Now where we had the three
terms already, the query, the key and the value,

0:04:08.338 --> 0:04:12.597
it was motivated by our database.

0:04:12.772 --> 0:04:20.746
We are comparing it to the keys to all the
other values, and then we are merging the values.

0:04:21.321 --> 0:04:35.735
There was a difference between the decoder
and the encoder.

0:04:35.775 --> 0:04:41.981
You can assume all the same because we are
curving ourselves.

0:04:41.981 --> 0:04:49.489
However, we can make them different but just
learning a linear projection.

0:04:49.529 --> 0:05:01.836
So you learn here some projection based on
what need to do in order to ask which question.

0:05:02.062 --> 0:05:11.800
That is, the query and the key is to what
do want to compare and provide others, and

0:05:11.800 --> 0:05:13.748
which values do.

0:05:14.014 --> 0:05:23.017
This is not like hand defined, but learn,
so it's like three linear projections that

0:05:23.017 --> 0:05:26.618
you apply on all of these hidden.

0:05:26.618 --> 0:05:32.338
That is the first thing based on your initial
hidden.

0:05:32.612 --> 0:05:37.249
And now you can do exactly as before, you
can do the attention.

0:05:37.637 --> 0:05:40.023
How did the attention work?

0:05:40.023 --> 0:05:45.390
The first thing is we are comparing our query
to all the keys.

0:05:45.445 --> 0:05:52.713
And that is now the difference before the
quarry was from the decoder, the keys were

0:05:52.713 --> 0:05:54.253
from the encoder.

0:05:54.253 --> 0:06:02.547
Now it's like all from the same, so we started
the first in state to the keys of all the others.

0:06:02.582 --> 0:06:06.217
We're learning some value here.

0:06:06.217 --> 0:06:12.806
How important are these information to better
understand?

0:06:13.974 --> 0:06:19.103
And these are just like floating point numbers.

0:06:19.103 --> 0:06:21.668
They are normalized so.

0:06:22.762 --> 0:06:30.160
And that is the first step, so let's go first
for the first curve.

0:06:30.470 --> 0:06:41.937
What we can then do is multiply each value
as we have done before with the importance

0:06:41.937 --> 0:06:43.937
of each state.

0:06:45.145 --> 0:06:47.686
And then we have in here the new hit step.

0:06:48.308 --> 0:06:57.862
See now this new hidden status is depending
on all the hidden state of all the sequences

0:06:57.862 --> 0:06:59.686
of the previous.

0:06:59.879 --> 0:07:01.739
One important thing.

0:07:01.739 --> 0:07:08.737
This one doesn't really depend, so the hidden
states here don't depend on the.

0:07:09.029 --> 0:07:15.000
So it only depends on the hidden state of
the previous layer, but it depends on all the

0:07:15.000 --> 0:07:18.664
hidden states, and that is of course a big
advantage.

0:07:18.664 --> 0:07:25.111
So on the one hand information can directly
flow from each hidden state before the information

0:07:25.111 --> 0:07:27.214
flow was always a bit limited.

0:07:28.828 --> 0:07:35.100
And the independence is important so we can
calculate all these in the states in parallel.

0:07:35.100 --> 0:07:41.371
That's another big advantage of self attention
that we can calculate all the hidden states

0:07:41.371 --> 0:07:46.815
in one layer in parallel and therefore it's
the ad designed for GPUs and fast.

0:07:47.587 --> 0:07:50.235
Then we can do the same thing for the second
in the state.

0:07:50.530 --> 0:08:06.866
And the only difference here is how we calculate
what is occurring.

0:08:07.227 --> 0:08:15.733
Getting these values is different because
we use the different query and then getting

0:08:15.733 --> 0:08:17.316
our new hidden.

0:08:18.258 --> 0:08:26.036
Yes, this is the word of words that underneath
this case might, but this is simple.

0:08:26.036 --> 0:08:26.498
Not.

0:08:27.127 --> 0:08:33.359
That's a very good question that is like on
the initial thing.

0:08:33.359 --> 0:08:38.503
That is exactly not one of you in the architecture.

0:08:38.503 --> 0:08:44.042
Maybe first you would think of a very big
disadvantage.

0:08:44.384 --> 0:08:49.804
So this hidden state would be the same if
the movie would be different.

0:08:50.650 --> 0:08:59.983
And of course this estate is a site someone
should like, so if the estate would be here

0:08:59.983 --> 0:09:06.452
except for this correspondence the word order
is completely.

0:09:06.706 --> 0:09:17.133
Therefore, just doing self attention wouldn't
work at all because we know word order is important

0:09:17.133 --> 0:09:21.707
and there is a complete different meaning.

0:09:22.262 --> 0:09:26.277
We introduce the word position again.

0:09:26.277 --> 0:09:33.038
The main idea is if the position is already
in your embeddings.

0:09:33.533 --> 0:09:39.296
Then of course the position is there and you
don't lose it anymore.

0:09:39.296 --> 0:09:46.922
So mainly if your life representation here
encodes at the second position and your output

0:09:46.922 --> 0:09:48.533
will be different.

0:09:49.049 --> 0:09:54.585
And that's how you encode it, but that's essential
in order to get this work.

0:09:57.137 --> 0:10:08.752
But before we are coming to the next slide,
one other thing that is typically done is multi-head

0:10:08.752 --> 0:10:10.069
attention.

0:10:10.430 --> 0:10:15.662
And it might be that in order to understand
much, it might be good that in some way we

0:10:15.662 --> 0:10:19.872
focus on life, and in some way we can focus
on vary, but not equally.

0:10:19.872 --> 0:10:25.345
But maybe it's like to understand again on
different dimensions we should look into these.

0:10:25.905 --> 0:10:31.393
And therefore what we're doing is we're just
doing the self attention at once, but we're

0:10:31.393 --> 0:10:35.031
doing it end times or based on your multi head
attentions.

0:10:35.031 --> 0:10:41.299
So in typical examples, the number of heads
people are talking about is like: So you're

0:10:41.299 --> 0:10:50.638
doing this process and have different queries
and keys so you can focus.

0:10:50.790 --> 0:10:52.887
How can you generate eight different?

0:10:53.593 --> 0:11:07.595
Things it's quite easy here, so instead of
having one linear projection you can have age

0:11:07.595 --> 0:11:09.326
different.

0:11:09.569 --> 0:11:13.844
And it might be that sometimes you're looking
more into one thing, and sometimes you're Looking

0:11:13.844 --> 0:11:14.779
more into the other.

0:11:15.055 --> 0:11:24.751
So that's of course nice with this type of
learned approach because we can automatically

0:11:24.751 --> 0:11:25.514
learn.

0:11:29.529 --> 0:11:36.629
And what you correctly said is its positional
independence, so it doesn't really matter the

0:11:36.629 --> 0:11:39.176
order which should be important.

0:11:39.379 --> 0:11:47.686
So how can we do that and the idea is we are
just encoding it directly into the embedding

0:11:47.686 --> 0:11:52.024
so into the starting so that a representation.

0:11:52.512 --> 0:11:55.873
How do we get that so we started with our
embeddings?

0:11:55.873 --> 0:11:58.300
Just imagine this is embedding of eye.

0:11:59.259 --> 0:12:06.169
And then we are having additionally this positional
encoding.

0:12:06.169 --> 0:12:10.181
In this position, encoding is just.

0:12:10.670 --> 0:12:19.564
With different wavelength, so with different
lengths of your signal as you see here.

0:12:20.160 --> 0:12:37.531
And the number of functions you have is exactly
the number of dimensions you have in your embedded.

0:12:38.118 --> 0:12:51.091
And what will then do is take the first one,
and based on your position you multiply your

0:12:51.091 --> 0:12:51.955
word.

0:12:52.212 --> 0:13:02.518
And you see now if you put it in this position,
of course it will get a different value.

0:13:03.003 --> 0:13:12.347
And thereby in each position a different function
is multiplied.

0:13:12.347 --> 0:13:19.823
This is a representation for at the first
position.

0:13:20.020 --> 0:13:34.922
If you have it in the input already encoded
then of course the model is able to keep the

0:13:34.922 --> 0:13:38.605
position information.

0:13:38.758 --> 0:13:48.045
But your embeddings can also learn your embeddings
in a way that they are optimal collaborating

0:13:48.045 --> 0:13:49.786
with these types.

0:13:51.451 --> 0:13:59.351
Is that somehow clear where he is there?

0:14:06.006 --> 0:14:13.630
Am the first position and second position?

0:14:16.576 --> 0:14:17.697
Have a long wait period.

0:14:17.697 --> 0:14:19.624
I'm not going to tell you how to turn the.

0:14:21.441 --> 0:14:26.927
Be completely issued because if you have a
very short wavelength there might be quite

0:14:26.927 --> 0:14:28.011
big differences.

0:14:28.308 --> 0:14:33.577
And it might also be that then it depends,
of course, like what type of world embedding

0:14:33.577 --> 0:14:34.834
you've learned like.

0:14:34.834 --> 0:14:37.588
Is the dimension where you have long changes?

0:14:37.588 --> 0:14:43.097
Is the report for your embedding or not so
that's what I mean so that the model can somehow

0:14:43.097 --> 0:14:47.707
learn that by putting more information into
one of the embedding dimensions?

0:14:48.128 --> 0:14:54.560
So incorporated and would assume it's learning
it a bit haven't seen.

0:14:54.560 --> 0:14:57.409
Details studied how different.

0:14:58.078 --> 0:15:07.863
It's also a bit difficult because really measuring
how similar or different a world isn't that

0:15:07.863 --> 0:15:08.480
easy.

0:15:08.480 --> 0:15:13.115
You can do, of course, the average distance.

0:15:14.114 --> 0:15:21.393
Them, so are the weight tags not at model
two, or is there fixed weight tags that the

0:15:21.393 --> 0:15:21.986
model.

0:15:24.164 --> 0:15:30.165
To believe they are fixed and the mono learns
there's a different way of doing it.

0:15:30.165 --> 0:15:32.985
The other thing you can do is you can.

0:15:33.213 --> 0:15:36.945
So you can learn the second embedding which
says this is position one.

0:15:36.945 --> 0:15:38.628
This is position two and so on.

0:15:38.628 --> 0:15:42.571
Like for words you could learn fixed embeddings
and then add them upwards.

0:15:42.571 --> 0:15:45.094
So then it would have the same thing it's
done.

0:15:45.094 --> 0:15:46.935
There is one disadvantage of this.

0:15:46.935 --> 0:15:51.403
There is anybody an idea what could be the
disadvantage of a more learned embedding.

0:15:54.955 --> 0:16:00.000
Here maybe extra play this finger and ethnic
stuff that will be an art.

0:16:00.000 --> 0:16:01.751
This will be an art for.

0:16:02.502 --> 0:16:08.323
You would only be good at positions you have
seen often and especially for long sequences.

0:16:08.323 --> 0:16:14.016
You might have seen the positions very rarely
and then normally not performing that well

0:16:14.016 --> 0:16:17.981
while here it can better learn a more general
representation.

0:16:18.298 --> 0:16:22.522
So that is another thing which we won't discuss
here.

0:16:22.522 --> 0:16:25.964
Guess is what is called relative attention.

0:16:25.945 --> 0:16:32.570
And in this case you don't learn absolute
positions, but in your calculation of the similarity

0:16:32.570 --> 0:16:39.194
you take again the relative distance into account
and have a different similarity depending on

0:16:39.194 --> 0:16:40.449
how far they are.

0:16:40.660 --> 0:16:45.898
And then you don't need to encode it beforehand,
but you would more happen within your comparison.

0:16:46.186 --> 0:16:53.471
So when you compare how similar things you
print, of course also take the relative position.

0:16:55.715 --> 0:17:03.187
Because there are multiple ways to use the
one, to multiply all the embedding, or to use

0:17:03.187 --> 0:17:03.607
all.

0:17:17.557 --> 0:17:21.931
The encoder can be bidirectional.

0:17:21.931 --> 0:17:30.679
We have everything from the beginning so we
can have a model where.

0:17:31.111 --> 0:17:36.455
Decoder training of course has also everything
available but during inference you always have

0:17:36.455 --> 0:17:41.628
only the past available so you can only look
into the previous one and not into the future

0:17:41.628 --> 0:17:46.062
because if you generate word by word you don't
know what it will be there in.

0:17:46.866 --> 0:17:53.180
And so we also have to consider this somehow
in the attention, and until now we look more

0:17:53.180 --> 0:17:54.653
at the ecoder style.

0:17:54.653 --> 0:17:58.652
So if you look at this type of model, it's
by direction.

0:17:58.652 --> 0:18:03.773
So for this hill state we are looking into
the past and into the future.

0:18:04.404 --> 0:18:14.436
So the question is, can we have to do this
like unidirectional so that you only look into

0:18:14.436 --> 0:18:15.551
the past?

0:18:15.551 --> 0:18:22.573
And the nice thing is, this is even easier
than for our hands.

0:18:23.123 --> 0:18:29.738
So we would have different types of parameters
and models because you have a forward direction.

0:18:31.211 --> 0:18:35.679
For attention, that is very simple.

0:18:35.679 --> 0:18:39.403
We are doing what is masking.

0:18:39.403 --> 0:18:45.609
If you want to have a backward model, these
ones.

0:18:45.845 --> 0:18:54.355
So on the first hit stage it's been over,
so it's maybe only looking at its health.

0:18:54.894 --> 0:19:05.310
By the second it looks on the second and the
third, so you're always selling all values

0:19:05.310 --> 0:19:07.085
in the future.

0:19:07.507 --> 0:19:13.318
And thereby you can have with the same parameters
the same model.

0:19:13.318 --> 0:19:15.783
You can have then a unique.

0:19:16.156 --> 0:19:29.895
In the decoder you do the masked self attention
where you only look into the past and you don't

0:19:29.895 --> 0:19:30.753
look.

0:19:32.212 --> 0:19:36.400
Then we only have, of course, looked onto
itself.

0:19:36.616 --> 0:19:50.903
So the question: How can we combine forward
and decoder and then we can do a decoder and

0:19:50.903 --> 0:19:54.114
just have a second?

0:19:54.374 --> 0:20:00.286
And then we're doing the cross attention which
attacks from the decoder to the anchoder.

0:20:00.540 --> 0:20:10.239
So in this time it's again that the queries
is a current state of decoder, while the keys

0:20:10.239 --> 0:20:22.833
are: You can do both onto yourself to get the
meaning on the target side and to get the meaning.

0:20:23.423 --> 0:20:25.928
So see then the full picture.

0:20:25.928 --> 0:20:33.026
This is now the typical picture of the transformer
and where you use self attention.

0:20:33.026 --> 0:20:36.700
So what you have is have your power hidden.

0:20:37.217 --> 0:20:43.254
What you then apply is here the position they're
coding: We have then doing the self attention

0:20:43.254 --> 0:20:46.734
to all the others, and this can be bi-directional.

0:20:47.707 --> 0:20:54.918
You normally do another feed forward layer
just like to make things to learn additional

0:20:54.918 --> 0:20:55.574
things.

0:20:55.574 --> 0:21:02.785
You're just having also a feed forward layer
which takes your heel stable and generates

0:21:02.785 --> 0:21:07.128
your heel state because we are making things
deeper.

0:21:07.747 --> 0:21:15.648
Then this blue part you can stack over several
times so you can have layers so that.

0:21:16.336 --> 0:21:30.256
In addition to these blue arrows, so we talked
about this in R&amp;S that if you are now back

0:21:30.256 --> 0:21:35.883
propagating your arrow from the top,.

0:21:36.436 --> 0:21:48.578
In order to prevent that we are not really
learning how to transform that, but instead

0:21:48.578 --> 0:21:51.230
we have to change.

0:21:51.671 --> 0:22:00.597
You're calculating what should be changed
with this one.

0:22:00.597 --> 0:22:09.365
The backwards clip each layer and the learning
is just.

0:22:10.750 --> 0:22:21.632
The encoder before we go to the decoder.

0:22:21.632 --> 0:22:30.655
We have any additional questions.

0:22:31.471 --> 0:22:33.220
That's a Very Good Point.

0:22:33.553 --> 0:22:38.709
Yeah, you normally take always that at least
the default architecture to only look at the

0:22:38.709 --> 0:22:38.996
top.

0:22:40.000 --> 0:22:40.388
Coder.

0:22:40.388 --> 0:22:42.383
Of course, you can do other things.

0:22:42.383 --> 0:22:45.100
We investigated, for example, the lowest layout.

0:22:45.100 --> 0:22:49.424
The decoder is looking at the lowest level
of the incoder and not of the top.

0:22:49.749 --> 0:23:05.342
You can average or you can even learn theoretically
that what you can also do is attending to all.

0:23:05.785 --> 0:23:11.180
Can attend to all possible layers and states.

0:23:11.180 --> 0:23:18.335
But what the default thing is is that you
only have the top.

0:23:20.580 --> 0:23:31.999
The decoder when we're doing is firstly doing
the same position and coding, then we're doing

0:23:31.999 --> 0:23:36.419
self attention in the decoder side.

0:23:37.837 --> 0:23:43.396
Of course here it's not important we're doing
the mask self attention so that we're only

0:23:43.396 --> 0:23:45.708
attending to the past and we're not.

0:23:47.287 --> 0:24:02.698
Here you see the difference, so in this case
the keys and values are from the encoder and

0:24:02.698 --> 0:24:03.554
the.

0:24:03.843 --> 0:24:12.103
You're comparing it to all the counter hidden
states calculating the similarity and then

0:24:12.103 --> 0:24:13.866
you do the weight.

0:24:14.294 --> 0:24:17.236
And that is an edit to what is here.

0:24:18.418 --> 0:24:29.778
Then you have a linen layer and again this
green one is sticked several times and then.

0:24:32.232 --> 0:24:36.987
Question, so each code is off.

0:24:36.987 --> 0:24:46.039
Every one of those has the last layer of thing,
so in the.

0:24:46.246 --> 0:24:51.007
All with and only to the last or the top layer
of the anchor.

0:24:57.197 --> 0:25:00.127
Good So That Would Be.

0:25:01.501 --> 0:25:12.513
To sequence models we have looked at attention
and before we are decoding do you have any

0:25:12.513 --> 0:25:18.020
more questions to this type of architecture.

0:25:20.480 --> 0:25:30.049
Transformer was first used in machine translation,
but now it's a standard thing for doing nearly

0:25:30.049 --> 0:25:32.490
any tie sequence models.

0:25:33.013 --> 0:25:35.984
Even large language models.

0:25:35.984 --> 0:25:38.531
They are a bit similar.

0:25:38.531 --> 0:25:45.111
They are just throwing away the anchor and
cross the tension.

0:25:45.505 --> 0:25:59.329
And that is maybe interesting that it's important
to have this attention because you cannot store

0:25:59.329 --> 0:26:01.021
everything.

0:26:01.361 --> 0:26:05.357
The interesting thing with the attention is
now we can attend to everything.

0:26:05.745 --> 0:26:13.403
So you can again go back to your initial model
and have just a simple sequence model and then

0:26:13.403 --> 0:26:14.055
target.

0:26:14.694 --> 0:26:24.277
There would be a more language model style
or people call it Decoder Only model where

0:26:24.277 --> 0:26:26.617
you throw this away.

0:26:27.247 --> 0:26:30.327
The nice thing is because of your self attention.

0:26:30.327 --> 0:26:34.208
You have the original problem why you introduce
the attention.

0:26:34.208 --> 0:26:39.691
You don't have that anymore because it's not
everything is summarized, but each time you

0:26:39.691 --> 0:26:44.866
generate, you're looking back at all the previous
words, the source and the target.

0:26:45.805 --> 0:26:51.734
And there is a lot of work on is a really
important to have encoded a decoded model or

0:26:51.734 --> 0:26:54.800
is a decoded only model as good if you have.

0:26:54.800 --> 0:27:00.048
But the comparison is not that easy because
how many parameters do you have?

0:27:00.360 --> 0:27:08.832
So think the general idea at the moment is,
at least for machine translation, it's normally

0:27:08.832 --> 0:27:17.765
a bit better to have an encoded decoder model
and not a decoder model where you just concatenate

0:27:17.765 --> 0:27:20.252
the source and the target.

0:27:21.581 --> 0:27:24.073
But there is not really a big difference anymore.

0:27:24.244 --> 0:27:29.891
Because this big issue, which we had initially
with it that everything is stored in the working

0:27:29.891 --> 0:27:31.009
state, is nothing.

0:27:31.211 --> 0:27:45.046
Of course, the advantage maybe here is that
you give it a bias at your same language information.

0:27:45.285 --> 0:27:53.702
While in an encoder only model this all is
merged into one thing and sometimes it is good

0:27:53.702 --> 0:28:02.120
to give models a bit of bias okay you should
maybe treat things separately and you should

0:28:02.120 --> 0:28:03.617
look different.

0:28:04.144 --> 0:28:11.612
And of course one other difference, one other
disadvantage, maybe of an encoder owning one.

0:28:16.396 --> 0:28:19.634
You think about the suicide sentence and how
it's treated.

0:28:21.061 --> 0:28:33.787
Architecture: Anchorer can both be in the
sentence for every state and cause a little

0:28:33.787 --> 0:28:35.563
difference.

0:28:35.475 --> 0:28:43.178
If you only have a decoder that has to be
unidirectional because for the decoder side

0:28:43.178 --> 0:28:51.239
for the generation you need it and so your
input is read state by state so you don't have

0:28:51.239 --> 0:28:54.463
positional bidirection information.

0:28:56.596 --> 0:29:05.551
Again, it receives a sequence of embeddings
with position encoding.

0:29:05.551 --> 0:29:11.082
The piece is like long vector has output.

0:29:11.031 --> 0:29:17.148
Don't understand how you can set footworks
to this part of each other through inputs.

0:29:17.097 --> 0:29:20.060
Other than cola is the same as the food consume.

0:29:21.681 --> 0:29:27.438
Okay, it's very good bye, so this one hand
coding is only done on the top layer.

0:29:27.727 --> 0:29:32.012
So this green one is only repeated.

0:29:32.012 --> 0:29:38.558
You have the word embedding or the position
embedding.

0:29:38.558 --> 0:29:42.961
You have one layer of decoder which.

0:29:43.283 --> 0:29:48.245
Then you stick in the second one, the third
one, the fourth one, and then on the top.

0:29:48.208 --> 0:29:55.188
Layer: You put this projection layer which
takes a one thousand dimensional backtalk and

0:29:55.188 --> 0:30:02.089
generates based on your vocabulary maybe in
ten thousand soft max layer which gives you

0:30:02.089 --> 0:30:04.442
the probability of all words.

0:30:06.066 --> 0:30:22.369
It's a very good part part of the mass tape
ladies, but it wouldn't be for the X-rays.

0:30:22.262 --> 0:30:27.015
Aquarium filters to be like monsoon roding
as they get by the river.

0:30:27.647 --> 0:30:33.140
Yes, there is work on that think we will discuss
that in the pre-trained models.

0:30:33.493 --> 0:30:39.756
It's called where you exactly do that.

0:30:39.756 --> 0:30:48.588
If you have more metric side, it's like diagonal
here.

0:30:48.708 --> 0:30:53.018
And it's a full metric, so here everybody's
attending to each position.

0:30:53.018 --> 0:30:54.694
Here you're only attending.

0:30:54.975 --> 0:31:05.744
Then you can do the previous one where this
one is decoded, not everything but everything.

0:31:06.166 --> 0:31:13.961
So you have a bit more that is possible, and
we'll have that in the lecture on pre-train

0:31:13.961 --> 0:31:14.662
models.

0:31:18.478 --> 0:31:27.440
So we now know how to build a translation
system, but of course we don't want to have

0:31:27.440 --> 0:31:30.774
a translation system by itself.

0:31:31.251 --> 0:31:40.037
Now given this model an input sentence, how
can we generate an output mind?

0:31:40.037 --> 0:31:49.398
The general idea is still: So what we really
want to do is we start with the model.

0:31:49.398 --> 0:31:53.893
We generate different possible translations.

0:31:54.014 --> 0:31:59.754
We score them the lock probability that we're
getting, so for each input and output pair

0:31:59.754 --> 0:32:05.430
we can calculate the lock probability, which
is a product of all probabilities for each

0:32:05.430 --> 0:32:09.493
word in there, and then we can find what is
the most probable.

0:32:09.949 --> 0:32:15.410
However, that's a bit complicated we will
see because we can't look at all possible translations.

0:32:15.795 --> 0:32:28.842
So there is infinite or a number of possible
translations, so we have to do it somehow in

0:32:28.842 --> 0:32:31.596
more intelligence.

0:32:32.872 --> 0:32:37.821
So what we want to do today in the rest of
the lecture?

0:32:37.821 --> 0:32:40.295
What is the search problem?

0:32:40.295 --> 0:32:44.713
Then we will look at different search algorithms.

0:32:45.825 --> 0:32:56.636
Will compare model and search errors, so there
can be errors on the model where the model

0:32:56.636 --> 0:33:03.483
is not giving the highest score to the best
translation.

0:33:03.903 --> 0:33:21.069
This is always like searching the best translation
out of one model, which is often also interesting.

0:33:24.004 --> 0:33:29.570
And how do we do the search?

0:33:29.570 --> 0:33:41.853
We want to find the translation where the
reference is minimal.

0:33:42.042 --> 0:33:44.041
So the nice thing is SMT.

0:33:44.041 --> 0:33:51.347
It wasn't the case, but in neuromachine translation
we can't find any possible translation, so

0:33:51.347 --> 0:33:53.808
at least within our vocabulary.

0:33:53.808 --> 0:33:58.114
But if we have BPE we can really generate
any possible.

0:33:58.078 --> 0:34:04.604
Translation and cereal: We could always minimize
that, but yeah, we can't do it that easy because

0:34:04.604 --> 0:34:07.734
of course we don't have the reference at hand.

0:34:07.747 --> 0:34:10.384
If it has a reference, it's not a problem.

0:34:10.384 --> 0:34:13.694
We know what we are searching for, but we
don't know.

0:34:14.054 --> 0:34:23.886
So how can we then model this by just finding
the translation with the highest probability?

0:34:23.886 --> 0:34:29.015
Looking at it, we want to find the translation.

0:34:29.169 --> 0:34:32.525
Idea is our model is a good approximation.

0:34:32.525 --> 0:34:34.399
That's how we train it.

0:34:34.399 --> 0:34:36.584
What is a good translation?

0:34:36.584 --> 0:34:43.687
And if we find translation with the highest
probability, this should also give us the best

0:34:43.687 --> 0:34:44.702
translation.

0:34:45.265 --> 0:34:56.965
And that is then, of course, the difference
between the search error is that the model

0:34:56.965 --> 0:35:02.076
doesn't predict the best translation.

0:35:02.622 --> 0:35:08.777
How can we do the basic search first of all
in basic search that seems to be very easy

0:35:08.777 --> 0:35:15.003
so what we can do is we can do the forward
pass for the whole encoder and that's how it

0:35:15.003 --> 0:35:21.724
starts the input sentences known you can put
the input sentence and calculate all your estates

0:35:21.724 --> 0:35:22.573
and hidden?

0:35:23.083 --> 0:35:35.508
Then you can put in your sentence start and
you can generate.

0:35:35.508 --> 0:35:41.721
Here you have the probability.

0:35:41.801 --> 0:35:52.624
A good idea we would see later that as a typical
algorithm is guess what you all would do, you

0:35:52.624 --> 0:35:54.788
would then select.

0:35:55.235 --> 0:36:06.265
So if you generate here a probability distribution
over all the words in your vocabulary then

0:36:06.265 --> 0:36:08.025
you can solve.

0:36:08.688 --> 0:36:13.147
Yeah, this is how our auto condition is done
in our system.

0:36:14.794 --> 0:36:19.463
Yeah, this is also why there you have to have
a model of possible extending.

0:36:19.463 --> 0:36:24.314
It's more of a language model, but then this
is one algorithm to do the search.

0:36:24.314 --> 0:36:26.801
They maybe have also more advanced ones.

0:36:26.801 --> 0:36:32.076
We will see that so this search and other
completion should be exactly the same as the

0:36:32.076 --> 0:36:33.774
search machine translation.

0:36:34.914 --> 0:36:40.480
So we'll see that this is not optimal, so
hopefully it's not that this way, but for this

0:36:40.480 --> 0:36:41.043
problem.

0:36:41.941 --> 0:36:47.437
And what you can do then you can select this
word.

0:36:47.437 --> 0:36:50.778
This was the best translation.

0:36:51.111 --> 0:36:57.675
Because the decoder, of course, in the next
step needs not to know what is the best word

0:36:57.675 --> 0:37:02.396
here, it inputs it and generates that flexibility
distribution.

0:37:03.423 --> 0:37:14.608
And then your new distribution, and you can
do the same thing, there's the best word there,

0:37:14.608 --> 0:37:15.216
and.

0:37:15.435 --> 0:37:22.647
So you can continue doing that and always
get the hopefully the best translation in.

0:37:23.483 --> 0:37:30.839
The first question is, of course, how long
are you doing it?

0:37:30.839 --> 0:37:33.854
Now we could go forever.

0:37:36.476 --> 0:37:52.596
We had this token at the input and we put
the stop token at the output.

0:37:53.974 --> 0:38:07.217
And this is important because if we wouldn't
do that then we wouldn't have a good idea.

0:38:10.930 --> 0:38:16.193
So that seems to be a good idea, but is it
really?

0:38:16.193 --> 0:38:21.044
Do we find the most probable sentence in this?

0:38:23.763 --> 0:38:25.154
Or my dear healed proverb,.

0:38:27.547 --> 0:38:41.823
We are always selecting the highest probability
one, so it seems to be that this is a very

0:38:41.823 --> 0:38:45.902
good solution to anybody.

0:38:46.406 --> 0:38:49.909
Yes, that is actually the problem.

0:38:49.909 --> 0:38:56.416
You might do early decisions and you don't
have the global view.

0:38:56.796 --> 0:39:02.813
And this problem happens because it is an
outer regressive model.

0:39:03.223 --> 0:39:13.275
So it happens because yeah, the output we
generate is the input in the next step.

0:39:13.793 --> 0:39:19.493
And this, of course, is leading to problems.

0:39:19.493 --> 0:39:27.474
If we always take the best solution, it doesn't
mean you have.

0:39:27.727 --> 0:39:33.941
It would be different if you have a problem
where the output is not influencing your input.

0:39:34.294 --> 0:39:44.079
Then this solution will give you the best
model, but since the output is influencing

0:39:44.079 --> 0:39:47.762
your next input and the model,.

0:39:48.268 --> 0:39:51.599
Because one question might not be why do we
have this type of model?

0:39:51.771 --> 0:39:58.946
So why do we really need to put here in the
last source word?

0:39:58.946 --> 0:40:06.078
You can also put in: And then always predict
the word and the nice thing is then you wouldn't

0:40:06.078 --> 0:40:11.846
need to do beams or a difficult search because
then the output here wouldn't influence what

0:40:11.846 --> 0:40:12.975
is inputted here.

0:40:15.435 --> 0:40:20.219
Idea whether that might not be the best idea.

0:40:20.219 --> 0:40:24.588
You'll just be translating each word and.

0:40:26.626 --> 0:40:37.815
The second one is right, yes, you're not generating
a Korean sentence.

0:40:38.058 --> 0:40:48.197
We'll also see that later it's called non
auto-progressive translation, so there is work

0:40:48.197 --> 0:40:49.223
on that.

0:40:49.529 --> 0:41:02.142
So you might know it roughly because you know
it's based on this hidden state, but it can

0:41:02.142 --> 0:41:08.588
be that in the end you have your probability.

0:41:09.189 --> 0:41:14.633
And then you're not modeling the dependencies
within a work within the target sentence.

0:41:14.633 --> 0:41:27.547
For example: You can express things in German,
then you don't know which one you really select.

0:41:27.547 --> 0:41:32.156
That influences what you later.

0:41:33.393 --> 0:41:46.411
Then you try to find a better way not only
based on the English sentence and the words

0:41:46.411 --> 0:41:48.057
that come.

0:41:49.709 --> 0:42:00.954
Yes, that is more like a two-step decoding,
but that is, of course, a lot more like computational.

0:42:01.181 --> 0:42:15.978
The first thing you can do, which is typically
done, is doing not really search.

0:42:16.176 --> 0:42:32.968
So first look at what the problem of research
is to make it a bit more clear.

0:42:34.254 --> 0:42:53.163
And now you can extend them and you can extend
these and the joint probabilities.

0:42:54.334 --> 0:42:59.063
The other thing is the second word.

0:42:59.063 --> 0:43:03.397
You can do the second word dusk.

0:43:03.397 --> 0:43:07.338
Now you see the problem here.

0:43:07.707 --> 0:43:17.507
It is true that these have the highest probability,
but for these you have an extension.

0:43:18.078 --> 0:43:31.585
So the problem is just because in one position
one hypothesis, so you can always call this

0:43:31.585 --> 0:43:34.702
partial translation.

0:43:34.874 --> 0:43:41.269
The blue one begin is higher, but the green
one can be better extended and it will overtake.

0:43:45.525 --> 0:43:54.672
So the problem is if we are doing this greedy
search is that we might not end up in really

0:43:54.672 --> 0:43:55.275
good.

0:43:55.956 --> 0:44:00.916
So the first thing we could not do is like
yeah, we can just try.

0:44:00.880 --> 0:44:06.049
All combinations that are there, so there
is the other direction.

0:44:06.049 --> 0:44:13.020
So if the solution to to check the first one
is to just try all and it doesn't give us a

0:44:13.020 --> 0:44:17.876
good result, maybe what we have to do is just
try everything.

0:44:18.318 --> 0:44:23.120
The nice thing is if we try everything, we'll
definitely find the best translation.

0:44:23.463 --> 0:44:26.094
So we won't have a search error.

0:44:26.094 --> 0:44:28.167
We'll come to that later.

0:44:28.167 --> 0:44:32.472
The interesting thing is our translation performance.

0:44:33.353 --> 0:44:37.039
But we will definitely find the most probable
translation.

0:44:38.598 --> 0:44:44.552
However, it's not really possible because
the number of combinations is just too high.

0:44:44.764 --> 0:44:57.127
So the number of congregations is your vocabulary
science times the lengths of your sentences.

0:44:57.157 --> 0:45:03.665
Ten thousand or so you can imagine that very
soon you will have so many possibilities here

0:45:03.665 --> 0:45:05.597
that you cannot check all.

0:45:06.226 --> 0:45:13.460
So this is not really an implication or an
algorithm that you can use for applying machine

0:45:13.460 --> 0:45:14.493
translation.

0:45:15.135 --> 0:45:24.657
So maybe we have to do something in between
and yeah, not look at all but only look at

0:45:24.657 --> 0:45:25.314
some.

0:45:26.826 --> 0:45:29.342
And the easiest thing for that is okay.

0:45:29.342 --> 0:45:34.877
Just do sampling, so if we don't know what
to look at, maybe it's good to randomly pick

0:45:34.877 --> 0:45:35.255
some.

0:45:35.255 --> 0:45:40.601
That's not only a very good algorithm, so
the basic idea will always randomly select

0:45:40.601 --> 0:45:42.865
the word, of course, based on bits.

0:45:43.223 --> 0:45:52.434
We are doing that or times, and then we are
looking which one at the end has the highest.

0:45:52.672 --> 0:45:59.060
So we are not doing anymore really searching
for the best one, but we are more randomly

0:45:59.060 --> 0:46:05.158
doing selections with the idea that we always
select the best one at the beginning.

0:46:05.158 --> 0:46:11.764
So maybe it's better to do random, but of
course one important thing is how do we randomly

0:46:11.764 --> 0:46:12.344
select?

0:46:12.452 --> 0:46:15.756
If we just do uniform distribution, it would
be very bad.

0:46:15.756 --> 0:46:18.034
You'll only have very bad translations.

0:46:18.398 --> 0:46:23.261
Because in each position if you think about
it you have ten thousand possibilities.

0:46:23.903 --> 0:46:28.729
Most of them are really bad decisions and
you shouldn't do that.

0:46:28.729 --> 0:46:35.189
There is always only a very small number,
at least compared to the 10 000 translation.

0:46:35.395 --> 0:46:43.826
So if you have the sentence here, this is
an English sentence.

0:46:43.826 --> 0:46:47.841
You can start with these and.

0:46:48.408 --> 0:46:58.345
You're thinking about setting legal documents
in a legal document.

0:46:58.345 --> 0:47:02.350
You should not change the.

0:47:03.603 --> 0:47:11.032
The problem is we have a neural network, we
have a black box, so it's anyway a bit random.

0:47:12.092 --> 0:47:24.341
It is considered, but you will see that if
you make it intelligent for clear sentences,

0:47:24.341 --> 0:47:26.986
there is not that.

0:47:27.787 --> 0:47:35.600
Is an issue we should consider that this one
might lead to more randomness, but it might

0:47:35.600 --> 0:47:39.286
also be positive for machine translation.

0:47:40.080 --> 0:47:46.395
Least can't directly think of a good implication
where it's positive, but if you most think

0:47:46.395 --> 0:47:52.778
about dialogue systems, for example, whereas
the similar architecture is nowadays also used,

0:47:52.778 --> 0:47:55.524
you predict what the system should say.

0:47:55.695 --> 0:48:00.885
Then you want to have randomness because it's
not always saying the same thing.

0:48:01.341 --> 0:48:08.370
Machine translation is typically not you want
to have consistency, so if you have the same

0:48:08.370 --> 0:48:09.606
input normally.

0:48:09.889 --> 0:48:14.528
Therefore, sampling is not a mathieu.

0:48:14.528 --> 0:48:22.584
There are some things you will later see as
a preprocessing step.

0:48:23.003 --> 0:48:27.832
But of course it's important how you can make
this process not too random.

0:48:29.269 --> 0:48:41.619
Therefore, the first thing is don't take a
uniform distribution, but we have a very nice

0:48:41.619 --> 0:48:43.562
distribution.

0:48:43.843 --> 0:48:46.621
So I'm like randomly taking a word.

0:48:46.621 --> 0:48:51.328
We are looking at output distribution and
now taking a word.

0:48:51.731 --> 0:49:03.901
So that means we are taking the word these,
we are taking the word does, and all these.

0:49:04.444 --> 0:49:06.095
How can you do that?

0:49:06.095 --> 0:49:09.948
You randomly draw a number between zero and
one.

0:49:10.390 --> 0:49:23.686
And then you have ordered your words in some
way, and then you take the words before the

0:49:23.686 --> 0:49:26.375
sum of the words.

0:49:26.806 --> 0:49:34.981
So the easiest thing is you have zero point
five, zero point two five, and zero point two

0:49:34.981 --> 0:49:35.526
five.

0:49:35.526 --> 0:49:43.428
If you have a number smaller than you take
the first word, it takes a second word, and

0:49:43.428 --> 0:49:45.336
if it's higher than.

0:49:45.845 --> 0:49:57.707
Therefore, you can very easily get a distribution
distributed according to this probability mass

0:49:57.707 --> 0:49:59.541
and no longer.

0:49:59.799 --> 0:50:12.479
You can't even do that a bit more and more
focus on the important part if we are not randomly

0:50:12.479 --> 0:50:19.494
drawing from all words, but we are looking
only at.

0:50:21.361 --> 0:50:24.278
You have an idea why this is an important
stamp.

0:50:24.278 --> 0:50:29.459
Although we say I'm only throwing away the
words which have a very low probability, so

0:50:29.459 --> 0:50:32.555
anyway the probability of taking them is quite
low.

0:50:32.555 --> 0:50:35.234
So normally that shouldn't matter that much.

0:50:36.256 --> 0:50:38.830
There's ten thousand words.

0:50:40.300 --> 0:50:42.074
Of course, they admire thousand nine hundred.

0:50:42.074 --> 0:50:44.002
They're going to build a good people steal
it up.

0:50:45.085 --> 0:50:47.425
Hi, I'm Sarah Hauer and I'm Sig Hauer and
We're Professional.

0:50:47.867 --> 0:50:55.299
Yes, that's exactly why you do this most sampling
or so that you don't take the lowest.

0:50:55.415 --> 0:50:59.694
Probability words, but you only look at the
most probable ones and then like.

0:50:59.694 --> 0:51:04.632
Of course you have to rescale your probability
mass then so that it's still a probability

0:51:04.632 --> 0:51:08.417
because now it's a probability distribution
over ten thousand words.

0:51:08.417 --> 0:51:13.355
If you only take ten of them or so it's no
longer a probability distribution, you rescale

0:51:13.355 --> 0:51:15.330
them and you can still do that and.

0:51:16.756 --> 0:51:20.095
That is what is done assembling.

0:51:20.095 --> 0:51:26.267
It's not the most common thing, but it's done
several times.

0:51:28.088 --> 0:51:40.625
Then the search, which is somehow a standard,
and if you're doing some type of machine translation.

0:51:41.181 --> 0:51:50.162
And the basic idea is that in research we
select for the most probable and only continue

0:51:50.162 --> 0:51:51.171
with the.

0:51:51.691 --> 0:51:53.970
You can easily generalize this.

0:51:53.970 --> 0:52:00.451
We are not only continuing the most probable
one, but we are continuing the most probable.

0:52:00.880 --> 0:52:21.376
The.

0:52:17.697 --> 0:52:26.920
You should say we are sampling how many examples
it makes sense to take the one with the highest.

0:52:27.127 --> 0:52:33.947
But that is important that once you do a mistake
you might want to not influence that much.

0:52:39.899 --> 0:52:45.815
So the idea is if we're keeping the end best
hypotheses and not only the first fact.

0:52:46.586 --> 0:52:51.558
And the nice thing is in statistical machine
translation.

0:52:51.558 --> 0:52:54.473
We have exactly the same problem.

0:52:54.473 --> 0:52:57.731
You would do the same thing, however.

0:52:57.731 --> 0:53:03.388
Since the model wasn't that strong you needed
a quite large beam.

0:53:03.984 --> 0:53:18.944
Machine translation models are really strong
and you get already a very good performance.

0:53:19.899 --> 0:53:22.835
So how does it work?

0:53:22.835 --> 0:53:35.134
We can't relate to our capabilities, but now
we are not storing the most probable ones.

0:53:36.156 --> 0:53:45.163
Done that we extend all these hypothesis and
of course there is now a bit difficult because

0:53:45.163 --> 0:53:54.073
now we always have to switch what is the input
so the search gets more complicated and the

0:53:54.073 --> 0:53:55.933
first one is easy.

0:53:56.276 --> 0:54:09.816
In this case we have to once put in here these
and then somehow delete this one and instead

0:54:09.816 --> 0:54:12.759
put that into that.

0:54:13.093 --> 0:54:24.318
Otherwise you could only store your current
network states here and just continue by going

0:54:24.318 --> 0:54:25.428
forward.

0:54:26.766 --> 0:54:34.357
So now you have done the first two, and then
you have known the best.

0:54:34.357 --> 0:54:37.285
Can you now just continue?

0:54:39.239 --> 0:54:53.511
Yes, that's very important, otherwise all
your beam search doesn't really help because

0:54:53.511 --> 0:54:57.120
you would still have.

0:54:57.317 --> 0:55:06.472
So now you have to do one important step and
then reduce again to end.

0:55:06.472 --> 0:55:13.822
So in our case to make things easier we have
the inputs.

0:55:14.014 --> 0:55:19.072
Otherwise you will have two to the power of
length possibilities, so it is still exponential.

0:55:19.559 --> 0:55:26.637
But by always throwing them away you keep
your beans fixed.

0:55:26.637 --> 0:55:31.709
The items now differ in the last position.

0:55:32.492 --> 0:55:42.078
They are completely different, but you are
always searching what is the best one.

0:55:44.564 --> 0:55:50.791
So another way of hearing it is like this,
so just imagine you start with the empty sentence.

0:55:50.791 --> 0:55:55.296
Then you have three possible extensions: A,
B, and end of sentence.

0:55:55.296 --> 0:55:59.205
It's throwing away the worst one, continuing
with the two.

0:55:59.699 --> 0:56:13.136
Then you want to stay too, so in this state
it's either or and then you continue.

0:56:13.293 --> 0:56:24.924
So you always have this exponential growing
tree by destroying most of them away and only

0:56:24.924 --> 0:56:26.475
continuing.

0:56:26.806 --> 0:56:42.455
And thereby you can hopefully do less errors
because in these examples you always see this

0:56:42.455 --> 0:56:43.315
one.

0:56:43.503 --> 0:56:47.406
So you're preventing some errors, but of course
it's not perfect.

0:56:47.447 --> 0:56:56.829
You can still do errors because it could be
not the second one but the fourth one.

0:56:57.017 --> 0:57:03.272
Now just the idea is that you make yeah less
errors and prevent that.

0:57:07.667 --> 0:57:11.191
Then the question is how much does it help?

0:57:11.191 --> 0:57:14.074
And here is some examples for that.

0:57:14.074 --> 0:57:16.716
So for S & T it was really like.

0:57:16.716 --> 0:57:23.523
Typically the larger beam you have a larger
third space and you have a better score.

0:57:23.763 --> 0:57:27.370
So the larger you get, the bigger your emails,
the better you will.

0:57:27.370 --> 0:57:30.023
Typically maybe use something like three hundred.

0:57:30.250 --> 0:57:38.777
And it's mainly a trade-off between quality
and speed because the larger your beams, the

0:57:38.777 --> 0:57:43.184
more time it takes and you want to finish it.

0:57:43.184 --> 0:57:49.124
So your quality improvements are getting smaller
and smaller.

0:57:49.349 --> 0:57:57.164
So the difference between a beam of one and
ten is bigger than the difference between a.

0:57:58.098 --> 0:58:14.203
And the interesting thing is we're seeing
a bit of a different view, and we're seeing

0:58:14.203 --> 0:58:16.263
typically.

0:58:16.776 --> 0:58:24.376
And then especially if you look at the green
ones, this is unnormalized.

0:58:24.376 --> 0:58:26.770
You're seeing a sharp.

0:58:27.207 --> 0:58:32.284
So your translation quality here measured
in blue will go down again.

0:58:33.373 --> 0:58:35.663
That is now a question.

0:58:35.663 --> 0:58:37.762
Why is that the case?

0:58:37.762 --> 0:58:43.678
Why should we are seeing more and more possible
translations?

0:58:46.226 --> 0:58:48.743
If we have a bigger stretch and we are going.

0:58:52.612 --> 0:58:56.312
I'm going to be using my examples before we
also look at the bar.

0:58:56.656 --> 0:58:59.194
A good idea.

0:59:00.000 --> 0:59:18.521
But it's not everything because we in the
end always in this list we're selecting.

0:59:18.538 --> 0:59:19.382
So this is here.

0:59:19.382 --> 0:59:21.170
We don't do any regions to do that.

0:59:21.601 --> 0:59:29.287
So the probabilities at the end we always
give out the hypothesis with the highest probabilities.

0:59:30.250 --> 0:59:33.623
That is always the case.

0:59:33.623 --> 0:59:43.338
If you have a beam of this should be a subset
of the items you look at.

0:59:44.224 --> 0:59:52.571
So if you increase your biomeat you're just
looking at more and you're always taking the

0:59:52.571 --> 0:59:54.728
wine with the highest.

0:59:57.737 --> 1:00:07.014
Maybe they are all the probability that they
will be comparable to don't really have.

1:00:08.388 --> 1:00:14.010
But the probabilities are the same, not that
easy.

1:00:14.010 --> 1:00:23.931
One morning maybe you will have more examples
where we look at some stuff that's not seen

1:00:23.931 --> 1:00:26.356
in the trading space.

1:00:28.428 --> 1:00:36.478
That's mainly the answer why we give a hyperability
math we will see, but that is first of all

1:00:36.478 --> 1:00:43.087
the biggest issues, so here is a blue score,
so that is somewhat translation.

1:00:43.883 --> 1:00:48.673
This will go down by the probability of the
highest one that only goes out where stays

1:00:48.673 --> 1:00:49.224
at least.

1:00:49.609 --> 1:00:57.971
The problem is if we are searching more, we
are finding high processes which have a high

1:00:57.971 --> 1:00:59.193
translation.

1:00:59.579 --> 1:01:10.375
So we are finding these things which we wouldn't
find and we'll see why this is happening.

1:01:10.375 --> 1:01:15.714
So somehow we are reducing our search error.

1:01:16.336 --> 1:01:25.300
However, we also have a model error and we
don't assign the highest probability to translation

1:01:25.300 --> 1:01:27.942
quality to the really best.

1:01:28.548 --> 1:01:31.460
They don't always add up.

1:01:31.460 --> 1:01:34.932
Of course somehow they add up.

1:01:34.932 --> 1:01:41.653
If your bottle is worse then your performance
will even go.

1:01:42.202 --> 1:01:49.718
But sometimes it's happening that by increasing
search errors we are missing out the really

1:01:49.718 --> 1:01:57.969
bad translations which have a high probability
and we are only finding the decently good probability

1:01:57.969 --> 1:01:58.460
mass.

1:01:59.159 --> 1:02:03.859
So they are a bit independent of each other
and you can make those types of arrows.

1:02:04.224 --> 1:02:09.858
That's why, for example, doing exact search
will give you the translation with the highest

1:02:09.858 --> 1:02:15.245
probability, but there has been work on it
that you then even have a lower translation

1:02:15.245 --> 1:02:21.436
quality because then you find some random translation
which has a very high translation probability

1:02:21.436 --> 1:02:22.984
by which I'm really bad.

1:02:23.063 --> 1:02:29.036
Because our model is not perfect and giving
a perfect translation probability over air,.

1:02:31.431 --> 1:02:34.537
So why is this happening?

1:02:34.537 --> 1:02:42.301
And one issue with this is the so called label
or length spiral.

1:02:42.782 --> 1:02:47.115
And we are in each step of decoding.

1:02:47.115 --> 1:02:55.312
We are modeling the probability of the next
word given the input and.

1:02:55.895 --> 1:03:06.037
So if you have this picture, so you always
hear you have the probability of the next word.

1:03:06.446 --> 1:03:16.147
That's that's what your modeling, and of course
the model is not perfect.

1:03:16.576 --> 1:03:22.765
So it can be that if we at one time do a bitter
wrong prediction not for the first one but

1:03:22.765 --> 1:03:28.749
maybe for the 5th or 6th thing, then we're
giving it an exceptional high probability we

1:03:28.749 --> 1:03:30.178
cannot recover from.

1:03:30.230 --> 1:03:34.891
Because this high probability will stay there
forever and we just multiply other things to

1:03:34.891 --> 1:03:39.910
it, but we cannot like later say all this probability
was a bit too high, we shouldn't have done.

1:03:41.541 --> 1:03:48.984
And this leads to that the more the longer
your translation is, the more often you use

1:03:48.984 --> 1:03:51.637
this probability distribution.

1:03:52.112 --> 1:04:03.321
The typical example is this one, so you have
the probability of the translation.

1:04:04.104 --> 1:04:12.608
And this probability is quite low as you see,
and maybe there are a lot of other things.

1:04:13.053 --> 1:04:25.658
However, it might still be overestimated that
it's still a bit too high.

1:04:26.066 --> 1:04:33.042
The problem is if you know the project translation
is a very long one, but probability mask gets

1:04:33.042 --> 1:04:33.545
lower.

1:04:34.314 --> 1:04:45.399
Because each time you multiply your probability
to it, so your sequence probability gets lower

1:04:45.399 --> 1:04:46.683
and lower.

1:04:48.588 --> 1:04:59.776
And this means that at some point you might
get over this, and it might be a lower probability.

1:05:00.180 --> 1:05:09.651
And if you then have this probability at the
beginning away, but it wasn't your beam, then

1:05:09.651 --> 1:05:14.958
at this point you would select the empty sentence.

1:05:15.535 --> 1:05:25.379
So this has happened because this short translation
is seen and it's not thrown away.

1:05:28.268 --> 1:05:31.121
So,.

1:05:31.151 --> 1:05:41.256
If you have a very sore beam that can be prevented,
but if you have a large beam, this one is in

1:05:41.256 --> 1:05:41.986
there.

1:05:42.302 --> 1:05:52.029
This in general seems reasonable that shorter
pronunciations instead of longer sentences

1:05:52.029 --> 1:05:54.543
because non-religious.

1:05:56.376 --> 1:06:01.561
It's a bit depending on whether the translation
should be a bit related to your input.

1:06:02.402 --> 1:06:18.053
And since we are always multiplying things,
the longer the sequences we are getting smaller,

1:06:18.053 --> 1:06:18.726
it.

1:06:19.359 --> 1:06:29.340
It's somewhat right for human main too, but
the models tend to overestimate because of

1:06:29.340 --> 1:06:34.388
this short translation of long translation.

1:06:35.375 --> 1:06:46.474
Then, of course, that means that it's not
easy to stay on a computer because eventually

1:06:46.474 --> 1:06:48.114
it suggests.

1:06:51.571 --> 1:06:59.247
First of all there is another way and that's
typically used but you don't have to do really

1:06:59.247 --> 1:07:07.089
because this is normally not a second position
and if it's like on the 20th position you only

1:07:07.089 --> 1:07:09.592
have to have some bean lower.

1:07:10.030 --> 1:07:17.729
But you are right because these issues get
larger, the larger your input is, and then

1:07:17.729 --> 1:07:20.235
you might make more errors.

1:07:20.235 --> 1:07:27.577
So therefore this is true, but it's not as
simple that this one is always in the.

1:07:28.408 --> 1:07:45.430
That the translation for it goes down with
higher insert sizes has there been more control.

1:07:47.507 --> 1:07:51.435
In this work you see a dozen knocks.

1:07:51.435 --> 1:07:53.027
Knots go down.

1:07:53.027 --> 1:08:00.246
That's light green here, but at least you
don't see the sharp rock.

1:08:00.820 --> 1:08:07.897
So if you do some type of normalization, at
least you can assess this probability and limit

1:08:07.897 --> 1:08:08.204
it.

1:08:15.675 --> 1:08:24.828
There is other reasons why, like initial,
it's not only the length, but there can be

1:08:24.828 --> 1:08:26.874
other reasons why.

1:08:27.067 --> 1:08:37.316
And if you just take it too large, you're
looking too often at ways in between, but it's

1:08:37.316 --> 1:08:40.195
better to ignore things.

1:08:41.101 --> 1:08:44.487
But that's more a hand gravy argument.

1:08:44.487 --> 1:08:47.874
Agree so don't know if the exact word.

1:08:48.648 --> 1:08:53.223
You need to do the normalization and there
are different ways of doing it.

1:08:53.223 --> 1:08:54.199
It's mainly OK.

1:08:54.199 --> 1:08:59.445
We're just now not taking the translation
with the highest probability, but we during

1:08:59.445 --> 1:09:04.935
the coding have another feature saying not
only take the one with the highest probability

1:09:04.935 --> 1:09:08.169
but also prefer translations which are a bit
longer.

1:09:08.488 --> 1:09:16.933
You can do that different in a way to divide
by the center length.

1:09:16.933 --> 1:09:23.109
We take not the highest but the highest average.

1:09:23.563 --> 1:09:28.841
Of course, if both are the same lengths, it
doesn't matter if M is the same lengths in

1:09:28.841 --> 1:09:34.483
all cases, but if you compare a translation
with seven or eight words, there is a difference

1:09:34.483 --> 1:09:39.700
if you want to have the one with the highest
probability or with the highest average.

1:09:41.021 --> 1:09:50.993
So that is the first one can have some reward
model for each word, add a bit of the score,

1:09:50.993 --> 1:09:51.540
and.

1:09:51.711 --> 1:10:03.258
And then, of course, you have to find you
that there is also more complex ones here.

1:10:03.903 --> 1:10:08.226
So there is different ways of doing that,
and of course that's important.

1:10:08.428 --> 1:10:11.493
But in all of that, the main idea is OK.

1:10:11.493 --> 1:10:18.520
We are like knowing of the arrow that the
model seems to prevent or prefer short translation.

1:10:18.520 --> 1:10:24.799
We circumvent that by OK we are adding we
are no longer searching for the best one.

1:10:24.764 --> 1:10:30.071
But we're searching for the one best one and
some additional constraints, so mainly you

1:10:30.071 --> 1:10:32.122
are doing here during the coding.

1:10:32.122 --> 1:10:37.428
You're not completely trusting your model,
but you're adding some buyers or constraints

1:10:37.428 --> 1:10:39.599
into what should also be fulfilled.

1:10:40.000 --> 1:10:42.543
That can be, for example, that the length
should be recently.

1:10:49.369 --> 1:10:51.071
Any More Questions to That.

1:10:56.736 --> 1:11:04.001
Last idea which gets recently quite a bit
more interest also is what is called minimum

1:11:04.001 --> 1:11:11.682
base risk decoding and there is maybe not the
one correct translation but there are several

1:11:11.682 --> 1:11:13.937
good correct translations.

1:11:14.294 --> 1:11:21.731
And the idea is now we don't want to find
the one translation, which is maybe the highest

1:11:21.731 --> 1:11:22.805
probability.

1:11:23.203 --> 1:11:31.707
Instead we are looking at all the high translation,
all translation with high probability and then

1:11:31.707 --> 1:11:39.524
we want to take one representative out of this
so we're just most similar to all the other

1:11:39.524 --> 1:11:42.187
hydrobility translation again.

1:11:43.643 --> 1:11:46.642
So how does it work?

1:11:46.642 --> 1:11:55.638
First you could have imagined you have reference
translations.

1:11:55.996 --> 1:12:13.017
You have a set of reference translations and
then what you want to get is you want to have.

1:12:13.073 --> 1:12:28.641
As a probability distribution you measure
the similarity of reference and the hypothesis.

1:12:28.748 --> 1:12:31.408
So you have two sets of translation.

1:12:31.408 --> 1:12:34.786
You have the human translations of a sentence.

1:12:35.675 --> 1:12:39.251
That's of course not realistic, but first
from the idea.

1:12:39.251 --> 1:12:42.324
Then you have your set of possible translations.

1:12:42.622 --> 1:12:52.994
And now you're not saying okay, we have only
one human, but we have several humans with

1:12:52.994 --> 1:12:56.294
different types of quality.

1:12:56.796 --> 1:13:07.798
You have to have two metrics here, the similarity
between the automatic translation and the quality

1:13:07.798 --> 1:13:09.339
of the human.

1:13:10.951 --> 1:13:17.451
Of course, we have the same problem that we
don't have the human reference, so we have.

1:13:18.058 --> 1:13:29.751
So when we are doing it, instead of estimating
the quality based on the human, we use our

1:13:29.751 --> 1:13:30.660
model.

1:13:31.271 --> 1:13:37.612
So we can't be like humans, so we take the
model probability.

1:13:37.612 --> 1:13:40.782
We take the set here first of.

1:13:41.681 --> 1:13:48.755
Then we are comparing each hypothesis to this
one, so you have two sets.

1:13:48.755 --> 1:13:53.987
Just imagine here you take all possible translations.

1:13:53.987 --> 1:13:58.735
Here you take your hypothesis in comparing
them.

1:13:58.678 --> 1:14:03.798
And then you're taking estimating the quality
based on the outcome.

1:14:04.304 --> 1:14:06.874
So the overall idea is okay.

1:14:06.874 --> 1:14:14.672
We are not finding the best hypothesis but
finding the hypothesis which is most similar

1:14:14.672 --> 1:14:17.065
to many good translations.

1:14:19.599 --> 1:14:21.826
Why would you do that?

1:14:21.826 --> 1:14:25.119
It's a bit like a smoothing idea.

1:14:25.119 --> 1:14:28.605
Imagine this is the probability of.

1:14:29.529 --> 1:14:36.634
So if you would do beam search or mini search
or anything, if you just take the highest probability

1:14:36.634 --> 1:14:39.049
one, you would take this red one.

1:14:39.799 --> 1:14:45.686
Has this type of probability distribution.

1:14:45.686 --> 1:14:58.555
Then it might be better to take some of these
models because it's a bit lower in probability.

1:14:58.618 --> 1:15:12.501
So what you're mainly doing is you're doing
some smoothing of your probability distribution.

1:15:15.935 --> 1:15:17.010
How can you do that?

1:15:17.010 --> 1:15:20.131
Of course, we cannot do this again compared
to all the hype.

1:15:21.141 --> 1:15:29.472
But what we can do is we have just two sets
and we're just taking them the same.

1:15:29.472 --> 1:15:38.421
So we're having our penny data of the hypothesis
and the sum of the soider references.

1:15:39.179 --> 1:15:55.707
And we can just take the same clue so we can
just compare the utility of the.

1:15:56.656 --> 1:16:16.182
And then, of course, the question is how do
we measure the quality of the hypothesis?

1:16:16.396 --> 1:16:28.148
Course: You could also take here the probability
of this pee of given, but you can also say

1:16:28.148 --> 1:16:30.958
we only take the top.

1:16:31.211 --> 1:16:39.665
And where we don't want to really rely on
how good they are, we filtered out all the

1:16:39.665 --> 1:16:40.659
bad ones.

1:16:40.940 --> 1:16:54.657
So that is the first question for the minimum
base rhythm, and what are your pseudo references?

1:16:55.255 --> 1:17:06.968
So how do you set the quality of all these
references here in the independent sampling?

1:17:06.968 --> 1:17:10.163
They all have the same.

1:17:10.750 --> 1:17:12.308
There's Also Work Where You Can Take That.

1:17:13.453 --> 1:17:17.952
And then the second question you have to do
is, of course,.

1:17:17.917 --> 1:17:26.190
How do you prepare now two hypothesisms so
you have now Y and H which are post generated

1:17:26.190 --> 1:17:34.927
by the system and you want to find the H which
is most similar to all the other translations.

1:17:35.335 --> 1:17:41.812
So it's mainly like this model here, which
says how similar is age to all the other whites.

1:17:42.942 --> 1:17:50.127
So you have to again use some type of similarity
metric, which says how similar to possible.

1:17:52.172 --> 1:17:53.775
How can you do that?

1:17:53.775 --> 1:17:58.355
We luckily knew how to compare a reference
to a hypothesis.

1:17:58.355 --> 1:18:00.493
We have evaluation metrics.

1:18:00.493 --> 1:18:03.700
You can do something like sentence level.

1:18:04.044 --> 1:18:13.501
But especially if you're looking into neuromodels
you should have a stromometric so you can use

1:18:13.501 --> 1:18:17.836
a neural metric which directly compares to.

1:18:22.842 --> 1:18:29.292
Yes, so that is, is the main idea of minimum
base risk to, so the important idea you should

1:18:29.292 --> 1:18:35.743
keep in mind is that it's doing somehow the
smoothing by not taking the highest probability

1:18:35.743 --> 1:18:40.510
one, but by comparing like by taking a set
of high probability one.

1:18:40.640 --> 1:18:45.042
And then looking for the translation, which
is most similar to all of that.

1:18:45.445 --> 1:18:49.888
And thereby doing a bit more smoothing because
you look at this one.

1:18:49.888 --> 1:18:55.169
If you have this one, for example, it would
be more similar to all of these ones.

1:18:55.169 --> 1:19:00.965
But if you take this one, it's higher probability,
but it's very dissimilar to all these.

1:19:05.445 --> 1:19:17.609
Hey, that is all for decoding before we finish
with your combination of models.

1:19:18.678 --> 1:19:20.877
Sort of set of pseudo-reperences.

1:19:20.877 --> 1:19:24.368
Thomas Brown writes a little bit of type research
or.

1:19:24.944 --> 1:19:27.087
For example, you can do beam search.

1:19:27.087 --> 1:19:28.825
You can do sampling for that.

1:19:28.825 --> 1:19:31.257
Oh yeah, we had mentioned sampling there.

1:19:31.257 --> 1:19:34.500
I don't know somebody asking for what sampling
is good.

1:19:34.500 --> 1:19:37.280
So there's, of course, another important issue.

1:19:37.280 --> 1:19:40.117
How do you get a good representative set of
age?

1:19:40.620 --> 1:19:47.147
If you do beam search, it might be that you
end up with two similar ones, and maybe it's

1:19:47.147 --> 1:19:49.274
prevented by doing sampling.

1:19:49.274 --> 1:19:55.288
But maybe in sampling you find worse ones,
but yet some type of model is helpful.

1:19:56.416 --> 1:20:04.863
Search method use more transformed based translation
points.

1:20:04.863 --> 1:20:09.848
Nowadays beam search is definitely.

1:20:10.130 --> 1:20:13.749
There is work on this.

1:20:13.749 --> 1:20:27.283
The problem is that the MBR is often a lot
more like heavy because you have to sample

1:20:27.283 --> 1:20:29.486
translations.

1:20:31.871 --> 1:20:40.946
If you are bustling then we take a pen or
a pen for the most possible one.

1:20:40.946 --> 1:20:43.003
Now we put them.

1:20:43.623 --> 1:20:46.262
Bit and then we say okay, you don't have to
be fine.

1:20:46.262 --> 1:20:47.657
I'm going to put it to you.

1:20:48.428 --> 1:20:52.690
Yes, so that is what you can also do.

1:20:52.690 --> 1:21:00.092
Instead of taking uniform per ability, you
could take the modest.

1:21:01.041 --> 1:21:14.303
The uniform is a bit more robust because if
you had this one it might be that there is

1:21:14.303 --> 1:21:17.810
some crazy exceptions.

1:21:17.897 --> 1:21:21.088
And then it would still relax.

1:21:21.088 --> 1:21:28.294
So if you look at this picture, the probability
here would be higher.

1:21:28.294 --> 1:21:31.794
But yeah, that's a bit of tuning.

1:21:33.073 --> 1:21:42.980
In this case, and yes, it is like modeling
also the ants that.

1:21:49.169 --> 1:21:56.265
The last thing is now we always have considered
one model.

1:21:56.265 --> 1:22:04.084
It's also some prints helpful to not only
look at one model but.

1:22:04.384 --> 1:22:10.453
So in general there's many ways of how you
can make several models and with it's even

1:22:10.453 --> 1:22:17.370
easier you can just start three different random
municipalizations you get three different models

1:22:17.370 --> 1:22:18.428
and typically.

1:22:19.019 --> 1:22:27.299
And then the question is, can we combine their
strength into one model and use that then?

1:22:29.669 --> 1:22:39.281
And that can be done and it can be either
online or ensemble, and the more offline thing

1:22:39.281 --> 1:22:41.549
is called reranking.

1:22:42.462 --> 1:22:52.800
So the idea is, for example, an ensemble that
you combine different initializations.

1:22:52.800 --> 1:23:02.043
Of course, you can also do other things like
having different architecture.

1:23:02.222 --> 1:23:08.922
But the easiest thing you can change always
in generating two motors is to have different.

1:23:09.209 --> 1:23:24.054
And then the question is how can you combine
that?

1:23:26.006 --> 1:23:34.245
And the easiest thing, as said, is the bottle
of soda.

1:23:34.245 --> 1:23:39.488
What you mainly do is in parallel.

1:23:39.488 --> 1:23:43.833
You decode all of the money.

1:23:44.444 --> 1:23:59.084
So the probability of the output and you can
join this one to a joint one by just summing

1:23:59.084 --> 1:24:04.126
up over your key models again.

1:24:04.084 --> 1:24:10.374
So you still have a pro bonding distribution,
but you are not taking only one output here,

1:24:10.374 --> 1:24:10.719
but.

1:24:11.491 --> 1:24:20.049
So that's one you can easily combine different
models, and the nice thing is it typically

1:24:20.049 --> 1:24:20.715
works.

1:24:21.141 --> 1:24:27.487
You additional improvement with only more
calculation but not more human work.

1:24:27.487 --> 1:24:33.753
You just do the same thing for times and you're
getting a better performance.

1:24:33.793 --> 1:24:41.623
Like having more layers and so on, the advantage
of bigger models is of course you have to have

1:24:41.623 --> 1:24:46.272
the big models only joint and decoding during
inference.

1:24:46.272 --> 1:24:52.634
There you have to load models in parallel
because you have to do your search.

1:24:52.672 --> 1:24:57.557
Normally there is more memory resources for
training than you need for insurance.

1:25:00.000 --> 1:25:12.637
You have to train four models and the decoding
speed is also slower because you need to decode

1:25:12.637 --> 1:25:14.367
four models.

1:25:14.874 --> 1:25:25.670
There is one other very important thing and
the models have to be very similar, at least

1:25:25.670 --> 1:25:27.368
in some ways.

1:25:27.887 --> 1:25:28.506
Course.

1:25:28.506 --> 1:25:34.611
You can only combine this one if you have
the same words because you are just.

1:25:34.874 --> 1:25:43.110
So just imagine you have two different sizes
because you want to compare them or a director

1:25:43.110 --> 1:25:44.273
based model.

1:25:44.724 --> 1:25:53.327
That's at least not easily possible here because
once your output would be here a word and the

1:25:53.327 --> 1:25:56.406
other one would have to sum over.

1:25:56.636 --> 1:26:07.324
So this ensemble typically only works if you
have the same output vocabulary.

1:26:07.707 --> 1:26:16.636
Your input can be different because that is
only done once and then.

1:26:16.636 --> 1:26:23.752
Your hardware vocabulary has to be the same
otherwise.

1:26:27.507 --> 1:26:41.522
There's even a surprising effect of improving
your performance and it's again some kind of

1:26:41.522 --> 1:26:43.217
smoothing.

1:26:43.483 --> 1:26:52.122
So normally during training what we are doing
is we can save the checkpoints after each epoch.

1:26:52.412 --> 1:27:01.774
And you have this type of curve where your
Arab performance normally should go down, and

1:27:01.774 --> 1:27:09.874
if you do early stopping it means that at the
end you select not the lowest.

1:27:11.571 --> 1:27:21.467
However, some type of smoothing is there again.

1:27:21.467 --> 1:27:31.157
Sometimes what you can do is take an ensemble.

1:27:31.491 --> 1:27:38.798
That is not as good, but you still have four
different bottles, and they give you a little.

1:27:39.259 --> 1:27:42.212
So,.

1:27:43.723 --> 1:27:48.340
It's some are helping you, so now they're
supposed to be something different, you know.

1:27:49.489 --> 1:27:53.812
Oh didn't do that, so that is a checkpoint.

1:27:53.812 --> 1:27:59.117
There is one thing interesting, which is even
faster.

1:27:59.419 --> 1:28:12.255
Normally let's give you better performance
because this one might be again like a smooth

1:28:12.255 --> 1:28:13.697
ensemble.

1:28:16.736 --> 1:28:22.364
Of course, there is also some problems with
this, so I said.

1:28:22.364 --> 1:28:30.022
For example, maybe you want to do different
web representations with Cherokee and.

1:28:30.590 --> 1:28:37.189
You want to do right to left decoding so you
normally do like I go home but then your translation

1:28:37.189 --> 1:28:39.613
depends only on the previous words.

1:28:39.613 --> 1:28:45.942
If you want to model on the future you could
do the inverse direction and generate the target

1:28:45.942 --> 1:28:47.895
sentence from right to left.

1:28:48.728 --> 1:28:50.839
But it's not easy to combine these things.

1:28:51.571 --> 1:28:56.976
In order to do this, or what is also sometimes
interesting is doing in verse translation.

1:28:57.637 --> 1:29:07.841
You can combine these types of models in the
next election.

1:29:07.841 --> 1:29:13.963
That is only a bit which we can do.

1:29:14.494 --> 1:29:29.593
Next time what you should remember is how
search works and do you have any final questions.

1:29:33.773 --> 1:29:43.393
Then I wish you a happy holiday for next week
and then Monday there is another practical

1:29:43.393 --> 1:29:50.958
and then Thursday in two weeks so we'll have
the next lecture Monday.