retkowski's picture
Add demo
cb71ef5
WEBVTT
0:00:01.721 --> 0:00:05.064
Hey, and then welcome to today's lecture.
0:00:06.126 --> 0:00:13.861
What we want to do today is we will finish
with what we have done last time, so we started
0:00:13.861 --> 0:00:22.192
looking at the new machine translation system,
but we have had all the components of the sequence
0:00:22.192 --> 0:00:22.787
model.
0:00:22.722 --> 0:00:29.361
We're still missing is the transformer based
architecture so that maybe the self attention.
0:00:29.849 --> 0:00:31.958
Then we want to look at the beginning today.
0:00:32.572 --> 0:00:39.315
And then the main part of the day's lecture
will be decoding.
0:00:39.315 --> 0:00:43.992
That means we know how to train the model.
0:00:44.624 --> 0:00:47.507
So decoding sewage all they can be.
0:00:47.667 --> 0:00:53.359
Be useful that and the idea is how we find
that and what challenges are there.
0:00:53.359 --> 0:00:59.051
Since it's unregressive, we will see that
it's not as easy as for other tasks.
0:00:59.359 --> 0:01:08.206
While generating the translation step by step,
we might make additional arrows that lead.
0:01:09.069 --> 0:01:16.464
But let's start with a self attention, so
what we looked at into was an base model.
0:01:16.816 --> 0:01:27.931
And then in our based models you always take
the last new state, you take your input, you
0:01:27.931 --> 0:01:31.513
generate a new hidden state.
0:01:31.513 --> 0:01:35.218
This is more like a standard.
0:01:35.675 --> 0:01:41.088
And one challenge in this is that we always
store all our history in one signal hidden
0:01:41.088 --> 0:01:41.523
stick.
0:01:41.781 --> 0:01:50.235
We saw that this is a problem when going from
encoder to decoder, and that is why we then
0:01:50.235 --> 0:01:58.031
introduced the attention mechanism so that
we can look back and see all the parts.
0:01:59.579 --> 0:02:06.059
However, in the decoder we still have this
issue so we are still storing all information
0:02:06.059 --> 0:02:12.394
in one hidden state and we might do things
like here that we start to overwrite things
0:02:12.394 --> 0:02:13.486
and we forgot.
0:02:14.254 --> 0:02:23.575
So the idea is, can we do something similar
which we do between encoder and decoder within
0:02:23.575 --> 0:02:24.907
the decoder?
0:02:26.526 --> 0:02:33.732
And the idea is each time we're generating
here in New York State, it will not only depend
0:02:33.732 --> 0:02:40.780
on the previous one, but we will focus on the
whole sequence and look at different parts
0:02:40.780 --> 0:02:46.165
as we did in attention in order to generate
our new representation.
0:02:46.206 --> 0:02:53.903
So each time we generate a new representation
we will look into what is important now to
0:02:53.903 --> 0:02:54.941
understand.
0:02:55.135 --> 0:03:00.558
You may want to understand what much is important.
0:03:00.558 --> 0:03:08.534
You might want to look to vary and to like
so that it's much about liking.
0:03:08.808 --> 0:03:24.076
So the idea is that we are not staring everything
in each time we are looking at the full sequence.
0:03:25.125 --> 0:03:35.160
And that is achieved by no longer going really
secret, and the hidden states here aren't dependent
0:03:35.160 --> 0:03:37.086
on the same layer.
0:03:37.086 --> 0:03:42.864
But instead we are always looking at the previous
layer.
0:03:42.942 --> 0:03:45.510
We will always have more information that
we are coming.
0:03:47.147 --> 0:03:51.572
So how does this censor work in detail?
0:03:51.572 --> 0:03:56.107
So we started with our initial mistakes.
0:03:56.107 --> 0:04:08.338
So, for example: Now where we had the three
terms already, the query, the key and the value,
0:04:08.338 --> 0:04:12.597
it was motivated by our database.
0:04:12.772 --> 0:04:20.746
We are comparing it to the keys to all the
other values, and then we are merging the values.
0:04:21.321 --> 0:04:35.735
There was a difference between the decoder
and the encoder.
0:04:35.775 --> 0:04:41.981
You can assume all the same because we are
curving ourselves.
0:04:41.981 --> 0:04:49.489
However, we can make them different but just
learning a linear projection.
0:04:49.529 --> 0:05:01.836
So you learn here some projection based on
what need to do in order to ask which question.
0:05:02.062 --> 0:05:11.800
That is, the query and the key is to what
do want to compare and provide others, and
0:05:11.800 --> 0:05:13.748
which values do.
0:05:14.014 --> 0:05:23.017
This is not like hand defined, but learn,
so it's like three linear projections that
0:05:23.017 --> 0:05:26.618
you apply on all of these hidden.
0:05:26.618 --> 0:05:32.338
That is the first thing based on your initial
hidden.
0:05:32.612 --> 0:05:37.249
And now you can do exactly as before, you
can do the attention.
0:05:37.637 --> 0:05:40.023
How did the attention work?
0:05:40.023 --> 0:05:45.390
The first thing is we are comparing our query
to all the keys.
0:05:45.445 --> 0:05:52.713
And that is now the difference before the
quarry was from the decoder, the keys were
0:05:52.713 --> 0:05:54.253
from the encoder.
0:05:54.253 --> 0:06:02.547
Now it's like all from the same, so we started
the first in state to the keys of all the others.
0:06:02.582 --> 0:06:06.217
We're learning some value here.
0:06:06.217 --> 0:06:12.806
How important are these information to better
understand?
0:06:13.974 --> 0:06:19.103
And these are just like floating point numbers.
0:06:19.103 --> 0:06:21.668
They are normalized so.
0:06:22.762 --> 0:06:30.160
And that is the first step, so let's go first
for the first curve.
0:06:30.470 --> 0:06:41.937
What we can then do is multiply each value
as we have done before with the importance
0:06:41.937 --> 0:06:43.937
of each state.
0:06:45.145 --> 0:06:47.686
And then we have in here the new hit step.
0:06:48.308 --> 0:06:57.862
See now this new hidden status is depending
on all the hidden state of all the sequences
0:06:57.862 --> 0:06:59.686
of the previous.
0:06:59.879 --> 0:07:01.739
One important thing.
0:07:01.739 --> 0:07:08.737
This one doesn't really depend, so the hidden
states here don't depend on the.
0:07:09.029 --> 0:07:15.000
So it only depends on the hidden state of
the previous layer, but it depends on all the
0:07:15.000 --> 0:07:18.664
hidden states, and that is of course a big
advantage.
0:07:18.664 --> 0:07:25.111
So on the one hand information can directly
flow from each hidden state before the information
0:07:25.111 --> 0:07:27.214
flow was always a bit limited.
0:07:28.828 --> 0:07:35.100
And the independence is important so we can
calculate all these in the states in parallel.
0:07:35.100 --> 0:07:41.371
That's another big advantage of self attention
that we can calculate all the hidden states
0:07:41.371 --> 0:07:46.815
in one layer in parallel and therefore it's
the ad designed for GPUs and fast.
0:07:47.587 --> 0:07:50.235
Then we can do the same thing for the second
in the state.
0:07:50.530 --> 0:08:06.866
And the only difference here is how we calculate
what is occurring.
0:08:07.227 --> 0:08:15.733
Getting these values is different because
we use the different query and then getting
0:08:15.733 --> 0:08:17.316
our new hidden.
0:08:18.258 --> 0:08:26.036
Yes, this is the word of words that underneath
this case might, but this is simple.
0:08:26.036 --> 0:08:26.498
Not.
0:08:27.127 --> 0:08:33.359
That's a very good question that is like on
the initial thing.
0:08:33.359 --> 0:08:38.503
That is exactly not one of you in the architecture.
0:08:38.503 --> 0:08:44.042
Maybe first you would think of a very big
disadvantage.
0:08:44.384 --> 0:08:49.804
So this hidden state would be the same if
the movie would be different.
0:08:50.650 --> 0:08:59.983
And of course this estate is a site someone
should like, so if the estate would be here
0:08:59.983 --> 0:09:06.452
except for this correspondence the word order
is completely.
0:09:06.706 --> 0:09:17.133
Therefore, just doing self attention wouldn't
work at all because we know word order is important
0:09:17.133 --> 0:09:21.707
and there is a complete different meaning.
0:09:22.262 --> 0:09:26.277
We introduce the word position again.
0:09:26.277 --> 0:09:33.038
The main idea is if the position is already
in your embeddings.
0:09:33.533 --> 0:09:39.296
Then of course the position is there and you
don't lose it anymore.
0:09:39.296 --> 0:09:46.922
So mainly if your life representation here
encodes at the second position and your output
0:09:46.922 --> 0:09:48.533
will be different.
0:09:49.049 --> 0:09:54.585
And that's how you encode it, but that's essential
in order to get this work.
0:09:57.137 --> 0:10:08.752
But before we are coming to the next slide,
one other thing that is typically done is multi-head
0:10:08.752 --> 0:10:10.069
attention.
0:10:10.430 --> 0:10:15.662
And it might be that in order to understand
much, it might be good that in some way we
0:10:15.662 --> 0:10:19.872
focus on life, and in some way we can focus
on vary, but not equally.
0:10:19.872 --> 0:10:25.345
But maybe it's like to understand again on
different dimensions we should look into these.
0:10:25.905 --> 0:10:31.393
And therefore what we're doing is we're just
doing the self attention at once, but we're
0:10:31.393 --> 0:10:35.031
doing it end times or based on your multi head
attentions.
0:10:35.031 --> 0:10:41.299
So in typical examples, the number of heads
people are talking about is like: So you're
0:10:41.299 --> 0:10:50.638
doing this process and have different queries
and keys so you can focus.
0:10:50.790 --> 0:10:52.887
How can you generate eight different?
0:10:53.593 --> 0:11:07.595
Things it's quite easy here, so instead of
having one linear projection you can have age
0:11:07.595 --> 0:11:09.326
different.
0:11:09.569 --> 0:11:13.844
And it might be that sometimes you're looking
more into one thing, and sometimes you're Looking
0:11:13.844 --> 0:11:14.779
more into the other.
0:11:15.055 --> 0:11:24.751
So that's of course nice with this type of
learned approach because we can automatically
0:11:24.751 --> 0:11:25.514
learn.
0:11:29.529 --> 0:11:36.629
And what you correctly said is its positional
independence, so it doesn't really matter the
0:11:36.629 --> 0:11:39.176
order which should be important.
0:11:39.379 --> 0:11:47.686
So how can we do that and the idea is we are
just encoding it directly into the embedding
0:11:47.686 --> 0:11:52.024
so into the starting so that a representation.
0:11:52.512 --> 0:11:55.873
How do we get that so we started with our
embeddings?
0:11:55.873 --> 0:11:58.300
Just imagine this is embedding of eye.
0:11:59.259 --> 0:12:06.169
And then we are having additionally this positional
encoding.
0:12:06.169 --> 0:12:10.181
In this position, encoding is just.
0:12:10.670 --> 0:12:19.564
With different wavelength, so with different
lengths of your signal as you see here.
0:12:20.160 --> 0:12:37.531
And the number of functions you have is exactly
the number of dimensions you have in your embedded.
0:12:38.118 --> 0:12:51.091
And what will then do is take the first one,
and based on your position you multiply your
0:12:51.091 --> 0:12:51.955
word.
0:12:52.212 --> 0:13:02.518
And you see now if you put it in this position,
of course it will get a different value.
0:13:03.003 --> 0:13:12.347
And thereby in each position a different function
is multiplied.
0:13:12.347 --> 0:13:19.823
This is a representation for at the first
position.
0:13:20.020 --> 0:13:34.922
If you have it in the input already encoded
then of course the model is able to keep the
0:13:34.922 --> 0:13:38.605
position information.
0:13:38.758 --> 0:13:48.045
But your embeddings can also learn your embeddings
in a way that they are optimal collaborating
0:13:48.045 --> 0:13:49.786
with these types.
0:13:51.451 --> 0:13:59.351
Is that somehow clear where he is there?
0:14:06.006 --> 0:14:13.630
Am the first position and second position?
0:14:16.576 --> 0:14:17.697
Have a long wait period.
0:14:17.697 --> 0:14:19.624
I'm not going to tell you how to turn the.
0:14:21.441 --> 0:14:26.927
Be completely issued because if you have a
very short wavelength there might be quite
0:14:26.927 --> 0:14:28.011
big differences.
0:14:28.308 --> 0:14:33.577
And it might also be that then it depends,
of course, like what type of world embedding
0:14:33.577 --> 0:14:34.834
you've learned like.
0:14:34.834 --> 0:14:37.588
Is the dimension where you have long changes?
0:14:37.588 --> 0:14:43.097
Is the report for your embedding or not so
that's what I mean so that the model can somehow
0:14:43.097 --> 0:14:47.707
learn that by putting more information into
one of the embedding dimensions?
0:14:48.128 --> 0:14:54.560
So incorporated and would assume it's learning
it a bit haven't seen.
0:14:54.560 --> 0:14:57.409
Details studied how different.
0:14:58.078 --> 0:15:07.863
It's also a bit difficult because really measuring
how similar or different a world isn't that
0:15:07.863 --> 0:15:08.480
easy.
0:15:08.480 --> 0:15:13.115
You can do, of course, the average distance.
0:15:14.114 --> 0:15:21.393
Them, so are the weight tags not at model
two, or is there fixed weight tags that the
0:15:21.393 --> 0:15:21.986
model.
0:15:24.164 --> 0:15:30.165
To believe they are fixed and the mono learns
there's a different way of doing it.
0:15:30.165 --> 0:15:32.985
The other thing you can do is you can.
0:15:33.213 --> 0:15:36.945
So you can learn the second embedding which
says this is position one.
0:15:36.945 --> 0:15:38.628
This is position two and so on.
0:15:38.628 --> 0:15:42.571
Like for words you could learn fixed embeddings
and then add them upwards.
0:15:42.571 --> 0:15:45.094
So then it would have the same thing it's
done.
0:15:45.094 --> 0:15:46.935
There is one disadvantage of this.
0:15:46.935 --> 0:15:51.403
There is anybody an idea what could be the
disadvantage of a more learned embedding.
0:15:54.955 --> 0:16:00.000
Here maybe extra play this finger and ethnic
stuff that will be an art.
0:16:00.000 --> 0:16:01.751
This will be an art for.
0:16:02.502 --> 0:16:08.323
You would only be good at positions you have
seen often and especially for long sequences.
0:16:08.323 --> 0:16:14.016
You might have seen the positions very rarely
and then normally not performing that well
0:16:14.016 --> 0:16:17.981
while here it can better learn a more general
representation.
0:16:18.298 --> 0:16:22.522
So that is another thing which we won't discuss
here.
0:16:22.522 --> 0:16:25.964
Guess is what is called relative attention.
0:16:25.945 --> 0:16:32.570
And in this case you don't learn absolute
positions, but in your calculation of the similarity
0:16:32.570 --> 0:16:39.194
you take again the relative distance into account
and have a different similarity depending on
0:16:39.194 --> 0:16:40.449
how far they are.
0:16:40.660 --> 0:16:45.898
And then you don't need to encode it beforehand,
but you would more happen within your comparison.
0:16:46.186 --> 0:16:53.471
So when you compare how similar things you
print, of course also take the relative position.
0:16:55.715 --> 0:17:03.187
Because there are multiple ways to use the
one, to multiply all the embedding, or to use
0:17:03.187 --> 0:17:03.607
all.
0:17:17.557 --> 0:17:21.931
The encoder can be bidirectional.
0:17:21.931 --> 0:17:30.679
We have everything from the beginning so we
can have a model where.
0:17:31.111 --> 0:17:36.455
Decoder training of course has also everything
available but during inference you always have
0:17:36.455 --> 0:17:41.628
only the past available so you can only look
into the previous one and not into the future
0:17:41.628 --> 0:17:46.062
because if you generate word by word you don't
know what it will be there in.
0:17:46.866 --> 0:17:53.180
And so we also have to consider this somehow
in the attention, and until now we look more
0:17:53.180 --> 0:17:54.653
at the ecoder style.
0:17:54.653 --> 0:17:58.652
So if you look at this type of model, it's
by direction.
0:17:58.652 --> 0:18:03.773
So for this hill state we are looking into
the past and into the future.
0:18:04.404 --> 0:18:14.436
So the question is, can we have to do this
like unidirectional so that you only look into
0:18:14.436 --> 0:18:15.551
the past?
0:18:15.551 --> 0:18:22.573
And the nice thing is, this is even easier
than for our hands.
0:18:23.123 --> 0:18:29.738
So we would have different types of parameters
and models because you have a forward direction.
0:18:31.211 --> 0:18:35.679
For attention, that is very simple.
0:18:35.679 --> 0:18:39.403
We are doing what is masking.
0:18:39.403 --> 0:18:45.609
If you want to have a backward model, these
ones.
0:18:45.845 --> 0:18:54.355
So on the first hit stage it's been over,
so it's maybe only looking at its health.
0:18:54.894 --> 0:19:05.310
By the second it looks on the second and the
third, so you're always selling all values
0:19:05.310 --> 0:19:07.085
in the future.
0:19:07.507 --> 0:19:13.318
And thereby you can have with the same parameters
the same model.
0:19:13.318 --> 0:19:15.783
You can have then a unique.
0:19:16.156 --> 0:19:29.895
In the decoder you do the masked self attention
where you only look into the past and you don't
0:19:29.895 --> 0:19:30.753
look.
0:19:32.212 --> 0:19:36.400
Then we only have, of course, looked onto
itself.
0:19:36.616 --> 0:19:50.903
So the question: How can we combine forward
and decoder and then we can do a decoder and
0:19:50.903 --> 0:19:54.114
just have a second?
0:19:54.374 --> 0:20:00.286
And then we're doing the cross attention which
attacks from the decoder to the anchoder.
0:20:00.540 --> 0:20:10.239
So in this time it's again that the queries
is a current state of decoder, while the keys
0:20:10.239 --> 0:20:22.833
are: You can do both onto yourself to get the
meaning on the target side and to get the meaning.
0:20:23.423 --> 0:20:25.928
So see then the full picture.
0:20:25.928 --> 0:20:33.026
This is now the typical picture of the transformer
and where you use self attention.
0:20:33.026 --> 0:20:36.700
So what you have is have your power hidden.
0:20:37.217 --> 0:20:43.254
What you then apply is here the position they're
coding: We have then doing the self attention
0:20:43.254 --> 0:20:46.734
to all the others, and this can be bi-directional.
0:20:47.707 --> 0:20:54.918
You normally do another feed forward layer
just like to make things to learn additional
0:20:54.918 --> 0:20:55.574
things.
0:20:55.574 --> 0:21:02.785
You're just having also a feed forward layer
which takes your heel stable and generates
0:21:02.785 --> 0:21:07.128
your heel state because we are making things
deeper.
0:21:07.747 --> 0:21:15.648
Then this blue part you can stack over several
times so you can have layers so that.
0:21:16.336 --> 0:21:30.256
In addition to these blue arrows, so we talked
about this in R&S that if you are now back
0:21:30.256 --> 0:21:35.883
propagating your arrow from the top,.
0:21:36.436 --> 0:21:48.578
In order to prevent that we are not really
learning how to transform that, but instead
0:21:48.578 --> 0:21:51.230
we have to change.
0:21:51.671 --> 0:22:00.597
You're calculating what should be changed
with this one.
0:22:00.597 --> 0:22:09.365
The backwards clip each layer and the learning
is just.
0:22:10.750 --> 0:22:21.632
The encoder before we go to the decoder.
0:22:21.632 --> 0:22:30.655
We have any additional questions.
0:22:31.471 --> 0:22:33.220
That's a Very Good Point.
0:22:33.553 --> 0:22:38.709
Yeah, you normally take always that at least
the default architecture to only look at the
0:22:38.709 --> 0:22:38.996
top.
0:22:40.000 --> 0:22:40.388
Coder.
0:22:40.388 --> 0:22:42.383
Of course, you can do other things.
0:22:42.383 --> 0:22:45.100
We investigated, for example, the lowest layout.
0:22:45.100 --> 0:22:49.424
The decoder is looking at the lowest level
of the incoder and not of the top.
0:22:49.749 --> 0:23:05.342
You can average or you can even learn theoretically
that what you can also do is attending to all.
0:23:05.785 --> 0:23:11.180
Can attend to all possible layers and states.
0:23:11.180 --> 0:23:18.335
But what the default thing is is that you
only have the top.
0:23:20.580 --> 0:23:31.999
The decoder when we're doing is firstly doing
the same position and coding, then we're doing
0:23:31.999 --> 0:23:36.419
self attention in the decoder side.
0:23:37.837 --> 0:23:43.396
Of course here it's not important we're doing
the mask self attention so that we're only
0:23:43.396 --> 0:23:45.708
attending to the past and we're not.
0:23:47.287 --> 0:24:02.698
Here you see the difference, so in this case
the keys and values are from the encoder and
0:24:02.698 --> 0:24:03.554
the.
0:24:03.843 --> 0:24:12.103
You're comparing it to all the counter hidden
states calculating the similarity and then
0:24:12.103 --> 0:24:13.866
you do the weight.
0:24:14.294 --> 0:24:17.236
And that is an edit to what is here.
0:24:18.418 --> 0:24:29.778
Then you have a linen layer and again this
green one is sticked several times and then.
0:24:32.232 --> 0:24:36.987
Question, so each code is off.
0:24:36.987 --> 0:24:46.039
Every one of those has the last layer of thing,
so in the.
0:24:46.246 --> 0:24:51.007
All with and only to the last or the top layer
of the anchor.
0:24:57.197 --> 0:25:00.127
Good So That Would Be.
0:25:01.501 --> 0:25:12.513
To sequence models we have looked at attention
and before we are decoding do you have any
0:25:12.513 --> 0:25:18.020
more questions to this type of architecture.
0:25:20.480 --> 0:25:30.049
Transformer was first used in machine translation,
but now it's a standard thing for doing nearly
0:25:30.049 --> 0:25:32.490
any tie sequence models.
0:25:33.013 --> 0:25:35.984
Even large language models.
0:25:35.984 --> 0:25:38.531
They are a bit similar.
0:25:38.531 --> 0:25:45.111
They are just throwing away the anchor and
cross the tension.
0:25:45.505 --> 0:25:59.329
And that is maybe interesting that it's important
to have this attention because you cannot store
0:25:59.329 --> 0:26:01.021
everything.
0:26:01.361 --> 0:26:05.357
The interesting thing with the attention is
now we can attend to everything.
0:26:05.745 --> 0:26:13.403
So you can again go back to your initial model
and have just a simple sequence model and then
0:26:13.403 --> 0:26:14.055
target.
0:26:14.694 --> 0:26:24.277
There would be a more language model style
or people call it Decoder Only model where
0:26:24.277 --> 0:26:26.617
you throw this away.
0:26:27.247 --> 0:26:30.327
The nice thing is because of your self attention.
0:26:30.327 --> 0:26:34.208
You have the original problem why you introduce
the attention.
0:26:34.208 --> 0:26:39.691
You don't have that anymore because it's not
everything is summarized, but each time you
0:26:39.691 --> 0:26:44.866
generate, you're looking back at all the previous
words, the source and the target.
0:26:45.805 --> 0:26:51.734
And there is a lot of work on is a really
important to have encoded a decoded model or
0:26:51.734 --> 0:26:54.800
is a decoded only model as good if you have.
0:26:54.800 --> 0:27:00.048
But the comparison is not that easy because
how many parameters do you have?
0:27:00.360 --> 0:27:08.832
So think the general idea at the moment is,
at least for machine translation, it's normally
0:27:08.832 --> 0:27:17.765
a bit better to have an encoded decoder model
and not a decoder model where you just concatenate
0:27:17.765 --> 0:27:20.252
the source and the target.
0:27:21.581 --> 0:27:24.073
But there is not really a big difference anymore.
0:27:24.244 --> 0:27:29.891
Because this big issue, which we had initially
with it that everything is stored in the working
0:27:29.891 --> 0:27:31.009
state, is nothing.
0:27:31.211 --> 0:27:45.046
Of course, the advantage maybe here is that
you give it a bias at your same language information.
0:27:45.285 --> 0:27:53.702
While in an encoder only model this all is
merged into one thing and sometimes it is good
0:27:53.702 --> 0:28:02.120
to give models a bit of bias okay you should
maybe treat things separately and you should
0:28:02.120 --> 0:28:03.617
look different.
0:28:04.144 --> 0:28:11.612
And of course one other difference, one other
disadvantage, maybe of an encoder owning one.
0:28:16.396 --> 0:28:19.634
You think about the suicide sentence and how
it's treated.
0:28:21.061 --> 0:28:33.787
Architecture: Anchorer can both be in the
sentence for every state and cause a little
0:28:33.787 --> 0:28:35.563
difference.
0:28:35.475 --> 0:28:43.178
If you only have a decoder that has to be
unidirectional because for the decoder side
0:28:43.178 --> 0:28:51.239
for the generation you need it and so your
input is read state by state so you don't have
0:28:51.239 --> 0:28:54.463
positional bidirection information.
0:28:56.596 --> 0:29:05.551
Again, it receives a sequence of embeddings
with position encoding.
0:29:05.551 --> 0:29:11.082
The piece is like long vector has output.
0:29:11.031 --> 0:29:17.148
Don't understand how you can set footworks
to this part of each other through inputs.
0:29:17.097 --> 0:29:20.060
Other than cola is the same as the food consume.
0:29:21.681 --> 0:29:27.438
Okay, it's very good bye, so this one hand
coding is only done on the top layer.
0:29:27.727 --> 0:29:32.012
So this green one is only repeated.
0:29:32.012 --> 0:29:38.558
You have the word embedding or the position
embedding.
0:29:38.558 --> 0:29:42.961
You have one layer of decoder which.
0:29:43.283 --> 0:29:48.245
Then you stick in the second one, the third
one, the fourth one, and then on the top.
0:29:48.208 --> 0:29:55.188
Layer: You put this projection layer which
takes a one thousand dimensional backtalk and
0:29:55.188 --> 0:30:02.089
generates based on your vocabulary maybe in
ten thousand soft max layer which gives you
0:30:02.089 --> 0:30:04.442
the probability of all words.
0:30:06.066 --> 0:30:22.369
It's a very good part part of the mass tape
ladies, but it wouldn't be for the X-rays.
0:30:22.262 --> 0:30:27.015
Aquarium filters to be like monsoon roding
as they get by the river.
0:30:27.647 --> 0:30:33.140
Yes, there is work on that think we will discuss
that in the pre-trained models.
0:30:33.493 --> 0:30:39.756
It's called where you exactly do that.
0:30:39.756 --> 0:30:48.588
If you have more metric side, it's like diagonal
here.
0:30:48.708 --> 0:30:53.018
And it's a full metric, so here everybody's
attending to each position.
0:30:53.018 --> 0:30:54.694
Here you're only attending.
0:30:54.975 --> 0:31:05.744
Then you can do the previous one where this
one is decoded, not everything but everything.
0:31:06.166 --> 0:31:13.961
So you have a bit more that is possible, and
we'll have that in the lecture on pre-train
0:31:13.961 --> 0:31:14.662
models.
0:31:18.478 --> 0:31:27.440
So we now know how to build a translation
system, but of course we don't want to have
0:31:27.440 --> 0:31:30.774
a translation system by itself.
0:31:31.251 --> 0:31:40.037
Now given this model an input sentence, how
can we generate an output mind?
0:31:40.037 --> 0:31:49.398
The general idea is still: So what we really
want to do is we start with the model.
0:31:49.398 --> 0:31:53.893
We generate different possible translations.
0:31:54.014 --> 0:31:59.754
We score them the lock probability that we're
getting, so for each input and output pair
0:31:59.754 --> 0:32:05.430
we can calculate the lock probability, which
is a product of all probabilities for each
0:32:05.430 --> 0:32:09.493
word in there, and then we can find what is
the most probable.
0:32:09.949 --> 0:32:15.410
However, that's a bit complicated we will
see because we can't look at all possible translations.
0:32:15.795 --> 0:32:28.842
So there is infinite or a number of possible
translations, so we have to do it somehow in
0:32:28.842 --> 0:32:31.596
more intelligence.
0:32:32.872 --> 0:32:37.821
So what we want to do today in the rest of
the lecture?
0:32:37.821 --> 0:32:40.295
What is the search problem?
0:32:40.295 --> 0:32:44.713
Then we will look at different search algorithms.
0:32:45.825 --> 0:32:56.636
Will compare model and search errors, so there
can be errors on the model where the model
0:32:56.636 --> 0:33:03.483
is not giving the highest score to the best
translation.
0:33:03.903 --> 0:33:21.069
This is always like searching the best translation
out of one model, which is often also interesting.
0:33:24.004 --> 0:33:29.570
And how do we do the search?
0:33:29.570 --> 0:33:41.853
We want to find the translation where the
reference is minimal.
0:33:42.042 --> 0:33:44.041
So the nice thing is SMT.
0:33:44.041 --> 0:33:51.347
It wasn't the case, but in neuromachine translation
we can't find any possible translation, so
0:33:51.347 --> 0:33:53.808
at least within our vocabulary.
0:33:53.808 --> 0:33:58.114
But if we have BPE we can really generate
any possible.
0:33:58.078 --> 0:34:04.604
Translation and cereal: We could always minimize
that, but yeah, we can't do it that easy because
0:34:04.604 --> 0:34:07.734
of course we don't have the reference at hand.
0:34:07.747 --> 0:34:10.384
If it has a reference, it's not a problem.
0:34:10.384 --> 0:34:13.694
We know what we are searching for, but we
don't know.
0:34:14.054 --> 0:34:23.886
So how can we then model this by just finding
the translation with the highest probability?
0:34:23.886 --> 0:34:29.015
Looking at it, we want to find the translation.
0:34:29.169 --> 0:34:32.525
Idea is our model is a good approximation.
0:34:32.525 --> 0:34:34.399
That's how we train it.
0:34:34.399 --> 0:34:36.584
What is a good translation?
0:34:36.584 --> 0:34:43.687
And if we find translation with the highest
probability, this should also give us the best
0:34:43.687 --> 0:34:44.702
translation.
0:34:45.265 --> 0:34:56.965
And that is then, of course, the difference
between the search error is that the model
0:34:56.965 --> 0:35:02.076
doesn't predict the best translation.
0:35:02.622 --> 0:35:08.777
How can we do the basic search first of all
in basic search that seems to be very easy
0:35:08.777 --> 0:35:15.003
so what we can do is we can do the forward
pass for the whole encoder and that's how it
0:35:15.003 --> 0:35:21.724
starts the input sentences known you can put
the input sentence and calculate all your estates
0:35:21.724 --> 0:35:22.573
and hidden?
0:35:23.083 --> 0:35:35.508
Then you can put in your sentence start and
you can generate.
0:35:35.508 --> 0:35:41.721
Here you have the probability.
0:35:41.801 --> 0:35:52.624
A good idea we would see later that as a typical
algorithm is guess what you all would do, you
0:35:52.624 --> 0:35:54.788
would then select.
0:35:55.235 --> 0:36:06.265
So if you generate here a probability distribution
over all the words in your vocabulary then
0:36:06.265 --> 0:36:08.025
you can solve.
0:36:08.688 --> 0:36:13.147
Yeah, this is how our auto condition is done
in our system.
0:36:14.794 --> 0:36:19.463
Yeah, this is also why there you have to have
a model of possible extending.
0:36:19.463 --> 0:36:24.314
It's more of a language model, but then this
is one algorithm to do the search.
0:36:24.314 --> 0:36:26.801
They maybe have also more advanced ones.
0:36:26.801 --> 0:36:32.076
We will see that so this search and other
completion should be exactly the same as the
0:36:32.076 --> 0:36:33.774
search machine translation.
0:36:34.914 --> 0:36:40.480
So we'll see that this is not optimal, so
hopefully it's not that this way, but for this
0:36:40.480 --> 0:36:41.043
problem.
0:36:41.941 --> 0:36:47.437
And what you can do then you can select this
word.
0:36:47.437 --> 0:36:50.778
This was the best translation.
0:36:51.111 --> 0:36:57.675
Because the decoder, of course, in the next
step needs not to know what is the best word
0:36:57.675 --> 0:37:02.396
here, it inputs it and generates that flexibility
distribution.
0:37:03.423 --> 0:37:14.608
And then your new distribution, and you can
do the same thing, there's the best word there,
0:37:14.608 --> 0:37:15.216
and.
0:37:15.435 --> 0:37:22.647
So you can continue doing that and always
get the hopefully the best translation in.
0:37:23.483 --> 0:37:30.839
The first question is, of course, how long
are you doing it?
0:37:30.839 --> 0:37:33.854
Now we could go forever.
0:37:36.476 --> 0:37:52.596
We had this token at the input and we put
the stop token at the output.
0:37:53.974 --> 0:38:07.217
And this is important because if we wouldn't
do that then we wouldn't have a good idea.
0:38:10.930 --> 0:38:16.193
So that seems to be a good idea, but is it
really?
0:38:16.193 --> 0:38:21.044
Do we find the most probable sentence in this?
0:38:23.763 --> 0:38:25.154
Or my dear healed proverb,.
0:38:27.547 --> 0:38:41.823
We are always selecting the highest probability
one, so it seems to be that this is a very
0:38:41.823 --> 0:38:45.902
good solution to anybody.
0:38:46.406 --> 0:38:49.909
Yes, that is actually the problem.
0:38:49.909 --> 0:38:56.416
You might do early decisions and you don't
have the global view.
0:38:56.796 --> 0:39:02.813
And this problem happens because it is an
outer regressive model.
0:39:03.223 --> 0:39:13.275
So it happens because yeah, the output we
generate is the input in the next step.
0:39:13.793 --> 0:39:19.493
And this, of course, is leading to problems.
0:39:19.493 --> 0:39:27.474
If we always take the best solution, it doesn't
mean you have.
0:39:27.727 --> 0:39:33.941
It would be different if you have a problem
where the output is not influencing your input.
0:39:34.294 --> 0:39:44.079
Then this solution will give you the best
model, but since the output is influencing
0:39:44.079 --> 0:39:47.762
your next input and the model,.
0:39:48.268 --> 0:39:51.599
Because one question might not be why do we
have this type of model?
0:39:51.771 --> 0:39:58.946
So why do we really need to put here in the
last source word?
0:39:58.946 --> 0:40:06.078
You can also put in: And then always predict
the word and the nice thing is then you wouldn't
0:40:06.078 --> 0:40:11.846
need to do beams or a difficult search because
then the output here wouldn't influence what
0:40:11.846 --> 0:40:12.975
is inputted here.
0:40:15.435 --> 0:40:20.219
Idea whether that might not be the best idea.
0:40:20.219 --> 0:40:24.588
You'll just be translating each word and.
0:40:26.626 --> 0:40:37.815
The second one is right, yes, you're not generating
a Korean sentence.
0:40:38.058 --> 0:40:48.197
We'll also see that later it's called non
auto-progressive translation, so there is work
0:40:48.197 --> 0:40:49.223
on that.
0:40:49.529 --> 0:41:02.142
So you might know it roughly because you know
it's based on this hidden state, but it can
0:41:02.142 --> 0:41:08.588
be that in the end you have your probability.
0:41:09.189 --> 0:41:14.633
And then you're not modeling the dependencies
within a work within the target sentence.
0:41:14.633 --> 0:41:27.547
For example: You can express things in German,
then you don't know which one you really select.
0:41:27.547 --> 0:41:32.156
That influences what you later.
0:41:33.393 --> 0:41:46.411
Then you try to find a better way not only
based on the English sentence and the words
0:41:46.411 --> 0:41:48.057
that come.
0:41:49.709 --> 0:42:00.954
Yes, that is more like a two-step decoding,
but that is, of course, a lot more like computational.
0:42:01.181 --> 0:42:15.978
The first thing you can do, which is typically
done, is doing not really search.
0:42:16.176 --> 0:42:32.968
So first look at what the problem of research
is to make it a bit more clear.
0:42:34.254 --> 0:42:53.163
And now you can extend them and you can extend
these and the joint probabilities.
0:42:54.334 --> 0:42:59.063
The other thing is the second word.
0:42:59.063 --> 0:43:03.397
You can do the second word dusk.
0:43:03.397 --> 0:43:07.338
Now you see the problem here.
0:43:07.707 --> 0:43:17.507
It is true that these have the highest probability,
but for these you have an extension.
0:43:18.078 --> 0:43:31.585
So the problem is just because in one position
one hypothesis, so you can always call this
0:43:31.585 --> 0:43:34.702
partial translation.
0:43:34.874 --> 0:43:41.269
The blue one begin is higher, but the green
one can be better extended and it will overtake.
0:43:45.525 --> 0:43:54.672
So the problem is if we are doing this greedy
search is that we might not end up in really
0:43:54.672 --> 0:43:55.275
good.
0:43:55.956 --> 0:44:00.916
So the first thing we could not do is like
yeah, we can just try.
0:44:00.880 --> 0:44:06.049
All combinations that are there, so there
is the other direction.
0:44:06.049 --> 0:44:13.020
So if the solution to to check the first one
is to just try all and it doesn't give us a
0:44:13.020 --> 0:44:17.876
good result, maybe what we have to do is just
try everything.
0:44:18.318 --> 0:44:23.120
The nice thing is if we try everything, we'll
definitely find the best translation.
0:44:23.463 --> 0:44:26.094
So we won't have a search error.
0:44:26.094 --> 0:44:28.167
We'll come to that later.
0:44:28.167 --> 0:44:32.472
The interesting thing is our translation performance.
0:44:33.353 --> 0:44:37.039
But we will definitely find the most probable
translation.
0:44:38.598 --> 0:44:44.552
However, it's not really possible because
the number of combinations is just too high.
0:44:44.764 --> 0:44:57.127
So the number of congregations is your vocabulary
science times the lengths of your sentences.
0:44:57.157 --> 0:45:03.665
Ten thousand or so you can imagine that very
soon you will have so many possibilities here
0:45:03.665 --> 0:45:05.597
that you cannot check all.
0:45:06.226 --> 0:45:13.460
So this is not really an implication or an
algorithm that you can use for applying machine
0:45:13.460 --> 0:45:14.493
translation.
0:45:15.135 --> 0:45:24.657
So maybe we have to do something in between
and yeah, not look at all but only look at
0:45:24.657 --> 0:45:25.314
some.
0:45:26.826 --> 0:45:29.342
And the easiest thing for that is okay.
0:45:29.342 --> 0:45:34.877
Just do sampling, so if we don't know what
to look at, maybe it's good to randomly pick
0:45:34.877 --> 0:45:35.255
some.
0:45:35.255 --> 0:45:40.601
That's not only a very good algorithm, so
the basic idea will always randomly select
0:45:40.601 --> 0:45:42.865
the word, of course, based on bits.
0:45:43.223 --> 0:45:52.434
We are doing that or times, and then we are
looking which one at the end has the highest.
0:45:52.672 --> 0:45:59.060
So we are not doing anymore really searching
for the best one, but we are more randomly
0:45:59.060 --> 0:46:05.158
doing selections with the idea that we always
select the best one at the beginning.
0:46:05.158 --> 0:46:11.764
So maybe it's better to do random, but of
course one important thing is how do we randomly
0:46:11.764 --> 0:46:12.344
select?
0:46:12.452 --> 0:46:15.756
If we just do uniform distribution, it would
be very bad.
0:46:15.756 --> 0:46:18.034
You'll only have very bad translations.
0:46:18.398 --> 0:46:23.261
Because in each position if you think about
it you have ten thousand possibilities.
0:46:23.903 --> 0:46:28.729
Most of them are really bad decisions and
you shouldn't do that.
0:46:28.729 --> 0:46:35.189
There is always only a very small number,
at least compared to the 10 000 translation.
0:46:35.395 --> 0:46:43.826
So if you have the sentence here, this is
an English sentence.
0:46:43.826 --> 0:46:47.841
You can start with these and.
0:46:48.408 --> 0:46:58.345
You're thinking about setting legal documents
in a legal document.
0:46:58.345 --> 0:47:02.350
You should not change the.
0:47:03.603 --> 0:47:11.032
The problem is we have a neural network, we
have a black box, so it's anyway a bit random.
0:47:12.092 --> 0:47:24.341
It is considered, but you will see that if
you make it intelligent for clear sentences,
0:47:24.341 --> 0:47:26.986
there is not that.
0:47:27.787 --> 0:47:35.600
Is an issue we should consider that this one
might lead to more randomness, but it might
0:47:35.600 --> 0:47:39.286
also be positive for machine translation.
0:47:40.080 --> 0:47:46.395
Least can't directly think of a good implication
where it's positive, but if you most think
0:47:46.395 --> 0:47:52.778
about dialogue systems, for example, whereas
the similar architecture is nowadays also used,
0:47:52.778 --> 0:47:55.524
you predict what the system should say.
0:47:55.695 --> 0:48:00.885
Then you want to have randomness because it's
not always saying the same thing.
0:48:01.341 --> 0:48:08.370
Machine translation is typically not you want
to have consistency, so if you have the same
0:48:08.370 --> 0:48:09.606
input normally.
0:48:09.889 --> 0:48:14.528
Therefore, sampling is not a mathieu.
0:48:14.528 --> 0:48:22.584
There are some things you will later see as
a preprocessing step.
0:48:23.003 --> 0:48:27.832
But of course it's important how you can make
this process not too random.
0:48:29.269 --> 0:48:41.619
Therefore, the first thing is don't take a
uniform distribution, but we have a very nice
0:48:41.619 --> 0:48:43.562
distribution.
0:48:43.843 --> 0:48:46.621
So I'm like randomly taking a word.
0:48:46.621 --> 0:48:51.328
We are looking at output distribution and
now taking a word.
0:48:51.731 --> 0:49:03.901
So that means we are taking the word these,
we are taking the word does, and all these.
0:49:04.444 --> 0:49:06.095
How can you do that?
0:49:06.095 --> 0:49:09.948
You randomly draw a number between zero and
one.
0:49:10.390 --> 0:49:23.686
And then you have ordered your words in some
way, and then you take the words before the
0:49:23.686 --> 0:49:26.375
sum of the words.
0:49:26.806 --> 0:49:34.981
So the easiest thing is you have zero point
five, zero point two five, and zero point two
0:49:34.981 --> 0:49:35.526
five.
0:49:35.526 --> 0:49:43.428
If you have a number smaller than you take
the first word, it takes a second word, and
0:49:43.428 --> 0:49:45.336
if it's higher than.
0:49:45.845 --> 0:49:57.707
Therefore, you can very easily get a distribution
distributed according to this probability mass
0:49:57.707 --> 0:49:59.541
and no longer.
0:49:59.799 --> 0:50:12.479
You can't even do that a bit more and more
focus on the important part if we are not randomly
0:50:12.479 --> 0:50:19.494
drawing from all words, but we are looking
only at.
0:50:21.361 --> 0:50:24.278
You have an idea why this is an important
stamp.
0:50:24.278 --> 0:50:29.459
Although we say I'm only throwing away the
words which have a very low probability, so
0:50:29.459 --> 0:50:32.555
anyway the probability of taking them is quite
low.
0:50:32.555 --> 0:50:35.234
So normally that shouldn't matter that much.
0:50:36.256 --> 0:50:38.830
There's ten thousand words.
0:50:40.300 --> 0:50:42.074
Of course, they admire thousand nine hundred.
0:50:42.074 --> 0:50:44.002
They're going to build a good people steal
it up.
0:50:45.085 --> 0:50:47.425
Hi, I'm Sarah Hauer and I'm Sig Hauer and
We're Professional.
0:50:47.867 --> 0:50:55.299
Yes, that's exactly why you do this most sampling
or so that you don't take the lowest.
0:50:55.415 --> 0:50:59.694
Probability words, but you only look at the
most probable ones and then like.
0:50:59.694 --> 0:51:04.632
Of course you have to rescale your probability
mass then so that it's still a probability
0:51:04.632 --> 0:51:08.417
because now it's a probability distribution
over ten thousand words.
0:51:08.417 --> 0:51:13.355
If you only take ten of them or so it's no
longer a probability distribution, you rescale
0:51:13.355 --> 0:51:15.330
them and you can still do that and.
0:51:16.756 --> 0:51:20.095
That is what is done assembling.
0:51:20.095 --> 0:51:26.267
It's not the most common thing, but it's done
several times.
0:51:28.088 --> 0:51:40.625
Then the search, which is somehow a standard,
and if you're doing some type of machine translation.
0:51:41.181 --> 0:51:50.162
And the basic idea is that in research we
select for the most probable and only continue
0:51:50.162 --> 0:51:51.171
with the.
0:51:51.691 --> 0:51:53.970
You can easily generalize this.
0:51:53.970 --> 0:52:00.451
We are not only continuing the most probable
one, but we are continuing the most probable.
0:52:00.880 --> 0:52:21.376
The.
0:52:17.697 --> 0:52:26.920
You should say we are sampling how many examples
it makes sense to take the one with the highest.
0:52:27.127 --> 0:52:33.947
But that is important that once you do a mistake
you might want to not influence that much.
0:52:39.899 --> 0:52:45.815
So the idea is if we're keeping the end best
hypotheses and not only the first fact.
0:52:46.586 --> 0:52:51.558
And the nice thing is in statistical machine
translation.
0:52:51.558 --> 0:52:54.473
We have exactly the same problem.
0:52:54.473 --> 0:52:57.731
You would do the same thing, however.
0:52:57.731 --> 0:53:03.388
Since the model wasn't that strong you needed
a quite large beam.
0:53:03.984 --> 0:53:18.944
Machine translation models are really strong
and you get already a very good performance.
0:53:19.899 --> 0:53:22.835
So how does it work?
0:53:22.835 --> 0:53:35.134
We can't relate to our capabilities, but now
we are not storing the most probable ones.
0:53:36.156 --> 0:53:45.163
Done that we extend all these hypothesis and
of course there is now a bit difficult because
0:53:45.163 --> 0:53:54.073
now we always have to switch what is the input
so the search gets more complicated and the
0:53:54.073 --> 0:53:55.933
first one is easy.
0:53:56.276 --> 0:54:09.816
In this case we have to once put in here these
and then somehow delete this one and instead
0:54:09.816 --> 0:54:12.759
put that into that.
0:54:13.093 --> 0:54:24.318
Otherwise you could only store your current
network states here and just continue by going
0:54:24.318 --> 0:54:25.428
forward.
0:54:26.766 --> 0:54:34.357
So now you have done the first two, and then
you have known the best.
0:54:34.357 --> 0:54:37.285
Can you now just continue?
0:54:39.239 --> 0:54:53.511
Yes, that's very important, otherwise all
your beam search doesn't really help because
0:54:53.511 --> 0:54:57.120
you would still have.
0:54:57.317 --> 0:55:06.472
So now you have to do one important step and
then reduce again to end.
0:55:06.472 --> 0:55:13.822
So in our case to make things easier we have
the inputs.
0:55:14.014 --> 0:55:19.072
Otherwise you will have two to the power of
length possibilities, so it is still exponential.
0:55:19.559 --> 0:55:26.637
But by always throwing them away you keep
your beans fixed.
0:55:26.637 --> 0:55:31.709
The items now differ in the last position.
0:55:32.492 --> 0:55:42.078
They are completely different, but you are
always searching what is the best one.
0:55:44.564 --> 0:55:50.791
So another way of hearing it is like this,
so just imagine you start with the empty sentence.
0:55:50.791 --> 0:55:55.296
Then you have three possible extensions: A,
B, and end of sentence.
0:55:55.296 --> 0:55:59.205
It's throwing away the worst one, continuing
with the two.
0:55:59.699 --> 0:56:13.136
Then you want to stay too, so in this state
it's either or and then you continue.
0:56:13.293 --> 0:56:24.924
So you always have this exponential growing
tree by destroying most of them away and only
0:56:24.924 --> 0:56:26.475
continuing.
0:56:26.806 --> 0:56:42.455
And thereby you can hopefully do less errors
because in these examples you always see this
0:56:42.455 --> 0:56:43.315
one.
0:56:43.503 --> 0:56:47.406
So you're preventing some errors, but of course
it's not perfect.
0:56:47.447 --> 0:56:56.829
You can still do errors because it could be
not the second one but the fourth one.
0:56:57.017 --> 0:57:03.272
Now just the idea is that you make yeah less
errors and prevent that.
0:57:07.667 --> 0:57:11.191
Then the question is how much does it help?
0:57:11.191 --> 0:57:14.074
And here is some examples for that.
0:57:14.074 --> 0:57:16.716
So for S & T it was really like.
0:57:16.716 --> 0:57:23.523
Typically the larger beam you have a larger
third space and you have a better score.
0:57:23.763 --> 0:57:27.370
So the larger you get, the bigger your emails,
the better you will.
0:57:27.370 --> 0:57:30.023
Typically maybe use something like three hundred.
0:57:30.250 --> 0:57:38.777
And it's mainly a trade-off between quality
and speed because the larger your beams, the
0:57:38.777 --> 0:57:43.184
more time it takes and you want to finish it.
0:57:43.184 --> 0:57:49.124
So your quality improvements are getting smaller
and smaller.
0:57:49.349 --> 0:57:57.164
So the difference between a beam of one and
ten is bigger than the difference between a.
0:57:58.098 --> 0:58:14.203
And the interesting thing is we're seeing
a bit of a different view, and we're seeing
0:58:14.203 --> 0:58:16.263
typically.
0:58:16.776 --> 0:58:24.376
And then especially if you look at the green
ones, this is unnormalized.
0:58:24.376 --> 0:58:26.770
You're seeing a sharp.
0:58:27.207 --> 0:58:32.284
So your translation quality here measured
in blue will go down again.
0:58:33.373 --> 0:58:35.663
That is now a question.
0:58:35.663 --> 0:58:37.762
Why is that the case?
0:58:37.762 --> 0:58:43.678
Why should we are seeing more and more possible
translations?
0:58:46.226 --> 0:58:48.743
If we have a bigger stretch and we are going.
0:58:52.612 --> 0:58:56.312
I'm going to be using my examples before we
also look at the bar.
0:58:56.656 --> 0:58:59.194
A good idea.
0:59:00.000 --> 0:59:18.521
But it's not everything because we in the
end always in this list we're selecting.
0:59:18.538 --> 0:59:19.382
So this is here.
0:59:19.382 --> 0:59:21.170
We don't do any regions to do that.
0:59:21.601 --> 0:59:29.287
So the probabilities at the end we always
give out the hypothesis with the highest probabilities.
0:59:30.250 --> 0:59:33.623
That is always the case.
0:59:33.623 --> 0:59:43.338
If you have a beam of this should be a subset
of the items you look at.
0:59:44.224 --> 0:59:52.571
So if you increase your biomeat you're just
looking at more and you're always taking the
0:59:52.571 --> 0:59:54.728
wine with the highest.
0:59:57.737 --> 1:00:07.014
Maybe they are all the probability that they
will be comparable to don't really have.
1:00:08.388 --> 1:00:14.010
But the probabilities are the same, not that
easy.
1:00:14.010 --> 1:00:23.931
One morning maybe you will have more examples
where we look at some stuff that's not seen
1:00:23.931 --> 1:00:26.356
in the trading space.
1:00:28.428 --> 1:00:36.478
That's mainly the answer why we give a hyperability
math we will see, but that is first of all
1:00:36.478 --> 1:00:43.087
the biggest issues, so here is a blue score,
so that is somewhat translation.
1:00:43.883 --> 1:00:48.673
This will go down by the probability of the
highest one that only goes out where stays
1:00:48.673 --> 1:00:49.224
at least.
1:00:49.609 --> 1:00:57.971
The problem is if we are searching more, we
are finding high processes which have a high
1:00:57.971 --> 1:00:59.193
translation.
1:00:59.579 --> 1:01:10.375
So we are finding these things which we wouldn't
find and we'll see why this is happening.
1:01:10.375 --> 1:01:15.714
So somehow we are reducing our search error.
1:01:16.336 --> 1:01:25.300
However, we also have a model error and we
don't assign the highest probability to translation
1:01:25.300 --> 1:01:27.942
quality to the really best.
1:01:28.548 --> 1:01:31.460
They don't always add up.
1:01:31.460 --> 1:01:34.932
Of course somehow they add up.
1:01:34.932 --> 1:01:41.653
If your bottle is worse then your performance
will even go.
1:01:42.202 --> 1:01:49.718
But sometimes it's happening that by increasing
search errors we are missing out the really
1:01:49.718 --> 1:01:57.969
bad translations which have a high probability
and we are only finding the decently good probability
1:01:57.969 --> 1:01:58.460
mass.
1:01:59.159 --> 1:02:03.859
So they are a bit independent of each other
and you can make those types of arrows.
1:02:04.224 --> 1:02:09.858
That's why, for example, doing exact search
will give you the translation with the highest
1:02:09.858 --> 1:02:15.245
probability, but there has been work on it
that you then even have a lower translation
1:02:15.245 --> 1:02:21.436
quality because then you find some random translation
which has a very high translation probability
1:02:21.436 --> 1:02:22.984
by which I'm really bad.
1:02:23.063 --> 1:02:29.036
Because our model is not perfect and giving
a perfect translation probability over air,.
1:02:31.431 --> 1:02:34.537
So why is this happening?
1:02:34.537 --> 1:02:42.301
And one issue with this is the so called label
or length spiral.
1:02:42.782 --> 1:02:47.115
And we are in each step of decoding.
1:02:47.115 --> 1:02:55.312
We are modeling the probability of the next
word given the input and.
1:02:55.895 --> 1:03:06.037
So if you have this picture, so you always
hear you have the probability of the next word.
1:03:06.446 --> 1:03:16.147
That's that's what your modeling, and of course
the model is not perfect.
1:03:16.576 --> 1:03:22.765
So it can be that if we at one time do a bitter
wrong prediction not for the first one but
1:03:22.765 --> 1:03:28.749
maybe for the 5th or 6th thing, then we're
giving it an exceptional high probability we
1:03:28.749 --> 1:03:30.178
cannot recover from.
1:03:30.230 --> 1:03:34.891
Because this high probability will stay there
forever and we just multiply other things to
1:03:34.891 --> 1:03:39.910
it, but we cannot like later say all this probability
was a bit too high, we shouldn't have done.
1:03:41.541 --> 1:03:48.984
And this leads to that the more the longer
your translation is, the more often you use
1:03:48.984 --> 1:03:51.637
this probability distribution.
1:03:52.112 --> 1:04:03.321
The typical example is this one, so you have
the probability of the translation.
1:04:04.104 --> 1:04:12.608
And this probability is quite low as you see,
and maybe there are a lot of other things.
1:04:13.053 --> 1:04:25.658
However, it might still be overestimated that
it's still a bit too high.
1:04:26.066 --> 1:04:33.042
The problem is if you know the project translation
is a very long one, but probability mask gets
1:04:33.042 --> 1:04:33.545
lower.
1:04:34.314 --> 1:04:45.399
Because each time you multiply your probability
to it, so your sequence probability gets lower
1:04:45.399 --> 1:04:46.683
and lower.
1:04:48.588 --> 1:04:59.776
And this means that at some point you might
get over this, and it might be a lower probability.
1:05:00.180 --> 1:05:09.651
And if you then have this probability at the
beginning away, but it wasn't your beam, then
1:05:09.651 --> 1:05:14.958
at this point you would select the empty sentence.
1:05:15.535 --> 1:05:25.379
So this has happened because this short translation
is seen and it's not thrown away.
1:05:28.268 --> 1:05:31.121
So,.
1:05:31.151 --> 1:05:41.256
If you have a very sore beam that can be prevented,
but if you have a large beam, this one is in
1:05:41.256 --> 1:05:41.986
there.
1:05:42.302 --> 1:05:52.029
This in general seems reasonable that shorter
pronunciations instead of longer sentences
1:05:52.029 --> 1:05:54.543
because non-religious.
1:05:56.376 --> 1:06:01.561
It's a bit depending on whether the translation
should be a bit related to your input.
1:06:02.402 --> 1:06:18.053
And since we are always multiplying things,
the longer the sequences we are getting smaller,
1:06:18.053 --> 1:06:18.726
it.
1:06:19.359 --> 1:06:29.340
It's somewhat right for human main too, but
the models tend to overestimate because of
1:06:29.340 --> 1:06:34.388
this short translation of long translation.
1:06:35.375 --> 1:06:46.474
Then, of course, that means that it's not
easy to stay on a computer because eventually
1:06:46.474 --> 1:06:48.114
it suggests.
1:06:51.571 --> 1:06:59.247
First of all there is another way and that's
typically used but you don't have to do really
1:06:59.247 --> 1:07:07.089
because this is normally not a second position
and if it's like on the 20th position you only
1:07:07.089 --> 1:07:09.592
have to have some bean lower.
1:07:10.030 --> 1:07:17.729
But you are right because these issues get
larger, the larger your input is, and then
1:07:17.729 --> 1:07:20.235
you might make more errors.
1:07:20.235 --> 1:07:27.577
So therefore this is true, but it's not as
simple that this one is always in the.
1:07:28.408 --> 1:07:45.430
That the translation for it goes down with
higher insert sizes has there been more control.
1:07:47.507 --> 1:07:51.435
In this work you see a dozen knocks.
1:07:51.435 --> 1:07:53.027
Knots go down.
1:07:53.027 --> 1:08:00.246
That's light green here, but at least you
don't see the sharp rock.
1:08:00.820 --> 1:08:07.897
So if you do some type of normalization, at
least you can assess this probability and limit
1:08:07.897 --> 1:08:08.204
it.
1:08:15.675 --> 1:08:24.828
There is other reasons why, like initial,
it's not only the length, but there can be
1:08:24.828 --> 1:08:26.874
other reasons why.
1:08:27.067 --> 1:08:37.316
And if you just take it too large, you're
looking too often at ways in between, but it's
1:08:37.316 --> 1:08:40.195
better to ignore things.
1:08:41.101 --> 1:08:44.487
But that's more a hand gravy argument.
1:08:44.487 --> 1:08:47.874
Agree so don't know if the exact word.
1:08:48.648 --> 1:08:53.223
You need to do the normalization and there
are different ways of doing it.
1:08:53.223 --> 1:08:54.199
It's mainly OK.
1:08:54.199 --> 1:08:59.445
We're just now not taking the translation
with the highest probability, but we during
1:08:59.445 --> 1:09:04.935
the coding have another feature saying not
only take the one with the highest probability
1:09:04.935 --> 1:09:08.169
but also prefer translations which are a bit
longer.
1:09:08.488 --> 1:09:16.933
You can do that different in a way to divide
by the center length.
1:09:16.933 --> 1:09:23.109
We take not the highest but the highest average.
1:09:23.563 --> 1:09:28.841
Of course, if both are the same lengths, it
doesn't matter if M is the same lengths in
1:09:28.841 --> 1:09:34.483
all cases, but if you compare a translation
with seven or eight words, there is a difference
1:09:34.483 --> 1:09:39.700
if you want to have the one with the highest
probability or with the highest average.
1:09:41.021 --> 1:09:50.993
So that is the first one can have some reward
model for each word, add a bit of the score,
1:09:50.993 --> 1:09:51.540
and.
1:09:51.711 --> 1:10:03.258
And then, of course, you have to find you
that there is also more complex ones here.
1:10:03.903 --> 1:10:08.226
So there is different ways of doing that,
and of course that's important.
1:10:08.428 --> 1:10:11.493
But in all of that, the main idea is OK.
1:10:11.493 --> 1:10:18.520
We are like knowing of the arrow that the
model seems to prevent or prefer short translation.
1:10:18.520 --> 1:10:24.799
We circumvent that by OK we are adding we
are no longer searching for the best one.
1:10:24.764 --> 1:10:30.071
But we're searching for the one best one and
some additional constraints, so mainly you
1:10:30.071 --> 1:10:32.122
are doing here during the coding.
1:10:32.122 --> 1:10:37.428
You're not completely trusting your model,
but you're adding some buyers or constraints
1:10:37.428 --> 1:10:39.599
into what should also be fulfilled.
1:10:40.000 --> 1:10:42.543
That can be, for example, that the length
should be recently.
1:10:49.369 --> 1:10:51.071
Any More Questions to That.
1:10:56.736 --> 1:11:04.001
Last idea which gets recently quite a bit
more interest also is what is called minimum
1:11:04.001 --> 1:11:11.682
base risk decoding and there is maybe not the
one correct translation but there are several
1:11:11.682 --> 1:11:13.937
good correct translations.
1:11:14.294 --> 1:11:21.731
And the idea is now we don't want to find
the one translation, which is maybe the highest
1:11:21.731 --> 1:11:22.805
probability.
1:11:23.203 --> 1:11:31.707
Instead we are looking at all the high translation,
all translation with high probability and then
1:11:31.707 --> 1:11:39.524
we want to take one representative out of this
so we're just most similar to all the other
1:11:39.524 --> 1:11:42.187
hydrobility translation again.
1:11:43.643 --> 1:11:46.642
So how does it work?
1:11:46.642 --> 1:11:55.638
First you could have imagined you have reference
translations.
1:11:55.996 --> 1:12:13.017
You have a set of reference translations and
then what you want to get is you want to have.
1:12:13.073 --> 1:12:28.641
As a probability distribution you measure
the similarity of reference and the hypothesis.
1:12:28.748 --> 1:12:31.408
So you have two sets of translation.
1:12:31.408 --> 1:12:34.786
You have the human translations of a sentence.
1:12:35.675 --> 1:12:39.251
That's of course not realistic, but first
from the idea.
1:12:39.251 --> 1:12:42.324
Then you have your set of possible translations.
1:12:42.622 --> 1:12:52.994
And now you're not saying okay, we have only
one human, but we have several humans with
1:12:52.994 --> 1:12:56.294
different types of quality.
1:12:56.796 --> 1:13:07.798
You have to have two metrics here, the similarity
between the automatic translation and the quality
1:13:07.798 --> 1:13:09.339
of the human.
1:13:10.951 --> 1:13:17.451
Of course, we have the same problem that we
don't have the human reference, so we have.
1:13:18.058 --> 1:13:29.751
So when we are doing it, instead of estimating
the quality based on the human, we use our
1:13:29.751 --> 1:13:30.660
model.
1:13:31.271 --> 1:13:37.612
So we can't be like humans, so we take the
model probability.
1:13:37.612 --> 1:13:40.782
We take the set here first of.
1:13:41.681 --> 1:13:48.755
Then we are comparing each hypothesis to this
one, so you have two sets.
1:13:48.755 --> 1:13:53.987
Just imagine here you take all possible translations.
1:13:53.987 --> 1:13:58.735
Here you take your hypothesis in comparing
them.
1:13:58.678 --> 1:14:03.798
And then you're taking estimating the quality
based on the outcome.
1:14:04.304 --> 1:14:06.874
So the overall idea is okay.
1:14:06.874 --> 1:14:14.672
We are not finding the best hypothesis but
finding the hypothesis which is most similar
1:14:14.672 --> 1:14:17.065
to many good translations.
1:14:19.599 --> 1:14:21.826
Why would you do that?
1:14:21.826 --> 1:14:25.119
It's a bit like a smoothing idea.
1:14:25.119 --> 1:14:28.605
Imagine this is the probability of.
1:14:29.529 --> 1:14:36.634
So if you would do beam search or mini search
or anything, if you just take the highest probability
1:14:36.634 --> 1:14:39.049
one, you would take this red one.
1:14:39.799 --> 1:14:45.686
Has this type of probability distribution.
1:14:45.686 --> 1:14:58.555
Then it might be better to take some of these
models because it's a bit lower in probability.
1:14:58.618 --> 1:15:12.501
So what you're mainly doing is you're doing
some smoothing of your probability distribution.
1:15:15.935 --> 1:15:17.010
How can you do that?
1:15:17.010 --> 1:15:20.131
Of course, we cannot do this again compared
to all the hype.
1:15:21.141 --> 1:15:29.472
But what we can do is we have just two sets
and we're just taking them the same.
1:15:29.472 --> 1:15:38.421
So we're having our penny data of the hypothesis
and the sum of the soider references.
1:15:39.179 --> 1:15:55.707
And we can just take the same clue so we can
just compare the utility of the.
1:15:56.656 --> 1:16:16.182
And then, of course, the question is how do
we measure the quality of the hypothesis?
1:16:16.396 --> 1:16:28.148
Course: You could also take here the probability
of this pee of given, but you can also say
1:16:28.148 --> 1:16:30.958
we only take the top.
1:16:31.211 --> 1:16:39.665
And where we don't want to really rely on
how good they are, we filtered out all the
1:16:39.665 --> 1:16:40.659
bad ones.
1:16:40.940 --> 1:16:54.657
So that is the first question for the minimum
base rhythm, and what are your pseudo references?
1:16:55.255 --> 1:17:06.968
So how do you set the quality of all these
references here in the independent sampling?
1:17:06.968 --> 1:17:10.163
They all have the same.
1:17:10.750 --> 1:17:12.308
There's Also Work Where You Can Take That.
1:17:13.453 --> 1:17:17.952
And then the second question you have to do
is, of course,.
1:17:17.917 --> 1:17:26.190
How do you prepare now two hypothesisms so
you have now Y and H which are post generated
1:17:26.190 --> 1:17:34.927
by the system and you want to find the H which
is most similar to all the other translations.
1:17:35.335 --> 1:17:41.812
So it's mainly like this model here, which
says how similar is age to all the other whites.
1:17:42.942 --> 1:17:50.127
So you have to again use some type of similarity
metric, which says how similar to possible.
1:17:52.172 --> 1:17:53.775
How can you do that?
1:17:53.775 --> 1:17:58.355
We luckily knew how to compare a reference
to a hypothesis.
1:17:58.355 --> 1:18:00.493
We have evaluation metrics.
1:18:00.493 --> 1:18:03.700
You can do something like sentence level.
1:18:04.044 --> 1:18:13.501
But especially if you're looking into neuromodels
you should have a stromometric so you can use
1:18:13.501 --> 1:18:17.836
a neural metric which directly compares to.
1:18:22.842 --> 1:18:29.292
Yes, so that is, is the main idea of minimum
base risk to, so the important idea you should
1:18:29.292 --> 1:18:35.743
keep in mind is that it's doing somehow the
smoothing by not taking the highest probability
1:18:35.743 --> 1:18:40.510
one, but by comparing like by taking a set
of high probability one.
1:18:40.640 --> 1:18:45.042
And then looking for the translation, which
is most similar to all of that.
1:18:45.445 --> 1:18:49.888
And thereby doing a bit more smoothing because
you look at this one.
1:18:49.888 --> 1:18:55.169
If you have this one, for example, it would
be more similar to all of these ones.
1:18:55.169 --> 1:19:00.965
But if you take this one, it's higher probability,
but it's very dissimilar to all these.
1:19:05.445 --> 1:19:17.609
Hey, that is all for decoding before we finish
with your combination of models.
1:19:18.678 --> 1:19:20.877
Sort of set of pseudo-reperences.
1:19:20.877 --> 1:19:24.368
Thomas Brown writes a little bit of type research
or.
1:19:24.944 --> 1:19:27.087
For example, you can do beam search.
1:19:27.087 --> 1:19:28.825
You can do sampling for that.
1:19:28.825 --> 1:19:31.257
Oh yeah, we had mentioned sampling there.
1:19:31.257 --> 1:19:34.500
I don't know somebody asking for what sampling
is good.
1:19:34.500 --> 1:19:37.280
So there's, of course, another important issue.
1:19:37.280 --> 1:19:40.117
How do you get a good representative set of
age?
1:19:40.620 --> 1:19:47.147
If you do beam search, it might be that you
end up with two similar ones, and maybe it's
1:19:47.147 --> 1:19:49.274
prevented by doing sampling.
1:19:49.274 --> 1:19:55.288
But maybe in sampling you find worse ones,
but yet some type of model is helpful.
1:19:56.416 --> 1:20:04.863
Search method use more transformed based translation
points.
1:20:04.863 --> 1:20:09.848
Nowadays beam search is definitely.
1:20:10.130 --> 1:20:13.749
There is work on this.
1:20:13.749 --> 1:20:27.283
The problem is that the MBR is often a lot
more like heavy because you have to sample
1:20:27.283 --> 1:20:29.486
translations.
1:20:31.871 --> 1:20:40.946
If you are bustling then we take a pen or
a pen for the most possible one.
1:20:40.946 --> 1:20:43.003
Now we put them.
1:20:43.623 --> 1:20:46.262
Bit and then we say okay, you don't have to
be fine.
1:20:46.262 --> 1:20:47.657
I'm going to put it to you.
1:20:48.428 --> 1:20:52.690
Yes, so that is what you can also do.
1:20:52.690 --> 1:21:00.092
Instead of taking uniform per ability, you
could take the modest.
1:21:01.041 --> 1:21:14.303
The uniform is a bit more robust because if
you had this one it might be that there is
1:21:14.303 --> 1:21:17.810
some crazy exceptions.
1:21:17.897 --> 1:21:21.088
And then it would still relax.
1:21:21.088 --> 1:21:28.294
So if you look at this picture, the probability
here would be higher.
1:21:28.294 --> 1:21:31.794
But yeah, that's a bit of tuning.
1:21:33.073 --> 1:21:42.980
In this case, and yes, it is like modeling
also the ants that.
1:21:49.169 --> 1:21:56.265
The last thing is now we always have considered
one model.
1:21:56.265 --> 1:22:04.084
It's also some prints helpful to not only
look at one model but.
1:22:04.384 --> 1:22:10.453
So in general there's many ways of how you
can make several models and with it's even
1:22:10.453 --> 1:22:17.370
easier you can just start three different random
municipalizations you get three different models
1:22:17.370 --> 1:22:18.428
and typically.
1:22:19.019 --> 1:22:27.299
And then the question is, can we combine their
strength into one model and use that then?
1:22:29.669 --> 1:22:39.281
And that can be done and it can be either
online or ensemble, and the more offline thing
1:22:39.281 --> 1:22:41.549
is called reranking.
1:22:42.462 --> 1:22:52.800
So the idea is, for example, an ensemble that
you combine different initializations.
1:22:52.800 --> 1:23:02.043
Of course, you can also do other things like
having different architecture.
1:23:02.222 --> 1:23:08.922
But the easiest thing you can change always
in generating two motors is to have different.
1:23:09.209 --> 1:23:24.054
And then the question is how can you combine
that?
1:23:26.006 --> 1:23:34.245
And the easiest thing, as said, is the bottle
of soda.
1:23:34.245 --> 1:23:39.488
What you mainly do is in parallel.
1:23:39.488 --> 1:23:43.833
You decode all of the money.
1:23:44.444 --> 1:23:59.084
So the probability of the output and you can
join this one to a joint one by just summing
1:23:59.084 --> 1:24:04.126
up over your key models again.
1:24:04.084 --> 1:24:10.374
So you still have a pro bonding distribution,
but you are not taking only one output here,
1:24:10.374 --> 1:24:10.719
but.
1:24:11.491 --> 1:24:20.049
So that's one you can easily combine different
models, and the nice thing is it typically
1:24:20.049 --> 1:24:20.715
works.
1:24:21.141 --> 1:24:27.487
You additional improvement with only more
calculation but not more human work.
1:24:27.487 --> 1:24:33.753
You just do the same thing for times and you're
getting a better performance.
1:24:33.793 --> 1:24:41.623
Like having more layers and so on, the advantage
of bigger models is of course you have to have
1:24:41.623 --> 1:24:46.272
the big models only joint and decoding during
inference.
1:24:46.272 --> 1:24:52.634
There you have to load models in parallel
because you have to do your search.
1:24:52.672 --> 1:24:57.557
Normally there is more memory resources for
training than you need for insurance.
1:25:00.000 --> 1:25:12.637
You have to train four models and the decoding
speed is also slower because you need to decode
1:25:12.637 --> 1:25:14.367
four models.
1:25:14.874 --> 1:25:25.670
There is one other very important thing and
the models have to be very similar, at least
1:25:25.670 --> 1:25:27.368
in some ways.
1:25:27.887 --> 1:25:28.506
Course.
1:25:28.506 --> 1:25:34.611
You can only combine this one if you have
the same words because you are just.
1:25:34.874 --> 1:25:43.110
So just imagine you have two different sizes
because you want to compare them or a director
1:25:43.110 --> 1:25:44.273
based model.
1:25:44.724 --> 1:25:53.327
That's at least not easily possible here because
once your output would be here a word and the
1:25:53.327 --> 1:25:56.406
other one would have to sum over.
1:25:56.636 --> 1:26:07.324
So this ensemble typically only works if you
have the same output vocabulary.
1:26:07.707 --> 1:26:16.636
Your input can be different because that is
only done once and then.
1:26:16.636 --> 1:26:23.752
Your hardware vocabulary has to be the same
otherwise.
1:26:27.507 --> 1:26:41.522
There's even a surprising effect of improving
your performance and it's again some kind of
1:26:41.522 --> 1:26:43.217
smoothing.
1:26:43.483 --> 1:26:52.122
So normally during training what we are doing
is we can save the checkpoints after each epoch.
1:26:52.412 --> 1:27:01.774
And you have this type of curve where your
Arab performance normally should go down, and
1:27:01.774 --> 1:27:09.874
if you do early stopping it means that at the
end you select not the lowest.
1:27:11.571 --> 1:27:21.467
However, some type of smoothing is there again.
1:27:21.467 --> 1:27:31.157
Sometimes what you can do is take an ensemble.
1:27:31.491 --> 1:27:38.798
That is not as good, but you still have four
different bottles, and they give you a little.
1:27:39.259 --> 1:27:42.212
So,.
1:27:43.723 --> 1:27:48.340
It's some are helping you, so now they're
supposed to be something different, you know.
1:27:49.489 --> 1:27:53.812
Oh didn't do that, so that is a checkpoint.
1:27:53.812 --> 1:27:59.117
There is one thing interesting, which is even
faster.
1:27:59.419 --> 1:28:12.255
Normally let's give you better performance
because this one might be again like a smooth
1:28:12.255 --> 1:28:13.697
ensemble.
1:28:16.736 --> 1:28:22.364
Of course, there is also some problems with
this, so I said.
1:28:22.364 --> 1:28:30.022
For example, maybe you want to do different
web representations with Cherokee and.
1:28:30.590 --> 1:28:37.189
You want to do right to left decoding so you
normally do like I go home but then your translation
1:28:37.189 --> 1:28:39.613
depends only on the previous words.
1:28:39.613 --> 1:28:45.942
If you want to model on the future you could
do the inverse direction and generate the target
1:28:45.942 --> 1:28:47.895
sentence from right to left.
1:28:48.728 --> 1:28:50.839
But it's not easy to combine these things.
1:28:51.571 --> 1:28:56.976
In order to do this, or what is also sometimes
interesting is doing in verse translation.
1:28:57.637 --> 1:29:07.841
You can combine these types of models in the
next election.
1:29:07.841 --> 1:29:13.963
That is only a bit which we can do.
1:29:14.494 --> 1:29:29.593
Next time what you should remember is how
search works and do you have any final questions.
1:29:33.773 --> 1:29:43.393
Then I wish you a happy holiday for next week
and then Monday there is another practical
1:29:43.393 --> 1:29:50.958
and then Thursday in two weeks so we'll have
the next lecture Monday.