Spaces:
Running
Running
WEBVTT | |
0:00:01.721 --> 0:00:05.064 | |
Hey, and then welcome to today's lecture. | |
0:00:06.126 --> 0:00:13.861 | |
What we want to do today is we will finish | |
with what we have done last time, so we started | |
0:00:13.861 --> 0:00:22.192 | |
looking at the new machine translation system, | |
but we have had all the components of the sequence | |
0:00:22.192 --> 0:00:22.787 | |
model. | |
0:00:22.722 --> 0:00:29.361 | |
We're still missing is the transformer based | |
architecture so that maybe the self attention. | |
0:00:29.849 --> 0:00:31.958 | |
Then we want to look at the beginning today. | |
0:00:32.572 --> 0:00:39.315 | |
And then the main part of the day's lecture | |
will be decoding. | |
0:00:39.315 --> 0:00:43.992 | |
That means we know how to train the model. | |
0:00:44.624 --> 0:00:47.507 | |
So decoding sewage all they can be. | |
0:00:47.667 --> 0:00:53.359 | |
Be useful that and the idea is how we find | |
that and what challenges are there. | |
0:00:53.359 --> 0:00:59.051 | |
Since it's unregressive, we will see that | |
it's not as easy as for other tasks. | |
0:00:59.359 --> 0:01:08.206 | |
While generating the translation step by step, | |
we might make additional arrows that lead. | |
0:01:09.069 --> 0:01:16.464 | |
But let's start with a self attention, so | |
what we looked at into was an base model. | |
0:01:16.816 --> 0:01:27.931 | |
And then in our based models you always take | |
the last new state, you take your input, you | |
0:01:27.931 --> 0:01:31.513 | |
generate a new hidden state. | |
0:01:31.513 --> 0:01:35.218 | |
This is more like a standard. | |
0:01:35.675 --> 0:01:41.088 | |
And one challenge in this is that we always | |
store all our history in one signal hidden | |
0:01:41.088 --> 0:01:41.523 | |
stick. | |
0:01:41.781 --> 0:01:50.235 | |
We saw that this is a problem when going from | |
encoder to decoder, and that is why we then | |
0:01:50.235 --> 0:01:58.031 | |
introduced the attention mechanism so that | |
we can look back and see all the parts. | |
0:01:59.579 --> 0:02:06.059 | |
However, in the decoder we still have this | |
issue so we are still storing all information | |
0:02:06.059 --> 0:02:12.394 | |
in one hidden state and we might do things | |
like here that we start to overwrite things | |
0:02:12.394 --> 0:02:13.486 | |
and we forgot. | |
0:02:14.254 --> 0:02:23.575 | |
So the idea is, can we do something similar | |
which we do between encoder and decoder within | |
0:02:23.575 --> 0:02:24.907 | |
the decoder? | |
0:02:26.526 --> 0:02:33.732 | |
And the idea is each time we're generating | |
here in New York State, it will not only depend | |
0:02:33.732 --> 0:02:40.780 | |
on the previous one, but we will focus on the | |
whole sequence and look at different parts | |
0:02:40.780 --> 0:02:46.165 | |
as we did in attention in order to generate | |
our new representation. | |
0:02:46.206 --> 0:02:53.903 | |
So each time we generate a new representation | |
we will look into what is important now to | |
0:02:53.903 --> 0:02:54.941 | |
understand. | |
0:02:55.135 --> 0:03:00.558 | |
You may want to understand what much is important. | |
0:03:00.558 --> 0:03:08.534 | |
You might want to look to vary and to like | |
so that it's much about liking. | |
0:03:08.808 --> 0:03:24.076 | |
So the idea is that we are not staring everything | |
in each time we are looking at the full sequence. | |
0:03:25.125 --> 0:03:35.160 | |
And that is achieved by no longer going really | |
secret, and the hidden states here aren't dependent | |
0:03:35.160 --> 0:03:37.086 | |
on the same layer. | |
0:03:37.086 --> 0:03:42.864 | |
But instead we are always looking at the previous | |
layer. | |
0:03:42.942 --> 0:03:45.510 | |
We will always have more information that | |
we are coming. | |
0:03:47.147 --> 0:03:51.572 | |
So how does this censor work in detail? | |
0:03:51.572 --> 0:03:56.107 | |
So we started with our initial mistakes. | |
0:03:56.107 --> 0:04:08.338 | |
So, for example: Now where we had the three | |
terms already, the query, the key and the value, | |
0:04:08.338 --> 0:04:12.597 | |
it was motivated by our database. | |
0:04:12.772 --> 0:04:20.746 | |
We are comparing it to the keys to all the | |
other values, and then we are merging the values. | |
0:04:21.321 --> 0:04:35.735 | |
There was a difference between the decoder | |
and the encoder. | |
0:04:35.775 --> 0:04:41.981 | |
You can assume all the same because we are | |
curving ourselves. | |
0:04:41.981 --> 0:04:49.489 | |
However, we can make them different but just | |
learning a linear projection. | |
0:04:49.529 --> 0:05:01.836 | |
So you learn here some projection based on | |
what need to do in order to ask which question. | |
0:05:02.062 --> 0:05:11.800 | |
That is, the query and the key is to what | |
do want to compare and provide others, and | |
0:05:11.800 --> 0:05:13.748 | |
which values do. | |
0:05:14.014 --> 0:05:23.017 | |
This is not like hand defined, but learn, | |
so it's like three linear projections that | |
0:05:23.017 --> 0:05:26.618 | |
you apply on all of these hidden. | |
0:05:26.618 --> 0:05:32.338 | |
That is the first thing based on your initial | |
hidden. | |
0:05:32.612 --> 0:05:37.249 | |
And now you can do exactly as before, you | |
can do the attention. | |
0:05:37.637 --> 0:05:40.023 | |
How did the attention work? | |
0:05:40.023 --> 0:05:45.390 | |
The first thing is we are comparing our query | |
to all the keys. | |
0:05:45.445 --> 0:05:52.713 | |
And that is now the difference before the | |
quarry was from the decoder, the keys were | |
0:05:52.713 --> 0:05:54.253 | |
from the encoder. | |
0:05:54.253 --> 0:06:02.547 | |
Now it's like all from the same, so we started | |
the first in state to the keys of all the others. | |
0:06:02.582 --> 0:06:06.217 | |
We're learning some value here. | |
0:06:06.217 --> 0:06:12.806 | |
How important are these information to better | |
understand? | |
0:06:13.974 --> 0:06:19.103 | |
And these are just like floating point numbers. | |
0:06:19.103 --> 0:06:21.668 | |
They are normalized so. | |
0:06:22.762 --> 0:06:30.160 | |
And that is the first step, so let's go first | |
for the first curve. | |
0:06:30.470 --> 0:06:41.937 | |
What we can then do is multiply each value | |
as we have done before with the importance | |
0:06:41.937 --> 0:06:43.937 | |
of each state. | |
0:06:45.145 --> 0:06:47.686 | |
And then we have in here the new hit step. | |
0:06:48.308 --> 0:06:57.862 | |
See now this new hidden status is depending | |
on all the hidden state of all the sequences | |
0:06:57.862 --> 0:06:59.686 | |
of the previous. | |
0:06:59.879 --> 0:07:01.739 | |
One important thing. | |
0:07:01.739 --> 0:07:08.737 | |
This one doesn't really depend, so the hidden | |
states here don't depend on the. | |
0:07:09.029 --> 0:07:15.000 | |
So it only depends on the hidden state of | |
the previous layer, but it depends on all the | |
0:07:15.000 --> 0:07:18.664 | |
hidden states, and that is of course a big | |
advantage. | |
0:07:18.664 --> 0:07:25.111 | |
So on the one hand information can directly | |
flow from each hidden state before the information | |
0:07:25.111 --> 0:07:27.214 | |
flow was always a bit limited. | |
0:07:28.828 --> 0:07:35.100 | |
And the independence is important so we can | |
calculate all these in the states in parallel. | |
0:07:35.100 --> 0:07:41.371 | |
That's another big advantage of self attention | |
that we can calculate all the hidden states | |
0:07:41.371 --> 0:07:46.815 | |
in one layer in parallel and therefore it's | |
the ad designed for GPUs and fast. | |
0:07:47.587 --> 0:07:50.235 | |
Then we can do the same thing for the second | |
in the state. | |
0:07:50.530 --> 0:08:06.866 | |
And the only difference here is how we calculate | |
what is occurring. | |
0:08:07.227 --> 0:08:15.733 | |
Getting these values is different because | |
we use the different query and then getting | |
0:08:15.733 --> 0:08:17.316 | |
our new hidden. | |
0:08:18.258 --> 0:08:26.036 | |
Yes, this is the word of words that underneath | |
this case might, but this is simple. | |
0:08:26.036 --> 0:08:26.498 | |
Not. | |
0:08:27.127 --> 0:08:33.359 | |
That's a very good question that is like on | |
the initial thing. | |
0:08:33.359 --> 0:08:38.503 | |
That is exactly not one of you in the architecture. | |
0:08:38.503 --> 0:08:44.042 | |
Maybe first you would think of a very big | |
disadvantage. | |
0:08:44.384 --> 0:08:49.804 | |
So this hidden state would be the same if | |
the movie would be different. | |
0:08:50.650 --> 0:08:59.983 | |
And of course this estate is a site someone | |
should like, so if the estate would be here | |
0:08:59.983 --> 0:09:06.452 | |
except for this correspondence the word order | |
is completely. | |
0:09:06.706 --> 0:09:17.133 | |
Therefore, just doing self attention wouldn't | |
work at all because we know word order is important | |
0:09:17.133 --> 0:09:21.707 | |
and there is a complete different meaning. | |
0:09:22.262 --> 0:09:26.277 | |
We introduce the word position again. | |
0:09:26.277 --> 0:09:33.038 | |
The main idea is if the position is already | |
in your embeddings. | |
0:09:33.533 --> 0:09:39.296 | |
Then of course the position is there and you | |
don't lose it anymore. | |
0:09:39.296 --> 0:09:46.922 | |
So mainly if your life representation here | |
encodes at the second position and your output | |
0:09:46.922 --> 0:09:48.533 | |
will be different. | |
0:09:49.049 --> 0:09:54.585 | |
And that's how you encode it, but that's essential | |
in order to get this work. | |
0:09:57.137 --> 0:10:08.752 | |
But before we are coming to the next slide, | |
one other thing that is typically done is multi-head | |
0:10:08.752 --> 0:10:10.069 | |
attention. | |
0:10:10.430 --> 0:10:15.662 | |
And it might be that in order to understand | |
much, it might be good that in some way we | |
0:10:15.662 --> 0:10:19.872 | |
focus on life, and in some way we can focus | |
on vary, but not equally. | |
0:10:19.872 --> 0:10:25.345 | |
But maybe it's like to understand again on | |
different dimensions we should look into these. | |
0:10:25.905 --> 0:10:31.393 | |
And therefore what we're doing is we're just | |
doing the self attention at once, but we're | |
0:10:31.393 --> 0:10:35.031 | |
doing it end times or based on your multi head | |
attentions. | |
0:10:35.031 --> 0:10:41.299 | |
So in typical examples, the number of heads | |
people are talking about is like: So you're | |
0:10:41.299 --> 0:10:50.638 | |
doing this process and have different queries | |
and keys so you can focus. | |
0:10:50.790 --> 0:10:52.887 | |
How can you generate eight different? | |
0:10:53.593 --> 0:11:07.595 | |
Things it's quite easy here, so instead of | |
having one linear projection you can have age | |
0:11:07.595 --> 0:11:09.326 | |
different. | |
0:11:09.569 --> 0:11:13.844 | |
And it might be that sometimes you're looking | |
more into one thing, and sometimes you're Looking | |
0:11:13.844 --> 0:11:14.779 | |
more into the other. | |
0:11:15.055 --> 0:11:24.751 | |
So that's of course nice with this type of | |
learned approach because we can automatically | |
0:11:24.751 --> 0:11:25.514 | |
learn. | |
0:11:29.529 --> 0:11:36.629 | |
And what you correctly said is its positional | |
independence, so it doesn't really matter the | |
0:11:36.629 --> 0:11:39.176 | |
order which should be important. | |
0:11:39.379 --> 0:11:47.686 | |
So how can we do that and the idea is we are | |
just encoding it directly into the embedding | |
0:11:47.686 --> 0:11:52.024 | |
so into the starting so that a representation. | |
0:11:52.512 --> 0:11:55.873 | |
How do we get that so we started with our | |
embeddings? | |
0:11:55.873 --> 0:11:58.300 | |
Just imagine this is embedding of eye. | |
0:11:59.259 --> 0:12:06.169 | |
And then we are having additionally this positional | |
encoding. | |
0:12:06.169 --> 0:12:10.181 | |
In this position, encoding is just. | |
0:12:10.670 --> 0:12:19.564 | |
With different wavelength, so with different | |
lengths of your signal as you see here. | |
0:12:20.160 --> 0:12:37.531 | |
And the number of functions you have is exactly | |
the number of dimensions you have in your embedded. | |
0:12:38.118 --> 0:12:51.091 | |
And what will then do is take the first one, | |
and based on your position you multiply your | |
0:12:51.091 --> 0:12:51.955 | |
word. | |
0:12:52.212 --> 0:13:02.518 | |
And you see now if you put it in this position, | |
of course it will get a different value. | |
0:13:03.003 --> 0:13:12.347 | |
And thereby in each position a different function | |
is multiplied. | |
0:13:12.347 --> 0:13:19.823 | |
This is a representation for at the first | |
position. | |
0:13:20.020 --> 0:13:34.922 | |
If you have it in the input already encoded | |
then of course the model is able to keep the | |
0:13:34.922 --> 0:13:38.605 | |
position information. | |
0:13:38.758 --> 0:13:48.045 | |
But your embeddings can also learn your embeddings | |
in a way that they are optimal collaborating | |
0:13:48.045 --> 0:13:49.786 | |
with these types. | |
0:13:51.451 --> 0:13:59.351 | |
Is that somehow clear where he is there? | |
0:14:06.006 --> 0:14:13.630 | |
Am the first position and second position? | |
0:14:16.576 --> 0:14:17.697 | |
Have a long wait period. | |
0:14:17.697 --> 0:14:19.624 | |
I'm not going to tell you how to turn the. | |
0:14:21.441 --> 0:14:26.927 | |
Be completely issued because if you have a | |
very short wavelength there might be quite | |
0:14:26.927 --> 0:14:28.011 | |
big differences. | |
0:14:28.308 --> 0:14:33.577 | |
And it might also be that then it depends, | |
of course, like what type of world embedding | |
0:14:33.577 --> 0:14:34.834 | |
you've learned like. | |
0:14:34.834 --> 0:14:37.588 | |
Is the dimension where you have long changes? | |
0:14:37.588 --> 0:14:43.097 | |
Is the report for your embedding or not so | |
that's what I mean so that the model can somehow | |
0:14:43.097 --> 0:14:47.707 | |
learn that by putting more information into | |
one of the embedding dimensions? | |
0:14:48.128 --> 0:14:54.560 | |
So incorporated and would assume it's learning | |
it a bit haven't seen. | |
0:14:54.560 --> 0:14:57.409 | |
Details studied how different. | |
0:14:58.078 --> 0:15:07.863 | |
It's also a bit difficult because really measuring | |
how similar or different a world isn't that | |
0:15:07.863 --> 0:15:08.480 | |
easy. | |
0:15:08.480 --> 0:15:13.115 | |
You can do, of course, the average distance. | |
0:15:14.114 --> 0:15:21.393 | |
Them, so are the weight tags not at model | |
two, or is there fixed weight tags that the | |
0:15:21.393 --> 0:15:21.986 | |
model. | |
0:15:24.164 --> 0:15:30.165 | |
To believe they are fixed and the mono learns | |
there's a different way of doing it. | |
0:15:30.165 --> 0:15:32.985 | |
The other thing you can do is you can. | |
0:15:33.213 --> 0:15:36.945 | |
So you can learn the second embedding which | |
says this is position one. | |
0:15:36.945 --> 0:15:38.628 | |
This is position two and so on. | |
0:15:38.628 --> 0:15:42.571 | |
Like for words you could learn fixed embeddings | |
and then add them upwards. | |
0:15:42.571 --> 0:15:45.094 | |
So then it would have the same thing it's | |
done. | |
0:15:45.094 --> 0:15:46.935 | |
There is one disadvantage of this. | |
0:15:46.935 --> 0:15:51.403 | |
There is anybody an idea what could be the | |
disadvantage of a more learned embedding. | |
0:15:54.955 --> 0:16:00.000 | |
Here maybe extra play this finger and ethnic | |
stuff that will be an art. | |
0:16:00.000 --> 0:16:01.751 | |
This will be an art for. | |
0:16:02.502 --> 0:16:08.323 | |
You would only be good at positions you have | |
seen often and especially for long sequences. | |
0:16:08.323 --> 0:16:14.016 | |
You might have seen the positions very rarely | |
and then normally not performing that well | |
0:16:14.016 --> 0:16:17.981 | |
while here it can better learn a more general | |
representation. | |
0:16:18.298 --> 0:16:22.522 | |
So that is another thing which we won't discuss | |
here. | |
0:16:22.522 --> 0:16:25.964 | |
Guess is what is called relative attention. | |
0:16:25.945 --> 0:16:32.570 | |
And in this case you don't learn absolute | |
positions, but in your calculation of the similarity | |
0:16:32.570 --> 0:16:39.194 | |
you take again the relative distance into account | |
and have a different similarity depending on | |
0:16:39.194 --> 0:16:40.449 | |
how far they are. | |
0:16:40.660 --> 0:16:45.898 | |
And then you don't need to encode it beforehand, | |
but you would more happen within your comparison. | |
0:16:46.186 --> 0:16:53.471 | |
So when you compare how similar things you | |
print, of course also take the relative position. | |
0:16:55.715 --> 0:17:03.187 | |
Because there are multiple ways to use the | |
one, to multiply all the embedding, or to use | |
0:17:03.187 --> 0:17:03.607 | |
all. | |
0:17:17.557 --> 0:17:21.931 | |
The encoder can be bidirectional. | |
0:17:21.931 --> 0:17:30.679 | |
We have everything from the beginning so we | |
can have a model where. | |
0:17:31.111 --> 0:17:36.455 | |
Decoder training of course has also everything | |
available but during inference you always have | |
0:17:36.455 --> 0:17:41.628 | |
only the past available so you can only look | |
into the previous one and not into the future | |
0:17:41.628 --> 0:17:46.062 | |
because if you generate word by word you don't | |
know what it will be there in. | |
0:17:46.866 --> 0:17:53.180 | |
And so we also have to consider this somehow | |
in the attention, and until now we look more | |
0:17:53.180 --> 0:17:54.653 | |
at the ecoder style. | |
0:17:54.653 --> 0:17:58.652 | |
So if you look at this type of model, it's | |
by direction. | |
0:17:58.652 --> 0:18:03.773 | |
So for this hill state we are looking into | |
the past and into the future. | |
0:18:04.404 --> 0:18:14.436 | |
So the question is, can we have to do this | |
like unidirectional so that you only look into | |
0:18:14.436 --> 0:18:15.551 | |
the past? | |
0:18:15.551 --> 0:18:22.573 | |
And the nice thing is, this is even easier | |
than for our hands. | |
0:18:23.123 --> 0:18:29.738 | |
So we would have different types of parameters | |
and models because you have a forward direction. | |
0:18:31.211 --> 0:18:35.679 | |
For attention, that is very simple. | |
0:18:35.679 --> 0:18:39.403 | |
We are doing what is masking. | |
0:18:39.403 --> 0:18:45.609 | |
If you want to have a backward model, these | |
ones. | |
0:18:45.845 --> 0:18:54.355 | |
So on the first hit stage it's been over, | |
so it's maybe only looking at its health. | |
0:18:54.894 --> 0:19:05.310 | |
By the second it looks on the second and the | |
third, so you're always selling all values | |
0:19:05.310 --> 0:19:07.085 | |
in the future. | |
0:19:07.507 --> 0:19:13.318 | |
And thereby you can have with the same parameters | |
the same model. | |
0:19:13.318 --> 0:19:15.783 | |
You can have then a unique. | |
0:19:16.156 --> 0:19:29.895 | |
In the decoder you do the masked self attention | |
where you only look into the past and you don't | |
0:19:29.895 --> 0:19:30.753 | |
look. | |
0:19:32.212 --> 0:19:36.400 | |
Then we only have, of course, looked onto | |
itself. | |
0:19:36.616 --> 0:19:50.903 | |
So the question: How can we combine forward | |
and decoder and then we can do a decoder and | |
0:19:50.903 --> 0:19:54.114 | |
just have a second? | |
0:19:54.374 --> 0:20:00.286 | |
And then we're doing the cross attention which | |
attacks from the decoder to the anchoder. | |
0:20:00.540 --> 0:20:10.239 | |
So in this time it's again that the queries | |
is a current state of decoder, while the keys | |
0:20:10.239 --> 0:20:22.833 | |
are: You can do both onto yourself to get the | |
meaning on the target side and to get the meaning. | |
0:20:23.423 --> 0:20:25.928 | |
So see then the full picture. | |
0:20:25.928 --> 0:20:33.026 | |
This is now the typical picture of the transformer | |
and where you use self attention. | |
0:20:33.026 --> 0:20:36.700 | |
So what you have is have your power hidden. | |
0:20:37.217 --> 0:20:43.254 | |
What you then apply is here the position they're | |
coding: We have then doing the self attention | |
0:20:43.254 --> 0:20:46.734 | |
to all the others, and this can be bi-directional. | |
0:20:47.707 --> 0:20:54.918 | |
You normally do another feed forward layer | |
just like to make things to learn additional | |
0:20:54.918 --> 0:20:55.574 | |
things. | |
0:20:55.574 --> 0:21:02.785 | |
You're just having also a feed forward layer | |
which takes your heel stable and generates | |
0:21:02.785 --> 0:21:07.128 | |
your heel state because we are making things | |
deeper. | |
0:21:07.747 --> 0:21:15.648 | |
Then this blue part you can stack over several | |
times so you can have layers so that. | |
0:21:16.336 --> 0:21:30.256 | |
In addition to these blue arrows, so we talked | |
about this in R&S that if you are now back | |
0:21:30.256 --> 0:21:35.883 | |
propagating your arrow from the top,. | |
0:21:36.436 --> 0:21:48.578 | |
In order to prevent that we are not really | |
learning how to transform that, but instead | |
0:21:48.578 --> 0:21:51.230 | |
we have to change. | |
0:21:51.671 --> 0:22:00.597 | |
You're calculating what should be changed | |
with this one. | |
0:22:00.597 --> 0:22:09.365 | |
The backwards clip each layer and the learning | |
is just. | |
0:22:10.750 --> 0:22:21.632 | |
The encoder before we go to the decoder. | |
0:22:21.632 --> 0:22:30.655 | |
We have any additional questions. | |
0:22:31.471 --> 0:22:33.220 | |
That's a Very Good Point. | |
0:22:33.553 --> 0:22:38.709 | |
Yeah, you normally take always that at least | |
the default architecture to only look at the | |
0:22:38.709 --> 0:22:38.996 | |
top. | |
0:22:40.000 --> 0:22:40.388 | |
Coder. | |
0:22:40.388 --> 0:22:42.383 | |
Of course, you can do other things. | |
0:22:42.383 --> 0:22:45.100 | |
We investigated, for example, the lowest layout. | |
0:22:45.100 --> 0:22:49.424 | |
The decoder is looking at the lowest level | |
of the incoder and not of the top. | |
0:22:49.749 --> 0:23:05.342 | |
You can average or you can even learn theoretically | |
that what you can also do is attending to all. | |
0:23:05.785 --> 0:23:11.180 | |
Can attend to all possible layers and states. | |
0:23:11.180 --> 0:23:18.335 | |
But what the default thing is is that you | |
only have the top. | |
0:23:20.580 --> 0:23:31.999 | |
The decoder when we're doing is firstly doing | |
the same position and coding, then we're doing | |
0:23:31.999 --> 0:23:36.419 | |
self attention in the decoder side. | |
0:23:37.837 --> 0:23:43.396 | |
Of course here it's not important we're doing | |
the mask self attention so that we're only | |
0:23:43.396 --> 0:23:45.708 | |
attending to the past and we're not. | |
0:23:47.287 --> 0:24:02.698 | |
Here you see the difference, so in this case | |
the keys and values are from the encoder and | |
0:24:02.698 --> 0:24:03.554 | |
the. | |
0:24:03.843 --> 0:24:12.103 | |
You're comparing it to all the counter hidden | |
states calculating the similarity and then | |
0:24:12.103 --> 0:24:13.866 | |
you do the weight. | |
0:24:14.294 --> 0:24:17.236 | |
And that is an edit to what is here. | |
0:24:18.418 --> 0:24:29.778 | |
Then you have a linen layer and again this | |
green one is sticked several times and then. | |
0:24:32.232 --> 0:24:36.987 | |
Question, so each code is off. | |
0:24:36.987 --> 0:24:46.039 | |
Every one of those has the last layer of thing, | |
so in the. | |
0:24:46.246 --> 0:24:51.007 | |
All with and only to the last or the top layer | |
of the anchor. | |
0:24:57.197 --> 0:25:00.127 | |
Good So That Would Be. | |
0:25:01.501 --> 0:25:12.513 | |
To sequence models we have looked at attention | |
and before we are decoding do you have any | |
0:25:12.513 --> 0:25:18.020 | |
more questions to this type of architecture. | |
0:25:20.480 --> 0:25:30.049 | |
Transformer was first used in machine translation, | |
but now it's a standard thing for doing nearly | |
0:25:30.049 --> 0:25:32.490 | |
any tie sequence models. | |
0:25:33.013 --> 0:25:35.984 | |
Even large language models. | |
0:25:35.984 --> 0:25:38.531 | |
They are a bit similar. | |
0:25:38.531 --> 0:25:45.111 | |
They are just throwing away the anchor and | |
cross the tension. | |
0:25:45.505 --> 0:25:59.329 | |
And that is maybe interesting that it's important | |
to have this attention because you cannot store | |
0:25:59.329 --> 0:26:01.021 | |
everything. | |
0:26:01.361 --> 0:26:05.357 | |
The interesting thing with the attention is | |
now we can attend to everything. | |
0:26:05.745 --> 0:26:13.403 | |
So you can again go back to your initial model | |
and have just a simple sequence model and then | |
0:26:13.403 --> 0:26:14.055 | |
target. | |
0:26:14.694 --> 0:26:24.277 | |
There would be a more language model style | |
or people call it Decoder Only model where | |
0:26:24.277 --> 0:26:26.617 | |
you throw this away. | |
0:26:27.247 --> 0:26:30.327 | |
The nice thing is because of your self attention. | |
0:26:30.327 --> 0:26:34.208 | |
You have the original problem why you introduce | |
the attention. | |
0:26:34.208 --> 0:26:39.691 | |
You don't have that anymore because it's not | |
everything is summarized, but each time you | |
0:26:39.691 --> 0:26:44.866 | |
generate, you're looking back at all the previous | |
words, the source and the target. | |
0:26:45.805 --> 0:26:51.734 | |
And there is a lot of work on is a really | |
important to have encoded a decoded model or | |
0:26:51.734 --> 0:26:54.800 | |
is a decoded only model as good if you have. | |
0:26:54.800 --> 0:27:00.048 | |
But the comparison is not that easy because | |
how many parameters do you have? | |
0:27:00.360 --> 0:27:08.832 | |
So think the general idea at the moment is, | |
at least for machine translation, it's normally | |
0:27:08.832 --> 0:27:17.765 | |
a bit better to have an encoded decoder model | |
and not a decoder model where you just concatenate | |
0:27:17.765 --> 0:27:20.252 | |
the source and the target. | |
0:27:21.581 --> 0:27:24.073 | |
But there is not really a big difference anymore. | |
0:27:24.244 --> 0:27:29.891 | |
Because this big issue, which we had initially | |
with it that everything is stored in the working | |
0:27:29.891 --> 0:27:31.009 | |
state, is nothing. | |
0:27:31.211 --> 0:27:45.046 | |
Of course, the advantage maybe here is that | |
you give it a bias at your same language information. | |
0:27:45.285 --> 0:27:53.702 | |
While in an encoder only model this all is | |
merged into one thing and sometimes it is good | |
0:27:53.702 --> 0:28:02.120 | |
to give models a bit of bias okay you should | |
maybe treat things separately and you should | |
0:28:02.120 --> 0:28:03.617 | |
look different. | |
0:28:04.144 --> 0:28:11.612 | |
And of course one other difference, one other | |
disadvantage, maybe of an encoder owning one. | |
0:28:16.396 --> 0:28:19.634 | |
You think about the suicide sentence and how | |
it's treated. | |
0:28:21.061 --> 0:28:33.787 | |
Architecture: Anchorer can both be in the | |
sentence for every state and cause a little | |
0:28:33.787 --> 0:28:35.563 | |
difference. | |
0:28:35.475 --> 0:28:43.178 | |
If you only have a decoder that has to be | |
unidirectional because for the decoder side | |
0:28:43.178 --> 0:28:51.239 | |
for the generation you need it and so your | |
input is read state by state so you don't have | |
0:28:51.239 --> 0:28:54.463 | |
positional bidirection information. | |
0:28:56.596 --> 0:29:05.551 | |
Again, it receives a sequence of embeddings | |
with position encoding. | |
0:29:05.551 --> 0:29:11.082 | |
The piece is like long vector has output. | |
0:29:11.031 --> 0:29:17.148 | |
Don't understand how you can set footworks | |
to this part of each other through inputs. | |
0:29:17.097 --> 0:29:20.060 | |
Other than cola is the same as the food consume. | |
0:29:21.681 --> 0:29:27.438 | |
Okay, it's very good bye, so this one hand | |
coding is only done on the top layer. | |
0:29:27.727 --> 0:29:32.012 | |
So this green one is only repeated. | |
0:29:32.012 --> 0:29:38.558 | |
You have the word embedding or the position | |
embedding. | |
0:29:38.558 --> 0:29:42.961 | |
You have one layer of decoder which. | |
0:29:43.283 --> 0:29:48.245 | |
Then you stick in the second one, the third | |
one, the fourth one, and then on the top. | |
0:29:48.208 --> 0:29:55.188 | |
Layer: You put this projection layer which | |
takes a one thousand dimensional backtalk and | |
0:29:55.188 --> 0:30:02.089 | |
generates based on your vocabulary maybe in | |
ten thousand soft max layer which gives you | |
0:30:02.089 --> 0:30:04.442 | |
the probability of all words. | |
0:30:06.066 --> 0:30:22.369 | |
It's a very good part part of the mass tape | |
ladies, but it wouldn't be for the X-rays. | |
0:30:22.262 --> 0:30:27.015 | |
Aquarium filters to be like monsoon roding | |
as they get by the river. | |
0:30:27.647 --> 0:30:33.140 | |
Yes, there is work on that think we will discuss | |
that in the pre-trained models. | |
0:30:33.493 --> 0:30:39.756 | |
It's called where you exactly do that. | |
0:30:39.756 --> 0:30:48.588 | |
If you have more metric side, it's like diagonal | |
here. | |
0:30:48.708 --> 0:30:53.018 | |
And it's a full metric, so here everybody's | |
attending to each position. | |
0:30:53.018 --> 0:30:54.694 | |
Here you're only attending. | |
0:30:54.975 --> 0:31:05.744 | |
Then you can do the previous one where this | |
one is decoded, not everything but everything. | |
0:31:06.166 --> 0:31:13.961 | |
So you have a bit more that is possible, and | |
we'll have that in the lecture on pre-train | |
0:31:13.961 --> 0:31:14.662 | |
models. | |
0:31:18.478 --> 0:31:27.440 | |
So we now know how to build a translation | |
system, but of course we don't want to have | |
0:31:27.440 --> 0:31:30.774 | |
a translation system by itself. | |
0:31:31.251 --> 0:31:40.037 | |
Now given this model an input sentence, how | |
can we generate an output mind? | |
0:31:40.037 --> 0:31:49.398 | |
The general idea is still: So what we really | |
want to do is we start with the model. | |
0:31:49.398 --> 0:31:53.893 | |
We generate different possible translations. | |
0:31:54.014 --> 0:31:59.754 | |
We score them the lock probability that we're | |
getting, so for each input and output pair | |
0:31:59.754 --> 0:32:05.430 | |
we can calculate the lock probability, which | |
is a product of all probabilities for each | |
0:32:05.430 --> 0:32:09.493 | |
word in there, and then we can find what is | |
the most probable. | |
0:32:09.949 --> 0:32:15.410 | |
However, that's a bit complicated we will | |
see because we can't look at all possible translations. | |
0:32:15.795 --> 0:32:28.842 | |
So there is infinite or a number of possible | |
translations, so we have to do it somehow in | |
0:32:28.842 --> 0:32:31.596 | |
more intelligence. | |
0:32:32.872 --> 0:32:37.821 | |
So what we want to do today in the rest of | |
the lecture? | |
0:32:37.821 --> 0:32:40.295 | |
What is the search problem? | |
0:32:40.295 --> 0:32:44.713 | |
Then we will look at different search algorithms. | |
0:32:45.825 --> 0:32:56.636 | |
Will compare model and search errors, so there | |
can be errors on the model where the model | |
0:32:56.636 --> 0:33:03.483 | |
is not giving the highest score to the best | |
translation. | |
0:33:03.903 --> 0:33:21.069 | |
This is always like searching the best translation | |
out of one model, which is often also interesting. | |
0:33:24.004 --> 0:33:29.570 | |
And how do we do the search? | |
0:33:29.570 --> 0:33:41.853 | |
We want to find the translation where the | |
reference is minimal. | |
0:33:42.042 --> 0:33:44.041 | |
So the nice thing is SMT. | |
0:33:44.041 --> 0:33:51.347 | |
It wasn't the case, but in neuromachine translation | |
we can't find any possible translation, so | |
0:33:51.347 --> 0:33:53.808 | |
at least within our vocabulary. | |
0:33:53.808 --> 0:33:58.114 | |
But if we have BPE we can really generate | |
any possible. | |
0:33:58.078 --> 0:34:04.604 | |
Translation and cereal: We could always minimize | |
that, but yeah, we can't do it that easy because | |
0:34:04.604 --> 0:34:07.734 | |
of course we don't have the reference at hand. | |
0:34:07.747 --> 0:34:10.384 | |
If it has a reference, it's not a problem. | |
0:34:10.384 --> 0:34:13.694 | |
We know what we are searching for, but we | |
don't know. | |
0:34:14.054 --> 0:34:23.886 | |
So how can we then model this by just finding | |
the translation with the highest probability? | |
0:34:23.886 --> 0:34:29.015 | |
Looking at it, we want to find the translation. | |
0:34:29.169 --> 0:34:32.525 | |
Idea is our model is a good approximation. | |
0:34:32.525 --> 0:34:34.399 | |
That's how we train it. | |
0:34:34.399 --> 0:34:36.584 | |
What is a good translation? | |
0:34:36.584 --> 0:34:43.687 | |
And if we find translation with the highest | |
probability, this should also give us the best | |
0:34:43.687 --> 0:34:44.702 | |
translation. | |
0:34:45.265 --> 0:34:56.965 | |
And that is then, of course, the difference | |
between the search error is that the model | |
0:34:56.965 --> 0:35:02.076 | |
doesn't predict the best translation. | |
0:35:02.622 --> 0:35:08.777 | |
How can we do the basic search first of all | |
in basic search that seems to be very easy | |
0:35:08.777 --> 0:35:15.003 | |
so what we can do is we can do the forward | |
pass for the whole encoder and that's how it | |
0:35:15.003 --> 0:35:21.724 | |
starts the input sentences known you can put | |
the input sentence and calculate all your estates | |
0:35:21.724 --> 0:35:22.573 | |
and hidden? | |
0:35:23.083 --> 0:35:35.508 | |
Then you can put in your sentence start and | |
you can generate. | |
0:35:35.508 --> 0:35:41.721 | |
Here you have the probability. | |
0:35:41.801 --> 0:35:52.624 | |
A good idea we would see later that as a typical | |
algorithm is guess what you all would do, you | |
0:35:52.624 --> 0:35:54.788 | |
would then select. | |
0:35:55.235 --> 0:36:06.265 | |
So if you generate here a probability distribution | |
over all the words in your vocabulary then | |
0:36:06.265 --> 0:36:08.025 | |
you can solve. | |
0:36:08.688 --> 0:36:13.147 | |
Yeah, this is how our auto condition is done | |
in our system. | |
0:36:14.794 --> 0:36:19.463 | |
Yeah, this is also why there you have to have | |
a model of possible extending. | |
0:36:19.463 --> 0:36:24.314 | |
It's more of a language model, but then this | |
is one algorithm to do the search. | |
0:36:24.314 --> 0:36:26.801 | |
They maybe have also more advanced ones. | |
0:36:26.801 --> 0:36:32.076 | |
We will see that so this search and other | |
completion should be exactly the same as the | |
0:36:32.076 --> 0:36:33.774 | |
search machine translation. | |
0:36:34.914 --> 0:36:40.480 | |
So we'll see that this is not optimal, so | |
hopefully it's not that this way, but for this | |
0:36:40.480 --> 0:36:41.043 | |
problem. | |
0:36:41.941 --> 0:36:47.437 | |
And what you can do then you can select this | |
word. | |
0:36:47.437 --> 0:36:50.778 | |
This was the best translation. | |
0:36:51.111 --> 0:36:57.675 | |
Because the decoder, of course, in the next | |
step needs not to know what is the best word | |
0:36:57.675 --> 0:37:02.396 | |
here, it inputs it and generates that flexibility | |
distribution. | |
0:37:03.423 --> 0:37:14.608 | |
And then your new distribution, and you can | |
do the same thing, there's the best word there, | |
0:37:14.608 --> 0:37:15.216 | |
and. | |
0:37:15.435 --> 0:37:22.647 | |
So you can continue doing that and always | |
get the hopefully the best translation in. | |
0:37:23.483 --> 0:37:30.839 | |
The first question is, of course, how long | |
are you doing it? | |
0:37:30.839 --> 0:37:33.854 | |
Now we could go forever. | |
0:37:36.476 --> 0:37:52.596 | |
We had this token at the input and we put | |
the stop token at the output. | |
0:37:53.974 --> 0:38:07.217 | |
And this is important because if we wouldn't | |
do that then we wouldn't have a good idea. | |
0:38:10.930 --> 0:38:16.193 | |
So that seems to be a good idea, but is it | |
really? | |
0:38:16.193 --> 0:38:21.044 | |
Do we find the most probable sentence in this? | |
0:38:23.763 --> 0:38:25.154 | |
Or my dear healed proverb,. | |
0:38:27.547 --> 0:38:41.823 | |
We are always selecting the highest probability | |
one, so it seems to be that this is a very | |
0:38:41.823 --> 0:38:45.902 | |
good solution to anybody. | |
0:38:46.406 --> 0:38:49.909 | |
Yes, that is actually the problem. | |
0:38:49.909 --> 0:38:56.416 | |
You might do early decisions and you don't | |
have the global view. | |
0:38:56.796 --> 0:39:02.813 | |
And this problem happens because it is an | |
outer regressive model. | |
0:39:03.223 --> 0:39:13.275 | |
So it happens because yeah, the output we | |
generate is the input in the next step. | |
0:39:13.793 --> 0:39:19.493 | |
And this, of course, is leading to problems. | |
0:39:19.493 --> 0:39:27.474 | |
If we always take the best solution, it doesn't | |
mean you have. | |
0:39:27.727 --> 0:39:33.941 | |
It would be different if you have a problem | |
where the output is not influencing your input. | |
0:39:34.294 --> 0:39:44.079 | |
Then this solution will give you the best | |
model, but since the output is influencing | |
0:39:44.079 --> 0:39:47.762 | |
your next input and the model,. | |
0:39:48.268 --> 0:39:51.599 | |
Because one question might not be why do we | |
have this type of model? | |
0:39:51.771 --> 0:39:58.946 | |
So why do we really need to put here in the | |
last source word? | |
0:39:58.946 --> 0:40:06.078 | |
You can also put in: And then always predict | |
the word and the nice thing is then you wouldn't | |
0:40:06.078 --> 0:40:11.846 | |
need to do beams or a difficult search because | |
then the output here wouldn't influence what | |
0:40:11.846 --> 0:40:12.975 | |
is inputted here. | |
0:40:15.435 --> 0:40:20.219 | |
Idea whether that might not be the best idea. | |
0:40:20.219 --> 0:40:24.588 | |
You'll just be translating each word and. | |
0:40:26.626 --> 0:40:37.815 | |
The second one is right, yes, you're not generating | |
a Korean sentence. | |
0:40:38.058 --> 0:40:48.197 | |
We'll also see that later it's called non | |
auto-progressive translation, so there is work | |
0:40:48.197 --> 0:40:49.223 | |
on that. | |
0:40:49.529 --> 0:41:02.142 | |
So you might know it roughly because you know | |
it's based on this hidden state, but it can | |
0:41:02.142 --> 0:41:08.588 | |
be that in the end you have your probability. | |
0:41:09.189 --> 0:41:14.633 | |
And then you're not modeling the dependencies | |
within a work within the target sentence. | |
0:41:14.633 --> 0:41:27.547 | |
For example: You can express things in German, | |
then you don't know which one you really select. | |
0:41:27.547 --> 0:41:32.156 | |
That influences what you later. | |
0:41:33.393 --> 0:41:46.411 | |
Then you try to find a better way not only | |
based on the English sentence and the words | |
0:41:46.411 --> 0:41:48.057 | |
that come. | |
0:41:49.709 --> 0:42:00.954 | |
Yes, that is more like a two-step decoding, | |
but that is, of course, a lot more like computational. | |
0:42:01.181 --> 0:42:15.978 | |
The first thing you can do, which is typically | |
done, is doing not really search. | |
0:42:16.176 --> 0:42:32.968 | |
So first look at what the problem of research | |
is to make it a bit more clear. | |
0:42:34.254 --> 0:42:53.163 | |
And now you can extend them and you can extend | |
these and the joint probabilities. | |
0:42:54.334 --> 0:42:59.063 | |
The other thing is the second word. | |
0:42:59.063 --> 0:43:03.397 | |
You can do the second word dusk. | |
0:43:03.397 --> 0:43:07.338 | |
Now you see the problem here. | |
0:43:07.707 --> 0:43:17.507 | |
It is true that these have the highest probability, | |
but for these you have an extension. | |
0:43:18.078 --> 0:43:31.585 | |
So the problem is just because in one position | |
one hypothesis, so you can always call this | |
0:43:31.585 --> 0:43:34.702 | |
partial translation. | |
0:43:34.874 --> 0:43:41.269 | |
The blue one begin is higher, but the green | |
one can be better extended and it will overtake. | |
0:43:45.525 --> 0:43:54.672 | |
So the problem is if we are doing this greedy | |
search is that we might not end up in really | |
0:43:54.672 --> 0:43:55.275 | |
good. | |
0:43:55.956 --> 0:44:00.916 | |
So the first thing we could not do is like | |
yeah, we can just try. | |
0:44:00.880 --> 0:44:06.049 | |
All combinations that are there, so there | |
is the other direction. | |
0:44:06.049 --> 0:44:13.020 | |
So if the solution to to check the first one | |
is to just try all and it doesn't give us a | |
0:44:13.020 --> 0:44:17.876 | |
good result, maybe what we have to do is just | |
try everything. | |
0:44:18.318 --> 0:44:23.120 | |
The nice thing is if we try everything, we'll | |
definitely find the best translation. | |
0:44:23.463 --> 0:44:26.094 | |
So we won't have a search error. | |
0:44:26.094 --> 0:44:28.167 | |
We'll come to that later. | |
0:44:28.167 --> 0:44:32.472 | |
The interesting thing is our translation performance. | |
0:44:33.353 --> 0:44:37.039 | |
But we will definitely find the most probable | |
translation. | |
0:44:38.598 --> 0:44:44.552 | |
However, it's not really possible because | |
the number of combinations is just too high. | |
0:44:44.764 --> 0:44:57.127 | |
So the number of congregations is your vocabulary | |
science times the lengths of your sentences. | |
0:44:57.157 --> 0:45:03.665 | |
Ten thousand or so you can imagine that very | |
soon you will have so many possibilities here | |
0:45:03.665 --> 0:45:05.597 | |
that you cannot check all. | |
0:45:06.226 --> 0:45:13.460 | |
So this is not really an implication or an | |
algorithm that you can use for applying machine | |
0:45:13.460 --> 0:45:14.493 | |
translation. | |
0:45:15.135 --> 0:45:24.657 | |
So maybe we have to do something in between | |
and yeah, not look at all but only look at | |
0:45:24.657 --> 0:45:25.314 | |
some. | |
0:45:26.826 --> 0:45:29.342 | |
And the easiest thing for that is okay. | |
0:45:29.342 --> 0:45:34.877 | |
Just do sampling, so if we don't know what | |
to look at, maybe it's good to randomly pick | |
0:45:34.877 --> 0:45:35.255 | |
some. | |
0:45:35.255 --> 0:45:40.601 | |
That's not only a very good algorithm, so | |
the basic idea will always randomly select | |
0:45:40.601 --> 0:45:42.865 | |
the word, of course, based on bits. | |
0:45:43.223 --> 0:45:52.434 | |
We are doing that or times, and then we are | |
looking which one at the end has the highest. | |
0:45:52.672 --> 0:45:59.060 | |
So we are not doing anymore really searching | |
for the best one, but we are more randomly | |
0:45:59.060 --> 0:46:05.158 | |
doing selections with the idea that we always | |
select the best one at the beginning. | |
0:46:05.158 --> 0:46:11.764 | |
So maybe it's better to do random, but of | |
course one important thing is how do we randomly | |
0:46:11.764 --> 0:46:12.344 | |
select? | |
0:46:12.452 --> 0:46:15.756 | |
If we just do uniform distribution, it would | |
be very bad. | |
0:46:15.756 --> 0:46:18.034 | |
You'll only have very bad translations. | |
0:46:18.398 --> 0:46:23.261 | |
Because in each position if you think about | |
it you have ten thousand possibilities. | |
0:46:23.903 --> 0:46:28.729 | |
Most of them are really bad decisions and | |
you shouldn't do that. | |
0:46:28.729 --> 0:46:35.189 | |
There is always only a very small number, | |
at least compared to the 10 000 translation. | |
0:46:35.395 --> 0:46:43.826 | |
So if you have the sentence here, this is | |
an English sentence. | |
0:46:43.826 --> 0:46:47.841 | |
You can start with these and. | |
0:46:48.408 --> 0:46:58.345 | |
You're thinking about setting legal documents | |
in a legal document. | |
0:46:58.345 --> 0:47:02.350 | |
You should not change the. | |
0:47:03.603 --> 0:47:11.032 | |
The problem is we have a neural network, we | |
have a black box, so it's anyway a bit random. | |
0:47:12.092 --> 0:47:24.341 | |
It is considered, but you will see that if | |
you make it intelligent for clear sentences, | |
0:47:24.341 --> 0:47:26.986 | |
there is not that. | |
0:47:27.787 --> 0:47:35.600 | |
Is an issue we should consider that this one | |
might lead to more randomness, but it might | |
0:47:35.600 --> 0:47:39.286 | |
also be positive for machine translation. | |
0:47:40.080 --> 0:47:46.395 | |
Least can't directly think of a good implication | |
where it's positive, but if you most think | |
0:47:46.395 --> 0:47:52.778 | |
about dialogue systems, for example, whereas | |
the similar architecture is nowadays also used, | |
0:47:52.778 --> 0:47:55.524 | |
you predict what the system should say. | |
0:47:55.695 --> 0:48:00.885 | |
Then you want to have randomness because it's | |
not always saying the same thing. | |
0:48:01.341 --> 0:48:08.370 | |
Machine translation is typically not you want | |
to have consistency, so if you have the same | |
0:48:08.370 --> 0:48:09.606 | |
input normally. | |
0:48:09.889 --> 0:48:14.528 | |
Therefore, sampling is not a mathieu. | |
0:48:14.528 --> 0:48:22.584 | |
There are some things you will later see as | |
a preprocessing step. | |
0:48:23.003 --> 0:48:27.832 | |
But of course it's important how you can make | |
this process not too random. | |
0:48:29.269 --> 0:48:41.619 | |
Therefore, the first thing is don't take a | |
uniform distribution, but we have a very nice | |
0:48:41.619 --> 0:48:43.562 | |
distribution. | |
0:48:43.843 --> 0:48:46.621 | |
So I'm like randomly taking a word. | |
0:48:46.621 --> 0:48:51.328 | |
We are looking at output distribution and | |
now taking a word. | |
0:48:51.731 --> 0:49:03.901 | |
So that means we are taking the word these, | |
we are taking the word does, and all these. | |
0:49:04.444 --> 0:49:06.095 | |
How can you do that? | |
0:49:06.095 --> 0:49:09.948 | |
You randomly draw a number between zero and | |
one. | |
0:49:10.390 --> 0:49:23.686 | |
And then you have ordered your words in some | |
way, and then you take the words before the | |
0:49:23.686 --> 0:49:26.375 | |
sum of the words. | |
0:49:26.806 --> 0:49:34.981 | |
So the easiest thing is you have zero point | |
five, zero point two five, and zero point two | |
0:49:34.981 --> 0:49:35.526 | |
five. | |
0:49:35.526 --> 0:49:43.428 | |
If you have a number smaller than you take | |
the first word, it takes a second word, and | |
0:49:43.428 --> 0:49:45.336 | |
if it's higher than. | |
0:49:45.845 --> 0:49:57.707 | |
Therefore, you can very easily get a distribution | |
distributed according to this probability mass | |
0:49:57.707 --> 0:49:59.541 | |
and no longer. | |
0:49:59.799 --> 0:50:12.479 | |
You can't even do that a bit more and more | |
focus on the important part if we are not randomly | |
0:50:12.479 --> 0:50:19.494 | |
drawing from all words, but we are looking | |
only at. | |
0:50:21.361 --> 0:50:24.278 | |
You have an idea why this is an important | |
stamp. | |
0:50:24.278 --> 0:50:29.459 | |
Although we say I'm only throwing away the | |
words which have a very low probability, so | |
0:50:29.459 --> 0:50:32.555 | |
anyway the probability of taking them is quite | |
low. | |
0:50:32.555 --> 0:50:35.234 | |
So normally that shouldn't matter that much. | |
0:50:36.256 --> 0:50:38.830 | |
There's ten thousand words. | |
0:50:40.300 --> 0:50:42.074 | |
Of course, they admire thousand nine hundred. | |
0:50:42.074 --> 0:50:44.002 | |
They're going to build a good people steal | |
it up. | |
0:50:45.085 --> 0:50:47.425 | |
Hi, I'm Sarah Hauer and I'm Sig Hauer and | |
We're Professional. | |
0:50:47.867 --> 0:50:55.299 | |
Yes, that's exactly why you do this most sampling | |
or so that you don't take the lowest. | |
0:50:55.415 --> 0:50:59.694 | |
Probability words, but you only look at the | |
most probable ones and then like. | |
0:50:59.694 --> 0:51:04.632 | |
Of course you have to rescale your probability | |
mass then so that it's still a probability | |
0:51:04.632 --> 0:51:08.417 | |
because now it's a probability distribution | |
over ten thousand words. | |
0:51:08.417 --> 0:51:13.355 | |
If you only take ten of them or so it's no | |
longer a probability distribution, you rescale | |
0:51:13.355 --> 0:51:15.330 | |
them and you can still do that and. | |
0:51:16.756 --> 0:51:20.095 | |
That is what is done assembling. | |
0:51:20.095 --> 0:51:26.267 | |
It's not the most common thing, but it's done | |
several times. | |
0:51:28.088 --> 0:51:40.625 | |
Then the search, which is somehow a standard, | |
and if you're doing some type of machine translation. | |
0:51:41.181 --> 0:51:50.162 | |
And the basic idea is that in research we | |
select for the most probable and only continue | |
0:51:50.162 --> 0:51:51.171 | |
with the. | |
0:51:51.691 --> 0:51:53.970 | |
You can easily generalize this. | |
0:51:53.970 --> 0:52:00.451 | |
We are not only continuing the most probable | |
one, but we are continuing the most probable. | |
0:52:00.880 --> 0:52:21.376 | |
The. | |
0:52:17.697 --> 0:52:26.920 | |
You should say we are sampling how many examples | |
it makes sense to take the one with the highest. | |
0:52:27.127 --> 0:52:33.947 | |
But that is important that once you do a mistake | |
you might want to not influence that much. | |
0:52:39.899 --> 0:52:45.815 | |
So the idea is if we're keeping the end best | |
hypotheses and not only the first fact. | |
0:52:46.586 --> 0:52:51.558 | |
And the nice thing is in statistical machine | |
translation. | |
0:52:51.558 --> 0:52:54.473 | |
We have exactly the same problem. | |
0:52:54.473 --> 0:52:57.731 | |
You would do the same thing, however. | |
0:52:57.731 --> 0:53:03.388 | |
Since the model wasn't that strong you needed | |
a quite large beam. | |
0:53:03.984 --> 0:53:18.944 | |
Machine translation models are really strong | |
and you get already a very good performance. | |
0:53:19.899 --> 0:53:22.835 | |
So how does it work? | |
0:53:22.835 --> 0:53:35.134 | |
We can't relate to our capabilities, but now | |
we are not storing the most probable ones. | |
0:53:36.156 --> 0:53:45.163 | |
Done that we extend all these hypothesis and | |
of course there is now a bit difficult because | |
0:53:45.163 --> 0:53:54.073 | |
now we always have to switch what is the input | |
so the search gets more complicated and the | |
0:53:54.073 --> 0:53:55.933 | |
first one is easy. | |
0:53:56.276 --> 0:54:09.816 | |
In this case we have to once put in here these | |
and then somehow delete this one and instead | |
0:54:09.816 --> 0:54:12.759 | |
put that into that. | |
0:54:13.093 --> 0:54:24.318 | |
Otherwise you could only store your current | |
network states here and just continue by going | |
0:54:24.318 --> 0:54:25.428 | |
forward. | |
0:54:26.766 --> 0:54:34.357 | |
So now you have done the first two, and then | |
you have known the best. | |
0:54:34.357 --> 0:54:37.285 | |
Can you now just continue? | |
0:54:39.239 --> 0:54:53.511 | |
Yes, that's very important, otherwise all | |
your beam search doesn't really help because | |
0:54:53.511 --> 0:54:57.120 | |
you would still have. | |
0:54:57.317 --> 0:55:06.472 | |
So now you have to do one important step and | |
then reduce again to end. | |
0:55:06.472 --> 0:55:13.822 | |
So in our case to make things easier we have | |
the inputs. | |
0:55:14.014 --> 0:55:19.072 | |
Otherwise you will have two to the power of | |
length possibilities, so it is still exponential. | |
0:55:19.559 --> 0:55:26.637 | |
But by always throwing them away you keep | |
your beans fixed. | |
0:55:26.637 --> 0:55:31.709 | |
The items now differ in the last position. | |
0:55:32.492 --> 0:55:42.078 | |
They are completely different, but you are | |
always searching what is the best one. | |
0:55:44.564 --> 0:55:50.791 | |
So another way of hearing it is like this, | |
so just imagine you start with the empty sentence. | |
0:55:50.791 --> 0:55:55.296 | |
Then you have three possible extensions: A, | |
B, and end of sentence. | |
0:55:55.296 --> 0:55:59.205 | |
It's throwing away the worst one, continuing | |
with the two. | |
0:55:59.699 --> 0:56:13.136 | |
Then you want to stay too, so in this state | |
it's either or and then you continue. | |
0:56:13.293 --> 0:56:24.924 | |
So you always have this exponential growing | |
tree by destroying most of them away and only | |
0:56:24.924 --> 0:56:26.475 | |
continuing. | |
0:56:26.806 --> 0:56:42.455 | |
And thereby you can hopefully do less errors | |
because in these examples you always see this | |
0:56:42.455 --> 0:56:43.315 | |
one. | |
0:56:43.503 --> 0:56:47.406 | |
So you're preventing some errors, but of course | |
it's not perfect. | |
0:56:47.447 --> 0:56:56.829 | |
You can still do errors because it could be | |
not the second one but the fourth one. | |
0:56:57.017 --> 0:57:03.272 | |
Now just the idea is that you make yeah less | |
errors and prevent that. | |
0:57:07.667 --> 0:57:11.191 | |
Then the question is how much does it help? | |
0:57:11.191 --> 0:57:14.074 | |
And here is some examples for that. | |
0:57:14.074 --> 0:57:16.716 | |
So for S & T it was really like. | |
0:57:16.716 --> 0:57:23.523 | |
Typically the larger beam you have a larger | |
third space and you have a better score. | |
0:57:23.763 --> 0:57:27.370 | |
So the larger you get, the bigger your emails, | |
the better you will. | |
0:57:27.370 --> 0:57:30.023 | |
Typically maybe use something like three hundred. | |
0:57:30.250 --> 0:57:38.777 | |
And it's mainly a trade-off between quality | |
and speed because the larger your beams, the | |
0:57:38.777 --> 0:57:43.184 | |
more time it takes and you want to finish it. | |
0:57:43.184 --> 0:57:49.124 | |
So your quality improvements are getting smaller | |
and smaller. | |
0:57:49.349 --> 0:57:57.164 | |
So the difference between a beam of one and | |
ten is bigger than the difference between a. | |
0:57:58.098 --> 0:58:14.203 | |
And the interesting thing is we're seeing | |
a bit of a different view, and we're seeing | |
0:58:14.203 --> 0:58:16.263 | |
typically. | |
0:58:16.776 --> 0:58:24.376 | |
And then especially if you look at the green | |
ones, this is unnormalized. | |
0:58:24.376 --> 0:58:26.770 | |
You're seeing a sharp. | |
0:58:27.207 --> 0:58:32.284 | |
So your translation quality here measured | |
in blue will go down again. | |
0:58:33.373 --> 0:58:35.663 | |
That is now a question. | |
0:58:35.663 --> 0:58:37.762 | |
Why is that the case? | |
0:58:37.762 --> 0:58:43.678 | |
Why should we are seeing more and more possible | |
translations? | |
0:58:46.226 --> 0:58:48.743 | |
If we have a bigger stretch and we are going. | |
0:58:52.612 --> 0:58:56.312 | |
I'm going to be using my examples before we | |
also look at the bar. | |
0:58:56.656 --> 0:58:59.194 | |
A good idea. | |
0:59:00.000 --> 0:59:18.521 | |
But it's not everything because we in the | |
end always in this list we're selecting. | |
0:59:18.538 --> 0:59:19.382 | |
So this is here. | |
0:59:19.382 --> 0:59:21.170 | |
We don't do any regions to do that. | |
0:59:21.601 --> 0:59:29.287 | |
So the probabilities at the end we always | |
give out the hypothesis with the highest probabilities. | |
0:59:30.250 --> 0:59:33.623 | |
That is always the case. | |
0:59:33.623 --> 0:59:43.338 | |
If you have a beam of this should be a subset | |
of the items you look at. | |
0:59:44.224 --> 0:59:52.571 | |
So if you increase your biomeat you're just | |
looking at more and you're always taking the | |
0:59:52.571 --> 0:59:54.728 | |
wine with the highest. | |
0:59:57.737 --> 1:00:07.014 | |
Maybe they are all the probability that they | |
will be comparable to don't really have. | |
1:00:08.388 --> 1:00:14.010 | |
But the probabilities are the same, not that | |
easy. | |
1:00:14.010 --> 1:00:23.931 | |
One morning maybe you will have more examples | |
where we look at some stuff that's not seen | |
1:00:23.931 --> 1:00:26.356 | |
in the trading space. | |
1:00:28.428 --> 1:00:36.478 | |
That's mainly the answer why we give a hyperability | |
math we will see, but that is first of all | |
1:00:36.478 --> 1:00:43.087 | |
the biggest issues, so here is a blue score, | |
so that is somewhat translation. | |
1:00:43.883 --> 1:00:48.673 | |
This will go down by the probability of the | |
highest one that only goes out where stays | |
1:00:48.673 --> 1:00:49.224 | |
at least. | |
1:00:49.609 --> 1:00:57.971 | |
The problem is if we are searching more, we | |
are finding high processes which have a high | |
1:00:57.971 --> 1:00:59.193 | |
translation. | |
1:00:59.579 --> 1:01:10.375 | |
So we are finding these things which we wouldn't | |
find and we'll see why this is happening. | |
1:01:10.375 --> 1:01:15.714 | |
So somehow we are reducing our search error. | |
1:01:16.336 --> 1:01:25.300 | |
However, we also have a model error and we | |
don't assign the highest probability to translation | |
1:01:25.300 --> 1:01:27.942 | |
quality to the really best. | |
1:01:28.548 --> 1:01:31.460 | |
They don't always add up. | |
1:01:31.460 --> 1:01:34.932 | |
Of course somehow they add up. | |
1:01:34.932 --> 1:01:41.653 | |
If your bottle is worse then your performance | |
will even go. | |
1:01:42.202 --> 1:01:49.718 | |
But sometimes it's happening that by increasing | |
search errors we are missing out the really | |
1:01:49.718 --> 1:01:57.969 | |
bad translations which have a high probability | |
and we are only finding the decently good probability | |
1:01:57.969 --> 1:01:58.460 | |
mass. | |
1:01:59.159 --> 1:02:03.859 | |
So they are a bit independent of each other | |
and you can make those types of arrows. | |
1:02:04.224 --> 1:02:09.858 | |
That's why, for example, doing exact search | |
will give you the translation with the highest | |
1:02:09.858 --> 1:02:15.245 | |
probability, but there has been work on it | |
that you then even have a lower translation | |
1:02:15.245 --> 1:02:21.436 | |
quality because then you find some random translation | |
which has a very high translation probability | |
1:02:21.436 --> 1:02:22.984 | |
by which I'm really bad. | |
1:02:23.063 --> 1:02:29.036 | |
Because our model is not perfect and giving | |
a perfect translation probability over air,. | |
1:02:31.431 --> 1:02:34.537 | |
So why is this happening? | |
1:02:34.537 --> 1:02:42.301 | |
And one issue with this is the so called label | |
or length spiral. | |
1:02:42.782 --> 1:02:47.115 | |
And we are in each step of decoding. | |
1:02:47.115 --> 1:02:55.312 | |
We are modeling the probability of the next | |
word given the input and. | |
1:02:55.895 --> 1:03:06.037 | |
So if you have this picture, so you always | |
hear you have the probability of the next word. | |
1:03:06.446 --> 1:03:16.147 | |
That's that's what your modeling, and of course | |
the model is not perfect. | |
1:03:16.576 --> 1:03:22.765 | |
So it can be that if we at one time do a bitter | |
wrong prediction not for the first one but | |
1:03:22.765 --> 1:03:28.749 | |
maybe for the 5th or 6th thing, then we're | |
giving it an exceptional high probability we | |
1:03:28.749 --> 1:03:30.178 | |
cannot recover from. | |
1:03:30.230 --> 1:03:34.891 | |
Because this high probability will stay there | |
forever and we just multiply other things to | |
1:03:34.891 --> 1:03:39.910 | |
it, but we cannot like later say all this probability | |
was a bit too high, we shouldn't have done. | |
1:03:41.541 --> 1:03:48.984 | |
And this leads to that the more the longer | |
your translation is, the more often you use | |
1:03:48.984 --> 1:03:51.637 | |
this probability distribution. | |
1:03:52.112 --> 1:04:03.321 | |
The typical example is this one, so you have | |
the probability of the translation. | |
1:04:04.104 --> 1:04:12.608 | |
And this probability is quite low as you see, | |
and maybe there are a lot of other things. | |
1:04:13.053 --> 1:04:25.658 | |
However, it might still be overestimated that | |
it's still a bit too high. | |
1:04:26.066 --> 1:04:33.042 | |
The problem is if you know the project translation | |
is a very long one, but probability mask gets | |
1:04:33.042 --> 1:04:33.545 | |
lower. | |
1:04:34.314 --> 1:04:45.399 | |
Because each time you multiply your probability | |
to it, so your sequence probability gets lower | |
1:04:45.399 --> 1:04:46.683 | |
and lower. | |
1:04:48.588 --> 1:04:59.776 | |
And this means that at some point you might | |
get over this, and it might be a lower probability. | |
1:05:00.180 --> 1:05:09.651 | |
And if you then have this probability at the | |
beginning away, but it wasn't your beam, then | |
1:05:09.651 --> 1:05:14.958 | |
at this point you would select the empty sentence. | |
1:05:15.535 --> 1:05:25.379 | |
So this has happened because this short translation | |
is seen and it's not thrown away. | |
1:05:28.268 --> 1:05:31.121 | |
So,. | |
1:05:31.151 --> 1:05:41.256 | |
If you have a very sore beam that can be prevented, | |
but if you have a large beam, this one is in | |
1:05:41.256 --> 1:05:41.986 | |
there. | |
1:05:42.302 --> 1:05:52.029 | |
This in general seems reasonable that shorter | |
pronunciations instead of longer sentences | |
1:05:52.029 --> 1:05:54.543 | |
because non-religious. | |
1:05:56.376 --> 1:06:01.561 | |
It's a bit depending on whether the translation | |
should be a bit related to your input. | |
1:06:02.402 --> 1:06:18.053 | |
And since we are always multiplying things, | |
the longer the sequences we are getting smaller, | |
1:06:18.053 --> 1:06:18.726 | |
it. | |
1:06:19.359 --> 1:06:29.340 | |
It's somewhat right for human main too, but | |
the models tend to overestimate because of | |
1:06:29.340 --> 1:06:34.388 | |
this short translation of long translation. | |
1:06:35.375 --> 1:06:46.474 | |
Then, of course, that means that it's not | |
easy to stay on a computer because eventually | |
1:06:46.474 --> 1:06:48.114 | |
it suggests. | |
1:06:51.571 --> 1:06:59.247 | |
First of all there is another way and that's | |
typically used but you don't have to do really | |
1:06:59.247 --> 1:07:07.089 | |
because this is normally not a second position | |
and if it's like on the 20th position you only | |
1:07:07.089 --> 1:07:09.592 | |
have to have some bean lower. | |
1:07:10.030 --> 1:07:17.729 | |
But you are right because these issues get | |
larger, the larger your input is, and then | |
1:07:17.729 --> 1:07:20.235 | |
you might make more errors. | |
1:07:20.235 --> 1:07:27.577 | |
So therefore this is true, but it's not as | |
simple that this one is always in the. | |
1:07:28.408 --> 1:07:45.430 | |
That the translation for it goes down with | |
higher insert sizes has there been more control. | |
1:07:47.507 --> 1:07:51.435 | |
In this work you see a dozen knocks. | |
1:07:51.435 --> 1:07:53.027 | |
Knots go down. | |
1:07:53.027 --> 1:08:00.246 | |
That's light green here, but at least you | |
don't see the sharp rock. | |
1:08:00.820 --> 1:08:07.897 | |
So if you do some type of normalization, at | |
least you can assess this probability and limit | |
1:08:07.897 --> 1:08:08.204 | |
it. | |
1:08:15.675 --> 1:08:24.828 | |
There is other reasons why, like initial, | |
it's not only the length, but there can be | |
1:08:24.828 --> 1:08:26.874 | |
other reasons why. | |
1:08:27.067 --> 1:08:37.316 | |
And if you just take it too large, you're | |
looking too often at ways in between, but it's | |
1:08:37.316 --> 1:08:40.195 | |
better to ignore things. | |
1:08:41.101 --> 1:08:44.487 | |
But that's more a hand gravy argument. | |
1:08:44.487 --> 1:08:47.874 | |
Agree so don't know if the exact word. | |
1:08:48.648 --> 1:08:53.223 | |
You need to do the normalization and there | |
are different ways of doing it. | |
1:08:53.223 --> 1:08:54.199 | |
It's mainly OK. | |
1:08:54.199 --> 1:08:59.445 | |
We're just now not taking the translation | |
with the highest probability, but we during | |
1:08:59.445 --> 1:09:04.935 | |
the coding have another feature saying not | |
only take the one with the highest probability | |
1:09:04.935 --> 1:09:08.169 | |
but also prefer translations which are a bit | |
longer. | |
1:09:08.488 --> 1:09:16.933 | |
You can do that different in a way to divide | |
by the center length. | |
1:09:16.933 --> 1:09:23.109 | |
We take not the highest but the highest average. | |
1:09:23.563 --> 1:09:28.841 | |
Of course, if both are the same lengths, it | |
doesn't matter if M is the same lengths in | |
1:09:28.841 --> 1:09:34.483 | |
all cases, but if you compare a translation | |
with seven or eight words, there is a difference | |
1:09:34.483 --> 1:09:39.700 | |
if you want to have the one with the highest | |
probability or with the highest average. | |
1:09:41.021 --> 1:09:50.993 | |
So that is the first one can have some reward | |
model for each word, add a bit of the score, | |
1:09:50.993 --> 1:09:51.540 | |
and. | |
1:09:51.711 --> 1:10:03.258 | |
And then, of course, you have to find you | |
that there is also more complex ones here. | |
1:10:03.903 --> 1:10:08.226 | |
So there is different ways of doing that, | |
and of course that's important. | |
1:10:08.428 --> 1:10:11.493 | |
But in all of that, the main idea is OK. | |
1:10:11.493 --> 1:10:18.520 | |
We are like knowing of the arrow that the | |
model seems to prevent or prefer short translation. | |
1:10:18.520 --> 1:10:24.799 | |
We circumvent that by OK we are adding we | |
are no longer searching for the best one. | |
1:10:24.764 --> 1:10:30.071 | |
But we're searching for the one best one and | |
some additional constraints, so mainly you | |
1:10:30.071 --> 1:10:32.122 | |
are doing here during the coding. | |
1:10:32.122 --> 1:10:37.428 | |
You're not completely trusting your model, | |
but you're adding some buyers or constraints | |
1:10:37.428 --> 1:10:39.599 | |
into what should also be fulfilled. | |
1:10:40.000 --> 1:10:42.543 | |
That can be, for example, that the length | |
should be recently. | |
1:10:49.369 --> 1:10:51.071 | |
Any More Questions to That. | |
1:10:56.736 --> 1:11:04.001 | |
Last idea which gets recently quite a bit | |
more interest also is what is called minimum | |
1:11:04.001 --> 1:11:11.682 | |
base risk decoding and there is maybe not the | |
one correct translation but there are several | |
1:11:11.682 --> 1:11:13.937 | |
good correct translations. | |
1:11:14.294 --> 1:11:21.731 | |
And the idea is now we don't want to find | |
the one translation, which is maybe the highest | |
1:11:21.731 --> 1:11:22.805 | |
probability. | |
1:11:23.203 --> 1:11:31.707 | |
Instead we are looking at all the high translation, | |
all translation with high probability and then | |
1:11:31.707 --> 1:11:39.524 | |
we want to take one representative out of this | |
so we're just most similar to all the other | |
1:11:39.524 --> 1:11:42.187 | |
hydrobility translation again. | |
1:11:43.643 --> 1:11:46.642 | |
So how does it work? | |
1:11:46.642 --> 1:11:55.638 | |
First you could have imagined you have reference | |
translations. | |
1:11:55.996 --> 1:12:13.017 | |
You have a set of reference translations and | |
then what you want to get is you want to have. | |
1:12:13.073 --> 1:12:28.641 | |
As a probability distribution you measure | |
the similarity of reference and the hypothesis. | |
1:12:28.748 --> 1:12:31.408 | |
So you have two sets of translation. | |
1:12:31.408 --> 1:12:34.786 | |
You have the human translations of a sentence. | |
1:12:35.675 --> 1:12:39.251 | |
That's of course not realistic, but first | |
from the idea. | |
1:12:39.251 --> 1:12:42.324 | |
Then you have your set of possible translations. | |
1:12:42.622 --> 1:12:52.994 | |
And now you're not saying okay, we have only | |
one human, but we have several humans with | |
1:12:52.994 --> 1:12:56.294 | |
different types of quality. | |
1:12:56.796 --> 1:13:07.798 | |
You have to have two metrics here, the similarity | |
between the automatic translation and the quality | |
1:13:07.798 --> 1:13:09.339 | |
of the human. | |
1:13:10.951 --> 1:13:17.451 | |
Of course, we have the same problem that we | |
don't have the human reference, so we have. | |
1:13:18.058 --> 1:13:29.751 | |
So when we are doing it, instead of estimating | |
the quality based on the human, we use our | |
1:13:29.751 --> 1:13:30.660 | |
model. | |
1:13:31.271 --> 1:13:37.612 | |
So we can't be like humans, so we take the | |
model probability. | |
1:13:37.612 --> 1:13:40.782 | |
We take the set here first of. | |
1:13:41.681 --> 1:13:48.755 | |
Then we are comparing each hypothesis to this | |
one, so you have two sets. | |
1:13:48.755 --> 1:13:53.987 | |
Just imagine here you take all possible translations. | |
1:13:53.987 --> 1:13:58.735 | |
Here you take your hypothesis in comparing | |
them. | |
1:13:58.678 --> 1:14:03.798 | |
And then you're taking estimating the quality | |
based on the outcome. | |
1:14:04.304 --> 1:14:06.874 | |
So the overall idea is okay. | |
1:14:06.874 --> 1:14:14.672 | |
We are not finding the best hypothesis but | |
finding the hypothesis which is most similar | |
1:14:14.672 --> 1:14:17.065 | |
to many good translations. | |
1:14:19.599 --> 1:14:21.826 | |
Why would you do that? | |
1:14:21.826 --> 1:14:25.119 | |
It's a bit like a smoothing idea. | |
1:14:25.119 --> 1:14:28.605 | |
Imagine this is the probability of. | |
1:14:29.529 --> 1:14:36.634 | |
So if you would do beam search or mini search | |
or anything, if you just take the highest probability | |
1:14:36.634 --> 1:14:39.049 | |
one, you would take this red one. | |
1:14:39.799 --> 1:14:45.686 | |
Has this type of probability distribution. | |
1:14:45.686 --> 1:14:58.555 | |
Then it might be better to take some of these | |
models because it's a bit lower in probability. | |
1:14:58.618 --> 1:15:12.501 | |
So what you're mainly doing is you're doing | |
some smoothing of your probability distribution. | |
1:15:15.935 --> 1:15:17.010 | |
How can you do that? | |
1:15:17.010 --> 1:15:20.131 | |
Of course, we cannot do this again compared | |
to all the hype. | |
1:15:21.141 --> 1:15:29.472 | |
But what we can do is we have just two sets | |
and we're just taking them the same. | |
1:15:29.472 --> 1:15:38.421 | |
So we're having our penny data of the hypothesis | |
and the sum of the soider references. | |
1:15:39.179 --> 1:15:55.707 | |
And we can just take the same clue so we can | |
just compare the utility of the. | |
1:15:56.656 --> 1:16:16.182 | |
And then, of course, the question is how do | |
we measure the quality of the hypothesis? | |
1:16:16.396 --> 1:16:28.148 | |
Course: You could also take here the probability | |
of this pee of given, but you can also say | |
1:16:28.148 --> 1:16:30.958 | |
we only take the top. | |
1:16:31.211 --> 1:16:39.665 | |
And where we don't want to really rely on | |
how good they are, we filtered out all the | |
1:16:39.665 --> 1:16:40.659 | |
bad ones. | |
1:16:40.940 --> 1:16:54.657 | |
So that is the first question for the minimum | |
base rhythm, and what are your pseudo references? | |
1:16:55.255 --> 1:17:06.968 | |
So how do you set the quality of all these | |
references here in the independent sampling? | |
1:17:06.968 --> 1:17:10.163 | |
They all have the same. | |
1:17:10.750 --> 1:17:12.308 | |
There's Also Work Where You Can Take That. | |
1:17:13.453 --> 1:17:17.952 | |
And then the second question you have to do | |
is, of course,. | |
1:17:17.917 --> 1:17:26.190 | |
How do you prepare now two hypothesisms so | |
you have now Y and H which are post generated | |
1:17:26.190 --> 1:17:34.927 | |
by the system and you want to find the H which | |
is most similar to all the other translations. | |
1:17:35.335 --> 1:17:41.812 | |
So it's mainly like this model here, which | |
says how similar is age to all the other whites. | |
1:17:42.942 --> 1:17:50.127 | |
So you have to again use some type of similarity | |
metric, which says how similar to possible. | |
1:17:52.172 --> 1:17:53.775 | |
How can you do that? | |
1:17:53.775 --> 1:17:58.355 | |
We luckily knew how to compare a reference | |
to a hypothesis. | |
1:17:58.355 --> 1:18:00.493 | |
We have evaluation metrics. | |
1:18:00.493 --> 1:18:03.700 | |
You can do something like sentence level. | |
1:18:04.044 --> 1:18:13.501 | |
But especially if you're looking into neuromodels | |
you should have a stromometric so you can use | |
1:18:13.501 --> 1:18:17.836 | |
a neural metric which directly compares to. | |
1:18:22.842 --> 1:18:29.292 | |
Yes, so that is, is the main idea of minimum | |
base risk to, so the important idea you should | |
1:18:29.292 --> 1:18:35.743 | |
keep in mind is that it's doing somehow the | |
smoothing by not taking the highest probability | |
1:18:35.743 --> 1:18:40.510 | |
one, but by comparing like by taking a set | |
of high probability one. | |
1:18:40.640 --> 1:18:45.042 | |
And then looking for the translation, which | |
is most similar to all of that. | |
1:18:45.445 --> 1:18:49.888 | |
And thereby doing a bit more smoothing because | |
you look at this one. | |
1:18:49.888 --> 1:18:55.169 | |
If you have this one, for example, it would | |
be more similar to all of these ones. | |
1:18:55.169 --> 1:19:00.965 | |
But if you take this one, it's higher probability, | |
but it's very dissimilar to all these. | |
1:19:05.445 --> 1:19:17.609 | |
Hey, that is all for decoding before we finish | |
with your combination of models. | |
1:19:18.678 --> 1:19:20.877 | |
Sort of set of pseudo-reperences. | |
1:19:20.877 --> 1:19:24.368 | |
Thomas Brown writes a little bit of type research | |
or. | |
1:19:24.944 --> 1:19:27.087 | |
For example, you can do beam search. | |
1:19:27.087 --> 1:19:28.825 | |
You can do sampling for that. | |
1:19:28.825 --> 1:19:31.257 | |
Oh yeah, we had mentioned sampling there. | |
1:19:31.257 --> 1:19:34.500 | |
I don't know somebody asking for what sampling | |
is good. | |
1:19:34.500 --> 1:19:37.280 | |
So there's, of course, another important issue. | |
1:19:37.280 --> 1:19:40.117 | |
How do you get a good representative set of | |
age? | |
1:19:40.620 --> 1:19:47.147 | |
If you do beam search, it might be that you | |
end up with two similar ones, and maybe it's | |
1:19:47.147 --> 1:19:49.274 | |
prevented by doing sampling. | |
1:19:49.274 --> 1:19:55.288 | |
But maybe in sampling you find worse ones, | |
but yet some type of model is helpful. | |
1:19:56.416 --> 1:20:04.863 | |
Search method use more transformed based translation | |
points. | |
1:20:04.863 --> 1:20:09.848 | |
Nowadays beam search is definitely. | |
1:20:10.130 --> 1:20:13.749 | |
There is work on this. | |
1:20:13.749 --> 1:20:27.283 | |
The problem is that the MBR is often a lot | |
more like heavy because you have to sample | |
1:20:27.283 --> 1:20:29.486 | |
translations. | |
1:20:31.871 --> 1:20:40.946 | |
If you are bustling then we take a pen or | |
a pen for the most possible one. | |
1:20:40.946 --> 1:20:43.003 | |
Now we put them. | |
1:20:43.623 --> 1:20:46.262 | |
Bit and then we say okay, you don't have to | |
be fine. | |
1:20:46.262 --> 1:20:47.657 | |
I'm going to put it to you. | |
1:20:48.428 --> 1:20:52.690 | |
Yes, so that is what you can also do. | |
1:20:52.690 --> 1:21:00.092 | |
Instead of taking uniform per ability, you | |
could take the modest. | |
1:21:01.041 --> 1:21:14.303 | |
The uniform is a bit more robust because if | |
you had this one it might be that there is | |
1:21:14.303 --> 1:21:17.810 | |
some crazy exceptions. | |
1:21:17.897 --> 1:21:21.088 | |
And then it would still relax. | |
1:21:21.088 --> 1:21:28.294 | |
So if you look at this picture, the probability | |
here would be higher. | |
1:21:28.294 --> 1:21:31.794 | |
But yeah, that's a bit of tuning. | |
1:21:33.073 --> 1:21:42.980 | |
In this case, and yes, it is like modeling | |
also the ants that. | |
1:21:49.169 --> 1:21:56.265 | |
The last thing is now we always have considered | |
one model. | |
1:21:56.265 --> 1:22:04.084 | |
It's also some prints helpful to not only | |
look at one model but. | |
1:22:04.384 --> 1:22:10.453 | |
So in general there's many ways of how you | |
can make several models and with it's even | |
1:22:10.453 --> 1:22:17.370 | |
easier you can just start three different random | |
municipalizations you get three different models | |
1:22:17.370 --> 1:22:18.428 | |
and typically. | |
1:22:19.019 --> 1:22:27.299 | |
And then the question is, can we combine their | |
strength into one model and use that then? | |
1:22:29.669 --> 1:22:39.281 | |
And that can be done and it can be either | |
online or ensemble, and the more offline thing | |
1:22:39.281 --> 1:22:41.549 | |
is called reranking. | |
1:22:42.462 --> 1:22:52.800 | |
So the idea is, for example, an ensemble that | |
you combine different initializations. | |
1:22:52.800 --> 1:23:02.043 | |
Of course, you can also do other things like | |
having different architecture. | |
1:23:02.222 --> 1:23:08.922 | |
But the easiest thing you can change always | |
in generating two motors is to have different. | |
1:23:09.209 --> 1:23:24.054 | |
And then the question is how can you combine | |
that? | |
1:23:26.006 --> 1:23:34.245 | |
And the easiest thing, as said, is the bottle | |
of soda. | |
1:23:34.245 --> 1:23:39.488 | |
What you mainly do is in parallel. | |
1:23:39.488 --> 1:23:43.833 | |
You decode all of the money. | |
1:23:44.444 --> 1:23:59.084 | |
So the probability of the output and you can | |
join this one to a joint one by just summing | |
1:23:59.084 --> 1:24:04.126 | |
up over your key models again. | |
1:24:04.084 --> 1:24:10.374 | |
So you still have a pro bonding distribution, | |
but you are not taking only one output here, | |
1:24:10.374 --> 1:24:10.719 | |
but. | |
1:24:11.491 --> 1:24:20.049 | |
So that's one you can easily combine different | |
models, and the nice thing is it typically | |
1:24:20.049 --> 1:24:20.715 | |
works. | |
1:24:21.141 --> 1:24:27.487 | |
You additional improvement with only more | |
calculation but not more human work. | |
1:24:27.487 --> 1:24:33.753 | |
You just do the same thing for times and you're | |
getting a better performance. | |
1:24:33.793 --> 1:24:41.623 | |
Like having more layers and so on, the advantage | |
of bigger models is of course you have to have | |
1:24:41.623 --> 1:24:46.272 | |
the big models only joint and decoding during | |
inference. | |
1:24:46.272 --> 1:24:52.634 | |
There you have to load models in parallel | |
because you have to do your search. | |
1:24:52.672 --> 1:24:57.557 | |
Normally there is more memory resources for | |
training than you need for insurance. | |
1:25:00.000 --> 1:25:12.637 | |
You have to train four models and the decoding | |
speed is also slower because you need to decode | |
1:25:12.637 --> 1:25:14.367 | |
four models. | |
1:25:14.874 --> 1:25:25.670 | |
There is one other very important thing and | |
the models have to be very similar, at least | |
1:25:25.670 --> 1:25:27.368 | |
in some ways. | |
1:25:27.887 --> 1:25:28.506 | |
Course. | |
1:25:28.506 --> 1:25:34.611 | |
You can only combine this one if you have | |
the same words because you are just. | |
1:25:34.874 --> 1:25:43.110 | |
So just imagine you have two different sizes | |
because you want to compare them or a director | |
1:25:43.110 --> 1:25:44.273 | |
based model. | |
1:25:44.724 --> 1:25:53.327 | |
That's at least not easily possible here because | |
once your output would be here a word and the | |
1:25:53.327 --> 1:25:56.406 | |
other one would have to sum over. | |
1:25:56.636 --> 1:26:07.324 | |
So this ensemble typically only works if you | |
have the same output vocabulary. | |
1:26:07.707 --> 1:26:16.636 | |
Your input can be different because that is | |
only done once and then. | |
1:26:16.636 --> 1:26:23.752 | |
Your hardware vocabulary has to be the same | |
otherwise. | |
1:26:27.507 --> 1:26:41.522 | |
There's even a surprising effect of improving | |
your performance and it's again some kind of | |
1:26:41.522 --> 1:26:43.217 | |
smoothing. | |
1:26:43.483 --> 1:26:52.122 | |
So normally during training what we are doing | |
is we can save the checkpoints after each epoch. | |
1:26:52.412 --> 1:27:01.774 | |
And you have this type of curve where your | |
Arab performance normally should go down, and | |
1:27:01.774 --> 1:27:09.874 | |
if you do early stopping it means that at the | |
end you select not the lowest. | |
1:27:11.571 --> 1:27:21.467 | |
However, some type of smoothing is there again. | |
1:27:21.467 --> 1:27:31.157 | |
Sometimes what you can do is take an ensemble. | |
1:27:31.491 --> 1:27:38.798 | |
That is not as good, but you still have four | |
different bottles, and they give you a little. | |
1:27:39.259 --> 1:27:42.212 | |
So,. | |
1:27:43.723 --> 1:27:48.340 | |
It's some are helping you, so now they're | |
supposed to be something different, you know. | |
1:27:49.489 --> 1:27:53.812 | |
Oh didn't do that, so that is a checkpoint. | |
1:27:53.812 --> 1:27:59.117 | |
There is one thing interesting, which is even | |
faster. | |
1:27:59.419 --> 1:28:12.255 | |
Normally let's give you better performance | |
because this one might be again like a smooth | |
1:28:12.255 --> 1:28:13.697 | |
ensemble. | |
1:28:16.736 --> 1:28:22.364 | |
Of course, there is also some problems with | |
this, so I said. | |
1:28:22.364 --> 1:28:30.022 | |
For example, maybe you want to do different | |
web representations with Cherokee and. | |
1:28:30.590 --> 1:28:37.189 | |
You want to do right to left decoding so you | |
normally do like I go home but then your translation | |
1:28:37.189 --> 1:28:39.613 | |
depends only on the previous words. | |
1:28:39.613 --> 1:28:45.942 | |
If you want to model on the future you could | |
do the inverse direction and generate the target | |
1:28:45.942 --> 1:28:47.895 | |
sentence from right to left. | |
1:28:48.728 --> 1:28:50.839 | |
But it's not easy to combine these things. | |
1:28:51.571 --> 1:28:56.976 | |
In order to do this, or what is also sometimes | |
interesting is doing in verse translation. | |
1:28:57.637 --> 1:29:07.841 | |
You can combine these types of models in the | |
next election. | |
1:29:07.841 --> 1:29:13.963 | |
That is only a bit which we can do. | |
1:29:14.494 --> 1:29:29.593 | |
Next time what you should remember is how | |
search works and do you have any final questions. | |
1:29:33.773 --> 1:29:43.393 | |
Then I wish you a happy holiday for next week | |
and then Monday there is another practical | |
1:29:43.393 --> 1:29:50.958 | |
and then Thursday in two weeks so we'll have | |
the next lecture Monday. | |