Spaces:
Running
Running
WEBVTT | |
0:00:01.921 --> 0:00:16.424 | |
Hey welcome to today's lecture, what we today | |
want to look at is how we can make new. | |
0:00:16.796 --> 0:00:26.458 | |
So until now we have this global system, the | |
encoder and the decoder mostly, and we haven't | |
0:00:26.458 --> 0:00:29.714 | |
really thought about how long. | |
0:00:30.170 --> 0:00:42.684 | |
And what we, for example, know is yeah, you | |
can make the systems bigger in different ways. | |
0:00:42.684 --> 0:00:47.084 | |
We can make them deeper so the. | |
0:00:47.407 --> 0:00:56.331 | |
And if we have at least enough data that typically | |
helps you make things performance better,. | |
0:00:56.576 --> 0:01:00.620 | |
But of course leads to problems that we need | |
more resources. | |
0:01:00.620 --> 0:01:06.587 | |
That is a problem at universities where we | |
have typically limited computation capacities. | |
0:01:06.587 --> 0:01:11.757 | |
So at some point you have such big models | |
that you cannot train them anymore. | |
0:01:13.033 --> 0:01:23.792 | |
And also for companies is of course important | |
if it costs you like to generate translation | |
0:01:23.792 --> 0:01:26.984 | |
just by power consumption. | |
0:01:27.667 --> 0:01:35.386 | |
So yeah, there's different reasons why you | |
want to do efficient machine translation. | |
0:01:36.436 --> 0:01:48.338 | |
One reason is there are different ways of | |
how you can improve your machine translation | |
0:01:48.338 --> 0:01:50.527 | |
system once we. | |
0:01:50.670 --> 0:01:55.694 | |
There can be different types of data we looked | |
into data crawling, monolingual data. | |
0:01:55.875 --> 0:01:59.024 | |
All this data and the aim is always. | |
0:01:59.099 --> 0:02:05.735 | |
Of course, we are not just purely interested | |
in having more data, but the idea why we want | |
0:02:05.735 --> 0:02:12.299 | |
to have more data is that more data also means | |
that we have better quality because mostly | |
0:02:12.299 --> 0:02:17.550 | |
we are interested in increasing the quality | |
of the machine translation. | |
0:02:18.838 --> 0:02:24.892 | |
But there's also other ways of how you can | |
improve the quality of a machine translation. | |
0:02:25.325 --> 0:02:36.450 | |
And what is, of course, that is where most | |
research is focusing on. | |
0:02:36.450 --> 0:02:44.467 | |
It means all we want to build better algorithms. | |
0:02:44.684 --> 0:02:48.199 | |
Course: The other things are normally as good. | |
0:02:48.199 --> 0:02:54.631 | |
Sometimes it's easier to improve, so often | |
it's easier to just collect more data than | |
0:02:54.631 --> 0:02:57.473 | |
to invent some great view algorithms. | |
0:02:57.473 --> 0:03:00.315 | |
But yeah, both of them are important. | |
0:03:00.920 --> 0:03:09.812 | |
But there is this third thing, especially | |
with neural machine translation, and that means | |
0:03:09.812 --> 0:03:11.590 | |
we make a bigger. | |
0:03:11.751 --> 0:03:16.510 | |
Can be, as said, that we have more layers, | |
that we have wider layers. | |
0:03:16.510 --> 0:03:19.977 | |
The other thing we talked a bit about is ensemble. | |
0:03:19.977 --> 0:03:24.532 | |
That means we are not building one new machine | |
translation system. | |
0:03:24.965 --> 0:03:27.505 | |
And we can easily build four. | |
0:03:27.505 --> 0:03:32.331 | |
What is the typical strategy to build different | |
systems? | |
0:03:32.331 --> 0:03:33.177 | |
Remember. | |
0:03:35.795 --> 0:03:40.119 | |
It should be of course a bit different if | |
you have the same. | |
0:03:40.119 --> 0:03:44.585 | |
If they all predict the same then combining | |
them doesn't help. | |
0:03:44.585 --> 0:03:48.979 | |
So what is the easiest way if you have to | |
build four systems? | |
0:03:51.711 --> 0:04:01.747 | |
And the Charleston's will take, but this is | |
the best output of a single system. | |
0:04:02.362 --> 0:04:10.165 | |
Mean now, it's really three different systems | |
so that you later can combine them and maybe | |
0:04:10.165 --> 0:04:11.280 | |
the average. | |
0:04:11.280 --> 0:04:16.682 | |
Ensembles are typically that the average is | |
all probabilities. | |
0:04:19.439 --> 0:04:24.227 | |
The idea is to think about neural networks. | |
0:04:24.227 --> 0:04:29.342 | |
There's one parameter which can easily adjust. | |
0:04:29.342 --> 0:04:36.525 | |
That's exactly the easiest way to randomize | |
with three different. | |
0:04:37.017 --> 0:04:43.119 | |
They have the same architecture, so all the | |
hydroparameters are the same, but they are | |
0:04:43.119 --> 0:04:43.891 | |
different. | |
0:04:43.891 --> 0:04:46.556 | |
They will have different predictions. | |
0:04:48.228 --> 0:04:52.572 | |
So, of course, bigger amounts. | |
0:04:52.572 --> 0:05:05.325 | |
Some of these are a bit the easiest way of | |
improving your quality because you don't really | |
0:05:05.325 --> 0:05:08.268 | |
have to do anything. | |
0:05:08.588 --> 0:05:12.588 | |
There is limits on that bigger models only | |
get better. | |
0:05:12.588 --> 0:05:19.132 | |
If you have enough training data you can't | |
do like a handheld layer and you will not work | |
0:05:19.132 --> 0:05:24.877 | |
on very small data but with a recent amount | |
of data that is the easiest thing. | |
0:05:25.305 --> 0:05:33.726 | |
However, they are challenging with making | |
better models, bigger motors, and that is the | |
0:05:33.726 --> 0:05:34.970 | |
computation. | |
0:05:35.175 --> 0:05:44.482 | |
So, of course, if you have a bigger model | |
that can mean that you have longer running | |
0:05:44.482 --> 0:05:49.518 | |
times, if you have models, you have to times. | |
0:05:51.171 --> 0:05:56.685 | |
Normally you cannot paralyze the different | |
layers because the input to one layer is always | |
0:05:56.685 --> 0:06:02.442 | |
the output of the previous layer, so you propagate | |
that so it will also increase your runtime. | |
0:06:02.822 --> 0:06:10.720 | |
Then you have to store all your models in | |
memory. | |
0:06:10.720 --> 0:06:20.927 | |
If you have double weights you will have: | |
Is more difficult to then do back propagation. | |
0:06:20.927 --> 0:06:27.680 | |
You have to store in between the activations, | |
so there's not only do you increase the model | |
0:06:27.680 --> 0:06:31.865 | |
in your memory, but also all these other variables | |
that. | |
0:06:34.414 --> 0:06:36.734 | |
And so in general it is more expensive. | |
0:06:37.137 --> 0:06:54.208 | |
And therefore there's good reasons in looking | |
into can we make these models sound more efficient. | |
0:06:54.134 --> 0:07:00.982 | |
So it's been through the viewer, you can have | |
it okay, have one and one day of training time, | |
0:07:00.982 --> 0:07:01.274 | |
or. | |
0:07:01.221 --> 0:07:07.535 | |
Forty thousand euros and then what is the | |
best machine translation system I can get within | |
0:07:07.535 --> 0:07:08.437 | |
this budget. | |
0:07:08.969 --> 0:07:19.085 | |
And then, of course, you can make the models | |
bigger, but then you have to train them shorter, | |
0:07:19.085 --> 0:07:24.251 | |
and then we can make more efficient algorithms. | |
0:07:25.925 --> 0:07:31.699 | |
If you think about efficiency, there's a bit | |
different scenarios. | |
0:07:32.312 --> 0:07:43.635 | |
So if you're more of coming from the research | |
community, what you'll be doing is building | |
0:07:43.635 --> 0:07:47.913 | |
a lot of models in your research. | |
0:07:48.088 --> 0:07:58.645 | |
So you're having your test set of maybe sentences, | |
calculating the blue score, then another model. | |
0:07:58.818 --> 0:08:08.911 | |
So what that means is typically you're training | |
on millions of cents, so your training time | |
0:08:08.911 --> 0:08:14.944 | |
is long, maybe a day, but maybe in other cases | |
a week. | |
0:08:15.135 --> 0:08:22.860 | |
The testing is not really the cost efficient, | |
but the training is very costly. | |
0:08:23.443 --> 0:08:37.830 | |
If you are more thinking of building models | |
for application, the scenario is quite different. | |
0:08:38.038 --> 0:08:46.603 | |
And then you keep it running, and maybe thousands | |
of customers are using it in translating. | |
0:08:46.603 --> 0:08:47.720 | |
So in that. | |
0:08:48.168 --> 0:08:59.577 | |
And we will see that it is not always the | |
same type of challenges you can paralyze some | |
0:08:59.577 --> 0:09:07.096 | |
things in training, which you cannot paralyze | |
in testing. | |
0:09:07.347 --> 0:09:14.124 | |
For example, in training you have to do back | |
propagation, so you have to store the activations. | |
0:09:14.394 --> 0:09:23.901 | |
Therefore, in testing we briefly discussed | |
that we would do it in more detail today in | |
0:09:23.901 --> 0:09:24.994 | |
training. | |
0:09:25.265 --> 0:09:36.100 | |
You know they're a target and you can process | |
everything in parallel while in testing. | |
0:09:36.356 --> 0:09:46.741 | |
So you can only do one word at a time, and | |
so you can less paralyze this. | |
0:09:46.741 --> 0:09:50.530 | |
Therefore, it's important. | |
0:09:52.712 --> 0:09:55.347 | |
Is a specific task on this. | |
0:09:55.347 --> 0:10:03.157 | |
For example, it's the efficiency task where | |
it's about making things as efficient. | |
0:10:03.123 --> 0:10:09.230 | |
Is possible and they can look at different | |
resources. | |
0:10:09.230 --> 0:10:14.207 | |
So how much deep fuel run time do you need? | |
0:10:14.454 --> 0:10:19.366 | |
See how much memory you need or you can have | |
a fixed memory budget and then have to build | |
0:10:19.366 --> 0:10:20.294 | |
the best system. | |
0:10:20.500 --> 0:10:29.010 | |
And here is a bit like an example of that, | |
so there's three teams from Edinburgh from | |
0:10:29.010 --> 0:10:30.989 | |
and they submitted. | |
0:10:31.131 --> 0:10:36.278 | |
So then, of course, if you want to know the | |
most efficient system you have to do a bit | |
0:10:36.278 --> 0:10:36.515 | |
of. | |
0:10:36.776 --> 0:10:44.656 | |
You want to have a better quality or more | |
runtime and there's not the one solution. | |
0:10:44.656 --> 0:10:46.720 | |
You can improve your. | |
0:10:46.946 --> 0:10:49.662 | |
And that you see that there are different | |
systems. | |
0:10:49.909 --> 0:11:06.051 | |
Here is how many words you can do for a second | |
on the clock, and you want to be as talk as | |
0:11:06.051 --> 0:11:07.824 | |
possible. | |
0:11:08.068 --> 0:11:08.889 | |
And you see here a bit. | |
0:11:08.889 --> 0:11:09.984 | |
This is a little bit different. | |
0:11:11.051 --> 0:11:27.717 | |
You want to be there on the top right corner | |
and you can get a score of something between | |
0:11:27.717 --> 0:11:29.014 | |
words. | |
0:11:30.250 --> 0:11:34.161 | |
Two hundred and fifty thousand, then you'll | |
ever come and score zero point three. | |
0:11:34.834 --> 0:11:41.243 | |
There is, of course, any bit of a decision, | |
but the question is, like how far can you again? | |
0:11:41.243 --> 0:11:47.789 | |
Some of all these points on this line would | |
be winners because they are somehow most efficient | |
0:11:47.789 --> 0:11:53.922 | |
in a way that there's no system which achieves | |
the same quality with less computational. | |
0:11:57.657 --> 0:12:04.131 | |
So there's the one question of which resources | |
are you interested. | |
0:12:04.131 --> 0:12:07.416 | |
Are you running it on CPU or GPU? | |
0:12:07.416 --> 0:12:11.668 | |
There's different ways of paralyzing stuff. | |
0:12:14.654 --> 0:12:20.777 | |
Another dimension is how you process your | |
data. | |
0:12:20.777 --> 0:12:27.154 | |
There's really the best processing and streaming. | |
0:12:27.647 --> 0:12:34.672 | |
So in batch processing you have the whole | |
document available so you can translate all | |
0:12:34.672 --> 0:12:39.981 | |
sentences in perimeter and then you're interested | |
in throughput. | |
0:12:40.000 --> 0:12:43.844 | |
But you can then process, for example, especially | |
in GPS. | |
0:12:43.844 --> 0:12:49.810 | |
That's interesting, you're not translating | |
one sentence at a time, but you're translating | |
0:12:49.810 --> 0:12:56.108 | |
one hundred sentences or so in parallel, so | |
you have one more dimension where you can paralyze | |
0:12:56.108 --> 0:12:57.964 | |
and then be more efficient. | |
0:12:58.558 --> 0:13:14.863 | |
On the other hand, for example sorts of documents, | |
so we learned that if you do badge processing | |
0:13:14.863 --> 0:13:16.544 | |
you have. | |
0:13:16.636 --> 0:13:24.636 | |
Then, of course, it makes sense to sort the | |
sentences in order to have the minimum thing | |
0:13:24.636 --> 0:13:25.535 | |
attached. | |
0:13:27.427 --> 0:13:32.150 | |
The other scenario is more the streaming scenario | |
where you do life translation. | |
0:13:32.512 --> 0:13:40.212 | |
So in that case you can't wait for the whole | |
document to pass, but you have to do. | |
0:13:40.520 --> 0:13:49.529 | |
And then, for example, that's especially in | |
situations like speech translation, and then | |
0:13:49.529 --> 0:13:53.781 | |
you're interested in things like latency. | |
0:13:53.781 --> 0:14:00.361 | |
So how much do you have to wait to get the | |
output of a sentence? | |
0:14:06.566 --> 0:14:16.956 | |
Finally, there is the thing about the implementation: | |
Today we're mainly looking at different algorithms, | |
0:14:16.956 --> 0:14:23.678 | |
different models of how you can model them | |
in your machine translation system, but of | |
0:14:23.678 --> 0:14:29.227 | |
course for the same algorithms there's also | |
different implementations. | |
0:14:29.489 --> 0:14:38.643 | |
So, for example, for a machine translation | |
this tool could be very fast. | |
0:14:38.638 --> 0:14:46.615 | |
So they have like coded a lot of the operations | |
very low resource, not low resource, low level | |
0:14:46.615 --> 0:14:49.973 | |
on the directly on the QDAC kernels in. | |
0:14:50.110 --> 0:15:00.948 | |
So the same attention network is typically | |
more efficient in that type of algorithm. | |
0:15:00.880 --> 0:15:02.474 | |
Than in in any other. | |
0:15:03.323 --> 0:15:13.105 | |
Of course, it might be other disadvantages, | |
so if you're a little worker or have worked | |
0:15:13.105 --> 0:15:15.106 | |
in the practical. | |
0:15:15.255 --> 0:15:22.604 | |
Because it's normally easier to understand, | |
easier to change, and so on, but there is again | |
0:15:22.604 --> 0:15:23.323 | |
a train. | |
0:15:23.483 --> 0:15:29.440 | |
You have to think about, do you want to include | |
this into my study or comparison or not? | |
0:15:29.440 --> 0:15:36.468 | |
Should it be like I compare different implementations | |
and I also find the most efficient implementation? | |
0:15:36.468 --> 0:15:39.145 | |
Or is it only about the pure algorithm? | |
0:15:42.742 --> 0:15:50.355 | |
Yeah, when building these systems there is | |
a different trade-off to do. | |
0:15:50.850 --> 0:15:56.555 | |
So there's one of the traders between memory | |
and throughput, so how many words can generate | |
0:15:56.555 --> 0:15:57.299 | |
per second. | |
0:15:57.557 --> 0:16:03.351 | |
So typically you can easily like increase | |
your scruple by increasing the batch size. | |
0:16:03.643 --> 0:16:06.899 | |
So that means you are translating more sentences | |
in parallel. | |
0:16:07.107 --> 0:16:09.241 | |
And gypsies are very good at that stuff. | |
0:16:09.349 --> 0:16:15.161 | |
It should translate one sentence or one hundred | |
sentences, not the same time, but its. | |
0:16:15.115 --> 0:16:20.784 | |
Rough are very similar because they are at | |
this efficient metrics multiplication so that | |
0:16:20.784 --> 0:16:24.415 | |
you can do the same operation on all sentences | |
parallel. | |
0:16:24.415 --> 0:16:30.148 | |
So typically that means if you increase your | |
benchmark you can do more things in parallel | |
0:16:30.148 --> 0:16:31.995 | |
and you will translate more. | |
0:16:31.952 --> 0:16:33.370 | |
Second. | |
0:16:33.653 --> 0:16:43.312 | |
On the other hand, with this advantage, of | |
course you will need higher badge sizes and | |
0:16:43.312 --> 0:16:44.755 | |
more memory. | |
0:16:44.965 --> 0:16:56.452 | |
To begin with, the other problem is that you | |
have such big models that you can only translate | |
0:16:56.452 --> 0:16:59.141 | |
with lower bed sizes. | |
0:16:59.119 --> 0:17:08.466 | |
If you are running out of memory with translating, | |
one idea to go on that is to decrease your. | |
0:17:13.453 --> 0:17:24.456 | |
Then there is the thing about quality in Screwport, | |
of course, and before it's like larger models, | |
0:17:24.456 --> 0:17:28.124 | |
but in generally higher quality. | |
0:17:28.124 --> 0:17:31.902 | |
The first one is always this way. | |
0:17:32.092 --> 0:17:38.709 | |
Course: Not always larger model helps you | |
have over fitting at some point, but in generally. | |
0:17:43.883 --> 0:17:52.901 | |
And with this a bit on this training and testing | |
thing we had before. | |
0:17:53.113 --> 0:17:58.455 | |
So it wears all the difference between training | |
and testing, and for the encoder and decoder. | |
0:17:58.798 --> 0:18:06.992 | |
So if we are looking at what mentioned before | |
at training time, we have a source sentence | |
0:18:06.992 --> 0:18:17.183 | |
here: And how this is processed on a is not | |
the attention here. | |
0:18:17.183 --> 0:18:21.836 | |
That's a tubical transformer. | |
0:18:22.162 --> 0:18:31.626 | |
And how we can do that on a is that we can | |
paralyze the ear ever since. | |
0:18:31.626 --> 0:18:40.422 | |
The first thing to know is: So that is, of | |
course, not in all cases. | |
0:18:40.422 --> 0:18:49.184 | |
We'll later talk about speech translation | |
where we might want to translate. | |
0:18:49.389 --> 0:18:56.172 | |
Without the general case in, it's like you | |
have the full sentence you want to translate. | |
0:18:56.416 --> 0:19:02.053 | |
So the important thing is we are here everything | |
available on the source side. | |
0:19:03.323 --> 0:19:13.524 | |
And then this was one of the big advantages | |
that you can remember back of transformer. | |
0:19:13.524 --> 0:19:15.752 | |
There are several. | |
0:19:16.156 --> 0:19:25.229 | |
But the other one is now that we can calculate | |
the full layer. | |
0:19:25.645 --> 0:19:29.318 | |
There is no dependency between this and this | |
state or this and this state. | |
0:19:29.749 --> 0:19:36.662 | |
So we always did like here to calculate the | |
key value and query, and based on that you | |
0:19:36.662 --> 0:19:37.536 | |
calculate. | |
0:19:37.937 --> 0:19:46.616 | |
Which means we can do all these calculations | |
here in parallel and in parallel. | |
0:19:48.028 --> 0:19:55.967 | |
And there, of course, is this very efficiency | |
because again for GPS it's too bigly possible | |
0:19:55.967 --> 0:20:00.887 | |
to do these things in parallel and one after | |
each other. | |
0:20:01.421 --> 0:20:10.311 | |
And then we can also for each layer one by | |
one, and then we calculate here the encoder. | |
0:20:10.790 --> 0:20:21.921 | |
In training now an important thing is that | |
for the decoder we have the full sentence available | |
0:20:21.921 --> 0:20:28.365 | |
because we know this is the target we should | |
generate. | |
0:20:29.649 --> 0:20:33.526 | |
We have models now in a different way. | |
0:20:33.526 --> 0:20:38.297 | |
This hidden state is only on the previous | |
ones. | |
0:20:38.598 --> 0:20:51.887 | |
And the first thing here depends only on this | |
information, so you see if you remember we | |
0:20:51.887 --> 0:20:56.665 | |
had this masked self-attention. | |
0:20:56.896 --> 0:21:04.117 | |
So that means, of course, we can only calculate | |
the decoder once the encoder is done, but that's. | |
0:21:04.444 --> 0:21:06.656 | |
Percent can calculate the end quarter. | |
0:21:06.656 --> 0:21:08.925 | |
Then we can calculate here the decoder. | |
0:21:09.569 --> 0:21:25.566 | |
But again in training we have x, y and that | |
is available so we can calculate everything | |
0:21:25.566 --> 0:21:27.929 | |
in parallel. | |
0:21:28.368 --> 0:21:40.941 | |
So the interesting thing or advantage of transformer | |
is in training. | |
0:21:40.941 --> 0:21:46.408 | |
We can do it for the decoder. | |
0:21:46.866 --> 0:21:54.457 | |
That means you will have more calculations | |
because you can only calculate one layer at | |
0:21:54.457 --> 0:22:02.310 | |
a time, but for example the length which is | |
too bigly quite long or doesn't really matter | |
0:22:02.310 --> 0:22:03.270 | |
that much. | |
0:22:05.665 --> 0:22:10.704 | |
However, in testing this situation is different. | |
0:22:10.704 --> 0:22:13.276 | |
In testing we only have. | |
0:22:13.713 --> 0:22:20.622 | |
So this means we start with a sense: We don't | |
know the full sentence yet because we ought | |
0:22:20.622 --> 0:22:29.063 | |
to regularly generate that so for the encoder | |
we have the same here but for the decoder. | |
0:22:29.409 --> 0:22:39.598 | |
In this case we only have the first and the | |
second instinct, but only for all states in | |
0:22:39.598 --> 0:22:40.756 | |
parallel. | |
0:22:41.101 --> 0:22:51.752 | |
And then we can do the next step for y because | |
we are putting our most probable one. | |
0:22:51.752 --> 0:22:58.643 | |
We do greedy search or beam search, but you | |
cannot do. | |
0:23:03.663 --> 0:23:16.838 | |
Yes, so if we are interesting in making things | |
more efficient for testing, which we see, for | |
0:23:16.838 --> 0:23:22.363 | |
example in the scenario of really our. | |
0:23:22.642 --> 0:23:34.286 | |
It makes sense that we think about our architecture | |
and that we are currently working on attention | |
0:23:34.286 --> 0:23:35.933 | |
based models. | |
0:23:36.096 --> 0:23:44.150 | |
The decoder there is some of the most time | |
spent testing and testing. | |
0:23:44.150 --> 0:23:47.142 | |
It's similar, but during. | |
0:23:47.167 --> 0:23:50.248 | |
Nothing about beam search. | |
0:23:50.248 --> 0:23:59.833 | |
It might be even more complicated because | |
in beam search you have to try different. | |
0:24:02.762 --> 0:24:15.140 | |
So the question is what can you now do in | |
order to make your model more efficient and | |
0:24:15.140 --> 0:24:21.905 | |
better in translation in these types of cases? | |
0:24:24.604 --> 0:24:30.178 | |
And the one thing is to look into the encoded | |
decoder trailer. | |
0:24:30.690 --> 0:24:43.898 | |
And then until now we typically assume that | |
the depth of the encoder and the depth of the | |
0:24:43.898 --> 0:24:48.154 | |
decoder is roughly the same. | |
0:24:48.268 --> 0:24:55.553 | |
So if you haven't thought about it, you just | |
take what is running well. | |
0:24:55.553 --> 0:24:57.678 | |
You would try to do. | |
0:24:58.018 --> 0:25:04.148 | |
However, we saw now that there is a quite | |
big challenge and the runtime is a lot longer | |
0:25:04.148 --> 0:25:04.914 | |
than here. | |
0:25:05.425 --> 0:25:14.018 | |
The question is also the case for the calculations, | |
or do we have there the same issue that we | |
0:25:14.018 --> 0:25:21.887 | |
only get the good quality if we are having | |
high and high, so we know that making these | |
0:25:21.887 --> 0:25:25.415 | |
more depths is increasing our quality. | |
0:25:25.425 --> 0:25:31.920 | |
But what we haven't talked about is really | |
important that we increase the depth the same | |
0:25:31.920 --> 0:25:32.285 | |
way. | |
0:25:32.552 --> 0:25:41.815 | |
So what we can put instead also do is something | |
like this where you have a deep encoder and | |
0:25:41.815 --> 0:25:42.923 | |
a shallow. | |
0:25:43.163 --> 0:25:57.386 | |
So that would be that you, for example, have | |
instead of having layers on the encoder, and | |
0:25:57.386 --> 0:25:59.757 | |
layers on the. | |
0:26:00.080 --> 0:26:10.469 | |
So in this case the overall depth from start | |
to end would be similar and so hopefully. | |
0:26:11.471 --> 0:26:21.662 | |
But we could a lot more things hear parallelized, | |
and hear what is costly at the end during decoding | |
0:26:21.662 --> 0:26:22.973 | |
the decoder. | |
0:26:22.973 --> 0:26:29.330 | |
Because that does change in an outer regressive | |
way, there we. | |
0:26:31.411 --> 0:26:33.727 | |
And that that can be analyzed. | |
0:26:33.727 --> 0:26:38.734 | |
So here is some examples: Where people have | |
done all this. | |
0:26:39.019 --> 0:26:55.710 | |
So here it's mainly interested on the orange | |
things, which is auto-regressive about the | |
0:26:55.710 --> 0:26:57.607 | |
speed up. | |
0:26:57.717 --> 0:27:15.031 | |
You have the system, so agree is not exactly | |
the same, but it's similar. | |
0:27:15.055 --> 0:27:23.004 | |
It's always the case if you look at speed | |
up. | |
0:27:23.004 --> 0:27:31.644 | |
Think they put a speed of so that's the baseline. | |
0:27:31.771 --> 0:27:35.348 | |
So between and times as fast. | |
0:27:35.348 --> 0:27:42.621 | |
If you switch from a system to where you have | |
layers in the. | |
0:27:42.782 --> 0:27:52.309 | |
You see that although you have slightly more | |
parameters, more calculations are also roughly | |
0:27:52.309 --> 0:28:00.283 | |
the same, but you can speed out because now | |
during testing you can paralyze. | |
0:28:02.182 --> 0:28:09.754 | |
The other thing is that you're speeding up, | |
but if you look at the performance it's similar, | |
0:28:09.754 --> 0:28:13.500 | |
so sometimes you improve, sometimes you lose. | |
0:28:13.500 --> 0:28:20.421 | |
There's a bit of losing English to Romania, | |
but in general the quality is very slow. | |
0:28:20.680 --> 0:28:30.343 | |
So you see that you can keep a similar performance | |
while improving your speed by just having different. | |
0:28:30.470 --> 0:28:34.903 | |
And you also see the encoder layers from speed. | |
0:28:34.903 --> 0:28:38.136 | |
They don't really metal that much. | |
0:28:38.136 --> 0:28:38.690 | |
Most. | |
0:28:38.979 --> 0:28:50.319 | |
Because if you compare the 12th system to | |
the 6th system you have a lower performance | |
0:28:50.319 --> 0:28:57.309 | |
with 6th and colder layers but the speed is | |
similar. | |
0:28:57.897 --> 0:29:02.233 | |
And see the huge decrease is it maybe due | |
to a lack of data. | |
0:29:03.743 --> 0:29:11.899 | |
Good idea would say it's not the case. | |
0:29:11.899 --> 0:29:23.191 | |
Romanian English should have the same number | |
of data. | |
0:29:24.224 --> 0:29:31.184 | |
Maybe it's just that something in that language. | |
0:29:31.184 --> 0:29:40.702 | |
If you generate Romanian maybe they need more | |
target dependencies. | |
0:29:42.882 --> 0:29:46.263 | |
The Wine's the Eye Also Don't Know Any Sex | |
People Want To. | |
0:29:47.887 --> 0:29:49.034 | |
There could be yeah the. | |
0:29:49.889 --> 0:29:58.962 | |
As the maybe if you go from like a movie sphere | |
to a hybrid sphere, you can: It's very much | |
0:29:58.962 --> 0:30:12.492 | |
easier to expand the vocabulary to English, | |
but it must be the vocabulary. | |
0:30:13.333 --> 0:30:21.147 | |
Have to check, but would assume that in this | |
case the system is not retrained, but it's | |
0:30:21.147 --> 0:30:22.391 | |
trained with. | |
0:30:22.902 --> 0:30:30.213 | |
And that's why I was assuming that they have | |
the same, but maybe you'll write that in this | |
0:30:30.213 --> 0:30:35.595 | |
piece, for example, if they were pre-trained, | |
the decoder English. | |
0:30:36.096 --> 0:30:43.733 | |
But don't remember exactly if they do something | |
like that, but that could be a good. | |
0:30:45.325 --> 0:30:52.457 | |
So this is some of the most easy way to speed | |
up. | |
0:30:52.457 --> 0:31:01.443 | |
You just switch to hyperparameters, not to | |
implement anything. | |
0:31:02.722 --> 0:31:08.367 | |
Of course, there's other ways of doing that. | |
0:31:08.367 --> 0:31:11.880 | |
We'll look into two things. | |
0:31:11.880 --> 0:31:16.521 | |
The other thing is the architecture. | |
0:31:16.796 --> 0:31:28.154 | |
We are now at some of the baselines that we | |
are doing. | |
0:31:28.488 --> 0:31:39.978 | |
However, in translation in the decoder side, | |
it might not be the best solution. | |
0:31:39.978 --> 0:31:41.845 | |
There is no. | |
0:31:42.222 --> 0:31:47.130 | |
So we can use different types of architectures, | |
also in the encoder and the. | |
0:31:47.747 --> 0:31:52.475 | |
And there's two ways of what you could do | |
different, or there's more ways. | |
0:31:52.912 --> 0:31:54.825 | |
We will look into two todays. | |
0:31:54.825 --> 0:31:58.842 | |
The one is average attention, which is a very | |
simple solution. | |
0:31:59.419 --> 0:32:01.464 | |
You can do as it says. | |
0:32:01.464 --> 0:32:04.577 | |
It's not really attending anymore. | |
0:32:04.577 --> 0:32:08.757 | |
It's just like equal attendance to everything. | |
0:32:09.249 --> 0:32:23.422 | |
And the other idea, which is currently done | |
in most systems which are optimized to efficiency, | |
0:32:23.422 --> 0:32:24.913 | |
is we're. | |
0:32:25.065 --> 0:32:32.623 | |
But on the decoder side we are then not using | |
transformer or self attention, but we are using | |
0:32:32.623 --> 0:32:39.700 | |
recurrent neural network because they are the | |
disadvantage of recurrent neural network. | |
0:32:39.799 --> 0:32:48.353 | |
And then the recurrent is normally easier | |
to calculate because it only depends on inputs, | |
0:32:48.353 --> 0:32:49.684 | |
the input on. | |
0:32:51.931 --> 0:33:02.190 | |
So what is the difference between decoding | |
and why is the tension maybe not sufficient | |
0:33:02.190 --> 0:33:03.841 | |
for decoding? | |
0:33:04.204 --> 0:33:14.390 | |
If we want to populate the new state, we only | |
have to look at the input and the previous | |
0:33:14.390 --> 0:33:15.649 | |
state, so. | |
0:33:16.136 --> 0:33:19.029 | |
We are more conditional here networks. | |
0:33:19.029 --> 0:33:19.994 | |
We have the. | |
0:33:19.980 --> 0:33:31.291 | |
Dependency to a fixed number of previous ones, | |
but that's rarely used for decoding. | |
0:33:31.291 --> 0:33:39.774 | |
In contrast, in transformer we have this large | |
dependency, so. | |
0:33:40.000 --> 0:33:52.760 | |
So from t minus one to y t so that is somehow | |
and mainly not very efficient in this way mean | |
0:33:52.760 --> 0:33:56.053 | |
it's very good because. | |
0:33:56.276 --> 0:34:03.543 | |
However, the disadvantage is that we also | |
have to do all these calculations, so if we | |
0:34:03.543 --> 0:34:10.895 | |
more view from the point of view of efficient | |
calculation, this might not be the best. | |
0:34:11.471 --> 0:34:20.517 | |
So the question is, can we change our architecture | |
to keep some of the advantages but make things | |
0:34:20.517 --> 0:34:21.994 | |
more efficient? | |
0:34:24.284 --> 0:34:31.131 | |
The one idea is what is called the average | |
attention, and the interesting thing is this | |
0:34:31.131 --> 0:34:32.610 | |
work surprisingly. | |
0:34:33.013 --> 0:34:38.917 | |
So the only idea what you're doing is doing | |
the decoder. | |
0:34:38.917 --> 0:34:42.646 | |
You're not doing attention anymore. | |
0:34:42.646 --> 0:34:46.790 | |
The attention weights are all the same. | |
0:34:47.027 --> 0:35:00.723 | |
So you don't calculate with query and key | |
the different weights, and then you just take | |
0:35:00.723 --> 0:35:03.058 | |
equal weights. | |
0:35:03.283 --> 0:35:07.585 | |
So here would be one third from this, one | |
third from this, and one third. | |
0:35:09.009 --> 0:35:14.719 | |
And while it is sufficient you can now do | |
precalculation and things get more efficient. | |
0:35:15.195 --> 0:35:18.803 | |
So first go the formula that's maybe not directed | |
here. | |
0:35:18.979 --> 0:35:38.712 | |
So the difference here is that your new hint | |
stage is the sum of all the hint states, then. | |
0:35:38.678 --> 0:35:40.844 | |
So here would be with this. | |
0:35:40.844 --> 0:35:45.022 | |
It would be one third of this plus one third | |
of this. | |
0:35:46.566 --> 0:35:57.162 | |
But if you calculate it this way, it's not | |
yet being more efficient because you still | |
0:35:57.162 --> 0:36:01.844 | |
have to sum over here all the hidden. | |
0:36:04.524 --> 0:36:22.932 | |
But you can not easily speed up these things | |
by having an in between value, which is just | |
0:36:22.932 --> 0:36:24.568 | |
always. | |
0:36:25.585 --> 0:36:30.057 | |
If you take this as ten to one, you take this | |
one class this one. | |
0:36:30.350 --> 0:36:36.739 | |
Because this one then was before this, and | |
this one was this, so in the end. | |
0:36:37.377 --> 0:36:49.545 | |
So now this one is not the final one in order | |
to get the final one to do the average. | |
0:36:49.545 --> 0:36:50.111 | |
So. | |
0:36:50.430 --> 0:37:00.264 | |
But then if you do this calculation with speed | |
up you can do it with a fixed number of steps. | |
0:37:00.180 --> 0:37:11.300 | |
Instead of the sun which depends on age, so | |
you only have to do calculations to calculate | |
0:37:11.300 --> 0:37:12.535 | |
this one. | |
0:37:12.732 --> 0:37:21.718 | |
Can you do the lakes and the lakes? | |
0:37:21.718 --> 0:37:32.701 | |
For example, light bulb here now takes and. | |
0:37:32.993 --> 0:37:38.762 | |
That's a very good point and that's why this | |
is now in the image. | |
0:37:38.762 --> 0:37:44.531 | |
It's not very good so this is the one with | |
tilder and the tilder. | |
0:37:44.884 --> 0:37:57.895 | |
So this one is just the sum of these two, | |
because this is just this one. | |
0:37:58.238 --> 0:38:08.956 | |
So the sum of this is exactly as the sum of | |
these, and the sum of these is the sum of here. | |
0:38:08.956 --> 0:38:15.131 | |
So you only do the sum in here, and the multiplying. | |
0:38:15.255 --> 0:38:22.145 | |
So what you can mainly do here is you can | |
do it more mathematically. | |
0:38:22.145 --> 0:38:31.531 | |
You can know this by tea taking out of the | |
sum, and then you can calculate the sum different. | |
0:38:36.256 --> 0:38:42.443 | |
That maybe looks a bit weird and simple, so | |
we were all talking about this great attention | |
0:38:42.443 --> 0:38:47.882 | |
that we can focus on different parts, and a | |
bit surprising on this work is now. | |
0:38:47.882 --> 0:38:53.321 | |
In the end it might also work well without | |
really putting and just doing equal. | |
0:38:53.954 --> 0:38:56.164 | |
Mean it's not that easy. | |
0:38:56.376 --> 0:38:58.261 | |
It's like sometimes this is working. | |
0:38:58.261 --> 0:39:00.451 | |
There's also report weight work that well. | |
0:39:01.481 --> 0:39:05.848 | |
But I think it's an interesting way and it | |
maybe shows that a lot of. | |
0:39:05.805 --> 0:39:10.624 | |
Things in the self or in the transformer paper | |
which are more put as like yet. | |
0:39:10.624 --> 0:39:15.930 | |
These are some hyperpermetheuss around it, | |
like that you do the layer norm in between, | |
0:39:15.930 --> 0:39:21.785 | |
and that you do a feat forward before, and | |
things like that, that these are also all important, | |
0:39:21.785 --> 0:39:25.566 | |
and that the right set up around that is also | |
very important. | |
0:39:28.969 --> 0:39:38.598 | |
The other thing you can do in the end is not | |
completely different from this one. | |
0:39:38.598 --> 0:39:42.521 | |
It's just like a very different. | |
0:39:42.942 --> 0:39:54.338 | |
And that is a recurrent network which also | |
has this type of highway connection that can | |
0:39:54.338 --> 0:40:01.330 | |
ignore the recurrent unit and directly put | |
the input. | |
0:40:01.561 --> 0:40:10.770 | |
It's not really adding out, but if you see | |
the hitting step is your input, but what you | |
0:40:10.770 --> 0:40:15.480 | |
can do is somehow directly go to the output. | |
0:40:17.077 --> 0:40:28.390 | |
These are the four components of the simple | |
return unit, and the unit is motivated by GIS | |
0:40:28.390 --> 0:40:33.418 | |
and by LCMs, which we have seen before. | |
0:40:33.513 --> 0:40:43.633 | |
And that has proven to be very good for iron | |
ends, which allows you to have a gate on your. | |
0:40:44.164 --> 0:40:48.186 | |
In this thing we have two gates, the reset | |
gate and the forget gate. | |
0:40:48.768 --> 0:40:57.334 | |
So first we have the general structure which | |
has a cell state. | |
0:40:57.334 --> 0:41:01.277 | |
Here we have the cell state. | |
0:41:01.361 --> 0:41:09.661 | |
And then this goes next, and we always get | |
the different cell states over the times that. | |
0:41:10.030 --> 0:41:11.448 | |
This Is the South Stand. | |
0:41:11.771 --> 0:41:16.518 | |
How do we now calculate that just assume we | |
have an initial cell safe here? | |
0:41:17.017 --> 0:41:19.670 | |
But the first thing is we're doing the forget | |
game. | |
0:41:20.060 --> 0:41:34.774 | |
The forgetting models should the new cell | |
state mainly depend on the previous cell state | |
0:41:34.774 --> 0:41:40.065 | |
or should it depend on our age. | |
0:41:40.000 --> 0:41:41.356 | |
Like Add to Them. | |
0:41:41.621 --> 0:41:42.877 | |
How can we model that? | |
0:41:44.024 --> 0:41:45.599 | |
First we were at a cocktail. | |
0:41:45.945 --> 0:41:52.151 | |
The forget gait is depending on minus one. | |
0:41:52.151 --> 0:41:56.480 | |
You also see here the former. | |
0:41:57.057 --> 0:42:01.963 | |
So we are multiplying both the cell state | |
and our input. | |
0:42:01.963 --> 0:42:04.890 | |
With some weights we are getting. | |
0:42:05.105 --> 0:42:08.472 | |
We are putting some Bay Inspector and then | |
we are doing Sigma Weed on that. | |
0:42:08.868 --> 0:42:13.452 | |
So in the end we have numbers between zero | |
and one saying for each dimension. | |
0:42:13.853 --> 0:42:22.041 | |
Like how much if it's near to zero we will | |
mainly use the new input. | |
0:42:22.041 --> 0:42:31.890 | |
If it's near to one we will keep the input | |
and ignore the input at this dimension. | |
0:42:33.313 --> 0:42:40.173 | |
And by this motivation we can then create | |
here the new sound state, and here you see | |
0:42:40.173 --> 0:42:41.141 | |
the formal. | |
0:42:41.601 --> 0:42:55.048 | |
So you take your foot back gate and multiply | |
it with your class. | |
0:42:55.048 --> 0:43:00.427 | |
So if my was around then. | |
0:43:00.800 --> 0:43:07.405 | |
In the other case, when the value was others, | |
that's what you added. | |
0:43:07.405 --> 0:43:10.946 | |
Then you're adding a transformation. | |
0:43:11.351 --> 0:43:24.284 | |
So if this value was maybe zero then you're | |
putting most of the information from inputting. | |
0:43:25.065 --> 0:43:26.947 | |
Is already your element? | |
0:43:26.947 --> 0:43:30.561 | |
The only question is now based on your element. | |
0:43:30.561 --> 0:43:32.067 | |
What is the output? | |
0:43:33.253 --> 0:43:47.951 | |
And there you have another opportunity so | |
you can either take the output or instead you | |
0:43:47.951 --> 0:43:50.957 | |
prefer the input. | |
0:43:52.612 --> 0:43:58.166 | |
So is the value also the same for the recept | |
game and the forget game. | |
0:43:58.166 --> 0:43:59.417 | |
Yes, the movie. | |
0:44:00.900 --> 0:44:10.004 | |
Yes exactly so the matrices are different | |
and therefore it can be and that should be | |
0:44:10.004 --> 0:44:16.323 | |
and maybe there is sometimes you want to have | |
information. | |
0:44:16.636 --> 0:44:23.843 | |
So here again we have this vector with values | |
between zero and which says controlling how | |
0:44:23.843 --> 0:44:25.205 | |
the information. | |
0:44:25.505 --> 0:44:36.459 | |
And then the output is calculated here similar | |
to a cell stage, but again input is from. | |
0:44:36.536 --> 0:44:45.714 | |
So either the reset gate decides should give | |
what is currently stored in there, or. | |
0:44:46.346 --> 0:44:58.647 | |
So it's not exactly as the thing we had before, | |
with the residual connections where we added | |
0:44:58.647 --> 0:45:01.293 | |
up, but here we do. | |
0:45:04.224 --> 0:45:08.472 | |
This is the general idea of a simple recurrent | |
neural network. | |
0:45:08.472 --> 0:45:13.125 | |
Then we will now look at how we can make things | |
even more efficient. | |
0:45:13.125 --> 0:45:17.104 | |
But first do you have more questions on how | |
it is working? | |
0:45:23.063 --> 0:45:38.799 | |
Now these calculations are a bit where things | |
get more efficient because this somehow. | |
0:45:38.718 --> 0:45:43.177 | |
It depends on all the other damage for the | |
second one also. | |
0:45:43.423 --> 0:45:48.904 | |
Because if you do a matrix multiplication | |
with a vector like for the output vector, each | |
0:45:48.904 --> 0:45:52.353 | |
diameter of the output vector depends on all | |
the other. | |
0:45:52.973 --> 0:46:06.561 | |
The cell state here depends because this one | |
is used here, and somehow the first dimension | |
0:46:06.561 --> 0:46:11.340 | |
of the cell state only depends. | |
0:46:11.931 --> 0:46:17.973 | |
In order to make that, of course, is sometimes | |
again making things less paralyzeable if things | |
0:46:17.973 --> 0:46:18.481 | |
depend. | |
0:46:19.359 --> 0:46:35.122 | |
Can easily make that different by changing | |
from the metric product to not a vector. | |
0:46:35.295 --> 0:46:51.459 | |
So you do first, just like inside here, you | |
take like the first dimension, my second dimension. | |
0:46:52.032 --> 0:46:53.772 | |
Is, of course, narrow. | |
0:46:53.772 --> 0:46:59.294 | |
This should be reset or this should be because | |
it should be a different. | |
0:46:59.899 --> 0:47:12.053 | |
Now the first dimension only depends on the | |
first dimension, so you don't have dependencies | |
0:47:12.053 --> 0:47:16.148 | |
any longer between dimensions. | |
0:47:18.078 --> 0:47:25.692 | |
Maybe it gets a bit clearer if you see about | |
it in this way, so what we have to do now. | |
0:47:25.966 --> 0:47:31.911 | |
First, we have to do a metrics multiplication | |
on to gather and to get the. | |
0:47:32.292 --> 0:47:38.041 | |
And then we only have the element wise operations | |
where we take this output. | |
0:47:38.041 --> 0:47:38.713 | |
We take. | |
0:47:39.179 --> 0:47:42.978 | |
Minus one and our original. | |
0:47:42.978 --> 0:47:52.748 | |
Here we only have elemental abrasions which | |
can be optimally paralyzed. | |
0:47:53.273 --> 0:48:07.603 | |
So here we have additional paralyzed things | |
across the dimension and don't have to do that. | |
0:48:09.929 --> 0:48:24.255 | |
Yeah, but this you can do like in parallel | |
again for all xts. | |
0:48:24.544 --> 0:48:33.014 | |
Here you can't do it in parallel, but you | |
only have to do it on each seat, and then you | |
0:48:33.014 --> 0:48:34.650 | |
can parallelize. | |
0:48:35.495 --> 0:48:39.190 | |
But this maybe for the dimension. | |
0:48:39.190 --> 0:48:42.124 | |
Maybe it's also important. | |
0:48:42.124 --> 0:48:46.037 | |
I don't know if they have tried it. | |
0:48:46.037 --> 0:48:55.383 | |
I assume it's not only for dimension reduction, | |
but it's hard because you can easily. | |
0:49:01.001 --> 0:49:08.164 | |
People have even like made the second thing | |
even more easy. | |
0:49:08.164 --> 0:49:10.313 | |
So there is this. | |
0:49:10.313 --> 0:49:17.954 | |
This is how we have the highway connections | |
in the transformer. | |
0:49:17.954 --> 0:49:20.699 | |
Then it's like you do. | |
0:49:20.780 --> 0:49:24.789 | |
So that is like how things are put together | |
as a transformer. | |
0:49:25.125 --> 0:49:39.960 | |
And that is a similar and simple recurring | |
neural network where you do exactly the same | |
0:49:39.960 --> 0:49:44.512 | |
for the so you don't have. | |
0:49:46.326 --> 0:49:47.503 | |
This type of things. | |
0:49:49.149 --> 0:50:01.196 | |
And with this we are at the end of how to | |
make efficient architectures before we go to | |
0:50:01.196 --> 0:50:02.580 | |
the next. | |
0:50:13.013 --> 0:50:24.424 | |
Between the ink or the trader and the architectures | |
there is a next technique which is used in | |
0:50:24.424 --> 0:50:28.988 | |
nearly all deburning very successful. | |
0:50:29.449 --> 0:50:43.463 | |
So the idea is can we extract the knowledge | |
from a large network into a smaller one, but | |
0:50:43.463 --> 0:50:45.983 | |
it's similarly. | |
0:50:47.907 --> 0:50:53.217 | |
And the nice thing is that this really works, | |
and it may be very, very surprising. | |
0:50:53.673 --> 0:51:03.000 | |
So the idea is that we have a large straw | |
model which we train for long, and the question | |
0:51:03.000 --> 0:51:07.871 | |
is: Can that help us to train a smaller model? | |
0:51:08.148 --> 0:51:16.296 | |
So can what we refer to as teacher model tell | |
us better to build a small student model than | |
0:51:16.296 --> 0:51:17.005 | |
before. | |
0:51:17.257 --> 0:51:27.371 | |
So what we're before in it as a student model, | |
we learn from the data and that is how we train | |
0:51:27.371 --> 0:51:28.755 | |
our systems. | |
0:51:29.249 --> 0:51:37.949 | |
The question is: Can we train this small model | |
better if we are not only learning from the | |
0:51:37.949 --> 0:51:46.649 | |
data, but we are also learning from a large | |
model which has been trained maybe in the same | |
0:51:46.649 --> 0:51:47.222 | |
data? | |
0:51:47.667 --> 0:51:55.564 | |
So that you have then in the end a smaller | |
model that is somehow better performing than. | |
0:51:55.895 --> 0:51:59.828 | |
And maybe that's on the first view. | |
0:51:59.739 --> 0:52:05.396 | |
Very very surprising because it has seen the | |
same data so it should have learned the same | |
0:52:05.396 --> 0:52:11.053 | |
so the baseline model trained only on the data | |
and the student teacher knowledge to still | |
0:52:11.053 --> 0:52:11.682 | |
model it. | |
0:52:11.682 --> 0:52:17.401 | |
They all have seen only this data because | |
your teacher modeling was also trained typically | |
0:52:17.401 --> 0:52:19.161 | |
only on this model however. | |
0:52:20.580 --> 0:52:30.071 | |
It has by now shown that by many ways the | |
model trained in the teacher and analysis framework | |
0:52:30.071 --> 0:52:32.293 | |
is performing better. | |
0:52:33.473 --> 0:52:40.971 | |
A bit of an explanation when we see how that | |
works. | |
0:52:40.971 --> 0:52:46.161 | |
There's different ways of doing it. | |
0:52:46.161 --> 0:52:47.171 | |
Maybe. | |
0:52:47.567 --> 0:52:51.501 | |
So how does it work? | |
0:52:51.501 --> 0:53:04.802 | |
This is our student network, the normal one, | |
some type of new network. | |
0:53:04.802 --> 0:53:06.113 | |
We're. | |
0:53:06.586 --> 0:53:17.050 | |
So we are training the model to predict the | |
same thing as we are doing that by calculating. | |
0:53:17.437 --> 0:53:23.173 | |
The cross angry loss was defined in a way | |
where saying all the probabilities for the | |
0:53:23.173 --> 0:53:25.332 | |
correct word should be as high. | |
0:53:25.745 --> 0:53:32.207 | |
So you are calculating your alphabet probabilities | |
always, and each time step you have an alphabet | |
0:53:32.207 --> 0:53:33.055 | |
probability. | |
0:53:33.055 --> 0:53:38.669 | |
What is the most probable in the next word | |
and your training signal is put as much of | |
0:53:38.669 --> 0:53:43.368 | |
your probability mass to the correct word to | |
the word that is there in. | |
0:53:43.903 --> 0:53:51.367 | |
And this is the chief by this cross entry | |
loss, which says with some of the all training | |
0:53:51.367 --> 0:53:58.664 | |
examples of all positions, with some of the | |
full vocabulary, and then this one is this | |
0:53:58.664 --> 0:54:03.947 | |
one that this current word is the case word | |
in the vocabulary. | |
0:54:04.204 --> 0:54:11.339 | |
And then we take here the lock for the ability | |
of that, so what we made me do is: We have | |
0:54:11.339 --> 0:54:27.313 | |
this metric here, so each position of your | |
vocabulary size. | |
0:54:27.507 --> 0:54:38.656 | |
In the end what you just do is some of these | |
three lock probabilities, and then you want | |
0:54:38.656 --> 0:54:40.785 | |
to have as much. | |
0:54:41.041 --> 0:54:54.614 | |
So although this is a thumb over this metric | |
here, in the end of each dimension you. | |
0:54:54.794 --> 0:55:06.366 | |
So that is a normal cross end to be lost that | |
we have discussed at the very beginning of | |
0:55:06.366 --> 0:55:07.016 | |
how. | |
0:55:08.068 --> 0:55:15.132 | |
So what can we do differently in the teacher | |
network? | |
0:55:15.132 --> 0:55:23.374 | |
We also have a teacher network which is trained | |
on large data. | |
0:55:24.224 --> 0:55:35.957 | |
And of course this distribution might be better | |
than the one from the small model because it's. | |
0:55:36.456 --> 0:55:40.941 | |
So in this case we have now the training signal | |
from the teacher network. | |
0:55:41.441 --> 0:55:46.262 | |
And it's the same way as we had before. | |
0:55:46.262 --> 0:55:56.507 | |
The only difference is we're training not | |
the ground truths per ability distribution | |
0:55:56.507 --> 0:55:59.159 | |
year, which is sharp. | |
0:55:59.299 --> 0:56:11.303 | |
That's also a probability, so this word has | |
a high probability, but have some probability. | |
0:56:12.612 --> 0:56:19.577 | |
And that is the main difference. | |
0:56:19.577 --> 0:56:30.341 | |
Typically you do like the interpretation of | |
these. | |
0:56:33.213 --> 0:56:38.669 | |
Because there's more information contained | |
in the distribution than in the front booth, | |
0:56:38.669 --> 0:56:44.187 | |
because it encodes more information about the | |
language, because language always has more | |
0:56:44.187 --> 0:56:47.907 | |
options to put alone, that's the same sentence | |
yes exactly. | |
0:56:47.907 --> 0:56:53.114 | |
So there's ambiguity in there that is encoded | |
hopefully very well in the complaint. | |
0:56:53.513 --> 0:56:57.257 | |
Trade you two networks so better than a student | |
network you have in there from your learner. | |
0:56:57.537 --> 0:57:05.961 | |
So maybe often there's only one correct word, | |
but it might be two or three, and then all | |
0:57:05.961 --> 0:57:10.505 | |
of these three have a probability distribution. | |
0:57:10.590 --> 0:57:21.242 | |
And then is the main advantage or one explanation | |
of why it's better to train from the. | |
0:57:21.361 --> 0:57:32.652 | |
Of course, it's good to also keep the signal | |
in there because then you can prevent it because | |
0:57:32.652 --> 0:57:33.493 | |
crazy. | |
0:57:37.017 --> 0:57:49.466 | |
Any more questions on the first type of knowledge | |
distillation, also distribution changes. | |
0:57:50.550 --> 0:58:02.202 | |
Coming around again, this would put it a bit | |
different, so this is not a solution to maintenance | |
0:58:02.202 --> 0:58:04.244 | |
or distribution. | |
0:58:04.744 --> 0:58:12.680 | |
But don't think it's performing worse than | |
only doing the ground tours because they also. | |
0:58:13.113 --> 0:58:21.254 | |
So it's more like it's not improving you would | |
assume it's similarly helping you, but. | |
0:58:21.481 --> 0:58:28.145 | |
Of course, if you now have a teacher, maybe | |
you have no danger on your target to Maine, | |
0:58:28.145 --> 0:58:28.524 | |
but. | |
0:58:28.888 --> 0:58:39.895 | |
Then you can use this one which is not the | |
ground truth but helpful to learn better for | |
0:58:39.895 --> 0:58:42.147 | |
the distribution. | |
0:58:46.326 --> 0:58:57.012 | |
The second idea is to do sequence level knowledge | |
distillation, so what we have in this case | |
0:58:57.012 --> 0:59:02.757 | |
is we have looked at each position independently. | |
0:59:03.423 --> 0:59:05.436 | |
Mean, we do that often. | |
0:59:05.436 --> 0:59:10.972 | |
We are not generating a lot of sequences, | |
but that has a problem. | |
0:59:10.972 --> 0:59:13.992 | |
We have this propagation of errors. | |
0:59:13.992 --> 0:59:16.760 | |
We start with one area and then. | |
0:59:17.237 --> 0:59:27.419 | |
So if we are doing word-level knowledge dissolution, | |
we are treating each word in the sentence independently. | |
0:59:28.008 --> 0:59:32.091 | |
So we are not trying to like somewhat model | |
the dependency between. | |
0:59:32.932 --> 0:59:47.480 | |
We can try to do that by sequence level knowledge | |
dissolution, but the problem is, of course,. | |
0:59:47.847 --> 0:59:53.478 | |
So we can that for each position we can get | |
a distribution over all the words at this. | |
0:59:53.793 --> 1:00:05.305 | |
But if we want to have a distribution of all | |
possible target sentences, that's not possible | |
1:00:05.305 --> 1:00:06.431 | |
because. | |
1:00:08.508 --> 1:00:15.940 | |
Area, so we can then again do a bit of a heck | |
on that. | |
1:00:15.940 --> 1:00:23.238 | |
If we can't have a distribution of all sentences, | |
it. | |
1:00:23.843 --> 1:00:30.764 | |
So what we can't do is you can not use the | |
teacher network and sample different translations. | |
1:00:31.931 --> 1:00:39.327 | |
And now we can do different ways to train | |
them. | |
1:00:39.327 --> 1:00:49.343 | |
We can use them as their probability, the | |
easiest one to assume. | |
1:00:50.050 --> 1:00:56.373 | |
So what that ends to is that we're taking | |
our teacher network, we're generating some | |
1:00:56.373 --> 1:01:01.135 | |
translations, and these ones we're using as | |
additional trading. | |
1:01:01.781 --> 1:01:11.382 | |
Then we have mainly done this sequence level | |
because the teacher network takes us. | |
1:01:11.382 --> 1:01:17.513 | |
These are all probable translations of the | |
sentence. | |
1:01:26.286 --> 1:01:34.673 | |
And then you can do a bit of a yeah, and you | |
can try to better make a bit of an interpolated | |
1:01:34.673 --> 1:01:36.206 | |
version of that. | |
1:01:36.716 --> 1:01:42.802 | |
So what people have also done is like subsequent | |
level interpolations. | |
1:01:42.802 --> 1:01:52.819 | |
You generate here several translations: But | |
then you don't use all of them. | |
1:01:52.819 --> 1:02:00.658 | |
You do some metrics on which of these ones. | |
1:02:01.021 --> 1:02:12.056 | |
So it's a bit more training on this brown | |
chose which might be improbable or unreachable | |
1:02:12.056 --> 1:02:16.520 | |
because we can generate everything. | |
1:02:16.676 --> 1:02:23.378 | |
And we are giving it an easier solution which | |
is also good quality and training of that. | |
1:02:23.703 --> 1:02:32.602 | |
So you're not training it on a very difficult | |
solution, but you're training it on an easier | |
1:02:32.602 --> 1:02:33.570 | |
solution. | |
1:02:36.356 --> 1:02:38.494 | |
Any More Questions to This. | |
1:02:40.260 --> 1:02:41.557 | |
Yeah. | |
1:02:41.461 --> 1:02:44.296 | |
Good. | |
1:02:43.843 --> 1:03:01.642 | |
Is to look at the vocabulary, so the problem | |
is we have seen that vocabulary calculations | |
1:03:01.642 --> 1:03:06.784 | |
are often very presuming. | |
1:03:09.789 --> 1:03:19.805 | |
The thing is that most of the vocabulary is | |
not needed for each sentence, so in each sentence. | |
1:03:20.280 --> 1:03:28.219 | |
The question is: Can we somehow easily precalculate, | |
which words are probable to occur in the sentence, | |
1:03:28.219 --> 1:03:30.967 | |
and then only calculate these ones? | |
1:03:31.691 --> 1:03:34.912 | |
And this can be done so. | |
1:03:34.912 --> 1:03:43.932 | |
For example, if you have sentenced card, it's | |
probably not happening. | |
1:03:44.164 --> 1:03:48.701 | |
So what you can try to do is to limit your | |
vocabulary. | |
1:03:48.701 --> 1:03:51.093 | |
You're considering for each. | |
1:03:51.151 --> 1:04:04.693 | |
So you're no longer taking the full vocabulary | |
as possible output, but you're restricting. | |
1:04:06.426 --> 1:04:18.275 | |
That typically works is that we limit it by | |
the most frequent words we always take because | |
1:04:18.275 --> 1:04:23.613 | |
these are not so easy to align to words. | |
1:04:23.964 --> 1:04:32.241 | |
To take the most treatment taggin' words and | |
then work that often aligns with one of the | |
1:04:32.241 --> 1:04:32.985 | |
source. | |
1:04:33.473 --> 1:04:46.770 | |
So for each source word you calculate the | |
word alignment on your training data, and then | |
1:04:46.770 --> 1:04:51.700 | |
you calculate which words occur. | |
1:04:52.352 --> 1:04:57.680 | |
And then for decoding you build this union | |
of maybe the source word list that other. | |
1:04:59.960 --> 1:05:02.145 | |
Are like for each source work. | |
1:05:02.145 --> 1:05:08.773 | |
One of the most frequent translations of these | |
source words, for example for each source work | |
1:05:08.773 --> 1:05:13.003 | |
like in the most frequent ones, and then the | |
most frequent. | |
1:05:13.193 --> 1:05:24.333 | |
In total, if you have short sentences, you | |
have a lot less words, so in most cases it's | |
1:05:24.333 --> 1:05:26.232 | |
not more than. | |
1:05:26.546 --> 1:05:33.957 | |
And so you have dramatically reduced your | |
vocabulary, and thereby can also fax a depot. | |
1:05:35.495 --> 1:05:43.757 | |
That easy does anybody see what is challenging | |
here and why that might not always need. | |
1:05:47.687 --> 1:05:54.448 | |
The performance is not why this might not. | |
1:05:54.448 --> 1:06:01.838 | |
If you implement it, it might not be a strong. | |
1:06:01.941 --> 1:06:06.053 | |
You have to store this list. | |
1:06:06.053 --> 1:06:14.135 | |
You have to burn the union and of course your | |
safe time. | |
1:06:14.554 --> 1:06:21.920 | |
The second thing the vocabulary is used in | |
our last step, so we have the hidden state, | |
1:06:21.920 --> 1:06:23.868 | |
and then we calculate. | |
1:06:24.284 --> 1:06:29.610 | |
Now we are not longer calculating them for | |
all output words, but for a subset of them. | |
1:06:30.430 --> 1:06:35.613 | |
However, this metric multiplication is typically | |
parallelized with the perfect but good. | |
1:06:35.956 --> 1:06:46.937 | |
But if you not only calculate some of them, | |
if you're not modeling it right, it will take | |
1:06:46.937 --> 1:06:52.794 | |
as long as before because of the nature of | |
the. | |
1:06:56.776 --> 1:07:07.997 | |
Here for beam search there's some ideas of | |
course you can go back to greedy search because | |
1:07:07.997 --> 1:07:10.833 | |
that's more efficient. | |
1:07:11.651 --> 1:07:18.347 | |
And better quality, and you can buffer some | |
states in between, so how much buffering it's | |
1:07:18.347 --> 1:07:22.216 | |
again this tradeoff between calculation and | |
memory. | |
1:07:25.125 --> 1:07:41.236 | |
Then at the end of today what we want to look | |
into is one last type of new machine translation | |
1:07:41.236 --> 1:07:42.932 | |
approach. | |
1:07:43.403 --> 1:07:53.621 | |
And the idea is what we've already seen in | |
our first two steps is that this ultra aggressive | |
1:07:53.621 --> 1:07:57.246 | |
park is taking community coding. | |
1:07:57.557 --> 1:08:04.461 | |
Can process everything in parallel, but we | |
are always taking the most probable and then. | |
1:08:05.905 --> 1:08:10.476 | |
The question is: Do we really need to do that? | |
1:08:10.476 --> 1:08:14.074 | |
Therefore, there is a bunch of work. | |
1:08:14.074 --> 1:08:16.602 | |
Can we do it differently? | |
1:08:16.602 --> 1:08:19.616 | |
Can we generate a full target? | |
1:08:20.160 --> 1:08:29.417 | |
We'll see it's not that easy and there's still | |
an open debate whether this is really faster | |
1:08:29.417 --> 1:08:31.832 | |
and quality, but think. | |
1:08:32.712 --> 1:08:45.594 | |
So, as said, what we have done is our encoder | |
decoder where we can process our encoder color, | |
1:08:45.594 --> 1:08:50.527 | |
and then the output always depends. | |
1:08:50.410 --> 1:08:54.709 | |
We generate the output and then we have to | |
put it here the wide because then everything | |
1:08:54.709 --> 1:08:56.565 | |
depends on the purpose of the output. | |
1:08:56.916 --> 1:09:10.464 | |
This is what is referred to as an outer-regressive | |
model and nearly outs speech generation and | |
1:09:10.464 --> 1:09:16.739 | |
language generation or works in this outer. | |
1:09:18.318 --> 1:09:21.132 | |
So the motivation is, can we do that more | |
efficiently? | |
1:09:21.361 --> 1:09:31.694 | |
And can we somehow process all target words | |
in parallel? | |
1:09:31.694 --> 1:09:41.302 | |
So instead of doing it one by one, we are | |
inputting. | |
1:09:45.105 --> 1:09:46.726 | |
So how does it work? | |
1:09:46.726 --> 1:09:50.587 | |
So let's first have a basic auto regressive | |
mode. | |
1:09:50.810 --> 1:09:53.551 | |
So the encoder looks as it is before. | |
1:09:53.551 --> 1:09:58.310 | |
That's maybe not surprising because here we | |
know we can paralyze. | |
1:09:58.618 --> 1:10:04.592 | |
So we have put in here our ink holder and | |
generated the ink stash, so that's exactly | |
1:10:04.592 --> 1:10:05.295 | |
the same. | |
1:10:05.845 --> 1:10:16.229 | |
However, now we need to do one more thing: | |
One challenge is what we had before and that's | |
1:10:16.229 --> 1:10:26.799 | |
a challenge of natural language generation | |
like machine translation. | |
1:10:32.672 --> 1:10:38.447 | |
We generate until we generate this out of | |
end of center stock, but if we now generate | |
1:10:38.447 --> 1:10:44.625 | |
everything at once that's no longer possible, | |
so we cannot generate as long because we only | |
1:10:44.625 --> 1:10:45.632 | |
generated one. | |
1:10:46.206 --> 1:10:58.321 | |
So the question is how can we now determine | |
how long the sequence is, and we can also accelerate. | |
1:11:00.000 --> 1:11:06.384 | |
Yes, but there would be one idea, and there | |
is other work which tries to do that. | |
1:11:06.806 --> 1:11:15.702 | |
However, in here there's some work already | |
done before and maybe you remember we had the | |
1:11:15.702 --> 1:11:20.900 | |
IBM models and there was this concept of fertility. | |
1:11:21.241 --> 1:11:26.299 | |
The concept of fertility is means like for | |
one saucepan, and how many target pores does | |
1:11:26.299 --> 1:11:27.104 | |
it translate? | |
1:11:27.847 --> 1:11:34.805 | |
And exactly that we try to do here, and that | |
means we are calculating like at the top we | |
1:11:34.805 --> 1:11:36.134 | |
are calculating. | |
1:11:36.396 --> 1:11:42.045 | |
So it says word is translated into word. | |
1:11:42.045 --> 1:11:54.171 | |
Word might be translated into words into, | |
so we're trying to predict in how many words. | |
1:11:55.935 --> 1:12:10.314 | |
And then the end of the anchor, so this is | |
like a length estimation. | |
1:12:10.314 --> 1:12:15.523 | |
You can do it otherwise. | |
1:12:16.236 --> 1:12:24.526 | |
You initialize your decoder input and we know | |
it's good with word embeddings so we're trying | |
1:12:24.526 --> 1:12:28.627 | |
to do the same thing and what people then do. | |
1:12:28.627 --> 1:12:35.224 | |
They initialize it again with word embedding | |
but in the frequency of the. | |
1:12:35.315 --> 1:12:36.460 | |
So we have the cartilage. | |
1:12:36.896 --> 1:12:47.816 | |
So one has two, so twice the is and then one | |
is, so that is then our initialization. | |
1:12:48.208 --> 1:12:57.151 | |
In other words, if you don't predict fertilities | |
but predict lengths, you can just initialize | |
1:12:57.151 --> 1:12:57.912 | |
second. | |
1:12:58.438 --> 1:13:07.788 | |
This often works a bit better, but that's | |
the other. | |
1:13:07.788 --> 1:13:16.432 | |
Now you have everything in training and testing. | |
1:13:16.656 --> 1:13:18.621 | |
This is all available at once. | |
1:13:20.280 --> 1:13:31.752 | |
Then we can generate everything in parallel, | |
so we have the decoder stack, and that is now | |
1:13:31.752 --> 1:13:33.139 | |
as before. | |
1:13:35.395 --> 1:13:41.555 | |
And then we're doing the translation predictions | |
here on top of it in order to do. | |
1:13:43.083 --> 1:13:59.821 | |
And then we are predicting here the target | |
words and once predicted, and that is the basic | |
1:13:59.821 --> 1:14:00.924 | |
idea. | |
1:14:01.241 --> 1:14:08.171 | |
Machine translation: Where the idea is, we | |
don't have to do one by one what we're. | |
1:14:10.210 --> 1:14:13.900 | |
So this looks really, really, really great. | |
1:14:13.900 --> 1:14:20.358 | |
On the first view there's one challenge with | |
this, and this is the baseline. | |
1:14:20.358 --> 1:14:27.571 | |
Of course there's some improvements, but in | |
general the quality is often significant. | |
1:14:28.068 --> 1:14:32.075 | |
So here you see the baseline models. | |
1:14:32.075 --> 1:14:38.466 | |
You have a loss of ten blue points or something | |
like that. | |
1:14:38.878 --> 1:14:40.230 | |
So why does it change? | |
1:14:40.230 --> 1:14:41.640 | |
So why is it happening? | |
1:14:43.903 --> 1:14:56.250 | |
If you look at the errors there is repetitive | |
tokens, so you have like or things like that. | |
1:14:56.536 --> 1:15:01.995 | |
Broken senses or influent senses, so that | |
exactly where algebra aggressive models are | |
1:15:01.995 --> 1:15:04.851 | |
very good, we say that's a bit of a problem. | |
1:15:04.851 --> 1:15:07.390 | |
They generate very fluid transcription. | |
1:15:07.387 --> 1:15:10.898 | |
Translation: Sometimes there doesn't have | |
to do anything with the input. | |
1:15:11.411 --> 1:15:14.047 | |
But generally it really looks always very | |
fluid. | |
1:15:14.995 --> 1:15:20.865 | |
Here exactly the opposite, so the problem | |
is that we don't have really fluid translation. | |
1:15:21.421 --> 1:15:26.123 | |
And that is mainly due to the challenge that | |
we have this independent assumption. | |
1:15:26.646 --> 1:15:35.873 | |
So in this case, the probability of Y of the | |
second position is independent of the probability | |
1:15:35.873 --> 1:15:40.632 | |
of X, so we don't know what was there generated. | |
1:15:40.632 --> 1:15:43.740 | |
We're just generating it there. | |
1:15:43.964 --> 1:15:55.439 | |
You can see it also in a bit of examples. | |
1:15:55.439 --> 1:16:03.636 | |
You can over-panelize shifts. | |
1:16:04.024 --> 1:16:10.566 | |
And the problem is this is already an improvement | |
again, but this is also similar to. | |
1:16:11.071 --> 1:16:19.900 | |
So you can, for example, translate heeded | |
back, or maybe you could also translate it | |
1:16:19.900 --> 1:16:31.105 | |
with: But on their feeling down in feeling | |
down, if the first position thinks of their | |
1:16:31.105 --> 1:16:34.594 | |
feeling done and the second. | |
1:16:35.075 --> 1:16:42.908 | |
So each position here and that is one of the | |
main issues here doesn't know what the other. | |
1:16:43.243 --> 1:16:53.846 | |
And for example, if you are translating something | |
with, you can often translate things in two | |
1:16:53.846 --> 1:16:58.471 | |
ways: German with a different agreement. | |
1:16:58.999 --> 1:17:02.058 | |
And then here where you have to decide do | |
a used jet. | |
1:17:02.162 --> 1:17:05.460 | |
Interpretator: It doesn't know which word | |
it has to select. | |
1:17:06.086 --> 1:17:14.789 | |
Mean, of course, it knows a hidden state, | |
but in the end you have a liability distribution. | |
1:17:16.256 --> 1:17:20.026 | |
And that is the important thing in the outer | |
regressive month. | |
1:17:20.026 --> 1:17:24.335 | |
You know that because you have put it in you | |
here, you don't know that. | |
1:17:24.335 --> 1:17:29.660 | |
If it's equal probable here to two, you don't | |
Know Which Is Selected, and of course that | |
1:17:29.660 --> 1:17:32.832 | |
depends on what should be the latest traction | |
under. | |
1:17:33.333 --> 1:17:39.554 | |
Yep, that's the undershift, and we're going | |
to last last the next time. | |
1:17:39.554 --> 1:17:39.986 | |
Yes. | |
1:17:40.840 --> 1:17:44.935 | |
Doesn't this also appear in and like now we're | |
talking about physical training? | |
1:17:46.586 --> 1:17:48.412 | |
The thing is in the auto regress. | |
1:17:48.412 --> 1:17:50.183 | |
If you give it the correct one,. | |
1:17:50.450 --> 1:17:55.827 | |
So if you predict here comma what the reference | |
is feeling then you tell the model here. | |
1:17:55.827 --> 1:17:59.573 | |
The last one was feeling and then it knows | |
it has to be done. | |
1:17:59.573 --> 1:18:04.044 | |
But here it doesn't know that because it doesn't | |
get as input as a right. | |
1:18:04.204 --> 1:18:24.286 | |
Yes, that's a bit depending on what. | |
1:18:24.204 --> 1:18:27.973 | |
But in training, of course, you just try to | |
make the highest one the current one. | |
1:18:31.751 --> 1:18:38.181 | |
So what you can do is things like CDC loss | |
which can adjust for this. | |
1:18:38.181 --> 1:18:42.866 | |
So then you can also have this shifted correction. | |
1:18:42.866 --> 1:18:50.582 | |
If you're doing this type of correction in | |
the CDC loss you don't get full penalty. | |
1:18:50.930 --> 1:18:58.486 | |
Just shifted by one, so it's a bit of a different | |
loss, which is mainly used in, but. | |
1:19:00.040 --> 1:19:03.412 | |
It can be used in order to address this problem. | |
1:19:04.504 --> 1:19:13.844 | |
The other problem is that outer regressively | |
we have the label buyers that tries to disimmigrate. | |
1:19:13.844 --> 1:19:20.515 | |
That's the example did before was if you translate | |
thank you to Dung. | |
1:19:20.460 --> 1:19:31.925 | |
And then it might end up because it learns | |
in the first position and the second also. | |
1:19:32.492 --> 1:19:43.201 | |
In order to prevent that, it would be helpful | |
for one output, only one output, so that makes | |
1:19:43.201 --> 1:19:47.002 | |
the system already better learn. | |
1:19:47.227 --> 1:19:53.867 | |
Might be that for slightly different inputs | |
you have different outputs, but for the same. | |
1:19:54.714 --> 1:19:57.467 | |
That we can luckily very easily solve. | |
1:19:59.119 --> 1:19:59.908 | |
And it's done. | |
1:19:59.908 --> 1:20:04.116 | |
We just learned the technique about it, which | |
is called knowledge distillation. | |
1:20:04.985 --> 1:20:13.398 | |
So what we can do and the easiest solution | |
to prove your non-autoregressive model is to | |
1:20:13.398 --> 1:20:16.457 | |
train an auto regressive model. | |
1:20:16.457 --> 1:20:22.958 | |
Then you decode your whole training gamer | |
with this model and then. | |
1:20:23.603 --> 1:20:27.078 | |
While the main advantage of that is that this | |
is more consistent,. | |
1:20:27.407 --> 1:20:33.995 | |
So for the same input you always have the | |
same output. | |
1:20:33.995 --> 1:20:41.901 | |
So you have to make your training data more | |
consistent and learn. | |
1:20:42.482 --> 1:20:54.471 | |
So there is another advantage of knowledge | |
distillation and that advantage is you have | |
1:20:54.471 --> 1:20:59.156 | |
more consistent training signals. | |
1:21:04.884 --> 1:21:10.630 | |
There's another to make the things more easy | |
at the beginning. | |
1:21:10.630 --> 1:21:16.467 | |
There's this plants model, black model where | |
you do more masks. | |
1:21:16.756 --> 1:21:26.080 | |
So during training, especially at the beginning, | |
you give some correct solutions at the beginning. | |
1:21:28.468 --> 1:21:38.407 | |
And there is this tokens at a time, so the | |
idea is to establish other regressive training. | |
1:21:40.000 --> 1:21:50.049 | |
And some targets are open, so you always predict | |
only like first auto regression is K. | |
1:21:50.049 --> 1:21:59.174 | |
It puts one, so you always have one input | |
and one output, then you do partial. | |
1:21:59.699 --> 1:22:05.825 | |
So in that way you can slowly learn what is | |
a good and what is a bad answer. | |
1:22:08.528 --> 1:22:10.862 | |
It doesn't sound very impressive. | |
1:22:10.862 --> 1:22:12.578 | |
Don't contact me anyway. | |
1:22:12.578 --> 1:22:15.323 | |
Go all over your training data several. | |
1:22:15.875 --> 1:22:20.655 | |
You can even switch in between. | |
1:22:20.655 --> 1:22:29.318 | |
There is a homework on this thing where you | |
try to start. | |
1:22:31.271 --> 1:22:41.563 | |
You have to learn so there's a whole work | |
on that so this is often happening and it doesn't | |
1:22:41.563 --> 1:22:46.598 | |
mean it's less efficient but still it helps. | |
1:22:49.389 --> 1:22:57.979 | |
For later maybe here are some examples of | |
how much things help. | |
1:22:57.979 --> 1:23:04.958 | |
Maybe one point here is that it's really important. | |
1:23:05.365 --> 1:23:13.787 | |
Here's the translation performance and speed. | |
1:23:13.787 --> 1:23:24.407 | |
One point which is a point is if you compare | |
researchers. | |
1:23:24.784 --> 1:23:33.880 | |
So yeah, if you're compared to one very weak | |
baseline transformer even with beam search, | |
1:23:33.880 --> 1:23:40.522 | |
then you're ten times slower than a very strong | |
auto regressive. | |
1:23:40.961 --> 1:23:48.620 | |
If you make a strong baseline then it's going | |
down to depending on times and here like: You | |
1:23:48.620 --> 1:23:53.454 | |
have a lot of different speed ups. | |
1:23:53.454 --> 1:24:03.261 | |
Generally, it makes a strong baseline and | |
not very simple transformer. | |
1:24:07.407 --> 1:24:20.010 | |
Yeah, with this one last thing that you can | |
do to speed up things and also reduce your | |
1:24:20.010 --> 1:24:25.950 | |
memory is what is called half precision. | |
1:24:26.326 --> 1:24:29.139 | |
And especially for decoding issues for training. | |
1:24:29.139 --> 1:24:31.148 | |
Sometimes it also gets less stale. | |
1:24:32.592 --> 1:24:45.184 | |
With this we close nearly wait a bit, so what | |
you should remember is that efficient machine | |
1:24:45.184 --> 1:24:46.963 | |
translation. | |
1:24:47.007 --> 1:24:51.939 | |
We have, for example, looked at knowledge | |
distillation. | |
1:24:51.939 --> 1:24:55.991 | |
We have looked at non auto regressive models. | |
1:24:55.991 --> 1:24:57.665 | |
We have different. | |
1:24:58.898 --> 1:25:02.383 | |
For today and then only requests. | |
1:25:02.383 --> 1:25:08.430 | |
So if you haven't done so, please fill out | |
the evaluation. | |
1:25:08.388 --> 1:25:20.127 | |
So now if you have done so think then you | |
should have and with the online people hopefully. | |
1:25:20.320 --> 1:25:29.758 | |
Only possibility to tell us what things are | |
good and what not the only one but the most | |
1:25:29.758 --> 1:25:30.937 | |
efficient. | |
1:25:31.851 --> 1:25:35.871 | |
So think of all the students doing it in this | |
case okay and then thank. | |