retkowski's picture
Add demo
cb71ef5
WEBVTT
0:00:01.921 --> 0:00:16.424
Hey welcome to today's lecture, what we today
want to look at is how we can make new.
0:00:16.796 --> 0:00:26.458
So until now we have this global system, the
encoder and the decoder mostly, and we haven't
0:00:26.458 --> 0:00:29.714
really thought about how long.
0:00:30.170 --> 0:00:42.684
And what we, for example, know is yeah, you
can make the systems bigger in different ways.
0:00:42.684 --> 0:00:47.084
We can make them deeper so the.
0:00:47.407 --> 0:00:56.331
And if we have at least enough data that typically
helps you make things performance better,.
0:00:56.576 --> 0:01:00.620
But of course leads to problems that we need
more resources.
0:01:00.620 --> 0:01:06.587
That is a problem at universities where we
have typically limited computation capacities.
0:01:06.587 --> 0:01:11.757
So at some point you have such big models
that you cannot train them anymore.
0:01:13.033 --> 0:01:23.792
And also for companies is of course important
if it costs you like to generate translation
0:01:23.792 --> 0:01:26.984
just by power consumption.
0:01:27.667 --> 0:01:35.386
So yeah, there's different reasons why you
want to do efficient machine translation.
0:01:36.436 --> 0:01:48.338
One reason is there are different ways of
how you can improve your machine translation
0:01:48.338 --> 0:01:50.527
system once we.
0:01:50.670 --> 0:01:55.694
There can be different types of data we looked
into data crawling, monolingual data.
0:01:55.875 --> 0:01:59.024
All this data and the aim is always.
0:01:59.099 --> 0:02:05.735
Of course, we are not just purely interested
in having more data, but the idea why we want
0:02:05.735 --> 0:02:12.299
to have more data is that more data also means
that we have better quality because mostly
0:02:12.299 --> 0:02:17.550
we are interested in increasing the quality
of the machine translation.
0:02:18.838 --> 0:02:24.892
But there's also other ways of how you can
improve the quality of a machine translation.
0:02:25.325 --> 0:02:36.450
And what is, of course, that is where most
research is focusing on.
0:02:36.450 --> 0:02:44.467
It means all we want to build better algorithms.
0:02:44.684 --> 0:02:48.199
Course: The other things are normally as good.
0:02:48.199 --> 0:02:54.631
Sometimes it's easier to improve, so often
it's easier to just collect more data than
0:02:54.631 --> 0:02:57.473
to invent some great view algorithms.
0:02:57.473 --> 0:03:00.315
But yeah, both of them are important.
0:03:00.920 --> 0:03:09.812
But there is this third thing, especially
with neural machine translation, and that means
0:03:09.812 --> 0:03:11.590
we make a bigger.
0:03:11.751 --> 0:03:16.510
Can be, as said, that we have more layers,
that we have wider layers.
0:03:16.510 --> 0:03:19.977
The other thing we talked a bit about is ensemble.
0:03:19.977 --> 0:03:24.532
That means we are not building one new machine
translation system.
0:03:24.965 --> 0:03:27.505
And we can easily build four.
0:03:27.505 --> 0:03:32.331
What is the typical strategy to build different
systems?
0:03:32.331 --> 0:03:33.177
Remember.
0:03:35.795 --> 0:03:40.119
It should be of course a bit different if
you have the same.
0:03:40.119 --> 0:03:44.585
If they all predict the same then combining
them doesn't help.
0:03:44.585 --> 0:03:48.979
So what is the easiest way if you have to
build four systems?
0:03:51.711 --> 0:04:01.747
And the Charleston's will take, but this is
the best output of a single system.
0:04:02.362 --> 0:04:10.165
Mean now, it's really three different systems
so that you later can combine them and maybe
0:04:10.165 --> 0:04:11.280
the average.
0:04:11.280 --> 0:04:16.682
Ensembles are typically that the average is
all probabilities.
0:04:19.439 --> 0:04:24.227
The idea is to think about neural networks.
0:04:24.227 --> 0:04:29.342
There's one parameter which can easily adjust.
0:04:29.342 --> 0:04:36.525
That's exactly the easiest way to randomize
with three different.
0:04:37.017 --> 0:04:43.119
They have the same architecture, so all the
hydroparameters are the same, but they are
0:04:43.119 --> 0:04:43.891
different.
0:04:43.891 --> 0:04:46.556
They will have different predictions.
0:04:48.228 --> 0:04:52.572
So, of course, bigger amounts.
0:04:52.572 --> 0:05:05.325
Some of these are a bit the easiest way of
improving your quality because you don't really
0:05:05.325 --> 0:05:08.268
have to do anything.
0:05:08.588 --> 0:05:12.588
There is limits on that bigger models only
get better.
0:05:12.588 --> 0:05:19.132
If you have enough training data you can't
do like a handheld layer and you will not work
0:05:19.132 --> 0:05:24.877
on very small data but with a recent amount
of data that is the easiest thing.
0:05:25.305 --> 0:05:33.726
However, they are challenging with making
better models, bigger motors, and that is the
0:05:33.726 --> 0:05:34.970
computation.
0:05:35.175 --> 0:05:44.482
So, of course, if you have a bigger model
that can mean that you have longer running
0:05:44.482 --> 0:05:49.518
times, if you have models, you have to times.
0:05:51.171 --> 0:05:56.685
Normally you cannot paralyze the different
layers because the input to one layer is always
0:05:56.685 --> 0:06:02.442
the output of the previous layer, so you propagate
that so it will also increase your runtime.
0:06:02.822 --> 0:06:10.720
Then you have to store all your models in
memory.
0:06:10.720 --> 0:06:20.927
If you have double weights you will have:
Is more difficult to then do back propagation.
0:06:20.927 --> 0:06:27.680
You have to store in between the activations,
so there's not only do you increase the model
0:06:27.680 --> 0:06:31.865
in your memory, but also all these other variables
that.
0:06:34.414 --> 0:06:36.734
And so in general it is more expensive.
0:06:37.137 --> 0:06:54.208
And therefore there's good reasons in looking
into can we make these models sound more efficient.
0:06:54.134 --> 0:07:00.982
So it's been through the viewer, you can have
it okay, have one and one day of training time,
0:07:00.982 --> 0:07:01.274
or.
0:07:01.221 --> 0:07:07.535
Forty thousand euros and then what is the
best machine translation system I can get within
0:07:07.535 --> 0:07:08.437
this budget.
0:07:08.969 --> 0:07:19.085
And then, of course, you can make the models
bigger, but then you have to train them shorter,
0:07:19.085 --> 0:07:24.251
and then we can make more efficient algorithms.
0:07:25.925 --> 0:07:31.699
If you think about efficiency, there's a bit
different scenarios.
0:07:32.312 --> 0:07:43.635
So if you're more of coming from the research
community, what you'll be doing is building
0:07:43.635 --> 0:07:47.913
a lot of models in your research.
0:07:48.088 --> 0:07:58.645
So you're having your test set of maybe sentences,
calculating the blue score, then another model.
0:07:58.818 --> 0:08:08.911
So what that means is typically you're training
on millions of cents, so your training time
0:08:08.911 --> 0:08:14.944
is long, maybe a day, but maybe in other cases
a week.
0:08:15.135 --> 0:08:22.860
The testing is not really the cost efficient,
but the training is very costly.
0:08:23.443 --> 0:08:37.830
If you are more thinking of building models
for application, the scenario is quite different.
0:08:38.038 --> 0:08:46.603
And then you keep it running, and maybe thousands
of customers are using it in translating.
0:08:46.603 --> 0:08:47.720
So in that.
0:08:48.168 --> 0:08:59.577
And we will see that it is not always the
same type of challenges you can paralyze some
0:08:59.577 --> 0:09:07.096
things in training, which you cannot paralyze
in testing.
0:09:07.347 --> 0:09:14.124
For example, in training you have to do back
propagation, so you have to store the activations.
0:09:14.394 --> 0:09:23.901
Therefore, in testing we briefly discussed
that we would do it in more detail today in
0:09:23.901 --> 0:09:24.994
training.
0:09:25.265 --> 0:09:36.100
You know they're a target and you can process
everything in parallel while in testing.
0:09:36.356 --> 0:09:46.741
So you can only do one word at a time, and
so you can less paralyze this.
0:09:46.741 --> 0:09:50.530
Therefore, it's important.
0:09:52.712 --> 0:09:55.347
Is a specific task on this.
0:09:55.347 --> 0:10:03.157
For example, it's the efficiency task where
it's about making things as efficient.
0:10:03.123 --> 0:10:09.230
Is possible and they can look at different
resources.
0:10:09.230 --> 0:10:14.207
So how much deep fuel run time do you need?
0:10:14.454 --> 0:10:19.366
See how much memory you need or you can have
a fixed memory budget and then have to build
0:10:19.366 --> 0:10:20.294
the best system.
0:10:20.500 --> 0:10:29.010
And here is a bit like an example of that,
so there's three teams from Edinburgh from
0:10:29.010 --> 0:10:30.989
and they submitted.
0:10:31.131 --> 0:10:36.278
So then, of course, if you want to know the
most efficient system you have to do a bit
0:10:36.278 --> 0:10:36.515
of.
0:10:36.776 --> 0:10:44.656
You want to have a better quality or more
runtime and there's not the one solution.
0:10:44.656 --> 0:10:46.720
You can improve your.
0:10:46.946 --> 0:10:49.662
And that you see that there are different
systems.
0:10:49.909 --> 0:11:06.051
Here is how many words you can do for a second
on the clock, and you want to be as talk as
0:11:06.051 --> 0:11:07.824
possible.
0:11:08.068 --> 0:11:08.889
And you see here a bit.
0:11:08.889 --> 0:11:09.984
This is a little bit different.
0:11:11.051 --> 0:11:27.717
You want to be there on the top right corner
and you can get a score of something between
0:11:27.717 --> 0:11:29.014
words.
0:11:30.250 --> 0:11:34.161
Two hundred and fifty thousand, then you'll
ever come and score zero point three.
0:11:34.834 --> 0:11:41.243
There is, of course, any bit of a decision,
but the question is, like how far can you again?
0:11:41.243 --> 0:11:47.789
Some of all these points on this line would
be winners because they are somehow most efficient
0:11:47.789 --> 0:11:53.922
in a way that there's no system which achieves
the same quality with less computational.
0:11:57.657 --> 0:12:04.131
So there's the one question of which resources
are you interested.
0:12:04.131 --> 0:12:07.416
Are you running it on CPU or GPU?
0:12:07.416 --> 0:12:11.668
There's different ways of paralyzing stuff.
0:12:14.654 --> 0:12:20.777
Another dimension is how you process your
data.
0:12:20.777 --> 0:12:27.154
There's really the best processing and streaming.
0:12:27.647 --> 0:12:34.672
So in batch processing you have the whole
document available so you can translate all
0:12:34.672 --> 0:12:39.981
sentences in perimeter and then you're interested
in throughput.
0:12:40.000 --> 0:12:43.844
But you can then process, for example, especially
in GPS.
0:12:43.844 --> 0:12:49.810
That's interesting, you're not translating
one sentence at a time, but you're translating
0:12:49.810 --> 0:12:56.108
one hundred sentences or so in parallel, so
you have one more dimension where you can paralyze
0:12:56.108 --> 0:12:57.964
and then be more efficient.
0:12:58.558 --> 0:13:14.863
On the other hand, for example sorts of documents,
so we learned that if you do badge processing
0:13:14.863 --> 0:13:16.544
you have.
0:13:16.636 --> 0:13:24.636
Then, of course, it makes sense to sort the
sentences in order to have the minimum thing
0:13:24.636 --> 0:13:25.535
attached.
0:13:27.427 --> 0:13:32.150
The other scenario is more the streaming scenario
where you do life translation.
0:13:32.512 --> 0:13:40.212
So in that case you can't wait for the whole
document to pass, but you have to do.
0:13:40.520 --> 0:13:49.529
And then, for example, that's especially in
situations like speech translation, and then
0:13:49.529 --> 0:13:53.781
you're interested in things like latency.
0:13:53.781 --> 0:14:00.361
So how much do you have to wait to get the
output of a sentence?
0:14:06.566 --> 0:14:16.956
Finally, there is the thing about the implementation:
Today we're mainly looking at different algorithms,
0:14:16.956 --> 0:14:23.678
different models of how you can model them
in your machine translation system, but of
0:14:23.678 --> 0:14:29.227
course for the same algorithms there's also
different implementations.
0:14:29.489 --> 0:14:38.643
So, for example, for a machine translation
this tool could be very fast.
0:14:38.638 --> 0:14:46.615
So they have like coded a lot of the operations
very low resource, not low resource, low level
0:14:46.615 --> 0:14:49.973
on the directly on the QDAC kernels in.
0:14:50.110 --> 0:15:00.948
So the same attention network is typically
more efficient in that type of algorithm.
0:15:00.880 --> 0:15:02.474
Than in in any other.
0:15:03.323 --> 0:15:13.105
Of course, it might be other disadvantages,
so if you're a little worker or have worked
0:15:13.105 --> 0:15:15.106
in the practical.
0:15:15.255 --> 0:15:22.604
Because it's normally easier to understand,
easier to change, and so on, but there is again
0:15:22.604 --> 0:15:23.323
a train.
0:15:23.483 --> 0:15:29.440
You have to think about, do you want to include
this into my study or comparison or not?
0:15:29.440 --> 0:15:36.468
Should it be like I compare different implementations
and I also find the most efficient implementation?
0:15:36.468 --> 0:15:39.145
Or is it only about the pure algorithm?
0:15:42.742 --> 0:15:50.355
Yeah, when building these systems there is
a different trade-off to do.
0:15:50.850 --> 0:15:56.555
So there's one of the traders between memory
and throughput, so how many words can generate
0:15:56.555 --> 0:15:57.299
per second.
0:15:57.557 --> 0:16:03.351
So typically you can easily like increase
your scruple by increasing the batch size.
0:16:03.643 --> 0:16:06.899
So that means you are translating more sentences
in parallel.
0:16:07.107 --> 0:16:09.241
And gypsies are very good at that stuff.
0:16:09.349 --> 0:16:15.161
It should translate one sentence or one hundred
sentences, not the same time, but its.
0:16:15.115 --> 0:16:20.784
Rough are very similar because they are at
this efficient metrics multiplication so that
0:16:20.784 --> 0:16:24.415
you can do the same operation on all sentences
parallel.
0:16:24.415 --> 0:16:30.148
So typically that means if you increase your
benchmark you can do more things in parallel
0:16:30.148 --> 0:16:31.995
and you will translate more.
0:16:31.952 --> 0:16:33.370
Second.
0:16:33.653 --> 0:16:43.312
On the other hand, with this advantage, of
course you will need higher badge sizes and
0:16:43.312 --> 0:16:44.755
more memory.
0:16:44.965 --> 0:16:56.452
To begin with, the other problem is that you
have such big models that you can only translate
0:16:56.452 --> 0:16:59.141
with lower bed sizes.
0:16:59.119 --> 0:17:08.466
If you are running out of memory with translating,
one idea to go on that is to decrease your.
0:17:13.453 --> 0:17:24.456
Then there is the thing about quality in Screwport,
of course, and before it's like larger models,
0:17:24.456 --> 0:17:28.124
but in generally higher quality.
0:17:28.124 --> 0:17:31.902
The first one is always this way.
0:17:32.092 --> 0:17:38.709
Course: Not always larger model helps you
have over fitting at some point, but in generally.
0:17:43.883 --> 0:17:52.901
And with this a bit on this training and testing
thing we had before.
0:17:53.113 --> 0:17:58.455
So it wears all the difference between training
and testing, and for the encoder and decoder.
0:17:58.798 --> 0:18:06.992
So if we are looking at what mentioned before
at training time, we have a source sentence
0:18:06.992 --> 0:18:17.183
here: And how this is processed on a is not
the attention here.
0:18:17.183 --> 0:18:21.836
That's a tubical transformer.
0:18:22.162 --> 0:18:31.626
And how we can do that on a is that we can
paralyze the ear ever since.
0:18:31.626 --> 0:18:40.422
The first thing to know is: So that is, of
course, not in all cases.
0:18:40.422 --> 0:18:49.184
We'll later talk about speech translation
where we might want to translate.
0:18:49.389 --> 0:18:56.172
Without the general case in, it's like you
have the full sentence you want to translate.
0:18:56.416 --> 0:19:02.053
So the important thing is we are here everything
available on the source side.
0:19:03.323 --> 0:19:13.524
And then this was one of the big advantages
that you can remember back of transformer.
0:19:13.524 --> 0:19:15.752
There are several.
0:19:16.156 --> 0:19:25.229
But the other one is now that we can calculate
the full layer.
0:19:25.645 --> 0:19:29.318
There is no dependency between this and this
state or this and this state.
0:19:29.749 --> 0:19:36.662
So we always did like here to calculate the
key value and query, and based on that you
0:19:36.662 --> 0:19:37.536
calculate.
0:19:37.937 --> 0:19:46.616
Which means we can do all these calculations
here in parallel and in parallel.
0:19:48.028 --> 0:19:55.967
And there, of course, is this very efficiency
because again for GPS it's too bigly possible
0:19:55.967 --> 0:20:00.887
to do these things in parallel and one after
each other.
0:20:01.421 --> 0:20:10.311
And then we can also for each layer one by
one, and then we calculate here the encoder.
0:20:10.790 --> 0:20:21.921
In training now an important thing is that
for the decoder we have the full sentence available
0:20:21.921 --> 0:20:28.365
because we know this is the target we should
generate.
0:20:29.649 --> 0:20:33.526
We have models now in a different way.
0:20:33.526 --> 0:20:38.297
This hidden state is only on the previous
ones.
0:20:38.598 --> 0:20:51.887
And the first thing here depends only on this
information, so you see if you remember we
0:20:51.887 --> 0:20:56.665
had this masked self-attention.
0:20:56.896 --> 0:21:04.117
So that means, of course, we can only calculate
the decoder once the encoder is done, but that's.
0:21:04.444 --> 0:21:06.656
Percent can calculate the end quarter.
0:21:06.656 --> 0:21:08.925
Then we can calculate here the decoder.
0:21:09.569 --> 0:21:25.566
But again in training we have x, y and that
is available so we can calculate everything
0:21:25.566 --> 0:21:27.929
in parallel.
0:21:28.368 --> 0:21:40.941
So the interesting thing or advantage of transformer
is in training.
0:21:40.941 --> 0:21:46.408
We can do it for the decoder.
0:21:46.866 --> 0:21:54.457
That means you will have more calculations
because you can only calculate one layer at
0:21:54.457 --> 0:22:02.310
a time, but for example the length which is
too bigly quite long or doesn't really matter
0:22:02.310 --> 0:22:03.270
that much.
0:22:05.665 --> 0:22:10.704
However, in testing this situation is different.
0:22:10.704 --> 0:22:13.276
In testing we only have.
0:22:13.713 --> 0:22:20.622
So this means we start with a sense: We don't
know the full sentence yet because we ought
0:22:20.622 --> 0:22:29.063
to regularly generate that so for the encoder
we have the same here but for the decoder.
0:22:29.409 --> 0:22:39.598
In this case we only have the first and the
second instinct, but only for all states in
0:22:39.598 --> 0:22:40.756
parallel.
0:22:41.101 --> 0:22:51.752
And then we can do the next step for y because
we are putting our most probable one.
0:22:51.752 --> 0:22:58.643
We do greedy search or beam search, but you
cannot do.
0:23:03.663 --> 0:23:16.838
Yes, so if we are interesting in making things
more efficient for testing, which we see, for
0:23:16.838 --> 0:23:22.363
example in the scenario of really our.
0:23:22.642 --> 0:23:34.286
It makes sense that we think about our architecture
and that we are currently working on attention
0:23:34.286 --> 0:23:35.933
based models.
0:23:36.096 --> 0:23:44.150
The decoder there is some of the most time
spent testing and testing.
0:23:44.150 --> 0:23:47.142
It's similar, but during.
0:23:47.167 --> 0:23:50.248
Nothing about beam search.
0:23:50.248 --> 0:23:59.833
It might be even more complicated because
in beam search you have to try different.
0:24:02.762 --> 0:24:15.140
So the question is what can you now do in
order to make your model more efficient and
0:24:15.140 --> 0:24:21.905
better in translation in these types of cases?
0:24:24.604 --> 0:24:30.178
And the one thing is to look into the encoded
decoder trailer.
0:24:30.690 --> 0:24:43.898
And then until now we typically assume that
the depth of the encoder and the depth of the
0:24:43.898 --> 0:24:48.154
decoder is roughly the same.
0:24:48.268 --> 0:24:55.553
So if you haven't thought about it, you just
take what is running well.
0:24:55.553 --> 0:24:57.678
You would try to do.
0:24:58.018 --> 0:25:04.148
However, we saw now that there is a quite
big challenge and the runtime is a lot longer
0:25:04.148 --> 0:25:04.914
than here.
0:25:05.425 --> 0:25:14.018
The question is also the case for the calculations,
or do we have there the same issue that we
0:25:14.018 --> 0:25:21.887
only get the good quality if we are having
high and high, so we know that making these
0:25:21.887 --> 0:25:25.415
more depths is increasing our quality.
0:25:25.425 --> 0:25:31.920
But what we haven't talked about is really
important that we increase the depth the same
0:25:31.920 --> 0:25:32.285
way.
0:25:32.552 --> 0:25:41.815
So what we can put instead also do is something
like this where you have a deep encoder and
0:25:41.815 --> 0:25:42.923
a shallow.
0:25:43.163 --> 0:25:57.386
So that would be that you, for example, have
instead of having layers on the encoder, and
0:25:57.386 --> 0:25:59.757
layers on the.
0:26:00.080 --> 0:26:10.469
So in this case the overall depth from start
to end would be similar and so hopefully.
0:26:11.471 --> 0:26:21.662
But we could a lot more things hear parallelized,
and hear what is costly at the end during decoding
0:26:21.662 --> 0:26:22.973
the decoder.
0:26:22.973 --> 0:26:29.330
Because that does change in an outer regressive
way, there we.
0:26:31.411 --> 0:26:33.727
And that that can be analyzed.
0:26:33.727 --> 0:26:38.734
So here is some examples: Where people have
done all this.
0:26:39.019 --> 0:26:55.710
So here it's mainly interested on the orange
things, which is auto-regressive about the
0:26:55.710 --> 0:26:57.607
speed up.
0:26:57.717 --> 0:27:15.031
You have the system, so agree is not exactly
the same, but it's similar.
0:27:15.055 --> 0:27:23.004
It's always the case if you look at speed
up.
0:27:23.004 --> 0:27:31.644
Think they put a speed of so that's the baseline.
0:27:31.771 --> 0:27:35.348
So between and times as fast.
0:27:35.348 --> 0:27:42.621
If you switch from a system to where you have
layers in the.
0:27:42.782 --> 0:27:52.309
You see that although you have slightly more
parameters, more calculations are also roughly
0:27:52.309 --> 0:28:00.283
the same, but you can speed out because now
during testing you can paralyze.
0:28:02.182 --> 0:28:09.754
The other thing is that you're speeding up,
but if you look at the performance it's similar,
0:28:09.754 --> 0:28:13.500
so sometimes you improve, sometimes you lose.
0:28:13.500 --> 0:28:20.421
There's a bit of losing English to Romania,
but in general the quality is very slow.
0:28:20.680 --> 0:28:30.343
So you see that you can keep a similar performance
while improving your speed by just having different.
0:28:30.470 --> 0:28:34.903
And you also see the encoder layers from speed.
0:28:34.903 --> 0:28:38.136
They don't really metal that much.
0:28:38.136 --> 0:28:38.690
Most.
0:28:38.979 --> 0:28:50.319
Because if you compare the 12th system to
the 6th system you have a lower performance
0:28:50.319 --> 0:28:57.309
with 6th and colder layers but the speed is
similar.
0:28:57.897 --> 0:29:02.233
And see the huge decrease is it maybe due
to a lack of data.
0:29:03.743 --> 0:29:11.899
Good idea would say it's not the case.
0:29:11.899 --> 0:29:23.191
Romanian English should have the same number
of data.
0:29:24.224 --> 0:29:31.184
Maybe it's just that something in that language.
0:29:31.184 --> 0:29:40.702
If you generate Romanian maybe they need more
target dependencies.
0:29:42.882 --> 0:29:46.263
The Wine's the Eye Also Don't Know Any Sex
People Want To.
0:29:47.887 --> 0:29:49.034
There could be yeah the.
0:29:49.889 --> 0:29:58.962
As the maybe if you go from like a movie sphere
to a hybrid sphere, you can: It's very much
0:29:58.962 --> 0:30:12.492
easier to expand the vocabulary to English,
but it must be the vocabulary.
0:30:13.333 --> 0:30:21.147
Have to check, but would assume that in this
case the system is not retrained, but it's
0:30:21.147 --> 0:30:22.391
trained with.
0:30:22.902 --> 0:30:30.213
And that's why I was assuming that they have
the same, but maybe you'll write that in this
0:30:30.213 --> 0:30:35.595
piece, for example, if they were pre-trained,
the decoder English.
0:30:36.096 --> 0:30:43.733
But don't remember exactly if they do something
like that, but that could be a good.
0:30:45.325 --> 0:30:52.457
So this is some of the most easy way to speed
up.
0:30:52.457 --> 0:31:01.443
You just switch to hyperparameters, not to
implement anything.
0:31:02.722 --> 0:31:08.367
Of course, there's other ways of doing that.
0:31:08.367 --> 0:31:11.880
We'll look into two things.
0:31:11.880 --> 0:31:16.521
The other thing is the architecture.
0:31:16.796 --> 0:31:28.154
We are now at some of the baselines that we
are doing.
0:31:28.488 --> 0:31:39.978
However, in translation in the decoder side,
it might not be the best solution.
0:31:39.978 --> 0:31:41.845
There is no.
0:31:42.222 --> 0:31:47.130
So we can use different types of architectures,
also in the encoder and the.
0:31:47.747 --> 0:31:52.475
And there's two ways of what you could do
different, or there's more ways.
0:31:52.912 --> 0:31:54.825
We will look into two todays.
0:31:54.825 --> 0:31:58.842
The one is average attention, which is a very
simple solution.
0:31:59.419 --> 0:32:01.464
You can do as it says.
0:32:01.464 --> 0:32:04.577
It's not really attending anymore.
0:32:04.577 --> 0:32:08.757
It's just like equal attendance to everything.
0:32:09.249 --> 0:32:23.422
And the other idea, which is currently done
in most systems which are optimized to efficiency,
0:32:23.422 --> 0:32:24.913
is we're.
0:32:25.065 --> 0:32:32.623
But on the decoder side we are then not using
transformer or self attention, but we are using
0:32:32.623 --> 0:32:39.700
recurrent neural network because they are the
disadvantage of recurrent neural network.
0:32:39.799 --> 0:32:48.353
And then the recurrent is normally easier
to calculate because it only depends on inputs,
0:32:48.353 --> 0:32:49.684
the input on.
0:32:51.931 --> 0:33:02.190
So what is the difference between decoding
and why is the tension maybe not sufficient
0:33:02.190 --> 0:33:03.841
for decoding?
0:33:04.204 --> 0:33:14.390
If we want to populate the new state, we only
have to look at the input and the previous
0:33:14.390 --> 0:33:15.649
state, so.
0:33:16.136 --> 0:33:19.029
We are more conditional here networks.
0:33:19.029 --> 0:33:19.994
We have the.
0:33:19.980 --> 0:33:31.291
Dependency to a fixed number of previous ones,
but that's rarely used for decoding.
0:33:31.291 --> 0:33:39.774
In contrast, in transformer we have this large
dependency, so.
0:33:40.000 --> 0:33:52.760
So from t minus one to y t so that is somehow
and mainly not very efficient in this way mean
0:33:52.760 --> 0:33:56.053
it's very good because.
0:33:56.276 --> 0:34:03.543
However, the disadvantage is that we also
have to do all these calculations, so if we
0:34:03.543 --> 0:34:10.895
more view from the point of view of efficient
calculation, this might not be the best.
0:34:11.471 --> 0:34:20.517
So the question is, can we change our architecture
to keep some of the advantages but make things
0:34:20.517 --> 0:34:21.994
more efficient?
0:34:24.284 --> 0:34:31.131
The one idea is what is called the average
attention, and the interesting thing is this
0:34:31.131 --> 0:34:32.610
work surprisingly.
0:34:33.013 --> 0:34:38.917
So the only idea what you're doing is doing
the decoder.
0:34:38.917 --> 0:34:42.646
You're not doing attention anymore.
0:34:42.646 --> 0:34:46.790
The attention weights are all the same.
0:34:47.027 --> 0:35:00.723
So you don't calculate with query and key
the different weights, and then you just take
0:35:00.723 --> 0:35:03.058
equal weights.
0:35:03.283 --> 0:35:07.585
So here would be one third from this, one
third from this, and one third.
0:35:09.009 --> 0:35:14.719
And while it is sufficient you can now do
precalculation and things get more efficient.
0:35:15.195 --> 0:35:18.803
So first go the formula that's maybe not directed
here.
0:35:18.979 --> 0:35:38.712
So the difference here is that your new hint
stage is the sum of all the hint states, then.
0:35:38.678 --> 0:35:40.844
So here would be with this.
0:35:40.844 --> 0:35:45.022
It would be one third of this plus one third
of this.
0:35:46.566 --> 0:35:57.162
But if you calculate it this way, it's not
yet being more efficient because you still
0:35:57.162 --> 0:36:01.844
have to sum over here all the hidden.
0:36:04.524 --> 0:36:22.932
But you can not easily speed up these things
by having an in between value, which is just
0:36:22.932 --> 0:36:24.568
always.
0:36:25.585 --> 0:36:30.057
If you take this as ten to one, you take this
one class this one.
0:36:30.350 --> 0:36:36.739
Because this one then was before this, and
this one was this, so in the end.
0:36:37.377 --> 0:36:49.545
So now this one is not the final one in order
to get the final one to do the average.
0:36:49.545 --> 0:36:50.111
So.
0:36:50.430 --> 0:37:00.264
But then if you do this calculation with speed
up you can do it with a fixed number of steps.
0:37:00.180 --> 0:37:11.300
Instead of the sun which depends on age, so
you only have to do calculations to calculate
0:37:11.300 --> 0:37:12.535
this one.
0:37:12.732 --> 0:37:21.718
Can you do the lakes and the lakes?
0:37:21.718 --> 0:37:32.701
For example, light bulb here now takes and.
0:37:32.993 --> 0:37:38.762
That's a very good point and that's why this
is now in the image.
0:37:38.762 --> 0:37:44.531
It's not very good so this is the one with
tilder and the tilder.
0:37:44.884 --> 0:37:57.895
So this one is just the sum of these two,
because this is just this one.
0:37:58.238 --> 0:38:08.956
So the sum of this is exactly as the sum of
these, and the sum of these is the sum of here.
0:38:08.956 --> 0:38:15.131
So you only do the sum in here, and the multiplying.
0:38:15.255 --> 0:38:22.145
So what you can mainly do here is you can
do it more mathematically.
0:38:22.145 --> 0:38:31.531
You can know this by tea taking out of the
sum, and then you can calculate the sum different.
0:38:36.256 --> 0:38:42.443
That maybe looks a bit weird and simple, so
we were all talking about this great attention
0:38:42.443 --> 0:38:47.882
that we can focus on different parts, and a
bit surprising on this work is now.
0:38:47.882 --> 0:38:53.321
In the end it might also work well without
really putting and just doing equal.
0:38:53.954 --> 0:38:56.164
Mean it's not that easy.
0:38:56.376 --> 0:38:58.261
It's like sometimes this is working.
0:38:58.261 --> 0:39:00.451
There's also report weight work that well.
0:39:01.481 --> 0:39:05.848
But I think it's an interesting way and it
maybe shows that a lot of.
0:39:05.805 --> 0:39:10.624
Things in the self or in the transformer paper
which are more put as like yet.
0:39:10.624 --> 0:39:15.930
These are some hyperpermetheuss around it,
like that you do the layer norm in between,
0:39:15.930 --> 0:39:21.785
and that you do a feat forward before, and
things like that, that these are also all important,
0:39:21.785 --> 0:39:25.566
and that the right set up around that is also
very important.
0:39:28.969 --> 0:39:38.598
The other thing you can do in the end is not
completely different from this one.
0:39:38.598 --> 0:39:42.521
It's just like a very different.
0:39:42.942 --> 0:39:54.338
And that is a recurrent network which also
has this type of highway connection that can
0:39:54.338 --> 0:40:01.330
ignore the recurrent unit and directly put
the input.
0:40:01.561 --> 0:40:10.770
It's not really adding out, but if you see
the hitting step is your input, but what you
0:40:10.770 --> 0:40:15.480
can do is somehow directly go to the output.
0:40:17.077 --> 0:40:28.390
These are the four components of the simple
return unit, and the unit is motivated by GIS
0:40:28.390 --> 0:40:33.418
and by LCMs, which we have seen before.
0:40:33.513 --> 0:40:43.633
And that has proven to be very good for iron
ends, which allows you to have a gate on your.
0:40:44.164 --> 0:40:48.186
In this thing we have two gates, the reset
gate and the forget gate.
0:40:48.768 --> 0:40:57.334
So first we have the general structure which
has a cell state.
0:40:57.334 --> 0:41:01.277
Here we have the cell state.
0:41:01.361 --> 0:41:09.661
And then this goes next, and we always get
the different cell states over the times that.
0:41:10.030 --> 0:41:11.448
This Is the South Stand.
0:41:11.771 --> 0:41:16.518
How do we now calculate that just assume we
have an initial cell safe here?
0:41:17.017 --> 0:41:19.670
But the first thing is we're doing the forget
game.
0:41:20.060 --> 0:41:34.774
The forgetting models should the new cell
state mainly depend on the previous cell state
0:41:34.774 --> 0:41:40.065
or should it depend on our age.
0:41:40.000 --> 0:41:41.356
Like Add to Them.
0:41:41.621 --> 0:41:42.877
How can we model that?
0:41:44.024 --> 0:41:45.599
First we were at a cocktail.
0:41:45.945 --> 0:41:52.151
The forget gait is depending on minus one.
0:41:52.151 --> 0:41:56.480
You also see here the former.
0:41:57.057 --> 0:42:01.963
So we are multiplying both the cell state
and our input.
0:42:01.963 --> 0:42:04.890
With some weights we are getting.
0:42:05.105 --> 0:42:08.472
We are putting some Bay Inspector and then
we are doing Sigma Weed on that.
0:42:08.868 --> 0:42:13.452
So in the end we have numbers between zero
and one saying for each dimension.
0:42:13.853 --> 0:42:22.041
Like how much if it's near to zero we will
mainly use the new input.
0:42:22.041 --> 0:42:31.890
If it's near to one we will keep the input
and ignore the input at this dimension.
0:42:33.313 --> 0:42:40.173
And by this motivation we can then create
here the new sound state, and here you see
0:42:40.173 --> 0:42:41.141
the formal.
0:42:41.601 --> 0:42:55.048
So you take your foot back gate and multiply
it with your class.
0:42:55.048 --> 0:43:00.427
So if my was around then.
0:43:00.800 --> 0:43:07.405
In the other case, when the value was others,
that's what you added.
0:43:07.405 --> 0:43:10.946
Then you're adding a transformation.
0:43:11.351 --> 0:43:24.284
So if this value was maybe zero then you're
putting most of the information from inputting.
0:43:25.065 --> 0:43:26.947
Is already your element?
0:43:26.947 --> 0:43:30.561
The only question is now based on your element.
0:43:30.561 --> 0:43:32.067
What is the output?
0:43:33.253 --> 0:43:47.951
And there you have another opportunity so
you can either take the output or instead you
0:43:47.951 --> 0:43:50.957
prefer the input.
0:43:52.612 --> 0:43:58.166
So is the value also the same for the recept
game and the forget game.
0:43:58.166 --> 0:43:59.417
Yes, the movie.
0:44:00.900 --> 0:44:10.004
Yes exactly so the matrices are different
and therefore it can be and that should be
0:44:10.004 --> 0:44:16.323
and maybe there is sometimes you want to have
information.
0:44:16.636 --> 0:44:23.843
So here again we have this vector with values
between zero and which says controlling how
0:44:23.843 --> 0:44:25.205
the information.
0:44:25.505 --> 0:44:36.459
And then the output is calculated here similar
to a cell stage, but again input is from.
0:44:36.536 --> 0:44:45.714
So either the reset gate decides should give
what is currently stored in there, or.
0:44:46.346 --> 0:44:58.647
So it's not exactly as the thing we had before,
with the residual connections where we added
0:44:58.647 --> 0:45:01.293
up, but here we do.
0:45:04.224 --> 0:45:08.472
This is the general idea of a simple recurrent
neural network.
0:45:08.472 --> 0:45:13.125
Then we will now look at how we can make things
even more efficient.
0:45:13.125 --> 0:45:17.104
But first do you have more questions on how
it is working?
0:45:23.063 --> 0:45:38.799
Now these calculations are a bit where things
get more efficient because this somehow.
0:45:38.718 --> 0:45:43.177
It depends on all the other damage for the
second one also.
0:45:43.423 --> 0:45:48.904
Because if you do a matrix multiplication
with a vector like for the output vector, each
0:45:48.904 --> 0:45:52.353
diameter of the output vector depends on all
the other.
0:45:52.973 --> 0:46:06.561
The cell state here depends because this one
is used here, and somehow the first dimension
0:46:06.561 --> 0:46:11.340
of the cell state only depends.
0:46:11.931 --> 0:46:17.973
In order to make that, of course, is sometimes
again making things less paralyzeable if things
0:46:17.973 --> 0:46:18.481
depend.
0:46:19.359 --> 0:46:35.122
Can easily make that different by changing
from the metric product to not a vector.
0:46:35.295 --> 0:46:51.459
So you do first, just like inside here, you
take like the first dimension, my second dimension.
0:46:52.032 --> 0:46:53.772
Is, of course, narrow.
0:46:53.772 --> 0:46:59.294
This should be reset or this should be because
it should be a different.
0:46:59.899 --> 0:47:12.053
Now the first dimension only depends on the
first dimension, so you don't have dependencies
0:47:12.053 --> 0:47:16.148
any longer between dimensions.
0:47:18.078 --> 0:47:25.692
Maybe it gets a bit clearer if you see about
it in this way, so what we have to do now.
0:47:25.966 --> 0:47:31.911
First, we have to do a metrics multiplication
on to gather and to get the.
0:47:32.292 --> 0:47:38.041
And then we only have the element wise operations
where we take this output.
0:47:38.041 --> 0:47:38.713
We take.
0:47:39.179 --> 0:47:42.978
Minus one and our original.
0:47:42.978 --> 0:47:52.748
Here we only have elemental abrasions which
can be optimally paralyzed.
0:47:53.273 --> 0:48:07.603
So here we have additional paralyzed things
across the dimension and don't have to do that.
0:48:09.929 --> 0:48:24.255
Yeah, but this you can do like in parallel
again for all xts.
0:48:24.544 --> 0:48:33.014
Here you can't do it in parallel, but you
only have to do it on each seat, and then you
0:48:33.014 --> 0:48:34.650
can parallelize.
0:48:35.495 --> 0:48:39.190
But this maybe for the dimension.
0:48:39.190 --> 0:48:42.124
Maybe it's also important.
0:48:42.124 --> 0:48:46.037
I don't know if they have tried it.
0:48:46.037 --> 0:48:55.383
I assume it's not only for dimension reduction,
but it's hard because you can easily.
0:49:01.001 --> 0:49:08.164
People have even like made the second thing
even more easy.
0:49:08.164 --> 0:49:10.313
So there is this.
0:49:10.313 --> 0:49:17.954
This is how we have the highway connections
in the transformer.
0:49:17.954 --> 0:49:20.699
Then it's like you do.
0:49:20.780 --> 0:49:24.789
So that is like how things are put together
as a transformer.
0:49:25.125 --> 0:49:39.960
And that is a similar and simple recurring
neural network where you do exactly the same
0:49:39.960 --> 0:49:44.512
for the so you don't have.
0:49:46.326 --> 0:49:47.503
This type of things.
0:49:49.149 --> 0:50:01.196
And with this we are at the end of how to
make efficient architectures before we go to
0:50:01.196 --> 0:50:02.580
the next.
0:50:13.013 --> 0:50:24.424
Between the ink or the trader and the architectures
there is a next technique which is used in
0:50:24.424 --> 0:50:28.988
nearly all deburning very successful.
0:50:29.449 --> 0:50:43.463
So the idea is can we extract the knowledge
from a large network into a smaller one, but
0:50:43.463 --> 0:50:45.983
it's similarly.
0:50:47.907 --> 0:50:53.217
And the nice thing is that this really works,
and it may be very, very surprising.
0:50:53.673 --> 0:51:03.000
So the idea is that we have a large straw
model which we train for long, and the question
0:51:03.000 --> 0:51:07.871
is: Can that help us to train a smaller model?
0:51:08.148 --> 0:51:16.296
So can what we refer to as teacher model tell
us better to build a small student model than
0:51:16.296 --> 0:51:17.005
before.
0:51:17.257 --> 0:51:27.371
So what we're before in it as a student model,
we learn from the data and that is how we train
0:51:27.371 --> 0:51:28.755
our systems.
0:51:29.249 --> 0:51:37.949
The question is: Can we train this small model
better if we are not only learning from the
0:51:37.949 --> 0:51:46.649
data, but we are also learning from a large
model which has been trained maybe in the same
0:51:46.649 --> 0:51:47.222
data?
0:51:47.667 --> 0:51:55.564
So that you have then in the end a smaller
model that is somehow better performing than.
0:51:55.895 --> 0:51:59.828
And maybe that's on the first view.
0:51:59.739 --> 0:52:05.396
Very very surprising because it has seen the
same data so it should have learned the same
0:52:05.396 --> 0:52:11.053
so the baseline model trained only on the data
and the student teacher knowledge to still
0:52:11.053 --> 0:52:11.682
model it.
0:52:11.682 --> 0:52:17.401
They all have seen only this data because
your teacher modeling was also trained typically
0:52:17.401 --> 0:52:19.161
only on this model however.
0:52:20.580 --> 0:52:30.071
It has by now shown that by many ways the
model trained in the teacher and analysis framework
0:52:30.071 --> 0:52:32.293
is performing better.
0:52:33.473 --> 0:52:40.971
A bit of an explanation when we see how that
works.
0:52:40.971 --> 0:52:46.161
There's different ways of doing it.
0:52:46.161 --> 0:52:47.171
Maybe.
0:52:47.567 --> 0:52:51.501
So how does it work?
0:52:51.501 --> 0:53:04.802
This is our student network, the normal one,
some type of new network.
0:53:04.802 --> 0:53:06.113
We're.
0:53:06.586 --> 0:53:17.050
So we are training the model to predict the
same thing as we are doing that by calculating.
0:53:17.437 --> 0:53:23.173
The cross angry loss was defined in a way
where saying all the probabilities for the
0:53:23.173 --> 0:53:25.332
correct word should be as high.
0:53:25.745 --> 0:53:32.207
So you are calculating your alphabet probabilities
always, and each time step you have an alphabet
0:53:32.207 --> 0:53:33.055
probability.
0:53:33.055 --> 0:53:38.669
What is the most probable in the next word
and your training signal is put as much of
0:53:38.669 --> 0:53:43.368
your probability mass to the correct word to
the word that is there in.
0:53:43.903 --> 0:53:51.367
And this is the chief by this cross entry
loss, which says with some of the all training
0:53:51.367 --> 0:53:58.664
examples of all positions, with some of the
full vocabulary, and then this one is this
0:53:58.664 --> 0:54:03.947
one that this current word is the case word
in the vocabulary.
0:54:04.204 --> 0:54:11.339
And then we take here the lock for the ability
of that, so what we made me do is: We have
0:54:11.339 --> 0:54:27.313
this metric here, so each position of your
vocabulary size.
0:54:27.507 --> 0:54:38.656
In the end what you just do is some of these
three lock probabilities, and then you want
0:54:38.656 --> 0:54:40.785
to have as much.
0:54:41.041 --> 0:54:54.614
So although this is a thumb over this metric
here, in the end of each dimension you.
0:54:54.794 --> 0:55:06.366
So that is a normal cross end to be lost that
we have discussed at the very beginning of
0:55:06.366 --> 0:55:07.016
how.
0:55:08.068 --> 0:55:15.132
So what can we do differently in the teacher
network?
0:55:15.132 --> 0:55:23.374
We also have a teacher network which is trained
on large data.
0:55:24.224 --> 0:55:35.957
And of course this distribution might be better
than the one from the small model because it's.
0:55:36.456 --> 0:55:40.941
So in this case we have now the training signal
from the teacher network.
0:55:41.441 --> 0:55:46.262
And it's the same way as we had before.
0:55:46.262 --> 0:55:56.507
The only difference is we're training not
the ground truths per ability distribution
0:55:56.507 --> 0:55:59.159
year, which is sharp.
0:55:59.299 --> 0:56:11.303
That's also a probability, so this word has
a high probability, but have some probability.
0:56:12.612 --> 0:56:19.577
And that is the main difference.
0:56:19.577 --> 0:56:30.341
Typically you do like the interpretation of
these.
0:56:33.213 --> 0:56:38.669
Because there's more information contained
in the distribution than in the front booth,
0:56:38.669 --> 0:56:44.187
because it encodes more information about the
language, because language always has more
0:56:44.187 --> 0:56:47.907
options to put alone, that's the same sentence
yes exactly.
0:56:47.907 --> 0:56:53.114
So there's ambiguity in there that is encoded
hopefully very well in the complaint.
0:56:53.513 --> 0:56:57.257
Trade you two networks so better than a student
network you have in there from your learner.
0:56:57.537 --> 0:57:05.961
So maybe often there's only one correct word,
but it might be two or three, and then all
0:57:05.961 --> 0:57:10.505
of these three have a probability distribution.
0:57:10.590 --> 0:57:21.242
And then is the main advantage or one explanation
of why it's better to train from the.
0:57:21.361 --> 0:57:32.652
Of course, it's good to also keep the signal
in there because then you can prevent it because
0:57:32.652 --> 0:57:33.493
crazy.
0:57:37.017 --> 0:57:49.466
Any more questions on the first type of knowledge
distillation, also distribution changes.
0:57:50.550 --> 0:58:02.202
Coming around again, this would put it a bit
different, so this is not a solution to maintenance
0:58:02.202 --> 0:58:04.244
or distribution.
0:58:04.744 --> 0:58:12.680
But don't think it's performing worse than
only doing the ground tours because they also.
0:58:13.113 --> 0:58:21.254
So it's more like it's not improving you would
assume it's similarly helping you, but.
0:58:21.481 --> 0:58:28.145
Of course, if you now have a teacher, maybe
you have no danger on your target to Maine,
0:58:28.145 --> 0:58:28.524
but.
0:58:28.888 --> 0:58:39.895
Then you can use this one which is not the
ground truth but helpful to learn better for
0:58:39.895 --> 0:58:42.147
the distribution.
0:58:46.326 --> 0:58:57.012
The second idea is to do sequence level knowledge
distillation, so what we have in this case
0:58:57.012 --> 0:59:02.757
is we have looked at each position independently.
0:59:03.423 --> 0:59:05.436
Mean, we do that often.
0:59:05.436 --> 0:59:10.972
We are not generating a lot of sequences,
but that has a problem.
0:59:10.972 --> 0:59:13.992
We have this propagation of errors.
0:59:13.992 --> 0:59:16.760
We start with one area and then.
0:59:17.237 --> 0:59:27.419
So if we are doing word-level knowledge dissolution,
we are treating each word in the sentence independently.
0:59:28.008 --> 0:59:32.091
So we are not trying to like somewhat model
the dependency between.
0:59:32.932 --> 0:59:47.480
We can try to do that by sequence level knowledge
dissolution, but the problem is, of course,.
0:59:47.847 --> 0:59:53.478
So we can that for each position we can get
a distribution over all the words at this.
0:59:53.793 --> 1:00:05.305
But if we want to have a distribution of all
possible target sentences, that's not possible
1:00:05.305 --> 1:00:06.431
because.
1:00:08.508 --> 1:00:15.940
Area, so we can then again do a bit of a heck
on that.
1:00:15.940 --> 1:00:23.238
If we can't have a distribution of all sentences,
it.
1:00:23.843 --> 1:00:30.764
So what we can't do is you can not use the
teacher network and sample different translations.
1:00:31.931 --> 1:00:39.327
And now we can do different ways to train
them.
1:00:39.327 --> 1:00:49.343
We can use them as their probability, the
easiest one to assume.
1:00:50.050 --> 1:00:56.373
So what that ends to is that we're taking
our teacher network, we're generating some
1:00:56.373 --> 1:01:01.135
translations, and these ones we're using as
additional trading.
1:01:01.781 --> 1:01:11.382
Then we have mainly done this sequence level
because the teacher network takes us.
1:01:11.382 --> 1:01:17.513
These are all probable translations of the
sentence.
1:01:26.286 --> 1:01:34.673
And then you can do a bit of a yeah, and you
can try to better make a bit of an interpolated
1:01:34.673 --> 1:01:36.206
version of that.
1:01:36.716 --> 1:01:42.802
So what people have also done is like subsequent
level interpolations.
1:01:42.802 --> 1:01:52.819
You generate here several translations: But
then you don't use all of them.
1:01:52.819 --> 1:02:00.658
You do some metrics on which of these ones.
1:02:01.021 --> 1:02:12.056
So it's a bit more training on this brown
chose which might be improbable or unreachable
1:02:12.056 --> 1:02:16.520
because we can generate everything.
1:02:16.676 --> 1:02:23.378
And we are giving it an easier solution which
is also good quality and training of that.
1:02:23.703 --> 1:02:32.602
So you're not training it on a very difficult
solution, but you're training it on an easier
1:02:32.602 --> 1:02:33.570
solution.
1:02:36.356 --> 1:02:38.494
Any More Questions to This.
1:02:40.260 --> 1:02:41.557
Yeah.
1:02:41.461 --> 1:02:44.296
Good.
1:02:43.843 --> 1:03:01.642
Is to look at the vocabulary, so the problem
is we have seen that vocabulary calculations
1:03:01.642 --> 1:03:06.784
are often very presuming.
1:03:09.789 --> 1:03:19.805
The thing is that most of the vocabulary is
not needed for each sentence, so in each sentence.
1:03:20.280 --> 1:03:28.219
The question is: Can we somehow easily precalculate,
which words are probable to occur in the sentence,
1:03:28.219 --> 1:03:30.967
and then only calculate these ones?
1:03:31.691 --> 1:03:34.912
And this can be done so.
1:03:34.912 --> 1:03:43.932
For example, if you have sentenced card, it's
probably not happening.
1:03:44.164 --> 1:03:48.701
So what you can try to do is to limit your
vocabulary.
1:03:48.701 --> 1:03:51.093
You're considering for each.
1:03:51.151 --> 1:04:04.693
So you're no longer taking the full vocabulary
as possible output, but you're restricting.
1:04:06.426 --> 1:04:18.275
That typically works is that we limit it by
the most frequent words we always take because
1:04:18.275 --> 1:04:23.613
these are not so easy to align to words.
1:04:23.964 --> 1:04:32.241
To take the most treatment taggin' words and
then work that often aligns with one of the
1:04:32.241 --> 1:04:32.985
source.
1:04:33.473 --> 1:04:46.770
So for each source word you calculate the
word alignment on your training data, and then
1:04:46.770 --> 1:04:51.700
you calculate which words occur.
1:04:52.352 --> 1:04:57.680
And then for decoding you build this union
of maybe the source word list that other.
1:04:59.960 --> 1:05:02.145
Are like for each source work.
1:05:02.145 --> 1:05:08.773
One of the most frequent translations of these
source words, for example for each source work
1:05:08.773 --> 1:05:13.003
like in the most frequent ones, and then the
most frequent.
1:05:13.193 --> 1:05:24.333
In total, if you have short sentences, you
have a lot less words, so in most cases it's
1:05:24.333 --> 1:05:26.232
not more than.
1:05:26.546 --> 1:05:33.957
And so you have dramatically reduced your
vocabulary, and thereby can also fax a depot.
1:05:35.495 --> 1:05:43.757
That easy does anybody see what is challenging
here and why that might not always need.
1:05:47.687 --> 1:05:54.448
The performance is not why this might not.
1:05:54.448 --> 1:06:01.838
If you implement it, it might not be a strong.
1:06:01.941 --> 1:06:06.053
You have to store this list.
1:06:06.053 --> 1:06:14.135
You have to burn the union and of course your
safe time.
1:06:14.554 --> 1:06:21.920
The second thing the vocabulary is used in
our last step, so we have the hidden state,
1:06:21.920 --> 1:06:23.868
and then we calculate.
1:06:24.284 --> 1:06:29.610
Now we are not longer calculating them for
all output words, but for a subset of them.
1:06:30.430 --> 1:06:35.613
However, this metric multiplication is typically
parallelized with the perfect but good.
1:06:35.956 --> 1:06:46.937
But if you not only calculate some of them,
if you're not modeling it right, it will take
1:06:46.937 --> 1:06:52.794
as long as before because of the nature of
the.
1:06:56.776 --> 1:07:07.997
Here for beam search there's some ideas of
course you can go back to greedy search because
1:07:07.997 --> 1:07:10.833
that's more efficient.
1:07:11.651 --> 1:07:18.347
And better quality, and you can buffer some
states in between, so how much buffering it's
1:07:18.347 --> 1:07:22.216
again this tradeoff between calculation and
memory.
1:07:25.125 --> 1:07:41.236
Then at the end of today what we want to look
into is one last type of new machine translation
1:07:41.236 --> 1:07:42.932
approach.
1:07:43.403 --> 1:07:53.621
And the idea is what we've already seen in
our first two steps is that this ultra aggressive
1:07:53.621 --> 1:07:57.246
park is taking community coding.
1:07:57.557 --> 1:08:04.461
Can process everything in parallel, but we
are always taking the most probable and then.
1:08:05.905 --> 1:08:10.476
The question is: Do we really need to do that?
1:08:10.476 --> 1:08:14.074
Therefore, there is a bunch of work.
1:08:14.074 --> 1:08:16.602
Can we do it differently?
1:08:16.602 --> 1:08:19.616
Can we generate a full target?
1:08:20.160 --> 1:08:29.417
We'll see it's not that easy and there's still
an open debate whether this is really faster
1:08:29.417 --> 1:08:31.832
and quality, but think.
1:08:32.712 --> 1:08:45.594
So, as said, what we have done is our encoder
decoder where we can process our encoder color,
1:08:45.594 --> 1:08:50.527
and then the output always depends.
1:08:50.410 --> 1:08:54.709
We generate the output and then we have to
put it here the wide because then everything
1:08:54.709 --> 1:08:56.565
depends on the purpose of the output.
1:08:56.916 --> 1:09:10.464
This is what is referred to as an outer-regressive
model and nearly outs speech generation and
1:09:10.464 --> 1:09:16.739
language generation or works in this outer.
1:09:18.318 --> 1:09:21.132
So the motivation is, can we do that more
efficiently?
1:09:21.361 --> 1:09:31.694
And can we somehow process all target words
in parallel?
1:09:31.694 --> 1:09:41.302
So instead of doing it one by one, we are
inputting.
1:09:45.105 --> 1:09:46.726
So how does it work?
1:09:46.726 --> 1:09:50.587
So let's first have a basic auto regressive
mode.
1:09:50.810 --> 1:09:53.551
So the encoder looks as it is before.
1:09:53.551 --> 1:09:58.310
That's maybe not surprising because here we
know we can paralyze.
1:09:58.618 --> 1:10:04.592
So we have put in here our ink holder and
generated the ink stash, so that's exactly
1:10:04.592 --> 1:10:05.295
the same.
1:10:05.845 --> 1:10:16.229
However, now we need to do one more thing:
One challenge is what we had before and that's
1:10:16.229 --> 1:10:26.799
a challenge of natural language generation
like machine translation.
1:10:32.672 --> 1:10:38.447
We generate until we generate this out of
end of center stock, but if we now generate
1:10:38.447 --> 1:10:44.625
everything at once that's no longer possible,
so we cannot generate as long because we only
1:10:44.625 --> 1:10:45.632
generated one.
1:10:46.206 --> 1:10:58.321
So the question is how can we now determine
how long the sequence is, and we can also accelerate.
1:11:00.000 --> 1:11:06.384
Yes, but there would be one idea, and there
is other work which tries to do that.
1:11:06.806 --> 1:11:15.702
However, in here there's some work already
done before and maybe you remember we had the
1:11:15.702 --> 1:11:20.900
IBM models and there was this concept of fertility.
1:11:21.241 --> 1:11:26.299
The concept of fertility is means like for
one saucepan, and how many target pores does
1:11:26.299 --> 1:11:27.104
it translate?
1:11:27.847 --> 1:11:34.805
And exactly that we try to do here, and that
means we are calculating like at the top we
1:11:34.805 --> 1:11:36.134
are calculating.
1:11:36.396 --> 1:11:42.045
So it says word is translated into word.
1:11:42.045 --> 1:11:54.171
Word might be translated into words into,
so we're trying to predict in how many words.
1:11:55.935 --> 1:12:10.314
And then the end of the anchor, so this is
like a length estimation.
1:12:10.314 --> 1:12:15.523
You can do it otherwise.
1:12:16.236 --> 1:12:24.526
You initialize your decoder input and we know
it's good with word embeddings so we're trying
1:12:24.526 --> 1:12:28.627
to do the same thing and what people then do.
1:12:28.627 --> 1:12:35.224
They initialize it again with word embedding
but in the frequency of the.
1:12:35.315 --> 1:12:36.460
So we have the cartilage.
1:12:36.896 --> 1:12:47.816
So one has two, so twice the is and then one
is, so that is then our initialization.
1:12:48.208 --> 1:12:57.151
In other words, if you don't predict fertilities
but predict lengths, you can just initialize
1:12:57.151 --> 1:12:57.912
second.
1:12:58.438 --> 1:13:07.788
This often works a bit better, but that's
the other.
1:13:07.788 --> 1:13:16.432
Now you have everything in training and testing.
1:13:16.656 --> 1:13:18.621
This is all available at once.
1:13:20.280 --> 1:13:31.752
Then we can generate everything in parallel,
so we have the decoder stack, and that is now
1:13:31.752 --> 1:13:33.139
as before.
1:13:35.395 --> 1:13:41.555
And then we're doing the translation predictions
here on top of it in order to do.
1:13:43.083 --> 1:13:59.821
And then we are predicting here the target
words and once predicted, and that is the basic
1:13:59.821 --> 1:14:00.924
idea.
1:14:01.241 --> 1:14:08.171
Machine translation: Where the idea is, we
don't have to do one by one what we're.
1:14:10.210 --> 1:14:13.900
So this looks really, really, really great.
1:14:13.900 --> 1:14:20.358
On the first view there's one challenge with
this, and this is the baseline.
1:14:20.358 --> 1:14:27.571
Of course there's some improvements, but in
general the quality is often significant.
1:14:28.068 --> 1:14:32.075
So here you see the baseline models.
1:14:32.075 --> 1:14:38.466
You have a loss of ten blue points or something
like that.
1:14:38.878 --> 1:14:40.230
So why does it change?
1:14:40.230 --> 1:14:41.640
So why is it happening?
1:14:43.903 --> 1:14:56.250
If you look at the errors there is repetitive
tokens, so you have like or things like that.
1:14:56.536 --> 1:15:01.995
Broken senses or influent senses, so that
exactly where algebra aggressive models are
1:15:01.995 --> 1:15:04.851
very good, we say that's a bit of a problem.
1:15:04.851 --> 1:15:07.390
They generate very fluid transcription.
1:15:07.387 --> 1:15:10.898
Translation: Sometimes there doesn't have
to do anything with the input.
1:15:11.411 --> 1:15:14.047
But generally it really looks always very
fluid.
1:15:14.995 --> 1:15:20.865
Here exactly the opposite, so the problem
is that we don't have really fluid translation.
1:15:21.421 --> 1:15:26.123
And that is mainly due to the challenge that
we have this independent assumption.
1:15:26.646 --> 1:15:35.873
So in this case, the probability of Y of the
second position is independent of the probability
1:15:35.873 --> 1:15:40.632
of X, so we don't know what was there generated.
1:15:40.632 --> 1:15:43.740
We're just generating it there.
1:15:43.964 --> 1:15:55.439
You can see it also in a bit of examples.
1:15:55.439 --> 1:16:03.636
You can over-panelize shifts.
1:16:04.024 --> 1:16:10.566
And the problem is this is already an improvement
again, but this is also similar to.
1:16:11.071 --> 1:16:19.900
So you can, for example, translate heeded
back, or maybe you could also translate it
1:16:19.900 --> 1:16:31.105
with: But on their feeling down in feeling
down, if the first position thinks of their
1:16:31.105 --> 1:16:34.594
feeling done and the second.
1:16:35.075 --> 1:16:42.908
So each position here and that is one of the
main issues here doesn't know what the other.
1:16:43.243 --> 1:16:53.846
And for example, if you are translating something
with, you can often translate things in two
1:16:53.846 --> 1:16:58.471
ways: German with a different agreement.
1:16:58.999 --> 1:17:02.058
And then here where you have to decide do
a used jet.
1:17:02.162 --> 1:17:05.460
Interpretator: It doesn't know which word
it has to select.
1:17:06.086 --> 1:17:14.789
Mean, of course, it knows a hidden state,
but in the end you have a liability distribution.
1:17:16.256 --> 1:17:20.026
And that is the important thing in the outer
regressive month.
1:17:20.026 --> 1:17:24.335
You know that because you have put it in you
here, you don't know that.
1:17:24.335 --> 1:17:29.660
If it's equal probable here to two, you don't
Know Which Is Selected, and of course that
1:17:29.660 --> 1:17:32.832
depends on what should be the latest traction
under.
1:17:33.333 --> 1:17:39.554
Yep, that's the undershift, and we're going
to last last the next time.
1:17:39.554 --> 1:17:39.986
Yes.
1:17:40.840 --> 1:17:44.935
Doesn't this also appear in and like now we're
talking about physical training?
1:17:46.586 --> 1:17:48.412
The thing is in the auto regress.
1:17:48.412 --> 1:17:50.183
If you give it the correct one,.
1:17:50.450 --> 1:17:55.827
So if you predict here comma what the reference
is feeling then you tell the model here.
1:17:55.827 --> 1:17:59.573
The last one was feeling and then it knows
it has to be done.
1:17:59.573 --> 1:18:04.044
But here it doesn't know that because it doesn't
get as input as a right.
1:18:04.204 --> 1:18:24.286
Yes, that's a bit depending on what.
1:18:24.204 --> 1:18:27.973
But in training, of course, you just try to
make the highest one the current one.
1:18:31.751 --> 1:18:38.181
So what you can do is things like CDC loss
which can adjust for this.
1:18:38.181 --> 1:18:42.866
So then you can also have this shifted correction.
1:18:42.866 --> 1:18:50.582
If you're doing this type of correction in
the CDC loss you don't get full penalty.
1:18:50.930 --> 1:18:58.486
Just shifted by one, so it's a bit of a different
loss, which is mainly used in, but.
1:19:00.040 --> 1:19:03.412
It can be used in order to address this problem.
1:19:04.504 --> 1:19:13.844
The other problem is that outer regressively
we have the label buyers that tries to disimmigrate.
1:19:13.844 --> 1:19:20.515
That's the example did before was if you translate
thank you to Dung.
1:19:20.460 --> 1:19:31.925
And then it might end up because it learns
in the first position and the second also.
1:19:32.492 --> 1:19:43.201
In order to prevent that, it would be helpful
for one output, only one output, so that makes
1:19:43.201 --> 1:19:47.002
the system already better learn.
1:19:47.227 --> 1:19:53.867
Might be that for slightly different inputs
you have different outputs, but for the same.
1:19:54.714 --> 1:19:57.467
That we can luckily very easily solve.
1:19:59.119 --> 1:19:59.908
And it's done.
1:19:59.908 --> 1:20:04.116
We just learned the technique about it, which
is called knowledge distillation.
1:20:04.985 --> 1:20:13.398
So what we can do and the easiest solution
to prove your non-autoregressive model is to
1:20:13.398 --> 1:20:16.457
train an auto regressive model.
1:20:16.457 --> 1:20:22.958
Then you decode your whole training gamer
with this model and then.
1:20:23.603 --> 1:20:27.078
While the main advantage of that is that this
is more consistent,.
1:20:27.407 --> 1:20:33.995
So for the same input you always have the
same output.
1:20:33.995 --> 1:20:41.901
So you have to make your training data more
consistent and learn.
1:20:42.482 --> 1:20:54.471
So there is another advantage of knowledge
distillation and that advantage is you have
1:20:54.471 --> 1:20:59.156
more consistent training signals.
1:21:04.884 --> 1:21:10.630
There's another to make the things more easy
at the beginning.
1:21:10.630 --> 1:21:16.467
There's this plants model, black model where
you do more masks.
1:21:16.756 --> 1:21:26.080
So during training, especially at the beginning,
you give some correct solutions at the beginning.
1:21:28.468 --> 1:21:38.407
And there is this tokens at a time, so the
idea is to establish other regressive training.
1:21:40.000 --> 1:21:50.049
And some targets are open, so you always predict
only like first auto regression is K.
1:21:50.049 --> 1:21:59.174
It puts one, so you always have one input
and one output, then you do partial.
1:21:59.699 --> 1:22:05.825
So in that way you can slowly learn what is
a good and what is a bad answer.
1:22:08.528 --> 1:22:10.862
It doesn't sound very impressive.
1:22:10.862 --> 1:22:12.578
Don't contact me anyway.
1:22:12.578 --> 1:22:15.323
Go all over your training data several.
1:22:15.875 --> 1:22:20.655
You can even switch in between.
1:22:20.655 --> 1:22:29.318
There is a homework on this thing where you
try to start.
1:22:31.271 --> 1:22:41.563
You have to learn so there's a whole work
on that so this is often happening and it doesn't
1:22:41.563 --> 1:22:46.598
mean it's less efficient but still it helps.
1:22:49.389 --> 1:22:57.979
For later maybe here are some examples of
how much things help.
1:22:57.979 --> 1:23:04.958
Maybe one point here is that it's really important.
1:23:05.365 --> 1:23:13.787
Here's the translation performance and speed.
1:23:13.787 --> 1:23:24.407
One point which is a point is if you compare
researchers.
1:23:24.784 --> 1:23:33.880
So yeah, if you're compared to one very weak
baseline transformer even with beam search,
1:23:33.880 --> 1:23:40.522
then you're ten times slower than a very strong
auto regressive.
1:23:40.961 --> 1:23:48.620
If you make a strong baseline then it's going
down to depending on times and here like: You
1:23:48.620 --> 1:23:53.454
have a lot of different speed ups.
1:23:53.454 --> 1:24:03.261
Generally, it makes a strong baseline and
not very simple transformer.
1:24:07.407 --> 1:24:20.010
Yeah, with this one last thing that you can
do to speed up things and also reduce your
1:24:20.010 --> 1:24:25.950
memory is what is called half precision.
1:24:26.326 --> 1:24:29.139
And especially for decoding issues for training.
1:24:29.139 --> 1:24:31.148
Sometimes it also gets less stale.
1:24:32.592 --> 1:24:45.184
With this we close nearly wait a bit, so what
you should remember is that efficient machine
1:24:45.184 --> 1:24:46.963
translation.
1:24:47.007 --> 1:24:51.939
We have, for example, looked at knowledge
distillation.
1:24:51.939 --> 1:24:55.991
We have looked at non auto regressive models.
1:24:55.991 --> 1:24:57.665
We have different.
1:24:58.898 --> 1:25:02.383
For today and then only requests.
1:25:02.383 --> 1:25:08.430
So if you haven't done so, please fill out
the evaluation.
1:25:08.388 --> 1:25:20.127
So now if you have done so think then you
should have and with the online people hopefully.
1:25:20.320 --> 1:25:29.758
Only possibility to tell us what things are
good and what not the only one but the most
1:25:29.758 --> 1:25:30.937
efficient.
1:25:31.851 --> 1:25:35.871
So think of all the students doing it in this
case okay and then thank.