Spaces:
Running
Running
WEBVTT | |
0:00:03.663 --> 0:00:07.970 | |
Okay, then I should switch back to English, | |
sorry,. | |
0:00:08.528 --> 0:00:18.970 | |
So welcome to today's lecture in the cross | |
machine translation and today we're planning | |
0:00:18.970 --> 0:00:20.038 | |
to talk. | |
0:00:20.880 --> 0:00:31.845 | |
Which will be without our summary of power | |
translation was done from around till. | |
0:00:32.872 --> 0:00:38.471 | |
Fourteen, so this was an approach which was | |
quite long. | |
0:00:38.471 --> 0:00:47.070 | |
It was the first approach where at the end | |
the quality was really so good that it was | |
0:00:47.070 --> 0:00:49.969 | |
used as a commercial system. | |
0:00:49.990 --> 0:00:56.482 | |
Or something like that, so the first systems | |
there was using the statistical machine translation. | |
0:00:57.937 --> 0:01:02.706 | |
So when I came into the field this was the | |
main part of the lecture, so there would be | |
0:01:02.706 --> 0:01:07.912 | |
not be one lecture, but in more detail than | |
half of the full course would be about statistical | |
0:01:07.912 --> 0:01:09.063 | |
machine translation. | |
0:01:09.369 --> 0:01:23.381 | |
So what we try to do today is like get the | |
most important things, which think our part | |
0:01:23.381 --> 0:01:27.408 | |
is still very important. | |
0:01:27.267 --> 0:01:31.196 | |
Four State of the Art Box. | |
0:01:31.952 --> 0:01:45.240 | |
Then we'll have the presentation about how | |
to evaluate the other part of the machine translation. | |
0:01:45.505 --> 0:01:58.396 | |
The other important thing is the language | |
modeling part will explain later how they combine. | |
0:01:59.539 --> 0:02:04.563 | |
Shortly mentioned this one already. | |
0:02:04.824 --> 0:02:06.025 | |
On Tuesday. | |
0:02:06.246 --> 0:02:21.849 | |
So in a lot of these explanations, how we | |
model translation process, it might be surprising: | |
0:02:22.082 --> 0:02:27.905 | |
Later some people say it's for four eight words | |
traditionally came because the first models | |
0:02:27.905 --> 0:02:32.715 | |
which you'll discuss here also when they are | |
referred to as the IVM models. | |
0:02:32.832 --> 0:02:40.043 | |
They were trained on French to English translation | |
directions and that's why they started using | |
0:02:40.043 --> 0:02:44.399 | |
F and E and then this was done for the next | |
twenty years. | |
0:02:44.664 --> 0:02:52.316 | |
So while we are trying to wait, the source | |
words is: We have a big eye, typically the | |
0:02:52.316 --> 0:03:02.701 | |
lengths of the sewer sentence in small eye, | |
the position, and similarly in the target and | |
0:03:02.701 --> 0:03:05.240 | |
the lengths of small. | |
0:03:05.485 --> 0:03:13.248 | |
Things will get a bit complicated in this | |
way because it is not always clear what is | |
0:03:13.248 --> 0:03:13.704 | |
the. | |
0:03:14.014 --> 0:03:21.962 | |
See that there is this noisy channel model | |
which switches the direction in your model, | |
0:03:21.962 --> 0:03:25.616 | |
but in the application it's the target. | |
0:03:26.006 --> 0:03:37.077 | |
So that is why if you especially read these | |
papers, it might sometimes be a bit disturbing. | |
0:03:37.437 --> 0:03:40.209 | |
Try to keep it here always. | |
0:03:40.209 --> 0:03:48.427 | |
The source is, and even if we use a model | |
where it's inverse, we'll keep this way. | |
0:03:48.468 --> 0:03:55.138 | |
Don't get disturbed by that, and I think it's | |
possible to understand all that without this | |
0:03:55.138 --> 0:03:55.944 | |
confusion. | |
0:03:55.944 --> 0:04:01.734 | |
But in some of the papers you might get confused | |
because they switched to the. | |
0:04:04.944 --> 0:04:17.138 | |
In general, in statistics and machine translation, | |
the goal is how we do translation. | |
0:04:17.377 --> 0:04:25.562 | |
But first we are seeing all our possible target | |
sentences as possible translations. | |
0:04:26.726 --> 0:04:37.495 | |
And we are assigning some probability to the | |
combination, so we are modeling. | |
0:04:39.359 --> 0:04:49.746 | |
And then we are doing a search over all possible | |
things or at least theoretically, and we are | |
0:04:49.746 --> 0:04:56.486 | |
trying to find the translation with the highest | |
probability. | |
0:04:56.936 --> 0:05:05.116 | |
And this general idea is also true for neuromachine | |
translation. | |
0:05:05.116 --> 0:05:07.633 | |
They differ in how. | |
0:05:08.088 --> 0:05:10.801 | |
So these were then of course the two big challenges. | |
0:05:11.171 --> 0:05:17.414 | |
On the one hand, how can we estimate this | |
probability? | |
0:05:17.414 --> 0:05:21.615 | |
How is the translation of the other? | |
0:05:22.262 --> 0:05:32.412 | |
The other challenge is the search, so we cannot, | |
of course, say we want to find the most probable | |
0:05:32.412 --> 0:05:33.759 | |
translation. | |
0:05:33.759 --> 0:05:42.045 | |
We cannot go over all possible English sentences | |
and calculate the probability. | |
0:05:43.103 --> 0:05:45.004 | |
So,. | |
0:05:45.165 --> 0:05:53.423 | |
What we have to do there is some are doing | |
intelligent search and look for the ones and | |
0:05:53.423 --> 0:05:54.268 | |
compare. | |
0:05:54.734 --> 0:05:57.384 | |
That will be done. | |
0:05:57.384 --> 0:06:07.006 | |
This process of finding them is called the | |
decoding process because. | |
0:06:07.247 --> 0:06:09.015 | |
They will be covered well later. | |
0:06:09.015 --> 0:06:11.104 | |
Today we will concentrate on the mile. | |
0:06:11.451 --> 0:06:23.566 | |
The model is trained using data, so in the | |
first step we're having data, we're somehow | |
0:06:23.566 --> 0:06:30.529 | |
having a definition of what the model looks | |
like. | |
0:06:34.034 --> 0:06:42.913 | |
And in statistical machine translation the | |
common model is behind. | |
0:06:42.913 --> 0:06:46.358 | |
That is what is referred. | |
0:06:46.786 --> 0:06:55.475 | |
And this is motivated by the initial idea | |
from Shannon. | |
0:06:55.475 --> 0:07:02.457 | |
We have this that you can think of decoding. | |
0:07:02.722 --> 0:07:10.472 | |
So think of it as we have this text in maybe | |
German. | |
0:07:10.472 --> 0:07:21.147 | |
Originally it was an English text, but somebody | |
used some nice decoding. | |
0:07:21.021 --> 0:07:28.579 | |
Task is to decipher it again, this crazy cyborg | |
expressing things in German, and to decipher | |
0:07:28.579 --> 0:07:31.993 | |
the meaning again and doing that between. | |
0:07:32.452 --> 0:07:35.735 | |
And that is the idea about this noisy channel | |
when it. | |
0:07:36.236 --> 0:07:47.209 | |
It goes through some type of channel which | |
adds noise to the source and then you receive | |
0:07:47.209 --> 0:07:48.811 | |
the message. | |
0:07:49.429 --> 0:08:00.190 | |
And then the idea is, can we now construct | |
the original message out of these messages | |
0:08:00.190 --> 0:08:05.070 | |
by modeling some of the channels here? | |
0:08:06.726 --> 0:08:15.797 | |
There you know to see a bit the surface of | |
the source message with English. | |
0:08:15.797 --> 0:08:22.361 | |
It went through some channel and received | |
the message. | |
0:08:22.682 --> 0:08:31.381 | |
If you're not looking at machine translation, | |
your source language is English. | |
0:08:31.671 --> 0:08:44.388 | |
Here you see now a bit of this where the confusion | |
starts while English as a target language is | |
0:08:44.388 --> 0:08:47.700 | |
also the source message. | |
0:08:47.927 --> 0:08:48.674 | |
You can see. | |
0:08:48.674 --> 0:08:51.488 | |
There is also a mathematics of how we model | |
the. | |
0:08:52.592 --> 0:08:56.888 | |
It's a noisy channel model from a mathematic | |
point of view. | |
0:08:56.997 --> 0:09:00.245 | |
So this is again our general formula. | |
0:09:00.245 --> 0:09:08.623 | |
We are looking for the most probable translation | |
and that is the translation that has the highest | |
0:09:08.623 --> 0:09:09.735 | |
probability. | |
0:09:09.809 --> 0:09:19.467 | |
We are not interested in the probability itself, | |
but we are interesting in this target sentence | |
0:09:19.467 --> 0:09:22.082 | |
E where this probability. | |
0:09:23.483 --> 0:09:33.479 | |
And: Therefore, we can use them twice definition | |
of conditional probability and using the base | |
0:09:33.479 --> 0:09:42.712 | |
rules, so this probability equals the probability | |
of f giving any kind of probability of e divided | |
0:09:42.712 --> 0:09:44.858 | |
by the probability of. | |
0:09:45.525 --> 0:09:48.218 | |
Now see mathematically this confusion. | |
0:09:48.218 --> 0:09:54.983 | |
Originally we are interested in the probability | |
of the target sentence given the search sentence. | |
0:09:55.295 --> 0:10:00.742 | |
And if we are modeling things now, we are | |
looking here at the inverse direction, so the | |
0:10:00.742 --> 0:10:06.499 | |
probability of F given E to the probability | |
of the source sentence given the target sentence | |
0:10:06.499 --> 0:10:10.832 | |
is the probability of the target sentence divided | |
by the probability. | |
0:10:13.033 --> 0:10:15.353 | |
Why are we doing this? | |
0:10:15.353 --> 0:10:24.333 | |
Maybe I mean, of course, once it's motivated | |
by our model, that we were saying this type | |
0:10:24.333 --> 0:10:27.058 | |
of how we are modeling it. | |
0:10:27.058 --> 0:10:30.791 | |
The other interesting thing is that. | |
0:10:31.231 --> 0:10:40.019 | |
So we are looking at this probability up there, | |
which we had before we formulate that we can | |
0:10:40.019 --> 0:10:40.775 | |
remove. | |
0:10:41.181 --> 0:10:46.164 | |
If we are searching for the highest translation, | |
this is fixed. | |
0:10:46.164 --> 0:10:47.800 | |
This doesn't change. | |
0:10:47.800 --> 0:10:52.550 | |
We have an input, the source sentence, and | |
we cannot change. | |
0:10:52.812 --> 0:11:02.780 | |
Is always the same, so we can ignore it in | |
the ACMAX because the lower one is exactly | |
0:11:02.780 --> 0:11:03.939 | |
the same. | |
0:11:04.344 --> 0:11:06.683 | |
And then we have p o f. | |
0:11:06.606 --> 0:11:13.177 | |
E times P of E and that is so we are modeling | |
the translation process on the one hand with | |
0:11:13.177 --> 0:11:19.748 | |
the translation model which models how probable | |
is the sentence F given E and on the other | |
0:11:19.748 --> 0:11:25.958 | |
hand with the language model which models only | |
how probable is this English sentence. | |
0:11:26.586 --> 0:11:39.366 | |
That somebody wrote this language or translation | |
point of view, this is about fluency. | |
0:11:40.200 --> 0:11:44.416 | |
You should have in German, for example, agreement. | |
0:11:44.416 --> 0:11:50.863 | |
If the agreement is not right, that's properly | |
not said by anybody in German. | |
0:11:50.863 --> 0:11:58.220 | |
Nobody would say that's Schönest's house because | |
it's not according to the German rules. | |
0:11:58.598 --> 0:12:02.302 | |
So this can be modeled by the language model. | |
0:12:02.542 --> 0:12:09.855 | |
And you have the translation model which models | |
housings get translated between the. | |
0:12:10.910 --> 0:12:18.775 | |
And here you see again our confusion again, | |
and now here put the translation model: Wage | |
0:12:18.775 --> 0:12:24.360 | |
is a big income counterintuitive because the | |
probability of a sewer sentence giving the | |
0:12:24.360 --> 0:12:24.868 | |
target. | |
0:12:26.306 --> 0:12:35.094 | |
Have to do that for the bass farmer, but in | |
the following slides I'll talk again about. | |
0:12:35.535 --> 0:12:45.414 | |
Because yeah, that's more intuitive that you | |
model the translation of the target sentence | |
0:12:45.414 --> 0:12:48.377 | |
given the source sentence. | |
0:12:50.930 --> 0:12:55.668 | |
And this is what we want to talk about today. | |
0:12:55.668 --> 0:13:01.023 | |
We later talk about language models how to | |
do that. | |
0:13:00.940 --> 0:13:04.493 | |
And maybe also how to combine them. | |
0:13:04.493 --> 0:13:13.080 | |
But the focus on today would be how can we | |
model this probability to how to generate a | |
0:13:13.080 --> 0:13:16.535 | |
translation from source to target? | |
0:13:19.960 --> 0:13:24.263 | |
How can we do that and the easiest thing? | |
0:13:24.263 --> 0:13:33.588 | |
Maybe if you think about statistics, you count | |
how many examples you have, how many target | |
0:13:33.588 --> 0:13:39.121 | |
sentences go occur, and that gives you an estimation. | |
0:13:40.160 --> 0:13:51.632 | |
However, like in another model that is not | |
possible because most sentences you will never | |
0:13:51.632 --> 0:13:52.780 | |
see, so. | |
0:13:53.333 --> 0:14:06.924 | |
So what we have to do is break up the translation | |
process into smaller models and model each | |
0:14:06.924 --> 0:14:09.555 | |
of the decisions. | |
0:14:09.970 --> 0:14:26.300 | |
So this simple solution with how you throw | |
a dice is like you have a and that gives you | |
0:14:26.300 --> 0:14:29.454 | |
the probability. | |
0:14:29.449 --> 0:14:40.439 | |
But here's the principle because each event | |
is so rare that most of them never have helped. | |
0:14:43.063 --> 0:14:48.164 | |
Although it might be that in all your training | |
data you have never seen this title of set. | |
0:14:49.589 --> 0:14:52.388 | |
How can we do that? | |
0:14:52.388 --> 0:15:04.845 | |
We look in statistical machine translation | |
into two different models, a generative model | |
0:15:04.845 --> 0:15:05.825 | |
where. | |
0:15:06.166 --> 0:15:11.736 | |
So the idea was to really model model like | |
each individual translation between words. | |
0:15:12.052 --> 0:15:22.598 | |
So you break down the translation of a full | |
sentence into the translation of each individual's | |
0:15:22.598 --> 0:15:23.264 | |
word. | |
0:15:23.264 --> 0:15:31.922 | |
So you say if you have the black cat, if you | |
translate it, the full sentence. | |
0:15:32.932 --> 0:15:38.797 | |
Of course, this has some challenges, any ideas | |
where this type of model could be very challenging. | |
0:15:40.240 --> 0:15:47.396 | |
Vocabularies and videos: Yes, we're going | |
to be able to play in the very color. | |
0:15:47.867 --> 0:15:51.592 | |
Yes, but you could at least use a bit of the | |
context around it. | |
0:15:51.592 --> 0:15:55.491 | |
It will not only depend on the word, but it's | |
already challenging. | |
0:15:55.491 --> 0:15:59.157 | |
You make things very hard, so that's definitely | |
one challenge. | |
0:16:00.500 --> 0:16:07.085 | |
One other, what did you talk about that we | |
just don't want to say? | |
0:16:08.348 --> 0:16:11.483 | |
Yes, they are challenging. | |
0:16:11.483 --> 0:16:21.817 | |
You have to do something like words, but the | |
problem is that you might introduce errors. | |
0:16:21.841 --> 0:16:23.298 | |
Later and makes things very comfortable. | |
0:16:25.265 --> 0:16:28.153 | |
Wrong splitting is the worst things that are | |
very complicated. | |
0:16:32.032 --> 0:16:35.580 | |
Saints, for example, and also maybe Japanese | |
medicine. | |
0:16:35.735 --> 0:16:41.203 | |
In German, yes, especially like these are | |
all right. | |
0:16:41.203 --> 0:16:46.981 | |
The first thing is maybe the one which is | |
most obvious. | |
0:16:46.981 --> 0:16:49.972 | |
It is raining cats and dogs. | |
0:16:51.631 --> 0:17:01.837 | |
To German, the cat doesn't translate this | |
whole chunk into something because there is | |
0:17:01.837 --> 0:17:03.261 | |
not really. | |
0:17:03.403 --> 0:17:08.610 | |
Mean, of course, in generally there is this | |
type of alignment, so there is a correspondence | |
0:17:08.610 --> 0:17:11.439 | |
between words in English and the words in German. | |
0:17:11.439 --> 0:17:16.363 | |
However, that's not true for all sentences, | |
so in some sentences you cannot really say | |
0:17:16.363 --> 0:17:18.174 | |
this word translates into that. | |
0:17:18.498 --> 0:17:21.583 | |
But you can only let more locate this whole | |
phrase. | |
0:17:21.583 --> 0:17:23.482 | |
This model into something else. | |
0:17:23.563 --> 0:17:30.970 | |
If you think about the don't in English, the | |
do is not really clearly where should that | |
0:17:30.970 --> 0:17:31.895 | |
be allied. | |
0:17:32.712 --> 0:17:39.079 | |
Then for a long time the most successful approach | |
was this phrase based translation model where | |
0:17:39.079 --> 0:17:45.511 | |
the idea is your block is not a single word | |
but a longer phrase if you try to build translations | |
0:17:45.511 --> 0:17:46.572 | |
based on these. | |
0:17:48.768 --> 0:17:54.105 | |
But let's start with a word based and what | |
you need. | |
0:17:54.105 --> 0:18:03.470 | |
There is two main knowledge sources, so on | |
the one hand we have a lexicon where we translate | |
0:18:03.470 --> 0:18:05.786 | |
possible translations. | |
0:18:06.166 --> 0:18:16.084 | |
The main difference between the lexicon and | |
statistical machine translation and lexicon | |
0:18:16.084 --> 0:18:17.550 | |
as you know. | |
0:18:17.837 --> 0:18:23.590 | |
Traditional lexicon: You know how word is | |
translated and mainly it's giving you two or | |
0:18:23.590 --> 0:18:26.367 | |
three examples with any example sentence. | |
0:18:26.367 --> 0:18:30.136 | |
So in this context it gets translated like | |
that henceon. | |
0:18:30.570 --> 0:18:38.822 | |
In order to model that and work with probabilities | |
what we need in a machine translation is these: | |
0:18:39.099 --> 0:18:47.962 | |
So if we have the German word bargain, it sends | |
me out with a probability of zero point five. | |
0:18:47.962 --> 0:18:51.545 | |
Maybe it's translated into a vehicle. | |
0:18:52.792 --> 0:18:58.876 | |
And of course this is not easy to be created | |
by a shoveman. | |
0:18:58.876 --> 0:19:07.960 | |
If ask you and give probabilities for how | |
probable this vehicle is, there might: So how | |
0:19:07.960 --> 0:19:12.848 | |
we are doing is again that the lexicon is automatically | |
will be created from a corpus. | |
0:19:13.333 --> 0:19:18.754 | |
And we're just counting here, so we count | |
how often does it work, how often does it co | |
0:19:18.754 --> 0:19:24.425 | |
occur with vehicle, and then we're taking the | |
ratio and saying in the house of time on the | |
0:19:24.425 --> 0:19:26.481 | |
English side there was vehicles. | |
0:19:26.481 --> 0:19:31.840 | |
There was a probability of vehicles given | |
back, and there's something like zero point | |
0:19:31.840 --> 0:19:32.214 | |
five. | |
0:19:33.793 --> 0:19:46.669 | |
That we need another concept, and that is | |
this concept of alignment, and now you can | |
0:19:46.669 --> 0:19:47.578 | |
have. | |
0:19:47.667 --> 0:19:53.113 | |
Since this is quite complicated, the alignment | |
in general can be complex. | |
0:19:53.113 --> 0:19:55.689 | |
It can be that it's not only like. | |
0:19:55.895 --> 0:20:04.283 | |
It can be that two words of a surrender target | |
sign and it's also imbiguous. | |
0:20:04.283 --> 0:20:13.761 | |
It can be that you say all these two words | |
only are aligned together and our words are | |
0:20:13.761 --> 0:20:15.504 | |
aligned or not. | |
0:20:15.875 --> 0:20:21.581 | |
Is should the do be aligned to the knot in | |
German? | |
0:20:21.581 --> 0:20:29.301 | |
It's only there because in German it's not, | |
so it should be aligned. | |
0:20:30.510 --> 0:20:39.736 | |
However, typically it's formalized and it's | |
formalized by a function from the target language. | |
0:20:40.180 --> 0:20:44.051 | |
And that is to make these models get easier | |
and clearer. | |
0:20:44.304 --> 0:20:49.860 | |
That means what means does it mean that you | |
have a fence that means that each. | |
0:20:49.809 --> 0:20:58.700 | |
A sewer's word gives target word and the alliance | |
to only one source word because the function | |
0:20:58.700 --> 0:21:00.384 | |
is also directly. | |
0:21:00.384 --> 0:21:05.999 | |
However, a source word can be hit or like | |
by signal target. | |
0:21:06.286 --> 0:21:11.332 | |
So you are allowing for one to many alignments, | |
but not for many to one alignment. | |
0:21:11.831 --> 0:21:17.848 | |
That is a bit of a challenge because you assume | |
a lightning should be symmetrical. | |
0:21:17.848 --> 0:21:24.372 | |
So if you look at a parallel sentence, it | |
should not matter if you look at it from German | |
0:21:24.372 --> 0:21:26.764 | |
to English or English to German. | |
0:21:26.764 --> 0:21:34.352 | |
So however, it makes these models: Yea possible | |
and we'll like to see yea for the phrase bass | |
0:21:34.352 --> 0:21:36.545 | |
until we need these alignments. | |
0:21:36.836 --> 0:21:41.423 | |
So this alignment was the most important of | |
the world based models. | |
0:21:41.423 --> 0:21:47.763 | |
For the next twenty years you need the world | |
based models to generate this type of alignment, | |
0:21:47.763 --> 0:21:50.798 | |
which is then the first step for the phrase. | |
0:21:51.931 --> 0:21:59.642 | |
Approach, and there you can then combine them | |
again like both directions into one we'll see. | |
0:22:00.280 --> 0:22:06.850 | |
This alignment is very important and allows | |
us to do this type of separation. | |
0:22:08.308 --> 0:22:15.786 | |
And yet the most commonly used word based | |
models are these models referred to as IBM | |
0:22:15.786 --> 0:22:25.422 | |
models, and there is a sequence of them with | |
great names: And they were like yeah very commonly | |
0:22:25.422 --> 0:22:26.050 | |
used. | |
0:22:26.246 --> 0:22:31.719 | |
We'll mainly focus on the simple one here | |
and look how this works and then not do all | |
0:22:31.719 --> 0:22:34.138 | |
the details about the further models. | |
0:22:34.138 --> 0:22:38.084 | |
The interesting thing is also that all of | |
them are important. | |
0:22:38.084 --> 0:22:43.366 | |
So if you want to train this alignment what | |
you normally do is train an IVM model. | |
0:22:43.743 --> 0:22:50.940 | |
Then you take that as your initialization | |
to then train the IBM model too and so on. | |
0:22:50.940 --> 0:22:53.734 | |
The motivation for that is yeah. | |
0:22:53.734 --> 0:23:00.462 | |
The first model gives you: Is so simple that | |
you can even find a global optimum, so it gives | |
0:23:00.462 --> 0:23:06.403 | |
you a good starting point for the next one | |
where the optimization in finding the right | |
0:23:06.403 --> 0:23:12.344 | |
model is more difficult and therefore like | |
the defore technique was to make your model | |
0:23:12.344 --> 0:23:13.641 | |
step by step more. | |
0:23:15.195 --> 0:23:27.333 | |
In these models we are breaking down the probability | |
into smaller steps and then we can define: | |
0:23:27.367 --> 0:23:38.981 | |
You see it's not a bit different, so it's not | |
the curability and one specific alignment given. | |
0:23:39.299 --> 0:23:42.729 | |
We'll let us learn how we can then go from | |
one alignment to the full set. | |
0:23:43.203 --> 0:23:52.889 | |
The probability of target sentences and one | |
alignment between the source and target sentences | |
0:23:52.889 --> 0:23:56.599 | |
alignment is this type of function. | |
0:23:57.057 --> 0:24:14.347 | |
That every word is aligned in order to ensure | |
that every word is aligned. | |
0:24:15.835 --> 0:24:28.148 | |
So first of all you do some epsilon, the epsilon | |
is just a normalization factor that everything | |
0:24:28.148 --> 0:24:31.739 | |
is somehow to inferability. | |
0:24:31.631 --> 0:24:37.539 | |
Of source sentences plus one to the power | |
of the length of the targets. | |
0:24:37.937 --> 0:24:50.987 | |
And this is somehow the probability of this | |
alignment. | |
0:24:51.131 --> 0:24:53.224 | |
So is this alignment probable or not? | |
0:24:53.224 --> 0:24:55.373 | |
Of course you can have some intuition. | |
0:24:55.373 --> 0:24:58.403 | |
So if there's a lot of crossing, it may be | |
not a good. | |
0:24:58.403 --> 0:25:03.196 | |
If all of the words align to the same one | |
might be not a good alignment, but generally | |
0:25:03.196 --> 0:25:06.501 | |
it's difficult to really describe what is a | |
good alignment. | |
0:25:07.067 --> 0:25:11.482 | |
Say for the first model that's the most simple | |
thing. | |
0:25:11.482 --> 0:25:18.760 | |
What can be the most simple thing if you think | |
about giving a probability to some event? | |
0:25:21.401 --> 0:25:25.973 | |
Yes exactly, so just take the uniform distribution. | |
0:25:25.973 --> 0:25:33.534 | |
If we don't really know the best thing of | |
modeling is all equally probable, of course | |
0:25:33.534 --> 0:25:38.105 | |
that is not true, but it's giving you a good | |
study. | |
0:25:38.618 --> 0:25:44.519 | |
And so this one is just a number of all possible | |
alignments for this sentence. | |
0:25:44.644 --> 0:25:53.096 | |
So how many alignments are possible, so the | |
first target word can be allied to all sources | |
0:25:53.096 --> 0:25:53.746 | |
worth. | |
0:25:54.234 --> 0:26:09.743 | |
The second one can also be aligned to all | |
source work, and the third one also to source. | |
0:26:10.850 --> 0:26:13.678 | |
This is the number of alignments. | |
0:26:13.678 --> 0:26:19.002 | |
The second part is to model the probability | |
of the translation. | |
0:26:19.439 --> 0:26:31.596 | |
And there it's not nice to have this function, | |
so now we are making the product over all target. | |
0:26:31.911 --> 0:26:40.068 | |
And we are making a very strong independent | |
assumption because in these models we normally | |
0:26:40.068 --> 0:26:45.715 | |
assume the translation probability of one word | |
is independent. | |
0:26:46.126 --> 0:26:49.800 | |
So how you translate and visit it is independent | |
of all the other parts. | |
0:26:50.290 --> 0:26:52.907 | |
That is very strong and very bad. | |
0:26:52.907 --> 0:26:55.294 | |
Yeah, you should do it better. | |
0:26:55.294 --> 0:27:00.452 | |
We know that it's wrong because how you translate | |
this depends on. | |
0:27:00.452 --> 0:27:05.302 | |
However, it's a first easy solution and again | |
a good starting. | |
0:27:05.966 --> 0:27:14.237 | |
So what you do is that you take a product | |
of all words and take a translation probability | |
0:27:14.237 --> 0:27:15.707 | |
on this target. | |
0:27:16.076 --> 0:27:23.901 | |
And because we know that there is always one | |
source word allied to that, so it. | |
0:27:24.344 --> 0:27:37.409 | |
If the probability of visits in the zoo doesn't | |
really work, the good here I'm again. | |
0:27:38.098 --> 0:27:51.943 | |
So most only we have it here, so the probability | |
is an absolute divided pipe to the power. | |
0:27:53.913 --> 0:27:58.401 | |
And then there is somewhere in the last one. | |
0:27:58.401 --> 0:28:04.484 | |
There is an arrow and switch, so it is the | |
other way around. | |
0:28:04.985 --> 0:28:07.511 | |
Then you have your translation model. | |
0:28:07.511 --> 0:28:12.498 | |
Hopefully let's assume you have your water | |
train so that's only a signing. | |
0:28:12.953 --> 0:28:25.466 | |
And then this sentence has the probability | |
of generating I visit a friend given that you | |
0:28:25.466 --> 0:28:31.371 | |
have the source sentence if Bezukhov I'm. | |
0:28:32.012 --> 0:28:34.498 | |
Time stand to the power of minus five. | |
0:28:35.155 --> 0:28:36.098 | |
So this is your model. | |
0:28:36.098 --> 0:28:37.738 | |
This is how you're applying your model. | |
0:28:39.479 --> 0:28:44.220 | |
As you said, it's the most simple bottle you | |
assume that all word translations are. | |
0:28:44.204 --> 0:28:46.540 | |
Independent of each other. | |
0:28:46.540 --> 0:28:54.069 | |
You assume that all alignments are equally | |
important, and then the only thing you need | |
0:28:54.069 --> 0:29:00.126 | |
for this type of model is to have this lexicon | |
in order to calculate. | |
0:29:00.940 --> 0:29:04.560 | |
And that is, of course, now the training process. | |
0:29:04.560 --> 0:29:08.180 | |
The question is how do we get this type of | |
lexic? | |
0:29:09.609 --> 0:29:15.461 | |
But before we look into the training, do you | |
have any questions about the model itself? | |
0:29:21.101 --> 0:29:26.816 | |
The problem in training is that we have incomplete | |
data. | |
0:29:26.816 --> 0:29:32.432 | |
So if you want to count, I mean said you want | |
to count. | |
0:29:33.073 --> 0:29:39.348 | |
However, if you don't have the alignment, | |
on the other hand, if you would have a lexicon | |
0:29:39.348 --> 0:29:44.495 | |
you could maybe generate the alignment, which | |
is the most probable word. | |
0:29:45.225 --> 0:29:55.667 | |
And this is the very common problem that you | |
have this type of incomplete data where you | |
0:29:55.667 --> 0:29:59.656 | |
have not one type of information. | |
0:30:00.120 --> 0:30:08.767 | |
And you can model this by considering the | |
alignment as your hidden variable and then | |
0:30:08.767 --> 0:30:17.619 | |
you can use the expectation maximization algorithm | |
in order to generate the alignment. | |
0:30:17.577 --> 0:30:26.801 | |
So the nice thing is that you only need your | |
parallel data, which is aligned on sentence | |
0:30:26.801 --> 0:30:29.392 | |
level, but you normally. | |
0:30:29.389 --> 0:30:33.720 | |
Is just a lot of work we saw last time. | |
0:30:33.720 --> 0:30:39.567 | |
Typically what you have is this type of corpus | |
where. | |
0:30:41.561 --> 0:30:50.364 | |
And yeah, the ERM algorithm sounds very fancy. | |
0:30:50.364 --> 0:30:58.605 | |
However, again look at a little high level. | |
0:30:58.838 --> 0:31:05.841 | |
So you're initializing a model by uniform | |
distribution. | |
0:31:05.841 --> 0:31:14.719 | |
You're just saying if have lexicon, if all | |
words are equally possible. | |
0:31:15.215 --> 0:31:23.872 | |
And then you apply your model to the data, | |
and that is your expectation step. | |
0:31:23.872 --> 0:31:30.421 | |
So given this initial lexicon, we are now | |
calculating the. | |
0:31:30.951 --> 0:31:36.043 | |
So we can now take all our parallel sentences, | |
and of course ought to check what is the most | |
0:31:36.043 --> 0:31:36.591 | |
probable. | |
0:31:38.338 --> 0:31:49.851 | |
And then, of course, at the beginning maybe | |
houses most often in line. | |
0:31:50.350 --> 0:31:58.105 | |
Once we have done this expectation step, we | |
can next do the maximization step and based | |
0:31:58.105 --> 0:32:06.036 | |
on this guest alignment, which we have, we | |
can now learn better translation probabilities | |
0:32:06.036 --> 0:32:09.297 | |
by just counting how often do words. | |
0:32:09.829 --> 0:32:22.289 | |
And then it's rated these steps: We can make | |
this whole process even more stable, only taking | |
0:32:22.289 --> 0:32:26.366 | |
the most probable alignment. | |
0:32:26.346 --> 0:32:36.839 | |
Second step, but in contrast we calculate | |
for all possible alignments the alignment probability | |
0:32:36.839 --> 0:32:40.009 | |
and weigh the correcurrence. | |
0:32:40.000 --> 0:32:41.593 | |
Then Things Are Most. | |
0:32:42.942 --> 0:32:49.249 | |
Why could that be very challenging if we do | |
it in general and really calculate all probabilities | |
0:32:49.249 --> 0:32:49.834 | |
for all? | |
0:32:53.673 --> 0:32:55.905 | |
How many alignments are there for a Simpson? | |
0:32:58.498 --> 0:33:03.344 | |
Yes there, we just saw that in the formula | |
if you remember. | |
0:33:03.984 --> 0:33:12.336 | |
This was the formula so it's exponential in | |
the lengths of the target sentence. | |
0:33:12.336 --> 0:33:15.259 | |
It would calculate all the. | |
0:33:15.415 --> 0:33:18.500 | |
Be very inefficient and really possible. | |
0:33:18.500 --> 0:33:25.424 | |
The nice thing is we can again use some type | |
of dynamic programming, so then we can do this | |
0:33:25.424 --> 0:33:27.983 | |
without really calculating audit. | |
0:33:28.948 --> 0:33:40.791 | |
We have the next pipe slides or so with the | |
most equations in the whole lecture, so don't | |
0:33:40.791 --> 0:33:41.713 | |
worry. | |
0:33:42.902 --> 0:34:01.427 | |
So we said we have first explanation where | |
it is about calculating the alignment. | |
0:34:02.022 --> 0:34:20.253 | |
And we can do this with our initial definition | |
of because this formula. | |
0:34:20.160 --> 0:34:25.392 | |
So we can define this as and and divided by | |
and. | |
0:34:25.905 --> 0:34:30.562 | |
This is just the normal definition of a conditional | |
probability. | |
0:34:31.231 --> 0:34:37.937 | |
And what we then need to assume a meter calculate | |
is P of E given. | |
0:34:37.937 --> 0:34:41.441 | |
P of E given is still again quiet. | |
0:34:41.982 --> 0:34:56.554 | |
Simple: The probability of the sewer sentence | |
given the target sentence is quite intuitive. | |
0:34:57.637 --> 0:35:15.047 | |
So let's just calculate how to calculate the | |
probability of a event. | |
0:35:15.215 --> 0:35:21.258 | |
So in here we can then put in our original | |
form in our soils. | |
0:35:21.201 --> 0:35:28.023 | |
There are some of the possible alignments | |
of the first word, and so until the sum of | |
0:35:28.023 --> 0:35:30.030 | |
all possible alignments. | |
0:35:29.990 --> 0:35:41.590 | |
And then we have the probability here of the | |
alignment type, this product of translation. | |
0:35:42.562 --> 0:35:58.857 | |
Now this one is independent of the alignment, | |
so we can put it to the front here. | |
0:35:58.959 --> 0:36:03.537 | |
And now this is where dynamic programming | |
works in. | |
0:36:03.537 --> 0:36:08.556 | |
We can change that and make thereby things | |
a lot easier. | |
0:36:08.668 --> 0:36:21.783 | |
Can reform it like this just as a product | |
over all target positions, and then it's the | |
0:36:21.783 --> 0:36:26.456 | |
sum over all source positions. | |
0:36:27.127 --> 0:36:36.454 | |
Maybe at least the intuition why this is equal | |
is a lot easier if you look into it as graphic. | |
0:36:36.816 --> 0:36:39.041 | |
So what we have here is the table. | |
0:36:39.041 --> 0:36:42.345 | |
We have the target position and the Swiss | |
position. | |
0:36:42.862 --> 0:37:03.643 | |
And we have to sum up all possible passes | |
through that: The nice thing is that each of | |
0:37:03.643 --> 0:37:07.127 | |
these passes these probabilities are independent | |
of each. | |
0:37:07.607 --> 0:37:19.678 | |
In order to get the sum of all passes through | |
this table you can use dynamic programming | |
0:37:19.678 --> 0:37:27.002 | |
and then say oh this probability is exactly | |
the same. | |
0:37:26.886 --> 0:37:34.618 | |
Times the sun of this column finds the sum | |
of this column, and times the sun of this colun. | |
0:37:35.255 --> 0:37:41.823 | |
That is the same as if you go through all | |
possible passes here and multiply always the | |
0:37:41.823 --> 0:37:42.577 | |
elements. | |
0:37:43.923 --> 0:37:54.227 | |
And that is a simplification because now we | |
only have quadratic numbers and we don't have | |
0:37:54.227 --> 0:37:55.029 | |
to go. | |
0:37:55.355 --> 0:38:12.315 | |
Similar to guess you may be seen the same | |
type of algorithm for what is it? | |
0:38:14.314 --> 0:38:19.926 | |
Yeah, well yeah, so that is the saying. | |
0:38:19.926 --> 0:38:31.431 | |
But yeah, I think graphically this is seeable | |
if you don't know exactly the mass. | |
0:38:32.472 --> 0:38:49.786 | |
Now put these both together, so if you really | |
want to take a piece of and put these two formulas | |
0:38:49.786 --> 0:38:51.750 | |
together,. | |
0:38:51.611 --> 0:38:56.661 | |
Eliminated and Then You Get Your Final Formula. | |
0:38:56.716 --> 0:39:01.148 | |
And that somehow really makes now really intuitively | |
again sense. | |
0:39:01.401 --> 0:39:08.301 | |
So the probability of an alignment is the | |
product of all target sentences, and then it's | |
0:39:08.301 --> 0:39:15.124 | |
the probability of to translate a word into | |
the word that is aligned to divided by some | |
0:39:15.124 --> 0:39:17.915 | |
of the other words in the sentence. | |
0:39:18.678 --> 0:39:31.773 | |
If you look at this again, it makes real descent. | |
0:39:31.891 --> 0:39:43.872 | |
So you're looking at how probable it is to | |
translate compared to all the other words. | |
0:39:43.872 --> 0:39:45.404 | |
So you're. | |
0:39:45.865 --> 0:39:48.543 | |
So and that gives you the alignment probability. | |
0:39:48.768 --> 0:39:54.949 | |
Somehow it's not only that it's mathematically | |
correct if you look at it this way, it's somehow | |
0:39:54.949 --> 0:39:55.785 | |
intuitively. | |
0:39:55.785 --> 0:39:58.682 | |
So if you would say how good is it to align? | |
0:39:58.638 --> 0:40:04.562 | |
We had to zoo him to visit, or yet it should | |
depend on how good this is the translation | |
0:40:04.562 --> 0:40:10.620 | |
probability compared to how good are the other | |
words in the sentence, and how probable is | |
0:40:10.620 --> 0:40:12.639 | |
it that I align them to them. | |
0:40:15.655 --> 0:40:26.131 | |
Then you have the expectations that the next | |
thing is now the maximization step, so we have | |
0:40:26.131 --> 0:40:30.344 | |
now the probability of an alignment. | |
0:40:31.451 --> 0:40:37.099 | |
Intuitively, that means how often are words | |
aligned to each other giving this alignment | |
0:40:37.099 --> 0:40:39.281 | |
or more in a perverse definition? | |
0:40:39.281 --> 0:40:43.581 | |
What is the expectation value that they are | |
aligned to each other? | |
0:40:43.581 --> 0:40:49.613 | |
So if there's a lot of alignments with hyperability | |
that they're aligned to each other, then. | |
0:40:50.050 --> 0:41:07.501 | |
So the count of E and given F given our caravan | |
data is a sum of all possible alignments. | |
0:41:07.968 --> 0:41:14.262 | |
That is, this count, and you don't do just | |
count with absolute numbers, but you count | |
0:41:14.262 --> 0:41:14.847 | |
always. | |
0:41:15.815 --> 0:41:26.519 | |
And to make that translation probability is | |
that you have to normalize it, of course, through: | |
0:41:27.487 --> 0:41:30.584 | |
And that's then the whole model. | |
0:41:31.111 --> 0:41:39.512 | |
It looks now maybe a bit mathematically complex. | |
0:41:39.512 --> 0:41:47.398 | |
The whole training process is described here. | |
0:41:47.627 --> 0:41:53.809 | |
So you really, really just have to collect | |
these counts and later normalize that. | |
0:41:54.134 --> 0:42:03.812 | |
So repeating that until convergence we have | |
said the ear migration is always done again. | |
0:42:04.204 --> 0:42:15.152 | |
Equally, then you go over all sentence pairs | |
and all of words and calculate the translation. | |
0:42:15.355 --> 0:42:17.983 | |
And then you go once again over. | |
0:42:17.983 --> 0:42:22.522 | |
It counted this count, count given, and totally | |
e-given. | |
0:42:22.702 --> 0:42:35.316 | |
Initially how probable is the E translated | |
to something else, and you normalize your translation | |
0:42:35.316 --> 0:42:37.267 | |
probabilities. | |
0:42:38.538 --> 0:42:45.761 | |
So this is an old training process for this | |
type of. | |
0:42:46.166 --> 0:43:00.575 | |
How that then works is shown here a bit, so | |
we have a very simple corpus. | |
0:43:01.221 --> 0:43:12.522 | |
And as we said, you initialize your translation | |
with yes or possible translations, so dusk | |
0:43:12.522 --> 0:43:16.620 | |
can be aligned to the bookhouse. | |
0:43:16.997 --> 0:43:25.867 | |
And the other ones are missing because only | |
a curse with and book, and then the others | |
0:43:25.867 --> 0:43:26.988 | |
will soon. | |
0:43:27.127 --> 0:43:34.316 | |
In the initial way your vocabulary is for | |
works, so the initial probabilities are all: | |
0:43:34.794 --> 0:43:50.947 | |
And then if you iterate you see that the things | |
which occur often and then get alignments get | |
0:43:50.947 --> 0:43:53.525 | |
more and more. | |
0:43:55.615 --> 0:44:01.506 | |
In reality, of course, you won't get like | |
zero alignments, but you would normally get | |
0:44:01.506 --> 0:44:02.671 | |
there sometimes. | |
0:44:03.203 --> 0:44:05.534 | |
But as the probability increases. | |
0:44:05.785 --> 0:44:17.181 | |
The training process is also guaranteed that | |
the probability of your training data is always | |
0:44:17.181 --> 0:44:20.122 | |
increased in iteration. | |
0:44:21.421 --> 0:44:27.958 | |
You see that the model tries to model your | |
training data and give you at least good models. | |
0:44:30.130 --> 0:44:37.765 | |
Okay, are there any more questions to the | |
training of these type of word-based models? | |
0:44:38.838 --> 0:44:54.790 | |
Initially there is like forwards in the source | |
site, so it's just one force to do equal distribution. | |
0:44:55.215 --> 0:45:01.888 | |
So each target word, the probability of the | |
target word, is at four target words, so the | |
0:45:01.888 --> 0:45:03.538 | |
uniform distribution. | |
0:45:07.807 --> 0:45:14.430 | |
However, there is problems with this initial | |
order and we have this already mentioned at | |
0:45:14.430 --> 0:45:15.547 | |
the beginning. | |
0:45:15.547 --> 0:45:21.872 | |
There is for example things that yeah you | |
want to allow for reordering but there are | |
0:45:21.872 --> 0:45:27.081 | |
definitely some alignments which should be | |
more probable than others. | |
0:45:27.347 --> 0:45:42.333 | |
So a friend visit should have a lower probability | |
than visit a friend. | |
0:45:42.302 --> 0:45:50.233 | |
It's not always monitoring, there is some | |
reordering happening, but if you just mix it | |
0:45:50.233 --> 0:45:51.782 | |
crazy, it's not. | |
0:45:52.252 --> 0:46:11.014 | |
You have slings like one too many alignments | |
and they are not really models. | |
0:46:11.491 --> 0:46:17.066 | |
But it shouldn't be that you align one word | |
to all the others, and that is, you don't want | |
0:46:17.066 --> 0:46:18.659 | |
this type of probability. | |
0:46:19.199 --> 0:46:27.879 | |
You don't want to align to null, so there's | |
nothing about that and how to deal with other | |
0:46:27.879 --> 0:46:30.386 | |
words on the source side. | |
0:46:32.272 --> 0:46:45.074 | |
And therefore this was only like the initial | |
model in there. | |
0:46:45.325 --> 0:46:47.639 | |
Models, which we saw. | |
0:46:47.639 --> 0:46:57.001 | |
They only model the translation probability, | |
so how probable is it to translate one word | |
0:46:57.001 --> 0:46:58.263 | |
to another? | |
0:46:58.678 --> 0:47:05.915 | |
What you could then add is the absolute position. | |
0:47:05.915 --> 0:47:16.481 | |
Yeah, the second word should more probable | |
align to the second position. | |
0:47:17.557 --> 0:47:22.767 | |
We add a fertility model that means one word | |
is mostly translated into one word. | |
0:47:23.523 --> 0:47:29.257 | |
For example, we saw it there that should be | |
translated into two words, but most words should | |
0:47:29.257 --> 0:47:32.463 | |
be one to one, and it's even modeled for each | |
word. | |
0:47:32.463 --> 0:47:37.889 | |
So for each source word, how probable is it | |
that it is translated to one, two, three or | |
0:47:37.889 --> 0:47:38.259 | |
more? | |
0:47:40.620 --> 0:47:50.291 | |
Then either one of four acts relative positions, | |
so it's asks: Maybe instead of modeling, how | |
0:47:50.291 --> 0:47:55.433 | |
probable is it that you translate from position | |
five to position twenty five? | |
0:47:55.433 --> 0:48:01.367 | |
It's not a very good way, but in a relative | |
position instead of what you try to model it. | |
0:48:01.321 --> 0:48:06.472 | |
How probable is that you are jumping Swiss | |
steps forward or Swiss steps back? | |
0:48:07.287 --> 0:48:15.285 | |
However, this makes sense more complex because | |
what is a jump forward and a jump backward | |
0:48:15.285 --> 0:48:16.885 | |
is not that easy. | |
0:48:18.318 --> 0:48:30.423 | |
You want to have a model that describes reality, | |
so every sentence that is not possible should | |
0:48:30.423 --> 0:48:37.304 | |
have the probability zero because that cannot | |
happen. | |
0:48:37.837 --> 0:48:48.037 | |
However, with this type of IBM model four | |
this has a positive probability, so it makes | |
0:48:48.037 --> 0:48:54.251 | |
a sentence more complex and you can easily | |
check it. | |
0:48:57.457 --> 0:49:09.547 | |
So these models were the first models which | |
tried to directly model and where they are | |
0:49:09.547 --> 0:49:14.132 | |
the first to do the translation. | |
0:49:14.414 --> 0:49:19.605 | |
So in all of these models, the probability | |
of a word translating into another word is | |
0:49:19.605 --> 0:49:25.339 | |
always independent of all the other translations, | |
and that is a challenge because we know that | |
0:49:25.339 --> 0:49:26.486 | |
this is not right. | |
0:49:26.967 --> 0:49:32.342 | |
And therefore we will come now to then the | |
phrase-based translation models. | |
0:49:35.215 --> 0:49:42.057 | |
However, this word alignment is the very important | |
concept which was used in phrase based. | |
0:49:42.162 --> 0:49:50.559 | |
Even when people use phrase based, they first | |
would always train a word based model not to | |
0:49:50.559 --> 0:49:56.188 | |
get the really model but only to get this type | |
of alignment. | |
0:49:57.497 --> 0:50:01.343 | |
What was the main idea of a phrase based machine | |
translation? | |
0:50:03.223 --> 0:50:08.898 | |
It's not only that things got mathematically | |
a lot more simple here because you don't try | |
0:50:08.898 --> 0:50:13.628 | |
to express the whole translation process, but | |
it's a discriminative model. | |
0:50:13.628 --> 0:50:19.871 | |
So what you only try to model is this translation | |
probability or is this translation more probable | |
0:50:19.871 --> 0:50:20.943 | |
than some other. | |
0:50:24.664 --> 0:50:28.542 | |
The main idea is that the basic units are | |
are the phrases. | |
0:50:28.542 --> 0:50:31.500 | |
That's why it's called phrase phrase phrase. | |
0:50:31.500 --> 0:50:35.444 | |
You have to be aware that these are not linguistic | |
phrases. | |
0:50:35.444 --> 0:50:39.124 | |
I guess you have some intuition about what | |
is a phrase. | |
0:50:39.399 --> 0:50:45.547 | |
You would express as a phrase. | |
0:50:45.547 --> 0:50:58.836 | |
However, you wouldn't say that is a very good | |
phrase because it's. | |
0:50:59.339 --> 0:51:06.529 | |
However, in this machine learning-based motivated | |
thing, phrases are just indicative. | |
0:51:07.127 --> 0:51:08.832 | |
So it can be any split. | |
0:51:08.832 --> 0:51:12.455 | |
We don't consider linguistically motivated | |
or not. | |
0:51:12.455 --> 0:51:15.226 | |
It can be any sequence of consecutive. | |
0:51:15.335 --> 0:51:16.842 | |
That's the Only Important Thing. | |
0:51:16.977 --> 0:51:25.955 | |
The phrase is always a thing of consecutive | |
words, and the motivation behind that is getting | |
0:51:25.955 --> 0:51:27.403 | |
computational. | |
0:51:27.387 --> 0:51:35.912 | |
People have looked into how you can also discontinuous | |
phrases, which might be very helpful if you | |
0:51:35.912 --> 0:51:38.237 | |
think about German harbor. | |
0:51:38.237 --> 0:51:40.046 | |
Has this one phrase? | |
0:51:40.000 --> 0:51:47.068 | |
There's two phrases, although there's many | |
things in between, but in order to make things | |
0:51:47.068 --> 0:51:52.330 | |
still possible and runner will, it's always | |
like consecutive work. | |
0:51:53.313 --> 0:52:05.450 | |
The nice thing is that on the one hand you | |
don't need this word to word correspondence | |
0:52:05.450 --> 0:52:06.706 | |
anymore. | |
0:52:06.906 --> 0:52:17.088 | |
You now need to invent some type of alignment | |
that in this case doesn't really make sense. | |
0:52:17.417 --> 0:52:21.710 | |
So you can just learn okay, you have this | |
phrase and this phrase and their translation. | |
0:52:22.862 --> 0:52:25.989 | |
Secondly, we can add a bit of context into | |
that. | |
0:52:26.946 --> 0:52:43.782 | |
You're saying, for example, of Ultimate Customs | |
and of My Shift. | |
0:52:44.404 --> 0:52:51.443 | |
And this was difficult to model and work based | |
models because they always model the translation. | |
0:52:52.232 --> 0:52:57.877 | |
Here you can have phrases where you have more | |
context and just jointly translate the phrases, | |
0:52:57.877 --> 0:53:03.703 | |
and if you then have seen all by the question | |
as a phrase you can directly use that to generate. | |
0:53:08.468 --> 0:53:19.781 | |
Okay, before we go into how to do that, then | |
we start, so the start is when we start with | |
0:53:19.781 --> 0:53:21.667 | |
the alignment. | |
0:53:22.022 --> 0:53:35.846 | |
So that is what we get from the work based | |
model and we are assuming to get the. | |
0:53:36.356 --> 0:53:40.786 | |
So that is your starting point. | |
0:53:40.786 --> 0:53:47.846 | |
You have a certain sentence and one most probable. | |
0:53:48.989 --> 0:54:11.419 | |
The challenge you now have is that these alignments | |
are: On the one hand, a source word like hit | |
0:54:11.419 --> 0:54:19.977 | |
several times with one source word can be aligned | |
to several: So in this case you see that for | |
0:54:19.977 --> 0:54:29.594 | |
example Bisher is aligned to three words, so | |
this can be the alignment from English to German, | |
0:54:29.594 --> 0:54:32.833 | |
but it cannot be the alignment. | |
0:54:33.273 --> 0:54:41.024 | |
In order to address for this inconsistency | |
and being able to do that, what you typically | |
0:54:41.024 --> 0:54:49.221 | |
then do is: If you have this inconsistency | |
and you get different things in both directions,. | |
0:54:54.774 --> 0:55:01.418 | |
In machine translation to do that you just | |
do it in both directions and somehow combine | |
0:55:01.418 --> 0:55:08.363 | |
them because both will do arrows and the hope | |
is yeah if you know both things you minimize. | |
0:55:08.648 --> 0:55:20.060 | |
So you would also do it in the other direction | |
and get a different type of lineup, for example | |
0:55:20.060 --> 0:55:22.822 | |
that you now have saw. | |
0:55:23.323 --> 0:55:37.135 | |
So in this way you are having two alignments | |
and the question is now how do get one alignment | |
0:55:37.135 --> 0:55:38.605 | |
and what? | |
0:55:38.638 --> 0:55:45.828 | |
There were a lot of different types of heuristics. | |
0:55:45.828 --> 0:55:55.556 | |
They normally start with intersection because | |
you should trust them. | |
0:55:55.996 --> 0:55:59.661 | |
And your maximum will could take this, the | |
union thought,. | |
0:55:59.980 --> 0:56:04.679 | |
If one of the systems says they are not aligned | |
then maybe you should not align them. | |
0:56:05.986 --> 0:56:12.240 | |
The only question they are different is what | |
should I do about things where they don't agree? | |
0:56:12.240 --> 0:56:18.096 | |
So where only one of them enlines and then | |
you have heuristics depending on other words | |
0:56:18.096 --> 0:56:22.288 | |
around it, you can decide should I align them | |
or should I not. | |
0:56:24.804 --> 0:56:34.728 | |
So that is your first step and then the second | |
step in your model. | |
0:56:34.728 --> 0:56:41.689 | |
So now you have one alignment for the process. | |
0:56:42.042 --> 0:56:47.918 | |
And the idea is that we will now extract all | |
phrase pairs to combinations of source and | |
0:56:47.918 --> 0:56:51.858 | |
target phrases where they are consistent within | |
alignment. | |
0:56:52.152 --> 0:56:57.980 | |
The idea is a consistence with an alignment | |
that should be a good example and that we can | |
0:56:57.980 --> 0:56:58.563 | |
extract. | |
0:56:59.459 --> 0:57:14.533 | |
And there are three conditions where we say | |
an alignment has to be consistent. | |
0:57:14.533 --> 0:57:17.968 | |
The first one is. | |
0:57:18.318 --> 0:57:24.774 | |
So if you add bisher, then it's in your phrase. | |
0:57:24.774 --> 0:57:32.306 | |
All the three words up till and now should | |
be in there. | |
0:57:32.492 --> 0:57:42.328 | |
So Bisheret Till would not be a valid phrase | |
pair in this case, but for example Bisheret | |
0:57:42.328 --> 0:57:43.433 | |
Till now. | |
0:57:45.525 --> 0:58:04.090 | |
Does anybody now have already an idea about | |
the second rule that should be there? | |
0:58:05.325 --> 0:58:10.529 | |
Yes, that is exactly the other thing. | |
0:58:10.529 --> 0:58:22.642 | |
If a target verse is in the phrase pair, there | |
are also: Then there is one very obvious one. | |
0:58:22.642 --> 0:58:28.401 | |
If you strike a phrase pair, at least one | |
word in the phrase. | |
0:58:29.069 --> 0:58:32.686 | |
And this is a knife with working. | |
0:58:32.686 --> 0:58:40.026 | |
However, in reality a captain will select | |
some part of the sentence. | |
0:58:40.380 --> 0:58:47.416 | |
You can take any possible combination of sewers | |
and target words for this part, and that of | |
0:58:47.416 --> 0:58:54.222 | |
course is not very helpful because you just | |
have no idea, and therefore it says at least | |
0:58:54.222 --> 0:58:58.735 | |
one sewer should be aligned to one target word | |
to prevent. | |
0:58:59.399 --> 0:59:09.615 | |
But still, it means that if you have normally | |
analyzed words, the more analyzed words you | |
0:59:09.615 --> 0:59:10.183 | |
can. | |
0:59:10.630 --> 0:59:13.088 | |
That's not true for the very extreme case. | |
0:59:13.088 --> 0:59:17.603 | |
If no word is a line you can extract nothing | |
because you can never fulfill it. | |
0:59:17.603 --> 0:59:23.376 | |
However, if only for example one word is aligned | |
then you can align a lot of different possibilities | |
0:59:23.376 --> 0:59:28.977 | |
because you can start with this word and then | |
add source words or target words or any combination | |
0:59:28.977 --> 0:59:29.606 | |
of source. | |
0:59:30.410 --> 0:59:37.585 | |
So there was typically a problem that if you | |
have too few works in light you can really | |
0:59:37.585 --> 0:59:38.319 | |
extract. | |
0:59:38.558 --> 0:59:45.787 | |
If you think about this already here you can | |
extract very, very many phrase pairs from: | |
0:59:45.845 --> 0:59:55.476 | |
So what you can extract is, for example, what | |
we saw up and so on. | |
0:59:55.476 --> 1:00:00.363 | |
So all of them will be extracted. | |
1:00:00.400 --> 1:00:08.379 | |
In order to limit this you typically have | |
a length limit so you can only extract phrases | |
1:00:08.379 --> 1:00:08.738 | |
up. | |
1:00:09.049 --> 1:00:18.328 | |
But still there these phrases where you have | |
all these phrases extracted. | |
1:00:18.328 --> 1:00:22.968 | |
You have to think about how to deal. | |
1:00:26.366 --> 1:00:34.966 | |
Now we have the phrases, so the other question | |
is what is a good phrase pair and not so good. | |
1:00:35.255 --> 1:00:39.933 | |
You might be that you sometimes extract one | |
which is explaining this sentence but is not | |
1:00:39.933 --> 1:00:44.769 | |
really a good one because there is something | |
ever in there or something special so it might | |
1:00:44.769 --> 1:00:47.239 | |
not be a good phase pair in another situation. | |
1:00:49.629 --> 1:00:59.752 | |
And therefore the easiest thing is again just | |
count, and if a phrase pair occurs very often | |
1:00:59.752 --> 1:01:03.273 | |
seems to be a good phrase pair. | |
1:01:03.743 --> 1:01:05.185 | |
So if we have this one. | |
1:01:05.665 --> 1:01:09.179 | |
And if you have the exam up till now,. | |
1:01:09.469 --> 1:01:20.759 | |
Then you look how often does up till now to | |
this hair occur? | |
1:01:20.759 --> 1:01:28.533 | |
How often does up until now to this hair? | |
1:01:30.090 --> 1:01:36.426 | |
So this is one way of yeah describing the | |
quality of the phrase book. | |
1:01:37.257 --> 1:01:47.456 | |
So one difference is now, and that is the | |
advantage of these primitive models. | |
1:01:47.867 --> 1:01:55.442 | |
But instead we are trying to have a lot of | |
features describing how good a phrase parent | |
1:01:55.442 --> 1:01:55.786 | |
is. | |
1:01:55.786 --> 1:02:04.211 | |
One of these features is this one describing: | |
But in this model we'll later see how to combine | |
1:02:04.211 --> 1:02:04.515 | |
it. | |
1:02:04.515 --> 1:02:10.987 | |
The nice thing is we can invent any other | |
type of features and add that and normally | |
1:02:10.987 --> 1:02:14.870 | |
if you have two or three metrics to describe | |
then. | |
1:02:15.435 --> 1:02:18.393 | |
And therefore the spray spray sprays. | |
1:02:18.393 --> 1:02:23.220 | |
They were not only like evaluated by one type | |
but by several. | |
1:02:23.763 --> 1:02:36.580 | |
So this could, for example, have a problem | |
because your target phrase here occurs only | |
1:02:36.580 --> 1:02:37.464 | |
once. | |
1:02:38.398 --> 1:02:46.026 | |
It will of course only occur with one other | |
source trait, and that probability will be | |
1:02:46.026 --> 1:02:53.040 | |
one which might not be a very good estimation | |
because you've only seen it once. | |
1:02:53.533 --> 1:02:58.856 | |
Therefore, we use additional ones to better | |
deal with that, and the first thing is we're | |
1:02:58.856 --> 1:02:59.634 | |
doing again. | |
1:02:59.634 --> 1:03:01.129 | |
Yeah, we know it by now. | |
1:03:01.129 --> 1:03:06.692 | |
If you look at it in the one direction, it's | |
helpful to us to look into the other direction. | |
1:03:06.692 --> 1:03:11.297 | |
So you take also the inverse probability, | |
so you not only take in peer of E. | |
1:03:11.297 --> 1:03:11.477 | |
G. | |
1:03:11.477 --> 1:03:11.656 | |
M. | |
1:03:11.656 --> 1:03:12.972 | |
F., but also peer of. | |
1:03:13.693 --> 1:03:19.933 | |
And then in addition you say maybe for the | |
especially prolonged phrases they occur rarely, | |
1:03:19.933 --> 1:03:25.898 | |
and then you have very high probabilities, | |
and that might not be always the right one. | |
1:03:25.898 --> 1:03:32.138 | |
So maybe it's good to also look at the word | |
based probabilities to represent how good they | |
1:03:32.138 --> 1:03:32.480 | |
are. | |
1:03:32.692 --> 1:03:44.202 | |
So in addition you take the work based probabilities | |
of this phrase pair as an additional model. | |
1:03:44.704 --> 1:03:52.828 | |
So then you would have in total four different | |
values describing how good the phrase is. | |
1:03:52.828 --> 1:04:00.952 | |
It would be the relatively frequencies in | |
both directions and the lexical probabilities. | |
1:04:01.361 --> 1:04:08.515 | |
So four values in describing how probable | |
a phrase translation is. | |
1:04:11.871 --> 1:04:20.419 | |
Then the next challenge is how can we combine | |
these different types of probabilities into | |
1:04:20.419 --> 1:04:23.458 | |
a global score saying how good? | |
1:04:24.424 --> 1:04:36.259 | |
Model, but before we are doing that give any | |
questions to this phrase extraction and phrase | |
1:04:36.259 --> 1:04:37.546 | |
creation. | |
1:04:40.260 --> 1:04:44.961 | |
And the motivation for that this was our initial | |
moral. | |
1:04:44.961 --> 1:04:52.937 | |
If you remember from the beginning of a lecture | |
we had the probability of like PFO three times | |
1:04:52.937 --> 1:04:53.357 | |
PFO. | |
1:04:55.155 --> 1:04:57.051 | |
Now the problem is here. | |
1:04:57.051 --> 1:04:59.100 | |
That is, of course, right. | |
1:04:59.100 --> 1:05:06.231 | |
However, we have done a lot of simplification | |
that the translation probability is independent | |
1:05:06.231 --> 1:05:08.204 | |
of the other translation. | |
1:05:08.628 --> 1:05:14.609 | |
So therefore our estimations of pH give me | |
and pH might not be right, and therefore the | |
1:05:14.609 --> 1:05:16.784 | |
combination might not be right. | |
1:05:17.317 --> 1:05:22.499 | |
So it can be that, for example, at the edge | |
you have a fluid but not accurate translation. | |
1:05:22.782 --> 1:05:25.909 | |
And Then There's Could Be an Easy Way Around | |
It. | |
1:05:26.126 --> 1:05:32.019 | |
If our effluent but not accurate, it might | |
be that we put too much effort on the language | |
1:05:32.019 --> 1:05:36.341 | |
model and we are putting too few effort on | |
the translation model. | |
1:05:36.936 --> 1:05:43.016 | |
There we can wait a minute so we can do this | |
a bit stronger. | |
1:05:43.016 --> 1:05:46.305 | |
This one is more important than. | |
1:05:48.528 --> 1:05:53.511 | |
And based on that we can extend this idea | |
to the lacteria mole. | |
1:05:53.893 --> 1:06:02.164 | |
The log linear model now says all the translation | |
probabilities is just we have. | |
1:06:02.082 --> 1:06:09.230 | |
Describing how good this translation process | |
is, these are the speeches H which depend on | |
1:06:09.230 --> 1:06:09.468 | |
E. | |
1:06:09.468 --> 1:06:09.706 | |
F. | |
1:06:09.706 --> 1:06:13.280 | |
Only one of them, but generally depend on | |
E. | |
1:06:13.280 --> 1:06:13.518 | |
E. | |
1:06:13.518 --> 1:06:13.757 | |
E. | |
1:06:13.757 --> 1:06:13.995 | |
N. | |
1:06:13.995 --> 1:06:14.233 | |
F. | |
1:06:14.474 --> 1:06:22.393 | |
Each of these pictures has a weight saying | |
yeah how good does it model it so that if you're | |
1:06:22.393 --> 1:06:29.968 | |
asking a lot of people about some opinion it | |
might also be waiting some opinion more so | |
1:06:29.968 --> 1:06:34.100 | |
I put more effort on that and he may not be | |
so. | |
1:06:34.314 --> 1:06:39.239 | |
If you're saying that it's maybe a good indication, | |
yeah, would trust that much. | |
1:06:39.559 --> 1:06:41.380 | |
And exactly you can do that for you too. | |
1:06:41.380 --> 1:06:42.446 | |
You can't add no below. | |
1:06:43.423 --> 1:07:01.965 | |
It's like depending on how many you want to | |
have and each of the features gives you value. | |
1:07:02.102 --> 1:07:12.655 | |
The nice thing is that we can normally ignore | |
because we are not interested in the probability | |
1:07:12.655 --> 1:07:13.544 | |
itself. | |
1:07:13.733 --> 1:07:18.640 | |
And again, if that's not normalized, that's | |
fine. | |
1:07:18.640 --> 1:07:23.841 | |
So if this value is the highest, that's the | |
highest. | |
1:07:26.987 --> 1:07:29.302 | |
Can we do that? | |
1:07:29.302 --> 1:07:34.510 | |
Let's start with two simple things. | |
1:07:34.510 --> 1:07:39.864 | |
Then you have one translation model. | |
1:07:40.000 --> 1:07:43.102 | |
Which gives you the peer of eagerness. | |
1:07:43.383 --> 1:07:49.203 | |
It can be typically as a feature it would | |
take the liberalism of this ability, so mine | |
1:07:49.203 --> 1:07:51.478 | |
is nine hundred and fourty seven. | |
1:07:51.451 --> 1:07:57.846 | |
And the language model which says you how | |
clue in the English side is how you can calculate | |
1:07:57.846 --> 1:07:59.028 | |
the probability. | |
1:07:58.979 --> 1:08:03.129 | |
In some future lectures we'll give you all | |
superbology. | |
1:08:03.129 --> 1:08:10.465 | |
You can feature again the luck of the purbology, | |
then you have minus seven and then give different | |
1:08:10.465 --> 1:08:11.725 | |
weights to them. | |
1:08:12.292 --> 1:08:19.243 | |
And that means that your probability is one | |
divided by said to the power of this. | |
1:08:20.840 --> 1:08:38.853 | |
You're not really interested in the probability, | |
so you just calculate on the score to the exponendum. | |
1:08:40.000 --> 1:08:41.668 | |
Maximal Maximal I Think. | |
1:08:42.122 --> 1:08:57.445 | |
You can, for example, try different translations, | |
calculate all their scores and take in the | |
1:08:57.445 --> 1:09:00.905 | |
end the translation. | |
1:09:03.423 --> 1:09:04.661 | |
Why to do that. | |
1:09:05.986 --> 1:09:10.698 | |
We've done that now for two, but of course | |
you cannot only do it with two. | |
1:09:10.698 --> 1:09:16.352 | |
You can do it now with any fixed number, so | |
of course you have to decide in the beginning | |
1:09:16.352 --> 1:09:21.944 | |
I want to have ten features or something like | |
that, but you can take all these features. | |
1:09:22.002 --> 1:09:29.378 | |
And yeah, based on them, they calculate your | |
model probability or the model score. | |
1:09:31.031 --> 1:09:40.849 | |
A big advantage over the initial. | |
1:09:40.580 --> 1:09:45.506 | |
A model because now we can add a lot of features | |
and there was diamond machine translation, | |
1:09:45.506 --> 1:09:47.380 | |
a statistical machine translation. | |
1:09:47.647 --> 1:09:57.063 | |
So how can develop new features, new ways | |
of evaluating them so that can hopefully better | |
1:09:57.063 --> 1:10:00.725 | |
describe what is good translation? | |
1:10:01.001 --> 1:10:16.916 | |
If you have a new great feature you can calculate | |
these features and then how much better do | |
1:10:16.916 --> 1:10:18.969 | |
they model? | |
1:10:21.741 --> 1:10:27.903 | |
There is one challenge which haven't touched | |
upon yet. | |
1:10:27.903 --> 1:10:33.505 | |
So could you easily build your model if you | |
have. | |
1:10:38.999 --> 1:10:43.016 | |
Assumed here something which just gazed, but | |
which might not be that easy. | |
1:10:49.990 --> 1:10:56.333 | |
The weight for the translation model is and | |
the weight for the language model is. | |
1:10:56.716 --> 1:11:08.030 | |
That's a bit arbitrary, so why should you | |
use this one and guess normally you won't be | |
1:11:08.030 --> 1:11:11.801 | |
able to select that by hand? | |
1:11:11.992 --> 1:11:19.123 | |
So typically we didn't have like or features | |
in there, but features is very common. | |
1:11:19.779 --> 1:11:21.711 | |
So how do you select them? | |
1:11:21.711 --> 1:11:24.645 | |
There was a second part of the training. | |
1:11:24.645 --> 1:11:27.507 | |
These models were trained in two steps. | |
1:11:27.507 --> 1:11:32.302 | |
On the one hand, we had the training of the | |
individual components. | |
1:11:32.302 --> 1:11:38.169 | |
We saw that now how to build the phrase based | |
system, how to extract the phrases. | |
1:11:38.738 --> 1:11:46.223 | |
But then if you have these different components | |
you need a second training to learn the optimal. | |
1:11:46.926 --> 1:11:51.158 | |
And typically this is referred to as the tuning | |
of the system. | |
1:11:51.431 --> 1:12:07.030 | |
So now if you have different types of models | |
describing what a good translation is you need | |
1:12:07.030 --> 1:12:10.760 | |
to find good weights. | |
1:12:12.312 --> 1:12:14.315 | |
So how can you do it? | |
1:12:14.315 --> 1:12:20.871 | |
The easiest thing is, of course, you can just | |
try different things out. | |
1:12:21.121 --> 1:12:27.496 | |
You can then always select the best hyper | |
scissors. | |
1:12:27.496 --> 1:12:38.089 | |
You can evaluate it with some metrics saying: | |
You can score all your outputs, always select | |
1:12:38.089 --> 1:12:42.543 | |
the best one and then get this translation. | |
1:12:42.983 --> 1:12:45.930 | |
And you can do that for a lot of different | |
possible combinations. | |
1:12:47.067 --> 1:12:59.179 | |
However, the challenge is the complexity, | |
so if you have only parameters and each of | |
1:12:59.179 --> 1:13:04.166 | |
them has values you try for, then. | |
1:13:04.804 --> 1:13:16.895 | |
We won't be able to try all of these possible | |
combinations, so what we have to do is some | |
1:13:16.895 --> 1:13:19.313 | |
more intelligent. | |
1:13:20.540 --> 1:13:34.027 | |
And what has been done there in machine translation | |
is referred to as a minimum error rate training. | |
1:13:34.534 --> 1:13:41.743 | |
Whole surge is a very intuitive one, so have | |
all these different parameters, so how do. | |
1:13:42.522 --> 1:13:44.358 | |
And the idea is okay. | |
1:13:44.358 --> 1:13:52.121 | |
I start with an initial guess and then I optimize | |
one single parameter that's always easier. | |
1:13:52.121 --> 1:13:54.041 | |
That's some or linear. | |
1:13:54.041 --> 1:13:58.882 | |
So you're searching the best value for the | |
one parameter. | |
1:13:59.759 --> 1:14:04.130 | |
Often visualized with a San Francisco map. | |
1:14:04.130 --> 1:14:13.786 | |
Just imagine if you want to go to the highest | |
spot in San Francisco, you're standing somewhere | |
1:14:13.786 --> 1:14:14.395 | |
here. | |
1:14:14.574 --> 1:14:21.220 | |
You are switching your dimensions so you are | |
going in this direction again finding. | |
1:14:21.661 --> 1:14:33.804 | |
Now you're on a different street and this | |
one is not a different one so you go in here | |
1:14:33.804 --> 1:14:36.736 | |
so you can interact. | |
1:14:36.977 --> 1:14:56.368 | |
The one thing of course is find a local optimum, | |
especially if you start in two different positions. | |
1:14:56.536 --> 1:15:10.030 | |
So yeah, there is a heuristic in there, so | |
typically it's done again if you land in different | |
1:15:10.030 --> 1:15:16.059 | |
positions with different starting points. | |
1:15:16.516 --> 1:15:29.585 | |
What is different or what is like the addition | |
of arrow rate training compared to the standard? | |
1:15:29.729 --> 1:15:37.806 | |
So the question is, like we said, you can | |
now evaluate different values for one parameter. | |
1:15:38.918 --> 1:15:42.857 | |
And the question is: Which values should you | |
try out for one parameters? | |
1:15:42.857 --> 1:15:47.281 | |
Should you just do zero point one, zero point | |
two, zero point three, or anything? | |
1:15:49.029 --> 1:16:03.880 | |
If you change only one parameter then you | |
can define the score of translation as a linear | |
1:16:03.880 --> 1:16:05.530 | |
function. | |
1:16:05.945 --> 1:16:17.258 | |
That this is the one that possesses, and yet | |
if you change the parameter, the score of this. | |
1:16:17.397 --> 1:16:26.506 | |
It may depend so your score is there because | |
the rest you don't change your feature value. | |
1:16:26.826 --> 1:16:30.100 | |
And the feature value is there for the steepness | |
of their purse. | |
1:16:30.750 --> 1:16:38.887 | |
And now look at different possible translations. | |
1:16:38.887 --> 1:16:46.692 | |
Therefore, how they go up here is differently. | |
1:16:47.247 --> 1:16:59.289 | |
So in this case if you look at the minimum | |
score so there should be as minimum. | |
1:17:00.300 --> 1:17:10.642 | |
So it's enough to check once a year and check | |
once here because if you check here and here. | |
1:17:11.111 --> 1:17:24.941 | |
And that is the idea in minimum air rate training | |
when you select different hypotheses. | |
1:17:29.309 --> 1:17:34.378 | |
So in yeah, the minimum air raid training | |
is a power search. | |
1:17:34.378 --> 1:17:37.453 | |
Then we do an intelligent step size. | |
1:17:37.453 --> 1:17:39.364 | |
We do random restarts. | |
1:17:39.364 --> 1:17:46.428 | |
Then things are still too slow because it | |
might say we would have to decode a lot of | |
1:17:46.428 --> 1:17:47.009 | |
times. | |
1:17:46.987 --> 1:17:54.460 | |
So what we can do to make things even faster | |
is we are decoding once with the current parameters, | |
1:17:54.460 --> 1:18:01.248 | |
but then we are not generating only the most | |
probable translation, but we are generating | |
1:18:01.248 --> 1:18:05.061 | |
the most probable ten hundred translations | |
or so. | |
1:18:06.006 --> 1:18:18.338 | |
And then we are optimizing our weights by | |
only looking at this one hundred translation | |
1:18:18.338 --> 1:18:23.725 | |
and finding the optimal values there. | |
1:18:24.564 --> 1:18:39.284 | |
Of course, it might be a problem that at some | |
point you have now good ways to find good translations | |
1:18:39.284 --> 1:18:42.928 | |
inside your ambest list. | |
1:18:43.143 --> 1:18:52.357 | |
You have to iterate that sometime, but the | |
important thing is you don't have to decode | |
1:18:52.357 --> 1:18:56.382 | |
every time you need weights, but you. | |
1:18:57.397 --> 1:19:11.325 | |
There is mainly a speed up process in order | |
to make things more, make things even faster. | |
1:19:15.515 --> 1:19:20.160 | |
Good Then We'll Finish With. | |
1:19:20.440 --> 1:19:25.289 | |
Looking at how do you really calculate the | |
scores and everything? | |
1:19:25.289 --> 1:19:32.121 | |
Because what we did look into was a translation | |
of a full sentence doesn't really consist of | |
1:19:32.121 --> 1:19:37.190 | |
only one single phrase, but of course you have | |
to combine different. | |
1:19:37.637 --> 1:19:40.855 | |
So how does that now really look and how do | |
we have to do? | |
1:19:41.361 --> 1:19:48.252 | |
Just think again of the translation we have | |
done before. | |
1:19:48.252 --> 1:19:59.708 | |
The sentence must be: What is the probability | |
of translating this one into what we saw after | |
1:19:59.708 --> 1:20:00.301 | |
now? | |
1:20:00.301 --> 1:20:03.501 | |
We're doing this by using. | |
1:20:03.883 --> 1:20:07.157 | |
So we're having the phrase pair. | |
1:20:07.157 --> 1:20:12.911 | |
Vasvia is the phrase pair up to now and gazine | |
harm into. | |
1:20:13.233 --> 1:20:18.970 | |
In addition, that is important because translation | |
is not monotone. | |
1:20:18.970 --> 1:20:26.311 | |
We are not putting phrase pairs in the same | |
order as we are doing it on the source and | |
1:20:26.311 --> 1:20:31.796 | |
on the target, but in order to generate the | |
correct translation. | |
1:20:31.771 --> 1:20:34.030 | |
So we have to shuffle the phrase pears. | |
1:20:34.294 --> 1:20:39.747 | |
And the blue wand is in front on the search | |
side but not on the back of the tag. | |
1:20:40.200 --> 1:20:49.709 | |
This reordering makes a statistic of the machine | |
translation really complicated because if you | |
1:20:49.709 --> 1:20:53.313 | |
would just monotonely do this then. | |
1:20:53.593 --> 1:21:05.288 | |
The problem is if you would analyze all possible | |
combinations of reshuffling them, then again. | |
1:21:05.565 --> 1:21:11.508 | |
So you again have to use some type of heuristics | |
which shuffle you allow and which you don't | |
1:21:11.508 --> 1:21:11.955 | |
allow. | |
1:21:12.472 --> 1:21:27.889 | |
That was relatively challenging since, for | |
example, if you think of Germany you would | |
1:21:27.889 --> 1:21:32.371 | |
have to allow very long. | |
1:21:33.033 --> 1:21:52.218 | |
But if we have now this, how do we calculate | |
the translation score so the translation score? | |
1:21:52.432 --> 1:21:55.792 | |
That's why we sum up the scores at the end. | |
1:21:56.036 --> 1:22:08.524 | |
So you said our first feature is the probability | |
of the full sentence. | |
1:22:08.588 --> 1:22:13.932 | |
So we say, the translation of each phrase | |
pair is independent of each other, and then | |
1:22:13.932 --> 1:22:19.959 | |
we can hear the probability of the full sentences, | |
fear of what we give, but fear of times, fear | |
1:22:19.959 --> 1:22:24.246 | |
of sobbing because they have time to feel up | |
till now is impossible. | |
1:22:24.664 --> 1:22:29.379 | |
Now we can use the loss of logarithmal calculation. | |
1:22:29.609 --> 1:22:36.563 | |
That's logarithm of the first perability. | |
1:22:36.563 --> 1:22:48.153 | |
We'll get our first score, which says the | |
translation model is minus. | |
1:22:49.970 --> 1:22:56.586 | |
And that we're not doing only once, but we're | |
exactly doing it with all our translation model. | |
1:22:56.957 --> 1:23:03.705 | |
So we said we also have the relative frequency | |
and the inverse directions of the. | |
1:23:03.843 --> 1:23:06.226 | |
So in the end you'll have four scores. | |
1:23:06.226 --> 1:23:09.097 | |
Here how you combine them is exactly the same. | |
1:23:09.097 --> 1:23:12.824 | |
The only thing is how you look them up for | |
each phrase pair. | |
1:23:12.824 --> 1:23:18.139 | |
We have said in the beginning we are storing | |
four scores describing how good they are. | |
1:23:19.119 --> 1:23:25.415 | |
And these are then of force points describing | |
how probable the sense. | |
1:23:27.427 --> 1:23:31.579 | |
Then we can have more sports. | |
1:23:31.579 --> 1:23:37.806 | |
For example, we can have a distortion model. | |
1:23:37.806 --> 1:23:41.820 | |
How much reordering is done? | |
1:23:41.841 --> 1:23:47.322 | |
There were different types of ones who won't | |
go into detail, but just imagine you have no | |
1:23:47.322 --> 1:23:47.748 | |
score. | |
1:23:48.548 --> 1:23:56.651 | |
Then you have a language model which is the | |
sequence of what we saw until now. | |
1:23:56.651 --> 1:24:06.580 | |
How we generate this language model for ability | |
will cover: And there weren't even more probabilities. | |
1:24:06.580 --> 1:24:11.841 | |
So one, for example, was a phrase count scarf, | |
which just counts how many. | |
1:24:12.072 --> 1:24:19.555 | |
In order to learn is it better to have more | |
short phrases or should bias on having fewer | |
1:24:19.555 --> 1:24:20.564 | |
and longer. | |
1:24:20.940 --> 1:24:28.885 | |
Easily add this but just counting so the value | |
will be here and like putting in a count like | |
1:24:28.885 --> 1:24:32.217 | |
typically how good is it to translate. | |
1:24:32.932 --> 1:24:44.887 | |
For language model, the probability normally | |
gets shorter the longer the sequences in order | |
1:24:44.887 --> 1:24:46.836 | |
to counteract. | |
1:24:47.827 --> 1:24:59.717 | |
And then you get your final score by multi-climbing | |
each of the scores we had before. | |
1:24:59.619 --> 1:25:07.339 | |
Optimization and that gives you a final score | |
maybe of twenty three point seven eight five | |
1:25:07.339 --> 1:25:13.278 | |
and then you can do that with several possible | |
translation tests and. | |
1:25:14.114 --> 1:25:23.949 | |
One may be important point here is so the | |
score not only depends on the target side but | |
1:25:23.949 --> 1:25:32.444 | |
it also depends on which phrases you have used | |
so you could have generated. | |
1:25:32.772 --> 1:25:38.076 | |
So you would have the same translation, but | |
you would have a different split into phrase. | |
1:25:38.979 --> 1:25:45.636 | |
And this was normally ignored so you would | |
just look at all of them and then select the | |
1:25:45.636 --> 1:25:52.672 | |
one which has the highest probability and ignore | |
that this translation could be generated by | |
1:25:52.672 --> 1:25:54.790 | |
several splits into phrase. | |
1:25:57.497 --> 1:26:06.097 | |
So to summarize what we look into today and | |
what you should hopefully remember is: Statistical | |
1:26:06.097 --> 1:26:11.440 | |
models in how to generate machine translation | |
output that were the word based statistical | |
1:26:11.440 --> 1:26:11.915 | |
models. | |
1:26:11.915 --> 1:26:16.962 | |
There was IBM models at the beginning and | |
then we have the phrase based entity where | |
1:26:16.962 --> 1:26:22.601 | |
it's about building the translation by putting | |
together these blocks of phrases and combining. | |
1:26:23.283 --> 1:26:34.771 | |
If you have a water which has several features | |
you can't do that with millions but with features. | |
1:26:34.834 --> 1:26:42.007 | |
Then you can combine them with your local | |
model, which allows you to have your variable | |
1:26:42.007 --> 1:26:45.186 | |
number of features and easily combine. | |
1:26:45.365 --> 1:26:47.920 | |
Yeah, how much can you trust each of these | |
more? | |
1:26:51.091 --> 1:26:54.584 | |
Do you have any further questions for this | |
topic? | |
1:26:58.378 --> 1:27:08.715 | |
And there will be on Tuesday a lecture by | |
Tuan about evaluation, and then next Thursday | |
1:27:08.715 --> 1:27:12.710 | |
there will be the practical part. | |
1:27:12.993 --> 1:27:21.461 | |
So please bring the practical pot here, but | |
you can do something yourself if you are not | |
1:27:21.461 --> 1:27:22.317 | |
able to. | |
1:27:23.503 --> 1:27:26.848 | |
So then please tell us and we'll have to see | |
how we find the difference in this. | |