Spaces:
Running
Running
WEBVTT | |
0:00:02.822 --> 0:00:07.880 | |
We look into more linguistic approaches. | |
0:00:07.880 --> 0:00:14.912 | |
We can do machine translation in a more traditional | |
way. | |
0:00:14.912 --> 0:00:21.224 | |
It should be: Translation should be generated | |
this way. | |
0:00:21.224 --> 0:00:27.933 | |
We can analyze versus a sewer sentence what | |
is the meaning or the syntax. | |
0:00:27.933 --> 0:00:35.185 | |
Then we transfer this information to the target | |
side and then we then generate. | |
0:00:36.556 --> 0:00:42.341 | |
And this was the strong and common used approach | |
for yeah several years. | |
0:00:44.024 --> 0:00:50.839 | |
However, we saw already at the beginning there | |
some challenges with that: Language is very | |
0:00:50.839 --> 0:00:57.232 | |
ambigue and it's often very difficult to really | |
get high coated rules. | |
0:00:57.232 --> 0:01:05.336 | |
What are the different meanings and we have | |
to do that also with a living language so new | |
0:01:05.336 --> 0:01:06.596 | |
things occur. | |
0:01:07.007 --> 0:01:09.308 | |
And that's why people look into. | |
0:01:09.308 --> 0:01:13.282 | |
Can we maybe do it differently and use machine | |
learning? | |
0:01:13.333 --> 0:01:24.849 | |
So we are no longer giving rules of how to | |
do it, but we just give examples and the system. | |
0:01:25.045 --> 0:01:34.836 | |
And one important thing then is these examples: | |
how can we learn how to translate one sentence? | |
0:01:35.635 --> 0:01:42.516 | |
And therefore these yeah, the data is now | |
really a very important issue. | |
0:01:42.582 --> 0:01:50.021 | |
And that is what we want to look into today. | |
0:01:50.021 --> 0:01:58.783 | |
What type of data do we use for machine translation? | |
0:01:59.019 --> 0:02:08.674 | |
So the idea in preprocessing is always: Can | |
we make the task somehow a bit easier so that | |
0:02:08.674 --> 0:02:13.180 | |
the empty system will be in a way better? | |
0:02:13.493 --> 0:02:28.309 | |
So one example could be if it has problems | |
dealing with numbers because they are occurring. | |
0:02:28.648 --> 0:02:35.479 | |
Or think about so one problem which still | |
might be is there in some systems think about | |
0:02:35.479 --> 0:02:36.333 | |
different. | |
0:02:36.656 --> 0:02:44.897 | |
So a system might learn that of course if | |
there's a German over in English there should. | |
0:02:45.365 --> 0:02:52.270 | |
However, if it's in pearl text, it will see | |
that in Germany there is often km, and in English | |
0:02:52.270 --> 0:02:54.107 | |
typically various miles. | |
0:02:54.594 --> 0:03:00.607 | |
Might just translate three hundred and fifty | |
five miles into three hundred and fiftY five | |
0:03:00.607 --> 0:03:04.348 | |
kilometers, which of course is not right, and | |
so forth. | |
0:03:04.348 --> 0:03:06.953 | |
It might make things to look into the. | |
0:03:07.067 --> 0:03:13.072 | |
Therefore, first step when you build your | |
machine translation system is normally to look | |
0:03:13.072 --> 0:03:19.077 | |
at the data, to check it, to see if there is | |
anything happening which you should address | |
0:03:19.077 --> 0:03:19.887 | |
beforehand. | |
0:03:20.360 --> 0:03:29.152 | |
And then the second part is how do you represent | |
no works machine learning normally? | |
0:03:29.109 --> 0:03:35.404 | |
So the question is how do we get out from | |
the words into numbers and I've seen some of | |
0:03:35.404 --> 0:03:35.766 | |
you? | |
0:03:35.766 --> 0:03:42.568 | |
For example, in advance there we have introduced | |
to an algorithm which we also shortly repeat | |
0:03:42.568 --> 0:03:43.075 | |
today. | |
0:03:43.303 --> 0:03:53.842 | |
The subword unit approach which was first | |
introduced in machine translation and now used | |
0:03:53.842 --> 0:04:05.271 | |
for an in order to represent: Now you've learned | |
about morphology, so you know that maybe in | |
0:04:05.271 --> 0:04:09.270 | |
English it's not that important. | |
0:04:09.429 --> 0:04:22.485 | |
In German you have all these different word | |
poems and to learn independent representation. | |
0:04:24.024 --> 0:04:26.031 | |
And then, of course, they are more extreme. | |
0:04:27.807 --> 0:04:34.387 | |
So how are we doing? | |
0:04:34.975 --> 0:04:37.099 | |
Machine translation. | |
0:04:37.099 --> 0:04:46.202 | |
So hopefully you remember we had these approaches | |
to machine translation, the rule based. | |
0:04:46.202 --> 0:04:52.473 | |
We had a big block of corpus space machine | |
translation which. | |
0:04:52.492 --> 0:05:00.443 | |
Will on Thursday have an overview on statistical | |
models and then afterwards concentrate on the. | |
0:05:00.680 --> 0:05:08.828 | |
Both of them are corpus based machine translation | |
and therefore it's really essential, and while | |
0:05:08.828 --> 0:05:16.640 | |
we are typically training a machine translation | |
system is what we refer to as parallel data. | |
0:05:16.957 --> 0:05:22.395 | |
Talk a lot about pearl corpus or pearl data, | |
and what I mean there is something which you | |
0:05:22.395 --> 0:05:28.257 | |
might know from was that a stone or something | |
like that, so it's typically you have one sentence | |
0:05:28.257 --> 0:05:33.273 | |
in the one language, and then you have aligned | |
to it one sentence in the charcote. | |
0:05:33.833 --> 0:05:38.261 | |
And this is how we train all our alignments. | |
0:05:38.261 --> 0:05:43.181 | |
We'll see today that of course we might not | |
have. | |
0:05:43.723 --> 0:05:51.279 | |
However, this is relatively easy to create, | |
at least for iquality data. | |
0:05:51.279 --> 0:06:00.933 | |
We look into data trawling so that means how | |
we can automatically create this parallel data | |
0:06:00.933 --> 0:06:02.927 | |
from the Internet. | |
0:06:04.144 --> 0:06:13.850 | |
It's not so difficult to learn these alignments | |
if we have some type of dictionary, so which | |
0:06:13.850 --> 0:06:16.981 | |
sentence is aligned to which. | |
0:06:18.718 --> 0:06:25.069 | |
What it would, of course, be a lot more difficult | |
is really to word alignment, and that's also | |
0:06:25.069 --> 0:06:27.476 | |
often no longer that good possible. | |
0:06:27.476 --> 0:06:33.360 | |
We do that automatically in some yes for symbols, | |
but it's definitely more challenging. | |
0:06:33.733 --> 0:06:40.691 | |
For sentence alignment, of course, it's still | |
not always perfect, so there might be that | |
0:06:40.691 --> 0:06:46.085 | |
there is two German sentences and one English | |
sentence or the other. | |
0:06:46.085 --> 0:06:53.511 | |
So there's not always perfect alignment, but | |
if you look at text, it's still bigly relatively. | |
0:06:54.014 --> 0:07:03.862 | |
If we have that then we can build a machine | |
learning model which tries to map ignition | |
0:07:03.862 --> 0:07:06.239 | |
sentences somewhere. | |
0:07:06.626 --> 0:07:15.932 | |
So this is the idea of behind statistical | |
machine translation and machine translation. | |
0:07:15.932 --> 0:07:27.098 | |
The difference is: Statistical machine translation | |
is typically a whole box of different models | |
0:07:27.098 --> 0:07:30.205 | |
which try to evaluate the. | |
0:07:30.510 --> 0:07:42.798 | |
In neural machine translation, it's all one | |
large neural network where we use the one-sur-sentence | |
0:07:42.798 --> 0:07:43.667 | |
input. | |
0:07:44.584 --> 0:07:50.971 | |
And then we can train it by having exactly | |
this mapping port or parallel data. | |
0:07:54.214 --> 0:08:02.964 | |
So what we want today to look at today is | |
we want to first look at general text data. | |
0:08:03.083 --> 0:08:06.250 | |
So what is text data? | |
0:08:06.250 --> 0:08:09.850 | |
What text data is there? | |
0:08:09.850 --> 0:08:18.202 | |
Why is it challenging so that we have large | |
vocabularies? | |
0:08:18.378 --> 0:08:22.003 | |
It's so that you always have words which you | |
haven't seen. | |
0:08:22.142 --> 0:08:29.053 | |
If you increase your corporate science normally | |
you will also increase your vocabulary so you | |
0:08:29.053 --> 0:08:30.744 | |
always find new words. | |
0:08:31.811 --> 0:08:39.738 | |
Then based on that we'll look into pre-processing. | |
0:08:39.738 --> 0:08:45.333 | |
So how can we pre-process our data? | |
0:08:45.333 --> 0:08:46.421 | |
Maybe. | |
0:08:46.526 --> 0:08:54.788 | |
This is a lot about tokenization, for example, | |
which we heard is not so challenging in European | |
0:08:54.788 --> 0:09:02.534 | |
languages but still important, but might be | |
really difficult in Asian languages where you | |
0:09:02.534 --> 0:09:05.030 | |
don't have space separation. | |
0:09:05.986 --> 0:09:12.161 | |
And this preprocessing typically tries to | |
deal with the extreme cases where you have | |
0:09:12.161 --> 0:09:13.105 | |
seen things. | |
0:09:13.353 --> 0:09:25.091 | |
If you have seen your words three one hundred | |
times, it doesn't really matter if you have | |
0:09:25.091 --> 0:09:31.221 | |
seen them with them without punctuation or | |
so. | |
0:09:31.651 --> 0:09:38.578 | |
And then we look into word representation, | |
so what is the best way to represent a word? | |
0:09:38.578 --> 0:09:45.584 | |
And finally, we look into the other type of | |
data we really need for machine translation. | |
0:09:45.725 --> 0:09:56.842 | |
So in first we can use for many tasks, and | |
later we can also use purely monolingual data | |
0:09:56.842 --> 0:10:00.465 | |
to make machine translation. | |
0:10:00.660 --> 0:10:03.187 | |
So then the traditional approach was that | |
it was easier. | |
0:10:03.483 --> 0:10:08.697 | |
We have this type of language model which | |
we can train only on the target data to make | |
0:10:08.697 --> 0:10:12.173 | |
the text more fluent in neural machine translation | |
model. | |
0:10:12.173 --> 0:10:18.106 | |
It's partly a bit more complicated to integrate | |
this data but still it's very important especially | |
0:10:18.106 --> 0:10:22.362 | |
if you think about lower issue languages where | |
you have very few data. | |
0:10:23.603 --> 0:10:26.999 | |
It's harder to get parallel data than you | |
get monolingual data. | |
0:10:27.347 --> 0:10:33.821 | |
Because monolingual data you just have out | |
there not huge amounts for some languages, | |
0:10:33.821 --> 0:10:38.113 | |
but definitely the amount of data is always | |
significant. | |
0:10:40.940 --> 0:10:50.454 | |
When we talk about data, it's also of course | |
important how we use it for machine learning. | |
0:10:50.530 --> 0:11:05.867 | |
And that you hopefully learn in some prior | |
class, so typically we separate our data into | |
0:11:05.867 --> 0:11:17.848 | |
three chunks: So this is really by far the | |
largest, and this grows with the data we get. | |
0:11:17.848 --> 0:11:21.387 | |
Today we get here millions. | |
0:11:22.222 --> 0:11:27.320 | |
Then we have our validation data and that | |
is to train some type of parameters. | |
0:11:27.320 --> 0:11:33.129 | |
So not only you have some things to configure | |
and you don't know what is the right value, | |
0:11:33.129 --> 0:11:39.067 | |
so what you can do is train a model and change | |
these a bit and try to find the best ones on | |
0:11:39.067 --> 0:11:40.164 | |
your validation. | |
0:11:40.700 --> 0:11:48.531 | |
For a statistical model, for example data | |
in what you want to use if you have several | |
0:11:48.531 --> 0:11:54.664 | |
models: You know how to combine it, so how | |
much focus should you put on the different | |
0:11:54.664 --> 0:11:55.186 | |
models? | |
0:11:55.186 --> 0:11:59.301 | |
And if it's like twenty models, so it's only | |
twenty per meter. | |
0:11:59.301 --> 0:12:02.828 | |
It's not that much, so that is still bigly | |
estimated. | |
0:12:03.183 --> 0:12:18.964 | |
In your model there's often a question how | |
long should train the model before you have | |
0:12:18.964 --> 0:12:21.322 | |
overfitting. | |
0:12:22.902 --> 0:12:28.679 | |
And then you have your test data, which is | |
finally where you report on your test. | |
0:12:29.009 --> 0:12:33.663 | |
And therefore it's also important that from | |
time to time you get new test data because | |
0:12:33.663 --> 0:12:38.423 | |
if you're always through your experiments you | |
test on it and then you do new experiments | |
0:12:38.423 --> 0:12:43.452 | |
and tests again at some point you have tested | |
so many on it that you do some type of training | |
0:12:43.452 --> 0:12:48.373 | |
on your test data again because you just select | |
the things which is at the end best on your | |
0:12:48.373 --> 0:12:48.962 | |
test data. | |
0:12:49.009 --> 0:12:54.755 | |
It's important to get a new test data from | |
time to time, for example in important evaluation | |
0:12:54.755 --> 0:12:58.340 | |
campaigns for machine translation and speech | |
translation. | |
0:12:58.618 --> 0:13:07.459 | |
There is like every year there should do tests | |
that create it so we can see if the model really | |
0:13:07.459 --> 0:13:09.761 | |
gets better on new data. | |
0:13:10.951 --> 0:13:19.629 | |
And of course it is important that this is | |
a representative of the use case you are interested. | |
0:13:19.879 --> 0:13:36.511 | |
So if you're building a system for translating | |
websites, this should be on websites. | |
0:13:36.816 --> 0:13:39.356 | |
So normally a system is good on some tasks. | |
0:13:40.780 --> 0:13:48.596 | |
I would solve everything and then your test | |
data should be out of everything because if | |
0:13:48.596 --> 0:13:54.102 | |
you only have a very small subset you know | |
it's good on this. | |
0:13:54.394 --> 0:14:02.714 | |
Therefore, the selection of your test data | |
is really important in order to ensure that | |
0:14:02.714 --> 0:14:05.200 | |
the MP system in the end. | |
0:14:05.525 --> 0:14:12.646 | |
Is the greatest system ever you have evaluated | |
on translating Bible. | |
0:14:12.646 --> 0:14:21.830 | |
The use case is to translate some Twitter | |
data and you can imagine the performance might | |
0:14:21.830 --> 0:14:22.965 | |
be really. | |
0:14:23.803 --> 0:14:25.471 | |
And privately. | |
0:14:25.471 --> 0:14:35.478 | |
Of course, in honor to have this and realistic | |
evaluation, it's important that there's no | |
0:14:35.478 --> 0:14:39.370 | |
overlap between this data because. | |
0:14:39.799 --> 0:14:51.615 | |
Because the danger might be is learning by | |
heart how to translate the sentences from your | |
0:14:51.615 --> 0:14:53.584 | |
training data. | |
0:14:54.194 --> 0:15:04.430 | |
That the test data is really different from | |
your training data. | |
0:15:04.430 --> 0:15:16.811 | |
Therefore, it's important to: So what type | |
of data we have? | |
0:15:16.811 --> 0:15:24.966 | |
There's a lot of different text data and the | |
nice thing is with digitalization. | |
0:15:25.345 --> 0:15:31.785 | |
You might think there's a large amount with | |
books, but to be honest books and printed things | |
0:15:31.785 --> 0:15:35.524 | |
that's by now a minor percentage of the data | |
we have. | |
0:15:35.815 --> 0:15:39.947 | |
There's like so much data created every day | |
on the Internet. | |
0:15:39.980 --> 0:15:46.223 | |
With social media and all the other types. | |
0:15:46.223 --> 0:15:56.821 | |
This of course is a largest amount of data, | |
more of colloquial language. | |
0:15:56.856 --> 0:16:02.609 | |
It might be more noisy and harder to process, | |
so there is a whole area on how to deal with | |
0:16:02.609 --> 0:16:04.948 | |
more social media and outdoor stuff. | |
0:16:07.347 --> 0:16:20.702 | |
What type of data is there if you think about | |
parallel data news type of data official sites? | |
0:16:20.900 --> 0:16:26.629 | |
So the first Power Corpora were like things | |
like the European Parliament or like some news | |
0:16:26.629 --> 0:16:27.069 | |
sites. | |
0:16:27.227 --> 0:16:32.888 | |
Nowadays there's quite a large amount of data | |
crawled from the Internet, but of course if | |
0:16:32.888 --> 0:16:38.613 | |
you crawl parallel data from the Internet, | |
a lot of the data is also like company websites | |
0:16:38.613 --> 0:16:41.884 | |
or so which gets translated into several languages. | |
0:16:45.365 --> 0:17:00.613 | |
Then, of course, there is different levels | |
of text and we have to look at what level we | |
0:17:00.613 --> 0:17:05.118 | |
want to process our data. | |
0:17:05.885 --> 0:17:16.140 | |
It one normally doesn't make sense to work | |
on full sentences because a lot of sentences | |
0:17:16.140 --> 0:17:22.899 | |
have never been seen and you always create | |
new sentences. | |
0:17:23.283 --> 0:17:37.421 | |
So typically what we take is our basic words, | |
something between words and letters, and that | |
0:17:37.421 --> 0:17:40.033 | |
is an essential. | |
0:17:40.400 --> 0:17:47.873 | |
So we need some of these atomic blocks or | |
basic blocks on which we can't make smaller. | |
0:17:48.128 --> 0:17:55.987 | |
So if we're building a sentence, for example, | |
you can build it out of something and you can | |
0:17:55.987 --> 0:17:57.268 | |
either decide. | |
0:17:57.268 --> 0:18:01.967 | |
For example, you take words and you spit them | |
further. | |
0:18:03.683 --> 0:18:10.178 | |
Then, of course, the nice thing is not too | |
small and therefore building larger things | |
0:18:10.178 --> 0:18:11.386 | |
like sentences. | |
0:18:11.831 --> 0:18:16.690 | |
So you only have to take your vocabulary and | |
put it somewhere together to get your full | |
0:18:16.690 --> 0:18:17.132 | |
center. | |
0:18:19.659 --> 0:18:27.670 | |
However, if it's too large, these blocks don't | |
occur often enough, and you have more blocks | |
0:18:27.670 --> 0:18:28.715 | |
that occur. | |
0:18:29.249 --> 0:18:34.400 | |
And that's why yeah we can work with blocks | |
for smaller like software blocks. | |
0:18:34.714 --> 0:18:38.183 | |
Work with neural models. | |
0:18:38.183 --> 0:18:50.533 | |
Then you can work on letters so you have a | |
system which tries to understand the sentence | |
0:18:50.533 --> 0:18:53.031 | |
letter by letter. | |
0:18:53.313 --> 0:18:57.608 | |
But that is a design decision which you have | |
to take at some point. | |
0:18:57.608 --> 0:19:03.292 | |
On which level do you want to split your text | |
and that of the evasive blocks that you are | |
0:19:03.292 --> 0:19:04.176 | |
working with? | |
0:19:04.176 --> 0:19:06.955 | |
And that's something we'll look into today. | |
0:19:06.955 --> 0:19:08.471 | |
What possibilities are? | |
0:19:12.572 --> 0:19:14.189 | |
Any question. | |
0:19:17.998 --> 0:19:24.456 | |
Then let's look a bit on what type of data | |
there is in how much data there is to person. | |
0:19:24.824 --> 0:19:34.006 | |
Is that nowadays, at least for pure text, | |
it's no longer for some language. | |
0:19:34.006 --> 0:19:38.959 | |
There is so much data we cannot process. | |
0:19:39.479 --> 0:19:49.384 | |
That is only true for some languages, but | |
there is also interest in other languages and | |
0:19:49.384 --> 0:19:50.622 | |
important. | |
0:19:50.810 --> 0:20:01.483 | |
So if you want to build a system for Sweden | |
or for some dialect in other countries, then | |
0:20:01.483 --> 0:20:02.802 | |
of course. | |
0:20:03.103 --> 0:20:06.888 | |
Otherwise you have this huge amount of hair. | |
0:20:06.888 --> 0:20:11.515 | |
We are often no longer taking about gigabytes | |
or more. | |
0:20:11.891 --> 0:20:35.788 | |
The general information that is produced every | |
year is: And this is like all the information | |
0:20:35.788 --> 0:20:40.661 | |
that are available in the, so there are really. | |
0:20:41.001 --> 0:20:44.129 | |
We look at machine translation. | |
0:20:44.129 --> 0:20:53.027 | |
We can see these numbers are really like more | |
than ten years old, but we see this increase | |
0:20:53.027 --> 0:20:58.796 | |
in one billion works we had at that time for | |
English data. | |
0:20:59.019 --> 0:21:01.955 | |
Then I wore like new shuffle on Google Maps | |
and stuff. | |
0:21:02.382 --> 0:21:05.003 | |
For this one you could train your system on. | |
0:21:05.805 --> 0:21:20.457 | |
And the interesting thing is this one billion | |
words is more than any human typically speaks. | |
0:21:21.001 --> 0:21:25.892 | |
So these systems they see by now like a magnitude | |
of more data. | |
0:21:25.892 --> 0:21:32.465 | |
We know I think are a magnitude higher of | |
more data than a human has ever seen in his | |
0:21:32.465 --> 0:21:33.229 | |
lifetime. | |
0:21:35.175 --> 0:21:41.808 | |
And that is maybe the interesting thing why | |
it still doesn't work on it because you see | |
0:21:41.808 --> 0:21:42.637 | |
they seem. | |
0:21:43.103 --> 0:21:48.745 | |
So we are seeing a really impressive result, | |
but in most cases it's not that they're really | |
0:21:48.745 --> 0:21:49.911 | |
better than human. | |
0:21:50.170 --> 0:21:56.852 | |
However, they really have seen more data than | |
any human ever has seen in this lifetime. | |
0:21:57.197 --> 0:22:01.468 | |
They can just process so much data, so. | |
0:22:01.501 --> 0:22:08.425 | |
The question is, can we make them more efficient | |
so that they can learn similarly good without | |
0:22:08.425 --> 0:22:09.592 | |
that much data? | |
0:22:09.592 --> 0:22:16.443 | |
And that is essential if we now go to Lawrence's | |
languages where we might never get that much | |
0:22:16.443 --> 0:22:21.254 | |
data, and we should be also able to achieve | |
a reasonable perform. | |
0:22:23.303 --> 0:22:32.399 | |
On the other hand, this of course links also | |
to one topic which we will cover later: If | |
0:22:32.399 --> 0:22:37.965 | |
you think about this, it's really important | |
that your algorithms are also very efficient | |
0:22:37.965 --> 0:22:41.280 | |
in order to process that much data both in | |
training. | |
0:22:41.280 --> 0:22:46.408 | |
If you have more data, you want to process | |
more data so you can make use of that. | |
0:22:46.466 --> 0:22:54.499 | |
On the other hand, if more and more data is | |
processed, more and more people will use machine | |
0:22:54.499 --> 0:23:06.816 | |
translation to generate translations, and it | |
will be important to: And there is yeah, there | |
0:23:06.816 --> 0:23:07.257 | |
is. | |
0:23:07.607 --> 0:23:10.610 | |
More. | |
0:23:10.170 --> 0:23:17.262 | |
More data generated every day, we hear just | |
some general numbers on how much data there | |
0:23:17.262 --> 0:23:17.584 | |
is. | |
0:23:17.584 --> 0:23:24.595 | |
It says that a lot of the data we produce | |
at least at the moment is text rich, so text | |
0:23:24.595 --> 0:23:26.046 | |
that is produced. | |
0:23:26.026 --> 0:23:29.748 | |
That is very important to either wise. | |
0:23:29.748 --> 0:23:33.949 | |
We can use it as training data in some way. | |
0:23:33.873 --> 0:23:40.836 | |
That we want to translate some of that because | |
it might not be published in all the languages, | |
0:23:40.836 --> 0:23:46.039 | |
and step with the need for machine translation | |
is even more important. | |
0:23:47.907 --> 0:23:51.547 | |
So what are the challenges with this? | |
0:23:51.831 --> 0:24:01.360 | |
So first of all that seems to be very good | |
news, so there is more and more data, so we | |
0:24:01.360 --> 0:24:10.780 | |
can just wait for three years and have more | |
data, and then our system will be better. | |
0:24:11.011 --> 0:24:22.629 | |
If you see in competitions, the system performance | |
increases. | |
0:24:24.004 --> 0:24:27.190 | |
See that here are three different systems. | |
0:24:27.190 --> 0:24:34.008 | |
Blue score is metric to measure how good an | |
empty system is and we'll talk about evaluation | |
0:24:34.008 --> 0:24:40.974 | |
and the next week so you'll have to evaluate | |
machine validation and also a practical session. | |
0:24:41.581 --> 0:24:45.219 | |
And so. | |
0:24:44.784 --> 0:24:50.960 | |
This shows you that this is like how much | |
data of the training data you have five percent. | |
0:24:50.960 --> 0:24:56.117 | |
You're significantly worse than if you're | |
forty percent and eighty percent. | |
0:24:56.117 --> 0:25:02.021 | |
You're getting better and you're seeing two | |
between this curve, which maybe not really | |
0:25:02.021 --> 0:25:02.971 | |
flattens out. | |
0:25:02.971 --> 0:25:03.311 | |
But. | |
0:25:03.263 --> 0:25:07.525 | |
Of course, the gains you get are normally | |
smaller and smaller. | |
0:25:07.525 --> 0:25:09.216 | |
The more data you have,. | |
0:25:09.549 --> 0:25:21.432 | |
If your improvements are unnormally better, | |
if you add the same thing or even double your | |
0:25:21.432 --> 0:25:25.657 | |
data late, of course more data. | |
0:25:26.526 --> 0:25:34.955 | |
However, you see the clear tendency if you | |
need to improve your system. | |
0:25:34.955 --> 0:25:38.935 | |
This is possible by just getting. | |
0:25:39.039 --> 0:25:41.110 | |
But it's not all about data. | |
0:25:41.110 --> 0:25:45.396 | |
It can also be the domain of the day that | |
there's building. | |
0:25:45.865 --> 0:25:55.668 | |
So this was a test on machine translation | |
system on translating genome data. | |
0:25:55.668 --> 0:26:02.669 | |
We have the like SAI said he's working on | |
translating. | |
0:26:02.862 --> 0:26:06.868 | |
Here you see the performance began with GreenScore. | |
0:26:06.868 --> 0:26:12.569 | |
You see one system which only was trained | |
on genome data and it only has. | |
0:26:12.812 --> 0:26:17.742 | |
That's very, very few for machine translation. | |
0:26:18.438 --> 0:26:23.927 | |
And to compare that to a system which was | |
generally trained on used translation data. | |
0:26:24.104 --> 0:26:34.177 | |
With four point five million sentences so | |
roughly one hundred times as much data you | |
0:26:34.177 --> 0:26:40.458 | |
still see that this system doesn't really work | |
well. | |
0:26:40.820 --> 0:26:50.575 | |
So you see it's not only about data, it's | |
also that the data has to somewhat fit to the | |
0:26:50.575 --> 0:26:51.462 | |
domain. | |
0:26:51.831 --> 0:26:58.069 | |
The more general data you get that you have | |
covered up all domains. | |
0:26:58.418 --> 0:27:07.906 | |
But that's very difficult and especially for | |
more specific domains. | |
0:27:07.906 --> 0:27:16.696 | |
It can be really important to get data which | |
fits your domain. | |
0:27:16.716 --> 0:27:18.520 | |
Maybe if you can do some very much broccoli | |
or something like that, maybe if you. | |
0:27:18.598 --> 0:27:22.341 | |
To say okay, concentrate this as you like | |
for being at better. | |
0:27:24.564 --> 0:27:28.201 | |
It's not that easy to prompt it. | |
0:27:28.201 --> 0:27:35.807 | |
You can do the prompting in the more traditional | |
way of fine tuning. | |
0:27:35.807 --> 0:27:44.514 | |
Then, of course, if you select UIV later combine | |
this one, you can get better. | |
0:27:44.904 --> 0:27:52.675 | |
But it will always be that this type of similar | |
data is much more important than the general. | |
0:27:52.912 --> 0:28:00.705 | |
So of course it can make the lower system | |
a lot better if you search for similar data | |
0:28:00.705 --> 0:28:01.612 | |
and find. | |
0:28:02.122 --> 0:28:08.190 | |
Will have a lecture on domain adaptation where | |
it's exactly the idea how you can make systems | |
0:28:08.190 --> 0:28:13.935 | |
in these situations better so you can adapt | |
it to this data but then you still need this | |
0:28:13.935 --> 0:28:14.839 | |
type of data. | |
0:28:15.335 --> 0:28:21.590 | |
And in prompting it might work if you have | |
seen it in your data so it can make the system | |
0:28:21.590 --> 0:28:25.134 | |
aware and tell it focus more in this type of | |
data. | |
0:28:25.465 --> 0:28:30.684 | |
But if you haven't had enough of the really | |
specific good matching data, I think it will | |
0:28:30.684 --> 0:28:31.681 | |
always not work. | |
0:28:31.681 --> 0:28:37.077 | |
So you need to have this type of data and | |
therefore it's important not only to have general | |
0:28:37.077 --> 0:28:42.120 | |
data but also data, at least in your overall | |
system, which really fits to the domain. | |
0:28:45.966 --> 0:28:53.298 | |
And then the second thing, of course, is you | |
need to have data that has good quality. | |
0:28:53.693 --> 0:29:00.170 | |
In the early stages it might be good to have | |
all the data but later it's especially important | |
0:29:00.170 --> 0:29:06.577 | |
that you have somehow good quality and so that | |
you're learning what you really want to learn | |
0:29:06.577 --> 0:29:09.057 | |
and not learning some great things. | |
0:29:10.370 --> 0:29:21.551 | |
We talked about this with the kilometers and | |
miles, so if you just take in some type of | |
0:29:21.551 --> 0:29:26.253 | |
data and don't look at the quality,. | |
0:29:26.766 --> 0:29:30.875 | |
But of course, the question here is what is | |
good quality data? | |
0:29:31.331 --> 0:29:35.054 | |
It is not yet that easy to define what is | |
a good quality data. | |
0:29:36.096 --> 0:29:43.961 | |
That doesn't mean it has to what people generally | |
assume as high quality text or so, like written | |
0:29:43.961 --> 0:29:47.814 | |
by a Nobel Prize winner or something like that. | |
0:29:47.814 --> 0:29:54.074 | |
This is not what we mean by this quality, | |
but again the most important again. | |
0:29:54.354 --> 0:30:09.181 | |
So if you have Twitter data, high quality | |
data doesn't mean you have now some novels. | |
0:30:09.309 --> 0:30:12.875 | |
Test data, but it should also be represented | |
similarly. | |
0:30:12.875 --> 0:30:18.480 | |
Don't have, for example, quality definitely | |
as it should be really translating yourself | |
0:30:18.480 --> 0:30:18.862 | |
into. | |
0:30:19.199 --> 0:30:25.556 | |
So especially if you corral data you would | |
often have that it's not a direct translation. | |
0:30:25.805 --> 0:30:28.436 | |
So then, of course, this is not high quality | |
teaching. | |
0:30:29.449 --> 0:30:39.974 | |
But in generally that's a very difficult thing | |
to, and it's very difficult to design what | |
0:30:39.974 --> 0:30:41.378 | |
is reading. | |
0:30:41.982 --> 0:30:48.333 | |
And of course a biometric is always the quality | |
of your data is good if your machine translation. | |
0:30:48.648 --> 0:30:50.719 | |
So that is like the indirect. | |
0:30:50.991 --> 0:30:52.447 | |
Well, what can we motive? | |
0:30:52.447 --> 0:30:57.210 | |
Of course, it's difficult to always try a | |
lot of things and evaluate either of them, | |
0:30:57.210 --> 0:30:59.396 | |
build a full MP system and then check. | |
0:30:59.396 --> 0:31:00.852 | |
Oh, was this a good idea? | |
0:31:00.852 --> 0:31:01.357 | |
I mean,. | |
0:31:01.581 --> 0:31:19.055 | |
You have two tokenizers who like split sentences | |
and the words you really want to apply. | |
0:31:19.179 --> 0:31:21.652 | |
Now you could maybe argue or your idea could | |
be. | |
0:31:21.841 --> 0:31:30.186 | |
Just take it there very fast and then get | |
the result, but the problem is there is not | |
0:31:30.186 --> 0:31:31.448 | |
always this. | |
0:31:31.531 --> 0:31:36.269 | |
One thing that works very well for small data. | |
0:31:36.269 --> 0:31:43.123 | |
It's not for sure that the same effect will | |
happen in large stages. | |
0:31:43.223 --> 0:31:50.395 | |
This idea really improves on very low resource | |
data if only train on hundred words. | |
0:31:51.271 --> 0:31:58.357 | |
But if you use it for a large data set, it | |
doesn't really matter and all your ideas not. | |
0:31:58.598 --> 0:32:01.172 | |
So that is also a typical thing. | |
0:32:01.172 --> 0:32:05.383 | |
This quality issue is more and more important | |
if you. | |
0:32:06.026 --> 0:32:16.459 | |
By one motivation which generally you should | |
have, you want to represent your data in having | |
0:32:16.459 --> 0:32:17.469 | |
as many. | |
0:32:17.677 --> 0:32:21.805 | |
Why is this the case any idea? | |
0:32:21.805 --> 0:32:33.389 | |
Why this could be a motivation that we try | |
to represent the data in a way that we have | |
0:32:33.389 --> 0:32:34.587 | |
as many. | |
0:32:38.338 --> 0:32:50.501 | |
We also want to learn about the fun text because | |
maybe sometimes some grows in the fun text. | |
0:32:52.612 --> 0:32:54.020 | |
The context is here. | |
0:32:54.020 --> 0:32:56.432 | |
It's more about the learning first. | |
0:32:56.432 --> 0:33:00.990 | |
You can generally learn better if you've seen | |
something more often. | |
0:33:00.990 --> 0:33:06.553 | |
So if you have seen an event only once, it's | |
really hard to learn about the event. | |
0:33:07.107 --> 0:33:15.057 | |
If you have seen an event a hundred times | |
your bearing estimating which and maybe that | |
0:33:15.057 --> 0:33:18.529 | |
is the context, then you can use the. | |
0:33:18.778 --> 0:33:21.331 | |
So, for example, if you here have the word | |
towels. | |
0:33:21.761 --> 0:33:28.440 | |
If you would just take the data normally you | |
would directly process the data. | |
0:33:28.440 --> 0:33:32.893 | |
In the upper case you would the house with | |
the dog. | |
0:33:32.893 --> 0:33:40.085 | |
That's a different word than the house this | |
way and then the house with the common. | |
0:33:40.520 --> 0:33:48.365 | |
So you want to learn how this translates into | |
house, but you translate an upper case. | |
0:33:48.365 --> 0:33:50.281 | |
How this translates. | |
0:33:50.610 --> 0:33:59.445 | |
You were learning how to translate into house | |
and house, so you have to learn four different | |
0:33:59.445 --> 0:34:00.205 | |
things. | |
0:34:00.205 --> 0:34:06.000 | |
Instead, we really want to learn that house | |
gets into house. | |
0:34:06.366 --> 0:34:18.796 | |
And then imagine if it would be even a beak, | |
it might be like here a house would be into. | |
0:34:18.678 --> 0:34:22.089 | |
Good-bye Then. | |
0:34:22.202 --> 0:34:29.512 | |
If it's an upper case then I always have to | |
translate it into a boiler while it's a lower | |
0:34:29.512 --> 0:34:34.955 | |
case that is translated into house and that's | |
of course not right. | |
0:34:34.955 --> 0:34:39.260 | |
We have to use the context to decide what | |
is better. | |
0:34:39.679 --> 0:34:47.086 | |
If you have seen an event several times then | |
you are better able to learn your model and | |
0:34:47.086 --> 0:34:51.414 | |
that doesn't matter what type of learning you | |
have. | |
0:34:52.392 --> 0:34:58.981 | |
I shouldn't say all but for most of these | |
models it's always better to have like seen | |
0:34:58.981 --> 0:35:00.897 | |
an event war more often. | |
0:35:00.920 --> 0:35:11.483 | |
Therefore, if you preprocessive data, you | |
should ask the question how can represent data | |
0:35:11.483 --> 0:35:14.212 | |
in order to have seen. | |
0:35:14.514 --> 0:35:17.885 | |
Of course you should not remove that information. | |
0:35:18.078 --> 0:35:25.519 | |
So you could now, of course, just lowercase | |
everything. | |
0:35:25.519 --> 0:35:30.303 | |
Then you've seen things more often. | |
0:35:30.710 --> 0:35:38.443 | |
And that might be an issue because in the | |
final application you want to have real text | |
0:35:38.443 --> 0:35:38.887 | |
and. | |
0:35:40.440 --> 0:35:44.003 | |
And finally, even it's more important than | |
it's consistent. | |
0:35:44.965 --> 0:35:52.630 | |
So this is a problem where, for example, aren't | |
consistent. | |
0:35:52.630 --> 0:35:58.762 | |
So I am, I'm together written in training | |
data. | |
0:35:58.762 --> 0:36:04.512 | |
And if you're not in test data, have a high. | |
0:36:04.824 --> 0:36:14.612 | |
Therefore, most important is to generate preprocessing | |
and represent your data that is most consistent | |
0:36:14.612 --> 0:36:18.413 | |
because it's easier to map how similar. | |
0:36:18.758 --> 0:36:26.588 | |
If your text is represented very, very differently | |
then your data will be badly be translated. | |
0:36:26.666 --> 0:36:30.664 | |
So we once had the case. | |
0:36:30.664 --> 0:36:40.420 | |
For example, there is some data who wrote | |
it, but in German. | |
0:36:40.900 --> 0:36:44.187 | |
And if you read it as a human you see it. | |
0:36:44.187 --> 0:36:49.507 | |
It's even hard to get the difference because | |
it looks very similar. | |
0:36:50.130 --> 0:37:02.997 | |
If you use it for a machine translation system, | |
it would not be able to translate anything | |
0:37:02.997 --> 0:37:08.229 | |
of it because it's a different word. | |
0:37:09.990 --> 0:37:17.736 | |
And especially on the other hand you should | |
of course not rechange significant training | |
0:37:17.736 --> 0:37:18.968 | |
data thereby. | |
0:37:18.968 --> 0:37:27.155 | |
For example, removing case information because | |
if your task is to generate case information. | |
0:37:31.191 --> 0:37:41.081 | |
One thing which is a bit point to look into | |
it in order to see the difficulty of your data | |
0:37:41.081 --> 0:37:42.711 | |
is to compare. | |
0:37:43.103 --> 0:37:45.583 | |
There are types. | |
0:37:45.583 --> 0:37:57.983 | |
We mean the number of unique words in the | |
corpus, so your vocabulary and the tokens. | |
0:37:58.298 --> 0:38:08.628 | |
And then you can look at the type token ratio | |
that means a number of types per token. | |
0:38:15.815 --> 0:38:22.381 | |
Have less types than tokens because every | |
word appears at least in the corpus, but most | |
0:38:22.381 --> 0:38:27.081 | |
of them will occur more often until this number | |
is bigger, so. | |
0:38:27.667 --> 0:38:30.548 | |
And of course this changes if you have more | |
date. | |
0:38:31.191 --> 0:38:38.103 | |
Here is an example from an English Wikipedia. | |
0:38:38.103 --> 0:38:45.015 | |
That means each word in average occurs times. | |
0:38:45.425 --> 0:38:47.058 | |
Of course there's a big difference. | |
0:38:47.058 --> 0:38:51.323 | |
There will be some words which occur one hundred | |
times, but therefore most of the words occur | |
0:38:51.323 --> 0:38:51.777 | |
only one. | |
0:38:52.252 --> 0:38:55.165 | |
However, you see this ratio goes down. | |
0:38:55.165 --> 0:39:01.812 | |
That's a good thing, so you have seen each | |
word more often and therefore your model gets | |
0:39:01.812 --> 0:39:03.156 | |
typically better. | |
0:39:03.156 --> 0:39:08.683 | |
However, the problem is we always have a lot | |
of words which we have seen. | |
0:39:09.749 --> 0:39:15.111 | |
Even here there will be a bound of words which | |
you have only seen once. | |
0:39:15.111 --> 0:39:20.472 | |
However, this can give you an indication about | |
the quality of the data. | |
0:39:20.472 --> 0:39:27.323 | |
So you should always, of course, try to achieve | |
data where you have a very low type to talk | |
0:39:27.323 --> 0:39:28.142 | |
and ratio. | |
0:39:28.808 --> 0:39:39.108 | |
For example, if you compare, simplify and | |
not only Wikipedia, what would be your expectation? | |
0:39:41.861 --> 0:39:49.842 | |
Yes, that's exactly, but however it's surprisingly | |
only a little bit lower, but you see that it's | |
0:39:49.842 --> 0:39:57.579 | |
lower, so we are using less words to express | |
the same thing, and therefore the task to produce | |
0:39:57.579 --> 0:39:59.941 | |
this text is also a gesture. | |
0:40:01.221 --> 0:40:07.702 | |
However, as how many words are there, there | |
is no clear definition. | |
0:40:07.787 --> 0:40:19.915 | |
So there will be always more words, especially | |
depending on your dataset, how many different | |
0:40:19.915 --> 0:40:22.132 | |
words there are. | |
0:40:22.482 --> 0:40:30.027 | |
So if you have million tweets where around | |
fifty million tokens and you have six hundred | |
0:40:30.027 --> 0:40:30.875 | |
thousand. | |
0:40:31.251 --> 0:40:40.299 | |
If you have times this money teen tweeds you | |
also have significantly more tokens but also. | |
0:40:40.660 --> 0:40:58.590 | |
So especially in things like the social media, | |
of course, there's always different types of | |
0:40:58.590 --> 0:40:59.954 | |
words. | |
0:41:00.040 --> 0:41:04.028 | |
Another example from not social media is here. | |
0:41:04.264 --> 0:41:18.360 | |
So yeah, there is a small liter sandwich like | |
phone conversations, two million tokens, and | |
0:41:18.360 --> 0:41:22.697 | |
only twenty thousand words. | |
0:41:23.883 --> 0:41:37.221 | |
If you think about Shakespeare, it has even | |
less token, significantly less than a million, | |
0:41:37.221 --> 0:41:40.006 | |
but the number of. | |
0:41:40.060 --> 0:41:48.781 | |
On the other hand, there is this Google Engron | |
corpus which has tokens and there is always | |
0:41:48.781 --> 0:41:50.506 | |
new words coming. | |
0:41:50.991 --> 0:41:52.841 | |
Is English. | |
0:41:52.841 --> 0:42:08.103 | |
The nice thing about English is that the vocabulary | |
is relatively small, too small, but relatively | |
0:42:08.103 --> 0:42:09.183 | |
small. | |
0:42:09.409 --> 0:42:14.224 | |
So here you see the Ted Corpus here. | |
0:42:15.555 --> 0:42:18.144 | |
All know Ted's lectures. | |
0:42:18.144 --> 0:42:26.429 | |
They are transcribed, translated, not a source | |
for us, especially small crocus. | |
0:42:26.846 --> 0:42:32.702 | |
You can do a lot of experiments with that | |
and you see that the corpus site is relatively | |
0:42:32.702 --> 0:42:36.782 | |
similar so we have around four million tokens | |
in this corpus. | |
0:42:36.957 --> 0:42:44.464 | |
However, if you look at the vocabulary, English | |
has half as many words in their different words | |
0:42:44.464 --> 0:42:47.045 | |
as German and Dutch and Italian. | |
0:42:47.527 --> 0:42:56.260 | |
So this is one influence from positional works | |
like which are more frequent in German, the | |
0:42:56.260 --> 0:43:02.978 | |
more important since we have all these different | |
morphological forms. | |
0:43:03.263 --> 0:43:08.170 | |
There all leads to new words and they need | |
to be somewhat expressed in there. | |
0:43:11.531 --> 0:43:20.278 | |
So to deal with this, the question is how | |
can we normalize the text in order to make | |
0:43:20.278 --> 0:43:22.028 | |
the text easier? | |
0:43:22.028 --> 0:43:25.424 | |
Can we simplify the task easier? | |
0:43:25.424 --> 0:43:29.231 | |
But we need to keep all information. | |
0:43:29.409 --> 0:43:32.239 | |
So an example where not all information skipped. | |
0:43:32.239 --> 0:43:35.012 | |
Of course you make the task easier if you | |
just. | |
0:43:35.275 --> 0:43:41.141 | |
You don't have to deal with different cases. | |
0:43:41.141 --> 0:43:42.836 | |
It's easier. | |
0:43:42.836 --> 0:43:52.482 | |
However, information gets lost and you might | |
need to generate the target. | |
0:43:52.832 --> 0:44:00.153 | |
So the question is always: How can we on the | |
one hand simplify the task but keep all the | |
0:44:00.153 --> 0:44:01.223 | |
information? | |
0:44:01.441 --> 0:44:06.639 | |
Say necessary because it depends on the task. | |
0:44:06.639 --> 0:44:11.724 | |
For some tasks you might find to remove the. | |
0:44:14.194 --> 0:44:23.463 | |
So the steps they were typically doing are | |
that you can the segment and words in a running | |
0:44:23.463 --> 0:44:30.696 | |
text, so you can normalize word forms and segmentation | |
into sentences. | |
0:44:30.696 --> 0:44:33.955 | |
Also, if you have not a single. | |
0:44:33.933 --> 0:44:38.739 | |
If this is not a redundancy point to segments, | |
the text is also into segments. | |
0:44:39.779 --> 0:44:52.609 | |
So what are we doing there for European language | |
segmentation into words? | |
0:44:52.609 --> 0:44:57.290 | |
It's not that complicated. | |
0:44:57.277 --> 0:45:06.001 | |
You have to somehow handle the joint words | |
and by handling joint words the most important. | |
0:45:06.526 --> 0:45:11.331 | |
So in most systems it really doesn't matter | |
much. | |
0:45:11.331 --> 0:45:16.712 | |
If you write, I'm together as one word or | |
as two words. | |
0:45:17.197 --> 0:45:23.511 | |
The nice thing about iron is maybe this is | |
so often that it doesn't matter if you both | |
0:45:23.511 --> 0:45:26.560 | |
and if they're both accrued often enough. | |
0:45:26.560 --> 0:45:32.802 | |
But you'll have some of these cases where | |
they don't occur there often, so you should | |
0:45:32.802 --> 0:45:35.487 | |
have more as consistent as possible. | |
0:45:36.796 --> 0:45:41.662 | |
But of course things can get more complicated. | |
0:45:41.662 --> 0:45:48.598 | |
If you have Finland capital, do you want to | |
split the ends or not? | |
0:45:48.598 --> 0:45:53.256 | |
Isn't you split or do you even write it out? | |
0:45:53.433 --> 0:46:00.468 | |
And what about like things with hyphens in | |
the middle and so on? | |
0:46:00.540 --> 0:46:07.729 | |
So there is not everything is very easy, but | |
is generally possible to somewhat keep as. | |
0:46:11.791 --> 0:46:25.725 | |
Sometimes the most challenging and traditional | |
systems were compounds, or how to deal with | |
0:46:25.725 --> 0:46:28.481 | |
things like this. | |
0:46:28.668 --> 0:46:32.154 | |
The nice thing is, as said, will come to the | |
later. | |
0:46:32.154 --> 0:46:34.501 | |
Nowadays we typically use subword. | |
0:46:35.255 --> 0:46:42.261 | |
Unit, so we don't have to deal with this in | |
the preprocessing directly, but in the subword | |
0:46:42.261 --> 0:46:47.804 | |
splitting we're doing it, and then we can learn | |
how to best spit these. | |
0:46:52.392 --> 0:46:56.974 | |
Things Get More Complicated. | |
0:46:56.977 --> 0:46:59.934 | |
About non European languages. | |
0:46:59.934 --> 0:47:08.707 | |
Because in non European languages, not all | |
of them, there is no space between the words. | |
0:47:09.029 --> 0:47:18.752 | |
Nowadays you can also download word segmentation | |
models where you put in the full sentence and | |
0:47:18.752 --> 0:47:22.744 | |
then it's getting splitted into parts. | |
0:47:22.963 --> 0:47:31.814 | |
And then, of course, it's even that you have | |
different writing systems, sometimes in Japanese. | |
0:47:31.814 --> 0:47:40.385 | |
For example, they have these katakana, hiragana | |
and kanji symbols in there, and you have to | |
0:47:40.385 --> 0:47:42.435 | |
some idea with these. | |
0:47:49.669 --> 0:47:54.560 | |
To the, the next thing is can reduce some | |
normalization. | |
0:47:54.874 --> 0:48:00.376 | |
So the idea is that you map several words | |
onto the same. | |
0:48:00.460 --> 0:48:07.877 | |
And that is test dependent, and the idea is | |
to define something like acronym classes so | |
0:48:07.877 --> 0:48:15.546 | |
that words, which have the same meaning where | |
it's not in order to have the difference, to | |
0:48:15.546 --> 0:48:19.423 | |
map onto the same thing in order to make the. | |
0:48:19.679 --> 0:48:27.023 | |
The most important thing is there about tasing, | |
and then there is something like sometimes | |
0:48:27.023 --> 0:48:27.508 | |
word. | |
0:48:28.048 --> 0:48:37.063 | |
For casing you can do two things and then | |
depend on the task. | |
0:48:37.063 --> 0:48:44.769 | |
You can lowercase everything, maybe some exceptions. | |
0:48:45.045 --> 0:48:47.831 | |
For the target side, it should normally it's | |
normally not done. | |
0:48:48.188 --> 0:48:51.020 | |
Why is it not done? | |
0:48:51.020 --> 0:48:56.542 | |
Why should you only do it for suicide? | |
0:48:56.542 --> 0:49:07.729 | |
Yes, so you have to generate correct text | |
instead of lower case and uppercase. | |
0:49:08.848 --> 0:49:16.370 | |
Nowadays to be always do true casing on both | |
sides, also on the sewer side, that means you | |
0:49:16.370 --> 0:49:17.610 | |
keep the case. | |
0:49:17.610 --> 0:49:24.966 | |
The only thing where people try to work on | |
or sometimes do that is that at the beginning | |
0:49:24.966 --> 0:49:25.628 | |
of the. | |
0:49:25.825 --> 0:49:31.115 | |
For words like this, this is not that important | |
because you will have seen otherwise a lot | |
0:49:31.115 --> 0:49:31.696 | |
of times. | |
0:49:31.696 --> 0:49:36.928 | |
But if you know have rare words, which you | |
only have seen maybe three times, and you have | |
0:49:36.928 --> 0:49:42.334 | |
only seen in the middle of the sentence, and | |
now it occurs at the beginning of the sentence, | |
0:49:42.334 --> 0:49:45.763 | |
which is upper case, then you don't know how | |
to deal with. | |
0:49:46.146 --> 0:49:50.983 | |
So then it might be good to do a true casing. | |
0:49:50.983 --> 0:49:56.241 | |
That means you recase each word on the beginning. | |
0:49:56.576 --> 0:49:59.830 | |
The only question, of course, is how do you | |
recase it? | |
0:49:59.830 --> 0:50:01.961 | |
So what case would you always know? | |
0:50:02.162 --> 0:50:18.918 | |
Word of the senders, or do you have a better | |
solution, especially not English, maybe German. | |
0:50:18.918 --> 0:50:20.000 | |
It's. | |
0:50:25.966 --> 0:50:36.648 | |
The fancy solution would be to count hope | |
and decide based on this, the unfancy running | |
0:50:36.648 --> 0:50:43.147 | |
would: Think it's not really good because most | |
of the cane boards are lower paced. | |
0:50:43.683 --> 0:50:53.657 | |
That is one idea to count and definitely better | |
because as a word more often occurs upper case. | |
0:50:53.653 --> 0:50:57.934 | |
Otherwise you only have a lower case at the | |
beginning where you have again. | |
0:50:58.338 --> 0:51:03.269 | |
Haven't gained anything, you can make it even | |
a bit better when counting. | |
0:51:03.269 --> 0:51:09.134 | |
You're ignoring the first position so that | |
you don't count the word beginning and yeah, | |
0:51:09.134 --> 0:51:12.999 | |
that's typically how it's done to do this type | |
of casing. | |
0:51:13.273 --> 0:51:23.907 | |
And that's the easy thing you can't even use | |
like then bygram teachers who work pairs. | |
0:51:23.907 --> 0:51:29.651 | |
There's very few words which occur more often. | |
0:51:29.970 --> 0:51:33.163 | |
It's OK to have them boast because you can | |
otherwise learn it. | |
0:51:36.376 --> 0:51:52.305 | |
Another thing about these classes is to use | |
word classes that were partly done, for example, | |
0:51:52.305 --> 0:51:55.046 | |
and more often. | |
0:51:55.375 --> 0:51:57.214 | |
Ten Thousand One Hundred Books. | |
0:51:57.597 --> 0:52:07.397 | |
And then for an system that might not be important | |
you can do something at number books. | |
0:52:07.847 --> 0:52:16.450 | |
However, you see here already that it's not | |
that easy because if you have one book you | |
0:52:16.450 --> 0:52:19.318 | |
don't have to do with a pro. | |
0:52:20.020 --> 0:52:21.669 | |
Always be careful. | |
0:52:21.669 --> 0:52:28.094 | |
It's very fast to ignore some exceptions and | |
make more things worse than. | |
0:52:28.488 --> 0:52:37.879 | |
So it's always difficult to decide when to | |
do this and when to better not do it and keep | |
0:52:37.879 --> 0:52:38.724 | |
things. | |
0:52:43.483 --> 0:52:56.202 | |
Then the next step is sentence segmentation, | |
so we are typically working on sentences. | |
0:52:56.476 --> 0:53:11.633 | |
However, dots things are a bit more complicated, | |
so you can do a bit more. | |
0:53:11.731 --> 0:53:20.111 | |
You can even have some type of classifier | |
with features by then generally. | |
0:53:20.500 --> 0:53:30.731 | |
Is not too complicated, so you can have different | |
types of classifiers to do that, but in generally. | |
0:53:30.650 --> 0:53:32.537 | |
I Didn't Know It. | |
0:53:33.393 --> 0:53:35.583 | |
It's not a super complicated task. | |
0:53:35.583 --> 0:53:39.461 | |
There are nowadays also a lot of libraries | |
which you can use. | |
0:53:39.699 --> 0:53:45.714 | |
To do that normally if you're doing the normalization | |
beforehand that can be done there so you only | |
0:53:45.714 --> 0:53:51.126 | |
split up the dot if it's like the sentence | |
boundary and otherwise you keep it to the word | |
0:53:51.126 --> 0:53:54.194 | |
so you can do that a bit jointly with the segment. | |
0:53:54.634 --> 0:54:06.017 | |
It's something to think about to care because | |
it's where arrows happen. | |
0:54:06.017 --> 0:54:14.712 | |
However, on the one end you can still do it | |
very well. | |
0:54:14.834 --> 0:54:19.740 | |
You will never get data which is perfectly | |
clean and where everything is great. | |
0:54:20.340 --> 0:54:31.020 | |
There's just too much data and it will never | |
happen, so therefore it's important to be aware | |
0:54:31.020 --> 0:54:35.269 | |
of that during the full development. | |
0:54:37.237 --> 0:54:42.369 | |
And one last thing about the preprocessing, | |
we'll get into the representation. | |
0:54:42.369 --> 0:54:47.046 | |
If you're working on that, you'll get a friend | |
with regular expression. | |
0:54:47.046 --> 0:54:50.034 | |
That's not only how you do all this matching. | |
0:54:50.430 --> 0:55:03.811 | |
And if you look into the scripts of how to | |
deal with pancreation marks and stuff like | |
0:55:03.811 --> 0:55:04.900 | |
that,. | |
0:55:11.011 --> 0:55:19.025 | |
So if we have now the data of our next step | |
to build, the system is to represent our words. | |
0:55:19.639 --> 0:55:27.650 | |
Before we start with this, any more questions | |
about preprocessing. | |
0:55:27.650 --> 0:55:32.672 | |
While we work on the pure text, I'm sure. | |
0:55:33.453 --> 0:55:40.852 | |
The idea is again to make things more simple | |
because if you think about the production mark | |
0:55:40.852 --> 0:55:48.252 | |
at the beginning of a sentence, it might be | |
that you haven't seen the word or, for example, | |
0:55:48.252 --> 0:55:49.619 | |
think of titles. | |
0:55:49.619 --> 0:55:56.153 | |
In newspaper articles there's: So you then | |
have seen the word now in the title before, | |
0:55:56.153 --> 0:55:58.425 | |
and the text you have never seen. | |
0:55:58.898 --> 0:56:03.147 | |
But there is always the decision. | |
0:56:03.123 --> 0:56:09.097 | |
Do I gain more because I've seen things more | |
often or do I lose because now I remove information | |
0:56:09.097 --> 0:56:11.252 | |
which helps me to the same degree? | |
0:56:11.571 --> 0:56:21.771 | |
Because if we, for example, do that in German | |
and remove the case, this might be an important | |
0:56:21.771 --> 0:56:22.531 | |
issue. | |
0:56:22.842 --> 0:56:30.648 | |
So there is not the perfect solution, but | |
generally you can get some arrows to make things | |
0:56:30.648 --> 0:56:32.277 | |
look more similar. | |
0:56:35.295 --> 0:56:43.275 | |
What you can do about products like the state | |
of the area or the trends that are more or | |
0:56:43.275 --> 0:56:43.813 | |
less. | |
0:56:44.944 --> 0:56:50.193 | |
It starts even less because models get more | |
powerful, so it's not that important, but be | |
0:56:50.193 --> 0:56:51.136 | |
careful partly. | |
0:56:51.136 --> 0:56:56.326 | |
It's also the evaluation thing because these | |
things which are problematic are happening | |
0:56:56.326 --> 0:56:57.092 | |
very rarely. | |
0:56:57.092 --> 0:57:00.159 | |
If you take average performance, it doesn't | |
matter. | |
0:57:00.340 --> 0:57:06.715 | |
However, in between it's doing the stupid | |
mistakes that don't count on average, but they | |
0:57:06.715 --> 0:57:08.219 | |
are not really good. | |
0:57:09.089 --> 0:57:15.118 | |
Done you do some type of tokenization? | |
0:57:15.118 --> 0:57:19.911 | |
You can do true casing or not. | |
0:57:19.911 --> 0:57:28.723 | |
Some people nowadays don't do it, but that's | |
still done. | |
0:57:28.948 --> 0:57:34.441 | |
Then it depends on who is a bit on the type | |
of domain. | |
0:57:34.441 --> 0:57:37.437 | |
Again we have so translation. | |
0:57:37.717 --> 0:57:46.031 | |
So in the text sometimes there is mark in | |
the menu, later the shortcut. | |
0:57:46.031 --> 0:57:49.957 | |
This letter is used for shortcut. | |
0:57:49.957 --> 0:57:57.232 | |
You cannot mistake the word because it's no | |
longer a file but. | |
0:57:58.018 --> 0:58:09.037 | |
Then you cannot deal with it, so then it might | |
make sense to remove this. | |
0:58:12.032 --> 0:58:17.437 | |
Now the next step is how to match words into | |
numbers. | |
0:58:17.437 --> 0:58:22.142 | |
Machine learning models deal with some digits. | |
0:58:22.342 --> 0:58:27.091 | |
The first idea is to use words as our basic | |
components. | |
0:58:27.247 --> 0:58:40.695 | |
And then you have a large vocabulary where | |
each word gets referenced to an indigenous. | |
0:58:40.900 --> 0:58:49.059 | |
So your sentence go home is now and that is | |
your set. | |
0:58:52.052 --> 0:59:00.811 | |
So the nice thing is you have very short sequences | |
so that you can deal with them. | |
0:59:00.811 --> 0:59:01.867 | |
However,. | |
0:59:01.982 --> 0:59:11.086 | |
So you have not really understood how words | |
are processed. | |
0:59:11.086 --> 0:59:16.951 | |
Why is this or can that be a problem? | |
0:59:17.497 --> 0:59:20.741 | |
And there is an easy solution to deal with | |
unknown words. | |
0:59:20.741 --> 0:59:22.698 | |
You just have one token, which is. | |
0:59:23.123 --> 0:59:25.906 | |
Worrying in maybe some railroads in your training | |
day, do you deal? | |
0:59:26.206 --> 0:59:34.938 | |
That's working a bit for some province, but | |
in general it's not good because you know nothing | |
0:59:34.938 --> 0:59:35.588 | |
about. | |
0:59:35.895 --> 0:59:38.770 | |
Can at least deal with this and maybe map | |
it. | |
0:59:38.770 --> 0:59:44.269 | |
So an easy solution in machine translation | |
is always if it's an unknown word or we just | |
0:59:44.269 --> 0:59:49.642 | |
copy it to the target side because unknown | |
words are often named entities and in many | |
0:59:49.642 --> 0:59:52.454 | |
languages the good solution is just to keep. | |
0:59:53.013 --> 1:00:01.203 | |
So that is somehow a trick, trick, but yeah, | |
that's of course not a good thing. | |
1:00:01.821 --> 1:00:08.959 | |
It's also a problem if you deal with full | |
words is that you have very few examples for | |
1:00:08.959 --> 1:00:09.451 | |
some. | |
1:00:09.949 --> 1:00:17.696 | |
And of course if you've seen a word once you | |
can, someone may be translated, but we will | |
1:00:17.696 --> 1:00:24.050 | |
learn that in your networks you represent words | |
with continuous vectors. | |
1:00:24.264 --> 1:00:26.591 | |
You have seen them two, three or four times. | |
1:00:26.591 --> 1:00:31.246 | |
It is not really well learned, and you are | |
typically doing most Arabs and words with your | |
1:00:31.246 --> 1:00:31.763 | |
crow rap. | |
1:00:33.053 --> 1:00:40.543 | |
And yeah, you cannot deal with things which | |
are inside the world. | |
1:00:40.543 --> 1:00:50.303 | |
So if you know that houses set one hundred | |
and twelve and you see no houses, you have | |
1:00:50.303 --> 1:00:51.324 | |
no idea. | |
1:00:51.931 --> 1:00:55.533 | |
Of course, not really convenient, so humans | |
are better. | |
1:00:55.533 --> 1:00:58.042 | |
They can use the internal information. | |
1:00:58.498 --> 1:01:04.080 | |
So if we have houses you'll know that it's | |
like the bluer form of house. | |
1:01:05.285 --> 1:01:16.829 | |
And for the ones who weren't in advance, ay, | |
you have this night worth here and guess. | |
1:01:16.716 --> 1:01:20.454 | |
Don't know the meaning of these words. | |
1:01:20.454 --> 1:01:25.821 | |
However, all of you will know is the fear | |
of something. | |
1:01:26.686 --> 1:01:39.437 | |
From the ending, the phobia phobia is always | |
the fear of something, but you don't know how. | |
1:01:39.879 --> 1:01:46.618 | |
So we can split words into some parts that | |
is helpful to deal with. | |
1:01:46.618 --> 1:01:49.888 | |
This, for example, is a fear of. | |
1:01:50.450 --> 1:02:04.022 | |
It's not very important, it's not how to happen | |
very often, but yeah, it's also not important | |
1:02:04.022 --> 1:02:10.374 | |
for understanding that you know everything. | |
1:02:15.115 --> 1:02:18.791 | |
So what can we do instead? | |
1:02:18.791 --> 1:02:29.685 | |
One thing which we could do instead is to | |
represent words by the other extreme. | |
1:02:29.949 --> 1:02:42.900 | |
So you really do like if you have a person's | |
eye and a and age, then you need a space symbol. | |
1:02:43.203 --> 1:02:55.875 | |
So you have now a representation for each | |
character that enables you to implicitly learn | |
1:02:55.875 --> 1:03:01.143 | |
morphology because words which have. | |
1:03:01.541 --> 1:03:05.517 | |
And you can then deal with unknown words. | |
1:03:05.517 --> 1:03:10.344 | |
There's still not everything you can process, | |
but. | |
1:03:11.851 --> 1:03:16.953 | |
So if you would go on charity level what might | |
still be a problem? | |
1:03:18.598 --> 1:03:24.007 | |
So all characters which you haven't seen, | |
but that's nowadays a little bit more often | |
1:03:24.007 --> 1:03:25.140 | |
with new emoties. | |
1:03:25.140 --> 1:03:26.020 | |
You couldn't. | |
1:03:26.020 --> 1:03:31.366 | |
It could also be that you have translated | |
from Germany and German, and then there is | |
1:03:31.366 --> 1:03:35.077 | |
a Japanese character or Chinese that you cannot | |
translate. | |
1:03:35.435 --> 1:03:43.938 | |
But most of the time all directions occur | |
have been seen so that someone works very good. | |
1:03:44.464 --> 1:03:58.681 | |
This is first a nice thing, so you have a | |
very small vocabulary size, so one big part | |
1:03:58.681 --> 1:04:01.987 | |
of the calculation. | |
1:04:02.222 --> 1:04:11.960 | |
Neural networks is the calculation of the | |
vocabulary size, so if you are efficient there | |
1:04:11.960 --> 1:04:13.382 | |
it's better. | |
1:04:14.914 --> 1:04:26.998 | |
On the other hand, the problem is you have | |
no very long sequences, so if you think about | |
1:04:26.998 --> 1:04:29.985 | |
this before you have. | |
1:04:30.410 --> 1:04:43.535 | |
Your computation often depends on your input | |
size and not only linear but quadratic going | |
1:04:43.535 --> 1:04:44.410 | |
more. | |
1:04:44.504 --> 1:04:49.832 | |
And of course it might also be that you just | |
generally make things more complicated than | |
1:04:49.832 --> 1:04:50.910 | |
they were before. | |
1:04:50.951 --> 1:04:58.679 | |
We said before make things easy, but now if | |
we really have to analyze each director independently, | |
1:04:58.679 --> 1:05:05.003 | |
we cannot directly learn that university is | |
the same, but we have to learn that. | |
1:05:05.185 --> 1:05:12.179 | |
Is beginning and then there is an I and then | |
there is an E and then all this together means | |
1:05:12.179 --> 1:05:17.273 | |
university but another combination of these | |
letters is a complete. | |
1:05:17.677 --> 1:05:24.135 | |
So of course you make everything here a lot | |
more complicated than you have on word basis. | |
1:05:24.744 --> 1:05:32.543 | |
Character based models work very well in conditions | |
with few data because you have seen the words | |
1:05:32.543 --> 1:05:33.578 | |
very rarely. | |
1:05:33.578 --> 1:05:38.751 | |
It's not good to learn but you have seen all | |
letters more often. | |
1:05:38.751 --> 1:05:44.083 | |
So if you have scenarios with very few data | |
this is like one good. | |
1:05:46.446 --> 1:05:59.668 | |
The other idea is to split now not doing the | |
extreme, so either taking forwards or taking | |
1:05:59.668 --> 1:06:06.573 | |
only directives by doing something in between. | |
1:06:07.327 --> 1:06:12.909 | |
And one of these ideas has been done for a | |
long time. | |
1:06:12.909 --> 1:06:17.560 | |
It's called compound splitting, but we only. | |
1:06:17.477 --> 1:06:18.424 | |
Bounce them. | |
1:06:18.424 --> 1:06:24.831 | |
You see that Baum and Stumbo accrue very often, | |
then maybe more often than Bounce them. | |
1:06:24.831 --> 1:06:28.180 | |
Then you split Baum and Stumb and you use | |
it. | |
1:06:29.509 --> 1:06:44.165 | |
But it's even not so easy it will learn wrong | |
splits so we did that in all the systems and | |
1:06:44.165 --> 1:06:47.708 | |
there is a word Asia. | |
1:06:48.288 --> 1:06:56.137 | |
And the business, of course, is not a really | |
good way of dealing it because it is non-semantic. | |
1:06:56.676 --> 1:07:05.869 | |
The good thing is we didn't really care that | |
much about it because the system wasn't learned | |
1:07:05.869 --> 1:07:09.428 | |
if you have Asia and Tish together. | |
1:07:09.729 --> 1:07:17.452 | |
So you can of course learn all that the compound | |
spirit doesn't really help you to get a deeper | |
1:07:17.452 --> 1:07:18.658 | |
understanding. | |
1:07:21.661 --> 1:07:23.364 | |
The Thing of Course. | |
1:07:23.943 --> 1:07:30.475 | |
Yeah, there was one paper where this doesn't | |
work like they report, but it's called Burning | |
1:07:30.475 --> 1:07:30.972 | |
Ducks. | |
1:07:30.972 --> 1:07:37.503 | |
I think because it was like if you had German | |
NS Branter, you could split it in NS Branter, | |
1:07:37.503 --> 1:07:43.254 | |
and sometimes you have to add an E to make | |
the compounds that was Enter Branter. | |
1:07:43.583 --> 1:07:48.515 | |
So he translated Esperanto into burning dark. | |
1:07:48.888 --> 1:07:56.127 | |
So of course you can introduce there some | |
type of additional arrows, but in generally | |
1:07:56.127 --> 1:07:57.221 | |
it's a good. | |
1:07:57.617 --> 1:08:03.306 | |
Of course there is a trade off between vocabulary | |
size so you want to have a lower vocabulary | |
1:08:03.306 --> 1:08:08.812 | |
size so you've seen everything more often but | |
the length of the sequence should not be too | |
1:08:08.812 --> 1:08:13.654 | |
long because if you split more often you get | |
less different types but you have. | |
1:08:16.896 --> 1:08:25.281 | |
The motivation of the advantage of compared | |
to Character based models is that you can directly | |
1:08:25.281 --> 1:08:33.489 | |
learn the representation for works that occur | |
very often while still being able to represent | |
1:08:33.489 --> 1:08:35.783 | |
works that are rare into. | |
1:08:36.176 --> 1:08:42.973 | |
And while first this was only done for compounds, | |
nowadays there's an algorithm which really | |
1:08:42.973 --> 1:08:49.405 | |
tries to do it on everything and there are | |
different ways to be honest compound fitting | |
1:08:49.405 --> 1:08:50.209 | |
and so on. | |
1:08:50.209 --> 1:08:56.129 | |
But the most successful one which is commonly | |
used is based on data compression. | |
1:08:56.476 --> 1:08:59.246 | |
And there the idea is okay. | |
1:08:59.246 --> 1:09:06.765 | |
Can we find an encoding so that parts are | |
compressed in the most efficient? | |
1:09:07.027 --> 1:09:22.917 | |
And the compression algorithm is called the | |
bipear encoding, and this is also then used | |
1:09:22.917 --> 1:09:25.625 | |
for splitting. | |
1:09:26.346 --> 1:09:39.164 | |
And the idea is we recursively represent the | |
most frequent pair of bites by a new bike. | |
1:09:39.819 --> 1:09:51.926 | |
Language is now you splitch, burst all your | |
words into letters, and then you look at what | |
1:09:51.926 --> 1:09:59.593 | |
is the most frequent bigrams of which two letters | |
occur. | |
1:10:00.040 --> 1:10:04.896 | |
And then you replace your repeat until you | |
have a fixed vocabulary. | |
1:10:04.985 --> 1:10:08.031 | |
So that's a nice thing. | |
1:10:08.031 --> 1:10:16.663 | |
Now you can predefine your vocabulary as want | |
to represent my text. | |
1:10:16.936 --> 1:10:28.486 | |
By hand, and then you can represent any text | |
with these symbols, and of course the shorter | |
1:10:28.486 --> 1:10:30.517 | |
your text will. | |
1:10:32.772 --> 1:10:36.543 | |
So the original idea was something like that. | |
1:10:36.543 --> 1:10:39.411 | |
We have to sequence A, B, A, B, C. | |
1:10:39.411 --> 1:10:45.149 | |
For example, a common biogram is A, B, so | |
you can face A, B, B, I, D. | |
1:10:45.149 --> 1:10:46.788 | |
Then the text gets. | |
1:10:48.108 --> 1:10:53.615 | |
Then you can make to and then you have eating | |
beet and so on, so this is then your text. | |
1:10:54.514 --> 1:11:00.691 | |
Similarly, we can do it now for tanking. | |
1:11:01.761 --> 1:11:05.436 | |
Let's assume you have these sentences. | |
1:11:05.436 --> 1:11:11.185 | |
I go, he goes, she goes, so your vocabulary | |
is go, goes, he. | |
1:11:11.851 --> 1:11:30.849 | |
And the first thing you're doing is split | |
your crocus into singles. | |
1:11:30.810 --> 1:11:34.692 | |
So thereby you can split words again like | |
split senses into words. | |
1:11:34.692 --> 1:11:38.980 | |
Because now you only have chiracters, you | |
don't know the word boundaries. | |
1:11:38.980 --> 1:11:44.194 | |
You introduce the word boundaries by having | |
a special symbol at the end of each word, and | |
1:11:44.194 --> 1:11:46.222 | |
then you know this symbol happens. | |
1:11:46.222 --> 1:11:48.366 | |
I can split it and have it in a new. | |
1:11:48.708 --> 1:11:55.245 | |
So you have the corpus I go, he goes, and | |
she goes, and then you have now here the sequences | |
1:11:55.245 --> 1:11:56.229 | |
of Character. | |
1:11:56.229 --> 1:12:02.625 | |
So then the Character based per presentation, | |
and now you calculate the bigram statistics. | |
1:12:02.625 --> 1:12:08.458 | |
So I and the end of word occurs one time G | |
& O across three times, so there there. | |
1:12:09.189 --> 1:12:18.732 | |
And these are all the others, and now you | |
look, which is the most common happening. | |
1:12:19.119 --> 1:12:26.046 | |
So then you have known the rules. | |
1:12:26.046 --> 1:12:39.235 | |
If and have them together you have these new | |
words: Now is no longer two symbols, but it's | |
1:12:39.235 --> 1:12:41.738 | |
one single symbol because if you join that. | |
1:12:42.402 --> 1:12:51.175 | |
And then you have here now the new number | |
of biceps, steel and wood, and and so on. | |
1:12:52.092 --> 1:13:01.753 | |
In small examples now you have a lot of rules | |
which occur the same time. | |
1:13:01.753 --> 1:13:09.561 | |
In reality that is happening sometimes but | |
not that often. | |
1:13:10.370 --> 1:13:21.240 | |
You add the end of words to him, and so this | |
way you go on until you have your vocabulary. | |
1:13:21.601 --> 1:13:38.242 | |
And your vocabulary is in these rules, so | |
people speak about the vocabulary of the rules. | |
1:13:38.658 --> 1:13:43.637 | |
And these are the rules, and if you have not | |
a different sentence, something like they tell. | |
1:13:44.184 --> 1:13:53.600 | |
Then your final output looks like something | |
like that. | |
1:13:53.600 --> 1:13:59.250 | |
These two words represent by by. | |
1:14:00.940 --> 1:14:06.398 | |
And that is your algorithm. | |
1:14:06.398 --> 1:14:18.873 | |
Now you can represent any type of text with | |
a fixed vocabulary. | |
1:14:20.400 --> 1:14:23.593 | |
So think that's defined in the beginning. | |
1:14:23.593 --> 1:14:27.243 | |
Fill how many egos have won and that has spent. | |
1:14:28.408 --> 1:14:35.253 | |
It's nearly correct that it writes a number | |
of characters. | |
1:14:35.253 --> 1:14:38.734 | |
It can be that in additional. | |
1:14:38.878 --> 1:14:49.162 | |
So on the one end all three of the right side | |
of the rules can occur, and then additionally | |
1:14:49.162 --> 1:14:49.721 | |
all. | |
1:14:49.809 --> 1:14:55.851 | |
In reality it can even happen that there is | |
less your vocabulary smaller because it might | |
1:14:55.851 --> 1:15:01.960 | |
happen that like for example go never occurs | |
singular at the end but you always like merge | |
1:15:01.960 --> 1:15:06.793 | |
all occurrences so there are not all right | |
sides really happen because. | |
1:15:06.746 --> 1:15:11.269 | |
This rule is never only applied, but afterwards | |
another rule is also applied. | |
1:15:11.531 --> 1:15:15.621 | |
So it's a summary approbounce of your vocabulary | |
than static. | |
1:15:20.480 --> 1:15:29.014 | |
Then we come to the last part, which is about | |
parallel data, but we have some questions beforehand. | |
1:15:36.436 --> 1:15:38.824 | |
So what is parallel data? | |
1:15:38.824 --> 1:15:47.368 | |
So if we set machine translations really, | |
really important that we are dealing with parallel | |
1:15:47.368 --> 1:15:52.054 | |
data, that means we have a lined input and | |
output. | |
1:15:52.054 --> 1:15:54.626 | |
You have this type of data. | |
1:15:55.015 --> 1:16:01.773 | |
However, in machine translation we have one | |
very big advantage that is somewhat naturally | |
1:16:01.773 --> 1:16:07.255 | |
occurring, so you have a lot of parallel data | |
which you can summar gaps. | |
1:16:07.255 --> 1:16:13.788 | |
In many P tests you need to manually annotate | |
your data and generate the aligned data. | |
1:16:14.414 --> 1:16:22.540 | |
We have to manually create translations, and | |
of course that is very expensive, but it's | |
1:16:22.540 --> 1:16:29.281 | |
really expensive to pay for like one million | |
sentences to be translated. | |
1:16:29.889 --> 1:16:36.952 | |
The nice thing is that in there is data normally | |
available because other people have done machine | |
1:16:36.952 --> 1:16:37.889 | |
translation. | |
1:16:40.120 --> 1:16:44.672 | |
So there is this data and of course process | |
it. | |
1:16:44.672 --> 1:16:51.406 | |
We'll have a full lecture on how to deal with | |
more complex situations. | |
1:16:52.032 --> 1:16:56.645 | |
The idea is really you don't do really much | |
human work. | |
1:16:56.645 --> 1:17:02.825 | |
You really just start the caller with some | |
initials, start pages and then. | |
1:17:03.203 --> 1:17:07.953 | |
But a lot of iquality parallel data is really | |
targeted on some scenarios. | |
1:17:07.953 --> 1:17:13.987 | |
So, for example, think of the European Parliament | |
as one website where you can easily extract | |
1:17:13.987 --> 1:17:17.581 | |
these information from and there you have a | |
large data. | |
1:17:17.937 --> 1:17:22.500 | |
Or like we have the TED data, which is also | |
you can get from the TED website. | |
1:17:23.783 --> 1:17:33.555 | |
So in generally parallel corpus is a collection | |
of texts with translations into one of several. | |
1:17:34.134 --> 1:17:42.269 | |
And this data is important because there is | |
no general empty normally, but you work secured. | |
1:17:42.222 --> 1:17:46.732 | |
It works especially good if your training | |
and test conditions are similar. | |
1:17:46.732 --> 1:17:50.460 | |
So if the topic is similar, the style of modality | |
is similar. | |
1:17:50.460 --> 1:17:55.391 | |
So if you want to translate speech, it's often | |
better to train all to own speech. | |
1:17:55.391 --> 1:17:58.818 | |
If you want to translate text, it's better | |
to translate. | |
1:17:59.379 --> 1:18:08.457 | |
And there is a lot of these data available | |
nowadays for common languages. | |
1:18:08.457 --> 1:18:12.014 | |
You normally can start with. | |
1:18:12.252 --> 1:18:15.298 | |
It's really available. | |
1:18:15.298 --> 1:18:27.350 | |
For example, Opus is a big website collecting | |
different types of parallel corpus where you | |
1:18:27.350 --> 1:18:29.601 | |
can select them. | |
1:18:29.529 --> 1:18:33.276 | |
You have this document alignment will come | |
to that layout. | |
1:18:33.553 --> 1:18:39.248 | |
There is things like comparable data where | |
you have not full sentences but only some parts | |
1:18:39.248 --> 1:18:40.062 | |
of parallel. | |
1:18:40.220 --> 1:18:48.700 | |
But now first let's assume we have easy tasks | |
like European Parliament when we have the speech | |
1:18:48.700 --> 1:18:55.485 | |
in German and the speech in English and you | |
need to generate parallel data. | |
1:18:55.485 --> 1:18:59.949 | |
That means you have to align the sewer sentences. | |
1:19:00.000 --> 1:19:01.573 | |
And doing this right. | |
1:19:05.905 --> 1:19:08.435 | |
How can we do that? | |
1:19:08.435 --> 1:19:19.315 | |
And that is what people refer to sentence | |
alignment, so we have parallel documents in | |
1:19:19.315 --> 1:19:20.707 | |
languages. | |
1:19:22.602 --> 1:19:32.076 | |
This is so you cannot normally do that word | |
by word because there is no direct correlation | |
1:19:32.076 --> 1:19:34.158 | |
between, but it is. | |
1:19:34.074 --> 1:19:39.837 | |
Relatively possible to do it on sentence level, | |
it will not be perfect, so you sometimes have | |
1:19:39.837 --> 1:19:42.535 | |
two sentences in English and one in German. | |
1:19:42.535 --> 1:19:47.992 | |
German like to have these long sentences with | |
sub clauses and so on, so there you can do | |
1:19:47.992 --> 1:19:51.733 | |
it, but with long sentences it might not be | |
really possible. | |
1:19:55.015 --> 1:19:59.454 | |
And for some we saw that sentence Marcus Andre | |
there, so it's more complicated. | |
1:19:59.819 --> 1:20:10.090 | |
So how can we formalize this sentence alignment | |
problem? | |
1:20:10.090 --> 1:20:16.756 | |
So we have a set of sewer sentences. | |
1:20:17.377 --> 1:20:22.167 | |
And machine translation relatively often. | |
1:20:22.167 --> 1:20:32.317 | |
Sometimes source sentences nowadays are and, | |
but traditionally it was and because people | |
1:20:32.317 --> 1:20:34.027 | |
started using. | |
1:20:34.594 --> 1:20:45.625 | |
And then the idea is to find this alignment | |
where we have alignment. | |
1:20:46.306 --> 1:20:50.421 | |
And of course you want these sequences to | |
be shown as possible. | |
1:20:50.421 --> 1:20:56.400 | |
Of course an easy solution is here all my | |
screen sentences and here all my target sentences. | |
1:20:56.756 --> 1:21:07.558 | |
So want to have short sequences there, typically | |
one sentence or maximum two or three sentences, | |
1:21:07.558 --> 1:21:09.340 | |
so that really. | |
1:21:13.913 --> 1:21:21.479 | |
Then there is different ways of restriction | |
to this type of alignment, so first of all | |
1:21:21.479 --> 1:21:29.131 | |
it should be a monotone alignment, so that | |
means that each segment on the source should | |
1:21:29.131 --> 1:21:31.218 | |
start after each other. | |
1:21:31.431 --> 1:21:36.428 | |
So we assume that in document there's really | |
a monotone and it's going the same way in source. | |
1:21:36.957 --> 1:21:41.965 | |
Course for a very free translation that might | |
not be valid anymore. | |
1:21:41.965 --> 1:21:49.331 | |
But this algorithm, the first one in the church | |
and gay algorithm, is more than really translations | |
1:21:49.331 --> 1:21:51.025 | |
which are very direct. | |
1:21:51.025 --> 1:21:54.708 | |
So each segment should be like coming after | |
each. | |
1:21:55.115 --> 1:22:04.117 | |
Then we want to translate the full sequence, | |
and of course each segment should start before | |
1:22:04.117 --> 1:22:04.802 | |
it is. | |
1:22:05.525 --> 1:22:22.654 | |
And then you want to have something like that, | |
but you have to alignments or alignments. | |
1:22:25.525 --> 1:22:41.851 | |
The alignment types are: You then, of course, | |
sometimes insertions and Venetians where there | |
1:22:41.851 --> 1:22:43.858 | |
is some information added. | |
1:22:44.224 --> 1:22:50.412 | |
Hand be, for example, explanation, so it can | |
be that some term is known in the one language | |
1:22:50.412 --> 1:22:51.018 | |
but not. | |
1:22:51.111 --> 1:22:53.724 | |
Think of things like Deutschland ticket. | |
1:22:53.724 --> 1:22:58.187 | |
In Germany everybody will by now know what | |
the Deutschland ticket is. | |
1:22:58.187 --> 1:23:03.797 | |
But if you translate it to English it might | |
be important to explain it and other things | |
1:23:03.797 --> 1:23:04.116 | |
are. | |
1:23:04.116 --> 1:23:09.853 | |
So sometimes you have to explain things and | |
then you have more sentences with insertions. | |
1:23:10.410 --> 1:23:15.956 | |
Then you have two to one and one to two alignment, | |
and that is, for example, in Germany you have | |
1:23:15.956 --> 1:23:19.616 | |
a lot of sub-classes and bipes that are expressed | |
by two cents. | |
1:23:20.580 --> 1:23:37.725 | |
Of course, it might be more complex, but typically | |
to make it simple and only allow for this type | |
1:23:37.725 --> 1:23:40.174 | |
of alignment. | |
1:23:41.301 --> 1:23:56.588 | |
Then it is about finding the alignment and | |
that is, we try to score where we just take | |
1:23:56.588 --> 1:23:59.575 | |
a general score. | |
1:24:00.000 --> 1:24:04.011 | |
That is true like gala algorithms and the | |
matching of one segment. | |
1:24:04.011 --> 1:24:09.279 | |
If you have one segment now so this is one | |
of the global things so the global alignment | |
1:24:09.279 --> 1:24:13.828 | |
is as good as the product of all single steps | |
and then you have two scores. | |
1:24:13.828 --> 1:24:18.558 | |
First of all you say one to one alignments | |
are much better than all the hours. | |
1:24:19.059 --> 1:24:26.884 | |
And then you have a lexical similarity, which | |
is, for example, based on an initial dictionary | |
1:24:26.884 --> 1:24:30.713 | |
which counts how many dictionary entries are. | |
1:24:31.091 --> 1:24:35.407 | |
So this is a very simple algorithm. | |
1:24:35.407 --> 1:24:41.881 | |
Typically violates like your first step and | |
you want. | |
1:24:43.303 --> 1:24:54.454 | |
And that is like with this one you can get | |
an initial one you can have better parallel | |
1:24:54.454 --> 1:24:55.223 | |
data. | |
1:24:55.675 --> 1:25:02.369 | |
No, it is an optimization problem and you | |
are now based on the scores you can calculate | |
1:25:02.369 --> 1:25:07.541 | |
for each possible alignment and score and then | |
select the best one. | |
1:25:07.541 --> 1:25:14.386 | |
Of course, you won't try all possibilities | |
out but you can do a good search and then find | |
1:25:14.386 --> 1:25:15.451 | |
the best one. | |
1:25:15.815 --> 1:25:18.726 | |
Can typically be automatically. | |
1:25:18.726 --> 1:25:25.456 | |
Of course, you should do some checks like | |
aligning sentences as possible. | |
1:25:26.766 --> 1:25:32.043 | |
A bill like typically for training data is | |
done this way. | |
1:25:32.043 --> 1:25:35.045 | |
Maybe if you have test data you. | |
1:25:40.000 --> 1:25:47.323 | |
Sorry, I'm a bit late because originally wanted | |
to do a quiz at the end. | |
1:25:47.323 --> 1:25:49.129 | |
Can we go a quiz? | |
1:25:49.429 --> 1:25:51.833 | |
We'll do it somewhere else. | |
1:25:51.833 --> 1:25:56.813 | |
We had a bachelor project about making quiz | |
for lectures. | |
1:25:56.813 --> 1:25:59.217 | |
And I still want to try it. | |
1:25:59.217 --> 1:26:04.197 | |
So let's see I hope in some other lecture | |
we can do that. | |
1:26:04.197 --> 1:26:09.435 | |
Then we can at the island of the lecture do | |
some quiz about. | |
1:26:09.609 --> 1:26:13.081 | |
All We Can Do Is Is the Practical Thing Let's | |
See. | |
1:26:13.533 --> 1:26:24.719 | |
And: Today, so what you should remember is | |
what is parallel data and how we can. | |
1:26:25.045 --> 1:26:29.553 | |
Create parallel data like how to generally | |
process data. | |
1:26:29.553 --> 1:26:36.435 | |
What you think about data is really important | |
if you build systems and different ways. | |
1:26:36.696 --> 1:26:46.857 | |
The three main options like forwards is directly | |
on director level or using subword things. | |
1:26:47.687 --> 1:26:49.634 | |
Is there any question? | |
1:26:52.192 --> 1:26:57.768 | |
Yes, this is the alignment thing in Cadillac | |
band in Tyne walking with people. | |
1:27:00.000 --> 1:27:05.761 | |
It's not directly using than every time walking, | |
but the idea is similar and you can use all | |
1:27:05.761 --> 1:27:11.771 | |
this type of similar algorithms, which is the | |
main thing which is the question of the difficulty | |
1:27:11.771 --> 1:27:14.807 | |
is to define me at your your loss function | |
here. | |
1:27:14.807 --> 1:27:16.418 | |
What is a good alignment? | |
1:27:16.736 --> 1:27:24.115 | |
But as you do not have a time walk on, you | |
have a monotone alignment in there, and you | |
1:27:24.115 --> 1:27:26.150 | |
cannot have rehonoring. | |
1:27:30.770 --> 1:27:40.121 | |
There then thanks a lot and on first day we | |
will then start with or discuss. | |