retkowski's picture
Add demo
cb71ef5
WEBVTT
0:00:02.822 --> 0:00:07.880
We look into more linguistic approaches.
0:00:07.880 --> 0:00:14.912
We can do machine translation in a more traditional
way.
0:00:14.912 --> 0:00:21.224
It should be: Translation should be generated
this way.
0:00:21.224 --> 0:00:27.933
We can analyze versus a sewer sentence what
is the meaning or the syntax.
0:00:27.933 --> 0:00:35.185
Then we transfer this information to the target
side and then we then generate.
0:00:36.556 --> 0:00:42.341
And this was the strong and common used approach
for yeah several years.
0:00:44.024 --> 0:00:50.839
However, we saw already at the beginning there
some challenges with that: Language is very
0:00:50.839 --> 0:00:57.232
ambigue and it's often very difficult to really
get high coated rules.
0:00:57.232 --> 0:01:05.336
What are the different meanings and we have
to do that also with a living language so new
0:01:05.336 --> 0:01:06.596
things occur.
0:01:07.007 --> 0:01:09.308
And that's why people look into.
0:01:09.308 --> 0:01:13.282
Can we maybe do it differently and use machine
learning?
0:01:13.333 --> 0:01:24.849
So we are no longer giving rules of how to
do it, but we just give examples and the system.
0:01:25.045 --> 0:01:34.836
And one important thing then is these examples:
how can we learn how to translate one sentence?
0:01:35.635 --> 0:01:42.516
And therefore these yeah, the data is now
really a very important issue.
0:01:42.582 --> 0:01:50.021
And that is what we want to look into today.
0:01:50.021 --> 0:01:58.783
What type of data do we use for machine translation?
0:01:59.019 --> 0:02:08.674
So the idea in preprocessing is always: Can
we make the task somehow a bit easier so that
0:02:08.674 --> 0:02:13.180
the empty system will be in a way better?
0:02:13.493 --> 0:02:28.309
So one example could be if it has problems
dealing with numbers because they are occurring.
0:02:28.648 --> 0:02:35.479
Or think about so one problem which still
might be is there in some systems think about
0:02:35.479 --> 0:02:36.333
different.
0:02:36.656 --> 0:02:44.897
So a system might learn that of course if
there's a German over in English there should.
0:02:45.365 --> 0:02:52.270
However, if it's in pearl text, it will see
that in Germany there is often km, and in English
0:02:52.270 --> 0:02:54.107
typically various miles.
0:02:54.594 --> 0:03:00.607
Might just translate three hundred and fifty
five miles into three hundred and fiftY five
0:03:00.607 --> 0:03:04.348
kilometers, which of course is not right, and
so forth.
0:03:04.348 --> 0:03:06.953
It might make things to look into the.
0:03:07.067 --> 0:03:13.072
Therefore, first step when you build your
machine translation system is normally to look
0:03:13.072 --> 0:03:19.077
at the data, to check it, to see if there is
anything happening which you should address
0:03:19.077 --> 0:03:19.887
beforehand.
0:03:20.360 --> 0:03:29.152
And then the second part is how do you represent
no works machine learning normally?
0:03:29.109 --> 0:03:35.404
So the question is how do we get out from
the words into numbers and I've seen some of
0:03:35.404 --> 0:03:35.766
you?
0:03:35.766 --> 0:03:42.568
For example, in advance there we have introduced
to an algorithm which we also shortly repeat
0:03:42.568 --> 0:03:43.075
today.
0:03:43.303 --> 0:03:53.842
The subword unit approach which was first
introduced in machine translation and now used
0:03:53.842 --> 0:04:05.271
for an in order to represent: Now you've learned
about morphology, so you know that maybe in
0:04:05.271 --> 0:04:09.270
English it's not that important.
0:04:09.429 --> 0:04:22.485
In German you have all these different word
poems and to learn independent representation.
0:04:24.024 --> 0:04:26.031
And then, of course, they are more extreme.
0:04:27.807 --> 0:04:34.387
So how are we doing?
0:04:34.975 --> 0:04:37.099
Machine translation.
0:04:37.099 --> 0:04:46.202
So hopefully you remember we had these approaches
to machine translation, the rule based.
0:04:46.202 --> 0:04:52.473
We had a big block of corpus space machine
translation which.
0:04:52.492 --> 0:05:00.443
Will on Thursday have an overview on statistical
models and then afterwards concentrate on the.
0:05:00.680 --> 0:05:08.828
Both of them are corpus based machine translation
and therefore it's really essential, and while
0:05:08.828 --> 0:05:16.640
we are typically training a machine translation
system is what we refer to as parallel data.
0:05:16.957 --> 0:05:22.395
Talk a lot about pearl corpus or pearl data,
and what I mean there is something which you
0:05:22.395 --> 0:05:28.257
might know from was that a stone or something
like that, so it's typically you have one sentence
0:05:28.257 --> 0:05:33.273
in the one language, and then you have aligned
to it one sentence in the charcote.
0:05:33.833 --> 0:05:38.261
And this is how we train all our alignments.
0:05:38.261 --> 0:05:43.181
We'll see today that of course we might not
have.
0:05:43.723 --> 0:05:51.279
However, this is relatively easy to create,
at least for iquality data.
0:05:51.279 --> 0:06:00.933
We look into data trawling so that means how
we can automatically create this parallel data
0:06:00.933 --> 0:06:02.927
from the Internet.
0:06:04.144 --> 0:06:13.850
It's not so difficult to learn these alignments
if we have some type of dictionary, so which
0:06:13.850 --> 0:06:16.981
sentence is aligned to which.
0:06:18.718 --> 0:06:25.069
What it would, of course, be a lot more difficult
is really to word alignment, and that's also
0:06:25.069 --> 0:06:27.476
often no longer that good possible.
0:06:27.476 --> 0:06:33.360
We do that automatically in some yes for symbols,
but it's definitely more challenging.
0:06:33.733 --> 0:06:40.691
For sentence alignment, of course, it's still
not always perfect, so there might be that
0:06:40.691 --> 0:06:46.085
there is two German sentences and one English
sentence or the other.
0:06:46.085 --> 0:06:53.511
So there's not always perfect alignment, but
if you look at text, it's still bigly relatively.
0:06:54.014 --> 0:07:03.862
If we have that then we can build a machine
learning model which tries to map ignition
0:07:03.862 --> 0:07:06.239
sentences somewhere.
0:07:06.626 --> 0:07:15.932
So this is the idea of behind statistical
machine translation and machine translation.
0:07:15.932 --> 0:07:27.098
The difference is: Statistical machine translation
is typically a whole box of different models
0:07:27.098 --> 0:07:30.205
which try to evaluate the.
0:07:30.510 --> 0:07:42.798
In neural machine translation, it's all one
large neural network where we use the one-sur-sentence
0:07:42.798 --> 0:07:43.667
input.
0:07:44.584 --> 0:07:50.971
And then we can train it by having exactly
this mapping port or parallel data.
0:07:54.214 --> 0:08:02.964
So what we want today to look at today is
we want to first look at general text data.
0:08:03.083 --> 0:08:06.250
So what is text data?
0:08:06.250 --> 0:08:09.850
What text data is there?
0:08:09.850 --> 0:08:18.202
Why is it challenging so that we have large
vocabularies?
0:08:18.378 --> 0:08:22.003
It's so that you always have words which you
haven't seen.
0:08:22.142 --> 0:08:29.053
If you increase your corporate science normally
you will also increase your vocabulary so you
0:08:29.053 --> 0:08:30.744
always find new words.
0:08:31.811 --> 0:08:39.738
Then based on that we'll look into pre-processing.
0:08:39.738 --> 0:08:45.333
So how can we pre-process our data?
0:08:45.333 --> 0:08:46.421
Maybe.
0:08:46.526 --> 0:08:54.788
This is a lot about tokenization, for example,
which we heard is not so challenging in European
0:08:54.788 --> 0:09:02.534
languages but still important, but might be
really difficult in Asian languages where you
0:09:02.534 --> 0:09:05.030
don't have space separation.
0:09:05.986 --> 0:09:12.161
And this preprocessing typically tries to
deal with the extreme cases where you have
0:09:12.161 --> 0:09:13.105
seen things.
0:09:13.353 --> 0:09:25.091
If you have seen your words three one hundred
times, it doesn't really matter if you have
0:09:25.091 --> 0:09:31.221
seen them with them without punctuation or
so.
0:09:31.651 --> 0:09:38.578
And then we look into word representation,
so what is the best way to represent a word?
0:09:38.578 --> 0:09:45.584
And finally, we look into the other type of
data we really need for machine translation.
0:09:45.725 --> 0:09:56.842
So in first we can use for many tasks, and
later we can also use purely monolingual data
0:09:56.842 --> 0:10:00.465
to make machine translation.
0:10:00.660 --> 0:10:03.187
So then the traditional approach was that
it was easier.
0:10:03.483 --> 0:10:08.697
We have this type of language model which
we can train only on the target data to make
0:10:08.697 --> 0:10:12.173
the text more fluent in neural machine translation
model.
0:10:12.173 --> 0:10:18.106
It's partly a bit more complicated to integrate
this data but still it's very important especially
0:10:18.106 --> 0:10:22.362
if you think about lower issue languages where
you have very few data.
0:10:23.603 --> 0:10:26.999
It's harder to get parallel data than you
get monolingual data.
0:10:27.347 --> 0:10:33.821
Because monolingual data you just have out
there not huge amounts for some languages,
0:10:33.821 --> 0:10:38.113
but definitely the amount of data is always
significant.
0:10:40.940 --> 0:10:50.454
When we talk about data, it's also of course
important how we use it for machine learning.
0:10:50.530 --> 0:11:05.867
And that you hopefully learn in some prior
class, so typically we separate our data into
0:11:05.867 --> 0:11:17.848
three chunks: So this is really by far the
largest, and this grows with the data we get.
0:11:17.848 --> 0:11:21.387
Today we get here millions.
0:11:22.222 --> 0:11:27.320
Then we have our validation data and that
is to train some type of parameters.
0:11:27.320 --> 0:11:33.129
So not only you have some things to configure
and you don't know what is the right value,
0:11:33.129 --> 0:11:39.067
so what you can do is train a model and change
these a bit and try to find the best ones on
0:11:39.067 --> 0:11:40.164
your validation.
0:11:40.700 --> 0:11:48.531
For a statistical model, for example data
in what you want to use if you have several
0:11:48.531 --> 0:11:54.664
models: You know how to combine it, so how
much focus should you put on the different
0:11:54.664 --> 0:11:55.186
models?
0:11:55.186 --> 0:11:59.301
And if it's like twenty models, so it's only
twenty per meter.
0:11:59.301 --> 0:12:02.828
It's not that much, so that is still bigly
estimated.
0:12:03.183 --> 0:12:18.964
In your model there's often a question how
long should train the model before you have
0:12:18.964 --> 0:12:21.322
overfitting.
0:12:22.902 --> 0:12:28.679
And then you have your test data, which is
finally where you report on your test.
0:12:29.009 --> 0:12:33.663
And therefore it's also important that from
time to time you get new test data because
0:12:33.663 --> 0:12:38.423
if you're always through your experiments you
test on it and then you do new experiments
0:12:38.423 --> 0:12:43.452
and tests again at some point you have tested
so many on it that you do some type of training
0:12:43.452 --> 0:12:48.373
on your test data again because you just select
the things which is at the end best on your
0:12:48.373 --> 0:12:48.962
test data.
0:12:49.009 --> 0:12:54.755
It's important to get a new test data from
time to time, for example in important evaluation
0:12:54.755 --> 0:12:58.340
campaigns for machine translation and speech
translation.
0:12:58.618 --> 0:13:07.459
There is like every year there should do tests
that create it so we can see if the model really
0:13:07.459 --> 0:13:09.761
gets better on new data.
0:13:10.951 --> 0:13:19.629
And of course it is important that this is
a representative of the use case you are interested.
0:13:19.879 --> 0:13:36.511
So if you're building a system for translating
websites, this should be on websites.
0:13:36.816 --> 0:13:39.356
So normally a system is good on some tasks.
0:13:40.780 --> 0:13:48.596
I would solve everything and then your test
data should be out of everything because if
0:13:48.596 --> 0:13:54.102
you only have a very small subset you know
it's good on this.
0:13:54.394 --> 0:14:02.714
Therefore, the selection of your test data
is really important in order to ensure that
0:14:02.714 --> 0:14:05.200
the MP system in the end.
0:14:05.525 --> 0:14:12.646
Is the greatest system ever you have evaluated
on translating Bible.
0:14:12.646 --> 0:14:21.830
The use case is to translate some Twitter
data and you can imagine the performance might
0:14:21.830 --> 0:14:22.965
be really.
0:14:23.803 --> 0:14:25.471
And privately.
0:14:25.471 --> 0:14:35.478
Of course, in honor to have this and realistic
evaluation, it's important that there's no
0:14:35.478 --> 0:14:39.370
overlap between this data because.
0:14:39.799 --> 0:14:51.615
Because the danger might be is learning by
heart how to translate the sentences from your
0:14:51.615 --> 0:14:53.584
training data.
0:14:54.194 --> 0:15:04.430
That the test data is really different from
your training data.
0:15:04.430 --> 0:15:16.811
Therefore, it's important to: So what type
of data we have?
0:15:16.811 --> 0:15:24.966
There's a lot of different text data and the
nice thing is with digitalization.
0:15:25.345 --> 0:15:31.785
You might think there's a large amount with
books, but to be honest books and printed things
0:15:31.785 --> 0:15:35.524
that's by now a minor percentage of the data
we have.
0:15:35.815 --> 0:15:39.947
There's like so much data created every day
on the Internet.
0:15:39.980 --> 0:15:46.223
With social media and all the other types.
0:15:46.223 --> 0:15:56.821
This of course is a largest amount of data,
more of colloquial language.
0:15:56.856 --> 0:16:02.609
It might be more noisy and harder to process,
so there is a whole area on how to deal with
0:16:02.609 --> 0:16:04.948
more social media and outdoor stuff.
0:16:07.347 --> 0:16:20.702
What type of data is there if you think about
parallel data news type of data official sites?
0:16:20.900 --> 0:16:26.629
So the first Power Corpora were like things
like the European Parliament or like some news
0:16:26.629 --> 0:16:27.069
sites.
0:16:27.227 --> 0:16:32.888
Nowadays there's quite a large amount of data
crawled from the Internet, but of course if
0:16:32.888 --> 0:16:38.613
you crawl parallel data from the Internet,
a lot of the data is also like company websites
0:16:38.613 --> 0:16:41.884
or so which gets translated into several languages.
0:16:45.365 --> 0:17:00.613
Then, of course, there is different levels
of text and we have to look at what level we
0:17:00.613 --> 0:17:05.118
want to process our data.
0:17:05.885 --> 0:17:16.140
It one normally doesn't make sense to work
on full sentences because a lot of sentences
0:17:16.140 --> 0:17:22.899
have never been seen and you always create
new sentences.
0:17:23.283 --> 0:17:37.421
So typically what we take is our basic words,
something between words and letters, and that
0:17:37.421 --> 0:17:40.033
is an essential.
0:17:40.400 --> 0:17:47.873
So we need some of these atomic blocks or
basic blocks on which we can't make smaller.
0:17:48.128 --> 0:17:55.987
So if we're building a sentence, for example,
you can build it out of something and you can
0:17:55.987 --> 0:17:57.268
either decide.
0:17:57.268 --> 0:18:01.967
For example, you take words and you spit them
further.
0:18:03.683 --> 0:18:10.178
Then, of course, the nice thing is not too
small and therefore building larger things
0:18:10.178 --> 0:18:11.386
like sentences.
0:18:11.831 --> 0:18:16.690
So you only have to take your vocabulary and
put it somewhere together to get your full
0:18:16.690 --> 0:18:17.132
center.
0:18:19.659 --> 0:18:27.670
However, if it's too large, these blocks don't
occur often enough, and you have more blocks
0:18:27.670 --> 0:18:28.715
that occur.
0:18:29.249 --> 0:18:34.400
And that's why yeah we can work with blocks
for smaller like software blocks.
0:18:34.714 --> 0:18:38.183
Work with neural models.
0:18:38.183 --> 0:18:50.533
Then you can work on letters so you have a
system which tries to understand the sentence
0:18:50.533 --> 0:18:53.031
letter by letter.
0:18:53.313 --> 0:18:57.608
But that is a design decision which you have
to take at some point.
0:18:57.608 --> 0:19:03.292
On which level do you want to split your text
and that of the evasive blocks that you are
0:19:03.292 --> 0:19:04.176
working with?
0:19:04.176 --> 0:19:06.955
And that's something we'll look into today.
0:19:06.955 --> 0:19:08.471
What possibilities are?
0:19:12.572 --> 0:19:14.189
Any question.
0:19:17.998 --> 0:19:24.456
Then let's look a bit on what type of data
there is in how much data there is to person.
0:19:24.824 --> 0:19:34.006
Is that nowadays, at least for pure text,
it's no longer for some language.
0:19:34.006 --> 0:19:38.959
There is so much data we cannot process.
0:19:39.479 --> 0:19:49.384
That is only true for some languages, but
there is also interest in other languages and
0:19:49.384 --> 0:19:50.622
important.
0:19:50.810 --> 0:20:01.483
So if you want to build a system for Sweden
or for some dialect in other countries, then
0:20:01.483 --> 0:20:02.802
of course.
0:20:03.103 --> 0:20:06.888
Otherwise you have this huge amount of hair.
0:20:06.888 --> 0:20:11.515
We are often no longer taking about gigabytes
or more.
0:20:11.891 --> 0:20:35.788
The general information that is produced every
year is: And this is like all the information
0:20:35.788 --> 0:20:40.661
that are available in the, so there are really.
0:20:41.001 --> 0:20:44.129
We look at machine translation.
0:20:44.129 --> 0:20:53.027
We can see these numbers are really like more
than ten years old, but we see this increase
0:20:53.027 --> 0:20:58.796
in one billion works we had at that time for
English data.
0:20:59.019 --> 0:21:01.955
Then I wore like new shuffle on Google Maps
and stuff.
0:21:02.382 --> 0:21:05.003
For this one you could train your system on.
0:21:05.805 --> 0:21:20.457
And the interesting thing is this one billion
words is more than any human typically speaks.
0:21:21.001 --> 0:21:25.892
So these systems they see by now like a magnitude
of more data.
0:21:25.892 --> 0:21:32.465
We know I think are a magnitude higher of
more data than a human has ever seen in his
0:21:32.465 --> 0:21:33.229
lifetime.
0:21:35.175 --> 0:21:41.808
And that is maybe the interesting thing why
it still doesn't work on it because you see
0:21:41.808 --> 0:21:42.637
they seem.
0:21:43.103 --> 0:21:48.745
So we are seeing a really impressive result,
but in most cases it's not that they're really
0:21:48.745 --> 0:21:49.911
better than human.
0:21:50.170 --> 0:21:56.852
However, they really have seen more data than
any human ever has seen in this lifetime.
0:21:57.197 --> 0:22:01.468
They can just process so much data, so.
0:22:01.501 --> 0:22:08.425
The question is, can we make them more efficient
so that they can learn similarly good without
0:22:08.425 --> 0:22:09.592
that much data?
0:22:09.592 --> 0:22:16.443
And that is essential if we now go to Lawrence's
languages where we might never get that much
0:22:16.443 --> 0:22:21.254
data, and we should be also able to achieve
a reasonable perform.
0:22:23.303 --> 0:22:32.399
On the other hand, this of course links also
to one topic which we will cover later: If
0:22:32.399 --> 0:22:37.965
you think about this, it's really important
that your algorithms are also very efficient
0:22:37.965 --> 0:22:41.280
in order to process that much data both in
training.
0:22:41.280 --> 0:22:46.408
If you have more data, you want to process
more data so you can make use of that.
0:22:46.466 --> 0:22:54.499
On the other hand, if more and more data is
processed, more and more people will use machine
0:22:54.499 --> 0:23:06.816
translation to generate translations, and it
will be important to: And there is yeah, there
0:23:06.816 --> 0:23:07.257
is.
0:23:07.607 --> 0:23:10.610
More.
0:23:10.170 --> 0:23:17.262
More data generated every day, we hear just
some general numbers on how much data there
0:23:17.262 --> 0:23:17.584
is.
0:23:17.584 --> 0:23:24.595
It says that a lot of the data we produce
at least at the moment is text rich, so text
0:23:24.595 --> 0:23:26.046
that is produced.
0:23:26.026 --> 0:23:29.748
That is very important to either wise.
0:23:29.748 --> 0:23:33.949
We can use it as training data in some way.
0:23:33.873 --> 0:23:40.836
That we want to translate some of that because
it might not be published in all the languages,
0:23:40.836 --> 0:23:46.039
and step with the need for machine translation
is even more important.
0:23:47.907 --> 0:23:51.547
So what are the challenges with this?
0:23:51.831 --> 0:24:01.360
So first of all that seems to be very good
news, so there is more and more data, so we
0:24:01.360 --> 0:24:10.780
can just wait for three years and have more
data, and then our system will be better.
0:24:11.011 --> 0:24:22.629
If you see in competitions, the system performance
increases.
0:24:24.004 --> 0:24:27.190
See that here are three different systems.
0:24:27.190 --> 0:24:34.008
Blue score is metric to measure how good an
empty system is and we'll talk about evaluation
0:24:34.008 --> 0:24:40.974
and the next week so you'll have to evaluate
machine validation and also a practical session.
0:24:41.581 --> 0:24:45.219
And so.
0:24:44.784 --> 0:24:50.960
This shows you that this is like how much
data of the training data you have five percent.
0:24:50.960 --> 0:24:56.117
You're significantly worse than if you're
forty percent and eighty percent.
0:24:56.117 --> 0:25:02.021
You're getting better and you're seeing two
between this curve, which maybe not really
0:25:02.021 --> 0:25:02.971
flattens out.
0:25:02.971 --> 0:25:03.311
But.
0:25:03.263 --> 0:25:07.525
Of course, the gains you get are normally
smaller and smaller.
0:25:07.525 --> 0:25:09.216
The more data you have,.
0:25:09.549 --> 0:25:21.432
If your improvements are unnormally better,
if you add the same thing or even double your
0:25:21.432 --> 0:25:25.657
data late, of course more data.
0:25:26.526 --> 0:25:34.955
However, you see the clear tendency if you
need to improve your system.
0:25:34.955 --> 0:25:38.935
This is possible by just getting.
0:25:39.039 --> 0:25:41.110
But it's not all about data.
0:25:41.110 --> 0:25:45.396
It can also be the domain of the day that
there's building.
0:25:45.865 --> 0:25:55.668
So this was a test on machine translation
system on translating genome data.
0:25:55.668 --> 0:26:02.669
We have the like SAI said he's working on
translating.
0:26:02.862 --> 0:26:06.868
Here you see the performance began with GreenScore.
0:26:06.868 --> 0:26:12.569
You see one system which only was trained
on genome data and it only has.
0:26:12.812 --> 0:26:17.742
That's very, very few for machine translation.
0:26:18.438 --> 0:26:23.927
And to compare that to a system which was
generally trained on used translation data.
0:26:24.104 --> 0:26:34.177
With four point five million sentences so
roughly one hundred times as much data you
0:26:34.177 --> 0:26:40.458
still see that this system doesn't really work
well.
0:26:40.820 --> 0:26:50.575
So you see it's not only about data, it's
also that the data has to somewhat fit to the
0:26:50.575 --> 0:26:51.462
domain.
0:26:51.831 --> 0:26:58.069
The more general data you get that you have
covered up all domains.
0:26:58.418 --> 0:27:07.906
But that's very difficult and especially for
more specific domains.
0:27:07.906 --> 0:27:16.696
It can be really important to get data which
fits your domain.
0:27:16.716 --> 0:27:18.520
Maybe if you can do some very much broccoli
or something like that, maybe if you.
0:27:18.598 --> 0:27:22.341
To say okay, concentrate this as you like
for being at better.
0:27:24.564 --> 0:27:28.201
It's not that easy to prompt it.
0:27:28.201 --> 0:27:35.807
You can do the prompting in the more traditional
way of fine tuning.
0:27:35.807 --> 0:27:44.514
Then, of course, if you select UIV later combine
this one, you can get better.
0:27:44.904 --> 0:27:52.675
But it will always be that this type of similar
data is much more important than the general.
0:27:52.912 --> 0:28:00.705
So of course it can make the lower system
a lot better if you search for similar data
0:28:00.705 --> 0:28:01.612
and find.
0:28:02.122 --> 0:28:08.190
Will have a lecture on domain adaptation where
it's exactly the idea how you can make systems
0:28:08.190 --> 0:28:13.935
in these situations better so you can adapt
it to this data but then you still need this
0:28:13.935 --> 0:28:14.839
type of data.
0:28:15.335 --> 0:28:21.590
And in prompting it might work if you have
seen it in your data so it can make the system
0:28:21.590 --> 0:28:25.134
aware and tell it focus more in this type of
data.
0:28:25.465 --> 0:28:30.684
But if you haven't had enough of the really
specific good matching data, I think it will
0:28:30.684 --> 0:28:31.681
always not work.
0:28:31.681 --> 0:28:37.077
So you need to have this type of data and
therefore it's important not only to have general
0:28:37.077 --> 0:28:42.120
data but also data, at least in your overall
system, which really fits to the domain.
0:28:45.966 --> 0:28:53.298
And then the second thing, of course, is you
need to have data that has good quality.
0:28:53.693 --> 0:29:00.170
In the early stages it might be good to have
all the data but later it's especially important
0:29:00.170 --> 0:29:06.577
that you have somehow good quality and so that
you're learning what you really want to learn
0:29:06.577 --> 0:29:09.057
and not learning some great things.
0:29:10.370 --> 0:29:21.551
We talked about this with the kilometers and
miles, so if you just take in some type of
0:29:21.551 --> 0:29:26.253
data and don't look at the quality,.
0:29:26.766 --> 0:29:30.875
But of course, the question here is what is
good quality data?
0:29:31.331 --> 0:29:35.054
It is not yet that easy to define what is
a good quality data.
0:29:36.096 --> 0:29:43.961
That doesn't mean it has to what people generally
assume as high quality text or so, like written
0:29:43.961 --> 0:29:47.814
by a Nobel Prize winner or something like that.
0:29:47.814 --> 0:29:54.074
This is not what we mean by this quality,
but again the most important again.
0:29:54.354 --> 0:30:09.181
So if you have Twitter data, high quality
data doesn't mean you have now some novels.
0:30:09.309 --> 0:30:12.875
Test data, but it should also be represented
similarly.
0:30:12.875 --> 0:30:18.480
Don't have, for example, quality definitely
as it should be really translating yourself
0:30:18.480 --> 0:30:18.862
into.
0:30:19.199 --> 0:30:25.556
So especially if you corral data you would
often have that it's not a direct translation.
0:30:25.805 --> 0:30:28.436
So then, of course, this is not high quality
teaching.
0:30:29.449 --> 0:30:39.974
But in generally that's a very difficult thing
to, and it's very difficult to design what
0:30:39.974 --> 0:30:41.378
is reading.
0:30:41.982 --> 0:30:48.333
And of course a biometric is always the quality
of your data is good if your machine translation.
0:30:48.648 --> 0:30:50.719
So that is like the indirect.
0:30:50.991 --> 0:30:52.447
Well, what can we motive?
0:30:52.447 --> 0:30:57.210
Of course, it's difficult to always try a
lot of things and evaluate either of them,
0:30:57.210 --> 0:30:59.396
build a full MP system and then check.
0:30:59.396 --> 0:31:00.852
Oh, was this a good idea?
0:31:00.852 --> 0:31:01.357
I mean,.
0:31:01.581 --> 0:31:19.055
You have two tokenizers who like split sentences
and the words you really want to apply.
0:31:19.179 --> 0:31:21.652
Now you could maybe argue or your idea could
be.
0:31:21.841 --> 0:31:30.186
Just take it there very fast and then get
the result, but the problem is there is not
0:31:30.186 --> 0:31:31.448
always this.
0:31:31.531 --> 0:31:36.269
One thing that works very well for small data.
0:31:36.269 --> 0:31:43.123
It's not for sure that the same effect will
happen in large stages.
0:31:43.223 --> 0:31:50.395
This idea really improves on very low resource
data if only train on hundred words.
0:31:51.271 --> 0:31:58.357
But if you use it for a large data set, it
doesn't really matter and all your ideas not.
0:31:58.598 --> 0:32:01.172
So that is also a typical thing.
0:32:01.172 --> 0:32:05.383
This quality issue is more and more important
if you.
0:32:06.026 --> 0:32:16.459
By one motivation which generally you should
have, you want to represent your data in having
0:32:16.459 --> 0:32:17.469
as many.
0:32:17.677 --> 0:32:21.805
Why is this the case any idea?
0:32:21.805 --> 0:32:33.389
Why this could be a motivation that we try
to represent the data in a way that we have
0:32:33.389 --> 0:32:34.587
as many.
0:32:38.338 --> 0:32:50.501
We also want to learn about the fun text because
maybe sometimes some grows in the fun text.
0:32:52.612 --> 0:32:54.020
The context is here.
0:32:54.020 --> 0:32:56.432
It's more about the learning first.
0:32:56.432 --> 0:33:00.990
You can generally learn better if you've seen
something more often.
0:33:00.990 --> 0:33:06.553
So if you have seen an event only once, it's
really hard to learn about the event.
0:33:07.107 --> 0:33:15.057
If you have seen an event a hundred times
your bearing estimating which and maybe that
0:33:15.057 --> 0:33:18.529
is the context, then you can use the.
0:33:18.778 --> 0:33:21.331
So, for example, if you here have the word
towels.
0:33:21.761 --> 0:33:28.440
If you would just take the data normally you
would directly process the data.
0:33:28.440 --> 0:33:32.893
In the upper case you would the house with
the dog.
0:33:32.893 --> 0:33:40.085
That's a different word than the house this
way and then the house with the common.
0:33:40.520 --> 0:33:48.365
So you want to learn how this translates into
house, but you translate an upper case.
0:33:48.365 --> 0:33:50.281
How this translates.
0:33:50.610 --> 0:33:59.445
You were learning how to translate into house
and house, so you have to learn four different
0:33:59.445 --> 0:34:00.205
things.
0:34:00.205 --> 0:34:06.000
Instead, we really want to learn that house
gets into house.
0:34:06.366 --> 0:34:18.796
And then imagine if it would be even a beak,
it might be like here a house would be into.
0:34:18.678 --> 0:34:22.089
Good-bye Then.
0:34:22.202 --> 0:34:29.512
If it's an upper case then I always have to
translate it into a boiler while it's a lower
0:34:29.512 --> 0:34:34.955
case that is translated into house and that's
of course not right.
0:34:34.955 --> 0:34:39.260
We have to use the context to decide what
is better.
0:34:39.679 --> 0:34:47.086
If you have seen an event several times then
you are better able to learn your model and
0:34:47.086 --> 0:34:51.414
that doesn't matter what type of learning you
have.
0:34:52.392 --> 0:34:58.981
I shouldn't say all but for most of these
models it's always better to have like seen
0:34:58.981 --> 0:35:00.897
an event war more often.
0:35:00.920 --> 0:35:11.483
Therefore, if you preprocessive data, you
should ask the question how can represent data
0:35:11.483 --> 0:35:14.212
in order to have seen.
0:35:14.514 --> 0:35:17.885
Of course you should not remove that information.
0:35:18.078 --> 0:35:25.519
So you could now, of course, just lowercase
everything.
0:35:25.519 --> 0:35:30.303
Then you've seen things more often.
0:35:30.710 --> 0:35:38.443
And that might be an issue because in the
final application you want to have real text
0:35:38.443 --> 0:35:38.887
and.
0:35:40.440 --> 0:35:44.003
And finally, even it's more important than
it's consistent.
0:35:44.965 --> 0:35:52.630
So this is a problem where, for example, aren't
consistent.
0:35:52.630 --> 0:35:58.762
So I am, I'm together written in training
data.
0:35:58.762 --> 0:36:04.512
And if you're not in test data, have a high.
0:36:04.824 --> 0:36:14.612
Therefore, most important is to generate preprocessing
and represent your data that is most consistent
0:36:14.612 --> 0:36:18.413
because it's easier to map how similar.
0:36:18.758 --> 0:36:26.588
If your text is represented very, very differently
then your data will be badly be translated.
0:36:26.666 --> 0:36:30.664
So we once had the case.
0:36:30.664 --> 0:36:40.420
For example, there is some data who wrote
it, but in German.
0:36:40.900 --> 0:36:44.187
And if you read it as a human you see it.
0:36:44.187 --> 0:36:49.507
It's even hard to get the difference because
it looks very similar.
0:36:50.130 --> 0:37:02.997
If you use it for a machine translation system,
it would not be able to translate anything
0:37:02.997 --> 0:37:08.229
of it because it's a different word.
0:37:09.990 --> 0:37:17.736
And especially on the other hand you should
of course not rechange significant training
0:37:17.736 --> 0:37:18.968
data thereby.
0:37:18.968 --> 0:37:27.155
For example, removing case information because
if your task is to generate case information.
0:37:31.191 --> 0:37:41.081
One thing which is a bit point to look into
it in order to see the difficulty of your data
0:37:41.081 --> 0:37:42.711
is to compare.
0:37:43.103 --> 0:37:45.583
There are types.
0:37:45.583 --> 0:37:57.983
We mean the number of unique words in the
corpus, so your vocabulary and the tokens.
0:37:58.298 --> 0:38:08.628
And then you can look at the type token ratio
that means a number of types per token.
0:38:15.815 --> 0:38:22.381
Have less types than tokens because every
word appears at least in the corpus, but most
0:38:22.381 --> 0:38:27.081
of them will occur more often until this number
is bigger, so.
0:38:27.667 --> 0:38:30.548
And of course this changes if you have more
date.
0:38:31.191 --> 0:38:38.103
Here is an example from an English Wikipedia.
0:38:38.103 --> 0:38:45.015
That means each word in average occurs times.
0:38:45.425 --> 0:38:47.058
Of course there's a big difference.
0:38:47.058 --> 0:38:51.323
There will be some words which occur one hundred
times, but therefore most of the words occur
0:38:51.323 --> 0:38:51.777
only one.
0:38:52.252 --> 0:38:55.165
However, you see this ratio goes down.
0:38:55.165 --> 0:39:01.812
That's a good thing, so you have seen each
word more often and therefore your model gets
0:39:01.812 --> 0:39:03.156
typically better.
0:39:03.156 --> 0:39:08.683
However, the problem is we always have a lot
of words which we have seen.
0:39:09.749 --> 0:39:15.111
Even here there will be a bound of words which
you have only seen once.
0:39:15.111 --> 0:39:20.472
However, this can give you an indication about
the quality of the data.
0:39:20.472 --> 0:39:27.323
So you should always, of course, try to achieve
data where you have a very low type to talk
0:39:27.323 --> 0:39:28.142
and ratio.
0:39:28.808 --> 0:39:39.108
For example, if you compare, simplify and
not only Wikipedia, what would be your expectation?
0:39:41.861 --> 0:39:49.842
Yes, that's exactly, but however it's surprisingly
only a little bit lower, but you see that it's
0:39:49.842 --> 0:39:57.579
lower, so we are using less words to express
the same thing, and therefore the task to produce
0:39:57.579 --> 0:39:59.941
this text is also a gesture.
0:40:01.221 --> 0:40:07.702
However, as how many words are there, there
is no clear definition.
0:40:07.787 --> 0:40:19.915
So there will be always more words, especially
depending on your dataset, how many different
0:40:19.915 --> 0:40:22.132
words there are.
0:40:22.482 --> 0:40:30.027
So if you have million tweets where around
fifty million tokens and you have six hundred
0:40:30.027 --> 0:40:30.875
thousand.
0:40:31.251 --> 0:40:40.299
If you have times this money teen tweeds you
also have significantly more tokens but also.
0:40:40.660 --> 0:40:58.590
So especially in things like the social media,
of course, there's always different types of
0:40:58.590 --> 0:40:59.954
words.
0:41:00.040 --> 0:41:04.028
Another example from not social media is here.
0:41:04.264 --> 0:41:18.360
So yeah, there is a small liter sandwich like
phone conversations, two million tokens, and
0:41:18.360 --> 0:41:22.697
only twenty thousand words.
0:41:23.883 --> 0:41:37.221
If you think about Shakespeare, it has even
less token, significantly less than a million,
0:41:37.221 --> 0:41:40.006
but the number of.
0:41:40.060 --> 0:41:48.781
On the other hand, there is this Google Engron
corpus which has tokens and there is always
0:41:48.781 --> 0:41:50.506
new words coming.
0:41:50.991 --> 0:41:52.841
Is English.
0:41:52.841 --> 0:42:08.103
The nice thing about English is that the vocabulary
is relatively small, too small, but relatively
0:42:08.103 --> 0:42:09.183
small.
0:42:09.409 --> 0:42:14.224
So here you see the Ted Corpus here.
0:42:15.555 --> 0:42:18.144
All know Ted's lectures.
0:42:18.144 --> 0:42:26.429
They are transcribed, translated, not a source
for us, especially small crocus.
0:42:26.846 --> 0:42:32.702
You can do a lot of experiments with that
and you see that the corpus site is relatively
0:42:32.702 --> 0:42:36.782
similar so we have around four million tokens
in this corpus.
0:42:36.957 --> 0:42:44.464
However, if you look at the vocabulary, English
has half as many words in their different words
0:42:44.464 --> 0:42:47.045
as German and Dutch and Italian.
0:42:47.527 --> 0:42:56.260
So this is one influence from positional works
like which are more frequent in German, the
0:42:56.260 --> 0:43:02.978
more important since we have all these different
morphological forms.
0:43:03.263 --> 0:43:08.170
There all leads to new words and they need
to be somewhat expressed in there.
0:43:11.531 --> 0:43:20.278
So to deal with this, the question is how
can we normalize the text in order to make
0:43:20.278 --> 0:43:22.028
the text easier?
0:43:22.028 --> 0:43:25.424
Can we simplify the task easier?
0:43:25.424 --> 0:43:29.231
But we need to keep all information.
0:43:29.409 --> 0:43:32.239
So an example where not all information skipped.
0:43:32.239 --> 0:43:35.012
Of course you make the task easier if you
just.
0:43:35.275 --> 0:43:41.141
You don't have to deal with different cases.
0:43:41.141 --> 0:43:42.836
It's easier.
0:43:42.836 --> 0:43:52.482
However, information gets lost and you might
need to generate the target.
0:43:52.832 --> 0:44:00.153
So the question is always: How can we on the
one hand simplify the task but keep all the
0:44:00.153 --> 0:44:01.223
information?
0:44:01.441 --> 0:44:06.639
Say necessary because it depends on the task.
0:44:06.639 --> 0:44:11.724
For some tasks you might find to remove the.
0:44:14.194 --> 0:44:23.463
So the steps they were typically doing are
that you can the segment and words in a running
0:44:23.463 --> 0:44:30.696
text, so you can normalize word forms and segmentation
into sentences.
0:44:30.696 --> 0:44:33.955
Also, if you have not a single.
0:44:33.933 --> 0:44:38.739
If this is not a redundancy point to segments,
the text is also into segments.
0:44:39.779 --> 0:44:52.609
So what are we doing there for European language
segmentation into words?
0:44:52.609 --> 0:44:57.290
It's not that complicated.
0:44:57.277 --> 0:45:06.001
You have to somehow handle the joint words
and by handling joint words the most important.
0:45:06.526 --> 0:45:11.331
So in most systems it really doesn't matter
much.
0:45:11.331 --> 0:45:16.712
If you write, I'm together as one word or
as two words.
0:45:17.197 --> 0:45:23.511
The nice thing about iron is maybe this is
so often that it doesn't matter if you both
0:45:23.511 --> 0:45:26.560
and if they're both accrued often enough.
0:45:26.560 --> 0:45:32.802
But you'll have some of these cases where
they don't occur there often, so you should
0:45:32.802 --> 0:45:35.487
have more as consistent as possible.
0:45:36.796 --> 0:45:41.662
But of course things can get more complicated.
0:45:41.662 --> 0:45:48.598
If you have Finland capital, do you want to
split the ends or not?
0:45:48.598 --> 0:45:53.256
Isn't you split or do you even write it out?
0:45:53.433 --> 0:46:00.468
And what about like things with hyphens in
the middle and so on?
0:46:00.540 --> 0:46:07.729
So there is not everything is very easy, but
is generally possible to somewhat keep as.
0:46:11.791 --> 0:46:25.725
Sometimes the most challenging and traditional
systems were compounds, or how to deal with
0:46:25.725 --> 0:46:28.481
things like this.
0:46:28.668 --> 0:46:32.154
The nice thing is, as said, will come to the
later.
0:46:32.154 --> 0:46:34.501
Nowadays we typically use subword.
0:46:35.255 --> 0:46:42.261
Unit, so we don't have to deal with this in
the preprocessing directly, but in the subword
0:46:42.261 --> 0:46:47.804
splitting we're doing it, and then we can learn
how to best spit these.
0:46:52.392 --> 0:46:56.974
Things Get More Complicated.
0:46:56.977 --> 0:46:59.934
About non European languages.
0:46:59.934 --> 0:47:08.707
Because in non European languages, not all
of them, there is no space between the words.
0:47:09.029 --> 0:47:18.752
Nowadays you can also download word segmentation
models where you put in the full sentence and
0:47:18.752 --> 0:47:22.744
then it's getting splitted into parts.
0:47:22.963 --> 0:47:31.814
And then, of course, it's even that you have
different writing systems, sometimes in Japanese.
0:47:31.814 --> 0:47:40.385
For example, they have these katakana, hiragana
and kanji symbols in there, and you have to
0:47:40.385 --> 0:47:42.435
some idea with these.
0:47:49.669 --> 0:47:54.560
To the, the next thing is can reduce some
normalization.
0:47:54.874 --> 0:48:00.376
So the idea is that you map several words
onto the same.
0:48:00.460 --> 0:48:07.877
And that is test dependent, and the idea is
to define something like acronym classes so
0:48:07.877 --> 0:48:15.546
that words, which have the same meaning where
it's not in order to have the difference, to
0:48:15.546 --> 0:48:19.423
map onto the same thing in order to make the.
0:48:19.679 --> 0:48:27.023
The most important thing is there about tasing,
and then there is something like sometimes
0:48:27.023 --> 0:48:27.508
word.
0:48:28.048 --> 0:48:37.063
For casing you can do two things and then
depend on the task.
0:48:37.063 --> 0:48:44.769
You can lowercase everything, maybe some exceptions.
0:48:45.045 --> 0:48:47.831
For the target side, it should normally it's
normally not done.
0:48:48.188 --> 0:48:51.020
Why is it not done?
0:48:51.020 --> 0:48:56.542
Why should you only do it for suicide?
0:48:56.542 --> 0:49:07.729
Yes, so you have to generate correct text
instead of lower case and uppercase.
0:49:08.848 --> 0:49:16.370
Nowadays to be always do true casing on both
sides, also on the sewer side, that means you
0:49:16.370 --> 0:49:17.610
keep the case.
0:49:17.610 --> 0:49:24.966
The only thing where people try to work on
or sometimes do that is that at the beginning
0:49:24.966 --> 0:49:25.628
of the.
0:49:25.825 --> 0:49:31.115
For words like this, this is not that important
because you will have seen otherwise a lot
0:49:31.115 --> 0:49:31.696
of times.
0:49:31.696 --> 0:49:36.928
But if you know have rare words, which you
only have seen maybe three times, and you have
0:49:36.928 --> 0:49:42.334
only seen in the middle of the sentence, and
now it occurs at the beginning of the sentence,
0:49:42.334 --> 0:49:45.763
which is upper case, then you don't know how
to deal with.
0:49:46.146 --> 0:49:50.983
So then it might be good to do a true casing.
0:49:50.983 --> 0:49:56.241
That means you recase each word on the beginning.
0:49:56.576 --> 0:49:59.830
The only question, of course, is how do you
recase it?
0:49:59.830 --> 0:50:01.961
So what case would you always know?
0:50:02.162 --> 0:50:18.918
Word of the senders, or do you have a better
solution, especially not English, maybe German.
0:50:18.918 --> 0:50:20.000
It's.
0:50:25.966 --> 0:50:36.648
The fancy solution would be to count hope
and decide based on this, the unfancy running
0:50:36.648 --> 0:50:43.147
would: Think it's not really good because most
of the cane boards are lower paced.
0:50:43.683 --> 0:50:53.657
That is one idea to count and definitely better
because as a word more often occurs upper case.
0:50:53.653 --> 0:50:57.934
Otherwise you only have a lower case at the
beginning where you have again.
0:50:58.338 --> 0:51:03.269
Haven't gained anything, you can make it even
a bit better when counting.
0:51:03.269 --> 0:51:09.134
You're ignoring the first position so that
you don't count the word beginning and yeah,
0:51:09.134 --> 0:51:12.999
that's typically how it's done to do this type
of casing.
0:51:13.273 --> 0:51:23.907
And that's the easy thing you can't even use
like then bygram teachers who work pairs.
0:51:23.907 --> 0:51:29.651
There's very few words which occur more often.
0:51:29.970 --> 0:51:33.163
It's OK to have them boast because you can
otherwise learn it.
0:51:36.376 --> 0:51:52.305
Another thing about these classes is to use
word classes that were partly done, for example,
0:51:52.305 --> 0:51:55.046
and more often.
0:51:55.375 --> 0:51:57.214
Ten Thousand One Hundred Books.
0:51:57.597 --> 0:52:07.397
And then for an system that might not be important
you can do something at number books.
0:52:07.847 --> 0:52:16.450
However, you see here already that it's not
that easy because if you have one book you
0:52:16.450 --> 0:52:19.318
don't have to do with a pro.
0:52:20.020 --> 0:52:21.669
Always be careful.
0:52:21.669 --> 0:52:28.094
It's very fast to ignore some exceptions and
make more things worse than.
0:52:28.488 --> 0:52:37.879
So it's always difficult to decide when to
do this and when to better not do it and keep
0:52:37.879 --> 0:52:38.724
things.
0:52:43.483 --> 0:52:56.202
Then the next step is sentence segmentation,
so we are typically working on sentences.
0:52:56.476 --> 0:53:11.633
However, dots things are a bit more complicated,
so you can do a bit more.
0:53:11.731 --> 0:53:20.111
You can even have some type of classifier
with features by then generally.
0:53:20.500 --> 0:53:30.731
Is not too complicated, so you can have different
types of classifiers to do that, but in generally.
0:53:30.650 --> 0:53:32.537
I Didn't Know It.
0:53:33.393 --> 0:53:35.583
It's not a super complicated task.
0:53:35.583 --> 0:53:39.461
There are nowadays also a lot of libraries
which you can use.
0:53:39.699 --> 0:53:45.714
To do that normally if you're doing the normalization
beforehand that can be done there so you only
0:53:45.714 --> 0:53:51.126
split up the dot if it's like the sentence
boundary and otherwise you keep it to the word
0:53:51.126 --> 0:53:54.194
so you can do that a bit jointly with the segment.
0:53:54.634 --> 0:54:06.017
It's something to think about to care because
it's where arrows happen.
0:54:06.017 --> 0:54:14.712
However, on the one end you can still do it
very well.
0:54:14.834 --> 0:54:19.740
You will never get data which is perfectly
clean and where everything is great.
0:54:20.340 --> 0:54:31.020
There's just too much data and it will never
happen, so therefore it's important to be aware
0:54:31.020 --> 0:54:35.269
of that during the full development.
0:54:37.237 --> 0:54:42.369
And one last thing about the preprocessing,
we'll get into the representation.
0:54:42.369 --> 0:54:47.046
If you're working on that, you'll get a friend
with regular expression.
0:54:47.046 --> 0:54:50.034
That's not only how you do all this matching.
0:54:50.430 --> 0:55:03.811
And if you look into the scripts of how to
deal with pancreation marks and stuff like
0:55:03.811 --> 0:55:04.900
that,.
0:55:11.011 --> 0:55:19.025
So if we have now the data of our next step
to build, the system is to represent our words.
0:55:19.639 --> 0:55:27.650
Before we start with this, any more questions
about preprocessing.
0:55:27.650 --> 0:55:32.672
While we work on the pure text, I'm sure.
0:55:33.453 --> 0:55:40.852
The idea is again to make things more simple
because if you think about the production mark
0:55:40.852 --> 0:55:48.252
at the beginning of a sentence, it might be
that you haven't seen the word or, for example,
0:55:48.252 --> 0:55:49.619
think of titles.
0:55:49.619 --> 0:55:56.153
In newspaper articles there's: So you then
have seen the word now in the title before,
0:55:56.153 --> 0:55:58.425
and the text you have never seen.
0:55:58.898 --> 0:56:03.147
But there is always the decision.
0:56:03.123 --> 0:56:09.097
Do I gain more because I've seen things more
often or do I lose because now I remove information
0:56:09.097 --> 0:56:11.252
which helps me to the same degree?
0:56:11.571 --> 0:56:21.771
Because if we, for example, do that in German
and remove the case, this might be an important
0:56:21.771 --> 0:56:22.531
issue.
0:56:22.842 --> 0:56:30.648
So there is not the perfect solution, but
generally you can get some arrows to make things
0:56:30.648 --> 0:56:32.277
look more similar.
0:56:35.295 --> 0:56:43.275
What you can do about products like the state
of the area or the trends that are more or
0:56:43.275 --> 0:56:43.813
less.
0:56:44.944 --> 0:56:50.193
It starts even less because models get more
powerful, so it's not that important, but be
0:56:50.193 --> 0:56:51.136
careful partly.
0:56:51.136 --> 0:56:56.326
It's also the evaluation thing because these
things which are problematic are happening
0:56:56.326 --> 0:56:57.092
very rarely.
0:56:57.092 --> 0:57:00.159
If you take average performance, it doesn't
matter.
0:57:00.340 --> 0:57:06.715
However, in between it's doing the stupid
mistakes that don't count on average, but they
0:57:06.715 --> 0:57:08.219
are not really good.
0:57:09.089 --> 0:57:15.118
Done you do some type of tokenization?
0:57:15.118 --> 0:57:19.911
You can do true casing or not.
0:57:19.911 --> 0:57:28.723
Some people nowadays don't do it, but that's
still done.
0:57:28.948 --> 0:57:34.441
Then it depends on who is a bit on the type
of domain.
0:57:34.441 --> 0:57:37.437
Again we have so translation.
0:57:37.717 --> 0:57:46.031
So in the text sometimes there is mark in
the menu, later the shortcut.
0:57:46.031 --> 0:57:49.957
This letter is used for shortcut.
0:57:49.957 --> 0:57:57.232
You cannot mistake the word because it's no
longer a file but.
0:57:58.018 --> 0:58:09.037
Then you cannot deal with it, so then it might
make sense to remove this.
0:58:12.032 --> 0:58:17.437
Now the next step is how to match words into
numbers.
0:58:17.437 --> 0:58:22.142
Machine learning models deal with some digits.
0:58:22.342 --> 0:58:27.091
The first idea is to use words as our basic
components.
0:58:27.247 --> 0:58:40.695
And then you have a large vocabulary where
each word gets referenced to an indigenous.
0:58:40.900 --> 0:58:49.059
So your sentence go home is now and that is
your set.
0:58:52.052 --> 0:59:00.811
So the nice thing is you have very short sequences
so that you can deal with them.
0:59:00.811 --> 0:59:01.867
However,.
0:59:01.982 --> 0:59:11.086
So you have not really understood how words
are processed.
0:59:11.086 --> 0:59:16.951
Why is this or can that be a problem?
0:59:17.497 --> 0:59:20.741
And there is an easy solution to deal with
unknown words.
0:59:20.741 --> 0:59:22.698
You just have one token, which is.
0:59:23.123 --> 0:59:25.906
Worrying in maybe some railroads in your training
day, do you deal?
0:59:26.206 --> 0:59:34.938
That's working a bit for some province, but
in general it's not good because you know nothing
0:59:34.938 --> 0:59:35.588
about.
0:59:35.895 --> 0:59:38.770
Can at least deal with this and maybe map
it.
0:59:38.770 --> 0:59:44.269
So an easy solution in machine translation
is always if it's an unknown word or we just
0:59:44.269 --> 0:59:49.642
copy it to the target side because unknown
words are often named entities and in many
0:59:49.642 --> 0:59:52.454
languages the good solution is just to keep.
0:59:53.013 --> 1:00:01.203
So that is somehow a trick, trick, but yeah,
that's of course not a good thing.
1:00:01.821 --> 1:00:08.959
It's also a problem if you deal with full
words is that you have very few examples for
1:00:08.959 --> 1:00:09.451
some.
1:00:09.949 --> 1:00:17.696
And of course if you've seen a word once you
can, someone may be translated, but we will
1:00:17.696 --> 1:00:24.050
learn that in your networks you represent words
with continuous vectors.
1:00:24.264 --> 1:00:26.591
You have seen them two, three or four times.
1:00:26.591 --> 1:00:31.246
It is not really well learned, and you are
typically doing most Arabs and words with your
1:00:31.246 --> 1:00:31.763
crow rap.
1:00:33.053 --> 1:00:40.543
And yeah, you cannot deal with things which
are inside the world.
1:00:40.543 --> 1:00:50.303
So if you know that houses set one hundred
and twelve and you see no houses, you have
1:00:50.303 --> 1:00:51.324
no idea.
1:00:51.931 --> 1:00:55.533
Of course, not really convenient, so humans
are better.
1:00:55.533 --> 1:00:58.042
They can use the internal information.
1:00:58.498 --> 1:01:04.080
So if we have houses you'll know that it's
like the bluer form of house.
1:01:05.285 --> 1:01:16.829
And for the ones who weren't in advance, ay,
you have this night worth here and guess.
1:01:16.716 --> 1:01:20.454
Don't know the meaning of these words.
1:01:20.454 --> 1:01:25.821
However, all of you will know is the fear
of something.
1:01:26.686 --> 1:01:39.437
From the ending, the phobia phobia is always
the fear of something, but you don't know how.
1:01:39.879 --> 1:01:46.618
So we can split words into some parts that
is helpful to deal with.
1:01:46.618 --> 1:01:49.888
This, for example, is a fear of.
1:01:50.450 --> 1:02:04.022
It's not very important, it's not how to happen
very often, but yeah, it's also not important
1:02:04.022 --> 1:02:10.374
for understanding that you know everything.
1:02:15.115 --> 1:02:18.791
So what can we do instead?
1:02:18.791 --> 1:02:29.685
One thing which we could do instead is to
represent words by the other extreme.
1:02:29.949 --> 1:02:42.900
So you really do like if you have a person's
eye and a and age, then you need a space symbol.
1:02:43.203 --> 1:02:55.875
So you have now a representation for each
character that enables you to implicitly learn
1:02:55.875 --> 1:03:01.143
morphology because words which have.
1:03:01.541 --> 1:03:05.517
And you can then deal with unknown words.
1:03:05.517 --> 1:03:10.344
There's still not everything you can process,
but.
1:03:11.851 --> 1:03:16.953
So if you would go on charity level what might
still be a problem?
1:03:18.598 --> 1:03:24.007
So all characters which you haven't seen,
but that's nowadays a little bit more often
1:03:24.007 --> 1:03:25.140
with new emoties.
1:03:25.140 --> 1:03:26.020
You couldn't.
1:03:26.020 --> 1:03:31.366
It could also be that you have translated
from Germany and German, and then there is
1:03:31.366 --> 1:03:35.077
a Japanese character or Chinese that you cannot
translate.
1:03:35.435 --> 1:03:43.938
But most of the time all directions occur
have been seen so that someone works very good.
1:03:44.464 --> 1:03:58.681
This is first a nice thing, so you have a
very small vocabulary size, so one big part
1:03:58.681 --> 1:04:01.987
of the calculation.
1:04:02.222 --> 1:04:11.960
Neural networks is the calculation of the
vocabulary size, so if you are efficient there
1:04:11.960 --> 1:04:13.382
it's better.
1:04:14.914 --> 1:04:26.998
On the other hand, the problem is you have
no very long sequences, so if you think about
1:04:26.998 --> 1:04:29.985
this before you have.
1:04:30.410 --> 1:04:43.535
Your computation often depends on your input
size and not only linear but quadratic going
1:04:43.535 --> 1:04:44.410
more.
1:04:44.504 --> 1:04:49.832
And of course it might also be that you just
generally make things more complicated than
1:04:49.832 --> 1:04:50.910
they were before.
1:04:50.951 --> 1:04:58.679
We said before make things easy, but now if
we really have to analyze each director independently,
1:04:58.679 --> 1:05:05.003
we cannot directly learn that university is
the same, but we have to learn that.
1:05:05.185 --> 1:05:12.179
Is beginning and then there is an I and then
there is an E and then all this together means
1:05:12.179 --> 1:05:17.273
university but another combination of these
letters is a complete.
1:05:17.677 --> 1:05:24.135
So of course you make everything here a lot
more complicated than you have on word basis.
1:05:24.744 --> 1:05:32.543
Character based models work very well in conditions
with few data because you have seen the words
1:05:32.543 --> 1:05:33.578
very rarely.
1:05:33.578 --> 1:05:38.751
It's not good to learn but you have seen all
letters more often.
1:05:38.751 --> 1:05:44.083
So if you have scenarios with very few data
this is like one good.
1:05:46.446 --> 1:05:59.668
The other idea is to split now not doing the
extreme, so either taking forwards or taking
1:05:59.668 --> 1:06:06.573
only directives by doing something in between.
1:06:07.327 --> 1:06:12.909
And one of these ideas has been done for a
long time.
1:06:12.909 --> 1:06:17.560
It's called compound splitting, but we only.
1:06:17.477 --> 1:06:18.424
Bounce them.
1:06:18.424 --> 1:06:24.831
You see that Baum and Stumbo accrue very often,
then maybe more often than Bounce them.
1:06:24.831 --> 1:06:28.180
Then you split Baum and Stumb and you use
it.
1:06:29.509 --> 1:06:44.165
But it's even not so easy it will learn wrong
splits so we did that in all the systems and
1:06:44.165 --> 1:06:47.708
there is a word Asia.
1:06:48.288 --> 1:06:56.137
And the business, of course, is not a really
good way of dealing it because it is non-semantic.
1:06:56.676 --> 1:07:05.869
The good thing is we didn't really care that
much about it because the system wasn't learned
1:07:05.869 --> 1:07:09.428
if you have Asia and Tish together.
1:07:09.729 --> 1:07:17.452
So you can of course learn all that the compound
spirit doesn't really help you to get a deeper
1:07:17.452 --> 1:07:18.658
understanding.
1:07:21.661 --> 1:07:23.364
The Thing of Course.
1:07:23.943 --> 1:07:30.475
Yeah, there was one paper where this doesn't
work like they report, but it's called Burning
1:07:30.475 --> 1:07:30.972
Ducks.
1:07:30.972 --> 1:07:37.503
I think because it was like if you had German
NS Branter, you could split it in NS Branter,
1:07:37.503 --> 1:07:43.254
and sometimes you have to add an E to make
the compounds that was Enter Branter.
1:07:43.583 --> 1:07:48.515
So he translated Esperanto into burning dark.
1:07:48.888 --> 1:07:56.127
So of course you can introduce there some
type of additional arrows, but in generally
1:07:56.127 --> 1:07:57.221
it's a good.
1:07:57.617 --> 1:08:03.306
Of course there is a trade off between vocabulary
size so you want to have a lower vocabulary
1:08:03.306 --> 1:08:08.812
size so you've seen everything more often but
the length of the sequence should not be too
1:08:08.812 --> 1:08:13.654
long because if you split more often you get
less different types but you have.
1:08:16.896 --> 1:08:25.281
The motivation of the advantage of compared
to Character based models is that you can directly
1:08:25.281 --> 1:08:33.489
learn the representation for works that occur
very often while still being able to represent
1:08:33.489 --> 1:08:35.783
works that are rare into.
1:08:36.176 --> 1:08:42.973
And while first this was only done for compounds,
nowadays there's an algorithm which really
1:08:42.973 --> 1:08:49.405
tries to do it on everything and there are
different ways to be honest compound fitting
1:08:49.405 --> 1:08:50.209
and so on.
1:08:50.209 --> 1:08:56.129
But the most successful one which is commonly
used is based on data compression.
1:08:56.476 --> 1:08:59.246
And there the idea is okay.
1:08:59.246 --> 1:09:06.765
Can we find an encoding so that parts are
compressed in the most efficient?
1:09:07.027 --> 1:09:22.917
And the compression algorithm is called the
bipear encoding, and this is also then used
1:09:22.917 --> 1:09:25.625
for splitting.
1:09:26.346 --> 1:09:39.164
And the idea is we recursively represent the
most frequent pair of bites by a new bike.
1:09:39.819 --> 1:09:51.926
Language is now you splitch, burst all your
words into letters, and then you look at what
1:09:51.926 --> 1:09:59.593
is the most frequent bigrams of which two letters
occur.
1:10:00.040 --> 1:10:04.896
And then you replace your repeat until you
have a fixed vocabulary.
1:10:04.985 --> 1:10:08.031
So that's a nice thing.
1:10:08.031 --> 1:10:16.663
Now you can predefine your vocabulary as want
to represent my text.
1:10:16.936 --> 1:10:28.486
By hand, and then you can represent any text
with these symbols, and of course the shorter
1:10:28.486 --> 1:10:30.517
your text will.
1:10:32.772 --> 1:10:36.543
So the original idea was something like that.
1:10:36.543 --> 1:10:39.411
We have to sequence A, B, A, B, C.
1:10:39.411 --> 1:10:45.149
For example, a common biogram is A, B, so
you can face A, B, B, I, D.
1:10:45.149 --> 1:10:46.788
Then the text gets.
1:10:48.108 --> 1:10:53.615
Then you can make to and then you have eating
beet and so on, so this is then your text.
1:10:54.514 --> 1:11:00.691
Similarly, we can do it now for tanking.
1:11:01.761 --> 1:11:05.436
Let's assume you have these sentences.
1:11:05.436 --> 1:11:11.185
I go, he goes, she goes, so your vocabulary
is go, goes, he.
1:11:11.851 --> 1:11:30.849
And the first thing you're doing is split
your crocus into singles.
1:11:30.810 --> 1:11:34.692
So thereby you can split words again like
split senses into words.
1:11:34.692 --> 1:11:38.980
Because now you only have chiracters, you
don't know the word boundaries.
1:11:38.980 --> 1:11:44.194
You introduce the word boundaries by having
a special symbol at the end of each word, and
1:11:44.194 --> 1:11:46.222
then you know this symbol happens.
1:11:46.222 --> 1:11:48.366
I can split it and have it in a new.
1:11:48.708 --> 1:11:55.245
So you have the corpus I go, he goes, and
she goes, and then you have now here the sequences
1:11:55.245 --> 1:11:56.229
of Character.
1:11:56.229 --> 1:12:02.625
So then the Character based per presentation,
and now you calculate the bigram statistics.
1:12:02.625 --> 1:12:08.458
So I and the end of word occurs one time G
& O across three times, so there there.
1:12:09.189 --> 1:12:18.732
And these are all the others, and now you
look, which is the most common happening.
1:12:19.119 --> 1:12:26.046
So then you have known the rules.
1:12:26.046 --> 1:12:39.235
If and have them together you have these new
words: Now is no longer two symbols, but it's
1:12:39.235 --> 1:12:41.738
one single symbol because if you join that.
1:12:42.402 --> 1:12:51.175
And then you have here now the new number
of biceps, steel and wood, and and so on.
1:12:52.092 --> 1:13:01.753
In small examples now you have a lot of rules
which occur the same time.
1:13:01.753 --> 1:13:09.561
In reality that is happening sometimes but
not that often.
1:13:10.370 --> 1:13:21.240
You add the end of words to him, and so this
way you go on until you have your vocabulary.
1:13:21.601 --> 1:13:38.242
And your vocabulary is in these rules, so
people speak about the vocabulary of the rules.
1:13:38.658 --> 1:13:43.637
And these are the rules, and if you have not
a different sentence, something like they tell.
1:13:44.184 --> 1:13:53.600
Then your final output looks like something
like that.
1:13:53.600 --> 1:13:59.250
These two words represent by by.
1:14:00.940 --> 1:14:06.398
And that is your algorithm.
1:14:06.398 --> 1:14:18.873
Now you can represent any type of text with
a fixed vocabulary.
1:14:20.400 --> 1:14:23.593
So think that's defined in the beginning.
1:14:23.593 --> 1:14:27.243
Fill how many egos have won and that has spent.
1:14:28.408 --> 1:14:35.253
It's nearly correct that it writes a number
of characters.
1:14:35.253 --> 1:14:38.734
It can be that in additional.
1:14:38.878 --> 1:14:49.162
So on the one end all three of the right side
of the rules can occur, and then additionally
1:14:49.162 --> 1:14:49.721
all.
1:14:49.809 --> 1:14:55.851
In reality it can even happen that there is
less your vocabulary smaller because it might
1:14:55.851 --> 1:15:01.960
happen that like for example go never occurs
singular at the end but you always like merge
1:15:01.960 --> 1:15:06.793
all occurrences so there are not all right
sides really happen because.
1:15:06.746 --> 1:15:11.269
This rule is never only applied, but afterwards
another rule is also applied.
1:15:11.531 --> 1:15:15.621
So it's a summary approbounce of your vocabulary
than static.
1:15:20.480 --> 1:15:29.014
Then we come to the last part, which is about
parallel data, but we have some questions beforehand.
1:15:36.436 --> 1:15:38.824
So what is parallel data?
1:15:38.824 --> 1:15:47.368
So if we set machine translations really,
really important that we are dealing with parallel
1:15:47.368 --> 1:15:52.054
data, that means we have a lined input and
output.
1:15:52.054 --> 1:15:54.626
You have this type of data.
1:15:55.015 --> 1:16:01.773
However, in machine translation we have one
very big advantage that is somewhat naturally
1:16:01.773 --> 1:16:07.255
occurring, so you have a lot of parallel data
which you can summar gaps.
1:16:07.255 --> 1:16:13.788
In many P tests you need to manually annotate
your data and generate the aligned data.
1:16:14.414 --> 1:16:22.540
We have to manually create translations, and
of course that is very expensive, but it's
1:16:22.540 --> 1:16:29.281
really expensive to pay for like one million
sentences to be translated.
1:16:29.889 --> 1:16:36.952
The nice thing is that in there is data normally
available because other people have done machine
1:16:36.952 --> 1:16:37.889
translation.
1:16:40.120 --> 1:16:44.672
So there is this data and of course process
it.
1:16:44.672 --> 1:16:51.406
We'll have a full lecture on how to deal with
more complex situations.
1:16:52.032 --> 1:16:56.645
The idea is really you don't do really much
human work.
1:16:56.645 --> 1:17:02.825
You really just start the caller with some
initials, start pages and then.
1:17:03.203 --> 1:17:07.953
But a lot of iquality parallel data is really
targeted on some scenarios.
1:17:07.953 --> 1:17:13.987
So, for example, think of the European Parliament
as one website where you can easily extract
1:17:13.987 --> 1:17:17.581
these information from and there you have a
large data.
1:17:17.937 --> 1:17:22.500
Or like we have the TED data, which is also
you can get from the TED website.
1:17:23.783 --> 1:17:33.555
So in generally parallel corpus is a collection
of texts with translations into one of several.
1:17:34.134 --> 1:17:42.269
And this data is important because there is
no general empty normally, but you work secured.
1:17:42.222 --> 1:17:46.732
It works especially good if your training
and test conditions are similar.
1:17:46.732 --> 1:17:50.460
So if the topic is similar, the style of modality
is similar.
1:17:50.460 --> 1:17:55.391
So if you want to translate speech, it's often
better to train all to own speech.
1:17:55.391 --> 1:17:58.818
If you want to translate text, it's better
to translate.
1:17:59.379 --> 1:18:08.457
And there is a lot of these data available
nowadays for common languages.
1:18:08.457 --> 1:18:12.014
You normally can start with.
1:18:12.252 --> 1:18:15.298
It's really available.
1:18:15.298 --> 1:18:27.350
For example, Opus is a big website collecting
different types of parallel corpus where you
1:18:27.350 --> 1:18:29.601
can select them.
1:18:29.529 --> 1:18:33.276
You have this document alignment will come
to that layout.
1:18:33.553 --> 1:18:39.248
There is things like comparable data where
you have not full sentences but only some parts
1:18:39.248 --> 1:18:40.062
of parallel.
1:18:40.220 --> 1:18:48.700
But now first let's assume we have easy tasks
like European Parliament when we have the speech
1:18:48.700 --> 1:18:55.485
in German and the speech in English and you
need to generate parallel data.
1:18:55.485 --> 1:18:59.949
That means you have to align the sewer sentences.
1:19:00.000 --> 1:19:01.573
And doing this right.
1:19:05.905 --> 1:19:08.435
How can we do that?
1:19:08.435 --> 1:19:19.315
And that is what people refer to sentence
alignment, so we have parallel documents in
1:19:19.315 --> 1:19:20.707
languages.
1:19:22.602 --> 1:19:32.076
This is so you cannot normally do that word
by word because there is no direct correlation
1:19:32.076 --> 1:19:34.158
between, but it is.
1:19:34.074 --> 1:19:39.837
Relatively possible to do it on sentence level,
it will not be perfect, so you sometimes have
1:19:39.837 --> 1:19:42.535
two sentences in English and one in German.
1:19:42.535 --> 1:19:47.992
German like to have these long sentences with
sub clauses and so on, so there you can do
1:19:47.992 --> 1:19:51.733
it, but with long sentences it might not be
really possible.
1:19:55.015 --> 1:19:59.454
And for some we saw that sentence Marcus Andre
there, so it's more complicated.
1:19:59.819 --> 1:20:10.090
So how can we formalize this sentence alignment
problem?
1:20:10.090 --> 1:20:16.756
So we have a set of sewer sentences.
1:20:17.377 --> 1:20:22.167
And machine translation relatively often.
1:20:22.167 --> 1:20:32.317
Sometimes source sentences nowadays are and,
but traditionally it was and because people
1:20:32.317 --> 1:20:34.027
started using.
1:20:34.594 --> 1:20:45.625
And then the idea is to find this alignment
where we have alignment.
1:20:46.306 --> 1:20:50.421
And of course you want these sequences to
be shown as possible.
1:20:50.421 --> 1:20:56.400
Of course an easy solution is here all my
screen sentences and here all my target sentences.
1:20:56.756 --> 1:21:07.558
So want to have short sequences there, typically
one sentence or maximum two or three sentences,
1:21:07.558 --> 1:21:09.340
so that really.
1:21:13.913 --> 1:21:21.479
Then there is different ways of restriction
to this type of alignment, so first of all
1:21:21.479 --> 1:21:29.131
it should be a monotone alignment, so that
means that each segment on the source should
1:21:29.131 --> 1:21:31.218
start after each other.
1:21:31.431 --> 1:21:36.428
So we assume that in document there's really
a monotone and it's going the same way in source.
1:21:36.957 --> 1:21:41.965
Course for a very free translation that might
not be valid anymore.
1:21:41.965 --> 1:21:49.331
But this algorithm, the first one in the church
and gay algorithm, is more than really translations
1:21:49.331 --> 1:21:51.025
which are very direct.
1:21:51.025 --> 1:21:54.708
So each segment should be like coming after
each.
1:21:55.115 --> 1:22:04.117
Then we want to translate the full sequence,
and of course each segment should start before
1:22:04.117 --> 1:22:04.802
it is.
1:22:05.525 --> 1:22:22.654
And then you want to have something like that,
but you have to alignments or alignments.
1:22:25.525 --> 1:22:41.851
The alignment types are: You then, of course,
sometimes insertions and Venetians where there
1:22:41.851 --> 1:22:43.858
is some information added.
1:22:44.224 --> 1:22:50.412
Hand be, for example, explanation, so it can
be that some term is known in the one language
1:22:50.412 --> 1:22:51.018
but not.
1:22:51.111 --> 1:22:53.724
Think of things like Deutschland ticket.
1:22:53.724 --> 1:22:58.187
In Germany everybody will by now know what
the Deutschland ticket is.
1:22:58.187 --> 1:23:03.797
But if you translate it to English it might
be important to explain it and other things
1:23:03.797 --> 1:23:04.116
are.
1:23:04.116 --> 1:23:09.853
So sometimes you have to explain things and
then you have more sentences with insertions.
1:23:10.410 --> 1:23:15.956
Then you have two to one and one to two alignment,
and that is, for example, in Germany you have
1:23:15.956 --> 1:23:19.616
a lot of sub-classes and bipes that are expressed
by two cents.
1:23:20.580 --> 1:23:37.725
Of course, it might be more complex, but typically
to make it simple and only allow for this type
1:23:37.725 --> 1:23:40.174
of alignment.
1:23:41.301 --> 1:23:56.588
Then it is about finding the alignment and
that is, we try to score where we just take
1:23:56.588 --> 1:23:59.575
a general score.
1:24:00.000 --> 1:24:04.011
That is true like gala algorithms and the
matching of one segment.
1:24:04.011 --> 1:24:09.279
If you have one segment now so this is one
of the global things so the global alignment
1:24:09.279 --> 1:24:13.828
is as good as the product of all single steps
and then you have two scores.
1:24:13.828 --> 1:24:18.558
First of all you say one to one alignments
are much better than all the hours.
1:24:19.059 --> 1:24:26.884
And then you have a lexical similarity, which
is, for example, based on an initial dictionary
1:24:26.884 --> 1:24:30.713
which counts how many dictionary entries are.
1:24:31.091 --> 1:24:35.407
So this is a very simple algorithm.
1:24:35.407 --> 1:24:41.881
Typically violates like your first step and
you want.
1:24:43.303 --> 1:24:54.454
And that is like with this one you can get
an initial one you can have better parallel
1:24:54.454 --> 1:24:55.223
data.
1:24:55.675 --> 1:25:02.369
No, it is an optimization problem and you
are now based on the scores you can calculate
1:25:02.369 --> 1:25:07.541
for each possible alignment and score and then
select the best one.
1:25:07.541 --> 1:25:14.386
Of course, you won't try all possibilities
out but you can do a good search and then find
1:25:14.386 --> 1:25:15.451
the best one.
1:25:15.815 --> 1:25:18.726
Can typically be automatically.
1:25:18.726 --> 1:25:25.456
Of course, you should do some checks like
aligning sentences as possible.
1:25:26.766 --> 1:25:32.043
A bill like typically for training data is
done this way.
1:25:32.043 --> 1:25:35.045
Maybe if you have test data you.
1:25:40.000 --> 1:25:47.323
Sorry, I'm a bit late because originally wanted
to do a quiz at the end.
1:25:47.323 --> 1:25:49.129
Can we go a quiz?
1:25:49.429 --> 1:25:51.833
We'll do it somewhere else.
1:25:51.833 --> 1:25:56.813
We had a bachelor project about making quiz
for lectures.
1:25:56.813 --> 1:25:59.217
And I still want to try it.
1:25:59.217 --> 1:26:04.197
So let's see I hope in some other lecture
we can do that.
1:26:04.197 --> 1:26:09.435
Then we can at the island of the lecture do
some quiz about.
1:26:09.609 --> 1:26:13.081
All We Can Do Is Is the Practical Thing Let's
See.
1:26:13.533 --> 1:26:24.719
And: Today, so what you should remember is
what is parallel data and how we can.
1:26:25.045 --> 1:26:29.553
Create parallel data like how to generally
process data.
1:26:29.553 --> 1:26:36.435
What you think about data is really important
if you build systems and different ways.
1:26:36.696 --> 1:26:46.857
The three main options like forwards is directly
on director level or using subword things.
1:26:47.687 --> 1:26:49.634
Is there any question?
1:26:52.192 --> 1:26:57.768
Yes, this is the alignment thing in Cadillac
band in Tyne walking with people.
1:27:00.000 --> 1:27:05.761
It's not directly using than every time walking,
but the idea is similar and you can use all
1:27:05.761 --> 1:27:11.771
this type of similar algorithms, which is the
main thing which is the question of the difficulty
1:27:11.771 --> 1:27:14.807
is to define me at your your loss function
here.
1:27:14.807 --> 1:27:16.418
What is a good alignment?
1:27:16.736 --> 1:27:24.115
But as you do not have a time walk on, you
have a monotone alignment in there, and you
1:27:24.115 --> 1:27:26.150
cannot have rehonoring.
1:27:30.770 --> 1:27:40.121
There then thanks a lot and on first day we
will then start with or discuss.