Spaces:

retkowski
/

ytseg_demo

Running

App Files Files Community

ytseg_demo / demo_data /lectures /Lecture-03-25.04.2023 /English.vtt

retkowski

Add demo

cb71ef5 over 1 year ago

raw

history blame contribute delete

76.5 kB

	WEBVTT

	0:00:02.822 --> 0:00:07.880
	We look into more linguistic approaches.

	0:00:07.880 --> 0:00:14.912
	We can do machine translation in a more traditional
	way.

	0:00:14.912 --> 0:00:21.224
	It should be: Translation should be generated
	this way.

	0:00:21.224 --> 0:00:27.933
	We can analyze versus a sewer sentence what
	is the meaning or the syntax.

	0:00:27.933 --> 0:00:35.185
	Then we transfer this information to the target
	side and then we then generate.

	0:00:36.556 --> 0:00:42.341
	And this was the strong and common used approach
	for yeah several years.

	0:00:44.024 --> 0:00:50.839
	However, we saw already at the beginning there
	some challenges with that: Language is very

	0:00:50.839 --> 0:00:57.232
	ambigue and it's often very difficult to really
	get high coated rules.

	0:00:57.232 --> 0:01:05.336
	What are the different meanings and we have
	to do that also with a living language so new

	0:01:05.336 --> 0:01:06.596
	things occur.

	0:01:07.007 --> 0:01:09.308
	And that's why people look into.

	0:01:09.308 --> 0:01:13.282
	Can we maybe do it differently and use machine
	learning?

	0:01:13.333 --> 0:01:24.849
	So we are no longer giving rules of how to
	do it, but we just give examples and the system.

	0:01:25.045 --> 0:01:34.836
	And one important thing then is these examples:
	how can we learn how to translate one sentence?

	0:01:35.635 --> 0:01:42.516
	And therefore these yeah, the data is now
	really a very important issue.

	0:01:42.582 --> 0:01:50.021
	And that is what we want to look into today.

	0:01:50.021 --> 0:01:58.783
	What type of data do we use for machine translation?

	0:01:59.019 --> 0:02:08.674
	So the idea in preprocessing is always: Can
	we make the task somehow a bit easier so that

	0:02:08.674 --> 0:02:13.180
	the empty system will be in a way better?

	0:02:13.493 --> 0:02:28.309
	So one example could be if it has problems
	dealing with numbers because they are occurring.

	0:02:28.648 --> 0:02:35.479
	Or think about so one problem which still
	might be is there in some systems think about

	0:02:35.479 --> 0:02:36.333
	different.

	0:02:36.656 --> 0:02:44.897
	So a system might learn that of course if
	there's a German over in English there should.

	0:02:45.365 --> 0:02:52.270
	However, if it's in pearl text, it will see
	that in Germany there is often km, and in English

	0:02:52.270 --> 0:02:54.107
	typically various miles.

	0:02:54.594 --> 0:03:00.607
	Might just translate three hundred and fifty
	five miles into three hundred and fiftY five

	0:03:00.607 --> 0:03:04.348
	kilometers, which of course is not right, and
	so forth.

	0:03:04.348 --> 0:03:06.953
	It might make things to look into the.

	0:03:07.067 --> 0:03:13.072
	Therefore, first step when you build your
	machine translation system is normally to look

	0:03:13.072 --> 0:03:19.077
	at the data, to check it, to see if there is
	anything happening which you should address

	0:03:19.077 --> 0:03:19.887
	beforehand.

	0:03:20.360 --> 0:03:29.152
	And then the second part is how do you represent
	no works machine learning normally?

	0:03:29.109 --> 0:03:35.404
	So the question is how do we get out from
	the words into numbers and I've seen some of

	0:03:35.404 --> 0:03:35.766
	you?

	0:03:35.766 --> 0:03:42.568
	For example, in advance there we have introduced
	to an algorithm which we also shortly repeat

	0:03:42.568 --> 0:03:43.075
	today.

	0:03:43.303 --> 0:03:53.842
	The subword unit approach which was first
	introduced in machine translation and now used

	0:03:53.842 --> 0:04:05.271
	for an in order to represent: Now you've learned
	about morphology, so you know that maybe in

	0:04:05.271 --> 0:04:09.270
	English it's not that important.

	0:04:09.429 --> 0:04:22.485
	In German you have all these different word
	poems and to learn independent representation.

	0:04:24.024 --> 0:04:26.031
	And then, of course, they are more extreme.

	0:04:27.807 --> 0:04:34.387
	So how are we doing?

	0:04:34.975 --> 0:04:37.099
	Machine translation.

	0:04:37.099 --> 0:04:46.202
	So hopefully you remember we had these approaches
	to machine translation, the rule based.

	0:04:46.202 --> 0:04:52.473
	We had a big block of corpus space machine
	translation which.

	0:04:52.492 --> 0:05:00.443
	Will on Thursday have an overview on statistical
	models and then afterwards concentrate on the.

	0:05:00.680 --> 0:05:08.828
	Both of them are corpus based machine translation
	and therefore it's really essential, and while

	0:05:08.828 --> 0:05:16.640
	we are typically training a machine translation
	system is what we refer to as parallel data.

	0:05:16.957 --> 0:05:22.395
	Talk a lot about pearl corpus or pearl data,
	and what I mean there is something which you

	0:05:22.395 --> 0:05:28.257
	might know from was that a stone or something
	like that, so it's typically you have one sentence

	0:05:28.257 --> 0:05:33.273
	in the one language, and then you have aligned
	to it one sentence in the charcote.

	0:05:33.833 --> 0:05:38.261
	And this is how we train all our alignments.

	0:05:38.261 --> 0:05:43.181
	We'll see today that of course we might not
	have.

	0:05:43.723 --> 0:05:51.279
	However, this is relatively easy to create,
	at least for iquality data.

	0:05:51.279 --> 0:06:00.933
	We look into data trawling so that means how
	we can automatically create this parallel data

	0:06:00.933 --> 0:06:02.927
	from the Internet.

	0:06:04.144 --> 0:06:13.850
	It's not so difficult to learn these alignments
	if we have some type of dictionary, so which

	0:06:13.850 --> 0:06:16.981
	sentence is aligned to which.

	0:06:18.718 --> 0:06:25.069
	What it would, of course, be a lot more difficult
	is really to word alignment, and that's also

	0:06:25.069 --> 0:06:27.476
	often no longer that good possible.

	0:06:27.476 --> 0:06:33.360
	We do that automatically in some yes for symbols,
	but it's definitely more challenging.

	0:06:33.733 --> 0:06:40.691
	For sentence alignment, of course, it's still
	not always perfect, so there might be that

	0:06:40.691 --> 0:06:46.085
	there is two German sentences and one English
	sentence or the other.

	0:06:46.085 --> 0:06:53.511
	So there's not always perfect alignment, but
	if you look at text, it's still bigly relatively.

	0:06:54.014 --> 0:07:03.862
	If we have that then we can build a machine
	learning model which tries to map ignition

	0:07:03.862 --> 0:07:06.239
	sentences somewhere.

	0:07:06.626 --> 0:07:15.932
	So this is the idea of behind statistical
	machine translation and machine translation.

	0:07:15.932 --> 0:07:27.098
	The difference is: Statistical machine translation
	is typically a whole box of different models

	0:07:27.098 --> 0:07:30.205
	which try to evaluate the.

	0:07:30.510 --> 0:07:42.798
	In neural machine translation, it's all one
	large neural network where we use the one-sur-sentence

	0:07:42.798 --> 0:07:43.667
	input.

	0:07:44.584 --> 0:07:50.971
	And then we can train it by having exactly
	this mapping port or parallel data.

	0:07:54.214 --> 0:08:02.964
	So what we want today to look at today is
	we want to first look at general text data.

	0:08:03.083 --> 0:08:06.250
	So what is text data?

	0:08:06.250 --> 0:08:09.850
	What text data is there?

	0:08:09.850 --> 0:08:18.202
	Why is it challenging so that we have large
	vocabularies?

	0:08:18.378 --> 0:08:22.003
	It's so that you always have words which you
	haven't seen.

	0:08:22.142 --> 0:08:29.053
	If you increase your corporate science normally
	you will also increase your vocabulary so you

	0:08:29.053 --> 0:08:30.744
	always find new words.

	0:08:31.811 --> 0:08:39.738
	Then based on that we'll look into pre-processing.

	0:08:39.738 --> 0:08:45.333
	So how can we pre-process our data?

	0:08:45.333 --> 0:08:46.421
	Maybe.

	0:08:46.526 --> 0:08:54.788
	This is a lot about tokenization, for example,
	which we heard is not so challenging in European

	0:08:54.788 --> 0:09:02.534
	languages but still important, but might be
	really difficult in Asian languages where you

	0:09:02.534 --> 0:09:05.030
	don't have space separation.

	0:09:05.986 --> 0:09:12.161
	And this preprocessing typically tries to
	deal with the extreme cases where you have

	0:09:12.161 --> 0:09:13.105
	seen things.

	0:09:13.353 --> 0:09:25.091
	If you have seen your words three one hundred
	times, it doesn't really matter if you have

	0:09:25.091 --> 0:09:31.221
	seen them with them without punctuation or
	so.

	0:09:31.651 --> 0:09:38.578
	And then we look into word representation,
	so what is the best way to represent a word?

	0:09:38.578 --> 0:09:45.584
	And finally, we look into the other type of
	data we really need for machine translation.

	0:09:45.725 --> 0:09:56.842
	So in first we can use for many tasks, and
	later we can also use purely monolingual data

	0:09:56.842 --> 0:10:00.465
	to make machine translation.

	0:10:00.660 --> 0:10:03.187
	So then the traditional approach was that
	it was easier.

	0:10:03.483 --> 0:10:08.697
	We have this type of language model which
	we can train only on the target data to make

	0:10:08.697 --> 0:10:12.173
	the text more fluent in neural machine translation
	model.

	0:10:12.173 --> 0:10:18.106
	It's partly a bit more complicated to integrate
	this data but still it's very important especially

	0:10:18.106 --> 0:10:22.362
	if you think about lower issue languages where
	you have very few data.

	0:10:23.603 --> 0:10:26.999
	It's harder to get parallel data than you
	get monolingual data.

	0:10:27.347 --> 0:10:33.821
	Because monolingual data you just have out
	there not huge amounts for some languages,

	0:10:33.821 --> 0:10:38.113
	but definitely the amount of data is always
	significant.

	0:10:40.940 --> 0:10:50.454
	When we talk about data, it's also of course
	important how we use it for machine learning.

	0:10:50.530 --> 0:11:05.867
	And that you hopefully learn in some prior
	class, so typically we separate our data into

	0:11:05.867 --> 0:11:17.848
	three chunks: So this is really by far the
	largest, and this grows with the data we get.

	0:11:17.848 --> 0:11:21.387
	Today we get here millions.

	0:11:22.222 --> 0:11:27.320
	Then we have our validation data and that
	is to train some type of parameters.

	0:11:27.320 --> 0:11:33.129
	So not only you have some things to configure
	and you don't know what is the right value,

	0:11:33.129 --> 0:11:39.067
	so what you can do is train a model and change
	these a bit and try to find the best ones on

	0:11:39.067 --> 0:11:40.164
	your validation.

	0:11:40.700 --> 0:11:48.531
	For a statistical model, for example data
	in what you want to use if you have several

	0:11:48.531 --> 0:11:54.664
	models: You know how to combine it, so how
	much focus should you put on the different

	0:11:54.664 --> 0:11:55.186
	models?

	0:11:55.186 --> 0:11:59.301
	And if it's like twenty models, so it's only
	twenty per meter.

	0:11:59.301 --> 0:12:02.828
	It's not that much, so that is still bigly
	estimated.

	0:12:03.183 --> 0:12:18.964
	In your model there's often a question how
	long should train the model before you have

	0:12:18.964 --> 0:12:21.322
	overfitting.

	0:12:22.902 --> 0:12:28.679
	And then you have your test data, which is
	finally where you report on your test.

	0:12:29.009 --> 0:12:33.663
	And therefore it's also important that from
	time to time you get new test data because

	0:12:33.663 --> 0:12:38.423
	if you're always through your experiments you
	test on it and then you do new experiments

	0:12:38.423 --> 0:12:43.452
	and tests again at some point you have tested
	so many on it that you do some type of training

	0:12:43.452 --> 0:12:48.373
	on your test data again because you just select
	the things which is at the end best on your

	0:12:48.373 --> 0:12:48.962
	test data.

	0:12:49.009 --> 0:12:54.755
	It's important to get a new test data from
	time to time, for example in important evaluation

	0:12:54.755 --> 0:12:58.340
	campaigns for machine translation and speech
	translation.

	0:12:58.618 --> 0:13:07.459
	There is like every year there should do tests
	that create it so we can see if the model really

	0:13:07.459 --> 0:13:09.761
	gets better on new data.

	0:13:10.951 --> 0:13:19.629
	And of course it is important that this is
	a representative of the use case you are interested.

	0:13:19.879 --> 0:13:36.511
	So if you're building a system for translating
	websites, this should be on websites.

	0:13:36.816 --> 0:13:39.356
	So normally a system is good on some tasks.

	0:13:40.780 --> 0:13:48.596
	I would solve everything and then your test
	data should be out of everything because if

	0:13:48.596 --> 0:13:54.102
	you only have a very small subset you know
	it's good on this.

	0:13:54.394 --> 0:14:02.714
	Therefore, the selection of your test data
	is really important in order to ensure that

	0:14:02.714 --> 0:14:05.200
	the MP system in the end.

	0:14:05.525 --> 0:14:12.646
	Is the greatest system ever you have evaluated
	on translating Bible.

	0:14:12.646 --> 0:14:21.830
	The use case is to translate some Twitter
	data and you can imagine the performance might

	0:14:21.830 --> 0:14:22.965
	be really.

	0:14:23.803 --> 0:14:25.471
	And privately.

	0:14:25.471 --> 0:14:35.478
	Of course, in honor to have this and realistic
	evaluation, it's important that there's no

	0:14:35.478 --> 0:14:39.370
	overlap between this data because.

	0:14:39.799 --> 0:14:51.615
	Because the danger might be is learning by
	heart how to translate the sentences from your

	0:14:51.615 --> 0:14:53.584
	training data.

	0:14:54.194 --> 0:15:04.430
	That the test data is really different from
	your training data.

	0:15:04.430 --> 0:15:16.811
	Therefore, it's important to: So what type
	of data we have?

	0:15:16.811 --> 0:15:24.966
	There's a lot of different text data and the
	nice thing is with digitalization.

	0:15:25.345 --> 0:15:31.785
	You might think there's a large amount with
	books, but to be honest books and printed things

	0:15:31.785 --> 0:15:35.524
	that's by now a minor percentage of the data
	we have.

	0:15:35.815 --> 0:15:39.947
	There's like so much data created every day
	on the Internet.

	0:15:39.980 --> 0:15:46.223
	With social media and all the other types.

	0:15:46.223 --> 0:15:56.821
	This of course is a largest amount of data,
	more of colloquial language.

	0:15:56.856 --> 0:16:02.609
	It might be more noisy and harder to process,
	so there is a whole area on how to deal with

	0:16:02.609 --> 0:16:04.948
	more social media and outdoor stuff.

	0:16:07.347 --> 0:16:20.702
	What type of data is there if you think about
	parallel data news type of data official sites?

	0:16:20.900 --> 0:16:26.629
	So the first Power Corpora were like things
	like the European Parliament or like some news

	0:16:26.629 --> 0:16:27.069
	sites.

	0:16:27.227 --> 0:16:32.888
	Nowadays there's quite a large amount of data
	crawled from the Internet, but of course if

	0:16:32.888 --> 0:16:38.613
	you crawl parallel data from the Internet,
	a lot of the data is also like company websites

	0:16:38.613 --> 0:16:41.884
	or so which gets translated into several languages.

	0:16:45.365 --> 0:17:00.613
	Then, of course, there is different levels
	of text and we have to look at what level we

	0:17:00.613 --> 0:17:05.118
	want to process our data.

	0:17:05.885 --> 0:17:16.140
	It one normally doesn't make sense to work
	on full sentences because a lot of sentences

	0:17:16.140 --> 0:17:22.899
	have never been seen and you always create
	new sentences.

	0:17:23.283 --> 0:17:37.421
	So typically what we take is our basic words,
	something between words and letters, and that

	0:17:37.421 --> 0:17:40.033
	is an essential.

	0:17:40.400 --> 0:17:47.873
	So we need some of these atomic blocks or
	basic blocks on which we can't make smaller.

	0:17:48.128 --> 0:17:55.987
	So if we're building a sentence, for example,
	you can build it out of something and you can

	0:17:55.987 --> 0:17:57.268
	either decide.

	0:17:57.268 --> 0:18:01.967
	For example, you take words and you spit them
	further.

	0:18:03.683 --> 0:18:10.178
	Then, of course, the nice thing is not too
	small and therefore building larger things

	0:18:10.178 --> 0:18:11.386
	like sentences.

	0:18:11.831 --> 0:18:16.690
	So you only have to take your vocabulary and
	put it somewhere together to get your full

	0:18:16.690 --> 0:18:17.132
	center.

	0:18:19.659 --> 0:18:27.670
	However, if it's too large, these blocks don't
	occur often enough, and you have more blocks

	0:18:27.670 --> 0:18:28.715
	that occur.

	0:18:29.249 --> 0:18:34.400
	And that's why yeah we can work with blocks
	for smaller like software blocks.

	0:18:34.714 --> 0:18:38.183
	Work with neural models.

	0:18:38.183 --> 0:18:50.533
	Then you can work on letters so you have a
	system which tries to understand the sentence

	0:18:50.533 --> 0:18:53.031
	letter by letter.

	0:18:53.313 --> 0:18:57.608
	But that is a design decision which you have
	to take at some point.

	0:18:57.608 --> 0:19:03.292
	On which level do you want to split your text
	and that of the evasive blocks that you are

	0:19:03.292 --> 0:19:04.176
	working with?

	0:19:04.176 --> 0:19:06.955
	And that's something we'll look into today.

	0:19:06.955 --> 0:19:08.471
	What possibilities are?

	0:19:12.572 --> 0:19:14.189
	Any question.

	0:19:17.998 --> 0:19:24.456
	Then let's look a bit on what type of data
	there is in how much data there is to person.

	0:19:24.824 --> 0:19:34.006
	Is that nowadays, at least for pure text,
	it's no longer for some language.

	0:19:34.006 --> 0:19:38.959
	There is so much data we cannot process.

	0:19:39.479 --> 0:19:49.384
	That is only true for some languages, but
	there is also interest in other languages and

	0:19:49.384 --> 0:19:50.622
	important.

	0:19:50.810 --> 0:20:01.483
	So if you want to build a system for Sweden
	or for some dialect in other countries, then

	0:20:01.483 --> 0:20:02.802
	of course.

	0:20:03.103 --> 0:20:06.888
	Otherwise you have this huge amount of hair.

	0:20:06.888 --> 0:20:11.515
	We are often no longer taking about gigabytes
	or more.

	0:20:11.891 --> 0:20:35.788
	The general information that is produced every
	year is: And this is like all the information

	0:20:35.788 --> 0:20:40.661
	that are available in the, so there are really.

	0:20:41.001 --> 0:20:44.129
	We look at machine translation.

	0:20:44.129 --> 0:20:53.027
	We can see these numbers are really like more
	than ten years old, but we see this increase

	0:20:53.027 --> 0:20:58.796
	in one billion works we had at that time for
	English data.

	0:20:59.019 --> 0:21:01.955
	Then I wore like new shuffle on Google Maps
	and stuff.

	0:21:02.382 --> 0:21:05.003
	For this one you could train your system on.

	0:21:05.805 --> 0:21:20.457
	And the interesting thing is this one billion
	words is more than any human typically speaks.

	0:21:21.001 --> 0:21:25.892
	So these systems they see by now like a magnitude
	of more data.

	0:21:25.892 --> 0:21:32.465
	We know I think are a magnitude higher of
	more data than a human has ever seen in his

	0:21:32.465 --> 0:21:33.229
	lifetime.

	0:21:35.175 --> 0:21:41.808
	And that is maybe the interesting thing why
	it still doesn't work on it because you see

	0:21:41.808 --> 0:21:42.637
	they seem.

	0:21:43.103 --> 0:21:48.745
	So we are seeing a really impressive result,
	but in most cases it's not that they're really

	0:21:48.745 --> 0:21:49.911
	better than human.

	0:21:50.170 --> 0:21:56.852
	However, they really have seen more data than
	any human ever has seen in this lifetime.

	0:21:57.197 --> 0:22:01.468
	They can just process so much data, so.

	0:22:01.501 --> 0:22:08.425
	The question is, can we make them more efficient
	so that they can learn similarly good without

	0:22:08.425 --> 0:22:09.592
	that much data?

	0:22:09.592 --> 0:22:16.443
	And that is essential if we now go to Lawrence's
	languages where we might never get that much

	0:22:16.443 --> 0:22:21.254
	data, and we should be also able to achieve
	a reasonable perform.

	0:22:23.303 --> 0:22:32.399
	On the other hand, this of course links also
	to one topic which we will cover later: If

	0:22:32.399 --> 0:22:37.965
	you think about this, it's really important
	that your algorithms are also very efficient

	0:22:37.965 --> 0:22:41.280
	in order to process that much data both in
	training.

	0:22:41.280 --> 0:22:46.408
	If you have more data, you want to process
	more data so you can make use of that.

	0:22:46.466 --> 0:22:54.499
	On the other hand, if more and more data is
	processed, more and more people will use machine

	0:22:54.499 --> 0:23:06.816
	translation to generate translations, and it
	will be important to: And there is yeah, there

	0:23:06.816 --> 0:23:07.257
	is.

	0:23:07.607 --> 0:23:10.610
	More.

	0:23:10.170 --> 0:23:17.262
	More data generated every day, we hear just
	some general numbers on how much data there

	0:23:17.262 --> 0:23:17.584
	is.

	0:23:17.584 --> 0:23:24.595
	It says that a lot of the data we produce
	at least at the moment is text rich, so text

	0:23:24.595 --> 0:23:26.046
	that is produced.

	0:23:26.026 --> 0:23:29.748
	That is very important to either wise.

	0:23:29.748 --> 0:23:33.949
	We can use it as training data in some way.

	0:23:33.873 --> 0:23:40.836
	That we want to translate some of that because
	it might not be published in all the languages,

	0:23:40.836 --> 0:23:46.039
	and step with the need for machine translation
	is even more important.

	0:23:47.907 --> 0:23:51.547
	So what are the challenges with this?

	0:23:51.831 --> 0:24:01.360
	So first of all that seems to be very good
	news, so there is more and more data, so we

	0:24:01.360 --> 0:24:10.780
	can just wait for three years and have more
	data, and then our system will be better.

	0:24:11.011 --> 0:24:22.629
	If you see in competitions, the system performance
	increases.

	0:24:24.004 --> 0:24:27.190
	See that here are three different systems.

	0:24:27.190 --> 0:24:34.008
	Blue score is metric to measure how good an
	empty system is and we'll talk about evaluation

	0:24:34.008 --> 0:24:40.974
	and the next week so you'll have to evaluate
	machine validation and also a practical session.

	0:24:41.581 --> 0:24:45.219
	And so.

	0:24:44.784 --> 0:24:50.960
	This shows you that this is like how much
	data of the training data you have five percent.

	0:24:50.960 --> 0:24:56.117
	You're significantly worse than if you're
	forty percent and eighty percent.

	0:24:56.117 --> 0:25:02.021
	You're getting better and you're seeing two
	between this curve, which maybe not really

	0:25:02.021 --> 0:25:02.971
	flattens out.

	0:25:02.971 --> 0:25:03.311
	But.

	0:25:03.263 --> 0:25:07.525
	Of course, the gains you get are normally
	smaller and smaller.

	0:25:07.525 --> 0:25:09.216
	The more data you have,.

	0:25:09.549 --> 0:25:21.432
	If your improvements are unnormally better,
	if you add the same thing or even double your

	0:25:21.432 --> 0:25:25.657
	data late, of course more data.

	0:25:26.526 --> 0:25:34.955
	However, you see the clear tendency if you
	need to improve your system.

	0:25:34.955 --> 0:25:38.935
	This is possible by just getting.

	0:25:39.039 --> 0:25:41.110
	But it's not all about data.

	0:25:41.110 --> 0:25:45.396
	It can also be the domain of the day that
	there's building.

	0:25:45.865 --> 0:25:55.668
	So this was a test on machine translation
	system on translating genome data.

	0:25:55.668 --> 0:26:02.669
	We have the like SAI said he's working on
	translating.

	0:26:02.862 --> 0:26:06.868
	Here you see the performance began with GreenScore.

	0:26:06.868 --> 0:26:12.569
	You see one system which only was trained
	on genome data and it only has.

	0:26:12.812 --> 0:26:17.742
	That's very, very few for machine translation.

	0:26:18.438 --> 0:26:23.927
	And to compare that to a system which was
	generally trained on used translation data.

	0:26:24.104 --> 0:26:34.177
	With four point five million sentences so
	roughly one hundred times as much data you

	0:26:34.177 --> 0:26:40.458
	still see that this system doesn't really work
	well.

	0:26:40.820 --> 0:26:50.575
	So you see it's not only about data, it's
	also that the data has to somewhat fit to the

	0:26:50.575 --> 0:26:51.462
	domain.

	0:26:51.831 --> 0:26:58.069
	The more general data you get that you have
	covered up all domains.

	0:26:58.418 --> 0:27:07.906
	But that's very difficult and especially for
	more specific domains.

	0:27:07.906 --> 0:27:16.696
	It can be really important to get data which
	fits your domain.

	0:27:16.716 --> 0:27:18.520
	Maybe if you can do some very much broccoli
	or something like that, maybe if you.

	0:27:18.598 --> 0:27:22.341
	To say okay, concentrate this as you like
	for being at better.

	0:27:24.564 --> 0:27:28.201
	It's not that easy to prompt it.

	0:27:28.201 --> 0:27:35.807
	You can do the prompting in the more traditional
	way of fine tuning.

	0:27:35.807 --> 0:27:44.514
	Then, of course, if you select UIV later combine
	this one, you can get better.

	0:27:44.904 --> 0:27:52.675
	But it will always be that this type of similar
	data is much more important than the general.

	0:27:52.912 --> 0:28:00.705
	So of course it can make the lower system
	a lot better if you search for similar data

	0:28:00.705 --> 0:28:01.612
	and find.

	0:28:02.122 --> 0:28:08.190
	Will have a lecture on domain adaptation where
	it's exactly the idea how you can make systems

	0:28:08.190 --> 0:28:13.935
	in these situations better so you can adapt
	it to this data but then you still need this

	0:28:13.935 --> 0:28:14.839
	type of data.

	0:28:15.335 --> 0:28:21.590
	And in prompting it might work if you have
	seen it in your data so it can make the system

	0:28:21.590 --> 0:28:25.134
	aware and tell it focus more in this type of
	data.

	0:28:25.465 --> 0:28:30.684
	But if you haven't had enough of the really
	specific good matching data, I think it will

	0:28:30.684 --> 0:28:31.681
	always not work.

	0:28:31.681 --> 0:28:37.077
	So you need to have this type of data and
	therefore it's important not only to have general

	0:28:37.077 --> 0:28:42.120
	data but also data, at least in your overall
	system, which really fits to the domain.

	0:28:45.966 --> 0:28:53.298
	And then the second thing, of course, is you
	need to have data that has good quality.

	0:28:53.693 --> 0:29:00.170
	In the early stages it might be good to have
	all the data but later it's especially important

	0:29:00.170 --> 0:29:06.577
	that you have somehow good quality and so that
	you're learning what you really want to learn

	0:29:06.577 --> 0:29:09.057
	and not learning some great things.

	0:29:10.370 --> 0:29:21.551
	We talked about this with the kilometers and
	miles, so if you just take in some type of

	0:29:21.551 --> 0:29:26.253
	data and don't look at the quality,.

	0:29:26.766 --> 0:29:30.875
	But of course, the question here is what is
	good quality data?

	0:29:31.331 --> 0:29:35.054
	It is not yet that easy to define what is
	a good quality data.

	0:29:36.096 --> 0:29:43.961
	That doesn't mean it has to what people generally
	assume as high quality text or so, like written

	0:29:43.961 --> 0:29:47.814
	by a Nobel Prize winner or something like that.

	0:29:47.814 --> 0:29:54.074
	This is not what we mean by this quality,
	but again the most important again.

	0:29:54.354 --> 0:30:09.181
	So if you have Twitter data, high quality
	data doesn't mean you have now some novels.

	0:30:09.309 --> 0:30:12.875
	Test data, but it should also be represented
	similarly.

	0:30:12.875 --> 0:30:18.480
	Don't have, for example, quality definitely
	as it should be really translating yourself

	0:30:18.480 --> 0:30:18.862
	into.

	0:30:19.199 --> 0:30:25.556
	So especially if you corral data you would
	often have that it's not a direct translation.

	0:30:25.805 --> 0:30:28.436
	So then, of course, this is not high quality
	teaching.

	0:30:29.449 --> 0:30:39.974
	But in generally that's a very difficult thing
	to, and it's very difficult to design what

	0:30:39.974 --> 0:30:41.378
	is reading.

	0:30:41.982 --> 0:30:48.333
	And of course a biometric is always the quality
	of your data is good if your machine translation.

	0:30:48.648 --> 0:30:50.719
	So that is like the indirect.

	0:30:50.991 --> 0:30:52.447
	Well, what can we motive?

	0:30:52.447 --> 0:30:57.210
	Of course, it's difficult to always try a
	lot of things and evaluate either of them,

	0:30:57.210 --> 0:30:59.396
	build a full MP system and then check.

	0:30:59.396 --> 0:31:00.852
	Oh, was this a good idea?

	0:31:00.852 --> 0:31:01.357
	I mean,.

	0:31:01.581 --> 0:31:19.055
	You have two tokenizers who like split sentences
	and the words you really want to apply.

	0:31:19.179 --> 0:31:21.652
	Now you could maybe argue or your idea could
	be.

	0:31:21.841 --> 0:31:30.186
	Just take it there very fast and then get
	the result, but the problem is there is not

	0:31:30.186 --> 0:31:31.448
	always this.

	0:31:31.531 --> 0:31:36.269
	One thing that works very well for small data.

	0:31:36.269 --> 0:31:43.123
	It's not for sure that the same effect will
	happen in large stages.

	0:31:43.223 --> 0:31:50.395
	This idea really improves on very low resource
	data if only train on hundred words.

	0:31:51.271 --> 0:31:58.357
	But if you use it for a large data set, it
	doesn't really matter and all your ideas not.

	0:31:58.598 --> 0:32:01.172
	So that is also a typical thing.

	0:32:01.172 --> 0:32:05.383
	This quality issue is more and more important
	if you.

	0:32:06.026 --> 0:32:16.459
	By one motivation which generally you should
	have, you want to represent your data in having

	0:32:16.459 --> 0:32:17.469
	as many.

	0:32:17.677 --> 0:32:21.805
	Why is this the case any idea?

	0:32:21.805 --> 0:32:33.389
	Why this could be a motivation that we try
	to represent the data in a way that we have

	0:32:33.389 --> 0:32:34.587
	as many.

	0:32:38.338 --> 0:32:50.501
	We also want to learn about the fun text because
	maybe sometimes some grows in the fun text.

	0:32:52.612 --> 0:32:54.020
	The context is here.

	0:32:54.020 --> 0:32:56.432
	It's more about the learning first.

	0:32:56.432 --> 0:33:00.990
	You can generally learn better if you've seen
	something more often.

	0:33:00.990 --> 0:33:06.553
	So if you have seen an event only once, it's
	really hard to learn about the event.

	0:33:07.107 --> 0:33:15.057
	If you have seen an event a hundred times
	your bearing estimating which and maybe that

	0:33:15.057 --> 0:33:18.529
	is the context, then you can use the.

	0:33:18.778 --> 0:33:21.331
	So, for example, if you here have the word
	towels.

	0:33:21.761 --> 0:33:28.440
	If you would just take the data normally you
	would directly process the data.

	0:33:28.440 --> 0:33:32.893
	In the upper case you would the house with
	the dog.

	0:33:32.893 --> 0:33:40.085
	That's a different word than the house this
	way and then the house with the common.

	0:33:40.520 --> 0:33:48.365
	So you want to learn how this translates into
	house, but you translate an upper case.

	0:33:48.365 --> 0:33:50.281
	How this translates.

	0:33:50.610 --> 0:33:59.445
	You were learning how to translate into house
	and house, so you have to learn four different

	0:33:59.445 --> 0:34:00.205
	things.

	0:34:00.205 --> 0:34:06.000
	Instead, we really want to learn that house
	gets into house.

	0:34:06.366 --> 0:34:18.796
	And then imagine if it would be even a beak,
	it might be like here a house would be into.

	0:34:18.678 --> 0:34:22.089
	Good-bye Then.

	0:34:22.202 --> 0:34:29.512
	If it's an upper case then I always have to
	translate it into a boiler while it's a lower

	0:34:29.512 --> 0:34:34.955
	case that is translated into house and that's
	of course not right.

	0:34:34.955 --> 0:34:39.260
	We have to use the context to decide what
	is better.

	0:34:39.679 --> 0:34:47.086
	If you have seen an event several times then
	you are better able to learn your model and

	0:34:47.086 --> 0:34:51.414
	that doesn't matter what type of learning you
	have.

	0:34:52.392 --> 0:34:58.981
	I shouldn't say all but for most of these
	models it's always better to have like seen

	0:34:58.981 --> 0:35:00.897
	an event war more often.

	0:35:00.920 --> 0:35:11.483
	Therefore, if you preprocessive data, you
	should ask the question how can represent data

	0:35:11.483 --> 0:35:14.212
	in order to have seen.

	0:35:14.514 --> 0:35:17.885
	Of course you should not remove that information.

	0:35:18.078 --> 0:35:25.519
	So you could now, of course, just lowercase
	everything.

	0:35:25.519 --> 0:35:30.303
	Then you've seen things more often.

	0:35:30.710 --> 0:35:38.443
	And that might be an issue because in the
	final application you want to have real text

	0:35:38.443 --> 0:35:38.887
	and.

	0:35:40.440 --> 0:35:44.003
	And finally, even it's more important than
	it's consistent.

	0:35:44.965 --> 0:35:52.630
	So this is a problem where, for example, aren't
	consistent.

	0:35:52.630 --> 0:35:58.762
	So I am, I'm together written in training
	data.

	0:35:58.762 --> 0:36:04.512
	And if you're not in test data, have a high.

	0:36:04.824 --> 0:36:14.612
	Therefore, most important is to generate preprocessing
	and represent your data that is most consistent

	0:36:14.612 --> 0:36:18.413
	because it's easier to map how similar.

	0:36:18.758 --> 0:36:26.588
	If your text is represented very, very differently
	then your data will be badly be translated.

	0:36:26.666 --> 0:36:30.664
	So we once had the case.

	0:36:30.664 --> 0:36:40.420
	For example, there is some data who wrote
	it, but in German.

	0:36:40.900 --> 0:36:44.187
	And if you read it as a human you see it.

	0:36:44.187 --> 0:36:49.507
	It's even hard to get the difference because
	it looks very similar.

	0:36:50.130 --> 0:37:02.997
	If you use it for a machine translation system,
	it would not be able to translate anything

	0:37:02.997 --> 0:37:08.229
	of it because it's a different word.

	0:37:09.990 --> 0:37:17.736
	And especially on the other hand you should
	of course not rechange significant training

	0:37:17.736 --> 0:37:18.968
	data thereby.

	0:37:18.968 --> 0:37:27.155
	For example, removing case information because
	if your task is to generate case information.

	0:37:31.191 --> 0:37:41.081
	One thing which is a bit point to look into
	it in order to see the difficulty of your data

	0:37:41.081 --> 0:37:42.711
	is to compare.

	0:37:43.103 --> 0:37:45.583
	There are types.

	0:37:45.583 --> 0:37:57.983
	We mean the number of unique words in the
	corpus, so your vocabulary and the tokens.

	0:37:58.298 --> 0:38:08.628
	And then you can look at the type token ratio
	that means a number of types per token.

	0:38:15.815 --> 0:38:22.381
	Have less types than tokens because every
	word appears at least in the corpus, but most

	0:38:22.381 --> 0:38:27.081
	of them will occur more often until this number
	is bigger, so.

	0:38:27.667 --> 0:38:30.548
	And of course this changes if you have more
	date.

	0:38:31.191 --> 0:38:38.103
	Here is an example from an English Wikipedia.

	0:38:38.103 --> 0:38:45.015
	That means each word in average occurs times.

	0:38:45.425 --> 0:38:47.058
	Of course there's a big difference.

	0:38:47.058 --> 0:38:51.323
	There will be some words which occur one hundred
	times, but therefore most of the words occur

	0:38:51.323 --> 0:38:51.777
	only one.

	0:38:52.252 --> 0:38:55.165
	However, you see this ratio goes down.

	0:38:55.165 --> 0:39:01.812
	That's a good thing, so you have seen each
	word more often and therefore your model gets

	0:39:01.812 --> 0:39:03.156
	typically better.

	0:39:03.156 --> 0:39:08.683
	However, the problem is we always have a lot
	of words which we have seen.

	0:39:09.749 --> 0:39:15.111
	Even here there will be a bound of words which
	you have only seen once.

	0:39:15.111 --> 0:39:20.472
	However, this can give you an indication about
	the quality of the data.

	0:39:20.472 --> 0:39:27.323
	So you should always, of course, try to achieve
	data where you have a very low type to talk

	0:39:27.323 --> 0:39:28.142
	and ratio.

	0:39:28.808 --> 0:39:39.108
	For example, if you compare, simplify and
	not only Wikipedia, what would be your expectation?

	0:39:41.861 --> 0:39:49.842
	Yes, that's exactly, but however it's surprisingly
	only a little bit lower, but you see that it's

	0:39:49.842 --> 0:39:57.579
	lower, so we are using less words to express
	the same thing, and therefore the task to produce

	0:39:57.579 --> 0:39:59.941
	this text is also a gesture.

	0:40:01.221 --> 0:40:07.702
	However, as how many words are there, there
	is no clear definition.

	0:40:07.787 --> 0:40:19.915
	So there will be always more words, especially
	depending on your dataset, how many different

	0:40:19.915 --> 0:40:22.132
	words there are.

	0:40:22.482 --> 0:40:30.027
	So if you have million tweets where around
	fifty million tokens and you have six hundred

	0:40:30.027 --> 0:40:30.875
	thousand.

	0:40:31.251 --> 0:40:40.299
	If you have times this money teen tweeds you
	also have significantly more tokens but also.

	0:40:40.660 --> 0:40:58.590
	So especially in things like the social media,
	of course, there's always different types of

	0:40:58.590 --> 0:40:59.954
	words.

	0:41:00.040 --> 0:41:04.028
	Another example from not social media is here.

	0:41:04.264 --> 0:41:18.360
	So yeah, there is a small liter sandwich like
	phone conversations, two million tokens, and

	0:41:18.360 --> 0:41:22.697
	only twenty thousand words.

	0:41:23.883 --> 0:41:37.221
	If you think about Shakespeare, it has even
	less token, significantly less than a million,

	0:41:37.221 --> 0:41:40.006
	but the number of.

	0:41:40.060 --> 0:41:48.781
	On the other hand, there is this Google Engron
	corpus which has tokens and there is always

	0:41:48.781 --> 0:41:50.506
	new words coming.

	0:41:50.991 --> 0:41:52.841
	Is English.

	0:41:52.841 --> 0:42:08.103
	The nice thing about English is that the vocabulary
	is relatively small, too small, but relatively

	0:42:08.103 --> 0:42:09.183
	small.

	0:42:09.409 --> 0:42:14.224
	So here you see the Ted Corpus here.

	0:42:15.555 --> 0:42:18.144
	All know Ted's lectures.

	0:42:18.144 --> 0:42:26.429
	They are transcribed, translated, not a source
	for us, especially small crocus.

	0:42:26.846 --> 0:42:32.702
	You can do a lot of experiments with that
	and you see that the corpus site is relatively

	0:42:32.702 --> 0:42:36.782
	similar so we have around four million tokens
	in this corpus.

	0:42:36.957 --> 0:42:44.464
	However, if you look at the vocabulary, English
	has half as many words in their different words

	0:42:44.464 --> 0:42:47.045
	as German and Dutch and Italian.

	0:42:47.527 --> 0:42:56.260
	So this is one influence from positional works
	like which are more frequent in German, the

	0:42:56.260 --> 0:43:02.978
	more important since we have all these different
	morphological forms.

	0:43:03.263 --> 0:43:08.170
	There all leads to new words and they need
	to be somewhat expressed in there.

	0:43:11.531 --> 0:43:20.278
	So to deal with this, the question is how
	can we normalize the text in order to make

	0:43:20.278 --> 0:43:22.028
	the text easier?

	0:43:22.028 --> 0:43:25.424
	Can we simplify the task easier?

	0:43:25.424 --> 0:43:29.231
	But we need to keep all information.

	0:43:29.409 --> 0:43:32.239
	So an example where not all information skipped.

	0:43:32.239 --> 0:43:35.012
	Of course you make the task easier if you
	just.

	0:43:35.275 --> 0:43:41.141
	You don't have to deal with different cases.

	0:43:41.141 --> 0:43:42.836
	It's easier.

	0:43:42.836 --> 0:43:52.482
	However, information gets lost and you might
	need to generate the target.

	0:43:52.832 --> 0:44:00.153
	So the question is always: How can we on the
	one hand simplify the task but keep all the

	0:44:00.153 --> 0:44:01.223
	information?

	0:44:01.441 --> 0:44:06.639
	Say necessary because it depends on the task.

	0:44:06.639 --> 0:44:11.724
	For some tasks you might find to remove the.

	0:44:14.194 --> 0:44:23.463
	So the steps they were typically doing are
	that you can the segment and words in a running

	0:44:23.463 --> 0:44:30.696
	text, so you can normalize word forms and segmentation
	into sentences.

	0:44:30.696 --> 0:44:33.955
	Also, if you have not a single.

	0:44:33.933 --> 0:44:38.739
	If this is not a redundancy point to segments,
	the text is also into segments.

	0:44:39.779 --> 0:44:52.609
	So what are we doing there for European language
	segmentation into words?

	0:44:52.609 --> 0:44:57.290
	It's not that complicated.

	0:44:57.277 --> 0:45:06.001
	You have to somehow handle the joint words
	and by handling joint words the most important.

	0:45:06.526 --> 0:45:11.331
	So in most systems it really doesn't matter
	much.

	0:45:11.331 --> 0:45:16.712
	If you write, I'm together as one word or
	as two words.

	0:45:17.197 --> 0:45:23.511
	The nice thing about iron is maybe this is
	so often that it doesn't matter if you both

	0:45:23.511 --> 0:45:26.560
	and if they're both accrued often enough.

	0:45:26.560 --> 0:45:32.802
	But you'll have some of these cases where
	they don't occur there often, so you should

	0:45:32.802 --> 0:45:35.487
	have more as consistent as possible.

	0:45:36.796 --> 0:45:41.662
	But of course things can get more complicated.

	0:45:41.662 --> 0:45:48.598
	If you have Finland capital, do you want to
	split the ends or not?

	0:45:48.598 --> 0:45:53.256
	Isn't you split or do you even write it out?

	0:45:53.433 --> 0:46:00.468
	And what about like things with hyphens in
	the middle and so on?

	0:46:00.540 --> 0:46:07.729
	So there is not everything is very easy, but
	is generally possible to somewhat keep as.

	0:46:11.791 --> 0:46:25.725
	Sometimes the most challenging and traditional
	systems were compounds, or how to deal with

	0:46:25.725 --> 0:46:28.481
	things like this.

	0:46:28.668 --> 0:46:32.154
	The nice thing is, as said, will come to the
	later.

	0:46:32.154 --> 0:46:34.501
	Nowadays we typically use subword.

	0:46:35.255 --> 0:46:42.261
	Unit, so we don't have to deal with this in
	the preprocessing directly, but in the subword

	0:46:42.261 --> 0:46:47.804
	splitting we're doing it, and then we can learn
	how to best spit these.

	0:46:52.392 --> 0:46:56.974
	Things Get More Complicated.

	0:46:56.977 --> 0:46:59.934
	About non European languages.

	0:46:59.934 --> 0:47:08.707
	Because in non European languages, not all
	of them, there is no space between the words.

	0:47:09.029 --> 0:47:18.752
	Nowadays you can also download word segmentation
	models where you put in the full sentence and

	0:47:18.752 --> 0:47:22.744
	then it's getting splitted into parts.

	0:47:22.963 --> 0:47:31.814
	And then, of course, it's even that you have
	different writing systems, sometimes in Japanese.

	0:47:31.814 --> 0:47:40.385
	For example, they have these katakana, hiragana
	and kanji symbols in there, and you have to

	0:47:40.385 --> 0:47:42.435
	some idea with these.

	0:47:49.669 --> 0:47:54.560
	To the, the next thing is can reduce some
	normalization.

	0:47:54.874 --> 0:48:00.376
	So the idea is that you map several words
	onto the same.

	0:48:00.460 --> 0:48:07.877
	And that is test dependent, and the idea is
	to define something like acronym classes so

	0:48:07.877 --> 0:48:15.546
	that words, which have the same meaning where
	it's not in order to have the difference, to

	0:48:15.546 --> 0:48:19.423
	map onto the same thing in order to make the.

	0:48:19.679 --> 0:48:27.023
	The most important thing is there about tasing,
	and then there is something like sometimes

	0:48:27.023 --> 0:48:27.508
	word.

	0:48:28.048 --> 0:48:37.063
	For casing you can do two things and then
	depend on the task.

	0:48:37.063 --> 0:48:44.769
	You can lowercase everything, maybe some exceptions.

	0:48:45.045 --> 0:48:47.831
	For the target side, it should normally it's
	normally not done.

	0:48:48.188 --> 0:48:51.020
	Why is it not done?

	0:48:51.020 --> 0:48:56.542
	Why should you only do it for suicide?

	0:48:56.542 --> 0:49:07.729
	Yes, so you have to generate correct text
	instead of lower case and uppercase.

	0:49:08.848 --> 0:49:16.370
	Nowadays to be always do true casing on both
	sides, also on the sewer side, that means you

	0:49:16.370 --> 0:49:17.610
	keep the case.

	0:49:17.610 --> 0:49:24.966
	The only thing where people try to work on
	or sometimes do that is that at the beginning

	0:49:24.966 --> 0:49:25.628
	of the.

	0:49:25.825 --> 0:49:31.115
	For words like this, this is not that important
	because you will have seen otherwise a lot

	0:49:31.115 --> 0:49:31.696
	of times.

	0:49:31.696 --> 0:49:36.928
	But if you know have rare words, which you
	only have seen maybe three times, and you have

	0:49:36.928 --> 0:49:42.334
	only seen in the middle of the sentence, and
	now it occurs at the beginning of the sentence,

	0:49:42.334 --> 0:49:45.763
	which is upper case, then you don't know how
	to deal with.

	0:49:46.146 --> 0:49:50.983
	So then it might be good to do a true casing.

	0:49:50.983 --> 0:49:56.241
	That means you recase each word on the beginning.

	0:49:56.576 --> 0:49:59.830
	The only question, of course, is how do you
	recase it?

	0:49:59.830 --> 0:50:01.961
	So what case would you always know?

	0:50:02.162 --> 0:50:18.918
	Word of the senders, or do you have a better
	solution, especially not English, maybe German.

	0:50:18.918 --> 0:50:20.000
	It's.

	0:50:25.966 --> 0:50:36.648
	The fancy solution would be to count hope
	and decide based on this, the unfancy running

	0:50:36.648 --> 0:50:43.147
	would: Think it's not really good because most
	of the cane boards are lower paced.

	0:50:43.683 --> 0:50:53.657
	That is one idea to count and definitely better
	because as a word more often occurs upper case.

	0:50:53.653 --> 0:50:57.934
	Otherwise you only have a lower case at the
	beginning where you have again.

	0:50:58.338 --> 0:51:03.269
	Haven't gained anything, you can make it even
	a bit better when counting.

	0:51:03.269 --> 0:51:09.134
	You're ignoring the first position so that
	you don't count the word beginning and yeah,

	0:51:09.134 --> 0:51:12.999
	that's typically how it's done to do this type
	of casing.

	0:51:13.273 --> 0:51:23.907
	And that's the easy thing you can't even use
	like then bygram teachers who work pairs.

	0:51:23.907 --> 0:51:29.651
	There's very few words which occur more often.

	0:51:29.970 --> 0:51:33.163
	It's OK to have them boast because you can
	otherwise learn it.

	0:51:36.376 --> 0:51:52.305
	Another thing about these classes is to use
	word classes that were partly done, for example,

	0:51:52.305 --> 0:51:55.046
	and more often.

	0:51:55.375 --> 0:51:57.214
	Ten Thousand One Hundred Books.

	0:51:57.597 --> 0:52:07.397
	And then for an system that might not be important
	you can do something at number books.

	0:52:07.847 --> 0:52:16.450
	However, you see here already that it's not
	that easy because if you have one book you

	0:52:16.450 --> 0:52:19.318
	don't have to do with a pro.

	0:52:20.020 --> 0:52:21.669
	Always be careful.

	0:52:21.669 --> 0:52:28.094
	It's very fast to ignore some exceptions and
	make more things worse than.

	0:52:28.488 --> 0:52:37.879
	So it's always difficult to decide when to
	do this and when to better not do it and keep

	0:52:37.879 --> 0:52:38.724
	things.

	0:52:43.483 --> 0:52:56.202
	Then the next step is sentence segmentation,
	so we are typically working on sentences.

	0:52:56.476 --> 0:53:11.633
	However, dots things are a bit more complicated,
	so you can do a bit more.

	0:53:11.731 --> 0:53:20.111
	You can even have some type of classifier
	with features by then generally.

	0:53:20.500 --> 0:53:30.731
	Is not too complicated, so you can have different
	types of classifiers to do that, but in generally.

	0:53:30.650 --> 0:53:32.537
	I Didn't Know It.

	0:53:33.393 --> 0:53:35.583
	It's not a super complicated task.

	0:53:35.583 --> 0:53:39.461
	There are nowadays also a lot of libraries
	which you can use.

	0:53:39.699 --> 0:53:45.714
	To do that normally if you're doing the normalization
	beforehand that can be done there so you only

	0:53:45.714 --> 0:53:51.126
	split up the dot if it's like the sentence
	boundary and otherwise you keep it to the word

	0:53:51.126 --> 0:53:54.194
	so you can do that a bit jointly with the segment.

	0:53:54.634 --> 0:54:06.017
	It's something to think about to care because
	it's where arrows happen.

	0:54:06.017 --> 0:54:14.712
	However, on the one end you can still do it
	very well.

	0:54:14.834 --> 0:54:19.740
	You will never get data which is perfectly
	clean and where everything is great.

	0:54:20.340 --> 0:54:31.020
	There's just too much data and it will never
	happen, so therefore it's important to be aware

	0:54:31.020 --> 0:54:35.269
	of that during the full development.

	0:54:37.237 --> 0:54:42.369
	And one last thing about the preprocessing,
	we'll get into the representation.

	0:54:42.369 --> 0:54:47.046
	If you're working on that, you'll get a friend
	with regular expression.

	0:54:47.046 --> 0:54:50.034
	That's not only how you do all this matching.

	0:54:50.430 --> 0:55:03.811
	And if you look into the scripts of how to
	deal with pancreation marks and stuff like

	0:55:03.811 --> 0:55:04.900
	that,.

	0:55:11.011 --> 0:55:19.025
	So if we have now the data of our next step
	to build, the system is to represent our words.

	0:55:19.639 --> 0:55:27.650
	Before we start with this, any more questions
	about preprocessing.

	0:55:27.650 --> 0:55:32.672
	While we work on the pure text, I'm sure.

	0:55:33.453 --> 0:55:40.852
	The idea is again to make things more simple
	because if you think about the production mark

	0:55:40.852 --> 0:55:48.252
	at the beginning of a sentence, it might be
	that you haven't seen the word or, for example,

	0:55:48.252 --> 0:55:49.619
	think of titles.

	0:55:49.619 --> 0:55:56.153
	In newspaper articles there's: So you then
	have seen the word now in the title before,

	0:55:56.153 --> 0:55:58.425
	and the text you have never seen.

	0:55:58.898 --> 0:56:03.147
	But there is always the decision.

	0:56:03.123 --> 0:56:09.097
	Do I gain more because I've seen things more
	often or do I lose because now I remove information

	0:56:09.097 --> 0:56:11.252
	which helps me to the same degree?

	0:56:11.571 --> 0:56:21.771
	Because if we, for example, do that in German
	and remove the case, this might be an important

	0:56:21.771 --> 0:56:22.531
	issue.

	0:56:22.842 --> 0:56:30.648
	So there is not the perfect solution, but
	generally you can get some arrows to make things

	0:56:30.648 --> 0:56:32.277
	look more similar.

	0:56:35.295 --> 0:56:43.275
	What you can do about products like the state
	of the area or the trends that are more or

	0:56:43.275 --> 0:56:43.813
	less.

	0:56:44.944 --> 0:56:50.193
	It starts even less because models get more
	powerful, so it's not that important, but be

	0:56:50.193 --> 0:56:51.136
	careful partly.

	0:56:51.136 --> 0:56:56.326
	It's also the evaluation thing because these
	things which are problematic are happening

	0:56:56.326 --> 0:56:57.092
	very rarely.

	0:56:57.092 --> 0:57:00.159
	If you take average performance, it doesn't
	matter.

	0:57:00.340 --> 0:57:06.715
	However, in between it's doing the stupid
	mistakes that don't count on average, but they

	0:57:06.715 --> 0:57:08.219
	are not really good.

	0:57:09.089 --> 0:57:15.118
	Done you do some type of tokenization?

	0:57:15.118 --> 0:57:19.911
	You can do true casing or not.

	0:57:19.911 --> 0:57:28.723
	Some people nowadays don't do it, but that's
	still done.

	0:57:28.948 --> 0:57:34.441
	Then it depends on who is a bit on the type
	of domain.

	0:57:34.441 --> 0:57:37.437
	Again we have so translation.

	0:57:37.717 --> 0:57:46.031
	So in the text sometimes there is mark in
	the menu, later the shortcut.

	0:57:46.031 --> 0:57:49.957
	This letter is used for shortcut.

	0:57:49.957 --> 0:57:57.232
	You cannot mistake the word because it's no
	longer a file but.

	0:57:58.018 --> 0:58:09.037
	Then you cannot deal with it, so then it might
	make sense to remove this.

	0:58:12.032 --> 0:58:17.437
	Now the next step is how to match words into
	numbers.

	0:58:17.437 --> 0:58:22.142
	Machine learning models deal with some digits.

	0:58:22.342 --> 0:58:27.091
	The first idea is to use words as our basic
	components.

	0:58:27.247 --> 0:58:40.695
	And then you have a large vocabulary where
	each word gets referenced to an indigenous.

	0:58:40.900 --> 0:58:49.059
	So your sentence go home is now and that is
	your set.

	0:58:52.052 --> 0:59:00.811
	So the nice thing is you have very short sequences
	so that you can deal with them.

	0:59:00.811 --> 0:59:01.867
	However,.

	0:59:01.982 --> 0:59:11.086
	So you have not really understood how words
	are processed.

	0:59:11.086 --> 0:59:16.951
	Why is this or can that be a problem?

	0:59:17.497 --> 0:59:20.741
	And there is an easy solution to deal with
	unknown words.

	0:59:20.741 --> 0:59:22.698
	You just have one token, which is.

	0:59:23.123 --> 0:59:25.906
	Worrying in maybe some railroads in your training
	day, do you deal?

	0:59:26.206 --> 0:59:34.938
	That's working a bit for some province, but
	in general it's not good because you know nothing

	0:59:34.938 --> 0:59:35.588
	about.

	0:59:35.895 --> 0:59:38.770
	Can at least deal with this and maybe map
	it.

	0:59:38.770 --> 0:59:44.269
	So an easy solution in machine translation
	is always if it's an unknown word or we just

	0:59:44.269 --> 0:59:49.642
	copy it to the target side because unknown
	words are often named entities and in many

	0:59:49.642 --> 0:59:52.454
	languages the good solution is just to keep.

	0:59:53.013 --> 1:00:01.203
	So that is somehow a trick, trick, but yeah,
	that's of course not a good thing.

	1:00:01.821 --> 1:00:08.959
	It's also a problem if you deal with full
	words is that you have very few examples for

	1:00:08.959 --> 1:00:09.451
	some.

	1:00:09.949 --> 1:00:17.696
	And of course if you've seen a word once you
	can, someone may be translated, but we will

	1:00:17.696 --> 1:00:24.050
	learn that in your networks you represent words
	with continuous vectors.

	1:00:24.264 --> 1:00:26.591
	You have seen them two, three or four times.

	1:00:26.591 --> 1:00:31.246
	It is not really well learned, and you are
	typically doing most Arabs and words with your

	1:00:31.246 --> 1:00:31.763
	crow rap.

	1:00:33.053 --> 1:00:40.543
	And yeah, you cannot deal with things which
	are inside the world.

	1:00:40.543 --> 1:00:50.303
	So if you know that houses set one hundred
	and twelve and you see no houses, you have

	1:00:50.303 --> 1:00:51.324
	no idea.

	1:00:51.931 --> 1:00:55.533
	Of course, not really convenient, so humans
	are better.

	1:00:55.533 --> 1:00:58.042
	They can use the internal information.

	1:00:58.498 --> 1:01:04.080
	So if we have houses you'll know that it's
	like the bluer form of house.

	1:01:05.285 --> 1:01:16.829
	And for the ones who weren't in advance, ay,
	you have this night worth here and guess.

	1:01:16.716 --> 1:01:20.454
	Don't know the meaning of these words.

	1:01:20.454 --> 1:01:25.821
	However, all of you will know is the fear
	of something.

	1:01:26.686 --> 1:01:39.437
	From the ending, the phobia phobia is always
	the fear of something, but you don't know how.

	1:01:39.879 --> 1:01:46.618
	So we can split words into some parts that
	is helpful to deal with.

	1:01:46.618 --> 1:01:49.888
	This, for example, is a fear of.

	1:01:50.450 --> 1:02:04.022
	It's not very important, it's not how to happen
	very often, but yeah, it's also not important

	1:02:04.022 --> 1:02:10.374
	for understanding that you know everything.

	1:02:15.115 --> 1:02:18.791
	So what can we do instead?

	1:02:18.791 --> 1:02:29.685
	One thing which we could do instead is to
	represent words by the other extreme.

	1:02:29.949 --> 1:02:42.900
	So you really do like if you have a person's
	eye and a and age, then you need a space symbol.

	1:02:43.203 --> 1:02:55.875
	So you have now a representation for each
	character that enables you to implicitly learn

	1:02:55.875 --> 1:03:01.143
	morphology because words which have.

	1:03:01.541 --> 1:03:05.517
	And you can then deal with unknown words.

	1:03:05.517 --> 1:03:10.344
	There's still not everything you can process,
	but.

	1:03:11.851 --> 1:03:16.953
	So if you would go on charity level what might
	still be a problem?

	1:03:18.598 --> 1:03:24.007
	So all characters which you haven't seen,
	but that's nowadays a little bit more often

	1:03:24.007 --> 1:03:25.140
	with new emoties.

	1:03:25.140 --> 1:03:26.020
	You couldn't.

	1:03:26.020 --> 1:03:31.366
	It could also be that you have translated
	from Germany and German, and then there is

	1:03:31.366 --> 1:03:35.077
	a Japanese character or Chinese that you cannot
	translate.

	1:03:35.435 --> 1:03:43.938
	But most of the time all directions occur
	have been seen so that someone works very good.

	1:03:44.464 --> 1:03:58.681
	This is first a nice thing, so you have a
	very small vocabulary size, so one big part

	1:03:58.681 --> 1:04:01.987
	of the calculation.

	1:04:02.222 --> 1:04:11.960
	Neural networks is the calculation of the
	vocabulary size, so if you are efficient there

	1:04:11.960 --> 1:04:13.382
	it's better.

	1:04:14.914 --> 1:04:26.998
	On the other hand, the problem is you have
	no very long sequences, so if you think about

	1:04:26.998 --> 1:04:29.985
	this before you have.

	1:04:30.410 --> 1:04:43.535
	Your computation often depends on your input
	size and not only linear but quadratic going

	1:04:43.535 --> 1:04:44.410
	more.

	1:04:44.504 --> 1:04:49.832
	And of course it might also be that you just
	generally make things more complicated than

	1:04:49.832 --> 1:04:50.910
	they were before.

	1:04:50.951 --> 1:04:58.679
	We said before make things easy, but now if
	we really have to analyze each director independently,

	1:04:58.679 --> 1:05:05.003
	we cannot directly learn that university is
	the same, but we have to learn that.

	1:05:05.185 --> 1:05:12.179
	Is beginning and then there is an I and then
	there is an E and then all this together means

	1:05:12.179 --> 1:05:17.273
	university but another combination of these
	letters is a complete.

	1:05:17.677 --> 1:05:24.135
	So of course you make everything here a lot
	more complicated than you have on word basis.

	1:05:24.744 --> 1:05:32.543
	Character based models work very well in conditions
	with few data because you have seen the words

	1:05:32.543 --> 1:05:33.578
	very rarely.

	1:05:33.578 --> 1:05:38.751
	It's not good to learn but you have seen all
	letters more often.

	1:05:38.751 --> 1:05:44.083
	So if you have scenarios with very few data
	this is like one good.

	1:05:46.446 --> 1:05:59.668
	The other idea is to split now not doing the
	extreme, so either taking forwards or taking

	1:05:59.668 --> 1:06:06.573
	only directives by doing something in between.

	1:06:07.327 --> 1:06:12.909
	And one of these ideas has been done for a
	long time.

	1:06:12.909 --> 1:06:17.560
	It's called compound splitting, but we only.

	1:06:17.477 --> 1:06:18.424
	Bounce them.

	1:06:18.424 --> 1:06:24.831
	You see that Baum and Stumbo accrue very often,
	then maybe more often than Bounce them.

	1:06:24.831 --> 1:06:28.180
	Then you split Baum and Stumb and you use
	it.

	1:06:29.509 --> 1:06:44.165
	But it's even not so easy it will learn wrong
	splits so we did that in all the systems and

	1:06:44.165 --> 1:06:47.708
	there is a word Asia.

	1:06:48.288 --> 1:06:56.137
	And the business, of course, is not a really
	good way of dealing it because it is non-semantic.

	1:06:56.676 --> 1:07:05.869
	The good thing is we didn't really care that
	much about it because the system wasn't learned

	1:07:05.869 --> 1:07:09.428
	if you have Asia and Tish together.

	1:07:09.729 --> 1:07:17.452
	So you can of course learn all that the compound
	spirit doesn't really help you to get a deeper

	1:07:17.452 --> 1:07:18.658
	understanding.

	1:07:21.661 --> 1:07:23.364
	The Thing of Course.

	1:07:23.943 --> 1:07:30.475
	Yeah, there was one paper where this doesn't
	work like they report, but it's called Burning

	1:07:30.475 --> 1:07:30.972
	Ducks.

	1:07:30.972 --> 1:07:37.503
	I think because it was like if you had German
	NS Branter, you could split it in NS Branter,

	1:07:37.503 --> 1:07:43.254
	and sometimes you have to add an E to make
	the compounds that was Enter Branter.

	1:07:43.583 --> 1:07:48.515
	So he translated Esperanto into burning dark.

	1:07:48.888 --> 1:07:56.127
	So of course you can introduce there some
	type of additional arrows, but in generally

	1:07:56.127 --> 1:07:57.221
	it's a good.

	1:07:57.617 --> 1:08:03.306
	Of course there is a trade off between vocabulary
	size so you want to have a lower vocabulary

	1:08:03.306 --> 1:08:08.812
	size so you've seen everything more often but
	the length of the sequence should not be too

	1:08:08.812 --> 1:08:13.654
	long because if you split more often you get
	less different types but you have.

	1:08:16.896 --> 1:08:25.281
	The motivation of the advantage of compared
	to Character based models is that you can directly

	1:08:25.281 --> 1:08:33.489
	learn the representation for works that occur
	very often while still being able to represent

	1:08:33.489 --> 1:08:35.783
	works that are rare into.

	1:08:36.176 --> 1:08:42.973
	And while first this was only done for compounds,
	nowadays there's an algorithm which really

	1:08:42.973 --> 1:08:49.405
	tries to do it on everything and there are
	different ways to be honest compound fitting

	1:08:49.405 --> 1:08:50.209
	and so on.

	1:08:50.209 --> 1:08:56.129
	But the most successful one which is commonly
	used is based on data compression.

	1:08:56.476 --> 1:08:59.246
	And there the idea is okay.

	1:08:59.246 --> 1:09:06.765
	Can we find an encoding so that parts are
	compressed in the most efficient?

	1:09:07.027 --> 1:09:22.917
	And the compression algorithm is called the
	bipear encoding, and this is also then used

	1:09:22.917 --> 1:09:25.625
	for splitting.

	1:09:26.346 --> 1:09:39.164
	And the idea is we recursively represent the
	most frequent pair of bites by a new bike.

	1:09:39.819 --> 1:09:51.926
	Language is now you splitch, burst all your
	words into letters, and then you look at what

	1:09:51.926 --> 1:09:59.593
	is the most frequent bigrams of which two letters
	occur.

	1:10:00.040 --> 1:10:04.896
	And then you replace your repeat until you
	have a fixed vocabulary.

	1:10:04.985 --> 1:10:08.031
	So that's a nice thing.

	1:10:08.031 --> 1:10:16.663
	Now you can predefine your vocabulary as want
	to represent my text.

	1:10:16.936 --> 1:10:28.486
	By hand, and then you can represent any text
	with these symbols, and of course the shorter

	1:10:28.486 --> 1:10:30.517
	your text will.

	1:10:32.772 --> 1:10:36.543
	So the original idea was something like that.

	1:10:36.543 --> 1:10:39.411
	We have to sequence A, B, A, B, C.

	1:10:39.411 --> 1:10:45.149
	For example, a common biogram is A, B, so
	you can face A, B, B, I, D.

	1:10:45.149 --> 1:10:46.788
	Then the text gets.

	1:10:48.108 --> 1:10:53.615
	Then you can make to and then you have eating
	beet and so on, so this is then your text.

	1:10:54.514 --> 1:11:00.691
	Similarly, we can do it now for tanking.

	1:11:01.761 --> 1:11:05.436
	Let's assume you have these sentences.

	1:11:05.436 --> 1:11:11.185
	I go, he goes, she goes, so your vocabulary
	is go, goes, he.

	1:11:11.851 --> 1:11:30.849
	And the first thing you're doing is split
	your crocus into singles.

	1:11:30.810 --> 1:11:34.692
	So thereby you can split words again like
	split senses into words.

	1:11:34.692 --> 1:11:38.980
	Because now you only have chiracters, you
	don't know the word boundaries.

	1:11:38.980 --> 1:11:44.194
	You introduce the word boundaries by having
	a special symbol at the end of each word, and

	1:11:44.194 --> 1:11:46.222
	then you know this symbol happens.

	1:11:46.222 --> 1:11:48.366
	I can split it and have it in a new.

	1:11:48.708 --> 1:11:55.245
	So you have the corpus I go, he goes, and
	she goes, and then you have now here the sequences

	1:11:55.245 --> 1:11:56.229
	of Character.

	1:11:56.229 --> 1:12:02.625
	So then the Character based per presentation,
	and now you calculate the bigram statistics.

	1:12:02.625 --> 1:12:08.458
	So I and the end of word occurs one time G
	& O across three times, so there there.

	1:12:09.189 --> 1:12:18.732
	And these are all the others, and now you
	look, which is the most common happening.

	1:12:19.119 --> 1:12:26.046
	So then you have known the rules.

	1:12:26.046 --> 1:12:39.235
	If and have them together you have these new
	words: Now is no longer two symbols, but it's

	1:12:39.235 --> 1:12:41.738
	one single symbol because if you join that.

	1:12:42.402 --> 1:12:51.175
	And then you have here now the new number
	of biceps, steel and wood, and and so on.

	1:12:52.092 --> 1:13:01.753
	In small examples now you have a lot of rules
	which occur the same time.

	1:13:01.753 --> 1:13:09.561
	In reality that is happening sometimes but
	not that often.

	1:13:10.370 --> 1:13:21.240
	You add the end of words to him, and so this
	way you go on until you have your vocabulary.

	1:13:21.601 --> 1:13:38.242
	And your vocabulary is in these rules, so
	people speak about the vocabulary of the rules.

	1:13:38.658 --> 1:13:43.637
	And these are the rules, and if you have not
	a different sentence, something like they tell.

	1:13:44.184 --> 1:13:53.600
	Then your final output looks like something
	like that.

	1:13:53.600 --> 1:13:59.250
	These two words represent by by.

	1:14:00.940 --> 1:14:06.398
	And that is your algorithm.

	1:14:06.398 --> 1:14:18.873
	Now you can represent any type of text with
	a fixed vocabulary.

	1:14:20.400 --> 1:14:23.593
	So think that's defined in the beginning.

	1:14:23.593 --> 1:14:27.243
	Fill how many egos have won and that has spent.

	1:14:28.408 --> 1:14:35.253
	It's nearly correct that it writes a number
	of characters.

	1:14:35.253 --> 1:14:38.734
	It can be that in additional.

	1:14:38.878 --> 1:14:49.162
	So on the one end all three of the right side
	of the rules can occur, and then additionally

	1:14:49.162 --> 1:14:49.721
	all.

	1:14:49.809 --> 1:14:55.851
	In reality it can even happen that there is
	less your vocabulary smaller because it might

	1:14:55.851 --> 1:15:01.960
	happen that like for example go never occurs
	singular at the end but you always like merge

	1:15:01.960 --> 1:15:06.793
	all occurrences so there are not all right
	sides really happen because.

	1:15:06.746 --> 1:15:11.269
	This rule is never only applied, but afterwards
	another rule is also applied.

	1:15:11.531 --> 1:15:15.621
	So it's a summary approbounce of your vocabulary
	than static.

	1:15:20.480 --> 1:15:29.014
	Then we come to the last part, which is about
	parallel data, but we have some questions beforehand.

	1:15:36.436 --> 1:15:38.824
	So what is parallel data?

	1:15:38.824 --> 1:15:47.368
	So if we set machine translations really,
	really important that we are dealing with parallel

	1:15:47.368 --> 1:15:52.054
	data, that means we have a lined input and
	output.

	1:15:52.054 --> 1:15:54.626
	You have this type of data.

	1:15:55.015 --> 1:16:01.773
	However, in machine translation we have one
	very big advantage that is somewhat naturally

	1:16:01.773 --> 1:16:07.255
	occurring, so you have a lot of parallel data
	which you can summar gaps.

	1:16:07.255 --> 1:16:13.788
	In many P tests you need to manually annotate
	your data and generate the aligned data.

	1:16:14.414 --> 1:16:22.540
	We have to manually create translations, and
	of course that is very expensive, but it's

	1:16:22.540 --> 1:16:29.281
	really expensive to pay for like one million
	sentences to be translated.

	1:16:29.889 --> 1:16:36.952
	The nice thing is that in there is data normally
	available because other people have done machine

	1:16:36.952 --> 1:16:37.889
	translation.

	1:16:40.120 --> 1:16:44.672
	So there is this data and of course process
	it.

	1:16:44.672 --> 1:16:51.406
	We'll have a full lecture on how to deal with
	more complex situations.

	1:16:52.032 --> 1:16:56.645
	The idea is really you don't do really much
	human work.

	1:16:56.645 --> 1:17:02.825
	You really just start the caller with some
	initials, start pages and then.

	1:17:03.203 --> 1:17:07.953
	But a lot of iquality parallel data is really
	targeted on some scenarios.

	1:17:07.953 --> 1:17:13.987
	So, for example, think of the European Parliament
	as one website where you can easily extract

	1:17:13.987 --> 1:17:17.581
	these information from and there you have a
	large data.

	1:17:17.937 --> 1:17:22.500
	Or like we have the TED data, which is also
	you can get from the TED website.

	1:17:23.783 --> 1:17:33.555
	So in generally parallel corpus is a collection
	of texts with translations into one of several.

	1:17:34.134 --> 1:17:42.269
	And this data is important because there is
	no general empty normally, but you work secured.

	1:17:42.222 --> 1:17:46.732
	It works especially good if your training
	and test conditions are similar.

	1:17:46.732 --> 1:17:50.460
	So if the topic is similar, the style of modality
	is similar.

	1:17:50.460 --> 1:17:55.391
	So if you want to translate speech, it's often
	better to train all to own speech.

	1:17:55.391 --> 1:17:58.818
	If you want to translate text, it's better
	to translate.

	1:17:59.379 --> 1:18:08.457
	And there is a lot of these data available
	nowadays for common languages.

	1:18:08.457 --> 1:18:12.014
	You normally can start with.

	1:18:12.252 --> 1:18:15.298
	It's really available.

	1:18:15.298 --> 1:18:27.350
	For example, Opus is a big website collecting
	different types of parallel corpus where you

	1:18:27.350 --> 1:18:29.601
	can select them.

	1:18:29.529 --> 1:18:33.276
	You have this document alignment will come
	to that layout.

	1:18:33.553 --> 1:18:39.248
	There is things like comparable data where
	you have not full sentences but only some parts

	1:18:39.248 --> 1:18:40.062
	of parallel.

	1:18:40.220 --> 1:18:48.700
	But now first let's assume we have easy tasks
	like European Parliament when we have the speech

	1:18:48.700 --> 1:18:55.485
	in German and the speech in English and you
	need to generate parallel data.

	1:18:55.485 --> 1:18:59.949
	That means you have to align the sewer sentences.

	1:19:00.000 --> 1:19:01.573
	And doing this right.

	1:19:05.905 --> 1:19:08.435
	How can we do that?

	1:19:08.435 --> 1:19:19.315
	And that is what people refer to sentence
	alignment, so we have parallel documents in

	1:19:19.315 --> 1:19:20.707
	languages.

	1:19:22.602 --> 1:19:32.076
	This is so you cannot normally do that word
	by word because there is no direct correlation

	1:19:32.076 --> 1:19:34.158
	between, but it is.

	1:19:34.074 --> 1:19:39.837
	Relatively possible to do it on sentence level,
	it will not be perfect, so you sometimes have

	1:19:39.837 --> 1:19:42.535
	two sentences in English and one in German.

	1:19:42.535 --> 1:19:47.992
	German like to have these long sentences with
	sub clauses and so on, so there you can do

	1:19:47.992 --> 1:19:51.733
	it, but with long sentences it might not be
	really possible.

	1:19:55.015 --> 1:19:59.454
	And for some we saw that sentence Marcus Andre
	there, so it's more complicated.

	1:19:59.819 --> 1:20:10.090
	So how can we formalize this sentence alignment
	problem?

	1:20:10.090 --> 1:20:16.756
	So we have a set of sewer sentences.

	1:20:17.377 --> 1:20:22.167
	And machine translation relatively often.

	1:20:22.167 --> 1:20:32.317
	Sometimes source sentences nowadays are and,
	but traditionally it was and because people

	1:20:32.317 --> 1:20:34.027
	started using.

	1:20:34.594 --> 1:20:45.625
	And then the idea is to find this alignment
	where we have alignment.

	1:20:46.306 --> 1:20:50.421
	And of course you want these sequences to
	be shown as possible.

	1:20:50.421 --> 1:20:56.400
	Of course an easy solution is here all my
	screen sentences and here all my target sentences.

	1:20:56.756 --> 1:21:07.558
	So want to have short sequences there, typically
	one sentence or maximum two or three sentences,

	1:21:07.558 --> 1:21:09.340
	so that really.

	1:21:13.913 --> 1:21:21.479
	Then there is different ways of restriction
	to this type of alignment, so first of all

	1:21:21.479 --> 1:21:29.131
	it should be a monotone alignment, so that
	means that each segment on the source should

	1:21:29.131 --> 1:21:31.218
	start after each other.

	1:21:31.431 --> 1:21:36.428
	So we assume that in document there's really
	a monotone and it's going the same way in source.

	1:21:36.957 --> 1:21:41.965
	Course for a very free translation that might
	not be valid anymore.

	1:21:41.965 --> 1:21:49.331
	But this algorithm, the first one in the church
	and gay algorithm, is more than really translations

	1:21:49.331 --> 1:21:51.025
	which are very direct.

	1:21:51.025 --> 1:21:54.708
	So each segment should be like coming after
	each.

	1:21:55.115 --> 1:22:04.117
	Then we want to translate the full sequence,
	and of course each segment should start before

	1:22:04.117 --> 1:22:04.802
	it is.

	1:22:05.525 --> 1:22:22.654
	And then you want to have something like that,
	but you have to alignments or alignments.

	1:22:25.525 --> 1:22:41.851
	The alignment types are: You then, of course,
	sometimes insertions and Venetians where there

	1:22:41.851 --> 1:22:43.858
	is some information added.

	1:22:44.224 --> 1:22:50.412
	Hand be, for example, explanation, so it can
	be that some term is known in the one language

	1:22:50.412 --> 1:22:51.018
	but not.

	1:22:51.111 --> 1:22:53.724
	Think of things like Deutschland ticket.

	1:22:53.724 --> 1:22:58.187
	In Germany everybody will by now know what
	the Deutschland ticket is.

	1:22:58.187 --> 1:23:03.797
	But if you translate it to English it might
	be important to explain it and other things

	1:23:03.797 --> 1:23:04.116
	are.

	1:23:04.116 --> 1:23:09.853
	So sometimes you have to explain things and
	then you have more sentences with insertions.

	1:23:10.410 --> 1:23:15.956
	Then you have two to one and one to two alignment,
	and that is, for example, in Germany you have

	1:23:15.956 --> 1:23:19.616
	a lot of sub-classes and bipes that are expressed
	by two cents.

	1:23:20.580 --> 1:23:37.725
	Of course, it might be more complex, but typically
	to make it simple and only allow for this type

	1:23:37.725 --> 1:23:40.174
	of alignment.

	1:23:41.301 --> 1:23:56.588
	Then it is about finding the alignment and
	that is, we try to score where we just take

	1:23:56.588 --> 1:23:59.575
	a general score.

	1:24:00.000 --> 1:24:04.011
	That is true like gala algorithms and the
	matching of one segment.

	1:24:04.011 --> 1:24:09.279
	If you have one segment now so this is one
	of the global things so the global alignment

	1:24:09.279 --> 1:24:13.828
	is as good as the product of all single steps
	and then you have two scores.

	1:24:13.828 --> 1:24:18.558
	First of all you say one to one alignments
	are much better than all the hours.

	1:24:19.059 --> 1:24:26.884
	And then you have a lexical similarity, which
	is, for example, based on an initial dictionary

	1:24:26.884 --> 1:24:30.713
	which counts how many dictionary entries are.

	1:24:31.091 --> 1:24:35.407
	So this is a very simple algorithm.

	1:24:35.407 --> 1:24:41.881
	Typically violates like your first step and
	you want.

	1:24:43.303 --> 1:24:54.454
	And that is like with this one you can get
	an initial one you can have better parallel

	1:24:54.454 --> 1:24:55.223
	data.

	1:24:55.675 --> 1:25:02.369
	No, it is an optimization problem and you
	are now based on the scores you can calculate

	1:25:02.369 --> 1:25:07.541
	for each possible alignment and score and then
	select the best one.

	1:25:07.541 --> 1:25:14.386
	Of course, you won't try all possibilities
	out but you can do a good search and then find

	1:25:14.386 --> 1:25:15.451
	the best one.

	1:25:15.815 --> 1:25:18.726
	Can typically be automatically.

	1:25:18.726 --> 1:25:25.456
	Of course, you should do some checks like
	aligning sentences as possible.

	1:25:26.766 --> 1:25:32.043
	A bill like typically for training data is
	done this way.

	1:25:32.043 --> 1:25:35.045
	Maybe if you have test data you.

	1:25:40.000 --> 1:25:47.323
	Sorry, I'm a bit late because originally wanted
	to do a quiz at the end.

	1:25:47.323 --> 1:25:49.129
	Can we go a quiz?

	1:25:49.429 --> 1:25:51.833
	We'll do it somewhere else.

	1:25:51.833 --> 1:25:56.813
	We had a bachelor project about making quiz
	for lectures.

	1:25:56.813 --> 1:25:59.217
	And I still want to try it.

	1:25:59.217 --> 1:26:04.197
	So let's see I hope in some other lecture
	we can do that.

	1:26:04.197 --> 1:26:09.435
	Then we can at the island of the lecture do
	some quiz about.

	1:26:09.609 --> 1:26:13.081
	All We Can Do Is Is the Practical Thing Let's
	See.

	1:26:13.533 --> 1:26:24.719
	And: Today, so what you should remember is
	what is parallel data and how we can.

	1:26:25.045 --> 1:26:29.553
	Create parallel data like how to generally
	process data.

	1:26:29.553 --> 1:26:36.435
	What you think about data is really important
	if you build systems and different ways.

	1:26:36.696 --> 1:26:46.857
	The three main options like forwards is directly
	on director level or using subword things.

	1:26:47.687 --> 1:26:49.634
	Is there any question?

	1:26:52.192 --> 1:26:57.768
	Yes, this is the alignment thing in Cadillac
	band in Tyne walking with people.

	1:27:00.000 --> 1:27:05.761
	It's not directly using than every time walking,
	but the idea is similar and you can use all

	1:27:05.761 --> 1:27:11.771
	this type of similar algorithms, which is the
	main thing which is the question of the difficulty

	1:27:11.771 --> 1:27:14.807
	is to define me at your your loss function
	here.

	1:27:14.807 --> 1:27:16.418
	What is a good alignment?

	1:27:16.736 --> 1:27:24.115
	But as you do not have a time walk on, you
	have a monotone alignment in there, and you

	1:27:24.115 --> 1:27:26.150
	cannot have rehonoring.

	1:27:30.770 --> 1:27:40.121
	There then thanks a lot and on first day we
	will then start with or discuss.