WEBVTT 0:00:02.822 --> 0:00:07.880 We look into more linguistic approaches. 0:00:07.880 --> 0:00:14.912 We can do machine translation in a more traditional way. 0:00:14.912 --> 0:00:21.224 It should be: Translation should be generated this way. 0:00:21.224 --> 0:00:27.933 We can analyze versus a sewer sentence what is the meaning or the syntax. 0:00:27.933 --> 0:00:35.185 Then we transfer this information to the target side and then we then generate. 0:00:36.556 --> 0:00:42.341 And this was the strong and common used approach for yeah several years. 0:00:44.024 --> 0:00:50.839 However, we saw already at the beginning there some challenges with that: Language is very 0:00:50.839 --> 0:00:57.232 ambigue and it's often very difficult to really get high coated rules. 0:00:57.232 --> 0:01:05.336 What are the different meanings and we have to do that also with a living language so new 0:01:05.336 --> 0:01:06.596 things occur. 0:01:07.007 --> 0:01:09.308 And that's why people look into. 0:01:09.308 --> 0:01:13.282 Can we maybe do it differently and use machine learning? 0:01:13.333 --> 0:01:24.849 So we are no longer giving rules of how to do it, but we just give examples and the system. 0:01:25.045 --> 0:01:34.836 And one important thing then is these examples: how can we learn how to translate one sentence? 0:01:35.635 --> 0:01:42.516 And therefore these yeah, the data is now really a very important issue. 0:01:42.582 --> 0:01:50.021 And that is what we want to look into today. 0:01:50.021 --> 0:01:58.783 What type of data do we use for machine translation? 0:01:59.019 --> 0:02:08.674 So the idea in preprocessing is always: Can we make the task somehow a bit easier so that 0:02:08.674 --> 0:02:13.180 the empty system will be in a way better? 0:02:13.493 --> 0:02:28.309 So one example could be if it has problems dealing with numbers because they are occurring. 0:02:28.648 --> 0:02:35.479 Or think about so one problem which still might be is there in some systems think about 0:02:35.479 --> 0:02:36.333 different. 0:02:36.656 --> 0:02:44.897 So a system might learn that of course if there's a German over in English there should. 0:02:45.365 --> 0:02:52.270 However, if it's in pearl text, it will see that in Germany there is often km, and in English 0:02:52.270 --> 0:02:54.107 typically various miles. 0:02:54.594 --> 0:03:00.607 Might just translate three hundred and fifty five miles into three hundred and fiftY five 0:03:00.607 --> 0:03:04.348 kilometers, which of course is not right, and so forth. 0:03:04.348 --> 0:03:06.953 It might make things to look into the. 0:03:07.067 --> 0:03:13.072 Therefore, first step when you build your machine translation system is normally to look 0:03:13.072 --> 0:03:19.077 at the data, to check it, to see if there is anything happening which you should address 0:03:19.077 --> 0:03:19.887 beforehand. 0:03:20.360 --> 0:03:29.152 And then the second part is how do you represent no works machine learning normally? 0:03:29.109 --> 0:03:35.404 So the question is how do we get out from the words into numbers and I've seen some of 0:03:35.404 --> 0:03:35.766 you? 0:03:35.766 --> 0:03:42.568 For example, in advance there we have introduced to an algorithm which we also shortly repeat 0:03:42.568 --> 0:03:43.075 today. 0:03:43.303 --> 0:03:53.842 The subword unit approach which was first introduced in machine translation and now used 0:03:53.842 --> 0:04:05.271 for an in order to represent: Now you've learned about morphology, so you know that maybe in 0:04:05.271 --> 0:04:09.270 English it's not that important. 0:04:09.429 --> 0:04:22.485 In German you have all these different word poems and to learn independent representation. 0:04:24.024 --> 0:04:26.031 And then, of course, they are more extreme. 0:04:27.807 --> 0:04:34.387 So how are we doing? 0:04:34.975 --> 0:04:37.099 Machine translation. 0:04:37.099 --> 0:04:46.202 So hopefully you remember we had these approaches to machine translation, the rule based. 0:04:46.202 --> 0:04:52.473 We had a big block of corpus space machine translation which. 0:04:52.492 --> 0:05:00.443 Will on Thursday have an overview on statistical models and then afterwards concentrate on the. 0:05:00.680 --> 0:05:08.828 Both of them are corpus based machine translation and therefore it's really essential, and while 0:05:08.828 --> 0:05:16.640 we are typically training a machine translation system is what we refer to as parallel data. 0:05:16.957 --> 0:05:22.395 Talk a lot about pearl corpus or pearl data, and what I mean there is something which you 0:05:22.395 --> 0:05:28.257 might know from was that a stone or something like that, so it's typically you have one sentence 0:05:28.257 --> 0:05:33.273 in the one language, and then you have aligned to it one sentence in the charcote. 0:05:33.833 --> 0:05:38.261 And this is how we train all our alignments. 0:05:38.261 --> 0:05:43.181 We'll see today that of course we might not have. 0:05:43.723 --> 0:05:51.279 However, this is relatively easy to create, at least for iquality data. 0:05:51.279 --> 0:06:00.933 We look into data trawling so that means how we can automatically create this parallel data 0:06:00.933 --> 0:06:02.927 from the Internet. 0:06:04.144 --> 0:06:13.850 It's not so difficult to learn these alignments if we have some type of dictionary, so which 0:06:13.850 --> 0:06:16.981 sentence is aligned to which. 0:06:18.718 --> 0:06:25.069 What it would, of course, be a lot more difficult is really to word alignment, and that's also 0:06:25.069 --> 0:06:27.476 often no longer that good possible. 0:06:27.476 --> 0:06:33.360 We do that automatically in some yes for symbols, but it's definitely more challenging. 0:06:33.733 --> 0:06:40.691 For sentence alignment, of course, it's still not always perfect, so there might be that 0:06:40.691 --> 0:06:46.085 there is two German sentences and one English sentence or the other. 0:06:46.085 --> 0:06:53.511 So there's not always perfect alignment, but if you look at text, it's still bigly relatively. 0:06:54.014 --> 0:07:03.862 If we have that then we can build a machine learning model which tries to map ignition 0:07:03.862 --> 0:07:06.239 sentences somewhere. 0:07:06.626 --> 0:07:15.932 So this is the idea of behind statistical machine translation and machine translation. 0:07:15.932 --> 0:07:27.098 The difference is: Statistical machine translation is typically a whole box of different models 0:07:27.098 --> 0:07:30.205 which try to evaluate the. 0:07:30.510 --> 0:07:42.798 In neural machine translation, it's all one large neural network where we use the one-sur-sentence 0:07:42.798 --> 0:07:43.667 input. 0:07:44.584 --> 0:07:50.971 And then we can train it by having exactly this mapping port or parallel data. 0:07:54.214 --> 0:08:02.964 So what we want today to look at today is we want to first look at general text data. 0:08:03.083 --> 0:08:06.250 So what is text data? 0:08:06.250 --> 0:08:09.850 What text data is there? 0:08:09.850 --> 0:08:18.202 Why is it challenging so that we have large vocabularies? 0:08:18.378 --> 0:08:22.003 It's so that you always have words which you haven't seen. 0:08:22.142 --> 0:08:29.053 If you increase your corporate science normally you will also increase your vocabulary so you 0:08:29.053 --> 0:08:30.744 always find new words. 0:08:31.811 --> 0:08:39.738 Then based on that we'll look into pre-processing. 0:08:39.738 --> 0:08:45.333 So how can we pre-process our data? 0:08:45.333 --> 0:08:46.421 Maybe. 0:08:46.526 --> 0:08:54.788 This is a lot about tokenization, for example, which we heard is not so challenging in European 0:08:54.788 --> 0:09:02.534 languages but still important, but might be really difficult in Asian languages where you 0:09:02.534 --> 0:09:05.030 don't have space separation. 0:09:05.986 --> 0:09:12.161 And this preprocessing typically tries to deal with the extreme cases where you have 0:09:12.161 --> 0:09:13.105 seen things. 0:09:13.353 --> 0:09:25.091 If you have seen your words three one hundred times, it doesn't really matter if you have 0:09:25.091 --> 0:09:31.221 seen them with them without punctuation or so. 0:09:31.651 --> 0:09:38.578 And then we look into word representation, so what is the best way to represent a word? 0:09:38.578 --> 0:09:45.584 And finally, we look into the other type of data we really need for machine translation. 0:09:45.725 --> 0:09:56.842 So in first we can use for many tasks, and later we can also use purely monolingual data 0:09:56.842 --> 0:10:00.465 to make machine translation. 0:10:00.660 --> 0:10:03.187 So then the traditional approach was that it was easier. 0:10:03.483 --> 0:10:08.697 We have this type of language model which we can train only on the target data to make 0:10:08.697 --> 0:10:12.173 the text more fluent in neural machine translation model. 0:10:12.173 --> 0:10:18.106 It's partly a bit more complicated to integrate this data but still it's very important especially 0:10:18.106 --> 0:10:22.362 if you think about lower issue languages where you have very few data. 0:10:23.603 --> 0:10:26.999 It's harder to get parallel data than you get monolingual data. 0:10:27.347 --> 0:10:33.821 Because monolingual data you just have out there not huge amounts for some languages, 0:10:33.821 --> 0:10:38.113 but definitely the amount of data is always significant. 0:10:40.940 --> 0:10:50.454 When we talk about data, it's also of course important how we use it for machine learning. 0:10:50.530 --> 0:11:05.867 And that you hopefully learn in some prior class, so typically we separate our data into 0:11:05.867 --> 0:11:17.848 three chunks: So this is really by far the largest, and this grows with the data we get. 0:11:17.848 --> 0:11:21.387 Today we get here millions. 0:11:22.222 --> 0:11:27.320 Then we have our validation data and that is to train some type of parameters. 0:11:27.320 --> 0:11:33.129 So not only you have some things to configure and you don't know what is the right value, 0:11:33.129 --> 0:11:39.067 so what you can do is train a model and change these a bit and try to find the best ones on 0:11:39.067 --> 0:11:40.164 your validation. 0:11:40.700 --> 0:11:48.531 For a statistical model, for example data in what you want to use if you have several 0:11:48.531 --> 0:11:54.664 models: You know how to combine it, so how much focus should you put on the different 0:11:54.664 --> 0:11:55.186 models? 0:11:55.186 --> 0:11:59.301 And if it's like twenty models, so it's only twenty per meter. 0:11:59.301 --> 0:12:02.828 It's not that much, so that is still bigly estimated. 0:12:03.183 --> 0:12:18.964 In your model there's often a question how long should train the model before you have 0:12:18.964 --> 0:12:21.322 overfitting. 0:12:22.902 --> 0:12:28.679 And then you have your test data, which is finally where you report on your test. 0:12:29.009 --> 0:12:33.663 And therefore it's also important that from time to time you get new test data because 0:12:33.663 --> 0:12:38.423 if you're always through your experiments you test on it and then you do new experiments 0:12:38.423 --> 0:12:43.452 and tests again at some point you have tested so many on it that you do some type of training 0:12:43.452 --> 0:12:48.373 on your test data again because you just select the things which is at the end best on your 0:12:48.373 --> 0:12:48.962 test data. 0:12:49.009 --> 0:12:54.755 It's important to get a new test data from time to time, for example in important evaluation 0:12:54.755 --> 0:12:58.340 campaigns for machine translation and speech translation. 0:12:58.618 --> 0:13:07.459 There is like every year there should do tests that create it so we can see if the model really 0:13:07.459 --> 0:13:09.761 gets better on new data. 0:13:10.951 --> 0:13:19.629 And of course it is important that this is a representative of the use case you are interested. 0:13:19.879 --> 0:13:36.511 So if you're building a system for translating websites, this should be on websites. 0:13:36.816 --> 0:13:39.356 So normally a system is good on some tasks. 0:13:40.780 --> 0:13:48.596 I would solve everything and then your test data should be out of everything because if 0:13:48.596 --> 0:13:54.102 you only have a very small subset you know it's good on this. 0:13:54.394 --> 0:14:02.714 Therefore, the selection of your test data is really important in order to ensure that 0:14:02.714 --> 0:14:05.200 the MP system in the end. 0:14:05.525 --> 0:14:12.646 Is the greatest system ever you have evaluated on translating Bible. 0:14:12.646 --> 0:14:21.830 The use case is to translate some Twitter data and you can imagine the performance might 0:14:21.830 --> 0:14:22.965 be really. 0:14:23.803 --> 0:14:25.471 And privately. 0:14:25.471 --> 0:14:35.478 Of course, in honor to have this and realistic evaluation, it's important that there's no 0:14:35.478 --> 0:14:39.370 overlap between this data because. 0:14:39.799 --> 0:14:51.615 Because the danger might be is learning by heart how to translate the sentences from your 0:14:51.615 --> 0:14:53.584 training data. 0:14:54.194 --> 0:15:04.430 That the test data is really different from your training data. 0:15:04.430 --> 0:15:16.811 Therefore, it's important to: So what type of data we have? 0:15:16.811 --> 0:15:24.966 There's a lot of different text data and the nice thing is with digitalization. 0:15:25.345 --> 0:15:31.785 You might think there's a large amount with books, but to be honest books and printed things 0:15:31.785 --> 0:15:35.524 that's by now a minor percentage of the data we have. 0:15:35.815 --> 0:15:39.947 There's like so much data created every day on the Internet. 0:15:39.980 --> 0:15:46.223 With social media and all the other types. 0:15:46.223 --> 0:15:56.821 This of course is a largest amount of data, more of colloquial language. 0:15:56.856 --> 0:16:02.609 It might be more noisy and harder to process, so there is a whole area on how to deal with 0:16:02.609 --> 0:16:04.948 more social media and outdoor stuff. 0:16:07.347 --> 0:16:20.702 What type of data is there if you think about parallel data news type of data official sites? 0:16:20.900 --> 0:16:26.629 So the first Power Corpora were like things like the European Parliament or like some news 0:16:26.629 --> 0:16:27.069 sites. 0:16:27.227 --> 0:16:32.888 Nowadays there's quite a large amount of data crawled from the Internet, but of course if 0:16:32.888 --> 0:16:38.613 you crawl parallel data from the Internet, a lot of the data is also like company websites 0:16:38.613 --> 0:16:41.884 or so which gets translated into several languages. 0:16:45.365 --> 0:17:00.613 Then, of course, there is different levels of text and we have to look at what level we 0:17:00.613 --> 0:17:05.118 want to process our data. 0:17:05.885 --> 0:17:16.140 It one normally doesn't make sense to work on full sentences because a lot of sentences 0:17:16.140 --> 0:17:22.899 have never been seen and you always create new sentences. 0:17:23.283 --> 0:17:37.421 So typically what we take is our basic words, something between words and letters, and that 0:17:37.421 --> 0:17:40.033 is an essential. 0:17:40.400 --> 0:17:47.873 So we need some of these atomic blocks or basic blocks on which we can't make smaller. 0:17:48.128 --> 0:17:55.987 So if we're building a sentence, for example, you can build it out of something and you can 0:17:55.987 --> 0:17:57.268 either decide. 0:17:57.268 --> 0:18:01.967 For example, you take words and you spit them further. 0:18:03.683 --> 0:18:10.178 Then, of course, the nice thing is not too small and therefore building larger things 0:18:10.178 --> 0:18:11.386 like sentences. 0:18:11.831 --> 0:18:16.690 So you only have to take your vocabulary and put it somewhere together to get your full 0:18:16.690 --> 0:18:17.132 center. 0:18:19.659 --> 0:18:27.670 However, if it's too large, these blocks don't occur often enough, and you have more blocks 0:18:27.670 --> 0:18:28.715 that occur. 0:18:29.249 --> 0:18:34.400 And that's why yeah we can work with blocks for smaller like software blocks. 0:18:34.714 --> 0:18:38.183 Work with neural models. 0:18:38.183 --> 0:18:50.533 Then you can work on letters so you have a system which tries to understand the sentence 0:18:50.533 --> 0:18:53.031 letter by letter. 0:18:53.313 --> 0:18:57.608 But that is a design decision which you have to take at some point. 0:18:57.608 --> 0:19:03.292 On which level do you want to split your text and that of the evasive blocks that you are 0:19:03.292 --> 0:19:04.176 working with? 0:19:04.176 --> 0:19:06.955 And that's something we'll look into today. 0:19:06.955 --> 0:19:08.471 What possibilities are? 0:19:12.572 --> 0:19:14.189 Any question. 0:19:17.998 --> 0:19:24.456 Then let's look a bit on what type of data there is in how much data there is to person. 0:19:24.824 --> 0:19:34.006 Is that nowadays, at least for pure text, it's no longer for some language. 0:19:34.006 --> 0:19:38.959 There is so much data we cannot process. 0:19:39.479 --> 0:19:49.384 That is only true for some languages, but there is also interest in other languages and 0:19:49.384 --> 0:19:50.622 important. 0:19:50.810 --> 0:20:01.483 So if you want to build a system for Sweden or for some dialect in other countries, then 0:20:01.483 --> 0:20:02.802 of course. 0:20:03.103 --> 0:20:06.888 Otherwise you have this huge amount of hair. 0:20:06.888 --> 0:20:11.515 We are often no longer taking about gigabytes or more. 0:20:11.891 --> 0:20:35.788 The general information that is produced every year is: And this is like all the information 0:20:35.788 --> 0:20:40.661 that are available in the, so there are really. 0:20:41.001 --> 0:20:44.129 We look at machine translation. 0:20:44.129 --> 0:20:53.027 We can see these numbers are really like more than ten years old, but we see this increase 0:20:53.027 --> 0:20:58.796 in one billion works we had at that time for English data. 0:20:59.019 --> 0:21:01.955 Then I wore like new shuffle on Google Maps and stuff. 0:21:02.382 --> 0:21:05.003 For this one you could train your system on. 0:21:05.805 --> 0:21:20.457 And the interesting thing is this one billion words is more than any human typically speaks. 0:21:21.001 --> 0:21:25.892 So these systems they see by now like a magnitude of more data. 0:21:25.892 --> 0:21:32.465 We know I think are a magnitude higher of more data than a human has ever seen in his 0:21:32.465 --> 0:21:33.229 lifetime. 0:21:35.175 --> 0:21:41.808 And that is maybe the interesting thing why it still doesn't work on it because you see 0:21:41.808 --> 0:21:42.637 they seem. 0:21:43.103 --> 0:21:48.745 So we are seeing a really impressive result, but in most cases it's not that they're really 0:21:48.745 --> 0:21:49.911 better than human. 0:21:50.170 --> 0:21:56.852 However, they really have seen more data than any human ever has seen in this lifetime. 0:21:57.197 --> 0:22:01.468 They can just process so much data, so. 0:22:01.501 --> 0:22:08.425 The question is, can we make them more efficient so that they can learn similarly good without 0:22:08.425 --> 0:22:09.592 that much data? 0:22:09.592 --> 0:22:16.443 And that is essential if we now go to Lawrence's languages where we might never get that much 0:22:16.443 --> 0:22:21.254 data, and we should be also able to achieve a reasonable perform. 0:22:23.303 --> 0:22:32.399 On the other hand, this of course links also to one topic which we will cover later: If 0:22:32.399 --> 0:22:37.965 you think about this, it's really important that your algorithms are also very efficient 0:22:37.965 --> 0:22:41.280 in order to process that much data both in training. 0:22:41.280 --> 0:22:46.408 If you have more data, you want to process more data so you can make use of that. 0:22:46.466 --> 0:22:54.499 On the other hand, if more and more data is processed, more and more people will use machine 0:22:54.499 --> 0:23:06.816 translation to generate translations, and it will be important to: And there is yeah, there 0:23:06.816 --> 0:23:07.257 is. 0:23:07.607 --> 0:23:10.610 More. 0:23:10.170 --> 0:23:17.262 More data generated every day, we hear just some general numbers on how much data there 0:23:17.262 --> 0:23:17.584 is. 0:23:17.584 --> 0:23:24.595 It says that a lot of the data we produce at least at the moment is text rich, so text 0:23:24.595 --> 0:23:26.046 that is produced. 0:23:26.026 --> 0:23:29.748 That is very important to either wise. 0:23:29.748 --> 0:23:33.949 We can use it as training data in some way. 0:23:33.873 --> 0:23:40.836 That we want to translate some of that because it might not be published in all the languages, 0:23:40.836 --> 0:23:46.039 and step with the need for machine translation is even more important. 0:23:47.907 --> 0:23:51.547 So what are the challenges with this? 0:23:51.831 --> 0:24:01.360 So first of all that seems to be very good news, so there is more and more data, so we 0:24:01.360 --> 0:24:10.780 can just wait for three years and have more data, and then our system will be better. 0:24:11.011 --> 0:24:22.629 If you see in competitions, the system performance increases. 0:24:24.004 --> 0:24:27.190 See that here are three different systems. 0:24:27.190 --> 0:24:34.008 Blue score is metric to measure how good an empty system is and we'll talk about evaluation 0:24:34.008 --> 0:24:40.974 and the next week so you'll have to evaluate machine validation and also a practical session. 0:24:41.581 --> 0:24:45.219 And so. 0:24:44.784 --> 0:24:50.960 This shows you that this is like how much data of the training data you have five percent. 0:24:50.960 --> 0:24:56.117 You're significantly worse than if you're forty percent and eighty percent. 0:24:56.117 --> 0:25:02.021 You're getting better and you're seeing two between this curve, which maybe not really 0:25:02.021 --> 0:25:02.971 flattens out. 0:25:02.971 --> 0:25:03.311 But. 0:25:03.263 --> 0:25:07.525 Of course, the gains you get are normally smaller and smaller. 0:25:07.525 --> 0:25:09.216 The more data you have,. 0:25:09.549 --> 0:25:21.432 If your improvements are unnormally better, if you add the same thing or even double your 0:25:21.432 --> 0:25:25.657 data late, of course more data. 0:25:26.526 --> 0:25:34.955 However, you see the clear tendency if you need to improve your system. 0:25:34.955 --> 0:25:38.935 This is possible by just getting. 0:25:39.039 --> 0:25:41.110 But it's not all about data. 0:25:41.110 --> 0:25:45.396 It can also be the domain of the day that there's building. 0:25:45.865 --> 0:25:55.668 So this was a test on machine translation system on translating genome data. 0:25:55.668 --> 0:26:02.669 We have the like SAI said he's working on translating. 0:26:02.862 --> 0:26:06.868 Here you see the performance began with GreenScore. 0:26:06.868 --> 0:26:12.569 You see one system which only was trained on genome data and it only has. 0:26:12.812 --> 0:26:17.742 That's very, very few for machine translation. 0:26:18.438 --> 0:26:23.927 And to compare that to a system which was generally trained on used translation data. 0:26:24.104 --> 0:26:34.177 With four point five million sentences so roughly one hundred times as much data you 0:26:34.177 --> 0:26:40.458 still see that this system doesn't really work well. 0:26:40.820 --> 0:26:50.575 So you see it's not only about data, it's also that the data has to somewhat fit to the 0:26:50.575 --> 0:26:51.462 domain. 0:26:51.831 --> 0:26:58.069 The more general data you get that you have covered up all domains. 0:26:58.418 --> 0:27:07.906 But that's very difficult and especially for more specific domains. 0:27:07.906 --> 0:27:16.696 It can be really important to get data which fits your domain. 0:27:16.716 --> 0:27:18.520 Maybe if you can do some very much broccoli or something like that, maybe if you. 0:27:18.598 --> 0:27:22.341 To say okay, concentrate this as you like for being at better. 0:27:24.564 --> 0:27:28.201 It's not that easy to prompt it. 0:27:28.201 --> 0:27:35.807 You can do the prompting in the more traditional way of fine tuning. 0:27:35.807 --> 0:27:44.514 Then, of course, if you select UIV later combine this one, you can get better. 0:27:44.904 --> 0:27:52.675 But it will always be that this type of similar data is much more important than the general. 0:27:52.912 --> 0:28:00.705 So of course it can make the lower system a lot better if you search for similar data 0:28:00.705 --> 0:28:01.612 and find. 0:28:02.122 --> 0:28:08.190 Will have a lecture on domain adaptation where it's exactly the idea how you can make systems 0:28:08.190 --> 0:28:13.935 in these situations better so you can adapt it to this data but then you still need this 0:28:13.935 --> 0:28:14.839 type of data. 0:28:15.335 --> 0:28:21.590 And in prompting it might work if you have seen it in your data so it can make the system 0:28:21.590 --> 0:28:25.134 aware and tell it focus more in this type of data. 0:28:25.465 --> 0:28:30.684 But if you haven't had enough of the really specific good matching data, I think it will 0:28:30.684 --> 0:28:31.681 always not work. 0:28:31.681 --> 0:28:37.077 So you need to have this type of data and therefore it's important not only to have general 0:28:37.077 --> 0:28:42.120 data but also data, at least in your overall system, which really fits to the domain. 0:28:45.966 --> 0:28:53.298 And then the second thing, of course, is you need to have data that has good quality. 0:28:53.693 --> 0:29:00.170 In the early stages it might be good to have all the data but later it's especially important 0:29:00.170 --> 0:29:06.577 that you have somehow good quality and so that you're learning what you really want to learn 0:29:06.577 --> 0:29:09.057 and not learning some great things. 0:29:10.370 --> 0:29:21.551 We talked about this with the kilometers and miles, so if you just take in some type of 0:29:21.551 --> 0:29:26.253 data and don't look at the quality,. 0:29:26.766 --> 0:29:30.875 But of course, the question here is what is good quality data? 0:29:31.331 --> 0:29:35.054 It is not yet that easy to define what is a good quality data. 0:29:36.096 --> 0:29:43.961 That doesn't mean it has to what people generally assume as high quality text or so, like written 0:29:43.961 --> 0:29:47.814 by a Nobel Prize winner or something like that. 0:29:47.814 --> 0:29:54.074 This is not what we mean by this quality, but again the most important again. 0:29:54.354 --> 0:30:09.181 So if you have Twitter data, high quality data doesn't mean you have now some novels. 0:30:09.309 --> 0:30:12.875 Test data, but it should also be represented similarly. 0:30:12.875 --> 0:30:18.480 Don't have, for example, quality definitely as it should be really translating yourself 0:30:18.480 --> 0:30:18.862 into. 0:30:19.199 --> 0:30:25.556 So especially if you corral data you would often have that it's not a direct translation. 0:30:25.805 --> 0:30:28.436 So then, of course, this is not high quality teaching. 0:30:29.449 --> 0:30:39.974 But in generally that's a very difficult thing to, and it's very difficult to design what 0:30:39.974 --> 0:30:41.378 is reading. 0:30:41.982 --> 0:30:48.333 And of course a biometric is always the quality of your data is good if your machine translation. 0:30:48.648 --> 0:30:50.719 So that is like the indirect. 0:30:50.991 --> 0:30:52.447 Well, what can we motive? 0:30:52.447 --> 0:30:57.210 Of course, it's difficult to always try a lot of things and evaluate either of them, 0:30:57.210 --> 0:30:59.396 build a full MP system and then check. 0:30:59.396 --> 0:31:00.852 Oh, was this a good idea? 0:31:00.852 --> 0:31:01.357 I mean,. 0:31:01.581 --> 0:31:19.055 You have two tokenizers who like split sentences and the words you really want to apply. 0:31:19.179 --> 0:31:21.652 Now you could maybe argue or your idea could be. 0:31:21.841 --> 0:31:30.186 Just take it there very fast and then get the result, but the problem is there is not 0:31:30.186 --> 0:31:31.448 always this. 0:31:31.531 --> 0:31:36.269 One thing that works very well for small data. 0:31:36.269 --> 0:31:43.123 It's not for sure that the same effect will happen in large stages. 0:31:43.223 --> 0:31:50.395 This idea really improves on very low resource data if only train on hundred words. 0:31:51.271 --> 0:31:58.357 But if you use it for a large data set, it doesn't really matter and all your ideas not. 0:31:58.598 --> 0:32:01.172 So that is also a typical thing. 0:32:01.172 --> 0:32:05.383 This quality issue is more and more important if you. 0:32:06.026 --> 0:32:16.459 By one motivation which generally you should have, you want to represent your data in having 0:32:16.459 --> 0:32:17.469 as many. 0:32:17.677 --> 0:32:21.805 Why is this the case any idea? 0:32:21.805 --> 0:32:33.389 Why this could be a motivation that we try to represent the data in a way that we have 0:32:33.389 --> 0:32:34.587 as many. 0:32:38.338 --> 0:32:50.501 We also want to learn about the fun text because maybe sometimes some grows in the fun text. 0:32:52.612 --> 0:32:54.020 The context is here. 0:32:54.020 --> 0:32:56.432 It's more about the learning first. 0:32:56.432 --> 0:33:00.990 You can generally learn better if you've seen something more often. 0:33:00.990 --> 0:33:06.553 So if you have seen an event only once, it's really hard to learn about the event. 0:33:07.107 --> 0:33:15.057 If you have seen an event a hundred times your bearing estimating which and maybe that 0:33:15.057 --> 0:33:18.529 is the context, then you can use the. 0:33:18.778 --> 0:33:21.331 So, for example, if you here have the word towels. 0:33:21.761 --> 0:33:28.440 If you would just take the data normally you would directly process the data. 0:33:28.440 --> 0:33:32.893 In the upper case you would the house with the dog. 0:33:32.893 --> 0:33:40.085 That's a different word than the house this way and then the house with the common. 0:33:40.520 --> 0:33:48.365 So you want to learn how this translates into house, but you translate an upper case. 0:33:48.365 --> 0:33:50.281 How this translates. 0:33:50.610 --> 0:33:59.445 You were learning how to translate into house and house, so you have to learn four different 0:33:59.445 --> 0:34:00.205 things. 0:34:00.205 --> 0:34:06.000 Instead, we really want to learn that house gets into house. 0:34:06.366 --> 0:34:18.796 And then imagine if it would be even a beak, it might be like here a house would be into. 0:34:18.678 --> 0:34:22.089 Good-bye Then. 0:34:22.202 --> 0:34:29.512 If it's an upper case then I always have to translate it into a boiler while it's a lower 0:34:29.512 --> 0:34:34.955 case that is translated into house and that's of course not right. 0:34:34.955 --> 0:34:39.260 We have to use the context to decide what is better. 0:34:39.679 --> 0:34:47.086 If you have seen an event several times then you are better able to learn your model and 0:34:47.086 --> 0:34:51.414 that doesn't matter what type of learning you have. 0:34:52.392 --> 0:34:58.981 I shouldn't say all but for most of these models it's always better to have like seen 0:34:58.981 --> 0:35:00.897 an event war more often. 0:35:00.920 --> 0:35:11.483 Therefore, if you preprocessive data, you should ask the question how can represent data 0:35:11.483 --> 0:35:14.212 in order to have seen. 0:35:14.514 --> 0:35:17.885 Of course you should not remove that information. 0:35:18.078 --> 0:35:25.519 So you could now, of course, just lowercase everything. 0:35:25.519 --> 0:35:30.303 Then you've seen things more often. 0:35:30.710 --> 0:35:38.443 And that might be an issue because in the final application you want to have real text 0:35:38.443 --> 0:35:38.887 and. 0:35:40.440 --> 0:35:44.003 And finally, even it's more important than it's consistent. 0:35:44.965 --> 0:35:52.630 So this is a problem where, for example, aren't consistent. 0:35:52.630 --> 0:35:58.762 So I am, I'm together written in training data. 0:35:58.762 --> 0:36:04.512 And if you're not in test data, have a high. 0:36:04.824 --> 0:36:14.612 Therefore, most important is to generate preprocessing and represent your data that is most consistent 0:36:14.612 --> 0:36:18.413 because it's easier to map how similar. 0:36:18.758 --> 0:36:26.588 If your text is represented very, very differently then your data will be badly be translated. 0:36:26.666 --> 0:36:30.664 So we once had the case. 0:36:30.664 --> 0:36:40.420 For example, there is some data who wrote it, but in German. 0:36:40.900 --> 0:36:44.187 And if you read it as a human you see it. 0:36:44.187 --> 0:36:49.507 It's even hard to get the difference because it looks very similar. 0:36:50.130 --> 0:37:02.997 If you use it for a machine translation system, it would not be able to translate anything 0:37:02.997 --> 0:37:08.229 of it because it's a different word. 0:37:09.990 --> 0:37:17.736 And especially on the other hand you should of course not rechange significant training 0:37:17.736 --> 0:37:18.968 data thereby. 0:37:18.968 --> 0:37:27.155 For example, removing case information because if your task is to generate case information. 0:37:31.191 --> 0:37:41.081 One thing which is a bit point to look into it in order to see the difficulty of your data 0:37:41.081 --> 0:37:42.711 is to compare. 0:37:43.103 --> 0:37:45.583 There are types. 0:37:45.583 --> 0:37:57.983 We mean the number of unique words in the corpus, so your vocabulary and the tokens. 0:37:58.298 --> 0:38:08.628 And then you can look at the type token ratio that means a number of types per token. 0:38:15.815 --> 0:38:22.381 Have less types than tokens because every word appears at least in the corpus, but most 0:38:22.381 --> 0:38:27.081 of them will occur more often until this number is bigger, so. 0:38:27.667 --> 0:38:30.548 And of course this changes if you have more date. 0:38:31.191 --> 0:38:38.103 Here is an example from an English Wikipedia. 0:38:38.103 --> 0:38:45.015 That means each word in average occurs times. 0:38:45.425 --> 0:38:47.058 Of course there's a big difference. 0:38:47.058 --> 0:38:51.323 There will be some words which occur one hundred times, but therefore most of the words occur 0:38:51.323 --> 0:38:51.777 only one. 0:38:52.252 --> 0:38:55.165 However, you see this ratio goes down. 0:38:55.165 --> 0:39:01.812 That's a good thing, so you have seen each word more often and therefore your model gets 0:39:01.812 --> 0:39:03.156 typically better. 0:39:03.156 --> 0:39:08.683 However, the problem is we always have a lot of words which we have seen. 0:39:09.749 --> 0:39:15.111 Even here there will be a bound of words which you have only seen once. 0:39:15.111 --> 0:39:20.472 However, this can give you an indication about the quality of the data. 0:39:20.472 --> 0:39:27.323 So you should always, of course, try to achieve data where you have a very low type to talk 0:39:27.323 --> 0:39:28.142 and ratio. 0:39:28.808 --> 0:39:39.108 For example, if you compare, simplify and not only Wikipedia, what would be your expectation? 0:39:41.861 --> 0:39:49.842 Yes, that's exactly, but however it's surprisingly only a little bit lower, but you see that it's 0:39:49.842 --> 0:39:57.579 lower, so we are using less words to express the same thing, and therefore the task to produce 0:39:57.579 --> 0:39:59.941 this text is also a gesture. 0:40:01.221 --> 0:40:07.702 However, as how many words are there, there is no clear definition. 0:40:07.787 --> 0:40:19.915 So there will be always more words, especially depending on your dataset, how many different 0:40:19.915 --> 0:40:22.132 words there are. 0:40:22.482 --> 0:40:30.027 So if you have million tweets where around fifty million tokens and you have six hundred 0:40:30.027 --> 0:40:30.875 thousand. 0:40:31.251 --> 0:40:40.299 If you have times this money teen tweeds you also have significantly more tokens but also. 0:40:40.660 --> 0:40:58.590 So especially in things like the social media, of course, there's always different types of 0:40:58.590 --> 0:40:59.954 words. 0:41:00.040 --> 0:41:04.028 Another example from not social media is here. 0:41:04.264 --> 0:41:18.360 So yeah, there is a small liter sandwich like phone conversations, two million tokens, and 0:41:18.360 --> 0:41:22.697 only twenty thousand words. 0:41:23.883 --> 0:41:37.221 If you think about Shakespeare, it has even less token, significantly less than a million, 0:41:37.221 --> 0:41:40.006 but the number of. 0:41:40.060 --> 0:41:48.781 On the other hand, there is this Google Engron corpus which has tokens and there is always 0:41:48.781 --> 0:41:50.506 new words coming. 0:41:50.991 --> 0:41:52.841 Is English. 0:41:52.841 --> 0:42:08.103 The nice thing about English is that the vocabulary is relatively small, too small, but relatively 0:42:08.103 --> 0:42:09.183 small. 0:42:09.409 --> 0:42:14.224 So here you see the Ted Corpus here. 0:42:15.555 --> 0:42:18.144 All know Ted's lectures. 0:42:18.144 --> 0:42:26.429 They are transcribed, translated, not a source for us, especially small crocus. 0:42:26.846 --> 0:42:32.702 You can do a lot of experiments with that and you see that the corpus site is relatively 0:42:32.702 --> 0:42:36.782 similar so we have around four million tokens in this corpus. 0:42:36.957 --> 0:42:44.464 However, if you look at the vocabulary, English has half as many words in their different words 0:42:44.464 --> 0:42:47.045 as German and Dutch and Italian. 0:42:47.527 --> 0:42:56.260 So this is one influence from positional works like which are more frequent in German, the 0:42:56.260 --> 0:43:02.978 more important since we have all these different morphological forms. 0:43:03.263 --> 0:43:08.170 There all leads to new words and they need to be somewhat expressed in there. 0:43:11.531 --> 0:43:20.278 So to deal with this, the question is how can we normalize the text in order to make 0:43:20.278 --> 0:43:22.028 the text easier? 0:43:22.028 --> 0:43:25.424 Can we simplify the task easier? 0:43:25.424 --> 0:43:29.231 But we need to keep all information. 0:43:29.409 --> 0:43:32.239 So an example where not all information skipped. 0:43:32.239 --> 0:43:35.012 Of course you make the task easier if you just. 0:43:35.275 --> 0:43:41.141 You don't have to deal with different cases. 0:43:41.141 --> 0:43:42.836 It's easier. 0:43:42.836 --> 0:43:52.482 However, information gets lost and you might need to generate the target. 0:43:52.832 --> 0:44:00.153 So the question is always: How can we on the one hand simplify the task but keep all the 0:44:00.153 --> 0:44:01.223 information? 0:44:01.441 --> 0:44:06.639 Say necessary because it depends on the task. 0:44:06.639 --> 0:44:11.724 For some tasks you might find to remove the. 0:44:14.194 --> 0:44:23.463 So the steps they were typically doing are that you can the segment and words in a running 0:44:23.463 --> 0:44:30.696 text, so you can normalize word forms and segmentation into sentences. 0:44:30.696 --> 0:44:33.955 Also, if you have not a single. 0:44:33.933 --> 0:44:38.739 If this is not a redundancy point to segments, the text is also into segments. 0:44:39.779 --> 0:44:52.609 So what are we doing there for European language segmentation into words? 0:44:52.609 --> 0:44:57.290 It's not that complicated. 0:44:57.277 --> 0:45:06.001 You have to somehow handle the joint words and by handling joint words the most important. 0:45:06.526 --> 0:45:11.331 So in most systems it really doesn't matter much. 0:45:11.331 --> 0:45:16.712 If you write, I'm together as one word or as two words. 0:45:17.197 --> 0:45:23.511 The nice thing about iron is maybe this is so often that it doesn't matter if you both 0:45:23.511 --> 0:45:26.560 and if they're both accrued often enough. 0:45:26.560 --> 0:45:32.802 But you'll have some of these cases where they don't occur there often, so you should 0:45:32.802 --> 0:45:35.487 have more as consistent as possible. 0:45:36.796 --> 0:45:41.662 But of course things can get more complicated. 0:45:41.662 --> 0:45:48.598 If you have Finland capital, do you want to split the ends or not? 0:45:48.598 --> 0:45:53.256 Isn't you split or do you even write it out? 0:45:53.433 --> 0:46:00.468 And what about like things with hyphens in the middle and so on? 0:46:00.540 --> 0:46:07.729 So there is not everything is very easy, but is generally possible to somewhat keep as. 0:46:11.791 --> 0:46:25.725 Sometimes the most challenging and traditional systems were compounds, or how to deal with 0:46:25.725 --> 0:46:28.481 things like this. 0:46:28.668 --> 0:46:32.154 The nice thing is, as said, will come to the later. 0:46:32.154 --> 0:46:34.501 Nowadays we typically use subword. 0:46:35.255 --> 0:46:42.261 Unit, so we don't have to deal with this in the preprocessing directly, but in the subword 0:46:42.261 --> 0:46:47.804 splitting we're doing it, and then we can learn how to best spit these. 0:46:52.392 --> 0:46:56.974 Things Get More Complicated. 0:46:56.977 --> 0:46:59.934 About non European languages. 0:46:59.934 --> 0:47:08.707 Because in non European languages, not all of them, there is no space between the words. 0:47:09.029 --> 0:47:18.752 Nowadays you can also download word segmentation models where you put in the full sentence and 0:47:18.752 --> 0:47:22.744 then it's getting splitted into parts. 0:47:22.963 --> 0:47:31.814 And then, of course, it's even that you have different writing systems, sometimes in Japanese. 0:47:31.814 --> 0:47:40.385 For example, they have these katakana, hiragana and kanji symbols in there, and you have to 0:47:40.385 --> 0:47:42.435 some idea with these. 0:47:49.669 --> 0:47:54.560 To the, the next thing is can reduce some normalization. 0:47:54.874 --> 0:48:00.376 So the idea is that you map several words onto the same. 0:48:00.460 --> 0:48:07.877 And that is test dependent, and the idea is to define something like acronym classes so 0:48:07.877 --> 0:48:15.546 that words, which have the same meaning where it's not in order to have the difference, to 0:48:15.546 --> 0:48:19.423 map onto the same thing in order to make the. 0:48:19.679 --> 0:48:27.023 The most important thing is there about tasing, and then there is something like sometimes 0:48:27.023 --> 0:48:27.508 word. 0:48:28.048 --> 0:48:37.063 For casing you can do two things and then depend on the task. 0:48:37.063 --> 0:48:44.769 You can lowercase everything, maybe some exceptions. 0:48:45.045 --> 0:48:47.831 For the target side, it should normally it's normally not done. 0:48:48.188 --> 0:48:51.020 Why is it not done? 0:48:51.020 --> 0:48:56.542 Why should you only do it for suicide? 0:48:56.542 --> 0:49:07.729 Yes, so you have to generate correct text instead of lower case and uppercase. 0:49:08.848 --> 0:49:16.370 Nowadays to be always do true casing on both sides, also on the sewer side, that means you 0:49:16.370 --> 0:49:17.610 keep the case. 0:49:17.610 --> 0:49:24.966 The only thing where people try to work on or sometimes do that is that at the beginning 0:49:24.966 --> 0:49:25.628 of the. 0:49:25.825 --> 0:49:31.115 For words like this, this is not that important because you will have seen otherwise a lot 0:49:31.115 --> 0:49:31.696 of times. 0:49:31.696 --> 0:49:36.928 But if you know have rare words, which you only have seen maybe three times, and you have 0:49:36.928 --> 0:49:42.334 only seen in the middle of the sentence, and now it occurs at the beginning of the sentence, 0:49:42.334 --> 0:49:45.763 which is upper case, then you don't know how to deal with. 0:49:46.146 --> 0:49:50.983 So then it might be good to do a true casing. 0:49:50.983 --> 0:49:56.241 That means you recase each word on the beginning. 0:49:56.576 --> 0:49:59.830 The only question, of course, is how do you recase it? 0:49:59.830 --> 0:50:01.961 So what case would you always know? 0:50:02.162 --> 0:50:18.918 Word of the senders, or do you have a better solution, especially not English, maybe German. 0:50:18.918 --> 0:50:20.000 It's. 0:50:25.966 --> 0:50:36.648 The fancy solution would be to count hope and decide based on this, the unfancy running 0:50:36.648 --> 0:50:43.147 would: Think it's not really good because most of the cane boards are lower paced. 0:50:43.683 --> 0:50:53.657 That is one idea to count and definitely better because as a word more often occurs upper case. 0:50:53.653 --> 0:50:57.934 Otherwise you only have a lower case at the beginning where you have again. 0:50:58.338 --> 0:51:03.269 Haven't gained anything, you can make it even a bit better when counting. 0:51:03.269 --> 0:51:09.134 You're ignoring the first position so that you don't count the word beginning and yeah, 0:51:09.134 --> 0:51:12.999 that's typically how it's done to do this type of casing. 0:51:13.273 --> 0:51:23.907 And that's the easy thing you can't even use like then bygram teachers who work pairs. 0:51:23.907 --> 0:51:29.651 There's very few words which occur more often. 0:51:29.970 --> 0:51:33.163 It's OK to have them boast because you can otherwise learn it. 0:51:36.376 --> 0:51:52.305 Another thing about these classes is to use word classes that were partly done, for example, 0:51:52.305 --> 0:51:55.046 and more often. 0:51:55.375 --> 0:51:57.214 Ten Thousand One Hundred Books. 0:51:57.597 --> 0:52:07.397 And then for an system that might not be important you can do something at number books. 0:52:07.847 --> 0:52:16.450 However, you see here already that it's not that easy because if you have one book you 0:52:16.450 --> 0:52:19.318 don't have to do with a pro. 0:52:20.020 --> 0:52:21.669 Always be careful. 0:52:21.669 --> 0:52:28.094 It's very fast to ignore some exceptions and make more things worse than. 0:52:28.488 --> 0:52:37.879 So it's always difficult to decide when to do this and when to better not do it and keep 0:52:37.879 --> 0:52:38.724 things. 0:52:43.483 --> 0:52:56.202 Then the next step is sentence segmentation, so we are typically working on sentences. 0:52:56.476 --> 0:53:11.633 However, dots things are a bit more complicated, so you can do a bit more. 0:53:11.731 --> 0:53:20.111 You can even have some type of classifier with features by then generally. 0:53:20.500 --> 0:53:30.731 Is not too complicated, so you can have different types of classifiers to do that, but in generally. 0:53:30.650 --> 0:53:32.537 I Didn't Know It. 0:53:33.393 --> 0:53:35.583 It's not a super complicated task. 0:53:35.583 --> 0:53:39.461 There are nowadays also a lot of libraries which you can use. 0:53:39.699 --> 0:53:45.714 To do that normally if you're doing the normalization beforehand that can be done there so you only 0:53:45.714 --> 0:53:51.126 split up the dot if it's like the sentence boundary and otherwise you keep it to the word 0:53:51.126 --> 0:53:54.194 so you can do that a bit jointly with the segment. 0:53:54.634 --> 0:54:06.017 It's something to think about to care because it's where arrows happen. 0:54:06.017 --> 0:54:14.712 However, on the one end you can still do it very well. 0:54:14.834 --> 0:54:19.740 You will never get data which is perfectly clean and where everything is great. 0:54:20.340 --> 0:54:31.020 There's just too much data and it will never happen, so therefore it's important to be aware 0:54:31.020 --> 0:54:35.269 of that during the full development. 0:54:37.237 --> 0:54:42.369 And one last thing about the preprocessing, we'll get into the representation. 0:54:42.369 --> 0:54:47.046 If you're working on that, you'll get a friend with regular expression. 0:54:47.046 --> 0:54:50.034 That's not only how you do all this matching. 0:54:50.430 --> 0:55:03.811 And if you look into the scripts of how to deal with pancreation marks and stuff like 0:55:03.811 --> 0:55:04.900 that,. 0:55:11.011 --> 0:55:19.025 So if we have now the data of our next step to build, the system is to represent our words. 0:55:19.639 --> 0:55:27.650 Before we start with this, any more questions about preprocessing. 0:55:27.650 --> 0:55:32.672 While we work on the pure text, I'm sure. 0:55:33.453 --> 0:55:40.852 The idea is again to make things more simple because if you think about the production mark 0:55:40.852 --> 0:55:48.252 at the beginning of a sentence, it might be that you haven't seen the word or, for example, 0:55:48.252 --> 0:55:49.619 think of titles. 0:55:49.619 --> 0:55:56.153 In newspaper articles there's: So you then have seen the word now in the title before, 0:55:56.153 --> 0:55:58.425 and the text you have never seen. 0:55:58.898 --> 0:56:03.147 But there is always the decision. 0:56:03.123 --> 0:56:09.097 Do I gain more because I've seen things more often or do I lose because now I remove information 0:56:09.097 --> 0:56:11.252 which helps me to the same degree? 0:56:11.571 --> 0:56:21.771 Because if we, for example, do that in German and remove the case, this might be an important 0:56:21.771 --> 0:56:22.531 issue. 0:56:22.842 --> 0:56:30.648 So there is not the perfect solution, but generally you can get some arrows to make things 0:56:30.648 --> 0:56:32.277 look more similar. 0:56:35.295 --> 0:56:43.275 What you can do about products like the state of the area or the trends that are more or 0:56:43.275 --> 0:56:43.813 less. 0:56:44.944 --> 0:56:50.193 It starts even less because models get more powerful, so it's not that important, but be 0:56:50.193 --> 0:56:51.136 careful partly. 0:56:51.136 --> 0:56:56.326 It's also the evaluation thing because these things which are problematic are happening 0:56:56.326 --> 0:56:57.092 very rarely. 0:56:57.092 --> 0:57:00.159 If you take average performance, it doesn't matter. 0:57:00.340 --> 0:57:06.715 However, in between it's doing the stupid mistakes that don't count on average, but they 0:57:06.715 --> 0:57:08.219 are not really good. 0:57:09.089 --> 0:57:15.118 Done you do some type of tokenization? 0:57:15.118 --> 0:57:19.911 You can do true casing or not. 0:57:19.911 --> 0:57:28.723 Some people nowadays don't do it, but that's still done. 0:57:28.948 --> 0:57:34.441 Then it depends on who is a bit on the type of domain. 0:57:34.441 --> 0:57:37.437 Again we have so translation. 0:57:37.717 --> 0:57:46.031 So in the text sometimes there is mark in the menu, later the shortcut. 0:57:46.031 --> 0:57:49.957 This letter is used for shortcut. 0:57:49.957 --> 0:57:57.232 You cannot mistake the word because it's no longer a file but. 0:57:58.018 --> 0:58:09.037 Then you cannot deal with it, so then it might make sense to remove this. 0:58:12.032 --> 0:58:17.437 Now the next step is how to match words into numbers. 0:58:17.437 --> 0:58:22.142 Machine learning models deal with some digits. 0:58:22.342 --> 0:58:27.091 The first idea is to use words as our basic components. 0:58:27.247 --> 0:58:40.695 And then you have a large vocabulary where each word gets referenced to an indigenous. 0:58:40.900 --> 0:58:49.059 So your sentence go home is now and that is your set. 0:58:52.052 --> 0:59:00.811 So the nice thing is you have very short sequences so that you can deal with them. 0:59:00.811 --> 0:59:01.867 However,. 0:59:01.982 --> 0:59:11.086 So you have not really understood how words are processed. 0:59:11.086 --> 0:59:16.951 Why is this or can that be a problem? 0:59:17.497 --> 0:59:20.741 And there is an easy solution to deal with unknown words. 0:59:20.741 --> 0:59:22.698 You just have one token, which is. 0:59:23.123 --> 0:59:25.906 Worrying in maybe some railroads in your training day, do you deal? 0:59:26.206 --> 0:59:34.938 That's working a bit for some province, but in general it's not good because you know nothing 0:59:34.938 --> 0:59:35.588 about. 0:59:35.895 --> 0:59:38.770 Can at least deal with this and maybe map it. 0:59:38.770 --> 0:59:44.269 So an easy solution in machine translation is always if it's an unknown word or we just 0:59:44.269 --> 0:59:49.642 copy it to the target side because unknown words are often named entities and in many 0:59:49.642 --> 0:59:52.454 languages the good solution is just to keep. 0:59:53.013 --> 1:00:01.203 So that is somehow a trick, trick, but yeah, that's of course not a good thing. 1:00:01.821 --> 1:00:08.959 It's also a problem if you deal with full words is that you have very few examples for 1:00:08.959 --> 1:00:09.451 some. 1:00:09.949 --> 1:00:17.696 And of course if you've seen a word once you can, someone may be translated, but we will 1:00:17.696 --> 1:00:24.050 learn that in your networks you represent words with continuous vectors. 1:00:24.264 --> 1:00:26.591 You have seen them two, three or four times. 1:00:26.591 --> 1:00:31.246 It is not really well learned, and you are typically doing most Arabs and words with your 1:00:31.246 --> 1:00:31.763 crow rap. 1:00:33.053 --> 1:00:40.543 And yeah, you cannot deal with things which are inside the world. 1:00:40.543 --> 1:00:50.303 So if you know that houses set one hundred and twelve and you see no houses, you have 1:00:50.303 --> 1:00:51.324 no idea. 1:00:51.931 --> 1:00:55.533 Of course, not really convenient, so humans are better. 1:00:55.533 --> 1:00:58.042 They can use the internal information. 1:00:58.498 --> 1:01:04.080 So if we have houses you'll know that it's like the bluer form of house. 1:01:05.285 --> 1:01:16.829 And for the ones who weren't in advance, ay, you have this night worth here and guess. 1:01:16.716 --> 1:01:20.454 Don't know the meaning of these words. 1:01:20.454 --> 1:01:25.821 However, all of you will know is the fear of something. 1:01:26.686 --> 1:01:39.437 From the ending, the phobia phobia is always the fear of something, but you don't know how. 1:01:39.879 --> 1:01:46.618 So we can split words into some parts that is helpful to deal with. 1:01:46.618 --> 1:01:49.888 This, for example, is a fear of. 1:01:50.450 --> 1:02:04.022 It's not very important, it's not how to happen very often, but yeah, it's also not important 1:02:04.022 --> 1:02:10.374 for understanding that you know everything. 1:02:15.115 --> 1:02:18.791 So what can we do instead? 1:02:18.791 --> 1:02:29.685 One thing which we could do instead is to represent words by the other extreme. 1:02:29.949 --> 1:02:42.900 So you really do like if you have a person's eye and a and age, then you need a space symbol. 1:02:43.203 --> 1:02:55.875 So you have now a representation for each character that enables you to implicitly learn 1:02:55.875 --> 1:03:01.143 morphology because words which have. 1:03:01.541 --> 1:03:05.517 And you can then deal with unknown words. 1:03:05.517 --> 1:03:10.344 There's still not everything you can process, but. 1:03:11.851 --> 1:03:16.953 So if you would go on charity level what might still be a problem? 1:03:18.598 --> 1:03:24.007 So all characters which you haven't seen, but that's nowadays a little bit more often 1:03:24.007 --> 1:03:25.140 with new emoties. 1:03:25.140 --> 1:03:26.020 You couldn't. 1:03:26.020 --> 1:03:31.366 It could also be that you have translated from Germany and German, and then there is 1:03:31.366 --> 1:03:35.077 a Japanese character or Chinese that you cannot translate. 1:03:35.435 --> 1:03:43.938 But most of the time all directions occur have been seen so that someone works very good. 1:03:44.464 --> 1:03:58.681 This is first a nice thing, so you have a very small vocabulary size, so one big part 1:03:58.681 --> 1:04:01.987 of the calculation. 1:04:02.222 --> 1:04:11.960 Neural networks is the calculation of the vocabulary size, so if you are efficient there 1:04:11.960 --> 1:04:13.382 it's better. 1:04:14.914 --> 1:04:26.998 On the other hand, the problem is you have no very long sequences, so if you think about 1:04:26.998 --> 1:04:29.985 this before you have. 1:04:30.410 --> 1:04:43.535 Your computation often depends on your input size and not only linear but quadratic going 1:04:43.535 --> 1:04:44.410 more. 1:04:44.504 --> 1:04:49.832 And of course it might also be that you just generally make things more complicated than 1:04:49.832 --> 1:04:50.910 they were before. 1:04:50.951 --> 1:04:58.679 We said before make things easy, but now if we really have to analyze each director independently, 1:04:58.679 --> 1:05:05.003 we cannot directly learn that university is the same, but we have to learn that. 1:05:05.185 --> 1:05:12.179 Is beginning and then there is an I and then there is an E and then all this together means 1:05:12.179 --> 1:05:17.273 university but another combination of these letters is a complete. 1:05:17.677 --> 1:05:24.135 So of course you make everything here a lot more complicated than you have on word basis. 1:05:24.744 --> 1:05:32.543 Character based models work very well in conditions with few data because you have seen the words 1:05:32.543 --> 1:05:33.578 very rarely. 1:05:33.578 --> 1:05:38.751 It's not good to learn but you have seen all letters more often. 1:05:38.751 --> 1:05:44.083 So if you have scenarios with very few data this is like one good. 1:05:46.446 --> 1:05:59.668 The other idea is to split now not doing the extreme, so either taking forwards or taking 1:05:59.668 --> 1:06:06.573 only directives by doing something in between. 1:06:07.327 --> 1:06:12.909 And one of these ideas has been done for a long time. 1:06:12.909 --> 1:06:17.560 It's called compound splitting, but we only. 1:06:17.477 --> 1:06:18.424 Bounce them. 1:06:18.424 --> 1:06:24.831 You see that Baum and Stumbo accrue very often, then maybe more often than Bounce them. 1:06:24.831 --> 1:06:28.180 Then you split Baum and Stumb and you use it. 1:06:29.509 --> 1:06:44.165 But it's even not so easy it will learn wrong splits so we did that in all the systems and 1:06:44.165 --> 1:06:47.708 there is a word Asia. 1:06:48.288 --> 1:06:56.137 And the business, of course, is not a really good way of dealing it because it is non-semantic. 1:06:56.676 --> 1:07:05.869 The good thing is we didn't really care that much about it because the system wasn't learned 1:07:05.869 --> 1:07:09.428 if you have Asia and Tish together. 1:07:09.729 --> 1:07:17.452 So you can of course learn all that the compound spirit doesn't really help you to get a deeper 1:07:17.452 --> 1:07:18.658 understanding. 1:07:21.661 --> 1:07:23.364 The Thing of Course. 1:07:23.943 --> 1:07:30.475 Yeah, there was one paper where this doesn't work like they report, but it's called Burning 1:07:30.475 --> 1:07:30.972 Ducks. 1:07:30.972 --> 1:07:37.503 I think because it was like if you had German NS Branter, you could split it in NS Branter, 1:07:37.503 --> 1:07:43.254 and sometimes you have to add an E to make the compounds that was Enter Branter. 1:07:43.583 --> 1:07:48.515 So he translated Esperanto into burning dark. 1:07:48.888 --> 1:07:56.127 So of course you can introduce there some type of additional arrows, but in generally 1:07:56.127 --> 1:07:57.221 it's a good. 1:07:57.617 --> 1:08:03.306 Of course there is a trade off between vocabulary size so you want to have a lower vocabulary 1:08:03.306 --> 1:08:08.812 size so you've seen everything more often but the length of the sequence should not be too 1:08:08.812 --> 1:08:13.654 long because if you split more often you get less different types but you have. 1:08:16.896 --> 1:08:25.281 The motivation of the advantage of compared to Character based models is that you can directly 1:08:25.281 --> 1:08:33.489 learn the representation for works that occur very often while still being able to represent 1:08:33.489 --> 1:08:35.783 works that are rare into. 1:08:36.176 --> 1:08:42.973 And while first this was only done for compounds, nowadays there's an algorithm which really 1:08:42.973 --> 1:08:49.405 tries to do it on everything and there are different ways to be honest compound fitting 1:08:49.405 --> 1:08:50.209 and so on. 1:08:50.209 --> 1:08:56.129 But the most successful one which is commonly used is based on data compression. 1:08:56.476 --> 1:08:59.246 And there the idea is okay. 1:08:59.246 --> 1:09:06.765 Can we find an encoding so that parts are compressed in the most efficient? 1:09:07.027 --> 1:09:22.917 And the compression algorithm is called the bipear encoding, and this is also then used 1:09:22.917 --> 1:09:25.625 for splitting. 1:09:26.346 --> 1:09:39.164 And the idea is we recursively represent the most frequent pair of bites by a new bike. 1:09:39.819 --> 1:09:51.926 Language is now you splitch, burst all your words into letters, and then you look at what 1:09:51.926 --> 1:09:59.593 is the most frequent bigrams of which two letters occur. 1:10:00.040 --> 1:10:04.896 And then you replace your repeat until you have a fixed vocabulary. 1:10:04.985 --> 1:10:08.031 So that's a nice thing. 1:10:08.031 --> 1:10:16.663 Now you can predefine your vocabulary as want to represent my text. 1:10:16.936 --> 1:10:28.486 By hand, and then you can represent any text with these symbols, and of course the shorter 1:10:28.486 --> 1:10:30.517 your text will. 1:10:32.772 --> 1:10:36.543 So the original idea was something like that. 1:10:36.543 --> 1:10:39.411 We have to sequence A, B, A, B, C. 1:10:39.411 --> 1:10:45.149 For example, a common biogram is A, B, so you can face A, B, B, I, D. 1:10:45.149 --> 1:10:46.788 Then the text gets. 1:10:48.108 --> 1:10:53.615 Then you can make to and then you have eating beet and so on, so this is then your text. 1:10:54.514 --> 1:11:00.691 Similarly, we can do it now for tanking. 1:11:01.761 --> 1:11:05.436 Let's assume you have these sentences. 1:11:05.436 --> 1:11:11.185 I go, he goes, she goes, so your vocabulary is go, goes, he. 1:11:11.851 --> 1:11:30.849 And the first thing you're doing is split your crocus into singles. 1:11:30.810 --> 1:11:34.692 So thereby you can split words again like split senses into words. 1:11:34.692 --> 1:11:38.980 Because now you only have chiracters, you don't know the word boundaries. 1:11:38.980 --> 1:11:44.194 You introduce the word boundaries by having a special symbol at the end of each word, and 1:11:44.194 --> 1:11:46.222 then you know this symbol happens. 1:11:46.222 --> 1:11:48.366 I can split it and have it in a new. 1:11:48.708 --> 1:11:55.245 So you have the corpus I go, he goes, and she goes, and then you have now here the sequences 1:11:55.245 --> 1:11:56.229 of Character. 1:11:56.229 --> 1:12:02.625 So then the Character based per presentation, and now you calculate the bigram statistics. 1:12:02.625 --> 1:12:08.458 So I and the end of word occurs one time G & O across three times, so there there. 1:12:09.189 --> 1:12:18.732 And these are all the others, and now you look, which is the most common happening. 1:12:19.119 --> 1:12:26.046 So then you have known the rules. 1:12:26.046 --> 1:12:39.235 If and have them together you have these new words: Now is no longer two symbols, but it's 1:12:39.235 --> 1:12:41.738 one single symbol because if you join that. 1:12:42.402 --> 1:12:51.175 And then you have here now the new number of biceps, steel and wood, and and so on. 1:12:52.092 --> 1:13:01.753 In small examples now you have a lot of rules which occur the same time. 1:13:01.753 --> 1:13:09.561 In reality that is happening sometimes but not that often. 1:13:10.370 --> 1:13:21.240 You add the end of words to him, and so this way you go on until you have your vocabulary. 1:13:21.601 --> 1:13:38.242 And your vocabulary is in these rules, so people speak about the vocabulary of the rules. 1:13:38.658 --> 1:13:43.637 And these are the rules, and if you have not a different sentence, something like they tell. 1:13:44.184 --> 1:13:53.600 Then your final output looks like something like that. 1:13:53.600 --> 1:13:59.250 These two words represent by by. 1:14:00.940 --> 1:14:06.398 And that is your algorithm. 1:14:06.398 --> 1:14:18.873 Now you can represent any type of text with a fixed vocabulary. 1:14:20.400 --> 1:14:23.593 So think that's defined in the beginning. 1:14:23.593 --> 1:14:27.243 Fill how many egos have won and that has spent. 1:14:28.408 --> 1:14:35.253 It's nearly correct that it writes a number of characters. 1:14:35.253 --> 1:14:38.734 It can be that in additional. 1:14:38.878 --> 1:14:49.162 So on the one end all three of the right side of the rules can occur, and then additionally 1:14:49.162 --> 1:14:49.721 all. 1:14:49.809 --> 1:14:55.851 In reality it can even happen that there is less your vocabulary smaller because it might 1:14:55.851 --> 1:15:01.960 happen that like for example go never occurs singular at the end but you always like merge 1:15:01.960 --> 1:15:06.793 all occurrences so there are not all right sides really happen because. 1:15:06.746 --> 1:15:11.269 This rule is never only applied, but afterwards another rule is also applied. 1:15:11.531 --> 1:15:15.621 So it's a summary approbounce of your vocabulary than static. 1:15:20.480 --> 1:15:29.014 Then we come to the last part, which is about parallel data, but we have some questions beforehand. 1:15:36.436 --> 1:15:38.824 So what is parallel data? 1:15:38.824 --> 1:15:47.368 So if we set machine translations really, really important that we are dealing with parallel 1:15:47.368 --> 1:15:52.054 data, that means we have a lined input and output. 1:15:52.054 --> 1:15:54.626 You have this type of data. 1:15:55.015 --> 1:16:01.773 However, in machine translation we have one very big advantage that is somewhat naturally 1:16:01.773 --> 1:16:07.255 occurring, so you have a lot of parallel data which you can summar gaps. 1:16:07.255 --> 1:16:13.788 In many P tests you need to manually annotate your data and generate the aligned data. 1:16:14.414 --> 1:16:22.540 We have to manually create translations, and of course that is very expensive, but it's 1:16:22.540 --> 1:16:29.281 really expensive to pay for like one million sentences to be translated. 1:16:29.889 --> 1:16:36.952 The nice thing is that in there is data normally available because other people have done machine 1:16:36.952 --> 1:16:37.889 translation. 1:16:40.120 --> 1:16:44.672 So there is this data and of course process it. 1:16:44.672 --> 1:16:51.406 We'll have a full lecture on how to deal with more complex situations. 1:16:52.032 --> 1:16:56.645 The idea is really you don't do really much human work. 1:16:56.645 --> 1:17:02.825 You really just start the caller with some initials, start pages and then. 1:17:03.203 --> 1:17:07.953 But a lot of iquality parallel data is really targeted on some scenarios. 1:17:07.953 --> 1:17:13.987 So, for example, think of the European Parliament as one website where you can easily extract 1:17:13.987 --> 1:17:17.581 these information from and there you have a large data. 1:17:17.937 --> 1:17:22.500 Or like we have the TED data, which is also you can get from the TED website. 1:17:23.783 --> 1:17:33.555 So in generally parallel corpus is a collection of texts with translations into one of several. 1:17:34.134 --> 1:17:42.269 And this data is important because there is no general empty normally, but you work secured. 1:17:42.222 --> 1:17:46.732 It works especially good if your training and test conditions are similar. 1:17:46.732 --> 1:17:50.460 So if the topic is similar, the style of modality is similar. 1:17:50.460 --> 1:17:55.391 So if you want to translate speech, it's often better to train all to own speech. 1:17:55.391 --> 1:17:58.818 If you want to translate text, it's better to translate. 1:17:59.379 --> 1:18:08.457 And there is a lot of these data available nowadays for common languages. 1:18:08.457 --> 1:18:12.014 You normally can start with. 1:18:12.252 --> 1:18:15.298 It's really available. 1:18:15.298 --> 1:18:27.350 For example, Opus is a big website collecting different types of parallel corpus where you 1:18:27.350 --> 1:18:29.601 can select them. 1:18:29.529 --> 1:18:33.276 You have this document alignment will come to that layout. 1:18:33.553 --> 1:18:39.248 There is things like comparable data where you have not full sentences but only some parts 1:18:39.248 --> 1:18:40.062 of parallel. 1:18:40.220 --> 1:18:48.700 But now first let's assume we have easy tasks like European Parliament when we have the speech 1:18:48.700 --> 1:18:55.485 in German and the speech in English and you need to generate parallel data. 1:18:55.485 --> 1:18:59.949 That means you have to align the sewer sentences. 1:19:00.000 --> 1:19:01.573 And doing this right. 1:19:05.905 --> 1:19:08.435 How can we do that? 1:19:08.435 --> 1:19:19.315 And that is what people refer to sentence alignment, so we have parallel documents in 1:19:19.315 --> 1:19:20.707 languages. 1:19:22.602 --> 1:19:32.076 This is so you cannot normally do that word by word because there is no direct correlation 1:19:32.076 --> 1:19:34.158 between, but it is. 1:19:34.074 --> 1:19:39.837 Relatively possible to do it on sentence level, it will not be perfect, so you sometimes have 1:19:39.837 --> 1:19:42.535 two sentences in English and one in German. 1:19:42.535 --> 1:19:47.992 German like to have these long sentences with sub clauses and so on, so there you can do 1:19:47.992 --> 1:19:51.733 it, but with long sentences it might not be really possible. 1:19:55.015 --> 1:19:59.454 And for some we saw that sentence Marcus Andre there, so it's more complicated. 1:19:59.819 --> 1:20:10.090 So how can we formalize this sentence alignment problem? 1:20:10.090 --> 1:20:16.756 So we have a set of sewer sentences. 1:20:17.377 --> 1:20:22.167 And machine translation relatively often. 1:20:22.167 --> 1:20:32.317 Sometimes source sentences nowadays are and, but traditionally it was and because people 1:20:32.317 --> 1:20:34.027 started using. 1:20:34.594 --> 1:20:45.625 And then the idea is to find this alignment where we have alignment. 1:20:46.306 --> 1:20:50.421 And of course you want these sequences to be shown as possible. 1:20:50.421 --> 1:20:56.400 Of course an easy solution is here all my screen sentences and here all my target sentences. 1:20:56.756 --> 1:21:07.558 So want to have short sequences there, typically one sentence or maximum two or three sentences, 1:21:07.558 --> 1:21:09.340 so that really. 1:21:13.913 --> 1:21:21.479 Then there is different ways of restriction to this type of alignment, so first of all 1:21:21.479 --> 1:21:29.131 it should be a monotone alignment, so that means that each segment on the source should 1:21:29.131 --> 1:21:31.218 start after each other. 1:21:31.431 --> 1:21:36.428 So we assume that in document there's really a monotone and it's going the same way in source. 1:21:36.957 --> 1:21:41.965 Course for a very free translation that might not be valid anymore. 1:21:41.965 --> 1:21:49.331 But this algorithm, the first one in the church and gay algorithm, is more than really translations 1:21:49.331 --> 1:21:51.025 which are very direct. 1:21:51.025 --> 1:21:54.708 So each segment should be like coming after each. 1:21:55.115 --> 1:22:04.117 Then we want to translate the full sequence, and of course each segment should start before 1:22:04.117 --> 1:22:04.802 it is. 1:22:05.525 --> 1:22:22.654 And then you want to have something like that, but you have to alignments or alignments. 1:22:25.525 --> 1:22:41.851 The alignment types are: You then, of course, sometimes insertions and Venetians where there 1:22:41.851 --> 1:22:43.858 is some information added. 1:22:44.224 --> 1:22:50.412 Hand be, for example, explanation, so it can be that some term is known in the one language 1:22:50.412 --> 1:22:51.018 but not. 1:22:51.111 --> 1:22:53.724 Think of things like Deutschland ticket. 1:22:53.724 --> 1:22:58.187 In Germany everybody will by now know what the Deutschland ticket is. 1:22:58.187 --> 1:23:03.797 But if you translate it to English it might be important to explain it and other things 1:23:03.797 --> 1:23:04.116 are. 1:23:04.116 --> 1:23:09.853 So sometimes you have to explain things and then you have more sentences with insertions. 1:23:10.410 --> 1:23:15.956 Then you have two to one and one to two alignment, and that is, for example, in Germany you have 1:23:15.956 --> 1:23:19.616 a lot of sub-classes and bipes that are expressed by two cents. 1:23:20.580 --> 1:23:37.725 Of course, it might be more complex, but typically to make it simple and only allow for this type 1:23:37.725 --> 1:23:40.174 of alignment. 1:23:41.301 --> 1:23:56.588 Then it is about finding the alignment and that is, we try to score where we just take 1:23:56.588 --> 1:23:59.575 a general score. 1:24:00.000 --> 1:24:04.011 That is true like gala algorithms and the matching of one segment. 1:24:04.011 --> 1:24:09.279 If you have one segment now so this is one of the global things so the global alignment 1:24:09.279 --> 1:24:13.828 is as good as the product of all single steps and then you have two scores. 1:24:13.828 --> 1:24:18.558 First of all you say one to one alignments are much better than all the hours. 1:24:19.059 --> 1:24:26.884 And then you have a lexical similarity, which is, for example, based on an initial dictionary 1:24:26.884 --> 1:24:30.713 which counts how many dictionary entries are. 1:24:31.091 --> 1:24:35.407 So this is a very simple algorithm. 1:24:35.407 --> 1:24:41.881 Typically violates like your first step and you want. 1:24:43.303 --> 1:24:54.454 And that is like with this one you can get an initial one you can have better parallel 1:24:54.454 --> 1:24:55.223 data. 1:24:55.675 --> 1:25:02.369 No, it is an optimization problem and you are now based on the scores you can calculate 1:25:02.369 --> 1:25:07.541 for each possible alignment and score and then select the best one. 1:25:07.541 --> 1:25:14.386 Of course, you won't try all possibilities out but you can do a good search and then find 1:25:14.386 --> 1:25:15.451 the best one. 1:25:15.815 --> 1:25:18.726 Can typically be automatically. 1:25:18.726 --> 1:25:25.456 Of course, you should do some checks like aligning sentences as possible. 1:25:26.766 --> 1:25:32.043 A bill like typically for training data is done this way. 1:25:32.043 --> 1:25:35.045 Maybe if you have test data you. 1:25:40.000 --> 1:25:47.323 Sorry, I'm a bit late because originally wanted to do a quiz at the end. 1:25:47.323 --> 1:25:49.129 Can we go a quiz? 1:25:49.429 --> 1:25:51.833 We'll do it somewhere else. 1:25:51.833 --> 1:25:56.813 We had a bachelor project about making quiz for lectures. 1:25:56.813 --> 1:25:59.217 And I still want to try it. 1:25:59.217 --> 1:26:04.197 So let's see I hope in some other lecture we can do that. 1:26:04.197 --> 1:26:09.435 Then we can at the island of the lecture do some quiz about. 1:26:09.609 --> 1:26:13.081 All We Can Do Is Is the Practical Thing Let's See. 1:26:13.533 --> 1:26:24.719 And: Today, so what you should remember is what is parallel data and how we can. 1:26:25.045 --> 1:26:29.553 Create parallel data like how to generally process data. 1:26:29.553 --> 1:26:36.435 What you think about data is really important if you build systems and different ways. 1:26:36.696 --> 1:26:46.857 The three main options like forwards is directly on director level or using subword things. 1:26:47.687 --> 1:26:49.634 Is there any question? 1:26:52.192 --> 1:26:57.768 Yes, this is the alignment thing in Cadillac band in Tyne walking with people. 1:27:00.000 --> 1:27:05.761 It's not directly using than every time walking, but the idea is similar and you can use all 1:27:05.761 --> 1:27:11.771 this type of similar algorithms, which is the main thing which is the question of the difficulty 1:27:11.771 --> 1:27:14.807 is to define me at your your loss function here. 1:27:14.807 --> 1:27:16.418 What is a good alignment? 1:27:16.736 --> 1:27:24.115 But as you do not have a time walk on, you have a monotone alignment in there, and you 1:27:24.115 --> 1:27:26.150 cannot have rehonoring. 1:27:30.770 --> 1:27:40.121 There then thanks a lot and on first day we will then start with or discuss.