Spaces:

retkowski
/

ytseg_demo

Running

App Files Files Community

ytseg_demo / demo_data /lectures /Lecture-04-27.04.2023 /English.vtt

retkowski

Add demo

cb71ef5 over 1 year ago

raw

history blame contribute delete

72.6 kB

	WEBVTT

	0:00:03.663 --> 0:00:07.970
	Okay, then I should switch back to English,
	sorry,.

	0:00:08.528 --> 0:00:18.970
	So welcome to today's lecture in the cross
	machine translation and today we're planning

	0:00:18.970 --> 0:00:20.038
	to talk.

	0:00:20.880 --> 0:00:31.845
	Which will be without our summary of power
	translation was done from around till.

	0:00:32.872 --> 0:00:38.471
	Fourteen, so this was an approach which was
	quite long.

	0:00:38.471 --> 0:00:47.070
	It was the first approach where at the end
	the quality was really so good that it was

	0:00:47.070 --> 0:00:49.969
	used as a commercial system.

	0:00:49.990 --> 0:00:56.482
	Or something like that, so the first systems
	there was using the statistical machine translation.

	0:00:57.937 --> 0:01:02.706
	So when I came into the field this was the
	main part of the lecture, so there would be

	0:01:02.706 --> 0:01:07.912
	not be one lecture, but in more detail than
	half of the full course would be about statistical

	0:01:07.912 --> 0:01:09.063
	machine translation.

	0:01:09.369 --> 0:01:23.381
	So what we try to do today is like get the
	most important things, which think our part

	0:01:23.381 --> 0:01:27.408
	is still very important.

	0:01:27.267 --> 0:01:31.196
	Four State of the Art Box.

	0:01:31.952 --> 0:01:45.240
	Then we'll have the presentation about how
	to evaluate the other part of the machine translation.

	0:01:45.505 --> 0:01:58.396
	The other important thing is the language
	modeling part will explain later how they combine.

	0:01:59.539 --> 0:02:04.563
	Shortly mentioned this one already.

	0:02:04.824 --> 0:02:06.025
	On Tuesday.

	0:02:06.246 --> 0:02:21.849
	So in a lot of these explanations, how we
	model translation process, it might be surprising:

	0:02:22.082 --> 0:02:27.905
	Later some people say it's for four eight words
	traditionally came because the first models

	0:02:27.905 --> 0:02:32.715
	which you'll discuss here also when they are
	referred to as the IVM models.

	0:02:32.832 --> 0:02:40.043
	They were trained on French to English translation
	directions and that's why they started using

	0:02:40.043 --> 0:02:44.399
	F and E and then this was done for the next
	twenty years.

	0:02:44.664 --> 0:02:52.316
	So while we are trying to wait, the source
	words is: We have a big eye, typically the

	0:02:52.316 --> 0:03:02.701
	lengths of the sewer sentence in small eye,
	the position, and similarly in the target and

	0:03:02.701 --> 0:03:05.240
	the lengths of small.

	0:03:05.485 --> 0:03:13.248
	Things will get a bit complicated in this
	way because it is not always clear what is

	0:03:13.248 --> 0:03:13.704
	the.

	0:03:14.014 --> 0:03:21.962
	See that there is this noisy channel model
	which switches the direction in your model,

	0:03:21.962 --> 0:03:25.616
	but in the application it's the target.

	0:03:26.006 --> 0:03:37.077
	So that is why if you especially read these
	papers, it might sometimes be a bit disturbing.

	0:03:37.437 --> 0:03:40.209
	Try to keep it here always.

	0:03:40.209 --> 0:03:48.427
	The source is, and even if we use a model
	where it's inverse, we'll keep this way.

	0:03:48.468 --> 0:03:55.138
	Don't get disturbed by that, and I think it's
	possible to understand all that without this

	0:03:55.138 --> 0:03:55.944
	confusion.

	0:03:55.944 --> 0:04:01.734
	But in some of the papers you might get confused
	because they switched to the.

	0:04:04.944 --> 0:04:17.138
	In general, in statistics and machine translation,
	the goal is how we do translation.

	0:04:17.377 --> 0:04:25.562
	But first we are seeing all our possible target
	sentences as possible translations.

	0:04:26.726 --> 0:04:37.495
	And we are assigning some probability to the
	combination, so we are modeling.

	0:04:39.359 --> 0:04:49.746
	And then we are doing a search over all possible
	things or at least theoretically, and we are

	0:04:49.746 --> 0:04:56.486
	trying to find the translation with the highest
	probability.

	0:04:56.936 --> 0:05:05.116
	And this general idea is also true for neuromachine
	translation.

	0:05:05.116 --> 0:05:07.633
	They differ in how.

	0:05:08.088 --> 0:05:10.801
	So these were then of course the two big challenges.

	0:05:11.171 --> 0:05:17.414
	On the one hand, how can we estimate this
	probability?

	0:05:17.414 --> 0:05:21.615
	How is the translation of the other?

	0:05:22.262 --> 0:05:32.412
	The other challenge is the search, so we cannot,
	of course, say we want to find the most probable

	0:05:32.412 --> 0:05:33.759
	translation.

	0:05:33.759 --> 0:05:42.045
	We cannot go over all possible English sentences
	and calculate the probability.

	0:05:43.103 --> 0:05:45.004
	So,.

	0:05:45.165 --> 0:05:53.423
	What we have to do there is some are doing
	intelligent search and look for the ones and

	0:05:53.423 --> 0:05:54.268
	compare.

	0:05:54.734 --> 0:05:57.384
	That will be done.

	0:05:57.384 --> 0:06:07.006
	This process of finding them is called the
	decoding process because.

	0:06:07.247 --> 0:06:09.015
	They will be covered well later.

	0:06:09.015 --> 0:06:11.104
	Today we will concentrate on the mile.

	0:06:11.451 --> 0:06:23.566
	The model is trained using data, so in the
	first step we're having data, we're somehow

	0:06:23.566 --> 0:06:30.529
	having a definition of what the model looks
	like.

	0:06:34.034 --> 0:06:42.913
	And in statistical machine translation the
	common model is behind.

	0:06:42.913 --> 0:06:46.358
	That is what is referred.

	0:06:46.786 --> 0:06:55.475
	And this is motivated by the initial idea
	from Shannon.

	0:06:55.475 --> 0:07:02.457
	We have this that you can think of decoding.

	0:07:02.722 --> 0:07:10.472
	So think of it as we have this text in maybe
	German.

	0:07:10.472 --> 0:07:21.147
	Originally it was an English text, but somebody
	used some nice decoding.

	0:07:21.021 --> 0:07:28.579
	Task is to decipher it again, this crazy cyborg
	expressing things in German, and to decipher

	0:07:28.579 --> 0:07:31.993
	the meaning again and doing that between.

	0:07:32.452 --> 0:07:35.735
	And that is the idea about this noisy channel
	when it.

	0:07:36.236 --> 0:07:47.209
	It goes through some type of channel which
	adds noise to the source and then you receive

	0:07:47.209 --> 0:07:48.811
	the message.

	0:07:49.429 --> 0:08:00.190
	And then the idea is, can we now construct
	the original message out of these messages

	0:08:00.190 --> 0:08:05.070
	by modeling some of the channels here?

	0:08:06.726 --> 0:08:15.797
	There you know to see a bit the surface of
	the source message with English.

	0:08:15.797 --> 0:08:22.361
	It went through some channel and received
	the message.

	0:08:22.682 --> 0:08:31.381
	If you're not looking at machine translation,
	your source language is English.

	0:08:31.671 --> 0:08:44.388
	Here you see now a bit of this where the confusion
	starts while English as a target language is

	0:08:44.388 --> 0:08:47.700
	also the source message.

	0:08:47.927 --> 0:08:48.674
	You can see.

	0:08:48.674 --> 0:08:51.488
	There is also a mathematics of how we model
	the.

	0:08:52.592 --> 0:08:56.888
	It's a noisy channel model from a mathematic
	point of view.

	0:08:56.997 --> 0:09:00.245
	So this is again our general formula.

	0:09:00.245 --> 0:09:08.623
	We are looking for the most probable translation
	and that is the translation that has the highest

	0:09:08.623 --> 0:09:09.735
	probability.

	0:09:09.809 --> 0:09:19.467
	We are not interested in the probability itself,
	but we are interesting in this target sentence

	0:09:19.467 --> 0:09:22.082
	E where this probability.

	0:09:23.483 --> 0:09:33.479
	And: Therefore, we can use them twice definition
	of conditional probability and using the base

	0:09:33.479 --> 0:09:42.712
	rules, so this probability equals the probability
	of f giving any kind of probability of e divided

	0:09:42.712 --> 0:09:44.858
	by the probability of.

	0:09:45.525 --> 0:09:48.218
	Now see mathematically this confusion.

	0:09:48.218 --> 0:09:54.983
	Originally we are interested in the probability
	of the target sentence given the search sentence.

	0:09:55.295 --> 0:10:00.742
	And if we are modeling things now, we are
	looking here at the inverse direction, so the

	0:10:00.742 --> 0:10:06.499
	probability of F given E to the probability
	of the source sentence given the target sentence

	0:10:06.499 --> 0:10:10.832
	is the probability of the target sentence divided
	by the probability.

	0:10:13.033 --> 0:10:15.353
	Why are we doing this?

	0:10:15.353 --> 0:10:24.333
	Maybe I mean, of course, once it's motivated
	by our model, that we were saying this type

	0:10:24.333 --> 0:10:27.058
	of how we are modeling it.

	0:10:27.058 --> 0:10:30.791
	The other interesting thing is that.

	0:10:31.231 --> 0:10:40.019
	So we are looking at this probability up there,
	which we had before we formulate that we can

	0:10:40.019 --> 0:10:40.775
	remove.

	0:10:41.181 --> 0:10:46.164
	If we are searching for the highest translation,
	this is fixed.

	0:10:46.164 --> 0:10:47.800
	This doesn't change.

	0:10:47.800 --> 0:10:52.550
	We have an input, the source sentence, and
	we cannot change.

	0:10:52.812 --> 0:11:02.780
	Is always the same, so we can ignore it in
	the ACMAX because the lower one is exactly

	0:11:02.780 --> 0:11:03.939
	the same.

	0:11:04.344 --> 0:11:06.683
	And then we have p o f.

	0:11:06.606 --> 0:11:13.177
	E times P of E and that is so we are modeling
	the translation process on the one hand with

	0:11:13.177 --> 0:11:19.748
	the translation model which models how probable
	is the sentence F given E and on the other

	0:11:19.748 --> 0:11:25.958
	hand with the language model which models only
	how probable is this English sentence.

	0:11:26.586 --> 0:11:39.366
	That somebody wrote this language or translation
	point of view, this is about fluency.

	0:11:40.200 --> 0:11:44.416
	You should have in German, for example, agreement.

	0:11:44.416 --> 0:11:50.863
	If the agreement is not right, that's properly
	not said by anybody in German.

	0:11:50.863 --> 0:11:58.220
	Nobody would say that's Schönest's house because
	it's not according to the German rules.

	0:11:58.598 --> 0:12:02.302
	So this can be modeled by the language model.

	0:12:02.542 --> 0:12:09.855
	And you have the translation model which models
	housings get translated between the.

	0:12:10.910 --> 0:12:18.775
	And here you see again our confusion again,
	and now here put the translation model: Wage

	0:12:18.775 --> 0:12:24.360
	is a big income counterintuitive because the
	probability of a sewer sentence giving the

	0:12:24.360 --> 0:12:24.868
	target.

	0:12:26.306 --> 0:12:35.094
	Have to do that for the bass farmer, but in
	the following slides I'll talk again about.

	0:12:35.535 --> 0:12:45.414
	Because yeah, that's more intuitive that you
	model the translation of the target sentence

	0:12:45.414 --> 0:12:48.377
	given the source sentence.

	0:12:50.930 --> 0:12:55.668
	And this is what we want to talk about today.

	0:12:55.668 --> 0:13:01.023
	We later talk about language models how to
	do that.

	0:13:00.940 --> 0:13:04.493
	And maybe also how to combine them.

	0:13:04.493 --> 0:13:13.080
	But the focus on today would be how can we
	model this probability to how to generate a

	0:13:13.080 --> 0:13:16.535
	translation from source to target?

	0:13:19.960 --> 0:13:24.263
	How can we do that and the easiest thing?

	0:13:24.263 --> 0:13:33.588
	Maybe if you think about statistics, you count
	how many examples you have, how many target

	0:13:33.588 --> 0:13:39.121
	sentences go occur, and that gives you an estimation.

	0:13:40.160 --> 0:13:51.632
	However, like in another model that is not
	possible because most sentences you will never

	0:13:51.632 --> 0:13:52.780
	see, so.

	0:13:53.333 --> 0:14:06.924
	So what we have to do is break up the translation
	process into smaller models and model each

	0:14:06.924 --> 0:14:09.555
	of the decisions.

	0:14:09.970 --> 0:14:26.300
	So this simple solution with how you throw
	a dice is like you have a and that gives you

	0:14:26.300 --> 0:14:29.454
	the probability.

	0:14:29.449 --> 0:14:40.439
	But here's the principle because each event
	is so rare that most of them never have helped.

	0:14:43.063 --> 0:14:48.164
	Although it might be that in all your training
	data you have never seen this title of set.

	0:14:49.589 --> 0:14:52.388
	How can we do that?

	0:14:52.388 --> 0:15:04.845
	We look in statistical machine translation
	into two different models, a generative model

	0:15:04.845 --> 0:15:05.825
	where.

	0:15:06.166 --> 0:15:11.736
	So the idea was to really model model like
	each individual translation between words.

	0:15:12.052 --> 0:15:22.598
	So you break down the translation of a full
	sentence into the translation of each individual's

	0:15:22.598 --> 0:15:23.264
	word.

	0:15:23.264 --> 0:15:31.922
	So you say if you have the black cat, if you
	translate it, the full sentence.

	0:15:32.932 --> 0:15:38.797
	Of course, this has some challenges, any ideas
	where this type of model could be very challenging.

	0:15:40.240 --> 0:15:47.396
	Vocabularies and videos: Yes, we're going
	to be able to play in the very color.

	0:15:47.867 --> 0:15:51.592
	Yes, but you could at least use a bit of the
	context around it.

	0:15:51.592 --> 0:15:55.491
	It will not only depend on the word, but it's
	already challenging.

	0:15:55.491 --> 0:15:59.157
	You make things very hard, so that's definitely
	one challenge.

	0:16:00.500 --> 0:16:07.085
	One other, what did you talk about that we
	just don't want to say?

	0:16:08.348 --> 0:16:11.483
	Yes, they are challenging.

	0:16:11.483 --> 0:16:21.817
	You have to do something like words, but the
	problem is that you might introduce errors.

	0:16:21.841 --> 0:16:23.298
	Later and makes things very comfortable.

	0:16:25.265 --> 0:16:28.153
	Wrong splitting is the worst things that are
	very complicated.

	0:16:32.032 --> 0:16:35.580
	Saints, for example, and also maybe Japanese
	medicine.

	0:16:35.735 --> 0:16:41.203
	In German, yes, especially like these are
	all right.

	0:16:41.203 --> 0:16:46.981
	The first thing is maybe the one which is
	most obvious.

	0:16:46.981 --> 0:16:49.972
	It is raining cats and dogs.

	0:16:51.631 --> 0:17:01.837
	To German, the cat doesn't translate this
	whole chunk into something because there is

	0:17:01.837 --> 0:17:03.261
	not really.

	0:17:03.403 --> 0:17:08.610
	Mean, of course, in generally there is this
	type of alignment, so there is a correspondence

	0:17:08.610 --> 0:17:11.439
	between words in English and the words in German.

	0:17:11.439 --> 0:17:16.363
	However, that's not true for all sentences,
	so in some sentences you cannot really say

	0:17:16.363 --> 0:17:18.174
	this word translates into that.

	0:17:18.498 --> 0:17:21.583
	But you can only let more locate this whole
	phrase.

	0:17:21.583 --> 0:17:23.482
	This model into something else.

	0:17:23.563 --> 0:17:30.970
	If you think about the don't in English, the
	do is not really clearly where should that

	0:17:30.970 --> 0:17:31.895
	be allied.

	0:17:32.712 --> 0:17:39.079
	Then for a long time the most successful approach
	was this phrase based translation model where

	0:17:39.079 --> 0:17:45.511
	the idea is your block is not a single word
	but a longer phrase if you try to build translations

	0:17:45.511 --> 0:17:46.572
	based on these.

	0:17:48.768 --> 0:17:54.105
	But let's start with a word based and what
	you need.

	0:17:54.105 --> 0:18:03.470
	There is two main knowledge sources, so on
	the one hand we have a lexicon where we translate

	0:18:03.470 --> 0:18:05.786
	possible translations.

	0:18:06.166 --> 0:18:16.084
	The main difference between the lexicon and
	statistical machine translation and lexicon

	0:18:16.084 --> 0:18:17.550
	as you know.

	0:18:17.837 --> 0:18:23.590
	Traditional lexicon: You know how word is
	translated and mainly it's giving you two or

	0:18:23.590 --> 0:18:26.367
	three examples with any example sentence.

	0:18:26.367 --> 0:18:30.136
	So in this context it gets translated like
	that henceon.

	0:18:30.570 --> 0:18:38.822
	In order to model that and work with probabilities
	what we need in a machine translation is these:

	0:18:39.099 --> 0:18:47.962
	So if we have the German word bargain, it sends
	me out with a probability of zero point five.

	0:18:47.962 --> 0:18:51.545
	Maybe it's translated into a vehicle.

	0:18:52.792 --> 0:18:58.876
	And of course this is not easy to be created
	by a shoveman.

	0:18:58.876 --> 0:19:07.960
	If ask you and give probabilities for how
	probable this vehicle is, there might: So how

	0:19:07.960 --> 0:19:12.848
	we are doing is again that the lexicon is automatically
	will be created from a corpus.

	0:19:13.333 --> 0:19:18.754
	And we're just counting here, so we count
	how often does it work, how often does it co

	0:19:18.754 --> 0:19:24.425
	occur with vehicle, and then we're taking the
	ratio and saying in the house of time on the

	0:19:24.425 --> 0:19:26.481
	English side there was vehicles.

	0:19:26.481 --> 0:19:31.840
	There was a probability of vehicles given
	back, and there's something like zero point

	0:19:31.840 --> 0:19:32.214
	five.

	0:19:33.793 --> 0:19:46.669
	That we need another concept, and that is
	this concept of alignment, and now you can

	0:19:46.669 --> 0:19:47.578
	have.

	0:19:47.667 --> 0:19:53.113
	Since this is quite complicated, the alignment
	in general can be complex.

	0:19:53.113 --> 0:19:55.689
	It can be that it's not only like.

	0:19:55.895 --> 0:20:04.283
	It can be that two words of a surrender target
	sign and it's also imbiguous.

	0:20:04.283 --> 0:20:13.761
	It can be that you say all these two words
	only are aligned together and our words are

	0:20:13.761 --> 0:20:15.504
	aligned or not.

	0:20:15.875 --> 0:20:21.581
	Is should the do be aligned to the knot in
	German?

	0:20:21.581 --> 0:20:29.301
	It's only there because in German it's not,
	so it should be aligned.

	0:20:30.510 --> 0:20:39.736
	However, typically it's formalized and it's
	formalized by a function from the target language.

	0:20:40.180 --> 0:20:44.051
	And that is to make these models get easier
	and clearer.

	0:20:44.304 --> 0:20:49.860
	That means what means does it mean that you
	have a fence that means that each.

	0:20:49.809 --> 0:20:58.700
	A sewer's word gives target word and the alliance
	to only one source word because the function

	0:20:58.700 --> 0:21:00.384
	is also directly.

	0:21:00.384 --> 0:21:05.999
	However, a source word can be hit or like
	by signal target.

	0:21:06.286 --> 0:21:11.332
	So you are allowing for one to many alignments,
	but not for many to one alignment.

	0:21:11.831 --> 0:21:17.848
	That is a bit of a challenge because you assume
	a lightning should be symmetrical.

	0:21:17.848 --> 0:21:24.372
	So if you look at a parallel sentence, it
	should not matter if you look at it from German

	0:21:24.372 --> 0:21:26.764
	to English or English to German.

	0:21:26.764 --> 0:21:34.352
	So however, it makes these models: Yea possible
	and we'll like to see yea for the phrase bass

	0:21:34.352 --> 0:21:36.545
	until we need these alignments.

	0:21:36.836 --> 0:21:41.423
	So this alignment was the most important of
	the world based models.

	0:21:41.423 --> 0:21:47.763
	For the next twenty years you need the world
	based models to generate this type of alignment,

	0:21:47.763 --> 0:21:50.798
	which is then the first step for the phrase.

	0:21:51.931 --> 0:21:59.642
	Approach, and there you can then combine them
	again like both directions into one we'll see.

	0:22:00.280 --> 0:22:06.850
	This alignment is very important and allows
	us to do this type of separation.

	0:22:08.308 --> 0:22:15.786
	And yet the most commonly used word based
	models are these models referred to as IBM

	0:22:15.786 --> 0:22:25.422
	models, and there is a sequence of them with
	great names: And they were like yeah very commonly

	0:22:25.422 --> 0:22:26.050
	used.

	0:22:26.246 --> 0:22:31.719
	We'll mainly focus on the simple one here
	and look how this works and then not do all

	0:22:31.719 --> 0:22:34.138
	the details about the further models.

	0:22:34.138 --> 0:22:38.084
	The interesting thing is also that all of
	them are important.

	0:22:38.084 --> 0:22:43.366
	So if you want to train this alignment what
	you normally do is train an IVM model.

	0:22:43.743 --> 0:22:50.940
	Then you take that as your initialization
	to then train the IBM model too and so on.

	0:22:50.940 --> 0:22:53.734
	The motivation for that is yeah.

	0:22:53.734 --> 0:23:00.462
	The first model gives you: Is so simple that
	you can even find a global optimum, so it gives

	0:23:00.462 --> 0:23:06.403
	you a good starting point for the next one
	where the optimization in finding the right

	0:23:06.403 --> 0:23:12.344
	model is more difficult and therefore like
	the defore technique was to make your model

	0:23:12.344 --> 0:23:13.641
	step by step more.

	0:23:15.195 --> 0:23:27.333
	In these models we are breaking down the probability
	into smaller steps and then we can define:

	0:23:27.367 --> 0:23:38.981
	You see it's not a bit different, so it's not
	the curability and one specific alignment given.

	0:23:39.299 --> 0:23:42.729
	We'll let us learn how we can then go from
	one alignment to the full set.

	0:23:43.203 --> 0:23:52.889
	The probability of target sentences and one
	alignment between the source and target sentences

	0:23:52.889 --> 0:23:56.599
	alignment is this type of function.

	0:23:57.057 --> 0:24:14.347
	That every word is aligned in order to ensure
	that every word is aligned.

	0:24:15.835 --> 0:24:28.148
	So first of all you do some epsilon, the epsilon
	is just a normalization factor that everything

	0:24:28.148 --> 0:24:31.739
	is somehow to inferability.

	0:24:31.631 --> 0:24:37.539
	Of source sentences plus one to the power
	of the length of the targets.

	0:24:37.937 --> 0:24:50.987
	And this is somehow the probability of this
	alignment.

	0:24:51.131 --> 0:24:53.224
	So is this alignment probable or not?

	0:24:53.224 --> 0:24:55.373
	Of course you can have some intuition.

	0:24:55.373 --> 0:24:58.403
	So if there's a lot of crossing, it may be
	not a good.

	0:24:58.403 --> 0:25:03.196
	If all of the words align to the same one
	might be not a good alignment, but generally

	0:25:03.196 --> 0:25:06.501
	it's difficult to really describe what is a
	good alignment.

	0:25:07.067 --> 0:25:11.482
	Say for the first model that's the most simple
	thing.

	0:25:11.482 --> 0:25:18.760
	What can be the most simple thing if you think
	about giving a probability to some event?

	0:25:21.401 --> 0:25:25.973
	Yes exactly, so just take the uniform distribution.

	0:25:25.973 --> 0:25:33.534
	If we don't really know the best thing of
	modeling is all equally probable, of course

	0:25:33.534 --> 0:25:38.105
	that is not true, but it's giving you a good
	study.

	0:25:38.618 --> 0:25:44.519
	And so this one is just a number of all possible
	alignments for this sentence.

	0:25:44.644 --> 0:25:53.096
	So how many alignments are possible, so the
	first target word can be allied to all sources

	0:25:53.096 --> 0:25:53.746
	worth.

	0:25:54.234 --> 0:26:09.743
	The second one can also be aligned to all
	source work, and the third one also to source.

	0:26:10.850 --> 0:26:13.678
	This is the number of alignments.

	0:26:13.678 --> 0:26:19.002
	The second part is to model the probability
	of the translation.

	0:26:19.439 --> 0:26:31.596
	And there it's not nice to have this function,
	so now we are making the product over all target.

	0:26:31.911 --> 0:26:40.068
	And we are making a very strong independent
	assumption because in these models we normally

	0:26:40.068 --> 0:26:45.715
	assume the translation probability of one word
	is independent.

	0:26:46.126 --> 0:26:49.800
	So how you translate and visit it is independent
	of all the other parts.

	0:26:50.290 --> 0:26:52.907
	That is very strong and very bad.

	0:26:52.907 --> 0:26:55.294
	Yeah, you should do it better.

	0:26:55.294 --> 0:27:00.452
	We know that it's wrong because how you translate
	this depends on.

	0:27:00.452 --> 0:27:05.302
	However, it's a first easy solution and again
	a good starting.

	0:27:05.966 --> 0:27:14.237
	So what you do is that you take a product
	of all words and take a translation probability

	0:27:14.237 --> 0:27:15.707
	on this target.

	0:27:16.076 --> 0:27:23.901
	And because we know that there is always one
	source word allied to that, so it.

	0:27:24.344 --> 0:27:37.409
	If the probability of visits in the zoo doesn't
	really work, the good here I'm again.

	0:27:38.098 --> 0:27:51.943
	So most only we have it here, so the probability
	is an absolute divided pipe to the power.

	0:27:53.913 --> 0:27:58.401
	And then there is somewhere in the last one.

	0:27:58.401 --> 0:28:04.484
	There is an arrow and switch, so it is the
	other way around.

	0:28:04.985 --> 0:28:07.511
	Then you have your translation model.

	0:28:07.511 --> 0:28:12.498
	Hopefully let's assume you have your water
	train so that's only a signing.

	0:28:12.953 --> 0:28:25.466
	And then this sentence has the probability
	of generating I visit a friend given that you

	0:28:25.466 --> 0:28:31.371
	have the source sentence if Bezukhov I'm.

	0:28:32.012 --> 0:28:34.498
	Time stand to the power of minus five.

	0:28:35.155 --> 0:28:36.098
	So this is your model.

	0:28:36.098 --> 0:28:37.738
	This is how you're applying your model.

	0:28:39.479 --> 0:28:44.220
	As you said, it's the most simple bottle you
	assume that all word translations are.

	0:28:44.204 --> 0:28:46.540
	Independent of each other.

	0:28:46.540 --> 0:28:54.069
	You assume that all alignments are equally
	important, and then the only thing you need

	0:28:54.069 --> 0:29:00.126
	for this type of model is to have this lexicon
	in order to calculate.

	0:29:00.940 --> 0:29:04.560
	And that is, of course, now the training process.

	0:29:04.560 --> 0:29:08.180
	The question is how do we get this type of
	lexic?

	0:29:09.609 --> 0:29:15.461
	But before we look into the training, do you
	have any questions about the model itself?

	0:29:21.101 --> 0:29:26.816
	The problem in training is that we have incomplete
	data.

	0:29:26.816 --> 0:29:32.432
	So if you want to count, I mean said you want
	to count.

	0:29:33.073 --> 0:29:39.348
	However, if you don't have the alignment,
	on the other hand, if you would have a lexicon

	0:29:39.348 --> 0:29:44.495
	you could maybe generate the alignment, which
	is the most probable word.

	0:29:45.225 --> 0:29:55.667
	And this is the very common problem that you
	have this type of incomplete data where you

	0:29:55.667 --> 0:29:59.656
	have not one type of information.

	0:30:00.120 --> 0:30:08.767
	And you can model this by considering the
	alignment as your hidden variable and then

	0:30:08.767 --> 0:30:17.619
	you can use the expectation maximization algorithm
	in order to generate the alignment.

	0:30:17.577 --> 0:30:26.801
	So the nice thing is that you only need your
	parallel data, which is aligned on sentence

	0:30:26.801 --> 0:30:29.392
	level, but you normally.

	0:30:29.389 --> 0:30:33.720
	Is just a lot of work we saw last time.

	0:30:33.720 --> 0:30:39.567
	Typically what you have is this type of corpus
	where.

	0:30:41.561 --> 0:30:50.364
	And yeah, the ERM algorithm sounds very fancy.

	0:30:50.364 --> 0:30:58.605
	However, again look at a little high level.

	0:30:58.838 --> 0:31:05.841
	So you're initializing a model by uniform
	distribution.

	0:31:05.841 --> 0:31:14.719
	You're just saying if have lexicon, if all
	words are equally possible.

	0:31:15.215 --> 0:31:23.872
	And then you apply your model to the data,
	and that is your expectation step.

	0:31:23.872 --> 0:31:30.421
	So given this initial lexicon, we are now
	calculating the.

	0:31:30.951 --> 0:31:36.043
	So we can now take all our parallel sentences,
	and of course ought to check what is the most

	0:31:36.043 --> 0:31:36.591
	probable.

	0:31:38.338 --> 0:31:49.851
	And then, of course, at the beginning maybe
	houses most often in line.

	0:31:50.350 --> 0:31:58.105
	Once we have done this expectation step, we
	can next do the maximization step and based

	0:31:58.105 --> 0:32:06.036
	on this guest alignment, which we have, we
	can now learn better translation probabilities

	0:32:06.036 --> 0:32:09.297
	by just counting how often do words.

	0:32:09.829 --> 0:32:22.289
	And then it's rated these steps: We can make
	this whole process even more stable, only taking

	0:32:22.289 --> 0:32:26.366
	the most probable alignment.

	0:32:26.346 --> 0:32:36.839
	Second step, but in contrast we calculate
	for all possible alignments the alignment probability

	0:32:36.839 --> 0:32:40.009
	and weigh the correcurrence.

	0:32:40.000 --> 0:32:41.593
	Then Things Are Most.

	0:32:42.942 --> 0:32:49.249
	Why could that be very challenging if we do
	it in general and really calculate all probabilities

	0:32:49.249 --> 0:32:49.834
	for all?

	0:32:53.673 --> 0:32:55.905
	How many alignments are there for a Simpson?

	0:32:58.498 --> 0:33:03.344
	Yes there, we just saw that in the formula
	if you remember.

	0:33:03.984 --> 0:33:12.336
	This was the formula so it's exponential in
	the lengths of the target sentence.

	0:33:12.336 --> 0:33:15.259
	It would calculate all the.

	0:33:15.415 --> 0:33:18.500
	Be very inefficient and really possible.

	0:33:18.500 --> 0:33:25.424
	The nice thing is we can again use some type
	of dynamic programming, so then we can do this

	0:33:25.424 --> 0:33:27.983
	without really calculating audit.

	0:33:28.948 --> 0:33:40.791
	We have the next pipe slides or so with the
	most equations in the whole lecture, so don't

	0:33:40.791 --> 0:33:41.713
	worry.

	0:33:42.902 --> 0:34:01.427
	So we said we have first explanation where
	it is about calculating the alignment.

	0:34:02.022 --> 0:34:20.253
	And we can do this with our initial definition
	of because this formula.

	0:34:20.160 --> 0:34:25.392
	So we can define this as and and divided by
	and.

	0:34:25.905 --> 0:34:30.562
	This is just the normal definition of a conditional
	probability.

	0:34:31.231 --> 0:34:37.937
	And what we then need to assume a meter calculate
	is P of E given.

	0:34:37.937 --> 0:34:41.441
	P of E given is still again quiet.

	0:34:41.982 --> 0:34:56.554
	Simple: The probability of the sewer sentence
	given the target sentence is quite intuitive.

	0:34:57.637 --> 0:35:15.047
	So let's just calculate how to calculate the
	probability of a event.

	0:35:15.215 --> 0:35:21.258
	So in here we can then put in our original
	form in our soils.

	0:35:21.201 --> 0:35:28.023
	There are some of the possible alignments
	of the first word, and so until the sum of

	0:35:28.023 --> 0:35:30.030
	all possible alignments.

	0:35:29.990 --> 0:35:41.590
	And then we have the probability here of the
	alignment type, this product of translation.

	0:35:42.562 --> 0:35:58.857
	Now this one is independent of the alignment,
	so we can put it to the front here.

	0:35:58.959 --> 0:36:03.537
	And now this is where dynamic programming
	works in.

	0:36:03.537 --> 0:36:08.556
	We can change that and make thereby things
	a lot easier.

	0:36:08.668 --> 0:36:21.783
	Can reform it like this just as a product
	over all target positions, and then it's the

	0:36:21.783 --> 0:36:26.456
	sum over all source positions.

	0:36:27.127 --> 0:36:36.454
	Maybe at least the intuition why this is equal
	is a lot easier if you look into it as graphic.

	0:36:36.816 --> 0:36:39.041
	So what we have here is the table.

	0:36:39.041 --> 0:36:42.345
	We have the target position and the Swiss
	position.

	0:36:42.862 --> 0:37:03.643
	And we have to sum up all possible passes
	through that: The nice thing is that each of

	0:37:03.643 --> 0:37:07.127
	these passes these probabilities are independent
	of each.

	0:37:07.607 --> 0:37:19.678
	In order to get the sum of all passes through
	this table you can use dynamic programming

	0:37:19.678 --> 0:37:27.002
	and then say oh this probability is exactly
	the same.

	0:37:26.886 --> 0:37:34.618
	Times the sun of this column finds the sum
	of this column, and times the sun of this colun.

	0:37:35.255 --> 0:37:41.823
	That is the same as if you go through all
	possible passes here and multiply always the

	0:37:41.823 --> 0:37:42.577
	elements.

	0:37:43.923 --> 0:37:54.227
	And that is a simplification because now we
	only have quadratic numbers and we don't have

	0:37:54.227 --> 0:37:55.029
	to go.

	0:37:55.355 --> 0:38:12.315
	Similar to guess you may be seen the same
	type of algorithm for what is it?

	0:38:14.314 --> 0:38:19.926
	Yeah, well yeah, so that is the saying.

	0:38:19.926 --> 0:38:31.431
	But yeah, I think graphically this is seeable
	if you don't know exactly the mass.

	0:38:32.472 --> 0:38:49.786
	Now put these both together, so if you really
	want to take a piece of and put these two formulas

	0:38:49.786 --> 0:38:51.750
	together,.

	0:38:51.611 --> 0:38:56.661
	Eliminated and Then You Get Your Final Formula.

	0:38:56.716 --> 0:39:01.148
	And that somehow really makes now really intuitively
	again sense.

	0:39:01.401 --> 0:39:08.301
	So the probability of an alignment is the
	product of all target sentences, and then it's

	0:39:08.301 --> 0:39:15.124
	the probability of to translate a word into
	the word that is aligned to divided by some

	0:39:15.124 --> 0:39:17.915
	of the other words in the sentence.

	0:39:18.678 --> 0:39:31.773
	If you look at this again, it makes real descent.

	0:39:31.891 --> 0:39:43.872
	So you're looking at how probable it is to
	translate compared to all the other words.

	0:39:43.872 --> 0:39:45.404
	So you're.

	0:39:45.865 --> 0:39:48.543
	So and that gives you the alignment probability.

	0:39:48.768 --> 0:39:54.949
	Somehow it's not only that it's mathematically
	correct if you look at it this way, it's somehow

	0:39:54.949 --> 0:39:55.785
	intuitively.

	0:39:55.785 --> 0:39:58.682
	So if you would say how good is it to align?

	0:39:58.638 --> 0:40:04.562
	We had to zoo him to visit, or yet it should
	depend on how good this is the translation

	0:40:04.562 --> 0:40:10.620
	probability compared to how good are the other
	words in the sentence, and how probable is

	0:40:10.620 --> 0:40:12.639
	it that I align them to them.

	0:40:15.655 --> 0:40:26.131
	Then you have the expectations that the next
	thing is now the maximization step, so we have

	0:40:26.131 --> 0:40:30.344
	now the probability of an alignment.

	0:40:31.451 --> 0:40:37.099
	Intuitively, that means how often are words
	aligned to each other giving this alignment

	0:40:37.099 --> 0:40:39.281
	or more in a perverse definition?

	0:40:39.281 --> 0:40:43.581
	What is the expectation value that they are
	aligned to each other?

	0:40:43.581 --> 0:40:49.613
	So if there's a lot of alignments with hyperability
	that they're aligned to each other, then.

	0:40:50.050 --> 0:41:07.501
	So the count of E and given F given our caravan
	data is a sum of all possible alignments.

	0:41:07.968 --> 0:41:14.262
	That is, this count, and you don't do just
	count with absolute numbers, but you count

	0:41:14.262 --> 0:41:14.847
	always.

	0:41:15.815 --> 0:41:26.519
	And to make that translation probability is
	that you have to normalize it, of course, through:

	0:41:27.487 --> 0:41:30.584
	And that's then the whole model.

	0:41:31.111 --> 0:41:39.512
	It looks now maybe a bit mathematically complex.

	0:41:39.512 --> 0:41:47.398
	The whole training process is described here.

	0:41:47.627 --> 0:41:53.809
	So you really, really just have to collect
	these counts and later normalize that.

	0:41:54.134 --> 0:42:03.812
	So repeating that until convergence we have
	said the ear migration is always done again.

	0:42:04.204 --> 0:42:15.152
	Equally, then you go over all sentence pairs
	and all of words and calculate the translation.

	0:42:15.355 --> 0:42:17.983
	And then you go once again over.

	0:42:17.983 --> 0:42:22.522
	It counted this count, count given, and totally
	e-given.

	0:42:22.702 --> 0:42:35.316
	Initially how probable is the E translated
	to something else, and you normalize your translation

	0:42:35.316 --> 0:42:37.267
	probabilities.

	0:42:38.538 --> 0:42:45.761
	So this is an old training process for this
	type of.

	0:42:46.166 --> 0:43:00.575
	How that then works is shown here a bit, so
	we have a very simple corpus.

	0:43:01.221 --> 0:43:12.522
	And as we said, you initialize your translation
	with yes or possible translations, so dusk

	0:43:12.522 --> 0:43:16.620
	can be aligned to the bookhouse.

	0:43:16.997 --> 0:43:25.867
	And the other ones are missing because only
	a curse with and book, and then the others

	0:43:25.867 --> 0:43:26.988
	will soon.

	0:43:27.127 --> 0:43:34.316
	In the initial way your vocabulary is for
	works, so the initial probabilities are all:

	0:43:34.794 --> 0:43:50.947
	And then if you iterate you see that the things
	which occur often and then get alignments get

	0:43:50.947 --> 0:43:53.525
	more and more.

	0:43:55.615 --> 0:44:01.506
	In reality, of course, you won't get like
	zero alignments, but you would normally get

	0:44:01.506 --> 0:44:02.671
	there sometimes.

	0:44:03.203 --> 0:44:05.534
	But as the probability increases.

	0:44:05.785 --> 0:44:17.181
	The training process is also guaranteed that
	the probability of your training data is always

	0:44:17.181 --> 0:44:20.122
	increased in iteration.

	0:44:21.421 --> 0:44:27.958
	You see that the model tries to model your
	training data and give you at least good models.

	0:44:30.130 --> 0:44:37.765
	Okay, are there any more questions to the
	training of these type of word-based models?

	0:44:38.838 --> 0:44:54.790
	Initially there is like forwards in the source
	site, so it's just one force to do equal distribution.

	0:44:55.215 --> 0:45:01.888
	So each target word, the probability of the
	target word, is at four target words, so the

	0:45:01.888 --> 0:45:03.538
	uniform distribution.

	0:45:07.807 --> 0:45:14.430
	However, there is problems with this initial
	order and we have this already mentioned at

	0:45:14.430 --> 0:45:15.547
	the beginning.

	0:45:15.547 --> 0:45:21.872
	There is for example things that yeah you
	want to allow for reordering but there are

	0:45:21.872 --> 0:45:27.081
	definitely some alignments which should be
	more probable than others.

	0:45:27.347 --> 0:45:42.333
	So a friend visit should have a lower probability
	than visit a friend.

	0:45:42.302 --> 0:45:50.233
	It's not always monitoring, there is some
	reordering happening, but if you just mix it

	0:45:50.233 --> 0:45:51.782
	crazy, it's not.

	0:45:52.252 --> 0:46:11.014
	You have slings like one too many alignments
	and they are not really models.

	0:46:11.491 --> 0:46:17.066
	But it shouldn't be that you align one word
	to all the others, and that is, you don't want

	0:46:17.066 --> 0:46:18.659
	this type of probability.

	0:46:19.199 --> 0:46:27.879
	You don't want to align to null, so there's
	nothing about that and how to deal with other

	0:46:27.879 --> 0:46:30.386
	words on the source side.

	0:46:32.272 --> 0:46:45.074
	And therefore this was only like the initial
	model in there.

	0:46:45.325 --> 0:46:47.639
	Models, which we saw.

	0:46:47.639 --> 0:46:57.001
	They only model the translation probability,
	so how probable is it to translate one word

	0:46:57.001 --> 0:46:58.263
	to another?

	0:46:58.678 --> 0:47:05.915
	What you could then add is the absolute position.

	0:47:05.915 --> 0:47:16.481
	Yeah, the second word should more probable
	align to the second position.

	0:47:17.557 --> 0:47:22.767
	We add a fertility model that means one word
	is mostly translated into one word.

	0:47:23.523 --> 0:47:29.257
	For example, we saw it there that should be
	translated into two words, but most words should

	0:47:29.257 --> 0:47:32.463
	be one to one, and it's even modeled for each
	word.

	0:47:32.463 --> 0:47:37.889
	So for each source word, how probable is it
	that it is translated to one, two, three or

	0:47:37.889 --> 0:47:38.259
	more?

	0:47:40.620 --> 0:47:50.291
	Then either one of four acts relative positions,
	so it's asks: Maybe instead of modeling, how

	0:47:50.291 --> 0:47:55.433
	probable is it that you translate from position
	five to position twenty five?

	0:47:55.433 --> 0:48:01.367
	It's not a very good way, but in a relative
	position instead of what you try to model it.

	0:48:01.321 --> 0:48:06.472
	How probable is that you are jumping Swiss
	steps forward or Swiss steps back?

	0:48:07.287 --> 0:48:15.285
	However, this makes sense more complex because
	what is a jump forward and a jump backward

	0:48:15.285 --> 0:48:16.885
	is not that easy.

	0:48:18.318 --> 0:48:30.423
	You want to have a model that describes reality,
	so every sentence that is not possible should

	0:48:30.423 --> 0:48:37.304
	have the probability zero because that cannot
	happen.

	0:48:37.837 --> 0:48:48.037
	However, with this type of IBM model four
	this has a positive probability, so it makes

	0:48:48.037 --> 0:48:54.251
	a sentence more complex and you can easily
	check it.

	0:48:57.457 --> 0:49:09.547
	So these models were the first models which
	tried to directly model and where they are

	0:49:09.547 --> 0:49:14.132
	the first to do the translation.

	0:49:14.414 --> 0:49:19.605
	So in all of these models, the probability
	of a word translating into another word is

	0:49:19.605 --> 0:49:25.339
	always independent of all the other translations,
	and that is a challenge because we know that

	0:49:25.339 --> 0:49:26.486
	this is not right.

	0:49:26.967 --> 0:49:32.342
	And therefore we will come now to then the
	phrase-based translation models.

	0:49:35.215 --> 0:49:42.057
	However, this word alignment is the very important
	concept which was used in phrase based.

	0:49:42.162 --> 0:49:50.559
	Even when people use phrase based, they first
	would always train a word based model not to

	0:49:50.559 --> 0:49:56.188
	get the really model but only to get this type
	of alignment.

	0:49:57.497 --> 0:50:01.343
	What was the main idea of a phrase based machine
	translation?

	0:50:03.223 --> 0:50:08.898
	It's not only that things got mathematically
	a lot more simple here because you don't try

	0:50:08.898 --> 0:50:13.628
	to express the whole translation process, but
	it's a discriminative model.

	0:50:13.628 --> 0:50:19.871
	So what you only try to model is this translation
	probability or is this translation more probable

	0:50:19.871 --> 0:50:20.943
	than some other.

	0:50:24.664 --> 0:50:28.542
	The main idea is that the basic units are
	are the phrases.

	0:50:28.542 --> 0:50:31.500
	That's why it's called phrase phrase phrase.

	0:50:31.500 --> 0:50:35.444
	You have to be aware that these are not linguistic
	phrases.

	0:50:35.444 --> 0:50:39.124
	I guess you have some intuition about what
	is a phrase.

	0:50:39.399 --> 0:50:45.547
	You would express as a phrase.

	0:50:45.547 --> 0:50:58.836
	However, you wouldn't say that is a very good
	phrase because it's.

	0:50:59.339 --> 0:51:06.529
	However, in this machine learning-based motivated
	thing, phrases are just indicative.

	0:51:07.127 --> 0:51:08.832
	So it can be any split.

	0:51:08.832 --> 0:51:12.455
	We don't consider linguistically motivated
	or not.

	0:51:12.455 --> 0:51:15.226
	It can be any sequence of consecutive.

	0:51:15.335 --> 0:51:16.842
	That's the Only Important Thing.

	0:51:16.977 --> 0:51:25.955
	The phrase is always a thing of consecutive
	words, and the motivation behind that is getting

	0:51:25.955 --> 0:51:27.403
	computational.

	0:51:27.387 --> 0:51:35.912
	People have looked into how you can also discontinuous
	phrases, which might be very helpful if you

	0:51:35.912 --> 0:51:38.237
	think about German harbor.

	0:51:38.237 --> 0:51:40.046
	Has this one phrase?

	0:51:40.000 --> 0:51:47.068
	There's two phrases, although there's many
	things in between, but in order to make things

	0:51:47.068 --> 0:51:52.330
	still possible and runner will, it's always
	like consecutive work.

	0:51:53.313 --> 0:52:05.450
	The nice thing is that on the one hand you
	don't need this word to word correspondence

	0:52:05.450 --> 0:52:06.706
	anymore.

	0:52:06.906 --> 0:52:17.088
	You now need to invent some type of alignment
	that in this case doesn't really make sense.

	0:52:17.417 --> 0:52:21.710
	So you can just learn okay, you have this
	phrase and this phrase and their translation.

	0:52:22.862 --> 0:52:25.989
	Secondly, we can add a bit of context into
	that.

	0:52:26.946 --> 0:52:43.782
	You're saying, for example, of Ultimate Customs
	and of My Shift.

	0:52:44.404 --> 0:52:51.443
	And this was difficult to model and work based
	models because they always model the translation.

	0:52:52.232 --> 0:52:57.877
	Here you can have phrases where you have more
	context and just jointly translate the phrases,

	0:52:57.877 --> 0:53:03.703
	and if you then have seen all by the question
	as a phrase you can directly use that to generate.

	0:53:08.468 --> 0:53:19.781
	Okay, before we go into how to do that, then
	we start, so the start is when we start with

	0:53:19.781 --> 0:53:21.667
	the alignment.

	0:53:22.022 --> 0:53:35.846
	So that is what we get from the work based
	model and we are assuming to get the.

	0:53:36.356 --> 0:53:40.786
	So that is your starting point.

	0:53:40.786 --> 0:53:47.846
	You have a certain sentence and one most probable.

	0:53:48.989 --> 0:54:11.419
	The challenge you now have is that these alignments
	are: On the one hand, a source word like hit

	0:54:11.419 --> 0:54:19.977
	several times with one source word can be aligned
	to several: So in this case you see that for

	0:54:19.977 --> 0:54:29.594
	example Bisher is aligned to three words, so
	this can be the alignment from English to German,

	0:54:29.594 --> 0:54:32.833
	but it cannot be the alignment.

	0:54:33.273 --> 0:54:41.024
	In order to address for this inconsistency
	and being able to do that, what you typically

	0:54:41.024 --> 0:54:49.221
	then do is: If you have this inconsistency
	and you get different things in both directions,.

	0:54:54.774 --> 0:55:01.418
	In machine translation to do that you just
	do it in both directions and somehow combine

	0:55:01.418 --> 0:55:08.363
	them because both will do arrows and the hope
	is yeah if you know both things you minimize.

	0:55:08.648 --> 0:55:20.060
	So you would also do it in the other direction
	and get a different type of lineup, for example

	0:55:20.060 --> 0:55:22.822
	that you now have saw.

	0:55:23.323 --> 0:55:37.135
	So in this way you are having two alignments
	and the question is now how do get one alignment

	0:55:37.135 --> 0:55:38.605
	and what?

	0:55:38.638 --> 0:55:45.828
	There were a lot of different types of heuristics.

	0:55:45.828 --> 0:55:55.556
	They normally start with intersection because
	you should trust them.

	0:55:55.996 --> 0:55:59.661
	And your maximum will could take this, the
	union thought,.

	0:55:59.980 --> 0:56:04.679
	If one of the systems says they are not aligned
	then maybe you should not align them.

	0:56:05.986 --> 0:56:12.240
	The only question they are different is what
	should I do about things where they don't agree?

	0:56:12.240 --> 0:56:18.096
	So where only one of them enlines and then
	you have heuristics depending on other words

	0:56:18.096 --> 0:56:22.288
	around it, you can decide should I align them
	or should I not.

	0:56:24.804 --> 0:56:34.728
	So that is your first step and then the second
	step in your model.

	0:56:34.728 --> 0:56:41.689
	So now you have one alignment for the process.

	0:56:42.042 --> 0:56:47.918
	And the idea is that we will now extract all
	phrase pairs to combinations of source and

	0:56:47.918 --> 0:56:51.858
	target phrases where they are consistent within
	alignment.

	0:56:52.152 --> 0:56:57.980
	The idea is a consistence with an alignment
	that should be a good example and that we can

	0:56:57.980 --> 0:56:58.563
	extract.

	0:56:59.459 --> 0:57:14.533
	And there are three conditions where we say
	an alignment has to be consistent.

	0:57:14.533 --> 0:57:17.968
	The first one is.

	0:57:18.318 --> 0:57:24.774
	So if you add bisher, then it's in your phrase.

	0:57:24.774 --> 0:57:32.306
	All the three words up till and now should
	be in there.

	0:57:32.492 --> 0:57:42.328
	So Bisheret Till would not be a valid phrase
	pair in this case, but for example Bisheret

	0:57:42.328 --> 0:57:43.433
	Till now.

	0:57:45.525 --> 0:58:04.090
	Does anybody now have already an idea about
	the second rule that should be there?

	0:58:05.325 --> 0:58:10.529
	Yes, that is exactly the other thing.

	0:58:10.529 --> 0:58:22.642
	If a target verse is in the phrase pair, there
	are also: Then there is one very obvious one.

	0:58:22.642 --> 0:58:28.401
	If you strike a phrase pair, at least one
	word in the phrase.

	0:58:29.069 --> 0:58:32.686
	And this is a knife with working.

	0:58:32.686 --> 0:58:40.026
	However, in reality a captain will select
	some part of the sentence.

	0:58:40.380 --> 0:58:47.416
	You can take any possible combination of sewers
	and target words for this part, and that of

	0:58:47.416 --> 0:58:54.222
	course is not very helpful because you just
	have no idea, and therefore it says at least

	0:58:54.222 --> 0:58:58.735
	one sewer should be aligned to one target word
	to prevent.

	0:58:59.399 --> 0:59:09.615
	But still, it means that if you have normally
	analyzed words, the more analyzed words you

	0:59:09.615 --> 0:59:10.183
	can.

	0:59:10.630 --> 0:59:13.088
	That's not true for the very extreme case.

	0:59:13.088 --> 0:59:17.603
	If no word is a line you can extract nothing
	because you can never fulfill it.

	0:59:17.603 --> 0:59:23.376
	However, if only for example one word is aligned
	then you can align a lot of different possibilities

	0:59:23.376 --> 0:59:28.977
	because you can start with this word and then
	add source words or target words or any combination

	0:59:28.977 --> 0:59:29.606
	of source.

	0:59:30.410 --> 0:59:37.585
	So there was typically a problem that if you
	have too few works in light you can really

	0:59:37.585 --> 0:59:38.319
	extract.

	0:59:38.558 --> 0:59:45.787
	If you think about this already here you can
	extract very, very many phrase pairs from:

	0:59:45.845 --> 0:59:55.476
	So what you can extract is, for example, what
	we saw up and so on.

	0:59:55.476 --> 1:00:00.363
	So all of them will be extracted.

	1:00:00.400 --> 1:00:08.379
	In order to limit this you typically have
	a length limit so you can only extract phrases

	1:00:08.379 --> 1:00:08.738
	up.

	1:00:09.049 --> 1:00:18.328
	But still there these phrases where you have
	all these phrases extracted.

	1:00:18.328 --> 1:00:22.968
	You have to think about how to deal.

	1:00:26.366 --> 1:00:34.966
	Now we have the phrases, so the other question
	is what is a good phrase pair and not so good.

	1:00:35.255 --> 1:00:39.933
	You might be that you sometimes extract one
	which is explaining this sentence but is not

	1:00:39.933 --> 1:00:44.769
	really a good one because there is something
	ever in there or something special so it might

	1:00:44.769 --> 1:00:47.239
	not be a good phase pair in another situation.

	1:00:49.629 --> 1:00:59.752
	And therefore the easiest thing is again just
	count, and if a phrase pair occurs very often

	1:00:59.752 --> 1:01:03.273
	seems to be a good phrase pair.

	1:01:03.743 --> 1:01:05.185
	So if we have this one.

	1:01:05.665 --> 1:01:09.179
	And if you have the exam up till now,.

	1:01:09.469 --> 1:01:20.759
	Then you look how often does up till now to
	this hair occur?

	1:01:20.759 --> 1:01:28.533
	How often does up until now to this hair?

	1:01:30.090 --> 1:01:36.426
	So this is one way of yeah describing the
	quality of the phrase book.

	1:01:37.257 --> 1:01:47.456
	So one difference is now, and that is the
	advantage of these primitive models.

	1:01:47.867 --> 1:01:55.442
	But instead we are trying to have a lot of
	features describing how good a phrase parent

	1:01:55.442 --> 1:01:55.786
	is.

	1:01:55.786 --> 1:02:04.211
	One of these features is this one describing:
	But in this model we'll later see how to combine

	1:02:04.211 --> 1:02:04.515
	it.

	1:02:04.515 --> 1:02:10.987
	The nice thing is we can invent any other
	type of features and add that and normally

	1:02:10.987 --> 1:02:14.870
	if you have two or three metrics to describe
	then.

	1:02:15.435 --> 1:02:18.393
	And therefore the spray spray sprays.

	1:02:18.393 --> 1:02:23.220
	They were not only like evaluated by one type
	but by several.

	1:02:23.763 --> 1:02:36.580
	So this could, for example, have a problem
	because your target phrase here occurs only

	1:02:36.580 --> 1:02:37.464
	once.

	1:02:38.398 --> 1:02:46.026
	It will of course only occur with one other
	source trait, and that probability will be

	1:02:46.026 --> 1:02:53.040
	one which might not be a very good estimation
	because you've only seen it once.

	1:02:53.533 --> 1:02:58.856
	Therefore, we use additional ones to better
	deal with that, and the first thing is we're

	1:02:58.856 --> 1:02:59.634
	doing again.

	1:02:59.634 --> 1:03:01.129
	Yeah, we know it by now.

	1:03:01.129 --> 1:03:06.692
	If you look at it in the one direction, it's
	helpful to us to look into the other direction.

	1:03:06.692 --> 1:03:11.297
	So you take also the inverse probability,
	so you not only take in peer of E.

	1:03:11.297 --> 1:03:11.477
	G.

	1:03:11.477 --> 1:03:11.656
	M.

	1:03:11.656 --> 1:03:12.972
	F., but also peer of.

	1:03:13.693 --> 1:03:19.933
	And then in addition you say maybe for the
	especially prolonged phrases they occur rarely,

	1:03:19.933 --> 1:03:25.898
	and then you have very high probabilities,
	and that might not be always the right one.

	1:03:25.898 --> 1:03:32.138
	So maybe it's good to also look at the word
	based probabilities to represent how good they

	1:03:32.138 --> 1:03:32.480
	are.

	1:03:32.692 --> 1:03:44.202
	So in addition you take the work based probabilities
	of this phrase pair as an additional model.

	1:03:44.704 --> 1:03:52.828
	So then you would have in total four different
	values describing how good the phrase is.

	1:03:52.828 --> 1:04:00.952
	It would be the relatively frequencies in
	both directions and the lexical probabilities.

	1:04:01.361 --> 1:04:08.515
	So four values in describing how probable
	a phrase translation is.

	1:04:11.871 --> 1:04:20.419
	Then the next challenge is how can we combine
	these different types of probabilities into

	1:04:20.419 --> 1:04:23.458
	a global score saying how good?

	1:04:24.424 --> 1:04:36.259
	Model, but before we are doing that give any
	questions to this phrase extraction and phrase

	1:04:36.259 --> 1:04:37.546
	creation.

	1:04:40.260 --> 1:04:44.961
	And the motivation for that this was our initial
	moral.

	1:04:44.961 --> 1:04:52.937
	If you remember from the beginning of a lecture
	we had the probability of like PFO three times

	1:04:52.937 --> 1:04:53.357
	PFO.

	1:04:55.155 --> 1:04:57.051
	Now the problem is here.

	1:04:57.051 --> 1:04:59.100
	That is, of course, right.

	1:04:59.100 --> 1:05:06.231
	However, we have done a lot of simplification
	that the translation probability is independent

	1:05:06.231 --> 1:05:08.204
	of the other translation.

	1:05:08.628 --> 1:05:14.609
	So therefore our estimations of pH give me
	and pH might not be right, and therefore the

	1:05:14.609 --> 1:05:16.784
	combination might not be right.

	1:05:17.317 --> 1:05:22.499
	So it can be that, for example, at the edge
	you have a fluid but not accurate translation.

	1:05:22.782 --> 1:05:25.909
	And Then There's Could Be an Easy Way Around
	It.

	1:05:26.126 --> 1:05:32.019
	If our effluent but not accurate, it might
	be that we put too much effort on the language

	1:05:32.019 --> 1:05:36.341
	model and we are putting too few effort on
	the translation model.

	1:05:36.936 --> 1:05:43.016
	There we can wait a minute so we can do this
	a bit stronger.

	1:05:43.016 --> 1:05:46.305
	This one is more important than.

	1:05:48.528 --> 1:05:53.511
	And based on that we can extend this idea
	to the lacteria mole.

	1:05:53.893 --> 1:06:02.164
	The log linear model now says all the translation
	probabilities is just we have.

	1:06:02.082 --> 1:06:09.230
	Describing how good this translation process
	is, these are the speeches H which depend on

	1:06:09.230 --> 1:06:09.468
	E.

	1:06:09.468 --> 1:06:09.706
	F.

	1:06:09.706 --> 1:06:13.280
	Only one of them, but generally depend on
	E.

	1:06:13.280 --> 1:06:13.518
	E.

	1:06:13.518 --> 1:06:13.757
	E.

	1:06:13.757 --> 1:06:13.995
	N.

	1:06:13.995 --> 1:06:14.233
	F.

	1:06:14.474 --> 1:06:22.393
	Each of these pictures has a weight saying
	yeah how good does it model it so that if you're

	1:06:22.393 --> 1:06:29.968
	asking a lot of people about some opinion it
	might also be waiting some opinion more so

	1:06:29.968 --> 1:06:34.100
	I put more effort on that and he may not be
	so.

	1:06:34.314 --> 1:06:39.239
	If you're saying that it's maybe a good indication,
	yeah, would trust that much.

	1:06:39.559 --> 1:06:41.380
	And exactly you can do that for you too.

	1:06:41.380 --> 1:06:42.446
	You can't add no below.

	1:06:43.423 --> 1:07:01.965
	It's like depending on how many you want to
	have and each of the features gives you value.

	1:07:02.102 --> 1:07:12.655
	The nice thing is that we can normally ignore
	because we are not interested in the probability

	1:07:12.655 --> 1:07:13.544
	itself.

	1:07:13.733 --> 1:07:18.640
	And again, if that's not normalized, that's
	fine.

	1:07:18.640 --> 1:07:23.841
	So if this value is the highest, that's the
	highest.

	1:07:26.987 --> 1:07:29.302
	Can we do that?

	1:07:29.302 --> 1:07:34.510
	Let's start with two simple things.

	1:07:34.510 --> 1:07:39.864
	Then you have one translation model.

	1:07:40.000 --> 1:07:43.102
	Which gives you the peer of eagerness.

	1:07:43.383 --> 1:07:49.203
	It can be typically as a feature it would
	take the liberalism of this ability, so mine

	1:07:49.203 --> 1:07:51.478
	is nine hundred and fourty seven.

	1:07:51.451 --> 1:07:57.846
	And the language model which says you how
	clue in the English side is how you can calculate

	1:07:57.846 --> 1:07:59.028
	the probability.

	1:07:58.979 --> 1:08:03.129
	In some future lectures we'll give you all
	superbology.

	1:08:03.129 --> 1:08:10.465
	You can feature again the luck of the purbology,
	then you have minus seven and then give different

	1:08:10.465 --> 1:08:11.725
	weights to them.

	1:08:12.292 --> 1:08:19.243
	And that means that your probability is one
	divided by said to the power of this.

	1:08:20.840 --> 1:08:38.853
	You're not really interested in the probability,
	so you just calculate on the score to the exponendum.

	1:08:40.000 --> 1:08:41.668
	Maximal Maximal I Think.

	1:08:42.122 --> 1:08:57.445
	You can, for example, try different translations,
	calculate all their scores and take in the

	1:08:57.445 --> 1:09:00.905
	end the translation.

	1:09:03.423 --> 1:09:04.661
	Why to do that.

	1:09:05.986 --> 1:09:10.698
	We've done that now for two, but of course
	you cannot only do it with two.

	1:09:10.698 --> 1:09:16.352
	You can do it now with any fixed number, so
	of course you have to decide in the beginning

	1:09:16.352 --> 1:09:21.944
	I want to have ten features or something like
	that, but you can take all these features.

	1:09:22.002 --> 1:09:29.378
	And yeah, based on them, they calculate your
	model probability or the model score.

	1:09:31.031 --> 1:09:40.849
	A big advantage over the initial.

	1:09:40.580 --> 1:09:45.506
	A model because now we can add a lot of features
	and there was diamond machine translation,

	1:09:45.506 --> 1:09:47.380
	a statistical machine translation.

	1:09:47.647 --> 1:09:57.063
	So how can develop new features, new ways
	of evaluating them so that can hopefully better

	1:09:57.063 --> 1:10:00.725
	describe what is good translation?

	1:10:01.001 --> 1:10:16.916
	If you have a new great feature you can calculate
	these features and then how much better do

	1:10:16.916 --> 1:10:18.969
	they model?

	1:10:21.741 --> 1:10:27.903
	There is one challenge which haven't touched
	upon yet.

	1:10:27.903 --> 1:10:33.505
	So could you easily build your model if you
	have.

	1:10:38.999 --> 1:10:43.016
	Assumed here something which just gazed, but
	which might not be that easy.

	1:10:49.990 --> 1:10:56.333
	The weight for the translation model is and
	the weight for the language model is.

	1:10:56.716 --> 1:11:08.030
	That's a bit arbitrary, so why should you
	use this one and guess normally you won't be

	1:11:08.030 --> 1:11:11.801
	able to select that by hand?

	1:11:11.992 --> 1:11:19.123
	So typically we didn't have like or features
	in there, but features is very common.

	1:11:19.779 --> 1:11:21.711
	So how do you select them?

	1:11:21.711 --> 1:11:24.645
	There was a second part of the training.

	1:11:24.645 --> 1:11:27.507
	These models were trained in two steps.

	1:11:27.507 --> 1:11:32.302
	On the one hand, we had the training of the
	individual components.

	1:11:32.302 --> 1:11:38.169
	We saw that now how to build the phrase based
	system, how to extract the phrases.

	1:11:38.738 --> 1:11:46.223
	But then if you have these different components
	you need a second training to learn the optimal.

	1:11:46.926 --> 1:11:51.158
	And typically this is referred to as the tuning
	of the system.

	1:11:51.431 --> 1:12:07.030
	So now if you have different types of models
	describing what a good translation is you need

	1:12:07.030 --> 1:12:10.760
	to find good weights.

	1:12:12.312 --> 1:12:14.315
	So how can you do it?

	1:12:14.315 --> 1:12:20.871
	The easiest thing is, of course, you can just
	try different things out.

	1:12:21.121 --> 1:12:27.496
	You can then always select the best hyper
	scissors.

	1:12:27.496 --> 1:12:38.089
	You can evaluate it with some metrics saying:
	You can score all your outputs, always select

	1:12:38.089 --> 1:12:42.543
	the best one and then get this translation.

	1:12:42.983 --> 1:12:45.930
	And you can do that for a lot of different
	possible combinations.

	1:12:47.067 --> 1:12:59.179
	However, the challenge is the complexity,
	so if you have only parameters and each of

	1:12:59.179 --> 1:13:04.166
	them has values you try for, then.

	1:13:04.804 --> 1:13:16.895
	We won't be able to try all of these possible
	combinations, so what we have to do is some

	1:13:16.895 --> 1:13:19.313
	more intelligent.

	1:13:20.540 --> 1:13:34.027
	And what has been done there in machine translation
	is referred to as a minimum error rate training.

	1:13:34.534 --> 1:13:41.743
	Whole surge is a very intuitive one, so have
	all these different parameters, so how do.

	1:13:42.522 --> 1:13:44.358
	And the idea is okay.

	1:13:44.358 --> 1:13:52.121
	I start with an initial guess and then I optimize
	one single parameter that's always easier.

	1:13:52.121 --> 1:13:54.041
	That's some or linear.

	1:13:54.041 --> 1:13:58.882
	So you're searching the best value for the
	one parameter.

	1:13:59.759 --> 1:14:04.130
	Often visualized with a San Francisco map.

	1:14:04.130 --> 1:14:13.786
	Just imagine if you want to go to the highest
	spot in San Francisco, you're standing somewhere

	1:14:13.786 --> 1:14:14.395
	here.

	1:14:14.574 --> 1:14:21.220
	You are switching your dimensions so you are
	going in this direction again finding.

	1:14:21.661 --> 1:14:33.804
	Now you're on a different street and this
	one is not a different one so you go in here

	1:14:33.804 --> 1:14:36.736
	so you can interact.

	1:14:36.977 --> 1:14:56.368
	The one thing of course is find a local optimum,
	especially if you start in two different positions.

	1:14:56.536 --> 1:15:10.030
	So yeah, there is a heuristic in there, so
	typically it's done again if you land in different

	1:15:10.030 --> 1:15:16.059
	positions with different starting points.

	1:15:16.516 --> 1:15:29.585
	What is different or what is like the addition
	of arrow rate training compared to the standard?

	1:15:29.729 --> 1:15:37.806
	So the question is, like we said, you can
	now evaluate different values for one parameter.

	1:15:38.918 --> 1:15:42.857
	And the question is: Which values should you
	try out for one parameters?

	1:15:42.857 --> 1:15:47.281
	Should you just do zero point one, zero point
	two, zero point three, or anything?

	1:15:49.029 --> 1:16:03.880
	If you change only one parameter then you
	can define the score of translation as a linear

	1:16:03.880 --> 1:16:05.530
	function.

	1:16:05.945 --> 1:16:17.258
	That this is the one that possesses, and yet
	if you change the parameter, the score of this.

	1:16:17.397 --> 1:16:26.506
	It may depend so your score is there because
	the rest you don't change your feature value.

	1:16:26.826 --> 1:16:30.100
	And the feature value is there for the steepness
	of their purse.

	1:16:30.750 --> 1:16:38.887
	And now look at different possible translations.

	1:16:38.887 --> 1:16:46.692
	Therefore, how they go up here is differently.

	1:16:47.247 --> 1:16:59.289
	So in this case if you look at the minimum
	score so there should be as minimum.

	1:17:00.300 --> 1:17:10.642
	So it's enough to check once a year and check
	once here because if you check here and here.

	1:17:11.111 --> 1:17:24.941
	And that is the idea in minimum air rate training
	when you select different hypotheses.

	1:17:29.309 --> 1:17:34.378
	So in yeah, the minimum air raid training
	is a power search.

	1:17:34.378 --> 1:17:37.453
	Then we do an intelligent step size.

	1:17:37.453 --> 1:17:39.364
	We do random restarts.

	1:17:39.364 --> 1:17:46.428
	Then things are still too slow because it
	might say we would have to decode a lot of

	1:17:46.428 --> 1:17:47.009
	times.

	1:17:46.987 --> 1:17:54.460
	So what we can do to make things even faster
	is we are decoding once with the current parameters,

	1:17:54.460 --> 1:18:01.248
	but then we are not generating only the most
	probable translation, but we are generating

	1:18:01.248 --> 1:18:05.061
	the most probable ten hundred translations
	or so.

	1:18:06.006 --> 1:18:18.338
	And then we are optimizing our weights by
	only looking at this one hundred translation

	1:18:18.338 --> 1:18:23.725
	and finding the optimal values there.

	1:18:24.564 --> 1:18:39.284
	Of course, it might be a problem that at some
	point you have now good ways to find good translations

	1:18:39.284 --> 1:18:42.928
	inside your ambest list.

	1:18:43.143 --> 1:18:52.357
	You have to iterate that sometime, but the
	important thing is you don't have to decode

	1:18:52.357 --> 1:18:56.382
	every time you need weights, but you.

	1:18:57.397 --> 1:19:11.325
	There is mainly a speed up process in order
	to make things more, make things even faster.

	1:19:15.515 --> 1:19:20.160
	Good Then We'll Finish With.

	1:19:20.440 --> 1:19:25.289
	Looking at how do you really calculate the
	scores and everything?

	1:19:25.289 --> 1:19:32.121
	Because what we did look into was a translation
	of a full sentence doesn't really consist of

	1:19:32.121 --> 1:19:37.190
	only one single phrase, but of course you have
	to combine different.

	1:19:37.637 --> 1:19:40.855
	So how does that now really look and how do
	we have to do?

	1:19:41.361 --> 1:19:48.252
	Just think again of the translation we have
	done before.

	1:19:48.252 --> 1:19:59.708
	The sentence must be: What is the probability
	of translating this one into what we saw after

	1:19:59.708 --> 1:20:00.301
	now?

	1:20:00.301 --> 1:20:03.501
	We're doing this by using.

	1:20:03.883 --> 1:20:07.157
	So we're having the phrase pair.

	1:20:07.157 --> 1:20:12.911
	Vasvia is the phrase pair up to now and gazine
	harm into.

	1:20:13.233 --> 1:20:18.970
	In addition, that is important because translation
	is not monotone.

	1:20:18.970 --> 1:20:26.311
	We are not putting phrase pairs in the same
	order as we are doing it on the source and

	1:20:26.311 --> 1:20:31.796
	on the target, but in order to generate the
	correct translation.

	1:20:31.771 --> 1:20:34.030
	So we have to shuffle the phrase pears.

	1:20:34.294 --> 1:20:39.747
	And the blue wand is in front on the search
	side but not on the back of the tag.

	1:20:40.200 --> 1:20:49.709
	This reordering makes a statistic of the machine
	translation really complicated because if you

	1:20:49.709 --> 1:20:53.313
	would just monotonely do this then.

	1:20:53.593 --> 1:21:05.288
	The problem is if you would analyze all possible
	combinations of reshuffling them, then again.

	1:21:05.565 --> 1:21:11.508
	So you again have to use some type of heuristics
	which shuffle you allow and which you don't

	1:21:11.508 --> 1:21:11.955
	allow.

	1:21:12.472 --> 1:21:27.889
	That was relatively challenging since, for
	example, if you think of Germany you would

	1:21:27.889 --> 1:21:32.371
	have to allow very long.

	1:21:33.033 --> 1:21:52.218
	But if we have now this, how do we calculate
	the translation score so the translation score?

	1:21:52.432 --> 1:21:55.792
	That's why we sum up the scores at the end.

	1:21:56.036 --> 1:22:08.524
	So you said our first feature is the probability
	of the full sentence.

	1:22:08.588 --> 1:22:13.932
	So we say, the translation of each phrase
	pair is independent of each other, and then

	1:22:13.932 --> 1:22:19.959
	we can hear the probability of the full sentences,
	fear of what we give, but fear of times, fear

	1:22:19.959 --> 1:22:24.246
	of sobbing because they have time to feel up
	till now is impossible.

	1:22:24.664 --> 1:22:29.379
	Now we can use the loss of logarithmal calculation.

	1:22:29.609 --> 1:22:36.563
	That's logarithm of the first perability.

	1:22:36.563 --> 1:22:48.153
	We'll get our first score, which says the
	translation model is minus.

	1:22:49.970 --> 1:22:56.586
	And that we're not doing only once, but we're
	exactly doing it with all our translation model.

	1:22:56.957 --> 1:23:03.705
	So we said we also have the relative frequency
	and the inverse directions of the.

	1:23:03.843 --> 1:23:06.226
	So in the end you'll have four scores.

	1:23:06.226 --> 1:23:09.097
	Here how you combine them is exactly the same.

	1:23:09.097 --> 1:23:12.824
	The only thing is how you look them up for
	each phrase pair.

	1:23:12.824 --> 1:23:18.139
	We have said in the beginning we are storing
	four scores describing how good they are.

	1:23:19.119 --> 1:23:25.415
	And these are then of force points describing
	how probable the sense.

	1:23:27.427 --> 1:23:31.579
	Then we can have more sports.

	1:23:31.579 --> 1:23:37.806
	For example, we can have a distortion model.

	1:23:37.806 --> 1:23:41.820
	How much reordering is done?

	1:23:41.841 --> 1:23:47.322
	There were different types of ones who won't
	go into detail, but just imagine you have no

	1:23:47.322 --> 1:23:47.748
	score.

	1:23:48.548 --> 1:23:56.651
	Then you have a language model which is the
	sequence of what we saw until now.

	1:23:56.651 --> 1:24:06.580
	How we generate this language model for ability
	will cover: And there weren't even more probabilities.

	1:24:06.580 --> 1:24:11.841
	So one, for example, was a phrase count scarf,
	which just counts how many.

	1:24:12.072 --> 1:24:19.555
	In order to learn is it better to have more
	short phrases or should bias on having fewer

	1:24:19.555 --> 1:24:20.564
	and longer.

	1:24:20.940 --> 1:24:28.885
	Easily add this but just counting so the value
	will be here and like putting in a count like

	1:24:28.885 --> 1:24:32.217
	typically how good is it to translate.

	1:24:32.932 --> 1:24:44.887
	For language model, the probability normally
	gets shorter the longer the sequences in order

	1:24:44.887 --> 1:24:46.836
	to counteract.

	1:24:47.827 --> 1:24:59.717
	And then you get your final score by multi-climbing
	each of the scores we had before.

	1:24:59.619 --> 1:25:07.339
	Optimization and that gives you a final score
	maybe of twenty three point seven eight five

	1:25:07.339 --> 1:25:13.278
	and then you can do that with several possible
	translation tests and.

	1:25:14.114 --> 1:25:23.949
	One may be important point here is so the
	score not only depends on the target side but

	1:25:23.949 --> 1:25:32.444
	it also depends on which phrases you have used
	so you could have generated.

	1:25:32.772 --> 1:25:38.076
	So you would have the same translation, but
	you would have a different split into phrase.

	1:25:38.979 --> 1:25:45.636
	And this was normally ignored so you would
	just look at all of them and then select the

	1:25:45.636 --> 1:25:52.672
	one which has the highest probability and ignore
	that this translation could be generated by

	1:25:52.672 --> 1:25:54.790
	several splits into phrase.

	1:25:57.497 --> 1:26:06.097
	So to summarize what we look into today and
	what you should hopefully remember is: Statistical

	1:26:06.097 --> 1:26:11.440
	models in how to generate machine translation
	output that were the word based statistical

	1:26:11.440 --> 1:26:11.915
	models.

	1:26:11.915 --> 1:26:16.962
	There was IBM models at the beginning and
	then we have the phrase based entity where

	1:26:16.962 --> 1:26:22.601
	it's about building the translation by putting
	together these blocks of phrases and combining.

	1:26:23.283 --> 1:26:34.771
	If you have a water which has several features
	you can't do that with millions but with features.

	1:26:34.834 --> 1:26:42.007
	Then you can combine them with your local
	model, which allows you to have your variable

	1:26:42.007 --> 1:26:45.186
	number of features and easily combine.

	1:26:45.365 --> 1:26:47.920
	Yeah, how much can you trust each of these
	more?

	1:26:51.091 --> 1:26:54.584
	Do you have any further questions for this
	topic?

	1:26:58.378 --> 1:27:08.715
	And there will be on Tuesday a lecture by
	Tuan about evaluation, and then next Thursday

	1:27:08.715 --> 1:27:12.710
	there will be the practical part.

	1:27:12.993 --> 1:27:21.461
	So please bring the practical pot here, but
	you can do something yourself if you are not

	1:27:21.461 --> 1:27:22.317
	able to.

	1:27:23.503 --> 1:27:26.848
	So then please tell us and we'll have to see
	how we find the difference in this.