Spaces:

retkowski
/

ytseg_demo

Running

App Files Files Community

ytseg_demo / demo_data /lectures /Lecture-13-04.07.2023 /English.vtt

retkowski

Add demo

cb71ef5 over 1 year ago

raw

history blame contribute delete

66.3 kB

	WEBVTT

	0:00:01.641 --> 0:00:06.302
	Hey so what again to today's lecture on machine
	translation.

	0:00:07.968 --> 0:00:15.152
	This week we'll have a bit of different focus,
	so last two weeks or so we have looking into.

	0:00:15.655 --> 0:00:28.073
	How we can improve our system by having more
	data, other data sources, or using them to

	0:00:28.073 --> 0:00:30.331
	more efficient.

	0:00:30.590 --> 0:00:38.046
	And we'll have a bit more of that next week
	with the anti-travised and the context.

	0:00:38.338 --> 0:00:47.415
	So that we are shifting from this idea of
	we treat each sentence independently, but treat

	0:00:47.415 --> 0:00:49.129
	the translation.

	0:00:49.129 --> 0:00:58.788
	Because maybe you can remember from the beginning,
	there are phenomenon in machine translation

	0:00:58.788 --> 0:01:02.143
	that you cannot correctly check.

	0:01:03.443 --> 0:01:14.616
	However, today we want to more look into what
	challenges arise, specifically when we're practically

	0:01:14.616 --> 0:01:16.628
	applying machine.

	0:01:17.017 --> 0:01:23.674
	And this block will be a total of four different
	lectures.

	0:01:23.674 --> 0:01:29.542
	What type of biases are in machine translation
	can.

	0:01:29.729 --> 0:01:37.646
	Just then can we try to improve this, but
	of course the first focus can be at least the.

	0:01:37.717 --> 0:01:41.375
	And this, of course, gets more and more important.

	0:01:41.375 --> 0:01:48.333
	The more often you apply this type of technology,
	when it was mainly a basic research tool which

	0:01:48.333 --> 0:01:53.785
	you were using in a research environment, it's
	not directly that important.

	0:01:54.054 --> 0:02:00.370
	But once you apply it to the question, is
	it performed the same for everybody or is it

	0:02:00.370 --> 0:02:04.436
	performance of some people less good than other
	people?

	0:02:04.436 --> 0:02:10.462
	Does it have specific challenges and we are
	seeing that especially in translation?

	0:02:10.710 --> 0:02:13.420
	We have the major challenge.

	0:02:13.420 --> 0:02:20.333
	We have the grammatical gender and this is
	not the same in all languages.

	0:02:20.520 --> 0:02:35.431
	In English, it's not clear if you talk about
	some person, if it's male or female, and so

	0:02:35.431 --> 0:02:39.787
	hopefully you've learned.

	0:02:41.301 --> 0:02:50.034
	Just as a brief view, so based on this one
	aspect of application will then have two other

	0:02:50.034 --> 0:02:57.796
	aspects: On Thursday we'll look into adaptation,
	so how can we adapt to specific situations?

	0:02:58.718 --> 0:03:09.127
	Because we have seen that your systems perform
	well when the test case is similar to the training

	0:03:09.127 --> 0:03:15.181
	case, it's always the case you should get training
	data.

	0:03:16.036 --> 0:03:27.577
	However, in practical applications, it's not
	always possible to collect really the best

	0:03:27.577 --> 0:03:31.642
	fitting data, so in that case.

	0:03:32.092 --> 0:03:39.269
	And then the third larger group of applications
	will then be speech translation.

	0:03:39.269 --> 0:03:42.991
	What do we have to change in our machine?

	0:03:43.323 --> 0:03:53.569
	If we are now not translating text, but if
	we want to translate speech, that will be more

	0:03:53.569 --> 0:03:54.708
	lectures.

	0:04:00.180 --> 0:04:12.173
	So what are we talking about when we are talking
	about bias from a definition point?

	0:04:12.092 --> 0:04:21.799
	Means we are introducing systematic errors
	when testing, and then we encourage the selection

	0:04:21.799 --> 0:04:24.408
	of the specific answers.

	0:04:24.804 --> 0:04:36.862
	The most prominent case, which is analyzed
	most in the research community, is a bias based

	0:04:36.862 --> 0:04:38.320
	on gender.

	0:04:38.320 --> 0:04:43.355
	One example: she works in a hospital.

	0:04:43.523 --> 0:04:50.787
	It is not directly able to assess whether
	this is now a point or a friend.

	0:04:51.251 --> 0:05:07.095
	And although in this one even there is, it's
	possible to ambiguate this based on the context.

	0:05:07.127 --> 0:05:14.391
	However, there is yeah, this relation to learn
	is of course not that easy.

	0:05:14.614 --> 0:05:27.249
	So the system might also learn more like shortcut
	connections, which might be that in your training

	0:05:27.249 --> 0:05:31.798
	data most of the doctors are males.

	0:05:32.232 --> 0:05:41.725
	That is like that was too bigly analyzed and
	biased, and we'll focus on that also in this.

	0:05:41.641 --> 0:05:47.664
	In this lecture, however, of course, the system
	might be a lot of other biases too, which have

	0:05:47.664 --> 0:05:50.326
	been partly investigated in other fields.

	0:05:50.326 --> 0:05:53.496
	But I think machine translation is not that
	much.

	0:05:53.813 --> 0:05:57.637
	For example, it can be based on your originals.

	0:05:57.737 --> 0:06:09.405
	So there is an example for a sentiment analysis
	that's a bit prominent.

	0:06:09.405 --> 0:06:15.076
	A sentiment analysis means you're.

	0:06:15.035 --> 0:06:16.788
	Like you're seeing it in reviews.

	0:06:17.077 --> 0:06:24.045
	And then you can show that with baseline models,
	if the name is Mohammed then the sentiment

	0:06:24.045 --> 0:06:30.786
	in a lot of systems will be more negative than
	if it's like a traditional European name.

	0:06:31.271 --> 0:06:33.924
	Are with foods that is simple.

	0:06:33.924 --> 0:06:36.493
	It's this type of restaurant.

	0:06:36.493 --> 0:06:38.804
	It's positive and another.

	0:06:39.319 --> 0:06:49.510
	You have other aspects, so we have seen this.

	0:06:49.510 --> 0:06:59.480
	We have done some experiments in Vietnamese.

	0:06:59.559 --> 0:07:11.040
	And then, for example, you can analyze that
	if it's like he's Germany will address it more

	0:07:11.040 --> 0:07:18.484
	formal, while if he is North Korean he'll use
	an informal.

	0:07:18.838 --> 0:07:24.923
	So these are also possible types of gender.

	0:07:24.923 --> 0:07:31.009
	However, this is difficult types of biases.

	0:07:31.251 --> 0:07:38.903
	However, especially in translation, the bias
	for gender is the most challenging because

	0:07:38.903 --> 0:07:42.989
	we are treating gender in different languages.

	0:07:45.405 --> 0:07:46.930
	Hi this is challenging.

	0:07:48.148 --> 0:07:54.616
	The reason for that is that there is a translation
	mismatch and we have, I mean, one reason for

	0:07:54.616 --> 0:08:00.140
	that is there's a translation mismatch and
	that's the most challenging situation.

	0:08:00.140 --> 0:08:05.732
	So there is there is different information
	in the Sears language or in the target.

	0:08:06.046 --> 0:08:08.832
	So if we have the English word dot player,.

	0:08:09.029 --> 0:08:12.911
	It's there is no information about the gender
	in there.

	0:08:12.911 --> 0:08:19.082
	However, if you want to translate in German,
	you cannot easily generate a word without a

	0:08:19.082 --> 0:08:20.469
	gender information.

	0:08:20.469 --> 0:08:27.056
	Or man, you can't do something like Shubila
	in, but that sounds a bit weird if you're talking.

	0:08:27.027 --> 0:08:29.006
	About a specific person.

	0:08:29.006 --> 0:08:32.331
	Then you should use the appropriate font.

	0:08:32.692 --> 0:08:44.128
	And so it's most challenging translation as
	always in this situation where you have less

	0:08:44.128 --> 0:08:50.939
	information on the source side but more information.

	0:08:51.911 --> 0:08:57.103
	Similar things like if you think about Japanese,
	for example where there's different formality

	0:08:57.103 --> 0:08:57.540
	levels.

	0:08:57.540 --> 0:09:02.294
	If in German there is no formality or like
	two only or in English there's no formality

	0:09:02.294 --> 0:09:02.677
	level.

	0:09:02.862 --> 0:09:08.139
	And now you have to estimate the formality
	level.

	0:09:08.139 --> 0:09:10.884
	Of course, it takes some.

	0:09:10.884 --> 0:09:13.839
	It's not directly possible.

	0:09:14.094 --> 0:09:20.475
	What nowadays systems are doing is at least
	assess.

	0:09:20.475 --> 0:09:27.470
	This is a situation where don't have enough
	information.

	0:09:27.567 --> 0:09:28.656
	Translation.

	0:09:28.656 --> 0:09:34.938
	So here you have that suggesting it can be
	doctor or doctorate in Spanish.

	0:09:35.115 --> 0:09:37.051
	So that is a possibility.

	0:09:37.051 --> 0:09:41.595
	However, it is of course very, very challenging
	to find out.

	0:09:42.062 --> 0:09:46.130
	Is there two really different meanings, or
	is it not the case?

	0:09:46.326 --> 0:09:47.933
	You can do the big rule base here.

	0:09:47.933 --> 0:09:49.495
	Maybe don't know how they did it.

	0:09:49.990 --> 0:09:57.469
	You can, of course, if you are focusing on
	gender, the source and the target is different,

	0:09:57.469 --> 0:09:57.879
	and.

	0:09:58.118 --> 0:10:05.799
	But if you want to do it more general, it's
	not that easy because there's always.

	0:10:06.166 --> 0:10:18.255
	But it's not clear if these are really different
	or if there's only slight differences.

	0:10:22.142 --> 0:10:36.451
	Between that another reason why there is a
	bias in there is typically the system tries

	0:10:36.451 --> 0:10:41.385
	to always do the most simple.

	0:10:42.262 --> 0:10:54.483
	And also in your training data there are unintended
	shortcuts or clues only in the training data

	0:10:54.483 --> 0:10:59.145
	because you sample them in some way.

	0:10:59.379 --> 0:11:06.257
	This example, if she works in a hospital and
	my friend is a nurse, then it might be that

	0:11:06.257 --> 0:11:07.184
	one friend.

	0:11:08.168 --> 0:11:18.979
	Male and female because it has learned that
	in your trained doctor is a male and a nurse

	0:11:18.979 --> 0:11:20.802
	is doing this.

	0:11:20.880 --> 0:11:29.587
	And of course, if we are doing maximum likelihood
	approximation as we are doing it in general,

	0:11:29.587 --> 0:11:30.962
	we are always.

	0:11:30.951 --> 0:11:43.562
	So that means if in your training data this
	correlation is maybe in the case then your

	0:11:43.562 --> 0:11:48.345
	predictions are always the same.

	0:11:48.345 --> 0:11:50.375
	It typically.

	0:11:55.035 --> 0:12:06.007
	What does it mean, of course, if we are having
	this type of fires and if we are applying?

	0:12:05.925 --> 0:12:14.821
	It might be that the benefit of machine translation
	rice so more and more people can benefit from

	0:12:14.821 --> 0:12:20.631
	the ability to talk to people in different
	languages and so on.

	0:12:20.780 --> 0:12:27.261
	But if you more often use it, problems of
	the system also get more and more important.

	0:12:27.727 --> 0:12:36.984
	And so if we are seeing that these problems
	and people nowadays only start to analyze these

	0:12:36.984 --> 0:12:46.341
	problems partly, also because if it hasn't
	been used, it's not that important if the quality

	0:12:46.341 --> 0:12:47.447
	is so bad.

	0:12:47.627 --> 0:12:51.907
	Version or is mixing it all the time like
	we have seen in old systems.

	0:12:51.907 --> 0:12:52.993
	Then, of course,.

	0:12:53.053 --> 0:12:57.303
	The issue is not that you have biased issues
	that you at first need to create a right view.

	0:12:57.637 --> 0:13:10.604
	So only with the wide application of the good
	quality this becomes important, and then of

	0:13:10.604 --> 0:13:15.359
	course you should look into how.

	0:13:15.355 --> 0:13:23.100
	In order to first get aware of what are the
	challenges, and that is a general idea not

	0:13:23.100 --> 0:13:24.613
	only about bias.

	0:13:24.764 --> 0:13:31.868
	Of course, we have learned about blue scores,
	so how can you evaluate the over quality and

	0:13:31.868 --> 0:13:36.006
	they are very important, either blue or any
	of that.

	0:13:36.006 --> 0:13:40.378
	However, they are somehow giving us a general
	overview.

	0:13:40.560 --> 0:13:58.410
	And if we want to improve our systems, of
	course it's important that we also do more

	0:13:58.410 --> 0:14:00.510
	detailed.

	0:14:00.340 --> 0:14:05.828
	Test sets which are very challenging in order
	to attend to see how good these systems.

	0:14:06.446 --> 0:14:18.674
	Of course, one last reminder to that if you
	do a challenge that says it's typically good

	0:14:18.674 --> 0:14:24.581
	to keep track of your general performance.

	0:14:24.784 --> 0:14:28.648
	You don't want to improve normally then on
	the general quality.

	0:14:28.688 --> 0:14:41.555
	So if you build a system which will mitigate
	some biases then the aim is that if you evaluate

	0:14:41.555 --> 0:14:45.662
	it on the challenging biases.

	0:14:45.745 --> 0:14:53.646
	You don't need to get better because the aggregated
	versions don't really measure that aspect well,

	0:14:53.646 --> 0:14:57.676
	but if you significantly drop in performance
	then.

	0:15:00.000 --> 0:15:19.164
	What are, in generally calms, people report
	about that or why should you care about?

	0:15:19.259 --> 0:15:23.598
	And you're even then amplifying this type
	of stereotypes.

	0:15:23.883 --> 0:15:33.879
	And that is not what you want to achieve with
	using this technology.

	0:15:33.879 --> 0:15:39.384
	It's not working through some groups.

	0:15:39.819 --> 0:15:47.991
	And secondly what is referred to as allocational
	parts.

	0:15:47.991 --> 0:15:54.119
	The system might not perform as well for.

	0:15:54.314 --> 0:16:00.193
	So another example of which we would like
	to see is that sometimes the translation depends

	0:16:00.193 --> 0:16:01.485
	on who is speaking.

	0:16:01.601 --> 0:16:03.463
	So Here You Have It in French.

	0:16:03.723 --> 0:16:16.359
	Not say it, but the word happy or French has
	to be expressed differently, whether it's a

	0:16:16.359 --> 0:16:20.902
	male person or a female person.

	0:16:21.121 --> 0:16:28.917
	It's nearly impossible to guess that or it's
	impossible, so then you always select one.

	0:16:29.189 --> 0:16:37.109
	And of course, since we do greedy search,
	it will always generate the same, so you will

	0:16:37.109 --> 0:16:39.449
	have a worse performance.

	0:16:39.779 --> 0:16:46.826
	And of course not what we want to achieve
	in average.

	0:16:46.826 --> 0:16:54.004
	You might be then good, but you also have
	the ability.

	0:16:54.234 --> 0:17:08.749
	This is a biased problem or an interface problem
	because mean you can say well.

	0:17:09.069 --> 0:17:17.358
	And if you do it, we still have a system that
	generates unusable output.

	0:17:17.358 --> 0:17:24.057
	If you don't tell it what you want to do,
	so in this case.

	0:17:24.244 --> 0:17:27.173
	So in this case it's like if we don't have
	enough information.

	0:17:27.467 --> 0:17:34.629
	So you have to adapt your system in some way
	that can either access the information or output.

	0:17:34.894 --> 0:17:46.144
	But yeah, how you mean there's different ways
	of how to improve over that first thing is

	0:17:46.144 --> 0:17:47.914
	you find out.

	0:17:48.688 --> 0:17:53.826
	Then there is different ways of addressing
	them, and they of course differ.

	0:17:53.826 --> 0:17:57.545
	Isn't the situation where the information's
	available?

	0:17:58.038 --> 0:18:12.057
	That's the first case we have, or is it a
	situation where we don't have the information

	0:18:12.057 --> 0:18:13.332
	either?

	0:18:14.154 --> 0:18:28.787
	Or should give the system maybe the opportunity
	to output those or say don't know this is still

	0:18:28.787 --> 0:18:29.701
	open.

	0:18:29.769 --> 0:18:35.470
	And even if they have enough information,
	need this additional information, but they

	0:18:35.470 --> 0:18:36.543
	are just doing.

	0:18:36.776 --> 0:18:51.132
	Which is a bit based on how we find that there
	is research on that, but it's not that easy

	0:18:51.132 --> 0:18:52.710
	to solve.

	0:18:52.993 --> 0:19:05.291
	But in general, detecting do have enough information
	to do a good translation or are information

	0:19:05.291 --> 0:19:06.433
	missing?

	0:19:09.669 --> 0:19:18.951
	But before we come on how we will address
	it or try to change it, and before we look

	0:19:18.951 --> 0:19:22.992
	at how we can assess it, of course,.

	0:19:23.683 --> 0:19:42.820
	And therefore wanted to do a bit of a review
	on how gender is represented in languages.

	0:19:43.743 --> 0:19:48.920
	Course: You can have more fine grained.

	0:19:48.920 --> 0:20:00.569
	It's not that everything in the group is the
	same, but in general you have a large group.

	0:20:01.381 --> 0:20:08.347
	For example, you even don't say ishi or but
	it's just one word for it written.

	0:20:08.347 --> 0:20:16.107
	Oh, don't know how it's pronounced, so you
	cannot say from a sentence whether it's ishi

	0:20:16.107 --> 0:20:16.724
	or it.

	0:20:17.937 --> 0:20:29.615
	Of course, there are some exceptions for whether
	it's a difference between male and female.

	0:20:29.615 --> 0:20:35.962
	They have different names for brother and
	sister.

	0:20:36.036 --> 0:20:41.772
	So normally you cannot infer whether this
	is a male speaker or speaking about a male

	0:20:41.772 --> 0:20:42.649
	or a female.

	0:20:44.304 --> 0:20:50.153
	Examples for these languages are, for example,
	Finnish and Turkish.

	0:20:50.153 --> 0:21:00.370
	There are more languages, but these are: Then
	we have no nutritional gender languages where

	0:21:00.370 --> 0:21:05.932
	there's some gender information in there, but
	it's.

	0:21:05.905 --> 0:21:08.169
	And this is an example.

	0:21:08.169 --> 0:21:15.149
	This is English, which is in that way a nice
	example because most people.

	0:21:15.415 --> 0:21:20.164
	So you have there some lexicogender and phenomenal
	gender.

	0:21:20.164 --> 0:21:23.303
	I mean mamadeta there she-hee and him.

	0:21:23.643 --> 0:21:31.171
	And very few words are marked like actor and
	actress, but in general most words are not

	0:21:31.171 --> 0:21:39.468
	marked, so it's teacher and lecturer and friend,
	so in all these words the gender is not marked,

	0:21:39.468 --> 0:21:41.607
	and so you cannot infer.

	0:21:42.622 --> 0:21:48.216
	So the initial Turkish sentence here would
	be translated to either he is a good friend

	0:21:48.216 --> 0:21:49.373
	or she is a good.

	0:21:51.571 --> 0:22:05.222
	In this case you would have them gender information
	in there, but of course there's a good friend.

	0:22:07.667 --> 0:22:21.077
	And then finally there is the grammatical
	German languages where each noun has a gender.

	0:22:21.077 --> 0:22:25.295
	That's the case in Spanish.

	0:22:26.186 --> 0:22:34.025
	This is mostly formal, but at least if you're
	talking about a human that also agrees.

	0:22:34.214 --> 0:22:38.209
	Of course, it's like the sun.

	0:22:38.209 --> 0:22:50.463
	There is no clear thing why the sun should
	be female, and in other language it's different.

	0:22:50.390 --> 0:22:56.100
	The matching, and then you also have more
	agreements with this that makes things more

	0:22:56.100 --> 0:22:56.963
	complicated.

	0:22:57.958 --> 0:23:08.571
	Here he is a good friend and the good is also
	depending whether it's male or went up so it's

	0:23:08.571 --> 0:23:17.131
	changing also based on the gender so you have
	a lot of gender information.

	0:23:17.777 --> 0:23:21.364
	Get them, but do you always get them correctly?

	0:23:21.364 --> 0:23:25.099
	It might be that they're in English, for example.

	0:23:28.748 --> 0:23:36.154
	And since this is the case, and you need to
	like often express the gender even though you

	0:23:36.154 --> 0:23:37.059
	might not.

	0:23:37.377 --> 0:23:53.030
	Aware of it or it's not possible, there's
	some ways in German how to mark mutual forms.

	0:23:54.194 --> 0:24:03.025
	But then it's again from the machine learning
	side of view, of course quite challenging because

	0:24:03.025 --> 0:24:05.417
	you only want to use the.

	0:24:05.625 --> 0:24:11.108
	If it's known to the reader you want to use
	the correct, the not mutual form but either

	0:24:11.108 --> 0:24:12.354
	the male or female.

	0:24:13.013 --> 0:24:21.771
	So they are assessing what is known to the
	reader as a challenge which needs to in some

	0:24:21.771 --> 0:24:23.562
	way be addressed.

	0:24:26.506 --> 0:24:30.887
	Here why does that happen?

	0:24:30.887 --> 0:24:42.084
	Three reasons we have that in a bit so one
	is, of course, that your.

	0:24:42.162 --> 0:24:49.003
	Example: If you look at the Europe High Corpus,
	which is an important resource for doing machine

	0:24:49.003 --> 0:24:49.920
	translation.

	0:24:50.010 --> 0:24:59.208
	Then there's only thirty percent of the speakers
	are female, and so if you train a model on

	0:24:59.208 --> 0:25:06.606
	that data, if you're translating to French,
	there will be a male version.

	0:25:06.746 --> 0:25:10.762
	And so you'll just have a lot more like seventy
	percent of your mail for it.

	0:25:10.971 --> 0:25:18.748
	And that will be Yep will make the model therefore
	from this data sub.

	0:25:18.898 --> 0:25:25.882
	And of course this will be in the data for
	a very long time.

	0:25:25.882 --> 0:25:33.668
	So if there's more female speakers in the
	European Parliament, but.

	0:25:33.933 --> 0:25:42.338
	But we are training on historical data, so
	even if there is for a long time, it will not

	0:25:42.338 --> 0:25:43.377
	be in the.

	0:25:46.346 --> 0:25:57.457
	Then besides these preexisting data there
	is of course technical biases which will amplify

	0:25:57.457 --> 0:25:58.800
	this type.

	0:25:59.039 --> 0:26:04.027
	So one we already address, that's for example
	sampling or beam search.

	0:26:04.027 --> 0:26:06.416
	You get the most probable output.

	0:26:06.646 --> 0:26:16.306
	So if there's a bias in your model, it will
	amplify that not only in the case we had before,

	0:26:16.306 --> 0:26:19.423
	and produce the male version.

	0:26:20.040 --> 0:26:32.873
	So if you have the same source sentence like
	am happy and in your training data it will

	0:26:32.873 --> 0:26:38.123
	be male and female if you're doing.

	0:26:38.418 --> 0:26:44.510
	So in that way by doing this type of algorithmic
	design you will have.

	0:26:44.604 --> 0:26:59.970
	Another use case is if you think about a multilingual
	machine translation, for example if you are

	0:26:59.970 --> 0:27:04.360
	now doing a pivot language.

	0:27:04.524 --> 0:27:13.654
	But if you're first trying to English this
	information might get lost and then you translate

	0:27:13.654 --> 0:27:14.832
	to Spanish.

	0:27:15.075 --> 0:27:21.509
	So while in general in this class there is
	not this type of bias there,.

	0:27:22.922 --> 0:27:28.996
	You might introduce it because you might have
	good reasons for doing a modular system because

	0:27:28.996 --> 0:27:31.968
	you don't have enough training data or so on.

	0:27:31.968 --> 0:27:37.589
	It's performing better in average, but of
	course by doing this choice you'll introduce

	0:27:37.589 --> 0:27:40.044
	an additional type of bias into your.

	0:27:45.805 --> 0:27:52.212
	And then there is what people refer to as
	emergent bias, and that is, if you use a system

	0:27:52.212 --> 0:27:58.903
	for a different use case as we see in, generally
	it is the case that is performing worse, but

	0:27:58.903 --> 0:28:02.533
	then of course you can have even more challenging.

	0:28:02.942 --> 0:28:16.196
	So the extreme case would be if you train
	a system only on male speakers, then of course

	0:28:16.196 --> 0:28:22.451
	it will perform worse on female speakers.

	0:28:22.902 --> 0:28:36.287
	So, of course, if you're doing this type of
	problem, if you use a system for a different

	0:28:36.287 --> 0:28:42.152
	situation where it was original, then.

	0:28:44.004 --> 0:28:54.337
	And with this we would then go for type of
	evaluation, but before we are looking at how

	0:28:54.337 --> 0:28:56.333
	we can evaluate.

	0:29:00.740 --> 0:29:12.176
	Before we want to look into how we can improve
	the system, think yeah, maybe at the moment

	0:29:12.176 --> 0:29:13.559
	most work.

	0:29:13.954 --> 0:29:21.659
	And the one thing is the system trying to
	look into stereotypes.

	0:29:21.659 --> 0:29:26.164
	So how does a system use stereotypes?

	0:29:26.466 --> 0:29:29.443
	So if you have the Hungarian sentence,.

	0:29:29.729 --> 0:29:33.805
	Which should be he is an engineer or she is
	an engineer.

	0:29:35.375 --> 0:29:43.173
	And you cannot guess that because we saw that
	he and she is not different in Hungary.

	0:29:43.423 --> 0:29:57.085
	Then you can have a test set where you have
	these type of ailanomal occupations.

	0:29:56.977 --> 0:30:03.862
	You have statistics from how is the distribution
	by gender so you can automatically generate

	0:30:03.862 --> 0:30:04.898
	the sentence.

	0:30:04.985 --> 0:30:21.333
	Then you could put in jobs which are mostly
	done by a man and then you can check how is

	0:30:21.333 --> 0:30:22.448
	your.

	0:30:22.542 --> 0:30:31.315
	That is one type of evaluating stereotypes
	that one of the most famous benchmarks called

	0:30:31.315 --> 0:30:42.306
	vino is exactly: The second type of evaluation
	is about gender preserving.

	0:30:42.342 --> 0:30:51.201
	So that is exactly what we have seen beforehand.

	0:30:51.201 --> 0:31:00.240
	If these information are not in the text itself,.

	0:31:00.320 --> 0:31:01.875
	Gender as a speaker.

	0:31:02.062 --> 0:31:04.450
	And how good does a system do that?

	0:31:04.784 --> 0:31:09.675
	And we'll see there's, for example, one benchmark
	on this.

	0:31:09.675 --> 0:31:16.062
	For example: For Arabic there is one benchmark
	on this foot: Audio because if you're now think

	0:31:16.062 --> 0:31:16.781
	already of the.

	0:31:17.157 --> 0:31:25.257
	From when we're talking about speech translation,
	it might be interesting because in the speech

	0:31:25.257 --> 0:31:32.176
	signal you should have a better guess on whether
	it's a male or a female speaker.

	0:31:32.432 --> 0:31:38.928
	So but mean current systems, mostly you can
	always add, and they will just first transcribe.

	0:31:42.562 --> 0:31:45.370
	Yes, so how do these benchmarks?

	0:31:45.305 --> 0:31:51.356
	Look like that, the first one is here.

	0:31:51.356 --> 0:32:02.837
	There's an occupation test where it looks
	like a simple test set because.

	0:32:03.023 --> 0:32:10.111
	So I've known either hurry him or pronounce
	the name for a long time.

	0:32:10.111 --> 0:32:13.554
	My friend works as an occupation.

	0:32:13.833 --> 0:32:16.771
	So that is like all sentences in that look
	like that.

	0:32:17.257 --> 0:32:28.576
	So in this case you haven't had the biggest
	work in here, which is friends.

	0:32:28.576 --> 0:32:33.342
	So your only checking later is.

	0:32:34.934 --> 0:32:46.981
	This can be inferred from whether it's her
	or her or her, or if it's a proper name, so

	0:32:46.981 --> 0:32:55.013
	can you infer it from the name, and then you
	can compare.

	0:32:55.115 --> 0:33:01.744
	So is this because the job description is
	nearer to friend.

	0:33:01.744 --> 0:33:06.937
	Does the system get disturbed by this type
	of.

	0:33:08.828 --> 0:33:14.753
	And there you can then automatically assess
	yeah this type.

	0:33:14.774 --> 0:33:18.242
	Of course, that's what said at the beginning.

	0:33:18.242 --> 0:33:24.876
	You shouldn't only rely on that because if
	you only rely on it you can easily trick the

	0:33:24.876 --> 0:33:25.479
	system.

	0:33:25.479 --> 0:33:31.887
	So one type of sentence is translated, but
	of course it can give you very important.

	0:33:33.813 --> 0:33:35.309
	Any questions yeah.

	0:33:36.736 --> 0:33:44.553
	Much like the evaluation of stereotype, we
	want the system to agree with stereotypes because

	0:33:44.553 --> 0:33:46.570
	it increases precision.

	0:33:46.786 --> 0:33:47.979
	No, no, no.

	0:33:47.979 --> 0:33:53.149
	In this case, if we say oh yeah, he is an
	engineer.

	0:33:53.149 --> 0:34:01.600
	From the example, it's probably the most likely
	translation, probably in more cases.

	0:34:02.702 --> 0:34:08.611
	Now there is two things, so yeah yeah, so
	there is two ways of evaluating.

	0:34:08.611 --> 0:34:15.623
	The one thing is in this case he's using that
	he's an engineer, but there is conflicting

	0:34:15.623 --> 0:34:19.878
	information that in this case the engineer
	is female.

	0:34:20.380 --> 0:34:21.890
	So anything was.

	0:34:22.342 --> 0:34:29.281
	Information yes, so that is the one in the
	other case.

	0:34:29.281 --> 0:34:38.744
	Typically it's not evaluated in that, but
	in that time you really want it.

	0:34:38.898 --> 0:34:52.732
	That's why most of those cases you have evaluated
	in scenarios where you have context information.

	0:34:53.453 --> 0:34:58.878
	How to deal with the other thing is even more
	challenging to one case where it is the case

	0:34:58.878 --> 0:35:04.243
	is what I said before is when it's about the
	speaker so that the speech translation test.

	0:35:04.584 --> 0:35:17.305
	And there they try to look in a way that can
	you use, so use the audio also as input.

	0:35:18.678 --> 0:35:20.432
	Yeah.

	0:35:20.640 --> 0:35:30.660
	So if we have a reference where she is an
	engineer okay, are there efforts to adjust

	0:35:30.660 --> 0:35:37.497
	the metric so that our transmissions go into
	the correct?

	0:35:37.497 --> 0:35:38.676
	We don't.

	0:35:38.618 --> 0:35:40.389
	Only done for mean this is evaluation.

	0:35:40.389 --> 0:35:42.387
	You are not pushing the model for anything.

	0:35:43.023 --> 0:35:53.458
	But if you want to do it in training, that
	you're not doing it this way.

	0:35:53.458 --> 0:35:58.461
	I'm not aware of any direct model.

	0:35:58.638 --> 0:36:04.146
	Because you have to find out, is it known
	in this scenario or not?

	0:36:05.725 --> 0:36:12.622
	So at least I'm not aware of there's like
	the directive doing training try to assess

	0:36:12.622 --> 0:36:13.514
	more than.

	0:36:13.813 --> 0:36:18.518
	Mean there is data augmentation in the way
	that is done.

	0:36:18.518 --> 0:36:23.966
	Think we'll have that later, so what you can
	do is generate more.

	0:36:24.144 --> 0:36:35.355
	You can do that automatically or there's ways
	of biasing so that you can try to make your

	0:36:35.355 --> 0:36:36.600
	training.

	0:36:36.957 --> 0:36:46.228
	That's typically not done with focusing on
	scenarios where you check before or do have

	0:36:46.228 --> 0:36:47.614
	information.

	0:36:49.990 --> 0:36:58.692
	Mean, but for everyone it's not clear and
	agree with you in this scenario, the normal

	0:36:58.692 --> 0:37:01.222
	evaluation system where.

	0:37:01.341 --> 0:37:07.006
	Maybe you could say it shouldn't do always
	the same but have a distribution like a training

	0:37:07.006 --> 0:37:12.733
	data or something like that because otherwise
	we're amplifying but that current system can't

	0:37:12.733 --> 0:37:15.135
	do current systems can't predict both.

	0:37:15.135 --> 0:37:17.413
	That's why we see all the beginning.

	0:37:17.413 --> 0:37:20.862
	They have this extra interface where they
	then propose.

	0:37:24.784 --> 0:37:33.896
	Another thing is the vino empty system and
	it started from a challenge set for co-reference

	0:37:33.896 --> 0:37:35.084
	resolution.

	0:37:35.084 --> 0:37:43.502
	Co-reference resolution means we have pear
	on him and we need to find out what it's.

	0:37:43.823 --> 0:37:53.620
	So you have the doctor off the nurse to help
	her in the procedure, and now her does not

	0:37:53.620 --> 0:37:55.847
	refer to the nurse.

	0:37:56.556 --> 0:38:10.689
	And there you of course have the same type
	of stewardesses and the same type of buyers

	0:38:10.689 --> 0:38:15.237
	as the machine translation.

	0:38:16.316 --> 0:38:25.165
	And no think that normally yeah mean maybe
	that's also biased.

	0:38:27.687 --> 0:38:37.514
	No, but if you ask somebody, I guess if you
	ask somebody, then I mean syntectically it's

	0:38:37.514 --> 0:38:38.728
	ambiguous.

	0:38:38.918 --> 0:38:50.248
	If you ask somebody to help, then the horror
	has to refer to that.

	0:38:50.248 --> 0:38:54.983
	So it should also help the.

	0:38:56.396 --> 0:38:57.469
	Of the time.

	0:38:57.469 --> 0:39:03.906
	The doctor is female and says please have
	me in the procedure, but the other.

	0:39:04.904 --> 0:39:09.789
	Oh, you mean that it's helping the third person.

	0:39:12.192 --> 0:39:16.140
	Yeah, agree that it could also be yes.

	0:39:16.140 --> 0:39:19.077
	Don't know how easy that is.

	0:39:19.077 --> 0:39:21.102
	Only know the test.

	0:39:21.321 --> 0:39:31.820
	Then guess yeah, then you need a situation
	context where you know the situation, the other

	0:39:31.820 --> 0:39:34.589
	person having problems.

	0:39:36.936 --> 0:39:42.251
	Yeah no yeah that is like here when there
	is additional ambiguity in there.

	0:39:45.465 --> 0:39:48.395
	See that pure text models is not always okay.

	0:39:48.395 --> 0:39:51.134
	How full mean there is a lot of work also.

	0:39:52.472 --> 0:40:00.119
	Will not cover that in the lecture, but there
	are things like multimodal machine translation

	0:40:00.119 --> 0:40:07.109
	where you try to add pictures or something
	like that to have more context, and then.

	0:40:10.370 --> 0:40:23.498
	Yeah, it starts with this, so in order to
	evaluate that what it does is that you translate

	0:40:23.498 --> 0:40:25.229
	the system.

	0:40:25.305 --> 0:40:32.310
	It's doing stereotyping so the doctor is male
	and the nurse is female.

	0:40:32.492 --> 0:40:42.362
	And then you're using word alignment, and
	then you check whether this gender maps with

	0:40:42.362 --> 0:40:52.345
	the annotated gender of there, and that is
	how you evaluate in this type of vino empty.

	0:40:52.832 --> 0:40:59.475
	Mean, as you see, you're only focusing on
	the situation where you can or where the gender

	0:40:59.475 --> 0:41:00.214
	is known.

	0:41:00.214 --> 0:41:06.930
	Why for this one you don't do any evaluation,
	but because nurses can in that case be those

	0:41:06.930 --> 0:41:08.702
	and you cannot, as has.

	0:41:08.728 --> 0:41:19.112
	The benchmarks are at the moment designed
	in a way that you only evaluate things that

	0:41:19.112 --> 0:41:20.440
	are known.

	0:41:23.243 --> 0:41:25.081
	Then yeah, you can have a look.

	0:41:25.081 --> 0:41:28.931
	For example, here what people are looking
	is you can do the first.

	0:41:28.931 --> 0:41:32.149
	Oh well, the currency, how often does it do
	it correct?

	0:41:32.552 --> 0:41:41.551
	And there you see these numbers are a bit
	older.

	0:41:41.551 --> 0:41:51.835
	There's more work on that, but this is the
	first color.

	0:41:51.731 --> 0:42:01.311
	Because they do it like in this test, they
	do it twice, one with him and one with her.

	0:42:01.311 --> 0:42:04.834
	So the chance is fifty percent.

	0:42:05.065 --> 0:42:12.097
	Except somehow here, the one system seems
	to be quite good there that everything.

	0:42:13.433 --> 0:42:30.863
	What you can also do is look at the difference,
	where you need to predict female and the difference.

	0:42:30.850 --> 0:42:40.338
	It's more often correct on the male forms
	than on the female forms, and you see that

	0:42:40.338 --> 0:42:43.575
	it's except for this system.

	0:42:43.603 --> 0:42:53.507
	So would assume that they maybe in this one
	language did some type of method in there.

	0:42:55.515 --> 0:42:57.586
	If you are more often mean there is like.

	0:42:58.178 --> 0:43:01.764
	It's not a lot lower, there's one.

	0:43:01.764 --> 0:43:08.938
	I don't know why, but if you're always to
	the same then it should be.

	0:43:08.938 --> 0:43:14.677
	You seem to be counter intuitive, so maybe
	it's better.

	0:43:15.175 --> 0:43:18.629
	Don't know exactly how yes, but it's, it's
	true.

	0:43:19.019 --> 0:43:20.849
	Mean, there's very few cases.

	0:43:20.849 --> 0:43:22.740
	I also don't know for Russian.

	0:43:22.740 --> 0:43:27.559
	I mean, there is, I think, mainly for Russian
	where you have very low numbers.

	0:43:27.559 --> 0:43:30.183
	I mean, I would say like forty five or so.

	0:43:30.183 --> 0:43:32.989
	There can be more about renting and sampling.

	0:43:32.989 --> 0:43:37.321
	I don't know if they have even more gender
	or if they have a new tool.

	0:43:37.321 --> 0:43:38.419
	I don't think so.

	0:43:40.040 --> 0:43:46.901
	Then you have typically even a stronger bias
	here where you not do the differentiation between

	0:43:46.901 --> 0:43:53.185
	how often is it correct for me and the female,
	but you are distinguishing between the.

	0:43:53.553 --> 0:44:00.503
	So you're here, for you can check for each
	occupation, which is the most important.

	0:44:00.440 --> 0:44:06.182
	A comment one based on statistics, and then
	you take that on the one side and the anti

	0:44:06.182 --> 0:44:12.188
	stereotypically on the other side, and you
	see that not in all cases but in a lot of cases

	0:44:12.188 --> 0:44:16.081
	that null probabilities are even higher than
	on the other.

	0:44:21.061 --> 0:44:24.595
	Ah, I'm telling you there's something.

	0:44:28.668 --> 0:44:32.850
	But it has to be for a doctor.

	0:44:32.850 --> 0:44:39.594
	For example, for a doctor there three don't
	know.

	0:44:40.780 --> 0:44:44.275
	Yeah, but guess here it's mainly imminent
	job description.

	0:44:44.275 --> 0:44:45.104
	So yeah, but.

	0:44:50.050 --> 0:45:01.145
	And then there is the Arabic capital gender
	corpus where it is about more assessing how

	0:45:01.145 --> 0:45:03.289
	strong a singer.

	0:45:03.483 --> 0:45:09.445
	How that is done is the open subtitles.

	0:45:09.445 --> 0:45:18.687
	Corpus is like a corpus of subtitles generated
	by volunteers.

	0:45:18.558 --> 0:45:23.426
	For the Words Like I Mean Myself.

	0:45:23.303 --> 0:45:30.670
	And mine, and then they annotated the Arabic
	sentences, whether here I refer to as a female

	0:45:30.670 --> 0:45:38.198
	and masculine, or whether it's ambiguous, and
	then from the male and female one they generate

	0:45:38.198 --> 0:45:40.040
	types of translations.

	0:45:43.703 --> 0:45:51.921
	And then a bit more different test sets as
	the last one that is referred to as the machine.

	0:45:52.172 --> 0:45:57.926
	Corpus, which is based on these lectures.

	0:45:57.926 --> 0:46:05.462
	In general, this lecture is very important
	because it.

	0:46:05.765 --> 0:46:22.293
	And here is also interesting because you also
	have the obvious signal and it's done in the

	0:46:22.293 --> 0:46:23.564
	worst.

	0:46:23.763 --> 0:46:27.740
	In the first case is where it can only be
	determined based on the speaker.

	0:46:27.968 --> 0:46:30.293
	So something like am a good speaker.

	0:46:30.430 --> 0:46:32.377
	You cannot do that correctly.

	0:46:32.652 --> 0:46:36.970
	However, if you would have the audio signal
	you should have a lot better guests.

	0:46:37.257 --> 0:46:47.812
	So it wasn't evaluated, especially machine
	translation and speech translation system,

	0:46:47.812 --> 0:46:53.335
	which take this into account or, of course,.

	0:46:57.697 --> 0:47:04.265
	The second thing is where you can do it based
	on the context.

	0:47:04.265 --> 0:47:08.714
	In this case we are not using artificial.

	0:47:11.011 --> 0:47:15.550
	Cope from the from the real data, so it's
	not like artificial creative data, but.

	0:47:15.815 --> 0:47:20.939
	Of course, in a lot more work you have to
	somehow find these in the corpus and use them

	0:47:20.939 --> 0:47:21.579
	as a test.

	0:47:21.601 --> 0:47:27.594
	Is something she got together with two of
	her dearest friends, this older woman, and

	0:47:27.594 --> 0:47:34.152
	then, of course, here friends can we get from
	the context, but it might be that some systems

	0:47:34.152 --> 0:47:36.126
	ignore that that should be.

	0:47:36.256 --> 0:47:43.434
	So you have two test sets in there, two types
	of benchmarks, and you want to determine which

	0:47:43.434 --> 0:47:43.820
	one.

	0:47:47.787 --> 0:47:55.801
	Yes, this is how we can evaluate it, so the
	next question is how can we improve our systems

	0:47:55.801 --> 0:48:03.728
	because that's normally how we do evaluation
	and why we do evaluation so before we go into

	0:48:03.728 --> 0:48:04.251
	that?

	0:48:08.508 --> 0:48:22.685
	One idea is to do what is referred to as modeling,
	so the idea is somehow change the model in

	0:48:22.685 --> 0:48:24.495
	a way that.

	0:48:24.965 --> 0:48:38.271
	And yes, one idea is, of course, if we are
	giving him more information, the system doesn't

	0:48:38.271 --> 0:48:44.850
	need to do a guess without this information.

	0:48:44.724 --> 0:48:47.253
	In order to just ambiguate the bias,.

	0:48:47.707 --> 0:48:59.746
	The first thing is you can do that on the
	sentence level, for example, especially if

	0:48:59.746 --> 0:49:03.004
	you have the speakers.

	0:49:03.063 --> 0:49:12.518
	You can annotate the sentence with whether
	a speaker is made or a female, and then you

	0:49:12.518 --> 0:49:25.998
	can: Here we're seeing one thing which is very
	successful in neuromachine translation and

	0:49:25.998 --> 0:49:30.759
	other kinds of neural networks.

	0:49:31.711 --> 0:49:39.546
	However, in neuromachine translation, since
	we have no longer the strong correlation between

	0:49:39.546 --> 0:49:47.043
	input and output, the nice thing is you can
	normally put everything into your input, and

	0:49:47.043 --> 0:49:50.834
	if you have enough data, it's well balanced.

	0:49:51.151 --> 0:50:00.608
	So how you can do it here is you can add the
	token here saying female or male if the speaker

	0:50:00.608 --> 0:50:01.523
	is male.

	0:50:01.881 --> 0:50:07.195
	So, of course, this is no longer for human
	correct translation.

	0:50:07.195 --> 0:50:09.852
	It's like female Madam because.

	0:50:10.090 --> 0:50:22.951
	If you are doing the same thing then the translation
	would not be to translate female but can use

	0:50:22.951 --> 0:50:25.576
	it to disintegrate.

	0:50:25.865 --> 0:50:43.573
	And so this type of tagging is a very commonly
	used method in order to add more information.

	0:50:47.107 --> 0:50:54.047
	So this is first of all a very good thing,
	a very easy one.

	0:50:54.047 --> 0:50:57.633
	You don't have to change your.

	0:50:58.018 --> 0:51:04.581
	For example, has also been done if you think
	about formality in German.

	0:51:04.581 --> 0:51:11.393
	Whether you have to produce or, you can: We'll
	see it on Thursday.

	0:51:11.393 --> 0:51:19.628
	It's a very common approach for domains, so
	you put in the domain beforehand.

	0:51:19.628 --> 0:51:24.589
	This is from a Twitter or something like that.

	0:51:24.904 --> 0:51:36.239
	Of course, it only learns it if it has seen
	it and it dees them out, but in this case you

	0:51:36.239 --> 0:51:38.884
	don't need an equal.

	0:51:39.159 --> 0:51:42.593
	But however, it's still like challenging to
	get this availability.

	0:51:42.983 --> 0:51:55.300
	If you would do that on the first of all,
	of course, it only works if you really have

	0:51:55.300 --> 0:52:02.605
	data from speaking because otherwise it's unclear.

	0:52:02.642 --> 0:52:09.816
	You would only have the text and you would
	not easily see whether it is the mayor or the

	0:52:09.816 --> 0:52:14.895
	female speaker because this information has
	been removed from.

	0:52:16.456 --> 0:52:18.745
	Does anybody of you have an idea of how it
	fits?

	0:52:20.000 --> 0:52:25.480
	Manage that and still get the data of whether
	it's made or not speaking.

	0:52:32.152 --> 0:52:34.270
	Can do a small trick.

	0:52:34.270 --> 0:52:37.834
	We can just look on the target side.

	0:52:37.937 --> 0:52:43.573
	Mean this is, of course, only important if
	in the target side this is the case.

	0:52:44.004 --> 0:52:50.882
	So for your training data you can irritate
	it based on your target site in German you

	0:52:50.882 --> 0:52:51.362
	know.

	0:52:51.362 --> 0:52:58.400
	In German you don't know but in Spanish for
	example you know because different and then

	0:52:58.400 --> 0:53:00.400
	you can use grammatical.

	0:53:00.700 --> 0:53:10.964
	Of course, the test day would still need to
	do that more interface decision.

	0:53:13.954 --> 0:53:18.829
	And: You can, of course, do it even more advanced.

	0:53:18.898 --> 0:53:30.659
	You can even try to add these information
	to each word, so you're not doing it for the

	0:53:30.659 --> 0:53:32.687
	full sentence.

	0:53:32.572 --> 0:53:42.129
	If it's unknown, if it's female or if it's
	male, you know word alignment so you can't

	0:53:42.129 --> 0:53:42.573
	do.

	0:53:42.502 --> 0:53:55.919
	Here then you can do a word alignment, which
	is of course not always perfect, but roughly

	0:53:55.919 --> 0:53:59.348
	then you can annotate.

	0:54:01.401 --> 0:54:14.165
	Now you have these type of inputs where you
	have one information per word, but on the one

	0:54:14.165 --> 0:54:16.718
	end you have the.

	0:54:17.517 --> 0:54:26.019
	This has been used before in other scenarios,
	so you might not put in the gender, but in

	0:54:26.019 --> 0:54:29.745
	general this can be other information.

	0:54:30.090 --> 0:54:39.981
	And people refer to that or have used that
	as a factored translation model, so what you

	0:54:39.981 --> 0:54:42.454
	may do is you factor.

	0:54:42.742 --> 0:54:45.612
	You have the word itself.

	0:54:45.612 --> 0:54:48.591
	You might have the gender.

	0:54:48.591 --> 0:54:55.986
	You could have more information like don't
	know the paddle speech.

	0:54:56.316 --> 0:54:58.564
	And then you have an embedding for each of
	them.

	0:54:59.199 --> 0:55:03.599
	And you congratulate them, and then you have
	years of congratulated a bedding.

	0:55:03.563 --> 0:55:09.947
	Which says okay, this is a female plumber
	or a male plumber or so on.

	0:55:09.947 --> 0:55:18.064
	This has additional information and then you
	can train this factory model where you have

	0:55:18.064 --> 0:55:22.533
	the ability to give the model extra information.

	0:55:23.263 --> 0:55:35.702
	And of course now if you are training this
	way directly you always need to have this information.

	0:55:36.576 --> 0:55:45.396
	So that might not be the best way if you want
	to use a translation system and sometimes don't

	0:55:45.396 --> 0:55:45.959
	have.

	0:55:46.866 --> 0:55:57.987
	So any idea of how you can train it or what
	machine learning technique you can use to deal

	0:55:57.987 --> 0:55:58.720
	with.

	0:56:03.263 --> 0:56:07.475
	Mainly despite it already, many of your things.

	0:56:14.154 --> 0:56:21.521
	Drop out so you sometimes put information
	in there and then you can use dropouts to inputs.

	0:56:21.861 --> 0:56:27.599
	Is sometimes put in this information in there,
	sometimes not, and the system is then able

	0:56:27.599 --> 0:56:28.874
	to deal with those.

	0:56:28.874 --> 0:56:34.803
	If it doesn't have the information, it's doing
	some of the best it can do, but if it has the

	0:56:34.803 --> 0:56:39.202
	information, it can use the information and
	maybe do a more rounded.

	0:56:46.766 --> 0:56:52.831
	So then there is, of course, more ways to
	try to do a moderately biased one.

	0:56:52.993 --> 0:57:01.690
	We will only want to mention here because
	you'll have a full lecture on that next week

	0:57:01.690 --> 0:57:08.188
	and that is referred to where context based
	machine translation.

	0:57:08.728 --> 0:57:10.397
	Good, and in this other ones, but.

	0:57:10.750 --> 0:57:16.830
	If you translate several sentences well, of
	course, there are more situations where you

	0:57:16.830 --> 0:57:17.866
	can dissemble.

	0:57:18.118 --> 0:57:23.996
	Because it might be that the information is
	not in the current sentence, but it's in the

	0:57:23.996 --> 0:57:25.911
	previous sentence or before.

	0:57:26.967 --> 0:57:33.124
	If you have the mean with the speaker maybe
	not, but if it's referring to, you can core

	0:57:33.124 --> 0:57:33.963
	references.

	0:57:34.394 --> 0:57:40.185
	They are often referring to things in the
	previous sentence so you can use them in order

	0:57:40.185 --> 0:57:44.068
	to: And that can be done basically and very
	easy.

	0:57:44.068 --> 0:57:47.437
	You'll see more advanced options, but the
	main.

	0:57:48.108 --> 0:57:58.516
	Mean, no machine translation is a sequence
	to sequence model, which can use any input

	0:57:58.516 --> 0:58:02.993
	sequence to output sequence mapping.

	0:58:02.993 --> 0:58:04.325
	So now at.

	0:58:04.484 --> 0:58:11.281
	So then you can do, for example, five to five
	translations, or also five to one, or so there's.

	0:58:11.811 --> 0:58:19.211
	This is not a method like only dedicated to
	buying, of course, but the hope is.

	0:58:19.139 --> 0:58:25.534
	If you're using this because I mean bias often,
	we have seen that it rises in situations where

	0:58:25.534 --> 0:58:27.756
	we're not having enough context.

	0:58:27.756 --> 0:58:32.940
	So the idea is if we generally increase our
	context, it will also help this.

	0:58:32.932 --> 0:58:42.378
	Of course, it will help other situations where
	you need context to disintegrate.

	0:58:43.603 --> 0:58:45.768
	Get There If You're Saying I'm Going to the
	Bank.

	0:58:46.286 --> 0:58:54.761
	It's not directly from this sentence clear
	whether it's the finance institute or the bank

	0:58:54.761 --> 0:58:59.093
	for sitting, but maybe if you say afterward,.

	0:59:02.322 --> 0:59:11.258
	And then there is in generally a very large
	amount of work on debiasing the word embelling.

	0:59:11.258 --> 0:59:20.097
	So the one I hear like, I mean, I think that
	partly comes from the fact that like a first.

	0:59:21.041 --> 0:59:26.925
	Or that first research was done often on inspecting
	the word embeddings and seeing whether they

	0:59:26.925 --> 0:59:32.503
	are biased or not, and people found out how
	there is some bias in there, and then the idea

	0:59:32.503 --> 0:59:38.326
	is oh, if you remove them from the word embedded
	in already, then maybe your system later will

	0:59:38.326 --> 0:59:39.981
	not have that strong of a.

	0:59:40.520 --> 0:59:44.825
	So how can that work?

	0:59:44.825 --> 0:59:56.369
	Or like maybe first, how do words encounter
	bias in there?

	0:59:56.369 --> 0:59:57.152
	So.

	0:59:57.137 --> 1:00:05.555
	So you can look at the word embedding, and
	then you can compare the distance of the word

	1:00:05.555 --> 1:00:11.053
	compared: And there's like interesting findings.

	1:00:11.053 --> 1:00:18.284
	For example, you have the difference in occupation
	and how similar.

	1:00:18.678 --> 1:00:33.068
	And of course it's not a perfect correlation,
	but you see some type of correlation: jobs

	1:00:33.068 --> 1:00:37.919
	which have a high occupation.

	1:00:37.797 --> 1:00:41.387
	They also are more similar to the word what
	we're going to be talking about.

	1:00:43.023 --> 1:00:50.682
	Maybe a secretary is also a bit difficult,
	but because yeah maybe it's more often.

	1:00:50.610 --> 1:00:52.438
	Done in general by by women.

	1:00:52.438 --> 1:00:58.237
	However, there is a secretary like the Secretary
	of State or so, the German minister, which

	1:00:58.237 --> 1:01:03.406
	I of course know that many so in the statistics
	they are not counting that often.

	1:01:03.543 --> 1:01:11.576
	But in data they of course cook quite often,
	so there's different ways of different meanings.

	1:01:14.154 --> 1:01:23.307
	So how can you not try to remove this type
	of bias?

	1:01:23.307 --> 1:01:32.988
	One way is the idea of hearts, devices and
	embeddings.

	1:01:33.113 --> 1:01:39.354
	So if you remember on word embeddings think
	we have this image that you can do the difference

	1:01:39.354 --> 1:01:44.931
	between man and woman and add this difference
	to king and then look at your screen.

	1:01:45.865 --> 1:01:57.886
	So here's the idea we want to remove this
	gender information from some things which should

	1:01:57.886 --> 1:02:00.132
	not have gender.

	1:02:00.120 --> 1:02:01.386
	The word engineer.

	1:02:01.386 --> 1:02:06.853
	There is no information about the gender in
	that, so you should remove this type.

	1:02:07.347 --> 1:02:16.772
	Of course, you first need to find out where
	these inflammations are and you can.

	1:02:17.037 --> 1:02:23.603
	However, normally if you do the difference
	like the subspace by only one example, it's

	1:02:23.603 --> 1:02:24.659
	not the best.

	1:02:24.924 --> 1:02:31.446
	So you can do the same thing for things like
	brother and sister, man and dad, and then you

	1:02:31.446 --> 1:02:38.398
	can somehow take the average of these differences
	saying this is a vector which maps a male from

	1:02:38.398 --> 1:02:39.831
	to the female form.

	1:02:40.660 --> 1:02:50.455
	And then you can try to neutralize this gender
	information on this dimension.

	1:02:50.490 --> 1:02:57.951
	You can find it's subspace or dimensional.

	1:02:57.951 --> 1:03:08.882
	It would be a line, but now this is dimensional,
	and then you.

	1:03:08.728 --> 1:03:13.104
	Representation: Where you remove this type
	of embellishment.

	1:03:15.595 --> 1:03:18.178
	This is, of course, quite strong of the questions.

	1:03:18.178 --> 1:03:19.090
	How good does it?

	1:03:19.090 --> 1:03:20.711
	Thanks tell them for one other.

	1:03:20.880 --> 1:03:28.256
	But it's an idea we are trying to after learning
	before we are using the Word and Banks for

	1:03:28.256 --> 1:03:29.940
	machine translation.

	1:03:29.940 --> 1:03:37.315
	We are trying to remove the gender information
	from the jobs and then have a representation

	1:03:37.315 --> 1:03:38.678
	which hopefully.

	1:03:40.240 --> 1:03:45.047
	Similar idea is the one of agenda neutral
	glove.

	1:03:45.047 --> 1:03:50.248
	Glove is another technique to learn word embeddings.

	1:03:50.750 --> 1:03:52.870
	Think we discussed one shortly.

	1:03:52.870 --> 1:03:56.182
	It was too back, which was some of the first
	one.

	1:03:56.456 --> 1:04:04.383
	But there are other of course methods how
	you can train word embeddings and glove as

	1:04:04.383 --> 1:04:04.849
	one.

	1:04:04.849 --> 1:04:07.460
	The idea is we're training.

	1:04:07.747 --> 1:04:19.007
	At least this is somehow a bit separated,
	so where you have part of the vector is gender

	1:04:19.007 --> 1:04:20.146
	neutral.

	1:04:20.300 --> 1:04:29.247
	What you need therefore is three sets of words,
	so you have male words and you have words.

	1:04:29.769 --> 1:04:39.071
	And then you're trying to learn some type
	of vector where some dimensions are not.

	1:04:39.179 --> 1:04:51.997
	So the idea is can learn a representation
	where at least know that this part is gender

	1:04:51.997 --> 1:04:56.123
	neutral and the other part.

	1:05:00.760 --> 1:05:03.793
	How can we do that?

	1:05:03.793 --> 1:05:12.435
	How can we change the system to learn anything
	specific?

	1:05:12.435 --> 1:05:20.472
	Nearly in all cases this works by the loss
	function.

	1:05:20.520 --> 1:05:26.206
	And that is more a general approach in machine
	translation.

	1:05:26.206 --> 1:05:30.565
	The general loss function is we are learning.

	1:05:31.111 --> 1:05:33.842
	Here is the same idea.

	1:05:33.842 --> 1:05:44.412
	You have the general loss function in order
	to learn good embeddings and then you try to

	1:05:44.412 --> 1:05:48.687
	introduce additional loss function.

	1:05:48.969 --> 1:05:58.213
	Yes, I think yes, yes, that's the solution,
	and how you make sure that if I have training

	1:05:58.213 --> 1:06:07.149
	for all nurses of email, how do you make sure
	that the algorithm puts it into neutral?

	1:06:07.747 --> 1:06:12.448
	And you need, so this is like for only the
	first learning of word embeddings.

	1:06:12.448 --> 1:06:18.053
	Then the idea is if you have word embeddings
	where the gender is separate and then you train

	1:06:18.053 --> 1:06:23.718
	on top of that machine translation where you
	don't change the embeddings, it should hopefully

	1:06:23.718 --> 1:06:25.225
	be less and less biased.

	1:06:25.865 --> 1:06:33.465
	And in order to train that yes you need additional
	information so these information need to be

	1:06:33.465 --> 1:06:40.904
	hence defined and they can't be general so
	you need to have a list of these are male persons

	1:06:40.904 --> 1:06:44.744
	or males these are nouns for females and these.

	1:06:49.429 --> 1:06:52.575
	So the first step, of course, we still want
	to have good word inventings.

	1:06:54.314 --> 1:07:04.100
	So you have the normal objective function
	of the word embedding.

	1:07:04.100 --> 1:07:09.519
	It's something like the similarity.

	1:07:09.849 --> 1:07:19.751
	How it's exactly derived is not that important
	because we're not interested in love itself,

	1:07:19.751 --> 1:07:23.195
	but you have any loss function.

	1:07:23.195 --> 1:07:26.854
	Of course, you have to keep that.

	1:07:27.167 --> 1:07:37.481
	And then there's three more lost functions
	that you can add: So the one is you take the

	1:07:37.481 --> 1:07:51.341
	average value of all the male words and the
	average word embedding of all the female words.

	1:07:51.731 --> 1:08:00.066
	So the good thing about this is we don't always
	need to have for one word the male and the

	1:08:00.066 --> 1:08:05.837
	female worship, so it's only like we have a
	set of male words.

	1:08:06.946 --> 1:08:21.719
	So this is just saying yeah, we want these
	two should be somehow similar to each other.

	1:08:21.719 --> 1:08:25.413
	It shouldn't be that.

	1:08:30.330 --> 1:08:40.081
	Should be the other one, or think this should
	be it.

	1:08:40.081 --> 1:08:45.969
	This is agenda, the average of.

	1:08:45.945 --> 1:09:01.206
	The average should be the same, but if you're
	looking at the female should be at the other.

	1:09:01.681 --> 1:09:06.959
	This is like on these dimensions, the male
	should be on the one and the female on the

	1:09:06.959 --> 1:09:07.388
	other.

	1:09:07.627 --> 1:09:16.123
	The same yeah, this gender information should
	be there, so you're pushing all the males to

	1:09:16.123 --> 1:09:17.150
	the other.

	1:09:21.541 --> 1:09:23.680
	Then their words should be.

	1:09:23.680 --> 1:09:30.403
	If you have that you see the neutral words,
	they should be in the middle of between the

	1:09:30.403 --> 1:09:32.008
	male and the female.

	1:09:32.012 --> 1:09:48.261
	So you say is the middle point between all
	male and female words and just somehow putting

	1:09:48.261 --> 1:09:51.691
	the neutral words.

	1:09:52.912 --> 1:09:56.563
	And then you're learning them, and then you
	can apply them in different ways.

	1:09:57.057 --> 1:10:03.458
	So you have this a bit in the pre-training
	thing.

	1:10:03.458 --> 1:10:10.372
	You can use the pre-trained inbeddings on
	the output.

	1:10:10.372 --> 1:10:23.117
	All you can use are: And then you can analyze
	what happens instead of training them directly.

	1:10:23.117 --> 1:10:30.504
	If have this additional loss, which tries
	to optimize.

	1:10:32.432 --> 1:10:42.453
	And then it was evaluated exactly on the sentences
	we had at the beginning where it is about know

	1:10:42.453 --> 1:10:44.600
	her for a long time.

	1:10:44.600 --> 1:10:48.690
	My friend works as an accounting cling.

	1:10:48.788 --> 1:10:58.049
	So all these examples are not very difficult
	to translation, but the question is how often

	1:10:58.049 --> 1:10:58.660
	does?

	1:11:01.621 --> 1:11:06.028
	That it's not that complicated as you see
	here, so even the baseline.

	1:11:06.366 --> 1:11:10.772
	If you're doing nothing is working quite well,
	it's most challenging.

	1:11:10.772 --> 1:11:16.436
	It seems overall in the situation where it's
	a name, so for he and him he has learned the

	1:11:16.436 --> 1:11:22.290
	correlation because that's maybe not surprisingly
	because this correlation occurs more often

	1:11:22.290 --> 1:11:23.926
	than with any name there.

	1:11:24.044 --> 1:11:31.749
	If you have a name that you can extract, that
	is talking about Mary, that's female is a lot

	1:11:31.749 --> 1:11:34.177
	harder to extract than this.

	1:11:34.594 --> 1:11:40.495
	So you'll see already in the bass line this
	is yeah, not working, not working.

	1:11:43.403 --> 1:11:47.159
	And for all the other cases it's working very
	well.

	1:11:47.787 --> 1:11:53.921
	Where all the best one is achieved here with
	an arc debiasing both on the encoder, on the.

	1:11:57.077 --> 1:12:09.044
	It makes sense that a hard debasing on the
	decoder doesn't really work because there you

	1:12:09.044 --> 1:12:12.406
	have gender information.

	1:12:14.034 --> 1:12:17.406
	For glove it seems to already work here.

	1:12:17.406 --> 1:12:20.202
	That's maybe surprising and yeah.

	1:12:20.260 --> 1:12:28.263
	So there is no clear else we don't have numbers
	for that doesn't really work well on the other.

	1:12:28.263 --> 1:12:30.513
	So how much do I use then?

	1:12:33.693 --> 1:12:44.720
	Then as a last way of improving that is a
	bit what we had mentioned before.

	1:12:44.720 --> 1:12:48.493
	That is what is referred.

	1:12:48.488 --> 1:12:59.133
	One problem is the bias in the data so you
	can adapt your data so you can just try to

	1:12:59.133 --> 1:13:01.485
	find equal amount.

	1:13:01.561 --> 1:13:11.368
	In your data like you adapt your data and
	then you find your data on the smaller but

	1:13:11.368 --> 1:13:12.868
	you can try.

	1:13:18.298 --> 1:13:19.345
	This is line okay.

	1:13:19.345 --> 1:13:21.605
	We have access to the data to the model.

	1:13:21.605 --> 1:13:23.038
	We can improve our model.

	1:13:24.564 --> 1:13:31.328
	One situation we haven't talked a lot about
	but another situation might also be and that's

	1:13:31.328 --> 1:13:37.942
	even getting more important is oh you want
	to work with a model which you don't have but

	1:13:37.942 --> 1:13:42.476
	you want to improve the model without having
	access so when.

	1:13:42.862 --> 1:13:49.232
	Nowadays there are a lot of companies who
	are not developing their own system but they're

	1:13:49.232 --> 1:13:52.983
	using or something like that or machine translation.

	1:13:53.313 --> 1:13:59.853
	So there is interest that you might not be
	able to find children with models completely.

	1:14:00.080 --> 1:14:09.049
	So the question is, can you do some type of
	black box adaptation of a system that takes

	1:14:09.049 --> 1:14:19.920
	the black box system but tries to improve it
	in some ways through: There's some ways of

	1:14:19.920 --> 1:14:21.340
	doing that.

	1:14:21.340 --> 1:14:30.328
	One is called black box injection and that's
	what is referred to as prompt.

	1:14:30.730 --> 1:14:39.793
	So the problem is if you have sentences you
	don't have information about the speakers.

	1:14:39.793 --> 1:14:43.127
	So how can you put information?

	1:14:43.984 --> 1:14:53.299
	And what we know from a large language model,
	we just prompt them, and you can do that.

	1:14:53.233 --> 1:14:59.545
	Translating directly, I love you, you said
	she said to him, I love you, and then of course

	1:14:59.545 --> 1:15:01.210
	you have to strip away.

	1:15:01.181 --> 1:15:06.629
	I mean, you cannot prevent the model from
	translating that, but you should be able to

	1:15:06.629 --> 1:15:08.974
	see what is the translation of this.

	1:15:08.974 --> 1:15:14.866
	One can strip that away, and now the system
	had hopefully the information that it's somebody

	1:15:14.866 --> 1:15:15.563
	like that.

	1:15:15.563 --> 1:15:17.020
	The speaker is female.

	1:15:18.198 --> 1:15:23.222
	Because you're no longer translating love
	you, but you're translating the sentence she

	1:15:23.222 --> 1:15:24.261
	said to him love.

	1:15:24.744 --> 1:15:37.146
	And so you insert this information as contextual
	information around it and don't have to change

	1:15:37.146 --> 1:15:38.567
	the model.

	1:15:41.861 --> 1:15:56.946
	Last idea is to do what is referred to as
	letters rescoring, so the idea there is you

	1:15:56.946 --> 1:16:01.156
	generate a translation.

	1:16:01.481 --> 1:16:18.547
	And now you have an additional component which
	tries to add possibilities where gender information

	1:16:18.547 --> 1:16:21.133
	might be lost.

	1:16:21.261 --> 1:16:29.687
	It's just a graph in this way, a simplified
	graph where there's always one word between

	1:16:29.687 --> 1:16:31.507
	two notes and you.

	1:16:31.851 --> 1:16:35.212
	So you have something like Zi is an ads or
	a Zi is an ads.

	1:16:35.535 --> 1:16:41.847
	And then you can generate all possible variants.

	1:16:41.847 --> 1:16:49.317
	Then, of course, we're not done because the
	final output.

	1:16:50.530 --> 1:16:56.999
	Then you can re-score the system by a gender
	de-biased model.

	1:16:56.999 --> 1:17:03.468
	So the nice thing is why why don't we directly
	use our model?

	1:17:03.468 --> 1:17:10.354
	The idea is our model, which is only focusing
	on gender devising.

	1:17:10.530 --> 1:17:16.470
	It can be, for example, if it's just trained
	on some synthetical data, it will not be that

	1:17:16.470 --> 1:17:16.862
	well.

	1:17:16.957 --> 1:17:21.456
	But what we can do then is now you can rescore
	the possible translations in here.

	1:17:21.721 --> 1:17:31.090
	And here the cases of course in general structure
	is already done how to translate the words.

	1:17:31.051 --> 1:17:42.226
	Then you're only using the second component
	in order to react for some variants and then

	1:17:42.226 --> 1:17:45.490
	get the best translation.

	1:17:45.925 --> 1:17:58.479
	And: As the last one there is the post processing
	so you can't have it.

	1:17:58.538 --> 1:18:02.830
	Mean this was one way of post-processing was
	to generate the lattice and retranslate it.

	1:18:03.123 --> 1:18:08.407
	But you can also have a processing, for example
	only on the target side where you have additional

	1:18:08.407 --> 1:18:12.236
	components with checks about the gender which
	maybe only knows gender.

	1:18:12.236 --> 1:18:17.089
	So it's not a machine translation component
	but more like a grammatical checker which can

	1:18:17.089 --> 1:18:19.192
	be used as most processing to do that.

	1:18:19.579 --> 1:18:22.926
	Think about it a bit like when you use PPT.

	1:18:22.926 --> 1:18:25.892
	There's also a lot of post processing.

	1:18:25.892 --> 1:18:32.661
	If you use a directive, it would tell you
	how to build a bond, but they have some checks

	1:18:32.661 --> 1:18:35.931
	either before and after to prevent things.

	1:18:36.356 --> 1:18:40.580
	So often there might be an application system.

	1:18:40.580 --> 1:18:44.714
	There might be extra pre and post processing.

	1:18:48.608 --> 1:18:52.589
	And yeah, with this we're at the end of.

	1:18:52.512 --> 1:19:09.359
	To this lecture where we focused on the bias,
	but think a lot of these techniques we have

	1:19:09.359 --> 1:19:11.418
	seen here.

	1:19:11.331 --> 1:19:17.664
	So we saw, on the one hand, we saw that evaluating
	just pure blues first might not always be.

	1:19:17.677 --> 1:19:18.947
	Mean it's very important.

	1:19:20.000 --> 1:19:30.866
	Always do that, but if you want to check and
	some specific things are important, then you

	1:19:30.866 --> 1:19:35.696
	might have to do dedicated evaluations.

	1:19:36.036 --> 1:19:44.296
	It is now translating for the President and
	it is like in German that guess it is not very

	1:19:44.296 --> 1:19:45.476
	appropriate.

	1:19:45.785 --> 1:19:53.591
	So it might be important if characteristics
	of your system are essential to have dedicated

	1:19:53.591 --> 1:19:54.620
	evaluation.

	1:19:55.135 --> 1:20:02.478
	And then if you have that, of course, it might
	be also important to develop delicate techniques.

	1:20:02.862 --> 1:20:10.988
	We have seen today some how to mitigate biases,
	but I hope you see that a lot of these techniques

	1:20:10.988 --> 1:20:13.476
	you can also use to mitigate.

	1:20:13.573 --> 1:20:31.702
	At least related things you can adjust the
	training data you can do for other things.

	1:20:33.253 --> 1:20:36.022
	Before we have been finishing, we have any
	more questions.

	1:20:41.761 --> 1:20:47.218
	Then thanks a lot, and then we will see each
	other again on the first step.