Spaces:

retkowski
/

ytseg_demo

Running

File size: 66,346 Bytes

cb71ef5

WEBVTT

0:00:01.641 --> 0:00:06.302
Hey so what again to today's lecture on machine
translation.

0:00:07.968 --> 0:00:15.152
This week we'll have a bit of different focus,
so last two weeks or so we have looking into.

0:00:15.655 --> 0:00:28.073
How we can improve our system by having more
data, other data sources, or using them to

0:00:28.073 --> 0:00:30.331
more efficient.

0:00:30.590 --> 0:00:38.046
And we'll have a bit more of that next week
with the anti-travised and the context.

0:00:38.338 --> 0:00:47.415
So that we are shifting from this idea of
we treat each sentence independently, but treat

0:00:47.415 --> 0:00:49.129
the translation.

0:00:49.129 --> 0:00:58.788
Because maybe you can remember from the beginning,
there are phenomenon in machine translation

0:00:58.788 --> 0:01:02.143
that you cannot correctly check.

0:01:03.443 --> 0:01:14.616
However, today we want to more look into what
challenges arise, specifically when we're practically

0:01:14.616 --> 0:01:16.628
applying machine.

0:01:17.017 --> 0:01:23.674
And this block will be a total of four different
lectures.

0:01:23.674 --> 0:01:29.542
What type of biases are in machine translation
can.

0:01:29.729 --> 0:01:37.646
Just then can we try to improve this, but
of course the first focus can be at least the.

0:01:37.717 --> 0:01:41.375
And this, of course, gets more and more important.

0:01:41.375 --> 0:01:48.333
The more often you apply this type of technology,
when it was mainly a basic research tool which

0:01:48.333 --> 0:01:53.785
you were using in a research environment, it's
not directly that important.

0:01:54.054 --> 0:02:00.370
But once you apply it to the question, is
it performed the same for everybody or is it

0:02:00.370 --> 0:02:04.436
performance of some people less good than other
people?

0:02:04.436 --> 0:02:10.462
Does it have specific challenges and we are
seeing that especially in translation?

0:02:10.710 --> 0:02:13.420
We have the major challenge.

0:02:13.420 --> 0:02:20.333
We have the grammatical gender and this is
not the same in all languages.

0:02:20.520 --> 0:02:35.431
In English, it's not clear if you talk about
some person, if it's male or female, and so

0:02:35.431 --> 0:02:39.787
hopefully you've learned.

0:02:41.301 --> 0:02:50.034
Just as a brief view, so based on this one
aspect of application will then have two other

0:02:50.034 --> 0:02:57.796
aspects: On Thursday we'll look into adaptation,
so how can we adapt to specific situations?

0:02:58.718 --> 0:03:09.127
Because we have seen that your systems perform
well when the test case is similar to the training

0:03:09.127 --> 0:03:15.181
case, it's always the case you should get training
data.

0:03:16.036 --> 0:03:27.577
However, in practical applications, it's not
always possible to collect really the best

0:03:27.577 --> 0:03:31.642
fitting data, so in that case.

0:03:32.092 --> 0:03:39.269
And then the third larger group of applications
will then be speech translation.

0:03:39.269 --> 0:03:42.991
What do we have to change in our machine?

0:03:43.323 --> 0:03:53.569
If we are now not translating text, but if
we want to translate speech, that will be more

0:03:53.569 --> 0:03:54.708
lectures.

0:04:00.180 --> 0:04:12.173
So what are we talking about when we are talking
about bias from a definition point?

0:04:12.092 --> 0:04:21.799
Means we are introducing systematic errors
when testing, and then we encourage the selection

0:04:21.799 --> 0:04:24.408
of the specific answers.

0:04:24.804 --> 0:04:36.862
The most prominent case, which is analyzed
most in the research community, is a bias based

0:04:36.862 --> 0:04:38.320
on gender.

0:04:38.320 --> 0:04:43.355
One example: she works in a hospital.

0:04:43.523 --> 0:04:50.787
It is not directly able to assess whether
this is now a point or a friend.

0:04:51.251 --> 0:05:07.095
And although in this one even there is, it's
possible to ambiguate this based on the context.

0:05:07.127 --> 0:05:14.391
However, there is yeah, this relation to learn
is of course not that easy.

0:05:14.614 --> 0:05:27.249
So the system might also learn more like shortcut
connections, which might be that in your training

0:05:27.249 --> 0:05:31.798
data most of the doctors are males.

0:05:32.232 --> 0:05:41.725
That is like that was too bigly analyzed and
biased, and we'll focus on that also in this.

0:05:41.641 --> 0:05:47.664
In this lecture, however, of course, the system
might be a lot of other biases too, which have

0:05:47.664 --> 0:05:50.326
been partly investigated in other fields.

0:05:50.326 --> 0:05:53.496
But I think machine translation is not that
much.

0:05:53.813 --> 0:05:57.637
For example, it can be based on your originals.

0:05:57.737 --> 0:06:09.405
So there is an example for a sentiment analysis
that's a bit prominent.

0:06:09.405 --> 0:06:15.076
A sentiment analysis means you're.

0:06:15.035 --> 0:06:16.788
Like you're seeing it in reviews.

0:06:17.077 --> 0:06:24.045
And then you can show that with baseline models,
if the name is Mohammed then the sentiment

0:06:24.045 --> 0:06:30.786
in a lot of systems will be more negative than
if it's like a traditional European name.

0:06:31.271 --> 0:06:33.924
Are with foods that is simple.

0:06:33.924 --> 0:06:36.493
It's this type of restaurant.

0:06:36.493 --> 0:06:38.804
It's positive and another.

0:06:39.319 --> 0:06:49.510
You have other aspects, so we have seen this.

0:06:49.510 --> 0:06:59.480
We have done some experiments in Vietnamese.

0:06:59.559 --> 0:07:11.040
And then, for example, you can analyze that
if it's like he's Germany will address it more

0:07:11.040 --> 0:07:18.484
formal, while if he is North Korean he'll use
an informal.

0:07:18.838 --> 0:07:24.923
So these are also possible types of gender.

0:07:24.923 --> 0:07:31.009
However, this is difficult types of biases.

0:07:31.251 --> 0:07:38.903
However, especially in translation, the bias
for gender is the most challenging because

0:07:38.903 --> 0:07:42.989
we are treating gender in different languages.

0:07:45.405 --> 0:07:46.930
Hi this is challenging.

0:07:48.148 --> 0:07:54.616
The reason for that is that there is a translation
mismatch and we have, I mean, one reason for

0:07:54.616 --> 0:08:00.140
that is there's a translation mismatch and
that's the most challenging situation.

0:08:00.140 --> 0:08:05.732
So there is there is different information
in the Sears language or in the target.

0:08:06.046 --> 0:08:08.832
So if we have the English word dot player,.

0:08:09.029 --> 0:08:12.911
It's there is no information about the gender
in there.

0:08:12.911 --> 0:08:19.082
However, if you want to translate in German,
you cannot easily generate a word without a

0:08:19.082 --> 0:08:20.469
gender information.

0:08:20.469 --> 0:08:27.056
Or man, you can't do something like Shubila
in, but that sounds a bit weird if you're talking.

0:08:27.027 --> 0:08:29.006
About a specific person.

0:08:29.006 --> 0:08:32.331
Then you should use the appropriate font.

0:08:32.692 --> 0:08:44.128
And so it's most challenging translation as
always in this situation where you have less

0:08:44.128 --> 0:08:50.939
information on the source side but more information.

0:08:51.911 --> 0:08:57.103
Similar things like if you think about Japanese,
for example where there's different formality

0:08:57.103 --> 0:08:57.540
levels.

0:08:57.540 --> 0:09:02.294
If in German there is no formality or like
two only or in English there's no formality

0:09:02.294 --> 0:09:02.677
level.

0:09:02.862 --> 0:09:08.139
And now you have to estimate the formality
level.

0:09:08.139 --> 0:09:10.884
Of course, it takes some.

0:09:10.884 --> 0:09:13.839
It's not directly possible.

0:09:14.094 --> 0:09:20.475
What nowadays systems are doing is at least
assess.

0:09:20.475 --> 0:09:27.470
This is a situation where don't have enough
information.

0:09:27.567 --> 0:09:28.656
Translation.

0:09:28.656 --> 0:09:34.938
So here you have that suggesting it can be
doctor or doctorate in Spanish.

0:09:35.115 --> 0:09:37.051
So that is a possibility.

0:09:37.051 --> 0:09:41.595
However, it is of course very, very challenging
to find out.

0:09:42.062 --> 0:09:46.130
Is there two really different meanings, or
is it not the case?

0:09:46.326 --> 0:09:47.933
You can do the big rule base here.

0:09:47.933 --> 0:09:49.495
Maybe don't know how they did it.

0:09:49.990 --> 0:09:57.469
You can, of course, if you are focusing on
gender, the source and the target is different,

0:09:57.469 --> 0:09:57.879
and.

0:09:58.118 --> 0:10:05.799
But if you want to do it more general, it's
not that easy because there's always.

0:10:06.166 --> 0:10:18.255
But it's not clear if these are really different
or if there's only slight differences.

0:10:22.142 --> 0:10:36.451
Between that another reason why there is a
bias in there is typically the system tries

0:10:36.451 --> 0:10:41.385
to always do the most simple.

0:10:42.262 --> 0:10:54.483
And also in your training data there are unintended
shortcuts or clues only in the training data

0:10:54.483 --> 0:10:59.145
because you sample them in some way.

0:10:59.379 --> 0:11:06.257
This example, if she works in a hospital and
my friend is a nurse, then it might be that

0:11:06.257 --> 0:11:07.184
one friend.

0:11:08.168 --> 0:11:18.979
Male and female because it has learned that
in your trained doctor is a male and a nurse

0:11:18.979 --> 0:11:20.802
is doing this.

0:11:20.880 --> 0:11:29.587
And of course, if we are doing maximum likelihood
approximation as we are doing it in general,

0:11:29.587 --> 0:11:30.962
we are always.

0:11:30.951 --> 0:11:43.562
So that means if in your training data this
correlation is maybe in the case then your

0:11:43.562 --> 0:11:48.345
predictions are always the same.

0:11:48.345 --> 0:11:50.375
It typically.

0:11:55.035 --> 0:12:06.007
What does it mean, of course, if we are having
this type of fires and if we are applying?

0:12:05.925 --> 0:12:14.821
It might be that the benefit of machine translation
rice so more and more people can benefit from

0:12:14.821 --> 0:12:20.631
the ability to talk to people in different
languages and so on.

0:12:20.780 --> 0:12:27.261
But if you more often use it, problems of
the system also get more and more important.

0:12:27.727 --> 0:12:36.984
And so if we are seeing that these problems
and people nowadays only start to analyze these

0:12:36.984 --> 0:12:46.341
problems partly, also because if it hasn't
been used, it's not that important if the quality

0:12:46.341 --> 0:12:47.447
is so bad.

0:12:47.627 --> 0:12:51.907
Version or is mixing it all the time like
we have seen in old systems.

0:12:51.907 --> 0:12:52.993
Then, of course,.

0:12:53.053 --> 0:12:57.303
The issue is not that you have biased issues
that you at first need to create a right view.

0:12:57.637 --> 0:13:10.604
So only with the wide application of the good
quality this becomes important, and then of

0:13:10.604 --> 0:13:15.359
course you should look into how.

0:13:15.355 --> 0:13:23.100
In order to first get aware of what are the
challenges, and that is a general idea not

0:13:23.100 --> 0:13:24.613
only about bias.

0:13:24.764 --> 0:13:31.868
Of course, we have learned about blue scores,
so how can you evaluate the over quality and

0:13:31.868 --> 0:13:36.006
they are very important, either blue or any
of that.

0:13:36.006 --> 0:13:40.378
However, they are somehow giving us a general
overview.

0:13:40.560 --> 0:13:58.410
And if we want to improve our systems, of
course it's important that we also do more

0:13:58.410 --> 0:14:00.510
detailed.

0:14:00.340 --> 0:14:05.828
Test sets which are very challenging in order
to attend to see how good these systems.

0:14:06.446 --> 0:14:18.674
Of course, one last reminder to that if you
do a challenge that says it's typically good

0:14:18.674 --> 0:14:24.581
to keep track of your general performance.

0:14:24.784 --> 0:14:28.648
You don't want to improve normally then on
the general quality.

0:14:28.688 --> 0:14:41.555
So if you build a system which will mitigate
some biases then the aim is that if you evaluate

0:14:41.555 --> 0:14:45.662
it on the challenging biases.

0:14:45.745 --> 0:14:53.646
You don't need to get better because the aggregated
versions don't really measure that aspect well,

0:14:53.646 --> 0:14:57.676
but if you significantly drop in performance
then.

0:15:00.000 --> 0:15:19.164
What are, in generally calms, people report
about that or why should you care about?

0:15:19.259 --> 0:15:23.598
And you're even then amplifying this type
of stereotypes.

0:15:23.883 --> 0:15:33.879
And that is not what you want to achieve with
using this technology.

0:15:33.879 --> 0:15:39.384
It's not working through some groups.

0:15:39.819 --> 0:15:47.991
And secondly what is referred to as allocational
parts.

0:15:47.991 --> 0:15:54.119
The system might not perform as well for.

0:15:54.314 --> 0:16:00.193
So another example of which we would like
to see is that sometimes the translation depends

0:16:00.193 --> 0:16:01.485
on who is speaking.

0:16:01.601 --> 0:16:03.463
So Here You Have It in French.

0:16:03.723 --> 0:16:16.359
Not say it, but the word happy or French has
to be expressed differently, whether it's a

0:16:16.359 --> 0:16:20.902
male person or a female person.

0:16:21.121 --> 0:16:28.917
It's nearly impossible to guess that or it's
impossible, so then you always select one.

0:16:29.189 --> 0:16:37.109
And of course, since we do greedy search,
it will always generate the same, so you will

0:16:37.109 --> 0:16:39.449
have a worse performance.

0:16:39.779 --> 0:16:46.826
And of course not what we want to achieve
in average.

0:16:46.826 --> 0:16:54.004
You might be then good, but you also have
the ability.

0:16:54.234 --> 0:17:08.749
This is a biased problem or an interface problem
because mean you can say well.

0:17:09.069 --> 0:17:17.358
And if you do it, we still have a system that
generates unusable output.

0:17:17.358 --> 0:17:24.057
If you don't tell it what you want to do,
so in this case.

0:17:24.244 --> 0:17:27.173
So in this case it's like if we don't have
enough information.

0:17:27.467 --> 0:17:34.629
So you have to adapt your system in some way
that can either access the information or output.

0:17:34.894 --> 0:17:46.144
But yeah, how you mean there's different ways
of how to improve over that first thing is

0:17:46.144 --> 0:17:47.914
you find out.

0:17:48.688 --> 0:17:53.826
Then there is different ways of addressing
them, and they of course differ.

0:17:53.826 --> 0:17:57.545
Isn't the situation where the information's
available?

0:17:58.038 --> 0:18:12.057
That's the first case we have, or is it a
situation where we don't have the information

0:18:12.057 --> 0:18:13.332
either?

0:18:14.154 --> 0:18:28.787
Or should give the system maybe the opportunity
to output those or say don't know this is still

0:18:28.787 --> 0:18:29.701
open.

0:18:29.769 --> 0:18:35.470
And even if they have enough information,
need this additional information, but they

0:18:35.470 --> 0:18:36.543
are just doing.

0:18:36.776 --> 0:18:51.132
Which is a bit based on how we find that there
is research on that, but it's not that easy

0:18:51.132 --> 0:18:52.710
to solve.

0:18:52.993 --> 0:19:05.291
But in general, detecting do have enough information
to do a good translation or are information

0:19:05.291 --> 0:19:06.433
missing?

0:19:09.669 --> 0:19:18.951
But before we come on how we will address
it or try to change it, and before we look

0:19:18.951 --> 0:19:22.992
at how we can assess it, of course,.

0:19:23.683 --> 0:19:42.820
And therefore wanted to do a bit of a review
on how gender is represented in languages.

0:19:43.743 --> 0:19:48.920
Course: You can have more fine grained.

0:19:48.920 --> 0:20:00.569
It's not that everything in the group is the
same, but in general you have a large group.

0:20:01.381 --> 0:20:08.347
For example, you even don't say ishi or but
it's just one word for it written.

0:20:08.347 --> 0:20:16.107
Oh, don't know how it's pronounced, so you
cannot say from a sentence whether it's ishi

0:20:16.107 --> 0:20:16.724
or it.

0:20:17.937 --> 0:20:29.615
Of course, there are some exceptions for whether
it's a difference between male and female.

0:20:29.615 --> 0:20:35.962
They have different names for brother and
sister.

0:20:36.036 --> 0:20:41.772
So normally you cannot infer whether this
is a male speaker or speaking about a male

0:20:41.772 --> 0:20:42.649
or a female.

0:20:44.304 --> 0:20:50.153
Examples for these languages are, for example,
Finnish and Turkish.

0:20:50.153 --> 0:21:00.370
There are more languages, but these are: Then
we have no nutritional gender languages where

0:21:00.370 --> 0:21:05.932
there's some gender information in there, but
it's.

0:21:05.905 --> 0:21:08.169
And this is an example.

0:21:08.169 --> 0:21:15.149
This is English, which is in that way a nice
example because most people.

0:21:15.415 --> 0:21:20.164
So you have there some lexicogender and phenomenal
gender.

0:21:20.164 --> 0:21:23.303
I mean mamadeta there she-hee and him.

0:21:23.643 --> 0:21:31.171
And very few words are marked like actor and
actress, but in general most words are not

0:21:31.171 --> 0:21:39.468
marked, so it's teacher and lecturer and friend,
so in all these words the gender is not marked,

0:21:39.468 --> 0:21:41.607
and so you cannot infer.

0:21:42.622 --> 0:21:48.216
So the initial Turkish sentence here would
be translated to either he is a good friend

0:21:48.216 --> 0:21:49.373
or she is a good.

0:21:51.571 --> 0:22:05.222
In this case you would have them gender information
in there, but of course there's a good friend.

0:22:07.667 --> 0:22:21.077
And then finally there is the grammatical
German languages where each noun has a gender.

0:22:21.077 --> 0:22:25.295
That's the case in Spanish.

0:22:26.186 --> 0:22:34.025
This is mostly formal, but at least if you're
talking about a human that also agrees.

0:22:34.214 --> 0:22:38.209
Of course, it's like the sun.

0:22:38.209 --> 0:22:50.463
There is no clear thing why the sun should
be female, and in other language it's different.

0:22:50.390 --> 0:22:56.100
The matching, and then you also have more
agreements with this that makes things more

0:22:56.100 --> 0:22:56.963
complicated.

0:22:57.958 --> 0:23:08.571
Here he is a good friend and the good is also
depending whether it's male or went up so it's

0:23:08.571 --> 0:23:17.131
changing also based on the gender so you have
a lot of gender information.

0:23:17.777 --> 0:23:21.364
Get them, but do you always get them correctly?

0:23:21.364 --> 0:23:25.099
It might be that they're in English, for example.

0:23:28.748 --> 0:23:36.154
And since this is the case, and you need to
like often express the gender even though you

0:23:36.154 --> 0:23:37.059
might not.

0:23:37.377 --> 0:23:53.030
Aware of it or it's not possible, there's
some ways in German how to mark mutual forms.

0:23:54.194 --> 0:24:03.025
But then it's again from the machine learning
side of view, of course quite challenging because

0:24:03.025 --> 0:24:05.417
you only want to use the.

0:24:05.625 --> 0:24:11.108
If it's known to the reader you want to use
the correct, the not mutual form but either

0:24:11.108 --> 0:24:12.354
the male or female.

0:24:13.013 --> 0:24:21.771
So they are assessing what is known to the
reader as a challenge which needs to in some

0:24:21.771 --> 0:24:23.562
way be addressed.

0:24:26.506 --> 0:24:30.887
Here why does that happen?

0:24:30.887 --> 0:24:42.084
Three reasons we have that in a bit so one
is, of course, that your.

0:24:42.162 --> 0:24:49.003
Example: If you look at the Europe High Corpus,
which is an important resource for doing machine

0:24:49.003 --> 0:24:49.920
translation.

0:24:50.010 --> 0:24:59.208
Then there's only thirty percent of the speakers
are female, and so if you train a model on

0:24:59.208 --> 0:25:06.606
that data, if you're translating to French,
there will be a male version.

0:25:06.746 --> 0:25:10.762
And so you'll just have a lot more like seventy
percent of your mail for it.

0:25:10.971 --> 0:25:18.748
And that will be Yep will make the model therefore
from this data sub.

0:25:18.898 --> 0:25:25.882
And of course this will be in the data for
a very long time.

0:25:25.882 --> 0:25:33.668
So if there's more female speakers in the
European Parliament, but.

0:25:33.933 --> 0:25:42.338
But we are training on historical data, so
even if there is for a long time, it will not

0:25:42.338 --> 0:25:43.377
be in the.

0:25:46.346 --> 0:25:57.457
Then besides these preexisting data there
is of course technical biases which will amplify

0:25:57.457 --> 0:25:58.800
this type.

0:25:59.039 --> 0:26:04.027
So one we already address, that's for example
sampling or beam search.

0:26:04.027 --> 0:26:06.416
You get the most probable output.

0:26:06.646 --> 0:26:16.306
So if there's a bias in your model, it will
amplify that not only in the case we had before,

0:26:16.306 --> 0:26:19.423
and produce the male version.

0:26:20.040 --> 0:26:32.873
So if you have the same source sentence like
am happy and in your training data it will

0:26:32.873 --> 0:26:38.123
be male and female if you're doing.

0:26:38.418 --> 0:26:44.510
So in that way by doing this type of algorithmic
design you will have.

0:26:44.604 --> 0:26:59.970
Another use case is if you think about a multilingual
machine translation, for example if you are

0:26:59.970 --> 0:27:04.360
now doing a pivot language.

0:27:04.524 --> 0:27:13.654
But if you're first trying to English this
information might get lost and then you translate

0:27:13.654 --> 0:27:14.832
to Spanish.

0:27:15.075 --> 0:27:21.509
So while in general in this class there is
not this type of bias there,.

0:27:22.922 --> 0:27:28.996
You might introduce it because you might have
good reasons for doing a modular system because

0:27:28.996 --> 0:27:31.968
you don't have enough training data or so on.

0:27:31.968 --> 0:27:37.589
It's performing better in average, but of
course by doing this choice you'll introduce

0:27:37.589 --> 0:27:40.044
an additional type of bias into your.

0:27:45.805 --> 0:27:52.212
And then there is what people refer to as
emergent bias, and that is, if you use a system

0:27:52.212 --> 0:27:58.903
for a different use case as we see in, generally
it is the case that is performing worse, but

0:27:58.903 --> 0:28:02.533
then of course you can have even more challenging.

0:28:02.942 --> 0:28:16.196
So the extreme case would be if you train
a system only on male speakers, then of course

0:28:16.196 --> 0:28:22.451
it will perform worse on female speakers.

0:28:22.902 --> 0:28:36.287
So, of course, if you're doing this type of
problem, if you use a system for a different

0:28:36.287 --> 0:28:42.152
situation where it was original, then.

0:28:44.004 --> 0:28:54.337
And with this we would then go for type of
evaluation, but before we are looking at how

0:28:54.337 --> 0:28:56.333
we can evaluate.

0:29:00.740 --> 0:29:12.176
Before we want to look into how we can improve
the system, think yeah, maybe at the moment

0:29:12.176 --> 0:29:13.559
most work.

0:29:13.954 --> 0:29:21.659
And the one thing is the system trying to
look into stereotypes.

0:29:21.659 --> 0:29:26.164
So how does a system use stereotypes?

0:29:26.466 --> 0:29:29.443
So if you have the Hungarian sentence,.

0:29:29.729 --> 0:29:33.805
Which should be he is an engineer or she is
an engineer.

0:29:35.375 --> 0:29:43.173
And you cannot guess that because we saw that
he and she is not different in Hungary.

0:29:43.423 --> 0:29:57.085
Then you can have a test set where you have
these type of ailanomal occupations.

0:29:56.977 --> 0:30:03.862
You have statistics from how is the distribution
by gender so you can automatically generate

0:30:03.862 --> 0:30:04.898
the sentence.

0:30:04.985 --> 0:30:21.333
Then you could put in jobs which are mostly
done by a man and then you can check how is

0:30:21.333 --> 0:30:22.448
your.

0:30:22.542 --> 0:30:31.315
That is one type of evaluating stereotypes
that one of the most famous benchmarks called

0:30:31.315 --> 0:30:42.306
vino is exactly: The second type of evaluation
is about gender preserving.

0:30:42.342 --> 0:30:51.201
So that is exactly what we have seen beforehand.

0:30:51.201 --> 0:31:00.240
If these information are not in the text itself,.

0:31:00.320 --> 0:31:01.875
Gender as a speaker.

0:31:02.062 --> 0:31:04.450
And how good does a system do that?

0:31:04.784 --> 0:31:09.675
And we'll see there's, for example, one benchmark
on this.

0:31:09.675 --> 0:31:16.062
For example: For Arabic there is one benchmark
on this foot: Audio because if you're now think

0:31:16.062 --> 0:31:16.781
already of the.

0:31:17.157 --> 0:31:25.257
From when we're talking about speech translation,
it might be interesting because in the speech

0:31:25.257 --> 0:31:32.176
signal you should have a better guess on whether
it's a male or a female speaker.

0:31:32.432 --> 0:31:38.928
So but mean current systems, mostly you can
always add, and they will just first transcribe.

0:31:42.562 --> 0:31:45.370
Yes, so how do these benchmarks?

0:31:45.305 --> 0:31:51.356
Look like that, the first one is here.

0:31:51.356 --> 0:32:02.837
There's an occupation test where it looks
like a simple test set because.

0:32:03.023 --> 0:32:10.111
So I've known either hurry him or pronounce
the name for a long time.

0:32:10.111 --> 0:32:13.554
My friend works as an occupation.

0:32:13.833 --> 0:32:16.771
So that is like all sentences in that look
like that.

0:32:17.257 --> 0:32:28.576
So in this case you haven't had the biggest
work in here, which is friends.

0:32:28.576 --> 0:32:33.342
So your only checking later is.

0:32:34.934 --> 0:32:46.981
This can be inferred from whether it's her
or her or her, or if it's a proper name, so

0:32:46.981 --> 0:32:55.013
can you infer it from the name, and then you
can compare.

0:32:55.115 --> 0:33:01.744
So is this because the job description is
nearer to friend.

0:33:01.744 --> 0:33:06.937
Does the system get disturbed by this type
of.

0:33:08.828 --> 0:33:14.753
And there you can then automatically assess
yeah this type.

0:33:14.774 --> 0:33:18.242
Of course, that's what said at the beginning.

0:33:18.242 --> 0:33:24.876
You shouldn't only rely on that because if
you only rely on it you can easily trick the

0:33:24.876 --> 0:33:25.479
system.

0:33:25.479 --> 0:33:31.887
So one type of sentence is translated, but
of course it can give you very important.

0:33:33.813 --> 0:33:35.309
Any questions yeah.

0:33:36.736 --> 0:33:44.553
Much like the evaluation of stereotype, we
want the system to agree with stereotypes because

0:33:44.553 --> 0:33:46.570
it increases precision.

0:33:46.786 --> 0:33:47.979
No, no, no.

0:33:47.979 --> 0:33:53.149
In this case, if we say oh yeah, he is an
engineer.

0:33:53.149 --> 0:34:01.600
From the example, it's probably the most likely
translation, probably in more cases.

0:34:02.702 --> 0:34:08.611
Now there is two things, so yeah yeah, so
there is two ways of evaluating.

0:34:08.611 --> 0:34:15.623
The one thing is in this case he's using that
he's an engineer, but there is conflicting

0:34:15.623 --> 0:34:19.878
information that in this case the engineer
is female.

0:34:20.380 --> 0:34:21.890
So anything was.

0:34:22.342 --> 0:34:29.281
Information yes, so that is the one in the
other case.

0:34:29.281 --> 0:34:38.744
Typically it's not evaluated in that, but
in that time you really want it.

0:34:38.898 --> 0:34:52.732
That's why most of those cases you have evaluated
in scenarios where you have context information.

0:34:53.453 --> 0:34:58.878
How to deal with the other thing is even more
challenging to one case where it is the case

0:34:58.878 --> 0:35:04.243
is what I said before is when it's about the
speaker so that the speech translation test.

0:35:04.584 --> 0:35:17.305
And there they try to look in a way that can
you use, so use the audio also as input.

0:35:18.678 --> 0:35:20.432
Yeah.

0:35:20.640 --> 0:35:30.660
So if we have a reference where she is an
engineer okay, are there efforts to adjust

0:35:30.660 --> 0:35:37.497
the metric so that our transmissions go into
the correct?

0:35:37.497 --> 0:35:38.676
We don't.

0:35:38.618 --> 0:35:40.389
Only done for mean this is evaluation.

0:35:40.389 --> 0:35:42.387
You are not pushing the model for anything.

0:35:43.023 --> 0:35:53.458
But if you want to do it in training, that
you're not doing it this way.

0:35:53.458 --> 0:35:58.461
I'm not aware of any direct model.

0:35:58.638 --> 0:36:04.146
Because you have to find out, is it known
in this scenario or not?

0:36:05.725 --> 0:36:12.622
So at least I'm not aware of there's like
the directive doing training try to assess

0:36:12.622 --> 0:36:13.514
more than.

0:36:13.813 --> 0:36:18.518
Mean there is data augmentation in the way
that is done.

0:36:18.518 --> 0:36:23.966
Think we'll have that later, so what you can
do is generate more.

0:36:24.144 --> 0:36:35.355
You can do that automatically or there's ways
of biasing so that you can try to make your

0:36:35.355 --> 0:36:36.600
training.

0:36:36.957 --> 0:36:46.228
That's typically not done with focusing on
scenarios where you check before or do have

0:36:46.228 --> 0:36:47.614
information.

0:36:49.990 --> 0:36:58.692
Mean, but for everyone it's not clear and
agree with you in this scenario, the normal

0:36:58.692 --> 0:37:01.222
evaluation system where.

0:37:01.341 --> 0:37:07.006
Maybe you could say it shouldn't do always
the same but have a distribution like a training

0:37:07.006 --> 0:37:12.733
data or something like that because otherwise
we're amplifying but that current system can't

0:37:12.733 --> 0:37:15.135
do current systems can't predict both.

0:37:15.135 --> 0:37:17.413
That's why we see all the beginning.

0:37:17.413 --> 0:37:20.862
They have this extra interface where they
then propose.

0:37:24.784 --> 0:37:33.896
Another thing is the vino empty system and
it started from a challenge set for co-reference

0:37:33.896 --> 0:37:35.084
resolution.

0:37:35.084 --> 0:37:43.502
Co-reference resolution means we have pear
on him and we need to find out what it's.

0:37:43.823 --> 0:37:53.620
So you have the doctor off the nurse to help
her in the procedure, and now her does not

0:37:53.620 --> 0:37:55.847
refer to the nurse.

0:37:56.556 --> 0:38:10.689
And there you of course have the same type
of stewardesses and the same type of buyers

0:38:10.689 --> 0:38:15.237
as the machine translation.

0:38:16.316 --> 0:38:25.165
And no think that normally yeah mean maybe
that's also biased.

0:38:27.687 --> 0:38:37.514
No, but if you ask somebody, I guess if you
ask somebody, then I mean syntectically it's

0:38:37.514 --> 0:38:38.728
ambiguous.

0:38:38.918 --> 0:38:50.248
If you ask somebody to help, then the horror
has to refer to that.

0:38:50.248 --> 0:38:54.983
So it should also help the.

0:38:56.396 --> 0:38:57.469
Of the time.

0:38:57.469 --> 0:39:03.906
The doctor is female and says please have
me in the procedure, but the other.

0:39:04.904 --> 0:39:09.789
Oh, you mean that it's helping the third person.

0:39:12.192 --> 0:39:16.140
Yeah, agree that it could also be yes.

0:39:16.140 --> 0:39:19.077
Don't know how easy that is.

0:39:19.077 --> 0:39:21.102
Only know the test.

0:39:21.321 --> 0:39:31.820
Then guess yeah, then you need a situation
context where you know the situation, the other

0:39:31.820 --> 0:39:34.589
person having problems.

0:39:36.936 --> 0:39:42.251
Yeah no yeah that is like here when there
is additional ambiguity in there.

0:39:45.465 --> 0:39:48.395
See that pure text models is not always okay.

0:39:48.395 --> 0:39:51.134
How full mean there is a lot of work also.

0:39:52.472 --> 0:40:00.119
Will not cover that in the lecture, but there
are things like multimodal machine translation

0:40:00.119 --> 0:40:07.109
where you try to add pictures or something
like that to have more context, and then.

0:40:10.370 --> 0:40:23.498
Yeah, it starts with this, so in order to
evaluate that what it does is that you translate

0:40:23.498 --> 0:40:25.229
the system.

0:40:25.305 --> 0:40:32.310
It's doing stereotyping so the doctor is male
and the nurse is female.

0:40:32.492 --> 0:40:42.362
And then you're using word alignment, and
then you check whether this gender maps with

0:40:42.362 --> 0:40:52.345
the annotated gender of there, and that is
how you evaluate in this type of vino empty.

0:40:52.832 --> 0:40:59.475
Mean, as you see, you're only focusing on
the situation where you can or where the gender

0:40:59.475 --> 0:41:00.214
is known.

0:41:00.214 --> 0:41:06.930
Why for this one you don't do any evaluation,
but because nurses can in that case be those

0:41:06.930 --> 0:41:08.702
and you cannot, as has.

0:41:08.728 --> 0:41:19.112
The benchmarks are at the moment designed
in a way that you only evaluate things that

0:41:19.112 --> 0:41:20.440
are known.

0:41:23.243 --> 0:41:25.081
Then yeah, you can have a look.

0:41:25.081 --> 0:41:28.931
For example, here what people are looking
is you can do the first.

0:41:28.931 --> 0:41:32.149
Oh well, the currency, how often does it do
it correct?

0:41:32.552 --> 0:41:41.551
And there you see these numbers are a bit
older.

0:41:41.551 --> 0:41:51.835
There's more work on that, but this is the
first color.

0:41:51.731 --> 0:42:01.311
Because they do it like in this test, they
do it twice, one with him and one with her.

0:42:01.311 --> 0:42:04.834
So the chance is fifty percent.

0:42:05.065 --> 0:42:12.097
Except somehow here, the one system seems
to be quite good there that everything.

0:42:13.433 --> 0:42:30.863
What you can also do is look at the difference,
where you need to predict female and the difference.

0:42:30.850 --> 0:42:40.338
It's more often correct on the male forms
than on the female forms, and you see that

0:42:40.338 --> 0:42:43.575
it's except for this system.

0:42:43.603 --> 0:42:53.507
So would assume that they maybe in this one
language did some type of method in there.

0:42:55.515 --> 0:42:57.586
If you are more often mean there is like.

0:42:58.178 --> 0:43:01.764
It's not a lot lower, there's one.

0:43:01.764 --> 0:43:08.938
I don't know why, but if you're always to
the same then it should be.

0:43:08.938 --> 0:43:14.677
You seem to be counter intuitive, so maybe
it's better.

0:43:15.175 --> 0:43:18.629
Don't know exactly how yes, but it's, it's
true.

0:43:19.019 --> 0:43:20.849
Mean, there's very few cases.

0:43:20.849 --> 0:43:22.740
I also don't know for Russian.

0:43:22.740 --> 0:43:27.559
I mean, there is, I think, mainly for Russian
where you have very low numbers.

0:43:27.559 --> 0:43:30.183
I mean, I would say like forty five or so.

0:43:30.183 --> 0:43:32.989
There can be more about renting and sampling.

0:43:32.989 --> 0:43:37.321
I don't know if they have even more gender
or if they have a new tool.

0:43:37.321 --> 0:43:38.419
I don't think so.

0:43:40.040 --> 0:43:46.901
Then you have typically even a stronger bias
here where you not do the differentiation between

0:43:46.901 --> 0:43:53.185
how often is it correct for me and the female,
but you are distinguishing between the.

0:43:53.553 --> 0:44:00.503
So you're here, for you can check for each
occupation, which is the most important.

0:44:00.440 --> 0:44:06.182
A comment one based on statistics, and then
you take that on the one side and the anti

0:44:06.182 --> 0:44:12.188
stereotypically on the other side, and you
see that not in all cases but in a lot of cases

0:44:12.188 --> 0:44:16.081
that null probabilities are even higher than
on the other.

0:44:21.061 --> 0:44:24.595
Ah, I'm telling you there's something.

0:44:28.668 --> 0:44:32.850
But it has to be for a doctor.

0:44:32.850 --> 0:44:39.594
For example, for a doctor there three don't
know.

0:44:40.780 --> 0:44:44.275
Yeah, but guess here it's mainly imminent
job description.

0:44:44.275 --> 0:44:45.104
So yeah, but.

0:44:50.050 --> 0:45:01.145
And then there is the Arabic capital gender
corpus where it is about more assessing how

0:45:01.145 --> 0:45:03.289
strong a singer.

0:45:03.483 --> 0:45:09.445
How that is done is the open subtitles.

0:45:09.445 --> 0:45:18.687
Corpus is like a corpus of subtitles generated
by volunteers.

0:45:18.558 --> 0:45:23.426
For the Words Like I Mean Myself.

0:45:23.303 --> 0:45:30.670
And mine, and then they annotated the Arabic
sentences, whether here I refer to as a female

0:45:30.670 --> 0:45:38.198
and masculine, or whether it's ambiguous, and
then from the male and female one they generate

0:45:38.198 --> 0:45:40.040
types of translations.

0:45:43.703 --> 0:45:51.921
And then a bit more different test sets as
the last one that is referred to as the machine.

0:45:52.172 --> 0:45:57.926
Corpus, which is based on these lectures.

0:45:57.926 --> 0:46:05.462
In general, this lecture is very important
because it.

0:46:05.765 --> 0:46:22.293
And here is also interesting because you also
have the obvious signal and it's done in the

0:46:22.293 --> 0:46:23.564
worst.

0:46:23.763 --> 0:46:27.740
In the first case is where it can only be
determined based on the speaker.

0:46:27.968 --> 0:46:30.293
So something like am a good speaker.

0:46:30.430 --> 0:46:32.377
You cannot do that correctly.

0:46:32.652 --> 0:46:36.970
However, if you would have the audio signal
you should have a lot better guests.

0:46:37.257 --> 0:46:47.812
So it wasn't evaluated, especially machine
translation and speech translation system,

0:46:47.812 --> 0:46:53.335
which take this into account or, of course,.

0:46:57.697 --> 0:47:04.265
The second thing is where you can do it based
on the context.

0:47:04.265 --> 0:47:08.714
In this case we are not using artificial.

0:47:11.011 --> 0:47:15.550
Cope from the from the real data, so it's
not like artificial creative data, but.

0:47:15.815 --> 0:47:20.939
Of course, in a lot more work you have to
somehow find these in the corpus and use them

0:47:20.939 --> 0:47:21.579
as a test.

0:47:21.601 --> 0:47:27.594
Is something she got together with two of
her dearest friends, this older woman, and

0:47:27.594 --> 0:47:34.152
then, of course, here friends can we get from
the context, but it might be that some systems

0:47:34.152 --> 0:47:36.126
ignore that that should be.

0:47:36.256 --> 0:47:43.434
So you have two test sets in there, two types
of benchmarks, and you want to determine which

0:47:43.434 --> 0:47:43.820
one.

0:47:47.787 --> 0:47:55.801
Yes, this is how we can evaluate it, so the
next question is how can we improve our systems

0:47:55.801 --> 0:48:03.728
because that's normally how we do evaluation
and why we do evaluation so before we go into

0:48:03.728 --> 0:48:04.251
that?

0:48:08.508 --> 0:48:22.685
One idea is to do what is referred to as modeling,
so the idea is somehow change the model in

0:48:22.685 --> 0:48:24.495
a way that.

0:48:24.965 --> 0:48:38.271
And yes, one idea is, of course, if we are
giving him more information, the system doesn't

0:48:38.271 --> 0:48:44.850
need to do a guess without this information.

0:48:44.724 --> 0:48:47.253
In order to just ambiguate the bias,.

0:48:47.707 --> 0:48:59.746
The first thing is you can do that on the
sentence level, for example, especially if

0:48:59.746 --> 0:49:03.004
you have the speakers.

0:49:03.063 --> 0:49:12.518
You can annotate the sentence with whether
a speaker is made or a female, and then you

0:49:12.518 --> 0:49:25.998
can: Here we're seeing one thing which is very
successful in neuromachine translation and

0:49:25.998 --> 0:49:30.759
other kinds of neural networks.

0:49:31.711 --> 0:49:39.546
However, in neuromachine translation, since
we have no longer the strong correlation between

0:49:39.546 --> 0:49:47.043
input and output, the nice thing is you can
normally put everything into your input, and

0:49:47.043 --> 0:49:50.834
if you have enough data, it's well balanced.

0:49:51.151 --> 0:50:00.608
So how you can do it here is you can add the
token here saying female or male if the speaker

0:50:00.608 --> 0:50:01.523
is male.

0:50:01.881 --> 0:50:07.195
So, of course, this is no longer for human
correct translation.

0:50:07.195 --> 0:50:09.852
It's like female Madam because.

0:50:10.090 --> 0:50:22.951
If you are doing the same thing then the translation
would not be to translate female but can use

0:50:22.951 --> 0:50:25.576
it to disintegrate.

0:50:25.865 --> 0:50:43.573
And so this type of tagging is a very commonly
used method in order to add more information.

0:50:47.107 --> 0:50:54.047
So this is first of all a very good thing,
a very easy one.

0:50:54.047 --> 0:50:57.633
You don't have to change your.

0:50:58.018 --> 0:51:04.581
For example, has also been done if you think
about formality in German.

0:51:04.581 --> 0:51:11.393
Whether you have to produce or, you can: We'll
see it on Thursday.

0:51:11.393 --> 0:51:19.628
It's a very common approach for domains, so
you put in the domain beforehand.

0:51:19.628 --> 0:51:24.589
This is from a Twitter or something like that.

0:51:24.904 --> 0:51:36.239
Of course, it only learns it if it has seen
it and it dees them out, but in this case you

0:51:36.239 --> 0:51:38.884
don't need an equal.

0:51:39.159 --> 0:51:42.593
But however, it's still like challenging to
get this availability.

0:51:42.983 --> 0:51:55.300
If you would do that on the first of all,
of course, it only works if you really have

0:51:55.300 --> 0:52:02.605
data from speaking because otherwise it's unclear.

0:52:02.642 --> 0:52:09.816
You would only have the text and you would
not easily see whether it is the mayor or the

0:52:09.816 --> 0:52:14.895
female speaker because this information has
been removed from.

0:52:16.456 --> 0:52:18.745
Does anybody of you have an idea of how it
fits?

0:52:20.000 --> 0:52:25.480
Manage that and still get the data of whether
it's made or not speaking.

0:52:32.152 --> 0:52:34.270
Can do a small trick.

0:52:34.270 --> 0:52:37.834
We can just look on the target side.

0:52:37.937 --> 0:52:43.573
Mean this is, of course, only important if
in the target side this is the case.

0:52:44.004 --> 0:52:50.882
So for your training data you can irritate
it based on your target site in German you

0:52:50.882 --> 0:52:51.362
know.

0:52:51.362 --> 0:52:58.400
In German you don't know but in Spanish for
example you know because different and then

0:52:58.400 --> 0:53:00.400
you can use grammatical.

0:53:00.700 --> 0:53:10.964
Of course, the test day would still need to
do that more interface decision.

0:53:13.954 --> 0:53:18.829
And: You can, of course, do it even more advanced.

0:53:18.898 --> 0:53:30.659
You can even try to add these information
to each word, so you're not doing it for the

0:53:30.659 --> 0:53:32.687
full sentence.

0:53:32.572 --> 0:53:42.129
If it's unknown, if it's female or if it's
male, you know word alignment so you can't

0:53:42.129 --> 0:53:42.573
do.

0:53:42.502 --> 0:53:55.919
Here then you can do a word alignment, which
is of course not always perfect, but roughly

0:53:55.919 --> 0:53:59.348
then you can annotate.

0:54:01.401 --> 0:54:14.165
Now you have these type of inputs where you
have one information per word, but on the one

0:54:14.165 --> 0:54:16.718
end you have the.

0:54:17.517 --> 0:54:26.019
This has been used before in other scenarios,
so you might not put in the gender, but in

0:54:26.019 --> 0:54:29.745
general this can be other information.

0:54:30.090 --> 0:54:39.981
And people refer to that or have used that
as a factored translation model, so what you

0:54:39.981 --> 0:54:42.454
may do is you factor.

0:54:42.742 --> 0:54:45.612
You have the word itself.

0:54:45.612 --> 0:54:48.591
You might have the gender.

0:54:48.591 --> 0:54:55.986
You could have more information like don't
know the paddle speech.

0:54:56.316 --> 0:54:58.564
And then you have an embedding for each of
them.

0:54:59.199 --> 0:55:03.599
And you congratulate them, and then you have
years of congratulated a bedding.

0:55:03.563 --> 0:55:09.947
Which says okay, this is a female plumber
or a male plumber or so on.

0:55:09.947 --> 0:55:18.064
This has additional information and then you
can train this factory model where you have

0:55:18.064 --> 0:55:22.533
the ability to give the model extra information.

0:55:23.263 --> 0:55:35.702
And of course now if you are training this
way directly you always need to have this information.

0:55:36.576 --> 0:55:45.396
So that might not be the best way if you want
to use a translation system and sometimes don't

0:55:45.396 --> 0:55:45.959
have.

0:55:46.866 --> 0:55:57.987
So any idea of how you can train it or what
machine learning technique you can use to deal

0:55:57.987 --> 0:55:58.720
with.

0:56:03.263 --> 0:56:07.475
Mainly despite it already, many of your things.

0:56:14.154 --> 0:56:21.521
Drop out so you sometimes put information
in there and then you can use dropouts to inputs.

0:56:21.861 --> 0:56:27.599
Is sometimes put in this information in there,
sometimes not, and the system is then able

0:56:27.599 --> 0:56:28.874
to deal with those.

0:56:28.874 --> 0:56:34.803
If it doesn't have the information, it's doing
some of the best it can do, but if it has the

0:56:34.803 --> 0:56:39.202
information, it can use the information and
maybe do a more rounded.

0:56:46.766 --> 0:56:52.831
So then there is, of course, more ways to
try to do a moderately biased one.

0:56:52.993 --> 0:57:01.690
We will only want to mention here because
you'll have a full lecture on that next week

0:57:01.690 --> 0:57:08.188
and that is referred to where context based
machine translation.

0:57:08.728 --> 0:57:10.397
Good, and in this other ones, but.

0:57:10.750 --> 0:57:16.830
If you translate several sentences well, of
course, there are more situations where you

0:57:16.830 --> 0:57:17.866
can dissemble.

0:57:18.118 --> 0:57:23.996
Because it might be that the information is
not in the current sentence, but it's in the

0:57:23.996 --> 0:57:25.911
previous sentence or before.

0:57:26.967 --> 0:57:33.124
If you have the mean with the speaker maybe
not, but if it's referring to, you can core

0:57:33.124 --> 0:57:33.963
references.

0:57:34.394 --> 0:57:40.185
They are often referring to things in the
previous sentence so you can use them in order

0:57:40.185 --> 0:57:44.068
to: And that can be done basically and very
easy.

0:57:44.068 --> 0:57:47.437
You'll see more advanced options, but the
main.

0:57:48.108 --> 0:57:58.516
Mean, no machine translation is a sequence
to sequence model, which can use any input

0:57:58.516 --> 0:58:02.993
sequence to output sequence mapping.

0:58:02.993 --> 0:58:04.325
So now at.

0:58:04.484 --> 0:58:11.281
So then you can do, for example, five to five
translations, or also five to one, or so there's.

0:58:11.811 --> 0:58:19.211
This is not a method like only dedicated to
buying, of course, but the hope is.

0:58:19.139 --> 0:58:25.534
If you're using this because I mean bias often,
we have seen that it rises in situations where

0:58:25.534 --> 0:58:27.756
we're not having enough context.

0:58:27.756 --> 0:58:32.940
So the idea is if we generally increase our
context, it will also help this.

0:58:32.932 --> 0:58:42.378
Of course, it will help other situations where
you need context to disintegrate.

0:58:43.603 --> 0:58:45.768
Get There If You're Saying I'm Going to the
Bank.

0:58:46.286 --> 0:58:54.761
It's not directly from this sentence clear
whether it's the finance institute or the bank

0:58:54.761 --> 0:58:59.093
for sitting, but maybe if you say afterward,.

0:59:02.322 --> 0:59:11.258
And then there is in generally a very large
amount of work on debiasing the word embelling.

0:59:11.258 --> 0:59:20.097
So the one I hear like, I mean, I think that
partly comes from the fact that like a first.

0:59:21.041 --> 0:59:26.925
Or that first research was done often on inspecting
the word embeddings and seeing whether they

0:59:26.925 --> 0:59:32.503
are biased or not, and people found out how
there is some bias in there, and then the idea

0:59:32.503 --> 0:59:38.326
is oh, if you remove them from the word embedded
in already, then maybe your system later will

0:59:38.326 --> 0:59:39.981
not have that strong of a.

0:59:40.520 --> 0:59:44.825
So how can that work?

0:59:44.825 --> 0:59:56.369
Or like maybe first, how do words encounter
bias in there?

0:59:56.369 --> 0:59:57.152
So.

0:59:57.137 --> 1:00:05.555
So you can look at the word embedding, and
then you can compare the distance of the word

1:00:05.555 --> 1:00:11.053
compared: And there's like interesting findings.

1:00:11.053 --> 1:00:18.284
For example, you have the difference in occupation
and how similar.

1:00:18.678 --> 1:00:33.068
And of course it's not a perfect correlation,
but you see some type of correlation: jobs

1:00:33.068 --> 1:00:37.919
which have a high occupation.

1:00:37.797 --> 1:00:41.387
They also are more similar to the word what
we're going to be talking about.

1:00:43.023 --> 1:00:50.682
Maybe a secretary is also a bit difficult,
but because yeah maybe it's more often.

1:00:50.610 --> 1:00:52.438
Done in general by by women.

1:00:52.438 --> 1:00:58.237
However, there is a secretary like the Secretary
of State or so, the German minister, which

1:00:58.237 --> 1:01:03.406
I of course know that many so in the statistics
they are not counting that often.

1:01:03.543 --> 1:01:11.576
But in data they of course cook quite often,
so there's different ways of different meanings.

1:01:14.154 --> 1:01:23.307
So how can you not try to remove this type
of bias?

1:01:23.307 --> 1:01:32.988
One way is the idea of hearts, devices and
embeddings.

1:01:33.113 --> 1:01:39.354
So if you remember on word embeddings think
we have this image that you can do the difference

1:01:39.354 --> 1:01:44.931
between man and woman and add this difference
to king and then look at your screen.

1:01:45.865 --> 1:01:57.886
So here's the idea we want to remove this
gender information from some things which should

1:01:57.886 --> 1:02:00.132
not have gender.

1:02:00.120 --> 1:02:01.386
The word engineer.

1:02:01.386 --> 1:02:06.853
There is no information about the gender in
that, so you should remove this type.

1:02:07.347 --> 1:02:16.772
Of course, you first need to find out where
these inflammations are and you can.

1:02:17.037 --> 1:02:23.603
However, normally if you do the difference
like the subspace by only one example, it's

1:02:23.603 --> 1:02:24.659
not the best.

1:02:24.924 --> 1:02:31.446
So you can do the same thing for things like
brother and sister, man and dad, and then you

1:02:31.446 --> 1:02:38.398
can somehow take the average of these differences
saying this is a vector which maps a male from

1:02:38.398 --> 1:02:39.831
to the female form.

1:02:40.660 --> 1:02:50.455
And then you can try to neutralize this gender
information on this dimension.

1:02:50.490 --> 1:02:57.951
You can find it's subspace or dimensional.

1:02:57.951 --> 1:03:08.882
It would be a line, but now this is dimensional,
and then you.

1:03:08.728 --> 1:03:13.104
Representation: Where you remove this type
of embellishment.

1:03:15.595 --> 1:03:18.178
This is, of course, quite strong of the questions.

1:03:18.178 --> 1:03:19.090
How good does it?

1:03:19.090 --> 1:03:20.711
Thanks tell them for one other.

1:03:20.880 --> 1:03:28.256
But it's an idea we are trying to after learning
before we are using the Word and Banks for

1:03:28.256 --> 1:03:29.940
machine translation.

1:03:29.940 --> 1:03:37.315
We are trying to remove the gender information
from the jobs and then have a representation

1:03:37.315 --> 1:03:38.678
which hopefully.

1:03:40.240 --> 1:03:45.047
Similar idea is the one of agenda neutral
glove.

1:03:45.047 --> 1:03:50.248
Glove is another technique to learn word embeddings.

1:03:50.750 --> 1:03:52.870
Think we discussed one shortly.

1:03:52.870 --> 1:03:56.182
It was too back, which was some of the first
one.

1:03:56.456 --> 1:04:04.383
But there are other of course methods how
you can train word embeddings and glove as

1:04:04.383 --> 1:04:04.849
one.

1:04:04.849 --> 1:04:07.460
The idea is we're training.

1:04:07.747 --> 1:04:19.007
At least this is somehow a bit separated,
so where you have part of the vector is gender

1:04:19.007 --> 1:04:20.146
neutral.

1:04:20.300 --> 1:04:29.247
What you need therefore is three sets of words,
so you have male words and you have words.

1:04:29.769 --> 1:04:39.071
And then you're trying to learn some type
of vector where some dimensions are not.

1:04:39.179 --> 1:04:51.997
So the idea is can learn a representation
where at least know that this part is gender

1:04:51.997 --> 1:04:56.123
neutral and the other part.

1:05:00.760 --> 1:05:03.793
How can we do that?

1:05:03.793 --> 1:05:12.435
How can we change the system to learn anything
specific?

1:05:12.435 --> 1:05:20.472
Nearly in all cases this works by the loss
function.

1:05:20.520 --> 1:05:26.206
And that is more a general approach in machine
translation.

1:05:26.206 --> 1:05:30.565
The general loss function is we are learning.

1:05:31.111 --> 1:05:33.842
Here is the same idea.

1:05:33.842 --> 1:05:44.412
You have the general loss function in order
to learn good embeddings and then you try to

1:05:44.412 --> 1:05:48.687
introduce additional loss function.

1:05:48.969 --> 1:05:58.213
Yes, I think yes, yes, that's the solution,
and how you make sure that if I have training

1:05:58.213 --> 1:06:07.149
for all nurses of email, how do you make sure
that the algorithm puts it into neutral?

1:06:07.747 --> 1:06:12.448
And you need, so this is like for only the
first learning of word embeddings.

1:06:12.448 --> 1:06:18.053
Then the idea is if you have word embeddings
where the gender is separate and then you train

1:06:18.053 --> 1:06:23.718
on top of that machine translation where you
don't change the embeddings, it should hopefully

1:06:23.718 --> 1:06:25.225
be less and less biased.

1:06:25.865 --> 1:06:33.465
And in order to train that yes you need additional
information so these information need to be

1:06:33.465 --> 1:06:40.904
hence defined and they can't be general so
you need to have a list of these are male persons

1:06:40.904 --> 1:06:44.744
or males these are nouns for females and these.

1:06:49.429 --> 1:06:52.575
So the first step, of course, we still want
to have good word inventings.

1:06:54.314 --> 1:07:04.100
So you have the normal objective function
of the word embedding.

1:07:04.100 --> 1:07:09.519
It's something like the similarity.

1:07:09.849 --> 1:07:19.751
How it's exactly derived is not that important
because we're not interested in love itself,

1:07:19.751 --> 1:07:23.195
but you have any loss function.

1:07:23.195 --> 1:07:26.854
Of course, you have to keep that.

1:07:27.167 --> 1:07:37.481
And then there's three more lost functions
that you can add: So the one is you take the

1:07:37.481 --> 1:07:51.341
average value of all the male words and the
average word embedding of all the female words.

1:07:51.731 --> 1:08:00.066
So the good thing about this is we don't always
need to have for one word the male and the

1:08:00.066 --> 1:08:05.837
female worship, so it's only like we have a
set of male words.

1:08:06.946 --> 1:08:21.719
So this is just saying yeah, we want these
two should be somehow similar to each other.

1:08:21.719 --> 1:08:25.413
It shouldn't be that.

1:08:30.330 --> 1:08:40.081
Should be the other one, or think this should
be it.

1:08:40.081 --> 1:08:45.969
This is agenda, the average of.

1:08:45.945 --> 1:09:01.206
The average should be the same, but if you're
looking at the female should be at the other.

1:09:01.681 --> 1:09:06.959
This is like on these dimensions, the male
should be on the one and the female on the

1:09:06.959 --> 1:09:07.388
other.

1:09:07.627 --> 1:09:16.123
The same yeah, this gender information should
be there, so you're pushing all the males to

1:09:16.123 --> 1:09:17.150
the other.

1:09:21.541 --> 1:09:23.680
Then their words should be.

1:09:23.680 --> 1:09:30.403
If you have that you see the neutral words,
they should be in the middle of between the

1:09:30.403 --> 1:09:32.008
male and the female.

1:09:32.012 --> 1:09:48.261
So you say is the middle point between all
male and female words and just somehow putting

1:09:48.261 --> 1:09:51.691
the neutral words.

1:09:52.912 --> 1:09:56.563
And then you're learning them, and then you
can apply them in different ways.

1:09:57.057 --> 1:10:03.458
So you have this a bit in the pre-training
thing.

1:10:03.458 --> 1:10:10.372
You can use the pre-trained inbeddings on
the output.

1:10:10.372 --> 1:10:23.117
All you can use are: And then you can analyze
what happens instead of training them directly.

1:10:23.117 --> 1:10:30.504
If have this additional loss, which tries
to optimize.

1:10:32.432 --> 1:10:42.453
And then it was evaluated exactly on the sentences
we had at the beginning where it is about know

1:10:42.453 --> 1:10:44.600
her for a long time.

1:10:44.600 --> 1:10:48.690
My friend works as an accounting cling.

1:10:48.788 --> 1:10:58.049
So all these examples are not very difficult
to translation, but the question is how often

1:10:58.049 --> 1:10:58.660
does?

1:11:01.621 --> 1:11:06.028
That it's not that complicated as you see
here, so even the baseline.

1:11:06.366 --> 1:11:10.772
If you're doing nothing is working quite well,
it's most challenging.

1:11:10.772 --> 1:11:16.436
It seems overall in the situation where it's
a name, so for he and him he has learned the

1:11:16.436 --> 1:11:22.290
correlation because that's maybe not surprisingly
because this correlation occurs more often

1:11:22.290 --> 1:11:23.926
than with any name there.

1:11:24.044 --> 1:11:31.749
If you have a name that you can extract, that
is talking about Mary, that's female is a lot

1:11:31.749 --> 1:11:34.177
harder to extract than this.

1:11:34.594 --> 1:11:40.495
So you'll see already in the bass line this
is yeah, not working, not working.

1:11:43.403 --> 1:11:47.159
And for all the other cases it's working very
well.

1:11:47.787 --> 1:11:53.921
Where all the best one is achieved here with
an arc debiasing both on the encoder, on the.

1:11:57.077 --> 1:12:09.044
It makes sense that a hard debasing on the
decoder doesn't really work because there you

1:12:09.044 --> 1:12:12.406
have gender information.

1:12:14.034 --> 1:12:17.406
For glove it seems to already work here.

1:12:17.406 --> 1:12:20.202
That's maybe surprising and yeah.

1:12:20.260 --> 1:12:28.263
So there is no clear else we don't have numbers
for that doesn't really work well on the other.

1:12:28.263 --> 1:12:30.513
So how much do I use then?

1:12:33.693 --> 1:12:44.720
Then as a last way of improving that is a
bit what we had mentioned before.

1:12:44.720 --> 1:12:48.493
That is what is referred.

1:12:48.488 --> 1:12:59.133
One problem is the bias in the data so you
can adapt your data so you can just try to

1:12:59.133 --> 1:13:01.485
find equal amount.

1:13:01.561 --> 1:13:11.368
In your data like you adapt your data and
then you find your data on the smaller but

1:13:11.368 --> 1:13:12.868
you can try.

1:13:18.298 --> 1:13:19.345
This is line okay.

1:13:19.345 --> 1:13:21.605
We have access to the data to the model.

1:13:21.605 --> 1:13:23.038
We can improve our model.

1:13:24.564 --> 1:13:31.328
One situation we haven't talked a lot about
but another situation might also be and that's

1:13:31.328 --> 1:13:37.942
even getting more important is oh you want
to work with a model which you don't have but

1:13:37.942 --> 1:13:42.476
you want to improve the model without having
access so when.

1:13:42.862 --> 1:13:49.232
Nowadays there are a lot of companies who
are not developing their own system but they're

1:13:49.232 --> 1:13:52.983
using or something like that or machine translation.

1:13:53.313 --> 1:13:59.853
So there is interest that you might not be
able to find children with models completely.

1:14:00.080 --> 1:14:09.049
So the question is, can you do some type of
black box adaptation of a system that takes

1:14:09.049 --> 1:14:19.920
the black box system but tries to improve it
in some ways through: There's some ways of

1:14:19.920 --> 1:14:21.340
doing that.

1:14:21.340 --> 1:14:30.328
One is called black box injection and that's
what is referred to as prompt.

1:14:30.730 --> 1:14:39.793
So the problem is if you have sentences you
don't have information about the speakers.

1:14:39.793 --> 1:14:43.127
So how can you put information?

1:14:43.984 --> 1:14:53.299
And what we know from a large language model,
we just prompt them, and you can do that.

1:14:53.233 --> 1:14:59.545
Translating directly, I love you, you said
she said to him, I love you, and then of course

1:14:59.545 --> 1:15:01.210
you have to strip away.

1:15:01.181 --> 1:15:06.629
I mean, you cannot prevent the model from
translating that, but you should be able to

1:15:06.629 --> 1:15:08.974
see what is the translation of this.

1:15:08.974 --> 1:15:14.866
One can strip that away, and now the system
had hopefully the information that it's somebody

1:15:14.866 --> 1:15:15.563
like that.

1:15:15.563 --> 1:15:17.020
The speaker is female.

1:15:18.198 --> 1:15:23.222
Because you're no longer translating love
you, but you're translating the sentence she

1:15:23.222 --> 1:15:24.261
said to him love.

1:15:24.744 --> 1:15:37.146
And so you insert this information as contextual
information around it and don't have to change

1:15:37.146 --> 1:15:38.567
the model.

1:15:41.861 --> 1:15:56.946
Last idea is to do what is referred to as
letters rescoring, so the idea there is you

1:15:56.946 --> 1:16:01.156
generate a translation.

1:16:01.481 --> 1:16:18.547
And now you have an additional component which
tries to add possibilities where gender information

1:16:18.547 --> 1:16:21.133
might be lost.

1:16:21.261 --> 1:16:29.687
It's just a graph in this way, a simplified
graph where there's always one word between

1:16:29.687 --> 1:16:31.507
two notes and you.

1:16:31.851 --> 1:16:35.212
So you have something like Zi is an ads or
a Zi is an ads.

1:16:35.535 --> 1:16:41.847
And then you can generate all possible variants.

1:16:41.847 --> 1:16:49.317
Then, of course, we're not done because the
final output.

1:16:50.530 --> 1:16:56.999
Then you can re-score the system by a gender
de-biased model.

1:16:56.999 --> 1:17:03.468
So the nice thing is why why don't we directly
use our model?

1:17:03.468 --> 1:17:10.354
The idea is our model, which is only focusing
on gender devising.

1:17:10.530 --> 1:17:16.470
It can be, for example, if it's just trained
on some synthetical data, it will not be that

1:17:16.470 --> 1:17:16.862
well.

1:17:16.957 --> 1:17:21.456
But what we can do then is now you can rescore
the possible translations in here.

1:17:21.721 --> 1:17:31.090
And here the cases of course in general structure
is already done how to translate the words.

1:17:31.051 --> 1:17:42.226
Then you're only using the second component
in order to react for some variants and then

1:17:42.226 --> 1:17:45.490
get the best translation.

1:17:45.925 --> 1:17:58.479
And: As the last one there is the post processing
so you can't have it.

1:17:58.538 --> 1:18:02.830
Mean this was one way of post-processing was
to generate the lattice and retranslate it.

1:18:03.123 --> 1:18:08.407
But you can also have a processing, for example
only on the target side where you have additional

1:18:08.407 --> 1:18:12.236
components with checks about the gender which
maybe only knows gender.

1:18:12.236 --> 1:18:17.089
So it's not a machine translation component
but more like a grammatical checker which can

1:18:17.089 --> 1:18:19.192
be used as most processing to do that.

1:18:19.579 --> 1:18:22.926
Think about it a bit like when you use PPT.

1:18:22.926 --> 1:18:25.892
There's also a lot of post processing.

1:18:25.892 --> 1:18:32.661
If you use a directive, it would tell you
how to build a bond, but they have some checks

1:18:32.661 --> 1:18:35.931
either before and after to prevent things.

1:18:36.356 --> 1:18:40.580
So often there might be an application system.

1:18:40.580 --> 1:18:44.714
There might be extra pre and post processing.

1:18:48.608 --> 1:18:52.589
And yeah, with this we're at the end of.

1:18:52.512 --> 1:19:09.359
To this lecture where we focused on the bias,
but think a lot of these techniques we have

1:19:09.359 --> 1:19:11.418
seen here.

1:19:11.331 --> 1:19:17.664
So we saw, on the one hand, we saw that evaluating
just pure blues first might not always be.

1:19:17.677 --> 1:19:18.947
Mean it's very important.

1:19:20.000 --> 1:19:30.866
Always do that, but if you want to check and
some specific things are important, then you

1:19:30.866 --> 1:19:35.696
might have to do dedicated evaluations.

1:19:36.036 --> 1:19:44.296
It is now translating for the President and
it is like in German that guess it is not very

1:19:44.296 --> 1:19:45.476
appropriate.

1:19:45.785 --> 1:19:53.591
So it might be important if characteristics
of your system are essential to have dedicated

1:19:53.591 --> 1:19:54.620
evaluation.

1:19:55.135 --> 1:20:02.478
And then if you have that, of course, it might
be also important to develop delicate techniques.

1:20:02.862 --> 1:20:10.988
We have seen today some how to mitigate biases,
but I hope you see that a lot of these techniques

1:20:10.988 --> 1:20:13.476
you can also use to mitigate.

1:20:13.573 --> 1:20:31.702
At least related things you can adjust the
training data you can do for other things.

1:20:33.253 --> 1:20:36.022
Before we have been finishing, we have any
more questions.

1:20:41.761 --> 1:20:47.218
Then thanks a lot, and then we will see each
other again on the first step.