Spaces:
Running
Running
WEBVTT | |
0:00:01.541 --> 0:00:06.926 | |
Okay, so we'll come back to today's lecture. | |
0:00:08.528 --> 0:00:23.334 | |
We want to talk about is speech translation, | |
so we'll have two lectures in this week about | |
0:00:23.334 --> 0:00:26.589 | |
speech translation. | |
0:00:27.087 --> 0:00:36.456 | |
And so in the last week we'll have some exercise | |
and repetition. | |
0:00:36.456 --> 0:00:46.690 | |
We want to look at what is now to do when | |
we want to translate speech. | |
0:00:46.946 --> 0:00:55.675 | |
So we want to address the specific challenges | |
that occur when we switch from translating | |
0:00:55.675 --> 0:00:56.754 | |
to speech. | |
0:00:57.697 --> 0:01:13.303 | |
Today we will look at the more general picture | |
out and build the systems. | |
0:01:13.493 --> 0:01:23.645 | |
And then secondly an end approach where we | |
are going to put in audio and generate. | |
0:01:24.224 --> 0:01:41.439 | |
Which are the main dominant systems which | |
are used in research and commercial systems. | |
0:01:43.523 --> 0:01:56.879 | |
More general, what is the general task of | |
speech translation that is shown here? | |
0:01:56.879 --> 0:02:01.826 | |
The idea is we have a speech. | |
0:02:02.202 --> 0:02:12.838 | |
Then we want to have a system which takes | |
this audio and then translates it into another | |
0:02:12.838 --> 0:02:14.033 | |
language. | |
0:02:15.095 --> 0:02:20.694 | |
Then it's no longer as clear the output modality. | |
0:02:20.694 --> 0:02:33.153 | |
In contrast, for humans we can typically have: | |
So you can either have more textual translation, | |
0:02:33.153 --> 0:02:37.917 | |
then you have subtitles, and the. | |
0:02:38.538 --> 0:02:57.010 | |
Are you want to have it also in audio like | |
it's done for human interpretation? | |
0:02:57.417 --> 0:03:03.922 | |
See there is not the one best solution, so | |
all of this one is always better. | |
0:03:03.922 --> 0:03:09.413 | |
It heavily depends on what is the use of what | |
the people prefer. | |
0:03:09.929 --> 0:03:14.950 | |
For example, you can think of if you know | |
a bit the source of language, but you're a | |
0:03:14.950 --> 0:03:17.549 | |
bit unsure and don't understand everything. | |
0:03:17.549 --> 0:03:23.161 | |
They may texture it out for this pattern because | |
you can direct your gear to what was said and | |
0:03:23.161 --> 0:03:26.705 | |
only if you're unsure you check down with your | |
translation. | |
0:03:27.727 --> 0:03:33.511 | |
Are another things that might be preferable | |
to have a complete spoken of. | |
0:03:34.794 --> 0:03:48.727 | |
So there are both ones for a long time in | |
automatic systems focused mainly on text output. | |
0:03:48.727 --> 0:04:06.711 | |
In most cases: But of course you can always | |
hand them to text to speech systems which generates | |
0:04:06.711 --> 0:04:09.960 | |
audio from that. | |
0:04:12.772 --> 0:04:14.494 | |
Why should we care about that? | |
0:04:14.494 --> 0:04:15.771 | |
Why should we do that? | |
0:04:17.737 --> 0:04:24.141 | |
There is the nice thing that yeah, with a | |
globalized world, we are able to now interact | |
0:04:24.141 --> 0:04:25.888 | |
with a lot more people. | |
0:04:25.888 --> 0:04:29.235 | |
You can do some conferences around the world. | |
0:04:29.235 --> 0:04:31.564 | |
We can travel around the world. | |
0:04:31.671 --> 0:04:37.802 | |
We can by Internet watch movies from all over | |
the world and watch TV from all over the world. | |
0:04:38.618 --> 0:04:47.812 | |
However, there is still this barrier that | |
is mainly to watch videos, either in English | |
0:04:47.812 --> 0:04:49.715 | |
or in a language. | |
0:04:50.250 --> 0:05:00.622 | |
So what is currently happening in order to | |
reach a large audience is that everybody. | |
0:05:00.820 --> 0:05:07.300 | |
So if we are going, for example, to a conferences, | |
these are international conferences. | |
0:05:08.368 --> 0:05:22.412 | |
However, everybody will then speak English | |
since that is some of the common language that | |
0:05:22.412 --> 0:05:26.001 | |
everybody understands. | |
0:05:26.686 --> 0:05:32.929 | |
So on the other hand, we cannot like have | |
human interpreters like they ever work. | |
0:05:32.892 --> 0:05:37.797 | |
You have that maybe in the European Parliament | |
or in important business meetings. | |
0:05:38.078 --> 0:05:47.151 | |
But this is relatively expensive, and so the | |
question is, can we enable communication in | |
0:05:47.151 --> 0:05:53.675 | |
your mother-in-law without having to have human | |
interpretation? | |
0:05:54.134 --> 0:06:04.321 | |
And there like speech translation can be helpful | |
in order to help you bridge this gap. | |
0:06:06.726 --> 0:06:22.507 | |
In this case, there are different scenarios | |
of how you can apply speech translation. | |
0:06:22.422 --> 0:06:29.282 | |
That's typically more interactive than we | |
are talking about text translation. | |
0:06:29.282 --> 0:06:32.800 | |
Text translation is most commonly used. | |
0:06:33.153 --> 0:06:41.637 | |
Course: Nowadays there's things like chat | |
and so on where it could also be interactive. | |
0:06:42.082 --> 0:06:48.299 | |
In contrast to speech translation, that is | |
less static, so there is different ways of | |
0:06:48.299 --> 0:06:48.660 | |
how. | |
0:06:49.149 --> 0:07:00.544 | |
The one scenario is what is called a translation | |
where you first get an input, then you translate | |
0:07:00.544 --> 0:07:03.799 | |
this fixed input, and then. | |
0:07:04.944 --> 0:07:12.823 | |
With me, which means you have always like | |
fixed, yeah fixed challenges which you need | |
0:07:12.823 --> 0:07:14.105 | |
to translate. | |
0:07:14.274 --> 0:07:25.093 | |
You don't need to like beat your mind what | |
are the boundaries where there's an end. | |
0:07:25.405 --> 0:07:31.023 | |
Also, there is no overlapping. | |
0:07:31.023 --> 0:07:42.983 | |
There is always a one-person sentence that | |
is getting translated. | |
0:07:43.443 --> 0:07:51.181 | |
Of course, this has a disadvantage that it | |
makes the conversation a lot longer because | |
0:07:51.181 --> 0:07:55.184 | |
you always have only speech and translation. | |
0:07:57.077 --> 0:08:03.780 | |
For example, if you would use that for a presentation | |
there would be yeah quite get quite long, if | |
0:08:03.780 --> 0:08:09.738 | |
I would just imagine you sitting here in the | |
lecture I would say three sentences that I | |
0:08:09.738 --> 0:08:15.765 | |
would wait for this interpreter to translate | |
it, then I would say the next two sentences | |
0:08:15.765 --> 0:08:16.103 | |
and. | |
0:08:16.676 --> 0:08:28.170 | |
That is why in these situations, for example, | |
if you have a direct conversation with a patient, | |
0:08:28.170 --> 0:08:28.888 | |
then. | |
0:08:29.209 --> 0:08:32.733 | |
But still there it's too big to be taking | |
them very long. | |
0:08:33.473 --> 0:08:42.335 | |
And that's why there's also the research on | |
simultaneous translation, where the idea is | |
0:08:42.335 --> 0:08:43.644 | |
in parallel. | |
0:08:43.964 --> 0:08:46.179 | |
That Is the Dining for Human. | |
0:08:46.126 --> 0:08:52.429 | |
Interpretation like if you think of things | |
like the European Parliament where they of | |
0:08:52.429 --> 0:08:59.099 | |
course not only speak always one sentence but | |
are just giving their speech and in parallel | |
0:08:59.099 --> 0:09:04.157 | |
human interpreters are translating the speech | |
into another language. | |
0:09:04.985 --> 0:09:12.733 | |
The same thing is interesting for automatic | |
speech translation where we in parallel generate | |
0:09:12.733 --> 0:09:13.817 | |
translation. | |
0:09:15.415 --> 0:09:32.271 | |
The challenges then, of course, are that we | |
need to segment our speech into somehow's chunks. | |
0:09:32.152 --> 0:09:34.903 | |
We just looked for the dots we saw. | |
0:09:34.903 --> 0:09:38.648 | |
There are some challenges that we have to | |
check. | |
0:09:38.648 --> 0:09:41.017 | |
The Doctor may not understand. | |
0:09:41.201 --> 0:09:47.478 | |
But in generally getting sentence boundary | |
sentences is not a really research question. | |
0:09:47.647 --> 0:09:51.668 | |
While in speech translation, this is not that | |
easy. | |
0:09:51.952 --> 0:10:05.908 | |
Either getting that in the audio is difficult | |
because it's not like we typically do breaks | |
0:10:05.908 --> 0:10:09.742 | |
when there's a sentence. | |
0:10:10.150 --> 0:10:17.432 | |
And even if you then see the transcript and | |
would have to add the punctuation, this is | |
0:10:17.432 --> 0:10:18.101 | |
not as. | |
0:10:20.340 --> 0:10:25.942 | |
Another question is how many speakers we have | |
here. | |
0:10:25.942 --> 0:10:31.759 | |
In presentations you have more like a single | |
speaker. | |
0:10:31.931 --> 0:10:40.186 | |
That is normally easier from the part of audio | |
processing, so in general in speech translation. | |
0:10:40.460 --> 0:10:49.308 | |
You can have different challenges and they | |
can be of different components. | |
0:10:49.308 --> 0:10:57.132 | |
In addition to translation, you have: And | |
if you're not going, for example, the magical | |
0:10:57.132 --> 0:11:00.378 | |
speaker, there are significantly additional | |
challenges. | |
0:11:00.720 --> 0:11:10.313 | |
So we as humans we are very good in filtering | |
out noises, or if two people speak in parallel | |
0:11:10.313 --> 0:11:15.058 | |
to like separate these two speakers and hear. | |
0:11:15.495 --> 0:11:28.300 | |
However, if you want to do that with automatic | |
systems that is very challenging so that you | |
0:11:28.300 --> 0:11:33.172 | |
can separate the speakers so that. | |
0:11:33.453 --> 0:11:41.284 | |
For the more of you have this multi-speaker | |
scenario, typically it's also less well prepared. | |
0:11:41.721 --> 0:11:45.807 | |
So you're getting very, we'll talk about the | |
spontaneous effects. | |
0:11:46.186 --> 0:11:53.541 | |
So people like will stop in the middle of | |
the sentence, they change their sentence, and | |
0:11:53.541 --> 0:12:01.481 | |
so on, and like filtering these, these fluences | |
out of the text and working with them is often | |
0:12:01.481 --> 0:12:02.986 | |
very challenging. | |
0:12:05.565 --> 0:12:09.144 | |
So these are all additional challenges when | |
you have multiples. | |
0:12:10.330 --> 0:12:19.995 | |
Then there's a question of an online or offline | |
system, sometimes textbook station. | |
0:12:19.995 --> 0:12:21.836 | |
We also mainly. | |
0:12:21.962 --> 0:12:36.507 | |
That means you can take the whole text and | |
you can translate it in a badge. | |
0:12:37.337 --> 0:12:44.344 | |
However, for speech translation there's also | |
several scenarios where this is the case. | |
0:12:44.344 --> 0:12:51.513 | |
For example, when you're translating a movie, | |
it's not only that you don't have to do it | |
0:12:51.513 --> 0:12:54.735 | |
live, but you can take the whole movie. | |
0:12:55.215 --> 0:13:05.473 | |
However, there is also a lot of situations | |
where you don't have this opportunity like | |
0:13:05.473 --> 0:13:06.785 | |
or sports. | |
0:13:07.247 --> 0:13:13.963 | |
And you don't want to like first like let | |
around a sports event and then like show in | |
0:13:13.963 --> 0:13:19.117 | |
the game three hours later then there is not | |
really any interest. | |
0:13:19.399 --> 0:13:31.118 | |
So you have to do it live, and so we have | |
the additional challenge of translating the | |
0:13:31.118 --> 0:13:32.208 | |
system. | |
0:13:32.412 --> 0:13:42.108 | |
There are still things on the one end of course. | |
0:13:42.108 --> 0:13:49.627 | |
It needs to be real time translation. | |
0:13:49.869 --> 0:13:54.153 | |
It's taking longer, then you're getting more | |
and more and more delayed. | |
0:13:55.495 --> 0:14:05.245 | |
So it maybe seems simple, but there have been | |
research systems which are undertime slower | |
0:14:05.245 --> 0:14:07.628 | |
than real time or so. | |
0:14:07.628 --> 0:14:15.103 | |
If you want to show what is possible with | |
the best current systems,. | |
0:14:16.596 --> 0:14:18.477 | |
But that isn't even not enough. | |
0:14:18.918 --> 0:14:29.593 | |
The other question: You can have a system | |
which is even like several times real time. | |
0:14:29.509 --> 0:14:33.382 | |
In less than one second, it might still be | |
not useful. | |
0:14:33.382 --> 0:14:39.648 | |
Then the question is like the latency, so | |
how much time has passed since you can produce | |
0:14:39.648 --> 0:14:39.930 | |
an. | |
0:14:40.120 --> 0:14:45.814 | |
It might be that in average you can like concress | |
it, but you still can't do it directly. | |
0:14:45.814 --> 0:14:51.571 | |
You need to do it after, or you need to have | |
the full context of thirty seconds before you | |
0:14:51.571 --> 0:14:55.178 | |
can output something, and then you have a large | |
latency. | |
0:14:55.335 --> 0:15:05.871 | |
So it can be that do it as fast as it is produced, | |
but have to wait until the food. | |
0:15:06.426 --> 0:15:13.772 | |
So we'll look into that on Thursday how we | |
can then generate translations that are having | |
0:15:13.772 --> 0:15:14.996 | |
a low latency. | |
0:15:15.155 --> 0:15:21.587 | |
You can imagine, for example, in German that | |
it's maybe quite challenging since the word | |
0:15:21.587 --> 0:15:23.466 | |
is often like at the end. | |
0:15:23.466 --> 0:15:30.115 | |
If you're using perfect, like in harbor and | |
so on, and then in English you have to directly | |
0:15:30.115 --> 0:15:30.983 | |
produce it. | |
0:15:31.311 --> 0:15:38.757 | |
So if you really want to have no context you | |
might need to wait until the end of the sentence. | |
0:15:41.021 --> 0:15:45.920 | |
Besides that, of course, offline and it gives | |
you more additional help. | |
0:15:45.920 --> 0:15:52.044 | |
I think last week you talked about context | |
based systems that typically have context from | |
0:15:52.044 --> 0:15:55.583 | |
maybe from the past but maybe also from the | |
future. | |
0:15:55.595 --> 0:16:02.923 | |
Then, of course, you cannot use anything from | |
the future in this case, but you can use it. | |
0:16:07.407 --> 0:16:24.813 | |
Finally, there is a thing about how you want | |
to present it to the audience in automatic | |
0:16:24.813 --> 0:16:27.384 | |
translation. | |
0:16:27.507 --> 0:16:31.361 | |
There is also the thing that you want to do. | |
0:16:31.361 --> 0:16:35.300 | |
All your outfits are running like the system. | |
0:16:35.996 --> 0:16:36.990 | |
Top of it. | |
0:16:36.990 --> 0:16:44.314 | |
Then they answered questions: How should it | |
be spoken so you can do things like. | |
0:16:46.586 --> 0:16:52.507 | |
Voice cloning so that it's like even the same | |
voice than the original speaker. | |
0:16:53.994 --> 0:16:59.081 | |
And if you do text or dubbing then there might | |
be additional constraints. | |
0:16:59.081 --> 0:17:05.729 | |
So if you think about subtitles: And they | |
should be readable, and we are too big to speak | |
0:17:05.729 --> 0:17:07.957 | |
faster than you can maybe read. | |
0:17:08.908 --> 0:17:14.239 | |
So you might need to shorten your text. | |
0:17:14.239 --> 0:17:20.235 | |
People say that a subtitle can be two lines. | |
0:17:20.235 --> 0:17:26.099 | |
Each line can be this number of characters. | |
0:17:26.346 --> 0:17:31.753 | |
So you cannot like if you have too long text, | |
we might need to shorten that to do that. | |
0:17:32.052 --> 0:17:48.272 | |
Similarly, if you think about dubbing, if | |
you want to produce dubbing voice, then the | |
0:17:48.272 --> 0:17:50.158 | |
original. | |
0:17:51.691 --> 0:17:59.294 | |
Here is another problem that we have different | |
settings like a more formal setting and let's | |
0:17:59.294 --> 0:18:00.602 | |
have different. | |
0:18:00.860 --> 0:18:09.775 | |
If you think about the United Nations maybe | |
you want more former things and between friends | |
0:18:09.775 --> 0:18:14.911 | |
maybe that former and there are languages which | |
use. | |
0:18:15.355 --> 0:18:21.867 | |
That is sure that is an important research | |
question. | |
0:18:21.867 --> 0:18:28.010 | |
To do that would more think of it more generally. | |
0:18:28.308 --> 0:18:32.902 | |
That's important in text translation. | |
0:18:32.902 --> 0:18:41.001 | |
If you translate a letter to your boss, it | |
should sound different. | |
0:18:42.202 --> 0:18:53.718 | |
So there is a question of how you can do this | |
style work on how you can do that. | |
0:18:53.718 --> 0:19:00.542 | |
For example, if you can specify that you might. | |
0:19:00.460 --> 0:19:10.954 | |
So you can tax the center or generate an informal | |
style because, as you correctly said, this | |
0:19:10.954 --> 0:19:16.709 | |
is especially challenging again in the situations. | |
0:19:16.856 --> 0:19:20.111 | |
Of course, there are ways of like being formal | |
or less formal. | |
0:19:20.500 --> 0:19:24.846 | |
But it's not like as clear as you do it, for | |
example, in German where you have the twin | |
0:19:24.846 --> 0:19:24.994 | |
C. | |
0:19:25.165 --> 0:19:26.855 | |
So there is no one to own mapping. | |
0:19:27.287 --> 0:19:34.269 | |
If you want to make that sure you can build | |
a system which generates different styles in | |
0:19:34.269 --> 0:19:38.662 | |
the output, so yeah that's definitely also | |
a challenge. | |
0:19:38.662 --> 0:19:43.762 | |
It just may be not mentioned here because | |
it's not specific now. | |
0:19:44.524 --> 0:19:54.029 | |
Generally, of course, these are all challenges | |
in how to customize and adapt systems to use | |
0:19:54.029 --> 0:19:56.199 | |
cases with specific. | |
0:20:00.360 --> 0:20:11.020 | |
Speech translation has been done for quite | |
a while and it's maybe not surprising it started | |
0:20:11.020 --> 0:20:13.569 | |
with more simple use. | |
0:20:13.793 --> 0:20:24.557 | |
So people first started to look into, for | |
example, limited to main translations. | |
0:20:24.557 --> 0:20:33.726 | |
The tourist was typically application if you're | |
going to a new city. | |
0:20:34.834 --> 0:20:44.028 | |
Then there are several open things of doing | |
open domain translation, especially people. | |
0:20:44.204 --> 0:20:51.957 | |
Like where there's a lot of data so you could | |
build systems which are more open to main, | |
0:20:51.957 --> 0:20:55.790 | |
but of course it's still a bit restrictive. | |
0:20:55.790 --> 0:20:59.101 | |
It's true in the European Parliament. | |
0:20:59.101 --> 0:21:01.888 | |
People talk about anything but. | |
0:21:02.162 --> 0:21:04.820 | |
And so it's not completely used for everything. | |
0:21:05.165 --> 0:21:11.545 | |
Nowadays we've seen this technology in a lot | |
of different situations guess you ought. | |
0:21:11.731 --> 0:21:17.899 | |
Use it so there is some basic technologies | |
where you can use them already. | |
0:21:18.218 --> 0:21:33.599 | |
There is still a lot of open questions going | |
from if you are going to really spontaneous | |
0:21:33.599 --> 0:21:35.327 | |
meetings. | |
0:21:35.655 --> 0:21:41.437 | |
Then these systems typically work good for | |
like some languages where we have a lot of | |
0:21:41.437 --> 0:21:42.109 | |
friendly. | |
0:21:42.742 --> 0:21:48.475 | |
But if we want to go for really low resource | |
data then things are often challenging. | |
0:21:48.448 --> 0:22:02.294 | |
Last week we had a workshop on spoken language | |
translation and there is a low-resource data | |
0:22:02.294 --> 0:22:05.756 | |
track which is dialed. | |
0:22:05.986 --> 0:22:06.925 | |
And so on. | |
0:22:06.925 --> 0:22:14.699 | |
All these languages can still then have significantly | |
lower performance than for a higher. | |
0:22:17.057 --> 0:22:20.126 | |
So how does this work? | |
0:22:20.126 --> 0:22:31.614 | |
If we want to do speech translation, there's | |
like three basic technology: So on the one | |
0:22:31.614 --> 0:22:40.908 | |
hand, it's automatic speech recognition where | |
automatic speech recognition normally transacts | |
0:22:40.908 --> 0:22:41.600 | |
audio. | |
0:22:42.822 --> 0:22:58.289 | |
Then what we talked about here is machine | |
translation, which takes input and translates | |
0:22:58.289 --> 0:23:01.276 | |
into the target. | |
0:23:02.642 --> 0:23:11.244 | |
And the very simple model now, if you think | |
about it, is of course the similar combination. | |
0:23:11.451 --> 0:23:14.740 | |
We have solved all these parts in a salt bedrock. | |
0:23:14.975 --> 0:23:31.470 | |
We are working on all these problems there, | |
so if we want to do a speech transition, maybe. | |
0:23:31.331 --> 0:23:35.058 | |
Such problems we just put all these combinations | |
together. | |
0:23:35.335 --> 0:23:45.130 | |
And then you get what you have as a cascading | |
system, which first is so you take your audio. | |
0:23:45.045 --> 0:23:59.288 | |
To take this as input and generate the output, | |
and then you take this text output, put it | |
0:23:59.288 --> 0:24:00.238 | |
into. | |
0:24:00.640 --> 0:24:05.782 | |
So in that way you have now. | |
0:24:08.008 --> 0:24:18.483 | |
Have now a solution for generating doing speech | |
translation for these types of systems, and | |
0:24:18.483 --> 0:24:20.874 | |
this type is called. | |
0:24:21.681 --> 0:24:28.303 | |
It is still often reaching state of the art, | |
however it has benefits and disadvantages. | |
0:24:28.668 --> 0:24:41.709 | |
So the one big benefit is we have independent | |
components and some of that is nice. | |
0:24:41.709 --> 0:24:48.465 | |
So if there are great ideas put into your. | |
0:24:48.788 --> 0:24:57.172 | |
And then some other times people develop a | |
new good way of how to improve. | |
0:24:57.172 --> 0:25:00.972 | |
You can also take this model and. | |
0:25:01.381 --> 0:25:07.639 | |
So you can leverage improvements from all | |
the different communities in order to adapt. | |
0:25:08.288 --> 0:25:18.391 | |
Furthermore, we would like to see, since all | |
of them is learning, that the biggest advantage | |
0:25:18.391 --> 0:25:23.932 | |
is that we have training data for each individual. | |
0:25:24.164 --> 0:25:34.045 | |
So there's a lot less training data where | |
you have the English audio, so it's easy to | |
0:25:34.045 --> 0:25:34.849 | |
train. | |
0:25:36.636 --> 0:25:48.595 | |
Now am a one that we will focus on when talking | |
about the cascaded approach is that often it. | |
0:25:48.928 --> 0:25:58.049 | |
So you need to adapt each component a bit | |
so that it's adapting to its input and. | |
0:25:58.278 --> 0:26:07.840 | |
So we'll focus there especially on how to | |
combine and since said the main focus is: So | |
0:26:07.840 --> 0:26:18.589 | |
if you would directly use an output that might | |
not work as perfect as you would,. | |
0:26:18.918 --> 0:26:33.467 | |
So a major challenge when building a cascade | |
of speech translation systems is how can we | |
0:26:33.467 --> 0:26:38.862 | |
adapt these systems and how can? | |
0:26:41.681 --> 0:26:43.918 | |
So why, why is this the kick? | |
0:26:44.164 --> 0:26:49.183 | |
So it would look quite nice. | |
0:26:49.183 --> 0:26:54.722 | |
It seems to be very reasonable. | |
0:26:54.722 --> 0:26:58.356 | |
You have some audio. | |
0:26:58.356 --> 0:27:03.376 | |
You put it into your system. | |
0:27:04.965 --> 0:27:23.759 | |
However, this is a bit which for thinking | |
because if you speak what you speak is more. | |
0:27:23.984 --> 0:27:29.513 | |
And especially all that rarely have punctuations | |
in there, and while the anti-system. | |
0:27:29.629 --> 0:27:43.247 | |
They assume, of course, that it's a full sentence, | |
that you don't have there some. | |
0:27:43.523 --> 0:27:55.087 | |
So we see we want to get this bridge between | |
the output and the input, and we might need | |
0:27:55.087 --> 0:27:56.646 | |
additional. | |
0:27:58.778 --> 0:28:05.287 | |
And that is typically what is referred to | |
as re-case and re-piculation system. | |
0:28:05.445 --> 0:28:15.045 | |
So the idea is that you might be good to have | |
something like an adapter here in between, | |
0:28:15.045 --> 0:28:20.007 | |
which really tries to adapt the speech input. | |
0:28:20.260 --> 0:28:28.809 | |
That can be at different levels, but it might | |
be even more rephrasing. | |
0:28:29.569 --> 0:28:40.620 | |
If you think of the sentence, if you have | |
false starts, then when speaking you sometimes | |
0:28:40.620 --> 0:28:41.986 | |
assume oh. | |
0:28:41.901 --> 0:28:52.224 | |
You restart it, then you might want to delete | |
that because if you read it you don't want | |
0:28:52.224 --> 0:28:52.688 | |
to. | |
0:28:56.096 --> 0:28:57.911 | |
Why is this yeah? | |
0:28:57.911 --> 0:29:01.442 | |
The case in punctuation important. | |
0:29:02.622 --> 0:29:17.875 | |
One important thing is directly for the challenge | |
is when speak is just a continuous stream of | |
0:29:17.875 --> 0:29:18.999 | |
words. | |
0:29:19.079 --> 0:29:27.422 | |
Then just speaking and punctuation marks, | |
and so on are all notes are there in natural. | |
0:29:27.507 --> 0:29:30.281 | |
However, they are of course important. | |
0:29:30.410 --> 0:29:33.877 | |
They are first of all very important for readability. | |
0:29:34.174 --> 0:29:41.296 | |
If you have once read a text without characterization | |
marks, you need more time to process it. | |
0:29:41.861 --> 0:29:47.375 | |
They're sometimes even semantically important. | |
0:29:47.375 --> 0:29:52.890 | |
There's a list for grandpa and big difference. | |
0:29:53.553 --> 0:30:00.089 | |
And so this, of course, with humans as well, | |
it'd be easy to distinguish by again doing | |
0:30:00.089 --> 0:30:01.426 | |
it automatically. | |
0:30:01.426 --> 0:30:06.180 | |
It's more typically and finally, in our case, | |
if we want to do. | |
0:30:06.386 --> 0:30:13.672 | |
We are assuming normally sentence wise, so | |
we always enter out system which is like one | |
0:30:13.672 --> 0:30:16.238 | |
sentence by the next sentence. | |
0:30:16.736 --> 0:30:26.058 | |
If you want to do speech translation of a | |
continuous stream, then of course what are | |
0:30:26.058 --> 0:30:26.716 | |
your. | |
0:30:28.168 --> 0:30:39.095 | |
And the easiest and most straightforward situation | |
is, of course, if you have a continuously. | |
0:30:39.239 --> 0:30:51.686 | |
And if it generates your calculation marks, | |
it's easy to separate your text into sentences. | |
0:30:52.032 --> 0:31:09.157 | |
So we can again reuse our system and thereby | |
have a normal anti-system on this continuous. | |
0:31:14.174 --> 0:31:21.708 | |
These are a bit older numbers, but they show | |
you a bit also how important all that is. | |
0:31:21.861 --> 0:31:31.719 | |
So this was so the best is if you do insurance | |
transcript you get roughly a blue score of. | |
0:31:32.112 --> 0:31:47.678 | |
If you have as it is with some air based length | |
segmentation, then you get something like. | |
0:31:47.907 --> 0:31:57.707 | |
If you then use the segments correctly as | |
it's done from the reference, you get one blue | |
0:31:57.707 --> 0:32:01.010 | |
point and another blue point. | |
0:32:01.201 --> 0:32:08.085 | |
So you see that you have been total like nearly | |
two blue points just by having the correct | |
0:32:08.085 --> 0:32:09.144 | |
segmentation. | |
0:32:10.050 --> 0:32:21.178 | |
This shows you that it's important to estimate | |
as good a segmentation because even if you | |
0:32:21.178 --> 0:32:25.629 | |
still have the same arrows in your. | |
0:32:27.147 --> 0:32:35.718 | |
Is to be into this movement, which is also | |
not as unusual as we do in translation. | |
0:32:36.736 --> 0:32:40.495 | |
So this is done by looking at the reference. | |
0:32:40.495 --> 0:32:48.097 | |
It should show you how much these scores are | |
done to just analyze how important are these. | |
0:32:48.097 --> 0:32:55.699 | |
So you take the A's R transcript and you look | |
at the reference and it's only done for the. | |
0:32:55.635 --> 0:33:01.720 | |
If we have optimal punctuations, if our model | |
is as good and optimal, so as a reference we | |
0:33:01.720 --> 0:33:15.602 | |
could: But of course this is not how we can | |
do it in reality because we don't have access | |
0:33:15.602 --> 0:33:16.990 | |
to that. | |
0:33:17.657 --> 0:33:24.044 | |
Because one would invade you okay, why should | |
we do that? | |
0:33:24.044 --> 0:33:28.778 | |
If we have the optimal then it's possible. | |
0:33:31.011 --> 0:33:40.060 | |
And yeah, that is why a typical system does | |
not only yeah depend on if our key component. | |
0:33:40.280 --> 0:33:56.468 | |
But in between you have this segmentation | |
in there in order to have more input and. | |
0:33:56.496 --> 0:34:01.595 | |
You can also prefer often this invariability | |
over the average study. | |
0:34:04.164 --> 0:34:19.708 | |
So the task of segmentation is to re-segment | |
the text into what is called sentence like | |
0:34:19.708 --> 0:34:24.300 | |
unit, so you also assign. | |
0:34:24.444 --> 0:34:39.421 | |
That is more a traditional thing because for | |
a long time case information was not provided. | |
0:34:39.879 --> 0:34:50.355 | |
So there was any good ASR system which directly | |
provides you with case information and this | |
0:34:50.355 --> 0:34:52.746 | |
may not be any more. | |
0:34:56.296 --> 0:35:12.060 | |
How that can be done is you can have three | |
different approaches because that was some | |
0:35:12.060 --> 0:35:16.459 | |
of the most common one. | |
0:35:17.097 --> 0:35:23.579 | |
Course: That is not the only thing you can | |
do. | |
0:35:23.579 --> 0:35:30.888 | |
You can also try to train the data to generate | |
that. | |
0:35:31.891 --> 0:35:41.324 | |
On the other hand, that is of course more | |
challenging. | |
0:35:41.324 --> 0:35:47.498 | |
You need some type of segmentation. | |
0:35:48.028 --> 0:35:59.382 | |
Mean, of course, you can easily remove and | |
capture information from your data and then | |
0:35:59.382 --> 0:36:05.515 | |
play a system which does non-case to non-case. | |
0:36:05.945 --> 0:36:15.751 | |
You can also, of course, try to combine these | |
two into one so that you directly translate | |
0:36:15.751 --> 0:36:17.386 | |
from non-case. | |
0:36:17.817 --> 0:36:24.722 | |
What is more happening by now is that you | |
also try to provide these to that you provide. | |
0:36:24.704 --> 0:36:35.267 | |
The ASR is a segmentation directly get these | |
information in there. | |
0:36:35.267 --> 0:36:45.462 | |
The systems that combine the A's and A's are: | |
Yes, there is a valid rule. | |
0:36:45.462 --> 0:36:51.187 | |
What we come later to today is that you do | |
audio to text in the target language. | |
0:36:51.187 --> 0:36:54.932 | |
That is what is referred to as an end to end | |
system. | |
0:36:54.932 --> 0:36:59.738 | |
So it's directly and this is still more often | |
done for text output. | |
0:36:59.738 --> 0:37:03.414 | |
But there is also end to end system which | |
directly. | |
0:37:03.683 --> 0:37:09.109 | |
There you have additional challenges by how | |
to even measure if things are correct or not. | |
0:37:09.089 --> 0:37:10.522 | |
Mean for text. | |
0:37:10.522 --> 0:37:18.073 | |
You can mention, in other words, that for | |
audio the audio signal is even more. | |
0:37:18.318 --> 0:37:27.156 | |
That's why it's currently mostly speech to | |
text, but that is one single system, but of | |
0:37:27.156 --> 0:37:27.969 | |
course. | |
0:37:32.492 --> 0:37:35.605 | |
Yeah, how can you do that? | |
0:37:35.605 --> 0:37:45.075 | |
You can do adding these calculation information: | |
Will look into three systems. | |
0:37:45.075 --> 0:37:53.131 | |
You can do that as a sequence labeling problem | |
or as a monolingual. | |
0:37:54.534 --> 0:37:57.145 | |
Let's have a little bit of a series. | |
0:37:57.145 --> 0:37:59.545 | |
This was some of the first ideas. | |
0:37:59.545 --> 0:38:04.626 | |
There's the idea where you try to do it mainly | |
based on language model. | |
0:38:04.626 --> 0:38:11.471 | |
So how probable is that there is a punctuation | |
that was done with like old style engram language | |
0:38:11.471 --> 0:38:12.883 | |
models to visually. | |
0:38:13.073 --> 0:38:24.687 | |
So you can, for example, if you have a program | |
language model to calculate the score of Hello, | |
0:38:24.687 --> 0:38:25.787 | |
how are? | |
0:38:25.725 --> 0:38:33.615 | |
And then you compare this probability and | |
take the one which has the highest probability. | |
0:38:33.615 --> 0:38:39.927 | |
You might have something like if you have | |
very long pauses, you anyway. | |
0:38:40.340 --> 0:38:51.953 | |
So this is a very easy model, which only calculates | |
some language model probabilities, and however | |
0:38:51.953 --> 0:39:00.023 | |
the advantages of course are: And then, of | |
course, in general, so what we will look into | |
0:39:00.023 --> 0:39:06.249 | |
here is that maybe interesting is that most | |
of the systems, also the advance, are really | |
0:39:06.249 --> 0:39:08.698 | |
mainly focused purely on the text. | |
0:39:09.289 --> 0:39:19.237 | |
If you think about how to insert punctuation | |
marks, maybe your first idea would have been | |
0:39:19.237 --> 0:39:22.553 | |
we can use pause information. | |
0:39:23.964 --> 0:39:30.065 | |
But however interestingly most systems that | |
use are really focusing on the text. | |
0:39:31.151 --> 0:39:34.493 | |
There are several reasons. | |
0:39:34.493 --> 0:39:44.147 | |
One is that it's easier to get training data | |
so you only need pure text data. | |
0:39:46.806 --> 0:40:03.221 | |
The next way you can do it is you can make | |
it as a secret labeling tax or something like | |
0:40:03.221 --> 0:40:04.328 | |
that. | |
0:40:04.464 --> 0:40:11.734 | |
Then you have how there is nothing in you, | |
and there is a. | |
0:40:11.651 --> 0:40:15.015 | |
A question. | |
0:40:15.315 --> 0:40:31.443 | |
So you have the number of labels, the number | |
of punctuation symbols you have for the basic | |
0:40:31.443 --> 0:40:32.329 | |
one. | |
0:40:32.892 --> 0:40:44.074 | |
Typically nowadays it would use something | |
like bird, and then you can train a sister. | |
0:40:48.168 --> 0:40:59.259 | |
Any questions to that then it would probably | |
be no contrary, you know, or not. | |
0:41:00.480 --> 0:41:03.221 | |
Yeah, you have definitely a labeled imbalance. | |
0:41:04.304 --> 0:41:12.405 | |
Think that works relatively well and haven't | |
seen that. | |
0:41:12.405 --> 0:41:21.085 | |
It's not a completely crazy label, maybe twenty | |
times more. | |
0:41:21.561 --> 0:41:29.636 | |
It can and especially for the more rare things | |
mean, the more rare things is question marks. | |
0:41:30.670 --> 0:41:43.877 | |
At least for question marks you have typically | |
very strong indicator words. | |
0:41:47.627 --> 0:42:03.321 | |
And then what was done for quite a long time | |
can we know how to do machine translation? | |
0:42:04.504 --> 0:42:12.640 | |
So the idea is, can we just translate non | |
punctuated English into punctuated English | |
0:42:12.640 --> 0:42:14.650 | |
and do it correctly? | |
0:42:15.855 --> 0:42:25.344 | |
So what you need is something like this type | |
of data where the source doesn't have punctuation. | |
0:42:25.845 --> 0:42:30.641 | |
Course: A year is already done. | |
0:42:30.641 --> 0:42:36.486 | |
You have to make it a bit challenging. | |
0:42:41.661 --> 0:42:44.550 | |
Yeah, that is true. | |
0:42:44.550 --> 0:42:55.237 | |
If you think about the normal trained age, | |
you have to do one thing more. | |
0:42:55.237 --> 0:43:00.724 | |
Is it otherwise difficult to predict? | |
0:43:05.745 --> 0:43:09.277 | |
Here it's already this already looks different | |
than normal training data. | |
0:43:09.277 --> 0:43:09.897 | |
What is the. | |
0:43:10.350 --> 0:43:15.305 | |
People want to use this transcript of speech. | |
0:43:15.305 --> 0:43:19.507 | |
We'll probably go to our text editors. | |
0:43:19.419 --> 0:43:25.906 | |
Yes, that is all already quite too difficult. | |
0:43:26.346 --> 0:43:33.528 | |
Mean, that's making things a lot better with | |
the first and easiest thing is you have to | |
0:43:33.528 --> 0:43:35.895 | |
randomly cut your sentences. | |
0:43:35.895 --> 0:43:43.321 | |
So if you take just me normally we have one | |
sentence per line and if you take this as your | |
0:43:43.321 --> 0:43:44.545 | |
training data. | |
0:43:44.924 --> 0:43:47.857 | |
And that is, of course, not very helpful. | |
0:43:48.208 --> 0:44:01.169 | |
So in order to build the training corpus for | |
doing punctuation you randomly cut your sentences | |
0:44:01.169 --> 0:44:08.264 | |
and then you can remove all your punctuation | |
marks. | |
0:44:08.528 --> 0:44:21.598 | |
Because of course there is no longer to do | |
when you have some random segments in your | |
0:44:21.598 --> 0:44:22.814 | |
system. | |
0:44:25.065 --> 0:44:37.984 | |
And then you can, for example, if you then | |
have generated your punctuation marks before | |
0:44:37.984 --> 0:44:41.067 | |
going to the system. | |
0:44:41.221 --> 0:44:54.122 | |
And that is an important thing, which we like | |
to see is more challenging for end systems. | |
0:44:54.122 --> 0:45:00.143 | |
We can change the segmentation, so maybe. | |
0:45:00.040 --> 0:45:06.417 | |
You can, then if you're combining these things | |
you can change the segmentation here, so. | |
0:45:06.406 --> 0:45:18.178 | |
While you have ten new ten segments in your, | |
you might only have five ones in your anymore. | |
0:45:18.178 --> 0:45:18.946 | |
Then. | |
0:45:19.259 --> 0:45:33.172 | |
Which might be more useful or helpful in because | |
you have to reorder things and so on. | |
0:45:33.273 --> 0:45:43.994 | |
And if you think of the wrong segmentation | |
then you cannot reorder things from the beginning | |
0:45:43.994 --> 0:45:47.222 | |
to the end of the sentence. | |
0:45:49.749 --> 0:45:58.006 | |
Okay, so much about segmentation do you have | |
any more questions about that? | |
0:46:02.522 --> 0:46:21.299 | |
Then there is one additional thing you can | |
do, and that is when we refer to the idea. | |
0:46:21.701 --> 0:46:29.356 | |
And when you get input there might be some | |
arrows in there, so it might not be perfect. | |
0:46:29.889 --> 0:46:36.322 | |
So the question is, can we adapt to that? | |
0:46:36.322 --> 0:46:45.358 | |
And can the system be improved by saying that | |
it can some. | |
0:46:45.265 --> 0:46:50.591 | |
So that is as aware that before there is a. | |
0:46:50.490 --> 0:46:55.449 | |
Their arm might not be the best one. | |
0:46:55.935 --> 0:47:01.961 | |
There are different ways of dealing with them. | |
0:47:01.961 --> 0:47:08.116 | |
You can use a best list but several best lists. | |
0:47:08.408 --> 0:47:16.711 | |
So the idea is that you're not only telling | |
the system this is the transcript, but here | |
0:47:16.711 --> 0:47:18.692 | |
I'm not going to be. | |
0:47:19.419 --> 0:47:30.748 | |
Or that you can try to make it more robust | |
towards arrows from an system so that. | |
0:47:32.612 --> 0:47:48.657 | |
Interesting what is often done is hope convince | |
you it might be a good idea to deal. | |
0:47:48.868 --> 0:47:57.777 | |
The interesting thing is if you're looking | |
into a lot of systems, this is often ignored, | |
0:47:57.777 --> 0:48:04.784 | |
so they are not adapting their T-system to | |
this type of A-S-R system. | |
0:48:05.345 --> 0:48:15.232 | |
So it's not really doing any handling of Arab, | |
and the interesting thing is often works as | |
0:48:15.232 --> 0:48:15.884 | |
good. | |
0:48:16.516 --> 0:48:23.836 | |
And one reason is, of course, one reason is | |
if the ASR system does not arrow up to like | |
0:48:23.836 --> 0:48:31.654 | |
a challenging situation, and then the antisystem | |
is really for the antisystem hard to detect. | |
0:48:31.931 --> 0:48:39.375 | |
If it would be easy for the system to detect | |
the error you would integrate this information | |
0:48:39.375 --> 0:48:45.404 | |
into: That is not always the case, but that | |
of course makes it a bit challenging, and that's | |
0:48:45.404 --> 0:48:49.762 | |
why there is a lot of systems where it's not | |
explicitly handled how to deal with. | |
0:48:52.912 --> 0:49:06.412 | |
But of course it might be good, so one thing | |
is you can give him a best list and you can | |
0:49:06.412 --> 0:49:09.901 | |
translate every entry. | |
0:49:10.410 --> 0:49:17.705 | |
And then you have two scores like the anti-probability | |
and the square probability. | |
0:49:18.058 --> 0:49:25.695 | |
Combine them and then generate or output the | |
output from what has the best combined. | |
0:49:26.366 --> 0:49:29.891 | |
And then it might no longer be the best. | |
0:49:29.891 --> 0:49:38.144 | |
It might like we had a bean search, so this | |
has the best score, but this has a better combined. | |
0:49:39.059 --> 0:49:46.557 | |
The problem sometimes works, but the problem | |
is that the anti-system might then tend to | |
0:49:46.557 --> 0:49:52.777 | |
just translate not the correct sentence but | |
the one easier to translate. | |
0:49:53.693 --> 0:50:03.639 | |
You can also generate a more compact representation | |
of this invest in it by having this type of | |
0:50:03.639 --> 0:50:04.467 | |
graphs. | |
0:50:05.285 --> 0:50:22.952 | |
Lettices: So then you could like try to do | |
a graph to text translation so you can translate. | |
0:50:22.802 --> 0:50:26.582 | |
Where like all possibilities, by the way our | |
systems are invented. | |
0:50:26.906 --> 0:50:31.485 | |
So it can be like a hostage, a conference | |
with some programs. | |
0:50:31.591 --> 0:50:35.296 | |
So the highest probability is here. | |
0:50:35.296 --> 0:50:41.984 | |
Conference is being recorded, but there are | |
other possibilities. | |
0:50:42.302 --> 0:50:53.054 | |
And you can take all of this information out | |
there with your probabilities. | |
0:50:59.980 --> 0:51:07.614 | |
But we'll see this type of arrow propagation | |
that if you have an error that this might then | |
0:51:07.614 --> 0:51:15.165 | |
propagate to, and t errors is one of the main | |
reasons why people looked into other ways of | |
0:51:15.165 --> 0:51:17.240 | |
doing it and not having. | |
0:51:19.219 --> 0:51:28.050 | |
By generally a cascaded combination, as we've | |
seen it, it has several advantages: The biggest | |
0:51:28.050 --> 0:51:42.674 | |
maybe is the data availability so we can train | |
systems for the different components. | |
0:51:42.822 --> 0:51:47.228 | |
So you can train your individual components | |
on relatively large stages. | |
0:51:47.667 --> 0:51:58.207 | |
A modular system where you can improve each | |
individual model and if there's new development | |
0:51:58.207 --> 0:52:01.415 | |
and models you can improve. | |
0:52:01.861 --> 0:52:11.280 | |
There are several advantages, but of course | |
there are also some disadvantages: The most | |
0:52:11.280 --> 0:52:19.522 | |
common thing is that there is what is referred | |
to as arrow propagation. | |
0:52:19.522 --> 0:52:28.222 | |
If the arrow is arrow, probably your output | |
will then directly do an arrow. | |
0:52:28.868 --> 0:52:41.740 | |
Typically it's like if there's an error in | |
the system, it's easier to like ignore by a | |
0:52:41.740 --> 0:52:46.474 | |
quantity scale than the output. | |
0:52:46.967 --> 0:52:49.785 | |
What do that mean? | |
0:52:49.785 --> 0:53:01.209 | |
It's complicated, so if you have German, the | |
ASR does the Arab, and instead. | |
0:53:01.101 --> 0:53:05.976 | |
Then most probably you'll ignore it or you'll | |
still know what it was said. | |
0:53:05.976 --> 0:53:11.827 | |
Maybe you even don't notice because you'll | |
fastly read over it and don't see that there's | |
0:53:11.827 --> 0:53:12.997 | |
one letter wrong. | |
0:53:13.673 --> 0:53:25.291 | |
However, if you translate this one in an English | |
sentence about speeches, there's something | |
0:53:25.291 --> 0:53:26.933 | |
about wines. | |
0:53:27.367 --> 0:53:37.238 | |
So it's a lot easier typically to read over | |
like arrows in the than reading over them in | |
0:53:37.238 --> 0:53:38.569 | |
the speech. | |
0:53:40.120 --> 0:53:45.863 | |
But there is additional challenges in in cascaded | |
systems. | |
0:53:46.066 --> 0:53:52.667 | |
So secondly we have seen that we optimize | |
each component individually so you have a separate | |
0:53:52.667 --> 0:53:59.055 | |
optimization and that doesn't mean that the | |
overall performance is really the best at the | |
0:53:59.055 --> 0:53:59.410 | |
end. | |
0:53:59.899 --> 0:54:07.945 | |
And we have tried to do that by already saying | |
yes. | |
0:54:07.945 --> 0:54:17.692 | |
You need to adapt them a bit to work good | |
together, but still. | |
0:54:20.280 --> 0:54:24.185 | |
Secondly, like that, there's a computational | |
complexity. | |
0:54:24.185 --> 0:54:30.351 | |
You always need to run an ASR system and an | |
MTT system, and especially if you think about | |
0:54:30.351 --> 0:54:32.886 | |
it, it should be fast and real time. | |
0:54:32.886 --> 0:54:37.065 | |
It's challenging to always run two systems | |
and not a single. | |
0:54:38.038 --> 0:54:45.245 | |
And one final thing which you might have not | |
directly thought of, but most of the world's | |
0:54:45.245 --> 0:54:47.407 | |
languages do not have any. | |
0:54:48.108 --> 0:55:01.942 | |
So if you have a language which doesn't have | |
any script, then of course if you want to translate | |
0:55:01.942 --> 0:55:05.507 | |
it you cannot first use. | |
0:55:05.905 --> 0:55:13.705 | |
So in order to do this, the pressure was mentioned | |
before ready. | |
0:55:13.705 --> 0:55:24.264 | |
Build somehow a system which takes the audio | |
and directly generates text in the target. | |
0:55:26.006 --> 0:55:41.935 | |
And there is quite big opportunity for that | |
because before that there was very different | |
0:55:41.935 --> 0:55:44.082 | |
technology. | |
0:55:44.644 --> 0:55:55.421 | |
However, since we are using neuromachine translation | |
encoded decoder models, the interesting thing | |
0:55:55.421 --> 0:56:00.429 | |
is that we are using very similar technology. | |
0:56:00.360 --> 0:56:06.047 | |
It's like in both cases very similar architecture. | |
0:56:06.047 --> 0:56:09.280 | |
The main difference is once. | |
0:56:09.649 --> 0:56:17.143 | |
But generally how it's done is very similar, | |
and therefore of course it might be put everything | |
0:56:17.143 --> 0:56:22.140 | |
together, and that is what is referred to as | |
end-to-end speech. | |
0:56:22.502 --> 0:56:31.411 | |
So that means we're having one large neural | |
network and decoded voice system, but we put | |
0:56:31.411 --> 0:56:34.914 | |
an audio in one language and then. | |
0:56:36.196 --> 0:56:43.106 | |
We can then have a system which directly does | |
the full process. | |
0:56:43.106 --> 0:56:46.454 | |
We don't have to care anymore. | |
0:56:48.048 --> 0:57:02.615 | |
So if you think of it as before, so we have | |
this decoder, and that's the two separate. | |
0:57:02.615 --> 0:57:04.792 | |
We have the. | |
0:57:05.085 --> 0:57:18.044 | |
And instead of going via the discrete text | |
representation in the Suez language, we can | |
0:57:18.044 --> 0:57:21.470 | |
go via the continuous. | |
0:57:21.681 --> 0:57:26.027 | |
Of course, they hope it's by not doing this | |
discrimination in between. | |
0:57:26.146 --> 0:57:30.275 | |
We don't have a problem at doing errors. | |
0:57:30.275 --> 0:57:32.793 | |
We can only cover later. | |
0:57:32.772 --> 0:57:47.849 | |
But we can encode here the variability or | |
so that we have and then only define the decision. | |
0:57:51.711 --> 0:57:54.525 | |
And so. | |
0:57:54.274 --> 0:58:02.253 | |
What we're doing is we're having very similar | |
technique. | |
0:58:02.253 --> 0:58:12.192 | |
We're having still the decoder model where | |
we're coming from the main. | |
0:58:12.552 --> 0:58:24.098 | |
Instead of getting discrete tokens in there | |
as we have subwords, we always encoded that | |
0:58:24.098 --> 0:58:26.197 | |
in one pattern. | |
0:58:26.846 --> 0:58:42.505 | |
The problem is that this is in continuous, | |
so we have to check how we can work with continuous | |
0:58:42.505 --> 0:58:43.988 | |
signals. | |
0:58:47.627 --> 0:58:55.166 | |
Mean, the first thing in your system is when | |
you do your disc freeze and code it. | |
0:59:02.402 --> 0:59:03.888 | |
A newer machine translation. | |
0:59:03.888 --> 0:59:05.067 | |
You're getting a word. | |
0:59:05.067 --> 0:59:06.297 | |
It's one hot, some not. | |
0:59:21.421 --> 0:59:24.678 | |
The first layer of the machine translation. | |
0:59:27.287 --> 0:59:36.147 | |
Yes, you do the word embedding, so then you | |
have a continuous thing. | |
0:59:36.147 --> 0:59:40.128 | |
So if you know get continuous. | |
0:59:40.961 --> 0:59:46.316 | |
Deal with it the same way, so we'll see not | |
a big of a challenge. | |
0:59:46.316 --> 0:59:48.669 | |
What is more challenging is. | |
0:59:49.349 --> 1:00:04.498 | |
So the audio signal is ten times longer or | |
so, like more time steps you have. | |
1:00:04.764 --> 1:00:10.332 | |
And so that is, of course, any challenge how | |
we can deal with this type of long sequence. | |
1:00:11.171 --> 1:00:13.055 | |
The advantage is a bit. | |
1:00:13.055 --> 1:00:17.922 | |
The long sequence is only at the input and | |
not at the output. | |
1:00:17.922 --> 1:00:24.988 | |
So when you remember for the efficiency, for | |
example, like a long sequence are especially | |
1:00:24.988 --> 1:00:29.227 | |
challenging in the decoder, but also for the | |
encoder. | |
1:00:31.371 --> 1:00:33.595 | |
So how it is this? | |
1:00:33.595 --> 1:00:40.617 | |
How can we process audio into an speech translation | |
system? | |
1:00:41.501 --> 1:00:51.856 | |
And you can follow mainly what is done in | |
an system, so you have the audio signal. | |
1:00:52.172 --> 1:00:59.135 | |
Then you measure your amplitude at every time | |
step. | |
1:00:59.135 --> 1:01:04.358 | |
It's typically something like killing. | |
1:01:04.384 --> 1:01:13.893 | |
And then you're doing this, this windowing, | |
so that you get a signal of a length twenty | |
1:01:13.893 --> 1:01:22.430 | |
to thirty seconds, and you have all these windowings | |
so that you measure them. | |
1:01:22.342 --> 1:01:32.260 | |
A simple gear, and then you look at these | |
time signals of seconds. | |
1:01:32.432 --> 1:01:36.920 | |
So in the end then it is ten seconds, ten | |
million seconds. | |
1:01:36.920 --> 1:01:39.735 | |
You have for every ten milliseconds. | |
1:01:40.000 --> 1:01:48.309 | |
Some type of representation which type of | |
representation you can generate from that, | |
1:01:48.309 --> 1:01:49.286 | |
but that. | |
1:01:49.649 --> 1:02:06.919 | |
So instead of having no letter or word, you | |
have no representations for every 10mm of your | |
1:02:06.919 --> 1:02:08.437 | |
system. | |
1:02:08.688 --> 1:02:13.372 | |
How we record that now your thirty second | |
window here there is different ways. | |
1:02:16.176 --> 1:02:31.891 | |
Was a traditional way of how people have done | |
that from an audio signal what frequencies | |
1:02:31.891 --> 1:02:34.010 | |
are in the. | |
1:02:34.114 --> 1:02:44.143 | |
So to do that you can do this malfrequency, | |
capsule co-pression so you can use gear transformations. | |
1:02:44.324 --> 1:02:47.031 | |
Which frequencies are there? | |
1:02:47.031 --> 1:02:53.566 | |
You know that the letters are different by | |
the different frequencies. | |
1:02:53.813 --> 1:03:04.243 | |
And then if you're doing that, use the matte | |
to covers for your window we have before. | |
1:03:04.624 --> 1:03:14.550 | |
So for each of these windows: You will calculate | |
what frequencies in there and then get features | |
1:03:14.550 --> 1:03:20.059 | |
for this window and features for this window. | |
1:03:19.980 --> 1:03:28.028 | |
These are the frequencies that occur there | |
and that help you to model which letters are | |
1:03:28.028 --> 1:03:28.760 | |
spoken. | |
1:03:31.611 --> 1:03:43.544 | |
More recently, instead of doing the traditional | |
signal processing, you can also replace that | |
1:03:43.544 --> 1:03:45.853 | |
by deep learning. | |
1:03:46.126 --> 1:03:56.406 | |
So that we are using a self-supervised approach | |
from language model to generate features that | |
1:03:56.406 --> 1:03:58.047 | |
describe what. | |
1:03:58.358 --> 1:03:59.821 | |
So you have your. | |
1:03:59.759 --> 1:04:07.392 | |
All your signal again, and then for each child | |
to do your convolutional neural networks to | |
1:04:07.392 --> 1:04:07.811 | |
get. | |
1:04:07.807 --> 1:04:23.699 | |
First representation here is a transformer | |
network here, and in the end it's similar to | |
1:04:23.699 --> 1:04:25.866 | |
a language. | |
1:04:25.705 --> 1:04:30.238 | |
And you tried to predict what was referenced | |
here. | |
1:04:30.670 --> 1:04:42.122 | |
So that is in a way similar that you also | |
try to learn a good representation of all these | |
1:04:42.122 --> 1:04:51.608 | |
audio signals by predicting: And then you don't | |
do the signal processing base, but have this | |
1:04:51.608 --> 1:04:52.717 | |
way to make. | |
1:04:52.812 --> 1:04:59.430 | |
But in all the things that you have to remember | |
what is most important for you, and to end | |
1:04:59.430 --> 1:05:05.902 | |
system is, of course, that you in the end get | |
for every minute ten milliseconds, you get | |
1:05:05.902 --> 1:05:11.283 | |
a representation of this audio signal, which | |
is again a vector, and that. | |
1:05:11.331 --> 1:05:15.365 | |
And then you can use your normal encoder to | |
code your model to do this research. | |
1:05:21.861 --> 1:05:32.694 | |
So that is all which directly has to be changed, | |
and then you can build your first base. | |
1:05:33.213 --> 1:05:37.167 | |
You do the audio processing. | |
1:05:37.167 --> 1:05:49.166 | |
You of course need data which is like Audio | |
and English and Text in German and then you | |
1:05:49.166 --> 1:05:50.666 | |
can train. | |
1:05:53.333 --> 1:05:57.854 | |
And interestingly, it works at the beginning. | |
1:05:57.854 --> 1:06:03.261 | |
The systems were maybe a bit worse, but we | |
saw really. | |
1:06:03.964 --> 1:06:11.803 | |
This is like from the biggest workshop where | |
people like compared different systems. | |
1:06:11.751 --> 1:06:17.795 | |
Special challenge on comparing Cascaded to | |
end to end systems and you see two thousand | |
1:06:17.795 --> 1:06:18.767 | |
and eighteen. | |
1:06:18.767 --> 1:06:25.089 | |
We had quite a huge gap between the Cascaded | |
and end to end systems and then it got nearer | |
1:06:25.089 --> 1:06:27.937 | |
and earlier in starting in two thousand. | |
1:06:27.907 --> 1:06:33.619 | |
Twenty the performance was mainly the same, | |
so there was no clear difference anymore. | |
1:06:34.014 --> 1:06:42.774 | |
So this is, of course, writing a bit of hope | |
saying if we better learn how to build these | |
1:06:42.774 --> 1:06:47.544 | |
internal systems, they might really fall better. | |
1:06:49.549 --> 1:06:52.346 | |
However, a bit. | |
1:06:52.452 --> 1:06:59.018 | |
This satisfying this is how this all continues, | |
and this is not only in two thousand and twenty | |
1:06:59.018 --> 1:07:04.216 | |
one, but even nowadays we can say there is | |
no clear performance difference. | |
1:07:04.216 --> 1:07:10.919 | |
It's not like the one model is better than | |
the other, but we are seeing very similar performance. | |
1:07:11.391 --> 1:07:19.413 | |
So the question is what is the difference? | |
1:07:19.413 --> 1:07:29.115 | |
Of course, this can only be achieved by new | |
tricks. | |
1:07:30.570 --> 1:07:35.658 | |
Yes and no, that's what we will mainly look | |
into now. | |
1:07:35.658 --> 1:07:39.333 | |
How can we make use of other types of. | |
1:07:39.359 --> 1:07:53.236 | |
In that case you can achieve some performance | |
by using different types of training so you | |
1:07:53.236 --> 1:07:55.549 | |
can also make. | |
1:07:55.855 --> 1:08:04.961 | |
So if you are training or preparing the systems | |
only on very small corpora where you have as | |
1:08:04.961 --> 1:08:10.248 | |
much data than you have for the individual | |
ones then. | |
1:08:10.550 --> 1:08:22.288 | |
So that is the biggest challenge of an end | |
system that you have small corpora and therefore. | |
1:08:24.404 --> 1:08:30.479 | |
Of course, there is several advantages so | |
you can give access to the audio information. | |
1:08:30.750 --> 1:08:42.046 | |
So that's, for example, interesting if you | |
think about it, you might not have modeled | |
1:08:42.046 --> 1:08:45.198 | |
everything in the text. | |
1:08:45.198 --> 1:08:50.321 | |
So remember when we talk about biases. | |
1:08:50.230 --> 1:08:55.448 | |
Male or female, and that of course is not | |
in the text any more, but in the audio signal | |
1:08:55.448 --> 1:08:56.515 | |
it's still there. | |
1:08:58.078 --> 1:09:03.108 | |
It also allows you to talk about that on Thursday | |
when you talk about latency. | |
1:09:03.108 --> 1:09:08.902 | |
You have a bit better chance if you do an | |
end to end system to get a lower latency because | |
1:09:08.902 --> 1:09:14.377 | |
you only have one system and you don't have | |
two systems which might have to wait for. | |
1:09:14.934 --> 1:09:20.046 | |
And having one system might be also a bit | |
easier management. | |
1:09:20.046 --> 1:09:23.146 | |
See that two systems work and so on. | |
1:09:26.346 --> 1:09:41.149 | |
The biggest challenge of end systems is the | |
data, so as you correctly pointed out, typically | |
1:09:41.149 --> 1:09:42.741 | |
there is. | |
1:09:43.123 --> 1:09:45.829 | |
There is some data for Ted. | |
1:09:45.829 --> 1:09:47.472 | |
People did that. | |
1:09:47.472 --> 1:09:52.789 | |
They took the English audio with all the translations. | |
1:09:53.273 --> 1:10:02.423 | |
But in January there is a lot less so we'll | |
look into how you can use other data sources. | |
1:10:05.305 --> 1:10:10.950 | |
And secondly, the second challenge is that | |
we have to deal with audio. | |
1:10:11.431 --> 1:10:22.163 | |
For example, in input length, and therefore | |
it's also important to handle this in your | |
1:10:22.163 --> 1:10:27.590 | |
network and maybe have dedicated solutions. | |
1:10:31.831 --> 1:10:40.265 | |
So in general we have this challenge that | |
we have a lot of text and translation and audio | |
1:10:40.265 --> 1:10:43.076 | |
transcript data by quite few. | |
1:10:43.643 --> 1:10:50.844 | |
So what can we do in one trick? | |
1:10:50.844 --> 1:11:00.745 | |
You already know a bit from other research. | |
1:11:02.302 --> 1:11:14.325 | |
Exactly so what you can do is you can, for | |
example, use to take a power locust, generate | |
1:11:14.325 --> 1:11:19.594 | |
an audio of a Suez language, and then. | |
1:11:21.341 --> 1:11:33.780 | |
There has been a bit motivated by what we | |
have seen in Beck translation, which was very | |
1:11:33.780 --> 1:11:35.476 | |
successful. | |
1:11:38.758 --> 1:11:54.080 | |
However, it's a bit more challenging because | |
it is often very different from real audience. | |
1:11:54.314 --> 1:12:07.131 | |
So often if you build a system only trained | |
on, but then generalized to real audio data | |
1:12:07.131 --> 1:12:10.335 | |
is quite challenging. | |
1:12:10.910 --> 1:12:20.927 | |
And therefore here the synthetic data generation | |
is significantly more challenging than when. | |
1:12:20.981 --> 1:12:27.071 | |
Because if you read a text, it's maybe bad | |
translation. | |
1:12:27.071 --> 1:12:33.161 | |
It's hard, but it's a real text or a text | |
generated by. | |
1:12:35.835 --> 1:12:42.885 | |
But it's a valid solution, and for example | |
we use that also for say current systems. | |
1:12:43.923 --> 1:12:53.336 | |
Of course you can also do a bit of forward | |
translation that is done so that you take data. | |
1:12:53.773 --> 1:13:02.587 | |
But then the problem is that your reference | |
is not always correct, and you remember when | |
1:13:02.587 --> 1:13:08.727 | |
we talked about back translation, it's a bit | |
of an advantage. | |
1:13:09.229 --> 1:13:11.930 | |
But both can be done and both have been done. | |
1:13:12.212 --> 1:13:20.277 | |
So you can think about this picture again. | |
1:13:20.277 --> 1:13:30.217 | |
You can take this data and generate the audio | |
to it. | |
1:13:30.750 --> 1:13:37.938 | |
However, it is only synthetic of what can | |
be used for the voice handling technology for: | |
1:13:40.240 --> 1:13:47.153 | |
But you have not, I mean, yet you get text | |
to speech, but the voice cloning would need | |
1:13:47.153 --> 1:13:47.868 | |
a voice. | |
1:13:47.868 --> 1:13:53.112 | |
You can use, of course, and then it's nothing | |
else than a normal. | |
1:13:54.594 --> 1:14:03.210 | |
But still think there are better than both, | |
but there are some characteristics of that | |
1:14:03.210 --> 1:14:05.784 | |
which is quite different. | |
1:14:07.327 --> 1:14:09.341 | |
But yeah, it's getting better. | |
1:14:09.341 --> 1:14:13.498 | |
That is definitely true, and then this might | |
get more and more. | |
1:14:16.596 --> 1:14:21.885 | |
Here make sure it's a good person and our | |
own systems because we try to train and. | |
1:14:21.881 --> 1:14:24.356 | |
And it's like a feedback mood. | |
1:14:24.356 --> 1:14:28.668 | |
There's anything like the Dutch English model | |
that's. | |
1:14:28.648 --> 1:14:33.081 | |
Yeah, you of course need a decent amount of | |
real data. | |
1:14:33.081 --> 1:14:40.255 | |
But I mean, as I said, so there is always | |
an advantage if you have this synthetics thing | |
1:14:40.255 --> 1:14:44.044 | |
only on the input side and not on the outside. | |
1:14:44.464 --> 1:14:47.444 | |
That you at least always generate correct | |
outcomes. | |
1:14:48.688 --> 1:14:54.599 | |
That's different in a language case because | |
they have input and the output and it's not | |
1:14:54.599 --> 1:14:55.002 | |
like. | |
1:14:58.618 --> 1:15:15.815 | |
The other idea is to integrate additional | |
sources so you can have more model sharing. | |
1:15:16.376 --> 1:15:23.301 | |
But you can use these components also in the | |
system. | |
1:15:23.301 --> 1:15:28.659 | |
Typically the text decoder and the text. | |
1:15:29.169 --> 1:15:41.845 | |
And so the other way of languaging is to join | |
a train or somehow train all these tasks. | |
1:15:43.403 --> 1:15:54.467 | |
The first and easy thing to do is multi task | |
training so the idea is you take these components | |
1:15:54.467 --> 1:16:02.038 | |
and train these two components and train the | |
speech translation. | |
1:16:02.362 --> 1:16:13.086 | |
So then, for example, all your encoders used | |
by the speech translation system can also gain | |
1:16:13.086 --> 1:16:14.951 | |
from the large. | |
1:16:14.975 --> 1:16:24.048 | |
So everything can gain a bit of emphasis, | |
but it can partly gain in there quite a bit. | |
1:16:27.407 --> 1:16:39.920 | |
The other idea is to do it in a pre-training | |
phase. | |
1:16:40.080 --> 1:16:50.414 | |
And then you take the end coder and the text | |
decoder and trade your model on that. | |
1:16:54.774 --> 1:17:04.895 | |
Finally, there is also what is referred to | |
as knowledge distillation, so there you have | |
1:17:04.895 --> 1:17:11.566 | |
to remember if you learn from a probability | |
distribution. | |
1:17:11.771 --> 1:17:24.371 | |
So what you can do then is you have your system | |
and if you then have your audio and text input | |
1:17:24.371 --> 1:17:26.759 | |
you can use your. | |
1:17:27.087 --> 1:17:32.699 | |
And then get a more rich signal that you'll | |
not only know this is the word, but you have | |
1:17:32.699 --> 1:17:33.456 | |
a complete. | |
1:17:34.394 --> 1:17:41.979 | |
Example is typically also done because, of | |
course, if you have ski data, it still begins | |
1:17:41.979 --> 1:17:49.735 | |
that you don't only have source language audio | |
and target language text, but then you also | |
1:17:49.735 --> 1:17:52.377 | |
have the source language text. | |
1:17:53.833 --> 1:18:00.996 | |
Get a good idea of the text editor and the | |
artist design. | |
1:18:00.996 --> 1:18:15.888 | |
Now have to be aligned so that: Otherwise | |
they wouldn't be able to determine which degree | |
1:18:15.888 --> 1:18:17.922 | |
they'd be. | |
1:18:18.178 --> 1:18:25.603 | |
What you've been doing in non-stasilation | |
is you run your MP and then you get your probability | |
1:18:25.603 --> 1:18:32.716 | |
distribution for all the words and you use | |
that to train and that is not only more helpful | |
1:18:32.716 --> 1:18:34.592 | |
than only getting back. | |
1:18:35.915 --> 1:18:44.427 | |
You can, of course, use the same decoder to | |
be even similar. | |
1:18:44.427 --> 1:18:49.729 | |
Otherwise you don't have exactly the. | |
1:18:52.832 --> 1:19:03.515 | |
Is a good point making these tools, and generally | |
in all these cases it's good to have more similar | |
1:19:03.515 --> 1:19:05.331 | |
representations. | |
1:19:05.331 --> 1:19:07.253 | |
You can transfer. | |
1:19:07.607 --> 1:19:23.743 | |
If you hear your representation to give from | |
the audio encoder and the text encoder are | |
1:19:23.743 --> 1:19:27.410 | |
more similar, then. | |
1:19:30.130 --> 1:19:39.980 | |
So here you have your text encoder in the | |
target language and you can train it on large | |
1:19:39.980 --> 1:19:40.652 | |
data. | |
1:19:41.341 --> 1:19:45.994 | |
But of course you want to benefit also for | |
this task because that's what your most interested. | |
1:19:46.846 --> 1:19:59.665 | |
Of course, the most benefit for this task | |
is if these two representations you give are | |
1:19:59.665 --> 1:20:01.728 | |
more similar. | |
1:20:02.222 --> 1:20:10.583 | |
Therefore, it's interesting to look into how | |
can we make these two representations as similar | |
1:20:10.583 --> 1:20:20.929 | |
as: The hope is that in the end you can't even | |
do something like zero shot transfer, but while | |
1:20:20.929 --> 1:20:25.950 | |
you only learn this one you can also deal with. | |
1:20:30.830 --> 1:20:40.257 | |
So what you can do is you can look at these | |
two representations. | |
1:20:40.257 --> 1:20:42.867 | |
So once the text. | |
1:20:43.003 --> 1:20:51.184 | |
And you can either put them into the text | |
decoder to the encoder. | |
1:20:51.184 --> 1:20:53.539 | |
We have seen both. | |
1:20:53.539 --> 1:21:03.738 | |
You can think: If you want to build an A's | |
and to insist on you can either take the audio | |
1:21:03.738 --> 1:21:06.575 | |
encoder and see how deep. | |
1:21:08.748 --> 1:21:21.915 | |
However, you have these two representations | |
and you want to make them more similar. | |
1:21:21.915 --> 1:21:23.640 | |
One thing. | |
1:21:23.863 --> 1:21:32.797 | |
Here we have, like you said, for every ten | |
million seconds we have a representation. | |
1:21:35.335 --> 1:21:46.085 | |
So what people may have done, for example, | |
is to remove redundant information so you can: | |
1:21:46.366 --> 1:21:56.403 | |
So you can use your system to put India based | |
on letter or words and then average over the | |
1:21:56.403 --> 1:21:58.388 | |
words or letters. | |
1:21:59.179 --> 1:22:07.965 | |
So that the number of representations from | |
the encoder is the same as you would get from. | |
1:22:12.692 --> 1:22:20.919 | |
Okay, that much to data do have any more questions | |
first about that. | |
1:22:27.207 --> 1:22:36.787 | |
Then we'll finish with the audience assessing | |
and highlight a bit while this is challenging, | |
1:22:36.787 --> 1:22:52.891 | |
so here's: One test here has one thousand eight | |
hundred sentences, so there are words or characters. | |
1:22:53.954 --> 1:22:59.336 | |
If you look how many all your features, so | |
how many samples there is like one point five | |
1:22:59.336 --> 1:22:59.880 | |
million. | |
1:23:00.200 --> 1:23:10.681 | |
So you have ten times more pizzas than you | |
have characters, and then again five times | |
1:23:10.681 --> 1:23:11.413 | |
more. | |
1:23:11.811 --> 1:23:23.934 | |
So you have the sequence leg of the audio | |
as long as you have for words, and that is | |
1:23:23.934 --> 1:23:25.788 | |
a challenge. | |
1:23:26.086 --> 1:23:34.935 | |
So the question is what can you do to make | |
the sequins a bit shorter and not have this? | |
1:23:38.458 --> 1:23:48.466 | |
The one thing is you can try to reduce the | |
dimensional entity in your encounter. | |
1:23:48.466 --> 1:23:50.814 | |
There's different. | |
1:23:50.991 --> 1:24:04.302 | |
So, for example, you can just sum up always | |
over some or you can do a congregation. | |
1:24:04.804 --> 1:24:12.045 | |
Are you a linear projectile or you even take | |
not every feature but only every fifth or something? | |
1:24:12.492 --> 1:24:23.660 | |
So this way you can very easily reduce your | |
number of features in there, and there has | |
1:24:23.660 --> 1:24:25.713 | |
been different. | |
1:24:26.306 --> 1:24:38.310 | |
There's also what you can do with things like | |
a convolutional layer. | |
1:24:38.310 --> 1:24:43.877 | |
If you skip over what you can,. | |
1:24:47.327 --> 1:24:55.539 | |
And then, in addition to the audio, the other | |
problem is higher variability. | |
1:24:55.539 --> 1:25:04.957 | |
So if you have a text you can: But there are | |
very different ways of saying that you can | |
1:25:04.957 --> 1:25:09.867 | |
distinguish whether say a sentence or your | |
voice. | |
1:25:10.510 --> 1:25:21.224 | |
That of course makes it more challenging because | |
now you get different inputs and while they | |
1:25:21.224 --> 1:25:22.837 | |
were in text. | |
1:25:23.263 --> 1:25:32.360 | |
So that makes especially for limited data | |
things more challenging and you want to somehow | |
1:25:32.360 --> 1:25:35.796 | |
learn that this is not important. | |
1:25:36.076 --> 1:25:39.944 | |
So there is the idea again okay. | |
1:25:39.944 --> 1:25:47.564 | |
Can we doing some type of data augmentation | |
to better deal with? | |
1:25:48.908 --> 1:25:55.735 | |
And again people can mainly use what has been | |
done in and try to do the same things. | |
1:25:56.276 --> 1:26:02.937 | |
You can try to do a bit of noise and speech | |
perturbation so playing the audio like slower | |
1:26:02.937 --> 1:26:08.563 | |
and a bit faster to get more samples then and | |
you can train on all of them. | |
1:26:08.563 --> 1:26:14.928 | |
What is very important and very successful | |
recently is what is called Spektr augment. | |
1:26:15.235 --> 1:26:25.882 | |
The idea is that you directly work on all | |
your features and you can try to last them | |
1:26:25.882 --> 1:26:29.014 | |
and that gives you more. | |
1:26:29.469 --> 1:26:41.717 | |
What do they mean with masking so this is | |
your audio feature and then there is different? | |
1:26:41.962 --> 1:26:47.252 | |
You can do what is referred to as mask and | |
a time masking. | |
1:26:47.252 --> 1:26:50.480 | |
That means you just set some masks. | |
1:26:50.730 --> 1:26:58.003 | |
And since then you should be still able to | |
to deal with it because you can normally. | |
1:26:57.937 --> 1:27:05.840 | |
Also without that you are getting more robust | |
and not and you can handle that because then | |
1:27:05.840 --> 1:27:10.877 | |
many symbols which have different time look | |
more similar. | |
1:27:11.931 --> 1:27:22.719 | |
You are not only doing that for time masking | |
but also for frequency masking so that if you | |
1:27:22.719 --> 1:27:30.188 | |
have here the frequency channels you mask a | |
frequency channel. | |
1:27:30.090 --> 1:27:33.089 | |
Thereby being able to better recognize these | |
things. | |
1:27:35.695 --> 1:27:43.698 | |
This we have had an overview of the two main | |
approaches for speech translation that is on | |
1:27:43.698 --> 1:27:51.523 | |
the one hand cascaded speech translation and | |
on the other hand we talked about advanced | |
1:27:51.523 --> 1:27:53.302 | |
speech translation. | |
1:27:53.273 --> 1:28:02.080 | |
It's like how to combine things and what they | |
work together for end speech translations. | |
1:28:02.362 --> 1:28:06.581 | |
Here was data challenges and a bit about long | |
circuits. | |
1:28:07.747 --> 1:28:09.304 | |
We have any more questions. | |
1:28:11.451 --> 1:28:19.974 | |
Can you really describe the change in cascading | |
from translation to text to speech because | |
1:28:19.974 --> 1:28:22.315 | |
thought the translation. | |
1:28:25.745 --> 1:28:30.201 | |
Yes, so mean that works again the easiest | |
thing. | |
1:28:30.201 --> 1:28:33.021 | |
What of course is challenging? | |
1:28:33.021 --> 1:28:40.751 | |
What can be challenging is how to make that | |
more lively and like that pronunciation? | |
1:28:40.680 --> 1:28:47.369 | |
And yeah, which things are put more important, | |
how to put things like that into. | |
1:28:47.627 --> 1:28:53.866 | |
In the normal text, otherwise it would sound | |
very monotone. | |
1:28:53.866 --> 1:28:57.401 | |
You want to add this information. | |
1:28:58.498 --> 1:29:02.656 | |
That is maybe one thing to make it a bit more | |
emotional. | |
1:29:02.656 --> 1:29:04.917 | |
That is maybe one thing which. | |
1:29:05.305 --> 1:29:13.448 | |
But you are right there and out of the box. | |
1:29:13.448 --> 1:29:20.665 | |
If you have everything works decently. | |
1:29:20.800 --> 1:29:30.507 | |
Still, especially if you have a very monotone | |
voice, so think these are quite some open challenges. | |
1:29:30.750 --> 1:29:35.898 | |
Maybe another open challenge is that it's | |
not so much for the end product, but for the | |
1:29:35.898 --> 1:29:37.732 | |
development is very important. | |
1:29:37.732 --> 1:29:40.099 | |
It's very hard to evaluate the quality. | |
1:29:40.740 --> 1:29:48.143 | |
So you cannot doubt that there is a way about | |
most systems are currently evaluated by human | |
1:29:48.143 --> 1:29:49.109 | |
evaluation. | |
1:29:49.589 --> 1:29:54.474 | |
So you cannot try hundreds of things and run | |
your blue score and get this score. | |
1:29:54.975 --> 1:30:00.609 | |
So therefore no means very important to have | |
some type of evaluation metric and that is | |
1:30:00.609 --> 1:30:01.825 | |
quite challenging. | |
1:30:08.768 --> 1:30:15.550 | |
And thanks for listening, and we'll have the | |
second part of speech translation on search. | |