Spaces:

retkowski
/

ytseg_demo

Running

App Files Files Community

ytseg_demo / demo_data /lectures /Lecture-18-18.07.2023 /English.vtt

retkowski

Add demo

cb71ef5 over 1 year ago

raw

history blame contribute delete

67.9 kB

	WEBVTT

	0:00:01.541 --> 0:00:06.926
	Okay, so we'll come back to today's lecture.

	0:00:08.528 --> 0:00:23.334
	We want to talk about is speech translation,
	so we'll have two lectures in this week about

	0:00:23.334 --> 0:00:26.589
	speech translation.

	0:00:27.087 --> 0:00:36.456
	And so in the last week we'll have some exercise
	and repetition.

	0:00:36.456 --> 0:00:46.690
	We want to look at what is now to do when
	we want to translate speech.

	0:00:46.946 --> 0:00:55.675
	So we want to address the specific challenges
	that occur when we switch from translating

	0:00:55.675 --> 0:00:56.754
	to speech.

	0:00:57.697 --> 0:01:13.303
	Today we will look at the more general picture
	out and build the systems.

	0:01:13.493 --> 0:01:23.645
	And then secondly an end approach where we
	are going to put in audio and generate.

	0:01:24.224 --> 0:01:41.439
	Which are the main dominant systems which
	are used in research and commercial systems.

	0:01:43.523 --> 0:01:56.879
	More general, what is the general task of
	speech translation that is shown here?

	0:01:56.879 --> 0:02:01.826
	The idea is we have a speech.

	0:02:02.202 --> 0:02:12.838
	Then we want to have a system which takes
	this audio and then translates it into another

	0:02:12.838 --> 0:02:14.033
	language.

	0:02:15.095 --> 0:02:20.694
	Then it's no longer as clear the output modality.

	0:02:20.694 --> 0:02:33.153
	In contrast, for humans we can typically have:
	So you can either have more textual translation,

	0:02:33.153 --> 0:02:37.917
	then you have subtitles, and the.

	0:02:38.538 --> 0:02:57.010
	Are you want to have it also in audio like
	it's done for human interpretation?

	0:02:57.417 --> 0:03:03.922
	See there is not the one best solution, so
	all of this one is always better.

	0:03:03.922 --> 0:03:09.413
	It heavily depends on what is the use of what
	the people prefer.

	0:03:09.929 --> 0:03:14.950
	For example, you can think of if you know
	a bit the source of language, but you're a

	0:03:14.950 --> 0:03:17.549
	bit unsure and don't understand everything.

	0:03:17.549 --> 0:03:23.161
	They may texture it out for this pattern because
	you can direct your gear to what was said and

	0:03:23.161 --> 0:03:26.705
	only if you're unsure you check down with your
	translation.

	0:03:27.727 --> 0:03:33.511
	Are another things that might be preferable
	to have a complete spoken of.

	0:03:34.794 --> 0:03:48.727
	So there are both ones for a long time in
	automatic systems focused mainly on text output.

	0:03:48.727 --> 0:04:06.711
	In most cases: But of course you can always
	hand them to text to speech systems which generates

	0:04:06.711 --> 0:04:09.960
	audio from that.

	0:04:12.772 --> 0:04:14.494
	Why should we care about that?

	0:04:14.494 --> 0:04:15.771
	Why should we do that?

	0:04:17.737 --> 0:04:24.141
	There is the nice thing that yeah, with a
	globalized world, we are able to now interact

	0:04:24.141 --> 0:04:25.888
	with a lot more people.

	0:04:25.888 --> 0:04:29.235
	You can do some conferences around the world.

	0:04:29.235 --> 0:04:31.564
	We can travel around the world.

	0:04:31.671 --> 0:04:37.802
	We can by Internet watch movies from all over
	the world and watch TV from all over the world.

	0:04:38.618 --> 0:04:47.812
	However, there is still this barrier that
	is mainly to watch videos, either in English

	0:04:47.812 --> 0:04:49.715
	or in a language.

	0:04:50.250 --> 0:05:00.622
	So what is currently happening in order to
	reach a large audience is that everybody.

	0:05:00.820 --> 0:05:07.300
	So if we are going, for example, to a conferences,
	these are international conferences.

	0:05:08.368 --> 0:05:22.412
	However, everybody will then speak English
	since that is some of the common language that

	0:05:22.412 --> 0:05:26.001
	everybody understands.

	0:05:26.686 --> 0:05:32.929
	So on the other hand, we cannot like have
	human interpreters like they ever work.

	0:05:32.892 --> 0:05:37.797
	You have that maybe in the European Parliament
	or in important business meetings.

	0:05:38.078 --> 0:05:47.151
	But this is relatively expensive, and so the
	question is, can we enable communication in

	0:05:47.151 --> 0:05:53.675
	your mother-in-law without having to have human
	interpretation?

	0:05:54.134 --> 0:06:04.321
	And there like speech translation can be helpful
	in order to help you bridge this gap.

	0:06:06.726 --> 0:06:22.507
	In this case, there are different scenarios
	of how you can apply speech translation.

	0:06:22.422 --> 0:06:29.282
	That's typically more interactive than we
	are talking about text translation.

	0:06:29.282 --> 0:06:32.800
	Text translation is most commonly used.

	0:06:33.153 --> 0:06:41.637
	Course: Nowadays there's things like chat
	and so on where it could also be interactive.

	0:06:42.082 --> 0:06:48.299
	In contrast to speech translation, that is
	less static, so there is different ways of

	0:06:48.299 --> 0:06:48.660
	how.

	0:06:49.149 --> 0:07:00.544
	The one scenario is what is called a translation
	where you first get an input, then you translate

	0:07:00.544 --> 0:07:03.799
	this fixed input, and then.

	0:07:04.944 --> 0:07:12.823
	With me, which means you have always like
	fixed, yeah fixed challenges which you need

	0:07:12.823 --> 0:07:14.105
	to translate.

	0:07:14.274 --> 0:07:25.093
	You don't need to like beat your mind what
	are the boundaries where there's an end.

	0:07:25.405 --> 0:07:31.023
	Also, there is no overlapping.

	0:07:31.023 --> 0:07:42.983
	There is always a one-person sentence that
	is getting translated.

	0:07:43.443 --> 0:07:51.181
	Of course, this has a disadvantage that it
	makes the conversation a lot longer because

	0:07:51.181 --> 0:07:55.184
	you always have only speech and translation.

	0:07:57.077 --> 0:08:03.780
	For example, if you would use that for a presentation
	there would be yeah quite get quite long, if

	0:08:03.780 --> 0:08:09.738
	I would just imagine you sitting here in the
	lecture I would say three sentences that I

	0:08:09.738 --> 0:08:15.765
	would wait for this interpreter to translate
	it, then I would say the next two sentences

	0:08:15.765 --> 0:08:16.103
	and.

	0:08:16.676 --> 0:08:28.170
	That is why in these situations, for example,
	if you have a direct conversation with a patient,

	0:08:28.170 --> 0:08:28.888
	then.

	0:08:29.209 --> 0:08:32.733
	But still there it's too big to be taking
	them very long.

	0:08:33.473 --> 0:08:42.335
	And that's why there's also the research on
	simultaneous translation, where the idea is

	0:08:42.335 --> 0:08:43.644
	in parallel.

	0:08:43.964 --> 0:08:46.179
	That Is the Dining for Human.

	0:08:46.126 --> 0:08:52.429
	Interpretation like if you think of things
	like the European Parliament where they of

	0:08:52.429 --> 0:08:59.099
	course not only speak always one sentence but
	are just giving their speech and in parallel

	0:08:59.099 --> 0:09:04.157
	human interpreters are translating the speech
	into another language.

	0:09:04.985 --> 0:09:12.733
	The same thing is interesting for automatic
	speech translation where we in parallel generate

	0:09:12.733 --> 0:09:13.817
	translation.

	0:09:15.415 --> 0:09:32.271
	The challenges then, of course, are that we
	need to segment our speech into somehow's chunks.

	0:09:32.152 --> 0:09:34.903
	We just looked for the dots we saw.

	0:09:34.903 --> 0:09:38.648
	There are some challenges that we have to
	check.

	0:09:38.648 --> 0:09:41.017
	The Doctor may not understand.

	0:09:41.201 --> 0:09:47.478
	But in generally getting sentence boundary
	sentences is not a really research question.

	0:09:47.647 --> 0:09:51.668
	While in speech translation, this is not that
	easy.

	0:09:51.952 --> 0:10:05.908
	Either getting that in the audio is difficult
	because it's not like we typically do breaks

	0:10:05.908 --> 0:10:09.742
	when there's a sentence.

	0:10:10.150 --> 0:10:17.432
	And even if you then see the transcript and
	would have to add the punctuation, this is

	0:10:17.432 --> 0:10:18.101
	not as.

	0:10:20.340 --> 0:10:25.942
	Another question is how many speakers we have
	here.

	0:10:25.942 --> 0:10:31.759
	In presentations you have more like a single
	speaker.

	0:10:31.931 --> 0:10:40.186
	That is normally easier from the part of audio
	processing, so in general in speech translation.

	0:10:40.460 --> 0:10:49.308
	You can have different challenges and they
	can be of different components.

	0:10:49.308 --> 0:10:57.132
	In addition to translation, you have: And
	if you're not going, for example, the magical

	0:10:57.132 --> 0:11:00.378
	speaker, there are significantly additional
	challenges.

	0:11:00.720 --> 0:11:10.313
	So we as humans we are very good in filtering
	out noises, or if two people speak in parallel

	0:11:10.313 --> 0:11:15.058
	to like separate these two speakers and hear.

	0:11:15.495 --> 0:11:28.300
	However, if you want to do that with automatic
	systems that is very challenging so that you

	0:11:28.300 --> 0:11:33.172
	can separate the speakers so that.

	0:11:33.453 --> 0:11:41.284
	For the more of you have this multi-speaker
	scenario, typically it's also less well prepared.

	0:11:41.721 --> 0:11:45.807
	So you're getting very, we'll talk about the
	spontaneous effects.

	0:11:46.186 --> 0:11:53.541
	So people like will stop in the middle of
	the sentence, they change their sentence, and

	0:11:53.541 --> 0:12:01.481
	so on, and like filtering these, these fluences
	out of the text and working with them is often

	0:12:01.481 --> 0:12:02.986
	very challenging.

	0:12:05.565 --> 0:12:09.144
	So these are all additional challenges when
	you have multiples.

	0:12:10.330 --> 0:12:19.995
	Then there's a question of an online or offline
	system, sometimes textbook station.

	0:12:19.995 --> 0:12:21.836
	We also mainly.

	0:12:21.962 --> 0:12:36.507
	That means you can take the whole text and
	you can translate it in a badge.

	0:12:37.337 --> 0:12:44.344
	However, for speech translation there's also
	several scenarios where this is the case.

	0:12:44.344 --> 0:12:51.513
	For example, when you're translating a movie,
	it's not only that you don't have to do it

	0:12:51.513 --> 0:12:54.735
	live, but you can take the whole movie.

	0:12:55.215 --> 0:13:05.473
	However, there is also a lot of situations
	where you don't have this opportunity like

	0:13:05.473 --> 0:13:06.785
	or sports.

	0:13:07.247 --> 0:13:13.963
	And you don't want to like first like let
	around a sports event and then like show in

	0:13:13.963 --> 0:13:19.117
	the game three hours later then there is not
	really any interest.

	0:13:19.399 --> 0:13:31.118
	So you have to do it live, and so we have
	the additional challenge of translating the

	0:13:31.118 --> 0:13:32.208
	system.

	0:13:32.412 --> 0:13:42.108
	There are still things on the one end of course.

	0:13:42.108 --> 0:13:49.627
	It needs to be real time translation.

	0:13:49.869 --> 0:13:54.153
	It's taking longer, then you're getting more
	and more and more delayed.

	0:13:55.495 --> 0:14:05.245
	So it maybe seems simple, but there have been
	research systems which are undertime slower

	0:14:05.245 --> 0:14:07.628
	than real time or so.

	0:14:07.628 --> 0:14:15.103
	If you want to show what is possible with
	the best current systems,.

	0:14:16.596 --> 0:14:18.477
	But that isn't even not enough.

	0:14:18.918 --> 0:14:29.593
	The other question: You can have a system
	which is even like several times real time.

	0:14:29.509 --> 0:14:33.382
	In less than one second, it might still be
	not useful.

	0:14:33.382 --> 0:14:39.648
	Then the question is like the latency, so
	how much time has passed since you can produce

	0:14:39.648 --> 0:14:39.930
	an.

	0:14:40.120 --> 0:14:45.814
	It might be that in average you can like concress
	it, but you still can't do it directly.

	0:14:45.814 --> 0:14:51.571
	You need to do it after, or you need to have
	the full context of thirty seconds before you

	0:14:51.571 --> 0:14:55.178
	can output something, and then you have a large
	latency.

	0:14:55.335 --> 0:15:05.871
	So it can be that do it as fast as it is produced,
	but have to wait until the food.

	0:15:06.426 --> 0:15:13.772
	So we'll look into that on Thursday how we
	can then generate translations that are having

	0:15:13.772 --> 0:15:14.996
	a low latency.

	0:15:15.155 --> 0:15:21.587
	You can imagine, for example, in German that
	it's maybe quite challenging since the word

	0:15:21.587 --> 0:15:23.466
	is often like at the end.

	0:15:23.466 --> 0:15:30.115
	If you're using perfect, like in harbor and
	so on, and then in English you have to directly

	0:15:30.115 --> 0:15:30.983
	produce it.

	0:15:31.311 --> 0:15:38.757
	So if you really want to have no context you
	might need to wait until the end of the sentence.

	0:15:41.021 --> 0:15:45.920
	Besides that, of course, offline and it gives
	you more additional help.

	0:15:45.920 --> 0:15:52.044
	I think last week you talked about context
	based systems that typically have context from

	0:15:52.044 --> 0:15:55.583
	maybe from the past but maybe also from the
	future.

	0:15:55.595 --> 0:16:02.923
	Then, of course, you cannot use anything from
	the future in this case, but you can use it.

	0:16:07.407 --> 0:16:24.813
	Finally, there is a thing about how you want
	to present it to the audience in automatic

	0:16:24.813 --> 0:16:27.384
	translation.

	0:16:27.507 --> 0:16:31.361
	There is also the thing that you want to do.

	0:16:31.361 --> 0:16:35.300
	All your outfits are running like the system.

	0:16:35.996 --> 0:16:36.990
	Top of it.

	0:16:36.990 --> 0:16:44.314
	Then they answered questions: How should it
	be spoken so you can do things like.

	0:16:46.586 --> 0:16:52.507
	Voice cloning so that it's like even the same
	voice than the original speaker.

	0:16:53.994 --> 0:16:59.081
	And if you do text or dubbing then there might
	be additional constraints.

	0:16:59.081 --> 0:17:05.729
	So if you think about subtitles: And they
	should be readable, and we are too big to speak

	0:17:05.729 --> 0:17:07.957
	faster than you can maybe read.

	0:17:08.908 --> 0:17:14.239
	So you might need to shorten your text.

	0:17:14.239 --> 0:17:20.235
	People say that a subtitle can be two lines.

	0:17:20.235 --> 0:17:26.099
	Each line can be this number of characters.

	0:17:26.346 --> 0:17:31.753
	So you cannot like if you have too long text,
	we might need to shorten that to do that.

	0:17:32.052 --> 0:17:48.272
	Similarly, if you think about dubbing, if
	you want to produce dubbing voice, then the

	0:17:48.272 --> 0:17:50.158
	original.

	0:17:51.691 --> 0:17:59.294
	Here is another problem that we have different
	settings like a more formal setting and let's

	0:17:59.294 --> 0:18:00.602
	have different.

	0:18:00.860 --> 0:18:09.775
	If you think about the United Nations maybe
	you want more former things and between friends

	0:18:09.775 --> 0:18:14.911
	maybe that former and there are languages which
	use.

	0:18:15.355 --> 0:18:21.867
	That is sure that is an important research
	question.

	0:18:21.867 --> 0:18:28.010
	To do that would more think of it more generally.

	0:18:28.308 --> 0:18:32.902
	That's important in text translation.

	0:18:32.902 --> 0:18:41.001
	If you translate a letter to your boss, it
	should sound different.

	0:18:42.202 --> 0:18:53.718
	So there is a question of how you can do this
	style work on how you can do that.

	0:18:53.718 --> 0:19:00.542
	For example, if you can specify that you might.

	0:19:00.460 --> 0:19:10.954
	So you can tax the center or generate an informal
	style because, as you correctly said, this

	0:19:10.954 --> 0:19:16.709
	is especially challenging again in the situations.

	0:19:16.856 --> 0:19:20.111
	Of course, there are ways of like being formal
	or less formal.

	0:19:20.500 --> 0:19:24.846
	But it's not like as clear as you do it, for
	example, in German where you have the twin

	0:19:24.846 --> 0:19:24.994
	C.

	0:19:25.165 --> 0:19:26.855
	So there is no one to own mapping.

	0:19:27.287 --> 0:19:34.269
	If you want to make that sure you can build
	a system which generates different styles in

	0:19:34.269 --> 0:19:38.662
	the output, so yeah that's definitely also
	a challenge.

	0:19:38.662 --> 0:19:43.762
	It just may be not mentioned here because
	it's not specific now.

	0:19:44.524 --> 0:19:54.029
	Generally, of course, these are all challenges
	in how to customize and adapt systems to use

	0:19:54.029 --> 0:19:56.199
	cases with specific.

	0:20:00.360 --> 0:20:11.020
	Speech translation has been done for quite
	a while and it's maybe not surprising it started

	0:20:11.020 --> 0:20:13.569
	with more simple use.

	0:20:13.793 --> 0:20:24.557
	So people first started to look into, for
	example, limited to main translations.

	0:20:24.557 --> 0:20:33.726
	The tourist was typically application if you're
	going to a new city.

	0:20:34.834 --> 0:20:44.028
	Then there are several open things of doing
	open domain translation, especially people.

	0:20:44.204 --> 0:20:51.957
	Like where there's a lot of data so you could
	build systems which are more open to main,

	0:20:51.957 --> 0:20:55.790
	but of course it's still a bit restrictive.

	0:20:55.790 --> 0:20:59.101
	It's true in the European Parliament.

	0:20:59.101 --> 0:21:01.888
	People talk about anything but.

	0:21:02.162 --> 0:21:04.820
	And so it's not completely used for everything.

	0:21:05.165 --> 0:21:11.545
	Nowadays we've seen this technology in a lot
	of different situations guess you ought.

	0:21:11.731 --> 0:21:17.899
	Use it so there is some basic technologies
	where you can use them already.

	0:21:18.218 --> 0:21:33.599
	There is still a lot of open questions going
	from if you are going to really spontaneous

	0:21:33.599 --> 0:21:35.327
	meetings.

	0:21:35.655 --> 0:21:41.437
	Then these systems typically work good for
	like some languages where we have a lot of

	0:21:41.437 --> 0:21:42.109
	friendly.

	0:21:42.742 --> 0:21:48.475
	But if we want to go for really low resource
	data then things are often challenging.

	0:21:48.448 --> 0:22:02.294
	Last week we had a workshop on spoken language
	translation and there is a low-resource data

	0:22:02.294 --> 0:22:05.756
	track which is dialed.

	0:22:05.986 --> 0:22:06.925
	And so on.

	0:22:06.925 --> 0:22:14.699
	All these languages can still then have significantly
	lower performance than for a higher.

	0:22:17.057 --> 0:22:20.126
	So how does this work?

	0:22:20.126 --> 0:22:31.614
	If we want to do speech translation, there's
	like three basic technology: So on the one

	0:22:31.614 --> 0:22:40.908
	hand, it's automatic speech recognition where
	automatic speech recognition normally transacts

	0:22:40.908 --> 0:22:41.600
	audio.

	0:22:42.822 --> 0:22:58.289
	Then what we talked about here is machine
	translation, which takes input and translates

	0:22:58.289 --> 0:23:01.276
	into the target.

	0:23:02.642 --> 0:23:11.244
	And the very simple model now, if you think
	about it, is of course the similar combination.

	0:23:11.451 --> 0:23:14.740
	We have solved all these parts in a salt bedrock.

	0:23:14.975 --> 0:23:31.470
	We are working on all these problems there,
	so if we want to do a speech transition, maybe.

	0:23:31.331 --> 0:23:35.058
	Such problems we just put all these combinations
	together.

	0:23:35.335 --> 0:23:45.130
	And then you get what you have as a cascading
	system, which first is so you take your audio.

	0:23:45.045 --> 0:23:59.288
	To take this as input and generate the output,
	and then you take this text output, put it

	0:23:59.288 --> 0:24:00.238
	into.

	0:24:00.640 --> 0:24:05.782
	So in that way you have now.

	0:24:08.008 --> 0:24:18.483
	Have now a solution for generating doing speech
	translation for these types of systems, and

	0:24:18.483 --> 0:24:20.874
	this type is called.

	0:24:21.681 --> 0:24:28.303
	It is still often reaching state of the art,
	however it has benefits and disadvantages.

	0:24:28.668 --> 0:24:41.709
	So the one big benefit is we have independent
	components and some of that is nice.

	0:24:41.709 --> 0:24:48.465
	So if there are great ideas put into your.

	0:24:48.788 --> 0:24:57.172
	And then some other times people develop a
	new good way of how to improve.

	0:24:57.172 --> 0:25:00.972
	You can also take this model and.

	0:25:01.381 --> 0:25:07.639
	So you can leverage improvements from all
	the different communities in order to adapt.

	0:25:08.288 --> 0:25:18.391
	Furthermore, we would like to see, since all
	of them is learning, that the biggest advantage

	0:25:18.391 --> 0:25:23.932
	is that we have training data for each individual.

	0:25:24.164 --> 0:25:34.045
	So there's a lot less training data where
	you have the English audio, so it's easy to

	0:25:34.045 --> 0:25:34.849
	train.

	0:25:36.636 --> 0:25:48.595
	Now am a one that we will focus on when talking
	about the cascaded approach is that often it.

	0:25:48.928 --> 0:25:58.049
	So you need to adapt each component a bit
	so that it's adapting to its input and.

	0:25:58.278 --> 0:26:07.840
	So we'll focus there especially on how to
	combine and since said the main focus is: So

	0:26:07.840 --> 0:26:18.589
	if you would directly use an output that might
	not work as perfect as you would,.

	0:26:18.918 --> 0:26:33.467
	So a major challenge when building a cascade
	of speech translation systems is how can we

	0:26:33.467 --> 0:26:38.862
	adapt these systems and how can?

	0:26:41.681 --> 0:26:43.918
	So why, why is this the kick?

	0:26:44.164 --> 0:26:49.183
	So it would look quite nice.

	0:26:49.183 --> 0:26:54.722
	It seems to be very reasonable.

	0:26:54.722 --> 0:26:58.356
	You have some audio.

	0:26:58.356 --> 0:27:03.376
	You put it into your system.

	0:27:04.965 --> 0:27:23.759
	However, this is a bit which for thinking
	because if you speak what you speak is more.

	0:27:23.984 --> 0:27:29.513
	And especially all that rarely have punctuations
	in there, and while the anti-system.

	0:27:29.629 --> 0:27:43.247
	They assume, of course, that it's a full sentence,
	that you don't have there some.

	0:27:43.523 --> 0:27:55.087
	So we see we want to get this bridge between
	the output and the input, and we might need

	0:27:55.087 --> 0:27:56.646
	additional.

	0:27:58.778 --> 0:28:05.287
	And that is typically what is referred to
	as re-case and re-piculation system.

	0:28:05.445 --> 0:28:15.045
	So the idea is that you might be good to have
	something like an adapter here in between,

	0:28:15.045 --> 0:28:20.007
	which really tries to adapt the speech input.

	0:28:20.260 --> 0:28:28.809
	That can be at different levels, but it might
	be even more rephrasing.

	0:28:29.569 --> 0:28:40.620
	If you think of the sentence, if you have
	false starts, then when speaking you sometimes

	0:28:40.620 --> 0:28:41.986
	assume oh.

	0:28:41.901 --> 0:28:52.224
	You restart it, then you might want to delete
	that because if you read it you don't want

	0:28:52.224 --> 0:28:52.688
	to.

	0:28:56.096 --> 0:28:57.911
	Why is this yeah?

	0:28:57.911 --> 0:29:01.442
	The case in punctuation important.

	0:29:02.622 --> 0:29:17.875
	One important thing is directly for the challenge
	is when speak is just a continuous stream of

	0:29:17.875 --> 0:29:18.999
	words.

	0:29:19.079 --> 0:29:27.422
	Then just speaking and punctuation marks,
	and so on are all notes are there in natural.

	0:29:27.507 --> 0:29:30.281
	However, they are of course important.

	0:29:30.410 --> 0:29:33.877
	They are first of all very important for readability.

	0:29:34.174 --> 0:29:41.296
	If you have once read a text without characterization
	marks, you need more time to process it.

	0:29:41.861 --> 0:29:47.375
	They're sometimes even semantically important.

	0:29:47.375 --> 0:29:52.890
	There's a list for grandpa and big difference.

	0:29:53.553 --> 0:30:00.089
	And so this, of course, with humans as well,
	it'd be easy to distinguish by again doing

	0:30:00.089 --> 0:30:01.426
	it automatically.

	0:30:01.426 --> 0:30:06.180
	It's more typically and finally, in our case,
	if we want to do.

	0:30:06.386 --> 0:30:13.672
	We are assuming normally sentence wise, so
	we always enter out system which is like one

	0:30:13.672 --> 0:30:16.238
	sentence by the next sentence.

	0:30:16.736 --> 0:30:26.058
	If you want to do speech translation of a
	continuous stream, then of course what are

	0:30:26.058 --> 0:30:26.716
	your.

	0:30:28.168 --> 0:30:39.095
	And the easiest and most straightforward situation
	is, of course, if you have a continuously.

	0:30:39.239 --> 0:30:51.686
	And if it generates your calculation marks,
	it's easy to separate your text into sentences.

	0:30:52.032 --> 0:31:09.157
	So we can again reuse our system and thereby
	have a normal anti-system on this continuous.

	0:31:14.174 --> 0:31:21.708
	These are a bit older numbers, but they show
	you a bit also how important all that is.

	0:31:21.861 --> 0:31:31.719
	So this was so the best is if you do insurance
	transcript you get roughly a blue score of.

	0:31:32.112 --> 0:31:47.678
	If you have as it is with some air based length
	segmentation, then you get something like.

	0:31:47.907 --> 0:31:57.707
	If you then use the segments correctly as
	it's done from the reference, you get one blue

	0:31:57.707 --> 0:32:01.010
	point and another blue point.

	0:32:01.201 --> 0:32:08.085
	So you see that you have been total like nearly
	two blue points just by having the correct

	0:32:08.085 --> 0:32:09.144
	segmentation.

	0:32:10.050 --> 0:32:21.178
	This shows you that it's important to estimate
	as good a segmentation because even if you

	0:32:21.178 --> 0:32:25.629
	still have the same arrows in your.

	0:32:27.147 --> 0:32:35.718
	Is to be into this movement, which is also
	not as unusual as we do in translation.

	0:32:36.736 --> 0:32:40.495
	So this is done by looking at the reference.

	0:32:40.495 --> 0:32:48.097
	It should show you how much these scores are
	done to just analyze how important are these.

	0:32:48.097 --> 0:32:55.699
	So you take the A's R transcript and you look
	at the reference and it's only done for the.

	0:32:55.635 --> 0:33:01.720
	If we have optimal punctuations, if our model
	is as good and optimal, so as a reference we

	0:33:01.720 --> 0:33:15.602
	could: But of course this is not how we can
	do it in reality because we don't have access

	0:33:15.602 --> 0:33:16.990
	to that.

	0:33:17.657 --> 0:33:24.044
	Because one would invade you okay, why should
	we do that?

	0:33:24.044 --> 0:33:28.778
	If we have the optimal then it's possible.

	0:33:31.011 --> 0:33:40.060
	And yeah, that is why a typical system does
	not only yeah depend on if our key component.

	0:33:40.280 --> 0:33:56.468
	But in between you have this segmentation
	in there in order to have more input and.

	0:33:56.496 --> 0:34:01.595
	You can also prefer often this invariability
	over the average study.

	0:34:04.164 --> 0:34:19.708
	So the task of segmentation is to re-segment
	the text into what is called sentence like

	0:34:19.708 --> 0:34:24.300
	unit, so you also assign.

	0:34:24.444 --> 0:34:39.421
	That is more a traditional thing because for
	a long time case information was not provided.

	0:34:39.879 --> 0:34:50.355
	So there was any good ASR system which directly
	provides you with case information and this

	0:34:50.355 --> 0:34:52.746
	may not be any more.

	0:34:56.296 --> 0:35:12.060
	How that can be done is you can have three
	different approaches because that was some

	0:35:12.060 --> 0:35:16.459
	of the most common one.

	0:35:17.097 --> 0:35:23.579
	Course: That is not the only thing you can
	do.

	0:35:23.579 --> 0:35:30.888
	You can also try to train the data to generate
	that.

	0:35:31.891 --> 0:35:41.324
	On the other hand, that is of course more
	challenging.

	0:35:41.324 --> 0:35:47.498
	You need some type of segmentation.

	0:35:48.028 --> 0:35:59.382
	Mean, of course, you can easily remove and
	capture information from your data and then

	0:35:59.382 --> 0:36:05.515
	play a system which does non-case to non-case.

	0:36:05.945 --> 0:36:15.751
	You can also, of course, try to combine these
	two into one so that you directly translate

	0:36:15.751 --> 0:36:17.386
	from non-case.

	0:36:17.817 --> 0:36:24.722
	What is more happening by now is that you
	also try to provide these to that you provide.

	0:36:24.704 --> 0:36:35.267
	The ASR is a segmentation directly get these
	information in there.

	0:36:35.267 --> 0:36:45.462
	The systems that combine the A's and A's are:
	Yes, there is a valid rule.

	0:36:45.462 --> 0:36:51.187
	What we come later to today is that you do
	audio to text in the target language.

	0:36:51.187 --> 0:36:54.932
	That is what is referred to as an end to end
	system.

	0:36:54.932 --> 0:36:59.738
	So it's directly and this is still more often
	done for text output.

	0:36:59.738 --> 0:37:03.414
	But there is also end to end system which
	directly.

	0:37:03.683 --> 0:37:09.109
	There you have additional challenges by how
	to even measure if things are correct or not.

	0:37:09.089 --> 0:37:10.522
	Mean for text.

	0:37:10.522 --> 0:37:18.073
	You can mention, in other words, that for
	audio the audio signal is even more.

	0:37:18.318 --> 0:37:27.156
	That's why it's currently mostly speech to
	text, but that is one single system, but of

	0:37:27.156 --> 0:37:27.969
	course.

	0:37:32.492 --> 0:37:35.605
	Yeah, how can you do that?

	0:37:35.605 --> 0:37:45.075
	You can do adding these calculation information:
	Will look into three systems.

	0:37:45.075 --> 0:37:53.131
	You can do that as a sequence labeling problem
	or as a monolingual.

	0:37:54.534 --> 0:37:57.145
	Let's have a little bit of a series.

	0:37:57.145 --> 0:37:59.545
	This was some of the first ideas.

	0:37:59.545 --> 0:38:04.626
	There's the idea where you try to do it mainly
	based on language model.

	0:38:04.626 --> 0:38:11.471
	So how probable is that there is a punctuation
	that was done with like old style engram language

	0:38:11.471 --> 0:38:12.883
	models to visually.

	0:38:13.073 --> 0:38:24.687
	So you can, for example, if you have a program
	language model to calculate the score of Hello,

	0:38:24.687 --> 0:38:25.787
	how are?

	0:38:25.725 --> 0:38:33.615
	And then you compare this probability and
	take the one which has the highest probability.

	0:38:33.615 --> 0:38:39.927
	You might have something like if you have
	very long pauses, you anyway.

	0:38:40.340 --> 0:38:51.953
	So this is a very easy model, which only calculates
	some language model probabilities, and however

	0:38:51.953 --> 0:39:00.023
	the advantages of course are: And then, of
	course, in general, so what we will look into

	0:39:00.023 --> 0:39:06.249
	here is that maybe interesting is that most
	of the systems, also the advance, are really

	0:39:06.249 --> 0:39:08.698
	mainly focused purely on the text.

	0:39:09.289 --> 0:39:19.237
	If you think about how to insert punctuation
	marks, maybe your first idea would have been

	0:39:19.237 --> 0:39:22.553
	we can use pause information.

	0:39:23.964 --> 0:39:30.065
	But however interestingly most systems that
	use are really focusing on the text.

	0:39:31.151 --> 0:39:34.493
	There are several reasons.

	0:39:34.493 --> 0:39:44.147
	One is that it's easier to get training data
	so you only need pure text data.

	0:39:46.806 --> 0:40:03.221
	The next way you can do it is you can make
	it as a secret labeling tax or something like

	0:40:03.221 --> 0:40:04.328
	that.

	0:40:04.464 --> 0:40:11.734
	Then you have how there is nothing in you,
	and there is a.

	0:40:11.651 --> 0:40:15.015
	A question.

	0:40:15.315 --> 0:40:31.443
	So you have the number of labels, the number
	of punctuation symbols you have for the basic

	0:40:31.443 --> 0:40:32.329
	one.

	0:40:32.892 --> 0:40:44.074
	Typically nowadays it would use something
	like bird, and then you can train a sister.

	0:40:48.168 --> 0:40:59.259
	Any questions to that then it would probably
	be no contrary, you know, or not.

	0:41:00.480 --> 0:41:03.221
	Yeah, you have definitely a labeled imbalance.

	0:41:04.304 --> 0:41:12.405
	Think that works relatively well and haven't
	seen that.

	0:41:12.405 --> 0:41:21.085
	It's not a completely crazy label, maybe twenty
	times more.

	0:41:21.561 --> 0:41:29.636
	It can and especially for the more rare things
	mean, the more rare things is question marks.

	0:41:30.670 --> 0:41:43.877
	At least for question marks you have typically
	very strong indicator words.

	0:41:47.627 --> 0:42:03.321
	And then what was done for quite a long time
	can we know how to do machine translation?

	0:42:04.504 --> 0:42:12.640
	So the idea is, can we just translate non
	punctuated English into punctuated English

	0:42:12.640 --> 0:42:14.650
	and do it correctly?

	0:42:15.855 --> 0:42:25.344
	So what you need is something like this type
	of data where the source doesn't have punctuation.

	0:42:25.845 --> 0:42:30.641
	Course: A year is already done.

	0:42:30.641 --> 0:42:36.486
	You have to make it a bit challenging.

	0:42:41.661 --> 0:42:44.550
	Yeah, that is true.

	0:42:44.550 --> 0:42:55.237
	If you think about the normal trained age,
	you have to do one thing more.

	0:42:55.237 --> 0:43:00.724
	Is it otherwise difficult to predict?

	0:43:05.745 --> 0:43:09.277
	Here it's already this already looks different
	than normal training data.

	0:43:09.277 --> 0:43:09.897
	What is the.

	0:43:10.350 --> 0:43:15.305
	People want to use this transcript of speech.

	0:43:15.305 --> 0:43:19.507
	We'll probably go to our text editors.

	0:43:19.419 --> 0:43:25.906
	Yes, that is all already quite too difficult.

	0:43:26.346 --> 0:43:33.528
	Mean, that's making things a lot better with
	the first and easiest thing is you have to

	0:43:33.528 --> 0:43:35.895
	randomly cut your sentences.

	0:43:35.895 --> 0:43:43.321
	So if you take just me normally we have one
	sentence per line and if you take this as your

	0:43:43.321 --> 0:43:44.545
	training data.

	0:43:44.924 --> 0:43:47.857
	And that is, of course, not very helpful.

	0:43:48.208 --> 0:44:01.169
	So in order to build the training corpus for
	doing punctuation you randomly cut your sentences

	0:44:01.169 --> 0:44:08.264
	and then you can remove all your punctuation
	marks.

	0:44:08.528 --> 0:44:21.598
	Because of course there is no longer to do
	when you have some random segments in your

	0:44:21.598 --> 0:44:22.814
	system.

	0:44:25.065 --> 0:44:37.984
	And then you can, for example, if you then
	have generated your punctuation marks before

	0:44:37.984 --> 0:44:41.067
	going to the system.

	0:44:41.221 --> 0:44:54.122
	And that is an important thing, which we like
	to see is more challenging for end systems.

	0:44:54.122 --> 0:45:00.143
	We can change the segmentation, so maybe.

	0:45:00.040 --> 0:45:06.417
	You can, then if you're combining these things
	you can change the segmentation here, so.

	0:45:06.406 --> 0:45:18.178
	While you have ten new ten segments in your,
	you might only have five ones in your anymore.

	0:45:18.178 --> 0:45:18.946
	Then.

	0:45:19.259 --> 0:45:33.172
	Which might be more useful or helpful in because
	you have to reorder things and so on.

	0:45:33.273 --> 0:45:43.994
	And if you think of the wrong segmentation
	then you cannot reorder things from the beginning

	0:45:43.994 --> 0:45:47.222
	to the end of the sentence.

	0:45:49.749 --> 0:45:58.006
	Okay, so much about segmentation do you have
	any more questions about that?

	0:46:02.522 --> 0:46:21.299
	Then there is one additional thing you can
	do, and that is when we refer to the idea.

	0:46:21.701 --> 0:46:29.356
	And when you get input there might be some
	arrows in there, so it might not be perfect.

	0:46:29.889 --> 0:46:36.322
	So the question is, can we adapt to that?

	0:46:36.322 --> 0:46:45.358
	And can the system be improved by saying that
	it can some.

	0:46:45.265 --> 0:46:50.591
	So that is as aware that before there is a.

	0:46:50.490 --> 0:46:55.449
	Their arm might not be the best one.

	0:46:55.935 --> 0:47:01.961
	There are different ways of dealing with them.

	0:47:01.961 --> 0:47:08.116
	You can use a best list but several best lists.

	0:47:08.408 --> 0:47:16.711
	So the idea is that you're not only telling
	the system this is the transcript, but here

	0:47:16.711 --> 0:47:18.692
	I'm not going to be.

	0:47:19.419 --> 0:47:30.748
	Or that you can try to make it more robust
	towards arrows from an system so that.

	0:47:32.612 --> 0:47:48.657
	Interesting what is often done is hope convince
	you it might be a good idea to deal.

	0:47:48.868 --> 0:47:57.777
	The interesting thing is if you're looking
	into a lot of systems, this is often ignored,

	0:47:57.777 --> 0:48:04.784
	so they are not adapting their T-system to
	this type of A-S-R system.

	0:48:05.345 --> 0:48:15.232
	So it's not really doing any handling of Arab,
	and the interesting thing is often works as

	0:48:15.232 --> 0:48:15.884
	good.

	0:48:16.516 --> 0:48:23.836
	And one reason is, of course, one reason is
	if the ASR system does not arrow up to like

	0:48:23.836 --> 0:48:31.654
	a challenging situation, and then the antisystem
	is really for the antisystem hard to detect.

	0:48:31.931 --> 0:48:39.375
	If it would be easy for the system to detect
	the error you would integrate this information

	0:48:39.375 --> 0:48:45.404
	into: That is not always the case, but that
	of course makes it a bit challenging, and that's

	0:48:45.404 --> 0:48:49.762
	why there is a lot of systems where it's not
	explicitly handled how to deal with.

	0:48:52.912 --> 0:49:06.412
	But of course it might be good, so one thing
	is you can give him a best list and you can

	0:49:06.412 --> 0:49:09.901
	translate every entry.

	0:49:10.410 --> 0:49:17.705
	And then you have two scores like the anti-probability
	and the square probability.

	0:49:18.058 --> 0:49:25.695
	Combine them and then generate or output the
	output from what has the best combined.

	0:49:26.366 --> 0:49:29.891
	And then it might no longer be the best.

	0:49:29.891 --> 0:49:38.144
	It might like we had a bean search, so this
	has the best score, but this has a better combined.

	0:49:39.059 --> 0:49:46.557
	The problem sometimes works, but the problem
	is that the anti-system might then tend to

	0:49:46.557 --> 0:49:52.777
	just translate not the correct sentence but
	the one easier to translate.

	0:49:53.693 --> 0:50:03.639
	You can also generate a more compact representation
	of this invest in it by having this type of

	0:50:03.639 --> 0:50:04.467
	graphs.

	0:50:05.285 --> 0:50:22.952
	Lettices: So then you could like try to do
	a graph to text translation so you can translate.

	0:50:22.802 --> 0:50:26.582
	Where like all possibilities, by the way our
	systems are invented.

	0:50:26.906 --> 0:50:31.485
	So it can be like a hostage, a conference
	with some programs.

	0:50:31.591 --> 0:50:35.296
	So the highest probability is here.

	0:50:35.296 --> 0:50:41.984
	Conference is being recorded, but there are
	other possibilities.

	0:50:42.302 --> 0:50:53.054
	And you can take all of this information out
	there with your probabilities.

	0:50:59.980 --> 0:51:07.614
	But we'll see this type of arrow propagation
	that if you have an error that this might then

	0:51:07.614 --> 0:51:15.165
	propagate to, and t errors is one of the main
	reasons why people looked into other ways of

	0:51:15.165 --> 0:51:17.240
	doing it and not having.

	0:51:19.219 --> 0:51:28.050
	By generally a cascaded combination, as we've
	seen it, it has several advantages: The biggest

	0:51:28.050 --> 0:51:42.674
	maybe is the data availability so we can train
	systems for the different components.

	0:51:42.822 --> 0:51:47.228
	So you can train your individual components
	on relatively large stages.

	0:51:47.667 --> 0:51:58.207
	A modular system where you can improve each
	individual model and if there's new development

	0:51:58.207 --> 0:52:01.415
	and models you can improve.

	0:52:01.861 --> 0:52:11.280
	There are several advantages, but of course
	there are also some disadvantages: The most

	0:52:11.280 --> 0:52:19.522
	common thing is that there is what is referred
	to as arrow propagation.

	0:52:19.522 --> 0:52:28.222
	If the arrow is arrow, probably your output
	will then directly do an arrow.

	0:52:28.868 --> 0:52:41.740
	Typically it's like if there's an error in
	the system, it's easier to like ignore by a

	0:52:41.740 --> 0:52:46.474
	quantity scale than the output.

	0:52:46.967 --> 0:52:49.785
	What do that mean?

	0:52:49.785 --> 0:53:01.209
	It's complicated, so if you have German, the
	ASR does the Arab, and instead.

	0:53:01.101 --> 0:53:05.976
	Then most probably you'll ignore it or you'll
	still know what it was said.

	0:53:05.976 --> 0:53:11.827
	Maybe you even don't notice because you'll
	fastly read over it and don't see that there's

	0:53:11.827 --> 0:53:12.997
	one letter wrong.

	0:53:13.673 --> 0:53:25.291
	However, if you translate this one in an English
	sentence about speeches, there's something

	0:53:25.291 --> 0:53:26.933
	about wines.

	0:53:27.367 --> 0:53:37.238
	So it's a lot easier typically to read over
	like arrows in the than reading over them in

	0:53:37.238 --> 0:53:38.569
	the speech.

	0:53:40.120 --> 0:53:45.863
	But there is additional challenges in in cascaded
	systems.

	0:53:46.066 --> 0:53:52.667
	So secondly we have seen that we optimize
	each component individually so you have a separate

	0:53:52.667 --> 0:53:59.055
	optimization and that doesn't mean that the
	overall performance is really the best at the

	0:53:59.055 --> 0:53:59.410
	end.

	0:53:59.899 --> 0:54:07.945
	And we have tried to do that by already saying
	yes.

	0:54:07.945 --> 0:54:17.692
	You need to adapt them a bit to work good
	together, but still.

	0:54:20.280 --> 0:54:24.185
	Secondly, like that, there's a computational
	complexity.

	0:54:24.185 --> 0:54:30.351
	You always need to run an ASR system and an
	MTT system, and especially if you think about

	0:54:30.351 --> 0:54:32.886
	it, it should be fast and real time.

	0:54:32.886 --> 0:54:37.065
	It's challenging to always run two systems
	and not a single.

	0:54:38.038 --> 0:54:45.245
	And one final thing which you might have not
	directly thought of, but most of the world's

	0:54:45.245 --> 0:54:47.407
	languages do not have any.

	0:54:48.108 --> 0:55:01.942
	So if you have a language which doesn't have
	any script, then of course if you want to translate

	0:55:01.942 --> 0:55:05.507
	it you cannot first use.

	0:55:05.905 --> 0:55:13.705
	So in order to do this, the pressure was mentioned
	before ready.

	0:55:13.705 --> 0:55:24.264
	Build somehow a system which takes the audio
	and directly generates text in the target.

	0:55:26.006 --> 0:55:41.935
	And there is quite big opportunity for that
	because before that there was very different

	0:55:41.935 --> 0:55:44.082
	technology.

	0:55:44.644 --> 0:55:55.421
	However, since we are using neuromachine translation
	encoded decoder models, the interesting thing

	0:55:55.421 --> 0:56:00.429
	is that we are using very similar technology.

	0:56:00.360 --> 0:56:06.047
	It's like in both cases very similar architecture.

	0:56:06.047 --> 0:56:09.280
	The main difference is once.

	0:56:09.649 --> 0:56:17.143
	But generally how it's done is very similar,
	and therefore of course it might be put everything

	0:56:17.143 --> 0:56:22.140
	together, and that is what is referred to as
	end-to-end speech.

	0:56:22.502 --> 0:56:31.411
	So that means we're having one large neural
	network and decoded voice system, but we put

	0:56:31.411 --> 0:56:34.914
	an audio in one language and then.

	0:56:36.196 --> 0:56:43.106
	We can then have a system which directly does
	the full process.

	0:56:43.106 --> 0:56:46.454
	We don't have to care anymore.

	0:56:48.048 --> 0:57:02.615
	So if you think of it as before, so we have
	this decoder, and that's the two separate.

	0:57:02.615 --> 0:57:04.792
	We have the.

	0:57:05.085 --> 0:57:18.044
	And instead of going via the discrete text
	representation in the Suez language, we can

	0:57:18.044 --> 0:57:21.470
	go via the continuous.

	0:57:21.681 --> 0:57:26.027
	Of course, they hope it's by not doing this
	discrimination in between.

	0:57:26.146 --> 0:57:30.275
	We don't have a problem at doing errors.

	0:57:30.275 --> 0:57:32.793
	We can only cover later.

	0:57:32.772 --> 0:57:47.849
	But we can encode here the variability or
	so that we have and then only define the decision.

	0:57:51.711 --> 0:57:54.525
	And so.

	0:57:54.274 --> 0:58:02.253
	What we're doing is we're having very similar
	technique.

	0:58:02.253 --> 0:58:12.192
	We're having still the decoder model where
	we're coming from the main.

	0:58:12.552 --> 0:58:24.098
	Instead of getting discrete tokens in there
	as we have subwords, we always encoded that

	0:58:24.098 --> 0:58:26.197
	in one pattern.

	0:58:26.846 --> 0:58:42.505
	The problem is that this is in continuous,
	so we have to check how we can work with continuous

	0:58:42.505 --> 0:58:43.988
	signals.

	0:58:47.627 --> 0:58:55.166
	Mean, the first thing in your system is when
	you do your disc freeze and code it.

	0:59:02.402 --> 0:59:03.888
	A newer machine translation.

	0:59:03.888 --> 0:59:05.067
	You're getting a word.

	0:59:05.067 --> 0:59:06.297
	It's one hot, some not.

	0:59:21.421 --> 0:59:24.678
	The first layer of the machine translation.

	0:59:27.287 --> 0:59:36.147
	Yes, you do the word embedding, so then you
	have a continuous thing.

	0:59:36.147 --> 0:59:40.128
	So if you know get continuous.

	0:59:40.961 --> 0:59:46.316
	Deal with it the same way, so we'll see not
	a big of a challenge.

	0:59:46.316 --> 0:59:48.669
	What is more challenging is.

	0:59:49.349 --> 1:00:04.498
	So the audio signal is ten times longer or
	so, like more time steps you have.

	1:00:04.764 --> 1:00:10.332
	And so that is, of course, any challenge how
	we can deal with this type of long sequence.

	1:00:11.171 --> 1:00:13.055
	The advantage is a bit.

	1:00:13.055 --> 1:00:17.922
	The long sequence is only at the input and
	not at the output.

	1:00:17.922 --> 1:00:24.988
	So when you remember for the efficiency, for
	example, like a long sequence are especially

	1:00:24.988 --> 1:00:29.227
	challenging in the decoder, but also for the
	encoder.

	1:00:31.371 --> 1:00:33.595
	So how it is this?

	1:00:33.595 --> 1:00:40.617
	How can we process audio into an speech translation
	system?

	1:00:41.501 --> 1:00:51.856
	And you can follow mainly what is done in
	an system, so you have the audio signal.

	1:00:52.172 --> 1:00:59.135
	Then you measure your amplitude at every time
	step.

	1:00:59.135 --> 1:01:04.358
	It's typically something like killing.

	1:01:04.384 --> 1:01:13.893
	And then you're doing this, this windowing,
	so that you get a signal of a length twenty

	1:01:13.893 --> 1:01:22.430
	to thirty seconds, and you have all these windowings
	so that you measure them.

	1:01:22.342 --> 1:01:32.260
	A simple gear, and then you look at these
	time signals of seconds.

	1:01:32.432 --> 1:01:36.920
	So in the end then it is ten seconds, ten
	million seconds.

	1:01:36.920 --> 1:01:39.735
	You have for every ten milliseconds.

	1:01:40.000 --> 1:01:48.309
	Some type of representation which type of
	representation you can generate from that,

	1:01:48.309 --> 1:01:49.286
	but that.

	1:01:49.649 --> 1:02:06.919
	So instead of having no letter or word, you
	have no representations for every 10mm of your

	1:02:06.919 --> 1:02:08.437
	system.

	1:02:08.688 --> 1:02:13.372
	How we record that now your thirty second
	window here there is different ways.

	1:02:16.176 --> 1:02:31.891
	Was a traditional way of how people have done
	that from an audio signal what frequencies

	1:02:31.891 --> 1:02:34.010
	are in the.

	1:02:34.114 --> 1:02:44.143
	So to do that you can do this malfrequency,
	capsule co-pression so you can use gear transformations.

	1:02:44.324 --> 1:02:47.031
	Which frequencies are there?

	1:02:47.031 --> 1:02:53.566
	You know that the letters are different by
	the different frequencies.

	1:02:53.813 --> 1:03:04.243
	And then if you're doing that, use the matte
	to covers for your window we have before.

	1:03:04.624 --> 1:03:14.550
	So for each of these windows: You will calculate
	what frequencies in there and then get features

	1:03:14.550 --> 1:03:20.059
	for this window and features for this window.

	1:03:19.980 --> 1:03:28.028
	These are the frequencies that occur there
	and that help you to model which letters are

	1:03:28.028 --> 1:03:28.760
	spoken.

	1:03:31.611 --> 1:03:43.544
	More recently, instead of doing the traditional
	signal processing, you can also replace that

	1:03:43.544 --> 1:03:45.853
	by deep learning.

	1:03:46.126 --> 1:03:56.406
	So that we are using a self-supervised approach
	from language model to generate features that

	1:03:56.406 --> 1:03:58.047
	describe what.

	1:03:58.358 --> 1:03:59.821
	So you have your.

	1:03:59.759 --> 1:04:07.392
	All your signal again, and then for each child
	to do your convolutional neural networks to

	1:04:07.392 --> 1:04:07.811
	get.

	1:04:07.807 --> 1:04:23.699
	First representation here is a transformer
	network here, and in the end it's similar to

	1:04:23.699 --> 1:04:25.866
	a language.

	1:04:25.705 --> 1:04:30.238
	And you tried to predict what was referenced
	here.

	1:04:30.670 --> 1:04:42.122
	So that is in a way similar that you also
	try to learn a good representation of all these

	1:04:42.122 --> 1:04:51.608
	audio signals by predicting: And then you don't
	do the signal processing base, but have this

	1:04:51.608 --> 1:04:52.717
	way to make.

	1:04:52.812 --> 1:04:59.430
	But in all the things that you have to remember
	what is most important for you, and to end

	1:04:59.430 --> 1:05:05.902
	system is, of course, that you in the end get
	for every minute ten milliseconds, you get

	1:05:05.902 --> 1:05:11.283
	a representation of this audio signal, which
	is again a vector, and that.

	1:05:11.331 --> 1:05:15.365
	And then you can use your normal encoder to
	code your model to do this research.

	1:05:21.861 --> 1:05:32.694
	So that is all which directly has to be changed,
	and then you can build your first base.

	1:05:33.213 --> 1:05:37.167
	You do the audio processing.

	1:05:37.167 --> 1:05:49.166
	You of course need data which is like Audio
	and English and Text in German and then you

	1:05:49.166 --> 1:05:50.666
	can train.

	1:05:53.333 --> 1:05:57.854
	And interestingly, it works at the beginning.

	1:05:57.854 --> 1:06:03.261
	The systems were maybe a bit worse, but we
	saw really.

	1:06:03.964 --> 1:06:11.803
	This is like from the biggest workshop where
	people like compared different systems.

	1:06:11.751 --> 1:06:17.795
	Special challenge on comparing Cascaded to
	end to end systems and you see two thousand

	1:06:17.795 --> 1:06:18.767
	and eighteen.

	1:06:18.767 --> 1:06:25.089
	We had quite a huge gap between the Cascaded
	and end to end systems and then it got nearer

	1:06:25.089 --> 1:06:27.937
	and earlier in starting in two thousand.

	1:06:27.907 --> 1:06:33.619
	Twenty the performance was mainly the same,
	so there was no clear difference anymore.

	1:06:34.014 --> 1:06:42.774
	So this is, of course, writing a bit of hope
	saying if we better learn how to build these

	1:06:42.774 --> 1:06:47.544
	internal systems, they might really fall better.

	1:06:49.549 --> 1:06:52.346
	However, a bit.

	1:06:52.452 --> 1:06:59.018
	This satisfying this is how this all continues,
	and this is not only in two thousand and twenty

	1:06:59.018 --> 1:07:04.216
	one, but even nowadays we can say there is
	no clear performance difference.

	1:07:04.216 --> 1:07:10.919
	It's not like the one model is better than
	the other, but we are seeing very similar performance.

	1:07:11.391 --> 1:07:19.413
	So the question is what is the difference?

	1:07:19.413 --> 1:07:29.115
	Of course, this can only be achieved by new
	tricks.

	1:07:30.570 --> 1:07:35.658
	Yes and no, that's what we will mainly look
	into now.

	1:07:35.658 --> 1:07:39.333
	How can we make use of other types of.

	1:07:39.359 --> 1:07:53.236
	In that case you can achieve some performance
	by using different types of training so you

	1:07:53.236 --> 1:07:55.549
	can also make.

	1:07:55.855 --> 1:08:04.961
	So if you are training or preparing the systems
	only on very small corpora where you have as

	1:08:04.961 --> 1:08:10.248
	much data than you have for the individual
	ones then.

	1:08:10.550 --> 1:08:22.288
	So that is the biggest challenge of an end
	system that you have small corpora and therefore.

	1:08:24.404 --> 1:08:30.479
	Of course, there is several advantages so
	you can give access to the audio information.

	1:08:30.750 --> 1:08:42.046
	So that's, for example, interesting if you
	think about it, you might not have modeled

	1:08:42.046 --> 1:08:45.198
	everything in the text.

	1:08:45.198 --> 1:08:50.321
	So remember when we talk about biases.

	1:08:50.230 --> 1:08:55.448
	Male or female, and that of course is not
	in the text any more, but in the audio signal

	1:08:55.448 --> 1:08:56.515
	it's still there.

	1:08:58.078 --> 1:09:03.108
	It also allows you to talk about that on Thursday
	when you talk about latency.

	1:09:03.108 --> 1:09:08.902
	You have a bit better chance if you do an
	end to end system to get a lower latency because

	1:09:08.902 --> 1:09:14.377
	you only have one system and you don't have
	two systems which might have to wait for.

	1:09:14.934 --> 1:09:20.046
	And having one system might be also a bit
	easier management.

	1:09:20.046 --> 1:09:23.146
	See that two systems work and so on.

	1:09:26.346 --> 1:09:41.149
	The biggest challenge of end systems is the
	data, so as you correctly pointed out, typically

	1:09:41.149 --> 1:09:42.741
	there is.

	1:09:43.123 --> 1:09:45.829
	There is some data for Ted.

	1:09:45.829 --> 1:09:47.472
	People did that.

	1:09:47.472 --> 1:09:52.789
	They took the English audio with all the translations.

	1:09:53.273 --> 1:10:02.423
	But in January there is a lot less so we'll
	look into how you can use other data sources.

	1:10:05.305 --> 1:10:10.950
	And secondly, the second challenge is that
	we have to deal with audio.

	1:10:11.431 --> 1:10:22.163
	For example, in input length, and therefore
	it's also important to handle this in your

	1:10:22.163 --> 1:10:27.590
	network and maybe have dedicated solutions.

	1:10:31.831 --> 1:10:40.265
	So in general we have this challenge that
	we have a lot of text and translation and audio

	1:10:40.265 --> 1:10:43.076
	transcript data by quite few.

	1:10:43.643 --> 1:10:50.844
	So what can we do in one trick?

	1:10:50.844 --> 1:11:00.745
	You already know a bit from other research.

	1:11:02.302 --> 1:11:14.325
	Exactly so what you can do is you can, for
	example, use to take a power locust, generate

	1:11:14.325 --> 1:11:19.594
	an audio of a Suez language, and then.

	1:11:21.341 --> 1:11:33.780
	There has been a bit motivated by what we
	have seen in Beck translation, which was very

	1:11:33.780 --> 1:11:35.476
	successful.

	1:11:38.758 --> 1:11:54.080
	However, it's a bit more challenging because
	it is often very different from real audience.

	1:11:54.314 --> 1:12:07.131
	So often if you build a system only trained
	on, but then generalized to real audio data

	1:12:07.131 --> 1:12:10.335
	is quite challenging.

	1:12:10.910 --> 1:12:20.927
	And therefore here the synthetic data generation
	is significantly more challenging than when.

	1:12:20.981 --> 1:12:27.071
	Because if you read a text, it's maybe bad
	translation.

	1:12:27.071 --> 1:12:33.161
	It's hard, but it's a real text or a text
	generated by.

	1:12:35.835 --> 1:12:42.885
	But it's a valid solution, and for example
	we use that also for say current systems.

	1:12:43.923 --> 1:12:53.336
	Of course you can also do a bit of forward
	translation that is done so that you take data.

	1:12:53.773 --> 1:13:02.587
	But then the problem is that your reference
	is not always correct, and you remember when

	1:13:02.587 --> 1:13:08.727
	we talked about back translation, it's a bit
	of an advantage.

	1:13:09.229 --> 1:13:11.930
	But both can be done and both have been done.

	1:13:12.212 --> 1:13:20.277
	So you can think about this picture again.

	1:13:20.277 --> 1:13:30.217
	You can take this data and generate the audio
	to it.

	1:13:30.750 --> 1:13:37.938
	However, it is only synthetic of what can
	be used for the voice handling technology for:

	1:13:40.240 --> 1:13:47.153
	But you have not, I mean, yet you get text
	to speech, but the voice cloning would need

	1:13:47.153 --> 1:13:47.868
	a voice.

	1:13:47.868 --> 1:13:53.112
	You can use, of course, and then it's nothing
	else than a normal.

	1:13:54.594 --> 1:14:03.210
	But still think there are better than both,
	but there are some characteristics of that

	1:14:03.210 --> 1:14:05.784
	which is quite different.

	1:14:07.327 --> 1:14:09.341
	But yeah, it's getting better.

	1:14:09.341 --> 1:14:13.498
	That is definitely true, and then this might
	get more and more.

	1:14:16.596 --> 1:14:21.885
	Here make sure it's a good person and our
	own systems because we try to train and.

	1:14:21.881 --> 1:14:24.356
	And it's like a feedback mood.

	1:14:24.356 --> 1:14:28.668
	There's anything like the Dutch English model
	that's.

	1:14:28.648 --> 1:14:33.081
	Yeah, you of course need a decent amount of
	real data.

	1:14:33.081 --> 1:14:40.255
	But I mean, as I said, so there is always
	an advantage if you have this synthetics thing

	1:14:40.255 --> 1:14:44.044
	only on the input side and not on the outside.

	1:14:44.464 --> 1:14:47.444
	That you at least always generate correct
	outcomes.

	1:14:48.688 --> 1:14:54.599
	That's different in a language case because
	they have input and the output and it's not

	1:14:54.599 --> 1:14:55.002
	like.

	1:14:58.618 --> 1:15:15.815
	The other idea is to integrate additional
	sources so you can have more model sharing.

	1:15:16.376 --> 1:15:23.301
	But you can use these components also in the
	system.

	1:15:23.301 --> 1:15:28.659
	Typically the text decoder and the text.

	1:15:29.169 --> 1:15:41.845
	And so the other way of languaging is to join
	a train or somehow train all these tasks.

	1:15:43.403 --> 1:15:54.467
	The first and easy thing to do is multi task
	training so the idea is you take these components

	1:15:54.467 --> 1:16:02.038
	and train these two components and train the
	speech translation.

	1:16:02.362 --> 1:16:13.086
	So then, for example, all your encoders used
	by the speech translation system can also gain

	1:16:13.086 --> 1:16:14.951
	from the large.

	1:16:14.975 --> 1:16:24.048
	So everything can gain a bit of emphasis,
	but it can partly gain in there quite a bit.

	1:16:27.407 --> 1:16:39.920
	The other idea is to do it in a pre-training
	phase.

	1:16:40.080 --> 1:16:50.414
	And then you take the end coder and the text
	decoder and trade your model on that.

	1:16:54.774 --> 1:17:04.895
	Finally, there is also what is referred to
	as knowledge distillation, so there you have

	1:17:04.895 --> 1:17:11.566
	to remember if you learn from a probability
	distribution.

	1:17:11.771 --> 1:17:24.371
	So what you can do then is you have your system
	and if you then have your audio and text input

	1:17:24.371 --> 1:17:26.759
	you can use your.

	1:17:27.087 --> 1:17:32.699
	And then get a more rich signal that you'll
	not only know this is the word, but you have

	1:17:32.699 --> 1:17:33.456
	a complete.

	1:17:34.394 --> 1:17:41.979
	Example is typically also done because, of
	course, if you have ski data, it still begins

	1:17:41.979 --> 1:17:49.735
	that you don't only have source language audio
	and target language text, but then you also

	1:17:49.735 --> 1:17:52.377
	have the source language text.

	1:17:53.833 --> 1:18:00.996
	Get a good idea of the text editor and the
	artist design.

	1:18:00.996 --> 1:18:15.888
	Now have to be aligned so that: Otherwise
	they wouldn't be able to determine which degree

	1:18:15.888 --> 1:18:17.922
	they'd be.

	1:18:18.178 --> 1:18:25.603
	What you've been doing in non-stasilation
	is you run your MP and then you get your probability

	1:18:25.603 --> 1:18:32.716
	distribution for all the words and you use
	that to train and that is not only more helpful

	1:18:32.716 --> 1:18:34.592
	than only getting back.

	1:18:35.915 --> 1:18:44.427
	You can, of course, use the same decoder to
	be even similar.

	1:18:44.427 --> 1:18:49.729
	Otherwise you don't have exactly the.

	1:18:52.832 --> 1:19:03.515
	Is a good point making these tools, and generally
	in all these cases it's good to have more similar

	1:19:03.515 --> 1:19:05.331
	representations.

	1:19:05.331 --> 1:19:07.253
	You can transfer.

	1:19:07.607 --> 1:19:23.743
	If you hear your representation to give from
	the audio encoder and the text encoder are

	1:19:23.743 --> 1:19:27.410
	more similar, then.

	1:19:30.130 --> 1:19:39.980
	So here you have your text encoder in the
	target language and you can train it on large

	1:19:39.980 --> 1:19:40.652
	data.

	1:19:41.341 --> 1:19:45.994
	But of course you want to benefit also for
	this task because that's what your most interested.

	1:19:46.846 --> 1:19:59.665
	Of course, the most benefit for this task
	is if these two representations you give are

	1:19:59.665 --> 1:20:01.728
	more similar.

	1:20:02.222 --> 1:20:10.583
	Therefore, it's interesting to look into how
	can we make these two representations as similar

	1:20:10.583 --> 1:20:20.929
	as: The hope is that in the end you can't even
	do something like zero shot transfer, but while

	1:20:20.929 --> 1:20:25.950
	you only learn this one you can also deal with.

	1:20:30.830 --> 1:20:40.257
	So what you can do is you can look at these
	two representations.

	1:20:40.257 --> 1:20:42.867
	So once the text.

	1:20:43.003 --> 1:20:51.184
	And you can either put them into the text
	decoder to the encoder.

	1:20:51.184 --> 1:20:53.539
	We have seen both.

	1:20:53.539 --> 1:21:03.738
	You can think: If you want to build an A's
	and to insist on you can either take the audio

	1:21:03.738 --> 1:21:06.575
	encoder and see how deep.

	1:21:08.748 --> 1:21:21.915
	However, you have these two representations
	and you want to make them more similar.

	1:21:21.915 --> 1:21:23.640
	One thing.

	1:21:23.863 --> 1:21:32.797
	Here we have, like you said, for every ten
	million seconds we have a representation.

	1:21:35.335 --> 1:21:46.085
	So what people may have done, for example,
	is to remove redundant information so you can:

	1:21:46.366 --> 1:21:56.403
	So you can use your system to put India based
	on letter or words and then average over the

	1:21:56.403 --> 1:21:58.388
	words or letters.

	1:21:59.179 --> 1:22:07.965
	So that the number of representations from
	the encoder is the same as you would get from.

	1:22:12.692 --> 1:22:20.919
	Okay, that much to data do have any more questions
	first about that.

	1:22:27.207 --> 1:22:36.787
	Then we'll finish with the audience assessing
	and highlight a bit while this is challenging,

	1:22:36.787 --> 1:22:52.891
	so here's: One test here has one thousand eight
	hundred sentences, so there are words or characters.

	1:22:53.954 --> 1:22:59.336
	If you look how many all your features, so
	how many samples there is like one point five

	1:22:59.336 --> 1:22:59.880
	million.

	1:23:00.200 --> 1:23:10.681
	So you have ten times more pizzas than you
	have characters, and then again five times

	1:23:10.681 --> 1:23:11.413
	more.

	1:23:11.811 --> 1:23:23.934
	So you have the sequence leg of the audio
	as long as you have for words, and that is

	1:23:23.934 --> 1:23:25.788
	a challenge.

	1:23:26.086 --> 1:23:34.935
	So the question is what can you do to make
	the sequins a bit shorter and not have this?

	1:23:38.458 --> 1:23:48.466
	The one thing is you can try to reduce the
	dimensional entity in your encounter.

	1:23:48.466 --> 1:23:50.814
	There's different.

	1:23:50.991 --> 1:24:04.302
	So, for example, you can just sum up always
	over some or you can do a congregation.

	1:24:04.804 --> 1:24:12.045
	Are you a linear projectile or you even take
	not every feature but only every fifth or something?

	1:24:12.492 --> 1:24:23.660
	So this way you can very easily reduce your
	number of features in there, and there has

	1:24:23.660 --> 1:24:25.713
	been different.

	1:24:26.306 --> 1:24:38.310
	There's also what you can do with things like
	a convolutional layer.

	1:24:38.310 --> 1:24:43.877
	If you skip over what you can,.

	1:24:47.327 --> 1:24:55.539
	And then, in addition to the audio, the other
	problem is higher variability.

	1:24:55.539 --> 1:25:04.957
	So if you have a text you can: But there are
	very different ways of saying that you can

	1:25:04.957 --> 1:25:09.867
	distinguish whether say a sentence or your
	voice.

	1:25:10.510 --> 1:25:21.224
	That of course makes it more challenging because
	now you get different inputs and while they

	1:25:21.224 --> 1:25:22.837
	were in text.

	1:25:23.263 --> 1:25:32.360
	So that makes especially for limited data
	things more challenging and you want to somehow

	1:25:32.360 --> 1:25:35.796
	learn that this is not important.

	1:25:36.076 --> 1:25:39.944
	So there is the idea again okay.

	1:25:39.944 --> 1:25:47.564
	Can we doing some type of data augmentation
	to better deal with?

	1:25:48.908 --> 1:25:55.735
	And again people can mainly use what has been
	done in and try to do the same things.

	1:25:56.276 --> 1:26:02.937
	You can try to do a bit of noise and speech
	perturbation so playing the audio like slower

	1:26:02.937 --> 1:26:08.563
	and a bit faster to get more samples then and
	you can train on all of them.

	1:26:08.563 --> 1:26:14.928
	What is very important and very successful
	recently is what is called Spektr augment.

	1:26:15.235 --> 1:26:25.882
	The idea is that you directly work on all
	your features and you can try to last them

	1:26:25.882 --> 1:26:29.014
	and that gives you more.

	1:26:29.469 --> 1:26:41.717
	What do they mean with masking so this is
	your audio feature and then there is different?

	1:26:41.962 --> 1:26:47.252
	You can do what is referred to as mask and
	a time masking.

	1:26:47.252 --> 1:26:50.480
	That means you just set some masks.

	1:26:50.730 --> 1:26:58.003
	And since then you should be still able to
	to deal with it because you can normally.

	1:26:57.937 --> 1:27:05.840
	Also without that you are getting more robust
	and not and you can handle that because then

	1:27:05.840 --> 1:27:10.877
	many symbols which have different time look
	more similar.

	1:27:11.931 --> 1:27:22.719
	You are not only doing that for time masking
	but also for frequency masking so that if you

	1:27:22.719 --> 1:27:30.188
	have here the frequency channels you mask a
	frequency channel.

	1:27:30.090 --> 1:27:33.089
	Thereby being able to better recognize these
	things.

	1:27:35.695 --> 1:27:43.698
	This we have had an overview of the two main
	approaches for speech translation that is on

	1:27:43.698 --> 1:27:51.523
	the one hand cascaded speech translation and
	on the other hand we talked about advanced

	1:27:51.523 --> 1:27:53.302
	speech translation.

	1:27:53.273 --> 1:28:02.080
	It's like how to combine things and what they
	work together for end speech translations.

	1:28:02.362 --> 1:28:06.581
	Here was data challenges and a bit about long
	circuits.

	1:28:07.747 --> 1:28:09.304
	We have any more questions.

	1:28:11.451 --> 1:28:19.974
	Can you really describe the change in cascading
	from translation to text to speech because

	1:28:19.974 --> 1:28:22.315
	thought the translation.

	1:28:25.745 --> 1:28:30.201
	Yes, so mean that works again the easiest
	thing.

	1:28:30.201 --> 1:28:33.021
	What of course is challenging?

	1:28:33.021 --> 1:28:40.751
	What can be challenging is how to make that
	more lively and like that pronunciation?

	1:28:40.680 --> 1:28:47.369
	And yeah, which things are put more important,
	how to put things like that into.

	1:28:47.627 --> 1:28:53.866
	In the normal text, otherwise it would sound
	very monotone.

	1:28:53.866 --> 1:28:57.401
	You want to add this information.

	1:28:58.498 --> 1:29:02.656
	That is maybe one thing to make it a bit more
	emotional.

	1:29:02.656 --> 1:29:04.917
	That is maybe one thing which.

	1:29:05.305 --> 1:29:13.448
	But you are right there and out of the box.

	1:29:13.448 --> 1:29:20.665
	If you have everything works decently.

	1:29:20.800 --> 1:29:30.507
	Still, especially if you have a very monotone
	voice, so think these are quite some open challenges.

	1:29:30.750 --> 1:29:35.898
	Maybe another open challenge is that it's
	not so much for the end product, but for the

	1:29:35.898 --> 1:29:37.732
	development is very important.

	1:29:37.732 --> 1:29:40.099
	It's very hard to evaluate the quality.

	1:29:40.740 --> 1:29:48.143
	So you cannot doubt that there is a way about
	most systems are currently evaluated by human

	1:29:48.143 --> 1:29:49.109
	evaluation.

	1:29:49.589 --> 1:29:54.474
	So you cannot try hundreds of things and run
	your blue score and get this score.

	1:29:54.975 --> 1:30:00.609
	So therefore no means very important to have
	some type of evaluation metric and that is

	1:30:00.609 --> 1:30:01.825
	quite challenging.

	1:30:08.768 --> 1:30:15.550
	And thanks for listening, and we'll have the
	second part of speech translation on search.