File size: 75,028 Bytes
cb71ef5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
2608
2609
2610
2611
2612
2613
2614
2615
2616
2617
2618
2619
2620
2621
2622
2623
2624
2625
2626
2627
2628
2629
2630
2631
2632
2633
2634
2635
2636
2637
2638
2639
2640
2641
2642
2643
2644
2645
2646
2647
2648
2649
2650
2651
2652
2653
2654
2655
2656
2657
2658
2659
2660
2661
2662
2663
2664
2665
2666
2667
2668
2669
2670
2671
2672
2673
2674
2675
2676
2677
2678
2679
2680
2681
2682
2683
2684
2685
2686
2687
2688
2689
2690
2691
2692
2693
2694
2695
2696
2697
2698
2699
2700
2701
2702
2703
2704
2705
2706
2707
2708
2709
2710
2711
2712
2713
2714
2715
2716
2717
2718
2719
2720
2721
2722
2723
2724
2725
2726
2727
2728
2729
2730
2731
2732
2733
2734
2735
2736
2737
2738
2739
2740
2741
2742
2743
2744
2745
2746
2747
2748
2749
2750
2751
2752
2753
2754
2755
2756
2757
2758
2759
2760
2761
2762
2763
2764
2765
2766
2767
2768
2769
2770
2771
2772
2773
2774
2775
2776
2777
2778
2779
2780
2781
2782
2783
2784
2785
2786
2787
2788
2789
2790
2791
2792
2793
2794
2795
2796
2797
2798
2799
2800
2801
2802
2803
2804
2805
2806
2807
2808
2809
2810
2811
2812
2813
2814
2815
2816
2817
2818
2819
2820
2821
2822
2823
2824
2825
2826
2827
2828
2829
2830
2831
2832
2833
2834
2835
2836
2837
2838
2839
2840
2841
2842
2843
2844
2845
2846
2847
2848
2849
2850
2851
2852
2853
2854
2855
2856
2857
2858
2859
2860
2861
2862
2863
2864
2865
2866
2867
2868
2869
2870
2871
2872
2873
2874
2875
2876
2877
2878
2879
2880
2881
2882
2883
2884
2885
2886
2887
2888
2889
2890
2891
2892
2893
2894
2895
2896
2897
2898
2899
2900
2901
2902
2903
2904
2905
2906
2907
2908
2909
2910
2911
2912
2913
2914
2915
2916
2917
2918
2919
2920
2921
2922
2923
2924
2925
2926
2927
2928
2929
2930
2931
2932
2933
2934
2935
2936
2937
2938
2939
2940
2941
2942
2943
2944
2945
2946
2947
2948
2949
2950
2951
2952
2953
2954
2955
2956
2957
2958
2959
2960
2961
2962
2963
2964
2965
2966
2967
2968
2969
2970
2971
2972
2973
2974
2975
2976
2977
2978
2979
2980
2981
2982
2983
2984
2985
2986
2987
2988
2989
2990
2991
2992
2993
2994
2995
2996
2997
2998
2999
3000
3001
3002
3003
3004
3005
3006
3007
3008
3009
3010
3011
3012
3013
3014
3015
3016
3017
3018
3019
3020
3021
3022
3023
3024
3025
3026
3027
3028
3029
3030
3031
3032
WEBVTT

0:00:01.721 --> 0:00:05.064
Hey, and then welcome to today's lecture.

0:00:06.126 --> 0:00:13.861
What we want to do today is we will finish
with what we have done last time, so we started

0:00:13.861 --> 0:00:22.192
looking at the new machine translation system,
but we have had all the components of the sequence

0:00:22.192 --> 0:00:22.787
model.

0:00:22.722 --> 0:00:29.361
We're still missing is the transformer based
architecture so that maybe the self attention.

0:00:29.849 --> 0:00:31.958
Then we want to look at the beginning today.

0:00:32.572 --> 0:00:39.315
And then the main part of the day's lecture
will be decoding.

0:00:39.315 --> 0:00:43.992
That means we know how to train the model.

0:00:44.624 --> 0:00:47.507
So decoding sewage all they can be.

0:00:47.667 --> 0:00:53.359
Be useful that and the idea is how we find
that and what challenges are there.

0:00:53.359 --> 0:00:59.051
Since it's unregressive, we will see that
it's not as easy as for other tasks.

0:00:59.359 --> 0:01:08.206
While generating the translation step by step,
we might make additional arrows that lead.

0:01:09.069 --> 0:01:16.464
But let's start with a self attention, so
what we looked at into was an base model.

0:01:16.816 --> 0:01:27.931
And then in our based models you always take
the last new state, you take your input, you

0:01:27.931 --> 0:01:31.513
generate a new hidden state.

0:01:31.513 --> 0:01:35.218
This is more like a standard.

0:01:35.675 --> 0:01:41.088
And one challenge in this is that we always
store all our history in one signal hidden

0:01:41.088 --> 0:01:41.523
stick.

0:01:41.781 --> 0:01:50.235
We saw that this is a problem when going from
encoder to decoder, and that is why we then

0:01:50.235 --> 0:01:58.031
introduced the attention mechanism so that
we can look back and see all the parts.

0:01:59.579 --> 0:02:06.059
However, in the decoder we still have this
issue so we are still storing all information

0:02:06.059 --> 0:02:12.394
in one hidden state and we might do things
like here that we start to overwrite things

0:02:12.394 --> 0:02:13.486
and we forgot.

0:02:14.254 --> 0:02:23.575
So the idea is, can we do something similar
which we do between encoder and decoder within

0:02:23.575 --> 0:02:24.907
the decoder?

0:02:26.526 --> 0:02:33.732
And the idea is each time we're generating
here in New York State, it will not only depend

0:02:33.732 --> 0:02:40.780
on the previous one, but we will focus on the
whole sequence and look at different parts

0:02:40.780 --> 0:02:46.165
as we did in attention in order to generate
our new representation.

0:02:46.206 --> 0:02:53.903
So each time we generate a new representation
we will look into what is important now to

0:02:53.903 --> 0:02:54.941
understand.

0:02:55.135 --> 0:03:00.558
You may want to understand what much is important.

0:03:00.558 --> 0:03:08.534
You might want to look to vary and to like
so that it's much about liking.

0:03:08.808 --> 0:03:24.076
So the idea is that we are not staring everything
in each time we are looking at the full sequence.

0:03:25.125 --> 0:03:35.160
And that is achieved by no longer going really
secret, and the hidden states here aren't dependent

0:03:35.160 --> 0:03:37.086
on the same layer.

0:03:37.086 --> 0:03:42.864
But instead we are always looking at the previous
layer.

0:03:42.942 --> 0:03:45.510
We will always have more information that
we are coming.

0:03:47.147 --> 0:03:51.572
So how does this censor work in detail?

0:03:51.572 --> 0:03:56.107
So we started with our initial mistakes.

0:03:56.107 --> 0:04:08.338
So, for example: Now where we had the three
terms already, the query, the key and the value,

0:04:08.338 --> 0:04:12.597
it was motivated by our database.

0:04:12.772 --> 0:04:20.746
We are comparing it to the keys to all the
other values, and then we are merging the values.

0:04:21.321 --> 0:04:35.735
There was a difference between the decoder
and the encoder.

0:04:35.775 --> 0:04:41.981
You can assume all the same because we are
curving ourselves.

0:04:41.981 --> 0:04:49.489
However, we can make them different but just
learning a linear projection.

0:04:49.529 --> 0:05:01.836
So you learn here some projection based on
what need to do in order to ask which question.

0:05:02.062 --> 0:05:11.800
That is, the query and the key is to what
do want to compare and provide others, and

0:05:11.800 --> 0:05:13.748
which values do.

0:05:14.014 --> 0:05:23.017
This is not like hand defined, but learn,
so it's like three linear projections that

0:05:23.017 --> 0:05:26.618
you apply on all of these hidden.

0:05:26.618 --> 0:05:32.338
That is the first thing based on your initial
hidden.

0:05:32.612 --> 0:05:37.249
And now you can do exactly as before, you
can do the attention.

0:05:37.637 --> 0:05:40.023
How did the attention work?

0:05:40.023 --> 0:05:45.390
The first thing is we are comparing our query
to all the keys.

0:05:45.445 --> 0:05:52.713
And that is now the difference before the
quarry was from the decoder, the keys were

0:05:52.713 --> 0:05:54.253
from the encoder.

0:05:54.253 --> 0:06:02.547
Now it's like all from the same, so we started
the first in state to the keys of all the others.

0:06:02.582 --> 0:06:06.217
We're learning some value here.

0:06:06.217 --> 0:06:12.806
How important are these information to better
understand?

0:06:13.974 --> 0:06:19.103
And these are just like floating point numbers.

0:06:19.103 --> 0:06:21.668
They are normalized so.

0:06:22.762 --> 0:06:30.160
And that is the first step, so let's go first
for the first curve.

0:06:30.470 --> 0:06:41.937
What we can then do is multiply each value
as we have done before with the importance

0:06:41.937 --> 0:06:43.937
of each state.

0:06:45.145 --> 0:06:47.686
And then we have in here the new hit step.

0:06:48.308 --> 0:06:57.862
See now this new hidden status is depending
on all the hidden state of all the sequences

0:06:57.862 --> 0:06:59.686
of the previous.

0:06:59.879 --> 0:07:01.739
One important thing.

0:07:01.739 --> 0:07:08.737
This one doesn't really depend, so the hidden
states here don't depend on the.

0:07:09.029 --> 0:07:15.000
So it only depends on the hidden state of
the previous layer, but it depends on all the

0:07:15.000 --> 0:07:18.664
hidden states, and that is of course a big
advantage.

0:07:18.664 --> 0:07:25.111
So on the one hand information can directly
flow from each hidden state before the information

0:07:25.111 --> 0:07:27.214
flow was always a bit limited.

0:07:28.828 --> 0:07:35.100
And the independence is important so we can
calculate all these in the states in parallel.

0:07:35.100 --> 0:07:41.371
That's another big advantage of self attention
that we can calculate all the hidden states

0:07:41.371 --> 0:07:46.815
in one layer in parallel and therefore it's
the ad designed for GPUs and fast.

0:07:47.587 --> 0:07:50.235
Then we can do the same thing for the second
in the state.

0:07:50.530 --> 0:08:06.866
And the only difference here is how we calculate
what is occurring.

0:08:07.227 --> 0:08:15.733
Getting these values is different because
we use the different query and then getting

0:08:15.733 --> 0:08:17.316
our new hidden.

0:08:18.258 --> 0:08:26.036
Yes, this is the word of words that underneath
this case might, but this is simple.

0:08:26.036 --> 0:08:26.498
Not.

0:08:27.127 --> 0:08:33.359
That's a very good question that is like on
the initial thing.

0:08:33.359 --> 0:08:38.503
That is exactly not one of you in the architecture.

0:08:38.503 --> 0:08:44.042
Maybe first you would think of a very big
disadvantage.

0:08:44.384 --> 0:08:49.804
So this hidden state would be the same if
the movie would be different.

0:08:50.650 --> 0:08:59.983
And of course this estate is a site someone
should like, so if the estate would be here

0:08:59.983 --> 0:09:06.452
except for this correspondence the word order
is completely.

0:09:06.706 --> 0:09:17.133
Therefore, just doing self attention wouldn't
work at all because we know word order is important

0:09:17.133 --> 0:09:21.707
and there is a complete different meaning.

0:09:22.262 --> 0:09:26.277
We introduce the word position again.

0:09:26.277 --> 0:09:33.038
The main idea is if the position is already
in your embeddings.

0:09:33.533 --> 0:09:39.296
Then of course the position is there and you
don't lose it anymore.

0:09:39.296 --> 0:09:46.922
So mainly if your life representation here
encodes at the second position and your output

0:09:46.922 --> 0:09:48.533
will be different.

0:09:49.049 --> 0:09:54.585
And that's how you encode it, but that's essential
in order to get this work.

0:09:57.137 --> 0:10:08.752
But before we are coming to the next slide,
one other thing that is typically done is multi-head

0:10:08.752 --> 0:10:10.069
attention.

0:10:10.430 --> 0:10:15.662
And it might be that in order to understand
much, it might be good that in some way we

0:10:15.662 --> 0:10:19.872
focus on life, and in some way we can focus
on vary, but not equally.

0:10:19.872 --> 0:10:25.345
But maybe it's like to understand again on
different dimensions we should look into these.

0:10:25.905 --> 0:10:31.393
And therefore what we're doing is we're just
doing the self attention at once, but we're

0:10:31.393 --> 0:10:35.031
doing it end times or based on your multi head
attentions.

0:10:35.031 --> 0:10:41.299
So in typical examples, the number of heads
people are talking about is like: So you're

0:10:41.299 --> 0:10:50.638
doing this process and have different queries
and keys so you can focus.

0:10:50.790 --> 0:10:52.887
How can you generate eight different?

0:10:53.593 --> 0:11:07.595
Things it's quite easy here, so instead of
having one linear projection you can have age

0:11:07.595 --> 0:11:09.326
different.

0:11:09.569 --> 0:11:13.844
And it might be that sometimes you're looking
more into one thing, and sometimes you're Looking

0:11:13.844 --> 0:11:14.779
more into the other.

0:11:15.055 --> 0:11:24.751
So that's of course nice with this type of
learned approach because we can automatically

0:11:24.751 --> 0:11:25.514
learn.

0:11:29.529 --> 0:11:36.629
And what you correctly said is its positional
independence, so it doesn't really matter the

0:11:36.629 --> 0:11:39.176
order which should be important.

0:11:39.379 --> 0:11:47.686
So how can we do that and the idea is we are
just encoding it directly into the embedding

0:11:47.686 --> 0:11:52.024
so into the starting so that a representation.

0:11:52.512 --> 0:11:55.873
How do we get that so we started with our
embeddings?

0:11:55.873 --> 0:11:58.300
Just imagine this is embedding of eye.

0:11:59.259 --> 0:12:06.169
And then we are having additionally this positional
encoding.

0:12:06.169 --> 0:12:10.181
In this position, encoding is just.

0:12:10.670 --> 0:12:19.564
With different wavelength, so with different
lengths of your signal as you see here.

0:12:20.160 --> 0:12:37.531
And the number of functions you have is exactly
the number of dimensions you have in your embedded.

0:12:38.118 --> 0:12:51.091
And what will then do is take the first one,
and based on your position you multiply your

0:12:51.091 --> 0:12:51.955
word.

0:12:52.212 --> 0:13:02.518
And you see now if you put it in this position,
of course it will get a different value.

0:13:03.003 --> 0:13:12.347
And thereby in each position a different function
is multiplied.

0:13:12.347 --> 0:13:19.823
This is a representation for at the first
position.

0:13:20.020 --> 0:13:34.922
If you have it in the input already encoded
then of course the model is able to keep the

0:13:34.922 --> 0:13:38.605
position information.

0:13:38.758 --> 0:13:48.045
But your embeddings can also learn your embeddings
in a way that they are optimal collaborating

0:13:48.045 --> 0:13:49.786
with these types.

0:13:51.451 --> 0:13:59.351
Is that somehow clear where he is there?

0:14:06.006 --> 0:14:13.630
Am the first position and second position?

0:14:16.576 --> 0:14:17.697
Have a long wait period.

0:14:17.697 --> 0:14:19.624
I'm not going to tell you how to turn the.

0:14:21.441 --> 0:14:26.927
Be completely issued because if you have a
very short wavelength there might be quite

0:14:26.927 --> 0:14:28.011
big differences.

0:14:28.308 --> 0:14:33.577
And it might also be that then it depends,
of course, like what type of world embedding

0:14:33.577 --> 0:14:34.834
you've learned like.

0:14:34.834 --> 0:14:37.588
Is the dimension where you have long changes?

0:14:37.588 --> 0:14:43.097
Is the report for your embedding or not so
that's what I mean so that the model can somehow

0:14:43.097 --> 0:14:47.707
learn that by putting more information into
one of the embedding dimensions?

0:14:48.128 --> 0:14:54.560
So incorporated and would assume it's learning
it a bit haven't seen.

0:14:54.560 --> 0:14:57.409
Details studied how different.

0:14:58.078 --> 0:15:07.863
It's also a bit difficult because really measuring
how similar or different a world isn't that

0:15:07.863 --> 0:15:08.480
easy.

0:15:08.480 --> 0:15:13.115
You can do, of course, the average distance.

0:15:14.114 --> 0:15:21.393
Them, so are the weight tags not at model
two, or is there fixed weight tags that the

0:15:21.393 --> 0:15:21.986
model.

0:15:24.164 --> 0:15:30.165
To believe they are fixed and the mono learns
there's a different way of doing it.

0:15:30.165 --> 0:15:32.985
The other thing you can do is you can.

0:15:33.213 --> 0:15:36.945
So you can learn the second embedding which
says this is position one.

0:15:36.945 --> 0:15:38.628
This is position two and so on.

0:15:38.628 --> 0:15:42.571
Like for words you could learn fixed embeddings
and then add them upwards.

0:15:42.571 --> 0:15:45.094
So then it would have the same thing it's
done.

0:15:45.094 --> 0:15:46.935
There is one disadvantage of this.

0:15:46.935 --> 0:15:51.403
There is anybody an idea what could be the
disadvantage of a more learned embedding.

0:15:54.955 --> 0:16:00.000
Here maybe extra play this finger and ethnic
stuff that will be an art.

0:16:00.000 --> 0:16:01.751
This will be an art for.

0:16:02.502 --> 0:16:08.323
You would only be good at positions you have
seen often and especially for long sequences.

0:16:08.323 --> 0:16:14.016
You might have seen the positions very rarely
and then normally not performing that well

0:16:14.016 --> 0:16:17.981
while here it can better learn a more general
representation.

0:16:18.298 --> 0:16:22.522
So that is another thing which we won't discuss
here.

0:16:22.522 --> 0:16:25.964
Guess is what is called relative attention.

0:16:25.945 --> 0:16:32.570
And in this case you don't learn absolute
positions, but in your calculation of the similarity

0:16:32.570 --> 0:16:39.194
you take again the relative distance into account
and have a different similarity depending on

0:16:39.194 --> 0:16:40.449
how far they are.

0:16:40.660 --> 0:16:45.898
And then you don't need to encode it beforehand,
but you would more happen within your comparison.

0:16:46.186 --> 0:16:53.471
So when you compare how similar things you
print, of course also take the relative position.

0:16:55.715 --> 0:17:03.187
Because there are multiple ways to use the
one, to multiply all the embedding, or to use

0:17:03.187 --> 0:17:03.607
all.

0:17:17.557 --> 0:17:21.931
The encoder can be bidirectional.

0:17:21.931 --> 0:17:30.679
We have everything from the beginning so we
can have a model where.

0:17:31.111 --> 0:17:36.455
Decoder training of course has also everything
available but during inference you always have

0:17:36.455 --> 0:17:41.628
only the past available so you can only look
into the previous one and not into the future

0:17:41.628 --> 0:17:46.062
because if you generate word by word you don't
know what it will be there in.

0:17:46.866 --> 0:17:53.180
And so we also have to consider this somehow
in the attention, and until now we look more

0:17:53.180 --> 0:17:54.653
at the ecoder style.

0:17:54.653 --> 0:17:58.652
So if you look at this type of model, it's
by direction.

0:17:58.652 --> 0:18:03.773
So for this hill state we are looking into
the past and into the future.

0:18:04.404 --> 0:18:14.436
So the question is, can we have to do this
like unidirectional so that you only look into

0:18:14.436 --> 0:18:15.551
the past?

0:18:15.551 --> 0:18:22.573
And the nice thing is, this is even easier
than for our hands.

0:18:23.123 --> 0:18:29.738
So we would have different types of parameters
and models because you have a forward direction.

0:18:31.211 --> 0:18:35.679
For attention, that is very simple.

0:18:35.679 --> 0:18:39.403
We are doing what is masking.

0:18:39.403 --> 0:18:45.609
If you want to have a backward model, these
ones.

0:18:45.845 --> 0:18:54.355
So on the first hit stage it's been over,
so it's maybe only looking at its health.

0:18:54.894 --> 0:19:05.310
By the second it looks on the second and the
third, so you're always selling all values

0:19:05.310 --> 0:19:07.085
in the future.

0:19:07.507 --> 0:19:13.318
And thereby you can have with the same parameters
the same model.

0:19:13.318 --> 0:19:15.783
You can have then a unique.

0:19:16.156 --> 0:19:29.895
In the decoder you do the masked self attention
where you only look into the past and you don't

0:19:29.895 --> 0:19:30.753
look.

0:19:32.212 --> 0:19:36.400
Then we only have, of course, looked onto
itself.

0:19:36.616 --> 0:19:50.903
So the question: How can we combine forward
and decoder and then we can do a decoder and

0:19:50.903 --> 0:19:54.114
just have a second?

0:19:54.374 --> 0:20:00.286
And then we're doing the cross attention which
attacks from the decoder to the anchoder.

0:20:00.540 --> 0:20:10.239
So in this time it's again that the queries
is a current state of decoder, while the keys

0:20:10.239 --> 0:20:22.833
are: You can do both onto yourself to get the
meaning on the target side and to get the meaning.

0:20:23.423 --> 0:20:25.928
So see then the full picture.

0:20:25.928 --> 0:20:33.026
This is now the typical picture of the transformer
and where you use self attention.

0:20:33.026 --> 0:20:36.700
So what you have is have your power hidden.

0:20:37.217 --> 0:20:43.254
What you then apply is here the position they're
coding: We have then doing the self attention

0:20:43.254 --> 0:20:46.734
to all the others, and this can be bi-directional.

0:20:47.707 --> 0:20:54.918
You normally do another feed forward layer
just like to make things to learn additional

0:20:54.918 --> 0:20:55.574
things.

0:20:55.574 --> 0:21:02.785
You're just having also a feed forward layer
which takes your heel stable and generates

0:21:02.785 --> 0:21:07.128
your heel state because we are making things
deeper.

0:21:07.747 --> 0:21:15.648
Then this blue part you can stack over several
times so you can have layers so that.

0:21:16.336 --> 0:21:30.256
In addition to these blue arrows, so we talked
about this in R&S that if you are now back

0:21:30.256 --> 0:21:35.883
propagating your arrow from the top,.

0:21:36.436 --> 0:21:48.578
In order to prevent that we are not really
learning how to transform that, but instead

0:21:48.578 --> 0:21:51.230
we have to change.

0:21:51.671 --> 0:22:00.597
You're calculating what should be changed
with this one.

0:22:00.597 --> 0:22:09.365
The backwards clip each layer and the learning
is just.

0:22:10.750 --> 0:22:21.632
The encoder before we go to the decoder.

0:22:21.632 --> 0:22:30.655
We have any additional questions.

0:22:31.471 --> 0:22:33.220
That's a Very Good Point.

0:22:33.553 --> 0:22:38.709
Yeah, you normally take always that at least
the default architecture to only look at the

0:22:38.709 --> 0:22:38.996
top.

0:22:40.000 --> 0:22:40.388
Coder.

0:22:40.388 --> 0:22:42.383
Of course, you can do other things.

0:22:42.383 --> 0:22:45.100
We investigated, for example, the lowest layout.

0:22:45.100 --> 0:22:49.424
The decoder is looking at the lowest level
of the incoder and not of the top.

0:22:49.749 --> 0:23:05.342
You can average or you can even learn theoretically
that what you can also do is attending to all.

0:23:05.785 --> 0:23:11.180
Can attend to all possible layers and states.

0:23:11.180 --> 0:23:18.335
But what the default thing is is that you
only have the top.

0:23:20.580 --> 0:23:31.999
The decoder when we're doing is firstly doing
the same position and coding, then we're doing

0:23:31.999 --> 0:23:36.419
self attention in the decoder side.

0:23:37.837 --> 0:23:43.396
Of course here it's not important we're doing
the mask self attention so that we're only

0:23:43.396 --> 0:23:45.708
attending to the past and we're not.

0:23:47.287 --> 0:24:02.698
Here you see the difference, so in this case
the keys and values are from the encoder and

0:24:02.698 --> 0:24:03.554
the.

0:24:03.843 --> 0:24:12.103
You're comparing it to all the counter hidden
states calculating the similarity and then

0:24:12.103 --> 0:24:13.866
you do the weight.

0:24:14.294 --> 0:24:17.236
And that is an edit to what is here.

0:24:18.418 --> 0:24:29.778
Then you have a linen layer and again this
green one is sticked several times and then.

0:24:32.232 --> 0:24:36.987
Question, so each code is off.

0:24:36.987 --> 0:24:46.039
Every one of those has the last layer of thing,
so in the.

0:24:46.246 --> 0:24:51.007
All with and only to the last or the top layer
of the anchor.

0:24:57.197 --> 0:25:00.127
Good So That Would Be.

0:25:01.501 --> 0:25:12.513
To sequence models we have looked at attention
and before we are decoding do you have any

0:25:12.513 --> 0:25:18.020
more questions to this type of architecture.

0:25:20.480 --> 0:25:30.049
Transformer was first used in machine translation,
but now it's a standard thing for doing nearly

0:25:30.049 --> 0:25:32.490
any tie sequence models.

0:25:33.013 --> 0:25:35.984
Even large language models.

0:25:35.984 --> 0:25:38.531
They are a bit similar.

0:25:38.531 --> 0:25:45.111
They are just throwing away the anchor and
cross the tension.

0:25:45.505 --> 0:25:59.329
And that is maybe interesting that it's important
to have this attention because you cannot store

0:25:59.329 --> 0:26:01.021
everything.

0:26:01.361 --> 0:26:05.357
The interesting thing with the attention is
now we can attend to everything.

0:26:05.745 --> 0:26:13.403
So you can again go back to your initial model
and have just a simple sequence model and then

0:26:13.403 --> 0:26:14.055
target.

0:26:14.694 --> 0:26:24.277
There would be a more language model style
or people call it Decoder Only model where

0:26:24.277 --> 0:26:26.617
you throw this away.

0:26:27.247 --> 0:26:30.327
The nice thing is because of your self attention.

0:26:30.327 --> 0:26:34.208
You have the original problem why you introduce
the attention.

0:26:34.208 --> 0:26:39.691
You don't have that anymore because it's not
everything is summarized, but each time you

0:26:39.691 --> 0:26:44.866
generate, you're looking back at all the previous
words, the source and the target.

0:26:45.805 --> 0:26:51.734
And there is a lot of work on is a really
important to have encoded a decoded model or

0:26:51.734 --> 0:26:54.800
is a decoded only model as good if you have.

0:26:54.800 --> 0:27:00.048
But the comparison is not that easy because
how many parameters do you have?

0:27:00.360 --> 0:27:08.832
So think the general idea at the moment is,
at least for machine translation, it's normally

0:27:08.832 --> 0:27:17.765
a bit better to have an encoded decoder model
and not a decoder model where you just concatenate

0:27:17.765 --> 0:27:20.252
the source and the target.

0:27:21.581 --> 0:27:24.073
But there is not really a big difference anymore.

0:27:24.244 --> 0:27:29.891
Because this big issue, which we had initially
with it that everything is stored in the working

0:27:29.891 --> 0:27:31.009
state, is nothing.

0:27:31.211 --> 0:27:45.046
Of course, the advantage maybe here is that
you give it a bias at your same language information.

0:27:45.285 --> 0:27:53.702
While in an encoder only model this all is
merged into one thing and sometimes it is good

0:27:53.702 --> 0:28:02.120
to give models a bit of bias okay you should
maybe treat things separately and you should

0:28:02.120 --> 0:28:03.617
look different.

0:28:04.144 --> 0:28:11.612
And of course one other difference, one other
disadvantage, maybe of an encoder owning one.

0:28:16.396 --> 0:28:19.634
You think about the suicide sentence and how
it's treated.

0:28:21.061 --> 0:28:33.787
Architecture: Anchorer can both be in the
sentence for every state and cause a little

0:28:33.787 --> 0:28:35.563
difference.

0:28:35.475 --> 0:28:43.178
If you only have a decoder that has to be
unidirectional because for the decoder side

0:28:43.178 --> 0:28:51.239
for the generation you need it and so your
input is read state by state so you don't have

0:28:51.239 --> 0:28:54.463
positional bidirection information.

0:28:56.596 --> 0:29:05.551
Again, it receives a sequence of embeddings
with position encoding.

0:29:05.551 --> 0:29:11.082
The piece is like long vector has output.

0:29:11.031 --> 0:29:17.148
Don't understand how you can set footworks
to this part of each other through inputs.

0:29:17.097 --> 0:29:20.060
Other than cola is the same as the food consume.

0:29:21.681 --> 0:29:27.438
Okay, it's very good bye, so this one hand
coding is only done on the top layer.

0:29:27.727 --> 0:29:32.012
So this green one is only repeated.

0:29:32.012 --> 0:29:38.558
You have the word embedding or the position
embedding.

0:29:38.558 --> 0:29:42.961
You have one layer of decoder which.

0:29:43.283 --> 0:29:48.245
Then you stick in the second one, the third
one, the fourth one, and then on the top.

0:29:48.208 --> 0:29:55.188
Layer: You put this projection layer which
takes a one thousand dimensional backtalk and

0:29:55.188 --> 0:30:02.089
generates based on your vocabulary maybe in
ten thousand soft max layer which gives you

0:30:02.089 --> 0:30:04.442
the probability of all words.

0:30:06.066 --> 0:30:22.369
It's a very good part part of the mass tape
ladies, but it wouldn't be for the X-rays.

0:30:22.262 --> 0:30:27.015
Aquarium filters to be like monsoon roding
as they get by the river.

0:30:27.647 --> 0:30:33.140
Yes, there is work on that think we will discuss
that in the pre-trained models.

0:30:33.493 --> 0:30:39.756
It's called where you exactly do that.

0:30:39.756 --> 0:30:48.588
If you have more metric side, it's like diagonal
here.

0:30:48.708 --> 0:30:53.018
And it's a full metric, so here everybody's
attending to each position.

0:30:53.018 --> 0:30:54.694
Here you're only attending.

0:30:54.975 --> 0:31:05.744
Then you can do the previous one where this
one is decoded, not everything but everything.

0:31:06.166 --> 0:31:13.961
So you have a bit more that is possible, and
we'll have that in the lecture on pre-train

0:31:13.961 --> 0:31:14.662
models.

0:31:18.478 --> 0:31:27.440
So we now know how to build a translation
system, but of course we don't want to have

0:31:27.440 --> 0:31:30.774
a translation system by itself.

0:31:31.251 --> 0:31:40.037
Now given this model an input sentence, how
can we generate an output mind?

0:31:40.037 --> 0:31:49.398
The general idea is still: So what we really
want to do is we start with the model.

0:31:49.398 --> 0:31:53.893
We generate different possible translations.

0:31:54.014 --> 0:31:59.754
We score them the lock probability that we're
getting, so for each input and output pair

0:31:59.754 --> 0:32:05.430
we can calculate the lock probability, which
is a product of all probabilities for each

0:32:05.430 --> 0:32:09.493
word in there, and then we can find what is
the most probable.

0:32:09.949 --> 0:32:15.410
However, that's a bit complicated we will
see because we can't look at all possible translations.

0:32:15.795 --> 0:32:28.842
So there is infinite or a number of possible
translations, so we have to do it somehow in

0:32:28.842 --> 0:32:31.596
more intelligence.

0:32:32.872 --> 0:32:37.821
So what we want to do today in the rest of
the lecture?

0:32:37.821 --> 0:32:40.295
What is the search problem?

0:32:40.295 --> 0:32:44.713
Then we will look at different search algorithms.

0:32:45.825 --> 0:32:56.636
Will compare model and search errors, so there
can be errors on the model where the model

0:32:56.636 --> 0:33:03.483
is not giving the highest score to the best
translation.

0:33:03.903 --> 0:33:21.069
This is always like searching the best translation
out of one model, which is often also interesting.

0:33:24.004 --> 0:33:29.570
And how do we do the search?

0:33:29.570 --> 0:33:41.853
We want to find the translation where the
reference is minimal.

0:33:42.042 --> 0:33:44.041
So the nice thing is SMT.

0:33:44.041 --> 0:33:51.347
It wasn't the case, but in neuromachine translation
we can't find any possible translation, so

0:33:51.347 --> 0:33:53.808
at least within our vocabulary.

0:33:53.808 --> 0:33:58.114
But if we have BPE we can really generate
any possible.

0:33:58.078 --> 0:34:04.604
Translation and cereal: We could always minimize
that, but yeah, we can't do it that easy because

0:34:04.604 --> 0:34:07.734
of course we don't have the reference at hand.

0:34:07.747 --> 0:34:10.384
If it has a reference, it's not a problem.

0:34:10.384 --> 0:34:13.694
We know what we are searching for, but we
don't know.

0:34:14.054 --> 0:34:23.886
So how can we then model this by just finding
the translation with the highest probability?

0:34:23.886 --> 0:34:29.015
Looking at it, we want to find the translation.

0:34:29.169 --> 0:34:32.525
Idea is our model is a good approximation.

0:34:32.525 --> 0:34:34.399
That's how we train it.

0:34:34.399 --> 0:34:36.584
What is a good translation?

0:34:36.584 --> 0:34:43.687
And if we find translation with the highest
probability, this should also give us the best

0:34:43.687 --> 0:34:44.702
translation.

0:34:45.265 --> 0:34:56.965
And that is then, of course, the difference
between the search error is that the model

0:34:56.965 --> 0:35:02.076
doesn't predict the best translation.

0:35:02.622 --> 0:35:08.777
How can we do the basic search first of all
in basic search that seems to be very easy

0:35:08.777 --> 0:35:15.003
so what we can do is we can do the forward
pass for the whole encoder and that's how it

0:35:15.003 --> 0:35:21.724
starts the input sentences known you can put
the input sentence and calculate all your estates

0:35:21.724 --> 0:35:22.573
and hidden?

0:35:23.083 --> 0:35:35.508
Then you can put in your sentence start and
you can generate.

0:35:35.508 --> 0:35:41.721
Here you have the probability.

0:35:41.801 --> 0:35:52.624
A good idea we would see later that as a typical
algorithm is guess what you all would do, you

0:35:52.624 --> 0:35:54.788
would then select.

0:35:55.235 --> 0:36:06.265
So if you generate here a probability distribution
over all the words in your vocabulary then

0:36:06.265 --> 0:36:08.025
you can solve.

0:36:08.688 --> 0:36:13.147
Yeah, this is how our auto condition is done
in our system.

0:36:14.794 --> 0:36:19.463
Yeah, this is also why there you have to have
a model of possible extending.

0:36:19.463 --> 0:36:24.314
It's more of a language model, but then this
is one algorithm to do the search.

0:36:24.314 --> 0:36:26.801
They maybe have also more advanced ones.

0:36:26.801 --> 0:36:32.076
We will see that so this search and other
completion should be exactly the same as the

0:36:32.076 --> 0:36:33.774
search machine translation.

0:36:34.914 --> 0:36:40.480
So we'll see that this is not optimal, so
hopefully it's not that this way, but for this

0:36:40.480 --> 0:36:41.043
problem.

0:36:41.941 --> 0:36:47.437
And what you can do then you can select this
word.

0:36:47.437 --> 0:36:50.778
This was the best translation.

0:36:51.111 --> 0:36:57.675
Because the decoder, of course, in the next
step needs not to know what is the best word

0:36:57.675 --> 0:37:02.396
here, it inputs it and generates that flexibility
distribution.

0:37:03.423 --> 0:37:14.608
And then your new distribution, and you can
do the same thing, there's the best word there,

0:37:14.608 --> 0:37:15.216
and.

0:37:15.435 --> 0:37:22.647
So you can continue doing that and always
get the hopefully the best translation in.

0:37:23.483 --> 0:37:30.839
The first question is, of course, how long
are you doing it?

0:37:30.839 --> 0:37:33.854
Now we could go forever.

0:37:36.476 --> 0:37:52.596
We had this token at the input and we put
the stop token at the output.

0:37:53.974 --> 0:38:07.217
And this is important because if we wouldn't
do that then we wouldn't have a good idea.

0:38:10.930 --> 0:38:16.193
So that seems to be a good idea, but is it
really?

0:38:16.193 --> 0:38:21.044
Do we find the most probable sentence in this?

0:38:23.763 --> 0:38:25.154
Or my dear healed proverb,.

0:38:27.547 --> 0:38:41.823
We are always selecting the highest probability
one, so it seems to be that this is a very

0:38:41.823 --> 0:38:45.902
good solution to anybody.

0:38:46.406 --> 0:38:49.909
Yes, that is actually the problem.

0:38:49.909 --> 0:38:56.416
You might do early decisions and you don't
have the global view.

0:38:56.796 --> 0:39:02.813
And this problem happens because it is an
outer regressive model.

0:39:03.223 --> 0:39:13.275
So it happens because yeah, the output we
generate is the input in the next step.

0:39:13.793 --> 0:39:19.493
And this, of course, is leading to problems.

0:39:19.493 --> 0:39:27.474
If we always take the best solution, it doesn't
mean you have.

0:39:27.727 --> 0:39:33.941
It would be different if you have a problem
where the output is not influencing your input.

0:39:34.294 --> 0:39:44.079
Then this solution will give you the best
model, but since the output is influencing

0:39:44.079 --> 0:39:47.762
your next input and the model,.

0:39:48.268 --> 0:39:51.599
Because one question might not be why do we
have this type of model?

0:39:51.771 --> 0:39:58.946
So why do we really need to put here in the
last source word?

0:39:58.946 --> 0:40:06.078
You can also put in: And then always predict
the word and the nice thing is then you wouldn't

0:40:06.078 --> 0:40:11.846
need to do beams or a difficult search because
then the output here wouldn't influence what

0:40:11.846 --> 0:40:12.975
is inputted here.

0:40:15.435 --> 0:40:20.219
Idea whether that might not be the best idea.

0:40:20.219 --> 0:40:24.588
You'll just be translating each word and.

0:40:26.626 --> 0:40:37.815
The second one is right, yes, you're not generating
a Korean sentence.

0:40:38.058 --> 0:40:48.197
We'll also see that later it's called non
auto-progressive translation, so there is work

0:40:48.197 --> 0:40:49.223
on that.

0:40:49.529 --> 0:41:02.142
So you might know it roughly because you know
it's based on this hidden state, but it can

0:41:02.142 --> 0:41:08.588
be that in the end you have your probability.

0:41:09.189 --> 0:41:14.633
And then you're not modeling the dependencies
within a work within the target sentence.

0:41:14.633 --> 0:41:27.547
For example: You can express things in German,
then you don't know which one you really select.

0:41:27.547 --> 0:41:32.156
That influences what you later.

0:41:33.393 --> 0:41:46.411
Then you try to find a better way not only
based on the English sentence and the words

0:41:46.411 --> 0:41:48.057
that come.

0:41:49.709 --> 0:42:00.954
Yes, that is more like a two-step decoding,
but that is, of course, a lot more like computational.

0:42:01.181 --> 0:42:15.978
The first thing you can do, which is typically
done, is doing not really search.

0:42:16.176 --> 0:42:32.968
So first look at what the problem of research
is to make it a bit more clear.

0:42:34.254 --> 0:42:53.163
And now you can extend them and you can extend
these and the joint probabilities.

0:42:54.334 --> 0:42:59.063
The other thing is the second word.

0:42:59.063 --> 0:43:03.397
You can do the second word dusk.

0:43:03.397 --> 0:43:07.338
Now you see the problem here.

0:43:07.707 --> 0:43:17.507
It is true that these have the highest probability,
but for these you have an extension.

0:43:18.078 --> 0:43:31.585
So the problem is just because in one position
one hypothesis, so you can always call this

0:43:31.585 --> 0:43:34.702
partial translation.

0:43:34.874 --> 0:43:41.269
The blue one begin is higher, but the green
one can be better extended and it will overtake.

0:43:45.525 --> 0:43:54.672
So the problem is if we are doing this greedy
search is that we might not end up in really

0:43:54.672 --> 0:43:55.275
good.

0:43:55.956 --> 0:44:00.916
So the first thing we could not do is like
yeah, we can just try.

0:44:00.880 --> 0:44:06.049
All combinations that are there, so there
is the other direction.

0:44:06.049 --> 0:44:13.020
So if the solution to to check the first one
is to just try all and it doesn't give us a

0:44:13.020 --> 0:44:17.876
good result, maybe what we have to do is just
try everything.

0:44:18.318 --> 0:44:23.120
The nice thing is if we try everything, we'll
definitely find the best translation.

0:44:23.463 --> 0:44:26.094
So we won't have a search error.

0:44:26.094 --> 0:44:28.167
We'll come to that later.

0:44:28.167 --> 0:44:32.472
The interesting thing is our translation performance.

0:44:33.353 --> 0:44:37.039
But we will definitely find the most probable
translation.

0:44:38.598 --> 0:44:44.552
However, it's not really possible because
the number of combinations is just too high.

0:44:44.764 --> 0:44:57.127
So the number of congregations is your vocabulary
science times the lengths of your sentences.

0:44:57.157 --> 0:45:03.665
Ten thousand or so you can imagine that very
soon you will have so many possibilities here

0:45:03.665 --> 0:45:05.597
that you cannot check all.

0:45:06.226 --> 0:45:13.460
So this is not really an implication or an
algorithm that you can use for applying machine

0:45:13.460 --> 0:45:14.493
translation.

0:45:15.135 --> 0:45:24.657
So maybe we have to do something in between
and yeah, not look at all but only look at

0:45:24.657 --> 0:45:25.314
some.

0:45:26.826 --> 0:45:29.342
And the easiest thing for that is okay.

0:45:29.342 --> 0:45:34.877
Just do sampling, so if we don't know what
to look at, maybe it's good to randomly pick

0:45:34.877 --> 0:45:35.255
some.

0:45:35.255 --> 0:45:40.601
That's not only a very good algorithm, so
the basic idea will always randomly select

0:45:40.601 --> 0:45:42.865
the word, of course, based on bits.

0:45:43.223 --> 0:45:52.434
We are doing that or times, and then we are
looking which one at the end has the highest.

0:45:52.672 --> 0:45:59.060
So we are not doing anymore really searching
for the best one, but we are more randomly

0:45:59.060 --> 0:46:05.158
doing selections with the idea that we always
select the best one at the beginning.

0:46:05.158 --> 0:46:11.764
So maybe it's better to do random, but of
course one important thing is how do we randomly

0:46:11.764 --> 0:46:12.344
select?

0:46:12.452 --> 0:46:15.756
If we just do uniform distribution, it would
be very bad.

0:46:15.756 --> 0:46:18.034
You'll only have very bad translations.

0:46:18.398 --> 0:46:23.261
Because in each position if you think about
it you have ten thousand possibilities.

0:46:23.903 --> 0:46:28.729
Most of them are really bad decisions and
you shouldn't do that.

0:46:28.729 --> 0:46:35.189
There is always only a very small number,
at least compared to the 10 000 translation.

0:46:35.395 --> 0:46:43.826
So if you have the sentence here, this is
an English sentence.

0:46:43.826 --> 0:46:47.841
You can start with these and.

0:46:48.408 --> 0:46:58.345
You're thinking about setting legal documents
in a legal document.

0:46:58.345 --> 0:47:02.350
You should not change the.

0:47:03.603 --> 0:47:11.032
The problem is we have a neural network, we
have a black box, so it's anyway a bit random.

0:47:12.092 --> 0:47:24.341
It is considered, but you will see that if
you make it intelligent for clear sentences,

0:47:24.341 --> 0:47:26.986
there is not that.

0:47:27.787 --> 0:47:35.600
Is an issue we should consider that this one
might lead to more randomness, but it might

0:47:35.600 --> 0:47:39.286
also be positive for machine translation.

0:47:40.080 --> 0:47:46.395
Least can't directly think of a good implication
where it's positive, but if you most think

0:47:46.395 --> 0:47:52.778
about dialogue systems, for example, whereas
the similar architecture is nowadays also used,

0:47:52.778 --> 0:47:55.524
you predict what the system should say.

0:47:55.695 --> 0:48:00.885
Then you want to have randomness because it's
not always saying the same thing.

0:48:01.341 --> 0:48:08.370
Machine translation is typically not you want
to have consistency, so if you have the same

0:48:08.370 --> 0:48:09.606
input normally.

0:48:09.889 --> 0:48:14.528
Therefore, sampling is not a mathieu.

0:48:14.528 --> 0:48:22.584
There are some things you will later see as
a preprocessing step.

0:48:23.003 --> 0:48:27.832
But of course it's important how you can make
this process not too random.

0:48:29.269 --> 0:48:41.619
Therefore, the first thing is don't take a
uniform distribution, but we have a very nice

0:48:41.619 --> 0:48:43.562
distribution.

0:48:43.843 --> 0:48:46.621
So I'm like randomly taking a word.

0:48:46.621 --> 0:48:51.328
We are looking at output distribution and
now taking a word.

0:48:51.731 --> 0:49:03.901
So that means we are taking the word these,
we are taking the word does, and all these.

0:49:04.444 --> 0:49:06.095
How can you do that?

0:49:06.095 --> 0:49:09.948
You randomly draw a number between zero and
one.

0:49:10.390 --> 0:49:23.686
And then you have ordered your words in some
way, and then you take the words before the

0:49:23.686 --> 0:49:26.375
sum of the words.

0:49:26.806 --> 0:49:34.981
So the easiest thing is you have zero point
five, zero point two five, and zero point two

0:49:34.981 --> 0:49:35.526
five.

0:49:35.526 --> 0:49:43.428
If you have a number smaller than you take
the first word, it takes a second word, and

0:49:43.428 --> 0:49:45.336
if it's higher than.

0:49:45.845 --> 0:49:57.707
Therefore, you can very easily get a distribution
distributed according to this probability mass

0:49:57.707 --> 0:49:59.541
and no longer.

0:49:59.799 --> 0:50:12.479
You can't even do that a bit more and more
focus on the important part if we are not randomly

0:50:12.479 --> 0:50:19.494
drawing from all words, but we are looking
only at.

0:50:21.361 --> 0:50:24.278
You have an idea why this is an important
stamp.

0:50:24.278 --> 0:50:29.459
Although we say I'm only throwing away the
words which have a very low probability, so

0:50:29.459 --> 0:50:32.555
anyway the probability of taking them is quite
low.

0:50:32.555 --> 0:50:35.234
So normally that shouldn't matter that much.

0:50:36.256 --> 0:50:38.830
There's ten thousand words.

0:50:40.300 --> 0:50:42.074
Of course, they admire thousand nine hundred.

0:50:42.074 --> 0:50:44.002
They're going to build a good people steal
it up.

0:50:45.085 --> 0:50:47.425
Hi, I'm Sarah Hauer and I'm Sig Hauer and
We're Professional.

0:50:47.867 --> 0:50:55.299
Yes, that's exactly why you do this most sampling
or so that you don't take the lowest.

0:50:55.415 --> 0:50:59.694
Probability words, but you only look at the
most probable ones and then like.

0:50:59.694 --> 0:51:04.632
Of course you have to rescale your probability
mass then so that it's still a probability

0:51:04.632 --> 0:51:08.417
because now it's a probability distribution
over ten thousand words.

0:51:08.417 --> 0:51:13.355
If you only take ten of them or so it's no
longer a probability distribution, you rescale

0:51:13.355 --> 0:51:15.330
them and you can still do that and.

0:51:16.756 --> 0:51:20.095
That is what is done assembling.

0:51:20.095 --> 0:51:26.267
It's not the most common thing, but it's done
several times.

0:51:28.088 --> 0:51:40.625
Then the search, which is somehow a standard,
and if you're doing some type of machine translation.

0:51:41.181 --> 0:51:50.162
And the basic idea is that in research we
select for the most probable and only continue

0:51:50.162 --> 0:51:51.171
with the.

0:51:51.691 --> 0:51:53.970
You can easily generalize this.

0:51:53.970 --> 0:52:00.451
We are not only continuing the most probable
one, but we are continuing the most probable.

0:52:00.880 --> 0:52:21.376
The.

0:52:17.697 --> 0:52:26.920
You should say we are sampling how many examples
it makes sense to take the one with the highest.

0:52:27.127 --> 0:52:33.947
But that is important that once you do a mistake
you might want to not influence that much.

0:52:39.899 --> 0:52:45.815
So the idea is if we're keeping the end best
hypotheses and not only the first fact.

0:52:46.586 --> 0:52:51.558
And the nice thing is in statistical machine
translation.

0:52:51.558 --> 0:52:54.473
We have exactly the same problem.

0:52:54.473 --> 0:52:57.731
You would do the same thing, however.

0:52:57.731 --> 0:53:03.388
Since the model wasn't that strong you needed
a quite large beam.

0:53:03.984 --> 0:53:18.944
Machine translation models are really strong
and you get already a very good performance.

0:53:19.899 --> 0:53:22.835
So how does it work?

0:53:22.835 --> 0:53:35.134
We can't relate to our capabilities, but now
we are not storing the most probable ones.

0:53:36.156 --> 0:53:45.163
Done that we extend all these hypothesis and
of course there is now a bit difficult because

0:53:45.163 --> 0:53:54.073
now we always have to switch what is the input
so the search gets more complicated and the

0:53:54.073 --> 0:53:55.933
first one is easy.

0:53:56.276 --> 0:54:09.816
In this case we have to once put in here these
and then somehow delete this one and instead

0:54:09.816 --> 0:54:12.759
put that into that.

0:54:13.093 --> 0:54:24.318
Otherwise you could only store your current
network states here and just continue by going

0:54:24.318 --> 0:54:25.428
forward.

0:54:26.766 --> 0:54:34.357
So now you have done the first two, and then
you have known the best.

0:54:34.357 --> 0:54:37.285
Can you now just continue?

0:54:39.239 --> 0:54:53.511
Yes, that's very important, otherwise all
your beam search doesn't really help because

0:54:53.511 --> 0:54:57.120
you would still have.

0:54:57.317 --> 0:55:06.472
So now you have to do one important step and
then reduce again to end.

0:55:06.472 --> 0:55:13.822
So in our case to make things easier we have
the inputs.

0:55:14.014 --> 0:55:19.072
Otherwise you will have two to the power of
length possibilities, so it is still exponential.

0:55:19.559 --> 0:55:26.637
But by always throwing them away you keep
your beans fixed.

0:55:26.637 --> 0:55:31.709
The items now differ in the last position.

0:55:32.492 --> 0:55:42.078
They are completely different, but you are
always searching what is the best one.

0:55:44.564 --> 0:55:50.791
So another way of hearing it is like this,
so just imagine you start with the empty sentence.

0:55:50.791 --> 0:55:55.296
Then you have three possible extensions: A,
B, and end of sentence.

0:55:55.296 --> 0:55:59.205
It's throwing away the worst one, continuing
with the two.

0:55:59.699 --> 0:56:13.136
Then you want to stay too, so in this state
it's either or and then you continue.

0:56:13.293 --> 0:56:24.924
So you always have this exponential growing
tree by destroying most of them away and only

0:56:24.924 --> 0:56:26.475
continuing.

0:56:26.806 --> 0:56:42.455
And thereby you can hopefully do less errors
because in these examples you always see this

0:56:42.455 --> 0:56:43.315
one.

0:56:43.503 --> 0:56:47.406
So you're preventing some errors, but of course
it's not perfect.

0:56:47.447 --> 0:56:56.829
You can still do errors because it could be
not the second one but the fourth one.

0:56:57.017 --> 0:57:03.272
Now just the idea is that you make yeah less
errors and prevent that.

0:57:07.667 --> 0:57:11.191
Then the question is how much does it help?

0:57:11.191 --> 0:57:14.074
And here is some examples for that.

0:57:14.074 --> 0:57:16.716
So for S & T it was really like.

0:57:16.716 --> 0:57:23.523
Typically the larger beam you have a larger
third space and you have a better score.

0:57:23.763 --> 0:57:27.370
So the larger you get, the bigger your emails,
the better you will.

0:57:27.370 --> 0:57:30.023
Typically maybe use something like three hundred.

0:57:30.250 --> 0:57:38.777
And it's mainly a trade-off between quality
and speed because the larger your beams, the

0:57:38.777 --> 0:57:43.184
more time it takes and you want to finish it.

0:57:43.184 --> 0:57:49.124
So your quality improvements are getting smaller
and smaller.

0:57:49.349 --> 0:57:57.164
So the difference between a beam of one and
ten is bigger than the difference between a.

0:57:58.098 --> 0:58:14.203
And the interesting thing is we're seeing
a bit of a different view, and we're seeing

0:58:14.203 --> 0:58:16.263
typically.

0:58:16.776 --> 0:58:24.376
And then especially if you look at the green
ones, this is unnormalized.

0:58:24.376 --> 0:58:26.770
You're seeing a sharp.

0:58:27.207 --> 0:58:32.284
So your translation quality here measured
in blue will go down again.

0:58:33.373 --> 0:58:35.663
That is now a question.

0:58:35.663 --> 0:58:37.762
Why is that the case?

0:58:37.762 --> 0:58:43.678
Why should we are seeing more and more possible
translations?

0:58:46.226 --> 0:58:48.743
If we have a bigger stretch and we are going.

0:58:52.612 --> 0:58:56.312
I'm going to be using my examples before we
also look at the bar.

0:58:56.656 --> 0:58:59.194
A good idea.

0:59:00.000 --> 0:59:18.521
But it's not everything because we in the
end always in this list we're selecting.

0:59:18.538 --> 0:59:19.382
So this is here.

0:59:19.382 --> 0:59:21.170
We don't do any regions to do that.

0:59:21.601 --> 0:59:29.287
So the probabilities at the end we always
give out the hypothesis with the highest probabilities.

0:59:30.250 --> 0:59:33.623
That is always the case.

0:59:33.623 --> 0:59:43.338
If you have a beam of this should be a subset
of the items you look at.

0:59:44.224 --> 0:59:52.571
So if you increase your biomeat you're just
looking at more and you're always taking the

0:59:52.571 --> 0:59:54.728
wine with the highest.

0:59:57.737 --> 1:00:07.014
Maybe they are all the probability that they
will be comparable to don't really have.

1:00:08.388 --> 1:00:14.010
But the probabilities are the same, not that
easy.

1:00:14.010 --> 1:00:23.931
One morning maybe you will have more examples
where we look at some stuff that's not seen

1:00:23.931 --> 1:00:26.356
in the trading space.

1:00:28.428 --> 1:00:36.478
That's mainly the answer why we give a hyperability
math we will see, but that is first of all

1:00:36.478 --> 1:00:43.087
the biggest issues, so here is a blue score,
so that is somewhat translation.

1:00:43.883 --> 1:00:48.673
This will go down by the probability of the
highest one that only goes out where stays

1:00:48.673 --> 1:00:49.224
at least.

1:00:49.609 --> 1:00:57.971
The problem is if we are searching more, we
are finding high processes which have a high

1:00:57.971 --> 1:00:59.193
translation.

1:00:59.579 --> 1:01:10.375
So we are finding these things which we wouldn't
find and we'll see why this is happening.

1:01:10.375 --> 1:01:15.714
So somehow we are reducing our search error.

1:01:16.336 --> 1:01:25.300
However, we also have a model error and we
don't assign the highest probability to translation

1:01:25.300 --> 1:01:27.942
quality to the really best.

1:01:28.548 --> 1:01:31.460
They don't always add up.

1:01:31.460 --> 1:01:34.932
Of course somehow they add up.

1:01:34.932 --> 1:01:41.653
If your bottle is worse then your performance
will even go.

1:01:42.202 --> 1:01:49.718
But sometimes it's happening that by increasing
search errors we are missing out the really

1:01:49.718 --> 1:01:57.969
bad translations which have a high probability
and we are only finding the decently good probability

1:01:57.969 --> 1:01:58.460
mass.

1:01:59.159 --> 1:02:03.859
So they are a bit independent of each other
and you can make those types of arrows.

1:02:04.224 --> 1:02:09.858
That's why, for example, doing exact search
will give you the translation with the highest

1:02:09.858 --> 1:02:15.245
probability, but there has been work on it
that you then even have a lower translation

1:02:15.245 --> 1:02:21.436
quality because then you find some random translation
which has a very high translation probability

1:02:21.436 --> 1:02:22.984
by which I'm really bad.

1:02:23.063 --> 1:02:29.036
Because our model is not perfect and giving
a perfect translation probability over air,.

1:02:31.431 --> 1:02:34.537
So why is this happening?

1:02:34.537 --> 1:02:42.301
And one issue with this is the so called label
or length spiral.

1:02:42.782 --> 1:02:47.115
And we are in each step of decoding.

1:02:47.115 --> 1:02:55.312
We are modeling the probability of the next
word given the input and.

1:02:55.895 --> 1:03:06.037
So if you have this picture, so you always
hear you have the probability of the next word.

1:03:06.446 --> 1:03:16.147
That's that's what your modeling, and of course
the model is not perfect.

1:03:16.576 --> 1:03:22.765
So it can be that if we at one time do a bitter
wrong prediction not for the first one but

1:03:22.765 --> 1:03:28.749
maybe for the 5th or 6th thing, then we're
giving it an exceptional high probability we

1:03:28.749 --> 1:03:30.178
cannot recover from.

1:03:30.230 --> 1:03:34.891
Because this high probability will stay there
forever and we just multiply other things to

1:03:34.891 --> 1:03:39.910
it, but we cannot like later say all this probability
was a bit too high, we shouldn't have done.

1:03:41.541 --> 1:03:48.984
And this leads to that the more the longer
your translation is, the more often you use

1:03:48.984 --> 1:03:51.637
this probability distribution.

1:03:52.112 --> 1:04:03.321
The typical example is this one, so you have
the probability of the translation.

1:04:04.104 --> 1:04:12.608
And this probability is quite low as you see,
and maybe there are a lot of other things.

1:04:13.053 --> 1:04:25.658
However, it might still be overestimated that
it's still a bit too high.

1:04:26.066 --> 1:04:33.042
The problem is if you know the project translation
is a very long one, but probability mask gets

1:04:33.042 --> 1:04:33.545
lower.

1:04:34.314 --> 1:04:45.399
Because each time you multiply your probability
to it, so your sequence probability gets lower

1:04:45.399 --> 1:04:46.683
and lower.

1:04:48.588 --> 1:04:59.776
And this means that at some point you might
get over this, and it might be a lower probability.

1:05:00.180 --> 1:05:09.651
And if you then have this probability at the
beginning away, but it wasn't your beam, then

1:05:09.651 --> 1:05:14.958
at this point you would select the empty sentence.

1:05:15.535 --> 1:05:25.379
So this has happened because this short translation
is seen and it's not thrown away.

1:05:28.268 --> 1:05:31.121
So,.

1:05:31.151 --> 1:05:41.256
If you have a very sore beam that can be prevented,
but if you have a large beam, this one is in

1:05:41.256 --> 1:05:41.986
there.

1:05:42.302 --> 1:05:52.029
This in general seems reasonable that shorter
pronunciations instead of longer sentences

1:05:52.029 --> 1:05:54.543
because non-religious.

1:05:56.376 --> 1:06:01.561
It's a bit depending on whether the translation
should be a bit related to your input.

1:06:02.402 --> 1:06:18.053
And since we are always multiplying things,
the longer the sequences we are getting smaller,

1:06:18.053 --> 1:06:18.726
it.

1:06:19.359 --> 1:06:29.340
It's somewhat right for human main too, but
the models tend to overestimate because of

1:06:29.340 --> 1:06:34.388
this short translation of long translation.

1:06:35.375 --> 1:06:46.474
Then, of course, that means that it's not
easy to stay on a computer because eventually

1:06:46.474 --> 1:06:48.114
it suggests.

1:06:51.571 --> 1:06:59.247
First of all there is another way and that's
typically used but you don't have to do really

1:06:59.247 --> 1:07:07.089
because this is normally not a second position
and if it's like on the 20th position you only

1:07:07.089 --> 1:07:09.592
have to have some bean lower.

1:07:10.030 --> 1:07:17.729
But you are right because these issues get
larger, the larger your input is, and then

1:07:17.729 --> 1:07:20.235
you might make more errors.

1:07:20.235 --> 1:07:27.577
So therefore this is true, but it's not as
simple that this one is always in the.

1:07:28.408 --> 1:07:45.430
That the translation for it goes down with
higher insert sizes has there been more control.

1:07:47.507 --> 1:07:51.435
In this work you see a dozen knocks.

1:07:51.435 --> 1:07:53.027
Knots go down.

1:07:53.027 --> 1:08:00.246
That's light green here, but at least you
don't see the sharp rock.

1:08:00.820 --> 1:08:07.897
So if you do some type of normalization, at
least you can assess this probability and limit

1:08:07.897 --> 1:08:08.204
it.

1:08:15.675 --> 1:08:24.828
There is other reasons why, like initial,
it's not only the length, but there can be

1:08:24.828 --> 1:08:26.874
other reasons why.

1:08:27.067 --> 1:08:37.316
And if you just take it too large, you're
looking too often at ways in between, but it's

1:08:37.316 --> 1:08:40.195
better to ignore things.

1:08:41.101 --> 1:08:44.487
But that's more a hand gravy argument.

1:08:44.487 --> 1:08:47.874
Agree so don't know if the exact word.

1:08:48.648 --> 1:08:53.223
You need to do the normalization and there
are different ways of doing it.

1:08:53.223 --> 1:08:54.199
It's mainly OK.

1:08:54.199 --> 1:08:59.445
We're just now not taking the translation
with the highest probability, but we during

1:08:59.445 --> 1:09:04.935
the coding have another feature saying not
only take the one with the highest probability

1:09:04.935 --> 1:09:08.169
but also prefer translations which are a bit
longer.

1:09:08.488 --> 1:09:16.933
You can do that different in a way to divide
by the center length.

1:09:16.933 --> 1:09:23.109
We take not the highest but the highest average.

1:09:23.563 --> 1:09:28.841
Of course, if both are the same lengths, it
doesn't matter if M is the same lengths in

1:09:28.841 --> 1:09:34.483
all cases, but if you compare a translation
with seven or eight words, there is a difference

1:09:34.483 --> 1:09:39.700
if you want to have the one with the highest
probability or with the highest average.

1:09:41.021 --> 1:09:50.993
So that is the first one can have some reward
model for each word, add a bit of the score,

1:09:50.993 --> 1:09:51.540
and.

1:09:51.711 --> 1:10:03.258
And then, of course, you have to find you
that there is also more complex ones here.

1:10:03.903 --> 1:10:08.226
So there is different ways of doing that,
and of course that's important.

1:10:08.428 --> 1:10:11.493
But in all of that, the main idea is OK.

1:10:11.493 --> 1:10:18.520
We are like knowing of the arrow that the
model seems to prevent or prefer short translation.

1:10:18.520 --> 1:10:24.799
We circumvent that by OK we are adding we
are no longer searching for the best one.

1:10:24.764 --> 1:10:30.071
But we're searching for the one best one and
some additional constraints, so mainly you

1:10:30.071 --> 1:10:32.122
are doing here during the coding.

1:10:32.122 --> 1:10:37.428
You're not completely trusting your model,
but you're adding some buyers or constraints

1:10:37.428 --> 1:10:39.599
into what should also be fulfilled.

1:10:40.000 --> 1:10:42.543
That can be, for example, that the length
should be recently.

1:10:49.369 --> 1:10:51.071
Any More Questions to That.

1:10:56.736 --> 1:11:04.001
Last idea which gets recently quite a bit
more interest also is what is called minimum

1:11:04.001 --> 1:11:11.682
base risk decoding and there is maybe not the
one correct translation but there are several

1:11:11.682 --> 1:11:13.937
good correct translations.

1:11:14.294 --> 1:11:21.731
And the idea is now we don't want to find
the one translation, which is maybe the highest

1:11:21.731 --> 1:11:22.805
probability.

1:11:23.203 --> 1:11:31.707
Instead we are looking at all the high translation,
all translation with high probability and then

1:11:31.707 --> 1:11:39.524
we want to take one representative out of this
so we're just most similar to all the other

1:11:39.524 --> 1:11:42.187
hydrobility translation again.

1:11:43.643 --> 1:11:46.642
So how does it work?

1:11:46.642 --> 1:11:55.638
First you could have imagined you have reference
translations.

1:11:55.996 --> 1:12:13.017
You have a set of reference translations and
then what you want to get is you want to have.

1:12:13.073 --> 1:12:28.641
As a probability distribution you measure
the similarity of reference and the hypothesis.

1:12:28.748 --> 1:12:31.408
So you have two sets of translation.

1:12:31.408 --> 1:12:34.786
You have the human translations of a sentence.

1:12:35.675 --> 1:12:39.251
That's of course not realistic, but first
from the idea.

1:12:39.251 --> 1:12:42.324
Then you have your set of possible translations.

1:12:42.622 --> 1:12:52.994
And now you're not saying okay, we have only
one human, but we have several humans with

1:12:52.994 --> 1:12:56.294
different types of quality.

1:12:56.796 --> 1:13:07.798
You have to have two metrics here, the similarity
between the automatic translation and the quality

1:13:07.798 --> 1:13:09.339
of the human.

1:13:10.951 --> 1:13:17.451
Of course, we have the same problem that we
don't have the human reference, so we have.

1:13:18.058 --> 1:13:29.751
So when we are doing it, instead of estimating
the quality based on the human, we use our

1:13:29.751 --> 1:13:30.660
model.

1:13:31.271 --> 1:13:37.612
So we can't be like humans, so we take the
model probability.

1:13:37.612 --> 1:13:40.782
We take the set here first of.

1:13:41.681 --> 1:13:48.755
Then we are comparing each hypothesis to this
one, so you have two sets.

1:13:48.755 --> 1:13:53.987
Just imagine here you take all possible translations.

1:13:53.987 --> 1:13:58.735
Here you take your hypothesis in comparing
them.

1:13:58.678 --> 1:14:03.798
And then you're taking estimating the quality
based on the outcome.

1:14:04.304 --> 1:14:06.874
So the overall idea is okay.

1:14:06.874 --> 1:14:14.672
We are not finding the best hypothesis but
finding the hypothesis which is most similar

1:14:14.672 --> 1:14:17.065
to many good translations.

1:14:19.599 --> 1:14:21.826
Why would you do that?

1:14:21.826 --> 1:14:25.119
It's a bit like a smoothing idea.

1:14:25.119 --> 1:14:28.605
Imagine this is the probability of.

1:14:29.529 --> 1:14:36.634
So if you would do beam search or mini search
or anything, if you just take the highest probability

1:14:36.634 --> 1:14:39.049
one, you would take this red one.

1:14:39.799 --> 1:14:45.686
Has this type of probability distribution.

1:14:45.686 --> 1:14:58.555
Then it might be better to take some of these
models because it's a bit lower in probability.

1:14:58.618 --> 1:15:12.501
So what you're mainly doing is you're doing
some smoothing of your probability distribution.

1:15:15.935 --> 1:15:17.010
How can you do that?

1:15:17.010 --> 1:15:20.131
Of course, we cannot do this again compared
to all the hype.

1:15:21.141 --> 1:15:29.472
But what we can do is we have just two sets
and we're just taking them the same.

1:15:29.472 --> 1:15:38.421
So we're having our penny data of the hypothesis
and the sum of the soider references.

1:15:39.179 --> 1:15:55.707
And we can just take the same clue so we can
just compare the utility of the.

1:15:56.656 --> 1:16:16.182
And then, of course, the question is how do
we measure the quality of the hypothesis?

1:16:16.396 --> 1:16:28.148
Course: You could also take here the probability
of this pee of given, but you can also say

1:16:28.148 --> 1:16:30.958
we only take the top.

1:16:31.211 --> 1:16:39.665
And where we don't want to really rely on
how good they are, we filtered out all the

1:16:39.665 --> 1:16:40.659
bad ones.

1:16:40.940 --> 1:16:54.657
So that is the first question for the minimum
base rhythm, and what are your pseudo references?

1:16:55.255 --> 1:17:06.968
So how do you set the quality of all these
references here in the independent sampling?

1:17:06.968 --> 1:17:10.163
They all have the same.

1:17:10.750 --> 1:17:12.308
There's Also Work Where You Can Take That.

1:17:13.453 --> 1:17:17.952
And then the second question you have to do
is, of course,.

1:17:17.917 --> 1:17:26.190
How do you prepare now two hypothesisms so
you have now Y and H which are post generated

1:17:26.190 --> 1:17:34.927
by the system and you want to find the H which
is most similar to all the other translations.

1:17:35.335 --> 1:17:41.812
So it's mainly like this model here, which
says how similar is age to all the other whites.

1:17:42.942 --> 1:17:50.127
So you have to again use some type of similarity
metric, which says how similar to possible.

1:17:52.172 --> 1:17:53.775
How can you do that?

1:17:53.775 --> 1:17:58.355
We luckily knew how to compare a reference
to a hypothesis.

1:17:58.355 --> 1:18:00.493
We have evaluation metrics.

1:18:00.493 --> 1:18:03.700
You can do something like sentence level.

1:18:04.044 --> 1:18:13.501
But especially if you're looking into neuromodels
you should have a stromometric so you can use

1:18:13.501 --> 1:18:17.836
a neural metric which directly compares to.

1:18:22.842 --> 1:18:29.292
Yes, so that is, is the main idea of minimum
base risk to, so the important idea you should

1:18:29.292 --> 1:18:35.743
keep in mind is that it's doing somehow the
smoothing by not taking the highest probability

1:18:35.743 --> 1:18:40.510
one, but by comparing like by taking a set
of high probability one.

1:18:40.640 --> 1:18:45.042
And then looking for the translation, which
is most similar to all of that.

1:18:45.445 --> 1:18:49.888
And thereby doing a bit more smoothing because
you look at this one.

1:18:49.888 --> 1:18:55.169
If you have this one, for example, it would
be more similar to all of these ones.

1:18:55.169 --> 1:19:00.965
But if you take this one, it's higher probability,
but it's very dissimilar to all these.

1:19:05.445 --> 1:19:17.609
Hey, that is all for decoding before we finish
with your combination of models.

1:19:18.678 --> 1:19:20.877
Sort of set of pseudo-reperences.

1:19:20.877 --> 1:19:24.368
Thomas Brown writes a little bit of type research
or.

1:19:24.944 --> 1:19:27.087
For example, you can do beam search.

1:19:27.087 --> 1:19:28.825
You can do sampling for that.

1:19:28.825 --> 1:19:31.257
Oh yeah, we had mentioned sampling there.

1:19:31.257 --> 1:19:34.500
I don't know somebody asking for what sampling
is good.

1:19:34.500 --> 1:19:37.280
So there's, of course, another important issue.

1:19:37.280 --> 1:19:40.117
How do you get a good representative set of
age?

1:19:40.620 --> 1:19:47.147
If you do beam search, it might be that you
end up with two similar ones, and maybe it's

1:19:47.147 --> 1:19:49.274
prevented by doing sampling.

1:19:49.274 --> 1:19:55.288
But maybe in sampling you find worse ones,
but yet some type of model is helpful.

1:19:56.416 --> 1:20:04.863
Search method use more transformed based translation
points.

1:20:04.863 --> 1:20:09.848
Nowadays beam search is definitely.

1:20:10.130 --> 1:20:13.749
There is work on this.

1:20:13.749 --> 1:20:27.283
The problem is that the MBR is often a lot
more like heavy because you have to sample

1:20:27.283 --> 1:20:29.486
translations.

1:20:31.871 --> 1:20:40.946
If you are bustling then we take a pen or
a pen for the most possible one.

1:20:40.946 --> 1:20:43.003
Now we put them.

1:20:43.623 --> 1:20:46.262
Bit and then we say okay, you don't have to
be fine.

1:20:46.262 --> 1:20:47.657
I'm going to put it to you.

1:20:48.428 --> 1:20:52.690
Yes, so that is what you can also do.

1:20:52.690 --> 1:21:00.092
Instead of taking uniform per ability, you
could take the modest.

1:21:01.041 --> 1:21:14.303
The uniform is a bit more robust because if
you had this one it might be that there is

1:21:14.303 --> 1:21:17.810
some crazy exceptions.

1:21:17.897 --> 1:21:21.088
And then it would still relax.

1:21:21.088 --> 1:21:28.294
So if you look at this picture, the probability
here would be higher.

1:21:28.294 --> 1:21:31.794
But yeah, that's a bit of tuning.

1:21:33.073 --> 1:21:42.980
In this case, and yes, it is like modeling
also the ants that.

1:21:49.169 --> 1:21:56.265
The last thing is now we always have considered
one model.

1:21:56.265 --> 1:22:04.084
It's also some prints helpful to not only
look at one model but.

1:22:04.384 --> 1:22:10.453
So in general there's many ways of how you
can make several models and with it's even

1:22:10.453 --> 1:22:17.370
easier you can just start three different random
municipalizations you get three different models

1:22:17.370 --> 1:22:18.428
and typically.

1:22:19.019 --> 1:22:27.299
And then the question is, can we combine their
strength into one model and use that then?

1:22:29.669 --> 1:22:39.281
And that can be done and it can be either
online or ensemble, and the more offline thing

1:22:39.281 --> 1:22:41.549
is called reranking.

1:22:42.462 --> 1:22:52.800
So the idea is, for example, an ensemble that
you combine different initializations.

1:22:52.800 --> 1:23:02.043
Of course, you can also do other things like
having different architecture.

1:23:02.222 --> 1:23:08.922
But the easiest thing you can change always
in generating two motors is to have different.

1:23:09.209 --> 1:23:24.054
And then the question is how can you combine
that?

1:23:26.006 --> 1:23:34.245
And the easiest thing, as said, is the bottle
of soda.

1:23:34.245 --> 1:23:39.488
What you mainly do is in parallel.

1:23:39.488 --> 1:23:43.833
You decode all of the money.

1:23:44.444 --> 1:23:59.084
So the probability of the output and you can
join this one to a joint one by just summing

1:23:59.084 --> 1:24:04.126
up over your key models again.

1:24:04.084 --> 1:24:10.374
So you still have a pro bonding distribution,
but you are not taking only one output here,

1:24:10.374 --> 1:24:10.719
but.

1:24:11.491 --> 1:24:20.049
So that's one you can easily combine different
models, and the nice thing is it typically

1:24:20.049 --> 1:24:20.715
works.

1:24:21.141 --> 1:24:27.487
You additional improvement with only more
calculation but not more human work.

1:24:27.487 --> 1:24:33.753
You just do the same thing for times and you're
getting a better performance.

1:24:33.793 --> 1:24:41.623
Like having more layers and so on, the advantage
of bigger models is of course you have to have

1:24:41.623 --> 1:24:46.272
the big models only joint and decoding during
inference.

1:24:46.272 --> 1:24:52.634
There you have to load models in parallel
because you have to do your search.

1:24:52.672 --> 1:24:57.557
Normally there is more memory resources for
training than you need for insurance.

1:25:00.000 --> 1:25:12.637
You have to train four models and the decoding
speed is also slower because you need to decode

1:25:12.637 --> 1:25:14.367
four models.

1:25:14.874 --> 1:25:25.670
There is one other very important thing and
the models have to be very similar, at least

1:25:25.670 --> 1:25:27.368
in some ways.

1:25:27.887 --> 1:25:28.506
Course.

1:25:28.506 --> 1:25:34.611
You can only combine this one if you have
the same words because you are just.

1:25:34.874 --> 1:25:43.110
So just imagine you have two different sizes
because you want to compare them or a director

1:25:43.110 --> 1:25:44.273
based model.

1:25:44.724 --> 1:25:53.327
That's at least not easily possible here because
once your output would be here a word and the

1:25:53.327 --> 1:25:56.406
other one would have to sum over.

1:25:56.636 --> 1:26:07.324
So this ensemble typically only works if you
have the same output vocabulary.

1:26:07.707 --> 1:26:16.636
Your input can be different because that is
only done once and then.

1:26:16.636 --> 1:26:23.752
Your hardware vocabulary has to be the same
otherwise.

1:26:27.507 --> 1:26:41.522
There's even a surprising effect of improving
your performance and it's again some kind of

1:26:41.522 --> 1:26:43.217
smoothing.

1:26:43.483 --> 1:26:52.122
So normally during training what we are doing
is we can save the checkpoints after each epoch.

1:26:52.412 --> 1:27:01.774
And you have this type of curve where your
Arab performance normally should go down, and

1:27:01.774 --> 1:27:09.874
if you do early stopping it means that at the
end you select not the lowest.

1:27:11.571 --> 1:27:21.467
However, some type of smoothing is there again.

1:27:21.467 --> 1:27:31.157
Sometimes what you can do is take an ensemble.

1:27:31.491 --> 1:27:38.798
That is not as good, but you still have four
different bottles, and they give you a little.

1:27:39.259 --> 1:27:42.212
So,.

1:27:43.723 --> 1:27:48.340
It's some are helping you, so now they're
supposed to be something different, you know.

1:27:49.489 --> 1:27:53.812
Oh didn't do that, so that is a checkpoint.

1:27:53.812 --> 1:27:59.117
There is one thing interesting, which is even
faster.

1:27:59.419 --> 1:28:12.255
Normally let's give you better performance
because this one might be again like a smooth

1:28:12.255 --> 1:28:13.697
ensemble.

1:28:16.736 --> 1:28:22.364
Of course, there is also some problems with
this, so I said.

1:28:22.364 --> 1:28:30.022
For example, maybe you want to do different
web representations with Cherokee and.

1:28:30.590 --> 1:28:37.189
You want to do right to left decoding so you
normally do like I go home but then your translation

1:28:37.189 --> 1:28:39.613
depends only on the previous words.

1:28:39.613 --> 1:28:45.942
If you want to model on the future you could
do the inverse direction and generate the target

1:28:45.942 --> 1:28:47.895
sentence from right to left.

1:28:48.728 --> 1:28:50.839
But it's not easy to combine these things.

1:28:51.571 --> 1:28:56.976
In order to do this, or what is also sometimes
interesting is doing in verse translation.

1:28:57.637 --> 1:29:07.841
You can combine these types of models in the
next election.

1:29:07.841 --> 1:29:13.963
That is only a bit which we can do.

1:29:14.494 --> 1:29:29.593
Next time what you should remember is how
search works and do you have any final questions.

1:29:33.773 --> 1:29:43.393
Then I wish you a happy holiday for next week
and then Monday there is another practical

1:29:43.393 --> 1:29:50.958
and then Thursday in two weeks so we'll have
the next lecture Monday.