File size: 67,856 Bytes
cb71ef5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328
2329
2330
2331
2332
2333
2334
2335
2336
2337
2338
2339
2340
2341
2342
2343
2344
2345
2346
2347
2348
2349
2350
2351
2352
2353
2354
2355
2356
2357
2358
2359
2360
2361
2362
2363
2364
2365
2366
2367
2368
2369
2370
2371
2372
2373
2374
2375
2376
2377
2378
2379
2380
2381
2382
2383
2384
2385
2386
2387
2388
2389
2390
2391
2392
2393
2394
2395
2396
2397
2398
2399
2400
2401
2402
2403
2404
2405
2406
2407
2408
2409
2410
2411
2412
2413
2414
2415
2416
2417
2418
2419
2420
2421
2422
2423
2424
2425
2426
2427
2428
2429
2430
2431
2432
2433
2434
2435
2436
2437
2438
2439
2440
2441
2442
2443
2444
2445
2446
2447
2448
2449
2450
2451
2452
2453
2454
2455
2456
2457
2458
2459
2460
2461
2462
2463
2464
2465
2466
2467
2468
2469
2470
2471
2472
2473
2474
2475
2476
2477
2478
2479
2480
2481
2482
2483
2484
2485
2486
2487
2488
2489
2490
2491
2492
2493
2494
2495
2496
2497
2498
2499
2500
2501
2502
2503
2504
2505
2506
2507
2508
2509
2510
2511
2512
2513
2514
2515
2516
2517
2518
2519
2520
2521
2522
2523
2524
2525
2526
2527
2528
2529
2530
2531
2532
2533
2534
2535
2536
2537
2538
2539
2540
2541
2542
2543
2544
2545
2546
2547
2548
2549
2550
2551
2552
2553
2554
2555
2556
2557
2558
2559
2560
2561
2562
2563
2564
2565
2566
2567
2568
2569
2570
2571
2572
2573
2574
2575
2576
2577
2578
2579
2580
2581
2582
2583
2584
2585
2586
2587
2588
2589
2590
2591
2592
2593
2594
2595
2596
2597
2598
2599
2600
2601
2602
2603
2604
2605
2606
2607
2608
2609
2610
2611
2612
2613
2614
2615
2616
2617
2618
2619
2620
2621
2622
2623
2624
2625
2626
2627
2628
2629
2630
2631
2632
2633
2634
2635
2636
2637
2638
2639
2640
2641
2642
2643
2644
2645
2646
2647
2648
2649
2650
2651
2652
2653
2654
2655
2656
2657
2658
2659
2660
2661
2662
2663
2664
2665
2666
2667
2668
2669
2670
2671
2672
2673
2674
2675
2676
2677
2678
2679
2680
2681
2682
2683
2684
2685
2686
2687
2688
2689
2690
2691
2692
2693
2694
2695
2696
2697
2698
2699
2700
2701
2702
2703
2704
2705
2706
2707
2708
2709
2710
2711
2712
2713
2714
2715
2716
2717
2718
2719
2720
2721
2722
2723
2724
2725
2726
2727
2728
2729
2730
2731
2732
2733
2734
2735
2736
2737
2738
2739
2740
2741
2742
2743
2744
2745
2746
2747
2748
WEBVTT

0:00:01.921 --> 0:00:16.424
Hey welcome to today's lecture, what we today
want to look at is how we can make new.

0:00:16.796 --> 0:00:26.458
So until now we have this global system, the
encoder and the decoder mostly, and we haven't

0:00:26.458 --> 0:00:29.714
really thought about how long.

0:00:30.170 --> 0:00:42.684
And what we, for example, know is yeah, you
can make the systems bigger in different ways.

0:00:42.684 --> 0:00:47.084
We can make them deeper so the.

0:00:47.407 --> 0:00:56.331
And if we have at least enough data that typically
helps you make things performance better,.

0:00:56.576 --> 0:01:00.620
But of course leads to problems that we need
more resources.

0:01:00.620 --> 0:01:06.587
That is a problem at universities where we
have typically limited computation capacities.

0:01:06.587 --> 0:01:11.757
So at some point you have such big models
that you cannot train them anymore.

0:01:13.033 --> 0:01:23.792
And also for companies is of course important
if it costs you like to generate translation

0:01:23.792 --> 0:01:26.984
just by power consumption.

0:01:27.667 --> 0:01:35.386
So yeah, there's different reasons why you
want to do efficient machine translation.

0:01:36.436 --> 0:01:48.338
One reason is there are different ways of
how you can improve your machine translation

0:01:48.338 --> 0:01:50.527
system once we.

0:01:50.670 --> 0:01:55.694
There can be different types of data we looked
into data crawling, monolingual data.

0:01:55.875 --> 0:01:59.024
All this data and the aim is always.

0:01:59.099 --> 0:02:05.735
Of course, we are not just purely interested
in having more data, but the idea why we want

0:02:05.735 --> 0:02:12.299
to have more data is that more data also means
that we have better quality because mostly

0:02:12.299 --> 0:02:17.550
we are interested in increasing the quality
of the machine translation.

0:02:18.838 --> 0:02:24.892
But there's also other ways of how you can
improve the quality of a machine translation.

0:02:25.325 --> 0:02:36.450
And what is, of course, that is where most
research is focusing on.

0:02:36.450 --> 0:02:44.467
It means all we want to build better algorithms.

0:02:44.684 --> 0:02:48.199
Course: The other things are normally as good.

0:02:48.199 --> 0:02:54.631
Sometimes it's easier to improve, so often
it's easier to just collect more data than

0:02:54.631 --> 0:02:57.473
to invent some great view algorithms.

0:02:57.473 --> 0:03:00.315
But yeah, both of them are important.

0:03:00.920 --> 0:03:09.812
But there is this third thing, especially
with neural machine translation, and that means

0:03:09.812 --> 0:03:11.590
we make a bigger.

0:03:11.751 --> 0:03:16.510
Can be, as said, that we have more layers,
that we have wider layers.

0:03:16.510 --> 0:03:19.977
The other thing we talked a bit about is ensemble.

0:03:19.977 --> 0:03:24.532
That means we are not building one new machine
translation system.

0:03:24.965 --> 0:03:27.505
And we can easily build four.

0:03:27.505 --> 0:03:32.331
What is the typical strategy to build different
systems?

0:03:32.331 --> 0:03:33.177
Remember.

0:03:35.795 --> 0:03:40.119
It should be of course a bit different if
you have the same.

0:03:40.119 --> 0:03:44.585
If they all predict the same then combining
them doesn't help.

0:03:44.585 --> 0:03:48.979
So what is the easiest way if you have to
build four systems?

0:03:51.711 --> 0:04:01.747
And the Charleston's will take, but this is
the best output of a single system.

0:04:02.362 --> 0:04:10.165
Mean now, it's really three different systems
so that you later can combine them and maybe

0:04:10.165 --> 0:04:11.280
the average.

0:04:11.280 --> 0:04:16.682
Ensembles are typically that the average is
all probabilities.

0:04:19.439 --> 0:04:24.227
The idea is to think about neural networks.

0:04:24.227 --> 0:04:29.342
There's one parameter which can easily adjust.

0:04:29.342 --> 0:04:36.525
That's exactly the easiest way to randomize
with three different.

0:04:37.017 --> 0:04:43.119
They have the same architecture, so all the
hydroparameters are the same, but they are

0:04:43.119 --> 0:04:43.891
different.

0:04:43.891 --> 0:04:46.556
They will have different predictions.

0:04:48.228 --> 0:04:52.572
So, of course, bigger amounts.

0:04:52.572 --> 0:05:05.325
Some of these are a bit the easiest way of
improving your quality because you don't really

0:05:05.325 --> 0:05:08.268
have to do anything.

0:05:08.588 --> 0:05:12.588
There is limits on that bigger models only
get better.

0:05:12.588 --> 0:05:19.132
If you have enough training data you can't
do like a handheld layer and you will not work

0:05:19.132 --> 0:05:24.877
on very small data but with a recent amount
of data that is the easiest thing.

0:05:25.305 --> 0:05:33.726
However, they are challenging with making
better models, bigger motors, and that is the

0:05:33.726 --> 0:05:34.970
computation.

0:05:35.175 --> 0:05:44.482
So, of course, if you have a bigger model
that can mean that you have longer running

0:05:44.482 --> 0:05:49.518
times, if you have models, you have to times.

0:05:51.171 --> 0:05:56.685
Normally you cannot paralyze the different
layers because the input to one layer is always

0:05:56.685 --> 0:06:02.442
the output of the previous layer, so you propagate
that so it will also increase your runtime.

0:06:02.822 --> 0:06:10.720
Then you have to store all your models in
memory.

0:06:10.720 --> 0:06:20.927
If you have double weights you will have:
Is more difficult to then do back propagation.

0:06:20.927 --> 0:06:27.680
You have to store in between the activations,
so there's not only do you increase the model

0:06:27.680 --> 0:06:31.865
in your memory, but also all these other variables
that.

0:06:34.414 --> 0:06:36.734
And so in general it is more expensive.

0:06:37.137 --> 0:06:54.208
And therefore there's good reasons in looking
into can we make these models sound more efficient.

0:06:54.134 --> 0:07:00.982
So it's been through the viewer, you can have
it okay, have one and one day of training time,

0:07:00.982 --> 0:07:01.274
or.

0:07:01.221 --> 0:07:07.535
Forty thousand euros and then what is the
best machine translation system I can get within

0:07:07.535 --> 0:07:08.437
this budget.

0:07:08.969 --> 0:07:19.085
And then, of course, you can make the models
bigger, but then you have to train them shorter,

0:07:19.085 --> 0:07:24.251
and then we can make more efficient algorithms.

0:07:25.925 --> 0:07:31.699
If you think about efficiency, there's a bit
different scenarios.

0:07:32.312 --> 0:07:43.635
So if you're more of coming from the research
community, what you'll be doing is building

0:07:43.635 --> 0:07:47.913
a lot of models in your research.

0:07:48.088 --> 0:07:58.645
So you're having your test set of maybe sentences,
calculating the blue score, then another model.

0:07:58.818 --> 0:08:08.911
So what that means is typically you're training
on millions of cents, so your training time

0:08:08.911 --> 0:08:14.944
is long, maybe a day, but maybe in other cases
a week.

0:08:15.135 --> 0:08:22.860
The testing is not really the cost efficient,
but the training is very costly.

0:08:23.443 --> 0:08:37.830
If you are more thinking of building models
for application, the scenario is quite different.

0:08:38.038 --> 0:08:46.603
And then you keep it running, and maybe thousands
of customers are using it in translating.

0:08:46.603 --> 0:08:47.720
So in that.

0:08:48.168 --> 0:08:59.577
And we will see that it is not always the
same type of challenges you can paralyze some

0:08:59.577 --> 0:09:07.096
things in training, which you cannot paralyze
in testing.

0:09:07.347 --> 0:09:14.124
For example, in training you have to do back
propagation, so you have to store the activations.

0:09:14.394 --> 0:09:23.901
Therefore, in testing we briefly discussed
that we would do it in more detail today in

0:09:23.901 --> 0:09:24.994
training.

0:09:25.265 --> 0:09:36.100
You know they're a target and you can process
everything in parallel while in testing.

0:09:36.356 --> 0:09:46.741
So you can only do one word at a time, and
so you can less paralyze this.

0:09:46.741 --> 0:09:50.530
Therefore, it's important.

0:09:52.712 --> 0:09:55.347
Is a specific task on this.

0:09:55.347 --> 0:10:03.157
For example, it's the efficiency task where
it's about making things as efficient.

0:10:03.123 --> 0:10:09.230
Is possible and they can look at different
resources.

0:10:09.230 --> 0:10:14.207
So how much deep fuel run time do you need?

0:10:14.454 --> 0:10:19.366
See how much memory you need or you can have
a fixed memory budget and then have to build

0:10:19.366 --> 0:10:20.294
the best system.

0:10:20.500 --> 0:10:29.010
And here is a bit like an example of that,
so there's three teams from Edinburgh from

0:10:29.010 --> 0:10:30.989
and they submitted.

0:10:31.131 --> 0:10:36.278
So then, of course, if you want to know the
most efficient system you have to do a bit

0:10:36.278 --> 0:10:36.515
of.

0:10:36.776 --> 0:10:44.656
You want to have a better quality or more
runtime and there's not the one solution.

0:10:44.656 --> 0:10:46.720
You can improve your.

0:10:46.946 --> 0:10:49.662
And that you see that there are different
systems.

0:10:49.909 --> 0:11:06.051
Here is how many words you can do for a second
on the clock, and you want to be as talk as

0:11:06.051 --> 0:11:07.824
possible.

0:11:08.068 --> 0:11:08.889
And you see here a bit.

0:11:08.889 --> 0:11:09.984
This is a little bit different.

0:11:11.051 --> 0:11:27.717
You want to be there on the top right corner
and you can get a score of something between

0:11:27.717 --> 0:11:29.014
words.

0:11:30.250 --> 0:11:34.161
Two hundred and fifty thousand, then you'll
ever come and score zero point three.

0:11:34.834 --> 0:11:41.243
There is, of course, any bit of a decision,
but the question is, like how far can you again?

0:11:41.243 --> 0:11:47.789
Some of all these points on this line would
be winners because they are somehow most efficient

0:11:47.789 --> 0:11:53.922
in a way that there's no system which achieves
the same quality with less computational.

0:11:57.657 --> 0:12:04.131
So there's the one question of which resources
are you interested.

0:12:04.131 --> 0:12:07.416
Are you running it on CPU or GPU?

0:12:07.416 --> 0:12:11.668
There's different ways of paralyzing stuff.

0:12:14.654 --> 0:12:20.777
Another dimension is how you process your
data.

0:12:20.777 --> 0:12:27.154
There's really the best processing and streaming.

0:12:27.647 --> 0:12:34.672
So in batch processing you have the whole
document available so you can translate all

0:12:34.672 --> 0:12:39.981
sentences in perimeter and then you're interested
in throughput.

0:12:40.000 --> 0:12:43.844
But you can then process, for example, especially
in GPS.

0:12:43.844 --> 0:12:49.810
That's interesting, you're not translating
one sentence at a time, but you're translating

0:12:49.810 --> 0:12:56.108
one hundred sentences or so in parallel, so
you have one more dimension where you can paralyze

0:12:56.108 --> 0:12:57.964
and then be more efficient.

0:12:58.558 --> 0:13:14.863
On the other hand, for example sorts of documents,
so we learned that if you do badge processing

0:13:14.863 --> 0:13:16.544
you have.

0:13:16.636 --> 0:13:24.636
Then, of course, it makes sense to sort the
sentences in order to have the minimum thing

0:13:24.636 --> 0:13:25.535
attached.

0:13:27.427 --> 0:13:32.150
The other scenario is more the streaming scenario
where you do life translation.

0:13:32.512 --> 0:13:40.212
So in that case you can't wait for the whole
document to pass, but you have to do.

0:13:40.520 --> 0:13:49.529
And then, for example, that's especially in
situations like speech translation, and then

0:13:49.529 --> 0:13:53.781
you're interested in things like latency.

0:13:53.781 --> 0:14:00.361
So how much do you have to wait to get the
output of a sentence?

0:14:06.566 --> 0:14:16.956
Finally, there is the thing about the implementation:
Today we're mainly looking at different algorithms,

0:14:16.956 --> 0:14:23.678
different models of how you can model them
in your machine translation system, but of

0:14:23.678 --> 0:14:29.227
course for the same algorithms there's also
different implementations.

0:14:29.489 --> 0:14:38.643
So, for example, for a machine translation
this tool could be very fast.

0:14:38.638 --> 0:14:46.615
So they have like coded a lot of the operations
very low resource, not low resource, low level

0:14:46.615 --> 0:14:49.973
on the directly on the QDAC kernels in.

0:14:50.110 --> 0:15:00.948
So the same attention network is typically
more efficient in that type of algorithm.

0:15:00.880 --> 0:15:02.474
Than in in any other.

0:15:03.323 --> 0:15:13.105
Of course, it might be other disadvantages,
so if you're a little worker or have worked

0:15:13.105 --> 0:15:15.106
in the practical.

0:15:15.255 --> 0:15:22.604
Because it's normally easier to understand,
easier to change, and so on, but there is again

0:15:22.604 --> 0:15:23.323
a train.

0:15:23.483 --> 0:15:29.440
You have to think about, do you want to include
this into my study or comparison or not?

0:15:29.440 --> 0:15:36.468
Should it be like I compare different implementations
and I also find the most efficient implementation?

0:15:36.468 --> 0:15:39.145
Or is it only about the pure algorithm?

0:15:42.742 --> 0:15:50.355
Yeah, when building these systems there is
a different trade-off to do.

0:15:50.850 --> 0:15:56.555
So there's one of the traders between memory
and throughput, so how many words can generate

0:15:56.555 --> 0:15:57.299
per second.

0:15:57.557 --> 0:16:03.351
So typically you can easily like increase
your scruple by increasing the batch size.

0:16:03.643 --> 0:16:06.899
So that means you are translating more sentences
in parallel.

0:16:07.107 --> 0:16:09.241
And gypsies are very good at that stuff.

0:16:09.349 --> 0:16:15.161
It should translate one sentence or one hundred
sentences, not the same time, but its.

0:16:15.115 --> 0:16:20.784
Rough are very similar because they are at
this efficient metrics multiplication so that

0:16:20.784 --> 0:16:24.415
you can do the same operation on all sentences
parallel.

0:16:24.415 --> 0:16:30.148
So typically that means if you increase your
benchmark you can do more things in parallel

0:16:30.148 --> 0:16:31.995
and you will translate more.

0:16:31.952 --> 0:16:33.370
Second.

0:16:33.653 --> 0:16:43.312
On the other hand, with this advantage, of
course you will need higher badge sizes and

0:16:43.312 --> 0:16:44.755
more memory.

0:16:44.965 --> 0:16:56.452
To begin with, the other problem is that you
have such big models that you can only translate

0:16:56.452 --> 0:16:59.141
with lower bed sizes.

0:16:59.119 --> 0:17:08.466
If you are running out of memory with translating,
one idea to go on that is to decrease your.

0:17:13.453 --> 0:17:24.456
Then there is the thing about quality in Screwport,
of course, and before it's like larger models,

0:17:24.456 --> 0:17:28.124
but in generally higher quality.

0:17:28.124 --> 0:17:31.902
The first one is always this way.

0:17:32.092 --> 0:17:38.709
Course: Not always larger model helps you
have over fitting at some point, but in generally.

0:17:43.883 --> 0:17:52.901
And with this a bit on this training and testing
thing we had before.

0:17:53.113 --> 0:17:58.455
So it wears all the difference between training
and testing, and for the encoder and decoder.

0:17:58.798 --> 0:18:06.992
So if we are looking at what mentioned before
at training time, we have a source sentence

0:18:06.992 --> 0:18:17.183
here: And how this is processed on a is not
the attention here.

0:18:17.183 --> 0:18:21.836
That's a tubical transformer.

0:18:22.162 --> 0:18:31.626
And how we can do that on a is that we can
paralyze the ear ever since.

0:18:31.626 --> 0:18:40.422
The first thing to know is: So that is, of
course, not in all cases.

0:18:40.422 --> 0:18:49.184
We'll later talk about speech translation
where we might want to translate.

0:18:49.389 --> 0:18:56.172
Without the general case in, it's like you
have the full sentence you want to translate.

0:18:56.416 --> 0:19:02.053
So the important thing is we are here everything
available on the source side.

0:19:03.323 --> 0:19:13.524
And then this was one of the big advantages
that you can remember back of transformer.

0:19:13.524 --> 0:19:15.752
There are several.

0:19:16.156 --> 0:19:25.229
But the other one is now that we can calculate
the full layer.

0:19:25.645 --> 0:19:29.318
There is no dependency between this and this
state or this and this state.

0:19:29.749 --> 0:19:36.662
So we always did like here to calculate the
key value and query, and based on that you

0:19:36.662 --> 0:19:37.536
calculate.

0:19:37.937 --> 0:19:46.616
Which means we can do all these calculations
here in parallel and in parallel.

0:19:48.028 --> 0:19:55.967
And there, of course, is this very efficiency
because again for GPS it's too bigly possible

0:19:55.967 --> 0:20:00.887
to do these things in parallel and one after
each other.

0:20:01.421 --> 0:20:10.311
And then we can also for each layer one by
one, and then we calculate here the encoder.

0:20:10.790 --> 0:20:21.921
In training now an important thing is that
for the decoder we have the full sentence available

0:20:21.921 --> 0:20:28.365
because we know this is the target we should
generate.

0:20:29.649 --> 0:20:33.526
We have models now in a different way.

0:20:33.526 --> 0:20:38.297
This hidden state is only on the previous
ones.

0:20:38.598 --> 0:20:51.887
And the first thing here depends only on this
information, so you see if you remember we

0:20:51.887 --> 0:20:56.665
had this masked self-attention.

0:20:56.896 --> 0:21:04.117
So that means, of course, we can only calculate
the decoder once the encoder is done, but that's.

0:21:04.444 --> 0:21:06.656
Percent can calculate the end quarter.

0:21:06.656 --> 0:21:08.925
Then we can calculate here the decoder.

0:21:09.569 --> 0:21:25.566
But again in training we have x, y and that
is available so we can calculate everything

0:21:25.566 --> 0:21:27.929
in parallel.

0:21:28.368 --> 0:21:40.941
So the interesting thing or advantage of transformer
is in training.

0:21:40.941 --> 0:21:46.408
We can do it for the decoder.

0:21:46.866 --> 0:21:54.457
That means you will have more calculations
because you can only calculate one layer at

0:21:54.457 --> 0:22:02.310
a time, but for example the length which is
too bigly quite long or doesn't really matter

0:22:02.310 --> 0:22:03.270
that much.

0:22:05.665 --> 0:22:10.704
However, in testing this situation is different.

0:22:10.704 --> 0:22:13.276
In testing we only have.

0:22:13.713 --> 0:22:20.622
So this means we start with a sense: We don't
know the full sentence yet because we ought

0:22:20.622 --> 0:22:29.063
to regularly generate that so for the encoder
we have the same here but for the decoder.

0:22:29.409 --> 0:22:39.598
In this case we only have the first and the
second instinct, but only for all states in

0:22:39.598 --> 0:22:40.756
parallel.

0:22:41.101 --> 0:22:51.752
And then we can do the next step for y because
we are putting our most probable one.

0:22:51.752 --> 0:22:58.643
We do greedy search or beam search, but you
cannot do.

0:23:03.663 --> 0:23:16.838
Yes, so if we are interesting in making things
more efficient for testing, which we see, for

0:23:16.838 --> 0:23:22.363
example in the scenario of really our.

0:23:22.642 --> 0:23:34.286
It makes sense that we think about our architecture
and that we are currently working on attention

0:23:34.286 --> 0:23:35.933
based models.

0:23:36.096 --> 0:23:44.150
The decoder there is some of the most time
spent testing and testing.

0:23:44.150 --> 0:23:47.142
It's similar, but during.

0:23:47.167 --> 0:23:50.248
Nothing about beam search.

0:23:50.248 --> 0:23:59.833
It might be even more complicated because
in beam search you have to try different.

0:24:02.762 --> 0:24:15.140
So the question is what can you now do in
order to make your model more efficient and

0:24:15.140 --> 0:24:21.905
better in translation in these types of cases?

0:24:24.604 --> 0:24:30.178
And the one thing is to look into the encoded
decoder trailer.

0:24:30.690 --> 0:24:43.898
And then until now we typically assume that
the depth of the encoder and the depth of the

0:24:43.898 --> 0:24:48.154
decoder is roughly the same.

0:24:48.268 --> 0:24:55.553
So if you haven't thought about it, you just
take what is running well.

0:24:55.553 --> 0:24:57.678
You would try to do.

0:24:58.018 --> 0:25:04.148
However, we saw now that there is a quite
big challenge and the runtime is a lot longer

0:25:04.148 --> 0:25:04.914
than here.

0:25:05.425 --> 0:25:14.018
The question is also the case for the calculations,
or do we have there the same issue that we

0:25:14.018 --> 0:25:21.887
only get the good quality if we are having
high and high, so we know that making these

0:25:21.887 --> 0:25:25.415
more depths is increasing our quality.

0:25:25.425 --> 0:25:31.920
But what we haven't talked about is really
important that we increase the depth the same

0:25:31.920 --> 0:25:32.285
way.

0:25:32.552 --> 0:25:41.815
So what we can put instead also do is something
like this where you have a deep encoder and

0:25:41.815 --> 0:25:42.923
a shallow.

0:25:43.163 --> 0:25:57.386
So that would be that you, for example, have
instead of having layers on the encoder, and

0:25:57.386 --> 0:25:59.757
layers on the.

0:26:00.080 --> 0:26:10.469
So in this case the overall depth from start
to end would be similar and so hopefully.

0:26:11.471 --> 0:26:21.662
But we could a lot more things hear parallelized,
and hear what is costly at the end during decoding

0:26:21.662 --> 0:26:22.973
the decoder.

0:26:22.973 --> 0:26:29.330
Because that does change in an outer regressive
way, there we.

0:26:31.411 --> 0:26:33.727
And that that can be analyzed.

0:26:33.727 --> 0:26:38.734
So here is some examples: Where people have
done all this.

0:26:39.019 --> 0:26:55.710
So here it's mainly interested on the orange
things, which is auto-regressive about the

0:26:55.710 --> 0:26:57.607
speed up.

0:26:57.717 --> 0:27:15.031
You have the system, so agree is not exactly
the same, but it's similar.

0:27:15.055 --> 0:27:23.004
It's always the case if you look at speed
up.

0:27:23.004 --> 0:27:31.644
Think they put a speed of so that's the baseline.

0:27:31.771 --> 0:27:35.348
So between and times as fast.

0:27:35.348 --> 0:27:42.621
If you switch from a system to where you have
layers in the.

0:27:42.782 --> 0:27:52.309
You see that although you have slightly more
parameters, more calculations are also roughly

0:27:52.309 --> 0:28:00.283
the same, but you can speed out because now
during testing you can paralyze.

0:28:02.182 --> 0:28:09.754
The other thing is that you're speeding up,
but if you look at the performance it's similar,

0:28:09.754 --> 0:28:13.500
so sometimes you improve, sometimes you lose.

0:28:13.500 --> 0:28:20.421
There's a bit of losing English to Romania,
but in general the quality is very slow.

0:28:20.680 --> 0:28:30.343
So you see that you can keep a similar performance
while improving your speed by just having different.

0:28:30.470 --> 0:28:34.903
And you also see the encoder layers from speed.

0:28:34.903 --> 0:28:38.136
They don't really metal that much.

0:28:38.136 --> 0:28:38.690
Most.

0:28:38.979 --> 0:28:50.319
Because if you compare the 12th system to
the 6th system you have a lower performance

0:28:50.319 --> 0:28:57.309
with 6th and colder layers but the speed is
similar.

0:28:57.897 --> 0:29:02.233
And see the huge decrease is it maybe due
to a lack of data.

0:29:03.743 --> 0:29:11.899
Good idea would say it's not the case.

0:29:11.899 --> 0:29:23.191
Romanian English should have the same number
of data.

0:29:24.224 --> 0:29:31.184
Maybe it's just that something in that language.

0:29:31.184 --> 0:29:40.702
If you generate Romanian maybe they need more
target dependencies.

0:29:42.882 --> 0:29:46.263
The Wine's the Eye Also Don't Know Any Sex
People Want To.

0:29:47.887 --> 0:29:49.034
There could be yeah the.

0:29:49.889 --> 0:29:58.962
As the maybe if you go from like a movie sphere
to a hybrid sphere, you can: It's very much

0:29:58.962 --> 0:30:12.492
easier to expand the vocabulary to English,
but it must be the vocabulary.

0:30:13.333 --> 0:30:21.147
Have to check, but would assume that in this
case the system is not retrained, but it's

0:30:21.147 --> 0:30:22.391
trained with.

0:30:22.902 --> 0:30:30.213
And that's why I was assuming that they have
the same, but maybe you'll write that in this

0:30:30.213 --> 0:30:35.595
piece, for example, if they were pre-trained,
the decoder English.

0:30:36.096 --> 0:30:43.733
But don't remember exactly if they do something
like that, but that could be a good.

0:30:45.325 --> 0:30:52.457
So this is some of the most easy way to speed
up.

0:30:52.457 --> 0:31:01.443
You just switch to hyperparameters, not to
implement anything.

0:31:02.722 --> 0:31:08.367
Of course, there's other ways of doing that.

0:31:08.367 --> 0:31:11.880
We'll look into two things.

0:31:11.880 --> 0:31:16.521
The other thing is the architecture.

0:31:16.796 --> 0:31:28.154
We are now at some of the baselines that we
are doing.

0:31:28.488 --> 0:31:39.978
However, in translation in the decoder side,
it might not be the best solution.

0:31:39.978 --> 0:31:41.845
There is no.

0:31:42.222 --> 0:31:47.130
So we can use different types of architectures,
also in the encoder and the.

0:31:47.747 --> 0:31:52.475
And there's two ways of what you could do
different, or there's more ways.

0:31:52.912 --> 0:31:54.825
We will look into two todays.

0:31:54.825 --> 0:31:58.842
The one is average attention, which is a very
simple solution.

0:31:59.419 --> 0:32:01.464
You can do as it says.

0:32:01.464 --> 0:32:04.577
It's not really attending anymore.

0:32:04.577 --> 0:32:08.757
It's just like equal attendance to everything.

0:32:09.249 --> 0:32:23.422
And the other idea, which is currently done
in most systems which are optimized to efficiency,

0:32:23.422 --> 0:32:24.913
is we're.

0:32:25.065 --> 0:32:32.623
But on the decoder side we are then not using
transformer or self attention, but we are using

0:32:32.623 --> 0:32:39.700
recurrent neural network because they are the
disadvantage of recurrent neural network.

0:32:39.799 --> 0:32:48.353
And then the recurrent is normally easier
to calculate because it only depends on inputs,

0:32:48.353 --> 0:32:49.684
the input on.

0:32:51.931 --> 0:33:02.190
So what is the difference between decoding
and why is the tension maybe not sufficient

0:33:02.190 --> 0:33:03.841
for decoding?

0:33:04.204 --> 0:33:14.390
If we want to populate the new state, we only
have to look at the input and the previous

0:33:14.390 --> 0:33:15.649
state, so.

0:33:16.136 --> 0:33:19.029
We are more conditional here networks.

0:33:19.029 --> 0:33:19.994
We have the.

0:33:19.980 --> 0:33:31.291
Dependency to a fixed number of previous ones,
but that's rarely used for decoding.

0:33:31.291 --> 0:33:39.774
In contrast, in transformer we have this large
dependency, so.

0:33:40.000 --> 0:33:52.760
So from t minus one to y t so that is somehow
and mainly not very efficient in this way mean

0:33:52.760 --> 0:33:56.053
it's very good because.

0:33:56.276 --> 0:34:03.543
However, the disadvantage is that we also
have to do all these calculations, so if we

0:34:03.543 --> 0:34:10.895
more view from the point of view of efficient
calculation, this might not be the best.

0:34:11.471 --> 0:34:20.517
So the question is, can we change our architecture
to keep some of the advantages but make things

0:34:20.517 --> 0:34:21.994
more efficient?

0:34:24.284 --> 0:34:31.131
The one idea is what is called the average
attention, and the interesting thing is this

0:34:31.131 --> 0:34:32.610
work surprisingly.

0:34:33.013 --> 0:34:38.917
So the only idea what you're doing is doing
the decoder.

0:34:38.917 --> 0:34:42.646
You're not doing attention anymore.

0:34:42.646 --> 0:34:46.790
The attention weights are all the same.

0:34:47.027 --> 0:35:00.723
So you don't calculate with query and key
the different weights, and then you just take

0:35:00.723 --> 0:35:03.058
equal weights.

0:35:03.283 --> 0:35:07.585
So here would be one third from this, one
third from this, and one third.

0:35:09.009 --> 0:35:14.719
And while it is sufficient you can now do
precalculation and things get more efficient.

0:35:15.195 --> 0:35:18.803
So first go the formula that's maybe not directed
here.

0:35:18.979 --> 0:35:38.712
So the difference here is that your new hint
stage is the sum of all the hint states, then.

0:35:38.678 --> 0:35:40.844
So here would be with this.

0:35:40.844 --> 0:35:45.022
It would be one third of this plus one third
of this.

0:35:46.566 --> 0:35:57.162
But if you calculate it this way, it's not
yet being more efficient because you still

0:35:57.162 --> 0:36:01.844
have to sum over here all the hidden.

0:36:04.524 --> 0:36:22.932
But you can not easily speed up these things
by having an in between value, which is just

0:36:22.932 --> 0:36:24.568
always.

0:36:25.585 --> 0:36:30.057
If you take this as ten to one, you take this
one class this one.

0:36:30.350 --> 0:36:36.739
Because this one then was before this, and
this one was this, so in the end.

0:36:37.377 --> 0:36:49.545
So now this one is not the final one in order
to get the final one to do the average.

0:36:49.545 --> 0:36:50.111
So.

0:36:50.430 --> 0:37:00.264
But then if you do this calculation with speed
up you can do it with a fixed number of steps.

0:37:00.180 --> 0:37:11.300
Instead of the sun which depends on age, so
you only have to do calculations to calculate

0:37:11.300 --> 0:37:12.535
this one.

0:37:12.732 --> 0:37:21.718
Can you do the lakes and the lakes?

0:37:21.718 --> 0:37:32.701
For example, light bulb here now takes and.

0:37:32.993 --> 0:37:38.762
That's a very good point and that's why this
is now in the image.

0:37:38.762 --> 0:37:44.531
It's not very good so this is the one with
tilder and the tilder.

0:37:44.884 --> 0:37:57.895
So this one is just the sum of these two,
because this is just this one.

0:37:58.238 --> 0:38:08.956
So the sum of this is exactly as the sum of
these, and the sum of these is the sum of here.

0:38:08.956 --> 0:38:15.131
So you only do the sum in here, and the multiplying.

0:38:15.255 --> 0:38:22.145
So what you can mainly do here is you can
do it more mathematically.

0:38:22.145 --> 0:38:31.531
You can know this by tea taking out of the
sum, and then you can calculate the sum different.

0:38:36.256 --> 0:38:42.443
That maybe looks a bit weird and simple, so
we were all talking about this great attention

0:38:42.443 --> 0:38:47.882
that we can focus on different parts, and a
bit surprising on this work is now.

0:38:47.882 --> 0:38:53.321
In the end it might also work well without
really putting and just doing equal.

0:38:53.954 --> 0:38:56.164
Mean it's not that easy.

0:38:56.376 --> 0:38:58.261
It's like sometimes this is working.

0:38:58.261 --> 0:39:00.451
There's also report weight work that well.

0:39:01.481 --> 0:39:05.848
But I think it's an interesting way and it
maybe shows that a lot of.

0:39:05.805 --> 0:39:10.624
Things in the self or in the transformer paper
which are more put as like yet.

0:39:10.624 --> 0:39:15.930
These are some hyperpermetheuss around it,
like that you do the layer norm in between,

0:39:15.930 --> 0:39:21.785
and that you do a feat forward before, and
things like that, that these are also all important,

0:39:21.785 --> 0:39:25.566
and that the right set up around that is also
very important.

0:39:28.969 --> 0:39:38.598
The other thing you can do in the end is not
completely different from this one.

0:39:38.598 --> 0:39:42.521
It's just like a very different.

0:39:42.942 --> 0:39:54.338
And that is a recurrent network which also
has this type of highway connection that can

0:39:54.338 --> 0:40:01.330
ignore the recurrent unit and directly put
the input.

0:40:01.561 --> 0:40:10.770
It's not really adding out, but if you see
the hitting step is your input, but what you

0:40:10.770 --> 0:40:15.480
can do is somehow directly go to the output.

0:40:17.077 --> 0:40:28.390
These are the four components of the simple
return unit, and the unit is motivated by GIS

0:40:28.390 --> 0:40:33.418
and by LCMs, which we have seen before.

0:40:33.513 --> 0:40:43.633
And that has proven to be very good for iron
ends, which allows you to have a gate on your.

0:40:44.164 --> 0:40:48.186
In this thing we have two gates, the reset
gate and the forget gate.

0:40:48.768 --> 0:40:57.334
So first we have the general structure which
has a cell state.

0:40:57.334 --> 0:41:01.277
Here we have the cell state.

0:41:01.361 --> 0:41:09.661
And then this goes next, and we always get
the different cell states over the times that.

0:41:10.030 --> 0:41:11.448
This Is the South Stand.

0:41:11.771 --> 0:41:16.518
How do we now calculate that just assume we
have an initial cell safe here?

0:41:17.017 --> 0:41:19.670
But the first thing is we're doing the forget
game.

0:41:20.060 --> 0:41:34.774
The forgetting models should the new cell
state mainly depend on the previous cell state

0:41:34.774 --> 0:41:40.065
or should it depend on our age.

0:41:40.000 --> 0:41:41.356
Like Add to Them.

0:41:41.621 --> 0:41:42.877
How can we model that?

0:41:44.024 --> 0:41:45.599
First we were at a cocktail.

0:41:45.945 --> 0:41:52.151
The forget gait is depending on minus one.

0:41:52.151 --> 0:41:56.480
You also see here the former.

0:41:57.057 --> 0:42:01.963
So we are multiplying both the cell state
and our input.

0:42:01.963 --> 0:42:04.890
With some weights we are getting.

0:42:05.105 --> 0:42:08.472
We are putting some Bay Inspector and then
we are doing Sigma Weed on that.

0:42:08.868 --> 0:42:13.452
So in the end we have numbers between zero
and one saying for each dimension.

0:42:13.853 --> 0:42:22.041
Like how much if it's near to zero we will
mainly use the new input.

0:42:22.041 --> 0:42:31.890
If it's near to one we will keep the input
and ignore the input at this dimension.

0:42:33.313 --> 0:42:40.173
And by this motivation we can then create
here the new sound state, and here you see

0:42:40.173 --> 0:42:41.141
the formal.

0:42:41.601 --> 0:42:55.048
So you take your foot back gate and multiply
it with your class.

0:42:55.048 --> 0:43:00.427
So if my was around then.

0:43:00.800 --> 0:43:07.405
In the other case, when the value was others,
that's what you added.

0:43:07.405 --> 0:43:10.946
Then you're adding a transformation.

0:43:11.351 --> 0:43:24.284
So if this value was maybe zero then you're
putting most of the information from inputting.

0:43:25.065 --> 0:43:26.947
Is already your element?

0:43:26.947 --> 0:43:30.561
The only question is now based on your element.

0:43:30.561 --> 0:43:32.067
What is the output?

0:43:33.253 --> 0:43:47.951
And there you have another opportunity so
you can either take the output or instead you

0:43:47.951 --> 0:43:50.957
prefer the input.

0:43:52.612 --> 0:43:58.166
So is the value also the same for the recept
game and the forget game.

0:43:58.166 --> 0:43:59.417
Yes, the movie.

0:44:00.900 --> 0:44:10.004
Yes exactly so the matrices are different
and therefore it can be and that should be

0:44:10.004 --> 0:44:16.323
and maybe there is sometimes you want to have
information.

0:44:16.636 --> 0:44:23.843
So here again we have this vector with values
between zero and which says controlling how

0:44:23.843 --> 0:44:25.205
the information.

0:44:25.505 --> 0:44:36.459
And then the output is calculated here similar
to a cell stage, but again input is from.

0:44:36.536 --> 0:44:45.714
So either the reset gate decides should give
what is currently stored in there, or.

0:44:46.346 --> 0:44:58.647
So it's not exactly as the thing we had before,
with the residual connections where we added

0:44:58.647 --> 0:45:01.293
up, but here we do.

0:45:04.224 --> 0:45:08.472
This is the general idea of a simple recurrent
neural network.

0:45:08.472 --> 0:45:13.125
Then we will now look at how we can make things
even more efficient.

0:45:13.125 --> 0:45:17.104
But first do you have more questions on how
it is working?

0:45:23.063 --> 0:45:38.799
Now these calculations are a bit where things
get more efficient because this somehow.

0:45:38.718 --> 0:45:43.177
It depends on all the other damage for the
second one also.

0:45:43.423 --> 0:45:48.904
Because if you do a matrix multiplication
with a vector like for the output vector, each

0:45:48.904 --> 0:45:52.353
diameter of the output vector depends on all
the other.

0:45:52.973 --> 0:46:06.561
The cell state here depends because this one
is used here, and somehow the first dimension

0:46:06.561 --> 0:46:11.340
of the cell state only depends.

0:46:11.931 --> 0:46:17.973
In order to make that, of course, is sometimes
again making things less paralyzeable if things

0:46:17.973 --> 0:46:18.481
depend.

0:46:19.359 --> 0:46:35.122
Can easily make that different by changing
from the metric product to not a vector.

0:46:35.295 --> 0:46:51.459
So you do first, just like inside here, you
take like the first dimension, my second dimension.

0:46:52.032 --> 0:46:53.772
Is, of course, narrow.

0:46:53.772 --> 0:46:59.294
This should be reset or this should be because
it should be a different.

0:46:59.899 --> 0:47:12.053
Now the first dimension only depends on the
first dimension, so you don't have dependencies

0:47:12.053 --> 0:47:16.148
any longer between dimensions.

0:47:18.078 --> 0:47:25.692
Maybe it gets a bit clearer if you see about
it in this way, so what we have to do now.

0:47:25.966 --> 0:47:31.911
First, we have to do a metrics multiplication
on to gather and to get the.

0:47:32.292 --> 0:47:38.041
And then we only have the element wise operations
where we take this output.

0:47:38.041 --> 0:47:38.713
We take.

0:47:39.179 --> 0:47:42.978
Minus one and our original.

0:47:42.978 --> 0:47:52.748
Here we only have elemental abrasions which
can be optimally paralyzed.

0:47:53.273 --> 0:48:07.603
So here we have additional paralyzed things
across the dimension and don't have to do that.

0:48:09.929 --> 0:48:24.255
Yeah, but this you can do like in parallel
again for all xts.

0:48:24.544 --> 0:48:33.014
Here you can't do it in parallel, but you
only have to do it on each seat, and then you

0:48:33.014 --> 0:48:34.650
can parallelize.

0:48:35.495 --> 0:48:39.190
But this maybe for the dimension.

0:48:39.190 --> 0:48:42.124
Maybe it's also important.

0:48:42.124 --> 0:48:46.037
I don't know if they have tried it.

0:48:46.037 --> 0:48:55.383
I assume it's not only for dimension reduction,
but it's hard because you can easily.

0:49:01.001 --> 0:49:08.164
People have even like made the second thing
even more easy.

0:49:08.164 --> 0:49:10.313
So there is this.

0:49:10.313 --> 0:49:17.954
This is how we have the highway connections
in the transformer.

0:49:17.954 --> 0:49:20.699
Then it's like you do.

0:49:20.780 --> 0:49:24.789
So that is like how things are put together
as a transformer.

0:49:25.125 --> 0:49:39.960
And that is a similar and simple recurring
neural network where you do exactly the same

0:49:39.960 --> 0:49:44.512
for the so you don't have.

0:49:46.326 --> 0:49:47.503
This type of things.

0:49:49.149 --> 0:50:01.196
And with this we are at the end of how to
make efficient architectures before we go to

0:50:01.196 --> 0:50:02.580
the next.

0:50:13.013 --> 0:50:24.424
Between the ink or the trader and the architectures
there is a next technique which is used in

0:50:24.424 --> 0:50:28.988
nearly all deburning very successful.

0:50:29.449 --> 0:50:43.463
So the idea is can we extract the knowledge
from a large network into a smaller one, but

0:50:43.463 --> 0:50:45.983
it's similarly.

0:50:47.907 --> 0:50:53.217
And the nice thing is that this really works,
and it may be very, very surprising.

0:50:53.673 --> 0:51:03.000
So the idea is that we have a large straw
model which we train for long, and the question

0:51:03.000 --> 0:51:07.871
is: Can that help us to train a smaller model?

0:51:08.148 --> 0:51:16.296
So can what we refer to as teacher model tell
us better to build a small student model than

0:51:16.296 --> 0:51:17.005
before.

0:51:17.257 --> 0:51:27.371
So what we're before in it as a student model,
we learn from the data and that is how we train

0:51:27.371 --> 0:51:28.755
our systems.

0:51:29.249 --> 0:51:37.949
The question is: Can we train this small model
better if we are not only learning from the

0:51:37.949 --> 0:51:46.649
data, but we are also learning from a large
model which has been trained maybe in the same

0:51:46.649 --> 0:51:47.222
data?

0:51:47.667 --> 0:51:55.564
So that you have then in the end a smaller
model that is somehow better performing than.

0:51:55.895 --> 0:51:59.828
And maybe that's on the first view.

0:51:59.739 --> 0:52:05.396
Very very surprising because it has seen the
same data so it should have learned the same

0:52:05.396 --> 0:52:11.053
so the baseline model trained only on the data
and the student teacher knowledge to still

0:52:11.053 --> 0:52:11.682
model it.

0:52:11.682 --> 0:52:17.401
They all have seen only this data because
your teacher modeling was also trained typically

0:52:17.401 --> 0:52:19.161
only on this model however.

0:52:20.580 --> 0:52:30.071
It has by now shown that by many ways the
model trained in the teacher and analysis framework

0:52:30.071 --> 0:52:32.293
is performing better.

0:52:33.473 --> 0:52:40.971
A bit of an explanation when we see how that
works.

0:52:40.971 --> 0:52:46.161
There's different ways of doing it.

0:52:46.161 --> 0:52:47.171
Maybe.

0:52:47.567 --> 0:52:51.501
So how does it work?

0:52:51.501 --> 0:53:04.802
This is our student network, the normal one,
some type of new network.

0:53:04.802 --> 0:53:06.113
We're.

0:53:06.586 --> 0:53:17.050
So we are training the model to predict the
same thing as we are doing that by calculating.

0:53:17.437 --> 0:53:23.173
The cross angry loss was defined in a way
where saying all the probabilities for the

0:53:23.173 --> 0:53:25.332
correct word should be as high.

0:53:25.745 --> 0:53:32.207
So you are calculating your alphabet probabilities
always, and each time step you have an alphabet

0:53:32.207 --> 0:53:33.055
probability.

0:53:33.055 --> 0:53:38.669
What is the most probable in the next word
and your training signal is put as much of

0:53:38.669 --> 0:53:43.368
your probability mass to the correct word to
the word that is there in.

0:53:43.903 --> 0:53:51.367
And this is the chief by this cross entry
loss, which says with some of the all training

0:53:51.367 --> 0:53:58.664
examples of all positions, with some of the
full vocabulary, and then this one is this

0:53:58.664 --> 0:54:03.947
one that this current word is the case word
in the vocabulary.

0:54:04.204 --> 0:54:11.339
And then we take here the lock for the ability
of that, so what we made me do is: We have

0:54:11.339 --> 0:54:27.313
this metric here, so each position of your
vocabulary size.

0:54:27.507 --> 0:54:38.656
In the end what you just do is some of these
three lock probabilities, and then you want

0:54:38.656 --> 0:54:40.785
to have as much.

0:54:41.041 --> 0:54:54.614
So although this is a thumb over this metric
here, in the end of each dimension you.

0:54:54.794 --> 0:55:06.366
So that is a normal cross end to be lost that
we have discussed at the very beginning of

0:55:06.366 --> 0:55:07.016
how.

0:55:08.068 --> 0:55:15.132
So what can we do differently in the teacher
network?

0:55:15.132 --> 0:55:23.374
We also have a teacher network which is trained
on large data.

0:55:24.224 --> 0:55:35.957
And of course this distribution might be better
than the one from the small model because it's.

0:55:36.456 --> 0:55:40.941
So in this case we have now the training signal
from the teacher network.

0:55:41.441 --> 0:55:46.262
And it's the same way as we had before.

0:55:46.262 --> 0:55:56.507
The only difference is we're training not
the ground truths per ability distribution

0:55:56.507 --> 0:55:59.159
year, which is sharp.

0:55:59.299 --> 0:56:11.303
That's also a probability, so this word has
a high probability, but have some probability.

0:56:12.612 --> 0:56:19.577
And that is the main difference.

0:56:19.577 --> 0:56:30.341
Typically you do like the interpretation of
these.

0:56:33.213 --> 0:56:38.669
Because there's more information contained
in the distribution than in the front booth,

0:56:38.669 --> 0:56:44.187
because it encodes more information about the
language, because language always has more

0:56:44.187 --> 0:56:47.907
options to put alone, that's the same sentence
yes exactly.

0:56:47.907 --> 0:56:53.114
So there's ambiguity in there that is encoded
hopefully very well in the complaint.

0:56:53.513 --> 0:56:57.257
Trade you two networks so better than a student
network you have in there from your learner.

0:56:57.537 --> 0:57:05.961
So maybe often there's only one correct word,
but it might be two or three, and then all

0:57:05.961 --> 0:57:10.505
of these three have a probability distribution.

0:57:10.590 --> 0:57:21.242
And then is the main advantage or one explanation
of why it's better to train from the.

0:57:21.361 --> 0:57:32.652
Of course, it's good to also keep the signal
in there because then you can prevent it because

0:57:32.652 --> 0:57:33.493
crazy.

0:57:37.017 --> 0:57:49.466
Any more questions on the first type of knowledge
distillation, also distribution changes.

0:57:50.550 --> 0:58:02.202
Coming around again, this would put it a bit
different, so this is not a solution to maintenance

0:58:02.202 --> 0:58:04.244
or distribution.

0:58:04.744 --> 0:58:12.680
But don't think it's performing worse than
only doing the ground tours because they also.

0:58:13.113 --> 0:58:21.254
So it's more like it's not improving you would
assume it's similarly helping you, but.

0:58:21.481 --> 0:58:28.145
Of course, if you now have a teacher, maybe
you have no danger on your target to Maine,

0:58:28.145 --> 0:58:28.524
but.

0:58:28.888 --> 0:58:39.895
Then you can use this one which is not the
ground truth but helpful to learn better for

0:58:39.895 --> 0:58:42.147
the distribution.

0:58:46.326 --> 0:58:57.012
The second idea is to do sequence level knowledge
distillation, so what we have in this case

0:58:57.012 --> 0:59:02.757
is we have looked at each position independently.

0:59:03.423 --> 0:59:05.436
Mean, we do that often.

0:59:05.436 --> 0:59:10.972
We are not generating a lot of sequences,
but that has a problem.

0:59:10.972 --> 0:59:13.992
We have this propagation of errors.

0:59:13.992 --> 0:59:16.760
We start with one area and then.

0:59:17.237 --> 0:59:27.419
So if we are doing word-level knowledge dissolution,
we are treating each word in the sentence independently.

0:59:28.008 --> 0:59:32.091
So we are not trying to like somewhat model
the dependency between.

0:59:32.932 --> 0:59:47.480
We can try to do that by sequence level knowledge
dissolution, but the problem is, of course,.

0:59:47.847 --> 0:59:53.478
So we can that for each position we can get
a distribution over all the words at this.

0:59:53.793 --> 1:00:05.305
But if we want to have a distribution of all
possible target sentences, that's not possible

1:00:05.305 --> 1:00:06.431
because.

1:00:08.508 --> 1:00:15.940
Area, so we can then again do a bit of a heck
on that.

1:00:15.940 --> 1:00:23.238
If we can't have a distribution of all sentences,
it.

1:00:23.843 --> 1:00:30.764
So what we can't do is you can not use the
teacher network and sample different translations.

1:00:31.931 --> 1:00:39.327
And now we can do different ways to train
them.

1:00:39.327 --> 1:00:49.343
We can use them as their probability, the
easiest one to assume.

1:00:50.050 --> 1:00:56.373
So what that ends to is that we're taking
our teacher network, we're generating some

1:00:56.373 --> 1:01:01.135
translations, and these ones we're using as
additional trading.

1:01:01.781 --> 1:01:11.382
Then we have mainly done this sequence level
because the teacher network takes us.

1:01:11.382 --> 1:01:17.513
These are all probable translations of the
sentence.

1:01:26.286 --> 1:01:34.673
And then you can do a bit of a yeah, and you
can try to better make a bit of an interpolated

1:01:34.673 --> 1:01:36.206
version of that.

1:01:36.716 --> 1:01:42.802
So what people have also done is like subsequent
level interpolations.

1:01:42.802 --> 1:01:52.819
You generate here several translations: But
then you don't use all of them.

1:01:52.819 --> 1:02:00.658
You do some metrics on which of these ones.

1:02:01.021 --> 1:02:12.056
So it's a bit more training on this brown
chose which might be improbable or unreachable

1:02:12.056 --> 1:02:16.520
because we can generate everything.

1:02:16.676 --> 1:02:23.378
And we are giving it an easier solution which
is also good quality and training of that.

1:02:23.703 --> 1:02:32.602
So you're not training it on a very difficult
solution, but you're training it on an easier

1:02:32.602 --> 1:02:33.570
solution.

1:02:36.356 --> 1:02:38.494
Any More Questions to This.

1:02:40.260 --> 1:02:41.557
Yeah.

1:02:41.461 --> 1:02:44.296
Good.

1:02:43.843 --> 1:03:01.642
Is to look at the vocabulary, so the problem
is we have seen that vocabulary calculations

1:03:01.642 --> 1:03:06.784
are often very presuming.

1:03:09.789 --> 1:03:19.805
The thing is that most of the vocabulary is
not needed for each sentence, so in each sentence.

1:03:20.280 --> 1:03:28.219
The question is: Can we somehow easily precalculate,
which words are probable to occur in the sentence,

1:03:28.219 --> 1:03:30.967
and then only calculate these ones?

1:03:31.691 --> 1:03:34.912
And this can be done so.

1:03:34.912 --> 1:03:43.932
For example, if you have sentenced card, it's
probably not happening.

1:03:44.164 --> 1:03:48.701
So what you can try to do is to limit your
vocabulary.

1:03:48.701 --> 1:03:51.093
You're considering for each.

1:03:51.151 --> 1:04:04.693
So you're no longer taking the full vocabulary
as possible output, but you're restricting.

1:04:06.426 --> 1:04:18.275
That typically works is that we limit it by
the most frequent words we always take because

1:04:18.275 --> 1:04:23.613
these are not so easy to align to words.

1:04:23.964 --> 1:04:32.241
To take the most treatment taggin' words and
then work that often aligns with one of the

1:04:32.241 --> 1:04:32.985
source.

1:04:33.473 --> 1:04:46.770
So for each source word you calculate the
word alignment on your training data, and then

1:04:46.770 --> 1:04:51.700
you calculate which words occur.

1:04:52.352 --> 1:04:57.680
And then for decoding you build this union
of maybe the source word list that other.

1:04:59.960 --> 1:05:02.145
Are like for each source work.

1:05:02.145 --> 1:05:08.773
One of the most frequent translations of these
source words, for example for each source work

1:05:08.773 --> 1:05:13.003
like in the most frequent ones, and then the
most frequent.

1:05:13.193 --> 1:05:24.333
In total, if you have short sentences, you
have a lot less words, so in most cases it's

1:05:24.333 --> 1:05:26.232
not more than.

1:05:26.546 --> 1:05:33.957
And so you have dramatically reduced your
vocabulary, and thereby can also fax a depot.

1:05:35.495 --> 1:05:43.757
That easy does anybody see what is challenging
here and why that might not always need.

1:05:47.687 --> 1:05:54.448
The performance is not why this might not.

1:05:54.448 --> 1:06:01.838
If you implement it, it might not be a strong.

1:06:01.941 --> 1:06:06.053
You have to store this list.

1:06:06.053 --> 1:06:14.135
You have to burn the union and of course your
safe time.

1:06:14.554 --> 1:06:21.920
The second thing the vocabulary is used in
our last step, so we have the hidden state,

1:06:21.920 --> 1:06:23.868
and then we calculate.

1:06:24.284 --> 1:06:29.610
Now we are not longer calculating them for
all output words, but for a subset of them.

1:06:30.430 --> 1:06:35.613
However, this metric multiplication is typically
parallelized with the perfect but good.

1:06:35.956 --> 1:06:46.937
But if you not only calculate some of them,
if you're not modeling it right, it will take

1:06:46.937 --> 1:06:52.794
as long as before because of the nature of
the.

1:06:56.776 --> 1:07:07.997
Here for beam search there's some ideas of
course you can go back to greedy search because

1:07:07.997 --> 1:07:10.833
that's more efficient.

1:07:11.651 --> 1:07:18.347
And better quality, and you can buffer some
states in between, so how much buffering it's

1:07:18.347 --> 1:07:22.216
again this tradeoff between calculation and
memory.

1:07:25.125 --> 1:07:41.236
Then at the end of today what we want to look
into is one last type of new machine translation

1:07:41.236 --> 1:07:42.932
approach.

1:07:43.403 --> 1:07:53.621
And the idea is what we've already seen in
our first two steps is that this ultra aggressive

1:07:53.621 --> 1:07:57.246
park is taking community coding.

1:07:57.557 --> 1:08:04.461
Can process everything in parallel, but we
are always taking the most probable and then.

1:08:05.905 --> 1:08:10.476
The question is: Do we really need to do that?

1:08:10.476 --> 1:08:14.074
Therefore, there is a bunch of work.

1:08:14.074 --> 1:08:16.602
Can we do it differently?

1:08:16.602 --> 1:08:19.616
Can we generate a full target?

1:08:20.160 --> 1:08:29.417
We'll see it's not that easy and there's still
an open debate whether this is really faster

1:08:29.417 --> 1:08:31.832
and quality, but think.

1:08:32.712 --> 1:08:45.594
So, as said, what we have done is our encoder
decoder where we can process our encoder color,

1:08:45.594 --> 1:08:50.527
and then the output always depends.

1:08:50.410 --> 1:08:54.709
We generate the output and then we have to
put it here the wide because then everything

1:08:54.709 --> 1:08:56.565
depends on the purpose of the output.

1:08:56.916 --> 1:09:10.464
This is what is referred to as an outer-regressive
model and nearly outs speech generation and

1:09:10.464 --> 1:09:16.739
language generation or works in this outer.

1:09:18.318 --> 1:09:21.132
So the motivation is, can we do that more
efficiently?

1:09:21.361 --> 1:09:31.694
And can we somehow process all target words
in parallel?

1:09:31.694 --> 1:09:41.302
So instead of doing it one by one, we are
inputting.

1:09:45.105 --> 1:09:46.726
So how does it work?

1:09:46.726 --> 1:09:50.587
So let's first have a basic auto regressive
mode.

1:09:50.810 --> 1:09:53.551
So the encoder looks as it is before.

1:09:53.551 --> 1:09:58.310
That's maybe not surprising because here we
know we can paralyze.

1:09:58.618 --> 1:10:04.592
So we have put in here our ink holder and
generated the ink stash, so that's exactly

1:10:04.592 --> 1:10:05.295
the same.

1:10:05.845 --> 1:10:16.229
However, now we need to do one more thing:
One challenge is what we had before and that's

1:10:16.229 --> 1:10:26.799
a challenge of natural language generation
like machine translation.

1:10:32.672 --> 1:10:38.447
We generate until we generate this out of
end of center stock, but if we now generate

1:10:38.447 --> 1:10:44.625
everything at once that's no longer possible,
so we cannot generate as long because we only

1:10:44.625 --> 1:10:45.632
generated one.

1:10:46.206 --> 1:10:58.321
So the question is how can we now determine
how long the sequence is, and we can also accelerate.

1:11:00.000 --> 1:11:06.384
Yes, but there would be one idea, and there
is other work which tries to do that.

1:11:06.806 --> 1:11:15.702
However, in here there's some work already
done before and maybe you remember we had the

1:11:15.702 --> 1:11:20.900
IBM models and there was this concept of fertility.

1:11:21.241 --> 1:11:26.299
The concept of fertility is means like for
one saucepan, and how many target pores does

1:11:26.299 --> 1:11:27.104
it translate?

1:11:27.847 --> 1:11:34.805
And exactly that we try to do here, and that
means we are calculating like at the top we

1:11:34.805 --> 1:11:36.134
are calculating.

1:11:36.396 --> 1:11:42.045
So it says word is translated into word.

1:11:42.045 --> 1:11:54.171
Word might be translated into words into,
so we're trying to predict in how many words.

1:11:55.935 --> 1:12:10.314
And then the end of the anchor, so this is
like a length estimation.

1:12:10.314 --> 1:12:15.523
You can do it otherwise.

1:12:16.236 --> 1:12:24.526
You initialize your decoder input and we know
it's good with word embeddings so we're trying

1:12:24.526 --> 1:12:28.627
to do the same thing and what people then do.

1:12:28.627 --> 1:12:35.224
They initialize it again with word embedding
but in the frequency of the.

1:12:35.315 --> 1:12:36.460
So we have the cartilage.

1:12:36.896 --> 1:12:47.816
So one has two, so twice the is and then one
is, so that is then our initialization.

1:12:48.208 --> 1:12:57.151
In other words, if you don't predict fertilities
but predict lengths, you can just initialize

1:12:57.151 --> 1:12:57.912
second.

1:12:58.438 --> 1:13:07.788
This often works a bit better, but that's
the other.

1:13:07.788 --> 1:13:16.432
Now you have everything in training and testing.

1:13:16.656 --> 1:13:18.621
This is all available at once.

1:13:20.280 --> 1:13:31.752
Then we can generate everything in parallel,
so we have the decoder stack, and that is now

1:13:31.752 --> 1:13:33.139
as before.

1:13:35.395 --> 1:13:41.555
And then we're doing the translation predictions
here on top of it in order to do.

1:13:43.083 --> 1:13:59.821
And then we are predicting here the target
words and once predicted, and that is the basic

1:13:59.821 --> 1:14:00.924
idea.

1:14:01.241 --> 1:14:08.171
Machine translation: Where the idea is, we
don't have to do one by one what we're.

1:14:10.210 --> 1:14:13.900
So this looks really, really, really great.

1:14:13.900 --> 1:14:20.358
On the first view there's one challenge with
this, and this is the baseline.

1:14:20.358 --> 1:14:27.571
Of course there's some improvements, but in
general the quality is often significant.

1:14:28.068 --> 1:14:32.075
So here you see the baseline models.

1:14:32.075 --> 1:14:38.466
You have a loss of ten blue points or something
like that.

1:14:38.878 --> 1:14:40.230
So why does it change?

1:14:40.230 --> 1:14:41.640
So why is it happening?

1:14:43.903 --> 1:14:56.250
If you look at the errors there is repetitive
tokens, so you have like or things like that.

1:14:56.536 --> 1:15:01.995
Broken senses or influent senses, so that
exactly where algebra aggressive models are

1:15:01.995 --> 1:15:04.851
very good, we say that's a bit of a problem.

1:15:04.851 --> 1:15:07.390
They generate very fluid transcription.

1:15:07.387 --> 1:15:10.898
Translation: Sometimes there doesn't have
to do anything with the input.

1:15:11.411 --> 1:15:14.047
But generally it really looks always very
fluid.

1:15:14.995 --> 1:15:20.865
Here exactly the opposite, so the problem
is that we don't have really fluid translation.

1:15:21.421 --> 1:15:26.123
And that is mainly due to the challenge that
we have this independent assumption.

1:15:26.646 --> 1:15:35.873
So in this case, the probability of Y of the
second position is independent of the probability

1:15:35.873 --> 1:15:40.632
of X, so we don't know what was there generated.

1:15:40.632 --> 1:15:43.740
We're just generating it there.

1:15:43.964 --> 1:15:55.439
You can see it also in a bit of examples.

1:15:55.439 --> 1:16:03.636
You can over-panelize shifts.

1:16:04.024 --> 1:16:10.566
And the problem is this is already an improvement
again, but this is also similar to.

1:16:11.071 --> 1:16:19.900
So you can, for example, translate heeded
back, or maybe you could also translate it

1:16:19.900 --> 1:16:31.105
with: But on their feeling down in feeling
down, if the first position thinks of their

1:16:31.105 --> 1:16:34.594
feeling done and the second.

1:16:35.075 --> 1:16:42.908
So each position here and that is one of the
main issues here doesn't know what the other.

1:16:43.243 --> 1:16:53.846
And for example, if you are translating something
with, you can often translate things in two

1:16:53.846 --> 1:16:58.471
ways: German with a different agreement.

1:16:58.999 --> 1:17:02.058
And then here where you have to decide do
a used jet.

1:17:02.162 --> 1:17:05.460
Interpretator: It doesn't know which word
it has to select.

1:17:06.086 --> 1:17:14.789
Mean, of course, it knows a hidden state,
but in the end you have a liability distribution.

1:17:16.256 --> 1:17:20.026
And that is the important thing in the outer
regressive month.

1:17:20.026 --> 1:17:24.335
You know that because you have put it in you
here, you don't know that.

1:17:24.335 --> 1:17:29.660
If it's equal probable here to two, you don't
Know Which Is Selected, and of course that

1:17:29.660 --> 1:17:32.832
depends on what should be the latest traction
under.

1:17:33.333 --> 1:17:39.554
Yep, that's the undershift, and we're going
to last last the next time.

1:17:39.554 --> 1:17:39.986
Yes.

1:17:40.840 --> 1:17:44.935
Doesn't this also appear in and like now we're
talking about physical training?

1:17:46.586 --> 1:17:48.412
The thing is in the auto regress.

1:17:48.412 --> 1:17:50.183
If you give it the correct one,.

1:17:50.450 --> 1:17:55.827
So if you predict here comma what the reference
is feeling then you tell the model here.

1:17:55.827 --> 1:17:59.573
The last one was feeling and then it knows
it has to be done.

1:17:59.573 --> 1:18:04.044
But here it doesn't know that because it doesn't
get as input as a right.

1:18:04.204 --> 1:18:24.286
Yes, that's a bit depending on what.

1:18:24.204 --> 1:18:27.973
But in training, of course, you just try to
make the highest one the current one.

1:18:31.751 --> 1:18:38.181
So what you can do is things like CDC loss
which can adjust for this.

1:18:38.181 --> 1:18:42.866
So then you can also have this shifted correction.

1:18:42.866 --> 1:18:50.582
If you're doing this type of correction in
the CDC loss you don't get full penalty.

1:18:50.930 --> 1:18:58.486
Just shifted by one, so it's a bit of a different
loss, which is mainly used in, but.

1:19:00.040 --> 1:19:03.412
It can be used in order to address this problem.

1:19:04.504 --> 1:19:13.844
The other problem is that outer regressively
we have the label buyers that tries to disimmigrate.

1:19:13.844 --> 1:19:20.515
That's the example did before was if you translate
thank you to Dung.

1:19:20.460 --> 1:19:31.925
And then it might end up because it learns
in the first position and the second also.

1:19:32.492 --> 1:19:43.201
In order to prevent that, it would be helpful
for one output, only one output, so that makes

1:19:43.201 --> 1:19:47.002
the system already better learn.

1:19:47.227 --> 1:19:53.867
Might be that for slightly different inputs
you have different outputs, but for the same.

1:19:54.714 --> 1:19:57.467
That we can luckily very easily solve.

1:19:59.119 --> 1:19:59.908
And it's done.

1:19:59.908 --> 1:20:04.116
We just learned the technique about it, which
is called knowledge distillation.

1:20:04.985 --> 1:20:13.398
So what we can do and the easiest solution
to prove your non-autoregressive model is to

1:20:13.398 --> 1:20:16.457
train an auto regressive model.

1:20:16.457 --> 1:20:22.958
Then you decode your whole training gamer
with this model and then.

1:20:23.603 --> 1:20:27.078
While the main advantage of that is that this
is more consistent,.

1:20:27.407 --> 1:20:33.995
So for the same input you always have the
same output.

1:20:33.995 --> 1:20:41.901
So you have to make your training data more
consistent and learn.

1:20:42.482 --> 1:20:54.471
So there is another advantage of knowledge
distillation and that advantage is you have

1:20:54.471 --> 1:20:59.156
more consistent training signals.

1:21:04.884 --> 1:21:10.630
There's another to make the things more easy
at the beginning.

1:21:10.630 --> 1:21:16.467
There's this plants model, black model where
you do more masks.

1:21:16.756 --> 1:21:26.080
So during training, especially at the beginning,
you give some correct solutions at the beginning.

1:21:28.468 --> 1:21:38.407
And there is this tokens at a time, so the
idea is to establish other regressive training.

1:21:40.000 --> 1:21:50.049
And some targets are open, so you always predict
only like first auto regression is K.

1:21:50.049 --> 1:21:59.174
It puts one, so you always have one input
and one output, then you do partial.

1:21:59.699 --> 1:22:05.825
So in that way you can slowly learn what is
a good and what is a bad answer.

1:22:08.528 --> 1:22:10.862
It doesn't sound very impressive.

1:22:10.862 --> 1:22:12.578
Don't contact me anyway.

1:22:12.578 --> 1:22:15.323
Go all over your training data several.

1:22:15.875 --> 1:22:20.655
You can even switch in between.

1:22:20.655 --> 1:22:29.318
There is a homework on this thing where you
try to start.

1:22:31.271 --> 1:22:41.563
You have to learn so there's a whole work
on that so this is often happening and it doesn't

1:22:41.563 --> 1:22:46.598
mean it's less efficient but still it helps.

1:22:49.389 --> 1:22:57.979
For later maybe here are some examples of
how much things help.

1:22:57.979 --> 1:23:04.958
Maybe one point here is that it's really important.

1:23:05.365 --> 1:23:13.787
Here's the translation performance and speed.

1:23:13.787 --> 1:23:24.407
One point which is a point is if you compare
researchers.

1:23:24.784 --> 1:23:33.880
So yeah, if you're compared to one very weak
baseline transformer even with beam search,

1:23:33.880 --> 1:23:40.522
then you're ten times slower than a very strong
auto regressive.

1:23:40.961 --> 1:23:48.620
If you make a strong baseline then it's going
down to depending on times and here like: You

1:23:48.620 --> 1:23:53.454
have a lot of different speed ups.

1:23:53.454 --> 1:24:03.261
Generally, it makes a strong baseline and
not very simple transformer.

1:24:07.407 --> 1:24:20.010
Yeah, with this one last thing that you can
do to speed up things and also reduce your

1:24:20.010 --> 1:24:25.950
memory is what is called half precision.

1:24:26.326 --> 1:24:29.139
And especially for decoding issues for training.

1:24:29.139 --> 1:24:31.148
Sometimes it also gets less stale.

1:24:32.592 --> 1:24:45.184
With this we close nearly wait a bit, so what
you should remember is that efficient machine

1:24:45.184 --> 1:24:46.963
translation.

1:24:47.007 --> 1:24:51.939
We have, for example, looked at knowledge
distillation.

1:24:51.939 --> 1:24:55.991
We have looked at non auto regressive models.

1:24:55.991 --> 1:24:57.665
We have different.

1:24:58.898 --> 1:25:02.383
For today and then only requests.

1:25:02.383 --> 1:25:08.430
So if you haven't done so, please fill out
the evaluation.

1:25:08.388 --> 1:25:20.127
So now if you have done so think then you
should have and with the online people hopefully.

1:25:20.320 --> 1:25:29.758
Only possibility to tell us what things are
good and what not the only one but the most

1:25:29.758 --> 1:25:30.937
efficient.

1:25:31.851 --> 1:25:35.871
So think of all the students doing it in this
case okay and then thank.