Spaces:
Running
Running
File size: 67,856 Bytes
cb71ef5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 |
WEBVTT
0:00:01.921 --> 0:00:16.424
Hey welcome to today's lecture, what we today
want to look at is how we can make new.
0:00:16.796 --> 0:00:26.458
So until now we have this global system, the
encoder and the decoder mostly, and we haven't
0:00:26.458 --> 0:00:29.714
really thought about how long.
0:00:30.170 --> 0:00:42.684
And what we, for example, know is yeah, you
can make the systems bigger in different ways.
0:00:42.684 --> 0:00:47.084
We can make them deeper so the.
0:00:47.407 --> 0:00:56.331
And if we have at least enough data that typically
helps you make things performance better,.
0:00:56.576 --> 0:01:00.620
But of course leads to problems that we need
more resources.
0:01:00.620 --> 0:01:06.587
That is a problem at universities where we
have typically limited computation capacities.
0:01:06.587 --> 0:01:11.757
So at some point you have such big models
that you cannot train them anymore.
0:01:13.033 --> 0:01:23.792
And also for companies is of course important
if it costs you like to generate translation
0:01:23.792 --> 0:01:26.984
just by power consumption.
0:01:27.667 --> 0:01:35.386
So yeah, there's different reasons why you
want to do efficient machine translation.
0:01:36.436 --> 0:01:48.338
One reason is there are different ways of
how you can improve your machine translation
0:01:48.338 --> 0:01:50.527
system once we.
0:01:50.670 --> 0:01:55.694
There can be different types of data we looked
into data crawling, monolingual data.
0:01:55.875 --> 0:01:59.024
All this data and the aim is always.
0:01:59.099 --> 0:02:05.735
Of course, we are not just purely interested
in having more data, but the idea why we want
0:02:05.735 --> 0:02:12.299
to have more data is that more data also means
that we have better quality because mostly
0:02:12.299 --> 0:02:17.550
we are interested in increasing the quality
of the machine translation.
0:02:18.838 --> 0:02:24.892
But there's also other ways of how you can
improve the quality of a machine translation.
0:02:25.325 --> 0:02:36.450
And what is, of course, that is where most
research is focusing on.
0:02:36.450 --> 0:02:44.467
It means all we want to build better algorithms.
0:02:44.684 --> 0:02:48.199
Course: The other things are normally as good.
0:02:48.199 --> 0:02:54.631
Sometimes it's easier to improve, so often
it's easier to just collect more data than
0:02:54.631 --> 0:02:57.473
to invent some great view algorithms.
0:02:57.473 --> 0:03:00.315
But yeah, both of them are important.
0:03:00.920 --> 0:03:09.812
But there is this third thing, especially
with neural machine translation, and that means
0:03:09.812 --> 0:03:11.590
we make a bigger.
0:03:11.751 --> 0:03:16.510
Can be, as said, that we have more layers,
that we have wider layers.
0:03:16.510 --> 0:03:19.977
The other thing we talked a bit about is ensemble.
0:03:19.977 --> 0:03:24.532
That means we are not building one new machine
translation system.
0:03:24.965 --> 0:03:27.505
And we can easily build four.
0:03:27.505 --> 0:03:32.331
What is the typical strategy to build different
systems?
0:03:32.331 --> 0:03:33.177
Remember.
0:03:35.795 --> 0:03:40.119
It should be of course a bit different if
you have the same.
0:03:40.119 --> 0:03:44.585
If they all predict the same then combining
them doesn't help.
0:03:44.585 --> 0:03:48.979
So what is the easiest way if you have to
build four systems?
0:03:51.711 --> 0:04:01.747
And the Charleston's will take, but this is
the best output of a single system.
0:04:02.362 --> 0:04:10.165
Mean now, it's really three different systems
so that you later can combine them and maybe
0:04:10.165 --> 0:04:11.280
the average.
0:04:11.280 --> 0:04:16.682
Ensembles are typically that the average is
all probabilities.
0:04:19.439 --> 0:04:24.227
The idea is to think about neural networks.
0:04:24.227 --> 0:04:29.342
There's one parameter which can easily adjust.
0:04:29.342 --> 0:04:36.525
That's exactly the easiest way to randomize
with three different.
0:04:37.017 --> 0:04:43.119
They have the same architecture, so all the
hydroparameters are the same, but they are
0:04:43.119 --> 0:04:43.891
different.
0:04:43.891 --> 0:04:46.556
They will have different predictions.
0:04:48.228 --> 0:04:52.572
So, of course, bigger amounts.
0:04:52.572 --> 0:05:05.325
Some of these are a bit the easiest way of
improving your quality because you don't really
0:05:05.325 --> 0:05:08.268
have to do anything.
0:05:08.588 --> 0:05:12.588
There is limits on that bigger models only
get better.
0:05:12.588 --> 0:05:19.132
If you have enough training data you can't
do like a handheld layer and you will not work
0:05:19.132 --> 0:05:24.877
on very small data but with a recent amount
of data that is the easiest thing.
0:05:25.305 --> 0:05:33.726
However, they are challenging with making
better models, bigger motors, and that is the
0:05:33.726 --> 0:05:34.970
computation.
0:05:35.175 --> 0:05:44.482
So, of course, if you have a bigger model
that can mean that you have longer running
0:05:44.482 --> 0:05:49.518
times, if you have models, you have to times.
0:05:51.171 --> 0:05:56.685
Normally you cannot paralyze the different
layers because the input to one layer is always
0:05:56.685 --> 0:06:02.442
the output of the previous layer, so you propagate
that so it will also increase your runtime.
0:06:02.822 --> 0:06:10.720
Then you have to store all your models in
memory.
0:06:10.720 --> 0:06:20.927
If you have double weights you will have:
Is more difficult to then do back propagation.
0:06:20.927 --> 0:06:27.680
You have to store in between the activations,
so there's not only do you increase the model
0:06:27.680 --> 0:06:31.865
in your memory, but also all these other variables
that.
0:06:34.414 --> 0:06:36.734
And so in general it is more expensive.
0:06:37.137 --> 0:06:54.208
And therefore there's good reasons in looking
into can we make these models sound more efficient.
0:06:54.134 --> 0:07:00.982
So it's been through the viewer, you can have
it okay, have one and one day of training time,
0:07:00.982 --> 0:07:01.274
or.
0:07:01.221 --> 0:07:07.535
Forty thousand euros and then what is the
best machine translation system I can get within
0:07:07.535 --> 0:07:08.437
this budget.
0:07:08.969 --> 0:07:19.085
And then, of course, you can make the models
bigger, but then you have to train them shorter,
0:07:19.085 --> 0:07:24.251
and then we can make more efficient algorithms.
0:07:25.925 --> 0:07:31.699
If you think about efficiency, there's a bit
different scenarios.
0:07:32.312 --> 0:07:43.635
So if you're more of coming from the research
community, what you'll be doing is building
0:07:43.635 --> 0:07:47.913
a lot of models in your research.
0:07:48.088 --> 0:07:58.645
So you're having your test set of maybe sentences,
calculating the blue score, then another model.
0:07:58.818 --> 0:08:08.911
So what that means is typically you're training
on millions of cents, so your training time
0:08:08.911 --> 0:08:14.944
is long, maybe a day, but maybe in other cases
a week.
0:08:15.135 --> 0:08:22.860
The testing is not really the cost efficient,
but the training is very costly.
0:08:23.443 --> 0:08:37.830
If you are more thinking of building models
for application, the scenario is quite different.
0:08:38.038 --> 0:08:46.603
And then you keep it running, and maybe thousands
of customers are using it in translating.
0:08:46.603 --> 0:08:47.720
So in that.
0:08:48.168 --> 0:08:59.577
And we will see that it is not always the
same type of challenges you can paralyze some
0:08:59.577 --> 0:09:07.096
things in training, which you cannot paralyze
in testing.
0:09:07.347 --> 0:09:14.124
For example, in training you have to do back
propagation, so you have to store the activations.
0:09:14.394 --> 0:09:23.901
Therefore, in testing we briefly discussed
that we would do it in more detail today in
0:09:23.901 --> 0:09:24.994
training.
0:09:25.265 --> 0:09:36.100
You know they're a target and you can process
everything in parallel while in testing.
0:09:36.356 --> 0:09:46.741
So you can only do one word at a time, and
so you can less paralyze this.
0:09:46.741 --> 0:09:50.530
Therefore, it's important.
0:09:52.712 --> 0:09:55.347
Is a specific task on this.
0:09:55.347 --> 0:10:03.157
For example, it's the efficiency task where
it's about making things as efficient.
0:10:03.123 --> 0:10:09.230
Is possible and they can look at different
resources.
0:10:09.230 --> 0:10:14.207
So how much deep fuel run time do you need?
0:10:14.454 --> 0:10:19.366
See how much memory you need or you can have
a fixed memory budget and then have to build
0:10:19.366 --> 0:10:20.294
the best system.
0:10:20.500 --> 0:10:29.010
And here is a bit like an example of that,
so there's three teams from Edinburgh from
0:10:29.010 --> 0:10:30.989
and they submitted.
0:10:31.131 --> 0:10:36.278
So then, of course, if you want to know the
most efficient system you have to do a bit
0:10:36.278 --> 0:10:36.515
of.
0:10:36.776 --> 0:10:44.656
You want to have a better quality or more
runtime and there's not the one solution.
0:10:44.656 --> 0:10:46.720
You can improve your.
0:10:46.946 --> 0:10:49.662
And that you see that there are different
systems.
0:10:49.909 --> 0:11:06.051
Here is how many words you can do for a second
on the clock, and you want to be as talk as
0:11:06.051 --> 0:11:07.824
possible.
0:11:08.068 --> 0:11:08.889
And you see here a bit.
0:11:08.889 --> 0:11:09.984
This is a little bit different.
0:11:11.051 --> 0:11:27.717
You want to be there on the top right corner
and you can get a score of something between
0:11:27.717 --> 0:11:29.014
words.
0:11:30.250 --> 0:11:34.161
Two hundred and fifty thousand, then you'll
ever come and score zero point three.
0:11:34.834 --> 0:11:41.243
There is, of course, any bit of a decision,
but the question is, like how far can you again?
0:11:41.243 --> 0:11:47.789
Some of all these points on this line would
be winners because they are somehow most efficient
0:11:47.789 --> 0:11:53.922
in a way that there's no system which achieves
the same quality with less computational.
0:11:57.657 --> 0:12:04.131
So there's the one question of which resources
are you interested.
0:12:04.131 --> 0:12:07.416
Are you running it on CPU or GPU?
0:12:07.416 --> 0:12:11.668
There's different ways of paralyzing stuff.
0:12:14.654 --> 0:12:20.777
Another dimension is how you process your
data.
0:12:20.777 --> 0:12:27.154
There's really the best processing and streaming.
0:12:27.647 --> 0:12:34.672
So in batch processing you have the whole
document available so you can translate all
0:12:34.672 --> 0:12:39.981
sentences in perimeter and then you're interested
in throughput.
0:12:40.000 --> 0:12:43.844
But you can then process, for example, especially
in GPS.
0:12:43.844 --> 0:12:49.810
That's interesting, you're not translating
one sentence at a time, but you're translating
0:12:49.810 --> 0:12:56.108
one hundred sentences or so in parallel, so
you have one more dimension where you can paralyze
0:12:56.108 --> 0:12:57.964
and then be more efficient.
0:12:58.558 --> 0:13:14.863
On the other hand, for example sorts of documents,
so we learned that if you do badge processing
0:13:14.863 --> 0:13:16.544
you have.
0:13:16.636 --> 0:13:24.636
Then, of course, it makes sense to sort the
sentences in order to have the minimum thing
0:13:24.636 --> 0:13:25.535
attached.
0:13:27.427 --> 0:13:32.150
The other scenario is more the streaming scenario
where you do life translation.
0:13:32.512 --> 0:13:40.212
So in that case you can't wait for the whole
document to pass, but you have to do.
0:13:40.520 --> 0:13:49.529
And then, for example, that's especially in
situations like speech translation, and then
0:13:49.529 --> 0:13:53.781
you're interested in things like latency.
0:13:53.781 --> 0:14:00.361
So how much do you have to wait to get the
output of a sentence?
0:14:06.566 --> 0:14:16.956
Finally, there is the thing about the implementation:
Today we're mainly looking at different algorithms,
0:14:16.956 --> 0:14:23.678
different models of how you can model them
in your machine translation system, but of
0:14:23.678 --> 0:14:29.227
course for the same algorithms there's also
different implementations.
0:14:29.489 --> 0:14:38.643
So, for example, for a machine translation
this tool could be very fast.
0:14:38.638 --> 0:14:46.615
So they have like coded a lot of the operations
very low resource, not low resource, low level
0:14:46.615 --> 0:14:49.973
on the directly on the QDAC kernels in.
0:14:50.110 --> 0:15:00.948
So the same attention network is typically
more efficient in that type of algorithm.
0:15:00.880 --> 0:15:02.474
Than in in any other.
0:15:03.323 --> 0:15:13.105
Of course, it might be other disadvantages,
so if you're a little worker or have worked
0:15:13.105 --> 0:15:15.106
in the practical.
0:15:15.255 --> 0:15:22.604
Because it's normally easier to understand,
easier to change, and so on, but there is again
0:15:22.604 --> 0:15:23.323
a train.
0:15:23.483 --> 0:15:29.440
You have to think about, do you want to include
this into my study or comparison or not?
0:15:29.440 --> 0:15:36.468
Should it be like I compare different implementations
and I also find the most efficient implementation?
0:15:36.468 --> 0:15:39.145
Or is it only about the pure algorithm?
0:15:42.742 --> 0:15:50.355
Yeah, when building these systems there is
a different trade-off to do.
0:15:50.850 --> 0:15:56.555
So there's one of the traders between memory
and throughput, so how many words can generate
0:15:56.555 --> 0:15:57.299
per second.
0:15:57.557 --> 0:16:03.351
So typically you can easily like increase
your scruple by increasing the batch size.
0:16:03.643 --> 0:16:06.899
So that means you are translating more sentences
in parallel.
0:16:07.107 --> 0:16:09.241
And gypsies are very good at that stuff.
0:16:09.349 --> 0:16:15.161
It should translate one sentence or one hundred
sentences, not the same time, but its.
0:16:15.115 --> 0:16:20.784
Rough are very similar because they are at
this efficient metrics multiplication so that
0:16:20.784 --> 0:16:24.415
you can do the same operation on all sentences
parallel.
0:16:24.415 --> 0:16:30.148
So typically that means if you increase your
benchmark you can do more things in parallel
0:16:30.148 --> 0:16:31.995
and you will translate more.
0:16:31.952 --> 0:16:33.370
Second.
0:16:33.653 --> 0:16:43.312
On the other hand, with this advantage, of
course you will need higher badge sizes and
0:16:43.312 --> 0:16:44.755
more memory.
0:16:44.965 --> 0:16:56.452
To begin with, the other problem is that you
have such big models that you can only translate
0:16:56.452 --> 0:16:59.141
with lower bed sizes.
0:16:59.119 --> 0:17:08.466
If you are running out of memory with translating,
one idea to go on that is to decrease your.
0:17:13.453 --> 0:17:24.456
Then there is the thing about quality in Screwport,
of course, and before it's like larger models,
0:17:24.456 --> 0:17:28.124
but in generally higher quality.
0:17:28.124 --> 0:17:31.902
The first one is always this way.
0:17:32.092 --> 0:17:38.709
Course: Not always larger model helps you
have over fitting at some point, but in generally.
0:17:43.883 --> 0:17:52.901
And with this a bit on this training and testing
thing we had before.
0:17:53.113 --> 0:17:58.455
So it wears all the difference between training
and testing, and for the encoder and decoder.
0:17:58.798 --> 0:18:06.992
So if we are looking at what mentioned before
at training time, we have a source sentence
0:18:06.992 --> 0:18:17.183
here: And how this is processed on a is not
the attention here.
0:18:17.183 --> 0:18:21.836
That's a tubical transformer.
0:18:22.162 --> 0:18:31.626
And how we can do that on a is that we can
paralyze the ear ever since.
0:18:31.626 --> 0:18:40.422
The first thing to know is: So that is, of
course, not in all cases.
0:18:40.422 --> 0:18:49.184
We'll later talk about speech translation
where we might want to translate.
0:18:49.389 --> 0:18:56.172
Without the general case in, it's like you
have the full sentence you want to translate.
0:18:56.416 --> 0:19:02.053
So the important thing is we are here everything
available on the source side.
0:19:03.323 --> 0:19:13.524
And then this was one of the big advantages
that you can remember back of transformer.
0:19:13.524 --> 0:19:15.752
There are several.
0:19:16.156 --> 0:19:25.229
But the other one is now that we can calculate
the full layer.
0:19:25.645 --> 0:19:29.318
There is no dependency between this and this
state or this and this state.
0:19:29.749 --> 0:19:36.662
So we always did like here to calculate the
key value and query, and based on that you
0:19:36.662 --> 0:19:37.536
calculate.
0:19:37.937 --> 0:19:46.616
Which means we can do all these calculations
here in parallel and in parallel.
0:19:48.028 --> 0:19:55.967
And there, of course, is this very efficiency
because again for GPS it's too bigly possible
0:19:55.967 --> 0:20:00.887
to do these things in parallel and one after
each other.
0:20:01.421 --> 0:20:10.311
And then we can also for each layer one by
one, and then we calculate here the encoder.
0:20:10.790 --> 0:20:21.921
In training now an important thing is that
for the decoder we have the full sentence available
0:20:21.921 --> 0:20:28.365
because we know this is the target we should
generate.
0:20:29.649 --> 0:20:33.526
We have models now in a different way.
0:20:33.526 --> 0:20:38.297
This hidden state is only on the previous
ones.
0:20:38.598 --> 0:20:51.887
And the first thing here depends only on this
information, so you see if you remember we
0:20:51.887 --> 0:20:56.665
had this masked self-attention.
0:20:56.896 --> 0:21:04.117
So that means, of course, we can only calculate
the decoder once the encoder is done, but that's.
0:21:04.444 --> 0:21:06.656
Percent can calculate the end quarter.
0:21:06.656 --> 0:21:08.925
Then we can calculate here the decoder.
0:21:09.569 --> 0:21:25.566
But again in training we have x, y and that
is available so we can calculate everything
0:21:25.566 --> 0:21:27.929
in parallel.
0:21:28.368 --> 0:21:40.941
So the interesting thing or advantage of transformer
is in training.
0:21:40.941 --> 0:21:46.408
We can do it for the decoder.
0:21:46.866 --> 0:21:54.457
That means you will have more calculations
because you can only calculate one layer at
0:21:54.457 --> 0:22:02.310
a time, but for example the length which is
too bigly quite long or doesn't really matter
0:22:02.310 --> 0:22:03.270
that much.
0:22:05.665 --> 0:22:10.704
However, in testing this situation is different.
0:22:10.704 --> 0:22:13.276
In testing we only have.
0:22:13.713 --> 0:22:20.622
So this means we start with a sense: We don't
know the full sentence yet because we ought
0:22:20.622 --> 0:22:29.063
to regularly generate that so for the encoder
we have the same here but for the decoder.
0:22:29.409 --> 0:22:39.598
In this case we only have the first and the
second instinct, but only for all states in
0:22:39.598 --> 0:22:40.756
parallel.
0:22:41.101 --> 0:22:51.752
And then we can do the next step for y because
we are putting our most probable one.
0:22:51.752 --> 0:22:58.643
We do greedy search or beam search, but you
cannot do.
0:23:03.663 --> 0:23:16.838
Yes, so if we are interesting in making things
more efficient for testing, which we see, for
0:23:16.838 --> 0:23:22.363
example in the scenario of really our.
0:23:22.642 --> 0:23:34.286
It makes sense that we think about our architecture
and that we are currently working on attention
0:23:34.286 --> 0:23:35.933
based models.
0:23:36.096 --> 0:23:44.150
The decoder there is some of the most time
spent testing and testing.
0:23:44.150 --> 0:23:47.142
It's similar, but during.
0:23:47.167 --> 0:23:50.248
Nothing about beam search.
0:23:50.248 --> 0:23:59.833
It might be even more complicated because
in beam search you have to try different.
0:24:02.762 --> 0:24:15.140
So the question is what can you now do in
order to make your model more efficient and
0:24:15.140 --> 0:24:21.905
better in translation in these types of cases?
0:24:24.604 --> 0:24:30.178
And the one thing is to look into the encoded
decoder trailer.
0:24:30.690 --> 0:24:43.898
And then until now we typically assume that
the depth of the encoder and the depth of the
0:24:43.898 --> 0:24:48.154
decoder is roughly the same.
0:24:48.268 --> 0:24:55.553
So if you haven't thought about it, you just
take what is running well.
0:24:55.553 --> 0:24:57.678
You would try to do.
0:24:58.018 --> 0:25:04.148
However, we saw now that there is a quite
big challenge and the runtime is a lot longer
0:25:04.148 --> 0:25:04.914
than here.
0:25:05.425 --> 0:25:14.018
The question is also the case for the calculations,
or do we have there the same issue that we
0:25:14.018 --> 0:25:21.887
only get the good quality if we are having
high and high, so we know that making these
0:25:21.887 --> 0:25:25.415
more depths is increasing our quality.
0:25:25.425 --> 0:25:31.920
But what we haven't talked about is really
important that we increase the depth the same
0:25:31.920 --> 0:25:32.285
way.
0:25:32.552 --> 0:25:41.815
So what we can put instead also do is something
like this where you have a deep encoder and
0:25:41.815 --> 0:25:42.923
a shallow.
0:25:43.163 --> 0:25:57.386
So that would be that you, for example, have
instead of having layers on the encoder, and
0:25:57.386 --> 0:25:59.757
layers on the.
0:26:00.080 --> 0:26:10.469
So in this case the overall depth from start
to end would be similar and so hopefully.
0:26:11.471 --> 0:26:21.662
But we could a lot more things hear parallelized,
and hear what is costly at the end during decoding
0:26:21.662 --> 0:26:22.973
the decoder.
0:26:22.973 --> 0:26:29.330
Because that does change in an outer regressive
way, there we.
0:26:31.411 --> 0:26:33.727
And that that can be analyzed.
0:26:33.727 --> 0:26:38.734
So here is some examples: Where people have
done all this.
0:26:39.019 --> 0:26:55.710
So here it's mainly interested on the orange
things, which is auto-regressive about the
0:26:55.710 --> 0:26:57.607
speed up.
0:26:57.717 --> 0:27:15.031
You have the system, so agree is not exactly
the same, but it's similar.
0:27:15.055 --> 0:27:23.004
It's always the case if you look at speed
up.
0:27:23.004 --> 0:27:31.644
Think they put a speed of so that's the baseline.
0:27:31.771 --> 0:27:35.348
So between and times as fast.
0:27:35.348 --> 0:27:42.621
If you switch from a system to where you have
layers in the.
0:27:42.782 --> 0:27:52.309
You see that although you have slightly more
parameters, more calculations are also roughly
0:27:52.309 --> 0:28:00.283
the same, but you can speed out because now
during testing you can paralyze.
0:28:02.182 --> 0:28:09.754
The other thing is that you're speeding up,
but if you look at the performance it's similar,
0:28:09.754 --> 0:28:13.500
so sometimes you improve, sometimes you lose.
0:28:13.500 --> 0:28:20.421
There's a bit of losing English to Romania,
but in general the quality is very slow.
0:28:20.680 --> 0:28:30.343
So you see that you can keep a similar performance
while improving your speed by just having different.
0:28:30.470 --> 0:28:34.903
And you also see the encoder layers from speed.
0:28:34.903 --> 0:28:38.136
They don't really metal that much.
0:28:38.136 --> 0:28:38.690
Most.
0:28:38.979 --> 0:28:50.319
Because if you compare the 12th system to
the 6th system you have a lower performance
0:28:50.319 --> 0:28:57.309
with 6th and colder layers but the speed is
similar.
0:28:57.897 --> 0:29:02.233
And see the huge decrease is it maybe due
to a lack of data.
0:29:03.743 --> 0:29:11.899
Good idea would say it's not the case.
0:29:11.899 --> 0:29:23.191
Romanian English should have the same number
of data.
0:29:24.224 --> 0:29:31.184
Maybe it's just that something in that language.
0:29:31.184 --> 0:29:40.702
If you generate Romanian maybe they need more
target dependencies.
0:29:42.882 --> 0:29:46.263
The Wine's the Eye Also Don't Know Any Sex
People Want To.
0:29:47.887 --> 0:29:49.034
There could be yeah the.
0:29:49.889 --> 0:29:58.962
As the maybe if you go from like a movie sphere
to a hybrid sphere, you can: It's very much
0:29:58.962 --> 0:30:12.492
easier to expand the vocabulary to English,
but it must be the vocabulary.
0:30:13.333 --> 0:30:21.147
Have to check, but would assume that in this
case the system is not retrained, but it's
0:30:21.147 --> 0:30:22.391
trained with.
0:30:22.902 --> 0:30:30.213
And that's why I was assuming that they have
the same, but maybe you'll write that in this
0:30:30.213 --> 0:30:35.595
piece, for example, if they were pre-trained,
the decoder English.
0:30:36.096 --> 0:30:43.733
But don't remember exactly if they do something
like that, but that could be a good.
0:30:45.325 --> 0:30:52.457
So this is some of the most easy way to speed
up.
0:30:52.457 --> 0:31:01.443
You just switch to hyperparameters, not to
implement anything.
0:31:02.722 --> 0:31:08.367
Of course, there's other ways of doing that.
0:31:08.367 --> 0:31:11.880
We'll look into two things.
0:31:11.880 --> 0:31:16.521
The other thing is the architecture.
0:31:16.796 --> 0:31:28.154
We are now at some of the baselines that we
are doing.
0:31:28.488 --> 0:31:39.978
However, in translation in the decoder side,
it might not be the best solution.
0:31:39.978 --> 0:31:41.845
There is no.
0:31:42.222 --> 0:31:47.130
So we can use different types of architectures,
also in the encoder and the.
0:31:47.747 --> 0:31:52.475
And there's two ways of what you could do
different, or there's more ways.
0:31:52.912 --> 0:31:54.825
We will look into two todays.
0:31:54.825 --> 0:31:58.842
The one is average attention, which is a very
simple solution.
0:31:59.419 --> 0:32:01.464
You can do as it says.
0:32:01.464 --> 0:32:04.577
It's not really attending anymore.
0:32:04.577 --> 0:32:08.757
It's just like equal attendance to everything.
0:32:09.249 --> 0:32:23.422
And the other idea, which is currently done
in most systems which are optimized to efficiency,
0:32:23.422 --> 0:32:24.913
is we're.
0:32:25.065 --> 0:32:32.623
But on the decoder side we are then not using
transformer or self attention, but we are using
0:32:32.623 --> 0:32:39.700
recurrent neural network because they are the
disadvantage of recurrent neural network.
0:32:39.799 --> 0:32:48.353
And then the recurrent is normally easier
to calculate because it only depends on inputs,
0:32:48.353 --> 0:32:49.684
the input on.
0:32:51.931 --> 0:33:02.190
So what is the difference between decoding
and why is the tension maybe not sufficient
0:33:02.190 --> 0:33:03.841
for decoding?
0:33:04.204 --> 0:33:14.390
If we want to populate the new state, we only
have to look at the input and the previous
0:33:14.390 --> 0:33:15.649
state, so.
0:33:16.136 --> 0:33:19.029
We are more conditional here networks.
0:33:19.029 --> 0:33:19.994
We have the.
0:33:19.980 --> 0:33:31.291
Dependency to a fixed number of previous ones,
but that's rarely used for decoding.
0:33:31.291 --> 0:33:39.774
In contrast, in transformer we have this large
dependency, so.
0:33:40.000 --> 0:33:52.760
So from t minus one to y t so that is somehow
and mainly not very efficient in this way mean
0:33:52.760 --> 0:33:56.053
it's very good because.
0:33:56.276 --> 0:34:03.543
However, the disadvantage is that we also
have to do all these calculations, so if we
0:34:03.543 --> 0:34:10.895
more view from the point of view of efficient
calculation, this might not be the best.
0:34:11.471 --> 0:34:20.517
So the question is, can we change our architecture
to keep some of the advantages but make things
0:34:20.517 --> 0:34:21.994
more efficient?
0:34:24.284 --> 0:34:31.131
The one idea is what is called the average
attention, and the interesting thing is this
0:34:31.131 --> 0:34:32.610
work surprisingly.
0:34:33.013 --> 0:34:38.917
So the only idea what you're doing is doing
the decoder.
0:34:38.917 --> 0:34:42.646
You're not doing attention anymore.
0:34:42.646 --> 0:34:46.790
The attention weights are all the same.
0:34:47.027 --> 0:35:00.723
So you don't calculate with query and key
the different weights, and then you just take
0:35:00.723 --> 0:35:03.058
equal weights.
0:35:03.283 --> 0:35:07.585
So here would be one third from this, one
third from this, and one third.
0:35:09.009 --> 0:35:14.719
And while it is sufficient you can now do
precalculation and things get more efficient.
0:35:15.195 --> 0:35:18.803
So first go the formula that's maybe not directed
here.
0:35:18.979 --> 0:35:38.712
So the difference here is that your new hint
stage is the sum of all the hint states, then.
0:35:38.678 --> 0:35:40.844
So here would be with this.
0:35:40.844 --> 0:35:45.022
It would be one third of this plus one third
of this.
0:35:46.566 --> 0:35:57.162
But if you calculate it this way, it's not
yet being more efficient because you still
0:35:57.162 --> 0:36:01.844
have to sum over here all the hidden.
0:36:04.524 --> 0:36:22.932
But you can not easily speed up these things
by having an in between value, which is just
0:36:22.932 --> 0:36:24.568
always.
0:36:25.585 --> 0:36:30.057
If you take this as ten to one, you take this
one class this one.
0:36:30.350 --> 0:36:36.739
Because this one then was before this, and
this one was this, so in the end.
0:36:37.377 --> 0:36:49.545
So now this one is not the final one in order
to get the final one to do the average.
0:36:49.545 --> 0:36:50.111
So.
0:36:50.430 --> 0:37:00.264
But then if you do this calculation with speed
up you can do it with a fixed number of steps.
0:37:00.180 --> 0:37:11.300
Instead of the sun which depends on age, so
you only have to do calculations to calculate
0:37:11.300 --> 0:37:12.535
this one.
0:37:12.732 --> 0:37:21.718
Can you do the lakes and the lakes?
0:37:21.718 --> 0:37:32.701
For example, light bulb here now takes and.
0:37:32.993 --> 0:37:38.762
That's a very good point and that's why this
is now in the image.
0:37:38.762 --> 0:37:44.531
It's not very good so this is the one with
tilder and the tilder.
0:37:44.884 --> 0:37:57.895
So this one is just the sum of these two,
because this is just this one.
0:37:58.238 --> 0:38:08.956
So the sum of this is exactly as the sum of
these, and the sum of these is the sum of here.
0:38:08.956 --> 0:38:15.131
So you only do the sum in here, and the multiplying.
0:38:15.255 --> 0:38:22.145
So what you can mainly do here is you can
do it more mathematically.
0:38:22.145 --> 0:38:31.531
You can know this by tea taking out of the
sum, and then you can calculate the sum different.
0:38:36.256 --> 0:38:42.443
That maybe looks a bit weird and simple, so
we were all talking about this great attention
0:38:42.443 --> 0:38:47.882
that we can focus on different parts, and a
bit surprising on this work is now.
0:38:47.882 --> 0:38:53.321
In the end it might also work well without
really putting and just doing equal.
0:38:53.954 --> 0:38:56.164
Mean it's not that easy.
0:38:56.376 --> 0:38:58.261
It's like sometimes this is working.
0:38:58.261 --> 0:39:00.451
There's also report weight work that well.
0:39:01.481 --> 0:39:05.848
But I think it's an interesting way and it
maybe shows that a lot of.
0:39:05.805 --> 0:39:10.624
Things in the self or in the transformer paper
which are more put as like yet.
0:39:10.624 --> 0:39:15.930
These are some hyperpermetheuss around it,
like that you do the layer norm in between,
0:39:15.930 --> 0:39:21.785
and that you do a feat forward before, and
things like that, that these are also all important,
0:39:21.785 --> 0:39:25.566
and that the right set up around that is also
very important.
0:39:28.969 --> 0:39:38.598
The other thing you can do in the end is not
completely different from this one.
0:39:38.598 --> 0:39:42.521
It's just like a very different.
0:39:42.942 --> 0:39:54.338
And that is a recurrent network which also
has this type of highway connection that can
0:39:54.338 --> 0:40:01.330
ignore the recurrent unit and directly put
the input.
0:40:01.561 --> 0:40:10.770
It's not really adding out, but if you see
the hitting step is your input, but what you
0:40:10.770 --> 0:40:15.480
can do is somehow directly go to the output.
0:40:17.077 --> 0:40:28.390
These are the four components of the simple
return unit, and the unit is motivated by GIS
0:40:28.390 --> 0:40:33.418
and by LCMs, which we have seen before.
0:40:33.513 --> 0:40:43.633
And that has proven to be very good for iron
ends, which allows you to have a gate on your.
0:40:44.164 --> 0:40:48.186
In this thing we have two gates, the reset
gate and the forget gate.
0:40:48.768 --> 0:40:57.334
So first we have the general structure which
has a cell state.
0:40:57.334 --> 0:41:01.277
Here we have the cell state.
0:41:01.361 --> 0:41:09.661
And then this goes next, and we always get
the different cell states over the times that.
0:41:10.030 --> 0:41:11.448
This Is the South Stand.
0:41:11.771 --> 0:41:16.518
How do we now calculate that just assume we
have an initial cell safe here?
0:41:17.017 --> 0:41:19.670
But the first thing is we're doing the forget
game.
0:41:20.060 --> 0:41:34.774
The forgetting models should the new cell
state mainly depend on the previous cell state
0:41:34.774 --> 0:41:40.065
or should it depend on our age.
0:41:40.000 --> 0:41:41.356
Like Add to Them.
0:41:41.621 --> 0:41:42.877
How can we model that?
0:41:44.024 --> 0:41:45.599
First we were at a cocktail.
0:41:45.945 --> 0:41:52.151
The forget gait is depending on minus one.
0:41:52.151 --> 0:41:56.480
You also see here the former.
0:41:57.057 --> 0:42:01.963
So we are multiplying both the cell state
and our input.
0:42:01.963 --> 0:42:04.890
With some weights we are getting.
0:42:05.105 --> 0:42:08.472
We are putting some Bay Inspector and then
we are doing Sigma Weed on that.
0:42:08.868 --> 0:42:13.452
So in the end we have numbers between zero
and one saying for each dimension.
0:42:13.853 --> 0:42:22.041
Like how much if it's near to zero we will
mainly use the new input.
0:42:22.041 --> 0:42:31.890
If it's near to one we will keep the input
and ignore the input at this dimension.
0:42:33.313 --> 0:42:40.173
And by this motivation we can then create
here the new sound state, and here you see
0:42:40.173 --> 0:42:41.141
the formal.
0:42:41.601 --> 0:42:55.048
So you take your foot back gate and multiply
it with your class.
0:42:55.048 --> 0:43:00.427
So if my was around then.
0:43:00.800 --> 0:43:07.405
In the other case, when the value was others,
that's what you added.
0:43:07.405 --> 0:43:10.946
Then you're adding a transformation.
0:43:11.351 --> 0:43:24.284
So if this value was maybe zero then you're
putting most of the information from inputting.
0:43:25.065 --> 0:43:26.947
Is already your element?
0:43:26.947 --> 0:43:30.561
The only question is now based on your element.
0:43:30.561 --> 0:43:32.067
What is the output?
0:43:33.253 --> 0:43:47.951
And there you have another opportunity so
you can either take the output or instead you
0:43:47.951 --> 0:43:50.957
prefer the input.
0:43:52.612 --> 0:43:58.166
So is the value also the same for the recept
game and the forget game.
0:43:58.166 --> 0:43:59.417
Yes, the movie.
0:44:00.900 --> 0:44:10.004
Yes exactly so the matrices are different
and therefore it can be and that should be
0:44:10.004 --> 0:44:16.323
and maybe there is sometimes you want to have
information.
0:44:16.636 --> 0:44:23.843
So here again we have this vector with values
between zero and which says controlling how
0:44:23.843 --> 0:44:25.205
the information.
0:44:25.505 --> 0:44:36.459
And then the output is calculated here similar
to a cell stage, but again input is from.
0:44:36.536 --> 0:44:45.714
So either the reset gate decides should give
what is currently stored in there, or.
0:44:46.346 --> 0:44:58.647
So it's not exactly as the thing we had before,
with the residual connections where we added
0:44:58.647 --> 0:45:01.293
up, but here we do.
0:45:04.224 --> 0:45:08.472
This is the general idea of a simple recurrent
neural network.
0:45:08.472 --> 0:45:13.125
Then we will now look at how we can make things
even more efficient.
0:45:13.125 --> 0:45:17.104
But first do you have more questions on how
it is working?
0:45:23.063 --> 0:45:38.799
Now these calculations are a bit where things
get more efficient because this somehow.
0:45:38.718 --> 0:45:43.177
It depends on all the other damage for the
second one also.
0:45:43.423 --> 0:45:48.904
Because if you do a matrix multiplication
with a vector like for the output vector, each
0:45:48.904 --> 0:45:52.353
diameter of the output vector depends on all
the other.
0:45:52.973 --> 0:46:06.561
The cell state here depends because this one
is used here, and somehow the first dimension
0:46:06.561 --> 0:46:11.340
of the cell state only depends.
0:46:11.931 --> 0:46:17.973
In order to make that, of course, is sometimes
again making things less paralyzeable if things
0:46:17.973 --> 0:46:18.481
depend.
0:46:19.359 --> 0:46:35.122
Can easily make that different by changing
from the metric product to not a vector.
0:46:35.295 --> 0:46:51.459
So you do first, just like inside here, you
take like the first dimension, my second dimension.
0:46:52.032 --> 0:46:53.772
Is, of course, narrow.
0:46:53.772 --> 0:46:59.294
This should be reset or this should be because
it should be a different.
0:46:59.899 --> 0:47:12.053
Now the first dimension only depends on the
first dimension, so you don't have dependencies
0:47:12.053 --> 0:47:16.148
any longer between dimensions.
0:47:18.078 --> 0:47:25.692
Maybe it gets a bit clearer if you see about
it in this way, so what we have to do now.
0:47:25.966 --> 0:47:31.911
First, we have to do a metrics multiplication
on to gather and to get the.
0:47:32.292 --> 0:47:38.041
And then we only have the element wise operations
where we take this output.
0:47:38.041 --> 0:47:38.713
We take.
0:47:39.179 --> 0:47:42.978
Minus one and our original.
0:47:42.978 --> 0:47:52.748
Here we only have elemental abrasions which
can be optimally paralyzed.
0:47:53.273 --> 0:48:07.603
So here we have additional paralyzed things
across the dimension and don't have to do that.
0:48:09.929 --> 0:48:24.255
Yeah, but this you can do like in parallel
again for all xts.
0:48:24.544 --> 0:48:33.014
Here you can't do it in parallel, but you
only have to do it on each seat, and then you
0:48:33.014 --> 0:48:34.650
can parallelize.
0:48:35.495 --> 0:48:39.190
But this maybe for the dimension.
0:48:39.190 --> 0:48:42.124
Maybe it's also important.
0:48:42.124 --> 0:48:46.037
I don't know if they have tried it.
0:48:46.037 --> 0:48:55.383
I assume it's not only for dimension reduction,
but it's hard because you can easily.
0:49:01.001 --> 0:49:08.164
People have even like made the second thing
even more easy.
0:49:08.164 --> 0:49:10.313
So there is this.
0:49:10.313 --> 0:49:17.954
This is how we have the highway connections
in the transformer.
0:49:17.954 --> 0:49:20.699
Then it's like you do.
0:49:20.780 --> 0:49:24.789
So that is like how things are put together
as a transformer.
0:49:25.125 --> 0:49:39.960
And that is a similar and simple recurring
neural network where you do exactly the same
0:49:39.960 --> 0:49:44.512
for the so you don't have.
0:49:46.326 --> 0:49:47.503
This type of things.
0:49:49.149 --> 0:50:01.196
And with this we are at the end of how to
make efficient architectures before we go to
0:50:01.196 --> 0:50:02.580
the next.
0:50:13.013 --> 0:50:24.424
Between the ink or the trader and the architectures
there is a next technique which is used in
0:50:24.424 --> 0:50:28.988
nearly all deburning very successful.
0:50:29.449 --> 0:50:43.463
So the idea is can we extract the knowledge
from a large network into a smaller one, but
0:50:43.463 --> 0:50:45.983
it's similarly.
0:50:47.907 --> 0:50:53.217
And the nice thing is that this really works,
and it may be very, very surprising.
0:50:53.673 --> 0:51:03.000
So the idea is that we have a large straw
model which we train for long, and the question
0:51:03.000 --> 0:51:07.871
is: Can that help us to train a smaller model?
0:51:08.148 --> 0:51:16.296
So can what we refer to as teacher model tell
us better to build a small student model than
0:51:16.296 --> 0:51:17.005
before.
0:51:17.257 --> 0:51:27.371
So what we're before in it as a student model,
we learn from the data and that is how we train
0:51:27.371 --> 0:51:28.755
our systems.
0:51:29.249 --> 0:51:37.949
The question is: Can we train this small model
better if we are not only learning from the
0:51:37.949 --> 0:51:46.649
data, but we are also learning from a large
model which has been trained maybe in the same
0:51:46.649 --> 0:51:47.222
data?
0:51:47.667 --> 0:51:55.564
So that you have then in the end a smaller
model that is somehow better performing than.
0:51:55.895 --> 0:51:59.828
And maybe that's on the first view.
0:51:59.739 --> 0:52:05.396
Very very surprising because it has seen the
same data so it should have learned the same
0:52:05.396 --> 0:52:11.053
so the baseline model trained only on the data
and the student teacher knowledge to still
0:52:11.053 --> 0:52:11.682
model it.
0:52:11.682 --> 0:52:17.401
They all have seen only this data because
your teacher modeling was also trained typically
0:52:17.401 --> 0:52:19.161
only on this model however.
0:52:20.580 --> 0:52:30.071
It has by now shown that by many ways the
model trained in the teacher and analysis framework
0:52:30.071 --> 0:52:32.293
is performing better.
0:52:33.473 --> 0:52:40.971
A bit of an explanation when we see how that
works.
0:52:40.971 --> 0:52:46.161
There's different ways of doing it.
0:52:46.161 --> 0:52:47.171
Maybe.
0:52:47.567 --> 0:52:51.501
So how does it work?
0:52:51.501 --> 0:53:04.802
This is our student network, the normal one,
some type of new network.
0:53:04.802 --> 0:53:06.113
We're.
0:53:06.586 --> 0:53:17.050
So we are training the model to predict the
same thing as we are doing that by calculating.
0:53:17.437 --> 0:53:23.173
The cross angry loss was defined in a way
where saying all the probabilities for the
0:53:23.173 --> 0:53:25.332
correct word should be as high.
0:53:25.745 --> 0:53:32.207
So you are calculating your alphabet probabilities
always, and each time step you have an alphabet
0:53:32.207 --> 0:53:33.055
probability.
0:53:33.055 --> 0:53:38.669
What is the most probable in the next word
and your training signal is put as much of
0:53:38.669 --> 0:53:43.368
your probability mass to the correct word to
the word that is there in.
0:53:43.903 --> 0:53:51.367
And this is the chief by this cross entry
loss, which says with some of the all training
0:53:51.367 --> 0:53:58.664
examples of all positions, with some of the
full vocabulary, and then this one is this
0:53:58.664 --> 0:54:03.947
one that this current word is the case word
in the vocabulary.
0:54:04.204 --> 0:54:11.339
And then we take here the lock for the ability
of that, so what we made me do is: We have
0:54:11.339 --> 0:54:27.313
this metric here, so each position of your
vocabulary size.
0:54:27.507 --> 0:54:38.656
In the end what you just do is some of these
three lock probabilities, and then you want
0:54:38.656 --> 0:54:40.785
to have as much.
0:54:41.041 --> 0:54:54.614
So although this is a thumb over this metric
here, in the end of each dimension you.
0:54:54.794 --> 0:55:06.366
So that is a normal cross end to be lost that
we have discussed at the very beginning of
0:55:06.366 --> 0:55:07.016
how.
0:55:08.068 --> 0:55:15.132
So what can we do differently in the teacher
network?
0:55:15.132 --> 0:55:23.374
We also have a teacher network which is trained
on large data.
0:55:24.224 --> 0:55:35.957
And of course this distribution might be better
than the one from the small model because it's.
0:55:36.456 --> 0:55:40.941
So in this case we have now the training signal
from the teacher network.
0:55:41.441 --> 0:55:46.262
And it's the same way as we had before.
0:55:46.262 --> 0:55:56.507
The only difference is we're training not
the ground truths per ability distribution
0:55:56.507 --> 0:55:59.159
year, which is sharp.
0:55:59.299 --> 0:56:11.303
That's also a probability, so this word has
a high probability, but have some probability.
0:56:12.612 --> 0:56:19.577
And that is the main difference.
0:56:19.577 --> 0:56:30.341
Typically you do like the interpretation of
these.
0:56:33.213 --> 0:56:38.669
Because there's more information contained
in the distribution than in the front booth,
0:56:38.669 --> 0:56:44.187
because it encodes more information about the
language, because language always has more
0:56:44.187 --> 0:56:47.907
options to put alone, that's the same sentence
yes exactly.
0:56:47.907 --> 0:56:53.114
So there's ambiguity in there that is encoded
hopefully very well in the complaint.
0:56:53.513 --> 0:56:57.257
Trade you two networks so better than a student
network you have in there from your learner.
0:56:57.537 --> 0:57:05.961
So maybe often there's only one correct word,
but it might be two or three, and then all
0:57:05.961 --> 0:57:10.505
of these three have a probability distribution.
0:57:10.590 --> 0:57:21.242
And then is the main advantage or one explanation
of why it's better to train from the.
0:57:21.361 --> 0:57:32.652
Of course, it's good to also keep the signal
in there because then you can prevent it because
0:57:32.652 --> 0:57:33.493
crazy.
0:57:37.017 --> 0:57:49.466
Any more questions on the first type of knowledge
distillation, also distribution changes.
0:57:50.550 --> 0:58:02.202
Coming around again, this would put it a bit
different, so this is not a solution to maintenance
0:58:02.202 --> 0:58:04.244
or distribution.
0:58:04.744 --> 0:58:12.680
But don't think it's performing worse than
only doing the ground tours because they also.
0:58:13.113 --> 0:58:21.254
So it's more like it's not improving you would
assume it's similarly helping you, but.
0:58:21.481 --> 0:58:28.145
Of course, if you now have a teacher, maybe
you have no danger on your target to Maine,
0:58:28.145 --> 0:58:28.524
but.
0:58:28.888 --> 0:58:39.895
Then you can use this one which is not the
ground truth but helpful to learn better for
0:58:39.895 --> 0:58:42.147
the distribution.
0:58:46.326 --> 0:58:57.012
The second idea is to do sequence level knowledge
distillation, so what we have in this case
0:58:57.012 --> 0:59:02.757
is we have looked at each position independently.
0:59:03.423 --> 0:59:05.436
Mean, we do that often.
0:59:05.436 --> 0:59:10.972
We are not generating a lot of sequences,
but that has a problem.
0:59:10.972 --> 0:59:13.992
We have this propagation of errors.
0:59:13.992 --> 0:59:16.760
We start with one area and then.
0:59:17.237 --> 0:59:27.419
So if we are doing word-level knowledge dissolution,
we are treating each word in the sentence independently.
0:59:28.008 --> 0:59:32.091
So we are not trying to like somewhat model
the dependency between.
0:59:32.932 --> 0:59:47.480
We can try to do that by sequence level knowledge
dissolution, but the problem is, of course,.
0:59:47.847 --> 0:59:53.478
So we can that for each position we can get
a distribution over all the words at this.
0:59:53.793 --> 1:00:05.305
But if we want to have a distribution of all
possible target sentences, that's not possible
1:00:05.305 --> 1:00:06.431
because.
1:00:08.508 --> 1:00:15.940
Area, so we can then again do a bit of a heck
on that.
1:00:15.940 --> 1:00:23.238
If we can't have a distribution of all sentences,
it.
1:00:23.843 --> 1:00:30.764
So what we can't do is you can not use the
teacher network and sample different translations.
1:00:31.931 --> 1:00:39.327
And now we can do different ways to train
them.
1:00:39.327 --> 1:00:49.343
We can use them as their probability, the
easiest one to assume.
1:00:50.050 --> 1:00:56.373
So what that ends to is that we're taking
our teacher network, we're generating some
1:00:56.373 --> 1:01:01.135
translations, and these ones we're using as
additional trading.
1:01:01.781 --> 1:01:11.382
Then we have mainly done this sequence level
because the teacher network takes us.
1:01:11.382 --> 1:01:17.513
These are all probable translations of the
sentence.
1:01:26.286 --> 1:01:34.673
And then you can do a bit of a yeah, and you
can try to better make a bit of an interpolated
1:01:34.673 --> 1:01:36.206
version of that.
1:01:36.716 --> 1:01:42.802
So what people have also done is like subsequent
level interpolations.
1:01:42.802 --> 1:01:52.819
You generate here several translations: But
then you don't use all of them.
1:01:52.819 --> 1:02:00.658
You do some metrics on which of these ones.
1:02:01.021 --> 1:02:12.056
So it's a bit more training on this brown
chose which might be improbable or unreachable
1:02:12.056 --> 1:02:16.520
because we can generate everything.
1:02:16.676 --> 1:02:23.378
And we are giving it an easier solution which
is also good quality and training of that.
1:02:23.703 --> 1:02:32.602
So you're not training it on a very difficult
solution, but you're training it on an easier
1:02:32.602 --> 1:02:33.570
solution.
1:02:36.356 --> 1:02:38.494
Any More Questions to This.
1:02:40.260 --> 1:02:41.557
Yeah.
1:02:41.461 --> 1:02:44.296
Good.
1:02:43.843 --> 1:03:01.642
Is to look at the vocabulary, so the problem
is we have seen that vocabulary calculations
1:03:01.642 --> 1:03:06.784
are often very presuming.
1:03:09.789 --> 1:03:19.805
The thing is that most of the vocabulary is
not needed for each sentence, so in each sentence.
1:03:20.280 --> 1:03:28.219
The question is: Can we somehow easily precalculate,
which words are probable to occur in the sentence,
1:03:28.219 --> 1:03:30.967
and then only calculate these ones?
1:03:31.691 --> 1:03:34.912
And this can be done so.
1:03:34.912 --> 1:03:43.932
For example, if you have sentenced card, it's
probably not happening.
1:03:44.164 --> 1:03:48.701
So what you can try to do is to limit your
vocabulary.
1:03:48.701 --> 1:03:51.093
You're considering for each.
1:03:51.151 --> 1:04:04.693
So you're no longer taking the full vocabulary
as possible output, but you're restricting.
1:04:06.426 --> 1:04:18.275
That typically works is that we limit it by
the most frequent words we always take because
1:04:18.275 --> 1:04:23.613
these are not so easy to align to words.
1:04:23.964 --> 1:04:32.241
To take the most treatment taggin' words and
then work that often aligns with one of the
1:04:32.241 --> 1:04:32.985
source.
1:04:33.473 --> 1:04:46.770
So for each source word you calculate the
word alignment on your training data, and then
1:04:46.770 --> 1:04:51.700
you calculate which words occur.
1:04:52.352 --> 1:04:57.680
And then for decoding you build this union
of maybe the source word list that other.
1:04:59.960 --> 1:05:02.145
Are like for each source work.
1:05:02.145 --> 1:05:08.773
One of the most frequent translations of these
source words, for example for each source work
1:05:08.773 --> 1:05:13.003
like in the most frequent ones, and then the
most frequent.
1:05:13.193 --> 1:05:24.333
In total, if you have short sentences, you
have a lot less words, so in most cases it's
1:05:24.333 --> 1:05:26.232
not more than.
1:05:26.546 --> 1:05:33.957
And so you have dramatically reduced your
vocabulary, and thereby can also fax a depot.
1:05:35.495 --> 1:05:43.757
That easy does anybody see what is challenging
here and why that might not always need.
1:05:47.687 --> 1:05:54.448
The performance is not why this might not.
1:05:54.448 --> 1:06:01.838
If you implement it, it might not be a strong.
1:06:01.941 --> 1:06:06.053
You have to store this list.
1:06:06.053 --> 1:06:14.135
You have to burn the union and of course your
safe time.
1:06:14.554 --> 1:06:21.920
The second thing the vocabulary is used in
our last step, so we have the hidden state,
1:06:21.920 --> 1:06:23.868
and then we calculate.
1:06:24.284 --> 1:06:29.610
Now we are not longer calculating them for
all output words, but for a subset of them.
1:06:30.430 --> 1:06:35.613
However, this metric multiplication is typically
parallelized with the perfect but good.
1:06:35.956 --> 1:06:46.937
But if you not only calculate some of them,
if you're not modeling it right, it will take
1:06:46.937 --> 1:06:52.794
as long as before because of the nature of
the.
1:06:56.776 --> 1:07:07.997
Here for beam search there's some ideas of
course you can go back to greedy search because
1:07:07.997 --> 1:07:10.833
that's more efficient.
1:07:11.651 --> 1:07:18.347
And better quality, and you can buffer some
states in between, so how much buffering it's
1:07:18.347 --> 1:07:22.216
again this tradeoff between calculation and
memory.
1:07:25.125 --> 1:07:41.236
Then at the end of today what we want to look
into is one last type of new machine translation
1:07:41.236 --> 1:07:42.932
approach.
1:07:43.403 --> 1:07:53.621
And the idea is what we've already seen in
our first two steps is that this ultra aggressive
1:07:53.621 --> 1:07:57.246
park is taking community coding.
1:07:57.557 --> 1:08:04.461
Can process everything in parallel, but we
are always taking the most probable and then.
1:08:05.905 --> 1:08:10.476
The question is: Do we really need to do that?
1:08:10.476 --> 1:08:14.074
Therefore, there is a bunch of work.
1:08:14.074 --> 1:08:16.602
Can we do it differently?
1:08:16.602 --> 1:08:19.616
Can we generate a full target?
1:08:20.160 --> 1:08:29.417
We'll see it's not that easy and there's still
an open debate whether this is really faster
1:08:29.417 --> 1:08:31.832
and quality, but think.
1:08:32.712 --> 1:08:45.594
So, as said, what we have done is our encoder
decoder where we can process our encoder color,
1:08:45.594 --> 1:08:50.527
and then the output always depends.
1:08:50.410 --> 1:08:54.709
We generate the output and then we have to
put it here the wide because then everything
1:08:54.709 --> 1:08:56.565
depends on the purpose of the output.
1:08:56.916 --> 1:09:10.464
This is what is referred to as an outer-regressive
model and nearly outs speech generation and
1:09:10.464 --> 1:09:16.739
language generation or works in this outer.
1:09:18.318 --> 1:09:21.132
So the motivation is, can we do that more
efficiently?
1:09:21.361 --> 1:09:31.694
And can we somehow process all target words
in parallel?
1:09:31.694 --> 1:09:41.302
So instead of doing it one by one, we are
inputting.
1:09:45.105 --> 1:09:46.726
So how does it work?
1:09:46.726 --> 1:09:50.587
So let's first have a basic auto regressive
mode.
1:09:50.810 --> 1:09:53.551
So the encoder looks as it is before.
1:09:53.551 --> 1:09:58.310
That's maybe not surprising because here we
know we can paralyze.
1:09:58.618 --> 1:10:04.592
So we have put in here our ink holder and
generated the ink stash, so that's exactly
1:10:04.592 --> 1:10:05.295
the same.
1:10:05.845 --> 1:10:16.229
However, now we need to do one more thing:
One challenge is what we had before and that's
1:10:16.229 --> 1:10:26.799
a challenge of natural language generation
like machine translation.
1:10:32.672 --> 1:10:38.447
We generate until we generate this out of
end of center stock, but if we now generate
1:10:38.447 --> 1:10:44.625
everything at once that's no longer possible,
so we cannot generate as long because we only
1:10:44.625 --> 1:10:45.632
generated one.
1:10:46.206 --> 1:10:58.321
So the question is how can we now determine
how long the sequence is, and we can also accelerate.
1:11:00.000 --> 1:11:06.384
Yes, but there would be one idea, and there
is other work which tries to do that.
1:11:06.806 --> 1:11:15.702
However, in here there's some work already
done before and maybe you remember we had the
1:11:15.702 --> 1:11:20.900
IBM models and there was this concept of fertility.
1:11:21.241 --> 1:11:26.299
The concept of fertility is means like for
one saucepan, and how many target pores does
1:11:26.299 --> 1:11:27.104
it translate?
1:11:27.847 --> 1:11:34.805
And exactly that we try to do here, and that
means we are calculating like at the top we
1:11:34.805 --> 1:11:36.134
are calculating.
1:11:36.396 --> 1:11:42.045
So it says word is translated into word.
1:11:42.045 --> 1:11:54.171
Word might be translated into words into,
so we're trying to predict in how many words.
1:11:55.935 --> 1:12:10.314
And then the end of the anchor, so this is
like a length estimation.
1:12:10.314 --> 1:12:15.523
You can do it otherwise.
1:12:16.236 --> 1:12:24.526
You initialize your decoder input and we know
it's good with word embeddings so we're trying
1:12:24.526 --> 1:12:28.627
to do the same thing and what people then do.
1:12:28.627 --> 1:12:35.224
They initialize it again with word embedding
but in the frequency of the.
1:12:35.315 --> 1:12:36.460
So we have the cartilage.
1:12:36.896 --> 1:12:47.816
So one has two, so twice the is and then one
is, so that is then our initialization.
1:12:48.208 --> 1:12:57.151
In other words, if you don't predict fertilities
but predict lengths, you can just initialize
1:12:57.151 --> 1:12:57.912
second.
1:12:58.438 --> 1:13:07.788
This often works a bit better, but that's
the other.
1:13:07.788 --> 1:13:16.432
Now you have everything in training and testing.
1:13:16.656 --> 1:13:18.621
This is all available at once.
1:13:20.280 --> 1:13:31.752
Then we can generate everything in parallel,
so we have the decoder stack, and that is now
1:13:31.752 --> 1:13:33.139
as before.
1:13:35.395 --> 1:13:41.555
And then we're doing the translation predictions
here on top of it in order to do.
1:13:43.083 --> 1:13:59.821
And then we are predicting here the target
words and once predicted, and that is the basic
1:13:59.821 --> 1:14:00.924
idea.
1:14:01.241 --> 1:14:08.171
Machine translation: Where the idea is, we
don't have to do one by one what we're.
1:14:10.210 --> 1:14:13.900
So this looks really, really, really great.
1:14:13.900 --> 1:14:20.358
On the first view there's one challenge with
this, and this is the baseline.
1:14:20.358 --> 1:14:27.571
Of course there's some improvements, but in
general the quality is often significant.
1:14:28.068 --> 1:14:32.075
So here you see the baseline models.
1:14:32.075 --> 1:14:38.466
You have a loss of ten blue points or something
like that.
1:14:38.878 --> 1:14:40.230
So why does it change?
1:14:40.230 --> 1:14:41.640
So why is it happening?
1:14:43.903 --> 1:14:56.250
If you look at the errors there is repetitive
tokens, so you have like or things like that.
1:14:56.536 --> 1:15:01.995
Broken senses or influent senses, so that
exactly where algebra aggressive models are
1:15:01.995 --> 1:15:04.851
very good, we say that's a bit of a problem.
1:15:04.851 --> 1:15:07.390
They generate very fluid transcription.
1:15:07.387 --> 1:15:10.898
Translation: Sometimes there doesn't have
to do anything with the input.
1:15:11.411 --> 1:15:14.047
But generally it really looks always very
fluid.
1:15:14.995 --> 1:15:20.865
Here exactly the opposite, so the problem
is that we don't have really fluid translation.
1:15:21.421 --> 1:15:26.123
And that is mainly due to the challenge that
we have this independent assumption.
1:15:26.646 --> 1:15:35.873
So in this case, the probability of Y of the
second position is independent of the probability
1:15:35.873 --> 1:15:40.632
of X, so we don't know what was there generated.
1:15:40.632 --> 1:15:43.740
We're just generating it there.
1:15:43.964 --> 1:15:55.439
You can see it also in a bit of examples.
1:15:55.439 --> 1:16:03.636
You can over-panelize shifts.
1:16:04.024 --> 1:16:10.566
And the problem is this is already an improvement
again, but this is also similar to.
1:16:11.071 --> 1:16:19.900
So you can, for example, translate heeded
back, or maybe you could also translate it
1:16:19.900 --> 1:16:31.105
with: But on their feeling down in feeling
down, if the first position thinks of their
1:16:31.105 --> 1:16:34.594
feeling done and the second.
1:16:35.075 --> 1:16:42.908
So each position here and that is one of the
main issues here doesn't know what the other.
1:16:43.243 --> 1:16:53.846
And for example, if you are translating something
with, you can often translate things in two
1:16:53.846 --> 1:16:58.471
ways: German with a different agreement.
1:16:58.999 --> 1:17:02.058
And then here where you have to decide do
a used jet.
1:17:02.162 --> 1:17:05.460
Interpretator: It doesn't know which word
it has to select.
1:17:06.086 --> 1:17:14.789
Mean, of course, it knows a hidden state,
but in the end you have a liability distribution.
1:17:16.256 --> 1:17:20.026
And that is the important thing in the outer
regressive month.
1:17:20.026 --> 1:17:24.335
You know that because you have put it in you
here, you don't know that.
1:17:24.335 --> 1:17:29.660
If it's equal probable here to two, you don't
Know Which Is Selected, and of course that
1:17:29.660 --> 1:17:32.832
depends on what should be the latest traction
under.
1:17:33.333 --> 1:17:39.554
Yep, that's the undershift, and we're going
to last last the next time.
1:17:39.554 --> 1:17:39.986
Yes.
1:17:40.840 --> 1:17:44.935
Doesn't this also appear in and like now we're
talking about physical training?
1:17:46.586 --> 1:17:48.412
The thing is in the auto regress.
1:17:48.412 --> 1:17:50.183
If you give it the correct one,.
1:17:50.450 --> 1:17:55.827
So if you predict here comma what the reference
is feeling then you tell the model here.
1:17:55.827 --> 1:17:59.573
The last one was feeling and then it knows
it has to be done.
1:17:59.573 --> 1:18:04.044
But here it doesn't know that because it doesn't
get as input as a right.
1:18:04.204 --> 1:18:24.286
Yes, that's a bit depending on what.
1:18:24.204 --> 1:18:27.973
But in training, of course, you just try to
make the highest one the current one.
1:18:31.751 --> 1:18:38.181
So what you can do is things like CDC loss
which can adjust for this.
1:18:38.181 --> 1:18:42.866
So then you can also have this shifted correction.
1:18:42.866 --> 1:18:50.582
If you're doing this type of correction in
the CDC loss you don't get full penalty.
1:18:50.930 --> 1:18:58.486
Just shifted by one, so it's a bit of a different
loss, which is mainly used in, but.
1:19:00.040 --> 1:19:03.412
It can be used in order to address this problem.
1:19:04.504 --> 1:19:13.844
The other problem is that outer regressively
we have the label buyers that tries to disimmigrate.
1:19:13.844 --> 1:19:20.515
That's the example did before was if you translate
thank you to Dung.
1:19:20.460 --> 1:19:31.925
And then it might end up because it learns
in the first position and the second also.
1:19:32.492 --> 1:19:43.201
In order to prevent that, it would be helpful
for one output, only one output, so that makes
1:19:43.201 --> 1:19:47.002
the system already better learn.
1:19:47.227 --> 1:19:53.867
Might be that for slightly different inputs
you have different outputs, but for the same.
1:19:54.714 --> 1:19:57.467
That we can luckily very easily solve.
1:19:59.119 --> 1:19:59.908
And it's done.
1:19:59.908 --> 1:20:04.116
We just learned the technique about it, which
is called knowledge distillation.
1:20:04.985 --> 1:20:13.398
So what we can do and the easiest solution
to prove your non-autoregressive model is to
1:20:13.398 --> 1:20:16.457
train an auto regressive model.
1:20:16.457 --> 1:20:22.958
Then you decode your whole training gamer
with this model and then.
1:20:23.603 --> 1:20:27.078
While the main advantage of that is that this
is more consistent,.
1:20:27.407 --> 1:20:33.995
So for the same input you always have the
same output.
1:20:33.995 --> 1:20:41.901
So you have to make your training data more
consistent and learn.
1:20:42.482 --> 1:20:54.471
So there is another advantage of knowledge
distillation and that advantage is you have
1:20:54.471 --> 1:20:59.156
more consistent training signals.
1:21:04.884 --> 1:21:10.630
There's another to make the things more easy
at the beginning.
1:21:10.630 --> 1:21:16.467
There's this plants model, black model where
you do more masks.
1:21:16.756 --> 1:21:26.080
So during training, especially at the beginning,
you give some correct solutions at the beginning.
1:21:28.468 --> 1:21:38.407
And there is this tokens at a time, so the
idea is to establish other regressive training.
1:21:40.000 --> 1:21:50.049
And some targets are open, so you always predict
only like first auto regression is K.
1:21:50.049 --> 1:21:59.174
It puts one, so you always have one input
and one output, then you do partial.
1:21:59.699 --> 1:22:05.825
So in that way you can slowly learn what is
a good and what is a bad answer.
1:22:08.528 --> 1:22:10.862
It doesn't sound very impressive.
1:22:10.862 --> 1:22:12.578
Don't contact me anyway.
1:22:12.578 --> 1:22:15.323
Go all over your training data several.
1:22:15.875 --> 1:22:20.655
You can even switch in between.
1:22:20.655 --> 1:22:29.318
There is a homework on this thing where you
try to start.
1:22:31.271 --> 1:22:41.563
You have to learn so there's a whole work
on that so this is often happening and it doesn't
1:22:41.563 --> 1:22:46.598
mean it's less efficient but still it helps.
1:22:49.389 --> 1:22:57.979
For later maybe here are some examples of
how much things help.
1:22:57.979 --> 1:23:04.958
Maybe one point here is that it's really important.
1:23:05.365 --> 1:23:13.787
Here's the translation performance and speed.
1:23:13.787 --> 1:23:24.407
One point which is a point is if you compare
researchers.
1:23:24.784 --> 1:23:33.880
So yeah, if you're compared to one very weak
baseline transformer even with beam search,
1:23:33.880 --> 1:23:40.522
then you're ten times slower than a very strong
auto regressive.
1:23:40.961 --> 1:23:48.620
If you make a strong baseline then it's going
down to depending on times and here like: You
1:23:48.620 --> 1:23:53.454
have a lot of different speed ups.
1:23:53.454 --> 1:24:03.261
Generally, it makes a strong baseline and
not very simple transformer.
1:24:07.407 --> 1:24:20.010
Yeah, with this one last thing that you can
do to speed up things and also reduce your
1:24:20.010 --> 1:24:25.950
memory is what is called half precision.
1:24:26.326 --> 1:24:29.139
And especially for decoding issues for training.
1:24:29.139 --> 1:24:31.148
Sometimes it also gets less stale.
1:24:32.592 --> 1:24:45.184
With this we close nearly wait a bit, so what
you should remember is that efficient machine
1:24:45.184 --> 1:24:46.963
translation.
1:24:47.007 --> 1:24:51.939
We have, for example, looked at knowledge
distillation.
1:24:51.939 --> 1:24:55.991
We have looked at non auto regressive models.
1:24:55.991 --> 1:24:57.665
We have different.
1:24:58.898 --> 1:25:02.383
For today and then only requests.
1:25:02.383 --> 1:25:08.430
So if you haven't done so, please fill out
the evaluation.
1:25:08.388 --> 1:25:20.127
So now if you have done so think then you
should have and with the online people hopefully.
1:25:20.320 --> 1:25:29.758
Only possibility to tell us what things are
good and what not the only one but the most
1:25:29.758 --> 1:25:30.937
efficient.
1:25:31.851 --> 1:25:35.871
So think of all the students doing it in this
case okay and then thank.
|