Spaces:
Running
Running
File size: 76,541 Bytes
cb71ef5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808 2809 2810 2811 2812 2813 2814 2815 2816 2817 2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 2865 2866 2867 2868 2869 2870 2871 2872 2873 2874 2875 2876 2877 2878 2879 2880 2881 2882 2883 2884 2885 2886 2887 2888 2889 2890 2891 2892 2893 2894 2895 2896 2897 2898 2899 2900 2901 2902 2903 2904 2905 2906 2907 2908 2909 2910 2911 2912 2913 2914 2915 2916 2917 2918 2919 2920 2921 2922 2923 2924 2925 2926 2927 2928 2929 2930 2931 2932 2933 2934 2935 2936 2937 2938 2939 2940 2941 2942 2943 2944 2945 2946 2947 2948 2949 2950 2951 2952 2953 2954 2955 2956 2957 2958 2959 2960 2961 2962 2963 2964 2965 2966 2967 2968 2969 2970 2971 2972 2973 2974 2975 2976 2977 2978 2979 2980 2981 2982 2983 2984 2985 2986 2987 2988 2989 2990 2991 2992 2993 2994 2995 2996 2997 2998 2999 3000 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 3011 3012 3013 3014 3015 3016 3017 3018 3019 3020 3021 3022 3023 3024 3025 3026 3027 3028 3029 3030 3031 3032 3033 3034 3035 3036 3037 3038 3039 3040 3041 3042 3043 3044 3045 3046 3047 3048 3049 3050 3051 3052 3053 3054 3055 3056 3057 3058 3059 3060 3061 3062 3063 3064 3065 3066 3067 3068 3069 3070 3071 3072 3073 3074 3075 3076 3077 3078 3079 3080 3081 3082 3083 3084 3085 3086 3087 3088 3089 3090 3091 3092 3093 3094 3095 3096 3097 3098 3099 3100 3101 3102 3103 |
WEBVTT
0:00:02.822 --> 0:00:07.880
We look into more linguistic approaches.
0:00:07.880 --> 0:00:14.912
We can do machine translation in a more traditional
way.
0:00:14.912 --> 0:00:21.224
It should be: Translation should be generated
this way.
0:00:21.224 --> 0:00:27.933
We can analyze versus a sewer sentence what
is the meaning or the syntax.
0:00:27.933 --> 0:00:35.185
Then we transfer this information to the target
side and then we then generate.
0:00:36.556 --> 0:00:42.341
And this was the strong and common used approach
for yeah several years.
0:00:44.024 --> 0:00:50.839
However, we saw already at the beginning there
some challenges with that: Language is very
0:00:50.839 --> 0:00:57.232
ambigue and it's often very difficult to really
get high coated rules.
0:00:57.232 --> 0:01:05.336
What are the different meanings and we have
to do that also with a living language so new
0:01:05.336 --> 0:01:06.596
things occur.
0:01:07.007 --> 0:01:09.308
And that's why people look into.
0:01:09.308 --> 0:01:13.282
Can we maybe do it differently and use machine
learning?
0:01:13.333 --> 0:01:24.849
So we are no longer giving rules of how to
do it, but we just give examples and the system.
0:01:25.045 --> 0:01:34.836
And one important thing then is these examples:
how can we learn how to translate one sentence?
0:01:35.635 --> 0:01:42.516
And therefore these yeah, the data is now
really a very important issue.
0:01:42.582 --> 0:01:50.021
And that is what we want to look into today.
0:01:50.021 --> 0:01:58.783
What type of data do we use for machine translation?
0:01:59.019 --> 0:02:08.674
So the idea in preprocessing is always: Can
we make the task somehow a bit easier so that
0:02:08.674 --> 0:02:13.180
the empty system will be in a way better?
0:02:13.493 --> 0:02:28.309
So one example could be if it has problems
dealing with numbers because they are occurring.
0:02:28.648 --> 0:02:35.479
Or think about so one problem which still
might be is there in some systems think about
0:02:35.479 --> 0:02:36.333
different.
0:02:36.656 --> 0:02:44.897
So a system might learn that of course if
there's a German over in English there should.
0:02:45.365 --> 0:02:52.270
However, if it's in pearl text, it will see
that in Germany there is often km, and in English
0:02:52.270 --> 0:02:54.107
typically various miles.
0:02:54.594 --> 0:03:00.607
Might just translate three hundred and fifty
five miles into three hundred and fiftY five
0:03:00.607 --> 0:03:04.348
kilometers, which of course is not right, and
so forth.
0:03:04.348 --> 0:03:06.953
It might make things to look into the.
0:03:07.067 --> 0:03:13.072
Therefore, first step when you build your
machine translation system is normally to look
0:03:13.072 --> 0:03:19.077
at the data, to check it, to see if there is
anything happening which you should address
0:03:19.077 --> 0:03:19.887
beforehand.
0:03:20.360 --> 0:03:29.152
And then the second part is how do you represent
no works machine learning normally?
0:03:29.109 --> 0:03:35.404
So the question is how do we get out from
the words into numbers and I've seen some of
0:03:35.404 --> 0:03:35.766
you?
0:03:35.766 --> 0:03:42.568
For example, in advance there we have introduced
to an algorithm which we also shortly repeat
0:03:42.568 --> 0:03:43.075
today.
0:03:43.303 --> 0:03:53.842
The subword unit approach which was first
introduced in machine translation and now used
0:03:53.842 --> 0:04:05.271
for an in order to represent: Now you've learned
about morphology, so you know that maybe in
0:04:05.271 --> 0:04:09.270
English it's not that important.
0:04:09.429 --> 0:04:22.485
In German you have all these different word
poems and to learn independent representation.
0:04:24.024 --> 0:04:26.031
And then, of course, they are more extreme.
0:04:27.807 --> 0:04:34.387
So how are we doing?
0:04:34.975 --> 0:04:37.099
Machine translation.
0:04:37.099 --> 0:04:46.202
So hopefully you remember we had these approaches
to machine translation, the rule based.
0:04:46.202 --> 0:04:52.473
We had a big block of corpus space machine
translation which.
0:04:52.492 --> 0:05:00.443
Will on Thursday have an overview on statistical
models and then afterwards concentrate on the.
0:05:00.680 --> 0:05:08.828
Both of them are corpus based machine translation
and therefore it's really essential, and while
0:05:08.828 --> 0:05:16.640
we are typically training a machine translation
system is what we refer to as parallel data.
0:05:16.957 --> 0:05:22.395
Talk a lot about pearl corpus or pearl data,
and what I mean there is something which you
0:05:22.395 --> 0:05:28.257
might know from was that a stone or something
like that, so it's typically you have one sentence
0:05:28.257 --> 0:05:33.273
in the one language, and then you have aligned
to it one sentence in the charcote.
0:05:33.833 --> 0:05:38.261
And this is how we train all our alignments.
0:05:38.261 --> 0:05:43.181
We'll see today that of course we might not
have.
0:05:43.723 --> 0:05:51.279
However, this is relatively easy to create,
at least for iquality data.
0:05:51.279 --> 0:06:00.933
We look into data trawling so that means how
we can automatically create this parallel data
0:06:00.933 --> 0:06:02.927
from the Internet.
0:06:04.144 --> 0:06:13.850
It's not so difficult to learn these alignments
if we have some type of dictionary, so which
0:06:13.850 --> 0:06:16.981
sentence is aligned to which.
0:06:18.718 --> 0:06:25.069
What it would, of course, be a lot more difficult
is really to word alignment, and that's also
0:06:25.069 --> 0:06:27.476
often no longer that good possible.
0:06:27.476 --> 0:06:33.360
We do that automatically in some yes for symbols,
but it's definitely more challenging.
0:06:33.733 --> 0:06:40.691
For sentence alignment, of course, it's still
not always perfect, so there might be that
0:06:40.691 --> 0:06:46.085
there is two German sentences and one English
sentence or the other.
0:06:46.085 --> 0:06:53.511
So there's not always perfect alignment, but
if you look at text, it's still bigly relatively.
0:06:54.014 --> 0:07:03.862
If we have that then we can build a machine
learning model which tries to map ignition
0:07:03.862 --> 0:07:06.239
sentences somewhere.
0:07:06.626 --> 0:07:15.932
So this is the idea of behind statistical
machine translation and machine translation.
0:07:15.932 --> 0:07:27.098
The difference is: Statistical machine translation
is typically a whole box of different models
0:07:27.098 --> 0:07:30.205
which try to evaluate the.
0:07:30.510 --> 0:07:42.798
In neural machine translation, it's all one
large neural network where we use the one-sur-sentence
0:07:42.798 --> 0:07:43.667
input.
0:07:44.584 --> 0:07:50.971
And then we can train it by having exactly
this mapping port or parallel data.
0:07:54.214 --> 0:08:02.964
So what we want today to look at today is
we want to first look at general text data.
0:08:03.083 --> 0:08:06.250
So what is text data?
0:08:06.250 --> 0:08:09.850
What text data is there?
0:08:09.850 --> 0:08:18.202
Why is it challenging so that we have large
vocabularies?
0:08:18.378 --> 0:08:22.003
It's so that you always have words which you
haven't seen.
0:08:22.142 --> 0:08:29.053
If you increase your corporate science normally
you will also increase your vocabulary so you
0:08:29.053 --> 0:08:30.744
always find new words.
0:08:31.811 --> 0:08:39.738
Then based on that we'll look into pre-processing.
0:08:39.738 --> 0:08:45.333
So how can we pre-process our data?
0:08:45.333 --> 0:08:46.421
Maybe.
0:08:46.526 --> 0:08:54.788
This is a lot about tokenization, for example,
which we heard is not so challenging in European
0:08:54.788 --> 0:09:02.534
languages but still important, but might be
really difficult in Asian languages where you
0:09:02.534 --> 0:09:05.030
don't have space separation.
0:09:05.986 --> 0:09:12.161
And this preprocessing typically tries to
deal with the extreme cases where you have
0:09:12.161 --> 0:09:13.105
seen things.
0:09:13.353 --> 0:09:25.091
If you have seen your words three one hundred
times, it doesn't really matter if you have
0:09:25.091 --> 0:09:31.221
seen them with them without punctuation or
so.
0:09:31.651 --> 0:09:38.578
And then we look into word representation,
so what is the best way to represent a word?
0:09:38.578 --> 0:09:45.584
And finally, we look into the other type of
data we really need for machine translation.
0:09:45.725 --> 0:09:56.842
So in first we can use for many tasks, and
later we can also use purely monolingual data
0:09:56.842 --> 0:10:00.465
to make machine translation.
0:10:00.660 --> 0:10:03.187
So then the traditional approach was that
it was easier.
0:10:03.483 --> 0:10:08.697
We have this type of language model which
we can train only on the target data to make
0:10:08.697 --> 0:10:12.173
the text more fluent in neural machine translation
model.
0:10:12.173 --> 0:10:18.106
It's partly a bit more complicated to integrate
this data but still it's very important especially
0:10:18.106 --> 0:10:22.362
if you think about lower issue languages where
you have very few data.
0:10:23.603 --> 0:10:26.999
It's harder to get parallel data than you
get monolingual data.
0:10:27.347 --> 0:10:33.821
Because monolingual data you just have out
there not huge amounts for some languages,
0:10:33.821 --> 0:10:38.113
but definitely the amount of data is always
significant.
0:10:40.940 --> 0:10:50.454
When we talk about data, it's also of course
important how we use it for machine learning.
0:10:50.530 --> 0:11:05.867
And that you hopefully learn in some prior
class, so typically we separate our data into
0:11:05.867 --> 0:11:17.848
three chunks: So this is really by far the
largest, and this grows with the data we get.
0:11:17.848 --> 0:11:21.387
Today we get here millions.
0:11:22.222 --> 0:11:27.320
Then we have our validation data and that
is to train some type of parameters.
0:11:27.320 --> 0:11:33.129
So not only you have some things to configure
and you don't know what is the right value,
0:11:33.129 --> 0:11:39.067
so what you can do is train a model and change
these a bit and try to find the best ones on
0:11:39.067 --> 0:11:40.164
your validation.
0:11:40.700 --> 0:11:48.531
For a statistical model, for example data
in what you want to use if you have several
0:11:48.531 --> 0:11:54.664
models: You know how to combine it, so how
much focus should you put on the different
0:11:54.664 --> 0:11:55.186
models?
0:11:55.186 --> 0:11:59.301
And if it's like twenty models, so it's only
twenty per meter.
0:11:59.301 --> 0:12:02.828
It's not that much, so that is still bigly
estimated.
0:12:03.183 --> 0:12:18.964
In your model there's often a question how
long should train the model before you have
0:12:18.964 --> 0:12:21.322
overfitting.
0:12:22.902 --> 0:12:28.679
And then you have your test data, which is
finally where you report on your test.
0:12:29.009 --> 0:12:33.663
And therefore it's also important that from
time to time you get new test data because
0:12:33.663 --> 0:12:38.423
if you're always through your experiments you
test on it and then you do new experiments
0:12:38.423 --> 0:12:43.452
and tests again at some point you have tested
so many on it that you do some type of training
0:12:43.452 --> 0:12:48.373
on your test data again because you just select
the things which is at the end best on your
0:12:48.373 --> 0:12:48.962
test data.
0:12:49.009 --> 0:12:54.755
It's important to get a new test data from
time to time, for example in important evaluation
0:12:54.755 --> 0:12:58.340
campaigns for machine translation and speech
translation.
0:12:58.618 --> 0:13:07.459
There is like every year there should do tests
that create it so we can see if the model really
0:13:07.459 --> 0:13:09.761
gets better on new data.
0:13:10.951 --> 0:13:19.629
And of course it is important that this is
a representative of the use case you are interested.
0:13:19.879 --> 0:13:36.511
So if you're building a system for translating
websites, this should be on websites.
0:13:36.816 --> 0:13:39.356
So normally a system is good on some tasks.
0:13:40.780 --> 0:13:48.596
I would solve everything and then your test
data should be out of everything because if
0:13:48.596 --> 0:13:54.102
you only have a very small subset you know
it's good on this.
0:13:54.394 --> 0:14:02.714
Therefore, the selection of your test data
is really important in order to ensure that
0:14:02.714 --> 0:14:05.200
the MP system in the end.
0:14:05.525 --> 0:14:12.646
Is the greatest system ever you have evaluated
on translating Bible.
0:14:12.646 --> 0:14:21.830
The use case is to translate some Twitter
data and you can imagine the performance might
0:14:21.830 --> 0:14:22.965
be really.
0:14:23.803 --> 0:14:25.471
And privately.
0:14:25.471 --> 0:14:35.478
Of course, in honor to have this and realistic
evaluation, it's important that there's no
0:14:35.478 --> 0:14:39.370
overlap between this data because.
0:14:39.799 --> 0:14:51.615
Because the danger might be is learning by
heart how to translate the sentences from your
0:14:51.615 --> 0:14:53.584
training data.
0:14:54.194 --> 0:15:04.430
That the test data is really different from
your training data.
0:15:04.430 --> 0:15:16.811
Therefore, it's important to: So what type
of data we have?
0:15:16.811 --> 0:15:24.966
There's a lot of different text data and the
nice thing is with digitalization.
0:15:25.345 --> 0:15:31.785
You might think there's a large amount with
books, but to be honest books and printed things
0:15:31.785 --> 0:15:35.524
that's by now a minor percentage of the data
we have.
0:15:35.815 --> 0:15:39.947
There's like so much data created every day
on the Internet.
0:15:39.980 --> 0:15:46.223
With social media and all the other types.
0:15:46.223 --> 0:15:56.821
This of course is a largest amount of data,
more of colloquial language.
0:15:56.856 --> 0:16:02.609
It might be more noisy and harder to process,
so there is a whole area on how to deal with
0:16:02.609 --> 0:16:04.948
more social media and outdoor stuff.
0:16:07.347 --> 0:16:20.702
What type of data is there if you think about
parallel data news type of data official sites?
0:16:20.900 --> 0:16:26.629
So the first Power Corpora were like things
like the European Parliament or like some news
0:16:26.629 --> 0:16:27.069
sites.
0:16:27.227 --> 0:16:32.888
Nowadays there's quite a large amount of data
crawled from the Internet, but of course if
0:16:32.888 --> 0:16:38.613
you crawl parallel data from the Internet,
a lot of the data is also like company websites
0:16:38.613 --> 0:16:41.884
or so which gets translated into several languages.
0:16:45.365 --> 0:17:00.613
Then, of course, there is different levels
of text and we have to look at what level we
0:17:00.613 --> 0:17:05.118
want to process our data.
0:17:05.885 --> 0:17:16.140
It one normally doesn't make sense to work
on full sentences because a lot of sentences
0:17:16.140 --> 0:17:22.899
have never been seen and you always create
new sentences.
0:17:23.283 --> 0:17:37.421
So typically what we take is our basic words,
something between words and letters, and that
0:17:37.421 --> 0:17:40.033
is an essential.
0:17:40.400 --> 0:17:47.873
So we need some of these atomic blocks or
basic blocks on which we can't make smaller.
0:17:48.128 --> 0:17:55.987
So if we're building a sentence, for example,
you can build it out of something and you can
0:17:55.987 --> 0:17:57.268
either decide.
0:17:57.268 --> 0:18:01.967
For example, you take words and you spit them
further.
0:18:03.683 --> 0:18:10.178
Then, of course, the nice thing is not too
small and therefore building larger things
0:18:10.178 --> 0:18:11.386
like sentences.
0:18:11.831 --> 0:18:16.690
So you only have to take your vocabulary and
put it somewhere together to get your full
0:18:16.690 --> 0:18:17.132
center.
0:18:19.659 --> 0:18:27.670
However, if it's too large, these blocks don't
occur often enough, and you have more blocks
0:18:27.670 --> 0:18:28.715
that occur.
0:18:29.249 --> 0:18:34.400
And that's why yeah we can work with blocks
for smaller like software blocks.
0:18:34.714 --> 0:18:38.183
Work with neural models.
0:18:38.183 --> 0:18:50.533
Then you can work on letters so you have a
system which tries to understand the sentence
0:18:50.533 --> 0:18:53.031
letter by letter.
0:18:53.313 --> 0:18:57.608
But that is a design decision which you have
to take at some point.
0:18:57.608 --> 0:19:03.292
On which level do you want to split your text
and that of the evasive blocks that you are
0:19:03.292 --> 0:19:04.176
working with?
0:19:04.176 --> 0:19:06.955
And that's something we'll look into today.
0:19:06.955 --> 0:19:08.471
What possibilities are?
0:19:12.572 --> 0:19:14.189
Any question.
0:19:17.998 --> 0:19:24.456
Then let's look a bit on what type of data
there is in how much data there is to person.
0:19:24.824 --> 0:19:34.006
Is that nowadays, at least for pure text,
it's no longer for some language.
0:19:34.006 --> 0:19:38.959
There is so much data we cannot process.
0:19:39.479 --> 0:19:49.384
That is only true for some languages, but
there is also interest in other languages and
0:19:49.384 --> 0:19:50.622
important.
0:19:50.810 --> 0:20:01.483
So if you want to build a system for Sweden
or for some dialect in other countries, then
0:20:01.483 --> 0:20:02.802
of course.
0:20:03.103 --> 0:20:06.888
Otherwise you have this huge amount of hair.
0:20:06.888 --> 0:20:11.515
We are often no longer taking about gigabytes
or more.
0:20:11.891 --> 0:20:35.788
The general information that is produced every
year is: And this is like all the information
0:20:35.788 --> 0:20:40.661
that are available in the, so there are really.
0:20:41.001 --> 0:20:44.129
We look at machine translation.
0:20:44.129 --> 0:20:53.027
We can see these numbers are really like more
than ten years old, but we see this increase
0:20:53.027 --> 0:20:58.796
in one billion works we had at that time for
English data.
0:20:59.019 --> 0:21:01.955
Then I wore like new shuffle on Google Maps
and stuff.
0:21:02.382 --> 0:21:05.003
For this one you could train your system on.
0:21:05.805 --> 0:21:20.457
And the interesting thing is this one billion
words is more than any human typically speaks.
0:21:21.001 --> 0:21:25.892
So these systems they see by now like a magnitude
of more data.
0:21:25.892 --> 0:21:32.465
We know I think are a magnitude higher of
more data than a human has ever seen in his
0:21:32.465 --> 0:21:33.229
lifetime.
0:21:35.175 --> 0:21:41.808
And that is maybe the interesting thing why
it still doesn't work on it because you see
0:21:41.808 --> 0:21:42.637
they seem.
0:21:43.103 --> 0:21:48.745
So we are seeing a really impressive result,
but in most cases it's not that they're really
0:21:48.745 --> 0:21:49.911
better than human.
0:21:50.170 --> 0:21:56.852
However, they really have seen more data than
any human ever has seen in this lifetime.
0:21:57.197 --> 0:22:01.468
They can just process so much data, so.
0:22:01.501 --> 0:22:08.425
The question is, can we make them more efficient
so that they can learn similarly good without
0:22:08.425 --> 0:22:09.592
that much data?
0:22:09.592 --> 0:22:16.443
And that is essential if we now go to Lawrence's
languages where we might never get that much
0:22:16.443 --> 0:22:21.254
data, and we should be also able to achieve
a reasonable perform.
0:22:23.303 --> 0:22:32.399
On the other hand, this of course links also
to one topic which we will cover later: If
0:22:32.399 --> 0:22:37.965
you think about this, it's really important
that your algorithms are also very efficient
0:22:37.965 --> 0:22:41.280
in order to process that much data both in
training.
0:22:41.280 --> 0:22:46.408
If you have more data, you want to process
more data so you can make use of that.
0:22:46.466 --> 0:22:54.499
On the other hand, if more and more data is
processed, more and more people will use machine
0:22:54.499 --> 0:23:06.816
translation to generate translations, and it
will be important to: And there is yeah, there
0:23:06.816 --> 0:23:07.257
is.
0:23:07.607 --> 0:23:10.610
More.
0:23:10.170 --> 0:23:17.262
More data generated every day, we hear just
some general numbers on how much data there
0:23:17.262 --> 0:23:17.584
is.
0:23:17.584 --> 0:23:24.595
It says that a lot of the data we produce
at least at the moment is text rich, so text
0:23:24.595 --> 0:23:26.046
that is produced.
0:23:26.026 --> 0:23:29.748
That is very important to either wise.
0:23:29.748 --> 0:23:33.949
We can use it as training data in some way.
0:23:33.873 --> 0:23:40.836
That we want to translate some of that because
it might not be published in all the languages,
0:23:40.836 --> 0:23:46.039
and step with the need for machine translation
is even more important.
0:23:47.907 --> 0:23:51.547
So what are the challenges with this?
0:23:51.831 --> 0:24:01.360
So first of all that seems to be very good
news, so there is more and more data, so we
0:24:01.360 --> 0:24:10.780
can just wait for three years and have more
data, and then our system will be better.
0:24:11.011 --> 0:24:22.629
If you see in competitions, the system performance
increases.
0:24:24.004 --> 0:24:27.190
See that here are three different systems.
0:24:27.190 --> 0:24:34.008
Blue score is metric to measure how good an
empty system is and we'll talk about evaluation
0:24:34.008 --> 0:24:40.974
and the next week so you'll have to evaluate
machine validation and also a practical session.
0:24:41.581 --> 0:24:45.219
And so.
0:24:44.784 --> 0:24:50.960
This shows you that this is like how much
data of the training data you have five percent.
0:24:50.960 --> 0:24:56.117
You're significantly worse than if you're
forty percent and eighty percent.
0:24:56.117 --> 0:25:02.021
You're getting better and you're seeing two
between this curve, which maybe not really
0:25:02.021 --> 0:25:02.971
flattens out.
0:25:02.971 --> 0:25:03.311
But.
0:25:03.263 --> 0:25:07.525
Of course, the gains you get are normally
smaller and smaller.
0:25:07.525 --> 0:25:09.216
The more data you have,.
0:25:09.549 --> 0:25:21.432
If your improvements are unnormally better,
if you add the same thing or even double your
0:25:21.432 --> 0:25:25.657
data late, of course more data.
0:25:26.526 --> 0:25:34.955
However, you see the clear tendency if you
need to improve your system.
0:25:34.955 --> 0:25:38.935
This is possible by just getting.
0:25:39.039 --> 0:25:41.110
But it's not all about data.
0:25:41.110 --> 0:25:45.396
It can also be the domain of the day that
there's building.
0:25:45.865 --> 0:25:55.668
So this was a test on machine translation
system on translating genome data.
0:25:55.668 --> 0:26:02.669
We have the like SAI said he's working on
translating.
0:26:02.862 --> 0:26:06.868
Here you see the performance began with GreenScore.
0:26:06.868 --> 0:26:12.569
You see one system which only was trained
on genome data and it only has.
0:26:12.812 --> 0:26:17.742
That's very, very few for machine translation.
0:26:18.438 --> 0:26:23.927
And to compare that to a system which was
generally trained on used translation data.
0:26:24.104 --> 0:26:34.177
With four point five million sentences so
roughly one hundred times as much data you
0:26:34.177 --> 0:26:40.458
still see that this system doesn't really work
well.
0:26:40.820 --> 0:26:50.575
So you see it's not only about data, it's
also that the data has to somewhat fit to the
0:26:50.575 --> 0:26:51.462
domain.
0:26:51.831 --> 0:26:58.069
The more general data you get that you have
covered up all domains.
0:26:58.418 --> 0:27:07.906
But that's very difficult and especially for
more specific domains.
0:27:07.906 --> 0:27:16.696
It can be really important to get data which
fits your domain.
0:27:16.716 --> 0:27:18.520
Maybe if you can do some very much broccoli
or something like that, maybe if you.
0:27:18.598 --> 0:27:22.341
To say okay, concentrate this as you like
for being at better.
0:27:24.564 --> 0:27:28.201
It's not that easy to prompt it.
0:27:28.201 --> 0:27:35.807
You can do the prompting in the more traditional
way of fine tuning.
0:27:35.807 --> 0:27:44.514
Then, of course, if you select UIV later combine
this one, you can get better.
0:27:44.904 --> 0:27:52.675
But it will always be that this type of similar
data is much more important than the general.
0:27:52.912 --> 0:28:00.705
So of course it can make the lower system
a lot better if you search for similar data
0:28:00.705 --> 0:28:01.612
and find.
0:28:02.122 --> 0:28:08.190
Will have a lecture on domain adaptation where
it's exactly the idea how you can make systems
0:28:08.190 --> 0:28:13.935
in these situations better so you can adapt
it to this data but then you still need this
0:28:13.935 --> 0:28:14.839
type of data.
0:28:15.335 --> 0:28:21.590
And in prompting it might work if you have
seen it in your data so it can make the system
0:28:21.590 --> 0:28:25.134
aware and tell it focus more in this type of
data.
0:28:25.465 --> 0:28:30.684
But if you haven't had enough of the really
specific good matching data, I think it will
0:28:30.684 --> 0:28:31.681
always not work.
0:28:31.681 --> 0:28:37.077
So you need to have this type of data and
therefore it's important not only to have general
0:28:37.077 --> 0:28:42.120
data but also data, at least in your overall
system, which really fits to the domain.
0:28:45.966 --> 0:28:53.298
And then the second thing, of course, is you
need to have data that has good quality.
0:28:53.693 --> 0:29:00.170
In the early stages it might be good to have
all the data but later it's especially important
0:29:00.170 --> 0:29:06.577
that you have somehow good quality and so that
you're learning what you really want to learn
0:29:06.577 --> 0:29:09.057
and not learning some great things.
0:29:10.370 --> 0:29:21.551
We talked about this with the kilometers and
miles, so if you just take in some type of
0:29:21.551 --> 0:29:26.253
data and don't look at the quality,.
0:29:26.766 --> 0:29:30.875
But of course, the question here is what is
good quality data?
0:29:31.331 --> 0:29:35.054
It is not yet that easy to define what is
a good quality data.
0:29:36.096 --> 0:29:43.961
That doesn't mean it has to what people generally
assume as high quality text or so, like written
0:29:43.961 --> 0:29:47.814
by a Nobel Prize winner or something like that.
0:29:47.814 --> 0:29:54.074
This is not what we mean by this quality,
but again the most important again.
0:29:54.354 --> 0:30:09.181
So if you have Twitter data, high quality
data doesn't mean you have now some novels.
0:30:09.309 --> 0:30:12.875
Test data, but it should also be represented
similarly.
0:30:12.875 --> 0:30:18.480
Don't have, for example, quality definitely
as it should be really translating yourself
0:30:18.480 --> 0:30:18.862
into.
0:30:19.199 --> 0:30:25.556
So especially if you corral data you would
often have that it's not a direct translation.
0:30:25.805 --> 0:30:28.436
So then, of course, this is not high quality
teaching.
0:30:29.449 --> 0:30:39.974
But in generally that's a very difficult thing
to, and it's very difficult to design what
0:30:39.974 --> 0:30:41.378
is reading.
0:30:41.982 --> 0:30:48.333
And of course a biometric is always the quality
of your data is good if your machine translation.
0:30:48.648 --> 0:30:50.719
So that is like the indirect.
0:30:50.991 --> 0:30:52.447
Well, what can we motive?
0:30:52.447 --> 0:30:57.210
Of course, it's difficult to always try a
lot of things and evaluate either of them,
0:30:57.210 --> 0:30:59.396
build a full MP system and then check.
0:30:59.396 --> 0:31:00.852
Oh, was this a good idea?
0:31:00.852 --> 0:31:01.357
I mean,.
0:31:01.581 --> 0:31:19.055
You have two tokenizers who like split sentences
and the words you really want to apply.
0:31:19.179 --> 0:31:21.652
Now you could maybe argue or your idea could
be.
0:31:21.841 --> 0:31:30.186
Just take it there very fast and then get
the result, but the problem is there is not
0:31:30.186 --> 0:31:31.448
always this.
0:31:31.531 --> 0:31:36.269
One thing that works very well for small data.
0:31:36.269 --> 0:31:43.123
It's not for sure that the same effect will
happen in large stages.
0:31:43.223 --> 0:31:50.395
This idea really improves on very low resource
data if only train on hundred words.
0:31:51.271 --> 0:31:58.357
But if you use it for a large data set, it
doesn't really matter and all your ideas not.
0:31:58.598 --> 0:32:01.172
So that is also a typical thing.
0:32:01.172 --> 0:32:05.383
This quality issue is more and more important
if you.
0:32:06.026 --> 0:32:16.459
By one motivation which generally you should
have, you want to represent your data in having
0:32:16.459 --> 0:32:17.469
as many.
0:32:17.677 --> 0:32:21.805
Why is this the case any idea?
0:32:21.805 --> 0:32:33.389
Why this could be a motivation that we try
to represent the data in a way that we have
0:32:33.389 --> 0:32:34.587
as many.
0:32:38.338 --> 0:32:50.501
We also want to learn about the fun text because
maybe sometimes some grows in the fun text.
0:32:52.612 --> 0:32:54.020
The context is here.
0:32:54.020 --> 0:32:56.432
It's more about the learning first.
0:32:56.432 --> 0:33:00.990
You can generally learn better if you've seen
something more often.
0:33:00.990 --> 0:33:06.553
So if you have seen an event only once, it's
really hard to learn about the event.
0:33:07.107 --> 0:33:15.057
If you have seen an event a hundred times
your bearing estimating which and maybe that
0:33:15.057 --> 0:33:18.529
is the context, then you can use the.
0:33:18.778 --> 0:33:21.331
So, for example, if you here have the word
towels.
0:33:21.761 --> 0:33:28.440
If you would just take the data normally you
would directly process the data.
0:33:28.440 --> 0:33:32.893
In the upper case you would the house with
the dog.
0:33:32.893 --> 0:33:40.085
That's a different word than the house this
way and then the house with the common.
0:33:40.520 --> 0:33:48.365
So you want to learn how this translates into
house, but you translate an upper case.
0:33:48.365 --> 0:33:50.281
How this translates.
0:33:50.610 --> 0:33:59.445
You were learning how to translate into house
and house, so you have to learn four different
0:33:59.445 --> 0:34:00.205
things.
0:34:00.205 --> 0:34:06.000
Instead, we really want to learn that house
gets into house.
0:34:06.366 --> 0:34:18.796
And then imagine if it would be even a beak,
it might be like here a house would be into.
0:34:18.678 --> 0:34:22.089
Good-bye Then.
0:34:22.202 --> 0:34:29.512
If it's an upper case then I always have to
translate it into a boiler while it's a lower
0:34:29.512 --> 0:34:34.955
case that is translated into house and that's
of course not right.
0:34:34.955 --> 0:34:39.260
We have to use the context to decide what
is better.
0:34:39.679 --> 0:34:47.086
If you have seen an event several times then
you are better able to learn your model and
0:34:47.086 --> 0:34:51.414
that doesn't matter what type of learning you
have.
0:34:52.392 --> 0:34:58.981
I shouldn't say all but for most of these
models it's always better to have like seen
0:34:58.981 --> 0:35:00.897
an event war more often.
0:35:00.920 --> 0:35:11.483
Therefore, if you preprocessive data, you
should ask the question how can represent data
0:35:11.483 --> 0:35:14.212
in order to have seen.
0:35:14.514 --> 0:35:17.885
Of course you should not remove that information.
0:35:18.078 --> 0:35:25.519
So you could now, of course, just lowercase
everything.
0:35:25.519 --> 0:35:30.303
Then you've seen things more often.
0:35:30.710 --> 0:35:38.443
And that might be an issue because in the
final application you want to have real text
0:35:38.443 --> 0:35:38.887
and.
0:35:40.440 --> 0:35:44.003
And finally, even it's more important than
it's consistent.
0:35:44.965 --> 0:35:52.630
So this is a problem where, for example, aren't
consistent.
0:35:52.630 --> 0:35:58.762
So I am, I'm together written in training
data.
0:35:58.762 --> 0:36:04.512
And if you're not in test data, have a high.
0:36:04.824 --> 0:36:14.612
Therefore, most important is to generate preprocessing
and represent your data that is most consistent
0:36:14.612 --> 0:36:18.413
because it's easier to map how similar.
0:36:18.758 --> 0:36:26.588
If your text is represented very, very differently
then your data will be badly be translated.
0:36:26.666 --> 0:36:30.664
So we once had the case.
0:36:30.664 --> 0:36:40.420
For example, there is some data who wrote
it, but in German.
0:36:40.900 --> 0:36:44.187
And if you read it as a human you see it.
0:36:44.187 --> 0:36:49.507
It's even hard to get the difference because
it looks very similar.
0:36:50.130 --> 0:37:02.997
If you use it for a machine translation system,
it would not be able to translate anything
0:37:02.997 --> 0:37:08.229
of it because it's a different word.
0:37:09.990 --> 0:37:17.736
And especially on the other hand you should
of course not rechange significant training
0:37:17.736 --> 0:37:18.968
data thereby.
0:37:18.968 --> 0:37:27.155
For example, removing case information because
if your task is to generate case information.
0:37:31.191 --> 0:37:41.081
One thing which is a bit point to look into
it in order to see the difficulty of your data
0:37:41.081 --> 0:37:42.711
is to compare.
0:37:43.103 --> 0:37:45.583
There are types.
0:37:45.583 --> 0:37:57.983
We mean the number of unique words in the
corpus, so your vocabulary and the tokens.
0:37:58.298 --> 0:38:08.628
And then you can look at the type token ratio
that means a number of types per token.
0:38:15.815 --> 0:38:22.381
Have less types than tokens because every
word appears at least in the corpus, but most
0:38:22.381 --> 0:38:27.081
of them will occur more often until this number
is bigger, so.
0:38:27.667 --> 0:38:30.548
And of course this changes if you have more
date.
0:38:31.191 --> 0:38:38.103
Here is an example from an English Wikipedia.
0:38:38.103 --> 0:38:45.015
That means each word in average occurs times.
0:38:45.425 --> 0:38:47.058
Of course there's a big difference.
0:38:47.058 --> 0:38:51.323
There will be some words which occur one hundred
times, but therefore most of the words occur
0:38:51.323 --> 0:38:51.777
only one.
0:38:52.252 --> 0:38:55.165
However, you see this ratio goes down.
0:38:55.165 --> 0:39:01.812
That's a good thing, so you have seen each
word more often and therefore your model gets
0:39:01.812 --> 0:39:03.156
typically better.
0:39:03.156 --> 0:39:08.683
However, the problem is we always have a lot
of words which we have seen.
0:39:09.749 --> 0:39:15.111
Even here there will be a bound of words which
you have only seen once.
0:39:15.111 --> 0:39:20.472
However, this can give you an indication about
the quality of the data.
0:39:20.472 --> 0:39:27.323
So you should always, of course, try to achieve
data where you have a very low type to talk
0:39:27.323 --> 0:39:28.142
and ratio.
0:39:28.808 --> 0:39:39.108
For example, if you compare, simplify and
not only Wikipedia, what would be your expectation?
0:39:41.861 --> 0:39:49.842
Yes, that's exactly, but however it's surprisingly
only a little bit lower, but you see that it's
0:39:49.842 --> 0:39:57.579
lower, so we are using less words to express
the same thing, and therefore the task to produce
0:39:57.579 --> 0:39:59.941
this text is also a gesture.
0:40:01.221 --> 0:40:07.702
However, as how many words are there, there
is no clear definition.
0:40:07.787 --> 0:40:19.915
So there will be always more words, especially
depending on your dataset, how many different
0:40:19.915 --> 0:40:22.132
words there are.
0:40:22.482 --> 0:40:30.027
So if you have million tweets where around
fifty million tokens and you have six hundred
0:40:30.027 --> 0:40:30.875
thousand.
0:40:31.251 --> 0:40:40.299
If you have times this money teen tweeds you
also have significantly more tokens but also.
0:40:40.660 --> 0:40:58.590
So especially in things like the social media,
of course, there's always different types of
0:40:58.590 --> 0:40:59.954
words.
0:41:00.040 --> 0:41:04.028
Another example from not social media is here.
0:41:04.264 --> 0:41:18.360
So yeah, there is a small liter sandwich like
phone conversations, two million tokens, and
0:41:18.360 --> 0:41:22.697
only twenty thousand words.
0:41:23.883 --> 0:41:37.221
If you think about Shakespeare, it has even
less token, significantly less than a million,
0:41:37.221 --> 0:41:40.006
but the number of.
0:41:40.060 --> 0:41:48.781
On the other hand, there is this Google Engron
corpus which has tokens and there is always
0:41:48.781 --> 0:41:50.506
new words coming.
0:41:50.991 --> 0:41:52.841
Is English.
0:41:52.841 --> 0:42:08.103
The nice thing about English is that the vocabulary
is relatively small, too small, but relatively
0:42:08.103 --> 0:42:09.183
small.
0:42:09.409 --> 0:42:14.224
So here you see the Ted Corpus here.
0:42:15.555 --> 0:42:18.144
All know Ted's lectures.
0:42:18.144 --> 0:42:26.429
They are transcribed, translated, not a source
for us, especially small crocus.
0:42:26.846 --> 0:42:32.702
You can do a lot of experiments with that
and you see that the corpus site is relatively
0:42:32.702 --> 0:42:36.782
similar so we have around four million tokens
in this corpus.
0:42:36.957 --> 0:42:44.464
However, if you look at the vocabulary, English
has half as many words in their different words
0:42:44.464 --> 0:42:47.045
as German and Dutch and Italian.
0:42:47.527 --> 0:42:56.260
So this is one influence from positional works
like which are more frequent in German, the
0:42:56.260 --> 0:43:02.978
more important since we have all these different
morphological forms.
0:43:03.263 --> 0:43:08.170
There all leads to new words and they need
to be somewhat expressed in there.
0:43:11.531 --> 0:43:20.278
So to deal with this, the question is how
can we normalize the text in order to make
0:43:20.278 --> 0:43:22.028
the text easier?
0:43:22.028 --> 0:43:25.424
Can we simplify the task easier?
0:43:25.424 --> 0:43:29.231
But we need to keep all information.
0:43:29.409 --> 0:43:32.239
So an example where not all information skipped.
0:43:32.239 --> 0:43:35.012
Of course you make the task easier if you
just.
0:43:35.275 --> 0:43:41.141
You don't have to deal with different cases.
0:43:41.141 --> 0:43:42.836
It's easier.
0:43:42.836 --> 0:43:52.482
However, information gets lost and you might
need to generate the target.
0:43:52.832 --> 0:44:00.153
So the question is always: How can we on the
one hand simplify the task but keep all the
0:44:00.153 --> 0:44:01.223
information?
0:44:01.441 --> 0:44:06.639
Say necessary because it depends on the task.
0:44:06.639 --> 0:44:11.724
For some tasks you might find to remove the.
0:44:14.194 --> 0:44:23.463
So the steps they were typically doing are
that you can the segment and words in a running
0:44:23.463 --> 0:44:30.696
text, so you can normalize word forms and segmentation
into sentences.
0:44:30.696 --> 0:44:33.955
Also, if you have not a single.
0:44:33.933 --> 0:44:38.739
If this is not a redundancy point to segments,
the text is also into segments.
0:44:39.779 --> 0:44:52.609
So what are we doing there for European language
segmentation into words?
0:44:52.609 --> 0:44:57.290
It's not that complicated.
0:44:57.277 --> 0:45:06.001
You have to somehow handle the joint words
and by handling joint words the most important.
0:45:06.526 --> 0:45:11.331
So in most systems it really doesn't matter
much.
0:45:11.331 --> 0:45:16.712
If you write, I'm together as one word or
as two words.
0:45:17.197 --> 0:45:23.511
The nice thing about iron is maybe this is
so often that it doesn't matter if you both
0:45:23.511 --> 0:45:26.560
and if they're both accrued often enough.
0:45:26.560 --> 0:45:32.802
But you'll have some of these cases where
they don't occur there often, so you should
0:45:32.802 --> 0:45:35.487
have more as consistent as possible.
0:45:36.796 --> 0:45:41.662
But of course things can get more complicated.
0:45:41.662 --> 0:45:48.598
If you have Finland capital, do you want to
split the ends or not?
0:45:48.598 --> 0:45:53.256
Isn't you split or do you even write it out?
0:45:53.433 --> 0:46:00.468
And what about like things with hyphens in
the middle and so on?
0:46:00.540 --> 0:46:07.729
So there is not everything is very easy, but
is generally possible to somewhat keep as.
0:46:11.791 --> 0:46:25.725
Sometimes the most challenging and traditional
systems were compounds, or how to deal with
0:46:25.725 --> 0:46:28.481
things like this.
0:46:28.668 --> 0:46:32.154
The nice thing is, as said, will come to the
later.
0:46:32.154 --> 0:46:34.501
Nowadays we typically use subword.
0:46:35.255 --> 0:46:42.261
Unit, so we don't have to deal with this in
the preprocessing directly, but in the subword
0:46:42.261 --> 0:46:47.804
splitting we're doing it, and then we can learn
how to best spit these.
0:46:52.392 --> 0:46:56.974
Things Get More Complicated.
0:46:56.977 --> 0:46:59.934
About non European languages.
0:46:59.934 --> 0:47:08.707
Because in non European languages, not all
of them, there is no space between the words.
0:47:09.029 --> 0:47:18.752
Nowadays you can also download word segmentation
models where you put in the full sentence and
0:47:18.752 --> 0:47:22.744
then it's getting splitted into parts.
0:47:22.963 --> 0:47:31.814
And then, of course, it's even that you have
different writing systems, sometimes in Japanese.
0:47:31.814 --> 0:47:40.385
For example, they have these katakana, hiragana
and kanji symbols in there, and you have to
0:47:40.385 --> 0:47:42.435
some idea with these.
0:47:49.669 --> 0:47:54.560
To the, the next thing is can reduce some
normalization.
0:47:54.874 --> 0:48:00.376
So the idea is that you map several words
onto the same.
0:48:00.460 --> 0:48:07.877
And that is test dependent, and the idea is
to define something like acronym classes so
0:48:07.877 --> 0:48:15.546
that words, which have the same meaning where
it's not in order to have the difference, to
0:48:15.546 --> 0:48:19.423
map onto the same thing in order to make the.
0:48:19.679 --> 0:48:27.023
The most important thing is there about tasing,
and then there is something like sometimes
0:48:27.023 --> 0:48:27.508
word.
0:48:28.048 --> 0:48:37.063
For casing you can do two things and then
depend on the task.
0:48:37.063 --> 0:48:44.769
You can lowercase everything, maybe some exceptions.
0:48:45.045 --> 0:48:47.831
For the target side, it should normally it's
normally not done.
0:48:48.188 --> 0:48:51.020
Why is it not done?
0:48:51.020 --> 0:48:56.542
Why should you only do it for suicide?
0:48:56.542 --> 0:49:07.729
Yes, so you have to generate correct text
instead of lower case and uppercase.
0:49:08.848 --> 0:49:16.370
Nowadays to be always do true casing on both
sides, also on the sewer side, that means you
0:49:16.370 --> 0:49:17.610
keep the case.
0:49:17.610 --> 0:49:24.966
The only thing where people try to work on
or sometimes do that is that at the beginning
0:49:24.966 --> 0:49:25.628
of the.
0:49:25.825 --> 0:49:31.115
For words like this, this is not that important
because you will have seen otherwise a lot
0:49:31.115 --> 0:49:31.696
of times.
0:49:31.696 --> 0:49:36.928
But if you know have rare words, which you
only have seen maybe three times, and you have
0:49:36.928 --> 0:49:42.334
only seen in the middle of the sentence, and
now it occurs at the beginning of the sentence,
0:49:42.334 --> 0:49:45.763
which is upper case, then you don't know how
to deal with.
0:49:46.146 --> 0:49:50.983
So then it might be good to do a true casing.
0:49:50.983 --> 0:49:56.241
That means you recase each word on the beginning.
0:49:56.576 --> 0:49:59.830
The only question, of course, is how do you
recase it?
0:49:59.830 --> 0:50:01.961
So what case would you always know?
0:50:02.162 --> 0:50:18.918
Word of the senders, or do you have a better
solution, especially not English, maybe German.
0:50:18.918 --> 0:50:20.000
It's.
0:50:25.966 --> 0:50:36.648
The fancy solution would be to count hope
and decide based on this, the unfancy running
0:50:36.648 --> 0:50:43.147
would: Think it's not really good because most
of the cane boards are lower paced.
0:50:43.683 --> 0:50:53.657
That is one idea to count and definitely better
because as a word more often occurs upper case.
0:50:53.653 --> 0:50:57.934
Otherwise you only have a lower case at the
beginning where you have again.
0:50:58.338 --> 0:51:03.269
Haven't gained anything, you can make it even
a bit better when counting.
0:51:03.269 --> 0:51:09.134
You're ignoring the first position so that
you don't count the word beginning and yeah,
0:51:09.134 --> 0:51:12.999
that's typically how it's done to do this type
of casing.
0:51:13.273 --> 0:51:23.907
And that's the easy thing you can't even use
like then bygram teachers who work pairs.
0:51:23.907 --> 0:51:29.651
There's very few words which occur more often.
0:51:29.970 --> 0:51:33.163
It's OK to have them boast because you can
otherwise learn it.
0:51:36.376 --> 0:51:52.305
Another thing about these classes is to use
word classes that were partly done, for example,
0:51:52.305 --> 0:51:55.046
and more often.
0:51:55.375 --> 0:51:57.214
Ten Thousand One Hundred Books.
0:51:57.597 --> 0:52:07.397
And then for an system that might not be important
you can do something at number books.
0:52:07.847 --> 0:52:16.450
However, you see here already that it's not
that easy because if you have one book you
0:52:16.450 --> 0:52:19.318
don't have to do with a pro.
0:52:20.020 --> 0:52:21.669
Always be careful.
0:52:21.669 --> 0:52:28.094
It's very fast to ignore some exceptions and
make more things worse than.
0:52:28.488 --> 0:52:37.879
So it's always difficult to decide when to
do this and when to better not do it and keep
0:52:37.879 --> 0:52:38.724
things.
0:52:43.483 --> 0:52:56.202
Then the next step is sentence segmentation,
so we are typically working on sentences.
0:52:56.476 --> 0:53:11.633
However, dots things are a bit more complicated,
so you can do a bit more.
0:53:11.731 --> 0:53:20.111
You can even have some type of classifier
with features by then generally.
0:53:20.500 --> 0:53:30.731
Is not too complicated, so you can have different
types of classifiers to do that, but in generally.
0:53:30.650 --> 0:53:32.537
I Didn't Know It.
0:53:33.393 --> 0:53:35.583
It's not a super complicated task.
0:53:35.583 --> 0:53:39.461
There are nowadays also a lot of libraries
which you can use.
0:53:39.699 --> 0:53:45.714
To do that normally if you're doing the normalization
beforehand that can be done there so you only
0:53:45.714 --> 0:53:51.126
split up the dot if it's like the sentence
boundary and otherwise you keep it to the word
0:53:51.126 --> 0:53:54.194
so you can do that a bit jointly with the segment.
0:53:54.634 --> 0:54:06.017
It's something to think about to care because
it's where arrows happen.
0:54:06.017 --> 0:54:14.712
However, on the one end you can still do it
very well.
0:54:14.834 --> 0:54:19.740
You will never get data which is perfectly
clean and where everything is great.
0:54:20.340 --> 0:54:31.020
There's just too much data and it will never
happen, so therefore it's important to be aware
0:54:31.020 --> 0:54:35.269
of that during the full development.
0:54:37.237 --> 0:54:42.369
And one last thing about the preprocessing,
we'll get into the representation.
0:54:42.369 --> 0:54:47.046
If you're working on that, you'll get a friend
with regular expression.
0:54:47.046 --> 0:54:50.034
That's not only how you do all this matching.
0:54:50.430 --> 0:55:03.811
And if you look into the scripts of how to
deal with pancreation marks and stuff like
0:55:03.811 --> 0:55:04.900
that,.
0:55:11.011 --> 0:55:19.025
So if we have now the data of our next step
to build, the system is to represent our words.
0:55:19.639 --> 0:55:27.650
Before we start with this, any more questions
about preprocessing.
0:55:27.650 --> 0:55:32.672
While we work on the pure text, I'm sure.
0:55:33.453 --> 0:55:40.852
The idea is again to make things more simple
because if you think about the production mark
0:55:40.852 --> 0:55:48.252
at the beginning of a sentence, it might be
that you haven't seen the word or, for example,
0:55:48.252 --> 0:55:49.619
think of titles.
0:55:49.619 --> 0:55:56.153
In newspaper articles there's: So you then
have seen the word now in the title before,
0:55:56.153 --> 0:55:58.425
and the text you have never seen.
0:55:58.898 --> 0:56:03.147
But there is always the decision.
0:56:03.123 --> 0:56:09.097
Do I gain more because I've seen things more
often or do I lose because now I remove information
0:56:09.097 --> 0:56:11.252
which helps me to the same degree?
0:56:11.571 --> 0:56:21.771
Because if we, for example, do that in German
and remove the case, this might be an important
0:56:21.771 --> 0:56:22.531
issue.
0:56:22.842 --> 0:56:30.648
So there is not the perfect solution, but
generally you can get some arrows to make things
0:56:30.648 --> 0:56:32.277
look more similar.
0:56:35.295 --> 0:56:43.275
What you can do about products like the state
of the area or the trends that are more or
0:56:43.275 --> 0:56:43.813
less.
0:56:44.944 --> 0:56:50.193
It starts even less because models get more
powerful, so it's not that important, but be
0:56:50.193 --> 0:56:51.136
careful partly.
0:56:51.136 --> 0:56:56.326
It's also the evaluation thing because these
things which are problematic are happening
0:56:56.326 --> 0:56:57.092
very rarely.
0:56:57.092 --> 0:57:00.159
If you take average performance, it doesn't
matter.
0:57:00.340 --> 0:57:06.715
However, in between it's doing the stupid
mistakes that don't count on average, but they
0:57:06.715 --> 0:57:08.219
are not really good.
0:57:09.089 --> 0:57:15.118
Done you do some type of tokenization?
0:57:15.118 --> 0:57:19.911
You can do true casing or not.
0:57:19.911 --> 0:57:28.723
Some people nowadays don't do it, but that's
still done.
0:57:28.948 --> 0:57:34.441
Then it depends on who is a bit on the type
of domain.
0:57:34.441 --> 0:57:37.437
Again we have so translation.
0:57:37.717 --> 0:57:46.031
So in the text sometimes there is mark in
the menu, later the shortcut.
0:57:46.031 --> 0:57:49.957
This letter is used for shortcut.
0:57:49.957 --> 0:57:57.232
You cannot mistake the word because it's no
longer a file but.
0:57:58.018 --> 0:58:09.037
Then you cannot deal with it, so then it might
make sense to remove this.
0:58:12.032 --> 0:58:17.437
Now the next step is how to match words into
numbers.
0:58:17.437 --> 0:58:22.142
Machine learning models deal with some digits.
0:58:22.342 --> 0:58:27.091
The first idea is to use words as our basic
components.
0:58:27.247 --> 0:58:40.695
And then you have a large vocabulary where
each word gets referenced to an indigenous.
0:58:40.900 --> 0:58:49.059
So your sentence go home is now and that is
your set.
0:58:52.052 --> 0:59:00.811
So the nice thing is you have very short sequences
so that you can deal with them.
0:59:00.811 --> 0:59:01.867
However,.
0:59:01.982 --> 0:59:11.086
So you have not really understood how words
are processed.
0:59:11.086 --> 0:59:16.951
Why is this or can that be a problem?
0:59:17.497 --> 0:59:20.741
And there is an easy solution to deal with
unknown words.
0:59:20.741 --> 0:59:22.698
You just have one token, which is.
0:59:23.123 --> 0:59:25.906
Worrying in maybe some railroads in your training
day, do you deal?
0:59:26.206 --> 0:59:34.938
That's working a bit for some province, but
in general it's not good because you know nothing
0:59:34.938 --> 0:59:35.588
about.
0:59:35.895 --> 0:59:38.770
Can at least deal with this and maybe map
it.
0:59:38.770 --> 0:59:44.269
So an easy solution in machine translation
is always if it's an unknown word or we just
0:59:44.269 --> 0:59:49.642
copy it to the target side because unknown
words are often named entities and in many
0:59:49.642 --> 0:59:52.454
languages the good solution is just to keep.
0:59:53.013 --> 1:00:01.203
So that is somehow a trick, trick, but yeah,
that's of course not a good thing.
1:00:01.821 --> 1:00:08.959
It's also a problem if you deal with full
words is that you have very few examples for
1:00:08.959 --> 1:00:09.451
some.
1:00:09.949 --> 1:00:17.696
And of course if you've seen a word once you
can, someone may be translated, but we will
1:00:17.696 --> 1:00:24.050
learn that in your networks you represent words
with continuous vectors.
1:00:24.264 --> 1:00:26.591
You have seen them two, three or four times.
1:00:26.591 --> 1:00:31.246
It is not really well learned, and you are
typically doing most Arabs and words with your
1:00:31.246 --> 1:00:31.763
crow rap.
1:00:33.053 --> 1:00:40.543
And yeah, you cannot deal with things which
are inside the world.
1:00:40.543 --> 1:00:50.303
So if you know that houses set one hundred
and twelve and you see no houses, you have
1:00:50.303 --> 1:00:51.324
no idea.
1:00:51.931 --> 1:00:55.533
Of course, not really convenient, so humans
are better.
1:00:55.533 --> 1:00:58.042
They can use the internal information.
1:00:58.498 --> 1:01:04.080
So if we have houses you'll know that it's
like the bluer form of house.
1:01:05.285 --> 1:01:16.829
And for the ones who weren't in advance, ay,
you have this night worth here and guess.
1:01:16.716 --> 1:01:20.454
Don't know the meaning of these words.
1:01:20.454 --> 1:01:25.821
However, all of you will know is the fear
of something.
1:01:26.686 --> 1:01:39.437
From the ending, the phobia phobia is always
the fear of something, but you don't know how.
1:01:39.879 --> 1:01:46.618
So we can split words into some parts that
is helpful to deal with.
1:01:46.618 --> 1:01:49.888
This, for example, is a fear of.
1:01:50.450 --> 1:02:04.022
It's not very important, it's not how to happen
very often, but yeah, it's also not important
1:02:04.022 --> 1:02:10.374
for understanding that you know everything.
1:02:15.115 --> 1:02:18.791
So what can we do instead?
1:02:18.791 --> 1:02:29.685
One thing which we could do instead is to
represent words by the other extreme.
1:02:29.949 --> 1:02:42.900
So you really do like if you have a person's
eye and a and age, then you need a space symbol.
1:02:43.203 --> 1:02:55.875
So you have now a representation for each
character that enables you to implicitly learn
1:02:55.875 --> 1:03:01.143
morphology because words which have.
1:03:01.541 --> 1:03:05.517
And you can then deal with unknown words.
1:03:05.517 --> 1:03:10.344
There's still not everything you can process,
but.
1:03:11.851 --> 1:03:16.953
So if you would go on charity level what might
still be a problem?
1:03:18.598 --> 1:03:24.007
So all characters which you haven't seen,
but that's nowadays a little bit more often
1:03:24.007 --> 1:03:25.140
with new emoties.
1:03:25.140 --> 1:03:26.020
You couldn't.
1:03:26.020 --> 1:03:31.366
It could also be that you have translated
from Germany and German, and then there is
1:03:31.366 --> 1:03:35.077
a Japanese character or Chinese that you cannot
translate.
1:03:35.435 --> 1:03:43.938
But most of the time all directions occur
have been seen so that someone works very good.
1:03:44.464 --> 1:03:58.681
This is first a nice thing, so you have a
very small vocabulary size, so one big part
1:03:58.681 --> 1:04:01.987
of the calculation.
1:04:02.222 --> 1:04:11.960
Neural networks is the calculation of the
vocabulary size, so if you are efficient there
1:04:11.960 --> 1:04:13.382
it's better.
1:04:14.914 --> 1:04:26.998
On the other hand, the problem is you have
no very long sequences, so if you think about
1:04:26.998 --> 1:04:29.985
this before you have.
1:04:30.410 --> 1:04:43.535
Your computation often depends on your input
size and not only linear but quadratic going
1:04:43.535 --> 1:04:44.410
more.
1:04:44.504 --> 1:04:49.832
And of course it might also be that you just
generally make things more complicated than
1:04:49.832 --> 1:04:50.910
they were before.
1:04:50.951 --> 1:04:58.679
We said before make things easy, but now if
we really have to analyze each director independently,
1:04:58.679 --> 1:05:05.003
we cannot directly learn that university is
the same, but we have to learn that.
1:05:05.185 --> 1:05:12.179
Is beginning and then there is an I and then
there is an E and then all this together means
1:05:12.179 --> 1:05:17.273
university but another combination of these
letters is a complete.
1:05:17.677 --> 1:05:24.135
So of course you make everything here a lot
more complicated than you have on word basis.
1:05:24.744 --> 1:05:32.543
Character based models work very well in conditions
with few data because you have seen the words
1:05:32.543 --> 1:05:33.578
very rarely.
1:05:33.578 --> 1:05:38.751
It's not good to learn but you have seen all
letters more often.
1:05:38.751 --> 1:05:44.083
So if you have scenarios with very few data
this is like one good.
1:05:46.446 --> 1:05:59.668
The other idea is to split now not doing the
extreme, so either taking forwards or taking
1:05:59.668 --> 1:06:06.573
only directives by doing something in between.
1:06:07.327 --> 1:06:12.909
And one of these ideas has been done for a
long time.
1:06:12.909 --> 1:06:17.560
It's called compound splitting, but we only.
1:06:17.477 --> 1:06:18.424
Bounce them.
1:06:18.424 --> 1:06:24.831
You see that Baum and Stumbo accrue very often,
then maybe more often than Bounce them.
1:06:24.831 --> 1:06:28.180
Then you split Baum and Stumb and you use
it.
1:06:29.509 --> 1:06:44.165
But it's even not so easy it will learn wrong
splits so we did that in all the systems and
1:06:44.165 --> 1:06:47.708
there is a word Asia.
1:06:48.288 --> 1:06:56.137
And the business, of course, is not a really
good way of dealing it because it is non-semantic.
1:06:56.676 --> 1:07:05.869
The good thing is we didn't really care that
much about it because the system wasn't learned
1:07:05.869 --> 1:07:09.428
if you have Asia and Tish together.
1:07:09.729 --> 1:07:17.452
So you can of course learn all that the compound
spirit doesn't really help you to get a deeper
1:07:17.452 --> 1:07:18.658
understanding.
1:07:21.661 --> 1:07:23.364
The Thing of Course.
1:07:23.943 --> 1:07:30.475
Yeah, there was one paper where this doesn't
work like they report, but it's called Burning
1:07:30.475 --> 1:07:30.972
Ducks.
1:07:30.972 --> 1:07:37.503
I think because it was like if you had German
NS Branter, you could split it in NS Branter,
1:07:37.503 --> 1:07:43.254
and sometimes you have to add an E to make
the compounds that was Enter Branter.
1:07:43.583 --> 1:07:48.515
So he translated Esperanto into burning dark.
1:07:48.888 --> 1:07:56.127
So of course you can introduce there some
type of additional arrows, but in generally
1:07:56.127 --> 1:07:57.221
it's a good.
1:07:57.617 --> 1:08:03.306
Of course there is a trade off between vocabulary
size so you want to have a lower vocabulary
1:08:03.306 --> 1:08:08.812
size so you've seen everything more often but
the length of the sequence should not be too
1:08:08.812 --> 1:08:13.654
long because if you split more often you get
less different types but you have.
1:08:16.896 --> 1:08:25.281
The motivation of the advantage of compared
to Character based models is that you can directly
1:08:25.281 --> 1:08:33.489
learn the representation for works that occur
very often while still being able to represent
1:08:33.489 --> 1:08:35.783
works that are rare into.
1:08:36.176 --> 1:08:42.973
And while first this was only done for compounds,
nowadays there's an algorithm which really
1:08:42.973 --> 1:08:49.405
tries to do it on everything and there are
different ways to be honest compound fitting
1:08:49.405 --> 1:08:50.209
and so on.
1:08:50.209 --> 1:08:56.129
But the most successful one which is commonly
used is based on data compression.
1:08:56.476 --> 1:08:59.246
And there the idea is okay.
1:08:59.246 --> 1:09:06.765
Can we find an encoding so that parts are
compressed in the most efficient?
1:09:07.027 --> 1:09:22.917
And the compression algorithm is called the
bipear encoding, and this is also then used
1:09:22.917 --> 1:09:25.625
for splitting.
1:09:26.346 --> 1:09:39.164
And the idea is we recursively represent the
most frequent pair of bites by a new bike.
1:09:39.819 --> 1:09:51.926
Language is now you splitch, burst all your
words into letters, and then you look at what
1:09:51.926 --> 1:09:59.593
is the most frequent bigrams of which two letters
occur.
1:10:00.040 --> 1:10:04.896
And then you replace your repeat until you
have a fixed vocabulary.
1:10:04.985 --> 1:10:08.031
So that's a nice thing.
1:10:08.031 --> 1:10:16.663
Now you can predefine your vocabulary as want
to represent my text.
1:10:16.936 --> 1:10:28.486
By hand, and then you can represent any text
with these symbols, and of course the shorter
1:10:28.486 --> 1:10:30.517
your text will.
1:10:32.772 --> 1:10:36.543
So the original idea was something like that.
1:10:36.543 --> 1:10:39.411
We have to sequence A, B, A, B, C.
1:10:39.411 --> 1:10:45.149
For example, a common biogram is A, B, so
you can face A, B, B, I, D.
1:10:45.149 --> 1:10:46.788
Then the text gets.
1:10:48.108 --> 1:10:53.615
Then you can make to and then you have eating
beet and so on, so this is then your text.
1:10:54.514 --> 1:11:00.691
Similarly, we can do it now for tanking.
1:11:01.761 --> 1:11:05.436
Let's assume you have these sentences.
1:11:05.436 --> 1:11:11.185
I go, he goes, she goes, so your vocabulary
is go, goes, he.
1:11:11.851 --> 1:11:30.849
And the first thing you're doing is split
your crocus into singles.
1:11:30.810 --> 1:11:34.692
So thereby you can split words again like
split senses into words.
1:11:34.692 --> 1:11:38.980
Because now you only have chiracters, you
don't know the word boundaries.
1:11:38.980 --> 1:11:44.194
You introduce the word boundaries by having
a special symbol at the end of each word, and
1:11:44.194 --> 1:11:46.222
then you know this symbol happens.
1:11:46.222 --> 1:11:48.366
I can split it and have it in a new.
1:11:48.708 --> 1:11:55.245
So you have the corpus I go, he goes, and
she goes, and then you have now here the sequences
1:11:55.245 --> 1:11:56.229
of Character.
1:11:56.229 --> 1:12:02.625
So then the Character based per presentation,
and now you calculate the bigram statistics.
1:12:02.625 --> 1:12:08.458
So I and the end of word occurs one time G
& O across three times, so there there.
1:12:09.189 --> 1:12:18.732
And these are all the others, and now you
look, which is the most common happening.
1:12:19.119 --> 1:12:26.046
So then you have known the rules.
1:12:26.046 --> 1:12:39.235
If and have them together you have these new
words: Now is no longer two symbols, but it's
1:12:39.235 --> 1:12:41.738
one single symbol because if you join that.
1:12:42.402 --> 1:12:51.175
And then you have here now the new number
of biceps, steel and wood, and and so on.
1:12:52.092 --> 1:13:01.753
In small examples now you have a lot of rules
which occur the same time.
1:13:01.753 --> 1:13:09.561
In reality that is happening sometimes but
not that often.
1:13:10.370 --> 1:13:21.240
You add the end of words to him, and so this
way you go on until you have your vocabulary.
1:13:21.601 --> 1:13:38.242
And your vocabulary is in these rules, so
people speak about the vocabulary of the rules.
1:13:38.658 --> 1:13:43.637
And these are the rules, and if you have not
a different sentence, something like they tell.
1:13:44.184 --> 1:13:53.600
Then your final output looks like something
like that.
1:13:53.600 --> 1:13:59.250
These two words represent by by.
1:14:00.940 --> 1:14:06.398
And that is your algorithm.
1:14:06.398 --> 1:14:18.873
Now you can represent any type of text with
a fixed vocabulary.
1:14:20.400 --> 1:14:23.593
So think that's defined in the beginning.
1:14:23.593 --> 1:14:27.243
Fill how many egos have won and that has spent.
1:14:28.408 --> 1:14:35.253
It's nearly correct that it writes a number
of characters.
1:14:35.253 --> 1:14:38.734
It can be that in additional.
1:14:38.878 --> 1:14:49.162
So on the one end all three of the right side
of the rules can occur, and then additionally
1:14:49.162 --> 1:14:49.721
all.
1:14:49.809 --> 1:14:55.851
In reality it can even happen that there is
less your vocabulary smaller because it might
1:14:55.851 --> 1:15:01.960
happen that like for example go never occurs
singular at the end but you always like merge
1:15:01.960 --> 1:15:06.793
all occurrences so there are not all right
sides really happen because.
1:15:06.746 --> 1:15:11.269
This rule is never only applied, but afterwards
another rule is also applied.
1:15:11.531 --> 1:15:15.621
So it's a summary approbounce of your vocabulary
than static.
1:15:20.480 --> 1:15:29.014
Then we come to the last part, which is about
parallel data, but we have some questions beforehand.
1:15:36.436 --> 1:15:38.824
So what is parallel data?
1:15:38.824 --> 1:15:47.368
So if we set machine translations really,
really important that we are dealing with parallel
1:15:47.368 --> 1:15:52.054
data, that means we have a lined input and
output.
1:15:52.054 --> 1:15:54.626
You have this type of data.
1:15:55.015 --> 1:16:01.773
However, in machine translation we have one
very big advantage that is somewhat naturally
1:16:01.773 --> 1:16:07.255
occurring, so you have a lot of parallel data
which you can summar gaps.
1:16:07.255 --> 1:16:13.788
In many P tests you need to manually annotate
your data and generate the aligned data.
1:16:14.414 --> 1:16:22.540
We have to manually create translations, and
of course that is very expensive, but it's
1:16:22.540 --> 1:16:29.281
really expensive to pay for like one million
sentences to be translated.
1:16:29.889 --> 1:16:36.952
The nice thing is that in there is data normally
available because other people have done machine
1:16:36.952 --> 1:16:37.889
translation.
1:16:40.120 --> 1:16:44.672
So there is this data and of course process
it.
1:16:44.672 --> 1:16:51.406
We'll have a full lecture on how to deal with
more complex situations.
1:16:52.032 --> 1:16:56.645
The idea is really you don't do really much
human work.
1:16:56.645 --> 1:17:02.825
You really just start the caller with some
initials, start pages and then.
1:17:03.203 --> 1:17:07.953
But a lot of iquality parallel data is really
targeted on some scenarios.
1:17:07.953 --> 1:17:13.987
So, for example, think of the European Parliament
as one website where you can easily extract
1:17:13.987 --> 1:17:17.581
these information from and there you have a
large data.
1:17:17.937 --> 1:17:22.500
Or like we have the TED data, which is also
you can get from the TED website.
1:17:23.783 --> 1:17:33.555
So in generally parallel corpus is a collection
of texts with translations into one of several.
1:17:34.134 --> 1:17:42.269
And this data is important because there is
no general empty normally, but you work secured.
1:17:42.222 --> 1:17:46.732
It works especially good if your training
and test conditions are similar.
1:17:46.732 --> 1:17:50.460
So if the topic is similar, the style of modality
is similar.
1:17:50.460 --> 1:17:55.391
So if you want to translate speech, it's often
better to train all to own speech.
1:17:55.391 --> 1:17:58.818
If you want to translate text, it's better
to translate.
1:17:59.379 --> 1:18:08.457
And there is a lot of these data available
nowadays for common languages.
1:18:08.457 --> 1:18:12.014
You normally can start with.
1:18:12.252 --> 1:18:15.298
It's really available.
1:18:15.298 --> 1:18:27.350
For example, Opus is a big website collecting
different types of parallel corpus where you
1:18:27.350 --> 1:18:29.601
can select them.
1:18:29.529 --> 1:18:33.276
You have this document alignment will come
to that layout.
1:18:33.553 --> 1:18:39.248
There is things like comparable data where
you have not full sentences but only some parts
1:18:39.248 --> 1:18:40.062
of parallel.
1:18:40.220 --> 1:18:48.700
But now first let's assume we have easy tasks
like European Parliament when we have the speech
1:18:48.700 --> 1:18:55.485
in German and the speech in English and you
need to generate parallel data.
1:18:55.485 --> 1:18:59.949
That means you have to align the sewer sentences.
1:19:00.000 --> 1:19:01.573
And doing this right.
1:19:05.905 --> 1:19:08.435
How can we do that?
1:19:08.435 --> 1:19:19.315
And that is what people refer to sentence
alignment, so we have parallel documents in
1:19:19.315 --> 1:19:20.707
languages.
1:19:22.602 --> 1:19:32.076
This is so you cannot normally do that word
by word because there is no direct correlation
1:19:32.076 --> 1:19:34.158
between, but it is.
1:19:34.074 --> 1:19:39.837
Relatively possible to do it on sentence level,
it will not be perfect, so you sometimes have
1:19:39.837 --> 1:19:42.535
two sentences in English and one in German.
1:19:42.535 --> 1:19:47.992
German like to have these long sentences with
sub clauses and so on, so there you can do
1:19:47.992 --> 1:19:51.733
it, but with long sentences it might not be
really possible.
1:19:55.015 --> 1:19:59.454
And for some we saw that sentence Marcus Andre
there, so it's more complicated.
1:19:59.819 --> 1:20:10.090
So how can we formalize this sentence alignment
problem?
1:20:10.090 --> 1:20:16.756
So we have a set of sewer sentences.
1:20:17.377 --> 1:20:22.167
And machine translation relatively often.
1:20:22.167 --> 1:20:32.317
Sometimes source sentences nowadays are and,
but traditionally it was and because people
1:20:32.317 --> 1:20:34.027
started using.
1:20:34.594 --> 1:20:45.625
And then the idea is to find this alignment
where we have alignment.
1:20:46.306 --> 1:20:50.421
And of course you want these sequences to
be shown as possible.
1:20:50.421 --> 1:20:56.400
Of course an easy solution is here all my
screen sentences and here all my target sentences.
1:20:56.756 --> 1:21:07.558
So want to have short sequences there, typically
one sentence or maximum two or three sentences,
1:21:07.558 --> 1:21:09.340
so that really.
1:21:13.913 --> 1:21:21.479
Then there is different ways of restriction
to this type of alignment, so first of all
1:21:21.479 --> 1:21:29.131
it should be a monotone alignment, so that
means that each segment on the source should
1:21:29.131 --> 1:21:31.218
start after each other.
1:21:31.431 --> 1:21:36.428
So we assume that in document there's really
a monotone and it's going the same way in source.
1:21:36.957 --> 1:21:41.965
Course for a very free translation that might
not be valid anymore.
1:21:41.965 --> 1:21:49.331
But this algorithm, the first one in the church
and gay algorithm, is more than really translations
1:21:49.331 --> 1:21:51.025
which are very direct.
1:21:51.025 --> 1:21:54.708
So each segment should be like coming after
each.
1:21:55.115 --> 1:22:04.117
Then we want to translate the full sequence,
and of course each segment should start before
1:22:04.117 --> 1:22:04.802
it is.
1:22:05.525 --> 1:22:22.654
And then you want to have something like that,
but you have to alignments or alignments.
1:22:25.525 --> 1:22:41.851
The alignment types are: You then, of course,
sometimes insertions and Venetians where there
1:22:41.851 --> 1:22:43.858
is some information added.
1:22:44.224 --> 1:22:50.412
Hand be, for example, explanation, so it can
be that some term is known in the one language
1:22:50.412 --> 1:22:51.018
but not.
1:22:51.111 --> 1:22:53.724
Think of things like Deutschland ticket.
1:22:53.724 --> 1:22:58.187
In Germany everybody will by now know what
the Deutschland ticket is.
1:22:58.187 --> 1:23:03.797
But if you translate it to English it might
be important to explain it and other things
1:23:03.797 --> 1:23:04.116
are.
1:23:04.116 --> 1:23:09.853
So sometimes you have to explain things and
then you have more sentences with insertions.
1:23:10.410 --> 1:23:15.956
Then you have two to one and one to two alignment,
and that is, for example, in Germany you have
1:23:15.956 --> 1:23:19.616
a lot of sub-classes and bipes that are expressed
by two cents.
1:23:20.580 --> 1:23:37.725
Of course, it might be more complex, but typically
to make it simple and only allow for this type
1:23:37.725 --> 1:23:40.174
of alignment.
1:23:41.301 --> 1:23:56.588
Then it is about finding the alignment and
that is, we try to score where we just take
1:23:56.588 --> 1:23:59.575
a general score.
1:24:00.000 --> 1:24:04.011
That is true like gala algorithms and the
matching of one segment.
1:24:04.011 --> 1:24:09.279
If you have one segment now so this is one
of the global things so the global alignment
1:24:09.279 --> 1:24:13.828
is as good as the product of all single steps
and then you have two scores.
1:24:13.828 --> 1:24:18.558
First of all you say one to one alignments
are much better than all the hours.
1:24:19.059 --> 1:24:26.884
And then you have a lexical similarity, which
is, for example, based on an initial dictionary
1:24:26.884 --> 1:24:30.713
which counts how many dictionary entries are.
1:24:31.091 --> 1:24:35.407
So this is a very simple algorithm.
1:24:35.407 --> 1:24:41.881
Typically violates like your first step and
you want.
1:24:43.303 --> 1:24:54.454
And that is like with this one you can get
an initial one you can have better parallel
1:24:54.454 --> 1:24:55.223
data.
1:24:55.675 --> 1:25:02.369
No, it is an optimization problem and you
are now based on the scores you can calculate
1:25:02.369 --> 1:25:07.541
for each possible alignment and score and then
select the best one.
1:25:07.541 --> 1:25:14.386
Of course, you won't try all possibilities
out but you can do a good search and then find
1:25:14.386 --> 1:25:15.451
the best one.
1:25:15.815 --> 1:25:18.726
Can typically be automatically.
1:25:18.726 --> 1:25:25.456
Of course, you should do some checks like
aligning sentences as possible.
1:25:26.766 --> 1:25:32.043
A bill like typically for training data is
done this way.
1:25:32.043 --> 1:25:35.045
Maybe if you have test data you.
1:25:40.000 --> 1:25:47.323
Sorry, I'm a bit late because originally wanted
to do a quiz at the end.
1:25:47.323 --> 1:25:49.129
Can we go a quiz?
1:25:49.429 --> 1:25:51.833
We'll do it somewhere else.
1:25:51.833 --> 1:25:56.813
We had a bachelor project about making quiz
for lectures.
1:25:56.813 --> 1:25:59.217
And I still want to try it.
1:25:59.217 --> 1:26:04.197
So let's see I hope in some other lecture
we can do that.
1:26:04.197 --> 1:26:09.435
Then we can at the island of the lecture do
some quiz about.
1:26:09.609 --> 1:26:13.081
All We Can Do Is Is the Practical Thing Let's
See.
1:26:13.533 --> 1:26:24.719
And: Today, so what you should remember is
what is parallel data and how we can.
1:26:25.045 --> 1:26:29.553
Create parallel data like how to generally
process data.
1:26:29.553 --> 1:26:36.435
What you think about data is really important
if you build systems and different ways.
1:26:36.696 --> 1:26:46.857
The three main options like forwards is directly
on director level or using subword things.
1:26:47.687 --> 1:26:49.634
Is there any question?
1:26:52.192 --> 1:26:57.768
Yes, this is the alignment thing in Cadillac
band in Tyne walking with people.
1:27:00.000 --> 1:27:05.761
It's not directly using than every time walking,
but the idea is similar and you can use all
1:27:05.761 --> 1:27:11.771
this type of similar algorithms, which is the
main thing which is the question of the difficulty
1:27:11.771 --> 1:27:14.807
is to define me at your your loss function
here.
1:27:14.807 --> 1:27:16.418
What is a good alignment?
1:27:16.736 --> 1:27:24.115
But as you do not have a time walk on, you
have a monotone alignment in there, and you
1:27:24.115 --> 1:27:26.150
cannot have rehonoring.
1:27:30.770 --> 1:27:40.121
There then thanks a lot and on first day we
will then start with or discuss.
|