Spaces:
Running
Running
File size: 75,027 Bytes
cb71ef5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808 2809 2810 2811 2812 2813 2814 2815 2816 2817 2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 2865 2866 2867 2868 2869 2870 2871 2872 2873 2874 2875 2876 2877 2878 2879 2880 2881 2882 2883 2884 2885 2886 2887 2888 2889 2890 2891 2892 2893 2894 2895 2896 2897 2898 2899 2900 2901 2902 2903 2904 2905 2906 2907 2908 2909 2910 2911 2912 2913 2914 2915 2916 2917 2918 2919 2920 2921 2922 2923 2924 2925 2926 2927 2928 2929 2930 2931 2932 2933 2934 2935 2936 2937 2938 2939 2940 2941 2942 2943 2944 2945 2946 2947 2948 2949 2950 2951 2952 2953 2954 2955 2956 2957 2958 2959 2960 2961 2962 2963 2964 2965 2966 2967 2968 2969 2970 2971 2972 2973 2974 2975 2976 2977 2978 2979 2980 2981 2982 2983 2984 2985 |
WEBVTT
0:00:01.561 --> 0:00:05.186
Okay So Um.
0:00:08.268 --> 0:00:17.655
Welcome to today's presentation of the second
class and machine translation where we'll today
0:00:17.655 --> 0:00:25.044
do a bit of a specific topic and we'll talk
about linguistic backgrounds.
0:00:26.226 --> 0:00:34.851
Will cover their three different parts of
the lecture.
0:00:35.615 --> 0:00:42.538
We'll do first a very, very brief introduction
about linguistic background in a way that what
0:00:42.538 --> 0:00:49.608
is language, what are ways of describing language,
what are a bit serious behind it, very, very
0:00:49.608 --> 0:00:50.123
short.
0:00:50.410 --> 0:00:57.669
Don't know some of you have listened, think
to NLP in the last semester or so.
0:00:58.598 --> 0:01:02.553
So there we did a lot longer explanation.
0:01:02.553 --> 0:01:08.862
Here is just because we are not talking about
machine translation.
0:01:09.109 --> 0:01:15.461
So it's really focused on the parts which
are important when we talk about machine translation.
0:01:15.755 --> 0:01:19.377
Though for everybody who has listened to that
already, it's a bit of a repetition.
0:01:19.377 --> 0:01:19.683
Maybe.
0:01:19.980 --> 0:01:23.415
But it's really trying to look.
0:01:23.415 --> 0:01:31.358
These are properties of languages and how
can they influence translation.
0:01:31.671 --> 0:01:38.928
We'll use that in the second part to discuss
why is machine translation more from what we
0:01:38.928 --> 0:01:40.621
know about language.
0:01:40.940 --> 0:01:47.044
We will see that I mean there's two main things
is that the language might express ideas and
0:01:47.044 --> 0:01:53.279
information differently, and if they are expressed
different in different languages we have to
0:01:53.279 --> 0:01:54.920
do somehow the transfer.
0:01:55.135 --> 0:02:02.771
And it's not purely that we know there's words
used for it, but it's not that simple and very
0:02:02.771 --> 0:02:03.664
different.
0:02:04.084 --> 0:02:10.088
And the other problem we mentioned last time
about biases is that there's not always the
0:02:10.088 --> 0:02:12.179
same amount of information in.
0:02:12.592 --> 0:02:18.206
So it can be that there's some more information
in the one or you can't express that few information
0:02:18.206 --> 0:02:19.039
on the target.
0:02:19.039 --> 0:02:24.264
We had that also, for example, with the example
with the rice plant in Germany, we would just
0:02:24.264 --> 0:02:24.820
say rice.
0:02:24.904 --> 0:02:33.178
Or in English, while in other countries you
have to distinguish between rice plant or rice
0:02:33.178 --> 0:02:33.724
as a.
0:02:34.194 --> 0:02:40.446
And then it's not always possible to directly
infer this on the surface.
0:02:41.781 --> 0:02:48.501
And if we make it to the last point otherwise
we'll do that next Tuesday or we'll partly
0:02:48.501 --> 0:02:55.447
do it only here is like we'll describe briefly
the three main approaches on a rule based so
0:02:55.447 --> 0:02:59.675
linguistic motivated ways of doing machine
translation.
0:02:59.779 --> 0:03:03.680
We mentioned them last time like the direct
translation.
0:03:03.680 --> 0:03:10.318
The translation by transfer the lingua interlingua
bass will do that a bit more in detail today.
0:03:10.590 --> 0:03:27.400
But very briefly because this is not a focus
of this class and then next week because.
0:03:29.569 --> 0:03:31.757
Why do we think this is important?
0:03:31.757 --> 0:03:37.259
On the one hand, of course, we are dealing
with natural language, so therefore it might
0:03:37.259 --> 0:03:43.074
be good to spend a bit of time in understanding
what we are really dealing with because this
0:03:43.074 --> 0:03:45.387
is challenging these other problems.
0:03:45.785 --> 0:03:50.890
And on the other hand, this was the first
way of how we're doing machine translation.
0:03:51.271 --> 0:04:01.520
Therefore, it's interesting to understand
what was the idea behind that and also to later
0:04:01.520 --> 0:04:08.922
see what is done differently and to understand
when some models.
0:04:13.453 --> 0:04:20.213
When we're talking about linguistics, we can
of course do that on different levels and there's
0:04:20.213 --> 0:04:21.352
different ways.
0:04:21.521 --> 0:04:26.841
On the right side here you are seeing the
basic levels of linguistics.
0:04:27.007 --> 0:04:31.431
So we have at the bottom the phonetics and
phonology.
0:04:31.431 --> 0:04:38.477
Phones will not cover this year because we
are mainly focusing on text input where we
0:04:38.477 --> 0:04:42.163
are directly having directors and then work.
0:04:42.642 --> 0:04:52.646
Then what we touch today, at least mention
what it is, is a morphology which is the first
0:04:52.646 --> 0:04:53.424
level.
0:04:53.833 --> 0:04:59.654
Already mentioned it a bit on Tuesday that
of course there are some languages where this
0:04:59.654 --> 0:05:05.343
is very, very basic and there is not really
a lot of rules of how you can build words.
0:05:05.343 --> 0:05:11.099
But since I assume you all have some basic
knowledge of German there is like a lot more
0:05:11.099 --> 0:05:12.537
challenges than that.
0:05:13.473 --> 0:05:20.030
You know, maybe if you're a native speaker
that's quite easy and everything is clear,
0:05:20.030 --> 0:05:26.969
but if you have to learn it like the endings
of a word, we are famous for doing compositar
0:05:26.969 --> 0:05:29.103
and putting words together.
0:05:29.103 --> 0:05:31.467
So this is like the first lab.
0:05:32.332 --> 0:05:40.268
Then we have the syntax, which is both on
the word and on the sentence level, and that's
0:05:40.268 --> 0:05:43.567
about the structure of the sentence.
0:05:43.567 --> 0:05:46.955
What are the functions of some words?
0:05:47.127 --> 0:05:51.757
You might remember part of speech text from
From Your High School Time.
0:05:51.757 --> 0:05:57.481
There is like noun and adjective and and things
like that and this is something helpful.
0:05:57.737 --> 0:06:03.933
Just imagine in the beginning that it was
not only used for rule based but for statistical
0:06:03.933 --> 0:06:10.538
machine translation, for example, the reordering
between languages was quite a challenging task.
0:06:10.770 --> 0:06:16.330
Especially if you have long range reorderings
and their part of speech information is very
0:06:16.330 --> 0:06:16.880
helpful.
0:06:16.880 --> 0:06:20.301
You know, in German you have to move the word
the verb.
0:06:20.260 --> 0:06:26.599
To the second position, if you have Spanish
you have to change the noun and the adjective
0:06:26.599 --> 0:06:30.120
so information from part of speech could be
very.
0:06:30.410 --> 0:06:38.621
Then you have a syntax base structure where
you have a full syntax tree in the beginning
0:06:38.621 --> 0:06:43.695
and then it came into statistical machine translation.
0:06:44.224 --> 0:06:50.930
And it got more and more important for statistical
machine translation that you are really trying
0:06:50.930 --> 0:06:53.461
to model the whole syntax tree of a.
0:06:53.413 --> 0:06:57.574
Sentence in order to better match how to do
that in UM.
0:06:57.574 --> 0:07:04.335
In the target language, a bit yeah, the syntax
based statistical machine translation had a
0:07:04.335 --> 0:07:05.896
bitter of a problem.
0:07:05.896 --> 0:07:08.422
It got better and better and was.
0:07:08.368 --> 0:07:13.349
Just on the way of getting better in some
languages than traditional statistical models.
0:07:13.349 --> 0:07:18.219
But then the neural models came up and they
were just so much better in modelling that
0:07:18.219 --> 0:07:19.115
all implicitly.
0:07:19.339 --> 0:07:23.847
So that they are never were used in practice
so much.
0:07:24.304 --> 0:07:34.262
And then we'll talk about the semantics, so
what is the meaning of the words?
0:07:34.262 --> 0:07:40.007
Last time words can have different meanings.
0:07:40.260 --> 0:07:46.033
And yeah, how you represent meaning of cause
is very challenging.
0:07:45.966 --> 0:07:53.043
And normally that like formalizing this is
typically done in quite limited domains because
0:07:53.043 --> 0:08:00.043
like doing that for like all possible words
has not really been achieved yet in this very
0:08:00.043 --> 0:08:00.898
challenge.
0:08:02.882 --> 0:08:09.436
About pragmatics, so pragmatics is then what
is meaning in the context of the current situation.
0:08:09.789 --> 0:08:16.202
So one famous example is there, for example,
if you say the light is red.
0:08:16.716 --> 0:08:21.795
The traffic light is red so that typically
not you don't want to tell the other person
0:08:21.795 --> 0:08:27.458
if you're sitting in a car that it's surprising
oh the light is red but typically you're meaning
0:08:27.458 --> 0:08:30.668
okay you should stop and you shouldn't pass
the light.
0:08:30.850 --> 0:08:40.994
So the meaning of this sentence, the light,
is red in the context of sitting in the car.
0:08:42.762 --> 0:08:51.080
So let's start with the morphology so that
with the things we are starting there and one
0:08:51.080 --> 0:08:53.977
easy and first thing is there.
0:08:53.977 --> 0:09:02.575
Of course we have to split the sentence into
words or joint directors so that we have word.
0:09:02.942 --> 0:09:09.017
Because in most of our work we'll deal like
machine translation with some type of words.
0:09:09.449 --> 0:09:15.970
In neuromachine translation, people are working
also on director based and subwords, but a
0:09:15.970 --> 0:09:20.772
basic unique words of the sentence is a very
important first step.
0:09:21.421 --> 0:09:32.379
And for many languages that is quite simple
in German, it's not that hard to determine
0:09:32.379 --> 0:09:33.639
the word.
0:09:34.234 --> 0:09:46.265
In tokenization, the main challenge is if
we are doing corpus-based methods that we are
0:09:46.265 --> 0:09:50.366
also dealing as normal words.
0:09:50.770 --> 0:10:06.115
And there of course it's getting a bit more
challenging.
0:10:13.173 --> 0:10:17.426
So that is maybe the main thing where, for
example, in Germany, if you think of German
0:10:17.426 --> 0:10:19.528
tokenization, it's easy to get every word.
0:10:19.779 --> 0:10:26.159
You split it at a space, but then you would
have the dots at the end join to the last word,
0:10:26.159 --> 0:10:30.666
and of course that you don't want because it's
a different word.
0:10:30.666 --> 0:10:37.046
The last word would not be go, but go dot,
but what you can do is split up the dots always.
0:10:37.677 --> 0:10:45.390
Can you really do that always or it might
be sometimes better to keep the dot as a point?
0:10:47.807 --> 0:10:51.001
For example, email addresses or abbreviations
here.
0:10:51.001 --> 0:10:56.284
For example, doctor, maybe it doesn't make
sense to split up the dot because then you
0:10:56.284 --> 0:11:01.382
would assume all year starts a new sentence,
but it's just the DR dot from doctor.
0:11:01.721 --> 0:11:08.797
Or if you have numbers like he's a seventh
person like the zipter, then you don't want
0:11:08.797 --> 0:11:09.610
to split.
0:11:09.669 --> 0:11:15.333
So there are some things where it could be
a bit more difficult, but it's not really challenging.
0:11:16.796 --> 0:11:23.318
In other languages it's getting a lot more
challenging, especially in Asian languages
0:11:23.318 --> 0:11:26.882
where often there are no spaces between words.
0:11:27.147 --> 0:11:32.775
So you just have the sequence of characters.
0:11:32.775 --> 0:11:38.403
The quick brown fox jumps over the lazy dog.
0:11:38.999 --> 0:11:44.569
And then it still might be helpful to work
on something like words.
0:11:44.569 --> 0:11:48.009
Then you need to have a bit more complex.
0:11:48.328 --> 0:11:55.782
And here you see we are again having our typical
problem.
0:11:55.782 --> 0:12:00.408
That means that there is ambiguity.
0:12:00.600 --> 0:12:02.104
So you're seeing here.
0:12:02.104 --> 0:12:08.056
We have exactly the same sequence of characters
or here, but depending on how we split it,
0:12:08.056 --> 0:12:12.437
it means he is your servant or he is the one
who used your things.
0:12:12.437 --> 0:12:15.380
Or here we have round eyes and take the air.
0:12:15.895 --> 0:12:22.953
So then of course yeah this type of tokenization
gets more important because you could introduce
0:12:22.953 --> 0:12:27.756
already arrows and you can imagine if you're
doing it here wrong.
0:12:27.756 --> 0:12:34.086
If you once do a wrong decision it's quite
difficult to recover from a wrong decision.
0:12:34.634 --> 0:12:47.088
And so in these cases looking about how we're
doing tokenization is an important issue.
0:12:47.127 --> 0:12:54.424
And then it might be helpful to do things
like director based models where we treat each
0:12:54.424 --> 0:12:56.228
director as a symbol.
0:12:56.228 --> 0:13:01.803
For example, do this decision in the later
or never really do this?
0:13:06.306 --> 0:13:12.033
The other thing is that if we have words we
might, it might not be the optimal unit to
0:13:12.033 --> 0:13:18.155
work with because it can be that we should
look into the internal structure of words because
0:13:18.155 --> 0:13:20.986
if we have a morphological rich language,.
0:13:21.141 --> 0:13:27.100
That means we have a lot of different types
of words, and if you have a lot of many different
0:13:27.100 --> 0:13:32.552
types of words, it on the other hand means
of course each of these words we have seen
0:13:32.552 --> 0:13:33.757
very infrequently.
0:13:33.793 --> 0:13:39.681
So if you only have ten words and you have
a large corpus, each word occurs more often.
0:13:39.681 --> 0:13:45.301
If you have three million different words,
then each of them will occur less often.
0:13:45.301 --> 0:13:51.055
Hopefully you know, from machine learning,
it's helpful if you have seen each example
0:13:51.055 --> 0:13:51.858
very often.
0:13:52.552 --> 0:13:54.524
And so why does it help?
0:13:54.524 --> 0:13:56.495
Why does it help happen?
0:13:56.495 --> 0:14:02.410
Yeah, in some languages we have quite a complex
information inside a word.
0:14:02.410 --> 0:14:09.271
So here's a word from a finish talosanikiko
or something like that, and it means in my
0:14:09.271 --> 0:14:10.769
house to question.
0:14:11.491 --> 0:14:15.690
So you have all these information attached
to the word.
0:14:16.036 --> 0:14:20.326
And that of course in extreme case that's
why typically, for example, Finnish is the
0:14:20.326 --> 0:14:20.831
language.
0:14:20.820 --> 0:14:26.725
Where machine translation quality is less
good because generating all these different
0:14:26.725 --> 0:14:33.110
morphological variants is is a challenge and
the additional challenge is typically in finish
0:14:33.110 --> 0:14:39.564
not really low resource but for in low resource
languages you quite often have more difficult
0:14:39.564 --> 0:14:40.388
morphology.
0:14:40.440 --> 0:14:43.949
Mean English is an example of a relatively
easy one.
0:14:46.066 --> 0:14:54.230
And so in general we can say that words are
composed of more themes, and more themes are
0:14:54.230 --> 0:15:03.069
the smallest meaning carrying unit, so normally
it means: All morphine should have some type
0:15:03.069 --> 0:15:04.218
of meaning.
0:15:04.218 --> 0:15:09.004
For example, here does not really have a meaning.
0:15:09.289 --> 0:15:12.005
Bian has some type of meaning.
0:15:12.005 --> 0:15:14.371
It's changing the meaning.
0:15:14.371 --> 0:15:21.468
The NES has the meaning that it's making out
of an adjective, a noun, and happy.
0:15:21.701 --> 0:15:31.215
So each of these parts conveys some meaning,
but you cannot split them further up and have
0:15:31.215 --> 0:15:32.156
somehow.
0:15:32.312 --> 0:15:36.589
You see that of course a little bit more is
happening.
0:15:36.589 --> 0:15:43.511
Typically the Y is going into an E so there
can be some variation, but these are typical
0:15:43.511 --> 0:15:46.544
examples of what we have as morphines.
0:16:02.963 --> 0:16:08.804
That is, of course, a problem and that's the
question why how you do your splitting.
0:16:08.804 --> 0:16:15.057
But that problem we have anyway always because
even full words can have different meanings
0:16:15.057 --> 0:16:17.806
depending on the context they're using.
0:16:18.038 --> 0:16:24.328
So we always have to somewhat have a model
which can infer or represent the meaning of
0:16:24.328 --> 0:16:25.557
the word in the.
0:16:25.825 --> 0:16:30.917
But you are right that this problem might
get even more severe if you're splitting up.
0:16:30.917 --> 0:16:36.126
Therefore, it might not be the best to go
for the very extreme and represent each letter
0:16:36.126 --> 0:16:41.920
and have a model which is only on letters because,
of course, a letter can have a lot of different
0:16:41.920 --> 0:16:44.202
meanings depending on where it's used.
0:16:44.524 --> 0:16:50.061
And yeah, there is no right solution like
what is the right splitting.
0:16:50.061 --> 0:16:56.613
It depends on the language and the application
on the amount of data you're having.
0:16:56.613 --> 0:17:01.058
For example, typically it means the fewer
data you have.
0:17:01.301 --> 0:17:12.351
The more splitting you should do, if you have
more data, then you can be better distinguish.
0:17:13.653 --> 0:17:19.065
Then there are different types of morphines:
So we have typically one stemmed theme: It's
0:17:19.065 --> 0:17:21.746
like house or tish, so the main meaning.
0:17:21.941 --> 0:17:29.131
And then you can have functional or bound
morphemes which can be f which can be prefix,
0:17:29.131 --> 0:17:34.115
suffix, infix or circumfix so it can be before
can be after.
0:17:34.114 --> 0:17:39.416
It can be inside or it can be around it, something
like a coughed there.
0:17:39.416 --> 0:17:45.736
Typically you would say that it's not like
two more themes, G and T, because they both
0:17:45.736 --> 0:17:50.603
describe the function, but together G and T
are marking the cough.
0:17:53.733 --> 0:18:01.209
For what are people using them you can use
them for inflection to describe something like
0:18:01.209 --> 0:18:03.286
tense count person case.
0:18:04.604 --> 0:18:09.238
That is yeah, if you know German, this is
commonly used in German.
0:18:10.991 --> 0:18:16.749
But of course there is a lot more complicated
things: I think in in some languages it also.
0:18:16.749 --> 0:18:21.431
I mean, in Germany it only depends counting
person on the subject.
0:18:21.431 --> 0:18:27.650
For the word, for example, in other languages
it can also determine the first and on the
0:18:27.650 --> 0:18:28.698
second object.
0:18:28.908 --> 0:18:35.776
So that it like if you buy an apple or an
house, that not only the, the, the.
0:18:35.776 --> 0:18:43.435
Kauft depends on on me like in German, but
it can also depend on whether it's an apple
0:18:43.435 --> 0:18:44.492
or a house.
0:18:44.724 --> 0:18:48.305
And then of course you have an exploding number
of web fronts.
0:18:49.409 --> 0:19:04.731
Furthermore, it can be used to do derivations
so you can make other types of words from it.
0:19:05.165 --> 0:19:06.254
And then yeah.
0:19:06.254 --> 0:19:12.645
This is like creating new words by joining
them like rainbow waterproof but for example
0:19:12.645 --> 0:19:19.254
in German like Einköw's Wagen, Ice Cult and
so on where you can join where you can do that
0:19:19.254 --> 0:19:22.014
with nouns and German adjectives and.
0:19:22.282 --> 0:19:29.077
Then of course you might have additional challenges
like the Fugan where you have to add this one.
0:19:32.452 --> 0:19:39.021
Yeah, then there is a yeah of course additional
special things.
0:19:39.639 --> 0:19:48.537
You have to sometimes put extra stuff because
of phonology, so it's dig the plural, not plural.
0:19:48.537 --> 0:19:56.508
The third person singular, as in English,
is normally S, but by Goes, for example, is
0:19:56.508 --> 0:19:57.249
an E S.
0:19:57.277 --> 0:20:04.321
In German you can also have other things that
like Osmutta gets Mutter so you're changing
0:20:04.321 --> 0:20:11.758
the Umlaud in order to express the plural and
in other languages for example the vowel harmony
0:20:11.758 --> 0:20:17.315
where the vowels inside are changing depending
on which form you have.
0:20:17.657 --> 0:20:23.793
Which makes things more difficult than splitting
a word into its part doesn't really work anymore.
0:20:23.793 --> 0:20:28.070
So like for Muta and Muta, for example, that
is not really possible.
0:20:28.348 --> 0:20:36.520
The nice thing is, of course, more like a
general thing, but often irregular things are
0:20:36.520 --> 0:20:39.492
happening as words which occur.
0:20:39.839 --> 0:20:52.177
So that you can have enough examples, while
the regular things you can do by some type
0:20:52.177 --> 0:20:53.595
of rules.
0:20:55.655 --> 0:20:57.326
Yeah, This Can Be Done.
0:20:57.557 --> 0:21:02.849
So there are tasks on this: how to do automatic
inflection, how to analyze them.
0:21:02.849 --> 0:21:04.548
So you give it a word to.
0:21:04.548 --> 0:21:10.427
It's telling you what are the possible forms
of that, like how they are built, and so on.
0:21:10.427 --> 0:21:15.654
And for the at least Ah Iris shoes language,
there are a lot of tools for that.
0:21:15.654 --> 0:21:18.463
Of course, if you now want to do that for.
0:21:18.558 --> 0:21:24.281
Some language which is very low resourced
might be very difficult and there might be
0:21:24.281 --> 0:21:25.492
no tool for them.
0:21:28.368 --> 0:21:37.652
Good before we are going for the next part
about part of speech, are there any questions
0:21:37.652 --> 0:21:38.382
about?
0:22:01.781 --> 0:22:03.187
Yeah, we'll come to that a bit.
0:22:03.483 --> 0:22:09.108
So it's a very good question and difficult
and especially we'll see that later if you
0:22:09.108 --> 0:22:14.994
just put in words it would be very bad because
words are put into neural networks just as
0:22:14.994 --> 0:22:15.844
some digits.
0:22:15.844 --> 0:22:21.534
Each word is mapped into a jitter and you
put it in so it doesn't really know any more
0:22:21.534 --> 0:22:22.908
about the structure.
0:22:23.543 --> 0:22:29.898
What we will see therefore the most successful
approach which is mostly done is a subword
0:22:29.898 --> 0:22:34.730
unit where we split: But we will do this.
0:22:34.730 --> 0:22:40.154
Don't know if you have been in advanced.
0:22:40.154 --> 0:22:44.256
We'll cover this on a Tuesday.
0:22:44.364 --> 0:22:52.316
So there is an algorithm called bite pairing
coding, which is about splitting words into
0:22:52.316 --> 0:22:52.942
parts.
0:22:53.293 --> 0:23:00.078
So it's doing the splitting of words but not
morphologically motivated but more based on
0:23:00.078 --> 0:23:00.916
frequency.
0:23:00.940 --> 0:23:11.312
However, it performs very good and that's
why it's used and there is a bit of correlation.
0:23:11.312 --> 0:23:15.529
Sometimes they agree on count based.
0:23:15.695 --> 0:23:20.709
So we're splitting words and we're splitting
especially words which are infrequent and that's
0:23:20.709 --> 0:23:23.962
maybe a good motivation why that's good for
neural networks.
0:23:23.962 --> 0:23:28.709
That means if you have seen a word very often
you don't need to split it and it's easier
0:23:28.709 --> 0:23:30.043
to just process it fast.
0:23:30.690 --> 0:23:39.218
While if you have seen the words infrequently,
it is good to split it into parts so it can
0:23:39.218 --> 0:23:39.593
do.
0:23:39.779 --> 0:23:47.729
So there is some way of doing it, but linguists
would say this is not a morphological analyst.
0:23:47.729 --> 0:23:53.837
That is true, but we are spitting words into
parts if they are not seen.
0:23:59.699 --> 0:24:06.324
Yes, so another important thing about words
are the paddle speech text.
0:24:06.324 --> 0:24:14.881
These are the common ones: noun, verb, adjective,
verb, determine, pronoun, proposition, and
0:24:14.881 --> 0:24:16.077
conjunction.
0:24:16.077 --> 0:24:26.880
There are some more: They are not the same
in all language, but for example there is this
0:24:26.880 --> 0:24:38.104
universal grammar which tries to do this type
of part of speech text for many languages.
0:24:38.258 --> 0:24:42.018
And then, of course, it's helping you for
generalization.
0:24:42.018 --> 0:24:48.373
There are some language deals with verbs and
nouns, especially if you look at sentence structure.
0:24:48.688 --> 0:24:55.332
And so if you know the part of speech tag
you can easily generalize and do get these
0:24:55.332 --> 0:24:58.459
rules or apply these rules as you know.
0:24:58.459 --> 0:25:02.680
The verb in English is always at the second
position.
0:25:03.043 --> 0:25:10.084
So you know how to deal with verbs independently
of which words you are now really looking at.
0:25:12.272 --> 0:25:18.551
And that again can be done is ambiguous.
0:25:18.598 --> 0:25:27.171
So there are some words which can have several
pot of speech text.
0:25:27.171 --> 0:25:38.686
Example are the word can, for example, which
can be the can of beans or can do something.
0:25:38.959 --> 0:25:46.021
Often is also in English related work.
0:25:46.021 --> 0:25:55.256
Access can be to excess or to access to something.
0:25:56.836 --> 0:26:02.877
Most words have only one single part of speech
tag, but they are some where it's a bit more
0:26:02.877 --> 0:26:03.731
challenging.
0:26:03.731 --> 0:26:09.640
The nice thing is the ones which are in big
are often more words, which occur more often,
0:26:09.640 --> 0:26:12.858
while for really ware words it's not that often.
0:26:13.473 --> 0:26:23.159
If you look at these classes you can distinguish
open classes where new words can happen so
0:26:23.159 --> 0:26:25.790
we can invent new nouns.
0:26:26.926 --> 0:26:31.461
But then there are the close classes which
I think are determined or pronoun.
0:26:31.461 --> 0:26:35.414
For example, it's not that you can easily
develop your new pronoun.
0:26:35.414 --> 0:26:38.901
So there is a fixed list of pronouns and we
are using that.
0:26:38.901 --> 0:26:44.075
So it's not like that or tomorrow there is
something happening and then people are using
0:26:44.075 --> 0:26:44.482
a new.
0:26:45.085 --> 0:26:52.426
Pronoun or new conjectures, so it's like end,
because it's not that you normally invent a
0:26:52.426 --> 0:26:52.834
new.
0:27:00.120 --> 0:27:03.391
And additional to part of speech text.
0:27:03.391 --> 0:27:09.012
Then some of these part of speech texts have
different properties.
0:27:09.389 --> 0:27:21.813
So, for example, for nouns and adjectives
we can have a singular plural: In other languages,
0:27:21.813 --> 0:27:29.351
there is a duel so that a word is not only
like a single or in plural, but also like a
0:27:29.351 --> 0:27:31.257
duel if it's meaning.
0:27:31.631 --> 0:27:36.246
You have the gender and masculine feminine
neutre we know.
0:27:36.246 --> 0:27:43.912
In other language there is animated and inanimated
and you have the cases like in German you have
0:27:43.912 --> 0:27:46.884
no maternative guinetive acquisitive.
0:27:47.467 --> 0:27:57.201
So here and then in other languages you also
have Latin with the upper teeth.
0:27:57.497 --> 0:28:03.729
So there's like more, it's just like yeah,
and there you have no one to one correspondence,
0:28:03.729 --> 0:28:09.961
so it can be that there are some cases which
are only in the one language and do not happen
0:28:09.961 --> 0:28:11.519
in the other language.
0:28:13.473 --> 0:28:20.373
For whorps we have tenses of course like walk
is walking walked have walked head walked will
0:28:20.373 --> 0:28:21.560
walk and so on.
0:28:21.560 --> 0:28:28.015
Interestingly for example in Japanese this
can also happen for adjectives though there
0:28:28.015 --> 0:28:32.987
is a difference between something is white
or something was white.
0:28:35.635 --> 0:28:41.496
There is this continuous thing which should
not really have that commonly in German and
0:28:41.496 --> 0:28:47.423
I guess that's if you're German and learning
English that's something like she sings and
0:28:47.423 --> 0:28:53.350
she is singing and of course we can express
that but it's not commonly used and normally
0:28:53.350 --> 0:28:55.281
we're not doing this aspect.
0:28:55.455 --> 0:28:57.240
Also about tenses.
0:28:57.240 --> 0:29:05.505
If you use pasts in English you will also
use past tenses in German, so we have similar
0:29:05.505 --> 0:29:09.263
tenses, but the use might be different.
0:29:14.214 --> 0:29:20.710
There is uncertainty like the mood in there
indicative.
0:29:20.710 --> 0:29:26.742
If he were here, there's voices active and
passive.
0:29:27.607 --> 0:29:34.024
That you know, that is like both in German
and English there, but there is something in
0:29:34.024 --> 0:29:35.628
the Middle and Greek.
0:29:35.628 --> 0:29:42.555
I get myself taught, so there is other phenomens
than which might only happen in one language.
0:29:42.762 --> 0:29:50.101
This is, like yeah, the different synthetic
structures that you can can have in the language,
0:29:50.101 --> 0:29:57.361
and where there's the two things, so it might
be that some only are in some language, others
0:29:57.361 --> 0:29:58.376
don't exist.
0:29:58.358 --> 0:30:05.219
And on the other hand there is also matching,
so it might be that in some situations you
0:30:05.219 --> 0:30:07.224
use different structures.
0:30:10.730 --> 0:30:13.759
The next would be then about semantics.
0:30:13.759 --> 0:30:16.712
Do you have any questions before that?
0:30:19.819 --> 0:30:31.326
I'll just continue, but if something is unclear
beside the structure, we typically have more
0:30:31.326 --> 0:30:39.863
ambiguities, so it can be that words itself
have different meanings.
0:30:40.200 --> 0:30:48.115
And we are typically talking about polysemy
and homonyme, where polysemy means that a word
0:30:48.115 --> 0:30:50.637
can have different meanings.
0:30:50.690 --> 0:30:58.464
So if you have the English word interest,
it can be that you are interested in something.
0:30:58.598 --> 0:31:07.051
Or it can be like the interest rate financial,
but it is somehow related because if you are
0:31:07.051 --> 0:31:11.002
getting some interest rates there is some.
0:31:11.531 --> 0:31:18.158
Are, but there is a homophemer where they
really are not related.
0:31:18.458 --> 0:31:24.086
So you can and can doesn't really have anything
in common, so it's really very different.
0:31:24.324 --> 0:31:29.527
And of course that's not completely clear
so there is not a clear definition so for example
0:31:29.527 --> 0:31:34.730
for the bank it can be that you say it's related
but it can also be other can argue that so
0:31:34.730 --> 0:31:39.876
there are some clear things which is interest
there are some which is vague and then there
0:31:39.876 --> 0:31:43.439
are some where it's very clear again that there
are different.
0:31:45.065 --> 0:31:49.994
And in order to translate them, of course,
we might need the context to disambiguate.
0:31:49.994 --> 0:31:54.981
That's typically where we can disambiguate,
and that's not only for lexical semantics,
0:31:54.981 --> 0:32:00.198
that's generally very often that if you want
to disambiguate, context can be very helpful.
0:32:00.198 --> 0:32:03.981
So in which sentence and which general knowledge
who is speaking?
0:32:04.944 --> 0:32:09.867
You can do that externally by some disinvigration
task.
0:32:09.867 --> 0:32:14.702
Machine translation system will also do it
internally.
0:32:16.156 --> 0:32:21.485
And sometimes you're lucky and you don't need
to do it because you just have the same ambiguity
0:32:21.485 --> 0:32:23.651
in the source and the target language.
0:32:23.651 --> 0:32:26.815
And then it doesn't matter if you think about
the mouse.
0:32:26.815 --> 0:32:31.812
As I said, you don't really need to know if
it's a computer mouse or the living mouse you
0:32:31.812 --> 0:32:36.031
translate from German to English because it
has exactly the same ambiguity.
0:32:40.400 --> 0:32:46.764
There's also relations between words like
synonyms, antonyms, hipponomes, like the is
0:32:46.764 --> 0:32:50.019
a relation and the part of like Dora House.
0:32:50.019 --> 0:32:55.569
Big small is an antonym and synonym is like
which needs something similar.
0:32:56.396 --> 0:33:03.252
There are resources which try to express all
these linguistic information like word net
0:33:03.252 --> 0:33:10.107
or German net where you have a graph with words
and how they are related to each other.
0:33:11.131 --> 0:33:12.602
Which can be helpful.
0:33:12.602 --> 0:33:18.690
Typically these things were more used in tasks
where there is fewer data, so there's a lot
0:33:18.690 --> 0:33:24.510
of tasks in NLP where you have very limited
data because you really need to hand align
0:33:24.510 --> 0:33:24.911
that.
0:33:25.125 --> 0:33:28.024
Machine translation has a big advantage.
0:33:28.024 --> 0:33:31.842
There's naturally a lot of text translated
out there.
0:33:32.212 --> 0:33:39.519
Typically in machine translation we have compared
to other tasks significantly amount of data.
0:33:39.519 --> 0:33:46.212
People have looked into integrating wordnet
or things like that, but it is rarely used
0:33:46.212 --> 0:33:49.366
in like commercial systems or something.
0:33:52.692 --> 0:33:55.626
So this was based on the words.
0:33:55.626 --> 0:34:03.877
We have morphology, syntax, and semantics,
and then of course it makes sense to also look
0:34:03.877 --> 0:34:06.169
at the bigger structure.
0:34:06.169 --> 0:34:08.920
That means information about.
0:34:08.948 --> 0:34:17.822
Of course, we don't have a really morphology
there because morphology about the structure
0:34:17.822 --> 0:34:26.104
of words, but we have syntax on the sentence
level and the semantic representation.
0:34:28.548 --> 0:34:35.637
When we are thinking about the sentence structure,
then the sentence is, of course, first a sequence
0:34:35.637 --> 0:34:37.742
of words terminated by a dot.
0:34:37.742 --> 0:34:42.515
Jane bought the house and we can say something
about the structure.
0:34:42.515 --> 0:34:47.077
It's typically its subject work and then one
or several objects.
0:34:47.367 --> 0:34:51.996
And the number of objects, for example, is
then determined by the word.
0:34:52.232 --> 0:34:54.317
It's Called the Valency.
0:34:54.354 --> 0:35:01.410
So you have intransitive verbs which don't
get any object, it's just to sleep.
0:35:02.622 --> 0:35:05.912
For example, there is no object sleep beds.
0:35:05.912 --> 0:35:14.857
You cannot say that: And there are transitive
verbs where you have to put one or more objects,
0:35:14.857 --> 0:35:16.221
and you always.
0:35:16.636 --> 0:35:19.248
Sentence is not correct if you don't put the
object.
0:35:19.599 --> 0:35:33.909
So if you have to buy something you have to
say bought this or give someone something then.
0:35:34.194 --> 0:35:40.683
Here you see a bit that may be interesting
the relation between word order and morphology.
0:35:40.683 --> 0:35:47.243
Of course it's not that strong, but for example
in English you always have to first say who
0:35:47.243 --> 0:35:49.453
you gave it and what you gave.
0:35:49.453 --> 0:35:53.304
So the structure is very clear and cannot
be changed.
0:35:54.154 --> 0:36:00.801
German, for example, has a possibility of
determining what you gave and whom you gave
0:36:00.801 --> 0:36:07.913
it because there is a morphology and you can
do what you gave a different form than to whom
0:36:07.913 --> 0:36:08.685
you gave.
0:36:11.691 --> 0:36:18.477
And that is a general tendency that if you
have morphology then typically the word order
0:36:18.477 --> 0:36:25.262
is more free and possible, while in English
you cannot express these information through
0:36:25.262 --> 0:36:26.482
the morphology.
0:36:26.706 --> 0:36:30.238
You typically have to express them through
the word order.
0:36:30.238 --> 0:36:32.872
It's not as free, but it's more restricted.
0:36:35.015 --> 0:36:40.060
Yeah, the first part is typically the noun
phrase, the subject, and that can not only
0:36:40.060 --> 0:36:43.521
be a single noun, but of course it can be a
longer phrase.
0:36:43.521 --> 0:36:48.860
So if you have Jane the woman, it can be Jane,
it can be the woman, it can a woman, it can
0:36:48.860 --> 0:36:52.791
be the young woman or the young woman who lives
across the street.
0:36:53.073 --> 0:36:56.890
All of these are the subjects, so this can
be already very, very long.
0:36:57.257 --> 0:36:58.921
And they also put this.
0:36:58.921 --> 0:37:05.092
The verb is on the second position in a bit
more complicated way because if you have now
0:37:05.092 --> 0:37:11.262
the young woman who lives across the street
runs to somewhere or so then yeah runs is at
0:37:11.262 --> 0:37:16.185
the second position in this tree but the first
position is quite long.
0:37:16.476 --> 0:37:19.277
And so it's not just counting okay.
0:37:19.277 --> 0:37:22.700
The second word is always is always a word.
0:37:26.306 --> 0:37:32.681
Additional to these simple things, there's
more complex stuff.
0:37:32.681 --> 0:37:43.104
Jane bought the house from Jim without hesitation,
or Jane bought the house in the pushed neighborhood
0:37:43.104 --> 0:37:44.925
across the river.
0:37:45.145 --> 0:37:51.694
And these often lead to additional ambiguities
because it's not always completely clear to
0:37:51.694 --> 0:37:53.565
which this prepositional.
0:37:54.054 --> 0:37:59.076
So that we'll see and you have, of course,
subclasses and so on.
0:38:01.061 --> 0:38:09.926
And then there is a theory behind it which
was very important for rule based machine translation
0:38:09.926 --> 0:38:14.314
because that's exactly what you're doing there.
0:38:14.314 --> 0:38:18.609
You would take the sentence, do the syntactic.
0:38:18.979 --> 0:38:28.432
So that we can have this constituents which
like describe the basic parts of the language.
0:38:28.468 --> 0:38:35.268
And we can create the sentence structure as
a context free grammar, which you hopefully
0:38:35.268 --> 0:38:42.223
remember from basic computer science, which
is a pair of non terminals, terminal symbols,
0:38:42.223 --> 0:38:44.001
production rules, and.
0:38:43.943 --> 0:38:50.218
And the star symbol, and you can then describe
a sentence by this phrase structure grammar:
0:38:51.751 --> 0:38:59.628
So a simple example would be something like
that: you have a lexicon, Jane is a noun, Frays
0:38:59.628 --> 0:39:02.367
is a noun, Telescope is a noun.
0:39:02.782 --> 0:39:10.318
And then you have these production rules sentences:
a noun phrase in the web phrase.
0:39:10.318 --> 0:39:18.918
The noun phrase can either be a determinized
noun or it can be a noun phrase and a propositional
0:39:18.918 --> 0:39:19.628
phrase.
0:39:19.919 --> 0:39:25.569
Or a prepositional phrase and a prepositional
phrase is a preposition and a non phrase.
0:39:26.426 --> 0:39:27.622
We're looking at this.
0:39:27.622 --> 0:39:30.482
What is the valency of the word we're describing
here?
0:39:33.513 --> 0:39:36.330
How many objects would in this case the world
have?
0:39:46.706 --> 0:39:48.810
We're looking at the web phrase.
0:39:48.810 --> 0:39:54.358
The web phrase is a verb and a noun phrase,
so one object here, so this would be for a
0:39:54.358 --> 0:39:55.378
balance of one.
0:39:55.378 --> 0:40:00.925
If you have intransitive verbs, it would be
verb phrases, just a word, and if you have
0:40:00.925 --> 0:40:03.667
two, it would be noun phrase, noun phrase.
0:40:08.088 --> 0:40:15.348
And yeah, then the, the, the challenge or
what you have to do is like this: Given a natural
0:40:15.348 --> 0:40:23.657
language sentence, you want to parse it to
get this type of pastry from programming languages
0:40:23.657 --> 0:40:30.198
where you also need to parse the code in order
to get the representation.
0:40:30.330 --> 0:40:39.356
However, there is one challenge if you parse
natural language compared to computer language.
0:40:43.823 --> 0:40:56.209
So there are different ways of how you can
express things and there are different pastures
0:40:56.209 --> 0:41:00.156
belonging to the same input.
0:41:00.740 --> 0:41:05.241
So if you have Jane buys a horse, how's that
an easy example?
0:41:05.241 --> 0:41:07.491
So you do the lexicon look up.
0:41:07.491 --> 0:41:13.806
Jane can be a noun phrase, a bias is a verb,
a is a determiner, and a house is a noun.
0:41:15.215 --> 0:41:18.098
And then you can now use the grammar rules
of here.
0:41:18.098 --> 0:41:19.594
There is no rule for that.
0:41:20.080 --> 0:41:23.564
Here we have no rules, but here we have a
rule.
0:41:23.564 --> 0:41:27.920
A noun is a non-phrase, so we have mapped
that to the noun.
0:41:28.268 --> 0:41:34.012
Then we can map this to the web phrase.
0:41:34.012 --> 0:41:47.510
We have a verb noun phrase to web phrase and
then we can map this to a sentence representing:
0:41:49.069 --> 0:41:53.042
We can have that even more complex.
0:41:53.042 --> 0:42:01.431
The woman who won the lottery yesterday bought
the house across the street.
0:42:01.431 --> 0:42:05.515
The structure gets more complicated.
0:42:05.685 --> 0:42:12.103
You now see that the word phrase is at the
second position, but the noun phrase is quite.
0:42:12.052 --> 0:42:18.655
Quite big in here and the p p phrases, it's
sometimes difficult where to put them because
0:42:18.655 --> 0:42:25.038
they can be put to the noun phrase, but in
other sentences they can also be put to the
0:42:25.038 --> 0:42:25.919
web phrase.
0:42:36.496 --> 0:42:38.250
Yeah.
0:42:43.883 --> 0:42:50.321
Yes, so then either it can have two tags,
noun or noun phrase, or you can have the extra
0:42:50.321 --> 0:42:50.755
rule.
0:42:50.755 --> 0:42:57.409
The noun phrase can not only be a determiner
in the noun, but it can also be a noun phrase.
0:42:57.717 --> 0:43:04.360
Then of course either you introduce additional
rules when what is possible or the problem
0:43:04.360 --> 0:43:11.446
that if you do pastures which are not correct
and then you have to add some type of probability
0:43:11.446 --> 0:43:13.587
which type is more probable.
0:43:16.876 --> 0:43:23.280
But of course some things also can't really
model easily with this type of cheese.
0:43:23.923 --> 0:43:32.095
There, for example, the agreement is not straightforward
to do so that in subject and work you can check
0:43:32.095 --> 0:43:38.866
that the person, the agreement, the number
in person, the number agreement is correct,
0:43:38.866 --> 0:43:41.279
but if it's a singular object.
0:43:41.561 --> 0:43:44.191
A singular verb, it's also a singular.
0:43:44.604 --> 0:43:49.242
Non-subject, and if it's a plural subject,
it's a plural work.
0:43:49.489 --> 0:43:56.519
Things like that are yeah, the agreement in
determining action driven now, so they also
0:43:56.519 --> 0:43:57.717
have to agree.
0:43:57.877 --> 0:44:05.549
Things like that cannot be easily done with
this type of grammar or this subcategorization
0:44:05.549 --> 0:44:13.221
that you check whether the verb is transitive
or intransitive, and that Jane sleeps is OK,
0:44:13.221 --> 0:44:16.340
but Jane sleeps the house is not OK.
0:44:16.436 --> 0:44:21.073
And Jane Walterhouse is okay, but Jane Walterhouse
is not okay.
0:44:23.183 --> 0:44:29.285
Furthermore, this long range dependency might
be difficult and which word orders are allowed
0:44:29.285 --> 0:44:31.056
and which are not allowed.
0:44:31.571 --> 0:44:40.011
This is also not directly so you can say Maria
give de man das bourg, de man give Maria das
0:44:40.011 --> 0:44:47.258
bourg, das bourg give Maria, de man aber Maria,
de man give des bourg is some.
0:44:47.227 --> 0:44:55.191
One yeah, which one from this one is possible
and not is sometimes not possible to model,
0:44:55.191 --> 0:44:56.164
is simple.
0:44:56.876 --> 0:45:05.842
Therefore, people have done more complex stuff
like this unification grammar and tried to
0:45:05.842 --> 0:45:09.328
model both the categories of verb.
0:45:09.529 --> 0:45:13.367
The agreement has to be that it's person and
single.
0:45:13.367 --> 0:45:20.028
You're joining that so you're annotating this
thing with more information and then you have
0:45:20.028 --> 0:45:25.097
more complex synthetic structures in order
to model also these types.
0:45:28.948 --> 0:45:33.137
Yeah, why is this difficult?
0:45:33.873 --> 0:45:39.783
We have different ambiguities and that makes
it different, so words have different part
0:45:39.783 --> 0:45:43.610
of speech text and if you have time flies like
an error.
0:45:43.583 --> 0:45:53.554
It can mean that sometimes the animal L look
like an arrow and or it can mean that the time
0:45:53.554 --> 0:45:59.948
is flying very fast is going away very fast
like an error.
0:46:00.220 --> 0:46:10.473
And if you want to do a pastry, these two
meanings have a different part of speech text,
0:46:10.473 --> 0:46:13.008
so flies is the verb.
0:46:13.373 --> 0:46:17.999
And of course that is a different semantic,
and so that is very different.
0:46:19.499 --> 0:46:23.361
And otherwise a structural.
0:46:23.243 --> 0:46:32.419
Ambiguity so that like some part of the sentence
can have different rules, so the famous thing
0:46:32.419 --> 0:46:34.350
is this attachment.
0:46:34.514 --> 0:46:39.724
So the cops saw the Bulgara with a binoculars.
0:46:39.724 --> 0:46:48.038
Then with a binocular can be attached to saw
or it can be attached to the.
0:46:48.448 --> 0:46:59.897
And so in the first two it's more probable
that he saw the theft, and not that the theft
0:46:59.897 --> 0:47:01.570
has the one.
0:47:01.982 --> 0:47:13.356
And this, of course, makes things difficult
while parsing and doing structure implicitly
0:47:13.356 --> 0:47:16.424
defining the semantics.
0:47:20.120 --> 0:47:29.736
Therefore, we would then go directly to semantics,
but maybe some questions about spintax and
0:47:29.736 --> 0:47:31.373
how that works.
0:47:33.113 --> 0:47:46.647
Then we'll do a bit more about semantics,
so now we only describe the structure of the
0:47:46.647 --> 0:47:48.203
sentence.
0:47:48.408 --> 0:47:55.584
And for the meaning of the sentence we typically
have the compositionality of meaning.
0:47:55.584 --> 0:48:03.091
The meaning of the full sentence is determined
by the meaning of the individual words, and
0:48:03.091 --> 0:48:06.308
they together form the meaning of the.
0:48:06.686 --> 0:48:17.936
For words that is partly true but not always
mean for things like rainbow, jointly rain
0:48:17.936 --> 0:48:19.086
and bow.
0:48:19.319 --> 0:48:26.020
But this is not always a case, while for sentences
typically that is happening because you can't
0:48:26.020 --> 0:48:30.579
directly determine the full meaning, but you
split it into parts.
0:48:30.590 --> 0:48:36.164
Sometimes only in some parts like kick the
bucket the expression.
0:48:36.164 --> 0:48:43.596
Of course you cannot get the meaning of kick
the bucket by looking at the individual or
0:48:43.596 --> 0:48:46.130
in German abyss in its grass.
0:48:47.207 --> 0:48:53.763
You cannot get that he died by looking at
the individual words of Bis ins grass, but
0:48:53.763 --> 0:48:54.611
they have.
0:48:55.195 --> 0:49:10.264
And there are different ways of describing
that some people have tried that more commonly
0:49:10.264 --> 0:49:13.781
used for some tasks.
0:49:14.654 --> 0:49:20.073
Will come to so the first thing would be something
like first order logic.
0:49:20.073 --> 0:49:27.297
If you have Peter loves Jane then you have
this meaning and you're having the end of representation
0:49:27.297 --> 0:49:33.005
that you have a love property between Peter
and Jane and you try to construct.
0:49:32.953 --> 0:49:40.606
That you will see this a lot more complex
than directly than only doing syntax but also
0:49:40.606 --> 0:49:43.650
doing this type of representation.
0:49:44.164 --> 0:49:47.761
The other thing is to try to do frame semantics.
0:49:47.867 --> 0:49:55.094
That means that you try to represent the knowledge
about the world and you have these ah frames.
0:49:55.094 --> 0:49:58.372
For example, you might have a frame to buy.
0:49:58.418 --> 0:50:05.030
And the meaning is that you have a commercial
transaction.
0:50:05.030 --> 0:50:08.840
You have a person who is selling.
0:50:08.969 --> 0:50:10.725
You Have a Person Who's Buying.
0:50:11.411 --> 0:50:16.123
You have something that is priced, you might
have a price, and so on.
0:50:17.237 --> 0:50:22.698
And then what you are doing in semantic parsing
with frame semantics you first try to determine.
0:50:22.902 --> 0:50:30.494
Which frames are happening in the sentence,
so if it's something with Bowie buying you
0:50:30.494 --> 0:50:33.025
would try to first identify.
0:50:33.025 --> 0:50:40.704
Oh, here we have to try Brain B, which does
not always have to be indicated by the verb
0:50:40.704 --> 0:50:42.449
cell or other ways.
0:50:42.582 --> 0:50:52.515
And then you try to find out which elements
of these frame are in the sentence and try
0:50:52.515 --> 0:50:54.228
to align them.
0:50:56.856 --> 0:51:01.121
Yeah, you have, for example, to buy and sell.
0:51:01.121 --> 0:51:07.239
If you have a model that has frames, they
have the same elements.
0:51:09.829 --> 0:51:15.018
In addition over like sentence, then you have
also a phenomenon beyond sentence level.
0:51:15.018 --> 0:51:20.088
We're coming to this later because it's a
special challenge for machine translation.
0:51:20.088 --> 0:51:22.295
There is, for example, co reference.
0:51:22.295 --> 0:51:27.186
That means if you first mention it, it's like
the President of the United States.
0:51:27.467 --> 0:51:30.107
And later you would refer to him maybe as
he.
0:51:30.510 --> 0:51:36.966
And that is especially challenging in machine
translation because you're not always using
0:51:36.966 --> 0:51:38.114
the same thing.
0:51:38.114 --> 0:51:44.355
Of course, for the president, it's he and
air in German, but for other things it might
0:51:44.355 --> 0:51:49.521
be different depending on the gender in languages
that you refer to it.
0:51:55.435 --> 0:52:03.866
So much for the background and the next, we
want to look based on the knowledge we have
0:52:03.866 --> 0:52:04.345
now.
0:52:04.345 --> 0:52:10.285
Why is machine translation difficult before
we have any more?
0:52:16.316 --> 0:52:22.471
The first type of problem is what we refer
to as translation divers.
0:52:22.471 --> 0:52:30.588
That means that we have the same information
in source and target, but the problem is that
0:52:30.588 --> 0:52:33.442
they are expressed differently.
0:52:33.713 --> 0:52:42.222
So it is not the same way, and we have to
translate these things more easily by just
0:52:42.222 --> 0:52:44.924
having a bit more complex.
0:52:45.325 --> 0:52:51.324
So example is if it's only a structure in
English, the delicious.
0:52:51.324 --> 0:52:59.141
The adjective is before the noun, while in
Spanish you have to put it after the noun,
0:52:59.141 --> 0:53:02.413
and so you have to change the word.
0:53:02.983 --> 0:53:10.281
So there are different ways of divergence,
so there can be structural divergence, which
0:53:10.281 --> 0:53:10.613
is.
0:53:10.550 --> 0:53:16.121
The word orders so that the order is different,
so in German we have that especially in the
0:53:16.121 --> 0:53:19.451
in the sub clause, while in English in the
sub clause.
0:53:19.451 --> 0:53:24.718
The verb is also at the second position, in
German it's at the end, and so you have to
0:53:24.718 --> 0:53:25.506
move it all.
0:53:25.465 --> 0:53:27.222
Um All Over.
0:53:27.487 --> 0:53:32.978
It can be that that it's a complete different
grammatical role.
0:53:33.253 --> 0:53:35.080
So,.
0:53:35.595 --> 0:53:37.458
You Have You Like Her.
0:53:38.238 --> 0:53:41.472
And eh in in.
0:53:41.261 --> 0:53:47.708
English: In Spanish it's a la ti gusta which
means she so now she is no longer like object
0:53:47.708 --> 0:53:54.509
but she is subject here and you are now acquisitive
and then pleases or like yeah so you really
0:53:54.509 --> 0:53:58.689
use a different sentence structure and you
have to change.
0:53:59.139 --> 0:54:03.624
Can also be the head switch.
0:54:03.624 --> 0:54:09.501
In English you say the baby just ate.
0:54:09.501 --> 0:54:16.771
In Spanish literary you say the baby finishes.
0:54:16.997 --> 0:54:20.803
So the is no longer the word, but the finishing
is the word.
0:54:21.241 --> 0:54:30.859
So you have to learn so you cannot always
have the same structures in your input and
0:54:30.859 --> 0:54:31.764
output.
0:54:36.856 --> 0:54:42.318
Lexical things like to swim across or to cross
swimming.
0:54:43.243 --> 0:54:57.397
You have categorical like an adjective gets
into a noun, so you have a little bread to
0:54:57.397 --> 0:55:00.162
make a decision.
0:55:00.480 --> 0:55:15.427
That is the one challenge and the even bigger
challenge is referred to as translation.
0:55:17.017 --> 0:55:19.301
That can be their lexical mismatch.
0:55:19.301 --> 0:55:21.395
That's the fish we talked about.
0:55:21.395 --> 0:55:27.169
If it's like the, the fish you eat or the
fish which is living is the two different worlds
0:55:27.169 --> 0:55:27.931
in Spanish.
0:55:28.108 --> 0:55:34.334
And then that's partly sometimes even not
known, so even the human might not be able
0:55:34.334 --> 0:55:34.627
to.
0:55:34.774 --> 0:55:40.242
Infer that you maybe need to see the context
you maybe need to have the sentences around,
0:55:40.242 --> 0:55:45.770
so one problem is that at least traditional
machine translation works on a sentence level,
0:55:45.770 --> 0:55:51.663
so we take each sentence and translate it independent
of everything else, but that's, of course,
0:55:51.663 --> 0:55:52.453
not correct.
0:55:52.532 --> 0:55:59.901
Will look into some ways of looking at and
doing document-based machine translation, but.
0:56:00.380 --> 0:56:06.793
There's gender information might be a problem,
so in English it's player and you don't know
0:56:06.793 --> 0:56:10.139
if it's Spieler Spielerin or if it's not known.
0:56:10.330 --> 0:56:15.770
But in the English, if you now generate German,
you should know is the reader.
0:56:15.770 --> 0:56:21.830
Does he know the gender or does he not know
the gender and then generate the right one?
0:56:22.082 --> 0:56:38.333
So just imagine a commentator if he's talking
about the player and you can see if it's male
0:56:38.333 --> 0:56:40.276
or female.
0:56:40.540 --> 0:56:47.801
So in generally the problem is that if you
have less information and you need more information
0:56:47.801 --> 0:56:51.928
in your target, this translation doesn't really
work.
0:56:55.175 --> 0:56:59.180
Another problem is we just talked about the
the.
0:56:59.119 --> 0:57:01.429
The co reference.
0:57:01.641 --> 0:57:08.818
So if you refer to an object and that can
be across sentence boundaries then you have
0:57:08.818 --> 0:57:14.492
to use the right pronoun and you cannot just
translate the pronoun.
0:57:14.492 --> 0:57:18.581
If the baby does not thrive on raw milk boil
it.
0:57:19.079 --> 0:57:28.279
And if you are now using it and just take
the typical translation, it will be: And That
0:57:28.279 --> 0:57:31.065
Will Be Ah Wrong.
0:57:31.291 --> 0:57:35.784
No, that will be even right because it is
dust baby.
0:57:35.784 --> 0:57:42.650
Yes, but I mean, you have to determine that
and it might be wrong at some point.
0:57:42.650 --> 0:57:48.753
So getting this this um yeah, it will be wrong
yes, that is right yeah.
0:57:48.908 --> 0:57:55.469
Because in English both are baby and milk,
and baby are both referred to it, so if you
0:57:55.469 --> 0:58:02.180
do S it will be to the first one referred to,
so it's correct, but in Germany it will be
0:58:02.180 --> 0:58:06.101
S, and so if you translate it as S it will
be baby.
0:58:06.546 --> 0:58:13.808
But you have to do Z because milk is female,
although that is really very uncommon because
0:58:13.808 --> 0:58:18.037
maybe a model is an object and so it should
be more.
0:58:18.358 --> 0:58:25.176
Of course, I agree there might be a situation
which is a bit created and not a common thing,
0:58:25.176 --> 0:58:29.062
but you can see that these things are not that
easy.
0:58:29.069 --> 0:58:31.779
Another example is this: Dr.
0:58:31.779 --> 0:58:37.855
McLean often brings his dog champion to visit
with his patients.
0:58:37.855 --> 0:58:41.594
He loves to give big wets loppy kisses.
0:58:42.122 --> 0:58:58.371
And there, of course, it's also important
if he refers to the dog or to the doctor.
0:58:59.779 --> 0:59:11.260
Another example of challenging is that we
don't have a fixed language and that was referred
0:59:11.260 --> 0:59:16.501
to morphology and we can build new words.
0:59:16.496 --> 0:59:23.787
So we can in all languages build new words
by just concatinating part of it like braxits,
0:59:23.787 --> 0:59:30.570
some things like: And then, of course, also
words don't exist in languages, don't exist
0:59:30.570 --> 0:59:31.578
in isolations.
0:59:32.012 --> 0:59:41.591
In Germany you can now use the word download
somewhere and you can also use a morphological
0:59:41.591 --> 0:59:43.570
operation on that.
0:59:43.570 --> 0:59:48.152
I guess there is even not the correct word.
0:59:48.508 --> 0:59:55.575
But so you have to deal with these things,
and yeah, in social meters.
0:59:55.996 --> 1:00:00.215
This word is maybe most of you have forgotten
already.
1:00:00.215 --> 1:00:02.517
This was ten years ago or so.
1:00:02.517 --> 1:00:08.885
I don't know there was a volcano in Iceland
which stopped Europeans flying around.
1:00:09.929 --> 1:00:14.706
So there is always new words coming up and
you have to deal with.
1:00:18.278 --> 1:00:24.041
Yeah, one last thing, so some of these examples
we have seen are a bit artificial.
1:00:24.041 --> 1:00:30.429
So one example what is very common with machine
translation doesn't really work is this box
1:00:30.429 --> 1:00:31.540
was in the pen.
1:00:32.192 --> 1:00:36.887
And maybe you would be surprised, at least
when read it.
1:00:36.887 --> 1:00:39.441
How can a box be inside a pen?
1:00:40.320 --> 1:00:44.175
Does anybody have a solution for that while
the sentence is still correct?
1:00:47.367 --> 1:00:51.692
Maybe it's directly clear for you, maybe your
English was aside, yeah.
1:00:54.654 --> 1:01:07.377
Yes, like at a farm or for small children,
and that is also called a pen or a pen on a
1:01:07.377 --> 1:01:08.254
farm.
1:01:08.368 --> 1:01:12.056
And then this is, and so you can mean okay.
1:01:12.056 --> 1:01:16.079
To infer these two meanings is quite difficult.
1:01:16.436 --> 1:01:23.620
But at least when I saw it, I wasn't completely
convinced because it's maybe not the sentence
1:01:23.620 --> 1:01:29.505
you're using in your daily life, and some of
these constructions seem to be.
1:01:29.509 --> 1:01:35.155
They are very good in showing where the problem
is, but the question is, does it really imply
1:01:35.155 --> 1:01:35.995
in real life?
1:01:35.996 --> 1:01:42.349
And therefore here some examples also that
we had here with a lecture translator that
1:01:42.349 --> 1:01:43.605
really occurred.
1:01:43.605 --> 1:01:49.663
They maybe looked simple, but you will see
that some of them still are happening.
1:01:50.050 --> 1:01:53.948
And they are partly about spitting words,
and then they are happening.
1:01:54.294 --> 1:01:56.816
So Um.
1:01:56.596 --> 1:02:03.087
We had a text about the numeral system in
German, the Silen system, which got splitted
1:02:03.087 --> 1:02:07.041
into sub parts because otherwise we can't translate.
1:02:07.367 --> 1:02:14.927
And then he did only a proximate match and
was talking about the binary payment system
1:02:14.927 --> 1:02:23.270
because the payment system was a lot more common
in the training data than the Thailand system.
1:02:23.823 --> 1:02:29.900
And so there you see like rare words, which
don't occur that often.
1:02:29.900 --> 1:02:38.211
They are very challenging to deal with because
we are good and inferring that sometimes, but
1:02:38.211 --> 1:02:41.250
for others that's very difficult.
1:02:44.344 --> 1:02:49.605
Another challenge is that, of course, the
context is very difficult.
1:02:50.010 --> 1:02:56.448
This is also an example a bit older from also
the lecture translators we were translating
1:02:56.448 --> 1:03:01.813
in mass lecture, and he was always talking
about the omens of the numbers.
1:03:02.322 --> 1:03:11.063
Which doesn't make any sense at all, but the
German word fortsizing can of course mean the
1:03:11.063 --> 1:03:12.408
sign and the.
1:03:12.732 --> 1:03:22.703
And if you not have the right to main knowledge
in there and encode it, it might use the main
1:03:22.703 --> 1:03:23.869
knowledge.
1:03:25.705 --> 1:03:31.205
A more recent version of that is like here
from a paper where it's about translating.
1:03:31.205 --> 1:03:36.833
We had this pivot based translation where
you translate maybe to English and to another
1:03:36.833 --> 1:03:39.583
because you have not enough training data.
1:03:40.880 --> 1:03:48.051
And we did that from Dutch to German guess
if you don't understand Dutch, if you speak
1:03:48.051 --> 1:03:48.710
German.
1:03:48.908 --> 1:03:56.939
So we have this raven forebuilt, which means
to geben in English.
1:03:56.939 --> 1:04:05.417
It's correctly in setting an example: However,
if we're then translate to German, he didn't
1:04:05.417 --> 1:04:11.524
get the full context, and in German you normally
don't set an example, but you give an example,
1:04:11.524 --> 1:04:16.740
and so yes, going through another language
you introduce their additional errors.
1:04:19.919 --> 1:04:27.568
Good so much for this are there more questions
about why this is difficult.
1:04:30.730 --> 1:04:35.606
Then we'll start with this one.
1:04:35.606 --> 1:04:44.596
I have to leave a bit early today in a quarter
of an hour.
1:04:44.904 --> 1:04:58.403
If you look about linguistic approaches to
machine translation, they are typically described
1:04:58.403 --> 1:05:03.599
by: So we can do a direct translation, so you
take the Suez language.
1:05:03.599 --> 1:05:09.452
Do not apply a lot of the analysis we were
discussing today about syntax representation,
1:05:09.452 --> 1:05:11.096
semantic representation.
1:05:11.551 --> 1:05:14.678
But you directly translate to your target
text.
1:05:14.678 --> 1:05:16.241
That's here the direct.
1:05:16.516 --> 1:05:19.285
Then there is a transfer based approach.
1:05:19.285 --> 1:05:23.811
Then you transfer everything over and you
do the text translation.
1:05:24.064 --> 1:05:28.354
And you can do that at two levels, more at
the syntax level.
1:05:28.354 --> 1:05:34.683
That means you only do synthetic analysts
like you do a pasture or so, or at the semantic
1:05:34.683 --> 1:05:37.848
level where you do a semantic parsing frame.
1:05:38.638 --> 1:05:51.489
Then there is an interlingua based approach
where you don't do any transfer anymore, but
1:05:51.489 --> 1:05:55.099
you only do an analysis.
1:05:57.437 --> 1:06:02.790
So how does now the direct transfer, the direct
translation?
1:06:03.043 --> 1:06:07.031
Look like it's one of the earliest approaches.
1:06:07.327 --> 1:06:18.485
So you do maybe some morphological analysts,
but not a lot, and then you do this bilingual
1:06:18.485 --> 1:06:20.202
word mapping.
1:06:20.540 --> 1:06:25.067
You might do some here in generations.
1:06:25.067 --> 1:06:32.148
These two things are not really big, but you
are working on.
1:06:32.672 --> 1:06:39.237
And of course this might be a first easy solution
about all the challenges we have seen that
1:06:39.237 --> 1:06:41.214
the structure is different.
1:06:41.214 --> 1:06:45.449
That you have to reorder, look at the agreement,
then work.
1:06:45.449 --> 1:06:47.638
That's why the first approach.
1:06:47.827 --> 1:06:54.618
So if we have different word order, structural
shifts or idiomatic expressions that doesn't
1:06:54.618 --> 1:06:55.208
really.
1:06:57.797 --> 1:07:05.034
Then there are these rule based approaches
which were more commonly used.
1:07:05.034 --> 1:07:15.249
They might still be somewhere: Mean most commonly
they are now used by neural networks but wouldn't
1:07:15.249 --> 1:07:19.254
be sure there is no system out there but.
1:07:19.719 --> 1:07:25.936
And in this transfer based approach we have
these steps there nicely visualized in the.
1:07:26.406 --> 1:07:32.397
Triangle, so we have the analytic of the sur
sentence where we then get some type of abstract
1:07:32.397 --> 1:07:33.416
representation.
1:07:33.693 --> 1:07:40.010
Then we are doing the transfer of the representation
of the source sentence into the representation
1:07:40.010 --> 1:07:40.263
of.
1:07:40.580 --> 1:07:46.754
And then we have the generation where we take
this abstract representation and do then the
1:07:46.754 --> 1:07:47.772
surface forms.
1:07:47.772 --> 1:07:54.217
For example, it might be that there is no
morphological variants in the episode representation
1:07:54.217 --> 1:07:56.524
and we have to do this agreement.
1:07:56.656 --> 1:08:00.077
Which components do you they need?
1:08:01.061 --> 1:08:08.854
You need monolingual source and target lexicon
and the corresponding grammars in order to
1:08:08.854 --> 1:08:12.318
do both the analyst and the generation.
1:08:12.412 --> 1:08:18.584
Then you need the bilingual dictionary in
order to do the lexical translation and the
1:08:18.584 --> 1:08:25.116
bilingual transfer rules in order to transfer
the grammar, for example in German, into the
1:08:25.116 --> 1:08:28.920
grammar in English, and that enables you to
do that.
1:08:29.269 --> 1:08:32.579
So an example is is something like this here.
1:08:32.579 --> 1:08:38.193
So if you're doing a syntactic transfer it
means you're starting with John E.
1:08:38.193 --> 1:08:38.408
Z.
1:08:38.408 --> 1:08:43.014
Apple you do the analyst then you have this
type of graph here.
1:08:43.014 --> 1:08:48.340
Therefore you need your monolingual lexicon
and your monolingual grammar.
1:08:48.748 --> 1:08:59.113
Then you're doing the transfer where you're
transferring this representation into this
1:08:59.113 --> 1:09:01.020
representation.
1:09:01.681 --> 1:09:05.965
So how could this type of translation then
look like?
1:09:07.607 --> 1:09:08.276
Style.
1:09:08.276 --> 1:09:14.389
We have the example of a delicious soup and
una soup deliciosa.
1:09:14.894 --> 1:09:22.173
This is your source language tree and this
is your target language tree and then the rules
1:09:22.173 --> 1:09:26.092
that you need are these ones to do the transfer.
1:09:26.092 --> 1:09:31.211
So if you have a noun phrase that also goes
to the noun phrase.
1:09:31.691 --> 1:09:44.609
You see here that the switch is happening,
so the second position is here at the first
1:09:44.609 --> 1:09:46.094
position.
1:09:46.146 --> 1:09:52.669
Then you have the translation of determiner
of the words, so the dictionary entries.
1:09:53.053 --> 1:10:07.752
And with these types of rules you can then
do these mappings and do the transfer between
1:10:07.752 --> 1:10:11.056
the representation.
1:10:25.705 --> 1:10:32.505
Think it more depends on the amount of expertise
you have in representing them.
1:10:32.505 --> 1:10:35.480
The rules will get more difficult.
1:10:36.136 --> 1:10:42.445
For example, these rule based were, so I think
it more depends on how difficult the structure
1:10:42.445 --> 1:10:42.713
is.
1:10:42.713 --> 1:10:48.619
So for German generating German they were
quite long, quite successful because modeling
1:10:48.619 --> 1:10:52.579
all the German phenomena which are in there
was difficult.
1:10:52.953 --> 1:10:56.786
And that can be done there, and it wasn't
easy to learn that just from data.
1:10:59.019 --> 1:11:07.716
Think even if you think about Chinese and
English or so, if you have the trees there
1:11:07.716 --> 1:11:10.172
is quite some rule and.
1:11:15.775 --> 1:11:23.370
Another thing is you can also try to do something
like that on the semantic, which means this
1:11:23.370 --> 1:11:24.905
gets more complex.
1:11:25.645 --> 1:11:31.047
This gets maybe a bit easier because this
representation, the semantic representation
1:11:31.047 --> 1:11:36.198
between languages, are more similar and therefore
this gets more difficult again.
1:11:36.496 --> 1:11:45.869
So typically if you go higher in your triangle
this is more work while this is less work.
1:11:49.729 --> 1:11:56.023
So it can be then, for example, like in Gusta,
we have again that the the the order changes.
1:11:56.023 --> 1:12:02.182
So you see the transfer rule for like is that
the first argument is here and the second is
1:12:02.182 --> 1:12:06.514
there, while on the on the Gusta side here
the second argument.
1:12:06.466 --> 1:12:11.232
It is in the first position and the first
argument is in the second position.
1:12:11.511 --> 1:12:14.061
So that you do yeah, and also there you're
ordering,.
1:12:14.354 --> 1:12:20.767
From the principle it is more like you have
a different type of formalism of representing
1:12:20.767 --> 1:12:27.038
your sentence and therefore you need to do
more on one side and less on the other side.
1:12:32.852 --> 1:12:42.365
Then so in general transfer based approaches
are you have to first select how to represent
1:12:42.365 --> 1:12:44.769
a synthetic structure.
1:12:45.165 --> 1:12:55.147
There's like these variable abstraction levels
and then you have the three components: The
1:12:55.147 --> 1:13:04.652
disadvantage is that on the one hand you need
normally a lot of experts monolingual experts
1:13:04.652 --> 1:13:08.371
who analyze how to do the transfer.
1:13:08.868 --> 1:13:18.860
And if you're doing a new language, you have
to do analyst transfer in generation and the
1:13:18.860 --> 1:13:19.970
transfer.
1:13:20.400 --> 1:13:27.074
So if you need one language, add one language
in existing systems, of course you have to
1:13:27.074 --> 1:13:29.624
do transfer to all the languages.
1:13:32.752 --> 1:13:39.297
Therefore, the other idea which people were
interested in is the interlingua based machine
1:13:39.297 --> 1:13:40.232
translation.
1:13:40.560 --> 1:13:47.321
Where the idea is that we have this intermediate
language with this abstract language independent
1:13:47.321 --> 1:13:53.530
representation and so the important thing is
it's language independent so it's really the
1:13:53.530 --> 1:13:59.188
same for all language and it's a pure meaning
and there is no ambiguity in there.
1:14:00.100 --> 1:14:05.833
That allows this nice translation without
transfer, so you just do an analysis into your
1:14:05.833 --> 1:14:11.695
representation, and there afterwards you do
the generation into the other target language.
1:14:13.293 --> 1:14:16.953
And that of course makes especially multilingual.
1:14:16.953 --> 1:14:19.150
It's like somehow is a dream.
1:14:19.150 --> 1:14:25.519
If you want to add a language you just need
to add one analyst tool and one generation
1:14:25.519 --> 1:14:25.959
tool.
1:14:29.249 --> 1:14:32.279
Which is not the case in the other scenario.
1:14:33.193 --> 1:14:40.547
However, the big challenge is in this case
the interlingua based representation because
1:14:40.547 --> 1:14:47.651
you need to represent all different types of
knowledge in there in order to do that.
1:14:47.807 --> 1:14:54.371
And also like world knowledge, so something
like an apple is a fruit and property is a
1:14:54.371 --> 1:14:57.993
fruit, so they are eatable and stuff like that.
1:14:58.578 --> 1:15:06.286
So that is why this is typically always only
done for small amounts of data.
1:15:06.326 --> 1:15:13.106
So what people have done for special applications
like hotel reservation people have looked into
1:15:13.106 --> 1:15:18.348
that, but they have typically not done it for
any possibility of doing it.
1:15:18.718 --> 1:15:31.640
So the advantage is you need to represent
all the world knowledge in your interlingua.
1:15:32.092 --> 1:15:40.198
And that is not possible at the moment or
never was possible so far.
1:15:40.198 --> 1:15:47.364
Typically they were for small domains for
hotel reservation.
1:15:51.431 --> 1:15:57.926
But of course this idea of doing that and
that's why some people are interested in is
1:15:57.926 --> 1:16:04.950
like if you now do a neural system where you
learn the representation in your neural network
1:16:04.950 --> 1:16:07.442
is that some type of artificial.
1:16:08.848 --> 1:16:09.620
Interlingua.
1:16:09.620 --> 1:16:15.025
However, what we at least found out until
now is that there's often very language specific
1:16:15.025 --> 1:16:15.975
information in.
1:16:16.196 --> 1:16:19.648
And they might be important and essential.
1:16:19.648 --> 1:16:26.552
You don't have all the information in your
input, so you typically can't do resolving
1:16:26.552 --> 1:16:32.412
all ambiguities inside there because you might
not have all information.
1:16:32.652 --> 1:16:37.870
So in English you don't know if it's a living
fish or the fish which you're eating, and if
1:16:37.870 --> 1:16:43.087
you're translating to Germany you also don't
have to resolve this problem because you have
1:16:43.087 --> 1:16:45.610
the same ambiguity in your target language.
1:16:45.610 --> 1:16:50.828
So why would you put in our effort in finding
out if it's a dish or the other fish if it's
1:16:50.828 --> 1:16:52.089
not necessary at all?
1:16:54.774 --> 1:16:59.509
Yeah Yeah.
1:17:05.585 --> 1:17:15.019
The semantic transfer is not the same for
both languages, so you still represent the
1:17:15.019 --> 1:17:17.127
semantic language.
1:17:17.377 --> 1:17:23.685
So you have the like semantic representation
in the Gusta, but that's not the same as semantic
1:17:23.685 --> 1:17:28.134
representation for both languages, and that's
the main difference.
1:17:35.515 --> 1:17:44.707
Okay, then these are the most important things
for today: what is language and how our rule
1:17:44.707 --> 1:17:46.205
based systems.
1:17:46.926 --> 1:17:59.337
And if there is no more questions thank you
for joining, we have today a bit of a shorter
1:17:59.337 --> 1:18:00.578
lecture.
|