Spaces:
Running
Running
File size: 67,898 Bytes
cb71ef5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 |
WEBVTT
0:00:01.541 --> 0:00:06.926
Okay, so we'll come back to today's lecture.
0:00:08.528 --> 0:00:23.334
We want to talk about is speech translation,
so we'll have two lectures in this week about
0:00:23.334 --> 0:00:26.589
speech translation.
0:00:27.087 --> 0:00:36.456
And so in the last week we'll have some exercise
and repetition.
0:00:36.456 --> 0:00:46.690
We want to look at what is now to do when
we want to translate speech.
0:00:46.946 --> 0:00:55.675
So we want to address the specific challenges
that occur when we switch from translating
0:00:55.675 --> 0:00:56.754
to speech.
0:00:57.697 --> 0:01:13.303
Today we will look at the more general picture
out and build the systems.
0:01:13.493 --> 0:01:23.645
And then secondly an end approach where we
are going to put in audio and generate.
0:01:24.224 --> 0:01:41.439
Which are the main dominant systems which
are used in research and commercial systems.
0:01:43.523 --> 0:01:56.879
More general, what is the general task of
speech translation that is shown here?
0:01:56.879 --> 0:02:01.826
The idea is we have a speech.
0:02:02.202 --> 0:02:12.838
Then we want to have a system which takes
this audio and then translates it into another
0:02:12.838 --> 0:02:14.033
language.
0:02:15.095 --> 0:02:20.694
Then it's no longer as clear the output modality.
0:02:20.694 --> 0:02:33.153
In contrast, for humans we can typically have:
So you can either have more textual translation,
0:02:33.153 --> 0:02:37.917
then you have subtitles, and the.
0:02:38.538 --> 0:02:57.010
Are you want to have it also in audio like
it's done for human interpretation?
0:02:57.417 --> 0:03:03.922
See there is not the one best solution, so
all of this one is always better.
0:03:03.922 --> 0:03:09.413
It heavily depends on what is the use of what
the people prefer.
0:03:09.929 --> 0:03:14.950
For example, you can think of if you know
a bit the source of language, but you're a
0:03:14.950 --> 0:03:17.549
bit unsure and don't understand everything.
0:03:17.549 --> 0:03:23.161
They may texture it out for this pattern because
you can direct your gear to what was said and
0:03:23.161 --> 0:03:26.705
only if you're unsure you check down with your
translation.
0:03:27.727 --> 0:03:33.511
Are another things that might be preferable
to have a complete spoken of.
0:03:34.794 --> 0:03:48.727
So there are both ones for a long time in
automatic systems focused mainly on text output.
0:03:48.727 --> 0:04:06.711
In most cases: But of course you can always
hand them to text to speech systems which generates
0:04:06.711 --> 0:04:09.960
audio from that.
0:04:12.772 --> 0:04:14.494
Why should we care about that?
0:04:14.494 --> 0:04:15.771
Why should we do that?
0:04:17.737 --> 0:04:24.141
There is the nice thing that yeah, with a
globalized world, we are able to now interact
0:04:24.141 --> 0:04:25.888
with a lot more people.
0:04:25.888 --> 0:04:29.235
You can do some conferences around the world.
0:04:29.235 --> 0:04:31.564
We can travel around the world.
0:04:31.671 --> 0:04:37.802
We can by Internet watch movies from all over
the world and watch TV from all over the world.
0:04:38.618 --> 0:04:47.812
However, there is still this barrier that
is mainly to watch videos, either in English
0:04:47.812 --> 0:04:49.715
or in a language.
0:04:50.250 --> 0:05:00.622
So what is currently happening in order to
reach a large audience is that everybody.
0:05:00.820 --> 0:05:07.300
So if we are going, for example, to a conferences,
these are international conferences.
0:05:08.368 --> 0:05:22.412
However, everybody will then speak English
since that is some of the common language that
0:05:22.412 --> 0:05:26.001
everybody understands.
0:05:26.686 --> 0:05:32.929
So on the other hand, we cannot like have
human interpreters like they ever work.
0:05:32.892 --> 0:05:37.797
You have that maybe in the European Parliament
or in important business meetings.
0:05:38.078 --> 0:05:47.151
But this is relatively expensive, and so the
question is, can we enable communication in
0:05:47.151 --> 0:05:53.675
your mother-in-law without having to have human
interpretation?
0:05:54.134 --> 0:06:04.321
And there like speech translation can be helpful
in order to help you bridge this gap.
0:06:06.726 --> 0:06:22.507
In this case, there are different scenarios
of how you can apply speech translation.
0:06:22.422 --> 0:06:29.282
That's typically more interactive than we
are talking about text translation.
0:06:29.282 --> 0:06:32.800
Text translation is most commonly used.
0:06:33.153 --> 0:06:41.637
Course: Nowadays there's things like chat
and so on where it could also be interactive.
0:06:42.082 --> 0:06:48.299
In contrast to speech translation, that is
less static, so there is different ways of
0:06:48.299 --> 0:06:48.660
how.
0:06:49.149 --> 0:07:00.544
The one scenario is what is called a translation
where you first get an input, then you translate
0:07:00.544 --> 0:07:03.799
this fixed input, and then.
0:07:04.944 --> 0:07:12.823
With me, which means you have always like
fixed, yeah fixed challenges which you need
0:07:12.823 --> 0:07:14.105
to translate.
0:07:14.274 --> 0:07:25.093
You don't need to like beat your mind what
are the boundaries where there's an end.
0:07:25.405 --> 0:07:31.023
Also, there is no overlapping.
0:07:31.023 --> 0:07:42.983
There is always a one-person sentence that
is getting translated.
0:07:43.443 --> 0:07:51.181
Of course, this has a disadvantage that it
makes the conversation a lot longer because
0:07:51.181 --> 0:07:55.184
you always have only speech and translation.
0:07:57.077 --> 0:08:03.780
For example, if you would use that for a presentation
there would be yeah quite get quite long, if
0:08:03.780 --> 0:08:09.738
I would just imagine you sitting here in the
lecture I would say three sentences that I
0:08:09.738 --> 0:08:15.765
would wait for this interpreter to translate
it, then I would say the next two sentences
0:08:15.765 --> 0:08:16.103
and.
0:08:16.676 --> 0:08:28.170
That is why in these situations, for example,
if you have a direct conversation with a patient,
0:08:28.170 --> 0:08:28.888
then.
0:08:29.209 --> 0:08:32.733
But still there it's too big to be taking
them very long.
0:08:33.473 --> 0:08:42.335
And that's why there's also the research on
simultaneous translation, where the idea is
0:08:42.335 --> 0:08:43.644
in parallel.
0:08:43.964 --> 0:08:46.179
That Is the Dining for Human.
0:08:46.126 --> 0:08:52.429
Interpretation like if you think of things
like the European Parliament where they of
0:08:52.429 --> 0:08:59.099
course not only speak always one sentence but
are just giving their speech and in parallel
0:08:59.099 --> 0:09:04.157
human interpreters are translating the speech
into another language.
0:09:04.985 --> 0:09:12.733
The same thing is interesting for automatic
speech translation where we in parallel generate
0:09:12.733 --> 0:09:13.817
translation.
0:09:15.415 --> 0:09:32.271
The challenges then, of course, are that we
need to segment our speech into somehow's chunks.
0:09:32.152 --> 0:09:34.903
We just looked for the dots we saw.
0:09:34.903 --> 0:09:38.648
There are some challenges that we have to
check.
0:09:38.648 --> 0:09:41.017
The Doctor may not understand.
0:09:41.201 --> 0:09:47.478
But in generally getting sentence boundary
sentences is not a really research question.
0:09:47.647 --> 0:09:51.668
While in speech translation, this is not that
easy.
0:09:51.952 --> 0:10:05.908
Either getting that in the audio is difficult
because it's not like we typically do breaks
0:10:05.908 --> 0:10:09.742
when there's a sentence.
0:10:10.150 --> 0:10:17.432
And even if you then see the transcript and
would have to add the punctuation, this is
0:10:17.432 --> 0:10:18.101
not as.
0:10:20.340 --> 0:10:25.942
Another question is how many speakers we have
here.
0:10:25.942 --> 0:10:31.759
In presentations you have more like a single
speaker.
0:10:31.931 --> 0:10:40.186
That is normally easier from the part of audio
processing, so in general in speech translation.
0:10:40.460 --> 0:10:49.308
You can have different challenges and they
can be of different components.
0:10:49.308 --> 0:10:57.132
In addition to translation, you have: And
if you're not going, for example, the magical
0:10:57.132 --> 0:11:00.378
speaker, there are significantly additional
challenges.
0:11:00.720 --> 0:11:10.313
So we as humans we are very good in filtering
out noises, or if two people speak in parallel
0:11:10.313 --> 0:11:15.058
to like separate these two speakers and hear.
0:11:15.495 --> 0:11:28.300
However, if you want to do that with automatic
systems that is very challenging so that you
0:11:28.300 --> 0:11:33.172
can separate the speakers so that.
0:11:33.453 --> 0:11:41.284
For the more of you have this multi-speaker
scenario, typically it's also less well prepared.
0:11:41.721 --> 0:11:45.807
So you're getting very, we'll talk about the
spontaneous effects.
0:11:46.186 --> 0:11:53.541
So people like will stop in the middle of
the sentence, they change their sentence, and
0:11:53.541 --> 0:12:01.481
so on, and like filtering these, these fluences
out of the text and working with them is often
0:12:01.481 --> 0:12:02.986
very challenging.
0:12:05.565 --> 0:12:09.144
So these are all additional challenges when
you have multiples.
0:12:10.330 --> 0:12:19.995
Then there's a question of an online or offline
system, sometimes textbook station.
0:12:19.995 --> 0:12:21.836
We also mainly.
0:12:21.962 --> 0:12:36.507
That means you can take the whole text and
you can translate it in a badge.
0:12:37.337 --> 0:12:44.344
However, for speech translation there's also
several scenarios where this is the case.
0:12:44.344 --> 0:12:51.513
For example, when you're translating a movie,
it's not only that you don't have to do it
0:12:51.513 --> 0:12:54.735
live, but you can take the whole movie.
0:12:55.215 --> 0:13:05.473
However, there is also a lot of situations
where you don't have this opportunity like
0:13:05.473 --> 0:13:06.785
or sports.
0:13:07.247 --> 0:13:13.963
And you don't want to like first like let
around a sports event and then like show in
0:13:13.963 --> 0:13:19.117
the game three hours later then there is not
really any interest.
0:13:19.399 --> 0:13:31.118
So you have to do it live, and so we have
the additional challenge of translating the
0:13:31.118 --> 0:13:32.208
system.
0:13:32.412 --> 0:13:42.108
There are still things on the one end of course.
0:13:42.108 --> 0:13:49.627
It needs to be real time translation.
0:13:49.869 --> 0:13:54.153
It's taking longer, then you're getting more
and more and more delayed.
0:13:55.495 --> 0:14:05.245
So it maybe seems simple, but there have been
research systems which are undertime slower
0:14:05.245 --> 0:14:07.628
than real time or so.
0:14:07.628 --> 0:14:15.103
If you want to show what is possible with
the best current systems,.
0:14:16.596 --> 0:14:18.477
But that isn't even not enough.
0:14:18.918 --> 0:14:29.593
The other question: You can have a system
which is even like several times real time.
0:14:29.509 --> 0:14:33.382
In less than one second, it might still be
not useful.
0:14:33.382 --> 0:14:39.648
Then the question is like the latency, so
how much time has passed since you can produce
0:14:39.648 --> 0:14:39.930
an.
0:14:40.120 --> 0:14:45.814
It might be that in average you can like concress
it, but you still can't do it directly.
0:14:45.814 --> 0:14:51.571
You need to do it after, or you need to have
the full context of thirty seconds before you
0:14:51.571 --> 0:14:55.178
can output something, and then you have a large
latency.
0:14:55.335 --> 0:15:05.871
So it can be that do it as fast as it is produced,
but have to wait until the food.
0:15:06.426 --> 0:15:13.772
So we'll look into that on Thursday how we
can then generate translations that are having
0:15:13.772 --> 0:15:14.996
a low latency.
0:15:15.155 --> 0:15:21.587
You can imagine, for example, in German that
it's maybe quite challenging since the word
0:15:21.587 --> 0:15:23.466
is often like at the end.
0:15:23.466 --> 0:15:30.115
If you're using perfect, like in harbor and
so on, and then in English you have to directly
0:15:30.115 --> 0:15:30.983
produce it.
0:15:31.311 --> 0:15:38.757
So if you really want to have no context you
might need to wait until the end of the sentence.
0:15:41.021 --> 0:15:45.920
Besides that, of course, offline and it gives
you more additional help.
0:15:45.920 --> 0:15:52.044
I think last week you talked about context
based systems that typically have context from
0:15:52.044 --> 0:15:55.583
maybe from the past but maybe also from the
future.
0:15:55.595 --> 0:16:02.923
Then, of course, you cannot use anything from
the future in this case, but you can use it.
0:16:07.407 --> 0:16:24.813
Finally, there is a thing about how you want
to present it to the audience in automatic
0:16:24.813 --> 0:16:27.384
translation.
0:16:27.507 --> 0:16:31.361
There is also the thing that you want to do.
0:16:31.361 --> 0:16:35.300
All your outfits are running like the system.
0:16:35.996 --> 0:16:36.990
Top of it.
0:16:36.990 --> 0:16:44.314
Then they answered questions: How should it
be spoken so you can do things like.
0:16:46.586 --> 0:16:52.507
Voice cloning so that it's like even the same
voice than the original speaker.
0:16:53.994 --> 0:16:59.081
And if you do text or dubbing then there might
be additional constraints.
0:16:59.081 --> 0:17:05.729
So if you think about subtitles: And they
should be readable, and we are too big to speak
0:17:05.729 --> 0:17:07.957
faster than you can maybe read.
0:17:08.908 --> 0:17:14.239
So you might need to shorten your text.
0:17:14.239 --> 0:17:20.235
People say that a subtitle can be two lines.
0:17:20.235 --> 0:17:26.099
Each line can be this number of characters.
0:17:26.346 --> 0:17:31.753
So you cannot like if you have too long text,
we might need to shorten that to do that.
0:17:32.052 --> 0:17:48.272
Similarly, if you think about dubbing, if
you want to produce dubbing voice, then the
0:17:48.272 --> 0:17:50.158
original.
0:17:51.691 --> 0:17:59.294
Here is another problem that we have different
settings like a more formal setting and let's
0:17:59.294 --> 0:18:00.602
have different.
0:18:00.860 --> 0:18:09.775
If you think about the United Nations maybe
you want more former things and between friends
0:18:09.775 --> 0:18:14.911
maybe that former and there are languages which
use.
0:18:15.355 --> 0:18:21.867
That is sure that is an important research
question.
0:18:21.867 --> 0:18:28.010
To do that would more think of it more generally.
0:18:28.308 --> 0:18:32.902
That's important in text translation.
0:18:32.902 --> 0:18:41.001
If you translate a letter to your boss, it
should sound different.
0:18:42.202 --> 0:18:53.718
So there is a question of how you can do this
style work on how you can do that.
0:18:53.718 --> 0:19:00.542
For example, if you can specify that you might.
0:19:00.460 --> 0:19:10.954
So you can tax the center or generate an informal
style because, as you correctly said, this
0:19:10.954 --> 0:19:16.709
is especially challenging again in the situations.
0:19:16.856 --> 0:19:20.111
Of course, there are ways of like being formal
or less formal.
0:19:20.500 --> 0:19:24.846
But it's not like as clear as you do it, for
example, in German where you have the twin
0:19:24.846 --> 0:19:24.994
C.
0:19:25.165 --> 0:19:26.855
So there is no one to own mapping.
0:19:27.287 --> 0:19:34.269
If you want to make that sure you can build
a system which generates different styles in
0:19:34.269 --> 0:19:38.662
the output, so yeah that's definitely also
a challenge.
0:19:38.662 --> 0:19:43.762
It just may be not mentioned here because
it's not specific now.
0:19:44.524 --> 0:19:54.029
Generally, of course, these are all challenges
in how to customize and adapt systems to use
0:19:54.029 --> 0:19:56.199
cases with specific.
0:20:00.360 --> 0:20:11.020
Speech translation has been done for quite
a while and it's maybe not surprising it started
0:20:11.020 --> 0:20:13.569
with more simple use.
0:20:13.793 --> 0:20:24.557
So people first started to look into, for
example, limited to main translations.
0:20:24.557 --> 0:20:33.726
The tourist was typically application if you're
going to a new city.
0:20:34.834 --> 0:20:44.028
Then there are several open things of doing
open domain translation, especially people.
0:20:44.204 --> 0:20:51.957
Like where there's a lot of data so you could
build systems which are more open to main,
0:20:51.957 --> 0:20:55.790
but of course it's still a bit restrictive.
0:20:55.790 --> 0:20:59.101
It's true in the European Parliament.
0:20:59.101 --> 0:21:01.888
People talk about anything but.
0:21:02.162 --> 0:21:04.820
And so it's not completely used for everything.
0:21:05.165 --> 0:21:11.545
Nowadays we've seen this technology in a lot
of different situations guess you ought.
0:21:11.731 --> 0:21:17.899
Use it so there is some basic technologies
where you can use them already.
0:21:18.218 --> 0:21:33.599
There is still a lot of open questions going
from if you are going to really spontaneous
0:21:33.599 --> 0:21:35.327
meetings.
0:21:35.655 --> 0:21:41.437
Then these systems typically work good for
like some languages where we have a lot of
0:21:41.437 --> 0:21:42.109
friendly.
0:21:42.742 --> 0:21:48.475
But if we want to go for really low resource
data then things are often challenging.
0:21:48.448 --> 0:22:02.294
Last week we had a workshop on spoken language
translation and there is a low-resource data
0:22:02.294 --> 0:22:05.756
track which is dialed.
0:22:05.986 --> 0:22:06.925
And so on.
0:22:06.925 --> 0:22:14.699
All these languages can still then have significantly
lower performance than for a higher.
0:22:17.057 --> 0:22:20.126
So how does this work?
0:22:20.126 --> 0:22:31.614
If we want to do speech translation, there's
like three basic technology: So on the one
0:22:31.614 --> 0:22:40.908
hand, it's automatic speech recognition where
automatic speech recognition normally transacts
0:22:40.908 --> 0:22:41.600
audio.
0:22:42.822 --> 0:22:58.289
Then what we talked about here is machine
translation, which takes input and translates
0:22:58.289 --> 0:23:01.276
into the target.
0:23:02.642 --> 0:23:11.244
And the very simple model now, if you think
about it, is of course the similar combination.
0:23:11.451 --> 0:23:14.740
We have solved all these parts in a salt bedrock.
0:23:14.975 --> 0:23:31.470
We are working on all these problems there,
so if we want to do a speech transition, maybe.
0:23:31.331 --> 0:23:35.058
Such problems we just put all these combinations
together.
0:23:35.335 --> 0:23:45.130
And then you get what you have as a cascading
system, which first is so you take your audio.
0:23:45.045 --> 0:23:59.288
To take this as input and generate the output,
and then you take this text output, put it
0:23:59.288 --> 0:24:00.238
into.
0:24:00.640 --> 0:24:05.782
So in that way you have now.
0:24:08.008 --> 0:24:18.483
Have now a solution for generating doing speech
translation for these types of systems, and
0:24:18.483 --> 0:24:20.874
this type is called.
0:24:21.681 --> 0:24:28.303
It is still often reaching state of the art,
however it has benefits and disadvantages.
0:24:28.668 --> 0:24:41.709
So the one big benefit is we have independent
components and some of that is nice.
0:24:41.709 --> 0:24:48.465
So if there are great ideas put into your.
0:24:48.788 --> 0:24:57.172
And then some other times people develop a
new good way of how to improve.
0:24:57.172 --> 0:25:00.972
You can also take this model and.
0:25:01.381 --> 0:25:07.639
So you can leverage improvements from all
the different communities in order to adapt.
0:25:08.288 --> 0:25:18.391
Furthermore, we would like to see, since all
of them is learning, that the biggest advantage
0:25:18.391 --> 0:25:23.932
is that we have training data for each individual.
0:25:24.164 --> 0:25:34.045
So there's a lot less training data where
you have the English audio, so it's easy to
0:25:34.045 --> 0:25:34.849
train.
0:25:36.636 --> 0:25:48.595
Now am a one that we will focus on when talking
about the cascaded approach is that often it.
0:25:48.928 --> 0:25:58.049
So you need to adapt each component a bit
so that it's adapting to its input and.
0:25:58.278 --> 0:26:07.840
So we'll focus there especially on how to
combine and since said the main focus is: So
0:26:07.840 --> 0:26:18.589
if you would directly use an output that might
not work as perfect as you would,.
0:26:18.918 --> 0:26:33.467
So a major challenge when building a cascade
of speech translation systems is how can we
0:26:33.467 --> 0:26:38.862
adapt these systems and how can?
0:26:41.681 --> 0:26:43.918
So why, why is this the kick?
0:26:44.164 --> 0:26:49.183
So it would look quite nice.
0:26:49.183 --> 0:26:54.722
It seems to be very reasonable.
0:26:54.722 --> 0:26:58.356
You have some audio.
0:26:58.356 --> 0:27:03.376
You put it into your system.
0:27:04.965 --> 0:27:23.759
However, this is a bit which for thinking
because if you speak what you speak is more.
0:27:23.984 --> 0:27:29.513
And especially all that rarely have punctuations
in there, and while the anti-system.
0:27:29.629 --> 0:27:43.247
They assume, of course, that it's a full sentence,
that you don't have there some.
0:27:43.523 --> 0:27:55.087
So we see we want to get this bridge between
the output and the input, and we might need
0:27:55.087 --> 0:27:56.646
additional.
0:27:58.778 --> 0:28:05.287
And that is typically what is referred to
as re-case and re-piculation system.
0:28:05.445 --> 0:28:15.045
So the idea is that you might be good to have
something like an adapter here in between,
0:28:15.045 --> 0:28:20.007
which really tries to adapt the speech input.
0:28:20.260 --> 0:28:28.809
That can be at different levels, but it might
be even more rephrasing.
0:28:29.569 --> 0:28:40.620
If you think of the sentence, if you have
false starts, then when speaking you sometimes
0:28:40.620 --> 0:28:41.986
assume oh.
0:28:41.901 --> 0:28:52.224
You restart it, then you might want to delete
that because if you read it you don't want
0:28:52.224 --> 0:28:52.688
to.
0:28:56.096 --> 0:28:57.911
Why is this yeah?
0:28:57.911 --> 0:29:01.442
The case in punctuation important.
0:29:02.622 --> 0:29:17.875
One important thing is directly for the challenge
is when speak is just a continuous stream of
0:29:17.875 --> 0:29:18.999
words.
0:29:19.079 --> 0:29:27.422
Then just speaking and punctuation marks,
and so on are all notes are there in natural.
0:29:27.507 --> 0:29:30.281
However, they are of course important.
0:29:30.410 --> 0:29:33.877
They are first of all very important for readability.
0:29:34.174 --> 0:29:41.296
If you have once read a text without characterization
marks, you need more time to process it.
0:29:41.861 --> 0:29:47.375
They're sometimes even semantically important.
0:29:47.375 --> 0:29:52.890
There's a list for grandpa and big difference.
0:29:53.553 --> 0:30:00.089
And so this, of course, with humans as well,
it'd be easy to distinguish by again doing
0:30:00.089 --> 0:30:01.426
it automatically.
0:30:01.426 --> 0:30:06.180
It's more typically and finally, in our case,
if we want to do.
0:30:06.386 --> 0:30:13.672
We are assuming normally sentence wise, so
we always enter out system which is like one
0:30:13.672 --> 0:30:16.238
sentence by the next sentence.
0:30:16.736 --> 0:30:26.058
If you want to do speech translation of a
continuous stream, then of course what are
0:30:26.058 --> 0:30:26.716
your.
0:30:28.168 --> 0:30:39.095
And the easiest and most straightforward situation
is, of course, if you have a continuously.
0:30:39.239 --> 0:30:51.686
And if it generates your calculation marks,
it's easy to separate your text into sentences.
0:30:52.032 --> 0:31:09.157
So we can again reuse our system and thereby
have a normal anti-system on this continuous.
0:31:14.174 --> 0:31:21.708
These are a bit older numbers, but they show
you a bit also how important all that is.
0:31:21.861 --> 0:31:31.719
So this was so the best is if you do insurance
transcript you get roughly a blue score of.
0:31:32.112 --> 0:31:47.678
If you have as it is with some air based length
segmentation, then you get something like.
0:31:47.907 --> 0:31:57.707
If you then use the segments correctly as
it's done from the reference, you get one blue
0:31:57.707 --> 0:32:01.010
point and another blue point.
0:32:01.201 --> 0:32:08.085
So you see that you have been total like nearly
two blue points just by having the correct
0:32:08.085 --> 0:32:09.144
segmentation.
0:32:10.050 --> 0:32:21.178
This shows you that it's important to estimate
as good a segmentation because even if you
0:32:21.178 --> 0:32:25.629
still have the same arrows in your.
0:32:27.147 --> 0:32:35.718
Is to be into this movement, which is also
not as unusual as we do in translation.
0:32:36.736 --> 0:32:40.495
So this is done by looking at the reference.
0:32:40.495 --> 0:32:48.097
It should show you how much these scores are
done to just analyze how important are these.
0:32:48.097 --> 0:32:55.699
So you take the A's R transcript and you look
at the reference and it's only done for the.
0:32:55.635 --> 0:33:01.720
If we have optimal punctuations, if our model
is as good and optimal, so as a reference we
0:33:01.720 --> 0:33:15.602
could: But of course this is not how we can
do it in reality because we don't have access
0:33:15.602 --> 0:33:16.990
to that.
0:33:17.657 --> 0:33:24.044
Because one would invade you okay, why should
we do that?
0:33:24.044 --> 0:33:28.778
If we have the optimal then it's possible.
0:33:31.011 --> 0:33:40.060
And yeah, that is why a typical system does
not only yeah depend on if our key component.
0:33:40.280 --> 0:33:56.468
But in between you have this segmentation
in there in order to have more input and.
0:33:56.496 --> 0:34:01.595
You can also prefer often this invariability
over the average study.
0:34:04.164 --> 0:34:19.708
So the task of segmentation is to re-segment
the text into what is called sentence like
0:34:19.708 --> 0:34:24.300
unit, so you also assign.
0:34:24.444 --> 0:34:39.421
That is more a traditional thing because for
a long time case information was not provided.
0:34:39.879 --> 0:34:50.355
So there was any good ASR system which directly
provides you with case information and this
0:34:50.355 --> 0:34:52.746
may not be any more.
0:34:56.296 --> 0:35:12.060
How that can be done is you can have three
different approaches because that was some
0:35:12.060 --> 0:35:16.459
of the most common one.
0:35:17.097 --> 0:35:23.579
Course: That is not the only thing you can
do.
0:35:23.579 --> 0:35:30.888
You can also try to train the data to generate
that.
0:35:31.891 --> 0:35:41.324
On the other hand, that is of course more
challenging.
0:35:41.324 --> 0:35:47.498
You need some type of segmentation.
0:35:48.028 --> 0:35:59.382
Mean, of course, you can easily remove and
capture information from your data and then
0:35:59.382 --> 0:36:05.515
play a system which does non-case to non-case.
0:36:05.945 --> 0:36:15.751
You can also, of course, try to combine these
two into one so that you directly translate
0:36:15.751 --> 0:36:17.386
from non-case.
0:36:17.817 --> 0:36:24.722
What is more happening by now is that you
also try to provide these to that you provide.
0:36:24.704 --> 0:36:35.267
The ASR is a segmentation directly get these
information in there.
0:36:35.267 --> 0:36:45.462
The systems that combine the A's and A's are:
Yes, there is a valid rule.
0:36:45.462 --> 0:36:51.187
What we come later to today is that you do
audio to text in the target language.
0:36:51.187 --> 0:36:54.932
That is what is referred to as an end to end
system.
0:36:54.932 --> 0:36:59.738
So it's directly and this is still more often
done for text output.
0:36:59.738 --> 0:37:03.414
But there is also end to end system which
directly.
0:37:03.683 --> 0:37:09.109
There you have additional challenges by how
to even measure if things are correct or not.
0:37:09.089 --> 0:37:10.522
Mean for text.
0:37:10.522 --> 0:37:18.073
You can mention, in other words, that for
audio the audio signal is even more.
0:37:18.318 --> 0:37:27.156
That's why it's currently mostly speech to
text, but that is one single system, but of
0:37:27.156 --> 0:37:27.969
course.
0:37:32.492 --> 0:37:35.605
Yeah, how can you do that?
0:37:35.605 --> 0:37:45.075
You can do adding these calculation information:
Will look into three systems.
0:37:45.075 --> 0:37:53.131
You can do that as a sequence labeling problem
or as a monolingual.
0:37:54.534 --> 0:37:57.145
Let's have a little bit of a series.
0:37:57.145 --> 0:37:59.545
This was some of the first ideas.
0:37:59.545 --> 0:38:04.626
There's the idea where you try to do it mainly
based on language model.
0:38:04.626 --> 0:38:11.471
So how probable is that there is a punctuation
that was done with like old style engram language
0:38:11.471 --> 0:38:12.883
models to visually.
0:38:13.073 --> 0:38:24.687
So you can, for example, if you have a program
language model to calculate the score of Hello,
0:38:24.687 --> 0:38:25.787
how are?
0:38:25.725 --> 0:38:33.615
And then you compare this probability and
take the one which has the highest probability.
0:38:33.615 --> 0:38:39.927
You might have something like if you have
very long pauses, you anyway.
0:38:40.340 --> 0:38:51.953
So this is a very easy model, which only calculates
some language model probabilities, and however
0:38:51.953 --> 0:39:00.023
the advantages of course are: And then, of
course, in general, so what we will look into
0:39:00.023 --> 0:39:06.249
here is that maybe interesting is that most
of the systems, also the advance, are really
0:39:06.249 --> 0:39:08.698
mainly focused purely on the text.
0:39:09.289 --> 0:39:19.237
If you think about how to insert punctuation
marks, maybe your first idea would have been
0:39:19.237 --> 0:39:22.553
we can use pause information.
0:39:23.964 --> 0:39:30.065
But however interestingly most systems that
use are really focusing on the text.
0:39:31.151 --> 0:39:34.493
There are several reasons.
0:39:34.493 --> 0:39:44.147
One is that it's easier to get training data
so you only need pure text data.
0:39:46.806 --> 0:40:03.221
The next way you can do it is you can make
it as a secret labeling tax or something like
0:40:03.221 --> 0:40:04.328
that.
0:40:04.464 --> 0:40:11.734
Then you have how there is nothing in you,
and there is a.
0:40:11.651 --> 0:40:15.015
A question.
0:40:15.315 --> 0:40:31.443
So you have the number of labels, the number
of punctuation symbols you have for the basic
0:40:31.443 --> 0:40:32.329
one.
0:40:32.892 --> 0:40:44.074
Typically nowadays it would use something
like bird, and then you can train a sister.
0:40:48.168 --> 0:40:59.259
Any questions to that then it would probably
be no contrary, you know, or not.
0:41:00.480 --> 0:41:03.221
Yeah, you have definitely a labeled imbalance.
0:41:04.304 --> 0:41:12.405
Think that works relatively well and haven't
seen that.
0:41:12.405 --> 0:41:21.085
It's not a completely crazy label, maybe twenty
times more.
0:41:21.561 --> 0:41:29.636
It can and especially for the more rare things
mean, the more rare things is question marks.
0:41:30.670 --> 0:41:43.877
At least for question marks you have typically
very strong indicator words.
0:41:47.627 --> 0:42:03.321
And then what was done for quite a long time
can we know how to do machine translation?
0:42:04.504 --> 0:42:12.640
So the idea is, can we just translate non
punctuated English into punctuated English
0:42:12.640 --> 0:42:14.650
and do it correctly?
0:42:15.855 --> 0:42:25.344
So what you need is something like this type
of data where the source doesn't have punctuation.
0:42:25.845 --> 0:42:30.641
Course: A year is already done.
0:42:30.641 --> 0:42:36.486
You have to make it a bit challenging.
0:42:41.661 --> 0:42:44.550
Yeah, that is true.
0:42:44.550 --> 0:42:55.237
If you think about the normal trained age,
you have to do one thing more.
0:42:55.237 --> 0:43:00.724
Is it otherwise difficult to predict?
0:43:05.745 --> 0:43:09.277
Here it's already this already looks different
than normal training data.
0:43:09.277 --> 0:43:09.897
What is the.
0:43:10.350 --> 0:43:15.305
People want to use this transcript of speech.
0:43:15.305 --> 0:43:19.507
We'll probably go to our text editors.
0:43:19.419 --> 0:43:25.906
Yes, that is all already quite too difficult.
0:43:26.346 --> 0:43:33.528
Mean, that's making things a lot better with
the first and easiest thing is you have to
0:43:33.528 --> 0:43:35.895
randomly cut your sentences.
0:43:35.895 --> 0:43:43.321
So if you take just me normally we have one
sentence per line and if you take this as your
0:43:43.321 --> 0:43:44.545
training data.
0:43:44.924 --> 0:43:47.857
And that is, of course, not very helpful.
0:43:48.208 --> 0:44:01.169
So in order to build the training corpus for
doing punctuation you randomly cut your sentences
0:44:01.169 --> 0:44:08.264
and then you can remove all your punctuation
marks.
0:44:08.528 --> 0:44:21.598
Because of course there is no longer to do
when you have some random segments in your
0:44:21.598 --> 0:44:22.814
system.
0:44:25.065 --> 0:44:37.984
And then you can, for example, if you then
have generated your punctuation marks before
0:44:37.984 --> 0:44:41.067
going to the system.
0:44:41.221 --> 0:44:54.122
And that is an important thing, which we like
to see is more challenging for end systems.
0:44:54.122 --> 0:45:00.143
We can change the segmentation, so maybe.
0:45:00.040 --> 0:45:06.417
You can, then if you're combining these things
you can change the segmentation here, so.
0:45:06.406 --> 0:45:18.178
While you have ten new ten segments in your,
you might only have five ones in your anymore.
0:45:18.178 --> 0:45:18.946
Then.
0:45:19.259 --> 0:45:33.172
Which might be more useful or helpful in because
you have to reorder things and so on.
0:45:33.273 --> 0:45:43.994
And if you think of the wrong segmentation
then you cannot reorder things from the beginning
0:45:43.994 --> 0:45:47.222
to the end of the sentence.
0:45:49.749 --> 0:45:58.006
Okay, so much about segmentation do you have
any more questions about that?
0:46:02.522 --> 0:46:21.299
Then there is one additional thing you can
do, and that is when we refer to the idea.
0:46:21.701 --> 0:46:29.356
And when you get input there might be some
arrows in there, so it might not be perfect.
0:46:29.889 --> 0:46:36.322
So the question is, can we adapt to that?
0:46:36.322 --> 0:46:45.358
And can the system be improved by saying that
it can some.
0:46:45.265 --> 0:46:50.591
So that is as aware that before there is a.
0:46:50.490 --> 0:46:55.449
Their arm might not be the best one.
0:46:55.935 --> 0:47:01.961
There are different ways of dealing with them.
0:47:01.961 --> 0:47:08.116
You can use a best list but several best lists.
0:47:08.408 --> 0:47:16.711
So the idea is that you're not only telling
the system this is the transcript, but here
0:47:16.711 --> 0:47:18.692
I'm not going to be.
0:47:19.419 --> 0:47:30.748
Or that you can try to make it more robust
towards arrows from an system so that.
0:47:32.612 --> 0:47:48.657
Interesting what is often done is hope convince
you it might be a good idea to deal.
0:47:48.868 --> 0:47:57.777
The interesting thing is if you're looking
into a lot of systems, this is often ignored,
0:47:57.777 --> 0:48:04.784
so they are not adapting their T-system to
this type of A-S-R system.
0:48:05.345 --> 0:48:15.232
So it's not really doing any handling of Arab,
and the interesting thing is often works as
0:48:15.232 --> 0:48:15.884
good.
0:48:16.516 --> 0:48:23.836
And one reason is, of course, one reason is
if the ASR system does not arrow up to like
0:48:23.836 --> 0:48:31.654
a challenging situation, and then the antisystem
is really for the antisystem hard to detect.
0:48:31.931 --> 0:48:39.375
If it would be easy for the system to detect
the error you would integrate this information
0:48:39.375 --> 0:48:45.404
into: That is not always the case, but that
of course makes it a bit challenging, and that's
0:48:45.404 --> 0:48:49.762
why there is a lot of systems where it's not
explicitly handled how to deal with.
0:48:52.912 --> 0:49:06.412
But of course it might be good, so one thing
is you can give him a best list and you can
0:49:06.412 --> 0:49:09.901
translate every entry.
0:49:10.410 --> 0:49:17.705
And then you have two scores like the anti-probability
and the square probability.
0:49:18.058 --> 0:49:25.695
Combine them and then generate or output the
output from what has the best combined.
0:49:26.366 --> 0:49:29.891
And then it might no longer be the best.
0:49:29.891 --> 0:49:38.144
It might like we had a bean search, so this
has the best score, but this has a better combined.
0:49:39.059 --> 0:49:46.557
The problem sometimes works, but the problem
is that the anti-system might then tend to
0:49:46.557 --> 0:49:52.777
just translate not the correct sentence but
the one easier to translate.
0:49:53.693 --> 0:50:03.639
You can also generate a more compact representation
of this invest in it by having this type of
0:50:03.639 --> 0:50:04.467
graphs.
0:50:05.285 --> 0:50:22.952
Lettices: So then you could like try to do
a graph to text translation so you can translate.
0:50:22.802 --> 0:50:26.582
Where like all possibilities, by the way our
systems are invented.
0:50:26.906 --> 0:50:31.485
So it can be like a hostage, a conference
with some programs.
0:50:31.591 --> 0:50:35.296
So the highest probability is here.
0:50:35.296 --> 0:50:41.984
Conference is being recorded, but there are
other possibilities.
0:50:42.302 --> 0:50:53.054
And you can take all of this information out
there with your probabilities.
0:50:59.980 --> 0:51:07.614
But we'll see this type of arrow propagation
that if you have an error that this might then
0:51:07.614 --> 0:51:15.165
propagate to, and t errors is one of the main
reasons why people looked into other ways of
0:51:15.165 --> 0:51:17.240
doing it and not having.
0:51:19.219 --> 0:51:28.050
By generally a cascaded combination, as we've
seen it, it has several advantages: The biggest
0:51:28.050 --> 0:51:42.674
maybe is the data availability so we can train
systems for the different components.
0:51:42.822 --> 0:51:47.228
So you can train your individual components
on relatively large stages.
0:51:47.667 --> 0:51:58.207
A modular system where you can improve each
individual model and if there's new development
0:51:58.207 --> 0:52:01.415
and models you can improve.
0:52:01.861 --> 0:52:11.280
There are several advantages, but of course
there are also some disadvantages: The most
0:52:11.280 --> 0:52:19.522
common thing is that there is what is referred
to as arrow propagation.
0:52:19.522 --> 0:52:28.222
If the arrow is arrow, probably your output
will then directly do an arrow.
0:52:28.868 --> 0:52:41.740
Typically it's like if there's an error in
the system, it's easier to like ignore by a
0:52:41.740 --> 0:52:46.474
quantity scale than the output.
0:52:46.967 --> 0:52:49.785
What do that mean?
0:52:49.785 --> 0:53:01.209
It's complicated, so if you have German, the
ASR does the Arab, and instead.
0:53:01.101 --> 0:53:05.976
Then most probably you'll ignore it or you'll
still know what it was said.
0:53:05.976 --> 0:53:11.827
Maybe you even don't notice because you'll
fastly read over it and don't see that there's
0:53:11.827 --> 0:53:12.997
one letter wrong.
0:53:13.673 --> 0:53:25.291
However, if you translate this one in an English
sentence about speeches, there's something
0:53:25.291 --> 0:53:26.933
about wines.
0:53:27.367 --> 0:53:37.238
So it's a lot easier typically to read over
like arrows in the than reading over them in
0:53:37.238 --> 0:53:38.569
the speech.
0:53:40.120 --> 0:53:45.863
But there is additional challenges in in cascaded
systems.
0:53:46.066 --> 0:53:52.667
So secondly we have seen that we optimize
each component individually so you have a separate
0:53:52.667 --> 0:53:59.055
optimization and that doesn't mean that the
overall performance is really the best at the
0:53:59.055 --> 0:53:59.410
end.
0:53:59.899 --> 0:54:07.945
And we have tried to do that by already saying
yes.
0:54:07.945 --> 0:54:17.692
You need to adapt them a bit to work good
together, but still.
0:54:20.280 --> 0:54:24.185
Secondly, like that, there's a computational
complexity.
0:54:24.185 --> 0:54:30.351
You always need to run an ASR system and an
MTT system, and especially if you think about
0:54:30.351 --> 0:54:32.886
it, it should be fast and real time.
0:54:32.886 --> 0:54:37.065
It's challenging to always run two systems
and not a single.
0:54:38.038 --> 0:54:45.245
And one final thing which you might have not
directly thought of, but most of the world's
0:54:45.245 --> 0:54:47.407
languages do not have any.
0:54:48.108 --> 0:55:01.942
So if you have a language which doesn't have
any script, then of course if you want to translate
0:55:01.942 --> 0:55:05.507
it you cannot first use.
0:55:05.905 --> 0:55:13.705
So in order to do this, the pressure was mentioned
before ready.
0:55:13.705 --> 0:55:24.264
Build somehow a system which takes the audio
and directly generates text in the target.
0:55:26.006 --> 0:55:41.935
And there is quite big opportunity for that
because before that there was very different
0:55:41.935 --> 0:55:44.082
technology.
0:55:44.644 --> 0:55:55.421
However, since we are using neuromachine translation
encoded decoder models, the interesting thing
0:55:55.421 --> 0:56:00.429
is that we are using very similar technology.
0:56:00.360 --> 0:56:06.047
It's like in both cases very similar architecture.
0:56:06.047 --> 0:56:09.280
The main difference is once.
0:56:09.649 --> 0:56:17.143
But generally how it's done is very similar,
and therefore of course it might be put everything
0:56:17.143 --> 0:56:22.140
together, and that is what is referred to as
end-to-end speech.
0:56:22.502 --> 0:56:31.411
So that means we're having one large neural
network and decoded voice system, but we put
0:56:31.411 --> 0:56:34.914
an audio in one language and then.
0:56:36.196 --> 0:56:43.106
We can then have a system which directly does
the full process.
0:56:43.106 --> 0:56:46.454
We don't have to care anymore.
0:56:48.048 --> 0:57:02.615
So if you think of it as before, so we have
this decoder, and that's the two separate.
0:57:02.615 --> 0:57:04.792
We have the.
0:57:05.085 --> 0:57:18.044
And instead of going via the discrete text
representation in the Suez language, we can
0:57:18.044 --> 0:57:21.470
go via the continuous.
0:57:21.681 --> 0:57:26.027
Of course, they hope it's by not doing this
discrimination in between.
0:57:26.146 --> 0:57:30.275
We don't have a problem at doing errors.
0:57:30.275 --> 0:57:32.793
We can only cover later.
0:57:32.772 --> 0:57:47.849
But we can encode here the variability or
so that we have and then only define the decision.
0:57:51.711 --> 0:57:54.525
And so.
0:57:54.274 --> 0:58:02.253
What we're doing is we're having very similar
technique.
0:58:02.253 --> 0:58:12.192
We're having still the decoder model where
we're coming from the main.
0:58:12.552 --> 0:58:24.098
Instead of getting discrete tokens in there
as we have subwords, we always encoded that
0:58:24.098 --> 0:58:26.197
in one pattern.
0:58:26.846 --> 0:58:42.505
The problem is that this is in continuous,
so we have to check how we can work with continuous
0:58:42.505 --> 0:58:43.988
signals.
0:58:47.627 --> 0:58:55.166
Mean, the first thing in your system is when
you do your disc freeze and code it.
0:59:02.402 --> 0:59:03.888
A newer machine translation.
0:59:03.888 --> 0:59:05.067
You're getting a word.
0:59:05.067 --> 0:59:06.297
It's one hot, some not.
0:59:21.421 --> 0:59:24.678
The first layer of the machine translation.
0:59:27.287 --> 0:59:36.147
Yes, you do the word embedding, so then you
have a continuous thing.
0:59:36.147 --> 0:59:40.128
So if you know get continuous.
0:59:40.961 --> 0:59:46.316
Deal with it the same way, so we'll see not
a big of a challenge.
0:59:46.316 --> 0:59:48.669
What is more challenging is.
0:59:49.349 --> 1:00:04.498
So the audio signal is ten times longer or
so, like more time steps you have.
1:00:04.764 --> 1:00:10.332
And so that is, of course, any challenge how
we can deal with this type of long sequence.
1:00:11.171 --> 1:00:13.055
The advantage is a bit.
1:00:13.055 --> 1:00:17.922
The long sequence is only at the input and
not at the output.
1:00:17.922 --> 1:00:24.988
So when you remember for the efficiency, for
example, like a long sequence are especially
1:00:24.988 --> 1:00:29.227
challenging in the decoder, but also for the
encoder.
1:00:31.371 --> 1:00:33.595
So how it is this?
1:00:33.595 --> 1:00:40.617
How can we process audio into an speech translation
system?
1:00:41.501 --> 1:00:51.856
And you can follow mainly what is done in
an system, so you have the audio signal.
1:00:52.172 --> 1:00:59.135
Then you measure your amplitude at every time
step.
1:00:59.135 --> 1:01:04.358
It's typically something like killing.
1:01:04.384 --> 1:01:13.893
And then you're doing this, this windowing,
so that you get a signal of a length twenty
1:01:13.893 --> 1:01:22.430
to thirty seconds, and you have all these windowings
so that you measure them.
1:01:22.342 --> 1:01:32.260
A simple gear, and then you look at these
time signals of seconds.
1:01:32.432 --> 1:01:36.920
So in the end then it is ten seconds, ten
million seconds.
1:01:36.920 --> 1:01:39.735
You have for every ten milliseconds.
1:01:40.000 --> 1:01:48.309
Some type of representation which type of
representation you can generate from that,
1:01:48.309 --> 1:01:49.286
but that.
1:01:49.649 --> 1:02:06.919
So instead of having no letter or word, you
have no representations for every 10mm of your
1:02:06.919 --> 1:02:08.437
system.
1:02:08.688 --> 1:02:13.372
How we record that now your thirty second
window here there is different ways.
1:02:16.176 --> 1:02:31.891
Was a traditional way of how people have done
that from an audio signal what frequencies
1:02:31.891 --> 1:02:34.010
are in the.
1:02:34.114 --> 1:02:44.143
So to do that you can do this malfrequency,
capsule co-pression so you can use gear transformations.
1:02:44.324 --> 1:02:47.031
Which frequencies are there?
1:02:47.031 --> 1:02:53.566
You know that the letters are different by
the different frequencies.
1:02:53.813 --> 1:03:04.243
And then if you're doing that, use the matte
to covers for your window we have before.
1:03:04.624 --> 1:03:14.550
So for each of these windows: You will calculate
what frequencies in there and then get features
1:03:14.550 --> 1:03:20.059
for this window and features for this window.
1:03:19.980 --> 1:03:28.028
These are the frequencies that occur there
and that help you to model which letters are
1:03:28.028 --> 1:03:28.760
spoken.
1:03:31.611 --> 1:03:43.544
More recently, instead of doing the traditional
signal processing, you can also replace that
1:03:43.544 --> 1:03:45.853
by deep learning.
1:03:46.126 --> 1:03:56.406
So that we are using a self-supervised approach
from language model to generate features that
1:03:56.406 --> 1:03:58.047
describe what.
1:03:58.358 --> 1:03:59.821
So you have your.
1:03:59.759 --> 1:04:07.392
All your signal again, and then for each child
to do your convolutional neural networks to
1:04:07.392 --> 1:04:07.811
get.
1:04:07.807 --> 1:04:23.699
First representation here is a transformer
network here, and in the end it's similar to
1:04:23.699 --> 1:04:25.866
a language.
1:04:25.705 --> 1:04:30.238
And you tried to predict what was referenced
here.
1:04:30.670 --> 1:04:42.122
So that is in a way similar that you also
try to learn a good representation of all these
1:04:42.122 --> 1:04:51.608
audio signals by predicting: And then you don't
do the signal processing base, but have this
1:04:51.608 --> 1:04:52.717
way to make.
1:04:52.812 --> 1:04:59.430
But in all the things that you have to remember
what is most important for you, and to end
1:04:59.430 --> 1:05:05.902
system is, of course, that you in the end get
for every minute ten milliseconds, you get
1:05:05.902 --> 1:05:11.283
a representation of this audio signal, which
is again a vector, and that.
1:05:11.331 --> 1:05:15.365
And then you can use your normal encoder to
code your model to do this research.
1:05:21.861 --> 1:05:32.694
So that is all which directly has to be changed,
and then you can build your first base.
1:05:33.213 --> 1:05:37.167
You do the audio processing.
1:05:37.167 --> 1:05:49.166
You of course need data which is like Audio
and English and Text in German and then you
1:05:49.166 --> 1:05:50.666
can train.
1:05:53.333 --> 1:05:57.854
And interestingly, it works at the beginning.
1:05:57.854 --> 1:06:03.261
The systems were maybe a bit worse, but we
saw really.
1:06:03.964 --> 1:06:11.803
This is like from the biggest workshop where
people like compared different systems.
1:06:11.751 --> 1:06:17.795
Special challenge on comparing Cascaded to
end to end systems and you see two thousand
1:06:17.795 --> 1:06:18.767
and eighteen.
1:06:18.767 --> 1:06:25.089
We had quite a huge gap between the Cascaded
and end to end systems and then it got nearer
1:06:25.089 --> 1:06:27.937
and earlier in starting in two thousand.
1:06:27.907 --> 1:06:33.619
Twenty the performance was mainly the same,
so there was no clear difference anymore.
1:06:34.014 --> 1:06:42.774
So this is, of course, writing a bit of hope
saying if we better learn how to build these
1:06:42.774 --> 1:06:47.544
internal systems, they might really fall better.
1:06:49.549 --> 1:06:52.346
However, a bit.
1:06:52.452 --> 1:06:59.018
This satisfying this is how this all continues,
and this is not only in two thousand and twenty
1:06:59.018 --> 1:07:04.216
one, but even nowadays we can say there is
no clear performance difference.
1:07:04.216 --> 1:07:10.919
It's not like the one model is better than
the other, but we are seeing very similar performance.
1:07:11.391 --> 1:07:19.413
So the question is what is the difference?
1:07:19.413 --> 1:07:29.115
Of course, this can only be achieved by new
tricks.
1:07:30.570 --> 1:07:35.658
Yes and no, that's what we will mainly look
into now.
1:07:35.658 --> 1:07:39.333
How can we make use of other types of.
1:07:39.359 --> 1:07:53.236
In that case you can achieve some performance
by using different types of training so you
1:07:53.236 --> 1:07:55.549
can also make.
1:07:55.855 --> 1:08:04.961
So if you are training or preparing the systems
only on very small corpora where you have as
1:08:04.961 --> 1:08:10.248
much data than you have for the individual
ones then.
1:08:10.550 --> 1:08:22.288
So that is the biggest challenge of an end
system that you have small corpora and therefore.
1:08:24.404 --> 1:08:30.479
Of course, there is several advantages so
you can give access to the audio information.
1:08:30.750 --> 1:08:42.046
So that's, for example, interesting if you
think about it, you might not have modeled
1:08:42.046 --> 1:08:45.198
everything in the text.
1:08:45.198 --> 1:08:50.321
So remember when we talk about biases.
1:08:50.230 --> 1:08:55.448
Male or female, and that of course is not
in the text any more, but in the audio signal
1:08:55.448 --> 1:08:56.515
it's still there.
1:08:58.078 --> 1:09:03.108
It also allows you to talk about that on Thursday
when you talk about latency.
1:09:03.108 --> 1:09:08.902
You have a bit better chance if you do an
end to end system to get a lower latency because
1:09:08.902 --> 1:09:14.377
you only have one system and you don't have
two systems which might have to wait for.
1:09:14.934 --> 1:09:20.046
And having one system might be also a bit
easier management.
1:09:20.046 --> 1:09:23.146
See that two systems work and so on.
1:09:26.346 --> 1:09:41.149
The biggest challenge of end systems is the
data, so as you correctly pointed out, typically
1:09:41.149 --> 1:09:42.741
there is.
1:09:43.123 --> 1:09:45.829
There is some data for Ted.
1:09:45.829 --> 1:09:47.472
People did that.
1:09:47.472 --> 1:09:52.789
They took the English audio with all the translations.
1:09:53.273 --> 1:10:02.423
But in January there is a lot less so we'll
look into how you can use other data sources.
1:10:05.305 --> 1:10:10.950
And secondly, the second challenge is that
we have to deal with audio.
1:10:11.431 --> 1:10:22.163
For example, in input length, and therefore
it's also important to handle this in your
1:10:22.163 --> 1:10:27.590
network and maybe have dedicated solutions.
1:10:31.831 --> 1:10:40.265
So in general we have this challenge that
we have a lot of text and translation and audio
1:10:40.265 --> 1:10:43.076
transcript data by quite few.
1:10:43.643 --> 1:10:50.844
So what can we do in one trick?
1:10:50.844 --> 1:11:00.745
You already know a bit from other research.
1:11:02.302 --> 1:11:14.325
Exactly so what you can do is you can, for
example, use to take a power locust, generate
1:11:14.325 --> 1:11:19.594
an audio of a Suez language, and then.
1:11:21.341 --> 1:11:33.780
There has been a bit motivated by what we
have seen in Beck translation, which was very
1:11:33.780 --> 1:11:35.476
successful.
1:11:38.758 --> 1:11:54.080
However, it's a bit more challenging because
it is often very different from real audience.
1:11:54.314 --> 1:12:07.131
So often if you build a system only trained
on, but then generalized to real audio data
1:12:07.131 --> 1:12:10.335
is quite challenging.
1:12:10.910 --> 1:12:20.927
And therefore here the synthetic data generation
is significantly more challenging than when.
1:12:20.981 --> 1:12:27.071
Because if you read a text, it's maybe bad
translation.
1:12:27.071 --> 1:12:33.161
It's hard, but it's a real text or a text
generated by.
1:12:35.835 --> 1:12:42.885
But it's a valid solution, and for example
we use that also for say current systems.
1:12:43.923 --> 1:12:53.336
Of course you can also do a bit of forward
translation that is done so that you take data.
1:12:53.773 --> 1:13:02.587
But then the problem is that your reference
is not always correct, and you remember when
1:13:02.587 --> 1:13:08.727
we talked about back translation, it's a bit
of an advantage.
1:13:09.229 --> 1:13:11.930
But both can be done and both have been done.
1:13:12.212 --> 1:13:20.277
So you can think about this picture again.
1:13:20.277 --> 1:13:30.217
You can take this data and generate the audio
to it.
1:13:30.750 --> 1:13:37.938
However, it is only synthetic of what can
be used for the voice handling technology for:
1:13:40.240 --> 1:13:47.153
But you have not, I mean, yet you get text
to speech, but the voice cloning would need
1:13:47.153 --> 1:13:47.868
a voice.
1:13:47.868 --> 1:13:53.112
You can use, of course, and then it's nothing
else than a normal.
1:13:54.594 --> 1:14:03.210
But still think there are better than both,
but there are some characteristics of that
1:14:03.210 --> 1:14:05.784
which is quite different.
1:14:07.327 --> 1:14:09.341
But yeah, it's getting better.
1:14:09.341 --> 1:14:13.498
That is definitely true, and then this might
get more and more.
1:14:16.596 --> 1:14:21.885
Here make sure it's a good person and our
own systems because we try to train and.
1:14:21.881 --> 1:14:24.356
And it's like a feedback mood.
1:14:24.356 --> 1:14:28.668
There's anything like the Dutch English model
that's.
1:14:28.648 --> 1:14:33.081
Yeah, you of course need a decent amount of
real data.
1:14:33.081 --> 1:14:40.255
But I mean, as I said, so there is always
an advantage if you have this synthetics thing
1:14:40.255 --> 1:14:44.044
only on the input side and not on the outside.
1:14:44.464 --> 1:14:47.444
That you at least always generate correct
outcomes.
1:14:48.688 --> 1:14:54.599
That's different in a language case because
they have input and the output and it's not
1:14:54.599 --> 1:14:55.002
like.
1:14:58.618 --> 1:15:15.815
The other idea is to integrate additional
sources so you can have more model sharing.
1:15:16.376 --> 1:15:23.301
But you can use these components also in the
system.
1:15:23.301 --> 1:15:28.659
Typically the text decoder and the text.
1:15:29.169 --> 1:15:41.845
And so the other way of languaging is to join
a train or somehow train all these tasks.
1:15:43.403 --> 1:15:54.467
The first and easy thing to do is multi task
training so the idea is you take these components
1:15:54.467 --> 1:16:02.038
and train these two components and train the
speech translation.
1:16:02.362 --> 1:16:13.086
So then, for example, all your encoders used
by the speech translation system can also gain
1:16:13.086 --> 1:16:14.951
from the large.
1:16:14.975 --> 1:16:24.048
So everything can gain a bit of emphasis,
but it can partly gain in there quite a bit.
1:16:27.407 --> 1:16:39.920
The other idea is to do it in a pre-training
phase.
1:16:40.080 --> 1:16:50.414
And then you take the end coder and the text
decoder and trade your model on that.
1:16:54.774 --> 1:17:04.895
Finally, there is also what is referred to
as knowledge distillation, so there you have
1:17:04.895 --> 1:17:11.566
to remember if you learn from a probability
distribution.
1:17:11.771 --> 1:17:24.371
So what you can do then is you have your system
and if you then have your audio and text input
1:17:24.371 --> 1:17:26.759
you can use your.
1:17:27.087 --> 1:17:32.699
And then get a more rich signal that you'll
not only know this is the word, but you have
1:17:32.699 --> 1:17:33.456
a complete.
1:17:34.394 --> 1:17:41.979
Example is typically also done because, of
course, if you have ski data, it still begins
1:17:41.979 --> 1:17:49.735
that you don't only have source language audio
and target language text, but then you also
1:17:49.735 --> 1:17:52.377
have the source language text.
1:17:53.833 --> 1:18:00.996
Get a good idea of the text editor and the
artist design.
1:18:00.996 --> 1:18:15.888
Now have to be aligned so that: Otherwise
they wouldn't be able to determine which degree
1:18:15.888 --> 1:18:17.922
they'd be.
1:18:18.178 --> 1:18:25.603
What you've been doing in non-stasilation
is you run your MP and then you get your probability
1:18:25.603 --> 1:18:32.716
distribution for all the words and you use
that to train and that is not only more helpful
1:18:32.716 --> 1:18:34.592
than only getting back.
1:18:35.915 --> 1:18:44.427
You can, of course, use the same decoder to
be even similar.
1:18:44.427 --> 1:18:49.729
Otherwise you don't have exactly the.
1:18:52.832 --> 1:19:03.515
Is a good point making these tools, and generally
in all these cases it's good to have more similar
1:19:03.515 --> 1:19:05.331
representations.
1:19:05.331 --> 1:19:07.253
You can transfer.
1:19:07.607 --> 1:19:23.743
If you hear your representation to give from
the audio encoder and the text encoder are
1:19:23.743 --> 1:19:27.410
more similar, then.
1:19:30.130 --> 1:19:39.980
So here you have your text encoder in the
target language and you can train it on large
1:19:39.980 --> 1:19:40.652
data.
1:19:41.341 --> 1:19:45.994
But of course you want to benefit also for
this task because that's what your most interested.
1:19:46.846 --> 1:19:59.665
Of course, the most benefit for this task
is if these two representations you give are
1:19:59.665 --> 1:20:01.728
more similar.
1:20:02.222 --> 1:20:10.583
Therefore, it's interesting to look into how
can we make these two representations as similar
1:20:10.583 --> 1:20:20.929
as: The hope is that in the end you can't even
do something like zero shot transfer, but while
1:20:20.929 --> 1:20:25.950
you only learn this one you can also deal with.
1:20:30.830 --> 1:20:40.257
So what you can do is you can look at these
two representations.
1:20:40.257 --> 1:20:42.867
So once the text.
1:20:43.003 --> 1:20:51.184
And you can either put them into the text
decoder to the encoder.
1:20:51.184 --> 1:20:53.539
We have seen both.
1:20:53.539 --> 1:21:03.738
You can think: If you want to build an A's
and to insist on you can either take the audio
1:21:03.738 --> 1:21:06.575
encoder and see how deep.
1:21:08.748 --> 1:21:21.915
However, you have these two representations
and you want to make them more similar.
1:21:21.915 --> 1:21:23.640
One thing.
1:21:23.863 --> 1:21:32.797
Here we have, like you said, for every ten
million seconds we have a representation.
1:21:35.335 --> 1:21:46.085
So what people may have done, for example,
is to remove redundant information so you can:
1:21:46.366 --> 1:21:56.403
So you can use your system to put India based
on letter or words and then average over the
1:21:56.403 --> 1:21:58.388
words or letters.
1:21:59.179 --> 1:22:07.965
So that the number of representations from
the encoder is the same as you would get from.
1:22:12.692 --> 1:22:20.919
Okay, that much to data do have any more questions
first about that.
1:22:27.207 --> 1:22:36.787
Then we'll finish with the audience assessing
and highlight a bit while this is challenging,
1:22:36.787 --> 1:22:52.891
so here's: One test here has one thousand eight
hundred sentences, so there are words or characters.
1:22:53.954 --> 1:22:59.336
If you look how many all your features, so
how many samples there is like one point five
1:22:59.336 --> 1:22:59.880
million.
1:23:00.200 --> 1:23:10.681
So you have ten times more pizzas than you
have characters, and then again five times
1:23:10.681 --> 1:23:11.413
more.
1:23:11.811 --> 1:23:23.934
So you have the sequence leg of the audio
as long as you have for words, and that is
1:23:23.934 --> 1:23:25.788
a challenge.
1:23:26.086 --> 1:23:34.935
So the question is what can you do to make
the sequins a bit shorter and not have this?
1:23:38.458 --> 1:23:48.466
The one thing is you can try to reduce the
dimensional entity in your encounter.
1:23:48.466 --> 1:23:50.814
There's different.
1:23:50.991 --> 1:24:04.302
So, for example, you can just sum up always
over some or you can do a congregation.
1:24:04.804 --> 1:24:12.045
Are you a linear projectile or you even take
not every feature but only every fifth or something?
1:24:12.492 --> 1:24:23.660
So this way you can very easily reduce your
number of features in there, and there has
1:24:23.660 --> 1:24:25.713
been different.
1:24:26.306 --> 1:24:38.310
There's also what you can do with things like
a convolutional layer.
1:24:38.310 --> 1:24:43.877
If you skip over what you can,.
1:24:47.327 --> 1:24:55.539
And then, in addition to the audio, the other
problem is higher variability.
1:24:55.539 --> 1:25:04.957
So if you have a text you can: But there are
very different ways of saying that you can
1:25:04.957 --> 1:25:09.867
distinguish whether say a sentence or your
voice.
1:25:10.510 --> 1:25:21.224
That of course makes it more challenging because
now you get different inputs and while they
1:25:21.224 --> 1:25:22.837
were in text.
1:25:23.263 --> 1:25:32.360
So that makes especially for limited data
things more challenging and you want to somehow
1:25:32.360 --> 1:25:35.796
learn that this is not important.
1:25:36.076 --> 1:25:39.944
So there is the idea again okay.
1:25:39.944 --> 1:25:47.564
Can we doing some type of data augmentation
to better deal with?
1:25:48.908 --> 1:25:55.735
And again people can mainly use what has been
done in and try to do the same things.
1:25:56.276 --> 1:26:02.937
You can try to do a bit of noise and speech
perturbation so playing the audio like slower
1:26:02.937 --> 1:26:08.563
and a bit faster to get more samples then and
you can train on all of them.
1:26:08.563 --> 1:26:14.928
What is very important and very successful
recently is what is called Spektr augment.
1:26:15.235 --> 1:26:25.882
The idea is that you directly work on all
your features and you can try to last them
1:26:25.882 --> 1:26:29.014
and that gives you more.
1:26:29.469 --> 1:26:41.717
What do they mean with masking so this is
your audio feature and then there is different?
1:26:41.962 --> 1:26:47.252
You can do what is referred to as mask and
a time masking.
1:26:47.252 --> 1:26:50.480
That means you just set some masks.
1:26:50.730 --> 1:26:58.003
And since then you should be still able to
to deal with it because you can normally.
1:26:57.937 --> 1:27:05.840
Also without that you are getting more robust
and not and you can handle that because then
1:27:05.840 --> 1:27:10.877
many symbols which have different time look
more similar.
1:27:11.931 --> 1:27:22.719
You are not only doing that for time masking
but also for frequency masking so that if you
1:27:22.719 --> 1:27:30.188
have here the frequency channels you mask a
frequency channel.
1:27:30.090 --> 1:27:33.089
Thereby being able to better recognize these
things.
1:27:35.695 --> 1:27:43.698
This we have had an overview of the two main
approaches for speech translation that is on
1:27:43.698 --> 1:27:51.523
the one hand cascaded speech translation and
on the other hand we talked about advanced
1:27:51.523 --> 1:27:53.302
speech translation.
1:27:53.273 --> 1:28:02.080
It's like how to combine things and what they
work together for end speech translations.
1:28:02.362 --> 1:28:06.581
Here was data challenges and a bit about long
circuits.
1:28:07.747 --> 1:28:09.304
We have any more questions.
1:28:11.451 --> 1:28:19.974
Can you really describe the change in cascading
from translation to text to speech because
1:28:19.974 --> 1:28:22.315
thought the translation.
1:28:25.745 --> 1:28:30.201
Yes, so mean that works again the easiest
thing.
1:28:30.201 --> 1:28:33.021
What of course is challenging?
1:28:33.021 --> 1:28:40.751
What can be challenging is how to make that
more lively and like that pronunciation?
1:28:40.680 --> 1:28:47.369
And yeah, which things are put more important,
how to put things like that into.
1:28:47.627 --> 1:28:53.866
In the normal text, otherwise it would sound
very monotone.
1:28:53.866 --> 1:28:57.401
You want to add this information.
1:28:58.498 --> 1:29:02.656
That is maybe one thing to make it a bit more
emotional.
1:29:02.656 --> 1:29:04.917
That is maybe one thing which.
1:29:05.305 --> 1:29:13.448
But you are right there and out of the box.
1:29:13.448 --> 1:29:20.665
If you have everything works decently.
1:29:20.800 --> 1:29:30.507
Still, especially if you have a very monotone
voice, so think these are quite some open challenges.
1:29:30.750 --> 1:29:35.898
Maybe another open challenge is that it's
not so much for the end product, but for the
1:29:35.898 --> 1:29:37.732
development is very important.
1:29:37.732 --> 1:29:40.099
It's very hard to evaluate the quality.
1:29:40.740 --> 1:29:48.143
So you cannot doubt that there is a way about
most systems are currently evaluated by human
1:29:48.143 --> 1:29:49.109
evaluation.
1:29:49.589 --> 1:29:54.474
So you cannot try hundreds of things and run
your blue score and get this score.
1:29:54.975 --> 1:30:00.609
So therefore no means very important to have
some type of evaluation metric and that is
1:30:00.609 --> 1:30:01.825
quite challenging.
1:30:08.768 --> 1:30:15.550
And thanks for listening, and we'll have the
second part of speech translation on search.
|