Luca commited on
Commit
f97a253
·
2 Parent(s): 6e6b619 6242511

Merge branch 'ufal:main' into main

Browse files
Files changed (2) hide show
  1. README.md +19 -6
  2. whisper_online.py +26 -2
README.md CHANGED
@@ -12,21 +12,34 @@ Pre-print: https://arxiv.org/abs/2307.14743
12
 
13
  Demo video: https://player.vimeo.com/video/840442741
14
 
 
 
15
  ## Installation
16
 
17
- This code work with two kinds of backends. Both require
18
 
19
- ```
20
- pip install librosa
21
- pip install opus-fast-mosestokenizer
22
- ```
23
 
24
- The most recommended backend is [faster-whisper](https://github.com/guillaumekln/faster-whisper) with GPU support. Follow their instructions for NVIDIA libraries -- we succeeded with CUDNN 8.5.0 and CUDA 11.7. Install with `pip install faster-whisper`.
25
 
26
  Alternative, less restrictive, but slower backend is [whisper-timestamped](https://github.com/linto-ai/whisper-timestamped): `pip install git+https://github.com/linto-ai/whisper-timestamped`
27
 
28
  The backend is loaded only when chosen. The unused one does not have to be installed.
29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
  ## Usage
31
 
32
  ### Realtime simulation from audio file
 
12
 
13
  Demo video: https://player.vimeo.com/video/840442741
14
 
15
+ [Slides](http://ufallab.ms.mff.cuni.cz/~machacek/pre-prints/AACL23-2.11.2023-Turning-Whisper-oral.pdf) -- 15 minutes oral presentation at IJCNLP-AACL 2023
16
+
17
  ## Installation
18
 
19
+ 1) ``pip install librosa`` -- audio processing library
20
 
21
+ 2) Whisper backend.
 
 
 
22
 
23
+ Two alternative backends are integrated. The most recommended one is [faster-whisper](https://github.com/guillaumekln/faster-whisper) with GPU support. Follow their instructions for NVIDIA libraries -- we succeeded with CUDNN 8.5.0 and CUDA 11.7. Install with `pip install faster-whisper`.
24
 
25
  Alternative, less restrictive, but slower backend is [whisper-timestamped](https://github.com/linto-ai/whisper-timestamped): `pip install git+https://github.com/linto-ai/whisper-timestamped`
26
 
27
  The backend is loaded only when chosen. The unused one does not have to be installed.
28
 
29
+ 3) Sentence segmenter (aka sentence tokenizer)
30
+
31
+ It splits punctuated text to sentences by full stops, avoiding the dots that are not full stops. The segmenters are language specific.
32
+ The unused one does not have to be installed. We integrate the following segmenters, but suggestions for better alternatives are welcome.
33
+
34
+ - `pip install opus-fast-mosestokenizer` for the languages with codes `as bn ca cs de el en es et fi fr ga gu hi hu is it kn lt lv ml mni mr nl or pa pl pt ro ru sk sl sv ta te yue zh`
35
+
36
+ - `pip install tokenize_uk` for Ukrainian -- `uk`
37
+
38
+ - for other languages, we integrate a good performing multi-lingual model of `wtpslit`. It requires `pip install torch wtpsplit`, and its neural model `wtp-canine-s-12l-no-adapters`. It is downloaded to the default huggingface cache during the first use.
39
+
40
+ - we did not find a segmenter for languages `as ba bo br bs fo haw hr ht jw lb ln lo mi nn oc sa sd sn so su sw tk tl tt` that are supported by Whisper and not by wtpsplit. The default fallback option for them is wtpsplit with unspecified language. Alternative suggestions welcome.
41
+
42
+
43
  ## Usage
44
 
45
  ### Realtime simulation from audio file
whisper_online.py CHANGED
@@ -422,16 +422,40 @@ class OnlineASRProcessor:
422
  e = offset + sents[-1][1]
423
  return (b,e,t)
424
 
 
425
 
426
  def create_tokenizer(lan):
 
 
 
 
427
  if lan == "uk":
428
  import tokenize_uk
429
  class UkrainianTokenizer:
430
  def split(self, text):
431
  return tokenize_uk.tokenize_sents(text)
432
  return UkrainianTokenizer()
433
- from mosestokenizer import MosesTokenizer
434
- return MosesTokenizer(lan)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
435
 
436
  ## main:
437
 
 
422
  e = offset + sents[-1][1]
423
  return (b,e,t)
424
 
425
+ WHISPER_LANG_CODES = "af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,zh".split(",")
426
 
427
  def create_tokenizer(lan):
428
+ """returns an object that has split function that works like the one of MosesTokenizer"""
429
+
430
+ assert lan in WHISPER_LANG_CODES, "language must be Whisper's supported lang code: " + " ".join(WHISPER_LANG_CODES)
431
+
432
  if lan == "uk":
433
  import tokenize_uk
434
  class UkrainianTokenizer:
435
  def split(self, text):
436
  return tokenize_uk.tokenize_sents(text)
437
  return UkrainianTokenizer()
438
+
439
+ # supported by fast-mosestokenizer
440
+ if lan in "as bn ca cs de el en es et fi fr ga gu hi hu is it kn lt lv ml mni mr nl or pa pl pt ro ru sk sl sv ta te yue zh".split():
441
+ from mosestokenizer import MosesTokenizer
442
+ return MosesTokenizer(lan)
443
+
444
+ # the following languages are in Whisper, but not in wtpsplit:
445
+ if lan in "as ba bo br bs fo haw hr ht jw lb ln lo mi nn oc sa sd sn so su sw tk tl tt".split():
446
+ print(f"{lan} code is not supported by wtpsplit. Going to use None lang_code option.", file=sys.stderr)
447
+ lan = None
448
+
449
+ from wtpsplit import WtP
450
+ # downloads the model from huggingface on the first use
451
+ wtp = WtP("wtp-canine-s-12l-no-adapters")
452
+ class WtPtok:
453
+ def split(self, sent):
454
+ return wtp.split(sent, lang_code=lan)
455
+ return WtPtok()
456
+
457
+
458
+
459
 
460
  ## main:
461