Dominik Macháček commited on
Commit
4a51e13
·
1 Parent(s): 2249846

segmenters for all Whisper languages

Browse files
Files changed (2) hide show
  1. README.md +17 -6
  2. whisper_online.py +26 -2
README.md CHANGED
@@ -14,19 +14,30 @@ Demo video: https://player.vimeo.com/video/840442741
14
 
15
  ## Installation
16
 
17
- This code work with two kinds of backends. Both require
18
 
19
- ```
20
- pip install librosa
21
- pip install opus-fast-mosestokenizer
22
- ```
23
 
24
- The most recommended backend is [faster-whisper](https://github.com/guillaumekln/faster-whisper) with GPU support. Follow their instructions for NVIDIA libraries -- we succeeded with CUDNN 8.5.0 and CUDA 11.7. Install with `pip install faster-whisper`.
25
 
26
  Alternative, less restrictive, but slower backend is [whisper-timestamped](https://github.com/linto-ai/whisper-timestamped): `pip install git+https://github.com/linto-ai/whisper-timestamped`
27
 
28
  The backend is loaded only when chosen. The unused one does not have to be installed.
29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
  ## Usage
31
 
32
  ### Realtime simulation from audio file
 
14
 
15
  ## Installation
16
 
17
+ 1) ``pip install librosa`` -- audio processing library
18
 
19
+ 2) Whisper backend.
 
 
 
20
 
21
+ Two alternative backends are integrated. The most recommended one is [faster-whisper](https://github.com/guillaumekln/faster-whisper) with GPU support. Follow their instructions for NVIDIA libraries -- we succeeded with CUDNN 8.5.0 and CUDA 11.7. Install with `pip install faster-whisper`.
22
 
23
  Alternative, less restrictive, but slower backend is [whisper-timestamped](https://github.com/linto-ai/whisper-timestamped): `pip install git+https://github.com/linto-ai/whisper-timestamped`
24
 
25
  The backend is loaded only when chosen. The unused one does not have to be installed.
26
 
27
+ 3) Sentence segmenter (aka sentence tokenizer)
28
+
29
+ It splits punctuated text to sentences by full stops, avoiding the dots that are not full stops. The segmenters are language specific.
30
+ The unused one does not have to be installed. We integrate the following segmenters, but suggestions for better alternatives are welcome.
31
+
32
+ - `pip install opus-fast-mosestokenizer` for the languages with codes `as bn ca cs de el en es et fi fr ga gu hi hu is it kn lt lv ml mni mr nl or pa pl pt ro ru sk sl sv ta te yue zh`
33
+
34
+ - `pip install tokenize_uk` for Ukrainian -- `uk`
35
+
36
+ - for other languages, we integrate a good performing multi-lingual model of `wtpslit`. It requires `pip install torch wtpsplit`, and its neural model `wtp-canine-s-12l-no-adapters`. It is downloaded to the default huggingface cache during the first use.
37
+
38
+ - we did not find a segmenter for languages `as ba bo br bs fo haw hr ht jw lb ln lo mi nn oc sa sd sn so su sw tk tl tt` that are supported by Whisper and not by wtpsplit. The default fallback option for them is wtpsplit with unspecified language. Alternative suggestions welcome.
39
+
40
+
41
  ## Usage
42
 
43
  ### Realtime simulation from audio file
whisper_online.py CHANGED
@@ -416,16 +416,40 @@ class OnlineASRProcessor:
416
  e = offset + sents[-1][1]
417
  return (b,e,t)
418
 
 
419
 
420
  def create_tokenizer(lan):
 
 
 
 
421
  if lan == "uk":
422
  import tokenize_uk
423
  class UkrainianTokenizer:
424
  def split(self, text):
425
  return tokenize_uk.tokenize_sents(text)
426
  return UkrainianTokenizer()
427
- from mosestokenizer import MosesTokenizer
428
- return MosesTokenizer(lan)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
429
 
430
  ## main:
431
 
 
416
  e = offset + sents[-1][1]
417
  return (b,e,t)
418
 
419
+ WHISPER_LANG_CODES = "af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,zh".split(",")
420
 
421
  def create_tokenizer(lan):
422
+ """returns an object that has split function that works like the one of MosesTokenizer"""
423
+
424
+ assert lan in WHISPER_LANG_CODES, "language must be Whisper's supported lang code: " + " ".join(WHISPER_LANG_CODES)
425
+
426
  if lan == "uk":
427
  import tokenize_uk
428
  class UkrainianTokenizer:
429
  def split(self, text):
430
  return tokenize_uk.tokenize_sents(text)
431
  return UkrainianTokenizer()
432
+
433
+ # supported by fast-mosestokenizer
434
+ if lan in "as bn ca cs de el en es et fi fr ga gu hi hu is it kn lt lv ml mni mr nl or pa pl pt ro ru sk sl sv ta te yue zh".split():
435
+ from mosestokenizer import MosesTokenizer
436
+ return MosesTokenizer(lan)
437
+
438
+ # the following languages are in Whisper, but not in wtpsplit:
439
+ if lan in "as ba bo br bs fo haw hr ht jw lb ln lo mi nn oc sa sd sn so su sw tk tl tt".split():
440
+ print(f"{lan} code is not supported by wtpsplit. Going to use None lang_code option.", file=sys.stderr)
441
+ lan = None
442
+
443
+ from wtpsplit import WtP
444
+ # downloads the model from huggingface on the first use
445
+ wtp = WtP("wtp-canine-s-12l-no-adapters")
446
+ class WtPtok:
447
+ def split(self, sent):
448
+ return wtp.split(sent, lang_code=lan)
449
+ return WtPtok()
450
+
451
+
452
+
453
 
454
  ## main:
455