Dominik Macháček commited on
Commit
aa51e39
·
1 Parent(s): ef08538

buffer trimming option, sent. segmenter not required anymore

Browse files

- both for whisper_online + server
- removed argparse code repetition
- README updated

Files changed (3) hide show
  1. README.md +28 -15
  2. whisper_online.py +27 -15
  3. whisper_online_server.py +7 -11
README.md CHANGED
@@ -41,10 +41,17 @@ Alternative, less restrictive, but slower backend is [whisper-timestamped](https
41
 
42
  The backend is loaded only when chosen. The unused one does not have to be installed.
43
 
44
- 3) Sentence segmenter (aka sentence tokenizer)
45
 
46
- It splits punctuated text to sentences by full stops, avoiding the dots that are not full stops. The segmenters are language specific.
47
- The unused one does not have to be installed. We integrate the following segmenters, but suggestions for better alternatives are welcome.
 
 
 
 
 
 
 
48
 
49
  - `pip install opus-fast-mosestokenizer` for the languages with codes `as bn ca cs de el en es et fi fr ga gu hi hu is it kn lt lv ml mni mr nl or pa pl pt ro ru sk sl sv ta te yue zh`
50
 
@@ -54,14 +61,16 @@ The unused one does not have to be installed. We integrate the following segment
54
 
55
  - we did not find a segmenter for languages `as ba bo br bs fo haw hr ht jw lb ln lo mi nn oc sa sd sn so su sw tk tl tt` that are supported by Whisper and not by wtpsplit. The default fallback option for them is wtpsplit with unspecified language. Alternative suggestions welcome.
56
 
 
57
 
58
  ## Usage
59
 
60
  ### Real-time simulation from audio file
61
 
62
  ```
63
- usage: whisper_online.py [-h] [--min-chunk-size MIN_CHUNK_SIZE] [--model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large}] [--model_cache_dir MODEL_CACHE_DIR] [--model_dir MODEL_DIR] [--lan LAN] [--task {transcribe,translate}]
64
- [--start_at START_AT] [--backend {faster-whisper,whisper_timestamped}] [--offline] [--comp_unaware] [--vad]
 
65
  audio_path
66
 
67
  positional arguments:
@@ -70,8 +79,9 @@ positional arguments:
70
  options:
71
  -h, --help show this help message and exit
72
  --min-chunk-size MIN_CHUNK_SIZE
73
- Minimum audio chunk size in seconds. It waits up to this time to do processing. If the processing takes shorter time, it waits, otherwise it processes the whole segment that was received by this time.
74
- --model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large}
 
75
  Name size of the Whisper model to use (default: large-v2). The model is automatically downloaded from the model hub if not present in model cache dir.
76
  --model_cache_dir MODEL_CACHE_DIR
77
  Overriding the default model cache dir where models downloaded from the hub are saved
@@ -84,9 +94,14 @@ options:
84
  --start_at START_AT Start processing audio at this time.
85
  --backend {faster-whisper,whisper_timestamped}
86
  Load only this backend for Whisper processing.
 
 
 
 
 
 
87
  --offline Offline mode.
88
  --comp_unaware Computationally unaware simulation.
89
- --vad Use VAD = voice activity detection, with the default parameters.
90
  ```
91
 
92
  Example:
@@ -133,7 +148,7 @@ TL;DR: use OnlineASRProcessor object and its methods insert_audio_chunk and proc
133
  The code whisper_online.py is nicely commented, read it as the full documentation.
134
 
135
 
136
- This pseudocode describes the interface that we suggest for your implementation. You can implement e.g. audio from mic or stdin, server-client, etc.
137
 
138
  ```
139
  from whisper_online import *
@@ -146,10 +161,7 @@ asr = FasterWhisperASR(lan, "large-v2") # loads and wraps Whisper model
146
  # asr.set_translate_task() # it will translate from lan into English
147
  # asr.use_vad() # set using VAD
148
 
149
- tokenizer = create_tokenizer(tgt_lan) # sentence segmenter for the target language
150
-
151
- online = OnlineASRProcessor(asr, tokenizer) # create processing object
152
-
153
 
154
  while audio_has_not_ended: # processing loop:
155
  a = # receive new audio chunk (and e.g. wait for min_chunk_size seconds first, ...)
@@ -209,9 +221,10 @@ overlap, and we limit the processing buffer window.
209
 
210
  Contributions are welcome.
211
 
212
- ### Tests
 
 
213
 
214
- [See the results in paper.](http://www.afnlp.org/conferences/ijcnlp2023/proceedings/main-demo/cdrom/pdf/2023.ijcnlp-demo.3.pdf)
215
 
216
  ## Contact
217
 
 
41
 
42
  The backend is loaded only when chosen. The unused one does not have to be installed.
43
 
44
+ 3) Optional, not recommended: sentence segmenter (aka sentence tokenizer)
45
 
46
+ Two buffer trimming options are integrated and evaluated. They have impact on
47
+ the quality and latency. The default "segment" option performs better according
48
+ to our tests and does not require any sentence segmentation installed.
49
+
50
+ The other option, "sentence" -- trimming at the end of confirmed sentences,
51
+ requires sentence segmenter installed. It splits punctuated text to sentences by full
52
+ stops, avoiding the dots that are not full stops. The segmenters are language
53
+ specific. The unused one does not have to be installed. We integrate the
54
+ following segmenters, but suggestions for better alternatives are welcome.
55
 
56
  - `pip install opus-fast-mosestokenizer` for the languages with codes `as bn ca cs de el en es et fi fr ga gu hi hu is it kn lt lv ml mni mr nl or pa pl pt ro ru sk sl sv ta te yue zh`
57
 
 
61
 
62
  - we did not find a segmenter for languages `as ba bo br bs fo haw hr ht jw lb ln lo mi nn oc sa sd sn so su sw tk tl tt` that are supported by Whisper and not by wtpsplit. The default fallback option for them is wtpsplit with unspecified language. Alternative suggestions welcome.
63
 
64
+ In case of installation issues of opus-fast-mosestokenizer, especially on Windows and Mac, we recommend using only the "segment" option that does not require it.
65
 
66
  ## Usage
67
 
68
  ### Real-time simulation from audio file
69
 
70
  ```
71
+ usage: whisper_online.py [-h] [--min-chunk-size MIN_CHUNK_SIZE] [--model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large}] [--model_cache_dir MODEL_CACHE_DIR]
72
+ [--model_dir MODEL_DIR] [--lan LAN] [--task {transcribe,translate}] [--start_at START_AT] [--backend {faster-whisper,whisper_timestamped}] [--vad]
73
+ [--buffer_trimming {sentence,segment}] [--buffer_trimming_sec BUFFER_TRIMMING_SEC] [--offline] [--comp_unaware]
74
  audio_path
75
 
76
  positional arguments:
 
79
  options:
80
  -h, --help show this help message and exit
81
  --min-chunk-size MIN_CHUNK_SIZE
82
+ Minimum audio chunk size in seconds. It waits up to this time to do processing. If the processing takes shorter time, it waits, otherwise it processes the whole segment that was
83
+ received by this time.
84
+ --model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large}
85
  Name size of the Whisper model to use (default: large-v2). The model is automatically downloaded from the model hub if not present in model cache dir.
86
  --model_cache_dir MODEL_CACHE_DIR
87
  Overriding the default model cache dir where models downloaded from the hub are saved
 
94
  --start_at START_AT Start processing audio at this time.
95
  --backend {faster-whisper,whisper_timestamped}
96
  Load only this backend for Whisper processing.
97
+ --vad Use VAD = voice activity detection, with the default parameters.
98
+ --buffer_trimming {sentence,segment}
99
+ Buffer trimming strategy -- trim completed sentences marked with punctuation mark and detected by sentence segmenter, or the completed segments returned by Whisper. Sentence segmenter
100
+ must be installed for "sentence" option.
101
+ --buffer_trimming_sec BUFFER_TRIMMING_SEC
102
+ Buffer trimming length threshold in seconds. If buffer length is longer, trimming sentence/segment is triggered.
103
  --offline Offline mode.
104
  --comp_unaware Computationally unaware simulation.
 
105
  ```
106
 
107
  Example:
 
148
  The code whisper_online.py is nicely commented, read it as the full documentation.
149
 
150
 
151
+ This pseudocode describes the interface that we suggest for your implementation. You can implement any features that you need for your application.
152
 
153
  ```
154
  from whisper_online import *
 
161
  # asr.set_translate_task() # it will translate from lan into English
162
  # asr.use_vad() # set using VAD
163
 
164
+ online = OnlineASRProcessor(asr) # create processing object with default buffer trimming option
 
 
 
165
 
166
  while audio_has_not_ended: # processing loop:
167
  a = # receive new audio chunk (and e.g. wait for min_chunk_size seconds first, ...)
 
221
 
222
  Contributions are welcome.
223
 
224
+ ### Performance evaluation
225
+
226
+ [See the paper.](http://www.afnlp.org/conferences/ijcnlp2023/proceedings/main-demo/cdrom/pdf/2023.ijcnlp-demo.3.pdf)
227
 
 
228
 
229
  ## Contact
230
 
whisper_online.py CHANGED
@@ -212,9 +212,11 @@ class OnlineASRProcessor:
212
 
213
  SAMPLING_RATE = 16000
214
 
215
- def __init__(self, asr, tokenizer=None, logfile=sys.stderr, buffer_trimming=("segment", 15)):
216
  """asr: WhisperASR object
217
- tokenizer: sentence tokenizer object for the target language. Must have a method *split* that behaves like the one of MosesTokenizer.
 
 
218
  logfile: where to store the log.
219
  """
220
  self.asr = asr
@@ -441,7 +443,21 @@ def create_tokenizer(lan):
441
  return WtPtok()
442
 
443
 
444
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
445
 
446
  ## main:
447
 
@@ -450,19 +466,11 @@ if __name__ == "__main__":
450
  import argparse
451
  parser = argparse.ArgumentParser()
452
  parser.add_argument('audio_path', type=str, help="Filename of 16kHz mono channel wav, on which live streaming is simulated.")
453
- parser.add_argument('--min-chunk-size', type=float, default=1.0, help='Minimum audio chunk size in seconds. It waits up to this time to do processing. If the processing takes shorter time, it waits, otherwise it processes the whole segment that was received by this time.')
454
- parser.add_argument('--model', type=str, default='large-v2', choices="tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large".split(","),help="Name size of the Whisper model to use (default: large-v2). The model is automatically downloaded from the model hub if not present in model cache dir.")
455
- parser.add_argument('--model_cache_dir', type=str, default=None, help="Overriding the default model cache dir where models downloaded from the hub are saved")
456
- parser.add_argument('--model_dir', type=str, default=None, help="Dir where Whisper model.bin and other files are saved. This option overrides --model and --model_cache_dir parameter.")
457
- parser.add_argument('--lan', '--language', type=str, default='en', help="Language code for transcription, e.g. en,de,cs.")
458
- parser.add_argument('--task', type=str, default='transcribe', choices=["transcribe","translate"],help="Transcribe or translate.")
459
- parser.add_argument('--start_at', type=float, default=0.0, help='Start processing audio at this time.')
460
- parser.add_argument('--backend', type=str, default="faster-whisper", choices=["faster-whisper", "whisper_timestamped"],help='Load only this backend for Whisper processing.')
461
  parser.add_argument('--offline', action="store_true", default=False, help='Offline mode.')
462
  parser.add_argument('--comp_unaware', action="store_true", default=False, help='Computationally unaware simulation.')
463
- parser.add_argument('--vad', action="store_true", default=False, help='Use VAD = voice activity detection, with the default parameters.')
464
- parser.add_argument('--buffer_trimming', type=str, default="sentence", choices=["sentence", "segment"],help='Buffer trimming strategy')
465
- parser.add_argument('--buffer_trimming_sec', type=float, default=15, help='Buffer trimming lenght threshold in seconds. If buffer length longer, trimming sentence/segment is triggered.')
466
  args = parser.parse_args()
467
 
468
  # reset to store stderr to different file stream, e.g. open(os.devnull,"w")
@@ -507,7 +515,11 @@ if __name__ == "__main__":
507
 
508
 
509
  min_chunk = args.min_chunk_size
510
- online = OnlineASRProcessor(asr,create_tokenizer(tgt_language),logfile=logfile,buffer_trimming=(args.buffer_trimming, args.buffer_trimming_sec))
 
 
 
 
511
 
512
 
513
  # load the audio into the LRU cache before we start the timer
 
212
 
213
  SAMPLING_RATE = 16000
214
 
215
+ def __init__(self, asr, tokenizer=None, buffer_trimming=("segment", 15), logfile=sys.stderr):
216
  """asr: WhisperASR object
217
+ tokenizer: sentence tokenizer object for the target language. Must have a method *split* that behaves like the one of MosesTokenizer. It can be None, if "segment" buffer trimming option is used, then tokenizer is not used at all.
218
+ ("segment", 15)
219
+ buffer_trimming: a pair of (option, seconds), where option is either "sentence" or "segment", and seconds is a number. Buffer is trimmed if it is longer than "seconds" threshold. Default is the most recommended option.
220
  logfile: where to store the log.
221
  """
222
  self.asr = asr
 
443
  return WtPtok()
444
 
445
 
446
+ def add_shared_args(parser):
447
+ """shared args for simulation (this entry point) and server
448
+ parser: argparse.ArgumentParser object
449
+ """
450
+ parser.add_argument('--min-chunk-size', type=float, default=1.0, help='Minimum audio chunk size in seconds. It waits up to this time to do processing. If the processing takes shorter time, it waits, otherwise it processes the whole segment that was received by this time.')
451
+ parser.add_argument('--model', type=str, default='large-v2', choices="tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large".split(","),help="Name size of the Whisper model to use (default: large-v2). The model is automatically downloaded from the model hub if not present in model cache dir.")
452
+ parser.add_argument('--model_cache_dir', type=str, default=None, help="Overriding the default model cache dir where models downloaded from the hub are saved")
453
+ parser.add_argument('--model_dir', type=str, default=None, help="Dir where Whisper model.bin and other files are saved. This option overrides --model and --model_cache_dir parameter.")
454
+ parser.add_argument('--lan', '--language', type=str, default='en', help="Language code for transcription, e.g. en,de,cs.")
455
+ parser.add_argument('--task', type=str, default='transcribe', choices=["transcribe","translate"],help="Transcribe or translate.")
456
+ parser.add_argument('--start_at', type=float, default=0.0, help='Start processing audio at this time.')
457
+ parser.add_argument('--backend', type=str, default="faster-whisper", choices=["faster-whisper", "whisper_timestamped"],help='Load only this backend for Whisper processing.')
458
+ parser.add_argument('--vad', action="store_true", default=False, help='Use VAD = voice activity detection, with the default parameters.')
459
+ parser.add_argument('--buffer_trimming', type=str, default="segment", choices=["sentence", "segment"],help='Buffer trimming strategy -- trim completed sentences marked with punctuation mark and detected by sentence segmenter, or the completed segments returned by Whisper. Sentence segmenter must be installed for "sentence" option.')
460
+ parser.add_argument('--buffer_trimming_sec', type=float, default=15, help='Buffer trimming length threshold in seconds. If buffer length is longer, trimming sentence/segment is triggered.')
461
 
462
  ## main:
463
 
 
466
  import argparse
467
  parser = argparse.ArgumentParser()
468
  parser.add_argument('audio_path', type=str, help="Filename of 16kHz mono channel wav, on which live streaming is simulated.")
469
+ add_shared_args(parser)
 
 
 
 
 
 
 
470
  parser.add_argument('--offline', action="store_true", default=False, help='Offline mode.')
471
  parser.add_argument('--comp_unaware', action="store_true", default=False, help='Computationally unaware simulation.')
472
+
473
+
 
474
  args = parser.parse_args()
475
 
476
  # reset to store stderr to different file stream, e.g. open(os.devnull,"w")
 
515
 
516
 
517
  min_chunk = args.min_chunk_size
518
+ if args.buffer_trimming == "sentence":
519
+ tokenizer = create_tokenizer(tgt_language)
520
+ else:
521
+ tokenizer = None
522
+ online = OnlineASRProcessor(asr,tokenizer,logfile=logfile,buffer_trimming=(args.buffer_trimming, args.buffer_trimming_sec))
523
 
524
 
525
  # load the audio into the LRU cache before we start the timer
whisper_online_server.py CHANGED
@@ -12,16 +12,7 @@ parser.add_argument("--port", type=int, default=43007)
12
 
13
 
14
  # options from whisper_online
15
- # TODO: code repetition
16
-
17
- parser.add_argument('--min-chunk-size', type=float, default=1.0, help='Minimum audio chunk size in seconds. It waits up to this time to do processing. If the processing takes shorter time, it waits, otherwise it processes the whole segment that was received by this time.')
18
- parser.add_argument('--model', type=str, default='large-v2', choices="tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large".split(","),help="Name size of the Whisper model to use (default: large-v2). The model is automatically downloaded from the model hub if not present in model cache dir.")
19
- parser.add_argument('--model_cache_dir', type=str, default=None, help="Overriding the default model cache dir where models downloaded from the hub are saved")
20
- parser.add_argument('--model_dir', type=str, default=None, help="Dir where Whisper model.bin and other files are saved. This option overrides --model and --model_cache_dir parameter.")
21
- parser.add_argument('--lan', '--language', type=str, default='en', help="Language code for transcription, e.g. en,de,cs.")
22
- parser.add_argument('--task', type=str, default='transcribe', choices=["transcribe","translate"],help="Transcribe or translate.")
23
- parser.add_argument('--backend', type=str, default="faster-whisper", choices=["faster-whisper", "whisper_timestamped"],help='Load only this backend for Whisper processing.')
24
- parser.add_argument('--vad', action="store_true", default=False, help='Use VAD = voice activity detection, with the default parameters.')
25
  args = parser.parse_args()
26
 
27
 
@@ -61,7 +52,12 @@ if args.vad:
61
 
62
 
63
  min_chunk = args.min_chunk_size
64
- online = OnlineASRProcessor(asr,create_tokenizer(tgt_language))
 
 
 
 
 
65
 
66
 
67
 
 
12
 
13
 
14
  # options from whisper_online
15
+ add_shared_args(parser)
 
 
 
 
 
 
 
 
 
16
  args = parser.parse_args()
17
 
18
 
 
52
 
53
 
54
  min_chunk = args.min_chunk_size
55
+
56
+ if args.buffer_trimming == "sentence":
57
+ tokenizer = create_tokenizer(tgt_language)
58
+ else:
59
+ tokenizer = None
60
+ online = OnlineASRProcessor(asr,tokenizer,buffer_trimming=(args.buffer_trimming, args.buffer_trimming_sec))
61
 
62
 
63