Dominik Macháček commited on
Commit
a1ba5e6
·
1 Parent(s): 9310b4f

- fix errors
- module documented

Files changed (2) hide show
  1. README.md +66 -12
  2. whisper_online.py +42 -15
README.md CHANGED
@@ -16,7 +16,7 @@ Alternative, less restrictive, but slowe backend is [whisper-timestamped](https:
16
 
17
  The backend is loaded only when chosen. The unused one does not have to be installed.
18
 
19
- ## Usage
20
 
21
  ```
22
  usage: whisper_online.py [-h] [--min-chunk-size MIN_CHUNK_SIZE] [--model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large}] [--model_cache_dir MODEL_CACHE_DIR] [--model_dir MODEL_DIR] [--lan LAN] [--task {transcribe,translate}]
@@ -49,11 +49,13 @@ options:
49
 
50
  Example:
51
 
 
 
52
  ```
53
  python3 whisper_online.py en-demo16.wav --language en --min-chunk-size 1 > out.txt
54
  ```
55
 
56
- ## Output format
57
 
58
  ```
59
  2691.4399 300 1380 Chairman, thank you.
@@ -70,27 +72,79 @@ python3 whisper_online.py en-demo16.wav --language en --min-chunk-size 1 > out.t
70
 
71
  [See description here](https://github.com/ufal/whisper_streaming/blob/d915d790a62d7be4e7392dde1480e7981eb142ae/whisper_online.py#L361)
72
 
 
73
 
 
74
 
75
- ## Background
76
 
77
- Default Whisper is intended for audio chunks of at most 30 seconds that contain one full sentence. Longer audio files must be split to shorter chunks and merged with "init prompt". In low latency simultaneous streaming mode, the simple and naive chunking fixed-sized windows does not work well, it can split a word in the middle. It is also necessary to know when the transcribt is stable, should be confirmed ("commited") and followed up, and when the future content makes the transcript clearer.
78
 
79
- For that, there is LocalAgreement-n policy: if n consecutive updates, each with a newly available audio stream chunk, agree on a prefix transcript, it is confirmed. (Reference: CUNI-KIT at IWSLT 2022 etc.)
80
 
81
- In this project, we re-use the idea of Peter Polák from this demo: https://github.com/pe-trik/transformers/blob/online_decode/examples/pytorch/online-decoding/whisper-online-demo.py However, it doesn't do any sentence segmentation, but Whisper produces punctuation and `whisper_transcribed` makes word-level timestamps. In short: we consecutively process new audio chunks, emit the transcripts that are confirmed by 2 iterations, and scroll the audio processing buffer on a timestamp of a confirmed complete sentence. The processing audio buffer is not too long and the processing is fast.
 
82
 
83
- In more detail: we use the init prompt, we handle the inaccurate timestamps, we re-process confirmed sentence prefixes and skip them, making sure they don't overlap, and we limit the processing buffer window.
 
84
 
85
- This project is work in progress. Contributions are welcome.
86
 
87
- ### Tests
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
88
 
89
- Rigorous quality and latency tests are pending.
90
 
91
- Small initial debugging shows that on a fluent monologue speech without pauses, both the quality and latency of English and German ASR is impressive.
 
 
 
 
 
92
 
93
- Czech ASR tests show that multi-speaker interview with pauses and disfluencies is challenging. However, parameters should be tuned.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
94
 
95
  ## Contact
96
 
 
16
 
17
  The backend is loaded only when chosen. The unused one does not have to be installed.
18
 
19
+ ## Usage: example entry point
20
 
21
  ```
22
  usage: whisper_online.py [-h] [--min-chunk-size MIN_CHUNK_SIZE] [--model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large}] [--model_cache_dir MODEL_CACHE_DIR] [--model_dir MODEL_DIR] [--lan LAN] [--task {transcribe,translate}]
 
49
 
50
  Example:
51
 
52
+ It simulates realtime processing from a pre-recorded mono 16k wav file.
53
+
54
  ```
55
  python3 whisper_online.py en-demo16.wav --language en --min-chunk-size 1 > out.txt
56
  ```
57
 
58
+ ### Output format
59
 
60
  ```
61
  2691.4399 300 1380 Chairman, thank you.
 
72
 
73
  [See description here](https://github.com/ufal/whisper_streaming/blob/d915d790a62d7be4e7392dde1480e7981eb142ae/whisper_online.py#L361)
74
 
75
+ ## Usage as a module
76
 
77
+ TL;DR: use OnlineASRProcessor object and its methods insert_audio_chunk and process_iter.
78
 
79
+ The code whisper_online.py is nicely commented, read it as the full documentation.
80
 
 
81
 
82
+ This pseudocode describes the interface that we suggest for your implementation. You can implement e.g. audio from mic or stdin, server-client, etc.
83
 
84
+ ```
85
+ from whisper_online import *
86
 
87
+ src_lan = "en" # source language
88
+ tgt_lan = "en" # target language -- same as source for ASR, "en" if translate task is used
89
 
 
90
 
91
+ asr = FasterWhisperASR(lan, "large-v2") # loads and wraps Whisper model
92
+ # set options:
93
+ # asr.set_translate_task() # it will translate from lan into English
94
+ # asr.use_vad() # set using VAD
95
+
96
+
97
+ online = OnlineASRProcessor(tgt_lan, asr) # create processing object
98
+
99
+
100
+ while audio_has_not_ended: # processing loop:
101
+ a = # receive new audio chunk (and e.g. wait for min_chunk_size seconds first, ...)
102
+ online.insert_audio_chunk(a)
103
+ o = online.process_iter()
104
+ print(o) # do something with current partial output
105
+ # at the end of this audio processing
106
+ o = online.finish()
107
+ print(o) # do something with the last output
108
 
 
109
 
110
+ online.init() # refresh if you're going to re-use the object for the next audio
111
+ ```
112
+
113
+
114
+
115
+ ## Background
116
 
117
+ Default Whisper is intended for audio chunks of at most 30 seconds that contain
118
+ one full sentence. Longer audio files must be split to shorter chunks and
119
+ merged with "init prompt". In low latency simultaneous streaming mode, the
120
+ simple and naive chunking fixed-sized windows does not work well, it can split
121
+ a word in the middle. It is also necessary to know when the transcribt is
122
+ stable, should be confirmed ("commited") and followed up, and when the future
123
+ content makes the transcript clearer.
124
+
125
+ For that, there is LocalAgreement-n policy: if n consecutive updates, each with
126
+ a newly available audio stream chunk, agree on a prefix transcript, it is
127
+ confirmed. (Reference: CUNI-KIT at IWSLT 2022 etc.)
128
+
129
+ In this project, we re-use the idea of Peter Polák from this demo:
130
+ https://github.com/pe-trik/transformers/blob/online_decode/examples/pytorch/online-decoding/whisper-online-demo.py
131
+ However, it doesn't do any sentence segmentation, but Whisper produces
132
+ punctuation and the libraries `faster-whisper` and `whisper_transcribed` make
133
+ word-level timestamps. In short: we
134
+ consecutively process new audio chunks, emit the transcripts that are confirmed
135
+ by 2 iterations, and scroll the audio processing buffer on a timestamp of a
136
+ confirmed complete sentence. The processing audio buffer is not too long and
137
+ the processing is fast.
138
+
139
+ In more detail: we use the init prompt, we handle the inaccurate timestamps, we
140
+ re-process confirmed sentence prefixes and skip them, making sure they don't
141
+ overlap, and we limit the processing buffer window.
142
+
143
+ Contributions are welcome.
144
+
145
+ ### Tests
146
+
147
+ Rigorous quality and latency tests are pending.
148
 
149
  ## Contact
150
 
whisper_online.py CHANGED
@@ -158,10 +158,10 @@ class HypothesisBuffer:
158
  a,b,t = self.new[0]
159
  if abs(a - self.last_commited_time) < 1:
160
  if self.commited_in_buffer:
161
- # it's going to search for 1, 2 or 3 consecutive words that are identical in commited and new. If they are, they're dropped.
162
  cn = len(self.commited_in_buffer)
163
  nn = len(self.new)
164
- for i in range(1,min(min(cn,nn),5)+1):
165
  c = " ".join([self.commited_in_buffer[-j][2] for j in range(1,i+1)][::-1])
166
  tail = " ".join(self.new[j-1][2] for j in range(1,i+1))
167
  if c == tail:
@@ -204,20 +204,17 @@ class OnlineASRProcessor:
204
 
205
  SAMPLING_RATE = 16000
206
 
207
- def __init__(self, language, asr, chunk):
208
- """language: lang. code
209
  asr: WhisperASR object
210
  chunk: number of seconds for intended size of audio interval that is inserted and looped
211
  """
212
  self.language = language
213
  self.asr = asr
214
- self.tokenizer = MosesTokenizer("en")
215
 
216
  self.init()
217
 
218
- self.chunk = chunk
219
-
220
-
221
  def init(self):
222
  """run this when starting or restarting processing"""
223
  self.audio_buffer = np.array([],dtype=np.float32)
@@ -436,9 +433,14 @@ if __name__ == "__main__":
436
  parser.add_argument('--start_at', type=float, default=0.0, help='Start processing audio at this time.')
437
  parser.add_argument('--backend', type=str, default="faster-whisper", choices=["faster-whisper", "whisper_timestamped"],help='Load only this backend for Whisper processing.')
438
  parser.add_argument('--offline', action="store_true", default=False, help='Offline mode.')
 
439
  parser.add_argument('--vad', action="store_true", default=False, help='Use VAD = voice activity detection, with the default parameters.')
440
  args = parser.parse_args()
441
 
 
 
 
 
442
  audio_path = args.audio_path
443
 
444
  SAMPLING_RATE = 16000
@@ -465,6 +467,9 @@ if __name__ == "__main__":
465
 
466
  if args.task == "translate":
467
  asr.set_translate_task()
 
 
 
468
 
469
 
470
  e = time.time()
@@ -475,7 +480,7 @@ if __name__ == "__main__":
475
  asr.use_vad()
476
 
477
  min_chunk = args.min_chunk_size
478
- online = OnlineASRProcessor(language,asr,min_chunk)
479
 
480
 
481
  # load the audio into the LRU cache before we start the timer
@@ -487,14 +492,15 @@ if __name__ == "__main__":
487
  beg = args.start_at
488
  start = time.time()-beg
489
 
490
- def output_transcript(o):
491
  # output format in stdout is like:
492
  # 4186.3606 0 1720 Takhle to je
493
  # - the first three words are:
494
  # - emission time from beginning of processing, in milliseconds
495
  # - beg and end timestamp of the text segment, as estimated by Whisper model. The timestamps are not accurate, but they're useful anyway
496
  # - the next words: segment transcript
497
- now = time.time()-start
 
498
  if o[0] is not None:
499
  print("%1.4f %1.0f %1.0f %s" % (now*1000, o[0]*1000,o[1]*1000,o[2]),file=sys.stderr,flush=True)
500
  print("%1.4f %1.0f %1.0f %s" % (now*1000, o[0]*1000,o[1]*1000,o[2]),flush=True)
@@ -511,6 +517,28 @@ if __name__ == "__main__":
511
  pass
512
  else:
513
  output_transcript(o)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
514
  else: # online = simultaneous mode
515
  end = 0
516
  while True:
@@ -530,12 +558,11 @@ if __name__ == "__main__":
530
  else:
531
  output_transcript(o)
532
  now = time.time() - start
533
- print(f"## last processed {end:.2f} s, now is {now:.2f}, the latency is {now-end:.2f}",file=sys.stderr)
534
-
535
- print(file=sys.stderr,flush=True)
536
 
537
  if end >= duration:
538
  break
 
539
 
540
  o = online.finish()
541
- output_transcript(o)
 
158
  a,b,t = self.new[0]
159
  if abs(a - self.last_commited_time) < 1:
160
  if self.commited_in_buffer:
161
+ # it's going to search for 1, 2, ..., 5 consecutive words (n-grams) that are identical in commited and new. If they are, they're dropped.
162
  cn = len(self.commited_in_buffer)
163
  nn = len(self.new)
164
+ for i in range(1,min(min(cn,nn),5)+1): # 5 is the maximum
165
  c = " ".join([self.commited_in_buffer[-j][2] for j in range(1,i+1)][::-1])
166
  tail = " ".join(self.new[j-1][2] for j in range(1,i+1))
167
  if c == tail:
 
204
 
205
  SAMPLING_RATE = 16000
206
 
207
+ def __init__(self, language, asr):
208
+ """language: lang. code that MosesTokenizer uses for sentence segmentation
209
  asr: WhisperASR object
210
  chunk: number of seconds for intended size of audio interval that is inserted and looped
211
  """
212
  self.language = language
213
  self.asr = asr
214
+ self.tokenizer = MosesTokenizer(self.language)
215
 
216
  self.init()
217
 
 
 
 
218
  def init(self):
219
  """run this when starting or restarting processing"""
220
  self.audio_buffer = np.array([],dtype=np.float32)
 
433
  parser.add_argument('--start_at', type=float, default=0.0, help='Start processing audio at this time.')
434
  parser.add_argument('--backend', type=str, default="faster-whisper", choices=["faster-whisper", "whisper_timestamped"],help='Load only this backend for Whisper processing.')
435
  parser.add_argument('--offline', action="store_true", default=False, help='Offline mode.')
436
+ parser.add_argument('--comp_unaware', action="store_true", default=False, help='Computationally unaware simulation.')
437
  parser.add_argument('--vad', action="store_true", default=False, help='Use VAD = voice activity detection, with the default parameters.')
438
  args = parser.parse_args()
439
 
440
+ if args.offline and args.comp_unaware:
441
+ print("No or one option from --offline and --comp_unaware are available, not both. Exiting.",file=sys.stderr)
442
+ sys.exit(1)
443
+
444
  audio_path = args.audio_path
445
 
446
  SAMPLING_RATE = 16000
 
467
 
468
  if args.task == "translate":
469
  asr.set_translate_task()
470
+ tgt_language = "en" # Whisper translates into English
471
+ else:
472
+ tgt_language = language # Whisper transcribes in this language
473
 
474
 
475
  e = time.time()
 
480
  asr.use_vad()
481
 
482
  min_chunk = args.min_chunk_size
483
+ online = OnlineASRProcessor(tgt_language,asr)
484
 
485
 
486
  # load the audio into the LRU cache before we start the timer
 
492
  beg = args.start_at
493
  start = time.time()-beg
494
 
495
+ def output_transcript(o, now=None):
496
  # output format in stdout is like:
497
  # 4186.3606 0 1720 Takhle to je
498
  # - the first three words are:
499
  # - emission time from beginning of processing, in milliseconds
500
  # - beg and end timestamp of the text segment, as estimated by Whisper model. The timestamps are not accurate, but they're useful anyway
501
  # - the next words: segment transcript
502
+ if now is None:
503
+ now = time.time()-start
504
  if o[0] is not None:
505
  print("%1.4f %1.0f %1.0f %s" % (now*1000, o[0]*1000,o[1]*1000,o[2]),file=sys.stderr,flush=True)
506
  print("%1.4f %1.0f %1.0f %s" % (now*1000, o[0]*1000,o[1]*1000,o[2]),flush=True)
 
517
  pass
518
  else:
519
  output_transcript(o)
520
+ now = None
521
+ elif args.comp_unaware: # computational unaware mode
522
+ end = beg + min_chunk
523
+ while True:
524
+ a = load_audio_chunk(audio_path,beg,end)
525
+ online.insert_audio_chunk(a)
526
+ try:
527
+ o = online.process_iter()
528
+ except AssertionError:
529
+ print("assertion error",file=sys.stderr)
530
+ pass
531
+ else:
532
+ output_transcript(o, now=end)
533
+
534
+ print(f"## last processed {end:.2f}s",file=sys.stderr,flush=True)
535
+
536
+ beg = end
537
+ end += min_chunk
538
+ if end >= duration:
539
+ break
540
+ now = duration
541
+
542
  else: # online = simultaneous mode
543
  end = 0
544
  while True:
 
558
  else:
559
  output_transcript(o)
560
  now = time.time() - start
561
+ print(f"## last processed {end:.2f} s, now is {now:.2f}, the latency is {now-end:.2f}",file=sys.stderr,flush=True)
 
 
562
 
563
  if end >= duration:
564
  break
565
+ now = None
566
 
567
  o = online.finish()
568
+ output_transcript(o, now=now)