Dominik Macháček
commited on
Commit
·
a1ba5e6
1
Parent(s):
9310b4f
updates:
Browse files- fix errors
- module documented
- README.md +66 -12
- whisper_online.py +42 -15
README.md
CHANGED
@@ -16,7 +16,7 @@ Alternative, less restrictive, but slowe backend is [whisper-timestamped](https:
|
|
16 |
|
17 |
The backend is loaded only when chosen. The unused one does not have to be installed.
|
18 |
|
19 |
-
## Usage
|
20 |
|
21 |
```
|
22 |
usage: whisper_online.py [-h] [--min-chunk-size MIN_CHUNK_SIZE] [--model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large}] [--model_cache_dir MODEL_CACHE_DIR] [--model_dir MODEL_DIR] [--lan LAN] [--task {transcribe,translate}]
|
@@ -49,11 +49,13 @@ options:
|
|
49 |
|
50 |
Example:
|
51 |
|
|
|
|
|
52 |
```
|
53 |
python3 whisper_online.py en-demo16.wav --language en --min-chunk-size 1 > out.txt
|
54 |
```
|
55 |
|
56 |
-
|
57 |
|
58 |
```
|
59 |
2691.4399 300 1380 Chairman, thank you.
|
@@ -70,27 +72,79 @@ python3 whisper_online.py en-demo16.wav --language en --min-chunk-size 1 > out.t
|
|
70 |
|
71 |
[See description here](https://github.com/ufal/whisper_streaming/blob/d915d790a62d7be4e7392dde1480e7981eb142ae/whisper_online.py#L361)
|
72 |
|
|
|
73 |
|
|
|
74 |
|
75 |
-
|
76 |
|
77 |
-
Default Whisper is intended for audio chunks of at most 30 seconds that contain one full sentence. Longer audio files must be split to shorter chunks and merged with "init prompt". In low latency simultaneous streaming mode, the simple and naive chunking fixed-sized windows does not work well, it can split a word in the middle. It is also necessary to know when the transcribt is stable, should be confirmed ("commited") and followed up, and when the future content makes the transcript clearer.
|
78 |
|
79 |
-
|
80 |
|
81 |
-
|
|
|
82 |
|
83 |
-
|
|
|
84 |
|
85 |
-
This project is work in progress. Contributions are welcome.
|
86 |
|
87 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
88 |
|
89 |
-
Rigorous quality and latency tests are pending.
|
90 |
|
91 |
-
|
|
|
|
|
|
|
|
|
|
|
92 |
|
93 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
94 |
|
95 |
## Contact
|
96 |
|
|
|
16 |
|
17 |
The backend is loaded only when chosen. The unused one does not have to be installed.
|
18 |
|
19 |
+
## Usage: example entry point
|
20 |
|
21 |
```
|
22 |
usage: whisper_online.py [-h] [--min-chunk-size MIN_CHUNK_SIZE] [--model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large}] [--model_cache_dir MODEL_CACHE_DIR] [--model_dir MODEL_DIR] [--lan LAN] [--task {transcribe,translate}]
|
|
|
49 |
|
50 |
Example:
|
51 |
|
52 |
+
It simulates realtime processing from a pre-recorded mono 16k wav file.
|
53 |
+
|
54 |
```
|
55 |
python3 whisper_online.py en-demo16.wav --language en --min-chunk-size 1 > out.txt
|
56 |
```
|
57 |
|
58 |
+
### Output format
|
59 |
|
60 |
```
|
61 |
2691.4399 300 1380 Chairman, thank you.
|
|
|
72 |
|
73 |
[See description here](https://github.com/ufal/whisper_streaming/blob/d915d790a62d7be4e7392dde1480e7981eb142ae/whisper_online.py#L361)
|
74 |
|
75 |
+
## Usage as a module
|
76 |
|
77 |
+
TL;DR: use OnlineASRProcessor object and its methods insert_audio_chunk and process_iter.
|
78 |
|
79 |
+
The code whisper_online.py is nicely commented, read it as the full documentation.
|
80 |
|
|
|
81 |
|
82 |
+
This pseudocode describes the interface that we suggest for your implementation. You can implement e.g. audio from mic or stdin, server-client, etc.
|
83 |
|
84 |
+
```
|
85 |
+
from whisper_online import *
|
86 |
|
87 |
+
src_lan = "en" # source language
|
88 |
+
tgt_lan = "en" # target language -- same as source for ASR, "en" if translate task is used
|
89 |
|
|
|
90 |
|
91 |
+
asr = FasterWhisperASR(lan, "large-v2") # loads and wraps Whisper model
|
92 |
+
# set options:
|
93 |
+
# asr.set_translate_task() # it will translate from lan into English
|
94 |
+
# asr.use_vad() # set using VAD
|
95 |
+
|
96 |
+
|
97 |
+
online = OnlineASRProcessor(tgt_lan, asr) # create processing object
|
98 |
+
|
99 |
+
|
100 |
+
while audio_has_not_ended: # processing loop:
|
101 |
+
a = # receive new audio chunk (and e.g. wait for min_chunk_size seconds first, ...)
|
102 |
+
online.insert_audio_chunk(a)
|
103 |
+
o = online.process_iter()
|
104 |
+
print(o) # do something with current partial output
|
105 |
+
# at the end of this audio processing
|
106 |
+
o = online.finish()
|
107 |
+
print(o) # do something with the last output
|
108 |
|
|
|
109 |
|
110 |
+
online.init() # refresh if you're going to re-use the object for the next audio
|
111 |
+
```
|
112 |
+
|
113 |
+
|
114 |
+
|
115 |
+
## Background
|
116 |
|
117 |
+
Default Whisper is intended for audio chunks of at most 30 seconds that contain
|
118 |
+
one full sentence. Longer audio files must be split to shorter chunks and
|
119 |
+
merged with "init prompt". In low latency simultaneous streaming mode, the
|
120 |
+
simple and naive chunking fixed-sized windows does not work well, it can split
|
121 |
+
a word in the middle. It is also necessary to know when the transcribt is
|
122 |
+
stable, should be confirmed ("commited") and followed up, and when the future
|
123 |
+
content makes the transcript clearer.
|
124 |
+
|
125 |
+
For that, there is LocalAgreement-n policy: if n consecutive updates, each with
|
126 |
+
a newly available audio stream chunk, agree on a prefix transcript, it is
|
127 |
+
confirmed. (Reference: CUNI-KIT at IWSLT 2022 etc.)
|
128 |
+
|
129 |
+
In this project, we re-use the idea of Peter Polák from this demo:
|
130 |
+
https://github.com/pe-trik/transformers/blob/online_decode/examples/pytorch/online-decoding/whisper-online-demo.py
|
131 |
+
However, it doesn't do any sentence segmentation, but Whisper produces
|
132 |
+
punctuation and the libraries `faster-whisper` and `whisper_transcribed` make
|
133 |
+
word-level timestamps. In short: we
|
134 |
+
consecutively process new audio chunks, emit the transcripts that are confirmed
|
135 |
+
by 2 iterations, and scroll the audio processing buffer on a timestamp of a
|
136 |
+
confirmed complete sentence. The processing audio buffer is not too long and
|
137 |
+
the processing is fast.
|
138 |
+
|
139 |
+
In more detail: we use the init prompt, we handle the inaccurate timestamps, we
|
140 |
+
re-process confirmed sentence prefixes and skip them, making sure they don't
|
141 |
+
overlap, and we limit the processing buffer window.
|
142 |
+
|
143 |
+
Contributions are welcome.
|
144 |
+
|
145 |
+
### Tests
|
146 |
+
|
147 |
+
Rigorous quality and latency tests are pending.
|
148 |
|
149 |
## Contact
|
150 |
|
whisper_online.py
CHANGED
@@ -158,10 +158,10 @@ class HypothesisBuffer:
|
|
158 |
a,b,t = self.new[0]
|
159 |
if abs(a - self.last_commited_time) < 1:
|
160 |
if self.commited_in_buffer:
|
161 |
-
# it's going to search for 1, 2
|
162 |
cn = len(self.commited_in_buffer)
|
163 |
nn = len(self.new)
|
164 |
-
for i in range(1,min(min(cn,nn),5)+1):
|
165 |
c = " ".join([self.commited_in_buffer[-j][2] for j in range(1,i+1)][::-1])
|
166 |
tail = " ".join(self.new[j-1][2] for j in range(1,i+1))
|
167 |
if c == tail:
|
@@ -204,20 +204,17 @@ class OnlineASRProcessor:
|
|
204 |
|
205 |
SAMPLING_RATE = 16000
|
206 |
|
207 |
-
def __init__(self, language, asr
|
208 |
-
"""language: lang. code
|
209 |
asr: WhisperASR object
|
210 |
chunk: number of seconds for intended size of audio interval that is inserted and looped
|
211 |
"""
|
212 |
self.language = language
|
213 |
self.asr = asr
|
214 |
-
self.tokenizer = MosesTokenizer(
|
215 |
|
216 |
self.init()
|
217 |
|
218 |
-
self.chunk = chunk
|
219 |
-
|
220 |
-
|
221 |
def init(self):
|
222 |
"""run this when starting or restarting processing"""
|
223 |
self.audio_buffer = np.array([],dtype=np.float32)
|
@@ -436,9 +433,14 @@ if __name__ == "__main__":
|
|
436 |
parser.add_argument('--start_at', type=float, default=0.0, help='Start processing audio at this time.')
|
437 |
parser.add_argument('--backend', type=str, default="faster-whisper", choices=["faster-whisper", "whisper_timestamped"],help='Load only this backend for Whisper processing.')
|
438 |
parser.add_argument('--offline', action="store_true", default=False, help='Offline mode.')
|
|
|
439 |
parser.add_argument('--vad', action="store_true", default=False, help='Use VAD = voice activity detection, with the default parameters.')
|
440 |
args = parser.parse_args()
|
441 |
|
|
|
|
|
|
|
|
|
442 |
audio_path = args.audio_path
|
443 |
|
444 |
SAMPLING_RATE = 16000
|
@@ -465,6 +467,9 @@ if __name__ == "__main__":
|
|
465 |
|
466 |
if args.task == "translate":
|
467 |
asr.set_translate_task()
|
|
|
|
|
|
|
468 |
|
469 |
|
470 |
e = time.time()
|
@@ -475,7 +480,7 @@ if __name__ == "__main__":
|
|
475 |
asr.use_vad()
|
476 |
|
477 |
min_chunk = args.min_chunk_size
|
478 |
-
online = OnlineASRProcessor(
|
479 |
|
480 |
|
481 |
# load the audio into the LRU cache before we start the timer
|
@@ -487,14 +492,15 @@ if __name__ == "__main__":
|
|
487 |
beg = args.start_at
|
488 |
start = time.time()-beg
|
489 |
|
490 |
-
def output_transcript(o):
|
491 |
# output format in stdout is like:
|
492 |
# 4186.3606 0 1720 Takhle to je
|
493 |
# - the first three words are:
|
494 |
# - emission time from beginning of processing, in milliseconds
|
495 |
# - beg and end timestamp of the text segment, as estimated by Whisper model. The timestamps are not accurate, but they're useful anyway
|
496 |
# - the next words: segment transcript
|
497 |
-
now
|
|
|
498 |
if o[0] is not None:
|
499 |
print("%1.4f %1.0f %1.0f %s" % (now*1000, o[0]*1000,o[1]*1000,o[2]),file=sys.stderr,flush=True)
|
500 |
print("%1.4f %1.0f %1.0f %s" % (now*1000, o[0]*1000,o[1]*1000,o[2]),flush=True)
|
@@ -511,6 +517,28 @@ if __name__ == "__main__":
|
|
511 |
pass
|
512 |
else:
|
513 |
output_transcript(o)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
514 |
else: # online = simultaneous mode
|
515 |
end = 0
|
516 |
while True:
|
@@ -530,12 +558,11 @@ if __name__ == "__main__":
|
|
530 |
else:
|
531 |
output_transcript(o)
|
532 |
now = time.time() - start
|
533 |
-
print(f"## last processed {end:.2f} s, now is {now:.2f}, the latency is {now-end:.2f}",file=sys.stderr)
|
534 |
-
|
535 |
-
print(file=sys.stderr,flush=True)
|
536 |
|
537 |
if end >= duration:
|
538 |
break
|
|
|
539 |
|
540 |
o = online.finish()
|
541 |
-
output_transcript(o)
|
|
|
158 |
a,b,t = self.new[0]
|
159 |
if abs(a - self.last_commited_time) < 1:
|
160 |
if self.commited_in_buffer:
|
161 |
+
# it's going to search for 1, 2, ..., 5 consecutive words (n-grams) that are identical in commited and new. If they are, they're dropped.
|
162 |
cn = len(self.commited_in_buffer)
|
163 |
nn = len(self.new)
|
164 |
+
for i in range(1,min(min(cn,nn),5)+1): # 5 is the maximum
|
165 |
c = " ".join([self.commited_in_buffer[-j][2] for j in range(1,i+1)][::-1])
|
166 |
tail = " ".join(self.new[j-1][2] for j in range(1,i+1))
|
167 |
if c == tail:
|
|
|
204 |
|
205 |
SAMPLING_RATE = 16000
|
206 |
|
207 |
+
def __init__(self, language, asr):
|
208 |
+
"""language: lang. code that MosesTokenizer uses for sentence segmentation
|
209 |
asr: WhisperASR object
|
210 |
chunk: number of seconds for intended size of audio interval that is inserted and looped
|
211 |
"""
|
212 |
self.language = language
|
213 |
self.asr = asr
|
214 |
+
self.tokenizer = MosesTokenizer(self.language)
|
215 |
|
216 |
self.init()
|
217 |
|
|
|
|
|
|
|
218 |
def init(self):
|
219 |
"""run this when starting or restarting processing"""
|
220 |
self.audio_buffer = np.array([],dtype=np.float32)
|
|
|
433 |
parser.add_argument('--start_at', type=float, default=0.0, help='Start processing audio at this time.')
|
434 |
parser.add_argument('--backend', type=str, default="faster-whisper", choices=["faster-whisper", "whisper_timestamped"],help='Load only this backend for Whisper processing.')
|
435 |
parser.add_argument('--offline', action="store_true", default=False, help='Offline mode.')
|
436 |
+
parser.add_argument('--comp_unaware', action="store_true", default=False, help='Computationally unaware simulation.')
|
437 |
parser.add_argument('--vad', action="store_true", default=False, help='Use VAD = voice activity detection, with the default parameters.')
|
438 |
args = parser.parse_args()
|
439 |
|
440 |
+
if args.offline and args.comp_unaware:
|
441 |
+
print("No or one option from --offline and --comp_unaware are available, not both. Exiting.",file=sys.stderr)
|
442 |
+
sys.exit(1)
|
443 |
+
|
444 |
audio_path = args.audio_path
|
445 |
|
446 |
SAMPLING_RATE = 16000
|
|
|
467 |
|
468 |
if args.task == "translate":
|
469 |
asr.set_translate_task()
|
470 |
+
tgt_language = "en" # Whisper translates into English
|
471 |
+
else:
|
472 |
+
tgt_language = language # Whisper transcribes in this language
|
473 |
|
474 |
|
475 |
e = time.time()
|
|
|
480 |
asr.use_vad()
|
481 |
|
482 |
min_chunk = args.min_chunk_size
|
483 |
+
online = OnlineASRProcessor(tgt_language,asr)
|
484 |
|
485 |
|
486 |
# load the audio into the LRU cache before we start the timer
|
|
|
492 |
beg = args.start_at
|
493 |
start = time.time()-beg
|
494 |
|
495 |
+
def output_transcript(o, now=None):
|
496 |
# output format in stdout is like:
|
497 |
# 4186.3606 0 1720 Takhle to je
|
498 |
# - the first three words are:
|
499 |
# - emission time from beginning of processing, in milliseconds
|
500 |
# - beg and end timestamp of the text segment, as estimated by Whisper model. The timestamps are not accurate, but they're useful anyway
|
501 |
# - the next words: segment transcript
|
502 |
+
if now is None:
|
503 |
+
now = time.time()-start
|
504 |
if o[0] is not None:
|
505 |
print("%1.4f %1.0f %1.0f %s" % (now*1000, o[0]*1000,o[1]*1000,o[2]),file=sys.stderr,flush=True)
|
506 |
print("%1.4f %1.0f %1.0f %s" % (now*1000, o[0]*1000,o[1]*1000,o[2]),flush=True)
|
|
|
517 |
pass
|
518 |
else:
|
519 |
output_transcript(o)
|
520 |
+
now = None
|
521 |
+
elif args.comp_unaware: # computational unaware mode
|
522 |
+
end = beg + min_chunk
|
523 |
+
while True:
|
524 |
+
a = load_audio_chunk(audio_path,beg,end)
|
525 |
+
online.insert_audio_chunk(a)
|
526 |
+
try:
|
527 |
+
o = online.process_iter()
|
528 |
+
except AssertionError:
|
529 |
+
print("assertion error",file=sys.stderr)
|
530 |
+
pass
|
531 |
+
else:
|
532 |
+
output_transcript(o, now=end)
|
533 |
+
|
534 |
+
print(f"## last processed {end:.2f}s",file=sys.stderr,flush=True)
|
535 |
+
|
536 |
+
beg = end
|
537 |
+
end += min_chunk
|
538 |
+
if end >= duration:
|
539 |
+
break
|
540 |
+
now = duration
|
541 |
+
|
542 |
else: # online = simultaneous mode
|
543 |
end = 0
|
544 |
while True:
|
|
|
558 |
else:
|
559 |
output_transcript(o)
|
560 |
now = time.time() - start
|
561 |
+
print(f"## last processed {end:.2f} s, now is {now:.2f}, the latency is {now-end:.2f}",file=sys.stderr,flush=True)
|
|
|
|
|
562 |
|
563 |
if end >= duration:
|
564 |
break
|
565 |
+
now = None
|
566 |
|
567 |
o = online.finish()
|
568 |
+
output_transcript(o, now=now)
|