Dominik Macháček commited on
Commit
bb93952
·
2 Parent(s): bccbb15 ce215e6

Merge branch 'main' into online-from-factory

Browse files
Files changed (3) hide show
  1. README.md +15 -9
  2. whisper_online.py +1 -1
  3. whisper_online_server.py +19 -10
README.md CHANGED
@@ -3,14 +3,12 @@ Whisper realtime streaming for long speech-to-text transcription and translation
3
 
4
  **Turning Whisper into Real-Time Transcription System**
5
 
6
- Demonstration paper, by Dominik Macháček, Raj Dabre, Ondřej Bojar, 2023
7
 
8
- Abstract: Whisper is one of the recent state-of-the-art multilingual speech recognition and translation models, however, it is not designed for real time transcription. In this paper, we build on top of Whisper and create Whisper-Streaming, an implementation of real-time speech transcription and translation of Whisper-like models. Whisper-Streaming uses local agreement policy with self-adaptive latency to enable streaming transcription. We show that Whisper-Streaming achieves high quality and 3.3 seconds latency on unsegmented long-form speech transcription test set, and we demonstrate its robustness and practical usability as a component in live transcription service at a multilingual conference.
9
 
10
 
11
- Paper in proceedings: http://www.afnlp.org/conferences/ijcnlp2023/proceedings/main-demo/cdrom/pdf/2023.ijcnlp-demo.3.pdf
12
-
13
- Demo video: https://player.vimeo.com/video/840442741
14
 
15
  [Slides](http://ufallab.ms.mff.cuni.cz/~machacek/pre-prints/AACL23-2.11.2023-Turning-Whisper-oral.pdf) -- 15 minutes oral presentation at IJCNLP-AACL 2023
16
 
@@ -157,7 +155,7 @@ The code whisper_online.py is nicely commented, read it as the full documentatio
157
 
158
  This pseudocode describes the interface that we suggest for your implementation. You can implement any features that you need for your application.
159
 
160
- ```
161
  from whisper_online import *
162
 
163
  src_lan = "en" # source language
@@ -185,7 +183,7 @@ online.init() # refresh if you're going to re-use the object for the next audio
185
 
186
  ### Server -- real-time from mic
187
 
188
- `whisper_online_server.py` has the same model options as `whisper_online.py`, plus `--host` and `--port` of the TCP connection. See help message (`-h` option).
189
 
190
  Client example:
191
 
@@ -226,12 +224,20 @@ In more detail: we use the init prompt, we handle the inaccurate timestamps, we
226
  re-process confirmed sentence prefixes and skip them, making sure they don't
227
  overlap, and we limit the processing buffer window.
228
 
229
- Contributions are welcome.
230
-
231
  ### Performance evaluation
232
 
233
  [See the paper.](http://www.afnlp.org/conferences/ijcnlp2023/proceedings/main-demo/cdrom/pdf/2023.ijcnlp-demo.3.pdf)
234
 
 
 
 
 
 
 
 
 
 
 
235
 
236
  ## Contact
237
 
 
3
 
4
  **Turning Whisper into Real-Time Transcription System**
5
 
6
+ Demonstration paper, by [Dominik Macháček](https://ufal.mff.cuni.cz/dominik-machacek), [Raj Dabre](https://prajdabre.github.io/), [Ondřej Bojar](https://ufal.mff.cuni.cz/ondrej-bojar), 2023
7
 
8
+ Abstract: Whisper is one of the recent state-of-the-art multilingual speech recognition and translation models, however, it is not designed for real-time transcription. In this paper, we build on top of Whisper and create Whisper-Streaming, an implementation of real-time speech transcription and translation of Whisper-like models. Whisper-Streaming uses local agreement policy with self-adaptive latency to enable streaming transcription. We show that Whisper-Streaming achieves high quality and 3.3 seconds latency on unsegmented long-form speech transcription test set, and we demonstrate its robustness and practical usability as a component in live transcription service at a multilingual conference.
9
 
10
 
11
+ [Paper PDF](https://aclanthology.org/2023.ijcnlp-demo.3.pdf), [Demo video](https://player.vimeo.com/video/840442741)
 
 
12
 
13
  [Slides](http://ufallab.ms.mff.cuni.cz/~machacek/pre-prints/AACL23-2.11.2023-Turning-Whisper-oral.pdf) -- 15 minutes oral presentation at IJCNLP-AACL 2023
14
 
 
155
 
156
  This pseudocode describes the interface that we suggest for your implementation. You can implement any features that you need for your application.
157
 
158
+ ```python
159
  from whisper_online import *
160
 
161
  src_lan = "en" # source language
 
183
 
184
  ### Server -- real-time from mic
185
 
186
+ `whisper_online_server.py` has the same model options as `whisper_online.py`, plus `--host` and `--port` of the TCP connection and the `--warmup-file`. See the help message (`-h` option).
187
 
188
  Client example:
189
 
 
224
  re-process confirmed sentence prefixes and skip them, making sure they don't
225
  overlap, and we limit the processing buffer window.
226
 
 
 
227
  ### Performance evaluation
228
 
229
  [See the paper.](http://www.afnlp.org/conferences/ijcnlp2023/proceedings/main-demo/cdrom/pdf/2023.ijcnlp-demo.3.pdf)
230
 
231
+ ### Contributions
232
+
233
+ Contributions are welcome. We acknowledge especially:
234
+
235
+ - [The GitHub contributors](https://github.com/ufal/whisper_streaming/graphs/contributors) for their pull requests with new features and bugfixes.
236
+ - [The translation of this repo into Chinese.](https://github.com/Gloridust/whisper_streaming_CN)
237
+ - [Ondřej Plátek](https://opla.cz/) for the paper pre-review.
238
+ - [Peter Polák](https://ufal.mff.cuni.cz/peter-polak) for the original idea.
239
+ - The UEDIN team of the [ELITR project](https://elitr.eu) for the original line_packet.py.
240
+
241
 
242
  ## Contact
243
 
whisper_online.py CHANGED
@@ -626,7 +626,7 @@ if __name__ == "__main__":
626
  # load the audio into the LRU cache before we start the timer
627
  a = load_audio_chunk(audio_path,0,1)
628
 
629
- # warm up the ASR, because the very first transcribe takes much more time than the other
630
  asr.transcribe(a)
631
 
632
  beg = args.start_at
 
626
  # load the audio into the LRU cache before we start the timer
627
  a = load_audio_chunk(audio_path,0,1)
628
 
629
+ # warm up the ASR because the very first transcribe takes much more time than the other
630
  asr.transcribe(a)
631
 
632
  beg = args.start_at
whisper_online_server.py CHANGED
@@ -10,6 +10,8 @@ parser = argparse.ArgumentParser()
10
  # server options
11
  parser.add_argument("--host", type=str, default='localhost')
12
  parser.add_argument("--port", type=int, default=43007)
 
 
13
 
14
 
15
  # options from whisper_online
@@ -26,18 +28,25 @@ language = args.lan
26
  asr, online = asr_factory(args)
27
  min_chunk = args.min_chunk_size
28
 
29
- demo_audio_path = "cs-maji-2.16k.wav"
30
- if os.path.exists(demo_audio_path):
31
- # load the audio into the LRU cache before we start the timer
32
- a = load_audio_chunk(demo_audio_path,0,1)
33
 
34
- # TODO: it should be tested whether it's meaningful
35
- # warm up the ASR, because the very first transcribe takes much more time than the other
36
- asr.transcribe(a)
37
  else:
38
- print("Whisper is not warmed up",file=sys.stderr)
39
-
40
-
 
 
 
 
 
 
 
 
 
 
 
 
41
 
42
 
43
  ######### Server objects
 
10
  # server options
11
  parser.add_argument("--host", type=str, default='localhost')
12
  parser.add_argument("--port", type=int, default=43007)
13
+ parser.add_argument("--warmup-file", type=str, dest="warmup_file",
14
+ help="The path to a speech audio wav file to warm up Whisper so that the very first chunk processing is fast. It can be e.g. https://github.com/ggerganov/whisper.cpp/raw/master/samples/jfk.wav .")
15
 
16
 
17
  # options from whisper_online
 
28
  asr, online = asr_factory(args)
29
  min_chunk = args.min_chunk_size
30
 
 
 
 
 
31
 
32
+ if args.buffer_trimming == "sentence":
33
+ tokenizer = create_tokenizer(tgt_language)
 
34
  else:
35
+ tokenizer = None
36
+ online = OnlineASRProcessor(asr,tokenizer,buffer_trimming=(args.buffer_trimming, args.buffer_trimming_sec))
37
+
38
+ # warm up the ASR because the very first transcribe takes more time than the others.
39
+ # Test results in https://github.com/ufal/whisper_streaming/pull/81
40
+ msg = "Whisper is not warmed up. The first chunk processing may take longer."
41
+ if args.warmup_file:
42
+ if os.path.isfile(args.warmup_file):
43
+ a = load_audio_chunk(args.warmup_file,0,1)
44
+ asr.transcribe(a)
45
+ print("INFO: Whisper is warmed up.",file=sys.stderr)
46
+ else:
47
+ print("WARNING: The warm up file is not available. "+msg,file=sys.stderr)
48
+ else:
49
+ print("WARNING: " + msg, file=sys.stderr)
50
 
51
 
52
  ######### Server objects