qfuxa commited on
Commit
8c6c010
·
1 Parent(s): 494b6e3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +29 -257
README.md CHANGED
@@ -1,218 +1,27 @@
1
- # whisper_streaming
2
- Whisper realtime streaming for long speech-to-text transcription and translation
3
 
4
- **Turning Whisper into Real-Time Transcription System**
5
 
6
- Demonstration paper, by [Dominik Macháček](https://ufal.mff.cuni.cz/dominik-machacek), [Raj Dabre](https://prajdabre.github.io/), [Ondřej Bojar](https://ufal.mff.cuni.cz/ondrej-bojar), 2023
7
 
8
- Abstract: Whisper is one of the recent state-of-the-art multilingual speech recognition and translation models, however, it is not designed for real-time transcription. In this paper, we build on top of Whisper and create Whisper-Streaming, an implementation of real-time speech transcription and translation of Whisper-like models. Whisper-Streaming uses local agreement policy with self-adaptive latency to enable streaming transcription. We show that Whisper-Streaming achieves high quality and 3.3 seconds latency on unsegmented long-form speech transcription test set, and we demonstrate its robustness and practical usability as a component in live transcription service at a multilingual conference.
9
 
 
10
 
11
- [Paper PDF](https://aclanthology.org/2023.ijcnlp-demo.3.pdf), [Demo video](https://player.vimeo.com/video/840442741)
12
 
13
- [Slides](http://ufallab.ms.mff.cuni.cz/~machacek/pre-prints/AACL23-2.11.2023-Turning-Whisper-oral.pdf) -- 15 minutes oral presentation at IJCNLP-AACL 2023
14
-
15
- Please, cite us. [ACL Anthology](https://aclanthology.org/2023.ijcnlp-demo.3/), [Bibtex citation](https://aclanthology.org/2023.ijcnlp-demo.3.bib):
16
 
17
- ```
18
- @inproceedings{machacek-etal-2023-turning,
19
- title = "Turning Whisper into Real-Time Transcription System",
20
- author = "Mach{\'a}{\v{c}}ek, Dominik and
21
- Dabre, Raj and
22
- Bojar, Ond{\v{r}}ej",
23
- editor = "Saha, Sriparna and
24
- Sujaini, Herry",
25
- booktitle = "Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: System Demonstrations",
26
- month = nov,
27
- year = "2023",
28
- address = "Bali, Indonesia",
29
- publisher = "Association for Computational Linguistics",
30
- url = "https://aclanthology.org/2023.ijcnlp-demo.3",
31
- pages = "17--24",
32
- }
33
- ```
34
 
35
  ## Installation
36
 
37
- 1) ``pip install librosa soundfile`` -- audio processing library
38
-
39
- 2) Whisper backend.
40
-
41
- Several alternative backends are integrated. The most recommended one is [faster-whisper](https://github.com/guillaumekln/faster-whisper) with GPU support. Follow their instructions for NVIDIA libraries -- we succeeded with CUDNN 8.5.0 and CUDA 11.7. Install with `pip install faster-whisper`.
42
-
43
- Alternative, less restrictive, but slower backend is [whisper-timestamped](https://github.com/linto-ai/whisper-timestamped): `pip install git+https://github.com/linto-ai/whisper-timestamped`
44
-
45
- Thirdly, it's also possible to run this software from the [OpenAI Whisper API](https://platform.openai.com/docs/api-reference/audio/createTranscription). This solution is fast and requires no GPU, just a small VM will suffice, but you will need to pay OpenAI for api access. Also note that, since each audio fragment is processed multiple times, the [price](https://openai.com/pricing) will be higher than obvious from the pricing page, so keep an eye on costs while using. Setting a higher chunk-size will reduce costs significantly.
46
- Install with: `pip install openai` , [requires Python >=3.8](https://pypi.org/project/openai/).
47
-
48
- For running with the openai-api backend, make sure that your [OpenAI api key](https://platform.openai.com/api-keys) is set in the `OPENAI_API_KEY` environment variable. For example, before running, do: `export OPENAI_API_KEY=sk-xxx` with *sk-xxx* replaced with your api key.
49
-
50
- The backend is loaded only when chosen. The unused one does not have to be installed.
51
-
52
- 3) For voice activity controller: `pip install torch torchaudio`. Optional, but very recommended.
53
-
54
- <details>
55
- <summary>4) Optional, not recommended: sentence segmenter (aka sentence tokenizer)</summary>
56
-
57
- Two buffer trimming options are integrated and evaluated. They have impact on
58
- the quality and latency. The default "segment" option performs better according
59
- to our tests and does not require any sentence segmentation installed.
60
-
61
- The other option, "sentence" -- trimming at the end of confirmed sentences,
62
- requires sentence segmenter installed. It splits punctuated text to sentences by full
63
- stops, avoiding the dots that are not full stops. The segmenters are language
64
- specific. The unused one does not have to be installed. We integrate the
65
- following segmenters, but suggestions for better alternatives are welcome.
66
-
67
- - `pip install opus-fast-mosestokenizer` for the languages with codes `as bn ca cs de el en es et fi fr ga gu hi hu is it kn lt lv ml mni mr nl or pa pl pt ro ru sk sl sv ta te yue zh`
68
-
69
- - `pip install tokenize_uk` for Ukrainian -- `uk`
70
-
71
- - for other languages, we integrate a good performing multi-lingual model of `wtpslit`. It requires `pip install torch wtpsplit`, and its neural model `wtp-canine-s-12l-no-adapters`. It is downloaded to the default huggingface cache during the first use.
72
-
73
- - we did not find a segmenter for languages `as ba bo br bs fo haw hr ht jw lb ln lo mi nn oc sa sd sn so su sw tk tl tt` that are supported by Whisper and not by wtpsplit. The default fallback option for them is wtpsplit with unspecified language. Alternative suggestions welcome.
74
-
75
- In case of installation issues of opus-fast-mosestokenizer, especially on Windows and Mac, we recommend using only the "segment" option that does not require it.
76
- </details>
77
-
78
- ## Usage
79
-
80
- ### Real-time simulation from audio file
81
-
82
- ```
83
- whisper_online.py -h
84
- usage: whisper_online.py [-h] [--min-chunk-size MIN_CHUNK_SIZE] [--model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large}] [--model_cache_dir MODEL_CACHE_DIR]
85
- [--model_dir MODEL_DIR] [--lan LAN] [--task {transcribe,translate}] [--backend {faster-whisper,whisper_timestamped,openai-api}] [--vac] [--vac-chunk-size VAC_CHUNK_SIZE] [--vad]
86
- [--buffer_trimming {sentence,segment}] [--buffer_trimming_sec BUFFER_TRIMMING_SEC] [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}] [--start_at START_AT] [--offline] [--comp_unaware]
87
- audio_path
88
-
89
- positional arguments:
90
- audio_path Filename of 16kHz mono channel wav, on which live streaming is simulated.
91
-
92
- options:
93
- -h, --help show this help message and exit
94
- --min-chunk-size MIN_CHUNK_SIZE
95
- Minimum audio chunk size in seconds. It waits up to this time to do processing. If the processing takes shorter time, it waits, otherwise it processes the whole segment that was
96
- received by this time.
97
- --model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large,large-v3-turbo}
98
- Name size of the Whisper model to use (default: large-v2). The model is automatically downloaded from the model hub if not present in model cache dir.
99
- --model_cache_dir MODEL_CACHE_DIR
100
- Overriding the default model cache dir where models downloaded from the hub are saved
101
- --model_dir MODEL_DIR
102
- Dir where Whisper model.bin and other files are saved. This option overrides --model and --model_cache_dir parameter.
103
- --lan LAN, --language LAN
104
- Source language code, e.g. en,de,cs, or 'auto' for language detection.
105
- --task {transcribe,translate}
106
- Transcribe or translate.
107
- --backend {faster-whisper,whisper_timestamped,openai-api}
108
- Load only this backend for Whisper processing.
109
- --vac Use VAC = voice activity controller. Recommended. Requires torch.
110
- --vac-chunk-size VAC_CHUNK_SIZE
111
- VAC sample size in seconds.
112
- --vad Use VAD = voice activity detection, with the default parameters.
113
- --buffer_trimming {sentence,segment}
114
- Buffer trimming strategy -- trim completed sentences marked with punctuation mark and detected by sentence segmenter, or the completed segments returned by Whisper. Sentence segmenter
115
- must be installed for "sentence" option.
116
- --buffer_trimming_sec BUFFER_TRIMMING_SEC
117
- Buffer trimming length threshold in seconds. If buffer length is longer, trimming sentence/segment is triggered.
118
- -l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
119
- Set the log level
120
- --start_at START_AT Start processing audio at this time.
121
- --offline Offline mode.
122
- --comp_unaware Computationally unaware simulation.
123
- ```
124
-
125
- Example:
126
-
127
- It simulates realtime processing from a pre-recorded mono 16k wav file.
128
-
129
- ```
130
- python3 whisper_online.py en-demo16.wav --language en --min-chunk-size 1 > out.txt
131
- ```
132
-
133
- Simulation modes:
134
-
135
- - default mode, no special option: real-time simulation from file, computationally aware. The chunk size is `MIN_CHUNK_SIZE` or larger, if more audio arrived during last update computation.
136
-
137
- - `--comp_unaware` option: computationally unaware simulation. It means that the timer that counts the emission times "stops" when the model is computing. The chunk size is always `MIN_CHUNK_SIZE`. The latency is caused only by the model being unable to confirm the output, e.g. because of language ambiguity etc., and not because of slow hardware or suboptimal implementation. We implement this feature for finding the lower bound for latency.
138
-
139
- - `--start_at START_AT`: Start processing audio at this time. The first update receives the whole audio by `START_AT`. It is useful for debugging, e.g. when we observe a bug in a specific time in audio file, and want to reproduce it quickly, without long waiting.
140
-
141
- - `--offline` option: It processes the whole audio file at once, in offline mode. We implement it to find out the lowest possible WER on given audio file.
142
-
143
-
144
-
145
- ### Output format
146
-
147
- ```
148
- 2691.4399 300 1380 Chairman, thank you.
149
- 6914.5501 1940 4940 If the debate today had a
150
- 9019.0277 5160 7160 the subject the situation in
151
- 10065.1274 7180 7480 Gaza
152
- 11058.3558 7480 9460 Strip, I might
153
- 12224.3731 9460 9760 have
154
- 13555.1929 9760 11060 joined Mrs.
155
- 14928.5479 11140 12240 De Kaiser and all the
156
- 16588.0787 12240 12560 other
157
- 18324.9285 12560 14420 colleagues across the
158
- ```
159
 
160
- [See description here](https://github.com/ufal/whisper_streaming/blob/d915d790a62d7be4e7392dde1480e7981eb142ae/whisper_online.py#L361)
 
 
 
161
 
162
- ### As a module
163
-
164
- TL;DR: use OnlineASRProcessor object and its methods insert_audio_chunk and process_iter.
165
-
166
- The code whisper_online.py is nicely commented, read it as the full documentation.
167
-
168
-
169
- This pseudocode describes the interface that we suggest for your implementation. You can implement any features that you need for your application.
170
-
171
- ```python
172
- from whisper_online import *
173
-
174
- src_lan = "en" # source language
175
- tgt_lan = "en" # target language -- same as source for ASR, "en" if translate task is used
176
-
177
- asr = FasterWhisperASR(lan, "large-v2") # loads and wraps Whisper model
178
- # set options:
179
- # asr.set_translate_task() # it will translate from lan into English
180
- # asr.use_vad() # set using VAD
181
-
182
- online = OnlineASRProcessor(asr) # create processing object with default buffer trimming option
183
-
184
- while audio_has_not_ended: # processing loop:
185
- a = # receive new audio chunk (and e.g. wait for min_chunk_size seconds first, ...)
186
- online.insert_audio_chunk(a)
187
- o = online.process_iter()
188
- print(o) # do something with current partial output
189
- # at the end of this audio processing
190
- o = online.finish()
191
- print(o) # do something with the last output
192
-
193
-
194
- online.init() # refresh if you're going to re-use the object for the next audio
195
- ```
196
-
197
- ### Server -- real-time from mic
198
-
199
- `whisper_online_server.py` has the same model options as `whisper_online.py`, plus `--host` and `--port` of the TCP connection and the `--warmup-file`. See the help message (`-h` option).
200
-
201
- Client example:
202
-
203
- ```
204
- arecord -f S16_LE -c1 -r 16000 -t raw -D default | nc localhost 43001
205
- ```
206
-
207
- - arecord sends realtime audio from a sound device (e.g. mic), in raw audio format -- 16000 sampling rate, mono channel, S16\_LE -- signed 16-bit integer low endian. (use the alternative to arecord that works for you)
208
-
209
- - nc is netcat with server's host and port
210
-
211
- ## Live Transcription Web Interface
212
-
213
- This repository also includes a **FastAPI server** and an **HTML/JavaScript client** for quick testing of live speech transcription in the browser. The client uses native WebSockets and the `MediaRecorder` API to capture microphone audio in **WebM** format and send it to the server—**no additional front-end framework** is required.
214
-
215
- ![Demo Screenshot](src/demo.png)
216
 
217
  ### How to Launch the Server
218
 
@@ -221,8 +30,19 @@ This repository also includes a **FastAPI server** and an **HTML/JavaScript clie
221
  ```bash
222
  pip install -r requirements.txt
223
  ```
 
 
 
 
 
 
 
 
 
 
 
224
 
225
- 2. **Run the FastAPI Server**:
226
 
227
  ```bash
228
  python whisper_fastapi_online_server.py --host 0.0.0.0 --port 8000
@@ -230,7 +50,7 @@ This repository also includes a **FastAPI server** and an **HTML/JavaScript clie
230
 
231
  - `--host` and `--port` let you specify the server’s IP/port.
232
 
233
- 3. **Open the Provided HTML**:
234
 
235
  - By default, the server root endpoint `/` serves a simple `live_transcription.html` page.
236
  - Open your browser at `http://localhost:8000` (or replace `localhost` and `8000` with whatever you specified).
@@ -240,7 +60,7 @@ This repository also includes a **FastAPI server** and an **HTML/JavaScript clie
240
 
241
  - Once you **allow microphone access**, the page records small chunks of audio using the **MediaRecorder** API in **webm/opus** format.
242
  - These chunks are sent over a **WebSocket** to the FastAPI endpoint at `/ws`.
243
- - The Python server decodes `.webm` chunks on the fly using **FFmpeg** and streams them into **Whisper** for transcription.
244
  - **Partial transcription** appears as soon as enough audio is processed. The “unvalidated” text is shown in **lighter or grey color** (i.e., an ‘aperçu’) to indicate it’s still buffered partial output. Once Whisper finalizes that segment, it’s displayed in normal text.
245
  - You can watch the transcription update in near real time, ideal for demos, prototyping, or quick debugging.
246
 
@@ -248,61 +68,13 @@ This repository also includes a **FastAPI server** and an **HTML/JavaScript clie
248
 
249
  If you want to **deploy** this setup:
250
 
251
- 1. **Host the FastAPI app** behind a production-grade HTTP server (like **Uvicorn + Nginx** or Docker).
252
  2. The **HTML/JS page** can be served by the same FastAPI app or a separate static host.
253
  3. Users open the page in **Chrome/Firefox** (any modern browser that supports MediaRecorder + WebSocket).
254
 
255
  No additional front-end libraries or frameworks are required. The WebSocket logic in `live_transcription.html` is minimal enough to adapt for your own custom UI or embed in other pages.
256
 
257
- ## Background
258
-
259
- Default Whisper is intended for audio chunks of at most 30 seconds that contain
260
- one full sentence. Longer audio files must be split to shorter chunks and
261
- merged with "init prompt". In low latency simultaneous streaming mode, the
262
- simple and naive chunking fixed-sized windows does not work well, it can split
263
- a word in the middle. It is also necessary to know when the transcribt is
264
- stable, should be confirmed ("commited") and followed up, and when the future
265
- content makes the transcript clearer.
266
-
267
- For that, there is LocalAgreement-n policy: if n consecutive updates, each with
268
- a newly available audio stream chunk, agree on a prefix transcript, it is
269
- confirmed. (Reference: CUNI-KIT at IWSLT 2022 etc.)
270
-
271
- In this project, we re-use the idea of Peter Polák from this demo:
272
- https://github.com/pe-trik/transformers/blob/online_decode/examples/pytorch/online-decoding/whisper-online-demo.py
273
- However, it doesn't do any sentence segmentation, but Whisper produces
274
- punctuation and the libraries `faster-whisper` and `whisper_transcribed` make
275
- word-level timestamps. In short: we
276
- consecutively process new audio chunks, emit the transcripts that are confirmed
277
- by 2 iterations, and scroll the audio processing buffer on a timestamp of a
278
- confirmed complete sentence. The processing audio buffer is not too long and
279
- the processing is fast.
280
-
281
- In more detail: we use the init prompt, we handle the inaccurate timestamps, we
282
- re-process confirmed sentence prefixes and skip them, making sure they don't
283
- overlap, and we limit the processing buffer window.
284
-
285
- ### Performance evaluation
286
-
287
- [See the paper.](http://www.afnlp.org/conferences/ijcnlp2023/proceedings/main-demo/cdrom/pdf/2023.ijcnlp-demo.3.pdf)
288
-
289
- ### Contributions
290
-
291
- Contributions are welcome. We acknowledge especially:
292
-
293
- - [The GitHub contributors](https://github.com/ufal/whisper_streaming/graphs/contributors) for their pull requests with new features and bugfixes.
294
- - [Nice explanation video](https://www.youtube.com/watch?v=_spinzpEeFM) -- published on 31st March 2024, note that newer updates are not included.
295
- - [The translation of this repo into Chinese.](https://github.com/Gloridust/whisper_streaming_CN)
296
- - [Ondřej Plátek](https://opla.cz/) for the paper pre-review.
297
-
298
- Credits:
299
-
300
- - [Peter Polák](https://ufal.mff.cuni.cz/peter-polak) for the original idea.
301
- - The UEDIN team of the [ELITR project](https://elitr.eu) for the original line_packet.py.
302
- - Silero Team for their VAD [model](https://github.com/snakers4/silero-vad) and [VADIterator](https://github.com/ufal/whisper_streaming/blob/47caa80588ee9c0fa8945a5d05f0aea6315eb837/silero_vad.py#L8).
303
-
304
-
305
- ## Contact
306
 
307
- Dominik Macháček, machacek@ufal.mff.cuni.cz
308
 
 
1
+ # Whisper Streaming with FastAPI and WebSocket Integration
 
2
 
3
+ This project extends the [Whisper Streaming](https://github.com/ufal/whisper_streaming) implementation by incorporating few extras. The enhancements include:
4
 
5
+ 1. **FastAPI Server with WebSocket Endpoint**: Enables real-time speech-to-text transcription directly from the browser.
6
 
7
+ 2. **Buffering Indication**: Improves streaming display by showing the current processing status, providing users with immediate feedback.
8
 
9
+ 3. **Javascript Client implementation**: Functionnal and minimalist MediaRecorder implementation that can be copied on your client side
10
 
11
+ 4. **MLX Whisper backend**: Integrates the alternative backend option MLX Whisper, optimized for efficient speech recognition on Apple silicon.
12
 
13
+ ![Demo Screenshot](src/demo.png)
 
 
14
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
  ## Installation
17
 
18
+ 1. **Clone the Repository**:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
+ ```bash
21
+ git clone https://github.com/QuentinFuxa/whisper_streaming_web
22
+ cd whisper_streaming_web
23
+ ```
24
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
  ### How to Launch the Server
27
 
 
30
  ```bash
31
  pip install -r requirements.txt
32
  ```
33
+ 2. Install a whisper backend among:
34
+
35
+ ```
36
+ whisper
37
+ whisper-timestamped
38
+ faster-whisper (faster backend on NVIDIA GPU)
39
+ mlx-whisper (faster backend on Apple Silicon)
40
+
41
+ and torch if you want to use VAC (Voice Activity Controller)
42
+ ```
43
+
44
 
45
+ 3. **Run the FastAPI Server**:
46
 
47
  ```bash
48
  python whisper_fastapi_online_server.py --host 0.0.0.0 --port 8000
 
50
 
51
  - `--host` and `--port` let you specify the server’s IP/port.
52
 
53
+ 4. **Open the Provided HTML**:
54
 
55
  - By default, the server root endpoint `/` serves a simple `live_transcription.html` page.
56
  - Open your browser at `http://localhost:8000` (or replace `localhost` and `8000` with whatever you specified).
 
60
 
61
  - Once you **allow microphone access**, the page records small chunks of audio using the **MediaRecorder** API in **webm/opus** format.
62
  - These chunks are sent over a **WebSocket** to the FastAPI endpoint at `/ws`.
63
+ - The Python server decodes `.webm` chunks on the fly using **FFmpeg** and streams them into the **whisper streaming** implementation for transcription.
64
  - **Partial transcription** appears as soon as enough audio is processed. The “unvalidated” text is shown in **lighter or grey color** (i.e., an ‘aperçu’) to indicate it’s still buffered partial output. Once Whisper finalizes that segment, it’s displayed in normal text.
65
  - You can watch the transcription update in near real time, ideal for demos, prototyping, or quick debugging.
66
 
 
68
 
69
  If you want to **deploy** this setup:
70
 
71
+ 1. **Host the FastAPI app** behind a production-grade HTTP(S) server (like **Uvicorn + Nginx** or Docker).
72
  2. The **HTML/JS page** can be served by the same FastAPI app or a separate static host.
73
  3. Users open the page in **Chrome/Firefox** (any modern browser that supports MediaRecorder + WebSocket).
74
 
75
  No additional front-end libraries or frameworks are required. The WebSocket logic in `live_transcription.html` is minimal enough to adapt for your own custom UI or embed in other pages.
76
 
77
+ ## Acknowledgments
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
 
79
+ This project builds upon the foundational work of the Whisper Streaming project. We extend our gratitude to the original authors for their contributions.
80