regularfry commited on
Commit
8883397
·
2 Parent(s): fc4b3cd d497503

Merge branch 'main' into ayo-warmup-file

Browse files
Files changed (3) hide show
  1. README.md +39 -23
  2. whisper_online.py +132 -31
  3. whisper_online_server.py +3 -23
README.md CHANGED
@@ -3,42 +3,50 @@ Whisper realtime streaming for long speech-to-text transcription and translation
3
 
4
  **Turning Whisper into Real-Time Transcription System**
5
 
6
- Demonstration paper, by Dominik Macháček, Raj Dabre, Ondřej Bojar, 2023
7
 
8
- Abstract: Whisper is one of the recent state-of-the-art multilingual speech recognition and translation models, however, it is not designed for real time transcription. In this paper, we build on top of Whisper and create Whisper-Streaming, an implementation of real-time speech transcription and translation of Whisper-like models. Whisper-Streaming uses local agreement policy with self-adaptive latency to enable streaming transcription. We show that Whisper-Streaming achieves high quality and 3.3 seconds latency on unsegmented long-form speech transcription test set, and we demonstrate its robustness and practical usability as a component in live transcription service at a multilingual conference.
9
 
10
 
11
- Paper in proceedings: http://www.afnlp.org/conferences/ijcnlp2023/proceedings/main-demo/cdrom/pdf/2023.ijcnlp-demo.3.pdf
12
-
13
- Demo video: https://player.vimeo.com/video/840442741
14
 
15
  [Slides](http://ufallab.ms.mff.cuni.cz/~machacek/pre-prints/AACL23-2.11.2023-Turning-Whisper-oral.pdf) -- 15 minutes oral presentation at IJCNLP-AACL 2023
16
 
17
- Please, cite us. [Bibtex citation](http://www.afnlp.org/conferences/ijcnlp2023/proceedings/main-demo/cdrom/bib/2023.ijcnlp-demo.3.bib):
18
 
19
  ```
20
- @InProceedings{machacek-dabre-bojar:2023:ijcnlp,
21
- author = {Macháček, Dominik and Dabre, Raj and Bojar, Ondřej},
22
- title = {Turning Whisper into Real-Time Transcription System},
23
- booktitle = {System Demonstrations},
24
- month = {November},
25
- year = {2023},
26
- address = {Bali, Indonesia},
27
- publisher = {Asian Federation of Natural Language Processing},
28
- pages = {17--24},
 
 
 
 
 
29
  }
30
  ```
31
 
32
  ## Installation
33
 
34
- 1) ``pip install librosa`` -- audio processing library
35
 
36
  2) Whisper backend.
37
 
38
- Two alternative backends are integrated. The most recommended one is [faster-whisper](https://github.com/guillaumekln/faster-whisper) with GPU support. Follow their instructions for NVIDIA libraries -- we succeeded with CUDNN 8.5.0 and CUDA 11.7. Install with `pip install faster-whisper`.
39
 
40
  Alternative, less restrictive, but slower backend is [whisper-timestamped](https://github.com/linto-ai/whisper-timestamped): `pip install git+https://github.com/linto-ai/whisper-timestamped`
41
 
 
 
 
 
 
42
  The backend is loaded only when chosen. The unused one does not have to be installed.
43
 
44
  3) Optional, not recommended: sentence segmenter (aka sentence tokenizer)
@@ -69,7 +77,7 @@ In case of installation issues of opus-fast-mosestokenizer, especially on Window
69
 
70
  ```
71
  usage: whisper_online.py [-h] [--min-chunk-size MIN_CHUNK_SIZE] [--model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large}] [--model_cache_dir MODEL_CACHE_DIR] [--model_dir MODEL_DIR] [--lan LAN] [--task {transcribe,translate}]
72
- [--backend {faster-whisper,whisper_timestamped}] [--vad] [--buffer_trimming {sentence,segment}] [--buffer_trimming_sec BUFFER_TRIMMING_SEC] [--start_at START_AT] [--offline] [--comp_unaware]
73
  audio_path
74
 
75
  positional arguments:
@@ -86,10 +94,10 @@ options:
86
  --model_dir MODEL_DIR
87
  Dir where Whisper model.bin and other files are saved. This option overrides --model and --model_cache_dir parameter.
88
  --lan LAN, --language LAN
89
- Language code for transcription, e.g. en,de,cs.
90
  --task {transcribe,translate}
91
  Transcribe or translate.
92
- --backend {faster-whisper,whisper_timestamped}
93
  Load only this backend for Whisper processing.
94
  --vad Use VAD = voice activity detection, with the default parameters.
95
  --buffer_trimming {sentence,segment}
@@ -147,7 +155,7 @@ The code whisper_online.py is nicely commented, read it as the full documentatio
147
 
148
  This pseudocode describes the interface that we suggest for your implementation. You can implement any features that you need for your application.
149
 
150
- ```
151
  from whisper_online import *
152
 
153
  src_lan = "en" # source language
@@ -216,12 +224,20 @@ In more detail: we use the init prompt, we handle the inaccurate timestamps, we
216
  re-process confirmed sentence prefixes and skip them, making sure they don't
217
  overlap, and we limit the processing buffer window.
218
 
219
- Contributions are welcome.
220
-
221
  ### Performance evaluation
222
 
223
  [See the paper.](http://www.afnlp.org/conferences/ijcnlp2023/proceedings/main-demo/cdrom/pdf/2023.ijcnlp-demo.3.pdf)
224
 
 
 
 
 
 
 
 
 
 
 
225
 
226
  ## Contact
227
 
 
3
 
4
  **Turning Whisper into Real-Time Transcription System**
5
 
6
+ Demonstration paper, by [Dominik Macháček](https://ufal.mff.cuni.cz/dominik-machacek), [Raj Dabre](https://prajdabre.github.io/), [Ondřej Bojar](https://ufal.mff.cuni.cz/ondrej-bojar), 2023
7
 
8
+ Abstract: Whisper is one of the recent state-of-the-art multilingual speech recognition and translation models, however, it is not designed for real-time transcription. In this paper, we build on top of Whisper and create Whisper-Streaming, an implementation of real-time speech transcription and translation of Whisper-like models. Whisper-Streaming uses local agreement policy with self-adaptive latency to enable streaming transcription. We show that Whisper-Streaming achieves high quality and 3.3 seconds latency on unsegmented long-form speech transcription test set, and we demonstrate its robustness and practical usability as a component in live transcription service at a multilingual conference.
9
 
10
 
11
+ [Paper PDF](https://aclanthology.org/2023.ijcnlp-demo.3.pdf), [Demo video](https://player.vimeo.com/video/840442741)
 
 
12
 
13
  [Slides](http://ufallab.ms.mff.cuni.cz/~machacek/pre-prints/AACL23-2.11.2023-Turning-Whisper-oral.pdf) -- 15 minutes oral presentation at IJCNLP-AACL 2023
14
 
15
+ Please, cite us. [ACL Anthology](https://aclanthology.org/2023.ijcnlp-demo.3/), [Bibtex citation](https://aclanthology.org/2023.ijcnlp-demo.3.bib):
16
 
17
  ```
18
+ @inproceedings{machacek-etal-2023-turning,
19
+ title = "Turning Whisper into Real-Time Transcription System",
20
+ author = "Mach{\'a}{\v{c}}ek, Dominik and
21
+ Dabre, Raj and
22
+ Bojar, Ond{\v{r}}ej",
23
+ editor = "Saha, Sriparna and
24
+ Sujaini, Herry",
25
+ booktitle = "Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: System Demonstrations",
26
+ month = nov,
27
+ year = "2023",
28
+ address = "Bali, Indonesia",
29
+ publisher = "Association for Computational Linguistics",
30
+ url = "https://aclanthology.org/2023.ijcnlp-demo.3",
31
+ pages = "17--24",
32
  }
33
  ```
34
 
35
  ## Installation
36
 
37
+ 1) ``pip install librosa soundfile`` -- audio processing library
38
 
39
  2) Whisper backend.
40
 
41
+ Several alternative backends are integrated. The most recommended one is [faster-whisper](https://github.com/guillaumekln/faster-whisper) with GPU support. Follow their instructions for NVIDIA libraries -- we succeeded with CUDNN 8.5.0 and CUDA 11.7. Install with `pip install faster-whisper`.
42
 
43
  Alternative, less restrictive, but slower backend is [whisper-timestamped](https://github.com/linto-ai/whisper-timestamped): `pip install git+https://github.com/linto-ai/whisper-timestamped`
44
 
45
+ Thirdly, it's also possible to run this software from the [OpenAI Whisper API](https://platform.openai.com/docs/api-reference/audio/createTranscription). This solution is fast and requires no GPU, just a small VM will suffice, but you will need to pay OpenAI for api access. Also note that, since each audio fragment is processed multiple times, the [price](https://openai.com/pricing) will be higher than obvious from the pricing page, so keep an eye on costs while using. Setting a higher chunk-size will reduce costs significantly.
46
+ Install with: `pip install openai`
47
+
48
+ For running with the openai-api backend, make sure that your [OpenAI api key](https://platform.openai.com/api-keys) is set in the `OPENAI_API_KEY` environment variable. For example, before running, do: `export OPENAI_API_KEY=sk-xxx` with *sk-xxx* replaced with your api key.
49
+
50
  The backend is loaded only when chosen. The unused one does not have to be installed.
51
 
52
  3) Optional, not recommended: sentence segmenter (aka sentence tokenizer)
 
77
 
78
  ```
79
  usage: whisper_online.py [-h] [--min-chunk-size MIN_CHUNK_SIZE] [--model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large}] [--model_cache_dir MODEL_CACHE_DIR] [--model_dir MODEL_DIR] [--lan LAN] [--task {transcribe,translate}]
80
+ [--backend {faster-whisper,whisper_timestamped,openai-api}] [--vad] [--buffer_trimming {sentence,segment}] [--buffer_trimming_sec BUFFER_TRIMMING_SEC] [--start_at START_AT] [--offline] [--comp_unaware]
81
  audio_path
82
 
83
  positional arguments:
 
94
  --model_dir MODEL_DIR
95
  Dir where Whisper model.bin and other files are saved. This option overrides --model and --model_cache_dir parameter.
96
  --lan LAN, --language LAN
97
+ Source language code, e.g. en,de,cs, or 'auto' for language detection.
98
  --task {transcribe,translate}
99
  Transcribe or translate.
100
+ --backend {faster-whisper,whisper_timestamped,openai-api}
101
  Load only this backend for Whisper processing.
102
  --vad Use VAD = voice activity detection, with the default parameters.
103
  --buffer_trimming {sentence,segment}
 
155
 
156
  This pseudocode describes the interface that we suggest for your implementation. You can implement any features that you need for your application.
157
 
158
+ ```python
159
  from whisper_online import *
160
 
161
  src_lan = "en" # source language
 
224
  re-process confirmed sentence prefixes and skip them, making sure they don't
225
  overlap, and we limit the processing buffer window.
226
 
 
 
227
  ### Performance evaluation
228
 
229
  [See the paper.](http://www.afnlp.org/conferences/ijcnlp2023/proceedings/main-demo/cdrom/pdf/2023.ijcnlp-demo.3.pdf)
230
 
231
+ ### Contributions
232
+
233
+ Contributions are welcome. We acknowledge especially:
234
+
235
+ - [The GitHub contributors](https://github.com/ufal/whisper_streaming/graphs/contributors) for their pull requests with new features and bugfixes.
236
+ - [The translation of this repo into Chinese.](https://github.com/Gloridust/whisper_streaming_CN)
237
+ - [Ondřej Plátek](https://opla.cz/) for the paper pre-review.
238
+ - [Peter Polák](https://ufal.mff.cuni.cz/peter-polak) for the original idea.
239
+ - The UEDIN team of the [ELITR project](https://elitr.eu) for the original line_packet.py.
240
+
241
 
242
  ## Contact
243
 
whisper_online.py CHANGED
@@ -4,12 +4,13 @@ import numpy as np
4
  import librosa
5
  from functools import lru_cache
6
  import time
7
-
8
-
 
9
 
10
  @lru_cache
11
  def load_audio(fname):
12
- a, _ = librosa.load(fname, sr=16000)
13
  return a
14
 
15
  def load_audio_chunk(fname, beg, end):
@@ -30,7 +31,10 @@ class ASRBase:
30
  self.logfile = logfile
31
 
32
  self.transcribe_kargs = {}
33
- self.original_language = lan
 
 
 
34
 
35
  self.model = self.load_model(modelsize, cache_dir, model_dir)
36
 
@@ -54,6 +58,7 @@ class WhisperTimestampedASR(ASRBase):
54
 
55
  def load_model(self, modelsize=None, cache_dir=None, model_dir=None):
56
  import whisper
 
57
  from whisper_timestamped import transcribe_timestamped
58
  self.transcribe_timestamped = transcribe_timestamped
59
  if model_dir is not None:
@@ -118,8 +123,11 @@ class FasterWhisperASR(ASRBase):
118
  return model
119
 
120
  def transcribe(self, audio, init_prompt=""):
 
121
  # tested: beam_size=5 is faster and better than 1 (on one 200 second document from En ESIC, min chunk 0.01)
122
  segments, info = self.model.transcribe(audio, language=self.original_language, initial_prompt=init_prompt, beam_size=5, word_timestamps=True, condition_on_previous_text=True, **self.transcribe_kargs)
 
 
123
  return list(segments)
124
 
125
  def ts_words(self, segments):
@@ -142,6 +150,93 @@ class FasterWhisperASR(ASRBase):
142
  self.transcribe_kargs["task"] = "translate"
143
 
144
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
145
 
146
  class HypothesisBuffer:
147
 
@@ -234,9 +329,6 @@ class OnlineASRProcessor:
234
 
235
  self.transcript_buffer = HypothesisBuffer(logfile=self.logfile)
236
  self.commited = []
237
- self.last_chunked_at = 0
238
-
239
- self.silence_iters = 0
240
 
241
  def insert_audio_chunk(self, audio):
242
  self.audio_buffer = np.append(self.audio_buffer, audio)
@@ -246,7 +338,7 @@ class OnlineASRProcessor:
246
  "context" is the commited text that is inside the audio buffer. It is transcribed again and skipped. It is returned only for debugging and logging reasons.
247
  """
248
  k = max(0,len(self.commited)-1)
249
- while k > 0 and self.commited[k-1][1] > self.last_chunked_at:
250
  k -= 1
251
 
252
  p = self.commited[:k]
@@ -357,7 +449,6 @@ class OnlineASRProcessor:
357
  cut_seconds = time - self.buffer_time_offset
358
  self.audio_buffer = self.audio_buffer[int(cut_seconds*self.SAMPLING_RATE):]
359
  self.buffer_time_offset = time
360
- self.last_chunked_at = time
361
 
362
  def words_to_sentences(self, words):
363
  """Uses self.tokenizer for sentence segmentation of words.
@@ -451,13 +542,42 @@ def add_shared_args(parser):
451
  parser.add_argument('--model', type=str, default='large-v2', choices="tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large".split(","),help="Name size of the Whisper model to use (default: large-v2). The model is automatically downloaded from the model hub if not present in model cache dir.")
452
  parser.add_argument('--model_cache_dir', type=str, default=None, help="Overriding the default model cache dir where models downloaded from the hub are saved")
453
  parser.add_argument('--model_dir', type=str, default=None, help="Dir where Whisper model.bin and other files are saved. This option overrides --model and --model_cache_dir parameter.")
454
- parser.add_argument('--lan', '--language', type=str, default='en', help="Language code for transcription, e.g. en,de,cs.")
455
  parser.add_argument('--task', type=str, default='transcribe', choices=["transcribe","translate"],help="Transcribe or translate.")
456
- parser.add_argument('--backend', type=str, default="faster-whisper", choices=["faster-whisper", "whisper_timestamped"],help='Load only this backend for Whisper processing.')
457
  parser.add_argument('--vad', action="store_true", default=False, help='Use VAD = voice activity detection, with the default parameters.')
458
  parser.add_argument('--buffer_trimming', type=str, default="segment", choices=["sentence", "segment"],help='Buffer trimming strategy -- trim completed sentences marked with punctuation mark and detected by sentence segmenter, or the completed segments returned by Whisper. Sentence segmenter must be installed for "sentence" option.')
459
  parser.add_argument('--buffer_trimming_sec', type=float, default=15, help='Buffer trimming length threshold in seconds. If buffer length is longer, trimming sentence/segment is triggered.')
460
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
461
  ## main:
462
 
463
  if __name__ == "__main__":
@@ -485,33 +605,14 @@ if __name__ == "__main__":
485
  duration = len(load_audio(audio_path))/SAMPLING_RATE
486
  print("Audio duration is: %2.2f seconds" % duration, file=logfile)
487
 
488
- size = args.model
489
  language = args.lan
490
-
491
- t = time.time()
492
- print(f"Loading Whisper {size} model for {language}...",file=logfile,end=" ",flush=True)
493
-
494
- if args.backend == "faster-whisper":
495
- asr_cls = FasterWhisperASR
496
- else:
497
- asr_cls = WhisperTimestampedASR
498
-
499
- asr = asr_cls(modelsize=size, lan=language, cache_dir=args.model_cache_dir, model_dir=args.model_dir)
500
-
501
  if args.task == "translate":
502
  asr.set_translate_task()
503
  tgt_language = "en" # Whisper translates into English
504
  else:
505
  tgt_language = language # Whisper transcribes in this language
506
 
507
-
508
- e = time.time()
509
- print(f"done. It took {round(e-t,2)} seconds.",file=logfile)
510
-
511
- if args.vad:
512
- print("setting VAD filter",file=logfile)
513
- asr.use_vad()
514
-
515
 
516
  min_chunk = args.min_chunk_size
517
  if args.buffer_trimming == "sentence":
 
4
  import librosa
5
  from functools import lru_cache
6
  import time
7
+ import io
8
+ import soundfile as sf
9
+ import math
10
 
11
  @lru_cache
12
  def load_audio(fname):
13
+ a, _ = librosa.load(fname, sr=16000, dtype=np.float32)
14
  return a
15
 
16
  def load_audio_chunk(fname, beg, end):
 
31
  self.logfile = logfile
32
 
33
  self.transcribe_kargs = {}
34
+ if lan == "auto":
35
+ self.original_language = None
36
+ else:
37
+ self.original_language = lan
38
 
39
  self.model = self.load_model(modelsize, cache_dir, model_dir)
40
 
 
58
 
59
  def load_model(self, modelsize=None, cache_dir=None, model_dir=None):
60
  import whisper
61
+ import whisper_timestamped
62
  from whisper_timestamped import transcribe_timestamped
63
  self.transcribe_timestamped = transcribe_timestamped
64
  if model_dir is not None:
 
123
  return model
124
 
125
  def transcribe(self, audio, init_prompt=""):
126
+
127
  # tested: beam_size=5 is faster and better than 1 (on one 200 second document from En ESIC, min chunk 0.01)
128
  segments, info = self.model.transcribe(audio, language=self.original_language, initial_prompt=init_prompt, beam_size=5, word_timestamps=True, condition_on_previous_text=True, **self.transcribe_kargs)
129
+ #print(info) # info contains language detection result
130
+
131
  return list(segments)
132
 
133
  def ts_words(self, segments):
 
150
  self.transcribe_kargs["task"] = "translate"
151
 
152
 
153
+ class OpenaiApiASR(ASRBase):
154
+ """Uses OpenAI's Whisper API for audio transcription."""
155
+
156
+ def __init__(self, lan=None, temperature=0, logfile=sys.stderr):
157
+ self.logfile = logfile
158
+
159
+ self.modelname = "whisper-1"
160
+ self.original_language = None if lan == "auto" else lan # ISO-639-1 language code
161
+ self.response_format = "verbose_json"
162
+ self.temperature = temperature
163
+
164
+ self.load_model()
165
+
166
+ self.use_vad_opt = False
167
+
168
+ # reset the task in set_translate_task
169
+ self.task = "transcribe"
170
+
171
+ def load_model(self, *args, **kwargs):
172
+ from openai import OpenAI
173
+ self.client = OpenAI()
174
+
175
+ self.transcribed_seconds = 0 # for logging how many seconds were processed by API, to know the cost
176
+
177
+
178
+ def ts_words(self, segments):
179
+ no_speech_segments = []
180
+ if self.use_vad_opt:
181
+ for segment in segments.segments:
182
+ # TODO: threshold can be set from outside
183
+ if segment["no_speech_prob"] > 0.8:
184
+ no_speech_segments.append((segment.get("start"), segment.get("end")))
185
+
186
+ o = []
187
+ for word in segments.words:
188
+ start = word.get("start")
189
+ end = word.get("end")
190
+ if any(s[0] <= start <= s[1] for s in no_speech_segments):
191
+ # print("Skipping word", word.get("word"), "because it's in a no-speech segment")
192
+ continue
193
+ o.append((start, end, word.get("word")))
194
+ return o
195
+
196
+
197
+ def segments_end_ts(self, res):
198
+ return [s["end"] for s in res.words]
199
+
200
+ def transcribe(self, audio_data, prompt=None, *args, **kwargs):
201
+ # Write the audio data to a buffer
202
+ buffer = io.BytesIO()
203
+ buffer.name = "temp.wav"
204
+ sf.write(buffer, audio_data, samplerate=16000, format='WAV', subtype='PCM_16')
205
+ buffer.seek(0) # Reset buffer's position to the beginning
206
+
207
+ self.transcribed_seconds += math.ceil(len(audio_data)/16000) # it rounds up to the whole seconds
208
+
209
+ params = {
210
+ "model": self.modelname,
211
+ "file": buffer,
212
+ "response_format": self.response_format,
213
+ "temperature": self.temperature,
214
+ "timestamp_granularities": ["word", "segment"]
215
+ }
216
+ if self.task != "translate" and self.original_language:
217
+ params["language"] = self.original_language
218
+ if prompt:
219
+ params["prompt"] = prompt
220
+
221
+ if self.task == "translate":
222
+ proc = self.client.audio.translations
223
+ else:
224
+ proc = self.client.audio.transcriptions
225
+
226
+ # Process transcription/translation
227
+ transcript = proc.create(**params)
228
+ print(f"OpenAI API processed accumulated {self.transcribed_seconds} seconds",file=self.logfile)
229
+
230
+ return transcript
231
+
232
+ def use_vad(self):
233
+ self.use_vad_opt = True
234
+
235
+ def set_translate_task(self):
236
+ self.task = "translate"
237
+
238
+
239
+
240
 
241
  class HypothesisBuffer:
242
 
 
329
 
330
  self.transcript_buffer = HypothesisBuffer(logfile=self.logfile)
331
  self.commited = []
 
 
 
332
 
333
  def insert_audio_chunk(self, audio):
334
  self.audio_buffer = np.append(self.audio_buffer, audio)
 
338
  "context" is the commited text that is inside the audio buffer. It is transcribed again and skipped. It is returned only for debugging and logging reasons.
339
  """
340
  k = max(0,len(self.commited)-1)
341
+ while k > 0 and self.commited[k-1][1] > self.buffer_time_offset:
342
  k -= 1
343
 
344
  p = self.commited[:k]
 
449
  cut_seconds = time - self.buffer_time_offset
450
  self.audio_buffer = self.audio_buffer[int(cut_seconds*self.SAMPLING_RATE):]
451
  self.buffer_time_offset = time
 
452
 
453
  def words_to_sentences(self, words):
454
  """Uses self.tokenizer for sentence segmentation of words.
 
542
  parser.add_argument('--model', type=str, default='large-v2', choices="tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large".split(","),help="Name size of the Whisper model to use (default: large-v2). The model is automatically downloaded from the model hub if not present in model cache dir.")
543
  parser.add_argument('--model_cache_dir', type=str, default=None, help="Overriding the default model cache dir where models downloaded from the hub are saved")
544
  parser.add_argument('--model_dir', type=str, default=None, help="Dir where Whisper model.bin and other files are saved. This option overrides --model and --model_cache_dir parameter.")
545
+ parser.add_argument('--lan', '--language', type=str, default='auto', help="Source language code, e.g. en,de,cs, or 'auto' for language detection.")
546
  parser.add_argument('--task', type=str, default='transcribe', choices=["transcribe","translate"],help="Transcribe or translate.")
547
+ parser.add_argument('--backend', type=str, default="faster-whisper", choices=["faster-whisper", "whisper_timestamped", "openai-api"],help='Load only this backend for Whisper processing.')
548
  parser.add_argument('--vad', action="store_true", default=False, help='Use VAD = voice activity detection, with the default parameters.')
549
  parser.add_argument('--buffer_trimming', type=str, default="segment", choices=["sentence", "segment"],help='Buffer trimming strategy -- trim completed sentences marked with punctuation mark and detected by sentence segmenter, or the completed segments returned by Whisper. Sentence segmenter must be installed for "sentence" option.')
550
  parser.add_argument('--buffer_trimming_sec', type=float, default=15, help='Buffer trimming length threshold in seconds. If buffer length is longer, trimming sentence/segment is triggered.')
551
 
552
+ def asr_factory(args, logfile=sys.stderr):
553
+ """
554
+ Creates and configures an ASR instance based on the specified backend and arguments.
555
+ """
556
+ backend = args.backend
557
+ if backend == "openai-api":
558
+ print("Using OpenAI API.", file=logfile)
559
+ asr = OpenaiApiASR(lan=args.lan)
560
+ else:
561
+ if backend == "faster-whisper":
562
+ asr_cls = FasterWhisperASR
563
+ else:
564
+ asr_cls = WhisperTimestampedASR
565
+
566
+ # Only for FasterWhisperASR and WhisperTimestampedASR
567
+ size = args.model
568
+ t = time.time()
569
+ print(f"Loading Whisper {size} model for {args.lan}...", file=logfile, end=" ", flush=True)
570
+ asr = asr_cls(modelsize=size, lan=args.lan, cache_dir=args.model_cache_dir, model_dir=args.model_dir)
571
+ e = time.time()
572
+ print(f"done. It took {round(e-t,2)} seconds.", file=logfile)
573
+
574
+ # Apply common configurations
575
+ if getattr(args, 'vad', False): # Checks if VAD argument is present and True
576
+ print("Setting VAD filter", file=logfile)
577
+ asr.use_vad()
578
+
579
+ return asr
580
+
581
  ## main:
582
 
583
  if __name__ == "__main__":
 
605
  duration = len(load_audio(audio_path))/SAMPLING_RATE
606
  print("Audio duration is: %2.2f seconds" % duration, file=logfile)
607
 
608
+ asr = asr_factory(args, logfile=logfile)
609
  language = args.lan
 
 
 
 
 
 
 
 
 
 
 
610
  if args.task == "translate":
611
  asr.set_translate_task()
612
  tgt_language = "en" # Whisper translates into English
613
  else:
614
  tgt_language = language # Whisper transcribes in this language
615
 
 
 
 
 
 
 
 
 
616
 
617
  min_chunk = args.min_chunk_size
618
  if args.buffer_trimming == "sentence":
whisper_online_server.py CHANGED
@@ -4,6 +4,7 @@ from whisper_online import *
4
  import sys
5
  import argparse
6
  import os
 
7
  parser = argparse.ArgumentParser()
8
 
9
  # server options
@@ -25,34 +26,13 @@ SAMPLING_RATE = 16000
25
  size = args.model
26
  language = args.lan
27
 
28
- t = time.time()
29
- print(f"Loading Whisper {size} model for {language}...",file=sys.stderr,end=" ",flush=True)
30
-
31
- if args.backend == "faster-whisper":
32
- from faster_whisper import WhisperModel
33
- asr_cls = FasterWhisperASR
34
- else:
35
- import whisper
36
- import whisper_timestamped
37
- # from whisper_timestamped_model import WhisperTimestampedASR
38
- asr_cls = WhisperTimestampedASR
39
-
40
- asr = asr_cls(modelsize=size, lan=language, cache_dir=args.model_cache_dir, model_dir=args.model_dir)
41
-
42
  if args.task == "translate":
43
  asr.set_translate_task()
44
  tgt_language = "en"
45
  else:
46
  tgt_language = language
47
 
48
- e = time.time()
49
- print(f"done. It took {round(e-t,2)} seconds.",file=sys.stderr)
50
-
51
- if args.vad:
52
- print("setting VAD filter",file=sys.stderr)
53
- asr.use_vad()
54
-
55
-
56
  min_chunk = args.min_chunk_size
57
 
58
  if args.buffer_trimming == "sentence":
@@ -136,7 +116,7 @@ class ServerProcessor:
136
  if not raw_bytes:
137
  break
138
  sf = soundfile.SoundFile(io.BytesIO(raw_bytes), channels=1,endian="LITTLE",samplerate=SAMPLING_RATE, subtype="PCM_16",format="RAW")
139
- audio, _ = librosa.load(sf,sr=SAMPLING_RATE)
140
  out.append(audio)
141
  if not out:
142
  return None
 
4
  import sys
5
  import argparse
6
  import os
7
+ import numpy as np
8
  parser = argparse.ArgumentParser()
9
 
10
  # server options
 
26
  size = args.model
27
  language = args.lan
28
 
29
+ asr = asr_factory(args)
 
 
 
 
 
 
 
 
 
 
 
 
 
30
  if args.task == "translate":
31
  asr.set_translate_task()
32
  tgt_language = "en"
33
  else:
34
  tgt_language = language
35
 
 
 
 
 
 
 
 
 
36
  min_chunk = args.min_chunk_size
37
 
38
  if args.buffer_trimming == "sentence":
 
116
  if not raw_bytes:
117
  break
118
  sf = soundfile.SoundFile(io.BytesIO(raw_bytes), channels=1,endian="LITTLE",samplerate=SAMPLING_RATE, subtype="PCM_16",format="RAW")
119
+ audio, _ = librosa.load(sf,sr=SAMPLING_RATE,dtype=np.float32)
120
  out.append(audio)
121
  if not out:
122
  return None