Dominik Macháček commited on
Commit
36bf3a3
·
2 Parent(s): 2ec2266 84a9995

Merge branch 'main' into vad-streaming-clean

Browse files
Files changed (4) hide show
  1. README.md +40 -23
  2. line_packet.py +1 -2
  3. whisper_online.py +196 -75
  4. whisper_online_server.py +29 -68
README.md CHANGED
@@ -3,44 +3,52 @@ Whisper realtime streaming for long speech-to-text transcription and translation
3
 
4
  **Turning Whisper into Real-Time Transcription System**
5
 
6
- Demonstration paper, by Dominik Macháček, Raj Dabre, Ondřej Bojar, 2023
7
 
8
- Abstract: Whisper is one of the recent state-of-the-art multilingual speech recognition and translation models, however, it is not designed for real time transcription. In this paper, we build on top of Whisper and create Whisper-Streaming, an implementation of real-time speech transcription and translation of Whisper-like models. Whisper-Streaming uses local agreement policy with self-adaptive latency to enable streaming transcription. We show that Whisper-Streaming achieves high quality and 3.3 seconds latency on unsegmented long-form speech transcription test set, and we demonstrate its robustness and practical usability as a component in live transcription service at a multilingual conference.
9
 
10
 
11
- Paper in proceedings: http://www.afnlp.org/conferences/ijcnlp2023/proceedings/main-demo/cdrom/pdf/2023.ijcnlp-demo.3.pdf
12
-
13
- Demo video: https://player.vimeo.com/video/840442741
14
 
15
  [Slides](http://ufallab.ms.mff.cuni.cz/~machacek/pre-prints/AACL23-2.11.2023-Turning-Whisper-oral.pdf) -- 15 minutes oral presentation at IJCNLP-AACL 2023
16
 
17
- Please, cite us. [Bibtex citation](http://www.afnlp.org/conferences/ijcnlp2023/proceedings/main-demo/cdrom/bib/2023.ijcnlp-demo.3.bib):
18
 
19
  ```
20
- @InProceedings{machacek-dabre-bojar:2023:ijcnlp,
21
- author = {Macháček, Dominik and Dabre, Raj and Bojar, Ondřej},
22
- title = {Turning Whisper into Real-Time Transcription System},
23
- booktitle = {System Demonstrations},
24
- month = {November},
25
- year = {2023},
26
- address = {Bali, Indonesia},
27
- publisher = {Asian Federation of Natural Language Processing},
28
- pages = {17--24},
 
 
 
 
 
29
  }
30
  ```
31
 
32
  ## Installation
33
 
34
- 1) ``pip install librosa`` -- audio processing library
35
 
36
  Note: for the VAD I need to `pip install torch torchaudio`.
37
 
38
  2) Whisper backend.
39
 
40
- Two alternative backends are integrated. The most recommended one is [faster-whisper](https://github.com/guillaumekln/faster-whisper) with GPU support. Follow their instructions for NVIDIA libraries -- we succeeded with CUDNN 8.5.0 and CUDA 11.7. Install with `pip install faster-whisper`.
41
 
42
  Alternative, less restrictive, but slower backend is [whisper-timestamped](https://github.com/linto-ai/whisper-timestamped): `pip install git+https://github.com/linto-ai/whisper-timestamped`
43
 
 
 
 
 
 
44
  The backend is loaded only when chosen. The unused one does not have to be installed.
45
 
46
  3) Optional, not recommended: sentence segmenter (aka sentence tokenizer)
@@ -71,7 +79,7 @@ In case of installation issues of opus-fast-mosestokenizer, especially on Window
71
 
72
  ```
73
  usage: whisper_online.py [-h] [--min-chunk-size MIN_CHUNK_SIZE] [--model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large}] [--model_cache_dir MODEL_CACHE_DIR] [--model_dir MODEL_DIR] [--lan LAN] [--task {transcribe,translate}]
74
- [--backend {faster-whisper,whisper_timestamped}] [--vad] [--buffer_trimming {sentence,segment}] [--buffer_trimming_sec BUFFER_TRIMMING_SEC] [--start_at START_AT] [--offline] [--comp_unaware]
75
  audio_path
76
 
77
  positional arguments:
@@ -91,7 +99,7 @@ options:
91
  Source language code, e.g. en,de,cs, or 'auto' for language detection.
92
  --task {transcribe,translate}
93
  Transcribe or translate.
94
- --backend {faster-whisper,whisper_timestamped}
95
  Load only this backend for Whisper processing.
96
  --vad Use VAD = voice activity detection, with the default parameters.
97
  --buffer_trimming {sentence,segment}
@@ -149,7 +157,7 @@ The code whisper_online.py is nicely commented, read it as the full documentatio
149
 
150
  This pseudocode describes the interface that we suggest for your implementation. You can implement any features that you need for your application.
151
 
152
- ```
153
  from whisper_online import *
154
 
155
  src_lan = "en" # source language
@@ -177,7 +185,7 @@ online.init() # refresh if you're going to re-use the object for the next audio
177
 
178
  ### Server -- real-time from mic
179
 
180
- `whisper_online_server.py` has the same model options as `whisper_online.py`, plus `--host` and `--port` of the TCP connection. See help message (`-h` option).
181
 
182
  Client example:
183
 
@@ -218,12 +226,21 @@ In more detail: we use the init prompt, we handle the inaccurate timestamps, we
218
  re-process confirmed sentence prefixes and skip them, making sure they don't
219
  overlap, and we limit the processing buffer window.
220
 
221
- Contributions are welcome.
222
-
223
  ### Performance evaluation
224
 
225
  [See the paper.](http://www.afnlp.org/conferences/ijcnlp2023/proceedings/main-demo/cdrom/pdf/2023.ijcnlp-demo.3.pdf)
226
 
 
 
 
 
 
 
 
 
 
 
 
227
 
228
  ## Contact
229
 
 
3
 
4
  **Turning Whisper into Real-Time Transcription System**
5
 
6
+ Demonstration paper, by [Dominik Macháček](https://ufal.mff.cuni.cz/dominik-machacek), [Raj Dabre](https://prajdabre.github.io/), [Ondřej Bojar](https://ufal.mff.cuni.cz/ondrej-bojar), 2023
7
 
8
+ Abstract: Whisper is one of the recent state-of-the-art multilingual speech recognition and translation models, however, it is not designed for real-time transcription. In this paper, we build on top of Whisper and create Whisper-Streaming, an implementation of real-time speech transcription and translation of Whisper-like models. Whisper-Streaming uses local agreement policy with self-adaptive latency to enable streaming transcription. We show that Whisper-Streaming achieves high quality and 3.3 seconds latency on unsegmented long-form speech transcription test set, and we demonstrate its robustness and practical usability as a component in live transcription service at a multilingual conference.
9
 
10
 
11
+ [Paper PDF](https://aclanthology.org/2023.ijcnlp-demo.3.pdf), [Demo video](https://player.vimeo.com/video/840442741)
 
 
12
 
13
  [Slides](http://ufallab.ms.mff.cuni.cz/~machacek/pre-prints/AACL23-2.11.2023-Turning-Whisper-oral.pdf) -- 15 minutes oral presentation at IJCNLP-AACL 2023
14
 
15
+ Please, cite us. [ACL Anthology](https://aclanthology.org/2023.ijcnlp-demo.3/), [Bibtex citation](https://aclanthology.org/2023.ijcnlp-demo.3.bib):
16
 
17
  ```
18
+ @inproceedings{machacek-etal-2023-turning,
19
+ title = "Turning Whisper into Real-Time Transcription System",
20
+ author = "Mach{\'a}{\v{c}}ek, Dominik and
21
+ Dabre, Raj and
22
+ Bojar, Ond{\v{r}}ej",
23
+ editor = "Saha, Sriparna and
24
+ Sujaini, Herry",
25
+ booktitle = "Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: System Demonstrations",
26
+ month = nov,
27
+ year = "2023",
28
+ address = "Bali, Indonesia",
29
+ publisher = "Association for Computational Linguistics",
30
+ url = "https://aclanthology.org/2023.ijcnlp-demo.3",
31
+ pages = "17--24",
32
  }
33
  ```
34
 
35
  ## Installation
36
 
37
+ 1) ``pip install librosa soundfile`` -- audio processing library
38
 
39
  Note: for the VAD I need to `pip install torch torchaudio`.
40
 
41
  2) Whisper backend.
42
 
43
+ Several alternative backends are integrated. The most recommended one is [faster-whisper](https://github.com/guillaumekln/faster-whisper) with GPU support. Follow their instructions for NVIDIA libraries -- we succeeded with CUDNN 8.5.0 and CUDA 11.7. Install with `pip install faster-whisper`.
44
 
45
  Alternative, less restrictive, but slower backend is [whisper-timestamped](https://github.com/linto-ai/whisper-timestamped): `pip install git+https://github.com/linto-ai/whisper-timestamped`
46
 
47
+ Thirdly, it's also possible to run this software from the [OpenAI Whisper API](https://platform.openai.com/docs/api-reference/audio/createTranscription). This solution is fast and requires no GPU, just a small VM will suffice, but you will need to pay OpenAI for api access. Also note that, since each audio fragment is processed multiple times, the [price](https://openai.com/pricing) will be higher than obvious from the pricing page, so keep an eye on costs while using. Setting a higher chunk-size will reduce costs significantly.
48
+ Install with: `pip install openai`
49
+
50
+ For running with the openai-api backend, make sure that your [OpenAI api key](https://platform.openai.com/api-keys) is set in the `OPENAI_API_KEY` environment variable. For example, before running, do: `export OPENAI_API_KEY=sk-xxx` with *sk-xxx* replaced with your api key.
51
+
52
  The backend is loaded only when chosen. The unused one does not have to be installed.
53
 
54
  3) Optional, not recommended: sentence segmenter (aka sentence tokenizer)
 
79
 
80
  ```
81
  usage: whisper_online.py [-h] [--min-chunk-size MIN_CHUNK_SIZE] [--model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large}] [--model_cache_dir MODEL_CACHE_DIR] [--model_dir MODEL_DIR] [--lan LAN] [--task {transcribe,translate}]
82
+ [--backend {faster-whisper,whisper_timestamped,openai-api}] [--vad] [--buffer_trimming {sentence,segment}] [--buffer_trimming_sec BUFFER_TRIMMING_SEC] [--start_at START_AT] [--offline] [--comp_unaware]
83
  audio_path
84
 
85
  positional arguments:
 
99
  Source language code, e.g. en,de,cs, or 'auto' for language detection.
100
  --task {transcribe,translate}
101
  Transcribe or translate.
102
+ --backend {faster-whisper,whisper_timestamped,openai-api}
103
  Load only this backend for Whisper processing.
104
  --vad Use VAD = voice activity detection, with the default parameters.
105
  --buffer_trimming {sentence,segment}
 
157
 
158
  This pseudocode describes the interface that we suggest for your implementation. You can implement any features that you need for your application.
159
 
160
+ ```python
161
  from whisper_online import *
162
 
163
  src_lan = "en" # source language
 
185
 
186
  ### Server -- real-time from mic
187
 
188
+ `whisper_online_server.py` has the same model options as `whisper_online.py`, plus `--host` and `--port` of the TCP connection and the `--warmup-file`. See the help message (`-h` option).
189
 
190
  Client example:
191
 
 
226
  re-process confirmed sentence prefixes and skip them, making sure they don't
227
  overlap, and we limit the processing buffer window.
228
 
 
 
229
  ### Performance evaluation
230
 
231
  [See the paper.](http://www.afnlp.org/conferences/ijcnlp2023/proceedings/main-demo/cdrom/pdf/2023.ijcnlp-demo.3.pdf)
232
 
233
+ ### Contributions
234
+
235
+ Contributions are welcome. We acknowledge especially:
236
+
237
+ - [The GitHub contributors](https://github.com/ufal/whisper_streaming/graphs/contributors) for their pull requests with new features and bugfixes.
238
+ - [Nice explanation video](https://www.youtube.com/watch?v=_spinzpEeFM) -- published on 31st March 2024, not that newer updates are not included.
239
+ - [The translation of this repo into Chinese.](https://github.com/Gloridust/whisper_streaming_CN)
240
+ - [Ondřej Plátek](https://opla.cz/) for the paper pre-review.
241
+ - [Peter Polák](https://ufal.mff.cuni.cz/peter-polak) for the original idea.
242
+ - The UEDIN team of the [ELITR project](https://elitr.eu) for the original line_packet.py.
243
+
244
 
245
  ## Contact
246
 
line_packet.py CHANGED
@@ -2,8 +2,6 @@
2
 
3
  """Functions for sending and receiving individual lines of text over a socket.
4
 
5
- Used by marian-server-server.py to communicate with the Marian worker.
6
-
7
  A line is transmitted using one or more fixed-size packets of UTF-8 bytes
8
  containing:
9
 
@@ -11,6 +9,7 @@ containing:
11
 
12
  - Zero or more \0 bytes as required to pad the packet to PACKET_SIZE
13
 
 
14
  """
15
 
16
  PACKET_SIZE = 65536
 
2
 
3
  """Functions for sending and receiving individual lines of text over a socket.
4
 
 
 
5
  A line is transmitted using one or more fixed-size packets of UTF-8 bytes
6
  containing:
7
 
 
9
 
10
  - Zero or more \0 bytes as required to pad the packet to PACKET_SIZE
11
 
12
+ Originally from the UEDIN team of the ELITR project.
13
  """
14
 
15
  PACKET_SIZE = 65536
whisper_online.py CHANGED
@@ -4,12 +4,17 @@ import numpy as np
4
  import librosa
5
  from functools import lru_cache
6
  import time
7
- import datetime
8
 
 
 
 
 
 
9
 
10
  @lru_cache
11
  def load_audio(fname):
12
- a, _ = librosa.load(fname, sr=16000)
13
  return a
14
 
15
  def load_audio_chunk(fname, beg, end):
@@ -57,10 +62,11 @@ class WhisperTimestampedASR(ASRBase):
57
 
58
  def load_model(self, modelsize=None, cache_dir=None, model_dir=None):
59
  import whisper
 
60
  from whisper_timestamped import transcribe_timestamped
61
  self.transcribe_timestamped = transcribe_timestamped
62
  if model_dir is not None:
63
- print("ignoring model_dir, not implemented",file=self.logfile)
64
  return whisper.load_model(modelsize, download_root=cache_dir)
65
 
66
  def transcribe(self, audio, init_prompt=""):
@@ -99,8 +105,9 @@ class FasterWhisperASR(ASRBase):
99
 
100
  def load_model(self, modelsize=None, cache_dir=None, model_dir=None):
101
  from faster_whisper import WhisperModel
 
102
  if model_dir is not None:
103
- print(f"Loading whisper model from model_dir {model_dir}. modelsize and cache_dir parameters are not used.",file=self.logfile)
104
  model_size_or_path = model_dir
105
  elif modelsize is not None:
106
  model_size_or_path = modelsize
@@ -150,6 +157,93 @@ class FasterWhisperASR(ASRBase):
150
  self.transcribe_kargs["task"] = "translate"
151
 
152
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
153
 
154
  class HypothesisBuffer:
155
 
@@ -181,9 +275,11 @@ class HypothesisBuffer:
181
  c = " ".join([self.commited_in_buffer[-j][2] for j in range(1,i+1)][::-1])
182
  tail = " ".join(self.new[j-1][2] for j in range(1,i+1))
183
  if c == tail:
184
- print("removing last",i,"words:",file=self.logfile)
185
  for j in range(i):
186
- print("\t",self.new.pop(0),file=self.logfile)
 
 
187
  break
188
 
189
  def flush(self):
@@ -246,8 +342,6 @@ class OnlineASRProcessor:
246
  self.transcript_buffer.last_commited_time = self.buffer_time_offset
247
 
248
  self.commited = []
249
- self.last_chunked_at = 0
250
-
251
 
252
  def insert_audio_chunk(self, audio):
253
  self.audio_buffer = np.append(self.audio_buffer, audio)
@@ -257,7 +351,7 @@ class OnlineASRProcessor:
257
  "context" is the commited text that is inside the audio buffer. It is transcribed again and skipped. It is returned only for debugging and logging reasons.
258
  """
259
  k = max(0,len(self.commited)-1)
260
- while k > 0 and self.commited[k-1][1] > self.last_chunked_at:
261
  k -= 1
262
 
263
  p = self.commited[:k]
@@ -278,9 +372,9 @@ class OnlineASRProcessor:
278
  """
279
 
280
  prompt, non_prompt = self.prompt()
281
- print("PROMPT:", prompt, file=self.logfile)
282
- print("CONTEXT:", non_prompt, file=self.logfile)
283
- print(f"transcribing {len(self.audio_buffer)/self.SAMPLING_RATE:2.2f} seconds from {self.buffer_time_offset:2.2f}",file=self.logfile)
284
  res = self.asr.transcribe(self.audio_buffer, init_prompt=prompt)
285
 
286
  # transform to [(beg,end,"word1"), ...]
@@ -289,8 +383,10 @@ class OnlineASRProcessor:
289
  self.transcript_buffer.insert(tsw, self.buffer_time_offset)
290
  o = self.transcript_buffer.flush()
291
  self.commited.extend(o)
292
- print(">>>>COMPLETE NOW:",self.to_flush(o),file=self.logfile,flush=True)
293
- print("INCOMPLETE:",self.to_flush(self.transcript_buffer.complete()),file=self.logfile,flush=True)
 
 
294
 
295
  # there is a newly confirmed text
296
 
@@ -314,18 +410,18 @@ class OnlineASRProcessor:
314
  #while k>0 and self.commited[k][1] > l:
315
  # k -= 1
316
  #t = self.commited[k][1]
317
- print(f"chunking segment",file=self.logfile)
318
  #self.chunk_at(t)
319
 
320
- print(f"len of buffer now: {len(self.audio_buffer)/self.SAMPLING_RATE:2.2f}",file=self.logfile)
321
  return self.to_flush(o)
322
 
323
  def chunk_completed_sentence(self):
324
  if self.commited == []: return
325
- print(self.commited,file=self.logfile)
326
  sents = self.words_to_sentences(self.commited)
327
  for s in sents:
328
- print("\t\tSENT:",s,file=self.logfile)
329
  if len(sents) < 2:
330
  return
331
  while len(sents) > 2:
@@ -333,7 +429,7 @@ class OnlineASRProcessor:
333
  # we will continue with audio processing at this timestamp
334
  chunk_at = sents[-2][1]
335
 
336
- print(f"--- sentence chunked at {chunk_at:2.2f}",file=self.logfile)
337
  self.chunk_at(chunk_at)
338
 
339
  def chunk_completed_segment(self, res):
@@ -350,12 +446,12 @@ class OnlineASRProcessor:
350
  ends.pop(-1)
351
  e = ends[-2]+self.buffer_time_offset
352
  if e <= t:
353
- print(f"--- segment chunked at {e:2.2f}",file=self.logfile)
354
  self.chunk_at(e)
355
  else:
356
- print(f"--- last segment not within commited area",file=self.logfile)
357
  else:
358
- print(f"--- not enough segments to chunk",file=self.logfile)
359
 
360
 
361
 
@@ -368,7 +464,6 @@ class OnlineASRProcessor:
368
  cut_seconds = time - self.buffer_time_offset
369
  self.audio_buffer = self.audio_buffer[int(cut_seconds*self.SAMPLING_RATE):]
370
  self.buffer_time_offset = time
371
- self.last_chunked_at = time
372
 
373
  def words_to_sentences(self, words):
374
  """Uses self.tokenizer for sentence segmentation of words.
@@ -402,7 +497,7 @@ class OnlineASRProcessor:
402
  """
403
  o = self.transcript_buffer.complete()
404
  f = self.to_flush(o)
405
- print("last, noncommited:",f,file=self.logfile)
406
  self.buffer_time_offset += len(self.audio_buffer)/16000
407
  return f
408
 
@@ -443,7 +538,7 @@ def create_tokenizer(lan):
443
 
444
  # the following languages are in Whisper, but not in wtpsplit:
445
  if lan in "as ba bo br bs fo haw hr ht jw lb ln lo mi nn oc sa sd sn so su sw tk tl tt".split():
446
- print(f"{lan} code is not supported by wtpsplit. Going to use None lang_code option.", file=sys.stderr)
447
  lan = None
448
 
449
  from wtpsplit import WtP
@@ -463,14 +558,67 @@ def add_shared_args(parser):
463
  parser.add_argument('--model', type=str, default='large-v2', choices="tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large".split(","),help="Name size of the Whisper model to use (default: large-v2). The model is automatically downloaded from the model hub if not present in model cache dir.")
464
  parser.add_argument('--model_cache_dir', type=str, default=None, help="Overriding the default model cache dir where models downloaded from the hub are saved")
465
  parser.add_argument('--model_dir', type=str, default=None, help="Dir where Whisper model.bin and other files are saved. This option overrides --model and --model_cache_dir parameter.")
466
- parser.add_argument('--lan', '--language', type=str, default='en', help="Source language code, e.g. en,de,cs, or 'auto' for language detection.")
467
  parser.add_argument('--task', type=str, default='transcribe', choices=["transcribe","translate"],help="Transcribe or translate.")
468
- parser.add_argument('--backend', type=str, default="faster-whisper", choices=["faster-whisper", "whisper_timestamped"],help='Load only this backend for Whisper processing.')
469
  parser.add_argument('--vad', action="store_true", default=False, help='Use VAD = voice activity detection, with the default parameters.')
470
  parser.add_argument('--buffer_trimming', type=str, default="segment", choices=["sentence", "segment"],help='Buffer trimming strategy -- trim completed sentences marked with punctuation mark and detected by sentence segmenter, or the completed segments returned by Whisper. Sentence segmenter must be installed for "sentence" option.')
471
  parser.add_argument('--buffer_trimming_sec', type=float, default=15, help='Buffer trimming length threshold in seconds. If buffer length is longer, trimming sentence/segment is triggered.')
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
472
 
473
- ## main:
474
 
475
  if __name__ == "__main__":
476
 
@@ -488,55 +636,28 @@ if __name__ == "__main__":
488
  logfile = sys.stderr
489
 
490
  if args.offline and args.comp_unaware:
491
- print("No or one option from --offline and --comp_unaware are available, not both. Exiting.",file=logfile)
492
  sys.exit(1)
493
 
 
 
 
 
 
 
494
  audio_path = args.audio_path
495
 
496
  SAMPLING_RATE = 16000
497
  duration = len(load_audio(audio_path))/SAMPLING_RATE
498
- print("Audio duration is: %2.2f seconds" % duration, file=logfile)
499
-
500
- size = args.model
501
- language = args.lan
502
-
503
- t = time.time()
504
- print(f"Loading Whisper {size} model for {language}...",file=logfile,end=" ",flush=True)
505
-
506
- if args.backend == "faster-whisper":
507
- asr_cls = FasterWhisperASR
508
- else:
509
- asr_cls = WhisperTimestampedASR
510
-
511
- asr = asr_cls(modelsize=size, lan=language, cache_dir=args.model_cache_dir, model_dir=args.model_dir)
512
 
513
- if args.task == "translate":
514
- asr.set_translate_task()
515
- tgt_language = "en" # Whisper translates into English
516
- else:
517
- tgt_language = language # Whisper transcribes in this language
518
-
519
-
520
- e = time.time()
521
- print(f"done. It took {round(e-t,2)} seconds.",file=logfile)
522
-
523
- if args.vad:
524
- print("setting VAD filter",file=logfile)
525
- asr.use_vad()
526
-
527
-
528
  min_chunk = args.min_chunk_size
529
- if args.buffer_trimming == "sentence":
530
- tokenizer = create_tokenizer(tgt_language)
531
- else:
532
- tokenizer = None
533
- online = OnlineASRProcessor(asr,tokenizer,logfile=logfile,buffer_trimming=(args.buffer_trimming, args.buffer_trimming_sec))
534
-
535
 
536
  # load the audio into the LRU cache before we start the timer
537
  a = load_audio_chunk(audio_path,0,1)
538
 
539
- # warm up the ASR, because the very first transcribe takes much more time than the other
540
  asr.transcribe(a)
541
 
542
  beg = args.start_at
@@ -555,16 +676,16 @@ if __name__ == "__main__":
555
  print("%1.4f %1.0f %1.0f %s" % (now*1000, o[0]*1000,o[1]*1000,o[2]),file=logfile,flush=True)
556
  print("%1.4f %1.0f %1.0f %s" % (now*1000, o[0]*1000,o[1]*1000,o[2]),flush=True)
557
  else:
558
- print(o,file=logfile,flush=True)
 
559
 
560
  if args.offline: ## offline mode processing (for testing/debugging)
561
  a = load_audio(audio_path)
562
  online.insert_audio_chunk(a)
563
  try:
564
  o = online.process_iter()
565
- except AssertionError:
566
- print("assertion error",file=logfile)
567
- pass
568
  else:
569
  output_transcript(o)
570
  now = None
@@ -575,13 +696,13 @@ if __name__ == "__main__":
575
  online.insert_audio_chunk(a)
576
  try:
577
  o = online.process_iter()
578
- except AssertionError:
579
- print("assertion error",file=logfile)
580
  pass
581
  else:
582
  output_transcript(o, now=end)
583
 
584
- print(f"## last processed {end:.2f}s",file=logfile,flush=True)
585
 
586
  if end >= duration:
587
  break
@@ -607,13 +728,13 @@ if __name__ == "__main__":
607
 
608
  try:
609
  o = online.process_iter()
610
- except AssertionError:
611
- print("assertion error",file=logfile)
612
  pass
613
  else:
614
  output_transcript(o)
615
  now = time.time() - start
616
- print(f"## last processed {end:.2f} s, now is {now:.2f}, the latency is {now-end:.2f}",file=logfile,flush=True)
617
 
618
  if end >= duration:
619
  break
 
4
  import librosa
5
  from functools import lru_cache
6
  import time
7
+ import logging
8
 
9
+ import io
10
+ import soundfile as sf
11
+ import math
12
+
13
+ logger = logging.getLogger(__name__)
14
 
15
  @lru_cache
16
  def load_audio(fname):
17
+ a, _ = librosa.load(fname, sr=16000, dtype=np.float32)
18
  return a
19
 
20
  def load_audio_chunk(fname, beg, end):
 
62
 
63
  def load_model(self, modelsize=None, cache_dir=None, model_dir=None):
64
  import whisper
65
+ import whisper_timestamped
66
  from whisper_timestamped import transcribe_timestamped
67
  self.transcribe_timestamped = transcribe_timestamped
68
  if model_dir is not None:
69
+ logger.debug("ignoring model_dir, not implemented")
70
  return whisper.load_model(modelsize, download_root=cache_dir)
71
 
72
  def transcribe(self, audio, init_prompt=""):
 
105
 
106
  def load_model(self, modelsize=None, cache_dir=None, model_dir=None):
107
  from faster_whisper import WhisperModel
108
+ # logging.getLogger("faster_whisper").setLevel(logger.level)
109
  if model_dir is not None:
110
+ logger.debug(f"Loading whisper model from model_dir {model_dir}. modelsize and cache_dir parameters are not used.")
111
  model_size_or_path = model_dir
112
  elif modelsize is not None:
113
  model_size_or_path = modelsize
 
157
  self.transcribe_kargs["task"] = "translate"
158
 
159
 
160
+ class OpenaiApiASR(ASRBase):
161
+ """Uses OpenAI's Whisper API for audio transcription."""
162
+
163
+ def __init__(self, lan=None, temperature=0, logfile=sys.stderr):
164
+ self.logfile = logfile
165
+
166
+ self.modelname = "whisper-1"
167
+ self.original_language = None if lan == "auto" else lan # ISO-639-1 language code
168
+ self.response_format = "verbose_json"
169
+ self.temperature = temperature
170
+
171
+ self.load_model()
172
+
173
+ self.use_vad_opt = False
174
+
175
+ # reset the task in set_translate_task
176
+ self.task = "transcribe"
177
+
178
+ def load_model(self, *args, **kwargs):
179
+ from openai import OpenAI
180
+ self.client = OpenAI()
181
+
182
+ self.transcribed_seconds = 0 # for logging how many seconds were processed by API, to know the cost
183
+
184
+
185
+ def ts_words(self, segments):
186
+ no_speech_segments = []
187
+ if self.use_vad_opt:
188
+ for segment in segments.segments:
189
+ # TODO: threshold can be set from outside
190
+ if segment["no_speech_prob"] > 0.8:
191
+ no_speech_segments.append((segment.get("start"), segment.get("end")))
192
+
193
+ o = []
194
+ for word in segments.words:
195
+ start = word.get("start")
196
+ end = word.get("end")
197
+ if any(s[0] <= start <= s[1] for s in no_speech_segments):
198
+ # print("Skipping word", word.get("word"), "because it's in a no-speech segment")
199
+ continue
200
+ o.append((start, end, word.get("word")))
201
+ return o
202
+
203
+
204
+ def segments_end_ts(self, res):
205
+ return [s["end"] for s in res.words]
206
+
207
+ def transcribe(self, audio_data, prompt=None, *args, **kwargs):
208
+ # Write the audio data to a buffer
209
+ buffer = io.BytesIO()
210
+ buffer.name = "temp.wav"
211
+ sf.write(buffer, audio_data, samplerate=16000, format='WAV', subtype='PCM_16')
212
+ buffer.seek(0) # Reset buffer's position to the beginning
213
+
214
+ self.transcribed_seconds += math.ceil(len(audio_data)/16000) # it rounds up to the whole seconds
215
+
216
+ params = {
217
+ "model": self.modelname,
218
+ "file": buffer,
219
+ "response_format": self.response_format,
220
+ "temperature": self.temperature,
221
+ "timestamp_granularities": ["word", "segment"]
222
+ }
223
+ if self.task != "translate" and self.original_language:
224
+ params["language"] = self.original_language
225
+ if prompt:
226
+ params["prompt"] = prompt
227
+
228
+ if self.task == "translate":
229
+ proc = self.client.audio.translations
230
+ else:
231
+ proc = self.client.audio.transcriptions
232
+
233
+ # Process transcription/translation
234
+ transcript = proc.create(**params)
235
+ logger.debug(f"OpenAI API processed accumulated {self.transcribed_seconds} seconds")
236
+
237
+ return transcript
238
+
239
+ def use_vad(self):
240
+ self.use_vad_opt = True
241
+
242
+ def set_translate_task(self):
243
+ self.task = "translate"
244
+
245
+
246
+
247
 
248
  class HypothesisBuffer:
249
 
 
275
  c = " ".join([self.commited_in_buffer[-j][2] for j in range(1,i+1)][::-1])
276
  tail = " ".join(self.new[j-1][2] for j in range(1,i+1))
277
  if c == tail:
278
+ words = []
279
  for j in range(i):
280
+ words.append(repr(self.new.pop(0)))
281
+ words_msg = " ".join(words)
282
+ logger.debug(f"removing last {i} words: {words_msg}")
283
  break
284
 
285
  def flush(self):
 
342
  self.transcript_buffer.last_commited_time = self.buffer_time_offset
343
 
344
  self.commited = []
 
 
345
 
346
  def insert_audio_chunk(self, audio):
347
  self.audio_buffer = np.append(self.audio_buffer, audio)
 
351
  "context" is the commited text that is inside the audio buffer. It is transcribed again and skipped. It is returned only for debugging and logging reasons.
352
  """
353
  k = max(0,len(self.commited)-1)
354
+ while k > 0 and self.commited[k-1][1] > self.buffer_time_offset:
355
  k -= 1
356
 
357
  p = self.commited[:k]
 
372
  """
373
 
374
  prompt, non_prompt = self.prompt()
375
+ logger.debug(f"PROMPT: {prompt}")
376
+ logger.debug(f"CONTEXT: {non_prompt}")
377
+ logger.debug(f"transcribing {len(self.audio_buffer)/self.SAMPLING_RATE:2.2f} seconds from {self.buffer_time_offset:2.2f}")
378
  res = self.asr.transcribe(self.audio_buffer, init_prompt=prompt)
379
 
380
  # transform to [(beg,end,"word1"), ...]
 
383
  self.transcript_buffer.insert(tsw, self.buffer_time_offset)
384
  o = self.transcript_buffer.flush()
385
  self.commited.extend(o)
386
+ completed = self.to_flush(o)
387
+ logger.debug(f">>>>COMPLETE NOW: {completed}")
388
+ the_rest = self.to_flush(self.transcript_buffer.complete())
389
+ logger.debug(f"INCOMPLETE: {the_rest}")
390
 
391
  # there is a newly confirmed text
392
 
 
410
  #while k>0 and self.commited[k][1] > l:
411
  # k -= 1
412
  #t = self.commited[k][1]
413
+ logger.debug("chunking segment")
414
  #self.chunk_at(t)
415
 
416
+ logger.debug(f"len of buffer now: {len(self.audio_buffer)/self.SAMPLING_RATE:2.2f}")
417
  return self.to_flush(o)
418
 
419
  def chunk_completed_sentence(self):
420
  if self.commited == []: return
421
+ logger.debug(self.commited)
422
  sents = self.words_to_sentences(self.commited)
423
  for s in sents:
424
+ logger.debug(f"\t\tSENT: {s}")
425
  if len(sents) < 2:
426
  return
427
  while len(sents) > 2:
 
429
  # we will continue with audio processing at this timestamp
430
  chunk_at = sents[-2][1]
431
 
432
+ logger.debug(f"--- sentence chunked at {chunk_at:2.2f}")
433
  self.chunk_at(chunk_at)
434
 
435
  def chunk_completed_segment(self, res):
 
446
  ends.pop(-1)
447
  e = ends[-2]+self.buffer_time_offset
448
  if e <= t:
449
+ logger.debug(f"--- segment chunked at {e:2.2f}")
450
  self.chunk_at(e)
451
  else:
452
+ logger.debug(f"--- last segment not within commited area")
453
  else:
454
+ logger.debug(f"--- not enough segments to chunk")
455
 
456
 
457
 
 
464
  cut_seconds = time - self.buffer_time_offset
465
  self.audio_buffer = self.audio_buffer[int(cut_seconds*self.SAMPLING_RATE):]
466
  self.buffer_time_offset = time
 
467
 
468
  def words_to_sentences(self, words):
469
  """Uses self.tokenizer for sentence segmentation of words.
 
497
  """
498
  o = self.transcript_buffer.complete()
499
  f = self.to_flush(o)
500
+ logger.debug(f"last, noncommited: {f}")
501
  self.buffer_time_offset += len(self.audio_buffer)/16000
502
  return f
503
 
 
538
 
539
  # the following languages are in Whisper, but not in wtpsplit:
540
  if lan in "as ba bo br bs fo haw hr ht jw lb ln lo mi nn oc sa sd sn so su sw tk tl tt".split():
541
+ logger.debug(f"{lan} code is not supported by wtpsplit. Going to use None lang_code option.")
542
  lan = None
543
 
544
  from wtpsplit import WtP
 
558
  parser.add_argument('--model', type=str, default='large-v2', choices="tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large".split(","),help="Name size of the Whisper model to use (default: large-v2). The model is automatically downloaded from the model hub if not present in model cache dir.")
559
  parser.add_argument('--model_cache_dir', type=str, default=None, help="Overriding the default model cache dir where models downloaded from the hub are saved")
560
  parser.add_argument('--model_dir', type=str, default=None, help="Dir where Whisper model.bin and other files are saved. This option overrides --model and --model_cache_dir parameter.")
561
+ parser.add_argument('--lan', '--language', type=str, default='auto', help="Source language code, e.g. en,de,cs, or 'auto' for language detection.")
562
  parser.add_argument('--task', type=str, default='transcribe', choices=["transcribe","translate"],help="Transcribe or translate.")
563
+ parser.add_argument('--backend', type=str, default="faster-whisper", choices=["faster-whisper", "whisper_timestamped", "openai-api"],help='Load only this backend for Whisper processing.')
564
  parser.add_argument('--vad', action="store_true", default=False, help='Use VAD = voice activity detection, with the default parameters.')
565
  parser.add_argument('--buffer_trimming', type=str, default="segment", choices=["sentence", "segment"],help='Buffer trimming strategy -- trim completed sentences marked with punctuation mark and detected by sentence segmenter, or the completed segments returned by Whisper. Sentence segmenter must be installed for "sentence" option.')
566
  parser.add_argument('--buffer_trimming_sec', type=float, default=15, help='Buffer trimming length threshold in seconds. If buffer length is longer, trimming sentence/segment is triggered.')
567
+ parser.add_argument("-l", "--log-level", dest="log_level", choices=['DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'], help="Set the log level", default='DEBUG')
568
+
569
+ def asr_factory(args, logfile=sys.stderr):
570
+ """
571
+ Creates and configures an ASR and ASR Online instance based on the specified backend and arguments.
572
+ """
573
+ backend = args.backend
574
+ if backend == "openai-api":
575
+ logger.debug("Using OpenAI API.")
576
+ asr = OpenaiApiASR(lan=args.lan)
577
+ else:
578
+ if backend == "faster-whisper":
579
+ asr_cls = FasterWhisperASR
580
+ else:
581
+ asr_cls = WhisperTimestampedASR
582
+
583
+ # Only for FasterWhisperASR and WhisperTimestampedASR
584
+ size = args.model
585
+ t = time.time()
586
+ logger.info(f"Loading Whisper {size} model for {args.lan}...")
587
+ asr = asr_cls(modelsize=size, lan=args.lan, cache_dir=args.model_cache_dir, model_dir=args.model_dir)
588
+ e = time.time()
589
+ logger.info(f"done. It took {round(e-t,2)} seconds.")
590
+
591
+ # Apply common configurations
592
+ if getattr(args, 'vad', False): # Checks if VAD argument is present and True
593
+ logger.info("Setting VAD filter")
594
+ asr.use_vad()
595
+
596
+ language = args.lan
597
+ if args.task == "translate":
598
+ asr.set_translate_task()
599
+ tgt_language = "en" # Whisper translates into English
600
+ else:
601
+ tgt_language = language # Whisper transcribes in this language
602
+
603
+ # Create the tokenizer
604
+ if args.buffer_trimming == "sentence":
605
+ tokenizer = create_tokenizer(tgt_language)
606
+ else:
607
+ tokenizer = None
608
+
609
+ # Create the OnlineASRProcessor
610
+ online = OnlineASRProcessor(asr,tokenizer,logfile=logfile,buffer_trimming=(args.buffer_trimming, args.buffer_trimming_sec))
611
+
612
+ return asr, online
613
+
614
+ def set_logging(args,logger,other="_server"):
615
+ logging.basicConfig(#format='%(name)s
616
+ format='%(levelname)s\t%(message)s')
617
+ logger.setLevel(args.log_level)
618
+ logging.getLogger("whisper_online"+other).setLevel(args.log_level)
619
+ # logging.getLogger("whisper_online_server").setLevel(args.log_level)
620
+
621
 
 
622
 
623
  if __name__ == "__main__":
624
 
 
636
  logfile = sys.stderr
637
 
638
  if args.offline and args.comp_unaware:
639
+ logger.error("No or one option from --offline and --comp_unaware are available, not both. Exiting.")
640
  sys.exit(1)
641
 
642
+ # if args.log_level:
643
+ # logging.basicConfig(format='whisper-%(levelname)s:%(name)s: %(message)s',
644
+ # level=getattr(logging, args.log_level))
645
+
646
+ set_logging(args,logger)
647
+
648
  audio_path = args.audio_path
649
 
650
  SAMPLING_RATE = 16000
651
  duration = len(load_audio(audio_path))/SAMPLING_RATE
652
+ logger.info("Audio duration is: %2.2f seconds" % duration)
 
 
 
 
 
 
 
 
 
 
 
 
 
653
 
654
+ asr, online = asr_factory(args, logfile=logfile)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
655
  min_chunk = args.min_chunk_size
 
 
 
 
 
 
656
 
657
  # load the audio into the LRU cache before we start the timer
658
  a = load_audio_chunk(audio_path,0,1)
659
 
660
+ # warm up the ASR because the very first transcribe takes much more time than the other
661
  asr.transcribe(a)
662
 
663
  beg = args.start_at
 
676
  print("%1.4f %1.0f %1.0f %s" % (now*1000, o[0]*1000,o[1]*1000,o[2]),file=logfile,flush=True)
677
  print("%1.4f %1.0f %1.0f %s" % (now*1000, o[0]*1000,o[1]*1000,o[2]),flush=True)
678
  else:
679
+ # No text, so no output
680
+ pass
681
 
682
  if args.offline: ## offline mode processing (for testing/debugging)
683
  a = load_audio(audio_path)
684
  online.insert_audio_chunk(a)
685
  try:
686
  o = online.process_iter()
687
+ except AssertionError as e:
688
+ logger.error(f"assertion error: {repr(e)}")
 
689
  else:
690
  output_transcript(o)
691
  now = None
 
696
  online.insert_audio_chunk(a)
697
  try:
698
  o = online.process_iter()
699
+ except AssertionError as e:
700
+ logger.error(f"assertion error: {repr(e)}")
701
  pass
702
  else:
703
  output_transcript(o, now=end)
704
 
705
+ logger.debug(f"## last processed {end:.2f}s")
706
 
707
  if end >= duration:
708
  break
 
728
 
729
  try:
730
  o = online.process_iter()
731
+ except AssertionError as e:
732
+ logger.error(f"assertion error: {e}")
733
  pass
734
  else:
735
  output_transcript(o)
736
  now = time.time() - start
737
+ logger.debug(f"## last processed {end:.2f} s, now is {now:.2f}, the latency is {now-end:.2f}")
738
 
739
  if end >= duration:
740
  break
whisper_online_server.py CHANGED
@@ -4,6 +4,10 @@ from whisper_online import *
4
  import sys
5
  import argparse
6
  import os
 
 
 
 
7
  parser = argparse.ArgumentParser()
8
 
9
  # server options
@@ -11,11 +15,14 @@ parser.add_argument("--host", type=str, default='localhost')
11
  parser.add_argument("--port", type=int, default=43007)
12
  parser.add_argument('--vac', action="store_true", default=False, help='Use VAC = voice activity controller.')
13
  parser.add_argument('--vac-chunk-size', type=float, default=0.04, help='VAC sample size in seconds.')
 
 
14
 
15
  # options from whisper_online
16
  add_shared_args(parser)
17
  args = parser.parse_args()
18
 
 
19
 
20
  # setting whisper object by args
21
 
@@ -23,59 +30,22 @@ SAMPLING_RATE = 16000
23
 
24
  size = args.model
25
  language = args.lan
26
-
27
- t = time.time()
28
- print(f"Loading Whisper {size} model for {language}...",file=sys.stderr,end=" ",flush=True)
29
-
30
- if args.backend == "faster-whisper":
31
- from faster_whisper import WhisperModel
32
- asr_cls = FasterWhisperASR
33
- elif args.backend == "whisper_timestamped":
34
- import whisper
35
- from whisper_online import WhisperTimestampedASR
36
- asr_cls = WhisperTimestampedASR
37
- else:
38
- raise ValueError(f"Unknown {args.backend=}")
39
-
40
- asr = asr_cls(modelsize=size, lan=language, cache_dir=args.model_cache_dir, model_dir=args.model_dir)
41
-
42
- if args.task == "translate":
43
- asr.set_translate_task()
44
- tgt_language = "en"
45
- else:
46
- tgt_language = language
47
-
48
- print(f"done. It took {round(time.time()-t,2)} seconds.",file=sys.stderr)
49
-
50
- if args.vad:
51
- print("setting VAD filter",file=sys.stderr)
52
- asr.use_vad()
53
-
54
-
55
- if args.buffer_trimming == "sentence":
56
- tokenizer = create_tokenizer(tgt_language)
57
  else:
58
- tokenizer = None
59
- if not args.vac:
60
- from whisper_online import OnlineASRProcessor
61
- online = OnlineASRProcessor(asr,tokenizer,buffer_trimming=(args.buffer_trimming, args.buffer_trimming_sec))
62
- else:
63
- from whisper_online_vac import VACOnlineASRProcessor
64
- online = VACOnlineASRProcessor(args.min_chunk_size, asr,tokenizer,buffer_trimming=(args.buffer_trimming, args.buffer_trimming_sec))
65
-
66
-
67
- demo_audio_path = "cs-maji-2.16k.wav"
68
- if os.path.exists(demo_audio_path):
69
- # load the audio into the LRU cache before we start the timer
70
- a = load_audio_chunk(demo_audio_path,0,1)
71
-
72
- # TODO: it should be tested whether it's meaningful
73
- # warm up the ASR, because the very first transcribe takes much more time than the other
74
- asr.transcribe(a)
75
- else:
76
- print("Whisper is not warmed up",file=sys.stderr)
77
-
78
-
79
 
80
 
81
  ######### Server objects
@@ -83,9 +53,6 @@ else:
83
  import line_packet
84
  import socket
85
 
86
- import logging
87
-
88
-
89
  class Connection:
90
  '''it wraps conn object'''
91
  PACKET_SIZE = 32000*5*60 # 5 minutes # was: 65536
@@ -143,7 +110,7 @@ class ServerProcessor:
143
  break
144
  print("received audio:",len(raw_bytes), "bytes", raw_bytes[:10])
145
  sf = soundfile.SoundFile(io.BytesIO(raw_bytes), channels=1,endian="LITTLE",samplerate=SAMPLING_RATE, subtype="PCM_16",format="RAW")
146
- audio, _ = librosa.load(sf,sr=SAMPLING_RATE)
147
  out.append(audio)
148
  if not out:
149
  return None
@@ -174,7 +141,7 @@ class ServerProcessor:
174
  print("%1.0f %1.0f %s" % (beg,end,o[2]),flush=True,file=sys.stderr)
175
  return "%1.0f %1.0f %s" % (beg,end,o[2])
176
  else:
177
- print(o,file=sys.stderr,flush=True)
178
  return None
179
 
180
  def send_result(self, o):
@@ -188,14 +155,13 @@ class ServerProcessor:
188
  while True:
189
  a = self.receive_audio_chunk()
190
  if a is None:
191
- print("break here",file=sys.stderr)
192
  break
193
  self.online_asr_proc.insert_audio_chunk(a)
194
  o = online.process_iter()
195
  try:
196
  self.send_result(o)
197
  except BrokenPipeError:
198
- print("broken pipe -- connection closed?",file=sys.stderr)
199
  break
200
 
201
  # o = online.finish() # this should be working
@@ -203,23 +169,18 @@ class ServerProcessor:
203
 
204
 
205
 
206
-
207
- # Start logging.
208
- level = logging.INFO
209
- logging.basicConfig(level=level, format='whisper-server-%(levelname)s: %(message)s')
210
-
211
  # server loop
212
 
213
  with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
214
  s.bind((args.host, args.port))
215
  s.listen(1)
216
- logging.info('INFO: Listening on'+str((args.host, args.port)))
217
  while True:
218
  conn, addr = s.accept()
219
- logging.info('INFO: Connected to client on {}'.format(addr))
220
  connection = Connection(conn)
221
  proc = ServerProcessor(connection, online, args.min_chunk_size)
222
  proc.process()
223
  conn.close()
224
- logging.info('INFO: Connection to client closed')
225
- logging.info('INFO: Connection closed, terminating.')
 
4
  import sys
5
  import argparse
6
  import os
7
+ import logging
8
+ import numpy as np
9
+
10
+ logger = logging.getLogger(__name__)
11
  parser = argparse.ArgumentParser()
12
 
13
  # server options
 
15
  parser.add_argument("--port", type=int, default=43007)
16
  parser.add_argument('--vac', action="store_true", default=False, help='Use VAC = voice activity controller.')
17
  parser.add_argument('--vac-chunk-size', type=float, default=0.04, help='VAC sample size in seconds.')
18
+ parser.add_argument("--warmup-file", type=str, dest="warmup_file",
19
+ help="The path to a speech audio wav file to warm up Whisper so that the very first chunk processing is fast. It can be e.g. https://github.com/ggerganov/whisper.cpp/raw/master/samples/jfk.wav .")
20
 
21
  # options from whisper_online
22
  add_shared_args(parser)
23
  args = parser.parse_args()
24
 
25
+ set_logging(args,logger,other="")
26
 
27
  # setting whisper object by args
28
 
 
30
 
31
  size = args.model
32
  language = args.lan
33
+ asr, online = asr_factory(args)
34
+ min_chunk = args.min_chunk_size
35
+
36
+ # warm up the ASR because the very first transcribe takes more time than the others.
37
+ # Test results in https://github.com/ufal/whisper_streaming/pull/81
38
+ msg = "Whisper is not warmed up. The first chunk processing may take longer."
39
+ if args.warmup_file:
40
+ if os.path.isfile(args.warmup_file):
41
+ a = load_audio_chunk(args.warmup_file,0,1)
42
+ asr.transcribe(a)
43
+ logger.info("Whisper is warmed up.")
44
+ else:
45
+ logger.critical("The warm up file is not available. "+msg)
46
+ sys.exit(1)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
  else:
48
+ logger.warning(msg)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
50
 
51
  ######### Server objects
 
53
  import line_packet
54
  import socket
55
 
 
 
 
56
  class Connection:
57
  '''it wraps conn object'''
58
  PACKET_SIZE = 32000*5*60 # 5 minutes # was: 65536
 
110
  break
111
  print("received audio:",len(raw_bytes), "bytes", raw_bytes[:10])
112
  sf = soundfile.SoundFile(io.BytesIO(raw_bytes), channels=1,endian="LITTLE",samplerate=SAMPLING_RATE, subtype="PCM_16",format="RAW")
113
+ audio, _ = librosa.load(sf,sr=SAMPLING_RATE,dtype=np.float32)
114
  out.append(audio)
115
  if not out:
116
  return None
 
141
  print("%1.0f %1.0f %s" % (beg,end,o[2]),flush=True,file=sys.stderr)
142
  return "%1.0f %1.0f %s" % (beg,end,o[2])
143
  else:
144
+ logger.debug("No text in this segment")
145
  return None
146
 
147
  def send_result(self, o):
 
155
  while True:
156
  a = self.receive_audio_chunk()
157
  if a is None:
 
158
  break
159
  self.online_asr_proc.insert_audio_chunk(a)
160
  o = online.process_iter()
161
  try:
162
  self.send_result(o)
163
  except BrokenPipeError:
164
+ logger.info("broken pipe -- connection closed?")
165
  break
166
 
167
  # o = online.finish() # this should be working
 
169
 
170
 
171
 
 
 
 
 
 
172
  # server loop
173
 
174
  with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
175
  s.bind((args.host, args.port))
176
  s.listen(1)
177
+ logger.info('Listening on'+str((args.host, args.port)))
178
  while True:
179
  conn, addr = s.accept()
180
+ logger.info('Connected to client on {}'.format(addr))
181
  connection = Connection(conn)
182
  proc = ServerProcessor(connection, online, args.min_chunk_size)
183
  proc.process()
184
  conn.close()
185
+ logger.info('Connection to client closed')
186
+ logger.info('Connection closed, terminating.')