Dominik Macháček commited on
Commit
52da121
·
1 Parent(s): 7edc534

cleaner code

Browse files
Files changed (4) hide show
  1. README.md +17 -9
  2. silero_vad.py +3 -1
  3. voice_activity_controller.py +0 -35
  4. whisper_online.py +1 -1
README.md CHANGED
@@ -36,8 +36,6 @@ Please, cite us. [ACL Anthology](https://aclanthology.org/2023.ijcnlp-demo.3/),
36
 
37
  1) ``pip install librosa soundfile`` -- audio processing library
38
 
39
- Note: for the VAD I need to `pip install torch torchaudio`.
40
-
41
  2) Whisper backend.
42
 
43
  Several alternative backends are integrated. The most recommended one is [faster-whisper](https://github.com/guillaumekln/faster-whisper) with GPU support. Follow their instructions for NVIDIA libraries -- we succeeded with CUDNN 8.5.0 and CUDA 11.7. Install with `pip install faster-whisper`.
@@ -51,7 +49,9 @@ For running with the openai-api backend, make sure that your [OpenAI api key](ht
51
 
52
  The backend is loaded only when chosen. The unused one does not have to be installed.
53
 
54
- 3) Optional, not recommended: sentence segmenter (aka sentence tokenizer)
 
 
55
 
56
  Two buffer trimming options are integrated and evaluated. They have impact on
57
  the quality and latency. The default "segment" option performs better according
@@ -78,8 +78,10 @@ In case of installation issues of opus-fast-mosestokenizer, especially on Window
78
  ### Real-time simulation from audio file
79
 
80
  ```
81
- usage: whisper_online.py [-h] [--min-chunk-size MIN_CHUNK_SIZE] [--model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large}] [--model_cache_dir MODEL_CACHE_DIR] [--model_dir MODEL_DIR] [--lan LAN] [--task {transcribe,translate}]
82
- [--backend {faster-whisper,whisper_timestamped,openai-api}] [--vad] [--buffer_trimming {sentence,segment}] [--buffer_trimming_sec BUFFER_TRIMMING_SEC] [--start_at START_AT] [--offline] [--comp_unaware]
 
 
83
  audio_path
84
 
85
  positional arguments:
@@ -88,7 +90,8 @@ positional arguments:
88
  options:
89
  -h, --help show this help message and exit
90
  --min-chunk-size MIN_CHUNK_SIZE
91
- Minimum audio chunk size in seconds. It waits up to this time to do processing. If the processing takes shorter time, it waits, otherwise it processes the whole segment that was received by this time.
 
92
  --model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large}
93
  Name size of the Whisper model to use (default: large-v2). The model is automatically downloaded from the model hub if not present in model cache dir.
94
  --model_cache_dir MODEL_CACHE_DIR
@@ -101,11 +104,17 @@ options:
101
  Transcribe or translate.
102
  --backend {faster-whisper,whisper_timestamped,openai-api}
103
  Load only this backend for Whisper processing.
 
 
 
104
  --vad Use VAD = voice activity detection, with the default parameters.
105
  --buffer_trimming {sentence,segment}
106
- Buffer trimming strategy -- trim completed sentences marked with punctuation mark and detected by sentence segmenter, or the completed segments returned by Whisper. Sentence segmenter must be installed for "sentence" option.
 
107
  --buffer_trimming_sec BUFFER_TRIMMING_SEC
108
  Buffer trimming length threshold in seconds. If buffer length is longer, trimming sentence/segment is triggered.
 
 
109
  --start_at START_AT Start processing audio at this time.
110
  --offline Offline mode.
111
  --comp_unaware Computationally unaware simulation.
@@ -240,11 +249,10 @@ Contributions are welcome. We acknowledge especially:
240
  - [Ondřej Plátek](https://opla.cz/) for the paper pre-review.
241
  - [Peter Polák](https://ufal.mff.cuni.cz/peter-polak) for the original idea.
242
  - The UEDIN team of the [ELITR project](https://elitr.eu) for the original line_packet.py.
 
243
 
244
 
245
  ## Contact
246
 
247
  Dominik Macháček, [email protected]
248
 
249
-
250
-
 
36
 
37
  1) ``pip install librosa soundfile`` -- audio processing library
38
 
 
 
39
  2) Whisper backend.
40
 
41
  Several alternative backends are integrated. The most recommended one is [faster-whisper](https://github.com/guillaumekln/faster-whisper) with GPU support. Follow their instructions for NVIDIA libraries -- we succeeded with CUDNN 8.5.0 and CUDA 11.7. Install with `pip install faster-whisper`.
 
49
 
50
  The backend is loaded only when chosen. The unused one does not have to be installed.
51
 
52
+ 3) For voice activity controller: `pip install torch torchaudio`. Optional, but very recommended.
53
+
54
+ 4) Optional, not recommended: sentence segmenter (aka sentence tokenizer)
55
 
56
  Two buffer trimming options are integrated and evaluated. They have impact on
57
  the quality and latency. The default "segment" option performs better according
 
78
  ### Real-time simulation from audio file
79
 
80
  ```
81
+ whisper_online.py -h
82
+ usage: whisper_online.py [-h] [--min-chunk-size MIN_CHUNK_SIZE] [--model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large}] [--model_cache_dir MODEL_CACHE_DIR]
83
+ [--model_dir MODEL_DIR] [--lan LAN] [--task {transcribe,translate}] [--backend {faster-whisper,whisper_timestamped,openai-api}] [--vac] [--vac-chunk-size VAC_CHUNK_SIZE] [--vad]
84
+ [--buffer_trimming {sentence,segment}] [--buffer_trimming_sec BUFFER_TRIMMING_SEC] [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}] [--start_at START_AT] [--offline] [--comp_unaware]
85
  audio_path
86
 
87
  positional arguments:
 
90
  options:
91
  -h, --help show this help message and exit
92
  --min-chunk-size MIN_CHUNK_SIZE
93
+ Minimum audio chunk size in seconds. It waits up to this time to do processing. If the processing takes shorter time, it waits, otherwise it processes the whole segment that was
94
+ received by this time.
95
  --model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large}
96
  Name size of the Whisper model to use (default: large-v2). The model is automatically downloaded from the model hub if not present in model cache dir.
97
  --model_cache_dir MODEL_CACHE_DIR
 
104
  Transcribe or translate.
105
  --backend {faster-whisper,whisper_timestamped,openai-api}
106
  Load only this backend for Whisper processing.
107
+ --vac Use VAC = voice activity controller. Recommended. Requires torch.
108
+ --vac-chunk-size VAC_CHUNK_SIZE
109
+ VAC sample size in seconds.
110
  --vad Use VAD = voice activity detection, with the default parameters.
111
  --buffer_trimming {sentence,segment}
112
+ Buffer trimming strategy -- trim completed sentences marked with punctuation mark and detected by sentence segmenter, or the completed segments returned by Whisper. Sentence segmenter
113
+ must be installed for "sentence" option.
114
  --buffer_trimming_sec BUFFER_TRIMMING_SEC
115
  Buffer trimming length threshold in seconds. If buffer length is longer, trimming sentence/segment is triggered.
116
+ -l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
117
+ Set the log level
118
  --start_at START_AT Start processing audio at this time.
119
  --offline Offline mode.
120
  --comp_unaware Computationally unaware simulation.
 
249
  - [Ondřej Plátek](https://opla.cz/) for the paper pre-review.
250
  - [Peter Polák](https://ufal.mff.cuni.cz/peter-polak) for the original idea.
251
  - The UEDIN team of the [ELITR project](https://elitr.eu) for the original line_packet.py.
252
+ - Silero Team for their VAD [model](https://github.com/snakers4/silero-vad) and [VADIterator](https://github.com/ufal/whisper_streaming/main/silero_vad.py).
253
 
254
 
255
  ## Contact
256
 
257
  Dominik Macháček, [email protected]
258
 
 
 
silero_vad.py CHANGED
@@ -1,8 +1,10 @@
1
  import torch
2
 
3
- # this is copypasted from silero-vad's vad_utils.py:
4
  # https://github.com/snakers4/silero-vad/blob/f6b1294cb27590fb2452899df98fb234dfef1134/utils_vad.py#L340
5
 
 
 
6
  class VADIterator:
7
  def __init__(self,
8
  model,
 
1
  import torch
2
 
3
+ # This is copied from silero-vad's vad_utils.py:
4
  # https://github.com/snakers4/silero-vad/blob/f6b1294cb27590fb2452899df98fb234dfef1134/utils_vad.py#L340
5
 
6
+ # Their licence is MIT, same as ours: https://github.com/snakers4/silero-vad/blob/f6b1294cb27590fb2452899df98fb234dfef1134/LICENSE
7
+
8
  class VADIterator:
9
  def __init__(self,
10
  model,
voice_activity_controller.py DELETED
@@ -1,35 +0,0 @@
1
- import torch
2
- from silero_vad import VADIterator
3
- import time
4
-
5
- class VoiceActivityController:
6
- SAMPLING_RATE = 16000
7
- def __init__(self):
8
- self.model, _ = torch.hub.load(
9
- repo_or_dir='snakers4/silero-vad',
10
- model='silero_vad'
11
- )
12
- # we use the default options: 500ms silence, etc.
13
- self.iterator = VADIterator(self.model)
14
-
15
- def reset(self):
16
- self.iterator.reset_states()
17
-
18
- def __call__(self, audio):
19
- '''
20
- audio: audio chunk in the current np.array format
21
- returns:
22
- - { 'start': time_frame } ... when voice start was detected. time_frame is number of frame (can be converted to seconds)
23
- - { 'end': time_frame } ... when voice end is detected
24
- - None ... when no change detected by current chunk
25
- '''
26
- x = audio
27
- # if not torch.is_tensor(x):
28
- # try:
29
- # x = torch.Tensor(x)
30
- # except:
31
- # raise TypeError("Audio cannot be casted to tensor. Cast it manually")
32
- t = time.time()
33
- a = self.iterator(x)
34
- print("VAD took ",time.time()-t,"seconds")
35
- return a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
whisper_online.py CHANGED
@@ -656,7 +656,7 @@ def add_shared_args(parser):
656
  parser.add_argument('--lan', '--language', type=str, default='auto', help="Source language code, e.g. en,de,cs, or 'auto' for language detection.")
657
  parser.add_argument('--task', type=str, default='transcribe', choices=["transcribe","translate"],help="Transcribe or translate.")
658
  parser.add_argument('--backend', type=str, default="faster-whisper", choices=["faster-whisper", "whisper_timestamped", "openai-api"],help='Load only this backend for Whisper processing.')
659
- parser.add_argument('--vac', action="store_true", default=False, help='Use VAC = voice activity controller.')
660
  parser.add_argument('--vac-chunk-size', type=float, default=0.04, help='VAC sample size in seconds.')
661
  parser.add_argument('--vad', action="store_true", default=False, help='Use VAD = voice activity detection, with the default parameters.')
662
  parser.add_argument('--buffer_trimming', type=str, default="segment", choices=["sentence", "segment"],help='Buffer trimming strategy -- trim completed sentences marked with punctuation mark and detected by sentence segmenter, or the completed segments returned by Whisper. Sentence segmenter must be installed for "sentence" option.')
 
656
  parser.add_argument('--lan', '--language', type=str, default='auto', help="Source language code, e.g. en,de,cs, or 'auto' for language detection.")
657
  parser.add_argument('--task', type=str, default='transcribe', choices=["transcribe","translate"],help="Transcribe or translate.")
658
  parser.add_argument('--backend', type=str, default="faster-whisper", choices=["faster-whisper", "whisper_timestamped", "openai-api"],help='Load only this backend for Whisper processing.')
659
+ parser.add_argument('--vac', action="store_true", default=False, help='Use VAC = voice activity controller. Recommended. Requires torch.')
660
  parser.add_argument('--vac-chunk-size', type=float, default=0.04, help='VAC sample size in seconds.')
661
  parser.add_argument('--vad', action="store_true", default=False, help='Use VAD = voice activity detection, with the default parameters.')
662
  parser.add_argument('--buffer_trimming', type=str, default="segment", choices=["sentence", "segment"],help='Buffer trimming strategy -- trim completed sentences marked with punctuation mark and detected by sentence segmenter, or the completed segments returned by Whisper. Sentence segmenter must be installed for "sentence" option.')