Quentin Fuxa commited on
Commit
25ca243
ยท
2 Parent(s): 4ae7cba 61170f4

Merge pull request #103 from QuentinFuxa/readme

Browse files
Files changed (1) hide show
  1. README.md +195 -91
README.md CHANGED
@@ -1,43 +1,69 @@
1
  <h1 align="center">WhisperLiveKit</h1>
2
- <p align="center"><b>Real-time, Fully Local Whisper's Speech-to-Text and Speaker Diarization</b></p>
3
 
4
  <p align="center">
5
- <img alt="PyPI Version" src="https://img.shields.io/pypi/v/whisperlivekit?color=g">
6
- <img alt="PyPI Downloads" src="https://static.pepy.tech/personalized-badge/whisperlivekit">
7
- <img alt="Python Versions" src="https://img.shields.io/badge/python-3.9%20%7C%203.10%20%7C%203.11%20%7C%203.12-dark_green">
8
  </p>
9
 
10
- This project is based on [Whisper Streaming](https://github.com/ufal/whisper_streaming) and lets you transcribe audio directly from your browser. Simply launch the local server and grant microphone access. Everything runs locally on your machine โœจ
11
 
12
  <p align="center">
13
- <img src="https://raw.githubusercontent.com/QuentinFuxa/WhisperLiveKit/refs/heads/main/demo.png" alt="Demo Screenshot" width="730">
 
 
 
14
  </p>
15
 
16
- ### Differences from [Whisper Streaming](https://github.com/ufal/whisper_streaming)
17
 
18
- #### โš™๏ธ **Core Improvements**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  - **Buffering Preview** โ€“ Displays unvalidated transcription segments
20
- - **Multi-User Support** โ€“ Handles multiple users simultaneously by decoupling backend and online asr
21
- - **MLX Whisper Backend** โ€“ Optimized for Apple Silicon for faster local processing.
22
- - **Confidence validation** โ€“ Immediately validate high-confidence tokens for faster inference
23
 
24
- #### ๐ŸŽ™๏ธ **Speaker Identification**
25
- - **Real-Time Diarization** โ€“ Identify different speakers in real time using [Diart](https://github.com/juanmc2005/diart)
26
 
27
- #### ๐ŸŒ **Web & API**
28
- - **Built-in Web UI** โ€“ Simple raw html browser interface with no frontend setup required
29
- - **FastAPI WebSocket Server** โ€“ Real-time speech-to-text processing with async FFmpeg streaming.
30
- - **JavaScript Client** โ€“ Ready-to-use MediaRecorder implementation for seamless client-side integration.
31
 
32
- ## Installation
 
33
 
34
- ### Via pip (recommended)
 
 
 
 
 
 
 
35
 
36
  ```bash
37
  pip install whisperlivekit
38
  ```
39
 
40
- ### From source
41
 
42
  ```bash
43
  git clone https://github.com/QuentinFuxa/WhisperLiveKit
@@ -47,78 +73,86 @@ pip install -e .
47
 
48
  ### System Dependencies
49
 
50
- You need to install FFmpeg on your system:
51
 
52
  ```bash
53
- # For Ubuntu/Debian:
54
  sudo apt install ffmpeg
55
 
56
- # For macOS:
57
  brew install ffmpeg
58
 
59
- # For Windows:
60
  # Download from https://ffmpeg.org/download.html and add to PATH
61
  ```
62
 
63
  ### Optional Dependencies
64
 
65
  ```bash
66
- # If you want to use VAC (Voice Activity Controller). Useful for preventing hallucinations
67
  pip install torch
68
-
69
- # If you choose sentences as buffer trimming strategy
70
  pip install mosestokenizer wtpsplit
71
  pip install tokenize_uk # If you work with Ukrainian text
72
 
73
- # If you want to use diarization
74
  pip install diart
75
 
76
- # Optional backends. Default is faster-whisper
77
- pip install whisperlivekit[whisper] # Original Whisper backend
78
- pip install whisperlivekit[whisper-timestamped] # Whisper with improved timestamps
79
- pip install whisperlivekit[mlx-whisper] # Optimized for Apple Silicon
80
- pip install whisperlivekit[openai] # OpenAI API backend
81
  ```
82
 
83
- ### Get access to ๐ŸŽน pyannote models
84
-
85
- By default, diart is based on [pyannote.audio](https://github.com/pyannote/pyannote-audio) models from the [huggingface](https://huggingface.co/) hub.
86
- In order to use them, please follow these steps:
87
 
88
- 1) [Accept user conditions](https://huggingface.co/pyannote/segmentation) for the `pyannote/segmentation` model
89
- 2) [Accept user conditions](https://huggingface.co/pyannote/segmentation-3.0) for the newest `pyannote/segmentation-3.0` model
90
- 3) [Accept user conditions](https://huggingface.co/pyannote/embedding) for the `pyannote/embedding` model
91
- 4) Install [huggingface-cli](https://huggingface.co/docs/huggingface_hub/quick-start#install-the-hub-library) and [log in](https://huggingface.co/docs/huggingface_hub/quick-start#login) with your user access token (or provide it manually in diart CLI or API).
92
 
 
 
 
 
 
 
 
 
93
 
 
94
 
95
- ## Usage
96
 
97
- ### Using the command-line tool
98
-
99
- After installation, you can start the server using the provided command-line tool:
100
 
101
  ```bash
102
- whisperlivekit-server --host 0.0.0.0 --port 8000 --model tiny.en
103
- ```
104
 
105
- Then open your browser at `http://localhost:8000` (or your specified host and port).
 
 
106
 
107
- ### Using the library in your code
108
 
109
  ```python
110
  from whisperlivekit import WhisperLiveKit
111
  from whisperlivekit.audio_processor import AudioProcessor
112
  from fastapi import FastAPI, WebSocket
 
 
113
 
 
 
114
  kit = WhisperLiveKit(model="medium", diarization=True)
115
- app = FastAPI() # Create a FastAPI application
116
 
 
117
  @app.get("/")
118
  async def get():
119
- return HTMLResponse(kit.web_interface()) # Use the built-in web interface
120
 
121
- async def handle_websocket_results(websocket, results_generator): # Sends results to frontend
 
122
  async for response in results_generator:
123
  await websocket.send_json(response)
124
 
@@ -127,57 +161,127 @@ async def websocket_endpoint(websocket: WebSocket):
127
  audio_processor = AudioProcessor()
128
  await websocket.accept()
129
  results_generator = await audio_processor.create_tasks()
130
- websocket_task = asyncio.create_task(handle_websocket_results(websocket, results_generator))
 
 
 
 
 
 
 
 
 
 
 
131
 
132
- while True:
133
- message = await websocket.receive_bytes()
134
- await audio_processor.process_audio(message)
 
 
 
135
  ```
136
 
137
- For a complete audio processing example, check [whisper_fastapi_online_server.py](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisper_fastapi_online_server.py)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
138
 
 
139
 
140
- ## Configuration Options
 
 
 
141
 
142
- The following parameters are supported when initializing `WhisperLiveKit`:
143
 
144
- - `--host` and `--port` let you specify the server's IP/port.
145
- - `--min-chunk-size` sets the minimum chunk size for audio processing. Make sure this value aligns with the chunk size selected in the frontend. If not aligned, the system will work but may unnecessarily over-process audio data.
146
- - `--no-transcription`: Disable transcription (enabled by default)
147
- - `--diarization`: Enable speaker diarization (disabled by default)
148
- - `--confidence-validation`: Use confidence scores for faster validation. Transcription will be faster but punctuation might be less accurate (disabled by default)
149
- - `--warmup-file`: The path to a speech audio wav file to warm up Whisper so that the very first chunk processing is fast:
150
- - If not set, uses https://github.com/ggerganov/whisper.cpp/raw/master/samples/jfk.wav.
151
- - If False, no warmup is performed.
152
- - `--min-chunk-size` Minimum audio chunk size in seconds. It waits up to this time to do processing. If the processing takes shorter time, it waits, otherwise it processes the whole segment that was received by this time.
153
- - `--model`: Name size of the Whisper model to use (default: tiny). Suggested values: tiny.en, tiny, base.en, base, small.en, small, medium.en, medium, large-v1, large-v2, large-v3, large, large-v3-turbo. The model is automatically downloaded from the model hub if not present in model cache dir.
154
- - `--model_cache_dir`: Overriding the default model cache dir where models downloaded from the hub are saved
155
- - `--model_dir`: Dir where Whisper model.bin and other files are saved. This option overrides --model and --model_cache_dir parameter.
156
- - `--lan`, `--language`: Source language code, e.g. en,de,cs, or 'auto' for language detection.
157
- - `--task` {_transcribe, translate_}: Transcribe or translate. If translate is set, we recommend avoiding the _large-v3-turbo_ backend, as it [performs significantly worse](https://github.com/QuentinFuxa/whisper_streaming_web/issues/40#issuecomment-2652816533) than other models for translation.
158
- - `--backend` {_faster-whisper, whisper_timestamped, openai-api, mlx-whisper_}: Load only this backend for Whisper processing.
159
- - `--vac`: Use VAC = voice activity controller. Requires torch. (disabled by default)
160
- - `--vac-chunk-size`: VAC sample size in seconds.
161
- - `--no-vad`: Disable VAD (voice activity detection), which is enabled by default.
162
- - `--buffer_trimming` {_sentence, segment_}: Buffer trimming strategy -- trim completed sentences marked with punctuation mark and detected by sentence segmenter, or the completed segments returned by Whisper. Sentence segmenter must be installed for "sentence" option.
163
- - `--buffer_trimming_sec`: Buffer trimming length threshold in seconds. If buffer length is longer, trimming sentence/segment is triggered.
164
 
 
 
 
 
 
165
 
166
- ## How the Live Interface Works
167
 
168
- - Once you **allow microphone access**, the page records small chunks of audio using the **MediaRecorder** API in **webm/opus** format.
169
- - These chunks are sent over a **WebSocket** to the FastAPI endpoint at `/asr`.
170
- - The Python server decodes `.webm` chunks on the fly using **FFmpeg** and streams them into the **whisper streaming** implementation for transcription.
171
- - **Partial transcription** appears as soon as enough audio is processed. The "unvalidated" text is shown in **lighter or grey color** (i.e., an 'aperรงu') to indicate it's still buffered partial output. Once Whisper finalizes that segment, it's displayed in normal text.
172
 
173
- ### Deploying to a Remote Server
174
 
175
- If you want to **deploy** this setup:
176
 
177
- 1. **Host the FastAPI app** behind a production-grade HTTP(S) server (like **Uvicorn + Nginx** or Docker). If you use HTTPS, use "wss" instead of "ws" in WebSocket URL.
178
- 2. The **HTML/JS page** can be served by the same FastAPI app or a separate static host.
179
- 3. Users open the page in **Chrome/Firefox** (any modern browser that supports MediaRecorder + WebSocket). No additional front-end libraries or frameworks are required.
180
 
181
- ## Acknowledgments
182
 
183
- This project builds upon the foundational work of the Whisper Streaming and Diart projects. We extend our gratitude to the original authors for their contributions.
 
 
 
1
  <h1 align="center">WhisperLiveKit</h1>
 
2
 
3
  <p align="center">
4
+ <img src="https://raw.githubusercontent.com/QuentinFuxa/WhisperLiveKit/refs/heads/main/demo.png" alt="WhisperLiveKit Demo" width="730">
 
 
5
  </p>
6
 
7
+ <p align="center"><b>Real-time, Fully Local Speech-to-Text with Speaker Diarization</b></p>
8
 
9
  <p align="center">
10
+ <a href="https://pypi.org/project/whisperlivekit/"><img alt="PyPI Version" src="https://img.shields.io/pypi/v/whisperlivekit?color=g"></a>
11
+ <a href="https://pepy.tech/project/whisperlivekit"><img alt="PyPI Downloads" src="https://static.pepy.tech/personalized-badge/whisperlivekit?period=total&units=international_system&left_color=grey&right_color=brightgreen&left_text=downloads"></a>
12
+ <a href="https://pypi.org/project/whisperlivekit/"><img alt="Python Versions" src="https://img.shields.io/badge/python-3.9%20%7C%203.10%20%7C%203.11%20%7C%203.12-dark_green"></a>
13
+ <a href="https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/github/license/QuentinFuxa/WhisperLiveKit?color=blue"></a>
14
  </p>
15
 
16
+ ## ๐Ÿš€ Overview
17
 
18
+ This project is based on [Whisper Streaming](https://github.com/ufal/whisper_streaming) and lets you transcribe audio directly from your browser. WhisperLiveKit provides a complete backend solution for real-time speech transcription with an example frontend that you can customize for your own needs. Everything runs locally on your machine โœจ
19
+
20
+ ### ๐Ÿ”„ Architecture
21
+
22
+ WhisperLiveKit consists of two main components:
23
+
24
+ - **Backend (Server)**: FastAPI WebSocket server that processes audio and provides real-time transcription
25
+ - **Frontend Example**: Basic HTML & JavaScript implementation that demonstrates how to capture and stream audio
26
+
27
+ > **Note**: We recommend installing this library on the server/backend. For the frontend, you can use and adapt the provided HTML template from [whisperlivekit/web/live_transcription.html](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/web/live_transcription.html) for your specific use case.
28
+
29
+ ### โœจ Key Features
30
+
31
+ - **๐ŸŽ™๏ธ Real-time Transcription** - Convert speech to text instantly as you speak
32
+ - **๐Ÿ‘ฅ Speaker Diarization** - Identify different speakers in real-time using [Diart](https://github.com/juanmc2005/diart)
33
+ - **๐Ÿ”’ Fully Local** - All processing happens on your machine - no data sent to external servers
34
+ - **๐Ÿ“ฑ Multi-User Support** - Handle multiple users simultaneously with a single backend/server
35
+
36
+ ### โš™๏ธ Differences from [Whisper Streaming](https://github.com/ufal/whisper_streaming)
37
+
38
+ - **Multi-User Support** โ€“ Handles multiple users simultaneously by decoupling backend and online ASR
39
+ - **MLX Whisper Backend** โ€“ Optimized for Apple Silicon for faster local processing
40
  - **Buffering Preview** โ€“ Displays unvalidated transcription segments
41
+ - **Confidence Validation** โ€“ Immediately validate high-confidence tokens for faster inference
42
+ - **Apple Silicon Optimized** - MLX backend for faster local processing on Mac
 
43
 
44
+ ## ๐Ÿ“– Quick Start
 
45
 
46
+ ```bash
47
+ # Install the package
48
+ pip install whisperlivekit
 
49
 
50
+ # Start the transcription server
51
+ whisperlivekit-server --model tiny.en
52
 
53
+ # Open your browser at http://localhost:8000
54
+ ```
55
+
56
+ That's it! Start speaking and watch your words appear on screen.
57
+
58
+ ## ๐Ÿ› ๏ธ Installation Options
59
+
60
+ ### Install from PyPI (Recommended)
61
 
62
  ```bash
63
  pip install whisperlivekit
64
  ```
65
 
66
+ ### Install from Source
67
 
68
  ```bash
69
  git clone https://github.com/QuentinFuxa/WhisperLiveKit
 
73
 
74
  ### System Dependencies
75
 
76
+ FFmpeg is required:
77
 
78
  ```bash
79
+ # Ubuntu/Debian
80
  sudo apt install ffmpeg
81
 
82
+ # macOS
83
  brew install ffmpeg
84
 
85
+ # Windows
86
  # Download from https://ffmpeg.org/download.html and add to PATH
87
  ```
88
 
89
  ### Optional Dependencies
90
 
91
  ```bash
92
+ # Voice Activity Controller (prevents hallucinations)
93
  pip install torch
94
+
95
+ # Sentence-based buffer trimming
96
  pip install mosestokenizer wtpsplit
97
  pip install tokenize_uk # If you work with Ukrainian text
98
 
99
+ # Speaker diarization
100
  pip install diart
101
 
102
+ # Alternative Whisper backends (default is faster-whisper)
103
+ pip install whisperlivekit[whisper] # Original Whisper
104
+ pip install whisperlivekit[whisper-timestamped] # Improved timestamps
105
+ pip install whisperlivekit[mlx-whisper] # Apple Silicon optimization
106
+ pip install whisperlivekit[openai] # OpenAI API
107
  ```
108
 
109
+ ### ๐ŸŽน Pyannote Models Setup
 
 
 
110
 
111
+ For diarization, you need access to pyannote.audio models:
 
 
 
112
 
113
+ 1. [Accept user conditions](https://huggingface.co/pyannote/segmentation) for the `pyannote/segmentation` model
114
+ 2. [Accept user conditions](https://huggingface.co/pyannote/segmentation-3.0) for the `pyannote/segmentation-3.0` model
115
+ 3. [Accept user conditions](https://huggingface.co/pyannote/embedding) for the `pyannote/embedding` model
116
+ 4. Login with HuggingFace:
117
+ ```bash
118
+ pip install huggingface_hub
119
+ huggingface-cli login
120
+ ```
121
 
122
+ ## ๐Ÿ’ป Usage Examples
123
 
124
+ ### Command-line Interface
125
 
126
+ Start the transcription server with various options:
 
 
127
 
128
  ```bash
129
+ # Basic server with English model
130
+ whisperlivekit-server --model tiny.en
131
 
132
+ # Advanced configuration with diarization
133
+ whisperlivekit-server --host 0.0.0.0 --port 8000 --model medium --diarization --language auto
134
+ ```
135
 
136
+ ### Python API Integration (Backend)
137
 
138
  ```python
139
  from whisperlivekit import WhisperLiveKit
140
  from whisperlivekit.audio_processor import AudioProcessor
141
  from fastapi import FastAPI, WebSocket
142
+ import asyncio
143
+ from fastapi.responses import HTMLResponse
144
 
145
+ # Initialize components
146
+ app = FastAPI()
147
  kit = WhisperLiveKit(model="medium", diarization=True)
 
148
 
149
+ # Serve the web interface
150
  @app.get("/")
151
  async def get():
152
+ return HTMLResponse(kit.web_interface()) # Use the built-in web interface
153
 
154
+ # Process WebSocket connections
155
+ async def handle_websocket_results(websocket, results_generator):
156
  async for response in results_generator:
157
  await websocket.send_json(response)
158
 
 
161
  audio_processor = AudioProcessor()
162
  await websocket.accept()
163
  results_generator = await audio_processor.create_tasks()
164
+ websocket_task = asyncio.create_task(
165
+ handle_websocket_results(websocket, results_generator)
166
+ )
167
+
168
+ try:
169
+ while True:
170
+ message = await websocket.receive_bytes()
171
+ await audio_processor.process_audio(message)
172
+ except Exception as e:
173
+ print(f"WebSocket error: {e}")
174
+ websocket_task.cancel()
175
+ ```
176
 
177
+ ### Frontend Implementation
178
+
179
+ The package includes a simple HTML/JavaScript implementation that you can adapt for your project. You can get in in [whisperlivekit/web/live_transcription.html](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/web/live_transcription.html), or using :
180
+
181
+ ```python
182
+ kit.web_interface()
183
  ```
184
 
185
+ ## โš™๏ธ Configuration Reference
186
+
187
+ WhisperLiveKit offers extensive configuration options:
188
+
189
+ | Parameter | Description | Default |
190
+ |-----------|-------------|---------|
191
+ | `--host` | Server host address | `localhost` |
192
+ | `--port` | Server port | `8000` |
193
+ | `--model` | Whisper model size | `tiny` |
194
+ | `--language` | Source language code or `auto` | `en` |
195
+ | `--task` | `transcribe` or `translate` | `transcribe` |
196
+ | `--backend` | Processing backend | `faster-whisper` |
197
+ | `--diarization` | Enable speaker identification | `False` |
198
+ | `--confidence-validation` | Use confidence scores for faster validation | `False` |
199
+ | `--min-chunk-size` | Minimum audio chunk size (seconds) | `1.0` |
200
+ | `--vac` | Use Voice Activity Controller | `False` |
201
+ | `--no-vad` | Disable Voice Activity Detection | `False` |
202
+ | `--buffer_trimming` | Buffer trimming strategy (`sentence` or `segment`) | `segment` |
203
+ | `--warmup-file` | Audio file path for model warmup | `jfk.wav` |
204
+
205
+ ## ๐Ÿ”ง How It Works
206
+
207
+ <p align="center">
208
+ <img src="https://raw.githubusercontent.com/QuentinFuxa/WhisperLiveKit/refs/heads/main/demo.png" alt="WhisperLiveKit in Action" width="500">
209
+ </p>
210
+
211
+ 1. **Audio Capture**: Browser's MediaRecorder API captures audio in webm/opus format
212
+ 2. **Streaming**: Audio chunks are sent to the server via WebSocket
213
+ 3. **Processing**: Server decodes audio with FFmpeg and streams into Whisper for transcription
214
+ 4. **Real-time Output**:
215
+ - Partial transcriptions appear immediately in light gray (the 'aperรงu')
216
+ - Finalized text appears in normal color
217
+ - (When enabled) Different speakers are identified and highlighted
218
+
219
+ ## ๐Ÿš€ Deployment Guide
220
+
221
+ To deploy WhisperLiveKit in production:
222
+
223
+ 1. **Server Setup** (Backend):
224
+ ```bash
225
+ # Install production ASGI server
226
+ pip install uvicorn gunicorn
227
+
228
+ # Launch with multiple workers
229
+ gunicorn -k uvicorn.workers.UvicornWorker -w 4 your_app:app
230
+ ```
231
+
232
+ 2. **Frontend Integration**:
233
+ - Host your customized version of the example HTML/JS in your web application
234
+ - Ensure WebSocket connection points to your server's address
235
+
236
+ 3. **Nginx Configuration** (recommended for production):
237
+ ```nginx
238
+ server {
239
+ listen 80;
240
+ server_name your-domain.com;
241
+
242
+ location / {
243
+ proxy_pass http://localhost:8000;
244
+ proxy_set_header Upgrade $http_upgrade;
245
+ proxy_set_header Connection "upgrade";
246
+ proxy_set_header Host $host;
247
+ }
248
+ }
249
+ ```
250
+
251
+ 4. **HTTPS Support**: For secure deployments, use "wss://" instead of "ws://" in WebSocket URL
252
 
253
+ ## ๐Ÿ”ฎ Use Cases
254
 
255
+ - **Meeting Transcription**: Capture discussions in real-time
256
+ - **Accessibility Tools**: Help hearing-impaired users follow conversations
257
+ - **Content Creation**: Transcribe podcasts or videos automatically
258
+ - **Customer Service**: Transcribe support calls with speaker identification
259
 
260
+ ## ๐Ÿค Contributing
261
 
262
+ Contributions are welcome! Here's how to get started:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
263
 
264
+ 1. Fork the repository
265
+ 2. Create a feature branch: `git checkout -b feature/amazing-feature`
266
+ 3. Commit your changes: `git commit -m 'Add amazing feature'`
267
+ 4. Push to your branch: `git push origin feature/amazing-feature`
268
+ 5. Open a Pull Request
269
 
270
+ ## ๐Ÿ™ Acknowledgments
271
 
272
+ This project builds upon the foundational work of:
273
+ - [Whisper Streaming](https://github.com/ufal/whisper_streaming)
274
+ - [Diart](https://github.com/juanmc2005/diart)
275
+ - [OpenAI Whisper](https://github.com/openai/whisper)
276
 
277
+ We extend our gratitude to the original authors for their contributions.
278
 
279
+ ## ๐Ÿ“„ License
280
 
281
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
 
 
282
 
283
+ ## ๐Ÿ”— Links
284
 
285
+ - [GitHub Repository](https://github.com/QuentinFuxa/WhisperLiveKit)
286
+ - [PyPI Package](https://pypi.org/project/whisperlivekit/)
287
+ - [Issue Tracker](https://github.com/QuentinFuxa/WhisperLiveKit/issues)