Luigi commited on
Commit
53fe0cb
Β·
1 Parent(s): 2a31e9c

update readme

Browse files
Files changed (1) hide show
  1. README.md +64 -55
README.md CHANGED
@@ -11,21 +11,18 @@ short_description: Streaming zipformer
11
 
12
  # πŸŽ™οΈ Real-Time Streaming ASR Demo (FastAPI + Sherpa-ONNX)
13
 
14
- This project demonstrates a real-time speech-to-text (ASR) web application using:
15
 
16
  * 🧠 [Sherpa-ONNX](https://github.com/k2-fsa/sherpa-onnx) streaming Zipformer model
17
  * πŸš€ FastAPI backend with WebSocket support
18
- * πŸ§‘β€πŸ’» Hugging Face Spaces (Docker CPU-only deployment)
19
- * 🌐 Browser-based microphone input + UI in vanilla HTML/JS
20
-
21
- ---
22
 
23
  ## πŸ“¦ Model
24
 
25
- This app uses the following **bilingual (Chinese-English)** streaming model:
26
 
27
- **πŸ”— Model Source:**
28
- [Zipformer Small Bilingual zh-en (2023-02-16)](https://k2-fsa.github.io/sherpa/onnx/pretrained_models/online-transducer/zipformer-transducer-models.html#sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16-bilingual-chinese-english)
29
 
30
  Model files (ONNX) are located under:
31
 
@@ -33,55 +30,88 @@ Model files (ONNX) are located under:
33
  models/zipformer_bilingual/
34
  ```
35
 
36
- ---
37
-
38
  ## πŸš€ Features
39
 
40
- * 🎀 Real-time microphone input (captured in browser)
41
- * πŸ” WebSocket-based streaming inference
42
- * πŸ’¬ Partial + final transcription
43
- * 🌏 Automatic conversion to **Traditional Chinese** using OpenCC
44
- * πŸ“Š Real-time volume indicator
45
- * ☁️ Deployed on Hugging Face Spaces (CPU only)
 
 
46
 
47
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
 
49
  ## πŸ§ͺ Local Development
50
 
51
- ### 1. Install dependencies
52
 
53
  ```bash
54
  pip install -r requirements.txt
55
  ```
56
 
57
- ### 2. Run the app locally
58
 
59
  ```bash
60
  uvicorn app.main:app --reload --host 0.0.0.0 --port 8501
61
  ```
62
 
63
- Then open: [http://localhost:8501](http://localhost:8501)
64
-
65
- ---
66
-
67
- ## 🐳 Deploy on Hugging Face Spaces
68
 
69
- This repo includes a `Dockerfile` compatible with HF Spaces. It uses:
70
-
71
- * `uvicorn` for serving the FastAPI app
72
- * `opencc-python-reimplemented` for Simplified β†’ Traditional Chinese
73
- * `pysoxr` or `scipy` for audio resampling (48kHz β†’ 16kHz)
74
-
75
- ---
76
 
77
  ## πŸ“ Project Structure
78
 
79
  ```
80
  .
81
  β”œβ”€β”€ app
82
- β”‚ β”œβ”€β”€ main.py # FastAPI + WebSocket
83
- β”‚ β”œβ”€β”€ asr_worker.py # Sherpa inference + resampling + OpenCC
84
- β”‚ └── static/index.html # Client-side mic UI
85
  β”œβ”€β”€ models/zipformer_bilingual/
86
  β”‚ └── ... (onnx, tokens.txt)
87
  β”œβ”€β”€ requirements.txt
@@ -89,30 +119,9 @@ This repo includes a `Dockerfile` compatible with HF Spaces. It uses:
89
  └── README.md
90
  ```
91
 
92
- ---
93
-
94
  ## πŸ”§ Credits
95
 
96
  * [Sherpa-ONNX](https://github.com/k2-fsa/sherpa-onnx)
97
  * [OpenCC](https://github.com/BYVoid/OpenCC)
98
  * [FastAPI](https://fastapi.tiangolo.com/)
99
  * [Hugging Face Spaces](https://huggingface.co/docs/hub/spaces)
100
-
101
- ---
102
-
103
- ## πŸ—£ Languages Supported
104
-
105
- * πŸ‡¨πŸ‡³ Chinese (Simplified input, converted to Traditional)
106
- * πŸ‡ΊπŸ‡Έ English
107
-
108
- ---
109
-
110
- ## 🀝 Contributing
111
-
112
- PRs welcome! Feel free to fork this and adapt to other models or languages.
113
-
114
- ---
115
-
116
- ## πŸ“œ License
117
-
118
- Apache 2.0
 
11
 
12
  # πŸŽ™οΈ Real-Time Streaming ASR Demo (FastAPI + Sherpa-ONNX)
13
 
14
+ This project demonstrates a real-time speech-to-text (ASR) web application with:
15
 
16
  * 🧠 [Sherpa-ONNX](https://github.com/k2-fsa/sherpa-onnx) streaming Zipformer model
17
  * πŸš€ FastAPI backend with WebSocket support
18
+ * πŸŽ›οΈ Configurable browser-based UI using vanilla HTML/JS
19
+ * ☁️ Docker-compatible deployment (CPU-only) on Hugging Face Spaces
 
 
20
 
21
  ## πŸ“¦ Model
22
 
23
+ The app uses the bilingual (Chinese-English) streaming Zipformer model:
24
 
25
+ πŸ”— **Model Source:** [Zipformer Small Bilingual zh-en (2023-02-16)](https://k2-fsa.github.io/sherpa/onnx/pretrained_models/online-transducer/zipformer-transducer-models.html#sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16-bilingual-chinese-english)
 
26
 
27
  Model files (ONNX) are located under:
28
 
 
30
  models/zipformer_bilingual/
31
  ```
32
 
 
 
33
  ## πŸš€ Features
34
 
35
+ * 🎀 **Real-Time Microphone Input:** capture audio directly in the browser.
36
+ * πŸŽ›οΈ **Recognition Settings:** select ASR model and precision; view supported languages and model size.
37
+ * πŸ”‘ **Hotword Biasing:** input custom hotwords (one per line) and adjust boost score. See [Sherpa-ONNX Hotwords Guide](https://k2-fsa.github.io/sherpa/onnx/hotwords/index.html).
38
+ * ⏱️ **Endpoint Detection:** configure silence-based rules (Rule 1 threshold, Rule 2 threshold, minimum utterance length) to control segmentation. See [Sherpa-NCNN Endpoint Detection](https://k2-fsa.github.io/sherpa/ncnn/endpoint.html).
39
+ * πŸ“Š **Volume Meter:** real-time volume indicator based on RMS.
40
+ * πŸ’¬ **Streaming Transcription:** display partial (in red) and final (in green) results with automatic scrolling.
41
+ * πŸ› οΈ **Debug Logging:** backend logs configuration steps and endpoint detection events.
42
+ * 🐳 **Deployment:** Dockerfile provided for CPU-only deployment on Hugging Face Spaces.
43
 
44
+ ## πŸ› οΈ Configuration Guide
45
+
46
+ ### πŸ”‘ Hotword Biasing Configuration
47
+
48
+ * **Hotwords List** (`hotwordsList`): Enter one hotword or phrase per line. These are words/phrases the ASR will preferentially recognize. For multilingual models, you can mix scripts according to your model’s `modeling-unit` (e.g., `cjkchar+bpe`).
49
+ * **Boost Score** (`boostScore`): A global score applied at the token level for each matched hotword (range: `0.0`–`10.0`). You may also specify per-hotword scores inline in the list using `:`, for example:
50
+
51
+ ```
52
+ θ―­ιŸ³θ―†εˆ« :3.5
53
+ ζ·±εΊ¦ε­¦δΉ  :2.0
54
+ SPEECH RECOGNITION :1.5
55
+ ```
56
+ * **Decoding Method**: Ensure your model uses `modified_beam_search` (not the default `greedy_search`) to enable hotword biasing.
57
+ * **Applying**: Click **Apply Hotwords** in the UI to send the following JSON payload to the backend:
58
+
59
+ ```json
60
+ {
61
+ "type": "config",
62
+ "hotwords": ["..."],
63
+ "hotwordsScore": 2.0
64
+ }
65
+ ```
66
+
67
+ (For full details, see the [Sherpa-ONNX Hotwords Guide](https://k2-fsa.github.io/sherpa/onnx/hotwords/index.html) ([k2-fsa.github.io](https://k2-fsa.github.io/sherpa/onnx/hotwords/index.html)).)
68
+
69
+ ### ⏱️ Endpoint Detection Configuration
70
+
71
+ The system supports three endpointing rules borrowed from Kaldi:
72
+
73
+ * **RuleΒ 1** (`epRule1`): Minimum duration of trailing silence to trigger an endpoint, in **seconds** (default: `2.4`). Fires whether or not any token has been decoded.
74
+ * **RuleΒ 2** (`epRule2`): Minimum duration of trailing silence to trigger an endpoint *only after* at least one token is decoded, in **seconds** (default: `1.2`).
75
+ * **RuleΒ 3** (`epRule3`): Maximum utterance length before forcing an endpoint, in **milliseconds** (default: `300`). Disable by setting a very large value.
76
+ * **Applying**: Click **Apply Endpoint Config** in the UI to send the following JSON payload to the backend:
77
+
78
+ ```json
79
+ {
80
+ "type": "config",
81
+ "epRule1": 2.4,
82
+ "epRule2": 1.2,
83
+ "epRule3": 300
84
+ }
85
+ ```
86
+
87
+ (See the [Sherpa-NCNN Endpointing documentation](https://k2-fsa.github.io/sherpa/ncnn/endpoint.html) ([k2-fsa.github.io](https://k2-fsa.github.io/sherpa/ncnn/endpoint.html)).)
88
 
89
  ## πŸ§ͺ Local Development
90
 
91
+ 1. **Install dependencies**
92
 
93
  ```bash
94
  pip install -r requirements.txt
95
  ```
96
 
97
+ 2. **Run the app locally**
98
 
99
  ```bash
100
  uvicorn app.main:app --reload --host 0.0.0.0 --port 8501
101
  ```
102
 
103
+ Open [http://localhost:8501](http://localhost:8501) in your browser.
 
 
 
 
104
 
105
+ [https://k2-fsa.github.io/sherpa/ncnn/endpoint.html](https://k2-fsa.github.io/sherpa/ncnn/endpoint.html)
 
 
 
 
 
 
106
 
107
  ## πŸ“ Project Structure
108
 
109
  ```
110
  .
111
  β”œβ”€β”€ app
112
+ β”‚ β”œβ”€β”€ main.py # FastAPI + WebSocket endpoint, config parsing, debug logging
113
+ β”‚ β”œβ”€β”€ asr_worker.py # Audio resampling, inference, endpoint detection, OpenCC conversion
114
+ β”‚ └── static/index.html # Client-side UI: recognition, hotword, endpoint, mic, transcript
115
  β”œβ”€β”€ models/zipformer_bilingual/
116
  β”‚ └── ... (onnx, tokens.txt)
117
  β”œβ”€β”€ requirements.txt
 
119
  └── README.md
120
  ```
121
 
 
 
122
  ## πŸ”§ Credits
123
 
124
  * [Sherpa-ONNX](https://github.com/k2-fsa/sherpa-onnx)
125
  * [OpenCC](https://github.com/BYVoid/OpenCC)
126
  * [FastAPI](https://fastapi.tiangolo.com/)
127
  * [Hugging Face Spaces](https://huggingface.co/docs/hub/spaces)