Spaces:
Running
Running
update readme
Browse files
README.md
CHANGED
@@ -11,21 +11,18 @@ short_description: Streaming zipformer
|
|
11 |
|
12 |
# ποΈ Real-Time Streaming ASR Demo (FastAPI + Sherpa-ONNX)
|
13 |
|
14 |
-
This project demonstrates a real-time speech-to-text (ASR) web application
|
15 |
|
16 |
* π§ [Sherpa-ONNX](https://github.com/k2-fsa/sherpa-onnx) streaming Zipformer model
|
17 |
* π FastAPI backend with WebSocket support
|
18 |
-
*
|
19 |
-
*
|
20 |
-
|
21 |
-
---
|
22 |
|
23 |
## π¦ Model
|
24 |
|
25 |
-
|
26 |
|
27 |
-
|
28 |
-
[Zipformer Small Bilingual zh-en (2023-02-16)](https://k2-fsa.github.io/sherpa/onnx/pretrained_models/online-transducer/zipformer-transducer-models.html#sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16-bilingual-chinese-english)
|
29 |
|
30 |
Model files (ONNX) are located under:
|
31 |
|
@@ -33,55 +30,88 @@ Model files (ONNX) are located under:
|
|
33 |
models/zipformer_bilingual/
|
34 |
```
|
35 |
|
36 |
-
---
|
37 |
-
|
38 |
## π Features
|
39 |
|
40 |
-
* π€ Real-
|
41 |
-
*
|
42 |
-
*
|
43 |
-
*
|
44 |
-
* π
|
45 |
-
*
|
|
|
|
|
46 |
|
47 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
48 |
|
49 |
## π§ͺ Local Development
|
50 |
|
51 |
-
|
52 |
|
53 |
```bash
|
54 |
pip install -r requirements.txt
|
55 |
```
|
56 |
|
57 |
-
|
58 |
|
59 |
```bash
|
60 |
uvicorn app.main:app --reload --host 0.0.0.0 --port 8501
|
61 |
```
|
62 |
|
63 |
-
|
64 |
-
|
65 |
-
---
|
66 |
-
|
67 |
-
## π³ Deploy on Hugging Face Spaces
|
68 |
|
69 |
-
|
70 |
-
|
71 |
-
* `uvicorn` for serving the FastAPI app
|
72 |
-
* `opencc-python-reimplemented` for Simplified β Traditional Chinese
|
73 |
-
* `pysoxr` or `scipy` for audio resampling (48kHz β 16kHz)
|
74 |
-
|
75 |
-
---
|
76 |
|
77 |
## π Project Structure
|
78 |
|
79 |
```
|
80 |
.
|
81 |
βββ app
|
82 |
-
β βββ main.py # FastAPI + WebSocket
|
83 |
-
β βββ asr_worker.py #
|
84 |
-
β βββ static/index.html # Client-side mic
|
85 |
βββ models/zipformer_bilingual/
|
86 |
β βββ ... (onnx, tokens.txt)
|
87 |
βββ requirements.txt
|
@@ -89,30 +119,9 @@ This repo includes a `Dockerfile` compatible with HF Spaces. It uses:
|
|
89 |
βββ README.md
|
90 |
```
|
91 |
|
92 |
-
---
|
93 |
-
|
94 |
## π§ Credits
|
95 |
|
96 |
* [Sherpa-ONNX](https://github.com/k2-fsa/sherpa-onnx)
|
97 |
* [OpenCC](https://github.com/BYVoid/OpenCC)
|
98 |
* [FastAPI](https://fastapi.tiangolo.com/)
|
99 |
* [Hugging Face Spaces](https://huggingface.co/docs/hub/spaces)
|
100 |
-
|
101 |
-
---
|
102 |
-
|
103 |
-
## π£ Languages Supported
|
104 |
-
|
105 |
-
* π¨π³ Chinese (Simplified input, converted to Traditional)
|
106 |
-
* πΊπΈ English
|
107 |
-
|
108 |
-
---
|
109 |
-
|
110 |
-
## π€ Contributing
|
111 |
-
|
112 |
-
PRs welcome! Feel free to fork this and adapt to other models or languages.
|
113 |
-
|
114 |
-
---
|
115 |
-
|
116 |
-
## π License
|
117 |
-
|
118 |
-
Apache 2.0
|
|
|
11 |
|
12 |
# ποΈ Real-Time Streaming ASR Demo (FastAPI + Sherpa-ONNX)
|
13 |
|
14 |
+
This project demonstrates a real-time speech-to-text (ASR) web application with:
|
15 |
|
16 |
* π§ [Sherpa-ONNX](https://github.com/k2-fsa/sherpa-onnx) streaming Zipformer model
|
17 |
* π FastAPI backend with WebSocket support
|
18 |
+
* ποΈ Configurable browser-based UI using vanilla HTML/JS
|
19 |
+
* βοΈ Docker-compatible deployment (CPU-only) on Hugging Face Spaces
|
|
|
|
|
20 |
|
21 |
## π¦ Model
|
22 |
|
23 |
+
The app uses the bilingual (Chinese-English) streaming Zipformer model:
|
24 |
|
25 |
+
π **Model Source:** [Zipformer Small Bilingual zh-en (2023-02-16)](https://k2-fsa.github.io/sherpa/onnx/pretrained_models/online-transducer/zipformer-transducer-models.html#sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16-bilingual-chinese-english)
|
|
|
26 |
|
27 |
Model files (ONNX) are located under:
|
28 |
|
|
|
30 |
models/zipformer_bilingual/
|
31 |
```
|
32 |
|
|
|
|
|
33 |
## π Features
|
34 |
|
35 |
+
* π€ **Real-Time Microphone Input:** capture audio directly in the browser.
|
36 |
+
* ποΈ **Recognition Settings:** select ASR model and precision; view supported languages and model size.
|
37 |
+
* π **Hotword Biasing:** input custom hotwords (one per line) and adjust boost score. See [Sherpa-ONNX Hotwords Guide](https://k2-fsa.github.io/sherpa/onnx/hotwords/index.html).
|
38 |
+
* β±οΈ **Endpoint Detection:** configure silence-based rules (RuleΒ 1 threshold, RuleΒ 2 threshold, minimum utterance length) to control segmentation. See [Sherpa-NCNN Endpoint Detection](https://k2-fsa.github.io/sherpa/ncnn/endpoint.html).
|
39 |
+
* π **Volume Meter:** real-time volume indicator based on RMS.
|
40 |
+
* π¬ **Streaming Transcription:** display partial (in red) and final (in green) results with automatic scrolling.
|
41 |
+
* π οΈ **Debug Logging:** backend logs configuration steps and endpoint detection events.
|
42 |
+
* π³ **Deployment:** Dockerfile provided for CPU-only deployment on Hugging Face Spaces.
|
43 |
|
44 |
+
## π οΈ Configuration Guide
|
45 |
+
|
46 |
+
### π Hotword Biasing Configuration
|
47 |
+
|
48 |
+
* **Hotwords List** (`hotwordsList`): Enter one hotword or phrase per line. These are words/phrases the ASR will preferentially recognize. For multilingual models, you can mix scripts according to your modelβs `modeling-unit` (e.g., `cjkchar+bpe`).
|
49 |
+
* **Boost Score** (`boostScore`): A global score applied at the token level for each matched hotword (range: `0.0`β`10.0`). You may also specify per-hotword scores inline in the list using `:`, for example:
|
50 |
+
|
51 |
+
```
|
52 |
+
θ―ι³θ―ε« :3.5
|
53 |
+
ζ·±εΊ¦ε¦δΉ :2.0
|
54 |
+
SPEECH RECOGNITION :1.5
|
55 |
+
```
|
56 |
+
* **Decoding Method**: Ensure your model uses `modified_beam_search` (not the default `greedy_search`) to enable hotword biasing.
|
57 |
+
* **Applying**: Click **Apply Hotwords** in the UI to send the following JSON payload to the backend:
|
58 |
+
|
59 |
+
```json
|
60 |
+
{
|
61 |
+
"type": "config",
|
62 |
+
"hotwords": ["..."],
|
63 |
+
"hotwordsScore": 2.0
|
64 |
+
}
|
65 |
+
```
|
66 |
+
|
67 |
+
(For full details, see the [Sherpa-ONNX Hotwords Guide](https://k2-fsa.github.io/sherpa/onnx/hotwords/index.html) ([k2-fsa.github.io](https://k2-fsa.github.io/sherpa/onnx/hotwords/index.html)).)
|
68 |
+
|
69 |
+
### β±οΈ Endpoint Detection Configuration
|
70 |
+
|
71 |
+
The system supports three endpointing rules borrowed from Kaldi:
|
72 |
+
|
73 |
+
* **RuleΒ 1** (`epRule1`): Minimum duration of trailing silence to trigger an endpoint, in **seconds** (default: `2.4`). Fires whether or not any token has been decoded.
|
74 |
+
* **RuleΒ 2** (`epRule2`): Minimum duration of trailing silence to trigger an endpoint *only after* at least one token is decoded, in **seconds** (default: `1.2`).
|
75 |
+
* **RuleΒ 3** (`epRule3`): Maximum utterance length before forcing an endpoint, in **milliseconds** (default: `300`). Disable by setting a very large value.
|
76 |
+
* **Applying**: Click **Apply Endpoint Config** in the UI to send the following JSON payload to the backend:
|
77 |
+
|
78 |
+
```json
|
79 |
+
{
|
80 |
+
"type": "config",
|
81 |
+
"epRule1": 2.4,
|
82 |
+
"epRule2": 1.2,
|
83 |
+
"epRule3": 300
|
84 |
+
}
|
85 |
+
```
|
86 |
+
|
87 |
+
(See the [Sherpa-NCNN Endpointing documentation](https://k2-fsa.github.io/sherpa/ncnn/endpoint.html) ([k2-fsa.github.io](https://k2-fsa.github.io/sherpa/ncnn/endpoint.html)).)
|
88 |
|
89 |
## π§ͺ Local Development
|
90 |
|
91 |
+
1. **Install dependencies**
|
92 |
|
93 |
```bash
|
94 |
pip install -r requirements.txt
|
95 |
```
|
96 |
|
97 |
+
2. **Run the app locally**
|
98 |
|
99 |
```bash
|
100 |
uvicorn app.main:app --reload --host 0.0.0.0 --port 8501
|
101 |
```
|
102 |
|
103 |
+
Open [http://localhost:8501](http://localhost:8501) in your browser.
|
|
|
|
|
|
|
|
|
104 |
|
105 |
+
[https://k2-fsa.github.io/sherpa/ncnn/endpoint.html](https://k2-fsa.github.io/sherpa/ncnn/endpoint.html)
|
|
|
|
|
|
|
|
|
|
|
|
|
106 |
|
107 |
## π Project Structure
|
108 |
|
109 |
```
|
110 |
.
|
111 |
βββ app
|
112 |
+
β βββ main.py # FastAPI + WebSocket endpoint, config parsing, debug logging
|
113 |
+
β βββ asr_worker.py # Audio resampling, inference, endpoint detection, OpenCC conversion
|
114 |
+
β βββ static/index.html # Client-side UI: recognition, hotword, endpoint, mic, transcript
|
115 |
βββ models/zipformer_bilingual/
|
116 |
β βββ ... (onnx, tokens.txt)
|
117 |
βββ requirements.txt
|
|
|
119 |
βββ README.md
|
120 |
```
|
121 |
|
|
|
|
|
122 |
## π§ Credits
|
123 |
|
124 |
* [Sherpa-ONNX](https://github.com/k2-fsa/sherpa-onnx)
|
125 |
* [OpenCC](https://github.com/BYVoid/OpenCC)
|
126 |
* [FastAPI](https://fastapi.tiangolo.com/)
|
127 |
* [Hugging Face Spaces](https://huggingface.co/docs/hub/spaces)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|