Spaces:

Luigi
/

Streaming-Zipformer

Running

App Files Files Community

Luigi commited on Jun 11

Commit

53fe0cb

1 Parent(s): 2a31e9c

update readme

Browse files

Files changed (1) hide show

README.md +64 -55

README.md CHANGED Viewed

@@ -11,21 +11,18 @@ short_description: Streaming zipformer
 # 🎙️ Real-Time Streaming ASR Demo (FastAPI + Sherpa-ONNX)
-This project demonstrates a real-time speech-to-text (ASR) web application using:
 * 🧠 [Sherpa-ONNX](https://github.com/k2-fsa/sherpa-onnx) streaming Zipformer model
 * 🚀 FastAPI backend with WebSocket support
-* 🧑‍💻 Hugging Face Spaces (Docker CPU-only deployment)
-* 🌐 Browser-based microphone input + UI in vanilla HTML/JS
----
 ## 📦 Model
-This app uses the following **bilingual (Chinese-English)** streaming model:
-**🔗 Model Source:**
-[Zipformer Small Bilingual zh-en (2023-02-16)](https://k2-fsa.github.io/sherpa/onnx/pretrained_models/online-transducer/zipformer-transducer-models.html#sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16-bilingual-chinese-english)
 Model files (ONNX) are located under:
@@ -33,55 +30,88 @@ Model files (ONNX) are located under:
 models/zipformer_bilingual/
 ```
----
 ## 🚀 Features
-* 🎤 Real-time microphone input (captured in browser)
-* 🔁 WebSocket-based streaming inference
-* 💬 Partial + final transcription
-* 🌏 Automatic conversion to **Traditional Chinese** using OpenCC
-* 📊 Real-time volume indicator
-* ☁️ Deployed on Hugging Face Spaces (CPU only)
----
 ## 🧪 Local Development
-### 1. Install dependencies
 ```bash
 pip install -r requirements.txt
 ```
-### 2. Run the app locally
 ```bash
 uvicorn app.main:app --reload --host 0.0.0.0 --port 8501
 ```
-Then open: [http://localhost:8501](http://localhost:8501)
----
-## 🐳 Deploy on Hugging Face Spaces
-This repo includes a `Dockerfile` compatible with HF Spaces. It uses:
-* `uvicorn` for serving the FastAPI app
-* `opencc-python-reimplemented` for Simplified → Traditional Chinese
-* `pysoxr` or `scipy` for audio resampling (48kHz → 16kHz)
----
 ## 📁 Project Structure
 ```
 .
 ├── app
-│   ├── main.py               # FastAPI + WebSocket
-│   ├── asr_worker.py         # Sherpa inference + resampling + OpenCC
-│   └── static/index.html     # Client-side mic UI
 ├── models/zipformer_bilingual/
 │   └── ... (onnx, tokens.txt)
 ├── requirements.txt
@@ -89,30 +119,9 @@ This repo includes a `Dockerfile` compatible with HF Spaces. It uses:
 └── README.md
 ```
----
 ## 🔧 Credits
 * [Sherpa-ONNX](https://github.com/k2-fsa/sherpa-onnx)
 * [OpenCC](https://github.com/BYVoid/OpenCC)
 * [FastAPI](https://fastapi.tiangolo.com/)
 * [Hugging Face Spaces](https://huggingface.co/docs/hub/spaces)
----
-## 🗣 Languages Supported
-* 🇨🇳 Chinese (Simplified input, converted to Traditional)
-* 🇺🇸 English
----
-## 🤝 Contributing
-PRs welcome! Feel free to fork this and adapt to other models or languages.
----
-## 📜 License
-Apache 2.0

 # 🎙️ Real-Time Streaming ASR Demo (FastAPI + Sherpa-ONNX)
+This project demonstrates a real-time speech-to-text (ASR) web application with:
 * 🧠 [Sherpa-ONNX](https://github.com/k2-fsa/sherpa-onnx) streaming Zipformer model
 * 🚀 FastAPI backend with WebSocket support
+* 🎛️ Configurable browser-based UI using vanilla HTML/JS
+* ☁️ Docker-compatible deployment (CPU-only) on Hugging Face Spaces
 ## 📦 Model
+The app uses the bilingual (Chinese-English) streaming Zipformer model:
+🔗 **Model Source:** [Zipformer Small Bilingual zh-en (2023-02-16)](https://k2-fsa.github.io/sherpa/onnx/pretrained_models/online-transducer/zipformer-transducer-models.html#sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16-bilingual-chinese-english)
 Model files (ONNX) are located under:
 models/zipformer_bilingual/
 ```
 ## 🚀 Features
+* 🎤 **Real-Time Microphone Input:** capture audio directly in the browser.
+* 🎛️ **Recognition Settings:** select ASR model and precision; view supported languages and model size.
+* 🔑 **Hotword Biasing:** input custom hotwords (one per line) and adjust boost score. See [Sherpa-ONNX Hotwords Guide](https://k2-fsa.github.io/sherpa/onnx/hotwords/index.html).
+* ⏱️ **Endpoint Detection:** configure silence-based rules (Rule 1 threshold, Rule 2 threshold, minimum utterance length) to control segmentation. See [Sherpa-NCNN Endpoint Detection](https://k2-fsa.github.io/sherpa/ncnn/endpoint.html).
+* 📊 **Volume Meter:** real-time volume indicator based on RMS.
+* 💬 **Streaming Transcription:** display partial (in red) and final (in green) results with automatic scrolling.
+* 🛠️ **Debug Logging:** backend logs configuration steps and endpoint detection events.
+* 🐳 **Deployment:** Dockerfile provided for CPU-only deployment on Hugging Face Spaces.
+## 🛠️ Configuration Guide
+### 🔑 Hotword Biasing Configuration
+* **Hotwords List** (`hotwordsList`): Enter one hotword or phrase per line. These are words/phrases the ASR will preferentially recognize. For multilingual models, you can mix scripts according to your model’s `modeling-unit` (e.g., `cjkchar+bpe`).
+* **Boost Score** (`boostScore`): A global score applied at the token level for each matched hotword (range: `0.0`–`10.0`). You may also specify per-hotword scores inline in the list using `:`, for example:
+  ```
+  语音识别 :3.5
+  深度学习 :2.0
+  SPEECH RECOGNITION :1.5
+  ```
+* **Decoding Method**: Ensure your model uses `modified_beam_search` (not the default `greedy_search`) to enable hotword biasing.
+* **Applying**: Click **Apply Hotwords** in the UI to send the following JSON payload to the backend:
+  ```json
+  {
+    "type": "config",
+    "hotwords": ["..."],
+    "hotwordsScore": 2.0
+  }
+  ```
+(For full details, see the [Sherpa-ONNX Hotwords Guide](https://k2-fsa.github.io/sherpa/onnx/hotwords/index.html) ([k2-fsa.github.io](https://k2-fsa.github.io/sherpa/onnx/hotwords/index.html)).)
+### ⏱️ Endpoint Detection Configuration
+The system supports three endpointing rules borrowed from Kaldi:
+* **Rule 1** (`epRule1`): Minimum duration of trailing silence to trigger an endpoint, in **seconds** (default: `2.4`). Fires whether or not any token has been decoded.
+* **Rule 2** (`epRule2`): Minimum duration of trailing silence to trigger an endpoint *only after* at least one token is decoded, in **seconds** (default: `1.2`).
+* **Rule 3** (`epRule3`): Maximum utterance length before forcing an endpoint, in **milliseconds** (default: `300`). Disable by setting a very large value.
+* **Applying**: Click **Apply Endpoint Config** in the UI to send the following JSON payload to the backend:
+  ```json
+  {
+    "type": "config",
+    "epRule1": 2.4,
+    "epRule2": 1.2,
+    "epRule3": 300
+  }
+  ```
+(See the [Sherpa-NCNN Endpointing documentation](https://k2-fsa.github.io/sherpa/ncnn/endpoint.html) ([k2-fsa.github.io](https://k2-fsa.github.io/sherpa/ncnn/endpoint.html)).)
 ## 🧪 Local Development
+1. **Install dependencies**
 ```bash
 pip install -r requirements.txt
 ```
+2. **Run the app locally**
 ```bash
 uvicorn app.main:app --reload --host 0.0.0.0 --port 8501
 ```
+Open [http://localhost:8501](http://localhost:8501) in your browser.
+[https://k2-fsa.github.io/sherpa/ncnn/endpoint.html](https://k2-fsa.github.io/sherpa/ncnn/endpoint.html)
 ## 📁 Project Structure
 ```
 .
 ├── app
+│   ├── main.py               # FastAPI + WebSocket endpoint, config parsing, debug logging
+│   ├── asr_worker.py         # Audio resampling, inference, endpoint detection, OpenCC conversion
+│   └── static/index.html     # Client-side UI: recognition, hotword, endpoint, mic, transcript
 ├── models/zipformer_bilingual/
 │   └── ... (onnx, tokens.txt)
 ├── requirements.txt
 └── README.md
 ```
 ## 🔧 Credits
 * [Sherpa-ONNX](https://github.com/k2-fsa/sherpa-onnx)
 * [OpenCC](https://github.com/BYVoid/OpenCC)
 * [FastAPI](https://fastapi.tiangolo.com/)
 * [Hugging Face Spaces](https://huggingface.co/docs/hub/spaces)