Spaces:
Running
on
Zero
Running
on
Zero
File size: 5,723 Bytes
989fd65 2d0779b 989fd65 d373bca 989fd65 2d0779b a8b6e59 2d0779b 3fdbe9b 2d0779b a8b6e59 9f1d629 3fdbe9b 2d0779b 3fdbe9b 2d0779b 3fdbe9b a8b6e59 3fdbe9b 9f1d629 3fdbe9b 9f1d629 3fdbe9b 2d0779b 3fdbe9b 9f1d629 3fdbe9b 2d0779b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 |
---
title: OpenAI Whisper Vs Alibaba SenseVoice Small
emoji: ⚡
colorFrom: gray
colorTo: purple
sdk: gradio
sdk_version: 5.31.0
app_file: app.py
pinned: false
license: mit
short_description: Compare OpenAI Whisper against FunAudioLLM SenseVoice.
---
# OpenAI Whisper vs. Alibaba SenseVoice Comparison
This Space lets you compare **faster-whisper** models against Alibaba FunAudioLLM’s **SenseVoice** models for automatic speech recognition (ASR), featuring:
* Multiple faster-whisper and SenseVoice model choices.
* Language selection for each ASR engine (full list of language codes).
* Explicit device selection (GPU or CPU) with ZeroGPU support (`spaces.GPU` decorator).
* Speaker diarization with `pyannote.audio`, displaying speaker-labeled transcripts.
* Simplified Chinese to Traditional Chinese conversion via `opencc`.
* Color-coded and scrollable diarized transcript panel.
* Semi‑streaming output: incremental transcript updates accumulate live as each segment or speaker turn completes.
* Semi‑real‑time diarized transcription: speaker‑labeled segments appear incrementally as they finish processing.
## 🚀 How to Use
1. Upload an audio file or record from your microphone.
2. **Faster-Whisper ASR**:
1. Select a model variant from the dropdown.
2. Choose the transcription language (default: auto-detect).
3. Pick device: GPU or CPU.
4. Toggle diarization on/off.
5. Click **Transcribe with Faster-Whisper**.
3. **SenseVoice ASR**:
1. Select a SenseVoice model.
2. Choose the transcription language.
3. Pick device: GPU or CPU.
4. Toggle punctuation on/off.
5. Toggle diarization on/off.
6. Click **Transcribe with SenseVoice**.
4. View both the plain transcript and the color-coded, speaker-labeled diarized transcript side by side.
## 📁 Files
* **app.py**
Main Gradio app implementing dual ASR pipelines with device control, diarization, and Chinese conversion.
* **requirements.txt**
Python dependencies: Gradio, PyTorch, Transformers, faster-whisper, funasr, pyannote.audio, pydub, opencc-python-reimplemented, ctranslate2, termcolor, NVIDIA cuBLAS/cuDNN wheels.
* **Dockerfile** (optional)
Defines a CUDA 12 + cuDNN 9 environment for GPU acceleration.
## ⚠️ Notes
* **Hugging Face token**: Set `HF_TOKEN` (or `HUGGINGFACE_TOKEN`) in Space secrets for authenticated diarization model access.
* **GPU allocation**: GPU resources are acquired only when GPU is explicitly selected, thanks to the `spaces.GPU` decorator.
* **Python version**: Python 3.10+ recommended.
* **System `ffmpeg`**: Ensure `ffmpeg` is installed on the host (or via Dockerfile) for audio processing.
## 🛠️ Dependencies
* **Python**: 3.10+
* **gradio** (>=3.39.0)
* **torch** (>=2.0.0) & **torchaudio**
* **transformers** (>=4.35.0)
* **faster-whisper** (>=1.1.1) & **ctranslate2** (==4.5.0)
* **funasr** (>=1.0.14)
* **pyannote.audio** (>=2.1.1) & **huggingface-hub** (>=0.18.0)
* **pydub** (>=0.25.1) & **ffmpeg-python** (>=0.2.0)
* **opencc-python-reimplemented**
* **termcolor**
* **nvidia-cublas-cu12**, **nvidia-cudnn-cu12**
## License
MIT
---
## 中文(臺灣)版本
# OpenAI Whisper vs. Alibaba FunASR SenseVoice 功能說明
本 Space 同步比較 **faster-whisper** 與 Alibaba FunAudioLLM 的 **SenseVoice** 模型,提供以下特色:
* 多款 faster-whisper 與 SenseVoice 模型可自由選擇
* 支援設定辨識語言(完整語言代碼列表)
* 明確切換運算裝置 (GPU/CPU),並以 `spaces.GPU` 裝飾器延後 GPU 資源配置
* 整合 `pyannote.audio` 做語者分離,並在抄本中標示不同語者
* 使用 `opencc` 自動將簡體中文轉為臺灣繁體中文
* 彩色區隔對話式抄本,可捲動瀏覽及複製
* 半即時分段輸出:每段語音或語者片段處理完成後,即時累積顯示抄本
## 🚀 使用步驟
1. 上傳音檔或透過麥克風錄製音訊。
2. **Faster-Whisper ASR**:
1. 選擇模型版本。
2. 選定辨識語言 (預設自動偵測)。
3. 切換運算裝置:GPU 或 CPU。
4. 開啟/關閉語者分離功能。
5. 點擊「Transcribe with Faster-Whisper」。
3. **SenseVoice ASR**:
1. 選擇 SenseVoice 模型。
2. 設定辨識語言。
3. 切換運算裝置:GPU 或 CPU。
4. 開啟/關閉標點符號功能。
5. 開啟/關閉語者分離功能。
6. 點擊「Transcribe with SenseVoice」。
4. 左右並排查看純文字抄本與彩色標註的語者分離抄本。
## 📁 檔案結構
* **app.py**
Gradio 應用程式原始碼,實作雙 ASR 流程,包含運算裝置選擇、語者分離與中文轉換。
* **requirements.txt**
Python 相依套件:Gradio、PyTorch、Transformers、faster-whisper、funasr、pyannote.audio、pydub、opencc-python-reimplemented、ctranslate2、termcolor、cuBLAS/cuDNN。
* **Dockerfile**(選用)
定義 CUDA 12 + cuDNN 9 的 Docker 環境。
## ⚠️ 注意事項
* **Hugging Face 權杖**:請在 Space Secrets 設定 `HF_TOKEN` 或 `HUGGINGFACE_TOKEN`,以便下載語者分離模型。
* **GPU 分配**:僅於選擇 GPU 時才會申請 GPU 資源。
* **Python 版本**:建議使用 Python 3.10 以上。
* **系統 ffmpeg**:請確保主機或容器中已安裝 ffmpeg,以支援音訊處理。
## 🛠️ 相依套件
* **Python**: 3.10+
* **gradio**: >=3.39.0
* **torch** & **torchaudio**: >=2.0.0
* **transformers**: >=4.35.0
* **faster-whisper**: >=1.1.1 & **ctranslate2**: ==4.5.0
* **funasr**: >=1.0.14
* **pyannote.audio**: >=2.1.1 & **huggingface-hub**: >=0.18.0
* **pydub**: >=0.25.1 & **ffmpeg-python**: >=0.2.0
* **opencc-python-reimplemented**
* **termcolor**
* **nvidia-cublas-cu12**, **nvidia-cudnn-cu12**
## 授權
MIT
|