metadata

title: OpenAI Whisper Vs Alibaba SenseVoice Small
emoji: ⚡
colorFrom: gray
colorTo: purple
sdk: gradio
sdk_version: 5.31.0
app_file: app.py
pinned: false
license: mit
short_description: Compare OpenAI Whisper against FunAudioLLM SenseVoice.

OpenAI Whisper vs. Alibaba SenseVoice Comparison

This Space lets you compare faster-whisper models against Alibaba FunAudioLLM’s SenseVoice models for automatic speech recognition (ASR), featuring:

Multiple faster-whisper and SenseVoice model choices.
Language selection for each ASR engine (full list of language codes).
Explicit device selection (GPU or CPU) with ZeroGPU support (spaces.GPU decorator).
Speaker diarization with pyannote.audio, displaying speaker-labeled transcripts.
Simplified Chinese to Traditional Chinese conversion via opencc.
Color-coded and scrollable diarized transcript panel.
Semi‑streaming output: incremental transcript updates accumulate live as each segment or speaker turn completes.
Semi‑real‑time diarized transcription: speaker‑labeled segments appear incrementally as they finish processing.

🚀 How to Use

Upload an audio file or record from your microphone.
Faster-Whisper ASR:
1. Select a model variant from the dropdown.
2. Choose the transcription language (default: auto-detect).
3. Pick device: GPU or CPU.
4. Toggle diarization on/off.
5. Click Transcribe with Faster-Whisper.
SenseVoice ASR:
1. Select a SenseVoice model.
2. Choose the transcription language.
3. Pick device: GPU or CPU.
4. Toggle punctuation on/off.
5. Toggle diarization on/off.
6. Click Transcribe with SenseVoice.
View both the plain transcript and the color-coded, speaker-labeled diarized transcript side by side.

📁 Files

app.py Main Gradio app implementing dual ASR pipelines with device control, diarization, and Chinese conversion.
requirements.txt Python dependencies: Gradio, PyTorch, Transformers, faster-whisper, funasr, pyannote.audio, pydub, opencc-python-reimplemented, ctranslate2, termcolor, NVIDIA cuBLAS/cuDNN wheels.
Dockerfile (optional) Defines a CUDA 12 + cuDNN 9 environment for GPU acceleration.

⚠️ Notes

Hugging Face token: Set HF_TOKEN (or HUGGINGFACE_TOKEN) in Space secrets for authenticated diarization model access.
GPU allocation: GPU resources are acquired only when GPU is explicitly selected, thanks to the spaces.GPU decorator.
Python version: Python 3.10+ recommended.
System ffmpeg: Ensure ffmpeg is installed on the host (or via Dockerfile) for audio processing.

🛠️ Dependencies

Python: 3.10+
gradio (>=3.39.0)
torch (>=2.0.0) & torchaudio
transformers (>=4.35.0)
faster-whisper (>=1.1.1) & ctranslate2 (==4.5.0)
funasr (>=1.0.14)
pyannote.audio (>=2.1.1) & huggingface-hub (>=0.18.0)
pydub (>=0.25.1) & ffmpeg-python (>=0.2.0)
opencc-python-reimplemented
termcolor
nvidia-cublas-cu12, nvidia-cudnn-cu12

License

MIT

中文（臺灣）版本

OpenAI Whisper vs. Alibaba FunASR SenseVoice 功能說明

本 Space 同步比較 faster-whisper 與 Alibaba FunAudioLLM 的 SenseVoice 模型，提供以下特色：

多款 faster-whisper 與 SenseVoice 模型可自由選擇
支援設定辨識語言（完整語言代碼列表）
明確切換運算裝置 (GPU/CPU)，並以 spaces.GPU 裝飾器延後 GPU 資源配置
整合 pyannote.audio 做語者分離，並在抄本中標示不同語者
使用 opencc 自動將簡體中文轉為臺灣繁體中文
彩色區隔對話式抄本，可捲動瀏覽及複製
半即時分段輸出：每段語音或語者片段處理完成後，即時累積顯示抄本

🚀 使用步驟

上傳音檔或透過麥克風錄製音訊。
Faster-Whisper ASR：
1. 選擇模型版本。
2. 選定辨識語言 (預設自動偵測)。
3. 切換運算裝置：GPU 或 CPU。
4. 開啟/關閉語者分離功能。
5. 點擊「Transcribe with Faster-Whisper」。
SenseVoice ASR：
1. 選擇 SenseVoice 模型。
2. 設定辨識語言。
3. 切換運算裝置：GPU 或 CPU。
4. 開啟/關閉標點符號功能。
5. 開啟/關閉語者分離功能。
6. 點擊「Transcribe with SenseVoice」。
左右並排查看純文字抄本與彩色標註的語者分離抄本。

📁 檔案結構

app.py Gradio 應用程式原始碼，實作雙 ASR 流程，包含運算裝置選擇、語者分離與中文轉換。
requirements.txt Python 相依套件：Gradio、PyTorch、Transformers、faster-whisper、funasr、pyannote.audio、pydub、opencc-python-reimplemented、ctranslate2、termcolor、cuBLAS/cuDNN。
Dockerfile（選用）定義 CUDA 12 + cuDNN 9 的 Docker 環境。

⚠️ 注意事項

Hugging Face 權杖：請在 Space Secrets 設定 HF_TOKEN 或 HUGGINGFACE_TOKEN，以便下載語者分離模型。
GPU 分配：僅於選擇 GPU 時才會申請 GPU 資源。
Python 版本：建議使用 Python 3.10 以上。
系統 ffmpeg：請確保主機或容器中已安裝 ffmpeg，以支援音訊處理。

🛠️ 相依套件

Python: 3.10+
gradio: >=3.39.0
torch & torchaudio: >=2.0.0
transformers: >=4.35.0
faster-whisper: >=1.1.1 & ctranslate2: ==4.5.0
funasr: >=1.0.14
pyannote.audio: >=2.1.1 & huggingface-hub: >=0.18.0
pydub: >=0.25.1 & ffmpeg-python: >=0.2.0
opencc-python-reimplemented
termcolor
nvidia-cublas-cu12, nvidia-cudnn-cu12

授權

MIT