Spaces:
Running
on
Zero
Running
on
Zero
A newer version of the Gradio SDK is available:
5.44.1
metadata
title: OpenAI Whisper Vs Alibaba SenseVoice Small
emoji: ⚡
colorFrom: gray
colorTo: purple
sdk: gradio
sdk_version: 5.31.0
app_file: app.py
pinned: false
license: mit
short_description: Compare OpenAI Whisper against FunAudioLLM SenseVoice.
OpenAI Whisper vs. Alibaba SenseVoice Comparison
This Space lets you compare faster-whisper models against Alibaba FunAudioLLM’s SenseVoice models for automatic speech recognition (ASR), featuring:
- Multiple faster-whisper and SenseVoice model choices.
- Language selection for each ASR engine (full list of language codes).
- Explicit device selection (GPU or CPU) with ZeroGPU support (
spaces.GPU
decorator). - Speaker diarization with
pyannote.audio
, displaying speaker-labeled transcripts. - Simplified Chinese to Traditional Chinese conversion via
opencc
. - Color-coded and scrollable diarized transcript panel.
- Semi‑streaming output: incremental transcript updates accumulate live as each segment or speaker turn completes.
- Semi‑real‑time diarized transcription: speaker‑labeled segments appear incrementally as they finish processing.
🚀 How to Use
Upload an audio file or record from your microphone.
Faster-Whisper ASR:
- Select a model variant from the dropdown.
- Choose the transcription language (default: auto-detect).
- Pick device: GPU or CPU.
- Toggle diarization on/off.
- Click Transcribe with Faster-Whisper.
SenseVoice ASR:
- Select a SenseVoice model.
- Choose the transcription language.
- Pick device: GPU or CPU.
- Toggle punctuation on/off.
- Toggle diarization on/off.
- Click Transcribe with SenseVoice.
View both the plain transcript and the color-coded, speaker-labeled diarized transcript side by side.
📁 Files
- app.py Main Gradio app implementing dual ASR pipelines with device control, diarization, and Chinese conversion.
- requirements.txt Python dependencies: Gradio, PyTorch, Transformers, faster-whisper, funasr, pyannote.audio, pydub, opencc-python-reimplemented, ctranslate2, termcolor, NVIDIA cuBLAS/cuDNN wheels.
- Dockerfile (optional) Defines a CUDA 12 + cuDNN 9 environment for GPU acceleration.
⚠️ Notes
- Hugging Face token: Set
HF_TOKEN
(orHUGGINGFACE_TOKEN
) in Space secrets for authenticated diarization model access. - GPU allocation: GPU resources are acquired only when GPU is explicitly selected, thanks to the
spaces.GPU
decorator. - Python version: Python 3.10+ recommended.
- System
ffmpeg
: Ensureffmpeg
is installed on the host (or via Dockerfile) for audio processing.
🛠️ Dependencies
- Python: 3.10+
- gradio (>=3.39.0)
- torch (>=2.0.0) & torchaudio
- transformers (>=4.35.0)
- faster-whisper (>=1.1.1) & ctranslate2 (==4.5.0)
- funasr (>=1.0.14)
- pyannote.audio (>=2.1.1) & huggingface-hub (>=0.18.0)
- pydub (>=0.25.1) & ffmpeg-python (>=0.2.0)
- opencc-python-reimplemented
- termcolor
- nvidia-cublas-cu12, nvidia-cudnn-cu12
License
MIT
中文(臺灣)版本
OpenAI Whisper vs. Alibaba FunASR SenseVoice 功能說明
本 Space 同步比較 faster-whisper 與 Alibaba FunAudioLLM 的 SenseVoice 模型,提供以下特色:
- 多款 faster-whisper 與 SenseVoice 模型可自由選擇
- 支援設定辨識語言(完整語言代碼列表)
- 明確切換運算裝置 (GPU/CPU),並以
spaces.GPU
裝飾器延後 GPU 資源配置 - 整合
pyannote.audio
做語者分離,並在抄本中標示不同語者 - 使用
opencc
自動將簡體中文轉為臺灣繁體中文 - 彩色區隔對話式抄本,可捲動瀏覽及複製
- 半即時分段輸出:每段語音或語者片段處理完成後,即時累積顯示抄本
🚀 使用步驟
上傳音檔或透過麥克風錄製音訊。
Faster-Whisper ASR:
- 選擇模型版本。
- 選定辨識語言 (預設自動偵測)。
- 切換運算裝置:GPU 或 CPU。
- 開啟/關閉語者分離功能。
- 點擊「Transcribe with Faster-Whisper」。
SenseVoice ASR:
- 選擇 SenseVoice 模型。
- 設定辨識語言。
- 切換運算裝置:GPU 或 CPU。
- 開啟/關閉標點符號功能。
- 開啟/關閉語者分離功能。
- 點擊「Transcribe with SenseVoice」。
左右並排查看純文字抄本與彩色標註的語者分離抄本。
📁 檔案結構
- app.py Gradio 應用程式原始碼,實作雙 ASR 流程,包含運算裝置選擇、語者分離與中文轉換。
- requirements.txt Python 相依套件:Gradio、PyTorch、Transformers、faster-whisper、funasr、pyannote.audio、pydub、opencc-python-reimplemented、ctranslate2、termcolor、cuBLAS/cuDNN。
- Dockerfile(選用) 定義 CUDA 12 + cuDNN 9 的 Docker 環境。
⚠️ 注意事項
- Hugging Face 權杖:請在 Space Secrets 設定
HF_TOKEN
或HUGGINGFACE_TOKEN
,以便下載語者分離模型。 - GPU 分配:僅於選擇 GPU 時才會申請 GPU 資源。
- Python 版本:建議使用 Python 3.10 以上。
- 系統 ffmpeg:請確保主機或容器中已安裝 ffmpeg,以支援音訊處理。
🛠️ 相依套件
- Python: 3.10+
- gradio: >=3.39.0
- torch & torchaudio: >=2.0.0
- transformers: >=4.35.0
- faster-whisper: >=1.1.1 & ctranslate2: ==4.5.0
- funasr: >=1.0.14
- pyannote.audio: >=2.1.1 & huggingface-hub: >=0.18.0
- pydub: >=0.25.1 & ffmpeg-python: >=0.2.0
- opencc-python-reimplemented
- termcolor
- nvidia-cublas-cu12, nvidia-cudnn-cu12
授權
MIT