Spaces:

Luigi
/

Whisper-vs-Sensevoice-Small

Running on Zero

App Files Files Community

Whisper-vs-Sensevoice-Small / README.md

Luigi

update readme

2d0779b 3 months ago

preview code

raw

history blame contribute delete

5.72 kB

	---
	title: OpenAI Whisper Vs Alibaba SenseVoice Small
	emoji: ⚡
	colorFrom: gray
	colorTo: purple
	sdk: gradio
	sdk_version: 5.31.0
	app_file: app.py
	pinned: false
	license: mit
	short_description: Compare OpenAI Whisper against FunAudioLLM SenseVoice.
	---

	# OpenAI Whisper vs. Alibaba SenseVoice Comparison

	This Space lets you compare faster-whisper models against Alibaba FunAudioLLM’s SenseVoice models for automatic speech recognition (ASR), featuring:

	* Multiple faster-whisper and SenseVoice model choices.
	* Language selection for each ASR engine (full list of language codes).
	* Explicit device selection (GPU or CPU) with ZeroGPU support (`spaces.GPU` decorator).
	* Speaker diarization with `pyannote.audio`, displaying speaker-labeled transcripts.
	* Simplified Chinese to Traditional Chinese conversion via `opencc`.
	* Color-coded and scrollable diarized transcript panel.
	* Semi‑streaming output: incremental transcript updates accumulate live as each segment or speaker turn completes.
	* Semi‑real‑time diarized transcription: speaker‑labeled segments appear incrementally as they finish processing.

	## 🚀 How to Use

	1. Upload an audio file or record from your microphone.
	2. Faster-Whisper ASR:

	1. Select a model variant from the dropdown.
	2. Choose the transcription language (default: auto-detect).
	3. Pick device: GPU or CPU.
	4. Toggle diarization on/off.
	5. Click Transcribe with Faster-Whisper.
	3. SenseVoice ASR:

	1. Select a SenseVoice model.
	2. Choose the transcription language.
	3. Pick device: GPU or CPU.
	4. Toggle punctuation on/off.
	5. Toggle diarization on/off.
	6. Click Transcribe with SenseVoice.
	4. View both the plain transcript and the color-coded, speaker-labeled diarized transcript side by side.

	## 📁 Files

	* app.py
	Main Gradio app implementing dual ASR pipelines with device control, diarization, and Chinese conversion.
	* requirements.txt
	Python dependencies: Gradio, PyTorch, Transformers, faster-whisper, funasr, pyannote.audio, pydub, opencc-python-reimplemented, ctranslate2, termcolor, NVIDIA cuBLAS/cuDNN wheels.
	* Dockerfile (optional)
	Defines a CUDA 12 + cuDNN 9 environment for GPU acceleration.

	## ⚠️ Notes

	* Hugging Face token: Set `HF_TOKEN` (or `HUGGINGFACE_TOKEN`) in Space secrets for authenticated diarization model access.
	* GPU allocation: GPU resources are acquired only when GPU is explicitly selected, thanks to the `spaces.GPU` decorator.
	* Python version: Python 3.10+ recommended.
	* System `ffmpeg`: Ensure `ffmpeg` is installed on the host (or via Dockerfile) for audio processing.

	## 🛠️ Dependencies

	* Python: 3.10+
	* gradio (>=3.39.0)
	* torch (>=2.0.0) & torchaudio
	* transformers (>=4.35.0)
	* faster-whisper (>=1.1.1) & ctranslate2 (==4.5.0)
	* funasr (>=1.0.14)
	* pyannote.audio (>=2.1.1) & huggingface-hub (>=0.18.0)
	* pydub (>=0.25.1) & ffmpeg-python (>=0.2.0)
	* opencc-python-reimplemented
	* termcolor
	* nvidia-cublas-cu12, nvidia-cudnn-cu12

	## License

	MIT

	---

	## 中文（臺灣）版本

	# OpenAI Whisper vs. Alibaba FunASR SenseVoice 功能說明

	本 Space 同步比較 faster-whisper 與 Alibaba FunAudioLLM 的 SenseVoice 模型，提供以下特色：

	* 多款 faster-whisper 與 SenseVoice 模型可自由選擇
	* 支援設定辨識語言（完整語言代碼列表）
	* 明確切換運算裝置 (GPU/CPU)，並以 `spaces.GPU` 裝飾器延後 GPU 資源配置
	* 整合 `pyannote.audio` 做語者分離，並在抄本中標示不同語者
	* 使用 `opencc` 自動將簡體中文轉為臺灣繁體中文
	* 彩色區隔對話式抄本，可捲動瀏覽及複製
	* 半即時分段輸出：每段語音或語者片段處理完成後，即時累積顯示抄本

	## 🚀 使用步驟

	1. 上傳音檔或透過麥克風錄製音訊。
	2. Faster-Whisper ASR：

	1. 選擇模型版本。
	2. 選定辨識語言 (預設自動偵測)。
	3. 切換運算裝置：GPU 或 CPU。
	4. 開啟/關閉語者分離功能。
	5. 點擊「Transcribe with Faster-Whisper」。
	3. SenseVoice ASR：

	1. 選擇 SenseVoice 模型。
	2. 設定辨識語言。
	3. 切換運算裝置：GPU 或 CPU。
	4. 開啟/關閉標點符號功能。
	5. 開啟/關閉語者分離功能。
	6. 點擊「Transcribe with SenseVoice」。
	4. 左右並排查看純文字抄本與彩色標註的語者分離抄本。

	## 📁 檔案結構

	* app.py
	Gradio 應用程式原始碼，實作雙 ASR 流程，包含運算裝置選擇、語者分離與中文轉換。
	* requirements.txt
	Python 相依套件：Gradio、PyTorch、Transformers、faster-whisper、funasr、pyannote.audio、pydub、opencc-python-reimplemented、ctranslate2、termcolor、cuBLAS/cuDNN。
	* Dockerfile（選用）
	定義 CUDA 12 + cuDNN 9 的 Docker 環境。

	## ⚠️ 注意事項

	* Hugging Face 權杖：請在 Space Secrets 設定 `HF_TOKEN` 或 `HUGGINGFACE_TOKEN`，以便下載語者分離模型。
	* GPU 分配：僅於選擇 GPU 時才會申請 GPU 資源。
	* Python 版本：建議使用 Python 3.10 以上。
	* 系統 ffmpeg：請確保主機或容器中已安裝 ffmpeg，以支援音訊處理。

	## 🛠️ 相依套件

	* Python: 3.10+
	* gradio: >=3.39.0
	* torch & torchaudio: >=2.0.0
	* transformers: >=4.35.0
	* faster-whisper: >=1.1.1 & ctranslate2: ==4.5.0
	* funasr: >=1.0.14
	* pyannote.audio: >=2.1.1 & huggingface-hub: >=0.18.0
	* pydub: >=0.25.1 & ffmpeg-python: >=0.2.0
	* opencc-python-reimplemented
	* termcolor
	* nvidia-cublas-cu12, nvidia-cudnn-cu12

	## 授權

	MIT