Luigi's picture
update readme
2d0779b

A newer version of the Gradio SDK is available: 5.44.1

Upgrade
metadata
title: OpenAI Whisper Vs Alibaba SenseVoice Small
emoji: 
colorFrom: gray
colorTo: purple
sdk: gradio
sdk_version: 5.31.0
app_file: app.py
pinned: false
license: mit
short_description: Compare OpenAI Whisper against FunAudioLLM SenseVoice.

OpenAI Whisper vs. Alibaba SenseVoice Comparison

This Space lets you compare faster-whisper models against Alibaba FunAudioLLM’s SenseVoice models for automatic speech recognition (ASR), featuring:

  • Multiple faster-whisper and SenseVoice model choices.
  • Language selection for each ASR engine (full list of language codes).
  • Explicit device selection (GPU or CPU) with ZeroGPU support (spaces.GPU decorator).
  • Speaker diarization with pyannote.audio, displaying speaker-labeled transcripts.
  • Simplified Chinese to Traditional Chinese conversion via opencc.
  • Color-coded and scrollable diarized transcript panel.
  • Semi‑streaming output: incremental transcript updates accumulate live as each segment or speaker turn completes.
  • Semi‑real‑time diarized transcription: speaker‑labeled segments appear incrementally as they finish processing.

🚀 How to Use

  1. Upload an audio file or record from your microphone.

  2. Faster-Whisper ASR:

    1. Select a model variant from the dropdown.
    2. Choose the transcription language (default: auto-detect).
    3. Pick device: GPU or CPU.
    4. Toggle diarization on/off.
    5. Click Transcribe with Faster-Whisper.
  3. SenseVoice ASR:

    1. Select a SenseVoice model.
    2. Choose the transcription language.
    3. Pick device: GPU or CPU.
    4. Toggle punctuation on/off.
    5. Toggle diarization on/off.
    6. Click Transcribe with SenseVoice.
  4. View both the plain transcript and the color-coded, speaker-labeled diarized transcript side by side.

📁 Files

  • app.py Main Gradio app implementing dual ASR pipelines with device control, diarization, and Chinese conversion.
  • requirements.txt Python dependencies: Gradio, PyTorch, Transformers, faster-whisper, funasr, pyannote.audio, pydub, opencc-python-reimplemented, ctranslate2, termcolor, NVIDIA cuBLAS/cuDNN wheels.
  • Dockerfile (optional) Defines a CUDA 12 + cuDNN 9 environment for GPU acceleration.

⚠️ Notes

  • Hugging Face token: Set HF_TOKEN (or HUGGINGFACE_TOKEN) in Space secrets for authenticated diarization model access.
  • GPU allocation: GPU resources are acquired only when GPU is explicitly selected, thanks to the spaces.GPU decorator.
  • Python version: Python 3.10+ recommended.
  • System ffmpeg: Ensure ffmpeg is installed on the host (or via Dockerfile) for audio processing.

🛠️ Dependencies

  • Python: 3.10+
  • gradio (>=3.39.0)
  • torch (>=2.0.0) & torchaudio
  • transformers (>=4.35.0)
  • faster-whisper (>=1.1.1) & ctranslate2 (==4.5.0)
  • funasr (>=1.0.14)
  • pyannote.audio (>=2.1.1) & huggingface-hub (>=0.18.0)
  • pydub (>=0.25.1) & ffmpeg-python (>=0.2.0)
  • opencc-python-reimplemented
  • termcolor
  • nvidia-cublas-cu12, nvidia-cudnn-cu12

License

MIT


中文(臺灣)版本

OpenAI Whisper vs. Alibaba FunASR SenseVoice 功能說明

本 Space 同步比較 faster-whisper 與 Alibaba FunAudioLLM 的 SenseVoice 模型,提供以下特色:

  • 多款 faster-whisper 與 SenseVoice 模型可自由選擇
  • 支援設定辨識語言(完整語言代碼列表)
  • 明確切換運算裝置 (GPU/CPU),並以 spaces.GPU 裝飾器延後 GPU 資源配置
  • 整合 pyannote.audio 做語者分離,並在抄本中標示不同語者
  • 使用 opencc 自動將簡體中文轉為臺灣繁體中文
  • 彩色區隔對話式抄本,可捲動瀏覽及複製
  • 半即時分段輸出:每段語音或語者片段處理完成後,即時累積顯示抄本

🚀 使用步驟

  1. 上傳音檔或透過麥克風錄製音訊。

  2. Faster-Whisper ASR

    1. 選擇模型版本。
    2. 選定辨識語言 (預設自動偵測)。
    3. 切換運算裝置:GPU 或 CPU。
    4. 開啟/關閉語者分離功能。
    5. 點擊「Transcribe with Faster-Whisper」。
  3. SenseVoice ASR

    1. 選擇 SenseVoice 模型。
    2. 設定辨識語言。
    3. 切換運算裝置:GPU 或 CPU。
    4. 開啟/關閉標點符號功能。
    5. 開啟/關閉語者分離功能。
    6. 點擊「Transcribe with SenseVoice」。
  4. 左右並排查看純文字抄本與彩色標註的語者分離抄本。

📁 檔案結構

  • app.py Gradio 應用程式原始碼,實作雙 ASR 流程,包含運算裝置選擇、語者分離與中文轉換。
  • requirements.txt Python 相依套件:Gradio、PyTorch、Transformers、faster-whisper、funasr、pyannote.audio、pydub、opencc-python-reimplemented、ctranslate2、termcolor、cuBLAS/cuDNN。
  • Dockerfile(選用) 定義 CUDA 12 + cuDNN 9 的 Docker 環境。

⚠️ 注意事項

  • Hugging Face 權杖:請在 Space Secrets 設定 HF_TOKENHUGGINGFACE_TOKEN,以便下載語者分離模型。
  • GPU 分配:僅於選擇 GPU 時才會申請 GPU 資源。
  • Python 版本:建議使用 Python 3.10 以上。
  • 系統 ffmpeg:請確保主機或容器中已安裝 ffmpeg,以支援音訊處理。

🛠️ 相依套件

  • Python: 3.10+
  • gradio: >=3.39.0
  • torch & torchaudio: >=2.0.0
  • transformers: >=4.35.0
  • faster-whisper: >=1.1.1 & ctranslate2: ==4.5.0
  • funasr: >=1.0.14
  • pyannote.audio: >=2.1.1 & huggingface-hub: >=0.18.0
  • pydub: >=0.25.1 & ffmpeg-python: >=0.2.0
  • opencc-python-reimplemented
  • termcolor
  • nvidia-cublas-cu12, nvidia-cudnn-cu12

授權

MIT