usamaijaz-ai's picture
updated readme
b5e4ecb

A newer version of the Gradio SDK is available: 5.43.1

Upgrade
metadata
title: Accent Classifier + Transcriber
emoji: 🎙️
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 4.20.0
app_file: app.py
pinned: false

Accent Classifier + Speech Transcriber

This Gradio app allows you to:

  • Upload or link to audio/video files
  • Automatically transcribe the speech (via OpenAI Whisper)
  • Detect the speaker's accent (28-class Wav2Vec2 model)
  • View a top-5 ranked list of likely accents with confidence scores

How to Use

Option 1: Upload an audio file

  • Supported formats: .mp3, .wav

Option 2: Upload a video file

  • Supported format: .mp4 (audio will be extracted automatically)

Option 3: Paste a direct .mp4 video URL

  • Must be a direct video file URL (not a webpage)
  • Example: a file hosted on archive.org or a CDN

Not Supported

  • Loom, YouTube, Dropbox, or other webpage links (they don't serve real video files)
  • Download the video manually and upload it if needed

Models Used

Transcription:

Accent Classification:


Running Locally

To set this up and run locally, follow these steps:

  1. Clone the repository
git clone https://huggingface.co/spaces/usamaijaz-ai/accent-classifier
cd accent-classifier
  1. Create a virtual environment (optional but recommended)
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install the dependencies
pip install -r requirements.txt

If there’s no requirements.txt, use:

pip install gradio==4.20.0 transformers torch moviepy==1.0.3 requests safetensors soundfile scipy
  1. Install ffmpeg
  • macOS: brew install ffmpeg
  • Ubuntu: sudo apt install ffmpeg
  • Windows: Download here and add to PATH
  1. Run the app
python app.py
  1. Access in your browser
    Visit http://localhost:7860 to use the app locally.

How It Works

  1. Audio is extracted (if input is a video)
  2. Audio is converted to .wav and resampled to 16kHz
  3. Speech is transcribed using Whisper
  4. Accent is classified using a Wav2Vec2 model
  5. Output includes:
    • Top accent prediction
    • Confidence score
    • Top-5 accent list
    • Full transcription