metadata

title: Multilingual Audio Intelligence System
emoji: 🎵
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
short_description: AI system for multilingual transcription and translation

🎵 Multilingual Audio Intelligence System

Overview

The Multilingual Audio Intelligence System is an advanced AI-powered platform that combines state-of-the-art speaker diarization, automatic speech recognition, and neural machine translation to deliver comprehensive audio analysis capabilities. This sophisticated system processes multilingual audio content, identifies individual speakers, transcribes speech with high accuracy, and provides intelligent translations across multiple languages, transforming raw audio into structured, actionable insights.

Features

Demo Mode with Professional Audio Files

Yuri Kizaki - Japanese Audio: Professional voice message about website communication
French Film Podcast: Discussion about movies including Social Network and Paranormal Activity
Smart demo file management with automatic download and preprocessing
Instant results with cached processing for blazing-fast demonstration

Enhanced User Interface

Audio Waveform Visualization: Real-time waveform display with HTML5 Canvas
Interactive Demo Selection: Beautiful cards for selecting demo audio files
Improved Transcript Display: Color-coded confidence levels and clear translation sections
Professional Audio Preview: Audio player with waveform visualization

Screenshots

🎬 Demo Banner

📝 Transcript with Translation

📊 Visual Representation

Visual Output

🧠 Summary Output

Demo & Documentation

🎥 Video Preview
📄 Project Documentation

Installation and Quick Start

Clone the Repository:

git clone https://github.com/Prathameshv07/Multilingual-Audio-Intelligence-System.git
cd Multilingual-Audio-Intelligence-System

Create and Activate Conda Environment:

conda create --name audio_challenge python=3.9
conda activate audio_challenge

Install Dependencies:
```
pip install -r requirements.txt
```

Configure Environment Variables:

cp config.example.env .env
# Edit .env file with your HUGGINGFACE_TOKEN for accessing gated models

Preload AI Models (Recommended):
```
python model_preloader.py
```
Initialize Application:
```
python run_fastapi.py
```

File Structure

Multilingual-Audio-Intelligence-System/
├── web_app.py                      # FastAPI application with RESTful endpoints
├── model_preloader.py              # Intelligent model loading with progress tracking
├── run_fastapi.py                  # Application startup script with preloading
├── src/
│   ├── main.py                     # AudioIntelligencePipeline orchestrator
│   ├── audio_processor.py          # Advanced audio preprocessing and normalization
│   ├── speaker_diarizer.py         # pyannote.audio integration for speaker identification
│   ├── speech_recognizer.py        # faster-whisper ASR with language detection
│   ├── translator.py               # Neural machine translation with multiple models
│   ├── output_formatter.py         # Multi-format result generation and export
│   └── utils.py                    # Utility functions and performance monitoring
├── templates/
│   └── index.html                  # Responsive web interface with home page
├── static/                         # Static assets and client-side resources
├── model_cache/                    # Intelligent model caching directory
├── uploads/                        # User audio file storage
├── outputs/                        # Generated results and downloads
├── requirements.txt                # Comprehensive dependency specification
├── Dockerfile                      # Production-ready containerization
└── config.example.env              # Environment configuration template

Configuration

Environment Variables

Create a .env file:

HUGGINGFACE_TOKEN=hf_your_token_here  # Optional, for gated models

Model Configuration

Whisper Model: tiny/small/medium/large
Target Language: en/es/fr/de/it/pt/zh/ja/ko/ar
Device: auto/cpu/cuda

Supported Audio Formats

WAV (recommended)
MP3
OGG
FLAC
M4A

Maximum file size: 100MB
Recommended duration: Under 30 minutes

Development

Local Development

python run_fastapi.py

Production Deployment

uvicorn web_app:app --host 0.0.0.0 --port 8000

Performance

Processing Speed: 2-14x real-time (depending on model size)
Memory Usage: Optimized with INT8 quantization
CPU Optimized: Works without GPU
Concurrent Processing: Async/await support

Troubleshooting

Common Issues

Dependencies: Use requirements.txt for clean installation
Memory: Use smaller models (tiny/small) for limited hardware
Audio Format: Convert to WAV if other formats fail
Port Conflicts: Change port in run_fastapi.py if 8000 is occupied

Error Resolution

Check logs in terminal output
Verify audio file format and size
Ensure all dependencies are installed
Check available system memory

Support

Documentation: Check /api/docs endpoint
System Info: Use the info button in the web interface
Logs: Monitor terminal output for detailed information

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference