File size: 6,233 Bytes
cb974bb
 
 
 
 
 
 
 
 
 
3f792e8
 
b899c60
5e6e4ea
 
 
 
 
 
3f792e8
 
85a395c
 
3f792e8
 
 
 
 
 
 
 
 
5e6e4ea
3f792e8
5e6e4ea
3f792e8
cb974bb
3f792e8
5e6e4ea
3f792e8
cb974bb
3f792e8
5e6e4ea
3f792e8
5e6e4ea
 
 
3f792e8
5e6e4ea
3f792e8
cb974bb
3f792e8
8c5e398
 
b899c60
8c5e398
 
5e6e4ea
3f792e8
5e6e4ea
 
 
 
 
3f792e8
5e6e4ea
 
 
 
 
3f792e8
5e6e4ea
 
 
 
3f792e8
5e6e4ea
 
 
 
 
3f792e8
5e6e4ea
 
 
 
3f792e8
5e6e4ea
 
 
 
3f792e8
 
 
 
8c5e398
 
 
 
 
 
 
 
 
 
 
 
3f792e8
8c5e398
 
 
 
 
 
 
 
3f792e8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cb974bb
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
---
title: Multilingual Audio Intelligence System
emoji: 🎡
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
short_description: AI system for multilingual transcription and translation
---

# 🎡 Multilingual Audio Intelligence System

<img src="static/imgs/banner.png" alt="Multilingual Audio Intelligence System Banner"/>

## Overview

The Multilingual Audio Intelligence System is an advanced AI-powered platform that combines state-of-the-art speaker diarization, automatic speech recognition, and neural machine translation to deliver comprehensive audio analysis capabilities. This sophisticated system processes multilingual audio content, identifies individual speakers, transcribes speech with high accuracy, and provides intelligent translations across multiple languages, transforming raw audio into structured, actionable insights.

## Features

### Demo Mode with Professional Audio Files
- **Yuri Kizaki - Japanese Audio**: Professional voice message about website communication
- **French Film Podcast**: Discussion about movies including Social Network and Paranormal Activity
- Smart demo file management with automatic download and preprocessing
- Instant results with cached processing for blazing-fast demonstration

### Enhanced User Interface
- **Audio Waveform Visualization**: Real-time waveform display with HTML5 Canvas
- **Interactive Demo Selection**: Beautiful cards for selecting demo audio files
- **Improved Transcript Display**: Color-coded confidence levels and clear translation sections
- **Professional Audio Preview**: Audio player with waveform visualization

### Screenshots

#### 🎬 Demo Banner

<img src="static/imgs/demo_banner.png" alt="Demo Banner"/>

#### πŸ“ Transcript with Translation

<img src="static/imgs/demo_res_transcript_translate.png" alt="Transcript with Translation"/>

#### πŸ“Š Visual Representation

<p align="center">
  <img src="static/imgs/demo_res_visual.png" alt="Visual Output"/>
</p>

#### 🧠 Summary Output

<img src="static/imgs/demo_res_summary.png" alt="Summary Output"/>

## Demo & Documentation

- πŸŽ₯ [Video Preview](https://drive.google.com/file/d/1dfYM5p9cKGw0C5RBvmyN6DUWgnEZk56M/view)
- πŸ“„ [Project Documentation](DOCUMENTATION.md)

## Installation and Quick Start

1. **Clone the Repository:**
   ```bash
   git clone https://github.com/Prathameshv07/Multilingual-Audio-Intelligence-System.git
   cd Multilingual-Audio-Intelligence-System
   ```

2. **Create and Activate Conda Environment:**
   ```bash
   conda create --name audio_challenge python=3.9
   conda activate audio_challenge
   ```

3. **Install Dependencies:**
   ```bash
   pip install -r requirements.txt
   ```

4. **Configure Environment Variables:**
   ```bash
   cp config.example.env .env
   # Edit .env file with your HUGGINGFACE_TOKEN for accessing gated models
   ```

5. **Preload AI Models (Recommended):**
   ```bash
   python model_preloader.py
   ```

6. **Initialize Application:**
   ```bash
   python run_fastapi.py
   ```

## File Structure

```
Multilingual-Audio-Intelligence-System/
β”œβ”€β”€ web_app.py                      # FastAPI application with RESTful endpoints
β”œβ”€β”€ model_preloader.py              # Intelligent model loading with progress tracking
β”œβ”€β”€ run_fastapi.py                  # Application startup script with preloading
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ main.py                     # AudioIntelligencePipeline orchestrator
β”‚   β”œβ”€β”€ audio_processor.py          # Advanced audio preprocessing and normalization
β”‚   β”œβ”€β”€ speaker_diarizer.py         # pyannote.audio integration for speaker identification
β”‚   β”œβ”€β”€ speech_recognizer.py        # faster-whisper ASR with language detection
β”‚   β”œβ”€β”€ translator.py               # Neural machine translation with multiple models
β”‚   β”œβ”€β”€ output_formatter.py         # Multi-format result generation and export
β”‚   └── utils.py                    # Utility functions and performance monitoring
β”œβ”€β”€ templates/
β”‚   └── index.html                  # Responsive web interface with home page
β”œβ”€β”€ static/                         # Static assets and client-side resources
β”œβ”€β”€ model_cache/                    # Intelligent model caching directory
β”œβ”€β”€ uploads/                        # User audio file storage
β”œβ”€β”€ outputs/                        # Generated results and downloads
β”œβ”€β”€ requirements.txt                # Comprehensive dependency specification
β”œβ”€β”€ Dockerfile                      # Production-ready containerization
└── config.example.env              # Environment configuration template
```

## Configuration

### Environment Variables
Create a `.env` file:
```env
HUGGINGFACE_TOKEN=hf_your_token_here  # Optional, for gated models
```

### Model Configuration
- **Whisper Model**: tiny/small/medium/large
- **Target Language**: en/es/fr/de/it/pt/zh/ja/ko/ar
- **Device**: auto/cpu/cuda

## Supported Audio Formats

- WAV (recommended)
- MP3
- OGG
- FLAC
- M4A

**Maximum file size**: 100MB  
**Recommended duration**: Under 30 minutes

## Development

### Local Development
```bash
python run_fastapi.py
```

### Production Deployment
```bash
uvicorn web_app:app --host 0.0.0.0 --port 8000
```

## Performance

- **Processing Speed**: 2-14x real-time (depending on model size)
- **Memory Usage**: Optimized with INT8 quantization
- **CPU Optimized**: Works without GPU
- **Concurrent Processing**: Async/await support

## Troubleshooting

### Common Issues

1. **Dependencies**: Use `requirements.txt` for clean installation
2. **Memory**: Use smaller models (tiny/small) for limited hardware
3. **Audio Format**: Convert to WAV if other formats fail
4. **Port Conflicts**: Change port in `run_fastapi.py` if 8000 is occupied

### Error Resolution
- Check logs in terminal output
- Verify audio file format and size
- Ensure all dependencies are installed
- Check available system memory

## Support

- **Documentation**: Check `/api/docs` endpoint
- **System Info**: Use the info button in the web interface
- **Logs**: Monitor terminal output for detailed information

---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference