Prathamesh Sarjerao Vaidya
commited on
Commit
Β·
5e6e4ea
1
Parent(s):
3f792e8
made some changes
Browse files- DOCUMENTATION.md +9 -10
- README.md +60 -88
- static/imgs/banner.png +3 -0
- static/imgs/demo_banner.png +3 -0
- static/imgs/demo_res_summary.png +3 -0
- static/imgs/demo_res_transcript_translate.png +3 -0
- static/imgs/demo_res_visual.png +3 -0
DOCUMENTATION.md
CHANGED
@@ -16,7 +16,7 @@ The primary objective of the Multilingual Audio Intelligence System is to revolu
|
|
16 |
|
17 |
## 3. Technologies and Tools
|
18 |
|
19 |
-
- **Programming Language:** Python 3.
|
20 |
- **Web Framework:** FastAPI with Uvicorn ASGI server for high-performance async operations
|
21 |
- **Frontend Technology:** HTML5, TailwindCSS, and Vanilla JavaScript for responsive user interface
|
22 |
- **Machine Learning Libraries:**
|
@@ -45,7 +45,7 @@ The primary objective of the Multilingual Audio Intelligence System is to revolu
|
|
45 |
- Storage: 10GB+ available space for application, models, and processing cache
|
46 |
- GPU: Optional NVIDIA GPU with 4GB+ VRAM for accelerated processing
|
47 |
- Network: Stable internet connection for initial model downloading
|
48 |
-
- **Software:** Python 3.
|
49 |
|
50 |
## 5. Setup Instructions
|
51 |
|
@@ -53,8 +53,8 @@ The primary objective of the Multilingual Audio Intelligence System is to revolu
|
|
53 |
|
54 |
1. **Clone the Repository:**
|
55 |
```bash
|
56 |
-
git clone https://github.com/
|
57 |
-
cd
|
58 |
```
|
59 |
|
60 |
2. **Create and Activate Conda Environment:**
|
@@ -98,7 +98,7 @@ The primary objective of the Multilingual Audio Intelligence System is to revolu
|
|
98 |
## 6. Detailed Project Structure
|
99 |
|
100 |
```
|
101 |
-
|
102 |
βββ web_app.py # FastAPI application with RESTful endpoints
|
103 |
βββ model_preloader.py # Intelligent model loading with progress tracking
|
104 |
βββ run_fastapi.py # Application startup script with preloading
|
@@ -117,10 +117,9 @@ multilingual-audio-intelligence/
|
|
117 |
βββ model_cache/ # Intelligent model caching directory
|
118 |
βββ uploads/ # User audio file storage
|
119 |
βββ outputs/ # Generated results and downloads
|
120 |
-
βββ requirements.txt
|
121 |
βββ Dockerfile # Production-ready containerization
|
122 |
-
|
123 |
-
βββ config.example.env # Environment configuration template
|
124 |
```
|
125 |
|
126 |
## 6.1 Demo Mode & Sample Files
|
@@ -129,8 +128,8 @@ The application ships with a professional demo mode for instant showcases withou
|
|
129 |
|
130 |
- Demo files are automatically downloaded at startup (if missing) into `demo_audio/` and preprocessed into `demo_results/` for blazing-fast responses.
|
131 |
- Available demos:
|
132 |
-
-
|
133 |
-
-
|
134 |
- Static serving: demo audio is exposed at `/demo_audio/<filename>` for local preview.
|
135 |
- The UI provides two selectable cards under Demo Mode; once selected, the system loads a preview and renders a waveform using HTML5 Canvas (Web Audio API) before processing.
|
136 |
|
|
|
16 |
|
17 |
## 3. Technologies and Tools
|
18 |
|
19 |
+
- **Programming Language:** Python 3.8+
|
20 |
- **Web Framework:** FastAPI with Uvicorn ASGI server for high-performance async operations
|
21 |
- **Frontend Technology:** HTML5, TailwindCSS, and Vanilla JavaScript for responsive user interface
|
22 |
- **Machine Learning Libraries:**
|
|
|
45 |
- Storage: 10GB+ available space for application, models, and processing cache
|
46 |
- GPU: Optional NVIDIA GPU with 4GB+ VRAM for accelerated processing
|
47 |
- Network: Stable internet connection for initial model downloading
|
48 |
+
- **Software:** Python 3.8+, pip package manager, Docker (optional), web browser (Chrome, Firefox, Safari, Edge)
|
49 |
|
50 |
## 5. Setup Instructions
|
51 |
|
|
|
53 |
|
54 |
1. **Clone the Repository:**
|
55 |
```bash
|
56 |
+
git clone https://github.com/Prathameshv07/Multilingual-Audio-Intelligence-System.git
|
57 |
+
cd Multilingual-Audio-Intelligence-System
|
58 |
```
|
59 |
|
60 |
2. **Create and Activate Conda Environment:**
|
|
|
98 |
## 6. Detailed Project Structure
|
99 |
|
100 |
```
|
101 |
+
Multilingual-Audio-Intelligence-System/
|
102 |
βββ web_app.py # FastAPI application with RESTful endpoints
|
103 |
βββ model_preloader.py # Intelligent model loading with progress tracking
|
104 |
βββ run_fastapi.py # Application startup script with preloading
|
|
|
117 |
βββ model_cache/ # Intelligent model caching directory
|
118 |
βββ uploads/ # User audio file storage
|
119 |
βββ outputs/ # Generated results and downloads
|
120 |
+
βββ requirements.txt # Comprehensive dependency specification
|
121 |
βββ Dockerfile # Production-ready containerization
|
122 |
+
βββ config.example.env # Environment configuration template
|
|
|
123 |
```
|
124 |
|
125 |
## 6.1 Demo Mode & Sample Files
|
|
|
128 |
|
129 |
- Demo files are automatically downloaded at startup (if missing) into `demo_audio/` and preprocessed into `demo_results/` for blazing-fast responses.
|
130 |
- Available demos:
|
131 |
+
- [Yuri_Kizaki.mp3](https://www.mitsue.co.jp/service/audio_and_video/audio_production/media/narrators_sample/yuri_kizaki/03.mp3) β Japanese narration about website communication
|
132 |
+
- [Film_Podcast.mp3](https://www.lightbulblanguages.co.uk/resources/audio/film-podcast.mp3) β French podcast discussing films like The Social Network
|
133 |
- Static serving: demo audio is exposed at `/demo_audio/<filename>` for local preview.
|
134 |
- The UI provides two selectable cards under Demo Mode; once selected, the system loads a preview and renders a waveform using HTML5 Canvas (Web Audio API) before processing.
|
135 |
|
README.md
CHANGED
@@ -1,6 +1,12 @@
|
|
1 |
# π΅ Multilingual Audio Intelligence System
|
2 |
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
|
5 |
### Demo Mode with Professional Audio Files
|
6 |
- **Yuri Kizaki - Japanese Audio**: Professional voice message about website communication (23 seconds)
|
@@ -14,111 +20,81 @@
|
|
14 |
- **Improved Transcript Display**: Color-coded confidence levels and clear translation sections
|
15 |
- **Professional Audio Preview**: Audio player with waveform visualization
|
16 |
|
17 |
-
###
|
18 |
-
- Automatic demo file download from original sources
|
19 |
-
- Cached preprocessing results for instant demo response
|
20 |
-
- Enhanced error handling for missing or corrupted demo files
|
21 |
-
- Web Audio API integration for dynamic waveform generation
|
22 |
|
23 |
-
|
24 |
|
25 |
-
|
26 |
-
# Install dependencies
|
27 |
-
pip install -r requirements.txt
|
28 |
|
29 |
-
|
30 |
-
python run_fastapi.py
|
31 |
|
32 |
-
|
33 |
-
# http://127.0.0.1:8000
|
34 |
-
```
|
35 |
|
36 |
-
|
37 |
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
4. **Process**: Click "Process Audio" for instant results
|
42 |
-
5. **Explore**: View transcripts, translations, and analytics
|
43 |
|
44 |
-
|
45 |
|
46 |
-
|
47 |
-
2. **Preview**: View waveform and listen to your audio
|
48 |
-
3. **Configure**: Select model size and target language
|
49 |
-
4. **Process**: Real-time processing with progress tracking
|
50 |
-
5. **Download**: Export results in JSON, SRT, or TXT format
|
51 |
|
52 |
-
##
|
53 |
|
54 |
-
|
|
|
|
|
|
|
|
|
55 |
|
56 |
-
|
|
|
|
|
|
|
|
|
57 |
|
58 |
-
|
59 |
-
|
60 |
-
|
61 |
-
|
62 |
|
63 |
-
|
|
|
|
|
|
|
|
|
64 |
|
65 |
-
|
66 |
-
|
67 |
-
|
68 |
-
|
69 |
-
- **Interactive Visualization** - Waveform analysis
|
70 |
-
- **Multiple Export Formats** - JSON, SRT, TXT
|
71 |
|
72 |
-
|
73 |
-
|
74 |
-
|
75 |
-
|
76 |
-
- **Uvicorn** - ASGI server
|
77 |
-
- **PyTorch** - Deep learning framework
|
78 |
-
- **pyannote.audio** - Speaker diarization
|
79 |
-
- **faster-whisper** - Speech recognition
|
80 |
-
- **Helsinki-NLP** - Neural translation
|
81 |
-
|
82 |
-
### Frontend
|
83 |
-
- **HTML5/CSS3** - Clean markup
|
84 |
-
- **TailwindCSS** - Utility-first styling
|
85 |
-
- **JavaScript (Vanilla)** - Client-side logic
|
86 |
-
- **Plotly.js** - Interactive visualizations
|
87 |
-
- **Font Awesome** - Professional icons
|
88 |
-
|
89 |
-
## API Endpoints
|
90 |
-
|
91 |
-
### Core Endpoints
|
92 |
-
- `GET /` - Main application interface
|
93 |
-
- `POST /api/upload` - Upload and process audio
|
94 |
-
- `GET /api/status/{task_id}` - Check processing status
|
95 |
-
- `GET /api/results/{task_id}` - Retrieve results
|
96 |
-
- `GET /api/download/{task_id}/{format}` - Download outputs
|
97 |
-
|
98 |
-
### Demo Endpoints
|
99 |
-
- `POST /api/demo-process` - Quick demo processing
|
100 |
-
- `GET /api/system-info` - System information
|
101 |
|
102 |
## File Structure
|
103 |
|
104 |
```
|
105 |
audio_challenge/
|
106 |
-
βββ web_app.py
|
107 |
-
βββ run_fastapi.py
|
108 |
-
βββ requirements.txt
|
109 |
βββ templates/
|
110 |
-
β βββ index.html
|
111 |
-
βββ src/
|
112 |
-
β βββ main.py
|
113 |
-
β βββ audio_processor.py
|
114 |
-
β βββ speaker_diarizer.py
|
115 |
β βββ speech_recognizer.py # ASR with language detection
|
116 |
-
β βββ translator.py
|
117 |
-
β βββ output_formatter.py
|
118 |
-
β βββ utils.py
|
119 |
-
βββ static/
|
120 |
-
βββ uploads/
|
121 |
-
βββ outputs/
|
122 |
βββ README.md
|
123 |
```
|
124 |
|
@@ -180,10 +156,6 @@ uvicorn web_app:app --host 0.0.0.0 --port 8000
|
|
180 |
- Ensure all dependencies are installed
|
181 |
- Check available system memory
|
182 |
|
183 |
-
## License
|
184 |
-
|
185 |
-
MIT License - See LICENSE file for details
|
186 |
-
|
187 |
## Support
|
188 |
|
189 |
- **Documentation**: Check `/api/docs` endpoint
|
|
|
1 |
# π΅ Multilingual Audio Intelligence System
|
2 |
|
3 |
+

|
4 |
+
|
5 |
+
## Overview
|
6 |
+
|
7 |
+
The Multilingual Audio Intelligence System is an advanced AI-powered platform that combines state-of-the-art speaker diarization, automatic speech recognition, and neural machine translation to deliver comprehensive audio analysis capabilities. This sophisticated system processes multilingual audio content, identifies individual speakers, transcribes speech with high accuracy, and provides intelligent translations across multiple languages, transforming raw audio into structured, actionable insights.
|
8 |
+
|
9 |
+
## Features
|
10 |
|
11 |
### Demo Mode with Professional Audio Files
|
12 |
- **Yuri Kizaki - Japanese Audio**: Professional voice message about website communication (23 seconds)
|
|
|
20 |
- **Improved Transcript Display**: Color-coded confidence levels and clear translation sections
|
21 |
- **Professional Audio Preview**: Audio player with waveform visualization
|
22 |
|
23 |
+
### Screenshots
|
|
|
|
|
|
|
|
|
24 |
|
25 |
+
#### π¬ Demo Banner
|
26 |
|
27 |
+

|
|
|
|
|
28 |
|
29 |
+
#### π Transcript with Translation
|
|
|
30 |
|
31 |
+

|
|
|
|
|
32 |
|
33 |
+
#### π Visual Representation
|
34 |
|
35 |
+
<p align="center">
|
36 |
+
<img src="static/imgs/demo_res_visual.png" alt="Visual Output"/>
|
37 |
+
</p>
|
|
|
|
|
38 |
|
39 |
+
#### π§ Summary Output
|
40 |
|
41 |
+

|
|
|
|
|
|
|
|
|
42 |
|
43 |
+
## Installation and Quick Start
|
44 |
|
45 |
+
1. **Clone the Repository:**
|
46 |
+
```bash
|
47 |
+
git clone https://github.com/Prathameshv07/Multilingual-Audio-Intelligence-System.git
|
48 |
+
cd Multilingual-Audio-Intelligence-System
|
49 |
+
```
|
50 |
|
51 |
+
2. **Create and Activate Conda Environment:**
|
52 |
+
```bash
|
53 |
+
conda create --name audio_challenge python=3.9
|
54 |
+
conda activate audio_challenge
|
55 |
+
```
|
56 |
|
57 |
+
3. **Install Dependencies:**
|
58 |
+
```bash
|
59 |
+
pip install -r requirements.txt
|
60 |
+
```
|
61 |
|
62 |
+
4. **Configure Environment Variables:**
|
63 |
+
```bash
|
64 |
+
cp config.example.env .env
|
65 |
+
# Edit .env file with your HUGGINGFACE_TOKEN for accessing gated models
|
66 |
+
```
|
67 |
|
68 |
+
5. **Preload AI Models (Recommended):**
|
69 |
+
```bash
|
70 |
+
python model_preloader.py
|
71 |
+
```
|
|
|
|
|
72 |
|
73 |
+
6. **Initialize Application:**
|
74 |
+
```bash
|
75 |
+
python run_fastapi.py
|
76 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
77 |
|
78 |
## File Structure
|
79 |
|
80 |
```
|
81 |
audio_challenge/
|
82 |
+
βββ web_app.py # FastAPI application
|
83 |
+
βββ run_fastapi.py # Startup script
|
84 |
+
βββ requirements.txt # Dependencies
|
85 |
βββ templates/
|
86 |
+
β βββ index.html # Main interface
|
87 |
+
βββ src/ # Core modules
|
88 |
+
β βββ main.py # Pipeline orchestrator
|
89 |
+
β βββ audio_processor.py # Audio preprocessing
|
90 |
+
β βββ speaker_diarizer.py # Speaker identification
|
91 |
β βββ speech_recognizer.py # ASR with language detection
|
92 |
+
β βββ translator.py # Neural machine translation
|
93 |
+
β βββ output_formatter.py # Output generation
|
94 |
+
β βββ utils.py # Utility functions
|
95 |
+
βββ static/ # Static assets
|
96 |
+
βββ uploads/ # Uploaded files
|
97 |
+
βββ outputs/ # Generated outputs
|
98 |
βββ README.md
|
99 |
```
|
100 |
|
|
|
156 |
- Ensure all dependencies are installed
|
157 |
- Check available system memory
|
158 |
|
|
|
|
|
|
|
|
|
159 |
## Support
|
160 |
|
161 |
- **Documentation**: Check `/api/docs` endpoint
|
static/imgs/banner.png
ADDED
![]() |
Git LFS Details
|
static/imgs/demo_banner.png
ADDED
![]() |
Git LFS Details
|
static/imgs/demo_res_summary.png
ADDED
![]() |
Git LFS Details
|
static/imgs/demo_res_transcript_translate.png
ADDED
![]() |
Git LFS Details
|
static/imgs/demo_res_visual.png
ADDED
![]() |
Git LFS Details
|