Prathamesh Sarjerao Vaidya commited on
Commit
5e6e4ea
Β·
1 Parent(s): 3f792e8

made some changes

Browse files
DOCUMENTATION.md CHANGED
@@ -16,7 +16,7 @@ The primary objective of the Multilingual Audio Intelligence System is to revolu
16
 
17
  ## 3. Technologies and Tools
18
 
19
- - **Programming Language:** Python 3.9+
20
  - **Web Framework:** FastAPI with Uvicorn ASGI server for high-performance async operations
21
  - **Frontend Technology:** HTML5, TailwindCSS, and Vanilla JavaScript for responsive user interface
22
  - **Machine Learning Libraries:**
@@ -45,7 +45,7 @@ The primary objective of the Multilingual Audio Intelligence System is to revolu
45
  - Storage: 10GB+ available space for application, models, and processing cache
46
  - GPU: Optional NVIDIA GPU with 4GB+ VRAM for accelerated processing
47
  - Network: Stable internet connection for initial model downloading
48
- - **Software:** Python 3.9+, pip package manager, Docker (optional), web browser (Chrome, Firefox, Safari, Edge)
49
 
50
  ## 5. Setup Instructions
51
 
@@ -53,8 +53,8 @@ The primary objective of the Multilingual Audio Intelligence System is to revolu
53
 
54
  1. **Clone the Repository:**
55
  ```bash
56
- git clone https://github.com/your-username/multilingual-audio-intelligence.git
57
- cd multilingual-audio-intelligence
58
  ```
59
 
60
  2. **Create and Activate Conda Environment:**
@@ -98,7 +98,7 @@ The primary objective of the Multilingual Audio Intelligence System is to revolu
98
  ## 6. Detailed Project Structure
99
 
100
  ```
101
- multilingual-audio-intelligence/
102
  β”œβ”€β”€ web_app.py # FastAPI application with RESTful endpoints
103
  β”œβ”€β”€ model_preloader.py # Intelligent model loading with progress tracking
104
  β”œβ”€β”€ run_fastapi.py # Application startup script with preloading
@@ -117,10 +117,9 @@ multilingual-audio-intelligence/
117
  β”œβ”€β”€ model_cache/ # Intelligent model caching directory
118
  β”œβ”€β”€ uploads/ # User audio file storage
119
  β”œβ”€β”€ outputs/ # Generated results and downloads
120
- β”œβ”€β”€ requirements.txt # Comprehensive dependency specification
121
  β”œβ”€β”€ Dockerfile # Production-ready containerization
122
- β”œβ”€β”€ DEPLOYMENT_GUIDE.md # Comprehensive deployment instructions
123
- └── config.example.env # Environment configuration template
124
  ```
125
 
126
  ## 6.1 Demo Mode & Sample Files
@@ -129,8 +128,8 @@ The application ships with a professional demo mode for instant showcases withou
129
 
130
  - Demo files are automatically downloaded at startup (if missing) into `demo_audio/` and preprocessed into `demo_results/` for blazing-fast responses.
131
  - Available demos:
132
- - `Yuri_Kizaki.mp3` β€” Japanese narration about website communication
133
- - `Film_Podcast.mp3` β€” French podcast discussing films like The Social Network
134
  - Static serving: demo audio is exposed at `/demo_audio/<filename>` for local preview.
135
  - The UI provides two selectable cards under Demo Mode; once selected, the system loads a preview and renders a waveform using HTML5 Canvas (Web Audio API) before processing.
136
 
 
16
 
17
  ## 3. Technologies and Tools
18
 
19
+ - **Programming Language:** Python 3.8+
20
  - **Web Framework:** FastAPI with Uvicorn ASGI server for high-performance async operations
21
  - **Frontend Technology:** HTML5, TailwindCSS, and Vanilla JavaScript for responsive user interface
22
  - **Machine Learning Libraries:**
 
45
  - Storage: 10GB+ available space for application, models, and processing cache
46
  - GPU: Optional NVIDIA GPU with 4GB+ VRAM for accelerated processing
47
  - Network: Stable internet connection for initial model downloading
48
+ - **Software:** Python 3.8+, pip package manager, Docker (optional), web browser (Chrome, Firefox, Safari, Edge)
49
 
50
  ## 5. Setup Instructions
51
 
 
53
 
54
  1. **Clone the Repository:**
55
  ```bash
56
+ git clone https://github.com/Prathameshv07/Multilingual-Audio-Intelligence-System.git
57
+ cd Multilingual-Audio-Intelligence-System
58
  ```
59
 
60
  2. **Create and Activate Conda Environment:**
 
98
  ## 6. Detailed Project Structure
99
 
100
  ```
101
+ Multilingual-Audio-Intelligence-System/
102
  β”œβ”€β”€ web_app.py # FastAPI application with RESTful endpoints
103
  β”œβ”€β”€ model_preloader.py # Intelligent model loading with progress tracking
104
  β”œβ”€β”€ run_fastapi.py # Application startup script with preloading
 
117
  β”œβ”€β”€ model_cache/ # Intelligent model caching directory
118
  β”œβ”€β”€ uploads/ # User audio file storage
119
  β”œβ”€β”€ outputs/ # Generated results and downloads
120
+ β”œβ”€β”€ requirements.txt # Comprehensive dependency specification
121
  β”œβ”€β”€ Dockerfile # Production-ready containerization
122
+ └── config.example.env # Environment configuration template
 
123
  ```
124
 
125
  ## 6.1 Demo Mode & Sample Files
 
128
 
129
  - Demo files are automatically downloaded at startup (if missing) into `demo_audio/` and preprocessed into `demo_results/` for blazing-fast responses.
130
  - Available demos:
131
+ - [Yuri_Kizaki.mp3](https://www.mitsue.co.jp/service/audio_and_video/audio_production/media/narrators_sample/yuri_kizaki/03.mp3) β€” Japanese narration about website communication
132
+ - [Film_Podcast.mp3](https://www.lightbulblanguages.co.uk/resources/audio/film-podcast.mp3) β€” French podcast discussing films like The Social Network
133
  - Static serving: demo audio is exposed at `/demo_audio/<filename>` for local preview.
134
  - The UI provides two selectable cards under Demo Mode; once selected, the system loads a preview and renders a waveform using HTML5 Canvas (Web Audio API) before processing.
135
 
README.md CHANGED
@@ -1,6 +1,12 @@
1
  # 🎡 Multilingual Audio Intelligence System
2
 
3
- ## New Features ✨
 
 
 
 
 
 
4
 
5
  ### Demo Mode with Professional Audio Files
6
  - **Yuri Kizaki - Japanese Audio**: Professional voice message about website communication (23 seconds)
@@ -14,111 +20,81 @@
14
  - **Improved Transcript Display**: Color-coded confidence levels and clear translation sections
15
  - **Professional Audio Preview**: Audio player with waveform visualization
16
 
17
- ### Technical Improvements
18
- - Automatic demo file download from original sources
19
- - Cached preprocessing results for instant demo response
20
- - Enhanced error handling for missing or corrupted demo files
21
- - Web Audio API integration for dynamic waveform generation
22
 
23
- ## Quick Start
24
 
25
- ```bash
26
- # Install dependencies
27
- pip install -r requirements.txt
28
 
29
- # Start the application (includes demo file setup)
30
- python run_fastapi.py
31
 
32
- # Access the application
33
- # http://127.0.0.1:8000
34
- ```
35
 
36
- ## Demo Mode Usage
37
 
38
- 1. **Select Demo Mode**: Click the "Demo Mode" button in the header
39
- 2. **Choose Audio File**: Select either Japanese or French demo audio
40
- 3. **Preview**: Listen to the audio and view the waveform
41
- 4. **Process**: Click "Process Audio" for instant results
42
- 5. **Explore**: View transcripts, translations, and analytics
43
 
44
- ## Full Processing Mode
45
 
46
- 1. **Upload Audio**: Drag & drop or click to upload your audio file
47
- 2. **Preview**: View waveform and listen to your audio
48
- 3. **Configure**: Select model size and target language
49
- 4. **Process**: Real-time processing with progress tracking
50
- 5. **Download**: Export results in JSON, SRT, or TXT format
51
 
52
- ## Features
53
 
54
- ## System Architecture
 
 
 
 
55
 
56
- ### Core Components
 
 
 
 
57
 
58
- - **FastAPI Backend** - Production-ready web framework
59
- - **HTML/TailwindCSS Frontend** - Clean, professional interface
60
- - **Audio Processing Pipeline** - Integrated ML models
61
- - **RESTful API** - Standardized endpoints
62
 
63
- ### Key Features
 
 
 
 
64
 
65
- - **Speaker Diarization** - Identify "who spoke when"
66
- - **Speech Recognition** - Convert speech to text
67
- - **Language Detection** - Automatic language identification
68
- - **Neural Translation** - Multi-language translation
69
- - **Interactive Visualization** - Waveform analysis
70
- - **Multiple Export Formats** - JSON, SRT, TXT
71
 
72
- ## Technology Stack
73
-
74
- ### Backend
75
- - **FastAPI** - Modern Python web framework
76
- - **Uvicorn** - ASGI server
77
- - **PyTorch** - Deep learning framework
78
- - **pyannote.audio** - Speaker diarization
79
- - **faster-whisper** - Speech recognition
80
- - **Helsinki-NLP** - Neural translation
81
-
82
- ### Frontend
83
- - **HTML5/CSS3** - Clean markup
84
- - **TailwindCSS** - Utility-first styling
85
- - **JavaScript (Vanilla)** - Client-side logic
86
- - **Plotly.js** - Interactive visualizations
87
- - **Font Awesome** - Professional icons
88
-
89
- ## API Endpoints
90
-
91
- ### Core Endpoints
92
- - `GET /` - Main application interface
93
- - `POST /api/upload` - Upload and process audio
94
- - `GET /api/status/{task_id}` - Check processing status
95
- - `GET /api/results/{task_id}` - Retrieve results
96
- - `GET /api/download/{task_id}/{format}` - Download outputs
97
-
98
- ### Demo Endpoints
99
- - `POST /api/demo-process` - Quick demo processing
100
- - `GET /api/system-info` - System information
101
 
102
  ## File Structure
103
 
104
  ```
105
  audio_challenge/
106
- β”œβ”€β”€ web_app.py # FastAPI application
107
- β”œβ”€β”€ run_fastapi.py # Startup script
108
- β”œβ”€β”€ requirements.txt # Dependencies
109
  β”œβ”€β”€ templates/
110
- β”‚ └── index.html # Main interface
111
- β”œβ”€β”€ src/ # Core modules
112
- β”‚ β”œβ”€β”€ main.py # Pipeline orchestrator
113
- β”‚ β”œβ”€β”€ audio_processor.py # Audio preprocessing
114
- β”‚ β”œβ”€β”€ speaker_diarizer.py # Speaker identification
115
  β”‚ β”œβ”€β”€ speech_recognizer.py # ASR with language detection
116
- β”‚ β”œβ”€β”€ translator.py # Neural machine translation
117
- β”‚ β”œβ”€β”€ output_formatter.py # Output generation
118
- β”‚ └── utils.py # Utility functions
119
- β”œβ”€β”€ static/ # Static assets
120
- β”œβ”€β”€ uploads/ # Uploaded files
121
- └── outputs/ # Generated outputs
122
  └── README.md
123
  ```
124
 
@@ -180,10 +156,6 @@ uvicorn web_app:app --host 0.0.0.0 --port 8000
180
  - Ensure all dependencies are installed
181
  - Check available system memory
182
 
183
- ## License
184
-
185
- MIT License - See LICENSE file for details
186
-
187
  ## Support
188
 
189
  - **Documentation**: Check `/api/docs` endpoint
 
1
  # 🎡 Multilingual Audio Intelligence System
2
 
3
+ ![Multilingual Audio Intelligence System Banner](/static/imgs/banner.png)
4
+
5
+ ## Overview
6
+
7
+ The Multilingual Audio Intelligence System is an advanced AI-powered platform that combines state-of-the-art speaker diarization, automatic speech recognition, and neural machine translation to deliver comprehensive audio analysis capabilities. This sophisticated system processes multilingual audio content, identifies individual speakers, transcribes speech with high accuracy, and provides intelligent translations across multiple languages, transforming raw audio into structured, actionable insights.
8
+
9
+ ## Features
10
 
11
  ### Demo Mode with Professional Audio Files
12
  - **Yuri Kizaki - Japanese Audio**: Professional voice message about website communication (23 seconds)
 
20
  - **Improved Transcript Display**: Color-coded confidence levels and clear translation sections
21
  - **Professional Audio Preview**: Audio player with waveform visualization
22
 
23
+ ### Screenshots
 
 
 
 
24
 
25
+ #### 🎬 Demo Banner
26
 
27
+ ![Demo Banner](/static/imgs/demo_banner.png)
 
 
28
 
29
+ #### πŸ“ Transcript with Translation
 
30
 
31
+ ![Transcript with Translation](/static/imgs/demo_res_transcript_translate.png)
 
 
32
 
33
+ #### πŸ“Š Visual Representation
34
 
35
+ <p align="center">
36
+ <img src="static/imgs/demo_res_visual.png" alt="Visual Output"/>
37
+ </p>
 
 
38
 
39
+ #### 🧠 Summary Output
40
 
41
+ ![Summary Output](/static/imgs/demo_res_summary.png)
 
 
 
 
42
 
43
+ ## Installation and Quick Start
44
 
45
+ 1. **Clone the Repository:**
46
+ ```bash
47
+ git clone https://github.com/Prathameshv07/Multilingual-Audio-Intelligence-System.git
48
+ cd Multilingual-Audio-Intelligence-System
49
+ ```
50
 
51
+ 2. **Create and Activate Conda Environment:**
52
+ ```bash
53
+ conda create --name audio_challenge python=3.9
54
+ conda activate audio_challenge
55
+ ```
56
 
57
+ 3. **Install Dependencies:**
58
+ ```bash
59
+ pip install -r requirements.txt
60
+ ```
61
 
62
+ 4. **Configure Environment Variables:**
63
+ ```bash
64
+ cp config.example.env .env
65
+ # Edit .env file with your HUGGINGFACE_TOKEN for accessing gated models
66
+ ```
67
 
68
+ 5. **Preload AI Models (Recommended):**
69
+ ```bash
70
+ python model_preloader.py
71
+ ```
 
 
72
 
73
+ 6. **Initialize Application:**
74
+ ```bash
75
+ python run_fastapi.py
76
+ ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77
 
78
  ## File Structure
79
 
80
  ```
81
  audio_challenge/
82
+ β”œβ”€β”€ web_app.py # FastAPI application
83
+ β”œβ”€β”€ run_fastapi.py # Startup script
84
+ β”œβ”€β”€ requirements.txt # Dependencies
85
  β”œβ”€β”€ templates/
86
+ β”‚ └── index.html # Main interface
87
+ β”œβ”€β”€ src/ # Core modules
88
+ β”‚ β”œβ”€β”€ main.py # Pipeline orchestrator
89
+ β”‚ β”œβ”€β”€ audio_processor.py # Audio preprocessing
90
+ β”‚ β”œβ”€β”€ speaker_diarizer.py # Speaker identification
91
  β”‚ β”œβ”€β”€ speech_recognizer.py # ASR with language detection
92
+ β”‚ β”œβ”€β”€ translator.py # Neural machine translation
93
+ β”‚ β”œβ”€β”€ output_formatter.py # Output generation
94
+ β”‚ └── utils.py # Utility functions
95
+ β”œβ”€β”€ static/ # Static assets
96
+ β”œβ”€β”€ uploads/ # Uploaded files
97
+ └── outputs/ # Generated outputs
98
  └── README.md
99
  ```
100
 
 
156
  - Ensure all dependencies are installed
157
  - Check available system memory
158
 
 
 
 
 
159
  ## Support
160
 
161
  - **Documentation**: Check `/api/docs` endpoint
static/imgs/banner.png ADDED

Git LFS Details

  • SHA256: e4a170267de3826c8d9ac50ee263263e2e47a2af5931b430c5729e7249aed76d
  • Pointer size: 130 Bytes
  • Size of remote file: 91.1 kB
static/imgs/demo_banner.png ADDED

Git LFS Details

  • SHA256: 15718eb8bb2bfad00a146505ff01027ab25264ea5c09bb218cfd4f1333c40eb1
  • Pointer size: 130 Bytes
  • Size of remote file: 66.6 kB
static/imgs/demo_res_summary.png ADDED

Git LFS Details

  • SHA256: f5a3afc43bc5a3499c45edab49814f9153a3518f4d03843e1a05bd5648edc85f
  • Pointer size: 130 Bytes
  • Size of remote file: 22.9 kB
static/imgs/demo_res_transcript_translate.png ADDED

Git LFS Details

  • SHA256: bf03b1eb12f0bbc94da734190999d83309cfb0e9f54cabbb14f9d3a5d5782a98
  • Pointer size: 130 Bytes
  • Size of remote file: 82.8 kB
static/imgs/demo_res_visual.png ADDED

Git LFS Details

  • SHA256: aff9f34b8c715a3ecc0ec81d5d69efb56037c4b78de105711654d02a98fa1653
  • Pointer size: 130 Bytes
  • Size of remote file: 30.5 kB