HAMMALE commited on
Commit
c56a4e0
·
verified ·
1 Parent(s): 0062c8a

Upload all space files

Browse files
Files changed (5) hide show
  1. README.md +185 -12
  2. app.py +255 -0
  3. female_embedding.pt +3 -0
  4. male_embedding.pt +3 -0
  5. requirements.txt +6 -0
README.md CHANGED
@@ -1,12 +1,185 @@
1
- ---
2
- title: Speecht5 Darija
3
- emoji: 📈
4
- colorFrom: yellow
5
- colorTo: purple
6
- sdk: gradio
7
- sdk_version: 5.27.0
8
- app_file: app.py
9
- pinned: false
10
- ---
11
-
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Moroccan Darija Text-to-Speech Model
2
+
3
+ This project implements a Text-to-Speech (TTS) system for Moroccan Darija using the SpeechT5 architecture. It's fine-tuned on the DODa-audio-dataset to generate natural-sounding Darija speech from text input.
4
+
5
+ ## Table of Contents
6
+ - [Dataset Overview](#dataset-overview)
7
+ - [Project Structure](#project-structure)
8
+ - [Installation](#installation)
9
+ - [Model Training](#model-training)
10
+ - [Inference](#inference)
11
+ - [Gradio Demo](#gradio-demo)
12
+ - [Project Features](#project-features)
13
+ - [Potential Applications](#potential-applications)
14
+ - [Limitations and Future Work](#limitations-and-future-work)
15
+ - [License](#license)
16
+
17
+ ## Dataset Overview
18
+
19
+ The **DODa audio dataset** contains 12,743 sentences recorded by 7 contributors (4 females, 3 males). Key characteristics:
20
+
21
+ - Audio recordings standardized at 16kHz sample rate
22
+ - Multiple text representations (Latin script, Arabic script, and English translations)
23
+ - High-quality recordings with manual corrections
24
+
25
+ ### Dataset Structure
26
+ | Column Name | Description |
27
+ |-------------|-------------|
28
+ | **audio** | Speech recordings for Darija sentences |
29
+ | **darija_Ltn** | Darija sentences using Latin letters |
30
+ | **darija_Arab_new** | Corrected Darija sentences using Arabic script |
31
+ | **english** | English translation of Darija sentences |
32
+ | **darija_Arab_old** | Original (uncorrected) Darija sentences in Arabic script |
33
+
34
+ ### Speaker Distribution
35
+ The dataset includes recordings from 7 speakers (4 females, 3 males) with the following distribution:
36
+ ```
37
+ Samples 0-999 -> Female 1
38
+ Samples 1000-1999 -> Male 3
39
+ Samples 2000-2730 -> Female 2
40
+ Samples 2731-2800 -> Male 1
41
+ Samples 2801-3999 -> Male 2
42
+ Samples 4000-4999 -> Male 1
43
+ Samples 5000-5999 -> Female 3
44
+ Samples 6000-6999 -> Male 1
45
+ Samples 7000-7999 -> Female 4
46
+ Samples 8000-8999 -> Female 1
47
+ Samples 9000-9999 -> Male 2
48
+ Samples 10000-11999 -> Male 1
49
+ Samples 12000-12350 -> Male 2
50
+ Samples 12351-12742 -> Male 1
51
+ ```
52
+
53
+
54
+
55
+ ## Installation
56
+
57
+ To set up the project environment:
58
+
59
+ ```bash
60
+ # Clone the repository
61
+ git clone https://github.com/yourusername/darija-tts.git
62
+ cd darija-tts
63
+
64
+ # Create a virtual environment (optional but recommended)
65
+ python -m venv venv
66
+ source venv/bin/activate # On Windows: venv\Scriptsctivate
67
+
68
+ # Install dependencies
69
+ pip install -r requirements.txt
70
+ ```
71
+
72
+ ## Model Training
73
+
74
+ The model training process involves:
75
+
76
+ 1. **Data Loading**: Loading the DODa audio dataset from Hugging Face
77
+ 2. **Data Preprocessing**: Normalizing text and extracting speaker embeddings
78
+ 3. **Model Setup**: Configuring the SpeechT5 model for Darija TTS
79
+ 4. **Training**: Fine-tuning the model using the prepared dataset
80
+
81
+ To run the training:
82
+
83
+ ```bash
84
+ # Open the Jupyter notebook
85
+ jupyter notebook notebooks/train_darija_tts.ipynb
86
+ ```
87
+
88
+ Key training parameters:
89
+ - Learning rate: 1e-4
90
+ - Batch size: 4 (with gradient accumulation: 8)
91
+ - Training steps: 1000
92
+ - Evaluation frequency: Every 100 steps
93
+
94
+ ## Inference
95
+
96
+ To generate speech from text after training:
97
+
98
+ ```python
99
+ from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
100
+ import torch
101
+ import soundfile as sf
102
+
103
+ # Load models
104
+ model_path = "./models/speecht5_finetuned_Darija"
105
+ processor = SpeechT5Processor.from_pretrained(model_path)
106
+ model = SpeechT5ForTextToSpeech.from_pretrained(model_path)
107
+ vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
108
+
109
+ # Load speaker embedding
110
+ speaker_embedding = torch.load("./data/speaker_embeddings/female_embedding.pt")
111
+
112
+ # Normalize and process input text
113
+ text = "Salam, kifach nta lyoum?"
114
+ inputs = processor(text=text, return_tensors="pt")
115
+
116
+ # Generate speech
117
+ speech = model.generate_speech(inputs["input_ids"], speaker_embedding, vocoder=vocoder)
118
+
119
+ # Save audio file
120
+ sf.write("output.wav", speech.numpy(), 16000)
121
+ ```
122
+
123
+ ## Gradio Demo
124
+
125
+ The project includes a Gradio demo that provides a user-friendly interface for text-to-speech conversion:
126
+
127
+ ```bash
128
+ # Run the demo locally
129
+ cd demo
130
+ python app.py
131
+ ```
132
+
133
+ The demo features:
134
+ - Text input field for Darija text (Latin script)
135
+ - Voice selection (male/female)
136
+ - Speech speed adjustment
137
+ - Audio playback of generated speech
138
+
139
+ ### Deploying to Hugging Face Spaces
140
+
141
+ To deploy the demo to Hugging Face Spaces:
142
+
143
+ 1. Push your model to the Hugging Face Hub
144
+ 2. Create a new Space with the Gradio SDK
145
+ 3. Upload the demo files to the Space
146
+
147
+ See the notebook for detailed deployment instructions.
148
+
149
+ ## Project Features
150
+
151
+ - **Multi-Speaker TTS**: Generate speech in both male and female voices
152
+ - **Voice Cloning**: Utilizes speaker embeddings for voice characteristics preservation
153
+ - **Speed Control**: Adjust the speech rate as needed
154
+ - **Text Normalization**: Handles various text inputs through proper normalization
155
+
156
+ ## Potential Applications
157
+
158
+ - **Voice Assistants**: Build voice assistants that speak Moroccan Darija
159
+ - **Accessibility Tools**: Create tools for people with visual impairments
160
+ - **Language Learning**: Develop applications for learning Darija pronunciation
161
+ - **Content Creation**: Generate voiceovers for videos or audio content
162
+ - **Public Announcements**: Create automated announcement systems in Darija
163
+
164
+ ## Limitations and Future Work
165
+
166
+ Current limitations:
167
+ - The model may struggle with code-switching between Darija and other languages
168
+ - Pronunciation of certain loanwords might be inconsistent
169
+ - Limited emotional range in the generated speech
170
+
171
+ Future improvements:
172
+ - Fine-tune with more diverse speech data
173
+ - Implement emotion control for expressive speech
174
+ - Add support for Arabic script input
175
+ - Develop a multilingual version supporting Darija, Arabic, and French
176
+
177
+ ## License
178
+
179
+ This project is released under the MIT License. The DODa audio dataset is also available under the MIT License.
180
+
181
+ ## Acknowledgments
182
+
183
+ - The [DODa audio dataset](https://huggingface.co/datasets/atlasia/DODa-audio-dataset) creators
184
+ - Hugging Face for the Transformers library and model hosting
185
+ - Microsoft Research for the SpeechT5 model architecture
app.py ADDED
@@ -0,0 +1,255 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ import torch
3
+ import soundfile as sf
4
+ import os
5
+ import re
6
+ from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
7
+ from speechbrain.pretrained import EncoderClassifier
8
+
9
+ # Define paths and device
10
+ model_path = "HAMMALE/speecht5-darija" # Path to your model on HF Hub
11
+ device = "cuda" if torch.cuda.is_available() else "cpu"
12
+ print(f"Using device: {device}")
13
+
14
+ # Load models
15
+ processor = SpeechT5Processor.from_pretrained(model_path)
16
+ model = SpeechT5ForTextToSpeech.from_pretrained(model_path).to(device)
17
+ vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan").to(device)
18
+
19
+ # Load speaker embedding model
20
+ speaker_model = EncoderClassifier.from_hparams(
21
+ source="speechbrain/spkrec-xvect-voxceleb",
22
+ run_opts={"device": device},
23
+ savedir=os.path.join("/tmp", "spkrec-xvect-voxceleb"),
24
+ )
25
+
26
+ # Load pre-computed speaker embeddings
27
+ male_embedding = torch.load("male_embedding.pt") if os.path.exists("male_embedding.pt") else torch.randn(1, 512)
28
+ female_embedding = torch.load("female_embedding.pt") if os.path.exists("female_embedding.pt") else torch.randn(1, 512)
29
+
30
+ # Text normalization function
31
+ def normalize_text(text):
32
+ """Normalize text for TTS processing"""
33
+ text = text.lower()
34
+ # Keep letters, numbers, spaces and apostrophes - fixed regex
35
+ text = re.sub(r'[^\w\s\'\u0600-\u06FF]', '', text)
36
+ text = ' '.join(text.split())
37
+ return text
38
+
39
+ # Function to synthesize speech
40
+ def synthesize_speech(text, voice_type="male", speed=1.0):
41
+ """Generate speech from text using the specified voice type"""
42
+ try:
43
+ # Select speaker embedding based on voice type
44
+ if voice_type == "male":
45
+ speaker_embeddings = male_embedding.to(device)
46
+ else:
47
+ speaker_embeddings = female_embedding.to(device)
48
+
49
+ # Normalize and tokenize input text
50
+ normalized_text = normalize_text(text)
51
+ inputs = processor(text=normalized_text, return_tensors="pt").to(device)
52
+
53
+ # Generate speech
54
+ with torch.no_grad():
55
+ speech = model.generate_speech(
56
+ inputs["input_ids"],
57
+ speaker_embeddings,
58
+ vocoder=vocoder
59
+ )
60
+
61
+ # Convert to numpy array and adjust speed if needed
62
+ speech_np = speech.cpu().numpy()
63
+
64
+ # Apply speed adjustment (simple resampling)
65
+ if speed != 1.0:
66
+ # This is a simple approach - for production use a proper resampling library
67
+ import numpy as np
68
+ from scipy import signal
69
+ sample_rate = 16000
70
+ new_length = int(len(speech_np) / speed)
71
+ speech_np = signal.resample(speech_np, new_length)
72
+
73
+ # Save temporary audio file
74
+ output_file = "output_speech.wav"
75
+ sf.write(output_file, speech_np, 16000)
76
+
77
+ return output_file, None
78
+
79
+ except Exception as e:
80
+ return None, f"Error generating speech: {str(e)}"
81
+
82
+ # Gradio imports need to be added
83
+ import gradio as gr
84
+
85
+ # Custom CSS for better design
86
+ custom_css = """
87
+ .gradio-container {
88
+ font-family: 'Poppins', 'Arial', sans-serif;
89
+ max-width: 750px;
90
+ margin: auto;
91
+ }
92
+
93
+ .main-header {
94
+ background: linear-gradient(90deg, #c31432, #240b36);
95
+ color: white;
96
+ padding: 1.5em;
97
+ border-radius: 10px;
98
+ text-align: center;
99
+ margin-bottom: 1em;
100
+ box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1);
101
+ }
102
+
103
+ .main-header h1 {
104
+ font-size: 2.2em;
105
+ margin-bottom: 0.3em;
106
+ }
107
+
108
+ .main-header p {
109
+ font-size: 1.1em;
110
+ opacity: 0.9;
111
+ }
112
+
113
+ footer {
114
+ text-align: center;
115
+ margin-top: 2em;
116
+ color: #555;
117
+ font-size: 0.9em;
118
+ }
119
+
120
+ .flag-icon {
121
+ width: 24px;
122
+ height: 24px;
123
+ vertical-align: middle;
124
+ margin-right: 8px;
125
+ }
126
+
127
+ .example-header {
128
+ font-weight: bold;
129
+ color: #c31432;
130
+ margin-top: 1em;
131
+ }
132
+
133
+ .info-box {
134
+ background-color: #f9f9f9;
135
+ border-left: 4px solid #c31432;
136
+ padding: 1em;
137
+ margin: 1em 0;
138
+ border-radius: 5px;
139
+ }
140
+
141
+ .voice-selector {
142
+ display: flex;
143
+ justify-content: center;
144
+ gap: 20px;
145
+ margin: 10px 0;
146
+ }
147
+
148
+ .voice-option {
149
+ border: 2px solid #ddd;
150
+ border-radius: 10px;
151
+ padding: 10px 15px;
152
+ transition: all 0.3s ease;
153
+ cursor: pointer;
154
+ }
155
+
156
+ .voice-option.selected {
157
+ border-color: #c31432;
158
+ background-color: #fff5f5;
159
+ }
160
+
161
+ .slider-container {
162
+ margin: 20px 0;
163
+ }
164
+ """
165
+
166
+ # Create Gradio interface with improved design
167
+ with gr.Blocks(css=custom_css) as demo:
168
+ gr.HTML(
169
+ """
170
+ <div class="main-header">
171
+ <h1>🇲🇦 Moroccan Darija Text-to-Speech 🎧</h1>
172
+ <p>Convert Moroccan Arabic (Darija) text into natural-sounding speech</p>
173
+ </div>
174
+ """
175
+ )
176
+
177
+ with gr.Row():
178
+ with gr.Column():
179
+ gr.HTML(
180
+ """
181
+ <div class="info-box">
182
+ <p>This model was fine-tuned on the DODa audio dataset to produce high-quality
183
+ Darija speech from text input. You can adjust the voice and speed below.</p>
184
+ </div>
185
+ """
186
+ )
187
+
188
+ text_input = gr.Textbox(
189
+ label="Enter Darija Text",
190
+ placeholder="Kteb chi jomla b darija hna...",
191
+ lines=3
192
+ )
193
+
194
+ with gr.Row():
195
+ voice_type = gr.Radio(
196
+ ["male", "female"],
197
+ label="Voice Type",
198
+ value="male"
199
+ )
200
+
201
+ speed = gr.Slider(
202
+ minimum=0.5,
203
+ maximum=2.0,
204
+ value=1.0,
205
+ step=0.1,
206
+ label="Speech Speed"
207
+ )
208
+
209
+ generate_btn = gr.Button("Generate Speech", variant="primary")
210
+
211
+ gr.HTML(
212
+ """
213
+ <div class="example-header">Example phrases:</div>
214
+ <ul>
215
+ <li>"Ana Nadi Bezzaaf hhh"</li>
216
+ <li>"Lyoum ajwaa zwina bezzaf."</li>
217
+ <li>"lmaghrib ahssan blad fi l3alam "</li>
218
+ </ul>
219
+ """
220
+ )
221
+
222
+ with gr.Column():
223
+ audio_output = gr.Audio(label="Generated Speech")
224
+ error_output = gr.Textbox(label="Error (if any)", visible=False)
225
+
226
+ gr.Examples(
227
+ examples=[
228
+ ["Ana Nadi Bezzaaf hhh", "male", 1.0],
229
+ ["Lyoum ajwaa zwina bezzaf.", "female", 1.0],
230
+ ["lmaghrib ahssan blad fi l3alam", "male", 1.0],
231
+ ["Filistine hora mina lbar ila lbahr", "female", 0.8],
232
+ ],
233
+ inputs=[text_input, voice_type, speed],
234
+ outputs=[audio_output, error_output],
235
+ fn=synthesize_speech
236
+ )
237
+
238
+ gr.HTML(
239
+ """
240
+ <footer>
241
+ <p>Developed by HAMMALE | Powered by Microsoft SpeechT5 | Data: DODa</p>
242
+ </footer>
243
+ """
244
+ )
245
+
246
+ # Set button click action
247
+ generate_btn.click(
248
+ fn=synthesize_speech,
249
+ inputs=[text_input, voice_type, speed],
250
+ outputs=[audio_output, error_output]
251
+ )
252
+
253
+ # Launch the demo
254
+ if __name__ == "__main__":
255
+ demo.launch()
female_embedding.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9f87744b22b910652a6cc2645e99e9c2d6d63d4d8cb5e803dd0b8b520863ea0f
3
+ size 3273
male_embedding.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:268bdfbc37c7bab90de174bc30105c509d8149b1b8be2ddd888353ab924d6c7c
3
+ size 3263
requirements.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ gradio>=3.50.2
2
+ torch>=2.0.0
3
+ transformers>=4.32.0
4
+ speechbrain==0.5.16
5
+ soundfile>=0.12.1
6
+ scipy>=1.11.1