|
--- |
|
language: |
|
- uz |
|
license: apache-2.0 |
|
tags: |
|
- whisper |
|
- automatic-speech-recognition |
|
- audio-transcription |
|
- uzbek |
|
- fine-tuned |
|
- speech-recognition |
|
--- |
|
|
|
# NavaiSTT-2v Medium - Uzbek Speech-to-Text Model |
|
|
|
Classic Whisper medium model fine-tuned for Uzbek language. The dataset included of diverse audio: publicly available podcasts, Tashkent dialect podcasts, news, google fleurs, USC and Common Voice 17. Data quality was mixed with 50% human transcribed and 50% pseudo-transcribed using Gemini 2.5 Pro. |
|
|
|
Difference between v1 is that v2 is fully open-sourced. Due to some conflicts with data partners, v1 was removed, and the 500-hour dataset was excluded. Instead, new and different datasets were included—all of which will be open-sourced. Training scripts will also be open-sourced. The entire process will be fully repeatable. |
|
|
|
Special attention was given to Tashkent dialect audio materials, resulting in strong performance on this dialect. Future versions will include other regional dialects to improve overall coverage. |
|
|
|
# Whitepaper |
|
For more details on the methodology and research behind this model, visit: https://uz-speech.web.app/navaistt02m |
|
|
|
Training and filtering code: https://github.com/Islomov49/navaistt_v2-open-sourced |
|
|
|
Support my works and open-source movement: https://tirikchilik.uz/islomovs |
|
|
|
## Model Details |
|
|
|
- **Base Model:** Whisper Medium |
|
- **Parameters:** 769M |
|
- **Performance:** |
|
- WER: ~17% |
|
- CER: ~5.5% |
|
|
|
## Training Data |
|
|
|
This model was fine-tuned on approximately 475 hours of diverse Uzbek audio data including: |
|
- Common Voice 17 dataset (filtered) |
|
- USC (filtered) |
|
- Google fleurs (filtered) |
|
- Podcasts Tashkent Dialect Youtube Uzbek Speech Dataset: [Link HF](https://huggingface.co/datasets/islomov/podcasts_tashkent_dialect_youtube_uzbek_speech_dataset) |
|
- News Youtube Uzbek Speech Dataset: [Link HF](https://huggingface.co/datasets/islomov/news_youtube_uzbek_speech_dataset) |
|
- IT Youtube Uzbek Speech Dataset: [Link HF](https://huggingface.co/datasets/islomov/it_youtube_uzbek_speech_dataset) |
|
|
|
The dataset consisted of 50% human-transcribed and 50% pseudo-transcribed material (using Gemini 2.5 Pro). Special attention was given to Tashkent dialect audio materials to ensure strong performance on this dialect. |
|
|
|
A technique was used to filter out datasets based on Word Error Rate (WER) and similarity checks. The script for this process will also be open-sourced. |
|
|
|
## Usage Example |
|
|
|
```python |
|
import torch |
|
import torchaudio |
|
from transformers import WhisperProcessor, WhisperForConditionalGeneration |
|
|
|
# Load model and processor |
|
processor = WhisperProcessor.from_pretrained("islomov/navaistt_v2_medium") |
|
model = WhisperForConditionalGeneration.from_pretrained("islomov/navaistt_v2_medium") |
|
|
|
def transcribe_audio(audio_path): |
|
|
|
global model, processor |
|
|
|
# Move to GPU if available |
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
model = model.to(device) |
|
|
|
# Load and preprocess audio |
|
waveform, sample_rate = torchaudio.load(audio_path) |
|
if sample_rate != 16000: |
|
waveform = torchaudio.functional.resample(waveform, sample_rate, 16000) |
|
|
|
# Convert to mono if needed |
|
if waveform.shape[0] > 1: |
|
waveform = waveform.mean(dim=0, keepdim=True) |
|
|
|
# Process audio |
|
input_features = processor( |
|
waveform.squeeze().numpy(), |
|
sampling_rate=16000, |
|
return_tensors="pt", |
|
language="uz" |
|
).input_features.to(device) |
|
|
|
# Generate transcription |
|
with torch.no_grad(): |
|
predicted_ids = model.generate(input_features) |
|
|
|
# Decode |
|
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] |
|
return transcription |
|
|
|
# Example usage |
|
if __name__ == "__main__": |
|
audio_file = "some_audio_max_30_sec.wav" |
|
|
|
text = transcribe_audio(audio_file) |
|
print(f"Transcription: {text}") |
|
``` |
|
|
|
# Future Improvements |
|
Future versions will include more regional Uzbek dialects to improve overall coverage. |
|
|
|
|