navaistt_v2_medium / README.md

Update README.md

1138c4a verified 3 months ago

3.95 kB

	---
	language:
	- uz
	license: apache-2.0
	tags:
	- whisper
	- automatic-speech-recognition
	- audio-transcription
	- uzbek
	- fine-tuned
	- speech-recognition
	---

	# NavaiSTT-2v Medium - Uzbek Speech-to-Text Model

	Classic Whisper medium model fine-tuned for Uzbek language. The dataset included of diverse audio: publicly available podcasts, Tashkent dialect podcasts, news, google fleurs, USC and Common Voice 17. Data quality was mixed with 50% human transcribed and 50% pseudo-transcribed using Gemini 2.5 Pro.

	Difference between v1 is that v2 is fully open-sourced. Due to some conflicts with data partners, v1 was removed, and the 500-hour dataset was excluded. Instead, new and different datasets were included—all of which will be open-sourced. Training scripts will also be open-sourced. The entire process will be fully repeatable.

	Special attention was given to Tashkent dialect audio materials, resulting in strong performance on this dialect. Future versions will include other regional dialects to improve overall coverage.

	# Whitepaper
	For more details on the methodology and research behind this model, visit: https://uz-speech.web.app/navaistt02m

	Training and filtering code: https://github.com/Islomov49/navaistt_v2-open-sourced

	Support my works and open-source movement: https://tirikchilik.uz/islomovs

	## Model Details

	- Base Model: Whisper Medium
	- Parameters: 769M
	- Performance:
	- WER: ~17%
	- CER: ~5.5%

	## Training Data

	This model was fine-tuned on approximately 475 hours of diverse Uzbek audio data including:
	- Common Voice 17 dataset (filtered)
	- USC (filtered)
	- Google fleurs (filtered)
	- Podcasts Tashkent Dialect Youtube Uzbek Speech Dataset: [Link HF](https://huggingface.co/datasets/islomov/podcasts_tashkent_dialect_youtube_uzbek_speech_dataset)
	- News Youtube Uzbek Speech Dataset: [Link HF](https://huggingface.co/datasets/islomov/news_youtube_uzbek_speech_dataset)
	- IT Youtube Uzbek Speech Dataset: [Link HF](https://huggingface.co/datasets/islomov/it_youtube_uzbek_speech_dataset)

	The dataset consisted of 50% human-transcribed and 50% pseudo-transcribed material (using Gemini 2.5 Pro). Special attention was given to Tashkent dialect audio materials to ensure strong performance on this dialect.

	A technique was used to filter out datasets based on Word Error Rate (WER) and similarity checks. The script for this process will also be open-sourced.

	## Usage Example

	```python
	import torch
	import torchaudio
	from transformers import WhisperProcessor, WhisperForConditionalGeneration

	# Load model and processor
	processor = WhisperProcessor.from_pretrained("islomov/navaistt_v2_medium")
	model = WhisperForConditionalGeneration.from_pretrained("islomov/navaistt_v2_medium")

	def transcribe_audio(audio_path):

	global model, processor

	# Move to GPU if available
	device = "cuda" if torch.cuda.is_available() else "cpu"
	model = model.to(device)

	# Load and preprocess audio
	waveform, sample_rate = torchaudio.load(audio_path)
	if sample_rate != 16000:
	waveform = torchaudio.functional.resample(waveform, sample_rate, 16000)

	# Convert to mono if needed
	if waveform.shape[0] > 1:
	waveform = waveform.mean(dim=0, keepdim=True)

	# Process audio
	input_features = processor(
	waveform.squeeze().numpy(),
	sampling_rate=16000,
	return_tensors="pt",
	language="uz"
	).input_features.to(device)

	# Generate transcription
	with torch.no_grad():
	predicted_ids = model.generate(input_features)

	# Decode
	transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
	return transcription

	# Example usage
	if __name__ == "__main__":
	audio_file = "some_audio_max_30_sec.wav"

	text = transcribe_audio(audio_file)
	print(f"Transcription: {text}")
	```

	# Future Improvements
	Future versions will include more regional Uzbek dialects to improve overall coverage.