whisper-speech-text / README.md
varshamishra's picture
Create README.md
3eb3e49 verified

OpenAI Whisper-Base Fine-Tuned Model for Speech-to-Text

This repository hosts a fine-tuned version of the OpenAI Whisper-Base model optimized for speech-to-text tasks using the Mozilla Common Voice 13.0 dataset. The model is designed to efficiently transcribe speech into text while maintaining high accuracy.

Model Details

  • Model Architecture: OpenAI Whisper-Base
  • Task: Speech-to-Text
  • Dataset: Mozilla Common Voice 13.0
  • Quantization: FP16
  • Fine-tuning Framework: Hugging Face Transformers

πŸš€ Usage

Installation

pip install transformers torch

Loading the Model

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

model_name = "AventIQ-AI/whisper-speech-text"
model = WhisperForConditionalGeneration.from_pretrained(model_name).to(device)
processor = WhisperProcessor.from_pretrained(model_name)

Speech-to-Text Inference

import torchaudio

# Load and process audio file
def transcribe(audio_path):
    waveform, sample_rate = torchaudio.load(audio_path)
    inputs = processor(waveform, sampling_rate=sample_rate, return_tensors="pt").input_features.to(device)
    
    # Generate transcription
    with torch.no_grad():
        predicted_ids = model.generate(inputs)
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
    return transcription

# Example usage
audio_file = "sample_audio.wav"
print(transcribe(audio_file))

πŸ“Š Evaluation Results

After fine-tuning the Whisper-Base model for speech-to-text, we evaluated the model's performance on the validation set from the Common Voice 13.0 dataset. The following results were obtained:

Metric Score Meaning
WER 8.2% Word Error Rate: Measures transcription accuracy
CER 4.5% Character Error Rate: Measures character-level accuracy

Fine-Tuning Details

Dataset

The Mozilla Common Voice 13.0 dataset, containing diverse multilingual speech samples, was used for fine-tuning the model.

Training

  • Number of epochs: 3
  • Batch size: 8
  • Evaluation strategy: epochs

Quantization

Post-training quantization was applied using PyTorch's built-in quantization framework to reduce the model size and improve inference efficiency.

πŸ“‚ Repository Structure

.
β”œβ”€β”€ model/               # Contains the quantized model files
β”œβ”€β”€ tokenizer_config/    # Tokenizer configuration and vocabulary files
β”œβ”€β”€ model.safetensors/   # Quantized Model
β”œβ”€β”€ README.md            # Model documentation

⚠️ Limitations

  • The model may struggle with highly noisy or overlapping speech.
  • Quantization may lead to slight degradation in accuracy compared to full-precision models.
  • Performance may vary across different accents and dialects.

🀝 Contributing

Contributions are welcome! Feel free to open an issue or submit a pull request if you have suggestions or improvements.