File size: 5,741 Bytes

---
base_model: unsloth/whisper-large-v3-turbo
tags:
- text-generation-inference
- transformers
- unsloth
- whisper
- creole
- haiti
license: apache-2.0
language:
- ht
datasets:
- jsbeaudry/cmu_haitian_creole_speech
- jsbeaudry/creole-text-voice
pipeline_tag: automatic-speech-recognition
---




# oswald-large-v3-turbo-m1

This model is a fine-tuned version of [openai/unsloth/whisper-large-v3-turbo](https://huggingface.co/unsloth/whisper-large-v3-turbo) on the **creole-text-voice** dataset.  
The main objective is to create a **99% accurate Haitian Creole Speech-to-Text model**, capable of transcribing diverse Haitian voices across accents, regions, and speaking styles.

---

## 🧠 Model description

**oswald-large-v3-turbo-m1** is optimized for Haitian Creole automatic speech recognition (ASR). It builds upon the Whisper architecture by OpenAI and adapts it to Haitian Creole through transfer learning and fine-tuning on a high-quality curated dataset containing hours of Haitian Creole audio-text pairs.

- **Architecture**: Whisper Large
- **Fine-tuned for**: Haitian Creole (Kreyòl Ayisyen)
- **Vocabulary**: Based on Latin script (Creole orthography), preserving diacritics and linguistic nuances.
- **Voice types**: Made with female and male synthetics and naturals voices.
- **Sampling rate**: 16kHz
- **Training objective**: Maximize transcription accuracy for everyday Creole speech

---


### ✅ Intended uses
- Transcribe Haitian Creole speech from:
  - Voice notes 
  - Radio shows
  - Interviews
  - Public speeches
  - Educational content
  - Synthetic voices 

- Enable Creole voice interfaces in:
  - Voice assistants
  - Transcription services
  - Language-learning tools
  - Chatbots and accessibility platforms

### ⚠️ Limitations
- May struggle with:
  - Extremely poor audio quality (e.g., heavy background noise)
  - Very fast or mumbled speech in some dialects
  - Long duration audio file
- Not optimized for **real-time transcription** on low-resource devices
- Fine-tuned on a specific dataset – might generalize less to completely unseen voice types or rare accents

---

## 📊 Training and evaluation data

The model was trained on the **creole-text-voice** dataset, which includes:

- **7 hours** of Haitian Creole Synthetic speech
- **8 hours** of Haitian Creole Human speech 
- Annotated, time-aligned text transcripts following standard Creole orthography

### Sources for next steps:
- Public domain radio and podcast archives
- Open-access interviews and spoken-word audio
- Community-submitted voice samples

### Preprocessing steps:
- Voice Activity Detection (VAD)
- Noise filtering and audio normalization
- Manual transcript review and correction


## Model usage script

```python
# Load model directly
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import librosa
import numpy as np
import torch

processor = AutoProcessor.from_pretrained("jsbeaudry/oswald-large-v3-turbo-m1")
model = AutoModelForSpeechSeq2Seq.from_pretrained("jsbeaudry/oswald-large-v3-turbo-m1")

def transcript (audio_file_path):
   
    # Load audio
    speech_array, sampling_rate = librosa.load(audio_file_path, sr=16000)

    # Convert the NumPy array to a PyTorch tensor
    speech_array_pt = torch.from_numpy(speech_array).unsqueeze(0)

    input_features = processor(speech_array, sampling_rate=sampling_rate, return_tensors="pt").input_features 

    # 2. Generate predictions
    predicted_ids = model.generate(input_features)

    # 3. Decode the predictions
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)

    # print(transcription)
    return transcription

text = transcript("/path_audio")

print(text)
```


## Model usage with gradio (UI)

```python

from transformers import pipeline
import gradio as gr

# Load Whisper model
print("Loading model...")
pipe = pipeline(model="jsbeaudry/oswald-large-v3-turbo-m1")
print("Model loaded successfully.")

# Transcription function
def transcribe(audio_path):
    if audio_path is None:
        return "Please upload or record an audio file first."
    result = pipe(audio_path)
    return result["text"]

# Build Gradio interface
def create_interface():
    with gr.Blocks(title="Whisper Medium - Haitian Creole") as demo:
        gr.Markdown("# 🎙️ Whisper Medium Creole ASR")
        gr.Markdown(
            "Upload an audio file or record your voice in Haitian Creole. "
            "Then click **Transcribe** to see the result."
        )

        with gr.Row():
            with gr.Column():
                audio_input = gr.Audio(source="upload", type="filepath", label="🎧 Upload Audio")
            with gr.Column():
                transcribe_button = gr.Button("🔍 Transcribe")
                output_text = gr.Textbox(label="📝 Transcribed Text", lines=4)
                
    
        transcribe_button.click(fn=transcribe, inputs=audio_input, outputs=output_text)

    return demo

if __name__ == "__main__":
    interface = create_interface()
    interface.launch()
```

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 1e-4
- num_epochs: 6.65
- hours: 2:52


Step	Training Loss	Validation Loss
100	0.565400	0.656878
200	0.481000	0.528320
300	0.457000	0.460658
400	0.822300	0.419748
500	0.298300	0.397042
.....
8300	0.049500	0.215643
8400	0.024700	0.210167




### Framework versions

- Transformers 4.46.1
- Pytorch 2.6.0+cu124
- Datasets 3.5.0
- Tokenizers 0.20.3



## 📌 Citation

If you use this model, please cite:

```bibtex
@misc{whispermediumcreoleoswald2025,
  title={oswald large  turbo M1},
  author={Jean sauvenel beaudry},
  year={2025},
  howpublished={\url{https://huggingface.co/jsbeaudry}}
}