Tiny Audio

Efficient Speech Recognition with Frozen Pretrained Models

Tiny Audio is a lightweight automatic speech recognition (ASR) model that combines a frozen HuBERT encoder with a SmolLM3 language model decoder, connected via a trainable audio projector. This architecture enables efficient training by only fine-tuning a small projection layer (~7M parameters) while leveraging the power of large pretrained models.

Model Description

  • Developed by: Alex Kroman
  • Model type: Automatic Speech Recognition (Speech-to-Text)
  • Language(s): English
  • License: MIT
  • Architecture: Encoder-Projector-Decoder
    • Audio Encoder: HuBERT-large (317M params, frozen)
    • Audio Projector: 2-layer MLP (~7M params, trainable)
    • Text Decoder: SmolLM3-3B (3B params, frozen)

Key Features

Parameter Efficient: Only ~7M trainable parameters ✅ Fast Training: Frozen encoder/decoder enable rapid fine-tuning ✅ Modular Design: Easy to swap different encoder or decoder models ✅ Production Ready: Includes evaluation tools and remote training scripts ✅ HuggingFace Native: Full integration with transformers library

Quick Start

from transformers import pipeline

# Load ASR pipeline
pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)

# Transcribe audio file
result = pipe("path/to/audio.wav")
print(result["text"])

# With custom generation parameters
result = pipe(
    "path/to/audio.wav",
    max_new_tokens=200,
    num_beams=4,
    length_penalty=1.0,
)
print(result["text"])

The model automatically handles:

  • Audio resampling to 16kHz
  • Various audio formats (WAV, MP3, FLAC, etc.)
  • Batch processing for multiple files

Architecture Details

Model Components

  1. Audio Encoder (Frozen)

    • Base Model: facebook/hubert-large-ls960-ft
    • Parameters: 317M (frozen)
    • Extracts acoustic features from raw audio waveforms
    • Output: Audio embeddings at ~50Hz frame rate
  2. Audio Projector (Trainable)

    • Architecture: Linear(encoder_dim × 5, 2048) → ReLU → Linear(2048, llm_dim)
    • Parameters: ~7M (trainable)
    • Downsamples audio features by 5x (from ~50Hz to ~10Hz)
    • Maps audio embeddings to language model embedding space
  3. Language Model Decoder (Frozen)

    • Base Model: HuggingFaceTB/SmolLM3-3B-Base
    • Parameters: 3B (frozen)
    • Generates text transcriptions autoregressively
    • Uses beam search (beam_size=4) for decoding

Data Flow

Raw Audio (16kHz)
    ↓
HuBERT Encoder (frozen)
    ↓
Audio Features [batch, ~1500, 1024]
    ↓
Audio Projector (trainable, 5x downsample)
    ↓
Language Embeddings [batch, ~300, 2048]
    ↓
SmolLM3 Decoder (frozen)
    ↓
Text Transcription

Training Details

Training Data

The model is trained on a diverse mix of English speech datasets:

Dataset Hours Domain Description
LibriSpeech 960h Audiobooks Clean read speech
GigaSpeech 10,000h Podcasts, YouTube Diverse acoustic conditions
Common Voice 17.0 ~2,500h Crowdsourced Multiple accents and speakers
LoquaciousSet Variable European Parliament Formal speech

Total Training Data: ~13,000+ hours of English speech

Training Configuration

  • Optimizer: AdamW
  • Learning Rate: Cosine schedule with warmup
  • Precision: BF16 mixed precision
  • Batch Size: Dynamic based on audio length
  • Gradient Checkpointing: Enabled
  • Training Steps: ~50,000 steps
  • Hardware: Single NVIDIA A100 (40GB)
  • Training Time: ~24 hours

Training Strategy

Only the audio projector weights are trained from scratch. The HuBERT encoder and SmolLM3 decoder remain frozen throughout training, which:

  • Reduces memory requirements significantly
  • Enables faster training convergence
  • Preserves pretrained knowledge
  • Prevents catastrophic forgetting

Evaluation

The model is evaluated on the LoquaciousSet benchmark dataset using Word Error Rate (WER) as the primary metric.

Benchmark Results

Dataset Split Samples WER
LoquaciousSet (large) test ~2,000 TBD
LoquaciousSet (clean) test ~500 TBD

Evaluation Script

# Install tiny-audio
pip install tiny-audio

# Run evaluation
uv run scripts/eval.py --max-samples 100

# Compare with baselines
uv run scripts/eval.py --provider assemblyai --api-key YOUR_API_KEY

Limitations and Bias

Limitations

  • English Only: Currently trained only on English speech data
  • Formal Speech: May perform better on clear, formal speech than casual conversation
  • Background Noise: Performance may degrade in noisy environments
  • Accents: May have varying performance across different English accents
  • Domain Shift: Best performance on domains similar to training data

Potential Biases

  • Dataset Bias: Training data may not equally represent all demographics
  • Accent Bias: May perform differently across accents (American, British, Indian, etc.)
  • Gender Bias: Performance may vary by speaker gender
  • Age Bias: Primarily trained on adult speech

Users should evaluate the model on their specific use case and demographics before production deployment.

Intended Use

Primary Use Cases

Transcription Services: Converting speech to text for podcasts, videos, interviews ✅ Accessibility Tools: Generating captions and subtitles ✅ Voice Assistants: Speech-to-text component in voice interfaces ✅ Research: ASR research and experimentation ✅ Education: Learning about multimodal models and parameter-efficient training

Out-of-Scope Use

❌ Real-time critical systems (medical, legal) without thorough validation ❌ Surveillance or privacy-invasive applications ❌ Non-English languages (not trained for this) ❌ Child safety applications without age-appropriate testing

Environmental Impact

  • Hardware: 1× NVIDIA A100 (40GB)
  • Training Time: ~24 hours
  • Power Consumption: ~300W × 24h = 7.2 kWh
  • Estimated CO₂ Emissions: ~3.6 kg CO₂e (assuming 0.5 kg CO₂/kWh)

This is significantly lower than training full ASR models from scratch thanks to frozen pretrained components.

Citation

If you use Tiny Audio in your research, please cite:

@software{kroman2024tinyaudio,
  author = {Kroman, Alex},
  title = {Tiny Audio: Efficient Speech Recognition with Frozen Pretrained Models},
  year = {2024},
  url = {https://github.com/alexkroman/tiny-audio},
  note = {HuggingFace Model: https://huggingface.co/mazesmazes/tiny-audio}
}

Acknowledgments

This project builds upon excellent prior work:

Additional Resources

License

This model is released under the MIT License. See the LICENSE file for details.

Downloads last month
1,653
Safetensors
Model size
26.2M params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mazesmazes/tiny-audio

Finetuned
(51)
this model

Datasets used to train mazesmazes/tiny-audio

Space using mazesmazes/tiny-audio 1

Evaluation results