Tiny Audio

Efficient Speech Recognition with Frozen Pretrained Models

Tiny Audio is a lightweight automatic speech recognition (ASR) model that combines a frozen HuBERT encoder with a SmolLM3 language model decoder, connected via a trainable audio projector. This architecture enables efficient training by only fine-tuning a small projection layer (~7M parameters) while leveraging the power of large pretrained models.

Model Description

Developed by: Alex Kroman
Model type: Automatic Speech Recognition (Speech-to-Text)
Language(s): English
License: MIT
Architecture: Encoder-Projector-Decoder
- Audio Encoder: HuBERT-large (317M params, frozen)
- Audio Projector: 2-layer MLP (~7M params, trainable)
- Text Decoder: SmolLM3-3B (3B params, frozen)

Key Features

✅ Parameter Efficient: Only ~7M trainable parameters ✅ Fast Training: Frozen encoder/decoder enable rapid fine-tuning ✅ Modular Design: Easy to swap different encoder or decoder models ✅ Production Ready: Includes evaluation tools and remote training scripts ✅ HuggingFace Native: Full integration with transformers library

Quick Start

from transformers import pipeline

# Load ASR pipeline
pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)

# Transcribe audio file
result = pipe("path/to/audio.wav")
print(result["text"])

# With custom generation parameters
result = pipe(
    "path/to/audio.wav",
    max_new_tokens=200,
    num_beams=4,
    length_penalty=1.0,
)
print(result["text"])

The model automatically handles:

Audio resampling to 16kHz
Various audio formats (WAV, MP3, FLAC, etc.)
Batch processing for multiple files

Architecture Details

Model Components

Audio Encoder (Frozen)
- Base Model: facebook/hubert-large-ls960-ft
- Parameters: 317M (frozen)
- Extracts acoustic features from raw audio waveforms
- Output: Audio embeddings at ~50Hz frame rate
Audio Projector (Trainable)
- Architecture: Linear(encoder_dim × 5, 2048) → ReLU → Linear(2048, llm_dim)
- Parameters: ~7M (trainable)
- Downsamples audio features by 5x (from ~50Hz to ~10Hz)
- Maps audio embeddings to language model embedding space
Language Model Decoder (Frozen)
- Base Model: HuggingFaceTB/SmolLM3-3B-Base
- Parameters: 3B (frozen)
- Generates text transcriptions autoregressively
- Uses beam search (beam_size=4) for decoding

Data Flow

Raw Audio (16kHz)
    ↓
HuBERT Encoder (frozen)
    ↓
Audio Features [batch, ~1500, 1024]
    ↓
Audio Projector (trainable, 5x downsample)
    ↓
Language Embeddings [batch, ~300, 2048]
    ↓
SmolLM3 Decoder (frozen)
    ↓
Text Transcription

Training Details

Training Data

The model is trained on a diverse mix of English speech datasets:

Dataset	Hours	Domain	Description
LibriSpeech	960h	Audiobooks	Clean read speech
GigaSpeech	10,000h	Podcasts, YouTube	Diverse acoustic conditions
Common Voice 17.0	~2,500h	Crowdsourced	Multiple accents and speakers
LoquaciousSet	Variable	European Parliament	Formal speech

Total Training Data: ~13,000+ hours of English speech

Training Configuration

Optimizer: AdamW
Learning Rate: Cosine schedule with warmup
Precision: BF16 mixed precision
Batch Size: Dynamic based on audio length
Gradient Checkpointing: Enabled
Training Steps: ~50,000 steps
Hardware: Single NVIDIA A100 (40GB)
Training Time: ~24 hours

Training Strategy

Only the audio projector weights are trained from scratch. The HuBERT encoder and SmolLM3 decoder remain frozen throughout training, which:

Reduces memory requirements significantly
Enables faster training convergence
Preserves pretrained knowledge
Prevents catastrophic forgetting

Evaluation

The model is evaluated on the LoquaciousSet benchmark dataset using Word Error Rate (WER) as the primary metric.

Benchmark Results

Dataset	Split	Samples	WER
LoquaciousSet (large)	test	~2,000	TBD
LoquaciousSet (clean)	test	~500	TBD

Evaluation Script

# Install tiny-audio
pip install tiny-audio

# Run evaluation
uv run scripts/eval.py --max-samples 100

# Compare with baselines
uv run scripts/eval.py --provider assemblyai --api-key YOUR_API_KEY

Limitations and Bias

Limitations

English Only: Currently trained only on English speech data
Formal Speech: May perform better on clear, formal speech than casual conversation
Background Noise: Performance may degrade in noisy environments
Accents: May have varying performance across different English accents
Domain Shift: Best performance on domains similar to training data

Potential Biases

Dataset Bias: Training data may not equally represent all demographics
Accent Bias: May perform differently across accents (American, British, Indian, etc.)
Gender Bias: Performance may vary by speaker gender
Age Bias: Primarily trained on adult speech

Users should evaluate the model on their specific use case and demographics before production deployment.

Intended Use

Primary Use Cases

✅ Transcription Services: Converting speech to text for podcasts, videos, interviews ✅ Accessibility Tools: Generating captions and subtitles ✅ Voice Assistants: Speech-to-text component in voice interfaces ✅ Research: ASR research and experimentation ✅ Education: Learning about multimodal models and parameter-efficient training

Out-of-Scope Use

❌ Real-time critical systems (medical, legal) without thorough validation ❌ Surveillance or privacy-invasive applications ❌ Non-English languages (not trained for this) ❌ Child safety applications without age-appropriate testing

Environmental Impact

Hardware: 1× NVIDIA A100 (40GB)
Training Time: ~24 hours
Power Consumption: ~300W × 24h = 7.2 kWh
Estimated CO₂ Emissions: ~3.6 kg CO₂e (assuming 0.5 kg CO₂/kWh)

This is significantly lower than training full ASR models from scratch thanks to frozen pretrained components.

Citation

If you use Tiny Audio in your research, please cite:

@software{kroman2024tinyaudio,
  author = {Kroman, Alex},
  title = {Tiny Audio: Efficient Speech Recognition with Frozen Pretrained Models},
  year = {2024},
  url = {https://github.com/alexkroman/tiny-audio},
  note = {HuggingFace Model: https://huggingface.co/mazesmazes/tiny-audio}
}

Acknowledgments

This project builds upon excellent prior work:

HuBERT (Hsu et al., 2021): Self-supervised speech representation learning
SmolLM3 (HuggingFace Team): Efficient language model

Additional Resources

License

This model is released under the MIT License. See the LICENSE file for details.

Downloads last month: 1,653

Safetensors

Model size

26.2M params

Tensor type

BF16

Model tree for mazesmazes/tiny-audio

Base model

HuggingFaceTB/SmolLM3-3B-Base

Finetuned

(51)

this model

Datasets used to train mazesmazes/tiny-audio

Space using mazesmazes/tiny-audio 1

Evaluation results

Word Error Rate on LoquaciousSet
test set self-reported

TBD

View on Papers With Code