Tiny Audio
A speech recognition model trained in 24 hours on a single GPU for ~$12. Built with the Tiny Audio codebase—a minimal, hackable framework for training ASR models.
Architecture
Audio (16kHz) → Whisper Encoder (frozen) → MLP Projector (trained) → SmolLM3-3B (frozen) → Text
MLP Projector:
- Convolutional downsampling: 4x sequence compression via two stride-2 conv layers
- Linear (1280 → 2048) → GELU → Linear (2048 → 2048)
- Output normalization: RMSNorm
Training Details
| Dataset | LoquaciousSet (25,000 hours) |
| Hardware | Single NVIDIA A40 40GB |
| Training Time | ~24 hours |
| Cost | ~$12 |
| Trainable Parameters | ~12M (projector only) |
Performance
Word Error Rate (WER): 12.14% on LoquaciousSet test set.
See the community leaderboard for comparisons.
Usage
from transformers import pipeline
pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)
result = pipe("path/to/audio.wav")
print(result["text"])
Limitations
- English only
- Optimized for 16kHz audio; other sample rates are resampled automatically
- Performance may degrade on heavily accented speech, noisy environments, or domain-specific jargon
- Maximum audio length limited by context window
Learn More
- Train your own model — The full codebase with training scripts
- Free 3-hour course — Build your own ASR system from scratch
- Submit to leaderboard — Share your trained model
- Downloads last month
- -
Model tree for mazesmazes/tiny-audio-glm
Base model
HuggingFaceTB/SmolLM3-3B-Base
Finetuned
HuggingFaceTB/SmolLM3-3B