OmniVoice 🌍
OmniVoice is a massive multilingual zero-shot text-to-speech (TTS) model supporting over 600 languages. Built on a novel diffusion language model-style architecture, it delivers high-quality speech with superior inference speed, supporting voice cloning and voice design.
- Paper: OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models
- Repository: GitHub
- Demo: Hugging Face Space
Key Features
- 600+ Languages Supported: The broadest language coverage among zero-shot TTS models.
- Voice Cloning: State-of-the-art voice cloning quality from a short reference audio.
- Voice Design: Control voices via assigned speaker attributes (gender, age, pitch, dialect/accent, whisper, etc.).
- Fine-grained Control: Non-verbal symbols (e.g.,
[laughter]) and pronunciation correction via pinyin or phonemes. - Fast Inference: RTF as low as 0.025 (40x faster than real-time).
- Diffusion Language Model-style Architecture: A clean, streamlined, and scalable design that delivers both quality and speed.
Sample Usage
To get started, install the omnivoice library:
We recommend using a fresh virtual environment (e.g.,
conda,venv, etc.) to avoid conflicts.
Step 1: Install PyTorch
NVIDIA GPU
# Install pytorch with your CUDA version, e.g.
pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128
See PyTorch official site for other versions installation.
Apple Silicon
pip install torch==2.8.0 torchaudio==2.8.0
Step 2: Install OmniVoice
pip install omnivoice
Python API
You can use OmniVoice for zero-shot voice cloning as follows:
from omnivoice import OmniVoice
import torch
import torchaudio
# Load the model
model = OmniVoice.from_pretrained(
"k2-fsa/OmniVoice",
device_map="cuda:0",
dtype=torch.float16
)
# Generate audio
audio = model.generate(
text="Hello, this is a test of zero-shot voice cloning.",
ref_audio="ref.wav",
ref_text="Transcription of the reference audio.",
) # audio is a list of `torch.Tensor` with shape (1, T) at 24 kHz.
torchaudio.save("out.wav", audio[0], 24000)
For more generation modes (e.g., voice design), functions (e.g., non-verbal symbols, pronunciation correction) and comprehensive usage instructions, see our GitHub Repository.
Citation
@article{zhu2026omnivoice,
title={OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models},
author={Zhu, Han and Ye, Lingxuan and Kang, Wei and Yao, Zengwei and Guo, Liyong and Kuang, Fangjun and Han, Zhifeng and Zhuang, Weiji and Lin, Long and Povey, Daniel},
journal={arXiv preprint arXiv:2604.00688},
year={2026}
}
- Downloads last month
- 20,345