File size: 6,219 Bytes
bc4f968 ccd183c bc4f968 f9a27e9 bc4f968 e9922f9 bc4f968 cad0c48 bc4f968 f68e339 9f8a329 bc4f968 37997a2 bc4f968 0197076 bc4f968 ccd183c bc4f968 37997a2 bc4f968 37997a2 bc4f968 dd0c666 8cfcd8e dd0c666 8cfcd8e dd0c666 8cfcd8e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 |
---
license: cc-by-nc-4.0
tags:
- audio
- codec
- speech
- xcodec2
- text-to-speech
- multilingual
language:
- en
- ja
- zh
- bn
- fr
- de
- ko
---
# π£οΈ XCodec2 Trained on 100K Hours of Multilingual Data
This is a retrained version of the XCodec2 neural audio codec by HKUSTAudio, using 100,000 hours of multilingual speech across seven languages. The model enables efficient speech compression and reconstruction for low-bandwidth, high-quality audio applications. Its discrete token outputs are well-suited for LLM-based TTS, AudioLM, multimodal models, and speech-to-speech systems, making it a versatile solution for multilingual and real-world speech processing tasks.
---
## π Overview
- **Model Architecture:** [Xcodec2](https://huggingface.co/HKUSTAudio/xcodec2)
- **Sampling Rate:** 16 kHz
- **Tokens:** 50 tokens/second
- **Developed By:** [Verbex.ai (Hishab Technologies Ltd.)](https://verbex.ai)
- **Primary Use Case:** High-quality speech reconstruction and intermediate TTS representations
- **Training Time:** 11 Days(8xH100 80GB)
- **Epoch:** 1
---
## π§ͺ Installation & Usage
This model requires `xcodec2`. We recommend using a minimal setup:
```bash
# Create environment
conda create -n xcodec2 python=3.9
conda activate xcodec2
# Install dependencies
pip install xcodec2==0.1.5
pip install numpy==1.26.4
```
### Example Usage
```python
import torch
import soundfile as sf
from xcodec2.modeling_xcodec2 import XCodec2Model
model_path = "hishab/titu-xcodec2" # Replace with actual Hugging Face path
model = XCodec2Model.from_pretrained(model_path)
model.eval().cuda()
# Load and preprocess waveform
wav, sr = sf.read("test_bn.wav")
if sr != 16000:
import librosa
wav = librosa.resample(wav, orig_sr=sr, target_sr=16000)
sr = 16000
if len(wav.shape) > 1:
wav = wav.mean(axis=1)
wav_tensor = torch.from_numpy(wav).float().unsqueeze(0)
# Encode and decode
with torch.no_grad():
vq_code = model.encode_code(input_waveform=wav_tensor)
print("Code:", vq_code)
recon_wav = model.decode_code(vq_code).cpu()
# Save output
sf.write("reconstructed_bn.wav", recon_wav[0, 0].numpy(), sr)
print("Done! Check reconstructed_bn.wav")
```
---
## π Multilingual Training Dataset
| Language | Dataset(s) | Hours (K) |
|-----------|----------------------------------------|-----------|
| Japanese | EmiliaYODAS + Verbex JA TTS Dataset | 31.41 |
| English | EmiliaYODAS | 25.69 |
| Chinese | EmiliaYODAS | 12.50 |
| Bangla | Verbex Bengali TTS Dataset | 11.58 |
| French | EmiliaYODAS + MLangLibrispeech | 8.40 |
| German | EmiliaYODAS + MLangLibrispeech | 5.42 |
| Korean | EmiliaYODAS | 5.00 |
| **Total** | β | **100** |
---
## π Reconstruction Evaluation
Reconstruction metrics are computed over 100 samples for English, Japanese, and Bangla using this retrained model (`XCODEC2 Ours`) alongside baselines (XCODEC, SNAC, NEMO).
**Evaluation Test Sets:**
- English: 100 Examples (Emilia Dataset)
- Japanese: 100 Examples (Emilia Dataset)
- Bangla: 100 Examples (Verbex's Inhouse TTS Dataset)
| Model | Lang | MCD β | MSE β | SpeechBERTScore β | SpeechBLEU β | SpeechTokenDist β |
|-------------------|------|--------|--------|-------------|--------|-------------|
| **XCODEC** | BN | 2.823 | 0.003 | 0.939 | 0.500 | 0.816 |
| | EN | 3.166 | 0.012 | 0.962 | 0.660 | 0.856 |
| | JA | 3.021 | 0.010 | 0.948 | 0.582 | 0.838 |
| **Overall** | | 3.003 | 0.008 | 0.949 | 0.581 | 0.837 |
| **XCODEC2 (Ours)** | BN | 2.712 | 0.003 | 0.940 | 0.508 | 0.817 |
| | EN | 3.206 | 0.014 | 0.957 | 0.644 | 0.851 |
| | JA | 3.022 | 0.012 | 0.946 | 0.573 | 0.838 |
| **Overall** | | 2.980 | 0.010 | 0.948 | 0.575 | 0.835 |
| **hubertsiuzdak/snac_24khz** | BN | 3.104 | 0.002 | 0.911 | 0.442 | 0.785 |
| | EN | 3.983 | 0.014 | 0.912 | 0.541 | 0.797 |
| | JA | 3.512 | 0.009 | 0.903 | 0.472 | 0.761 |
| **Overall** | | 3.533 | 0.008 | 0.909 | 0.485 | 0.781 |
| **nvidia/low-frame-rate-speech-codec-22khz** | BN | 2.247 | 0.000 | 0.957 | 0.589 | 0.863 |
| | EN | 2.867 | 0.007 | 0.969 | 0.707 | 0.872 |
| | JA | 2.677 | 0.003 | 0.955 | 0.614 | 0.853 |
| **Overall** | | 2.597 | 0.003 | 0.960 | 0.636 | 0.863 |
#### SpeechBERTScore, SpeechBLEU and SpeechTokenDistance are calculated using https://github.com/Takaaki-Saeki/DiscreteSpeechMetrics
---
## β
Intended Use
This model is suitable for:
- Speech tokenization in TTS pipelines
- Low-bitrate speech compression
- Code-based speech synthesis or generation tasks
- Multimodal LLM, Audio LM, Speech-to-Speech and etc. modeling
---
## π« Limitations
- Licensed for **non-commercial use only**
---
## π License
This model is licensed under **Creative Commons Attribution-NonCommercial 4.0 (CC BY-NC 4.0)**.
Commercial usage is **not allowed**.
- SPDX Identifier: `CC-BY-NC-4.0`
- License Details: [https://creativecommons.org/licenses/by-nc/4.0](https://creativecommons.org/licenses/by-nc/4.0)
---
## π¬ Contact
For research collaborations, feedback, or commercial licensing inquiries, please reach out to:
**π Website:** [https://verbex.ai](https://verbex.ai)
---
<!-- ## π Citation
```latex
@misc{verbex2025xcodec2,
title = {{Titu-XCodec2}: A Multilingual Neural Audio Codec by Verbex.ai},
author = {Mohammad Jahid Ibna Basher* and Saiful Islam* and Mehedi Hasan Menon and Tareq-Al-Muntasir},
year = {2025},
howpublished = {\url{https://huggingface.co/hishab/titu-xcodec2}},
}
```
--> |