|
--- |
|
license: cdla-permissive-2.0 |
|
datasets: |
|
- mythicinfinity/libritts_r |
|
- mythicinfinity/libritts |
|
- keithito/lj_speech |
|
- ginger-turmeric/LibriLight |
|
- corvj/daps |
|
language: |
|
- en |
|
base_model: |
|
- descript/dac_24khz |
|
tags: |
|
- speech |
|
- autoencoder |
|
- tokenizer |
|
- speech coding |
|
- vocoder |
|
--- |
|
|
|
## Model Summary |
|
[DAC auto-encoder models](https://github.com/descriptinc/descript-audio-codec) provide compact discrete tokenization of speech and audio signals that facilitate signal generation by cascaded generative AI models (e.g. multi-modal generative AI models) and high-quality reconstruction of the original signals. [The current finetuned models](https://www.isca-archive.org/interspeech_2024/shechtman24_interspeech.pdf) improve upon the [original DAC models](https://github.com/descriptinc/descript-audio-codec) by allowing a more compact representation for wide-band speech signals with high-quality signal reconstruction. The models achieve speech reconstruction, which is [nearly indistinguishable from PCM](https://ibm.biz/IS24SpeechRVQ) with a rate of 150-300 tokens per second |
|
(1500-3000 bps). [The evaluation](https://www.isca-archive.org/interspeech_2024/shechtman24_interspeech.pdf) used comprehensive English speech data encompassing different recording conditions, including studio settings. |
|
|
|
| Model | Speech Sample Rate | codebooks | Bit Rate | Token Rate| version| |
|
| :---: | :---: | :---: | :---: | :---: | :---: | |
|
| weights_24khz_3.0kbps_v1.0.pth | 24kHz | 4 | 3kHz | 300Hz | 1.0 | |
|
| weights_24khz_1.5kbps_v1.0.pth | 24kHz | 2 | 1.5kHz | 150Hz | 1.0 | |
|
|
|
## Usage |
|
* follow [DAC](https://github.com/descriptinc/descript-audio-codec) installation instructions |
|
|
|
* clone the current repo |
|
``` |
|
git clone https://huggingface.co/ibm/DAC.speech.v1.0 |
|
cd DAC.speech.v1.0 |
|
``` |
|
|
|
### Compress audio |
|
``` |
|
python3 -m dac encode /path/to/input --output /path/to/output/codes --weights_path weights_24khz_3.0kbps_v1.0.pth |
|
``` |
|
|
|
This command will create `.dac` files with the same name as the input files. It will also preserve the directory structure relative to input root and re-create it in the output directory. Please use `python -m dac encode --help` for more options. |
|
|
|
### Reconstruct audio from compressed codes |
|
``` |
|
python3 -m dac decode /path/to/output/codes --output /path/to/reconstructed_input --weights_path weights_24khz_3.0kbps_v1.0.pth |
|
``` |
|
|
|
This command will create `.wav` files with the same name as the input files. It will also preserve the directory structure relative to input root and re-create it in the output directory. Please use `python -m dac decode --help` for more options. |
|
|
|
### Programmatic Usage |
|
```py |
|
import dac |
|
from audiotools import AudioSignal |
|
|
|
# Download a model |
|
model_path = 'weights_24khz_3.0kbps_v1.0.pth' |
|
model = dac.DAC.load(model_path) |
|
|
|
model.to('cuda') |
|
|
|
# Load audio signal file |
|
signal = AudioSignal('input.wav') |
|
|
|
# Encode audio signal as one long file |
|
# (may run out of GPU memory on long files) |
|
signal.to(model.device) |
|
|
|
x = model.preprocess(signal.audio_data, signal.sample_rate) |
|
z, codes, latents, _, _ = model.encode(x) |
|
|
|
# Decode audio signal |
|
y = model.decode(z) |
|
|
|
# Alternatively, use the `compress` and `decompress` functions |
|
# to compress long files. |
|
|
|
signal = signal.cpu() |
|
x = model.compress(signal) |
|
|
|
# Save and load to and from disk |
|
x.save("compressed.dac") |
|
x = dac.DACFile.load("compressed.dac") |
|
|
|
# Decompress it back to an AudioSignal |
|
y = model.decompress(x) |
|
|
|
# Write to file |
|
y.write('output.wav') |
|
``` |
|
|
|
## Citing & Authors |
|
|
|
If you find this model helpful, feel free to cite our publication [Low Bitrate High-Quality RVQGAN-based Discrete Speech Tokenizer](https://www.isca-archive.org/interspeech_2024/shechtman24_interspeech.pdf): |
|
```bibtex |
|
@inproceedings{shechtman24_interspeech, |
|
title = {Low Bitrate High-Quality RVQGAN-based Discrete Speech Tokenizer}, |
|
author = {Slava Shechtman and Avihu Dekel}, |
|
year = {2024}, |
|
booktitle = {Interspeech 2024}, |
|
pages = {4174--4178}, |
|
doi = {10.21437/Interspeech.2024-2366}, |
|
issn = {2958-1796}, |
|
} |
|
``` |