DAC.speech.v1.0 / README.md

add bibtex

1ea7f64 12 months ago

4.07 kB

	---
	license: cdla-permissive-2.0
	datasets:
	- mythicinfinity/libritts_r
	- mythicinfinity/libritts
	- keithito/lj_speech
	- ginger-turmeric/LibriLight
	- corvj/daps
	language:
	- en
	base_model:
	- descript/dac_24khz
	tags:
	- speech
	- autoencoder
	- tokenizer
	- speech coding
	- vocoder
	---

	## Model Summary
	[DAC auto-encoder models](https://github.com/descriptinc/descript-audio-codec) provide compact discrete tokenization of speech and audio signals that facilitate signal generation by cascaded generative AI models (e.g. multi-modal generative AI models) and high-quality reconstruction of the original signals. [The current finetuned models](https://www.isca-archive.org/interspeech_2024/shechtman24_interspeech.pdf) improve upon the [original DAC models](https://github.com/descriptinc/descript-audio-codec) by allowing a more compact representation for wide-band speech signals with high-quality signal reconstruction. The models achieve speech reconstruction, which is [nearly indistinguishable from PCM](https://ibm.biz/IS24SpeechRVQ) with a rate of 150-300 tokens per second
	(1500-3000 bps). [The evaluation](https://www.isca-archive.org/interspeech_2024/shechtman24_interspeech.pdf) used comprehensive English speech data encompassing different recording conditions, including studio settings.

	\| Model \| Speech Sample Rate \| codebooks \| Bit Rate \| Token Rate\| version\|
	\| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \|
	\| weights_24khz_3.0kbps_v1.0.pth \| 24kHz \| 4 \| 3kHz \| 300Hz \| 1.0 \|
	\| weights_24khz_1.5kbps_v1.0.pth \| 24kHz \| 2 \| 1.5kHz \| 150Hz \| 1.0 \|

	## Usage
	* follow [DAC](https://github.com/descriptinc/descript-audio-codec) installation instructions

	* clone the current repo
	```
	git clone https://huggingface.co/ibm/DAC.speech.v1.0
	cd DAC.speech.v1.0
	```

	### Compress audio
	```
	python3 -m dac encode /path/to/input --output /path/to/output/codes --weights_path weights_24khz_3.0kbps_v1.0.pth
	```

	This command will create `.dac` files with the same name as the input files. It will also preserve the directory structure relative to input root and re-create it in the output directory. Please use `python -m dac encode --help` for more options.

	### Reconstruct audio from compressed codes
	```
	python3 -m dac decode /path/to/output/codes --output /path/to/reconstructed_input --weights_path weights_24khz_3.0kbps_v1.0.pth
	```

	This command will create `.wav` files with the same name as the input files. It will also preserve the directory structure relative to input root and re-create it in the output directory. Please use `python -m dac decode --help` for more options.

	### Programmatic Usage
	```py
	import dac
	from audiotools import AudioSignal

	# Download a model
	model_path = 'weights_24khz_3.0kbps_v1.0.pth'
	model = dac.DAC.load(model_path)

	model.to('cuda')

	# Load audio signal file
	signal = AudioSignal('input.wav')

	# Encode audio signal as one long file
	# (may run out of GPU memory on long files)
	signal.to(model.device)

	x = model.preprocess(signal.audio_data, signal.sample_rate)
	z, codes, latents, _, _ = model.encode(x)

	# Decode audio signal
	y = model.decode(z)

	# Alternatively, use the `compress` and `decompress` functions
	# to compress long files.

	signal = signal.cpu()
	x = model.compress(signal)

	# Save and load to and from disk
	x.save("compressed.dac")
	x = dac.DACFile.load("compressed.dac")

	# Decompress it back to an AudioSignal
	y = model.decompress(x)

	# Write to file
	y.write('output.wav')
	```

	## Citing & Authors

	If you find this model helpful, feel free to cite our publication [Low Bitrate High-Quality RVQGAN-based Discrete Speech Tokenizer](https://www.isca-archive.org/interspeech_2024/shechtman24_interspeech.pdf):
	```bibtex
	@inproceedings{shechtman24_interspeech,
	title = {Low Bitrate High-Quality RVQGAN-based Discrete Speech Tokenizer},
	author = {Slava Shechtman and Avihu Dekel},
	year = {2024},
	booktitle = {Interspeech 2024},
	pages = {4174--4178},
	doi = {10.21437/Interspeech.2024-2366},
	issn = {2958-1796},
	}
	```