Add pipeline_tag and paper link
Browse filesThis PR updates the model card by adding the `pipeline_tag: text-to-audio` to the metadata, which is crucial for categorizing the model correctly. It also includes the paper title and a direct link to the paper at the beginning of the content for better visibility.
README.md
CHANGED
|
@@ -3,19 +3,29 @@ license: mit
|
|
| 3 |
tags:
|
| 4 |
- text-to-audio
|
| 5 |
- controlnet
|
|
|
|
|
|
|
| 6 |
---
|
| 7 |
|
| 8 |
<img src="https://github.com/haidog-yaqub/EzAudio/blob/main/arts/ezaudio.png?raw=true">
|
| 9 |
|
| 10 |
# EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer
|
| 11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
๐ฃ EzAudio is a diffusion-based text-to-audio generation model. Designed for real-world audio applications, EzAudio brings together high-quality audio synthesis with lower computational demands.
|
| 13 |
|
| 14 |
-
๐ Play with EzAudio for text-to-audio generation, editing, and inpainting: [EzAudio](https://huggingface.co/spaces/OpenSound/EzAudio)
|
| 15 |
|
| 16 |
-
๐ฎ EzAudio-ControlNet is available: [EzAudio-ControlNet](https://huggingface.co/spaces/OpenSound/EzAudio-ControlNet)
|
| 17 |
|
| 18 |
-
We want to thank Hugging Face Space and Gradio for providing incredible demo platform.
|
| 19 |
|
| 20 |
## Installation
|
| 21 |
|
|
@@ -28,38 +38,56 @@ Install the dependencies:
|
|
| 28 |
cd EzAudio
|
| 29 |
pip install -r requirements.txt
|
| 30 |
```
|
| 31 |
-
|
|
|
|
|
|
|
| 32 |
|
| 33 |
## Usage
|
| 34 |
|
| 35 |
You can use the model with the following code:
|
| 36 |
|
| 37 |
```python
|
| 38 |
-
from api.ezaudio import
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
config_name = 'ckpts/ezaudio-xl.yml'
|
| 42 |
-
ckpt_path = 'ckpts/s3/ezaudio_s3_xl.pt'
|
| 43 |
-
vae_path = 'ckpts/vae/1m.pt'
|
| 44 |
-
# save_path = 'output/'
|
| 45 |
-
device = 'cuda' if torch.cuda.is_available() else 'cpu'
|
| 46 |
|
| 47 |
# load model
|
| 48 |
-
(
|
| 49 |
-
|
| 50 |
-
vae_path, device)
|
| 51 |
|
|
|
|
| 52 |
prompt = "a dog barking in the distance"
|
| 53 |
-
sr, audio = generate_audio(prompt
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 54 |
|
|
|
|
|
|
|
|
|
|
| 55 |
```
|
| 56 |
|
| 57 |
## Todo
|
| 58 |
- [x] Release Gradio Demo along with checkpoints [EzAudio Space](https://huggingface.co/spaces/OpenSound/EzAudio)
|
| 59 |
- [x] Release ControlNet Demo along with checkpoints [EzAudio ControlNet Space](https://huggingface.co/spaces/OpenSound/EzAudio-ControlNet)
|
| 60 |
-
- [x] Release inference code
|
| 61 |
-
- [
|
| 62 |
-
- [
|
|
|
|
| 63 |
|
| 64 |
## Reference
|
| 65 |
|
|
@@ -75,4 +103,4 @@ If you find the code useful for your research, please consider citing:
|
|
| 75 |
```
|
| 76 |
|
| 77 |
## Acknowledgement
|
| 78 |
-
Some
|
|
|
|
| 3 |
tags:
|
| 4 |
- text-to-audio
|
| 5 |
- controlnet
|
| 6 |
+
pipeline_tag: text-to-audio
|
| 7 |
+
library_name: diffusers
|
| 8 |
---
|
| 9 |
|
| 10 |
<img src="https://github.com/haidog-yaqub/EzAudio/blob/main/arts/ezaudio.png?raw=true">
|
| 11 |
|
| 12 |
# EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer
|
| 13 |
|
| 14 |
+
[EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer](https://huggingface.co/papers/2409.10819)
|
| 15 |
+
|
| 16 |
+
**Abstract:** We introduce EzAudio, a text-to-audio (T2A) generation framework designed to produce high-quality, natural-sounding sound effects. Core designs include: (1) We propose EzAudio-DiT, an optimized Diffusion Transformer (DiT) designed for audio latent representations, improving convergence speed, as well as parameter and memory efficiency. (2) We apply a classifier-free guidance (CFG) rescaling technique to mitigate fidelity loss at higher CFG scores and enhancing prompt adherence without compromising audio quality. (3) We propose a synthetic caption generation strategy leveraging recent advances in audio understanding and LLMs to enhance T2A pretraining. We show that EzAudio, with its computationally efficient architecture and fast convergence, is a competitive open-source model that excels in both objective and subjective evaluations by delivering highly realistic listening experiences. Code, data, and pre-trained models are released at: this https URL .
|
| 17 |
+
|
| 18 |
+
[](https://haidog-yaqub.github.io/EzAudio-Page/)
|
| 19 |
+
[](https://arxiv.org/abs/2409.10819)
|
| 20 |
+
[](https://huggingface.co/spaces/OpenSound/EzAudio)
|
| 21 |
+
|
| 22 |
๐ฃ EzAudio is a diffusion-based text-to-audio generation model. Designed for real-world audio applications, EzAudio brings together high-quality audio synthesis with lower computational demands.
|
| 23 |
|
| 24 |
+
๐ Play with EzAudio for text-to-audio generation, editing, and inpainting: [EzAudio Space](https://huggingface.co/spaces/OpenSound/EzAudio)
|
| 25 |
|
| 26 |
+
๐ฎ EzAudio-ControlNet is available: [EzAudio-ControlNet Space](https://huggingface.co/spaces/OpenSound/EzAudio-ControlNet)
|
| 27 |
|
| 28 |
+
<!-- We want to thank Hugging Face Space and Gradio for providing incredible demo platform. -->
|
| 29 |
|
| 30 |
## Installation
|
| 31 |
|
|
|
|
| 38 |
cd EzAudio
|
| 39 |
pip install -r requirements.txt
|
| 40 |
```
|
| 41 |
+
|
| 42 |
+
Download checkponts (Optional):
|
| 43 |
+
[https://huggingface.co/OpenSound/EzAudio](https://huggingface.co/OpenSound/EzAudio/tree/main)
|
| 44 |
|
| 45 |
## Usage
|
| 46 |
|
| 47 |
You can use the model with the following code:
|
| 48 |
|
| 49 |
```python
|
| 50 |
+
from api.ezaudio import EzAudio
|
| 51 |
+
import torch
|
| 52 |
+
import soundfile as sf
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
|
| 54 |
# load model
|
| 55 |
+
device = 'cuda' if torch.cuda.is_available() else 'cpu'
|
| 56 |
+
ezaudio = EzAudio(model_name='s3_xl', device=device)
|
|
|
|
| 57 |
|
| 58 |
+
# text to audio genertation
|
| 59 |
prompt = "a dog barking in the distance"
|
| 60 |
+
sr, audio = ezaudio.generate_audio(prompt)
|
| 61 |
+
sf.write(f'{prompt}.wav', audio, sr)
|
| 62 |
+
|
| 63 |
+
# audio inpainting
|
| 64 |
+
prompt = "A train passes by, blowing its horns"
|
| 65 |
+
original_audio = 'ref.wav'
|
| 66 |
+
sr, audio = ezaudio.editing_audio(prompt, boundary=2, gt_file=original_audio,
|
| 67 |
+
mask_start=1, mask_length=5)
|
| 68 |
+
sf.write(f'{prompt}_edit.wav', audio, sr)
|
| 69 |
+
```
|
| 70 |
+
|
| 71 |
+
## Training
|
| 72 |
+
|
| 73 |
+
#### Autoencoder
|
| 74 |
+
Refer to the VAE training section in our work [SoloAudio](https://github.com/WangHelin1997/SoloAudio)
|
| 75 |
+
|
| 76 |
+
#### T2A Diffusion Model
|
| 77 |
+
Prepare your data (see example in `src/dataset/meta_example.csv`), then run:
|
| 78 |
|
| 79 |
+
```bash
|
| 80 |
+
cd src
|
| 81 |
+
accelerate launch train.py
|
| 82 |
```
|
| 83 |
|
| 84 |
## Todo
|
| 85 |
- [x] Release Gradio Demo along with checkpoints [EzAudio Space](https://huggingface.co/spaces/OpenSound/EzAudio)
|
| 86 |
- [x] Release ControlNet Demo along with checkpoints [EzAudio ControlNet Space](https://huggingface.co/spaces/OpenSound/EzAudio-ControlNet)
|
| 87 |
+
- [x] Release inference code
|
| 88 |
+
- [x] Release training pipeline and dataset
|
| 89 |
+
- [x] Improve API and support automatic ckpts downloading
|
| 90 |
+
- [ ] Release checkpoints for stage1 and stage2 [WIP]
|
| 91 |
|
| 92 |
## Reference
|
| 93 |
|
|
|
|
| 103 |
```
|
| 104 |
|
| 105 |
## Acknowledgement
|
| 106 |
+
Some codes are borrowed from or inspired by: [U-Vit](https://github.com/baofff/U-ViT), [Pixel-Art](https://github.com/PixArt-alpha/PixArt-alpha), [Huyuan-DiT](https://github.com/Tencent/HunyuanDiT), and [Stable Audio](https://github.com/Stability-AI/stable-audio-tools).
|