Add pipeline_tag and paper link

This PR updates the model card by adding the `pipeline_tag: text-to-audio` to the metadata, which is crucial for categorizing the model correctly. It also includes the paper title and a direct link to the paper at the beginning of the content for better visibility.

Files changed (1) hide show

README.md +48 -20

README.md CHANGED Viewed

@@ -3,19 +3,29 @@ license: mit
 tags:
 - text-to-audio
 - controlnet
 ---
 <img src="https://github.com/haidog-yaqub/EzAudio/blob/main/arts/ezaudio.png?raw=true">
 # EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer
 🟣 EzAudio is a diffusion-based text-to-audio generation model. Designed for real-world audio applications, EzAudio brings together high-quality audio synthesis with lower computational demands.
-🎛 Play with EzAudio for text-to-audio generation, editing, and inpainting: [EzAudio](https://huggingface.co/spaces/OpenSound/EzAudio)
-🎮 EzAudio-ControlNet is available: [EzAudio-ControlNet](https://huggingface.co/spaces/OpenSound/EzAudio-ControlNet)
-We want to thank Hugging Face Space and Gradio for providing incredible demo platform.
 ## Installation
@@ -28,38 +38,56 @@ Install the dependencies:
 cd EzAudio
 pip install -r requirements.txt
 ```
-Download checkponts from: [https://huggingface.co/OpenSound/EzAudio](https://huggingface.co/OpenSound/EzAudio/tree/main)
 ## Usage
 You can use the model with the following code:
 ```python
-from api.ezaudio import load_models, generate_audio
-# model and config paths
-config_name = 'ckpts/ezaudio-xl.yml'
-ckpt_path = 'ckpts/s3/ezaudio_s3_xl.pt'
-vae_path = 'ckpts/vae/1m.pt'
-# save_path = 'output/'
-device = 'cuda' if torch.cuda.is_available() else 'cpu'
 # load model
-(autoencoder, unet, tokenizer,
- text_encoder, noise_scheduler, params) = load_models(config_name, ckpt_path,
-                                                      vae_path, device)
 prompt = "a dog barking in the distance"
-sr, audio = generate_audio(prompt, autoencoder, unet, tokenizer, text_encoder, noise_scheduler, params, device)
 ```
 ## Todo
 - [x] Release Gradio Demo along with checkpoints [EzAudio Space](https://huggingface.co/spaces/OpenSound/EzAudio)
 - [x] Release ControlNet Demo along with checkpoints [EzAudio ControlNet Space](https://huggingface.co/spaces/OpenSound/EzAudio-ControlNet)
-- [x] Release inference code
-- [ ] Release checkpoints for stage1 and stage2
-- [ ] Release training pipeline and dataset
 ## Reference
@@ -75,4 +103,4 @@ If you find the code useful for your research, please consider citing:
 ```
 ## Acknowledgement
-Some code are borrowed from or inspired by: [U-Vit](https://github.com/baofff/U-ViT), [Pixel-Art](https://github.com/PixArt-alpha/PixArt-alpha), [Huyuan-DiT](https://github.com/Tencent/HunyuanDiT), and [Stable Audio](https://github.com/Stability-AI/stable-audio-tools).

 tags:
 - text-to-audio
 - controlnet
+pipeline_tag: text-to-audio
+library_name: diffusers
 ---
 <img src="https://github.com/haidog-yaqub/EzAudio/blob/main/arts/ezaudio.png?raw=true">
 # EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer
+[EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer](https://huggingface.co/papers/2409.10819)
+**Abstract:** We introduce EzAudio, a text-to-audio (T2A) generation framework designed to produce high-quality, natural-sounding sound effects. Core designs include: (1) We propose EzAudio-DiT, an optimized Diffusion Transformer (DiT) designed for audio latent representations, improving convergence speed, as well as parameter and memory efficiency. (2) We apply a classifier-free guidance (CFG) rescaling technique to mitigate fidelity loss at higher CFG scores and enhancing prompt adherence without compromising audio quality. (3) We propose a synthetic caption generation strategy leveraging recent advances in audio understanding and LLMs to enhance T2A pretraining. We show that EzAudio, with its computationally efficient architecture and fast convergence, is a competitive open-source model that excels in both objective and subjective evaluations by delivering highly realistic listening experiences. Code, data, and pre-trained models are released at: this https URL .
+[![Official Page](https://img.shields.io/badge/Official%20Page-EzAudio-blue?logo=Github&style=flat-square)](https://haidog-yaqub.github.io/EzAudio-Page/)
+[![arXiv](https://img.shields.io/badge/arXiv-2409.10819-brightgreen.svg?style=flat-square)](https://arxiv.org/abs/2409.10819)
+[![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Models-blue)](https://huggingface.co/spaces/OpenSound/EzAudio)
 🟣 EzAudio is a diffusion-based text-to-audio generation model. Designed for real-world audio applications, EzAudio brings together high-quality audio synthesis with lower computational demands.
+🎛 Play with EzAudio for text-to-audio generation, editing, and inpainting: [EzAudio Space](https://huggingface.co/spaces/OpenSound/EzAudio)
+🎮 EzAudio-ControlNet is available: [EzAudio-ControlNet Space](https://huggingface.co/spaces/OpenSound/EzAudio-ControlNet)
+<!-- We want to thank Hugging Face Space and Gradio for providing incredible demo platform. -->
 ## Installation
 cd EzAudio
 pip install -r requirements.txt
 ```
+Download checkponts (Optional):
+[https://huggingface.co/OpenSound/EzAudio](https://huggingface.co/OpenSound/EzAudio/tree/main)
 ## Usage
 You can use the model with the following code:
 ```python
+from api.ezaudio import EzAudio
+import torch
+import soundfile as sf
 # load model
+device = 'cuda' if torch.cuda.is_available() else 'cpu'
+ezaudio = EzAudio(model_name='s3_xl', device=device)
+# text to audio genertation
 prompt = "a dog barking in the distance"
+sr, audio = ezaudio.generate_audio(prompt)
+sf.write(f'{prompt}.wav', audio, sr)
+# audio inpainting
+prompt = "A train passes by, blowing its horns"
+original_audio = 'ref.wav'
+sr, audio = ezaudio.editing_audio(prompt, boundary=2, gt_file=original_audio,
+                                  mask_start=1, mask_length=5)
+sf.write(f'{prompt}_edit.wav', audio, sr)
+```
+## Training
+#### Autoencoder
+Refer to the VAE training section in our work [SoloAudio](https://github.com/WangHelin1997/SoloAudio)
+#### T2A Diffusion Model
+Prepare your data (see example in `src/dataset/meta_example.csv`), then run:
+```bash
+cd src
+accelerate launch train.py
 ```
 ## Todo
 - [x] Release Gradio Demo along with checkpoints [EzAudio Space](https://huggingface.co/spaces/OpenSound/EzAudio)
 - [x] Release ControlNet Demo along with checkpoints [EzAudio ControlNet Space](https://huggingface.co/spaces/OpenSound/EzAudio-ControlNet)
+- [x] Release inference code
+- [x] Release training pipeline and dataset
+- [x] Improve API and support automatic ckpts downloading
+- [ ] Release checkpoints for stage1 and stage2 [WIP]
 ## Reference
 ```
 ## Acknowledgement
+Some codes are borrowed from or inspired by: [U-Vit](https://github.com/baofff/U-ViT), [Pixel-Art](https://github.com/PixArt-alpha/PixArt-alpha), [Huyuan-DiT](https://github.com/Tencent/HunyuanDiT), and [Stable Audio](https://github.com/Stability-AI/stable-audio-tools).