Spaces:

soft-boy
/

sdxs

Runtime error

File size: 5,619 Bytes

b0a43a6
affe6d7
 
b0a43a6
affe6d7
b0a43a6
affe6d7
b0a43a6
affe6d7

---
title: sdxs
app_file: demo_sketch.py
sdk: gradio
sdk_version: 3.43.1
---
<div align="center">

## SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions

[![Project](https://img.shields.io/badge/Home-Project-green?logo=Houzz&logoColor=white)](https://idkiro.github.io/sdxs)
[![Paper](https://img.shields.io/badge/arxiv-Paper-blue?logo=arxiv)](https://arxiv.org/abs/2403.16627) 
[![SDXS-512-0.9](https://img.shields.io/badge/🤗Model-512--0.9-gold)](https://huggingface.co/IDKiro/sdxs-512-0.9)
[![SDXS-512-DreamShaper](https://img.shields.io/badge/🤗Model-512--DreamShaper-gold)](https://huggingface.co/IDKiro/sdxs-512-dreamshaper)
[![SDXS-512-DreamShaper-Anime](https://img.shields.io/badge/🤗Model-512--DreamShaper--Anime-gold)](https://huggingface.co/IDKiro/sdxs-512-dreamshaper-anime)
[![SDXS-512-DreamShaper-Sketch](https://img.shields.io/badge/🤗Model-512--DreamShaper--Sketch-gold)](https://huggingface.co/IDKiro/sdxs-512-dreamshaper-sketch)
[![SDXS-512-DreamShaper-Demo](https://img.shields.io/badge/🤗Demo-Text2Image-pink)](https://huggingface.co/spaces/IDKiro/SDXS-512-DreamShaper)
[![SDXS-512-DreamShaper-Anime-Demo](https://img.shields.io/badge/🤗Demo-Text2Image--Anime-pink)](https://huggingface.co/spaces/IDKiro/SDXS-512-DreamShaper-Anime)
[![SDXS-512-DreamShaper-Sketch-Demo](https://img.shields.io/badge/🤗Demo-Sketch2Image-pink)](https://huggingface.co/spaces/IDKiro/SDXS-512-DreamShaper-Sketch)


*Yuda Song, Zehao Sun, Xuanwu Yin*

</div>

We present two models, SDXS-512 and SDXS-1024, achieving inference speeds of approximately <b>100 FPS</b> (30x faster than SD v1.5) and <b>30 FPS</b> (60x faster than SDXL) on a single GPU. Assuming the image generation time is limited to <b>1 second</b>, then SDXL can only use 16 NFEs to produce a slightly blurry image, while SDXS-1024 can generate 30 clear images. 

![](images/intro.png)

Moreover, our proposed method can also train ControlNet, offering promising applications in image-conditioned control and facilitating efficient image-to-image translation.

<p align="left" >
<img src="images\sketch.gif" width="800" />
</p>

## 🔥News

- **April 11, 2024:** [SDXS-512-DreamShaper-Anime](https://huggingface.co/IDKiro/sdxs-512-dreamshaper-anime) is released. We also create some Gradio demo on Hugging Face.
- **April 10, 2024:** [SDXS-512-DreamShaper](https://huggingface.co/IDKiro/sdxs-512-dreamshaper) and [SDXS-512-DreamShaper-Sketch](https://huggingface.co/IDKiro/sdxs-512-dreamshaper-sketch) are released. We also upload our demo code.
- **March 25, 2024:** [SDXS-512-0.9](https://huggingface.co/IDKiro/sdxs-512-0.9) is released, it is an old version of SDXS-512.

## ⚡️Demo

Create a new environment:

```sh
conda create -n sdxs
```

Activate the new environment:

```sh
conda activate sdxs
```

Install requirements:

```sh
conda install python=3.10 pytorch=2.2.1 torchvision torchaudio pytorch-cuda=11.8 xformers=0.0.25 -c pytorch -c nvidia -c xformers
pip install -r requirements.txt
```

Run text-to-image demo:

```sh
python demo.py
```

Run anime-style text-to-image (LoRA) demo:

```sh
python demo_anime.py
```

Run sketch-to-image (ControlNet) demo:

```sh
python demo_sketch.py
```

## 💡Train

I found that [DMD2](https://github.com/tianweiy/DMD2) release the training code, and its training scheme is identical to the new version of SDXS, so you can refer to it. 
Unfortunately, the SDXS training code is not allowed to be open-sourced and will most likely not be updated again.

## ✒️Method

### Model Acceleration

We train an extremely light-weight image decoder to mimic the original VAE decoder’s output through a combination of output distillation loss and GAN loss. We also leverage the block removal distillation strategy to efficiently transfer the knowledge from the original U-Net to a more compact version.

![](images/method1.png)

SDXS demonstrates efficiency far surpassing that of the base models, even achieving image generation at 100 FPS for 512x512 images and 30 FPS for 1024x1024 images on the GPU.

![](images/speed.png)

### Text-to-Image

To reduce the NFEs, we suggest straightening the sampling trajectory and quickly finetuning the multi-step model into a one-step model by replacing the distillation loss function with the proposed feature matching loss. Then, we extend the Diff-Instruct training strategy, using the gradient of the proposed feature matching loss to replace the gradient provided by score distillation in the latter half of the timestep.

![](images/method2.png)

Despite a noticeable downsizing in both the sizes of the models and the number of sampling steps required, the prompt-following capability of SDXS-512 remains superior to that of SD v1.5. This observation is consistently validated in the performance of SDXS-1024 as well.  

![](images/imgs.png)

### Image-to-Image

We extend our proposed training strategy to the training of ControlNet, relying on adding the pretrained ControlNet to the score function. 

![](images/method3.png)

We demonstrate its efficacy in facilitating image-to-image conversions utilizing ControlNet, specifically for transformations involving canny edges and depth maps.

![](images/control_imgs.png)


## Citation

If you find this work useful for your research, please cite our paper:

```bibtex
@article{song2024sdxs,
  author    = {Yuda Song, Zehao Sun, Xuanwu Yin},
  title     = {SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions},
  journal   = {arxiv},
  year      = {2024},
}
```

**Acknowledgment**: the demo code is based on https://github.com/GaParmar/img2img-turbo.