Spaces:
Runtime error
Runtime error
train latent dm with pre-trained vae from hf hub
Browse files- README.md +10 -1
- scripts/train_unconditional.py +14 -3
README.md
CHANGED
|
@@ -119,11 +119,13 @@ accelerate launch --config_file config/accelerate_sagemaker.yaml \
|
|
| 119 |
--lr_warmup_steps 500 \
|
| 120 |
--mixed_precision no
|
| 121 |
```
|
|
|
|
| 122 |
## DDIM ([De-noising Diffusion Implicit Models](https://arxiv.org/pdf/2010.02502.pdf))
|
| 123 |
#### A DDIM can be trained by adding the parameter
|
| 124 |
```bash
|
| 125 |
--scheduler ddim
|
| 126 |
```
|
|
|
|
| 127 |
Inference can the be run with far fewer steps than the number used for training (e.g., ~50), allowing for much faster generation. Without retraining, the parameter `eta` can be used to replicate a DDPM if it is set to 1 or a DDIM if it is set to 0, with all values in between being valid. When `eta` is 0 (the default value), the de-noising procedure is deterministic, which means that it can be run in reverse as a kind of encoder that recovers the original noise used in generation. A function `encode` has been added to `AudioDiffusionPipeline` for this purpose. It is then possible to interpolate between audios in the latent "noise" space using the function `slerp` (Spherical Linear intERPolation).
|
| 128 |
|
| 129 |
## Latent Audio Diffusion
|
|
@@ -131,7 +133,14 @@ Rather than de-noising images directly, it is interesting to work in the "latent
|
|
| 131 |
|
| 132 |
At the time of writing, the Hugging Face `diffusers` library is geared towards inference and lacking in training functionality (rather like its cousin `transformers` in the early days of development). In order to train a VAE (Variational AutoEncoder), I use the [stable-diffusion](https://github.com/CompVis/stable-diffusion) repo from CompVis and convert the checkpoints to `diffusers` format. Note that it uses a perceptual loss function for images; it would be nice to try a perceptual *audio* loss function.
|
| 133 |
|
| 134 |
-
####
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 135 |
```
|
| 136 |
pip install omegaconf pytorch_lightning
|
| 137 |
pip install -e git+https://github.com/CompVis/stable-diffusion.git@main#egg=latent-diffusion
|
|
|
|
| 119 |
--lr_warmup_steps 500 \
|
| 120 |
--mixed_precision no
|
| 121 |
```
|
| 122 |
+
|
| 123 |
## DDIM ([De-noising Diffusion Implicit Models](https://arxiv.org/pdf/2010.02502.pdf))
|
| 124 |
#### A DDIM can be trained by adding the parameter
|
| 125 |
```bash
|
| 126 |
--scheduler ddim
|
| 127 |
```
|
| 128 |
+
|
| 129 |
Inference can the be run with far fewer steps than the number used for training (e.g., ~50), allowing for much faster generation. Without retraining, the parameter `eta` can be used to replicate a DDPM if it is set to 1 or a DDIM if it is set to 0, with all values in between being valid. When `eta` is 0 (the default value), the de-noising procedure is deterministic, which means that it can be run in reverse as a kind of encoder that recovers the original noise used in generation. A function `encode` has been added to `AudioDiffusionPipeline` for this purpose. It is then possible to interpolate between audios in the latent "noise" space using the function `slerp` (Spherical Linear intERPolation).
|
| 130 |
|
| 131 |
## Latent Audio Diffusion
|
|
|
|
| 133 |
|
| 134 |
At the time of writing, the Hugging Face `diffusers` library is geared towards inference and lacking in training functionality (rather like its cousin `transformers` in the early days of development). In order to train a VAE (Variational AutoEncoder), I use the [stable-diffusion](https://github.com/CompVis/stable-diffusion) repo from CompVis and convert the checkpoints to `diffusers` format. Note that it uses a perceptual loss function for images; it would be nice to try a perceptual *audio* loss function.
|
| 135 |
|
| 136 |
+
#### Train latent diffusion model using pre-trained VAE.
|
| 137 |
+
```bash
|
| 138 |
+
accelerate launch ...
|
| 139 |
+
...
|
| 140 |
+
--vae teticio/latent-audio-diffusion-256
|
| 141 |
+
```
|
| 142 |
+
|
| 143 |
+
#### Install dependencies to train with Stable Diffusion.
|
| 144 |
```
|
| 145 |
pip install omegaconf pytorch_lightning
|
| 146 |
pip install -e git+https://github.com/CompVis/stable-diffusion.git@main#egg=latent-diffusion
|
scripts/train_unconditional.py
CHANGED
|
@@ -11,6 +11,7 @@ from accelerate.logging import get_logger
|
|
| 11 |
from datasets import load_from_disk, load_dataset
|
| 12 |
from diffusers import (DiffusionPipeline, DDPMScheduler, UNet2DModel,
|
| 13 |
DDIMScheduler, AutoencoderKL)
|
|
|
|
| 14 |
from diffusers.hub_utils import init_git_repo, push_to_hub
|
| 15 |
from diffusers.optimization import get_scheduler
|
| 16 |
from diffusers.training_utils import EMAModel
|
|
@@ -85,7 +86,11 @@ def main(args):
|
|
| 85 |
|
| 86 |
vqvae = None
|
| 87 |
if args.vae is not None:
|
| 88 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 89 |
# Determine latent resolution
|
| 90 |
with torch.no_grad():
|
| 91 |
latent_resolution = vqvae.encode(
|
|
@@ -93,10 +98,16 @@ def main(args):
|
|
| 93 |
resolution)).latent_dist.sample().shape[2:]
|
| 94 |
|
| 95 |
if args.from_pretrained is not None:
|
| 96 |
-
pipeline =
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 97 |
model = pipeline.unet
|
| 98 |
if hasattr(pipeline, 'vqvae'):
|
| 99 |
-
vqvae =
|
| 100 |
else:
|
| 101 |
model = UNet2DModel(
|
| 102 |
sample_size=resolution if vqvae is None else latent_resolution,
|
|
|
|
| 11 |
from datasets import load_from_disk, load_dataset
|
| 12 |
from diffusers import (DiffusionPipeline, DDPMScheduler, UNet2DModel,
|
| 13 |
DDIMScheduler, AutoencoderKL)
|
| 14 |
+
from diffusers.modeling_utils import EntryNotFoundError
|
| 15 |
from diffusers.hub_utils import init_git_repo, push_to_hub
|
| 16 |
from diffusers.optimization import get_scheduler
|
| 17 |
from diffusers.training_utils import EMAModel
|
|
|
|
| 86 |
|
| 87 |
vqvae = None
|
| 88 |
if args.vae is not None:
|
| 89 |
+
try:
|
| 90 |
+
vqvae = AutoencoderKL.from_pretrained(args.vae)
|
| 91 |
+
except EnvironmentError:
|
| 92 |
+
vqvae = LatentAudioDiffusionPipeline.from_pretrained(
|
| 93 |
+
args.vae).vqvae
|
| 94 |
# Determine latent resolution
|
| 95 |
with torch.no_grad():
|
| 96 |
latent_resolution = vqvae.encode(
|
|
|
|
| 98 |
resolution)).latent_dist.sample().shape[2:]
|
| 99 |
|
| 100 |
if args.from_pretrained is not None:
|
| 101 |
+
pipeline = {
|
| 102 |
+
'LatentAudioDiffusionPipeline': LatentAudioDiffusionPipeline,
|
| 103 |
+
'AudioDiffusionPipeline': AudioDiffusionPipeline
|
| 104 |
+
}.get(
|
| 105 |
+
DiffusionPipeline.get_config_dict(
|
| 106 |
+
args.from_pretrained)['_class_name'], AudioDiffusionPipeline)
|
| 107 |
+
pipeline = pipeline.from_pretrained(args.from_pretrained)
|
| 108 |
model = pipeline.unet
|
| 109 |
if hasattr(pipeline, 'vqvae'):
|
| 110 |
+
vqvae = pipeline.vqvae
|
| 111 |
else:
|
| 112 |
model = UNet2DModel(
|
| 113 |
sample_size=resolution if vqvae is None else latent_resolution,
|