Spaces:

teticio
/

audio-diffusion

Runtime error

App Files Files Community

teticio commited on Nov 5, 2022

Commit

af8111a

1 Parent(s): b3e97c5

train latent dm with pre-trained vae from hf hub

Browse files

Files changed (2) hide show

README.md +10 -1
scripts/train_unconditional.py +14 -3

README.md CHANGED Viewed

@@ -119,11 +119,13 @@ accelerate launch --config_file config/accelerate_sagemaker.yaml \
   --lr_warmup_steps 500 \
   --mixed_precision no
 ```
 ## DDIM ([De-noising Diffusion Implicit Models](https://arxiv.org/pdf/2010.02502.pdf))
 #### A DDIM can be trained by adding the parameter
 ```bash
   --scheduler ddim
 ```
 Inference can the be run with far fewer steps than the number used for training (e.g., ~50), allowing for much faster generation. Without retraining, the parameter `eta` can be used to replicate a DDPM if it is set to 1 or a DDIM if it is set to 0, with all values in between being valid. When `eta` is 0 (the default value), the de-noising procedure is deterministic, which means that it can be run in reverse as a kind of encoder that recovers the original noise used in generation. A function `encode` has been added to `AudioDiffusionPipeline` for this purpose. It is then possible to interpolate between audios in the latent "noise" space using the function `slerp` (Spherical Linear intERPolation).
 ## Latent Audio Diffusion
@@ -131,7 +133,14 @@ Rather than de-noising images directly, it is interesting to work in the "latent
 At the time of writing, the Hugging Face `diffusers` library is geared towards inference and lacking in training functionality (rather like its cousin `transformers` in the early days of development). In order to train a VAE (Variational AutoEncoder), I use the [stable-diffusion](https://github.com/CompVis/stable-diffusion) repo from CompVis and convert the checkpoints to `diffusers` format. Note that it uses a perceptual loss function for images; it would be nice to try a perceptual *audio* loss function.
-#### Install dependencies to train with Stable Diffusion
 ```
 pip install omegaconf pytorch_lightning
 pip install -e git+https://github.com/CompVis/stable-diffusion.git@main#egg=latent-diffusion

   --lr_warmup_steps 500 \
   --mixed_precision no
 ```
 ## DDIM ([De-noising Diffusion Implicit Models](https://arxiv.org/pdf/2010.02502.pdf))
 #### A DDIM can be trained by adding the parameter
 ```bash
   --scheduler ddim
 ```
 Inference can the be run with far fewer steps than the number used for training (e.g., ~50), allowing for much faster generation. Without retraining, the parameter `eta` can be used to replicate a DDPM if it is set to 1 or a DDIM if it is set to 0, with all values in between being valid. When `eta` is 0 (the default value), the de-noising procedure is deterministic, which means that it can be run in reverse as a kind of encoder that recovers the original noise used in generation. A function `encode` has been added to `AudioDiffusionPipeline` for this purpose. It is then possible to interpolate between audios in the latent "noise" space using the function `slerp` (Spherical Linear intERPolation).
 ## Latent Audio Diffusion
 At the time of writing, the Hugging Face `diffusers` library is geared towards inference and lacking in training functionality (rather like its cousin `transformers` in the early days of development). In order to train a VAE (Variational AutoEncoder), I use the [stable-diffusion](https://github.com/CompVis/stable-diffusion) repo from CompVis and convert the checkpoints to `diffusers` format. Note that it uses a perceptual loss function for images; it would be nice to try a perceptual *audio* loss function.
+#### Train latent diffusion model using pre-trained VAE.
+```bash
+accelerate launch ...
+  ...
+  --vae teticio/latent-audio-diffusion-256
+```
+#### Install dependencies to train with Stable Diffusion.
 ```
 pip install omegaconf pytorch_lightning
 pip install -e git+https://github.com/CompVis/stable-diffusion.git@main#egg=latent-diffusion

scripts/train_unconditional.py CHANGED Viewed

@@ -11,6 +11,7 @@ from accelerate.logging import get_logger
 from datasets import load_from_disk, load_dataset
 from diffusers import (DiffusionPipeline, DDPMScheduler, UNet2DModel,
                        DDIMScheduler, AutoencoderKL)
 from diffusers.hub_utils import init_git_repo, push_to_hub
 from diffusers.optimization import get_scheduler
 from diffusers.training_utils import EMAModel
@@ -85,7 +86,11 @@ def main(args):
     vqvae = None
     if args.vae is not None:
-        vqvae = AutoencoderKL.from_pretrained(args.vae)
         # Determine latent resolution
         with torch.no_grad():
             latent_resolution = vqvae.encode(
@@ -93,10 +98,16 @@ def main(args):
                             resolution)).latent_dist.sample().shape[2:]
     if args.from_pretrained is not None:
-        pipeline = DiffusionPipeline.from_pretrained(args.from_pretrained)
         model = pipeline.unet
         if hasattr(pipeline, 'vqvae'):
-            vqvae = AutoencoderKL.from_pretrained(args.vae)
     else:
         model = UNet2DModel(
             sample_size=resolution if vqvae is None else latent_resolution,

 from datasets import load_from_disk, load_dataset
 from diffusers import (DiffusionPipeline, DDPMScheduler, UNet2DModel,
                        DDIMScheduler, AutoencoderKL)
+from diffusers.modeling_utils import EntryNotFoundError
 from diffusers.hub_utils import init_git_repo, push_to_hub
 from diffusers.optimization import get_scheduler
 from diffusers.training_utils import EMAModel
     vqvae = None
     if args.vae is not None:
+        try:
+            vqvae = AutoencoderKL.from_pretrained(args.vae)
+        except EnvironmentError:
+            vqvae = LatentAudioDiffusionPipeline.from_pretrained(
+                args.vae).vqvae
         # Determine latent resolution
         with torch.no_grad():
             latent_resolution = vqvae.encode(
                             resolution)).latent_dist.sample().shape[2:]
     if args.from_pretrained is not None:
+        pipeline = {
+            'LatentAudioDiffusionPipeline': LatentAudioDiffusionPipeline,
+            'AudioDiffusionPipeline': AudioDiffusionPipeline
+        }.get(
+            DiffusionPipeline.get_config_dict(
+                args.from_pretrained)['_class_name'], AudioDiffusionPipeline)
+        pipeline = pipeline.from_pretrained(args.from_pretrained)
         model = pipeline.unet
         if hasattr(pipeline, 'vqvae'):
+            vqvae = pipeline.vqvae
     else:
         model = UNet2DModel(
             sample_size=resolution if vqvae is None else latent_resolution,