Spaces:

svjack
/

Wuerstchen

Sleeping

App Files Files Community

svjack commited on Nov 21, 2023

Commit

148c316

1 Parent(s): 8076d6f

Upload folder using huggingface_hub

Browse files

Files changed (18) hide show

wuerstchen/.gitattributes +35 -0
wuerstchen/README.md +90 -0
wuerstchen/decoder/config.json +43 -0
wuerstchen/decoder/diffusion_pytorch_model.bin +3 -0
wuerstchen/decoder/diffusion_pytorch_model.safetensors +3 -0
wuerstchen/model_index.json +25 -0
wuerstchen/scheduler/scheduler_config.json +6 -0
wuerstchen/text_encoder/config.json +25 -0
wuerstchen/text_encoder/model.safetensors +3 -0
wuerstchen/text_encoder/pytorch_model.bin +3 -0
wuerstchen/tokenizer/merges.txt +0 -0
wuerstchen/tokenizer/special_tokens_map.json +24 -0
wuerstchen/tokenizer/tokenizer.json +0 -0
wuerstchen/tokenizer/tokenizer_config.json +33 -0
wuerstchen/tokenizer/vocab.json +0 -0
wuerstchen/vqgan/config.json +13 -0
wuerstchen/vqgan/diffusion_pytorch_model.bin +3 -0
wuerstchen/vqgan/diffusion_pytorch_model.safetensors +3 -0

wuerstchen/.gitattributes ADDED Viewed

	@@ -0,0 +1,35 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text

wuerstchen/README.md ADDED Viewed

	@@ -0,0 +1,90 @@

+---
+license: mit
+prior:
+- warp-diffusion/wuerstchen-prior
+tags:
+- text-to-image
+- wuerstchen
+---
+<img src="https://cdn-uploads.huggingface.co/production/uploads/634cb5eefb80cc6bcaf63c3e/i-DYpDHw8Pwiy7QBKZVR5.jpeg" width=1500>
+## Würstchen - Overview
+Würstchen is a diffusion model, whose text-conditional model works in a highly compressed latent space of images. Why is this important? Compressing data can reduce
+computational costs for both training and inference by magnitudes. Training on 1024x1024 images, is way more expensive than training at 32x32. Usually, other works make
+use of a relatively small compression, in the range of 4x - 8x spatial compression. Würstchen takes this to an extreme. Through its novel design, we achieve a 42x spatial
+compression. This was unseen before because common methods fail to faithfully reconstruct detailed images after 16x spatial compression. Würstchen employs a
+two-stage compression, what we call Stage A and Stage B. Stage A is a VQGAN, and Stage B is a Diffusion Autoencoder (more details can be found in the [paper](https://arxiv.org/abs/2306.00637)).
+A third model, Stage C, is learned in that highly compressed latent space. This training requires fractions of the compute used for current top-performing models, allowing
+also cheaper and faster inference.
+## Würstchen - Decoder
+The Decoder is what we refer to as "Stage A" and "Stage B". The decoder takes in image embeddings, either generated by the Prior (Stage C) or extracted from a real image, and decodes those latents back into the pixel space. Specifically, Stage B first decodes the image embeddings into the VQGAN Space, and Stage A (which is a VQGAN)
+decodes the latents into pixel space. Together, they achieve a spatial compression of 42.
+**Note:** The reconstruction is lossy and loses information of the image. The current Stage B often lacks details in the reconstructions, which are especially noticeable to
+us humans when looking at faces, hands, etc. We are working on making these reconstructions even better in the future!
+### Image Sizes
+Würstchen was trained on image resolutions between 1024x1024 & 1536x1536. We sometimes also observe good outputs at resolutions like 1024x2048. Feel free to try it out.
+We also observed that the Prior (Stage C) adapts extremely fast to new resolutions. So finetuning it at 2048x2048 should be computationally cheap.
+<img src="https://cdn-uploads.huggingface.co/production/uploads/634cb5eefb80cc6bcaf63c3e/5pA5KUfGmvsObqiIjdGY1.jpeg" width=1000>
+## How to run
+This pipeline should be run together with a prior https://huggingface.co/warp-ai/wuerstchen-prior:
+```py
+import torch
+from diffusers import AutoPipelineForText2Image
+device = "cuda"
+dtype = torch.float16
+pipeline =  AutoPipelineForText2Image.from_pretrained(
+    "warp-diffusion/wuerstchen", torch_dtype=dtype
+).to(device)
+caption = "Anthropomorphic cat dressed as a fire fighter"
+output = pipeline(
+    prompt=caption,
+    height=1024,
+    width=1024,
+    prior_guidance_scale=4.0,
+    decoder_guidance_scale=0.0,
+).images
+```
+### Image Sampling Times
+The figure shows the inference times (on an A100) for different batch sizes (`num_images_per_prompt`) on Würstchen compared to [Stable Diffusion XL](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) (without refiner).
+The left figure shows inference times (using torch > 2.0), whereas the right figure applies `torch.compile` to both pipelines in advance.
+![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/634cb5eefb80cc6bcaf63c3e/UPhsIH2f079ZuTA_sLdVe.jpeg)
+## Model Details
+- **Developed by:** Pablo Pernias, Dominic Rampas
+- **Model type:** Diffusion-based text-to-image generation model
+- **Language(s):** English
+- **License:** MIT
+- **Model Description:** This is a model that can be used to generate and modify images based on text prompts. It is a Diffusion model in the style of Stage C from the [Würstchen paper](https://arxiv.org/abs/2306.00637) that uses a fixed, pretrained text encoder ([CLIP ViT-bigG/14](https://huggingface.co/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k)).
+- **Resources for more information:** [GitHub Repository](https://github.com/dome272/Wuerstchen), [Paper](https://arxiv.org/abs/2306.00637).
+- **Cite as:**
+      @misc{pernias2023wuerstchen,
+            title={Wuerstchen: Efficient Pretraining of Text-to-Image Models},
+            author={Pablo Pernias and Dominic Rampas and Mats L. Richter and Christopher Pal and Marc Aubreville},
+            year={2023},
+            eprint={2306.00637},
+            archivePrefix={arXiv},
+            primaryClass={cs.CV}
+      }
+## Environmental Impact
+**Würstchen v2** **Estimated Emissions**
+Based on that information, we estimate the following CO2 emissions using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact.
+- **Hardware Type:** A100 PCIe 40GB
+- **Hours used:** 24602
+- **Cloud Provider:** AWS
+- **Compute Region:** US-east
+- **Carbon Emitted (Power consumption x Time x Carbon produced based on location of power grid):** 2275.68 kg CO2 eq.

wuerstchen/decoder/config.json ADDED Viewed

	@@ -0,0 +1,43 @@

+{
+  "_class_name": "WuerstchenDiffNeXt",
+  "_diffusers_version": "0.21.0.dev0",
+  "blocks": [
+    4,
+    4,
+    14,
+    4
+  ],
+  "c_cond": 1024,
+  "c_hidden": [
+    320,
+    640,
+    1280,
+    1280
+  ],
+  "c_in": 4,
+  "c_out": 4,
+  "c_r": 64,
+  "clip_embd": 1024,
+  "dropout": 0.1,
+  "effnet_embd": 16,
+  "inject_effnet": [
+    false,
+    true,
+    true,
+    true
+  ],
+  "kernel_size": 3,
+  "level_config": [
+    "CT",
+    "CTA",
+    "CTA",
+    "CTA"
+  ],
+  "nhead": [
+    -1,
+    10,
+    20,
+    20
+  ],
+  "patch_size": 2
+}

wuerstchen/decoder/diffusion_pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b2e99829fe0a2c946ec6b4ef6979aee78bfaa05f87b0cf7b80ecafa20272ef60
+size 4221843094

wuerstchen/decoder/diffusion_pytorch_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1510c2cc1a891df02d61d79866c40c506e9099519829e0282c2a79d7e9c7e66f
+size 4221568336

wuerstchen/model_index.json ADDED Viewed

	@@ -0,0 +1,25 @@

+{
+  "_class_name": "WuerstchenDecoderPipeline",
+  "_diffusers_version": "0.21.0.dev0",
+  "decoder": [
+    "wuerstchen",
+    "WuerstchenDiffNeXt"
+  ],
+  "latent_dim_scale": 10.67,
+  "scheduler": [
+    "diffusers",
+    "DDPMWuerstchenScheduler"
+  ],
+  "text_encoder": [
+    "transformers",
+    "CLIPTextModel"
+  ],
+  "tokenizer": [
+    "transformers",
+    "CLIPTokenizerFast"
+  ],
+  "vqgan": [
+    "wuerstchen",
+    "PaellaVQModel"
+  ]
+}

wuerstchen/scheduler/scheduler_config.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "_class_name": "DDPMWuerstchenScheduler",
+  "_diffusers_version": "0.21.0.dev0",
+  "s": 0.008,
+  "scaler": 1.0
+}

wuerstchen/text_encoder/config.json ADDED Viewed

	@@ -0,0 +1,25 @@

+{
+  "_name_or_path": "laion/CLIP-ViT-H-14-laion2B-s32B-b79K",
+  "architectures": [
+    "CLIPTextModel"
+  ],
+  "attention_dropout": 0.0,
+  "bos_token_id": 0,
+  "dropout": 0.0,
+  "eos_token_id": 2,
+  "hidden_act": "gelu",
+  "hidden_size": 1024,
+  "initializer_factor": 1.0,
+  "initializer_range": 0.02,
+  "intermediate_size": 4096,
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 77,
+  "model_type": "clip_text_model",
+  "num_attention_heads": 16,
+  "num_hidden_layers": 24,
+  "pad_token_id": 1,
+  "projection_dim": 1024,
+  "torch_dtype": "float32",
+  "transformers_version": "4.33.0.dev0",
+  "vocab_size": 49408
+}

wuerstchen/text_encoder/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bd94a7ea6922e8028227567fe14e04d2989eec31c482e0813e9006afea6637f1
+size 1411983168

wuerstchen/text_encoder/pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0483b11b48b0f5a5079f778c0df4057d7b797cf58ef176087ec03a236d3e16e0
+size 1412064410

wuerstchen/tokenizer/merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

wuerstchen/tokenizer/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "bos_token": {
+    "content": "<|startoftext|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<|endoftext|>",
+  "unk_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  }
+}

wuerstchen/tokenizer/tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

wuerstchen/tokenizer/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,33 @@

+{
+  "add_prefix_space": false,
+  "bos_token": {
+    "__type": "AddedToken",
+    "content": "<|startoftext|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "clean_up_tokenization_spaces": true,
+  "do_lower_case": true,
+  "eos_token": {
+    "__type": "AddedToken",
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "errors": "replace",
+  "model_max_length": 77,
+  "pad_token": "<|endoftext|>",
+  "tokenizer_class": "CLIPTokenizer",
+  "unk_token": {
+    "__type": "AddedToken",
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  }
+}

wuerstchen/tokenizer/vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff

wuerstchen/vqgan/config.json ADDED Viewed

	@@ -0,0 +1,13 @@

+{
+  "_class_name": "PaellaVQModel",
+  "_diffusers_version": "0.21.0.dev0",
+  "bottleneck_blocks": 12,
+  "embed_dim": 384,
+  "in_channels": 3,
+  "latent_channels": 4,
+  "levels": 2,
+  "num_vq_embeddings": 8192,
+  "out_channels": 3,
+  "scale_factor": 0.3764,
+  "up_down_scale_factor": 2
+}

wuerstchen/vqgan/diffusion_pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f3ab7752b474058d177e8565860367a438b8016ba788954394fbb7f1da16d6e1
+size 73674142

wuerstchen/vqgan/diffusion_pytorch_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:052db8852c0d8b117e6d2a59ae3e0c7d7aaae3d00f247e392ef8e9837e11d6c4
+size 73639568