svjack commited on
Commit
148c316
·
1 Parent(s): 8076d6f

Upload folder using huggingface_hub

Browse files
wuerstchen/.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
wuerstchen/README.md ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ prior:
4
+ - warp-diffusion/wuerstchen-prior
5
+ tags:
6
+ - text-to-image
7
+ - wuerstchen
8
+ ---
9
+
10
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/634cb5eefb80cc6bcaf63c3e/i-DYpDHw8Pwiy7QBKZVR5.jpeg" width=1500>
11
+
12
+ ## Würstchen - Overview
13
+ Würstchen is a diffusion model, whose text-conditional model works in a highly compressed latent space of images. Why is this important? Compressing data can reduce
14
+ computational costs for both training and inference by magnitudes. Training on 1024x1024 images, is way more expensive than training at 32x32. Usually, other works make
15
+ use of a relatively small compression, in the range of 4x - 8x spatial compression. Würstchen takes this to an extreme. Through its novel design, we achieve a 42x spatial
16
+ compression. This was unseen before because common methods fail to faithfully reconstruct detailed images after 16x spatial compression. Würstchen employs a
17
+ two-stage compression, what we call Stage A and Stage B. Stage A is a VQGAN, and Stage B is a Diffusion Autoencoder (more details can be found in the [paper](https://arxiv.org/abs/2306.00637)).
18
+ A third model, Stage C, is learned in that highly compressed latent space. This training requires fractions of the compute used for current top-performing models, allowing
19
+ also cheaper and faster inference.
20
+
21
+ ## Würstchen - Decoder
22
+ The Decoder is what we refer to as "Stage A" and "Stage B". The decoder takes in image embeddings, either generated by the Prior (Stage C) or extracted from a real image, and decodes those latents back into the pixel space. Specifically, Stage B first decodes the image embeddings into the VQGAN Space, and Stage A (which is a VQGAN)
23
+ decodes the latents into pixel space. Together, they achieve a spatial compression of 42.
24
+
25
+ **Note:** The reconstruction is lossy and loses information of the image. The current Stage B often lacks details in the reconstructions, which are especially noticeable to
26
+ us humans when looking at faces, hands, etc. We are working on making these reconstructions even better in the future!
27
+
28
+ ### Image Sizes
29
+ Würstchen was trained on image resolutions between 1024x1024 & 1536x1536. We sometimes also observe good outputs at resolutions like 1024x2048. Feel free to try it out.
30
+ We also observed that the Prior (Stage C) adapts extremely fast to new resolutions. So finetuning it at 2048x2048 should be computationally cheap.
31
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/634cb5eefb80cc6bcaf63c3e/5pA5KUfGmvsObqiIjdGY1.jpeg" width=1000>
32
+
33
+ ## How to run
34
+ This pipeline should be run together with a prior https://huggingface.co/warp-ai/wuerstchen-prior:
35
+
36
+ ```py
37
+ import torch
38
+ from diffusers import AutoPipelineForText2Image
39
+
40
+ device = "cuda"
41
+ dtype = torch.float16
42
+
43
+ pipeline = AutoPipelineForText2Image.from_pretrained(
44
+ "warp-diffusion/wuerstchen", torch_dtype=dtype
45
+ ).to(device)
46
+
47
+ caption = "Anthropomorphic cat dressed as a fire fighter"
48
+
49
+ output = pipeline(
50
+ prompt=caption,
51
+ height=1024,
52
+ width=1024,
53
+ prior_guidance_scale=4.0,
54
+ decoder_guidance_scale=0.0,
55
+ ).images
56
+ ```
57
+
58
+ ### Image Sampling Times
59
+ The figure shows the inference times (on an A100) for different batch sizes (`num_images_per_prompt`) on Würstchen compared to [Stable Diffusion XL](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) (without refiner).
60
+ The left figure shows inference times (using torch > 2.0), whereas the right figure applies `torch.compile` to both pipelines in advance.
61
+ ![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/634cb5eefb80cc6bcaf63c3e/UPhsIH2f079ZuTA_sLdVe.jpeg)
62
+
63
+ ## Model Details
64
+ - **Developed by:** Pablo Pernias, Dominic Rampas
65
+ - **Model type:** Diffusion-based text-to-image generation model
66
+ - **Language(s):** English
67
+ - **License:** MIT
68
+ - **Model Description:** This is a model that can be used to generate and modify images based on text prompts. It is a Diffusion model in the style of Stage C from the [Würstchen paper](https://arxiv.org/abs/2306.00637) that uses a fixed, pretrained text encoder ([CLIP ViT-bigG/14](https://huggingface.co/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k)).
69
+ - **Resources for more information:** [GitHub Repository](https://github.com/dome272/Wuerstchen), [Paper](https://arxiv.org/abs/2306.00637).
70
+ - **Cite as:**
71
+
72
+ @misc{pernias2023wuerstchen,
73
+ title={Wuerstchen: Efficient Pretraining of Text-to-Image Models},
74
+ author={Pablo Pernias and Dominic Rampas and Mats L. Richter and Christopher Pal and Marc Aubreville},
75
+ year={2023},
76
+ eprint={2306.00637},
77
+ archivePrefix={arXiv},
78
+ primaryClass={cs.CV}
79
+ }
80
+
81
+ ## Environmental Impact
82
+
83
+ **Würstchen v2** **Estimated Emissions**
84
+ Based on that information, we estimate the following CO2 emissions using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). The hardware, runtime, cloud provider, and compute region were utilized to estimate the carbon impact.
85
+
86
+ - **Hardware Type:** A100 PCIe 40GB
87
+ - **Hours used:** 24602
88
+ - **Cloud Provider:** AWS
89
+ - **Compute Region:** US-east
90
+ - **Carbon Emitted (Power consumption x Time x Carbon produced based on location of power grid):** 2275.68 kg CO2 eq.
wuerstchen/decoder/config.json ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "WuerstchenDiffNeXt",
3
+ "_diffusers_version": "0.21.0.dev0",
4
+ "blocks": [
5
+ 4,
6
+ 4,
7
+ 14,
8
+ 4
9
+ ],
10
+ "c_cond": 1024,
11
+ "c_hidden": [
12
+ 320,
13
+ 640,
14
+ 1280,
15
+ 1280
16
+ ],
17
+ "c_in": 4,
18
+ "c_out": 4,
19
+ "c_r": 64,
20
+ "clip_embd": 1024,
21
+ "dropout": 0.1,
22
+ "effnet_embd": 16,
23
+ "inject_effnet": [
24
+ false,
25
+ true,
26
+ true,
27
+ true
28
+ ],
29
+ "kernel_size": 3,
30
+ "level_config": [
31
+ "CT",
32
+ "CTA",
33
+ "CTA",
34
+ "CTA"
35
+ ],
36
+ "nhead": [
37
+ -1,
38
+ 10,
39
+ 20,
40
+ 20
41
+ ],
42
+ "patch_size": 2
43
+ }
wuerstchen/decoder/diffusion_pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b2e99829fe0a2c946ec6b4ef6979aee78bfaa05f87b0cf7b80ecafa20272ef60
3
+ size 4221843094
wuerstchen/decoder/diffusion_pytorch_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1510c2cc1a891df02d61d79866c40c506e9099519829e0282c2a79d7e9c7e66f
3
+ size 4221568336
wuerstchen/model_index.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "WuerstchenDecoderPipeline",
3
+ "_diffusers_version": "0.21.0.dev0",
4
+ "decoder": [
5
+ "wuerstchen",
6
+ "WuerstchenDiffNeXt"
7
+ ],
8
+ "latent_dim_scale": 10.67,
9
+ "scheduler": [
10
+ "diffusers",
11
+ "DDPMWuerstchenScheduler"
12
+ ],
13
+ "text_encoder": [
14
+ "transformers",
15
+ "CLIPTextModel"
16
+ ],
17
+ "tokenizer": [
18
+ "transformers",
19
+ "CLIPTokenizerFast"
20
+ ],
21
+ "vqgan": [
22
+ "wuerstchen",
23
+ "PaellaVQModel"
24
+ ]
25
+ }
wuerstchen/scheduler/scheduler_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "DDPMWuerstchenScheduler",
3
+ "_diffusers_version": "0.21.0.dev0",
4
+ "s": 0.008,
5
+ "scaler": 1.0
6
+ }
wuerstchen/text_encoder/config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "laion/CLIP-ViT-H-14-laion2B-s32B-b79K",
3
+ "architectures": [
4
+ "CLIPTextModel"
5
+ ],
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 0,
8
+ "dropout": 0.0,
9
+ "eos_token_id": 2,
10
+ "hidden_act": "gelu",
11
+ "hidden_size": 1024,
12
+ "initializer_factor": 1.0,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 4096,
15
+ "layer_norm_eps": 1e-05,
16
+ "max_position_embeddings": 77,
17
+ "model_type": "clip_text_model",
18
+ "num_attention_heads": 16,
19
+ "num_hidden_layers": 24,
20
+ "pad_token_id": 1,
21
+ "projection_dim": 1024,
22
+ "torch_dtype": "float32",
23
+ "transformers_version": "4.33.0.dev0",
24
+ "vocab_size": 49408
25
+ }
wuerstchen/text_encoder/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bd94a7ea6922e8028227567fe14e04d2989eec31c482e0813e9006afea6637f1
3
+ size 1411983168
wuerstchen/text_encoder/pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0483b11b48b0f5a5079f778c0df4057d7b797cf58ef176087ec03a236d3e16e0
3
+ size 1412064410
wuerstchen/tokenizer/merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
wuerstchen/tokenizer/special_tokens_map.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<|startoftext|>",
4
+ "lstrip": false,
5
+ "normalized": true,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|endoftext|>",
11
+ "lstrip": false,
12
+ "normalized": true,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": "<|endoftext|>",
17
+ "unk_token": {
18
+ "content": "<|endoftext|>",
19
+ "lstrip": false,
20
+ "normalized": true,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ }
24
+ }
wuerstchen/tokenizer/tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
wuerstchen/tokenizer/tokenizer_config.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "bos_token": {
4
+ "__type": "AddedToken",
5
+ "content": "<|startoftext|>",
6
+ "lstrip": false,
7
+ "normalized": true,
8
+ "rstrip": false,
9
+ "single_word": false
10
+ },
11
+ "clean_up_tokenization_spaces": true,
12
+ "do_lower_case": true,
13
+ "eos_token": {
14
+ "__type": "AddedToken",
15
+ "content": "<|endoftext|>",
16
+ "lstrip": false,
17
+ "normalized": true,
18
+ "rstrip": false,
19
+ "single_word": false
20
+ },
21
+ "errors": "replace",
22
+ "model_max_length": 77,
23
+ "pad_token": "<|endoftext|>",
24
+ "tokenizer_class": "CLIPTokenizer",
25
+ "unk_token": {
26
+ "__type": "AddedToken",
27
+ "content": "<|endoftext|>",
28
+ "lstrip": false,
29
+ "normalized": true,
30
+ "rstrip": false,
31
+ "single_word": false
32
+ }
33
+ }
wuerstchen/tokenizer/vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
wuerstchen/vqgan/config.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "PaellaVQModel",
3
+ "_diffusers_version": "0.21.0.dev0",
4
+ "bottleneck_blocks": 12,
5
+ "embed_dim": 384,
6
+ "in_channels": 3,
7
+ "latent_channels": 4,
8
+ "levels": 2,
9
+ "num_vq_embeddings": 8192,
10
+ "out_channels": 3,
11
+ "scale_factor": 0.3764,
12
+ "up_down_scale_factor": 2
13
+ }
wuerstchen/vqgan/diffusion_pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f3ab7752b474058d177e8565860367a438b8016ba788954394fbb7f1da16d6e1
3
+ size 73674142
wuerstchen/vqgan/diffusion_pytorch_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:052db8852c0d8b117e6d2a59ae3e0c7d7aaae3d00f247e392ef8e9837e11d6c4
3
+ size 73639568