hunyuang3D
#27
by
Temi3003
- opened
- LICENSE +1 -2
- NOTICE +3 -3
- README.md +1 -1
- hunyuan3d-dit-v2-0-turbo/config.yaml +0 -70
- hunyuan3d-dit-v2-0-turbo/model.fp16.ckpt +0 -3
- hunyuan3d-dit-v2-0-turbo/model.fp16.safetensors +0 -3
- hunyuan3d-dit-v2-0/model.fp16.ckpt +0 -3
- hunyuan3d-dit-v2-0/model.fp16.safetensors +0 -3
- hunyuan3d-paint-v2-0-turbo/.gitattributes +0 -35
- hunyuan3d-paint-v2-0-turbo/README.md +0 -53
- hunyuan3d-paint-v2-0-turbo/feature_extractor/preprocessor_config.json +0 -20
- hunyuan3d-paint-v2-0-turbo/image_encoder/config.json +0 -23
- hunyuan3d-paint-v2-0-turbo/image_encoder/model.safetensors +0 -3
- hunyuan3d-paint-v2-0-turbo/image_encoder/preprocessor_config.json +0 -27
- hunyuan3d-paint-v2-0-turbo/model_index.json +0 -37
- hunyuan3d-paint-v2-0-turbo/scheduler/scheduler_config.json +0 -15
- hunyuan3d-paint-v2-0-turbo/text_encoder/config.json +0 -25
- hunyuan3d-paint-v2-0-turbo/text_encoder/pytorch_model.bin +0 -3
- hunyuan3d-paint-v2-0-turbo/tokenizer/merges.txt +0 -0
- hunyuan3d-paint-v2-0-turbo/tokenizer/special_tokens_map.json +0 -24
- hunyuan3d-paint-v2-0-turbo/tokenizer/tokenizer_config.json +0 -34
- hunyuan3d-paint-v2-0-turbo/tokenizer/vocab.json +0 -0
- hunyuan3d-paint-v2-0-turbo/unet/config.json +0 -45
- hunyuan3d-paint-v2-0-turbo/unet/diffusion_pytorch_model.bin +0 -3
- hunyuan3d-paint-v2-0-turbo/unet/diffusion_pytorch_model.safetensors +0 -3
- hunyuan3d-paint-v2-0-turbo/unet/modules.py +0 -610
- hunyuan3d-paint-v2-0-turbo/vae/config.json +0 -29
- hunyuan3d-paint-v2-0-turbo/vae/diffusion_pytorch_model.bin +0 -3
- hunyuan3d-vae-v2-0-turbo/config.yaml +0 -15
- hunyuan3d-vae-v2-0-turbo/model.fp16.ckpt +0 -3
- hunyuan3d-vae-v2-0-turbo/model.fp16.safetensors +0 -3
- hunyuan3d-vae-v2-0/config.yaml +0 -15
- hunyuan3d-vae-v2-0/model.fp16.ckpt +0 -3
- hunyuan3d-vae-v2-0/model.fp16.safetensors +0 -3
LICENSE
CHANGED
@@ -11,8 +11,7 @@ e. “Licensee,” “You” or “Your” shall mean a natural person or legal
|
|
11 |
f. “Materials” shall mean, collectively, Tencent’s proprietary Tencent Hunyuan 3D 2.0 and Documentation (and any portion thereof) as made available by Tencent under this Agreement.
|
12 |
g. “Model Derivatives” shall mean all: (i) modifications to Tencent Hunyuan 3D 2.0 or any Model Derivative of Tencent Hunyuan 3D 2.0; (ii) works based on Tencent Hunyuan 3D 2.0 or any Model Derivative of Tencent Hunyuan 3D 2.0; or (iii) any other machine learning model which is created by transfer of patterns of the weights, parameters, operations, or Output of Tencent Hunyuan 3D 2.0 or any Model Derivative of Tencent Hunyuan 3D 2.0, to that model in order to cause that model to perform similarly to Tencent Hunyuan 3D 2.0 or a Model Derivative of Tencent Hunyuan 3D 2.0, including distillation methods, methods that use intermediate data representations, or methods based on the generation of synthetic data Outputs by Tencent Hunyuan 3D 2.0 or a Model Derivative of Tencent Hunyuan 3D 2.0 for training that model. For clarity, Outputs by themselves are not deemed Model Derivatives.
|
13 |
h. “Output” shall mean the information and/or content output of Tencent Hunyuan 3D 2.0 or a Model Derivative that results from operating or otherwise using Tencent Hunyuan 3D 2.0 or a Model Derivative, including via a Hosted Service.
|
14 |
-
i. “Tencent,” “We” or “Us” shall mean
|
15 |
-
* Section 1.i of the previous Hunyuan License Agreement defined “Tencent,” “We” or “Us” to mean THL A29 Limited, and the copyright notices pertaining to the Materials were previously in the name of “THL A29 Limited.” That entity has now been de-registered. You should treat all previously distributed copies of the Materials as if Section 1.i of the Agreement defined “Tencent,” “We” or “Us” to mean “the applicable entity or entities in the Tencent corporate family that own(s) intellectual property or other rights embodied in or utilized by the Materials,” and treat the copyright notice(s) accompanying the Materials as if they were in the name of “Tencent.” When providing a copy of any Agreement to Third Party recipients of the Tencent Hunyuan Works or products or services using them, as required by Section 3.a of the Agreement, you should provide the most current version of the Agreement, including the change of definition in Section 1.i of the Agreement.
|
16 |
j. “Tencent Hunyuan 3D 2.0” shall mean the 3D generation models and their software and algorithms, including trained model weights, parameters (including optimizer states), machine-learning model code, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing made publicly available by Us at https://github.com/Tencent/Hunyuan3D-2.
|
17 |
k. “Tencent Hunyuan 3D 2.0 Works” shall mean: (i) the Materials; (ii) Model Derivatives; and (iii) all derivative works thereof.
|
18 |
l. “Territory” shall mean the worldwide territory, excluding the territory of the European Union, United Kingdom and South Korea.
|
|
|
11 |
f. “Materials” shall mean, collectively, Tencent’s proprietary Tencent Hunyuan 3D 2.0 and Documentation (and any portion thereof) as made available by Tencent under this Agreement.
|
12 |
g. “Model Derivatives” shall mean all: (i) modifications to Tencent Hunyuan 3D 2.0 or any Model Derivative of Tencent Hunyuan 3D 2.0; (ii) works based on Tencent Hunyuan 3D 2.0 or any Model Derivative of Tencent Hunyuan 3D 2.0; or (iii) any other machine learning model which is created by transfer of patterns of the weights, parameters, operations, or Output of Tencent Hunyuan 3D 2.0 or any Model Derivative of Tencent Hunyuan 3D 2.0, to that model in order to cause that model to perform similarly to Tencent Hunyuan 3D 2.0 or a Model Derivative of Tencent Hunyuan 3D 2.0, including distillation methods, methods that use intermediate data representations, or methods based on the generation of synthetic data Outputs by Tencent Hunyuan 3D 2.0 or a Model Derivative of Tencent Hunyuan 3D 2.0 for training that model. For clarity, Outputs by themselves are not deemed Model Derivatives.
|
13 |
h. “Output” shall mean the information and/or content output of Tencent Hunyuan 3D 2.0 or a Model Derivative that results from operating or otherwise using Tencent Hunyuan 3D 2.0 or a Model Derivative, including via a Hosted Service.
|
14 |
+
i. “Tencent,” “We” or “Us” shall mean THL A29 Limited.
|
|
|
15 |
j. “Tencent Hunyuan 3D 2.0” shall mean the 3D generation models and their software and algorithms, including trained model weights, parameters (including optimizer states), machine-learning model code, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing made publicly available by Us at https://github.com/Tencent/Hunyuan3D-2.
|
16 |
k. “Tencent Hunyuan 3D 2.0 Works” shall mean: (i) the Materials; (ii) Model Derivatives; and (iii) all derivative works thereof.
|
17 |
l. “Territory” shall mean the worldwide territory, excluding the territory of the European Union, United Kingdom and South Korea.
|
NOTICE
CHANGED
@@ -2,7 +2,7 @@ Usage and Legal Notices:
|
|
2 |
|
3 |
Tencent is pleased to support the open source community by making Hunyuan 3D 2.0 available.
|
4 |
|
5 |
-
Copyright (C) 2025 Tencent. All rights reserved. The below software and/or models in this distribution may have been modified by
|
6 |
|
7 |
Hunyuan 3D 2.0 is licensed under the TENCENT HUNYUAN 3D 2.0 COMMUNITY LICENSE AGREEMENT except for the third-party components listed below, which is licensed under different terms. Hunyuan 3D 2.0 does not impose any additional limitations beyond what is outlined in the respective licenses of these third-party components. Users must comply with all terms and conditions of original licenses of these third-party components and must ensure that the usage of the third party components adheres to all relevant laws and regulations.
|
8 |
|
@@ -126,7 +126,7 @@ You agree not to use the Model or Derivatives of the Model:
|
|
126 |
Open Source Model Licensed under the TENCENT HUNYUAN COMMUNITY LICENSE AGREEMENT and Other Licenses of the Third-Party Components therein:
|
127 |
--------------------------------------------------------------------
|
128 |
1. HunyuanDiT
|
129 |
-
Copyright (C) 2024 Tencent. All rights reserved.
|
130 |
|
131 |
|
132 |
Terms of the TENCENT HUNYUAN COMMUNITY LICENSE AGREEMENT:
|
@@ -143,7 +143,7 @@ e. “Licensee,” “You” or “Your” shall mean a natural person or legal
|
|
143 |
f. “Materials” shall mean, collectively, Tencent’s proprietary Tencent Hunyuan and Documentation (and any portion thereof) as made available by Tencent under this Agreement.
|
144 |
g. “Model Derivatives” shall mean all: (i) modifications to Tencent Hunyuan or any Model Derivative of Tencent Hunyuan; (ii) works based on Tencent Hunyuan or any Model Derivative of Tencent Hunyuan; or (iii) any other machine learning model which is created by transfer of patterns of the weights, parameters, operations, or Output of Tencent Hunyuan or any Model Derivative of Tencent Hunyuan, to that model in order to cause that model to perform similarly to Tencent Hunyuan or a Model Derivative of Tencent Hunyuan, including distillation methods, methods that use intermediate data representations, or methods based on the generation of synthetic data Outputs by Tencent Hunyuan or a Model Derivative of Tencent Hunyuan for training that model. For clarity, Outputs by themselves are not deemed Model Derivatives.
|
145 |
h. “Output” shall mean the information and/or content output of Tencent Hunyuan or a Model Derivative that results from operating or otherwise using Tencent Hunyuan or a Model Derivative, including via a Hosted Service.
|
146 |
-
i. “Tencent,” “We” or “Us” shall mean
|
147 |
j. “Tencent Hunyuan” shall mean the large language models, image/video/audio/3D generation models, and multimodal large language models and their software and algorithms, including trained model weights, parameters (including optimizer states), machine-learning model code, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing made publicly available by Us at https://huggingface.co/Tencent-Hunyuan/HunyuanDiT and https://github.com/Tencent/HunyuanDiT .
|
148 |
k. “Tencent Hunyuan Works” shall mean: (i) the Materials; (ii) Model Derivatives; and (iii) all derivative works thereof.
|
149 |
l. “Third Party” or “Third Parties” shall mean individuals or legal entities that are not under common control with Us or You.
|
|
|
2 |
|
3 |
Tencent is pleased to support the open source community by making Hunyuan 3D 2.0 available.
|
4 |
|
5 |
+
Copyright (C) 2025 THL A29 Limited, a Tencent company. All rights reserved. The below software and/or models in this distribution may have been modified by THL A29 Limited ("Tencent Modifications"). All Tencent Modifications are Copyright (C) THL A29 Limited.
|
6 |
|
7 |
Hunyuan 3D 2.0 is licensed under the TENCENT HUNYUAN 3D 2.0 COMMUNITY LICENSE AGREEMENT except for the third-party components listed below, which is licensed under different terms. Hunyuan 3D 2.0 does not impose any additional limitations beyond what is outlined in the respective licenses of these third-party components. Users must comply with all terms and conditions of original licenses of these third-party components and must ensure that the usage of the third party components adheres to all relevant laws and regulations.
|
8 |
|
|
|
126 |
Open Source Model Licensed under the TENCENT HUNYUAN COMMUNITY LICENSE AGREEMENT and Other Licenses of the Third-Party Components therein:
|
127 |
--------------------------------------------------------------------
|
128 |
1. HunyuanDiT
|
129 |
+
Copyright (C) 2024 THL A29 Limited, a Tencent company. All rights reserved.
|
130 |
|
131 |
|
132 |
Terms of the TENCENT HUNYUAN COMMUNITY LICENSE AGREEMENT:
|
|
|
143 |
f. “Materials” shall mean, collectively, Tencent’s proprietary Tencent Hunyuan and Documentation (and any portion thereof) as made available by Tencent under this Agreement.
|
144 |
g. “Model Derivatives” shall mean all: (i) modifications to Tencent Hunyuan or any Model Derivative of Tencent Hunyuan; (ii) works based on Tencent Hunyuan or any Model Derivative of Tencent Hunyuan; or (iii) any other machine learning model which is created by transfer of patterns of the weights, parameters, operations, or Output of Tencent Hunyuan or any Model Derivative of Tencent Hunyuan, to that model in order to cause that model to perform similarly to Tencent Hunyuan or a Model Derivative of Tencent Hunyuan, including distillation methods, methods that use intermediate data representations, or methods based on the generation of synthetic data Outputs by Tencent Hunyuan or a Model Derivative of Tencent Hunyuan for training that model. For clarity, Outputs by themselves are not deemed Model Derivatives.
|
145 |
h. “Output” shall mean the information and/or content output of Tencent Hunyuan or a Model Derivative that results from operating or otherwise using Tencent Hunyuan or a Model Derivative, including via a Hosted Service.
|
146 |
+
i. “Tencent,” “We” or “Us” shall mean THL A29 Limited.
|
147 |
j. “Tencent Hunyuan” shall mean the large language models, image/video/audio/3D generation models, and multimodal large language models and their software and algorithms, including trained model weights, parameters (including optimizer states), machine-learning model code, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing made publicly available by Us at https://huggingface.co/Tencent-Hunyuan/HunyuanDiT and https://github.com/Tencent/HunyuanDiT .
|
148 |
k. “Tencent Hunyuan Works” shall mean: (i) the Materials; (ii) Model Derivatives; and (iii) all derivative works thereof.
|
149 |
l. “Third Party” or “Third Parties” shall mean individuals or legal entities that are not under common control with Us or You.
|
README.md
CHANGED
@@ -153,7 +153,7 @@ pipeline = Hunyuan3DPaintPipeline.from_pretrained('tencent/Hunyuan3D-2')
|
|
153 |
mesh = pipeline(mesh, image='assets/demo.png')
|
154 |
```
|
155 |
|
156 |
-
Please visit [minimal_demo.py](
|
157 |
for handcrafted mesh**.
|
158 |
|
159 |
### Gradio App
|
|
|
153 |
mesh = pipeline(mesh, image='assets/demo.png')
|
154 |
```
|
155 |
|
156 |
+
Please visit [minimal_demo.py](minimal_demo.py) for more advanced usage, such as **text to 3D** and **texture generation
|
157 |
for handcrafted mesh**.
|
158 |
|
159 |
### Gradio App
|
hunyuan3d-dit-v2-0-turbo/config.yaml
DELETED
@@ -1,70 +0,0 @@
|
|
1 |
-
model:
|
2 |
-
target: hy3dgen.shapegen.models.Hunyuan3DDiT
|
3 |
-
params:
|
4 |
-
in_channels: 64
|
5 |
-
context_in_dim: 1536
|
6 |
-
hidden_size: 1024
|
7 |
-
mlp_ratio: 4.0
|
8 |
-
num_heads: 16
|
9 |
-
depth: 16
|
10 |
-
depth_single_blocks: 32
|
11 |
-
axes_dim: [ 64 ]
|
12 |
-
theta: 10000
|
13 |
-
qkv_bias: true
|
14 |
-
guidance_embed: true
|
15 |
-
|
16 |
-
vae:
|
17 |
-
target: hy3dgen.shapegen.models.ShapeVAE
|
18 |
-
params:
|
19 |
-
num_latents: 3072
|
20 |
-
embed_dim: 64
|
21 |
-
num_freqs: 8
|
22 |
-
include_pi: false
|
23 |
-
heads: 16
|
24 |
-
width: 1024
|
25 |
-
num_decoder_layers: 16
|
26 |
-
qkv_bias: false
|
27 |
-
qk_norm: true
|
28 |
-
scale_factor: 0.9990943042622529
|
29 |
-
|
30 |
-
conditioner:
|
31 |
-
target: hy3dgen.shapegen.models.SingleImageEncoder
|
32 |
-
params:
|
33 |
-
main_image_encoder:
|
34 |
-
type: DinoImageEncoder # dino giant
|
35 |
-
kwargs:
|
36 |
-
config:
|
37 |
-
attention_probs_dropout_prob: 0.0
|
38 |
-
drop_path_rate: 0.0
|
39 |
-
hidden_act: gelu
|
40 |
-
hidden_dropout_prob: 0.0
|
41 |
-
hidden_size: 1536
|
42 |
-
image_size: 518
|
43 |
-
initializer_range: 0.02
|
44 |
-
layer_norm_eps: 1.e-6
|
45 |
-
layerscale_value: 1.0
|
46 |
-
mlp_ratio: 4
|
47 |
-
model_type: dinov2
|
48 |
-
num_attention_heads: 24
|
49 |
-
num_channels: 3
|
50 |
-
num_hidden_layers: 40
|
51 |
-
patch_size: 14
|
52 |
-
qkv_bias: true
|
53 |
-
torch_dtype: float32
|
54 |
-
use_swiglu_ffn: true
|
55 |
-
image_size: 518
|
56 |
-
|
57 |
-
scheduler:
|
58 |
-
target: hy3dgen.shapegen.schedulers.ConsistencyFlowMatchEulerDiscreteScheduler
|
59 |
-
params:
|
60 |
-
num_train_timesteps: 1000
|
61 |
-
pcm_timesteps: 100
|
62 |
-
|
63 |
-
image_processor:
|
64 |
-
target: hy3dgen.shapegen.preprocessors.ImageProcessorV2
|
65 |
-
params:
|
66 |
-
size: 512
|
67 |
-
border_ratio: 0.15
|
68 |
-
|
69 |
-
pipeline:
|
70 |
-
target: hy3dgen.shapegen.pipelines.Hunyuan3DDiTFlowMatchingPipeline
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
hunyuan3d-dit-v2-0-turbo/model.fp16.ckpt
DELETED
@@ -1,3 +0,0 @@
|
|
1 |
-
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:f04cecad6953ca9644f9e1d3a22cd0abb20665ce8be30fc4409451ce78d622f1
|
3 |
-
size 4931245140
|
|
|
|
|
|
|
|
hunyuan3d-dit-v2-0-turbo/model.fp16.safetensors
DELETED
@@ -1,3 +0,0 @@
|
|
1 |
-
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:5ee5a81e4df08a1c65b79910bf5b145a90376e526794f4607a4d5d068d62f269
|
3 |
-
size 4930777530
|
|
|
|
|
|
|
|
hunyuan3d-dit-v2-0/model.fp16.ckpt
DELETED
@@ -1,3 +0,0 @@
|
|
1 |
-
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:39c2a6bf54f5674f2001b763d8e15b773fbda24604b3911544d09846496bc972
|
3 |
-
size 4928568095
|
|
|
|
|
|
|
|
hunyuan3d-dit-v2-0/model.fp16.safetensors
DELETED
@@ -1,3 +0,0 @@
|
|
1 |
-
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:360bc281fc956d4acac0c3d36d5ec0ebf8cdddbf4b8892e894d12419388d479b
|
3 |
-
size 4928151562
|
|
|
|
|
|
|
|
hunyuan3d-paint-v2-0-turbo/.gitattributes
DELETED
@@ -1,35 +0,0 @@
|
|
1 |
-
*.7z filter=lfs diff=lfs merge=lfs -text
|
2 |
-
*.arrow filter=lfs diff=lfs merge=lfs -text
|
3 |
-
*.bin filter=lfs diff=lfs merge=lfs -text
|
4 |
-
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
5 |
-
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
6 |
-
*.ftz filter=lfs diff=lfs merge=lfs -text
|
7 |
-
*.gz filter=lfs diff=lfs merge=lfs -text
|
8 |
-
*.h5 filter=lfs diff=lfs merge=lfs -text
|
9 |
-
*.joblib filter=lfs diff=lfs merge=lfs -text
|
10 |
-
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
11 |
-
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
12 |
-
*.model filter=lfs diff=lfs merge=lfs -text
|
13 |
-
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
14 |
-
*.npy filter=lfs diff=lfs merge=lfs -text
|
15 |
-
*.npz filter=lfs diff=lfs merge=lfs -text
|
16 |
-
*.onnx filter=lfs diff=lfs merge=lfs -text
|
17 |
-
*.ot filter=lfs diff=lfs merge=lfs -text
|
18 |
-
*.parquet filter=lfs diff=lfs merge=lfs -text
|
19 |
-
*.pb filter=lfs diff=lfs merge=lfs -text
|
20 |
-
*.pickle filter=lfs diff=lfs merge=lfs -text
|
21 |
-
*.pkl filter=lfs diff=lfs merge=lfs -text
|
22 |
-
*.pt filter=lfs diff=lfs merge=lfs -text
|
23 |
-
*.pth filter=lfs diff=lfs merge=lfs -text
|
24 |
-
*.rar filter=lfs diff=lfs merge=lfs -text
|
25 |
-
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
26 |
-
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
27 |
-
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
28 |
-
*.tar filter=lfs diff=lfs merge=lfs -text
|
29 |
-
*.tflite filter=lfs diff=lfs merge=lfs -text
|
30 |
-
*.tgz filter=lfs diff=lfs merge=lfs -text
|
31 |
-
*.wasm filter=lfs diff=lfs merge=lfs -text
|
32 |
-
*.xz filter=lfs diff=lfs merge=lfs -text
|
33 |
-
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
-
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
-
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
hunyuan3d-paint-v2-0-turbo/README.md
DELETED
@@ -1,53 +0,0 @@
|
|
1 |
-
---
|
2 |
-
license: openrail++
|
3 |
-
tags:
|
4 |
-
- stable-diffusion
|
5 |
-
- text-to-image
|
6 |
-
---
|
7 |
-
|
8 |
-
# SD v2.1-base with Zero Terminal SNR (LAION Aesthetic 6+)
|
9 |
-
|
10 |
-
This model is used in [Diffusion Model with Perceptual Loss](https://arxiv.org/abs/2401.00110) paper as the MSE baseline.
|
11 |
-
|
12 |
-
This model is trained using zero terminal SNR schedule following [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://arxiv.org/abs/2305.08891) paper on LAION aesthetic 6+ data.
|
13 |
-
|
14 |
-
This model is finetuned from [stabilityai/stable-diffusion-2-1-base](https://huggingface.co/stabilityai/stable-diffusion-2-1-base).
|
15 |
-
|
16 |
-
This model is meant for research demonstration, not for production use.
|
17 |
-
|
18 |
-
## Usage
|
19 |
-
|
20 |
-
```python
|
21 |
-
from diffusers import StableDiffusionPipeline
|
22 |
-
prompt = "A young girl smiling"
|
23 |
-
pipe = StableDiffusionPipeline.from_pretrained("ByteDance/sd2.1-base-zsnr-laionaes6").to("cuda")
|
24 |
-
pipe(prompt, guidance_scale=7.5, guidance_rescale=0.7).images[0].save("out.jpg")
|
25 |
-
```
|
26 |
-
|
27 |
-
## Related Models
|
28 |
-
|
29 |
-
* [bytedance/sd2.1-base-zsnr-laionaes5](https://huggingface.co/ByteDance/sd2.1-base-zsnr-laionaes5)
|
30 |
-
* [bytedance/sd2.1-base-zsnr-laionaes6](https://huggingface.co/ByteDance/sd2.1-base-zsnr-laionaes6)
|
31 |
-
* [bytedance/sd2.1-base-zsnr-laionaes6-perceptual](https://huggingface.co/ByteDance/sd2.1-base-zsnr-laionaes6-perceptual)
|
32 |
-
|
33 |
-
|
34 |
-
## Cite as
|
35 |
-
```
|
36 |
-
@misc{lin2024diffusion,
|
37 |
-
title={Diffusion Model with Perceptual Loss},
|
38 |
-
author={Shanchuan Lin and Xiao Yang},
|
39 |
-
year={2024},
|
40 |
-
eprint={2401.00110},
|
41 |
-
archivePrefix={arXiv},
|
42 |
-
primaryClass={cs.CV}
|
43 |
-
}
|
44 |
-
|
45 |
-
@misc{lin2023common,
|
46 |
-
title={Common Diffusion Noise Schedules and Sample Steps are Flawed},
|
47 |
-
author={Shanchuan Lin and Bingchen Liu and Jiashi Li and Xiao Yang},
|
48 |
-
year={2023},
|
49 |
-
eprint={2305.08891},
|
50 |
-
archivePrefix={arXiv},
|
51 |
-
primaryClass={cs.CV}
|
52 |
-
}
|
53 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
hunyuan3d-paint-v2-0-turbo/feature_extractor/preprocessor_config.json
DELETED
@@ -1,20 +0,0 @@
|
|
1 |
-
{
|
2 |
-
"crop_size": 224,
|
3 |
-
"do_center_crop": true,
|
4 |
-
"do_convert_rgb": true,
|
5 |
-
"do_normalize": true,
|
6 |
-
"do_resize": true,
|
7 |
-
"feature_extractor_type": "CLIPFeatureExtractor",
|
8 |
-
"image_mean": [
|
9 |
-
0.48145466,
|
10 |
-
0.4578275,
|
11 |
-
0.40821073
|
12 |
-
],
|
13 |
-
"image_std": [
|
14 |
-
0.26862954,
|
15 |
-
0.26130258,
|
16 |
-
0.27577711
|
17 |
-
],
|
18 |
-
"resample": 3,
|
19 |
-
"size": 224
|
20 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
hunyuan3d-paint-v2-0-turbo/image_encoder/config.json
DELETED
@@ -1,23 +0,0 @@
|
|
1 |
-
{
|
2 |
-
"_name_or_path": "D:\\.cache\\huggingface\\hub\\models--sudo-ai--zero123plus-v1.1\\snapshots\\36df7de980afd15f80b2e1a4e9a920d7020e2654\\vision_encoder",
|
3 |
-
"architectures": [
|
4 |
-
"CLIPVisionModelWithProjection"
|
5 |
-
],
|
6 |
-
"attention_dropout": 0.0,
|
7 |
-
"dropout": 0.0,
|
8 |
-
"hidden_act": "gelu",
|
9 |
-
"hidden_size": 1280,
|
10 |
-
"image_size": 224,
|
11 |
-
"initializer_factor": 1.0,
|
12 |
-
"initializer_range": 0.02,
|
13 |
-
"intermediate_size": 5120,
|
14 |
-
"layer_norm_eps": 1e-05,
|
15 |
-
"model_type": "clip_vision_model",
|
16 |
-
"num_attention_heads": 16,
|
17 |
-
"num_channels": 3,
|
18 |
-
"num_hidden_layers": 32,
|
19 |
-
"patch_size": 14,
|
20 |
-
"projection_dim": 1024,
|
21 |
-
"torch_dtype": "float16",
|
22 |
-
"transformers_version": "4.36.0"
|
23 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
hunyuan3d-paint-v2-0-turbo/image_encoder/model.safetensors
DELETED
@@ -1,3 +0,0 @@
|
|
1 |
-
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:ae616c24393dd1854372b0639e5541666f7521cbe219669255e865cb7f89466a
|
3 |
-
size 1264217240
|
|
|
|
|
|
|
|
hunyuan3d-paint-v2-0-turbo/image_encoder/preprocessor_config.json
DELETED
@@ -1,27 +0,0 @@
|
|
1 |
-
{
|
2 |
-
"crop_size": {
|
3 |
-
"height": 224,
|
4 |
-
"width": 224
|
5 |
-
},
|
6 |
-
"do_center_crop": true,
|
7 |
-
"do_convert_rgb": true,
|
8 |
-
"do_normalize": true,
|
9 |
-
"do_rescale": true,
|
10 |
-
"do_resize": true,
|
11 |
-
"image_mean": [
|
12 |
-
0.48145466,
|
13 |
-
0.4578275,
|
14 |
-
0.40821073
|
15 |
-
],
|
16 |
-
"image_processor_type": "CLIPImageProcessor",
|
17 |
-
"image_std": [
|
18 |
-
0.26862954,
|
19 |
-
0.26130258,
|
20 |
-
0.27577711
|
21 |
-
],
|
22 |
-
"resample": 3,
|
23 |
-
"rescale_factor": 0.00392156862745098,
|
24 |
-
"size": {
|
25 |
-
"shortest_edge": 224
|
26 |
-
}
|
27 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
hunyuan3d-paint-v2-0-turbo/model_index.json
DELETED
@@ -1,37 +0,0 @@
|
|
1 |
-
{
|
2 |
-
"_class_name": "StableDiffusionPipeline",
|
3 |
-
"_diffusers_version": "0.23.1",
|
4 |
-
"feature_extractor": [
|
5 |
-
"transformers",
|
6 |
-
"CLIPImageProcessor"
|
7 |
-
],
|
8 |
-
"requires_safety_checker": false,
|
9 |
-
"safety_checker": [
|
10 |
-
null,
|
11 |
-
null
|
12 |
-
],
|
13 |
-
"scheduler": [
|
14 |
-
"diffusers",
|
15 |
-
"DDIMScheduler"
|
16 |
-
],
|
17 |
-
"text_encoder": [
|
18 |
-
"transformers",
|
19 |
-
"CLIPTextModel"
|
20 |
-
],
|
21 |
-
"tokenizer": [
|
22 |
-
"transformers",
|
23 |
-
"CLIPTokenizer"
|
24 |
-
],
|
25 |
-
"image_encoder": [
|
26 |
-
"transformers",
|
27 |
-
"CLIPVisionModelWithProjection"
|
28 |
-
],
|
29 |
-
"unet": [
|
30 |
-
"modules",
|
31 |
-
"UNet2p5DConditionModel"
|
32 |
-
],
|
33 |
-
"vae": [
|
34 |
-
"diffusers",
|
35 |
-
"AutoencoderKL"
|
36 |
-
]
|
37 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
hunyuan3d-paint-v2-0-turbo/scheduler/scheduler_config.json
DELETED
@@ -1,15 +0,0 @@
|
|
1 |
-
{
|
2 |
-
"_class_name": "DDIMScheduler",
|
3 |
-
"_diffusers_version": "0.23.1",
|
4 |
-
"beta_end": 0.012,
|
5 |
-
"beta_schedule": "scaled_linear",
|
6 |
-
"beta_start": 0.00085,
|
7 |
-
"clip_sample": false,
|
8 |
-
"num_train_timesteps": 1000,
|
9 |
-
"prediction_type": "v_prediction",
|
10 |
-
"set_alpha_to_one": true,
|
11 |
-
"steps_offset": 1,
|
12 |
-
"trained_betas": null,
|
13 |
-
"timestep_spacing": "trailing",
|
14 |
-
"rescale_betas_zero_snr": true
|
15 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
hunyuan3d-paint-v2-0-turbo/text_encoder/config.json
DELETED
@@ -1,25 +0,0 @@
|
|
1 |
-
{
|
2 |
-
"_name_or_path": "stabilityai/stable-diffusion-2",
|
3 |
-
"architectures": [
|
4 |
-
"CLIPTextModel"
|
5 |
-
],
|
6 |
-
"attention_dropout": 0.0,
|
7 |
-
"bos_token_id": 0,
|
8 |
-
"dropout": 0.0,
|
9 |
-
"eos_token_id": 2,
|
10 |
-
"hidden_act": "gelu",
|
11 |
-
"hidden_size": 1024,
|
12 |
-
"initializer_factor": 1.0,
|
13 |
-
"initializer_range": 0.02,
|
14 |
-
"intermediate_size": 4096,
|
15 |
-
"layer_norm_eps": 1e-05,
|
16 |
-
"max_position_embeddings": 77,
|
17 |
-
"model_type": "clip_text_model",
|
18 |
-
"num_attention_heads": 16,
|
19 |
-
"num_hidden_layers": 23,
|
20 |
-
"pad_token_id": 1,
|
21 |
-
"projection_dim": 512,
|
22 |
-
"torch_dtype": "float32",
|
23 |
-
"transformers_version": "4.25.0.dev0",
|
24 |
-
"vocab_size": 49408
|
25 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
hunyuan3d-paint-v2-0-turbo/text_encoder/pytorch_model.bin
DELETED
@@ -1,3 +0,0 @@
|
|
1 |
-
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:c3e254d7b61353497ea0be2c4013df4ea8f739ee88cffa0ba58cd085459ed565
|
3 |
-
size 1361671895
|
|
|
|
|
|
|
|
hunyuan3d-paint-v2-0-turbo/tokenizer/merges.txt
DELETED
The diff for this file is too large to render.
See raw diff
|
|
hunyuan3d-paint-v2-0-turbo/tokenizer/special_tokens_map.json
DELETED
@@ -1,24 +0,0 @@
|
|
1 |
-
{
|
2 |
-
"bos_token": {
|
3 |
-
"content": "<|startoftext|>",
|
4 |
-
"lstrip": false,
|
5 |
-
"normalized": true,
|
6 |
-
"rstrip": false,
|
7 |
-
"single_word": false
|
8 |
-
},
|
9 |
-
"eos_token": {
|
10 |
-
"content": "<|endoftext|>",
|
11 |
-
"lstrip": false,
|
12 |
-
"normalized": true,
|
13 |
-
"rstrip": false,
|
14 |
-
"single_word": false
|
15 |
-
},
|
16 |
-
"pad_token": "!",
|
17 |
-
"unk_token": {
|
18 |
-
"content": "<|endoftext|>",
|
19 |
-
"lstrip": false,
|
20 |
-
"normalized": true,
|
21 |
-
"rstrip": false,
|
22 |
-
"single_word": false
|
23 |
-
}
|
24 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
hunyuan3d-paint-v2-0-turbo/tokenizer/tokenizer_config.json
DELETED
@@ -1,34 +0,0 @@
|
|
1 |
-
{
|
2 |
-
"add_prefix_space": false,
|
3 |
-
"bos_token": {
|
4 |
-
"__type": "AddedToken",
|
5 |
-
"content": "<|startoftext|>",
|
6 |
-
"lstrip": false,
|
7 |
-
"normalized": true,
|
8 |
-
"rstrip": false,
|
9 |
-
"single_word": false
|
10 |
-
},
|
11 |
-
"do_lower_case": true,
|
12 |
-
"eos_token": {
|
13 |
-
"__type": "AddedToken",
|
14 |
-
"content": "<|endoftext|>",
|
15 |
-
"lstrip": false,
|
16 |
-
"normalized": true,
|
17 |
-
"rstrip": false,
|
18 |
-
"single_word": false
|
19 |
-
},
|
20 |
-
"errors": "replace",
|
21 |
-
"model_max_length": 77,
|
22 |
-
"name_or_path": "stabilityai/stable-diffusion-2",
|
23 |
-
"pad_token": "<|endoftext|>",
|
24 |
-
"special_tokens_map_file": "./special_tokens_map.json",
|
25 |
-
"tokenizer_class": "CLIPTokenizer",
|
26 |
-
"unk_token": {
|
27 |
-
"__type": "AddedToken",
|
28 |
-
"content": "<|endoftext|>",
|
29 |
-
"lstrip": false,
|
30 |
-
"normalized": true,
|
31 |
-
"rstrip": false,
|
32 |
-
"single_word": false
|
33 |
-
}
|
34 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
hunyuan3d-paint-v2-0-turbo/tokenizer/vocab.json
DELETED
The diff for this file is too large to render.
See raw diff
|
|
hunyuan3d-paint-v2-0-turbo/unet/config.json
DELETED
@@ -1,45 +0,0 @@
|
|
1 |
-
{
|
2 |
-
"_class_name": "UNet2DConditionModel",
|
3 |
-
"_diffusers_version": "0.10.0.dev0",
|
4 |
-
"act_fn": "silu",
|
5 |
-
"attention_head_dim": [
|
6 |
-
5,
|
7 |
-
10,
|
8 |
-
20,
|
9 |
-
20
|
10 |
-
],
|
11 |
-
"block_out_channels": [
|
12 |
-
320,
|
13 |
-
640,
|
14 |
-
1280,
|
15 |
-
1280
|
16 |
-
],
|
17 |
-
"center_input_sample": false,
|
18 |
-
"cross_attention_dim": 1024,
|
19 |
-
"down_block_types": [
|
20 |
-
"CrossAttnDownBlock2D",
|
21 |
-
"CrossAttnDownBlock2D",
|
22 |
-
"CrossAttnDownBlock2D",
|
23 |
-
"DownBlock2D"
|
24 |
-
],
|
25 |
-
"downsample_padding": 1,
|
26 |
-
"dual_cross_attention": false,
|
27 |
-
"flip_sin_to_cos": true,
|
28 |
-
"freq_shift": 0,
|
29 |
-
"in_channels": 4,
|
30 |
-
"layers_per_block": 2,
|
31 |
-
"mid_block_scale_factor": 1,
|
32 |
-
"norm_eps": 1e-05,
|
33 |
-
"norm_num_groups": 32,
|
34 |
-
"num_class_embeds": null,
|
35 |
-
"only_cross_attention": false,
|
36 |
-
"out_channels": 4,
|
37 |
-
"sample_size": 64,
|
38 |
-
"up_block_types": [
|
39 |
-
"UpBlock2D",
|
40 |
-
"CrossAttnUpBlock2D",
|
41 |
-
"CrossAttnUpBlock2D",
|
42 |
-
"CrossAttnUpBlock2D"
|
43 |
-
],
|
44 |
-
"use_linear_projection": true
|
45 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
hunyuan3d-paint-v2-0-turbo/unet/diffusion_pytorch_model.bin
DELETED
@@ -1,3 +0,0 @@
|
|
1 |
-
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:24e7f1aea8a7c94cee627eb06f5265f19eeff4e19568636c5eaef050cc19ba3d
|
3 |
-
size 7325432923
|
|
|
|
|
|
|
|
hunyuan3d-paint-v2-0-turbo/unet/diffusion_pytorch_model.safetensors
DELETED
@@ -1,3 +0,0 @@
|
|
1 |
-
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:d6acffa4a22f4da61d87f446bfa83e7ac245481c1535fbf25b200fe4462d0b22
|
3 |
-
size 3722161032
|
|
|
|
|
|
|
|
hunyuan3d-paint-v2-0-turbo/unet/modules.py
DELETED
@@ -1,610 +0,0 @@
|
|
1 |
-
# Open Source Model Licensed under the Apache License Version 2.0
|
2 |
-
# and Other Licenses of the Third-Party Components therein:
|
3 |
-
# The below Model in this distribution may have been modified by THL A29 Limited
|
4 |
-
# ("Tencent Modifications"). All Tencent Modifications are Copyright (C) 2024 THL A29 Limited.
|
5 |
-
|
6 |
-
# Copyright (C) 2024 THL A29 Limited, a Tencent company. All rights reserved.
|
7 |
-
# The below software and/or models in this distribution may have been
|
8 |
-
# modified by THL A29 Limited ("Tencent Modifications").
|
9 |
-
# All Tencent Modifications are Copyright (C) THL A29 Limited.
|
10 |
-
|
11 |
-
# Hunyuan 3D is licensed under the TENCENT HUNYUAN NON-COMMERCIAL LICENSE AGREEMENT
|
12 |
-
# except for the third-party components listed below.
|
13 |
-
# Hunyuan 3D does not impose any additional limitations beyond what is outlined
|
14 |
-
# in the repsective licenses of these third-party components.
|
15 |
-
# Users must comply with all terms and conditions of original licenses of these third-party
|
16 |
-
# components and must ensure that the usage of the third party components adheres to
|
17 |
-
# all relevant laws and regulations.
|
18 |
-
|
19 |
-
# For avoidance of doubts, Hunyuan 3D means the large language models and
|
20 |
-
# their software and algorithms, including trained model weights, parameters (including
|
21 |
-
# optimizer states), machine-learning model code, inference-enabling code, training-enabling code,
|
22 |
-
# fine-tuning enabling code and other elements of the foregoing made publicly available
|
23 |
-
# by Tencent in accordance with TENCENT HUNYUAN COMMUNITY LICENSE AGREEMENT.
|
24 |
-
|
25 |
-
import copy
|
26 |
-
import json
|
27 |
-
import os
|
28 |
-
from typing import Any, Dict, List, Optional, Tuple, Union
|
29 |
-
|
30 |
-
import torch
|
31 |
-
import torch.nn as nn
|
32 |
-
import torch.nn.functional as F
|
33 |
-
from diffusers.models import UNet2DConditionModel
|
34 |
-
from diffusers.models.attention_processor import Attention
|
35 |
-
from diffusers.models.transformers.transformer_2d import BasicTransformerBlock
|
36 |
-
from einops import rearrange
|
37 |
-
|
38 |
-
|
39 |
-
def _chunked_feed_forward(ff: nn.Module, hidden_states: torch.Tensor, chunk_dim: int, chunk_size: int):
|
40 |
-
# "feed_forward_chunk_size" can be used to save memory
|
41 |
-
if hidden_states.shape[chunk_dim] % chunk_size != 0:
|
42 |
-
raise ValueError(
|
43 |
-
f"`hidden_states` dimension to be chunked: {hidden_states.shape[chunk_dim]}"
|
44 |
-
f"has to be divisible by chunk size: {chunk_size}."
|
45 |
-
f" Make sure to set an appropriate `chunk_size` when calling `unet.enable_forward_chunking`."
|
46 |
-
)
|
47 |
-
|
48 |
-
num_chunks = hidden_states.shape[chunk_dim] // chunk_size
|
49 |
-
ff_output = torch.cat(
|
50 |
-
[ff(hid_slice) for hid_slice in hidden_states.chunk(num_chunks, dim=chunk_dim)],
|
51 |
-
dim=chunk_dim,
|
52 |
-
)
|
53 |
-
return ff_output
|
54 |
-
|
55 |
-
|
56 |
-
class Basic2p5DTransformerBlock(torch.nn.Module):
|
57 |
-
def __init__(self, transformer: BasicTransformerBlock, layer_name, use_ma=True, use_ra=True, is_turbo=False) -> None:
|
58 |
-
super().__init__()
|
59 |
-
self.transformer = transformer
|
60 |
-
self.layer_name = layer_name
|
61 |
-
self.use_ma = use_ma
|
62 |
-
self.use_ra = use_ra
|
63 |
-
self.is_turbo = is_turbo
|
64 |
-
|
65 |
-
# multiview attn
|
66 |
-
if self.use_ma:
|
67 |
-
self.attn_multiview = Attention(
|
68 |
-
query_dim=self.dim,
|
69 |
-
heads=self.num_attention_heads,
|
70 |
-
dim_head=self.attention_head_dim,
|
71 |
-
dropout=self.dropout,
|
72 |
-
bias=self.attention_bias,
|
73 |
-
cross_attention_dim=None,
|
74 |
-
upcast_attention=self.attn1.upcast_attention,
|
75 |
-
out_bias=True,
|
76 |
-
)
|
77 |
-
|
78 |
-
# ref attn
|
79 |
-
if self.use_ra:
|
80 |
-
self.attn_refview = Attention(
|
81 |
-
query_dim=self.dim,
|
82 |
-
heads=self.num_attention_heads,
|
83 |
-
dim_head=self.attention_head_dim,
|
84 |
-
dropout=self.dropout,
|
85 |
-
bias=self.attention_bias,
|
86 |
-
cross_attention_dim=None,
|
87 |
-
upcast_attention=self.attn1.upcast_attention,
|
88 |
-
out_bias=True,
|
89 |
-
)
|
90 |
-
if self.is_turbo:
|
91 |
-
self._initialize_attn_weights()
|
92 |
-
|
93 |
-
def _initialize_attn_weights(self):
|
94 |
-
|
95 |
-
if self.use_ma:
|
96 |
-
self.attn_multiview.load_state_dict(self.attn1.state_dict())
|
97 |
-
with torch.no_grad():
|
98 |
-
for layer in self.attn_multiview.to_out:
|
99 |
-
for param in layer.parameters():
|
100 |
-
param.zero_()
|
101 |
-
if self.use_ra:
|
102 |
-
self.attn_refview.load_state_dict(self.attn1.state_dict())
|
103 |
-
with torch.no_grad():
|
104 |
-
for layer in self.attn_refview.to_out:
|
105 |
-
for param in layer.parameters():
|
106 |
-
param.zero_()
|
107 |
-
|
108 |
-
def __getattr__(self, name: str):
|
109 |
-
try:
|
110 |
-
return super().__getattr__(name)
|
111 |
-
except AttributeError:
|
112 |
-
return getattr(self.transformer, name)
|
113 |
-
|
114 |
-
def forward(
|
115 |
-
self,
|
116 |
-
hidden_states: torch.Tensor,
|
117 |
-
attention_mask: Optional[torch.Tensor] = None,
|
118 |
-
encoder_hidden_states: Optional[torch.Tensor] = None,
|
119 |
-
encoder_attention_mask: Optional[torch.Tensor] = None,
|
120 |
-
timestep: Optional[torch.LongTensor] = None,
|
121 |
-
cross_attention_kwargs: Dict[str, Any] = None,
|
122 |
-
class_labels: Optional[torch.LongTensor] = None,
|
123 |
-
added_cond_kwargs: Optional[Dict[str, torch.Tensor]] = None,
|
124 |
-
) -> torch.Tensor:
|
125 |
-
|
126 |
-
# Notice that normalization is always applied before the real computation in the following blocks.
|
127 |
-
# 0. Self-Attention
|
128 |
-
batch_size = hidden_states.shape[0]
|
129 |
-
|
130 |
-
cross_attention_kwargs = cross_attention_kwargs.copy() if cross_attention_kwargs is not None else {}
|
131 |
-
num_in_batch = cross_attention_kwargs.pop('num_in_batch', 1)
|
132 |
-
mode = cross_attention_kwargs.pop('mode', None)
|
133 |
-
if not self.is_turbo:
|
134 |
-
mva_scale = cross_attention_kwargs.pop('mva_scale', 1.0)
|
135 |
-
ref_scale = cross_attention_kwargs.pop('ref_scale', 1.0)
|
136 |
-
else:
|
137 |
-
position_attn_mask = cross_attention_kwargs.pop("position_attn_mask", None)
|
138 |
-
position_voxel_indices = cross_attention_kwargs.pop("position_voxel_indices", None)
|
139 |
-
mva_scale = 1.0
|
140 |
-
ref_scale = 1.0
|
141 |
-
|
142 |
-
condition_embed_dict = cross_attention_kwargs.pop("condition_embed_dict", None)
|
143 |
-
|
144 |
-
if self.norm_type == "ada_norm":
|
145 |
-
norm_hidden_states = self.norm1(hidden_states, timestep)
|
146 |
-
elif self.norm_type == "ada_norm_zero":
|
147 |
-
norm_hidden_states, gate_msa, shift_mlp, scale_mlp, gate_mlp = self.norm1(
|
148 |
-
hidden_states, timestep, class_labels, hidden_dtype=hidden_states.dtype
|
149 |
-
)
|
150 |
-
elif self.norm_type in ["layer_norm", "layer_norm_i2vgen"]:
|
151 |
-
norm_hidden_states = self.norm1(hidden_states)
|
152 |
-
elif self.norm_type == "ada_norm_continuous":
|
153 |
-
norm_hidden_states = self.norm1(hidden_states, added_cond_kwargs["pooled_text_emb"])
|
154 |
-
elif self.norm_type == "ada_norm_single":
|
155 |
-
shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = (
|
156 |
-
self.scale_shift_table[None] + timestep.reshape(batch_size, 6, -1)
|
157 |
-
).chunk(6, dim=1)
|
158 |
-
norm_hidden_states = self.norm1(hidden_states)
|
159 |
-
norm_hidden_states = norm_hidden_states * (1 + scale_msa) + shift_msa
|
160 |
-
else:
|
161 |
-
raise ValueError("Incorrect norm used")
|
162 |
-
|
163 |
-
if self.pos_embed is not None:
|
164 |
-
norm_hidden_states = self.pos_embed(norm_hidden_states)
|
165 |
-
|
166 |
-
# 1. Prepare GLIGEN inputs
|
167 |
-
cross_attention_kwargs = cross_attention_kwargs.copy() if cross_attention_kwargs is not None else {}
|
168 |
-
gligen_kwargs = cross_attention_kwargs.pop("gligen", None)
|
169 |
-
|
170 |
-
attn_output = self.attn1(
|
171 |
-
norm_hidden_states,
|
172 |
-
encoder_hidden_states=encoder_hidden_states if self.only_cross_attention else None,
|
173 |
-
attention_mask=attention_mask,
|
174 |
-
**cross_attention_kwargs,
|
175 |
-
)
|
176 |
-
|
177 |
-
if self.norm_type == "ada_norm_zero":
|
178 |
-
attn_output = gate_msa.unsqueeze(1) * attn_output
|
179 |
-
elif self.norm_type == "ada_norm_single":
|
180 |
-
attn_output = gate_msa * attn_output
|
181 |
-
|
182 |
-
hidden_states = attn_output + hidden_states
|
183 |
-
if hidden_states.ndim == 4:
|
184 |
-
hidden_states = hidden_states.squeeze(1)
|
185 |
-
|
186 |
-
# 1.2 Reference Attention
|
187 |
-
if 'w' in mode:
|
188 |
-
condition_embed_dict[self.layer_name] = rearrange(
|
189 |
-
norm_hidden_states, '(b n) l c -> b (n l) c',
|
190 |
-
n=num_in_batch
|
191 |
-
) # B, (N L), C
|
192 |
-
|
193 |
-
if 'r' in mode and self.use_ra:
|
194 |
-
condition_embed = condition_embed_dict[self.layer_name].unsqueeze(1).repeat(1, num_in_batch, 1,
|
195 |
-
1) # B N L C
|
196 |
-
condition_embed = rearrange(condition_embed, 'b n l c -> (b n) l c')
|
197 |
-
|
198 |
-
attn_output = self.attn_refview(
|
199 |
-
norm_hidden_states,
|
200 |
-
encoder_hidden_states=condition_embed,
|
201 |
-
attention_mask=None,
|
202 |
-
**cross_attention_kwargs
|
203 |
-
)
|
204 |
-
if not self.is_turbo:
|
205 |
-
ref_scale_timing = ref_scale
|
206 |
-
if isinstance(ref_scale, torch.Tensor):
|
207 |
-
ref_scale_timing = ref_scale.unsqueeze(1).repeat(1, num_in_batch).view(-1)
|
208 |
-
for _ in range(attn_output.ndim - 1):
|
209 |
-
ref_scale_timing = ref_scale_timing.unsqueeze(-1)
|
210 |
-
|
211 |
-
hidden_states = ref_scale_timing * attn_output + hidden_states
|
212 |
-
|
213 |
-
if hidden_states.ndim == 4:
|
214 |
-
hidden_states = hidden_states.squeeze(1)
|
215 |
-
|
216 |
-
# 1.3 Multiview Attention
|
217 |
-
if num_in_batch > 1 and self.use_ma:
|
218 |
-
multivew_hidden_states = rearrange(norm_hidden_states, '(b n) l c -> b (n l) c', n=num_in_batch)
|
219 |
-
|
220 |
-
if self.is_turbo:
|
221 |
-
position_mask = None
|
222 |
-
if position_attn_mask is not None:
|
223 |
-
if multivew_hidden_states.shape[1] in position_attn_mask:
|
224 |
-
position_mask = position_attn_mask[multivew_hidden_states.shape[1]]
|
225 |
-
position_indices = None
|
226 |
-
if position_voxel_indices is not None:
|
227 |
-
if multivew_hidden_states.shape[1] in position_voxel_indices:
|
228 |
-
position_indices = position_voxel_indices[multivew_hidden_states.shape[1]]
|
229 |
-
attn_output = self.attn_multiview(
|
230 |
-
multivew_hidden_states,
|
231 |
-
encoder_hidden_states=multivew_hidden_states,
|
232 |
-
attention_mask=position_mask,
|
233 |
-
position_indices=position_indices,
|
234 |
-
**cross_attention_kwargs
|
235 |
-
)
|
236 |
-
else:
|
237 |
-
attn_output = self.attn_multiview(
|
238 |
-
multivew_hidden_states,
|
239 |
-
encoder_hidden_states=multivew_hidden_states,
|
240 |
-
**cross_attention_kwargs
|
241 |
-
)
|
242 |
-
|
243 |
-
attn_output = rearrange(attn_output, 'b (n l) c -> (b n) l c', n=num_in_batch)
|
244 |
-
|
245 |
-
hidden_states = mva_scale * attn_output + hidden_states
|
246 |
-
if hidden_states.ndim == 4:
|
247 |
-
hidden_states = hidden_states.squeeze(1)
|
248 |
-
|
249 |
-
# 1.2 GLIGEN Control
|
250 |
-
if gligen_kwargs is not None:
|
251 |
-
hidden_states = self.fuser(hidden_states, gligen_kwargs["objs"])
|
252 |
-
|
253 |
-
# 3. Cross-Attention
|
254 |
-
if self.attn2 is not None:
|
255 |
-
if self.norm_type == "ada_norm":
|
256 |
-
norm_hidden_states = self.norm2(hidden_states, timestep)
|
257 |
-
elif self.norm_type in ["ada_norm_zero", "layer_norm", "layer_norm_i2vgen"]:
|
258 |
-
norm_hidden_states = self.norm2(hidden_states)
|
259 |
-
elif self.norm_type == "ada_norm_single":
|
260 |
-
# For PixArt norm2 isn't applied here:
|
261 |
-
# https://github.com/PixArt-alpha/PixArt-alpha/blob/0f55e922376d8b797edd44d25d0e7464b260dcab/diffusion/model/nets/PixArtMS.py#L70C1-L76C103
|
262 |
-
norm_hidden_states = hidden_states
|
263 |
-
elif self.norm_type == "ada_norm_continuous":
|
264 |
-
norm_hidden_states = self.norm2(hidden_states, added_cond_kwargs["pooled_text_emb"])
|
265 |
-
else:
|
266 |
-
raise ValueError("Incorrect norm")
|
267 |
-
|
268 |
-
if self.pos_embed is not None and self.norm_type != "ada_norm_single":
|
269 |
-
norm_hidden_states = self.pos_embed(norm_hidden_states)
|
270 |
-
|
271 |
-
attn_output = self.attn2(
|
272 |
-
norm_hidden_states,
|
273 |
-
encoder_hidden_states=encoder_hidden_states,
|
274 |
-
attention_mask=encoder_attention_mask,
|
275 |
-
**cross_attention_kwargs,
|
276 |
-
)
|
277 |
-
|
278 |
-
hidden_states = attn_output + hidden_states
|
279 |
-
|
280 |
-
# 4. Feed-forward
|
281 |
-
# i2vgen doesn't have this norm 🤷♂️
|
282 |
-
if self.norm_type == "ada_norm_continuous":
|
283 |
-
norm_hidden_states = self.norm3(hidden_states, added_cond_kwargs["pooled_text_emb"])
|
284 |
-
elif not self.norm_type == "ada_norm_single":
|
285 |
-
norm_hidden_states = self.norm3(hidden_states)
|
286 |
-
|
287 |
-
if self.norm_type == "ada_norm_zero":
|
288 |
-
norm_hidden_states = norm_hidden_states * (1 + scale_mlp[:, None]) + shift_mlp[:, None]
|
289 |
-
|
290 |
-
if self.norm_type == "ada_norm_single":
|
291 |
-
norm_hidden_states = self.norm2(hidden_states)
|
292 |
-
norm_hidden_states = norm_hidden_states * (1 + scale_mlp) + shift_mlp
|
293 |
-
|
294 |
-
if self._chunk_size is not None:
|
295 |
-
# "feed_forward_chunk_size" can be used to save memory
|
296 |
-
ff_output = _chunked_feed_forward(self.ff, norm_hidden_states, self._chunk_dim, self._chunk_size)
|
297 |
-
else:
|
298 |
-
ff_output = self.ff(norm_hidden_states)
|
299 |
-
|
300 |
-
if self.norm_type == "ada_norm_zero":
|
301 |
-
ff_output = gate_mlp.unsqueeze(1) * ff_output
|
302 |
-
elif self.norm_type == "ada_norm_single":
|
303 |
-
ff_output = gate_mlp * ff_output
|
304 |
-
|
305 |
-
hidden_states = ff_output + hidden_states
|
306 |
-
if hidden_states.ndim == 4:
|
307 |
-
hidden_states = hidden_states.squeeze(1)
|
308 |
-
|
309 |
-
return hidden_states
|
310 |
-
|
311 |
-
@torch.no_grad()
|
312 |
-
def compute_voxel_grid_mask(position, grid_resolution=8):
|
313 |
-
|
314 |
-
position = position.half()
|
315 |
-
B,N,_,H,W = position.shape
|
316 |
-
assert H%grid_resolution==0 and W%grid_resolution==0
|
317 |
-
|
318 |
-
valid_mask = (position != 1).all(dim=2, keepdim=True)
|
319 |
-
valid_mask = valid_mask.expand_as(position)
|
320 |
-
position[valid_mask==False] = 0
|
321 |
-
|
322 |
-
|
323 |
-
position = rearrange(
|
324 |
-
position,
|
325 |
-
'b n c (num_h grid_h) (num_w grid_w) -> b n num_h num_w c grid_h grid_w',
|
326 |
-
num_h=grid_resolution, num_w=grid_resolution
|
327 |
-
)
|
328 |
-
valid_mask = rearrange(
|
329 |
-
valid_mask,
|
330 |
-
'b n c (num_h grid_h) (num_w grid_w) -> b n num_h num_w c grid_h grid_w',
|
331 |
-
num_h=grid_resolution, num_w=grid_resolution
|
332 |
-
)
|
333 |
-
|
334 |
-
grid_position = position.sum(dim=(-2, -1))
|
335 |
-
count_masked = valid_mask.sum(dim=(-2, -1))
|
336 |
-
|
337 |
-
grid_position = grid_position / count_masked.clamp(min=1)
|
338 |
-
grid_position[count_masked<5] = 0
|
339 |
-
|
340 |
-
grid_position = grid_position.permute(0,1,4,2,3)
|
341 |
-
grid_position = rearrange(grid_position, 'b n c h w -> b n (h w) c')
|
342 |
-
|
343 |
-
grid_position_expanded_1 = grid_position.unsqueeze(2).unsqueeze(4) # 形状变为 B, N, 1, L, 1, 3
|
344 |
-
grid_position_expanded_2 = grid_position.unsqueeze(1).unsqueeze(3) # 形状变为 B, 1, N, 1, L, 3
|
345 |
-
|
346 |
-
# 计算欧氏距离
|
347 |
-
distances = torch.norm(grid_position_expanded_1 - grid_position_expanded_2, dim=-1) # 形状为 B, N, N, L, L
|
348 |
-
|
349 |
-
weights = distances
|
350 |
-
grid_distance = 1.73/grid_resolution
|
351 |
-
|
352 |
-
#weights = weights*-32
|
353 |
-
#weights = weights.clamp(min=-10000.0)
|
354 |
-
|
355 |
-
weights = weights< grid_distance
|
356 |
-
|
357 |
-
return weights
|
358 |
-
|
359 |
-
def compute_multi_resolution_mask(position_maps, grid_resolutions=[32, 16, 8]):
|
360 |
-
position_attn_mask = {}
|
361 |
-
with torch.no_grad():
|
362 |
-
for grid_resolution in grid_resolutions:
|
363 |
-
position_mask = compute_voxel_grid_mask(position_maps, grid_resolution)
|
364 |
-
position_mask = rearrange(position_mask, 'b ni nj li lj -> b (ni li) (nj lj)')
|
365 |
-
position_attn_mask[position_mask.shape[1]] = position_mask
|
366 |
-
return position_attn_mask
|
367 |
-
|
368 |
-
@torch.no_grad()
|
369 |
-
def compute_discrete_voxel_indice(position, grid_resolution=8, voxel_resolution=128):
|
370 |
-
|
371 |
-
position = position.half()
|
372 |
-
B,N,_,H,W = position.shape
|
373 |
-
assert H%grid_resolution==0 and W%grid_resolution==0
|
374 |
-
|
375 |
-
valid_mask = (position != 1).all(dim=2, keepdim=True)
|
376 |
-
valid_mask = valid_mask.expand_as(position)
|
377 |
-
position[valid_mask==False] = 0
|
378 |
-
|
379 |
-
position = rearrange(
|
380 |
-
position,
|
381 |
-
'b n c (num_h grid_h) (num_w grid_w) -> b n num_h num_w c grid_h grid_w',
|
382 |
-
num_h=grid_resolution, num_w=grid_resolution
|
383 |
-
)
|
384 |
-
valid_mask = rearrange(
|
385 |
-
valid_mask,
|
386 |
-
'b n c (num_h grid_h) (num_w grid_w) -> b n num_h num_w c grid_h grid_w',
|
387 |
-
num_h=grid_resolution, num_w=grid_resolution
|
388 |
-
)
|
389 |
-
|
390 |
-
grid_position = position.sum(dim=(-2, -1))
|
391 |
-
count_masked = valid_mask.sum(dim=(-2, -1))
|
392 |
-
|
393 |
-
grid_position = grid_position / count_masked.clamp(min=1)
|
394 |
-
grid_position[count_masked<5] = 0
|
395 |
-
|
396 |
-
grid_position = grid_position.permute(0,1,4,2,3).clamp(0, 1) # B N C H W
|
397 |
-
voxel_indices = grid_position * (voxel_resolution - 1)
|
398 |
-
voxel_indices = torch.round(voxel_indices).long()
|
399 |
-
return voxel_indices
|
400 |
-
|
401 |
-
def compute_multi_resolution_discrete_voxel_indice(
|
402 |
-
position_maps,
|
403 |
-
grid_resolutions=[64, 32, 16, 8],
|
404 |
-
voxel_resolutions=[512, 256, 128, 64]
|
405 |
-
):
|
406 |
-
voxel_indices = {}
|
407 |
-
with torch.no_grad():
|
408 |
-
for grid_resolution, voxel_resolution in zip(grid_resolutions, voxel_resolutions):
|
409 |
-
voxel_indice = compute_discrete_voxel_indice(position_maps, grid_resolution, voxel_resolution)
|
410 |
-
voxel_indice = rearrange(voxel_indice, 'b n c h w -> b (n h w) c')
|
411 |
-
voxel_indices[voxel_indice.shape[1]] = {'voxel_indices':voxel_indice, 'voxel_resolution':voxel_resolution}
|
412 |
-
return voxel_indices
|
413 |
-
|
414 |
-
class UNet2p5DConditionModel(torch.nn.Module):
|
415 |
-
def __init__(self, unet: UNet2DConditionModel) -> None:
|
416 |
-
super().__init__()
|
417 |
-
self.unet = unet
|
418 |
-
|
419 |
-
self.use_ma = True
|
420 |
-
self.use_ra = True
|
421 |
-
self.use_camera_embedding = True
|
422 |
-
self.use_dual_stream = True
|
423 |
-
self.is_turbo = False
|
424 |
-
|
425 |
-
if self.use_dual_stream:
|
426 |
-
self.unet_dual = copy.deepcopy(unet)
|
427 |
-
self.init_attention(self.unet_dual)
|
428 |
-
self.init_attention(self.unet, use_ma=self.use_ma, use_ra=self.use_ra, is_turbo=self.is_turbo)
|
429 |
-
self.init_condition()
|
430 |
-
self.init_camera_embedding()
|
431 |
-
|
432 |
-
@staticmethod
|
433 |
-
def from_pretrained(pretrained_model_name_or_path, **kwargs):
|
434 |
-
torch_dtype = kwargs.pop('torch_dtype', torch.float32)
|
435 |
-
config_path = os.path.join(pretrained_model_name_or_path, 'config.json')
|
436 |
-
unet_ckpt_path = os.path.join(pretrained_model_name_or_path, 'diffusion_pytorch_model.bin')
|
437 |
-
with open(config_path, 'r', encoding='utf-8') as file:
|
438 |
-
config = json.load(file)
|
439 |
-
unet = UNet2DConditionModel(**config)
|
440 |
-
unet = UNet2p5DConditionModel(unet)
|
441 |
-
unet_ckpt = torch.load(unet_ckpt_path, map_location='cpu', weights_only=True)
|
442 |
-
unet.load_state_dict(unet_ckpt, strict=True)
|
443 |
-
unet = unet.to(torch_dtype)
|
444 |
-
return unet
|
445 |
-
|
446 |
-
def init_condition(self):
|
447 |
-
self.unet.conv_in = torch.nn.Conv2d(
|
448 |
-
12,
|
449 |
-
self.unet.conv_in.out_channels,
|
450 |
-
kernel_size=self.unet.conv_in.kernel_size,
|
451 |
-
stride=self.unet.conv_in.stride,
|
452 |
-
padding=self.unet.conv_in.padding,
|
453 |
-
dilation=self.unet.conv_in.dilation,
|
454 |
-
groups=self.unet.conv_in.groups,
|
455 |
-
bias=self.unet.conv_in.bias is not None)
|
456 |
-
|
457 |
-
self.unet.learned_text_clip_gen = nn.Parameter(torch.randn(1, 77, 1024))
|
458 |
-
self.unet.learned_text_clip_ref = nn.Parameter(torch.randn(1, 77, 1024))
|
459 |
-
|
460 |
-
def init_camera_embedding(self):
|
461 |
-
|
462 |
-
if self.use_camera_embedding:
|
463 |
-
time_embed_dim = 1280
|
464 |
-
self.max_num_ref_image = 5
|
465 |
-
self.max_num_gen_image = 12 * 3 + 4 * 2
|
466 |
-
self.unet.class_embedding = nn.Embedding(self.max_num_ref_image + self.max_num_gen_image, time_embed_dim)
|
467 |
-
|
468 |
-
def init_attention(self, unet, use_ma=False, use_ra=False, is_turbo=False):
|
469 |
-
|
470 |
-
for down_block_i, down_block in enumerate(unet.down_blocks):
|
471 |
-
if hasattr(down_block, "has_cross_attention") and down_block.has_cross_attention:
|
472 |
-
for attn_i, attn in enumerate(down_block.attentions):
|
473 |
-
for transformer_i, transformer in enumerate(attn.transformer_blocks):
|
474 |
-
if isinstance(transformer, BasicTransformerBlock):
|
475 |
-
attn.transformer_blocks[transformer_i] = Basic2p5DTransformerBlock(
|
476 |
-
transformer,
|
477 |
-
f'down_{down_block_i}_{attn_i}_{transformer_i}',
|
478 |
-
use_ma, use_ra, is_turbo
|
479 |
-
)
|
480 |
-
|
481 |
-
if hasattr(unet.mid_block, "has_cross_attention") and unet.mid_block.has_cross_attention:
|
482 |
-
for attn_i, attn in enumerate(unet.mid_block.attentions):
|
483 |
-
for transformer_i, transformer in enumerate(attn.transformer_blocks):
|
484 |
-
if isinstance(transformer, BasicTransformerBlock):
|
485 |
-
attn.transformer_blocks[transformer_i] = Basic2p5DTransformerBlock(
|
486 |
-
transformer,
|
487 |
-
f'mid_{attn_i}_{transformer_i}',
|
488 |
-
use_ma, use_ra, is_turbo
|
489 |
-
)
|
490 |
-
|
491 |
-
for up_block_i, up_block in enumerate(unet.up_blocks):
|
492 |
-
if hasattr(up_block, "has_cross_attention") and up_block.has_cross_attention:
|
493 |
-
for attn_i, attn in enumerate(up_block.attentions):
|
494 |
-
for transformer_i, transformer in enumerate(attn.transformer_blocks):
|
495 |
-
if isinstance(transformer, BasicTransformerBlock):
|
496 |
-
attn.transformer_blocks[transformer_i] = Basic2p5DTransformerBlock(
|
497 |
-
transformer,
|
498 |
-
f'up_{up_block_i}_{attn_i}_{transformer_i}',
|
499 |
-
use_ma, use_ra, is_turbo
|
500 |
-
)
|
501 |
-
|
502 |
-
def __getattr__(self, name: str):
|
503 |
-
try:
|
504 |
-
return super().__getattr__(name)
|
505 |
-
except AttributeError:
|
506 |
-
return getattr(self.unet, name)
|
507 |
-
|
508 |
-
def forward(
|
509 |
-
self, sample, timestep, encoder_hidden_states,
|
510 |
-
*args, down_intrablock_additional_residuals=None,
|
511 |
-
down_block_res_samples=None, mid_block_res_sample=None,
|
512 |
-
**cached_condition,
|
513 |
-
):
|
514 |
-
B, N_gen, _, H, W = sample.shape
|
515 |
-
assert H == W
|
516 |
-
|
517 |
-
if self.use_camera_embedding:
|
518 |
-
camera_info_gen = cached_condition['camera_info_gen'] + self.max_num_ref_image
|
519 |
-
camera_info_gen = rearrange(camera_info_gen, 'b n -> (b n)')
|
520 |
-
else:
|
521 |
-
camera_info_gen = None
|
522 |
-
|
523 |
-
sample = [sample]
|
524 |
-
if 'normal_imgs' in cached_condition:
|
525 |
-
sample.append(cached_condition["normal_imgs"])
|
526 |
-
if 'position_imgs' in cached_condition:
|
527 |
-
sample.append(cached_condition["position_imgs"])
|
528 |
-
sample = torch.cat(sample, dim=2)
|
529 |
-
|
530 |
-
sample = rearrange(sample, 'b n c h w -> (b n) c h w')
|
531 |
-
|
532 |
-
encoder_hidden_states_gen = encoder_hidden_states.unsqueeze(1).repeat(1, N_gen, 1, 1)
|
533 |
-
encoder_hidden_states_gen = rearrange(encoder_hidden_states_gen, 'b n l c -> (b n) l c')
|
534 |
-
|
535 |
-
if self.use_ra:
|
536 |
-
if 'condition_embed_dict' in cached_condition:
|
537 |
-
condition_embed_dict = cached_condition['condition_embed_dict']
|
538 |
-
else:
|
539 |
-
condition_embed_dict = {}
|
540 |
-
ref_latents = cached_condition['ref_latents']
|
541 |
-
N_ref = ref_latents.shape[1]
|
542 |
-
if self.use_camera_embedding:
|
543 |
-
camera_info_ref = cached_condition['camera_info_ref']
|
544 |
-
camera_info_ref = rearrange(camera_info_ref, 'b n -> (b n)')
|
545 |
-
else:
|
546 |
-
camera_info_ref = None
|
547 |
-
|
548 |
-
ref_latents = rearrange(ref_latents, 'b n c h w -> (b n) c h w')
|
549 |
-
|
550 |
-
encoder_hidden_states_ref = self.unet.learned_text_clip_ref.unsqueeze(1).repeat(B, N_ref, 1, 1)
|
551 |
-
encoder_hidden_states_ref = rearrange(encoder_hidden_states_ref, 'b n l c -> (b n) l c')
|
552 |
-
|
553 |
-
noisy_ref_latents = ref_latents
|
554 |
-
timestep_ref = 0
|
555 |
-
|
556 |
-
if self.use_dual_stream:
|
557 |
-
unet_ref = self.unet_dual
|
558 |
-
else:
|
559 |
-
unet_ref = self.unet
|
560 |
-
unet_ref(
|
561 |
-
noisy_ref_latents, timestep_ref,
|
562 |
-
encoder_hidden_states=encoder_hidden_states_ref,
|
563 |
-
class_labels=camera_info_ref,
|
564 |
-
# **kwargs
|
565 |
-
return_dict=False,
|
566 |
-
cross_attention_kwargs={
|
567 |
-
'mode': 'w', 'num_in_batch': N_ref,
|
568 |
-
'condition_embed_dict': condition_embed_dict},
|
569 |
-
)
|
570 |
-
cached_condition['condition_embed_dict'] = condition_embed_dict
|
571 |
-
else:
|
572 |
-
condition_embed_dict = None
|
573 |
-
|
574 |
-
mva_scale = cached_condition.get('mva_scale', 1.0)
|
575 |
-
ref_scale = cached_condition.get('ref_scale', 1.0)
|
576 |
-
|
577 |
-
if self.is_turbo:
|
578 |
-
cross_attention_kwargs_ = {
|
579 |
-
'mode': 'r', 'num_in_batch': N_gen,
|
580 |
-
'condition_embed_dict': condition_embed_dict,
|
581 |
-
'position_attn_mask':position_attn_mask,
|
582 |
-
'position_voxel_indices':position_voxel_indices,
|
583 |
-
'mva_scale': mva_scale,
|
584 |
-
'ref_scale': ref_scale,
|
585 |
-
}
|
586 |
-
else:
|
587 |
-
cross_attention_kwargs_ = {
|
588 |
-
'mode': 'r', 'num_in_batch': N_gen,
|
589 |
-
'condition_embed_dict': condition_embed_dict,
|
590 |
-
'mva_scale': mva_scale,
|
591 |
-
'ref_scale': ref_scale,
|
592 |
-
}
|
593 |
-
return self.unet(
|
594 |
-
sample, timestep,
|
595 |
-
encoder_hidden_states_gen, *args,
|
596 |
-
class_labels=camera_info_gen,
|
597 |
-
down_intrablock_additional_residuals=[
|
598 |
-
sample.to(dtype=self.unet.dtype) for sample in down_intrablock_additional_residuals
|
599 |
-
] if down_intrablock_additional_residuals is not None else None,
|
600 |
-
down_block_additional_residuals=[
|
601 |
-
sample.to(dtype=self.unet.dtype) for sample in down_block_res_samples
|
602 |
-
] if down_block_res_samples is not None else None,
|
603 |
-
mid_block_additional_residual=(
|
604 |
-
mid_block_res_sample.to(dtype=self.unet.dtype)
|
605 |
-
if mid_block_res_sample is not None else None
|
606 |
-
),
|
607 |
-
return_dict=False,
|
608 |
-
cross_attention_kwargs=cross_attention_kwargs_,
|
609 |
-
)
|
610 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
hunyuan3d-paint-v2-0-turbo/vae/config.json
DELETED
@@ -1,29 +0,0 @@
|
|
1 |
-
{
|
2 |
-
"_class_name": "AutoencoderKL",
|
3 |
-
"_diffusers_version": "0.10.0.dev0",
|
4 |
-
"act_fn": "silu",
|
5 |
-
"block_out_channels": [
|
6 |
-
128,
|
7 |
-
256,
|
8 |
-
512,
|
9 |
-
512
|
10 |
-
],
|
11 |
-
"down_block_types": [
|
12 |
-
"DownEncoderBlock2D",
|
13 |
-
"DownEncoderBlock2D",
|
14 |
-
"DownEncoderBlock2D",
|
15 |
-
"DownEncoderBlock2D"
|
16 |
-
],
|
17 |
-
"in_channels": 3,
|
18 |
-
"latent_channels": 4,
|
19 |
-
"layers_per_block": 2,
|
20 |
-
"norm_num_groups": 32,
|
21 |
-
"out_channels": 3,
|
22 |
-
"sample_size": 768,
|
23 |
-
"up_block_types": [
|
24 |
-
"UpDecoderBlock2D",
|
25 |
-
"UpDecoderBlock2D",
|
26 |
-
"UpDecoderBlock2D",
|
27 |
-
"UpDecoderBlock2D"
|
28 |
-
]
|
29 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
hunyuan3d-paint-v2-0-turbo/vae/diffusion_pytorch_model.bin
DELETED
@@ -1,3 +0,0 @@
|
|
1 |
-
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:1b4889b6b1d4ce7ae320a02dedaeff1780ad77d415ea0d744b476155c6377ddc
|
3 |
-
size 334707217
|
|
|
|
|
|
|
|
hunyuan3d-vae-v2-0-turbo/config.yaml
DELETED
@@ -1,15 +0,0 @@
|
|
1 |
-
target: hy3dgen.shapegen.models.ShapeVAE
|
2 |
-
params:
|
3 |
-
num_latents: 3072
|
4 |
-
embed_dim: 64
|
5 |
-
num_freqs: 8
|
6 |
-
include_pi: false
|
7 |
-
heads: 16
|
8 |
-
width: 1024
|
9 |
-
num_decoder_layers: 16
|
10 |
-
qkv_bias: false
|
11 |
-
qk_norm: true
|
12 |
-
scale_factor: 0.9990943042622529
|
13 |
-
geo_decoder_mlp_expand_ratio: 1
|
14 |
-
geo_decoder_downsample_ratio: 2
|
15 |
-
geo_decoder_ln_post: false
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
hunyuan3d-vae-v2-0-turbo/model.fp16.ckpt
DELETED
@@ -1,3 +0,0 @@
|
|
1 |
-
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:bea74b1d912e245510b062ac62d938e5312760c07b122eadd0199e325e3d5343
|
3 |
-
size 407480196
|
|
|
|
|
|
|
|
hunyuan3d-vae-v2-0-turbo/model.fp16.safetensors
DELETED
@@ -1,3 +0,0 @@
|
|
1 |
-
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:66b7af697118db2de00c22f63a385c75f9049709739281971837af9e4ce31cb4
|
3 |
-
size 407410402
|
|
|
|
|
|
|
|
hunyuan3d-vae-v2-0/config.yaml
DELETED
@@ -1,15 +0,0 @@
|
|
1 |
-
target: hy3dgen.shapegen.models.ShapeVAE
|
2 |
-
params:
|
3 |
-
num_latents: 3072
|
4 |
-
embed_dim: 64
|
5 |
-
num_freqs: 8
|
6 |
-
include_pi: false
|
7 |
-
heads: 16
|
8 |
-
width: 1024
|
9 |
-
num_decoder_layers: 16
|
10 |
-
qkv_bias: false
|
11 |
-
qk_norm: true
|
12 |
-
scale_factor: 0.9990943042622529
|
13 |
-
geo_decoder_mlp_expand_ratio: 4
|
14 |
-
geo_decoder_downsample_ratio: 1
|
15 |
-
geo_decoder_ln_post: true
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
hunyuan3d-vae-v2-0/model.fp16.ckpt
DELETED
@@ -1,3 +0,0 @@
|
|
1 |
-
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:5fc24dcac763a28d09f49dfe5ef71cf129f2e7193e75a8753d5f8b1e4bd526e2
|
3 |
-
size 428526216
|
|
|
|
|
|
|
|
hunyuan3d-vae-v2-0/model.fp16.safetensors
DELETED
@@ -1,3 +0,0 @@
|
|
1 |
-
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:abbad221083fbde12d7e8afc69243de7cee85abc806368ba2338fc5736be7341
|
3 |
-
size 428455666
|
|
|
|
|
|
|
|