Image-to-3D
Hunyuan3D-2
Diffusers
Safetensors
English
Chinese
text-to-3d
Files changed (34) hide show
  1. LICENSE +1 -2
  2. NOTICE +3 -3
  3. README.md +1 -1
  4. hunyuan3d-dit-v2-0-turbo/config.yaml +0 -70
  5. hunyuan3d-dit-v2-0-turbo/model.fp16.ckpt +0 -3
  6. hunyuan3d-dit-v2-0-turbo/model.fp16.safetensors +0 -3
  7. hunyuan3d-dit-v2-0/model.fp16.ckpt +0 -3
  8. hunyuan3d-dit-v2-0/model.fp16.safetensors +0 -3
  9. hunyuan3d-paint-v2-0-turbo/.gitattributes +0 -35
  10. hunyuan3d-paint-v2-0-turbo/README.md +0 -53
  11. hunyuan3d-paint-v2-0-turbo/feature_extractor/preprocessor_config.json +0 -20
  12. hunyuan3d-paint-v2-0-turbo/image_encoder/config.json +0 -23
  13. hunyuan3d-paint-v2-0-turbo/image_encoder/model.safetensors +0 -3
  14. hunyuan3d-paint-v2-0-turbo/image_encoder/preprocessor_config.json +0 -27
  15. hunyuan3d-paint-v2-0-turbo/model_index.json +0 -37
  16. hunyuan3d-paint-v2-0-turbo/scheduler/scheduler_config.json +0 -15
  17. hunyuan3d-paint-v2-0-turbo/text_encoder/config.json +0 -25
  18. hunyuan3d-paint-v2-0-turbo/text_encoder/pytorch_model.bin +0 -3
  19. hunyuan3d-paint-v2-0-turbo/tokenizer/merges.txt +0 -0
  20. hunyuan3d-paint-v2-0-turbo/tokenizer/special_tokens_map.json +0 -24
  21. hunyuan3d-paint-v2-0-turbo/tokenizer/tokenizer_config.json +0 -34
  22. hunyuan3d-paint-v2-0-turbo/tokenizer/vocab.json +0 -0
  23. hunyuan3d-paint-v2-0-turbo/unet/config.json +0 -45
  24. hunyuan3d-paint-v2-0-turbo/unet/diffusion_pytorch_model.bin +0 -3
  25. hunyuan3d-paint-v2-0-turbo/unet/diffusion_pytorch_model.safetensors +0 -3
  26. hunyuan3d-paint-v2-0-turbo/unet/modules.py +0 -610
  27. hunyuan3d-paint-v2-0-turbo/vae/config.json +0 -29
  28. hunyuan3d-paint-v2-0-turbo/vae/diffusion_pytorch_model.bin +0 -3
  29. hunyuan3d-vae-v2-0-turbo/config.yaml +0 -15
  30. hunyuan3d-vae-v2-0-turbo/model.fp16.ckpt +0 -3
  31. hunyuan3d-vae-v2-0-turbo/model.fp16.safetensors +0 -3
  32. hunyuan3d-vae-v2-0/config.yaml +0 -15
  33. hunyuan3d-vae-v2-0/model.fp16.ckpt +0 -3
  34. hunyuan3d-vae-v2-0/model.fp16.safetensors +0 -3
LICENSE CHANGED
@@ -11,8 +11,7 @@ e. “Licensee,” “You” or “Your” shall mean a natural person or legal
11
  f. “Materials” shall mean, collectively, Tencent’s proprietary Tencent Hunyuan 3D 2.0 and Documentation (and any portion thereof) as made available by Tencent under this Agreement.
12
  g. “Model Derivatives” shall mean all: (i) modifications to Tencent Hunyuan 3D 2.0 or any Model Derivative of Tencent Hunyuan 3D 2.0; (ii) works based on Tencent Hunyuan 3D 2.0 or any Model Derivative of Tencent Hunyuan 3D 2.0; or (iii) any other machine learning model which is created by transfer of patterns of the weights, parameters, operations, or Output of Tencent Hunyuan 3D 2.0 or any Model Derivative of Tencent Hunyuan 3D 2.0, to that model in order to cause that model to perform similarly to Tencent Hunyuan 3D 2.0 or a Model Derivative of Tencent Hunyuan 3D 2.0, including distillation methods, methods that use intermediate data representations, or methods based on the generation of synthetic data Outputs by Tencent Hunyuan 3D 2.0 or a Model Derivative of Tencent Hunyuan 3D 2.0 for training that model. For clarity, Outputs by themselves are not deemed Model Derivatives.
13
  h. “Output” shall mean the information and/or content output of Tencent Hunyuan 3D 2.0 or a Model Derivative that results from operating or otherwise using Tencent Hunyuan 3D 2.0 or a Model Derivative, including via a Hosted Service.
14
- i. “Tencent,” “We” or “Us” shall mean the applicable entity or entities in the Tencent corporate family that own(s) intellectual property or other rights embodied in or utilized by the Materials.
15
- * Section 1.i of the previous Hunyuan License Agreement defined “Tencent,” “We” or “Us” to mean THL A29 Limited, and the copyright notices pertaining to the Materials were previously in the name of “THL A29 Limited.” That entity has now been de-registered. You should treat all previously distributed copies of the Materials as if Section 1.i of the Agreement defined “Tencent,” “We” or “Us” to mean “the applicable entity or entities in the Tencent corporate family that own(s) intellectual property or other rights embodied in or utilized by the Materials,” and treat the copyright notice(s) accompanying the Materials as if they were in the name of “Tencent.” When providing a copy of any Agreement to Third Party recipients of the Tencent Hunyuan Works or products or services using them, as required by Section 3.a of the Agreement, you should provide the most current version of the Agreement, including the change of definition in Section 1.i of the Agreement.
16
  j. “Tencent Hunyuan 3D 2.0” shall mean the 3D generation models and their software and algorithms, including trained model weights, parameters (including optimizer states), machine-learning model code, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing made publicly available by Us at https://github.com/Tencent/Hunyuan3D-2.
17
  k. “Tencent Hunyuan 3D 2.0 Works” shall mean: (i) the Materials; (ii) Model Derivatives; and (iii) all derivative works thereof.
18
  l. “Territory” shall mean the worldwide territory, excluding the territory of the European Union, United Kingdom and South Korea.
 
11
  f. “Materials” shall mean, collectively, Tencent’s proprietary Tencent Hunyuan 3D 2.0 and Documentation (and any portion thereof) as made available by Tencent under this Agreement.
12
  g. “Model Derivatives” shall mean all: (i) modifications to Tencent Hunyuan 3D 2.0 or any Model Derivative of Tencent Hunyuan 3D 2.0; (ii) works based on Tencent Hunyuan 3D 2.0 or any Model Derivative of Tencent Hunyuan 3D 2.0; or (iii) any other machine learning model which is created by transfer of patterns of the weights, parameters, operations, or Output of Tencent Hunyuan 3D 2.0 or any Model Derivative of Tencent Hunyuan 3D 2.0, to that model in order to cause that model to perform similarly to Tencent Hunyuan 3D 2.0 or a Model Derivative of Tencent Hunyuan 3D 2.0, including distillation methods, methods that use intermediate data representations, or methods based on the generation of synthetic data Outputs by Tencent Hunyuan 3D 2.0 or a Model Derivative of Tencent Hunyuan 3D 2.0 for training that model. For clarity, Outputs by themselves are not deemed Model Derivatives.
13
  h. “Output” shall mean the information and/or content output of Tencent Hunyuan 3D 2.0 or a Model Derivative that results from operating or otherwise using Tencent Hunyuan 3D 2.0 or a Model Derivative, including via a Hosted Service.
14
+ i. “Tencent,” “We” or “Us” shall mean THL A29 Limited.
 
15
  j. “Tencent Hunyuan 3D 2.0” shall mean the 3D generation models and their software and algorithms, including trained model weights, parameters (including optimizer states), machine-learning model code, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing made publicly available by Us at https://github.com/Tencent/Hunyuan3D-2.
16
  k. “Tencent Hunyuan 3D 2.0 Works” shall mean: (i) the Materials; (ii) Model Derivatives; and (iii) all derivative works thereof.
17
  l. “Territory” shall mean the worldwide territory, excluding the territory of the European Union, United Kingdom and South Korea.
NOTICE CHANGED
@@ -2,7 +2,7 @@ Usage and Legal Notices:
2
 
3
  Tencent is pleased to support the open source community by making Hunyuan 3D 2.0 available.
4
 
5
- Copyright (C) 2025 Tencent. All rights reserved. The below software and/or models in this distribution may have been modified by Tencent ("Tencent Modifications"). All Tencent Modifications are Copyright (C) Tencent.
6
 
7
  Hunyuan 3D 2.0 is licensed under the TENCENT HUNYUAN 3D 2.0 COMMUNITY LICENSE AGREEMENT except for the third-party components listed below, which is licensed under different terms. Hunyuan 3D 2.0 does not impose any additional limitations beyond what is outlined in the respective licenses of these third-party components. Users must comply with all terms and conditions of original licenses of these third-party components and must ensure that the usage of the third party components adheres to all relevant laws and regulations.
8
 
@@ -126,7 +126,7 @@ You agree not to use the Model or Derivatives of the Model:
126
  Open Source Model Licensed under the TENCENT HUNYUAN COMMUNITY LICENSE AGREEMENT and Other Licenses of the Third-Party Components therein:
127
  --------------------------------------------------------------------
128
  1. HunyuanDiT
129
- Copyright (C) 2024 Tencent. All rights reserved.
130
 
131
 
132
  Terms of the TENCENT HUNYUAN COMMUNITY LICENSE AGREEMENT:
@@ -143,7 +143,7 @@ e. “Licensee,” “You” or “Your” shall mean a natural person or legal
143
  f. “Materials” shall mean, collectively, Tencent’s proprietary Tencent Hunyuan and Documentation (and any portion thereof) as made available by Tencent under this Agreement.
144
  g. “Model Derivatives” shall mean all: (i) modifications to Tencent Hunyuan or any Model Derivative of Tencent Hunyuan; (ii) works based on Tencent Hunyuan or any Model Derivative of Tencent Hunyuan; or (iii) any other machine learning model which is created by transfer of patterns of the weights, parameters, operations, or Output of Tencent Hunyuan or any Model Derivative of Tencent Hunyuan, to that model in order to cause that model to perform similarly to Tencent Hunyuan or a Model Derivative of Tencent Hunyuan, including distillation methods, methods that use intermediate data representations, or methods based on the generation of synthetic data Outputs by Tencent Hunyuan or a Model Derivative of Tencent Hunyuan for training that model. For clarity, Outputs by themselves are not deemed Model Derivatives.
145
  h. “Output” shall mean the information and/or content output of Tencent Hunyuan or a Model Derivative that results from operating or otherwise using Tencent Hunyuan or a Model Derivative, including via a Hosted Service.
146
- i. “Tencent,” “We” or “Us” shall mean Tencent.
147
  j. “Tencent Hunyuan” shall mean the large language models, image/video/audio/3D generation models, and multimodal large language models and their software and algorithms, including trained model weights, parameters (including optimizer states), machine-learning model code, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing made publicly available by Us at https://huggingface.co/Tencent-Hunyuan/HunyuanDiT and https://github.com/Tencent/HunyuanDiT .
148
  k. “Tencent Hunyuan Works” shall mean: (i) the Materials; (ii) Model Derivatives; and (iii) all derivative works thereof.
149
  l. “Third Party” or “Third Parties” shall mean individuals or legal entities that are not under common control with Us or You.
 
2
 
3
  Tencent is pleased to support the open source community by making Hunyuan 3D 2.0 available.
4
 
5
+ Copyright (C) 2025 THL A29 Limited, a Tencent company. All rights reserved. The below software and/or models in this distribution may have been modified by THL A29 Limited ("Tencent Modifications"). All Tencent Modifications are Copyright (C) THL A29 Limited.
6
 
7
  Hunyuan 3D 2.0 is licensed under the TENCENT HUNYUAN 3D 2.0 COMMUNITY LICENSE AGREEMENT except for the third-party components listed below, which is licensed under different terms. Hunyuan 3D 2.0 does not impose any additional limitations beyond what is outlined in the respective licenses of these third-party components. Users must comply with all terms and conditions of original licenses of these third-party components and must ensure that the usage of the third party components adheres to all relevant laws and regulations.
8
 
 
126
  Open Source Model Licensed under the TENCENT HUNYUAN COMMUNITY LICENSE AGREEMENT and Other Licenses of the Third-Party Components therein:
127
  --------------------------------------------------------------------
128
  1. HunyuanDiT
129
+ Copyright (C) 2024 THL A29 Limited, a Tencent company. All rights reserved.
130
 
131
 
132
  Terms of the TENCENT HUNYUAN COMMUNITY LICENSE AGREEMENT:
 
143
  f. “Materials” shall mean, collectively, Tencent’s proprietary Tencent Hunyuan and Documentation (and any portion thereof) as made available by Tencent under this Agreement.
144
  g. “Model Derivatives” shall mean all: (i) modifications to Tencent Hunyuan or any Model Derivative of Tencent Hunyuan; (ii) works based on Tencent Hunyuan or any Model Derivative of Tencent Hunyuan; or (iii) any other machine learning model which is created by transfer of patterns of the weights, parameters, operations, or Output of Tencent Hunyuan or any Model Derivative of Tencent Hunyuan, to that model in order to cause that model to perform similarly to Tencent Hunyuan or a Model Derivative of Tencent Hunyuan, including distillation methods, methods that use intermediate data representations, or methods based on the generation of synthetic data Outputs by Tencent Hunyuan or a Model Derivative of Tencent Hunyuan for training that model. For clarity, Outputs by themselves are not deemed Model Derivatives.
145
  h. “Output” shall mean the information and/or content output of Tencent Hunyuan or a Model Derivative that results from operating or otherwise using Tencent Hunyuan or a Model Derivative, including via a Hosted Service.
146
+ i. “Tencent,” “We” or “Us” shall mean THL A29 Limited.
147
  j. “Tencent Hunyuan” shall mean the large language models, image/video/audio/3D generation models, and multimodal large language models and their software and algorithms, including trained model weights, parameters (including optimizer states), machine-learning model code, inference-enabling code, training-enabling code, fine-tuning enabling code and other elements of the foregoing made publicly available by Us at https://huggingface.co/Tencent-Hunyuan/HunyuanDiT and https://github.com/Tencent/HunyuanDiT .
148
  k. “Tencent Hunyuan Works” shall mean: (i) the Materials; (ii) Model Derivatives; and (iii) all derivative works thereof.
149
  l. “Third Party” or “Third Parties” shall mean individuals or legal entities that are not under common control with Us or You.
README.md CHANGED
@@ -153,7 +153,7 @@ pipeline = Hunyuan3DPaintPipeline.from_pretrained('tencent/Hunyuan3D-2')
153
  mesh = pipeline(mesh, image='assets/demo.png')
154
  ```
155
 
156
- Please visit [minimal_demo.py](https://github.com/Tencent/Hunyuan3D-2/blob/main/minimal_demo.py) for more advanced usage, such as **text to 3D** and **texture generation
157
  for handcrafted mesh**.
158
 
159
  ### Gradio App
 
153
  mesh = pipeline(mesh, image='assets/demo.png')
154
  ```
155
 
156
+ Please visit [minimal_demo.py](minimal_demo.py) for more advanced usage, such as **text to 3D** and **texture generation
157
  for handcrafted mesh**.
158
 
159
  ### Gradio App
hunyuan3d-dit-v2-0-turbo/config.yaml DELETED
@@ -1,70 +0,0 @@
1
- model:
2
- target: hy3dgen.shapegen.models.Hunyuan3DDiT
3
- params:
4
- in_channels: 64
5
- context_in_dim: 1536
6
- hidden_size: 1024
7
- mlp_ratio: 4.0
8
- num_heads: 16
9
- depth: 16
10
- depth_single_blocks: 32
11
- axes_dim: [ 64 ]
12
- theta: 10000
13
- qkv_bias: true
14
- guidance_embed: true
15
-
16
- vae:
17
- target: hy3dgen.shapegen.models.ShapeVAE
18
- params:
19
- num_latents: 3072
20
- embed_dim: 64
21
- num_freqs: 8
22
- include_pi: false
23
- heads: 16
24
- width: 1024
25
- num_decoder_layers: 16
26
- qkv_bias: false
27
- qk_norm: true
28
- scale_factor: 0.9990943042622529
29
-
30
- conditioner:
31
- target: hy3dgen.shapegen.models.SingleImageEncoder
32
- params:
33
- main_image_encoder:
34
- type: DinoImageEncoder # dino giant
35
- kwargs:
36
- config:
37
- attention_probs_dropout_prob: 0.0
38
- drop_path_rate: 0.0
39
- hidden_act: gelu
40
- hidden_dropout_prob: 0.0
41
- hidden_size: 1536
42
- image_size: 518
43
- initializer_range: 0.02
44
- layer_norm_eps: 1.e-6
45
- layerscale_value: 1.0
46
- mlp_ratio: 4
47
- model_type: dinov2
48
- num_attention_heads: 24
49
- num_channels: 3
50
- num_hidden_layers: 40
51
- patch_size: 14
52
- qkv_bias: true
53
- torch_dtype: float32
54
- use_swiglu_ffn: true
55
- image_size: 518
56
-
57
- scheduler:
58
- target: hy3dgen.shapegen.schedulers.ConsistencyFlowMatchEulerDiscreteScheduler
59
- params:
60
- num_train_timesteps: 1000
61
- pcm_timesteps: 100
62
-
63
- image_processor:
64
- target: hy3dgen.shapegen.preprocessors.ImageProcessorV2
65
- params:
66
- size: 512
67
- border_ratio: 0.15
68
-
69
- pipeline:
70
- target: hy3dgen.shapegen.pipelines.Hunyuan3DDiTFlowMatchingPipeline
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
hunyuan3d-dit-v2-0-turbo/model.fp16.ckpt DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:f04cecad6953ca9644f9e1d3a22cd0abb20665ce8be30fc4409451ce78d622f1
3
- size 4931245140
 
 
 
 
hunyuan3d-dit-v2-0-turbo/model.fp16.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:5ee5a81e4df08a1c65b79910bf5b145a90376e526794f4607a4d5d068d62f269
3
- size 4930777530
 
 
 
 
hunyuan3d-dit-v2-0/model.fp16.ckpt DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:39c2a6bf54f5674f2001b763d8e15b773fbda24604b3911544d09846496bc972
3
- size 4928568095
 
 
 
 
hunyuan3d-dit-v2-0/model.fp16.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:360bc281fc956d4acac0c3d36d5ec0ebf8cdddbf4b8892e894d12419388d479b
3
- size 4928151562
 
 
 
 
hunyuan3d-paint-v2-0-turbo/.gitattributes DELETED
@@ -1,35 +0,0 @@
1
- *.7z filter=lfs diff=lfs merge=lfs -text
2
- *.arrow filter=lfs diff=lfs merge=lfs -text
3
- *.bin filter=lfs diff=lfs merge=lfs -text
4
- *.bz2 filter=lfs diff=lfs merge=lfs -text
5
- *.ckpt filter=lfs diff=lfs merge=lfs -text
6
- *.ftz filter=lfs diff=lfs merge=lfs -text
7
- *.gz filter=lfs diff=lfs merge=lfs -text
8
- *.h5 filter=lfs diff=lfs merge=lfs -text
9
- *.joblib filter=lfs diff=lfs merge=lfs -text
10
- *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
- *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
- *.model filter=lfs diff=lfs merge=lfs -text
13
- *.msgpack filter=lfs diff=lfs merge=lfs -text
14
- *.npy filter=lfs diff=lfs merge=lfs -text
15
- *.npz filter=lfs diff=lfs merge=lfs -text
16
- *.onnx filter=lfs diff=lfs merge=lfs -text
17
- *.ot filter=lfs diff=lfs merge=lfs -text
18
- *.parquet filter=lfs diff=lfs merge=lfs -text
19
- *.pb filter=lfs diff=lfs merge=lfs -text
20
- *.pickle filter=lfs diff=lfs merge=lfs -text
21
- *.pkl filter=lfs diff=lfs merge=lfs -text
22
- *.pt filter=lfs diff=lfs merge=lfs -text
23
- *.pth filter=lfs diff=lfs merge=lfs -text
24
- *.rar filter=lfs diff=lfs merge=lfs -text
25
- *.safetensors filter=lfs diff=lfs merge=lfs -text
26
- saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
- *.tar.* filter=lfs diff=lfs merge=lfs -text
28
- *.tar filter=lfs diff=lfs merge=lfs -text
29
- *.tflite filter=lfs diff=lfs merge=lfs -text
30
- *.tgz filter=lfs diff=lfs merge=lfs -text
31
- *.wasm filter=lfs diff=lfs merge=lfs -text
32
- *.xz filter=lfs diff=lfs merge=lfs -text
33
- *.zip filter=lfs diff=lfs merge=lfs -text
34
- *.zst filter=lfs diff=lfs merge=lfs -text
35
- *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
hunyuan3d-paint-v2-0-turbo/README.md DELETED
@@ -1,53 +0,0 @@
1
- ---
2
- license: openrail++
3
- tags:
4
- - stable-diffusion
5
- - text-to-image
6
- ---
7
-
8
- # SD v2.1-base with Zero Terminal SNR (LAION Aesthetic 6+)
9
-
10
- This model is used in [Diffusion Model with Perceptual Loss](https://arxiv.org/abs/2401.00110) paper as the MSE baseline.
11
-
12
- This model is trained using zero terminal SNR schedule following [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://arxiv.org/abs/2305.08891) paper on LAION aesthetic 6+ data.
13
-
14
- This model is finetuned from [stabilityai/stable-diffusion-2-1-base](https://huggingface.co/stabilityai/stable-diffusion-2-1-base).
15
-
16
- This model is meant for research demonstration, not for production use.
17
-
18
- ## Usage
19
-
20
- ```python
21
- from diffusers import StableDiffusionPipeline
22
- prompt = "A young girl smiling"
23
- pipe = StableDiffusionPipeline.from_pretrained("ByteDance/sd2.1-base-zsnr-laionaes6").to("cuda")
24
- pipe(prompt, guidance_scale=7.5, guidance_rescale=0.7).images[0].save("out.jpg")
25
- ```
26
-
27
- ## Related Models
28
-
29
- * [bytedance/sd2.1-base-zsnr-laionaes5](https://huggingface.co/ByteDance/sd2.1-base-zsnr-laionaes5)
30
- * [bytedance/sd2.1-base-zsnr-laionaes6](https://huggingface.co/ByteDance/sd2.1-base-zsnr-laionaes6)
31
- * [bytedance/sd2.1-base-zsnr-laionaes6-perceptual](https://huggingface.co/ByteDance/sd2.1-base-zsnr-laionaes6-perceptual)
32
-
33
-
34
- ## Cite as
35
- ```
36
- @misc{lin2024diffusion,
37
- title={Diffusion Model with Perceptual Loss},
38
- author={Shanchuan Lin and Xiao Yang},
39
- year={2024},
40
- eprint={2401.00110},
41
- archivePrefix={arXiv},
42
- primaryClass={cs.CV}
43
- }
44
-
45
- @misc{lin2023common,
46
- title={Common Diffusion Noise Schedules and Sample Steps are Flawed},
47
- author={Shanchuan Lin and Bingchen Liu and Jiashi Li and Xiao Yang},
48
- year={2023},
49
- eprint={2305.08891},
50
- archivePrefix={arXiv},
51
- primaryClass={cs.CV}
52
- }
53
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
hunyuan3d-paint-v2-0-turbo/feature_extractor/preprocessor_config.json DELETED
@@ -1,20 +0,0 @@
1
- {
2
- "crop_size": 224,
3
- "do_center_crop": true,
4
- "do_convert_rgb": true,
5
- "do_normalize": true,
6
- "do_resize": true,
7
- "feature_extractor_type": "CLIPFeatureExtractor",
8
- "image_mean": [
9
- 0.48145466,
10
- 0.4578275,
11
- 0.40821073
12
- ],
13
- "image_std": [
14
- 0.26862954,
15
- 0.26130258,
16
- 0.27577711
17
- ],
18
- "resample": 3,
19
- "size": 224
20
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
hunyuan3d-paint-v2-0-turbo/image_encoder/config.json DELETED
@@ -1,23 +0,0 @@
1
- {
2
- "_name_or_path": "D:\\.cache\\huggingface\\hub\\models--sudo-ai--zero123plus-v1.1\\snapshots\\36df7de980afd15f80b2e1a4e9a920d7020e2654\\vision_encoder",
3
- "architectures": [
4
- "CLIPVisionModelWithProjection"
5
- ],
6
- "attention_dropout": 0.0,
7
- "dropout": 0.0,
8
- "hidden_act": "gelu",
9
- "hidden_size": 1280,
10
- "image_size": 224,
11
- "initializer_factor": 1.0,
12
- "initializer_range": 0.02,
13
- "intermediate_size": 5120,
14
- "layer_norm_eps": 1e-05,
15
- "model_type": "clip_vision_model",
16
- "num_attention_heads": 16,
17
- "num_channels": 3,
18
- "num_hidden_layers": 32,
19
- "patch_size": 14,
20
- "projection_dim": 1024,
21
- "torch_dtype": "float16",
22
- "transformers_version": "4.36.0"
23
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
hunyuan3d-paint-v2-0-turbo/image_encoder/model.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:ae616c24393dd1854372b0639e5541666f7521cbe219669255e865cb7f89466a
3
- size 1264217240
 
 
 
 
hunyuan3d-paint-v2-0-turbo/image_encoder/preprocessor_config.json DELETED
@@ -1,27 +0,0 @@
1
- {
2
- "crop_size": {
3
- "height": 224,
4
- "width": 224
5
- },
6
- "do_center_crop": true,
7
- "do_convert_rgb": true,
8
- "do_normalize": true,
9
- "do_rescale": true,
10
- "do_resize": true,
11
- "image_mean": [
12
- 0.48145466,
13
- 0.4578275,
14
- 0.40821073
15
- ],
16
- "image_processor_type": "CLIPImageProcessor",
17
- "image_std": [
18
- 0.26862954,
19
- 0.26130258,
20
- 0.27577711
21
- ],
22
- "resample": 3,
23
- "rescale_factor": 0.00392156862745098,
24
- "size": {
25
- "shortest_edge": 224
26
- }
27
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
hunyuan3d-paint-v2-0-turbo/model_index.json DELETED
@@ -1,37 +0,0 @@
1
- {
2
- "_class_name": "StableDiffusionPipeline",
3
- "_diffusers_version": "0.23.1",
4
- "feature_extractor": [
5
- "transformers",
6
- "CLIPImageProcessor"
7
- ],
8
- "requires_safety_checker": false,
9
- "safety_checker": [
10
- null,
11
- null
12
- ],
13
- "scheduler": [
14
- "diffusers",
15
- "DDIMScheduler"
16
- ],
17
- "text_encoder": [
18
- "transformers",
19
- "CLIPTextModel"
20
- ],
21
- "tokenizer": [
22
- "transformers",
23
- "CLIPTokenizer"
24
- ],
25
- "image_encoder": [
26
- "transformers",
27
- "CLIPVisionModelWithProjection"
28
- ],
29
- "unet": [
30
- "modules",
31
- "UNet2p5DConditionModel"
32
- ],
33
- "vae": [
34
- "diffusers",
35
- "AutoencoderKL"
36
- ]
37
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
hunyuan3d-paint-v2-0-turbo/scheduler/scheduler_config.json DELETED
@@ -1,15 +0,0 @@
1
- {
2
- "_class_name": "DDIMScheduler",
3
- "_diffusers_version": "0.23.1",
4
- "beta_end": 0.012,
5
- "beta_schedule": "scaled_linear",
6
- "beta_start": 0.00085,
7
- "clip_sample": false,
8
- "num_train_timesteps": 1000,
9
- "prediction_type": "v_prediction",
10
- "set_alpha_to_one": true,
11
- "steps_offset": 1,
12
- "trained_betas": null,
13
- "timestep_spacing": "trailing",
14
- "rescale_betas_zero_snr": true
15
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
hunyuan3d-paint-v2-0-turbo/text_encoder/config.json DELETED
@@ -1,25 +0,0 @@
1
- {
2
- "_name_or_path": "stabilityai/stable-diffusion-2",
3
- "architectures": [
4
- "CLIPTextModel"
5
- ],
6
- "attention_dropout": 0.0,
7
- "bos_token_id": 0,
8
- "dropout": 0.0,
9
- "eos_token_id": 2,
10
- "hidden_act": "gelu",
11
- "hidden_size": 1024,
12
- "initializer_factor": 1.0,
13
- "initializer_range": 0.02,
14
- "intermediate_size": 4096,
15
- "layer_norm_eps": 1e-05,
16
- "max_position_embeddings": 77,
17
- "model_type": "clip_text_model",
18
- "num_attention_heads": 16,
19
- "num_hidden_layers": 23,
20
- "pad_token_id": 1,
21
- "projection_dim": 512,
22
- "torch_dtype": "float32",
23
- "transformers_version": "4.25.0.dev0",
24
- "vocab_size": 49408
25
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
hunyuan3d-paint-v2-0-turbo/text_encoder/pytorch_model.bin DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:c3e254d7b61353497ea0be2c4013df4ea8f739ee88cffa0ba58cd085459ed565
3
- size 1361671895
 
 
 
 
hunyuan3d-paint-v2-0-turbo/tokenizer/merges.txt DELETED
The diff for this file is too large to render. See raw diff
 
hunyuan3d-paint-v2-0-turbo/tokenizer/special_tokens_map.json DELETED
@@ -1,24 +0,0 @@
1
- {
2
- "bos_token": {
3
- "content": "<|startoftext|>",
4
- "lstrip": false,
5
- "normalized": true,
6
- "rstrip": false,
7
- "single_word": false
8
- },
9
- "eos_token": {
10
- "content": "<|endoftext|>",
11
- "lstrip": false,
12
- "normalized": true,
13
- "rstrip": false,
14
- "single_word": false
15
- },
16
- "pad_token": "!",
17
- "unk_token": {
18
- "content": "<|endoftext|>",
19
- "lstrip": false,
20
- "normalized": true,
21
- "rstrip": false,
22
- "single_word": false
23
- }
24
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
hunyuan3d-paint-v2-0-turbo/tokenizer/tokenizer_config.json DELETED
@@ -1,34 +0,0 @@
1
- {
2
- "add_prefix_space": false,
3
- "bos_token": {
4
- "__type": "AddedToken",
5
- "content": "<|startoftext|>",
6
- "lstrip": false,
7
- "normalized": true,
8
- "rstrip": false,
9
- "single_word": false
10
- },
11
- "do_lower_case": true,
12
- "eos_token": {
13
- "__type": "AddedToken",
14
- "content": "<|endoftext|>",
15
- "lstrip": false,
16
- "normalized": true,
17
- "rstrip": false,
18
- "single_word": false
19
- },
20
- "errors": "replace",
21
- "model_max_length": 77,
22
- "name_or_path": "stabilityai/stable-diffusion-2",
23
- "pad_token": "<|endoftext|>",
24
- "special_tokens_map_file": "./special_tokens_map.json",
25
- "tokenizer_class": "CLIPTokenizer",
26
- "unk_token": {
27
- "__type": "AddedToken",
28
- "content": "<|endoftext|>",
29
- "lstrip": false,
30
- "normalized": true,
31
- "rstrip": false,
32
- "single_word": false
33
- }
34
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
hunyuan3d-paint-v2-0-turbo/tokenizer/vocab.json DELETED
The diff for this file is too large to render. See raw diff
 
hunyuan3d-paint-v2-0-turbo/unet/config.json DELETED
@@ -1,45 +0,0 @@
1
- {
2
- "_class_name": "UNet2DConditionModel",
3
- "_diffusers_version": "0.10.0.dev0",
4
- "act_fn": "silu",
5
- "attention_head_dim": [
6
- 5,
7
- 10,
8
- 20,
9
- 20
10
- ],
11
- "block_out_channels": [
12
- 320,
13
- 640,
14
- 1280,
15
- 1280
16
- ],
17
- "center_input_sample": false,
18
- "cross_attention_dim": 1024,
19
- "down_block_types": [
20
- "CrossAttnDownBlock2D",
21
- "CrossAttnDownBlock2D",
22
- "CrossAttnDownBlock2D",
23
- "DownBlock2D"
24
- ],
25
- "downsample_padding": 1,
26
- "dual_cross_attention": false,
27
- "flip_sin_to_cos": true,
28
- "freq_shift": 0,
29
- "in_channels": 4,
30
- "layers_per_block": 2,
31
- "mid_block_scale_factor": 1,
32
- "norm_eps": 1e-05,
33
- "norm_num_groups": 32,
34
- "num_class_embeds": null,
35
- "only_cross_attention": false,
36
- "out_channels": 4,
37
- "sample_size": 64,
38
- "up_block_types": [
39
- "UpBlock2D",
40
- "CrossAttnUpBlock2D",
41
- "CrossAttnUpBlock2D",
42
- "CrossAttnUpBlock2D"
43
- ],
44
- "use_linear_projection": true
45
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
hunyuan3d-paint-v2-0-turbo/unet/diffusion_pytorch_model.bin DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:24e7f1aea8a7c94cee627eb06f5265f19eeff4e19568636c5eaef050cc19ba3d
3
- size 7325432923
 
 
 
 
hunyuan3d-paint-v2-0-turbo/unet/diffusion_pytorch_model.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:d6acffa4a22f4da61d87f446bfa83e7ac245481c1535fbf25b200fe4462d0b22
3
- size 3722161032
 
 
 
 
hunyuan3d-paint-v2-0-turbo/unet/modules.py DELETED
@@ -1,610 +0,0 @@
1
- # Open Source Model Licensed under the Apache License Version 2.0
2
- # and Other Licenses of the Third-Party Components therein:
3
- # The below Model in this distribution may have been modified by THL A29 Limited
4
- # ("Tencent Modifications"). All Tencent Modifications are Copyright (C) 2024 THL A29 Limited.
5
-
6
- # Copyright (C) 2024 THL A29 Limited, a Tencent company. All rights reserved.
7
- # The below software and/or models in this distribution may have been
8
- # modified by THL A29 Limited ("Tencent Modifications").
9
- # All Tencent Modifications are Copyright (C) THL A29 Limited.
10
-
11
- # Hunyuan 3D is licensed under the TENCENT HUNYUAN NON-COMMERCIAL LICENSE AGREEMENT
12
- # except for the third-party components listed below.
13
- # Hunyuan 3D does not impose any additional limitations beyond what is outlined
14
- # in the repsective licenses of these third-party components.
15
- # Users must comply with all terms and conditions of original licenses of these third-party
16
- # components and must ensure that the usage of the third party components adheres to
17
- # all relevant laws and regulations.
18
-
19
- # For avoidance of doubts, Hunyuan 3D means the large language models and
20
- # their software and algorithms, including trained model weights, parameters (including
21
- # optimizer states), machine-learning model code, inference-enabling code, training-enabling code,
22
- # fine-tuning enabling code and other elements of the foregoing made publicly available
23
- # by Tencent in accordance with TENCENT HUNYUAN COMMUNITY LICENSE AGREEMENT.
24
-
25
- import copy
26
- import json
27
- import os
28
- from typing import Any, Dict, List, Optional, Tuple, Union
29
-
30
- import torch
31
- import torch.nn as nn
32
- import torch.nn.functional as F
33
- from diffusers.models import UNet2DConditionModel
34
- from diffusers.models.attention_processor import Attention
35
- from diffusers.models.transformers.transformer_2d import BasicTransformerBlock
36
- from einops import rearrange
37
-
38
-
39
- def _chunked_feed_forward(ff: nn.Module, hidden_states: torch.Tensor, chunk_dim: int, chunk_size: int):
40
- # "feed_forward_chunk_size" can be used to save memory
41
- if hidden_states.shape[chunk_dim] % chunk_size != 0:
42
- raise ValueError(
43
- f"`hidden_states` dimension to be chunked: {hidden_states.shape[chunk_dim]}"
44
- f"has to be divisible by chunk size: {chunk_size}."
45
- f" Make sure to set an appropriate `chunk_size` when calling `unet.enable_forward_chunking`."
46
- )
47
-
48
- num_chunks = hidden_states.shape[chunk_dim] // chunk_size
49
- ff_output = torch.cat(
50
- [ff(hid_slice) for hid_slice in hidden_states.chunk(num_chunks, dim=chunk_dim)],
51
- dim=chunk_dim,
52
- )
53
- return ff_output
54
-
55
-
56
- class Basic2p5DTransformerBlock(torch.nn.Module):
57
- def __init__(self, transformer: BasicTransformerBlock, layer_name, use_ma=True, use_ra=True, is_turbo=False) -> None:
58
- super().__init__()
59
- self.transformer = transformer
60
- self.layer_name = layer_name
61
- self.use_ma = use_ma
62
- self.use_ra = use_ra
63
- self.is_turbo = is_turbo
64
-
65
- # multiview attn
66
- if self.use_ma:
67
- self.attn_multiview = Attention(
68
- query_dim=self.dim,
69
- heads=self.num_attention_heads,
70
- dim_head=self.attention_head_dim,
71
- dropout=self.dropout,
72
- bias=self.attention_bias,
73
- cross_attention_dim=None,
74
- upcast_attention=self.attn1.upcast_attention,
75
- out_bias=True,
76
- )
77
-
78
- # ref attn
79
- if self.use_ra:
80
- self.attn_refview = Attention(
81
- query_dim=self.dim,
82
- heads=self.num_attention_heads,
83
- dim_head=self.attention_head_dim,
84
- dropout=self.dropout,
85
- bias=self.attention_bias,
86
- cross_attention_dim=None,
87
- upcast_attention=self.attn1.upcast_attention,
88
- out_bias=True,
89
- )
90
- if self.is_turbo:
91
- self._initialize_attn_weights()
92
-
93
- def _initialize_attn_weights(self):
94
-
95
- if self.use_ma:
96
- self.attn_multiview.load_state_dict(self.attn1.state_dict())
97
- with torch.no_grad():
98
- for layer in self.attn_multiview.to_out:
99
- for param in layer.parameters():
100
- param.zero_()
101
- if self.use_ra:
102
- self.attn_refview.load_state_dict(self.attn1.state_dict())
103
- with torch.no_grad():
104
- for layer in self.attn_refview.to_out:
105
- for param in layer.parameters():
106
- param.zero_()
107
-
108
- def __getattr__(self, name: str):
109
- try:
110
- return super().__getattr__(name)
111
- except AttributeError:
112
- return getattr(self.transformer, name)
113
-
114
- def forward(
115
- self,
116
- hidden_states: torch.Tensor,
117
- attention_mask: Optional[torch.Tensor] = None,
118
- encoder_hidden_states: Optional[torch.Tensor] = None,
119
- encoder_attention_mask: Optional[torch.Tensor] = None,
120
- timestep: Optional[torch.LongTensor] = None,
121
- cross_attention_kwargs: Dict[str, Any] = None,
122
- class_labels: Optional[torch.LongTensor] = None,
123
- added_cond_kwargs: Optional[Dict[str, torch.Tensor]] = None,
124
- ) -> torch.Tensor:
125
-
126
- # Notice that normalization is always applied before the real computation in the following blocks.
127
- # 0. Self-Attention
128
- batch_size = hidden_states.shape[0]
129
-
130
- cross_attention_kwargs = cross_attention_kwargs.copy() if cross_attention_kwargs is not None else {}
131
- num_in_batch = cross_attention_kwargs.pop('num_in_batch', 1)
132
- mode = cross_attention_kwargs.pop('mode', None)
133
- if not self.is_turbo:
134
- mva_scale = cross_attention_kwargs.pop('mva_scale', 1.0)
135
- ref_scale = cross_attention_kwargs.pop('ref_scale', 1.0)
136
- else:
137
- position_attn_mask = cross_attention_kwargs.pop("position_attn_mask", None)
138
- position_voxel_indices = cross_attention_kwargs.pop("position_voxel_indices", None)
139
- mva_scale = 1.0
140
- ref_scale = 1.0
141
-
142
- condition_embed_dict = cross_attention_kwargs.pop("condition_embed_dict", None)
143
-
144
- if self.norm_type == "ada_norm":
145
- norm_hidden_states = self.norm1(hidden_states, timestep)
146
- elif self.norm_type == "ada_norm_zero":
147
- norm_hidden_states, gate_msa, shift_mlp, scale_mlp, gate_mlp = self.norm1(
148
- hidden_states, timestep, class_labels, hidden_dtype=hidden_states.dtype
149
- )
150
- elif self.norm_type in ["layer_norm", "layer_norm_i2vgen"]:
151
- norm_hidden_states = self.norm1(hidden_states)
152
- elif self.norm_type == "ada_norm_continuous":
153
- norm_hidden_states = self.norm1(hidden_states, added_cond_kwargs["pooled_text_emb"])
154
- elif self.norm_type == "ada_norm_single":
155
- shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = (
156
- self.scale_shift_table[None] + timestep.reshape(batch_size, 6, -1)
157
- ).chunk(6, dim=1)
158
- norm_hidden_states = self.norm1(hidden_states)
159
- norm_hidden_states = norm_hidden_states * (1 + scale_msa) + shift_msa
160
- else:
161
- raise ValueError("Incorrect norm used")
162
-
163
- if self.pos_embed is not None:
164
- norm_hidden_states = self.pos_embed(norm_hidden_states)
165
-
166
- # 1. Prepare GLIGEN inputs
167
- cross_attention_kwargs = cross_attention_kwargs.copy() if cross_attention_kwargs is not None else {}
168
- gligen_kwargs = cross_attention_kwargs.pop("gligen", None)
169
-
170
- attn_output = self.attn1(
171
- norm_hidden_states,
172
- encoder_hidden_states=encoder_hidden_states if self.only_cross_attention else None,
173
- attention_mask=attention_mask,
174
- **cross_attention_kwargs,
175
- )
176
-
177
- if self.norm_type == "ada_norm_zero":
178
- attn_output = gate_msa.unsqueeze(1) * attn_output
179
- elif self.norm_type == "ada_norm_single":
180
- attn_output = gate_msa * attn_output
181
-
182
- hidden_states = attn_output + hidden_states
183
- if hidden_states.ndim == 4:
184
- hidden_states = hidden_states.squeeze(1)
185
-
186
- # 1.2 Reference Attention
187
- if 'w' in mode:
188
- condition_embed_dict[self.layer_name] = rearrange(
189
- norm_hidden_states, '(b n) l c -> b (n l) c',
190
- n=num_in_batch
191
- ) # B, (N L), C
192
-
193
- if 'r' in mode and self.use_ra:
194
- condition_embed = condition_embed_dict[self.layer_name].unsqueeze(1).repeat(1, num_in_batch, 1,
195
- 1) # B N L C
196
- condition_embed = rearrange(condition_embed, 'b n l c -> (b n) l c')
197
-
198
- attn_output = self.attn_refview(
199
- norm_hidden_states,
200
- encoder_hidden_states=condition_embed,
201
- attention_mask=None,
202
- **cross_attention_kwargs
203
- )
204
- if not self.is_turbo:
205
- ref_scale_timing = ref_scale
206
- if isinstance(ref_scale, torch.Tensor):
207
- ref_scale_timing = ref_scale.unsqueeze(1).repeat(1, num_in_batch).view(-1)
208
- for _ in range(attn_output.ndim - 1):
209
- ref_scale_timing = ref_scale_timing.unsqueeze(-1)
210
-
211
- hidden_states = ref_scale_timing * attn_output + hidden_states
212
-
213
- if hidden_states.ndim == 4:
214
- hidden_states = hidden_states.squeeze(1)
215
-
216
- # 1.3 Multiview Attention
217
- if num_in_batch > 1 and self.use_ma:
218
- multivew_hidden_states = rearrange(norm_hidden_states, '(b n) l c -> b (n l) c', n=num_in_batch)
219
-
220
- if self.is_turbo:
221
- position_mask = None
222
- if position_attn_mask is not None:
223
- if multivew_hidden_states.shape[1] in position_attn_mask:
224
- position_mask = position_attn_mask[multivew_hidden_states.shape[1]]
225
- position_indices = None
226
- if position_voxel_indices is not None:
227
- if multivew_hidden_states.shape[1] in position_voxel_indices:
228
- position_indices = position_voxel_indices[multivew_hidden_states.shape[1]]
229
- attn_output = self.attn_multiview(
230
- multivew_hidden_states,
231
- encoder_hidden_states=multivew_hidden_states,
232
- attention_mask=position_mask,
233
- position_indices=position_indices,
234
- **cross_attention_kwargs
235
- )
236
- else:
237
- attn_output = self.attn_multiview(
238
- multivew_hidden_states,
239
- encoder_hidden_states=multivew_hidden_states,
240
- **cross_attention_kwargs
241
- )
242
-
243
- attn_output = rearrange(attn_output, 'b (n l) c -> (b n) l c', n=num_in_batch)
244
-
245
- hidden_states = mva_scale * attn_output + hidden_states
246
- if hidden_states.ndim == 4:
247
- hidden_states = hidden_states.squeeze(1)
248
-
249
- # 1.2 GLIGEN Control
250
- if gligen_kwargs is not None:
251
- hidden_states = self.fuser(hidden_states, gligen_kwargs["objs"])
252
-
253
- # 3. Cross-Attention
254
- if self.attn2 is not None:
255
- if self.norm_type == "ada_norm":
256
- norm_hidden_states = self.norm2(hidden_states, timestep)
257
- elif self.norm_type in ["ada_norm_zero", "layer_norm", "layer_norm_i2vgen"]:
258
- norm_hidden_states = self.norm2(hidden_states)
259
- elif self.norm_type == "ada_norm_single":
260
- # For PixArt norm2 isn't applied here:
261
- # https://github.com/PixArt-alpha/PixArt-alpha/blob/0f55e922376d8b797edd44d25d0e7464b260dcab/diffusion/model/nets/PixArtMS.py#L70C1-L76C103
262
- norm_hidden_states = hidden_states
263
- elif self.norm_type == "ada_norm_continuous":
264
- norm_hidden_states = self.norm2(hidden_states, added_cond_kwargs["pooled_text_emb"])
265
- else:
266
- raise ValueError("Incorrect norm")
267
-
268
- if self.pos_embed is not None and self.norm_type != "ada_norm_single":
269
- norm_hidden_states = self.pos_embed(norm_hidden_states)
270
-
271
- attn_output = self.attn2(
272
- norm_hidden_states,
273
- encoder_hidden_states=encoder_hidden_states,
274
- attention_mask=encoder_attention_mask,
275
- **cross_attention_kwargs,
276
- )
277
-
278
- hidden_states = attn_output + hidden_states
279
-
280
- # 4. Feed-forward
281
- # i2vgen doesn't have this norm 🤷‍♂️
282
- if self.norm_type == "ada_norm_continuous":
283
- norm_hidden_states = self.norm3(hidden_states, added_cond_kwargs["pooled_text_emb"])
284
- elif not self.norm_type == "ada_norm_single":
285
- norm_hidden_states = self.norm3(hidden_states)
286
-
287
- if self.norm_type == "ada_norm_zero":
288
- norm_hidden_states = norm_hidden_states * (1 + scale_mlp[:, None]) + shift_mlp[:, None]
289
-
290
- if self.norm_type == "ada_norm_single":
291
- norm_hidden_states = self.norm2(hidden_states)
292
- norm_hidden_states = norm_hidden_states * (1 + scale_mlp) + shift_mlp
293
-
294
- if self._chunk_size is not None:
295
- # "feed_forward_chunk_size" can be used to save memory
296
- ff_output = _chunked_feed_forward(self.ff, norm_hidden_states, self._chunk_dim, self._chunk_size)
297
- else:
298
- ff_output = self.ff(norm_hidden_states)
299
-
300
- if self.norm_type == "ada_norm_zero":
301
- ff_output = gate_mlp.unsqueeze(1) * ff_output
302
- elif self.norm_type == "ada_norm_single":
303
- ff_output = gate_mlp * ff_output
304
-
305
- hidden_states = ff_output + hidden_states
306
- if hidden_states.ndim == 4:
307
- hidden_states = hidden_states.squeeze(1)
308
-
309
- return hidden_states
310
-
311
- @torch.no_grad()
312
- def compute_voxel_grid_mask(position, grid_resolution=8):
313
-
314
- position = position.half()
315
- B,N,_,H,W = position.shape
316
- assert H%grid_resolution==0 and W%grid_resolution==0
317
-
318
- valid_mask = (position != 1).all(dim=2, keepdim=True)
319
- valid_mask = valid_mask.expand_as(position)
320
- position[valid_mask==False] = 0
321
-
322
-
323
- position = rearrange(
324
- position,
325
- 'b n c (num_h grid_h) (num_w grid_w) -> b n num_h num_w c grid_h grid_w',
326
- num_h=grid_resolution, num_w=grid_resolution
327
- )
328
- valid_mask = rearrange(
329
- valid_mask,
330
- 'b n c (num_h grid_h) (num_w grid_w) -> b n num_h num_w c grid_h grid_w',
331
- num_h=grid_resolution, num_w=grid_resolution
332
- )
333
-
334
- grid_position = position.sum(dim=(-2, -1))
335
- count_masked = valid_mask.sum(dim=(-2, -1))
336
-
337
- grid_position = grid_position / count_masked.clamp(min=1)
338
- grid_position[count_masked<5] = 0
339
-
340
- grid_position = grid_position.permute(0,1,4,2,3)
341
- grid_position = rearrange(grid_position, 'b n c h w -> b n (h w) c')
342
-
343
- grid_position_expanded_1 = grid_position.unsqueeze(2).unsqueeze(4) # 形状变为 B, N, 1, L, 1, 3
344
- grid_position_expanded_2 = grid_position.unsqueeze(1).unsqueeze(3) # 形状变为 B, 1, N, 1, L, 3
345
-
346
- # 计算欧氏距离
347
- distances = torch.norm(grid_position_expanded_1 - grid_position_expanded_2, dim=-1) # 形状为 B, N, N, L, L
348
-
349
- weights = distances
350
- grid_distance = 1.73/grid_resolution
351
-
352
- #weights = weights*-32
353
- #weights = weights.clamp(min=-10000.0)
354
-
355
- weights = weights< grid_distance
356
-
357
- return weights
358
-
359
- def compute_multi_resolution_mask(position_maps, grid_resolutions=[32, 16, 8]):
360
- position_attn_mask = {}
361
- with torch.no_grad():
362
- for grid_resolution in grid_resolutions:
363
- position_mask = compute_voxel_grid_mask(position_maps, grid_resolution)
364
- position_mask = rearrange(position_mask, 'b ni nj li lj -> b (ni li) (nj lj)')
365
- position_attn_mask[position_mask.shape[1]] = position_mask
366
- return position_attn_mask
367
-
368
- @torch.no_grad()
369
- def compute_discrete_voxel_indice(position, grid_resolution=8, voxel_resolution=128):
370
-
371
- position = position.half()
372
- B,N,_,H,W = position.shape
373
- assert H%grid_resolution==0 and W%grid_resolution==0
374
-
375
- valid_mask = (position != 1).all(dim=2, keepdim=True)
376
- valid_mask = valid_mask.expand_as(position)
377
- position[valid_mask==False] = 0
378
-
379
- position = rearrange(
380
- position,
381
- 'b n c (num_h grid_h) (num_w grid_w) -> b n num_h num_w c grid_h grid_w',
382
- num_h=grid_resolution, num_w=grid_resolution
383
- )
384
- valid_mask = rearrange(
385
- valid_mask,
386
- 'b n c (num_h grid_h) (num_w grid_w) -> b n num_h num_w c grid_h grid_w',
387
- num_h=grid_resolution, num_w=grid_resolution
388
- )
389
-
390
- grid_position = position.sum(dim=(-2, -1))
391
- count_masked = valid_mask.sum(dim=(-2, -1))
392
-
393
- grid_position = grid_position / count_masked.clamp(min=1)
394
- grid_position[count_masked<5] = 0
395
-
396
- grid_position = grid_position.permute(0,1,4,2,3).clamp(0, 1) # B N C H W
397
- voxel_indices = grid_position * (voxel_resolution - 1)
398
- voxel_indices = torch.round(voxel_indices).long()
399
- return voxel_indices
400
-
401
- def compute_multi_resolution_discrete_voxel_indice(
402
- position_maps,
403
- grid_resolutions=[64, 32, 16, 8],
404
- voxel_resolutions=[512, 256, 128, 64]
405
- ):
406
- voxel_indices = {}
407
- with torch.no_grad():
408
- for grid_resolution, voxel_resolution in zip(grid_resolutions, voxel_resolutions):
409
- voxel_indice = compute_discrete_voxel_indice(position_maps, grid_resolution, voxel_resolution)
410
- voxel_indice = rearrange(voxel_indice, 'b n c h w -> b (n h w) c')
411
- voxel_indices[voxel_indice.shape[1]] = {'voxel_indices':voxel_indice, 'voxel_resolution':voxel_resolution}
412
- return voxel_indices
413
-
414
- class UNet2p5DConditionModel(torch.nn.Module):
415
- def __init__(self, unet: UNet2DConditionModel) -> None:
416
- super().__init__()
417
- self.unet = unet
418
-
419
- self.use_ma = True
420
- self.use_ra = True
421
- self.use_camera_embedding = True
422
- self.use_dual_stream = True
423
- self.is_turbo = False
424
-
425
- if self.use_dual_stream:
426
- self.unet_dual = copy.deepcopy(unet)
427
- self.init_attention(self.unet_dual)
428
- self.init_attention(self.unet, use_ma=self.use_ma, use_ra=self.use_ra, is_turbo=self.is_turbo)
429
- self.init_condition()
430
- self.init_camera_embedding()
431
-
432
- @staticmethod
433
- def from_pretrained(pretrained_model_name_or_path, **kwargs):
434
- torch_dtype = kwargs.pop('torch_dtype', torch.float32)
435
- config_path = os.path.join(pretrained_model_name_or_path, 'config.json')
436
- unet_ckpt_path = os.path.join(pretrained_model_name_or_path, 'diffusion_pytorch_model.bin')
437
- with open(config_path, 'r', encoding='utf-8') as file:
438
- config = json.load(file)
439
- unet = UNet2DConditionModel(**config)
440
- unet = UNet2p5DConditionModel(unet)
441
- unet_ckpt = torch.load(unet_ckpt_path, map_location='cpu', weights_only=True)
442
- unet.load_state_dict(unet_ckpt, strict=True)
443
- unet = unet.to(torch_dtype)
444
- return unet
445
-
446
- def init_condition(self):
447
- self.unet.conv_in = torch.nn.Conv2d(
448
- 12,
449
- self.unet.conv_in.out_channels,
450
- kernel_size=self.unet.conv_in.kernel_size,
451
- stride=self.unet.conv_in.stride,
452
- padding=self.unet.conv_in.padding,
453
- dilation=self.unet.conv_in.dilation,
454
- groups=self.unet.conv_in.groups,
455
- bias=self.unet.conv_in.bias is not None)
456
-
457
- self.unet.learned_text_clip_gen = nn.Parameter(torch.randn(1, 77, 1024))
458
- self.unet.learned_text_clip_ref = nn.Parameter(torch.randn(1, 77, 1024))
459
-
460
- def init_camera_embedding(self):
461
-
462
- if self.use_camera_embedding:
463
- time_embed_dim = 1280
464
- self.max_num_ref_image = 5
465
- self.max_num_gen_image = 12 * 3 + 4 * 2
466
- self.unet.class_embedding = nn.Embedding(self.max_num_ref_image + self.max_num_gen_image, time_embed_dim)
467
-
468
- def init_attention(self, unet, use_ma=False, use_ra=False, is_turbo=False):
469
-
470
- for down_block_i, down_block in enumerate(unet.down_blocks):
471
- if hasattr(down_block, "has_cross_attention") and down_block.has_cross_attention:
472
- for attn_i, attn in enumerate(down_block.attentions):
473
- for transformer_i, transformer in enumerate(attn.transformer_blocks):
474
- if isinstance(transformer, BasicTransformerBlock):
475
- attn.transformer_blocks[transformer_i] = Basic2p5DTransformerBlock(
476
- transformer,
477
- f'down_{down_block_i}_{attn_i}_{transformer_i}',
478
- use_ma, use_ra, is_turbo
479
- )
480
-
481
- if hasattr(unet.mid_block, "has_cross_attention") and unet.mid_block.has_cross_attention:
482
- for attn_i, attn in enumerate(unet.mid_block.attentions):
483
- for transformer_i, transformer in enumerate(attn.transformer_blocks):
484
- if isinstance(transformer, BasicTransformerBlock):
485
- attn.transformer_blocks[transformer_i] = Basic2p5DTransformerBlock(
486
- transformer,
487
- f'mid_{attn_i}_{transformer_i}',
488
- use_ma, use_ra, is_turbo
489
- )
490
-
491
- for up_block_i, up_block in enumerate(unet.up_blocks):
492
- if hasattr(up_block, "has_cross_attention") and up_block.has_cross_attention:
493
- for attn_i, attn in enumerate(up_block.attentions):
494
- for transformer_i, transformer in enumerate(attn.transformer_blocks):
495
- if isinstance(transformer, BasicTransformerBlock):
496
- attn.transformer_blocks[transformer_i] = Basic2p5DTransformerBlock(
497
- transformer,
498
- f'up_{up_block_i}_{attn_i}_{transformer_i}',
499
- use_ma, use_ra, is_turbo
500
- )
501
-
502
- def __getattr__(self, name: str):
503
- try:
504
- return super().__getattr__(name)
505
- except AttributeError:
506
- return getattr(self.unet, name)
507
-
508
- def forward(
509
- self, sample, timestep, encoder_hidden_states,
510
- *args, down_intrablock_additional_residuals=None,
511
- down_block_res_samples=None, mid_block_res_sample=None,
512
- **cached_condition,
513
- ):
514
- B, N_gen, _, H, W = sample.shape
515
- assert H == W
516
-
517
- if self.use_camera_embedding:
518
- camera_info_gen = cached_condition['camera_info_gen'] + self.max_num_ref_image
519
- camera_info_gen = rearrange(camera_info_gen, 'b n -> (b n)')
520
- else:
521
- camera_info_gen = None
522
-
523
- sample = [sample]
524
- if 'normal_imgs' in cached_condition:
525
- sample.append(cached_condition["normal_imgs"])
526
- if 'position_imgs' in cached_condition:
527
- sample.append(cached_condition["position_imgs"])
528
- sample = torch.cat(sample, dim=2)
529
-
530
- sample = rearrange(sample, 'b n c h w -> (b n) c h w')
531
-
532
- encoder_hidden_states_gen = encoder_hidden_states.unsqueeze(1).repeat(1, N_gen, 1, 1)
533
- encoder_hidden_states_gen = rearrange(encoder_hidden_states_gen, 'b n l c -> (b n) l c')
534
-
535
- if self.use_ra:
536
- if 'condition_embed_dict' in cached_condition:
537
- condition_embed_dict = cached_condition['condition_embed_dict']
538
- else:
539
- condition_embed_dict = {}
540
- ref_latents = cached_condition['ref_latents']
541
- N_ref = ref_latents.shape[1]
542
- if self.use_camera_embedding:
543
- camera_info_ref = cached_condition['camera_info_ref']
544
- camera_info_ref = rearrange(camera_info_ref, 'b n -> (b n)')
545
- else:
546
- camera_info_ref = None
547
-
548
- ref_latents = rearrange(ref_latents, 'b n c h w -> (b n) c h w')
549
-
550
- encoder_hidden_states_ref = self.unet.learned_text_clip_ref.unsqueeze(1).repeat(B, N_ref, 1, 1)
551
- encoder_hidden_states_ref = rearrange(encoder_hidden_states_ref, 'b n l c -> (b n) l c')
552
-
553
- noisy_ref_latents = ref_latents
554
- timestep_ref = 0
555
-
556
- if self.use_dual_stream:
557
- unet_ref = self.unet_dual
558
- else:
559
- unet_ref = self.unet
560
- unet_ref(
561
- noisy_ref_latents, timestep_ref,
562
- encoder_hidden_states=encoder_hidden_states_ref,
563
- class_labels=camera_info_ref,
564
- # **kwargs
565
- return_dict=False,
566
- cross_attention_kwargs={
567
- 'mode': 'w', 'num_in_batch': N_ref,
568
- 'condition_embed_dict': condition_embed_dict},
569
- )
570
- cached_condition['condition_embed_dict'] = condition_embed_dict
571
- else:
572
- condition_embed_dict = None
573
-
574
- mva_scale = cached_condition.get('mva_scale', 1.0)
575
- ref_scale = cached_condition.get('ref_scale', 1.0)
576
-
577
- if self.is_turbo:
578
- cross_attention_kwargs_ = {
579
- 'mode': 'r', 'num_in_batch': N_gen,
580
- 'condition_embed_dict': condition_embed_dict,
581
- 'position_attn_mask':position_attn_mask,
582
- 'position_voxel_indices':position_voxel_indices,
583
- 'mva_scale': mva_scale,
584
- 'ref_scale': ref_scale,
585
- }
586
- else:
587
- cross_attention_kwargs_ = {
588
- 'mode': 'r', 'num_in_batch': N_gen,
589
- 'condition_embed_dict': condition_embed_dict,
590
- 'mva_scale': mva_scale,
591
- 'ref_scale': ref_scale,
592
- }
593
- return self.unet(
594
- sample, timestep,
595
- encoder_hidden_states_gen, *args,
596
- class_labels=camera_info_gen,
597
- down_intrablock_additional_residuals=[
598
- sample.to(dtype=self.unet.dtype) for sample in down_intrablock_additional_residuals
599
- ] if down_intrablock_additional_residuals is not None else None,
600
- down_block_additional_residuals=[
601
- sample.to(dtype=self.unet.dtype) for sample in down_block_res_samples
602
- ] if down_block_res_samples is not None else None,
603
- mid_block_additional_residual=(
604
- mid_block_res_sample.to(dtype=self.unet.dtype)
605
- if mid_block_res_sample is not None else None
606
- ),
607
- return_dict=False,
608
- cross_attention_kwargs=cross_attention_kwargs_,
609
- )
610
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
hunyuan3d-paint-v2-0-turbo/vae/config.json DELETED
@@ -1,29 +0,0 @@
1
- {
2
- "_class_name": "AutoencoderKL",
3
- "_diffusers_version": "0.10.0.dev0",
4
- "act_fn": "silu",
5
- "block_out_channels": [
6
- 128,
7
- 256,
8
- 512,
9
- 512
10
- ],
11
- "down_block_types": [
12
- "DownEncoderBlock2D",
13
- "DownEncoderBlock2D",
14
- "DownEncoderBlock2D",
15
- "DownEncoderBlock2D"
16
- ],
17
- "in_channels": 3,
18
- "latent_channels": 4,
19
- "layers_per_block": 2,
20
- "norm_num_groups": 32,
21
- "out_channels": 3,
22
- "sample_size": 768,
23
- "up_block_types": [
24
- "UpDecoderBlock2D",
25
- "UpDecoderBlock2D",
26
- "UpDecoderBlock2D",
27
- "UpDecoderBlock2D"
28
- ]
29
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
hunyuan3d-paint-v2-0-turbo/vae/diffusion_pytorch_model.bin DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:1b4889b6b1d4ce7ae320a02dedaeff1780ad77d415ea0d744b476155c6377ddc
3
- size 334707217
 
 
 
 
hunyuan3d-vae-v2-0-turbo/config.yaml DELETED
@@ -1,15 +0,0 @@
1
- target: hy3dgen.shapegen.models.ShapeVAE
2
- params:
3
- num_latents: 3072
4
- embed_dim: 64
5
- num_freqs: 8
6
- include_pi: false
7
- heads: 16
8
- width: 1024
9
- num_decoder_layers: 16
10
- qkv_bias: false
11
- qk_norm: true
12
- scale_factor: 0.9990943042622529
13
- geo_decoder_mlp_expand_ratio: 1
14
- geo_decoder_downsample_ratio: 2
15
- geo_decoder_ln_post: false
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
hunyuan3d-vae-v2-0-turbo/model.fp16.ckpt DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:bea74b1d912e245510b062ac62d938e5312760c07b122eadd0199e325e3d5343
3
- size 407480196
 
 
 
 
hunyuan3d-vae-v2-0-turbo/model.fp16.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:66b7af697118db2de00c22f63a385c75f9049709739281971837af9e4ce31cb4
3
- size 407410402
 
 
 
 
hunyuan3d-vae-v2-0/config.yaml DELETED
@@ -1,15 +0,0 @@
1
- target: hy3dgen.shapegen.models.ShapeVAE
2
- params:
3
- num_latents: 3072
4
- embed_dim: 64
5
- num_freqs: 8
6
- include_pi: false
7
- heads: 16
8
- width: 1024
9
- num_decoder_layers: 16
10
- qkv_bias: false
11
- qk_norm: true
12
- scale_factor: 0.9990943042622529
13
- geo_decoder_mlp_expand_ratio: 4
14
- geo_decoder_downsample_ratio: 1
15
- geo_decoder_ln_post: true
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
hunyuan3d-vae-v2-0/model.fp16.ckpt DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:5fc24dcac763a28d09f49dfe5ef71cf129f2e7193e75a8753d5f8b1e4bd526e2
3
- size 428526216
 
 
 
 
hunyuan3d-vae-v2-0/model.fp16.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:abbad221083fbde12d7e8afc69243de7cee85abc806368ba2338fc5736be7341
3
- size 428455666