staswrs commited on
Commit
514160d
·
1 Parent(s): 4d75483

git and hf files

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .DS_Store +0 -0
  2. app.py +752 -8
  3. assets/.DS_Store +0 -0
  4. assets/config.json +5 -0
  5. assets/hunyuan3d-delight-v2-0/.DS_Store +0 -0
  6. assets/hunyuan3d-delight-v2-0/feature_extractor/preprocessor_config.json +27 -0
  7. assets/hunyuan3d-delight-v2-0/model_index.json +38 -0
  8. assets/hunyuan3d-delight-v2-0/scheduler/scheduler_config.json +20 -0
  9. assets/hunyuan3d-delight-v2-0/text_encoder/config.json +25 -0
  10. assets/hunyuan3d-delight-v2-0/tokenizer/merges.txt +0 -0
  11. assets/hunyuan3d-delight-v2-0/tokenizer/special_tokens_map.json +30 -0
  12. assets/hunyuan3d-delight-v2-0/tokenizer/tokenizer_config.json +38 -0
  13. assets/hunyuan3d-delight-v2-0/tokenizer/vocab.json +0 -0
  14. assets/hunyuan3d-delight-v2-0/unet/config.json +73 -0
  15. assets/hunyuan3d-delight-v2-0/vae/config.json +38 -0
  16. assets/hunyuan3d-dit-v2-0-fast/config.yaml +69 -0
  17. assets/hunyuan3d-dit-v2-0-turbo/config.yaml +70 -0
  18. assets/hunyuan3d-dit-v2-0/config.yaml +68 -0
  19. assets/hunyuan3d-paint-v2-0-turbo/.DS_Store +0 -0
  20. assets/hunyuan3d-paint-v2-0-turbo/.gitattributes +35 -0
  21. assets/hunyuan3d-paint-v2-0-turbo/README.md +53 -0
  22. assets/hunyuan3d-paint-v2-0-turbo/feature_extractor/preprocessor_config.json +20 -0
  23. assets/hunyuan3d-paint-v2-0-turbo/image_encoder/config.json +23 -0
  24. assets/hunyuan3d-paint-v2-0-turbo/image_encoder/preprocessor_config.json +27 -0
  25. assets/hunyuan3d-paint-v2-0-turbo/model_index.json +37 -0
  26. assets/hunyuan3d-paint-v2-0-turbo/scheduler/scheduler_config.json +15 -0
  27. assets/hunyuan3d-paint-v2-0-turbo/text_encoder/config.json +25 -0
  28. assets/hunyuan3d-paint-v2-0-turbo/tokenizer/merges.txt +0 -0
  29. assets/hunyuan3d-paint-v2-0-turbo/tokenizer/special_tokens_map.json +24 -0
  30. assets/hunyuan3d-paint-v2-0-turbo/tokenizer/tokenizer_config.json +34 -0
  31. assets/hunyuan3d-paint-v2-0-turbo/tokenizer/vocab.json +0 -0
  32. assets/hunyuan3d-paint-v2-0-turbo/unet/config.json +45 -0
  33. assets/hunyuan3d-paint-v2-0-turbo/unet/modules.py +610 -0
  34. assets/hunyuan3d-paint-v2-0-turbo/vae/config.json +29 -0
  35. assets/hunyuan3d-paint-v2-0/.DS_Store +0 -0
  36. assets/hunyuan3d-paint-v2-0/.gitattributes +35 -0
  37. assets/hunyuan3d-paint-v2-0/feature_extractor/preprocessor_config.json +20 -0
  38. assets/hunyuan3d-paint-v2-0/model_index.json +33 -0
  39. assets/hunyuan3d-paint-v2-0/scheduler/scheduler_config.json +15 -0
  40. assets/hunyuan3d-paint-v2-0/text_encoder/config.json +25 -0
  41. assets/hunyuan3d-paint-v2-0/tokenizer/merges.txt +0 -0
  42. assets/hunyuan3d-paint-v2-0/tokenizer/special_tokens_map.json +24 -0
  43. assets/hunyuan3d-paint-v2-0/tokenizer/tokenizer_config.json +34 -0
  44. assets/hunyuan3d-paint-v2-0/tokenizer/vocab.json +0 -0
  45. assets/hunyuan3d-paint-v2-0/unet/config.json +45 -0
  46. assets/hunyuan3d-paint-v2-0/unet/modules.py +437 -0
  47. assets/hunyuan3d-paint-v2-0/vae/config.json +29 -0
  48. assets/hunyuan3d-vae-v2-0-turbo/config.yaml +15 -0
  49. assets/hunyuan3d-vae-v2-0/config.yaml +15 -0
  50. hy3dgen/__init__.py +13 -0
.DS_Store ADDED
Binary file (6.15 kB). View file
 
app.py CHANGED
@@ -1,11 +1,755 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  import gradio as gr
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
- def dummy_predict(image):
4
- return "demo.glb"
 
 
 
 
 
 
5
 
6
- gr.Interface(
7
- fn=dummy_predict,
8
- inputs=gr.Image(type="filepath"),
9
- outputs=gr.File(label="GLB Model"),
10
- title="Hunyuan3D Placeholder"
11
- ).launch()
 
1
+ # Hunyuan 3D is licensed under the TENCENT HUNYUAN NON-COMMERCIAL LICENSE AGREEMENT
2
+ # except for the third-party components listed below.
3
+ # Hunyuan 3D does not impose any additional limitations beyond what is outlined
4
+ # in the repsective licenses of these third-party components.
5
+ # Users must comply with all terms and conditions of original licenses of these third-party
6
+ # components and must ensure that the usage of the third party components adheres to
7
+ # all relevant laws and regulations.
8
+
9
+ # For avoidance of doubts, Hunyuan 3D means the large language models and
10
+ # their software and algorithms, including trained model weights, parameters (including
11
+ # optimizer states), machine-learning model code, inference-enabling code, training-enabling code,
12
+ # fine-tuning enabling code and other elements of the foregoing made publicly available
13
+ # by Tencent in accordance with TENCENT HUNYUAN COMMUNITY LICENSE AGREEMENT.
14
+
15
+ import os
16
+ import random
17
+ import shutil
18
+ import time
19
+ from glob import glob
20
+ from pathlib import Path
21
+
22
  import gradio as gr
23
+ import torch
24
+ import trimesh
25
+ import uvicorn
26
+ from fastapi import FastAPI
27
+ from fastapi.staticfiles import StaticFiles
28
+ import uuid
29
+
30
+ from hy3dgen.shapegen.utils import logger
31
+
32
+ MAX_SEED = int(1e7)
33
+
34
+
35
+ def get_example_img_list():
36
+ print('Loading example img list ...')
37
+ return sorted(glob('./assets/example_images/**/*.png', recursive=True))
38
+
39
+
40
+ def get_example_txt_list():
41
+ print('Loading example txt list ...')
42
+ txt_list = list()
43
+ for line in open('./assets/example_prompts.txt', encoding='utf-8'):
44
+ txt_list.append(line.strip())
45
+ return txt_list
46
+
47
+
48
+ def get_example_mv_list():
49
+ print('Loading example mv list ...')
50
+ mv_list = list()
51
+ root = './assets/example_mv_images'
52
+ for mv_dir in os.listdir(root):
53
+ view_list = []
54
+ for view in ['front', 'back', 'left', 'right']:
55
+ path = os.path.join(root, mv_dir, f'{view}.png')
56
+ if os.path.exists(path):
57
+ view_list.append(path)
58
+ else:
59
+ view_list.append(None)
60
+ mv_list.append(view_list)
61
+ return mv_list
62
+
63
+
64
+ def gen_save_folder(max_size=200):
65
+ os.makedirs(SAVE_DIR, exist_ok=True)
66
+
67
+ # 获取所有文件夹路径
68
+ dirs = [f for f in Path(SAVE_DIR).iterdir() if f.is_dir()]
69
+
70
+ # 如果文件夹数量超过 max_size,删除创建时间最久的文件夹
71
+ if len(dirs) >= max_size:
72
+ # 按创建时间排序,最久的排在前面
73
+ oldest_dir = min(dirs, key=lambda x: x.stat().st_ctime)
74
+ shutil.rmtree(oldest_dir)
75
+ print(f"Removed the oldest folder: {oldest_dir}")
76
+
77
+ # 生成一个新的 uuid 文件夹名称
78
+ new_folder = os.path.join(SAVE_DIR, str(uuid.uuid4()))
79
+ os.makedirs(new_folder, exist_ok=True)
80
+ print(f"Created new folder: {new_folder}")
81
+
82
+ return new_folder
83
+
84
+
85
+ def export_mesh(mesh, save_folder, textured=False, type='glb'):
86
+ if textured:
87
+ path = os.path.join(save_folder, f'textured_mesh.{type}')
88
+ else:
89
+ path = os.path.join(save_folder, f'white_mesh.{type}')
90
+ if type not in ['glb', 'obj']:
91
+ mesh.export(path)
92
+ else:
93
+ mesh.export(path, include_normals=textured)
94
+ return path
95
+
96
+
97
+ def randomize_seed_fn(seed: int, randomize_seed: bool) -> int:
98
+ if randomize_seed:
99
+ seed = random.randint(0, MAX_SEED)
100
+ return seed
101
+
102
+
103
+ def build_model_viewer_html(save_folder, height=660, width=790, textured=False):
104
+ # Remove first folder from path to make relative path
105
+ if textured:
106
+ related_path = f"./textured_mesh.glb"
107
+ template_name = './assets/modelviewer-textured-template.html'
108
+ output_html_path = os.path.join(save_folder, f'textured_mesh.html')
109
+ else:
110
+ related_path = f"./white_mesh.glb"
111
+ template_name = './assets/modelviewer-template.html'
112
+ output_html_path = os.path.join(save_folder, f'white_mesh.html')
113
+ offset = 50 if textured else 10
114
+ with open(os.path.join(CURRENT_DIR, template_name), 'r', encoding='utf-8') as f:
115
+ template_html = f.read()
116
+
117
+ with open(output_html_path, 'w', encoding='utf-8') as f:
118
+ template_html = template_html.replace('#height#', f'{height - offset}')
119
+ template_html = template_html.replace('#width#', f'{width}')
120
+ template_html = template_html.replace('#src#', f'{related_path}/')
121
+ f.write(template_html)
122
+
123
+ rel_path = os.path.relpath(output_html_path, SAVE_DIR)
124
+ iframe_tag = f'<iframe src="/static/{rel_path}" height="{height}" width="100%" frameborder="0"></iframe>'
125
+ print(
126
+ f'Find html file {output_html_path}, {os.path.exists(output_html_path)}, relative HTML path is /static/{rel_path}')
127
+
128
+ return f"""
129
+ <div style='height: {height}; width: 100%;'>
130
+ {iframe_tag}
131
+ </div>
132
+ """
133
+
134
+
135
+ def _gen_shape(
136
+ caption=None,
137
+ image=None,
138
+ mv_image_front=None,
139
+ mv_image_back=None,
140
+ mv_image_left=None,
141
+ mv_image_right=None,
142
+ steps=50,
143
+ guidance_scale=7.5,
144
+ seed=1234,
145
+ octree_resolution=256,
146
+ check_box_rembg=False,
147
+ num_chunks=200000,
148
+ randomize_seed: bool = False,
149
+ ):
150
+ if not MV_MODE and image is None and caption is None:
151
+ raise gr.Error("Please provide either a caption or an image.")
152
+ if MV_MODE:
153
+ if mv_image_front is None and mv_image_back is None and mv_image_left is None and mv_image_right is None:
154
+ raise gr.Error("Please provide at least one view image.")
155
+ image = {}
156
+ if mv_image_front:
157
+ image['front'] = mv_image_front
158
+ if mv_image_back:
159
+ image['back'] = mv_image_back
160
+ if mv_image_left:
161
+ image['left'] = mv_image_left
162
+ if mv_image_right:
163
+ image['right'] = mv_image_right
164
+
165
+ seed = int(randomize_seed_fn(seed, randomize_seed))
166
+
167
+ octree_resolution = int(octree_resolution)
168
+ if caption: print('prompt is', caption)
169
+ save_folder = gen_save_folder()
170
+ stats = {
171
+ 'model': {
172
+ 'shapegen': f'{args.model_path}/{args.subfolder}',
173
+ 'texgen': f'{args.texgen_model_path}',
174
+ },
175
+ 'params': {
176
+ 'caption': caption,
177
+ 'steps': steps,
178
+ 'guidance_scale': guidance_scale,
179
+ 'seed': seed,
180
+ 'octree_resolution': octree_resolution,
181
+ 'check_box_rembg': check_box_rembg,
182
+ 'num_chunks': num_chunks,
183
+ }
184
+ }
185
+ time_meta = {}
186
+
187
+ if image is None:
188
+ start_time = time.time()
189
+ try:
190
+ image = t2i_worker(caption)
191
+ except Exception as e:
192
+ raise gr.Error(f"Text to 3D is disable. Please enable it by `python gradio_app.py --enable_t23d`.")
193
+ time_meta['text2image'] = time.time() - start_time
194
+
195
+ # remove disk io to make responding faster, uncomment at your will.
196
+ # image.save(os.path.join(save_folder, 'input.png'))
197
+ if MV_MODE:
198
+ start_time = time.time()
199
+ for k, v in image.items():
200
+ if check_box_rembg or v.mode == "RGB":
201
+ img = rmbg_worker(v.convert('RGB'))
202
+ image[k] = img
203
+ time_meta['remove background'] = time.time() - start_time
204
+ else:
205
+ if check_box_rembg or image.mode == "RGB":
206
+ start_time = time.time()
207
+ image = rmbg_worker(image.convert('RGB'))
208
+ time_meta['remove background'] = time.time() - start_time
209
+
210
+ # remove disk io to make responding faster, uncomment at your will.
211
+ # image.save(os.path.join(save_folder, 'rembg.png'))
212
+
213
+ # image to white model
214
+ start_time = time.time()
215
+
216
+ generator = torch.Generator()
217
+ generator = generator.manual_seed(int(seed))
218
+ outputs = i23d_worker(
219
+ image=image,
220
+ num_inference_steps=steps,
221
+ guidance_scale=guidance_scale,
222
+ generator=generator,
223
+ octree_resolution=octree_resolution,
224
+ num_chunks=num_chunks,
225
+ output_type='mesh'
226
+ )
227
+ time_meta['shape generation'] = time.time() - start_time
228
+ logger.info("---Shape generation takes %s seconds ---" % (time.time() - start_time))
229
+
230
+ tmp_start = time.time()
231
+ mesh = export_to_trimesh(outputs)[0]
232
+ time_meta['export to trimesh'] = time.time() - tmp_start
233
+
234
+ stats['number_of_faces'] = mesh.faces.shape[0]
235
+ stats['number_of_vertices'] = mesh.vertices.shape[0]
236
+
237
+ stats['time'] = time_meta
238
+ main_image = image if not MV_MODE else image['front']
239
+ return mesh, main_image, save_folder, stats, seed
240
+
241
+
242
+ def generation_all(
243
+ caption=None,
244
+ image=None,
245
+ mv_image_front=None,
246
+ mv_image_back=None,
247
+ mv_image_left=None,
248
+ mv_image_right=None,
249
+ steps=50,
250
+ guidance_scale=7.5,
251
+ seed=1234,
252
+ octree_resolution=256,
253
+ check_box_rembg=False,
254
+ num_chunks=200000,
255
+ randomize_seed: bool = False,
256
+ ):
257
+ start_time_0 = time.time()
258
+ mesh, image, save_folder, stats, seed = _gen_shape(
259
+ caption,
260
+ image,
261
+ mv_image_front=mv_image_front,
262
+ mv_image_back=mv_image_back,
263
+ mv_image_left=mv_image_left,
264
+ mv_image_right=mv_image_right,
265
+ steps=steps,
266
+ guidance_scale=guidance_scale,
267
+ seed=seed,
268
+ octree_resolution=octree_resolution,
269
+ check_box_rembg=check_box_rembg,
270
+ num_chunks=num_chunks,
271
+ randomize_seed=randomize_seed,
272
+ )
273
+ path = export_mesh(mesh, save_folder, textured=False)
274
+
275
+ # tmp_time = time.time()
276
+ # mesh = floater_remove_worker(mesh)
277
+ # mesh = degenerate_face_remove_worker(mesh)
278
+ # logger.info("---Postprocessing takes %s seconds ---" % (time.time() - tmp_time))
279
+ # stats['time']['postprocessing'] = time.time() - tmp_time
280
+
281
+ tmp_time = time.time()
282
+ mesh = face_reduce_worker(mesh)
283
+ logger.info("---Face Reduction takes %s seconds ---" % (time.time() - tmp_time))
284
+ stats['time']['face reduction'] = time.time() - tmp_time
285
+
286
+ tmp_time = time.time()
287
+ textured_mesh = texgen_worker(mesh, image)
288
+ logger.info("---Texture Generation takes %s seconds ---" % (time.time() - tmp_time))
289
+ stats['time']['texture generation'] = time.time() - tmp_time
290
+ stats['time']['total'] = time.time() - start_time_0
291
+
292
+ textured_mesh.metadata['extras'] = stats
293
+ path_textured = export_mesh(textured_mesh, save_folder, textured=True)
294
+ model_viewer_html_textured = build_model_viewer_html(save_folder, height=HTML_HEIGHT, width=HTML_WIDTH,
295
+ textured=True)
296
+ if args.low_vram_mode:
297
+ torch.cuda.empty_cache()
298
+ return (
299
+ gr.update(value=path),
300
+ gr.update(value=path_textured),
301
+ model_viewer_html_textured,
302
+ stats,
303
+ seed,
304
+ )
305
+
306
+
307
+ def shape_generation(
308
+ caption=None,
309
+ image=None,
310
+ mv_image_front=None,
311
+ mv_image_back=None,
312
+ mv_image_left=None,
313
+ mv_image_right=None,
314
+ steps=50,
315
+ guidance_scale=7.5,
316
+ seed=1234,
317
+ octree_resolution=256,
318
+ check_box_rembg=False,
319
+ num_chunks=200000,
320
+ randomize_seed: bool = False,
321
+ ):
322
+ start_time_0 = time.time()
323
+ mesh, image, save_folder, stats, seed = _gen_shape(
324
+ caption,
325
+ image,
326
+ mv_image_front=mv_image_front,
327
+ mv_image_back=mv_image_back,
328
+ mv_image_left=mv_image_left,
329
+ mv_image_right=mv_image_right,
330
+ steps=steps,
331
+ guidance_scale=guidance_scale,
332
+ seed=seed,
333
+ octree_resolution=octree_resolution,
334
+ check_box_rembg=check_box_rembg,
335
+ num_chunks=num_chunks,
336
+ randomize_seed=randomize_seed,
337
+ )
338
+ stats['time']['total'] = time.time() - start_time_0
339
+ mesh.metadata['extras'] = stats
340
+
341
+ path = export_mesh(mesh, save_folder, textured=False)
342
+ model_viewer_html = build_model_viewer_html(save_folder, height=HTML_HEIGHT, width=HTML_WIDTH)
343
+ if args.low_vram_mode:
344
+ torch.cuda.empty_cache()
345
+ return (
346
+ gr.update(value=path),
347
+ model_viewer_html,
348
+ stats,
349
+ seed,
350
+ )
351
+
352
+
353
+ def build_app():
354
+ title = 'Hunyuan3D-2: High Resolution Textured 3D Assets Generation'
355
+ if MV_MODE:
356
+ title = 'Hunyuan3D-2mv: Image to 3D Generation with 1-4 Views'
357
+ if 'mini' in args.subfolder:
358
+ title = 'Hunyuan3D-2mini: Strong 0.6B Image to Shape Generator'
359
+ if TURBO_MODE:
360
+ title = title.replace(':', '-Turbo: Fast ')
361
+
362
+ title_html = f"""
363
+ <div style="font-size: 2em; font-weight: bold; text-align: center; margin-bottom: 5px">
364
+
365
+ {title}
366
+ </div>
367
+ <div align="center">
368
+ Tencent Hunyuan3D Team
369
+ </div>
370
+ <div align="center">
371
+ <a href="https://github.com/tencent/Hunyuan3D-2">Github</a> &ensp;
372
+ <a href="http://3d-models.hunyuan.tencent.com">Homepage</a> &ensp;
373
+ <a href="https://3d.hunyuan.tencent.com">Hunyuan3D Studio</a> &ensp;
374
+ <a href="#">Technical Report</a> &ensp;
375
+ <a href="https://huggingface.co/Tencent/Hunyuan3D-2"> Pretrained Models</a> &ensp;
376
+ </div>
377
+ """
378
+ custom_css = """
379
+ .app.svelte-wpkpf6.svelte-wpkpf6:not(.fill_width) {
380
+ max-width: 1480px;
381
+ }
382
+ .mv-image button .wrap {
383
+ font-size: 10px;
384
+ }
385
+
386
+ .mv-image .icon-wrap {
387
+ width: 20px;
388
+ }
389
+
390
+ """
391
+
392
+ with gr.Blocks(theme=gr.themes.Base(), title='Hunyuan-3D-2.0', analytics_enabled=False, css=custom_css) as demo:
393
+ gr.HTML(title_html)
394
+
395
+ with gr.Row():
396
+ with gr.Column(scale=3):
397
+ with gr.Tabs(selected='tab_img_prompt') as tabs_prompt:
398
+ with gr.Tab('Image Prompt', id='tab_img_prompt', visible=not MV_MODE) as tab_ip:
399
+ image = gr.Image(label='Image', type='pil', image_mode='RGBA', height=290)
400
+
401
+ with gr.Tab('Text Prompt', id='tab_txt_prompt', visible=HAS_T2I and not MV_MODE) as tab_tp:
402
+ caption = gr.Textbox(label='Text Prompt',
403
+ placeholder='HunyuanDiT will be used to generate image.',
404
+ info='Example: A 3D model of a cute cat, white background')
405
+ with gr.Tab('MultiView Prompt', visible=MV_MODE) as tab_mv:
406
+ # gr.Label('Please upload at least one front image.')
407
+ with gr.Row():
408
+ mv_image_front = gr.Image(label='Front', type='pil', image_mode='RGBA', height=140,
409
+ min_width=100, elem_classes='mv-image')
410
+ mv_image_back = gr.Image(label='Back', type='pil', image_mode='RGBA', height=140,
411
+ min_width=100, elem_classes='mv-image')
412
+ with gr.Row():
413
+ mv_image_left = gr.Image(label='Left', type='pil', image_mode='RGBA', height=140,
414
+ min_width=100, elem_classes='mv-image')
415
+ mv_image_right = gr.Image(label='Right', type='pil', image_mode='RGBA', height=140,
416
+ min_width=100, elem_classes='mv-image')
417
+
418
+ with gr.Row():
419
+ btn = gr.Button(value='Gen Shape', variant='primary', min_width=100)
420
+ btn_all = gr.Button(value='Gen Textured Shape',
421
+ variant='primary',
422
+ visible=HAS_TEXTUREGEN,
423
+ min_width=100)
424
+
425
+ with gr.Group():
426
+ file_out = gr.File(label="File", visible=False)
427
+ file_out2 = gr.File(label="File", visible=False)
428
+
429
+ with gr.Tabs(selected='tab_options' if TURBO_MODE else 'tab_export'):
430
+ with gr.Tab("Options", id='tab_options', visible=TURBO_MODE):
431
+ gen_mode = gr.Radio(label='Generation Mode',
432
+ info='Recommendation: Turbo for most cases, Fast for very complex cases, Standard seldom use.',
433
+ choices=['Turbo', 'Fast', 'Standard'], value='Turbo')
434
+ decode_mode = gr.Radio(label='Decoding Mode',
435
+ info='The resolution for exporting mesh from generated vectset',
436
+ choices=['Low', 'Standard', 'High'],
437
+ value='Standard')
438
+ with gr.Tab('Advanced Options', id='tab_advanced_options'):
439
+ with gr.Row():
440
+ check_box_rembg = gr.Checkbox(value=True, label='Remove Background', min_width=100)
441
+ randomize_seed = gr.Checkbox(label="Randomize seed", value=True, min_width=100)
442
+ seed = gr.Slider(
443
+ label="Seed",
444
+ minimum=0,
445
+ maximum=MAX_SEED,
446
+ step=1,
447
+ value=1234,
448
+ min_width=100,
449
+ )
450
+ with gr.Row():
451
+ num_steps = gr.Slider(maximum=100,
452
+ minimum=1,
453
+ value=5 if 'turbo' in args.subfolder else 30,
454
+ step=1, label='Inference Steps')
455
+ octree_resolution = gr.Slider(maximum=512, minimum=16, value=256, label='Octree Resolution')
456
+ with gr.Row():
457
+ cfg_scale = gr.Number(value=5.0, label='Guidance Scale', min_width=100)
458
+ num_chunks = gr.Slider(maximum=5000000, minimum=1000, value=8000,
459
+ label='Number of Chunks', min_width=100)
460
+ with gr.Tab("Export", id='tab_export'):
461
+ with gr.Row():
462
+ file_type = gr.Dropdown(label='File Type', choices=SUPPORTED_FORMATS,
463
+ value='glb', min_width=100)
464
+ reduce_face = gr.Checkbox(label='Simplify Mesh', value=False, min_width=100)
465
+ export_texture = gr.Checkbox(label='Include Texture', value=False,
466
+ visible=False, min_width=100)
467
+ target_face_num = gr.Slider(maximum=1000000, minimum=100, value=10000,
468
+ label='Target Face Number')
469
+ with gr.Row():
470
+ confirm_export = gr.Button(value="Transform", min_width=100)
471
+ file_export = gr.DownloadButton(label="Download", variant='primary',
472
+ interactive=False, min_width=100)
473
+
474
+ with gr.Column(scale=6):
475
+ with gr.Tabs(selected='gen_mesh_panel') as tabs_output:
476
+ with gr.Tab('Generated Mesh', id='gen_mesh_panel'):
477
+ html_gen_mesh = gr.HTML(HTML_OUTPUT_PLACEHOLDER, label='Output')
478
+ with gr.Tab('Exporting Mesh', id='export_mesh_panel'):
479
+ html_export_mesh = gr.HTML(HTML_OUTPUT_PLACEHOLDER, label='Output')
480
+ with gr.Tab('Mesh Statistic', id='stats_panel'):
481
+ stats = gr.Json({}, label='Mesh Stats')
482
+
483
+ with gr.Column(scale=3 if MV_MODE else 2):
484
+ with gr.Tabs(selected='tab_img_gallery') as gallery:
485
+ with gr.Tab('Image to 3D Gallery', id='tab_img_gallery', visible=not MV_MODE) as tab_gi:
486
+ with gr.Row():
487
+ gr.Examples(examples=example_is, inputs=[image],
488
+ label=None, examples_per_page=18)
489
+
490
+ with gr.Tab('Text to 3D Gallery', id='tab_txt_gallery', visible=HAS_T2I and not MV_MODE) as tab_gt:
491
+ with gr.Row():
492
+ gr.Examples(examples=example_ts, inputs=[caption],
493
+ label=None, examples_per_page=18)
494
+ with gr.Tab('MultiView to 3D Gallery', id='tab_mv_gallery', visible=MV_MODE) as tab_mv:
495
+ with gr.Row():
496
+ gr.Examples(examples=example_mvs,
497
+ inputs=[mv_image_front, mv_image_back, mv_image_left, mv_image_right],
498
+ label=None, examples_per_page=6)
499
+
500
+ gr.HTML(f"""
501
+ <div align="center">
502
+ Activated Model - Shape Generation ({args.model_path}/{args.subfolder}) ; Texture Generation ({'Hunyuan3D-2' if HAS_TEXTUREGEN else 'Unavailable'})
503
+ </div>
504
+ """)
505
+ if not HAS_TEXTUREGEN:
506
+ gr.HTML("""
507
+ <div style="margin-top: 5px;" align="center">
508
+ <b>Warning: </b>
509
+ Texture synthesis is disable due to missing requirements,
510
+ please install requirements following <a href="https://github.com/Tencent/Hunyuan3D-2?tab=readme-ov-file#install-requirements">README.md</a>to activate it.
511
+ </div>
512
+ """)
513
+ if not args.enable_t23d:
514
+ gr.HTML("""
515
+ <div style="margin-top: 5px;" align="center">
516
+ <b>Warning: </b>
517
+ Text to 3D is disable. To activate it, please run `python gradio_app.py --enable_t23d`.
518
+ </div>
519
+ """)
520
+
521
+ tab_ip.select(fn=lambda: gr.update(selected='tab_img_gallery'), outputs=gallery)
522
+ if HAS_T2I:
523
+ tab_tp.select(fn=lambda: gr.update(selected='tab_txt_gallery'), outputs=gallery)
524
+
525
+ btn.click(
526
+ shape_generation,
527
+ inputs=[
528
+ caption,
529
+ image,
530
+ mv_image_front,
531
+ mv_image_back,
532
+ mv_image_left,
533
+ mv_image_right,
534
+ num_steps,
535
+ cfg_scale,
536
+ seed,
537
+ octree_resolution,
538
+ check_box_rembg,
539
+ num_chunks,
540
+ randomize_seed,
541
+ ],
542
+ outputs=[file_out, html_gen_mesh, stats, seed]
543
+ ).then(
544
+ lambda: (gr.update(visible=False, value=False), gr.update(interactive=True), gr.update(interactive=True),
545
+ gr.update(interactive=False)),
546
+ outputs=[export_texture, reduce_face, confirm_export, file_export],
547
+ ).then(
548
+ lambda: gr.update(selected='gen_mesh_panel'),
549
+ outputs=[tabs_output],
550
+ )
551
+
552
+ btn_all.click(
553
+ generation_all,
554
+ inputs=[
555
+ caption,
556
+ image,
557
+ mv_image_front,
558
+ mv_image_back,
559
+ mv_image_left,
560
+ mv_image_right,
561
+ num_steps,
562
+ cfg_scale,
563
+ seed,
564
+ octree_resolution,
565
+ check_box_rembg,
566
+ num_chunks,
567
+ randomize_seed,
568
+ ],
569
+ outputs=[file_out, file_out2, html_gen_mesh, stats, seed]
570
+ ).then(
571
+ lambda: (gr.update(visible=True, value=True), gr.update(interactive=False), gr.update(interactive=True),
572
+ gr.update(interactive=False)),
573
+ outputs=[export_texture, reduce_face, confirm_export, file_export],
574
+ ).then(
575
+ lambda: gr.update(selected='gen_mesh_panel'),
576
+ outputs=[tabs_output],
577
+ )
578
+
579
+ def on_gen_mode_change(value):
580
+ if value == 'Turbo':
581
+ return gr.update(value=5)
582
+ elif value == 'Fast':
583
+ return gr.update(value=10)
584
+ else:
585
+ return gr.update(value=30)
586
+
587
+ gen_mode.change(on_gen_mode_change, inputs=[gen_mode], outputs=[num_steps])
588
+
589
+ def on_decode_mode_change(value):
590
+ if value == 'Low':
591
+ return gr.update(value=196)
592
+ elif value == 'Standard':
593
+ return gr.update(value=256)
594
+ else:
595
+ return gr.update(value=384)
596
+
597
+ decode_mode.change(on_decode_mode_change, inputs=[decode_mode], outputs=[octree_resolution])
598
+
599
+ def on_export_click(file_out, file_out2, file_type, reduce_face, export_texture, target_face_num):
600
+ if file_out is None:
601
+ raise gr.Error('Please generate a mesh first.')
602
+
603
+ print(f'exporting {file_out}')
604
+ print(f'reduce face to {target_face_num}')
605
+ if export_texture:
606
+ mesh = trimesh.load(file_out2)
607
+ save_folder = gen_save_folder()
608
+ path = export_mesh(mesh, save_folder, textured=True, type=file_type)
609
+
610
+ # for preview
611
+ save_folder = gen_save_folder()
612
+ _ = export_mesh(mesh, save_folder, textured=True)
613
+ model_viewer_html = build_model_viewer_html(save_folder, height=HTML_HEIGHT, width=HTML_WIDTH,
614
+ textured=True)
615
+ else:
616
+ mesh = trimesh.load(file_out)
617
+ mesh = floater_remove_worker(mesh)
618
+ mesh = degenerate_face_remove_worker(mesh)
619
+ if reduce_face:
620
+ mesh = face_reduce_worker(mesh, target_face_num)
621
+ save_folder = gen_save_folder()
622
+ path = export_mesh(mesh, save_folder, textured=False, type=file_type)
623
+
624
+ # for preview
625
+ save_folder = gen_save_folder()
626
+ _ = export_mesh(mesh, save_folder, textured=False)
627
+ model_viewer_html = build_model_viewer_html(save_folder, height=HTML_HEIGHT, width=HTML_WIDTH,
628
+ textured=False)
629
+ print(f'export to {path}')
630
+ return model_viewer_html, gr.update(value=path, interactive=True)
631
+
632
+ confirm_export.click(
633
+ lambda: gr.update(selected='export_mesh_panel'),
634
+ outputs=[tabs_output],
635
+ ).then(
636
+ on_export_click,
637
+ inputs=[file_out, file_out2, file_type, reduce_face, export_texture, target_face_num],
638
+ outputs=[html_export_mesh, file_export]
639
+ )
640
+
641
+ return demo
642
+
643
+
644
+ if __name__ == '__main__':
645
+ import argparse
646
+
647
+ parser = argparse.ArgumentParser()
648
+ parser.add_argument("--model_path", type=str, default='tencent/Hunyuan3D-2mini')
649
+ parser.add_argument("--subfolder", type=str, default='hunyuan3d-dit-v2-mini-turbo')
650
+ parser.add_argument("--texgen_model_path", type=str, default='tencent/Hunyuan3D-2')
651
+ parser.add_argument('--port', type=int, default=8080)
652
+ parser.add_argument('--host', type=str, default='0.0.0.0')
653
+ parser.add_argument('--device', type=str, default='cuda')
654
+ parser.add_argument('--mc_algo', type=str, default='mc')
655
+ parser.add_argument('--cache-path', type=str, default='gradio_cache')
656
+ parser.add_argument('--enable_t23d', action='store_true')
657
+ parser.add_argument('--disable_tex', action='store_true')
658
+ parser.add_argument('--enable_flashvdm', action='store_true')
659
+ parser.add_argument('--compile', action='store_true')
660
+ parser.add_argument('--low_vram_mode', action='store_true')
661
+ args = parser.parse_args()
662
+
663
+ SAVE_DIR = args.cache_path
664
+ os.makedirs(SAVE_DIR, exist_ok=True)
665
+
666
+ CURRENT_DIR = os.path.dirname(os.path.abspath(__file__))
667
+ MV_MODE = 'mv' in args.model_path
668
+ TURBO_MODE = 'turbo' in args.subfolder
669
+
670
+ HTML_HEIGHT = 690 if MV_MODE else 650
671
+ HTML_WIDTH = 500
672
+ HTML_OUTPUT_PLACEHOLDER = f"""
673
+ <div style='height: {650}px; width: 100%; border-radius: 8px; border-color: #e5e7eb; border-style: solid; border-width: 1px; display: flex; justify-content: center; align-items: center;'>
674
+ <div style='text-align: center; font-size: 16px; color: #6b7280;'>
675
+ <p style="color: #8d8d8d;">Welcome to Hunyuan3D!</p>
676
+ <p style="color: #8d8d8d;">No mesh here.</p>
677
+ </div>
678
+ </div>
679
+ """
680
+
681
+ INPUT_MESH_HTML = """
682
+ <div style='height: 490px; width: 100%; border-radius: 8px;
683
+ border-color: #e5e7eb; order-style: solid; border-width: 1px;'>
684
+ </div>
685
+ """
686
+ example_is = get_example_img_list()
687
+ example_ts = get_example_txt_list()
688
+ example_mvs = get_example_mv_list()
689
+
690
+ SUPPORTED_FORMATS = ['glb', 'obj', 'ply', 'stl']
691
+
692
+ HAS_TEXTUREGEN = False
693
+ if not args.disable_tex:
694
+ try:
695
+ from hy3dgen.texgen import Hunyuan3DPaintPipeline
696
+
697
+ texgen_worker = Hunyuan3DPaintPipeline.from_pretrained(args.texgen_model_path)
698
+ if args.low_vram_mode:
699
+ texgen_worker.enable_model_cpu_offload()
700
+ # Not help much, ignore for now.
701
+ # if args.compile:
702
+ # texgen_worker.models['delight_model'].pipeline.unet.compile()
703
+ # texgen_worker.models['delight_model'].pipeline.vae.compile()
704
+ # texgen_worker.models['multiview_model'].pipeline.unet.compile()
705
+ # texgen_worker.models['multiview_model'].pipeline.vae.compile()
706
+ HAS_TEXTUREGEN = True
707
+ except Exception as e:
708
+ print(e)
709
+ print("Failed to load texture generator.")
710
+ print('Please try to install requirements by following README.md')
711
+ HAS_TEXTUREGEN = False
712
+
713
+ HAS_T2I = True
714
+ if args.enable_t23d:
715
+ from hy3dgen.text2image import HunyuanDiTPipeline
716
+
717
+ t2i_worker = HunyuanDiTPipeline('Tencent-Hunyuan/HunyuanDiT-v1.1-Diffusers-Distilled', device=args.device)
718
+ HAS_T2I = True
719
+
720
+ from hy3dgen.shapegen import FaceReducer, FloaterRemover, DegenerateFaceRemover, MeshSimplifier, \
721
+ Hunyuan3DDiTFlowMatchingPipeline
722
+ from hy3dgen.shapegen.pipelines import export_to_trimesh
723
+ from hy3dgen.rembg import BackgroundRemover
724
+
725
+ rmbg_worker = BackgroundRemover()
726
+ i23d_worker = Hunyuan3DDiTFlowMatchingPipeline.from_pretrained(
727
+ args.model_path,
728
+ subfolder=args.subfolder,
729
+ use_safetensors=True,
730
+ device=args.device,
731
+ )
732
+ if args.enable_flashvdm:
733
+ mc_algo = 'mc' if args.device in ['cpu', 'mps'] else args.mc_algo
734
+ i23d_worker.enable_flashvdm(mc_algo=mc_algo)
735
+ if args.compile:
736
+ i23d_worker.compile()
737
+
738
+ floater_remove_worker = FloaterRemover()
739
+ degenerate_face_remove_worker = DegenerateFaceRemover()
740
+ face_reduce_worker = FaceReducer()
741
 
742
+ # https://discuss.huggingface.co/t/how-to-serve-an-html-file/33921/2
743
+ # create a FastAPI app
744
+ app = FastAPI()
745
+ # create a static directory to store the static files
746
+ static_dir = Path(SAVE_DIR).absolute()
747
+ static_dir.mkdir(parents=True, exist_ok=True)
748
+ app.mount("/static", StaticFiles(directory=static_dir, html=True), name="static")
749
+ shutil.copytree('./assets/env_maps', os.path.join(static_dir, 'env_maps'), dirs_exist_ok=True)
750
 
751
+ if args.low_vram_mode:
752
+ torch.cuda.empty_cache()
753
+ demo = build_app()
754
+ app = gr.mount_gradio_app(app, demo, path="/")
755
+ uvicorn.run(app, host=args.host, port=args.port, workers=1)
 
assets/.DS_Store ADDED
Binary file (10.2 kB). View file
 
assets/config.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "Name": [
3
+ "Hunyuan3D-2"
4
+ ],
5
+ }
assets/hunyuan3d-delight-v2-0/.DS_Store ADDED
Binary file (6.15 kB). View file
 
assets/hunyuan3d-delight-v2-0/feature_extractor/preprocessor_config.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "crop_size": {
3
+ "height": 224,
4
+ "width": 224
5
+ },
6
+ "do_center_crop": true,
7
+ "do_convert_rgb": true,
8
+ "do_normalize": true,
9
+ "do_rescale": true,
10
+ "do_resize": true,
11
+ "image_mean": [
12
+ 0.48145466,
13
+ 0.4578275,
14
+ 0.40821073
15
+ ],
16
+ "image_processor_type": "CLIPImageProcessor",
17
+ "image_std": [
18
+ 0.26862954,
19
+ 0.26130258,
20
+ 0.27577711
21
+ ],
22
+ "resample": 3,
23
+ "rescale_factor": 0.00392156862745098,
24
+ "size": {
25
+ "shortest_edge": 224
26
+ }
27
+ }
assets/hunyuan3d-delight-v2-0/model_index.json ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "StableDiffusionInstructPix2PixPipeline",
3
+ "_diffusers_version": "0.30.1",
4
+ "_name_or_path": "",
5
+ "feature_extractor": [
6
+ "transformers",
7
+ "CLIPImageProcessor"
8
+ ],
9
+ "image_encoder": [
10
+ null,
11
+ null
12
+ ],
13
+ "requires_safety_checker": false,
14
+ "safety_checker": [
15
+ null,
16
+ null
17
+ ],
18
+ "scheduler": [
19
+ "diffusers",
20
+ "DDIMScheduler"
21
+ ],
22
+ "text_encoder": [
23
+ "transformers",
24
+ "CLIPTextModel"
25
+ ],
26
+ "tokenizer": [
27
+ "transformers",
28
+ "CLIPTokenizer"
29
+ ],
30
+ "unet": [
31
+ "diffusers",
32
+ "UNet2DConditionModel"
33
+ ],
34
+ "vae": [
35
+ "diffusers",
36
+ "AutoencoderKL"
37
+ ]
38
+ }
assets/hunyuan3d-delight-v2-0/scheduler/scheduler_config.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "DDIMScheduler",
3
+ "_diffusers_version": "0.30.1",
4
+ "beta_end": 0.012,
5
+ "beta_schedule": "scaled_linear",
6
+ "beta_start": 0.00085,
7
+ "clip_sample": false,
8
+ "clip_sample_range": 1.0,
9
+ "dynamic_thresholding_ratio": 0.995,
10
+ "num_train_timesteps": 1000,
11
+ "prediction_type": "v_prediction",
12
+ "rescale_betas_zero_snr": false,
13
+ "sample_max_value": 1.0,
14
+ "set_alpha_to_one": false,
15
+ "skip_prk_steps": true,
16
+ "steps_offset": 1,
17
+ "thresholding": false,
18
+ "timestep_spacing": "leading",
19
+ "trained_betas": null
20
+ }
assets/hunyuan3d-delight-v2-0/text_encoder/config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "",
3
+ "architectures": [
4
+ "CLIPTextModel"
5
+ ],
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 0,
8
+ "dropout": 0.0,
9
+ "eos_token_id": 2,
10
+ "hidden_act": "gelu",
11
+ "hidden_size": 1024,
12
+ "initializer_factor": 1.0,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 4096,
15
+ "layer_norm_eps": 1e-05,
16
+ "max_position_embeddings": 77,
17
+ "model_type": "clip_text_model",
18
+ "num_attention_heads": 16,
19
+ "num_hidden_layers": 23,
20
+ "pad_token_id": 1,
21
+ "projection_dim": 512,
22
+ "torch_dtype": "float16",
23
+ "transformers_version": "4.45.0.dev0",
24
+ "vocab_size": 49408
25
+ }
assets/hunyuan3d-delight-v2-0/tokenizer/merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
assets/hunyuan3d-delight-v2-0/tokenizer/special_tokens_map.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<|startoftext|>",
4
+ "lstrip": false,
5
+ "normalized": true,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|endoftext|>",
11
+ "lstrip": false,
12
+ "normalized": true,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "!",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "unk_token": {
24
+ "content": "<|endoftext|>",
25
+ "lstrip": false,
26
+ "normalized": true,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ }
30
+ }
assets/hunyuan3d-delight-v2-0/tokenizer/tokenizer_config.json ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "0": {
5
+ "content": "!",
6
+ "lstrip": false,
7
+ "normalized": false,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "49406": {
13
+ "content": "<|startoftext|>",
14
+ "lstrip": false,
15
+ "normalized": true,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "49407": {
21
+ "content": "<|endoftext|>",
22
+ "lstrip": false,
23
+ "normalized": true,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ }
28
+ },
29
+ "bos_token": "<|startoftext|>",
30
+ "clean_up_tokenization_spaces": true,
31
+ "do_lower_case": true,
32
+ "eos_token": "<|endoftext|>",
33
+ "errors": "replace",
34
+ "model_max_length": 77,
35
+ "pad_token": "!",
36
+ "tokenizer_class": "CLIPTokenizer",
37
+ "unk_token": "<|endoftext|>"
38
+ }
assets/hunyuan3d-delight-v2-0/tokenizer/vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
assets/hunyuan3d-delight-v2-0/unet/config.json ADDED
@@ -0,0 +1,73 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "UNet2DConditionModel",
3
+ "_diffusers_version": "0.30.1",
4
+ "_name_or_path": "",
5
+ "act_fn": "silu",
6
+ "addition_embed_type": null,
7
+ "addition_embed_type_num_heads": 64,
8
+ "addition_time_embed_dim": null,
9
+ "attention_head_dim": [
10
+ 5,
11
+ 10,
12
+ 20,
13
+ 20
14
+ ],
15
+ "attention_type": "default",
16
+ "block_out_channels": [
17
+ 320,
18
+ 640,
19
+ 1280,
20
+ 1280
21
+ ],
22
+ "center_input_sample": false,
23
+ "class_embed_type": null,
24
+ "class_embeddings_concat": false,
25
+ "conv_in_kernel": 3,
26
+ "conv_out_kernel": 3,
27
+ "cross_attention_dim": 1024,
28
+ "cross_attention_norm": null,
29
+ "down_block_types": [
30
+ "CrossAttnDownBlock2D",
31
+ "CrossAttnDownBlock2D",
32
+ "CrossAttnDownBlock2D",
33
+ "DownBlock2D"
34
+ ],
35
+ "downsample_padding": 1,
36
+ "dropout": 0.0,
37
+ "dual_cross_attention": false,
38
+ "encoder_hid_dim": null,
39
+ "encoder_hid_dim_type": null,
40
+ "flip_sin_to_cos": true,
41
+ "freq_shift": 0,
42
+ "in_channels": 8,
43
+ "layers_per_block": 2,
44
+ "mid_block_only_cross_attention": null,
45
+ "mid_block_scale_factor": 1,
46
+ "mid_block_type": "UNetMidBlock2DCrossAttn",
47
+ "norm_eps": 1e-05,
48
+ "norm_num_groups": 32,
49
+ "num_attention_heads": null,
50
+ "num_class_embeds": null,
51
+ "only_cross_attention": false,
52
+ "out_channels": 4,
53
+ "projection_class_embeddings_input_dim": null,
54
+ "resnet_out_scale_factor": 1.0,
55
+ "resnet_skip_time_act": false,
56
+ "resnet_time_scale_shift": "default",
57
+ "reverse_transformer_layers_per_block": null,
58
+ "sample_size": 96,
59
+ "time_cond_proj_dim": null,
60
+ "time_embedding_act_fn": null,
61
+ "time_embedding_dim": null,
62
+ "time_embedding_type": "positional",
63
+ "timestep_post_act": null,
64
+ "transformer_layers_per_block": 1,
65
+ "up_block_types": [
66
+ "UpBlock2D",
67
+ "CrossAttnUpBlock2D",
68
+ "CrossAttnUpBlock2D",
69
+ "CrossAttnUpBlock2D"
70
+ ],
71
+ "upcast_attention": true,
72
+ "use_linear_projection": true
73
+ }
assets/hunyuan3d-delight-v2-0/vae/config.json ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "AutoencoderKL",
3
+ "_diffusers_version": "0.30.1",
4
+ "_name_or_path": "",
5
+ "act_fn": "silu",
6
+ "block_out_channels": [
7
+ 128,
8
+ 256,
9
+ 512,
10
+ 512
11
+ ],
12
+ "down_block_types": [
13
+ "DownEncoderBlock2D",
14
+ "DownEncoderBlock2D",
15
+ "DownEncoderBlock2D",
16
+ "DownEncoderBlock2D"
17
+ ],
18
+ "force_upcast": true,
19
+ "in_channels": 3,
20
+ "latent_channels": 4,
21
+ "latents_mean": null,
22
+ "latents_std": null,
23
+ "layers_per_block": 2,
24
+ "mid_block_add_attention": true,
25
+ "norm_num_groups": 32,
26
+ "out_channels": 3,
27
+ "sample_size": 768,
28
+ "scaling_factor": 0.18215,
29
+ "shift_factor": null,
30
+ "up_block_types": [
31
+ "UpDecoderBlock2D",
32
+ "UpDecoderBlock2D",
33
+ "UpDecoderBlock2D",
34
+ "UpDecoderBlock2D"
35
+ ],
36
+ "use_post_quant_conv": true,
37
+ "use_quant_conv": true
38
+ }
assets/hunyuan3d-dit-v2-0-fast/config.yaml ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ model:
2
+ target: hy3dgen.shapegen.models.Hunyuan3DDiT
3
+ params:
4
+ in_channels: 64
5
+ context_in_dim: 1536
6
+ hidden_size: 1024
7
+ mlp_ratio: 4.0
8
+ num_heads: 16
9
+ depth: 16
10
+ depth_single_blocks: 32
11
+ axes_dim: [ 64 ]
12
+ theta: 10000
13
+ qkv_bias: true
14
+ guidance_embed: true
15
+
16
+ vae:
17
+ target: hy3dgen.shapegen.models.ShapeVAE
18
+ params:
19
+ num_latents: 3072
20
+ embed_dim: 64
21
+ num_freqs: 8
22
+ include_pi: false
23
+ heads: 16
24
+ width: 1024
25
+ num_decoder_layers: 16
26
+ qkv_bias: false
27
+ qk_norm: true
28
+ scale_factor: 0.9990943042622529
29
+
30
+ conditioner:
31
+ target: hy3dgen.shapegen.models.SingleImageEncoder
32
+ params:
33
+ main_image_encoder:
34
+ type: DinoImageEncoder # dino giant
35
+ kwargs:
36
+ config:
37
+ attention_probs_dropout_prob: 0.0
38
+ drop_path_rate: 0.0
39
+ hidden_act: gelu
40
+ hidden_dropout_prob: 0.0
41
+ hidden_size: 1536
42
+ image_size: 518
43
+ initializer_range: 0.02
44
+ layer_norm_eps: 1.e-6
45
+ layerscale_value: 1.0
46
+ mlp_ratio: 4
47
+ model_type: dinov2
48
+ num_attention_heads: 24
49
+ num_channels: 3
50
+ num_hidden_layers: 40
51
+ patch_size: 14
52
+ qkv_bias: true
53
+ torch_dtype: float32
54
+ use_swiglu_ffn: true
55
+ image_size: 518
56
+
57
+ scheduler:
58
+ target: hy3dgen.shapegen.schedulers.FlowMatchEulerDiscreteScheduler
59
+ params:
60
+ num_train_timesteps: 1000
61
+
62
+ image_processor:
63
+ target: hy3dgen.shapegen.preprocessors.ImageProcessorV2
64
+ params:
65
+ size: 512
66
+ border_ratio: 0.15
67
+
68
+ pipeline:
69
+ target: hy3dgen.shapegen.pipelines.Hunyuan3DDiTFlowMatchingPipeline
assets/hunyuan3d-dit-v2-0-turbo/config.yaml ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ model:
2
+ target: hy3dgen.shapegen.models.Hunyuan3DDiT
3
+ params:
4
+ in_channels: 64
5
+ context_in_dim: 1536
6
+ hidden_size: 1024
7
+ mlp_ratio: 4.0
8
+ num_heads: 16
9
+ depth: 16
10
+ depth_single_blocks: 32
11
+ axes_dim: [ 64 ]
12
+ theta: 10000
13
+ qkv_bias: true
14
+ guidance_embed: true
15
+
16
+ vae:
17
+ target: hy3dgen.shapegen.models.ShapeVAE
18
+ params:
19
+ num_latents: 3072
20
+ embed_dim: 64
21
+ num_freqs: 8
22
+ include_pi: false
23
+ heads: 16
24
+ width: 1024
25
+ num_decoder_layers: 16
26
+ qkv_bias: false
27
+ qk_norm: true
28
+ scale_factor: 0.9990943042622529
29
+
30
+ conditioner:
31
+ target: hy3dgen.shapegen.models.SingleImageEncoder
32
+ params:
33
+ main_image_encoder:
34
+ type: DinoImageEncoder # dino giant
35
+ kwargs:
36
+ config:
37
+ attention_probs_dropout_prob: 0.0
38
+ drop_path_rate: 0.0
39
+ hidden_act: gelu
40
+ hidden_dropout_prob: 0.0
41
+ hidden_size: 1536
42
+ image_size: 518
43
+ initializer_range: 0.02
44
+ layer_norm_eps: 1.e-6
45
+ layerscale_value: 1.0
46
+ mlp_ratio: 4
47
+ model_type: dinov2
48
+ num_attention_heads: 24
49
+ num_channels: 3
50
+ num_hidden_layers: 40
51
+ patch_size: 14
52
+ qkv_bias: true
53
+ torch_dtype: float32
54
+ use_swiglu_ffn: true
55
+ image_size: 518
56
+
57
+ scheduler:
58
+ target: hy3dgen.shapegen.schedulers.ConsistencyFlowMatchEulerDiscreteScheduler
59
+ params:
60
+ num_train_timesteps: 1000
61
+ pcm_timesteps: 100
62
+
63
+ image_processor:
64
+ target: hy3dgen.shapegen.preprocessors.ImageProcessorV2
65
+ params:
66
+ size: 512
67
+ border_ratio: 0.15
68
+
69
+ pipeline:
70
+ target: hy3dgen.shapegen.pipelines.Hunyuan3DDiTFlowMatchingPipeline
assets/hunyuan3d-dit-v2-0/config.yaml ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ model:
2
+ target: hy3dgen.shapegen.models.Hunyuan3DDiT
3
+ params:
4
+ in_channels: 64
5
+ context_in_dim: 1536
6
+ hidden_size: 1024
7
+ mlp_ratio: 4.0
8
+ num_heads: 16
9
+ depth: 16
10
+ depth_single_blocks: 32
11
+ axes_dim: [ 64 ]
12
+ theta: 10000
13
+ qkv_bias: True
14
+
15
+ vae:
16
+ target: hy3dgen.shapegen.models.ShapeVAE
17
+ params:
18
+ num_latents: 3072
19
+ embed_dim: 64
20
+ num_freqs: 8
21
+ include_pi: false
22
+ heads: 16
23
+ width: 1024
24
+ num_decoder_layers: 16
25
+ qkv_bias: false
26
+ qk_norm: true
27
+ scale_factor: 0.9990943042622529
28
+
29
+ conditioner:
30
+ target: hy3dgen.shapegen.models.SingleImageEncoder
31
+ params:
32
+ main_image_encoder:
33
+ type: DinoImageEncoder # dino giant
34
+ kwargs:
35
+ config:
36
+ attention_probs_dropout_prob: 0.0
37
+ drop_path_rate: 0.0
38
+ hidden_act: gelu
39
+ hidden_dropout_prob: 0.0
40
+ hidden_size: 1536
41
+ image_size: 518
42
+ initializer_range: 0.02
43
+ layer_norm_eps: 1.e-6
44
+ layerscale_value: 1.0
45
+ mlp_ratio: 4
46
+ model_type: dinov2
47
+ num_attention_heads: 24
48
+ num_channels: 3
49
+ num_hidden_layers: 40
50
+ patch_size: 14
51
+ qkv_bias: true
52
+ torch_dtype: float32
53
+ use_swiglu_ffn: true
54
+ image_size: 518
55
+
56
+ scheduler:
57
+ target: hy3dgen.shapegen.schedulers.FlowMatchEulerDiscreteScheduler
58
+ params:
59
+ num_train_timesteps: 1000
60
+
61
+ image_processor:
62
+ target: hy3dgen.shapegen.preprocessors.ImageProcessorV2
63
+ params:
64
+ size: 512
65
+ border_ratio: 0.15
66
+
67
+ pipeline:
68
+ target: hy3dgen.shapegen.pipelines.Hunyuan3DDiTFlowMatchingPipeline
assets/hunyuan3d-paint-v2-0-turbo/.DS_Store ADDED
Binary file (8.2 kB). View file
 
assets/hunyuan3d-paint-v2-0-turbo/.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
assets/hunyuan3d-paint-v2-0-turbo/README.md ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: openrail++
3
+ tags:
4
+ - stable-diffusion
5
+ - text-to-image
6
+ ---
7
+
8
+ # SD v2.1-base with Zero Terminal SNR (LAION Aesthetic 6+)
9
+
10
+ This model is used in [Diffusion Model with Perceptual Loss](https://arxiv.org/abs/2401.00110) paper as the MSE baseline.
11
+
12
+ This model is trained using zero terminal SNR schedule following [Common Diffusion Noise Schedules and Sample Steps are Flawed](https://arxiv.org/abs/2305.08891) paper on LAION aesthetic 6+ data.
13
+
14
+ This model is finetuned from [stabilityai/stable-diffusion-2-1-base](https://huggingface.co/stabilityai/stable-diffusion-2-1-base).
15
+
16
+ This model is meant for research demonstration, not for production use.
17
+
18
+ ## Usage
19
+
20
+ ```python
21
+ from diffusers import StableDiffusionPipeline
22
+ prompt = "A young girl smiling"
23
+ pipe = StableDiffusionPipeline.from_pretrained("ByteDance/sd2.1-base-zsnr-laionaes6").to("cuda")
24
+ pipe(prompt, guidance_scale=7.5, guidance_rescale=0.7).images[0].save("out.jpg")
25
+ ```
26
+
27
+ ## Related Models
28
+
29
+ * [bytedance/sd2.1-base-zsnr-laionaes5](https://huggingface.co/ByteDance/sd2.1-base-zsnr-laionaes5)
30
+ * [bytedance/sd2.1-base-zsnr-laionaes6](https://huggingface.co/ByteDance/sd2.1-base-zsnr-laionaes6)
31
+ * [bytedance/sd2.1-base-zsnr-laionaes6-perceptual](https://huggingface.co/ByteDance/sd2.1-base-zsnr-laionaes6-perceptual)
32
+
33
+
34
+ ## Cite as
35
+ ```
36
+ @misc{lin2024diffusion,
37
+ title={Diffusion Model with Perceptual Loss},
38
+ author={Shanchuan Lin and Xiao Yang},
39
+ year={2024},
40
+ eprint={2401.00110},
41
+ archivePrefix={arXiv},
42
+ primaryClass={cs.CV}
43
+ }
44
+
45
+ @misc{lin2023common,
46
+ title={Common Diffusion Noise Schedules and Sample Steps are Flawed},
47
+ author={Shanchuan Lin and Bingchen Liu and Jiashi Li and Xiao Yang},
48
+ year={2023},
49
+ eprint={2305.08891},
50
+ archivePrefix={arXiv},
51
+ primaryClass={cs.CV}
52
+ }
53
+ ```
assets/hunyuan3d-paint-v2-0-turbo/feature_extractor/preprocessor_config.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "crop_size": 224,
3
+ "do_center_crop": true,
4
+ "do_convert_rgb": true,
5
+ "do_normalize": true,
6
+ "do_resize": true,
7
+ "feature_extractor_type": "CLIPFeatureExtractor",
8
+ "image_mean": [
9
+ 0.48145466,
10
+ 0.4578275,
11
+ 0.40821073
12
+ ],
13
+ "image_std": [
14
+ 0.26862954,
15
+ 0.26130258,
16
+ 0.27577711
17
+ ],
18
+ "resample": 3,
19
+ "size": 224
20
+ }
assets/hunyuan3d-paint-v2-0-turbo/image_encoder/config.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "D:\\.cache\\huggingface\\hub\\models--sudo-ai--zero123plus-v1.1\\snapshots\\36df7de980afd15f80b2e1a4e9a920d7020e2654\\vision_encoder",
3
+ "architectures": [
4
+ "CLIPVisionModelWithProjection"
5
+ ],
6
+ "attention_dropout": 0.0,
7
+ "dropout": 0.0,
8
+ "hidden_act": "gelu",
9
+ "hidden_size": 1280,
10
+ "image_size": 224,
11
+ "initializer_factor": 1.0,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 5120,
14
+ "layer_norm_eps": 1e-05,
15
+ "model_type": "clip_vision_model",
16
+ "num_attention_heads": 16,
17
+ "num_channels": 3,
18
+ "num_hidden_layers": 32,
19
+ "patch_size": 14,
20
+ "projection_dim": 1024,
21
+ "torch_dtype": "float16",
22
+ "transformers_version": "4.36.0"
23
+ }
assets/hunyuan3d-paint-v2-0-turbo/image_encoder/preprocessor_config.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "crop_size": {
3
+ "height": 224,
4
+ "width": 224
5
+ },
6
+ "do_center_crop": true,
7
+ "do_convert_rgb": true,
8
+ "do_normalize": true,
9
+ "do_rescale": true,
10
+ "do_resize": true,
11
+ "image_mean": [
12
+ 0.48145466,
13
+ 0.4578275,
14
+ 0.40821073
15
+ ],
16
+ "image_processor_type": "CLIPImageProcessor",
17
+ "image_std": [
18
+ 0.26862954,
19
+ 0.26130258,
20
+ 0.27577711
21
+ ],
22
+ "resample": 3,
23
+ "rescale_factor": 0.00392156862745098,
24
+ "size": {
25
+ "shortest_edge": 224
26
+ }
27
+ }
assets/hunyuan3d-paint-v2-0-turbo/model_index.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "StableDiffusionPipeline",
3
+ "_diffusers_version": "0.23.1",
4
+ "feature_extractor": [
5
+ "transformers",
6
+ "CLIPImageProcessor"
7
+ ],
8
+ "requires_safety_checker": false,
9
+ "safety_checker": [
10
+ null,
11
+ null
12
+ ],
13
+ "scheduler": [
14
+ "diffusers",
15
+ "DDIMScheduler"
16
+ ],
17
+ "text_encoder": [
18
+ "transformers",
19
+ "CLIPTextModel"
20
+ ],
21
+ "tokenizer": [
22
+ "transformers",
23
+ "CLIPTokenizer"
24
+ ],
25
+ "image_encoder": [
26
+ "transformers",
27
+ "CLIPVisionModelWithProjection"
28
+ ],
29
+ "unet": [
30
+ "modules",
31
+ "UNet2p5DConditionModel"
32
+ ],
33
+ "vae": [
34
+ "diffusers",
35
+ "AutoencoderKL"
36
+ ]
37
+ }
assets/hunyuan3d-paint-v2-0-turbo/scheduler/scheduler_config.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "DDIMScheduler",
3
+ "_diffusers_version": "0.23.1",
4
+ "beta_end": 0.012,
5
+ "beta_schedule": "scaled_linear",
6
+ "beta_start": 0.00085,
7
+ "clip_sample": false,
8
+ "num_train_timesteps": 1000,
9
+ "prediction_type": "v_prediction",
10
+ "set_alpha_to_one": true,
11
+ "steps_offset": 1,
12
+ "trained_betas": null,
13
+ "timestep_spacing": "trailing",
14
+ "rescale_betas_zero_snr": true
15
+ }
assets/hunyuan3d-paint-v2-0-turbo/text_encoder/config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "stabilityai/stable-diffusion-2",
3
+ "architectures": [
4
+ "CLIPTextModel"
5
+ ],
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 0,
8
+ "dropout": 0.0,
9
+ "eos_token_id": 2,
10
+ "hidden_act": "gelu",
11
+ "hidden_size": 1024,
12
+ "initializer_factor": 1.0,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 4096,
15
+ "layer_norm_eps": 1e-05,
16
+ "max_position_embeddings": 77,
17
+ "model_type": "clip_text_model",
18
+ "num_attention_heads": 16,
19
+ "num_hidden_layers": 23,
20
+ "pad_token_id": 1,
21
+ "projection_dim": 512,
22
+ "torch_dtype": "float32",
23
+ "transformers_version": "4.25.0.dev0",
24
+ "vocab_size": 49408
25
+ }
assets/hunyuan3d-paint-v2-0-turbo/tokenizer/merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
assets/hunyuan3d-paint-v2-0-turbo/tokenizer/special_tokens_map.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<|startoftext|>",
4
+ "lstrip": false,
5
+ "normalized": true,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|endoftext|>",
11
+ "lstrip": false,
12
+ "normalized": true,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": "!",
17
+ "unk_token": {
18
+ "content": "<|endoftext|>",
19
+ "lstrip": false,
20
+ "normalized": true,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ }
24
+ }
assets/hunyuan3d-paint-v2-0-turbo/tokenizer/tokenizer_config.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "bos_token": {
4
+ "__type": "AddedToken",
5
+ "content": "<|startoftext|>",
6
+ "lstrip": false,
7
+ "normalized": true,
8
+ "rstrip": false,
9
+ "single_word": false
10
+ },
11
+ "do_lower_case": true,
12
+ "eos_token": {
13
+ "__type": "AddedToken",
14
+ "content": "<|endoftext|>",
15
+ "lstrip": false,
16
+ "normalized": true,
17
+ "rstrip": false,
18
+ "single_word": false
19
+ },
20
+ "errors": "replace",
21
+ "model_max_length": 77,
22
+ "name_or_path": "stabilityai/stable-diffusion-2",
23
+ "pad_token": "<|endoftext|>",
24
+ "special_tokens_map_file": "./special_tokens_map.json",
25
+ "tokenizer_class": "CLIPTokenizer",
26
+ "unk_token": {
27
+ "__type": "AddedToken",
28
+ "content": "<|endoftext|>",
29
+ "lstrip": false,
30
+ "normalized": true,
31
+ "rstrip": false,
32
+ "single_word": false
33
+ }
34
+ }
assets/hunyuan3d-paint-v2-0-turbo/tokenizer/vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
assets/hunyuan3d-paint-v2-0-turbo/unet/config.json ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "UNet2DConditionModel",
3
+ "_diffusers_version": "0.10.0.dev0",
4
+ "act_fn": "silu",
5
+ "attention_head_dim": [
6
+ 5,
7
+ 10,
8
+ 20,
9
+ 20
10
+ ],
11
+ "block_out_channels": [
12
+ 320,
13
+ 640,
14
+ 1280,
15
+ 1280
16
+ ],
17
+ "center_input_sample": false,
18
+ "cross_attention_dim": 1024,
19
+ "down_block_types": [
20
+ "CrossAttnDownBlock2D",
21
+ "CrossAttnDownBlock2D",
22
+ "CrossAttnDownBlock2D",
23
+ "DownBlock2D"
24
+ ],
25
+ "downsample_padding": 1,
26
+ "dual_cross_attention": false,
27
+ "flip_sin_to_cos": true,
28
+ "freq_shift": 0,
29
+ "in_channels": 4,
30
+ "layers_per_block": 2,
31
+ "mid_block_scale_factor": 1,
32
+ "norm_eps": 1e-05,
33
+ "norm_num_groups": 32,
34
+ "num_class_embeds": null,
35
+ "only_cross_attention": false,
36
+ "out_channels": 4,
37
+ "sample_size": 64,
38
+ "up_block_types": [
39
+ "UpBlock2D",
40
+ "CrossAttnUpBlock2D",
41
+ "CrossAttnUpBlock2D",
42
+ "CrossAttnUpBlock2D"
43
+ ],
44
+ "use_linear_projection": true
45
+ }
assets/hunyuan3d-paint-v2-0-turbo/unet/modules.py ADDED
@@ -0,0 +1,610 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Open Source Model Licensed under the Apache License Version 2.0
2
+ # and Other Licenses of the Third-Party Components therein:
3
+ # The below Model in this distribution may have been modified by THL A29 Limited
4
+ # ("Tencent Modifications"). All Tencent Modifications are Copyright (C) 2024 THL A29 Limited.
5
+
6
+ # Copyright (C) 2024 THL A29 Limited, a Tencent company. All rights reserved.
7
+ # The below software and/or models in this distribution may have been
8
+ # modified by THL A29 Limited ("Tencent Modifications").
9
+ # All Tencent Modifications are Copyright (C) THL A29 Limited.
10
+
11
+ # Hunyuan 3D is licensed under the TENCENT HUNYUAN NON-COMMERCIAL LICENSE AGREEMENT
12
+ # except for the third-party components listed below.
13
+ # Hunyuan 3D does not impose any additional limitations beyond what is outlined
14
+ # in the repsective licenses of these third-party components.
15
+ # Users must comply with all terms and conditions of original licenses of these third-party
16
+ # components and must ensure that the usage of the third party components adheres to
17
+ # all relevant laws and regulations.
18
+
19
+ # For avoidance of doubts, Hunyuan 3D means the large language models and
20
+ # their software and algorithms, including trained model weights, parameters (including
21
+ # optimizer states), machine-learning model code, inference-enabling code, training-enabling code,
22
+ # fine-tuning enabling code and other elements of the foregoing made publicly available
23
+ # by Tencent in accordance with TENCENT HUNYUAN COMMUNITY LICENSE AGREEMENT.
24
+
25
+ import copy
26
+ import json
27
+ import os
28
+ from typing import Any, Dict, List, Optional, Tuple, Union
29
+
30
+ import torch
31
+ import torch.nn as nn
32
+ import torch.nn.functional as F
33
+ from diffusers.models import UNet2DConditionModel
34
+ from diffusers.models.attention_processor import Attention
35
+ from diffusers.models.transformers.transformer_2d import BasicTransformerBlock
36
+ from einops import rearrange
37
+
38
+
39
+ def _chunked_feed_forward(ff: nn.Module, hidden_states: torch.Tensor, chunk_dim: int, chunk_size: int):
40
+ # "feed_forward_chunk_size" can be used to save memory
41
+ if hidden_states.shape[chunk_dim] % chunk_size != 0:
42
+ raise ValueError(
43
+ f"`hidden_states` dimension to be chunked: {hidden_states.shape[chunk_dim]}"
44
+ f"has to be divisible by chunk size: {chunk_size}."
45
+ f" Make sure to set an appropriate `chunk_size` when calling `unet.enable_forward_chunking`."
46
+ )
47
+
48
+ num_chunks = hidden_states.shape[chunk_dim] // chunk_size
49
+ ff_output = torch.cat(
50
+ [ff(hid_slice) for hid_slice in hidden_states.chunk(num_chunks, dim=chunk_dim)],
51
+ dim=chunk_dim,
52
+ )
53
+ return ff_output
54
+
55
+
56
+ class Basic2p5DTransformerBlock(torch.nn.Module):
57
+ def __init__(self, transformer: BasicTransformerBlock, layer_name, use_ma=True, use_ra=True, is_turbo=False) -> None:
58
+ super().__init__()
59
+ self.transformer = transformer
60
+ self.layer_name = layer_name
61
+ self.use_ma = use_ma
62
+ self.use_ra = use_ra
63
+ self.is_turbo = is_turbo
64
+
65
+ # multiview attn
66
+ if self.use_ma:
67
+ self.attn_multiview = Attention(
68
+ query_dim=self.dim,
69
+ heads=self.num_attention_heads,
70
+ dim_head=self.attention_head_dim,
71
+ dropout=self.dropout,
72
+ bias=self.attention_bias,
73
+ cross_attention_dim=None,
74
+ upcast_attention=self.attn1.upcast_attention,
75
+ out_bias=True,
76
+ )
77
+
78
+ # ref attn
79
+ if self.use_ra:
80
+ self.attn_refview = Attention(
81
+ query_dim=self.dim,
82
+ heads=self.num_attention_heads,
83
+ dim_head=self.attention_head_dim,
84
+ dropout=self.dropout,
85
+ bias=self.attention_bias,
86
+ cross_attention_dim=None,
87
+ upcast_attention=self.attn1.upcast_attention,
88
+ out_bias=True,
89
+ )
90
+ if self.is_turbo:
91
+ self._initialize_attn_weights()
92
+
93
+ def _initialize_attn_weights(self):
94
+
95
+ if self.use_ma:
96
+ self.attn_multiview.load_state_dict(self.attn1.state_dict())
97
+ with torch.no_grad():
98
+ for layer in self.attn_multiview.to_out:
99
+ for param in layer.parameters():
100
+ param.zero_()
101
+ if self.use_ra:
102
+ self.attn_refview.load_state_dict(self.attn1.state_dict())
103
+ with torch.no_grad():
104
+ for layer in self.attn_refview.to_out:
105
+ for param in layer.parameters():
106
+ param.zero_()
107
+
108
+ def __getattr__(self, name: str):
109
+ try:
110
+ return super().__getattr__(name)
111
+ except AttributeError:
112
+ return getattr(self.transformer, name)
113
+
114
+ def forward(
115
+ self,
116
+ hidden_states: torch.Tensor,
117
+ attention_mask: Optional[torch.Tensor] = None,
118
+ encoder_hidden_states: Optional[torch.Tensor] = None,
119
+ encoder_attention_mask: Optional[torch.Tensor] = None,
120
+ timestep: Optional[torch.LongTensor] = None,
121
+ cross_attention_kwargs: Dict[str, Any] = None,
122
+ class_labels: Optional[torch.LongTensor] = None,
123
+ added_cond_kwargs: Optional[Dict[str, torch.Tensor]] = None,
124
+ ) -> torch.Tensor:
125
+
126
+ # Notice that normalization is always applied before the real computation in the following blocks.
127
+ # 0. Self-Attention
128
+ batch_size = hidden_states.shape[0]
129
+
130
+ cross_attention_kwargs = cross_attention_kwargs.copy() if cross_attention_kwargs is not None else {}
131
+ num_in_batch = cross_attention_kwargs.pop('num_in_batch', 1)
132
+ mode = cross_attention_kwargs.pop('mode', None)
133
+ if not self.is_turbo:
134
+ mva_scale = cross_attention_kwargs.pop('mva_scale', 1.0)
135
+ ref_scale = cross_attention_kwargs.pop('ref_scale', 1.0)
136
+ else:
137
+ position_attn_mask = cross_attention_kwargs.pop("position_attn_mask", None)
138
+ position_voxel_indices = cross_attention_kwargs.pop("position_voxel_indices", None)
139
+ mva_scale = 1.0
140
+ ref_scale = 1.0
141
+
142
+ condition_embed_dict = cross_attention_kwargs.pop("condition_embed_dict", None)
143
+
144
+ if self.norm_type == "ada_norm":
145
+ norm_hidden_states = self.norm1(hidden_states, timestep)
146
+ elif self.norm_type == "ada_norm_zero":
147
+ norm_hidden_states, gate_msa, shift_mlp, scale_mlp, gate_mlp = self.norm1(
148
+ hidden_states, timestep, class_labels, hidden_dtype=hidden_states.dtype
149
+ )
150
+ elif self.norm_type in ["layer_norm", "layer_norm_i2vgen"]:
151
+ norm_hidden_states = self.norm1(hidden_states)
152
+ elif self.norm_type == "ada_norm_continuous":
153
+ norm_hidden_states = self.norm1(hidden_states, added_cond_kwargs["pooled_text_emb"])
154
+ elif self.norm_type == "ada_norm_single":
155
+ shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = (
156
+ self.scale_shift_table[None] + timestep.reshape(batch_size, 6, -1)
157
+ ).chunk(6, dim=1)
158
+ norm_hidden_states = self.norm1(hidden_states)
159
+ norm_hidden_states = norm_hidden_states * (1 + scale_msa) + shift_msa
160
+ else:
161
+ raise ValueError("Incorrect norm used")
162
+
163
+ if self.pos_embed is not None:
164
+ norm_hidden_states = self.pos_embed(norm_hidden_states)
165
+
166
+ # 1. Prepare GLIGEN inputs
167
+ cross_attention_kwargs = cross_attention_kwargs.copy() if cross_attention_kwargs is not None else {}
168
+ gligen_kwargs = cross_attention_kwargs.pop("gligen", None)
169
+
170
+ attn_output = self.attn1(
171
+ norm_hidden_states,
172
+ encoder_hidden_states=encoder_hidden_states if self.only_cross_attention else None,
173
+ attention_mask=attention_mask,
174
+ **cross_attention_kwargs,
175
+ )
176
+
177
+ if self.norm_type == "ada_norm_zero":
178
+ attn_output = gate_msa.unsqueeze(1) * attn_output
179
+ elif self.norm_type == "ada_norm_single":
180
+ attn_output = gate_msa * attn_output
181
+
182
+ hidden_states = attn_output + hidden_states
183
+ if hidden_states.ndim == 4:
184
+ hidden_states = hidden_states.squeeze(1)
185
+
186
+ # 1.2 Reference Attention
187
+ if 'w' in mode:
188
+ condition_embed_dict[self.layer_name] = rearrange(
189
+ norm_hidden_states, '(b n) l c -> b (n l) c',
190
+ n=num_in_batch
191
+ ) # B, (N L), C
192
+
193
+ if 'r' in mode and self.use_ra:
194
+ condition_embed = condition_embed_dict[self.layer_name].unsqueeze(1).repeat(1, num_in_batch, 1,
195
+ 1) # B N L C
196
+ condition_embed = rearrange(condition_embed, 'b n l c -> (b n) l c')
197
+
198
+ attn_output = self.attn_refview(
199
+ norm_hidden_states,
200
+ encoder_hidden_states=condition_embed,
201
+ attention_mask=None,
202
+ **cross_attention_kwargs
203
+ )
204
+ if not self.is_turbo:
205
+ ref_scale_timing = ref_scale
206
+ if isinstance(ref_scale, torch.Tensor):
207
+ ref_scale_timing = ref_scale.unsqueeze(1).repeat(1, num_in_batch).view(-1)
208
+ for _ in range(attn_output.ndim - 1):
209
+ ref_scale_timing = ref_scale_timing.unsqueeze(-1)
210
+
211
+ hidden_states = ref_scale_timing * attn_output + hidden_states
212
+
213
+ if hidden_states.ndim == 4:
214
+ hidden_states = hidden_states.squeeze(1)
215
+
216
+ # 1.3 Multiview Attention
217
+ if num_in_batch > 1 and self.use_ma:
218
+ multivew_hidden_states = rearrange(norm_hidden_states, '(b n) l c -> b (n l) c', n=num_in_batch)
219
+
220
+ if self.is_turbo:
221
+ position_mask = None
222
+ if position_attn_mask is not None:
223
+ if multivew_hidden_states.shape[1] in position_attn_mask:
224
+ position_mask = position_attn_mask[multivew_hidden_states.shape[1]]
225
+ position_indices = None
226
+ if position_voxel_indices is not None:
227
+ if multivew_hidden_states.shape[1] in position_voxel_indices:
228
+ position_indices = position_voxel_indices[multivew_hidden_states.shape[1]]
229
+ attn_output = self.attn_multiview(
230
+ multivew_hidden_states,
231
+ encoder_hidden_states=multivew_hidden_states,
232
+ attention_mask=position_mask,
233
+ position_indices=position_indices,
234
+ **cross_attention_kwargs
235
+ )
236
+ else:
237
+ attn_output = self.attn_multiview(
238
+ multivew_hidden_states,
239
+ encoder_hidden_states=multivew_hidden_states,
240
+ **cross_attention_kwargs
241
+ )
242
+
243
+ attn_output = rearrange(attn_output, 'b (n l) c -> (b n) l c', n=num_in_batch)
244
+
245
+ hidden_states = mva_scale * attn_output + hidden_states
246
+ if hidden_states.ndim == 4:
247
+ hidden_states = hidden_states.squeeze(1)
248
+
249
+ # 1.2 GLIGEN Control
250
+ if gligen_kwargs is not None:
251
+ hidden_states = self.fuser(hidden_states, gligen_kwargs["objs"])
252
+
253
+ # 3. Cross-Attention
254
+ if self.attn2 is not None:
255
+ if self.norm_type == "ada_norm":
256
+ norm_hidden_states = self.norm2(hidden_states, timestep)
257
+ elif self.norm_type in ["ada_norm_zero", "layer_norm", "layer_norm_i2vgen"]:
258
+ norm_hidden_states = self.norm2(hidden_states)
259
+ elif self.norm_type == "ada_norm_single":
260
+ # For PixArt norm2 isn't applied here:
261
+ # https://github.com/PixArt-alpha/PixArt-alpha/blob/0f55e922376d8b797edd44d25d0e7464b260dcab/diffusion/model/nets/PixArtMS.py#L70C1-L76C103
262
+ norm_hidden_states = hidden_states
263
+ elif self.norm_type == "ada_norm_continuous":
264
+ norm_hidden_states = self.norm2(hidden_states, added_cond_kwargs["pooled_text_emb"])
265
+ else:
266
+ raise ValueError("Incorrect norm")
267
+
268
+ if self.pos_embed is not None and self.norm_type != "ada_norm_single":
269
+ norm_hidden_states = self.pos_embed(norm_hidden_states)
270
+
271
+ attn_output = self.attn2(
272
+ norm_hidden_states,
273
+ encoder_hidden_states=encoder_hidden_states,
274
+ attention_mask=encoder_attention_mask,
275
+ **cross_attention_kwargs,
276
+ )
277
+
278
+ hidden_states = attn_output + hidden_states
279
+
280
+ # 4. Feed-forward
281
+ # i2vgen doesn't have this norm 🤷‍♂️
282
+ if self.norm_type == "ada_norm_continuous":
283
+ norm_hidden_states = self.norm3(hidden_states, added_cond_kwargs["pooled_text_emb"])
284
+ elif not self.norm_type == "ada_norm_single":
285
+ norm_hidden_states = self.norm3(hidden_states)
286
+
287
+ if self.norm_type == "ada_norm_zero":
288
+ norm_hidden_states = norm_hidden_states * (1 + scale_mlp[:, None]) + shift_mlp[:, None]
289
+
290
+ if self.norm_type == "ada_norm_single":
291
+ norm_hidden_states = self.norm2(hidden_states)
292
+ norm_hidden_states = norm_hidden_states * (1 + scale_mlp) + shift_mlp
293
+
294
+ if self._chunk_size is not None:
295
+ # "feed_forward_chunk_size" can be used to save memory
296
+ ff_output = _chunked_feed_forward(self.ff, norm_hidden_states, self._chunk_dim, self._chunk_size)
297
+ else:
298
+ ff_output = self.ff(norm_hidden_states)
299
+
300
+ if self.norm_type == "ada_norm_zero":
301
+ ff_output = gate_mlp.unsqueeze(1) * ff_output
302
+ elif self.norm_type == "ada_norm_single":
303
+ ff_output = gate_mlp * ff_output
304
+
305
+ hidden_states = ff_output + hidden_states
306
+ if hidden_states.ndim == 4:
307
+ hidden_states = hidden_states.squeeze(1)
308
+
309
+ return hidden_states
310
+
311
+ @torch.no_grad()
312
+ def compute_voxel_grid_mask(position, grid_resolution=8):
313
+
314
+ position = position.half()
315
+ B,N,_,H,W = position.shape
316
+ assert H%grid_resolution==0 and W%grid_resolution==0
317
+
318
+ valid_mask = (position != 1).all(dim=2, keepdim=True)
319
+ valid_mask = valid_mask.expand_as(position)
320
+ position[valid_mask==False] = 0
321
+
322
+
323
+ position = rearrange(
324
+ position,
325
+ 'b n c (num_h grid_h) (num_w grid_w) -> b n num_h num_w c grid_h grid_w',
326
+ num_h=grid_resolution, num_w=grid_resolution
327
+ )
328
+ valid_mask = rearrange(
329
+ valid_mask,
330
+ 'b n c (num_h grid_h) (num_w grid_w) -> b n num_h num_w c grid_h grid_w',
331
+ num_h=grid_resolution, num_w=grid_resolution
332
+ )
333
+
334
+ grid_position = position.sum(dim=(-2, -1))
335
+ count_masked = valid_mask.sum(dim=(-2, -1))
336
+
337
+ grid_position = grid_position / count_masked.clamp(min=1)
338
+ grid_position[count_masked<5] = 0
339
+
340
+ grid_position = grid_position.permute(0,1,4,2,3)
341
+ grid_position = rearrange(grid_position, 'b n c h w -> b n (h w) c')
342
+
343
+ grid_position_expanded_1 = grid_position.unsqueeze(2).unsqueeze(4) # 形状变为 B, N, 1, L, 1, 3
344
+ grid_position_expanded_2 = grid_position.unsqueeze(1).unsqueeze(3) # 形状变为 B, 1, N, 1, L, 3
345
+
346
+ # 计算欧氏距离
347
+ distances = torch.norm(grid_position_expanded_1 - grid_position_expanded_2, dim=-1) # 形状为 B, N, N, L, L
348
+
349
+ weights = distances
350
+ grid_distance = 1.73/grid_resolution
351
+
352
+ #weights = weights*-32
353
+ #weights = weights.clamp(min=-10000.0)
354
+
355
+ weights = weights< grid_distance
356
+
357
+ return weights
358
+
359
+ def compute_multi_resolution_mask(position_maps, grid_resolutions=[32, 16, 8]):
360
+ position_attn_mask = {}
361
+ with torch.no_grad():
362
+ for grid_resolution in grid_resolutions:
363
+ position_mask = compute_voxel_grid_mask(position_maps, grid_resolution)
364
+ position_mask = rearrange(position_mask, 'b ni nj li lj -> b (ni li) (nj lj)')
365
+ position_attn_mask[position_mask.shape[1]] = position_mask
366
+ return position_attn_mask
367
+
368
+ @torch.no_grad()
369
+ def compute_discrete_voxel_indice(position, grid_resolution=8, voxel_resolution=128):
370
+
371
+ position = position.half()
372
+ B,N,_,H,W = position.shape
373
+ assert H%grid_resolution==0 and W%grid_resolution==0
374
+
375
+ valid_mask = (position != 1).all(dim=2, keepdim=True)
376
+ valid_mask = valid_mask.expand_as(position)
377
+ position[valid_mask==False] = 0
378
+
379
+ position = rearrange(
380
+ position,
381
+ 'b n c (num_h grid_h) (num_w grid_w) -> b n num_h num_w c grid_h grid_w',
382
+ num_h=grid_resolution, num_w=grid_resolution
383
+ )
384
+ valid_mask = rearrange(
385
+ valid_mask,
386
+ 'b n c (num_h grid_h) (num_w grid_w) -> b n num_h num_w c grid_h grid_w',
387
+ num_h=grid_resolution, num_w=grid_resolution
388
+ )
389
+
390
+ grid_position = position.sum(dim=(-2, -1))
391
+ count_masked = valid_mask.sum(dim=(-2, -1))
392
+
393
+ grid_position = grid_position / count_masked.clamp(min=1)
394
+ grid_position[count_masked<5] = 0
395
+
396
+ grid_position = grid_position.permute(0,1,4,2,3).clamp(0, 1) # B N C H W
397
+ voxel_indices = grid_position * (voxel_resolution - 1)
398
+ voxel_indices = torch.round(voxel_indices).long()
399
+ return voxel_indices
400
+
401
+ def compute_multi_resolution_discrete_voxel_indice(
402
+ position_maps,
403
+ grid_resolutions=[64, 32, 16, 8],
404
+ voxel_resolutions=[512, 256, 128, 64]
405
+ ):
406
+ voxel_indices = {}
407
+ with torch.no_grad():
408
+ for grid_resolution, voxel_resolution in zip(grid_resolutions, voxel_resolutions):
409
+ voxel_indice = compute_discrete_voxel_indice(position_maps, grid_resolution, voxel_resolution)
410
+ voxel_indice = rearrange(voxel_indice, 'b n c h w -> b (n h w) c')
411
+ voxel_indices[voxel_indice.shape[1]] = {'voxel_indices':voxel_indice, 'voxel_resolution':voxel_resolution}
412
+ return voxel_indices
413
+
414
+ class UNet2p5DConditionModel(torch.nn.Module):
415
+ def __init__(self, unet: UNet2DConditionModel) -> None:
416
+ super().__init__()
417
+ self.unet = unet
418
+
419
+ self.use_ma = True
420
+ self.use_ra = True
421
+ self.use_camera_embedding = True
422
+ self.use_dual_stream = True
423
+ self.is_turbo = False
424
+
425
+ if self.use_dual_stream:
426
+ self.unet_dual = copy.deepcopy(unet)
427
+ self.init_attention(self.unet_dual)
428
+ self.init_attention(self.unet, use_ma=self.use_ma, use_ra=self.use_ra, is_turbo=self.is_turbo)
429
+ self.init_condition()
430
+ self.init_camera_embedding()
431
+
432
+ @staticmethod
433
+ def from_pretrained(pretrained_model_name_or_path, **kwargs):
434
+ torch_dtype = kwargs.pop('torch_dtype', torch.float32)
435
+ config_path = os.path.join(pretrained_model_name_or_path, 'config.json')
436
+ unet_ckpt_path = os.path.join(pretrained_model_name_or_path, 'diffusion_pytorch_model.bin')
437
+ with open(config_path, 'r', encoding='utf-8') as file:
438
+ config = json.load(file)
439
+ unet = UNet2DConditionModel(**config)
440
+ unet = UNet2p5DConditionModel(unet)
441
+ unet_ckpt = torch.load(unet_ckpt_path, map_location='cpu', weights_only=True)
442
+ unet.load_state_dict(unet_ckpt, strict=True)
443
+ unet = unet.to(torch_dtype)
444
+ return unet
445
+
446
+ def init_condition(self):
447
+ self.unet.conv_in = torch.nn.Conv2d(
448
+ 12,
449
+ self.unet.conv_in.out_channels,
450
+ kernel_size=self.unet.conv_in.kernel_size,
451
+ stride=self.unet.conv_in.stride,
452
+ padding=self.unet.conv_in.padding,
453
+ dilation=self.unet.conv_in.dilation,
454
+ groups=self.unet.conv_in.groups,
455
+ bias=self.unet.conv_in.bias is not None)
456
+
457
+ self.unet.learned_text_clip_gen = nn.Parameter(torch.randn(1, 77, 1024))
458
+ self.unet.learned_text_clip_ref = nn.Parameter(torch.randn(1, 77, 1024))
459
+
460
+ def init_camera_embedding(self):
461
+
462
+ if self.use_camera_embedding:
463
+ time_embed_dim = 1280
464
+ self.max_num_ref_image = 5
465
+ self.max_num_gen_image = 12 * 3 + 4 * 2
466
+ self.unet.class_embedding = nn.Embedding(self.max_num_ref_image + self.max_num_gen_image, time_embed_dim)
467
+
468
+ def init_attention(self, unet, use_ma=False, use_ra=False, is_turbo=False):
469
+
470
+ for down_block_i, down_block in enumerate(unet.down_blocks):
471
+ if hasattr(down_block, "has_cross_attention") and down_block.has_cross_attention:
472
+ for attn_i, attn in enumerate(down_block.attentions):
473
+ for transformer_i, transformer in enumerate(attn.transformer_blocks):
474
+ if isinstance(transformer, BasicTransformerBlock):
475
+ attn.transformer_blocks[transformer_i] = Basic2p5DTransformerBlock(
476
+ transformer,
477
+ f'down_{down_block_i}_{attn_i}_{transformer_i}',
478
+ use_ma, use_ra, is_turbo
479
+ )
480
+
481
+ if hasattr(unet.mid_block, "has_cross_attention") and unet.mid_block.has_cross_attention:
482
+ for attn_i, attn in enumerate(unet.mid_block.attentions):
483
+ for transformer_i, transformer in enumerate(attn.transformer_blocks):
484
+ if isinstance(transformer, BasicTransformerBlock):
485
+ attn.transformer_blocks[transformer_i] = Basic2p5DTransformerBlock(
486
+ transformer,
487
+ f'mid_{attn_i}_{transformer_i}',
488
+ use_ma, use_ra, is_turbo
489
+ )
490
+
491
+ for up_block_i, up_block in enumerate(unet.up_blocks):
492
+ if hasattr(up_block, "has_cross_attention") and up_block.has_cross_attention:
493
+ for attn_i, attn in enumerate(up_block.attentions):
494
+ for transformer_i, transformer in enumerate(attn.transformer_blocks):
495
+ if isinstance(transformer, BasicTransformerBlock):
496
+ attn.transformer_blocks[transformer_i] = Basic2p5DTransformerBlock(
497
+ transformer,
498
+ f'up_{up_block_i}_{attn_i}_{transformer_i}',
499
+ use_ma, use_ra, is_turbo
500
+ )
501
+
502
+ def __getattr__(self, name: str):
503
+ try:
504
+ return super().__getattr__(name)
505
+ except AttributeError:
506
+ return getattr(self.unet, name)
507
+
508
+ def forward(
509
+ self, sample, timestep, encoder_hidden_states,
510
+ *args, down_intrablock_additional_residuals=None,
511
+ down_block_res_samples=None, mid_block_res_sample=None,
512
+ **cached_condition,
513
+ ):
514
+ B, N_gen, _, H, W = sample.shape
515
+ assert H == W
516
+
517
+ if self.use_camera_embedding:
518
+ camera_info_gen = cached_condition['camera_info_gen'] + self.max_num_ref_image
519
+ camera_info_gen = rearrange(camera_info_gen, 'b n -> (b n)')
520
+ else:
521
+ camera_info_gen = None
522
+
523
+ sample = [sample]
524
+ if 'normal_imgs' in cached_condition:
525
+ sample.append(cached_condition["normal_imgs"])
526
+ if 'position_imgs' in cached_condition:
527
+ sample.append(cached_condition["position_imgs"])
528
+ sample = torch.cat(sample, dim=2)
529
+
530
+ sample = rearrange(sample, 'b n c h w -> (b n) c h w')
531
+
532
+ encoder_hidden_states_gen = encoder_hidden_states.unsqueeze(1).repeat(1, N_gen, 1, 1)
533
+ encoder_hidden_states_gen = rearrange(encoder_hidden_states_gen, 'b n l c -> (b n) l c')
534
+
535
+ if self.use_ra:
536
+ if 'condition_embed_dict' in cached_condition:
537
+ condition_embed_dict = cached_condition['condition_embed_dict']
538
+ else:
539
+ condition_embed_dict = {}
540
+ ref_latents = cached_condition['ref_latents']
541
+ N_ref = ref_latents.shape[1]
542
+ if self.use_camera_embedding:
543
+ camera_info_ref = cached_condition['camera_info_ref']
544
+ camera_info_ref = rearrange(camera_info_ref, 'b n -> (b n)')
545
+ else:
546
+ camera_info_ref = None
547
+
548
+ ref_latents = rearrange(ref_latents, 'b n c h w -> (b n) c h w')
549
+
550
+ encoder_hidden_states_ref = self.unet.learned_text_clip_ref.unsqueeze(1).repeat(B, N_ref, 1, 1)
551
+ encoder_hidden_states_ref = rearrange(encoder_hidden_states_ref, 'b n l c -> (b n) l c')
552
+
553
+ noisy_ref_latents = ref_latents
554
+ timestep_ref = 0
555
+
556
+ if self.use_dual_stream:
557
+ unet_ref = self.unet_dual
558
+ else:
559
+ unet_ref = self.unet
560
+ unet_ref(
561
+ noisy_ref_latents, timestep_ref,
562
+ encoder_hidden_states=encoder_hidden_states_ref,
563
+ class_labels=camera_info_ref,
564
+ # **kwargs
565
+ return_dict=False,
566
+ cross_attention_kwargs={
567
+ 'mode': 'w', 'num_in_batch': N_ref,
568
+ 'condition_embed_dict': condition_embed_dict},
569
+ )
570
+ cached_condition['condition_embed_dict'] = condition_embed_dict
571
+ else:
572
+ condition_embed_dict = None
573
+
574
+ mva_scale = cached_condition.get('mva_scale', 1.0)
575
+ ref_scale = cached_condition.get('ref_scale', 1.0)
576
+
577
+ if self.is_turbo:
578
+ cross_attention_kwargs_ = {
579
+ 'mode': 'r', 'num_in_batch': N_gen,
580
+ 'condition_embed_dict': condition_embed_dict,
581
+ 'position_attn_mask':position_attn_mask,
582
+ 'position_voxel_indices':position_voxel_indices,
583
+ 'mva_scale': mva_scale,
584
+ 'ref_scale': ref_scale,
585
+ }
586
+ else:
587
+ cross_attention_kwargs_ = {
588
+ 'mode': 'r', 'num_in_batch': N_gen,
589
+ 'condition_embed_dict': condition_embed_dict,
590
+ 'mva_scale': mva_scale,
591
+ 'ref_scale': ref_scale,
592
+ }
593
+ return self.unet(
594
+ sample, timestep,
595
+ encoder_hidden_states_gen, *args,
596
+ class_labels=camera_info_gen,
597
+ down_intrablock_additional_residuals=[
598
+ sample.to(dtype=self.unet.dtype) for sample in down_intrablock_additional_residuals
599
+ ] if down_intrablock_additional_residuals is not None else None,
600
+ down_block_additional_residuals=[
601
+ sample.to(dtype=self.unet.dtype) for sample in down_block_res_samples
602
+ ] if down_block_res_samples is not None else None,
603
+ mid_block_additional_residual=(
604
+ mid_block_res_sample.to(dtype=self.unet.dtype)
605
+ if mid_block_res_sample is not None else None
606
+ ),
607
+ return_dict=False,
608
+ cross_attention_kwargs=cross_attention_kwargs_,
609
+ )
610
+
assets/hunyuan3d-paint-v2-0-turbo/vae/config.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "AutoencoderKL",
3
+ "_diffusers_version": "0.10.0.dev0",
4
+ "act_fn": "silu",
5
+ "block_out_channels": [
6
+ 128,
7
+ 256,
8
+ 512,
9
+ 512
10
+ ],
11
+ "down_block_types": [
12
+ "DownEncoderBlock2D",
13
+ "DownEncoderBlock2D",
14
+ "DownEncoderBlock2D",
15
+ "DownEncoderBlock2D"
16
+ ],
17
+ "in_channels": 3,
18
+ "latent_channels": 4,
19
+ "layers_per_block": 2,
20
+ "norm_num_groups": 32,
21
+ "out_channels": 3,
22
+ "sample_size": 768,
23
+ "up_block_types": [
24
+ "UpDecoderBlock2D",
25
+ "UpDecoderBlock2D",
26
+ "UpDecoderBlock2D",
27
+ "UpDecoderBlock2D"
28
+ ]
29
+ }
assets/hunyuan3d-paint-v2-0/.DS_Store ADDED
Binary file (6.15 kB). View file
 
assets/hunyuan3d-paint-v2-0/.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
assets/hunyuan3d-paint-v2-0/feature_extractor/preprocessor_config.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "crop_size": 224,
3
+ "do_center_crop": true,
4
+ "do_convert_rgb": true,
5
+ "do_normalize": true,
6
+ "do_resize": true,
7
+ "feature_extractor_type": "CLIPFeatureExtractor",
8
+ "image_mean": [
9
+ 0.48145466,
10
+ 0.4578275,
11
+ 0.40821073
12
+ ],
13
+ "image_std": [
14
+ 0.26862954,
15
+ 0.26130258,
16
+ 0.27577711
17
+ ],
18
+ "resample": 3,
19
+ "size": 224
20
+ }
assets/hunyuan3d-paint-v2-0/model_index.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "StableDiffusionPipeline",
3
+ "_diffusers_version": "0.23.1",
4
+ "feature_extractor": [
5
+ "transformers",
6
+ "CLIPImageProcessor"
7
+ ],
8
+ "requires_safety_checker": false,
9
+ "safety_checker": [
10
+ null,
11
+ null
12
+ ],
13
+ "scheduler": [
14
+ "diffusers",
15
+ "DDIMScheduler"
16
+ ],
17
+ "text_encoder": [
18
+ "transformers",
19
+ "CLIPTextModel"
20
+ ],
21
+ "tokenizer": [
22
+ "transformers",
23
+ "CLIPTokenizer"
24
+ ],
25
+ "unet": [
26
+ "modules",
27
+ "UNet2p5DConditionModel"
28
+ ],
29
+ "vae": [
30
+ "diffusers",
31
+ "AutoencoderKL"
32
+ ]
33
+ }
assets/hunyuan3d-paint-v2-0/scheduler/scheduler_config.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "DDIMScheduler",
3
+ "_diffusers_version": "0.23.1",
4
+ "beta_end": 0.012,
5
+ "beta_schedule": "scaled_linear",
6
+ "beta_start": 0.00085,
7
+ "clip_sample": false,
8
+ "num_train_timesteps": 1000,
9
+ "prediction_type": "v_prediction",
10
+ "set_alpha_to_one": true,
11
+ "steps_offset": 1,
12
+ "trained_betas": null,
13
+ "timestep_spacing": "trailing",
14
+ "rescale_betas_zero_snr": true
15
+ }
assets/hunyuan3d-paint-v2-0/text_encoder/config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "stabilityai/stable-diffusion-2",
3
+ "architectures": [
4
+ "CLIPTextModel"
5
+ ],
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 0,
8
+ "dropout": 0.0,
9
+ "eos_token_id": 2,
10
+ "hidden_act": "gelu",
11
+ "hidden_size": 1024,
12
+ "initializer_factor": 1.0,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 4096,
15
+ "layer_norm_eps": 1e-05,
16
+ "max_position_embeddings": 77,
17
+ "model_type": "clip_text_model",
18
+ "num_attention_heads": 16,
19
+ "num_hidden_layers": 23,
20
+ "pad_token_id": 1,
21
+ "projection_dim": 512,
22
+ "torch_dtype": "float32",
23
+ "transformers_version": "4.25.0.dev0",
24
+ "vocab_size": 49408
25
+ }
assets/hunyuan3d-paint-v2-0/tokenizer/merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
assets/hunyuan3d-paint-v2-0/tokenizer/special_tokens_map.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<|startoftext|>",
4
+ "lstrip": false,
5
+ "normalized": true,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|endoftext|>",
11
+ "lstrip": false,
12
+ "normalized": true,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": "!",
17
+ "unk_token": {
18
+ "content": "<|endoftext|>",
19
+ "lstrip": false,
20
+ "normalized": true,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ }
24
+ }
assets/hunyuan3d-paint-v2-0/tokenizer/tokenizer_config.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "bos_token": {
4
+ "__type": "AddedToken",
5
+ "content": "<|startoftext|>",
6
+ "lstrip": false,
7
+ "normalized": true,
8
+ "rstrip": false,
9
+ "single_word": false
10
+ },
11
+ "do_lower_case": true,
12
+ "eos_token": {
13
+ "__type": "AddedToken",
14
+ "content": "<|endoftext|>",
15
+ "lstrip": false,
16
+ "normalized": true,
17
+ "rstrip": false,
18
+ "single_word": false
19
+ },
20
+ "errors": "replace",
21
+ "model_max_length": 77,
22
+ "name_or_path": "stabilityai/stable-diffusion-2",
23
+ "pad_token": "<|endoftext|>",
24
+ "special_tokens_map_file": "./special_tokens_map.json",
25
+ "tokenizer_class": "CLIPTokenizer",
26
+ "unk_token": {
27
+ "__type": "AddedToken",
28
+ "content": "<|endoftext|>",
29
+ "lstrip": false,
30
+ "normalized": true,
31
+ "rstrip": false,
32
+ "single_word": false
33
+ }
34
+ }
assets/hunyuan3d-paint-v2-0/tokenizer/vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
assets/hunyuan3d-paint-v2-0/unet/config.json ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "UNet2DConditionModel",
3
+ "_diffusers_version": "0.10.0.dev0",
4
+ "act_fn": "silu",
5
+ "attention_head_dim": [
6
+ 5,
7
+ 10,
8
+ 20,
9
+ 20
10
+ ],
11
+ "block_out_channels": [
12
+ 320,
13
+ 640,
14
+ 1280,
15
+ 1280
16
+ ],
17
+ "center_input_sample": false,
18
+ "cross_attention_dim": 1024,
19
+ "down_block_types": [
20
+ "CrossAttnDownBlock2D",
21
+ "CrossAttnDownBlock2D",
22
+ "CrossAttnDownBlock2D",
23
+ "DownBlock2D"
24
+ ],
25
+ "downsample_padding": 1,
26
+ "dual_cross_attention": false,
27
+ "flip_sin_to_cos": true,
28
+ "freq_shift": 0,
29
+ "in_channels": 4,
30
+ "layers_per_block": 2,
31
+ "mid_block_scale_factor": 1,
32
+ "norm_eps": 1e-05,
33
+ "norm_num_groups": 32,
34
+ "num_class_embeds": null,
35
+ "only_cross_attention": false,
36
+ "out_channels": 4,
37
+ "sample_size": 64,
38
+ "up_block_types": [
39
+ "UpBlock2D",
40
+ "CrossAttnUpBlock2D",
41
+ "CrossAttnUpBlock2D",
42
+ "CrossAttnUpBlock2D"
43
+ ],
44
+ "use_linear_projection": true
45
+ }
assets/hunyuan3d-paint-v2-0/unet/modules.py ADDED
@@ -0,0 +1,437 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import json
3
+ from typing import Any, Dict, Optional
4
+ from diffusers.models import UNet2DConditionModel
5
+
6
+ import numpy
7
+ import torch
8
+ import torch.nn as nn
9
+ import torch.nn.functional as F
10
+ import torch.utils.checkpoint
11
+ import torch.distributed
12
+ from PIL import Image
13
+ from einops import rearrange
14
+ from typing import Any, Callable, Dict, List, Optional, Union, Tuple
15
+
16
+ import diffusers
17
+ from diffusers import (
18
+ AutoencoderKL,
19
+ DDPMScheduler,
20
+ DiffusionPipeline,
21
+ EulerAncestralDiscreteScheduler,
22
+ UNet2DConditionModel,
23
+ ImagePipelineOutput
24
+ )
25
+ from diffusers.image_processor import VaeImageProcessor
26
+ from diffusers.models.attention_processor import Attention, AttnProcessor, XFormersAttnProcessor, AttnProcessor2_0
27
+ from diffusers.utils.import_utils import is_xformers_available
28
+
29
+
30
+ from diffusers.utils import deprecate
31
+
32
+ from diffusers.models.transformers.transformer_2d import BasicTransformerBlock
33
+
34
+
35
+
36
+ def _chunked_feed_forward(ff: nn.Module, hidden_states: torch.Tensor, chunk_dim: int, chunk_size: int):
37
+ # "feed_forward_chunk_size" can be used to save memory
38
+ if hidden_states.shape[chunk_dim] % chunk_size != 0:
39
+ raise ValueError(
40
+ f"`hidden_states` dimension to be chunked: {hidden_states.shape[chunk_dim]} has to be divisible by chunk size: {chunk_size}. Make sure to set an appropriate `chunk_size` when calling `unet.enable_forward_chunking`."
41
+ )
42
+
43
+ num_chunks = hidden_states.shape[chunk_dim] // chunk_size
44
+ ff_output = torch.cat(
45
+ [ff(hid_slice) for hid_slice in hidden_states.chunk(num_chunks, dim=chunk_dim)],
46
+ dim=chunk_dim,
47
+ )
48
+ return ff_output
49
+
50
+
51
+ class Basic2p5DTransformerBlock(torch.nn.Module):
52
+ def __init__(self, transformer: BasicTransformerBlock, layer_name, use_ma=True, use_ra=True) -> None:
53
+ super().__init__()
54
+ self.transformer = transformer
55
+ self.layer_name = layer_name
56
+ self.use_ma = use_ma
57
+ self.use_ra = use_ra
58
+
59
+ # multiview attn
60
+ if self.use_ma:
61
+ self.attn_multiview = Attention(
62
+ query_dim=self.dim,
63
+ heads=self.num_attention_heads,
64
+ dim_head=self.attention_head_dim,
65
+ dropout=self.dropout,
66
+ bias=self.attention_bias,
67
+ cross_attention_dim=None,
68
+ upcast_attention=self.attn1.upcast_attention,
69
+ out_bias=True,
70
+ )
71
+
72
+ # ref attn
73
+ if self.use_ra:
74
+ self.attn_refview = Attention(
75
+ query_dim=self.dim,
76
+ heads=self.num_attention_heads,
77
+ dim_head=self.attention_head_dim,
78
+ dropout=self.dropout,
79
+ bias=self.attention_bias,
80
+ cross_attention_dim=None,
81
+ upcast_attention=self.attn1.upcast_attention,
82
+ out_bias=True,
83
+ )
84
+
85
+ def __getattr__(self, name: str):
86
+ try:
87
+ return super().__getattr__(name)
88
+ except AttributeError:
89
+ return getattr(self.transformer, name)
90
+
91
+ def forward(
92
+ self,
93
+ hidden_states: torch.Tensor,
94
+ attention_mask: Optional[torch.Tensor] = None,
95
+ encoder_hidden_states: Optional[torch.Tensor] = None,
96
+ encoder_attention_mask: Optional[torch.Tensor] = None,
97
+ timestep: Optional[torch.LongTensor] = None,
98
+ cross_attention_kwargs: Dict[str, Any] = None,
99
+ class_labels: Optional[torch.LongTensor] = None,
100
+ added_cond_kwargs: Optional[Dict[str, torch.Tensor]] = None,
101
+ ) -> torch.Tensor:
102
+
103
+ # Notice that normalization is always applied before the real computation in the following blocks.
104
+ # 0. Self-Attention
105
+ batch_size = hidden_states.shape[0]
106
+
107
+ cross_attention_kwargs = cross_attention_kwargs.copy() if cross_attention_kwargs is not None else {}
108
+ num_in_batch = cross_attention_kwargs.pop('num_in_batch', 1)
109
+ mode = cross_attention_kwargs.pop('mode', None)
110
+ mva_scale = cross_attention_kwargs.pop('mva_scale', 1.0)
111
+ ref_scale = cross_attention_kwargs.pop('ref_scale', 1.0)
112
+ condition_embed_dict = cross_attention_kwargs.pop("condition_embed_dict", None)
113
+
114
+
115
+ if self.norm_type == "ada_norm":
116
+ norm_hidden_states = self.norm1(hidden_states, timestep)
117
+ elif self.norm_type == "ada_norm_zero":
118
+ norm_hidden_states, gate_msa, shift_mlp, scale_mlp, gate_mlp = self.norm1(
119
+ hidden_states, timestep, class_labels, hidden_dtype=hidden_states.dtype
120
+ )
121
+ elif self.norm_type in ["layer_norm", "layer_norm_i2vgen"]:
122
+ norm_hidden_states = self.norm1(hidden_states)
123
+ elif self.norm_type == "ada_norm_continuous":
124
+ norm_hidden_states = self.norm1(hidden_states, added_cond_kwargs["pooled_text_emb"])
125
+ elif self.norm_type == "ada_norm_single":
126
+ shift_msa, scale_msa, gate_msa, shift_mlp, scale_mlp, gate_mlp = (
127
+ self.scale_shift_table[None] + timestep.reshape(batch_size, 6, -1)
128
+ ).chunk(6, dim=1)
129
+ norm_hidden_states = self.norm1(hidden_states)
130
+ norm_hidden_states = norm_hidden_states * (1 + scale_msa) + shift_msa
131
+ else:
132
+ raise ValueError("Incorrect norm used")
133
+
134
+ if self.pos_embed is not None:
135
+ norm_hidden_states = self.pos_embed(norm_hidden_states)
136
+
137
+ # 1. Prepare GLIGEN inputs
138
+ cross_attention_kwargs = cross_attention_kwargs.copy() if cross_attention_kwargs is not None else {}
139
+ gligen_kwargs = cross_attention_kwargs.pop("gligen", None)
140
+
141
+ attn_output = self.attn1(
142
+ norm_hidden_states,
143
+ encoder_hidden_states=encoder_hidden_states if self.only_cross_attention else None,
144
+ attention_mask=attention_mask,
145
+ **cross_attention_kwargs,
146
+ )
147
+
148
+ if self.norm_type == "ada_norm_zero":
149
+ attn_output = gate_msa.unsqueeze(1) * attn_output
150
+ elif self.norm_type == "ada_norm_single":
151
+ attn_output = gate_msa * attn_output
152
+
153
+ hidden_states = attn_output + hidden_states
154
+ if hidden_states.ndim == 4:
155
+ hidden_states = hidden_states.squeeze(1)
156
+
157
+ # 1.2 Reference Attention
158
+ if 'w' in mode:
159
+ condition_embed_dict[self.layer_name] = rearrange(norm_hidden_states, '(b n) l c -> b (n l) c', n=num_in_batch) # B, (N L), C
160
+
161
+ if 'r' in mode and self.use_ra:
162
+ condition_embed = condition_embed_dict[self.layer_name].unsqueeze(1).repeat(1,num_in_batch,1,1) # B N L C
163
+ condition_embed = rearrange(condition_embed, 'b n l c -> (b n) l c')
164
+
165
+ attn_output = self.attn_refview(
166
+ norm_hidden_states,
167
+ encoder_hidden_states=condition_embed,
168
+ attention_mask=None,
169
+ **cross_attention_kwargs
170
+ )
171
+ ref_scale_timing = ref_scale
172
+ if isinstance(ref_scale, torch.Tensor):
173
+ ref_scale_timing = ref_scale.unsqueeze(1).repeat(1, num_in_batch).view(-1)
174
+ for _ in range(attn_output.ndim - 1):
175
+ ref_scale_timing = ref_scale_timing.unsqueeze(-1)
176
+ hidden_states = ref_scale_timing * attn_output + hidden_states
177
+ if hidden_states.ndim == 4:
178
+ hidden_states = hidden_states.squeeze(1)
179
+
180
+
181
+ # 1.3 Multiview Attention
182
+ if num_in_batch > 1 and self.use_ma:
183
+ multivew_hidden_states = rearrange(norm_hidden_states, '(b n) l c -> b (n l) c', n=num_in_batch)
184
+
185
+ attn_output = self.attn_multiview(
186
+ multivew_hidden_states,
187
+ encoder_hidden_states=multivew_hidden_states,
188
+ **cross_attention_kwargs
189
+ )
190
+
191
+ attn_output = rearrange(attn_output, 'b (n l) c -> (b n) l c', n=num_in_batch)
192
+
193
+ hidden_states = mva_scale * attn_output + hidden_states
194
+ if hidden_states.ndim == 4:
195
+ hidden_states = hidden_states.squeeze(1)
196
+
197
+ # 1.2 GLIGEN Control
198
+ if gligen_kwargs is not None:
199
+ hidden_states = self.fuser(hidden_states, gligen_kwargs["objs"])
200
+
201
+ # 3. Cross-Attention
202
+ if self.attn2 is not None:
203
+ if self.norm_type == "ada_norm":
204
+ norm_hidden_states = self.norm2(hidden_states, timestep)
205
+ elif self.norm_type in ["ada_norm_zero", "layer_norm", "layer_norm_i2vgen"]:
206
+ norm_hidden_states = self.norm2(hidden_states)
207
+ elif self.norm_type == "ada_norm_single":
208
+ # For PixArt norm2 isn't applied here:
209
+ # https://github.com/PixArt-alpha/PixArt-alpha/blob/0f55e922376d8b797edd44d25d0e7464b260dcab/diffusion/model/nets/PixArtMS.py#L70C1-L76C103
210
+ norm_hidden_states = hidden_states
211
+ elif self.norm_type == "ada_norm_continuous":
212
+ norm_hidden_states = self.norm2(hidden_states, added_cond_kwargs["pooled_text_emb"])
213
+ else:
214
+ raise ValueError("Incorrect norm")
215
+
216
+ if self.pos_embed is not None and self.norm_type != "ada_norm_single":
217
+ norm_hidden_states = self.pos_embed(norm_hidden_states)
218
+
219
+
220
+ attn_output = self.attn2(
221
+ norm_hidden_states,
222
+ encoder_hidden_states=encoder_hidden_states,
223
+ attention_mask=encoder_attention_mask,
224
+ **cross_attention_kwargs,
225
+ )
226
+
227
+ hidden_states = attn_output + hidden_states
228
+
229
+ # 4. Feed-forward
230
+ # i2vgen doesn't have this norm 🤷‍♂️
231
+ if self.norm_type == "ada_norm_continuous":
232
+ norm_hidden_states = self.norm3(hidden_states, added_cond_kwargs["pooled_text_emb"])
233
+ elif not self.norm_type == "ada_norm_single":
234
+ norm_hidden_states = self.norm3(hidden_states)
235
+
236
+ if self.norm_type == "ada_norm_zero":
237
+ norm_hidden_states = norm_hidden_states * (1 + scale_mlp[:, None]) + shift_mlp[:, None]
238
+
239
+ if self.norm_type == "ada_norm_single":
240
+ norm_hidden_states = self.norm2(hidden_states)
241
+ norm_hidden_states = norm_hidden_states * (1 + scale_mlp) + shift_mlp
242
+
243
+ if self._chunk_size is not None:
244
+ # "feed_forward_chunk_size" can be used to save memory
245
+ ff_output = _chunked_feed_forward(self.ff, norm_hidden_states, self._chunk_dim, self._chunk_size)
246
+ else:
247
+ ff_output = self.ff(norm_hidden_states)
248
+
249
+ if self.norm_type == "ada_norm_zero":
250
+ ff_output = gate_mlp.unsqueeze(1) * ff_output
251
+ elif self.norm_type == "ada_norm_single":
252
+ ff_output = gate_mlp * ff_output
253
+
254
+ hidden_states = ff_output + hidden_states
255
+ if hidden_states.ndim == 4:
256
+ hidden_states = hidden_states.squeeze(1)
257
+
258
+ return hidden_states
259
+
260
+ import copy
261
+ class UNet2p5DConditionModel(torch.nn.Module):
262
+ def __init__(self, unet: UNet2DConditionModel) -> None:
263
+ super().__init__()
264
+ self.unet = unet
265
+
266
+ self.use_ma = True
267
+ self.use_ra = True
268
+ self.use_camera_embedding = True
269
+ self.use_dual_stream = True
270
+
271
+ if self.use_dual_stream:
272
+ self.unet_dual = copy.deepcopy(unet)
273
+ self.init_attention(self.unet_dual)
274
+ self.init_attention(self.unet, use_ma=self.use_ma, use_ra=self.use_ra)
275
+ self.init_condition()
276
+ self.init_camera_embedding()
277
+
278
+
279
+ @staticmethod
280
+ def from_pretrained(pretrained_model_name_or_path, **kwargs):
281
+ torch_dtype = kwargs.pop('torch_dtype', torch.float32)
282
+ config_path = os.path.join(pretrained_model_name_or_path, 'config.json')
283
+ unet_ckpt_path = os.path.join(pretrained_model_name_or_path, 'diffusion_pytorch_model.bin')
284
+ with open(config_path, 'r', encoding='utf-8') as file:
285
+ config = json.load(file)
286
+ unet = UNet2DConditionModel(**config)
287
+ unet = UNet2p5DConditionModel(unet)
288
+ unet_ckpt = torch.load(unet_ckpt_path, map_location='cpu', weights_only=True)
289
+ unet.load_state_dict(unet_ckpt, strict=True)
290
+ unet = unet.to(torch_dtype)
291
+ return unet
292
+
293
+ def init_condition(self):
294
+ self.unet.conv_in = torch.nn.Conv2d(
295
+ 12,
296
+ self.unet.conv_in.out_channels,
297
+ kernel_size=self.unet.conv_in.kernel_size,
298
+ stride=self.unet.conv_in.stride,
299
+ padding=self.unet.conv_in.padding,
300
+ dilation=self.unet.conv_in.dilation,
301
+ groups=self.unet.conv_in.groups,
302
+ bias=self.unet.conv_in.bias is not None)
303
+ self.unet.learned_text_clip_gen = nn.Parameter(torch.randn(1,77,1024))
304
+ self.unet.learned_text_clip_ref = nn.Parameter(torch.randn(1,77,1024))
305
+
306
+ def init_camera_embedding(self):
307
+
308
+ self.max_num_ref_image = 5
309
+ self.max_num_gen_image = 12*3+4*2
310
+
311
+ if self.use_camera_embedding:
312
+ time_embed_dim = 1280
313
+ self.unet.class_embedding = nn.Embedding(self.max_num_ref_image+self.max_num_gen_image, time_embed_dim)
314
+
315
+
316
+ def init_attention(self, unet, use_ma=False, use_ra=False):
317
+
318
+ for down_block_i, down_block in enumerate(unet.down_blocks):
319
+ if hasattr(down_block, "has_cross_attention") and down_block.has_cross_attention:
320
+ for attn_i, attn in enumerate(down_block.attentions):
321
+ for transformer_i, transformer in enumerate(attn.transformer_blocks):
322
+ if isinstance(transformer, BasicTransformerBlock):
323
+ attn.transformer_blocks[transformer_i] = Basic2p5DTransformerBlock(transformer, f'down_{down_block_i}_{attn_i}_{transformer_i}', use_ma, use_ra)
324
+
325
+
326
+ if hasattr(unet.mid_block, "has_cross_attention") and unet.mid_block.has_cross_attention:
327
+ for attn_i, attn in enumerate(unet.mid_block.attentions):
328
+ for transformer_i, transformer in enumerate(attn.transformer_blocks):
329
+ if isinstance(transformer, BasicTransformerBlock):
330
+ attn.transformer_blocks[transformer_i] = Basic2p5DTransformerBlock(transformer, f'mid_{attn_i}_{transformer_i}', use_ma, use_ra)
331
+
332
+ for up_block_i, up_block in enumerate(unet.up_blocks):
333
+ if hasattr(up_block, "has_cross_attention") and up_block.has_cross_attention:
334
+ for attn_i, attn in enumerate(up_block.attentions):
335
+ for transformer_i, transformer in enumerate(attn.transformer_blocks):
336
+ if isinstance(transformer, BasicTransformerBlock):
337
+ attn.transformer_blocks[transformer_i] = Basic2p5DTransformerBlock(transformer, f'up_{up_block_i}_{attn_i}_{transformer_i}', use_ma, use_ra)
338
+
339
+
340
+ def __getattr__(self, name: str):
341
+ try:
342
+ return super().__getattr__(name)
343
+ except AttributeError:
344
+ return getattr(self.unet, name)
345
+
346
+ def forward(
347
+ self, sample, timestep, encoder_hidden_states,
348
+ *args, down_intrablock_additional_residuals=None,
349
+ down_block_res_samples=None, mid_block_res_sample=None,
350
+ **cached_condition,
351
+ ):
352
+ B, N_gen, _, H, W = sample.shape
353
+ assert H == W
354
+
355
+ if self.use_camera_embedding:
356
+ camera_info_gen = cached_condition['camera_info_gen'] + self.max_num_ref_image
357
+ camera_info_gen = rearrange(camera_info_gen, 'b n -> (b n)')
358
+ else:
359
+ camera_info_gen = None
360
+
361
+ sample = [sample]
362
+ if 'normal_imgs' in cached_condition:
363
+ sample.append(cached_condition["normal_imgs"])
364
+ if 'position_imgs' in cached_condition:
365
+ sample.append(cached_condition["position_imgs"])
366
+ sample = torch.cat(sample, dim=2)
367
+
368
+ sample = rearrange(sample, 'b n c h w -> (b n) c h w')
369
+
370
+ encoder_hidden_states_gen = encoder_hidden_states.unsqueeze(1).repeat(1, N_gen, 1, 1)
371
+ encoder_hidden_states_gen = rearrange(encoder_hidden_states_gen, 'b n l c -> (b n) l c')
372
+
373
+ if self.use_ra:
374
+ if 'condition_embed_dict' in cached_condition:
375
+ condition_embed_dict = cached_condition['condition_embed_dict']
376
+ else:
377
+ condition_embed_dict = {}
378
+ ref_latents = cached_condition['ref_latents']
379
+ N_ref = ref_latents.shape[1]
380
+ if self.use_camera_embedding:
381
+ camera_info_ref = cached_condition['camera_info_ref']
382
+ camera_info_ref = rearrange(camera_info_ref, 'b n -> (b n)')
383
+ else:
384
+ camera_info_ref = None
385
+
386
+ ref_latents = rearrange(ref_latents, 'b n c h w -> (b n) c h w')
387
+
388
+ encoder_hidden_states_ref = self.unet.learned_text_clip_ref.unsqueeze(1).repeat(B, N_ref, 1, 1)
389
+ encoder_hidden_states_ref = rearrange(encoder_hidden_states_ref, 'b n l c -> (b n) l c')
390
+
391
+ noisy_ref_latents = ref_latents
392
+ timestep_ref = 0
393
+
394
+ if self.use_dual_stream:
395
+ unet_ref = self.unet_dual
396
+ else:
397
+ unet_ref = self.unet
398
+ unet_ref(
399
+ noisy_ref_latents, timestep_ref,
400
+ encoder_hidden_states=encoder_hidden_states_ref,
401
+ class_labels=camera_info_ref,
402
+ # **kwargs
403
+ return_dict=False,
404
+ cross_attention_kwargs={
405
+ 'mode':'w', 'num_in_batch':N_ref,
406
+ 'condition_embed_dict':condition_embed_dict},
407
+ )
408
+ cached_condition['condition_embed_dict'] = condition_embed_dict
409
+ else:
410
+ condition_embed_dict = None
411
+
412
+
413
+ mva_scale = cached_condition.get('mva_scale', 1.0)
414
+ ref_scale = cached_condition.get('ref_scale', 1.0)
415
+
416
+ return self.unet(
417
+ sample, timestep,
418
+ encoder_hidden_states_gen, *args,
419
+ class_labels=camera_info_gen,
420
+ down_intrablock_additional_residuals=[
421
+ sample.to(dtype=self.unet.dtype) for sample in down_intrablock_additional_residuals
422
+ ] if down_intrablock_additional_residuals is not None else None,
423
+ down_block_additional_residuals=[
424
+ sample.to(dtype=self.unet.dtype) for sample in down_block_res_samples
425
+ ] if down_block_res_samples is not None else None,
426
+ mid_block_additional_residual=(
427
+ mid_block_res_sample.to(dtype=self.unet.dtype)
428
+ if mid_block_res_sample is not None else None
429
+ ),
430
+ return_dict=False,
431
+ cross_attention_kwargs={
432
+ 'mode':'r', 'num_in_batch':N_gen,
433
+ 'condition_embed_dict':condition_embed_dict,
434
+ 'mva_scale': mva_scale,
435
+ 'ref_scale': ref_scale,
436
+ },
437
+ )
assets/hunyuan3d-paint-v2-0/vae/config.json ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "AutoencoderKL",
3
+ "_diffusers_version": "0.10.0.dev0",
4
+ "act_fn": "silu",
5
+ "block_out_channels": [
6
+ 128,
7
+ 256,
8
+ 512,
9
+ 512
10
+ ],
11
+ "down_block_types": [
12
+ "DownEncoderBlock2D",
13
+ "DownEncoderBlock2D",
14
+ "DownEncoderBlock2D",
15
+ "DownEncoderBlock2D"
16
+ ],
17
+ "in_channels": 3,
18
+ "latent_channels": 4,
19
+ "layers_per_block": 2,
20
+ "norm_num_groups": 32,
21
+ "out_channels": 3,
22
+ "sample_size": 768,
23
+ "up_block_types": [
24
+ "UpDecoderBlock2D",
25
+ "UpDecoderBlock2D",
26
+ "UpDecoderBlock2D",
27
+ "UpDecoderBlock2D"
28
+ ]
29
+ }
assets/hunyuan3d-vae-v2-0-turbo/config.yaml ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ target: hy3dgen.shapegen.models.ShapeVAE
2
+ params:
3
+ num_latents: 3072
4
+ embed_dim: 64
5
+ num_freqs: 8
6
+ include_pi: false
7
+ heads: 16
8
+ width: 1024
9
+ num_decoder_layers: 16
10
+ qkv_bias: false
11
+ qk_norm: true
12
+ scale_factor: 0.9990943042622529
13
+ geo_decoder_mlp_expand_ratio: 1
14
+ geo_decoder_downsample_ratio: 2
15
+ geo_decoder_ln_post: false
assets/hunyuan3d-vae-v2-0/config.yaml ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ target: hy3dgen.shapegen.models.ShapeVAE
2
+ params:
3
+ num_latents: 3072
4
+ embed_dim: 64
5
+ num_freqs: 8
6
+ include_pi: false
7
+ heads: 16
8
+ width: 1024
9
+ num_decoder_layers: 16
10
+ qkv_bias: false
11
+ qk_norm: true
12
+ scale_factor: 0.9990943042622529
13
+ geo_decoder_mlp_expand_ratio: 4
14
+ geo_decoder_downsample_ratio: 1
15
+ geo_decoder_ln_post: true
hy3dgen/__init__.py ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Hunyuan 3D is licensed under the TENCENT HUNYUAN NON-COMMERCIAL LICENSE AGREEMENT
2
+ # except for the third-party components listed below.
3
+ # Hunyuan 3D does not impose any additional limitations beyond what is outlined
4
+ # in the repsective licenses of these third-party components.
5
+ # Users must comply with all terms and conditions of original licenses of these third-party
6
+ # components and must ensure that the usage of the third party components adheres to
7
+ # all relevant laws and regulations.
8
+
9
+ # For avoidance of doubts, Hunyuan 3D means the large language models and
10
+ # their software and algorithms, including trained model weights, parameters (including
11
+ # optimizer states), machine-learning model code, inference-enabling code, training-enabling code,
12
+ # fine-tuning enabling code and other elements of the foregoing made publicly available
13
+ # by Tencent in accordance with TENCENT HUNYUAN COMMUNITY LICENSE AGREEMENT.