rajux75 commited on
Commit
c138344
·
verified ·
1 Parent(s): 32361b4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +492 -483
README.md CHANGED
@@ -1,483 +1,492 @@
1
- <p align="center">
2
- <img src="https://github.com/user-attachments/assets/2cc030b4-87e1-40a0-b5bf-1b7d6b62820b" width="300">
3
- </p>
4
-
5
- # FramePack
6
-
7
- Official implementation and desktop software for ["Packing Input Frame Context in Next-Frame Prediction Models for Video Generation"](https://lllyasviel.github.io/frame_pack_gitpage/).
8
-
9
- Links: [**Paper**](https://arxiv.org/abs/2504.12626), [**Project Page**](https://lllyasviel.github.io/frame_pack_gitpage/)
10
-
11
- FramePack is a next-frame (next-frame-section) prediction neural network structure that generates videos progressively.
12
-
13
- FramePack compresses input contexts to a constant length so that the generation workload is invariant to video length.
14
-
15
- FramePack can process a very large number of frames with 13B models even on laptop GPUs.
16
-
17
- FramePack can be trained with a much larger batch size, similar to the batch size for image diffusion training.
18
-
19
- **Video diffusion, but feels like image diffusion.**
20
-
21
- # Notes
22
-
23
- Note that this GitHub repository is the only official FramePack website. We do not have any web services. All other websites are spam and fake, including but not limited to `framepack.co`, `frame_pack.co`, `framepack.net`, `frame_pack.net`, `framepack.ai`, `frame_pack.ai`, `framepack.pro`, `frame_pack.pro`, `framepack.cc`, `frame_pack.cc`,`framepackai.co`, `frame_pack_ai.co`, `framepackai.net`, `frame_pack_ai.net`, `framepackai.pro`, `frame_pack_ai.pro`, `framepackai.cc`, `frame_pack_ai.cc`, and so on. Again, they are all spam and fake. **Do not pay money or download files from any of those websites.**
24
-
25
- The team is on leave between April 21 and 29. PR merging will be delayed.
26
-
27
- # Requirements
28
-
29
- Note that this repo is a functional desktop software with minimal standalone high-quality sampling system and memory management.
30
-
31
- **Start with this repo before you try anything else!**
32
-
33
- Requirements:
34
-
35
- * Nvidia GPU in RTX 30XX, 40XX, 50XX series that supports fp16 and bf16. The GTX 10XX/20XX are not tested.
36
- * Linux or Windows operating system.
37
- * At least 6GB GPU memory.
38
-
39
- To generate 1-minute video (60 seconds) at 30fps (1800 frames) using 13B model, the minimal required GPU memory is 6GB. (Yes 6 GB, not a typo. Laptop GPUs are okay.)
40
-
41
- About speed, on my RTX 4090 desktop it generates at a speed of 2.5 seconds/frame (unoptimized) or 1.5 seconds/frame (teacache). On my laptops like 3070ti laptop or 3060 laptop, it is about 4x to 8x slower. [Troubleshoot if your speed is much slower than this.](https://github.com/lllyasviel/FramePack/issues/151#issuecomment-2817054649)
42
-
43
- In any case, you will directly see the generated frames since it is next-frame(-section) prediction. So you will get lots of visual feedback before the entire video is generated.
44
-
45
- # Installation
46
-
47
- **Windows**:
48
-
49
- [>>> Click Here to Download One-Click Package (CUDA 12.6 + Pytorch 2.6) <<<](https://github.com/lllyasviel/FramePack/releases/download/windows/framepack_cu126_torch26.7z)
50
-
51
- After you download, you uncompress, use `update.bat` to update, and use `run.bat` to run.
52
-
53
- Note that running `update.bat` is important, otherwise you may be using a previous version with potential bugs unfixed.
54
-
55
- ![image](https://github.com/lllyasviel/stable-diffusion-webui-forge/assets/19834515/c49bd60d-82bd-4086-9859-88d472582b94)
56
-
57
- Note that the models will be downloaded automatically. You will download more than 30GB from HuggingFace.
58
-
59
- **Linux**:
60
-
61
- We recommend having an independent Python 3.10.
62
-
63
- pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
64
- pip install -r requirements.txt
65
-
66
- To start the GUI, run:
67
-
68
- python demo_gradio.py
69
-
70
- Note that it supports `--share`, `--port`, `--server`, and so on.
71
-
72
- The software supports PyTorch attention, xformers, flash-attn, sage-attention. By default, it will just use PyTorch attention. You can install those attention kernels if you know how.
73
-
74
- For example, to install sage-attention (linux):
75
-
76
- pip install sageattention==1.0.6
77
-
78
- However, you are highly recommended to first try without sage-attention since it will influence results, though the influence is minimal.
79
-
80
- # GUI
81
-
82
- ![ui](https://github.com/user-attachments/assets/8c5cdbb1-b80c-4b7e-ac27-83834ac24cc4)
83
-
84
- On the left you upload an image and write a prompt.
85
-
86
- On the right are the generated videos and latent previews.
87
-
88
- Because this is a next-frame-section prediction model, videos will be generated longer and longer.
89
-
90
- You will see the progress bar for each section and the latent preview for the next section.
91
-
92
- Note that the initial progress may be slower than later diffusion as the device may need some warmup.
93
-
94
- # Sanity Check
95
-
96
- Before trying your own inputs, we highly recommend going through the sanity check to find out if any hardware or software went wrong.
97
-
98
- Next-frame-section prediction models are very sensitive to subtle differences in noise and hardware. Usually, people will get slightly different results on different devices, but the results should look overall similar. In some cases, if possible, you'll get exactly the same results.
99
-
100
- ## Image-to-5-seconds
101
-
102
- Download this image:
103
-
104
- <img src="https://github.com/user-attachments/assets/f3bc35cf-656a-4c9c-a83a-bbab24858b09" width="150">
105
-
106
- Copy this prompt:
107
-
108
- `The man dances energetically, leaping mid-air with fluid arm swings and quick footwork.`
109
-
110
- Set like this:
111
-
112
- (all default parameters, with teacache turned off)
113
- ![image](https://github.com/user-attachments/assets/0071fbb6-600c-4e0f-adc9-31980d540e9d)
114
-
115
- The result will be:
116
-
117
- <table>
118
- <tr>
119
- <td align="center" width="300">
120
- <video
121
- src="https://github.com/user-attachments/assets/bc74f039-2b14-4260-a30b-ceacf611a185"
122
- controls
123
- style="max-width:100%;">
124
- </video>
125
- </td>
126
- </tr>
127
- <tr>
128
- <td align="center">
129
- <em>Video may be compressed by GitHub</em>
130
- </td>
131
- </tr>
132
- </table>
133
-
134
- **Important Note:**
135
-
136
- Again, this is a next-frame-section prediction model. This means you will generate videos frame-by-frame or section-by-section.
137
-
138
- **If you get a much shorter video in the UI, like a video with only 1 second, then it is totally expected.** You just need to wait. More sections will be generated to complete the video.
139
-
140
- ## Know the influence of TeaCache and Quantization
141
-
142
- Download this image:
143
-
144
- <img src="https://github.com/user-attachments/assets/42293e30-bdd4-456d-895c-8fedff71be04" width="150">
145
-
146
- Copy this prompt:
147
-
148
- `The girl dances gracefully, with clear movements, full of charm.`
149
-
150
- Set like this:
151
-
152
- ![image](https://github.com/user-attachments/assets/4274207d-5180-4824-a552-d0d801933435)
153
-
154
- Turn off teacache:
155
-
156
- ![image](https://github.com/user-attachments/assets/53b309fb-667b-4aa8-96a1-f129c7a09ca6)
157
-
158
- You will get this:
159
-
160
- <table>
161
- <tr>
162
- <td align="center" width="300">
163
- <video
164
- src="https://github.com/user-attachments/assets/04ab527b-6da1-4726-9210-a8853dda5577"
165
- controls
166
- style="max-width:100%;">
167
- </video>
168
- </td>
169
- </tr>
170
- <tr>
171
- <td align="center">
172
- <em>Video may be compressed by GitHub</em>
173
- </td>
174
- </tr>
175
- </table>
176
-
177
- Now turn on teacache:
178
-
179
- ![image](https://github.com/user-attachments/assets/16ad047b-fbcc-4091-83dc-d46bea40708c)
180
-
181
- About 30% users will get this (the other 70% will get other random results depending on their hardware):
182
-
183
- <table>
184
- <tr>
185
- <td align="center" width="300">
186
- <video
187
- src="https://github.com/user-attachments/assets/149fb486-9ccc-4a48-b1f0-326253051e9b"
188
- controls
189
- style="max-width:100%;">
190
- </video>
191
- </td>
192
- </tr>
193
- <tr>
194
- <td align="center">
195
- <em>A typical worse result.</em>
196
- </td>
197
- </tr>
198
- </table>
199
-
200
- So you can see that teacache is not really lossless and sometimes can influence the result a lot.
201
-
202
- We recommend using teacache to try ideas and then using the full diffusion process to get high-quality results.
203
-
204
- This recommendation also applies to sage-attention, bnb quant, gguf, etc., etc.
205
-
206
- ## Image-to-1-minute
207
-
208
- <img src="https://github.com/user-attachments/assets/820af6ca-3c2e-4bbc-afe8-9a9be1994ff5" width="150">
209
-
210
- `The girl dances gracefully, with clear movements, full of charm.`
211
-
212
- ![image](https://github.com/user-attachments/assets/8c34fcb2-288a-44b3-a33d-9d2324e30cbd)
213
-
214
- Set video length to 60 seconds:
215
-
216
- ![image](https://github.com/user-attachments/assets/5595a7ea-f74e-445e-ad5f-3fb5b4b21bee)
217
-
218
- If everything is in order you will get some result like this eventually.
219
-
220
- 60s version:
221
-
222
- <table>
223
- <tr>
224
- <td align="center" width="300">
225
- <video
226
- src="https://github.com/user-attachments/assets/c3be4bde-2e33-4fd4-b76d-289a036d3a47"
227
- controls
228
- style="max-width:100%;">
229
- </video>
230
- </td>
231
- </tr>
232
- <tr>
233
- <td align="center">
234
- <em>Video may be compressed by GitHub</em>
235
- </td>
236
- </tr>
237
- </table>
238
-
239
- 6s version:
240
-
241
- <table>
242
- <tr>
243
- <td align="center" width="300">
244
- <video
245
- src="https://github.com/user-attachments/assets/37fe2c33-cb03-41e8-acca-920ab3e34861"
246
- controls
247
- style="max-width:100%;">
248
- </video>
249
- </td>
250
- </tr>
251
- <tr>
252
- <td align="center">
253
- <em>Video may be compressed by GitHub</em>
254
- </td>
255
- </tr>
256
- </table>
257
-
258
- # More Examples
259
-
260
- Many more examples are in [**Project Page**](https://lllyasviel.github.io/frame_pack_gitpage/).
261
-
262
- Below are some more examples that you may be interested in reproducing.
263
-
264
- ---
265
-
266
- <img src="https://github.com/user-attachments/assets/99f4d281-28ad-44f5-8700-aa7a4e5638fa" width="150">
267
-
268
- `The girl dances gracefully, with clear movements, full of charm.`
269
-
270
- ![image](https://github.com/user-attachments/assets/0e98bfca-1d91-4b1d-b30f-4236b517c35e)
271
-
272
- <table>
273
- <tr>
274
- <td align="center" width="300">
275
- <video
276
- src="https://github.com/user-attachments/assets/cebe178a-09ce-4b7a-8f3c-060332f4dab1"
277
- controls
278
- style="max-width:100%;">
279
- </video>
280
- </td>
281
- </tr>
282
- <tr>
283
- <td align="center">
284
- <em>Video may be compressed by GitHub</em>
285
- </td>
286
- </tr>
287
- </table>
288
-
289
- ---
290
-
291
- <img src="https://github.com/user-attachments/assets/853f4f40-2956-472f-aa7a-fa50da03ed92" width="150">
292
-
293
- `The girl suddenly took out a sign that said “cute” using right hand`
294
-
295
- ![image](https://github.com/user-attachments/assets/d51180e4-5537-4e25-a6c6-faecae28648a)
296
-
297
- <table>
298
- <tr>
299
- <td align="center" width="300">
300
- <video
301
- src="https://github.com/user-attachments/assets/116069d2-7499-4f38-ada7-8f85517d1fbb"
302
- controls
303
- style="max-width:100%;">
304
- </video>
305
- </td>
306
- </tr>
307
- <tr>
308
- <td align="center">
309
- <em>Video may be compressed by GitHub</em>
310
- </td>
311
- </tr>
312
- </table>
313
-
314
- ---
315
-
316
- <img src="https://github.com/user-attachments/assets/6d87c53f-81b2-4108-a704-697164ae2e81" width="150">
317
-
318
- `The girl skateboarding, repeating the endless spinning and dancing and jumping on a skateboard, with clear movements, full of charm.`
319
-
320
- ![image](https://github.com/user-attachments/assets/c2cfa835-b8e6-4c28-97f8-88f42da1ffdf)
321
-
322
- <table>
323
- <tr>
324
- <td align="center" width="300">
325
- <video
326
- src="https://github.com/user-attachments/assets/d9e3534a-eb17-4af2-a8ed-8e692e9993d2"
327
- controls
328
- style="max-width:100%;">
329
- </video>
330
- </td>
331
- </tr>
332
- <tr>
333
- <td align="center">
334
- <em>Video may be compressed by GitHub</em>
335
- </td>
336
- </tr>
337
- </table>
338
-
339
- ---
340
-
341
- <img src="https://github.com/user-attachments/assets/6e95d1a5-9674-4c9a-97a9-ddf704159b79" width="150">
342
-
343
- `The girl dances gracefully, with clear movements, full of charm.`
344
-
345
- ![image](https://github.com/user-attachments/assets/7412802a-ce44-4188-b1a4-cfe19f9c9118)
346
-
347
- <table>
348
- <tr>
349
- <td align="center" width="300">
350
- <video
351
- src="https://github.com/user-attachments/assets/e1b3279e-e30d-4d32-b55f-2fb1d37c81d2"
352
- controls
353
- style="max-width:100%;">
354
- </video>
355
- </td>
356
- </tr>
357
- <tr>
358
- <td align="center">
359
- <em>Video may be compressed by GitHub</em>
360
- </td>
361
- </tr>
362
- </table>
363
-
364
- ---
365
-
366
- <img src="https://github.com/user-attachments/assets/90fc6d7e-8f6b-4f8c-a5df-ee5b1c8b63c9" width="150">
367
-
368
- `The man dances flamboyantly, swinging his hips and striking bold poses with dramatic flair.`
369
-
370
- ![image](https://github.com/user-attachments/assets/1dcf10a3-9747-4e77-a269-03a9379dd9af)
371
-
372
- <table>
373
- <tr>
374
- <td align="center" width="300">
375
- <video
376
- src="https://github.com/user-attachments/assets/aaa4481b-7bf8-4c64-bc32-909659767115"
377
- controls
378
- style="max-width:100%;">
379
- </video>
380
- </td>
381
- </tr>
382
- <tr>
383
- <td align="center">
384
- <em>Video may be compressed by GitHub</em>
385
- </td>
386
- </tr>
387
- </table>
388
-
389
- ---
390
-
391
- <img src="https://github.com/user-attachments/assets/62ecf987-ec0c-401d-b3c9-be9ffe84ee5b" width="150">
392
-
393
- `The woman dances elegantly among the blossoms, spinning slowly with flowing sleeves and graceful hand movements.`
394
-
395
- ![image](https://github.com/user-attachments/assets/396f06bc-e399-4ac3-9766-8a42d4f8d383)
396
-
397
-
398
- <table>
399
- <tr>
400
- <td align="center" width="300">
401
- <video
402
- src="https://github.com/user-attachments/assets/f23f2f37-c9b8-45d5-a1be-7c87bd4b41cf"
403
- controls
404
- style="max-width:100%;">
405
- </video>
406
- </td>
407
- </tr>
408
- <tr>
409
- <td align="center">
410
- <em>Video may be compressed by GitHub</em>
411
- </td>
412
- </tr>
413
- </table>
414
-
415
- ---
416
-
417
- <img src="https://github.com/user-attachments/assets/4f740c1a-2d2f-40a6-9613-d6fe64c428aa" width="150">
418
-
419
- `The young man writes intensely, flipping papers and adjusting his glasses with swift, focused movements.`
420
-
421
- ![image](https://github.com/user-attachments/assets/c4513c4b-997a-429b-b092-bb275a37b719)
422
-
423
- <table>
424
- <tr>
425
- <td align="center" width="300">
426
- <video
427
- src="https://github.com/user-attachments/assets/62e9910e-aea6-4b2b-9333-2e727bccfc64"
428
- controls
429
- style="max-width:100%;">
430
- </video>
431
- </td>
432
- </tr>
433
- <tr>
434
- <td align="center">
435
- <em>Video may be compressed by GitHub</em>
436
- </td>
437
- </tr>
438
- </table>
439
-
440
- ---
441
-
442
- # Prompting Guideline
443
-
444
- Many people would ask how to write better prompts.
445
-
446
- Below is a ChatGPT template that I personally often use to get prompts:
447
-
448
- You are an assistant that writes short, motion-focused prompts for animating images.
449
-
450
- When the user sends an image, respond with a single, concise prompt describing visual motion (such as human activity, moving objects, or camera movements). Focus only on how the scene could come alive and become dynamic using brief phrases.
451
-
452
- Larger and more dynamic motions (like dancing, jumping, running, etc.) are preferred over smaller or more subtle ones (like standing still, sitting, etc.).
453
-
454
- Describe subject, then motion, then other things. For example: "The girl dances gracefully, with clear movements, full of charm."
455
-
456
- If there is something that can dance (like a man, girl, robot, etc.), then prefer to describe it as dancing.
457
-
458
- Stay in a loop: one image in, one motion prompt out. Do not explain, ask questions, or generate multiple options.
459
-
460
- You paste the instruct to ChatGPT and then feed it an image to get prompt like this:
461
-
462
- ![image](https://github.com/user-attachments/assets/586c53b9-0b8c-4c94-b1d3-d7e7c1a705c3)
463
-
464
- *The man dances powerfully, striking sharp poses and gliding smoothly across the reflective floor.*
465
-
466
- Usually this will give you a prompt that works well.
467
-
468
- You can also write prompts yourself. Concise prompts are usually preferred, for example:
469
-
470
- *The girl dances gracefully, with clear movements, full of charm.*
471
-
472
- *The man dances powerfully, with clear movements, full of energy.*
473
-
474
- and so on.
475
-
476
- # Cite
477
-
478
- @article{zhang2025framepack,
479
- title={Packing Input Frame Contexts in Next-Frame Prediction Models for Video Generation},
480
- author={Lvmin Zhang and Maneesh Agrawala},
481
- journal={Arxiv},
482
- year={2025}
483
- }
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: FramePack
3
+ emoji: 📊
4
+ colorFrom: purple
5
+ colorTo: red
6
+ sdk: gradio
7
+ sdk_version: 5.25.2
8
+ app_file: app.py
9
+ pinned: false
10
+ short_description: Video diffusion, but feels like image diffusion.
11
+ ---
12
+
13
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
14
+ # FramePack
15
+
16
+ Official implementation and desktop software for ["Packing Input Frame Context in Next-Frame Prediction Models for Video Generation"](https://lllyasviel.github.io/frame_pack_gitpage/).
17
+
18
+ Links: [**Paper**](https://arxiv.org/abs/2504.12626), [**Project Page**](https://lllyasviel.github.io/frame_pack_gitpage/)
19
+
20
+ FramePack is a next-frame (next-frame-section) prediction neural network structure that generates videos progressively.
21
+
22
+ FramePack compresses input contexts to a constant length so that the generation workload is invariant to video length.
23
+
24
+ FramePack can process a very large number of frames with 13B models even on laptop GPUs.
25
+
26
+ FramePack can be trained with a much larger batch size, similar to the batch size for image diffusion training.
27
+
28
+ **Video diffusion, but feels like image diffusion.**
29
+
30
+ # Notes
31
+
32
+ Note that this GitHub repository is the only official FramePack website. We do not have any web services. All other websites are spam and fake, including but not limited to `framepack.co`, `frame_pack.co`, `framepack.net`, `frame_pack.net`, `framepack.ai`, `frame_pack.ai`, `framepack.pro`, `frame_pack.pro`, `framepack.cc`, `frame_pack.cc`,`framepackai.co`, `frame_pack_ai.co`, `framepackai.net`, `frame_pack_ai.net`, `framepackai.pro`, `frame_pack_ai.pro`, `framepackai.cc`, `frame_pack_ai.cc`, and so on. Again, they are all spam and fake. **Do not pay money or download files from any of those websites.**
33
+
34
+ The team is on leave between April 21 and 29. PR merging will be delayed.
35
+
36
+ # Requirements
37
+
38
+ Note that this repo is a functional desktop software with minimal standalone high-quality sampling system and memory management.
39
+
40
+ **Start with this repo before you try anything else!**
41
+
42
+ Requirements:
43
+
44
+ * Nvidia GPU in RTX 30XX, 40XX, 50XX series that supports fp16 and bf16. The GTX 10XX/20XX are not tested.
45
+ * Linux or Windows operating system.
46
+ * At least 6GB GPU memory.
47
+
48
+ To generate 1-minute video (60 seconds) at 30fps (1800 frames) using 13B model, the minimal required GPU memory is 6GB. (Yes 6 GB, not a typo. Laptop GPUs are okay.)
49
+
50
+ About speed, on my RTX 4090 desktop it generates at a speed of 2.5 seconds/frame (unoptimized) or 1.5 seconds/frame (teacache). On my laptops like 3070ti laptop or 3060 laptop, it is about 4x to 8x slower. [Troubleshoot if your speed is much slower than this.](https://github.com/lllyasviel/FramePack/issues/151#issuecomment-2817054649)
51
+
52
+ In any case, you will directly see the generated frames since it is next-frame(-section) prediction. So you will get lots of visual feedback before the entire video is generated.
53
+
54
+ # Installation
55
+
56
+ **Windows**:
57
+
58
+ [>>> Click Here to Download One-Click Package (CUDA 12.6 + Pytorch 2.6) <<<](https://github.com/lllyasviel/FramePack/releases/download/windows/framepack_cu126_torch26.7z)
59
+
60
+ After you download, you uncompress, use `update.bat` to update, and use `run.bat` to run.
61
+
62
+ Note that running `update.bat` is important, otherwise you may be using a previous version with potential bugs unfixed.
63
+
64
+ ![image](https://github.com/lllyasviel/stable-diffusion-webui-forge/assets/19834515/c49bd60d-82bd-4086-9859-88d472582b94)
65
+
66
+ Note that the models will be downloaded automatically. You will download more than 30GB from HuggingFace.
67
+
68
+ **Linux**:
69
+
70
+ We recommend having an independent Python 3.10.
71
+
72
+ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
73
+ pip install -r requirements.txt
74
+
75
+ To start the GUI, run:
76
+
77
+ python demo_gradio.py
78
+
79
+ Note that it supports `--share`, `--port`, `--server`, and so on.
80
+
81
+ The software supports PyTorch attention, xformers, flash-attn, sage-attention. By default, it will just use PyTorch attention. You can install those attention kernels if you know how.
82
+
83
+ For example, to install sage-attention (linux):
84
+
85
+ pip install sageattention==1.0.6
86
+
87
+ However, you are highly recommended to first try without sage-attention since it will influence results, though the influence is minimal.
88
+
89
+ # GUI
90
+
91
+ ![ui](https://github.com/user-attachments/assets/8c5cdbb1-b80c-4b7e-ac27-83834ac24cc4)
92
+
93
+ On the left you upload an image and write a prompt.
94
+
95
+ On the right are the generated videos and latent previews.
96
+
97
+ Because this is a next-frame-section prediction model, videos will be generated longer and longer.
98
+
99
+ You will see the progress bar for each section and the latent preview for the next section.
100
+
101
+ Note that the initial progress may be slower than later diffusion as the device may need some warmup.
102
+
103
+ # Sanity Check
104
+
105
+ Before trying your own inputs, we highly recommend going through the sanity check to find out if any hardware or software went wrong.
106
+
107
+ Next-frame-section prediction models are very sensitive to subtle differences in noise and hardware. Usually, people will get slightly different results on different devices, but the results should look overall similar. In some cases, if possible, you'll get exactly the same results.
108
+
109
+ ## Image-to-5-seconds
110
+
111
+ Download this image:
112
+
113
+ <img src="https://github.com/user-attachments/assets/f3bc35cf-656a-4c9c-a83a-bbab24858b09" width="150">
114
+
115
+ Copy this prompt:
116
+
117
+ `The man dances energetically, leaping mid-air with fluid arm swings and quick footwork.`
118
+
119
+ Set like this:
120
+
121
+ (all default parameters, with teacache turned off)
122
+ ![image](https://github.com/user-attachments/assets/0071fbb6-600c-4e0f-adc9-31980d540e9d)
123
+
124
+ The result will be:
125
+
126
+ <table>
127
+ <tr>
128
+ <td align="center" width="300">
129
+ <video
130
+ src="https://github.com/user-attachments/assets/bc74f039-2b14-4260-a30b-ceacf611a185"
131
+ controls
132
+ style="max-width:100%;">
133
+ </video>
134
+ </td>
135
+ </tr>
136
+ <tr>
137
+ <td align="center">
138
+ <em>Video may be compressed by GitHub</em>
139
+ </td>
140
+ </tr>
141
+ </table>
142
+
143
+ **Important Note:**
144
+
145
+ Again, this is a next-frame-section prediction model. This means you will generate videos frame-by-frame or section-by-section.
146
+
147
+ **If you get a much shorter video in the UI, like a video with only 1 second, then it is totally expected.** You just need to wait. More sections will be generated to complete the video.
148
+
149
+ ## Know the influence of TeaCache and Quantization
150
+
151
+ Download this image:
152
+
153
+ <img src="https://github.com/user-attachments/assets/42293e30-bdd4-456d-895c-8fedff71be04" width="150">
154
+
155
+ Copy this prompt:
156
+
157
+ `The girl dances gracefully, with clear movements, full of charm.`
158
+
159
+ Set like this:
160
+
161
+ ![image](https://github.com/user-attachments/assets/4274207d-5180-4824-a552-d0d801933435)
162
+
163
+ Turn off teacache:
164
+
165
+ ![image](https://github.com/user-attachments/assets/53b309fb-667b-4aa8-96a1-f129c7a09ca6)
166
+
167
+ You will get this:
168
+
169
+ <table>
170
+ <tr>
171
+ <td align="center" width="300">
172
+ <video
173
+ src="https://github.com/user-attachments/assets/04ab527b-6da1-4726-9210-a8853dda5577"
174
+ controls
175
+ style="max-width:100%;">
176
+ </video>
177
+ </td>
178
+ </tr>
179
+ <tr>
180
+ <td align="center">
181
+ <em>Video may be compressed by GitHub</em>
182
+ </td>
183
+ </tr>
184
+ </table>
185
+
186
+ Now turn on teacache:
187
+
188
+ ![image](https://github.com/user-attachments/assets/16ad047b-fbcc-4091-83dc-d46bea40708c)
189
+
190
+ About 30% users will get this (the other 70% will get other random results depending on their hardware):
191
+
192
+ <table>
193
+ <tr>
194
+ <td align="center" width="300">
195
+ <video
196
+ src="https://github.com/user-attachments/assets/149fb486-9ccc-4a48-b1f0-326253051e9b"
197
+ controls
198
+ style="max-width:100%;">
199
+ </video>
200
+ </td>
201
+ </tr>
202
+ <tr>
203
+ <td align="center">
204
+ <em>A typical worse result.</em>
205
+ </td>
206
+ </tr>
207
+ </table>
208
+
209
+ So you can see that teacache is not really lossless and sometimes can influence the result a lot.
210
+
211
+ We recommend using teacache to try ideas and then using the full diffusion process to get high-quality results.
212
+
213
+ This recommendation also applies to sage-attention, bnb quant, gguf, etc., etc.
214
+
215
+ ## Image-to-1-minute
216
+
217
+ <img src="https://github.com/user-attachments/assets/820af6ca-3c2e-4bbc-afe8-9a9be1994ff5" width="150">
218
+
219
+ `The girl dances gracefully, with clear movements, full of charm.`
220
+
221
+ ![image](https://github.com/user-attachments/assets/8c34fcb2-288a-44b3-a33d-9d2324e30cbd)
222
+
223
+ Set video length to 60 seconds:
224
+
225
+ ![image](https://github.com/user-attachments/assets/5595a7ea-f74e-445e-ad5f-3fb5b4b21bee)
226
+
227
+ If everything is in order you will get some result like this eventually.
228
+
229
+ 60s version:
230
+
231
+ <table>
232
+ <tr>
233
+ <td align="center" width="300">
234
+ <video
235
+ src="https://github.com/user-attachments/assets/c3be4bde-2e33-4fd4-b76d-289a036d3a47"
236
+ controls
237
+ style="max-width:100%;">
238
+ </video>
239
+ </td>
240
+ </tr>
241
+ <tr>
242
+ <td align="center">
243
+ <em>Video may be compressed by GitHub</em>
244
+ </td>
245
+ </tr>
246
+ </table>
247
+
248
+ 6s version:
249
+
250
+ <table>
251
+ <tr>
252
+ <td align="center" width="300">
253
+ <video
254
+ src="https://github.com/user-attachments/assets/37fe2c33-cb03-41e8-acca-920ab3e34861"
255
+ controls
256
+ style="max-width:100%;">
257
+ </video>
258
+ </td>
259
+ </tr>
260
+ <tr>
261
+ <td align="center">
262
+ <em>Video may be compressed by GitHub</em>
263
+ </td>
264
+ </tr>
265
+ </table>
266
+
267
+ # More Examples
268
+
269
+ Many more examples are in [**Project Page**](https://lllyasviel.github.io/frame_pack_gitpage/).
270
+
271
+ Below are some more examples that you may be interested in reproducing.
272
+
273
+ ---
274
+
275
+ <img src="https://github.com/user-attachments/assets/99f4d281-28ad-44f5-8700-aa7a4e5638fa" width="150">
276
+
277
+ `The girl dances gracefully, with clear movements, full of charm.`
278
+
279
+ ![image](https://github.com/user-attachments/assets/0e98bfca-1d91-4b1d-b30f-4236b517c35e)
280
+
281
+ <table>
282
+ <tr>
283
+ <td align="center" width="300">
284
+ <video
285
+ src="https://github.com/user-attachments/assets/cebe178a-09ce-4b7a-8f3c-060332f4dab1"
286
+ controls
287
+ style="max-width:100%;">
288
+ </video>
289
+ </td>
290
+ </tr>
291
+ <tr>
292
+ <td align="center">
293
+ <em>Video may be compressed by GitHub</em>
294
+ </td>
295
+ </tr>
296
+ </table>
297
+
298
+ ---
299
+
300
+ <img src="https://github.com/user-attachments/assets/853f4f40-2956-472f-aa7a-fa50da03ed92" width="150">
301
+
302
+ `The girl suddenly took out a sign that said “cute” using right hand`
303
+
304
+ ![image](https://github.com/user-attachments/assets/d51180e4-5537-4e25-a6c6-faecae28648a)
305
+
306
+ <table>
307
+ <tr>
308
+ <td align="center" width="300">
309
+ <video
310
+ src="https://github.com/user-attachments/assets/116069d2-7499-4f38-ada7-8f85517d1fbb"
311
+ controls
312
+ style="max-width:100%;">
313
+ </video>
314
+ </td>
315
+ </tr>
316
+ <tr>
317
+ <td align="center">
318
+ <em>Video may be compressed by GitHub</em>
319
+ </td>
320
+ </tr>
321
+ </table>
322
+
323
+ ---
324
+
325
+ <img src="https://github.com/user-attachments/assets/6d87c53f-81b2-4108-a704-697164ae2e81" width="150">
326
+
327
+ `The girl skateboarding, repeating the endless spinning and dancing and jumping on a skateboard, with clear movements, full of charm.`
328
+
329
+ ![image](https://github.com/user-attachments/assets/c2cfa835-b8e6-4c28-97f8-88f42da1ffdf)
330
+
331
+ <table>
332
+ <tr>
333
+ <td align="center" width="300">
334
+ <video
335
+ src="https://github.com/user-attachments/assets/d9e3534a-eb17-4af2-a8ed-8e692e9993d2"
336
+ controls
337
+ style="max-width:100%;">
338
+ </video>
339
+ </td>
340
+ </tr>
341
+ <tr>
342
+ <td align="center">
343
+ <em>Video may be compressed by GitHub</em>
344
+ </td>
345
+ </tr>
346
+ </table>
347
+
348
+ ---
349
+
350
+ <img src="https://github.com/user-attachments/assets/6e95d1a5-9674-4c9a-97a9-ddf704159b79" width="150">
351
+
352
+ `The girl dances gracefully, with clear movements, full of charm.`
353
+
354
+ ![image](https://github.com/user-attachments/assets/7412802a-ce44-4188-b1a4-cfe19f9c9118)
355
+
356
+ <table>
357
+ <tr>
358
+ <td align="center" width="300">
359
+ <video
360
+ src="https://github.com/user-attachments/assets/e1b3279e-e30d-4d32-b55f-2fb1d37c81d2"
361
+ controls
362
+ style="max-width:100%;">
363
+ </video>
364
+ </td>
365
+ </tr>
366
+ <tr>
367
+ <td align="center">
368
+ <em>Video may be compressed by GitHub</em>
369
+ </td>
370
+ </tr>
371
+ </table>
372
+
373
+ ---
374
+
375
+ <img src="https://github.com/user-attachments/assets/90fc6d7e-8f6b-4f8c-a5df-ee5b1c8b63c9" width="150">
376
+
377
+ `The man dances flamboyantly, swinging his hips and striking bold poses with dramatic flair.`
378
+
379
+ ![image](https://github.com/user-attachments/assets/1dcf10a3-9747-4e77-a269-03a9379dd9af)
380
+
381
+ <table>
382
+ <tr>
383
+ <td align="center" width="300">
384
+ <video
385
+ src="https://github.com/user-attachments/assets/aaa4481b-7bf8-4c64-bc32-909659767115"
386
+ controls
387
+ style="max-width:100%;">
388
+ </video>
389
+ </td>
390
+ </tr>
391
+ <tr>
392
+ <td align="center">
393
+ <em>Video may be compressed by GitHub</em>
394
+ </td>
395
+ </tr>
396
+ </table>
397
+
398
+ ---
399
+
400
+ <img src="https://github.com/user-attachments/assets/62ecf987-ec0c-401d-b3c9-be9ffe84ee5b" width="150">
401
+
402
+ `The woman dances elegantly among the blossoms, spinning slowly with flowing sleeves and graceful hand movements.`
403
+
404
+ ![image](https://github.com/user-attachments/assets/396f06bc-e399-4ac3-9766-8a42d4f8d383)
405
+
406
+
407
+ <table>
408
+ <tr>
409
+ <td align="center" width="300">
410
+ <video
411
+ src="https://github.com/user-attachments/assets/f23f2f37-c9b8-45d5-a1be-7c87bd4b41cf"
412
+ controls
413
+ style="max-width:100%;">
414
+ </video>
415
+ </td>
416
+ </tr>
417
+ <tr>
418
+ <td align="center">
419
+ <em>Video may be compressed by GitHub</em>
420
+ </td>
421
+ </tr>
422
+ </table>
423
+
424
+ ---
425
+
426
+ <img src="https://github.com/user-attachments/assets/4f740c1a-2d2f-40a6-9613-d6fe64c428aa" width="150">
427
+
428
+ `The young man writes intensely, flipping papers and adjusting his glasses with swift, focused movements.`
429
+
430
+ ![image](https://github.com/user-attachments/assets/c4513c4b-997a-429b-b092-bb275a37b719)
431
+
432
+ <table>
433
+ <tr>
434
+ <td align="center" width="300">
435
+ <video
436
+ src="https://github.com/user-attachments/assets/62e9910e-aea6-4b2b-9333-2e727bccfc64"
437
+ controls
438
+ style="max-width:100%;">
439
+ </video>
440
+ </td>
441
+ </tr>
442
+ <tr>
443
+ <td align="center">
444
+ <em>Video may be compressed by GitHub</em>
445
+ </td>
446
+ </tr>
447
+ </table>
448
+
449
+ ---
450
+
451
+ # Prompting Guideline
452
+
453
+ Many people would ask how to write better prompts.
454
+
455
+ Below is a ChatGPT template that I personally often use to get prompts:
456
+
457
+ You are an assistant that writes short, motion-focused prompts for animating images.
458
+
459
+ When the user sends an image, respond with a single, concise prompt describing visual motion (such as human activity, moving objects, or camera movements). Focus only on how the scene could come alive and become dynamic using brief phrases.
460
+
461
+ Larger and more dynamic motions (like dancing, jumping, running, etc.) are preferred over smaller or more subtle ones (like standing still, sitting, etc.).
462
+
463
+ Describe subject, then motion, then other things. For example: "The girl dances gracefully, with clear movements, full of charm."
464
+
465
+ If there is something that can dance (like a man, girl, robot, etc.), then prefer to describe it as dancing.
466
+
467
+ Stay in a loop: one image in, one motion prompt out. Do not explain, ask questions, or generate multiple options.
468
+
469
+ You paste the instruct to ChatGPT and then feed it an image to get prompt like this:
470
+
471
+ ![image](https://github.com/user-attachments/assets/586c53b9-0b8c-4c94-b1d3-d7e7c1a705c3)
472
+
473
+ *The man dances powerfully, striking sharp poses and gliding smoothly across the reflective floor.*
474
+
475
+ Usually this will give you a prompt that works well.
476
+
477
+ You can also write prompts yourself. Concise prompts are usually preferred, for example:
478
+
479
+ *The girl dances gracefully, with clear movements, full of charm.*
480
+
481
+ *The man dances powerfully, with clear movements, full of energy.*
482
+
483
+ and so on.
484
+
485
+ # Cite
486
+
487
+ @article{zhang2025framepack,
488
+ title={Packing Input Frame Contexts in Next-Frame Prediction Models for Video Generation},
489
+ author={Lvmin Zhang and Maneesh Agrawala},
490
+ journal={Arxiv},
491
+ year={2025}
492
+ }