vitorcalvi commited on
Commit
a378df1
·
1 Parent(s): 4c397c7
Files changed (3) hide show
  1. README.md +9 -170
  2. app.py +40 -146
  3. requirements.txt +52 -1
README.md CHANGED
@@ -1,174 +1,13 @@
1
  ---
2
- title: Pyramid Flow
 
 
 
3
  sdk: gradio
4
- emoji: ⚱️
5
- sdk_version: 5.0.1
6
- suggested_hardware: l40sx1
 
7
  ---
8
 
9
- <div align="center">
10
-
11
- # ⚡️Pyramid Flow⚡️
12
-
13
- [[Paper]](https://arxiv.org/abs/2410.05954) [[Project Page ✨]](https://pyramid-flow.github.io) [[Model 🤗]](https://huggingface.co/rain1011/pyramid-flow-sd3)
14
-
15
- </div>
16
-
17
- This is the official repository for Pyramid Flow, a training-efficient **Autoregressive Video Generation** method based on **Flow Matching**. By training only on **open-source datasets**, it can generate high-quality 10-second videos at 768p resolution and 24 FPS, and naturally supports image-to-video generation.
18
-
19
- <table class="center" border="0" style="width: 100%; text-align: left;">
20
- <tr>
21
- <th>10s, 768p, 24fps</th>
22
- <th>5s, 768p, 24fps</th>
23
- <th>Image-to-video</th>
24
- </tr>
25
- <tr>
26
- <td><video src="https://github.com/user-attachments/assets/9935da83-ae56-4672-8747-0f46e90f7b2b" autoplay muted loop playsinline></video></td>
27
- <td><video src="https://github.com/user-attachments/assets/3412848b-64db-4d9e-8dbf-11403f6d02c5" autoplay muted loop playsinline></video></td>
28
- <td><video src="https://github.com/user-attachments/assets/3bd7251f-7b2c-4bee-951d-656fdb45f427" autoplay muted loop playsinline></video></td>
29
- </tr>
30
- </table>
31
-
32
- ## News
33
-
34
- - `COMING SOON` ⚡️⚡️⚡️ Training code for both the Video VAE and DiT; New model checkpoints trained from scratch.
35
-
36
- > We are training Pyramid Flow from scratch to fix human structure issues related to the currently adopted SD3 initialization and hope to release it in the next few days.
37
-
38
- - `2024.10.10` 🚀🚀🚀 We release the [technical report](https://arxiv.org/abs/2410.05954), [project page](https://pyramid-flow.github.io) and [model checkpoint](https://huggingface.co/rain1011/pyramid-flow-sd3) of Pyramid Flow.
39
-
40
- ## Introduction
41
-
42
- ![motivation](assets/motivation.jpg)
43
-
44
- Existing video diffusion models operate at full resolution, spending a lot of computation on very noisy latents. By contrast, our method harnesses the flexibility of flow matching ([Lipman et al., 2023](https://openreview.net/forum?id=PqvMRDCJT9t); [Liu et al., 2023](https://openreview.net/forum?id=XVjTT1nw5z); [Albergo & Vanden-Eijnden, 2023](https://openreview.net/forum?id=li7qeBbCR1t)) to interpolate between latents of different resolutions and noise levels, allowing for simultaneous generation and decompression of visual content with better computational efficiency. The entire framework is end-to-end optimized with a single DiT ([Peebles & Xie, 2023](http://openaccess.thecvf.com/content/ICCV2023/html/Peebles_Scalable_Diffusion_Models_with_Transformers_ICCV_2023_paper.html)), generating high-quality 10-second videos at 768p resolution and 24 FPS within 20.7k A100 GPU training hours.
45
-
46
- ## Usage
47
-
48
- You can directly download the model from [Huggingface](https://huggingface.co/rain1011/pyramid-flow-sd3). We provide both model checkpoints for 768p and 384p video generation. The 384p checkpoint supports 5-second video generation at 24FPS, while the 768p checkpoint supports up to 10-second video generation at 24FPS.
49
-
50
- ```python
51
- from huggingface_hub import snapshot_download
52
-
53
- model_path = 'PATH' # The local directory to save downloaded checkpoint
54
- snapshot_download("rain1011/pyramid-flow-sd3", local_dir=model_path, local_dir_use_symlinks=False, repo_type='model')
55
- ```
56
-
57
- To use our model, please follow the inference code in `video_generation_demo.ipynb` at [this link](https://github.com/jy0205/Pyramid-Flow/blob/main/video_generation_demo.ipynb). We further simplify it into the following two-step procedure. First, load the downloaded model:
58
-
59
- ```python
60
- import torch
61
- from PIL import Image
62
- from pyramid_dit import PyramidDiTForVideoGeneration
63
- from diffusers.utils import load_image, export_to_video
64
-
65
- torch.cuda.set_device(0)
66
- model_dtype, torch_dtype = 'bf16', torch.bfloat16 # Use bf16, fp16 or fp32
67
-
68
- model = PyramidDiTForVideoGeneration(
69
- 'PATH', # The downloaded checkpoint dir
70
- model_dtype,
71
- model_variant='diffusion_transformer_768p', # 'diffusion_transformer_384p'
72
- )
73
-
74
- model.vae.to("cuda")
75
- model.dit.to("cuda")
76
- model.text_encoder.to("cuda")
77
- model.vae.enable_tiling()
78
- ```
79
-
80
- Then, you can try text-to-video generation on your own prompts:
81
-
82
- ```python
83
- prompt = "A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors"
84
-
85
- with torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
86
- frames = model.generate(
87
- prompt=prompt,
88
- num_inference_steps=[20, 20, 20],
89
- video_num_inference_steps=[10, 10, 10],
90
- height=768,
91
- width=1280,
92
- temp=16, # temp=16: 5s, temp=31: 10s
93
- guidance_scale=9.0, # The guidance for the first frame
94
- video_guidance_scale=5.0, # The guidance for the other video latent
95
- output_type="pil",
96
- save_memory=True, # If you have enough GPU memory, set it to `False` to improve vae decoding speed
97
- )
98
-
99
- export_to_video(frames, "./text_to_video_sample.mp4", fps=24)
100
- ```
101
-
102
- As an autoregressive model, our model also supports (text conditioned) image-to-video generation:
103
-
104
- ```python
105
- image = Image.open('assets/the_great_wall.jpg').convert("RGB").resize((1280, 768))
106
- prompt = "FPV flying over the Great Wall"
107
-
108
- with torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
109
- frames = model.generate_i2v(
110
- prompt=prompt,
111
- input_image=image,
112
- num_inference_steps=[10, 10, 10],
113
- temp=16,
114
- video_guidance_scale=4.0,
115
- output_type="pil",
116
- save_memory=True, # If you have enough GPU memory, set it to `False` to improve vae decoding speed
117
- )
118
-
119
- export_to_video(frames, "./image_to_video_sample.mp4", fps=24)
120
- ```
121
-
122
- Usage tips:
123
-
124
- - The `guidance_scale` parameter controls the visual quality. We suggest using a guidance within [7, 9] for the 768p checkpoint during text-to-video generation, and 7 for the 384p checkpoint.
125
- - The `video_guidance_scale` parameter controls the motion. A larger value increases the dynamic degree and mitigates the autoregressive generation degradation, while a smaller value stabilizes the video.
126
- - For 10-second video generation, we recommend using a guidance scale of 7 and a video guidance scale of 5.
127
-
128
- ## Gallery
129
-
130
- The following video examples are generated at 5s, 768p, 24fps. For more results, please visit our [project page](https://pyramid-flow.github.io).
131
-
132
- <table class="center" border="0" style="width: 100%; text-align: left;">
133
- <tr>
134
- <td><video src="https://github.com/user-attachments/assets/5b44a57e-fa08-4554-84a2-2c7a99f2b343" autoplay muted loop playsinline></video></td>
135
- <td><video src="https://github.com/user-attachments/assets/5afd5970-de72-40e2-900d-a20d18308e8e" autoplay muted loop playsinline></video></td>
136
- </tr>
137
- <tr>
138
- <td><video src="https://github.com/user-attachments/assets/1d44daf8-017f-40e9-bf18-1e19c0a8983b" autoplay muted loop playsinline></video></td>
139
- <td><video src="https://github.com/user-attachments/assets/7f5dd901-b7d7-48cc-b67a-3c5f9e1546d2" autoplay muted loop playsinline></video></td>
140
- </tr>
141
- </table>
142
-
143
- ## Comparison
144
-
145
- On VBench ([Huang et al., 2024](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard)), our method surpasses all the compared open-source baselines. Even with only public video data, it achieves comparable performance to commercial models like Kling ([Kuaishou, 2024](https://kling.kuaishou.com/en)) and Gen-3 Alpha ([Runway, 2024](https://runwayml.com/research/introducing-gen-3-alpha)), especially in the quality score (84.74 vs. 84.11 of Gen-3) and motion smoothness.
146
-
147
- ![vbench](assets/vbench.jpg)
148
-
149
- We conduct an additional user study with 20+ participants. As can be seen, our method is preferred over open-source models such as [Open-Sora](https://github.com/hpcaitech/Open-Sora) and [CogVideoX-2B](https://github.com/THUDM/CogVideo) especially in terms of motion smoothness.
150
-
151
- ![user_study](assets/user_study.jpg)
152
-
153
- ## Acknowledgement
154
-
155
- We are grateful for the following awesome projects when implementing Pyramid Flow:
156
-
157
- - [SD3 Medium](https://huggingface.co/stabilityai/stable-diffusion-3-medium) and [Flux 1.0](https://huggingface.co/black-forest-labs/FLUX.1-dev): State-of-the-art image generation models based on flow matching.
158
- - [Diffusion Forcing](https://boyuan.space/diffusion-forcing) and [GameNGen](https://gamengen.github.io): Next-token prediction meets full-sequence diffusion.
159
- - [WebVid-10M](https://github.com/m-bain/webvid), [OpenVid-1M](https://github.com/NJU-PCALab/OpenVid-1M) and [Open-Sora Plan](https://github.com/PKU-YuanGroup/Open-Sora-Plan): Large-scale datasets for text-to-video generation.
160
- - [CogVideoX](https://github.com/THUDM/CogVideo): An open-source text-to-video generation model that shares many training details.
161
- - [Video-LLaMA2](https://github.com/DAMO-NLP-SG/VideoLLaMA2): An open-source video LLM for our video recaptioning.
162
-
163
- ## Citation
164
-
165
- Consider giving this repository a star and cite Pyramid Flow in your publications if it helps your research.
166
-
167
- ```
168
- @article{jin2024pyramidal,
169
- title={Pyramidal Flow Matching for Efficient Video Generative Modeling},
170
- author={Jin, Yang and Sun, Zhicheng and Li, Ningyuan and Xu, Kun and Xu, Kun and Jiang, Hao and Zhuang, Nan and Huang, Quzhe and Song, Yang and Mu, Yadong and Lin, Zhouchen},
171
- jounal={arXiv preprint arXiv:2410.05954},
172
- year={2024}
173
- }
174
- ```
 
1
  ---
2
+ title: Multi-Modal for Emotion and Sentiment Analysis (for GITEX 2024)
3
+ emoji: 😀😲😐😥🥴😱😡
4
+ colorFrom: blue
5
+ colorTo: pink
6
  sdk: gradio
7
+ sdk_version: '4.24.0'
8
+ app_file: app.py
9
+ pinned: false
10
+ license: mit
11
  ---
12
 
13
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app.py CHANGED
@@ -1,147 +1,41 @@
1
- import os
2
- import torch
3
  import gradio as gr
4
- from PIL import Image, ImageOps
5
-
6
- from huggingface_hub import snapshot_download
7
- from pyramid_dit import PyramidDiTForVideoGeneration
8
- from diffusers.utils import export_to_video
9
-
10
- import spaces
11
- import uuid
12
-
13
- is_canonical = True if os.environ.get("SPACE_ID") == "Pyramid-Flow/pyramid-flow" else False
14
-
15
- # Constants
16
- MODEL_PATH = "pyramid-flow-model"
17
- MODEL_REPO = "rain1011/pyramid-flow-sd3"
18
- MODEL_VARIANT = "diffusion_transformer_768p"
19
- MODEL_DTYPE = "bf16"
20
-
21
- def center_crop(image, target_width, target_height):
22
- width, height = image.size
23
- aspect_ratio_target = target_width / target_height
24
- aspect_ratio_image = width / height
25
-
26
- if aspect_ratio_image > aspect_ratio_target:
27
- # Crop the width (left and right)
28
- new_width = int(height * aspect_ratio_target)
29
- left = (width - new_width) // 2
30
- right = left + new_width
31
- top, bottom = 0, height
32
- else:
33
- # Crop the height (top and bottom)
34
- new_height = int(width / aspect_ratio_target)
35
- top = (height - new_height) // 2
36
- bottom = top + new_height
37
- left, right = 0, width
38
-
39
- image = image.crop((left, top, right, bottom))
40
- return image
41
-
42
- # Download and load the model
43
- def load_model():
44
- if not os.path.exists(MODEL_PATH):
45
- snapshot_download(MODEL_REPO, local_dir=MODEL_PATH, local_dir_use_symlinks=False, repo_type='model')
46
-
47
- model = PyramidDiTForVideoGeneration(
48
- MODEL_PATH,
49
- MODEL_DTYPE,
50
- model_variant=MODEL_VARIANT,
51
- )
52
-
53
- model.vae.to("cuda")
54
- model.dit.to("cuda")
55
- model.text_encoder.to("cuda")
56
- model.vae.enable_tiling()
57
-
58
- return model
59
-
60
- # Global model variable
61
- model = load_model()
62
-
63
- # Text-to-video generation function
64
- @spaces.GPU(duration=140)
65
- def generate_video(prompt, image=None, duration=3, guidance_scale=9, video_guidance_scale=5, frames_per_second=8, progress=gr.Progress(track_tqdm=True)):
66
- multiplier = 1.2 if is_canonical else 3.0
67
- temp = int(duration * multiplier) + 1
68
- torch_dtype = torch.bfloat16 if MODEL_DTYPE == "bf16" else torch.float32
69
- if(image):
70
- cropped_image = center_crop(image, 1280, 768)
71
- resized_image = cropped_image.resize((1280, 768))
72
- with torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
73
- frames = model.generate_i2v(
74
- prompt=prompt,
75
- input_image=resized_image,
76
- num_inference_steps=[10, 10, 10],
77
- temp=temp,
78
- guidance_scale=7.0,
79
- video_guidance_scale=video_guidance_scale,
80
- output_type="pil",
81
- save_memory=True,
82
- )
83
- else:
84
- with torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
85
- frames = model.generate(
86
- prompt=prompt,
87
- num_inference_steps=[20, 20, 20],
88
- video_num_inference_steps=[10, 10, 10],
89
- height=768,
90
- width=1280,
91
- temp=temp,
92
- guidance_scale=guidance_scale,
93
- video_guidance_scale=video_guidance_scale,
94
- output_type="pil",
95
- save_memory=True,
96
- )
97
- output_path = f"{str(uuid.uuid4())}_output_video.mp4"
98
- export_to_video(frames, output_path, fps=frames_per_second)
99
- return output_path
100
-
101
- # Gradio interface
102
- with gr.Blocks() as demo:
103
- gr.Markdown("# Pyramid Flow")
104
- gr.Markdown("Pyramid Flow is a training-efficient Autoregressive Video Generation model based on Flow Matching. It is trained only on open-source datasets within 20.7k A100 GPU hours")
105
- gr.Markdown("[[Paper](https://arxiv.org/pdf/2410.05954)], [[Model](https://huggingface.co/rain1011/pyramid-flow-sd3)], [[Code](https://github.com/jy0205/Pyramid-Flow)]")
106
-
107
- with gr.Row():
108
- with gr.Column():
109
- with gr.Accordion("Image to Video (optional)", open=False):
110
- i2v_image = gr.Image(type="pil", label="Input Image")
111
- t2v_prompt = gr.Textbox(label="Prompt")
112
- with gr.Accordion("Advanced settings", open=False):
113
- t2v_duration = gr.Slider(minimum=1, maximum=3 if is_canonical else 10, value=3 if is_canonical else 5, step=1, label="Duration (seconds)", visible=not is_canonical)
114
- t2v_fps = gr.Slider(minimum=8, maximum=24, step=16, value=8 if is_canonical else 24, label="Frames per second", visible=is_canonical)
115
- t2v_guidance_scale = gr.Slider(minimum=1, maximum=15, value=9, step=0.1, label="Guidance Scale")
116
- t2v_video_guidance_scale = gr.Slider(minimum=1, maximum=15, value=5, step=0.1, label="Video Guidance Scale")
117
- t2v_generate_btn = gr.Button("Generate Video")
118
- with gr.Column():
119
- t2v_output = gr.Video(label=f"Generated Video")
120
- gr.HTML("""
121
- <div style="display: flex; flex-direction: column;justify-content: center; align-items: center; text-align: center;">
122
- <p style="display: flex;gap: 6px;">
123
- <a href="https://huggingface.co/spaces/Pyramid-Flow/pyramid-flow?duplicate=true">
124
- <img src="https://huggingface.co/datasets/huggingface/badges/resolve/main/duplicate-this-space-lg.svg" alt="Duplicate this Space">
125
- </a>
126
- </p>
127
- <p>to use privately and generate videos up to 10s at 24fps</p>
128
- </div>
129
- """)
130
- gr.Examples(
131
- examples=[
132
- "A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors",
133
- "Beautiful, snowy Tokyo city is bustling. The camera moves through the bustling city street, following several people enjoying the beautiful snowy weather and shopping at nearby stalls. Gorgeous sakura petals are flying through the wind along with snowflakes"
134
- ],
135
- fn=generate_video,
136
- inputs=t2v_prompt,
137
- outputs=t2v_output,
138
- cache_examples=True,
139
- cache_mode="lazy"
140
- )
141
- t2v_generate_btn.click(
142
- generate_video,
143
- inputs=[t2v_prompt, i2v_image, t2v_duration, t2v_guidance_scale, t2v_video_guidance_scale, t2v_fps],
144
- outputs=t2v_output
145
- )
146
-
147
- demo.launch()
 
 
 
1
  import gradio as gr
2
+ import torch
3
+ from tabs.FACS_analysis import create_facs_analysis_tab
4
+ from ui_components import CUSTOM_CSS, HEADER_HTML, DISCLAIMER_HTML
5
+ import spaces # Importing spaces to utilize GPU if available
6
+
7
+ import logging
8
+
9
+ logging.basicConfig(level=logging.INFO)
10
+ logger = logging.getLogger(__name__)
11
+
12
+ # Define the tab structure
13
+ TAB_STRUCTURE = [
14
+ ("Visual Analysis", [
15
+ ("FACS for Stress, Anxiety, Depression", create_facs_analysis_tab),
16
+ ])
17
+ ]
18
+
19
+ def create_demo():
20
+ device = "cuda" if torch.cuda.is_available() else "cpu"
21
+ logger.info(f"Using device: {device}")
22
+
23
+ # Ensure that any models loaded within create_facs_analysis_tab use the correct device
24
+ with gr.Blocks(css=CUSTOM_CSS) as demo:
25
+ gr.Markdown(HEADER_HTML)
26
+ with gr.Tabs(elem_classes=["main-tab"]):
27
+ for main_tab, sub_tabs in TAB_STRUCTURE:
28
+ with gr.Tab(main_tab):
29
+ with gr.Tabs():
30
+ for sub_tab, create_fn in sub_tabs:
31
+ with gr.Tab(sub_tab):
32
+ create_fn(device=device) # Pass device if needed
33
+ gr.HTML(DISCLAIMER_HTML)
34
+
35
+ return demo
36
+
37
+ # Create the demo instance without GPU decorator
38
+ demo = create_demo()
39
+
40
+ if __name__ == "__main__":
41
+ demo.launch()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
requirements.txt CHANGED
@@ -1,3 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  sentencepiece
2
  tiktoken
3
  jsonlines
@@ -12,4 +63,4 @@ transformers
12
  opencv-python-headless
13
  einops
14
  tensorboardX
15
- ipython
 
1
+ # CUDA-enabled PyTorch packages
2
+ torch==2.0.1+cu118
3
+ torchvision==0.15.2+cu118
4
+ torchaudio==2.0.2+cu118
5
+ -f https://download.pytorch.org/whl/torch_stable.html
6
+
7
+ # Core dependencies
8
+ gradio==4.38.1
9
+ gradio_client==1.1.0
10
+
11
+ # Additional dependencies
12
+ absl-py==2.1.0
13
+ aiofiles==23.2.1
14
+ altair==5.3.0
15
+ anyio==4.4.0
16
+ attrs==23.2.0
17
+ audioread==3.0.1
18
+ certifi==2024.7.4
19
+ charset-normalizer==3.3.2
20
+ click==8.1.7
21
+ decorator==4.4.2
22
+ fastapi==0.111.1
23
+ h5py==3.11.0
24
+ huggingface-hub==0.23.5
25
+ idna==3.7
26
+ Jinja2==3.1.4
27
+ joblib==1.4.2
28
+ jsonschema==4.23.0
29
+ kiwisolver==1.4.5
30
+ librosa==0.10.2.post1
31
+ MarkupSafe==2.1.5
32
+ matplotlib==3.9.1
33
+ numpy==1.26.4
34
+ pandas==2.2.2
35
+ Pillow==10.4.0
36
+ pydantic==2.8.2
37
+ python-multipart==0.0.9
38
+ pytz==2024.1
39
+ PyYAML==6.0.1
40
+ requests==2.32.3
41
+ scikit-learn==1.5.1
42
+ scipy==1.14.0
43
+ soundfile==0.12.1
44
+ starlette==0.37.2
45
+ tqdm==4.66.4
46
+ transformers==4.42.4
47
+ uvicorn==0.30.1
48
+
49
+ # Any other necessary dependencies
50
+ # Add your additional dependencies here
51
+
52
  sentencepiece
53
  tiktoken
54
  jsonlines
 
63
  opencv-python-headless
64
  einops
65
  tensorboardX
66
+ ipython