FloVD: Optical Flow Meets Video Diffusion Model for Enhanced Camera-Controlled Video Synthesis (CogVideoX-based FloVD)
FloVD: Optical Flow Meets Video Diffusion Model for Enhanced Camera-Controlled Video Synthesis
Wonjoon Jin, Qi Dai, Chong Luo, Seung-Hwan Baek, Sunghyun Cho
POSTECH, Microsoft Research Asia
Gallery
FloVD-CogVideoX-5B
Project Updates
News:
2025/05/02
: We have updated the code forFloVD-CogVideoX
. We will release dataset preprocessing and training codes soon.News:
2025/02/26
: Our paper has been accepted to CVPR 2025.
Quick Start
Prompt Optimization
As mentioned in CogVideoX, we recommend to use long, detailed text prompts to get better results. Our FloVD-CogVideoX model is trained using text captions extracted from CogVLM2.
Environment
Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.
pip install -r requirements.txt
Optical flow normalization
As mentioned in FloVD paper, we normalize optical flow following Generative Image Dynamics. For this, we use scale factors (s_x, s_y) of (60, 36) for both FVSM and OMSM.
Pre-trained checkpoints
Download the FloVD-CogVideoX
FVSM and OMSM (Curated)
[Google Drive]
In addition, we used the off-the-shelf depth estimation model (Depth Anything V2, metric depth).
For these models, please refer links below.
[Depth_anything_v2_metric]
Then, place these checkpoints in ./ckpt directory
# File tree
./ckpt/
βββ FVSM
β β FloVD_FVSM_Controlnet.pt
βββ OMSM
β β selected_blocks.safetensors
β β pytorch_lora_weights.safetensors
βββ others
β β depth_anything_v2_metric_hypersim_vitb.pth
Pre-defined camera trajectory
We provide several example camera trajectory for user's quick inference. Refer to "./assets/cam_trajectory/" for visualization of each camera trajectory.
# File tree
./assets/
βββ manual_poses
β β ...
βββ re10k_poses
β β ...
βββ manual_poses_PanTiltSpin
β β ...
Inference Settings
In the inference time, we recommend to use the same setting used in the training time.
The number of frames: 49
FPS: 16
Flow scale factor: (s_x, s_y) = (60, 36)
CONTROLNET_GUIDANCE_END: 0.4 for better camera controllability, 0.1 for more natural object motions. This argument means the ratio of timestep to inject ControlNet features to the pre-trained model.
Inference
flovd_demo: To synthesize videos with desired camera trajectory and natural object motions, use this. A more detailed inference code explanation, including the significance of common parameters. Refer to flovd_demo_script
flovd_fvsm_demo: You can solely use FVSM model for more accurate camera control with little object motions. This code omits OMSM and only uses FVSM. (The script will be released soon.)
flovd_ddp_demo: If you want to sample large number of videos, you can use this. Note that you need to prepare dataset in advance following our dataset preprocessing pipeline. (The preprocessing pipeline will be released.)
Tools
This folder contains some tools for camera trajectory generation, visualization, etc.
generate_camparam: Generate manual camera parameters such as zoom-in, zoom-out, etc.
visualize trajectory: Converts SAT model weights to Huggingface model weights.
Citation
π If you find our work helpful, please leave us a star and cite our paper.
@article{jin2025flovd,
title={FloVD: Optical Flow Meets Video Diffusion Model for Enhanced Camera-Controlled Video Synthesis},
author={Jin, Wonjoon and Dai, Qi and Luo, Chong and Baek, Seung-Hwan and Cho, Sunghyun},
journal={arXiv preprint arXiv:2502.08244},
year={2025}
}
Reference
We thank CogVideoX for open source
Model-License
The CogVideoX-5B model (Transformers module, include I2V and T2V) is released under the CogVideoX LICENSE.