FloVD: Optical Flow Meets Video Diffusion Model for Enhanced Camera-Controlled Video Synthesis (CogVideoX-based FloVD)

[Project Page] [arXiv]

FloVD: Optical Flow Meets Video Diffusion Model for Enhanced Camera-Controlled Video Synthesis
Wonjoon Jin, Qi Dai, Chong Luo, Seung-Hwan Baek, Sunghyun Cho
POSTECH, Microsoft Research Asia

Gallery

FloVD-CogVideoX-5B

Project Updates

News: 2025/05/02: We have updated the code for FloVD-CogVideoX. We will release dataset preprocessing and training codes soon.
News: 2025/02/26: Our paper has been accepted to CVPR 2025.

Quick Start

Prompt Optimization

As mentioned in CogVideoX, we recommend to use long, detailed text prompts to get better results. Our FloVD-CogVideoX model is trained using text captions extracted from CogVLM2.

Environment

Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.

pip install -r requirements.txt

Optical flow normalization

As mentioned in FloVD paper, we normalize optical flow following Generative Image Dynamics. For this, we use scale factors (s_x, s_y) of (60, 36) for both FVSM and OMSM.

Pre-trained checkpoints

Download the FloVD-CogVideoX
FVSM and OMSM (Curated)
[Google Drive] In addition, we used the off-the-shelf depth estimation model (Depth Anything V2, metric depth). For these models, please refer links below.
[Depth_anything_v2_metric]
Then, place these checkpoints in ./ckpt directory

# File tree
./ckpt/
├── FVSM
│   ├ FloVD_FVSM_Controlnet.pt
├── OMSM
│   ├ selected_blocks.safetensors
│   ├ pytorch_lora_weights.safetensors
├── others
│   ├ depth_anything_v2_metric_hypersim_vitb.pth

Pre-defined camera trajectory

We provide several example camera trajectory for user's quick inference. Refer to "./assets/cam_trajectory/" for visualization of each camera trajectory.

# File tree
./assets/
├── manual_poses
│   ├ ...
├── re10k_poses
│   ├ ...
├── manual_poses_PanTiltSpin
│   ├ ...

Inference Settings

In the inference time, we recommend to use the same setting used in the training time.

The number of frames: 49
FPS: 16
Flow scale factor: (s_x, s_y) = (60, 36)
CONTROLNET_GUIDANCE_END: 0.4 for better camera controllability, 0.1 for more natural object motions. This argument means the ratio of timestep to inject ControlNet features to the pre-trained model.

Inference

flovd_demo: To synthesize videos with desired camera trajectory and natural object motions, use this. A more detailed inference code explanation, including the significance of common parameters. Refer to flovd_demo_script
flovd_fvsm_demo: You can solely use FVSM model for more accurate camera control with little object motions. This code omits OMSM and only uses FVSM. (The script will be released soon.)
flovd_ddp_demo: If you want to sample large number of videos, you can use this. Note that you need to prepare dataset in advance following our dataset preprocessing pipeline. (The preprocessing pipeline will be released.)

Tools

This folder contains some tools for camera trajectory generation, visualization, etc.

generate_camparam: Generate manual camera parameters such as zoom-in, zoom-out, etc.
visualize trajectory: Converts SAT model weights to Huggingface model weights.

Citation

🌟 If you find our work helpful, please leave us a star and cite our paper.

@article{jin2025flovd,
  title={FloVD: Optical Flow Meets Video Diffusion Model for Enhanced Camera-Controlled Video Synthesis},
  author={Jin, Wonjoon and Dai, Qi and Luo, Chong and Baek, Seung-Hwan and Cho, Sunghyun},
  journal={arXiv preprint arXiv:2502.08244},
  year={2025}
}

Reference

We thank CogVideoX for open source

Model-License

The CogVideoX-5B model (Transformers module, include I2V and T2V) is released under the CogVideoX LICENSE.