RealCam-I2V / index.html
roll-ai's picture
Upload 3 files
14964a5 verified
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<!-- Meta tags for social media banners, these should be filled in appropriatly as they are your "business card" -->
<!-- Replace the content tag with appropriate information -->
<meta name="description"
content="RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control">
<meta property="og:title"
content="RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control" />
<meta property="og:description"
content="RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control" />
<!-- Keywords for your paper to be indexed by-->
<meta name="keywords" content="RealCam-I2V, Complex Camera Control, Image-to-Video Generation">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control</title>
<link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro" rel="stylesheet">
<link href="https://fonts.googleapis.com/css2?family=Playfair+Display:ital,wght@1,400&display=swap"
rel="stylesheet">
<link rel="stylesheet" href="static/css/bulma.min.css">
<link rel="stylesheet" href="static/css/bulma-carousel.min.css">
<link rel="stylesheet" href="static/css/bulma-slider.min.css">
<link rel="stylesheet" href="static/css/fontawesome.all.min.css">
<link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
<link rel="stylesheet" href="static/css/index.css">
<link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/dreampulse/computer-modern-web-font@master/fonts.css">
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
<script src="https://documentcloud.adobe.com/view-sdk/main.js"></script>
<script defer src="static/js/fontawesome.all.min.js"></script>
<script src="static/js/bulma-carousel.min.js"></script>
<script src="static/js/bulma-slider.min.js"></script>
<script src="static/js/index.js"></script>
<script type="text/javascript" async
src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_SVG"></script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {
inlineMath: [['$','$'], ['\\(','\\)']]
}
});
</script>
<style>
.video-container {
display: flex;
justify-content: center;
gap: 0px;
}
.italic {
font-family: 'Playfair Display';
font-style: italic;
}
</style>
</head>
<body>
<!-- title and author -->
<section class="hero">
<div class="hero-body">
<div class="container is-max-desktop">
<div class="columns is-centered">
<div class="column has-text-centered">
<h1 class="title is-2 publication-title">
RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control
</h1>
<div class="is-size-5 publication-authors">
<span class="author-block">Teng Li<sup>1,2*</sup>,</span>
<span class="author-block">Guangcong Zheng<sup>1,2*</sup>,</span>
<span class="author-block">Rui Jiang<sup>1,2</sup>,</span>
<span class="author-block">Shuigen Zhan<sup>1</sup>,</span>
<span class="author-block">Tao Wu<sup>1</sup>,</span>
<span class="author-block">Yehao Lu<sup>1</sup>,</span>
<span class="author-block">Yining Lin<sup>3</sup>,</span>
<br>
<span class="author-block">Chuanyun Deng<sup>2</sup>,</span>
<span class="author-block">Yepan Xiong<sup>2</sup>,</span>
<span class="author-block">Min Chen<sup>2</sup>,</span>
<span class="author-block">Lin Cheng<sup>2</sup>,</span>
<span class="author-block">Xi Li<sup>1&#9993;</sup></span>
</div>
<div class="is-size-5 publication-authors">
<span class="author-block"><sup>1</sup>Zhejiang University,</span>
<span class="author-block"><sup>2</sup>Huawei,</span>
<span class="author-block"><sup>3</sup>Supremind</span>
<br>
<span class="author-block">ICCV 2025</span>
</div>
<div class="column has-text-centered">
<div class="publication-links">
<span class="link-block">
<a href="https://arxiv.org/pdf/2502.10059.pdf" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fas fa-file-pdf"></i>
</span>
<span>Paper</span>
</a>
</span>
<span class="link-block">
<a href="https://arxiv.org/abs/2502.10059" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="ai ai-arxiv"></i>
</span>
<span>arXiv</span>
</a>
</span>
<span class="link-block">
<a href="https://github.com/ZGCTroy/RealCam-I2V" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fab fa-github"></i>
</span>
<span>Code</span>
</a>
</span>
<span class="link-block">
<a href="https://github.com/ZGCTroy/CamI2V" target="_blank"
class="external-link button is-normal is-rounded is-dark">
<span class="icon">
<i class="fab fa-github"></i>
</span>
<span>CamI2V</span>
</a>
</span>
</div>
</div>
</div>
</div>
</div>
</div>
</section>
<!-- abstract -->
<section class="section hero is-light">
<div class="container is-max-desktop">
<div class="columns is-centered has-text-centered">
<div class="column is-four-fifths">
<h2 class="title is-3">Abstract</h2>
<div class="content has-text-justified">
<p>
Recent advancements in camera-trajectory-guided image-to-video generation offer higher
precision and better support for complex camera control compared to text-based approaches.
However, they also introduce significant usability challenges, as users often struggle to
provide precise camera parameters when working with arbitrary real-world images without
knowledge of their depth nor scene scale.
To address these real-world application issues, we propose RealCam-I2V, a novel
diffusion-based video generation framework that integrates monocular metric depth estimation
to establish 3D scene reconstruction in a preprocessing step.
During training, the reconstructed 3D scene enables scaling camera parameters from relative
to metric scales, ensuring compatibility and scale consistency across diverse real-world
images.
In inference, RealCam-I2V offers an intuitive interface where users can precisely draw
camera trajectories by dragging within the 3D scene.
To further enhance precise camera control and scene consistency, we propose
scene-constrained noise shaping, which shapes high-level noise and also allows the framework
to maintain dynamic and coherent video generation in lower noise stages.
RealCam-I2V achieves significant improvements in controllability and video quality on the
RealEstate10K and out-of-domain images. We further enables applications like
camera-controlled looping video generation and generative frame interpolation.
</p>
</div>
</div>
</div>
</div>
</section>
<section class="section hero">
<div class="container has-text-centered">
<h2 class="title is-3">Demo</h2>
<div class="video-container">
<div>
<video autoplay controls muted loop width="80%">
<source src="static/videos/demo/4d_demo.mp4" type="video/mp4" />
</video>
</div>
</div>
<h2 class="subtitle has-text-centered italic">
4D Visualization
</h2>
<br>
<div class="video-container">
<div>
<video autoplay controls muted loop>
<source src="static/videos/cogvideo1.5/73c3266a-d3e1-41c9-9691-729478a8bf77.mp4"
type="video/mp4" />
</video>
</div>
<div>
<video autoplay controls muted loop>
<source src="static/videos/cogvideo1.5/79131dea-ca85-49df-b68b-cdb208f164c7.mp4"
type="video/mp4" />
</video>
</div>
<div>
<video autoplay controls muted loop>
<source src="static/videos/cogvideo1.5/b17050b5-3ed8-44ae-94a4-ec939c57b41f.mp4"
type="video/mp4" />
</video>
</div>
<div>
<video autoplay controls muted loop>
<source src="static/videos/cogvideo1.5/8ab67ba3-8300-4b82-98b7-8e28403cf6f7.mp4"
type="video/mp4" />
</video>
</div>
</div>
<h2 class="subtitle has-text-centered italic">
Aerial View
</h2>
<br>
<div class="video-container">
<div>
<video autoplay controls muted loop>
<source src="static/videos/cogvideo1.5/3f962cd6-fbf4-4b8a-b107-1468931c80f4.mp4"
type="video/mp4" />
</video>
</div>
<div>
<video autoplay controls muted loop>
<source src="static/videos/cogvideo1.5/d4db16a8-3f82-43b3-8432-cc8df007f10c.mp4"
type="video/mp4" />
</video>
</div>
<div>
<video autoplay controls muted loop>
<source src="static/videos/cogvideo1.5/34614b89-431d-4e31-8d82-89a0f082aaed.mp4"
type="video/mp4" />
</video>
</div>
<div>
<video autoplay controls muted loop>
<source src="static/videos/cogvideo1.5/4f0b8b62-a278-4b0e-8457-b6e8b099de59.mp4"
type="video/mp4" />
</video>
</div>
</div>
<h2 class="subtitle has-text-centered italic">
Urban Exploration
</h2>
<br>
<div class="video-container">
<div>
<video autoplay controls muted loop>
<source src="static/videos/cogvideo1.5/f8180809-8e91-4ef8-b19b-9d42e99f5e00.mp4"
type="video/mp4" />
</video>
</div>
<div>
<video autoplay controls muted loop>
<source src="static/videos/cogvideo1.5/6c23cfd0-9618-4edd-9003-28b6b92c4196.mp4"
type="video/mp4" />
</video>
</div>
<div>
<video autoplay controls muted loop>
<source src="static/videos/cogvideo1.5/37c7abfa-c442-4df5-ace5-d2a2fa1c23aa.mp4"
type="video/mp4" />
</video>
</div>
<div>
<video autoplay controls muted loop>
<source src="static/videos/cogvideo1.5/e304abe7-3e5a-4929-9c0d-0dd8fec78b48.mp4"
type="video/mp4" />
</video>
</div>
</div>
<h2 class="subtitle has-text-centered italic">
FPV & Sports
</h2>
<br>
<div class="video-container">
<div>
<video autoplay controls muted loop>
<source src="static/videos/dynamic/cogvideox_controlnetxs_c52592a0.mp4" type="video/mp4" />
</video>
</div>
<div>
<video autoplay controls muted loop>
<source src="static/videos/dynamic/cogvideox_controlnetxs_19c3e433.mp4" type="video/mp4" />
</video>
</div>
<div>
<video autoplay controls muted loop>
<source src="static/videos/dynamic/cogvideox_controlnetxs_43d1ce7d.mp4" type="video/mp4" />
</video>
</div>
<div>
<video autoplay controls muted loop>
<source src="static/videos/dynamic/cogvideox_controlnetxs_183e7ba2.mp4" type="video/mp4" />
</video>
</div>
</div>
<h2 class="subtitle has-text-centered italic">
Complex Trajectories & Scene Dynamics
</h2>
<br>
<div class="video-container">
<div>
<video autoplay controls muted loop width="60%">
<source src="static/videos/demo/cogvideox.mp4" type="video/mp4" />
</video>
</div>
</div>
<h2 class="subtitle has-text-centered italic">
Various Domains
</h2>
<br>
<div class="video-container">
<div>
<video autoplay controls muted loop>
<source src="static/videos/various_types/cartoon.mp4" type="video/mp4" />
</video>
<h2 class="subtitle has-text-centered italic">
Cartoon
</h2>
</div>
<div>
<video autoplay controls muted loop>
<source src="static/videos/various_types/food.mp4" type="video/mp4" />
</video>
<h2 class="subtitle has-text-centered italic">
Food
</h2>
</div>
</div>
<br>
<div class="video-container">
<div>
<video autoplay controls muted loop>
<source src="static/videos/various_types/human.mp4" type="video/mp4" />
</video>
<h2 class="subtitle has-text-centered italic">
Human
</h2>
</div>
<div>
<video autoplay controls muted loop>
<source src="static/videos/various_types/pets.mp4" type="video/mp4" />
</video>
<h2 class="subtitle has-text-centered italic">
Pets
</h2>
</div>
</div>
<br>
<div class="video-container">
<div>
<video autoplay controls muted loop>
<source src="static/videos/demo/product_demo.mp4" type="video/mp4" />
</video>
<h2 class="subtitle has-text-centered italic">
Product Demo
</h2>
</div>
<div>
<video autoplay controls muted loop>
<source src="static/videos/demo/chinese_landscape.mp4" type="video/mp4" />
</video>
<h2 class="subtitle has-text-centered italic">
Chinese Antique
</h2>
</div>
</div>
</div>
</section>
<!-- Method -->
<section class="section hero">
<div class="container has-text-centered">
<h2 class="title is-3">Method</h2>
<!-- step 1 -->
<div class="container has-text-centered">
<h2 class="title has-text-centered is-4 italic">
Step 1 (Training & Inference): Construct 3D point cloud by monocular metric depth estimation.
</h2>
<div class="video-container" style="gap: 5px;">
<img src="static/images/scene1.jpg" width="25%" />
<img src="static/images/scene2.jpg" width="25%" />
<img src="static/images/scene3.jpg" width="25%" />
</div>
</div>
<br>
<!-- step 2 -->
<div class="container has-text-centered">
<h2 class="title has-text-centered is-4 italic">
Step 2 (Training): Align from relative-scale to metric-scale.
</h2>
<img src="static/images/align.jpg" width="80%" />
</div>
<br>
<!-- step 3 -->
<div class="container has-text-centered">
<h2 class="title has-text-centered is-4 italic">
Step 3 (Inference): Render preview video with camera trajectory on the reconstructed 3D scene.
</h2>
<div class="video-container">
<video autoplay controls muted loop>
<source src="static/videos/preview_video/preview1.mp4" type="video/mp4" />
</video>
<video autoplay controls muted loop>
<source src="static/videos/preview_video/preview2.mp4" type="video/mp4" />
</video>
<video autoplay controls muted loop>
<source src="static/videos/preview_video/preview3.mp4" type="video/mp4" />
</video>
</div>
</div>
<br>
<!-- step 4 -->
<div class="container has-text-centered">
<h2 class="title has-text-centered is-4 italic">
Step 4 (Inference): Scene-constrained noise shaping.
</h2>
<div class="content has-text-justified">
We paste the visible latents of preview video into the predicted latent during generation process.
However, we only paste on the high noise level and allow for
dynamics in lower level of noise, thus we name it "noise shaping" that only shapes the noise at the
initial high noise stage.
</div>
<div class="video-container">
<div>
<video autoplay controls muted loop>
<source src="static/videos/ablation/ablation1_preview.mp4" type="video/mp4" />
</video>
<h2 class="subtitle has-text-centered is-6 italic">
Preview Video
</h2>
</div>
<div>
<video autoplay controls muted loop>
<source src="static/videos/ablation/ablation1_withNoiseShaping.mp4" type="video/mp4" />
</video>
<h2 class="subtitle has-text-centered is-6 italic">
w. Scene-Constrained Noise Shaping
</h2>
</div>
<div>
<video autoplay controls muted loop>
<source src="static/videos/ablation/ablation1_withoutNoiseShaping.mp4" type="video/mp4" />
</video>
<h2 class="subtitle has-text-centered is-6 italic">
w.o. Scene-Constrained Noise Shaping
</h2>
</div>
</div>
</div>
</div>
</section>
<section class="section" id="BibTeX">
<div class="container is-max-desktop content">
<h2 class="title">BibTeX</h2>
<pre><code>
@article{li2025realcam,
title={RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control},
author={Li, Teng and Zheng, Guangcong and Jiang, Rui and Zhan, Shuigen and Wu, Tao and Lu, Yehao and Lin, Yining and Li, Xi},
journal={arXiv preprint arXiv:2502.10059},
year={2025},
}
</code></pre>
</div>
</section>
</body>
</html>