Spaces:

roll-ai
/

RealCam-I2V

Runtime error

App Files Files Community

RealCam-I2V / index.html

roll-ai

Upload 3 files

14964a5 verified 4 days ago

raw

history blame contribute delete

23.2 kB

	<!DOCTYPE html>
	<html>

	<head>
	<meta charset="utf-8">
	<!-- Meta tags for social media banners, these should be filled in appropriatly as they are your "business card" -->
	<!-- Replace the content tag with appropriate information -->
	<meta name="description"
	content="RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control">
	<meta property="og:title"
	content="RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control" />
	<meta property="og:description"
	content="RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control" />

	<!-- Keywords for your paper to be indexed by-->
	<meta name="keywords" content="RealCam-I2V, Complex Camera Control, Image-to-Video Generation">
	<meta name="viewport" content="width=device-width, initial-scale=1">

	<title>RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control</title>
	<link href="https://fonts.googleapis.com/css?family=Google+Sans\|Noto+Sans\|Castoro" rel="stylesheet">
	<link href="https://fonts.googleapis.com/css2?family=Playfair+Display:ital,wght@1,400&display=swap"
	rel="stylesheet">

	<link rel="stylesheet" href="static/css/bulma.min.css">
	<link rel="stylesheet" href="static/css/bulma-carousel.min.css">
	<link rel="stylesheet" href="static/css/bulma-slider.min.css">
	<link rel="stylesheet" href="static/css/fontawesome.all.min.css">
	<link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css">
	<link rel="stylesheet" href="static/css/index.css">
	<link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/dreampulse/computer-modern-web-font@master/fonts.css">

	<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script>
	<script src="https://documentcloud.adobe.com/view-sdk/main.js"></script>
	<script defer src="static/js/fontawesome.all.min.js"></script>
	<script src="static/js/bulma-carousel.min.js"></script>
	<script src="static/js/bulma-slider.min.js"></script>
	<script src="static/js/index.js"></script>

	<script type="text/javascript" async
	src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_SVG"></script>
	<script type="text/x-mathjax-config">
	MathJax.Hub.Config({
	tex2jax: {
	inlineMath: [['$','$'], ['\$','\$']]
	}
	});
	</script>

	<style>
	.video-container {
	display: flex;
	justify-content: center;
	gap: 0px;
	}

	.italic {
	font-family: 'Playfair Display';
	font-style: italic;
	}
	</style>
	</head>

	<body>
	<!-- title and author -->
	<section class="hero">
	<div class="hero-body">
	<div class="container is-max-desktop">
	<div class="columns is-centered">
	<div class="column has-text-centered">
	<h1 class="title is-2 publication-title">
	RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control
	</h1>

	<div class="is-size-5 publication-authors">
	<span class="author-block">Teng Li<sup>1,2*</sup>,</span>
	<span class="author-block">Guangcong Zheng<sup>1,2*</sup>,</span>
	<span class="author-block">Rui Jiang<sup>1,2</sup>,</span>
	<span class="author-block">Shuigen Zhan<sup>1</sup>,</span>
	<span class="author-block">Tao Wu<sup>1</sup>,</span>
	<span class="author-block">Yehao Lu<sup>1</sup>,</span>
	<span class="author-block">Yining Lin<sup>3</sup>,</span>
	<br>
	<span class="author-block">Chuanyun Deng<sup>2</sup>,</span>
	<span class="author-block">Yepan Xiong<sup>2</sup>,</span>
	<span class="author-block">Min Chen<sup>2</sup>,</span>
	<span class="author-block">Lin Cheng<sup>2</sup>,</span>
	<span class="author-block">Xi Li<sup>1✉</sup></span>
	</div>

	<div class="is-size-5 publication-authors">
	<span class="author-block"><sup>1</sup>Zhejiang University,</span>
	<span class="author-block"><sup>2</sup>Huawei,</span>
	<span class="author-block"><sup>3</sup>Supremind</span>
	<br>
	<span class="author-block">ICCV 2025</span>
	</div>

	<div class="column has-text-centered">
	<div class="publication-links">
	<span class="link-block">
	<a href="https://arxiv.org/pdf/2502.10059.pdf" target="_blank"
	class="external-link button is-normal is-rounded is-dark">
	<span class="icon">
	<i class="fas fa-file-pdf"></i>
	</span>
	<span>Paper</span>
	</a>
	</span>

	<span class="link-block">
	<a href="https://arxiv.org/abs/2502.10059" target="_blank"
	class="external-link button is-normal is-rounded is-dark">
	<span class="icon">
	<i class="ai ai-arxiv"></i>
	</span>
	<span>arXiv</span>
	</a>
	</span>

	<span class="link-block">
	<a href="https://github.com/ZGCTroy/RealCam-I2V" target="_blank"
	class="external-link button is-normal is-rounded is-dark">
	<span class="icon">
	<i class="fab fa-github"></i>
	</span>
	<span>Code</span>
	</a>
	</span>

	<span class="link-block">
	<a href="https://github.com/ZGCTroy/CamI2V" target="_blank"
	class="external-link button is-normal is-rounded is-dark">
	<span class="icon">
	<i class="fab fa-github"></i>
	</span>
	<span>CamI2V</span>
	</a>
	</span>
	</div>
	</div>
	</div>
	</div>
	</div>
	</div>
	</section>

	<!-- abstract -->
	<section class="section hero is-light">
	<div class="container is-max-desktop">
	<div class="columns is-centered has-text-centered">
	<div class="column is-four-fifths">
	<h2 class="title is-3">Abstract</h2>
	<div class="content has-text-justified">
	<p>
	Recent advancements in camera-trajectory-guided image-to-video generation offer higher
	precision and better support for complex camera control compared to text-based approaches.
	However, they also introduce significant usability challenges, as users often struggle to
	provide precise camera parameters when working with arbitrary real-world images without
	knowledge of their depth nor scene scale.
	To address these real-world application issues, we propose RealCam-I2V, a novel
	diffusion-based video generation framework that integrates monocular metric depth estimation
	to establish 3D scene reconstruction in a preprocessing step.
	During training, the reconstructed 3D scene enables scaling camera parameters from relative
	to metric scales, ensuring compatibility and scale consistency across diverse real-world
	images.
	In inference, RealCam-I2V offers an intuitive interface where users can precisely draw
	camera trajectories by dragging within the 3D scene.
	To further enhance precise camera control and scene consistency, we propose
	scene-constrained noise shaping, which shapes high-level noise and also allows the framework
	to maintain dynamic and coherent video generation in lower noise stages.
	RealCam-I2V achieves significant improvements in controllability and video quality on the
	RealEstate10K and out-of-domain images. We further enables applications like
	camera-controlled looping video generation and generative frame interpolation.
	</p>
	</div>
	</div>
	</div>
	</div>
	</section>

	<section class="section hero">
	<div class="container has-text-centered">
	<h2 class="title is-3">Demo</h2>

	<div class="video-container">
	<div>
	<video autoplay controls muted loop width="80%">
	<source src="static/videos/demo/4d_demo.mp4" type="video/mp4" />
	</video>
	</div>
	</div>
	<h2 class="subtitle has-text-centered italic">
	4D Visualization
	</h2>

	<br>

	<div class="video-container">
	<div>
	<video autoplay controls muted loop>
	<source src="static/videos/cogvideo1.5/73c3266a-d3e1-41c9-9691-729478a8bf77.mp4"
	type="video/mp4" />
	</video>
	</div>

	<div>
	<video autoplay controls muted loop>
	<source src="static/videos/cogvideo1.5/79131dea-ca85-49df-b68b-cdb208f164c7.mp4"
	type="video/mp4" />
	</video>
	</div>

	<div>
	<video autoplay controls muted loop>
	<source src="static/videos/cogvideo1.5/b17050b5-3ed8-44ae-94a4-ec939c57b41f.mp4"
	type="video/mp4" />
	</video>
	</div>

	<div>
	<video autoplay controls muted loop>
	<source src="static/videos/cogvideo1.5/8ab67ba3-8300-4b82-98b7-8e28403cf6f7.mp4"
	type="video/mp4" />
	</video>
	</div>
	</div>
	<h2 class="subtitle has-text-centered italic">
	Aerial View
	</h2>

	<br>

	<div class="video-container">
	<div>
	<video autoplay controls muted loop>
	<source src="static/videos/cogvideo1.5/3f962cd6-fbf4-4b8a-b107-1468931c80f4.mp4"
	type="video/mp4" />
	</video>
	</div>

	<div>
	<video autoplay controls muted loop>
	<source src="static/videos/cogvideo1.5/d4db16a8-3f82-43b3-8432-cc8df007f10c.mp4"
	type="video/mp4" />
	</video>
	</div>

	<div>
	<video autoplay controls muted loop>
	<source src="static/videos/cogvideo1.5/34614b89-431d-4e31-8d82-89a0f082aaed.mp4"
	type="video/mp4" />
	</video>
	</div>

	<div>
	<video autoplay controls muted loop>
	<source src="static/videos/cogvideo1.5/4f0b8b62-a278-4b0e-8457-b6e8b099de59.mp4"
	type="video/mp4" />
	</video>
	</div>
	</div>
	<h2 class="subtitle has-text-centered italic">
	Urban Exploration
	</h2>

	<br>

	<div class="video-container">
	<div>
	<video autoplay controls muted loop>
	<source src="static/videos/cogvideo1.5/f8180809-8e91-4ef8-b19b-9d42e99f5e00.mp4"
	type="video/mp4" />
	</video>
	</div>

	<div>
	<video autoplay controls muted loop>
	<source src="static/videos/cogvideo1.5/6c23cfd0-9618-4edd-9003-28b6b92c4196.mp4"
	type="video/mp4" />
	</video>
	</div>

	<div>
	<video autoplay controls muted loop>
	<source src="static/videos/cogvideo1.5/37c7abfa-c442-4df5-ace5-d2a2fa1c23aa.mp4"
	type="video/mp4" />
	</video>
	</div>

	<div>
	<video autoplay controls muted loop>
	<source src="static/videos/cogvideo1.5/e304abe7-3e5a-4929-9c0d-0dd8fec78b48.mp4"
	type="video/mp4" />
	</video>
	</div>
	</div>
	<h2 class="subtitle has-text-centered italic">
	FPV & Sports
	</h2>

	<br>

	<div class="video-container">
	<div>
	<video autoplay controls muted loop>
	<source src="static/videos/dynamic/cogvideox_controlnetxs_c52592a0.mp4" type="video/mp4" />
	</video>
	</div>

	<div>
	<video autoplay controls muted loop>
	<source src="static/videos/dynamic/cogvideox_controlnetxs_19c3e433.mp4" type="video/mp4" />
	</video>
	</div>

	<div>
	<video autoplay controls muted loop>
	<source src="static/videos/dynamic/cogvideox_controlnetxs_43d1ce7d.mp4" type="video/mp4" />
	</video>
	</div>

	<div>
	<video autoplay controls muted loop>
	<source src="static/videos/dynamic/cogvideox_controlnetxs_183e7ba2.mp4" type="video/mp4" />
	</video>
	</div>
	</div>
	<h2 class="subtitle has-text-centered italic">
	Complex Trajectories & Scene Dynamics
	</h2>

	<br>

	<div class="video-container">
	<div>
	<video autoplay controls muted loop width="60%">
	<source src="static/videos/demo/cogvideox.mp4" type="video/mp4" />
	</video>
	</div>
	</div>
	<h2 class="subtitle has-text-centered italic">
	Various Domains
	</h2>

	<br>

	<div class="video-container">
	<div>
	<video autoplay controls muted loop>
	<source src="static/videos/various_types/cartoon.mp4" type="video/mp4" />
	</video>
	<h2 class="subtitle has-text-centered italic">
	Cartoon
	</h2>
	</div>

	<div>
	<video autoplay controls muted loop>
	<source src="static/videos/various_types/food.mp4" type="video/mp4" />
	</video>
	<h2 class="subtitle has-text-centered italic">
	Food
	</h2>
	</div>
	</div>

	<br>

	<div class="video-container">
	<div>
	<video autoplay controls muted loop>
	<source src="static/videos/various_types/human.mp4" type="video/mp4" />
	</video>
	<h2 class="subtitle has-text-centered italic">
	Human
	</h2>
	</div>

	<div>
	<video autoplay controls muted loop>
	<source src="static/videos/various_types/pets.mp4" type="video/mp4" />
	</video>
	<h2 class="subtitle has-text-centered italic">
	Pets
	</h2>
	</div>
	</div>

	<br>

	<div class="video-container">
	<div>
	<video autoplay controls muted loop>
	<source src="static/videos/demo/product_demo.mp4" type="video/mp4" />
	</video>
	<h2 class="subtitle has-text-centered italic">
	Product Demo
	</h2>
	</div>

	<div>
	<video autoplay controls muted loop>
	<source src="static/videos/demo/chinese_landscape.mp4" type="video/mp4" />
	</video>
	<h2 class="subtitle has-text-centered italic">
	Chinese Antique
	</h2>
	</div>
	</div>

	</div>

	</section>

	<!-- Method -->
	<section class="section hero">
	<div class="container has-text-centered">
	<h2 class="title is-3">Method</h2>

	<!-- step 1 -->
	<div class="container has-text-centered">
	<h2 class="title has-text-centered is-4 italic">
	Step 1 (Training & Inference): Construct 3D point cloud by monocular metric depth estimation.
	</h2>

	<div class="video-container" style="gap: 5px;">
	<img src="static/images/scene1.jpg" width="25%" />
	<img src="static/images/scene2.jpg" width="25%" />
	<img src="static/images/scene3.jpg" width="25%" />
	</div>
	</div>

	<br>

	<!-- step 2 -->
	<div class="container has-text-centered">
	<h2 class="title has-text-centered is-4 italic">
	Step 2 (Training): Align from relative-scale to metric-scale.
	</h2>

	<img src="static/images/align.jpg" width="80%" />
	</div>

	<br>

	<!-- step 3 -->
	<div class="container has-text-centered">
	<h2 class="title has-text-centered is-4 italic">
	Step 3 (Inference): Render preview video with camera trajectory on the reconstructed 3D scene.
	</h2>

	<div class="video-container">
	<video autoplay controls muted loop>
	<source src="static/videos/preview_video/preview1.mp4" type="video/mp4" />
	</video>

	<video autoplay controls muted loop>
	<source src="static/videos/preview_video/preview2.mp4" type="video/mp4" />
	</video>

	<video autoplay controls muted loop>
	<source src="static/videos/preview_video/preview3.mp4" type="video/mp4" />
	</video>
	</div>
	</div>

	<br>

	<!-- step 4 -->
	<div class="container has-text-centered">
	<h2 class="title has-text-centered is-4 italic">
	Step 4 (Inference): Scene-constrained noise shaping.
	</h2>

	<div class="content has-text-justified">
	We paste the visible latents of preview video into the predicted latent during generation process.
	However, we only paste on the high noise level and allow for
	dynamics in lower level of noise, thus we name it "noise shaping" that only shapes the noise at the
	initial high noise stage.
	</div>

	<div class="video-container">
	<div>
	<video autoplay controls muted loop>
	<source src="static/videos/ablation/ablation1_preview.mp4" type="video/mp4" />
	</video>
	<h2 class="subtitle has-text-centered is-6 italic">
	Preview Video
	</h2>
	</div>

	<div>
	<video autoplay controls muted loop>
	<source src="static/videos/ablation/ablation1_withNoiseShaping.mp4" type="video/mp4" />
	</video>
	<h2 class="subtitle has-text-centered is-6 italic">
	w. Scene-Constrained Noise Shaping
	</h2>
	</div>

	<div>
	<video autoplay controls muted loop>
	<source src="static/videos/ablation/ablation1_withoutNoiseShaping.mp4" type="video/mp4" />
	</video>
	<h2 class="subtitle has-text-centered is-6 italic">
	w.o. Scene-Constrained Noise Shaping
	</h2>
	</div>
	</div>
	</div>
	</div>
	</section>

	<section class="section" id="BibTeX">
	<div class="container is-max-desktop content">
	<h2 class="title">BibTeX</h2>
	<pre><code>
	@article{li2025realcam,
	title={RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control},
	author={Li, Teng and Zheng, Guangcong and Jiang, Rui and Zhan, Shuigen and Wu, Tao and Lu, Yehao and Lin, Yining and Li, Xi},
	journal={arXiv preprint arXiv:2502.10059},
	year={2025},
	}
	</code></pre>
	</div>
	</section>
	</body>

	</html>