What libraries can I use for Image-to-Video?

The and diffusers library is compatible with Image-to-Video.

What models can I use for Image-to-Video?

The Lightricks/LTX-Video-0.9.7-dev, Wan-AI/Wan2.1-VACE-14B, lllyasviel/FramePack_F1_I2V_HY_20250503, Lightricks/LTX-Video-0.9.7-distilled, Skywork/SkyReels-V2-I2V-14B-720P, tencent/HunyuanVideo-I2V, Wan-AI/Wan2.1-I2V-14B-720P, and Wan-AI/Wan2.1-I2V-14B-720P-Diffusers models can be used for Image-to-Video.

What datasets can I use for Image-to-Video?

The ali-vilab/VACE-Benchmark, Rapidata/sora-video-generation-style-likert-scoring, and BestWishYsh/ChronoMagic datasets can be used for Image-to-Video.

What metrics can I use for Image-to-Video?

The fvd, clip_score, lpips, identity_preservation, and motion_score metrics can be used for Image-to-Video.

Tasks

Image-to-Video

Image-to-video models take a still image as input and generate a video. These models can be guided by text prompts to influence the content and style of the output video.

Inputs

Optional Text Prompt

This penguin is dancing

Image-to-Video Model

Output

About Image-to-Video

Use Cases

Image-to-video models transform a static image into a video sequence. This can be used for a variety of creative and practical applications.

Animated Images

Bring still photos to life by adding subtle motion or creating short animated clips. This is great for social media content or dynamic presentations.

Storytelling from a Single Frame

Expand on the narrative of an image by generating a short video that imagines what happened before or after the moment captured in the photo.

Video Generation with Visual Consistency

Use an input image as a strong visual anchor to guide the generation of a video, ensuring that the style, characters, or objects in the video remain consistent with the source image.

Controllable Motion

Image-to-video models can be used to specify the direction or intensity of motion or camera control, giving more fine-grained control over the generated animation.

Inference

Running the model Wan 2.1 T2V 1.3B with diffusers

import torch
from diffusers import AutoencoderKLWan, WanPipeline
from diffusers.utils import export_to_video

model_id = "Wan-AI/Wan2.1-T2V-1.3B-Diffusers"
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipe = WanPipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16)
pipe.to("cuda")

prompt = "A cat walks on the grass, realistic"
negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"

output = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=480,
    width=832,
    num_frames=81,
    guidance_scale=5.0
).frames[0]
export_to_video(output, "output.mp4", fps=15)

Useful Resources

To train image-to-video LoRAs check out finetrainers and musubi trainer.

Compatible libraries

Diffusers

Image-to-Video demo

No example widget is defined for this task.

Note Contribute by proposing a widget for this task !

Models for Image-to-Video

Browse Models (271)

Lightricks/LTX-Video-0.9.7-dev

Text-to-Video • Updated 25 days ago • 3.46k • • 13

Note LTX-Video, a 13B parameter model for high quality video generation

Wan-AI/Wan2.1-VACE-14B

Image-to-Video • Updated May 19 • 10.4k • 453

Note A 14B parameter model for reference image controlled video generation

lllyasviel/FramePack_F1_I2V_HY_20250503

Updated May 3 • 35.3k • 36

Note An image-to-video generation model using FramePack F1 methodology with Hunyuan-DiT architecture

Lightricks/LTX-Video-0.9.7-distilled

Text-to-Video • Updated 25 days ago • 8.13k • • 47

Note A distilled version of the LTX-Video-0.9.7-dev model for faster inference

Skywork/SkyReels-V2-I2V-14B-720P

Image-to-Video • Updated Apr 25 • 882 • 28

Note An image-to-video generation model by Skywork AI, 14B parameters, producing 720p videos.

tencent/HunyuanVideo-I2V

Image-to-Video • Updated Mar 13 • 219 • • 321

Note Image-to-video variant of Tencent's HunyuanVideo.

Wan-AI/Wan2.1-I2V-14B-720P

Image-to-Video • Updated Feb 26 • 45.4k • • 508

Note A 14B parameter model for 720p image-to-video generation by Wan-AI.

Wan-AI/Wan2.1-I2V-14B-720P-Diffusers

Image-to-Video • Updated Apr 4 • 13.5k • 43

Note A Diffusers version of the Wan2.1-I2V-14B-720P model for 720p image-to-video generation.

Datasets for Image-to-Video

Browse Datasets (62)

ali-vilab/VACE-Benchmark

Viewer • Updated Apr 7 • 40 • 1.45k • 11

Note A benchmark dataset for reference image controlled video generation.

Rapidata/sora-video-generation-style-likert-scoring

Viewer • Updated Feb 4 • 198 • 13 • 15

Note A dataset of video generation style preferences.

BestWishYsh/ChronoMagic

Viewer • Updated Jun 8 • 2.54k • 81 • 14

Note A dataset with videos and captions throughout the videos.

Spaces using Image-to-Video

🎥

Lightricks/ltx-video-distilled

Note An application to generate videos fast.

📹⚡️

linoyts/FramePack-F1

Note Generate videos with the FramePack-F1

🎬

lisonallen/framepack-i2v

Note Generate videos with the FramePack

🎥💨

multimodalart/wan2-1-fast

Note Wan2.1 with CausVid LoRA

📺

multimodalart/stable-video-diffusion

Note A demo for Stable Video Diffusion

Metrics for Image-to-Video

fvd: Fréchet Video Distance (FVD) measures the perceptual similarity between the distributions of generated videos and a set of real videos, assessing overall visual quality and temporal coherence of the video generated from an input image.

clip_score: CLIP Score measures the semantic similarity between a textual prompt (if provided alongside the input image) and the generated video frames. It evaluates how well the video's generated content and motion align with the textual description, conditioned on the initial image.

lpips: First Frame Fidelity, often measured using LPIPS (Learned Perceptual Image Patch Similarity), PSNR, or SSIM, quantifies how closely the first frame of the generated video matches the input conditioning image.

identity_preservation: Identity Preservation Score measures the consistency of identity (e.g., a person's face or a specific object's characteristics) between the input image and throughout the generated video frames, often calculated using features from specialized models like face recognition (e.g., ArcFace) or re-identification models.

motion_score: Motion Score evaluates the quality, realism, and temporal consistency of motion in the video generated from a static image. This can be based on optical flow analysis (e.g., smoothness, magnitude), consistency of object trajectories, or specific motion plausibility assessments.