|
--- |
|
language: |
|
- en |
|
library_name: transformers |
|
license: apache-2.0 |
|
metrics: |
|
- accuracy |
|
tags: |
|
- multimodal |
|
pipeline_tag: video-text-to-text |
|
model-index: |
|
- name: InternVideo2.5 |
|
results: |
|
- task: |
|
type: multimodal |
|
dataset: |
|
name: MLVU |
|
type: mlvu |
|
metrics: |
|
- type: accuracy |
|
value: 72.8 |
|
name: accuracy |
|
verified: true |
|
- task: |
|
type: multimodal |
|
dataset: |
|
name: MVBench |
|
type: mvbench |
|
metrics: |
|
- type: accuracy |
|
value: 75.7 |
|
name: accuracy |
|
verified: true |
|
- task: |
|
type: multimodal |
|
dataset: |
|
name: Perception Test |
|
type: percepTest |
|
metrics: |
|
- type: accuracy |
|
value: 74.9 |
|
name: accuracy |
|
verified: true |
|
- task: |
|
type: multimodal |
|
dataset: |
|
name: LongVideoBench |
|
type: longvideobench |
|
metrics: |
|
- type: accuracy |
|
value: 60.6 |
|
name: accuracy |
|
verified: true |
|
- task: |
|
type: multimodal |
|
dataset: |
|
name: VideoMME (w/o sub) |
|
type: videomme |
|
metrics: |
|
- type: accuracy |
|
value: 65.1 |
|
name: accuracy |
|
verified: true |
|
- task: |
|
type: multimodal |
|
dataset: |
|
name: LVBench |
|
type: lvbench |
|
metrics: |
|
- type: accuracy |
|
value: 46.4 |
|
name: accuracy |
|
verified: true |
|
|
|
|
|
--- |
|
|
|
# 📕InternVideo2.5⚡ |
|
<!-- [\[📰 Blog\]](https://internvideo.github.io/blog/2024-12-31-VideoChat-Flash) --> |
|
[\[📂 GitHub\]](https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2.5) |
|
[\[📜 Tech Report\]](https://arxiv.org/abs/2501.12386) |
|
<!-- [\[🗨️ Chat Demo\]](https://huggingface.co/spaces/OpenGVLab/VideoChat-Flash) --> |
|
|
|
InternVideo2.5 is a video multimodal large language model (MLLM, built upoon InternVL2.5) enhanced with **long and rich context (LRC) modeling**. It significantly improves upon existing MLLMs by enhancing their ability to perceive fine-grained details and capture long-form temporal structures. We achieve this through dense vision task annotations using direct preference optimization (TPO) and compact spatiotemporal representations via adaptive hierarchical token compression (HiCo). |
|
|
|
|
|
|
|
|
|
## 📈 Performance |
|
|
|
- VideoBenchmark |
|
| Model | MVBench | LongVideoBench | VideoMME(w/o sub)| |
|
| --- | --- | --- | --- | |
|
|InternVideo2.5| 75.7 | 60.6 | 65.1| |
|
|
|
- Inference Speed |
|
|
|
We measured the average inference speed (tokens/s) of generating 1024 new tokens and 5198 (8192-2998) tokens with the context of an video (which takes 2998 tokens) under BF16 precision. w/ encoder indicates that the inference includes the time for video encoder. |
|
|
|
|Quantization | Speed (3022 tokens) | Speed (8192 tokens) w/o encoder| Speed(8192 tokens) w/ encoder| |
|
|--- |--- |---| ---| |
|
|BF16 | 33.40 | 31.91 | 21.33| |
|
|INT4 | - | 31.95 | 26.37| |
|
|
|
The profiling runs on a single A800-SXM4-80G GPU with PyTorch 2.4.0 and CUDA 12.1. |
|
|
|
|
|
## 🚀 How to use the model |
|
|
|
First, you need to install [flash attention2](https://github.com/Dao-AILab/flash-attention) and some other modules. We provide a simple installation example below: |
|
``` |
|
pip install transformers==4.40.1 |
|
pip install av |
|
pip install imageio |
|
pip install decord |
|
pip install opencv-python |
|
pip install flash-attn --no-build-isolation |
|
``` |
|
Then you could use our model: |
|
```python |
|
import numpy as np |
|
import torch |
|
import torchvision.transforms as T |
|
from decord import VideoReader, cpu |
|
from PIL import Image |
|
from torchvision.transforms.functional import InterpolationMode |
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
|
|
# model setting |
|
model_path = 'OpenGVLab/InternVideo2_5_Chat_8B' |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) |
|
model = AutoModel.from_pretrained(model_path, trust_remote_code=True).half().cuda().to(torch.bfloat16) |
|
|
|
IMAGENET_MEAN = (0.485, 0.456, 0.406) |
|
IMAGENET_STD = (0.229, 0.224, 0.225) |
|
|
|
def build_transform(input_size): |
|
MEAN, STD = IMAGENET_MEAN, IMAGENET_STD |
|
transform = T.Compose([T.Lambda(lambda img: img.convert("RGB") if img.mode != "RGB" else img), T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC), T.ToTensor(), T.Normalize(mean=MEAN, std=STD)]) |
|
return transform |
|
|
|
|
|
def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size): |
|
best_ratio_diff = float("inf") |
|
best_ratio = (1, 1) |
|
area = width * height |
|
for ratio in target_ratios: |
|
target_aspect_ratio = ratio[0] / ratio[1] |
|
ratio_diff = abs(aspect_ratio - target_aspect_ratio) |
|
if ratio_diff < best_ratio_diff: |
|
best_ratio_diff = ratio_diff |
|
best_ratio = ratio |
|
elif ratio_diff == best_ratio_diff: |
|
if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]: |
|
best_ratio = ratio |
|
return best_ratio |
|
|
|
|
|
def dynamic_preprocess(image, min_num=1, max_num=6, image_size=448, use_thumbnail=False): |
|
orig_width, orig_height = image.size |
|
aspect_ratio = orig_width / orig_height |
|
|
|
# calculate the existing image aspect ratio |
|
target_ratios = set((i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if i * j <= max_num and i * j >= min_num) |
|
target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1]) |
|
|
|
# find the closest aspect ratio to the target |
|
target_aspect_ratio = find_closest_aspect_ratio(aspect_ratio, target_ratios, orig_width, orig_height, image_size) |
|
|
|
# calculate the target width and height |
|
target_width = image_size * target_aspect_ratio[0] |
|
target_height = image_size * target_aspect_ratio[1] |
|
blocks = target_aspect_ratio[0] * target_aspect_ratio[1] |
|
|
|
# resize the image |
|
resized_img = image.resize((target_width, target_height)) |
|
processed_images = [] |
|
for i in range(blocks): |
|
box = ((i % (target_width // image_size)) * image_size, (i // (target_width // image_size)) * image_size, ((i % (target_width // image_size)) + 1) * image_size, ((i // (target_width // image_size)) + 1) * image_size) |
|
# split the image |
|
split_img = resized_img.crop(box) |
|
processed_images.append(split_img) |
|
assert len(processed_images) == blocks |
|
if use_thumbnail and len(processed_images) != 1: |
|
thumbnail_img = image.resize((image_size, image_size)) |
|
processed_images.append(thumbnail_img) |
|
return processed_images |
|
|
|
|
|
def load_image(image, input_size=448, max_num=6): |
|
transform = build_transform(input_size=input_size) |
|
images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num) |
|
pixel_values = [transform(image) for image in images] |
|
pixel_values = torch.stack(pixel_values) |
|
return pixel_values |
|
|
|
|
|
def get_index(bound, fps, max_frame, first_idx=0, num_segments=32): |
|
if bound: |
|
start, end = bound[0], bound[1] |
|
else: |
|
start, end = -100000, 100000 |
|
start_idx = max(first_idx, round(start * fps)) |
|
end_idx = min(round(end * fps), max_frame) |
|
seg_size = float(end_idx - start_idx) / num_segments |
|
frame_indices = np.array([int(start_idx + (seg_size / 2) + np.round(seg_size * idx)) for idx in range(num_segments)]) |
|
return frame_indices |
|
|
|
def get_num_frames_by_duration(duration): |
|
local_num_frames = 4 |
|
num_segments = int(duration // local_num_frames) |
|
if num_segments == 0: |
|
num_frames = local_num_frames |
|
else: |
|
num_frames = local_num_frames * num_segments |
|
|
|
num_frames = min(512, num_frames) |
|
num_frames = max(128, num_frames) |
|
|
|
return num_frames |
|
|
|
def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=32, get_frame_by_duration = False): |
|
vr = VideoReader(video_path, ctx=cpu(0), num_threads=1) |
|
max_frame = len(vr) - 1 |
|
fps = float(vr.get_avg_fps()) |
|
|
|
pixel_values_list, num_patches_list = [], [] |
|
transform = build_transform(input_size=input_size) |
|
if get_frame_by_duration: |
|
duration = max_frame / fps |
|
num_segments = get_num_frames_by_duration(duration) |
|
frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments) |
|
for frame_index in frame_indices: |
|
img = Image.fromarray(vr[frame_index].asnumpy()).convert("RGB") |
|
img = dynamic_preprocess(img, image_size=input_size, use_thumbnail=True, max_num=max_num) |
|
pixel_values = [transform(tile) for tile in img] |
|
pixel_values = torch.stack(pixel_values) |
|
num_patches_list.append(pixel_values.shape[0]) |
|
pixel_values_list.append(pixel_values) |
|
pixel_values = torch.cat(pixel_values_list) |
|
return pixel_values, num_patches_list |
|
|
|
# evaluation setting |
|
max_num_frames = 512 |
|
generation_config = dict( |
|
do_sample=False, |
|
temperature=0.0, |
|
max_new_tokens=1024, |
|
top_p=0.1, |
|
num_beams=1 |
|
) |
|
video_path = "your_video.mp4" |
|
num_segments=128 |
|
|
|
|
|
with torch.no_grad(): |
|
|
|
pixel_values, num_patches_list = load_video(video_path, num_segments=num_segments, max_num=1, get_frame_by_duration=False) |
|
pixel_values = pixel_values.to(torch.bfloat16).to(model.device) |
|
video_prefix = "".join([f"Frame{i+1}: <image>\n" for i in range(len(num_patches_list))]) |
|
# single-turn conversation |
|
question1 = "Describe this video in detail." |
|
question = video_prefix + question1 |
|
output1, chat_history = model.chat(tokenizer, pixel_values, question, generation_config, num_patches_list=num_patches_list, history=None, return_history=True) |
|
print(output1) |
|
|
|
# multi-turn conversation |
|
question2 = "How many people appear in the video?" |
|
output2, chat_history = model.chat(tokenizer, pixel_values, question, generation_config, num_patches_list=num_patches_list, history=chat_history, return_history=True) |
|
|
|
print(output2) |
|
``` |
|
|
|
## ✏️ Citation |
|
|
|
```bibtex |
|
|
|
@article{wang2025internvideo, |
|
title={InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling}, |
|
author={Wang, Yi and Li, Xinhao and Yan, Ziang and He, Yinan and Yu, Jiashuo and Zeng, Xiangyu and Wang, Chenting and Ma, Changlian and Huang, Haian and Gao, Jianfei and Dou, Min and Chen, Kai and Wang, Wenhai and Qiao, Yu and Wang, Yali and Wang, Limin}, |
|
journal={arXiv preprint arXiv:2501.12386}, |
|
year={2025} |
|
} |
|
``` |