Update README.md

87680fe verified 1 day ago

10.2 kB

	---
	language:
	- en
	library_name: transformers
	license: apache-2.0
	metrics:
	- accuracy
	tags:
	- multimodal
	pipeline_tag: video-text-to-text
	model-index:
	- name: InternVideo2.5
	results:
	- task:
	type: multimodal
	dataset:
	name: MLVU
	type: mlvu
	metrics:
	- type: accuracy
	value: 72.8
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: MVBench
	type: mvbench
	metrics:
	- type: accuracy
	value: 75.7
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: Perception Test
	type: percepTest
	metrics:
	- type: accuracy
	value: 74.9
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: LongVideoBench
	type: longvideobench
	metrics:
	- type: accuracy
	value: 60.6
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: VideoMME (w/o sub)
	type: videomme
	metrics:
	- type: accuracy
	value: 65.1
	name: accuracy
	verified: true
	- task:
	type: multimodal
	dataset:
	name: LVBench
	type: lvbench
	metrics:
	- type: accuracy
	value: 46.4
	name: accuracy
	verified: true


	---

	# 📕InternVideo2.5⚡
	<!-- [\[📰 Blog\]](https://internvideo.github.io/blog/2024-12-31-VideoChat-Flash) -->
	[\[📂 GitHub\]](https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2.5)
	[\[📜 Tech Report\]](https://arxiv.org/abs/2501.12386)
	<!-- [\[🗨️ Chat Demo\]](https://huggingface.co/spaces/OpenGVLab/VideoChat-Flash) -->

	InternVideo2.5 is a video multimodal large language model (MLLM, built upoon InternVL2.5) enhanced with long and rich context (LRC) modeling. It significantly improves upon existing MLLMs by enhancing their ability to perceive fine-grained details and capture long-form temporal structures. We achieve this through dense vision task annotations using direct preference optimization (TPO) and compact spatiotemporal representations via adaptive hierarchical token compression (HiCo).




	## 📈 Performance

	- VideoBenchmark
	\| Model \| MVBench \| LongVideoBench \| VideoMME(w/o sub)\|
	\| --- \| --- \| --- \| --- \|
	\|InternVideo2.5\| 75.7 \| 60.6 \| 65.1\|

	- Inference Speed

	We measured the average inference speed (tokens/s) of generating 1024 new tokens and 5198 (8192-2998) tokens with the context of an video (which takes 2998 tokens) under BF16 precision. w/ encoder indicates that the inference includes the time for video encoder.

	\|Quantization \| Speed (3022 tokens) \| Speed (8192 tokens) w/o encoder\| Speed(8192 tokens) w/ encoder\|
	\|--- \|--- \|---\| ---\|
	\|BF16 \| 33.40 \| 31.91 \| 21.33\|
	\|INT4 \| - \| 31.95 \| 26.37\|

	The profiling runs on a single A800-SXM4-80G GPU with PyTorch 2.4.0 and CUDA 12.1.


	## 🚀 How to use the model

	First, you need to install [flash attention2](https://github.com/Dao-AILab/flash-attention) and some other modules. We provide a simple installation example below:
	```
	pip install transformers==4.40.1
	pip install av
	pip install imageio
	pip install decord
	pip install opencv-python
	pip install flash-attn --no-build-isolation
	```
	Then you could use our model:
	```python
	import numpy as np
	import torch
	import torchvision.transforms as T
	from decord import VideoReader, cpu
	from PIL import Image
	from torchvision.transforms.functional import InterpolationMode
	from transformers import AutoModel, AutoTokenizer


	# model setting
	model_path = 'OpenGVLab/InternVideo2_5_Chat_8B'

	tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
	model = AutoModel.from_pretrained(model_path, trust_remote_code=True).half().cuda().to(torch.bfloat16)

	IMAGENET_MEAN = (0.485, 0.456, 0.406)
	IMAGENET_STD = (0.229, 0.224, 0.225)

	def build_transform(input_size):
	MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
	transform = T.Compose([T.Lambda(lambda img: img.convert("RGB") if img.mode != "RGB" else img), T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC), T.ToTensor(), T.Normalize(mean=MEAN, std=STD)])
	return transform


	def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
	best_ratio_diff = float("inf")
	best_ratio = (1, 1)
	area = width * height
	for ratio in target_ratios:
	target_aspect_ratio = ratio[0] / ratio[1]
	ratio_diff = abs(aspect_ratio - target_aspect_ratio)
	if ratio_diff < best_ratio_diff:
	best_ratio_diff = ratio_diff
	best_ratio = ratio
	elif ratio_diff == best_ratio_diff:
	if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
	best_ratio = ratio
	return best_ratio


	def dynamic_preprocess(image, min_num=1, max_num=6, image_size=448, use_thumbnail=False):
	orig_width, orig_height = image.size
	aspect_ratio = orig_width / orig_height

	# calculate the existing image aspect ratio
	target_ratios = set((i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if i * j <= max_num and i * j >= min_num)
	target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])

	# find the closest aspect ratio to the target
	target_aspect_ratio = find_closest_aspect_ratio(aspect_ratio, target_ratios, orig_width, orig_height, image_size)

	# calculate the target width and height
	target_width = image_size * target_aspect_ratio[0]
	target_height = image_size * target_aspect_ratio[1]
	blocks = target_aspect_ratio[0] * target_aspect_ratio[1]

	# resize the image
	resized_img = image.resize((target_width, target_height))
	processed_images = []
	for i in range(blocks):
	box = ((i % (target_width // image_size)) * image_size, (i // (target_width // image_size)) * image_size, ((i % (target_width // image_size)) + 1) * image_size, ((i // (target_width // image_size)) + 1) * image_size)
	# split the image
	split_img = resized_img.crop(box)
	processed_images.append(split_img)
	assert len(processed_images) == blocks
	if use_thumbnail and len(processed_images) != 1:
	thumbnail_img = image.resize((image_size, image_size))
	processed_images.append(thumbnail_img)
	return processed_images


	def load_image(image, input_size=448, max_num=6):
	transform = build_transform(input_size=input_size)
	images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
	pixel_values = [transform(image) for image in images]
	pixel_values = torch.stack(pixel_values)
	return pixel_values


	def get_index(bound, fps, max_frame, first_idx=0, num_segments=32):
	if bound:
	start, end = bound[0], bound[1]
	else:
	start, end = -100000, 100000
	start_idx = max(first_idx, round(start * fps))
	end_idx = min(round(end * fps), max_frame)
	seg_size = float(end_idx - start_idx) / num_segments
	frame_indices = np.array([int(start_idx + (seg_size / 2) + np.round(seg_size * idx)) for idx in range(num_segments)])
	return frame_indices

	def get_num_frames_by_duration(duration):
	local_num_frames = 4
	num_segments = int(duration // local_num_frames)
	if num_segments == 0:
	num_frames = local_num_frames
	else:
	num_frames = local_num_frames * num_segments

	num_frames = min(512, num_frames)
	num_frames = max(128, num_frames)

	return num_frames

	def load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=32, get_frame_by_duration = False):
	vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
	max_frame = len(vr) - 1
	fps = float(vr.get_avg_fps())

	pixel_values_list, num_patches_list = [], []
	transform = build_transform(input_size=input_size)
	if get_frame_by_duration:
	duration = max_frame / fps
	num_segments = get_num_frames_by_duration(duration)
	frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)
	for frame_index in frame_indices:
	img = Image.fromarray(vr[frame_index].asnumpy()).convert("RGB")
	img = dynamic_preprocess(img, image_size=input_size, use_thumbnail=True, max_num=max_num)
	pixel_values = [transform(tile) for tile in img]
	pixel_values = torch.stack(pixel_values)
	num_patches_list.append(pixel_values.shape[0])
	pixel_values_list.append(pixel_values)
	pixel_values = torch.cat(pixel_values_list)
	return pixel_values, num_patches_list

	# evaluation setting
	max_num_frames = 512
	generation_config = dict(
	do_sample=False,
	temperature=0.0,
	max_new_tokens=1024,
	top_p=0.1,
	num_beams=1
	)
	video_path = "your_video.mp4"
	num_segments=128


	with torch.no_grad():

	pixel_values, num_patches_list = load_video(video_path, num_segments=num_segments, max_num=1, get_frame_by_duration=False)
	pixel_values = pixel_values.to(torch.bfloat16).to(model.device)
	video_prefix = "".join([f"Frame{i+1}: <image>\n" for i in range(len(num_patches_list))])
	# single-turn conversation
	question1 = "Describe this video in detail."
	question = video_prefix + question1
	output1, chat_history = model.chat(tokenizer, pixel_values, question, generation_config, num_patches_list=num_patches_list, history=None, return_history=True)
	print(output1)

	# multi-turn conversation
	question2 = "How many people appear in the video?"
	output2, chat_history = model.chat(tokenizer, pixel_values, question, generation_config, num_patches_list=num_patches_list, history=chat_history, return_history=True)

	print(output2)
	```

	## ✏️ Citation

	```bibtex

	@article{wang2025internvideo,
	title={InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling},
	author={Wang, Yi and Li, Xinhao and Yan, Ziang and He, Yinan and Yu, Jiashuo and Zeng, Xiangyu and Wang, Chenting and Ma, Changlian and Huang, Haian and Gao, Jianfei and Dou, Min and Chen, Kai and Wang, Wenhai and Qiao, Yu and Wang, Yali and Wang, Limin},
	journal={arXiv preprint arXiv:2501.12386},
	year={2025}
	}
	```