Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation

Model Summary

This model, named Sparrow, was presented in the paper "Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation". It builds upon the success of Multimodal Large Language Models (MLLMs) in vision understanding, specifically addressing the challenge of data efficiency in video-LLMs.

The paper revisits scaling with synthetic data and focuses on developing video-LLMs from a data-centric perspective. Preliminary experiments revealed a low learning efficiency when simply scaling up video data samples, which was attributed to a lack of instruction diversity.

Key Highlights:

Data Augmentation Method (Sparrow): Proposes a novel data augmentation method that synthesizes video-like samples from pure text instruction data.
Efficient Training: Mixing these synthetic samples with real video data enables a more efficient training scheme, achieving performance comparable to or even superior to baselines trained with significantly more samples.
Long Video Understanding: Demonstrates that incorporating these synthetic samples can enhance the performance of long video understanding without requiring explicit training on long video data.

The video-LLM is fine-tuned from the image-LLM MiniCPM-Llama3-V-2_5.

How to Use

You can use the Sparrow model with the transformers library. For more detailed instructions on video loading (e.g., extracting frames) and advanced usage scenarios, please refer to the project's GitHub repository.

First, ensure you have the necessary dependencies installed:

pip install transformers torch accelerate
pip install -U flash-attn --no-build-isolation # For efficient training and inference

Here's a quick example to get started with inference using an image as a proxy for video frames. In a full video-LLM setup, you would process a sequence of frames from a video.

from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
import torch
from PIL import Image
import requests
from io import BytesIO

# The model is fine-tuned from openbmb/MiniCPM-Llama3-V-2_5
model_id = "openbmb/MiniCPM-Llama3-V-2_5"

# Load model, tokenizer, and processor
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="cuda", trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

model.eval()

# --- Example: Using a single image as a proxy for a video frame ---
# For a real video-text-to-text task, you would typically:
# 1. Load a video using libraries like `decord` or `imageio`.
# 2. Extract a sequence of representative frames from the video.
# 3. Pass these frames as a list of PIL Images to the processor.

# For demonstration, we use a single dummy image:
image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"
response = requests.get(image_url)
image_frame = Image.open(BytesIO(response.content)).convert("RGB")

# Prepare the conversation input with image and text
# For video input, 'content' would be a list like: [frame1, frame2, ..., question_text]
question = "Describe the scene shown in this image in detail."
messages = [{'role': 'user', 'content': [image_frame, question]}]

# Process inputs
inputs = processor(messages, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}

# Generate text response
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        do_sample=False,
        repetition_penalty=1.05,
    )

response_text = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response_text)

License

Model License

The code in this repo is released under the Apache-2.0 License.
The usage of MiniCPM-V series model weights must strictly follow MiniCPM Model License.md.
The models and weights of MiniCPM are completely free for academic research. After filling out a "questionnaire" for registration, are also available for free commercial use.

Statement

As an LLM, MiniCPM-Llama3-V 2.5 generates contents by learning a large mount of texts, but it cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-Llama3-V 2.5 does not represent the views and positions of the model developers
We will not be liable for any problems arising from the use of the MinCPM-V open Source model, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.

Training dataset

10K video instruction data from Video-ChatGPT
10K video caption data from ShareGemini
10K synthetic data derived from long text instruction data

xjtupanda
/

MiniCPM-V-30K-mix-finetune

Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation

Model Summary

How to Use

License

Model License

Statement

Training dataset

Model tree for xjtupanda/MiniCPM-V-30K-mix-finetune

Datasets used to train xjtupanda/MiniCPM-V-30K-mix-finetune

Collection including xjtupanda/MiniCPM-V-30K-mix-finetune

Sparrow