Safetensors
llava
File size: 6,126 Bytes
cd071c7
 
 
614053a
cd071c7
614053a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cd071c7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
be250de
cd071c7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
614053a
 
 
 
 
 
 
 
 
cd071c7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
license: cc-by-sa-4.0
---
# ViMUL: A Culturally-diverse Multilingual Multimodal Video Model

[![πŸ€— Hugging Face](https://img.shields.io/badge/πŸ€—%20Hugging%20Face-Model-blue)](https://huggingface.co/MBZUAI/ViMUL)
[![πŸ“„ Paper](https://img.shields.io/badge/πŸ“„-Paper-red)](https://huggingface.co/papers/2506.07032)
[![🌐 Project Page](https://img.shields.io/badge/🌐-Project%20Page-green)](https://mbzuai-oryx.github.io/ViMUL/)
[![πŸ“Š Benchmark](https://img.shields.io/badge/πŸ“Š-ViMUL--Bench-orange)](https://huggingface.co/datasets/MBZUAI/ViMUL-Bench)

## Overview
ViMUL is a multilingual video Large Multimodal Model (LMM) designed to provide better tradeoffs between high and low-resource languages for video understanding. The model is trained on a machine-translated multilingual video training set comprising 1.2 million samples and demonstrates improved performance across culturally diverse video content in multiple languages.

## Key Features
- **🌍 Multilingual Support:** Optimized for 14 languages including both high and low-resource languages
- **πŸŽ₯ Video Understanding:** Specialized for multimodal video analysis and description
- **🎭 Cultural Awareness:** Enhanced understanding of culturally diverse content
- **βš–οΈ Balanced Performance:** Better tradeoff between high and low-resource language performance

## Model Details
- **Base Architecture:** LLaVA-NeXT with Qwen backbone
- **Training Data:** 1.2M machine-translated multilingual video samples
- **Supported Languages:** English, Chinese, Spanish, French, German, Hindi, Arabic, Russian, Bengali, Urdu, Sinhala, Tamil, Swedish, Japanese
- **Input Modalities:** Video | Image | Text
- **Output:** Text descriptions and analysis

## Requires

```bash
git clone https://github.com/LLaVA-VL/LLaVA-NeXT.git
pip install LLaVA-NeXT
```

## Inference

Example video inference:

```python
import torch
import numpy as np
from llava.model.builder import load_pretrained_model
from llava.mm_utils import process_anyres_image, tokenizer_image_token, get_model_name_from_path, KeywordsStoppingCriteria
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
from llava.conversation import conv_templates, SeparatorStyle
from transformers import AutoConfig
from decord import VideoReader, cpu

def load_video(video_path, num_frames=32, force_sample=False):
    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
    total_frame_num = len(vr)
    fps = round(vr.get_avg_fps())
    frame_idx = [i for i in range(0, len(vr), fps)]
    if len(frame_idx) > num_frames or force_sample:
        uniform_sampled_frames = np.linspace(0, total_frame_num - 1, num_frames, dtype=int)
        frame_idx = uniform_sampled_frames.tolist()
    frames = vr.get_batch(frame_idx).asnumpy()
    return frames

def infer(
    model_path,
    video_path,
    prompt,
    model_base=None,
    conv_mode=None,
    num_frames=32,
    force_sample=False,
    load_8bit=False,
    device="cuda"
):
    model_name = get_model_name_from_path(model_path)+"llava_qwen" # For llava internal checks and proper loading
    tokenizer, model, image_processor, context_len = load_pretrained_model(
        model_path, model_base, model_name, load_8bit=load_8bit
    )
    frames = load_video(video_path, num_frames=num_frames, force_sample=force_sample)
    video = image_processor.preprocess(frames, return_tensors="pt")["pixel_values"].half().to(device)
    video = [video]

    qs = DEFAULT_IMAGE_TOKEN + "\n" + prompt
    conv = conv_templates[conv_mode].copy() if conv_mode else conv_templates["default"].copy()
    conv.append_message(conv.roles[0], qs)
    conv.append_message(conv.roles[1], None)
    prompt_str = conv.get_prompt()

    input_ids = tokenizer_image_token(prompt_str, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
    if tokenizer.pad_token_id is None:
        tokenizer.pad_token_id = tokenizer.eos_token_id

    attention_masks = input_ids.ne(tokenizer.pad_token_id).long().to(device)
    stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
    keywords = [stop_str]
    stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

    with torch.inference_mode():
        output_ids = model.generate(
            inputs=input_ids,
            images=video,
            attention_mask=attention_masks,
            modalities="video",
            do_sample=False,
            temperature=0.0,
            max_new_tokens=1024,
            top_p=0.1,
            num_beams=1,
            use_cache=True,
            stopping_criteria=[stopping_criteria]
        )
    outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
    if outputs.endswith(stop_str):
        outputs = outputs[:-len(stop_str)]
    return outputs.strip()

if __name__ == "__main__":
    model_path = "MBZUAI/ViMUL"
    video_path = "LLaVA-NeXT/playground/demo/xU25MMA2N4aVtYay.mp4"
    prompt = "Describe what happens in the video."
    conv_mode = "qwen_1_5"
    output = infer(model_path, video_path, prompt, conv_mode=conv_mode)
    print("\n")
    print("="*40)
    print("Output:", output)
    print("="*40)
```

## Citation
```
@misc{shafique2025culturallydiversemultilingualmultimodalvideo,
      title={A Culturally-diverse Multilingual Multimodal Video Benchmark & Model}, 
      author={Bhuiyan Sanjid Shafique and Ashmal Vayani and Muhammad Maaz and Hanoona Abdul Rasheed and Dinura Dissanayake and Mohammed Irfan Kurpath and Yahya Hmaiti and Go Inoue and Jean Lahoud and Md. Safirur Rashid and Shadid Intisar Quasem and Maheen Fatima and Franco Vidal and Mykola Maslych and Ketan Pravin More and Sanoojan Baliah and Hasindri Watawana and Yuhao Li and Fabian Farestam and Leon Schaller and Roman Tymtsiv and Simon Weber and Hisham Cholakkal and Ivan Laptev and Shin'ichi Satoh and Michael Felsberg and Mubarak Shah and Salman Khan and Fahad Shahbaz Khan},
      year={2025},
      eprint={2506.07032},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.07032}, 
}
```