| --- |
| license: mit |
| datasets: |
| - Tevatron/bge-ir |
| - Tevatron/wiki-ss-nq-new |
| - Tevatron/pixmo-docs |
| - Tevatron/colpali |
| - Tevatron/msrvtt |
| - Tevatron/audiocaps |
| - Tevatron/multivent |
| base_model: |
| - Tevatron/OmniEmbed-v0.1 |
| pipeline_tag: visual-document-retrieval |
| library_name: peft |
| --- |
| # Tevatron/OmniEmbed-v0.1 |
|
|
| **OmniEmbed** is a powerful multi-modal embedding model built on [Qwen2.5-Omni-7B](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) using our [Tevatron](https://github.com/texttron/tevatron/) toolkit—a unified toolkit across scale, language, and modality for document retrieval. |
| OmniEmbed generates unified embeddings across multilingual text, images, audio, and video, enabling effective cross-modal retrieval for diverse applications. [Paper](https://arxiv.org/pdf/2505.02466v1). |
|
|
| **OmniEmbed-multivent** is further finetuned on OmniEmbed for video retrieval with allowing joint enhancing joint input performance of video, audio and text. |
|
|
| OmniEmbed-multivent gets SoTA performance on MAGMaR 2025 shared task on [MultiVENT-2.0](https://huggingface.co/datasets/hltcoe/MultiVENT2.0) datasets, large-scale, multi-lingual event-centric video retrieval benchmark featuring a collection of more than 218,000 news video. |
|
|
| 📝 Text 🖼️ Image 🎧 Audio 🎥 Video 🌐 Multilingual |
|
|
| ## Evaluation Results: |
|
|
| | | Modality | Model | nDCG@10 | AP | nDCG | RR | R@10 | |
| |-----|----------------------------------|------------------------|---------|-------|-------|-------|-------| |
| | | **Official Baselines** | | | | | | | |
| | | All | VAST | 0.116 | 0.08 | 0.115 | 0.198 | 0.118 | |
| | | OCR | ICDAR OCR → CLIP | 0.217 | 0.166 | 0.288 | 0.363 | 0.227 | |
| | | ASR | Whisper ASR | 0.267 | 0.212 | 0.336 | 0.417 | 0.29 | |
| | | Vision (key frame) | CLIP | 0.304 | 0.261 | 0.435 | 0.429 | 0.333 | |
| | | All | LanguageBind | 0.324 | 0.283 | 0.452 | 0.443 | 0.355 | |
| | | **Zero-Shot** | | | | | | | |
| | (a) | text, ASR | DRAMA | 0.629 | 0.576 | 0.693 | 0.749 | 0.649 | |
| | (b) | text, ASR | OmniEmbed | 0.377 | 0.329 | 0.453 | 0.493 | 0.403 | |
| | (c) | text, ASR, Vision (video), Audio| OmniEmbed | 0.595 | 0.537 | 0.673 | 0.732 | 0.616 | |
| | | **Trained on MultiVent 2.0 Training Set** | | | | | | | |
| | (d) | text, ASR | OmniEmbedMultivent | 0.710 | 0.673 | 0.772 | 0.808 | 0.734 | |
| | (f) | Vision (video), Audio | OmniEmbedMultivent | 0.709 | 0.665 | 0.776 | 0.822 | 0.724 | |
| | (h) | text, ASR, Vision (video), Audio| **OmniEmbedMultivent** | **0.753** | **0.769** | **0.807** | **0.848** | **0.715** | |
|
|
|
|
| --- |
|
|
| ### Usage |
| ```python |
| # Import Library, Load Model and Processor |
| import torch |
| from transformers import AutoProcessor, Qwen2_5OmniThinkerForConditionalGeneration |
| from qwen_omni_utils import process_mm_info |
| |
| device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") |
| |
| processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B") |
| model = Qwen2_5OmniThinkerForConditionalGeneration.from_pretrained( |
| 'Tevatron/OmniEmbed-v0.1', |
| attn_implementation="flash_attention_2", |
| torch_dtype=torch.bfloat16 |
| ).to(device).eval() |
| |
| processor.tokenizer.padding_side = "left" |
| model.padding_side = "left" |
| |
| # Function to Encode Message |
| def encode_message(message): |
| texts = processor.apply_chat_template(message, tokenize=False, add_generation_prompt=True)[0] + "<|endoftext|>" |
| audio_inputs, image_inputs, video_inputs = process_mm_info(message, use_audio_in_video=True) |
| |
| inputs = processor( |
| text=texts, |
| audio=audio_inputs, |
| images=image_inputs, |
| videos=video_inputs, |
| return_tensors="pt", |
| padding="longest", |
| ) |
| for k in inputs: |
| inputs[k] = inputs[k].to(device) |
| |
| cache_position = torch.arange(0, inputs['input_ids'].shape[1], device=device) |
| inputs = model.prepare_inputs_for_generation(**inputs, use_cache=True, cache_position=cache_position) |
| model_outputs = model(**inputs, return_dict=True, output_hidden_states=True) |
| |
| last_hidden_state = model_outputs.hidden_states[-1] |
| reps = last_hidden_state[:, -1] |
| reps = torch.nn.functional.normalize(reps, p=2, dim=-1) |
| return reps |
| ``` |
|
|
| ### 🎬 Video Retrieval |
| ```python |
| example_query = 'Query: How to cook Mapo Tofu?' |
| example_video_1 = "https://huggingface.co/Tevatron/OmniEmbed-v0/resolve/main/assets/mapo_tofu.mp4" |
| example_video_2 = "https://huggingface.co/Tevatron/OmniEmbed-v0/resolve/main/assets/zhajiang_noodle.mp4" |
| query = [{'role': 'user', 'content': [{'type': 'text', 'text': example_query}]}] |
| video_1 = [{'role': 'user', 'content': [{'type': 'video', 'video': example_video_1}]}] |
| video_2 = [{'role': 'user', 'content': [{'type': 'video', 'video': example_video_2}]}] |
| |
| sim1 = torch.cosine_similarity(encode_message(query), encode_message(video_1)) |
| sim2 = torch.cosine_similarity(encode_message(query), encode_message(video_2)) |
| |
| print("Similarities:", sim1.item(), sim2.item()) |
| ``` |
|
|
| ### 🎵 Audio Retrieval |
| ```python |
| example_query = 'Query: A light piano piece' |
| example_audio_1 = "https://huggingface.co/Tevatron/OmniEmbed-v0/resolve/main/assets/joe_hisaishi_summer.mp3" |
| example_audio_2 = "https://huggingface.co/Tevatron/OmniEmbed-v0/resolve/main/assets/jay_chou_superman_cant_fly.mp3" |
| query = [{'role': 'user', 'content': [{'type': 'text', 'text': example_query}]}] |
| audio_1 = [{'role': 'user', 'content': [{'type': 'audio', 'audio': example_audio_1}]}] |
| audio_2 = [{'role': 'user', 'content': [{'type': 'audio', 'audio': example_audio_2}]}] |
| |
| sim1 = torch.cosine_similarity(encode_message(query), encode_message(audio_1)) |
| sim2 = torch.cosine_similarity(encode_message(query), encode_message(audio_2)) |
| |
| print("Similarities:", sim1.item(), sim2.item()) |
| ``` |
|
|
| ### 📈 Image Document Retrieval (Image, Chart, PDF) |
| ```python |
| example_query = 'Query: How many input modality does Qwen2.5-Omni support?' |
| example_image_1 = "https://huggingface.co/Tevatron/OmniEmbed-v0/resolve/main/assets/qwen2.5omni_hgf.png" |
| example_image_2 = "https://huggingface.co/Tevatron/OmniEmbed-v0/resolve/main/assets/llama4_hgf.png" |
| query = [{'role': 'user', 'content': [{'type': 'text', 'text': example_query}]}] |
| image_1 = [{'role': 'user', 'content': [{'type': 'image', 'image': example_image_1}]}] |
| image_2 = [{'role': 'user', 'content': [{'type': 'image', 'image': example_image_2}]}] |
| |
| sim1 = torch.cosine_similarity(encode_message(query), encode_message(image_1)) |
| sim2 = torch.cosine_similarity(encode_message(query), encode_message(image_2)) |
| |
| print("Similarities:", sim1.item(), sim2.item()) |
| ``` |
|
|
| ### 🌍 Multilingual Text Retrieval |
| ```python |
| example_query = 'Query: 氧气在空气中占比多少?' |
| example_text_1 = "空气是指大气层中由不同气体和各类飘浮在其中的固体与液体颗粒(大气颗粒与气溶胶)所组成的气态混合物。地球大气层的空气主要由78.1%的氮气、20.9%氧气、0.9%的氩气和1~4%的水蒸气组成,其成分并不是固定的,随着高度、气压、温度的改变和对流情况不同,局部空气的组成比例也会改变。空气在大气层(特别是对流层)中的流动形成了风和曳流、气旋、龙卷等自然现象,而空气中飘浮的颗粒则形成了云、雾、霾和沙尘暴等短期天气情况。空气在海洋和陆地之间跨区域流动所承载的湿度和热能传导也是水循环和气候变率与变化的关键一环。" |
| example_text_2 = "水(化学式:H2O)是一种无机化合物,在常温且无杂质中是无色[1]无味不导电的透明液体,也会通过蒸发产生气态的水蒸气(这种蒸发可以发生在任何温度下,同时取决于与空气接触的表面积和湿度差)。在标准大气压下,水的凝固点是0 °C(32 °F;273 K),沸点是100 °C(212 °F;373 K)。" |
| query = [{'role': 'user', 'content': [{'type': 'text', 'text': example_query}]}] |
| text_1 = [{'role': 'user', 'content': [{'type': 'text', 'text': example_text_1}]}] |
| text_2 = [{'role': 'user', 'content': [{'type': 'text', 'text': example_text_2}]}] |
| |
| sim1 = torch.cosine_similarity(encode_message(query), encode_message(text_1)) |
| sim2 = torch.cosine_similarity(encode_message(query), encode_message(text_2)) |
| |
| print("Similarities:", sim1.item(), sim2.item()) |
| ``` |
|
|
| ## Data & Training |
| We fully open-soured the Training Code in [Tevatron](https://github.com/texttron/tevatron/tree/qwenomni) |
|
|
|
|
| ## Contact |
| This model is developed by: |
|
|
| Samantha Zhan, Crystina Zhang, Shengyao Zhuang, Xueguang Ma, Jimmy Lin |
|
|
| Feel free to reach out to us with any questions or for further discussion. |