ZiyiXia's picture
Update src/about.py
dad4aec verified
from dataclasses import dataclass
from enum import Enum
@dataclass
class Task:
benchmark: str
metric: str
col_name: str
# Select your tasks here
# ---------------------------------------------------
class Tasks(Enum):
# task_key in the json file, metric_key in the json file, name to display in the leaderboard
task0 = Task("anli_r1", "acc", "ANLI")
task1 = Task("logiqa", "acc_norm", "LogiQA")
NUM_FEWSHOT = 0 # Change with your few shot
# ---------------------------------------------------
# Your leaderboard name
TITLE = """<h1 align="center" id="space-title">MVRB Leaderboard</h1>"""
# What does your leaderboard evaluate?
INTRODUCTION_TEXT = """
**MVRB (Massive Visualized IR Benchmark)** evaluates multimodal retrievers’ performance on general **Vis-IR** tasks. The benchmark includes various task types, such as screenshot-based multimodal retrieval (screenshot to anything, anything to screenshot) and screenshotconditioned retrieval (e.g., searching for documents using queries conditioned on screenshots). It also covers a variety of important domains, including news, products, papers, and charts.
More details can be found:
- Paper: https://arxiv.org/pdf/2502.11431
- Repo: https://github.com/VectorSpaceLab/Vis-IR
"""
# Which evaluations are you running? how can people reproduce what you have?
LLM_BENCHMARKS_TEXT = f"""
## Tasks
- **Screenshot Retrieval (SR)** consists of evaluation samples, each comprising a textual query *q* and its relevant screenshot *s: (q, s)*. The retrieval model needs to precisely retrieve the relevant screenshot for a testing query from a given corpus *S*. Each evaluation sample is created in two steps: 1) sample a screenshot *s*, 2) prompt the LLM to generate a search query based on the caption of screenshot. We consider seven tasks under this category, including product retrieval, paper retrieval, repo retrieval, news retrieval, chart retrieval, document retrieval, and slide retrieval.
- **Composed Screenshot Retrieval (CSR)** is made up of sq2s triplets. Given a screenshot *s1* and a query *q* conditioned on *s1*, the retrieval model needs to retrieve the relevant screenshot *s2* from the corpus *S*. We define four tasks for this category, including product discovery, news-to-Wiki, knowledge relation, and Wiki-to-product. All tasks in this category are created by human annotators. For each task, annotators are instructed to identify relevant screenshot pairs and write queries to retrieve *s2* based on *s1*.
- **Screenshot Question Answering (SQA)** comprises sq2a triplets. Given a screenshot s and a question q conditioned on s, the retrieval model needs to retrieve the correct answer a from a candidate corpus A. Each evaluation sample is created in three steps: 1) sample a screenshot *s*. 2) prompt the MLLM to generate a question *q*. 3) prompt the MLLM to generate the answer *a* for *q* based on *s*. The following tasks are included in this category: product-QA, news-QA, Wiki-QA, paper-QA, repo-QA.
- **Open-Vocab Classification (OVC)** is performed using evaluation samples of screenshots and their textual class labels. Given a screenshot s and the label class *C*, the retrieval model needs to discriminate the correct label c from *C* based on the embedding similarity. We include the following tasks in this category: product classification, news-topic classification, academic-field classification, knowledge classification. For each task, we employ human labelers to create the label class and assign each screenshot with its correct label.
"""
EVALUATION_QUEUE_TEXT = """
## Some good practices before submitting a model
### 1) Make sure you can load your model and tokenizer using AutoClasses:
```python
from transformers import AutoConfig, AutoModel, AutoTokenizer
config = AutoConfig.from_pretrained("your model name", revision=revision)
model = AutoModel.from_pretrained("your model name", revision=revision)
tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision)
```
If this step fails, follow the error messages to debug your model before submitting it. It's likely your model has been improperly uploaded.
Note: make sure your model is public!
Note: if your model needs `use_remote_code=True`, we do not support this option yet but we are working on adding it, stay posted!
### 2) Convert your model weights to [safetensors](https://huggingface.co/docs/safetensors/index)
It's a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the `Extended Viewer`!
### 3) Make sure your model has an open license!
This is a leaderboard for Open LLMs, and we'd love for as many people as possible to know they can use your model 🤗
### 4) Fill up your model card
When we add extra information about models to the leaderboard, it will be automatically taken from the model card
## In case of model failure
If your model is displayed in the `FAILED` category, its execution stopped.
Make sure you have followed the above steps first.
If everything is done, check you can launch the EleutherAIHarness on your model locally, using the above command without modifications (you can add `--limit` to limit the number of examples per task).
"""
SUBMIT_FORM = """
## Make sure you submit your evaluation results in a JSON file with the following format:
```json
{
"Model": "<Model Name>",
"URL (optional)": "<Model/Repo/Paper URL>"
"#params": "7.11B",
"Overall": 30.00,
"SR": 30.00,
"CSR": 30.00,
"VQA": 30.00,
"OVC": 30.00,
}
```
Then send a email to [email protected] with the JSON file attached. We will review your submission and add it to the leaderboard.
"""
CITATION_BUTTON_LABEL = "Copy the following snippet to cite MVRB:"
CITATION_BUTTON_TEXT = """
@article{liu2025any,
title={Any Information Is Just Worth One Single Screenshot: Unifying Search With Visualized Information Retrieval},
author={Liu, Ze and Liang, Zhengyang and Zhou, Junjie and Liu, Zheng and Lian, Defu},
journal={arXiv preprint arXiv:2502.11431},
year={2025}
}
"""