ZiyiXia commited on
Commit
2cfcdca
·
1 Parent(s): 1b8d472
Files changed (2) hide show
  1. results.csv +10 -10
  2. src/about.py +12 -3
results.csv CHANGED
@@ -1,13 +1,13 @@
1
  Rank,Model,#Params (B),Overall,SR,CSR,SQA,OVC
2
  1,UniSE-MLLM,2.21,55.72,69.63,54.49,43.2,48.26
3
- 2,GME,2.21,48.14,61.62,37.68,37.78,47.98
4
- 3,DSE,4.15,45.21,61.54,37.78,39.24,31.51
5
- 4,ColPali,2.92,43.64,61.73,35.0,35.32,31.04
6
  5,UniSE-CLIP,0.428,36.41,35.95,43.38,28.13,40.62
7
- 6,MM-Embed,7.57,34.48,25.86,40.93,42.83,32.67
8
- 7,SigLIP,0.878,33.34,38.33,34.48,19.6,40.64
9
- 8,VLM2Vec,4.15,32.19,15.93,48.05,49.42,23.24
10
- 9,E5-V,8.35,25.13,34.11,26.59,5.23,32.85
11
- 10,CLIP,0.428,23.75,18.89,25.39,23.9,30.4
12
- 11,Uni-IR,0.428,19.63,12.35,35.92,29.68,20.06
13
- 12,VISTA,0.196,13.85,5.21,11.29,25.78,16.61
 
1
  Rank,Model,#Params (B),Overall,SR,CSR,SQA,OVC
2
  1,UniSE-MLLM,2.21,55.72,69.63,54.49,43.2,48.26
3
+ 2,"<a href=""https://huggingface.co/Alibaba-NLP/gme-Qwen2-VL-2B-Instruct"">GME</a>",2.21,48.14,61.62,37.68,37.78,47.98
4
+ 3,"<a href=""https://huggingface.co/Tevatron/dse-phi3-v1.0"">DSE</a>",4.15,45.21,61.54,37.78,39.24,31.51
5
+ 4,"<a href=""https://huggingface.co/vidore/colpali"">ColPali</a>",2.92,43.64,61.73,35.0,35.32,31.04
6
  5,UniSE-CLIP,0.428,36.41,35.95,43.38,28.13,40.62
7
+ 6,"<a href=""https://huggingface.co/nvidia/MM-Embed"">MM-Embed</a>",7.57,34.48,25.86,40.93,42.83,32.67
8
+ 7,"<a href=""https://huggingface.co/google/siglip-so400m-patch14-384"">SigLIP</a>",0.878,33.34,38.33,34.48,19.6,40.64
9
+ 8,"<a href=""https://huggingface.co/TIGER-Lab/VLM2Vec-Full"">VLM2Vec</a>",4.15,32.19,15.93,48.05,49.42,23.24
10
+ 9,"<a href=""https://huggingface.co/royokong/e5-v"">E5-V</a>",8.35,25.13,34.11,26.59,5.23,32.85
11
+ 10,"<a href=""https://huggingface.co/openai/clip-vit-large-patch14"">CLIP</a>",0.428,23.75,18.89,25.39,23.9,30.4
12
+ 11,"<a href=""https://huggingface.co/TIGER-Lab/UniIR"">Uni-IR</a>",0.428,19.63,12.35,35.92,29.68,20.06
13
+ 12,"<a href=""https://huggingface.co/OpenDriveLab/Vista"">VISTA</a>",0.196,13.85,5.21,11.29,25.78,16.61
src/about.py CHANGED
@@ -29,7 +29,7 @@ INTRODUCTION_TEXT = """
29
 
30
  More details can be found:
31
  - Paper: https://arxiv.org/pdf/2502.11431
32
- - Code: https://github.com/VectorSpaceLab/Vis-IR
33
  """
34
 
35
  # Which evaluations are you running? how can people reproduce what you have?
@@ -39,9 +39,17 @@ LLM_BENCHMARKS_TEXT = f"""
39
 
40
  - **Composed Screenshot Retrieval (CSR)** is made up of sq2s triplets. Given a screenshot *s1* and a query *q* conditioned on *s1*, the retrieval model needs to retrieve the relevant screenshot *s2* from the corpus *S*. We define four tasks for this category, including product discovery, news-to-Wiki, knowledge relation, and Wiki-to-product. All tasks in this category are created by human annotators. For each task, annotators are instructed to identify relevant screenshot pairs and write queries to retrieve *s2* based on *s1*.
41
 
42
- - **Screenshot Question Answering (SQA)** comprises sq2a triplets. Given a screenshot *s* and a question *q* conditioned on *s*, the retrieval model needs to retrieve the correct answer a from a candidate corpus *A*. Each evaluation sample is created in three steps: 1) sample a screenshot $s$, 2) prompt the MLLM to generate a question *q*, 3) prompt the MLLM to generate the answer *a* for *q* based on *s*. The following tasks are included in this category: product-QA, news-QA, Wiki-QA, paper-QA, repo-QA.
 
 
 
 
43
 
44
  - **Open-Vocab Classification (OVC)** is performed using evaluation samples of screenshots and their textual class labels. Given a screenshot s and the label class *C*, the retrieval model needs to discriminate the correct label c from *C* based on the embedding similarity. We include the following tasks in this category: product classification, news-topic classification, academic-field classification, knowledge classification. For each task, we employ human labelers to create the label class and assign each screenshot with its correct label.
 
 
 
 
45
  """
46
 
47
  EVALUATION_QUEUE_TEXT = """
@@ -79,7 +87,8 @@ SUBMIT_FORM = """
79
  ```json
80
  {
81
  "Model": "<Model Name>",
82
- "#params (B)": "7.11",
 
83
  "Overall": 30.00,
84
  "SR": 30.00,
85
  "CSR": 30.00,
 
29
 
30
  More details can be found:
31
  - Paper: https://arxiv.org/pdf/2502.11431
32
+ - Repo: https://github.com/VectorSpaceLab/Vis-IR
33
  """
34
 
35
  # Which evaluations are you running? how can people reproduce what you have?
 
39
 
40
  - **Composed Screenshot Retrieval (CSR)** is made up of sq2s triplets. Given a screenshot *s1* and a query *q* conditioned on *s1*, the retrieval model needs to retrieve the relevant screenshot *s2* from the corpus *S*. We define four tasks for this category, including product discovery, news-to-Wiki, knowledge relation, and Wiki-to-product. All tasks in this category are created by human annotators. For each task, annotators are instructed to identify relevant screenshot pairs and write queries to retrieve *s2* based on *s1*.
41
 
42
+ **Screenshot Question Answering (SQA)** comprises sq2a triplets. Given a screenshot s and a question q conditioned on s, the retrieval model needs to retrieve the correct answer a from a candidate corpus A. Each evaluation sample is created in three steps:
43
+ - 1) sample a screenshot *s*.
44
+ - 2) prompt the MLLM to generate a question *q*.
45
+ - 3) prompt the MLLM to generate the answer *a* for *q* based on *s*.
46
+ The following tasks are included in this category: product-QA, news-QA, Wiki-QA, paper-QA, repo-QA.
47
 
48
  - **Open-Vocab Classification (OVC)** is performed using evaluation samples of screenshots and their textual class labels. Given a screenshot s and the label class *C*, the retrieval model needs to discriminate the correct label c from *C* based on the embedding similarity. We include the following tasks in this category: product classification, news-topic classification, academic-field classification, knowledge classification. For each task, we employ human labelers to create the label class and assign each screenshot with its correct label.
49
+ **Screenshot Retrieval (SR)** consists of evaluation samples, each comprising a textual query q and its relevant screenshot *s: (q, s)*. The retrieval model needs to precisely retrieve the relevant screenshot for a testing query from a given corpus *S*. Each evaluation sample is created in two steps:
50
+ - 1) sample a screenshot *s*.
51
+ - 2) prompt the LLM to generate a search query based on the caption of screenshot
52
+ We consider seven tasks under this category, including product retrieval, paper retrieval, repo retrieval, news retrieval, chart retrieval, document retrieval, and slide retrieval.
53
  """
54
 
55
  EVALUATION_QUEUE_TEXT = """
 
87
  ```json
88
  {
89
  "Model": "<Model Name>",
90
+ "URL (optional)": "<Model/Repo/Paper URL>"
91
+ "#params": "7.11B",
92
  "Overall": 30.00,
93
  "SR": 30.00,
94
  "CSR": 30.00,