Spaces:

ZiyiXia
/

leaderboard_test

Sleeping

App Files Files Community

ZiyiXia commited on 28 days ago

Commit

2cfcdca

1 Parent(s): 1b8d472

update

Browse files

Files changed (2) hide show

results.csv +10 -10
src/about.py +12 -3

results.csv CHANGED Viewed

@@ -1,13 +1,13 @@
 Rank,Model,#Params (B),Overall,SR,CSR,SQA,OVC
 1,UniSE-MLLM,2.21,55.72,69.63,54.49,43.2,48.26
-2,GME,2.21,48.14,61.62,37.68,37.78,47.98
-3,DSE,4.15,45.21,61.54,37.78,39.24,31.51
-4,ColPali,2.92,43.64,61.73,35.0,35.32,31.04
 5,UniSE-CLIP,0.428,36.41,35.95,43.38,28.13,40.62
-6,MM-Embed,7.57,34.48,25.86,40.93,42.83,32.67
-7,SigLIP,0.878,33.34,38.33,34.48,19.6,40.64
-8,VLM2Vec,4.15,32.19,15.93,48.05,49.42,23.24
-9,E5-V,8.35,25.13,34.11,26.59,5.23,32.85
-10,CLIP,0.428,23.75,18.89,25.39,23.9,30.4
-11,Uni-IR,0.428,19.63,12.35,35.92,29.68,20.06
-12,VISTA,0.196,13.85,5.21,11.29,25.78,16.61

 Rank,Model,#Params (B),Overall,SR,CSR,SQA,OVC
 1,UniSE-MLLM,2.21,55.72,69.63,54.49,43.2,48.26
+2,"<a href=""https://huggingface.co/Alibaba-NLP/gme-Qwen2-VL-2B-Instruct"">GME</a>",2.21,48.14,61.62,37.68,37.78,47.98
+3,"<a href=""https://huggingface.co/Tevatron/dse-phi3-v1.0"">DSE</a>",4.15,45.21,61.54,37.78,39.24,31.51
+4,"<a href=""https://huggingface.co/vidore/colpali"">ColPali</a>",2.92,43.64,61.73,35.0,35.32,31.04
 5,UniSE-CLIP,0.428,36.41,35.95,43.38,28.13,40.62
+6,"<a href=""https://huggingface.co/nvidia/MM-Embed"">MM-Embed</a>",7.57,34.48,25.86,40.93,42.83,32.67
+7,"<a href=""https://huggingface.co/google/siglip-so400m-patch14-384"">SigLIP</a>",0.878,33.34,38.33,34.48,19.6,40.64
+8,"<a href=""https://huggingface.co/TIGER-Lab/VLM2Vec-Full"">VLM2Vec</a>",4.15,32.19,15.93,48.05,49.42,23.24
+9,"<a href=""https://huggingface.co/royokong/e5-v"">E5-V</a>",8.35,25.13,34.11,26.59,5.23,32.85
+10,"<a href=""https://huggingface.co/openai/clip-vit-large-patch14"">CLIP</a>",0.428,23.75,18.89,25.39,23.9,30.4
+11,"<a href=""https://huggingface.co/TIGER-Lab/UniIR"">Uni-IR</a>",0.428,19.63,12.35,35.92,29.68,20.06
+12,"<a href=""https://huggingface.co/OpenDriveLab/Vista"">VISTA</a>",0.196,13.85,5.21,11.29,25.78,16.61

src/about.py CHANGED Viewed

@@ -29,7 +29,7 @@ INTRODUCTION_TEXT = """
 More details can be found:
 - Paper: https://arxiv.org/pdf/2502.11431
-- Code: https://github.com/VectorSpaceLab/Vis-IR
 """
 # Which evaluations are you running? how can people reproduce what you have?
@@ -39,9 +39,17 @@ LLM_BENCHMARKS_TEXT = f"""
 - **Composed Screenshot Retrieval (CSR)** is made up of sq2s triplets. Given a screenshot *s1* and a query *q* conditioned on *s1*, the retrieval model needs to retrieve the relevant screenshot *s2* from the corpus *S*. We define four tasks for this category, including product discovery, news-to-Wiki, knowledge relation, and Wiki-to-product. All tasks in this category are created by human annotators. For each task, annotators are instructed to identify relevant screenshot pairs and write queries to retrieve *s2* based on *s1*.
-- **Screenshot Question Answering (SQA)** comprises sq2a triplets. Given a screenshot *s* and a question *q* conditioned on *s*, the retrieval model needs to retrieve the correct answer a from a candidate corpus *A*. Each evaluation sample is created in three steps: 1) sample a screenshot $s$, 2) prompt the MLLM to generate a question *q*, 3) prompt the MLLM to generate the answer *a* for *q* based on *s*. The following tasks are included in this category: product-QA, news-QA, Wiki-QA, paper-QA, repo-QA.
 - **Open-Vocab Classification (OVC)** is performed using evaluation samples of screenshots and their textual class labels. Given a screenshot s and the label class *C*, the retrieval model needs to discriminate the correct label c from *C* based on the embedding similarity. We include the following tasks in this category: product classification, news-topic classification, academic-field classification, knowledge classification. For each task, we employ human labelers to create the label class and assign each screenshot with its correct label.
 """
 EVALUATION_QUEUE_TEXT = """
@@ -79,7 +87,8 @@ SUBMIT_FORM = """
 ```json
 {
     "Model": "<Model Name>",
-    "#params (B)": "7.11",
     "Overall": 30.00,
     "SR": 30.00,
     "CSR": 30.00,

 More details can be found:
 - Paper: https://arxiv.org/pdf/2502.11431
+- Repo: https://github.com/VectorSpaceLab/Vis-IR
 """
 # Which evaluations are you running? how can people reproduce what you have?
 - **Composed Screenshot Retrieval (CSR)** is made up of sq2s triplets. Given a screenshot *s1* and a query *q* conditioned on *s1*, the retrieval model needs to retrieve the relevant screenshot *s2* from the corpus *S*. We define four tasks for this category, including product discovery, news-to-Wiki, knowledge relation, and Wiki-to-product. All tasks in this category are created by human annotators. For each task, annotators are instructed to identify relevant screenshot pairs and write queries to retrieve *s2* based on *s1*.
+**Screenshot Question Answering (SQA)** comprises sq2a triplets. Given a screenshot s and a question q conditioned on s, the retrieval model needs to retrieve the correct answer a from a candidate corpus A. Each evaluation sample is created in three steps:
+- 1) sample a screenshot *s*.
+- 2) prompt the MLLM to generate a question *q*.
+- 3) prompt the MLLM to generate the answer *a* for *q* based on *s*.
+The following tasks are included in this category: product-QA, news-QA, Wiki-QA, paper-QA, repo-QA.
 - **Open-Vocab Classification (OVC)** is performed using evaluation samples of screenshots and their textual class labels. Given a screenshot s and the label class *C*, the retrieval model needs to discriminate the correct label c from *C* based on the embedding similarity. We include the following tasks in this category: product classification, news-topic classification, academic-field classification, knowledge classification. For each task, we employ human labelers to create the label class and assign each screenshot with its correct label.
+**Screenshot Retrieval (SR)** consists of evaluation samples, each comprising a textual query q and its relevant screenshot *s: (q, s)*. The retrieval model needs to precisely retrieve the relevant screenshot for a testing query from a given corpus *S*. Each evaluation sample is created in two steps:
+- 1) sample a screenshot *s*.
+- 2) prompt the LLM to generate a search query based on the caption of screenshot
+We consider seven tasks under this category, including product retrieval, paper retrieval, repo retrieval, news retrieval, chart retrieval, document retrieval, and slide retrieval.
 """
 EVALUATION_QUEUE_TEXT = """
 ```json
 {
     "Model": "<Model Name>",
+    "URL (optional)": "<Model/Repo/Paper URL>"
+    "#params": "7.11B",
     "Overall": 30.00,
     "SR": 30.00,
     "CSR": 30.00,