File size: 4,252 Bytes
a0c1c09
b47cdd1
a0c1c09
 
 
 
 
 
b47cdd1
a0c1c09
 
 
 
587e0bc
a0c1c09
587e0bc
a0c1c09
587e0bc
 
2dbb46e
587e0bc
 
79cb6e1
 
 
 
 
587e0bc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79cb6e1
 
 
 
 
 
 
d37faa6
587e0bc
 
 
 
79cb6e1
 
 
 
587e0bc
 
 
2dbb46e
587e0bc
 
 
2dbb46e
 
 
 
 
 
 
 
587e0bc
 
79cb6e1
587e0bc
 
 
 
 
 
d37faa6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79cb6e1
 
 
 
 
 
 
587e0bc
 
 
2dbb46e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
---
title: Leaderboard Viewer
emoji: πŸš€
colorFrom: red
colorTo: red
sdk: docker
app_port: 8501
tags:
- streamlit
pinned: false
short_description: Streamlit template space
---

# Grounding Benchmark Leaderboard Viewer

A Streamlit application for visualizing model performance on grounding benchmarks.

## Features

- **Real-time Data**: Streams results directly from the HuggingFace leaderboard repository without local storage
- **Interactive Visualizations**: Bar charts comparing model performance across different metrics
- **Baseline Comparisons**: Shows baseline models (Qwen2-VL, UI-TARS) alongside evaluated models
- **Best Checkpoint Selection**: Automatically shows the best performing checkpoint for each model (marked with * if not the last checkpoint)
- **UI Type Breakdown**: 
  - For ScreenSpot-v2: Comprehensive charts showing Overall, Desktop, Web, and individual UI type performance
  - For other datasets: Desktop vs Web and Text vs Icon performance
- **Checkpoint Progression Analysis**: Visualize how metrics evolve during training
- **Model Details**: View training loss, checkpoint steps, and evaluation timestamps

## Installation

1. Clone or download this directory
2. Install dependencies:
   ```bash
   pip install -r requirements.txt
   ```

## Running the App

```bash
streamlit run src/streamlit_app.py
```

The app will open in your browser at `http://localhost:8501`

## Usage

1. **Select Dataset**: Use the sidebar to choose which benchmark dataset to view (e.g., screenspot-v2, screenspot-pro)

2. **Filter Models**: Optionally filter to view a specific model or all models

3. **View Charts**: 
   - For ScreenSpot-v2:
     - Overall performance (average of desktop and web)
     - Desktop and Web averages
     - Individual UI type metrics: Desktop (Text), Desktop (Icon), Web (Text), Web (Icon)
     - Text and Icon averages across environments
   - Baseline model comparisons shown in orange
   - Models marked with * indicate the best checkpoint is not the last one

4. **Explore Details**: 
   - Expand "Model Details" to see training metadata
   - Expand "Detailed UI Type Breakdown" for a comprehensive table
   - Expand "Checkpoint Progression Analysis" to:
     - View accuracy progression over training steps
     - See the relationship between training loss and accuracy
     - Compare performance across checkpoints

## Data Source

The app streams data directly from the HuggingFace dataset repository:
- Repository: `mlfoundations-cua-dev/leaderboard`
- Path: `grounding/[dataset_name]/[model_results].json`

## Streaming Approach

To minimize local storage requirements, the app:
- Streams JSON files directly from HuggingFace Hub
- Extracts only the necessary data for visualization
- Discards the full JSON after processing
- Caches the extracted data in memory for 5 minutes

## Supported Datasets

- **ScreenSpot-v2**: Web and desktop UI element grounding (with special handling for desktop/web averaging)
- **ScreenSpot-Pro**: Professional UI grounding benchmark
- **ShowdownClicks**: Click prediction benchmark
- And more as they are added to the leaderboard

## Baseline Models

The dashboard includes baseline performance from established models:

### ScreenSpot-v2 Baselines
- **Qwen2-VL-7B**: 38.0% overall
- **UI-TARS-2B**: 82.8% overall
- **UI-TARS-7B**: 92.2% overall
- **UI-TARS-72B**: 88.3% overall

### ScreenSpot-Pro Baselines
- **Qwen2.5-VL-3B-Instruct**: 16.1% overall
- **Qwen2.5-VL-7B-Instruct**: 26.8% overall
- **Qwen2.5-VL-72B-Instruct**: 53.3% overall
- **UI-TARS-2B**: 27.7% overall
- **UI-TARS-7B**: 35.7% overall
- **UI-TARS-72B**: 38.1% overall

### ShowDown-Clicks Baselines
- **Qwen2.5-VL-72B-Instruct**: 24.8% overall
- **UI-TARS-72B-SFT**: 54.4% overall
- **Molmo-72B-0924**: 54.8% overall

## Checkpoint Handling

- The app automatically identifies the best performing checkpoint for each model
- If multiple checkpoints exist, only the best one is shown in the main charts
- An asterisk (*) indicates when the best checkpoint is not the last one
- Use the "Checkpoint Progression Analysis" to explore all checkpoints

## Caching

Results are cached in memory for 5 minutes to improve performance. The cache automatically refreshes to show new evaluation results.