File size: 4,252 Bytes
a0c1c09 b47cdd1 a0c1c09 b47cdd1 a0c1c09 587e0bc a0c1c09 587e0bc a0c1c09 587e0bc 2dbb46e 587e0bc 79cb6e1 587e0bc 79cb6e1 d37faa6 587e0bc 79cb6e1 587e0bc 2dbb46e 587e0bc 2dbb46e 587e0bc 79cb6e1 587e0bc d37faa6 79cb6e1 587e0bc 2dbb46e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 |
---
title: Leaderboard Viewer
emoji: π
colorFrom: red
colorTo: red
sdk: docker
app_port: 8501
tags:
- streamlit
pinned: false
short_description: Streamlit template space
---
# Grounding Benchmark Leaderboard Viewer
A Streamlit application for visualizing model performance on grounding benchmarks.
## Features
- **Real-time Data**: Streams results directly from the HuggingFace leaderboard repository without local storage
- **Interactive Visualizations**: Bar charts comparing model performance across different metrics
- **Baseline Comparisons**: Shows baseline models (Qwen2-VL, UI-TARS) alongside evaluated models
- **Best Checkpoint Selection**: Automatically shows the best performing checkpoint for each model (marked with * if not the last checkpoint)
- **UI Type Breakdown**:
- For ScreenSpot-v2: Comprehensive charts showing Overall, Desktop, Web, and individual UI type performance
- For other datasets: Desktop vs Web and Text vs Icon performance
- **Checkpoint Progression Analysis**: Visualize how metrics evolve during training
- **Model Details**: View training loss, checkpoint steps, and evaluation timestamps
## Installation
1. Clone or download this directory
2. Install dependencies:
```bash
pip install -r requirements.txt
```
## Running the App
```bash
streamlit run src/streamlit_app.py
```
The app will open in your browser at `http://localhost:8501`
## Usage
1. **Select Dataset**: Use the sidebar to choose which benchmark dataset to view (e.g., screenspot-v2, screenspot-pro)
2. **Filter Models**: Optionally filter to view a specific model or all models
3. **View Charts**:
- For ScreenSpot-v2:
- Overall performance (average of desktop and web)
- Desktop and Web averages
- Individual UI type metrics: Desktop (Text), Desktop (Icon), Web (Text), Web (Icon)
- Text and Icon averages across environments
- Baseline model comparisons shown in orange
- Models marked with * indicate the best checkpoint is not the last one
4. **Explore Details**:
- Expand "Model Details" to see training metadata
- Expand "Detailed UI Type Breakdown" for a comprehensive table
- Expand "Checkpoint Progression Analysis" to:
- View accuracy progression over training steps
- See the relationship between training loss and accuracy
- Compare performance across checkpoints
## Data Source
The app streams data directly from the HuggingFace dataset repository:
- Repository: `mlfoundations-cua-dev/leaderboard`
- Path: `grounding/[dataset_name]/[model_results].json`
## Streaming Approach
To minimize local storage requirements, the app:
- Streams JSON files directly from HuggingFace Hub
- Extracts only the necessary data for visualization
- Discards the full JSON after processing
- Caches the extracted data in memory for 5 minutes
## Supported Datasets
- **ScreenSpot-v2**: Web and desktop UI element grounding (with special handling for desktop/web averaging)
- **ScreenSpot-Pro**: Professional UI grounding benchmark
- **ShowdownClicks**: Click prediction benchmark
- And more as they are added to the leaderboard
## Baseline Models
The dashboard includes baseline performance from established models:
### ScreenSpot-v2 Baselines
- **Qwen2-VL-7B**: 38.0% overall
- **UI-TARS-2B**: 82.8% overall
- **UI-TARS-7B**: 92.2% overall
- **UI-TARS-72B**: 88.3% overall
### ScreenSpot-Pro Baselines
- **Qwen2.5-VL-3B-Instruct**: 16.1% overall
- **Qwen2.5-VL-7B-Instruct**: 26.8% overall
- **Qwen2.5-VL-72B-Instruct**: 53.3% overall
- **UI-TARS-2B**: 27.7% overall
- **UI-TARS-7B**: 35.7% overall
- **UI-TARS-72B**: 38.1% overall
### ShowDown-Clicks Baselines
- **Qwen2.5-VL-72B-Instruct**: 24.8% overall
- **UI-TARS-72B-SFT**: 54.4% overall
- **Molmo-72B-0924**: 54.8% overall
## Checkpoint Handling
- The app automatically identifies the best performing checkpoint for each model
- If multiple checkpoints exist, only the best one is shown in the main charts
- An asterisk (*) indicates when the best checkpoint is not the last one
- Use the "Checkpoint Progression Analysis" to explore all checkpoints
## Caching
Results are cached in memory for 5 minutes to improve performance. The cache automatically refreshes to show new evaluation results.
|