|
--- |
|
title: Leaderboard Viewer |
|
emoji: π |
|
colorFrom: red |
|
colorTo: red |
|
sdk: docker |
|
app_port: 8501 |
|
tags: |
|
- streamlit |
|
pinned: false |
|
short_description: Streamlit template space |
|
--- |
|
|
|
# Grounding Benchmark Leaderboard Viewer |
|
|
|
A Streamlit application for visualizing model performance on grounding benchmarks. |
|
|
|
## Features |
|
|
|
- **Real-time Data**: Streams results directly from the HuggingFace leaderboard repository without local storage |
|
- **Interactive Visualizations**: Bar charts comparing model performance across different metrics |
|
- **Baseline Comparisons**: Shows baseline models (Qwen2-VL, UI-TARS) alongside evaluated models |
|
- **Best Checkpoint Selection**: Automatically shows the best performing checkpoint for each model (marked with * if not the last checkpoint) |
|
- **UI Type Breakdown**: |
|
- For ScreenSpot-v2: Comprehensive charts showing Overall, Desktop, Web, and individual UI type performance |
|
- For other datasets: Desktop vs Web and Text vs Icon performance |
|
- **Checkpoint Progression Analysis**: Visualize how metrics evolve during training |
|
- **Model Details**: View training loss, checkpoint steps, and evaluation timestamps |
|
|
|
## Installation |
|
|
|
1. Clone or download this directory |
|
2. Install dependencies: |
|
```bash |
|
pip install -r requirements.txt |
|
``` |
|
|
|
## Running the App |
|
|
|
```bash |
|
streamlit run src/streamlit_app.py |
|
``` |
|
|
|
The app will open in your browser at `http://localhost:8501` |
|
|
|
## Usage |
|
|
|
1. **Select Dataset**: Use the sidebar to choose which benchmark dataset to view (e.g., screenspot-v2, screenspot-pro) |
|
|
|
2. **Filter Models**: Optionally filter to view a specific model or all models |
|
|
|
3. **View Charts**: |
|
- For ScreenSpot-v2: |
|
- Overall performance (average of desktop and web) |
|
- Desktop and Web averages |
|
- Individual UI type metrics: Desktop (Text), Desktop (Icon), Web (Text), Web (Icon) |
|
- Text and Icon averages across environments |
|
- Baseline model comparisons shown in orange |
|
- Models marked with * indicate the best checkpoint is not the last one |
|
|
|
4. **Explore Details**: |
|
- Expand "Model Details" to see training metadata |
|
- Expand "Detailed UI Type Breakdown" for a comprehensive table |
|
- Expand "Checkpoint Progression Analysis" to: |
|
- View accuracy progression over training steps |
|
- See the relationship between training loss and accuracy |
|
- Compare performance across checkpoints |
|
|
|
## Data Source |
|
|
|
The app streams data directly from the HuggingFace dataset repository: |
|
- Repository: `mlfoundations-cua-dev/leaderboard` |
|
- Path: `grounding/[dataset_name]/[model_results].json` |
|
|
|
## Streaming Approach |
|
|
|
To minimize local storage requirements, the app: |
|
- Streams JSON files directly from HuggingFace Hub |
|
- Extracts only the necessary data for visualization |
|
- Discards the full JSON after processing |
|
- Caches the extracted data in memory for 5 minutes |
|
|
|
## Supported Datasets |
|
|
|
- **ScreenSpot-v2**: Web and desktop UI element grounding (with special handling for desktop/web averaging) |
|
- **ScreenSpot-Pro**: Professional UI grounding benchmark |
|
- **ShowdownClicks**: Click prediction benchmark |
|
- And more as they are added to the leaderboard |
|
|
|
## Baseline Models |
|
|
|
The dashboard includes baseline performance from established models: |
|
|
|
### ScreenSpot-v2 Baselines |
|
- **Qwen2-VL-7B**: 38.0% overall |
|
- **UI-TARS-2B**: 82.8% overall |
|
- **UI-TARS-7B**: 92.2% overall |
|
- **UI-TARS-72B**: 88.3% overall |
|
|
|
### ScreenSpot-Pro Baselines |
|
- **Qwen2.5-VL-3B-Instruct**: 16.1% overall |
|
- **Qwen2.5-VL-7B-Instruct**: 26.8% overall |
|
- **Qwen2.5-VL-72B-Instruct**: 53.3% overall |
|
- **UI-TARS-2B**: 27.7% overall |
|
- **UI-TARS-7B**: 35.7% overall |
|
- **UI-TARS-72B**: 38.1% overall |
|
|
|
### ShowDown-Clicks Baselines |
|
- **Qwen2.5-VL-72B-Instruct**: 24.8% overall |
|
- **UI-TARS-72B-SFT**: 54.4% overall |
|
- **Molmo-72B-0924**: 54.8% overall |
|
|
|
## Checkpoint Handling |
|
|
|
- The app automatically identifies the best performing checkpoint for each model |
|
- If multiple checkpoints exist, only the best one is shown in the main charts |
|
- An asterisk (*) indicates when the best checkpoint is not the last one |
|
- Use the "Checkpoint Progression Analysis" to explore all checkpoints |
|
|
|
## Caching |
|
|
|
Results are cached in memory for 5 minutes to improve performance. The cache automatically refreshes to show new evaluation results. |
|
|