leaderboard-viewer / README.md
djghosh's picture
Update README.md
1ec438e verified
---
title: Leaderboard Viewer
emoji: πŸš€
colorFrom: red
colorTo: red
sdk: docker
app_port: 8501
tags:
- streamlit
pinned: false
short_description: Streamlit template space
---
# Grounding Benchmark Leaderboard Viewer
A Streamlit application for visualizing model performance on grounding benchmarks.
## Features
- **Real-time Data**: Streams results directly from the HuggingFace leaderboard repository without local storage
- **Interactive Visualizations**: Bar charts comparing model performance across different metrics
- **Baseline Comparisons**: Shows baseline models (Qwen2-VL, UI-TARS) alongside evaluated models
- **Best Checkpoint Selection**: Automatically shows the best performing checkpoint for each model (marked with * if not the last checkpoint)
- **UI Type Breakdown**:
- For ScreenSpot-v2: Comprehensive charts showing Overall, Desktop, Web, and individual UI type performance
- For other datasets: Desktop vs Web and Text vs Icon performance
- **Checkpoint Progression Analysis**: Visualize how metrics evolve during training
- **Model Details**: View training loss, checkpoint steps, and evaluation timestamps
## Installation
1. Clone or download this directory
2. Install dependencies:
```bash
pip install -r requirements.txt
```
## Running the App
```bash
streamlit run src/streamlit_app.py
```
The app will open in your browser at `http://localhost:8501`
## Usage
1. **Select Dataset**: Use the sidebar to choose which benchmark dataset to view (e.g., screenspot-v2, screenspot-pro)
2. **Filter Models**: Optionally filter to view a specific model or all models
3. **View Charts**:
- For ScreenSpot-v2:
- Overall performance (average of desktop and web)
- Desktop and Web averages
- Individual UI type metrics: Desktop (Text), Desktop (Icon), Web (Text), Web (Icon)
- Text and Icon averages across environments
- Baseline model comparisons shown in orange
- Models marked with * indicate the best checkpoint is not the last one
4. **Explore Details**:
- Expand "Model Details" to see training metadata
- Expand "Detailed UI Type Breakdown" for a comprehensive table
- Expand "Checkpoint Progression Analysis" to:
- View accuracy progression over training steps
- See the relationship between training loss and accuracy
- Compare performance across checkpoints
## Data Source
The app streams data directly from the HuggingFace dataset repository:
- Repository: `mlfoundations-cua-dev/leaderboard`
- Path: `grounding/[dataset_name]/[model_results].json`
## Streaming Approach
To minimize local storage requirements, the app:
- Streams JSON files directly from HuggingFace Hub
- Extracts only the necessary data for visualization
- Discards the full JSON after processing
- Caches the extracted data in memory for 5 minutes
## Supported Datasets
- **ScreenSpot-v2**: Web and desktop UI element grounding (with special handling for desktop/web averaging)
- **ScreenSpot-Pro**: Professional UI grounding benchmark
- **ShowdownClicks**: Click prediction benchmark
- And more as they are added to the leaderboard
## Baseline Models
The dashboard includes baseline performance from established models:
### ScreenSpot-v2 Baselines
- **Qwen2-VL-7B**: 38.0% overall
- **UI-TARS-2B**: 82.8% overall
- **UI-TARS-7B**: 92.2% overall
- **UI-TARS-72B**: 88.3% overall
### ScreenSpot-Pro Baselines
- **Qwen2.5-VL-3B-Instruct**: 16.1% overall
- **Qwen2.5-VL-7B-Instruct**: 26.8% overall
- **Qwen2.5-VL-72B-Instruct**: 53.3% overall
- **UI-TARS-2B**: 27.7% overall
- **UI-TARS-7B**: 35.7% overall
- **UI-TARS-72B**: 38.1% overall
### ShowDown-Clicks Baselines
- **Qwen2.5-VL-72B-Instruct**: 24.8% overall
- **UI-TARS-72B-SFT**: 54.4% overall
- **Molmo-72B-0924**: 54.8% overall
## Checkpoint Handling
- The app automatically identifies the best performing checkpoint for each model
- If multiple checkpoints exist, only the best one is shown in the main charts
- An asterisk (*) indicates when the best checkpoint is not the last one
- Use the "Checkpoint Progression Analysis" to explore all checkpoints
## Caching
Results are cached in memory for 5 minutes to improve performance. The cache automatically refreshes to show new evaluation results.