metadata
title: Leaderboard Viewer
emoji: π
colorFrom: red
colorTo: red
sdk: docker
app_port: 8501
tags:
- streamlit
pinned: false
short_description: Streamlit template space
Grounding Benchmark Leaderboard Viewer
A Streamlit application for visualizing model performance on grounding benchmarks.
Features
- Real-time Data: Streams results directly from the HuggingFace leaderboard repository without local storage
- Interactive Visualizations: Bar charts comparing model performance across different metrics
- Baseline Comparisons: Shows baseline models (Qwen2-VL, UI-TARS) alongside evaluated models
- Best Checkpoint Selection: Automatically shows the best performing checkpoint for each model (marked with * if not the last checkpoint)
- UI Type Breakdown:
- For ScreenSpot-v2: Comprehensive charts showing Overall, Desktop, Web, and individual UI type performance
- For other datasets: Desktop vs Web and Text vs Icon performance
- Checkpoint Progression Analysis: Visualize how metrics evolve during training
- Model Details: View training loss, checkpoint steps, and evaluation timestamps
Installation
- Clone or download this directory
- Install dependencies:
pip install -r requirements.txt
Running the App
streamlit run src/streamlit_app.py
The app will open in your browser at http://localhost:8501
Usage
Select Dataset: Use the sidebar to choose which benchmark dataset to view (e.g., screenspot-v2, screenspot-pro)
Filter Models: Optionally filter to view a specific model or all models
View Charts:
- For ScreenSpot-v2:
- Overall performance (average of desktop and web)
- Desktop and Web averages
- Individual UI type metrics: Desktop (Text), Desktop (Icon), Web (Text), Web (Icon)
- Text and Icon averages across environments
- Baseline model comparisons shown in orange
- Models marked with * indicate the best checkpoint is not the last one
- For ScreenSpot-v2:
Explore Details:
- Expand "Model Details" to see training metadata
- Expand "Detailed UI Type Breakdown" for a comprehensive table
- Expand "Checkpoint Progression Analysis" to:
- View accuracy progression over training steps
- See the relationship between training loss and accuracy
- Compare performance across checkpoints
Data Source
The app streams data directly from the HuggingFace dataset repository:
- Repository:
mlfoundations-cua-dev/leaderboard
- Path:
grounding/[dataset_name]/[model_results].json
Streaming Approach
To minimize local storage requirements, the app:
- Streams JSON files directly from HuggingFace Hub
- Extracts only the necessary data for visualization
- Discards the full JSON after processing
- Caches the extracted data in memory for 5 minutes
Supported Datasets
- ScreenSpot-v2: Web and desktop UI element grounding (with special handling for desktop/web averaging)
- ScreenSpot-Pro: Professional UI grounding benchmark
- ShowdownClicks: Click prediction benchmark
- And more as they are added to the leaderboard
Baseline Models
The dashboard includes baseline performance from established models:
ScreenSpot-v2 Baselines
- Qwen2-VL-7B: 38.0% overall
- UI-TARS-2B: 82.8% overall
- UI-TARS-7B: 92.2% overall
- UI-TARS-72B: 88.3% overall
ScreenSpot-Pro Baselines
- Qwen2.5-VL-3B-Instruct: 16.1% overall
- Qwen2.5-VL-7B-Instruct: 26.8% overall
- Qwen2.5-VL-72B-Instruct: 53.3% overall
- UI-TARS-2B: 27.7% overall
- UI-TARS-7B: 35.7% overall
- UI-TARS-72B: 38.1% overall
ShowDown-Clicks Baselines
- Qwen2.5-VL-72B-Instruct: 24.8% overall
- UI-TARS-72B-SFT: 54.4% overall
- Molmo-72B-0924: 54.8% overall
Checkpoint Handling
- The app automatically identifies the best performing checkpoint for each model
- If multiple checkpoints exist, only the best one is shown in the main charts
- An asterisk (*) indicates when the best checkpoint is not the last one
- Use the "Checkpoint Progression Analysis" to explore all checkpoints
Caching
Results are cached in memory for 5 minutes to improve performance. The cache automatically refreshes to show new evaluation results.