metadata

title: Leaderboard Viewer
emoji: 🚀
colorFrom: red
colorTo: red
sdk: docker
app_port: 8501
tags:
  - streamlit
pinned: false
short_description: Streamlit template space

Grounding Benchmark Leaderboard Viewer

A Streamlit application for visualizing model performance on grounding benchmarks.

Features

Real-time Data: Streams results directly from the HuggingFace leaderboard repository without local storage
Interactive Visualizations: Bar charts comparing model performance across different metrics
Baseline Comparisons: Shows baseline models (Qwen2-VL, UI-TARS) alongside evaluated models
Best Checkpoint Selection: Automatically shows the best performing checkpoint for each model (marked with * if not the last checkpoint)
UI Type Breakdown:
- For ScreenSpot-v2: Comprehensive charts showing Overall, Desktop, Web, and individual UI type performance
- For other datasets: Desktop vs Web and Text vs Icon performance
Checkpoint Progression Analysis: Visualize how metrics evolve during training
Model Details: View training loss, checkpoint steps, and evaluation timestamps

Installation

Clone or download this directory
Install dependencies:
```
pip install -r requirements.txt
```

Running the App

streamlit run src/streamlit_app.py

The app will open in your browser at http://localhost:8501

Usage

Select Dataset: Use the sidebar to choose which benchmark dataset to view (e.g., screenspot-v2, screenspot-pro)
Filter Models: Optionally filter to view a specific model or all models
View Charts:
- For ScreenSpot-v2:
  - Overall performance (average of desktop and web)
  - Desktop and Web averages
  - Individual UI type metrics: Desktop (Text), Desktop (Icon), Web (Text), Web (Icon)
  - Text and Icon averages across environments
- Baseline model comparisons shown in orange
- Models marked with * indicate the best checkpoint is not the last one
Explore Details:
- Expand "Model Details" to see training metadata
- Expand "Detailed UI Type Breakdown" for a comprehensive table
- Expand "Checkpoint Progression Analysis" to:
  - View accuracy progression over training steps
  - See the relationship between training loss and accuracy
  - Compare performance across checkpoints

Data Source

The app streams data directly from the HuggingFace dataset repository:

Repository: mlfoundations-cua-dev/leaderboard
Path: grounding/[dataset_name]/[model_results].json

Streaming Approach

To minimize local storage requirements, the app:

Streams JSON files directly from HuggingFace Hub
Extracts only the necessary data for visualization
Discards the full JSON after processing
Caches the extracted data in memory for 5 minutes

Supported Datasets

ScreenSpot-v2: Web and desktop UI element grounding (with special handling for desktop/web averaging)
ScreenSpot-Pro: Professional UI grounding benchmark
ShowdownClicks: Click prediction benchmark
And more as they are added to the leaderboard

Baseline Models

The dashboard includes baseline performance from established models:

ScreenSpot-v2 Baselines

Qwen2-VL-7B: 38.0% overall
UI-TARS-2B: 82.8% overall
UI-TARS-7B: 92.2% overall
UI-TARS-72B: 88.3% overall

ScreenSpot-Pro Baselines

Qwen2.5-VL-3B-Instruct: 16.1% overall
Qwen2.5-VL-7B-Instruct: 26.8% overall
Qwen2.5-VL-72B-Instruct: 53.3% overall
UI-TARS-2B: 27.7% overall
UI-TARS-7B: 35.7% overall
UI-TARS-72B: 38.1% overall

ShowDown-Clicks Baselines

Qwen2.5-VL-72B-Instruct: 24.8% overall
UI-TARS-72B-SFT: 54.4% overall
Molmo-72B-0924: 54.8% overall

Checkpoint Handling

The app automatically identifies the best performing checkpoint for each model
If multiple checkpoints exist, only the best one is shown in the main charts
An asterisk (*) indicates when the best checkpoint is not the last one
Use the "Checkpoint Progression Analysis" to explore all checkpoints

Caching

Results are cached in memory for 5 minutes to improve performance. The cache automatically refreshes to show new evaluation results.