---
title: Leaderboard Viewer
emoji: 🚀
colorFrom: red
colorTo: red
sdk: docker
app_port: 8501
tags:
- streamlit
pinned: false
short_description: Streamlit template space
---

# Grounding Benchmark Leaderboard Viewer

A Streamlit application for visualizing model performance on grounding benchmarks.

## Features

- **Real-time Data**: Streams results directly from the HuggingFace leaderboard repository without local storage
- **Interactive Visualizations**: Bar charts comparing model performance across different metrics
- **Baseline Comparisons**: Shows baseline models (Qwen2-VL, UI-TARS) alongside evaluated models
- **Best Checkpoint Selection**: Automatically shows the best performing checkpoint for each model (marked with * if not the last checkpoint)
- **UI Type Breakdown**: 
  - For ScreenSpot-v2: Comprehensive charts showing Overall, Desktop, Web, and individual UI type performance
  - For other datasets: Desktop vs Web and Text vs Icon performance
- **Checkpoint Progression Analysis**: Visualize how metrics evolve during training
- **Model Details**: View training loss, checkpoint steps, and evaluation timestamps

## Installation

1. Clone or download this directory
2. Install dependencies:
   ```bash
   pip install -r requirements.txt
   ```

## Running the App

```bash
streamlit run src/streamlit_app.py
```

The app will open in your browser at `http://localhost:8501`

## Usage

1. **Select Dataset**: Use the sidebar to choose which benchmark dataset to view (e.g., screenspot-v2, screenspot-pro)

2. **Filter Models**: Optionally filter to view a specific model or all models

3. **View Charts**: 
   - For ScreenSpot-v2:
     - Overall performance (average of desktop and web)
     - Desktop and Web averages
     - Individual UI type metrics: Desktop (Text), Desktop (Icon), Web (Text), Web (Icon)
     - Text and Icon averages across environments
   - Baseline model comparisons shown in orange
   - Models marked with * indicate the best checkpoint is not the last one

4. **Explore Details**: 
   - Expand "Model Details" to see training metadata
   - Expand "Detailed UI Type Breakdown" for a comprehensive table
   - Expand "Checkpoint Progression Analysis" to:
     - View accuracy progression over training steps
     - See the relationship between training loss and accuracy
     - Compare performance across checkpoints

## Data Source

The app streams data directly from the HuggingFace dataset repository:
- Repository: `mlfoundations-cua-dev/leaderboard`
- Path: `grounding/[dataset_name]/[model_results].json`

## Streaming Approach

To minimize local storage requirements, the app:
- Streams JSON files directly from HuggingFace Hub
- Extracts only the necessary data for visualization
- Discards the full JSON after processing
- Caches the extracted data in memory for 5 minutes

## Supported Datasets

- **ScreenSpot-v2**: Web and desktop UI element grounding (with special handling for desktop/web averaging)
- **ScreenSpot-Pro**: Professional UI grounding benchmark
- **ShowdownClicks**: Click prediction benchmark
- And more as they are added to the leaderboard

## Baseline Models

The dashboard includes baseline performance from established models:

### ScreenSpot-v2 Baselines
- **Qwen2-VL-7B**: 38.0% overall
- **UI-TARS-2B**: 82.8% overall
- **UI-TARS-7B**: 92.2% overall
- **UI-TARS-72B**: 88.3% overall

### ScreenSpot-Pro Baselines
- **Qwen2.5-VL-3B-Instruct**: 16.1% overall
- **Qwen2.5-VL-7B-Instruct**: 26.8% overall
- **Qwen2.5-VL-72B-Instruct**: 53.3% overall
- **UI-TARS-2B**: 27.7% overall
- **UI-TARS-7B**: 35.7% overall
- **UI-TARS-72B**: 38.1% overall

### ShowDown-Clicks Baselines
- **Qwen2.5-VL-72B-Instruct**: 24.8% overall
- **UI-TARS-72B-SFT**: 54.4% overall
- **Molmo-72B-0924**: 54.8% overall

## Checkpoint Handling

- The app automatically identifies the best performing checkpoint for each model
- If multiple checkpoints exist, only the best one is shown in the main charts
- An asterisk (*) indicates when the best checkpoint is not the last one
- Use the "Checkpoint Progression Analysis" to explore all checkpoints

## Caching

Results are cached in memory for 5 minutes to improve performance. The cache automatically refreshes to show new evaluation results.