leaderboard-viewer / README.md
djghosh's picture
Update README.md
1ec438e verified
metadata
title: Leaderboard Viewer
emoji: πŸš€
colorFrom: red
colorTo: red
sdk: docker
app_port: 8501
tags:
  - streamlit
pinned: false
short_description: Streamlit template space

Grounding Benchmark Leaderboard Viewer

A Streamlit application for visualizing model performance on grounding benchmarks.

Features

  • Real-time Data: Streams results directly from the HuggingFace leaderboard repository without local storage
  • Interactive Visualizations: Bar charts comparing model performance across different metrics
  • Baseline Comparisons: Shows baseline models (Qwen2-VL, UI-TARS) alongside evaluated models
  • Best Checkpoint Selection: Automatically shows the best performing checkpoint for each model (marked with * if not the last checkpoint)
  • UI Type Breakdown:
    • For ScreenSpot-v2: Comprehensive charts showing Overall, Desktop, Web, and individual UI type performance
    • For other datasets: Desktop vs Web and Text vs Icon performance
  • Checkpoint Progression Analysis: Visualize how metrics evolve during training
  • Model Details: View training loss, checkpoint steps, and evaluation timestamps

Installation

  1. Clone or download this directory
  2. Install dependencies:
    pip install -r requirements.txt
    

Running the App

streamlit run src/streamlit_app.py

The app will open in your browser at http://localhost:8501

Usage

  1. Select Dataset: Use the sidebar to choose which benchmark dataset to view (e.g., screenspot-v2, screenspot-pro)

  2. Filter Models: Optionally filter to view a specific model or all models

  3. View Charts:

    • For ScreenSpot-v2:
      • Overall performance (average of desktop and web)
      • Desktop and Web averages
      • Individual UI type metrics: Desktop (Text), Desktop (Icon), Web (Text), Web (Icon)
      • Text and Icon averages across environments
    • Baseline model comparisons shown in orange
    • Models marked with * indicate the best checkpoint is not the last one
  4. Explore Details:

    • Expand "Model Details" to see training metadata
    • Expand "Detailed UI Type Breakdown" for a comprehensive table
    • Expand "Checkpoint Progression Analysis" to:
      • View accuracy progression over training steps
      • See the relationship between training loss and accuracy
      • Compare performance across checkpoints

Data Source

The app streams data directly from the HuggingFace dataset repository:

  • Repository: mlfoundations-cua-dev/leaderboard
  • Path: grounding/[dataset_name]/[model_results].json

Streaming Approach

To minimize local storage requirements, the app:

  • Streams JSON files directly from HuggingFace Hub
  • Extracts only the necessary data for visualization
  • Discards the full JSON after processing
  • Caches the extracted data in memory for 5 minutes

Supported Datasets

  • ScreenSpot-v2: Web and desktop UI element grounding (with special handling for desktop/web averaging)
  • ScreenSpot-Pro: Professional UI grounding benchmark
  • ShowdownClicks: Click prediction benchmark
  • And more as they are added to the leaderboard

Baseline Models

The dashboard includes baseline performance from established models:

ScreenSpot-v2 Baselines

  • Qwen2-VL-7B: 38.0% overall
  • UI-TARS-2B: 82.8% overall
  • UI-TARS-7B: 92.2% overall
  • UI-TARS-72B: 88.3% overall

ScreenSpot-Pro Baselines

  • Qwen2.5-VL-3B-Instruct: 16.1% overall
  • Qwen2.5-VL-7B-Instruct: 26.8% overall
  • Qwen2.5-VL-72B-Instruct: 53.3% overall
  • UI-TARS-2B: 27.7% overall
  • UI-TARS-7B: 35.7% overall
  • UI-TARS-72B: 38.1% overall

ShowDown-Clicks Baselines

  • Qwen2.5-VL-72B-Instruct: 24.8% overall
  • UI-TARS-72B-SFT: 54.4% overall
  • Molmo-72B-0924: 54.8% overall

Checkpoint Handling

  • The app automatically identifies the best performing checkpoint for each model
  • If multiple checkpoints exist, only the best one is shown in the main charts
  • An asterisk (*) indicates when the best checkpoint is not the last one
  • Use the "Checkpoint Progression Analysis" to explore all checkpoints

Caching

Results are cached in memory for 5 minutes to improve performance. The cache automatically refreshes to show new evaluation results.