Spaces:

mlfoundations-cua-dev
/

leaderboard-viewer

Running

App Files Files Community

leaderboard-viewer / README.md

djghosh

Update README.md

1ec438e verified 5 days ago

preview code

raw

history blame contribute delete

4.25 kB

	---
	title: Leaderboard Viewer
	emoji: 🚀
	colorFrom: red
	colorTo: red
	sdk: docker
	app_port: 8501
	tags:
	- streamlit
	pinned: false
	short_description: Streamlit template space
	---

	# Grounding Benchmark Leaderboard Viewer

	A Streamlit application for visualizing model performance on grounding benchmarks.

	## Features

	- Real-time Data: Streams results directly from the HuggingFace leaderboard repository without local storage
	- Interactive Visualizations: Bar charts comparing model performance across different metrics
	- Baseline Comparisons: Shows baseline models (Qwen2-VL, UI-TARS) alongside evaluated models
	- Best Checkpoint Selection: Automatically shows the best performing checkpoint for each model (marked with * if not the last checkpoint)
	- UI Type Breakdown:
	- For ScreenSpot-v2: Comprehensive charts showing Overall, Desktop, Web, and individual UI type performance
	- For other datasets: Desktop vs Web and Text vs Icon performance
	- Checkpoint Progression Analysis: Visualize how metrics evolve during training
	- Model Details: View training loss, checkpoint steps, and evaluation timestamps

	## Installation

	1. Clone or download this directory
	2. Install dependencies:
	```bash
	pip install -r requirements.txt
	```

	## Running the App

	```bash
	streamlit run src/streamlit_app.py
	```

	The app will open in your browser at `http://localhost:8501`

	## Usage

	1. Select Dataset: Use the sidebar to choose which benchmark dataset to view (e.g., screenspot-v2, screenspot-pro)

	2. Filter Models: Optionally filter to view a specific model or all models

	3. View Charts:
	- For ScreenSpot-v2:
	- Overall performance (average of desktop and web)
	- Desktop and Web averages
	- Individual UI type metrics: Desktop (Text), Desktop (Icon), Web (Text), Web (Icon)
	- Text and Icon averages across environments
	- Baseline model comparisons shown in orange
	- Models marked with * indicate the best checkpoint is not the last one

	4. Explore Details:
	- Expand "Model Details" to see training metadata
	- Expand "Detailed UI Type Breakdown" for a comprehensive table
	- Expand "Checkpoint Progression Analysis" to:
	- View accuracy progression over training steps
	- See the relationship between training loss and accuracy
	- Compare performance across checkpoints

	## Data Source

	The app streams data directly from the HuggingFace dataset repository:
	- Repository: `mlfoundations-cua-dev/leaderboard`
	- Path: `grounding/[dataset_name]/[model_results].json`

	## Streaming Approach

	To minimize local storage requirements, the app:
	- Streams JSON files directly from HuggingFace Hub
	- Extracts only the necessary data for visualization
	- Discards the full JSON after processing
	- Caches the extracted data in memory for 5 minutes

	## Supported Datasets

	- ScreenSpot-v2: Web and desktop UI element grounding (with special handling for desktop/web averaging)
	- ScreenSpot-Pro: Professional UI grounding benchmark
	- ShowdownClicks: Click prediction benchmark
	- And more as they are added to the leaderboard

	## Baseline Models

	The dashboard includes baseline performance from established models:

	### ScreenSpot-v2 Baselines
	- Qwen2-VL-7B: 38.0% overall
	- UI-TARS-2B: 82.8% overall
	- UI-TARS-7B: 92.2% overall
	- UI-TARS-72B: 88.3% overall

	### ScreenSpot-Pro Baselines
	- Qwen2.5-VL-3B-Instruct: 16.1% overall
	- Qwen2.5-VL-7B-Instruct: 26.8% overall
	- Qwen2.5-VL-72B-Instruct: 53.3% overall
	- UI-TARS-2B: 27.7% overall
	- UI-TARS-7B: 35.7% overall
	- UI-TARS-72B: 38.1% overall

	### ShowDown-Clicks Baselines
	- Qwen2.5-VL-72B-Instruct: 24.8% overall
	- UI-TARS-72B-SFT: 54.4% overall
	- Molmo-72B-0924: 54.8% overall

	## Checkpoint Handling

	- The app automatically identifies the best performing checkpoint for each model
	- If multiple checkpoints exist, only the best one is shown in the main charts
	- An asterisk (*) indicates when the best checkpoint is not the last one
	- Use the "Checkpoint Progression Analysis" to explore all checkpoints

	## Caching

	Results are cached in memory for 5 minutes to improve performance. The cache automatically refreshes to show new evaluation results.