--- title: Leaderboard Viewer emoji: 🚀 colorFrom: red colorTo: red sdk: docker app_port: 8501 tags: - streamlit pinned: false short_description: Streamlit template space --- # Grounding Benchmark Leaderboard Viewer A Streamlit application for visualizing model performance on grounding benchmarks. ## Features - **Real-time Data**: Streams results directly from the HuggingFace leaderboard repository without local storage - **Interactive Visualizations**: Bar charts comparing model performance across different metrics - **Baseline Comparisons**: Shows baseline models (Qwen2-VL, UI-TARS) alongside evaluated models - **Best Checkpoint Selection**: Automatically shows the best performing checkpoint for each model (marked with * if not the last checkpoint) - **UI Type Breakdown**: - For ScreenSpot-v2: Comprehensive charts showing Overall, Desktop, Web, and individual UI type performance - For other datasets: Desktop vs Web and Text vs Icon performance - **Checkpoint Progression Analysis**: Visualize how metrics evolve during training - **Model Details**: View training loss, checkpoint steps, and evaluation timestamps ## Installation 1. Clone or download this directory 2. Install dependencies: ```bash pip install -r requirements.txt ``` ## Running the App ```bash streamlit run src/streamlit_app.py ``` The app will open in your browser at `http://localhost:8501` ## Usage 1. **Select Dataset**: Use the sidebar to choose which benchmark dataset to view (e.g., screenspot-v2, screenspot-pro) 2. **Filter Models**: Optionally filter to view a specific model or all models 3. **View Charts**: - For ScreenSpot-v2: - Overall performance (average of desktop and web) - Desktop and Web averages - Individual UI type metrics: Desktop (Text), Desktop (Icon), Web (Text), Web (Icon) - Text and Icon averages across environments - Baseline model comparisons shown in orange - Models marked with * indicate the best checkpoint is not the last one 4. **Explore Details**: - Expand "Model Details" to see training metadata - Expand "Detailed UI Type Breakdown" for a comprehensive table - Expand "Checkpoint Progression Analysis" to: - View accuracy progression over training steps - See the relationship between training loss and accuracy - Compare performance across checkpoints ## Data Source The app streams data directly from the HuggingFace dataset repository: - Repository: `mlfoundations-cua-dev/leaderboard` - Path: `grounding/[dataset_name]/[model_results].json` ## Streaming Approach To minimize local storage requirements, the app: - Streams JSON files directly from HuggingFace Hub - Extracts only the necessary data for visualization - Discards the full JSON after processing - Caches the extracted data in memory for 5 minutes ## Supported Datasets - **ScreenSpot-v2**: Web and desktop UI element grounding (with special handling for desktop/web averaging) - **ScreenSpot-Pro**: Professional UI grounding benchmark - **ShowdownClicks**: Click prediction benchmark - And more as they are added to the leaderboard ## Baseline Models The dashboard includes baseline performance from established models: ### ScreenSpot-v2 Baselines - **Qwen2-VL-7B**: 38.0% overall - **UI-TARS-2B**: 82.8% overall - **UI-TARS-7B**: 92.2% overall - **UI-TARS-72B**: 88.3% overall ### ScreenSpot-Pro Baselines - **Qwen2.5-VL-3B-Instruct**: 16.1% overall - **Qwen2.5-VL-7B-Instruct**: 26.8% overall - **Qwen2.5-VL-72B-Instruct**: 53.3% overall - **UI-TARS-2B**: 27.7% overall - **UI-TARS-7B**: 35.7% overall - **UI-TARS-72B**: 38.1% overall ### ShowDown-Clicks Baselines - **Qwen2.5-VL-72B-Instruct**: 24.8% overall - **UI-TARS-72B-SFT**: 54.4% overall - **Molmo-72B-0924**: 54.8% overall ## Checkpoint Handling - The app automatically identifies the best performing checkpoint for each model - If multiple checkpoints exist, only the best one is shown in the main charts - An asterisk (*) indicates when the best checkpoint is not the last one - Use the "Checkpoint Progression Analysis" to explore all checkpoints ## Caching Results are cached in memory for 5 minutes to improve performance. The cache automatically refreshes to show new evaluation results.