Spaces:

mlfoundations-cua-dev
/

leaderboard-viewer

Running

App Files Files Community

Anas Awadalla commited on Jul 24

Commit

79cb6e1

1 Parent(s): 2dbb46e

more analysis + baselines

Browse files

Files changed (2) hide show

README.md +37 -13
src/streamlit_app.py +285 -33

README.md CHANGED Viewed

@@ -20,10 +20,11 @@ A Streamlit application for visualizing model performance on grounding benchmark
 - **Real-time Data**: Streams results directly from the HuggingFace leaderboard repository without local storage
 - **Interactive Visualizations**: Bar charts comparing model performance across different metrics
 - **Baseline Comparisons**: Shows baseline models (Qwen2-VL, UI-TARS) alongside evaluated models
-- **UI Type Breakdown**: For ScreenSpot datasets, shows performance by:
-  - Desktop vs Web
-  - Text vs Icon elements
-  - Overall averages
 - **Model Details**: View training loss, checkpoint steps, and evaluation timestamps
 - **Sample Results**: Inspect the first 5 evaluation samples for each model
@@ -49,15 +50,23 @@ The app will open in your browser at `http://localhost:8501`
 2. **Filter Models**: Optionally filter to view a specific model or all models
-3. **View Charts**: The main page displays:
-   - Overall metrics (number of models, best accuracy, total samples)
-   - Bar charts comparing performance across different UI types
-   - Baseline model comparisons (shown in orange)
 4. **Explore Details**:
    - Expand "Model Details" to see training metadata
    - Expand "Detailed UI Type Breakdown" for a comprehensive table
    - Expand "Sample Results" to see the first 5 evaluation samples
 ## Data Source
@@ -75,7 +84,7 @@ To minimize local storage requirements, the app:
 ## Supported Datasets
-- **ScreenSpot-v2**: Web and desktop UI element grounding
 - **ScreenSpot-Pro**: Professional UI grounding benchmark
 - **ShowdownClicks**: Click prediction benchmark
 - And more as they are added to the leaderboard
@@ -83,10 +92,25 @@ To minimize local storage requirements, the app:
 ## Baseline Models
 For ScreenSpot-v2, the following baselines are included:
-- Qwen2-VL-7B
-- UI-TARS-2B
-- UI-TARS-7B
-- UI-TARS-72B
 ## Caching

 - **Real-time Data**: Streams results directly from the HuggingFace leaderboard repository without local storage
 - **Interactive Visualizations**: Bar charts comparing model performance across different metrics
 - **Baseline Comparisons**: Shows baseline models (Qwen2-VL, UI-TARS) alongside evaluated models
+- **Best Checkpoint Selection**: Automatically shows the best performing checkpoint for each model (marked with * if not the last checkpoint)
+- **UI Type Breakdown**:
+  - For ScreenSpot-v2: Comprehensive charts showing Overall, Desktop, Web, and individual UI type performance
+  - For other datasets: Desktop vs Web and Text vs Icon performance
+- **Checkpoint Progression Analysis**: Visualize how metrics evolve during training
 - **Model Details**: View training loss, checkpoint steps, and evaluation timestamps
 - **Sample Results**: Inspect the first 5 evaluation samples for each model
 2. **Filter Models**: Optionally filter to view a specific model or all models
+3. **View Charts**:
+   - For ScreenSpot-v2:
+     - Overall performance (average of desktop and web)
+     - Desktop and Web averages
+     - Individual UI type metrics: Desktop (Text), Desktop (Icon), Web (Text), Web (Icon)
+     - Text and Icon averages across environments
+   - Baseline model comparisons shown in orange
+   - Models marked with * indicate the best checkpoint is not the final one
 4. **Explore Details**:
    - Expand "Model Details" to see training metadata
    - Expand "Detailed UI Type Breakdown" for a comprehensive table
    - Expand "Sample Results" to see the first 5 evaluation samples
+   - Expand "Checkpoint Progression Analysis" to:
+     - View accuracy progression over training steps
+     - See the relationship between training loss and accuracy
+     - Compare performance across checkpoints
 ## Data Source
 ## Supported Datasets
+- **ScreenSpot-v2**: Web and desktop UI element grounding (with special handling for desktop/web averaging)
 - **ScreenSpot-Pro**: Professional UI grounding benchmark
 - **ShowdownClicks**: Click prediction benchmark
 - And more as they are added to the leaderboard
 ## Baseline Models
 For ScreenSpot-v2, the following baselines are included:
+- Qwen2-VL-7B: 37.96%
+- UI-TARS-2B: 82.8%
+- UI-TARS-7B: 92.2%
+- UI-TARS-72B: 88.3%
+For ScreenSpot-Pro, the following baselines are included:
+- Qwen2.5-VL-3B-Instruct: 16.1%
+- Qwen2.5-VL-7B-Instruct: 26.8%
+- Qwen2.5-VL-72B-Instruct: 53.3%
+- UI-TARS-2B: 27.7%
+- UI-TARS-7B: 35.7%
+- UI-TARS-72B: 38.1%
+## Checkpoint Handling
+- The app automatically identifies the best performing checkpoint for each model
+- If multiple checkpoints exist, only the best one is shown in the main charts
+- An asterisk (*) indicates when the best checkpoint is not the last one
+- Use the "Checkpoint Progression Analysis" to explore all checkpoints
 ## Caching

src/streamlit_app.py CHANGED Viewed

@@ -53,6 +53,26 @@ BASELINES = {
             "web_icon": 86.3,
             "overall": 88.3
         }
     }
 }
@@ -99,18 +119,36 @@ def fetch_leaderboard_data():
                 # Get model name from metadata or path
                 model_checkpoint = metadata.get("model_checkpoint", "")
                 model_name = model_checkpoint.split('/')[-1]
                 # Handle checkpoint names
                 if not model_name and len(path_parts) > 2:
                     # Check if it's a checkpoint subdirectory structure
                     if len(path_parts) > 3 and path_parts[2] != path_parts[3]:
                         # Format: grounding/dataset/base_model/checkpoint.json
-                        base_model = path_parts[2]
                         checkpoint_file = path_parts[3].replace(".json", "")
-                        model_name = f"{base_model}/{checkpoint_file}"
                     else:
                         # Regular format: grounding/dataset/results_modelname.json
                         model_name = path_parts[2].replace("results_", "").replace(".json", "")
                 # Extract UI type results if available
                 ui_type_results = detailed_results.get("by_ui_type", {})
@@ -120,6 +158,8 @@ def fetch_leaderboard_data():
                 result_entry = {
                     "dataset": dataset_name,
                     "model": model_name,
                     "model_path": model_checkpoint,
                     "overall_accuracy": metrics.get("accuracy", 0) * 100,  # Convert to percentage
                     "total_samples": metrics.get("total", 0),
@@ -145,7 +185,49 @@ def fetch_leaderboard_data():
         progress_bar.empty()
         status_text.empty()
-        return pd.DataFrame(results)
     except Exception as e:
         st.error(f"Error fetching leaderboard data: {str(e)}")
@@ -164,17 +246,23 @@ def parse_ui_type_metrics(df: pd.DataFrame, dataset_filter: str) -> pd.DataFrame
         # For ScreenSpot datasets, we have desktop/web and text/icon
         if 'screenspot' in dataset_filter.lower():
-            # Calculate aggregated metrics
             desktop_text = ui_results.get('desktop_text', {}).get('correct', 0) / max(ui_results.get('desktop_text', {}).get('total', 1), 1) * 100
             desktop_icon = ui_results.get('desktop_icon', {}).get('correct', 0) / max(ui_results.get('desktop_icon', {}).get('total', 1), 1) * 100
             web_text = ui_results.get('web_text', {}).get('correct', 0) / max(ui_results.get('web_text', {}).get('total', 1), 1) * 100
             web_icon = ui_results.get('web_icon', {}).get('correct', 0) / max(ui_results.get('web_icon', {}).get('total', 1), 1) * 100
             # Calculate averages
-            desktop_avg = (desktop_text + desktop_icon) / 2 if desktop_text or desktop_icon else 0
-            web_avg = (web_text + web_icon) / 2 if web_text or web_icon else 0
-            text_avg = (desktop_text + web_text) / 2 if desktop_text or web_text else 0
-            icon_avg = (desktop_icon + web_icon) / 2 if desktop_icon or web_icon else 0
             metrics_list.append({
                 'model': model,
@@ -186,7 +274,9 @@ def parse_ui_type_metrics(df: pd.DataFrame, dataset_filter: str) -> pd.DataFrame
                 'web_avg': web_avg,
                 'text_avg': text_avg,
                 'icon_avg': icon_avg,
-                'overall': row['overall_accuracy']
             })
     return pd.DataFrame(metrics_list)
@@ -303,35 +393,197 @@ def main():
     if not ui_metrics_df.empty and 'screenspot' in selected_dataset.lower():
         st.subheader("Performance by UI Type")
-        # Create charts in a grid
-        col1, col2 = st.columns(2)
-        with col1:
-            # Overall Average
-            chart = create_bar_chart(ui_metrics_df, 'overall', 'Overall Average')
-            if chart:
-                st.altair_chart(chart, use_container_width=True)
-            # Desktop Average
-            chart = create_bar_chart(ui_metrics_df, 'desktop_avg', 'Desktop Average')
-            if chart:
-                st.altair_chart(chart, use_container_width=True)
-            # Text Average
-            chart = create_bar_chart(ui_metrics_df, 'text_avg', 'Text Average (UI-Type)')
-            if chart:
-                st.altair_chart(chart, use_container_width=True)
-        with col2:
-            # Web Average
-            chart = create_bar_chart(ui_metrics_df, 'web_avg', 'Web Average')
-            if chart:
-                st.altair_chart(chart, use_container_width=True)
-            # Icon Average
-            chart = create_bar_chart(ui_metrics_df, 'icon_avg', 'Icon Average (UI-Type)')
-            if chart:
-                st.altair_chart(chart, use_container_width=True)
         # Detailed breakdown
         with st.expander("Detailed UI Type Breakdown"):

             "web_icon": 86.3,
             "overall": 88.3
         }
+    },
+    "screenspot-pro": {
+        "Qwen2.5-VL-3B-Instruct": {
+            "overall": 16.1
+        },
+        "Qwen2.5-VL-7B-Instruct": {
+            "overall": 26.8
+        },
+        "Qwen2.5-VL-72B-Instruct": {
+            "overall": 53.3
+        },
+        "UI-TARS-2B": {
+            "overall": 27.7
+        },
+        "UI-TARS-7B": {
+            "overall": 35.7
+        },
+        "UI-TARS-72B": {
+            "overall": 38.1
+        }
     }
 }
                 # Get model name from metadata or path
                 model_checkpoint = metadata.get("model_checkpoint", "")
                 model_name = model_checkpoint.split('/')[-1]
+                base_model_name = None
+                is_checkpoint = False
                 # Handle checkpoint names
                 if not model_name and len(path_parts) > 2:
                     # Check if it's a checkpoint subdirectory structure
                     if len(path_parts) > 3 and path_parts[2] != path_parts[3]:
                         # Format: grounding/dataset/base_model/checkpoint.json
+                        base_model_name = path_parts[2]
                         checkpoint_file = path_parts[3].replace(".json", "")
+                        model_name = f"{base_model_name}/{checkpoint_file}"
+                        is_checkpoint = True
                     else:
                         # Regular format: grounding/dataset/results_modelname.json
                         model_name = path_parts[2].replace("results_", "").replace(".json", "")
+                        base_model_name = model_name
+                # Check if model name indicates a checkpoint
+                if 'checkpoint-' in model_name:
+                    is_checkpoint = True
+                    if not base_model_name:
+                        # Extract base model name from full path
+                        if '/' in model_name:
+                            parts = model_name.split('/')
+                            base_model_name = parts[0]
+                        else:
+                            # Try to get from model_checkpoint path
+                            checkpoint_parts = model_checkpoint.split('/')
+                            if len(checkpoint_parts) > 1:
+                                base_model_name = checkpoint_parts[-2]
                 # Extract UI type results if available
                 ui_type_results = detailed_results.get("by_ui_type", {})
                 result_entry = {
                     "dataset": dataset_name,
                     "model": model_name,
+                    "base_model": base_model_name or model_name,
+                    "is_checkpoint": is_checkpoint,
                     "model_path": model_checkpoint,
                     "overall_accuracy": metrics.get("accuracy", 0) * 100,  # Convert to percentage
                     "total_samples": metrics.get("total", 0),
         progress_bar.empty()
         status_text.empty()
+        # Create DataFrame
+        df = pd.DataFrame(results)
+        # Process checkpoints: for each base model, find the best checkpoint
+        if not df.empty:
+            # Group by dataset and base_model
+            grouped = df.groupby(['dataset', 'base_model'])
+            # For each group, find the best checkpoint
+            best_models = []
+            for (dataset, base_model), group in grouped:
+                if len(group) > 1:
+                    # Multiple entries for this model (likely checkpoints)
+                    best_idx = group['overall_accuracy'].idxmax()
+                    best_row = group.loc[best_idx].copy()
+                    # Check if the best is the last checkpoint
+                    checkpoint_steps = group[group['checkpoint_steps'].notna()]['checkpoint_steps'].sort_values()
+                    if len(checkpoint_steps) > 0:
+                        last_checkpoint_steps = checkpoint_steps.iloc[-1]
+                        best_checkpoint_steps = best_row['checkpoint_steps']
+                        if pd.notna(best_checkpoint_steps) and best_checkpoint_steps != last_checkpoint_steps:
+                            # Best checkpoint is not the last one, add asterisk
+                            best_row['model'] = best_row['model'] + '*'
+                            best_row['is_best_not_last'] = True
+                        else:
+                            best_row['is_best_not_last'] = False
+                    # Store all checkpoints for this model
+                    best_row['all_checkpoints'] = group.to_dict('records')
+                    best_models.append(best_row)
+                else:
+                    # Single entry for this model
+                    row = group.iloc[0].copy()
+                    row['is_best_not_last'] = False
+                    row['all_checkpoints'] = [row.to_dict()]
+                    best_models.append(row)
+            # Create new dataframe with best models
+            df_best = pd.DataFrame(best_models)
+            return df_best
+        return df
     except Exception as e:
         st.error(f"Error fetching leaderboard data: {str(e)}")
         # For ScreenSpot datasets, we have desktop/web and text/icon
         if 'screenspot' in dataset_filter.lower():
+            # Calculate individual metrics
             desktop_text = ui_results.get('desktop_text', {}).get('correct', 0) / max(ui_results.get('desktop_text', {}).get('total', 1), 1) * 100
             desktop_icon = ui_results.get('desktop_icon', {}).get('correct', 0) / max(ui_results.get('desktop_icon', {}).get('total', 1), 1) * 100
             web_text = ui_results.get('web_text', {}).get('correct', 0) / max(ui_results.get('web_text', {}).get('total', 1), 1) * 100
             web_icon = ui_results.get('web_icon', {}).get('correct', 0) / max(ui_results.get('web_icon', {}).get('total', 1), 1) * 100
             # Calculate averages
+            desktop_avg = (desktop_text + desktop_icon) / 2 if (desktop_text > 0 or desktop_icon > 0) else 0
+            web_avg = (web_text + web_icon) / 2 if (web_text > 0 or web_icon > 0) else 0
+            text_avg = (desktop_text + web_text) / 2 if (desktop_text > 0 or web_text > 0) else 0
+            icon_avg = (desktop_icon + web_icon) / 2 if (desktop_icon > 0 or web_icon > 0) else 0
+            # For screenspot-v2, calculate the overall as average of desktop and web
+            if dataset_filter == 'screenspot-v2':
+                overall = (desktop_avg + web_avg) / 2 if (desktop_avg > 0 or web_avg > 0) else 0
+            else:
+                overall = row['overall_accuracy']
             metrics_list.append({
                 'model': model,
                 'web_avg': web_avg,
                 'text_avg': text_avg,
                 'icon_avg': icon_avg,
+                'overall': overall,
+                'is_best_not_last': row.get('is_best_not_last', False),
+                'all_checkpoints': row.get('all_checkpoints', [])
             })
     return pd.DataFrame(metrics_list)
     if not ui_metrics_df.empty and 'screenspot' in selected_dataset.lower():
         st.subheader("Performance by UI Type")
+        # Add note about asterisks
+        if any(ui_metrics_df['is_best_not_last']):
+            st.info("* indicates the best checkpoint is not the last checkpoint")
+        # Create charts in a grid
+        if selected_dataset == 'screenspot-v2':
+            # First row: Overall, Desktop, Web averages
+            col1, col2, col3 = st.columns(3)
+            with col1:
+                chart = create_bar_chart(ui_metrics_df, 'overall', 'Overall Average (Desktop + Web) / 2')
+                if chart:
+                    st.altair_chart(chart, use_container_width=True)
+            with col2:
+                chart = create_bar_chart(ui_metrics_df, 'desktop_avg', 'Desktop Average')
+                if chart:
+                    st.altair_chart(chart, use_container_width=True)
+            with col3:
+                chart = create_bar_chart(ui_metrics_df, 'web_avg', 'Web Average')
+                if chart:
+                    st.altair_chart(chart, use_container_width=True)
+            # Second row: Individual UI type metrics
+            col1, col2, col3, col4 = st.columns(4)
+            with col1:
+                chart = create_bar_chart(ui_metrics_df, 'desktop_text', 'Desktop (Text)')
+                if chart:
+                    st.altair_chart(chart, use_container_width=True)
+            with col2:
+                chart = create_bar_chart(ui_metrics_df, 'desktop_icon', 'Desktop (Icon)')
+                if chart:
+                    st.altair_chart(chart, use_container_width=True)
+            with col3:
+                chart = create_bar_chart(ui_metrics_df, 'web_text', 'Web (Text)')
+                if chart:
+                    st.altair_chart(chart, use_container_width=True)
+            with col4:
+                chart = create_bar_chart(ui_metrics_df, 'web_icon', 'Web (Icon)')
+                if chart:
+                    st.altair_chart(chart, use_container_width=True)
+            # Third row: Text vs Icon averages
+            col1, col2 = st.columns(2)
+            with col1:
+                chart = create_bar_chart(ui_metrics_df, 'text_avg', 'Text Average (Desktop + Web)')
+                if chart:
+                    st.altair_chart(chart, use_container_width=True)
+            with col2:
+                chart = create_bar_chart(ui_metrics_df, 'icon_avg', 'Icon Average (Desktop + Web)')
+                if chart:
+                    st.altair_chart(chart, use_container_width=True)
+        else:
+            # For other screenspot datasets, show the standard layout
+            col1, col2 = st.columns(2)
+            with col1:
+                # Overall Average
+                chart = create_bar_chart(ui_metrics_df, 'overall', 'Overall Average')
+                if chart:
+                    st.altair_chart(chart, use_container_width=True)
+                # Desktop Average
+                chart = create_bar_chart(ui_metrics_df, 'desktop_avg', 'Desktop Average')
+                if chart:
+                    st.altair_chart(chart, use_container_width=True)
+                # Text Average
+                chart = create_bar_chart(ui_metrics_df, 'text_avg', 'Text Average (UI-Type)')
+                if chart:
+                    st.altair_chart(chart, use_container_width=True)
+            with col2:
+                # Web Average
+                chart = create_bar_chart(ui_metrics_df, 'web_avg', 'Web Average')
+                if chart:
+                    st.altair_chart(chart, use_container_width=True)
+                # Icon Average
+                chart = create_bar_chart(ui_metrics_df, 'icon_avg', 'Icon Average (UI-Type)')
+                if chart:
+                    st.altair_chart(chart, use_container_width=True)
+        # Checkpoint progression visualization
+        with st.expander("Checkpoint Progression Analysis"):
+            # Select a model with checkpoints
+            models_with_checkpoints = ui_metrics_df[ui_metrics_df['all_checkpoints'].apply(lambda x: len(x) > 1)]
+            if not models_with_checkpoints.empty:
+                selected_checkpoint_model = st.selectbox(
+                    "Select a model to view checkpoint progression:",
+                    models_with_checkpoints['model'].str.replace('*', '').unique()
+                )
+                # Get checkpoint data for selected model
+                model_row = models_with_checkpoints[models_with_checkpoints['model'].str.replace('*', '') == selected_checkpoint_model].iloc[0]
+                checkpoint_data = model_row['all_checkpoints']
+                # Create DataFrame from checkpoint data
+                checkpoint_df = pd.DataFrame(checkpoint_data)
+                # Prepare data for visualization
+                checkpoint_metrics = []
+                for _, cp in checkpoint_df.iterrows():
+                    ui_results = cp['ui_type_results']
+                    # Calculate metrics
+                    desktop_text = ui_results.get('desktop_text', {}).get('correct', 0) / max(ui_results.get('desktop_text', {}).get('total', 1), 1) * 100
+                    desktop_icon = ui_results.get('desktop_icon', {}).get('correct', 0) / max(ui_results.get('desktop_icon', {}).get('total', 1), 1) * 100
+                    web_text = ui_results.get('web_text', {}).get('correct', 0) / max(ui_results.get('web_text', {}).get('total', 1), 1) * 100
+                    web_icon = ui_results.get('web_icon', {}).get('correct', 0) / max(ui_results.get('web_icon', {}).get('total', 1), 1) * 100
+                    desktop_avg = (desktop_text + desktop_icon) / 2
+                    web_avg = (web_text + web_icon) / 2
+                    overall = (desktop_avg + web_avg) / 2 if selected_dataset == 'screenspot-v2' else cp['overall_accuracy']
+                    checkpoint_metrics.append({
+                        'steps': cp['checkpoint_steps'] or 0,
+                        'overall': overall,
+                        'desktop': desktop_avg,
+                        'web': web_avg,
+                        'loss': cp['training_loss'],
+                        'neg_log_loss': -np.log(cp['training_loss']) if cp['training_loss'] and cp['training_loss'] > 0 else None
+                    })
+                metrics_df = pd.DataFrame(checkpoint_metrics).sort_values('steps')
+                # Plot metrics over training steps
+                col1, col2 = st.columns(2)
+                with col1:
+                    st.write("**Accuracy over Training Steps**")
+                    # Melt data for multi-line chart
+                    melted = metrics_df[['steps', 'overall', 'desktop', 'web']].melt(
+                        id_vars=['steps'],
+                        var_name='Metric',
+                        value_name='Accuracy'
+                    )
+                    chart = alt.Chart(melted).mark_line(point=True).encode(
+                        x=alt.X('steps:Q', title='Training Steps'),
+                        y=alt.Y('Accuracy:Q', scale=alt.Scale(domain=[0, 100]), title='Accuracy (%)'),
+                        color=alt.Color('Metric:N', scale=alt.Scale(
+                            domain=['overall', 'desktop', 'web'],
+                            range=['#4ECDC4', '#45B7D1', '#96CEB4']
+                        )),
+                        tooltip=['steps', 'Metric', 'Accuracy']
+                    ).properties(
+                        width=400,
+                        height=300,
+                        title='Accuracy Progression During Training'
+                    )
+                    st.altair_chart(chart, use_container_width=True)
+                with col2:
+                    st.write("**Accuracy vs. Training Loss**")
+                    if metrics_df['neg_log_loss'].notna().any():
+                        scatter_data = metrics_df[metrics_df['neg_log_loss'].notna()]
+                        chart = alt.Chart(scatter_data).mark_circle(size=100).encode(
+                            x=alt.X('neg_log_loss:Q', title='-log(Training Loss)'),
+                            y=alt.Y('overall:Q', scale=alt.Scale(domain=[0, 100]), title='Overall Accuracy (%)'),
+                            color=alt.Color('steps:Q', scale=alt.Scale(scheme='viridis'), title='Training Steps'),
+                            tooltip=['steps', 'loss', 'overall']
+                        ).properties(
+                            width=400,
+                            height=300,
+                            title='Accuracy vs. -log(Training Loss)'
+                        )
+                        st.altair_chart(chart, use_container_width=True)
+                    else:
+                        st.info("No training loss data available for this model")
+                # Show checkpoint details table
+                st.write("**Checkpoint Details**")
+                display_metrics = metrics_df[['steps', 'overall', 'desktop', 'web', 'loss']].copy()
+                display_metrics.columns = ['Steps', 'Overall %', 'Desktop %', 'Web %', 'Training Loss']
+                display_metrics[['Overall %', 'Desktop %', 'Web %']] = display_metrics[['Overall %', 'Desktop %', 'Web %']].round(2)
+                display_metrics['Training Loss'] = display_metrics['Training Loss'].apply(lambda x: f"{x:.4f}" if pd.notna(x) else "N/A")
+                st.dataframe(display_metrics, use_container_width=True)
+            else:
+                st.info("No models with multiple checkpoints available for progression analysis")
         # Detailed breakdown
         with st.expander("Detailed UI Type Breakdown"):