Spaces:

ybchen928
/

oncall-guide-ai

Running

YanBoChen commited on Aug 5

Commit

16a2990

1 Parent(s): a2aaea2

Add multi-system evaluation support for clinical actionability and evidence quality metrics

- Introduced evaluation rubrics for Clinical Actionability and Clinical Evidence Quality.
- Implemented a new Llama3-70B judge client for evaluating multiple AI systems.
- Refactored evaluation methods to handle comparisons between different systems.
- Enhanced prompt generation to include detailed system descriptions and evaluation criteria.
- Updated response parsing to accommodate results from multiple systems.
- Added functionality to save comparison statistics with detailed system-specific results.
- Improved command-line interface for evaluating single or multiple systems.

Files changed (5) hide show

evaluation/direct_llm_evaluator.py +2 -1
evaluation/metric5_6_judge_evaluator_manual.md +303 -0
evaluation/metric5_6_llm_judge_chart_generator.py +430 -0
evaluation/metric5_6_llm_judge_evaluator.py +344 -137
src/llm_clients.py +131 -0

evaluation/direct_llm_evaluator.py CHANGED Viewed

@@ -329,7 +329,8 @@ if __name__ == "__main__":
     if len(sys.argv) > 1:
         query_file = sys.argv[1]
     else:
-        query_file = Path(__file__).parent / "pre_user_query_evaluate.txt"
     if not os.path.exists(query_file):
         print(f"❌ Query file not found: {query_file}")

     if len(sys.argv) > 1:
         query_file = sys.argv[1]
     else:
+        # Default to evaluation/single_test_query.txt for consistency
+        query_file = Path(__file__).parent / "single_test_query.txt"
     if not os.path.exists(query_file):
         print(f"❌ Query file not found: {query_file}")

evaluation/metric5_6_judge_evaluator_manual.md ADDED Viewed

	@@ -0,0 +1,303 @@

+# Metric 5-6 LLM Judge Evaluator Manual
+## Overview
+The `metric5_6_llm_judge_evaluator.py` is a multi-system evaluation tool that uses Llama3-70B as a third-party judge to assess medical advice quality across different AI systems. It supports both single-system evaluation and multi-system comparison with a single LLM call for maximum consistency.
+## Metrics Evaluated
+**Metric 5: Clinical Actionability (臨床可操作性)**
+- Scale: 1-10 (normalized to 0.0-1.0)
+- Question: "Can healthcare providers immediately act on this advice?"
+- Target: ≥7.0/10 for acceptable actionability
+**Metric 6: Clinical Evidence Quality (臨床證據品質)**
+- Scale: 1-10 (normalized to 0.0-1.0)
+- Question: "Is the advice evidence-based and follows medical standards?"
+- Target: ≥7.5/10 for acceptable evidence quality
+## System Architecture
+### Multi-System Support
+The evaluator supports flexible system combinations:
+- **Single System**: `rag` or `direct`
+- **Two-System Comparison**: `rag,direct`
+- **Future Extension**: `rag,direct,claude,gpt4` (any combination)
+### Judge LLM
+- **Model**: Llama3-70B-Instruct via Hugging Face API
+- **Strategy**: Single batch call for all evaluations
+- **Temperature**: 0.1 (low for consistent evaluation)
+- **Max Tokens**: 2048 (sufficient for evaluation responses)
+## Prerequisites
+### 1. Environment Setup
+```bash
+# Ensure HF_TOKEN is set in your environment
+export HF_TOKEN="your_huggingface_token"
+# Or add to .env file
+echo "HF_TOKEN=your_huggingface_token" >> .env
+```
+### 2. Required Data Files
+Before running the judge evaluator, you must have medical outputs from your systems:
+**For RAG System**:
+```bash
+python latency_evaluator.py single_test_query.txt
+# Generates: results/medical_outputs_YYYYMMDD_HHMMSS.json
+```
+**For Direct LLM System**:
+```bash
+python direct_llm_evaluator.py single_test_query.txt
+# Generates: results/medical_outputs_direct_YYYYMMDD_HHMMSS.json
+```
+## Usage
+### Command Line Interface
+#### Single System Evaluation
+```bash
+# Evaluate RAG system only
+python metric5_6_llm_judge_evaluator.py rag
+# Evaluate Direct LLM system only
+python metric5_6_llm_judge_evaluator.py direct
+```
+#### Multi-System Comparison (Recommended)
+```bash
+# Compare RAG vs Direct systems
+python metric5_6_llm_judge_evaluator.py rag,direct
+# Future: Compare multiple systems
+python metric5_6_llm_judge_evaluator.py rag,direct,claude
+```
+### Complete Workflow Example
+```bash
+# Step 1: Navigate to evaluation directory
+cd /path/to/GenAI-OnCallAssistant/evaluation
+# Step 2: Generate medical outputs from both systems
+python latency_evaluator.py single_test_query.txt
+python direct_llm_evaluator.py single_test_query.txt
+# Step 3: Run comparative evaluation
+python metric5_6_llm_judge_evaluator.py rag,direct
+```
+## Output Files
+### Generated Files
+- **Statistics**: `results/judge_evaluation_comparison_rag_vs_direct_YYYYMMDD_HHMMSS.json`
+- **Detailed Results**: Stored in evaluator's internal results array
+### File Structure
+```json
+{
+  "comparison_metadata": {
+    "systems_compared": ["rag", "direct"],
+    "comparison_type": "multi_system",
+    "timestamp": "2025-08-04T22:00:00"
+  },
+  "category_results": {
+    "diagnosis": {
+      "average_actionability": 0.850,
+      "average_evidence": 0.780,
+      "query_count": 1,
+      "actionability_target_met": true,
+      "evidence_target_met": true
+    }
+  },
+  "overall_results": {
+    "average_actionability": 0.850,
+    "average_evidence": 0.780,
+    "successful_evaluations": 2,
+    "total_queries": 2,
+    "actionability_target_met": true,
+    "evidence_target_met": true
+  }
+}
+```
+## Evaluation Process
+### 1. File Discovery
+The evaluator automatically finds the latest medical output files:
+- **RAG**: `medical_outputs_*.json`
+- **Direct**: `medical_outputs_direct_*.json`
+- **Custom**: `medical_outputs_{system}_*.json`
+### 2. Prompt Generation
+For multi-system comparison, the evaluator creates a structured prompt:
+```
+You are a medical expert evaluating and comparing AI systems...
+SYSTEM 1 (RAG): Uses medical guidelines + LLM for evidence-based advice
+SYSTEM 2 (Direct): Uses LLM only without external guidelines
+QUERY 1 (DIAGNOSIS):
+Patient Query: 60-year-old patient with hypertension history...
+SYSTEM 1 Response: For a 60-year-old patient with...
+SYSTEM 2 Response: Based on the symptoms described...
+RESPONSE FORMAT:
+Query 1 System 1: Actionability=X, Evidence=Y
+Query 1 System 2: Actionability=X, Evidence=Y
+```
+### 3. LLM Judge Evaluation
+- **Single API Call**: All systems evaluated in one request for consistency
+- **Response Parsing**: Automatic extraction of numerical scores
+- **Error Handling**: Graceful handling of parsing failures
+### 4. Results Analysis
+- **System-Specific Statistics**: Individual performance metrics
+- **Comparative Analysis**: Direct system-to-system comparison
+- **Target Compliance**: Automatic threshold checking
+## Expected Output
+### Console Output Example
+```
+🧠 OnCall.ai LLM Judge Evaluator - Metrics 5-6 Multi-System Evaluation
+🧪 Multi-System Comparison: RAG vs DIRECT
+📊 Found rag outputs: results/medical_outputs_20250804_215917.json
+📊 Found direct outputs: results/medical_outputs_direct_20250804_220000.json
+📊 Comparing 2 systems with 1 queries each
+🎯 Metrics: 5 (Actionability) + 6 (Evidence Quality)
+⚡ Strategy: Single comparison call for maximum consistency
+🧠 Multi-system comparison: rag, direct
+📊 Evaluating 1 queries across 2 systems...
+📝 Comparison prompt created (2150 characters)
+🔄 Calling judge LLM for multi-system comparison...
+✅ Judge LLM completed comparison evaluation in 45.3s
+📄 Response length: 145 characters
+📊 RAG: 1 evaluations parsed
+📊 DIRECT: 1 evaluations parsed
+📊 === LLM JUDGE EVALUATION SUMMARY ===
+Systems Compared: RAG vs DIRECT
+Overall Performance:
+   Average Actionability: 0.850 (8.5/10)
+   Average Evidence Quality: 0.780 (7.8/10)
+   Actionability Target (≥7.0): ✅ Met
+   Evidence Target (≥7.5): ✅ Met
+System Breakdown:
+   RAG: Actionability=0.900, Evidence=0.850 [1 queries]
+   DIRECT: Actionability=0.800, Evidence=0.710 [1 queries]
+✅ LLM judge evaluation complete!
+📊 Statistics: results/judge_evaluation_comparison_rag_vs_direct_20250804_220000.json
+⚡ Efficiency: 2 evaluations in 1 LLM call
+```
+## Key Features
+### 1. Scientific Comparison Design
+- **Single Judge Call**: All systems evaluated simultaneously for consistency
+- **Eliminates Temporal Bias**: Same judge, same context, same standards
+- **Direct System Comparison**: Side-by-side evaluation format
+### 2. Flexible Architecture
+- **Backward Compatible**: Single system evaluation still supported
+- **Future Extensible**: Easy to add new systems (`claude`, `gpt4`, etc.)
+- **Modular Design**: Clean separation of concerns
+### 3. Robust Error Handling
+- **File Validation**: Automatic detection of missing input files
+- **Query Count Verification**: Warns if systems have different query counts
+- **Graceful Degradation**: Continues operation despite partial failures
+### 4. Comprehensive Reporting
+- **System-Specific Metrics**: Individual performance analysis
+- **Comparative Statistics**: Direct system-to-system comparison
+- **Target Compliance**: Automatic benchmark checking
+- **Detailed Metadata**: Full traceability of evaluation parameters
+## Troubleshooting
+### Common Issues
+#### 1. Missing Input Files
+```
+❌ No medical outputs files found for rag system
+💡 Please run evaluators first:
+   python latency_evaluator.py single_test_query.txt
+```
+**Solution**: Run the prerequisite evaluators to generate medical outputs.
+#### 2. HF_TOKEN Not Set
+```
+❌ HF_TOKEN is missing from environment variables
+```
+**Solution**: Set your Hugging Face token in environment or `.env` file.
+#### 3. Query Count Mismatch
+```
+⚠️ Warning: Systems have different query counts: {'rag': 3, 'direct': 1}
+```
+**Solution**: Ensure both systems processed the same input file.
+#### 4. LLM API Timeout
+```
+❌ Multi-system evaluation failed: timeout
+```
+**Solution**: Check internet connection and Hugging Face API status.
+### Debug Tips
+1. **Check File Existence**: Verify medical output files in `results/` directory
+2. **Validate JSON Format**: Ensure input files are properly formatted
+3. **Monitor API Usage**: Check Hugging Face account limits
+4. **Review Logs**: Examine detailed logging output for specific errors
+## Future Extensions
+### Phase 2: Generic Multi-System Framework
+```bash
+# Configuration-driven system comparison
+python metric5_6_llm_judge_evaluator.py --config comparison_config.json
+```
+### Phase 3: Unlimited System Support
+```bash
+# Dynamic system registration
+python metric5_6_llm_judge_evaluator.py med42,claude,gpt4,palm,llama2
+```
+### Integration with Chart Generators
+```bash
+# Generate comparison visualizations
+python metric5_6_llm_judge_chart_generator.py rag,direct
+```
+## Best Practices
+1. **Consistent Test Data**: Use the same query file for all systems
+2. **Sequential Execution**: Complete data collection before evaluation
+3. **Batch Processing**: Use multi-system mode for scientific comparison
+4. **Result Verification**: Review detailed statistics files for accuracy
+5. **Performance Monitoring**: Track evaluation latency and API costs
+## Scientific Validity
+The multi-system comparison approach provides superior scientific validity compared to separate evaluations:
+- **Eliminates Judge Variability**: Same judge evaluates all systems
+- **Reduces Temporal Effects**: All evaluations in single time window
+- **Ensures Consistent Standards**: Identical evaluation criteria applied
+- **Enables Direct Comparison**: Side-by-side system assessment
+- **Maximizes Efficiency**: Single API call vs multiple separate calls
+This design makes the evaluation results more reliable for research publications and system optimization decisions.

evaluation/metric5_6_llm_judge_chart_generator.py ADDED Viewed

	@@ -0,0 +1,430 @@

+#!/usr/bin/env python3
+"""
+OnCall.ai System - LLM Judge Chart Generator (Metrics 5-6)
+==========================================================
+Generates comprehensive comparison charts for LLM judge evaluation results.
+Supports both single-system and multi-system visualization with professional layouts.
+Metrics visualized:
+5. Clinical Actionability (臨床可操作性) - 1-10 scale
+6. Clinical Evidence Quality (臨床證據品質) - 1-10 scale
+Author: YanBo Chen
+Date: 2025-08-04
+"""
+import json
+import os
+import sys
+from typing import Dict, List, Any, Tuple
+from datetime import datetime
+from pathlib import Path
+import glob
+import numpy as np
+# Visualization imports
+import matplotlib.pyplot as plt
+import seaborn as sns
+import pandas as pd
+from matplotlib.patches import Rectangle
+class LLMJudgeChartGenerator:
+    """Generate professional comparison charts for LLM judge evaluation results"""
+    def __init__(self):
+        """Initialize chart generator with professional styling"""
+        print("📈 Initializing LLM Judge Chart Generator...")
+        # Set up professional chart style
+        plt.style.use('default')
+        sns.set_palette("husl")
+        # Professional color scheme for medical evaluation
+        self.colors = {
+            'rag': '#2E8B57',      # Sea Green - represents evidence-based
+            'direct': '#4682B4',   # Steel Blue - represents direct approach
+            'claude': '#9370DB',   # Medium Purple - future extension
+            'gpt4': '#DC143C',     # Crimson - future extension
+            'actionability': '#FF6B6B',  # Coral Red
+            'evidence': '#4ECDC4',        # Turquoise
+            'target_line': '#FF4444',     # Red for target thresholds
+            'grid': '#E0E0E0'             # Light gray for grid
+        }
+        print("✅ Chart Generator ready with professional medical styling")
+    def load_latest_statistics(self, results_dir: str = None) -> Dict[str, Any]:
+        """
+        Load the most recent judge evaluation statistics file
+        Args:
+            results_dir: Directory containing statistics files
+        """
+        if results_dir is None:
+            results_dir = Path(__file__).parent / "results"
+        # Find latest comparison statistics file
+        pattern = str(results_dir / "judge_evaluation_comparison_*.json")
+        stat_files = glob.glob(pattern)
+        if not stat_files:
+            raise FileNotFoundError(f"No judge evaluation comparison files found in {results_dir}")
+        # Get the most recent file
+        latest_file = max(stat_files, key=os.path.getmtime)
+        print(f"📊 Loading statistics from: {latest_file}")
+        with open(latest_file, 'r', encoding='utf-8') as f:
+            return json.load(f)
+    def generate_comparison_charts(self, stats: Dict[str, Any], save_path: str = None) -> str:
+        """
+        Generate comprehensive 4-panel comparison visualization
+        Creates professional charts showing:
+        1. System comparison radar chart
+        2. Grouped bar chart comparison
+        3. Actionability vs Evidence scatter plot
+        4. Category-wise heatmap
+        """
+        try:
+            # Create figure with subplots
+            fig, axes = plt.subplots(2, 2, figsize=(16, 12))
+            fig.suptitle(
+                'Medical AI Systems Comparison - Clinical Quality Assessment\n'
+                'Actionability (1-10): Can healthcare providers act immediately? | '
+                'Evidence Quality (1-10): Is advice evidence-based?',
+                fontsize=14, fontweight='bold', y=0.95
+            )
+            # Extract comparison metadata
+            comparison_meta = stats.get('comparison_metadata', {})
+            systems = comparison_meta.get('systems_compared', ['rag', 'direct'])
+            overall_results = stats['overall_results']
+            category_results = stats['category_results']
+            # Chart 1: System Comparison Radar Chart
+            self._create_radar_chart(axes[0, 0], stats, systems)
+            # Chart 2: Grouped Bar Chart Comparison
+            self._create_grouped_bar_chart(axes[0, 1], stats, systems)
+            # Chart 3: Actionability vs Evidence Scatter Plot
+            self._create_scatter_plot(axes[1, 0], stats, systems)
+            # Chart 4: Category-wise Performance Heatmap
+            self._create_heatmap(axes[1, 1], stats, systems)
+            # Add method annotation at bottom
+            method_text = (
+                f"Evaluation: Llama3-70B judge | Targets: Actionability ≥7.0, Evidence ≥7.5 | "
+                f"Systems: {', '.join([s.upper() for s in systems])} | "
+                f"Queries: {overall_results.get('total_queries', 'N/A')}"
+            )
+            fig.text(0.5, 0.02, method_text, ha='center', fontsize=10,
+                    style='italic', color='gray')
+            # Adjust layout
+            plt.tight_layout()
+            plt.subplots_adjust(top=0.88, bottom=0.08)
+            # Save the chart
+            if save_path is None:
+                timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+                systems_str = "_vs_".join(systems)
+                save_path = f"judge_comparison_charts_{systems_str}_{timestamp}.png"
+            results_dir = Path(__file__).parent / "results"
+            results_dir.mkdir(exist_ok=True)
+            full_path = results_dir / save_path
+            plt.savefig(full_path, dpi=300, bbox_inches='tight')
+            plt.show()
+            print(f"📊 Comparison charts saved to: {full_path}")
+            return str(full_path)
+        except Exception as e:
+            print(f"❌ Chart generation failed: {e}")
+            raise
+    def _create_radar_chart(self, ax, stats: Dict, systems: List[str]):
+        """Create radar chart for multi-dimensional system comparison"""
+        ax.set_title('Multi-Dimensional System Comparison', fontweight='bold', pad=20)
+        # Prepare data for radar chart using real system-specific data
+        categories = ['Overall Actionability', 'Overall Evidence', 'Diagnosis', 'Treatment', 'Mixed']
+        # Extract real system-specific metrics
+        detailed_results = stats.get('detailed_system_results', {})
+        system_data = {}
+        for system in systems:
+            if system in detailed_results:
+                system_info = detailed_results[system]
+                system_results = system_info['results']
+                # Calculate category-specific performance
+                category_performance = {}
+                for result in system_results:
+                    category = result.get('category', 'unknown').lower()
+                    if category not in category_performance:
+                        category_performance[category] = {'actionability': [], 'evidence': []}
+                    category_performance[category]['actionability'].append(result['actionability_score'])
+                    category_performance[category]['evidence'].append(result['evidence_score'])
+                # Build radar chart data
+                system_scores = [
+                    system_info['avg_actionability'],  # Overall Actionability
+                    system_info['avg_evidence'],       # Overall Evidence
+                    # Category-specific scores (average of actionability and evidence)
+                    (sum(category_performance.get('diagnosis', {}).get('actionability', [0])) /
+                     len(category_performance.get('diagnosis', {}).get('actionability', [1])) +
+                     sum(category_performance.get('diagnosis', {}).get('evidence', [0])) /
+                     len(category_performance.get('diagnosis', {}).get('evidence', [1]))) / 2 if 'diagnosis' in category_performance else 0.5,
+                    (sum(category_performance.get('treatment', {}).get('actionability', [0])) /
+                     len(category_performance.get('treatment', {}).get('actionability', [1])) +
+                     sum(category_performance.get('treatment', {}).get('evidence', [0])) /
+                     len(category_performance.get('treatment', {}).get('evidence', [1]))) / 2 if 'treatment' in category_performance else 0.5,
+                    (sum(category_performance.get('mixed', {}).get('actionability', [0])) /
+                     len(category_performance.get('mixed', {}).get('actionability', [1])) +
+                     sum(category_performance.get('mixed', {}).get('evidence', [0])) /
+                     len(category_performance.get('mixed', {}).get('evidence', [1]))) / 2 if 'mixed' in category_performance else 0.5
+                ]
+                system_data[system] = system_scores
+            else:
+                # Fallback to overall stats if detailed results not available
+                overall_results = stats['overall_results']
+                system_data[system] = [
+                    overall_results['average_actionability'],
+                    overall_results['average_evidence'],
+                    0.7, 0.6, 0.5  # Placeholder for missing category data
+                ]
+        # Create radar chart
+        angles = np.linspace(0, 2 * np.pi, len(categories), endpoint=False).tolist()
+        angles += angles[:1]  # Complete the circle
+        for system in systems:
+            values = system_data[system] + [system_data[system][0]]  # Complete the circle
+            ax.plot(angles, values, 'o-', linewidth=2,
+                   label=f'{system.upper()} System', color=self.colors.get(system, 'gray'))
+            ax.fill(angles, values, alpha=0.1, color=self.colors.get(system, 'gray'))
+        # Customize radar chart
+        ax.set_xticks(angles[:-1])
+        ax.set_xticklabels(categories, fontsize=9)
+        ax.set_ylim(0, 1)
+        ax.set_yticks([0.2, 0.4, 0.6, 0.8, 1.0])
+        ax.set_yticklabels(['2.0', '4.0', '6.0', '8.0', '10.0'])
+        ax.grid(True, alpha=0.3)
+        ax.legend(loc='upper right', bbox_to_anchor=(1.2, 1.0))
+        # Add target threshold circle
+        target_circle = [0.7] * (len(categories) + 1)  # 7.0 threshold
+        ax.plot(angles, target_circle, '--', color=self.colors['target_line'],
+               alpha=0.7, label='Target (7.0)')
+    def _create_grouped_bar_chart(self, ax, stats: Dict, systems: List[str]):
+        """Create grouped bar chart for direct metric comparison"""
+        ax.set_title('Direct Metric Comparison', fontweight='bold', pad=20)
+        # Prepare data using real system-specific metrics
+        metrics = ['Actionability', 'Evidence Quality']
+        detailed_results = stats.get('detailed_system_results', {})
+        # Extract real system-specific data
+        system_scores = {}
+        for system in systems:
+            if system in detailed_results:
+                system_info = detailed_results[system]
+                system_scores[system] = [
+                    system_info['avg_actionability'],
+                    system_info['avg_evidence']
+                ]
+            else:
+                # Fallback to overall results
+                overall_results = stats['overall_results']
+                system_scores[system] = [
+                    overall_results['average_actionability'],
+                    overall_results['average_evidence']
+                ]
+        # Create grouped bar chart
+        x = np.arange(len(metrics))
+        width = 0.35 if len(systems) == 2 else 0.25
+        for i, system in enumerate(systems):
+            offset = (i - len(systems)/2 + 0.5) * width
+            bars = ax.bar(x + offset, system_scores[system], width,
+                         label=f'{system.upper()}', color=self.colors.get(system, 'gray'),
+                         alpha=0.8)
+            # Add value labels on bars
+            for bar, value in zip(bars, system_scores[system]):
+                height = bar.get_height()
+                ax.text(bar.get_x() + bar.get_width()/2., height + 0.01,
+                       f'{value:.3f}', ha='center', va='bottom', fontweight='bold')
+        # Add target threshold lines
+        ax.axhline(y=0.7, color=self.colors['target_line'], linestyle='--',
+                  alpha=0.7, label='Actionability Target (7.0)')
+        ax.axhline(y=0.75, color=self.colors['target_line'], linestyle=':',
+                  alpha=0.7, label='Evidence Target (7.5)')
+        # Customize chart
+        ax.set_xlabel('Evaluation Metrics')
+        ax.set_ylabel('Score (0-1 scale)')
+        ax.set_title('System Performance Comparison')
+        ax.set_xticks(x)
+        ax.set_xticklabels(metrics)
+        ax.legend(loc='upper left')
+        ax.grid(True, alpha=0.3, axis='y')
+        ax.set_ylim(0, 1.0)
+    def _create_scatter_plot(self, ax, stats: Dict, systems: List[str]):
+        """Create scatter plot for actionability vs evidence quality analysis"""
+        ax.set_title('Actionability vs Evidence Quality Analysis', fontweight='bold', pad=20)
+        # Extract real query-level data from detailed results
+        detailed_results = stats.get('detailed_system_results', {})
+        for system in systems:
+            if system in detailed_results:
+                system_results = detailed_results[system]['results']
+                # Extract real actionability and evidence scores for each query
+                actionability_scores = [r['actionability_score'] for r in system_results]
+                evidence_scores = [r['evidence_score'] for r in system_results]
+                ax.scatter(actionability_scores, evidence_scores,
+                          label=f'{system.upper()}', color=self.colors.get(system, 'gray'),
+                          alpha=0.7, s=100, edgecolors='white', linewidth=1)
+            else:
+                # Fallback: create single point from overall averages
+                overall_results = stats['overall_results']
+                ax.scatter([overall_results['average_actionability']],
+                          [overall_results['average_evidence']],
+                          label=f'{system.upper()}', color=self.colors.get(system, 'gray'),
+                          alpha=0.7, s=100, edgecolors='white', linewidth=1)
+        # Add target threshold lines
+        ax.axvline(x=0.7, color=self.colors['target_line'], linestyle='--',
+                  alpha=0.7, label='Actionability Target')
+        ax.axhline(y=0.75, color=self.colors['target_line'], linestyle='--',
+                  alpha=0.7, label='Evidence Target')
+        # Add target zone
+        target_rect = Rectangle((0.7, 0.75), 0.3, 0.25, linewidth=1,
+                               edgecolor=self.colors['target_line'], facecolor='green',
+                               alpha=0.1, label='Target Zone')
+        ax.add_patch(target_rect)
+        # Customize chart
+        ax.set_xlabel('Clinical Actionability (0-1 scale)')
+        ax.set_ylabel('Clinical Evidence Quality (0-1 scale)')
+        ax.legend(loc='lower right')
+        ax.grid(True, alpha=0.3)
+        ax.set_xlim(0, 1)
+        ax.set_ylim(0, 1)
+    def _create_heatmap(self, ax, stats: Dict, systems: List[str]):
+        """Create heatmap for category-wise performance matrix"""
+        ax.set_title('Category-wise Performance Matrix', fontweight='bold', pad=20)
+        # Prepare data
+        categories = ['Diagnosis', 'Treatment', 'Mixed']
+        metrics = ['Actionability', 'Evidence']
+        category_results = stats['category_results']
+        # Create data matrix
+        data_matrix = []
+        row_labels = []
+        for system in systems:
+            for metric in metrics:
+                row_data = []
+                for category in categories:
+                    cat_key = category.lower()
+                    if cat_key in category_results and category_results[cat_key]['query_count'] > 0:
+                        if metric == 'Actionability':
+                            value = category_results[cat_key]['average_actionability']
+                        else:
+                            value = category_results[cat_key]['average_evidence']
+                    else:
+                        value = 0.5  # Placeholder for missing data
+                    row_data.append(value)
+                data_matrix.append(row_data)
+                row_labels.append(f'{system.upper()}\n{metric}')
+        # Create heatmap
+        im = ax.imshow(data_matrix, cmap='RdYlGn', aspect='auto', vmin=0, vmax=1)
+        # Set ticks and labels
+        ax.set_xticks(np.arange(len(categories)))
+        ax.set_yticks(np.arange(len(row_labels)))
+        ax.set_xticklabels(categories)
+        ax.set_yticklabels(row_labels, fontsize=9)
+        # Add text annotations
+        for i in range(len(row_labels)):
+            for j in range(len(categories)):
+                text = ax.text(j, i, f'{data_matrix[i][j]:.3f}',
+                             ha='center', va='center', fontweight='bold',
+                             color='white' if data_matrix[i][j] < 0.5 else 'black')
+        # Add colorbar
+        cbar = plt.colorbar(im, ax=ax, shrink=0.6)
+        cbar.set_label('Performance Score (0-1)', rotation=270, labelpad=15)
+        ax.set_xlabel('Query Categories')
+        ax.set_ylabel('System × Metric')
+# Independent execution interface
+if __name__ == "__main__":
+    """Independent chart generation interface"""
+    print("📊 OnCall.ai LLM Judge Chart Generator - Metrics 5-6 Visualization")
+    # Initialize generator
+    generator = LLMJudgeChartGenerator()
+    try:
+        # Load latest statistics
+        stats = generator.load_latest_statistics()
+        print(f"📈 Generating comparison charts...")
+        # Generate comprehensive comparison charts
+        chart_path = generator.generate_comparison_charts(stats)
+        # Print summary
+        comparison_meta = stats.get('comparison_metadata', {})
+        systems = comparison_meta.get('systems_compared', ['rag', 'direct'])
+        overall_results = stats['overall_results']
+        print(f"\n📊 === CHART GENERATION SUMMARY ===")
+        print(f"Systems Visualized: {' vs '.join([s.upper() for s in systems])}")
+        print(f"Overall Actionability: {overall_results['average_actionability']:.3f}")
+        print(f"Overall Evidence Quality: {overall_results['average_evidence']:.3f}")
+        print(f"Total Queries: {overall_results['total_queries']}")
+        print(f"Chart Components: Radar Chart, Bar Chart, Scatter Plot, Heatmap")
+        print(f"\n✅ Comprehensive visualization complete!")
+        print(f"📊 Charts saved to: {chart_path}")
+        print(f"💡 Tip: Charts optimized for research presentations and publications")
+    except FileNotFoundError as e:
+        print(f"❌ {e}")
+        print(f"💡 Please run judge evaluation first:")
+        print("   python metric5_6_llm_judge_evaluator.py rag,direct")
+    except Exception as e:
+        print(f"❌ Chart generation failed: {e}")

evaluation/metric5_6_llm_judge_evaluator.py CHANGED Viewed

@@ -10,6 +10,22 @@ Metrics evaluated:
 5. Clinical Actionability (臨床可操作性)
 6. Clinical Evidence Quality (臨床證據品質)
 Author: YanBo Chen
 Date: 2025-08-04
 """
@@ -17,12 +33,62 @@ Date: 2025-08-04
 import json
 import os
 import sys
 from typing import Dict, List, Any, Tuple
 from datetime import datetime
 from pathlib import Path
 import glob
 import re
 # Add project path
 current_dir = Path(__file__).parent
 project_root = current_dir.parent
@@ -31,8 +97,7 @@ sys.path.insert(0, str(src_dir))
 # Import LLM client for judge evaluation
 try:
-    from llm_clients import llm_Med42_70BClient  # Temporarily use Med42 as placeholder
-    # TODO: Replace with actual Llama3-70B client when available
 except ImportError as e:
     print(f"❌ Import failed: {e}")
     print("Please ensure running from project root directory")
@@ -46,10 +111,8 @@ class LLMJudgeEvaluator:
         """Initialize judge LLM client"""
         print("🔧 Initializing LLM Judge Evaluator...")
-        # TODO: Replace with actual Llama3-70B client
-        # For now, using Med42 as placeholder
-        self.judge_llm = llm_Med42_70BClient()
-        print("⚠️ Note: Using Med42 as placeholder for Llama3-70B judge")
         self.evaluation_results = []
@@ -67,55 +130,109 @@ class LLMJudgeEvaluator:
         return medical_outputs
-    def find_latest_medical_outputs(self, model_type: str = "rag") -> str:
-        """Find the latest medical outputs file"""
         results_dir = Path(__file__).parent / "results"
-        if model_type == "rag":
-            pattern = str(results_dir / "medical_outputs_*.json")
-        else:  # direct
-            pattern = str(results_dir / "medical_outputs_direct_*.json")
-        output_files = glob.glob(pattern)
-        if not output_files:
-            raise FileNotFoundError(f"No medical outputs files found for {model_type} model")
-        latest_file = max(output_files, key=os.path.getmtime)
-        print(f"📊 Found latest medical outputs: {latest_file}")
-        return latest_file
-    def create_batch_evaluation_prompt(self, medical_outputs: List[Dict[str, Any]]) -> str:
         """
-        Create batch evaluation prompt for all queries at once
-        Maximum efficiency: 1 LLM call evaluates all queries
         """
         prompt_parts = [
-            "You are a medical expert evaluating clinical advice quality.",
-            "Please evaluate each medical advice response on TWO criteria:",
             "",
-            "CRITERIA:",
             "1. Clinical Actionability (1-10): Can healthcare providers immediately act on this advice?",
             "2. Clinical Evidence Quality (1-10): Is the advice evidence-based and follows medical standards?",
             "",
             "QUERIES TO EVALUATE:",
             ""
-        ]
-        # Add each query and advice
-        for i, output in enumerate(medical_outputs, 1):
-            query = output.get('query', '')
-            advice = output.get('medical_advice', '')
-            category = output.get('category', 'unknown')
             prompt_parts.extend([
                 f"=== QUERY {i} ({category.upper()}) ===",
                 f"Patient Query: {query}",
-                f"Medical Advice: {advice}",
                 ""
             ])
         prompt_parts.extend([
             "RESPONSE FORMAT (provide exactly this format):",
@@ -123,120 +240,135 @@ class LLMJudgeEvaluator:
         ])
         # Add response format template
-        for i in range(1, len(medical_outputs) + 1):
-            prompt_parts.append(f"Query {i}: Actionability=X, Evidence=Y")
         prompt_parts.extend([
             "",
             "Replace X and Y with numeric scores 1-10.",
-            "Provide only the scores in the exact format above."
         ])
         return "\n".join(prompt_parts)
-    def parse_batch_evaluation_response(self, response: str, medical_outputs: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
-        """Parse batch evaluation response into individual scores"""
-        results = []
-        # Parse response format: "Query 1: Actionability=8, Evidence=7"
         lines = response.strip().split('\n')
-        for i, line in enumerate(lines):
             line = line.strip()
             if not line:
                 continue
-            # Try to match pattern: "Query X: Actionability=Y, Evidence=Z"
-            match = re.match(r'Query\s+(\d+):\s*Actionability\s*=\s*(\d+)\s*,\s*Evidence\s*=\s*(\d+)', line, re.IGNORECASE)
             if match:
-                query_num = int(match.group(1)) - 1  # Convert to 0-based index
-                actionability_score = int(match.group(2))
-                evidence_score = int(match.group(3))
-                if query_num < len(medical_outputs):
-                    output = medical_outputs[query_num]
                     result = {
                         "query": output.get('query', ''),
                         "category": output.get('category', 'unknown'),
-                        "model_type": output.get('model_type', 'unknown'),
                         "medical_advice": output.get('medical_advice', ''),
                         # Metric 5: Clinical Actionability
-                        "actionability_score": actionability_score / 10.0,  # Normalize to 0-1
                         "actionability_raw": actionability_score,
                         # Metric 6: Clinical Evidence Quality
-                        "evidence_score": evidence_score / 10.0,  # Normalize to 0-1
                         "evidence_raw": evidence_score,
                         "evaluation_success": True,
                         "timestamp": datetime.now().isoformat()
                     }
-                    results.append(result)
-        return results
-    def evaluate_batch_medical_outputs(self, medical_outputs: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
         """
-        Batch evaluate all medical outputs using single LLM call
         Args:
-            medical_outputs: List of medical advice outputs to evaluate
         """
-        print(f"🧠 Batch evaluating {len(medical_outputs)} medical outputs...")
         try:
-            # Create batch evaluation prompt
-            batch_prompt = self.create_batch_evaluation_prompt(medical_outputs)
-            print(f"📝 Batch prompt created ({len(batch_prompt)} characters)")
-            print(f"🔄 Calling judge LLM for batch evaluation...")
-            # Single LLM call for all evaluations
             eval_start = time.time()
-            response = self.judge_llm.generate_completion(batch_prompt)
             eval_time = time.time() - eval_start
             # Extract response text
             response_text = response.get('content', '') if isinstance(response, dict) else str(response)
-            print(f"✅ Judge LLM completed batch evaluation in {eval_time:.2f}s")
             print(f"📄 Response length: {len(response_text)} characters")
-            # Parse batch response
-            parsed_results = self.parse_batch_evaluation_response(response_text, medical_outputs)
-            if len(parsed_results) != len(medical_outputs):
-                print(f"⚠️ Warning: Expected {len(medical_outputs)} results, got {len(parsed_results)}")
-            self.evaluation_results.extend(parsed_results)
-            print(f"📊 Successfully parsed {len(parsed_results)} evaluation results")
-            return parsed_results
         except Exception as e:
-            print(f"❌ Batch evaluation failed: {e}")
-            # Create error results for all outputs
-            error_results = []
-            for output in medical_outputs:
-                error_result = {
-                    "query": output.get('query', ''),
-                    "category": output.get('category', 'unknown'),
-                    "model_type": output.get('model_type', 'unknown'),
-                    "actionability_score": 0.0,
-                    "evidence_score": 0.0,
-                    "evaluation_success": False,
-                    "error": str(e),
-                    "timestamp": datetime.now().isoformat()
-                }
-                error_results.append(error_result)
-            self.evaluation_results.extend(error_results)
             return error_results
     def calculate_judge_statistics(self) -> Dict[str, Any]:
@@ -309,93 +441,168 @@ class LLMJudgeEvaluator:
             "timestamp": datetime.now().isoformat()
         }
-    def save_judge_statistics(self, model_type: str, filename: str = None) -> str:
-        """Save judge evaluation statistics"""
         stats = self.calculate_judge_statistics()
         if filename is None:
             timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
-            filename = f"judge_evaluation_{model_type}_{timestamp}.json"
         results_dir = Path(__file__).parent / "results"
         results_dir.mkdir(exist_ok=True)
         filepath = results_dir / filename
         with open(filepath, 'w', encoding='utf-8') as f:
             json.dump(stats, f, indent=2, ensure_ascii=False)
-        print(f"📊 Judge evaluation statistics saved to: {filepath}")
         return str(filepath)
 # Independent execution interface
 if __name__ == "__main__":
-    """Independent LLM judge evaluation interface"""
-    print("🧠 OnCall.ai LLM Judge Evaluator - Metrics 5-6 Batch Evaluation")
-    if len(sys.argv) > 1 and sys.argv[1] in ['rag', 'direct']:
-        model_type = sys.argv[1]
-    else:
-        print("Usage: python llm_judge_evaluator.py [rag|direct]")
-        print("  rag    - Evaluate RAG system medical outputs")
-        print("  direct - Evaluate direct LLM medical outputs")
         sys.exit(1)
     # Initialize evaluator
     evaluator = LLMJudgeEvaluator()
     try:
-        # Find and load latest medical outputs
-        outputs_file = evaluator.find_latest_medical_outputs(model_type)
-        medical_outputs = evaluator.load_medical_outputs(outputs_file)
-        if not medical_outputs:
-            print(f"❌ No medical outputs found in {outputs_file}")
-            sys.exit(1)
-        # Batch evaluate all outputs
-        print(f"\n🧪 Batch LLM Judge Evaluation for {model_type.upper()} model")
-        print(f"📊 Evaluating {len(medical_outputs)} medical advice outputs")
-        print(f"🎯 Metrics: 5 (Actionability) + 6 (Evidence Quality)")
-        print(f"⚡ Strategy: Single batch call for maximum efficiency")
-        evaluation_results = evaluator.evaluate_batch_medical_outputs(medical_outputs)
-        # Save results
-        print(f"\n📊 Generating judge evaluation analysis...")
-        stats_path = evaluator.save_judge_statistics(model_type)
         # Print summary
         stats = evaluator.calculate_judge_statistics()
         overall_results = stats['overall_results']
-        category_results = stats['category_results']
-        print(f"\n📊 === LLM JUDGE EVALUATION SUMMARY ({model_type.upper()}) ===")
         print(f"Overall Performance:")
-        print(f"   Average Actionability: {overall_results['average_actionability']:.3f} ({overall_results['average_actionability']*10:.1f}/10)")
-        print(f"   Average Evidence Quality: {overall_results['average_evidence']:.3f} ({overall_results['average_evidence']*10:.1f}/10)")
         print(f"   Actionability Target (≥7.0): {'✅ Met' if overall_results['actionability_target_met'] else '❌ Not Met'}")
         print(f"   Evidence Target (≥7.5): {'✅ Met' if overall_results['evidence_target_met'] else '❌ Not Met'}")
-        print(f"\nCategory Breakdown:")
-        for category, cat_stats in category_results.items():
-            if cat_stats['query_count'] > 0:
-                print(f"   {category.capitalize()}: "
-                      f"Actionability={cat_stats['average_actionability']:.2f}, "
-                      f"Evidence={cat_stats['average_evidence']:.2f} "
-                      f"[{cat_stats['query_count']} queries]")
         print(f"\n✅ LLM judge evaluation complete!")
         print(f"📊 Statistics: {stats_path}")
-        print(f"⚡ Efficiency: {len(medical_outputs)} evaluations in 1 LLM call")
     except FileNotFoundError as e:
         print(f"❌ {e}")
-        print(f"💡 Please run evaluator first:")
-        if model_type == "rag":
-            print("   python latency_evaluator.py pre_user_query_evaluate.txt")
-        else:
-            print("   python direct_llm_evaluator.py pre_user_query_evaluate.txt")
     except Exception as e:
         print(f"❌ Judge evaluation failed: {e}")

 5. Clinical Actionability (臨床可操作性)
 6. Clinical Evidence Quality (臨床證據品質)
+EVALUATION RUBRICS:
+Metric 5: Clinical Actionability (1-10 scale)
+  1-2 points: Almost no actionable advice; extremely abstract or empty responses.
+  3-4 points: Provides some directional suggestions but too vague, lacks clear steps.
+  5-6 points: Offers basic executable steps but lacks details or insufficient explanation for key aspects.
+  7-8 points: Clear and complete steps that clinicians can follow, with occasional gaps needing supplementation.
+  9-10 points: Extremely actionable with precise, step-by-step executable guidance; can be used "as-is" immediately.
+Metric 6: Clinical Evidence Quality (1-10 scale)
+  1-2 points: Almost no evidence support; cites completely irrelevant or unreliable sources.
+  3-4 points: References lower quality literature or guidelines, or sources lack authority.
+  5-6 points: Uses general quality literature/guidelines but lacks depth or currency.
+  7-8 points: References reliable, authoritative sources (renowned journals or authoritative guidelines) with accurate explanations.
+  9-10 points: Rich and high-quality evidence sources (systematic reviews, RCTs, etc.) combined with latest research; enhances recommendation credibility.
 Author: YanBo Chen
 Date: 2025-08-04
 """
 import json
 import os
 import sys
+import time
 from typing import Dict, List, Any, Tuple
 from datetime import datetime
 from pathlib import Path
 import glob
 import re
+# Evaluation Rubrics as programmable constants
+ACTIONABILITY_RUBRIC = {
+    (1, 2): "Almost no actionable advice; extremely abstract or empty responses.",
+    (3, 4): "Provides some directional suggestions but too vague, lacks clear steps.",
+    (5, 6): "Offers basic executable steps but lacks details or insufficient explanation for key aspects.",
+    (7, 8): "Clear and complete steps that clinicians can follow, with occasional gaps needing supplementation.",
+    (9, 10): "Extremely actionable with precise, step-by-step executable guidance; can be used 'as-is' immediately."
+}
+EVIDENCE_RUBRIC = {
+    (1, 2): "Almost no evidence support; cites completely irrelevant or unreliable sources.",
+    (3, 4): "References lower quality literature or guidelines, or sources lack authority.",
+    (5, 6): "Uses general quality literature/guidelines but lacks depth or currency.",
+    (7, 8): "References reliable, authoritative sources (renowned journals or authoritative guidelines) with accurate explanations.",
+    (9, 10): "Rich and high-quality evidence sources (systematic reviews, RCTs, etc.) combined with latest research; enhances recommendation credibility."
+}
+def print_evaluation_rubrics():
+    """Print detailed evaluation rubrics for reference"""
+    print("=" * 60)
+    print("CLINICAL EVALUATION RUBRICS")
+    print("=" * 60)
+    print("\n🎯 METRIC 5: Clinical Actionability (1-10 scale)")
+    print("-" * 50)
+    for score_range, description in ACTIONABILITY_RUBRIC.items():
+        print(f"{score_range[0]}–{score_range[1]} points: {description}")
+    print("\n📚 METRIC 6: Clinical Evidence Quality (1-10 scale)")
+    print("-" * 50)
+    for score_range, description in EVIDENCE_RUBRIC.items():
+        print(f"{score_range[0]}–{score_range[1]} points: {description}")
+    print("\n" + "=" * 60)
+    print("TARGET THRESHOLDS:")
+    print("• Actionability: ≥7.0 (Acceptable clinical utility)")
+    print("• Evidence Quality: ≥7.5 (Reliable evidence support)")
+    print("=" * 60)
+def get_rubric_description(score: int, metric_type: str) -> str:
+    """Get rubric description for a given score and metric type"""
+    rubric = ACTIONABILITY_RUBRIC if metric_type == "actionability" else EVIDENCE_RUBRIC
+    for score_range, description in rubric.items():
+        if score_range[0] <= score <= score_range[1]:
+            return description
+    return "Score out of valid range (1-10)"
 # Add project path
 current_dir = Path(__file__).parent
 project_root = current_dir.parent
 # Import LLM client for judge evaluation
 try:
+    from llm_clients import llm_Llama3_70B_JudgeClient
 except ImportError as e:
     print(f"❌ Import failed: {e}")
     print("Please ensure running from project root directory")
         """Initialize judge LLM client"""
         print("🔧 Initializing LLM Judge Evaluator...")
+        # Initialize Llama3-70B as judge LLM
+        self.judge_llm = llm_Llama3_70B_JudgeClient()
         self.evaluation_results = []
         return medical_outputs
+    def find_medical_outputs_for_systems(self, systems: List[str]) -> Dict[str, str]:
+        """Find medical outputs files for multiple systems"""
         results_dir = Path(__file__).parent / "results"
+        system_files = {}
+        for system in systems:
+            if system == "rag":
+                pattern = str(results_dir / "medical_outputs_*.json")
+            elif system == "direct":
+                pattern = str(results_dir / "medical_outputs_direct_*.json")
+            else:
+                # Future extension: support other systems
+                pattern = str(results_dir / f"medical_outputs_{system}_*.json")
+            output_files = glob.glob(pattern)
+            if not output_files:
+                raise FileNotFoundError(f"No medical outputs files found for {system} system")
+            latest_file = max(output_files, key=os.path.getmtime)
+            system_files[system] = latest_file
+            print(f"📊 Found {system} outputs: {latest_file}")
+        return system_files
+    def create_comparison_evaluation_prompt(self, systems_outputs: Dict[str, List[Dict]]) -> str:
         """
+        Create comparison evaluation prompt for multiple systems
+        Args:
+            systems_outputs: Dict mapping system names to their medical outputs
         """
+        system_names = list(systems_outputs.keys())
         prompt_parts = [
+            "You are a medical expert evaluating and comparing AI systems for clinical advice quality.",
+            f"Please evaluate {len(system_names)} different systems using the detailed rubrics below:",
+            "",
+            "EVALUATION RUBRICS:",
+            "",
+            "METRIC 1: Clinical Actionability (1-10 scale)",
+            "Question: Can healthcare providers immediately act on this advice?",
+            "1-2 points: Almost no actionable advice; extremely abstract or empty responses.",
+            "3-4 points: Provides directional suggestions but too vague, lacks clear steps.",
+            "5-6 points: Offers basic executable steps but lacks details for key aspects.",
+            "7-8 points: Clear and complete steps that clinicians can follow with occasional gaps.",
+            "9-10 points: Extremely actionable with precise, step-by-step executable guidance.",
+            "",
+            "METRIC 2: Clinical Evidence Quality (1-10 scale)",
+            "Question: Is the advice evidence-based and follows medical standards?",
+            "1-2 points: Almost no evidence support; cites irrelevant or unreliable sources.",
+            "3-4 points: References lower quality literature or sources lack authority.",
+            "5-6 points: Uses general quality literature/guidelines but lacks depth or currency.",
+            "7-8 points: References reliable, authoritative sources with accurate explanations.",
+            "9-10 points: Rich, high-quality evidence sources combined with latest research.",
+            "",
+            "TARGET THRESHOLDS: Actionability ≥7.0, Evidence Quality ≥7.5",
+            ""
+        ]
+        # Add system descriptions
+        for i, system in enumerate(system_names, 1):
+            if system == "rag":
+                prompt_parts.append(f"SYSTEM {i} (RAG): Uses medical guidelines + LLM for evidence-based advice")
+            elif system == "direct":
+                prompt_parts.append(f"SYSTEM {i} (Direct): Uses LLM only without external guidelines")
+            else:
+                prompt_parts.append(f"SYSTEM {i} ({system.upper()}): {system} medical AI system")
+        prompt_parts.extend([
             "",
+            "EVALUATION CRITERIA:",
             "1. Clinical Actionability (1-10): Can healthcare providers immediately act on this advice?",
             "2. Clinical Evidence Quality (1-10): Is the advice evidence-based and follows medical standards?",
             "",
             "QUERIES TO EVALUATE:",
             ""
+        ])
+        # Get all queries (assuming all systems processed same queries)
+        first_system = system_names[0]
+        queries = systems_outputs[first_system]
+        # Add each query with all system responses
+        for i, query_data in enumerate(queries, 1):
+            query = query_data.get('query', '')
+            category = query_data.get('category', 'unknown')
             prompt_parts.extend([
                 f"=== QUERY {i} ({category.upper()}) ===",
                 f"Patient Query: {query}",
                 ""
             ])
+            # Add each system's response
+            for j, system in enumerate(system_names, 1):
+                system_query = systems_outputs[system][i-1]  # Get corresponding query from this system
+                advice = system_query.get('medical_advice', '')
+                prompt_parts.extend([
+                    f"SYSTEM {j} Response: {advice}",
+                    ""
+                ])
         prompt_parts.extend([
             "RESPONSE FORMAT (provide exactly this format):",
         ])
         # Add response format template
+        for i in range(1, len(queries) + 1):
+            for j, system in enumerate(system_names, 1):
+                prompt_parts.append(f"Query {i} System {j}: Actionability=X, Evidence=Y")
         prompt_parts.extend([
             "",
             "Replace X and Y with numeric scores 1-10.",
+            "Provide only the scores in the exact format above.",
+            f"Note: System 1={system_names[0]}, System 2={system_names[1] if len(system_names) > 1 else 'N/A'}"
         ])
         return "\n".join(prompt_parts)
+    def parse_comparison_evaluation_response(self, response: str, systems_outputs: Dict[str, List[Dict]]) -> Dict[str, List[Dict]]:
+        """Parse comparison evaluation response into results by system"""
+        results_by_system = {}
+        system_names = list(systems_outputs.keys())
+        # Initialize results for each system
+        for system in system_names:
+            results_by_system[system] = []
         lines = response.strip().split('\n')
+        for line in lines:
             line = line.strip()
             if not line:
                 continue
+            # Parse format: "Query X System Y: Actionability=A, Evidence=B"
+            match = re.match(r'Query\s+(\d+)\s+System\s+(\d+):\s*Actionability\s*=\s*(\d+)\s*,\s*Evidence\s*=\s*(\d+)', line, re.IGNORECASE)
             if match:
+                query_num = int(match.group(1)) - 1  # 0-based index
+                system_num = int(match.group(2)) - 1  # 0-based index
+                actionability_score = int(match.group(3))
+                evidence_score = int(match.group(4))
+                if system_num < len(system_names) and query_num < len(systems_outputs[system_names[system_num]]):
+                    system_name = system_names[system_num]
+                    output = systems_outputs[system_name][query_num]
                     result = {
                         "query": output.get('query', ''),
                         "category": output.get('category', 'unknown'),
+                        "system_type": system_name,
                         "medical_advice": output.get('medical_advice', ''),
                         # Metric 5: Clinical Actionability
+                        "actionability_score": actionability_score / 10.0,
                         "actionability_raw": actionability_score,
                         # Metric 6: Clinical Evidence Quality
+                        "evidence_score": evidence_score / 10.0,
                         "evidence_raw": evidence_score,
                         "evaluation_success": True,
                         "timestamp": datetime.now().isoformat()
                     }
+                    results_by_system[system_name].append(result)
+        return results_by_system
+    def evaluate_multiple_systems(self, systems_outputs: Dict[str, List[Dict]]) -> Dict[str, List[Dict]]:
         """
+        Evaluate multiple systems using single LLM call for comparison
         Args:
+            systems_outputs: Dict mapping system names to their medical outputs
         """
+        system_names = list(systems_outputs.keys())
+        total_queries = len(systems_outputs[system_names[0]])
+        print(f"🧠 Multi-system comparison: {', '.join(system_names)}")
+        print(f"📊 Evaluating {total_queries} queries across {len(system_names)} systems...")
         try:
+            # Create comparison evaluation prompt
+            comparison_prompt = self.create_comparison_evaluation_prompt(systems_outputs)
+            print(f"📝 Comparison prompt created ({len(comparison_prompt)} characters)")
+            print(f"🔄 Calling judge LLM for multi-system comparison...")
+            # Single LLM call for all systems comparison
             eval_start = time.time()
+            response = self.judge_llm.batch_evaluate(comparison_prompt)
             eval_time = time.time() - eval_start
             # Extract response text
             response_text = response.get('content', '') if isinstance(response, dict) else str(response)
+            print(f"✅ Judge LLM completed comparison evaluation in {eval_time:.2f}s")
             print(f"📄 Response length: {len(response_text)} characters")
+            # Parse comparison response
+            results_by_system = self.parse_comparison_evaluation_response(response_text, systems_outputs)
+            # Combine all results for storage
+            all_results = []
+            for system_name, system_results in results_by_system.items():
+                all_results.extend(system_results)
+                print(f"📊 {system_name.upper()}: {len(system_results)} evaluations parsed")
+            self.evaluation_results.extend(all_results)
+            return results_by_system
         except Exception as e:
+            print(f"❌ Multi-system evaluation failed: {e}")
+            # Create error results for all systems
+            error_results = {}
+            for system_name, outputs in systems_outputs.items():
+                error_results[system_name] = []
+                for output in outputs:
+                    error_result = {
+                        "query": output.get('query', ''),
+                        "category": output.get('category', 'unknown'),
+                        "system_type": system_name,
+                        "actionability_score": 0.0,
+                        "evidence_score": 0.0,
+                        "evaluation_success": False,
+                        "error": str(e),
+                        "timestamp": datetime.now().isoformat()
+                    }
+                    error_results[system_name].append(error_result)
+                self.evaluation_results.extend(error_results[system_name])
             return error_results
     def calculate_judge_statistics(self) -> Dict[str, Any]:
             "timestamp": datetime.now().isoformat()
         }
+    def save_comparison_statistics(self, systems: List[str], filename: str = None) -> str:
+        """Save comparison evaluation statistics for multiple systems"""
         stats = self.calculate_judge_statistics()
         if filename is None:
             timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+            systems_str = "_vs_".join(systems)
+            filename = f"judge_evaluation_comparison_{systems_str}_{timestamp}.json"
         results_dir = Path(__file__).parent / "results"
         results_dir.mkdir(exist_ok=True)
         filepath = results_dir / filename
+        # Add comparison metadata
+        stats["comparison_metadata"] = {
+            "systems_compared": systems,
+            "comparison_type": "multi_system",
+            "timestamp": datetime.now().isoformat()
+        }
+        # Add detailed system-specific results for chart generation
+        stats["detailed_system_results"] = {}
+        for system in systems:
+            system_results = [r for r in self.evaluation_results if r.get('system_type') == system and r.get('evaluation_success')]
+            stats["detailed_system_results"][system] = {
+                "results": system_results,
+                "query_count": len(system_results),
+                "avg_actionability": sum(r['actionability_score'] for r in system_results) / len(system_results) if system_results else 0.0,
+                "avg_evidence": sum(r['evidence_score'] for r in system_results) / len(system_results) if system_results else 0.0
+            }
         with open(filepath, 'w', encoding='utf-8') as f:
             json.dump(stats, f, indent=2, ensure_ascii=False)
+        print(f"📊 Comparison evaluation statistics saved to: {filepath}")
         return str(filepath)
 # Independent execution interface
 if __name__ == "__main__":
+    """Independent LLM judge evaluation interface with multi-system support"""
+    print("🧠 OnCall.ai LLM Judge Evaluator - Metrics 5-6 Multi-System Evaluation")
+    # Print evaluation rubrics for reference
+    print_evaluation_rubrics()
+    if len(sys.argv) < 2:
+        print("Usage: python metric5_6_llm_judge_evaluator.py [system1] or [system1,system2,...]")
+        print("  rag         - Evaluate RAG system medical outputs")
+        print("  direct      - Evaluate direct LLM medical outputs")
+        print("  rag,direct  - Compare RAG vs Direct systems")
+        print("  system1,system2,system3  - Compare multiple systems")
         sys.exit(1)
+    # Parse systems from command line
+    systems_input = sys.argv[1]
+    systems = [s.strip() for s in systems_input.split(',')]
     # Initialize evaluator
     evaluator = LLMJudgeEvaluator()
     try:
+        if len(systems) == 1:
+            # Single system evaluation (legacy mode)
+            system = systems[0]
+            print(f"\n🧪 Single System LLM Judge Evaluation: {system.upper()}")
+            # Find and load medical outputs for single system
+            system_files = evaluator.find_medical_outputs_for_systems([system])
+            medical_outputs = evaluator.load_medical_outputs(system_files[system])
+            if not medical_outputs:
+                print(f"❌ No medical outputs found for {system}")
+                sys.exit(1)
+            print(f"📊 Evaluating {len(medical_outputs)} medical advice outputs")
+            print(f"🎯 Metrics: 5 (Actionability) + 6 (Evidence Quality)")
+            # Convert to multi-system format for consistency
+            systems_outputs = {system: medical_outputs}
+            results_by_system = evaluator.evaluate_multiple_systems(systems_outputs)
+            # Save results
+            stats_path = evaluator.save_comparison_statistics([system])
+        else:
+            # Multi-system comparison evaluation
+            print(f"\n🧪 Multi-System Comparison: {' vs '.join([s.upper() for s in systems])}")
+            # Find and load medical outputs for all systems
+            system_files = evaluator.find_medical_outputs_for_systems(systems)
+            systems_outputs = {}
+            for system in systems:
+                outputs = evaluator.load_medical_outputs(system_files[system])
+                if not outputs:
+                    print(f"❌ No medical outputs found for {system}")
+                    sys.exit(1)
+                systems_outputs[system] = outputs
+            # Validate all systems have same number of queries
+            query_counts = [len(outputs) for outputs in systems_outputs.values()]
+            if len(set(query_counts)) > 1:
+                print(f"⚠️ Warning: Systems have different query counts: {dict(zip(systems, query_counts))}")
+            print(f"📊 Comparing {len(systems)} systems with {min(query_counts)} queries each")
+            print(f"🎯 Metrics: 5 (Actionability) + 6 (Evidence Quality)")
+            print(f"⚡ Strategy: Single comparison call for maximum consistency")
+            # Multi-system comparison evaluation
+            results_by_system = evaluator.evaluate_multiple_systems(systems_outputs)
+            # Save comparison results
+            stats_path = evaluator.save_comparison_statistics(systems)
         # Print summary
+        print(f"\n📊 Generating evaluation analysis...")
         stats = evaluator.calculate_judge_statistics()
         overall_results = stats['overall_results']
+        print(f"\n📊 === LLM JUDGE EVALUATION SUMMARY ===")
+        if len(systems) == 1:
+            print(f"System: {systems[0].upper()}")
+        else:
+            print(f"Systems Compared: {' vs '.join([s.upper() for s in systems])}")
         print(f"Overall Performance:")
+        actionability_raw = overall_results['average_actionability'] * 10
+        evidence_raw = overall_results['average_evidence'] * 10
+        print(f"   Average Actionability: {overall_results['average_actionability']:.3f} ({actionability_raw:.1f}/10)")
+        print(f"   • {get_rubric_description(int(actionability_raw), 'actionability')}")
+        print(f"   Average Evidence Quality: {overall_results['average_evidence']:.3f} ({evidence_raw:.1f}/10)")
+        print(f"   • {get_rubric_description(int(evidence_raw), 'evidence')}")
         print(f"   Actionability Target (≥7.0): {'✅ Met' if overall_results['actionability_target_met'] else '❌ Not Met'}")
         print(f"   Evidence Target (≥7.5): {'✅ Met' if overall_results['evidence_target_met'] else '❌ Not Met'}")
+        # System-specific breakdown for multi-system comparison
+        if len(systems) > 1:
+            print(f"\nSystem Breakdown:")
+            for system in systems:
+                system_results = [r for r in evaluator.evaluation_results if r.get('system_type') == system and r.get('evaluation_success')]
+                if system_results:
+                    avg_action = sum(r['actionability_score'] for r in system_results) / len(system_results)
+                    avg_evidence = sum(r['evidence_score'] for r in system_results) / len(system_results)
+                    print(f"   {system.upper()}: Actionability={avg_action:.3f}, Evidence={avg_evidence:.3f} [{len(system_results)} queries]")
         print(f"\n✅ LLM judge evaluation complete!")
         print(f"📊 Statistics: {stats_path}")
+        print(f"⚡ Efficiency: {overall_results['total_queries']} evaluations in 1 LLM call")
     except FileNotFoundError as e:
         print(f"❌ {e}")
+        print(f"💡 Please run evaluators first:")
+        for system in systems:
+            if system == "rag":
+                print("   python latency_evaluator.py single_test_query.txt")
+            elif system == "direct":
+                print("   python direct_llm_evaluator.py single_test_query.txt")
+            else:
+                print(f"   python {system}_evaluator.py single_test_query.txt")
     except Exception as e:
         print(f"❌ Judge evaluation failed: {e}")

src/llm_clients.py CHANGED Viewed

@@ -461,5 +461,136 @@ def main():
             'total_execution_time': total_execution_time
         }
 if __name__ == "__main__":
     main()

             'total_execution_time': total_execution_time
         }
+class llm_Llama3_70B_JudgeClient:
+    """
+    Llama3-70B client specifically for LLM judge evaluation.
+    Used for metrics 5-6 evaluation: Clinical Actionability & Evidence Quality.
+    """
+    def __init__(
+        self,
+        model_name: str = "meta-llama/Meta-Llama-3-70B-Instruct",
+        timeout: float = 60.0
+    ):
+        """
+        Initialize Llama3-70B judge client for evaluation tasks.
+        Args:
+            model_name: Hugging Face model name for Llama3-70B
+            timeout: API call timeout duration (longer for judge evaluation)
+        Note: This client is specifically designed for third-party evaluation,
+              not for medical advice generation.
+        """
+        self.logger = logging.getLogger(__name__)
+        self.timeout = timeout
+        self.model_name = model_name
+        # Get Hugging Face token from environment
+        hf_token = os.getenv('HF_TOKEN')
+        if not hf_token:
+            self.logger.error("HF_TOKEN is missing from environment variables.")
+            raise ValueError(
+                "HF_TOKEN not found in environment variables. "
+                "Please set HF_TOKEN in your .env file or environment."
+            )
+        # Initialize Hugging Face Inference Client for judge evaluation
+        try:
+            self.client = InferenceClient(
+                provider="auto",
+                api_key=hf_token,
+            )
+            self.logger.info(f"Llama3-70B judge client initialized with model: {model_name}")
+            self.logger.info("Judge LLM: Evaluation tool only. Not for medical advice generation.")
+        except Exception as e:
+            self.logger.error(f"Failed to initialize Llama3-70B judge client: {e}")
+            raise
+    def generate_completion(self, prompt: str) -> Dict[str, Union[str, float]]:
+        """
+        Generate completion using Llama3-70B for judge evaluation.
+        Args:
+            prompt: Evaluation prompt for medical advice assessment
+        Returns:
+            Dict containing response content and timing information
+        """
+        import time
+        start_time = time.time()
+        try:
+            self.logger.info(f"Calling Llama3-70B Judge with evaluation prompt ({len(prompt)} chars)")
+            # Call Llama3-70B for judge evaluation
+            completion = self.client.chat.completions.create(
+                model=self.model_name,
+                messages=[
+                    {
+                        "role": "user",
+                        "content": prompt
+                    }
+                ],
+                max_tokens=2048,  # Sufficient for evaluation responses
+                temperature=0.1,   # Low temperature for consistent evaluation
+            )
+            # Extract response content
+            response_content = completion.choices[0].message.content
+            end_time = time.time()
+            latency = end_time - start_time
+            self.logger.info(f"Llama3-70B Judge Response: {response_content[:100]}...")
+            self.logger.info(f"Judge Evaluation Latency: {latency:.4f} seconds")
+            return {
+                'content': response_content,
+                'latency': latency,
+                'model': self.model_name,
+                'timestamp': time.time()
+            }
+        except Exception as e:
+            end_time = time.time()
+            error_latency = end_time - start_time
+            self.logger.error(f"Llama3-70B judge evaluation failed: {e}")
+            self.logger.error(f"Error occurred after {error_latency:.4f} seconds")
+            return {
+                'content': f"Judge evaluation error: {str(e)}",
+                'latency': error_latency,
+                'error': str(e),
+                'model': self.model_name,
+                'timestamp': time.time()
+            }
+    def batch_evaluate(self, evaluation_prompt: str) -> Dict[str, Union[str, float]]:
+        """
+        Specialized method for batch evaluation of medical advice.
+        Alias for generate_completion with judge-specific logging.
+        Args:
+            evaluation_prompt: Batch evaluation prompt containing multiple queries
+        Returns:
+            Dict containing batch evaluation results and timing
+        """
+        self.logger.info("Starting batch judge evaluation...")
+        result = self.generate_completion(evaluation_prompt)
+        if 'error' not in result:
+            self.logger.info(f"Batch evaluation completed successfully in {result['latency']:.2f}s")
+        else:
+            self.logger.error(f"Batch evaluation failed: {result.get('error', 'Unknown error')}")
+        return result
 if __name__ == "__main__":
     main()