Spaces:
Sleeping
Sleeping
YanBoChen
Add multi-system evaluation support for clinical actionability and evidence quality metrics
16a2990
| # Metric 5-6 LLM Judge Evaluator Manual | |
| ## Overview | |
| The `metric5_6_llm_judge_evaluator.py` is a multi-system evaluation tool that uses Llama3-70B as a third-party judge to assess medical advice quality across different AI systems. It supports both single-system evaluation and multi-system comparison with a single LLM call for maximum consistency. | |
| ## Metrics Evaluated | |
| **Metric 5: Clinical Actionability (θ¨εΊε―ζδ½ζ§)** | |
| - Scale: 1-10 (normalized to 0.0-1.0) | |
| - Question: "Can healthcare providers immediately act on this advice?" | |
| - Target: β₯7.0/10 for acceptable actionability | |
| **Metric 6: Clinical Evidence Quality (θ¨εΊθζεθ³ͺ)** | |
| - Scale: 1-10 (normalized to 0.0-1.0) | |
| - Question: "Is the advice evidence-based and follows medical standards?" | |
| - Target: β₯7.5/10 for acceptable evidence quality | |
| ## System Architecture | |
| ### Multi-System Support | |
| The evaluator supports flexible system combinations: | |
| - **Single System**: `rag` or `direct` | |
| - **Two-System Comparison**: `rag,direct` | |
| - **Future Extension**: `rag,direct,claude,gpt4` (any combination) | |
| ### Judge LLM | |
| - **Model**: Llama3-70B-Instruct via Hugging Face API | |
| - **Strategy**: Single batch call for all evaluations | |
| - **Temperature**: 0.1 (low for consistent evaluation) | |
| - **Max Tokens**: 2048 (sufficient for evaluation responses) | |
| ## Prerequisites | |
| ### 1. Environment Setup | |
| ```bash | |
| # Ensure HF_TOKEN is set in your environment | |
| export HF_TOKEN="your_huggingface_token" | |
| # Or add to .env file | |
| echo "HF_TOKEN=your_huggingface_token" >> .env | |
| ``` | |
| ### 2. Required Data Files | |
| Before running the judge evaluator, you must have medical outputs from your systems: | |
| **For RAG System**: | |
| ```bash | |
| python latency_evaluator.py single_test_query.txt | |
| # Generates: results/medical_outputs_YYYYMMDD_HHMMSS.json | |
| ``` | |
| **For Direct LLM System**: | |
| ```bash | |
| python direct_llm_evaluator.py single_test_query.txt | |
| # Generates: results/medical_outputs_direct_YYYYMMDD_HHMMSS.json | |
| ``` | |
| ## Usage | |
| ### Command Line Interface | |
| #### Single System Evaluation | |
| ```bash | |
| # Evaluate RAG system only | |
| python metric5_6_llm_judge_evaluator.py rag | |
| # Evaluate Direct LLM system only | |
| python metric5_6_llm_judge_evaluator.py direct | |
| ``` | |
| #### Multi-System Comparison (Recommended) | |
| ```bash | |
| # Compare RAG vs Direct systems | |
| python metric5_6_llm_judge_evaluator.py rag,direct | |
| # Future: Compare multiple systems | |
| python metric5_6_llm_judge_evaluator.py rag,direct,claude | |
| ``` | |
| ### Complete Workflow Example | |
| ```bash | |
| # Step 1: Navigate to evaluation directory | |
| cd /path/to/GenAI-OnCallAssistant/evaluation | |
| # Step 2: Generate medical outputs from both systems | |
| python latency_evaluator.py single_test_query.txt | |
| python direct_llm_evaluator.py single_test_query.txt | |
| # Step 3: Run comparative evaluation | |
| python metric5_6_llm_judge_evaluator.py rag,direct | |
| ``` | |
| ## Output Files | |
| ### Generated Files | |
| - **Statistics**: `results/judge_evaluation_comparison_rag_vs_direct_YYYYMMDD_HHMMSS.json` | |
| - **Detailed Results**: Stored in evaluator's internal results array | |
| ### File Structure | |
| ```json | |
| { | |
| "comparison_metadata": { | |
| "systems_compared": ["rag", "direct"], | |
| "comparison_type": "multi_system", | |
| "timestamp": "2025-08-04T22:00:00" | |
| }, | |
| "category_results": { | |
| "diagnosis": { | |
| "average_actionability": 0.850, | |
| "average_evidence": 0.780, | |
| "query_count": 1, | |
| "actionability_target_met": true, | |
| "evidence_target_met": true | |
| } | |
| }, | |
| "overall_results": { | |
| "average_actionability": 0.850, | |
| "average_evidence": 0.780, | |
| "successful_evaluations": 2, | |
| "total_queries": 2, | |
| "actionability_target_met": true, | |
| "evidence_target_met": true | |
| } | |
| } | |
| ``` | |
| ## Evaluation Process | |
| ### 1. File Discovery | |
| The evaluator automatically finds the latest medical output files: | |
| - **RAG**: `medical_outputs_*.json` | |
| - **Direct**: `medical_outputs_direct_*.json` | |
| - **Custom**: `medical_outputs_{system}_*.json` | |
| ### 2. Prompt Generation | |
| For multi-system comparison, the evaluator creates a structured prompt: | |
| ``` | |
| You are a medical expert evaluating and comparing AI systems... | |
| SYSTEM 1 (RAG): Uses medical guidelines + LLM for evidence-based advice | |
| SYSTEM 2 (Direct): Uses LLM only without external guidelines | |
| QUERY 1 (DIAGNOSIS): | |
| Patient Query: 60-year-old patient with hypertension history... | |
| SYSTEM 1 Response: For a 60-year-old patient with... | |
| SYSTEM 2 Response: Based on the symptoms described... | |
| RESPONSE FORMAT: | |
| Query 1 System 1: Actionability=X, Evidence=Y | |
| Query 1 System 2: Actionability=X, Evidence=Y | |
| ``` | |
| ### 3. LLM Judge Evaluation | |
| - **Single API Call**: All systems evaluated in one request for consistency | |
| - **Response Parsing**: Automatic extraction of numerical scores | |
| - **Error Handling**: Graceful handling of parsing failures | |
| ### 4. Results Analysis | |
| - **System-Specific Statistics**: Individual performance metrics | |
| - **Comparative Analysis**: Direct system-to-system comparison | |
| - **Target Compliance**: Automatic threshold checking | |
| ## Expected Output | |
| ### Console Output Example | |
| ``` | |
| π§ OnCall.ai LLM Judge Evaluator - Metrics 5-6 Multi-System Evaluation | |
| π§ͺ Multi-System Comparison: RAG vs DIRECT | |
| π Found rag outputs: results/medical_outputs_20250804_215917.json | |
| π Found direct outputs: results/medical_outputs_direct_20250804_220000.json | |
| π Comparing 2 systems with 1 queries each | |
| π― Metrics: 5 (Actionability) + 6 (Evidence Quality) | |
| β‘ Strategy: Single comparison call for maximum consistency | |
| π§ Multi-system comparison: rag, direct | |
| π Evaluating 1 queries across 2 systems... | |
| π Comparison prompt created (2150 characters) | |
| π Calling judge LLM for multi-system comparison... | |
| β Judge LLM completed comparison evaluation in 45.3s | |
| π Response length: 145 characters | |
| π RAG: 1 evaluations parsed | |
| π DIRECT: 1 evaluations parsed | |
| π === LLM JUDGE EVALUATION SUMMARY === | |
| Systems Compared: RAG vs DIRECT | |
| Overall Performance: | |
| Average Actionability: 0.850 (8.5/10) | |
| Average Evidence Quality: 0.780 (7.8/10) | |
| Actionability Target (β₯7.0): β Met | |
| Evidence Target (β₯7.5): β Met | |
| System Breakdown: | |
| RAG: Actionability=0.900, Evidence=0.850 [1 queries] | |
| DIRECT: Actionability=0.800, Evidence=0.710 [1 queries] | |
| β LLM judge evaluation complete! | |
| π Statistics: results/judge_evaluation_comparison_rag_vs_direct_20250804_220000.json | |
| β‘ Efficiency: 2 evaluations in 1 LLM call | |
| ``` | |
| ## Key Features | |
| ### 1. Scientific Comparison Design | |
| - **Single Judge Call**: All systems evaluated simultaneously for consistency | |
| - **Eliminates Temporal Bias**: Same judge, same context, same standards | |
| - **Direct System Comparison**: Side-by-side evaluation format | |
| ### 2. Flexible Architecture | |
| - **Backward Compatible**: Single system evaluation still supported | |
| - **Future Extensible**: Easy to add new systems (`claude`, `gpt4`, etc.) | |
| - **Modular Design**: Clean separation of concerns | |
| ### 3. Robust Error Handling | |
| - **File Validation**: Automatic detection of missing input files | |
| - **Query Count Verification**: Warns if systems have different query counts | |
| - **Graceful Degradation**: Continues operation despite partial failures | |
| ### 4. Comprehensive Reporting | |
| - **System-Specific Metrics**: Individual performance analysis | |
| - **Comparative Statistics**: Direct system-to-system comparison | |
| - **Target Compliance**: Automatic benchmark checking | |
| - **Detailed Metadata**: Full traceability of evaluation parameters | |
| ## Troubleshooting | |
| ### Common Issues | |
| #### 1. Missing Input Files | |
| ``` | |
| β No medical outputs files found for rag system | |
| π‘ Please run evaluators first: | |
| python latency_evaluator.py single_test_query.txt | |
| ``` | |
| **Solution**: Run the prerequisite evaluators to generate medical outputs. | |
| #### 2. HF_TOKEN Not Set | |
| ``` | |
| β HF_TOKEN is missing from environment variables | |
| ``` | |
| **Solution**: Set your Hugging Face token in environment or `.env` file. | |
| #### 3. Query Count Mismatch | |
| ``` | |
| β οΈ Warning: Systems have different query counts: {'rag': 3, 'direct': 1} | |
| ``` | |
| **Solution**: Ensure both systems processed the same input file. | |
| #### 4. LLM API Timeout | |
| ``` | |
| β Multi-system evaluation failed: timeout | |
| ``` | |
| **Solution**: Check internet connection and Hugging Face API status. | |
| ### Debug Tips | |
| 1. **Check File Existence**: Verify medical output files in `results/` directory | |
| 2. **Validate JSON Format**: Ensure input files are properly formatted | |
| 3. **Monitor API Usage**: Check Hugging Face account limits | |
| 4. **Review Logs**: Examine detailed logging output for specific errors | |
| ## Future Extensions | |
| ### Phase 2: Generic Multi-System Framework | |
| ```bash | |
| # Configuration-driven system comparison | |
| python metric5_6_llm_judge_evaluator.py --config comparison_config.json | |
| ``` | |
| ### Phase 3: Unlimited System Support | |
| ```bash | |
| # Dynamic system registration | |
| python metric5_6_llm_judge_evaluator.py med42,claude,gpt4,palm,llama2 | |
| ``` | |
| ### Integration with Chart Generators | |
| ```bash | |
| # Generate comparison visualizations | |
| python metric5_6_llm_judge_chart_generator.py rag,direct | |
| ``` | |
| ## Best Practices | |
| 1. **Consistent Test Data**: Use the same query file for all systems | |
| 2. **Sequential Execution**: Complete data collection before evaluation | |
| 3. **Batch Processing**: Use multi-system mode for scientific comparison | |
| 4. **Result Verification**: Review detailed statistics files for accuracy | |
| 5. **Performance Monitoring**: Track evaluation latency and API costs | |
| ## Scientific Validity | |
| The multi-system comparison approach provides superior scientific validity compared to separate evaluations: | |
| - **Eliminates Judge Variability**: Same judge evaluates all systems | |
| - **Reduces Temporal Effects**: All evaluations in single time window | |
| - **Ensures Consistent Standards**: Identical evaluation criteria applied | |
| - **Enables Direct Comparison**: Side-by-side system assessment | |
| - **Maximizes Efficiency**: Single API call vs multiple separate calls | |
| This design makes the evaluation results more reliable for research publications and system optimization decisions. | |