Spaces:

ybchen928
/

oncall-guide-ai

Running

File size: 10,115 Bytes

# OnCall.ai - Medical Emergency Assistant

A RAG-based medical assistant system that provides evidence-based clinical guidance for emergency medical situations using real medical guidelines and advanced language models.

## 🎯 Project Overview

OnCall.ai helps healthcare professionals by:
- Processing medical queries through multi-level validation
- Retrieving relevant medical guidelines from curated datasets
- Generating evidence-based clinical advice using specialized medical LLMs
- Providing transparent, traceable medical guidance

## ✅ Current Implementation Status

### **🎉 COMPLETED MODULES (2025-07-31)**

#### **1. Multi-Level Query Processing System**
- ✅ **UserPromptProcessor** (`src/user_prompt.py`)
  - Level 1: Predefined medical condition mapping (instant response)
  - Level 2: LLM-based condition extraction (Llama3-Med42-70B)
  - Level 3: Semantic search fallback
  - Level 4: Medical query validation (100% non-medical rejection)
  - Level 5: Generic medical search for rare conditions

#### **2. Dual-Index Retrieval System**
- ✅ **BasicRetrievalSystem** (`src/retrieval.py`)
  - Emergency medical guidelines index (emergency.ann)
  - Treatment protocols index (treatment.ann)
  - Vector-based similarity search using PubMedBERT embeddings
  - Intelligent deduplication and result ranking

#### **3. Medical Knowledge Base**
- ✅ **MedicalConditions** (`src/medical_conditions.py`)
  - Predefined condition-keyword mappings
  - Medical terminology validation
  - Extensible condition database

#### **4. LLM Integration**
- ✅ **Med42-70B Client** (`src/llm_clients.py`)
  - Specialized medical language model integration
  - Dual-layer rejection detection for non-medical queries
  - Robust error handling and timeout management

#### **5. Medical Advice Generation**
- ✅ **MedicalAdviceGenerator** (`src/generation.py`)
  - RAG-based prompt construction
  - Intention-aware chunk selection (treatment/diagnosis)
  - Confidence scoring and response formatting
  - Integration with Med42-70B for clinical advice generation

#### **6. Data Processing Pipeline**
- ✅ **Processed Medical Guidelines** (`src/data_processing.py`)
  - ~4000 medical guidelines from EPFL-LLM dataset
  - Emergency subset: ~2000-2500 records
  - Treatment subset: ~2000-2500 records
  - PubMedBERT embeddings (768 dimensions)
  - ANNOY vector indices for fast retrieval

## 📊 **System Performance (Validated)**

### **Test Results Summary**
```
🎯 Multi-Level Fallback Validation: 69.2% success rate
   - Level 1 (Predefined): 100% success (instant response)
   - Level 4a (Non-medical rejection): 100% success
   - Level 4b→5 (Rare medical): 100% success

📈 End-to-End Pipeline: 100% technical completion
   - Condition extraction: 2.6s average
   - Medical guideline retrieval: 0.3s average
   - Total pipeline: 15.5s average (including generation)
```

### **Quality Metrics**
```
🔍 Retrieval Performance:
   - Guidelines retrieved: 8-9 per query
   - Relevance scores: 0.245-0.326 (good for medical domain)
   - Emergency/Treatment balance: Correctly maintained
   
🧠 Generation Quality:
   - Confidence scores: 0.90 for successful generations
   - Evidence-based responses with specific guideline references
   - Appropriate medical caution and clinical judgment emphasis
```

## 🛠️ **Technical Architecture**

### **Data Flow**
```
User Query → Level 1: Predefined Mapping
     ↓ (if fails)
Level 2: LLM Extraction
     ↓ (if fails)
Level 3: Semantic Search
     ↓ (if fails)
Level 4: Medical Validation
     ↓ (if fails)
Level 5: Generic Search
     ↓ (if fails)
No Match Found
```

### **Core Technologies**
- **Embeddings**: NeuML/pubmedbert-base-embeddings (768D)
- **Vector Search**: ANNOY indices with angular distance
- **LLM**: m42-health/Llama3-Med42-70B (medical specialist)
- **Dataset**: EPFL-LLM medical guidelines (~4000 documents)

### **Fallback Mechanism**
```
Level 1: Predefined Mapping (0.001s) → Success: Direct return
Level 2: LLM Extraction (8-15s) → Success: Condition mapping  
Level 3: Semantic Search (1-2s) → Success: Sliding window chunks
Level 4: Medical Validation (8-10s) → Fail: Return rejection
Level 5: Generic Search (1s) → Final: General medical guidance
```

## 🚀 **NEXT PHASE: Interactive Interface**

### **🎯 Immediate Goals (Next 1-2 Days)**

#### **Phase 1: Gradio Interface Development**
- [ ] **Create `app.py`** - Interactive web interface
  - [ ] Complete pipeline integration
  - [ ] Multi-output display (advice + guidelines + technical details)
  - [ ] Environment-controlled debug mode
  - [ ] User-friendly error handling

#### **Phase 2: Local Validation Testing**
- [ ] **Manual testing** with 20-30 realistic medical queries
  - [ ] Emergency scenarios (cardiac arrest, stroke, MI)
  - [ ] Diagnostic queries (chest pain, respiratory distress)
  - [ ] Treatment protocols (medication management, procedures)
  - [ ] Edge cases (rare conditions, complex symptoms)

#### **Phase 3: HuggingFace Spaces Deployment**
- [ ] **Create requirements.txt** for deployment
- [ ] **Deploy to HF Spaces** for public testing
- [ ] **Production mode configuration** (limited technical details)
- [ ] **Performance monitoring** and user feedback collection

### **🔮 Future Enhancements (Next 1-2 Weeks)**

#### **Audio Input Integration**
- [ ] **Whisper ASR integration** for voice queries
- [ ] **Audio preprocessing** and quality validation
- [ ] **Multi-modal interface** (text + audio input)

#### **Evaluation & Metrics**
- [ ] **Faithfulness scoring** implementation
- [ ] **Automated evaluation pipeline** 
- [ ] **Clinical validation** with medical professionals
- [ ] **Performance benchmarking** against target metrics

#### **Dataset Expansion (Future)**
- [ ] **Dataset B integration** (symptom/diagnosis subsets)
- [ ] **Multi-dataset RAG** architecture
- [ ] **Enhanced medical knowledge** coverage

## 📋 **Target Performance Metrics**

### **Response Quality**
- [ ] Physician satisfaction: ≥ 4/5
- [ ] RAG content coverage: ≥ 80%
- [ ] Retrieval precision (P@5): ≥ 0.7
- [ ] Medical advice faithfulness: ≥ 0.8

### **System Performance**  
- [ ] Total response latency: ≤ 30 seconds
- [ ] Condition extraction: ≤ 5 seconds
- [ ] Guideline retrieval: ≤ 2 seconds
- [ ] Medical advice generation: ≤ 25 seconds

### **User Experience**
- [ ] Non-medical query rejection: 100%
- [ ] System availability: ≥ 99%
- [ ] Error handling: Graceful degradation
- [ ] Interface responsiveness: Immediate feedback

## 🏗️ **Project Structure**
```
OnCall.ai/
├── src/                          # Core modules (✅ Complete)
│   ├── user_prompt.py           # Multi-level query processing
│   ├── retrieval.py             # Dual-index vector search
│   ├── generation.py            # RAG-based advice generation
│   ├── llm_clients.py           # Med42-70B integration
│   ├── medical_conditions.py    # Medical knowledge configuration
│   └── data_processing.py       # Dataset preprocessing
├── models/                       # Pre-processed data (✅ Complete)
│   ├── embeddings/              # Vector embeddings and chunks
│   └── indices/                 # ANNOY vector indices
├── tests/                        # Validation tests (✅ Complete)
│   ├── test_multilevel_fallback_validation.py
│   ├── test_end_to_end_pipeline.py
│   └── test_userinput_userprompt_medical_*.py
├── docs/                         # Documentation and planning
│   ├── next/                    # Current implementation docs
│   └── next_gradio_evaluation/  # Interface planning
├── app.py                        # 🎯 NEXT: Gradio interface
├── requirements.txt              # 🎯 NEXT: Deployment dependencies
└── README.md                     # This file
```

## 🧪 **Testing Validation**

### **Completed Tests**
- ✅ **Multi-level fallback validation**: 13 test cases, 69.2% success
- ✅ **End-to-end pipeline testing**: 6 scenarios, 100% technical completion
- ✅ **Component integration**: All modules working together
- ✅ **Error handling**: Graceful degradation and user-friendly messages

### **Key Findings**
- **Predefined mapping**: Instant response for known conditions
- **LLM extraction**: Reliable for complex symptom descriptions  
- **Non-medical rejection**: Perfect accuracy with updated prompt engineering
- **Retrieval quality**: High-relevance medical guidelines (0.2-0.4 relevance scores)
- **Generation capability**: Evidence-based advice with proper medical caution

## 🤝 **Contributing & Development**

### **Environment Setup**
```bash
# Clone repository
git clone [repository-url]
cd OnCall.ai

# Setup virtual environment
python -m venv genAIvenv
source genAIvenv/bin/activate  # On Windows: genAIvenv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Run tests
python tests/test_end_to_end_pipeline.py

# Start Gradio interface (coming soon)
python app.py
```

### **API Configuration**
```bash
# Set up HuggingFace token for LLM access
export HF_TOKEN=your_huggingface_token

# Enable debug mode for development
export ONCALL_DEBUG=true
```

## ⚠️ **Important Notes**

### **Medical Disclaimer**
This system is designed for **research and educational purposes only**. It should not replace professional medical consultation, diagnosis, or treatment. Always consult qualified healthcare providers for medical decisions.

### **Current Limitations**
- **API Dependencies**: Requires HuggingFace API access for LLM functionality
- **Dataset Scope**: Currently focused on emergency and treatment guidelines
- **Language Support**: English medical terminology only
- **Validation Stage**: System under active development and testing

## 📞 **Contact & Support**

**Development Team**: OnCall.ai Team  
**Last Updated**: 2025-07-31  
**Version**: 0.9.0 (Pre-release)  
**Status**: 🚧 Ready for Interactive Testing Phase

---

*Built with ❤️ for healthcare professionals*