harshalmore31 commited on
Commit
4305b0a
·
1 Parent(s): d9837be

Implement code changes to enhance functionality and improve performance

Browse files
Files changed (2) hide show
  1. README.md +502 -41
  2. mai_dx/main.py +1256 -7
README.md CHANGED
@@ -1,79 +1,540 @@
1
- # Open-MAI-Dx-Orchestrator [WIP]
2
 
3
- An open source implementation of the paper: "Sequential Diagnosis with Language Models" From Microsoft Built with Swarms Framework.
4
 
5
- - [Paper Link](https://arxiv.org/abs/2506.22405)
 
 
6
 
7
- # Install
 
 
8
 
9
  ```bash
10
- pip3 install mai-dx
 
 
 
 
 
 
11
  ```
12
 
13
- ## Usage
 
14
 
15
- ...
 
16
 
17
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
- ## Architecture
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
- ### Virtual Panel Roles
22
 
23
- The virtual panel consists of five specialized roles:
24
 
25
- - **Dr. Hypothesis** – Maintains a probability-ranked differential diagnosis with the top three most likely conditions, updating probabilities in a Bayesian manner after each new finding.
 
 
 
 
 
 
 
 
 
 
26
 
27
- - **Dr. Test-Chooser** – Selects up to three diagnostic tests per round that maximally discriminate between leading hypotheses.
28
 
29
- - **Dr. Challenger** – Acts as devil's advocate by identifying potential anchoring bias, highlighting contradictory evidence, and proposing tests that could falsify the current leading diagnosis.
 
 
30
 
31
- - **Dr. Stewardship** – Enforces cost-conscious care by advocating for cheaper alternatives when diagnostically equivalent and vetoing low-yield expensive tests.
 
 
32
 
33
- - **Dr. Checklist** – Performs silent quality control to ensure the model generates valid test names and maintains internal consistency across the panel's reasoning.
 
 
34
 
35
- ### Decision Process
36
 
37
- After internal deliberation, the panel reaches consensus on one of three actions:
38
- - Asking questions
39
- - Ordering tests
40
- - Committing to a diagnosis (if certainty exceeds threshold)
 
 
 
 
 
 
 
41
 
42
- Before tests are ordered, an optional budget tracker can be invoked to estimate both the cumulative medical costs so far and the cost of each test in the order.
43
 
44
- ### MAI-DxO Variants
 
 
45
 
46
- We evaluate five variants of MAI-DxO to explore different points on the accuracy-cost frontier (from most cost conscious to least):
47
 
48
- - **Instant Answer** – Diagnosis based solely on initial vignette (as in Figure 3), without any follow-up questions or tests.
 
 
 
 
 
 
 
 
 
 
 
49
 
50
- - **Question Only** – The panel can ask questions, but cannot order diagnostic tests. The cost is simply the cost of a single physician visit.
 
 
 
51
 
52
- - **Budgeted** – The panel is augmented with a budgeting system that tracks cumulative costs (a separately orchestrated language model call) towards a max budget and allows the panel to cancel tests after seeing their estimated cost.
 
 
 
 
53
 
54
- - **No Budget** Full panel with no explicit cost tracking or budget limitations.
 
 
 
55
 
56
- - **Ensemble** – Simulates multiple doctor panels working in parallel, with an additional panel to provide a final diagnosis. This is implemented as multiple independent No Budget runs with a final aggregation step to select the best diagnosis. Costs are computed as the sum of the costs of all tests ordered by each of the runs, accounting for duplicates.
 
 
 
 
 
 
 
 
 
 
57
 
58
- ### Technical Implementation
 
 
 
 
59
 
60
- MAI-DxO was primarily developed and optimized using GPT-4.1, but is designed to be model-agnostic. All MAI-DxO variants used the same underlying orchestration structure, with capabilities selectively enabled or disabled for variants.
61
 
62
- ## Citation
 
 
 
 
63
 
64
  ```bibtex
65
  @misc{nori2025sequentialdiagnosislanguagemodels,
66
- title={Sequential Diagnosis with Language Models},
67
- author={Harsha Nori and Mayank Daswani and Christopher Kelly and Scott Lundberg and Marco Tulio Ribeiro and Marc Wilson and Xiaoxuan Liu and Viknesh Sounderajah and Jonathan Carlson and Matthew P Lungren and Bay Gross and Peter Hames and Mustafa Suleyman and Dominic King and Eric Horvitz},
68
- year={2025},
69
- eprint={2506.22405},
70
- archivePrefix={arXiv},
71
- primaryClass={cs.CL},
72
- url={https://arxiv.org/abs/2506.22405},
 
 
 
 
 
 
 
73
  }
74
  ```
75
 
 
76
 
 
 
 
 
 
 
 
 
 
 
 
77
 
78
- # License
79
- MIT
 
 
1
+ # Open-MAI-Dx-Orchestrator
2
 
3
+ > **An open-source implementation of the "Sequential Diagnosis with Language Models" paper by Microsoft Research, built with the Swarms AI framework.**
4
 
5
+ [![Paper](https://img.shields.io/badge/Paper-arXiv:2506.22405-red.svg)](https://arxiv.org/abs/2506.22405)
6
+ [![License](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
7
+ [![Python](https://img.shields.io/badge/Python-3.8+-green.svg)](https://python.org)
8
 
9
+ MAI-DxO (MAI Diagnostic Orchestrator) is a sophisticated AI-powered diagnostic system that simulates a virtual panel of physician-agents to perform iterative medical diagnosis with cost-effectiveness optimization. This implementation faithfully reproduces the methodology described in the Microsoft Research paper while providing additional features and flexibility.
10
+
11
+ ## 🚀 Quick Start
12
 
13
  ```bash
14
+ # Install the package
15
+ pip install mai-dx
16
+
17
+ # Or install from source
18
+ git clone https://github.com/The-Swarm-Corporation/Open-MAI-Dx-Orchestrator.git
19
+ cd Open-MAI-Dx-Orchestrator
20
+ pip install -e .
21
  ```
22
 
23
+ ```python
24
+ from mai_dx import MaiDxOrchestrator
25
 
26
+ # Create orchestrator
27
+ orchestrator = MaiDxOrchestrator(model_name="gemini/gemini-2.5-flash")
28
 
29
+ # Run diagnosis
30
+ result = orchestrator.run(
31
+ initial_case_info="29-year-old woman with sore throat and peritonsillar swelling...",
32
+ full_case_details="Patient: 29-year-old female. History: Onset of sore throat...",
33
+ ground_truth_diagnosis="Embryonal rhabdomyosarcoma of the pharynx"
34
+ )
35
+
36
+ print(f"Diagnosis: {result.final_diagnosis}")
37
+ print(f"Accuracy: {result.accuracy_score}/5.0")
38
+ print(f"Cost: ${result.total_cost:,}")
39
+ ```
40
+
41
+ ## 📚 Table of Contents
42
+
43
+ - [Features](#-features)
44
+ - [Installation](#-installation)
45
+ - [Architecture](#-architecture)
46
+ - [Usage](#-usage)
47
+ - [MAI-DxO Variants](#-mai-dxo-variants)
48
+ - [Configuration](#-configuration)
49
+ - [Examples](#-examples)
50
+ - [API Reference](#-api-reference)
51
+ - [Contributing](#-contributing)
52
+ - [Citation](#-citation)
53
+
54
+ ## ✨ Features
55
+
56
+ ### 🏥 Virtual Physician Panel
57
+ - **8 Specialized AI Agents**: Each with distinct medical expertise and decision-making roles
58
+ - **Iterative Deliberation**: Sequential consultation and consensus-building process
59
+ - **Bayesian Reasoning**: Probability-based differential diagnosis updates
60
+ - **Cognitive Bias Detection**: Built-in challenger agent to prevent diagnostic errors
61
+
62
+ ### 💰 Cost-Effectiveness Optimization
63
+ - **Comprehensive Cost Tracking**: Real-time budget monitoring with 25+ medical test costs
64
+ - **Resource Stewardship**: AI agent dedicated to cost-conscious care decisions
65
+ - **Budget Constraints**: Configurable spending limits with intelligent test prioritization
66
+ - **Value-Based Testing**: Information theory-driven test selection
67
+
68
+ ### 🎯 Multiple Operational Modes
69
+ - **Instant**: Immediate diagnosis from initial presentation
70
+ - **Question-Only**: History-taking without diagnostic tests
71
+ - **Budgeted**: Cost-constrained diagnostic workup
72
+ - **No-Budget**: Full diagnostic capability
73
+ - **Ensemble**: Multiple independent panels with consensus aggregation
74
+
75
+ ### 📊 Advanced Evaluation
76
+ - **Clinical Accuracy Scoring**: 5-point Likert scale with detailed rubric
77
+ - **Management Impact Assessment**: Evaluation based on treatment implications
78
+ - **Diagnostic Reasoning Tracking**: Complete conversation history and decision trails
79
+ - **Ensemble Methods**: Multi-run consensus for improved accuracy
80
+
81
+ ### 🔧 Technical Excellence
82
+ - **Model Agnostic**: Support for GPT, Gemini, Claude, and other LLMs
83
+ - **Robust Error Handling**: Comprehensive exception management and fallback mechanisms
84
+ - **Beautiful Logging**: Structured logging with Loguru for debugging and monitoring
85
+ - **Type Safety**: Full Pydantic models and type hints throughout
86
+
87
+ ## 🛠 Installation
88
+
89
+ ### Prerequisites
90
+ - Python 3.8 or higher
91
+ - API keys for your chosen language model provider
92
+
93
+ ### Standard Installation
94
+ ```bash
95
+ pip install mai-dx
96
+ ```
97
+
98
+ ### Development Installation
99
+ ```bash
100
+ git clone https://github.com/The-Swarm-Corporation/Open-MAI-Dx-Orchestrator.git
101
+ cd Open-MAI-Dx-Orchestrator
102
+ pip install -e .
103
+ ```
104
+
105
+ ### Dependencies
106
+ The package automatically installs:
107
+ - `swarms` - AI agent orchestration framework
108
+ - `loguru` - Advanced logging
109
+ - `pydantic` - Data validation and serialization
110
+
111
+ ## 🏗 Architecture
112
+
113
+ ### Virtual Panel Composition
114
+
115
+ The MAI-DxO system consists of 8 specialized AI agents that work together to provide comprehensive medical diagnosis:
116
+
117
+ #### Core Diagnostic Panel
118
+
119
+ **🧠 Dr. Hypothesis**
120
+ - Maintains probability-ranked differential diagnosis (top 3 conditions)
121
+ - Updates probabilities using Bayesian reasoning after each finding
122
+ - Tracks evidence supporting and contradicting each hypothesis
123
+
124
+ **🔬 Dr. Test-Chooser**
125
+ - Selects up to 3 diagnostic tests per round for maximum information value
126
+ - Optimizes for discriminatory power between competing hypotheses
127
+ - Balances diagnostic yield with patient burden
128
+
129
+ **🤔 Dr. Challenger**
130
+ - Acts as devil's advocate to prevent cognitive biases
131
+ - Identifies contradictory evidence and alternative explanations
132
+ - Proposes falsifying tests and guards against premature closure
133
+
134
+ **💰 Dr. Stewardship**
135
+ - Enforces cost-conscious, high-value care decisions
136
+ - Advocates for cheaper alternatives when diagnostically equivalent
137
+ - Evaluates test necessity and suggests cost-effective strategies
138
+
139
+ **✅ Dr. Checklist**
140
+ - Performs quality control on panel deliberations
141
+ - Validates test names and maintains logical consistency
142
+ - Flags errors and ensures proper diagnostic methodology
143
+
144
+ #### Coordination and Evaluation
145
+
146
+ **🤝 Consensus Coordinator**
147
+ - Synthesizes panel input into optimal next action
148
+ - Decides between asking questions, ordering tests, or diagnosing
149
+ - Balances accuracy, cost, efficiency, and thoroughness
150
+
151
+ **🔑 Gatekeeper**
152
+ - Serves as clinical information oracle with complete case access
153
+ - Provides objective findings and realistic synthetic results
154
+ - Maintains clinical realism while preventing information leakage
155
+
156
+ **⚖️ Judge**
157
+ - Evaluates final diagnoses against ground truth
158
+ - Uses rigorous 5-point clinical rubric
159
+ - Considers management implications and diagnostic completeness
160
+
161
+ ### Decision Process Flow
162
+
163
+ ```mermaid
164
+ graph TD
165
+ A[Initial Case Information] --> B[Panel Deliberation]
166
+ B --> C{Consensus Decision}
167
+ C -->|Ask| D[Question to Gatekeeper]
168
+ C -->|Test| E[Diagnostic Tests]
169
+ C -->|Diagnose| F[Final Diagnosis]
170
+ D --> G[Update Case Information]
171
+ E --> G
172
+ G --> H{Max Iterations or Budget?}
173
+ H -->|No| B
174
+ H -->|Yes| F
175
+ F --> I[Judge Evaluation]
176
+ I --> J[Diagnosis Result]
177
+ ```
178
+
179
+ ## 🎮 Usage
180
+
181
+ ### Basic Usage
182
+
183
+ ```python
184
+ from mai_dx import MaiDxOrchestrator
185
+
186
+ # Initialize orchestrator
187
+ orchestrator = MaiDxOrchestrator(
188
+ model_name="gemini/gemini-2.5-flash",
189
+ max_iterations=10,
190
+ initial_budget=10000
191
+ )
192
+
193
+ # Define case information
194
+ initial_info = "A 45-year-old male presents with chest pain..."
195
+ full_case = "Patient: 45-year-old male. History: Acute onset chest pain..."
196
+ ground_truth = "Myocardial infarction"
197
+
198
+ # Run diagnosis
199
+ result = orchestrator.run(initial_info, full_case, ground_truth)
200
+
201
+ # Access results
202
+ print(f"Diagnosis: {result.final_diagnosis}")
203
+ print(f"Accuracy Score: {result.accuracy_score}/5.0")
204
+ print(f"Total Cost: ${result.total_cost:,}")
205
+ print(f"Iterations: {result.iterations}")
206
+ ```
207
+
208
+ ### Advanced Configuration
209
+
210
+ ```python
211
+ # Custom orchestrator with specific settings
212
+ orchestrator = MaiDxOrchestrator(
213
+ model_name="gpt-4",
214
+ max_iterations=15,
215
+ initial_budget=5000,
216
+ mode="budgeted",
217
+ physician_visit_cost=250,
218
+ enable_budget_tracking=True
219
+ )
220
+
221
+ # Enable debug logging
222
+ import os
223
+ os.environ["MAIDX_DEBUG"] = "1"
224
+ ```
225
+
226
+ ## 📋 MAI-DxO Variants
227
+
228
+ The system supports five distinct operational variants, each optimized for different clinical scenarios:
229
+
230
+ ### 1. Instant Answer
231
+ ```python
232
+ orchestrator = MaiDxOrchestrator.create_variant("instant")
233
+ result = orchestrator.run(initial_info, full_case, ground_truth)
234
+ ```
235
+ - **Use Case**: Emergency triage, rapid screening
236
+ - **Behavior**: Immediate diagnosis from initial presentation only
237
+ - **Cost**: Single physician visit ($300)
238
+
239
+ ### 2. Question-Only
240
+ ```python
241
+ orchestrator = MaiDxOrchestrator.create_variant("question_only")
242
+ result = orchestrator.run(initial_info, full_case, ground_truth)
243
+ ```
244
+ - **Use Case**: Telemedicine, history-taking focused consultations
245
+ - **Behavior**: Detailed questioning without diagnostic tests
246
+ - **Cost**: Physician visit only
247
+
248
+ ### 3. Budgeted
249
+ ```python
250
+ orchestrator = MaiDxOrchestrator.create_variant("budgeted", budget=3000)
251
+ result = orchestrator.run(initial_info, full_case, ground_truth)
252
+ ```
253
+ - **Use Case**: Resource-constrained settings, cost-conscious care
254
+ - **Behavior**: Full panel with strict budget enforcement
255
+ - **Cost**: Limited by specified budget
256
+
257
+ ### 4. No-Budget
258
+ ```python
259
+ orchestrator = MaiDxOrchestrator.create_variant("no_budget")
260
+ result = orchestrator.run(initial_info, full_case, ground_truth)
261
+ ```
262
+ - **Use Case**: Academic medical centers, complex cases
263
+ - **Behavior**: Full diagnostic capability without cost constraints
264
+ - **Cost**: Unlimited (tracks for analysis)
265
+
266
+ ### 5. Ensemble
267
+ ```python
268
+ orchestrator = MaiDxOrchestrator.create_variant("ensemble")
269
+ result = orchestrator.run_ensemble(initial_info, full_case, ground_truth, num_runs=3)
270
+ ```
271
+ - **Use Case**: Critical diagnoses, second opinion simulation
272
+ - **Behavior**: Multiple independent panels with consensus aggregation
273
+ - **Cost**: Sum of all panel costs
274
+
275
+ ## ⚙️ Configuration
276
+
277
+ ### Model Configuration
278
+
279
+ ```python
280
+ # Supported models
281
+ models = [
282
+ "gemini/gemini-2.5-flash",
283
+ "gpt-4o",
284
+ "gpt-4o-mini",
285
+ "claude-3-5-sonnet-20241022",
286
+ "meta-llama/llama-3.1-8b-instruct"
287
+ ]
288
 
289
+ orchestrator = MaiDxOrchestrator(model_name="gpt-4o")
290
+ ```
291
+
292
+ ### Cost Database Customization
293
+
294
+ ```python
295
+ # Access and modify cost database
296
+ orchestrator = MaiDxOrchestrator()
297
+ orchestrator.test_cost_db.update({
298
+ "custom_test": 450,
299
+ "specialized_imaging": 2000
300
+ })
301
+ ```
302
+
303
+ ### Logging Configuration
304
+
305
+ ```python
306
+ # Enable detailed debug logging
307
+ import os
308
+ os.environ["MAIDX_DEBUG"] = "1"
309
+
310
+ # Custom log levels and formats available
311
+ ```
312
+
313
+ ## 📖 Examples
314
+
315
+ ### Example 1: Comprehensive Diagnostic Workup
316
+
317
+ ```python
318
+ from mai_dx import MaiDxOrchestrator
319
+
320
+ # Complex case requiring multiple tests
321
+ case_info = """
322
+ A 29-year-old woman was admitted to the hospital because of sore throat
323
+ and peritonsillar swelling and bleeding. Symptoms did not abate with
324
+ antimicrobial therapy.
325
+ """
326
+
327
+ case_details = """
328
+ Patient: 29-year-old female.
329
+ History: Onset of sore throat 7 weeks prior to admission. Worsening
330
+ right-sided pain and swelling. No fevers, headaches, or GI symptoms.
331
+ Physical Exam: Right peritonsillar mass, displacing the uvula.
332
+ Initial Labs: FBC, clotting studies normal.
333
+ """
334
+
335
+ ground_truth = "Embryonal rhabdomyosarcoma of the pharynx"
336
+
337
+ # Run with different variants
338
+ variants = ["question_only", "budgeted", "no_budget"]
339
+ results = {}
340
+
341
+ for variant in variants:
342
+ if variant == "budgeted":
343
+ orch = MaiDxOrchestrator.create_variant(variant, budget=3000)
344
+ else:
345
+ orch = MaiDxOrchestrator.create_variant(variant)
346
+
347
+ results[variant] = orch.run(case_info, case_details, ground_truth)
348
+
349
+ # Compare results
350
+ for variant, result in results.items():
351
+ print(f"{variant}: {result.final_diagnosis} (Score: {result.accuracy_score})")
352
+ ```
353
+
354
+ ### Example 2: Ensemble Diagnosis
355
+
356
+ ```python
357
+ # High-stakes diagnosis with ensemble approach
358
+ ensemble_orchestrator = MaiDxOrchestrator.create_variant("ensemble")
359
+
360
+ ensemble_result = ensemble_orchestrator.run_ensemble(
361
+ initial_case_info=case_info,
362
+ full_case_details=case_details,
363
+ ground_truth_diagnosis=ground_truth,
364
+ num_runs=5 # 5 independent diagnostic panels
365
+ )
366
+
367
+ print(f"Ensemble Diagnosis: {ensemble_result.final_diagnosis}")
368
+ print(f"Confidence Score: {ensemble_result.accuracy_score}/5.0")
369
+ print(f"Total Cost: ${ensemble_result.total_cost:,}")
370
+ ```
371
+
372
+ ### Example 3: Custom Cost Analysis
373
+
374
+ ```python
375
+ # Analyze cost-effectiveness across variants
376
+ import matplotlib.pyplot as plt
377
+
378
+ variants = ["instant", "question_only", "budgeted", "no_budget"]
379
+ costs = []
380
+ accuracies = []
381
+
382
+ for variant in variants:
383
+ orch = MaiDxOrchestrator.create_variant(variant)
384
+ result = orch.run(case_info, case_details, ground_truth)
385
+ costs.append(result.total_cost)
386
+ accuracies.append(result.accuracy_score)
387
+
388
+ # Plot cost vs accuracy
389
+ plt.scatter(costs, accuracies)
390
+ plt.xlabel('Total Cost ($)')
391
+ plt.ylabel('Accuracy Score')
392
+ plt.title('Cost vs Accuracy Trade-off')
393
+ for i, variant in enumerate(variants):
394
+ plt.annotate(variant, (costs[i], accuracies[i]))
395
+ plt.show()
396
+ ```
397
 
398
+ ## 🔍 API Reference
399
 
400
+ ### MaiDxOrchestrator Class
401
 
402
+ #### Constructor
403
+ ```python
404
+ MaiDxOrchestrator(
405
+ model_name: str = "gemini/gemini-2.5-flash",
406
+ max_iterations: int = 10,
407
+ initial_budget: int = 10000,
408
+ mode: str = "no_budget",
409
+ physician_visit_cost: int = 300,
410
+ enable_budget_tracking: bool = False
411
+ )
412
+ ```
413
 
414
+ #### Methods
415
 
416
+ **`run(initial_case_info, full_case_details, ground_truth_diagnosis)`**
417
+ - Executes the sequential diagnostic process
418
+ - Returns: `DiagnosisResult` object
419
 
420
+ **`run_ensemble(initial_case_info, full_case_details, ground_truth_diagnosis, num_runs=3)`**
421
+ - Runs multiple independent sessions with consensus aggregation
422
+ - Returns: `DiagnosisResult` object
423
 
424
+ **`create_variant(variant, **kwargs)` (Class Method)**
425
+ - Factory method for creating specialized variants
426
+ - Variants: "instant", "question_only", "budgeted", "no_budget", "ensemble"
427
 
428
+ ### DiagnosisResult Class
429
 
430
+ ```python
431
+ @dataclass
432
+ class DiagnosisResult:
433
+ final_diagnosis: str
434
+ ground_truth: str
435
+ accuracy_score: float
436
+ accuracy_reasoning: str
437
+ total_cost: int
438
+ iterations: int
439
+ conversation_history: str
440
+ ```
441
 
442
+ ### Utility Functions
443
 
444
+ **`run_mai_dxo_demo(case_info=None, case_details=None, ground_truth=None)`**
445
+ - Convenience function for quick demonstrations
446
+ - Returns: Dictionary of results from multiple variants
447
 
448
+ ## 🧪 Testing and Validation
449
 
450
+ ### Running Tests
451
+ ```bash
452
+ # Run the built-in demo
453
+ python -m mai_dx.main
454
+
455
+ # Run with custom cases
456
+ python -c "
457
+ from mai_dx import run_mai_dxo_demo
458
+ results = run_mai_dxo_demo()
459
+ print(results)
460
+ "
461
+ ```
462
 
463
+ ### Benchmarking
464
+ ```python
465
+ import time
466
+ from mai_dx import MaiDxOrchestrator
467
 
468
+ # Performance benchmarking
469
+ start_time = time.time()
470
+ orchestrator = MaiDxOrchestrator()
471
+ result = orchestrator.run(case_info, case_details, ground_truth)
472
+ elapsed = time.time() - start_time
473
 
474
+ print(f"Diagnosis completed in {elapsed:.2f} seconds")
475
+ print(f"Accuracy: {result.accuracy_score}/5.0")
476
+ print(f"Cost efficiency: ${result.total_cost/result.accuracy_score:.0f} per accuracy point")
477
+ ```
478
 
479
+ ## 🤝 Contributing
480
+
481
+ We welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) for details.
482
+
483
+ ### Development Setup
484
+ ```bash
485
+ git clone https://github.com/your-org/Open-MAI-Dx-Orchestrator.git
486
+ cd Open-MAI-Dx-Orchestrator
487
+ pip install -e ".[dev]"
488
+ pre-commit install
489
+ ```
490
 
491
+ ### Code Style
492
+ - Follow PEP 8 guidelines
493
+ - Use type hints throughout
494
+ - Maintain comprehensive docstrings
495
+ - Add tests for new features
496
 
497
+ ## 📄 License
498
 
499
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
500
+
501
+ ## 📚 Citation
502
+
503
+ If you use this implementation in your research, please cite both the original paper and this implementation:
504
 
505
  ```bibtex
506
  @misc{nori2025sequentialdiagnosislanguagemodels,
507
+ title={Sequential Diagnosis with Language Models},
508
+ author={Harsha Nori and Mayank Daswani and Christopher Kelly and Scott Lundberg and Marco Tulio Ribeiro and Marc Wilson and Xiaoxuan Liu and Viknesh Sounderajah and Jonathan Carlson and Matthew P Lungren and Bay Gross and Peter Hames and Mustafa Suleyman and Dominic King and Eric Horvitz},
509
+ year={2025},
510
+ eprint={2506.22405},
511
+ archivePrefix={arXiv},
512
+ primaryClass={cs.CL},
513
+ url={https://arxiv.org/abs/2506.22405},
514
+ }
515
+
516
+ @software{mai_dx_orchestrator,
517
+ title={Open-MAI-Dx-Orchestrator: An Open Source Implementation of Sequential Diagnosis with Language Models},
518
+ author={The-Swarm-Corporation},
519
+ year={2025},
520
+ url={https://github.com/The-Swarm-Corporation/Open-MAI-Dx-Orchestrator.git}
521
  }
522
  ```
523
 
524
+ ## 🔗 Related Work
525
 
526
+ - [Original Paper](https://arxiv.org/abs/2506.22405) - Sequential Diagnosis with Language Models
527
+ - [Swarms Framework](https://github.com/kyegomez/swarms) - Multi-agent AI orchestration
528
+ - [Microsoft Research](https://www.microsoft.com/en-us/research/) - Original research institution
529
+
530
+ ## 📞 Support
531
+
532
+ - **Issues**: [GitHub Issues](https://github.com/The-Swarm-Corporation/Open-MAI-Dx-Orchestrator/issues)
533
+ - **Discussions**: [GitHub Discussions](https://github.com/The-Swarm-Corporation/Open-MAI-Dx-Orchestrator/discussions)
534
+ - **Documentation**: [Full Documentation](https://docs.swarms.world)
535
+
536
+ ---
537
 
538
+ <p align="center">
539
+ <strong>Built with Swarms for advancing AI-powered medical diagnosis</strong>
540
+ </p>
mai_dx/main.py CHANGED
@@ -1,12 +1,1261 @@
1
- from swarms import Agent
 
2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
 
4
  class MaiDxOrchestrator:
5
- def __init__(self, conversation_backend: str = None, agents: list[Agent] = None):
6
- self.conversation_backend = conversation_backend
7
- self.agents = agents
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
 
9
- def run(self, task: str, *args, **kwargs):
10
- pass
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ MAI Diagnostic Orchestrator (MAI-DxO)
3
 
4
+ This script provides a complete implementation of the "Sequential Diagnosis with Language Models"
5
+ paper, using the `swarms` framework. It simulates a virtual panel of physician-agents to perform
6
+ iterative medical diagnosis with cost-effectiveness optimization.
7
+
8
+ Based on the paper: "Sequential Diagnosis with Language Models"
9
+ (arXiv:2506.22405v1) by Nori et al.
10
+
11
+ Key Features:
12
+ - Virtual physician panel with specialized roles (Hypothesis, Test-Chooser, Challenger, Stewardship, Checklist)
13
+ - Multiple operational modes (instant, question_only, budgeted, no_budget, ensemble)
14
+ - Comprehensive cost tracking and budget management
15
+ - Clinical accuracy evaluation with 5-point Likert scale
16
+ - Gatekeeper system for realistic clinical information disclosure
17
+ - Ensemble methods for improved diagnostic accuracy
18
+
19
+ Example Usage:
20
+ # Standard MAI-DxO usage
21
+ orchestrator = MaiDxOrchestrator(model_name="gemini/gemini-2.5-flash")
22
+ result = orchestrator.run(initial_case_info, full_case_details, ground_truth)
23
+
24
+ # Budget-constrained variant
25
+ budgeted_orchestrator = MaiDxOrchestrator.create_variant("budgeted", budget=5000)
26
+
27
+ # Ensemble approach
28
+ ensemble_result = orchestrator.run_ensemble(initial_case_info, full_case_details, ground_truth)
29
+ """
30
+
31
+ import json
32
+ import sys
33
+ import time
34
+ from dataclasses import dataclass
35
+ from enum import Enum
36
+ from typing import Any, Dict, List, Optional, Union, Literal
37
+
38
+ from loguru import logger
39
+ from pydantic import BaseModel, Field
40
+ from swarms import Agent, Conversation
41
+
42
+ # Configure Loguru with beautiful formatting and features
43
+ logger.remove() # Remove default handler
44
+
45
+ # Console handler with beautiful colors
46
+ logger.add(
47
+ sys.stdout,
48
+ level="INFO",
49
+ format="<green>{time:YYYY-MM-DD HH:mm:ss}</green> | <level>{level: <8}</level> | <cyan>{name}</cyan>:<cyan>{function}</cyan>:<cyan>{line}</cyan> - <level>{message}</level>",
50
+ colorize=True
51
+ )
52
+
53
+ # Enable debug mode if environment variable is set
54
+ import os
55
+ if os.getenv("MAIDX_DEBUG", "").lower() in ("1", "true", "yes"):
56
+ logger.add(
57
+ "logs/maidx_debug_{time:YYYY-MM-DD}.log",
58
+ level="DEBUG",
59
+ format="{time:YYYY-MM-DD HH:mm:ss} | {level: <8} | {name}:{function}:{line} - {message}",
60
+ rotation="1 day",
61
+ retention="3 days"
62
+ )
63
+ logger.info("🐛 Debug logging enabled - logs will be written to logs/ directory")
64
+
65
+ # File handler for persistent logging (optional - uncomment if needed)
66
+ # logger.add(
67
+ # "logs/mai_dxo_{time:YYYY-MM-DD}.log",
68
+ # rotation="1 day",
69
+ # retention="7 days",
70
+ # level="DEBUG",
71
+ # format="{time:YYYY-MM-DD HH:mm:ss} | {level: <8} | {name}:{function}:{line} - {message}",
72
+ # compression="zip"
73
+ # )
74
+
75
+ # --- Data Structures and Enums ---
76
+
77
+ class AgentRole(Enum):
78
+ """Enumeration of roles for the virtual physician panel."""
79
+ HYPOTHESIS = "Dr. Hypothesis"
80
+ TEST_CHOOSER = "Dr. Test-Chooser"
81
+ CHALLENGER = "Dr. Challenger"
82
+ STEWARDSHIP = "Dr. Stewardship"
83
+ CHECKLIST = "Dr. Checklist"
84
+ CONSENSUS = "Consensus Coordinator"
85
+ GATEKEEPER = "Gatekeeper"
86
+ JUDGE = "Judge"
87
+
88
+ @dataclass
89
+ class DiagnosisResult:
90
+ """Stores the final result of a diagnostic session."""
91
+ final_diagnosis: str
92
+ ground_truth: str
93
+ accuracy_score: float
94
+ accuracy_reasoning: str
95
+ total_cost: int
96
+ iterations: int
97
+ conversation_history: str
98
+
99
+ class Action(BaseModel):
100
+ """Pydantic model for a structured action decided by the consensus agent."""
101
+ action_type: Literal["ask", "test", "diagnose"] = Field(..., description="The type of action to perform.")
102
+ content: Union[str, List[str]] = Field(..., description="The content of the action (question, test name, or diagnosis).")
103
+ reasoning: str = Field(..., description="The reasoning behind choosing this action.")
104
+
105
+ # --- Main Orchestrator Class ---
106
 
107
  class MaiDxOrchestrator:
108
+ """
109
+ Implements the MAI Diagnostic Orchestrator (MAI-DxO) framework.
110
+ This class orchestrates a virtual panel of AI agents to perform sequential medical diagnosis,
111
+ evaluates the final diagnosis, and tracks costs.
112
+ """
113
+ def __init__(
114
+ self,
115
+ model_name: str = "gemini/gemini-2.5-flash",
116
+ max_iterations: int = 10,
117
+ initial_budget: int = 10000,
118
+ mode: str = "no_budget", # "instant", "question_only", "budgeted", "no_budget", "ensemble"
119
+ physician_visit_cost: int = 300,
120
+ enable_budget_tracking: bool = False,
121
+ ):
122
+ """
123
+ Initializes the MAI-DxO system.
124
+
125
+ Args:
126
+ model_name (str): The language model to be used by all agents.
127
+ max_iterations (int): The maximum number of diagnostic loops.
128
+ initial_budget (int): The starting budget for diagnostic tests.
129
+ mode (str): The operational mode of MAI-DxO.
130
+ physician_visit_cost (int): Cost per physician visit.
131
+ enable_budget_tracking (bool): Whether to enable budget tracking.
132
+ """
133
+ self.model_name = model_name
134
+ self.max_iterations = max_iterations
135
+ self.initial_budget = initial_budget
136
+ self.mode = mode
137
+ self.physician_visit_cost = physician_visit_cost
138
+ self.enable_budget_tracking = enable_budget_tracking
139
+
140
+ self.cumulative_cost = 0
141
+ self.differential_diagnosis = "Not yet formulated."
142
+ self.conversation = Conversation(
143
+ time_enabled=True,
144
+ autosave=False,
145
+ save_enabled=False
146
+ )
147
+
148
+ # Enhanced cost model based on the paper's methodology
149
+ self.test_cost_db = {
150
+ "default": 150,
151
+ "cbc": 50,
152
+ "complete blood count": 50,
153
+ "fbc": 50,
154
+ "chest x-ray": 200,
155
+ "chest xray": 200,
156
+ "mri": 1500,
157
+ "mri brain": 1800,
158
+ "mri neck": 1600,
159
+ "ct scan": 1200,
160
+ "ct chest": 1300,
161
+ "ct abdomen": 1400,
162
+ "biopsy": 800,
163
+ "core biopsy": 900,
164
+ "immunohistochemistry": 400,
165
+ "fish test": 500,
166
+ "fish": 500,
167
+ "ultrasound": 300,
168
+ "ecg": 100,
169
+ "ekg": 100,
170
+ "blood glucose": 30,
171
+ "liver function tests": 80,
172
+ "renal function": 70,
173
+ "toxic alcohol panel": 200,
174
+ "urinalysis": 40,
175
+ "culture": 150,
176
+ "pathology": 600,
177
+ }
178
+
179
+ self._init_agents()
180
+ logger.info(f"🏥 MAI Diagnostic Orchestrator initialized successfully in '{mode}' mode with budget ${initial_budget:,}")
181
+
182
+ def _init_agents(self):
183
+ """Initializes all required agents with their specific roles and prompts."""
184
+ self.agents = {
185
+ role: Agent(
186
+ agent_name=role.value,
187
+ system_prompt=self._get_prompt_for_role(role),
188
+ model_name=self.model_name,
189
+ max_loops=1,
190
+ output_type="json" if role == AgentRole.CONSENSUS else "str",
191
+ print_on=True, # Enable printing for all agents to see outputs
192
+ ) for role in AgentRole
193
+ }
194
+ logger.info(f"👥 {len(self.agents)} virtual physician agents initialized and ready for consultation")
195
+
196
+ def _get_prompt_for_role(self, role: AgentRole) -> str:
197
+ """Returns the system prompt for a given agent role."""
198
+ prompts = {
199
+ AgentRole.HYPOTHESIS: """
200
+ You are Dr. Hypothesis, a specialist in maintaining differential diagnoses. Your role is critical to the diagnostic process.
201
+
202
+ CORE RESPONSIBILITIES:
203
+ - Maintain a probability-ranked differential diagnosis with the top 3 most likely conditions
204
+ - Update probabilities using Bayesian reasoning after each new finding
205
+ - Consider both common and rare diseases appropriate to the clinical context
206
+ - Explicitly track how new evidence changes your diagnostic thinking
207
+
208
+ APPROACH:
209
+ 1. Start with the most likely diagnoses based on presenting symptoms
210
+ 2. For each new piece of evidence, consider:
211
+ - How it supports or refutes each hypothesis
212
+ - Whether it suggests new diagnoses to consider
213
+ - How it changes the relative probabilities
214
+ 3. Always explain your Bayesian reasoning clearly
215
+
216
+ OUTPUT FORMAT:
217
+ Provide your updated differential diagnosis with:
218
+ - Top 3 diagnoses with probability estimates (percentages)
219
+ - Brief rationale for each
220
+ - Key evidence supporting each hypothesis
221
+ - Evidence that contradicts or challenges each hypothesis
222
+
223
+ Remember: Your differential drives the entire diagnostic process. Be thorough, evidence-based, and adaptive.
224
+ """,
225
+
226
+ AgentRole.TEST_CHOOSER: """
227
+ You are Dr. Test-Chooser, a specialist in diagnostic test selection and information theory.
228
+
229
+ CORE RESPONSIBILITIES:
230
+ - Select up to 3 diagnostic tests per round that maximally discriminate between leading hypotheses
231
+ - Optimize for information value, not just clinical reasonableness
232
+ - Consider test characteristics: sensitivity, specificity, positive/negative predictive values
233
+ - Balance diagnostic yield with patient burden and resource utilization
234
+
235
+ SELECTION CRITERIA:
236
+ 1. Information Value: How much will this test change diagnostic probabilities?
237
+ 2. Discriminatory Power: How well does it distinguish between competing hypotheses?
238
+ 3. Clinical Impact: Will the result meaningfully alter management?
239
+ 4. Sequential Logic: What should we establish first before ordering more complex tests?
240
+
241
+ APPROACH:
242
+ - For each proposed test, explicitly state which hypotheses it will help confirm or exclude
243
+ - Consider both positive and negative results and their implications
244
+ - Think about test sequences (e.g., basic labs before advanced imaging)
245
+ - Avoid redundant tests that won't add new information
246
+
247
+ OUTPUT FORMAT:
248
+ For each recommended test:
249
+ - Test name (be specific)
250
+ - Primary hypotheses it will help evaluate
251
+ - Expected information gain
252
+ - How results will change management decisions
253
+
254
+ Focus on tests that will most efficiently narrow the differential diagnosis.
255
+ """,
256
+
257
+ AgentRole.CHALLENGER: """
258
+ You are Dr. Challenger, the critical thinking specialist and devil's advocate.
259
+
260
+ CORE RESPONSIBILITIES:
261
+ - Identify and challenge cognitive biases in the diagnostic process
262
+ - Highlight contradictory evidence that might be overlooked
263
+ - Propose alternative hypotheses and falsifying tests
264
+ - Guard against premature diagnostic closure
265
+
266
+ COGNITIVE BIASES TO WATCH FOR:
267
+ 1. Anchoring: Over-reliance on initial impressions
268
+ 2. Confirmation bias: Seeking only supporting evidence
269
+ 3. Availability bias: Overestimating probability of recently seen conditions
270
+ 4. Representativeness: Ignoring base rates and prevalence
271
+ 5. Search satisficing: Stopping at "good enough" explanations
272
+
273
+ YOUR APPROACH:
274
+ - Ask "What else could this be?" and "What doesn't fit?"
275
+ - Challenge assumptions and look for alternative explanations
276
+ - Propose tests that could disprove the leading hypothesis
277
+ - Consider rare diseases when common ones don't fully explain the picture
278
+ - Advocate for considering multiple conditions simultaneously
279
+
280
+ OUTPUT FORMAT:
281
+ - Specific biases you've identified in the current reasoning
282
+ - Evidence that contradicts the leading hypotheses
283
+ - Alternative diagnoses to consider
284
+ - Tests that could falsify current assumptions
285
+ - Red flags or concerning patterns that need attention
286
+
287
+ Be constructively critical - your role is to strengthen diagnostic accuracy through rigorous challenge.
288
+ """,
289
+
290
+ AgentRole.STEWARDSHIP: """
291
+ You are Dr. Stewardship, the resource optimization and cost-effectiveness specialist.
292
+
293
+ CORE RESPONSIBILITIES:
294
+ - Enforce cost-conscious, high-value care
295
+ - Advocate for cheaper alternatives when diagnostically equivalent
296
+ - Challenge low-yield, expensive tests
297
+ - Balance diagnostic thoroughness with resource stewardship
298
+
299
+ COST-VALUE FRAMEWORK:
300
+ 1. High-Value Tests: Low cost, high diagnostic yield, changes management
301
+ 2. Moderate-Value Tests: Moderate cost, specific indication, incremental value
302
+ 3. Low-Value Tests: High cost, low yield, minimal impact on decisions
303
+ 4. No-Value Tests: Any cost, no diagnostic value, ordered out of habit
304
+
305
+ ALTERNATIVE STRATEGIES:
306
+ - Could patient history/physical exam provide this information?
307
+ - Is there a less expensive test with similar diagnostic value?
308
+ - Can we use a staged approach (cheap test first, expensive if needed)?
309
+ - Does the test result actually change management?
310
+
311
+ YOUR APPROACH:
312
+ - Review all proposed tests for necessity and value
313
+ - Suggest cost-effective alternatives
314
+ - Question tests that don't clearly advance diagnosis
315
+ - Advocate for asking questions before ordering expensive tests
316
+ - Consider the cumulative cost burden
317
+
318
+ OUTPUT FORMAT:
319
+ - Assessment of proposed tests (high/moderate/low/no value)
320
+ - Specific cost-effective alternatives
321
+ - Questions that might obviate need for testing
322
+ - Recommended modifications to testing strategy
323
+ - Cumulative cost considerations
324
+
325
+ Your goal: Maximum diagnostic accuracy at minimum necessary cost.
326
+ """,
327
+
328
+ AgentRole.CHECKLIST: """
329
+ You are Dr. Checklist, the quality assurance and consistency specialist.
330
+
331
+ CORE RESPONSIBILITIES:
332
+ - Perform silent quality control on all panel deliberations
333
+ - Ensure test names are valid and properly specified
334
+ - Check internal consistency of reasoning across panel members
335
+ - Flag logical errors or contradictions in the diagnostic approach
336
+
337
+ QUALITY CHECKS:
338
+ 1. Test Validity: Are proposed tests real and properly named?
339
+ 2. Logical Consistency: Do the recommendations align with the differential?
340
+ 3. Evidence Integration: Are all findings being considered appropriately?
341
+ 4. Process Adherence: Is the panel following proper diagnostic methodology?
342
+ 5. Safety Checks: Are any critical possibilities being overlooked?
343
+
344
+ SPECIFIC VALIDATIONS:
345
+ - Test names match standard medical terminology
346
+ - Proposed tests are appropriate for the clinical scenario
347
+ - No contradictions between different panel members' reasoning
348
+ - All significant findings are being addressed
349
+ - No gaps in the diagnostic logic
350
+
351
+ OUTPUT FORMAT:
352
+ - Brief validation summary (✓ Clear / ⚠ Issues noted)
353
+ - Any test name corrections needed
354
+ - Logical inconsistencies identified
355
+ - Missing considerations or gaps
356
+ - Process improvement suggestions
357
+
358
+ Keep your feedback concise but comprehensive. Flag any issues that could compromise diagnostic quality.
359
+ """,
360
+
361
+ AgentRole.CONSENSUS: """
362
+ You are the Consensus Coordinator, responsible for synthesizing the virtual panel's expertise into a single, optimal decision.
363
+
364
+ CORE RESPONSIBILITIES:
365
+ - Integrate input from Dr. Hypothesis, Dr. Test-Chooser, Dr. Challenger, Dr. Stewardship, and Dr. Checklist
366
+ - Decide on the single best next action: 'ask', 'test', or 'diagnose'
367
+ - Balance competing priorities: accuracy, cost, efficiency, and thoroughness
368
+ - Ensure the chosen action advances the diagnostic process optimally
369
+
370
+ DECISION FRAMEWORK:
371
+ 1. DIAGNOSE: Choose when diagnostic certainty is sufficiently high (>85%) for the leading hypothesis
372
+ 2. TEST: Choose when tests will meaningfully discriminate between hypotheses
373
+ 3. ASK: Choose when history/exam questions could provide high-value information
374
+
375
+ SYNTHESIS PROCESS:
376
+ - Weight Dr. Hypothesis's confidence level and differential
377
+ - Consider Dr. Test-Chooser's information value analysis
378
+ - Incorporate Dr. Challenger's alternative perspectives
379
+ - Respect Dr. Stewardship's cost-effectiveness concerns
380
+ - Address any quality issues raised by Dr. Checklist
381
+
382
+ OUTPUT REQUIREMENTS:
383
+ Provide a JSON object with this exact structure:
384
+ {
385
+ "action_type": "ask" | "test" | "diagnose",
386
+ "content": "specific question(s), test name(s), or final diagnosis",
387
+ "reasoning": "clear justification synthesizing panel input"
388
+ }
389
+
390
+ For action_type "ask": content should be specific patient history or physical exam questions
391
+ For action_type "test": content should be properly named diagnostic tests (up to 3)
392
+ For action_type "diagnose": content should be the complete, specific final diagnosis
393
+
394
+ Make the decision that best advances accurate, cost-effective diagnosis.
395
+ """,
396
+
397
+ AgentRole.GATEKEEPER: """
398
+ You are the Gatekeeper, the clinical information oracle with complete access to the patient case file.
399
+
400
+ CORE RESPONSIBILITIES:
401
+ - Provide objective, specific clinical findings when explicitly requested
402
+ - Serve as the authoritative source for all patient information
403
+ - Generate realistic synthetic findings for tests not in the original case
404
+ - Maintain clinical realism while preventing information leakage
405
+
406
+ RESPONSE PRINCIPLES:
407
+ 1. OBJECTIVITY: Provide only factual findings, never interpretations or impressions
408
+ 2. SPECIFICITY: Give precise, detailed results when tests are properly ordered
409
+ 3. REALISM: Ensure all responses reflect realistic clinical scenarios
410
+ 4. NO HINTS: Never provide diagnostic clues or suggestions
411
+ 5. CONSISTENCY: Maintain coherence across all provided information
412
 
413
+ HANDLING REQUESTS:
414
+ - Patient History Questions: Provide relevant history from case file or realistic details
415
+ - Physical Exam: Give specific examination findings as would be documented
416
+ - Diagnostic Tests: Provide exact results as specified or realistic synthetic values
417
+ - Vague Requests: Politely ask for more specific queries
418
+ - Invalid Requests: Explain why the request cannot be fulfilled
419
+
420
+ SYNTHETIC FINDINGS GUIDELINES:
421
+ When generating findings not in the original case:
422
+ - Ensure consistency with established diagnosis and case details
423
+ - Use realistic reference ranges and values
424
+ - Maintain clinical plausibility
425
+ - Avoid pathognomonic findings unless specifically diagnostic
426
+
427
+ RESPONSE FORMAT:
428
+ - Direct, clinical language
429
+ - Specific measurements with reference ranges when applicable
430
+ - Clear organization of findings
431
+ - Professional medical terminology
432
+
433
+ Your role is crucial: provide complete, accurate clinical information while maintaining the challenge of the diagnostic process.
434
+ """,
435
+
436
+ AgentRole.JUDGE: """
437
+ You are the Judge, the diagnostic accuracy evaluation specialist.
438
+
439
+ CORE RESPONSIBILITIES:
440
+ - Evaluate candidate diagnoses against ground truth using a rigorous clinical rubric
441
+ - Provide fair, consistent scoring based on clinical management implications
442
+ - Consider diagnostic substance over terminology differences
443
+ - Account for acceptable medical synonyms and equivalent formulations
444
+
445
+ EVALUATION RUBRIC (5-point Likert scale):
446
+
447
+ SCORE 5 (Perfect/Clinically Superior):
448
+ - Clinically identical to reference diagnosis
449
+ - May be more specific than reference (adding relevant detail)
450
+ - No incorrect or unrelated additions
451
+ - Treatment approach would be identical
452
+
453
+ SCORE 4 (Mostly Correct - Minor Incompleteness):
454
+ - Core disease correctly identified
455
+ - Minor qualifier or component missing/mis-specified
456
+ - Overall management largely unchanged
457
+ - Clinically appropriate diagnosis
458
+
459
+ SCORE 3 (Partially Correct - Major Error):
460
+ - Correct general disease category
461
+ - Major error in etiology, anatomic site, or critical specificity
462
+ - Would significantly alter workup or prognosis
463
+ - Partially correct but clinically concerning gaps
464
+
465
+ SCORE 2 (Largely Incorrect):
466
+ - Shares only superficial features with correct diagnosis
467
+ - Wrong fundamental disease process
468
+ - Would misdirect clinical workup
469
+ - Partially contradicts case details
470
+
471
+ SCORE 1 (Completely Incorrect):
472
+ - No meaningful overlap with correct diagnosis
473
+ - Wrong organ system or disease category
474
+ - Would likely lead to harmful care
475
+ - Completely inconsistent with clinical presentation
476
+
477
+ EVALUATION PROCESS:
478
+ 1. Compare core disease entity
479
+ 2. Assess etiology/causative factors
480
+ 3. Evaluate anatomic specificity
481
+ 4. Consider diagnostic completeness
482
+ 5. Judge clinical management implications
483
+
484
+ OUTPUT FORMAT:
485
+ - Score (1-5) with clear label
486
+ - Detailed justification referencing specific rubric criteria
487
+ - Explanation of how diagnosis would affect clinical management
488
+ - Note any acceptable medical synonyms or equivalent terminology
489
+
490
+ Maintain high standards while recognizing legitimate diagnostic variability in medical practice.
491
+ """
492
+ }
493
+ return prompts[role]
494
+
495
+ def _parse_json_response(self, response: str) -> Dict[str, Any]:
496
+ """Safely parses a JSON string, returning a dictionary."""
497
+ try:
498
+ # Extract the actual response content from the agent response
499
+ if isinstance(response, str):
500
+ # Handle markdown-formatted JSON
501
+ if "```json" in response:
502
+ # Extract JSON content between ```json and ```
503
+ start_marker = "```json"
504
+ end_marker = "```"
505
+ start_idx = response.find(start_marker)
506
+ if start_idx != -1:
507
+ start_idx += len(start_marker)
508
+ end_idx = response.find(end_marker, start_idx)
509
+ if end_idx != -1:
510
+ json_content = response[start_idx:end_idx].strip()
511
+ return json.loads(json_content)
512
+
513
+ # Try to find JSON-like content in the response
514
+ lines = response.split('\n')
515
+ json_lines = []
516
+ in_json = False
517
+ brace_count = 0
518
+
519
+ for line in lines:
520
+ stripped_line = line.strip()
521
+ if stripped_line.startswith('{') and not in_json:
522
+ in_json = True
523
+ json_lines = [line] # Start fresh
524
+ brace_count = line.count('{') - line.count('}')
525
+ elif in_json:
526
+ json_lines.append(line)
527
+ brace_count += line.count('{') - line.count('}')
528
+ if brace_count <= 0: # Balanced braces, end of JSON
529
+ break
530
+
531
+ if json_lines and in_json:
532
+ json_content = '\n'.join(json_lines)
533
+ return json.loads(json_content)
534
+
535
+ # Try to extract JSON from text that might contain other content
536
+ import re
537
+ # Look for JSON pattern in the text
538
+ json_pattern = r'\{[^{}]*(?:\{[^{}]*\}[^{}]*)*\}'
539
+ matches = re.findall(json_pattern, response, re.DOTALL)
540
+
541
+ for match in matches:
542
+ try:
543
+ return json.loads(match)
544
+ except json.JSONDecodeError:
545
+ continue
546
+
547
+ # Direct parsing attempt as fallback
548
+ return json.loads(response)
549
+
550
+ except (json.JSONDecodeError, IndexError, AttributeError) as e:
551
+ logger.error(f"Failed to parse JSON response. Error: {e}")
552
+ logger.debug(f"Response content: {response[:500]}...") # Log first 500 chars
553
+ # Fallback to a default action if parsing fails
554
+ return {
555
+ "action_type": "ask",
556
+ "content": "Could you please clarify the next best step? The previous analysis was inconclusive.",
557
+ "reasoning": "Fallback due to parsing error."
558
+ }
559
+
560
+ def _estimate_cost(self, tests: Union[List[str], str]) -> int:
561
+ """Estimates the cost of diagnostic tests."""
562
+ if isinstance(tests, str):
563
+ tests = [tests]
564
+
565
+ cost = 0
566
+ for test in tests:
567
+ test_lower = test.lower().strip()
568
+
569
+ # Enhanced cost matching with multiple strategies
570
+ cost_found = False
571
+
572
+ # Strategy 1: Exact match
573
+ if test_lower in self.test_cost_db:
574
+ cost += self.test_cost_db[test_lower]
575
+ cost_found = True
576
+ continue
577
+
578
+ # Strategy 2: Partial match (find best matching key)
579
+ best_match = None
580
+ best_match_length = 0
581
+ for cost_key in self.test_cost_db:
582
+ if cost_key in test_lower or test_lower in cost_key:
583
+ if len(cost_key) > best_match_length:
584
+ best_match = cost_key
585
+ best_match_length = len(cost_key)
586
+
587
+ if best_match:
588
+ cost += self.test_cost_db[best_match]
589
+ cost_found = True
590
+ continue
591
+
592
+ # Strategy 3: Keyword-based matching
593
+ if any(keyword in test_lower for keyword in ['biopsy', 'tissue']):
594
+ cost += self.test_cost_db.get('biopsy', 800)
595
+ cost_found = True
596
+ elif any(keyword in test_lower for keyword in ['mri', 'magnetic']):
597
+ cost += self.test_cost_db.get('mri', 1500)
598
+ cost_found = True
599
+ elif any(keyword in test_lower for keyword in ['ct', 'computed tomography']):
600
+ cost += self.test_cost_db.get('ct scan', 1200)
601
+ cost_found = True
602
+ elif any(keyword in test_lower for keyword in ['xray', 'x-ray', 'radiograph']):
603
+ cost += self.test_cost_db.get('chest x-ray', 200)
604
+ cost_found = True
605
+ elif any(keyword in test_lower for keyword in ['blood', 'serum', 'plasma']):
606
+ cost += 100 # Basic blood test cost
607
+ cost_found = True
608
+ elif any(keyword in test_lower for keyword in ['culture', 'sensitivity']):
609
+ cost += self.test_cost_db.get('culture', 150)
610
+ cost_found = True
611
+ elif any(keyword in test_lower for keyword in ['immunohistochemistry', 'ihc']):
612
+ cost += self.test_cost_db.get('immunohistochemistry', 400)
613
+ cost_found = True
614
+
615
+ # Strategy 4: Default cost for unknown tests
616
+ if not cost_found:
617
+ cost += self.test_cost_db['default']
618
+ logger.debug(f"Using default cost for unknown test: {test}")
619
+
620
+ return cost
621
+
622
+ def _run_panel_deliberation(self) -> Action:
623
+ """Orchestrates one round of debate among the virtual panel to decide the next action."""
624
+ logger.info("🩺 Virtual medical panel deliberation commenced - analyzing patient case")
625
+ logger.debug("Panel members: Dr. Hypothesis, Dr. Test-Chooser, Dr. Challenger, Dr. Stewardship, Dr. Checklist")
626
+ panel_conversation = Conversation(
627
+ time_enabled=True,
628
+ autosave=False,
629
+ save_enabled=False
630
+ )
631
+
632
+ # Prepare comprehensive panel context
633
+ remaining_budget = self.initial_budget - self.cumulative_cost
634
+ budget_status = "EXCEEDED" if remaining_budget < 0 else f"${remaining_budget:,}"
635
+
636
+ panel_context = f"""
637
+ DIAGNOSTIC CASE STATUS - ROUND {len(self.conversation.return_history_as_string().split('Gatekeeper:')) - 1}
638
+
639
+ === CASE INFORMATION ===
640
+ {self.conversation.get_str()}
641
+
642
+ === CURRENT STATE ===
643
+ Differential Diagnosis: {self.differential_diagnosis}
644
+ Cumulative Cost: ${self.cumulative_cost:,}
645
+ Remaining Budget: {budget_status}
646
+ Mode: {self.mode}
647
+ Max Iterations: {self.max_iterations}
648
+
649
+ === PANEL TASK ===
650
+ Virtual medical panel, please deliberate systematically on the next best diagnostic action.
651
+ Each specialist should provide their expert analysis in sequence.
652
+ """
653
+ panel_conversation.add("System", panel_context)
654
+
655
+ # Check mode-specific constraints
656
+ if self.mode == "instant":
657
+ # For instant mode, skip deliberation and go straight to diagnosis
658
+ action_dict = {
659
+ "action_type": "diagnose",
660
+ "content": self.differential_diagnosis.split('\n')[0] if '\n' in self.differential_diagnosis else self.differential_diagnosis,
661
+ "reasoning": "Instant diagnosis mode - providing immediate assessment based on initial presentation"
662
+ }
663
+ return Action(**action_dict)
664
+
665
+ if self.mode == "question_only":
666
+ # For question-only mode, prevent test ordering
667
+ panel_context += "\n\nIMPORTANT: This is QUESTION-ONLY mode. You may ONLY ask patient questions, not order diagnostic tests."
668
+ panel_conversation.add("System", panel_context)
669
+
670
+ # Sequential expert deliberation with enhanced methodology
671
+ try:
672
+ # Dr. Hypothesis - Differential diagnosis and probability assessment
673
+ logger.info("🧠 Dr. Hypothesis analyzing differential diagnosis...")
674
+ hypothesis = self.agents[AgentRole.HYPOTHESIS].run(panel_conversation.get_str())
675
+ self.differential_diagnosis = hypothesis # Update main state
676
+ panel_conversation.add(self.agents[AgentRole.HYPOTHESIS].agent_name, hypothesis)
677
+
678
+ # Dr. Test-Chooser - Information value optimization
679
+ logger.info("🔬 Dr. Test-Chooser selecting optimal tests...")
680
+ test_choices = self.agents[AgentRole.TEST_CHOOSER].run(panel_conversation.get_str())
681
+ panel_conversation.add(self.agents[AgentRole.TEST_CHOOSER].agent_name, test_choices)
682
+
683
+ # Dr. Challenger - Bias identification and alternative hypotheses
684
+ logger.info("🤔 Dr. Challenger challenging assumptions...")
685
+ challenges = self.agents[AgentRole.CHALLENGER].run(panel_conversation.get_str())
686
+ panel_conversation.add(self.agents[AgentRole.CHALLENGER].agent_name, challenges)
687
+
688
+ # Dr. Stewardship - Cost-effectiveness analysis
689
+ logger.info("💰 Dr. Stewardship evaluating cost-effectiveness...")
690
+ stewardship_context = panel_conversation.get_str()
691
+ if self.enable_budget_tracking:
692
+ stewardship_context += f"\n\nBUDGET TRACKING ENABLED - Current cost: ${self.cumulative_cost}, Remaining: ${remaining_budget}"
693
+ stewardship_rec = self.agents[AgentRole.STEWARDSHIP].run(stewardship_context)
694
+ panel_conversation.add(self.agents[AgentRole.STEWARDSHIP].agent_name, stewardship_rec)
695
+
696
+ # Dr. Checklist - Quality assurance
697
+ logger.info("✅ Dr. Checklist performing quality control...")
698
+ checklist_rep = self.agents[AgentRole.CHECKLIST].run(panel_conversation.get_str())
699
+ panel_conversation.add(self.agents[AgentRole.CHECKLIST].agent_name, checklist_rep)
700
+
701
+ # Consensus Coordinator - Final decision synthesis
702
+ logger.info("🤝 Consensus Coordinator synthesizing panel decision...")
703
+ consensus_context = panel_conversation.get_str()
704
+
705
+ # Add mode-specific constraints to consensus
706
+ if self.mode == "budgeted" and remaining_budget <= 0:
707
+ consensus_context += "\n\nBUDGET CONSTRAINT: Budget exceeded - must either ask questions or provide final diagnosis."
708
+
709
+ consensus_response = self.agents[AgentRole.CONSENSUS].run(consensus_context)
710
+ logger.debug(f"Raw consensus response: {consensus_response}")
711
+
712
+ # Extract the actual text content from agent response
713
+ if hasattr(consensus_response, 'content'):
714
+ response_text = consensus_response.content
715
+ elif isinstance(consensus_response, str):
716
+ response_text = consensus_response
717
+ else:
718
+ response_text = str(consensus_response)
719
+
720
+ action_dict = self._parse_json_response(response_text)
721
+
722
+ # Validate action based on mode constraints
723
+ action = Action(**action_dict)
724
+ if self.mode == "question_only" and action.action_type == "test":
725
+ logger.warning("Test ordering attempted in question-only mode, converting to ask action")
726
+ action.action_type = "ask"
727
+ action.content = "Can you provide more details about the patient's symptoms and history?"
728
+ action.reasoning = "Mode constraint: question-only mode active"
729
+
730
+ if self.mode == "budgeted" and action.action_type == "test" and remaining_budget <= 0:
731
+ logger.warning("Test ordering attempted with insufficient budget, converting to diagnose action")
732
+ action.action_type = "diagnose"
733
+ action.content = self.differential_diagnosis.split('\n')[0] if '\n' in self.differential_diagnosis else self.differential_diagnosis
734
+ action.reasoning = "Budget constraint: insufficient funds for additional testing"
735
+
736
+ return action
737
+
738
+ except Exception as e:
739
+ logger.error(f"Error during panel deliberation: {e}")
740
+ # Fallback action
741
+ return Action(
742
+ action_type="ask",
743
+ content="Could you please provide more information about the patient's current condition?",
744
+ reasoning=f"Fallback due to panel deliberation error: {str(e)}"
745
+ )
746
+
747
+ def _interact_with_gatekeeper(self, action: Action, full_case_details: str) -> str:
748
+ """Sends the panel's action to the Gatekeeper and returns its response."""
749
+ gatekeeper = self.agents[AgentRole.GATEKEEPER]
750
+
751
+ if action.action_type == "ask":
752
+ request = f"Question: {action.content}"
753
+ elif action.action_type == "test":
754
+ request = f"Tests ordered: {', '.join(action.content)}"
755
+ else:
756
+ return "No interaction needed for 'diagnose' action."
757
+
758
+ # The Gatekeeper needs the full case to act as an oracle
759
+ prompt = f"""
760
+ Full Case Details (for your reference only):
761
+ ---
762
+ {full_case_details}
763
+ ---
764
+
765
+ Request from Diagnostic Agent:
766
+ {request}
767
+ """
768
+
769
+ response = gatekeeper.run(prompt)
770
+ return response
771
+
772
+ def _judge_diagnosis(self, candidate_diagnosis: str, ground_truth: str) -> Dict[str, Any]:
773
+ """Uses the Judge agent to evaluate the final diagnosis."""
774
+ judge = self.agents[AgentRole.JUDGE]
775
+ prompt = f"""
776
+ Please evaluate the following diagnosis.
777
+ Ground Truth: "{ground_truth}"
778
+ Candidate Diagnosis: "{candidate_diagnosis}"
779
+ """
780
+ response = judge.run(prompt)
781
+
782
+ # Simple parsing for demonstration; a more robust solution would use structured output.
783
+ try:
784
+ score = float(response.split("Score:")[1].split("/")[0].strip())
785
+ reasoning = response.split("Justification:")[1].strip()
786
+ except (IndexError, ValueError):
787
+ score = 0.0
788
+ reasoning = "Could not parse judge's response."
789
+
790
+ return {"score": score, "reasoning": reasoning}
791
+
792
+ def run(self, initial_case_info: str, full_case_details: str, ground_truth_diagnosis: str) -> DiagnosisResult:
793
+ """
794
+ Executes the full sequential diagnostic process.
795
+
796
+ Args:
797
+ initial_case_info (str): The initial abstract of the case.
798
+ full_case_details (str): The complete case file for the Gatekeeper.
799
+ ground_truth_diagnosis (str): The correct final diagnosis for evaluation.
800
+
801
+ Returns:
802
+ DiagnosisResult: An object containing the final diagnosis, evaluation, cost, and history.
803
+ """
804
+ start_time = time.time()
805
+ self.conversation.add("Gatekeeper", f"Initial Case Information: {initial_case_info}")
806
+
807
+ # Add initial physician visit cost
808
+ self.cumulative_cost += self.physician_visit_cost
809
+ logger.info(f"Initial physician visit cost: ${self.physician_visit_cost}")
810
+
811
+ final_diagnosis = None
812
+ iteration_count = 0
813
+
814
+ for i in range(self.max_iterations):
815
+ iteration_count = i + 1
816
+ logger.info(f"--- Starting Diagnostic Loop {iteration_count}/{self.max_iterations} ---")
817
+ logger.info(f"Current cost: ${self.cumulative_cost:,} | Remaining budget: ${self.initial_budget - self.cumulative_cost:,}")
818
+
819
+ try:
820
+ # Panel deliberates to decide on the next action
821
+ action = self._run_panel_deliberation()
822
+ logger.info(f"⚕️ Panel decision: {action.action_type.upper()} -> {action.content}")
823
+ logger.info(f"💭 Medical reasoning: {action.reasoning}")
824
+
825
+ if action.action_type == "diagnose":
826
+ final_diagnosis = action.content
827
+ logger.info(f"Final diagnosis proposed: {final_diagnosis}")
828
+ break
829
+
830
+ # Handle mode-specific constraints
831
+ if self.mode == "question_only" and action.action_type == "test":
832
+ logger.warning("Test ordering blocked in question-only mode")
833
+ continue
834
+
835
+ if self.mode == "budgeted" and action.action_type == "test":
836
+ # Check if we can afford the tests
837
+ estimated_test_cost = self._estimate_cost(action.content)
838
+ if self.cumulative_cost + estimated_test_cost > self.initial_budget:
839
+ logger.warning(f"Test cost ${estimated_test_cost} would exceed budget. Skipping tests.")
840
+ continue
841
+
842
+ # Interact with the Gatekeeper
843
+ response = self._interact_with_gatekeeper(action, full_case_details)
844
+ self.conversation.add("Gatekeeper", response)
845
+
846
+ # Update costs based on action type
847
+ if action.action_type == "test":
848
+ test_cost = self._estimate_cost(action.content)
849
+ self.cumulative_cost += test_cost
850
+ logger.info(f"Tests ordered: {action.content}")
851
+ logger.info(f"Test cost: ${test_cost:,} | Cumulative cost: ${self.cumulative_cost:,}")
852
+ elif action.action_type == "ask":
853
+ # Questions are part of the same visit until tests are ordered
854
+ logger.info(f"Questions asked: {action.content}")
855
+ logger.info(f"No additional cost for questions in same visit")
856
+
857
+ # Check budget constraints for budgeted mode
858
+ if self.mode == "budgeted" and self.cumulative_cost >= self.initial_budget:
859
+ logger.warning("Budget limit reached. Forcing final diagnosis.")
860
+ # Use current differential diagnosis or make best guess
861
+ final_diagnosis = self.differential_diagnosis.split('\n')[0] if '\n' in self.differential_diagnosis else "Diagnosis not reached within budget constraints."
862
+ break
863
+
864
+ except Exception as e:
865
+ logger.error(f"Error in diagnostic loop {iteration_count}: {e}")
866
+ # Continue to next iteration or break if critical error
867
+ continue
868
+
869
+ else:
870
+ # Max iterations reached without diagnosis
871
+ final_diagnosis = self.differential_diagnosis.split('\n')[0] if '\n' in self.differential_diagnosis else "Diagnosis not reached within maximum iterations."
872
+ logger.warning(f"Max iterations ({self.max_iterations}) reached. Using best available diagnosis.")
873
+
874
+ # Ensure we have a final diagnosis
875
+ if not final_diagnosis or final_diagnosis.strip() == "":
876
+ final_diagnosis = "Unable to determine diagnosis within constraints."
877
+
878
+ # Calculate total time
879
+ total_time = time.time() - start_time
880
+ logger.info(f"Diagnostic session completed in {total_time:.2f} seconds")
881
+
882
+ # Judge the final diagnosis
883
+ logger.info("Evaluating final diagnosis...")
884
+ try:
885
+ judgement = self._judge_diagnosis(final_diagnosis, ground_truth_diagnosis)
886
+ except Exception as e:
887
+ logger.error(f"Error in diagnosis evaluation: {e}")
888
+ judgement = {"score": 0.0, "reasoning": f"Evaluation error: {str(e)}"}
889
+
890
+ # Create comprehensive result
891
+ result = DiagnosisResult(
892
+ final_diagnosis=final_diagnosis,
893
+ ground_truth=ground_truth_diagnosis,
894
+ accuracy_score=judgement["score"],
895
+ accuracy_reasoning=judgement["reasoning"],
896
+ total_cost=self.cumulative_cost,
897
+ iterations=iteration_count,
898
+ conversation_history=self.conversation.get_str()
899
+ )
900
+
901
+ logger.info(f"Diagnostic process completed:")
902
+ logger.info(f" Final diagnosis: {final_diagnosis}")
903
+ logger.info(f" Ground truth: {ground_truth_diagnosis}")
904
+ logger.info(f" Accuracy score: {judgement['score']}/5.0")
905
+ logger.info(f" Total cost: ${self.cumulative_cost:,}")
906
+ logger.info(f" Iterations: {iteration_count}")
907
+
908
+ return result
909
+
910
+ def run_ensemble(self, initial_case_info: str, full_case_details: str, ground_truth_diagnosis: str, num_runs: int = 3) -> DiagnosisResult:
911
+ """
912
+ Runs multiple independent diagnostic sessions and aggregates the results.
913
+
914
+ Args:
915
+ initial_case_info (str): The initial abstract of the case.
916
+ full_case_details (str): The complete case file for the Gatekeeper.
917
+ ground_truth_diagnosis (str): The correct final diagnosis for evaluation.
918
+ num_runs (int): Number of independent runs to perform.
919
+
920
+ Returns:
921
+ DiagnosisResult: Aggregated result from ensemble runs.
922
+ """
923
+ logger.info(f"Starting ensemble run with {num_runs} independent sessions")
924
+
925
+ ensemble_results = []
926
+ total_cost = 0
927
+
928
+ for run_id in range(num_runs):
929
+ logger.info(f"=== Ensemble Run {run_id + 1}/{num_runs} ===")
930
+
931
+ # Create a fresh orchestrator instance for each run
932
+ run_orchestrator = MaiDxOrchestrator(
933
+ model_name=self.model_name,
934
+ max_iterations=self.max_iterations,
935
+ initial_budget=self.initial_budget,
936
+ mode="no_budget", # Use no_budget for ensemble runs
937
+ physician_visit_cost=self.physician_visit_cost,
938
+ enable_budget_tracking=False
939
+ )
940
+
941
+ # Run the diagnostic session
942
+ result = run_orchestrator.run(initial_case_info, full_case_details, ground_truth_diagnosis)
943
+ ensemble_results.append(result)
944
+ total_cost += result.total_cost
945
+
946
+ logger.info(f"Run {run_id + 1} completed: {result.final_diagnosis} (Score: {result.accuracy_score})")
947
+
948
+ # Aggregate results using consensus
949
+ final_diagnosis = self._aggregate_ensemble_diagnoses([r.final_diagnosis for r in ensemble_results])
950
+
951
+ # Judge the aggregated diagnosis
952
+ judgement = self._judge_diagnosis(final_diagnosis, ground_truth_diagnosis)
953
+
954
+ # Calculate average metrics
955
+ avg_iterations = sum(r.iterations for r in ensemble_results) / len(ensemble_results)
956
+
957
+ # Combine conversation histories
958
+ combined_history = "\n\n=== ENSEMBLE RESULTS ===\n"
959
+ for i, result in enumerate(ensemble_results):
960
+ combined_history += f"\n--- Run {i+1} ---\n"
961
+ combined_history += f"Diagnosis: {result.final_diagnosis}\n"
962
+ combined_history += f"Score: {result.accuracy_score}\n"
963
+ combined_history += f"Cost: ${result.total_cost:,}\n"
964
+ combined_history += f"Iterations: {result.iterations}\n"
965
+
966
+ combined_history += f"\n--- Aggregated Result ---\n"
967
+ combined_history += f"Final Diagnosis: {final_diagnosis}\n"
968
+ combined_history += f"Reasoning: {judgement['reasoning']}\n"
969
+
970
+ ensemble_result = DiagnosisResult(
971
+ final_diagnosis=final_diagnosis,
972
+ ground_truth=ground_truth_diagnosis,
973
+ accuracy_score=judgement["score"],
974
+ accuracy_reasoning=judgement["reasoning"],
975
+ total_cost=total_cost, # Sum of all runs
976
+ iterations=int(avg_iterations),
977
+ conversation_history=combined_history
978
+ )
979
+
980
+ logger.info(f"Ensemble completed: {final_diagnosis} (Score: {judgement['score']})")
981
+ return ensemble_result
982
+
983
+ def _aggregate_ensemble_diagnoses(self, diagnoses: List[str]) -> str:
984
+ """Aggregates multiple diagnoses from ensemble runs."""
985
+ # Simple majority voting or use the most confident diagnosis
986
+ if not diagnoses:
987
+ return "No diagnosis available"
988
+
989
+ # Remove any empty or invalid diagnoses
990
+ valid_diagnoses = [d for d in diagnoses if d and d.strip() and "not reached" not in d.lower()]
991
+
992
+ if not valid_diagnoses:
993
+ return diagnoses[0] if diagnoses else "No valid diagnosis"
994
+
995
+ # If all diagnoses are the same, return that
996
+ if len(set(valid_diagnoses)) == 1:
997
+ return valid_diagnoses[0]
998
+
999
+ # Use an aggregator agent to select the best diagnosis
1000
+ try:
1001
+ aggregator_prompt = f"""
1002
+ You are a medical consensus aggregator. Given multiple diagnostic assessments from independent medical panels,
1003
+ select the most accurate and complete diagnosis.
1004
+
1005
+ Diagnoses to consider:
1006
+ {chr(10).join(f"{i+1}. {d}" for i, d in enumerate(valid_diagnoses))}
1007
+
1008
+ Provide the single best diagnosis that represents the medical consensus.
1009
+ Consider clinical accuracy, specificity, and completeness.
1010
+ """
1011
+
1012
+ aggregator = Agent(
1013
+ agent_name="Ensemble Aggregator",
1014
+ system_prompt=aggregator_prompt,
1015
+ model_name=self.model_name,
1016
+ max_loops=1,
1017
+ print_on=True # Enable printing for aggregator agent
1018
+ )
1019
+
1020
+ return aggregator.run(aggregator_prompt).strip()
1021
+
1022
+ except Exception as e:
1023
+ logger.error(f"Error in ensemble aggregation: {e}")
1024
+ # Fallback to most common diagnosis
1025
+ from collections import Counter
1026
+ return Counter(valid_diagnoses).most_common(1)[0][0]
1027
+
1028
+ @classmethod
1029
+ def create_variant(cls, variant: str, **kwargs) -> 'MaiDxOrchestrator':
1030
+ """
1031
+ Factory method to create different MAI-DxO variants as described in the paper.
1032
+
1033
+ Args:
1034
+ variant (str): One of 'instant', 'question_only', 'budgeted', 'no_budget', 'ensemble'
1035
+ **kwargs: Additional parameters for the orchestrator
1036
+
1037
+ Returns:
1038
+ MaiDxOrchestrator: Configured orchestrator instance
1039
+ """
1040
+ variant_configs = {
1041
+ "instant": {
1042
+ "mode": "instant",
1043
+ "max_iterations": 1,
1044
+ "enable_budget_tracking": False
1045
+ },
1046
+ "question_only": {
1047
+ "mode": "question_only",
1048
+ "max_iterations": 10,
1049
+ "enable_budget_tracking": False
1050
+ },
1051
+ "budgeted": {
1052
+ "mode": "budgeted",
1053
+ "max_iterations": 10,
1054
+ "enable_budget_tracking": True,
1055
+ "initial_budget": kwargs.get("budget", 5000)
1056
+ },
1057
+ "no_budget": {
1058
+ "mode": "no_budget",
1059
+ "max_iterations": 10,
1060
+ "enable_budget_tracking": False
1061
+ },
1062
+ "ensemble": {
1063
+ "mode": "no_budget",
1064
+ "max_iterations": 10,
1065
+ "enable_budget_tracking": False
1066
+ }
1067
+ }
1068
+
1069
+ if variant not in variant_configs:
1070
+ raise ValueError(f"Unknown variant: {variant}. Choose from: {list(variant_configs.keys())}")
1071
+
1072
+ config = variant_configs[variant]
1073
+ config.update(kwargs) # Allow overrides
1074
+
1075
+ return cls(**config)
1076
+
1077
+
1078
+ def run_mai_dxo_demo(case_info: str = None, case_details: str = None, ground_truth: str = None) -> Dict[str, DiagnosisResult]:
1079
+ """
1080
+ Convenience function to run a quick demonstration of MAI-DxO variants.
1081
+
1082
+ Args:
1083
+ case_info (str): Initial case information. Uses default if None.
1084
+ case_details (str): Full case details. Uses default if None.
1085
+ ground_truth (str): Ground truth diagnosis. Uses default if None.
1086
+
1087
+ Returns:
1088
+ Dict[str, DiagnosisResult]: Results from different MAI-DxO variants
1089
+ """
1090
+ # Use default case if not provided
1091
+ if not case_info:
1092
+ case_info = (
1093
+ "A 29-year-old woman was admitted to the hospital because of sore throat and peritonsillar swelling "
1094
+ "and bleeding. Symptoms did not abate with antimicrobial therapy."
1095
+ )
1096
+
1097
+ if not case_details:
1098
+ case_details = """
1099
+ Patient: 29-year-old female.
1100
+ History: Onset of sore throat 7 weeks prior to admission. Worsening right-sided pain and swelling.
1101
+ No fevers, headaches, or gastrointestinal symptoms. Past medical history is unremarkable.
1102
+ Physical Exam: Right peritonsillar mass, displacing the uvula. No other significant findings.
1103
+ Initial Labs: FBC, clotting studies normal.
1104
+ MRI Neck: Showed a large, enhancing mass in the right peritonsillar space.
1105
+ Biopsy (H&E): Infiltrative round-cell neoplasm with high nuclear-to-cytoplasmic ratio and frequent mitotic figures.
1106
+ Biopsy (Immunohistochemistry): Desmin and MyoD1 diffusely positive. Myogenin multifocally positive.
1107
+ Biopsy (FISH): No FOXO1 (13q14) rearrangements detected.
1108
+ Final Diagnosis from Pathology: Embryonal rhabdomyosarcoma of the pharynx.
1109
+ """
1110
+
1111
+ if not ground_truth:
1112
+ ground_truth = "Embryonal rhabdomyosarcoma of the pharynx"
1113
+
1114
+ results = {}
1115
+
1116
+ # Test key variants
1117
+ variants = ["no_budget", "budgeted", "question_only"]
1118
+
1119
+ for variant in variants:
1120
+ try:
1121
+ logger.info(f"Running MAI-DxO variant: {variant}")
1122
+
1123
+ if variant == "budgeted":
1124
+ orchestrator = MaiDxOrchestrator.create_variant(variant, budget=3000, model_name="gemini/gemini-2.5-flash")
1125
+ else:
1126
+ orchestrator = MaiDxOrchestrator.create_variant(variant, model_name="gemini/gemini-2.5-flash")
1127
+
1128
+ result = orchestrator.run(case_info, case_details, ground_truth)
1129
+ results[variant] = result
1130
+
1131
+ except Exception as e:
1132
+ logger.error(f"Error running variant {variant}: {e}")
1133
+ results[variant] = None
1134
 
1135
+ return results
1136
+
1137
+
1138
+ if __name__ == "__main__":
1139
+ # Example case inspired by the paper's Figure 1
1140
+ initial_info = (
1141
+ "A 29-year-old woman was admitted to the hospital because of sore throat and peritonsillar swelling "
1142
+ "and bleeding. Symptoms did not abate with antimicrobial therapy."
1143
+ )
1144
+
1145
+ full_case = """
1146
+ Patient: 29-year-old female.
1147
+ History: Onset of sore throat 7 weeks prior to admission. Worsening right-sided pain and swelling.
1148
+ No fevers, headaches, or gastrointestinal symptoms. Past medical history is unremarkable. No history of smoking or significant alcohol use.
1149
+ Physical Exam: Right peritonsillar mass, displacing the uvula. No other significant findings.
1150
+ Initial Labs: FBC, clotting studies normal.
1151
+ MRI Neck: Showed a large, enhancing mass in the right peritonsillar space.
1152
+ Biopsy (H&E): Infiltrative round-cell neoplasm with high nuclear-to-cytoplasmic ratio and frequent mitotic figures.
1153
+ Biopsy (Immunohistochemistry for Carcinoma): CD31, D2-40, CD34, ERG, GLUT-1, pan-cytokeratin, CD45, CD20, CD3 all negative. Ki-67: 60% nuclear positivity.
1154
+ Biopsy (Immunohistochemistry for Rhabdomyosarcoma): Desmin and MyoD1 diffusely positive. Myogenin multifocally positive.
1155
+ Biopsy (FISH): No FOXO1 (13q14) rearrangements detected.
1156
+ Final Diagnosis from Pathology: Embryonal rhabdomyosarcoma of the pharynx.
1157
+ """
1158
+
1159
+ ground_truth = "Embryonal rhabdomyosarcoma of the pharynx"
1160
+
1161
+ # --- Demonstrate Different MAI-DxO Variants ---
1162
+ try:
1163
+ print("\n" + "="*80)
1164
+ print(" MAI DIAGNOSTIC ORCHESTRATOR (MAI-DxO) - SEQUENTIAL DIAGNOSIS BENCHMARK")
1165
+ print(" Implementation based on the NEJM Research Paper")
1166
+ print("="*80)
1167
+
1168
+ # Test different variants as described in the paper
1169
+ variants_to_test = [
1170
+ ("no_budget", "Standard MAI-DxO with no budget constraints"),
1171
+ ("budgeted", "Budget-constrained MAI-DxO ($3000 limit)"),
1172
+ ("question_only", "Question-only variant (no diagnostic tests)"),
1173
+ ]
1174
+
1175
+ results = {}
1176
+
1177
+ for variant_name, description in variants_to_test:
1178
+ print(f"\n{'='*60}")
1179
+ print(f"Testing Variant: {variant_name.upper()}")
1180
+ print(f"Description: {description}")
1181
+ print('='*60)
1182
+
1183
+ # Create the variant
1184
+ if variant_name == "budgeted":
1185
+ orchestrator = MaiDxOrchestrator.create_variant(
1186
+ variant_name,
1187
+ budget=3000,
1188
+ model_name="gemini/gemini-2.5-flash",
1189
+ max_iterations=5
1190
+ )
1191
+ else:
1192
+ orchestrator = MaiDxOrchestrator.create_variant(
1193
+ variant_name,
1194
+ model_name="gemini/gemini-2.5-flash",
1195
+ max_iterations=5
1196
+ )
1197
+
1198
+ # Run the diagnostic process
1199
+ result = orchestrator.run(
1200
+ initial_case_info=initial_info,
1201
+ full_case_details=full_case,
1202
+ ground_truth_diagnosis=ground_truth
1203
+ )
1204
+
1205
+ results[variant_name] = result
1206
+
1207
+ # Display results
1208
+ print(f"\n🚀 Final Diagnosis: {result.final_diagnosis}")
1209
+ print(f"🎯 Ground Truth: {result.ground_truth}")
1210
+ print(f"⭐ Accuracy Score: {result.accuracy_score}/5.0")
1211
+ print(f" Reasoning: {result.accuracy_reasoning}")
1212
+ print(f"💰 Total Cost: ${result.total_cost:,}")
1213
+ print(f"🔄 Iterations: {result.iterations}")
1214
+ print(f"⏱️ Mode: {orchestrator.mode}")
1215
+
1216
+ # Demonstrate ensemble approach
1217
+ print(f"\n{'='*60}")
1218
+ print("Testing Variant: ENSEMBLE")
1219
+ print("Description: Multiple independent runs with consensus aggregation")
1220
+ print('='*60)
1221
+
1222
+ ensemble_orchestrator = MaiDxOrchestrator.create_variant(
1223
+ "ensemble",
1224
+ model_name="gemini/gemini-2.5-flash",
1225
+ max_iterations=3 # Shorter iterations for ensemble
1226
+ )
1227
+
1228
+ ensemble_result = ensemble_orchestrator.run_ensemble(
1229
+ initial_case_info=initial_info,
1230
+ full_case_details=full_case,
1231
+ ground_truth_diagnosis=ground_truth,
1232
+ num_runs=2 # Reduced for demo
1233
+ )
1234
+
1235
+ results["ensemble"] = ensemble_result
1236
+
1237
+ print(f"\n🚀 Ensemble Diagnosis: {ensemble_result.final_diagnosis}")
1238
+ print(f"🎯 Ground Truth: {ensemble_result.ground_truth}")
1239
+ print(f"⭐ Ensemble Score: {ensemble_result.accuracy_score}/5.0")
1240
+ print(f"💰 Total Ensemble Cost: ${ensemble_result.total_cost:,}")
1241
+
1242
+ # --- Summary Comparison ---
1243
+ print(f"\n{'='*80}")
1244
+ print(" RESULTS SUMMARY")
1245
+ print('='*80)
1246
+ print(f"{'Variant':<15} {'Diagnosis Match':<15} {'Score':<8} {'Cost':<12} {'Iterations':<12}")
1247
+ print('-'*80)
1248
+
1249
+ for variant_name, result in results.items():
1250
+ match_status = "✓ Match" if result.accuracy_score >= 4.0 else "✗ No Match"
1251
+ print(f"{variant_name:<15} {match_status:<15} {result.accuracy_score:<8.1f} ${result.total_cost:<11,} {result.iterations:<12}")
1252
+
1253
+ print(f"\n{'='*80}")
1254
+ print("Implementation successfully demonstrates the MAI-DxO framework")
1255
+ print("as described in 'Sequential Diagnosis with Language Models' paper")
1256
+ print('='*80)
1257
+
1258
+ except Exception as e:
1259
+ logger.exception(f"An error occurred during the diagnostic session: {e}")
1260
+ print(f"\n❌ Error occurred: {e}")
1261
+ print("Please check your model configuration and API keys.")