--- language: - en license: other pipeline_tag: text-generation library_name: transformers tags: - clinical-nlp - medical-coding - icd10 - icd-10-cm - reasoning - reinforcement-learning - grpo - healthcare base_model: - Qwen/Qwen2.5-7B-Instruct --- # DeepICD-R1-7B ## Model Summary **DeepICD-R1-7B** is a clinical reasoning language model for **ICD-10-CM diagnosis outcome prediction from admission notes**. It is derived from **Qwen2.5-7B-Instruct** and trained using the **DeepICD-R1 framework**, which combines structured reasoning traces with reinforcement learning and hierarchical reward signals. The model is designed to predict a **single ICD-10-CM diagnosis code** from clinical text while producing an interpretable reasoning trace explaining the decision. The training methodology follows the approach described in the paper: **DeepICD-R1: Medical Reasoning through Hierarchical Rewards and Unsupervised Distillation** This work frames clinical diagnosis prediction as a **reasoning task optimized through reinforcement learning**. --- # Model Details - **Model name:** DeepICD-R1-7B - **Organization:** DATEXIS - **Base model:** Qwen2.5-7B-Instruct - **Parameters:** ~7B - **Task:** Single ICD-10-CM diagnosis prediction from admission notes - **Training paradigm:** Supervised reasoning + reinforcement learning - **Framework:** VERL RL trainer - **Domain:** Clinical NLP / healthcare reasoning The Qwen2.5-7B-Instruct architecture is a **7-billion-parameter instruction-tuned language model designed for instruction following and long-form generation tasks**. :contentReference[oaicite:1]{index=1} --- # Intended Use This model is intended for **research purposes**, including: - clinical reasoning research - ICD-10-CM coding prediction - reinforcement learning for language models - reasoning trace generation - structured prediction from clinical text ### Out-of-Scope Use This model **must not be used for**: - medical diagnosis - clinical decision support - patient triage - automated medical coding without expert supervision - billing or compliance workflows --- # Training Methodology The **DeepICD-R1 framework** treats diagnosis prediction as a reasoning problem. Training combines: ### 1. Supervised reasoning traces A dataset of reasoning chains explaining diagnosis predictions. ### 2. Reinforcement learning optimization Training uses **Group Relative Policy Optimization (GRPO)** to improve reasoning and prediction accuracy. ### 3. Hierarchical reward signals Rewards are aligned with the hierarchical structure of ICD codes. The reward function combines: - **format reward** — correct reasoning + diagnosis structure - **outcome reward** — correct diagnosis prediction - **hierarchical reward** — partial credit for correct ICD prefixes This design encourages models to produce both **accurate diagnoses and structured reasoning**. --- # Training Data The training task uses **clinical admission notes paired with ICD-10-CM diagnosis codes**, derived from de-identified electronic health record datasets such as **MIMIC-IV**. Task formulation: **Input** Clinical admission note describing patient presentation. **Output** Structured reasoning trace and predicted ICD-10-CM code. --- # Output Format The model is trained to produce structured outputs separating reasoning from the final diagnosis. ### Example ```text The patient presents with ... Symptoms and clinical history suggest ... ... M5116 ``` ## Training Configuration The model was trained using the **VERL reinforcement learning trainer** with **Group Relative Policy Optimization (GRPO)**, following the DeepICD-R1 training framework. ### Core Training Parameters | Parameter | Value | |-----------|------| | Algorithm | GRPO | | Training framework | VERL (`verl.trainer.main_ppo`) | | Base model | Qwen2.5-7B-Instruct | | Training batch size | 64 | | PPO mini batch size | 64 | | PPO micro batch size per GPU | 16 | | Learning rate | 1e-6 | | LR warmup steps | 80 | | Total epochs | 1 | | Max prompt length | 2048 tokens | | Max response length | 1024 tokens | ### Rollout / Generation Settings | Parameter | Value | |-----------|------| | Rollout engine | vLLM | | Samples per prompt (`n`) | 8 | | Temperature | 0.9 | | Top-k | disabled | | dtype | bfloat16 | | Tensor parallel size | 1 | | GPU memory utilization | 0.4 | ### Optimization Details | Parameter | Value | |-----------|------| | Entropy coefficient | 0.001 | | KL controller coefficient | 0.001 | | KL loss | disabled | | Gradient checkpointing | enabled | | Torch compile | enabled | | FSDP param offload | disabled | | FSDP optimizer offload | disabled | ### Hardware | Component | Value | |-----------|------| | GPUs | 4 | | Nodes | 1 | | Precision | bfloat16 | ### Reward Function Training uses a **custom batched reward function** combining several reward signals: - **Outcome reward** — correct ICD-10 prediction - **Format reward** — correct `` and `` structure - **Hierarchical reward** — partial credit for ICD prefix matches - **Reasoning reward** — encourages meaningful reasoning traces - **LLM-based reward** — optional external judge scoring These rewards align the model toward producing **both accurate diagnoses and structured reasoning traces**. The reasoning trace provides transparency into how the diagnosis was derived from the clinical note. --- ## Evaluation Evaluation follows the methodology described in the **DeepICD-R1 paper**. Performance is measured using **macro-averaged F1 scores** at multiple levels of the ICD hierarchy. | Level | Description | |------|-------------| | Chapter | Broad ICD category | | Category | First three digits | | Full code | Complete ICD-10 code | Hierarchical evaluation allows partial credit when the model predicts the correct high-level diagnostic category even if the full code is incorrect. --- ## Limitations Models following the **DeepICD-R1 framework** share several limitations. ### Dataset limitations - Training data consists primarily of **English clinical notes** - Distribution reflects **hospital-specific patient populations** - ICD labels are **highly imbalanced**, affecting rare diagnoses ### Model limitations - Reasoning traces may appear convincing while being incorrect - Predictions may fail for rare or long-tail diagnoses - Models may demonstrate **premature diagnostic closure** - Reinforcement learning rewards are only proxies for expert feedback --- ## Ethical Considerations This model is trained on **de-identified clinical data** and intended strictly for research. ### Potential risks - propagation of dataset biases - overconfidence in generated reasoning - misuse in clinical decision making ### Appropriate safeguards - expert oversight - dataset bias evaluation - fairness audits - controlled deployment environments --- ## Hardware and Training Setup Typical training configuration for models in this family includes: - **GPUs:** multi-GPU training (4–8 GPUs) - **Precision:** bfloat16 - **Rollout engine:** vLLM - **Training framework:** VERL PPO / GRPO trainer - **Sampling:** multiple rollouts per prompt --- ## Usage ### Transformers Example ```python from transformers import AutoTokenizer, AutoModelForCausalLM model_id = "DATEXIS/DeepICD-R1-7B" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, device_map="auto", torch_dtype="auto" ) prompt = """ You are a clinical reasoning model. Given the following admission note, produce reasoning in tags and a final ICD-10 diagnosis in tags. [ADMISSION NOTE] """ inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate( **inputs, max_new_tokens=512 ) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Recommended Inference Practices - Use prompts consistent with the training format. - Validate predicted ICD-10 codes against official code formats. - Always review predictions with medical experts. - Avoid exposing reasoning traces in safety-critical settings without verification. --- ## Citation If you use this model, please cite: ```bibtex @inproceedings{roehr2026deepicdr1, title={DeepICD-R1: Medical Reasoning through Hierarchical Rewards and Unsupervised Distillation}, author={R{\"o}hr, Tom and Steffek, Thomas and Teucher, Roman and Bressem, Keno and others}, booktitle={Proceedings of LREC-COLING}, year={2026} }