Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,133 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Model Card for BioGPT-FineTuned-MedicalTextbooks-FP16
|
2 |
+
|
3 |
+
# Model Overview
|
4 |
+
This model is a fine-tuned and quantized version of the microsoft/biogpt model, specifically tailored for medical text understanding. It was fine-tuned on the dmedhi/medical-textbooks dataset from Hugging Face and subsequently quantized to FP16 (half-precision) to reduce memory usage and improve inference speed while maintaining accuracy. The model is designed for tasks like keyword extraction from medical texts and generative tasks in the biomedical domain.
|
5 |
+
|
6 |
+
# Model Details
|
7 |
+
```
|
8 |
+
Base Model: microsoft/biogpt
|
9 |
+
Fine-Tuning Dataset: dmedhi/medical-textbooks (15,970 rows)
|
10 |
+
Quantization: FP16 (half-precision) using PyTorch's .half() method
|
11 |
+
Model Type: Causal Language Model
|
12 |
+
Language: English
|
13 |
+
```
|
14 |
+
# Intended Use
|
15 |
+
This model is intended for:
|
16 |
+
|
17 |
+
- Keyword Extraction: Extracting relevant lines containing specific keywords (e.g., "anatomy") from medical textbooks, along with metadata like book names.
|
18 |
+
- Generative Tasks: Generating short explanations or summaries in the biomedical domain (e.g., answering questions like "What is anatomy?").
|
19 |
+
- Research and Education: Assisting researchers, students, and educators in exploring medical texts and generating insights.
|
20 |
+
# Out of Scope
|
21 |
+
- Real-time clinical decision-making or medical diagnosis (not evaluated for such tasks).
|
22 |
+
- Non-English text processing (not tested on other languages).
|
23 |
+
- Tasks requiring high precision in generative output without human oversight.
|
24 |
+
# Training Details
|
25 |
+
# Dataset
|
26 |
+
The model was fine-tuned on the dmedhi/medical-textbooks dataset, which contains excerpts from medical textbooks with two attributes:
|
27 |
+
|
28 |
+
**text:** The content of the excerpt.
|
29 |
+
**book:** The name of the book (e.g., "Gray's Anatomy").
|
30 |
+
# Dataset Splits:
|
31 |
+
- Original split: train (15,970 rows).
|
32 |
+
- Custom splits: 80% train (12,776 rows), 20% validation (3,194 rows).
|
33 |
+
# Training Procedure
|
34 |
+
# Preprocessing:
|
35 |
+
|
36 |
+
- Tokenized the text field using the BioGPT tokenizer (microsoft/biogpt).
|
37 |
+
- Set max_length=512, with truncation and padding.
|
38 |
+
- Used input_ids as labels for causal language modeling.
|
39 |
+
# Fine-Tuning:
|
40 |
+
- Fine-tuned microsoft/biogpt using Hugging Face's Trainer API.
|
41 |
+
```
|
42 |
+
Training arguments:
|
43 |
+
Epochs: 1
|
44 |
+
Batch size: 4 per device
|
45 |
+
Learning rate: 2e-5
|
46 |
+
Mixed precision: FP16 (fp16=True)
|
47 |
+
Evaluation strategy: Steps (every 1000 steps)
|
48 |
+
Training loss decreased from 2.8409 to 2.7006 over 3,194 steps.
|
49 |
+
Validation loss decreased from 2.7317 to 2.6512.
|
50 |
+
```
|
51 |
+
# Quantization:
|
52 |
+
- Converted the fine-tuned model to FP16 using PyTorch's .half() method.
|
53 |
+
- Saved as ./biogpt_finetuned/final_model_fp16.
|
54 |
+
- Compute Infrastructure
|
55 |
+
- Hardware: 12 GB GPU (NVIDIA)
|
56 |
+
- Environment: Jupyter Notebook on Windows
|
57 |
+
- Framework: PyTorch, Hugging Face Transformers
|
58 |
+
- Training Time: Approximately 27 minutes for 1 epoch
|
59 |
+
# Evaluation
|
60 |
+
**Metrics**
|
61 |
+
```
|
62 |
+
Training Loss: Decreased from 2.8409 to 2.7006.
|
63 |
+
Validation Loss: Decreased from 2.7317 to 2.6512.
|
64 |
+
Memory Usage: Post-quantization memory usage reported as ~661 MB (FP16), though actual savings may vary due to buffers and non-weight tensors.
|
65 |
+
```
|
66 |
+
# Qualitative Testing
|
67 |
+
**Generative Task:** Generated a response to "What is anatomy?" with reasonable output: "What is anatomy? Anatomy is the basis of medicine..."
|
68 |
+
**Keyword Extraction:** Successfully extracted up to 10 lines containing keywords (e.g., "anatomy") with corresponding book names (e.g., "Gray's Anatomy").
|
69 |
+
|
70 |
+
# Usage
|
71 |
+
**Installation**
|
72 |
+
- Ensure you have the required libraries installed:
|
73 |
+
|
74 |
+
```
|
75 |
+
pip install transformers torch datasets sacremoses
|
76 |
+
```
|
77 |
+
# Loading the Model
|
78 |
+
- Load the quantized FP16 model and tokenizer:
|
79 |
+
```
|
80 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
81 |
+
import torch
|
82 |
+
|
83 |
+
model_path = "path/to/biogpt_finetuned/final_model_fp16" # Update with your HF repo path
|
84 |
+
model = AutoModelForCausalLM.from_pretrained(model_path)
|
85 |
+
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
86 |
+
|
87 |
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
88 |
+
model.to(device)
|
89 |
+
model.eval()
|
90 |
+
```
|
91 |
+
# Example 1: Generative Inference
|
92 |
+
# Generate text with the quantized model:
|
93 |
+
```
|
94 |
+
|
95 |
+
input_text = "What is anatomy?"
|
96 |
+
inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True, max_length=512)
|
97 |
+
inputs = {k: v.to(device) for k, v in inputs.items()}
|
98 |
+
with torch.no_grad():
|
99 |
+
outputs = model.generate(**inputs, max_length=50)
|
100 |
+
output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
101 |
+
print(output_text)
|
102 |
+
```
|
103 |
+
# Example 2: Keyword Extraction
|
104 |
+
```
|
105 |
+
from datasets import load_from_disk
|
106 |
+
|
107 |
+
original_datasets = load_from_disk('path/to/original_medical_textbooks')
|
108 |
+
|
109 |
+
def extract_lines_with_keyword(keyword, dataset_split='train', max_results=10):
|
110 |
+
dataset = original_datasets[dataset_split]
|
111 |
+
matching_lines = []
|
112 |
+
for entry in dataset:
|
113 |
+
text = entry['text']
|
114 |
+
book = entry['book']
|
115 |
+
lines = text.split('\n')
|
116 |
+
for line in lines:
|
117 |
+
if keyword.lower() in line.lower():
|
118 |
+
matching_lines.append({'text': line.strip(), 'book': book})
|
119 |
+
if len(matching_lines) >= max_results:
|
120 |
+
return matching_lines
|
121 |
+
return matching_lines
|
122 |
+
|
123 |
+
keyword = "anatomy"
|
124 |
+
matching_lines = extract_lines_with_keyword(keyword)
|
125 |
+
for i, match in enumerate(matching_lines, 1):
|
126 |
+
print(f"{i}. Text: {match['text']}")
|
127 |
+
print(f" Book: {match['book']}\n")
|
128 |
+
```
|
129 |
+
# Limitations
|
130 |
+
- Quantization Trade-offs: FP16 quantization may lead to minor accuracy degradation, though not extensively evaluated.
|
131 |
+
- Dataset Bias: Fine-tuned only on dmedhi/medical-textbooks, which may not cover all medical domains or topics.
|
132 |
+
- Generative Quality: Generative outputs may require human oversight for correctness.
|
133 |
+
- Scalability: Keyword extraction relies on string matching, not semantic understanding, limiting its ability to capture nuanced relationships.
|