|
# Model Card for BioGPT-FineTuned-MedicalTextbooks-FP16 |
|
|
|
# Model Overview |
|
This model is a fine-tuned and quantized version of the microsoft/biogpt model, specifically tailored for medical text understanding. It was fine-tuned on the dmedhi/medical-textbooks dataset from Hugging Face and subsequently quantized to FP16 (half-precision) to reduce memory usage and improve inference speed while maintaining accuracy. The model is designed for tasks like keyword extraction from medical texts and generative tasks in the biomedical domain. |
|
|
|
# Model Details |
|
``` |
|
Base Model: microsoft/biogpt |
|
Fine-Tuning Dataset: dmedhi/medical-textbooks (15,970 rows) |
|
Quantization: FP16 (half-precision) using PyTorch's .half() method |
|
Model Type: Causal Language Model |
|
Language: English |
|
``` |
|
# Intended Use |
|
This model is intended for: |
|
|
|
- Keyword Extraction: Extracting relevant lines containing specific keywords (e.g., "anatomy") from medical textbooks, along with metadata like book names. |
|
- Generative Tasks: Generating short explanations or summaries in the biomedical domain (e.g., answering questions like "What is anatomy?"). |
|
- Research and Education: Assisting researchers, students, and educators in exploring medical texts and generating insights. |
|
# Out of Scope |
|
- Real-time clinical decision-making or medical diagnosis (not evaluated for such tasks). |
|
- Non-English text processing (not tested on other languages). |
|
- Tasks requiring high precision in generative output without human oversight. |
|
# Training Details |
|
# Dataset |
|
The model was fine-tuned on the dmedhi/medical-textbooks dataset, which contains excerpts from medical textbooks with two attributes: |
|
|
|
**text:** The content of the excerpt. |
|
**book:** The name of the book (e.g., "Gray's Anatomy"). |
|
# Dataset Splits: |
|
- Original split: train (15,970 rows). |
|
- Custom splits: 80% train (12,776 rows), 20% validation (3,194 rows). |
|
# Training Procedure |
|
# Preprocessing: |
|
|
|
- Tokenized the text field using the BioGPT tokenizer (microsoft/biogpt). |
|
- Set max_length=512, with truncation and padding. |
|
- Used input_ids as labels for causal language modeling. |
|
# Fine-Tuning: |
|
- Fine-tuned microsoft/biogpt using Hugging Face's Trainer API. |
|
``` |
|
Training arguments: |
|
Epochs: 1 |
|
Batch size: 4 per device |
|
Learning rate: 2e-5 |
|
Mixed precision: FP16 (fp16=True) |
|
Evaluation strategy: Steps (every 1000 steps) |
|
Training loss decreased from 2.8409 to 2.7006 over 3,194 steps. |
|
Validation loss decreased from 2.7317 to 2.6512. |
|
``` |
|
# Quantization: |
|
- Converted the fine-tuned model to FP16 using PyTorch's .half() method. |
|
- Saved as ./biogpt_finetuned/final_model_fp16. |
|
- Compute Infrastructure |
|
- Hardware: 12 GB GPU (NVIDIA) |
|
- Environment: Jupyter Notebook on Windows |
|
- Framework: PyTorch, Hugging Face Transformers |
|
- Training Time: Approximately 27 minutes for 1 epoch |
|
# Evaluation |
|
**Metrics** |
|
``` |
|
Training Loss: Decreased from 2.8409 to 2.7006. |
|
Validation Loss: Decreased from 2.7317 to 2.6512. |
|
Memory Usage: Post-quantization memory usage reported as ~661 MB (FP16), though actual savings may vary due to buffers and non-weight tensors. |
|
``` |
|
# Qualitative Testing |
|
**Generative Task:** Generated a response to "What is anatomy?" with reasonable output: "What is anatomy? Anatomy is the basis of medicine..." |
|
**Keyword Extraction:** Successfully extracted up to 10 lines containing keywords (e.g., "anatomy") with corresponding book names (e.g., "Gray's Anatomy"). |
|
|
|
# Usage |
|
**Installation** |
|
- Ensure you have the required libraries installed: |
|
|
|
``` |
|
pip install transformers torch datasets sacremoses |
|
``` |
|
# Loading the Model |
|
- Load the quantized FP16 model and tokenizer: |
|
``` |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
import torch |
|
|
|
model_path = "path/to/biogpt_finetuned/final_model_fp16" # Update with your HF repo path |
|
model = AutoModelForCausalLM.from_pretrained(model_path) |
|
tokenizer = AutoTokenizer.from_pretrained(model_path) |
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
model.to(device) |
|
model.eval() |
|
``` |
|
# Example 1: Generative Inference |
|
# Generate text with the quantized model: |
|
``` |
|
|
|
input_text = "What is anatomy?" |
|
inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True, max_length=512) |
|
inputs = {k: v.to(device) for k, v in inputs.items()} |
|
with torch.no_grad(): |
|
outputs = model.generate(**inputs, max_length=50) |
|
output_text = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
print(output_text) |
|
``` |
|
# Example 2: Keyword Extraction |
|
``` |
|
from datasets import load_from_disk |
|
|
|
original_datasets = load_from_disk('path/to/original_medical_textbooks') |
|
|
|
def extract_lines_with_keyword(keyword, dataset_split='train', max_results=10): |
|
dataset = original_datasets[dataset_split] |
|
matching_lines = [] |
|
for entry in dataset: |
|
text = entry['text'] |
|
book = entry['book'] |
|
lines = text.split('\n') |
|
for line in lines: |
|
if keyword.lower() in line.lower(): |
|
matching_lines.append({'text': line.strip(), 'book': book}) |
|
if len(matching_lines) >= max_results: |
|
return matching_lines |
|
return matching_lines |
|
|
|
keyword = "anatomy" |
|
matching_lines = extract_lines_with_keyword(keyword) |
|
for i, match in enumerate(matching_lines, 1): |
|
print(f"{i}. Text: {match['text']}") |
|
print(f" Book: {match['book']}\n") |
|
``` |
|
# Limitations |
|
- Quantization Trade-offs: FP16 quantization may lead to minor accuracy degradation, though not extensively evaluated. |
|
- Dataset Bias: Fine-tuned only on dmedhi/medical-textbooks, which may not cover all medical domains or topics. |
|
- Generative Quality: Generative outputs may require human oversight for correctness. |
|
- Scalability: Keyword extraction relies on string matching, not semantic understanding, limiting its ability to capture nuanced relationships. |
|
|