BioGPT-MedText / README.md

Create README.md

3c4965f verified 4 months ago

5.77 kB

	# Model Card for BioGPT-FineTuned-MedicalTextbooks-FP16

	# Model Overview
	This model is a fine-tuned and quantized version of the microsoft/biogpt model, specifically tailored for medical text understanding. It was fine-tuned on the dmedhi/medical-textbooks dataset from Hugging Face and subsequently quantized to FP16 (half-precision) to reduce memory usage and improve inference speed while maintaining accuracy. The model is designed for tasks like keyword extraction from medical texts and generative tasks in the biomedical domain.

	# Model Details
	```
	Base Model: microsoft/biogpt
	Fine-Tuning Dataset: dmedhi/medical-textbooks (15,970 rows)
	Quantization: FP16 (half-precision) using PyTorch's .half() method
	Model Type: Causal Language Model
	Language: English
	```
	# Intended Use
	This model is intended for:

	- Keyword Extraction: Extracting relevant lines containing specific keywords (e.g., "anatomy") from medical textbooks, along with metadata like book names.
	- Generative Tasks: Generating short explanations or summaries in the biomedical domain (e.g., answering questions like "What is anatomy?").
	- Research and Education: Assisting researchers, students, and educators in exploring medical texts and generating insights.
	# Out of Scope
	- Real-time clinical decision-making or medical diagnosis (not evaluated for such tasks).
	- Non-English text processing (not tested on other languages).
	- Tasks requiring high precision in generative output without human oversight.
	# Training Details
	# Dataset
	The model was fine-tuned on the dmedhi/medical-textbooks dataset, which contains excerpts from medical textbooks with two attributes:

	text: The content of the excerpt.
	book: The name of the book (e.g., "Gray's Anatomy").
	# Dataset Splits:
	- Original split: train (15,970 rows).
	- Custom splits: 80% train (12,776 rows), 20% validation (3,194 rows).
	# Training Procedure
	# Preprocessing:

	- Tokenized the text field using the BioGPT tokenizer (microsoft/biogpt).
	- Set max_length=512, with truncation and padding.
	- Used input_ids as labels for causal language modeling.
	# Fine-Tuning:
	- Fine-tuned microsoft/biogpt using Hugging Face's Trainer API.
	```
	Training arguments:
	Epochs: 1
	Batch size: 4 per device
	Learning rate: 2e-5
	Mixed precision: FP16 (fp16=True)
	Evaluation strategy: Steps (every 1000 steps)
	Training loss decreased from 2.8409 to 2.7006 over 3,194 steps.
	Validation loss decreased from 2.7317 to 2.6512.
	```
	# Quantization:
	- Converted the fine-tuned model to FP16 using PyTorch's .half() method.
	- Saved as ./biogpt_finetuned/final_model_fp16.
	- Compute Infrastructure
	- Hardware: 12 GB GPU (NVIDIA)
	- Environment: Jupyter Notebook on Windows
	- Framework: PyTorch, Hugging Face Transformers
	- Training Time: Approximately 27 minutes for 1 epoch
	# Evaluation
	Metrics
	```
	Training Loss: Decreased from 2.8409 to 2.7006.
	Validation Loss: Decreased from 2.7317 to 2.6512.
	Memory Usage: Post-quantization memory usage reported as ~661 MB (FP16), though actual savings may vary due to buffers and non-weight tensors.
	```
	# Qualitative Testing
	Generative Task: Generated a response to "What is anatomy?" with reasonable output: "What is anatomy? Anatomy is the basis of medicine..."
	Keyword Extraction: Successfully extracted up to 10 lines containing keywords (e.g., "anatomy") with corresponding book names (e.g., "Gray's Anatomy").

	# Usage
	Installation
	- Ensure you have the required libraries installed:

	```
	pip install transformers torch datasets sacremoses
	```
	# Loading the Model
	- Load the quantized FP16 model and tokenizer:
	```
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	model_path = "path/to/biogpt_finetuned/final_model_fp16" # Update with your HF repo path
	model = AutoModelForCausalLM.from_pretrained(model_path)
	tokenizer = AutoTokenizer.from_pretrained(model_path)

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model.to(device)
	model.eval()
	```
	# Example 1: Generative Inference
	# Generate text with the quantized model:
	```

	input_text = "What is anatomy?"
	inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True, max_length=512)
	inputs = {k: v.to(device) for k, v in inputs.items()}
	with torch.no_grad():
	outputs = model.generate(**inputs, max_length=50)
	output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(output_text)
	```
	# Example 2: Keyword Extraction
	```
	from datasets import load_from_disk

	original_datasets = load_from_disk('path/to/original_medical_textbooks')

	def extract_lines_with_keyword(keyword, dataset_split='train', max_results=10):
	dataset = original_datasets[dataset_split]
	matching_lines = []
	for entry in dataset:
	text = entry['text']
	book = entry['book']
	lines = text.split('\n')
	for line in lines:
	if keyword.lower() in line.lower():
	matching_lines.append({'text': line.strip(), 'book': book})
	if len(matching_lines) >= max_results:
	return matching_lines
	return matching_lines

	keyword = "anatomy"
	matching_lines = extract_lines_with_keyword(keyword)
	for i, match in enumerate(matching_lines, 1):
	print(f"{i}. Text: {match['text']}")
	print(f" Book: {match['book']}\n")
	```
	# Limitations
	- Quantization Trade-offs: FP16 quantization may lead to minor accuracy degradation, though not extensively evaluated.
	- Dataset Bias: Fine-tuned only on dmedhi/medical-textbooks, which may not cover all medical domains or topics.
	- Generative Quality: Generative outputs may require human oversight for correctness.
	- Scalability: Keyword extraction relies on string matching, not semantic understanding, limiting its ability to capture nuanced relationships.