YashikaNagpal commited on
Commit
3c4965f
·
verified ·
1 Parent(s): 1bdb91b

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +133 -0
README.md ADDED
@@ -0,0 +1,133 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Model Card for BioGPT-FineTuned-MedicalTextbooks-FP16
2
+
3
+ # Model Overview
4
+ This model is a fine-tuned and quantized version of the microsoft/biogpt model, specifically tailored for medical text understanding. It was fine-tuned on the dmedhi/medical-textbooks dataset from Hugging Face and subsequently quantized to FP16 (half-precision) to reduce memory usage and improve inference speed while maintaining accuracy. The model is designed for tasks like keyword extraction from medical texts and generative tasks in the biomedical domain.
5
+
6
+ # Model Details
7
+ ```
8
+ Base Model: microsoft/biogpt
9
+ Fine-Tuning Dataset: dmedhi/medical-textbooks (15,970 rows)
10
+ Quantization: FP16 (half-precision) using PyTorch's .half() method
11
+ Model Type: Causal Language Model
12
+ Language: English
13
+ ```
14
+ # Intended Use
15
+ This model is intended for:
16
+
17
+ - Keyword Extraction: Extracting relevant lines containing specific keywords (e.g., "anatomy") from medical textbooks, along with metadata like book names.
18
+ - Generative Tasks: Generating short explanations or summaries in the biomedical domain (e.g., answering questions like "What is anatomy?").
19
+ - Research and Education: Assisting researchers, students, and educators in exploring medical texts and generating insights.
20
+ # Out of Scope
21
+ - Real-time clinical decision-making or medical diagnosis (not evaluated for such tasks).
22
+ - Non-English text processing (not tested on other languages).
23
+ - Tasks requiring high precision in generative output without human oversight.
24
+ # Training Details
25
+ # Dataset
26
+ The model was fine-tuned on the dmedhi/medical-textbooks dataset, which contains excerpts from medical textbooks with two attributes:
27
+
28
+ **text:** The content of the excerpt.
29
+ **book:** The name of the book (e.g., "Gray's Anatomy").
30
+ # Dataset Splits:
31
+ - Original split: train (15,970 rows).
32
+ - Custom splits: 80% train (12,776 rows), 20% validation (3,194 rows).
33
+ # Training Procedure
34
+ # Preprocessing:
35
+
36
+ - Tokenized the text field using the BioGPT tokenizer (microsoft/biogpt).
37
+ - Set max_length=512, with truncation and padding.
38
+ - Used input_ids as labels for causal language modeling.
39
+ # Fine-Tuning:
40
+ - Fine-tuned microsoft/biogpt using Hugging Face's Trainer API.
41
+ ```
42
+ Training arguments:
43
+ Epochs: 1
44
+ Batch size: 4 per device
45
+ Learning rate: 2e-5
46
+ Mixed precision: FP16 (fp16=True)
47
+ Evaluation strategy: Steps (every 1000 steps)
48
+ Training loss decreased from 2.8409 to 2.7006 over 3,194 steps.
49
+ Validation loss decreased from 2.7317 to 2.6512.
50
+ ```
51
+ # Quantization:
52
+ - Converted the fine-tuned model to FP16 using PyTorch's .half() method.
53
+ - Saved as ./biogpt_finetuned/final_model_fp16.
54
+ - Compute Infrastructure
55
+ - Hardware: 12 GB GPU (NVIDIA)
56
+ - Environment: Jupyter Notebook on Windows
57
+ - Framework: PyTorch, Hugging Face Transformers
58
+ - Training Time: Approximately 27 minutes for 1 epoch
59
+ # Evaluation
60
+ **Metrics**
61
+ ```
62
+ Training Loss: Decreased from 2.8409 to 2.7006.
63
+ Validation Loss: Decreased from 2.7317 to 2.6512.
64
+ Memory Usage: Post-quantization memory usage reported as ~661 MB (FP16), though actual savings may vary due to buffers and non-weight tensors.
65
+ ```
66
+ # Qualitative Testing
67
+ **Generative Task:** Generated a response to "What is anatomy?" with reasonable output: "What is anatomy? Anatomy is the basis of medicine..."
68
+ **Keyword Extraction:** Successfully extracted up to 10 lines containing keywords (e.g., "anatomy") with corresponding book names (e.g., "Gray's Anatomy").
69
+
70
+ # Usage
71
+ **Installation**
72
+ - Ensure you have the required libraries installed:
73
+
74
+ ```
75
+ pip install transformers torch datasets sacremoses
76
+ ```
77
+ # Loading the Model
78
+ - Load the quantized FP16 model and tokenizer:
79
+ ```
80
+ from transformers import AutoModelForCausalLM, AutoTokenizer
81
+ import torch
82
+
83
+ model_path = "path/to/biogpt_finetuned/final_model_fp16" # Update with your HF repo path
84
+ model = AutoModelForCausalLM.from_pretrained(model_path)
85
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
86
+
87
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
88
+ model.to(device)
89
+ model.eval()
90
+ ```
91
+ # Example 1: Generative Inference
92
+ # Generate text with the quantized model:
93
+ ```
94
+
95
+ input_text = "What is anatomy?"
96
+ inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True, max_length=512)
97
+ inputs = {k: v.to(device) for k, v in inputs.items()}
98
+ with torch.no_grad():
99
+ outputs = model.generate(**inputs, max_length=50)
100
+ output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
101
+ print(output_text)
102
+ ```
103
+ # Example 2: Keyword Extraction
104
+ ```
105
+ from datasets import load_from_disk
106
+
107
+ original_datasets = load_from_disk('path/to/original_medical_textbooks')
108
+
109
+ def extract_lines_with_keyword(keyword, dataset_split='train', max_results=10):
110
+ dataset = original_datasets[dataset_split]
111
+ matching_lines = []
112
+ for entry in dataset:
113
+ text = entry['text']
114
+ book = entry['book']
115
+ lines = text.split('\n')
116
+ for line in lines:
117
+ if keyword.lower() in line.lower():
118
+ matching_lines.append({'text': line.strip(), 'book': book})
119
+ if len(matching_lines) >= max_results:
120
+ return matching_lines
121
+ return matching_lines
122
+
123
+ keyword = "anatomy"
124
+ matching_lines = extract_lines_with_keyword(keyword)
125
+ for i, match in enumerate(matching_lines, 1):
126
+ print(f"{i}. Text: {match['text']}")
127
+ print(f" Book: {match['book']}\n")
128
+ ```
129
+ # Limitations
130
+ - Quantization Trade-offs: FP16 quantization may lead to minor accuracy degradation, though not extensively evaluated.
131
+ - Dataset Bias: Fine-tuned only on dmedhi/medical-textbooks, which may not cover all medical domains or topics.
132
+ - Generative Quality: Generative outputs may require human oversight for correctness.
133
+ - Scalability: Keyword extraction relies on string matching, not semantic understanding, limiting its ability to capture nuanced relationships.