File size: 3,820 Bytes
4e2e3dd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 |
# BERT-Based Named Entity Recognition (NER) Model
This repository contains a fine-tuned BERT-based model for Named Entity Recognition (NER) using the WNUT-17 dataset. The model is trained using the Hugging Face Transformers and Datasets libraries, and supports inference and quantization for deployment in resource-constrained environments.
---
## Model Details
- **Model Name:** BERT-Base-Cased NER
- **Model Architecture:** BERT Base
- **Task:** Named Entity Recognition (NER)
- **Dataset:** WNUT-17 (from Hugging Face Datasets)
- **Quantization:** Float16
- **Fine-tuning Framework:** Hugging Face Transformers
---
## Usage
### Installation
```bash
pip install transformers datasets evaluate seqeval scikit-learn torch
```
### Training the Model
```python
from transformers import Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics
)
trainer.train()
```
### Saving the Model
```python
model.save_pretrained("./saved_model")
tokenizer.save_pretrained("./saved_model")
```
### Testing the Saved Model
```python
from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification
model = AutoModelForTokenClassification.from_pretrained("./saved_model")
tokenizer = AutoTokenizer.from_pretrained("./saved_model")
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
sample_sentences = [
"Barack Obama visited Microsoft headquarters in Redmond.",
"Nancy Gautam lives in Faridabad and studies at J.C. Bose University.",
"Google is launching a new AI product in California."
]
for sentence in sample_sentences:
print(f"Sentence: {sentence}")
print(ner_pipeline(sentence))
```
### Quantizing the Model
```python
import torch
quantized_model = model.to(dtype=torch.float16, device="cuda" if torch.cuda.is_available() else "cpu")
quantized_model.save_pretrained("quantized-model")
tokenizer.save_pretrained("quantized-model")
```
### Testing the Quantized Model
```python
model = AutoModelForTokenClassification.from_pretrained("quantized-model", torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained("quantized-model")
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
```
---
## Performance Metrics
- **Accuracy:** Evaluated using seqeval on the validation split
- **Precision, Recall, F1 Score:** Computed using label-wise predictions excluding ignored indices
---
## Fine-Tuning Details
### Dataset
The model was fine-tuned on the WNUT-17 dataset, a benchmark dataset for emerging and rare named entities. The preprocessing includes:
- Tokenization using BERT tokenizer
- Label alignment for wordpiece tokens
### Training Configuration
- **Epochs:** 3
- **Batch Size:** 16
- **Learning Rate:** 2e-5
- **Max Length:** 128 tokens (implicitly handled by tokenizer)
- **Evaluation Strategy:** Per epoch
### Quantization
The model was quantized using PyTorch's half-precision (float16) support to reduce memory footprint and inference time.
---
## Repository Structure
```
.
βββ saved_model/ # Fine-Tuned BERT Model and Tokenizer
βββ quantized-model/ # Quantized Model for Deployment
βββ ner_output/ # Training Logs and Checkpoints
βββ README.md # Documentation
```
---
## Limitations
- May not generalize well to domains outside WNUT-17 entities
- Quantized model may slightly reduce accuracy for faster performance
---
## Contributing
Contributions are welcome! Please raise an issue or PR for improvements, bug fixes, or feature additions.
---
|