|
|
|
# BERT-Based Named Entity Recognition (NER) Model |
|
|
|
This repository contains a fine-tuned BERT-based model for Named Entity Recognition (NER) using the WNUT-17 dataset. The model is trained using the Hugging Face Transformers and Datasets libraries, and supports inference and quantization for deployment in resource-constrained environments. |
|
|
|
--- |
|
|
|
## Model Details |
|
|
|
- **Model Name:** BERT-Base-Cased NER |
|
- **Model Architecture:** BERT Base |
|
- **Task:** Named Entity Recognition (NER) |
|
- **Dataset:** WNUT-17 (from Hugging Face Datasets) |
|
- **Quantization:** Float16 |
|
- **Fine-tuning Framework:** Hugging Face Transformers |
|
|
|
--- |
|
|
|
## Usage |
|
|
|
### Installation |
|
|
|
```bash |
|
pip install transformers datasets evaluate seqeval scikit-learn torch |
|
``` |
|
|
|
### Training the Model |
|
|
|
```python |
|
from transformers import Trainer |
|
|
|
trainer = Trainer( |
|
model=model, |
|
args=training_args, |
|
train_dataset=tokenized_datasets["train"], |
|
eval_dataset=tokenized_datasets["validation"], |
|
tokenizer=tokenizer, |
|
data_collator=data_collator, |
|
compute_metrics=compute_metrics |
|
) |
|
|
|
trainer.train() |
|
``` |
|
|
|
### Saving the Model |
|
|
|
```python |
|
model.save_pretrained("./saved_model") |
|
tokenizer.save_pretrained("./saved_model") |
|
``` |
|
|
|
### Testing the Saved Model |
|
|
|
```python |
|
from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification |
|
|
|
model = AutoModelForTokenClassification.from_pretrained("./saved_model") |
|
tokenizer = AutoTokenizer.from_pretrained("./saved_model") |
|
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple") |
|
|
|
sample_sentences = [ |
|
"Barack Obama visited Microsoft headquarters in Redmond.", |
|
"Nancy Gautam lives in Faridabad and studies at J.C. Bose University.", |
|
"Google is launching a new AI product in California." |
|
] |
|
|
|
for sentence in sample_sentences: |
|
print(f"Sentence: {sentence}") |
|
print(ner_pipeline(sentence)) |
|
``` |
|
|
|
### Quantizing the Model |
|
|
|
```python |
|
import torch |
|
|
|
quantized_model = model.to(dtype=torch.float16, device="cuda" if torch.cuda.is_available() else "cpu") |
|
quantized_model.save_pretrained("quantized-model") |
|
tokenizer.save_pretrained("quantized-model") |
|
``` |
|
|
|
### Testing the Quantized Model |
|
|
|
```python |
|
model = AutoModelForTokenClassification.from_pretrained("quantized-model", torch_dtype=torch.float16) |
|
tokenizer = AutoTokenizer.from_pretrained("quantized-model") |
|
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple") |
|
``` |
|
|
|
--- |
|
|
|
## Performance Metrics |
|
|
|
- **Accuracy:** Evaluated using seqeval on the validation split |
|
- **Precision, Recall, F1 Score:** Computed using label-wise predictions excluding ignored indices |
|
|
|
--- |
|
|
|
## Fine-Tuning Details |
|
|
|
### Dataset |
|
|
|
The model was fine-tuned on the WNUT-17 dataset, a benchmark dataset for emerging and rare named entities. The preprocessing includes: |
|
- Tokenization using BERT tokenizer |
|
- Label alignment for wordpiece tokens |
|
|
|
### Training Configuration |
|
|
|
- **Epochs:** 3 |
|
- **Batch Size:** 16 |
|
- **Learning Rate:** 2e-5 |
|
- **Max Length:** 128 tokens (implicitly handled by tokenizer) |
|
- **Evaluation Strategy:** Per epoch |
|
|
|
### Quantization |
|
|
|
The model was quantized using PyTorch's half-precision (float16) support to reduce memory footprint and inference time. |
|
|
|
--- |
|
|
|
## Repository Structure |
|
|
|
``` |
|
. |
|
βββ saved_model/ # Fine-Tuned BERT Model and Tokenizer |
|
βββ quantized-model/ # Quantized Model for Deployment |
|
βββ ner_output/ # Training Logs and Checkpoints |
|
βββ README.md # Documentation |
|
``` |
|
|
|
--- |
|
|
|
## Limitations |
|
|
|
- May not generalize well to domains outside WNUT-17 entities |
|
- Quantized model may slightly reduce accuracy for faster performance |
|
|
|
--- |
|
|
|
## Contributing |
|
|
|
Contributions are welcome! Please raise an issue or PR for improvements, bug fixes, or feature additions. |
|
|
|
--- |
|
|