BERT-Based Named Entity Recognition (NER) Model
This repository contains a fine-tuned BERT-based model for Named Entity Recognition (NER) using the WNUT-17 dataset. The model is trained using the Hugging Face Transformers and Datasets libraries, and supports inference and quantization for deployment in resource-constrained environments.
Model Details
- Model Name: BERT-Base-Cased NER
- Model Architecture: BERT Base
- Task: Named Entity Recognition (NER)
- Dataset: WNUT-17 (from Hugging Face Datasets)
- Quantization: Float16
- Fine-tuning Framework: Hugging Face Transformers
Usage
Installation
pip install transformers datasets evaluate seqeval scikit-learn torch
Training the Model
from transformers import Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics
)
trainer.train()
Saving the Model
model.save_pretrained("./saved_model")
tokenizer.save_pretrained("./saved_model")
Testing the Saved Model
from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification
model = AutoModelForTokenClassification.from_pretrained("./saved_model")
tokenizer = AutoTokenizer.from_pretrained("./saved_model")
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
sample_sentences = [
"Barack Obama visited Microsoft headquarters in Redmond.",
"Nancy Gautam lives in Faridabad and studies at J.C. Bose University.",
"Google is launching a new AI product in California."
]
for sentence in sample_sentences:
print(f"Sentence: {sentence}")
print(ner_pipeline(sentence))
Quantizing the Model
import torch
quantized_model = model.to(dtype=torch.float16, device="cuda" if torch.cuda.is_available() else "cpu")
quantized_model.save_pretrained("quantized-model")
tokenizer.save_pretrained("quantized-model")
Testing the Quantized Model
model = AutoModelForTokenClassification.from_pretrained("quantized-model", torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained("quantized-model")
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
Performance Metrics
- Accuracy: Evaluated using seqeval on the validation split
- Precision, Recall, F1 Score: Computed using label-wise predictions excluding ignored indices
Fine-Tuning Details
Dataset
The model was fine-tuned on the WNUT-17 dataset, a benchmark dataset for emerging and rare named entities. The preprocessing includes:
- Tokenization using BERT tokenizer
- Label alignment for wordpiece tokens
Training Configuration
- Epochs: 3
- Batch Size: 16
- Learning Rate: 2e-5
- Max Length: 128 tokens (implicitly handled by tokenizer)
- Evaluation Strategy: Per epoch
Quantization
The model was quantized using PyTorch's half-precision (float16) support to reduce memory footprint and inference time.
Repository Structure
.
βββ saved_model/ # Fine-Tuned BERT Model and Tokenizer
βββ quantized-model/ # Quantized Model for Deployment
βββ ner_output/ # Training Logs and Checkpoints
βββ README.md # Documentation
Limitations
- May not generalize well to domains outside WNUT-17 entities
- Quantized model may slightly reduce accuracy for faster performance
Contributing
Contributions are welcome! Please raise an issue or PR for improvements, bug fixes, or feature additions.