Name-Entity-Recognition / README_NER_Model.md
gautamnancy's picture
Upload 7 files
4e2e3dd verified
# BERT-Based Named Entity Recognition (NER) Model
This repository contains a fine-tuned BERT-based model for Named Entity Recognition (NER) using the WNUT-17 dataset. The model is trained using the Hugging Face Transformers and Datasets libraries, and supports inference and quantization for deployment in resource-constrained environments.
---
## Model Details
- **Model Name:** BERT-Base-Cased NER
- **Model Architecture:** BERT Base
- **Task:** Named Entity Recognition (NER)
- **Dataset:** WNUT-17 (from Hugging Face Datasets)
- **Quantization:** Float16
- **Fine-tuning Framework:** Hugging Face Transformers
---
## Usage
### Installation
```bash
pip install transformers datasets evaluate seqeval scikit-learn torch
```
### Training the Model
```python
from transformers import Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics
)
trainer.train()
```
### Saving the Model
```python
model.save_pretrained("./saved_model")
tokenizer.save_pretrained("./saved_model")
```
### Testing the Saved Model
```python
from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification
model = AutoModelForTokenClassification.from_pretrained("./saved_model")
tokenizer = AutoTokenizer.from_pretrained("./saved_model")
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
sample_sentences = [
"Barack Obama visited Microsoft headquarters in Redmond.",
"Nancy Gautam lives in Faridabad and studies at J.C. Bose University.",
"Google is launching a new AI product in California."
]
for sentence in sample_sentences:
print(f"Sentence: {sentence}")
print(ner_pipeline(sentence))
```
### Quantizing the Model
```python
import torch
quantized_model = model.to(dtype=torch.float16, device="cuda" if torch.cuda.is_available() else "cpu")
quantized_model.save_pretrained("quantized-model")
tokenizer.save_pretrained("quantized-model")
```
### Testing the Quantized Model
```python
model = AutoModelForTokenClassification.from_pretrained("quantized-model", torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained("quantized-model")
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
```
---
## Performance Metrics
- **Accuracy:** Evaluated using seqeval on the validation split
- **Precision, Recall, F1 Score:** Computed using label-wise predictions excluding ignored indices
---
## Fine-Tuning Details
### Dataset
The model was fine-tuned on the WNUT-17 dataset, a benchmark dataset for emerging and rare named entities. The preprocessing includes:
- Tokenization using BERT tokenizer
- Label alignment for wordpiece tokens
### Training Configuration
- **Epochs:** 3
- **Batch Size:** 16
- **Learning Rate:** 2e-5
- **Max Length:** 128 tokens (implicitly handled by tokenizer)
- **Evaluation Strategy:** Per epoch
### Quantization
The model was quantized using PyTorch's half-precision (float16) support to reduce memory footprint and inference time.
---
## Repository Structure
```
.
β”œβ”€β”€ saved_model/ # Fine-Tuned BERT Model and Tokenizer
β”œβ”€β”€ quantized-model/ # Quantized Model for Deployment
β”œβ”€β”€ ner_output/ # Training Logs and Checkpoints
β”œβ”€β”€ README.md # Documentation
```
---
## Limitations
- May not generalize well to domains outside WNUT-17 entities
- Quantized model may slightly reduce accuracy for faster performance
---
## Contributing
Contributions are welcome! Please raise an issue or PR for improvements, bug fixes, or feature additions.
---