File size: 3,820 Bytes

4e2e3dd


# BERT-Based Named Entity Recognition (NER) Model

This repository contains a fine-tuned BERT-based model for Named Entity Recognition (NER) using the WNUT-17 dataset. The model is trained using the Hugging Face Transformers and Datasets libraries, and supports inference and quantization for deployment in resource-constrained environments.

---

## Model Details

- **Model Name:** BERT-Base-Cased NER  
- **Model Architecture:** BERT Base  
- **Task:** Named Entity Recognition (NER)  
- **Dataset:** WNUT-17 (from Hugging Face Datasets)  
- **Quantization:** Float16  
- **Fine-tuning Framework:** Hugging Face Transformers

---

## Usage

### Installation

```bash
pip install transformers datasets evaluate seqeval scikit-learn torch
```

### Training the Model

```python
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

trainer.train()
```

### Saving the Model

```python
model.save_pretrained("./saved_model")
tokenizer.save_pretrained("./saved_model")
```

### Testing the Saved Model

```python
from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained("./saved_model")
tokenizer = AutoTokenizer.from_pretrained("./saved_model")
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

sample_sentences = [
    "Barack Obama visited Microsoft headquarters in Redmond.",
    "Nancy Gautam lives in Faridabad and studies at J.C. Bose University.",
    "Google is launching a new AI product in California."
]

for sentence in sample_sentences:
    print(f"Sentence: {sentence}")
    print(ner_pipeline(sentence))
```

### Quantizing the Model

```python
import torch

quantized_model = model.to(dtype=torch.float16, device="cuda" if torch.cuda.is_available() else "cpu")
quantized_model.save_pretrained("quantized-model")
tokenizer.save_pretrained("quantized-model")
```

### Testing the Quantized Model

```python
model = AutoModelForTokenClassification.from_pretrained("quantized-model", torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained("quantized-model")
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
```

---

## Performance Metrics

- **Accuracy:** Evaluated using seqeval on the validation split  
- **Precision, Recall, F1 Score:** Computed using label-wise predictions excluding ignored indices

---

## Fine-Tuning Details

### Dataset

The model was fine-tuned on the WNUT-17 dataset, a benchmark dataset for emerging and rare named entities. The preprocessing includes:
- Tokenization using BERT tokenizer
- Label alignment for wordpiece tokens

### Training Configuration

- **Epochs:** 3  
- **Batch Size:** 16  
- **Learning Rate:** 2e-5  
- **Max Length:** 128 tokens (implicitly handled by tokenizer)  
- **Evaluation Strategy:** Per epoch  

### Quantization

The model was quantized using PyTorch's half-precision (float16) support to reduce memory footprint and inference time.

---

## Repository Structure

```
.
├── saved_model/           # Fine-Tuned BERT Model and Tokenizer
├── quantized-model/       # Quantized Model for Deployment
├── ner_output/            # Training Logs and Checkpoints
├── README.md              # Documentation
```

---

## Limitations

- May not generalize well to domains outside WNUT-17 entities  
- Quantized model may slightly reduce accuracy for faster performance

---

## Contributing

Contributions are welcome! Please raise an issue or PR for improvements, bug fixes, or feature additions.

---