BERT-Based Named Entity Recognition (NER) Model

This repository contains a fine-tuned BERT-based model for Named Entity Recognition (NER) using the WNUT-17 dataset. The model is trained using the Hugging Face Transformers and Datasets libraries, and supports inference and quantization for deployment in resource-constrained environments.

Model Details

Model Name: BERT-Base-Cased NER
Model Architecture: BERT Base
Task: Named Entity Recognition (NER)
Dataset: WNUT-17 (from Hugging Face Datasets)
Quantization: Float16
Fine-tuning Framework: Hugging Face Transformers

Usage

Installation

pip install transformers datasets evaluate seqeval scikit-learn torch

Training the Model

from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

trainer.train()

Saving the Model

model.save_pretrained("./saved_model")
tokenizer.save_pretrained("./saved_model")

Testing the Saved Model

from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained("./saved_model")
tokenizer = AutoTokenizer.from_pretrained("./saved_model")
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

sample_sentences = [
    "Barack Obama visited Microsoft headquarters in Redmond.",
    "Nancy Gautam lives in Faridabad and studies at J.C. Bose University.",
    "Google is launching a new AI product in California."
]

for sentence in sample_sentences:
    print(f"Sentence: {sentence}")
    print(ner_pipeline(sentence))

Quantizing the Model

import torch

quantized_model = model.to(dtype=torch.float16, device="cuda" if torch.cuda.is_available() else "cpu")
quantized_model.save_pretrained("quantized-model")
tokenizer.save_pretrained("quantized-model")

Testing the Quantized Model

model = AutoModelForTokenClassification.from_pretrained("quantized-model", torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained("quantized-model")
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

Performance Metrics

Accuracy: Evaluated using seqeval on the validation split
Precision, Recall, F1 Score: Computed using label-wise predictions excluding ignored indices

Fine-Tuning Details

Dataset

The model was fine-tuned on the WNUT-17 dataset, a benchmark dataset for emerging and rare named entities. The preprocessing includes:

Tokenization using BERT tokenizer
Label alignment for wordpiece tokens

Training Configuration

Epochs: 3
Batch Size: 16
Learning Rate: 2e-5
Max Length: 128 tokens (implicitly handled by tokenizer)
Evaluation Strategy: Per epoch

Quantization

The model was quantized using PyTorch's half-precision (float16) support to reduce memory footprint and inference time.

Repository Structure

.
├── saved_model/           # Fine-Tuned BERT Model and Tokenizer
├── quantized-model/       # Quantized Model for Deployment
├── ner_output/            # Training Logs and Checkpoints
├── README.md              # Documentation

Limitations

May not generalize well to domains outside WNUT-17 entities
Quantized model may slightly reduce accuracy for faster performance

Contributing

Contributions are welcome! Please raise an issue or PR for improvements, bug fixes, or feature additions.