Name-Entity-Recognition / README_NER_Model.md
gautamnancy's picture
Upload 7 files
4e2e3dd verified

BERT-Based Named Entity Recognition (NER) Model

This repository contains a fine-tuned BERT-based model for Named Entity Recognition (NER) using the WNUT-17 dataset. The model is trained using the Hugging Face Transformers and Datasets libraries, and supports inference and quantization for deployment in resource-constrained environments.


Model Details

  • Model Name: BERT-Base-Cased NER
  • Model Architecture: BERT Base
  • Task: Named Entity Recognition (NER)
  • Dataset: WNUT-17 (from Hugging Face Datasets)
  • Quantization: Float16
  • Fine-tuning Framework: Hugging Face Transformers

Usage

Installation

pip install transformers datasets evaluate seqeval scikit-learn torch

Training the Model

from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

trainer.train()

Saving the Model

model.save_pretrained("./saved_model")
tokenizer.save_pretrained("./saved_model")

Testing the Saved Model

from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained("./saved_model")
tokenizer = AutoTokenizer.from_pretrained("./saved_model")
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

sample_sentences = [
    "Barack Obama visited Microsoft headquarters in Redmond.",
    "Nancy Gautam lives in Faridabad and studies at J.C. Bose University.",
    "Google is launching a new AI product in California."
]

for sentence in sample_sentences:
    print(f"Sentence: {sentence}")
    print(ner_pipeline(sentence))

Quantizing the Model

import torch

quantized_model = model.to(dtype=torch.float16, device="cuda" if torch.cuda.is_available() else "cpu")
quantized_model.save_pretrained("quantized-model")
tokenizer.save_pretrained("quantized-model")

Testing the Quantized Model

model = AutoModelForTokenClassification.from_pretrained("quantized-model", torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained("quantized-model")
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

Performance Metrics

  • Accuracy: Evaluated using seqeval on the validation split
  • Precision, Recall, F1 Score: Computed using label-wise predictions excluding ignored indices

Fine-Tuning Details

Dataset

The model was fine-tuned on the WNUT-17 dataset, a benchmark dataset for emerging and rare named entities. The preprocessing includes:

  • Tokenization using BERT tokenizer
  • Label alignment for wordpiece tokens

Training Configuration

  • Epochs: 3
  • Batch Size: 16
  • Learning Rate: 2e-5
  • Max Length: 128 tokens (implicitly handled by tokenizer)
  • Evaluation Strategy: Per epoch

Quantization

The model was quantized using PyTorch's half-precision (float16) support to reduce memory footprint and inference time.


Repository Structure

.
β”œβ”€β”€ saved_model/           # Fine-Tuned BERT Model and Tokenizer
β”œβ”€β”€ quantized-model/       # Quantized Model for Deployment
β”œβ”€β”€ ner_output/            # Training Logs and Checkpoints
β”œβ”€β”€ README.md              # Documentation

Limitations

  • May not generalize well to domains outside WNUT-17 entities
  • Quantized model may slightly reduce accuracy for faster performance

Contributing

Contributions are welcome! Please raise an issue or PR for improvements, bug fixes, or feature additions.