🧠 NER-BERT-AI-Model-using-annotated-corpus-ner

A BERT-based Named Entity Recognition (NER) model fine-tuned on the Entity Annotated Corpus. It classifies tokens in text into predefined entity types such as Person (PER), Organization (ORG), and Location (LOC). This model is well-suited for information extraction, resume parsing, and chatbot applications.

✨ Model Highlights

📌 Based on bert-base-cased (by Google)
🔍 Fine-tuned on the Entity Annotated Corpus (ner_dataset.csv)
⚡ Supports prediction of 3 entity types: PER, ORG, LOC
💾 Compatible with Hugging Face pipeline() for easy inference

🧠 Intended Uses

Resume and document parsing
Chatbots and virtual assistants
Named entity tagging in structured documents
Search and information retrieval systems
News or content analysis

🚫 Limitations

Trained only on English formal texts
May not generalize well to informal text or domain-specific jargon
Subword tokenization may split entities (e.g., "Cupertino" → "Cup", "##ert", "##ino")
Limited to the entities available in the original dataset (PER, ORG, LOC only)

🏋️‍♂️ Training Details

Field	Value
Base Model	`bert-base-cased`
Dataset	Entity Annotated Corpus
Framework	PyTorch with Transformers
Epochs	3
Batch Size	16
Max Length	128 tokens
Optimizer	AdamW
Loss	CrossEntropyLoss (token-level)
Device	Trained on CUDA-enabled GPU

📊 Evaluation Metrics

Metric	Score
Precision	83.15
Recall	83.85
F1-Score	83.50

🔎 Label Mapping

Label ID	Entity Type
0	O
1	B-PER
2	I-PER
3	B-ORG
4	I-ORG
5	B-LOC
6	I-LOC

🚀 Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

model_name = "/AventIQ-AI/NER-BERT-AI-Model-using-annotated-corpus-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "My name is Wolfgang and I live in Berlin"

ner_results = nlp(example)
print(ner_results)

🧩 Quantization

Post-training quantization can be applied using PyTorch to reduce model size and improve inference performance, especially on edge devices.

🗂 Repository Structure

.
├── model/               # Trained model files
├── tokenizer_config/    # Tokenizer and vocab files
├── model.safensors/     # Model in safetensors format
├── README.md            # Model card

🤝 Contributing

We welcome feedback, bug reports, and improvements! Feel free to open an issue or submit a pull request.