File size: 3,223 Bytes

d80e21b

# 🧠 NER-BERT-AI-Model-using-annotated-corpus-ner

A BERT-based Named Entity Recognition (NER) model fine-tuned on the Entity Annotated Corpus. It classifies tokens in text into predefined entity types such as Person (PER), Organization (ORG), and Location (LOC). This model is well-suited for information extraction, resume parsing, and chatbot applications.

---

## ✨ Model Highlights

- 📌 Based on `bert-base-cased` (by Google)  
- 🔍 Fine-tuned on the Entity Annotated Corpus (`ner_dataset.csv`)  
- ⚡ Supports prediction of 3 entity types: PER, ORG, LOC  
- 💾 Compatible with Hugging Face `pipeline()` for easy inference

---

## 🧠 Intended Uses

- Resume and document parsing  
- Chatbots and virtual assistants  
- Named entity tagging in structured documents  
- Search and information retrieval systems  
- News or content analysis  

---

## 🚫 Limitations

- Trained only on English formal texts  
- May not generalize well to informal text or domain-specific jargon  
- Subword tokenization may split entities (e.g., "Cupertino" → "Cup", "##ert", "##ino")  
- Limited to the entities available in the original dataset (PER, ORG, LOC only)  

---

## 🏋️‍♂️ Training Details

| Field         | Value                        |
|---------------|------------------------------|
| Base Model    | `bert-base-cased`            |
| Dataset       | Entity Annotated Corpus      |
| Framework     | PyTorch with Transformers    |
| Epochs        | 3                            |
| Batch Size    | 16                           |
| Max Length    | 128 tokens                   |
| Optimizer     | AdamW                        |
| Loss          | CrossEntropyLoss (token-level) |
| Device        | Trained on CUDA-enabled GPU  |

---

## 📊 Evaluation Metrics

| Metric    | Score |
|-----------|-------|
| Precision | 83.15 |
| Recall    | 83.85 |
| F1-Score  | 83.50 |


---

## 🔎 Label Mapping

| Label ID | Entity Type |
|----------|--------------|
| 0        | O            |
| 1        | B-PER        |
| 2        | I-PER        |
| 3        | B-ORG        |
| 4        | I-ORG        |
| 5        | B-LOC        |
| 6        | I-LOC        |

---

## 🚀 Usage

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

model_name = "/AventIQ-AI/NER-BERT-AI-Model-using-annotated-corpus-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "My name is Wolfgang and I live in Berlin"

ner_results = nlp(example)
print(ner_results)

```
## 🧩 Quantization
Post-training quantization can be applied using PyTorch to reduce model size and improve inference performance, especially on edge devices.

## 🗂 Repository Structure

```
.
├── model/               # Trained model files
├── tokenizer_config/    # Tokenizer and vocab files
├── model.safensors/     # Model in safetensors format
├── README.md            # Model card
```
## 🤝 Contributing
We welcome feedback, bug reports, and improvements!
Feel free to open an issue or submit a pull request.