File size: 3,223 Bytes
d80e21b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
# 🧠 NER-BERT-AI-Model-using-annotated-corpus-ner

A BERT-based Named Entity Recognition (NER) model fine-tuned on the Entity Annotated Corpus. It classifies tokens in text into predefined entity types such as Person (PER), Organization (ORG), and Location (LOC). This model is well-suited for information extraction, resume parsing, and chatbot applications.

---

## ✨ Model Highlights

- πŸ“Œ Based on `bert-base-cased` (by Google)  
- πŸ” Fine-tuned on the Entity Annotated Corpus (`ner_dataset.csv`)  
- ⚑ Supports prediction of 3 entity types: PER, ORG, LOC  
- πŸ’Ύ Compatible with Hugging Face `pipeline()` for easy inference

---

## 🧠 Intended Uses

- Resume and document parsing  
- Chatbots and virtual assistants  
- Named entity tagging in structured documents  
- Search and information retrieval systems  
- News or content analysis  

---

## 🚫 Limitations

- Trained only on English formal texts  
- May not generalize well to informal text or domain-specific jargon  
- Subword tokenization may split entities (e.g., "Cupertino" β†’ "Cup", "##ert", "##ino")  
- Limited to the entities available in the original dataset (PER, ORG, LOC only)  

---

## πŸ‹οΈβ€β™‚οΈ Training Details

| Field         | Value                        |
|---------------|------------------------------|
| Base Model    | `bert-base-cased`            |
| Dataset       | Entity Annotated Corpus      |
| Framework     | PyTorch with Transformers    |
| Epochs        | 3                            |
| Batch Size    | 16                           |
| Max Length    | 128 tokens                   |
| Optimizer     | AdamW                        |
| Loss          | CrossEntropyLoss (token-level) |
| Device        | Trained on CUDA-enabled GPU  |

---

## πŸ“Š Evaluation Metrics

| Metric    | Score |
|-----------|-------|
| Precision | 83.15 |
| Recall    | 83.85 |
| F1-Score  | 83.50 |


---

## πŸ”Ž Label Mapping

| Label ID | Entity Type |
|----------|--------------|
| 0        | O            |
| 1        | B-PER        |
| 2        | I-PER        |
| 3        | B-ORG        |
| 4        | I-ORG        |
| 5        | B-LOC        |
| 6        | I-LOC        |

---

## πŸš€ Usage

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

model_name = "/AventIQ-AI/NER-BERT-AI-Model-using-annotated-corpus-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "My name is Wolfgang and I live in Berlin"

ner_results = nlp(example)
print(ner_results)

```
## 🧩 Quantization
Post-training quantization can be applied using PyTorch to reduce model size and improve inference performance, especially on edge devices.

## πŸ—‚ Repository Structure

```
.
β”œβ”€β”€ model/               # Trained model files
β”œβ”€β”€ tokenizer_config/    # Tokenizer and vocab files
β”œβ”€β”€ model.safensors/     # Model in safetensors format
β”œβ”€β”€ README.md            # Model card
```
## 🀝 Contributing
We welcome feedback, bug reports, and improvements!
Feel free to open an issue or submit a pull request.