|
# π§ NER-BERT-AI-Model-using-annotated-corpus-ner |
|
|
|
A BERT-based Named Entity Recognition (NER) model fine-tuned on the Entity Annotated Corpus. It classifies tokens in text into predefined entity types such as Person (PER), Organization (ORG), and Location (LOC). This model is well-suited for information extraction, resume parsing, and chatbot applications. |
|
|
|
--- |
|
|
|
## β¨ Model Highlights |
|
|
|
- π Based on `bert-base-cased` (by Google) |
|
- π Fine-tuned on the Entity Annotated Corpus (`ner_dataset.csv`) |
|
- β‘ Supports prediction of 3 entity types: PER, ORG, LOC |
|
- πΎ Compatible with Hugging Face `pipeline()` for easy inference |
|
|
|
--- |
|
|
|
## π§ Intended Uses |
|
|
|
- Resume and document parsing |
|
- Chatbots and virtual assistants |
|
- Named entity tagging in structured documents |
|
- Search and information retrieval systems |
|
- News or content analysis |
|
|
|
--- |
|
|
|
## π« Limitations |
|
|
|
- Trained only on English formal texts |
|
- May not generalize well to informal text or domain-specific jargon |
|
- Subword tokenization may split entities (e.g., "Cupertino" β "Cup", "##ert", "##ino") |
|
- Limited to the entities available in the original dataset (PER, ORG, LOC only) |
|
|
|
--- |
|
|
|
## ποΈββοΈ Training Details |
|
|
|
| Field | Value | |
|
|---------------|------------------------------| |
|
| Base Model | `bert-base-cased` | |
|
| Dataset | Entity Annotated Corpus | |
|
| Framework | PyTorch with Transformers | |
|
| Epochs | 3 | |
|
| Batch Size | 16 | |
|
| Max Length | 128 tokens | |
|
| Optimizer | AdamW | |
|
| Loss | CrossEntropyLoss (token-level) | |
|
| Device | Trained on CUDA-enabled GPU | |
|
|
|
--- |
|
|
|
## π Evaluation Metrics |
|
|
|
| Metric | Score | |
|
|-----------|-------| |
|
| Precision | 83.15 | |
|
| Recall | 83.85 | |
|
| F1-Score | 83.50 | |
|
|
|
|
|
--- |
|
|
|
## π Label Mapping |
|
|
|
| Label ID | Entity Type | |
|
|----------|--------------| |
|
| 0 | O | |
|
| 1 | B-PER | |
|
| 2 | I-PER | |
|
| 3 | B-ORG | |
|
| 4 | I-ORG | |
|
| 5 | B-LOC | |
|
| 6 | I-LOC | |
|
|
|
--- |
|
|
|
## π Usage |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
from transformers import pipeline |
|
|
|
model_name = "/AventIQ-AI/NER-BERT-AI-Model-using-annotated-corpus-ner" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForTokenClassification.from_pretrained(model_name) |
|
|
|
nlp = pipeline("ner", model=model, tokenizer=tokenizer) |
|
example = "My name is Wolfgang and I live in Berlin" |
|
|
|
ner_results = nlp(example) |
|
print(ner_results) |
|
|
|
``` |
|
## π§© Quantization |
|
Post-training quantization can be applied using PyTorch to reduce model size and improve inference performance, especially on edge devices. |
|
|
|
## π Repository Structure |
|
|
|
``` |
|
. |
|
βββ model/ # Trained model files |
|
βββ tokenizer_config/ # Tokenizer and vocab files |
|
βββ model.safensors/ # Model in safetensors format |
|
βββ README.md # Model card |
|
``` |
|
## π€ Contributing |
|
We welcome feedback, bug reports, and improvements! |
|
Feel free to open an issue or submit a pull request. |