File size: 3,223 Bytes
d80e21b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 |
# π§ NER-BERT-AI-Model-using-annotated-corpus-ner
A BERT-based Named Entity Recognition (NER) model fine-tuned on the Entity Annotated Corpus. It classifies tokens in text into predefined entity types such as Person (PER), Organization (ORG), and Location (LOC). This model is well-suited for information extraction, resume parsing, and chatbot applications.
---
## β¨ Model Highlights
- π Based on `bert-base-cased` (by Google)
- π Fine-tuned on the Entity Annotated Corpus (`ner_dataset.csv`)
- β‘ Supports prediction of 3 entity types: PER, ORG, LOC
- πΎ Compatible with Hugging Face `pipeline()` for easy inference
---
## π§ Intended Uses
- Resume and document parsing
- Chatbots and virtual assistants
- Named entity tagging in structured documents
- Search and information retrieval systems
- News or content analysis
---
## π« Limitations
- Trained only on English formal texts
- May not generalize well to informal text or domain-specific jargon
- Subword tokenization may split entities (e.g., "Cupertino" β "Cup", "##ert", "##ino")
- Limited to the entities available in the original dataset (PER, ORG, LOC only)
---
## ποΈββοΈ Training Details
| Field | Value |
|---------------|------------------------------|
| Base Model | `bert-base-cased` |
| Dataset | Entity Annotated Corpus |
| Framework | PyTorch with Transformers |
| Epochs | 3 |
| Batch Size | 16 |
| Max Length | 128 tokens |
| Optimizer | AdamW |
| Loss | CrossEntropyLoss (token-level) |
| Device | Trained on CUDA-enabled GPU |
---
## π Evaluation Metrics
| Metric | Score |
|-----------|-------|
| Precision | 83.15 |
| Recall | 83.85 |
| F1-Score | 83.50 |
---
## π Label Mapping
| Label ID | Entity Type |
|----------|--------------|
| 0 | O |
| 1 | B-PER |
| 2 | I-PER |
| 3 | B-ORG |
| 4 | I-ORG |
| 5 | B-LOC |
| 6 | I-LOC |
---
## π Usage
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
model_name = "/AventIQ-AI/NER-BERT-AI-Model-using-annotated-corpus-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "My name is Wolfgang and I live in Berlin"
ner_results = nlp(example)
print(ner_results)
```
## π§© Quantization
Post-training quantization can be applied using PyTorch to reduce model size and improve inference performance, especially on edge devices.
## π Repository Structure
```
.
βββ model/ # Trained model files
βββ tokenizer_config/ # Tokenizer and vocab files
βββ model.safensors/ # Model in safetensors format
βββ README.md # Model card
```
## π€ Contributing
We welcome feedback, bug reports, and improvements!
Feel free to open an issue or submit a pull request. |