AventIQ-AI
/

NER-BERT-AI-Model-using-annotated-corpus-ner

Model card Files Files and versions Community

NER-BERT-AI-Model-using-annotated-corpus-ner / README.md

vishal1364's picture

Create README.md

d80e21b verified about 2 months ago

|

history blame contribute delete

3.22 kB

	# 🧠 NER-BERT-AI-Model-using-annotated-corpus-ner

	A BERT-based Named Entity Recognition (NER) model fine-tuned on the Entity Annotated Corpus. It classifies tokens in text into predefined entity types such as Person (PER), Organization (ORG), and Location (LOC). This model is well-suited for information extraction, resume parsing, and chatbot applications.

	---

	## ✨ Model Highlights

	- 📌 Based on `bert-base-cased` (by Google)
	- 🔍 Fine-tuned on the Entity Annotated Corpus (`ner_dataset.csv`)
	- ⚡ Supports prediction of 3 entity types: PER, ORG, LOC
	- 💾 Compatible with Hugging Face `pipeline()` for easy inference

	---

	## 🧠 Intended Uses

	- Resume and document parsing
	- Chatbots and virtual assistants
	- Named entity tagging in structured documents
	- Search and information retrieval systems
	- News or content analysis

	---

	## 🚫 Limitations

	- Trained only on English formal texts
	- May not generalize well to informal text or domain-specific jargon
	- Subword tokenization may split entities (e.g., "Cupertino" → "Cup", "##ert", "##ino")
	- Limited to the entities available in the original dataset (PER, ORG, LOC only)

	---

	## 🏋️‍♂️ Training Details

	\| Field \| Value \|
	\|---------------\|------------------------------\|
	\| Base Model \| `bert-base-cased` \|
	\| Dataset \| Entity Annotated Corpus \|
	\| Framework \| PyTorch with Transformers \|
	\| Epochs \| 3 \|
	\| Batch Size \| 16 \|
	\| Max Length \| 128 tokens \|
	\| Optimizer \| AdamW \|
	\| Loss \| CrossEntropyLoss (token-level) \|
	\| Device \| Trained on CUDA-enabled GPU \|

	---

	## 📊 Evaluation Metrics

	\| Metric \| Score \|
	\|-----------\|-------\|
	\| Precision \| 83.15 \|
	\| Recall \| 83.85 \|
	\| F1-Score \| 83.50 \|


	---

	## 🔎 Label Mapping

	\| Label ID \| Entity Type \|
	\|----------\|--------------\|
	\| 0 \| O \|
	\| 1 \| B-PER \|
	\| 2 \| I-PER \|
	\| 3 \| B-ORG \|
	\| 4 \| I-ORG \|
	\| 5 \| B-LOC \|
	\| 6 \| I-LOC \|

	---

	## 🚀 Usage

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification
	from transformers import pipeline

	model_name = "/AventIQ-AI/NER-BERT-AI-Model-using-annotated-corpus-ner"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForTokenClassification.from_pretrained(model_name)

	nlp = pipeline("ner", model=model, tokenizer=tokenizer)
	example = "My name is Wolfgang and I live in Berlin"

	ner_results = nlp(example)
	print(ner_results)

	```
	## 🧩 Quantization
	Post-training quantization can be applied using PyTorch to reduce model size and improve inference performance, especially on edge devices.

	## 🗂 Repository Structure

	```
	.
	├── model/ # Trained model files
	├── tokenizer_config/ # Tokenizer and vocab files
	├── model.safensors/ # Model in safetensors format
	├── README.md # Model card
	```
	## 🤝 Contributing
	We welcome feedback, bug reports, and improvements!
	Feel free to open an issue or submit a pull request.