|
# RoBERTa-Base Model for Named Entity Recognition (NER) on CoNLL-2003 Dataset |
|
|
|
This repository hosts a fine-tuned version of the RoBERTa model for Named Entity Recognition (NER) using the CoNLL-2003 dataset. The model is capable of identifying and classifying named entities such as people, organizations, locations, etc. |
|
|
|
## Model Details |
|
|
|
- **Model Architecture:** RoBERTa Base |
|
- **Task:** Named Entity Recognition |
|
- **Dataset:** CoNLL-2003 (Hugging Face Datasets) |
|
- **Quantization:** Float16 |
|
- **Fine-tuning Framework:** Hugging Face Transformers |
|
|
|
--- |
|
|
|
## Installation |
|
|
|
```bash |
|
pip install datasets transformers seqeval torch --quiet |
|
``` |
|
|
|
--- |
|
|
|
## Loading the Model |
|
|
|
|
|
```python |
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer |
|
import torch |
|
|
|
# Load tokenizer and model |
|
model = "roberta-base" |
|
tokenizer = AutoTokenizer.from_pretrained(model) |
|
model = AutoModelForSequenceClassification.from_pretrained(model) |
|
# Define test sentences |
|
sentences = [ |
|
"Barack Obama was born in Hawaii.", |
|
"Elon Musk founded SpaceX and Tesla.", |
|
"Apple is headquartered in Cupertino, California." |
|
] |
|
|
|
for sentence in sentences: |
|
tokens = tokenizer(sentence, return_tensors="pt", truncation=True, is_split_into_words=False).to(device) |
|
with torch.no_grad(): |
|
outputs = model(**tokens) |
|
logits = outputs.logits |
|
predictions = torch.argmax(logits, dim=2) |
|
predicted_labels = predictions[0].cpu().numpy() |
|
tokens_decoded = tokenizer.convert_ids_to_tokens(tokens["input_ids"][0]) |
|
print(f"Sentence: {sentence}") |
|
for token, label_id in zip(tokens_decoded, predicted_labels): |
|
label = label_list[label_id] |
|
if token.startswith("Δ ") or not token.startswith("β"): |
|
token = token.replace("Δ ", "") |
|
if label != "O": |
|
print(f"{token}: {label}") |
|
print("\n" + "-"*50 + "\n") |
|
``` |
|
|
|
|
|
## Performance Metrics |
|
|
|
- **Accuracy:** 0.9921 |
|
- **Precision:** 0.9466 |
|
- **Recall:** 0.9589 |
|
- **F1 Score:** 0.9527 |
|
|
|
--- |
|
|
|
## Fine-Tuning Details |
|
|
|
### Dataset |
|
|
|
The dataset used is the CoNLL-2003 dataset, which contains labeled tokens for Named Entity Recognition (NER). |
|
Entities are categorized into classes such as PER (person), ORG (organization), LOC (location), and MISC (miscellaneous). |
|
It includes four columns: the word, part-of-speech tag, syntactic chunk tag, and NER tag. |
|
|
|
The dataset is automatically loaded using the Hugging Face datasets library and is split into train, validation, and test sets. |
|
|
|
|
|
### Training |
|
|
|
- **Epochs:** 3 |
|
- **Batch size:** 16 (train) / 16 (eval) |
|
- **Learning rate:** 2e-5 |
|
- **Evaluation strategy:** `epoch` |
|
- **FP16 Training:** Enabled |
|
- **Trainer:** Hugging Face `Trainer` API |
|
|
|
--- |
|
|
|
## Quantization |
|
|
|
Post-training quantization was applied using `model.to(dtype=torch.float16)` to reduce model size and speed up inference. |
|
|
|
--- |
|
|
|
## Repository Structure |
|
|
|
```bash |
|
. |
|
βββ quantized-model/ # Directory containing trained model artifacts |
|
β βββ config.json |
|
β βββ merges.txt |
|
β βββ model.safetensors # (May appear as 'model' in UI) |
|
β βββ special_tokens_map.json |
|
β βββ tokenizer.json |
|
β βββ tokenizer_config.json |
|
β βββ vocab.json |
|
βββ README.md |
|
``` |
|
|
|
--- |
|
|
|
## Limitations |
|
|
|
- The model is trained only on CoNLL-2003 and may not generalize to unseen NER tasks. |
|
- Token misalignment may occur for complex or ambiguous phrases. |
|
|
|
|
|
## Contributing |
|
|
|
Feel free to open issues or submit pull requests to improve the model, training process, or documentation. |
|
|
|
|