File size: 3,589 Bytes
c8f775e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
# RoBERTa-Base Model for Named Entity Recognition (NER) on CoNLL-2003 Dataset

This repository hosts a fine-tuned version of the RoBERTa model for Named Entity Recognition (NER) using the CoNLL-2003 dataset. The model is capable of identifying and classifying named entities such as people, organizations, locations, etc.

## Model Details

- **Model Architecture:** RoBERTa Base  
- **Task:** Named Entity Recognition  
- **Dataset:** CoNLL-2003 (Hugging Face Datasets)  
- **Quantization:** Float16
- **Fine-tuning Framework:** Hugging Face Transformers  

---

## Installation

```bash
pip install datasets transformers seqeval torch --quiet
```

---

## Loading the Model


```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load tokenizer and model
model = "roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModelForSequenceClassification.from_pretrained(model)
# Define test sentences
sentences = [
    "Barack Obama was born in Hawaii.",
    "Elon Musk founded SpaceX and Tesla.",
    "Apple is headquartered in Cupertino, California."
]

for sentence in sentences:
    tokens = tokenizer(sentence, return_tensors="pt", truncation=True, is_split_into_words=False).to(device)
    with torch.no_grad():
        outputs = model(**tokens)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=2)
    predicted_labels = predictions[0].cpu().numpy()
    tokens_decoded = tokenizer.convert_ids_to_tokens(tokens["input_ids"][0])
    print(f"Sentence: {sentence}")
    for token, label_id in zip(tokens_decoded, predicted_labels):
        label = label_list[label_id]
        if token.startswith("Δ ") or not token.startswith("▁"):
            token = token.replace("Δ ", "")
        if label != "O":
            print(f"{token}: {label}")
    print("\n" + "-"*50 + "\n")
```


## Performance Metrics

- **Accuracy:** 0.9921  
- **Precision:** 0.9466
- **Recall:** 0.9589	  
- **F1 Score:** 0.9527  

---

## Fine-Tuning Details

### Dataset

The dataset used is the CoNLL-2003 dataset, which contains labeled tokens for Named Entity Recognition (NER).
Entities are categorized into classes such as PER (person), ORG (organization), LOC (location), and MISC (miscellaneous).
It includes four columns: the word, part-of-speech tag, syntactic chunk tag, and NER tag.

The dataset is automatically loaded using the Hugging Face datasets library and is split into train, validation, and test sets.


### Training

- **Epochs:** 3 
- **Batch size:** 16 (train) / 16 (eval)  
- **Learning rate:** 2e-5  
- **Evaluation strategy:** `epoch`  
- **FP16 Training:** Enabled  
- **Trainer:** Hugging Face `Trainer` API  

---

## Quantization

Post-training quantization was applied using `model.to(dtype=torch.float16)` to reduce model size and speed up inference.

---

## Repository Structure

```bash
.
β”œβ”€β”€ quantized-model/                            # Directory containing trained model artifacts
β”‚   β”œβ”€β”€ config.json
β”‚   β”œβ”€β”€ merges.txt
β”‚   β”œβ”€β”€ model.safetensors            # (May appear as 'model' in UI)
β”‚   β”œβ”€β”€ special_tokens_map.json
β”‚   β”œβ”€β”€ tokenizer.json
β”‚   β”œβ”€β”€ tokenizer_config.json
β”‚   └── vocab.json
β”œβ”€β”€ README.md
```

---

## Limitations

- The model is trained only on CoNLL-2003 and may not generalize to unseen NER tasks.
- Token misalignment may occur for complex or ambiguous phrases.


## Contributing

Feel free to open issues or submit pull requests to improve the model, training process, or documentation.