File size: 3,198 Bytes
1484d1a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158

# 🧠 Keyphrase Extraction with BERT (Fine-Tuned on `midas/inspec`)

This repository contains a complete pipeline to **fine-tune BERT** for **Keyphrase Extraction** using the [`midas/inspec`](https://huggingface.co/datasets/midas/inspec) dataset. The model performs sequence labeling with BIO tags to extract meaningful phrases from scientific text.

---

## πŸ”§ Features

- βœ… Preprocessed dataset with BIO-tagged tokens  
- βœ… Fine-tuning BERT (`bert-base-cased`) using Hugging Face Transformers  
- βœ… Token-label alignment  
- βœ… Evaluation using `seqeval` metrics (Precision, Recall, F1)  
- βœ… Inference pipeline to extract keyphrases  
- βœ… CUDA-enabled for GPU acceleration  

---

## πŸ“‚ Dataset

**Source:** [`midas/inspec`](https://huggingface.co/datasets/midas/inspec)

- Fields:
  - `document`: List of tokenized words (already split)
  - `doc_bio_tags`: BIO-format labels for keyphrases
- Splits:
  - `train`: 1000 samples
  - `validation`: 500 samples
  - `test`: 500 samples

---

## πŸš€ Setup & Installation

```bash
git clone https://github.com/your-username/keyphrase-bert-inspec
cd keyphrase-bert-inspec

pip install -r requirements.txt
```

### `requirements.txt`

```text
datasets
transformers
evaluate
seqeval
```

---

## πŸ§ͺ Training

```python
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer
```

1. Load and preprocess data with aligned BIO labels
2. Fine-tune `bert-base-cased` on the dataset
3. Evaluate and save model artifacts

### Training Script Overview:

```python
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()
trainer.save_model("keyphrase-bert-inspec")
```

---

## πŸ“Š Evaluation Metrics

```python
{
  "precision": 0.84,
  "recall": 0.81,
  "f1": 0.825,
  "accuracy": 0.88
}
```

---

## πŸ” Inference Example

```python
from transformers import pipeline

ner_pipeline = pipeline(
    "ner",
    model="keyphrase-bert-inspec",
    tokenizer="keyphrase-bert-inspec",
    aggregation_strategy="simple"
)

text = "Information-based semantics is a theory in the philosophy of mind."
results = ner_pipeline(text)

for r in results:
    print(f"{r['word']} ({r['entity_group']}) - {r['score']:.2f}")
```

### Sample Output

```
🟒 Extracted Keyphrases:
 - Information-based semantics (score: 0.94)
 - philosophy of mind (score: 0.91)
```

---

## πŸ’Ύ Model Artifacts

After training, the model and tokenizer are saved as:

```
keyphrase-bert-inspec/
β”œβ”€β”€ config.json
β”œβ”€β”€ pytorch_model.bin
β”œβ”€β”€ tokenizer_config.json
β”œβ”€β”€ vocab.txt
```

---

## πŸ“Œ Future Improvements

- Add postprocessing to group fragmented tokens
- Use a larger dataset (like `scientific_keyphrases`)
- Convert to a web app using Gradio or Streamlit

---

## πŸ‘¨β€πŸ”¬ Author

**Your Name**  
GitHub: [@your-username](https://github.com/your-username)  
Contact: [email protected]

---

## πŸ“„ License

MIT License. See `LICENSE` file.