File size: 3,198 Bytes

1484d1a


# 🧠 Keyphrase Extraction with BERT (Fine-Tuned on `midas/inspec`)

This repository contains a complete pipeline to **fine-tune BERT** for **Keyphrase Extraction** using the [`midas/inspec`](https://huggingface.co/datasets/midas/inspec) dataset. The model performs sequence labeling with BIO tags to extract meaningful phrases from scientific text.

---

## 🔧 Features

- ✅ Preprocessed dataset with BIO-tagged tokens  
- ✅ Fine-tuning BERT (`bert-base-cased`) using Hugging Face Transformers  
- ✅ Token-label alignment  
- ✅ Evaluation using `seqeval` metrics (Precision, Recall, F1)  
- ✅ Inference pipeline to extract keyphrases  
- ✅ CUDA-enabled for GPU acceleration  

---

## 📂 Dataset

**Source:** [`midas/inspec`](https://huggingface.co/datasets/midas/inspec)

- Fields:
  - `document`: List of tokenized words (already split)
  - `doc_bio_tags`: BIO-format labels for keyphrases
- Splits:
  - `train`: 1000 samples
  - `validation`: 500 samples
  - `test`: 500 samples

---

## 🚀 Setup & Installation

```bash
git clone https://github.com/your-username/keyphrase-bert-inspec
cd keyphrase-bert-inspec

pip install -r requirements.txt
```

### `requirements.txt`

```text
datasets
transformers
evaluate
seqeval
```

---

## 🧪 Training

```python
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer
```

1. Load and preprocess data with aligned BIO labels
2. Fine-tune `bert-base-cased` on the dataset
3. Evaluate and save model artifacts

### Training Script Overview:

```python
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()
trainer.save_model("keyphrase-bert-inspec")
```

---

## 📊 Evaluation Metrics

```python
{
  "precision": 0.84,
  "recall": 0.81,
  "f1": 0.825,
  "accuracy": 0.88
}
```

---

## 🔍 Inference Example

```python
from transformers import pipeline

ner_pipeline = pipeline(
    "ner",
    model="keyphrase-bert-inspec",
    tokenizer="keyphrase-bert-inspec",
    aggregation_strategy="simple"
)

text = "Information-based semantics is a theory in the philosophy of mind."
results = ner_pipeline(text)

for r in results:
    print(f"{r['word']} ({r['entity_group']}) - {r['score']:.2f}")
```

### Sample Output

```
🟢 Extracted Keyphrases:
 - Information-based semantics (score: 0.94)
 - philosophy of mind (score: 0.91)
```

---

## 💾 Model Artifacts

After training, the model and tokenizer are saved as:

```
keyphrase-bert-inspec/
├── config.json
├── pytorch_model.bin
├── tokenizer_config.json
├── vocab.txt
```

---

## 📌 Future Improvements

- Add postprocessing to group fragmented tokens
- Use a larger dataset (like `scientific_keyphrases`)
- Convert to a web app using Gradio or Streamlit

---

## 👨‍🔬 Author

**Your Name**  
GitHub: [@your-username](https://github.com/your-username)  
Contact: [email protected]

---

## 📄 License

MIT License. See `LICENSE` file.