🧠 Keyphrase Extraction with BERT (Fine-Tuned on `midas/inspec`)

This repository contains a complete pipeline to fine-tune BERT for Keyphrase Extraction using the midas/inspec dataset. The model performs sequence labeling with BIO tags to extract meaningful phrases from scientific text.

🔧 Features

✅ Preprocessed dataset with BIO-tagged tokens
✅ Fine-tuning BERT (bert-base-cased) using Hugging Face Transformers
✅ Token-label alignment
✅ Evaluation using seqeval metrics (Precision, Recall, F1)
✅ Inference pipeline to extract keyphrases
✅ CUDA-enabled for GPU acceleration

📂 Dataset

Source: midas/inspec

Fields:
- document: List of tokenized words (already split)
- doc_bio_tags: BIO-format labels for keyphrases
Splits:
- train: 1000 samples
- validation: 500 samples
- test: 500 samples

🚀 Setup & Installation

git clone https://github.com/your-username/keyphrase-bert-inspec
cd keyphrase-bert-inspec

pip install -r requirements.txt

`requirements.txt`

datasets
transformers
evaluate
seqeval

🧪 Training

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer

Load and preprocess data with aligned BIO labels
Fine-tune bert-base-cased on the dataset
Evaluate and save model artifacts

Training Script Overview:

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()
trainer.save_model("keyphrase-bert-inspec")

📊 Evaluation Metrics

{
  "precision": 0.84,
  "recall": 0.81,
  "f1": 0.825,
  "accuracy": 0.88
}

🔍 Inference Example

from transformers import pipeline

ner_pipeline = pipeline(
    "ner",
    model="keyphrase-bert-inspec",
    tokenizer="keyphrase-bert-inspec",
    aggregation_strategy="simple"
)

text = "Information-based semantics is a theory in the philosophy of mind."
results = ner_pipeline(text)

for r in results:
    print(f"{r['word']} ({r['entity_group']}) - {r['score']:.2f}")

Sample Output

🟢 Extracted Keyphrases:
 - Information-based semantics (score: 0.94)
 - philosophy of mind (score: 0.91)

💾 Model Artifacts

After training, the model and tokenizer are saved as:

keyphrase-bert-inspec/
├── config.json
├── pytorch_model.bin
├── tokenizer_config.json
├── vocab.txt

📌 Future Improvements

Add postprocessing to group fragmented tokens
Use a larger dataset (like scientific_keyphrases)
Convert to a web app using Gradio or Streamlit

👨‍🔬 Author

Your Name
GitHub: @your-username
Contact: [email protected]

📄 License

MIT License. See LICENSE file.

🧠 Keyphrase Extraction with BERT (Fine-Tuned on midas/inspec)