|
|
|
# π§ Keyphrase Extraction with BERT (Fine-Tuned on `midas/inspec`) |
|
|
|
This repository contains a complete pipeline to **fine-tune BERT** for **Keyphrase Extraction** using the [`midas/inspec`](https://huggingface.co/datasets/midas/inspec) dataset. The model performs sequence labeling with BIO tags to extract meaningful phrases from scientific text. |
|
|
|
--- |
|
|
|
## π§ Features |
|
|
|
- β
Preprocessed dataset with BIO-tagged tokens |
|
- β
Fine-tuning BERT (`bert-base-cased`) using Hugging Face Transformers |
|
- β
Token-label alignment |
|
- β
Evaluation using `seqeval` metrics (Precision, Recall, F1) |
|
- β
Inference pipeline to extract keyphrases |
|
- β
CUDA-enabled for GPU acceleration |
|
|
|
--- |
|
|
|
## π Dataset |
|
|
|
**Source:** [`midas/inspec`](https://huggingface.co/datasets/midas/inspec) |
|
|
|
- Fields: |
|
- `document`: List of tokenized words (already split) |
|
- `doc_bio_tags`: BIO-format labels for keyphrases |
|
- Splits: |
|
- `train`: 1000 samples |
|
- `validation`: 500 samples |
|
- `test`: 500 samples |
|
|
|
--- |
|
|
|
## π Setup & Installation |
|
|
|
```bash |
|
git clone https://github.com/your-username/keyphrase-bert-inspec |
|
cd keyphrase-bert-inspec |
|
|
|
pip install -r requirements.txt |
|
``` |
|
|
|
### `requirements.txt` |
|
|
|
```text |
|
datasets |
|
transformers |
|
evaluate |
|
seqeval |
|
``` |
|
|
|
--- |
|
|
|
## π§ͺ Training |
|
|
|
```python |
|
from datasets import load_dataset |
|
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer |
|
``` |
|
|
|
1. Load and preprocess data with aligned BIO labels |
|
2. Fine-tune `bert-base-cased` on the dataset |
|
3. Evaluate and save model artifacts |
|
|
|
### Training Script Overview: |
|
|
|
```python |
|
trainer = Trainer( |
|
model=model, |
|
args=training_args, |
|
train_dataset=tokenized_datasets["train"], |
|
eval_dataset=tokenized_datasets["validation"], |
|
tokenizer=tokenizer, |
|
data_collator=data_collator, |
|
compute_metrics=compute_metrics, |
|
) |
|
|
|
trainer.train() |
|
trainer.save_model("keyphrase-bert-inspec") |
|
``` |
|
|
|
--- |
|
|
|
## π Evaluation Metrics |
|
|
|
```python |
|
{ |
|
"precision": 0.84, |
|
"recall": 0.81, |
|
"f1": 0.825, |
|
"accuracy": 0.88 |
|
} |
|
``` |
|
|
|
--- |
|
|
|
## π Inference Example |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
ner_pipeline = pipeline( |
|
"ner", |
|
model="keyphrase-bert-inspec", |
|
tokenizer="keyphrase-bert-inspec", |
|
aggregation_strategy="simple" |
|
) |
|
|
|
text = "Information-based semantics is a theory in the philosophy of mind." |
|
results = ner_pipeline(text) |
|
|
|
for r in results: |
|
print(f"{r['word']} ({r['entity_group']}) - {r['score']:.2f}") |
|
``` |
|
|
|
### Sample Output |
|
|
|
``` |
|
π’ Extracted Keyphrases: |
|
- Information-based semantics (score: 0.94) |
|
- philosophy of mind (score: 0.91) |
|
``` |
|
|
|
--- |
|
|
|
## πΎ Model Artifacts |
|
|
|
After training, the model and tokenizer are saved as: |
|
|
|
``` |
|
keyphrase-bert-inspec/ |
|
βββ config.json |
|
βββ pytorch_model.bin |
|
βββ tokenizer_config.json |
|
βββ vocab.txt |
|
``` |
|
|
|
--- |
|
|
|
## π Future Improvements |
|
|
|
- Add postprocessing to group fragmented tokens |
|
- Use a larger dataset (like `scientific_keyphrases`) |
|
- Convert to a web app using Gradio or Streamlit |
|
|
|
--- |
|
|
|
## π¨βπ¬ Author |
|
|
|
**Your Name** |
|
GitHub: [@your-username](https://github.com/your-username) |
|
Contact: [email protected] |
|
|
|
--- |
|
|
|
## π License |
|
|
|
MIT License. See `LICENSE` file. |
|
|