Keyword_Extraction / README.md
gautamnancy's picture
Upload 7 files
1484d1a verified
# 🧠 Keyphrase Extraction with BERT (Fine-Tuned on `midas/inspec`)
This repository contains a complete pipeline to **fine-tune BERT** for **Keyphrase Extraction** using the [`midas/inspec`](https://huggingface.co/datasets/midas/inspec) dataset. The model performs sequence labeling with BIO tags to extract meaningful phrases from scientific text.
---
## πŸ”§ Features
- βœ… Preprocessed dataset with BIO-tagged tokens
- βœ… Fine-tuning BERT (`bert-base-cased`) using Hugging Face Transformers
- βœ… Token-label alignment
- βœ… Evaluation using `seqeval` metrics (Precision, Recall, F1)
- βœ… Inference pipeline to extract keyphrases
- βœ… CUDA-enabled for GPU acceleration
---
## πŸ“‚ Dataset
**Source:** [`midas/inspec`](https://huggingface.co/datasets/midas/inspec)
- Fields:
- `document`: List of tokenized words (already split)
- `doc_bio_tags`: BIO-format labels for keyphrases
- Splits:
- `train`: 1000 samples
- `validation`: 500 samples
- `test`: 500 samples
---
## πŸš€ Setup & Installation
```bash
git clone https://github.com/your-username/keyphrase-bert-inspec
cd keyphrase-bert-inspec
pip install -r requirements.txt
```
### `requirements.txt`
```text
datasets
transformers
evaluate
seqeval
```
---
## πŸ§ͺ Training
```python
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer
```
1. Load and preprocess data with aligned BIO labels
2. Fine-tune `bert-base-cased` on the dataset
3. Evaluate and save model artifacts
### Training Script Overview:
```python
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)
trainer.train()
trainer.save_model("keyphrase-bert-inspec")
```
---
## πŸ“Š Evaluation Metrics
```python
{
"precision": 0.84,
"recall": 0.81,
"f1": 0.825,
"accuracy": 0.88
}
```
---
## πŸ” Inference Example
```python
from transformers import pipeline
ner_pipeline = pipeline(
"ner",
model="keyphrase-bert-inspec",
tokenizer="keyphrase-bert-inspec",
aggregation_strategy="simple"
)
text = "Information-based semantics is a theory in the philosophy of mind."
results = ner_pipeline(text)
for r in results:
print(f"{r['word']} ({r['entity_group']}) - {r['score']:.2f}")
```
### Sample Output
```
🟒 Extracted Keyphrases:
- Information-based semantics (score: 0.94)
- philosophy of mind (score: 0.91)
```
---
## πŸ’Ύ Model Artifacts
After training, the model and tokenizer are saved as:
```
keyphrase-bert-inspec/
β”œβ”€β”€ config.json
β”œβ”€β”€ pytorch_model.bin
β”œβ”€β”€ tokenizer_config.json
β”œβ”€β”€ vocab.txt
```
---
## πŸ“Œ Future Improvements
- Add postprocessing to group fragmented tokens
- Use a larger dataset (like `scientific_keyphrases`)
- Convert to a web app using Gradio or Streamlit
---
## πŸ‘¨β€πŸ”¬ Author
**Your Name**
GitHub: [@your-username](https://github.com/your-username)
Contact: [email protected]
---
## πŸ“„ License
MIT License. See `LICENSE` file.