YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
π§ Keyphrase Extraction with BERT (Fine-Tuned on midas/inspec
)
This repository contains a complete pipeline to fine-tune BERT for Keyphrase Extraction using the midas/inspec
dataset. The model performs sequence labeling with BIO tags to extract meaningful phrases from scientific text.
π§ Features
- β Preprocessed dataset with BIO-tagged tokens
- β
Fine-tuning BERT (
bert-base-cased
) using Hugging Face Transformers - β Token-label alignment
- β
Evaluation using
seqeval
metrics (Precision, Recall, F1) - β Inference pipeline to extract keyphrases
- β CUDA-enabled for GPU acceleration
π Dataset
Source: midas/inspec
- Fields:
document
: List of tokenized words (already split)doc_bio_tags
: BIO-format labels for keyphrases
- Splits:
train
: 1000 samplesvalidation
: 500 samplestest
: 500 samples
π Setup & Installation
git clone https://github.com/your-username/keyphrase-bert-inspec
cd keyphrase-bert-inspec
pip install -r requirements.txt
requirements.txt
datasets
transformers
evaluate
seqeval
π§ͺ Training
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer
- Load and preprocess data with aligned BIO labels
- Fine-tune
bert-base-cased
on the dataset - Evaluate and save model artifacts
Training Script Overview:
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)
trainer.train()
trainer.save_model("keyphrase-bert-inspec")
π Evaluation Metrics
{
"precision": 0.84,
"recall": 0.81,
"f1": 0.825,
"accuracy": 0.88
}
π Inference Example
from transformers import pipeline
ner_pipeline = pipeline(
"ner",
model="keyphrase-bert-inspec",
tokenizer="keyphrase-bert-inspec",
aggregation_strategy="simple"
)
text = "Information-based semantics is a theory in the philosophy of mind."
results = ner_pipeline(text)
for r in results:
print(f"{r['word']} ({r['entity_group']}) - {r['score']:.2f}")
Sample Output
π’ Extracted Keyphrases:
- Information-based semantics (score: 0.94)
- philosophy of mind (score: 0.91)
πΎ Model Artifacts
After training, the model and tokenizer are saved as:
keyphrase-bert-inspec/
βββ config.json
βββ pytorch_model.bin
βββ tokenizer_config.json
βββ vocab.txt
π Future Improvements
- Add postprocessing to group fragmented tokens
- Use a larger dataset (like
scientific_keyphrases
) - Convert to a web app using Gradio or Streamlit
π¨βπ¬ Author
Your Name
GitHub: @your-username
Contact: [email protected]
π License
MIT License. See LICENSE
file.
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support