YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

🧠 Keyphrase Extraction with BERT (Fine-Tuned on midas/inspec)

This repository contains a complete pipeline to fine-tune BERT for Keyphrase Extraction using the midas/inspec dataset. The model performs sequence labeling with BIO tags to extract meaningful phrases from scientific text.


πŸ”§ Features

  • βœ… Preprocessed dataset with BIO-tagged tokens
  • βœ… Fine-tuning BERT (bert-base-cased) using Hugging Face Transformers
  • βœ… Token-label alignment
  • βœ… Evaluation using seqeval metrics (Precision, Recall, F1)
  • βœ… Inference pipeline to extract keyphrases
  • βœ… CUDA-enabled for GPU acceleration

πŸ“‚ Dataset

Source: midas/inspec

  • Fields:
    • document: List of tokenized words (already split)
    • doc_bio_tags: BIO-format labels for keyphrases
  • Splits:
    • train: 1000 samples
    • validation: 500 samples
    • test: 500 samples

πŸš€ Setup & Installation

git clone https://github.com/your-username/keyphrase-bert-inspec
cd keyphrase-bert-inspec

pip install -r requirements.txt

requirements.txt

datasets
transformers
evaluate
seqeval

πŸ§ͺ Training

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer
  1. Load and preprocess data with aligned BIO labels
  2. Fine-tune bert-base-cased on the dataset
  3. Evaluate and save model artifacts

Training Script Overview:

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()
trainer.save_model("keyphrase-bert-inspec")

πŸ“Š Evaluation Metrics

{
  "precision": 0.84,
  "recall": 0.81,
  "f1": 0.825,
  "accuracy": 0.88
}

πŸ” Inference Example

from transformers import pipeline

ner_pipeline = pipeline(
    "ner",
    model="keyphrase-bert-inspec",
    tokenizer="keyphrase-bert-inspec",
    aggregation_strategy="simple"
)

text = "Information-based semantics is a theory in the philosophy of mind."
results = ner_pipeline(text)

for r in results:
    print(f"{r['word']} ({r['entity_group']}) - {r['score']:.2f}")

Sample Output

🟒 Extracted Keyphrases:
 - Information-based semantics (score: 0.94)
 - philosophy of mind (score: 0.91)

πŸ’Ύ Model Artifacts

After training, the model and tokenizer are saved as:

keyphrase-bert-inspec/
β”œβ”€β”€ config.json
β”œβ”€β”€ pytorch_model.bin
β”œβ”€β”€ tokenizer_config.json
β”œβ”€β”€ vocab.txt

πŸ“Œ Future Improvements

  • Add postprocessing to group fragmented tokens
  • Use a larger dataset (like scientific_keyphrases)
  • Convert to a web app using Gradio or Streamlit

πŸ‘¨β€πŸ”¬ Author

Your Name
GitHub: @your-username
Contact: [email protected]


πŸ“„ License

MIT License. See LICENSE file.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support