# ๐Ÿง  Keyphrase Extraction with BERT (Fine-Tuned on `midas/inspec`) This repository contains a complete pipeline to **fine-tune BERT** for **Keyphrase Extraction** using the [`midas/inspec`](https://huggingface.co/datasets/midas/inspec) dataset. The model performs sequence labeling with BIO tags to extract meaningful phrases from scientific text. --- ## ๐Ÿ”ง Features - โœ… Preprocessed dataset with BIO-tagged tokens - โœ… Fine-tuning BERT (`bert-base-cased`) using Hugging Face Transformers - โœ… Token-label alignment - โœ… Evaluation using `seqeval` metrics (Precision, Recall, F1) - โœ… Inference pipeline to extract keyphrases - โœ… CUDA-enabled for GPU acceleration --- ## ๐Ÿ“‚ Dataset **Source:** [`midas/inspec`](https://huggingface.co/datasets/midas/inspec) - Fields: - `document`: List of tokenized words (already split) - `doc_bio_tags`: BIO-format labels for keyphrases - Splits: - `train`: 1000 samples - `validation`: 500 samples - `test`: 500 samples --- ## ๐Ÿš€ Setup & Installation ```bash git clone https://github.com/your-username/keyphrase-bert-inspec cd keyphrase-bert-inspec pip install -r requirements.txt ``` ### `requirements.txt` ```text datasets transformers evaluate seqeval ``` --- ## ๐Ÿงช Training ```python from datasets import load_dataset from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer ``` 1. Load and preprocess data with aligned BIO labels 2. Fine-tune `bert-base-cased` on the dataset 3. Evaluate and save model artifacts ### Training Script Overview: ```python trainer = Trainer( model=model, args=training_args, train_dataset=tokenized_datasets["train"], eval_dataset=tokenized_datasets["validation"], tokenizer=tokenizer, data_collator=data_collator, compute_metrics=compute_metrics, ) trainer.train() trainer.save_model("keyphrase-bert-inspec") ``` --- ## ๐Ÿ“Š Evaluation Metrics ```python { "precision": 0.84, "recall": 0.81, "f1": 0.825, "accuracy": 0.88 } ``` --- ## ๐Ÿ” Inference Example ```python from transformers import pipeline ner_pipeline = pipeline( "ner", model="keyphrase-bert-inspec", tokenizer="keyphrase-bert-inspec", aggregation_strategy="simple" ) text = "Information-based semantics is a theory in the philosophy of mind." results = ner_pipeline(text) for r in results: print(f"{r['word']} ({r['entity_group']}) - {r['score']:.2f}") ``` ### Sample Output ``` ๐ŸŸข Extracted Keyphrases: - Information-based semantics (score: 0.94) - philosophy of mind (score: 0.91) ``` --- ## ๐Ÿ’พ Model Artifacts After training, the model and tokenizer are saved as: ``` keyphrase-bert-inspec/ โ”œโ”€โ”€ config.json โ”œโ”€โ”€ pytorch_model.bin โ”œโ”€โ”€ tokenizer_config.json โ”œโ”€โ”€ vocab.txt ``` --- ## ๐Ÿ“Œ Future Improvements - Add postprocessing to group fragmented tokens - Use a larger dataset (like `scientific_keyphrases`) - Convert to a web app using Gradio or Streamlit --- ## ๐Ÿ‘จโ€๐Ÿ”ฌ Author **Your Name** GitHub: [@your-username](https://github.com/your-username) Contact: your.email@example.com --- ## ๐Ÿ“„ License MIT License. See `LICENSE` file.