Upload 7 files

Browse files

Files changed (7) hide show

README.md +157 -0
config (1).json +35 -0
model (2).safetensors +3 -0
special_tokens_map (1).json +7 -0
tokenizer (1).json +0 -0
tokenizer_config (1).json +56 -0
vocab (1).txt +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,157 @@

+# 🧠 Keyphrase Extraction with BERT (Fine-Tuned on `midas/inspec`)
+This repository contains a complete pipeline to **fine-tune BERT** for **Keyphrase Extraction** using the [`midas/inspec`](https://huggingface.co/datasets/midas/inspec) dataset. The model performs sequence labeling with BIO tags to extract meaningful phrases from scientific text.
+---
+## 🔧 Features
+- ✅ Preprocessed dataset with BIO-tagged tokens
+- ✅ Fine-tuning BERT (`bert-base-cased`) using Hugging Face Transformers
+- ✅ Token-label alignment
+- ✅ Evaluation using `seqeval` metrics (Precision, Recall, F1)
+- ✅ Inference pipeline to extract keyphrases
+- ✅ CUDA-enabled for GPU acceleration
+---
+## 📂 Dataset
+**Source:** [`midas/inspec`](https://huggingface.co/datasets/midas/inspec)
+- Fields:
+  - `document`: List of tokenized words (already split)
+  - `doc_bio_tags`: BIO-format labels for keyphrases
+- Splits:
+  - `train`: 1000 samples
+  - `validation`: 500 samples
+  - `test`: 500 samples
+---
+## 🚀 Setup & Installation
+```bash
+git clone https://github.com/your-username/keyphrase-bert-inspec
+cd keyphrase-bert-inspec
+pip install -r requirements.txt
+```
+### `requirements.txt`
+```text
+datasets
+transformers
+evaluate
+seqeval
+```
+---
+## 🧪 Training
+```python
+from datasets import load_dataset
+from transformers import AutoTokenizer, AutoModelForTokenClassification, TrainingArguments, Trainer
+```
+1. Load and preprocess data with aligned BIO labels
+2. Fine-tune `bert-base-cased` on the dataset
+3. Evaluate and save model artifacts
+### Training Script Overview:
+```python
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=tokenized_datasets["train"],
+    eval_dataset=tokenized_datasets["validation"],
+    tokenizer=tokenizer,
+    data_collator=data_collator,
+    compute_metrics=compute_metrics,
+)
+trainer.train()
+trainer.save_model("keyphrase-bert-inspec")
+```
+---
+## 📊 Evaluation Metrics
+```python
+{
+  "precision": 0.84,
+  "recall": 0.81,
+  "f1": 0.825,
+  "accuracy": 0.88
+}
+```
+---
+## 🔍 Inference Example
+```python
+from transformers import pipeline
+ner_pipeline = pipeline(
+    "ner",
+    model="keyphrase-bert-inspec",
+    tokenizer="keyphrase-bert-inspec",
+    aggregation_strategy="simple"
+)
+text = "Information-based semantics is a theory in the philosophy of mind."
+results = ner_pipeline(text)
+for r in results:
+    print(f"{r['word']} ({r['entity_group']}) - {r['score']:.2f}")
+```
+### Sample Output
+```
+🟢 Extracted Keyphrases:
+ - Information-based semantics (score: 0.94)
+ - philosophy of mind (score: 0.91)
+```
+---
+## 💾 Model Artifacts
+After training, the model and tokenizer are saved as:
+```
+keyphrase-bert-inspec/
+├── config.json
+├── pytorch_model.bin
+├── tokenizer_config.json
+├── vocab.txt
+```
+---
+## 📌 Future Improvements
+- Add postprocessing to group fragmented tokens
+- Use a larger dataset (like `scientific_keyphrases`)
+- Convert to a web app using Gradio or Streamlit
+---
+## 👨‍🔬 Author
+**Your Name**
+GitHub: [@your-username](https://github.com/your-username)
+Contact: [email protected]
+---
+## 📄 License
+MIT License. See `LICENSE` file.

config (1).json ADDED Viewed

	@@ -0,0 +1,35 @@

+{
+  "architectures": [
+    "BertForTokenClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "O",
+    "1": "B",
+    "2": "I"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "B": 1,
+    "I": 2,
+    "O": 0
+  },
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float16",
+  "transformers_version": "4.51.3",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 28996
+}

model (2).safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bb746d06f9e1af11b95f06a2ba6440f24907a80bff33075dabea15b969fb880e
+size 215467198

special_tokens_map (1).json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer (1).json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config (1).json ADDED Viewed

	@@ -0,0 +1,56 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "do_lower_case": false,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

vocab (1).txt ADDED Viewed

The diff for this file is too large to render. See raw diff