|
|
|
# Paraphrase Detection Pipeline using Transformers |
|
|
|
This repository provides a complete pipeline to fine-tune a transformer model for **Paraphrase Detection** using the PAWS dataset. |
|
|
|
--- |
|
|
|
## Steps |
|
|
|
### 1. Load Dataset |
|
Load the PAWS dataset which contains pairs of sentences with labels indicating if they are paraphrases or not. |
|
|
|
```python |
|
from datasets import load_dataset |
|
dataset = load_dataset("paws", "labeled_final") |
|
``` |
|
|
|
### 2. Preprocess and Tokenize |
|
Tokenize sentence pairs with padding and truncation. |
|
|
|
```python |
|
from transformers import AutoTokenizer |
|
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/paraphrase-MiniLM-L6-v2") |
|
|
|
def preprocess_function(examples): |
|
return tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, padding="max_length", max_length=128) |
|
|
|
tokenized_datasets = dataset.map(preprocess_function, batched=True) |
|
``` |
|
|
|
### 3. Load Model |
|
Load a pre-trained sequence classification model suitable for paraphrase detection. |
|
|
|
```python |
|
from transformers import AutoModelForSequenceClassification |
|
model = AutoModelForSequenceClassification.from_pretrained("sentence-transformers/paraphrase-MiniLM-L6-v2", num_labels=2) |
|
``` |
|
|
|
### 4. Fine-tune the Model |
|
Setup training arguments and fine-tune the model using the Trainer API. |
|
|
|
```python |
|
from transformers import TrainingArguments, Trainer |
|
import evaluate |
|
|
|
training_args = TrainingArguments( |
|
output_dir="./paraphrase-detector", |
|
evaluation_strategy="epoch", |
|
save_strategy="epoch", |
|
learning_rate=2e-5, |
|
per_device_train_batch_size=16, |
|
per_device_eval_batch_size=64, |
|
num_train_epochs=3, |
|
weight_decay=0.01, |
|
load_best_model_at_end=True, |
|
metric_for_best_model="accuracy" |
|
) |
|
|
|
accuracy = evaluate.load("accuracy") |
|
|
|
def compute_metrics(eval_preds): |
|
logits, labels = eval_preds |
|
predictions = logits.argmax(axis=-1) |
|
return accuracy.compute(predictions=predictions, references=labels) |
|
|
|
trainer = Trainer( |
|
model=model, |
|
args=training_args, |
|
train_dataset=tokenized_datasets["train"], |
|
eval_dataset=tokenized_datasets["validation"], |
|
tokenizer=tokenizer, |
|
compute_metrics=compute_metrics, |
|
) |
|
|
|
trainer.train() |
|
trainer.save_model("paraphrase-detector") |
|
``` |
|
|
|
### 5. Evaluate |
|
Evaluate the fine-tuned model. |
|
|
|
```python |
|
eval_results = trainer.evaluate() |
|
print(eval_results) |
|
``` |
|
|
|
### 6. Inference |
|
Use the fine-tuned model for paraphrase detection inference. |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
paraphrase_pipeline = pipeline("text-classification", model="paraphrase-detector", tokenizer=tokenizer) |
|
|
|
example = paraphrase_pipeline({ |
|
"text": "How old are you?", |
|
"text_pair": "What is your age?" |
|
}) |
|
|
|
print(example) |
|
``` |
|
|
|
--- |
|
|
|
## Requirements |
|
- `datasets` |
|
- `transformers` |
|
- `evaluate` |
|
|
|
Install dependencies with: |
|
|
|
```bash |
|
pip install datasets transformers evaluate |
|
``` |
|
|
|
--- |
|
|
|
## Author |
|
Your Name - [email protected] |
|
|
|
--- |
|
|
|
## License |
|
MIT License |
|
|