Paraphrase Detection Pipeline using Transformers
This repository provides a complete pipeline to fine-tune a transformer model for Paraphrase Detection using the PAWS dataset.
Steps
1. Load Dataset
Load the PAWS dataset which contains pairs of sentences with labels indicating if they are paraphrases or not.
from datasets import load_dataset
dataset = load_dataset("paws", "labeled_final")
2. Preprocess and Tokenize
Tokenize sentence pairs with padding and truncation.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/paraphrase-MiniLM-L6-v2")
def preprocess_function(examples):
return tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, padding="max_length", max_length=128)
tokenized_datasets = dataset.map(preprocess_function, batched=True)
3. Load Model
Load a pre-trained sequence classification model suitable for paraphrase detection.
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("sentence-transformers/paraphrase-MiniLM-L6-v2", num_labels=2)
4. Fine-tune the Model
Setup training arguments and fine-tune the model using the Trainer API.
from transformers import TrainingArguments, Trainer
import evaluate
training_args = TrainingArguments(
output_dir="./paraphrase-detector",
evaluation_strategy="epoch",
save_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
num_train_epochs=3,
weight_decay=0.01,
load_best_model_at_end=True,
metric_for_best_model="accuracy"
)
accuracy = evaluate.load("accuracy")
def compute_metrics(eval_preds):
logits, labels = eval_preds
predictions = logits.argmax(axis=-1)
return accuracy.compute(predictions=predictions, references=labels)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)
trainer.train()
trainer.save_model("paraphrase-detector")
5. Evaluate
Evaluate the fine-tuned model.
eval_results = trainer.evaluate()
print(eval_results)
6. Inference
Use the fine-tuned model for paraphrase detection inference.
from transformers import pipeline
paraphrase_pipeline = pipeline("text-classification", model="paraphrase-detector", tokenizer=tokenizer)
example = paraphrase_pipeline({
"text": "How old are you?",
"text_pair": "What is your age?"
})
print(example)
Requirements
datasets
transformers
evaluate
Install dependencies with:
pip install datasets transformers evaluate
Author
Your Name - [email protected]
License
MIT License