File size: 3,923 Bytes
8d16b9a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 |
# Paraphrase Detection with Roberta-base
## π Overview
This repository hosts the quantized version of the Roberta-base model for Paraphrase Detection. The model is designed to determine whether two sentences convey the same meaning. If they are similar, the model outputs "duplicate" with a confidence score; otherwise, it outputs "not duplicate" with a confidence score. The model has been optimized for efficient deployment while maintaining reasonable accuracy, making it suitable for real-time applications.
## π Model Details
- **Model Architecture:** Roberta-base
- **Task:** Paraphrase Detection
- **Dataset:** Hugging Face's `quora-question-pairs`
- **Quantization:** Float16 (FP16) for optimized inference
- **Fine-tuning Framework:** Hugging Face Transformers
## π Usage
### Installation
```bash
pip install transformers torch
```
### Loading the Model
```python
from transformers import RobertaTokenizer, RobertaForSequenceClassification
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "AventIQ-AI/roberta-paraphrase-detection"
tokenizer = RobertaTokenizer.from_pretrained(model_name)
model = RobertaForSequenceClassification.from_pretrained(model_name).to(device)
```
### Paraphrase Detection Inference
```python
def predict_paraphrase(sentence1, sentence2, threshold=0.96):
inputs = tokenizer(sentence1, sentence2, return_tensors="pt", padding=True, truncation=True).to(device)
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
probabilities = torch.nn.functional.softmax(logits, dim=-1)
predicted_class = torch.argmax(probabilities, dim=-1).item()
confidence = probabilities[0][predicted_class].item()
label_map = {0: "Not Duplicate", 1: "Duplicate"}
# Apply a slightly less strict threshold
if predicted_class == 1 and confidence < threshold:
return {"sentence1": sentence1, "sentence2": sentence2, "predicted_label": "Not Duplicate", "confidence": confidence}
else:
return {"sentence1": sentence1, "sentence2": sentence2, "predicted_label": label_map[predicted_class], "confidence": confidence}
# π Test Example
test_cases = [
("The sun rises in the east.", "The east is where the sun rises."), # Duplicate
("She enjoys playing the piano.", "She loves playing musical instruments."), # Duplicate
("I had a great time at the party.", "The event was really fun."), # Duplicate
("The sky is blue.", "Bananas are yellow."), # Not Duplicate
("The capital of France is Paris.", "Berlin is the capital of Germany."), # Not Duplicate
("I like reading books.", "She is going for a run."), # Not Duplicate
]
for sent1, sent2 in test_cases:
result = predict_paraphrase(sent1, sent2)
print(result)
```
## π Quantized Model Evaluation Results
### π₯ Evaluation Metrics π₯
- β
**Accuracy:** 0.7515
- β
**Precision:** 0.6697
- β
**Recall:** 0.5840
- β
**F1-score:** 0.6022
## β‘ Quantization Details
Post-training quantization was applied using PyTorch's built-in quantization framework. The model was quantized to Float16 (FP16) to reduce model size and improve inference efficiency while balancing accuracy.
## π Repository Structure
```
.
βββ model/ # Contains the quantized model files
βββ tokenizer_config/ # Tokenizer configuration and vocabulary files
βββ model.safetensors/ # Quantized Model
βββ README.md # Model documentation
```
## β οΈ Limitations
- The model may struggle with highly nuanced paraphrases.
- Quantization may lead to slight degradation in accuracy compared to full-precision models.
- Performance may vary across different domains and sentence structures.
## π€ Contributing
Contributions are welcome! Feel free to open an issue or submit a pull request if you have suggestions or improvements. |