File size: 3,558 Bytes
068dbc2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 |
# Roberta-Base Quantized Model for Toxic-Comment-Classification
This repository hosts a quantized version of the Roberta model, fine-tuned for Toxic-comment classification . The model has been optimized using FP16 quantization for efficient deployment without significant accuracy loss.
## Model Details
- **Model Architecture:** Roberta Base Uncased
- **Task:** Binary Sentiment Classification (Positive/Negative)
- **Dataset:** Classified_comments
- **Quantization:** Float16
- **Fine-tuning Framework:** Hugging Face Transformers
---
## Installation
```bash
pip install transformers datasets scikit-learn
```
---
## Loading the Model
```python
from transformers import RobertaTokenizer, RobertaForSequenceClassification
import torch
# Load tokenizer and model
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
model = RobertaForSequenceClassification.from_pretrained("roberta-base", num_labels=2).to(device)
# Define test sentences
new_comments = [
"I hate you so much, you are disgusting.",
"What a terrible idea. Just awful.",
"You are looking beautiful today"
]
# Tokenize and predict
def predict_comments(texts, model, tokenizer):
# If a single string is passed, convert to list
if isinstance(texts, str):
texts = [texts]
# Preprocess (same as training)
def preprocess(text):
text = text.lower()
text = re.sub(r"http\S+|www\S+|https\S+", '', text)
text = re.sub(r'\@\w+|\#','', text)
text = re.sub(r"[^a-zA-Z0-9\s.,!?']", '', text)
text = re.sub(r'\s+', ' ', text).strip()
return text
cleaned_texts = [preprocess(text) for text in texts]
# Tokenize
inputs = tokenizer(cleaned_texts, padding=True, truncation=True, return_tensors="pt").to(device)
# Move to model's device (CPU/GPU)
model.eval()
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=1).tolist()
# Map predictions
label_map = {0: "Non-Toxic", 1: "Toxic"}
return [label_map[pred] for pred in predictions]
```
---
## Performance Metrics
- **Accuracy:** 0.979737
- **Precision:** 0.976084
- **Recall:** 0.984133
- **F1 Score:** 0.980092
---
## Fine-Tuning Details
### Dataset
The dataset is sourced from Kaggle Classified_comment.csv . It contains 140000 labeled comments (Toxic or Non toxic).
The original training and testing sets were merged, shuffled, and re-split using an 80/20 ratio.
### Training
- **Epochs:** 3
- **Batch size:** 8
- **Learning rate:** 2e-5
- **Evaluation strategy:** `epoch`
---
## Quantization
Post-training quantization was applied using PyTorchβs `half()` precision (FP16) to reduce model size and inference time.
---
## Repository Structure
```python
.
βββ quantized-model/ # Contains the quantized model files
β βββ config.json
β βββ model.safetensors
β βββ tokenizer_config.json
β βββ vocab.txt
β βββ special_tokens_map.json
βββ README.md # Model documentation
```
---
## Limitations
- The model is trained specifically for binary sentiment classification on Toxic comments.
- FP16 quantization may result in slight numerical instability in edge cases.
---
## Contributing
Feel free to open issues or submit pull requests to improve the model or documentation.
|