File size: 3,558 Bytes
068dbc2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134

# Roberta-Base Quantized Model for Toxic-Comment-Classification

This repository hosts a quantized version of the Roberta model, fine-tuned for Toxic-comment classification . The model has been optimized using FP16 quantization for efficient deployment without significant accuracy loss.

## Model Details

- **Model Architecture:** Roberta Base Uncased  
- **Task:** Binary Sentiment Classification (Positive/Negative)  
- **Dataset:** Classified_comments 

- **Quantization:** Float16  

- **Fine-tuning Framework:** Hugging Face Transformers  



---



## Installation



```bash

pip install transformers datasets scikit-learn

```



---



## Loading the Model



```python

from transformers import RobertaTokenizer, RobertaForSequenceClassification

import torch



# Load tokenizer and model

tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
model = RobertaForSequenceClassification.from_pretrained("roberta-base", num_labels=2).to(device)

# Define test sentences
new_comments = [

    "I hate you so much, you are disgusting.",

    

    "What a terrible idea. Just awful.",

    "You are looking beautiful today"

]





# Tokenize and predict

def predict_comments(texts, model, tokenizer):
    # If a single string is passed, convert to list

    if isinstance(texts, str):

        texts = [texts]

    

    # Preprocess (same as training)

    def preprocess(text):

        text = text.lower()

        text = re.sub(r"http\S+|www\S+|https\S+", '', text)

        text = re.sub(r'\@\w+|\#','', text)

        text = re.sub(r"[^a-zA-Z0-9\s.,!?']", '', text)

        text = re.sub(r'\s+', ' ', text).strip()

        return text


    cleaned_texts = [preprocess(text) for text in texts]


    # Tokenize

    inputs = tokenizer(cleaned_texts, padding=True, truncation=True, return_tensors="pt").to(device)


    # Move to model's device (CPU/GPU)

    model.eval()

    with torch.no_grad():

        outputs = model(**inputs)

        predictions = torch.argmax(outputs.logits, dim=1).tolist()


    # Map predictions

    label_map = {0: "Non-Toxic", 1: "Toxic"}

    return [label_map[pred] for pred in predictions]


```



---



## Performance Metrics



- **Accuracy:** 0.979737  

- **Precision:** 0.976084 

- **Recall:** 0.984133  

- **F1 Score:** 0.980092  



---



## Fine-Tuning Details



### Dataset



The dataset is sourced from Kaggle Classified_comment.csv . It contains 140000 labeled comments (Toxic or Non toxic).  

The original training and testing sets were merged, shuffled, and re-split using an 80/20 ratio.



### Training



- **Epochs:** 3  

- **Batch size:** 8  

- **Learning rate:** 2e-5  

- **Evaluation strategy:** `epoch`  



---



## Quantization



Post-training quantization was applied using PyTorch’s `half()` precision (FP16) to reduce model size and inference time.



---



## Repository Structure



```python

.

β”œβ”€β”€ quantized-model/               # Contains the quantized model files

β”‚   β”œβ”€β”€ config.json

β”‚   β”œβ”€β”€ model.safetensors

β”‚   β”œβ”€β”€ tokenizer_config.json

β”‚   β”œβ”€β”€ vocab.txt

β”‚   └── special_tokens_map.json

β”œβ”€β”€ README.md                      # Model documentation

```

---

## Limitations

- The model is trained specifically for binary sentiment classification on Toxic comments.
- FP16 quantization may result in slight numerical instability in edge cases.


---

## Contributing

Feel free to open issues or submit pull requests to improve the model or documentation.