File size: 3,558 Bytes

068dbc2


# Roberta-Base Quantized Model for Toxic-Comment-Classification

This repository hosts a quantized version of the Roberta model, fine-tuned for Toxic-comment classification . The model has been optimized using FP16 quantization for efficient deployment without significant accuracy loss.

## Model Details

- **Model Architecture:** Roberta Base Uncased  
- **Task:** Binary Sentiment Classification (Positive/Negative)  
- **Dataset:** Classified_comments 

- **Quantization:** Float16  

- **Fine-tuning Framework:** Hugging Face Transformers  



---



## Installation



```bash

pip install transformers datasets scikit-learn

```



---



## Loading the Model



```python

from transformers import RobertaTokenizer, RobertaForSequenceClassification

import torch



# Load tokenizer and model

tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
model = RobertaForSequenceClassification.from_pretrained("roberta-base", num_labels=2).to(device)

# Define test sentences
new_comments = [

    "I hate you so much, you are disgusting.",

    

    "What a terrible idea. Just awful.",

    "You are looking beautiful today"

]





# Tokenize and predict

def predict_comments(texts, model, tokenizer):
    # If a single string is passed, convert to list

    if isinstance(texts, str):

        texts = [texts]

    

    # Preprocess (same as training)

    def preprocess(text):

        text = text.lower()

        text = re.sub(r"http\S+|www\S+|https\S+", '', text)

        text = re.sub(r'\@\w+|\#','', text)

        text = re.sub(r"[^a-zA-Z0-9\s.,!?']", '', text)

        text = re.sub(r'\s+', ' ', text).strip()

        return text


    cleaned_texts = [preprocess(text) for text in texts]


    # Tokenize

    inputs = tokenizer(cleaned_texts, padding=True, truncation=True, return_tensors="pt").to(device)


    # Move to model's device (CPU/GPU)

    model.eval()

    with torch.no_grad():

        outputs = model(**inputs)

        predictions = torch.argmax(outputs.logits, dim=1).tolist()


    # Map predictions

    label_map = {0: "Non-Toxic", 1: "Toxic"}

    return [label_map[pred] for pred in predictions]


```



---



## Performance Metrics



- **Accuracy:** 0.979737  

- **Precision:** 0.976084 

- **Recall:** 0.984133  

- **F1 Score:** 0.980092  



---



## Fine-Tuning Details



### Dataset



The dataset is sourced from Kaggle Classified_comment.csv . It contains 140000 labeled comments (Toxic or Non toxic).  

The original training and testing sets were merged, shuffled, and re-split using an 80/20 ratio.



### Training



- **Epochs:** 3  

- **Batch size:** 8  

- **Learning rate:** 2e-5  

- **Evaluation strategy:** `epoch`  



---



## Quantization



Post-training quantization was applied using PyTorch’s `half()` precision (FP16) to reduce model size and inference time.



---



## Repository Structure



```python

.

├── quantized-model/               # Contains the quantized model files

│   ├── config.json

│   ├── model.safetensors

│   ├── tokenizer_config.json

│   ├── vocab.txt

│   └── special_tokens_map.json

├── README.md                      # Model documentation

```

---

## Limitations

- The model is trained specifically for binary sentiment classification on Toxic comments.
- FP16 quantization may result in slight numerical instability in edge cases.


---

## Contributing

Feel free to open issues or submit pull requests to improve the model or documentation.