AventIQ-AI
/

distilbert-sms-messaging-spam-detection

Model card Files Files and versions

nimishgarg commited on May 1, 2025

Commit

929f617

·

verified ·

1 Parent(s): 7d1b62e

Create README.md

Files changed (1) hide show

README.md +112 -0

README.md ADDED Viewed

	@@ -0,0 +1,112 @@

+# DistilBERT-Base-Uncased Quantized Model for Spam Detection
+This repository hosts a quantized version of the DistilBERT model, fine-tuned for spam classification using a labeled SMS dataset. The model has been optimized using FP16 quantization for efficient deployment without significant accuracy loss.
+## Model Details
+- **Model Architecture:** DistilBERT Base Uncased
+- **Task:** Binary Spam Classification (Spam/Ham)
+- **Dataset:** SMS Spam Collection
+- **Quantization:** Float16
+- **Fine-tuning Framework:** Hugging Face Transformers
+---
+## Installation
+```bash
+pip install transformers datasets scikit-learn
+```
+---
+## Loading the Model
+```python
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+import torch
+# Load tokenizer and model
+model_path = "distilbert-base-uncased"
+tokenizer = AutoTokenizer.from_pretrained(model_path)
+model = AutoModelForSequenceClassification.from_pretrained(model_path)
+# Define test messages
+texts = [
+    "Congratulations! You have won a free iPhone. Click here to claim your prize.",
+    "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
+]
+# Tokenize and predict
+for text in texts:
+    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
+    inputs = {k: v.long() for k, v in inputs.items()}
+    with torch.no_grad():
+        outputs = model(**inputs)
+    predicted_class = torch.argmax(outputs.logits, dim=1).item()
+    label_map = {0: "Ham", 1: "Spam"}
+    print(f"Text: {text}")
+    print(f"Predicted Label: {label_map[predicted_class]}\n")
+```
+---
+## Performance Metrics
+- **Accuracy:** 0.9994
+- **Precision:** 1.0000
+- **Recall:** 0.9955
+- **F1 Score:** 0.9978
+---
+## Fine-Tuning Details
+### Dataset
+The dataset used is the SMS Spam Collection dataset containing labeled messages as either "spam" or "ham".
+The dataset was cleaned using custom preprocessing, then split into 80% training and 20% validation sets with stratification.
+### Training
+- **Epochs:** 5
+- **Batch size:** 12 (train) / 16 (eval)
+- **Learning rate:** 3e-5
+- **Evaluation strategy:** `epoch`
+- **FP16 Training:** Enabled
+- **Trainer:** Hugging Face `Trainer` API
+---
+## Quantization
+Post-training quantization was applied using `model.to(dtype=torch.float16)` to reduce model size and speed up inference.
+---
+## Repository Structure
+```bash
+.
+├── quantized-model/               # Contains the quantized model files
+│   ├── config.json
+│   ├── model.safetensors
+│   ├── tokenizer_config.json
+│   ├── vocab.txt
+│   └── special_tokens_map.json
+├── README.md                      # Project documentation
+```
+---
+## Limitations
+- The model is trained specifically for binary spam classification on SMS data.
+- Performance might degrade when applied to emails or social media without domain adaptation.
+- FP16 inference might show slight instability on edge cases.
+---
+## Contributing
+Feel free to open issues or submit pull requests to improve the model, training process, or documentation.