|
|
|
**π§ SMSDetection-DistilBERT-SMS** |
|
|
|
A DistilBERT-based binary classifier fine-tuned on the SMS Spam Collection dataset. It classifies messages as either **spam** or **ham** (not spam). This model is suitable for real-world applications like mobile SMS spam filters, automated customer message triage, and telecom fraud detection. |
|
|
|
--- |
|
|
|
β¨ **Model Highlights** |
|
|
|
- π Based on `distilbert-base-uncased` |
|
- π Fine-tuned on the SMS Spam Collection dataset |
|
- β‘ Supports binary classification: Spam vs Not Spam |
|
- πΎ Lightweight and optimized for both CPU and GPU environments |
|
|
|
--- |
|
|
|
π§ Intended Uses |
|
|
|
- β
Mobile SMS spam filtering |
|
- β
Telecom customer service automation |
|
- β
Fraudulent message detection |
|
- β
User inbox categorization |
|
- β
Regulatory compliance monitoring |
|
|
|
--- |
|
- π« Limitations |
|
|
|
- β Trained on English SMS messages only |
|
- β May underperform on emails, social media texts, or non-English content |
|
- β Not designed for multilingual datasets |
|
- β Slight performance dip expected for long messages (>128 tokens) |
|
|
|
--- |
|
|
|
ποΈββοΈ Training Details |
|
|
|
| Field | Value | |
|
| -------------- | ------------------------------ | |
|
| **Base Model** | `distilbert-base-uncased` | |
|
| **Dataset** |SMS Spam Collection (UCI) | |
|
| **Framework** | PyTorch with π€ Transformers | |
|
| **Epochs** | 3 | |
|
| **Batch Size** | 16 | |
|
| **Max Length** | 128 tokens | |
|
| **Optimizer** | AdamW | |
|
| **Loss** | CrossEntropyLoss (token-level) | |
|
| **Device** | Trained on CUDA-enabled GPU | |
|
|
|
--- |
|
|
|
π Evaluation Metrics |
|
|
|
| Metric | Score | |
|
| ----------------------------------------------- | ----- | |
|
| Accuracy | 0.99 | |
|
| F1-Score | 0.96 | |
|
| Precision | 0.98 | |
|
| Recall | 0.93 | |
|
|
|
|
|
--- |
|
|
|
|
|
--- |
|
π Usage |
|
```python |
|
from transformers import BertTokenizerFast, BertForTokenClassification |
|
from transformers import pipeline |
|
import torch |
|
|
|
model_name = "AventIQ-AI/SMS-Spam-Detection-Model" |
|
tokenizer = BertTokenizerFast.from_pretrained(model_name) |
|
model = BertForTokenClassification.from_pretrained(model_name) |
|
model.eval() |
|
|
|
|
|
# Inference |
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
model.to(device) |
|
|
|
def predict_sms(text): |
|
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128) |
|
inputs = {k: v.to(device) for k, v in inputs.items()} |
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
logits = outputs.logits |
|
predicted = torch.argmax(logits, dim=1).item() |
|
return "spam" if predicted == 1 else "ham" |
|
|
|
# Test example |
|
print(predict_sms("You've won $1,000,000! Call now to claim your prize!")) |
|
|
|
``` |
|
--- |
|
|
|
- π§© Quantization |
|
- Post-training static quantization applied using PyTorch to reduce model size and accelerate inference on edge devices. |
|
|
|
---- |
|
|
|
π Repository Structure |
|
``` |
|
. |
|
βββ model/ # Quantized model files |
|
βββ tokenizer_config/ # Tokenizer and vocab files |
|
βββ model.safensors/ # Fine-tuned model in safetensors format |
|
βββ README.md # Model card |
|
|
|
``` |
|
--- |
|
π€ Contributing |
|
|
|
Open to improvements and feedback! Feel free to submit a pull request or open an issue if you find any bugs or want to enhance the model. |
|
|
|
|
|
|
|
|