AmanSengar commited on
Commit
4405582
Β·
verified Β·
1 Parent(s): 4163b86

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +116 -0
README.md ADDED
@@ -0,0 +1,116 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ **🧠 SMSDetection-DistilBERT-SMS**
3
+
4
+ A DistilBERT-based binary classifier fine-tuned on the SMS Spam Collection dataset. It classifies messages as either **spam** or **ham** (not spam). This model is suitable for real-world applications like mobile SMS spam filters, automated customer message triage, and telecom fraud detection.
5
+
6
+ ---
7
+
8
+ ✨ **Model Highlights**
9
+
10
+ - πŸ“Œ Based on `distilbert-base-uncased`
11
+ - πŸ” Fine-tuned on the SMS Spam Collection dataset
12
+ - ⚑ Supports binary classification: Spam vs Not Spam
13
+ - πŸ’Ύ Lightweight and optimized for both CPU and GPU environments
14
+
15
+ ---
16
+
17
+ 🧠 Intended Uses
18
+
19
+ - βœ… Mobile SMS spam filtering
20
+ - βœ… Telecom customer service automation
21
+ - βœ… Fraudulent message detection
22
+ - βœ… User inbox categorization
23
+ - βœ… Regulatory compliance monitoring
24
+
25
+ ---
26
+ - 🚫 Limitations
27
+
28
+ - ❌ Trained on English SMS messages only
29
+ - ❌ May underperform on emails, social media texts, or non-English content
30
+ - ❌ Not designed for multilingual datasets
31
+ - ❌ Slight performance dip expected for long messages (>128 tokens)
32
+
33
+ ---
34
+
35
+ πŸ‹οΈβ€β™‚οΈ Training Details
36
+
37
+ | Field | Value |
38
+ | -------------- | ------------------------------ |
39
+ | **Base Model** | `distilbert-base-uncased` |
40
+ | **Dataset** |SMS Spam Collection (UCI) |
41
+ | **Framework** | PyTorch with πŸ€— Transformers |
42
+ | **Epochs** | 3 |
43
+ | **Batch Size** | 16 |
44
+ | **Max Length** | 128 tokens |
45
+ | **Optimizer** | AdamW |
46
+ | **Loss** | CrossEntropyLoss (token-level) |
47
+ | **Device** | Trained on CUDA-enabled GPU |
48
+
49
+ ---
50
+
51
+ πŸ“Š Evaluation Metrics
52
+
53
+ | Metric | Score |
54
+ | ----------------------------------------------- | ----- |
55
+ | Accuracy | 0.99 |
56
+ | F1-Score | 0.96 |
57
+ | Precision | 0.98 |
58
+ | Recall | 0.93 |
59
+
60
+
61
+ ---
62
+
63
+
64
+ ---
65
+ πŸš€ Usage
66
+ ```python
67
+ from transformers import BertTokenizerFast, BertForTokenClassification
68
+ from transformers import pipeline
69
+ import torch
70
+
71
+ model_name = "AventIQ-AI/SMS-Spam-Detection-Model"
72
+ tokenizer = BertTokenizerFast.from_pretrained(model_name)
73
+ model = BertForTokenClassification.from_pretrained(model_name)
74
+ model.eval()
75
+
76
+
77
+ # Inference
78
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
79
+ model.to(device)
80
+
81
+ def predict_sms(text):
82
+ inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
83
+ inputs = {k: v.to(device) for k, v in inputs.items()}
84
+ with torch.no_grad():
85
+ outputs = model(**inputs)
86
+ logits = outputs.logits
87
+ predicted = torch.argmax(logits, dim=1).item()
88
+ return "spam" if predicted == 1 else "ham"
89
+
90
+ # Test example
91
+ print(predict_sms("You've won $1,000,000! Call now to claim your prize!"))
92
+
93
+ ```
94
+ ---
95
+
96
+ - 🧩 Quantization
97
+ - Post-training static quantization applied using PyTorch to reduce model size and accelerate inference on edge devices.
98
+
99
+ ----
100
+
101
+ πŸ—‚ Repository Structure
102
+ ```
103
+ .
104
+ β”œβ”€β”€ model/ # Quantized model files
105
+ β”œβ”€β”€ tokenizer_config/ # Tokenizer and vocab files
106
+ β”œβ”€β”€ model.safensors/ # Fine-tuned model in safetensors format
107
+ β”œβ”€β”€ README.md # Model card
108
+
109
+ ```
110
+ ---
111
+ 🀝 Contributing
112
+
113
+ Open to improvements and feedback! Feel free to submit a pull request or open an issue if you find any bugs or want to enhance the model.
114
+
115
+
116
+