r3ddkahili's picture
Upload README.md
1e47e82 verified
---
language: en
tags:
- cybersecurity
- malicious-url-detection
- bert
- transformers
- phishing-detection
license: apache-2.0
---
# Malicious URL Detection Model
> A fine-tuned **BERT-LoRA** model for detecting malicious URLs, including phishing, malware, and defacement threats.
## Model Description
This model is a **fine-tuned BERT-based classifier** designed to detect **malicious URLs** in real-time. It applies **Low-Rank Adaptation (LoRA)** for efficient fine-tuning, reducing computational costs while maintaining high accuracy.
The model classifies URLs into **four categories**:
- **Benign**
- **Defacement**
- **Phishing**
- **Malware**
It achieves **98% validation accuracy** and an **F1-score of 0.965**, ensuring robust detection capabilities.
---
## Intended Uses
### Use Cases
- Real-time URL classification for cybersecurity tools
- Phishing and malware detection for online safety
- Integration into browser extensions for instant threat alerts
- Security monitoring for SOC (Security Operations Centers)
---
## Model Details
- **Model Type:** BERT-based URL Classifier
- **Fine-Tuning Method:** LoRA (Low-Rank Adaptation)
- **Base Model:** `bert-base-uncased`
- **Number of Parameters:** 110M
- **Dataset:** Kaggle Malicious URLs Dataset (~651,191 samples)
- **Max Sequence Length:** `128`
- **Framework:** πŸ€— `transformers`, `torch`, `peft`
---
## How to Use
You can use this model directly with πŸ€— **Transformers**:
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load the model and tokenizer
model_name = "your-huggingface-model-name"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Example URL
url = "http://example.com/login"
# Tokenize and predict
inputs = tokenizer(url, return_tensors="pt", truncation=True, padding=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs)
prediction = torch.argmax(outputs.logits).item()
# Mapping prediction to labels
label_map = {0: "Benign", 1: "Defacement", 2: "Phishing", 3: "Malware"}
print(f"Prediction: {label_map[prediction]}")
```
---
## Training Details
- **Batch Size:** `16`
- **Epochs:** `5`
- **Learning Rate:** `2e-5`
- **Optimizer:** AdamW with weight decay
- **Loss Function:** Weighted Cross-Entropy
- **Evaluation Strategy:** Epoch-based
- **Fine-Tuning Strategy:** LoRA applied to BERT layers
---
## Evaluation Results
| Metric | Value |
| ------------ | --------- |
| Accuracy | **98%** |
| Precision | **0.96** |
| Recall | **0.97** |
| **F1 Score** | **0.965** |
### Category-wise Performance
| Category | Precision | Recall | F1-Score |
| -------------- | --------- | ------ | -------- |
| **Benign** | 0.98 | 0.99 | 0.985 |
| **Defacement** | 0.98 | 0.99 | 0.985 |
| **Phishing** | 0.93 | 0.94 | 0.935 |
| **Malware** | 0.95 | 0.96 | 0.955 |
---
## Deployment Options
### Streamlit Web App
- Deployed on **Streamlit Cloud, AWS, or Google Cloud**.
- Provides **real-time URL analysis** with a user-friendly interface.
### Browser Extension (Planned)
- **Real-time scanning** of visited web pages.
- **Dynamic threat alerts** with confidence scores.
### API Integration
- REST API for bulk URL analysis.
- Supports **Security Operations Centers (SOC)**.
---
## Limitations & Bias
- **May misclassify complex phishing URLs** that mimic legitimate sites.
- **Needs regular updates** to counter evolving threats.
- **Potential bias** if future threats are not represented in training data.
---
## Training Data & Citation
### Data Source
Dataset sourced from **Kaggle Malicious URLs Dataset**:
πŸ“Œ [Dataset Link](https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset)
### BibTeX Citation
```
@article{maliciousurl2025,
author = {Gleyzie Tongo, Dr. Farnaz Farid, Dr. Ala Al-Areqi, Dr. Farhad Ahamed},
title = {Fine-Tuned BERT for Malicious URL Detection},
year = {2025},
institution = {Western Sydney University}
}
```
---
## Contact
For inquiries, collaborations, or feedback, feel free to reach out via LinkedIn:
πŸ”— [Gleyzie Tongo](https://www.linkedin.com/in/gleyzie-tongo-83b454218/)