Upload README.md
Browse files
    	
        README.md
    CHANGED
    
    | 
         @@ -1,3 +1,118 @@ 
     | 
|
| 1 | 
         
            -
            ---
         
     | 
| 2 | 
         
            -
             
     | 
| 3 | 
         
            -
             
     | 
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
|
| 
         | 
| 
         | 
|
| 1 | 
         
            +
            ---
         
     | 
| 2 | 
         
            +
            language:
         
     | 
| 3 | 
         
            +
            - en
         
     | 
| 4 | 
         
            +
            base_model:
         
     | 
| 5 | 
         
            +
            - CrabInHoney/urlbert-tiny-base-v4
         
     | 
| 6 | 
         
            +
            pipeline_tag: text-classification
         
     | 
| 7 | 
         
            +
            tags:
         
     | 
| 8 | 
         
            +
            - url
         
     | 
| 9 | 
         
            +
            - cybersecurity
         
     | 
| 10 | 
         
            +
            - urls
         
     | 
| 11 | 
         
            +
            - links
         
     | 
| 12 | 
         
            +
            - classification
         
     | 
| 13 | 
         
            +
            - phishing-detection
         
     | 
| 14 | 
         
            +
            - tiny
         
     | 
| 15 | 
         
            +
            - phishing
         
     | 
| 16 | 
         
            +
            - malware
         
     | 
| 17 | 
         
            +
            - defacement
         
     | 
| 18 | 
         
            +
            - transformers
         
     | 
| 19 | 
         
            +
            - urlbert
         
     | 
| 20 | 
         
            +
            - bert
         
     | 
| 21 | 
         
            +
            - malicious
         
     | 
| 22 | 
         
            +
            license: apache-2.0
         
     | 
| 23 | 
         
            +
            ---
         
     | 
| 24 | 
         
            +
             
     | 
| 25 | 
         
            +
            # URLBERT-Tiny-v4 Malicious URL Classifier
         
     | 
| 26 | 
         
            +
             
     | 
| 27 | 
         
            +
            This is a lightweight version of BERT, specifically fine-tuned for classifying URLs into four categories: benign, phishing, malware, and defacement.
         
     | 
| 28 | 
         
            +
             
     | 
| 29 | 
         
            +
            ## Model Details
         
     | 
| 30 | 
         
            +
             
     | 
| 31 | 
         
            +
            - **Model size**: 3.69M parameters  
         
     | 
| 32 | 
         
            +
            - **Tensor type**: F32  
         
     | 
| 33 | 
         
            +
            - **Model weight size**: 14.8 MB  
         
     | 
| 34 | 
         
            +
            - **Base model**: [CrabInHoney/urlbert-tiny-base-v4](https://huggingface.co/CrabInHoney/urlbert-tiny-base-v4)  
         
     | 
| 35 | 
         
            +
            - **Dataset**: [Malicious URLs Dataset](https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset)  
         
     | 
| 36 | 
         
            +
             
     | 
| 37 | 
         
            +
            ## Model Evaluation Results
         
     | 
| 38 | 
         
            +
             
     | 
| 39 | 
         
            +
            The model was evaluated on a test set with the following classification metrics:
         
     | 
| 40 | 
         
            +
             
     | 
| 41 | 
         
            +
             
     | 
| 42 | 
         
            +
            | Metric | Model V3 | Model V4 (this model) |
         
     | 
| 43 | 
         
            +
            |--------|----------|----------|
         
     | 
| 44 | 
         
            +
            | **Overall Accuracy** | 0.9837 | **0.9922** |
         
     | 
| 45 | 
         
            +
            | **F1-score (Benign)** | 0.9907 | **0.9955** |
         
     | 
| 46 | 
         
            +
            | **F1-score (Defacement)** | 0.9937 | **0.9984** |
         
     | 
| 47 | 
         
            +
            | **F1-score (Malware)** | 0.9741 | **0.9845** |
         
     | 
| 48 | 
         
            +
            | **F1-score (Phishing)** | 0.9444 | **0.9734** |
         
     | 
| 49 | 
         
            +
            | **Weighted Average F1-score** | 0.9836 | **0.9922** |
         
     | 
| 50 | 
         
            +
             
     | 
| 51 | 
         
            +
             
     | 
| 52 | 
         
            +
             
     | 
| 53 | 
         
            +
            ## Usage Example
         
     | 
| 54 | 
         
            +
             
     | 
| 55 | 
         
            +
            Below is an example of how to use the model for URL classification using the Hugging Face `transformers` library:
         
     | 
| 56 | 
         
            +
             
     | 
| 57 | 
         
            +
            ```python
         
     | 
| 58 | 
         
            +
            from transformers import BertTokenizerFast, BertForSequenceClassification, pipeline
         
     | 
| 59 | 
         
            +
            import torch
         
     | 
| 60 | 
         
            +
             
     | 
| 61 | 
         
            +
            # Определение устройства (GPU или CPU)
         
     | 
| 62 | 
         
            +
            device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
         
     | 
| 63 | 
         
            +
            print(f"Используемое устройство: {device}")
         
     | 
| 64 | 
         
            +
             
     | 
| 65 | 
         
            +
            # Загрузка модели и токенизатора
         
     | 
| 66 | 
         
            +
            model_name = "CrabInHoney/urlbert-tiny-v4-malicious-url-classifier"
         
     | 
| 67 | 
         
            +
            tokenizer = BertTokenizerFast.from_pretrained(model_name)
         
     | 
| 68 | 
         
            +
            model = BertForSequenceClassification.from_pretrained(model_name)
         
     | 
| 69 | 
         
            +
            model.to(device)
         
     | 
| 70 | 
         
            +
             
     | 
| 71 | 
         
            +
            # Создание pipeline для классификации
         
     | 
| 72 | 
         
            +
            classifier = pipeline(
         
     | 
| 73 | 
         
            +
                "text-classification",
         
     | 
| 74 | 
         
            +
                model=model,
         
     | 
| 75 | 
         
            +
                tokenizer=tokenizer,
         
     | 
| 76 | 
         
            +
                device=0 if torch.cuda.is_available() else -1,
         
     | 
| 77 | 
         
            +
                return_all_scores=True
         
     | 
| 78 | 
         
            +
            )
         
     | 
| 79 | 
         
            +
             
     | 
| 80 | 
         
            +
            # Примеры URL для тестирования
         
     | 
| 81 | 
         
            +
            test_urls = [
         
     | 
| 82 | 
         
            +
                "wikiobits.com/Obits/TonyProudfoot",
         
     | 
| 83 | 
         
            +
                "http://www.824555.com/app/member/SportOption.php?uid=guest&langx=gb",
         
     | 
| 84 | 
         
            +
            ]
         
     | 
| 85 | 
         
            +
             
     | 
| 86 | 
         
            +
            # Маппинг меток на понятные названия классов
         
     | 
| 87 | 
         
            +
            label_mapping = {
         
     | 
| 88 | 
         
            +
                "LABEL_0": "benign",
         
     | 
| 89 | 
         
            +
                "LABEL_1": "defacement",
         
     | 
| 90 | 
         
            +
                "LABEL_2": "malware",
         
     | 
| 91 | 
         
            +
                "LABEL_3": "phishing"
         
     | 
| 92 | 
         
            +
            }
         
     | 
| 93 | 
         
            +
             
     | 
| 94 | 
         
            +
            # Классификация URL
         
     | 
| 95 | 
         
            +
            for url in test_urls:
         
     | 
| 96 | 
         
            +
                results = classifier(url)
         
     | 
| 97 | 
         
            +
                print(f"\nURL: {url}")
         
     | 
| 98 | 
         
            +
                for result in results[0]: 
         
     | 
| 99 | 
         
            +
                    label = result['label']
         
     | 
| 100 | 
         
            +
                    score = result['score']
         
     | 
| 101 | 
         
            +
                    friendly_label = label_mapping.get(label, label)
         
     | 
| 102 | 
         
            +
                    print(f"{friendly_label}, %: {score:.4f}")
         
     | 
| 103 | 
         
            +
            ```
         
     | 
| 104 | 
         
            +
             
     | 
| 105 | 
         
            +
            ### Example Output:
         
     | 
| 106 | 
         
            +
            ```
         
     | 
| 107 | 
         
            +
            URL: wikiobits.com/Obits/TonyProudfoot
         
     | 
| 108 | 
         
            +
            benign, %: 0.9996
         
     | 
| 109 | 
         
            +
            defacement, %: 0.0000
         
     | 
| 110 | 
         
            +
            malware, %: 0.0000
         
     | 
| 111 | 
         
            +
            phishing, %: 0.0003
         
     | 
| 112 | 
         
            +
             
     | 
| 113 | 
         
            +
            URL: http://www.824555.com/app/member/SportOption.php?uid=guest&langx=gb
         
     | 
| 114 | 
         
            +
            benign, %: 0.0000
         
     | 
| 115 | 
         
            +
            defacement, %: 0.0001
         
     | 
| 116 | 
         
            +
            malware, %: 0.9998
         
     | 
| 117 | 
         
            +
            phishing, %: 0.0001
         
     | 
| 118 | 
         
            +
            ```
         
     |