eunJ commited on
Commit
29f6b22
·
verified ·
1 Parent(s): 0551875

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +83 -0
README.md ADDED
@@ -0,0 +1,83 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ tags:
4
+ - Code
5
+ - Vulnerability
6
+ - Detection
7
+ datasets:
8
+ - DetectVul/bigvul
9
+ language:
10
+ - en
11
+ base_model:
12
+ - microsoft/codebert-base
13
+ license: mit
14
+ metrics:
15
+ - accuracy
16
+ - precision
17
+ - f1
18
+ - recall
19
+ ---
20
+
21
+ ## CodeBERT for Code Vulnerability Detection
22
+
23
+ ## Model Summary
24
+ This model is a fine-tuned version of **microsoft/codebert-base**, optimized for detecting vulnerabilities in code. It is trained on the **bigvul** dataset. The model takes in a code snippet and classifies it as either **benign (0)** or **vulnerable (1)**.
25
+
26
+ ## Model Details
27
+
28
+ - **Developed by:** Eun Jung
29
+ - **Finetuned from:** `microsoft/codebert-base`
30
+ - **Language(s):** English (for code comments & metadata), C/C++
31
+ - **License:** MIT
32
+ - **Task:** Code vulnerability detection
33
+ - **Dataset Used:** `bigvul`
34
+ - **Architecture:** Transformer-based sequence classification
35
+
36
+ ## Uses
37
+
38
+ ## How to Get Started with the Model
39
+ Use the code below to load the model and run inference on a sample code snippet:
40
+
41
+ ```python
42
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
43
+ import torch
44
+
45
+ # Load the fine-tuned model
46
+ tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")
47
+ model = AutoModelForSequenceClassification.from_pretrained("eunJ/codebert_vulnerabilty_detector")
48
+
49
+ # Sample code snippet
50
+ code_snippet = '''
51
+ void process(char *input) {
52
+ char buffer[50];
53
+ strcpy(buffer, input); // Potential buffer overflow
54
+ }
55
+ '''
56
+
57
+ # Tokenize the input
58
+ inputs = tokenizer(code_snippet, return_tensors="pt", truncation=True, padding="max_length", max_length=512)
59
+
60
+ # Run inference
61
+ with torch.no_grad():
62
+ outputs = model(**inputs)
63
+ predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
64
+ predicted_label = torch.argmax(predictions, dim=1).item()
65
+
66
+ # Output the result
67
+ print("Vulnerable Code" if predicted_label == 1 else "Benign Code")
68
+ ```
69
+
70
+ ## Training Details
71
+
72
+ ### Training Data
73
+ - **Dataset:** `Bigvul`
74
+ - **Classes:** `0 (Benign)`, `1 (Vulnerable)`
75
+ - **Size:** `21800` Code Snippets
76
+
77
+ ### Metrics
78
+ | Metric | Score |
79
+ |------------|-------------|
80
+ | **Accuracy** | 99.11% |
81
+ | **F1 Score** | 91.88% |
82
+ | **Precision** | 89.57% |
83
+ | **Recall** | 94.31% |