File size: 2,160 Bytes
29f6b22
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
---
library_name: transformers
tags:
- Code
- Vulnerability
- Detection
datasets:
- DetectVul/bigvul
language:
- en
base_model:
- microsoft/codebert-base
license: mit
metrics:
- accuracy
- precision
- f1
- recall
---

## CodeBERT for Code Vulnerability Detection

## Model Summary
This model is a fine-tuned version of **microsoft/codebert-base**, optimized for detecting vulnerabilities in code. It is trained on the **bigvul** dataset. The model takes in a code snippet and classifies it as either **benign (0)** or **vulnerable (1)**.

## Model Details

- **Developed by:** Eun Jung
- **Finetuned from:** `microsoft/codebert-base`
- **Language(s):** English (for code comments & metadata), C/C++
- **License:** MIT
- **Task:** Code vulnerability detection
- **Dataset Used:** `bigvul`
- **Architecture:** Transformer-based sequence classification

## Uses

## How to Get Started with the Model
Use the code below to load the model and run inference on a sample code snippet:

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load the fine-tuned model
tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")
model = AutoModelForSequenceClassification.from_pretrained("eunJ/codebert_vulnerabilty_detector")

# Sample code snippet
code_snippet = '''
void process(char *input) {
    char buffer[50];
    strcpy(buffer, input); // Potential buffer overflow
}
'''

# Tokenize the input
inputs = tokenizer(code_snippet, return_tensors="pt", truncation=True, padding="max_length", max_length=512)

# Run inference
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_label = torch.argmax(predictions, dim=1).item()

# Output the result
print("Vulnerable Code" if predicted_label == 1 else "Benign Code")
```

## Training Details

### Training Data
- **Dataset:** `Bigvul`
- **Classes:** `0 (Benign)`, `1 (Vulnerable)`
- **Size:** `21800` Code Snippets

### Metrics
| Metric  | Score |
|------------|-------------|
| **Accuracy** | 99.11% |
| **F1 Score** | 91.88% |
| **Precision** | 89.57% |
| **Recall** | 94.31% |