|
--- |
|
library_name: transformers |
|
tags: |
|
- Code |
|
- Vulnerability |
|
- Detection |
|
datasets: |
|
- DetectVul/bigvul |
|
language: |
|
- en |
|
base_model: |
|
- microsoft/codebert-base |
|
license: mit |
|
metrics: |
|
- accuracy |
|
- precision |
|
- f1 |
|
- recall |
|
--- |
|
|
|
## CodeBERT for Code Vulnerability Detection |
|
|
|
## Model Summary |
|
This model is a fine-tuned version of **microsoft/codebert-base**, optimized for detecting vulnerabilities in code. It is trained on the **bigvul** dataset. The model takes in a code snippet and classifies it as either **benign (0)** or **vulnerable (1)**. |
|
|
|
## Model Details |
|
|
|
- **Developed by:** Eun Jung |
|
- **Finetuned from:** `microsoft/codebert-base` |
|
- **Language(s):** English (for code comments & metadata), C/C++ |
|
- **License:** MIT |
|
- **Task:** Code vulnerability detection |
|
- **Dataset Used:** `bigvul` |
|
- **Architecture:** Transformer-based sequence classification |
|
|
|
## Uses |
|
|
|
## How to Get Started with the Model |
|
Use the code below to load the model and run inference on a sample code snippet: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
import torch |
|
|
|
# Load the fine-tuned model |
|
tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base") |
|
model = AutoModelForSequenceClassification.from_pretrained("eunJ/codebert_vulnerabilty_detector") |
|
|
|
# Sample code snippet |
|
code_snippet = ''' |
|
void process(char *input) { |
|
char buffer[50]; |
|
strcpy(buffer, input); // Potential buffer overflow |
|
} |
|
''' |
|
|
|
# Tokenize the input |
|
inputs = tokenizer(code_snippet, return_tensors="pt", truncation=True, padding="max_length", max_length=512) |
|
|
|
# Run inference |
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) |
|
predicted_label = torch.argmax(predictions, dim=1).item() |
|
|
|
# Output the result |
|
print("Vulnerable Code" if predicted_label == 1 else "Benign Code") |
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
- **Dataset:** `Bigvul` |
|
- **Classes:** `0 (Benign)`, `1 (Vulnerable)` |
|
- **Size:** `21800` Code Snippets |
|
|
|
### Metrics |
|
| Metric | Score | |
|
|------------|-------------| |
|
| **Accuracy** | 99.11% | |
|
| **F1 Score** | 91.88% | |
|
| **Precision** | 89.57% | |
|
| **Recall** | 94.31% | |
|
|