README.md · eunJ/codebert_vulnerabilty

metadata

library_name: transformers
tags:
  - Code
  - Vulnerability
  - Detection
datasets:
  - DetectVul/bigvul
language:
  - en
base_model:
  - microsoft/codebert-base
license: mit
metrics:
  - accuracy
  - precision
  - f1
  - recall

CodeBERT for Code Vulnerability Detection

Model Summary

This model is a fine-tuned version of microsoft/codebert-base, optimized for detecting vulnerabilities in code. It is trained on the bigvul dataset. The model takes in a code snippet and classifies it as either benign (0) or vulnerable (1).

Model Details

Developed by: Eun Jung
Finetuned from: microsoft/codebert-base
Language(s): English (for code comments & metadata), C/C++
License: MIT
Task: Code vulnerability detection
Dataset Used: bigvul
Architecture: Transformer-based sequence classification

Uses

How to Get Started with the Model

Use the code below to load the model and run inference on a sample code snippet:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load the fine-tuned model
tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")
model = AutoModelForSequenceClassification.from_pretrained("eunJ/codebert_vulnerabilty_detector")

# Sample code snippet
code_snippet = '''
void process(char *input) {
    char buffer[50];
    strcpy(buffer, input); // Potential buffer overflow
}
'''

# Tokenize the input
inputs = tokenizer(code_snippet, return_tensors="pt", truncation=True, padding="max_length", max_length=512)

# Run inference
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_label = torch.argmax(predictions, dim=1).item()

# Output the result
print("Vulnerable Code" if predicted_label == 1 else "Benign Code")

Training Details

Training Data

Dataset: Bigvul
Classes: 0 (Benign), 1 (Vulnerable)
Size: 21800 Code Snippets

Metrics

Metric	Score
Accuracy	99.11%
F1 Score	91.88%
Precision	89.57%
Recall	94.31%