eunJ
/

codebert_vulnerabilty_detector

Text Classification

Model card Files Files and versions Community

eunJ commited on 9 days ago

Commit

29f6b22

·

verified ·

1 Parent(s): 0551875

Create README.md

Files changed (1) hide show

README.md +83 -0

README.md ADDED Viewed

	@@ -0,0 +1,83 @@

+---
+library_name: transformers
+tags:
+- Code
+- Vulnerability
+- Detection
+datasets:
+- DetectVul/bigvul
+language:
+- en
+base_model:
+- microsoft/codebert-base
+license: mit
+metrics:
+- accuracy
+- precision
+- f1
+- recall
+---
+## CodeBERT for Code Vulnerability Detection
+## Model Summary
+This model is a fine-tuned version of **microsoft/codebert-base**, optimized for detecting vulnerabilities in code. It is trained on the **bigvul** dataset. The model takes in a code snippet and classifies it as either **benign (0)** or **vulnerable (1)**.
+## Model Details
+- **Developed by:** Eun Jung
+- **Finetuned from:** `microsoft/codebert-base`
+- **Language(s):** English (for code comments & metadata), C/C++
+- **License:** MIT
+- **Task:** Code vulnerability detection
+- **Dataset Used:** `bigvul`
+- **Architecture:** Transformer-based sequence classification
+## Uses
+## How to Get Started with the Model
+Use the code below to load the model and run inference on a sample code snippet:
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+# Load the fine-tuned model
+tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")
+model = AutoModelForSequenceClassification.from_pretrained("eunJ/codebert_vulnerabilty_detector")
+# Sample code snippet
+code_snippet = '''
+void process(char *input) {
+    char buffer[50];
+    strcpy(buffer, input); // Potential buffer overflow
+}
+'''
+# Tokenize the input
+inputs = tokenizer(code_snippet, return_tensors="pt", truncation=True, padding="max_length", max_length=512)
+# Run inference
+with torch.no_grad():
+    outputs = model(**inputs)
+    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
+    predicted_label = torch.argmax(predictions, dim=1).item()
+# Output the result
+print("Vulnerable Code" if predicted_label == 1 else "Benign Code")
+```
+## Training Details
+### Training Data
+- **Dataset:** `Bigvul`
+- **Classes:** `0 (Benign)`, `1 (Vulnerable)`
+- **Size:** `21800` Code Snippets
+### Metrics
+| Metric  | Score |
+|------------|-------------|
+| **Accuracy** | 99.11% |
+| **F1 Score** | 91.88% |
+| **Precision** | 89.57% |
+| **Recall** | 94.31% |