eunJ
/

codebert_vulnerabilty_detector

Text Classification

Model card Files Files and versions Community

codebert_vulnerabilty_detector / README.md

eunJ's picture

Create README.md

29f6b22 verified 11 days ago

|

history blame contribute delete

2.16 kB

	---
	library_name: transformers
	tags:
	- Code
	- Vulnerability
	- Detection
	datasets:
	- DetectVul/bigvul
	language:
	- en
	base_model:
	- microsoft/codebert-base
	license: mit
	metrics:
	- accuracy
	- precision
	- f1
	- recall
	---

	## CodeBERT for Code Vulnerability Detection

	## Model Summary
	This model is a fine-tuned version of microsoft/codebert-base, optimized for detecting vulnerabilities in code. It is trained on the bigvul dataset. The model takes in a code snippet and classifies it as either benign (0) or vulnerable (1).

	## Model Details

	- Developed by: Eun Jung
	- Finetuned from: `microsoft/codebert-base`
	- Language(s): English (for code comments & metadata), C/C++
	- License: MIT
	- Task: Code vulnerability detection
	- Dataset Used: `bigvul`
	- Architecture: Transformer-based sequence classification

	## Uses

	## How to Get Started with the Model
	Use the code below to load the model and run inference on a sample code snippet:

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	# Load the fine-tuned model
	tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")
	model = AutoModelForSequenceClassification.from_pretrained("eunJ/codebert_vulnerabilty_detector")

	# Sample code snippet
	code_snippet = '''
	void process(char *input) {
	char buffer[50];
	strcpy(buffer, input); // Potential buffer overflow
	}
	'''

	# Tokenize the input
	inputs = tokenizer(code_snippet, return_tensors="pt", truncation=True, padding="max_length", max_length=512)

	# Run inference
	with torch.no_grad():
	outputs = model(**inputs)
	predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
	predicted_label = torch.argmax(predictions, dim=1).item()

	# Output the result
	print("Vulnerable Code" if predicted_label == 1 else "Benign Code")
	```

	## Training Details

	### Training Data
	- Dataset: `Bigvul`
	- Classes: `0 (Benign)`, `1 (Vulnerable)`
	- Size: `21800` Code Snippets

	### Metrics
	\| Metric \| Score \|
	\|------------\|-------------\|
	\| Accuracy \| 99.11% \|
	\| F1 Score \| 91.88% \|
	\| Precision \| 89.57% \|
	\| Recall \| 94.31% \|