Mac

Push full project updates including refactored code, data, and documentation

bf46039 about 2 months ago

4.85 kB

	---
	license: mit
	tags:
	- linux
	- bugfix
	- codellama
	- qlora
	- transformers
	- causal-lm
	model_type: causal-lm
	library_name: transformers
	pipeline_tag: text-generation
	base_model: codellama/CodeLLaMA-7b-Instruct-hf
	language:
	- en
	- c
	---

	# CodeLLaMA-Linux-BugFix

	A fine-tuned CodeLLaMA-7B-Instruct model specifically designed for Linux kernel bug fixing. This model generates Git diff patches from buggy C code and commit messages.

	## Model Description

	This model is a QLoRA fine-tuned version of CodeLLaMA-7B-Instruct, trained on a dataset of Linux kernel bug fixes extracted from Git commits. It learns to generate appropriate Git diff patches that can fix bugs in C code.

	- Developed by: Maaac
	- Model type: Causal Language Model (QLoRA fine-tuned)
	- Language(s): English, C
	- License: MIT
	- Finetuned from model: codellama/CodeLLaMA-7b-Instruct-hf

	## Uses

	### Direct Use

	This model is designed to:
	- Generate Git diff patches for Linux kernel bug fixes
	- Assist developers in fixing common kernel bugs
	- Provide automated code review suggestions
	- Help with learning Linux kernel development patterns

	### Downstream Use

	The model can be integrated into:
	- Automated code review systems
	- Development IDEs and editors
	- Continuous integration pipelines
	- Educational tools for kernel development

	### Out-of-Scope Use

	This model is not suitable for:
	- Non-Linux kernel code
	- Non-C programming languages
	- Security-critical applications without human review
	- Production systems without proper validation

	## Bias, Risks, and Limitations

	### Limitations
	- Focused specifically on Linux kernel C code
	- May not generalize to other codebases
	- Generated fixes should be reviewed by human developers
	- Limited to the patterns present in the training data

	### Recommendations

	Users should:
	- Always review generated patches before applying
	- Test fixes in a safe environment first
	- Understand the context of the bug being fixed
	- Use as a development aid, not a replacement for human expertise

	## How to Get Started with the Model

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	# Load the model
	model = AutoModelForCausalLM.from_pretrained("Maaac/CodeLLaMA-Linux-BugFix")
	tokenizer = AutoTokenizer.from_pretrained("Maaac/CodeLLaMA-Linux-BugFix")

	# Example usage
	prompt = """Given the following original C code:
	int *ptr = kmalloc(sizeof(int), GFP_KERNEL);
	if (!ptr) {
	return -ENOMEM;
	}
	// ... use ptr ...
	// Missing kfree(ptr)

	Instruction: Fix memory leak by adding proper cleanup

	Return the diff that fixes it:
	"""

	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=256)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## Training Details

	### Training Data

	- Source: Linux kernel Git repository
	- Size: 100,000 bug-fix samples
	- Format: JSONL with prompt-completion pairs
	- Extraction Method: PyDriller analysis of commit history

	### Training Procedure

	#### Preprocessing
	- Extracted bug-fix commits using keyword filtering
	- Captured code context (10 lines before/after bug location)
	- Converted to prompt-completion format for supervised learning

	#### Training Hyperparameters
	- Base Model: codellama/CodeLLaMA-7b-Instruct-hf
	- Method: QLoRA with 4-bit quantization
	- LoRA Config: r=64, alpha=16, dropout=0.1
	- Training: 3 epochs, batch size 64, learning rate 2e-4
	- Hardware: Optimized for H200 GPU with bfloat16

	## Evaluation

	### Testing Data
	- Separate evaluation dataset with known bug-fix pairs
	- Focused on common Linux kernel bug patterns

	### Metrics
	- BLEU Score: Measures translation quality of generated diffs
	- ROUGE Score: Evaluates overlap between predicted and actual fixes
	- Human Evaluation: Qualitative assessment of fix quality

	### Results
	The model demonstrates the ability to generate contextually appropriate Git diff patches for Linux kernel bugs, though results should be validated by human developers.

	## Technical Specifications

	### Model Architecture
	- Base: CodeLLaMA-7B-Instruct (7 billion parameters)
	- Adapter: LoRA layers for efficient fine-tuning
	- Output: Generates Git diff format patches

	### Compute Infrastructure
	- Hardware: H200 GPU
	- Framework: PyTorch with Transformers
	- Quantization: 4-bit QLoRA for memory efficiency

	## Citation

	If you use this model in your research, please cite:

	```bibtex
	@misc{CodeLLaMA-Linux-BugFix,
	author = {Maaac},
	title = {CodeLLaMA-Linux-BugFix: A Fine-tuned Model for Linux Kernel Bug Fixing},
	year = {2024},
	url = {https://huggingface.co/Maaac/CodeLLaMA-Linux-BugFix}
	}
	```

	## Model Card Authors

	- Author: Maaac
	- Contact: [Your contact information]

	## Framework Versions

	- PEFT 0.16.0
	- Transformers 4.53.1
	- PyTorch 2.7.1