metadata
license: mit
tags:
- codellama
- linux
- bugfix
- lora
- qlora
- git-diff
base_model: codellama/CodeLLaMA-7b-Instruct-hf
model_type: LlamaForCausalLM
library_name: peft
pipeline_tag: text-generation
CodeLLaMA-Linux-BugFix
A fine-tuned version of CodeLLaMA-7B-Instruct, designed specifically for Linux kernel bug fixing using QLoRA (Quantized Low-Rank Adaptation). The model learns to generate Git diff patches based on buggy C code and commit messages.
π― Overview
This project targets automated Linux kernel bug fixing by:
- Mining real commit data from the kernel Git history
- Training a specialized QLoRA model on diff-style fixes
- Generating Git patches in response to bug-prone code
- Evaluating results using BLEU, ROUGE, and human inspection
π§ Model Configuration
- Base model:
CodeLLaMA-7B-Instruct - Fine-tuning method: QLoRA with 4-bit quantization
- Training setup:
- LoRA r=64, alpha=16, dropout=0.1
- Batch size: 64, LR: 2e-4, Epochs: 3
- Mixed precision (bfloat16), gradient checkpointing
- Hardware: Optimized for NVIDIA H200 GPUs
π Dataset
Custom dataset extracted from Linux kernel Git history.
Filtering Criteria
Bug-fix commits containing:
fix, bug, crash, memory, null, panic, overflow, race, corruption, etc.
Structure
- Language: C (
.c,.h) - Context: 10 lines before/after the change
- Format:
{
"input": {
"original code": "C code snippet with bug",
"instruction": "Commit message or fix description"
},
"output": {
"diff codes": "Git diff showing the fix"
}
}
- File:
training_data_100k.jsonl(100,000 samples)
π Quick Start
Install dependencies
pip install -r requirements.txt
1. Build the Dataset
cd dataset_builder
python extract_linux_bugfixes.py
python format_for_training.py
2. Fine-tune the Model
cd train
python train_codellama_qlora_linux_bugfix.py
3. Run Evaluation
cd evaluate
python evaluate_linux_bugfix_model.py
π Project Structure
CodeLLaMA-Linux-BugFix/
βββ dataset_builder/
β βββ extract_linux_bugfixes.py
β βββ extract_linux_bugfixes_parallel.py
β βββ format_for_training.py
βββ dataset/
β βββ training_data_100k.jsonl
β βββ training_data_prompt_completion.jsonl
βββ train/
β βββ train_codellama_qlora_linux_bugfix.py
β βββ train_codellama_qlora_simple.py
β βββ download_codellama_model.py
β βββ output/
βββ evaluate/
β βββ evaluate_linux_bugfix_model.py
β βββ test_samples.jsonl
β βββ output/
βββ requirements.txt
π§© Features
- π§ Efficient Fine-tuning: QLoRA + 4-bit quant = massive memory savings
- π§ Real-world commits: From actual Linux kernel development
- π‘ Context-aware: Code context extraction around bug lines
- π» Output-ready: Generates valid Git-style diffs
π Evaluation Metrics
- BLEU: Translation-style match to reference diffs
- ROUGE: Overlap in fix content
- Human Evaluation: Subjective patch quality
π§ͺ Use Cases
- Automated kernel bug fixing
- Code review assistance
- Teaching/debugging kernel code
- Research in automated program repair (APR)
π¬ Technical Highlights
Memory & Speed Optimizations
- 4-bit quantization (NF4)
- Gradient checkpointing
- Mixed precision (bfloat16)
- Gradient accumulation
π€ Contributing
- Fork this repo
- Create a branch
- Add your feature or fix
- Submit a PR π
π License
MIT License β see LICENSE file for details.
π Acknowledgments
- Meta for CodeLLaMA
- Hugging Face for Transformers + PEFT
- The Linux kernel community for open access to commit data
- Microsoft for introducing LoRA