CodeLLaMA-Linux-BugFix
A machine learning project that fine-tunes CodeLLaMA-7B-Instruct specifically for Linux kernel bug fixing using QLoRA (Quantized Low-Rank Adaptation). The model learns to generate Git diff patches from buggy C code and commit messages.
π― Project Overview
This project addresses the challenging task of automated Linux kernel bug fixing by:
- Extracting real bug-fix data from the Linux kernel Git repository
- Training a specialized model using QLoRA for efficient fine-tuning
- Generating Git diff patches that can be applied to fix bugs
- Providing evaluation metrics to assess model performance
ποΈ Architecture
Base Model
- Model:
codellama/CodeLLaMA-7b-Instruct-hf
(7 billion parameters) - Fine-tuning Method: QLoRA with 4-bit quantization
- Hardware: Optimized for H200 GPU with bfloat16 precision
Training Configuration
- LoRA Config: r=64, alpha=16, dropout=0.1
- Training: 3 epochs, batch size 64, learning rate 2e-4
- Memory Optimization: Gradient checkpointing, mixed precision training
π Dataset
The project creates a specialized dataset from Linux kernel commits:
Data Extraction Process
Commit Filtering: Identifies bug-fix commits using keywords:
fix
,bug
,leak
,null
,overflow
,error
,failure
crash
,panic
,memory
,race
,deadlock
,corruption
security
,vulnerability
,exploit
,buffer
,stack
Code Context Extraction:
- Focuses on C and header files (
.c
,.h
) - Extracts 10 lines before/after bug location
- Captures relevant code context
- Focuses on C and header files (
Data Format:
{ "input": { "original code": "C code snippet with bug", "instruction": "Bug fix instruction from commit message" }, "output": { "diff codes": "Git diff showing the fix" } }
Dataset Statistics
- Training Data: 100K samples (
training_data_100k.jsonl
) - Format: JSONL (one JSON object per line)
- Source: Linux kernel Git repository
π Quick Start
Prerequisites
pip install -r requirements.txt
1. Build Dataset
cd dataset_builder
python extract_linux_bugfixes.py
python format_for_training.py
2. Train Model
cd train
python train_codellama_qlora_linux_bugfix.py
3. Evaluate Model
cd evaluate
python evaluate_linux_bugfix_model.py
π Project Structure
CodeLLaMA-Linux-BugFix/
βββ dataset_builder/ # Dataset creation scripts
β βββ extract_linux_bugfixes.py # Main dataset extraction
β βββ extract_linux_bugfixes_parallel.py # Parallelized version
β βββ format_for_training.py
βββ dataset/ # Generated datasets
β βββ training_data_100k.jsonl
β βββ training_data_prompt_completion.jsonl
βββ train/ # Training scripts and outputs
β βββ train_codellama_qlora_linux_bugfix.py # Main training script
β βββ train_codellama_qlora_simple.py
β βββ download_codellama_model.py
β βββ output/ # Trained model checkpoints
βββ evaluate/ # Evaluation scripts and results
β βββ evaluate_linux_bugfix_model.py # Model evaluation
β βββ test_samples.jsonl # Evaluation dataset
β βββ output/ # Evaluation results
βββ requirements.txt # Python dependencies
π§ Key Features
Efficient Training
- QLoRA: Reduces memory requirements by 75% while maintaining performance
- 4-bit Quantization: Enables training on consumer hardware
- Gradient Checkpointing: Optimizes memory usage during training
Real-world Data
- Authentic Bug Fixes: Extracted from actual Linux kernel development
- Contextual Understanding: Captures relevant code context around bugs
- Git Integration: Outputs proper Git diff format
Evaluation
- BLEU Score: Measures translation quality
- ROUGE Score: Evaluates text generation accuracy
- Comprehensive Metrics: JSON and CSV output formats
π― Use Cases
The fine-tuned model can assist with:
- Automated Bug Fixing: Generate patches for common kernel bugs
- Code Review: Suggest fixes during development
- Learning: Study patterns in Linux kernel bug fixes
- Research: Advance automated software repair techniques
π Performance
The model is evaluated using:
- BLEU Score: Measures how well generated diffs match reference fixes
- ROUGE Score: Evaluates overlap between predicted and actual fixes
- Human Evaluation: Qualitative assessment of fix quality
π¬ Technical Details
Model Architecture
- Base: CodeLLaMA-7B-Instruct with instruction tuning
- Adapter: LoRA layers for efficient fine-tuning
- Output: Generates Git diff format patches
Training Process
- Data Preprocessing: Extract and clean commit data
- Tokenization: Convert to model input format
- QLoRA Training: Efficient parameter-efficient fine-tuning
- Checkpointing: Save model states for evaluation
Memory Optimization
- 4-bit Quantization: Reduces model size significantly
- Gradient Accumulation: Enables larger effective batch sizes
- Mixed Precision: Uses bfloat16 for faster training
π€ Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
π License
This project is licensed under the MIT License - see the LICENSE file for details.
π Acknowledgments
- CodeLLaMA Team: For the base model
- Linux Kernel Community: For the bug-fix data
- Hugging Face: For the transformers library
- Microsoft: For the LoRA technique