CodeLLaMA-Linux-BugFix

A machine learning project that fine-tunes CodeLLaMA-7B-Instruct specifically for Linux kernel bug fixing using QLoRA (Quantized Low-Rank Adaptation). The model learns to generate Git diff patches from buggy C code and commit messages.

🎯 Project Overview

This project addresses the challenging task of automated Linux kernel bug fixing by:

Extracting real bug-fix data from the Linux kernel Git repository
Training a specialized model using QLoRA for efficient fine-tuning
Generating Git diff patches that can be applied to fix bugs
Providing evaluation metrics to assess model performance

🏗️ Architecture

Base Model

Model: codellama/CodeLLaMA-7b-Instruct-hf (7 billion parameters)
Fine-tuning Method: QLoRA with 4-bit quantization
Hardware: Optimized for H200 GPU with bfloat16 precision

Training Configuration

LoRA Config: r=64, alpha=16, dropout=0.1
Training: 3 epochs, batch size 64, learning rate 2e-4
Memory Optimization: Gradient checkpointing, mixed precision training

📊 Dataset

The project creates a specialized dataset from Linux kernel commits:

Data Extraction Process

Commit Filtering: Identifies bug-fix commits using keywords:
- fix, bug, leak, null, overflow, error, failure
- crash, panic, memory, race, deadlock, corruption
- security, vulnerability, exploit, buffer, stack
Code Context Extraction:
- Focuses on C and header files (.c, .h)
- Extracts 10 lines before/after bug location
- Captures relevant code context

Data Format:

{
  "input": {
    "original code": "C code snippet with bug",
    "instruction": "Bug fix instruction from commit message"
  },
  "output": {
    "diff codes": "Git diff showing the fix"
  }
}

Dataset Statistics

Training Data: 100K samples (training_data_100k.jsonl)
Format: JSONL (one JSON object per line)
Source: Linux kernel Git repository

🚀 Quick Start

Prerequisites

pip install -r requirements.txt

1. Build Dataset

cd dataset_builder
python extract_linux_bugfixes.py
python format_for_training.py

2. Train Model

cd train
python train_codellama_qlora_linux_bugfix.py

3. Evaluate Model

cd evaluate
python evaluate_linux_bugfix_model.py

📁 Project Structure

CodeLLaMA-Linux-BugFix/
├── dataset_builder/          # Dataset creation scripts
│   ├── extract_linux_bugfixes.py      # Main dataset extraction
│   ├── extract_linux_bugfixes_parallel.py # Parallelized version
│   └── format_for_training.py
├── dataset/                  # Generated datasets
│   ├── training_data_100k.jsonl
│   └── training_data_prompt_completion.jsonl
├── train/                    # Training scripts and outputs
│   ├── train_codellama_qlora_linux_bugfix.py # Main training script
│   ├── train_codellama_qlora_simple.py
│   ├── download_codellama_model.py
│   └── output/              # Trained model checkpoints
├── evaluate/                 # Evaluation scripts and results
│   ├── evaluate_linux_bugfix_model.py # Model evaluation
│   ├── test_samples.jsonl   # Evaluation dataset
│   └── output/              # Evaluation results
└── requirements.txt         # Python dependencies

🔧 Key Features

Efficient Training

QLoRA: Reduces memory requirements by 75% while maintaining performance
4-bit Quantization: Enables training on consumer hardware
Gradient Checkpointing: Optimizes memory usage during training

Real-world Data

Authentic Bug Fixes: Extracted from actual Linux kernel development
Contextual Understanding: Captures relevant code context around bugs
Git Integration: Outputs proper Git diff format

Evaluation

BLEU Score: Measures translation quality
ROUGE Score: Evaluates text generation accuracy
Comprehensive Metrics: JSON and CSV output formats

🎯 Use Cases

The fine-tuned model can assist with:

Automated Bug Fixing: Generate patches for common kernel bugs
Code Review: Suggest fixes during development
Learning: Study patterns in Linux kernel bug fixes
Research: Advance automated software repair techniques

📈 Performance

The model is evaluated using:

BLEU Score: Measures how well generated diffs match reference fixes
ROUGE Score: Evaluates overlap between predicted and actual fixes
Human Evaluation: Qualitative assessment of fix quality

🔬 Technical Details

Model Architecture

Base: CodeLLaMA-7B-Instruct with instruction tuning
Adapter: LoRA layers for efficient fine-tuning
Output: Generates Git diff format patches

Training Process

Data Preprocessing: Extract and clean commit data
Tokenization: Convert to model input format
QLoRA Training: Efficient parameter-efficient fine-tuning
Checkpointing: Save model states for evaluation

Memory Optimization

4-bit Quantization: Reduces model size significantly
Gradient Accumulation: Enables larger effective batch sizes
Mixed Precision: Uses bfloat16 for faster training

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

CodeLLaMA Team: For the base model
Linux Kernel Community: For the bug-fix data
Hugging Face: For the transformers library
Microsoft: For the LoRA technique

Maaac
/

CodeLLaMA-Linux-BugFix