Mac
Refactor filenames and paths for clarity and structure
15eb8ca
|
raw
history blame
6.3 kB

CodeLLaMA-Linux-BugFix

A machine learning project that fine-tunes CodeLLaMA-7B-Instruct specifically for Linux kernel bug fixing using QLoRA (Quantized Low-Rank Adaptation). The model learns to generate Git diff patches from buggy C code and commit messages.

🎯 Project Overview

This project addresses the challenging task of automated Linux kernel bug fixing by:

  • Extracting real bug-fix data from the Linux kernel Git repository
  • Training a specialized model using QLoRA for efficient fine-tuning
  • Generating Git diff patches that can be applied to fix bugs
  • Providing evaluation metrics to assess model performance

πŸ—οΈ Architecture

Base Model

  • Model: codellama/CodeLLaMA-7b-Instruct-hf (7 billion parameters)
  • Fine-tuning Method: QLoRA with 4-bit quantization
  • Hardware: Optimized for H200 GPU with bfloat16 precision

Training Configuration

  • LoRA Config: r=64, alpha=16, dropout=0.1
  • Training: 3 epochs, batch size 64, learning rate 2e-4
  • Memory Optimization: Gradient checkpointing, mixed precision training

πŸ“Š Dataset

The project creates a specialized dataset from Linux kernel commits:

Data Extraction Process

  1. Commit Filtering: Identifies bug-fix commits using keywords:

    • fix, bug, leak, null, overflow, error, failure
    • crash, panic, memory, race, deadlock, corruption
    • security, vulnerability, exploit, buffer, stack
  2. Code Context Extraction:

    • Focuses on C and header files (.c, .h)
    • Extracts 10 lines before/after bug location
    • Captures relevant code context
  3. Data Format:

    {
      "input": {
        "original code": "C code snippet with bug",
        "instruction": "Bug fix instruction from commit message"
      },
      "output": {
        "diff codes": "Git diff showing the fix"
      }
    }
    

Dataset Statistics

  • Training Data: 100K samples (training_data_100k.jsonl)
  • Format: JSONL (one JSON object per line)
  • Source: Linux kernel Git repository

πŸš€ Quick Start

Prerequisites

pip install -r requirements.txt

1. Build Dataset

cd dataset_builder
python extract_linux_bugfixes.py
python format_for_training.py

2. Train Model

cd train
python train_codellama_qlora_linux_bugfix.py

3. Evaluate Model

cd evaluate
python evaluate_linux_bugfix_model.py

πŸ“ Project Structure

CodeLLaMA-Linux-BugFix/
β”œβ”€β”€ dataset_builder/          # Dataset creation scripts
β”‚   β”œβ”€β”€ extract_linux_bugfixes.py      # Main dataset extraction
β”‚   β”œβ”€β”€ extract_linux_bugfixes_parallel.py # Parallelized version
β”‚   └── format_for_training.py
β”œβ”€β”€ dataset/                  # Generated datasets
β”‚   β”œβ”€β”€ training_data_100k.jsonl
β”‚   └── training_data_prompt_completion.jsonl
β”œβ”€β”€ train/                    # Training scripts and outputs
β”‚   β”œβ”€β”€ train_codellama_qlora_linux_bugfix.py # Main training script
β”‚   β”œβ”€β”€ train_codellama_qlora_simple.py
β”‚   β”œβ”€β”€ download_codellama_model.py
β”‚   └── output/              # Trained model checkpoints
β”œβ”€β”€ evaluate/                 # Evaluation scripts and results
β”‚   β”œβ”€β”€ evaluate_linux_bugfix_model.py # Model evaluation
β”‚   β”œβ”€β”€ test_samples.jsonl   # Evaluation dataset
β”‚   └── output/              # Evaluation results
└── requirements.txt         # Python dependencies

πŸ”§ Key Features

Efficient Training

  • QLoRA: Reduces memory requirements by 75% while maintaining performance
  • 4-bit Quantization: Enables training on consumer hardware
  • Gradient Checkpointing: Optimizes memory usage during training

Real-world Data

  • Authentic Bug Fixes: Extracted from actual Linux kernel development
  • Contextual Understanding: Captures relevant code context around bugs
  • Git Integration: Outputs proper Git diff format

Evaluation

  • BLEU Score: Measures translation quality
  • ROUGE Score: Evaluates text generation accuracy
  • Comprehensive Metrics: JSON and CSV output formats

🎯 Use Cases

The fine-tuned model can assist with:

  1. Automated Bug Fixing: Generate patches for common kernel bugs
  2. Code Review: Suggest fixes during development
  3. Learning: Study patterns in Linux kernel bug fixes
  4. Research: Advance automated software repair techniques

πŸ“ˆ Performance

The model is evaluated using:

  • BLEU Score: Measures how well generated diffs match reference fixes
  • ROUGE Score: Evaluates overlap between predicted and actual fixes
  • Human Evaluation: Qualitative assessment of fix quality

πŸ”¬ Technical Details

Model Architecture

  • Base: CodeLLaMA-7B-Instruct with instruction tuning
  • Adapter: LoRA layers for efficient fine-tuning
  • Output: Generates Git diff format patches

Training Process

  1. Data Preprocessing: Extract and clean commit data
  2. Tokenization: Convert to model input format
  3. QLoRA Training: Efficient parameter-efficient fine-tuning
  4. Checkpointing: Save model states for evaluation

Memory Optimization

  • 4-bit Quantization: Reduces model size significantly
  • Gradient Accumulation: Enables larger effective batch sizes
  • Mixed Precision: Uses bfloat16 for faster training

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • CodeLLaMA Team: For the base model
  • Linux Kernel Community: For the bug-fix data
  • Hugging Face: For the transformers library
  • Microsoft: For the LoRA technique

πŸ“š References