Mac
Update README with Hugging Face metadata and full project description
8046c68
|
raw
history blame
4.37 kB
metadata
license: mit
tags:
  - codellama
  - linux
  - bugfix
  - lora
  - qlora
  - git-diff
base_model: codellama/CodeLLaMA-7b-Instruct-hf
model_type: LlamaForCausalLM
library_name: peft
pipeline_tag: text-generation

CodeLLaMA-Linux-BugFix

A fine-tuned version of CodeLLaMA-7B-Instruct, designed specifically for Linux kernel bug fixing using QLoRA (Quantized Low-Rank Adaptation). The model learns to generate Git diff patches based on buggy C code and commit messages.


🎯 Overview

This project targets automated Linux kernel bug fixing by:

  • Mining real commit data from the kernel Git history
  • Training a specialized QLoRA model on diff-style fixes
  • Generating Git patches in response to bug-prone code
  • Evaluating results using BLEU, ROUGE, and human inspection

🧠 Model Configuration

  • Base model: CodeLLaMA-7B-Instruct
  • Fine-tuning method: QLoRA with 4-bit quantization
  • Training setup:
    • LoRA r=64, alpha=16, dropout=0.1
    • Batch size: 64, LR: 2e-4, Epochs: 3
    • Mixed precision (bfloat16), gradient checkpointing
  • Hardware: Optimized for NVIDIA H200 GPUs

πŸ“Š Dataset

Custom dataset extracted from Linux kernel Git history.

Filtering Criteria

Bug-fix commits containing: fix, bug, crash, memory, null, panic, overflow, race, corruption, etc.

Structure

  • Language: C (.c, .h)
  • Context: 10 lines before/after the change
  • Format:
{
  "input": {
    "original code": "C code snippet with bug",
    "instruction": "Commit message or fix description"
  },
  "output": {
    "diff codes": "Git diff showing the fix"
  }
}
  • File: training_data_100k.jsonl (100,000 samples)

πŸš€ Quick Start

Install dependencies

pip install -r requirements.txt

1. Build the Dataset

cd dataset_builder
python extract_linux_bugfixes.py
python format_for_training.py

2. Fine-tune the Model

cd train
python train_codellama_qlora_linux_bugfix.py

3. Run Evaluation

cd evaluate
python evaluate_linux_bugfix_model.py

πŸ“ Project Structure

CodeLLaMA-Linux-BugFix/
β”œβ”€β”€ dataset_builder/
β”‚   β”œβ”€β”€ extract_linux_bugfixes.py
β”‚   β”œβ”€β”€ extract_linux_bugfixes_parallel.py
β”‚   └── format_for_training.py
β”œβ”€β”€ dataset/
β”‚   β”œβ”€β”€ training_data_100k.jsonl
β”‚   └── training_data_prompt_completion.jsonl
β”œβ”€β”€ train/
β”‚   β”œβ”€β”€ train_codellama_qlora_linux_bugfix.py
β”‚   β”œβ”€β”€ train_codellama_qlora_simple.py
β”‚   β”œβ”€β”€ download_codellama_model.py
β”‚   └── output/
β”œβ”€β”€ evaluate/
β”‚   β”œβ”€β”€ evaluate_linux_bugfix_model.py
β”‚   β”œβ”€β”€ test_samples.jsonl
β”‚   └── output/
└── requirements.txt

🧩 Features

  • πŸ”§ Efficient Fine-tuning: QLoRA + 4-bit quant = massive memory savings
  • 🧠 Real-world commits: From actual Linux kernel development
  • πŸ’‘ Context-aware: Code context extraction around bug lines
  • πŸ’» Output-ready: Generates valid Git-style diffs

πŸ“ˆ Evaluation Metrics

  • BLEU: Translation-style match to reference diffs
  • ROUGE: Overlap in fix content
  • Human Evaluation: Subjective patch quality

πŸ§ͺ Use Cases

  • Automated kernel bug fixing
  • Code review assistance
  • Teaching/debugging kernel code
  • Research in automated program repair (APR)

πŸ”¬ Technical Highlights

Memory & Speed Optimizations

  • 4-bit quantization (NF4)
  • Gradient checkpointing
  • Mixed precision (bfloat16)
  • Gradient accumulation

🀝 Contributing

  1. Fork this repo
  2. Create a branch
  3. Add your feature or fix
  4. Submit a PR πŸ™Œ

πŸ“„ License

MIT License – see LICENSE file for details.


πŸ™ Acknowledgments

  • Meta for CodeLLaMA
  • Hugging Face for Transformers + PEFT
  • The Linux kernel community for open access to commit data
  • Microsoft for introducing LoRA

πŸ“š References