Mac

Update README with Hugging Face metadata and full project description

8046c68 5 months ago

4.37 kB

metadata

license: mit
tags:
  - codellama
  - linux
  - bugfix
  - lora
  - qlora
  - git-diff
base_model: codellama/CodeLLaMA-7b-Instruct-hf
model_type: LlamaForCausalLM
library_name: peft
pipeline_tag: text-generation

CodeLLaMA-Linux-BugFix

A fine-tuned version of CodeLLaMA-7B-Instruct, designed specifically for Linux kernel bug fixing using QLoRA (Quantized Low-Rank Adaptation). The model learns to generate Git diff patches based on buggy C code and commit messages.

🎯 Overview

This project targets automated Linux kernel bug fixing by:

Mining real commit data from the kernel Git history
Training a specialized QLoRA model on diff-style fixes
Generating Git patches in response to bug-prone code
Evaluating results using BLEU, ROUGE, and human inspection

🧠 Model Configuration

Base model: CodeLLaMA-7B-Instruct
Fine-tuning method: QLoRA with 4-bit quantization
Training setup:
- LoRA r=64, alpha=16, dropout=0.1
- Batch size: 64, LR: 2e-4, Epochs: 3
- Mixed precision (bfloat16), gradient checkpointing
Hardware: Optimized for NVIDIA H200 GPUs

📊 Dataset

Custom dataset extracted from Linux kernel Git history.

Filtering Criteria

Bug-fix commits containing: fix, bug, crash, memory, null, panic, overflow, race, corruption, etc.

Structure

Language: C (.c, .h)
Context: 10 lines before/after the change
Format:

{
  "input": {
    "original code": "C code snippet with bug",
    "instruction": "Commit message or fix description"
  },
  "output": {
    "diff codes": "Git diff showing the fix"
  }
}

File: training_data_100k.jsonl (100,000 samples)

🚀 Quick Start

Install dependencies

pip install -r requirements.txt

1. Build the Dataset

cd dataset_builder
python extract_linux_bugfixes.py
python format_for_training.py

2. Fine-tune the Model

cd train
python train_codellama_qlora_linux_bugfix.py

3. Run Evaluation

cd evaluate
python evaluate_linux_bugfix_model.py

📁 Project Structure

CodeLLaMA-Linux-BugFix/
├── dataset_builder/
│   ├── extract_linux_bugfixes.py
│   ├── extract_linux_bugfixes_parallel.py
│   └── format_for_training.py
├── dataset/
│   ├── training_data_100k.jsonl
│   └── training_data_prompt_completion.jsonl
├── train/
│   ├── train_codellama_qlora_linux_bugfix.py
│   ├── train_codellama_qlora_simple.py
│   ├── download_codellama_model.py
│   └── output/
├── evaluate/
│   ├── evaluate_linux_bugfix_model.py
│   ├── test_samples.jsonl
│   └── output/
└── requirements.txt

🧩 Features

🔧 Efficient Fine-tuning: QLoRA + 4-bit quant = massive memory savings
🧠 Real-world commits: From actual Linux kernel development
💡 Context-aware: Code context extraction around bug lines
💻 Output-ready: Generates valid Git-style diffs

📈 Evaluation Metrics

BLEU: Translation-style match to reference diffs
ROUGE: Overlap in fix content
Human Evaluation: Subjective patch quality

🧪 Use Cases

Automated kernel bug fixing
Code review assistance
Teaching/debugging kernel code
Research in automated program repair (APR)

🔬 Technical Highlights

Memory & Speed Optimizations

4-bit quantization (NF4)
Gradient checkpointing
Mixed precision (bfloat16)
Gradient accumulation

🤝 Contributing

Fork this repo
Create a branch
Add your feature or fix
Submit a PR 🙌

📄 License

MIT License – see LICENSE file for details.

🙏 Acknowledgments

Meta for CodeLLaMA
Hugging Face for Transformers + PEFT
The Linux kernel community for open access to commit data
Microsoft for introducing LoRA

Maaac
/

CodeLLaMA-Linux-BugFix