File size: 6,302 Bytes
15eb8ca 43864c1 15eb8ca 43864c1 15eb8ca 43864c1 15eb8ca 43864c1 15eb8ca 43864c1 15eb8ca 43864c1 15eb8ca 43864c1 15eb8ca 43864c1 15eb8ca 43864c1 15eb8ca 43864c1 15eb8ca 43864c1 15eb8ca 43864c1 15eb8ca 43864c1 15eb8ca 43864c1 15eb8ca 43864c1 15eb8ca 43864c1 15eb8ca 43864c1 15eb8ca 43864c1 15eb8ca 43864c1 15eb8ca 43864c1 15eb8ca 43864c1 15eb8ca 43864c1 15eb8ca 43864c1 15eb8ca 43864c1 15eb8ca 43864c1 15eb8ca 43864c1 15eb8ca 43864c1 15eb8ca 43864c1 15eb8ca 43864c1 15eb8ca 43864c1 15eb8ca 43864c1 15eb8ca 43864c1 15eb8ca 43864c1 15eb8ca 43864c1 15eb8ca 43864c1 15eb8ca 43864c1 15eb8ca 43864c1 15eb8ca 43864c1 15eb8ca |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 |
# CodeLLaMA-Linux-BugFix
A machine learning project that fine-tunes CodeLLaMA-7B-Instruct specifically for Linux kernel bug fixing using QLoRA (Quantized Low-Rank Adaptation). The model learns to generate Git diff patches from buggy C code and commit messages.
## π― Project Overview
This project addresses the challenging task of automated Linux kernel bug fixing by:
- **Extracting real bug-fix data** from the Linux kernel Git repository
- **Training a specialized model** using QLoRA for efficient fine-tuning
- **Generating Git diff patches** that can be applied to fix bugs
- **Providing evaluation metrics** to assess model performance
## ποΈ Architecture
### Base Model
- **Model**: `codellama/CodeLLaMA-7b-Instruct-hf` (7 billion parameters)
- **Fine-tuning Method**: QLoRA with 4-bit quantization
- **Hardware**: Optimized for H200 GPU with bfloat16 precision
### Training Configuration
- **LoRA Config**: r=64, alpha=16, dropout=0.1
- **Training**: 3 epochs, batch size 64, learning rate 2e-4
- **Memory Optimization**: Gradient checkpointing, mixed precision training
## π Dataset
The project creates a specialized dataset from Linux kernel commits:
### Data Extraction Process
1. **Commit Filtering**: Identifies bug-fix commits using keywords:
- `fix`, `bug`, `leak`, `null`, `overflow`, `error`, `failure`
- `crash`, `panic`, `memory`, `race`, `deadlock`, `corruption`
- `security`, `vulnerability`, `exploit`, `buffer`, `stack`
2. **Code Context Extraction**:
- Focuses on C and header files (`.c`, `.h`)
- Extracts 10 lines before/after bug location
- Captures relevant code context
3. **Data Format**:
```json
{
"input": {
"original code": "C code snippet with bug",
"instruction": "Bug fix instruction from commit message"
},
"output": {
"diff codes": "Git diff showing the fix"
}
}
```
### Dataset Statistics
- **Training Data**: 100K samples (`training_data_100k.jsonl`)
- **Format**: JSONL (one JSON object per line)
- **Source**: Linux kernel Git repository
## π Quick Start
### Prerequisites
```bash
pip install -r requirements.txt
```
### 1. Build Dataset
```bash
cd dataset_builder
python extract_linux_bugfixes.py
python format_for_training.py
```
### 2. Train Model
```bash
cd train
python train_codellama_qlora_linux_bugfix.py
```
### 3. Evaluate Model
```bash
cd evaluate
python evaluate_linux_bugfix_model.py
```
## π Project Structure
```
CodeLLaMA-Linux-BugFix/
βββ dataset_builder/ # Dataset creation scripts
β βββ extract_linux_bugfixes.py # Main dataset extraction
β βββ extract_linux_bugfixes_parallel.py # Parallelized version
β βββ format_for_training.py
βββ dataset/ # Generated datasets
β βββ training_data_100k.jsonl
β βββ training_data_prompt_completion.jsonl
βββ train/ # Training scripts and outputs
β βββ train_codellama_qlora_linux_bugfix.py # Main training script
β βββ train_codellama_qlora_simple.py
β βββ download_codellama_model.py
β βββ output/ # Trained model checkpoints
βββ evaluate/ # Evaluation scripts and results
β βββ evaluate_linux_bugfix_model.py # Model evaluation
β βββ test_samples.jsonl # Evaluation dataset
β βββ output/ # Evaluation results
βββ requirements.txt # Python dependencies
```
## π§ Key Features
### Efficient Training
- **QLoRA**: Reduces memory requirements by 75% while maintaining performance
- **4-bit Quantization**: Enables training on consumer hardware
- **Gradient Checkpointing**: Optimizes memory usage during training
### Real-world Data
- **Authentic Bug Fixes**: Extracted from actual Linux kernel development
- **Contextual Understanding**: Captures relevant code context around bugs
- **Git Integration**: Outputs proper Git diff format
### Evaluation
- **BLEU Score**: Measures translation quality
- **ROUGE Score**: Evaluates text generation accuracy
- **Comprehensive Metrics**: JSON and CSV output formats
## π― Use Cases
The fine-tuned model can assist with:
1. **Automated Bug Fixing**: Generate patches for common kernel bugs
2. **Code Review**: Suggest fixes during development
3. **Learning**: Study patterns in Linux kernel bug fixes
4. **Research**: Advance automated software repair techniques
## π Performance
The model is evaluated using:
- **BLEU Score**: Measures how well generated diffs match reference fixes
- **ROUGE Score**: Evaluates overlap between predicted and actual fixes
- **Human Evaluation**: Qualitative assessment of fix quality
## π¬ Technical Details
### Model Architecture
- **Base**: CodeLLaMA-7B-Instruct with instruction tuning
- **Adapter**: LoRA layers for efficient fine-tuning
- **Output**: Generates Git diff format patches
### Training Process
1. **Data Preprocessing**: Extract and clean commit data
2. **Tokenization**: Convert to model input format
3. **QLoRA Training**: Efficient parameter-efficient fine-tuning
4. **Checkpointing**: Save model states for evaluation
### Memory Optimization
- **4-bit Quantization**: Reduces model size significantly
- **Gradient Accumulation**: Enables larger effective batch sizes
- **Mixed Precision**: Uses bfloat16 for faster training
## π€ Contributing
1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request
## π License
This project is licensed under the MIT License - see the LICENSE file for details.
## π Acknowledgments
- **CodeLLaMA Team**: For the base model
- **Linux Kernel Community**: For the bug-fix data
- **Hugging Face**: For the transformers library
- **Microsoft**: For the LoRA technique
## π References
- [CodeLLaMA: Open Foundation for Code](https://arxiv.org/abs/2308.12950)
- [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314)
- [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) |