File size: 6,302 Bytes
15eb8ca
43864c1
15eb8ca
43864c1
15eb8ca
43864c1
15eb8ca
43864c1
15eb8ca
 
 
 
43864c1
15eb8ca
43864c1
15eb8ca
 
 
 
43864c1
15eb8ca
 
 
 
43864c1
15eb8ca
43864c1
15eb8ca
43864c1
15eb8ca
 
 
 
 
43864c1
15eb8ca
 
 
 
43864c1
15eb8ca
 
 
 
 
 
 
 
 
 
 
 
43864c1
15eb8ca
 
 
 
43864c1
15eb8ca
43864c1
15eb8ca
 
 
43864c1
 
15eb8ca
43864c1
 
15eb8ca
 
43864c1
 
15eb8ca
 
 
 
 
43864c1
15eb8ca
 
 
 
 
43864c1
 
 
 
15eb8ca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43864c1
 
15eb8ca
43864c1
15eb8ca
 
 
 
43864c1
15eb8ca
 
 
 
43864c1
15eb8ca
 
 
 
43864c1
15eb8ca
43864c1
15eb8ca
43864c1
15eb8ca
 
 
 
43864c1
 
 
15eb8ca
 
 
 
43864c1
15eb8ca
43864c1
15eb8ca
 
 
 
43864c1
15eb8ca
 
 
 
 
43864c1
15eb8ca
 
 
 
43864c1
15eb8ca
43864c1
15eb8ca
 
 
 
 
43864c1
 
 
15eb8ca
43864c1
 
 
15eb8ca
 
 
 
43864c1
15eb8ca
43864c1
15eb8ca
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
# CodeLLaMA-Linux-BugFix

A machine learning project that fine-tunes CodeLLaMA-7B-Instruct specifically for Linux kernel bug fixing using QLoRA (Quantized Low-Rank Adaptation). The model learns to generate Git diff patches from buggy C code and commit messages.

## 🎯 Project Overview

This project addresses the challenging task of automated Linux kernel bug fixing by:

- **Extracting real bug-fix data** from the Linux kernel Git repository
- **Training a specialized model** using QLoRA for efficient fine-tuning
- **Generating Git diff patches** that can be applied to fix bugs
- **Providing evaluation metrics** to assess model performance

## πŸ—οΈ Architecture

### Base Model
- **Model**: `codellama/CodeLLaMA-7b-Instruct-hf` (7 billion parameters)
- **Fine-tuning Method**: QLoRA with 4-bit quantization
- **Hardware**: Optimized for H200 GPU with bfloat16 precision

### Training Configuration
- **LoRA Config**: r=64, alpha=16, dropout=0.1
- **Training**: 3 epochs, batch size 64, learning rate 2e-4
- **Memory Optimization**: Gradient checkpointing, mixed precision training

## πŸ“Š Dataset

The project creates a specialized dataset from Linux kernel commits:

### Data Extraction Process
1. **Commit Filtering**: Identifies bug-fix commits using keywords:
   - `fix`, `bug`, `leak`, `null`, `overflow`, `error`, `failure`
   - `crash`, `panic`, `memory`, `race`, `deadlock`, `corruption`
   - `security`, `vulnerability`, `exploit`, `buffer`, `stack`

2. **Code Context Extraction**: 
   - Focuses on C and header files (`.c`, `.h`)
   - Extracts 10 lines before/after bug location
   - Captures relevant code context

3. **Data Format**:
   ```json

   {

     "input": {

       "original code": "C code snippet with bug",

       "instruction": "Bug fix instruction from commit message"

     },

     "output": {

       "diff codes": "Git diff showing the fix"

     }

   }

   ```

### Dataset Statistics
- **Training Data**: 100K samples (`training_data_100k.jsonl`)
- **Format**: JSONL (one JSON object per line)
- **Source**: Linux kernel Git repository

## πŸš€ Quick Start

### Prerequisites
```bash

pip install -r requirements.txt

```

### 1. Build Dataset
```bash

cd dataset_builder

python extract_linux_bugfixes.py

python format_for_training.py

```

### 2. Train Model
```bash

cd train

python train_codellama_qlora_linux_bugfix.py

```

### 3. Evaluate Model
```bash

cd evaluate

python evaluate_linux_bugfix_model.py

```

## πŸ“ Project Structure

```

CodeLLaMA-Linux-BugFix/

β”œβ”€β”€ dataset_builder/          # Dataset creation scripts

β”‚   β”œβ”€β”€ extract_linux_bugfixes.py      # Main dataset extraction

β”‚   β”œβ”€β”€ extract_linux_bugfixes_parallel.py # Parallelized version

β”‚   └── format_for_training.py

β”œβ”€β”€ dataset/                  # Generated datasets

β”‚   β”œβ”€β”€ training_data_100k.jsonl

β”‚   └── training_data_prompt_completion.jsonl

β”œβ”€β”€ train/                    # Training scripts and outputs

β”‚   β”œβ”€β”€ train_codellama_qlora_linux_bugfix.py # Main training script

β”‚   β”œβ”€β”€ train_codellama_qlora_simple.py

β”‚   β”œβ”€β”€ download_codellama_model.py

β”‚   └── output/              # Trained model checkpoints

β”œβ”€β”€ evaluate/                 # Evaluation scripts and results

β”‚   β”œβ”€β”€ evaluate_linux_bugfix_model.py # Model evaluation

β”‚   β”œβ”€β”€ test_samples.jsonl   # Evaluation dataset

β”‚   └── output/              # Evaluation results

└── requirements.txt         # Python dependencies

```

## πŸ”§ Key Features

### Efficient Training
- **QLoRA**: Reduces memory requirements by 75% while maintaining performance
- **4-bit Quantization**: Enables training on consumer hardware
- **Gradient Checkpointing**: Optimizes memory usage during training

### Real-world Data
- **Authentic Bug Fixes**: Extracted from actual Linux kernel development
- **Contextual Understanding**: Captures relevant code context around bugs
- **Git Integration**: Outputs proper Git diff format

### Evaluation
- **BLEU Score**: Measures translation quality
- **ROUGE Score**: Evaluates text generation accuracy
- **Comprehensive Metrics**: JSON and CSV output formats

## 🎯 Use Cases

The fine-tuned model can assist with:

1. **Automated Bug Fixing**: Generate patches for common kernel bugs
2. **Code Review**: Suggest fixes during development
3. **Learning**: Study patterns in Linux kernel bug fixes
4. **Research**: Advance automated software repair techniques

## πŸ“ˆ Performance

The model is evaluated using:
- **BLEU Score**: Measures how well generated diffs match reference fixes
- **ROUGE Score**: Evaluates overlap between predicted and actual fixes
- **Human Evaluation**: Qualitative assessment of fix quality

## πŸ”¬ Technical Details

### Model Architecture
- **Base**: CodeLLaMA-7B-Instruct with instruction tuning
- **Adapter**: LoRA layers for efficient fine-tuning
- **Output**: Generates Git diff format patches

### Training Process
1. **Data Preprocessing**: Extract and clean commit data
2. **Tokenization**: Convert to model input format
3. **QLoRA Training**: Efficient parameter-efficient fine-tuning
4. **Checkpointing**: Save model states for evaluation

### Memory Optimization
- **4-bit Quantization**: Reduces model size significantly
- **Gradient Accumulation**: Enables larger effective batch sizes
- **Mixed Precision**: Uses bfloat16 for faster training

## 🀝 Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request

## πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

## πŸ™ Acknowledgments

- **CodeLLaMA Team**: For the base model
- **Linux Kernel Community**: For the bug-fix data
- **Hugging Face**: For the transformers library
- **Microsoft**: For the LoRA technique

## πŸ“š References

- [CodeLLaMA: Open Foundation for Code](https://arxiv.org/abs/2308.12950)
- [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314)
- [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)