File size: 7,503 Bytes
15eb8ca
ed6b901
15eb8ca
ed6b901
15eb8ca
ed6b901
 
15eb8ca
 
 
 
 
 
 
 
ed6b901
 
15eb8ca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ed6b901
 
15eb8ca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ed6b901
15eb8ca
 
 
ed6b901
 
15eb8ca
ed6b901
15eb8ca
 
 
ed6b901
15eb8ca
 
 
 
ed6b901
 
15eb8ca
 
 
 
 
 
ed6b901
15eb8ca
 
 
 
ed6b901
15eb8ca
 
 
 
ed6b901
15eb8ca
 
 
 
ed6b901
15eb8ca
ed6b901
15eb8ca
 
 
 
 
ed6b901
15eb8ca
ed6b901
15eb8ca
 
 
 
ed6b901
15eb8ca
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
# Project Structure

This document provides a detailed overview of the CodeLLaMA-Linux-BugFix project structure, explaining the purpose and organization of each component.

## πŸ“ Root Directory

```

CodeLLaMA-Linux-BugFix/

β”œβ”€β”€ dataset_builder/          # Dataset creation and processing

β”œβ”€β”€ dataset/                  # Generated datasets and data files

β”œβ”€β”€ train/                    # Model training scripts and outputs

β”œβ”€β”€ evaluate/                 # Model evaluation and testing

β”œβ”€β”€ requirements.txt          # Python dependencies

β”œβ”€β”€ README.md                 # Project documentation

└── PROJECT_STRUCTURE.md      # This file

```

## πŸ”§ Dataset Builder (`dataset_builder/`)



The dataset builder extracts bug-fix data from the Linux kernel Git repository and converts it into training-ready format.



### Files:

- **`extract_linux_bugfixes.py`** - Main dataset extraction script

  - Uses PyDriller to analyze Linux kernel Git history

  - Filters commits using bug-fix keywords

  - Extracts code context around bug locations

  - Generates structured dataset entries



- **`extract_linux_bugfixes_parallel.py`** - Parallelized version of dataset builder

  - Multi-process implementation for faster processing

  - Configurable worker count (default: 16 workers)

  - Test mode with limited commit processing



- **`format_for_training.py`** - Format conversion script

  - Converts structured data to prompt-completion pairs

  - Formats input for supervised fine-tuning

  - Creates training-ready JSONL format



### Key Features:

- **Commit Filtering**: Identifies bug-fix commits using 17 keywords

- **Code Context**: Extracts 10 lines before/after bug location

- **File Filtering**: Focuses on C and header files (`.c`, `.h`)

- **Diff Extraction**: Captures Git diff patches for fixes



## πŸ“Š Dataset (`dataset/`)



Contains the generated datasets used for training and evaluation.



### Files:

- **`training_data_100k.jsonl`** - Main training dataset

  - 100,000 bug-fix samples

  - Structured format with input/output pairs

  - Stored using Git LFS for large file handling



- **`training_data_prompt_completion.jsonl`** - Converted training format

  - Prompt-completion pairs for supervised learning

  - Optimized for transformer model training

  - Stored using Git LFS



### Data Format:

```json

{

  "input": {

    "original code": "C code snippet with bug",

    "instruction": "Bug fix instruction from commit message"

  },

  "output": {

    "diff codes": "Git diff showing the fix"

  }

}

```



## πŸš€ Training (`train/`)



Contains all training-related scripts, configurations, and model outputs.



### Files:

- **`train_codellama_qlora_linux_bugfix.py`** - Main training script

  - QLoRA fine-tuning implementation

  - Optimized for H200 GPU with bfloat16

  - Includes Weights & Biases integration

  - Comprehensive training configuration



- **`train_codellama_qlora_simple.py`** - Alternative training script

  - Simplified QLoRA implementation

  - Basic training setup without advanced features

  - Good for testing and development



- **`download_codellama_model.py`** - Model download utility

  - Downloads base CodeLLaMA-7B-Instruct model

  - Ensures model availability before training



### Output Directory (`train/output/`):

- **`qlora-codellama-bugfix/`** - Main model output

  - **`adapter_model.safetensors`** - LoRA adapter weights

  - **`adapter_config.json`** - LoRA configuration

  - **`tokenizer.json`** - Tokenizer files

  - **`chat_template.jinja`** - Conversation template

  - **`checkpoint-500/`** - Training checkpoint at step 500

  - **`checkpoint-1000/`** - Training checkpoint at step 1000

  - **`README.md`** - Model card and documentation



### Training Configuration:

- **Base Model**: `codellama/CodeLLaMA-7b-Instruct-hf`

- **Method**: QLoRA with 4-bit quantization

- **LoRA Config**: r=64, alpha=16, dropout=0.1

- **Training**: 3 epochs, batch size 64, learning rate 2e-4

- **Hardware**: Optimized for H200 GPU



## πŸ“ˆ Evaluation (`evaluate/`)



Contains evaluation scripts and results for assessing model performance.



### Files:

- **`evaluate_linux_bugfix_model.py`** - Main evaluation script

  - Loads fine-tuned model for inference

  - Generates predictions on test data

  - Computes BLEU and ROUGE metrics

  - Saves results in multiple formats



- **`test_samples.jsonl`** - Evaluation dataset

  - Test samples for model evaluation

  - Stored using Git LFS



### Output Directory (`evaluate/output/`):

- **`eval_results.json`** - Detailed evaluation results

  - Complete predictions and references

  - Stored using Git LFS



- **`eval_results.csv`** - Tabular evaluation results

  - CSV format for easy analysis

  - Stored using Git LFS



### Evaluation Metrics:

- **BLEU Score**: Measures translation quality

- **ROUGE Score**: Evaluates text generation accuracy

- **Human Evaluation**: Qualitative assessment



## πŸ”§ Dependencies (`requirements.txt`)



Comprehensive list of Python packages required for the project:



### Core ML Libraries:

- `transformers==4.53.1` - Hugging Face transformers

- `torch==2.7.1+cu128` - PyTorch with CUDA support

- `peft==0.16.0` - Parameter-efficient fine-tuning

- `accelerate==1.8.1` - Distributed training

- `bitsandbytes==0.46.1` - Quantization support



### Data Processing:

- `datasets==3.6.0` - Dataset handling

- `pandas==2.3.1` - Data manipulation

- `numpy==2.3.1` - Numerical computing



### Git Analysis:

- `pydriller` - Git repository mining

- `gitpython` - Git operations



### Utilities:

- `tqdm==4.67.1` - Progress bars

- `wandb` - Experiment tracking

- `evaluate==0.4.4` - Evaluation metrics



## πŸ”„ Workflow



### 1. Dataset Creation

```bash

cd dataset_builder
python extract_linux_bugfixes.py          # Extract bug-fix data
python format_for_training.py  # Convert format
```



### 2. Model Training

```bash

cd train

python train_codellama_qlora_linux_bugfix.py                  # Train with QLoRA

```

### 3. Model Evaluation
```bash

cd evaluate

python evaluate_linux_bugfix_model.py               # Evaluate performance

```

## 🎯 Key Design Principles

### Modularity
- Each component has a specific responsibility
- Clear separation between data, training, and evaluation
- Easy to modify or extend individual components

### Efficiency
- QLoRA for memory-efficient training
- Parallel processing for dataset creation
- Optimized for modern GPU hardware

### Reproducibility
- Version-controlled dependencies
- Structured data formats
- Comprehensive logging and evaluation

### Scalability
- Configurable parameters for different hardware
- Support for distributed training
- Efficient data handling with Git LFS

## πŸ” File Naming Conventions

- **Scripts**: Descriptive names with clear purpose
- **Datasets**: Include size/version information
- **Models**: Include architecture and method
- **Results**: Include timestamp or version
- **Configs**: Use `.json` or `.yaml` format

## πŸ“ Documentation

- **README.md**: Project overview and quick start
- **PROJECT_STRUCTURE.md**: This detailed structure guide

- **Model README**: Generated model cards in output directories

- **Code Comments**: Inline documentation in all scripts



This structure ensures the project is organized, maintainable, and easy to understand for both users and contributors.