Mac commited on
Commit
8046c68
Β·
1 Parent(s): 15eb8ca

Update README with Hugging Face metadata and full project description

Browse files
Files changed (1) hide show
  1. README.md +127 -113
README.md CHANGED
@@ -1,182 +1,196 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # CodeLLaMA-Linux-BugFix
2
 
3
- A machine learning project that fine-tunes CodeLLaMA-7B-Instruct specifically for Linux kernel bug fixing using QLoRA (Quantized Low-Rank Adaptation). The model learns to generate Git diff patches from buggy C code and commit messages.
 
 
 
 
4
 
5
- ## 🎯 Project Overview
6
 
7
- This project addresses the challenging task of automated Linux kernel bug fixing by:
 
 
 
8
 
9
- - **Extracting real bug-fix data** from the Linux kernel Git repository
10
- - **Training a specialized model** using QLoRA for efficient fine-tuning
11
- - **Generating Git diff patches** that can be applied to fix bugs
12
- - **Providing evaluation metrics** to assess model performance
13
 
14
- ## πŸ—οΈ Architecture
15
 
16
- ### Base Model
17
- - **Model**: `codellama/CodeLLaMA-7b-Instruct-hf` (7 billion parameters)
18
- - **Fine-tuning Method**: QLoRA with 4-bit quantization
19
- - **Hardware**: Optimized for H200 GPU with bfloat16 precision
 
 
 
20
 
21
- ### Training Configuration
22
- - **LoRA Config**: r=64, alpha=16, dropout=0.1
23
- - **Training**: 3 epochs, batch size 64, learning rate 2e-4
24
- - **Memory Optimization**: Gradient checkpointing, mixed precision training
25
 
26
  ## πŸ“Š Dataset
27
 
28
- The project creates a specialized dataset from Linux kernel commits:
29
-
30
- ### Data Extraction Process
31
- 1. **Commit Filtering**: Identifies bug-fix commits using keywords:
32
- - `fix`, `bug`, `leak`, `null`, `overflow`, `error`, `failure`
33
- - `crash`, `panic`, `memory`, `race`, `deadlock`, `corruption`
34
- - `security`, `vulnerability`, `exploit`, `buffer`, `stack`
35
-
36
- 2. **Code Context Extraction**:
37
- - Focuses on C and header files (`.c`, `.h`)
38
- - Extracts 10 lines before/after bug location
39
- - Captures relevant code context
40
-
41
- 3. **Data Format**:
42
- ```json
43
- {
44
- "input": {
45
- "original code": "C code snippet with bug",
46
- "instruction": "Bug fix instruction from commit message"
47
- },
48
- "output": {
49
- "diff codes": "Git diff showing the fix"
50
- }
51
- }
52
- ```
53
-
54
- ### Dataset Statistics
55
- - **Training Data**: 100K samples (`training_data_100k.jsonl`)
56
- - **Format**: JSONL (one JSON object per line)
57
- - **Source**: Linux kernel Git repository
58
 
59
  ## πŸš€ Quick Start
60
 
61
- ### Prerequisites
 
62
  ```bash
63
  pip install -r requirements.txt
64
  ```
65
 
66
- ### 1. Build Dataset
 
67
  ```bash
68
  cd dataset_builder
69
  python extract_linux_bugfixes.py
70
  python format_for_training.py
71
  ```
72
 
73
- ### 2. Train Model
 
74
  ```bash
75
  cd train
76
  python train_codellama_qlora_linux_bugfix.py
77
  ```
78
 
79
- ### 3. Evaluate Model
 
80
  ```bash
81
  cd evaluate
82
  python evaluate_linux_bugfix_model.py
83
  ```
84
 
 
 
85
  ## πŸ“ Project Structure
86
 
87
  ```
88
  CodeLLaMA-Linux-BugFix/
89
- β”œβ”€β”€ dataset_builder/ # Dataset creation scripts
90
- β”‚ β”œβ”€β”€ extract_linux_bugfixes.py # Main dataset extraction
91
- β”‚ β”œβ”€β”€ extract_linux_bugfixes_parallel.py # Parallelized version
92
  β”‚ └── format_for_training.py
93
- β”œβ”€β”€ dataset/ # Generated datasets
94
  β”‚ β”œβ”€β”€ training_data_100k.jsonl
95
  β”‚ └── training_data_prompt_completion.jsonl
96
- β”œβ”€β”€ train/ # Training scripts and outputs
97
- β”‚ β”œβ”€β”€ train_codellama_qlora_linux_bugfix.py # Main training script
98
  β”‚ β”œβ”€β”€ train_codellama_qlora_simple.py
99
  β”‚ β”œβ”€β”€ download_codellama_model.py
100
- β”‚ └── output/ # Trained model checkpoints
101
- β”œβ”€β”€ evaluate/ # Evaluation scripts and results
102
- β”‚ β”œβ”€β”€ evaluate_linux_bugfix_model.py # Model evaluation
103
- β”‚ β”œβ”€β”€ test_samples.jsonl # Evaluation dataset
104
- β”‚ └── output/ # Evaluation results
105
- └── requirements.txt # Python dependencies
106
  ```
107
 
108
- ## πŸ”§ Key Features
109
 
110
- ### Efficient Training
111
- - **QLoRA**: Reduces memory requirements by 75% while maintaining performance
112
- - **4-bit Quantization**: Enables training on consumer hardware
113
- - **Gradient Checkpointing**: Optimizes memory usage during training
114
 
115
- ### Real-world Data
116
- - **Authentic Bug Fixes**: Extracted from actual Linux kernel development
117
- - **Contextual Understanding**: Captures relevant code context around bugs
118
- - **Git Integration**: Outputs proper Git diff format
119
 
120
- ### Evaluation
121
- - **BLEU Score**: Measures translation quality
122
- - **ROUGE Score**: Evaluates text generation accuracy
123
- - **Comprehensive Metrics**: JSON and CSV output formats
124
 
125
- ## 🎯 Use Cases
126
 
127
- The fine-tuned model can assist with:
 
 
128
 
129
- 1. **Automated Bug Fixing**: Generate patches for common kernel bugs
130
- 2. **Code Review**: Suggest fixes during development
131
- 3. **Learning**: Study patterns in Linux kernel bug fixes
132
- 4. **Research**: Advance automated software repair techniques
133
 
134
- ## πŸ“ˆ Performance
135
 
136
- The model is evaluated using:
137
- - **BLEU Score**: Measures how well generated diffs match reference fixes
138
- - **ROUGE Score**: Evaluates overlap between predicted and actual fixes
139
- - **Human Evaluation**: Qualitative assessment of fix quality
140
 
141
- ## πŸ”¬ Technical Details
142
 
143
- ### Model Architecture
144
- - **Base**: CodeLLaMA-7B-Instruct with instruction tuning
145
- - **Adapter**: LoRA layers for efficient fine-tuning
146
- - **Output**: Generates Git diff format patches
147
 
148
- ### Training Process
149
- 1. **Data Preprocessing**: Extract and clean commit data
150
- 2. **Tokenization**: Convert to model input format
151
- 3. **QLoRA Training**: Efficient parameter-efficient fine-tuning
152
- 4. **Checkpointing**: Save model states for evaluation
153
 
154
- ### Memory Optimization
155
- - **4-bit Quantization**: Reduces model size significantly
156
- - **Gradient Accumulation**: Enables larger effective batch sizes
157
- - **Mixed Precision**: Uses bfloat16 for faster training
 
 
158
 
159
  ## 🀝 Contributing
160
 
161
- 1. Fork the repository
162
- 2. Create a feature branch
163
- 3. Make your changes
164
- 4. Add tests if applicable
165
- 5. Submit a pull request
 
166
 
167
  ## πŸ“„ License
168
 
169
- This project is licensed under the MIT License - see the LICENSE file for details.
 
 
170
 
171
  ## πŸ™ Acknowledgments
172
 
173
- - **CodeLLaMA Team**: For the base model
174
- - **Linux Kernel Community**: For the bug-fix data
175
- - **Hugging Face**: For the transformers library
176
- - **Microsoft**: For the LoRA technique
 
 
177
 
178
  ## πŸ“š References
179
 
180
- - [CodeLLaMA: Open Foundation for Code](https://arxiv.org/abs/2308.12950)
181
- - [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314)
182
- - [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - codellama
5
+ - linux
6
+ - bugfix
7
+ - lora
8
+ - qlora
9
+ - git-diff
10
+ base_model: codellama/CodeLLaMA-7b-Instruct-hf
11
+ model_type: LlamaForCausalLM
12
+ library_name: peft
13
+ pipeline_tag: text-generation
14
+ ---
15
+
16
  # CodeLLaMA-Linux-BugFix
17
 
18
+ A fine-tuned version of `CodeLLaMA-7B-Instruct`, designed specifically for Linux kernel bug fixing using QLoRA (Quantized Low-Rank Adaptation). The model learns to generate Git diff patches based on buggy C code and commit messages.
19
+
20
+ ---
21
+
22
+ ## 🎯 Overview
23
 
24
+ This project targets automated Linux kernel bug fixing by:
25
 
26
+ - **Mining real commit data** from the kernel Git history
27
+ - **Training a specialized QLoRA model** on diff-style fixes
28
+ - **Generating Git patches** in response to bug-prone code
29
+ - **Evaluating results** using BLEU, ROUGE, and human inspection
30
 
31
+ ---
 
 
 
32
 
33
+ ## 🧠 Model Configuration
34
 
35
+ - **Base model**: `CodeLLaMA-7B-Instruct`
36
+ - **Fine-tuning method**: QLoRA with 4-bit quantization
37
+ - **Training setup**:
38
+ - LoRA r=64, alpha=16, dropout=0.1
39
+ - Batch size: 64, LR: 2e-4, Epochs: 3
40
+ - Mixed precision (bfloat16), gradient checkpointing
41
+ - **Hardware**: Optimized for NVIDIA H200 GPUs
42
 
43
+ ---
 
 
 
44
 
45
  ## πŸ“Š Dataset
46
 
47
+ Custom dataset extracted from Linux kernel Git history.
48
+
49
+ ### Filtering Criteria
50
+ Bug-fix commits containing:
51
+ `fix`, `bug`, `crash`, `memory`, `null`, `panic`, `overflow`, `race`, `corruption`, etc.
52
+
53
+ ### Structure
54
+ - Language: C (`.c`, `.h`)
55
+ - Context: 10 lines before/after the change
56
+ - Format:
57
+
58
+ ```json
59
+ {
60
+ "input": {
61
+ "original code": "C code snippet with bug",
62
+ "instruction": "Commit message or fix description"
63
+ },
64
+ "output": {
65
+ "diff codes": "Git diff showing the fix"
66
+ }
67
+ }
68
+ ````
69
+
70
+ * **File**: `training_data_100k.jsonl` (100,000 samples)
71
+
72
+ ---
 
 
 
 
73
 
74
  ## πŸš€ Quick Start
75
 
76
+ ### Install dependencies
77
+
78
  ```bash
79
  pip install -r requirements.txt
80
  ```
81
 
82
+ ### 1. Build the Dataset
83
+
84
  ```bash
85
  cd dataset_builder
86
  python extract_linux_bugfixes.py
87
  python format_for_training.py
88
  ```
89
 
90
+ ### 2. Fine-tune the Model
91
+
92
  ```bash
93
  cd train
94
  python train_codellama_qlora_linux_bugfix.py
95
  ```
96
 
97
+ ### 3. Run Evaluation
98
+
99
  ```bash
100
  cd evaluate
101
  python evaluate_linux_bugfix_model.py
102
  ```
103
 
104
+ ---
105
+
106
  ## πŸ“ Project Structure
107
 
108
  ```
109
  CodeLLaMA-Linux-BugFix/
110
+ β”œβ”€β”€ dataset_builder/
111
+ β”‚ β”œβ”€β”€ extract_linux_bugfixes.py
112
+ β”‚ β”œβ”€β”€ extract_linux_bugfixes_parallel.py
113
  β”‚ └── format_for_training.py
114
+ β”œβ”€β”€ dataset/
115
  β”‚ β”œβ”€β”€ training_data_100k.jsonl
116
  β”‚ └── training_data_prompt_completion.jsonl
117
+ β”œβ”€β”€ train/
118
+ β”‚ β”œβ”€β”€ train_codellama_qlora_linux_bugfix.py
119
  β”‚ β”œβ”€β”€ train_codellama_qlora_simple.py
120
  β”‚ β”œβ”€β”€ download_codellama_model.py
121
+ β”‚ └── output/
122
+ β”œβ”€β”€ evaluate/
123
+ β”‚ β”œβ”€β”€ evaluate_linux_bugfix_model.py
124
+ β”‚ β”œβ”€β”€ test_samples.jsonl
125
+ β”‚ └── output/
126
+ └── requirements.txt
127
  ```
128
 
129
+ ---
130
 
131
+ ## 🧩 Features
 
 
 
132
 
133
+ * πŸ”§ **Efficient Fine-tuning**: QLoRA + 4-bit quant = massive memory savings
134
+ * 🧠 **Real-world commits**: From actual Linux kernel development
135
+ * πŸ’‘ **Context-aware**: Code context extraction around bug lines
136
+ * πŸ’» **Output-ready**: Generates valid Git-style diffs
137
 
138
+ ---
 
 
 
139
 
140
+ ## πŸ“ˆ Evaluation Metrics
141
 
142
+ * **BLEU**: Translation-style match to reference diffs
143
+ * **ROUGE**: Overlap in fix content
144
+ * **Human Evaluation**: Subjective patch quality
145
 
146
+ ---
 
 
 
147
 
148
+ ## πŸ§ͺ Use Cases
149
 
150
+ * Automated kernel bug fixing
151
+ * Code review assistance
152
+ * Teaching/debugging kernel code
153
+ * Research in automated program repair (APR)
154
 
155
+ ---
156
 
157
+ ## πŸ”¬ Technical Highlights
 
 
 
158
 
159
+ ### Memory & Speed Optimizations
 
 
 
 
160
 
161
+ * 4-bit quantization (NF4)
162
+ * Gradient checkpointing
163
+ * Mixed precision (bfloat16)
164
+ * Gradient accumulation
165
+
166
+ ---
167
 
168
  ## 🀝 Contributing
169
 
170
+ 1. Fork this repo
171
+ 2. Create a branch
172
+ 3. Add your feature or fix
173
+ 4. Submit a PR πŸ™Œ
174
+
175
+ ---
176
 
177
  ## πŸ“„ License
178
 
179
+ MIT License – see `LICENSE` file for details.
180
+
181
+ ---
182
 
183
  ## πŸ™ Acknowledgments
184
 
185
+ * Meta for CodeLLaMA
186
+ * Hugging Face for Transformers + PEFT
187
+ * The Linux kernel community for open access to commit data
188
+ * Microsoft for introducing LoRA
189
+
190
+ ---
191
 
192
  ## πŸ“š References
193
 
194
+ * [CodeLLaMA (Meta, 2023)](https://arxiv.org/abs/2308.12950)
195
+ * [QLoRA (Dettmers et al., 2023)](https://arxiv.org/abs/2305.14314)
196
+ * [LoRA (Hu et al., 2021)](https://arxiv.org/abs/2106.09685)