Mac commited on
Commit
15eb8ca
Β·
1 Parent(s): 4babdc8

Refactor filenames and paths for clarity and structure

Browse files
PROJECT_STRUCTURE.md CHANGED
@@ -1,228 +1,222 @@
1
- # Linux Kernel Anti-Pattern Detector - Project Structure
2
 
3
- ## Overview
4
 
5
- This project is organized into a clear, maintainable structure that separates concerns and makes it easy to find, modify, and extend functionality.
6
-
7
- ## Directory Structure
8
 
9
  ```
10
- Linux Kernel Anti-Pattern Detector/
11
- β”œβ”€β”€ πŸ“ data/ # Analysis data and results
12
- β”‚ β”œβ”€β”€ results.json # Main analysis results
13
- β”‚ β”œβ”€β”€ concurrency_analysis_report.json
14
- β”‚ └── kernel_analysis.log # Analysis logs
15
- β”‚
16
- β”œβ”€β”€ πŸ“ docs/ # Documentation
17
- β”‚ β”œβ”€β”€ kernel-analysis-guide.md # Kernel analysis documentation
18
- β”‚ └── [additional documentation]
19
- β”‚
20
- β”œβ”€β”€ πŸ“ examples/ # Example code and usage
21
- β”‚ └── [example files]
22
- β”‚
23
- β”œβ”€β”€ πŸ“ reports/ # Generated analysis reports
24
- β”‚ β”œβ”€β”€ Linux_Kernel_Anti_Pattern_Analysis_Report.md
25
- β”‚ β”œβ”€β”€ Executive_Summary.md
26
- β”‚ └── πŸ“ concurrency/ # Concurrency-specific reports
27
- β”‚ └── Concurrency_Analysis_Report.md
28
- β”‚
29
- β”œβ”€β”€ πŸ“ scripts/ # Analysis and utility scripts
30
- β”‚ β”œβ”€β”€ πŸ“ analysis/ # Core analysis scripts
31
- β”‚ β”‚ β”œβ”€β”€ concurrency_analyzer.py # Concurrency issue analyzer
32
- β”‚ β”‚ └── analyze_kernel_structure.py
33
- β”‚ β”œβ”€β”€ πŸ“ reporting/ # Report generation scripts
34
- β”‚ β”‚ └── view_results.py # Results viewer
35
- β”‚ └── πŸ“ utils/ # Utility scripts
36
- β”‚ └── quick_summary.py # Quick summary generator
37
- β”‚
38
- β”œβ”€β”€ πŸ“ src/ # Source code (main project)
39
- β”‚ β”œβ”€β”€ __init__.py
40
- β”‚ β”œβ”€β”€ πŸ“ detectors/ # Anti-pattern detection modules
41
- β”‚ β”œβ”€β”€ πŸ“ rules/ # Detection rules and patterns
42
- β”‚ └── πŸ“ utils/ # Utility functions
43
- β”‚
44
- β”œβ”€β”€ πŸ“ tests/ # Test files
45
- β”‚ └── [test files]
46
- β”‚
47
- β”œβ”€β”€ πŸ“ tools/ # Analysis tools and detectors
48
- β”‚ β”œβ”€β”€ πŸ“ detectors/ # Main detection tools
49
- β”‚ β”‚ β”œβ”€β”€ detector.py # Main anti-pattern detector
50
- β”‚ β”‚ └── config.yaml # Detection configuration
51
- β”‚ β”œβ”€β”€ πŸ“ visualizers/ # Data visualization tools
52
- β”‚ └── πŸ“ exporters/ # Data export tools
53
- β”‚
54
- β”œβ”€β”€ πŸ“ linux/ # Linux kernel source (cloned)
55
- β”‚ └── [kernel source files]
56
- β”‚
57
- β”œβ”€β”€ πŸ“„ README.md # Main project documentation
58
- β”œβ”€β”€ πŸ“„ requirements.txt # Main project dependencies
59
- β”œβ”€β”€ πŸ“„ requirements-kernel-analysis.txt
60
- β”œβ”€β”€ πŸ“„ requirements-simple.txt
61
- β”œβ”€β”€ πŸ“„ .gitignore # Git ignore rules
62
- └── πŸ“„ PROJECT_STRUCTURE.md # This file
63
  ```
64
 
65
- ## Directory Descriptions
66
-
67
- ### πŸ“ data/
68
- Contains all analysis results, logs, and generated data files.
69
- - **results.json**: Complete analysis results from the main detector
70
- - **concurrency_analysis_report.json**: Detailed concurrency analysis
71
- - **kernel_analysis.log**: Analysis execution logs
72
-
73
- ### πŸ“ docs/
74
- Project documentation and guides.
75
- - **kernel-analysis-guide.md**: Comprehensive guide for kernel analysis
76
- - Additional documentation for specific features
77
-
78
- ### πŸ“ examples/
79
- Example code, usage patterns, and sample data.
80
- - Example kernel modules for testing
81
- - Sample configuration files
82
- - Usage examples
83
-
84
- ### πŸ“ reports/
85
- Generated analysis reports in various formats.
86
- - **Linux_Kernel_Anti_Pattern_Analysis_Report.md**: Complete technical report
87
- - **Executive_Summary.md**: High-level summary for stakeholders
88
- - **concurrency/**: Specialized reports for specific issue types
89
-
90
- ### πŸ“ scripts/
91
- Analysis and utility scripts organized by function.
92
-
93
- #### πŸ“ analysis/
94
- Core analysis scripts for different types of anti-patterns.
95
- - **concurrency_analyzer.py**: Specialized concurrency issue analysis
96
- - **analyze_kernel_structure.py**: Kernel structure analysis
97
-
98
- #### πŸ“ reporting/
99
- Scripts for generating and viewing reports.
100
- - **view_results.py**: Interactive results viewer and reporter
101
-
102
- #### πŸ“ utils/
103
- Utility scripts for common tasks.
104
- - **quick_summary.py**: Quick summary generation
105
-
106
- ### πŸ“ src/
107
- Main project source code (core framework).
108
- - **detectors/**: Anti-pattern detection modules
109
- - **rules/**: Detection rules and pattern definitions
110
- - **utils/**: Utility functions and helpers
111
-
112
- ### πŸ“ tests/
113
- Test files and test data.
114
- - Unit tests for detection modules
115
- - Integration tests
116
- - Test data and fixtures
117
-
118
- ### πŸ“ tools/
119
- Analysis tools and detectors.
120
-
121
- #### πŸ“ detectors/
122
- Main detection tools and configurations.
123
- - **detector.py**: Primary anti-pattern detection engine
124
- - **config.yaml**: Detection configuration and rules
125
-
126
- #### πŸ“ visualizers/
127
- Data visualization and charting tools.
128
- - Interactive dashboards
129
- - Chart generators
130
- - Data plotting utilities
131
-
132
- #### πŸ“ exporters/
133
- Data export and format conversion tools.
134
- - JSON to other formats
135
- - Report generation
136
- - Data transformation
137
-
138
- ### πŸ“ linux/
139
- Cloned Linux kernel source code for analysis.
140
- - Complete kernel source tree
141
- - Used for code snippet extraction
142
- - Reference for pattern validation
143
-
144
- ## File Descriptions
145
-
146
- ### Core Files
147
- - **README.md**: Main project documentation and getting started guide
148
- - **requirements.txt**: Main project Python dependencies
149
- - **requirements-kernel-analysis.txt**: Kernel analysis specific dependencies
150
- - **requirements-simple.txt**: Simplified dependencies for basic usage
151
- - **.gitignore**: Git ignore patterns for the project
152
-
153
- ### Configuration Files
154
- - **tools/detectors/config.yaml**: Main detection configuration
155
- - **tools/detectors/detector.py**: Primary detection engine
156
-
157
- ## Usage Patterns
158
-
159
- ### Running Analysis
160
- ```bash
161
- # Main analysis
162
- python tools/detectors/detector.py --clone --output data/results.json
163
-
164
- # Concurrency analysis
165
- python scripts/analysis/concurrency_analyzer.py
166
-
167
- # View results
168
- python scripts/reporting/view_results.py data/results.json
169
  ```
170
 
171
- ### Generating Reports
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
172
  ```bash
173
- # Quick summary
174
- python scripts/utils/quick_summary.py
175
-
176
- # Interactive viewer
177
- python scripts/reporting/view_results.py --interactive
178
  ```
179
 
180
- ### Development
181
  ```bash
182
- # Install dependencies
183
- pip install -r requirements.txt
184
- pip install -r requirements-kernel-analysis.txt
185
-
186
- # Run tests
187
- python -m pytest tests/
188
 
189
- # Development setup
190
- conda activate linux-kernel-anti-pattern-detector
 
 
191
  ```
192
 
193
- ## Best Practices
 
 
 
 
 
194
 
195
- ### Adding New Features
196
- 1. **Analysis scripts**: Add to `scripts/analysis/`
197
- 2. **Reporting tools**: Add to `scripts/reporting/`
198
- 3. **Utilities**: Add to `scripts/utils/`
199
- 4. **Core detection**: Add to `src/detectors/`
200
- 5. **Configuration**: Update `tools/detectors/config.yaml`
201
 
202
- ### File Naming Conventions
203
- - **Python files**: snake_case (e.g., `concurrency_analyzer.py`)
204
- - **Configuration files**: kebab-case (e.g., `kernel-analysis-guide.md`)
205
- - **Reports**: Pascal_Case (e.g., `Concurrency_Analysis_Report.md`)
206
 
207
- ### Data Management
208
- - **Raw data**: Store in `data/`
209
- - **Processed results**: Store in `data/`
210
- - **Reports**: Generate in `reports/`
211
- - **Logs**: Store in `data/`
212
 
213
- ## Maintenance
214
 
215
- ### Regular Tasks
216
- 1. **Update dependencies**: Review and update requirements files
217
- 2. **Clean data**: Remove old analysis results periodically
218
- 3. **Update kernel**: Refresh the Linux kernel source
219
- 4. **Backup reports**: Archive important analysis reports
220
 
221
- ### Version Control
222
- - **Track**: Source code, configuration, documentation
223
- - **Ignore**: Analysis results, logs, kernel source (large files)
224
- - **Archive**: Important reports and findings
225
 
226
- ---
 
 
 
227
 
228
- *This structure is designed to be scalable, maintainable, and easy to navigate. Each directory has a clear purpose and the organization supports both development and research workflows.*
 
1
+ # Project Structure
2
 
3
+ This document provides a detailed overview of the CodeLLaMA-Linux-BugFix project structure, explaining the purpose and organization of each component.
4
 
5
+ ## πŸ“ Root Directory
 
 
6
 
7
  ```
8
+ CodeLLaMA-Linux-BugFix/
9
+ β”œβ”€β”€ dataset_builder/ # Dataset creation and processing
10
+ β”œβ”€β”€ dataset/ # Generated datasets and data files
11
+ β”œβ”€β”€ train/ # Model training scripts and outputs
12
+ β”œβ”€β”€ evaluate/ # Model evaluation and testing
13
+ β”œβ”€β”€ requirements.txt # Python dependencies
14
+ β”œβ”€β”€ README.md # Project documentation
15
+ └── PROJECT_STRUCTURE.md # This file
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
  ```
17
 
18
+ ## πŸ”§ Dataset Builder (`dataset_builder/`)
19
+
20
+ The dataset builder extracts bug-fix data from the Linux kernel Git repository and converts it into training-ready format.
21
+
22
+ ### Files:
23
+ - **`extract_linux_bugfixes.py`** - Main dataset extraction script
24
+ - Uses PyDriller to analyze Linux kernel Git history
25
+ - Filters commits using bug-fix keywords
26
+ - Extracts code context around bug locations
27
+ - Generates structured dataset entries
28
+
29
+ - **`extract_linux_bugfixes_parallel.py`** - Parallelized version of dataset builder
30
+ - Multi-process implementation for faster processing
31
+ - Configurable worker count (default: 16 workers)
32
+ - Test mode with limited commit processing
33
+
34
+ - **`format_for_training.py`** - Format conversion script
35
+ - Converts structured data to prompt-completion pairs
36
+ - Formats input for supervised fine-tuning
37
+ - Creates training-ready JSONL format
38
+
39
+ ### Key Features:
40
+ - **Commit Filtering**: Identifies bug-fix commits using 17 keywords
41
+ - **Code Context**: Extracts 10 lines before/after bug location
42
+ - **File Filtering**: Focuses on C and header files (`.c`, `.h`)
43
+ - **Diff Extraction**: Captures Git diff patches for fixes
44
+
45
+ ## πŸ“Š Dataset (`dataset/`)
46
+
47
+ Contains the generated datasets used for training and evaluation.
48
+
49
+ ### Files:
50
+ - **`training_data_100k.jsonl`** - Main training dataset
51
+ - 100,000 bug-fix samples
52
+ - Structured format with input/output pairs
53
+ - Stored using Git LFS for large file handling
54
+
55
+ - **`training_data_prompt_completion.jsonl`** - Converted training format
56
+ - Prompt-completion pairs for supervised learning
57
+ - Optimized for transformer model training
58
+ - Stored using Git LFS
59
+
60
+ ### Data Format:
61
+ ```json
62
+ {
63
+ "input": {
64
+ "original code": "C code snippet with bug",
65
+ "instruction": "Bug fix instruction from commit message"
66
+ },
67
+ "output": {
68
+ "diff codes": "Git diff showing the fix"
69
+ }
70
+ }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
  ```
72
 
73
+ ## πŸš€ Training (`train/`)
74
+
75
+ Contains all training-related scripts, configurations, and model outputs.
76
+
77
+ ### Files:
78
+ - **`train_codellama_qlora_linux_bugfix.py`** - Main training script
79
+ - QLoRA fine-tuning implementation
80
+ - Optimized for H200 GPU with bfloat16
81
+ - Includes Weights & Biases integration
82
+ - Comprehensive training configuration
83
+
84
+ - **`train_codellama_qlora_simple.py`** - Alternative training script
85
+ - Simplified QLoRA implementation
86
+ - Basic training setup without advanced features
87
+ - Good for testing and development
88
+
89
+ - **`download_codellama_model.py`** - Model download utility
90
+ - Downloads base CodeLLaMA-7B-Instruct model
91
+ - Ensures model availability before training
92
+
93
+ ### Output Directory (`train/output/`):
94
+ - **`qlora-codellama-bugfix/`** - Main model output
95
+ - **`adapter_model.safetensors`** - LoRA adapter weights
96
+ - **`adapter_config.json`** - LoRA configuration
97
+ - **`tokenizer.json`** - Tokenizer files
98
+ - **`chat_template.jinja`** - Conversation template
99
+ - **`checkpoint-500/`** - Training checkpoint at step 500
100
+ - **`checkpoint-1000/`** - Training checkpoint at step 1000
101
+ - **`README.md`** - Model card and documentation
102
+
103
+ ### Training Configuration:
104
+ - **Base Model**: `codellama/CodeLLaMA-7b-Instruct-hf`
105
+ - **Method**: QLoRA with 4-bit quantization
106
+ - **LoRA Config**: r=64, alpha=16, dropout=0.1
107
+ - **Training**: 3 epochs, batch size 64, learning rate 2e-4
108
+ - **Hardware**: Optimized for H200 GPU
109
+
110
+ ## πŸ“ˆ Evaluation (`evaluate/`)
111
+
112
+ Contains evaluation scripts and results for assessing model performance.
113
+
114
+ ### Files:
115
+ - **`evaluate_linux_bugfix_model.py`** - Main evaluation script
116
+ - Loads fine-tuned model for inference
117
+ - Generates predictions on test data
118
+ - Computes BLEU and ROUGE metrics
119
+ - Saves results in multiple formats
120
+
121
+ - **`test_samples.jsonl`** - Evaluation dataset
122
+ - Test samples for model evaluation
123
+ - Stored using Git LFS
124
+
125
+ ### Output Directory (`evaluate/output/`):
126
+ - **`eval_results.json`** - Detailed evaluation results
127
+ - Complete predictions and references
128
+ - Stored using Git LFS
129
+
130
+ - **`eval_results.csv`** - Tabular evaluation results
131
+ - CSV format for easy analysis
132
+ - Stored using Git LFS
133
+
134
+ ### Evaluation Metrics:
135
+ - **BLEU Score**: Measures translation quality
136
+ - **ROUGE Score**: Evaluates text generation accuracy
137
+ - **Human Evaluation**: Qualitative assessment
138
+
139
+ ## πŸ”§ Dependencies (`requirements.txt`)
140
+
141
+ Comprehensive list of Python packages required for the project:
142
+
143
+ ### Core ML Libraries:
144
+ - `transformers==4.53.1` - Hugging Face transformers
145
+ - `torch==2.7.1+cu128` - PyTorch with CUDA support
146
+ - `peft==0.16.0` - Parameter-efficient fine-tuning
147
+ - `accelerate==1.8.1` - Distributed training
148
+ - `bitsandbytes==0.46.1` - Quantization support
149
+
150
+ ### Data Processing:
151
+ - `datasets==3.6.0` - Dataset handling
152
+ - `pandas==2.3.1` - Data manipulation
153
+ - `numpy==2.3.1` - Numerical computing
154
+
155
+ ### Git Analysis:
156
+ - `pydriller` - Git repository mining
157
+ - `gitpython` - Git operations
158
+
159
+ ### Utilities:
160
+ - `tqdm==4.67.1` - Progress bars
161
+ - `wandb` - Experiment tracking
162
+ - `evaluate==0.4.4` - Evaluation metrics
163
+
164
+ ## πŸ”„ Workflow
165
+
166
+ ### 1. Dataset Creation
167
  ```bash
168
+ cd dataset_builder
169
+ python extract_linux_bugfixes.py # Extract bug-fix data
170
+ python format_for_training.py # Convert format
 
 
171
  ```
172
 
173
+ ### 2. Model Training
174
  ```bash
175
+ cd train
176
+ python train_codellama_qlora_linux_bugfix.py # Train with QLoRA
177
+ ```
 
 
 
178
 
179
+ ### 3. Model Evaluation
180
+ ```bash
181
+ cd evaluate
182
+ python evaluate_linux_bugfix_model.py # Evaluate performance
183
  ```
184
 
185
+ ## 🎯 Key Design Principles
186
+
187
+ ### Modularity
188
+ - Each component has a specific responsibility
189
+ - Clear separation between data, training, and evaluation
190
+ - Easy to modify or extend individual components
191
 
192
+ ### Efficiency
193
+ - QLoRA for memory-efficient training
194
+ - Parallel processing for dataset creation
195
+ - Optimized for modern GPU hardware
 
 
196
 
197
+ ### Reproducibility
198
+ - Version-controlled dependencies
199
+ - Structured data formats
200
+ - Comprehensive logging and evaluation
201
 
202
+ ### Scalability
203
+ - Configurable parameters for different hardware
204
+ - Support for distributed training
205
+ - Efficient data handling with Git LFS
 
206
 
207
+ ## πŸ” File Naming Conventions
208
 
209
+ - **Scripts**: Descriptive names with clear purpose
210
+ - **Datasets**: Include size/version information
211
+ - **Models**: Include architecture and method
212
+ - **Results**: Include timestamp or version
213
+ - **Configs**: Use `.json` or `.yaml` format
214
 
215
+ ## πŸ“ Documentation
 
 
 
216
 
217
+ - **README.md**: Project overview and quick start
218
+ - **PROJECT_STRUCTURE.md**: This detailed structure guide
219
+ - **Model README**: Generated model cards in output directories
220
+ - **Code Comments**: Inline documentation in all scripts
221
 
222
+ This structure ensures the project is organized, maintainable, and easy to understand for both users and contributors.
README.md CHANGED
@@ -1,294 +1,182 @@
1
- # Linux Kernel Anti-Pattern Detector
2
 
3
- A comprehensive tool for detecting anti-patterns and potential issues in Linux kernel code.
4
 
5
- ## 🎯 Overview
6
 
7
- This project provides automated static analysis tools to identify common anti-patterns, code smells, and potential issues in Linux kernel source code. The analysis covers 7 major categories including memory management, concurrency, security vulnerabilities, and code quality issues.
8
 
9
- ## πŸ“Š Recent Analysis Results
 
 
 
10
 
11
- **Analysis Date:** June 29, 2025
12
- **Kernel Version:** Linux 6.16-rc4
13
- **Files Analyzed:** 35,588
14
- **Total Issues Found:** 3,122
15
 
16
- ### Issue Distribution
17
- - **πŸ”΄ Critical Security Issues:** 347 (11.1%)
18
- - **🟑 High Priority Issues:** 2,670 (85.5%)
19
- - **πŸ”΅ Medium Priority Issues:** 105 (3.4%)
20
 
21
- ### Top Categories
22
- 1. **Concurrency Issues:** 2,314 (74.1%) - Race conditions, deadlocks
23
- 2. **Memory Management:** 356 (11.4%) - Memory leaks, use-after-free
24
- 3. **Security Vulnerabilities:** 347 (11.1%) - Buffer overflows, format strings
25
- 4. **Code Quality:** 92 (2.9%) - Magic numbers, code duplication
26
 
27
- ## πŸš€ Quick Start
28
-
29
- ### Prerequisites
30
- - Python 3.8+
31
- - Git
32
- - Conda (recommended for environment management)
33
-
34
- ### Setup
35
- ```bash
36
- # Clone the repository
37
- git clone https://github.com/Mac-Huang/linux-kernel-anti-pattern-detector.git
38
- cd linux-kernel-anti-pattern-detector
39
 
40
- # Create and activate conda environment
41
- conda create -n linux-kernel-anti-pattern-detector python=3.10 -y
42
- conda activate linux-kernel-anti-pattern-detector
43
-
44
- # Install dependencies
45
- pip install -r requirements.txt
46
- pip install -r requirements-kernel-analysis.txt
47
- ```
48
-
49
- ### Run Analysis
50
- ```bash
51
- # Full kernel analysis (clones kernel if needed)
52
- python tools/detectors/detector.py --clone --output data/results.json
53
 
54
- # Concurrency-specific analysis
55
- python scripts/analysis/concurrency_analyzer.py
 
 
 
56
 
57
- # View results interactively
58
- python scripts/reporting/view_results.py --interactive
59
- ```
 
60
 
61
- ## πŸ—ƒοΈ Dataset Building Pipeline
 
 
 
 
 
 
 
 
 
 
 
62
 
63
- ### βœ… **Successfully Implemented**
 
 
 
64
 
65
- The project includes a robust dataset building pipeline for training code intelligence models on Linux kernel bug fixes.
66
 
67
- #### **Dataset Format**
68
- ```json
69
- {
70
- "input": {
71
- "original code": "code before the patch (extracted from a known bug-fix)",
72
- "instruction": "the commit message describing what this fix is"
73
- },
74
- "output": {
75
- "diff codes": "the unified diff patch for the bug fix"
76
- }
77
- }
78
  ```
79
 
80
- #### **Features**
81
- - **πŸ” Intelligent Bug Detection:** Uses keyword-based filtering to identify bug-fix commits
82
- - **πŸ“ Focused Code Extraction:** Extracts relevant code context around bug fixes
83
- - **πŸ”§ Diff Processing:** Parses and formats unified diff patches
84
- - **⚑ Parallel Processing:** Multi-threaded processing for large repositories
85
- - **πŸ“Š Quality Filtering:** Only includes valid C source file modifications
86
-
87
- #### **Usage**
88
  ```bash
89
- # Activate environment
90
- conda activate detector
91
-
92
- # Build test dataset (small sample)
93
  cd dataset_builder
94
- python build_dataset_demo.py
95
-
96
- # Build full dataset (entire repository)
97
- # Edit TEST_MODE = False in build_dataset_demo.py
98
- python build_dataset_demo.py
99
  ```
100
 
101
- #### **Output**
102
- - **File:** `dataset_builder/output/linux_bugfix_dataset.jsonl`
103
- - **Format:** JSONL (one JSON object per line)
104
- - **Content:** Bug-fix commits with original code, commit messages, and diff patches
 
105
 
106
- #### **Keywords Detected**
107
- - Memory issues: `leak`, `null`, `overflow`, `memory`
108
- - Security: `security`, `vulnerability`, `exploit`, `buffer`
109
- - Concurrency: `race`, `deadlock`, `lock`
110
- - General bugs: `fix`, `bug`, `error`, `failure`, `crash`
111
 
112
  ## πŸ“ Project Structure
113
 
114
  ```
115
- β”œβ”€β”€ πŸ“ data/ # Analysis results and logs
116
- β”œβ”€β”€ πŸ“ dataset_builder/ # Dataset building pipeline
117
- β”‚ β”œβ”€β”€ build_dataset.py # Main dataset builder
118
- β”‚ β”œβ”€β”€ build_dataset_demo.py # Test dataset builder
119
- β”‚ └── output/ # Generated datasets
120
- β”œβ”€β”€ πŸ“ docs/ # Documentation
121
- β”œβ”€β”€ πŸ“ reports/ # Generated reports
122
- β”œβ”€β”€ πŸ“ scripts/ # Analysis and utility scripts
123
- β”œβ”€β”€ πŸ“ src/ # Core source code
124
- β”œβ”€β”€ πŸ“ tests/ # Test files
125
- β”œβ”€β”€ πŸ“ tools/ # Detection tools
126
- └── πŸ“ linux/ # Kernel source (cloned)
 
 
 
 
 
 
127
  ```
128
 
129
- See [PROJECT_STRUCTURE.md](PROJECT_STRUCTURE.md) for detailed structure information.
130
-
131
- ## πŸ“‹ Available Reports
132
-
133
- ### πŸ“„ Main Reports
134
- - **[Complete Analysis Report](reports/Linux_Kernel_Anti_Pattern_Analysis_Report.md)** - Full technical analysis
135
- - **[Executive Summary](reports/Executive_Summary.md)** - High-level overview for stakeholders
136
-
137
- ### πŸ”’ Specialized Reports
138
- - **[Concurrency Analysis](reports/concurrency/Concurrency_Analysis_Report.md)** - Detailed concurrency issues (2,314 issues)
139
 
140
- ## πŸ› οΈ Usage Examples
141
-
142
- ### Basic Analysis
143
- ```bash
144
- # Run complete analysis
145
- python tools/detectors/detector.py --clone --output data/results.json
146
-
147
- # Quick summary
148
- python scripts/utils/quick_summary.py
149
-
150
- # Interactive results viewer
151
- python scripts/reporting/view_results.py --interactive
152
- ```
153
 
154
- ### Specialized Analysis
155
- ```bash
156
- # Concurrency issues only
157
- python scripts/analysis/concurrency_analyzer.py
158
 
159
- # Kernel structure analysis
160
- python scripts/analysis/analyze_kernel_structure.py
161
- ```
 
162
 
163
- ### Custom Configuration
164
- ```bash
165
- # Use custom config
166
- python tools/detectors/detector.py --config tools/detectors/config.yaml
167
 
168
- # Analyze specific kernel path
169
- python tools/detectors/detector.py --kernel-path /path/to/kernel
170
- ```
171
 
172
- ## οΏ½οΏ½οΏ½οΏ½ Detection Categories
173
-
174
- ### 1. Memory Management
175
- - Memory leaks (kmalloc without kfree)
176
- - Use-after-free bugs
177
- - Double-free issues
178
- - Null pointer dereferences
179
-
180
- ### 2. Concurrency
181
- - Race conditions
182
- - Deadlocks
183
- - Missing locks
184
- - Double locking
185
- - Lock ordering violations
186
-
187
- ### 3. Security
188
- - Buffer overflows
189
- - Format string vulnerabilities
190
- - Privilege escalation
191
- - Information disclosure
192
-
193
- ### 4. Error Handling
194
- - Unchecked return values
195
- - Missing error handling
196
- - Ignored error codes
197
- - Wrong error propagation
198
-
199
- ### 5. Performance
200
- - O(nΒ²) algorithms
201
- - Unnecessary memory allocation
202
- - Inefficient data structures
203
- - Cache miss patterns
204
-
205
- ### 6. Code Quality
206
- - Magic numbers
207
- - Hardcoded values
208
- - Complex functions
209
- - Code duplication
210
-
211
- ### 7. API Usage
212
- - Deprecated functions
213
- - Wrong API usage
214
- - Missing parameter validation
215
- - Incorrect flags
216
-
217
- ## πŸ“Š Analysis Features
218
-
219
- - **Pattern-based Detection:** Regular expression matching with context awareness
220
- - **Parallel Processing:** Configurable concurrent analysis for performance
221
- - **Detailed Reporting:** JSON output with file locations and line numbers
222
- - **Interactive Viewer:** Browse and filter results by category and severity
223
- - **Code Snippet Extraction:** View actual code around detected issues
224
- - **Severity Classification:** Critical, High, Medium, Low priority levels
225
-
226
- ## πŸ”§ Configuration
227
-
228
- The analysis can be customized through `tools/detectors/config.yaml`:
229
-
230
- ```yaml
231
- detection_rules:
232
- memory_management:
233
- enabled: true
234
- severity: "high"
235
- patterns:
236
- - "kmalloc.*without.*kfree"
237
- - "use.*after.*free"
238
- directories: ["drivers", "kernel", "mm"]
239
- ```
240
 
241
  ## πŸ“ˆ Performance
242
 
243
- - **Analysis Speed:** ~11 minutes for 35,588 files
244
- - **Memory Usage:** Configurable limits (default: 1GB)
245
- - **Parallel Processing:** Up to 4 concurrent analyzers
246
- - **File Filtering:** Excludes generated files and build artifacts
247
 
248
- ## 🀝 Contributing
249
-
250
- 1. **Fork the repository**
251
- 2. **Create a feature branch** (`git checkout -b feature/amazing-feature`)
252
- 3. **Add your changes** following the project structure
253
- 4. **Test your changes** (`python -m pytest tests/`)
254
- 5. **Commit your changes** (`git commit -m 'Add amazing feature'`)
255
- 6. **Push to the branch** (`git push origin feature/amazing-feature`)
256
- 7. **Open a Pull Request**
257
 
258
- ### Development Guidelines
259
- - **Analysis scripts:** Add to `scripts/analysis/`
260
- - **Reporting tools:** Add to `scripts/reporting/`
261
- - **Core detection:** Add to `src/detectors/`
262
- - **Configuration:** Update `tools/detectors/config.yaml`
263
 
264
- ## πŸ“š Documentation
 
 
 
 
265
 
266
- - **[Project Structure](PROJECT_STRUCTURE.md)** - Detailed project organization
267
- - **[Kernel Analysis Guide](docs/kernel-analysis-guide.md)** - Comprehensive analysis guide
268
- - **[Concurrency Analysis](reports/concurrency/Concurrency_Analysis_Report.md)** - Concurrency-specific findings
 
269
 
270
- ## πŸ› Known Issues
271
 
272
- - **Code snippet extraction:** Some file paths may not match due to kernel cloning method
273
- - **Large file handling:** Files >10MB are skipped to prevent memory issues
274
- - **Pattern accuracy:** Some patterns may generate false positives
 
 
275
 
276
  ## πŸ“„ License
277
 
278
- This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
279
 
280
  ## πŸ™ Acknowledgments
281
 
282
- - Linux kernel community for the source code
283
- - Static analysis research community
284
- - Contributors and maintainers
285
-
286
- ## πŸ“ž Contact
287
-
288
- - **Repository:** https://github.com/Mac-Huang/linux-kernel-anti-pattern-detector
289
- - **Issues:** Use GitHub Issues for bug reports and feature requests
290
- - **Discussions:** Use GitHub Discussions for questions and ideas
291
 
292
- ---
293
 
294
- *This tool is designed to help improve Linux kernel code quality by identifying potential issues early in the development process. The analysis results should be used as guidance for improvement rather than definitive assessments of code quality.*
 
 
 
1
+ # CodeLLaMA-Linux-BugFix
2
 
3
+ A machine learning project that fine-tunes CodeLLaMA-7B-Instruct specifically for Linux kernel bug fixing using QLoRA (Quantized Low-Rank Adaptation). The model learns to generate Git diff patches from buggy C code and commit messages.
4
 
5
+ ## 🎯 Project Overview
6
 
7
+ This project addresses the challenging task of automated Linux kernel bug fixing by:
8
 
9
+ - **Extracting real bug-fix data** from the Linux kernel Git repository
10
+ - **Training a specialized model** using QLoRA for efficient fine-tuning
11
+ - **Generating Git diff patches** that can be applied to fix bugs
12
+ - **Providing evaluation metrics** to assess model performance
13
 
14
+ ## πŸ—οΈ Architecture
 
 
 
15
 
16
+ ### Base Model
17
+ - **Model**: `codellama/CodeLLaMA-7b-Instruct-hf` (7 billion parameters)
18
+ - **Fine-tuning Method**: QLoRA with 4-bit quantization
19
+ - **Hardware**: Optimized for H200 GPU with bfloat16 precision
20
 
21
+ ### Training Configuration
22
+ - **LoRA Config**: r=64, alpha=16, dropout=0.1
23
+ - **Training**: 3 epochs, batch size 64, learning rate 2e-4
24
+ - **Memory Optimization**: Gradient checkpointing, mixed precision training
 
25
 
26
+ ## πŸ“Š Dataset
 
 
 
 
 
 
 
 
 
 
 
27
 
28
+ The project creates a specialized dataset from Linux kernel commits:
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
+ ### Data Extraction Process
31
+ 1. **Commit Filtering**: Identifies bug-fix commits using keywords:
32
+ - `fix`, `bug`, `leak`, `null`, `overflow`, `error`, `failure`
33
+ - `crash`, `panic`, `memory`, `race`, `deadlock`, `corruption`
34
+ - `security`, `vulnerability`, `exploit`, `buffer`, `stack`
35
 
36
+ 2. **Code Context Extraction**:
37
+ - Focuses on C and header files (`.c`, `.h`)
38
+ - Extracts 10 lines before/after bug location
39
+ - Captures relevant code context
40
 
41
+ 3. **Data Format**:
42
+ ```json
43
+ {
44
+ "input": {
45
+ "original code": "C code snippet with bug",
46
+ "instruction": "Bug fix instruction from commit message"
47
+ },
48
+ "output": {
49
+ "diff codes": "Git diff showing the fix"
50
+ }
51
+ }
52
+ ```
53
 
54
+ ### Dataset Statistics
55
+ - **Training Data**: 100K samples (`training_data_100k.jsonl`)
56
+ - **Format**: JSONL (one JSON object per line)
57
+ - **Source**: Linux kernel Git repository
58
 
59
+ ## πŸš€ Quick Start
60
 
61
+ ### Prerequisites
62
+ ```bash
63
+ pip install -r requirements.txt
 
 
 
 
 
 
 
 
64
  ```
65
 
66
+ ### 1. Build Dataset
 
 
 
 
 
 
 
67
  ```bash
 
 
 
 
68
  cd dataset_builder
69
+ python extract_linux_bugfixes.py
70
+ python format_for_training.py
 
 
 
71
  ```
72
 
73
+ ### 2. Train Model
74
+ ```bash
75
+ cd train
76
+ python train_codellama_qlora_linux_bugfix.py
77
+ ```
78
 
79
+ ### 3. Evaluate Model
80
+ ```bash
81
+ cd evaluate
82
+ python evaluate_linux_bugfix_model.py
83
+ ```
84
 
85
  ## πŸ“ Project Structure
86
 
87
  ```
88
+ CodeLLaMA-Linux-BugFix/
89
+ β”œβ”€β”€ dataset_builder/ # Dataset creation scripts
90
+ β”‚ β”œβ”€β”€ extract_linux_bugfixes.py # Main dataset extraction
91
+ β”‚ β”œβ”€β”€ extract_linux_bugfixes_parallel.py # Parallelized version
92
+ β”‚ └── format_for_training.py
93
+ β”œβ”€β”€ dataset/ # Generated datasets
94
+ β”‚ β”œβ”€β”€ training_data_100k.jsonl
95
+ β”‚ └── training_data_prompt_completion.jsonl
96
+ β”œβ”€β”€ train/ # Training scripts and outputs
97
+ β”‚ β”œβ”€β”€ train_codellama_qlora_linux_bugfix.py # Main training script
98
+ β”‚ β”œβ”€β”€ train_codellama_qlora_simple.py
99
+ β”‚ β”œβ”€β”€ download_codellama_model.py
100
+ β”‚ └── output/ # Trained model checkpoints
101
+ β”œβ”€β”€ evaluate/ # Evaluation scripts and results
102
+ β”‚ β”œβ”€β”€ evaluate_linux_bugfix_model.py # Model evaluation
103
+ β”‚ β”œβ”€β”€ test_samples.jsonl # Evaluation dataset
104
+ β”‚ └── output/ # Evaluation results
105
+ └── requirements.txt # Python dependencies
106
  ```
107
 
108
+ ## πŸ”§ Key Features
 
 
 
 
 
 
 
 
 
109
 
110
+ ### Efficient Training
111
+ - **QLoRA**: Reduces memory requirements by 75% while maintaining performance
112
+ - **4-bit Quantization**: Enables training on consumer hardware
113
+ - **Gradient Checkpointing**: Optimizes memory usage during training
 
 
 
 
 
 
 
 
 
114
 
115
+ ### Real-world Data
116
+ - **Authentic Bug Fixes**: Extracted from actual Linux kernel development
117
+ - **Contextual Understanding**: Captures relevant code context around bugs
118
+ - **Git Integration**: Outputs proper Git diff format
119
 
120
+ ### Evaluation
121
+ - **BLEU Score**: Measures translation quality
122
+ - **ROUGE Score**: Evaluates text generation accuracy
123
+ - **Comprehensive Metrics**: JSON and CSV output formats
124
 
125
+ ## 🎯 Use Cases
 
 
 
126
 
127
+ The fine-tuned model can assist with:
 
 
128
 
129
+ 1. **Automated Bug Fixing**: Generate patches for common kernel bugs
130
+ 2. **Code Review**: Suggest fixes during development
131
+ 3. **Learning**: Study patterns in Linux kernel bug fixes
132
+ 4. **Research**: Advance automated software repair techniques
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
133
 
134
  ## πŸ“ˆ Performance
135
 
136
+ The model is evaluated using:
137
+ - **BLEU Score**: Measures how well generated diffs match reference fixes
138
+ - **ROUGE Score**: Evaluates overlap between predicted and actual fixes
139
+ - **Human Evaluation**: Qualitative assessment of fix quality
140
 
141
+ ## πŸ”¬ Technical Details
 
 
 
 
 
 
 
 
142
 
143
+ ### Model Architecture
144
+ - **Base**: CodeLLaMA-7B-Instruct with instruction tuning
145
+ - **Adapter**: LoRA layers for efficient fine-tuning
146
+ - **Output**: Generates Git diff format patches
 
147
 
148
+ ### Training Process
149
+ 1. **Data Preprocessing**: Extract and clean commit data
150
+ 2. **Tokenization**: Convert to model input format
151
+ 3. **QLoRA Training**: Efficient parameter-efficient fine-tuning
152
+ 4. **Checkpointing**: Save model states for evaluation
153
 
154
+ ### Memory Optimization
155
+ - **4-bit Quantization**: Reduces model size significantly
156
+ - **Gradient Accumulation**: Enables larger effective batch sizes
157
+ - **Mixed Precision**: Uses bfloat16 for faster training
158
 
159
+ ## 🀝 Contributing
160
 
161
+ 1. Fork the repository
162
+ 2. Create a feature branch
163
+ 3. Make your changes
164
+ 4. Add tests if applicable
165
+ 5. Submit a pull request
166
 
167
  ## πŸ“„ License
168
 
169
+ This project is licensed under the MIT License - see the LICENSE file for details.
170
 
171
  ## πŸ™ Acknowledgments
172
 
173
+ - **CodeLLaMA Team**: For the base model
174
+ - **Linux Kernel Community**: For the bug-fix data
175
+ - **Hugging Face**: For the transformers library
176
+ - **Microsoft**: For the LoRA technique
 
 
 
 
 
177
 
178
+ ## πŸ“š References
179
 
180
+ - [CodeLLaMA: Open Foundation for Code](https://arxiv.org/abs/2308.12950)
181
+ - [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314)
182
+ - [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)
dataset/{linux_bugfix_100k.jsonl β†’ training_data_100k.jsonl} RENAMED
File without changes
dataset/{linux_bugfix_prompt_completion.jsonl β†’ training_data_prompt_completion.jsonl} RENAMED
File without changes
dataset_builder/{build_dataset_demo.py β†’ extract_linux_bugfixes_parallel.py} RENAMED
File without changes
dataset_builder/{convert_to_prompt_completion.py β†’ format_for_training.py} RENAMED
@@ -1,7 +1,7 @@
1
  import json
2
 
3
  INPUT_FILE = './output/linux_bugfix_dataset.jsonl'
4
- OUTPUT_FILE = './output/linux_bugfix_prompt_completion.jsonl'
5
 
6
  def format_prompt(original_code, instruction):
7
  return (
 
1
  import json
2
 
3
  INPUT_FILE = './output/linux_bugfix_dataset.jsonl'
4
+ OUTPUT_FILE = '../dataset/training_data_prompt_completion.jsonl'
5
 
6
  def format_prompt(original_code, instruction):
7
  return (
evaluate/{evaluate.py β†’ evaluate_linux_bugfix_model.py} RENAMED
@@ -9,7 +9,7 @@ import evaluate
9
 
10
  # ==== CONFIG ====
11
  MODEL_PATH = "../train/output/qlora-codellama-bugfix"
12
- EVAL_FILE = "eval.jsonl"
13
  OUTPUT_JSON = "./output/eval_results.json"
14
  OUTPUT_CSV = "./output/eval_results.csv"
15
  MAX_INPUT_LEN = 1024
 
9
 
10
  # ==== CONFIG ====
11
  MODEL_PATH = "../train/output/qlora-codellama-bugfix"
12
+ EVAL_FILE = "test_samples.jsonl"
13
  OUTPUT_JSON = "./output/eval_results.json"
14
  OUTPUT_CSV = "./output/eval_results.csv"
15
  MAX_INPUT_LEN = 1024
evaluate/{eval.jsonl β†’ test_samples.jsonl} RENAMED
File without changes
train/{download_model.py β†’ download_codellama_model.py} RENAMED
File without changes
train/output/qlora-codellama-bugfix/README.md CHANGED
@@ -1,11 +1,12 @@
1
  ---
2
- base_model: codellama/CodeLLaMA-7b-Instruct-hf
3
- library_name: peft
4
- pipeline_tag: text-generation
5
  tags:
6
- - base_model:adapter:codellama/CodeLLaMA-7b-Instruct-hf
7
- - lora
8
- - transformers
 
 
 
9
  ---
10
 
11
  # Model Card for Model ID
 
1
  ---
2
+ license: mit
 
 
3
  tags:
4
+ - linux
5
+ - bugfix
6
+ - codellama
7
+ model_type: causal-lm
8
+ library_name: transformers
9
+ pipeline_tag: text-generation
10
  ---
11
 
12
  # Model Card for Model ID
train/{train.py β†’ train_codellama_qlora_linux_bugfix.py} RENAMED
@@ -18,7 +18,7 @@ os.environ["WANDB_PROJECT"] = "codellama-7b-instruct-qlora-linux-bugfix"
18
  os.environ["WANDB_NAME"] = "run-v1"
19
  # Paths and model
20
  BASE_MODEL = "codellama/CodeLLaMA-7b-Instruct-hf"
21
- DATA_PATH = "../dataset/linux_bugfix_100k.jsonl"
22
  OUTPUT_DIR = "./output/qlora-codellama-bugfix"
23
 
24
  # Load dataset (prompt-completion format)
 
18
  os.environ["WANDB_NAME"] = "run-v1"
19
  # Paths and model
20
  BASE_MODEL = "codellama/CodeLLaMA-7b-Instruct-hf"
21
+ DATA_PATH = "../dataset/training_data_100k.jsonl"
22
  OUTPUT_DIR = "./output/qlora-codellama-bugfix"
23
 
24
  # Load dataset (prompt-completion format)
train/{train_codellama_qlora.py β†’ train_codellama_qlora_simple.py} RENAMED
@@ -9,7 +9,7 @@ import os
9
 
10
  # Paths and parameters
11
  BASE_MODEL = "codellama/CodeLlama-7b-Instruct-hf"
12
- DATA_PATH = "../dataset_builder/output/linux_bugfix_prompt_completion.jsonl"
13
  OUTPUT_DIR = "./output/qlora-codellama-bugfix"
14
 
15
  # Load dataset (prompt, completion)
 
9
 
10
  # Paths and parameters
11
  BASE_MODEL = "codellama/CodeLlama-7b-Instruct-hf"
12
+ DATA_PATH = "../dataset/training_data_prompt_completion.jsonl"
13
  OUTPUT_DIR = "./output/qlora-codellama-bugfix"
14
 
15
  # Load dataset (prompt, completion)