File size: 7,503 Bytes
15eb8ca ed6b901 15eb8ca ed6b901 15eb8ca ed6b901 15eb8ca ed6b901 15eb8ca ed6b901 15eb8ca ed6b901 15eb8ca ed6b901 15eb8ca ed6b901 15eb8ca ed6b901 15eb8ca ed6b901 15eb8ca ed6b901 15eb8ca ed6b901 15eb8ca ed6b901 15eb8ca ed6b901 15eb8ca ed6b901 15eb8ca ed6b901 15eb8ca ed6b901 15eb8ca ed6b901 15eb8ca |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 |
# Project Structure
This document provides a detailed overview of the CodeLLaMA-Linux-BugFix project structure, explaining the purpose and organization of each component.
## π Root Directory
```
CodeLLaMA-Linux-BugFix/
βββ dataset_builder/ # Dataset creation and processing
βββ dataset/ # Generated datasets and data files
βββ train/ # Model training scripts and outputs
βββ evaluate/ # Model evaluation and testing
βββ requirements.txt # Python dependencies
βββ README.md # Project documentation
βββ PROJECT_STRUCTURE.md # This file
```
## π§ Dataset Builder (`dataset_builder/`)
The dataset builder extracts bug-fix data from the Linux kernel Git repository and converts it into training-ready format.
### Files:
- **`extract_linux_bugfixes.py`** - Main dataset extraction script
- Uses PyDriller to analyze Linux kernel Git history
- Filters commits using bug-fix keywords
- Extracts code context around bug locations
- Generates structured dataset entries
- **`extract_linux_bugfixes_parallel.py`** - Parallelized version of dataset builder
- Multi-process implementation for faster processing
- Configurable worker count (default: 16 workers)
- Test mode with limited commit processing
- **`format_for_training.py`** - Format conversion script
- Converts structured data to prompt-completion pairs
- Formats input for supervised fine-tuning
- Creates training-ready JSONL format
### Key Features:
- **Commit Filtering**: Identifies bug-fix commits using 17 keywords
- **Code Context**: Extracts 10 lines before/after bug location
- **File Filtering**: Focuses on C and header files (`.c`, `.h`)
- **Diff Extraction**: Captures Git diff patches for fixes
## π Dataset (`dataset/`)
Contains the generated datasets used for training and evaluation.
### Files:
- **`training_data_100k.jsonl`** - Main training dataset
- 100,000 bug-fix samples
- Structured format with input/output pairs
- Stored using Git LFS for large file handling
- **`training_data_prompt_completion.jsonl`** - Converted training format
- Prompt-completion pairs for supervised learning
- Optimized for transformer model training
- Stored using Git LFS
### Data Format:
```json
{
"input": {
"original code": "C code snippet with bug",
"instruction": "Bug fix instruction from commit message"
},
"output": {
"diff codes": "Git diff showing the fix"
}
}
```
## π Training (`train/`)
Contains all training-related scripts, configurations, and model outputs.
### Files:
- **`train_codellama_qlora_linux_bugfix.py`** - Main training script
- QLoRA fine-tuning implementation
- Optimized for H200 GPU with bfloat16
- Includes Weights & Biases integration
- Comprehensive training configuration
- **`train_codellama_qlora_simple.py`** - Alternative training script
- Simplified QLoRA implementation
- Basic training setup without advanced features
- Good for testing and development
- **`download_codellama_model.py`** - Model download utility
- Downloads base CodeLLaMA-7B-Instruct model
- Ensures model availability before training
### Output Directory (`train/output/`):
- **`qlora-codellama-bugfix/`** - Main model output
- **`adapter_model.safetensors`** - LoRA adapter weights
- **`adapter_config.json`** - LoRA configuration
- **`tokenizer.json`** - Tokenizer files
- **`chat_template.jinja`** - Conversation template
- **`checkpoint-500/`** - Training checkpoint at step 500
- **`checkpoint-1000/`** - Training checkpoint at step 1000
- **`README.md`** - Model card and documentation
### Training Configuration:
- **Base Model**: `codellama/CodeLLaMA-7b-Instruct-hf`
- **Method**: QLoRA with 4-bit quantization
- **LoRA Config**: r=64, alpha=16, dropout=0.1
- **Training**: 3 epochs, batch size 64, learning rate 2e-4
- **Hardware**: Optimized for H200 GPU
## π Evaluation (`evaluate/`)
Contains evaluation scripts and results for assessing model performance.
### Files:
- **`evaluate_linux_bugfix_model.py`** - Main evaluation script
- Loads fine-tuned model for inference
- Generates predictions on test data
- Computes BLEU and ROUGE metrics
- Saves results in multiple formats
- **`test_samples.jsonl`** - Evaluation dataset
- Test samples for model evaluation
- Stored using Git LFS
### Output Directory (`evaluate/output/`):
- **`eval_results.json`** - Detailed evaluation results
- Complete predictions and references
- Stored using Git LFS
- **`eval_results.csv`** - Tabular evaluation results
- CSV format for easy analysis
- Stored using Git LFS
### Evaluation Metrics:
- **BLEU Score**: Measures translation quality
- **ROUGE Score**: Evaluates text generation accuracy
- **Human Evaluation**: Qualitative assessment
## π§ Dependencies (`requirements.txt`)
Comprehensive list of Python packages required for the project:
### Core ML Libraries:
- `transformers==4.53.1` - Hugging Face transformers
- `torch==2.7.1+cu128` - PyTorch with CUDA support
- `peft==0.16.0` - Parameter-efficient fine-tuning
- `accelerate==1.8.1` - Distributed training
- `bitsandbytes==0.46.1` - Quantization support
### Data Processing:
- `datasets==3.6.0` - Dataset handling
- `pandas==2.3.1` - Data manipulation
- `numpy==2.3.1` - Numerical computing
### Git Analysis:
- `pydriller` - Git repository mining
- `gitpython` - Git operations
### Utilities:
- `tqdm==4.67.1` - Progress bars
- `wandb` - Experiment tracking
- `evaluate==0.4.4` - Evaluation metrics
## π Workflow
### 1. Dataset Creation
```bash
cd dataset_builder
python extract_linux_bugfixes.py # Extract bug-fix data
python format_for_training.py # Convert format
```
### 2. Model Training
```bash
cd train
python train_codellama_qlora_linux_bugfix.py # Train with QLoRA
```
### 3. Model Evaluation
```bash
cd evaluate
python evaluate_linux_bugfix_model.py # Evaluate performance
```
## π― Key Design Principles
### Modularity
- Each component has a specific responsibility
- Clear separation between data, training, and evaluation
- Easy to modify or extend individual components
### Efficiency
- QLoRA for memory-efficient training
- Parallel processing for dataset creation
- Optimized for modern GPU hardware
### Reproducibility
- Version-controlled dependencies
- Structured data formats
- Comprehensive logging and evaluation
### Scalability
- Configurable parameters for different hardware
- Support for distributed training
- Efficient data handling with Git LFS
## π File Naming Conventions
- **Scripts**: Descriptive names with clear purpose
- **Datasets**: Include size/version information
- **Models**: Include architecture and method
- **Results**: Include timestamp or version
- **Configs**: Use `.json` or `.yaml` format
## π Documentation
- **README.md**: Project overview and quick start
- **PROJECT_STRUCTURE.md**: This detailed structure guide
- **Model README**: Generated model cards in output directories
- **Code Comments**: Inline documentation in all scripts
This structure ensures the project is organized, maintainable, and easy to understand for both users and contributors. |