|
# Project Structure
|
|
|
|
This document provides a detailed overview of the CodeLLaMA-Linux-BugFix project structure, explaining the purpose and organization of each component.
|
|
|
|
## π Root Directory
|
|
|
|
```
|
|
CodeLLaMA-Linux-BugFix/
|
|
βββ dataset_builder/ # Dataset creation and processing
|
|
βββ dataset/ # Generated datasets and data files
|
|
βββ train/ # Model training scripts and outputs
|
|
βββ evaluate/ # Model evaluation and testing
|
|
βββ requirements.txt # Python dependencies
|
|
βββ README.md # Project documentation
|
|
βββ PROJECT_STRUCTURE.md # This file
|
|
```
|
|
|
|
## π§ Dataset Builder (`dataset_builder/`)
|
|
|
|
The dataset builder extracts bug-fix data from the Linux kernel Git repository and converts it into training-ready format.
|
|
|
|
### Files:
|
|
- **`extract_linux_bugfixes.py`** - Main dataset extraction script
|
|
- Uses PyDriller to analyze Linux kernel Git history
|
|
- Filters commits using bug-fix keywords
|
|
- Extracts code context around bug locations
|
|
- Generates structured dataset entries
|
|
|
|
- **`extract_linux_bugfixes_parallel.py`** - Parallelized version of dataset builder
|
|
- Multi-process implementation for faster processing
|
|
- Configurable worker count (default: 16 workers)
|
|
- Test mode with limited commit processing
|
|
|
|
- **`format_for_training.py`** - Format conversion script
|
|
- Converts structured data to prompt-completion pairs
|
|
- Formats input for supervised fine-tuning
|
|
- Creates training-ready JSONL format
|
|
|
|
### Key Features:
|
|
- **Commit Filtering**: Identifies bug-fix commits using 17 keywords
|
|
- **Code Context**: Extracts 10 lines before/after bug location
|
|
- **File Filtering**: Focuses on C and header files (`.c`, `.h`)
|
|
- **Diff Extraction**: Captures Git diff patches for fixes
|
|
|
|
## π Dataset (`dataset/`)
|
|
|
|
Contains the generated datasets used for training and evaluation.
|
|
|
|
### Files:
|
|
- **`training_data_100k.jsonl`** - Main training dataset
|
|
- 100,000 bug-fix samples
|
|
- Structured format with input/output pairs
|
|
- Stored using Git LFS for large file handling
|
|
|
|
- **`training_data_prompt_completion.jsonl`** - Converted training format
|
|
- Prompt-completion pairs for supervised learning
|
|
- Optimized for transformer model training
|
|
- Stored using Git LFS
|
|
|
|
### Data Format:
|
|
```json
|
|
{
|
|
"input": {
|
|
"original code": "C code snippet with bug",
|
|
"instruction": "Bug fix instruction from commit message"
|
|
},
|
|
"output": {
|
|
"diff codes": "Git diff showing the fix"
|
|
}
|
|
}
|
|
```
|
|
|
|
## π Training (`train/`)
|
|
|
|
Contains all training-related scripts, configurations, and model outputs.
|
|
|
|
### Files:
|
|
- **`train_codellama_qlora_linux_bugfix.py`** - Main training script
|
|
- QLoRA fine-tuning implementation
|
|
- Optimized for H200 GPU with bfloat16
|
|
- Includes Weights & Biases integration
|
|
- Comprehensive training configuration
|
|
|
|
- **`train_codellama_qlora_simple.py`** - Alternative training script
|
|
- Simplified QLoRA implementation
|
|
- Basic training setup without advanced features
|
|
- Good for testing and development
|
|
|
|
- **`download_codellama_model.py`** - Model download utility
|
|
- Downloads base CodeLLaMA-7B-Instruct model
|
|
- Ensures model availability before training
|
|
|
|
### Output Directory (`train/output/`):
|
|
- **`qlora-codellama-bugfix/`** - Main model output
|
|
- **`adapter_model.safetensors`** - LoRA adapter weights
|
|
- **`adapter_config.json`** - LoRA configuration
|
|
- **`tokenizer.json`** - Tokenizer files
|
|
- **`chat_template.jinja`** - Conversation template
|
|
- **`checkpoint-500/`** - Training checkpoint at step 500
|
|
- **`checkpoint-1000/`** - Training checkpoint at step 1000
|
|
- **`README.md`** - Model card and documentation
|
|
|
|
### Training Configuration:
|
|
- **Base Model**: `codellama/CodeLLaMA-7b-Instruct-hf`
|
|
- **Method**: QLoRA with 4-bit quantization
|
|
- **LoRA Config**: r=64, alpha=16, dropout=0.1
|
|
- **Training**: 3 epochs, batch size 64, learning rate 2e-4
|
|
- **Hardware**: Optimized for H200 GPU
|
|
|
|
## π Evaluation (`evaluate/`)
|
|
|
|
Contains evaluation scripts and results for assessing model performance.
|
|
|
|
### Files:
|
|
- **`evaluate_linux_bugfix_model.py`** - Main evaluation script
|
|
- Loads fine-tuned model for inference
|
|
- Generates predictions on test data
|
|
- Computes BLEU and ROUGE metrics
|
|
- Saves results in multiple formats
|
|
|
|
- **`test_samples.jsonl`** - Evaluation dataset
|
|
- Test samples for model evaluation
|
|
- Stored using Git LFS
|
|
|
|
### Output Directory (`evaluate/output/`):
|
|
- **`eval_results.json`** - Detailed evaluation results
|
|
- Complete predictions and references
|
|
- Stored using Git LFS
|
|
|
|
- **`eval_results.csv`** - Tabular evaluation results
|
|
- CSV format for easy analysis
|
|
- Stored using Git LFS
|
|
|
|
### Evaluation Metrics:
|
|
- **BLEU Score**: Measures translation quality
|
|
- **ROUGE Score**: Evaluates text generation accuracy
|
|
- **Human Evaluation**: Qualitative assessment
|
|
|
|
## π§ Dependencies (`requirements.txt`)
|
|
|
|
Comprehensive list of Python packages required for the project:
|
|
|
|
### Core ML Libraries:
|
|
- `transformers==4.53.1` - Hugging Face transformers
|
|
- `torch==2.7.1+cu128` - PyTorch with CUDA support
|
|
- `peft==0.16.0` - Parameter-efficient fine-tuning
|
|
- `accelerate==1.8.1` - Distributed training
|
|
- `bitsandbytes==0.46.1` - Quantization support
|
|
|
|
### Data Processing:
|
|
- `datasets==3.6.0` - Dataset handling
|
|
- `pandas==2.3.1` - Data manipulation
|
|
- `numpy==2.3.1` - Numerical computing
|
|
|
|
### Git Analysis:
|
|
- `pydriller` - Git repository mining
|
|
- `gitpython` - Git operations
|
|
|
|
### Utilities:
|
|
- `tqdm==4.67.1` - Progress bars
|
|
- `wandb` - Experiment tracking
|
|
- `evaluate==0.4.4` - Evaluation metrics
|
|
|
|
## π Workflow
|
|
|
|
### 1. Dataset Creation
|
|
```bash
|
|
cd dataset_builder
|
|
python extract_linux_bugfixes.py # Extract bug-fix data
|
|
python format_for_training.py # Convert format
|
|
```
|
|
|
|
### 2. Model Training
|
|
```bash
|
|
cd train
|
|
python train_codellama_qlora_linux_bugfix.py # Train with QLoRA
|
|
```
|
|
|
|
### 3. Model Evaluation
|
|
```bash
|
|
cd evaluate
|
|
python evaluate_linux_bugfix_model.py # Evaluate performance
|
|
```
|
|
|
|
## π― Key Design Principles
|
|
|
|
### Modularity
|
|
- Each component has a specific responsibility
|
|
- Clear separation between data, training, and evaluation
|
|
- Easy to modify or extend individual components
|
|
|
|
### Efficiency
|
|
- QLoRA for memory-efficient training
|
|
- Parallel processing for dataset creation
|
|
- Optimized for modern GPU hardware
|
|
|
|
### Reproducibility
|
|
- Version-controlled dependencies
|
|
- Structured data formats
|
|
- Comprehensive logging and evaluation
|
|
|
|
### Scalability
|
|
- Configurable parameters for different hardware
|
|
- Support for distributed training
|
|
- Efficient data handling with Git LFS
|
|
|
|
## π File Naming Conventions
|
|
|
|
- **Scripts**: Descriptive names with clear purpose
|
|
- **Datasets**: Include size/version information
|
|
- **Models**: Include architecture and method
|
|
- **Results**: Include timestamp or version
|
|
- **Configs**: Use `.json` or `.yaml` format
|
|
|
|
## π Documentation
|
|
|
|
- **README.md**: Project overview and quick start
|
|
- **PROJECT_STRUCTURE.md**: This detailed structure guide
|
|
- **Model README**: Generated model cards in output directories
|
|
- **Code Comments**: Inline documentation in all scripts
|
|
|
|
This structure ensures the project is organized, maintainable, and easy to understand for both users and contributors. |