CodeLLaMA-Linux-BugFix / PROJECT_STRUCTURE.md
Mac
Refactor filenames and paths for clarity and structure
15eb8ca
# Project Structure
This document provides a detailed overview of the CodeLLaMA-Linux-BugFix project structure, explaining the purpose and organization of each component.
## πŸ“ Root Directory
```
CodeLLaMA-Linux-BugFix/
β”œβ”€β”€ dataset_builder/ # Dataset creation and processing
β”œβ”€β”€ dataset/ # Generated datasets and data files
β”œβ”€β”€ train/ # Model training scripts and outputs
β”œβ”€β”€ evaluate/ # Model evaluation and testing
β”œβ”€β”€ requirements.txt # Python dependencies
β”œβ”€β”€ README.md # Project documentation
└── PROJECT_STRUCTURE.md # This file
```
## πŸ”§ Dataset Builder (`dataset_builder/`)
The dataset builder extracts bug-fix data from the Linux kernel Git repository and converts it into training-ready format.
### Files:
- **`extract_linux_bugfixes.py`** - Main dataset extraction script
- Uses PyDriller to analyze Linux kernel Git history
- Filters commits using bug-fix keywords
- Extracts code context around bug locations
- Generates structured dataset entries
- **`extract_linux_bugfixes_parallel.py`** - Parallelized version of dataset builder
- Multi-process implementation for faster processing
- Configurable worker count (default: 16 workers)
- Test mode with limited commit processing
- **`format_for_training.py`** - Format conversion script
- Converts structured data to prompt-completion pairs
- Formats input for supervised fine-tuning
- Creates training-ready JSONL format
### Key Features:
- **Commit Filtering**: Identifies bug-fix commits using 17 keywords
- **Code Context**: Extracts 10 lines before/after bug location
- **File Filtering**: Focuses on C and header files (`.c`, `.h`)
- **Diff Extraction**: Captures Git diff patches for fixes
## πŸ“Š Dataset (`dataset/`)
Contains the generated datasets used for training and evaluation.
### Files:
- **`training_data_100k.jsonl`** - Main training dataset
- 100,000 bug-fix samples
- Structured format with input/output pairs
- Stored using Git LFS for large file handling
- **`training_data_prompt_completion.jsonl`** - Converted training format
- Prompt-completion pairs for supervised learning
- Optimized for transformer model training
- Stored using Git LFS
### Data Format:
```json
{
"input": {
"original code": "C code snippet with bug",
"instruction": "Bug fix instruction from commit message"
},
"output": {
"diff codes": "Git diff showing the fix"
}
}
```
## πŸš€ Training (`train/`)
Contains all training-related scripts, configurations, and model outputs.
### Files:
- **`train_codellama_qlora_linux_bugfix.py`** - Main training script
- QLoRA fine-tuning implementation
- Optimized for H200 GPU with bfloat16
- Includes Weights & Biases integration
- Comprehensive training configuration
- **`train_codellama_qlora_simple.py`** - Alternative training script
- Simplified QLoRA implementation
- Basic training setup without advanced features
- Good for testing and development
- **`download_codellama_model.py`** - Model download utility
- Downloads base CodeLLaMA-7B-Instruct model
- Ensures model availability before training
### Output Directory (`train/output/`):
- **`qlora-codellama-bugfix/`** - Main model output
- **`adapter_model.safetensors`** - LoRA adapter weights
- **`adapter_config.json`** - LoRA configuration
- **`tokenizer.json`** - Tokenizer files
- **`chat_template.jinja`** - Conversation template
- **`checkpoint-500/`** - Training checkpoint at step 500
- **`checkpoint-1000/`** - Training checkpoint at step 1000
- **`README.md`** - Model card and documentation
### Training Configuration:
- **Base Model**: `codellama/CodeLLaMA-7b-Instruct-hf`
- **Method**: QLoRA with 4-bit quantization
- **LoRA Config**: r=64, alpha=16, dropout=0.1
- **Training**: 3 epochs, batch size 64, learning rate 2e-4
- **Hardware**: Optimized for H200 GPU
## πŸ“ˆ Evaluation (`evaluate/`)
Contains evaluation scripts and results for assessing model performance.
### Files:
- **`evaluate_linux_bugfix_model.py`** - Main evaluation script
- Loads fine-tuned model for inference
- Generates predictions on test data
- Computes BLEU and ROUGE metrics
- Saves results in multiple formats
- **`test_samples.jsonl`** - Evaluation dataset
- Test samples for model evaluation
- Stored using Git LFS
### Output Directory (`evaluate/output/`):
- **`eval_results.json`** - Detailed evaluation results
- Complete predictions and references
- Stored using Git LFS
- **`eval_results.csv`** - Tabular evaluation results
- CSV format for easy analysis
- Stored using Git LFS
### Evaluation Metrics:
- **BLEU Score**: Measures translation quality
- **ROUGE Score**: Evaluates text generation accuracy
- **Human Evaluation**: Qualitative assessment
## πŸ”§ Dependencies (`requirements.txt`)
Comprehensive list of Python packages required for the project:
### Core ML Libraries:
- `transformers==4.53.1` - Hugging Face transformers
- `torch==2.7.1+cu128` - PyTorch with CUDA support
- `peft==0.16.0` - Parameter-efficient fine-tuning
- `accelerate==1.8.1` - Distributed training
- `bitsandbytes==0.46.1` - Quantization support
### Data Processing:
- `datasets==3.6.0` - Dataset handling
- `pandas==2.3.1` - Data manipulation
- `numpy==2.3.1` - Numerical computing
### Git Analysis:
- `pydriller` - Git repository mining
- `gitpython` - Git operations
### Utilities:
- `tqdm==4.67.1` - Progress bars
- `wandb` - Experiment tracking
- `evaluate==0.4.4` - Evaluation metrics
## πŸ”„ Workflow
### 1. Dataset Creation
```bash
cd dataset_builder
python extract_linux_bugfixes.py # Extract bug-fix data
python format_for_training.py # Convert format
```
### 2. Model Training
```bash
cd train
python train_codellama_qlora_linux_bugfix.py # Train with QLoRA
```
### 3. Model Evaluation
```bash
cd evaluate
python evaluate_linux_bugfix_model.py # Evaluate performance
```
## 🎯 Key Design Principles
### Modularity
- Each component has a specific responsibility
- Clear separation between data, training, and evaluation
- Easy to modify or extend individual components
### Efficiency
- QLoRA for memory-efficient training
- Parallel processing for dataset creation
- Optimized for modern GPU hardware
### Reproducibility
- Version-controlled dependencies
- Structured data formats
- Comprehensive logging and evaluation
### Scalability
- Configurable parameters for different hardware
- Support for distributed training
- Efficient data handling with Git LFS
## πŸ” File Naming Conventions
- **Scripts**: Descriptive names with clear purpose
- **Datasets**: Include size/version information
- **Models**: Include architecture and method
- **Results**: Include timestamp or version
- **Configs**: Use `.json` or `.yaml` format
## πŸ“ Documentation
- **README.md**: Project overview and quick start
- **PROJECT_STRUCTURE.md**: This detailed structure guide
- **Model README**: Generated model cards in output directories
- **Code Comments**: Inline documentation in all scripts
This structure ensures the project is organized, maintainable, and easy to understand for both users and contributors.