Project Structure
This document provides a detailed overview of the CodeLLaMA-Linux-BugFix project structure, explaining the purpose and organization of each component.
π Root Directory
CodeLLaMA-Linux-BugFix/
βββ dataset_builder/ # Dataset creation and processing
βββ dataset/ # Generated datasets and data files
βββ train/ # Model training scripts and outputs
βββ evaluate/ # Model evaluation and testing
βββ requirements.txt # Python dependencies
βββ README.md # Project documentation
βββ PROJECT_STRUCTURE.md # This file
π§ Dataset Builder (dataset_builder/
)
The dataset builder extracts bug-fix data from the Linux kernel Git repository and converts it into training-ready format.
Files:
extract_linux_bugfixes.py
- Main dataset extraction script- Uses PyDriller to analyze Linux kernel Git history
- Filters commits using bug-fix keywords
- Extracts code context around bug locations
- Generates structured dataset entries
extract_linux_bugfixes_parallel.py
- Parallelized version of dataset builder- Multi-process implementation for faster processing
- Configurable worker count (default: 16 workers)
- Test mode with limited commit processing
format_for_training.py
- Format conversion script- Converts structured data to prompt-completion pairs
- Formats input for supervised fine-tuning
- Creates training-ready JSONL format
Key Features:
- Commit Filtering: Identifies bug-fix commits using 17 keywords
- Code Context: Extracts 10 lines before/after bug location
- File Filtering: Focuses on C and header files (
.c
,.h
) - Diff Extraction: Captures Git diff patches for fixes
π Dataset (dataset/
)
Contains the generated datasets used for training and evaluation.
Files:
training_data_100k.jsonl
- Main training dataset- 100,000 bug-fix samples
- Structured format with input/output pairs
- Stored using Git LFS for large file handling
training_data_prompt_completion.jsonl
- Converted training format- Prompt-completion pairs for supervised learning
- Optimized for transformer model training
- Stored using Git LFS
Data Format:
{
"input": {
"original code": "C code snippet with bug",
"instruction": "Bug fix instruction from commit message"
},
"output": {
"diff codes": "Git diff showing the fix"
}
}
π Training (train/
)
Contains all training-related scripts, configurations, and model outputs.
Files:
train_codellama_qlora_linux_bugfix.py
- Main training script- QLoRA fine-tuning implementation
- Optimized for H200 GPU with bfloat16
- Includes Weights & Biases integration
- Comprehensive training configuration
train_codellama_qlora_simple.py
- Alternative training script- Simplified QLoRA implementation
- Basic training setup without advanced features
- Good for testing and development
download_codellama_model.py
- Model download utility- Downloads base CodeLLaMA-7B-Instruct model
- Ensures model availability before training
Output Directory (train/output/
):
qlora-codellama-bugfix/
- Main model outputadapter_model.safetensors
- LoRA adapter weightsadapter_config.json
- LoRA configurationtokenizer.json
- Tokenizer fileschat_template.jinja
- Conversation templatecheckpoint-500/
- Training checkpoint at step 500checkpoint-1000/
- Training checkpoint at step 1000README.md
- Model card and documentation
Training Configuration:
- Base Model:
codellama/CodeLLaMA-7b-Instruct-hf
- Method: QLoRA with 4-bit quantization
- LoRA Config: r=64, alpha=16, dropout=0.1
- Training: 3 epochs, batch size 64, learning rate 2e-4
- Hardware: Optimized for H200 GPU
π Evaluation (evaluate/
)
Contains evaluation scripts and results for assessing model performance.
Files:
evaluate_linux_bugfix_model.py
- Main evaluation script- Loads fine-tuned model for inference
- Generates predictions on test data
- Computes BLEU and ROUGE metrics
- Saves results in multiple formats
test_samples.jsonl
- Evaluation dataset- Test samples for model evaluation
- Stored using Git LFS
Output Directory (evaluate/output/
):
eval_results.json
- Detailed evaluation results- Complete predictions and references
- Stored using Git LFS
eval_results.csv
- Tabular evaluation results- CSV format for easy analysis
- Stored using Git LFS
Evaluation Metrics:
- BLEU Score: Measures translation quality
- ROUGE Score: Evaluates text generation accuracy
- Human Evaluation: Qualitative assessment
π§ Dependencies (requirements.txt
)
Comprehensive list of Python packages required for the project:
Core ML Libraries:
transformers==4.53.1
- Hugging Face transformerstorch==2.7.1+cu128
- PyTorch with CUDA supportpeft==0.16.0
- Parameter-efficient fine-tuningaccelerate==1.8.1
- Distributed trainingbitsandbytes==0.46.1
- Quantization support
Data Processing:
datasets==3.6.0
- Dataset handlingpandas==2.3.1
- Data manipulationnumpy==2.3.1
- Numerical computing
Git Analysis:
pydriller
- Git repository mininggitpython
- Git operations
Utilities:
tqdm==4.67.1
- Progress barswandb
- Experiment trackingevaluate==0.4.4
- Evaluation metrics
π Workflow
1. Dataset Creation
cd dataset_builder
python extract_linux_bugfixes.py # Extract bug-fix data
python format_for_training.py # Convert format
2. Model Training
cd train
python train_codellama_qlora_linux_bugfix.py # Train with QLoRA
3. Model Evaluation
cd evaluate
python evaluate_linux_bugfix_model.py # Evaluate performance
π― Key Design Principles
Modularity
- Each component has a specific responsibility
- Clear separation between data, training, and evaluation
- Easy to modify or extend individual components
Efficiency
- QLoRA for memory-efficient training
- Parallel processing for dataset creation
- Optimized for modern GPU hardware
Reproducibility
- Version-controlled dependencies
- Structured data formats
- Comprehensive logging and evaluation
Scalability
- Configurable parameters for different hardware
- Support for distributed training
- Efficient data handling with Git LFS
π File Naming Conventions
- Scripts: Descriptive names with clear purpose
- Datasets: Include size/version information
- Models: Include architecture and method
- Results: Include timestamp or version
- Configs: Use
.json
or.yaml
format
π Documentation
- README.md: Project overview and quick start
- PROJECT_STRUCTURE.md: This detailed structure guide
- Model README: Generated model cards in output directories
- Code Comments: Inline documentation in all scripts
This structure ensures the project is organized, maintainable, and easy to understand for both users and contributors.