CodeLLaMA-Linux-BugFix / PROJECT_STRUCTURE.md
Mac
Refactor filenames and paths for clarity and structure
15eb8ca

Project Structure

This document provides a detailed overview of the CodeLLaMA-Linux-BugFix project structure, explaining the purpose and organization of each component.

πŸ“ Root Directory

CodeLLaMA-Linux-BugFix/
β”œβ”€β”€ dataset_builder/          # Dataset creation and processing
β”œβ”€β”€ dataset/                  # Generated datasets and data files
β”œβ”€β”€ train/                    # Model training scripts and outputs
β”œβ”€β”€ evaluate/                 # Model evaluation and testing
β”œβ”€β”€ requirements.txt          # Python dependencies
β”œβ”€β”€ README.md                 # Project documentation
└── PROJECT_STRUCTURE.md      # This file

πŸ”§ Dataset Builder (dataset_builder/)

The dataset builder extracts bug-fix data from the Linux kernel Git repository and converts it into training-ready format.

Files:

  • extract_linux_bugfixes.py - Main dataset extraction script

    • Uses PyDriller to analyze Linux kernel Git history
    • Filters commits using bug-fix keywords
    • Extracts code context around bug locations
    • Generates structured dataset entries
  • extract_linux_bugfixes_parallel.py - Parallelized version of dataset builder

    • Multi-process implementation for faster processing
    • Configurable worker count (default: 16 workers)
    • Test mode with limited commit processing
  • format_for_training.py - Format conversion script

    • Converts structured data to prompt-completion pairs
    • Formats input for supervised fine-tuning
    • Creates training-ready JSONL format

Key Features:

  • Commit Filtering: Identifies bug-fix commits using 17 keywords
  • Code Context: Extracts 10 lines before/after bug location
  • File Filtering: Focuses on C and header files (.c, .h)
  • Diff Extraction: Captures Git diff patches for fixes

πŸ“Š Dataset (dataset/)

Contains the generated datasets used for training and evaluation.

Files:

  • training_data_100k.jsonl - Main training dataset

    • 100,000 bug-fix samples
    • Structured format with input/output pairs
    • Stored using Git LFS for large file handling
  • training_data_prompt_completion.jsonl - Converted training format

    • Prompt-completion pairs for supervised learning
    • Optimized for transformer model training
    • Stored using Git LFS

Data Format:

{
  "input": {
    "original code": "C code snippet with bug",
    "instruction": "Bug fix instruction from commit message"
  },
  "output": {
    "diff codes": "Git diff showing the fix"
  }
}

πŸš€ Training (train/)

Contains all training-related scripts, configurations, and model outputs.

Files:

  • train_codellama_qlora_linux_bugfix.py - Main training script

    • QLoRA fine-tuning implementation
    • Optimized for H200 GPU with bfloat16
    • Includes Weights & Biases integration
    • Comprehensive training configuration
  • train_codellama_qlora_simple.py - Alternative training script

    • Simplified QLoRA implementation
    • Basic training setup without advanced features
    • Good for testing and development
  • download_codellama_model.py - Model download utility

    • Downloads base CodeLLaMA-7B-Instruct model
    • Ensures model availability before training

Output Directory (train/output/):

  • qlora-codellama-bugfix/ - Main model output
    • adapter_model.safetensors - LoRA adapter weights
    • adapter_config.json - LoRA configuration
    • tokenizer.json - Tokenizer files
    • chat_template.jinja - Conversation template
    • checkpoint-500/ - Training checkpoint at step 500
    • checkpoint-1000/ - Training checkpoint at step 1000
    • README.md - Model card and documentation

Training Configuration:

  • Base Model: codellama/CodeLLaMA-7b-Instruct-hf
  • Method: QLoRA with 4-bit quantization
  • LoRA Config: r=64, alpha=16, dropout=0.1
  • Training: 3 epochs, batch size 64, learning rate 2e-4
  • Hardware: Optimized for H200 GPU

πŸ“ˆ Evaluation (evaluate/)

Contains evaluation scripts and results for assessing model performance.

Files:

  • evaluate_linux_bugfix_model.py - Main evaluation script

    • Loads fine-tuned model for inference
    • Generates predictions on test data
    • Computes BLEU and ROUGE metrics
    • Saves results in multiple formats
  • test_samples.jsonl - Evaluation dataset

    • Test samples for model evaluation
    • Stored using Git LFS

Output Directory (evaluate/output/):

  • eval_results.json - Detailed evaluation results

    • Complete predictions and references
    • Stored using Git LFS
  • eval_results.csv - Tabular evaluation results

    • CSV format for easy analysis
    • Stored using Git LFS

Evaluation Metrics:

  • BLEU Score: Measures translation quality
  • ROUGE Score: Evaluates text generation accuracy
  • Human Evaluation: Qualitative assessment

πŸ”§ Dependencies (requirements.txt)

Comprehensive list of Python packages required for the project:

Core ML Libraries:

  • transformers==4.53.1 - Hugging Face transformers
  • torch==2.7.1+cu128 - PyTorch with CUDA support
  • peft==0.16.0 - Parameter-efficient fine-tuning
  • accelerate==1.8.1 - Distributed training
  • bitsandbytes==0.46.1 - Quantization support

Data Processing:

  • datasets==3.6.0 - Dataset handling
  • pandas==2.3.1 - Data manipulation
  • numpy==2.3.1 - Numerical computing

Git Analysis:

  • pydriller - Git repository mining
  • gitpython - Git operations

Utilities:

  • tqdm==4.67.1 - Progress bars
  • wandb - Experiment tracking
  • evaluate==0.4.4 - Evaluation metrics

πŸ”„ Workflow

1. Dataset Creation

cd dataset_builder
python extract_linux_bugfixes.py          # Extract bug-fix data
python format_for_training.py  # Convert format

2. Model Training

cd train
python train_codellama_qlora_linux_bugfix.py                  # Train with QLoRA

3. Model Evaluation

cd evaluate
python evaluate_linux_bugfix_model.py               # Evaluate performance

🎯 Key Design Principles

Modularity

  • Each component has a specific responsibility
  • Clear separation between data, training, and evaluation
  • Easy to modify or extend individual components

Efficiency

  • QLoRA for memory-efficient training
  • Parallel processing for dataset creation
  • Optimized for modern GPU hardware

Reproducibility

  • Version-controlled dependencies
  • Structured data formats
  • Comprehensive logging and evaluation

Scalability

  • Configurable parameters for different hardware
  • Support for distributed training
  • Efficient data handling with Git LFS

πŸ” File Naming Conventions

  • Scripts: Descriptive names with clear purpose
  • Datasets: Include size/version information
  • Models: Include architecture and method
  • Results: Include timestamp or version
  • Configs: Use .json or .yaml format

πŸ“ Documentation

  • README.md: Project overview and quick start
  • PROJECT_STRUCTURE.md: This detailed structure guide
  • Model README: Generated model cards in output directories
  • Code Comments: Inline documentation in all scripts

This structure ensures the project is organized, maintainable, and easy to understand for both users and contributors.