Maaac
/

CodeLLaMA-Linux-BugFix

@@ -1,182 +1,196 @@
 # CodeLLaMA-Linux-BugFix
-A machine learning project that fine-tunes CodeLLaMA-7B-Instruct specifically for Linux kernel bug fixing using QLoRA (Quantized Low-Rank Adaptation). The model learns to generate Git diff patches from buggy C code and commit messages.
-## 🎯 Project Overview
-This project addresses the challenging task of automated Linux kernel bug fixing by:
-- **Extracting real bug-fix data** from the Linux kernel Git repository
-- **Training a specialized model** using QLoRA for efficient fine-tuning
-- **Generating Git diff patches** that can be applied to fix bugs
-- **Providing evaluation metrics** to assess model performance
-## 🏗️ Architecture
-### Base Model
-- **Model**: `codellama/CodeLLaMA-7b-Instruct-hf` (7 billion parameters)
-- **Fine-tuning Method**: QLoRA with 4-bit quantization
-- **Hardware**: Optimized for H200 GPU with bfloat16 precision
-### Training Configuration
-- **LoRA Config**: r=64, alpha=16, dropout=0.1
-- **Training**: 3 epochs, batch size 64, learning rate 2e-4
-- **Memory Optimization**: Gradient checkpointing, mixed precision training
 ## 📊 Dataset
-The project creates a specialized dataset from Linux kernel commits:
-### Data Extraction Process
-1. **Commit Filtering**: Identifies bug-fix commits using keywords:
-   - `fix`, `bug`, `leak`, `null`, `overflow`, `error`, `failure`
-   - `crash`, `panic`, `memory`, `race`, `deadlock`, `corruption`
-   - `security`, `vulnerability`, `exploit`, `buffer`, `stack`
-2. **Code Context Extraction**:
-   - Focuses on C and header files (`.c`, `.h`)
-   - Extracts 10 lines before/after bug location
-   - Captures relevant code context
-3. **Data Format**:
-   ```json
-   {
-     "input": {
-       "original code": "C code snippet with bug",
-       "instruction": "Bug fix instruction from commit message"
-     },
-     "output": {
-       "diff codes": "Git diff showing the fix"
-     }
-   }
-   ```
-### Dataset Statistics
-- **Training Data**: 100K samples (`training_data_100k.jsonl`)
-- **Format**: JSONL (one JSON object per line)
-- **Source**: Linux kernel Git repository
 ## 🚀 Quick Start
-### Prerequisites
 ```bash
 pip install -r requirements.txt
 ```
-### 1. Build Dataset
 ```bash
 cd dataset_builder
 python extract_linux_bugfixes.py
 python format_for_training.py
 ```
-### 2. Train Model
 ```bash
 cd train
 python train_codellama_qlora_linux_bugfix.py
 ```
-### 3. Evaluate Model
 ```bash
 cd evaluate
 python evaluate_linux_bugfix_model.py
 ```
 ## 📁 Project Structure
 ```
 CodeLLaMA-Linux-BugFix/
-├── dataset_builder/          # Dataset creation scripts
-│   ├── extract_linux_bugfixes.py      # Main dataset extraction
-│   ├── extract_linux_bugfixes_parallel.py # Parallelized version
 │   └── format_for_training.py
-├── dataset/                  # Generated datasets
 │   ├── training_data_100k.jsonl
 │   └── training_data_prompt_completion.jsonl
-├── train/                    # Training scripts and outputs
-│   ├── train_codellama_qlora_linux_bugfix.py # Main training script
 │   ├── train_codellama_qlora_simple.py
 │   ├── download_codellama_model.py
-│   └── output/              # Trained model checkpoints
-├── evaluate/                 # Evaluation scripts and results
-│   ├── evaluate_linux_bugfix_model.py # Model evaluation
-│   ├── test_samples.jsonl   # Evaluation dataset
-│   └── output/              # Evaluation results
-└── requirements.txt         # Python dependencies
 ```
-## 🔧 Key Features
-### Efficient Training
-- **QLoRA**: Reduces memory requirements by 75% while maintaining performance
-- **4-bit Quantization**: Enables training on consumer hardware
-- **Gradient Checkpointing**: Optimizes memory usage during training
-### Real-world Data
-- **Authentic Bug Fixes**: Extracted from actual Linux kernel development
-- **Contextual Understanding**: Captures relevant code context around bugs
-- **Git Integration**: Outputs proper Git diff format
-### Evaluation
-- **BLEU Score**: Measures translation quality
-- **ROUGE Score**: Evaluates text generation accuracy
-- **Comprehensive Metrics**: JSON and CSV output formats
-## 🎯 Use Cases
-The fine-tuned model can assist with:
-1. **Automated Bug Fixing**: Generate patches for common kernel bugs
-2. **Code Review**: Suggest fixes during development
-3. **Learning**: Study patterns in Linux kernel bug fixes
-4. **Research**: Advance automated software repair techniques
-## 📈 Performance
-The model is evaluated using:
-- **BLEU Score**: Measures how well generated diffs match reference fixes
-- **ROUGE Score**: Evaluates overlap between predicted and actual fixes
-- **Human Evaluation**: Qualitative assessment of fix quality
-## 🔬 Technical Details
-### Model Architecture
-- **Base**: CodeLLaMA-7B-Instruct with instruction tuning
-- **Adapter**: LoRA layers for efficient fine-tuning
-- **Output**: Generates Git diff format patches
-### Training Process
-1. **Data Preprocessing**: Extract and clean commit data
-2. **Tokenization**: Convert to model input format
-3. **QLoRA Training**: Efficient parameter-efficient fine-tuning
-4. **Checkpointing**: Save model states for evaluation
-### Memory Optimization
-- **4-bit Quantization**: Reduces model size significantly
-- **Gradient Accumulation**: Enables larger effective batch sizes
-- **Mixed Precision**: Uses bfloat16 for faster training
 ## 🤝 Contributing
-1. Fork the repository
-2. Create a feature branch
-3. Make your changes
-4. Add tests if applicable
-5. Submit a pull request
 ## 📄 License
-This project is licensed under the MIT License - see the LICENSE file for details.
 ## 🙏 Acknowledgments
-- **CodeLLaMA Team**: For the base model
-- **Linux Kernel Community**: For the bug-fix data
-- **Hugging Face**: For the transformers library
-- **Microsoft**: For the LoRA technique
 ## 📚 References
-- [CodeLLaMA: Open Foundation for Code](https://arxiv.org/abs/2308.12950)
-- [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314)
-- [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)

+---
+license: mit
+tags:
+  - codellama
+  - linux
+  - bugfix
+  - lora
+  - qlora
+  - git-diff
+base_model: codellama/CodeLLaMA-7b-Instruct-hf
+model_type: LlamaForCausalLM
+library_name: peft
+pipeline_tag: text-generation
+---
 # CodeLLaMA-Linux-BugFix
+A fine-tuned version of `CodeLLaMA-7B-Instruct`, designed specifically for Linux kernel bug fixing using QLoRA (Quantized Low-Rank Adaptation). The model learns to generate Git diff patches based on buggy C code and commit messages.
+---
+## 🎯 Overview
+This project targets automated Linux kernel bug fixing by:
+- **Mining real commit data** from the kernel Git history
+- **Training a specialized QLoRA model** on diff-style fixes
+- **Generating Git patches** in response to bug-prone code
+- **Evaluating results** using BLEU, ROUGE, and human inspection
+---
+## 🧠 Model Configuration
+- **Base model**: `CodeLLaMA-7B-Instruct`
+- **Fine-tuning method**: QLoRA with 4-bit quantization
+- **Training setup**:
+  - LoRA r=64, alpha=16, dropout=0.1
+  - Batch size: 64, LR: 2e-4, Epochs: 3
+  - Mixed precision (bfloat16), gradient checkpointing
+- **Hardware**: Optimized for NVIDIA H200 GPUs
+---
 ## 📊 Dataset
+Custom dataset extracted from Linux kernel Git history.
+### Filtering Criteria
+Bug-fix commits containing:
+`fix`, `bug`, `crash`, `memory`, `null`, `panic`, `overflow`, `race`, `corruption`, etc.
+### Structure
+- Language: C (`.c`, `.h`)
+- Context: 10 lines before/after the change
+- Format:
+```json
+{
+  "input": {
+    "original code": "C code snippet with bug",
+    "instruction": "Commit message or fix description"
+  },
+  "output": {
+    "diff codes": "Git diff showing the fix"
+  }
+}
+````
+* **File**: `training_data_100k.jsonl` (100,000 samples)
+---
 ## 🚀 Quick Start
+### Install dependencies
 ```bash
 pip install -r requirements.txt
 ```
+### 1. Build the Dataset
 ```bash
 cd dataset_builder
 python extract_linux_bugfixes.py
 python format_for_training.py
 ```
+### 2. Fine-tune the Model
 ```bash
 cd train
 python train_codellama_qlora_linux_bugfix.py
 ```
+### 3. Run Evaluation
 ```bash
 cd evaluate
 python evaluate_linux_bugfix_model.py
 ```
+---
 ## 📁 Project Structure
 ```
 CodeLLaMA-Linux-BugFix/
+├── dataset_builder/
+│   ├── extract_linux_bugfixes.py
+│   ├── extract_linux_bugfixes_parallel.py
 │   └── format_for_training.py
+├── dataset/
 │   ├── training_data_100k.jsonl
 │   └── training_data_prompt_completion.jsonl
+├── train/
+│   ├── train_codellama_qlora_linux_bugfix.py
 │   ├── train_codellama_qlora_simple.py
 │   ├── download_codellama_model.py
+│   └── output/
+├── evaluate/
+│   ├── evaluate_linux_bugfix_model.py
+│   ├── test_samples.jsonl
+│   └── output/
+└── requirements.txt
 ```
+---
+## 🧩 Features
+* 🔧 **Efficient Fine-tuning**: QLoRA + 4-bit quant = massive memory savings
+* 🧠 **Real-world commits**: From actual Linux kernel development
+* 💡 **Context-aware**: Code context extraction around bug lines
+* 💻 **Output-ready**: Generates valid Git-style diffs
+---
+## 📈 Evaluation Metrics
+* **BLEU**: Translation-style match to reference diffs
+* **ROUGE**: Overlap in fix content
+* **Human Evaluation**: Subjective patch quality
+---
+## 🧪 Use Cases
+* Automated kernel bug fixing
+* Code review assistance
+* Teaching/debugging kernel code
+* Research in automated program repair (APR)
+---
+## 🔬 Technical Highlights
+### Memory & Speed Optimizations
+* 4-bit quantization (NF4)
+* Gradient checkpointing
+* Mixed precision (bfloat16)
+* Gradient accumulation
+---
 ## 🤝 Contributing
+1. Fork this repo
+2. Create a branch
+3. Add your feature or fix
+4. Submit a PR 🙌
+---
 ## 📄 License
+MIT License – see `LICENSE` file for details.
+---
 ## 🙏 Acknowledgments
+* Meta for CodeLLaMA
+* Hugging Face for Transformers + PEFT
+* The Linux kernel community for open access to commit data
+* Microsoft for introducing LoRA
+---
 ## 📚 References
+* [CodeLLaMA (Meta, 2023)](https://arxiv.org/abs/2308.12950)
+* [QLoRA (Dettmers et al., 2023)](https://arxiv.org/abs/2305.14314)
+* [LoRA (Hu et al., 2021)](https://arxiv.org/abs/2106.09685)