Refactor filenames and paths for clarity and structure

Files changed (12) hide show

PROJECT_STRUCTURE.md +199 -205
README.md +131 -243
dataset/{linux_bugfix_100k.jsonl → training_data_100k.jsonl} +0 -0
dataset/{linux_bugfix_prompt_completion.jsonl → training_data_prompt_completion.jsonl} +0 -0
dataset_builder/{build_dataset_demo.py → extract_linux_bugfixes_parallel.py} +0 -0
dataset_builder/{convert_to_prompt_completion.py → format_for_training.py} +1 -1
evaluate/{evaluate.py → evaluate_linux_bugfix_model.py} +1 -1
evaluate/{eval.jsonl → test_samples.jsonl} +0 -0
train/{download_model.py → download_codellama_model.py} +0 -0
train/output/qlora-codellama-bugfix/README.md +7 -6
train/{train.py → train_codellama_qlora_linux_bugfix.py} +1 -1
train/{train_codellama_qlora.py → train_codellama_qlora_simple.py} +1 -1

PROJECT_STRUCTURE.md CHANGED Viewed

@@ -1,228 +1,222 @@
-# Linux Kernel Anti-Pattern Detector - Project Structure
-## Overview
-This project is organized into a clear, maintainable structure that separates concerns and makes it easy to find, modify, and extend functionality.
-## Directory Structure
 ```
-Linux Kernel Anti-Pattern Detector/
-├── 📁 data/                          # Analysis data and results
-│   ├── results.json                  # Main analysis results
-│   ├── concurrency_analysis_report.json
-│   └── kernel_analysis.log          # Analysis logs
-│
-├── 📁 docs/                          # Documentation
-│   ├── kernel-analysis-guide.md     # Kernel analysis documentation
-│   └── [additional documentation]
-│
-├── 📁 examples/                      # Example code and usage
-│   └── [example files]
-│
-├── 📁 reports/                       # Generated analysis reports
-│   ├── Linux_Kernel_Anti_Pattern_Analysis_Report.md
-│   ├── Executive_Summary.md
-│   └── 📁 concurrency/              # Concurrency-specific reports
-│       └── Concurrency_Analysis_Report.md
-│
-├── 📁 scripts/                       # Analysis and utility scripts
-│   ├── 📁 analysis/                 # Core analysis scripts
-│   │   ├── concurrency_analyzer.py  # Concurrency issue analyzer
-│   │   └── analyze_kernel_structure.py
-│   ├── 📁 reporting/                # Report generation scripts
-│   │   └── view_results.py         # Results viewer
-│   └── 📁 utils/                    # Utility scripts
-│       └── quick_summary.py        # Quick summary generator
-│
-├── 📁 src/                          # Source code (main project)
-│   ├── __init__.py
-│   ├── 📁 detectors/               # Anti-pattern detection modules
-│   ├── 📁 rules/                   # Detection rules and patterns
-│   └── 📁 utils/                   # Utility functions
-│
-├── 📁 tests/                        # Test files
-│   └── [test files]
-│
-├── 📁 tools/                        # Analysis tools and detectors
-│   ├── 📁 detectors/               # Main detection tools
-│   │   ├── detector.py             # Main anti-pattern detector
-│   │   └── config.yaml             # Detection configuration
-│   ├── 📁 visualizers/             # Data visualization tools
-│   └── 📁 exporters/               # Data export tools
-│
-├── 📁 linux/                        # Linux kernel source (cloned)
-│   └── [kernel source files]
-│
-├── 📄 README.md                     # Main project documentation
-├── 📄 requirements.txt              # Main project dependencies
-├── 📄 requirements-kernel-analysis.txt
-├── 📄 requirements-simple.txt
-├── 📄 .gitignore                    # Git ignore rules
-└── 📄 PROJECT_STRUCTURE.md          # This file
 ```
-## Directory Descriptions
-### 📁 data/
-Contains all analysis results, logs, and generated data files.
-- **results.json**: Complete analysis results from the main detector
-- **concurrency_analysis_report.json**: Detailed concurrency analysis
-- **kernel_analysis.log**: Analysis execution logs
-### 📁 docs/
-Project documentation and guides.
-- **kernel-analysis-guide.md**: Comprehensive guide for kernel analysis
-- Additional documentation for specific features
-### 📁 examples/
-Example code, usage patterns, and sample data.
-- Example kernel modules for testing
-- Sample configuration files
-- Usage examples
-### 📁 reports/
-Generated analysis reports in various formats.
-- **Linux_Kernel_Anti_Pattern_Analysis_Report.md**: Complete technical report
-- **Executive_Summary.md**: High-level summary for stakeholders
-- **concurrency/**: Specialized reports for specific issue types
-### 📁 scripts/
-Analysis and utility scripts organized by function.
-#### 📁 analysis/
-Core analysis scripts for different types of anti-patterns.
-- **concurrency_analyzer.py**: Specialized concurrency issue analysis
-- **analyze_kernel_structure.py**: Kernel structure analysis
-#### 📁 reporting/
-Scripts for generating and viewing reports.
-- **view_results.py**: Interactive results viewer and reporter
-#### 📁 utils/
-Utility scripts for common tasks.
-- **quick_summary.py**: Quick summary generation
-### 📁 src/
-Main project source code (core framework).
-- **detectors/**: Anti-pattern detection modules
-- **rules/**: Detection rules and pattern definitions
-- **utils/**: Utility functions and helpers
-### 📁 tests/
-Test files and test data.
-- Unit tests for detection modules
-- Integration tests
-- Test data and fixtures
-### 📁 tools/
-Analysis tools and detectors.
-#### 📁 detectors/
-Main detection tools and configurations.
-- **detector.py**: Primary anti-pattern detection engine
-- **config.yaml**: Detection configuration and rules
-#### 📁 visualizers/
-Data visualization and charting tools.
-- Interactive dashboards
-- Chart generators
-- Data plotting utilities
-#### 📁 exporters/
-Data export and format conversion tools.
-- JSON to other formats
-- Report generation
-- Data transformation
-### 📁 linux/
-Cloned Linux kernel source code for analysis.
-- Complete kernel source tree
-- Used for code snippet extraction
-- Reference for pattern validation
-## File Descriptions
-### Core Files
-- **README.md**: Main project documentation and getting started guide
-- **requirements.txt**: Main project Python dependencies
-- **requirements-kernel-analysis.txt**: Kernel analysis specific dependencies
-- **requirements-simple.txt**: Simplified dependencies for basic usage
-- **.gitignore**: Git ignore patterns for the project
-### Configuration Files
-- **tools/detectors/config.yaml**: Main detection configuration
-- **tools/detectors/detector.py**: Primary detection engine
-## Usage Patterns
-### Running Analysis
-```bash
-# Main analysis
-python tools/detectors/detector.py --clone --output data/results.json
-# Concurrency analysis
-python scripts/analysis/concurrency_analyzer.py
-# View results
-python scripts/reporting/view_results.py data/results.json
 ```
-### Generating Reports
 ```bash
-# Quick summary
-python scripts/utils/quick_summary.py
-# Interactive viewer
-python scripts/reporting/view_results.py --interactive
 ```
-### Development
 ```bash
-# Install dependencies
-pip install -r requirements.txt
-pip install -r requirements-kernel-analysis.txt
-# Run tests
-python -m pytest tests/
-# Development setup
-conda activate linux-kernel-anti-pattern-detector
 ```
-## Best Practices
-### Adding New Features
-1. **Analysis scripts**: Add to `scripts/analysis/`
-2. **Reporting tools**: Add to `scripts/reporting/`
-3. **Utilities**: Add to `scripts/utils/`
-4. **Core detection**: Add to `src/detectors/`
-5. **Configuration**: Update `tools/detectors/config.yaml`
-### File Naming Conventions
-- **Python files**: snake_case (e.g., `concurrency_analyzer.py`)
-- **Configuration files**: kebab-case (e.g., `kernel-analysis-guide.md`)
-- **Reports**: Pascal_Case (e.g., `Concurrency_Analysis_Report.md`)
-### Data Management
-- **Raw data**: Store in `data/`
-- **Processed results**: Store in `data/`
-- **Reports**: Generate in `reports/`
-- **Logs**: Store in `data/`
-## Maintenance
-### Regular Tasks
-1. **Update dependencies**: Review and update requirements files
-2. **Clean data**: Remove old analysis results periodically
-3. **Update kernel**: Refresh the Linux kernel source
-4. **Backup reports**: Archive important analysis reports
-### Version Control
-- **Track**: Source code, configuration, documentation
-- **Ignore**: Analysis results, logs, kernel source (large files)
-- **Archive**: Important reports and findings
----
-*This structure is designed to be scalable, maintainable, and easy to navigate. Each directory has a clear purpose and the organization supports both development and research workflows.*

+# Project Structure
+This document provides a detailed overview of the CodeLLaMA-Linux-BugFix project structure, explaining the purpose and organization of each component.
+## 📁 Root Directory
 ```
+CodeLLaMA-Linux-BugFix/
+├── dataset_builder/          # Dataset creation and processing
+├── dataset/                  # Generated datasets and data files
+├── train/                    # Model training scripts and outputs
+├── evaluate/                 # Model evaluation and testing
+├── requirements.txt          # Python dependencies
+├── README.md                 # Project documentation
+└── PROJECT_STRUCTURE.md      # This file
 ```
+## 🔧 Dataset Builder (`dataset_builder/`)
+The dataset builder extracts bug-fix data from the Linux kernel Git repository and converts it into training-ready format.
+### Files:
+- **`extract_linux_bugfixes.py`** - Main dataset extraction script
+  - Uses PyDriller to analyze Linux kernel Git history
+  - Filters commits using bug-fix keywords
+  - Extracts code context around bug locations
+  - Generates structured dataset entries
+- **`extract_linux_bugfixes_parallel.py`** - Parallelized version of dataset builder
+  - Multi-process implementation for faster processing
+  - Configurable worker count (default: 16 workers)
+  - Test mode with limited commit processing
+- **`format_for_training.py`** - Format conversion script
+  - Converts structured data to prompt-completion pairs
+  - Formats input for supervised fine-tuning
+  - Creates training-ready JSONL format
+### Key Features:
+- **Commit Filtering**: Identifies bug-fix commits using 17 keywords
+- **Code Context**: Extracts 10 lines before/after bug location
+- **File Filtering**: Focuses on C and header files (`.c`, `.h`)
+- **Diff Extraction**: Captures Git diff patches for fixes
+## 📊 Dataset (`dataset/`)
+Contains the generated datasets used for training and evaluation.
+### Files:
+- **`training_data_100k.jsonl`** - Main training dataset
+  - 100,000 bug-fix samples
+  - Structured format with input/output pairs
+  - Stored using Git LFS for large file handling
+- **`training_data_prompt_completion.jsonl`** - Converted training format
+  - Prompt-completion pairs for supervised learning
+  - Optimized for transformer model training
+  - Stored using Git LFS
+### Data Format:
+```json
+{
+  "input": {
+    "original code": "C code snippet with bug",
+    "instruction": "Bug fix instruction from commit message"
+  },
+  "output": {
+    "diff codes": "Git diff showing the fix"
+  }
+}
 ```
+## 🚀 Training (`train/`)
+Contains all training-related scripts, configurations, and model outputs.
+### Files:
+- **`train_codellama_qlora_linux_bugfix.py`** - Main training script
+  - QLoRA fine-tuning implementation
+  - Optimized for H200 GPU with bfloat16
+  - Includes Weights & Biases integration
+  - Comprehensive training configuration
+- **`train_codellama_qlora_simple.py`** - Alternative training script
+  - Simplified QLoRA implementation
+  - Basic training setup without advanced features
+  - Good for testing and development
+- **`download_codellama_model.py`** - Model download utility
+  - Downloads base CodeLLaMA-7B-Instruct model
+  - Ensures model availability before training
+### Output Directory (`train/output/`):
+- **`qlora-codellama-bugfix/`** - Main model output
+  - **`adapter_model.safetensors`** - LoRA adapter weights
+  - **`adapter_config.json`** - LoRA configuration
+  - **`tokenizer.json`** - Tokenizer files
+  - **`chat_template.jinja`** - Conversation template
+  - **`checkpoint-500/`** - Training checkpoint at step 500
+  - **`checkpoint-1000/`** - Training checkpoint at step 1000
+  - **`README.md`** - Model card and documentation
+### Training Configuration:
+- **Base Model**: `codellama/CodeLLaMA-7b-Instruct-hf`
+- **Method**: QLoRA with 4-bit quantization
+- **LoRA Config**: r=64, alpha=16, dropout=0.1
+- **Training**: 3 epochs, batch size 64, learning rate 2e-4
+- **Hardware**: Optimized for H200 GPU
+## 📈 Evaluation (`evaluate/`)
+Contains evaluation scripts and results for assessing model performance.
+### Files:
+- **`evaluate_linux_bugfix_model.py`** - Main evaluation script
+  - Loads fine-tuned model for inference
+  - Generates predictions on test data
+  - Computes BLEU and ROUGE metrics
+  - Saves results in multiple formats
+- **`test_samples.jsonl`** - Evaluation dataset
+  - Test samples for model evaluation
+  - Stored using Git LFS
+### Output Directory (`evaluate/output/`):
+- **`eval_results.json`** - Detailed evaluation results
+  - Complete predictions and references
+  - Stored using Git LFS
+- **`eval_results.csv`** - Tabular evaluation results
+  - CSV format for easy analysis
+  - Stored using Git LFS
+### Evaluation Metrics:
+- **BLEU Score**: Measures translation quality
+- **ROUGE Score**: Evaluates text generation accuracy
+- **Human Evaluation**: Qualitative assessment
+## 🔧 Dependencies (`requirements.txt`)
+Comprehensive list of Python packages required for the project:
+### Core ML Libraries:
+- `transformers==4.53.1` - Hugging Face transformers
+- `torch==2.7.1+cu128` - PyTorch with CUDA support
+- `peft==0.16.0` - Parameter-efficient fine-tuning
+- `accelerate==1.8.1` - Distributed training
+- `bitsandbytes==0.46.1` - Quantization support
+### Data Processing:
+- `datasets==3.6.0` - Dataset handling
+- `pandas==2.3.1` - Data manipulation
+- `numpy==2.3.1` - Numerical computing
+### Git Analysis:
+- `pydriller` - Git repository mining
+- `gitpython` - Git operations
+### Utilities:
+- `tqdm==4.67.1` - Progress bars
+- `wandb` - Experiment tracking
+- `evaluate==0.4.4` - Evaluation metrics
+## 🔄 Workflow
+### 1. Dataset Creation
 ```bash
+cd dataset_builder
+python extract_linux_bugfixes.py          # Extract bug-fix data
+python format_for_training.py  # Convert format
 ```
+### 2. Model Training
 ```bash
+cd train
+python train_codellama_qlora_linux_bugfix.py                  # Train with QLoRA
+```
+### 3. Model Evaluation
+```bash
+cd evaluate
+python evaluate_linux_bugfix_model.py               # Evaluate performance
 ```
+## 🎯 Key Design Principles
+### Modularity
+- Each component has a specific responsibility
+- Clear separation between data, training, and evaluation
+- Easy to modify or extend individual components
+### Efficiency
+- QLoRA for memory-efficient training
+- Parallel processing for dataset creation
+- Optimized for modern GPU hardware
+### Reproducibility
+- Version-controlled dependencies
+- Structured data formats
+- Comprehensive logging and evaluation
+### Scalability
+- Configurable parameters for different hardware
+- Support for distributed training
+- Efficient data handling with Git LFS
+## 🔍 File Naming Conventions
+- **Scripts**: Descriptive names with clear purpose
+- **Datasets**: Include size/version information
+- **Models**: Include architecture and method
+- **Results**: Include timestamp or version
+- **Configs**: Use `.json` or `.yaml` format
+## 📝 Documentation
+- **README.md**: Project overview and quick start
+- **PROJECT_STRUCTURE.md**: This detailed structure guide
+- **Model README**: Generated model cards in output directories
+- **Code Comments**: Inline documentation in all scripts
+This structure ensures the project is organized, maintainable, and easy to understand for both users and contributors.

README.md CHANGED Viewed

@@ -1,294 +1,182 @@
-# Linux Kernel Anti-Pattern Detector
-A comprehensive tool for detecting anti-patterns and potential issues in Linux kernel code.
-## 🎯 Overview
-This project provides automated static analysis tools to identify common anti-patterns, code smells, and potential issues in Linux kernel source code. The analysis covers 7 major categories including memory management, concurrency, security vulnerabilities, and code quality issues.
-## 📊 Recent Analysis Results
-**Analysis Date:** June 29, 2025
-**Kernel Version:** Linux 6.16-rc4
-**Files Analyzed:** 35,588
-**Total Issues Found:** 3,122
-### Issue Distribution
-- **🔴 Critical Security Issues:** 347 (11.1%)
-- **🟡 High Priority Issues:** 2,670 (85.5%)
-- **🔵 Medium Priority Issues:** 105 (3.4%)
-### Top Categories
-1. **Concurrency Issues:** 2,314 (74.1%) - Race conditions, deadlocks
-2. **Memory Management:** 356 (11.4%) - Memory leaks, use-after-free
-3. **Security Vulnerabilities:** 347 (11.1%) - Buffer overflows, format strings
-4. **Code Quality:** 92 (2.9%) - Magic numbers, code duplication
-## 🚀 Quick Start
-### Prerequisites
-- Python 3.8+
-- Git
-- Conda (recommended for environment management)
-### Setup
-```bash
-# Clone the repository
-git clone https://github.com/Mac-Huang/linux-kernel-anti-pattern-detector.git
-cd linux-kernel-anti-pattern-detector
-# Create and activate conda environment
-conda create -n linux-kernel-anti-pattern-detector python=3.10 -y
-conda activate linux-kernel-anti-pattern-detector
-# Install dependencies
-pip install -r requirements.txt
-pip install -r requirements-kernel-analysis.txt
-```
-### Run Analysis
-```bash
-# Full kernel analysis (clones kernel if needed)
-python tools/detectors/detector.py --clone --output data/results.json
-# Concurrency-specific analysis
-python scripts/analysis/concurrency_analyzer.py
-# View results interactively
-python scripts/reporting/view_results.py --interactive
-```
-## 🗃️ Dataset Building Pipeline
-### ✅ **Successfully Implemented**
-The project includes a robust dataset building pipeline for training code intelligence models on Linux kernel bug fixes.
-#### **Dataset Format**
-```json
-{
-  "input": {
-    "original code": "code before the patch (extracted from a known bug-fix)",
-    "instruction": "the commit message describing what this fix is"
-  },
-  "output": {
-    "diff codes": "the unified diff patch for the bug fix"
-  }
-}
 ```
-#### **Features**
-- **🔍 Intelligent Bug Detection:** Uses keyword-based filtering to identify bug-fix commits
-- **📝 Focused Code Extraction:** Extracts relevant code context around bug fixes
-- **🔧 Diff Processing:** Parses and formats unified diff patches
-- **⚡ Parallel Processing:** Multi-threaded processing for large repositories
-- **📊 Quality Filtering:** Only includes valid C source file modifications
-#### **Usage**
 ```bash
-# Activate environment
-conda activate detector
-# Build test dataset (small sample)
 cd dataset_builder
-python build_dataset_demo.py
-# Build full dataset (entire repository)
-# Edit TEST_MODE = False in build_dataset_demo.py
-python build_dataset_demo.py
 ```
-#### **Output**
-- **File:** `dataset_builder/output/linux_bugfix_dataset.jsonl`
-- **Format:** JSONL (one JSON object per line)
-- **Content:** Bug-fix commits with original code, commit messages, and diff patches
-#### **Keywords Detected**
-- Memory issues: `leak`, `null`, `overflow`, `memory`
-- Security: `security`, `vulnerability`, `exploit`, `buffer`
-- Concurrency: `race`, `deadlock`, `lock`
-- General bugs: `fix`, `bug`, `error`, `failure`, `crash`
 ## 📁 Project Structure
 ```
-├── 📁 data/                    # Analysis results and logs
-├── 📁 dataset_builder/         # Dataset building pipeline
-│   ├── build_dataset.py        # Main dataset builder
-│   ├── build_dataset_demo.py   # Test dataset builder
-│   └── output/                 # Generated datasets
-├── 📁 docs/                    # Documentation
-├── 📁 reports/                 # Generated reports
-├── 📁 scripts/                 # Analysis and utility scripts
-├── 📁 src/                     # Core source code
-├── 📁 tests/                   # Test files
-├── 📁 tools/                   # Detection tools
-└── 📁 linux/                   # Kernel source (cloned)
 ```
-See [PROJECT_STRUCTURE.md](PROJECT_STRUCTURE.md) for detailed structure information.
-## 📋 Available Reports
-### 📄 Main Reports
-- **[Complete Analysis Report](reports/Linux_Kernel_Anti_Pattern_Analysis_Report.md)** - Full technical analysis
-- **[Executive Summary](reports/Executive_Summary.md)** - High-level overview for stakeholders
-### 🔒 Specialized Reports
-- **[Concurrency Analysis](reports/concurrency/Concurrency_Analysis_Report.md)** - Detailed concurrency issues (2,314 issues)
-## 🛠️ Usage Examples
-### Basic Analysis
-```bash
-# Run complete analysis
-python tools/detectors/detector.py --clone --output data/results.json
-# Quick summary
-python scripts/utils/quick_summary.py
-# Interactive results viewer
-python scripts/reporting/view_results.py --interactive
-```
-### Specialized Analysis
-```bash
-# Concurrency issues only
-python scripts/analysis/concurrency_analyzer.py
-# Kernel structure analysis
-python scripts/analysis/analyze_kernel_structure.py
-```
-### Custom Configuration
-```bash
-# Use custom config
-python tools/detectors/detector.py --config tools/detectors/config.yaml
-# Analyze specific kernel path
-python tools/detectors/detector.py --kernel-path /path/to/kernel
-```
-## ���� Detection Categories
-### 1. Memory Management
-- Memory leaks (kmalloc without kfree)
-- Use-after-free bugs
-- Double-free issues
-- Null pointer dereferences
-### 2. Concurrency
-- Race conditions
-- Deadlocks
-- Missing locks
-- Double locking
-- Lock ordering violations
-### 3. Security
-- Buffer overflows
-- Format string vulnerabilities
-- Privilege escalation
-- Information disclosure
-### 4. Error Handling
-- Unchecked return values
-- Missing error handling
-- Ignored error codes
-- Wrong error propagation
-### 5. Performance
-- O(n²) algorithms
-- Unnecessary memory allocation
-- Inefficient data structures
-- Cache miss patterns
-### 6. Code Quality
-- Magic numbers
-- Hardcoded values
-- Complex functions
-- Code duplication
-### 7. API Usage
-- Deprecated functions
-- Wrong API usage
-- Missing parameter validation
-- Incorrect flags
-## 📊 Analysis Features
-- **Pattern-based Detection:** Regular expression matching with context awareness
-- **Parallel Processing:** Configurable concurrent analysis for performance
-- **Detailed Reporting:** JSON output with file locations and line numbers
-- **Interactive Viewer:** Browse and filter results by category and severity
-- **Code Snippet Extraction:** View actual code around detected issues
-- **Severity Classification:** Critical, High, Medium, Low priority levels
-## 🔧 Configuration
-The analysis can be customized through `tools/detectors/config.yaml`:
-```yaml
-detection_rules:
-  memory_management:
-    enabled: true
-    severity: "high"
-    patterns:
-      - "kmalloc.*without.*kfree"
-      - "use.*after.*free"
-    directories: ["drivers", "kernel", "mm"]
-```
 ## 📈 Performance
-- **Analysis Speed:** ~11 minutes for 35,588 files
-- **Memory Usage:** Configurable limits (default: 1GB)
-- **Parallel Processing:** Up to 4 concurrent analyzers
-- **File Filtering:** Excludes generated files and build artifacts
-## 🤝 Contributing
-1. **Fork the repository**
-2. **Create a feature branch** (`git checkout -b feature/amazing-feature`)
-3. **Add your changes** following the project structure
-4. **Test your changes** (`python -m pytest tests/`)
-5. **Commit your changes** (`git commit -m 'Add amazing feature'`)
-6. **Push to the branch** (`git push origin feature/amazing-feature`)
-7. **Open a Pull Request**
-### Development Guidelines
-- **Analysis scripts:** Add to `scripts/analysis/`
-- **Reporting tools:** Add to `scripts/reporting/`
-- **Core detection:** Add to `src/detectors/`
-- **Configuration:** Update `tools/detectors/config.yaml`
-## 📚 Documentation
-- **[Project Structure](PROJECT_STRUCTURE.md)** - Detailed project organization
-- **[Kernel Analysis Guide](docs/kernel-analysis-guide.md)** - Comprehensive analysis guide
-- **[Concurrency Analysis](reports/concurrency/Concurrency_Analysis_Report.md)** - Concurrency-specific findings
-## 🐛 Known Issues
-- **Code snippet extraction:** Some file paths may not match due to kernel cloning method
-- **Large file handling:** Files >10MB are skipped to prevent memory issues
-- **Pattern accuracy:** Some patterns may generate false positives
 ## 📄 License
-This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
 ## 🙏 Acknowledgments
-- Linux kernel community for the source code
-- Static analysis research community
-- Contributors and maintainers
-## 📞 Contact
-- **Repository:** https://github.com/Mac-Huang/linux-kernel-anti-pattern-detector
-- **Issues:** Use GitHub Issues for bug reports and feature requests
-- **Discussions:** Use GitHub Discussions for questions and ideas
----
-*This tool is designed to help improve Linux kernel code quality by identifying potential issues early in the development process. The analysis results should be used as guidance for improvement rather than definitive assessments of code quality.*

+# CodeLLaMA-Linux-BugFix
+A machine learning project that fine-tunes CodeLLaMA-7B-Instruct specifically for Linux kernel bug fixing using QLoRA (Quantized Low-Rank Adaptation). The model learns to generate Git diff patches from buggy C code and commit messages.
+## 🎯 Project Overview
+This project addresses the challenging task of automated Linux kernel bug fixing by:
+- **Extracting real bug-fix data** from the Linux kernel Git repository
+- **Training a specialized model** using QLoRA for efficient fine-tuning
+- **Generating Git diff patches** that can be applied to fix bugs
+- **Providing evaluation metrics** to assess model performance
+## 🏗️ Architecture
+### Base Model
+- **Model**: `codellama/CodeLLaMA-7b-Instruct-hf` (7 billion parameters)
+- **Fine-tuning Method**: QLoRA with 4-bit quantization
+- **Hardware**: Optimized for H200 GPU with bfloat16 precision
+### Training Configuration
+- **LoRA Config**: r=64, alpha=16, dropout=0.1
+- **Training**: 3 epochs, batch size 64, learning rate 2e-4
+- **Memory Optimization**: Gradient checkpointing, mixed precision training
+## 📊 Dataset
+The project creates a specialized dataset from Linux kernel commits:
+### Data Extraction Process
+1. **Commit Filtering**: Identifies bug-fix commits using keywords:
+   - `fix`, `bug`, `leak`, `null`, `overflow`, `error`, `failure`
+   - `crash`, `panic`, `memory`, `race`, `deadlock`, `corruption`
+   - `security`, `vulnerability`, `exploit`, `buffer`, `stack`
+2. **Code Context Extraction**:
+   - Focuses on C and header files (`.c`, `.h`)
+   - Extracts 10 lines before/after bug location
+   - Captures relevant code context
+3. **Data Format**:
+   ```json
+   {
+     "input": {
+       "original code": "C code snippet with bug",
+       "instruction": "Bug fix instruction from commit message"
+     },
+     "output": {
+       "diff codes": "Git diff showing the fix"
+     }
+   }
+   ```
+### Dataset Statistics
+- **Training Data**: 100K samples (`training_data_100k.jsonl`)
+- **Format**: JSONL (one JSON object per line)
+- **Source**: Linux kernel Git repository
+## 🚀 Quick Start
+### Prerequisites
+```bash
+pip install -r requirements.txt
 ```
+### 1. Build Dataset
 ```bash
 cd dataset_builder
+python extract_linux_bugfixes.py
+python format_for_training.py
 ```
+### 2. Train Model
+```bash
+cd train
+python train_codellama_qlora_linux_bugfix.py
+```
+### 3. Evaluate Model
+```bash
+cd evaluate
+python evaluate_linux_bugfix_model.py
+```
 ## 📁 Project Structure
 ```
+CodeLLaMA-Linux-BugFix/
+├── dataset_builder/          # Dataset creation scripts
+│   ├── extract_linux_bugfixes.py      # Main dataset extraction
+│   ├── extract_linux_bugfixes_parallel.py # Parallelized version
+│   └── format_for_training.py
+├── dataset/                  # Generated datasets
+│   ├── training_data_100k.jsonl
+│   └── training_data_prompt_completion.jsonl
+├── train/                    # Training scripts and outputs
+│   ├── train_codellama_qlora_linux_bugfix.py # Main training script
+│   ├── train_codellama_qlora_simple.py
+│   ├── download_codellama_model.py
+│   └── output/              # Trained model checkpoints
+├── evaluate/                 # Evaluation scripts and results
+│   ├── evaluate_linux_bugfix_model.py # Model evaluation
+│   ├── test_samples.jsonl   # Evaluation dataset
+│   └── output/              # Evaluation results
+└── requirements.txt         # Python dependencies
 ```
+## 🔧 Key Features
+### Efficient Training
+- **QLoRA**: Reduces memory requirements by 75% while maintaining performance
+- **4-bit Quantization**: Enables training on consumer hardware
+- **Gradient Checkpointing**: Optimizes memory usage during training
+### Real-world Data
+- **Authentic Bug Fixes**: Extracted from actual Linux kernel development
+- **Contextual Understanding**: Captures relevant code context around bugs
+- **Git Integration**: Outputs proper Git diff format
+### Evaluation
+- **BLEU Score**: Measures translation quality
+- **ROUGE Score**: Evaluates text generation accuracy
+- **Comprehensive Metrics**: JSON and CSV output formats
+## 🎯 Use Cases
+The fine-tuned model can assist with:
+1. **Automated Bug Fixing**: Generate patches for common kernel bugs
+2. **Code Review**: Suggest fixes during development
+3. **Learning**: Study patterns in Linux kernel bug fixes
+4. **Research**: Advance automated software repair techniques
 ## 📈 Performance
+The model is evaluated using:
+- **BLEU Score**: Measures how well generated diffs match reference fixes
+- **ROUGE Score**: Evaluates overlap between predicted and actual fixes
+- **Human Evaluation**: Qualitative assessment of fix quality
+## 🔬 Technical Details
+### Model Architecture
+- **Base**: CodeLLaMA-7B-Instruct with instruction tuning
+- **Adapter**: LoRA layers for efficient fine-tuning
+- **Output**: Generates Git diff format patches
+### Training Process
+1. **Data Preprocessing**: Extract and clean commit data
+2. **Tokenization**: Convert to model input format
+3. **QLoRA Training**: Efficient parameter-efficient fine-tuning
+4. **Checkpointing**: Save model states for evaluation
+### Memory Optimization
+- **4-bit Quantization**: Reduces model size significantly
+- **Gradient Accumulation**: Enables larger effective batch sizes
+- **Mixed Precision**: Uses bfloat16 for faster training
+## 🤝 Contributing
+1. Fork the repository
+2. Create a feature branch
+3. Make your changes
+4. Add tests if applicable
+5. Submit a pull request
 ## 📄 License
+This project is licensed under the MIT License - see the LICENSE file for details.
 ## 🙏 Acknowledgments
+- **CodeLLaMA Team**: For the base model
+- **Linux Kernel Community**: For the bug-fix data
+- **Hugging Face**: For the transformers library
+- **Microsoft**: For the LoRA technique
+## 📚 References
+- [CodeLLaMA: Open Foundation for Code](https://arxiv.org/abs/2308.12950)
+- [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314)
+- [LoRA: Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685)

dataset/{linux_bugfix_100k.jsonl → training_data_100k.jsonl} RENAMED Viewed

File without changes

dataset/{linux_bugfix_prompt_completion.jsonl → training_data_prompt_completion.jsonl} RENAMED Viewed

File without changes

dataset_builder/{build_dataset_demo.py → extract_linux_bugfixes_parallel.py} RENAMED Viewed

File without changes

dataset_builder/{convert_to_prompt_completion.py → format_for_training.py} RENAMED Viewed

@@ -1,7 +1,7 @@
 import json
 INPUT_FILE = './output/linux_bugfix_dataset.jsonl'
-OUTPUT_FILE = './output/linux_bugfix_prompt_completion.jsonl'
 def format_prompt(original_code, instruction):
     return (

 import json
 INPUT_FILE = './output/linux_bugfix_dataset.jsonl'
+OUTPUT_FILE = '../dataset/training_data_prompt_completion.jsonl'
 def format_prompt(original_code, instruction):
     return (

evaluate/{evaluate.py → evaluate_linux_bugfix_model.py} RENAMED Viewed

@@ -9,7 +9,7 @@ import evaluate
 # ==== CONFIG ====
 MODEL_PATH = "../train/output/qlora-codellama-bugfix"
-EVAL_FILE = "eval.jsonl"
 OUTPUT_JSON = "./output/eval_results.json"
 OUTPUT_CSV = "./output/eval_results.csv"
 MAX_INPUT_LEN = 1024

 # ==== CONFIG ====
 MODEL_PATH = "../train/output/qlora-codellama-bugfix"
+EVAL_FILE = "test_samples.jsonl"
 OUTPUT_JSON = "./output/eval_results.json"
 OUTPUT_CSV = "./output/eval_results.csv"
 MAX_INPUT_LEN = 1024

evaluate/{eval.jsonl → test_samples.jsonl} RENAMED Viewed

File without changes

train/{download_model.py → download_codellama_model.py} RENAMED Viewed

File without changes

train/output/qlora-codellama-bugfix/README.md CHANGED Viewed

@@ -1,11 +1,12 @@
 ---
-base_model: codellama/CodeLLaMA-7b-Instruct-hf
-library_name: peft
-pipeline_tag: text-generation
 tags:
-- base_model:adapter:codellama/CodeLLaMA-7b-Instruct-hf
-- lora
-- transformers
 ---
 # Model Card for Model ID

 ---
+license: mit
 tags:
+  - linux
+  - bugfix
+  - codellama
+model_type: causal-lm
+library_name: transformers
+pipeline_tag: text-generation
 ---
 # Model Card for Model ID

train/{train.py → train_codellama_qlora_linux_bugfix.py} RENAMED Viewed

@@ -18,7 +18,7 @@ os.environ["WANDB_PROJECT"] = "codellama-7b-instruct-qlora-linux-bugfix"
 os.environ["WANDB_NAME"] = "run-v1"
 # Paths and model
 BASE_MODEL = "codellama/CodeLLaMA-7b-Instruct-hf"
-DATA_PATH = "../dataset/linux_bugfix_100k.jsonl"
 OUTPUT_DIR = "./output/qlora-codellama-bugfix"
 # Load dataset (prompt-completion format)

 os.environ["WANDB_NAME"] = "run-v1"
 # Paths and model
 BASE_MODEL = "codellama/CodeLLaMA-7b-Instruct-hf"
+DATA_PATH = "../dataset/training_data_100k.jsonl"
 OUTPUT_DIR = "./output/qlora-codellama-bugfix"
 # Load dataset (prompt-completion format)

train/{train_codellama_qlora.py → train_codellama_qlora_simple.py} RENAMED Viewed

@@ -9,7 +9,7 @@ import os
 # Paths and parameters
 BASE_MODEL = "codellama/CodeLlama-7b-Instruct-hf"
-DATA_PATH = "../dataset_builder/output/linux_bugfix_prompt_completion.jsonl"
 OUTPUT_DIR = "./output/qlora-codellama-bugfix"
 # Load dataset (prompt, completion)

 # Paths and parameters
 BASE_MODEL = "codellama/CodeLlama-7b-Instruct-hf"
+DATA_PATH = "../dataset/training_data_prompt_completion.jsonl"
 OUTPUT_DIR = "./output/qlora-codellama-bugfix"
 # Load dataset (prompt, completion)