Update README.md
Browse files
README.md
CHANGED
@@ -8,4 +8,63 @@ app_file: app.py
|
|
8 |
pinned: false
|
9 |
license: mit
|
10 |
short_description: RAG + LoRA Fine-Tuning for Code Analysis
|
11 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
pinned: false
|
9 |
license: mit
|
10 |
short_description: RAG + LoRA Fine-Tuning for Code Analysis
|
11 |
+
---
|
12 |
+
|
13 |
+
# Fine-tuned LLM with RAG for Codebase Analysis
|
14 |
+
|
15 |
+
This project demonstrates a production-ready, sophisticated Retrieval-Augmented Generation (RAG) system specifically engineered for codebase analysis. Its core innovation is the **automatic fine-tuning** of a code-specialized language model (`Salesforce/codegen-350M-mono`) on startup. This process creates a highly specialized expert model that provides accurate, context-aware, and reliable answers to complex software engineering questions.
|
16 |
+
|
17 |
+
The system is designed to be a transparent and robust framework, featuring detailed performance evaluation, cost tracking, and clear source attribution for every generated response.
|
18 |
+
|
19 |
+
## Core Features
|
20 |
+
|
21 |
+
This system integrates a complete, automated pipeline for building and querying a specialized code analysis engine.
|
22 |
+
|
23 |
+
* **Automatic Model Fine-Tuning**: On initialization, the system automatically fine-tunes the `Salesforce/codegen-350M-mono` model using Parameter-Efficient Fine-Tuning (PEFT/LoRA). This process adapts the model to the specific nuances of software development Q&A, significantly improving its accuracy and relevance.
|
24 |
+
|
25 |
+
* **Retrieval-Augmented Generation (RAG)**: The framework leverages a `ChromaDB` vector store and a `sentence-transformers` model to create and query a knowledge base. When a question is asked, the system retrieves the most relevant document chunks to ground the language model's response in factual, verifiable information.
|
26 |
+
|
27 |
+
* **Code-Specific Language Model**: The entire system is built upon `Salesforce/codegen-350M-mono`, a powerful model pre-trained specifically on code. This provides a strong foundation for understanding programming concepts, syntax, and architecture.
|
28 |
+
|
29 |
+
* **Comprehensive Evaluation Metrics**: Every response is critically evaluated in real-time. The system calculates and displays scores for:
|
30 |
+
* **Relevance**: How closely the answer matches the user's query.
|
31 |
+
* **Context Grounding**: How well the answer is supported by the retrieved documents.
|
32 |
+
* **Hallucination Score**: An estimation of how much the model deviates from the provided context (lower is better).
|
33 |
+
* **Technical Accuracy**: A measure of the response's use of correct technical terminology.
|
34 |
+
|
35 |
+
* **Performance & Cost Tracking**: A built-in `PerformanceTracker` monitors key operational metrics, including query latency, the number of tokens processed, and the estimated cost of each interaction, providing insights needed for production deployment.
|
36 |
+
|
37 |
+
* **Source Attribution**: To ensure transparency and trust, the system clearly cites the source documents that were used to formulate each answer.
|
38 |
+
|
39 |
+
## How It Works
|
40 |
+
|
41 |
+
The system follows an automated, multi-stage process to deliver high-quality codebase analysis.
|
42 |
+
|
43 |
+
1. **Initialization & Fine-Tuning**: On the very first run, the system fine-tunes the base CodeGen model using a curated dataset of software development Q&A. This one-time process creates a specialized LoRA adapter, which is then loaded for all subsequent operations.
|
44 |
+
2. **Knowledge Ingestion**: A knowledge base of software engineering documents (covering architecture, testing, best practices, etc.) is processed. Each document is split into manageable chunks, converted into vector embeddings, and indexed into a `ChromaDB` vector store.
|
45 |
+
3. **Query & Retrieval**: When a user submits a query, the system embeds the question and searches the vector store to find the most semantically similar document chunks.
|
46 |
+
4. **Augmented Generation**: The user's query and the retrieved context chunks are combined into a detailed prompt. This prompt is then passed to the fine-tuned language model, which generates a comprehensive, context-aware answer.
|
47 |
+
5. **Evaluation & Presentation**: The final answer, its sources, and the full suite of performance and quality metrics are presented to the user in a clean, interactive Gradio dashboard.
|
48 |
+
|
49 |
+
## Technical Stack
|
50 |
+
|
51 |
+
* **LLM & Fine-Tuning**: Transformers, PEFT (LoRA), PyTorch, BitsAndBytes
|
52 |
+
* **Retrieval & Embeddings**: ChromaDB, Sentence-Transformers, LangChain
|
53 |
+
* **Core Data Science**: Pandas, NumPy, Scikit-learn
|
54 |
+
* **Web Interface**: Gradio
|
55 |
+
* **Core Language**: Python
|
56 |
+
|
57 |
+
## How to Use the Demo
|
58 |
+
|
59 |
+
The interface is designed for simplicity and clarity.
|
60 |
+
|
61 |
+
1. **Wait for Initialization**: The first time the application starts, it will perform the automatic fine-tuning process. A status banner will indicate when the fine-tuned model is active.
|
62 |
+
2. **Ask a Question**: Use the text box to ask a question related to software development, such as "What are microservices architecture?" or "Explain test-driven development."
|
63 |
+
3. **Analyze Query**: Click the "Analyze Query" button to submit your question.
|
64 |
+
4. **Review the Results**:
|
65 |
+
* The generated **Analysis Result** will appear on the left.
|
66 |
+
* The **Referenced Sources**, **Response Metrics**, and **Performance Data** will be displayed on the right, giving you a complete overview of the system's operation for your query.
|
67 |
+
|
68 |
+
## Disclaimer
|
69 |
+
|
70 |
+
This application is a demonstration of a sophisticated RAG and fine-tuning pipeline. The knowledge base is pre-loaded with general software engineering documents and does not reflect any specific proprietary codebase.
|