File size: 4,069 Bytes
feb8f14 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 |
# Repository Explorer - Vectorization Feature
## Overview
The Repository Explorer now includes **simple vectorization** to enhance the chatbot's ability to answer questions about loaded repositories. This feature uses semantic search to find the most relevant code sections for user queries.
## How It Works
### 1. **Content Chunking**
- Repository content is split into overlapping chunks (~500 lines each with 50 lines overlap)
- Each chunk maintains metadata (repo ID, line numbers, chunk index)
### 2. **Embedding Creation**
- Uses the lightweight `all-MiniLM-L6-v2` SentenceTransformer model
- Creates vector embeddings for each chunk
- Embeddings capture semantic meaning of code content
### 3. **Semantic Search**
- When you ask a question, it searches for the 3 most relevant chunks
- Uses cosine similarity to rank chunks by relevance
- Returns both similarity scores and line number references
### 4. **Enhanced Responses**
- The chatbot combines both the general repository analysis AND the most relevant code sections
- Provides specific code examples and implementation details
- References exact line numbers for better context
## Installation
The vectorization feature requires additional dependencies:
```bash
pip install sentence-transformers numpy
```
These are already included in the updated `requirements.txt`.
## Testing
Run the test script to verify everything is working:
```bash
python test_vectorization.py
```
This will test:
- β
Dependencies import correctly
- β
SentenceTransformer model loads
- β
Embedding creation works
- β
Similarity calculations function
- β
Integration with repo explorer
## Features
### β
**What's Included**
- **Simple setup**: Uses a lightweight, fast embedding model
- **Automatic chunking**: Smart content splitting with overlap for context
- **Semantic search**: Find relevant code based on meaning, not just keywords
- **Graceful fallback**: If vectorization fails, falls back to text-only analysis
- **Memory efficient**: In-memory storage suitable for single repository exploration
- **Clear feedback**: Status messages show when vectorization is active
### π **How to Use**
1. Load any repository in the Repository Explorer tab
2. Look for "Vector embeddings created" in the status message
3. Ask questions - the chatbot will automatically use vector search
4. Responses will include "MOST RELEVANT CODE SECTIONS" with similarity scores
### π **Example Output**
When you ask "How do I use this repository?", you might get:
```
=== MOST RELEVANT CODE SECTIONS ===
--- Relevant Section 1 (similarity: 0.847, lines 25-75) ---
# Installation and Usage
...actual code from those lines...
--- Relevant Section 2 (similarity: 0.792, lines 150-200) ---
def main():
"""Main usage example"""
...actual code from those lines...
```
## Technical Details
- **Model**: `all-MiniLM-L6-v2` (384-dimensional embeddings)
- **Chunk size**: 500 lines with 50 line overlap
- **Search**: Top 3 most similar chunks per query
- **Storage**: In-memory (cleared when loading new repository)
- **Fallback**: Graceful degradation to text-only analysis if vectorization fails
## Benefits
1. **Better Context**: Finds relevant code sections even with natural language queries
2. **Specific Examples**: Provides actual code snippets related to your question
3. **Line References**: Shows exactly where information comes from
4. **Semantic Understanding**: Understands intent, not just keyword matching
5. **Fast Setup**: Lightweight model downloads quickly on first use
## Limitations
- **Single Repository**: Vector store is cleared when loading a new repository
- **Memory Usage**: Keeps all embeddings in memory (suitable for exploration use case)
- **Model Size**: ~80MB download for the embedding model (one-time)
- **No Persistence**: Vectors are recreated each time you load a repository
This simple vectorization approach significantly improves the chatbot's ability to provide relevant, code-specific answers while keeping the implementation straightforward and fast. |