|
# Repository Explorer - Vectorization Feature |
|
|
|
## Overview |
|
|
|
The Repository Explorer now includes **simple vectorization** to enhance the chatbot's ability to answer questions about loaded repositories. This feature uses semantic search to find the most relevant code sections for user queries. |
|
|
|
## How It Works |
|
|
|
### 1. **Content Chunking** |
|
- Repository content is split into overlapping chunks (~500 lines each with 50 lines overlap) |
|
- Each chunk maintains metadata (repo ID, line numbers, chunk index) |
|
|
|
### 2. **Embedding Creation** |
|
- Uses the lightweight `all-MiniLM-L6-v2` SentenceTransformer model |
|
- Creates vector embeddings for each chunk |
|
- Embeddings capture semantic meaning of code content |
|
|
|
### 3. **Semantic Search** |
|
- When you ask a question, it searches for the 3 most relevant chunks |
|
- Uses cosine similarity to rank chunks by relevance |
|
- Returns both similarity scores and line number references |
|
|
|
### 4. **Enhanced Responses** |
|
- The chatbot combines both the general repository analysis AND the most relevant code sections |
|
- Provides specific code examples and implementation details |
|
- References exact line numbers for better context |
|
|
|
## Installation |
|
|
|
The vectorization feature requires additional dependencies: |
|
|
|
```bash |
|
pip install sentence-transformers numpy |
|
``` |
|
|
|
These are already included in the updated `requirements.txt`. |
|
|
|
## Testing |
|
|
|
Run the test script to verify everything is working: |
|
|
|
```bash |
|
python test_vectorization.py |
|
``` |
|
|
|
This will test: |
|
- β
Dependencies import correctly |
|
- β
SentenceTransformer model loads |
|
- β
Embedding creation works |
|
- β
Similarity calculations function |
|
- β
Integration with repo explorer |
|
|
|
## Features |
|
|
|
### β
**What's Included** |
|
- **Simple setup**: Uses a lightweight, fast embedding model |
|
- **Automatic chunking**: Smart content splitting with overlap for context |
|
- **Semantic search**: Find relevant code based on meaning, not just keywords |
|
- **Graceful fallback**: If vectorization fails, falls back to text-only analysis |
|
- **Memory efficient**: In-memory storage suitable for single repository exploration |
|
- **Clear feedback**: Status messages show when vectorization is active |
|
|
|
### π **How to Use** |
|
1. Load any repository in the Repository Explorer tab |
|
2. Look for "Vector embeddings created" in the status message |
|
3. Ask questions - the chatbot will automatically use vector search |
|
4. Responses will include "MOST RELEVANT CODE SECTIONS" with similarity scores |
|
|
|
### π **Example Output** |
|
When you ask "How do I use this repository?", you might get: |
|
|
|
``` |
|
=== MOST RELEVANT CODE SECTIONS === |
|
|
|
--- Relevant Section 1 (similarity: 0.847, lines 25-75) --- |
|
# Installation and Usage |
|
...actual code from those lines... |
|
|
|
--- Relevant Section 2 (similarity: 0.792, lines 150-200) --- |
|
def main(): |
|
"""Main usage example""" |
|
...actual code from those lines... |
|
``` |
|
|
|
## Technical Details |
|
|
|
- **Model**: `all-MiniLM-L6-v2` (384-dimensional embeddings) |
|
- **Chunk size**: 500 lines with 50 line overlap |
|
- **Search**: Top 3 most similar chunks per query |
|
- **Storage**: In-memory (cleared when loading new repository) |
|
- **Fallback**: Graceful degradation to text-only analysis if vectorization fails |
|
|
|
## Benefits |
|
|
|
1. **Better Context**: Finds relevant code sections even with natural language queries |
|
2. **Specific Examples**: Provides actual code snippets related to your question |
|
3. **Line References**: Shows exactly where information comes from |
|
4. **Semantic Understanding**: Understands intent, not just keyword matching |
|
5. **Fast Setup**: Lightweight model downloads quickly on first use |
|
|
|
## Limitations |
|
|
|
- **Single Repository**: Vector store is cleared when loading a new repository |
|
- **Memory Usage**: Keeps all embeddings in memory (suitable for exploration use case) |
|
- **Model Size**: ~80MB download for the embedding model (one-time) |
|
- **No Persistence**: Vectors are recreated each time you load a repository |
|
|
|
This simple vectorization approach significantly improves the chatbot's ability to provide relevant, code-specific answers while keeping the implementation straightforward and fast. |