Spaces:
Sleeping
Sleeping
# Repository Explorer - Vectorization Feature | |
## Overview | |
The Repository Explorer now includes **simple vectorization** to enhance the chatbot's ability to answer questions about loaded repositories. This feature uses semantic search to find the most relevant code sections for user queries. | |
## How It Works | |
### 1. **Content Chunking** | |
- Repository content is split into overlapping chunks (~500 lines each with 50 lines overlap) | |
- Each chunk maintains metadata (repo ID, line numbers, chunk index) | |
### 2. **Embedding Creation** | |
- Uses the lightweight `all-MiniLM-L6-v2` SentenceTransformer model | |
- Creates vector embeddings for each chunk | |
- Embeddings capture semantic meaning of code content | |
### 3. **Semantic Search** | |
- When you ask a question, it searches for the 3 most relevant chunks | |
- Uses cosine similarity to rank chunks by relevance | |
- Returns both similarity scores and line number references | |
### 4. **Enhanced Responses** | |
- The chatbot combines both the general repository analysis AND the most relevant code sections | |
- Provides specific code examples and implementation details | |
- References exact line numbers for better context | |
## Installation | |
The vectorization feature requires additional dependencies: | |
```bash | |
pip install sentence-transformers numpy | |
``` | |
These are already included in the updated `requirements.txt`. | |
## Testing | |
Run the test script to verify everything is working: | |
```bash | |
python test_vectorization.py | |
``` | |
This will test: | |
- β Dependencies import correctly | |
- β SentenceTransformer model loads | |
- β Embedding creation works | |
- β Similarity calculations function | |
- β Integration with repo explorer | |
## Features | |
### β **What's Included** | |
- **Simple setup**: Uses a lightweight, fast embedding model | |
- **Automatic chunking**: Smart content splitting with overlap for context | |
- **Semantic search**: Find relevant code based on meaning, not just keywords | |
- **Graceful fallback**: If vectorization fails, falls back to text-only analysis | |
- **Memory efficient**: In-memory storage suitable for single repository exploration | |
- **Clear feedback**: Status messages show when vectorization is active | |
### π **How to Use** | |
1. Load any repository in the Repository Explorer tab | |
2. Look for "Vector embeddings created" in the status message | |
3. Ask questions - the chatbot will automatically use vector search | |
4. Responses will include "MOST RELEVANT CODE SECTIONS" with similarity scores | |
### π **Example Output** | |
When you ask "How do I use this repository?", you might get: | |
``` | |
=== MOST RELEVANT CODE SECTIONS === | |
--- Relevant Section 1 (similarity: 0.847, lines 25-75) --- | |
# Installation and Usage | |
...actual code from those lines... | |
--- Relevant Section 2 (similarity: 0.792, lines 150-200) --- | |
def main(): | |
"""Main usage example""" | |
...actual code from those lines... | |
``` | |
## Technical Details | |
- **Model**: `all-MiniLM-L6-v2` (384-dimensional embeddings) | |
- **Chunk size**: 500 lines with 50 line overlap | |
- **Search**: Top 3 most similar chunks per query | |
- **Storage**: In-memory (cleared when loading new repository) | |
- **Fallback**: Graceful degradation to text-only analysis if vectorization fails | |
## Benefits | |
1. **Better Context**: Finds relevant code sections even with natural language queries | |
2. **Specific Examples**: Provides actual code snippets related to your question | |
3. **Line References**: Shows exactly where information comes from | |
4. **Semantic Understanding**: Understands intent, not just keyword matching | |
5. **Fast Setup**: Lightweight model downloads quickly on first use | |
## Limitations | |
- **Single Repository**: Vector store is cleared when loading a new repository | |
- **Memory Usage**: Keeps all embeddings in memory (suitable for exploration use case) | |
- **Model Size**: ~80MB download for the embedding model (one-time) | |
- **No Persistence**: Vectors are recreated each time you load a repository | |
This simple vectorization approach significantly improves the chatbot's ability to provide relevant, code-specific answers while keeping the implementation straightforward and fast. |