Repository Explorer - Vectorization Feature

Overview

The Repository Explorer now includes simple vectorization to enhance the chatbot's ability to answer questions about loaded repositories. This feature uses semantic search to find the most relevant code sections for user queries.

How It Works

1. Content Chunking

Repository content is split into overlapping chunks (~500 lines each with 50 lines overlap)
Each chunk maintains metadata (repo ID, line numbers, chunk index)

2. Embedding Creation

Uses the lightweight all-MiniLM-L6-v2 SentenceTransformer model
Creates vector embeddings for each chunk
Embeddings capture semantic meaning of code content

3. Semantic Search

When you ask a question, it searches for the 3 most relevant chunks
Uses cosine similarity to rank chunks by relevance
Returns both similarity scores and line number references

4. Enhanced Responses

The chatbot combines both the general repository analysis AND the most relevant code sections
Provides specific code examples and implementation details
References exact line numbers for better context

Installation

The vectorization feature requires additional dependencies:

pip install sentence-transformers numpy

These are already included in the updated requirements.txt.

Testing

Run the test script to verify everything is working:

python test_vectorization.py

This will test:

✅ Dependencies import correctly
✅ SentenceTransformer model loads
✅ Embedding creation works
✅ Similarity calculations function
✅ Integration with repo explorer

Features

✅ What's Included

Simple setup: Uses a lightweight, fast embedding model
Automatic chunking: Smart content splitting with overlap for context
Semantic search: Find relevant code based on meaning, not just keywords
Graceful fallback: If vectorization fails, falls back to text-only analysis
Memory efficient: In-memory storage suitable for single repository exploration
Clear feedback: Status messages show when vectorization is active

🔍 How to Use

Load any repository in the Repository Explorer tab
Look for "Vector embeddings created" in the status message
Ask questions - the chatbot will automatically use vector search
Responses will include "MOST RELEVANT CODE SECTIONS" with similarity scores

📊 Example Output

When you ask "How do I use this repository?", you might get:

=== MOST RELEVANT CODE SECTIONS ===

--- Relevant Section 1 (similarity: 0.847, lines 25-75) ---
# Installation and Usage
...actual code from those lines...

--- Relevant Section 2 (similarity: 0.792, lines 150-200) ---  
def main():
    """Main usage example"""
...actual code from those lines...

Technical Details

Model: all-MiniLM-L6-v2 (384-dimensional embeddings)
Chunk size: 500 lines with 50 line overlap
Search: Top 3 most similar chunks per query
Storage: In-memory (cleared when loading new repository)
Fallback: Graceful degradation to text-only analysis if vectorization fails

Benefits

Better Context: Finds relevant code sections even with natural language queries
Specific Examples: Provides actual code snippets related to your question
Line References: Shows exactly where information comes from
Semantic Understanding: Understands intent, not just keyword matching
Fast Setup: Lightweight model downloads quickly on first use

Limitations

Single Repository: Vector store is cleared when loading a new repository
Memory Usage: Keeps all embeddings in memory (suitable for exploration use case)
Model Size: ~80MB download for the embedding model (one-time)
No Persistence: Vectors are recreated each time you load a repository

This simple vectorization approach significantly improves the chatbot's ability to provide relevant, code-specific answers while keeping the implementation straightforward and fast.