HF_RepoSense / VECTORIZATION_README.md
naman1102's picture
vectorization
feb8f14

A newer version of the Gradio SDK is available: 5.34.2

Upgrade

Repository Explorer - Vectorization Feature

Overview

The Repository Explorer now includes simple vectorization to enhance the chatbot's ability to answer questions about loaded repositories. This feature uses semantic search to find the most relevant code sections for user queries.

How It Works

1. Content Chunking

  • Repository content is split into overlapping chunks (~500 lines each with 50 lines overlap)
  • Each chunk maintains metadata (repo ID, line numbers, chunk index)

2. Embedding Creation

  • Uses the lightweight all-MiniLM-L6-v2 SentenceTransformer model
  • Creates vector embeddings for each chunk
  • Embeddings capture semantic meaning of code content

3. Semantic Search

  • When you ask a question, it searches for the 3 most relevant chunks
  • Uses cosine similarity to rank chunks by relevance
  • Returns both similarity scores and line number references

4. Enhanced Responses

  • The chatbot combines both the general repository analysis AND the most relevant code sections
  • Provides specific code examples and implementation details
  • References exact line numbers for better context

Installation

The vectorization feature requires additional dependencies:

pip install sentence-transformers numpy

These are already included in the updated requirements.txt.

Testing

Run the test script to verify everything is working:

python test_vectorization.py

This will test:

  • βœ… Dependencies import correctly
  • βœ… SentenceTransformer model loads
  • βœ… Embedding creation works
  • βœ… Similarity calculations function
  • βœ… Integration with repo explorer

Features

βœ… What's Included

  • Simple setup: Uses a lightweight, fast embedding model
  • Automatic chunking: Smart content splitting with overlap for context
  • Semantic search: Find relevant code based on meaning, not just keywords
  • Graceful fallback: If vectorization fails, falls back to text-only analysis
  • Memory efficient: In-memory storage suitable for single repository exploration
  • Clear feedback: Status messages show when vectorization is active

πŸ” How to Use

  1. Load any repository in the Repository Explorer tab
  2. Look for "Vector embeddings created" in the status message
  3. Ask questions - the chatbot will automatically use vector search
  4. Responses will include "MOST RELEVANT CODE SECTIONS" with similarity scores

πŸ“Š Example Output

When you ask "How do I use this repository?", you might get:

=== MOST RELEVANT CODE SECTIONS ===

--- Relevant Section 1 (similarity: 0.847, lines 25-75) ---
# Installation and Usage
...actual code from those lines...

--- Relevant Section 2 (similarity: 0.792, lines 150-200) ---  
def main():
    """Main usage example"""
...actual code from those lines...

Technical Details

  • Model: all-MiniLM-L6-v2 (384-dimensional embeddings)
  • Chunk size: 500 lines with 50 line overlap
  • Search: Top 3 most similar chunks per query
  • Storage: In-memory (cleared when loading new repository)
  • Fallback: Graceful degradation to text-only analysis if vectorization fails

Benefits

  1. Better Context: Finds relevant code sections even with natural language queries
  2. Specific Examples: Provides actual code snippets related to your question
  3. Line References: Shows exactly where information comes from
  4. Semantic Understanding: Understands intent, not just keyword matching
  5. Fast Setup: Lightweight model downloads quickly on first use

Limitations

  • Single Repository: Vector store is cleared when loading a new repository
  • Memory Usage: Keeps all embeddings in memory (suitable for exploration use case)
  • Model Size: ~80MB download for the embedding model (one-time)
  • No Persistence: Vectors are recreated each time you load a repository

This simple vectorization approach significantly improves the chatbot's ability to provide relevant, code-specific answers while keeping the implementation straightforward and fast.