HF_RepoSense

Sleeping

App Files Files Community

HF_RepoSense / VECTORIZATION_README.md

naman1102

vectorization

feb8f14 2 months ago

preview code

raw

history blame contribute delete

4.07 kB

	# Repository Explorer - Vectorization Feature

	## Overview

	The Repository Explorer now includes simple vectorization to enhance the chatbot's ability to answer questions about loaded repositories. This feature uses semantic search to find the most relevant code sections for user queries.

	## How It Works

	### 1. Content Chunking
	- Repository content is split into overlapping chunks (~500 lines each with 50 lines overlap)
	- Each chunk maintains metadata (repo ID, line numbers, chunk index)

	### 2. Embedding Creation
	- Uses the lightweight `all-MiniLM-L6-v2` SentenceTransformer model
	- Creates vector embeddings for each chunk
	- Embeddings capture semantic meaning of code content

	### 3. Semantic Search
	- When you ask a question, it searches for the 3 most relevant chunks
	- Uses cosine similarity to rank chunks by relevance
	- Returns both similarity scores and line number references

	### 4. Enhanced Responses
	- The chatbot combines both the general repository analysis AND the most relevant code sections
	- Provides specific code examples and implementation details
	- References exact line numbers for better context

	## Installation

	The vectorization feature requires additional dependencies:

	```bash
	pip install sentence-transformers numpy
	```

	These are already included in the updated `requirements.txt`.

	## Testing

	Run the test script to verify everything is working:

	```bash
	python test_vectorization.py
	```

	This will test:
	- ✅ Dependencies import correctly
	- ✅ SentenceTransformer model loads
	- ✅ Embedding creation works
	- ✅ Similarity calculations function
	- ✅ Integration with repo explorer

	## Features

	### ✅ What's Included
	- Simple setup: Uses a lightweight, fast embedding model
	- Automatic chunking: Smart content splitting with overlap for context
	- Semantic search: Find relevant code based on meaning, not just keywords
	- Graceful fallback: If vectorization fails, falls back to text-only analysis
	- Memory efficient: In-memory storage suitable for single repository exploration
	- Clear feedback: Status messages show when vectorization is active

	### 🔍 How to Use
	1. Load any repository in the Repository Explorer tab
	2. Look for "Vector embeddings created" in the status message
	3. Ask questions - the chatbot will automatically use vector search
	4. Responses will include "MOST RELEVANT CODE SECTIONS" with similarity scores

	### 📊 Example Output
	When you ask "How do I use this repository?", you might get:

	```
	=== MOST RELEVANT CODE SECTIONS ===

	--- Relevant Section 1 (similarity: 0.847, lines 25-75) ---
	# Installation and Usage
	...actual code from those lines...

	--- Relevant Section 2 (similarity: 0.792, lines 150-200) ---
	def main():
	"""Main usage example"""
	...actual code from those lines...
	```

	## Technical Details

	- Model: `all-MiniLM-L6-v2` (384-dimensional embeddings)
	- Chunk size: 500 lines with 50 line overlap
	- Search: Top 3 most similar chunks per query
	- Storage: In-memory (cleared when loading new repository)
	- Fallback: Graceful degradation to text-only analysis if vectorization fails

	## Benefits

	1. Better Context: Finds relevant code sections even with natural language queries
	2. Specific Examples: Provides actual code snippets related to your question
	3. Line References: Shows exactly where information comes from
	4. Semantic Understanding: Understands intent, not just keyword matching
	5. Fast Setup: Lightweight model downloads quickly on first use

	## Limitations

	- Single Repository: Vector store is cleared when loading a new repository
	- Memory Usage: Keeps all embeddings in memory (suitable for exploration use case)
	- Model Size: ~80MB download for the embedding model (one-time)
	- No Persistence: Vectors are recreated each time you load a repository

	This simple vectorization approach significantly improves the chatbot's ability to provide relevant, code-specific answers while keeping the implementation straightforward and fast.