File size: 4,069 Bytes
feb8f14
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
# Repository Explorer - Vectorization Feature

## Overview

The Repository Explorer now includes **simple vectorization** to enhance the chatbot's ability to answer questions about loaded repositories. This feature uses semantic search to find the most relevant code sections for user queries.

## How It Works

### 1. **Content Chunking**
- Repository content is split into overlapping chunks (~500 lines each with 50 lines overlap)
- Each chunk maintains metadata (repo ID, line numbers, chunk index)

### 2. **Embedding Creation** 
- Uses the lightweight `all-MiniLM-L6-v2` SentenceTransformer model
- Creates vector embeddings for each chunk 
- Embeddings capture semantic meaning of code content

### 3. **Semantic Search**
- When you ask a question, it searches for the 3 most relevant chunks
- Uses cosine similarity to rank chunks by relevance
- Returns both similarity scores and line number references

### 4. **Enhanced Responses**
- The chatbot combines both the general repository analysis AND the most relevant code sections
- Provides specific code examples and implementation details
- References exact line numbers for better context

## Installation

The vectorization feature requires additional dependencies:

```bash
pip install sentence-transformers numpy
```

These are already included in the updated `requirements.txt`.

## Testing

Run the test script to verify everything is working:

```bash
python test_vectorization.py
```

This will test:
- βœ… Dependencies import correctly  
- βœ… SentenceTransformer model loads
- βœ… Embedding creation works
- βœ… Similarity calculations function
- βœ… Integration with repo explorer

## Features

### βœ… **What's Included**
- **Simple setup**: Uses a lightweight, fast embedding model
- **Automatic chunking**: Smart content splitting with overlap for context
- **Semantic search**: Find relevant code based on meaning, not just keywords  
- **Graceful fallback**: If vectorization fails, falls back to text-only analysis
- **Memory efficient**: In-memory storage suitable for single repository exploration
- **Clear feedback**: Status messages show when vectorization is active

### πŸ” **How to Use**
1. Load any repository in the Repository Explorer tab
2. Look for "Vector embeddings created" in the status message
3. Ask questions - the chatbot will automatically use vector search
4. Responses will include "MOST RELEVANT CODE SECTIONS" with similarity scores

### πŸ“Š **Example Output**
When you ask "How do I use this repository?", you might get:

```
=== MOST RELEVANT CODE SECTIONS ===

--- Relevant Section 1 (similarity: 0.847, lines 25-75) ---
# Installation and Usage
...actual code from those lines...

--- Relevant Section 2 (similarity: 0.792, lines 150-200) ---  
def main():
    """Main usage example"""
...actual code from those lines...
```

## Technical Details

- **Model**: `all-MiniLM-L6-v2` (384-dimensional embeddings)
- **Chunk size**: 500 lines with 50 line overlap
- **Search**: Top 3 most similar chunks per query
- **Storage**: In-memory (cleared when loading new repository)
- **Fallback**: Graceful degradation to text-only analysis if vectorization fails

## Benefits

1. **Better Context**: Finds relevant code sections even with natural language queries
2. **Specific Examples**: Provides actual code snippets related to your question  
3. **Line References**: Shows exactly where information comes from
4. **Semantic Understanding**: Understands intent, not just keyword matching
5. **Fast Setup**: Lightweight model downloads quickly on first use

## Limitations

- **Single Repository**: Vector store is cleared when loading a new repository
- **Memory Usage**: Keeps all embeddings in memory (suitable for exploration use case)
- **Model Size**: ~80MB download for the embedding model (one-time)
- **No Persistence**: Vectors are recreated each time you load a repository

This simple vectorization approach significantly improves the chatbot's ability to provide relevant, code-specific answers while keeping the implementation straightforward and fast.