vectorization
Browse files- VECTORIZATION_README.md +108 -0
- repo_explorer.py +193 -8
- repo_explorer_old.py +200 -0
- requirements.txt +3 -1
- test_vectorization.py +135 -0
VECTORIZATION_README.md
ADDED
@@ -0,0 +1,108 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Repository Explorer - Vectorization Feature
|
2 |
+
|
3 |
+
## Overview
|
4 |
+
|
5 |
+
The Repository Explorer now includes **simple vectorization** to enhance the chatbot's ability to answer questions about loaded repositories. This feature uses semantic search to find the most relevant code sections for user queries.
|
6 |
+
|
7 |
+
## How It Works
|
8 |
+
|
9 |
+
### 1. **Content Chunking**
|
10 |
+
- Repository content is split into overlapping chunks (~500 lines each with 50 lines overlap)
|
11 |
+
- Each chunk maintains metadata (repo ID, line numbers, chunk index)
|
12 |
+
|
13 |
+
### 2. **Embedding Creation**
|
14 |
+
- Uses the lightweight `all-MiniLM-L6-v2` SentenceTransformer model
|
15 |
+
- Creates vector embeddings for each chunk
|
16 |
+
- Embeddings capture semantic meaning of code content
|
17 |
+
|
18 |
+
### 3. **Semantic Search**
|
19 |
+
- When you ask a question, it searches for the 3 most relevant chunks
|
20 |
+
- Uses cosine similarity to rank chunks by relevance
|
21 |
+
- Returns both similarity scores and line number references
|
22 |
+
|
23 |
+
### 4. **Enhanced Responses**
|
24 |
+
- The chatbot combines both the general repository analysis AND the most relevant code sections
|
25 |
+
- Provides specific code examples and implementation details
|
26 |
+
- References exact line numbers for better context
|
27 |
+
|
28 |
+
## Installation
|
29 |
+
|
30 |
+
The vectorization feature requires additional dependencies:
|
31 |
+
|
32 |
+
```bash
|
33 |
+
pip install sentence-transformers numpy
|
34 |
+
```
|
35 |
+
|
36 |
+
These are already included in the updated `requirements.txt`.
|
37 |
+
|
38 |
+
## Testing
|
39 |
+
|
40 |
+
Run the test script to verify everything is working:
|
41 |
+
|
42 |
+
```bash
|
43 |
+
python test_vectorization.py
|
44 |
+
```
|
45 |
+
|
46 |
+
This will test:
|
47 |
+
- β
Dependencies import correctly
|
48 |
+
- β
SentenceTransformer model loads
|
49 |
+
- β
Embedding creation works
|
50 |
+
- β
Similarity calculations function
|
51 |
+
- β
Integration with repo explorer
|
52 |
+
|
53 |
+
## Features
|
54 |
+
|
55 |
+
### β
**What's Included**
|
56 |
+
- **Simple setup**: Uses a lightweight, fast embedding model
|
57 |
+
- **Automatic chunking**: Smart content splitting with overlap for context
|
58 |
+
- **Semantic search**: Find relevant code based on meaning, not just keywords
|
59 |
+
- **Graceful fallback**: If vectorization fails, falls back to text-only analysis
|
60 |
+
- **Memory efficient**: In-memory storage suitable for single repository exploration
|
61 |
+
- **Clear feedback**: Status messages show when vectorization is active
|
62 |
+
|
63 |
+
### π **How to Use**
|
64 |
+
1. Load any repository in the Repository Explorer tab
|
65 |
+
2. Look for "Vector embeddings created" in the status message
|
66 |
+
3. Ask questions - the chatbot will automatically use vector search
|
67 |
+
4. Responses will include "MOST RELEVANT CODE SECTIONS" with similarity scores
|
68 |
+
|
69 |
+
### π **Example Output**
|
70 |
+
When you ask "How do I use this repository?", you might get:
|
71 |
+
|
72 |
+
```
|
73 |
+
=== MOST RELEVANT CODE SECTIONS ===
|
74 |
+
|
75 |
+
--- Relevant Section 1 (similarity: 0.847, lines 25-75) ---
|
76 |
+
# Installation and Usage
|
77 |
+
...actual code from those lines...
|
78 |
+
|
79 |
+
--- Relevant Section 2 (similarity: 0.792, lines 150-200) ---
|
80 |
+
def main():
|
81 |
+
"""Main usage example"""
|
82 |
+
...actual code from those lines...
|
83 |
+
```
|
84 |
+
|
85 |
+
## Technical Details
|
86 |
+
|
87 |
+
- **Model**: `all-MiniLM-L6-v2` (384-dimensional embeddings)
|
88 |
+
- **Chunk size**: 500 lines with 50 line overlap
|
89 |
+
- **Search**: Top 3 most similar chunks per query
|
90 |
+
- **Storage**: In-memory (cleared when loading new repository)
|
91 |
+
- **Fallback**: Graceful degradation to text-only analysis if vectorization fails
|
92 |
+
|
93 |
+
## Benefits
|
94 |
+
|
95 |
+
1. **Better Context**: Finds relevant code sections even with natural language queries
|
96 |
+
2. **Specific Examples**: Provides actual code snippets related to your question
|
97 |
+
3. **Line References**: Shows exactly where information comes from
|
98 |
+
4. **Semantic Understanding**: Understands intent, not just keyword matching
|
99 |
+
5. **Fast Setup**: Lightweight model downloads quickly on first use
|
100 |
+
|
101 |
+
## Limitations
|
102 |
+
|
103 |
+
- **Single Repository**: Vector store is cleared when loading a new repository
|
104 |
+
- **Memory Usage**: Keeps all embeddings in memory (suitable for exploration use case)
|
105 |
+
- **Model Size**: ~80MB download for the embedding model (one-time)
|
106 |
+
- **No Persistence**: Vectors are recreated each time you load a repository
|
107 |
+
|
108 |
+
This simple vectorization approach significantly improves the chatbot's ability to provide relevant, code-specific answers while keeping the implementation straightforward and fast.
|
repo_explorer.py
CHANGED
@@ -2,12 +2,136 @@ import gradio as gr
|
|
2 |
import os
|
3 |
import logging
|
4 |
from typing import List, Dict, Tuple
|
|
|
5 |
from analyzer import combine_repo_files_for_llm, handle_load_repository
|
6 |
from hf_utils import download_filtered_space_files
|
7 |
|
8 |
# Setup logger
|
9 |
logger = logging.getLogger(__name__)
|
10 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
def create_repo_explorer_tab() -> Tuple[Dict[str, gr.components.Component], Dict[str, gr.State]]:
|
12 |
"""
|
13 |
Creates the Repo Explorer tab content and returns the component references and state variables.
|
@@ -35,8 +159,8 @@ def create_repo_explorer_tab() -> Tuple[Dict[str, gr.components.Component], Dict
|
|
35 |
repo_status_display = gr.Textbox(
|
36 |
label="π Repository Status",
|
37 |
interactive=False,
|
38 |
-
lines=
|
39 |
-
info="Current repository loading status and
|
40 |
)
|
41 |
|
42 |
with gr.Row():
|
@@ -101,13 +225,26 @@ def handle_repo_user_message(user_message: str, history: List[Dict[str, str]], r
|
|
101 |
return history, ""
|
102 |
|
103 |
def handle_repo_bot_response(history: List[Dict[str, str]], repo_context_summary: str, repo_id: str) -> List[Dict[str, str]]:
|
104 |
-
"""Generate bot response for repo-specific questions using comprehensive context."""
|
105 |
if not history or history[-1]["role"] != "user" or not repo_context_summary.strip():
|
106 |
return history
|
107 |
|
108 |
user_message = history[-1]["content"]
|
109 |
|
110 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
111 |
repo_system_prompt = f"""You are an expert assistant for the Hugging Face repository '{repo_id}'.
|
112 |
You have comprehensive knowledge about this repository based on detailed analysis of all its files and components.
|
113 |
|
@@ -115,11 +252,14 @@ Use the following comprehensive analysis to answer user questions accurately and
|
|
115 |
|
116 |
{repo_context_summary}
|
117 |
|
|
|
|
|
118 |
Instructions:
|
119 |
- Answer questions clearly and conversationally about this specific repository
|
120 |
- Reference specific components, functions, or features when relevant
|
121 |
- Provide practical guidance on installation, usage, and implementation
|
122 |
-
- If asked about code details, refer to the analysis above
|
|
|
123 |
- Be helpful and informative while staying focused on this repository
|
124 |
- If something isn't covered in the analysis, acknowledge the limitation
|
125 |
|
@@ -150,11 +290,56 @@ Answer the user's question based on your comprehensive knowledge of this reposit
|
|
150 |
|
151 |
return history
|
152 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
153 |
def initialize_repo_chatbot(repo_status: str, repo_id: str, repo_context_summary: str) -> List[Dict[str, str]]:
|
154 |
"""Initialize the repository chatbot with a welcome message after successful repo loading."""
|
155 |
# Only initialize if repository was loaded successfully
|
156 |
if repo_context_summary.strip() and "successfully" in repo_status.lower():
|
157 |
-
|
|
|
|
|
|
|
158 |
return [{"role": "assistant", "content": welcome_msg}]
|
159 |
else:
|
160 |
# Keep chatbot empty if loading failed
|
@@ -163,9 +348,9 @@ def initialize_repo_chatbot(repo_status: str, repo_id: str, repo_context_summary
|
|
163 |
def setup_repo_explorer_events(components: Dict[str, gr.components.Component], states: Dict[str, gr.State]):
|
164 |
"""Setup event handlers for the repo explorer components."""
|
165 |
|
166 |
-
# Load repository event
|
167 |
components["load_repo_btn"].click(
|
168 |
-
fn=
|
169 |
inputs=[components["repo_explorer_input"]],
|
170 |
outputs=[components["repo_status_display"], states["repo_context_summary"]]
|
171 |
).then(
|
|
|
2 |
import os
|
3 |
import logging
|
4 |
from typing import List, Dict, Tuple
|
5 |
+
import numpy as np
|
6 |
from analyzer import combine_repo_files_for_llm, handle_load_repository
|
7 |
from hf_utils import download_filtered_space_files
|
8 |
|
9 |
# Setup logger
|
10 |
logger = logging.getLogger(__name__)
|
11 |
|
12 |
+
class SimpleVectorStore:
|
13 |
+
"""Simple in-memory vector store for repository chunks."""
|
14 |
+
|
15 |
+
def __init__(self):
|
16 |
+
self.chunks = []
|
17 |
+
self.embeddings = []
|
18 |
+
self.chunk_metadata = []
|
19 |
+
self.model = None
|
20 |
+
|
21 |
+
def _get_embedding_model(self):
|
22 |
+
"""Lazy load the embedding model."""
|
23 |
+
if self.model is None:
|
24 |
+
try:
|
25 |
+
from sentence_transformers import SentenceTransformer
|
26 |
+
self.model = SentenceTransformer('all-MiniLM-L6-v2') # Lightweight, fast model
|
27 |
+
logger.info("Loaded SentenceTransformer model for vectorization")
|
28 |
+
except ImportError:
|
29 |
+
logger.error("sentence-transformers not installed. Install with: pip install sentence-transformers")
|
30 |
+
raise ImportError("sentence-transformers package is required for vectorization")
|
31 |
+
return self.model
|
32 |
+
|
33 |
+
def add_chunks(self, chunks: List[str], metadata: List[Dict] = None):
|
34 |
+
"""Add text chunks and create embeddings."""
|
35 |
+
try:
|
36 |
+
model = self._get_embedding_model()
|
37 |
+
embeddings = model.encode(chunks, convert_to_tensor=False)
|
38 |
+
|
39 |
+
self.chunks.extend(chunks)
|
40 |
+
self.embeddings.extend(embeddings)
|
41 |
+
self.chunk_metadata.extend(metadata or [{} for _ in chunks])
|
42 |
+
|
43 |
+
logger.info(f"Added {len(chunks)} chunks to vector store")
|
44 |
+
except Exception as e:
|
45 |
+
logger.error(f"Error adding chunks to vector store: {e}")
|
46 |
+
|
47 |
+
def search(self, query: str, top_k: int = 3) -> List[Tuple[str, float, Dict]]:
|
48 |
+
"""Search for similar chunks using cosine similarity."""
|
49 |
+
if not self.chunks or not self.embeddings:
|
50 |
+
return []
|
51 |
+
|
52 |
+
try:
|
53 |
+
model = self._get_embedding_model()
|
54 |
+
query_embedding = model.encode([query], convert_to_tensor=False)[0]
|
55 |
+
|
56 |
+
# Calculate cosine similarities
|
57 |
+
similarities = []
|
58 |
+
for i, chunk_embedding in enumerate(self.embeddings):
|
59 |
+
similarity = np.dot(query_embedding, chunk_embedding) / (
|
60 |
+
np.linalg.norm(query_embedding) * np.linalg.norm(chunk_embedding)
|
61 |
+
)
|
62 |
+
similarities.append((self.chunks[i], similarity, self.chunk_metadata[i]))
|
63 |
+
|
64 |
+
# Sort by similarity and return top_k
|
65 |
+
similarities.sort(key=lambda x: x[1], reverse=True)
|
66 |
+
return similarities[:top_k]
|
67 |
+
|
68 |
+
except Exception as e:
|
69 |
+
logger.error(f"Error searching vector store: {e}")
|
70 |
+
return []
|
71 |
+
|
72 |
+
def clear(self):
|
73 |
+
"""Clear all stored data."""
|
74 |
+
self.chunks = []
|
75 |
+
self.embeddings = []
|
76 |
+
self.chunk_metadata = []
|
77 |
+
|
78 |
+
def get_stats(self) -> Dict:
|
79 |
+
"""Get statistics about the vector store."""
|
80 |
+
return {
|
81 |
+
'total_chunks': len(self.chunks),
|
82 |
+
'total_embeddings': len(self.embeddings),
|
83 |
+
'model_loaded': self.model is not None
|
84 |
+
}
|
85 |
+
|
86 |
+
# Global vector store instance
|
87 |
+
vector_store = SimpleVectorStore()
|
88 |
+
|
89 |
+
def vectorize_repository_content(repo_content: str, repo_id: str, chunk_size: int = 500) -> bool:
|
90 |
+
"""
|
91 |
+
Vectorize repository content by splitting into chunks and creating embeddings.
|
92 |
+
|
93 |
+
Args:
|
94 |
+
repo_content: The combined repository content
|
95 |
+
repo_id: Repository identifier
|
96 |
+
chunk_size: Number of lines per chunk
|
97 |
+
|
98 |
+
Returns:
|
99 |
+
bool: True if vectorization was successful
|
100 |
+
"""
|
101 |
+
try:
|
102 |
+
# Clear previous data
|
103 |
+
vector_store.clear()
|
104 |
+
|
105 |
+
lines = repo_content.split('\n')
|
106 |
+
chunks = []
|
107 |
+
metadata = []
|
108 |
+
|
109 |
+
# Split into chunks with overlap for better context
|
110 |
+
overlap = 50 # lines of overlap between chunks
|
111 |
+
|
112 |
+
for i in range(0, len(lines), chunk_size - overlap):
|
113 |
+
chunk_lines = lines[i:i + chunk_size]
|
114 |
+
chunk_text = '\n'.join(chunk_lines)
|
115 |
+
|
116 |
+
if chunk_text.strip(): # Only add non-empty chunks
|
117 |
+
chunks.append(chunk_text)
|
118 |
+
metadata.append({
|
119 |
+
'repo_id': repo_id,
|
120 |
+
'chunk_index': len(chunks) - 1,
|
121 |
+
'start_line': i,
|
122 |
+
'end_line': min(i + chunk_size, len(lines))
|
123 |
+
})
|
124 |
+
|
125 |
+
# Add chunks to vector store
|
126 |
+
vector_store.add_chunks(chunks, metadata)
|
127 |
+
|
128 |
+
logger.info(f"Successfully vectorized {len(chunks)} chunks for repository {repo_id}")
|
129 |
+
return True
|
130 |
+
|
131 |
+
except Exception as e:
|
132 |
+
logger.error(f"Error vectorizing repository content: {e}")
|
133 |
+
return False
|
134 |
+
|
135 |
def create_repo_explorer_tab() -> Tuple[Dict[str, gr.components.Component], Dict[str, gr.State]]:
|
136 |
"""
|
137 |
Creates the Repo Explorer tab content and returns the component references and state variables.
|
|
|
159 |
repo_status_display = gr.Textbox(
|
160 |
label="π Repository Status",
|
161 |
interactive=False,
|
162 |
+
lines=4,
|
163 |
+
info="Current repository loading status and vectorization info"
|
164 |
)
|
165 |
|
166 |
with gr.Row():
|
|
|
225 |
return history, ""
|
226 |
|
227 |
def handle_repo_bot_response(history: List[Dict[str, str]], repo_context_summary: str, repo_id: str) -> List[Dict[str, str]]:
|
228 |
+
"""Generate bot response for repo-specific questions using comprehensive context and vector search."""
|
229 |
if not history or history[-1]["role"] != "user" or not repo_context_summary.strip():
|
230 |
return history
|
231 |
|
232 |
user_message = history[-1]["content"]
|
233 |
|
234 |
+
# Use vector search to find relevant chunks
|
235 |
+
relevant_chunks = vector_store.search(user_message, top_k=3)
|
236 |
+
|
237 |
+
# Build enhanced context using vector search results
|
238 |
+
vector_context = ""
|
239 |
+
if relevant_chunks:
|
240 |
+
vector_context = "\n\n=== MOST RELEVANT CODE SECTIONS ===\n"
|
241 |
+
for i, (chunk, similarity, metadata) in enumerate(relevant_chunks):
|
242 |
+
chunk_id = metadata.get('chunk_index', i)
|
243 |
+
start_line = metadata.get('start_line', 'unknown')
|
244 |
+
end_line = metadata.get('end_line', 'unknown')
|
245 |
+
vector_context += f"\n--- Relevant Section {i+1} (similarity: {similarity:.3f}, lines {start_line}-{end_line}) ---\n{chunk}\n"
|
246 |
+
|
247 |
+
# Create a specialized prompt using both comprehensive context and vector search results
|
248 |
repo_system_prompt = f"""You are an expert assistant for the Hugging Face repository '{repo_id}'.
|
249 |
You have comprehensive knowledge about this repository based on detailed analysis of all its files and components.
|
250 |
|
|
|
252 |
|
253 |
{repo_context_summary}
|
254 |
|
255 |
+
{vector_context}
|
256 |
+
|
257 |
Instructions:
|
258 |
- Answer questions clearly and conversationally about this specific repository
|
259 |
- Reference specific components, functions, or features when relevant
|
260 |
- Provide practical guidance on installation, usage, and implementation
|
261 |
+
- If asked about code details, refer to the analysis above and the relevant code sections
|
262 |
+
- Use the most relevant code sections to provide specific examples and implementation details
|
263 |
- Be helpful and informative while staying focused on this repository
|
264 |
- If something isn't covered in the analysis, acknowledge the limitation
|
265 |
|
|
|
290 |
|
291 |
return history
|
292 |
|
293 |
+
def handle_load_repository_with_vectorization(repo_id: str) -> Tuple[str, str]:
|
294 |
+
"""Load repository and create both context summary and vector embeddings."""
|
295 |
+
if not repo_id.strip():
|
296 |
+
return "Status: Please enter a repository ID.", ""
|
297 |
+
|
298 |
+
try:
|
299 |
+
logger.info(f"Loading repository with vectorization: {repo_id}")
|
300 |
+
|
301 |
+
# Download and process the repository (existing logic)
|
302 |
+
try:
|
303 |
+
download_filtered_space_files(repo_id, local_dir="repo_files", file_extensions=['.py', '.md', '.txt'])
|
304 |
+
combined_text_path = combine_repo_files_for_llm()
|
305 |
+
except Exception as e:
|
306 |
+
logger.error(f"Error downloading repository {repo_id}: {e}")
|
307 |
+
error_status = f"β Error downloading repository: {e}"
|
308 |
+
return error_status, ""
|
309 |
+
|
310 |
+
# Read the combined content
|
311 |
+
with open(combined_text_path, "r", encoding="utf-8") as f:
|
312 |
+
repo_content = f.read()
|
313 |
+
|
314 |
+
# Create vectorized representation
|
315 |
+
vectorization_success = vectorize_repository_content(repo_content, repo_id)
|
316 |
+
|
317 |
+
# Get the original context summary
|
318 |
+
from analyzer import create_repo_context_summary
|
319 |
+
context_summary = create_repo_context_summary(repo_content, repo_id)
|
320 |
+
|
321 |
+
# Update status message
|
322 |
+
if vectorization_success:
|
323 |
+
status = f"β
Repository '{repo_id}' loaded successfully!\nπ Files processed and ready for exploration.\nπ Vector embeddings created for semantic search.\nπ¬ You can now ask questions about this repository."
|
324 |
+
else:
|
325 |
+
status = f"β
Repository '{repo_id}' loaded successfully!\nπ Files processed and ready for exploration.\nβ οΈ Vectorization failed - using text-only analysis.\nπ¬ You can now ask questions about this repository."
|
326 |
+
|
327 |
+
logger.info(f"Repository {repo_id} loaded and processed successfully")
|
328 |
+
return status, context_summary
|
329 |
+
|
330 |
+
except Exception as e:
|
331 |
+
logger.error(f"Error loading repository {repo_id}: {e}")
|
332 |
+
error_status = f"β Error loading repository: {e}"
|
333 |
+
return error_status, ""
|
334 |
+
|
335 |
def initialize_repo_chatbot(repo_status: str, repo_id: str, repo_context_summary: str) -> List[Dict[str, str]]:
|
336 |
"""Initialize the repository chatbot with a welcome message after successful repo loading."""
|
337 |
# Only initialize if repository was loaded successfully
|
338 |
if repo_context_summary.strip() and "successfully" in repo_status.lower():
|
339 |
+
# Check if vectorization was successful
|
340 |
+
vectorization_status = "π **Enhanced with vector search** for finding relevant code sections" if "Vector embeddings created" in repo_status else "π **Text-based analysis** (vector search unavailable)"
|
341 |
+
|
342 |
+
welcome_msg = f"π Welcome! I've successfully analyzed the **{repo_id}** repository.\n\nπ§ **I now have comprehensive knowledge of:**\nβ’ All files and code structure\nβ’ Key features and capabilities\nβ’ Installation and usage instructions\nβ’ Architecture and implementation details\nβ’ Dependencies and requirements\n\n{vectorization_status}\n\nπ¬ **Ask me anything about this repository!** \nFor example:\nβ’ \"What does this repository do?\"\nβ’ \"How do I install and use it?\"\nβ’ \"What are the main components?\"\nβ’ \"Show me usage examples\"\n\nWhat would you like to know? π€"
|
343 |
return [{"role": "assistant", "content": welcome_msg}]
|
344 |
else:
|
345 |
# Keep chatbot empty if loading failed
|
|
|
348 |
def setup_repo_explorer_events(components: Dict[str, gr.components.Component], states: Dict[str, gr.State]):
|
349 |
"""Setup event handlers for the repo explorer components."""
|
350 |
|
351 |
+
# Load repository event with vectorization
|
352 |
components["load_repo_btn"].click(
|
353 |
+
fn=handle_load_repository_with_vectorization,
|
354 |
inputs=[components["repo_explorer_input"]],
|
355 |
outputs=[components["repo_status_display"], states["repo_context_summary"]]
|
356 |
).then(
|
repo_explorer_old.py
ADDED
@@ -0,0 +1,200 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import gradio as gr
|
2 |
+
import os
|
3 |
+
import logging
|
4 |
+
from typing import List, Dict, Tuple
|
5 |
+
from analyzer import combine_repo_files_for_llm, handle_load_repository
|
6 |
+
from hf_utils import download_filtered_space_files
|
7 |
+
|
8 |
+
# Setup logger
|
9 |
+
logger = logging.getLogger(__name__)
|
10 |
+
|
11 |
+
def create_repo_explorer_tab() -> Tuple[Dict[str, gr.components.Component], Dict[str, gr.State]]:
|
12 |
+
"""
|
13 |
+
Creates the Repo Explorer tab content and returns the component references and state variables.
|
14 |
+
"""
|
15 |
+
|
16 |
+
# State variables for repo explorer
|
17 |
+
states = {
|
18 |
+
"repo_context_summary": gr.State(""),
|
19 |
+
"current_repo_id": gr.State("")
|
20 |
+
}
|
21 |
+
|
22 |
+
gr.Markdown("### ποΈ Deep Dive into a Specific Repository")
|
23 |
+
|
24 |
+
with gr.Row():
|
25 |
+
with gr.Column(scale=2):
|
26 |
+
repo_explorer_input = gr.Textbox(
|
27 |
+
label="π Repository ID",
|
28 |
+
placeholder="microsoft/DialoGPT-medium",
|
29 |
+
info="Enter a Hugging Face repository ID to explore"
|
30 |
+
)
|
31 |
+
with gr.Column(scale=1):
|
32 |
+
load_repo_btn = gr.Button("π Load Repository", variant="primary", size="lg")
|
33 |
+
|
34 |
+
with gr.Row():
|
35 |
+
repo_status_display = gr.Textbox(
|
36 |
+
label="π Repository Status",
|
37 |
+
interactive=False,
|
38 |
+
lines=3,
|
39 |
+
info="Current repository loading status and basic info"
|
40 |
+
)
|
41 |
+
|
42 |
+
with gr.Row():
|
43 |
+
with gr.Column(scale=2):
|
44 |
+
repo_chatbot = gr.Chatbot(
|
45 |
+
label="π€ Repository Assistant",
|
46 |
+
height=400,
|
47 |
+
type="messages",
|
48 |
+
avatar_images=(
|
49 |
+
"https://cdn-icons-png.flaticon.com/512/149/149071.png",
|
50 |
+
"https://huggingface.co/datasets/huggingface/brand-assets/resolve/main/hf-logo.png"
|
51 |
+
),
|
52 |
+
show_copy_button=True,
|
53 |
+
value=[] # Start empty - welcome message will appear only after repo is loaded
|
54 |
+
)
|
55 |
+
|
56 |
+
with gr.Row():
|
57 |
+
repo_msg_input = gr.Textbox(
|
58 |
+
label="π Ask about this repository",
|
59 |
+
placeholder="What does this repository do? How do I use it?",
|
60 |
+
lines=1,
|
61 |
+
scale=4,
|
62 |
+
info="Ask anything about the loaded repository"
|
63 |
+
)
|
64 |
+
repo_send_btn = gr.Button("π€ Send", variant="primary", scale=1)
|
65 |
+
|
66 |
+
# with gr.Column(scale=1):
|
67 |
+
# # Repository content preview
|
68 |
+
# repo_content_display = gr.Textbox(
|
69 |
+
# label="π Repository Content Preview",
|
70 |
+
# lines=20,
|
71 |
+
# show_copy_button=True,
|
72 |
+
# interactive=False,
|
73 |
+
# info="Overview of the loaded repository structure and content"
|
74 |
+
# )
|
75 |
+
|
76 |
+
# Component references
|
77 |
+
components = {
|
78 |
+
"repo_explorer_input": repo_explorer_input,
|
79 |
+
"load_repo_btn": load_repo_btn,
|
80 |
+
"repo_status_display": repo_status_display,
|
81 |
+
"repo_chatbot": repo_chatbot,
|
82 |
+
"repo_msg_input": repo_msg_input,
|
83 |
+
"repo_send_btn": repo_send_btn,
|
84 |
+
# "repo_content_display": repo_content_display
|
85 |
+
}
|
86 |
+
|
87 |
+
return components, states
|
88 |
+
|
89 |
+
def handle_repo_user_message(user_message: str, history: List[Dict[str, str]], repo_context_summary: str, repo_id: str) -> Tuple[List[Dict[str, str]], str]:
|
90 |
+
"""Handle user messages in the repo-specific chatbot."""
|
91 |
+
if not repo_context_summary.strip():
|
92 |
+
return history, ""
|
93 |
+
|
94 |
+
# Initialize with repository-specific welcome message if empty
|
95 |
+
if not history:
|
96 |
+
welcome_msg = f"Hello! I'm your assistant for the '{repo_id}' repository. I have analyzed all the files and created a comprehensive understanding of this repository. I'm ready to answer any questions about its functionality, usage, architecture, and more. What would you like to know?"
|
97 |
+
history = [{"role": "assistant", "content": welcome_msg}]
|
98 |
+
|
99 |
+
if user_message:
|
100 |
+
history.append({"role": "user", "content": user_message})
|
101 |
+
return history, ""
|
102 |
+
|
103 |
+
def handle_repo_bot_response(history: List[Dict[str, str]], repo_context_summary: str, repo_id: str) -> List[Dict[str, str]]:
|
104 |
+
"""Generate bot response for repo-specific questions using comprehensive context."""
|
105 |
+
if not history or history[-1]["role"] != "user" or not repo_context_summary.strip():
|
106 |
+
return history
|
107 |
+
|
108 |
+
user_message = history[-1]["content"]
|
109 |
+
|
110 |
+
# Create a specialized prompt using the comprehensive context summary
|
111 |
+
repo_system_prompt = f"""You are an expert assistant for the Hugging Face repository '{repo_id}'.
|
112 |
+
You have comprehensive knowledge about this repository based on detailed analysis of all its files and components.
|
113 |
+
|
114 |
+
Use the following comprehensive analysis to answer user questions accurately and helpfully:
|
115 |
+
|
116 |
+
{repo_context_summary}
|
117 |
+
|
118 |
+
Instructions:
|
119 |
+
- Answer questions clearly and conversationally about this specific repository
|
120 |
+
- Reference specific components, functions, or features when relevant
|
121 |
+
- Provide practical guidance on installation, usage, and implementation
|
122 |
+
- If asked about code details, refer to the analysis above
|
123 |
+
- Be helpful and informative while staying focused on this repository
|
124 |
+
- If something isn't covered in the analysis, acknowledge the limitation
|
125 |
+
|
126 |
+
Answer the user's question based on your comprehensive knowledge of this repository."""
|
127 |
+
|
128 |
+
try:
|
129 |
+
from openai import OpenAI
|
130 |
+
client = OpenAI(api_key=os.getenv("modal_api"))
|
131 |
+
client.base_url = os.getenv("base_url")
|
132 |
+
|
133 |
+
response = client.chat.completions.create(
|
134 |
+
model="Orion-zhen/Qwen2.5-Coder-7B-Instruct-AWQ",
|
135 |
+
messages=[
|
136 |
+
{"role": "system", "content": repo_system_prompt},
|
137 |
+
{"role": "user", "content": user_message}
|
138 |
+
],
|
139 |
+
max_tokens=1024,
|
140 |
+
temperature=0.7
|
141 |
+
)
|
142 |
+
|
143 |
+
bot_response = response.choices[0].message.content
|
144 |
+
history.append({"role": "assistant", "content": bot_response})
|
145 |
+
|
146 |
+
except Exception as e:
|
147 |
+
logger.error(f"Error generating repo bot response: {e}")
|
148 |
+
error_response = f"I apologize, but I encountered an error while processing your question: {e}"
|
149 |
+
history.append({"role": "assistant", "content": error_response})
|
150 |
+
|
151 |
+
return history
|
152 |
+
|
153 |
+
def initialize_repo_chatbot(repo_status: str, repo_id: str, repo_context_summary: str) -> List[Dict[str, str]]:
|
154 |
+
"""Initialize the repository chatbot with a welcome message after successful repo loading."""
|
155 |
+
# Only initialize if repository was loaded successfully
|
156 |
+
if repo_context_summary.strip() and "successfully" in repo_status.lower():
|
157 |
+
welcome_msg = f"π Welcome! I've successfully analyzed the **{repo_id}** repository.\n\nπ§ **I now have comprehensive knowledge of:**\nβ’ All files and code structure\nβ’ Key features and capabilities\nβ’ Installation and usage instructions\nβ’ Architecture and implementation details\nβ’ Dependencies and requirements\n\nπ¬ **Ask me anything about this repository!** \nFor example:\nβ’ \"What does this repository do?\"\nβ’ \"How do I install and use it?\"\nβ’ \"What are the main components?\"\nβ’ \"Show me usage examples\"\n\nWhat would you like to know? π€"
|
158 |
+
return [{"role": "assistant", "content": welcome_msg}]
|
159 |
+
else:
|
160 |
+
# Keep chatbot empty if loading failed
|
161 |
+
return []
|
162 |
+
|
163 |
+
def setup_repo_explorer_events(components: Dict[str, gr.components.Component], states: Dict[str, gr.State]):
|
164 |
+
"""Setup event handlers for the repo explorer components."""
|
165 |
+
|
166 |
+
# Load repository event
|
167 |
+
components["load_repo_btn"].click(
|
168 |
+
fn=handle_load_repository,
|
169 |
+
inputs=[components["repo_explorer_input"]],
|
170 |
+
outputs=[components["repo_status_display"], states["repo_context_summary"]]
|
171 |
+
).then(
|
172 |
+
fn=lambda repo_id: repo_id,
|
173 |
+
inputs=[components["repo_explorer_input"]],
|
174 |
+
outputs=[states["current_repo_id"]]
|
175 |
+
).then(
|
176 |
+
fn=initialize_repo_chatbot,
|
177 |
+
inputs=[components["repo_status_display"], states["current_repo_id"], states["repo_context_summary"]],
|
178 |
+
outputs=[components["repo_chatbot"]]
|
179 |
+
)
|
180 |
+
|
181 |
+
# Chat message submission events
|
182 |
+
components["repo_msg_input"].submit(
|
183 |
+
fn=handle_repo_user_message,
|
184 |
+
inputs=[components["repo_msg_input"], components["repo_chatbot"], states["repo_context_summary"], states["current_repo_id"]],
|
185 |
+
outputs=[components["repo_chatbot"], components["repo_msg_input"]]
|
186 |
+
).then(
|
187 |
+
fn=handle_repo_bot_response,
|
188 |
+
inputs=[components["repo_chatbot"], states["repo_context_summary"], states["current_repo_id"]],
|
189 |
+
outputs=[components["repo_chatbot"]]
|
190 |
+
)
|
191 |
+
|
192 |
+
components["repo_send_btn"].click(
|
193 |
+
fn=handle_repo_user_message,
|
194 |
+
inputs=[components["repo_msg_input"], components["repo_chatbot"], states["repo_context_summary"], states["current_repo_id"]],
|
195 |
+
outputs=[components["repo_chatbot"], components["repo_msg_input"]]
|
196 |
+
).then(
|
197 |
+
fn=handle_repo_bot_response,
|
198 |
+
inputs=[components["repo_chatbot"], states["repo_context_summary"], states["current_repo_id"]],
|
199 |
+
outputs=[components["repo_chatbot"]]
|
200 |
+
)
|
requirements.txt
CHANGED
@@ -2,4 +2,6 @@ gradio
|
|
2 |
pandas
|
3 |
openai
|
4 |
regex
|
5 |
-
huggingface_hub
|
|
|
|
|
|
2 |
pandas
|
3 |
openai
|
4 |
regex
|
5 |
+
huggingface_hub
|
6 |
+
sentence-transformers
|
7 |
+
numpy
|
test_vectorization.py
ADDED
@@ -0,0 +1,135 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
#!/usr/bin/env python3
|
2 |
+
"""
|
3 |
+
Simple test script to verify vectorization functionality.
|
4 |
+
Run this to check if sentence-transformers is working correctly.
|
5 |
+
"""
|
6 |
+
|
7 |
+
import os
|
8 |
+
import sys
|
9 |
+
|
10 |
+
def test_vectorization():
|
11 |
+
"""Test the vectorization functionality."""
|
12 |
+
print("π§ͺ Testing vectorization functionality...")
|
13 |
+
|
14 |
+
# Test 1: Import dependencies
|
15 |
+
print("\n1. Testing imports...")
|
16 |
+
try:
|
17 |
+
import numpy as np
|
18 |
+
print("β
numpy imported successfully")
|
19 |
+
except ImportError as e:
|
20 |
+
print(f"β numpy import failed: {e}")
|
21 |
+
return False
|
22 |
+
|
23 |
+
try:
|
24 |
+
from sentence_transformers import SentenceTransformer
|
25 |
+
print("β
sentence-transformers imported successfully")
|
26 |
+
except ImportError as e:
|
27 |
+
print(f"β sentence-transformers import failed: {e}")
|
28 |
+
print("Install with: pip install sentence-transformers")
|
29 |
+
return False
|
30 |
+
|
31 |
+
# Test 2: Load model
|
32 |
+
print("\n2. Testing model loading...")
|
33 |
+
try:
|
34 |
+
model = SentenceTransformer('all-MiniLM-L6-v2')
|
35 |
+
print("β
SentenceTransformer model loaded successfully")
|
36 |
+
except Exception as e:
|
37 |
+
print(f"β Model loading failed: {e}")
|
38 |
+
return False
|
39 |
+
|
40 |
+
# Test 3: Create embeddings
|
41 |
+
print("\n3. Testing embedding creation...")
|
42 |
+
try:
|
43 |
+
test_texts = [
|
44 |
+
"This is a Python function for machine learning",
|
45 |
+
"Here's a repository configuration file",
|
46 |
+
"Installation instructions for the project"
|
47 |
+
]
|
48 |
+
embeddings = model.encode(test_texts)
|
49 |
+
print(f"β
Created embeddings with shape: {embeddings.shape}")
|
50 |
+
except Exception as e:
|
51 |
+
print(f"β Embedding creation failed: {e}")
|
52 |
+
return False
|
53 |
+
|
54 |
+
# Test 4: Test similarity calculation
|
55 |
+
print("\n4. Testing similarity calculation...")
|
56 |
+
try:
|
57 |
+
query_embedding = model.encode(["Python code example"])
|
58 |
+
similarities = []
|
59 |
+
for embedding in embeddings:
|
60 |
+
similarity = np.dot(query_embedding[0], embedding) / (
|
61 |
+
np.linalg.norm(query_embedding[0]) * np.linalg.norm(embedding)
|
62 |
+
)
|
63 |
+
similarities.append(similarity)
|
64 |
+
print(f"β
Similarity scores: {[f'{s:.3f}' for s in similarities]}")
|
65 |
+
except Exception as e:
|
66 |
+
print(f"β Similarity calculation failed: {e}")
|
67 |
+
return False
|
68 |
+
|
69 |
+
# Test 5: Test repo_explorer integration
|
70 |
+
print("\n5. Testing repo_explorer integration...")
|
71 |
+
try:
|
72 |
+
from repo_explorer import SimpleVectorStore, vectorize_repository_content
|
73 |
+
|
74 |
+
# Create test repository content
|
75 |
+
test_repo_content = """# Test Repository
|
76 |
+
import numpy as np
|
77 |
+
import pandas as pd
|
78 |
+
|
79 |
+
def main():
|
80 |
+
print("Hello, world!")
|
81 |
+
|
82 |
+
class DataProcessor:
|
83 |
+
def __init__(self):
|
84 |
+
self.data = []
|
85 |
+
|
86 |
+
def process(self, data):
|
87 |
+
return data.upper()
|
88 |
+
|
89 |
+
if __name__ == "__main__":
|
90 |
+
main()
|
91 |
+
"""
|
92 |
+
|
93 |
+
# Test vectorization
|
94 |
+
success = vectorize_repository_content(test_repo_content, "test/repo")
|
95 |
+
if success:
|
96 |
+
print("β
Repository vectorization successful")
|
97 |
+
|
98 |
+
# Test vector store
|
99 |
+
from repo_explorer import vector_store
|
100 |
+
stats = vector_store.get_stats()
|
101 |
+
print(f"β
Vector store stats: {stats}")
|
102 |
+
|
103 |
+
# Test search
|
104 |
+
results = vector_store.search("Python function", top_k=2)
|
105 |
+
if results:
|
106 |
+
print(f"β
Vector search returned {len(results)} results")
|
107 |
+
for i, (chunk, similarity, metadata) in enumerate(results):
|
108 |
+
print(f" Result {i+1}: similarity={similarity:.3f}")
|
109 |
+
else:
|
110 |
+
print("β οΈ Vector search returned no results")
|
111 |
+
else:
|
112 |
+
print("β Repository vectorization failed")
|
113 |
+
return False
|
114 |
+
|
115 |
+
except Exception as e:
|
116 |
+
print(f"β repo_explorer integration test failed: {e}")
|
117 |
+
return False
|
118 |
+
|
119 |
+
print("\nπ All tests passed! Vectorization is working correctly.")
|
120 |
+
return True
|
121 |
+
|
122 |
+
if __name__ == "__main__":
|
123 |
+
print("Repository Explorer Vectorization Test")
|
124 |
+
print("=" * 45)
|
125 |
+
|
126 |
+
success = test_vectorization()
|
127 |
+
|
128 |
+
if success:
|
129 |
+
print("\nβ
Ready to use vectorization in repo explorer!")
|
130 |
+
print(" The sentence-transformers model will be downloaded on first use.")
|
131 |
+
else:
|
132 |
+
print("\nβ Vectorization setup incomplete.")
|
133 |
+
print(" Make sure to install: pip install sentence-transformers numpy")
|
134 |
+
|
135 |
+
sys.exit(0 if success else 1)
|