naman1102 commited on
Commit
feb8f14
Β·
1 Parent(s): c01151f

vectorization

Browse files
VECTORIZATION_README.md ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Repository Explorer - Vectorization Feature
2
+
3
+ ## Overview
4
+
5
+ The Repository Explorer now includes **simple vectorization** to enhance the chatbot's ability to answer questions about loaded repositories. This feature uses semantic search to find the most relevant code sections for user queries.
6
+
7
+ ## How It Works
8
+
9
+ ### 1. **Content Chunking**
10
+ - Repository content is split into overlapping chunks (~500 lines each with 50 lines overlap)
11
+ - Each chunk maintains metadata (repo ID, line numbers, chunk index)
12
+
13
+ ### 2. **Embedding Creation**
14
+ - Uses the lightweight `all-MiniLM-L6-v2` SentenceTransformer model
15
+ - Creates vector embeddings for each chunk
16
+ - Embeddings capture semantic meaning of code content
17
+
18
+ ### 3. **Semantic Search**
19
+ - When you ask a question, it searches for the 3 most relevant chunks
20
+ - Uses cosine similarity to rank chunks by relevance
21
+ - Returns both similarity scores and line number references
22
+
23
+ ### 4. **Enhanced Responses**
24
+ - The chatbot combines both the general repository analysis AND the most relevant code sections
25
+ - Provides specific code examples and implementation details
26
+ - References exact line numbers for better context
27
+
28
+ ## Installation
29
+
30
+ The vectorization feature requires additional dependencies:
31
+
32
+ ```bash
33
+ pip install sentence-transformers numpy
34
+ ```
35
+
36
+ These are already included in the updated `requirements.txt`.
37
+
38
+ ## Testing
39
+
40
+ Run the test script to verify everything is working:
41
+
42
+ ```bash
43
+ python test_vectorization.py
44
+ ```
45
+
46
+ This will test:
47
+ - βœ… Dependencies import correctly
48
+ - βœ… SentenceTransformer model loads
49
+ - βœ… Embedding creation works
50
+ - βœ… Similarity calculations function
51
+ - βœ… Integration with repo explorer
52
+
53
+ ## Features
54
+
55
+ ### βœ… **What's Included**
56
+ - **Simple setup**: Uses a lightweight, fast embedding model
57
+ - **Automatic chunking**: Smart content splitting with overlap for context
58
+ - **Semantic search**: Find relevant code based on meaning, not just keywords
59
+ - **Graceful fallback**: If vectorization fails, falls back to text-only analysis
60
+ - **Memory efficient**: In-memory storage suitable for single repository exploration
61
+ - **Clear feedback**: Status messages show when vectorization is active
62
+
63
+ ### πŸ” **How to Use**
64
+ 1. Load any repository in the Repository Explorer tab
65
+ 2. Look for "Vector embeddings created" in the status message
66
+ 3. Ask questions - the chatbot will automatically use vector search
67
+ 4. Responses will include "MOST RELEVANT CODE SECTIONS" with similarity scores
68
+
69
+ ### πŸ“Š **Example Output**
70
+ When you ask "How do I use this repository?", you might get:
71
+
72
+ ```
73
+ === MOST RELEVANT CODE SECTIONS ===
74
+
75
+ --- Relevant Section 1 (similarity: 0.847, lines 25-75) ---
76
+ # Installation and Usage
77
+ ...actual code from those lines...
78
+
79
+ --- Relevant Section 2 (similarity: 0.792, lines 150-200) ---
80
+ def main():
81
+ """Main usage example"""
82
+ ...actual code from those lines...
83
+ ```
84
+
85
+ ## Technical Details
86
+
87
+ - **Model**: `all-MiniLM-L6-v2` (384-dimensional embeddings)
88
+ - **Chunk size**: 500 lines with 50 line overlap
89
+ - **Search**: Top 3 most similar chunks per query
90
+ - **Storage**: In-memory (cleared when loading new repository)
91
+ - **Fallback**: Graceful degradation to text-only analysis if vectorization fails
92
+
93
+ ## Benefits
94
+
95
+ 1. **Better Context**: Finds relevant code sections even with natural language queries
96
+ 2. **Specific Examples**: Provides actual code snippets related to your question
97
+ 3. **Line References**: Shows exactly where information comes from
98
+ 4. **Semantic Understanding**: Understands intent, not just keyword matching
99
+ 5. **Fast Setup**: Lightweight model downloads quickly on first use
100
+
101
+ ## Limitations
102
+
103
+ - **Single Repository**: Vector store is cleared when loading a new repository
104
+ - **Memory Usage**: Keeps all embeddings in memory (suitable for exploration use case)
105
+ - **Model Size**: ~80MB download for the embedding model (one-time)
106
+ - **No Persistence**: Vectors are recreated each time you load a repository
107
+
108
+ This simple vectorization approach significantly improves the chatbot's ability to provide relevant, code-specific answers while keeping the implementation straightforward and fast.
repo_explorer.py CHANGED
@@ -2,12 +2,136 @@ import gradio as gr
2
  import os
3
  import logging
4
  from typing import List, Dict, Tuple
 
5
  from analyzer import combine_repo_files_for_llm, handle_load_repository
6
  from hf_utils import download_filtered_space_files
7
 
8
  # Setup logger
9
  logger = logging.getLogger(__name__)
10
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  def create_repo_explorer_tab() -> Tuple[Dict[str, gr.components.Component], Dict[str, gr.State]]:
12
  """
13
  Creates the Repo Explorer tab content and returns the component references and state variables.
@@ -35,8 +159,8 @@ def create_repo_explorer_tab() -> Tuple[Dict[str, gr.components.Component], Dict
35
  repo_status_display = gr.Textbox(
36
  label="πŸ“Š Repository Status",
37
  interactive=False,
38
- lines=3,
39
- info="Current repository loading status and basic info"
40
  )
41
 
42
  with gr.Row():
@@ -101,13 +225,26 @@ def handle_repo_user_message(user_message: str, history: List[Dict[str, str]], r
101
  return history, ""
102
 
103
  def handle_repo_bot_response(history: List[Dict[str, str]], repo_context_summary: str, repo_id: str) -> List[Dict[str, str]]:
104
- """Generate bot response for repo-specific questions using comprehensive context."""
105
  if not history or history[-1]["role"] != "user" or not repo_context_summary.strip():
106
  return history
107
 
108
  user_message = history[-1]["content"]
109
 
110
- # Create a specialized prompt using the comprehensive context summary
 
 
 
 
 
 
 
 
 
 
 
 
 
111
  repo_system_prompt = f"""You are an expert assistant for the Hugging Face repository '{repo_id}'.
112
  You have comprehensive knowledge about this repository based on detailed analysis of all its files and components.
113
 
@@ -115,11 +252,14 @@ Use the following comprehensive analysis to answer user questions accurately and
115
 
116
  {repo_context_summary}
117
 
 
 
118
  Instructions:
119
  - Answer questions clearly and conversationally about this specific repository
120
  - Reference specific components, functions, or features when relevant
121
  - Provide practical guidance on installation, usage, and implementation
122
- - If asked about code details, refer to the analysis above
 
123
  - Be helpful and informative while staying focused on this repository
124
  - If something isn't covered in the analysis, acknowledge the limitation
125
 
@@ -150,11 +290,56 @@ Answer the user's question based on your comprehensive knowledge of this reposit
150
 
151
  return history
152
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
153
  def initialize_repo_chatbot(repo_status: str, repo_id: str, repo_context_summary: str) -> List[Dict[str, str]]:
154
  """Initialize the repository chatbot with a welcome message after successful repo loading."""
155
  # Only initialize if repository was loaded successfully
156
  if repo_context_summary.strip() and "successfully" in repo_status.lower():
157
- welcome_msg = f"πŸ‘‹ Welcome! I've successfully analyzed the **{repo_id}** repository.\n\n🧠 **I now have comprehensive knowledge of:**\nβ€’ All files and code structure\nβ€’ Key features and capabilities\nβ€’ Installation and usage instructions\nβ€’ Architecture and implementation details\nβ€’ Dependencies and requirements\n\nπŸ’¬ **Ask me anything about this repository!** \nFor example:\nβ€’ \"What does this repository do?\"\nβ€’ \"How do I install and use it?\"\nβ€’ \"What are the main components?\"\nβ€’ \"Show me usage examples\"\n\nWhat would you like to know? πŸ€”"
 
 
 
158
  return [{"role": "assistant", "content": welcome_msg}]
159
  else:
160
  # Keep chatbot empty if loading failed
@@ -163,9 +348,9 @@ def initialize_repo_chatbot(repo_status: str, repo_id: str, repo_context_summary
163
  def setup_repo_explorer_events(components: Dict[str, gr.components.Component], states: Dict[str, gr.State]):
164
  """Setup event handlers for the repo explorer components."""
165
 
166
- # Load repository event
167
  components["load_repo_btn"].click(
168
- fn=handle_load_repository,
169
  inputs=[components["repo_explorer_input"]],
170
  outputs=[components["repo_status_display"], states["repo_context_summary"]]
171
  ).then(
 
2
  import os
3
  import logging
4
  from typing import List, Dict, Tuple
5
+ import numpy as np
6
  from analyzer import combine_repo_files_for_llm, handle_load_repository
7
  from hf_utils import download_filtered_space_files
8
 
9
  # Setup logger
10
  logger = logging.getLogger(__name__)
11
 
12
+ class SimpleVectorStore:
13
+ """Simple in-memory vector store for repository chunks."""
14
+
15
+ def __init__(self):
16
+ self.chunks = []
17
+ self.embeddings = []
18
+ self.chunk_metadata = []
19
+ self.model = None
20
+
21
+ def _get_embedding_model(self):
22
+ """Lazy load the embedding model."""
23
+ if self.model is None:
24
+ try:
25
+ from sentence_transformers import SentenceTransformer
26
+ self.model = SentenceTransformer('all-MiniLM-L6-v2') # Lightweight, fast model
27
+ logger.info("Loaded SentenceTransformer model for vectorization")
28
+ except ImportError:
29
+ logger.error("sentence-transformers not installed. Install with: pip install sentence-transformers")
30
+ raise ImportError("sentence-transformers package is required for vectorization")
31
+ return self.model
32
+
33
+ def add_chunks(self, chunks: List[str], metadata: List[Dict] = None):
34
+ """Add text chunks and create embeddings."""
35
+ try:
36
+ model = self._get_embedding_model()
37
+ embeddings = model.encode(chunks, convert_to_tensor=False)
38
+
39
+ self.chunks.extend(chunks)
40
+ self.embeddings.extend(embeddings)
41
+ self.chunk_metadata.extend(metadata or [{} for _ in chunks])
42
+
43
+ logger.info(f"Added {len(chunks)} chunks to vector store")
44
+ except Exception as e:
45
+ logger.error(f"Error adding chunks to vector store: {e}")
46
+
47
+ def search(self, query: str, top_k: int = 3) -> List[Tuple[str, float, Dict]]:
48
+ """Search for similar chunks using cosine similarity."""
49
+ if not self.chunks or not self.embeddings:
50
+ return []
51
+
52
+ try:
53
+ model = self._get_embedding_model()
54
+ query_embedding = model.encode([query], convert_to_tensor=False)[0]
55
+
56
+ # Calculate cosine similarities
57
+ similarities = []
58
+ for i, chunk_embedding in enumerate(self.embeddings):
59
+ similarity = np.dot(query_embedding, chunk_embedding) / (
60
+ np.linalg.norm(query_embedding) * np.linalg.norm(chunk_embedding)
61
+ )
62
+ similarities.append((self.chunks[i], similarity, self.chunk_metadata[i]))
63
+
64
+ # Sort by similarity and return top_k
65
+ similarities.sort(key=lambda x: x[1], reverse=True)
66
+ return similarities[:top_k]
67
+
68
+ except Exception as e:
69
+ logger.error(f"Error searching vector store: {e}")
70
+ return []
71
+
72
+ def clear(self):
73
+ """Clear all stored data."""
74
+ self.chunks = []
75
+ self.embeddings = []
76
+ self.chunk_metadata = []
77
+
78
+ def get_stats(self) -> Dict:
79
+ """Get statistics about the vector store."""
80
+ return {
81
+ 'total_chunks': len(self.chunks),
82
+ 'total_embeddings': len(self.embeddings),
83
+ 'model_loaded': self.model is not None
84
+ }
85
+
86
+ # Global vector store instance
87
+ vector_store = SimpleVectorStore()
88
+
89
+ def vectorize_repository_content(repo_content: str, repo_id: str, chunk_size: int = 500) -> bool:
90
+ """
91
+ Vectorize repository content by splitting into chunks and creating embeddings.
92
+
93
+ Args:
94
+ repo_content: The combined repository content
95
+ repo_id: Repository identifier
96
+ chunk_size: Number of lines per chunk
97
+
98
+ Returns:
99
+ bool: True if vectorization was successful
100
+ """
101
+ try:
102
+ # Clear previous data
103
+ vector_store.clear()
104
+
105
+ lines = repo_content.split('\n')
106
+ chunks = []
107
+ metadata = []
108
+
109
+ # Split into chunks with overlap for better context
110
+ overlap = 50 # lines of overlap between chunks
111
+
112
+ for i in range(0, len(lines), chunk_size - overlap):
113
+ chunk_lines = lines[i:i + chunk_size]
114
+ chunk_text = '\n'.join(chunk_lines)
115
+
116
+ if chunk_text.strip(): # Only add non-empty chunks
117
+ chunks.append(chunk_text)
118
+ metadata.append({
119
+ 'repo_id': repo_id,
120
+ 'chunk_index': len(chunks) - 1,
121
+ 'start_line': i,
122
+ 'end_line': min(i + chunk_size, len(lines))
123
+ })
124
+
125
+ # Add chunks to vector store
126
+ vector_store.add_chunks(chunks, metadata)
127
+
128
+ logger.info(f"Successfully vectorized {len(chunks)} chunks for repository {repo_id}")
129
+ return True
130
+
131
+ except Exception as e:
132
+ logger.error(f"Error vectorizing repository content: {e}")
133
+ return False
134
+
135
  def create_repo_explorer_tab() -> Tuple[Dict[str, gr.components.Component], Dict[str, gr.State]]:
136
  """
137
  Creates the Repo Explorer tab content and returns the component references and state variables.
 
159
  repo_status_display = gr.Textbox(
160
  label="πŸ“Š Repository Status",
161
  interactive=False,
162
+ lines=4,
163
+ info="Current repository loading status and vectorization info"
164
  )
165
 
166
  with gr.Row():
 
225
  return history, ""
226
 
227
  def handle_repo_bot_response(history: List[Dict[str, str]], repo_context_summary: str, repo_id: str) -> List[Dict[str, str]]:
228
+ """Generate bot response for repo-specific questions using comprehensive context and vector search."""
229
  if not history or history[-1]["role"] != "user" or not repo_context_summary.strip():
230
  return history
231
 
232
  user_message = history[-1]["content"]
233
 
234
+ # Use vector search to find relevant chunks
235
+ relevant_chunks = vector_store.search(user_message, top_k=3)
236
+
237
+ # Build enhanced context using vector search results
238
+ vector_context = ""
239
+ if relevant_chunks:
240
+ vector_context = "\n\n=== MOST RELEVANT CODE SECTIONS ===\n"
241
+ for i, (chunk, similarity, metadata) in enumerate(relevant_chunks):
242
+ chunk_id = metadata.get('chunk_index', i)
243
+ start_line = metadata.get('start_line', 'unknown')
244
+ end_line = metadata.get('end_line', 'unknown')
245
+ vector_context += f"\n--- Relevant Section {i+1} (similarity: {similarity:.3f}, lines {start_line}-{end_line}) ---\n{chunk}\n"
246
+
247
+ # Create a specialized prompt using both comprehensive context and vector search results
248
  repo_system_prompt = f"""You are an expert assistant for the Hugging Face repository '{repo_id}'.
249
  You have comprehensive knowledge about this repository based on detailed analysis of all its files and components.
250
 
 
252
 
253
  {repo_context_summary}
254
 
255
+ {vector_context}
256
+
257
  Instructions:
258
  - Answer questions clearly and conversationally about this specific repository
259
  - Reference specific components, functions, or features when relevant
260
  - Provide practical guidance on installation, usage, and implementation
261
+ - If asked about code details, refer to the analysis above and the relevant code sections
262
+ - Use the most relevant code sections to provide specific examples and implementation details
263
  - Be helpful and informative while staying focused on this repository
264
  - If something isn't covered in the analysis, acknowledge the limitation
265
 
 
290
 
291
  return history
292
 
293
+ def handle_load_repository_with_vectorization(repo_id: str) -> Tuple[str, str]:
294
+ """Load repository and create both context summary and vector embeddings."""
295
+ if not repo_id.strip():
296
+ return "Status: Please enter a repository ID.", ""
297
+
298
+ try:
299
+ logger.info(f"Loading repository with vectorization: {repo_id}")
300
+
301
+ # Download and process the repository (existing logic)
302
+ try:
303
+ download_filtered_space_files(repo_id, local_dir="repo_files", file_extensions=['.py', '.md', '.txt'])
304
+ combined_text_path = combine_repo_files_for_llm()
305
+ except Exception as e:
306
+ logger.error(f"Error downloading repository {repo_id}: {e}")
307
+ error_status = f"❌ Error downloading repository: {e}"
308
+ return error_status, ""
309
+
310
+ # Read the combined content
311
+ with open(combined_text_path, "r", encoding="utf-8") as f:
312
+ repo_content = f.read()
313
+
314
+ # Create vectorized representation
315
+ vectorization_success = vectorize_repository_content(repo_content, repo_id)
316
+
317
+ # Get the original context summary
318
+ from analyzer import create_repo_context_summary
319
+ context_summary = create_repo_context_summary(repo_content, repo_id)
320
+
321
+ # Update status message
322
+ if vectorization_success:
323
+ status = f"βœ… Repository '{repo_id}' loaded successfully!\nπŸ“ Files processed and ready for exploration.\nπŸ” Vector embeddings created for semantic search.\nπŸ’¬ You can now ask questions about this repository."
324
+ else:
325
+ status = f"βœ… Repository '{repo_id}' loaded successfully!\nπŸ“ Files processed and ready for exploration.\n⚠️ Vectorization failed - using text-only analysis.\nπŸ’¬ You can now ask questions about this repository."
326
+
327
+ logger.info(f"Repository {repo_id} loaded and processed successfully")
328
+ return status, context_summary
329
+
330
+ except Exception as e:
331
+ logger.error(f"Error loading repository {repo_id}: {e}")
332
+ error_status = f"❌ Error loading repository: {e}"
333
+ return error_status, ""
334
+
335
  def initialize_repo_chatbot(repo_status: str, repo_id: str, repo_context_summary: str) -> List[Dict[str, str]]:
336
  """Initialize the repository chatbot with a welcome message after successful repo loading."""
337
  # Only initialize if repository was loaded successfully
338
  if repo_context_summary.strip() and "successfully" in repo_status.lower():
339
+ # Check if vectorization was successful
340
+ vectorization_status = "πŸ” **Enhanced with vector search** for finding relevant code sections" if "Vector embeddings created" in repo_status else "πŸ“„ **Text-based analysis** (vector search unavailable)"
341
+
342
+ welcome_msg = f"πŸ‘‹ Welcome! I've successfully analyzed the **{repo_id}** repository.\n\n🧠 **I now have comprehensive knowledge of:**\nβ€’ All files and code structure\nβ€’ Key features and capabilities\nβ€’ Installation and usage instructions\nβ€’ Architecture and implementation details\nβ€’ Dependencies and requirements\n\n{vectorization_status}\n\nπŸ’¬ **Ask me anything about this repository!** \nFor example:\nβ€’ \"What does this repository do?\"\nβ€’ \"How do I install and use it?\"\nβ€’ \"What are the main components?\"\nβ€’ \"Show me usage examples\"\n\nWhat would you like to know? πŸ€”"
343
  return [{"role": "assistant", "content": welcome_msg}]
344
  else:
345
  # Keep chatbot empty if loading failed
 
348
  def setup_repo_explorer_events(components: Dict[str, gr.components.Component], states: Dict[str, gr.State]):
349
  """Setup event handlers for the repo explorer components."""
350
 
351
+ # Load repository event with vectorization
352
  components["load_repo_btn"].click(
353
+ fn=handle_load_repository_with_vectorization,
354
  inputs=[components["repo_explorer_input"]],
355
  outputs=[components["repo_status_display"], states["repo_context_summary"]]
356
  ).then(
repo_explorer_old.py ADDED
@@ -0,0 +1,200 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import os
3
+ import logging
4
+ from typing import List, Dict, Tuple
5
+ from analyzer import combine_repo_files_for_llm, handle_load_repository
6
+ from hf_utils import download_filtered_space_files
7
+
8
+ # Setup logger
9
+ logger = logging.getLogger(__name__)
10
+
11
+ def create_repo_explorer_tab() -> Tuple[Dict[str, gr.components.Component], Dict[str, gr.State]]:
12
+ """
13
+ Creates the Repo Explorer tab content and returns the component references and state variables.
14
+ """
15
+
16
+ # State variables for repo explorer
17
+ states = {
18
+ "repo_context_summary": gr.State(""),
19
+ "current_repo_id": gr.State("")
20
+ }
21
+
22
+ gr.Markdown("### πŸ—‚οΈ Deep Dive into a Specific Repository")
23
+
24
+ with gr.Row():
25
+ with gr.Column(scale=2):
26
+ repo_explorer_input = gr.Textbox(
27
+ label="πŸ“ Repository ID",
28
+ placeholder="microsoft/DialoGPT-medium",
29
+ info="Enter a Hugging Face repository ID to explore"
30
+ )
31
+ with gr.Column(scale=1):
32
+ load_repo_btn = gr.Button("πŸš€ Load Repository", variant="primary", size="lg")
33
+
34
+ with gr.Row():
35
+ repo_status_display = gr.Textbox(
36
+ label="πŸ“Š Repository Status",
37
+ interactive=False,
38
+ lines=3,
39
+ info="Current repository loading status and basic info"
40
+ )
41
+
42
+ with gr.Row():
43
+ with gr.Column(scale=2):
44
+ repo_chatbot = gr.Chatbot(
45
+ label="πŸ€– Repository Assistant",
46
+ height=400,
47
+ type="messages",
48
+ avatar_images=(
49
+ "https://cdn-icons-png.flaticon.com/512/149/149071.png",
50
+ "https://huggingface.co/datasets/huggingface/brand-assets/resolve/main/hf-logo.png"
51
+ ),
52
+ show_copy_button=True,
53
+ value=[] # Start empty - welcome message will appear only after repo is loaded
54
+ )
55
+
56
+ with gr.Row():
57
+ repo_msg_input = gr.Textbox(
58
+ label="πŸ’­ Ask about this repository",
59
+ placeholder="What does this repository do? How do I use it?",
60
+ lines=1,
61
+ scale=4,
62
+ info="Ask anything about the loaded repository"
63
+ )
64
+ repo_send_btn = gr.Button("πŸ“€ Send", variant="primary", scale=1)
65
+
66
+ # with gr.Column(scale=1):
67
+ # # Repository content preview
68
+ # repo_content_display = gr.Textbox(
69
+ # label="πŸ“„ Repository Content Preview",
70
+ # lines=20,
71
+ # show_copy_button=True,
72
+ # interactive=False,
73
+ # info="Overview of the loaded repository structure and content"
74
+ # )
75
+
76
+ # Component references
77
+ components = {
78
+ "repo_explorer_input": repo_explorer_input,
79
+ "load_repo_btn": load_repo_btn,
80
+ "repo_status_display": repo_status_display,
81
+ "repo_chatbot": repo_chatbot,
82
+ "repo_msg_input": repo_msg_input,
83
+ "repo_send_btn": repo_send_btn,
84
+ # "repo_content_display": repo_content_display
85
+ }
86
+
87
+ return components, states
88
+
89
+ def handle_repo_user_message(user_message: str, history: List[Dict[str, str]], repo_context_summary: str, repo_id: str) -> Tuple[List[Dict[str, str]], str]:
90
+ """Handle user messages in the repo-specific chatbot."""
91
+ if not repo_context_summary.strip():
92
+ return history, ""
93
+
94
+ # Initialize with repository-specific welcome message if empty
95
+ if not history:
96
+ welcome_msg = f"Hello! I'm your assistant for the '{repo_id}' repository. I have analyzed all the files and created a comprehensive understanding of this repository. I'm ready to answer any questions about its functionality, usage, architecture, and more. What would you like to know?"
97
+ history = [{"role": "assistant", "content": welcome_msg}]
98
+
99
+ if user_message:
100
+ history.append({"role": "user", "content": user_message})
101
+ return history, ""
102
+
103
+ def handle_repo_bot_response(history: List[Dict[str, str]], repo_context_summary: str, repo_id: str) -> List[Dict[str, str]]:
104
+ """Generate bot response for repo-specific questions using comprehensive context."""
105
+ if not history or history[-1]["role"] != "user" or not repo_context_summary.strip():
106
+ return history
107
+
108
+ user_message = history[-1]["content"]
109
+
110
+ # Create a specialized prompt using the comprehensive context summary
111
+ repo_system_prompt = f"""You are an expert assistant for the Hugging Face repository '{repo_id}'.
112
+ You have comprehensive knowledge about this repository based on detailed analysis of all its files and components.
113
+
114
+ Use the following comprehensive analysis to answer user questions accurately and helpfully:
115
+
116
+ {repo_context_summary}
117
+
118
+ Instructions:
119
+ - Answer questions clearly and conversationally about this specific repository
120
+ - Reference specific components, functions, or features when relevant
121
+ - Provide practical guidance on installation, usage, and implementation
122
+ - If asked about code details, refer to the analysis above
123
+ - Be helpful and informative while staying focused on this repository
124
+ - If something isn't covered in the analysis, acknowledge the limitation
125
+
126
+ Answer the user's question based on your comprehensive knowledge of this repository."""
127
+
128
+ try:
129
+ from openai import OpenAI
130
+ client = OpenAI(api_key=os.getenv("modal_api"))
131
+ client.base_url = os.getenv("base_url")
132
+
133
+ response = client.chat.completions.create(
134
+ model="Orion-zhen/Qwen2.5-Coder-7B-Instruct-AWQ",
135
+ messages=[
136
+ {"role": "system", "content": repo_system_prompt},
137
+ {"role": "user", "content": user_message}
138
+ ],
139
+ max_tokens=1024,
140
+ temperature=0.7
141
+ )
142
+
143
+ bot_response = response.choices[0].message.content
144
+ history.append({"role": "assistant", "content": bot_response})
145
+
146
+ except Exception as e:
147
+ logger.error(f"Error generating repo bot response: {e}")
148
+ error_response = f"I apologize, but I encountered an error while processing your question: {e}"
149
+ history.append({"role": "assistant", "content": error_response})
150
+
151
+ return history
152
+
153
+ def initialize_repo_chatbot(repo_status: str, repo_id: str, repo_context_summary: str) -> List[Dict[str, str]]:
154
+ """Initialize the repository chatbot with a welcome message after successful repo loading."""
155
+ # Only initialize if repository was loaded successfully
156
+ if repo_context_summary.strip() and "successfully" in repo_status.lower():
157
+ welcome_msg = f"πŸ‘‹ Welcome! I've successfully analyzed the **{repo_id}** repository.\n\n🧠 **I now have comprehensive knowledge of:**\nβ€’ All files and code structure\nβ€’ Key features and capabilities\nβ€’ Installation and usage instructions\nβ€’ Architecture and implementation details\nβ€’ Dependencies and requirements\n\nπŸ’¬ **Ask me anything about this repository!** \nFor example:\nβ€’ \"What does this repository do?\"\nβ€’ \"How do I install and use it?\"\nβ€’ \"What are the main components?\"\nβ€’ \"Show me usage examples\"\n\nWhat would you like to know? πŸ€”"
158
+ return [{"role": "assistant", "content": welcome_msg}]
159
+ else:
160
+ # Keep chatbot empty if loading failed
161
+ return []
162
+
163
+ def setup_repo_explorer_events(components: Dict[str, gr.components.Component], states: Dict[str, gr.State]):
164
+ """Setup event handlers for the repo explorer components."""
165
+
166
+ # Load repository event
167
+ components["load_repo_btn"].click(
168
+ fn=handle_load_repository,
169
+ inputs=[components["repo_explorer_input"]],
170
+ outputs=[components["repo_status_display"], states["repo_context_summary"]]
171
+ ).then(
172
+ fn=lambda repo_id: repo_id,
173
+ inputs=[components["repo_explorer_input"]],
174
+ outputs=[states["current_repo_id"]]
175
+ ).then(
176
+ fn=initialize_repo_chatbot,
177
+ inputs=[components["repo_status_display"], states["current_repo_id"], states["repo_context_summary"]],
178
+ outputs=[components["repo_chatbot"]]
179
+ )
180
+
181
+ # Chat message submission events
182
+ components["repo_msg_input"].submit(
183
+ fn=handle_repo_user_message,
184
+ inputs=[components["repo_msg_input"], components["repo_chatbot"], states["repo_context_summary"], states["current_repo_id"]],
185
+ outputs=[components["repo_chatbot"], components["repo_msg_input"]]
186
+ ).then(
187
+ fn=handle_repo_bot_response,
188
+ inputs=[components["repo_chatbot"], states["repo_context_summary"], states["current_repo_id"]],
189
+ outputs=[components["repo_chatbot"]]
190
+ )
191
+
192
+ components["repo_send_btn"].click(
193
+ fn=handle_repo_user_message,
194
+ inputs=[components["repo_msg_input"], components["repo_chatbot"], states["repo_context_summary"], states["current_repo_id"]],
195
+ outputs=[components["repo_chatbot"], components["repo_msg_input"]]
196
+ ).then(
197
+ fn=handle_repo_bot_response,
198
+ inputs=[components["repo_chatbot"], states["repo_context_summary"], states["current_repo_id"]],
199
+ outputs=[components["repo_chatbot"]]
200
+ )
requirements.txt CHANGED
@@ -2,4 +2,6 @@ gradio
2
  pandas
3
  openai
4
  regex
5
- huggingface_hub
 
 
 
2
  pandas
3
  openai
4
  regex
5
+ huggingface_hub
6
+ sentence-transformers
7
+ numpy
test_vectorization.py ADDED
@@ -0,0 +1,135 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Simple test script to verify vectorization functionality.
4
+ Run this to check if sentence-transformers is working correctly.
5
+ """
6
+
7
+ import os
8
+ import sys
9
+
10
+ def test_vectorization():
11
+ """Test the vectorization functionality."""
12
+ print("πŸ§ͺ Testing vectorization functionality...")
13
+
14
+ # Test 1: Import dependencies
15
+ print("\n1. Testing imports...")
16
+ try:
17
+ import numpy as np
18
+ print("βœ… numpy imported successfully")
19
+ except ImportError as e:
20
+ print(f"❌ numpy import failed: {e}")
21
+ return False
22
+
23
+ try:
24
+ from sentence_transformers import SentenceTransformer
25
+ print("βœ… sentence-transformers imported successfully")
26
+ except ImportError as e:
27
+ print(f"❌ sentence-transformers import failed: {e}")
28
+ print("Install with: pip install sentence-transformers")
29
+ return False
30
+
31
+ # Test 2: Load model
32
+ print("\n2. Testing model loading...")
33
+ try:
34
+ model = SentenceTransformer('all-MiniLM-L6-v2')
35
+ print("βœ… SentenceTransformer model loaded successfully")
36
+ except Exception as e:
37
+ print(f"❌ Model loading failed: {e}")
38
+ return False
39
+
40
+ # Test 3: Create embeddings
41
+ print("\n3. Testing embedding creation...")
42
+ try:
43
+ test_texts = [
44
+ "This is a Python function for machine learning",
45
+ "Here's a repository configuration file",
46
+ "Installation instructions for the project"
47
+ ]
48
+ embeddings = model.encode(test_texts)
49
+ print(f"βœ… Created embeddings with shape: {embeddings.shape}")
50
+ except Exception as e:
51
+ print(f"❌ Embedding creation failed: {e}")
52
+ return False
53
+
54
+ # Test 4: Test similarity calculation
55
+ print("\n4. Testing similarity calculation...")
56
+ try:
57
+ query_embedding = model.encode(["Python code example"])
58
+ similarities = []
59
+ for embedding in embeddings:
60
+ similarity = np.dot(query_embedding[0], embedding) / (
61
+ np.linalg.norm(query_embedding[0]) * np.linalg.norm(embedding)
62
+ )
63
+ similarities.append(similarity)
64
+ print(f"βœ… Similarity scores: {[f'{s:.3f}' for s in similarities]}")
65
+ except Exception as e:
66
+ print(f"❌ Similarity calculation failed: {e}")
67
+ return False
68
+
69
+ # Test 5: Test repo_explorer integration
70
+ print("\n5. Testing repo_explorer integration...")
71
+ try:
72
+ from repo_explorer import SimpleVectorStore, vectorize_repository_content
73
+
74
+ # Create test repository content
75
+ test_repo_content = """# Test Repository
76
+ import numpy as np
77
+ import pandas as pd
78
+
79
+ def main():
80
+ print("Hello, world!")
81
+
82
+ class DataProcessor:
83
+ def __init__(self):
84
+ self.data = []
85
+
86
+ def process(self, data):
87
+ return data.upper()
88
+
89
+ if __name__ == "__main__":
90
+ main()
91
+ """
92
+
93
+ # Test vectorization
94
+ success = vectorize_repository_content(test_repo_content, "test/repo")
95
+ if success:
96
+ print("βœ… Repository vectorization successful")
97
+
98
+ # Test vector store
99
+ from repo_explorer import vector_store
100
+ stats = vector_store.get_stats()
101
+ print(f"βœ… Vector store stats: {stats}")
102
+
103
+ # Test search
104
+ results = vector_store.search("Python function", top_k=2)
105
+ if results:
106
+ print(f"βœ… Vector search returned {len(results)} results")
107
+ for i, (chunk, similarity, metadata) in enumerate(results):
108
+ print(f" Result {i+1}: similarity={similarity:.3f}")
109
+ else:
110
+ print("⚠️ Vector search returned no results")
111
+ else:
112
+ print("❌ Repository vectorization failed")
113
+ return False
114
+
115
+ except Exception as e:
116
+ print(f"❌ repo_explorer integration test failed: {e}")
117
+ return False
118
+
119
+ print("\nπŸŽ‰ All tests passed! Vectorization is working correctly.")
120
+ return True
121
+
122
+ if __name__ == "__main__":
123
+ print("Repository Explorer Vectorization Test")
124
+ print("=" * 45)
125
+
126
+ success = test_vectorization()
127
+
128
+ if success:
129
+ print("\nβœ… Ready to use vectorization in repo explorer!")
130
+ print(" The sentence-transformers model will be downloaded on first use.")
131
+ else:
132
+ print("\n❌ Vectorization setup incomplete.")
133
+ print(" Make sure to install: pip install sentence-transformers numpy")
134
+
135
+ sys.exit(0 if success else 1)