Kazel commited on
Commit
0400df3
·
1 Parent(s): 7a57e5b
QUICK_START.md ADDED
@@ -0,0 +1,289 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # 🚀 Quick Start Guide - Collar Multimodal RAG Demo
2
+
3
+ Get your production-ready multimodal RAG system up and running in minutes!
4
+
5
+ ## ⚡ 5-Minute Setup
6
+
7
+ ### 1. **Install Dependencies**
8
+ ```bash
9
+ pip install -r requirements.txt
10
+ ```
11
+
12
+ ### 2. **Start the Application**
13
+ ```bash
14
+ python app.py
15
+ ```
16
+
17
+ ### 3. **Access the Application**
18
+ Open your browser and go to: `http://localhost:7860`
19
+
20
+ ### 4. **Login with Default Users**
21
+ - **Team A**: `admin_team_a` / `admin123_team_a`
22
+ - **Team B**: `admin_team_b` / `admin123_team_b`
23
+
24
+ ## 🎯 Key Features to Try
25
+
26
+ ### **Enhanced Multi-Page Citations**
27
+ 1. Upload multiple documents
28
+ 2. Ask complex queries like: "What are the different types of explosives and their safety procedures?"
29
+ 3. The system automatically detects complex queries and retrieves multiple relevant pages
30
+ 4. See intelligent citations grouped by document collections with relevance scores
31
+ 5. View multiple pages in the gallery display
32
+
33
+ ### **Team Repository Management**
34
+ 1. Login as Team A user
35
+ 2. Upload documents with a collection name like "Safety Manuals"
36
+ 3. Switch to Team B user - notice you can't see Team A's documents
37
+
38
+ ### **Chat History**
39
+ 1. Make several queries
40
+ 2. Go to "💬 Chat History" tab
41
+ 3. See your conversation history with timestamps and cited pages
42
+
43
+ ### **Advanced Querying**
44
+ 1. Set "Number of pages to retrieve" to 5
45
+ 2. Ask a complex question
46
+ 3. View multiple relevant pages and AI response with citations
47
+
48
+ ### **Enhanced Detailed Responses**
49
+ 1. Ask any question and receive comprehensive, detailed answers
50
+ 2. Get extensive background information and context
51
+ 3. See step-by-step explanations and practical applications
52
+ 4. Receive safety considerations and best practices
53
+ 5. Get technical specifications and measurements
54
+ 6. View quality assessment and recommendations for further research
55
+
56
+ ### **CSV Table Generation**
57
+ 1. Ask for data in table format: "Show me a table of safety procedures"
58
+ 2. Request CSV data: "Create a CSV with the comparison data"
59
+ 3. Get structured responses with downloadable CSV content
60
+ 4. View table information including rows, columns, and data sources
61
+ 5. Copy CSV content to use in Excel, Google Sheets, or other applications
62
+
63
+ ## 🔧 Configuration
64
+
65
+ ### Environment Variables (.env file)
66
+ ```env
67
+ # AI Models
68
+ colpali=colpali-v1.3
69
+ ollama=llama2
70
+
71
+ # Performance
72
+ flashattn=1
73
+ temperature=0.8
74
+ batchsize=5
75
+
76
+ # Database
77
+ metrictype=IP
78
+ mnum=16
79
+ efnum=500
80
+ topk=50
81
+ ```
82
+
83
+ ### Customizing for Your Use Case
84
+
85
+ #### **For Large Document Collections**
86
+ ```env
87
+ batchsize=10
88
+ topk=100
89
+ efnum=1000
90
+ ```
91
+
92
+ #### **For Faster Processing**
93
+ ```env
94
+ batchsize=2
95
+ flashattn=0
96
+ ```
97
+
98
+ #### **For Higher Accuracy**
99
+ ```env
100
+ temperature=0.3
101
+ topk=200
102
+ ```
103
+
104
+ ## 📁 File Structure
105
+ ```
106
+ colpali-milvus-multimodal-rag-master/
107
+ ├── app.py # Main application
108
+ ├── requirements.txt # Dependencies
109
+ ├── README.md # Full documentation
110
+ ├── QUICK_START.md # This file
111
+ ├── test_production_features.py # Test suite
112
+ ├── deploy_production.py # Production deployment
113
+ ├── app_database.db # SQLite database (auto-created)
114
+ ├── pages/ # Document pages (auto-created)
115
+ ├── logs/ # Application logs
116
+ └── uploads/ # Uploaded files
117
+ ```
118
+
119
+ ## 🧪 Testing
120
+
121
+ Run the test suite to verify everything works:
122
+ ```bash
123
+ python test_production_features.py
124
+ ```
125
+
126
+ Test the multi-page citation system:
127
+ ```bash
128
+ python test_multipage_citations.py
129
+ ```
130
+
131
+ Test the page count fix:
132
+ ```bash
133
+ python test_page_count_fix.py
134
+ ```
135
+
136
+ Test the enhanced detailed responses:
137
+ ```bash
138
+ python test_detailed_responses.py
139
+ ```
140
+
141
+ Test the page usage fix:
142
+ ```bash
143
+ python test_page_usage_fix.py
144
+ ```
145
+
146
+ Test the table generation functionality:
147
+ ```bash
148
+ python test_table_generation.py
149
+ ```
150
+
151
+ ## 🚀 Production Deployment
152
+
153
+ For production deployment, run:
154
+ ```bash
155
+ python deploy_production.py
156
+ ```
157
+
158
+ This will:
159
+ - ✅ Check prerequisites
160
+ - ✅ Setup environment
161
+ - ✅ Install dependencies
162
+ - ✅ Create database
163
+ - ✅ Setup logging
164
+ - ✅ Create Docker configurations
165
+ - ✅ Run tests
166
+
167
+ ## 🔍 Troubleshooting
168
+
169
+ ### **Common Issues**
170
+
171
+ #### **"No module named 'bcrypt'"**
172
+ ```bash
173
+ pip install bcrypt
174
+ ```
175
+
176
+ #### **"Docker not running"**
177
+ - Start Docker Desktop
178
+ - Wait for it to fully initialize
179
+
180
+ #### **"Ollama not found"**
181
+ ```bash
182
+ # Install Ollama
183
+ curl -fsSL https://ollama.ai/install.sh | sh
184
+ ollama serve
185
+ ```
186
+
187
+ #### **"CUDA out of memory"**
188
+ Reduce batch size in .env:
189
+ ```env
190
+ batchsize=2
191
+ ```
192
+
193
+ #### **"Database locked"**
194
+ ```bash
195
+ # Stop the application and restart
196
+ # Or delete the database file to start fresh
197
+ rm app_database.db
198
+ ```
199
+
200
+ #### **"Getting fewer pages than requested"**
201
+ - The system now ensures exactly the requested number of pages are returned
202
+ - Check the console logs for debugging information
203
+ - Run the page count test: `python test_page_count_fix.py`
204
+ - If issues persist, check that documents have enough content for the query
205
+
206
+ #### **"LLM only cites 2 pages when 3 are requested"**
207
+ - The system now verifies that LLM uses all provided pages
208
+ - Enhanced prompts explicitly instruct to use ALL pages
209
+ - Page usage verification detects missing references
210
+ - Run the page usage test: `python test_page_usage_fix.py`
211
+ - Check console logs for page usage verification messages
212
+
213
+ ### **Performance Optimization**
214
+
215
+ #### **For GPU Users**
216
+ ```env
217
+ flashattn=1
218
+ batchsize=8
219
+ ```
220
+
221
+ #### **For CPU Users**
222
+ ```env
223
+ flashattn=0
224
+ batchsize=2
225
+ ```
226
+
227
+ #### **For Large Datasets**
228
+ ```env
229
+ topk=200
230
+ efnum=1000
231
+ mnum=32
232
+ ```
233
+
234
+ ## 📊 Monitoring
235
+
236
+ ### **Check Application Status**
237
+ - View logs in `logs/app.log`
238
+ - Monitor database size: `ls -lh app_database.db`
239
+ - Check uploaded documents: `ls -la pages/`
240
+
241
+ ### **Performance Metrics**
242
+ - Query response time
243
+ - Document processing time
244
+ - Memory usage
245
+ - GPU utilization (if applicable)
246
+
247
+ ## 🔐 Security Best Practices
248
+
249
+ ### **For Development**
250
+ - Use default passwords (already configured)
251
+ - Run on localhost only
252
+
253
+ ### **For Production**
254
+ - Change default passwords
255
+ - Use HTTPS
256
+ - Set up proper firewall rules
257
+ - Regular database backups
258
+ - Monitor access logs
259
+
260
+ ## 📞 Support
261
+
262
+ ### **Getting Help**
263
+ 1. Check the troubleshooting section above
264
+ 2. Review the full README.md
265
+ 3. Run the test suite: `python test_production_features.py`
266
+ 4. Check application logs: `tail -f logs/app.log`
267
+
268
+ ### **Feature Requests**
269
+ - Multi-language support
270
+ - Advanced analytics dashboard
271
+ - API endpoints
272
+ - Mobile app
273
+ - Integration with external systems
274
+
275
+ ## 🎉 What's Next?
276
+
277
+ After getting familiar with the basic features:
278
+
279
+ 1. **Upload Your Documents**: Replace the sample documents with your own
280
+ 2. **Customize Models**: Experiment with different AI models
281
+ 3. **Scale Up**: Add more users and teams
282
+ 4. **Integrate**: Connect with your existing systems
283
+ 5. **Deploy**: Move to production with the deployment script
284
+
285
+ ---
286
+
287
+ **Happy RAG-ing! 🚀**
288
+
289
+ *Made by Collar - Enhanced with Team Management & Chat History*
README.md CHANGED
@@ -1,12 +1,330 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
- title: Demo Updated
3
- emoji: 🔥
4
- colorFrom: purple
5
- colorTo: pink
6
- sdk: gradio
7
- sdk_version: 5.44.1
8
- app_file: app.py
9
- pinned: false
10
- ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
1
+ # Collar Multimodal RAG Demo - Production Ready
2
+
3
+ A production-ready multimodal RAG (Retrieval-Augmented Generation) system with team management, chat history, and advanced document processing capabilities.
4
+
5
+ ## 🚀 New Production Features
6
+
7
+ ### 1. **Multi-Page Citations**
8
+ - **Complex Query Support**: The AI can now retrieve and cite multiple pages when queries reference information across different documents
9
+ - **Smart Citation System**: Automatically identifies and displays which pages contain relevant information
10
+ - **Configurable Results**: Users can specify how many pages to retrieve (1-10 pages)
11
+
12
+ ### 2. **Team-Based Repository Management**
13
+ - **Folder Uploads**: Upload multiple documents as organized collections
14
+ - **Team Isolation**: Each team has access only to their own document collections
15
+ - **Master Repository**: Documents are organized in team-specific repositories for easy access
16
+ - **Collection Naming**: Optional custom names for document collections
17
+
18
+ ### 3. **Authentication & Team Management**
19
+ - **User Authentication**: Secure login system with bcrypt password hashing
20
+ - **Team-Based Access**: Separate entry points for Team A and Team B
21
+ - **Session Management**: Secure session handling with automatic timeout
22
+ - **Access Control**: Users can only access and manage their team's documents
23
+
24
+ ### 4. **Chat History & Persistence**
25
+ - **Conversation Tracking**: All queries and responses are saved to a SQLite database
26
+ - **Historical Context**: View previous conversations with timestamps
27
+ - **Cited Pages History**: Track which pages were referenced in each conversation
28
+ - **Team-Specific History**: Each team sees only their own conversation history
29
+
30
+ ### 5. **Advanced Relevance Scoring**
31
+ - **Multimodal Embeddings**: ColPali-based semantic understanding of text and visual content
32
+ - **Intelligent Ranking**: Sophisticated relevance scoring with cosine similarity and dot product
33
+ - **Quality Assessment**: Automatic evaluation of information relevance and completeness
34
+ - **Diversity Optimization**: Ensures comprehensive coverage across document collections
35
+
36
+ ## 🔧 Installation & Setup
37
+
38
+ ### Prerequisites
39
+ - Python 3.8+
40
+ - Docker Desktop
41
+ - Ollama
42
+ - CUDA-compatible GPU (recommended)
43
+
44
+ ### 1. Install Dependencies
45
+ ```bash
46
+ pip install -r requirements.txt
47
+ ```
48
+
49
+ ### 2. Environment Configuration
50
+ Create a `.env` file with the following variables:
51
+ ```env
52
+ colpali=your_colpali_model
53
+ ollama=your_ollama_model
54
+ flashattn=1
55
+ temperature=0.8
56
+ batchsize=5
57
+ metrictype=IP
58
+ mnum=16
59
+ efnum=500
60
+ topk=50
61
+ ```
62
+
63
+ ### 3. Start Services
64
+ The application will automatically:
65
+ - Start Docker Desktop (Windows)
66
+ - Start Ollama server
67
+ - Initialize Docker containers
68
+ - Create default users
69
+
70
+ ## 👥 Default Users
71
+
72
+ The system creates default users for each team:
73
+
74
+ | Team | Username | Password |
75
+ |------|----------|----------|
76
+ | Team A | admin_team_a | admin123_team_a |
77
+ | Team B | admin_team_b | admin123_team_b |
78
+
79
+ ## 📖 Usage Guide
80
+
81
+ ### 1. **Authentication**
82
+ 1. Navigate to the "🔐 Authentication" tab
83
+ 2. Enter your username and password
84
+ 3. Click "Login" to access team-specific features
85
+
86
+ ### 2. **Document Management**
87
+ 1. Go to "📁 Document Management" tab
88
+ 2. Optionally enter a collection name for organization
89
+ 3. Set the maximum pages to extract per document
90
+ 4. Upload multiple PPT/PDF files
91
+ 5. Click "Upload to Repository" to process documents
92
+ 6. Use "Refresh Collections" to see available document collections
93
+
94
+ ### 3. **Advanced Querying**
95
+ 1. Navigate to "🔍 Advanced Query" tab
96
+ 2. Enter your query in the text box
97
+ 3. Adjust the number of pages to retrieve (1-10)
98
+ 4. Click "Search Documents" to get AI response with citations
99
+ 5. View the cited pages and retrieved document images
100
+ 6. Check relevance scores to understand information quality (see "Relevance Score Calculation" section)
101
+
102
+ ### 4. **Chat History**
103
+ 1. Go to "💬 Chat History" tab
104
+ 2. Adjust the number of conversations to display
105
+ 3. Click "Refresh History" to view recent conversations
106
+ 4. Each entry shows query, response, cited pages, and timestamp
107
+
108
+ ### 5. **Data Management**
109
+ 1. Access "⚙️ Data Management" tab
110
+ 2. Select collections to delete (team-restricted)
111
+ 3. Configure database parameters for optimal performance
112
+ 4. Update settings as needed
113
+
114
+ ## 🏗️ Architecture
115
+
116
+ ### Database Schema
117
+ - **users**: User accounts with team assignments
118
+ - **chat_history**: Conversation tracking with citations
119
+ - **document_collections**: Team-specific document organization
120
+
121
+ ### Security Features
122
+ - **Password Hashing**: bcrypt for secure password storage
123
+ - **Session Management**: UUID-based session tokens
124
+ - **Access Control**: Team-based document isolation
125
+ - **Input Validation**: Comprehensive error handling
126
+
127
+ ### Performance Optimizations
128
+ - **Multi-threading**: Concurrent document processing
129
+ - **Memory Management**: Efficient image and vector handling
130
+ - **Caching**: Session-based caching for improved response times
131
+ - **Batch Processing**: Configurable batch sizes for GPU optimization
132
+
133
+ ## 🔍 Relevance Score Calculation
134
+
135
+ The system uses sophisticated relevance scoring to determine how well retrieved documents align with user queries. This process is crucial for selecting the most pertinent information for generating accurate and contextually appropriate responses.
136
+
137
+ ### How Relevance Scores Work
138
+
139
+ #### 1. **Document Embedding Process**
140
+ - **Page Segmentation**: Each document page is processed as a complete unit
141
+ - **Multimodal Encoding**: Both text and visual elements are captured using ColPali embeddings
142
+ - **Vector Representation**: Pages are transformed into high-dimensional numerical vectors (typically 768-1024 dimensions)
143
+ - **Semantic Capture**: The embedding captures semantic meaning, not just keyword matches
144
+
145
+ #### 2. **Query Embedding**
146
+ - **Query Processing**: User queries are converted into embeddings using the same ColPali model
147
+ - **Semantic Understanding**: The system understands query intent, not just literal words
148
+ - **Context Preservation**: Query context and meaning are maintained in the embedding
149
+
150
+ #### 3. **Similarity Computation**
151
+ - **Cosine Similarity**: Primary similarity measure between query and document embeddings
152
+ - **Dot Product**: Alternative similarity calculation for high-dimensional vectors
153
+ - **Normalized Scores**: Similarity scores are normalized to a 0-1 range
154
+ - **Distance Metrics**: Lower distances indicate higher relevance
155
+
156
+ #### 4. **Score Aggregation & Ranking**
157
+ - **Individual Page Scores**: Each page gets a relevance score based on similarity
158
+ - **Collection Diversity**: Scores are adjusted to promote diversity across document collections
159
+ - **Consecutive Page Optimization**: Adjacent pages are considered for better context
160
+ - **Final Ranking**: Pages are ranked by their aggregated relevance scores
161
+
162
+ ### Relevance Score Interpretation
163
+
164
+ | Score Range | Relevance Level | Description |
165
+ |-------------|----------------|-------------|
166
+ | 0.90 - 1.00 | **Excellent** | Highly relevant, directly answers the query |
167
+ | 0.80 - 0.89 | **Very Good** | Very relevant, provides substantial information |
168
+ | 0.70 - 0.79 | **Good** | Relevant, contains useful information |
169
+ | 0.60 - 0.69 | **Moderate** | Somewhat relevant, may contain partial answers |
170
+ | 0.50 - 0.59 | **Basic** | Minimally relevant, limited usefulness |
171
+ | < 0.50 | **Poor** | Not relevant, unlikely to be useful |
172
+
173
+ ### Example Relevance Calculation
174
+
175
+ **Query**: "What are the safety procedures for handling explosives?"
176
+
177
+ **Document Pages**:
178
+ 1. **Page 15**: "Safety protocols for explosive materials" → Score: 0.95 (Excellent)
179
+ 2. **Page 23**: "Equipment requirements for explosive handling" → Score: 0.92 (Very Good)
180
+ 3. **Page 8**: "General laboratory safety guidelines" → Score: 0.88 (Very Good)
181
+ 4. **Page 45**: "Chemical storage procedures" → Score: 0.65 (Moderate)
182
+
183
+ **Selection Process**:
184
+ - Pages 15, 23, and 8 are selected for their high relevance
185
+ - Page 45 is excluded due to lower relevance
186
+ - The system ensures diversity across different aspects of safety procedures
187
+
188
+ ### Advanced Features
189
+
190
+ #### **Multi-Modal Relevance**
191
+ - **Visual Elements**: Images, charts, and diagrams contribute to relevance scores
192
+ - **Text-Vision Alignment**: ColPali captures relationships between text and visual content
193
+ - **Layout Understanding**: Document structure and formatting influence relevance
194
+
195
+ #### **Context-Aware Scoring**
196
+ - **Query Complexity**: Complex queries may retrieve more pages with varied scores
197
+ - **Cross-Reference Detection**: Pages that reference each other get boosted scores
198
+ - **Temporal Relevance**: Recent documents may receive slight score adjustments
199
+
200
+ #### **Quality Assurance**
201
+ - **Score Verification**: System validates that selected pages meet minimum relevance thresholds
202
+ - **Diversity Optimization**: Ensures selected pages provide comprehensive coverage
203
+ - **Redundancy Reduction**: Avoids selecting multiple pages with very similar content
204
+
205
+ ### Configuration Parameters
206
+
207
+ ```env
208
+ # Relevance scoring configuration
209
+ metrictype=IP # Inner Product similarity
210
+ mnum=16 # Number of connections in HNSW graph
211
+ efnum=500 # Search depth for high-quality results
212
+ topk=50 # Maximum results to consider
213
+ ```
214
+
215
+ ### Performance Impact
216
+
217
+ - **Search Speed**: Relevance scoring adds minimal overhead (~10-50ms per query)
218
+ - **Accuracy**: High-quality embeddings ensure accurate relevance assessment
219
+ - **Scalability**: Efficient vector operations support large document collections
220
+ - **Memory Usage**: Optimized to handle thousands of document pages efficiently
221
+
222
+ ## 🔒 Security Considerations
223
+
224
+ ### Production Deployment
225
+ 1. **HTTPS**: Always use HTTPS in production
226
+ 2. **Environment Variables**: Store sensitive data in environment variables
227
+ 3. **Database Security**: Use production-grade database (PostgreSQL/MySQL)
228
+ 4. **Rate Limiting**: Implement API rate limiting
229
+ 5. **Logging**: Add comprehensive logging for security monitoring
230
+
231
+ ### Recommended Security Enhancements
232
+ ```python
233
+ # Add to production deployment
234
+ import logging
235
+ from flask_limiter import Limiter
236
+ from flask_limiter.util import get_remote_address
237
+
238
+ # Rate limiting
239
+ limiter = Limiter(
240
+ app,
241
+ key_func=get_remote_address,
242
+ default_limits=["200 per day", "50 per hour"]
243
+ )
244
+
245
+ # Security headers
246
+ @app.after_request
247
+ def add_security_headers(response):
248
+ response.headers['X-Content-Type-Options'] = 'nosniff'
249
+ response.headers['X-Frame-Options'] = 'DENY'
250
+ response.headers['X-XSS-Protection'] = '1; mode=block'
251
+ return response
252
+ ```
253
+
254
+ ## 🚀 Deployment
255
+
256
+ ### Docker Deployment
257
+ ```dockerfile
258
+ FROM python:3.9-slim
259
+
260
+ WORKDIR /app
261
+ COPY requirements.txt .
262
+ RUN pip install -r requirements.txt
263
+
264
+ COPY . .
265
+ EXPOSE 7860
266
+
267
+ CMD ["python", "app.py"]
268
+ ```
269
+
270
+ ### Environment Variables for Production
271
+ ```env
272
+ # Database
273
+ DATABASE_URL=postgresql://user:password@localhost/dbname
274
+ SECRET_KEY=your-secret-key-here
275
+
276
+ # Security
277
+ BCRYPT_ROUNDS=12
278
+ SESSION_TIMEOUT=3600
279
+
280
+ # Performance
281
+ WORKER_THREADS=4
282
+ MAX_UPLOAD_SIZE=100MB
283
+ ```
284
+
285
+ ## 📊 Monitoring & Analytics
286
+
287
+ ### Key Metrics to Track
288
+ - **Query Response Time**: Average time for AI responses
289
+ - **Document Processing Time**: Time to index new documents
290
+ - **User Activity**: Login frequency and session duration
291
+ - **Error Rates**: Failed queries and system errors
292
+ - **Storage Usage**: Database and file system utilization
293
+
294
+ ### Logging Configuration
295
+ ```python
296
+ import logging
297
+
298
+ logging.basicConfig(
299
+ level=logging.INFO,
300
+ format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
301
+ handlers=[
302
+ logging.FileHandler('app.log'),
303
+ logging.StreamHandler()
304
+ ]
305
+ )
306
+ ```
307
+
308
+ ## 🤝 Contributing
309
+
310
+ 1. Fork the repository
311
+ 2. Create a feature branch
312
+ 3. Make your changes
313
+ 4. Add tests for new features
314
+ 5. Submit a pull request
315
+
316
+ ## 📄 License
317
+
318
+ This project is licensed under the MIT License - see the LICENSE file for details.
319
+
320
+ ## 🆘 Support
321
+
322
+ For support and questions:
323
+ - Create an issue in the repository
324
+ - Check the documentation
325
+ - Review the troubleshooting guide
326
+
327
  ---
 
 
 
 
 
 
 
 
 
328
 
329
+ **Made by Collar** - Enhanced with Team Management & Chat History
330
+
app.py ADDED
The diff for this file is too large to render. See raw diff
 
app_database.db ADDED
Binary file (41 kB). View file
 
colpali_manager.py ADDED
@@ -0,0 +1,186 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from colpali_engine.models import ColPali
2
+ from colpali_engine.models import ColPaliProcessor
3
+ from colpali_engine.utils.processing_utils import BaseVisualRetrieverProcessor
4
+ from colpali_engine.utils.torch_utils import ListDataset, get_torch_device
5
+ from torch.utils.data import DataLoader
6
+ import torch
7
+ from typing import List, cast
8
+ import matplotlib.pyplot as plt
9
+ #from colpali_engine.models import ColQwen2_5, ColQwen2_5_Processor
10
+ from colpali_engine.models import ColIdefics3, ColIdefics3Processor
11
+
12
+ from tqdm import tqdm
13
+ from PIL import Image
14
+ import os
15
+
16
+ import spaces
17
+
18
+
19
+ #this part is for local runs
20
+ torch.cuda.empty_cache()
21
+
22
+ #get model name from .env variable & set directory & processor dir as the model names!
23
+ import dotenv
24
+ # Load the .env file
25
+ dotenv_file = dotenv.find_dotenv()
26
+ dotenv.load_dotenv(dotenv_file)
27
+
28
+ model_name = os.environ['colpali'] #"vidore/colSmol-256M"
29
+ device = get_torch_device("cuda") #try using cpu instead of cuda?
30
+
31
+ #switch to locally downloading models & loading locally rather than from hf
32
+ #
33
+
34
+ current_working_directory = os.getcwd()
35
+ save_directory = model_name # Directory to save the specific model name
36
+ save_directory = os.path.join(current_working_directory, save_directory)
37
+
38
+ processor_directory = model_name+'_processor' # Directory to save the processor
39
+ processor_directory = os.path.join(current_working_directory, processor_directory)
40
+
41
+
42
+
43
+ if not os.path.exists(save_directory): #download if directory not created/model not loaded
44
+ # Directory does not exist; create it
45
+
46
+ if "colSmol-256M" in model_name: #if colsmol
47
+ model = ColIdefics3.from_pretrained(
48
+ model_name,
49
+ torch_dtype=torch.bfloat16,
50
+ device_map=device,
51
+ #attn_implementation="flash_attention_2",
52
+ ).eval()
53
+ processor = cast(ColIdefics3Processor, ColIdefics3Processor.from_pretrained(model_name))
54
+ else: #if colpali v1.3 etc
55
+ model = ColPali.from_pretrained(
56
+ model_name,
57
+ torch_dtype=torch.bfloat16,
58
+ device_map=device,
59
+ #attn_implementation="flash_attention_2",
60
+ ).eval()
61
+ processor = cast(ColPaliProcessor, ColPaliProcessor.from_pretrained(model_name))
62
+ os.makedirs(save_directory)
63
+ print(f"Directory '{save_directory}' created.")
64
+ model.save_pretrained(save_directory)
65
+ os.makedirs(processor_directory)
66
+ processor.save_pretrained(processor_directory)
67
+
68
+ else:
69
+ if "colSmol-256M" in model_name:
70
+ model = ColIdefics3.from_pretrained(save_directory)
71
+ processor = ColIdefics3Processor.from_pretrained(processor_directory, use_fast=True)
72
+ else:
73
+ model = ColPali.from_pretrained(save_directory)
74
+ processor = ColPaliProcessor.from_pretrained(processor_directory, use_fast=True)
75
+
76
+
77
+ class ColpaliManager:
78
+
79
+
80
+ def __init__(self, device = "cuda", model_name = model_name): #need to hot potato/use diff gpus between colpali & ollama
81
+
82
+ print(f"Initializing ColpaliManager with device {device} and model {model_name}")
83
+
84
+ # self.device = get_torch_device(device)
85
+
86
+ # self.model = ColPali.from_pretrained(
87
+ # model_name,
88
+ # torch_dtype=torch.bfloat16,
89
+ # device_map=self.device,
90
+ # ).eval()
91
+
92
+ # self.processor = cast(ColPaliProcessor, ColPaliProcessor.from_pretrained(model_name))
93
+
94
+ @spaces.GPU
95
+ def get_images(self, paths: list[str]) -> List[Image.Image]:
96
+ model.to("cuda")
97
+ return [Image.open(path) for path in paths]
98
+
99
+ @spaces.GPU
100
+ def process_images(self, image_paths:list[str], batch_size=int(os.environ['batchsize'])):
101
+ model.to("cuda")
102
+ print(f"Processing {len(image_paths)} image_paths")
103
+
104
+ images = self.get_images(image_paths)
105
+
106
+ dataloader = DataLoader(
107
+ dataset=ListDataset[str](images),
108
+ batch_size=batch_size,
109
+ shuffle=False,
110
+ collate_fn=lambda x: processor.process_images(x),
111
+ )
112
+
113
+ ds: List[torch.Tensor] = []
114
+ for batch_doc in tqdm(dataloader):
115
+ with torch.no_grad():
116
+ batch_doc = {k: v.to(model.device) for k, v in batch_doc.items()}
117
+ embeddings_doc = model(**batch_doc)
118
+ ds.extend(list(torch.unbind(embeddings_doc.to(device))))
119
+
120
+ ds_np = [d.float().cpu().numpy() for d in ds]
121
+
122
+ return ds_np
123
+
124
+
125
+ @spaces.GPU
126
+ def process_text(self, texts: list[str]):
127
+
128
+ #current_working_directory = os.getcwd()
129
+ #save_directory = model_name # Directory to save the specific model name
130
+ #save_directory = os.path.join(current_working_directory, save_directory)
131
+
132
+ #processor_directory = model_name+'_processor' # Directory to save the processor
133
+ #processor_directory = os.path.join(current_working_directory, processor_directory)
134
+
135
+
136
+
137
+ if not os.path.exists(save_directory): #download if directory not created/model not loaded
138
+
139
+ #MUST USE colpali v1.3/1.2 etc, CANNOT USE SMOLCOLPALI! for queries AS NOT RELIABLE!
140
+ """
141
+ model = ColPali.from_pretrained(
142
+ model_name,
143
+ torch_dtype=torch.bfloat16,
144
+ device_map=device,
145
+ attn_implementation="flash_attention_2",
146
+ ).eval()
147
+ processor = cast(ColPaliProcessor, ColPaliProcessor.from_pretrained(model_name))
148
+ os.makedirs(save_directory)
149
+ print(f"Directory '{save_directory}' created.")
150
+ model.save_pretrained(save_directory)
151
+ os.makedirs(processor_directory)
152
+ processor.save_pretrained(processor_directory)
153
+ else:
154
+ model = ColPali.from_pretrained(save_directory)
155
+ processor = ColPaliProcessor.from_pretrained(processor_directory, use_fast=True)
156
+ """
157
+
158
+
159
+ model.to("cuda") #ensure this is commented out so ollama/multimodal llm can use gpu! (nah wrong, need to enable so that it can process multiple)
160
+ print(f"Processing {len(texts)} texts")
161
+
162
+ dataloader = DataLoader(
163
+ dataset=ListDataset[str](texts),
164
+ batch_size=int(os.environ['batchsize']), #OG is 5, try reducing batch size to maximise gpu use
165
+ shuffle=False,
166
+ collate_fn=lambda x: processor.process_queries(x),
167
+ )
168
+
169
+
170
+ qs: List[torch.Tensor] = []
171
+ for batch_query in dataloader:
172
+ with torch.no_grad():
173
+ batch_query = {k: v.to(model.device) for k, v in batch_query.items()}
174
+ embeddings_query = model(**batch_query)
175
+
176
+ qs.extend(list(torch.unbind(embeddings_query.to(device))))
177
+
178
+ qs_np = [q.float().cpu().numpy() for q in qs]
179
+ model.to("cpu") # Moves all model parameters and buffers to the CPU, freeing up gpu for ollama call after this process text call! (THIS WORKS!)
180
+
181
+ return qs_np
182
+
183
+ plt.close("all")
184
+
185
+
186
+
docker-compose.yml ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ version: '3.5'
2
+
3
+ services:
4
+ etcd:
5
+ container_name: milvus-etcd
6
+ image: quay.io/coreos/etcd:v3.5.16
7
+ environment:
8
+ - ETCD_AUTO_COMPACTION_MODE=revision
9
+ - ETCD_AUTO_COMPACTION_RETENTION=1000
10
+ - ETCD_QUOTA_BACKEND_BYTES=4294967296
11
+ - ETCD_SNAPSHOT_COUNT=50000
12
+ volumes:
13
+ - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/etcd:/etcd
14
+ command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
15
+ healthcheck:
16
+ test: ["CMD", "etcdctl", "endpoint", "health"]
17
+ interval: 30s
18
+ timeout: 20s
19
+ retries: 3
20
+
21
+ minio:
22
+ container_name: milvus-minio
23
+ image: minio/minio:RELEASE.2023-03-20T20-16-18Z
24
+ environment:
25
+ MINIO_ACCESS_KEY: minioadmin
26
+ MINIO_SECRET_KEY: minioadmin
27
+ ports:
28
+ - "9001:9001"
29
+ - "9000:9000"
30
+ volumes:
31
+ - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/minio:/minio_data
32
+ command: minio server /minio_data --console-address ":9001"
33
+ healthcheck:
34
+ test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
35
+ interval: 30s
36
+ timeout: 20s
37
+ retries: 3
38
+
39
+ standalone:
40
+ container_name: milvus-standalone
41
+ image: milvusdb/milvus:v2.5.4-gpu
42
+ command: ["milvus", "run", "standalone"]
43
+ security_opt:
44
+ - seccomp:unconfined
45
+ environment:
46
+ ETCD_ENDPOINTS: etcd:2379
47
+ MINIO_ADDRESS: minio:9000
48
+ volumes:
49
+ - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/milvus:/var/lib/milvus
50
+ ports:
51
+ - "19530:19530"
52
+ - "9091:9091"
53
+ deploy:
54
+ resources:
55
+ reservations:
56
+ devices:
57
+ - driver: nvidia
58
+ capabilities: ["gpu"]
59
+ device_ids: ["0"]
60
+ depends_on:
61
+ - "etcd"
62
+ - "minio"
63
+
64
+ networks:
65
+ default:
66
+ name: milvus
middleware.py ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from colpali_manager import ColpaliManager
2
+ from milvus_manager import MilvusManager
3
+ from pdf_manager import PdfManager
4
+ import hashlib
5
+
6
+
7
+
8
+ pdf_manager = PdfManager()
9
+ colpali_manager = ColpaliManager()
10
+
11
+
12
+
13
+ class Middleware:
14
+ def __init__(self, id:str, create_collection=True):
15
+ #hashed_id = hashlib.md5(id.encode()).hexdigest()[:8]
16
+ hashed_id = 0 #switched to persistent db, shld use diff id for diff accs
17
+ milvus_db_name = f"milvus_{hashed_id}.db"
18
+ self.milvus_manager = MilvusManager(milvus_db_name, id, create_collection) #create collections based on id rather than colpali
19
+
20
+ def index(self, pdf_path: str, id:str, max_pages: int, pages: list[int] = None):
21
+
22
+ if type(pdf_path) == None: #for direct query without any upload to db
23
+ print("no docs")
24
+ return
25
+
26
+ print(f"Indexing {pdf_path}, id: {id}, max_pages: {max_pages}")
27
+
28
+ image_paths = pdf_manager.save_images(id, pdf_path, max_pages)
29
+
30
+ print(f"Saved {len(image_paths)} images")
31
+
32
+ colbert_vecs = colpali_manager.process_images(image_paths)
33
+
34
+ images_data = [{
35
+ "colbert_vecs": colbert_vecs[i],
36
+ "filepath": image_paths[i]
37
+ } for i in range(len(image_paths))]
38
+
39
+ print(f"Inserting {len(images_data)} images data to Milvus")
40
+
41
+ self.milvus_manager.insert_images_data(images_data)
42
+
43
+ print("Indexing completed")
44
+
45
+ return image_paths
46
+
47
+
48
+
49
+ def search(self, search_queries: list[str], topk: int = 10):
50
+ print(f"Searching for {len(search_queries)} queries with topk={topk}")
51
+
52
+ final_res = []
53
+
54
+ for query in search_queries:
55
+ print(f"Searching for query: {query}")
56
+ query_vec = colpali_manager.process_text([query])[0]
57
+ search_res = self.milvus_manager.search(query_vec, topk=topk)
58
+ print(f"Search result: {len(search_res)} results for query: {query}")
59
+ final_res.append(search_res)
60
+
61
+ return final_res
62
+
milvus_manager.py ADDED
@@ -0,0 +1,218 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from pymilvus import MilvusClient, DataType
2
+ try:
3
+ from milvus import default_server # Milvus Lite
4
+ except Exception:
5
+ default_server = None
6
+ import numpy as np
7
+ import concurrent.futures
8
+ from pymilvus import Collection
9
+ import os
10
+
11
+ class MilvusManager:
12
+ def __init__(self, milvus_uri, collection_name, create_collection, dim=128):
13
+
14
+ #import environ variables from .env
15
+ import dotenv
16
+ # Load the .env file
17
+ dotenv_file = dotenv.find_dotenv()
18
+ dotenv.load_dotenv(dotenv_file)
19
+
20
+ # Start embedded Milvus Lite server and connect locally
21
+ if default_server is not None:
22
+ try:
23
+ # Optionally set base dir here if desired, e.g. default_server.set_base_dir('volumes/milvus_lite')
24
+ default_server.start()
25
+ except Exception:
26
+ pass
27
+ local_uri = f"http://127.0.0.1:{default_server.listen_port}"
28
+ self.client = MilvusClient(uri=local_uri)
29
+ else:
30
+ # Fallback to standard local server (assumes docker-compose or system service)
31
+ self.client = MilvusClient(uri="http://127.0.0.1:19530")
32
+ self.collection_name = collection_name
33
+ self.dim = dim
34
+
35
+ if self.client.has_collection(collection_name=self.collection_name):
36
+ self.client.load_collection(collection_name=self.collection_name)
37
+ print("Loaded existing collection.")
38
+ elif create_collection:
39
+ self.create_collection()
40
+ self.create_index()
41
+
42
+ def create_collection(self):
43
+ if self.client.has_collection(collection_name=self.collection_name):
44
+ print("Collection already exists.")
45
+ return
46
+
47
+ schema = self.client.create_schema(
48
+ auto_id=True,
49
+ enable_dynamic_fields=True,
50
+ )
51
+ schema.add_field(field_name="pk", datatype=DataType.INT64, is_primary=True)
52
+ schema.add_field(
53
+ field_name="vector", datatype=DataType.FLOAT_VECTOR, dim=self.dim
54
+ )
55
+ schema.add_field(field_name="seq_id", datatype=DataType.INT16)
56
+ schema.add_field(field_name="doc_id", datatype=DataType.INT64)
57
+ schema.add_field(field_name="doc", datatype=DataType.VARCHAR, max_length=65535)
58
+
59
+ self.client.create_collection(
60
+ collection_name=self.collection_name, schema=schema
61
+ )
62
+
63
+ def create_index(self):
64
+ index_params = self.client.prepare_index_params()
65
+
66
+ index_params.add_index(
67
+ field_name="vector",
68
+ index_name="vector_index",
69
+ index_type="HNSW", #use HNSW option if got more mem, if not use IVF for faster processing
70
+ metric_type=os.environ["metrictype"], #"IP"
71
+ params={
72
+ "M": int(os.environ["mnum"]), #M:16 for HNSW, capital M
73
+ "efConstruction": int(os.environ["efnum"]), #500 for HNSW
74
+ },
75
+ )
76
+
77
+ self.client.create_index(
78
+ collection_name=self.collection_name, index_params=index_params, sync=True
79
+ )
80
+
81
+ def search(self, data, topk):
82
+ # Retrieve all collection names from the Milvus client.
83
+ collections = self.client.list_collections()
84
+
85
+ # Set search parameters (here, using Inner Product metric).
86
+ search_params = {"metric_type": os.environ["metrictype"], "params": {}} #default metric type is "IP"
87
+
88
+ # Set to store unique (doc_id, collection_name) pairs across all collections.
89
+ doc_collection_pairs = set()
90
+
91
+ # Query each collection individually
92
+ for collection in collections:
93
+ self.client.load_collection(collection_name=collection)
94
+ print("collection loaded:"+ collection)
95
+ results = self.client.search(
96
+ collection,
97
+ data,
98
+ limit=int(os.environ["topk"]), # Adjust limit per collection as needed. (default is 50)
99
+ output_fields=["vector", "seq_id", "doc_id"],
100
+ search_params=search_params,
101
+ )
102
+ # Accumulate document IDs along with their originating collection.
103
+ for r_id in range(len(results)):
104
+ for r in range(len(results[r_id])):
105
+ doc_id = results[r_id][r]["entity"]["doc_id"]
106
+ doc_collection_pairs.add((doc_id, collection))
107
+
108
+ scores = []
109
+
110
+ def rerank_single_doc(doc_id, data, client, collection_name):
111
+ # Query for detailed document vectors in the given collection.
112
+ doc_colbert_vecs = client.query(
113
+ collection_name=collection_name,
114
+ filter=f"doc_id in [{doc_id}, {doc_id + 1}]",
115
+ output_fields=["seq_id", "vector", "doc"],
116
+ limit=16380,
117
+ )
118
+ # Stack the vectors for dot product computation.
119
+ doc_vecs = np.vstack(
120
+ [doc_colbert_vecs[i]["vector"] for i in range(len(doc_colbert_vecs))]
121
+ )
122
+ # Compute a similarity score via dot product.
123
+ score = np.dot(data, doc_vecs.T).max(1).sum()
124
+ return (score, doc_id, collection_name)
125
+
126
+ # Use a thread pool to rerank each document concurrently.
127
+ with concurrent.futures.ThreadPoolExecutor(max_workers=300) as executor:
128
+ futures = {
129
+ executor.submit(rerank_single_doc, doc_id, data, self.client, collection): (doc_id, collection)
130
+ for doc_id, collection in doc_collection_pairs
131
+ }
132
+ for future in concurrent.futures.as_completed(futures):
133
+ score, doc_id, collection = future.result()
134
+ scores.append((score, doc_id, collection))
135
+ #doc_id is page number!
136
+
137
+ # Sort the reranked results by score in descending order.
138
+ scores.sort(key=lambda x: x[0], reverse=True)
139
+ # Unload the collection after search to free memory.
140
+ self.client.release_collection(collection_name=collection)
141
+
142
+ return scores[:topk] if len(scores) >= topk else scores #topk is the number of scores to return back
143
+ """
144
+ search_params = {"metric_type": "IP", "params": {}}
145
+ results = self.client.search(
146
+ self.collection_name,
147
+ data,
148
+ limit=50,
149
+ output_fields=["vector", "seq_id", "doc_id"],
150
+ search_params=search_params,
151
+ )
152
+ doc_ids = {result["entity"]["doc_id"] for result in results[0]}
153
+
154
+ scores = []
155
+
156
+ def rerank_single_doc(doc_id, data, client, collection_name):
157
+ doc_colbert_vecs = client.query(
158
+ collection_name=collection_name,
159
+ filter=f"doc_id in [{doc_id}, {doc_id + 1}]",
160
+ output_fields=["seq_id", "vector", "doc"],
161
+ limit=1000,
162
+ )
163
+ doc_vecs = np.vstack(
164
+ [doc["vector"] for doc in doc_colbert_vecs]
165
+ )
166
+ score = np.dot(data, doc_vecs.T).max(1).sum()
167
+ return score, doc_id
168
+
169
+ with concurrent.futures.ThreadPoolExecutor(max_workers=300) as executor:
170
+ futures = {
171
+ executor.submit(
172
+ rerank_single_doc, doc_id, data, self.client, self.collection_name
173
+ ): doc_id
174
+ for doc_id in doc_ids
175
+ }
176
+ for future in concurrent.futures.as_completed(futures):
177
+ score, doc_id = future.result()
178
+ scores.append((score, doc_id))
179
+
180
+ scores.sort(key=lambda x: x[0], reverse=True)
181
+ return scores[:topk]
182
+ """
183
+
184
+ def insert(self, data):
185
+ colbert_vecs = data["colbert_vecs"]
186
+ seq_length = len(colbert_vecs)
187
+ doc_ids = [data["doc_id"]] * seq_length
188
+ seq_ids = list(range(seq_length))
189
+ docs = [""] * seq_length
190
+ docs[0] = data["filepath"]
191
+
192
+ self.client.insert(
193
+ self.collection_name,
194
+ [
195
+ {
196
+ "vector": colbert_vecs[i],
197
+ "seq_id": seq_ids[i],
198
+ "doc_id": doc_ids[i],
199
+ "doc": docs[i],
200
+ }
201
+ for i in range(seq_length)
202
+ ],
203
+ )
204
+
205
+ def get_images_as_doc(self, images_with_vectors):
206
+ return [
207
+ {
208
+ "colbert_vecs": image["colbert_vecs"],
209
+ "doc_id": idx,
210
+ "filepath": image["filepath"],
211
+ }
212
+ for idx, image in enumerate(images_with_vectors)
213
+ ]
214
+
215
+ def insert_images_data(self, image_data):
216
+ data = self.get_images_as_doc(image_data)
217
+ for item in data:
218
+ self.insert(item)
packages.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ poppler-utils
pdf_manager.py ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from pdf2image import convert_from_path
2
+ import os
3
+ import shutil
4
+
5
+ class PdfManager:
6
+ def __init__(self):
7
+ pass
8
+
9
+ def clear_and_recreate_dir(self, output_folder):
10
+
11
+ print(f"Clearing output folder {output_folder}")
12
+
13
+ if os.path.exists(output_folder):
14
+ shutil.rmtree(output_folder)
15
+ #print("Clearing is unused for now for persistency")
16
+ else:
17
+ os.makedirs(output_folder)
18
+
19
+ #print("Clearing is unused for now for persistency")
20
+
21
+ def save_images(self, id, pdf_path, max_pages, pages: list[int] = None) -> list[str]:
22
+ output_folder = f"pages/{id}" #remove last backslash to avoid error,test this
23
+ images = convert_from_path(pdf_path)
24
+
25
+ print(f"Saving images from {pdf_path} to {output_folder}. Max pages: {max_pages}")
26
+
27
+ self.clear_and_recreate_dir(output_folder)
28
+
29
+ num_page_processed = 0
30
+
31
+ for i, image in enumerate(images):
32
+ if max_pages and num_page_processed >= max_pages:
33
+ break
34
+
35
+ if pages and i not in pages:
36
+ continue
37
+
38
+ full_save_path = f"{output_folder}/page_{i + 1}.png"
39
+
40
+ #print(f"Saving image to {full_save_path}")
41
+
42
+ image.save(full_save_path, "PNG")
43
+
44
+ num_page_processed += 1
45
+
46
+ return [f"{output_folder}/page_{i + 1}.png" for i in range(num_page_processed)]
rag.py ADDED
@@ -0,0 +1,153 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import requests
2
+ import os
3
+ import re
4
+
5
+ from typing import List
6
+ from utils import encode_image
7
+ from PIL import Image
8
+ from google import genai
9
+ import torch
10
+ import subprocess
11
+ import psutil
12
+ import torch
13
+ from transformers import AutoModel, AutoTokenizer
14
+ from google import genai
15
+
16
+
17
+ class Rag:
18
+
19
+ def _clean_raw_token_response(self, response_text):
20
+ """
21
+ Clean raw token responses that contain undecoded token IDs
22
+ This handles cases where models return raw tokens instead of decoded text
23
+ """
24
+ if not response_text:
25
+ return response_text
26
+
27
+ # Check if response contains raw token patterns
28
+ token_patterns = [
29
+ r'<unused\d+>', # unused tokens
30
+ r'<bos>', # beginning of sequence
31
+ r'<eos>', # end of sequence
32
+ r'<unk>', # unknown tokens
33
+ r'<mask>', # mask tokens
34
+ r'<pad>', # padding tokens
35
+ r'\[multimodal\]', # multimodal tokens
36
+ ]
37
+
38
+ # If response contains raw tokens, try to clean them
39
+ has_raw_tokens = any(re.search(pattern, response_text) for pattern in token_patterns)
40
+
41
+ if has_raw_tokens:
42
+ print("⚠️ Detected raw token response, attempting to clean...")
43
+
44
+ # Remove common raw token patterns
45
+ cleaned_text = response_text
46
+
47
+ # Remove unused tokens
48
+ cleaned_text = re.sub(r'<unused\d+>', '', cleaned_text)
49
+
50
+ # Remove special tokens
51
+ cleaned_text = re.sub(r'<(bos|eos|unk|mask|pad)>', '', cleaned_text)
52
+
53
+ # Remove multimodal tokens
54
+ cleaned_text = re.sub(r'\[multimodal\]', '', cleaned_text)
55
+
56
+ # Clean up extra whitespace
57
+ cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()
58
+
59
+ # If we still have mostly tokens, return an error message
60
+ if len(cleaned_text.strip()) < 10:
61
+ return "❌ **Model Response Error**: The model returned raw token IDs instead of decoded text. This may be due to model configuration issues. Please try:\n\n1. Restarting the Ollama server\n2. Using a different model\n3. Checking model compatibility with multimodal inputs"
62
+
63
+ return cleaned_text
64
+
65
+ return response_text
66
+
67
+ def get_answer_from_gemini(self, query: str, image_paths: List[str]) -> str:
68
+ print(f"Querying Gemini 2.5 Pro for query={query}, image_paths={image_paths}")
69
+ try:
70
+ # Use environment variable GEMINI_API_KEY
71
+ api_key = os.environ.get('GEMINI_API_KEY')
72
+ if not api_key:
73
+ return "Error: GEMINI_API_KEY is not set."
74
+
75
+ genai.configure(api_key=api_key)
76
+ model = genai.GenerativeModel('gemini-2.5-pro')
77
+
78
+ # Load images
79
+ images = []
80
+ for p in image_paths:
81
+ try:
82
+ images.append(Image.open(p))
83
+ except Exception:
84
+ pass
85
+
86
+ chat_session = model.start_chat()
87
+ response = chat_session.send_message([*images, query])
88
+ return response.text
89
+ except Exception as e:
90
+ print(f"Gemini error: {e}")
91
+ return f"Error: {str(e)}"
92
+
93
+ #os.environ['OPENAI_API_KEY'] = "for the love of Jesus let this work"
94
+
95
+ def get_answer_from_openai(self, query, imagesPaths):
96
+ #import environ variables from .env
97
+ import dotenv
98
+
99
+ # Load the .env file
100
+ dotenv_file = dotenv.find_dotenv()
101
+ dotenv.load_dotenv(dotenv_file)
102
+
103
+ # This function formerly used Ollama. Replace with Gemini 2.5 Pro.
104
+ print(f"Querying Gemini (replacement for Ollama) for query={query}, imagesPaths={imagesPaths}")
105
+ try:
106
+ enhanced_query = f"Use all {len(imagesPaths)} pages to answer comprehensively.\n\nQuery: {query}"
107
+ return self.get_answer_from_gemini(enhanced_query, imagesPaths)
108
+ except Exception as e:
109
+ print(f"Gemini replacement error: {e}")
110
+ return None
111
+
112
+
113
+
114
+ def __get_openai_api_payload(self, query:str, imagesPaths:List[str]):
115
+ image_payload = []
116
+
117
+ for imagePath in imagesPaths:
118
+ base64_image = encode_image(imagePath)
119
+ image_payload.append({
120
+ "type": "image_url",
121
+ "image_url": {
122
+ "url": f"data:image/jpeg;base64,{base64_image}"
123
+ }
124
+ })
125
+
126
+ payload = {
127
+ "model": "Llama3.2-vision", #change model here as needed
128
+ "messages": [
129
+ {
130
+ "role": "user",
131
+ "content": [
132
+ {
133
+ "type": "text",
134
+ "text": query
135
+ },
136
+ *image_payload
137
+ ]
138
+ }
139
+ ],
140
+ "max_tokens": 1024 #reduce token size to reduce processing time
141
+ }
142
+
143
+ return payload
144
+
145
+
146
+
147
+ # if __name__ == "__main__":
148
+ # rag = Rag()
149
+
150
+ # query = "Based on attached images, how many new cases were reported during second wave peak"
151
+ # imagesPaths = ["covid_slides_page_8.png", "covid_slides_page_8.png"]
152
+
153
+ # rag.get_answer_from_gemini(query, imagesPaths)
requirements.txt ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ gradio>=4.0.0
2
+ pymilvus>=2.3.0
3
+ milvus>=2.4.13
4
+ PyMuPDF>=1.23.0
5
+ python-dotenv>=1.0.0
6
+ bcrypt>=4.0.0
7
+ numpy>=1.24.0
8
+ Pillow>=10.0.0
9
+ requests>=2.31.0
10
+ transformers>=4.35.0
11
+ torch>=2.0.0
12
+ torchvision>=0.15.0
13
+ opencv-python>=4.8.0
14
+ scikit-learn>=1.3.0
15
+ schemdraw
16
+ matplotlib
17
+ python-docx>=0.8.11
18
+ openpyxl>=3.1.2
19
+ pandas>=2.0.0
temp/comprehensive_report_generate_a_report_on_what_the_20250901_035723.docx ADDED
Binary file (36.8 kB). View file
 
temp/comprehensive_report_write_a_report_on_the_israel-i_20250901_041633.docx ADDED
Binary file (36.8 kB). View file
 
temp/enhanced_export_create_a_bar_chart_graph_showi_20250901_045645.xlsx ADDED
Binary file (8.62 kB). View file
 
temp/enhanced_export_create_a_bar_graph_showing_the_20250901_043912.xlsx ADDED
Binary file (8.57 kB). View file
 
temp/table_Present_a_structured_table_sum_20250817_141602.csv ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Item Number,Description
2
+ 1,"**Israeli Strikes:** On this day, Israel launched Operation Rising Lion targeting nuclear and military sites across Iran[^1^][^4^]."
3
+ 2,"**Iranian Retaliation:** In response to the Israeli strikes on Tehran, an Iranian missile strike directly hit Soroka hospital in Beerseba, Southern Israel. The attack resulted in 32 casualties with injuries reported but no deaths."
4
+ 3,**Iranian Strikes:** Iran launched its first volley of retaliatory attacks towards Jerusalem at around midnight[^5^].
5
+ 4,"**Israeli Response:** In retaliation for the Iranian strikes on Israeli targets, an IDF (Israel Defense Forces) strike directly hit Soroka hospital in Beerseba, Southern Israel. The attack caused 32 casualties and significant damage to the facility."
6
+ 5,**Persistent Strikes:** The conflict continued with both sides exchanging blows through continuous attacks[^4^].
7
+ 6,**Damages/Casualties Continued:** Both countries reported ongoing exchanges of strikes causing further fatalities on either side.
8
+ 7,"On June 19th, an Iranian missile directly hit Soroka hospital in Beerseba[^5^]."
9
+ 8,This incident resulted in the death and injury of many people.
10
+ 9,Israel’s defense force launched Operation Rising Lion targeting various military sites within Iran.
11
+ 10,"The document mentions continuous exchanges between both countries, with casualties on either side being reported daily[^5^]."
12
+ 11,"Israeli Prime Minister Benjamin Netanyahu called the Iranian attacks a ""declaration of war"" and threatened severe punishment for Tehran's actions."
13
+ 12,Iran retaliated by targeting military centers and airbases in Israel.
14
+ 13,"The conflict extended into June, with both sides sustaining more casualties."
15
+ 14,Israeli Prime Minister Netanyahu reiterated his warnings about potential nuclear threats from Iran[^5^].
16
+ 15,"Iranian Supreme Leader Ayatollah Ali Khamenei stated that if not stopped, Israel could produce a new generation of terrorist leaders."
17
+ 16,The document continues detailing the ongoing conflict with daily reports on casualties and military sites being targeted.
18
+ 17,"Both sides continued launching attacks, causing significant damage and loss of life[^1^][^4^]."
temp/table_Present_a_structured_table_sum_20250817_141947.csv ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ Measurement Type,Value,Details
2
+ Measurement,13,"The timeline provided indicates a rapid escalation of violence between Israel and Iran over several days (June 13-15). The conflict began with Israeli Prime Minister Netanyahu declaring an ""Operation Rising Lion"" targeting nuclear facilities. In response, Iranian forces launched their own series of retaliatory attacks."
temp/table_Present_a_structured_table_sum_20250817_142357.csv ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ Measurement Type,Value,Details
2
+ Measurement,13,Here is a structured table summarizing the timeline of the conflict from June 13-15:
temp/table_Present_a_structured_table_sum_20250817_142801.csv ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ Page,Collection,Relevance Score,Content Summary
2
+ 2,Israel_Iran,39.623,Page 2 from Israel_Iran
3
+ 1,Israel_Iran,39.453,Page 1 from Israel_Iran
4
+ 3,Israel_Iran,37.398,Page 3 from Israel_Iran
temp/table_Present_a_structured_table_sum_20250817_143017.csv ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ Measurement Type,Value,Details
2
+ Measurement,2025,"In light of recent events surrounding Israel’s strikes on Iran’s nuclear sites in June 2025, this article provides detailed coverage of the conflict between Israel and Iran. The timeline outlined by BBC News reveals a series of escalations leading to an ongoing exchange of attacks from both sides over several days."
3
+ Measurement,14,"On Sunday (14th), Israel's Health Ministry announced damage caused by Iranian missile strikes at a hospital for about 200 injured. The Israeli Prime Minister emphasized that if Iran does not stop its nuclear program soon, it could produce a nuclear weapon in ""a very short time."""
temp/table_Present_a_structured_table_sum_20250817_171709.csv ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ Measurement Type,Value,Details
2
+ Measurement,13,## Israel-Iran Conflict Timeline: 13-15 June 2025
3
+ Measurement,1,"Based on the provided BBC news articles (pages 1, 2, & 3), here’s a structured timeline of the escalating conflict between Israel and Iran from June 13th to 15th, 2025. This analysis incorporates information from all three pages, detailing attackers, targets, locations, weapons, and resulting damage/casualties."
4
+ Measurement,1,"The conflict stems from a series of escalating actions initiated by Israel, targeting Iran’s nuclear program. Page 1 highlights that this began with Israeli strikes on nuclear and military sites in Iran, prompting retaliatory attacks by Iran targeting Israel. The situation is highly volatile, with both sides engaged in rhetorical threats and actual military action, and the US considering its involvement (Page 1). The Israeli Prime Minister, Benjamin Netanyahu, justified the attacks as a pre-emptive measure to prevent Iran from developing nuclear weapons, claiming they could produce one ""in a very short time"" if not stopped (Page 1). The conflict centers around concerns over Iran’s nuclear program and the potential for developing a nuclear weapon, and further details indicate that Iran is under investigation for possibly building these weapons (Page 3)."
5
+ Measurement,13,**Timeline: 13-15 June 2025**
6
+ Measurement,2,"* **Page 2:** This page provides a timeline of the immediate retaliatory actions. It details Iran launching around 100 missiles towards Israel on June 13th, with the majority being intercepted by Israel’s Iron Dome system. There is mention of the reporting of 78 people being injured by Friday evening. Page 2 also highlights coordination between Israel and Washington on Iran and notes a reported Iranian missile directly hitting a hospital in Beersheba, Southern Israel (Page 2)."
temp/table_Present_a_structured_table_sum_20250817_174625.csv ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ,**Date**,**Attacker**,**Target**,**Location**,**Weapon Used**,**Damage/Casualties**,**Source Page**,
2
+ ,---,---,---,---,---,---,---,
3
+ ,**June 13**,Israel,Iranian nuclear sites,Throughout Iran (including Natanz – approximately 225km (140 miles) south of Tehran),Unspecified (likely airstrikes),Significant damage to the Natanz nuclear facility,Page 1,
4
+ ,**June 13**,Iran,Israel (Military & Civilian),Israel (including Beersheba in the south),"Ballistic missiles, drones","Dozens of targets hit (military centers, air bases); approximately 78 people injured, Iron Dome intercepted most missiles","Page 1, Page 2",
5
+ ,**June 13**,Iran,Israel,Beersheba,Iranian missile,At least 32 people injured at Soroka hospital,Page 2,
6
+ ,**June 13**,Iran,IDF command center and intelligence camp,Adjacent to hospital in Beersheba,Missile,"Intended target, but state media report possible intentional targeting of the hospital",Page 2,
7
+ ,**June 14-15**,Not detailed,Likely reciprocal attacks,Not detailed,Not detailed,Conflict is ongoing,"Page 1, Page 2",
8
+ ,**June 15**,Israel,Iranian nuclear facility,Not detailed,Unspecified (likely airstrikes),The Natanz facility is targeted to prevent Iran from reaching the capability to produce weapons-grade material,Page 3,
9
+ *,**Escalation:** The conflict has rapidly escalated from targeted strikes on nuclear facilities to widespread missile attacks on military and civilian targets.
10
+ *,**Nuclear Focus:** A central theme of the conflict is preventing Iran from acquiring nuclear weapons. Israel’s strikes on Natanz (Page 1 & 3) demonstrate a clear intention to disrupt Iran’s nuclear program.
11
+ *,"**Reciprocal Attacks:** The timeline indicates a clear pattern of reciprocal attacks, with Iran responding to Israeli strikes with missile launches, and Israel likely conducting follow-up strikes."
12
+ *,"**Civilian Impact:** The attacks have resulted in civilian casualties, with over 220 Palestinians killed in Israeli strikes (Page 1) and at least 32 injured in Israel (Page 2). The report of a possible deliberate targeting of a hospital (Page 2) is particularly concerning."
13
+ *,"**Threats and Warnings:** The rhetoric from both sides is aggressive, with warnings of “severe punishment” (Page 3) and a “declaration of war.”"
14
+ *,"**Iron Dome Effectiveness:** The Iron Dome defense system has intercepted most of the Iranian missiles (Page 1 & 2), but the sheer volume of attacks demonstrates the challenge of defending against such threats."
temp/table_create_a_bar_chart_graph_showi_20250901_045645.csv ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ ,**Date**,**Time (Local)**,**Event**,**Location**,**Impact/Details**,**Page Reference**,
2
+ ,---,---,---,---,---,---,
3
+ ,Days Prior,N/A,Reciprocal missile exchanges,Israel & Iran,Ongoing pattern of attacks,1,
4
+ ,June 12,Evening,Iran tells people to evacuate,Tehran's District 18,Evacuation order issued,2,
5
+ ,June 13,03:30,Initial Iranian Missile Attack,Towards Israel,"~100 missiles launched, most intercepted by Iron Dome","1, 2",
6
+ ,June 13,Shortly After,Israeli Operation Rising Lion,Iran,"Strikes on nuclear and military sites, significant damage to Natanz facility",2,
7
+ ,June 13,Hours Later,Iranian Missile Attack,Israel,"Dozens of targets, military centers and air bases targeted",2,
8
+ ,June 19,Morning,Missile Hits Hospital,"Beersheba, Southern Israel","At least 32 people injured, controversy over targeting a hospital",3,
9
+ ,Ongoing,N/A,Casualties,Israel & Iran,"Over 220 deaths in Iran, 24 deaths in Israel",3,
10
+ ,Ongoing,N/A,International Involvement,US,Trump considering joining strikes on Iranian nuclear sites,3,
utils.py ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ import base64
2
+
3
+ def encode_image(image_path):
4
+ with open(image_path, "rb") as image_file:
5
+ return base64.b64encode(image_file.read()).decode('utf-8')