Spaces:
Running
on
Zero
Running
on
Zero
start
Browse files- QUICK_START.md +289 -0
- README.md +328 -10
- app.py +0 -0
- app_database.db +0 -0
- colpali_manager.py +186 -0
- docker-compose.yml +66 -0
- middleware.py +62 -0
- milvus_manager.py +218 -0
- packages.txt +1 -0
- pdf_manager.py +46 -0
- rag.py +153 -0
- requirements.txt +19 -0
- temp/comprehensive_report_generate_a_report_on_what_the_20250901_035723.docx +0 -0
- temp/comprehensive_report_write_a_report_on_the_israel-i_20250901_041633.docx +0 -0
- temp/enhanced_export_create_a_bar_chart_graph_showi_20250901_045645.xlsx +0 -0
- temp/enhanced_export_create_a_bar_graph_showing_the_20250901_043912.xlsx +0 -0
- temp/table_Present_a_structured_table_sum_20250817_141602.csv +18 -0
- temp/table_Present_a_structured_table_sum_20250817_141947.csv +2 -0
- temp/table_Present_a_structured_table_sum_20250817_142357.csv +2 -0
- temp/table_Present_a_structured_table_sum_20250817_142801.csv +4 -0
- temp/table_Present_a_structured_table_sum_20250817_143017.csv +3 -0
- temp/table_Present_a_structured_table_sum_20250817_171709.csv +6 -0
- temp/table_Present_a_structured_table_sum_20250817_174625.csv +14 -0
- temp/table_create_a_bar_chart_graph_showi_20250901_045645.csv +10 -0
- utils.py +5 -0
QUICK_START.md
ADDED
@@ -0,0 +1,289 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# 🚀 Quick Start Guide - Collar Multimodal RAG Demo
|
2 |
+
|
3 |
+
Get your production-ready multimodal RAG system up and running in minutes!
|
4 |
+
|
5 |
+
## ⚡ 5-Minute Setup
|
6 |
+
|
7 |
+
### 1. **Install Dependencies**
|
8 |
+
```bash
|
9 |
+
pip install -r requirements.txt
|
10 |
+
```
|
11 |
+
|
12 |
+
### 2. **Start the Application**
|
13 |
+
```bash
|
14 |
+
python app.py
|
15 |
+
```
|
16 |
+
|
17 |
+
### 3. **Access the Application**
|
18 |
+
Open your browser and go to: `http://localhost:7860`
|
19 |
+
|
20 |
+
### 4. **Login with Default Users**
|
21 |
+
- **Team A**: `admin_team_a` / `admin123_team_a`
|
22 |
+
- **Team B**: `admin_team_b` / `admin123_team_b`
|
23 |
+
|
24 |
+
## 🎯 Key Features to Try
|
25 |
+
|
26 |
+
### **Enhanced Multi-Page Citations**
|
27 |
+
1. Upload multiple documents
|
28 |
+
2. Ask complex queries like: "What are the different types of explosives and their safety procedures?"
|
29 |
+
3. The system automatically detects complex queries and retrieves multiple relevant pages
|
30 |
+
4. See intelligent citations grouped by document collections with relevance scores
|
31 |
+
5. View multiple pages in the gallery display
|
32 |
+
|
33 |
+
### **Team Repository Management**
|
34 |
+
1. Login as Team A user
|
35 |
+
2. Upload documents with a collection name like "Safety Manuals"
|
36 |
+
3. Switch to Team B user - notice you can't see Team A's documents
|
37 |
+
|
38 |
+
### **Chat History**
|
39 |
+
1. Make several queries
|
40 |
+
2. Go to "💬 Chat History" tab
|
41 |
+
3. See your conversation history with timestamps and cited pages
|
42 |
+
|
43 |
+
### **Advanced Querying**
|
44 |
+
1. Set "Number of pages to retrieve" to 5
|
45 |
+
2. Ask a complex question
|
46 |
+
3. View multiple relevant pages and AI response with citations
|
47 |
+
|
48 |
+
### **Enhanced Detailed Responses**
|
49 |
+
1. Ask any question and receive comprehensive, detailed answers
|
50 |
+
2. Get extensive background information and context
|
51 |
+
3. See step-by-step explanations and practical applications
|
52 |
+
4. Receive safety considerations and best practices
|
53 |
+
5. Get technical specifications and measurements
|
54 |
+
6. View quality assessment and recommendations for further research
|
55 |
+
|
56 |
+
### **CSV Table Generation**
|
57 |
+
1. Ask for data in table format: "Show me a table of safety procedures"
|
58 |
+
2. Request CSV data: "Create a CSV with the comparison data"
|
59 |
+
3. Get structured responses with downloadable CSV content
|
60 |
+
4. View table information including rows, columns, and data sources
|
61 |
+
5. Copy CSV content to use in Excel, Google Sheets, or other applications
|
62 |
+
|
63 |
+
## 🔧 Configuration
|
64 |
+
|
65 |
+
### Environment Variables (.env file)
|
66 |
+
```env
|
67 |
+
# AI Models
|
68 |
+
colpali=colpali-v1.3
|
69 |
+
ollama=llama2
|
70 |
+
|
71 |
+
# Performance
|
72 |
+
flashattn=1
|
73 |
+
temperature=0.8
|
74 |
+
batchsize=5
|
75 |
+
|
76 |
+
# Database
|
77 |
+
metrictype=IP
|
78 |
+
mnum=16
|
79 |
+
efnum=500
|
80 |
+
topk=50
|
81 |
+
```
|
82 |
+
|
83 |
+
### Customizing for Your Use Case
|
84 |
+
|
85 |
+
#### **For Large Document Collections**
|
86 |
+
```env
|
87 |
+
batchsize=10
|
88 |
+
topk=100
|
89 |
+
efnum=1000
|
90 |
+
```
|
91 |
+
|
92 |
+
#### **For Faster Processing**
|
93 |
+
```env
|
94 |
+
batchsize=2
|
95 |
+
flashattn=0
|
96 |
+
```
|
97 |
+
|
98 |
+
#### **For Higher Accuracy**
|
99 |
+
```env
|
100 |
+
temperature=0.3
|
101 |
+
topk=200
|
102 |
+
```
|
103 |
+
|
104 |
+
## 📁 File Structure
|
105 |
+
```
|
106 |
+
colpali-milvus-multimodal-rag-master/
|
107 |
+
├── app.py # Main application
|
108 |
+
├── requirements.txt # Dependencies
|
109 |
+
├── README.md # Full documentation
|
110 |
+
├── QUICK_START.md # This file
|
111 |
+
├── test_production_features.py # Test suite
|
112 |
+
├── deploy_production.py # Production deployment
|
113 |
+
├── app_database.db # SQLite database (auto-created)
|
114 |
+
├── pages/ # Document pages (auto-created)
|
115 |
+
├── logs/ # Application logs
|
116 |
+
└── uploads/ # Uploaded files
|
117 |
+
```
|
118 |
+
|
119 |
+
## 🧪 Testing
|
120 |
+
|
121 |
+
Run the test suite to verify everything works:
|
122 |
+
```bash
|
123 |
+
python test_production_features.py
|
124 |
+
```
|
125 |
+
|
126 |
+
Test the multi-page citation system:
|
127 |
+
```bash
|
128 |
+
python test_multipage_citations.py
|
129 |
+
```
|
130 |
+
|
131 |
+
Test the page count fix:
|
132 |
+
```bash
|
133 |
+
python test_page_count_fix.py
|
134 |
+
```
|
135 |
+
|
136 |
+
Test the enhanced detailed responses:
|
137 |
+
```bash
|
138 |
+
python test_detailed_responses.py
|
139 |
+
```
|
140 |
+
|
141 |
+
Test the page usage fix:
|
142 |
+
```bash
|
143 |
+
python test_page_usage_fix.py
|
144 |
+
```
|
145 |
+
|
146 |
+
Test the table generation functionality:
|
147 |
+
```bash
|
148 |
+
python test_table_generation.py
|
149 |
+
```
|
150 |
+
|
151 |
+
## 🚀 Production Deployment
|
152 |
+
|
153 |
+
For production deployment, run:
|
154 |
+
```bash
|
155 |
+
python deploy_production.py
|
156 |
+
```
|
157 |
+
|
158 |
+
This will:
|
159 |
+
- ✅ Check prerequisites
|
160 |
+
- ✅ Setup environment
|
161 |
+
- ✅ Install dependencies
|
162 |
+
- ✅ Create database
|
163 |
+
- ✅ Setup logging
|
164 |
+
- ✅ Create Docker configurations
|
165 |
+
- ✅ Run tests
|
166 |
+
|
167 |
+
## 🔍 Troubleshooting
|
168 |
+
|
169 |
+
### **Common Issues**
|
170 |
+
|
171 |
+
#### **"No module named 'bcrypt'"**
|
172 |
+
```bash
|
173 |
+
pip install bcrypt
|
174 |
+
```
|
175 |
+
|
176 |
+
#### **"Docker not running"**
|
177 |
+
- Start Docker Desktop
|
178 |
+
- Wait for it to fully initialize
|
179 |
+
|
180 |
+
#### **"Ollama not found"**
|
181 |
+
```bash
|
182 |
+
# Install Ollama
|
183 |
+
curl -fsSL https://ollama.ai/install.sh | sh
|
184 |
+
ollama serve
|
185 |
+
```
|
186 |
+
|
187 |
+
#### **"CUDA out of memory"**
|
188 |
+
Reduce batch size in .env:
|
189 |
+
```env
|
190 |
+
batchsize=2
|
191 |
+
```
|
192 |
+
|
193 |
+
#### **"Database locked"**
|
194 |
+
```bash
|
195 |
+
# Stop the application and restart
|
196 |
+
# Or delete the database file to start fresh
|
197 |
+
rm app_database.db
|
198 |
+
```
|
199 |
+
|
200 |
+
#### **"Getting fewer pages than requested"**
|
201 |
+
- The system now ensures exactly the requested number of pages are returned
|
202 |
+
- Check the console logs for debugging information
|
203 |
+
- Run the page count test: `python test_page_count_fix.py`
|
204 |
+
- If issues persist, check that documents have enough content for the query
|
205 |
+
|
206 |
+
#### **"LLM only cites 2 pages when 3 are requested"**
|
207 |
+
- The system now verifies that LLM uses all provided pages
|
208 |
+
- Enhanced prompts explicitly instruct to use ALL pages
|
209 |
+
- Page usage verification detects missing references
|
210 |
+
- Run the page usage test: `python test_page_usage_fix.py`
|
211 |
+
- Check console logs for page usage verification messages
|
212 |
+
|
213 |
+
### **Performance Optimization**
|
214 |
+
|
215 |
+
#### **For GPU Users**
|
216 |
+
```env
|
217 |
+
flashattn=1
|
218 |
+
batchsize=8
|
219 |
+
```
|
220 |
+
|
221 |
+
#### **For CPU Users**
|
222 |
+
```env
|
223 |
+
flashattn=0
|
224 |
+
batchsize=2
|
225 |
+
```
|
226 |
+
|
227 |
+
#### **For Large Datasets**
|
228 |
+
```env
|
229 |
+
topk=200
|
230 |
+
efnum=1000
|
231 |
+
mnum=32
|
232 |
+
```
|
233 |
+
|
234 |
+
## 📊 Monitoring
|
235 |
+
|
236 |
+
### **Check Application Status**
|
237 |
+
- View logs in `logs/app.log`
|
238 |
+
- Monitor database size: `ls -lh app_database.db`
|
239 |
+
- Check uploaded documents: `ls -la pages/`
|
240 |
+
|
241 |
+
### **Performance Metrics**
|
242 |
+
- Query response time
|
243 |
+
- Document processing time
|
244 |
+
- Memory usage
|
245 |
+
- GPU utilization (if applicable)
|
246 |
+
|
247 |
+
## 🔐 Security Best Practices
|
248 |
+
|
249 |
+
### **For Development**
|
250 |
+
- Use default passwords (already configured)
|
251 |
+
- Run on localhost only
|
252 |
+
|
253 |
+
### **For Production**
|
254 |
+
- Change default passwords
|
255 |
+
- Use HTTPS
|
256 |
+
- Set up proper firewall rules
|
257 |
+
- Regular database backups
|
258 |
+
- Monitor access logs
|
259 |
+
|
260 |
+
## 📞 Support
|
261 |
+
|
262 |
+
### **Getting Help**
|
263 |
+
1. Check the troubleshooting section above
|
264 |
+
2. Review the full README.md
|
265 |
+
3. Run the test suite: `python test_production_features.py`
|
266 |
+
4. Check application logs: `tail -f logs/app.log`
|
267 |
+
|
268 |
+
### **Feature Requests**
|
269 |
+
- Multi-language support
|
270 |
+
- Advanced analytics dashboard
|
271 |
+
- API endpoints
|
272 |
+
- Mobile app
|
273 |
+
- Integration with external systems
|
274 |
+
|
275 |
+
## 🎉 What's Next?
|
276 |
+
|
277 |
+
After getting familiar with the basic features:
|
278 |
+
|
279 |
+
1. **Upload Your Documents**: Replace the sample documents with your own
|
280 |
+
2. **Customize Models**: Experiment with different AI models
|
281 |
+
3. **Scale Up**: Add more users and teams
|
282 |
+
4. **Integrate**: Connect with your existing systems
|
283 |
+
5. **Deploy**: Move to production with the deployment script
|
284 |
+
|
285 |
+
---
|
286 |
+
|
287 |
+
**Happy RAG-ing! 🚀**
|
288 |
+
|
289 |
+
*Made by Collar - Enhanced with Team Management & Chat History*
|
README.md
CHANGED
@@ -1,12 +1,330 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
-
title: Demo Updated
|
3 |
-
emoji: 🔥
|
4 |
-
colorFrom: purple
|
5 |
-
colorTo: pink
|
6 |
-
sdk: gradio
|
7 |
-
sdk_version: 5.44.1
|
8 |
-
app_file: app.py
|
9 |
-
pinned: false
|
10 |
-
---
|
11 |
|
12 |
-
|
|
|
|
1 |
+
# Collar Multimodal RAG Demo - Production Ready
|
2 |
+
|
3 |
+
A production-ready multimodal RAG (Retrieval-Augmented Generation) system with team management, chat history, and advanced document processing capabilities.
|
4 |
+
|
5 |
+
## 🚀 New Production Features
|
6 |
+
|
7 |
+
### 1. **Multi-Page Citations**
|
8 |
+
- **Complex Query Support**: The AI can now retrieve and cite multiple pages when queries reference information across different documents
|
9 |
+
- **Smart Citation System**: Automatically identifies and displays which pages contain relevant information
|
10 |
+
- **Configurable Results**: Users can specify how many pages to retrieve (1-10 pages)
|
11 |
+
|
12 |
+
### 2. **Team-Based Repository Management**
|
13 |
+
- **Folder Uploads**: Upload multiple documents as organized collections
|
14 |
+
- **Team Isolation**: Each team has access only to their own document collections
|
15 |
+
- **Master Repository**: Documents are organized in team-specific repositories for easy access
|
16 |
+
- **Collection Naming**: Optional custom names for document collections
|
17 |
+
|
18 |
+
### 3. **Authentication & Team Management**
|
19 |
+
- **User Authentication**: Secure login system with bcrypt password hashing
|
20 |
+
- **Team-Based Access**: Separate entry points for Team A and Team B
|
21 |
+
- **Session Management**: Secure session handling with automatic timeout
|
22 |
+
- **Access Control**: Users can only access and manage their team's documents
|
23 |
+
|
24 |
+
### 4. **Chat History & Persistence**
|
25 |
+
- **Conversation Tracking**: All queries and responses are saved to a SQLite database
|
26 |
+
- **Historical Context**: View previous conversations with timestamps
|
27 |
+
- **Cited Pages History**: Track which pages were referenced in each conversation
|
28 |
+
- **Team-Specific History**: Each team sees only their own conversation history
|
29 |
+
|
30 |
+
### 5. **Advanced Relevance Scoring**
|
31 |
+
- **Multimodal Embeddings**: ColPali-based semantic understanding of text and visual content
|
32 |
+
- **Intelligent Ranking**: Sophisticated relevance scoring with cosine similarity and dot product
|
33 |
+
- **Quality Assessment**: Automatic evaluation of information relevance and completeness
|
34 |
+
- **Diversity Optimization**: Ensures comprehensive coverage across document collections
|
35 |
+
|
36 |
+
## 🔧 Installation & Setup
|
37 |
+
|
38 |
+
### Prerequisites
|
39 |
+
- Python 3.8+
|
40 |
+
- Docker Desktop
|
41 |
+
- Ollama
|
42 |
+
- CUDA-compatible GPU (recommended)
|
43 |
+
|
44 |
+
### 1. Install Dependencies
|
45 |
+
```bash
|
46 |
+
pip install -r requirements.txt
|
47 |
+
```
|
48 |
+
|
49 |
+
### 2. Environment Configuration
|
50 |
+
Create a `.env` file with the following variables:
|
51 |
+
```env
|
52 |
+
colpali=your_colpali_model
|
53 |
+
ollama=your_ollama_model
|
54 |
+
flashattn=1
|
55 |
+
temperature=0.8
|
56 |
+
batchsize=5
|
57 |
+
metrictype=IP
|
58 |
+
mnum=16
|
59 |
+
efnum=500
|
60 |
+
topk=50
|
61 |
+
```
|
62 |
+
|
63 |
+
### 3. Start Services
|
64 |
+
The application will automatically:
|
65 |
+
- Start Docker Desktop (Windows)
|
66 |
+
- Start Ollama server
|
67 |
+
- Initialize Docker containers
|
68 |
+
- Create default users
|
69 |
+
|
70 |
+
## 👥 Default Users
|
71 |
+
|
72 |
+
The system creates default users for each team:
|
73 |
+
|
74 |
+
| Team | Username | Password |
|
75 |
+
|------|----------|----------|
|
76 |
+
| Team A | admin_team_a | admin123_team_a |
|
77 |
+
| Team B | admin_team_b | admin123_team_b |
|
78 |
+
|
79 |
+
## 📖 Usage Guide
|
80 |
+
|
81 |
+
### 1. **Authentication**
|
82 |
+
1. Navigate to the "🔐 Authentication" tab
|
83 |
+
2. Enter your username and password
|
84 |
+
3. Click "Login" to access team-specific features
|
85 |
+
|
86 |
+
### 2. **Document Management**
|
87 |
+
1. Go to "📁 Document Management" tab
|
88 |
+
2. Optionally enter a collection name for organization
|
89 |
+
3. Set the maximum pages to extract per document
|
90 |
+
4. Upload multiple PPT/PDF files
|
91 |
+
5. Click "Upload to Repository" to process documents
|
92 |
+
6. Use "Refresh Collections" to see available document collections
|
93 |
+
|
94 |
+
### 3. **Advanced Querying**
|
95 |
+
1. Navigate to "🔍 Advanced Query" tab
|
96 |
+
2. Enter your query in the text box
|
97 |
+
3. Adjust the number of pages to retrieve (1-10)
|
98 |
+
4. Click "Search Documents" to get AI response with citations
|
99 |
+
5. View the cited pages and retrieved document images
|
100 |
+
6. Check relevance scores to understand information quality (see "Relevance Score Calculation" section)
|
101 |
+
|
102 |
+
### 4. **Chat History**
|
103 |
+
1. Go to "💬 Chat History" tab
|
104 |
+
2. Adjust the number of conversations to display
|
105 |
+
3. Click "Refresh History" to view recent conversations
|
106 |
+
4. Each entry shows query, response, cited pages, and timestamp
|
107 |
+
|
108 |
+
### 5. **Data Management**
|
109 |
+
1. Access "⚙️ Data Management" tab
|
110 |
+
2. Select collections to delete (team-restricted)
|
111 |
+
3. Configure database parameters for optimal performance
|
112 |
+
4. Update settings as needed
|
113 |
+
|
114 |
+
## 🏗️ Architecture
|
115 |
+
|
116 |
+
### Database Schema
|
117 |
+
- **users**: User accounts with team assignments
|
118 |
+
- **chat_history**: Conversation tracking with citations
|
119 |
+
- **document_collections**: Team-specific document organization
|
120 |
+
|
121 |
+
### Security Features
|
122 |
+
- **Password Hashing**: bcrypt for secure password storage
|
123 |
+
- **Session Management**: UUID-based session tokens
|
124 |
+
- **Access Control**: Team-based document isolation
|
125 |
+
- **Input Validation**: Comprehensive error handling
|
126 |
+
|
127 |
+
### Performance Optimizations
|
128 |
+
- **Multi-threading**: Concurrent document processing
|
129 |
+
- **Memory Management**: Efficient image and vector handling
|
130 |
+
- **Caching**: Session-based caching for improved response times
|
131 |
+
- **Batch Processing**: Configurable batch sizes for GPU optimization
|
132 |
+
|
133 |
+
## 🔍 Relevance Score Calculation
|
134 |
+
|
135 |
+
The system uses sophisticated relevance scoring to determine how well retrieved documents align with user queries. This process is crucial for selecting the most pertinent information for generating accurate and contextually appropriate responses.
|
136 |
+
|
137 |
+
### How Relevance Scores Work
|
138 |
+
|
139 |
+
#### 1. **Document Embedding Process**
|
140 |
+
- **Page Segmentation**: Each document page is processed as a complete unit
|
141 |
+
- **Multimodal Encoding**: Both text and visual elements are captured using ColPali embeddings
|
142 |
+
- **Vector Representation**: Pages are transformed into high-dimensional numerical vectors (typically 768-1024 dimensions)
|
143 |
+
- **Semantic Capture**: The embedding captures semantic meaning, not just keyword matches
|
144 |
+
|
145 |
+
#### 2. **Query Embedding**
|
146 |
+
- **Query Processing**: User queries are converted into embeddings using the same ColPali model
|
147 |
+
- **Semantic Understanding**: The system understands query intent, not just literal words
|
148 |
+
- **Context Preservation**: Query context and meaning are maintained in the embedding
|
149 |
+
|
150 |
+
#### 3. **Similarity Computation**
|
151 |
+
- **Cosine Similarity**: Primary similarity measure between query and document embeddings
|
152 |
+
- **Dot Product**: Alternative similarity calculation for high-dimensional vectors
|
153 |
+
- **Normalized Scores**: Similarity scores are normalized to a 0-1 range
|
154 |
+
- **Distance Metrics**: Lower distances indicate higher relevance
|
155 |
+
|
156 |
+
#### 4. **Score Aggregation & Ranking**
|
157 |
+
- **Individual Page Scores**: Each page gets a relevance score based on similarity
|
158 |
+
- **Collection Diversity**: Scores are adjusted to promote diversity across document collections
|
159 |
+
- **Consecutive Page Optimization**: Adjacent pages are considered for better context
|
160 |
+
- **Final Ranking**: Pages are ranked by their aggregated relevance scores
|
161 |
+
|
162 |
+
### Relevance Score Interpretation
|
163 |
+
|
164 |
+
| Score Range | Relevance Level | Description |
|
165 |
+
|-------------|----------------|-------------|
|
166 |
+
| 0.90 - 1.00 | **Excellent** | Highly relevant, directly answers the query |
|
167 |
+
| 0.80 - 0.89 | **Very Good** | Very relevant, provides substantial information |
|
168 |
+
| 0.70 - 0.79 | **Good** | Relevant, contains useful information |
|
169 |
+
| 0.60 - 0.69 | **Moderate** | Somewhat relevant, may contain partial answers |
|
170 |
+
| 0.50 - 0.59 | **Basic** | Minimally relevant, limited usefulness |
|
171 |
+
| < 0.50 | **Poor** | Not relevant, unlikely to be useful |
|
172 |
+
|
173 |
+
### Example Relevance Calculation
|
174 |
+
|
175 |
+
**Query**: "What are the safety procedures for handling explosives?"
|
176 |
+
|
177 |
+
**Document Pages**:
|
178 |
+
1. **Page 15**: "Safety protocols for explosive materials" → Score: 0.95 (Excellent)
|
179 |
+
2. **Page 23**: "Equipment requirements for explosive handling" → Score: 0.92 (Very Good)
|
180 |
+
3. **Page 8**: "General laboratory safety guidelines" → Score: 0.88 (Very Good)
|
181 |
+
4. **Page 45**: "Chemical storage procedures" → Score: 0.65 (Moderate)
|
182 |
+
|
183 |
+
**Selection Process**:
|
184 |
+
- Pages 15, 23, and 8 are selected for their high relevance
|
185 |
+
- Page 45 is excluded due to lower relevance
|
186 |
+
- The system ensures diversity across different aspects of safety procedures
|
187 |
+
|
188 |
+
### Advanced Features
|
189 |
+
|
190 |
+
#### **Multi-Modal Relevance**
|
191 |
+
- **Visual Elements**: Images, charts, and diagrams contribute to relevance scores
|
192 |
+
- **Text-Vision Alignment**: ColPali captures relationships between text and visual content
|
193 |
+
- **Layout Understanding**: Document structure and formatting influence relevance
|
194 |
+
|
195 |
+
#### **Context-Aware Scoring**
|
196 |
+
- **Query Complexity**: Complex queries may retrieve more pages with varied scores
|
197 |
+
- **Cross-Reference Detection**: Pages that reference each other get boosted scores
|
198 |
+
- **Temporal Relevance**: Recent documents may receive slight score adjustments
|
199 |
+
|
200 |
+
#### **Quality Assurance**
|
201 |
+
- **Score Verification**: System validates that selected pages meet minimum relevance thresholds
|
202 |
+
- **Diversity Optimization**: Ensures selected pages provide comprehensive coverage
|
203 |
+
- **Redundancy Reduction**: Avoids selecting multiple pages with very similar content
|
204 |
+
|
205 |
+
### Configuration Parameters
|
206 |
+
|
207 |
+
```env
|
208 |
+
# Relevance scoring configuration
|
209 |
+
metrictype=IP # Inner Product similarity
|
210 |
+
mnum=16 # Number of connections in HNSW graph
|
211 |
+
efnum=500 # Search depth for high-quality results
|
212 |
+
topk=50 # Maximum results to consider
|
213 |
+
```
|
214 |
+
|
215 |
+
### Performance Impact
|
216 |
+
|
217 |
+
- **Search Speed**: Relevance scoring adds minimal overhead (~10-50ms per query)
|
218 |
+
- **Accuracy**: High-quality embeddings ensure accurate relevance assessment
|
219 |
+
- **Scalability**: Efficient vector operations support large document collections
|
220 |
+
- **Memory Usage**: Optimized to handle thousands of document pages efficiently
|
221 |
+
|
222 |
+
## 🔒 Security Considerations
|
223 |
+
|
224 |
+
### Production Deployment
|
225 |
+
1. **HTTPS**: Always use HTTPS in production
|
226 |
+
2. **Environment Variables**: Store sensitive data in environment variables
|
227 |
+
3. **Database Security**: Use production-grade database (PostgreSQL/MySQL)
|
228 |
+
4. **Rate Limiting**: Implement API rate limiting
|
229 |
+
5. **Logging**: Add comprehensive logging for security monitoring
|
230 |
+
|
231 |
+
### Recommended Security Enhancements
|
232 |
+
```python
|
233 |
+
# Add to production deployment
|
234 |
+
import logging
|
235 |
+
from flask_limiter import Limiter
|
236 |
+
from flask_limiter.util import get_remote_address
|
237 |
+
|
238 |
+
# Rate limiting
|
239 |
+
limiter = Limiter(
|
240 |
+
app,
|
241 |
+
key_func=get_remote_address,
|
242 |
+
default_limits=["200 per day", "50 per hour"]
|
243 |
+
)
|
244 |
+
|
245 |
+
# Security headers
|
246 |
+
@app.after_request
|
247 |
+
def add_security_headers(response):
|
248 |
+
response.headers['X-Content-Type-Options'] = 'nosniff'
|
249 |
+
response.headers['X-Frame-Options'] = 'DENY'
|
250 |
+
response.headers['X-XSS-Protection'] = '1; mode=block'
|
251 |
+
return response
|
252 |
+
```
|
253 |
+
|
254 |
+
## 🚀 Deployment
|
255 |
+
|
256 |
+
### Docker Deployment
|
257 |
+
```dockerfile
|
258 |
+
FROM python:3.9-slim
|
259 |
+
|
260 |
+
WORKDIR /app
|
261 |
+
COPY requirements.txt .
|
262 |
+
RUN pip install -r requirements.txt
|
263 |
+
|
264 |
+
COPY . .
|
265 |
+
EXPOSE 7860
|
266 |
+
|
267 |
+
CMD ["python", "app.py"]
|
268 |
+
```
|
269 |
+
|
270 |
+
### Environment Variables for Production
|
271 |
+
```env
|
272 |
+
# Database
|
273 |
+
DATABASE_URL=postgresql://user:password@localhost/dbname
|
274 |
+
SECRET_KEY=your-secret-key-here
|
275 |
+
|
276 |
+
# Security
|
277 |
+
BCRYPT_ROUNDS=12
|
278 |
+
SESSION_TIMEOUT=3600
|
279 |
+
|
280 |
+
# Performance
|
281 |
+
WORKER_THREADS=4
|
282 |
+
MAX_UPLOAD_SIZE=100MB
|
283 |
+
```
|
284 |
+
|
285 |
+
## 📊 Monitoring & Analytics
|
286 |
+
|
287 |
+
### Key Metrics to Track
|
288 |
+
- **Query Response Time**: Average time for AI responses
|
289 |
+
- **Document Processing Time**: Time to index new documents
|
290 |
+
- **User Activity**: Login frequency and session duration
|
291 |
+
- **Error Rates**: Failed queries and system errors
|
292 |
+
- **Storage Usage**: Database and file system utilization
|
293 |
+
|
294 |
+
### Logging Configuration
|
295 |
+
```python
|
296 |
+
import logging
|
297 |
+
|
298 |
+
logging.basicConfig(
|
299 |
+
level=logging.INFO,
|
300 |
+
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
|
301 |
+
handlers=[
|
302 |
+
logging.FileHandler('app.log'),
|
303 |
+
logging.StreamHandler()
|
304 |
+
]
|
305 |
+
)
|
306 |
+
```
|
307 |
+
|
308 |
+
## 🤝 Contributing
|
309 |
+
|
310 |
+
1. Fork the repository
|
311 |
+
2. Create a feature branch
|
312 |
+
3. Make your changes
|
313 |
+
4. Add tests for new features
|
314 |
+
5. Submit a pull request
|
315 |
+
|
316 |
+
## 📄 License
|
317 |
+
|
318 |
+
This project is licensed under the MIT License - see the LICENSE file for details.
|
319 |
+
|
320 |
+
## 🆘 Support
|
321 |
+
|
322 |
+
For support and questions:
|
323 |
+
- Create an issue in the repository
|
324 |
+
- Check the documentation
|
325 |
+
- Review the troubleshooting guide
|
326 |
+
|
327 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
328 |
|
329 |
+
**Made by Collar** - Enhanced with Team Management & Chat History
|
330 |
+
|
app.py
ADDED
The diff for this file is too large to render.
See raw diff
|
|
app_database.db
ADDED
Binary file (41 kB). View file
|
|
colpali_manager.py
ADDED
@@ -0,0 +1,186 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from colpali_engine.models import ColPali
|
2 |
+
from colpali_engine.models import ColPaliProcessor
|
3 |
+
from colpali_engine.utils.processing_utils import BaseVisualRetrieverProcessor
|
4 |
+
from colpali_engine.utils.torch_utils import ListDataset, get_torch_device
|
5 |
+
from torch.utils.data import DataLoader
|
6 |
+
import torch
|
7 |
+
from typing import List, cast
|
8 |
+
import matplotlib.pyplot as plt
|
9 |
+
#from colpali_engine.models import ColQwen2_5, ColQwen2_5_Processor
|
10 |
+
from colpali_engine.models import ColIdefics3, ColIdefics3Processor
|
11 |
+
|
12 |
+
from tqdm import tqdm
|
13 |
+
from PIL import Image
|
14 |
+
import os
|
15 |
+
|
16 |
+
import spaces
|
17 |
+
|
18 |
+
|
19 |
+
#this part is for local runs
|
20 |
+
torch.cuda.empty_cache()
|
21 |
+
|
22 |
+
#get model name from .env variable & set directory & processor dir as the model names!
|
23 |
+
import dotenv
|
24 |
+
# Load the .env file
|
25 |
+
dotenv_file = dotenv.find_dotenv()
|
26 |
+
dotenv.load_dotenv(dotenv_file)
|
27 |
+
|
28 |
+
model_name = os.environ['colpali'] #"vidore/colSmol-256M"
|
29 |
+
device = get_torch_device("cuda") #try using cpu instead of cuda?
|
30 |
+
|
31 |
+
#switch to locally downloading models & loading locally rather than from hf
|
32 |
+
#
|
33 |
+
|
34 |
+
current_working_directory = os.getcwd()
|
35 |
+
save_directory = model_name # Directory to save the specific model name
|
36 |
+
save_directory = os.path.join(current_working_directory, save_directory)
|
37 |
+
|
38 |
+
processor_directory = model_name+'_processor' # Directory to save the processor
|
39 |
+
processor_directory = os.path.join(current_working_directory, processor_directory)
|
40 |
+
|
41 |
+
|
42 |
+
|
43 |
+
if not os.path.exists(save_directory): #download if directory not created/model not loaded
|
44 |
+
# Directory does not exist; create it
|
45 |
+
|
46 |
+
if "colSmol-256M" in model_name: #if colsmol
|
47 |
+
model = ColIdefics3.from_pretrained(
|
48 |
+
model_name,
|
49 |
+
torch_dtype=torch.bfloat16,
|
50 |
+
device_map=device,
|
51 |
+
#attn_implementation="flash_attention_2",
|
52 |
+
).eval()
|
53 |
+
processor = cast(ColIdefics3Processor, ColIdefics3Processor.from_pretrained(model_name))
|
54 |
+
else: #if colpali v1.3 etc
|
55 |
+
model = ColPali.from_pretrained(
|
56 |
+
model_name,
|
57 |
+
torch_dtype=torch.bfloat16,
|
58 |
+
device_map=device,
|
59 |
+
#attn_implementation="flash_attention_2",
|
60 |
+
).eval()
|
61 |
+
processor = cast(ColPaliProcessor, ColPaliProcessor.from_pretrained(model_name))
|
62 |
+
os.makedirs(save_directory)
|
63 |
+
print(f"Directory '{save_directory}' created.")
|
64 |
+
model.save_pretrained(save_directory)
|
65 |
+
os.makedirs(processor_directory)
|
66 |
+
processor.save_pretrained(processor_directory)
|
67 |
+
|
68 |
+
else:
|
69 |
+
if "colSmol-256M" in model_name:
|
70 |
+
model = ColIdefics3.from_pretrained(save_directory)
|
71 |
+
processor = ColIdefics3Processor.from_pretrained(processor_directory, use_fast=True)
|
72 |
+
else:
|
73 |
+
model = ColPali.from_pretrained(save_directory)
|
74 |
+
processor = ColPaliProcessor.from_pretrained(processor_directory, use_fast=True)
|
75 |
+
|
76 |
+
|
77 |
+
class ColpaliManager:
|
78 |
+
|
79 |
+
|
80 |
+
def __init__(self, device = "cuda", model_name = model_name): #need to hot potato/use diff gpus between colpali & ollama
|
81 |
+
|
82 |
+
print(f"Initializing ColpaliManager with device {device} and model {model_name}")
|
83 |
+
|
84 |
+
# self.device = get_torch_device(device)
|
85 |
+
|
86 |
+
# self.model = ColPali.from_pretrained(
|
87 |
+
# model_name,
|
88 |
+
# torch_dtype=torch.bfloat16,
|
89 |
+
# device_map=self.device,
|
90 |
+
# ).eval()
|
91 |
+
|
92 |
+
# self.processor = cast(ColPaliProcessor, ColPaliProcessor.from_pretrained(model_name))
|
93 |
+
|
94 |
+
@spaces.GPU
|
95 |
+
def get_images(self, paths: list[str]) -> List[Image.Image]:
|
96 |
+
model.to("cuda")
|
97 |
+
return [Image.open(path) for path in paths]
|
98 |
+
|
99 |
+
@spaces.GPU
|
100 |
+
def process_images(self, image_paths:list[str], batch_size=int(os.environ['batchsize'])):
|
101 |
+
model.to("cuda")
|
102 |
+
print(f"Processing {len(image_paths)} image_paths")
|
103 |
+
|
104 |
+
images = self.get_images(image_paths)
|
105 |
+
|
106 |
+
dataloader = DataLoader(
|
107 |
+
dataset=ListDataset[str](images),
|
108 |
+
batch_size=batch_size,
|
109 |
+
shuffle=False,
|
110 |
+
collate_fn=lambda x: processor.process_images(x),
|
111 |
+
)
|
112 |
+
|
113 |
+
ds: List[torch.Tensor] = []
|
114 |
+
for batch_doc in tqdm(dataloader):
|
115 |
+
with torch.no_grad():
|
116 |
+
batch_doc = {k: v.to(model.device) for k, v in batch_doc.items()}
|
117 |
+
embeddings_doc = model(**batch_doc)
|
118 |
+
ds.extend(list(torch.unbind(embeddings_doc.to(device))))
|
119 |
+
|
120 |
+
ds_np = [d.float().cpu().numpy() for d in ds]
|
121 |
+
|
122 |
+
return ds_np
|
123 |
+
|
124 |
+
|
125 |
+
@spaces.GPU
|
126 |
+
def process_text(self, texts: list[str]):
|
127 |
+
|
128 |
+
#current_working_directory = os.getcwd()
|
129 |
+
#save_directory = model_name # Directory to save the specific model name
|
130 |
+
#save_directory = os.path.join(current_working_directory, save_directory)
|
131 |
+
|
132 |
+
#processor_directory = model_name+'_processor' # Directory to save the processor
|
133 |
+
#processor_directory = os.path.join(current_working_directory, processor_directory)
|
134 |
+
|
135 |
+
|
136 |
+
|
137 |
+
if not os.path.exists(save_directory): #download if directory not created/model not loaded
|
138 |
+
|
139 |
+
#MUST USE colpali v1.3/1.2 etc, CANNOT USE SMOLCOLPALI! for queries AS NOT RELIABLE!
|
140 |
+
"""
|
141 |
+
model = ColPali.from_pretrained(
|
142 |
+
model_name,
|
143 |
+
torch_dtype=torch.bfloat16,
|
144 |
+
device_map=device,
|
145 |
+
attn_implementation="flash_attention_2",
|
146 |
+
).eval()
|
147 |
+
processor = cast(ColPaliProcessor, ColPaliProcessor.from_pretrained(model_name))
|
148 |
+
os.makedirs(save_directory)
|
149 |
+
print(f"Directory '{save_directory}' created.")
|
150 |
+
model.save_pretrained(save_directory)
|
151 |
+
os.makedirs(processor_directory)
|
152 |
+
processor.save_pretrained(processor_directory)
|
153 |
+
else:
|
154 |
+
model = ColPali.from_pretrained(save_directory)
|
155 |
+
processor = ColPaliProcessor.from_pretrained(processor_directory, use_fast=True)
|
156 |
+
"""
|
157 |
+
|
158 |
+
|
159 |
+
model.to("cuda") #ensure this is commented out so ollama/multimodal llm can use gpu! (nah wrong, need to enable so that it can process multiple)
|
160 |
+
print(f"Processing {len(texts)} texts")
|
161 |
+
|
162 |
+
dataloader = DataLoader(
|
163 |
+
dataset=ListDataset[str](texts),
|
164 |
+
batch_size=int(os.environ['batchsize']), #OG is 5, try reducing batch size to maximise gpu use
|
165 |
+
shuffle=False,
|
166 |
+
collate_fn=lambda x: processor.process_queries(x),
|
167 |
+
)
|
168 |
+
|
169 |
+
|
170 |
+
qs: List[torch.Tensor] = []
|
171 |
+
for batch_query in dataloader:
|
172 |
+
with torch.no_grad():
|
173 |
+
batch_query = {k: v.to(model.device) for k, v in batch_query.items()}
|
174 |
+
embeddings_query = model(**batch_query)
|
175 |
+
|
176 |
+
qs.extend(list(torch.unbind(embeddings_query.to(device))))
|
177 |
+
|
178 |
+
qs_np = [q.float().cpu().numpy() for q in qs]
|
179 |
+
model.to("cpu") # Moves all model parameters and buffers to the CPU, freeing up gpu for ollama call after this process text call! (THIS WORKS!)
|
180 |
+
|
181 |
+
return qs_np
|
182 |
+
|
183 |
+
plt.close("all")
|
184 |
+
|
185 |
+
|
186 |
+
|
docker-compose.yml
ADDED
@@ -0,0 +1,66 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
version: '3.5'
|
2 |
+
|
3 |
+
services:
|
4 |
+
etcd:
|
5 |
+
container_name: milvus-etcd
|
6 |
+
image: quay.io/coreos/etcd:v3.5.16
|
7 |
+
environment:
|
8 |
+
- ETCD_AUTO_COMPACTION_MODE=revision
|
9 |
+
- ETCD_AUTO_COMPACTION_RETENTION=1000
|
10 |
+
- ETCD_QUOTA_BACKEND_BYTES=4294967296
|
11 |
+
- ETCD_SNAPSHOT_COUNT=50000
|
12 |
+
volumes:
|
13 |
+
- ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/etcd:/etcd
|
14 |
+
command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
|
15 |
+
healthcheck:
|
16 |
+
test: ["CMD", "etcdctl", "endpoint", "health"]
|
17 |
+
interval: 30s
|
18 |
+
timeout: 20s
|
19 |
+
retries: 3
|
20 |
+
|
21 |
+
minio:
|
22 |
+
container_name: milvus-minio
|
23 |
+
image: minio/minio:RELEASE.2023-03-20T20-16-18Z
|
24 |
+
environment:
|
25 |
+
MINIO_ACCESS_KEY: minioadmin
|
26 |
+
MINIO_SECRET_KEY: minioadmin
|
27 |
+
ports:
|
28 |
+
- "9001:9001"
|
29 |
+
- "9000:9000"
|
30 |
+
volumes:
|
31 |
+
- ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/minio:/minio_data
|
32 |
+
command: minio server /minio_data --console-address ":9001"
|
33 |
+
healthcheck:
|
34 |
+
test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
|
35 |
+
interval: 30s
|
36 |
+
timeout: 20s
|
37 |
+
retries: 3
|
38 |
+
|
39 |
+
standalone:
|
40 |
+
container_name: milvus-standalone
|
41 |
+
image: milvusdb/milvus:v2.5.4-gpu
|
42 |
+
command: ["milvus", "run", "standalone"]
|
43 |
+
security_opt:
|
44 |
+
- seccomp:unconfined
|
45 |
+
environment:
|
46 |
+
ETCD_ENDPOINTS: etcd:2379
|
47 |
+
MINIO_ADDRESS: minio:9000
|
48 |
+
volumes:
|
49 |
+
- ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/milvus:/var/lib/milvus
|
50 |
+
ports:
|
51 |
+
- "19530:19530"
|
52 |
+
- "9091:9091"
|
53 |
+
deploy:
|
54 |
+
resources:
|
55 |
+
reservations:
|
56 |
+
devices:
|
57 |
+
- driver: nvidia
|
58 |
+
capabilities: ["gpu"]
|
59 |
+
device_ids: ["0"]
|
60 |
+
depends_on:
|
61 |
+
- "etcd"
|
62 |
+
- "minio"
|
63 |
+
|
64 |
+
networks:
|
65 |
+
default:
|
66 |
+
name: milvus
|
middleware.py
ADDED
@@ -0,0 +1,62 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from colpali_manager import ColpaliManager
|
2 |
+
from milvus_manager import MilvusManager
|
3 |
+
from pdf_manager import PdfManager
|
4 |
+
import hashlib
|
5 |
+
|
6 |
+
|
7 |
+
|
8 |
+
pdf_manager = PdfManager()
|
9 |
+
colpali_manager = ColpaliManager()
|
10 |
+
|
11 |
+
|
12 |
+
|
13 |
+
class Middleware:
|
14 |
+
def __init__(self, id:str, create_collection=True):
|
15 |
+
#hashed_id = hashlib.md5(id.encode()).hexdigest()[:8]
|
16 |
+
hashed_id = 0 #switched to persistent db, shld use diff id for diff accs
|
17 |
+
milvus_db_name = f"milvus_{hashed_id}.db"
|
18 |
+
self.milvus_manager = MilvusManager(milvus_db_name, id, create_collection) #create collections based on id rather than colpali
|
19 |
+
|
20 |
+
def index(self, pdf_path: str, id:str, max_pages: int, pages: list[int] = None):
|
21 |
+
|
22 |
+
if type(pdf_path) == None: #for direct query without any upload to db
|
23 |
+
print("no docs")
|
24 |
+
return
|
25 |
+
|
26 |
+
print(f"Indexing {pdf_path}, id: {id}, max_pages: {max_pages}")
|
27 |
+
|
28 |
+
image_paths = pdf_manager.save_images(id, pdf_path, max_pages)
|
29 |
+
|
30 |
+
print(f"Saved {len(image_paths)} images")
|
31 |
+
|
32 |
+
colbert_vecs = colpali_manager.process_images(image_paths)
|
33 |
+
|
34 |
+
images_data = [{
|
35 |
+
"colbert_vecs": colbert_vecs[i],
|
36 |
+
"filepath": image_paths[i]
|
37 |
+
} for i in range(len(image_paths))]
|
38 |
+
|
39 |
+
print(f"Inserting {len(images_data)} images data to Milvus")
|
40 |
+
|
41 |
+
self.milvus_manager.insert_images_data(images_data)
|
42 |
+
|
43 |
+
print("Indexing completed")
|
44 |
+
|
45 |
+
return image_paths
|
46 |
+
|
47 |
+
|
48 |
+
|
49 |
+
def search(self, search_queries: list[str], topk: int = 10):
|
50 |
+
print(f"Searching for {len(search_queries)} queries with topk={topk}")
|
51 |
+
|
52 |
+
final_res = []
|
53 |
+
|
54 |
+
for query in search_queries:
|
55 |
+
print(f"Searching for query: {query}")
|
56 |
+
query_vec = colpali_manager.process_text([query])[0]
|
57 |
+
search_res = self.milvus_manager.search(query_vec, topk=topk)
|
58 |
+
print(f"Search result: {len(search_res)} results for query: {query}")
|
59 |
+
final_res.append(search_res)
|
60 |
+
|
61 |
+
return final_res
|
62 |
+
|
milvus_manager.py
ADDED
@@ -0,0 +1,218 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from pymilvus import MilvusClient, DataType
|
2 |
+
try:
|
3 |
+
from milvus import default_server # Milvus Lite
|
4 |
+
except Exception:
|
5 |
+
default_server = None
|
6 |
+
import numpy as np
|
7 |
+
import concurrent.futures
|
8 |
+
from pymilvus import Collection
|
9 |
+
import os
|
10 |
+
|
11 |
+
class MilvusManager:
|
12 |
+
def __init__(self, milvus_uri, collection_name, create_collection, dim=128):
|
13 |
+
|
14 |
+
#import environ variables from .env
|
15 |
+
import dotenv
|
16 |
+
# Load the .env file
|
17 |
+
dotenv_file = dotenv.find_dotenv()
|
18 |
+
dotenv.load_dotenv(dotenv_file)
|
19 |
+
|
20 |
+
# Start embedded Milvus Lite server and connect locally
|
21 |
+
if default_server is not None:
|
22 |
+
try:
|
23 |
+
# Optionally set base dir here if desired, e.g. default_server.set_base_dir('volumes/milvus_lite')
|
24 |
+
default_server.start()
|
25 |
+
except Exception:
|
26 |
+
pass
|
27 |
+
local_uri = f"http://127.0.0.1:{default_server.listen_port}"
|
28 |
+
self.client = MilvusClient(uri=local_uri)
|
29 |
+
else:
|
30 |
+
# Fallback to standard local server (assumes docker-compose or system service)
|
31 |
+
self.client = MilvusClient(uri="http://127.0.0.1:19530")
|
32 |
+
self.collection_name = collection_name
|
33 |
+
self.dim = dim
|
34 |
+
|
35 |
+
if self.client.has_collection(collection_name=self.collection_name):
|
36 |
+
self.client.load_collection(collection_name=self.collection_name)
|
37 |
+
print("Loaded existing collection.")
|
38 |
+
elif create_collection:
|
39 |
+
self.create_collection()
|
40 |
+
self.create_index()
|
41 |
+
|
42 |
+
def create_collection(self):
|
43 |
+
if self.client.has_collection(collection_name=self.collection_name):
|
44 |
+
print("Collection already exists.")
|
45 |
+
return
|
46 |
+
|
47 |
+
schema = self.client.create_schema(
|
48 |
+
auto_id=True,
|
49 |
+
enable_dynamic_fields=True,
|
50 |
+
)
|
51 |
+
schema.add_field(field_name="pk", datatype=DataType.INT64, is_primary=True)
|
52 |
+
schema.add_field(
|
53 |
+
field_name="vector", datatype=DataType.FLOAT_VECTOR, dim=self.dim
|
54 |
+
)
|
55 |
+
schema.add_field(field_name="seq_id", datatype=DataType.INT16)
|
56 |
+
schema.add_field(field_name="doc_id", datatype=DataType.INT64)
|
57 |
+
schema.add_field(field_name="doc", datatype=DataType.VARCHAR, max_length=65535)
|
58 |
+
|
59 |
+
self.client.create_collection(
|
60 |
+
collection_name=self.collection_name, schema=schema
|
61 |
+
)
|
62 |
+
|
63 |
+
def create_index(self):
|
64 |
+
index_params = self.client.prepare_index_params()
|
65 |
+
|
66 |
+
index_params.add_index(
|
67 |
+
field_name="vector",
|
68 |
+
index_name="vector_index",
|
69 |
+
index_type="HNSW", #use HNSW option if got more mem, if not use IVF for faster processing
|
70 |
+
metric_type=os.environ["metrictype"], #"IP"
|
71 |
+
params={
|
72 |
+
"M": int(os.environ["mnum"]), #M:16 for HNSW, capital M
|
73 |
+
"efConstruction": int(os.environ["efnum"]), #500 for HNSW
|
74 |
+
},
|
75 |
+
)
|
76 |
+
|
77 |
+
self.client.create_index(
|
78 |
+
collection_name=self.collection_name, index_params=index_params, sync=True
|
79 |
+
)
|
80 |
+
|
81 |
+
def search(self, data, topk):
|
82 |
+
# Retrieve all collection names from the Milvus client.
|
83 |
+
collections = self.client.list_collections()
|
84 |
+
|
85 |
+
# Set search parameters (here, using Inner Product metric).
|
86 |
+
search_params = {"metric_type": os.environ["metrictype"], "params": {}} #default metric type is "IP"
|
87 |
+
|
88 |
+
# Set to store unique (doc_id, collection_name) pairs across all collections.
|
89 |
+
doc_collection_pairs = set()
|
90 |
+
|
91 |
+
# Query each collection individually
|
92 |
+
for collection in collections:
|
93 |
+
self.client.load_collection(collection_name=collection)
|
94 |
+
print("collection loaded:"+ collection)
|
95 |
+
results = self.client.search(
|
96 |
+
collection,
|
97 |
+
data,
|
98 |
+
limit=int(os.environ["topk"]), # Adjust limit per collection as needed. (default is 50)
|
99 |
+
output_fields=["vector", "seq_id", "doc_id"],
|
100 |
+
search_params=search_params,
|
101 |
+
)
|
102 |
+
# Accumulate document IDs along with their originating collection.
|
103 |
+
for r_id in range(len(results)):
|
104 |
+
for r in range(len(results[r_id])):
|
105 |
+
doc_id = results[r_id][r]["entity"]["doc_id"]
|
106 |
+
doc_collection_pairs.add((doc_id, collection))
|
107 |
+
|
108 |
+
scores = []
|
109 |
+
|
110 |
+
def rerank_single_doc(doc_id, data, client, collection_name):
|
111 |
+
# Query for detailed document vectors in the given collection.
|
112 |
+
doc_colbert_vecs = client.query(
|
113 |
+
collection_name=collection_name,
|
114 |
+
filter=f"doc_id in [{doc_id}, {doc_id + 1}]",
|
115 |
+
output_fields=["seq_id", "vector", "doc"],
|
116 |
+
limit=16380,
|
117 |
+
)
|
118 |
+
# Stack the vectors for dot product computation.
|
119 |
+
doc_vecs = np.vstack(
|
120 |
+
[doc_colbert_vecs[i]["vector"] for i in range(len(doc_colbert_vecs))]
|
121 |
+
)
|
122 |
+
# Compute a similarity score via dot product.
|
123 |
+
score = np.dot(data, doc_vecs.T).max(1).sum()
|
124 |
+
return (score, doc_id, collection_name)
|
125 |
+
|
126 |
+
# Use a thread pool to rerank each document concurrently.
|
127 |
+
with concurrent.futures.ThreadPoolExecutor(max_workers=300) as executor:
|
128 |
+
futures = {
|
129 |
+
executor.submit(rerank_single_doc, doc_id, data, self.client, collection): (doc_id, collection)
|
130 |
+
for doc_id, collection in doc_collection_pairs
|
131 |
+
}
|
132 |
+
for future in concurrent.futures.as_completed(futures):
|
133 |
+
score, doc_id, collection = future.result()
|
134 |
+
scores.append((score, doc_id, collection))
|
135 |
+
#doc_id is page number!
|
136 |
+
|
137 |
+
# Sort the reranked results by score in descending order.
|
138 |
+
scores.sort(key=lambda x: x[0], reverse=True)
|
139 |
+
# Unload the collection after search to free memory.
|
140 |
+
self.client.release_collection(collection_name=collection)
|
141 |
+
|
142 |
+
return scores[:topk] if len(scores) >= topk else scores #topk is the number of scores to return back
|
143 |
+
"""
|
144 |
+
search_params = {"metric_type": "IP", "params": {}}
|
145 |
+
results = self.client.search(
|
146 |
+
self.collection_name,
|
147 |
+
data,
|
148 |
+
limit=50,
|
149 |
+
output_fields=["vector", "seq_id", "doc_id"],
|
150 |
+
search_params=search_params,
|
151 |
+
)
|
152 |
+
doc_ids = {result["entity"]["doc_id"] for result in results[0]}
|
153 |
+
|
154 |
+
scores = []
|
155 |
+
|
156 |
+
def rerank_single_doc(doc_id, data, client, collection_name):
|
157 |
+
doc_colbert_vecs = client.query(
|
158 |
+
collection_name=collection_name,
|
159 |
+
filter=f"doc_id in [{doc_id}, {doc_id + 1}]",
|
160 |
+
output_fields=["seq_id", "vector", "doc"],
|
161 |
+
limit=1000,
|
162 |
+
)
|
163 |
+
doc_vecs = np.vstack(
|
164 |
+
[doc["vector"] for doc in doc_colbert_vecs]
|
165 |
+
)
|
166 |
+
score = np.dot(data, doc_vecs.T).max(1).sum()
|
167 |
+
return score, doc_id
|
168 |
+
|
169 |
+
with concurrent.futures.ThreadPoolExecutor(max_workers=300) as executor:
|
170 |
+
futures = {
|
171 |
+
executor.submit(
|
172 |
+
rerank_single_doc, doc_id, data, self.client, self.collection_name
|
173 |
+
): doc_id
|
174 |
+
for doc_id in doc_ids
|
175 |
+
}
|
176 |
+
for future in concurrent.futures.as_completed(futures):
|
177 |
+
score, doc_id = future.result()
|
178 |
+
scores.append((score, doc_id))
|
179 |
+
|
180 |
+
scores.sort(key=lambda x: x[0], reverse=True)
|
181 |
+
return scores[:topk]
|
182 |
+
"""
|
183 |
+
|
184 |
+
def insert(self, data):
|
185 |
+
colbert_vecs = data["colbert_vecs"]
|
186 |
+
seq_length = len(colbert_vecs)
|
187 |
+
doc_ids = [data["doc_id"]] * seq_length
|
188 |
+
seq_ids = list(range(seq_length))
|
189 |
+
docs = [""] * seq_length
|
190 |
+
docs[0] = data["filepath"]
|
191 |
+
|
192 |
+
self.client.insert(
|
193 |
+
self.collection_name,
|
194 |
+
[
|
195 |
+
{
|
196 |
+
"vector": colbert_vecs[i],
|
197 |
+
"seq_id": seq_ids[i],
|
198 |
+
"doc_id": doc_ids[i],
|
199 |
+
"doc": docs[i],
|
200 |
+
}
|
201 |
+
for i in range(seq_length)
|
202 |
+
],
|
203 |
+
)
|
204 |
+
|
205 |
+
def get_images_as_doc(self, images_with_vectors):
|
206 |
+
return [
|
207 |
+
{
|
208 |
+
"colbert_vecs": image["colbert_vecs"],
|
209 |
+
"doc_id": idx,
|
210 |
+
"filepath": image["filepath"],
|
211 |
+
}
|
212 |
+
for idx, image in enumerate(images_with_vectors)
|
213 |
+
]
|
214 |
+
|
215 |
+
def insert_images_data(self, image_data):
|
216 |
+
data = self.get_images_as_doc(image_data)
|
217 |
+
for item in data:
|
218 |
+
self.insert(item)
|
packages.txt
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
poppler-utils
|
pdf_manager.py
ADDED
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from pdf2image import convert_from_path
|
2 |
+
import os
|
3 |
+
import shutil
|
4 |
+
|
5 |
+
class PdfManager:
|
6 |
+
def __init__(self):
|
7 |
+
pass
|
8 |
+
|
9 |
+
def clear_and_recreate_dir(self, output_folder):
|
10 |
+
|
11 |
+
print(f"Clearing output folder {output_folder}")
|
12 |
+
|
13 |
+
if os.path.exists(output_folder):
|
14 |
+
shutil.rmtree(output_folder)
|
15 |
+
#print("Clearing is unused for now for persistency")
|
16 |
+
else:
|
17 |
+
os.makedirs(output_folder)
|
18 |
+
|
19 |
+
#print("Clearing is unused for now for persistency")
|
20 |
+
|
21 |
+
def save_images(self, id, pdf_path, max_pages, pages: list[int] = None) -> list[str]:
|
22 |
+
output_folder = f"pages/{id}" #remove last backslash to avoid error,test this
|
23 |
+
images = convert_from_path(pdf_path)
|
24 |
+
|
25 |
+
print(f"Saving images from {pdf_path} to {output_folder}. Max pages: {max_pages}")
|
26 |
+
|
27 |
+
self.clear_and_recreate_dir(output_folder)
|
28 |
+
|
29 |
+
num_page_processed = 0
|
30 |
+
|
31 |
+
for i, image in enumerate(images):
|
32 |
+
if max_pages and num_page_processed >= max_pages:
|
33 |
+
break
|
34 |
+
|
35 |
+
if pages and i not in pages:
|
36 |
+
continue
|
37 |
+
|
38 |
+
full_save_path = f"{output_folder}/page_{i + 1}.png"
|
39 |
+
|
40 |
+
#print(f"Saving image to {full_save_path}")
|
41 |
+
|
42 |
+
image.save(full_save_path, "PNG")
|
43 |
+
|
44 |
+
num_page_processed += 1
|
45 |
+
|
46 |
+
return [f"{output_folder}/page_{i + 1}.png" for i in range(num_page_processed)]
|
rag.py
ADDED
@@ -0,0 +1,153 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import requests
|
2 |
+
import os
|
3 |
+
import re
|
4 |
+
|
5 |
+
from typing import List
|
6 |
+
from utils import encode_image
|
7 |
+
from PIL import Image
|
8 |
+
from google import genai
|
9 |
+
import torch
|
10 |
+
import subprocess
|
11 |
+
import psutil
|
12 |
+
import torch
|
13 |
+
from transformers import AutoModel, AutoTokenizer
|
14 |
+
from google import genai
|
15 |
+
|
16 |
+
|
17 |
+
class Rag:
|
18 |
+
|
19 |
+
def _clean_raw_token_response(self, response_text):
|
20 |
+
"""
|
21 |
+
Clean raw token responses that contain undecoded token IDs
|
22 |
+
This handles cases where models return raw tokens instead of decoded text
|
23 |
+
"""
|
24 |
+
if not response_text:
|
25 |
+
return response_text
|
26 |
+
|
27 |
+
# Check if response contains raw token patterns
|
28 |
+
token_patterns = [
|
29 |
+
r'<unused\d+>', # unused tokens
|
30 |
+
r'<bos>', # beginning of sequence
|
31 |
+
r'<eos>', # end of sequence
|
32 |
+
r'<unk>', # unknown tokens
|
33 |
+
r'<mask>', # mask tokens
|
34 |
+
r'<pad>', # padding tokens
|
35 |
+
r'\[multimodal\]', # multimodal tokens
|
36 |
+
]
|
37 |
+
|
38 |
+
# If response contains raw tokens, try to clean them
|
39 |
+
has_raw_tokens = any(re.search(pattern, response_text) for pattern in token_patterns)
|
40 |
+
|
41 |
+
if has_raw_tokens:
|
42 |
+
print("⚠️ Detected raw token response, attempting to clean...")
|
43 |
+
|
44 |
+
# Remove common raw token patterns
|
45 |
+
cleaned_text = response_text
|
46 |
+
|
47 |
+
# Remove unused tokens
|
48 |
+
cleaned_text = re.sub(r'<unused\d+>', '', cleaned_text)
|
49 |
+
|
50 |
+
# Remove special tokens
|
51 |
+
cleaned_text = re.sub(r'<(bos|eos|unk|mask|pad)>', '', cleaned_text)
|
52 |
+
|
53 |
+
# Remove multimodal tokens
|
54 |
+
cleaned_text = re.sub(r'\[multimodal\]', '', cleaned_text)
|
55 |
+
|
56 |
+
# Clean up extra whitespace
|
57 |
+
cleaned_text = re.sub(r'\s+', ' ', cleaned_text).strip()
|
58 |
+
|
59 |
+
# If we still have mostly tokens, return an error message
|
60 |
+
if len(cleaned_text.strip()) < 10:
|
61 |
+
return "❌ **Model Response Error**: The model returned raw token IDs instead of decoded text. This may be due to model configuration issues. Please try:\n\n1. Restarting the Ollama server\n2. Using a different model\n3. Checking model compatibility with multimodal inputs"
|
62 |
+
|
63 |
+
return cleaned_text
|
64 |
+
|
65 |
+
return response_text
|
66 |
+
|
67 |
+
def get_answer_from_gemini(self, query: str, image_paths: List[str]) -> str:
|
68 |
+
print(f"Querying Gemini 2.5 Pro for query={query}, image_paths={image_paths}")
|
69 |
+
try:
|
70 |
+
# Use environment variable GEMINI_API_KEY
|
71 |
+
api_key = os.environ.get('GEMINI_API_KEY')
|
72 |
+
if not api_key:
|
73 |
+
return "Error: GEMINI_API_KEY is not set."
|
74 |
+
|
75 |
+
genai.configure(api_key=api_key)
|
76 |
+
model = genai.GenerativeModel('gemini-2.5-pro')
|
77 |
+
|
78 |
+
# Load images
|
79 |
+
images = []
|
80 |
+
for p in image_paths:
|
81 |
+
try:
|
82 |
+
images.append(Image.open(p))
|
83 |
+
except Exception:
|
84 |
+
pass
|
85 |
+
|
86 |
+
chat_session = model.start_chat()
|
87 |
+
response = chat_session.send_message([*images, query])
|
88 |
+
return response.text
|
89 |
+
except Exception as e:
|
90 |
+
print(f"Gemini error: {e}")
|
91 |
+
return f"Error: {str(e)}"
|
92 |
+
|
93 |
+
#os.environ['OPENAI_API_KEY'] = "for the love of Jesus let this work"
|
94 |
+
|
95 |
+
def get_answer_from_openai(self, query, imagesPaths):
|
96 |
+
#import environ variables from .env
|
97 |
+
import dotenv
|
98 |
+
|
99 |
+
# Load the .env file
|
100 |
+
dotenv_file = dotenv.find_dotenv()
|
101 |
+
dotenv.load_dotenv(dotenv_file)
|
102 |
+
|
103 |
+
# This function formerly used Ollama. Replace with Gemini 2.5 Pro.
|
104 |
+
print(f"Querying Gemini (replacement for Ollama) for query={query}, imagesPaths={imagesPaths}")
|
105 |
+
try:
|
106 |
+
enhanced_query = f"Use all {len(imagesPaths)} pages to answer comprehensively.\n\nQuery: {query}"
|
107 |
+
return self.get_answer_from_gemini(enhanced_query, imagesPaths)
|
108 |
+
except Exception as e:
|
109 |
+
print(f"Gemini replacement error: {e}")
|
110 |
+
return None
|
111 |
+
|
112 |
+
|
113 |
+
|
114 |
+
def __get_openai_api_payload(self, query:str, imagesPaths:List[str]):
|
115 |
+
image_payload = []
|
116 |
+
|
117 |
+
for imagePath in imagesPaths:
|
118 |
+
base64_image = encode_image(imagePath)
|
119 |
+
image_payload.append({
|
120 |
+
"type": "image_url",
|
121 |
+
"image_url": {
|
122 |
+
"url": f"data:image/jpeg;base64,{base64_image}"
|
123 |
+
}
|
124 |
+
})
|
125 |
+
|
126 |
+
payload = {
|
127 |
+
"model": "Llama3.2-vision", #change model here as needed
|
128 |
+
"messages": [
|
129 |
+
{
|
130 |
+
"role": "user",
|
131 |
+
"content": [
|
132 |
+
{
|
133 |
+
"type": "text",
|
134 |
+
"text": query
|
135 |
+
},
|
136 |
+
*image_payload
|
137 |
+
]
|
138 |
+
}
|
139 |
+
],
|
140 |
+
"max_tokens": 1024 #reduce token size to reduce processing time
|
141 |
+
}
|
142 |
+
|
143 |
+
return payload
|
144 |
+
|
145 |
+
|
146 |
+
|
147 |
+
# if __name__ == "__main__":
|
148 |
+
# rag = Rag()
|
149 |
+
|
150 |
+
# query = "Based on attached images, how many new cases were reported during second wave peak"
|
151 |
+
# imagesPaths = ["covid_slides_page_8.png", "covid_slides_page_8.png"]
|
152 |
+
|
153 |
+
# rag.get_answer_from_gemini(query, imagesPaths)
|
requirements.txt
ADDED
@@ -0,0 +1,19 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
gradio>=4.0.0
|
2 |
+
pymilvus>=2.3.0
|
3 |
+
milvus>=2.4.13
|
4 |
+
PyMuPDF>=1.23.0
|
5 |
+
python-dotenv>=1.0.0
|
6 |
+
bcrypt>=4.0.0
|
7 |
+
numpy>=1.24.0
|
8 |
+
Pillow>=10.0.0
|
9 |
+
requests>=2.31.0
|
10 |
+
transformers>=4.35.0
|
11 |
+
torch>=2.0.0
|
12 |
+
torchvision>=0.15.0
|
13 |
+
opencv-python>=4.8.0
|
14 |
+
scikit-learn>=1.3.0
|
15 |
+
schemdraw
|
16 |
+
matplotlib
|
17 |
+
python-docx>=0.8.11
|
18 |
+
openpyxl>=3.1.2
|
19 |
+
pandas>=2.0.0
|
temp/comprehensive_report_generate_a_report_on_what_the_20250901_035723.docx
ADDED
Binary file (36.8 kB). View file
|
|
temp/comprehensive_report_write_a_report_on_the_israel-i_20250901_041633.docx
ADDED
Binary file (36.8 kB). View file
|
|
temp/enhanced_export_create_a_bar_chart_graph_showi_20250901_045645.xlsx
ADDED
Binary file (8.62 kB). View file
|
|
temp/enhanced_export_create_a_bar_graph_showing_the_20250901_043912.xlsx
ADDED
Binary file (8.57 kB). View file
|
|
temp/table_Present_a_structured_table_sum_20250817_141602.csv
ADDED
@@ -0,0 +1,18 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Item Number,Description
|
2 |
+
1,"**Israeli Strikes:** On this day, Israel launched Operation Rising Lion targeting nuclear and military sites across Iran[^1^][^4^]."
|
3 |
+
2,"**Iranian Retaliation:** In response to the Israeli strikes on Tehran, an Iranian missile strike directly hit Soroka hospital in Beerseba, Southern Israel. The attack resulted in 32 casualties with injuries reported but no deaths."
|
4 |
+
3,**Iranian Strikes:** Iran launched its first volley of retaliatory attacks towards Jerusalem at around midnight[^5^].
|
5 |
+
4,"**Israeli Response:** In retaliation for the Iranian strikes on Israeli targets, an IDF (Israel Defense Forces) strike directly hit Soroka hospital in Beerseba, Southern Israel. The attack caused 32 casualties and significant damage to the facility."
|
6 |
+
5,**Persistent Strikes:** The conflict continued with both sides exchanging blows through continuous attacks[^4^].
|
7 |
+
6,**Damages/Casualties Continued:** Both countries reported ongoing exchanges of strikes causing further fatalities on either side.
|
8 |
+
7,"On June 19th, an Iranian missile directly hit Soroka hospital in Beerseba[^5^]."
|
9 |
+
8,This incident resulted in the death and injury of many people.
|
10 |
+
9,Israel’s defense force launched Operation Rising Lion targeting various military sites within Iran.
|
11 |
+
10,"The document mentions continuous exchanges between both countries, with casualties on either side being reported daily[^5^]."
|
12 |
+
11,"Israeli Prime Minister Benjamin Netanyahu called the Iranian attacks a ""declaration of war"" and threatened severe punishment for Tehran's actions."
|
13 |
+
12,Iran retaliated by targeting military centers and airbases in Israel.
|
14 |
+
13,"The conflict extended into June, with both sides sustaining more casualties."
|
15 |
+
14,Israeli Prime Minister Netanyahu reiterated his warnings about potential nuclear threats from Iran[^5^].
|
16 |
+
15,"Iranian Supreme Leader Ayatollah Ali Khamenei stated that if not stopped, Israel could produce a new generation of terrorist leaders."
|
17 |
+
16,The document continues detailing the ongoing conflict with daily reports on casualties and military sites being targeted.
|
18 |
+
17,"Both sides continued launching attacks, causing significant damage and loss of life[^1^][^4^]."
|
temp/table_Present_a_structured_table_sum_20250817_141947.csv
ADDED
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
1 |
+
Measurement Type,Value,Details
|
2 |
+
Measurement,13,"The timeline provided indicates a rapid escalation of violence between Israel and Iran over several days (June 13-15). The conflict began with Israeli Prime Minister Netanyahu declaring an ""Operation Rising Lion"" targeting nuclear facilities. In response, Iranian forces launched their own series of retaliatory attacks."
|
temp/table_Present_a_structured_table_sum_20250817_142357.csv
ADDED
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
1 |
+
Measurement Type,Value,Details
|
2 |
+
Measurement,13,Here is a structured table summarizing the timeline of the conflict from June 13-15:
|
temp/table_Present_a_structured_table_sum_20250817_142801.csv
ADDED
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Page,Collection,Relevance Score,Content Summary
|
2 |
+
2,Israel_Iran,39.623,Page 2 from Israel_Iran
|
3 |
+
1,Israel_Iran,39.453,Page 1 from Israel_Iran
|
4 |
+
3,Israel_Iran,37.398,Page 3 from Israel_Iran
|
temp/table_Present_a_structured_table_sum_20250817_143017.csv
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
Measurement Type,Value,Details
|
2 |
+
Measurement,2025,"In light of recent events surrounding Israel’s strikes on Iran’s nuclear sites in June 2025, this article provides detailed coverage of the conflict between Israel and Iran. The timeline outlined by BBC News reveals a series of escalations leading to an ongoing exchange of attacks from both sides over several days."
|
3 |
+
Measurement,14,"On Sunday (14th), Israel's Health Ministry announced damage caused by Iranian missile strikes at a hospital for about 200 injured. The Israeli Prime Minister emphasized that if Iran does not stop its nuclear program soon, it could produce a nuclear weapon in ""a very short time."""
|
temp/table_Present_a_structured_table_sum_20250817_171709.csv
ADDED
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Measurement Type,Value,Details
|
2 |
+
Measurement,13,## Israel-Iran Conflict Timeline: 13-15 June 2025
|
3 |
+
Measurement,1,"Based on the provided BBC news articles (pages 1, 2, & 3), here’s a structured timeline of the escalating conflict between Israel and Iran from June 13th to 15th, 2025. This analysis incorporates information from all three pages, detailing attackers, targets, locations, weapons, and resulting damage/casualties."
|
4 |
+
Measurement,1,"The conflict stems from a series of escalating actions initiated by Israel, targeting Iran’s nuclear program. Page 1 highlights that this began with Israeli strikes on nuclear and military sites in Iran, prompting retaliatory attacks by Iran targeting Israel. The situation is highly volatile, with both sides engaged in rhetorical threats and actual military action, and the US considering its involvement (Page 1). The Israeli Prime Minister, Benjamin Netanyahu, justified the attacks as a pre-emptive measure to prevent Iran from developing nuclear weapons, claiming they could produce one ""in a very short time"" if not stopped (Page 1). The conflict centers around concerns over Iran’s nuclear program and the potential for developing a nuclear weapon, and further details indicate that Iran is under investigation for possibly building these weapons (Page 3)."
|
5 |
+
Measurement,13,**Timeline: 13-15 June 2025**
|
6 |
+
Measurement,2,"* **Page 2:** This page provides a timeline of the immediate retaliatory actions. It details Iran launching around 100 missiles towards Israel on June 13th, with the majority being intercepted by Israel’s Iron Dome system. There is mention of the reporting of 78 people being injured by Friday evening. Page 2 also highlights coordination between Israel and Washington on Iran and notes a reported Iranian missile directly hitting a hospital in Beersheba, Southern Israel (Page 2)."
|
temp/table_Present_a_structured_table_sum_20250817_174625.csv
ADDED
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
,**Date**,**Attacker**,**Target**,**Location**,**Weapon Used**,**Damage/Casualties**,**Source Page**,
|
2 |
+
,---,---,---,---,---,---,---,
|
3 |
+
,**June 13**,Israel,Iranian nuclear sites,Throughout Iran (including Natanz – approximately 225km (140 miles) south of Tehran),Unspecified (likely airstrikes),Significant damage to the Natanz nuclear facility,Page 1,
|
4 |
+
,**June 13**,Iran,Israel (Military & Civilian),Israel (including Beersheba in the south),"Ballistic missiles, drones","Dozens of targets hit (military centers, air bases); approximately 78 people injured, Iron Dome intercepted most missiles","Page 1, Page 2",
|
5 |
+
,**June 13**,Iran,Israel,Beersheba,Iranian missile,At least 32 people injured at Soroka hospital,Page 2,
|
6 |
+
,**June 13**,Iran,IDF command center and intelligence camp,Adjacent to hospital in Beersheba,Missile,"Intended target, but state media report possible intentional targeting of the hospital",Page 2,
|
7 |
+
,**June 14-15**,Not detailed,Likely reciprocal attacks,Not detailed,Not detailed,Conflict is ongoing,"Page 1, Page 2",
|
8 |
+
,**June 15**,Israel,Iranian nuclear facility,Not detailed,Unspecified (likely airstrikes),The Natanz facility is targeted to prevent Iran from reaching the capability to produce weapons-grade material,Page 3,
|
9 |
+
*,**Escalation:** The conflict has rapidly escalated from targeted strikes on nuclear facilities to widespread missile attacks on military and civilian targets.
|
10 |
+
*,**Nuclear Focus:** A central theme of the conflict is preventing Iran from acquiring nuclear weapons. Israel’s strikes on Natanz (Page 1 & 3) demonstrate a clear intention to disrupt Iran’s nuclear program.
|
11 |
+
*,"**Reciprocal Attacks:** The timeline indicates a clear pattern of reciprocal attacks, with Iran responding to Israeli strikes with missile launches, and Israel likely conducting follow-up strikes."
|
12 |
+
*,"**Civilian Impact:** The attacks have resulted in civilian casualties, with over 220 Palestinians killed in Israeli strikes (Page 1) and at least 32 injured in Israel (Page 2). The report of a possible deliberate targeting of a hospital (Page 2) is particularly concerning."
|
13 |
+
*,"**Threats and Warnings:** The rhetoric from both sides is aggressive, with warnings of “severe punishment” (Page 3) and a “declaration of war.”"
|
14 |
+
*,"**Iron Dome Effectiveness:** The Iron Dome defense system has intercepted most of the Iranian missiles (Page 1 & 2), but the sheer volume of attacks demonstrates the challenge of defending against such threats."
|
temp/table_create_a_bar_chart_graph_showi_20250901_045645.csv
ADDED
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
,**Date**,**Time (Local)**,**Event**,**Location**,**Impact/Details**,**Page Reference**,
|
2 |
+
,---,---,---,---,---,---,
|
3 |
+
,Days Prior,N/A,Reciprocal missile exchanges,Israel & Iran,Ongoing pattern of attacks,1,
|
4 |
+
,June 12,Evening,Iran tells people to evacuate,Tehran's District 18,Evacuation order issued,2,
|
5 |
+
,June 13,03:30,Initial Iranian Missile Attack,Towards Israel,"~100 missiles launched, most intercepted by Iron Dome","1, 2",
|
6 |
+
,June 13,Shortly After,Israeli Operation Rising Lion,Iran,"Strikes on nuclear and military sites, significant damage to Natanz facility",2,
|
7 |
+
,June 13,Hours Later,Iranian Missile Attack,Israel,"Dozens of targets, military centers and air bases targeted",2,
|
8 |
+
,June 19,Morning,Missile Hits Hospital,"Beersheba, Southern Israel","At least 32 people injured, controversy over targeting a hospital",3,
|
9 |
+
,Ongoing,N/A,Casualties,Israel & Iran,"Over 220 deaths in Iran, 24 deaths in Israel",3,
|
10 |
+
,Ongoing,N/A,International Involvement,US,Trump considering joining strikes on Iranian nuclear sites,3,
|
utils.py
ADDED
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import base64
|
2 |
+
|
3 |
+
def encode_image(image_path):
|
4 |
+
with open(image_path, "rb") as image_file:
|
5 |
+
return base64.b64encode(image_file.read()).decode('utf-8')
|