milwright commited on
Commit
59aaeae
·
0 Parent(s):

Update Historical OCR with specified input files

Browse files
.gitattributes ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ *.jpg filter=lfs diff=lfs merge=lfs -text
2
+ *.jpeg filter=lfs diff=lfs merge=lfs -text
3
+ *.png filter=lfs diff=lfs merge=lfs -text
4
+ *.pdf filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Historical OCR with Contextual Intelligence
3
+ emoji: 📜
4
+ colorFrom: indigo
5
+ colorTo: purple
6
+ sdk: streamlit
7
+ sdk_version: "1.28.0"
8
+ app_file: app.py
9
+ pinned: false
10
+ ---
11
+
12
+ # Historical OCR with Contextual Intelligence
13
+
14
+ An advanced OCR application for historical document analysis using Mistral AI.
15
+
16
+ ## Features
17
+
18
+ - **OCR with Context:** AI-enhanced OCR optimized for historical documents
19
+ - **Document Type Detection:** Automatically identifies handwritten letters, recipes, scientific texts, and more
20
+ - **Image Preprocessing:** Optimizes images for better text recognition
21
+ - **Custom Prompting:** Tailor the AI analysis with document-specific instructions
22
+ - **Structured Output:** Returns organized, structured information based on document type
23
+
24
+ ## Using This App
25
+
26
+ 1. Upload a historical document (image or PDF)
27
+ 2. Add optional context or special instructions
28
+ 3. Get detailed, structured OCR results with historical context
29
+
30
+ ## Supported Document Types
31
+
32
+ - Handwritten letters and correspondence
33
+ - Historical recipes and cookbooks
34
+ - Travel accounts and exploration logs
35
+ - Scientific papers and experiments
36
+ - Legal documents and certificates
37
+ - Historical newspaper articles
38
+ - General historical texts
39
+
40
+ ## Technical Details
41
+
42
+ Built with Streamlit and Mistral AI's OCR and large language model capabilities.
43
+
44
+ ---
45
+
46
+ Created by Zach Muhlbauer, CUNY Graduate Center
app.py ADDED
@@ -0,0 +1,1672 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import streamlit as st
3
+ import json
4
+ import sys
5
+ import time
6
+ import base64
7
+ from pathlib import Path
8
+ import tempfile
9
+ import io
10
+ from pdf2image import convert_from_bytes
11
+ from PIL import Image, ImageEnhance, ImageFilter
12
+ import cv2
13
+ import numpy as np
14
+ from datetime import datetime
15
+
16
+ # Import the StructuredOCR class and config from the local files
17
+ from structured_ocr import StructuredOCR
18
+ from config import MISTRAL_API_KEY
19
+
20
+ # Import utilities for handling previous results
21
+ from ocr_utils import create_results_zip
22
+
23
+ def get_base64_from_image(image_path):
24
+ """Get base64 string from image file"""
25
+ with open(image_path, "rb") as img_file:
26
+ return base64.b64encode(img_file.read()).decode('utf-8')
27
+
28
+ # Set favicon path
29
+ favicon_path = os.path.join(os.path.dirname(__file__), "static/favicon.png")
30
+
31
+ # Set page configuration
32
+ st.set_page_config(
33
+ page_title="Historical OCR",
34
+ page_icon=favicon_path if os.path.exists(favicon_path) else "📜",
35
+ layout="wide",
36
+ initial_sidebar_state="expanded"
37
+ )
38
+
39
+ # Enable caching for expensive operations with longer TTL for better performance
40
+ @st.cache_data(ttl=24*3600, show_spinner=False) # Cache for 24 hours instead of 1 hour
41
+ def convert_pdf_to_images(pdf_bytes, dpi=150, rotation=0):
42
+ """Convert PDF bytes to a list of images with caching"""
43
+ try:
44
+ images = convert_from_bytes(pdf_bytes, dpi=dpi)
45
+
46
+ # Apply rotation if specified
47
+ if rotation != 0 and images:
48
+ rotated_images = []
49
+ for img in images:
50
+ rotated_img = img.rotate(rotation, expand=True, resample=Image.BICUBIC)
51
+ rotated_images.append(rotated_img)
52
+ return rotated_images
53
+
54
+ return images
55
+ except Exception as e:
56
+ st.error(f"Error converting PDF: {str(e)}")
57
+ return []
58
+
59
+ # Cache preprocessed images for better performance
60
+ @st.cache_data(ttl=24*3600, show_spinner=False) # Cache for 24 hours
61
+ def preprocess_image(image_bytes, preprocessing_options):
62
+ """Preprocess image with selected options optimized for historical document OCR quality"""
63
+ # Setup basic console logging
64
+ import logging
65
+ logger = logging.getLogger("image_preprocessor")
66
+ logger.setLevel(logging.INFO)
67
+
68
+ # Log which preprocessing options are being applied
69
+ logger.info(f"Preprocessing image with options: {preprocessing_options}")
70
+
71
+ # Convert bytes to PIL Image
72
+ image = Image.open(io.BytesIO(image_bytes))
73
+
74
+ # Check for alpha channel (RGBA) and convert to RGB if needed
75
+ if image.mode == 'RGBA':
76
+ # Convert RGBA to RGB by compositing the image onto a white background
77
+ background = Image.new('RGB', image.size, (255, 255, 255))
78
+ background.paste(image, mask=image.split()[3]) # 3 is the alpha channel
79
+ image = background
80
+ logger.info("Converted RGBA image to RGB")
81
+ elif image.mode not in ('RGB', 'L'):
82
+ # Convert other modes to RGB as well
83
+ image = image.convert('RGB')
84
+ logger.info(f"Converted {image.mode} image to RGB")
85
+
86
+ # Apply rotation if specified
87
+ if preprocessing_options.get("rotation", 0) != 0:
88
+ rotation_degrees = preprocessing_options.get("rotation")
89
+ image = image.rotate(rotation_degrees, expand=True, resample=Image.BICUBIC)
90
+
91
+ # Resize large images while preserving details important for OCR
92
+ width, height = image.size
93
+ max_dimension = max(width, height)
94
+
95
+ # Less aggressive resizing to preserve document details
96
+ if max_dimension > 2500:
97
+ scale_factor = 2500 / max_dimension
98
+ new_width = int(width * scale_factor)
99
+ new_height = int(height * scale_factor)
100
+ # Use LANCZOS for better quality preservation
101
+ image = image.resize((new_width, new_height), Image.LANCZOS)
102
+
103
+ img_array = np.array(image)
104
+
105
+ # Apply preprocessing based on selected options with settings optimized for historical documents
106
+ document_type = preprocessing_options.get("document_type", "standard")
107
+
108
+ # Process grayscale option first as it's a common foundation
109
+ if preprocessing_options.get("grayscale", False):
110
+ if len(img_array.shape) == 3: # Only convert if it's not already grayscale
111
+ if document_type == "handwritten":
112
+ # Enhanced grayscale processing for handwritten documents
113
+ img_array = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
114
+ # Apply adaptive histogram equalization to enhance handwriting
115
+ clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8))
116
+ img_array = clahe.apply(img_array)
117
+ else:
118
+ # Standard grayscale for printed documents
119
+ img_array = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
120
+
121
+ # Convert back to RGB for further processing
122
+ img_array = cv2.cvtColor(img_array, cv2.COLOR_GRAY2RGB)
123
+
124
+ if preprocessing_options.get("contrast", 0) != 0:
125
+ contrast_factor = 1 + (preprocessing_options.get("contrast", 0) / 10)
126
+ image = Image.fromarray(img_array)
127
+ enhancer = ImageEnhance.Contrast(image)
128
+ image = enhancer.enhance(contrast_factor)
129
+ img_array = np.array(image)
130
+
131
+ if preprocessing_options.get("denoise", False):
132
+ try:
133
+ # Apply appropriate denoising based on document type
134
+ if document_type == "handwritten":
135
+ # Very light denoising for handwritten documents to preserve pen strokes
136
+ if len(img_array.shape) == 3 and img_array.shape[2] == 3: # Color image
137
+ img_array = cv2.fastNlMeansDenoisingColored(img_array, None, 3, 3, 5, 9)
138
+ else: # Grayscale image
139
+ img_array = cv2.fastNlMeansDenoising(img_array, None, 3, 7, 21)
140
+ else:
141
+ # Standard denoising for printed documents
142
+ if len(img_array.shape) == 3 and img_array.shape[2] == 3: # Color image
143
+ img_array = cv2.fastNlMeansDenoisingColored(img_array, None, 5, 5, 7, 21)
144
+ else: # Grayscale image
145
+ img_array = cv2.fastNlMeansDenoising(img_array, None, 5, 7, 21)
146
+ except Exception as e:
147
+ print(f"Denoising error: {str(e)}, falling back to standard processing")
148
+
149
+ # Convert back to PIL Image
150
+ processed_image = Image.fromarray(img_array)
151
+
152
+ # Higher quality for OCR processing
153
+ byte_io = io.BytesIO()
154
+ try:
155
+ # Make sure the image is in RGB mode before saving as JPEG
156
+ if processed_image.mode not in ('RGB', 'L'):
157
+ processed_image = processed_image.convert('RGB')
158
+
159
+ processed_image.save(byte_io, format='JPEG', quality=92, optimize=True)
160
+ byte_io.seek(0)
161
+
162
+ logger.info(f"Preprocessing complete. Original image mode: {image.mode}, processed mode: {processed_image.mode}")
163
+ logger.info(f"Original size: {len(image_bytes)/1024:.1f}KB, processed size: {len(byte_io.getvalue())/1024:.1f}KB")
164
+
165
+ return byte_io.getvalue()
166
+ except Exception as e:
167
+ logger.error(f"Error saving processed image: {str(e)}")
168
+ # Fallback to original image
169
+ logger.info("Using original image as fallback")
170
+ image_io = io.BytesIO()
171
+ image.save(image_io, format='JPEG', quality=92)
172
+ image_io.seek(0)
173
+ return image_io.getvalue()
174
+
175
+ # Cache OCR results in memory to speed up repeated processing
176
+ @st.cache_data(ttl=24*3600, max_entries=20, show_spinner=False)
177
+ def process_file_cached(file_path, file_type, use_vision, file_size_mb, cache_key):
178
+ """Cached version of OCR processing to reuse results"""
179
+ # Initialize OCR processor
180
+ processor = StructuredOCR()
181
+
182
+ # Process the file
183
+ result = processor.process_file(
184
+ file_path,
185
+ file_type=file_type,
186
+ use_vision=use_vision,
187
+ file_size_mb=file_size_mb
188
+ )
189
+
190
+ return result
191
+
192
+ # Define functions
193
+ def process_file(uploaded_file, use_vision=True, preprocessing_options=None, progress_container=None):
194
+ """Process the uploaded file and return the OCR results
195
+
196
+ Args:
197
+ uploaded_file: The uploaded file to process
198
+ use_vision: Whether to use vision model
199
+ preprocessing_options: Dictionary of preprocessing options
200
+ progress_container: Optional container for progress indicators
201
+ """
202
+ if preprocessing_options is None:
203
+ preprocessing_options = {}
204
+
205
+ # Create a container for progress indicators if not provided
206
+ if progress_container is None:
207
+ progress_container = st.empty()
208
+
209
+ with progress_container.container():
210
+ progress_bar = st.progress(0)
211
+ status_text = st.empty()
212
+ status_text.markdown('<div class="processing-status-container">Preparing file for processing...</div>', unsafe_allow_html=True)
213
+
214
+ try:
215
+ # Check if API key is available
216
+ if not MISTRAL_API_KEY:
217
+ # Return dummy data if no API key
218
+ progress_bar.progress(100)
219
+ status_text.empty()
220
+ return {
221
+ "file_name": uploaded_file.name,
222
+ "topics": ["Document"],
223
+ "languages": ["English"],
224
+ "ocr_contents": {
225
+ "title": "API Key Required",
226
+ "content": "Please set the MISTRAL_API_KEY environment variable to process documents."
227
+ }
228
+ }
229
+
230
+ # Update progress - more granular steps
231
+ progress_bar.progress(10)
232
+ status_text.markdown('<div class="processing-status-container">Initializing OCR processor...</div>', unsafe_allow_html=True)
233
+
234
+ # Determine file type from extension
235
+ file_ext = Path(uploaded_file.name).suffix.lower()
236
+ file_type = "pdf" if file_ext == ".pdf" else "image"
237
+ file_bytes = uploaded_file.getvalue()
238
+
239
+ # Create a temporary file for processing
240
+ with tempfile.NamedTemporaryFile(delete=False, suffix=file_ext) as tmp:
241
+ tmp.write(file_bytes)
242
+ temp_path = tmp.name
243
+
244
+ # Get PDF rotation value if available and file is a PDF
245
+ pdf_rotation_value = pdf_rotation if 'pdf_rotation' in locals() and file_type == "pdf" else 0
246
+
247
+ progress_bar.progress(15)
248
+
249
+ # For PDFs, we need to handle differently
250
+ if file_type == "pdf":
251
+ status_text.markdown('<div class="processing-status-container">Converting PDF to images...</div>', unsafe_allow_html=True)
252
+ progress_bar.progress(20)
253
+
254
+ # Convert PDF to images
255
+ try:
256
+ # Use the PDF processing pipeline directly from the StructuredOCR class
257
+ processor = StructuredOCR()
258
+
259
+ # Process the file with direct PDF handling
260
+ progress_bar.progress(30)
261
+ status_text.markdown('<div class="processing-status-container">Processing PDF with OCR...</div>', unsafe_allow_html=True)
262
+
263
+ # Get file size in MB for API limits
264
+ file_size_mb = os.path.getsize(temp_path) / (1024 * 1024)
265
+
266
+ # Check if file exceeds API limits (50 MB)
267
+ if file_size_mb > 50:
268
+ os.unlink(temp_path) # Clean up temp file
269
+ progress_bar.progress(100)
270
+ status_text.empty()
271
+ progress_container.empty()
272
+ return {
273
+ "file_name": uploaded_file.name,
274
+ "topics": ["Document"],
275
+ "languages": ["English"],
276
+ "error": f"File size {file_size_mb:.2f} MB exceeds Mistral API limit of 50 MB",
277
+ "ocr_contents": {
278
+ "error": f"Failed to process file: File size {file_size_mb:.2f} MB exceeds Mistral API limit of 50 MB",
279
+ "partial_text": "Document could not be processed due to size limitations."
280
+ }
281
+ }
282
+
283
+ # Generate cache key
284
+ import hashlib
285
+ file_hash = hashlib.md5(file_bytes).hexdigest()
286
+ cache_key = f"{file_hash}_{file_type}_{use_vision}_{pdf_rotation_value}"
287
+
288
+ # Process with cached function if possible
289
+ try:
290
+ result = process_file_cached(temp_path, file_type, use_vision, file_size_mb, cache_key)
291
+ progress_bar.progress(90)
292
+ status_text.markdown('<div class="processing-status-container">Finalizing results...</div>', unsafe_allow_html=True)
293
+ except Exception as e:
294
+ status_text.markdown(f'<div class="processing-status-container">Processing error: {str(e)}. Retrying...</div>', unsafe_allow_html=True)
295
+ progress_bar.progress(60)
296
+ # If caching fails, process directly
297
+ result = processor.process_file(
298
+ temp_path,
299
+ file_type=file_type,
300
+ use_vision=use_vision,
301
+ file_size_mb=file_size_mb,
302
+ )
303
+ progress_bar.progress(90)
304
+ status_text.markdown('<div class="processing-status-container">Finalizing results...</div>', unsafe_allow_html=True)
305
+
306
+ except Exception as e:
307
+ os.unlink(temp_path) # Clean up temp file
308
+ progress_bar.progress(100)
309
+ status_text.empty()
310
+ progress_container.empty()
311
+ raise ValueError(f"Error processing PDF: {str(e)}")
312
+
313
+ else:
314
+ # For image files, apply preprocessing if needed
315
+ # Check if any preprocessing options with boolean values are True, or if any non-boolean values are non-default
316
+ has_preprocessing = (
317
+ preprocessing_options.get("grayscale", False) or
318
+ preprocessing_options.get("denoise", False) or
319
+ preprocessing_options.get("contrast", 0) != 0 or
320
+ preprocessing_options.get("rotation", 0) != 0 or
321
+ preprocessing_options.get("document_type", "standard") != "standard"
322
+ )
323
+
324
+ if has_preprocessing:
325
+ status_text.markdown('<div class="processing-status-container">Applying image preprocessing...</div>', unsafe_allow_html=True)
326
+ progress_bar.progress(20)
327
+ processed_bytes = preprocess_image(file_bytes, preprocessing_options)
328
+ progress_bar.progress(25)
329
+
330
+ # Save processed image to temp file
331
+ with tempfile.NamedTemporaryFile(delete=False, suffix=file_ext) as proc_tmp:
332
+ proc_tmp.write(processed_bytes)
333
+ # Clean up original temp file and use the processed one
334
+ if os.path.exists(temp_path):
335
+ os.unlink(temp_path)
336
+ temp_path = proc_tmp.name
337
+ progress_bar.progress(30)
338
+ else:
339
+ progress_bar.progress(30)
340
+
341
+ # Get file size in MB for API limits
342
+ file_size_mb = os.path.getsize(temp_path) / (1024 * 1024)
343
+
344
+ # Check if file exceeds API limits (50 MB)
345
+ if file_size_mb > 50:
346
+ os.unlink(temp_path) # Clean up temp file
347
+ progress_bar.progress(100)
348
+ status_text.empty()
349
+ progress_container.empty()
350
+ return {
351
+ "file_name": uploaded_file.name,
352
+ "topics": ["Document"],
353
+ "languages": ["English"],
354
+ "error": f"File size {file_size_mb:.2f} MB exceeds Mistral API limit of 50 MB",
355
+ "ocr_contents": {
356
+ "error": f"Failed to process file: File size {file_size_mb:.2f} MB exceeds Mistral API limit of 50 MB",
357
+ "partial_text": "Document could not be processed due to size limitations."
358
+ }
359
+ }
360
+
361
+ # Update progress - more granular steps
362
+ progress_bar.progress(40)
363
+ status_text.markdown('<div class="processing-status-container">Preparing document for OCR analysis...</div>', unsafe_allow_html=True)
364
+
365
+ # Generate a cache key based on file content, type and settings
366
+ import hashlib
367
+ file_hash = hashlib.md5(open(temp_path, 'rb').read()).hexdigest()
368
+ cache_key = f"{file_hash}_{file_type}_{use_vision}"
369
+
370
+ progress_bar.progress(50)
371
+ status_text.markdown('<div class="processing-status-container">Processing document with OCR...</div>', unsafe_allow_html=True)
372
+
373
+ # Process the file using cached function if possible
374
+ try:
375
+ result = process_file_cached(temp_path, file_type, use_vision, file_size_mb, cache_key)
376
+ progress_bar.progress(80)
377
+ status_text.markdown('<div class="processing-status-container">Analyzing document structure...</div>', unsafe_allow_html=True)
378
+ progress_bar.progress(90)
379
+ status_text.markdown('<div class="processing-status-container">Finalizing results...</div>', unsafe_allow_html=True)
380
+ except Exception as e:
381
+ progress_bar.progress(60)
382
+ status_text.markdown(f'<div class="processing-status-container">Processing error: {str(e)}. Retrying...</div>', unsafe_allow_html=True)
383
+ # If caching fails, process directly
384
+ processor = StructuredOCR()
385
+ result = processor.process_file(temp_path, file_type=file_type, use_vision=use_vision, file_size_mb=file_size_mb)
386
+ progress_bar.progress(90)
387
+ status_text.markdown('<div class="processing-status-container">Finalizing results...</div>', unsafe_allow_html=True)
388
+
389
+ # Complete progress
390
+ progress_bar.progress(100)
391
+ status_text.markdown('<div class="processing-status-container">Processing complete!</div>', unsafe_allow_html=True)
392
+ time.sleep(0.8) # Brief pause to show completion
393
+ status_text.empty()
394
+ progress_container.empty() # Remove progress indicators when done
395
+
396
+ # Clean up the temporary file
397
+ if os.path.exists(temp_path):
398
+ try:
399
+ os.unlink(temp_path)
400
+ except:
401
+ pass # Ignore errors when cleaning up temporary files
402
+
403
+ return result
404
+ except Exception as e:
405
+ progress_bar.progress(100)
406
+ error_message = str(e)
407
+
408
+ # Check for specific error types and provide helpful user-facing messages
409
+ if "rate limit" in error_message.lower() or "429" in error_message or "requests rate limit exceeded" in error_message.lower():
410
+ friendly_message = "The AI service is currently experiencing high demand. Please try again in a few minutes."
411
+ logger = logging.getLogger("app")
412
+ logger.error(f"Rate limit error: {error_message}")
413
+ status_text.markdown(f'<div class="processing-status-container" style="border-left-color: #ff9800;">Rate Limit: {friendly_message}</div>', unsafe_allow_html=True)
414
+ elif "quota" in error_message.lower() or "credit" in error_message.lower() or "subscription" in error_message.lower():
415
+ friendly_message = "The API usage quota has been reached. Please check your API key and subscription limits."
416
+ status_text.markdown(f'<div class="processing-status-container" style="border-left-color: #ef5350;">API Quota: {friendly_message}</div>', unsafe_allow_html=True)
417
+ else:
418
+ status_text.markdown(f'<div class="processing-status-container" style="border-left-color: #ef5350;">Error: {error_message}</div>', unsafe_allow_html=True)
419
+
420
+ time.sleep(1.5) # Show error briefly
421
+ status_text.empty()
422
+ progress_container.empty()
423
+
424
+ # Display an appropriate error message based on the exception type
425
+ if "rate limit" in error_message.lower() or "429" in error_message or "requests rate limit exceeded" in error_message.lower():
426
+ st.warning(f"API Rate Limit: {friendly_message} This is a temporary issue and does not indicate any problem with your document.")
427
+ elif "quota" in error_message.lower() or "credit" in error_message.lower() or "subscription" in error_message.lower():
428
+ st.error(f"API Quota Exceeded: {friendly_message}")
429
+ else:
430
+ st.error(f"Error during processing: {error_message}")
431
+
432
+ # Clean up the temporary file
433
+ try:
434
+ if 'temp_path' in locals() and os.path.exists(temp_path):
435
+ os.unlink(temp_path)
436
+ except:
437
+ pass # Ignore errors when cleaning up temporary files
438
+
439
+ raise
440
+
441
+ # App title and description
442
+ favicon_base64 = get_base64_from_image(os.path.join(os.path.dirname(__file__), "static/favicon.png"))
443
+ st.markdown(f'<div style="display: flex; align-items: center; gap: 10px;"><img src="data:image/png;base64,{favicon_base64}" width="36" height="36" alt="Scroll Icon"/> <h1 style="margin: 0; padding: 0;">Historical Document OCR</h1></div>', unsafe_allow_html=True)
444
+ st.subheader("Powered by Mistral AI")
445
+
446
+ # Check if pytesseract is available for fallback
447
+ try:
448
+ import pytesseract
449
+ has_pytesseract = True
450
+ except ImportError:
451
+ has_pytesseract = False
452
+
453
+ # Initialize session state for storing previous results if not already present
454
+ if 'previous_results' not in st.session_state:
455
+ st.session_state.previous_results = []
456
+
457
+ # Create main layout with tabs and columns
458
+ main_tab1, main_tab2, main_tab3 = st.tabs(["Document Processing", "Previous Results", "About"])
459
+
460
+ with main_tab1:
461
+ # Create a two-column layout for file upload and results
462
+ left_col, right_col = st.columns([1, 1])
463
+
464
+ # File uploader in the left column
465
+ with left_col:
466
+ st.markdown("""
467
+ Upload an image or PDF file to get started.
468
+
469
+ Using the latest `mistral-ocr-latest` model for advanced document understanding.
470
+ """)
471
+
472
+ uploaded_file = st.file_uploader("Choose a file", type=["pdf", "png", "jpg", "jpeg"])
473
+
474
+ # Removed seed prompt instructions from here, moving to sidebar
475
+
476
+ # Sidebar with options
477
+ with st.sidebar:
478
+ st.header("Options")
479
+
480
+ # Model options
481
+ st.subheader("Model Settings")
482
+ use_vision = st.checkbox("Use Vision Model", value=True,
483
+ help="For image files, use the vision model for improved analysis (may be slower)")
484
+
485
+ # Historical Context section moved up
486
+ st.subheader("Historical Context")
487
+
488
+ # Historical period selector
489
+ historical_periods = [
490
+ "Select period (if known)",
491
+ "Pre-1700s",
492
+ "18th Century (1700s)",
493
+ "19th Century (1800s)",
494
+ "Early 20th Century (1900-1950)",
495
+ "Modern (Post 1950)"
496
+ ]
497
+
498
+ selected_period = st.selectbox(
499
+ "Historical Period",
500
+ options=historical_periods,
501
+ index=0,
502
+ help="Select the time period of the document for better OCR processing"
503
+ )
504
+
505
+ # Document purpose selector
506
+ document_purposes = [
507
+ "Select purpose (if known)",
508
+ "Personal Letter/Correspondence",
509
+ "Official/Government Document",
510
+ "Business/Financial Record",
511
+ "Literary/Academic Work",
512
+ "News/Journalism",
513
+ "Religious Text",
514
+ "Legal Document"
515
+ ]
516
+
517
+ selected_purpose = st.selectbox(
518
+ "Document Purpose",
519
+ options=document_purposes,
520
+ index=0,
521
+ help="Select the purpose or type of the document for better OCR processing"
522
+ )
523
+
524
+ # Custom prompt field
525
+ custom_prompt_text = ""
526
+ if selected_period != "Select period (if known)":
527
+ custom_prompt_text += f"This is a {selected_period} document. "
528
+
529
+ if selected_purpose != "Select purpose (if known)":
530
+ custom_prompt_text += f"It appears to be a {selected_purpose}. "
531
+
532
+ custom_prompt = st.text_area(
533
+ "Additional Context",
534
+ value=custom_prompt_text,
535
+ placeholder="Example: This document has unusual handwriting with cursive script. Please identify any mentioned locations and dates.",
536
+ height=150,
537
+ max_chars=500,
538
+ key="custom_analysis_instructions",
539
+ help="Powerful instructions field that impacts how the AI processes your document. Can request translations, format images correctly, extract specific information, or handle challenging documents. See the 'Additional Context Instructions & Examples' section below for more details."
540
+ )
541
+
542
+ # Enhanced instructions for Additional Context with more capabilities
543
+ with st.expander("Prompting Instructions"):
544
+ st.markdown("""
545
+ ### How Additional Context Affects Processing
546
+
547
+ The "Additional Context" field provides instructions directly to the AI to influence how it processes your document. Use it to:
548
+
549
+ #### Document Understanding
550
+ - **Specify handwriting styles**: "This document uses old-fashioned cursive with numerous flourishes and abbreviations"
551
+ - **Identify language features**: "The text contains archaic spellings common in 18th century documents"
552
+ - **Highlight focus areas**: "Look for mentions of financial transactions or dates of travel"
553
+
554
+ #### Output Formatting & Languages
555
+ - **Request translations**: "After extracting the text, translate the content into Spanish"
556
+ - **Format image orientation**: "Ensure images are displayed in the same orientation as they appear in the document"
557
+ - **Format tables**: "Convert any tables in the document to structured format with clear columns"
558
+
559
+ #### Special Processing
560
+ - **Handle challenges**: "Some portions may be faded; the page edges contain handwritten notes"
561
+ - **Technical terms**: "This is a medical document with specialized terminology about surgical procedures"
562
+ - **Organization**: "Separate the letter content from the address blocks and signature"
563
+
564
+ #### Example Combinations
565
+ ```
566
+ This is a handwritten letter from the 1850s. The writer uses archaic spellings and formal language.
567
+ Please preserve paragraph structure, identify any place names mentioned, and note any references
568
+ to historical events. Format any lists as bullet points.
569
+ ```
570
+ """)
571
+
572
+ # Image preprocessing options (collapsible)
573
+ st.subheader("Image Preprocessing")
574
+ with st.expander("Preprocessing Options"):
575
+ preprocessing_options = {}
576
+
577
+ # Document type selector - important for optimized processing
578
+ doc_type_options = ["standard", "handwritten", "typed", "printed"]
579
+ preprocessing_options["document_type"] = st.selectbox(
580
+ "Document Type",
581
+ options=doc_type_options,
582
+ index=0, # Default to standard
583
+ format_func=lambda x: x.capitalize(),
584
+ help="Select document type for optimized processing - choose 'Handwritten' for letters and manuscripts"
585
+ )
586
+
587
+ preprocessing_options["grayscale"] = st.checkbox("Convert to Grayscale",
588
+ help="Convert image to grayscale before OCR")
589
+ preprocessing_options["denoise"] = st.checkbox("Denoise Image",
590
+ help="Remove noise from the image")
591
+ preprocessing_options["contrast"] = st.slider("Adjust Contrast", -5, 5, 0,
592
+ help="Adjust image contrast (-5 to +5)")
593
+
594
+ # Add rotation options
595
+ rotation_options = [0, 90, 180, 270]
596
+ preprocessing_options["rotation"] = st.select_slider(
597
+ "Rotate Document",
598
+ options=rotation_options,
599
+ value=0,
600
+ format_func=lambda x: f"{x}° {'(No rotation)' if x == 0 else ''}",
601
+ help="Rotate the document to correct orientation"
602
+ )
603
+
604
+ # PDF options (collapsible)
605
+ st.subheader("PDF Options")
606
+ with st.expander("PDF Settings"):
607
+ pdf_dpi = st.slider("PDF Resolution (DPI)", 72, 300, 100,
608
+ help="Higher DPI gives better quality but slower processing. Try 100 for faster processing.")
609
+ max_pages = st.number_input("Maximum Pages to Process", 1, 20, 3,
610
+ help="Limit number of pages to process")
611
+
612
+ # Add PDF rotation option
613
+ rotation_options = [0, 90, 180, 270]
614
+ pdf_rotation = st.select_slider(
615
+ "Rotate PDF",
616
+ options=rotation_options,
617
+ value=0,
618
+ format_func=lambda x: f"{x}° {'(No rotation)' if x == 0 else ''}",
619
+ help="Rotate the PDF pages to correct orientation"
620
+ )
621
+
622
+ # Store PDF rotation separately instead of in preprocessing_options
623
+ # This prevents conflict with image preprocessing
624
+
625
+ # Previous Results tab content
626
+ with main_tab2:
627
+ st.markdown('<h2>Previous Results</h2>', unsafe_allow_html=True)
628
+
629
+ # Load custom CSS for Previous Results tab
630
+ from ui.layout import load_css
631
+ load_css()
632
+
633
+ # Display previous results if available
634
+ if not st.session_state.previous_results:
635
+ st.markdown("""
636
+ <div class="previous-results-container" style="text-align: center; padding: 40px 20px;">
637
+ <div style="font-size: 48px; margin-bottom: 20px; color: #757575;">📄</div>
638
+ <h3 style="color: #212121; margin-bottom: 10px;">No Previous Results</h3>
639
+ <p style="color: #616161;">Process a document to see your results history saved here.</p>
640
+ </div>
641
+ """, unsafe_allow_html=True)
642
+ else:
643
+ # Create a container for the results list
644
+ st.markdown('<div class="previous-results-container">', unsafe_allow_html=True)
645
+ st.markdown(f'<h3>{len(st.session_state.previous_results)} Previous Results</h3>', unsafe_allow_html=True)
646
+
647
+ # Create two columns for filters and download buttons
648
+ filter_col, download_col = st.columns([2, 1])
649
+
650
+ with filter_col:
651
+ # Add filter options
652
+ filter_options = ["All Types"]
653
+ if any(result.get("file_name", "").lower().endswith(".pdf") for result in st.session_state.previous_results):
654
+ filter_options.append("PDF Documents")
655
+ if any(result.get("file_name", "").lower().endswith((".jpg", ".jpeg", ".png")) for result in st.session_state.previous_results):
656
+ filter_options.append("Images")
657
+
658
+ selected_filter = st.selectbox("Filter by Type:", filter_options)
659
+
660
+ with download_col:
661
+ # Add download all button for results
662
+ if len(st.session_state.previous_results) > 0:
663
+ try:
664
+ # Create buffer in memory instead of file on disk
665
+ import io
666
+ from ocr_utils import create_results_zip_in_memory
667
+
668
+ # Get zip data directly in memory
669
+ zip_data = create_results_zip_in_memory(st.session_state.previous_results)
670
+
671
+ st.download_button(
672
+ label="Download All Results",
673
+ data=zip_data,
674
+ file_name="all_ocr_results.zip",
675
+ mime="application/zip",
676
+ help="Download all previous results as a ZIP file containing HTML and JSON files"
677
+ )
678
+ except Exception as e:
679
+ st.error(f"Error creating download: {str(e)}")
680
+ st.info("Try with fewer results or individual downloads")
681
+
682
+ # Filter results based on selection
683
+ filtered_results = st.session_state.previous_results
684
+ if selected_filter == "PDF Documents":
685
+ filtered_results = [r for r in st.session_state.previous_results if r.get("file_name", "").lower().endswith(".pdf")]
686
+ elif selected_filter == "Images":
687
+ filtered_results = [r for r in st.session_state.previous_results if r.get("file_name", "").lower().endswith((".jpg", ".jpeg", ".png"))]
688
+
689
+ # Show a message if no results match the filter
690
+ if not filtered_results:
691
+ st.markdown("""
692
+ <div style="text-align: center; padding: 20px; background-color: #f9f9f9; border-radius: 5px; margin: 20px 0;">
693
+ <p>No results match the selected filter.</p>
694
+ </div>
695
+ """, unsafe_allow_html=True)
696
+
697
+ # Display each result as a card
698
+ for i, result in enumerate(filtered_results):
699
+ # Determine file type icon
700
+ file_name = result.get("file_name", f"Document {i+1}")
701
+ file_type_lower = file_name.lower()
702
+
703
+ if file_type_lower.endswith(".pdf"):
704
+ icon = "📄"
705
+ elif file_type_lower.endswith((".jpg", ".jpeg", ".png", ".gif")):
706
+ icon = "🖼️"
707
+ else:
708
+ icon = "📝"
709
+
710
+ # Create a card for each result
711
+ st.markdown(f"""
712
+ <div class="result-card">
713
+ <div class="result-header">
714
+ <div class="result-filename">{icon} {file_name}</div>
715
+ <div class="result-date">{result.get('timestamp', 'Unknown')}</div>
716
+ </div>
717
+ <div class="result-metadata">
718
+ <div class="result-tag">Languages: {', '.join(result.get('languages', ['Unknown']))}</div>
719
+ <div class="result-tag">Topics: {', '.join(result.get('topics', ['Unknown']))}</div>
720
+ </div>
721
+ """, unsafe_allow_html=True)
722
+
723
+ # Add view button inside the card with proper styling
724
+ st.markdown('<div class="result-action-button">', unsafe_allow_html=True)
725
+ if st.button(f"View Document", key=f"view_{i}"):
726
+ # Set the selected result in the session state
727
+ st.session_state.selected_previous_result = st.session_state.previous_results[i]
728
+ # Force a rerun to show the selected result
729
+ st.rerun()
730
+ st.markdown('</div>', unsafe_allow_html=True)
731
+
732
+ # Close the result card
733
+ st.markdown('</div>', unsafe_allow_html=True)
734
+
735
+ # Close the container
736
+ st.markdown('</div>', unsafe_allow_html=True)
737
+
738
+ # Display the selected result if available
739
+ if 'selected_previous_result' in st.session_state and st.session_state.selected_previous_result:
740
+ selected_result = st.session_state.selected_previous_result
741
+
742
+ # Create a styled container for the selected result
743
+ st.markdown(f"""
744
+ <div class="selected-result-container">
745
+ <div class="result-header" style="margin-bottom: 20px;">
746
+ <div class="selected-result-title">Selected Document: {selected_result.get('file_name', 'Unknown')}</div>
747
+ <div class="result-date">{selected_result.get('timestamp', '')}</div>
748
+ </div>
749
+ """, unsafe_allow_html=True)
750
+
751
+ # Display metadata in a styled way
752
+ meta_col1, meta_col2 = st.columns(2)
753
+
754
+ with meta_col1:
755
+ # Display document metadata
756
+ if 'languages' in selected_result:
757
+ languages = [lang for lang in selected_result['languages'] if lang is not None]
758
+ if languages:
759
+ st.write(f"**Languages:** {', '.join(languages)}")
760
+
761
+ if 'topics' in selected_result and selected_result['topics']:
762
+ st.write(f"**Topics:** {', '.join(selected_result['topics'])}")
763
+
764
+ with meta_col2:
765
+ # Display processing metadata
766
+ if 'limited_pages' in selected_result:
767
+ st.info(f"Processed {selected_result['limited_pages']['processed']} of {selected_result['limited_pages']['total']} pages")
768
+
769
+ if 'processing_time' in selected_result:
770
+ proc_time = selected_result['processing_time']
771
+ st.write(f"**Processing Time:** {proc_time:.1f}s")
772
+
773
+ # Create tabs for content display
774
+ has_images = selected_result.get('has_images', False)
775
+ if has_images:
776
+ view_tab1, view_tab2, view_tab3 = st.tabs(["Structured View", "Raw JSON", "With Images"])
777
+ else:
778
+ view_tab1, view_tab2 = st.tabs(["Structured View", "Raw JSON"])
779
+
780
+ with view_tab1:
781
+ # Display structured content
782
+ if 'ocr_contents' in selected_result and isinstance(selected_result['ocr_contents'], dict):
783
+ for section, content in selected_result['ocr_contents'].items():
784
+ if content and section not in ['error', 'raw_text', 'partial_text']: # Skip error and raw text sections
785
+ st.markdown(f"#### {section.replace('_', ' ').title()}")
786
+
787
+ if isinstance(content, str):
788
+ st.write(content)
789
+ elif isinstance(content, list):
790
+ for item in content:
791
+ if isinstance(item, str):
792
+ st.write(f"- {item}")
793
+ else:
794
+ st.write(f"- {str(item)}")
795
+ elif isinstance(content, dict):
796
+ for k, v in content.items():
797
+ st.write(f"**{k}:** {v}")
798
+
799
+ with view_tab2:
800
+ # Show the raw JSON with an option to download it
801
+ st.json(selected_result)
802
+
803
+ # Add JSON download button
804
+ json_str = json.dumps(selected_result, indent=2)
805
+ filename = selected_result.get('file_name', 'document').split('.')[0]
806
+ st.download_button(
807
+ label="Download JSON",
808
+ data=json_str,
809
+ file_name=f"{filename}_data.json",
810
+ mime="application/json"
811
+ )
812
+
813
+ if has_images and 'pages_data' in selected_result:
814
+ with view_tab3:
815
+ # Display content with images in a nicely formatted way
816
+ pages_data = selected_result.get('pages_data', [])
817
+
818
+ # Process and display each page
819
+ for page_idx, page in enumerate(pages_data):
820
+ # Add a page header if multi-page
821
+ if len(pages_data) > 1:
822
+ st.markdown(f"### Page {page_idx + 1}")
823
+
824
+ # Create columns for better layout
825
+ if page.get('images'):
826
+ # Extract images for this page
827
+ images = page.get('images', [])
828
+ for img in images:
829
+ if 'image_base64' in img:
830
+ st.image(img['image_base64'], width=600)
831
+
832
+ # Display text content if available
833
+ text_content = page.get('markdown', '')
834
+ if text_content:
835
+ with st.expander("View Page Text", expanded=True):
836
+ st.markdown(text_content)
837
+ else:
838
+ # Just display text if no images
839
+ text_content = page.get('markdown', '')
840
+ if text_content:
841
+ st.markdown(text_content)
842
+
843
+ # Add page separator
844
+ if page_idx < len(pages_data) - 1:
845
+ st.markdown("---")
846
+
847
+ # Add HTML download button if images are available
848
+ from ocr_utils import create_html_with_images
849
+ html_content = create_html_with_images(selected_result)
850
+ filename = selected_result.get('file_name', 'document').split('.')[0]
851
+ st.download_button(
852
+ label="Download as HTML with Images",
853
+ data=html_content,
854
+ file_name=f"{filename}_with_images.html",
855
+ mime="text/html"
856
+ )
857
+
858
+ # Close the container
859
+ st.markdown('</div>', unsafe_allow_html=True)
860
+
861
+ # Add clear button outside the container with proper styling
862
+ col1, col2, col3 = st.columns([1, 1, 1])
863
+ with col2:
864
+ st.markdown('<div class="result-action-button" style="text-align: center;">', unsafe_allow_html=True)
865
+ if st.button("Close Selected Document", key="close_selected"):
866
+ # Clear the selected result from session state
867
+ del st.session_state.selected_previous_result
868
+ # Force a rerun to update the view
869
+ st.rerun()
870
+ st.markdown('</div>', unsafe_allow_html=True)
871
+
872
+ # About tab content
873
+ with main_tab3:
874
+ # Add a notice about local OCR fallback if available
875
+ fallback_notice = ""
876
+ if 'has_pytesseract' in locals() and has_pytesseract:
877
+ fallback_notice = """
878
+ **Local OCR Fallback:**
879
+ - Local OCR fallback using Tesseract is available if API rate limits are reached
880
+ - Provides basic text extraction when cloud OCR is unavailable
881
+ """
882
+
883
+ st.markdown(f"""
884
+ ### About This Application
885
+
886
+ This app uses [Mistral AI's Document OCR](https://docs.mistral.ai/capabilities/document/) to extract text and images from historical documents.
887
+
888
+ It can process:
889
+ - Image files (jpg, png, etc.)
890
+ - PDF documents (multi-page support)
891
+
892
+ The extracted content is processed into structured data based on the document type, combining:
893
+ - Text extraction with `mistral-ocr-latest`
894
+ - Analysis with language models
895
+ - Layout preservation with images
896
+
897
+ View results in three formats:
898
+ - Structured HTML view
899
+ - Raw JSON (for developers)
900
+ - Markdown with images (preserves document layout)
901
+
902
+ **New Features:**
903
+ - Image preprocessing for better OCR quality
904
+ - PDF resolution and page controls
905
+ - Document rotation (90°, 180°, 270°)
906
+ - Custom instructions for special document analysis
907
+ - Performance mode selection (Speed/Balance/Quality)
908
+ - Progress tracking during processing
909
+ - Previous Results tab to review processed documents
910
+ - Enhanced rate limit handling with automatic retry
911
+ {fallback_notice}
912
+ """)
913
+
914
+ with main_tab1:
915
+ if uploaded_file is not None:
916
+ # Check file size (cap at 50MB)
917
+ file_size_mb = len(uploaded_file.getvalue()) / (1024 * 1024)
918
+
919
+ if file_size_mb > 50:
920
+ with left_col:
921
+ st.error(f"File too large ({file_size_mb:.1f} MB). Maximum file size is 50MB.")
922
+ st.stop()
923
+
924
+ file_ext = Path(uploaded_file.name).suffix.lower()
925
+
926
+ # Process button - flush left with similar padding as file browser
927
+ with left_col:
928
+ process_button = st.button("Process Document")
929
+
930
+ # Image preprocessing preview in upload column, right after the process button
931
+ if any(preprocessing_options.values()) and uploaded_file.type.startswith('image/'):
932
+ with st.expander("Image Preprocessing Preview"):
933
+ preview_cols = st.columns(2)
934
+
935
+ with preview_cols[0]:
936
+ st.markdown("**Original Image**")
937
+ st.image(uploaded_file, width=600)
938
+
939
+ with preview_cols[1]:
940
+ st.markdown("**Preprocessed Image**")
941
+ try:
942
+ processed_bytes = preprocess_image(uploaded_file.getvalue(), preprocessing_options)
943
+ st.image(io.BytesIO(processed_bytes), width=600)
944
+ except Exception as e:
945
+ st.error(f"Error in preprocessing: {str(e)}")
946
+ st.info("Try using grayscale preprocessing for PNG images with transparency")
947
+
948
+ # Empty container for progress indicators - will be filled during processing
949
+ progress_placeholder = st.empty()
950
+
951
+ # Add space (one inch equivalent in Streamlit)
952
+ st.markdown("<div style='margin-top: 72px;'></div>", unsafe_allow_html=True)
953
+
954
+ # Container for document metadata (will be filled after processing)
955
+ metadata_placeholder = st.empty()
956
+
957
+ # Results section
958
+ if process_button:
959
+ # Move the progress indicator reference to just below the button
960
+ progress_container = progress_placeholder
961
+ try:
962
+ # Get max_pages or default if not available
963
+ max_pages_value = max_pages if 'max_pages' in locals() else None
964
+
965
+ # Apply performance mode settings
966
+ if 'perf_mode' in locals():
967
+ if perf_mode == "Speed":
968
+ # Override settings for faster processing
969
+ if 'preprocessing_options' in locals():
970
+ preprocessing_options["denoise"] = False # Skip denoising for speed
971
+ if 'pdf_dpi' in locals() and file_ext.lower() == '.pdf':
972
+ pdf_dpi = min(pdf_dpi, 100) # Lower DPI for speed
973
+
974
+ # Process file with or without custom prompt
975
+ if custom_prompt and custom_prompt.strip():
976
+ # Process with custom instructions for the AI
977
+ with progress_placeholder.container():
978
+ progress_bar = st.progress(0)
979
+ status_text = st.empty()
980
+ status_text.markdown('<div class="processing-status-container">Processing with custom instructions...</div>', unsafe_allow_html=True)
981
+ progress_bar.progress(30)
982
+
983
+ # Special handling for PDF files with custom prompts
984
+ if file_ext.lower() == ".pdf":
985
+ # For PDFs with custom prompts, we use a special two-step process
986
+ with progress_placeholder.container():
987
+ status_text.markdown('<div class="processing-status-container">Using special PDF processing for custom instructions...</div>', unsafe_allow_html=True)
988
+ progress_bar.progress(40)
989
+
990
+ try:
991
+ # Step 1: Process without custom prompt to get OCR text
992
+ processor = StructuredOCR()
993
+
994
+ # First save the PDF to a temp file
995
+ with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp:
996
+ tmp.write(uploaded_file.getvalue())
997
+ temp_path = tmp.name
998
+
999
+ # Process with NO custom prompt first
1000
+ base_result = processor.process_file(
1001
+ file_path=temp_path,
1002
+ file_type="pdf",
1003
+ use_vision=use_vision,
1004
+ custom_prompt=None, # No custom prompt in first step
1005
+ file_size_mb=len(uploaded_file.getvalue()) / (1024 * 1024)
1006
+ )
1007
+
1008
+ progress_bar.progress(70)
1009
+ status_text.markdown('<div class="processing-status-container">Applying custom analysis to extracted text...</div>', unsafe_allow_html=True)
1010
+
1011
+ # Step 2: Apply custom prompt to the extracted text using text-only LLM
1012
+ if 'ocr_contents' in base_result and isinstance(base_result['ocr_contents'], dict):
1013
+ # Get text from OCR result
1014
+ ocr_text = ""
1015
+ for section, content in base_result['ocr_contents'].items():
1016
+ if isinstance(content, str):
1017
+ ocr_text += content + "\n\n"
1018
+ elif isinstance(content, list):
1019
+ for item in content:
1020
+ if isinstance(item, str):
1021
+ ocr_text += item + "\n"
1022
+ ocr_text += "\n"
1023
+
1024
+ # Format the custom prompt for text-only processing
1025
+ formatted_prompt = f"USER INSTRUCTIONS: {custom_prompt.strip()}\nPay special attention to these instructions and respond accordingly."
1026
+
1027
+ # Apply custom prompt to extracted text
1028
+ enhanced_result = processor._extract_structured_data_text_only(ocr_text, uploaded_file.name, formatted_prompt)
1029
+
1030
+ # Merge results, keeping images from base_result
1031
+ result = base_result.copy()
1032
+ result['custom_prompt_applied'] = 'text_only'
1033
+
1034
+ # Update with enhanced analysis results, preserving image data
1035
+ for key, value in enhanced_result.items():
1036
+ if key not in ['raw_response_data', 'pages_data', 'has_images']:
1037
+ result[key] = value
1038
+ else:
1039
+ # If no OCR content, just use the base result
1040
+ result = base_result
1041
+ result['custom_prompt_applied'] = 'failed'
1042
+
1043
+ # Clean up temp file
1044
+ if os.path.exists(temp_path):
1045
+ os.unlink(temp_path)
1046
+
1047
+ except Exception as e:
1048
+ # If anything fails, revert to standard processing
1049
+ st.warning(f"Special PDF processing failed. Falling back to standard method: {str(e)}")
1050
+ result = process_file(uploaded_file, use_vision, {}, progress_container=progress_placeholder)
1051
+ else:
1052
+ # For non-PDF files, use normal processing with custom prompt
1053
+ # Save the uploaded file to a temporary file with preprocessing
1054
+ with tempfile.NamedTemporaryFile(delete=False, suffix=Path(uploaded_file.name).suffix) as tmp:
1055
+ # Apply preprocessing if any options are selected
1056
+ if any(preprocessing_options.values()):
1057
+ # Apply performance mode settings
1058
+ if 'perf_mode' in locals() and perf_mode == "Speed":
1059
+ # Skip denoising for speed in preprocessing
1060
+ speed_preprocessing = preprocessing_options.copy()
1061
+ speed_preprocessing["denoise"] = False
1062
+ processed_bytes = preprocess_image(uploaded_file.getvalue(), speed_preprocessing)
1063
+ else:
1064
+ processed_bytes = preprocess_image(uploaded_file.getvalue(), preprocessing_options)
1065
+ tmp.write(processed_bytes)
1066
+ else:
1067
+ tmp.write(uploaded_file.getvalue())
1068
+ temp_path = tmp.name
1069
+
1070
+ # Show progress
1071
+ with progress_placeholder.container():
1072
+ progress_bar.progress(50)
1073
+ status_text.markdown('<div class="processing-status-container">Analyzing with custom instructions...</div>', unsafe_allow_html=True)
1074
+
1075
+ # Initialize OCR processor and process with custom prompt
1076
+ processor = StructuredOCR()
1077
+
1078
+ # Format the custom prompt to ensure it has an impact
1079
+ formatted_prompt = f"USER INSTRUCTIONS: {custom_prompt.strip()}\nPay special attention to these instructions and respond accordingly."
1080
+
1081
+ try:
1082
+ result = processor.process_file(
1083
+ file_path=temp_path,
1084
+ file_type="image", # Always use image for non-PDFs
1085
+ use_vision=use_vision,
1086
+ custom_prompt=formatted_prompt,
1087
+ file_size_mb=len(uploaded_file.getvalue()) / (1024 * 1024)
1088
+ )
1089
+ except Exception as e:
1090
+ # For any error, fall back to standard processing
1091
+ st.warning(f"Custom prompt processing failed. Falling back to standard processing: {str(e)}")
1092
+ result = process_file(uploaded_file, use_vision, preprocessing_options, progress_container=progress_placeholder)
1093
+
1094
+ # Complete progress
1095
+ with progress_placeholder.container():
1096
+ progress_bar.progress(100)
1097
+ status_text.markdown('<div class="processing-status-container">Processing complete!</div>', unsafe_allow_html=True)
1098
+ time.sleep(0.8)
1099
+ progress_placeholder.empty()
1100
+
1101
+ # Clean up temporary file
1102
+ if os.path.exists(temp_path):
1103
+ try:
1104
+ os.unlink(temp_path)
1105
+ except:
1106
+ pass
1107
+ else:
1108
+ # Standard processing without custom prompt
1109
+ result = process_file(uploaded_file, use_vision, preprocessing_options, progress_container=progress_placeholder)
1110
+
1111
+ # Display Document Contents in the right column
1112
+ with right_col:
1113
+ st.subheader("Document Contents")
1114
+ # Start document content div with consistent styling class
1115
+ st.markdown('<div class="document-content">', unsafe_allow_html=True)
1116
+ if 'ocr_contents' in result:
1117
+ # Check for has_images in the result
1118
+ has_images = result.get('has_images', False)
1119
+
1120
+ # Create tabs for different views
1121
+ if has_images:
1122
+ view_tab1, view_tab2, view_tab3 = st.tabs(["Structured View", "Raw JSON", "With Images"])
1123
+ else:
1124
+ view_tab1, view_tab2 = st.tabs(["Structured View", "Raw JSON"])
1125
+
1126
+ with view_tab1:
1127
+ # Display in a more user-friendly format based on the content structure
1128
+ html_content = ""
1129
+ if isinstance(result['ocr_contents'], dict):
1130
+ for section, content in result['ocr_contents'].items():
1131
+ if content: # Only display non-empty sections
1132
+ # Add consistent styling for each section
1133
+ section_title = f'<h4 style="font-family: Georgia, serif; font-size: 18px; margin-top: 20px; margin-bottom: 10px;">{section.replace("_", " ").title()}</h4>'
1134
+ html_content += section_title
1135
+
1136
+ if isinstance(content, str):
1137
+ # Optimize by using a expander for very long content
1138
+ if len(content) > 1000:
1139
+ # Format content for long text - bold everything after "... that"
1140
+ preview_content = content[:1000] + "..." if len(content) > 1000 else content
1141
+
1142
+ if "... that" in content:
1143
+ # For the preview (first 1000 chars)
1144
+ if "... that" in preview_content:
1145
+ parts = preview_content.split("... that", 1)
1146
+ formatted_preview = f"{parts[0]}... that<strong>{parts[1]}</strong>"
1147
+ html_content += f"<p style=\"font-size:16px;\">{formatted_preview}</p>"
1148
+ else:
1149
+ html_content += f"<p style=\"font-size:16px; font-weight:normal;\">{preview_content}</p>"
1150
+
1151
+ # For the full content in expander
1152
+ parts = content.split("... that", 1)
1153
+ formatted_full = f"{parts[0]}... that**{parts[1]}**"
1154
+
1155
+ st.markdown(f"#### {section.replace('_', ' ').title()}")
1156
+ with st.expander("Show full content"):
1157
+ st.markdown(formatted_full)
1158
+ else:
1159
+ html_content += f"<p style=\"font-size:16px; font-weight:normal;\">{preview_content}</p>"
1160
+ st.markdown(f"#### {section.replace('_', ' ').title()}")
1161
+ with st.expander("Show full content"):
1162
+ st.write(content)
1163
+ else:
1164
+ # Format content - bold everything after "... that"
1165
+ if "... that" in content:
1166
+ parts = content.split("... that", 1)
1167
+ formatted_content = f"{parts[0]}... that<strong>{parts[1]}</strong>"
1168
+ html_content += f"<p style=\"font-size:16px;\">{formatted_content}</p>"
1169
+ st.markdown(f"#### {section.replace('_', ' ').title()}")
1170
+ st.markdown(f"{parts[0]}... that**{parts[1]}**")
1171
+ else:
1172
+ html_content += f"<p style=\"font-size:16px; font-weight:normal;\">{content}</p>"
1173
+ st.markdown(f"#### {section.replace('_', ' ').title()}")
1174
+ st.write(content)
1175
+ elif isinstance(content, list):
1176
+ html_list = "<ul>"
1177
+ st.markdown(f"#### {section.replace('_', ' ').title()}")
1178
+ # Limit display for very long lists
1179
+ if len(content) > 20:
1180
+ with st.expander(f"Show all {len(content)} items"):
1181
+ for item in content:
1182
+ if isinstance(item, str):
1183
+ html_list += f"<li>{item}</li>"
1184
+ st.write(f"- {item}")
1185
+ elif isinstance(item, dict):
1186
+ st.json(item)
1187
+ else:
1188
+ for item in content:
1189
+ if isinstance(item, str):
1190
+ html_list += f"<li>{item}</li>"
1191
+ st.write(f"- {item}")
1192
+ elif isinstance(item, dict):
1193
+ st.json(item)
1194
+ html_list += "</ul>"
1195
+ html_content += html_list
1196
+ elif isinstance(content, dict):
1197
+ html_dict = "<dl>"
1198
+ st.markdown(f"#### {section.replace('_', ' ').title()}")
1199
+ for k, v in content.items():
1200
+ html_dict += f"<dt>{k}</dt><dd>{v}</dd>"
1201
+ st.write(f"**{k}:** {v}")
1202
+ html_dict += "</dl>"
1203
+ html_content += html_dict
1204
+
1205
+ # Add download button in a smaller section
1206
+ with st.expander("Export Content"):
1207
+ # Get original filename without extension
1208
+ original_name = Path(result.get('file_name', uploaded_file.name)).stem
1209
+ # HTML download button
1210
+ html_bytes = html_content.encode()
1211
+ st.download_button(
1212
+ label="Download as HTML",
1213
+ data=html_bytes,
1214
+ file_name=f"{original_name}_processed.html",
1215
+ mime="text/html"
1216
+ )
1217
+
1218
+ with view_tab2:
1219
+ # Show the raw JSON for developers, with an expander for large results
1220
+ if len(json.dumps(result)) > 5000:
1221
+ with st.expander("View full JSON"):
1222
+ st.json(result)
1223
+ else:
1224
+ st.json(result)
1225
+
1226
+ if has_images and 'pages_data' in result:
1227
+ with view_tab3:
1228
+ # Use pages_data directly instead of raw_response
1229
+ try:
1230
+ # Use the serialized pages data
1231
+ pages_data = result.get('pages_data', [])
1232
+ if not pages_data:
1233
+ st.warning("No image data found in the document.")
1234
+ st.stop()
1235
+
1236
+ # Construct markdown from pages_data directly
1237
+ from ocr_utils import replace_images_in_markdown
1238
+ combined_markdown = ""
1239
+
1240
+ for page in pages_data:
1241
+ page_markdown = page.get('markdown', '')
1242
+ images = page.get('images', [])
1243
+
1244
+ # Create image dictionary
1245
+ image_dict = {}
1246
+ for img in images:
1247
+ if 'id' in img and 'image_base64' in img:
1248
+ image_dict[img['id']] = img['image_base64']
1249
+
1250
+ # Replace image references in markdown
1251
+ if page_markdown and image_dict:
1252
+ page_markdown = replace_images_in_markdown(page_markdown, image_dict)
1253
+ combined_markdown += page_markdown + "\n\n---\n\n"
1254
+
1255
+ if not combined_markdown:
1256
+ st.warning("No content with images found.")
1257
+ st.stop()
1258
+
1259
+ # Add CSS for better image handling
1260
+ st.markdown("""
1261
+ <style>
1262
+ .image-container {
1263
+ margin: 20px 0;
1264
+ text-align: center;
1265
+ }
1266
+ .markdown-text-container {
1267
+ padding: 10px;
1268
+ background-color: #f9f9f9;
1269
+ border-radius: 5px;
1270
+ }
1271
+ .markdown-text-container img {
1272
+ margin: 15px auto;
1273
+ max-width: 90%;
1274
+ max-height: 500px;
1275
+ object-fit: contain;
1276
+ border: 1px solid #ddd;
1277
+ border-radius: 4px;
1278
+ display: block;
1279
+ }
1280
+ .markdown-text-container p {
1281
+ margin-bottom: 16px;
1282
+ line-height: 1.6;
1283
+ font-family: Georgia, serif;
1284
+ }
1285
+ .page-break {
1286
+ border-top: 1px solid #ddd;
1287
+ margin: 20px 0;
1288
+ padding-top: 20px;
1289
+ }
1290
+ .page-text-content {
1291
+ margin-bottom: 20px;
1292
+ }
1293
+ .text-block {
1294
+ background-color: #fff;
1295
+ padding: 15px;
1296
+ border-radius: 4px;
1297
+ border-left: 3px solid #546e7a;
1298
+ margin-bottom: 15px;
1299
+ color: #333;
1300
+ }
1301
+ .text-block p {
1302
+ margin: 8px 0;
1303
+ color: #333;
1304
+ }
1305
+ </style>
1306
+ """, unsafe_allow_html=True)
1307
+
1308
+ # Process and display content with images properly
1309
+ import re
1310
+
1311
+ # Process each page separately
1312
+ pages_content = []
1313
+
1314
+ # Check if this is from a PDF processed through pdf2image
1315
+ is_pdf2image = result.get('pdf_processing_method') == 'pdf2image'
1316
+
1317
+ for i, page in enumerate(pages_data):
1318
+ page_markdown = page.get('markdown', '')
1319
+ images = page.get('images', [])
1320
+
1321
+ if not page_markdown:
1322
+ continue
1323
+
1324
+ # Create image dictionary
1325
+ image_dict = {}
1326
+ for img in images:
1327
+ if 'id' in img and 'image_base64' in img:
1328
+ image_dict[img['id']] = img['image_base64']
1329
+
1330
+ # Create HTML content for this page
1331
+ page_html = f"<h3>Page {i+1}</h3>" if i > 0 else ""
1332
+
1333
+ # Display the raw text content first to ensure it's visible
1334
+ page_html += f"<div class='page-text-content'>"
1335
+
1336
+ # Special handling for PDF2image processed documents
1337
+ if is_pdf2image and i == 0 and 'ocr_contents' in result:
1338
+ # Display all structured content from OCR for PDFs
1339
+ page_html += "<div class='text-block pdf-content'>"
1340
+
1341
+ # Check if custom prompt was applied
1342
+ if result.get('custom_prompt_applied') == 'text_only':
1343
+ page_html += "<div class='prompt-info'><i>Custom analysis applied using text-only processing</i></div>"
1344
+
1345
+ ocr_contents = result.get('ocr_contents', {})
1346
+ # Get a sorted list of sections to ensure consistent order
1347
+ section_keys = sorted(ocr_contents.keys())
1348
+
1349
+ # Place important sections first
1350
+ priority_sections = ['title', 'subtitle', 'header', 'publication', 'date', 'content', 'main_text']
1351
+ for important in priority_sections:
1352
+ if important in ocr_contents and important in section_keys:
1353
+ section_keys.remove(important)
1354
+ section_keys.insert(0, important)
1355
+
1356
+ for section in section_keys:
1357
+ content = ocr_contents[section]
1358
+ if section in ['raw_text', 'error', 'partial_text']:
1359
+ continue # Skip these fields
1360
+
1361
+ section_title = section.replace('_', ' ').title()
1362
+ page_html += f"<h4>{section_title}</h4>"
1363
+
1364
+ if isinstance(content, str):
1365
+ # Convert newlines to <br> tags
1366
+ content_html = content.replace('\n', '<br>')
1367
+ page_html += f"<p>{content_html}</p>"
1368
+ elif isinstance(content, list):
1369
+ page_html += "<ul>"
1370
+ for item in content:
1371
+ if isinstance(item, str):
1372
+ page_html += f"<li>{item}</li>"
1373
+ elif isinstance(item, dict):
1374
+ page_html += "<li>"
1375
+ for k, v in item.items():
1376
+ page_html += f"<strong>{k}:</strong> {v}<br>"
1377
+ page_html += "</li>"
1378
+ else:
1379
+ page_html += f"<li>{str(item)}</li>"
1380
+ page_html += "</ul>"
1381
+ elif isinstance(content, dict):
1382
+ for k, v in content.items():
1383
+ if isinstance(v, str):
1384
+ page_html += f"<p><strong>{k}:</strong> {v}</p>"
1385
+ elif isinstance(v, list):
1386
+ page_html += f"<p><strong>{k}:</strong></p><ul>"
1387
+ for item in v:
1388
+ page_html += f"<li>{item}</li>"
1389
+ page_html += "</ul>"
1390
+ else:
1391
+ page_html += f"<p><strong>{k}:</strong> {str(v)}</p>"
1392
+
1393
+ page_html += "</div>"
1394
+ else:
1395
+ # Standard processing for regular documents
1396
+ # Get all text content that isn't an image and add it first
1397
+ text_content = []
1398
+ for line in page_markdown.split("\n"):
1399
+ if not re.search(r'!\[(.*?)\]\((.*?)\)', line) and line.strip():
1400
+ text_content.append(line)
1401
+
1402
+ # Add the text content as a block
1403
+ if text_content:
1404
+ page_html += f"<div class='text-block'>"
1405
+ for line in text_content:
1406
+ page_html += f"<p>{line}</p>"
1407
+ page_html += "</div>"
1408
+
1409
+ page_html += "</div>"
1410
+
1411
+ # Then add images separately
1412
+ for line in page_markdown.split("\n"):
1413
+ # Handle image lines
1414
+ img_match = re.search(r'!\[(.*?)\]\((.*?)\)', line)
1415
+ if img_match:
1416
+ alt_text = img_match.group(1)
1417
+ img_ref = img_match.group(2)
1418
+
1419
+ # Get the base64 data for this image ID
1420
+ img_data = image_dict.get(img_ref, "")
1421
+ if img_data:
1422
+ img_html = f'<div class="image-container"><img src="{img_data}" alt="{alt_text}"></div>'
1423
+ page_html += img_html
1424
+
1425
+ # Add page separator if not the last page
1426
+ if i < len(pages_data) - 1:
1427
+ page_html += '<div class="page-break"></div>'
1428
+
1429
+ pages_content.append(page_html)
1430
+
1431
+ # Combine all pages HTML
1432
+ html_content = "\n".join(pages_content)
1433
+
1434
+ # Wrap the content in a div with the class for styling
1435
+ st.markdown(f"""
1436
+ <div class="markdown-text-container">
1437
+ {html_content}
1438
+ </div>
1439
+ """, unsafe_allow_html=True)
1440
+
1441
+ # Create download HTML content
1442
+ download_html = f"""
1443
+ <html>
1444
+ <head>
1445
+ <style>
1446
+ body {{
1447
+ font-family: Georgia, serif;
1448
+ line-height: 1.7;
1449
+ margin: 0 auto;
1450
+ max-width: 800px;
1451
+ padding: 20px;
1452
+ }}
1453
+ img {{
1454
+ max-width: 90%;
1455
+ max-height: 500px;
1456
+ object-fit: contain;
1457
+ margin: 20px auto;
1458
+ display: block;
1459
+ border: 1px solid #ddd;
1460
+ border-radius: 4px;
1461
+ }}
1462
+ .image-container {{
1463
+ margin: 20px 0;
1464
+ text-align: center;
1465
+ }}
1466
+ .page-break {{
1467
+ border-top: 1px solid #ddd;
1468
+ margin: 40px 0;
1469
+ padding-top: 40px;
1470
+ }}
1471
+ h3 {{
1472
+ color: #333;
1473
+ border-bottom: 1px solid #eee;
1474
+ padding-bottom: 10px;
1475
+ }}
1476
+ p {{
1477
+ margin: 12px 0;
1478
+ }}
1479
+ .page-text-content {{
1480
+ margin-bottom: 20px;
1481
+ }}
1482
+ .text-block {{
1483
+ background-color: #f9f9f9;
1484
+ padding: 15px;
1485
+ border-radius: 4px;
1486
+ border-left: 3px solid #546e7a;
1487
+ margin-bottom: 15px;
1488
+ color: #333;
1489
+ }}
1490
+ .text-block p {{
1491
+ margin: 8px 0;
1492
+ color: #333;
1493
+ }}
1494
+ </style>
1495
+ </head>
1496
+ <body>
1497
+ <div class="markdown-text-container">
1498
+ {html_content}
1499
+ </div>
1500
+ </body>
1501
+ </html>
1502
+ """
1503
+
1504
+ # Get original filename without extension
1505
+ original_name = Path(result.get('file_name', uploaded_file.name)).stem
1506
+
1507
+ # Add download button as an expander to prevent page reset
1508
+ with st.expander("Download Document with Images"):
1509
+ st.markdown("Click the button below to download the document with embedded images")
1510
+ st.download_button(
1511
+ label="Download as HTML",
1512
+ data=download_html,
1513
+ file_name=f"{original_name}_with_images.html",
1514
+ mime="text/html",
1515
+ key="download_with_images_button"
1516
+ )
1517
+
1518
+ except Exception as e:
1519
+ st.error(f"Could not display document with images: {str(e)}")
1520
+ st.info("Try refreshing or processing the document again.")
1521
+
1522
+ if 'ocr_contents' not in result:
1523
+ st.error("No OCR content was extracted from the document.")
1524
+
1525
+ # Close document content div
1526
+ st.markdown('</div>', unsafe_allow_html=True)
1527
+
1528
+ # Add Document Metadata in the left column placeholder
1529
+ with metadata_placeholder.container():
1530
+ st.subheader("Document Metadata")
1531
+ st.success("**Document processed successfully**")
1532
+
1533
+ # Display file info
1534
+ st.write(f"**File Name:** {result.get('file_name', uploaded_file.name)}")
1535
+
1536
+ # Display info if only limited pages were processed
1537
+ if 'limited_pages' in result:
1538
+ st.info(f"Processed {result['limited_pages']['processed']} of {result['limited_pages']['total']} pages")
1539
+
1540
+ # Display languages if available
1541
+ if 'languages' in result:
1542
+ languages = [lang for lang in result['languages'] if lang is not None]
1543
+ if languages:
1544
+ st.write(f"**Languages:** {', '.join(languages)}")
1545
+
1546
+ # Display topics if available
1547
+ if 'topics' in result and result['topics']:
1548
+ st.write(f"**Topics:** {', '.join(result['topics'])}")
1549
+
1550
+ # Processing time if available
1551
+ if 'processing_time' in result:
1552
+ proc_time = result['processing_time']
1553
+ st.write(f"**Processing Time:** {proc_time:.1f}s")
1554
+
1555
+ # Store the result in the previous results list
1556
+ # Add timestamp to result for history tracking
1557
+ result_copy = result.copy()
1558
+ result_copy['timestamp'] = datetime.now().strftime("%Y-%m-%d %H:%M")
1559
+
1560
+ # Add to session state, keeping the most recent 20 results
1561
+ st.session_state.previous_results.insert(0, result_copy)
1562
+ if len(st.session_state.previous_results) > 20:
1563
+ st.session_state.previous_results = st.session_state.previous_results[:20]
1564
+
1565
+ except Exception as e:
1566
+ st.error(f"Error processing document: {str(e)}")
1567
+ else:
1568
+ # Display basic info when no file is uploaded
1569
+ st.markdown('<div style="text-align: left; width: auto; display: inline-block;">Upload a document to get started using the file uploader above.</div>', unsafe_allow_html=True)
1570
+
1571
+ # Show example images in a grid
1572
+ st.subheader("Example Documents")
1573
+
1574
+ # Add a sample images container
1575
+ with st.container():
1576
+ # Find sample images from the input directory to display
1577
+ input_dir = Path(__file__).parent / "input"
1578
+ sample_images = []
1579
+ backup_dir = Path(__file__).parent / "backup" / "input"
1580
+
1581
+ if input_dir.exists():
1582
+ # Define images in specific order per requirements
1583
+ ordered_sample_images = []
1584
+
1585
+ # Define ordered list: magellan, americae, handwritten letter, milgram flier, recipe, magician
1586
+ ordered_image_names = [
1587
+ "magellan-travels.jpg",
1588
+ "americae-retectio.jpg",
1589
+ "handwritten-letter.jpg",
1590
+ "milgram-flier.png",
1591
+ "recipe.jpg",
1592
+ "The Magician, or Bottle Cungerer.jpeg"
1593
+ ]
1594
+
1595
+ # Create the image list in the desired order
1596
+ for img_name in ordered_image_names:
1597
+ img_path = input_dir / img_name
1598
+ if img_path.exists():
1599
+ ordered_sample_images.append(img_path)
1600
+
1601
+ # Organize for display: first 3 in top row, next 3 in bottom row
1602
+ sample_images = ordered_sample_images
1603
+
1604
+ # If we don't have enough samples, fill in with other available images
1605
+ if len(sample_images) < 6:
1606
+ # Get all remaining images from input directory
1607
+ all_images = set(
1608
+ list(input_dir.glob("*.jpg")) +
1609
+ list(input_dir.glob("*.jpeg")) +
1610
+ list(input_dir.glob("*.png")) +
1611
+ list(input_dir.glob("*.tif"))
1612
+ )
1613
+
1614
+ # Remove the already selected images
1615
+ remaining_images = [img for img in all_images if img not in sample_images]
1616
+
1617
+ # Add remaining images to fill the grid
1618
+ sample_images.extend(remaining_images[:6-len(sample_images)])
1619
+
1620
+ # If still not enough, try backup directory
1621
+ if len(sample_images) < 6 and backup_dir.exists():
1622
+ remaining = 6 - len(sample_images)
1623
+ backup_samples = (
1624
+ list(backup_dir.glob("*.jpg")) +
1625
+ list(backup_dir.glob("*.jpeg")) +
1626
+ list(backup_dir.glob("*.png"))
1627
+ )[:remaining]
1628
+ sample_images.extend(backup_samples)
1629
+
1630
+ if sample_images:
1631
+ # Create two rows of 3 columns each for the 6 examples
1632
+ if len(sample_images) > 3:
1633
+ # First row
1634
+ columns1 = st.columns(3)
1635
+ for i, img_path in enumerate(sample_images[:3]):
1636
+ with columns1[i]:
1637
+ if img_path.suffix.lower() in ['.jpg', '.jpeg', '.png', '.tif']:
1638
+ try:
1639
+ st.image(str(img_path), caption=img_path.name, width=300)
1640
+ except Exception:
1641
+ st.info(f"Example: {img_path.name}")
1642
+ else:
1643
+ # For PDFs, show an icon or info message
1644
+ st.info(f"PDF Example: {img_path.name}")
1645
+
1646
+ # Second row
1647
+ columns2 = st.columns(3)
1648
+ for i, img_path in enumerate(sample_images[3:6]):
1649
+ with columns2[i]:
1650
+ if img_path.suffix.lower() in ['.jpg', '.jpeg', '.png', '.tif']:
1651
+ try:
1652
+ st.image(str(img_path), caption=img_path.name, width=300)
1653
+ except Exception:
1654
+ st.info(f"Example: {img_path.name}")
1655
+ else:
1656
+ # For PDFs, show an icon or info message
1657
+ st.info(f"PDF Example: {img_path.name}")
1658
+ else:
1659
+ # If we have 3 or fewer samples, just use one row
1660
+ columns = st.columns(min(3, len(sample_images)))
1661
+ for i, img_path in enumerate(sample_images):
1662
+ with columns[i % len(columns)]:
1663
+ if img_path.suffix.lower() in ['.jpg', '.jpeg', '.png', '.tif']:
1664
+ try:
1665
+ st.image(str(img_path), caption=img_path.name, width=300)
1666
+ except Exception:
1667
+ st.info(f"Example: {img_path.name}")
1668
+ else:
1669
+ # For PDFs, show an icon or info message
1670
+ st.info(f"PDF Example: {img_path.name}")
1671
+ else:
1672
+ st.info("No example documents found. Upload your own document to get started.")
config.py ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # config.py
2
+ """
3
+ Configuration file for Mistral OCR processing.
4
+ Contains API key and other settings.
5
+ """
6
+ import os
7
+ import logging
8
+ from dotenv import load_dotenv
9
+
10
+ # Configure logging
11
+ logger = logging.getLogger("config")
12
+
13
+ # Load environment variables from .env file if it exists
14
+ load_dotenv()
15
+
16
+ # Mistral API key handling - get from Hugging Face secrets or environment variable
17
+ # The priority order is:
18
+ # 1. HF_MISTRAL_API_KEY environment var (for Hugging Face deployment)
19
+ # 2. MISTRAL_API_KEY environment var (standard environment variable)
20
+ # 3. Empty string (will show warning in app)
21
+ MISTRAL_API_KEY = os.environ.get("HF_MISTRAL_API_KEY",
22
+ os.environ.get("MISTRAL_API_KEY", "")).strip()
23
+
24
+ # Check if we're in test mode (allows operation without valid API key)
25
+ TEST_MODE = False # Disable test mode for production use
26
+
27
+ # Just check if API key exists
28
+ if not MISTRAL_API_KEY and not TEST_MODE:
29
+ logger.warning("No Mistral API key found. OCR functionality will not work unless TEST_MODE is enabled.")
30
+
31
+ if TEST_MODE:
32
+ logger.info("TEST_MODE is enabled. Using mock responses instead of actual API calls.")
33
+
34
+ # Model settings with fallbacks
35
+ OCR_MODEL = os.environ.get("MISTRAL_OCR_MODEL", "mistral-ocr-latest")
36
+ TEXT_MODEL = os.environ.get("MISTRAL_TEXT_MODEL", "mistral-small-latest") # Updated from ministral-8b-latest
37
+ VISION_MODEL = os.environ.get("MISTRAL_VISION_MODEL", "mistral-large-latest") # Updated from pixtral-12b-latest
38
+
39
+ # Image preprocessing settings optimized for historical documents
40
+ # These can be customized from environment variables
41
+ IMAGE_PREPROCESSING = {
42
+ "enhance_contrast": float(os.environ.get("ENHANCE_CONTRAST", "1.8")), # Increased contrast for better text recognition
43
+ "sharpen": os.environ.get("SHARPEN", "True").lower() in ("true", "1", "yes"),
44
+ "denoise": os.environ.get("DENOISE", "True").lower() in ("true", "1", "yes"),
45
+ "max_size_mb": float(os.environ.get("MAX_IMAGE_SIZE_MB", "12.0")), # Increased size limit for better quality
46
+ "target_dpi": int(os.environ.get("TARGET_DPI", "300")), # Target DPI for scaling
47
+ "compression_quality": int(os.environ.get("COMPRESSION_QUALITY", "95")) # Higher quality for better OCR results
48
+ }
49
+
50
+ # OCR settings optimized for reliability and performance
51
+ OCR_SETTINGS = {
52
+ "timeout_ms": int(os.environ.get("OCR_TIMEOUT_MS", "120000")), # Extended timeout for larger documents
53
+ "max_retries": int(os.environ.get("OCR_MAX_RETRIES", "3")), # Increased retry attempts for better reliability
54
+ "retry_delay": int(os.environ.get("OCR_RETRY_DELAY", "2")), # Longer initial retry delay for better success rate
55
+ "include_image_base64": os.environ.get("INCLUDE_IMAGE_BASE64", "True").lower() in ("true", "1", "yes"),
56
+ "thread_count": int(os.environ.get("OCR_THREAD_COUNT", "4")) # Thread count for parallel processing
57
+ }
input/The Magician, or Bottle Cungerer.jpeg ADDED

Git LFS Details

  • SHA256: 3becaf6f5548a794436864885bb125f3fa09f1e6f7bdd76e8878f2d36ff26232
  • Pointer size: 132 Bytes
  • Size of remote file: 2.96 MB
input/americae-retectio.jpg ADDED

Git LFS Details

  • SHA256: 3ea42f6d3f7c0331a08321c26978c9011843965de99735a178de8167fdede544
  • Pointer size: 131 Bytes
  • Size of remote file: 452 kB
input/handwritten-letter.jpg ADDED

Git LFS Details

  • SHA256: 7fe2d81bb4e8bef7cdbf87c58a8cc180c49c313e5099de167ae37bbbfb895e88
  • Pointer size: 131 Bytes
  • Size of remote file: 231 kB
input/harpers.pdf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3c9030714b07bb5f7c9adf8b175975baa9b4f40402da62d69cad9b0d4ba61b94
3
+ size 14931299
input/magellan-travels.jpg ADDED

Git LFS Details

  • SHA256: ae3e860789e2c3c8032499e5326864294dbc1b01059169fd08203c980577010b
  • Pointer size: 131 Bytes
  • Size of remote file: 283 kB
input/milgram-flier.png ADDED

Git LFS Details

  • SHA256: 0e1ca2821304427dcf7e2c9e0a03de880f44146bf8fa6abc9a437249fda85486
  • Pointer size: 130 Bytes
  • Size of remote file: 88.5 kB
input/recipe.jpg ADDED

Git LFS Details

  • SHA256: 8bdb2a05dee10e4e181d8636714915f3055c664297e512f805fea180446624b2
  • Pointer size: 130 Bytes
  • Size of remote file: 70.8 kB
ocr_utils.py ADDED
@@ -0,0 +1,1255 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Utility functions for OCR processing with Mistral AI.
3
+ Contains helper functions for working with OCR responses and image handling.
4
+ """
5
+
6
+ import json
7
+ import base64
8
+ import io
9
+ import zipfile
10
+ import logging
11
+ import numpy as np
12
+ from datetime import datetime
13
+ from pathlib import Path
14
+ from typing import Dict, List, Optional, Union, Any, Tuple
15
+ from functools import lru_cache
16
+
17
+ # Configure logging
18
+ logger = logging.getLogger("ocr_utils")
19
+
20
+ try:
21
+ from PIL import Image, ImageEnhance, ImageFilter, ImageOps
22
+ import cv2
23
+ PILLOW_AVAILABLE = True
24
+ CV2_AVAILABLE = True
25
+ except ImportError as e:
26
+ # Check which image libraries are available
27
+ if "PIL" in str(e):
28
+ PILLOW_AVAILABLE = False
29
+ if "cv2" in str(e):
30
+ CV2_AVAILABLE = False
31
+
32
+ from mistralai import DocumentURLChunk, ImageURLChunk, TextChunk
33
+
34
+ # Import configuration
35
+ try:
36
+ from config import IMAGE_PREPROCESSING
37
+ except ImportError:
38
+ # Fallback defaults if config not available
39
+ IMAGE_PREPROCESSING = {
40
+ "enhance_contrast": 1.5,
41
+ "sharpen": True,
42
+ "denoise": True,
43
+ "max_size_mb": 8.0,
44
+ "target_dpi": 300,
45
+ "compression_quality": 92
46
+ }
47
+
48
+ def replace_images_in_markdown(markdown_str: str, images_dict: dict) -> str:
49
+ """
50
+ Replace image placeholders in markdown with base64-encoded images.
51
+
52
+ Args:
53
+ markdown_str: Markdown text containing image placeholders
54
+ images_dict: Dictionary mapping image IDs to base64 strings
55
+
56
+ Returns:
57
+ Markdown text with images replaced by base64 data
58
+ """
59
+ for img_name, base64_str in images_dict.items():
60
+ markdown_str = markdown_str.replace(
61
+ f"![{img_name}]({img_name})", f"![{img_name}]({base64_str})"
62
+ )
63
+ return markdown_str
64
+
65
+ def get_combined_markdown(ocr_response) -> str:
66
+ """
67
+ Combine OCR text and images into a single markdown document.
68
+
69
+ Args:
70
+ ocr_response: OCR response object from Mistral AI
71
+
72
+ Returns:
73
+ Combined markdown string with embedded images
74
+ """
75
+ markdowns = []
76
+
77
+ # Process each page of the OCR response
78
+ for page in ocr_response.pages:
79
+ # Extract image data if available
80
+ image_data = {}
81
+ if hasattr(page, "images"):
82
+ for img in page.images:
83
+ if hasattr(img, "id") and hasattr(img, "image_base64"):
84
+ image_data[img.id] = img.image_base64
85
+
86
+ # Replace image placeholders with base64 data
87
+ page_markdown = page.markdown if hasattr(page, "markdown") else ""
88
+ processed_markdown = replace_images_in_markdown(page_markdown, image_data)
89
+ markdowns.append(processed_markdown)
90
+
91
+ # Join all pages' markdown with double newlines
92
+ return "\n\n".join(markdowns)
93
+
94
+ def encode_image_for_api(image_path: Union[str, Path]) -> str:
95
+ """
96
+ Encode an image as base64 data URL for API submission.
97
+
98
+ Args:
99
+ image_path: Path to the image file
100
+
101
+ Returns:
102
+ Base64 data URL for the image
103
+ """
104
+ # Convert to Path object if string
105
+ image_file = Path(image_path) if isinstance(image_path, str) else image_path
106
+
107
+ # Verify image exists
108
+ if not image_file.is_file():
109
+ raise FileNotFoundError(f"Image file not found: {image_file}")
110
+
111
+ # Encode image as base64
112
+ encoded = base64.b64encode(image_file.read_bytes()).decode()
113
+ return f"data:image/jpeg;base64,{encoded}"
114
+
115
+ def process_image_with_ocr(client, image_path: Union[str, Path], model: str = "mistral-ocr-latest"):
116
+ """
117
+ Process an image with OCR and return the response.
118
+
119
+ Args:
120
+ client: Mistral AI client
121
+ image_path: Path to the image file
122
+ model: OCR model to use
123
+
124
+ Returns:
125
+ OCR response object
126
+ """
127
+ # Encode image as base64
128
+ base64_data_url = encode_image_for_api(image_path)
129
+
130
+ # Process image with OCR
131
+ image_response = client.ocr.process(
132
+ document=ImageURLChunk(image_url=base64_data_url),
133
+ model=model
134
+ )
135
+
136
+ return image_response
137
+
138
+ def ocr_response_to_json(ocr_response, indent: int = 4) -> str:
139
+ """
140
+ Convert OCR response to a formatted JSON string.
141
+
142
+ Args:
143
+ ocr_response: OCR response object
144
+ indent: Indentation level for JSON formatting
145
+
146
+ Returns:
147
+ Formatted JSON string
148
+ """
149
+ # Convert OCR response to a dictionary
150
+ response_dict = {
151
+ "text": ocr_response.text if hasattr(ocr_response, "text") else "",
152
+ "pages": []
153
+ }
154
+
155
+ # Process pages if available
156
+ if hasattr(ocr_response, "pages"):
157
+ for page in ocr_response.pages:
158
+ page_dict = {
159
+ "text": page.text if hasattr(page, "text") else "",
160
+ "markdown": page.markdown if hasattr(page, "markdown") else "",
161
+ "images": []
162
+ }
163
+
164
+ # Process images if available
165
+ if hasattr(page, "images"):
166
+ for img in page.images:
167
+ img_dict = {
168
+ "id": img.id if hasattr(img, "id") else "",
169
+ "base64": img.image_base64 if hasattr(img, "image_base64") else ""
170
+ }
171
+ page_dict["images"].append(img_dict)
172
+
173
+ response_dict["pages"].append(page_dict)
174
+
175
+ # Convert dictionary to JSON
176
+ return json.dumps(response_dict, indent=indent)
177
+
178
+ def create_results_zip_in_memory(results):
179
+ """
180
+ Create a zip file containing OCR results in memory.
181
+
182
+ Args:
183
+ results: Dictionary or list of OCR results
184
+
185
+ Returns:
186
+ Binary zip file data
187
+ """
188
+ # Create a BytesIO object
189
+ zip_buffer = io.BytesIO()
190
+
191
+ # Check if results is a list or a dictionary
192
+ is_list = isinstance(results, list)
193
+
194
+ # Create zip file in memory
195
+ with zipfile.ZipFile(zip_buffer, 'w', zipfile.ZIP_DEFLATED) as zipf:
196
+ if is_list:
197
+ # Handle list of results
198
+ for i, result in enumerate(results):
199
+ try:
200
+ # Add JSON results for each file
201
+ result_json = json.dumps(result, indent=2)
202
+ zipf.writestr(f"results_{i+1}.json", result_json)
203
+
204
+ # Add HTML content (generated from the result)
205
+ html_content = create_html_with_images(result)
206
+ filename = result.get('file_name', f'document_{i+1}').split('.')[0]
207
+ zipf.writestr(f"{filename}_with_images.html", html_content)
208
+
209
+ # Add raw OCR text if available
210
+ if "ocr_contents" in result and "raw_text" in result["ocr_contents"]:
211
+ zipf.writestr(f"ocr_text_{i+1}.txt", result["ocr_contents"]["raw_text"])
212
+
213
+ # Add HTML visualization if available
214
+ if "html_visualization" in result:
215
+ zipf.writestr(f"visualization_{i+1}.html", result["html_visualization"])
216
+
217
+ # Add images if available (limit to conserve memory)
218
+ if "pages_data" in result:
219
+ for page_idx, page in enumerate(result["pages_data"]):
220
+ for img_idx, img in enumerate(page.get("images", [])[:3]): # Limit to first 3 images per page
221
+ img_base64 = img.get("image_base64", "")
222
+ if img_base64:
223
+ # Strip data URL prefix if present
224
+ if img_base64.startswith("data:image"):
225
+ img_base64 = img_base64.split(",", 1)[1]
226
+
227
+ # Decode base64 and add to zip
228
+ try:
229
+ img_data = base64.b64decode(img_base64)
230
+ zipf.writestr(f"images/result_{i+1}_page_{page_idx+1}_img_{img_idx+1}.jpg", img_data)
231
+ except:
232
+ pass
233
+ except Exception:
234
+ # If any result fails, skip it and continue
235
+ continue
236
+ else:
237
+ # Handle single result
238
+ try:
239
+ # Add JSON results
240
+ results_json = json.dumps(results, indent=2)
241
+ zipf.writestr("results.json", results_json)
242
+
243
+ # Add HTML content
244
+ html_content = create_html_with_images(results)
245
+ filename = results.get('file_name', 'document').split('.')[0]
246
+ zipf.writestr(f"{filename}_with_images.html", html_content)
247
+
248
+ # Add raw OCR text if available
249
+ if "ocr_contents" in results and "raw_text" in results["ocr_contents"]:
250
+ zipf.writestr("ocr_text.txt", results["ocr_contents"]["raw_text"])
251
+
252
+ # Add HTML visualization if available
253
+ if "html_visualization" in results:
254
+ zipf.writestr("visualization.html", results["html_visualization"])
255
+
256
+ # Add images if available
257
+ if "pages_data" in results:
258
+ for page_idx, page in enumerate(results["pages_data"]):
259
+ for img_idx, img in enumerate(page.get("images", [])):
260
+ img_base64 = img.get("image_base64", "")
261
+ if img_base64:
262
+ # Strip data URL prefix if present
263
+ if img_base64.startswith("data:image"):
264
+ img_base64 = img_base64.split(",", 1)[1]
265
+
266
+ # Decode base64 and add to zip
267
+ try:
268
+ img_data = base64.b64decode(img_base64)
269
+ zipf.writestr(f"images/page_{page_idx+1}_img_{img_idx+1}.jpg", img_data)
270
+ except:
271
+ pass
272
+ except Exception:
273
+ # If processing fails, return empty zip
274
+ pass
275
+
276
+ # Seek to the beginning of the BytesIO object
277
+ zip_buffer.seek(0)
278
+
279
+ # Return the zip file bytes
280
+ return zip_buffer.getvalue()
281
+
282
+ def create_results_zip(results, output_dir=None, zip_name=None):
283
+ """
284
+ Create a zip file containing OCR results.
285
+
286
+ Args:
287
+ results: Dictionary or list of OCR results
288
+ output_dir: Optional output directory
289
+ zip_name: Optional zip file name
290
+
291
+ Returns:
292
+ Path to the created zip file
293
+ """
294
+ # Create temporary output directory if not provided
295
+ if output_dir is None:
296
+ output_dir = Path.cwd() / "output"
297
+ output_dir.mkdir(exist_ok=True)
298
+ else:
299
+ output_dir = Path(output_dir)
300
+ output_dir.mkdir(exist_ok=True)
301
+
302
+ # Check if results is a list or a dictionary
303
+ is_list = isinstance(results, list)
304
+
305
+ # Generate zip name if not provided
306
+ if zip_name is None:
307
+ if is_list:
308
+ # For list of results, use timestamp and generic name
309
+ timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
310
+ zip_name = f"ocr-results_{timestamp}.zip"
311
+ else:
312
+ # For single result, use original file's info
313
+ # Check if processed_at exists, otherwise use current timestamp
314
+ if "processed_at" in results:
315
+ timestamp = results.get("processed_at", "").replace(":", "-").replace(".", "-")
316
+ else:
317
+ timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
318
+ file_name = results.get("file_name", "ocr-results")
319
+ zip_name = f"{file_name}_{timestamp}.zip"
320
+
321
+ try:
322
+ # Get zip data in memory first
323
+ zip_data = create_results_zip_in_memory(results)
324
+
325
+ # Save to file
326
+ zip_path = output_dir / zip_name
327
+ with open(zip_path, 'wb') as f:
328
+ f.write(zip_data)
329
+
330
+ return zip_path
331
+ except Exception as e:
332
+ # Create an empty zip file as fallback
333
+ zip_path = output_dir / zip_name
334
+ with zipfile.ZipFile(zip_path, 'w') as zipf:
335
+ zipf.writestr("info.txt", "Could not create complete archive")
336
+
337
+ return zip_path
338
+
339
+
340
+ # Advanced image preprocessing functions
341
+
342
+ def preprocess_image_for_ocr(image_path: Union[str, Path]) -> Tuple[Image.Image, str]:
343
+ """
344
+ Preprocess an image for optimal OCR performance with enhanced speed and memory optimization.
345
+
346
+ Args:
347
+ image_path: Path to the image file
348
+
349
+ Returns:
350
+ Tuple of (processed PIL Image, base64 string)
351
+ """
352
+ # Fast path: Skip all processing if PIL not available
353
+ if not PILLOW_AVAILABLE:
354
+ logger.info("PIL not available, skipping image preprocessing")
355
+ return None, encode_image_for_api(image_path)
356
+
357
+ # Convert to Path object if string
358
+ image_file = Path(image_path) if isinstance(image_path, str) else image_path
359
+
360
+ # Thread-safe caching with early exit for already processed images
361
+ try:
362
+ # Fast stat calls for file metadata - consolidate to reduce I/O
363
+ file_stat = image_file.stat()
364
+ file_size = file_stat.st_size
365
+ file_size_mb = file_size / (1024 * 1024)
366
+ mod_time = file_stat.st_mtime
367
+
368
+ # Create a cache key based on essential file properties
369
+ cache_key = f"{image_file.name}_{file_size}_{mod_time}"
370
+
371
+ # Fast path: Return cached result if available
372
+ if hasattr(preprocess_image_for_ocr, "_cache") and cache_key in preprocess_image_for_ocr._cache:
373
+ logger.debug(f"Using cached preprocessing result for {image_file.name}")
374
+ return preprocess_image_for_ocr._cache[cache_key]
375
+
376
+ # Optimization: Skip heavy processing for very small files
377
+ # Small images (less than 100KB) likely don't need preprocessing
378
+ if file_size < 100000: # 100KB
379
+ logger.info(f"Image {image_file.name} is small ({file_size/1024:.1f}KB), using minimal processing")
380
+ with Image.open(image_file) as img:
381
+ # Normalize mode only
382
+ if img.mode not in ('RGB', 'L'):
383
+ img = img.convert('RGB')
384
+
385
+ # Save with light optimization
386
+ buffer = io.BytesIO()
387
+ img.save(buffer, format="JPEG", quality=95, optimize=True)
388
+ buffer.seek(0)
389
+
390
+ # Get base64
391
+ encoded_image = base64.b64encode(buffer.getvalue()).decode()
392
+ base64_data_url = f"data:image/jpeg;base64,{encoded_image}"
393
+
394
+ # Cache and return
395
+ result = (img, base64_data_url)
396
+ if not hasattr(preprocess_image_for_ocr, "_cache"):
397
+ preprocess_image_for_ocr._cache = {}
398
+
399
+ # Clean cache if needed
400
+ if len(preprocess_image_for_ocr._cache) > 20: # Increased cache size for better performance
401
+ # Remove oldest 5 entries for better batch processing
402
+ for _ in range(5):
403
+ if preprocess_image_for_ocr._cache:
404
+ preprocess_image_for_ocr._cache.pop(next(iter(preprocess_image_for_ocr._cache)))
405
+
406
+ preprocess_image_for_ocr._cache[cache_key] = result
407
+ return result
408
+
409
+ except Exception as e:
410
+ # If stat or cache handling fails, log and continue with processing
411
+ logger.debug(f"Cache handling failed for {image_path}: {str(e)}")
412
+ # Ensure we have a valid file_size_mb for later decisions
413
+ try:
414
+ file_size_mb = image_file.stat().st_size / (1024 * 1024)
415
+ except:
416
+ file_size_mb = 0 # Default if we can't determine size
417
+
418
+ try:
419
+ # Process start time for performance logging
420
+ start_time = time.time()
421
+
422
+ # Open and process the image with minimal memory footprint
423
+ with Image.open(image_file) as img:
424
+ # Normalize image mode
425
+ if img.mode not in ('RGB', 'L'):
426
+ img = img.convert('RGB')
427
+
428
+ # Fast path: Quick check of image properties to determine appropriate processing
429
+ width, height = img.size
430
+ image_area = width * height
431
+
432
+ # Detect document type only for medium to large images to save processing time
433
+ is_document = False
434
+ if image_area > 500000: # Approx 700x700 or larger
435
+ # Store image for document detection
436
+ _detect_document_type_impl._current_img = img
437
+ is_document = _detect_document_type_impl(None)
438
+ logger.debug(f"Document type detection for {image_file.name}: {'document' if is_document else 'photo'}")
439
+
440
+ # Resize large images for API efficiency
441
+ if file_size_mb > IMAGE_PREPROCESSING["max_size_mb"] or max(width, height) > 3000:
442
+ # Calculate target dimensions directly instead of using the heavier resize function
443
+ target_width, target_height = width, height
444
+ max_dimension = max(width, height)
445
+
446
+ # Use a sliding scale for reduction based on image size
447
+ if max_dimension > 5000:
448
+ scale_factor = 0.25 # Aggressive reduction for very large images
449
+ elif max_dimension > 3000:
450
+ scale_factor = 0.4 # Significant reduction for large images
451
+ else:
452
+ scale_factor = 0.6 # Moderate reduction for medium images
453
+
454
+ # Calculate new dimensions
455
+ new_width = int(width * scale_factor)
456
+ new_height = int(height * scale_factor)
457
+
458
+ # Use direct resize with optimized resampling filter based on image size
459
+ if image_area > 3000000: # Very large, use faster but lower quality
460
+ processed_img = img.resize((new_width, new_height), Image.BILINEAR)
461
+ else: # Medium size, use better quality
462
+ processed_img = img.resize((new_width, new_height), Image.LANCZOS)
463
+
464
+ logger.debug(f"Resized image from {width}x{height} to {new_width}x{new_height}")
465
+ else:
466
+ # Skip resizing for smaller images
467
+ processed_img = img
468
+
469
+ # Apply appropriate processing based on document type and size
470
+ if is_document:
471
+ # Process as document with optimized path based on size
472
+ if image_area > 1000000: # Full processing for larger documents
473
+ preprocess_document_image._current_img = processed_img
474
+ processed = _preprocess_document_image_impl()
475
+ else: # Lightweight processing for smaller documents
476
+ # Just enhance contrast for small documents to save time
477
+ enhancer = ImageEnhance.Contrast(processed_img)
478
+ processed = enhancer.enhance(1.3)
479
+ else:
480
+ # Process as photo with optimized path based on size
481
+ if image_area > 1000000: # Full processing for larger photos
482
+ preprocess_general_image._current_img = processed_img
483
+ processed = _preprocess_general_image_impl()
484
+ else: # Skip processing for smaller photos
485
+ processed = processed_img
486
+
487
+ # Optimize memory handling during encoding
488
+ buffer = io.BytesIO()
489
+
490
+ # Adjust quality based on image size to optimize API payload
491
+ if file_size_mb > 5:
492
+ quality = 85 # Lower quality for large files
493
+ else:
494
+ quality = IMAGE_PREPROCESSING["compression_quality"]
495
+
496
+ # Save with optimized parameters
497
+ processed.save(buffer, format="JPEG", quality=quality, optimize=True)
498
+ buffer.seek(0)
499
+
500
+ # Get base64 with minimal memory footprint
501
+ encoded_image = base64.b64encode(buffer.getvalue()).decode()
502
+ base64_data_url = f"data:image/jpeg;base64,{encoded_image}"
503
+
504
+ # Update cache thread-safely
505
+ result = (processed, base64_data_url)
506
+ if not hasattr(preprocess_image_for_ocr, "_cache"):
507
+ preprocess_image_for_ocr._cache = {}
508
+
509
+ # LRU-like cache management with improved clearing
510
+ if len(preprocess_image_for_ocr._cache) > 20:
511
+ try:
512
+ # Remove several entries to avoid frequent cache clearing
513
+ for _ in range(5):
514
+ if preprocess_image_for_ocr._cache:
515
+ preprocess_image_for_ocr._cache.pop(next(iter(preprocess_image_for_ocr._cache)))
516
+ except:
517
+ # If removal fails, just continue
518
+ pass
519
+
520
+ # Add to cache
521
+ try:
522
+ preprocess_image_for_ocr._cache[cache_key] = result
523
+ except Exception:
524
+ # If caching fails, just proceed
525
+ pass
526
+
527
+ # Log performance metrics
528
+ processing_time = time.time() - start_time
529
+ logger.debug(f"Image preprocessing completed in {processing_time:.3f}s for {image_file.name}")
530
+
531
+ # Return both processed image and base64 string
532
+ return result
533
+
534
+ except Exception as e:
535
+ # If preprocessing fails, log error and use original image
536
+ logger.warning(f"Image preprocessing failed: {str(e)}. Using original image.")
537
+ return None, encode_image_for_api(image_path)
538
+
539
+ # Removed caching decorator to fix unhashable type error
540
+ def detect_document_type(img: Image.Image) -> bool:
541
+ """
542
+ Detect if an image is likely a document (text-heavy) vs. a photo.
543
+
544
+ Args:
545
+ img: PIL Image object
546
+
547
+ Returns:
548
+ True if likely a document, False otherwise
549
+ """
550
+ # Direct implementation without caching
551
+ return _detect_document_type_impl(None)
552
+
553
+ def _detect_document_type_impl(img_hash=None) -> bool:
554
+ """
555
+ Optimized implementation of document type detection for faster processing.
556
+ The img_hash parameter is unused but kept for backward compatibility.
557
+ """
558
+ # Fast path: Get the image from thread-local storage
559
+ if not hasattr(_detect_document_type_impl, "_current_img"):
560
+ return False # Fail safe in case image is not set
561
+
562
+ img = _detect_document_type_impl._current_img
563
+
564
+ # Skip processing for tiny images - just classify as non-documents
565
+ width, height = img.size
566
+ if width * height < 100000: # Approx 300x300 or smaller
567
+ return False
568
+
569
+ # Quick check: If image has many colors, it's likely not a document
570
+ # Sample a subset of pixels for color analysis (faster than full histogram)
571
+ try:
572
+ # Sample pixels in a grid pattern
573
+ color_samples = []
574
+ for x in range(0, width, max(1, width // 10)):
575
+ for y in range(0, height, max(1, height // 10)):
576
+ try:
577
+ color_samples.append(img.getpixel((x, y)))
578
+ except:
579
+ pass
580
+
581
+ # Count unique colors in the sample
582
+ if img.mode == 'RGB':
583
+ unique_colors = len(set(color_samples))
584
+ if unique_colors > 1000: # Many unique colors suggest a photo, not a document
585
+ return False
586
+ except:
587
+ pass # If sampling fails, continue with regular analysis
588
+
589
+ # Convert to grayscale for analysis (using faster conversion)
590
+ gray_img = img.convert('L')
591
+
592
+ # PIL-only path for systems without OpenCV
593
+ if not CV2_AVAILABLE:
594
+ # Faster method: Sample a subset of the image for edge detection
595
+ # Downscale image for faster processing
596
+ sample_size = min(width, height, 1000)
597
+ scale_factor = sample_size / max(width, height)
598
+
599
+ if scale_factor < 0.9: # Only resize if significant reduction
600
+ sample_img = gray_img.resize(
601
+ (int(width * scale_factor), int(height * scale_factor)),
602
+ Image.NEAREST # Fastest resampling method
603
+ )
604
+ else:
605
+ sample_img = gray_img
606
+
607
+ # Fast edge detection on sample
608
+ edges = sample_img.filter(ImageFilter.FIND_EDGES)
609
+
610
+ # Count edge pixels using threshold (faster than summing individual pixels)
611
+ edge_data = edges.getdata()
612
+ edge_threshold = 50
613
+
614
+ # Use list comprehension for better performance
615
+ edge_count = sum(1 for p in edge_data if p > edge_threshold)
616
+ total_pixels = len(edge_data)
617
+ edge_ratio = edge_count / total_pixels
618
+
619
+ # Check if bright areas exist - simple approximation of text/background contrast
620
+ bright_count = sum(1 for p in gray_img.getdata() if p > 200)
621
+ bright_ratio = bright_count / (width * height)
622
+
623
+ # Documents typically have more edges (text boundaries) and bright areas (background)
624
+ return edge_ratio > 0.05 or bright_ratio > 0.4
625
+
626
+ # OpenCV path - optimized for speed
627
+ img_np = np.array(gray_img)
628
+
629
+ # Fast document detection heuristics
630
+
631
+ # 1. Fast check: Variance of pixel values
632
+ # Documents typically have high variance (black text on white background)
633
+ # Use numpy's fast statistical functions
634
+ std_dev = np.std(img_np)
635
+ if std_dev > 60: # High standard deviation suggests document
636
+ return True
637
+
638
+ # 2. Quick check using downsampled image for edges
639
+ # Downscale for faster processing on large images
640
+ if max(img_np.shape) > 1000:
641
+ scale = 1000 / max(img_np.shape)
642
+ small_img = cv2.resize(img_np, None, fx=scale, fy=scale, interpolation=cv2.INTER_NEAREST)
643
+ else:
644
+ small_img = img_np
645
+
646
+ # Use faster edge detection
647
+ edges = cv2.Canny(small_img, 50, 150, L2gradient=False)
648
+ edge_ratio = np.count_nonzero(edges) / edges.size
649
+
650
+ # 3. Fast histogram approximation using bins
651
+ # Instead of calculating full histogram, use bins for dark and light regions
652
+ dark_mask = img_np < 50
653
+ light_mask = img_np > 200
654
+
655
+ dark_ratio = np.count_nonzero(dark_mask) / img_np.size
656
+ light_ratio = np.count_nonzero(light_mask) / img_np.size
657
+
658
+ # Combine heuristics for final decision
659
+ # Documents typically have both dark (text) and light (background) regions,
660
+ # and/or well-defined edges
661
+ return (dark_ratio > 0.05 and light_ratio > 0.3) or edge_ratio > 0.04
662
+
663
+ # Removed caching to fix unhashable type error
664
+ def preprocess_document_image(img: Image.Image) -> Image.Image:
665
+ """
666
+ Preprocess a document image for optimal OCR.
667
+
668
+ Args:
669
+ img: PIL Image object
670
+
671
+ Returns:
672
+ Processed PIL Image
673
+ """
674
+ # Store the image for the implementation function
675
+ preprocess_document_image._current_img = img
676
+ # The actual implementation is separated for cleaner code organization
677
+ return _preprocess_document_image_impl()
678
+
679
+ def _preprocess_document_image_impl() -> Image.Image:
680
+ """
681
+ Optimized implementation of document preprocessing with adaptive processing based on image size
682
+ """
683
+ # Fast path: Get image from thread-local storage
684
+ if not hasattr(preprocess_document_image, "_current_img"):
685
+ raise ValueError("No image set for document preprocessing")
686
+
687
+ img = preprocess_document_image._current_img
688
+
689
+ # Analyze image size to determine processing strategy
690
+ width, height = img.size
691
+ img_size = width * height
692
+
693
+ # Ultra-fast path for tiny images - just convert to grayscale with contrast enhancement
694
+ if img_size < 300000: # ~500x600 or smaller
695
+ gray = img.convert('L')
696
+ enhancer = ImageEnhance.Contrast(gray)
697
+ return enhancer.enhance(IMAGE_PREPROCESSING["enhance_contrast"])
698
+
699
+ # Fast path for small images - minimal processing
700
+ if img_size < 1000000: # ~1000x1000 or smaller
701
+ gray = img.convert('L')
702
+ enhancer = ImageEnhance.Contrast(gray)
703
+ enhanced = enhancer.enhance(IMAGE_PREPROCESSING["enhance_contrast"])
704
+ # Light sharpening only if sharpen is enabled
705
+ if IMAGE_PREPROCESSING["sharpen"]:
706
+ enhanced = enhanced.filter(ImageFilter.SHARPEN)
707
+ return enhanced
708
+
709
+ # Standard path for medium images
710
+ # Convert to grayscale (faster processing)
711
+ gray = img.convert('L')
712
+
713
+ # Improve contrast - key for text visibility
714
+ enhancer = ImageEnhance.Contrast(gray)
715
+ enhanced = enhancer.enhance(IMAGE_PREPROCESSING["enhance_contrast"])
716
+
717
+ # Apply light sharpening for text clarity
718
+ if IMAGE_PREPROCESSING["sharpen"]:
719
+ enhanced = enhanced.filter(ImageFilter.SHARPEN)
720
+
721
+ # Advanced processing for larger images or when OpenCV is available
722
+ # The following optimizations improve OCR accuracy significantly for complex documents
723
+ if img_size > 1500000 and CV2_AVAILABLE and IMAGE_PREPROCESSING["denoise"]:
724
+ try:
725
+ # Convert to numpy array for OpenCV processing
726
+ img_np = np.array(enhanced)
727
+
728
+ # Optimize denoising parameters based on image size
729
+ if img_size > 4000000: # Very large images (~2000x2000 or larger)
730
+ # More aggressive downsampling for very large images
731
+ scale_factor = 0.5
732
+ downsample = cv2.resize(img_np, None, fx=scale_factor, fy=scale_factor,
733
+ interpolation=cv2.INTER_AREA)
734
+
735
+ # Lighter denoising for downsampled image
736
+ h_value = 7 # Strength parameter
737
+ template_window = 5
738
+ search_window = 13
739
+
740
+ # Apply denoising on smaller image
741
+ denoised_np = cv2.fastNlMeansDenoising(downsample, None, h_value, template_window, search_window)
742
+
743
+ # Resize back to original size
744
+ denoised_np = cv2.resize(denoised_np, (width, height), interpolation=cv2.INTER_LINEAR)
745
+ else:
746
+ # Direct denoising for medium-large images
747
+ h_value = 8 # Balanced for speed and quality
748
+ template_window = 5
749
+ search_window = 15
750
+
751
+ # Apply denoising
752
+ denoised_np = cv2.fastNlMeansDenoising(img_np, None, h_value, template_window, search_window)
753
+
754
+ # Convert back to PIL Image
755
+ enhanced = Image.fromarray(denoised_np)
756
+
757
+ # Apply adaptive thresholding only if it improves text visibility
758
+ # Create a binarized version of the image
759
+ if img_size < 8000000: # Skip for extremely large images to save processing time
760
+ binary = cv2.adaptiveThreshold(denoised_np, 255,
761
+ cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
762
+ cv2.THRESH_BINARY, 11, 2)
763
+
764
+ # Quick verification that binarization preserves text information
765
+ # Use simplified check that works well for document images
766
+ white_pixels_binary = np.count_nonzero(binary > 200)
767
+ white_pixels_orig = np.count_nonzero(denoised_np > 200)
768
+
769
+ # Check if binary preserves reasonable amount of white pixels (background)
770
+ if white_pixels_binary > white_pixels_orig * 0.8:
771
+ # Binarization looks good, use it
772
+ return Image.fromarray(binary)
773
+ except Exception as e:
774
+ # If OpenCV processing fails, continue with PIL-enhanced image
775
+ pass
776
+
777
+ elif IMAGE_PREPROCESSING["denoise"]:
778
+ # Fallback PIL denoising for systems without OpenCV
779
+ # Use lighter median filter
780
+ enhanced = enhanced.filter(ImageFilter.MedianFilter(3))
781
+
782
+ # Return enhanced grayscale image
783
+ return enhanced
784
+
785
+ # Removed caching to fix unhashable type error
786
+ def preprocess_general_image(img: Image.Image) -> Image.Image:
787
+ """
788
+ Preprocess a general image for OCR.
789
+
790
+ Args:
791
+ img: PIL Image object
792
+
793
+ Returns:
794
+ Processed PIL Image
795
+ """
796
+ # Store the image for implementation function
797
+ preprocess_general_image._current_img = img
798
+ return _preprocess_general_image_impl()
799
+
800
+ def _preprocess_general_image_impl() -> Image.Image:
801
+ """
802
+ Optimized implementation of general image preprocessing with size-based processing paths
803
+ """
804
+ # Fast path: Get the image from thread-local storage
805
+ if not hasattr(preprocess_general_image, "_current_img"):
806
+ raise ValueError("No image set for general preprocessing")
807
+
808
+ img = preprocess_general_image._current_img
809
+
810
+ # Ultra-fast path: Skip processing completely for small images to improve performance
811
+ width, height = img.size
812
+ img_size = width * height
813
+ if img_size < 300000: # Skip for tiny images under ~0.3 megapixel
814
+ # Just ensure correct color mode
815
+ if img.mode != 'RGB':
816
+ return img.convert('RGB')
817
+ return img
818
+
819
+ # Fast path: Minimal processing for smaller images
820
+ if img_size < 600000: # ~800x750 or smaller
821
+ # Ensure RGB mode
822
+ if img.mode != 'RGB':
823
+ img = img.convert('RGB')
824
+
825
+ # Very light contrast enhancement only
826
+ enhancer = ImageEnhance.Contrast(img)
827
+ return enhancer.enhance(1.15) # Lighter enhancement for small images
828
+
829
+ # Standard path: Apply moderate enhancements for medium images
830
+ # Convert to RGB to ensure compatibility
831
+ if img.mode != 'RGB':
832
+ img = img.convert('RGB')
833
+
834
+ # Moderate enhancement only
835
+ enhancer = ImageEnhance.Contrast(img)
836
+ enhanced = enhancer.enhance(1.2) # Less aggressive than document enhancement
837
+
838
+ # Skip additional processing for medium-sized images
839
+ if img_size < 1000000: # Skip for images under ~1 megapixel
840
+ return enhanced
841
+
842
+ # Enhanced path: Additional processing for larger images
843
+ try:
844
+ # Apply optimized enhancement pipeline for large non-document images
845
+
846
+ # 1. Improve color saturation slightly for better feature extraction
847
+ saturation = ImageEnhance.Color(enhanced)
848
+ enhanced = saturation.enhance(1.1)
849
+
850
+ # 2. Apply adaptive sharpening based on image size
851
+ if img_size > 2500000: # Very large images (~1600x1600 or larger)
852
+ # Use EDGE_ENHANCE instead of SHARPEN for more subtle enhancement on large images
853
+ enhanced = enhanced.filter(ImageFilter.EDGE_ENHANCE)
854
+ else:
855
+ # Standard sharpening for regular large images
856
+ enhanced = enhanced.filter(ImageFilter.SHARPEN)
857
+
858
+ # 3. Apply additional processing with OpenCV if available (for largest images)
859
+ if CV2_AVAILABLE and img_size > 3000000:
860
+ # Convert to numpy array
861
+ img_np = np.array(enhanced)
862
+
863
+ # Apply subtle enhancement of details (CLAHE)
864
+ try:
865
+ # Convert to LAB color space for better processing
866
+ lab = cv2.cvtColor(img_np, cv2.COLOR_RGB2LAB)
867
+
868
+ # Only enhance the L channel (luminance)
869
+ l, a, b = cv2.split(lab)
870
+
871
+ # Create CLAHE object with optimal parameters for photos
872
+ clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
873
+
874
+ # Apply CLAHE to L channel
875
+ l = clahe.apply(l)
876
+
877
+ # Merge channels back and convert to RGB
878
+ lab = cv2.merge((l, a, b))
879
+ enhanced_np = cv2.cvtColor(lab, cv2.COLOR_LAB2RGB)
880
+
881
+ # Convert back to PIL
882
+ enhanced = Image.fromarray(enhanced_np)
883
+ except:
884
+ # If CLAHE fails, continue with PIL-enhanced image
885
+ pass
886
+
887
+ except Exception:
888
+ # If any enhancement fails, fall back to basic contrast enhancement
889
+ if img.mode != 'RGB':
890
+ img = img.convert('RGB')
891
+ enhancer = ImageEnhance.Contrast(img)
892
+ enhanced = enhancer.enhance(1.2)
893
+
894
+ return enhanced
895
+
896
+ # Removed caching decorator to fix unhashable type error
897
+ def resize_image(img: Image.Image, target_dpi: int = 300) -> Image.Image:
898
+ """
899
+ Resize an image to an optimal size for OCR while preserving quality.
900
+
901
+ Args:
902
+ img: PIL Image object
903
+ target_dpi: Target DPI (dots per inch)
904
+
905
+ Returns:
906
+ Resized PIL Image
907
+ """
908
+ # Store the image for implementation function
909
+ resize_image._current_img = img
910
+ return resize_image_impl(target_dpi)
911
+
912
+ def resize_image_impl(target_dpi: int = 300) -> Image.Image:
913
+ """
914
+ Implementation of resize function that uses thread-local storage.
915
+
916
+ Args:
917
+ target_dpi: Target DPI (dots per inch)
918
+
919
+ Returns:
920
+ Resized PIL Image
921
+ """
922
+ # Get the image from thread-local storage (set by the caller)
923
+ if not hasattr(resize_image, "_current_img"):
924
+ raise ValueError("No image set for resizing")
925
+
926
+ img = resize_image._current_img
927
+
928
+ # Calculate current dimensions
929
+ width, height = img.size
930
+
931
+ # Fixed target dimensions based on DPI
932
+ # Using 8.5x11 inches (standard paper size) as reference
933
+ max_width = int(8.5 * target_dpi)
934
+ max_height = int(11 * target_dpi)
935
+
936
+ # Check if resizing is needed - quick early return
937
+ if width <= max_width and height <= max_height:
938
+ return img # No resizing needed
939
+
940
+ # Calculate scaling factor once
941
+ scale_factor = min(max_width / width, max_height / height)
942
+
943
+ # Calculate new dimensions
944
+ new_width = int(width * scale_factor)
945
+ new_height = int(height * scale_factor)
946
+
947
+ # Use BICUBIC for better balance of speed and quality
948
+ return img.resize((new_width, new_height), Image.BICUBIC)
949
+
950
+ def calculate_image_entropy(img: Image.Image) -> float:
951
+ """
952
+ Calculate the entropy (information content) of an image.
953
+
954
+ Args:
955
+ img: PIL Image object
956
+
957
+ Returns:
958
+ Entropy value
959
+ """
960
+ # Convert to grayscale
961
+ if img.mode != 'L':
962
+ img = img.convert('L')
963
+
964
+ # Calculate histogram
965
+ histogram = img.histogram()
966
+ total_pixels = img.width * img.height
967
+
968
+ # Calculate entropy
969
+ entropy = 0
970
+ for h in histogram:
971
+ if h > 0:
972
+ probability = h / total_pixels
973
+ entropy -= probability * np.log2(probability)
974
+
975
+ return entropy
976
+
977
+ def create_html_with_images(result):
978
+ """
979
+ Create an HTML document with embedded images from OCR results.
980
+
981
+ Args:
982
+ result: OCR result dictionary containing pages_data
983
+
984
+ Returns:
985
+ HTML content as string
986
+ """
987
+ # Create HTML document structure
988
+ html_content = """
989
+ <!DOCTYPE html>
990
+ <html>
991
+ <head>
992
+ <meta charset="UTF-8">
993
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
994
+ <title>OCR Document with Images</title>
995
+ <style>
996
+ body {
997
+ font-family: Georgia, serif;
998
+ line-height: 1.7;
999
+ margin: 0 auto;
1000
+ max-width: 800px;
1001
+ padding: 20px;
1002
+ }
1003
+ img {
1004
+ max-width: 90%;
1005
+ max-height: 500px;
1006
+ object-fit: contain;
1007
+ margin: 20px auto;
1008
+ display: block;
1009
+ border: 1px solid #ddd;
1010
+ border-radius: 4px;
1011
+ }
1012
+ .image-container {
1013
+ margin: 20px 0;
1014
+ text-align: center;
1015
+ }
1016
+ .page-break {
1017
+ border-top: 1px solid #ddd;
1018
+ margin: 40px 0;
1019
+ padding-top: 40px;
1020
+ }
1021
+ h3 {
1022
+ color: #333;
1023
+ border-bottom: 1px solid #eee;
1024
+ padding-bottom: 10px;
1025
+ }
1026
+ p {
1027
+ margin: 12px 0;
1028
+ }
1029
+ .page-text-content {
1030
+ margin-bottom: 20px;
1031
+ }
1032
+ .text-block {
1033
+ background-color: #f9f9f9;
1034
+ padding: 15px;
1035
+ border-radius: 4px;
1036
+ border-left: 3px solid #546e7a;
1037
+ margin-bottom: 15px;
1038
+ color: #333;
1039
+ }
1040
+ .text-block p {
1041
+ margin: 8px 0;
1042
+ color: #333;
1043
+ }
1044
+ .metadata {
1045
+ background-color: #f5f5f5;
1046
+ padding: 10px 15px;
1047
+ border-radius: 4px;
1048
+ margin-bottom: 20px;
1049
+ font-size: 14px;
1050
+ }
1051
+ .metadata p {
1052
+ margin: 5px 0;
1053
+ }
1054
+ </style>
1055
+ </head>
1056
+ <body>
1057
+ """
1058
+
1059
+ # Add document metadata
1060
+ html_content += f"""
1061
+ <div class="metadata">
1062
+ <h2>{result.get('file_name', 'Document')}</h2>
1063
+ <p><strong>Processed at:</strong> {result.get('timestamp', '')}</p>
1064
+ <p><strong>Languages:</strong> {', '.join(result.get('languages', ['Unknown']))}</p>
1065
+ <p><strong>Topics:</strong> {', '.join(result.get('topics', ['Unknown']))}</p>
1066
+ </div>
1067
+ """
1068
+
1069
+ # Check if we have pages_data
1070
+ if 'pages_data' in result and result['pages_data']:
1071
+ pages_data = result['pages_data']
1072
+
1073
+ # Process each page
1074
+ for i, page in enumerate(pages_data):
1075
+ page_markdown = page.get('markdown', '')
1076
+ images = page.get('images', [])
1077
+
1078
+ # Add page header if multi-page
1079
+ if len(pages_data) > 1:
1080
+ html_content += f"<h3>Page {i+1}</h3>"
1081
+
1082
+ # Create image dictionary
1083
+ image_dict = {}
1084
+ for img in images:
1085
+ if 'id' in img and 'image_base64' in img:
1086
+ image_dict[img['id']] = img['image_base64']
1087
+
1088
+ # Process the markdown content
1089
+ if page_markdown:
1090
+ # Extract text content (lines without images)
1091
+ text_content = []
1092
+ image_lines = []
1093
+
1094
+ for line in page_markdown.split('\n'):
1095
+ if '![' in line and '](' in line:
1096
+ image_lines.append(line)
1097
+ elif line.strip():
1098
+ text_content.append(line)
1099
+
1100
+ # Add text content
1101
+ if text_content:
1102
+ html_content += '<div class="text-block">'
1103
+ for line in text_content:
1104
+ html_content += f"<p>{line}</p>"
1105
+ html_content += '</div>'
1106
+
1107
+ # Add images
1108
+ for line in image_lines:
1109
+ # Extract image ID and alt text using simple parsing
1110
+ try:
1111
+ alt_start = line.find('![') + 2
1112
+ alt_end = line.find(']', alt_start)
1113
+ alt_text = line[alt_start:alt_end]
1114
+
1115
+ img_start = line.find('(', alt_end) + 1
1116
+ img_end = line.find(')', img_start)
1117
+ img_id = line[img_start:img_end]
1118
+
1119
+ if img_id in image_dict:
1120
+ html_content += f'<div class="image-container">'
1121
+ html_content += f'<img src="{image_dict[img_id]}" alt="{alt_text}">'
1122
+ html_content += f'</div>'
1123
+ except:
1124
+ # If parsing fails, just skip this image
1125
+ continue
1126
+
1127
+ # Add page separator if not the last page
1128
+ if i < len(pages_data) - 1:
1129
+ html_content += '<div class="page-break"></div>'
1130
+
1131
+ # Add structured content if available
1132
+ if 'ocr_contents' in result and isinstance(result['ocr_contents'], dict):
1133
+ html_content += '<h3>Structured Content</h3>'
1134
+
1135
+ for section, content in result['ocr_contents'].items():
1136
+ if content and section not in ['error', 'raw_text', 'partial_text']:
1137
+ html_content += f'<h4>{section.replace("_", " ").title()}</h4>'
1138
+
1139
+ if isinstance(content, str):
1140
+ html_content += f'<p>{content}</p>'
1141
+ elif isinstance(content, list):
1142
+ html_content += '<ul>'
1143
+ for item in content:
1144
+ html_content += f'<li>{str(item)}</li>'
1145
+ html_content += '</ul>'
1146
+ elif isinstance(content, dict):
1147
+ html_content += '<dl>'
1148
+ for k, v in content.items():
1149
+ html_content += f'<dt>{k}</dt><dd>{v}</dd>'
1150
+ html_content += '</dl>'
1151
+
1152
+ # Close HTML document
1153
+ html_content += """
1154
+ </body>
1155
+ </html>
1156
+ """
1157
+
1158
+ return html_content
1159
+
1160
+ def generate_document_thumbnail(image_path: Union[str, Path], max_size: int = 300) -> str:
1161
+ """
1162
+ Generate a thumbnail for document preview.
1163
+
1164
+ Args:
1165
+ image_path: Path to the image file
1166
+ max_size: Maximum dimension for thumbnail
1167
+
1168
+ Returns:
1169
+ Base64 encoded thumbnail
1170
+ """
1171
+ if not PILLOW_AVAILABLE:
1172
+ return None
1173
+
1174
+ try:
1175
+ # Open the image
1176
+ with Image.open(image_path) as img:
1177
+ # Calculate thumbnail size preserving aspect ratio
1178
+ width, height = img.size
1179
+ if width > height:
1180
+ new_width = max_size
1181
+ new_height = int(height * (max_size / width))
1182
+ else:
1183
+ new_height = max_size
1184
+ new_width = int(width * (max_size / height))
1185
+
1186
+ # Create thumbnail
1187
+ thumbnail = img.resize((new_width, new_height), Image.LANCZOS)
1188
+
1189
+ # Save to buffer
1190
+ buffer = io.BytesIO()
1191
+ thumbnail.save(buffer, format="JPEG", quality=85)
1192
+ buffer.seek(0)
1193
+
1194
+ # Encode as base64
1195
+ encoded = base64.b64encode(buffer.getvalue()).decode()
1196
+ return f"data:image/jpeg;base64,{encoded}"
1197
+ except Exception:
1198
+ # Return None if thumbnail generation fails
1199
+ return None
1200
+
1201
+ def try_local_ocr_fallback(image_path: Union[str, Path], base64_data_url: str = None) -> str:
1202
+ """
1203
+ Attempt to use local pytesseract OCR as a fallback when API fails
1204
+
1205
+ Args:
1206
+ image_path: Path to the image file
1207
+ base64_data_url: Optional base64 data URL if already available
1208
+
1209
+ Returns:
1210
+ OCR text string if successful, None if failed
1211
+ """
1212
+ logger.info("Attempting local OCR fallback using pytesseract...")
1213
+
1214
+ try:
1215
+ import pytesseract
1216
+ from PIL import Image
1217
+
1218
+ # Load image - either from path or from base64
1219
+ if base64_data_url and base64_data_url.startswith('data:image'):
1220
+ # Extract image from base64
1221
+ image_data = base64_data_url.split(',', 1)[1]
1222
+ image_bytes = base64.b64decode(image_data)
1223
+ image = Image.open(io.BytesIO(image_bytes))
1224
+ else:
1225
+ # Load from file path
1226
+ image_path = Path(image_path) if isinstance(image_path, str) else image_path
1227
+ image = Image.open(image_path)
1228
+
1229
+ # Convert to RGB if not already (pytesseract works best with RGB)
1230
+ if image.mode != 'RGB':
1231
+ image = image.convert('RGB')
1232
+
1233
+ # Apply image enhancements for better OCR
1234
+ # Convert to grayscale for better text recognition
1235
+ image = image.convert('L')
1236
+
1237
+ # Enhance contrast
1238
+ enhancer = ImageEnhance.Contrast(image)
1239
+ image = enhancer.enhance(2.0) # Higher contrast for better OCR
1240
+
1241
+ # Run OCR
1242
+ ocr_text = pytesseract.image_to_string(image, lang='eng')
1243
+
1244
+ if ocr_text and len(ocr_text.strip()) > 50:
1245
+ logger.info(f"Local OCR successful: extracted {len(ocr_text)} characters")
1246
+ return ocr_text
1247
+ else:
1248
+ logger.warning("Local OCR produced minimal or no text")
1249
+ return None
1250
+ except ImportError:
1251
+ logger.warning("Pytesseract not installed - local OCR not available")
1252
+ return None
1253
+ except Exception as e:
1254
+ logger.error(f"Local OCR fallback failed: {str(e)}")
1255
+ return None
packages.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ poppler-utils
2
+ tesseract-ocr
pdf_ocr.py ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ PDFOCR - Module for processing PDF files with OCR and extracting structured data.
4
+ """
5
+
6
+ import json
7
+ from pathlib import Path
8
+ from structured_ocr import StructuredOCR
9
+
10
+ class PDFOCR:
11
+ """Class for processing PDF files with OCR and extracting structured data."""
12
+
13
+ def __init__(self, api_key=None):
14
+ """Initialize the PDF OCR processor."""
15
+ self.processor = StructuredOCR(api_key=api_key)
16
+
17
+ def process_pdf(self, pdf_path, use_vision=True):
18
+ """
19
+ Process a PDF file with OCR and extract structured data.
20
+
21
+ Args:
22
+ pdf_path: Path to the PDF file
23
+ use_vision: Whether to use vision model for improved analysis
24
+
25
+ Returns:
26
+ Dictionary with structured OCR results
27
+ """
28
+ pdf_path = Path(pdf_path)
29
+ if not pdf_path.exists():
30
+ raise FileNotFoundError(f"PDF file not found: {pdf_path}")
31
+
32
+ return self.processor.process_file(pdf_path, file_type="pdf", use_vision=use_vision)
33
+
34
+ def save_json_output(self, pdf_path, output_path, use_vision=True):
35
+ """
36
+ Process a PDF file and save the structured output as JSON.
37
+
38
+ Args:
39
+ pdf_path: Path to the PDF file
40
+ output_path: Path where to save the JSON output
41
+ use_vision: Whether to use vision model for improved analysis
42
+
43
+ Returns:
44
+ Path to the saved JSON file
45
+ """
46
+ # Process the PDF
47
+ result = self.process_pdf(pdf_path, use_vision=use_vision)
48
+
49
+ # Save the result to JSON
50
+ output_path = Path(output_path)
51
+ output_path.parent.mkdir(parents=True, exist_ok=True)
52
+
53
+ with open(output_path, 'w') as f:
54
+ json.dump(result, f, indent=2)
55
+
56
+ return output_path
57
+
58
+ # For testing directly
59
+ if __name__ == "__main__":
60
+ import sys
61
+
62
+ if len(sys.argv) < 2:
63
+ print("Usage: python pdf_ocr.py <pdf_path> [output_path]")
64
+ sys.exit(1)
65
+
66
+ pdf_path = sys.argv[1]
67
+ output_path = sys.argv[2] if len(sys.argv) > 2 else None
68
+
69
+ processor = PDFOCR()
70
+
71
+ if output_path:
72
+ result_path = processor.save_json_output(pdf_path, output_path)
73
+ print(f"Results saved to: {result_path}")
74
+ else:
75
+ result = processor.process_pdf(pdf_path)
76
+ print(json.dumps(result, indent=2))
process_file.py ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Utility function for processing files with OCR in the Historical OCR Workshop app.
3
+ """
4
+
5
+ import os
6
+ import tempfile
7
+ from pathlib import Path
8
+ from datetime import datetime
9
+
10
+ def process_file(uploaded_file, use_vision=True, processor=None, custom_prompt=None):
11
+ """Process the uploaded file and return the OCR results
12
+
13
+ Args:
14
+ uploaded_file: The uploaded file to process
15
+ use_vision: Whether to use vision model
16
+ processor: StructuredOCR processor (if None, it will be imported)
17
+ custom_prompt: Optional additional instructions for the model
18
+
19
+ Returns:
20
+ dict: The OCR results
21
+ """
22
+ # Import the processor if not provided
23
+ if processor is None:
24
+ from structured_ocr import StructuredOCR
25
+ processor = StructuredOCR()
26
+
27
+ # Save the uploaded file to a temporary file
28
+ with tempfile.NamedTemporaryFile(delete=False, suffix=Path(uploaded_file.name).suffix) as tmp:
29
+ tmp.write(uploaded_file.getvalue())
30
+ temp_path = tmp.name
31
+
32
+ try:
33
+ # Determine file type from extension
34
+ file_ext = Path(uploaded_file.name).suffix.lower()
35
+ file_type = "pdf" if file_ext == ".pdf" else "image"
36
+
37
+ # Get file size in MB
38
+ file_size_mb = os.path.getsize(temp_path) / (1024 * 1024)
39
+
40
+ # Process the file with file size information for automatic page limiting
41
+ result = processor.process_file(
42
+ temp_path,
43
+ file_type=file_type,
44
+ use_vision=use_vision,
45
+ file_size_mb=file_size_mb,
46
+ custom_prompt=custom_prompt
47
+ )
48
+
49
+ # Add processing metadata
50
+ result.update({
51
+ "file_name": uploaded_file.name,
52
+ "processed_at": datetime.now().isoformat(),
53
+ "file_size_mb": round(file_size_mb, 2),
54
+ "use_vision": use_vision
55
+ })
56
+
57
+ # No longer needed - removing confidence score
58
+
59
+ return result
60
+ except Exception as e:
61
+ return {
62
+ "error": str(e),
63
+ "file_name": uploaded_file.name
64
+ }
65
+ finally:
66
+ # Clean up the temporary file
67
+ if os.path.exists(temp_path):
68
+ os.unlink(temp_path)
requirements.txt ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Generated requirements for Hugging Face Spaces deployment
2
+
3
+ streamlit>=1.28.0
4
+ mistralai>=0.0.3
5
+ Pillow>=9.0.0
6
+ opencv-python-headless>=4.5.0
7
+ pdf2image>=1.16.0
8
+ python-dotenv>=0.19.0
9
+ pycountry>=22.1.10
10
+ pydantic>=1.9.0
11
+ numpy>=1.20.0
12
+ requests>=2.28.0
13
+
14
+ # Additional packages from original requirements
15
+ pillow>=10.0.0
16
+ python-multipart>=0.0.6
17
+ pytesseract>=0.3.10
static/favicon.ico ADDED
static/favicon.png ADDED

Git LFS Details

  • SHA256: 579585886ddea743aa3e212e698632f315c6130d5d6dd3287a015011dbb8fc3a
  • Pointer size: 128 Bytes
  • Size of remote file: 779 Bytes
static/scroll.svg ADDED
structured_ocr.py ADDED
@@ -0,0 +1,1718 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import sys
3
+ import time
4
+ import random
5
+ from enum import Enum
6
+ from pathlib import Path
7
+ import json
8
+ import base64
9
+ import pycountry
10
+ import logging
11
+ from functools import lru_cache
12
+ from typing import Optional, Dict, Any, List, Union, Tuple
13
+ from pydantic import BaseModel
14
+ from mistralai import Mistral
15
+ from mistralai import DocumentURLChunk, ImageURLChunk, TextChunk
16
+ from mistralai.models import OCRImageObject
17
+
18
+ # Configure logging
19
+ logging.basicConfig(level=logging.INFO,
20
+ format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
21
+
22
+ # Import utilities for OCR processing
23
+ try:
24
+ from ocr_utils import replace_images_in_markdown, get_combined_markdown
25
+ except ImportError:
26
+ # Define fallback functions if module not found
27
+ def replace_images_in_markdown(markdown_str, images_dict):
28
+ for img_name, base64_str in images_dict.items():
29
+ markdown_str = markdown_str.replace(
30
+ f"![{img_name}]({img_name})", f"![{img_name}]({base64_str})"
31
+ )
32
+ return markdown_str
33
+
34
+ def get_combined_markdown(ocr_response):
35
+ markdowns = []
36
+ for page in ocr_response.pages:
37
+ image_data = {}
38
+ for img in page.images:
39
+ image_data[img.id] = img.image_base64
40
+ markdowns.append(replace_images_in_markdown(page.markdown, image_data))
41
+ return "\n\n".join(markdowns)
42
+
43
+ # Import config directly (now local to historical-ocr)
44
+ from config import MISTRAL_API_KEY, OCR_MODEL, TEXT_MODEL, VISION_MODEL, TEST_MODE
45
+
46
+ # Helper function to make OCR objects JSON serializable
47
+ # Removed caching to fix unhashable type error
48
+ def serialize_ocr_response(obj):
49
+ """
50
+ Convert OCR response objects to JSON serializable format
51
+ Optimized for speed and memory usage
52
+ """
53
+ # Fast path: Handle primitive types directly
54
+ if obj is None or isinstance(obj, (str, int, float, bool)):
55
+ return obj
56
+
57
+ # Handle collections with optimized recursion
58
+ if isinstance(obj, list):
59
+ return [serialize_ocr_response(item) for item in obj]
60
+ elif isinstance(obj, dict):
61
+ return {k: serialize_ocr_response(v) for k, v in obj.items()}
62
+ elif hasattr(obj, '__dict__'):
63
+ # For OCR objects with __dict__ attribute
64
+ result = {}
65
+ for key, value in obj.__dict__.items():
66
+ if key.startswith('_'):
67
+ continue # Skip private attributes
68
+
69
+ # Fast path for OCRImageObject - most common complex object
70
+ if isinstance(value, OCRImageObject):
71
+ # Special handling for OCRImageObject with direct attribute access
72
+ result[key] = {
73
+ 'id': value.id if hasattr(value, 'id') else None,
74
+ 'image_base64': value.image_base64 if hasattr(value, 'image_base64') else None
75
+ }
76
+ # Handle collections
77
+ elif isinstance(value, list):
78
+ result[key] = [serialize_ocr_response(item) for item in value]
79
+ # Handle nested objects
80
+ elif hasattr(value, '__dict__'):
81
+ result[key] = serialize_ocr_response(value)
82
+ # Handle primitives and other types
83
+ else:
84
+ result[key] = value
85
+ return result
86
+ else:
87
+ return obj
88
+
89
+ # Create language enum for structured output - cache language lookup to avoid repeated processing
90
+ @lru_cache(maxsize=1)
91
+ def get_language_dict():
92
+ return {lang.alpha_2: lang.name for lang in pycountry.languages if hasattr(lang, 'alpha_2')}
93
+
94
+ class LanguageMeta(Enum.__class__):
95
+ def __new__(metacls, cls, bases, classdict):
96
+ languages = get_language_dict()
97
+ for code, name in languages.items():
98
+ classdict[name.upper().replace(' ', '_')] = name
99
+ return super().__new__(metacls, cls, bases, classdict)
100
+
101
+ class Language(Enum, metaclass=LanguageMeta):
102
+ pass
103
+
104
+ class StructuredOCRModel(BaseModel):
105
+ file_name: str
106
+ topics: list[str]
107
+ languages: list[Language]
108
+ ocr_contents: dict
109
+
110
+ class StructuredOCR:
111
+ def __init__(self, api_key=None):
112
+ """Initialize the OCR processor with API key"""
113
+ # Check if we're running in test mode
114
+ self.test_mode = TEST_MODE
115
+
116
+ # Initialize API key - use provided key, or environment var
117
+ if self.test_mode and not api_key:
118
+ self.api_key = "placeholder_key"
119
+ else:
120
+ self.api_key = api_key or MISTRAL_API_KEY
121
+
122
+ # Ensure we have a valid API key when not in test mode
123
+ if not self.api_key and not self.test_mode:
124
+ raise ValueError("No Mistral API key provided. Please set the MISTRAL_API_KEY environment variable or enable TEST_MODE.")
125
+
126
+ # Clean the API key by removing any whitespace
127
+ self.api_key = self.api_key.strip()
128
+
129
+ # Check if API key exists but don't enforce length requirements
130
+ if not self.test_mode and not self.api_key:
131
+ logger = logging.getLogger("api_validator")
132
+ logger.warning("Warning: No API key provided")
133
+
134
+ # Initialize client with the API key
135
+ try:
136
+ self.client = Mistral(api_key=self.api_key)
137
+ # Skip validation to avoid unnecessary API calls
138
+ except Exception as e:
139
+ error_msg = str(e).lower()
140
+ if "unauthorized" in error_msg or "401" in error_msg:
141
+ raise ValueError(f"API key authentication failed. Please check your Mistral API key: {str(e)}")
142
+ else:
143
+ raise
144
+
145
+ def process_file(self, file_path, file_type=None, use_vision=True, max_pages=None, file_size_mb=None, custom_pages=None, custom_prompt=None):
146
+ """Process a file and return structured OCR results
147
+
148
+ Args:
149
+ file_path: Path to the file to process
150
+ file_type: 'pdf' or 'image' (will be auto-detected if None)
151
+ use_vision: Whether to use vision model for improved analysis
152
+ max_pages: Optional limit on number of pages to process
153
+ file_size_mb: Optional file size in MB (used for automatic page limiting)
154
+ custom_pages: Optional list of specific page numbers to process
155
+ custom_prompt: Optional instructions for the AI to handle unusual document formatting or specific extraction needs
156
+
157
+ Returns:
158
+ Dictionary with structured OCR results
159
+ """
160
+ # Convert file_path to Path object if it's a string
161
+ file_path = Path(file_path)
162
+
163
+ # Auto-detect file type if not provided
164
+ if file_type is None:
165
+ suffix = file_path.suffix.lower()
166
+ file_type = "pdf" if suffix == ".pdf" else "image"
167
+
168
+ # Get file size if not provided
169
+ if file_size_mb is None and file_path.exists():
170
+ file_size_mb = file_path.stat().st_size / (1024 * 1024) # Convert bytes to MB
171
+
172
+ # Check if file exceeds API limits (50 MB)
173
+ if file_size_mb and file_size_mb > 50:
174
+ logging.warning(f"File size {file_size_mb:.2f} MB exceeds Mistral API limit of 50 MB")
175
+ return {
176
+ "file_name": file_path.name,
177
+ "topics": ["Document"],
178
+ "languages": ["English"],
179
+ "confidence_score": 0.0,
180
+ "error": f"File size {file_size_mb:.2f} MB exceeds API limit of 50 MB",
181
+ "ocr_contents": {
182
+ "error": f"Failed to process file: File size {file_size_mb:.2f} MB exceeds Mistral API limit of 50 MB",
183
+ "partial_text": "Document could not be processed due to size limitations."
184
+ }
185
+ }
186
+
187
+ # For PDF files, limit pages based on file size if no explicit limit is given
188
+ if file_type == "pdf" and file_size_mb and max_pages is None and custom_pages is None:
189
+ if file_size_mb > 100: # Very large files
190
+ max_pages = 3
191
+ elif file_size_mb > 50: # Large files
192
+ max_pages = 5
193
+ elif file_size_mb > 20: # Medium files
194
+ max_pages = 10
195
+ else: # Small files
196
+ max_pages = None # Process all pages
197
+
198
+ # Start processing timer
199
+ start_time = time.time()
200
+
201
+ # Read and process the file
202
+ if file_type == "pdf":
203
+ result = self._process_pdf(file_path, use_vision, max_pages, custom_pages, custom_prompt)
204
+ else:
205
+ result = self._process_image(file_path, use_vision, custom_prompt)
206
+
207
+ # Add processing time information
208
+ processing_time = time.time() - start_time
209
+ result['processing_time'] = processing_time
210
+
211
+ # Add a default confidence score if not present
212
+ if 'confidence_score' not in result:
213
+ result['confidence_score'] = 0.85 # Default confidence
214
+
215
+ # Ensure the entire result is fully JSON serializable by running it through our serializer
216
+ try:
217
+ # First convert to a standard dict if it's not already
218
+ if not isinstance(result, dict):
219
+ result = serialize_ocr_response(result)
220
+
221
+ # Make a final pass to check for any remaining non-serializable objects
222
+ # Test JSON serialization to catch any remaining issues
223
+ json.dumps(result)
224
+ except TypeError as e:
225
+ # If there's a serialization error, run the whole result through our serializer
226
+ logger = logging.getLogger("serializer")
227
+ logger.warning(f"JSON serialization error in result: {str(e)}. Applying full serialization.")
228
+ result = serialize_ocr_response(result)
229
+
230
+ return result
231
+
232
+ def _process_pdf(self, file_path, use_vision=True, max_pages=None, custom_pages=None, custom_prompt=None):
233
+ """
234
+ Process a PDF file with OCR - optimized version with smart page handling and memory management
235
+
236
+ Args:
237
+ file_path: Path to the PDF file
238
+ use_vision: Whether to use vision model for enhanced analysis
239
+ max_pages: Optional limit on the number of pages to process
240
+ custom_pages: Optional list of specific page numbers to process
241
+ custom_prompt: Optional custom prompt for specialized extraction
242
+ """
243
+ logger = logging.getLogger("pdf_processor")
244
+ logger.info(f"Processing PDF: {file_path}")
245
+
246
+ # Track processing time
247
+ start_time = time.time()
248
+
249
+ # Fast path: Return placeholder if in test mode
250
+ if self.test_mode:
251
+ logger.info("Test mode active, returning placeholder response")
252
+ # Enhanced test mode placeholder that's more realistic
253
+ return {
254
+ "file_name": file_path.name,
255
+ "topics": ["Historical Document", "Literature", "American History"],
256
+ "languages": ["English"],
257
+ "ocr_contents": {
258
+ "title": "Harper's New Monthly Magazine",
259
+ "publication_date": "1855",
260
+ "publisher": "Harper & Brothers, New York",
261
+ "raw_text": "This is a test mode placeholder for Harper's New Monthly Magazine from 1855. The actual document contains articles on literature, politics, science, and culture from mid-19th century America.",
262
+ "content": "The magazine includes various literary pieces, poetry, political commentary, and illustrations typical of 19th century periodicals. Known for publishing works by prominent American authors including Herman Melville and Charles Dickens.",
263
+ "key_figures": ["Herman Melville", "Charles Dickens", "Henry Wadsworth Longfellow"],
264
+ "noted_articles": ["Continued serialization of popular novels", "Commentary on contemporary political events", "Scientific discoveries and technological advancements"]
265
+ },
266
+ "pdf_processing_method": "enhanced_test_mode",
267
+ "total_pages": 12,
268
+ "processed_pages": 3,
269
+ "processing_time": 0.5,
270
+ "confidence_score": 0.9
271
+ }
272
+
273
+ try:
274
+ # PDF processing strategy decision based on file size
275
+ file_size_mb = file_path.stat().st_size / (1024 * 1024)
276
+ logger.info(f"PDF size: {file_size_mb:.2f} MB")
277
+
278
+ # Always use pdf2image for better control and consistency across all PDF files
279
+ use_pdf2image = True
280
+
281
+ # First try local PDF processing for better performance and control
282
+ if use_pdf2image:
283
+ try:
284
+ import tempfile
285
+ from pdf2image import convert_from_path
286
+
287
+ logger.info("Processing PDF using pdf2image for better multi-page handling")
288
+
289
+ # Convert PDF to images with optimized parameters
290
+ conversion_start = time.time()
291
+
292
+ # Use consistent DPI for all files to ensure reliable results
293
+ dpi = 200 # Higher quality DPI for all files to ensure better text recognition
294
+
295
+ # Only convert first page initially to check document type
296
+ pdf_first_page = convert_from_path(file_path, dpi=dpi, first_page=1, last_page=1)
297
+ logger.info(f"First page converted in {time.time() - conversion_start:.2f}s")
298
+
299
+ # Quick check if PDF has readable content
300
+ if not pdf_first_page:
301
+ logger.warning("PDF conversion produced no images, falling back to API")
302
+ raise Exception("PDF conversion failed to produce images")
303
+
304
+ # Determine total pages in the document
305
+ # First, try simple estimate from first page conversion
306
+ total_pages = 1
307
+
308
+ # Try pdf2image info extraction
309
+ try:
310
+ # Try with pdf2image page counting - use simpler parameters
311
+ logger.info("Determining PDF page count...")
312
+ count_start = time.time()
313
+
314
+ # Use a lightweight approach with multi-threading for faster processing
315
+ pdf_info = convert_from_path(
316
+ file_path,
317
+ dpi=72, # Low DPI just for info
318
+ first_page=1,
319
+ last_page=1,
320
+ size=(100, 100), # Tiny image to save memory
321
+ fmt="jpeg",
322
+ thread_count=4, # Increased thread count for faster processing
323
+ output_file=None
324
+ )
325
+
326
+ # Extract page count
327
+ if hasattr(pdf_info, 'n_pages'):
328
+ total_pages = pdf_info.n_pages
329
+ elif isinstance(pdf_info, dict) and "Pages" in pdf_info:
330
+ total_pages = int(pdf_info.get("Pages", "1"))
331
+ elif len(pdf_first_page) > 0:
332
+ # Just estimate based on first page - at least we have one
333
+ total_pages = 1
334
+
335
+ logger.info(f"Page count determined in {time.time() - count_start:.2f}s")
336
+ except Exception as count_error:
337
+ logger.warning(f"Error determining page count: {str(count_error)}. Using default of 1")
338
+ total_pages = 1
339
+
340
+ logger.info(f"PDF has {total_pages} total pages")
341
+
342
+ # Determine which pages to process
343
+ pages_to_process = []
344
+
345
+ # Handle custom page selection if provided
346
+ if custom_pages and any(0 < p <= total_pages for p in custom_pages):
347
+ # Filter valid page numbers
348
+ pages_to_process = [p for p in custom_pages if 0 < p <= total_pages]
349
+ logger.info(f"Processing {len(pages_to_process)} custom-selected pages: {pages_to_process}")
350
+ # Otherwise use max_pages limit if provided
351
+ elif max_pages and max_pages < total_pages:
352
+ pages_to_process = list(range(1, max_pages + 1))
353
+ logger.info(f"Processing first {max_pages} pages of {total_pages} total")
354
+ # Or process all pages if reasonable count
355
+ elif total_pages <= 10:
356
+ pages_to_process = list(range(1, total_pages + 1))
357
+ logger.info(f"Processing all {total_pages} pages")
358
+ # For large documents without limits, process subset of pages
359
+ else:
360
+ # Smart sampling: first page, last page, and some pages in between
361
+ pages_to_process = [1] # Always include first page
362
+
363
+ if total_pages > 1:
364
+ if total_pages <= 5:
365
+ # For few pages, process all
366
+ pages_to_process = list(range(1, total_pages + 1))
367
+ else:
368
+ # For many pages, sample intelligently
369
+ # Add pages from the middle of the document
370
+ middle = total_pages // 2
371
+ # Add last page if more than 3 pages
372
+ if total_pages > 3:
373
+ pages_to_process.append(total_pages)
374
+ # Add up to 3 pages from middle if document is large
375
+ if total_pages > 5:
376
+ pages_to_process.append(middle)
377
+ if total_pages > 10:
378
+ pages_to_process.append(middle // 2)
379
+ pages_to_process.append(middle + (middle // 2))
380
+
381
+ # Sort pages for sequential processing
382
+ pages_to_process = sorted(list(set(pages_to_process)))
383
+ logger.info(f"Processing {len(pages_to_process)} sampled pages out of {total_pages} total: {pages_to_process}")
384
+
385
+ # Convert only the selected pages to minimize memory usage
386
+ selected_images = []
387
+ combined_text = []
388
+
389
+ # Process pages in larger batches for better efficiency
390
+ batch_size = 5 # Process 5 pages at a time for better throughput
391
+ for i in range(0, len(pages_to_process), batch_size):
392
+ batch_pages = pages_to_process[i:i+batch_size]
393
+ logger.info(f"Converting batch of pages {batch_pages}")
394
+
395
+ # Convert batch of pages with multi-threading for better performance
396
+ batch_start = time.time()
397
+ batch_images = convert_from_path(
398
+ file_path,
399
+ dpi=dpi,
400
+ first_page=min(batch_pages),
401
+ last_page=max(batch_pages),
402
+ thread_count=4, # Use multi-threading for faster PDF processing
403
+ fmt="jpeg" # Use JPEG format for better compatibility
404
+ )
405
+ logger.info(f"Batch conversion completed in {time.time() - batch_start:.2f}s")
406
+
407
+ # Map converted images to requested page numbers
408
+ for idx, page_num in enumerate(range(min(batch_pages), max(batch_pages) + 1)):
409
+ if page_num in pages_to_process and idx < len(batch_images):
410
+ if page_num == pages_to_process[0]: # First page to process
411
+ selected_images.append(batch_images[idx])
412
+
413
+ # Process each page individually
414
+ with tempfile.NamedTemporaryFile(suffix='.jpeg', delete=False) as tmp:
415
+ batch_images[idx].save(tmp.name, format='JPEG')
416
+ # Simple OCR to extract text
417
+ try:
418
+ page_result = self._process_image(Path(tmp.name), False, None)
419
+ if 'ocr_contents' in page_result and 'raw_text' in page_result['ocr_contents']:
420
+ # Add page text to combined text
421
+ page_text = page_result['ocr_contents']['raw_text']
422
+ combined_text.append(f"--- PAGE {page_num} ---\n{page_text}")
423
+ except Exception as page_e:
424
+ logger.warning(f"Error processing page {page_num}: {str(page_e)}")
425
+ # Clean up temp file
426
+ import os
427
+ os.unlink(tmp.name)
428
+
429
+ # If we have processed pages
430
+ if selected_images and combined_text:
431
+ # Save first image to temp file for vision model
432
+ with tempfile.NamedTemporaryFile(suffix='.jpeg', delete=False) as tmp:
433
+ selected_images[0].save(tmp.name, format='JPEG', quality=95)
434
+ first_image_path = tmp.name
435
+
436
+ # Combine all extracted text
437
+ all_text = "\n\n".join(combined_text)
438
+
439
+ # For custom prompts, use specialized processing
440
+ if custom_prompt:
441
+ try:
442
+ # Process image with vision model
443
+ result = self._process_image(Path(first_image_path), use_vision, None)
444
+
445
+ # Enhance with text analysis using combined text from all pages
446
+ enhanced_result = self._extract_structured_data_text_only(all_text, file_path.name, custom_prompt)
447
+
448
+ # Merge results, keeping images from original result
449
+ for key, value in enhanced_result.items():
450
+ if key not in ('raw_response_data', 'pages_data', 'has_images'):
451
+ result[key] = value
452
+
453
+ # Update raw text with full document text
454
+ if 'ocr_contents' in result:
455
+ result['ocr_contents']['raw_text'] = all_text
456
+
457
+ except Exception as e:
458
+ logger.warning(f"Custom prompt processing failed: {str(e)}. Using standard processing.")
459
+ # Fall back to standard processing
460
+ result = self._process_image(Path(first_image_path), use_vision, None)
461
+ if 'ocr_contents' in result:
462
+ result['ocr_contents']['raw_text'] = all_text
463
+ else:
464
+ # Standard processing with combined text
465
+ result = self._process_image(Path(first_image_path), use_vision, None)
466
+ if 'ocr_contents' in result:
467
+ result['ocr_contents']['raw_text'] = all_text
468
+
469
+ # Add PDF metadata
470
+ result['file_name'] = file_path.name
471
+ result['pdf_processing_method'] = 'pdf2image_optimized'
472
+ result['total_pages'] = total_pages
473
+ result['processed_pages'] = len(pages_to_process)
474
+ result['pages_processed'] = pages_to_process
475
+
476
+ # Add processing info
477
+ result['processing_info'] = {
478
+ 'method': 'local_pdf_processing',
479
+ 'dpi': dpi,
480
+ 'pages_sampled': pages_to_process,
481
+ 'processing_time': time.time() - start_time
482
+ }
483
+
484
+ # Clean up
485
+ os.unlink(first_image_path)
486
+
487
+ return result
488
+ else:
489
+ logger.warning("No pages successfully processed with pdf2image, falling back to API")
490
+ raise Exception("Failed to process PDF pages locally")
491
+
492
+ except Exception as pdf2image_error:
493
+ logger.warning(f"Local PDF processing failed, falling back to API: {str(pdf2image_error)}")
494
+ # Fall back to API processing
495
+
496
+ # API-based PDF processing
497
+ logger.info("Processing PDF via Mistral API")
498
+
499
+ # Optimize file upload for faster processing
500
+ logger.info("Uploading PDF file to Mistral API")
501
+ upload_start = time.time()
502
+
503
+ # Set appropriate timeout based on file size
504
+ upload_timeout = max(60, min(300, int(file_size_mb * 5))) # 60s to 300s based on size
505
+
506
+ try:
507
+ # Upload the file (Mistral client doesn't support timeout parameter for upload)
508
+ uploaded_file = self.client.files.upload(
509
+ file={
510
+ "file_name": file_path.stem,
511
+ "content": file_path.read_bytes(),
512
+ },
513
+ purpose="ocr"
514
+ )
515
+
516
+ logger.info(f"PDF uploaded in {time.time() - upload_start:.2f}s")
517
+
518
+ # Get a signed URL for the uploaded file
519
+ signed_url = self.client.files.get_signed_url(file_id=uploaded_file.id, expiry=1)
520
+
521
+ # Process the PDF with OCR - use adaptive timeout based on file size
522
+ logger.info(f"Processing PDF with OCR using {OCR_MODEL}")
523
+
524
+ # Adaptive retry strategy based on file size
525
+ max_retries = 3 if file_size_mb < 20 else 2 # Fewer retries for large files
526
+ base_retry_delay = 1 if file_size_mb < 10 else 2 # Longer delays for large files
527
+
528
+ # Adaptive timeout based on file size
529
+ ocr_timeout_ms = min(180000, max(60000, int(file_size_mb * 3000))) # 60s to 180s
530
+
531
+ # Try processing with retries
532
+ for retry in range(max_retries):
533
+ try:
534
+ ocr_start = time.time()
535
+ pdf_response = self.client.ocr.process(
536
+ document=DocumentURLChunk(document_url=signed_url.url),
537
+ model=OCR_MODEL,
538
+ include_image_base64=True,
539
+ timeout_ms=ocr_timeout_ms
540
+ )
541
+ logger.info(f"PDF OCR processing completed in {time.time() - ocr_start:.2f}s")
542
+ break # Success, exit retry loop
543
+ except Exception as e:
544
+ error_msg = str(e)
545
+ logger.warning(f"API error on attempt {retry+1}/{max_retries}: {error_msg}")
546
+
547
+ # Handle errors with optimized retry logic
548
+ error_lower = error_msg.lower()
549
+
550
+ # Authentication errors - no point in retrying
551
+ if any(term in error_lower for term in ["unauthorized", "401", "403", "authentication"]):
552
+ logger.error("API authentication failed. Check your API key.")
553
+ raise ValueError(f"Authentication failed. Please verify your Mistral API key: {error_msg}")
554
+
555
+ # Connection or server errors - worth retrying
556
+ elif any(term in error_lower for term in ["connection", "timeout", "520", "server error", "502", "503", "504"]):
557
+ if retry < max_retries - 1:
558
+ # Exponential backoff with jitter for better retry behavior
559
+ wait_time = base_retry_delay * (2 ** retry) * (0.8 + 0.4 * random.random())
560
+ logger.info(f"Connection issue detected. Waiting {wait_time:.1f}s before retry...")
561
+ time.sleep(wait_time)
562
+ else:
563
+ # Last retry failed
564
+ logger.error("Maximum retries reached, API connection error persists.")
565
+ raise ValueError(f"Could not connect to Mistral API after {max_retries} attempts: {error_msg}")
566
+
567
+ # Rate limit errors - much longer wait
568
+ elif any(term in error_lower for term in ["rate limit", "429", "too many requests", "requests rate limit exceeded"]):
569
+ # Check specifically for token exhaustion vs temporary rate limit
570
+ if "quota" in error_lower or "credit" in error_lower or "subscription" in error_lower:
571
+ logger.error("API quota or credit limit reached. No retry will help.")
572
+ raise ValueError(f"Mistral API quota or credit limit reached. Please check your subscription: {error_msg}")
573
+ elif retry < max_retries - 1:
574
+ wait_time = base_retry_delay * (2 ** retry) * 6.0 # Significantly longer wait for rate limits
575
+ logger.info(f"Rate limit exceeded. Waiting {wait_time:.1f}s before retry...")
576
+ time.sleep(wait_time)
577
+ else:
578
+ logger.error("Maximum retries reached, rate limit error persists.")
579
+ raise ValueError(f"API rate limit exceeded. Please try again later: {error_msg}")
580
+
581
+ # Misc errors - typically no retry will help
582
+ else:
583
+ if retry < max_retries - 1 and any(term in error_lower for term in ["transient", "temporary"]):
584
+ # Only retry for errors explicitly marked as transient
585
+ wait_time = base_retry_delay * (2 ** retry)
586
+ logger.info(f"Transient error detected. Waiting {wait_time:.1f}s before retry...")
587
+ time.sleep(wait_time)
588
+ else:
589
+ logger.error(f"Unrecoverable API error: {error_msg}")
590
+ raise
591
+
592
+ # Calculate the number of pages to process
593
+ pages_to_process = pdf_response.pages
594
+ total_pages = len(pdf_response.pages)
595
+ limited_pages = False
596
+
597
+ logger.info(f"API returned {total_pages} total PDF pages")
598
+
599
+ # Smart page selection logic for better performance
600
+ if custom_pages:
601
+ # Convert to 0-based indexing and filter valid page numbers
602
+ valid_indices = [i-1 for i in custom_pages if 0 < i <= total_pages]
603
+ if valid_indices:
604
+ pages_to_process = [pdf_response.pages[i] for i in valid_indices]
605
+ limited_pages = True
606
+ logger.info(f"Processing {len(valid_indices)} custom-selected pages")
607
+ # Max pages limit with smart sampling
608
+ elif max_pages and total_pages > max_pages:
609
+ if max_pages == 1:
610
+ # Just first page
611
+ pages_to_process = pages_to_process[:1]
612
+ elif max_pages < 5 and total_pages > 10:
613
+ # For small max_pages on large docs, include first, last, and middle
614
+ indices = [0] # First page
615
+ if max_pages > 1:
616
+ indices.append(total_pages - 1) # Last page
617
+ if max_pages > 2:
618
+ indices.append(total_pages // 2) # Middle page
619
+ # Add more pages up to max_pages if needed
620
+ if max_pages > 3:
621
+ remaining = max_pages - len(indices)
622
+ step = total_pages // (remaining + 1)
623
+ for i in range(1, remaining + 1):
624
+ idx = i * step
625
+ if idx not in indices and 0 <= idx < total_pages:
626
+ indices.append(idx)
627
+ indices.sort()
628
+ pages_to_process = [pdf_response.pages[i] for i in indices]
629
+ else:
630
+ # Default: first max_pages
631
+ pages_to_process = pages_to_process[:max_pages]
632
+
633
+ limited_pages = True
634
+ logger.info(f"Processing {len(pages_to_process)} pages out of {total_pages} total")
635
+
636
+ # Calculate confidence score if available
637
+ try:
638
+ confidence_values = [page.confidence for page in pages_to_process if hasattr(page, 'confidence')]
639
+ confidence_score = sum(confidence_values) / len(confidence_values) if confidence_values else 0.89
640
+ except Exception:
641
+ confidence_score = 0.89 # Improved default
642
+
643
+ # Merge page content intelligently - include page numbers for better context
644
+ all_markdown = []
645
+ for idx, page in enumerate(pages_to_process):
646
+ # Try to determine actual page number
647
+ if custom_pages and len(custom_pages) == len(pages_to_process):
648
+ page_num = custom_pages[idx]
649
+ else:
650
+ # Estimate page number - may not be accurate with sampling
651
+ page_num = idx + 1
652
+
653
+ page_markdown = page.markdown if hasattr(page, 'markdown') else ""
654
+ # Add page header if content exists
655
+ if page_markdown.strip():
656
+ all_markdown.append(f"--- PAGE {page_num} ---\n{page_markdown}")
657
+
658
+ # Join all pages with separation
659
+ combined_markdown = "\n\n".join(all_markdown)
660
+
661
+ # Extract structured data with the appropriate model
662
+ if use_vision:
663
+ # Try to get a good image for vision model
664
+ vision_image = None
665
+
666
+ # Try first page with images
667
+ for page in pages_to_process:
668
+ if hasattr(page, 'images') and page.images:
669
+ vision_image = page.images[0].image_base64
670
+ break
671
+
672
+ if vision_image:
673
+ # Use vision model with enhanced prompt
674
+ logger.info(f"Using vision model: {VISION_MODEL}")
675
+ result = self._extract_structured_data_with_vision(
676
+ vision_image, combined_markdown, file_path.name, custom_prompt
677
+ )
678
+ else:
679
+ # Fall back to text-only if no images available
680
+ logger.info(f"No images in PDF, falling back to text model: {TEXT_MODEL}")
681
+ result = self._extract_structured_data_text_only(
682
+ combined_markdown, file_path.name, custom_prompt
683
+ )
684
+ else:
685
+ # Use text-only model as requested
686
+ logger.info(f"Using text-only model as specified: {TEXT_MODEL}")
687
+ result = self._extract_structured_data_text_only(
688
+ combined_markdown, file_path.name, custom_prompt
689
+ )
690
+
691
+ # Add metadata about pages
692
+ if limited_pages:
693
+ result['limited_pages'] = {
694
+ 'processed': len(pages_to_process),
695
+ 'total': total_pages
696
+ }
697
+
698
+ # Set confidence score from OCR
699
+ result['confidence_score'] = confidence_score
700
+
701
+ # Add processing method info
702
+ result['pdf_processing_method'] = 'api'
703
+ result['total_pages'] = total_pages
704
+ result['processed_pages'] = len(pages_to_process)
705
+
706
+ # Store serialized OCR response for rendering
707
+ serialized_response = serialize_ocr_response(pdf_response)
708
+ result['raw_response_data'] = serialized_response
709
+
710
+ # Check if there are images to include
711
+ has_images = hasattr(pdf_response, 'pages') and any(
712
+ hasattr(page, 'images') and page.images for page in pdf_response.pages
713
+ )
714
+ result['has_images'] = has_images
715
+
716
+ # Include image data for rendering if available
717
+ if has_images:
718
+ # Prepare pages data with image references
719
+ result['pages_data'] = []
720
+
721
+ # Get serialized pages - handle different formats
722
+ serialized_pages = None
723
+ try:
724
+ if hasattr(serialized_response, 'pages'):
725
+ serialized_pages = serialized_response.pages
726
+ elif isinstance(serialized_response, dict) and 'pages' in serialized_response:
727
+ serialized_pages = serialized_response.get('pages', [])
728
+ else:
729
+ # No pages found in response
730
+ logger.warning("No pages found in OCR response")
731
+ serialized_pages = []
732
+ except Exception as pages_err:
733
+ logger.warning(f"Error extracting pages from OCR response: {str(pages_err)}")
734
+ serialized_pages = []
735
+
736
+ # Process each page to extract images
737
+ for page_idx, page in enumerate(serialized_pages):
738
+ try:
739
+ # Skip processing pages not in our selection
740
+ if limited_pages and page_idx >= len(pages_to_process):
741
+ continue
742
+
743
+ # Extract page data with careful error handling
744
+ markdown = ""
745
+ images = []
746
+
747
+ # Handle different page formats safely
748
+ if isinstance(page, dict):
749
+ markdown = page.get('markdown', '')
750
+ images = page.get('images', [])
751
+ else:
752
+ # Try attribute access
753
+ if hasattr(page, 'markdown'):
754
+ markdown = page.markdown
755
+ if hasattr(page, 'images'):
756
+ images = page.images
757
+
758
+ # Create page data record
759
+ page_data = {
760
+ 'page_number': page_idx + 1,
761
+ 'markdown': markdown,
762
+ 'images': []
763
+ }
764
+
765
+ # Process images with careful error handling
766
+ for img_idx, img in enumerate(images):
767
+ try:
768
+ # Extract image ID and base64 data
769
+ img_id = None
770
+ img_base64 = None
771
+
772
+ if isinstance(img, dict):
773
+ img_id = img.get('id')
774
+ img_base64 = img.get('image_base64')
775
+ else:
776
+ # Try attribute access
777
+ if hasattr(img, 'id'):
778
+ img_id = img.id
779
+ if hasattr(img, 'image_base64'):
780
+ img_base64 = img.image_base64
781
+
782
+ # Only add if we have valid image data
783
+ if img_base64 and isinstance(img_base64, str):
784
+ # Ensure ID exists
785
+ safe_id = img_id if img_id else f"img_{page_idx}_{img_idx}"
786
+ page_data['images'].append({
787
+ 'id': safe_id,
788
+ 'image_base64': img_base64
789
+ })
790
+ except Exception as img_err:
791
+ logger.warning(f"Error processing image {img_idx} on page {page_idx+1}: {str(img_err)}")
792
+ continue # Skip this image
793
+
794
+ # Add page data if it has content
795
+ if page_data['markdown'] or page_data['images']:
796
+ result['pages_data'].append(page_data)
797
+
798
+ except Exception as page_err:
799
+ logger.warning(f"Error processing page {page_idx+1}: {str(page_err)}")
800
+ continue # Skip this page
801
+
802
+ # Record final processing time
803
+ total_time = time.time() - start_time
804
+ result['processing_time'] = total_time
805
+ logger.info(f"PDF API processing completed in {total_time:.2f}s")
806
+
807
+ return result
808
+
809
+ except Exception as api_e:
810
+ logger.error(f"Error in API-based PDF processing: {str(api_e)}")
811
+ # Re-raise to be caught by outer exception handler
812
+ raise
813
+
814
+ except Exception as e:
815
+ # Log the error and return a helpful error result
816
+ logger.error(f"Error processing PDF: {str(e)}")
817
+
818
+ # Return basic result on error
819
+ return {
820
+ "file_name": file_path.name,
821
+ "topics": ["Document"],
822
+ "languages": ["English"],
823
+ "confidence_score": 0.0,
824
+ "error": str(e),
825
+ "ocr_contents": {
826
+ "error": f"Failed to process PDF: {str(e)}",
827
+ "partial_text": "Document could not be fully processed."
828
+ },
829
+ "processing_time": time.time() - start_time
830
+ }
831
+
832
+ def _process_image(self, file_path, use_vision=True, custom_prompt=None):
833
+ """Process an image file with OCR"""
834
+ logger = logging.getLogger("image_processor")
835
+ logger.info(f"Processing image: {file_path}")
836
+
837
+ # Check if we're in test mode
838
+ if self.test_mode:
839
+ # Return a placeholder document response
840
+ return {
841
+ "file_name": file_path.name,
842
+ "topics": ["Document"],
843
+ "languages": ["English"],
844
+ "ocr_contents": {
845
+ "title": "Document",
846
+ "content": "Please set up API key to process documents."
847
+ },
848
+ "processing_time": 0.5,
849
+ "confidence_score": 0.0
850
+ }
851
+
852
+ try:
853
+ # Check file size
854
+ file_size_mb = file_path.stat().st_size / (1024 * 1024)
855
+ logger.info(f"Original image size: {file_size_mb:.2f} MB")
856
+
857
+ # Use enhanced preprocessing functions from ocr_utils
858
+ try:
859
+ from ocr_utils import preprocess_image_for_ocr, IMAGE_PREPROCESSING
860
+
861
+ logger.info(f"Applying advanced image preprocessing for OCR")
862
+
863
+ # Get preprocessing settings from config
864
+ max_size_mb = IMAGE_PREPROCESSING.get("max_size_mb", 8.0)
865
+
866
+ if file_size_mb > max_size_mb:
867
+ logger.info(f"Image is large ({file_size_mb:.2f} MB), optimizing for API submission")
868
+
869
+ # Preprocess image with document-type detection and appropriate enhancements
870
+ _, base64_data_url = preprocess_image_for_ocr(file_path)
871
+
872
+ logger.info(f"Image preprocessing completed successfully")
873
+
874
+ except (ImportError, AttributeError) as e:
875
+ # Fallback to basic processing if advanced functions not available
876
+ logger.warning(f"Advanced preprocessing not available: {str(e)}. Using basic image processing.")
877
+
878
+ # If image is larger than 8MB, resize it to reduce API payload size
879
+ if file_size_mb > 8:
880
+ logger.info("Image is large, resizing before API submission")
881
+ try:
882
+ from PIL import Image
883
+ import io
884
+
885
+ # Open and process the image
886
+ with Image.open(file_path) as img:
887
+ # Convert to RGB if not already (prevents mode errors)
888
+ if img.mode != 'RGB':
889
+ img = img.convert('RGB')
890
+
891
+ # Calculate new dimensions (maintain aspect ratio)
892
+ # Target around 2000-2500 pixels on longest side for better OCR quality
893
+ width, height = img.size
894
+ max_dimension = max(width, height)
895
+ target_dimension = 2000 # Restored to 2000 for better image quality
896
+
897
+ if max_dimension > target_dimension:
898
+ scale_factor = target_dimension / max_dimension
899
+ resized_width = int(width * scale_factor)
900
+ resized_height = int(height * scale_factor)
901
+ # Use LANCZOS instead of BILINEAR for better quality
902
+ img = img.resize((resized_width, resized_height), Image.LANCZOS)
903
+
904
+ # Enhance contrast for better text recognition
905
+ from PIL import ImageEnhance
906
+ enhancer = ImageEnhance.Contrast(img)
907
+ img = enhancer.enhance(1.3)
908
+
909
+ # Save to bytes with compression
910
+ buffer = io.BytesIO()
911
+ img.save(buffer, format="JPEG", quality=92, optimize=True) # Higher quality for better OCR
912
+ buffer.seek(0)
913
+
914
+ # Get the base64
915
+ encoded_image = base64.b64encode(buffer.getvalue()).decode()
916
+ base64_data_url = f"data:image/jpeg;base64,{encoded_image}"
917
+
918
+ # Log the new size
919
+ new_size_mb = len(buffer.getvalue()) / (1024 * 1024)
920
+ logger.info(f"Resized image to {new_size_mb:.2f} MB")
921
+ except ImportError:
922
+ logger.warning("PIL not available for resizing. Using original image.")
923
+ encoded_image = base64.b64encode(file_path.read_bytes()).decode()
924
+ base64_data_url = f"data:image/jpeg;base64,{encoded_image}"
925
+ except Exception as e:
926
+ logger.warning(f"Image resize failed: {str(e)}. Using original image.")
927
+ encoded_image = base64.b64encode(file_path.read_bytes()).decode()
928
+ base64_data_url = f"data:image/jpeg;base64,{encoded_image}"
929
+ else:
930
+ # For smaller images, use as-is
931
+ encoded_image = base64.b64encode(file_path.read_bytes()).decode()
932
+ base64_data_url = f"data:image/jpeg;base64,{encoded_image}"
933
+ except Exception as e:
934
+ # Fallback to original image if any preprocessing fails
935
+ logger.warning(f"Image preprocessing failed: {str(e)}. Using original image.")
936
+ encoded_image = base64.b64encode(file_path.read_bytes()).decode()
937
+ base64_data_url = f"data:image/jpeg;base64,{encoded_image}"
938
+
939
+ # Process the image with OCR
940
+ logger.info(f"Processing image with OCR using {OCR_MODEL}")
941
+
942
+ # Add retry logic with more retries and longer backoff periods for rate limit issues
943
+ max_retries = 4 # Increased from 2 to give more chances to succeed
944
+ retry_delay = 2 # Increased from 1 to allow for longer backoff periods
945
+
946
+ for retry in range(max_retries):
947
+ try:
948
+ image_response = self.client.ocr.process(
949
+ document=ImageURLChunk(image_url=base64_data_url),
950
+ model=OCR_MODEL,
951
+ include_image_base64=True,
952
+ timeout_ms=90000 # 90 second timeout for better success rate
953
+ )
954
+ break # Success, exit retry loop
955
+ except Exception as e:
956
+ error_msg = str(e)
957
+ logger.warning(f"API error on attempt {retry+1}/{max_retries}: {error_msg}")
958
+
959
+ # Check specific error types to handle them appropriately
960
+ error_lower = error_msg.lower()
961
+
962
+ # Authentication errors - no point in retrying
963
+ if "unauthorized" in error_lower or "401" in error_lower:
964
+ logger.error("API authentication failed. Check your API key.")
965
+ raise ValueError(f"Authentication failed with API key. Please verify your Mistral API key is correct and active: {error_msg}")
966
+
967
+ # Connection errors - worth retrying
968
+ elif "connection" in error_lower or "timeout" in error_lower or "520" in error_msg or "server error" in error_lower:
969
+ if retry < max_retries - 1:
970
+ # Wait with shorter delay before retrying
971
+ wait_time = retry_delay * (2 ** retry)
972
+ logger.info(f"Connection issue detected. Waiting {wait_time}s before retry...")
973
+ time.sleep(wait_time)
974
+ else:
975
+ # Last retry failed
976
+ logger.error("Maximum retries reached, API connection error persists.")
977
+ raise ValueError(f"Could not connect to Mistral API after {max_retries} attempts: {error_msg}")
978
+
979
+ # Rate limit errors
980
+ elif "rate limit" in error_lower or "429" in error_lower or "requests rate limit exceeded" in error_lower:
981
+ # Check specifically for token exhaustion vs temporary rate limit
982
+ if "quota" in error_lower or "credit" in error_lower or "subscription" in error_lower:
983
+ logger.error("API quota or credit limit reached. No retry will help.")
984
+ raise ValueError(f"Mistral API quota or credit limit reached. Please check your subscription: {error_msg}")
985
+ elif retry < max_retries - 1:
986
+ # More aggressive backoff for rate limits
987
+ wait_time = retry_delay * (2 ** retry) * 5 # 5x longer wait for rate limits
988
+ logger.info(f"Rate limit exceeded. Waiting {wait_time}s before retry...")
989
+ time.sleep(wait_time)
990
+ else:
991
+ # Last retry failed, try local OCR as fallback
992
+ logger.error("Maximum retries reached, rate limit error persists.")
993
+ try:
994
+ # Try to import the local OCR fallback function
995
+ from ocr_utils import try_local_ocr_fallback
996
+
997
+ # Attempt local OCR fallback
998
+ ocr_text = try_local_ocr_fallback(file_path, base64_data_url)
999
+
1000
+ if ocr_text:
1001
+ logger.info("Successfully used local OCR fallback")
1002
+ # Return a basic result with the local OCR text
1003
+ return {
1004
+ "file_name": file_path.name,
1005
+ "topics": ["Document"],
1006
+ "languages": ["English"],
1007
+ "ocr_contents": {
1008
+ "title": "Document (Local OCR)",
1009
+ "content": "This document was processed with local OCR due to API rate limiting.",
1010
+ "raw_text": ocr_text
1011
+ },
1012
+ "processing_method": "local_fallback",
1013
+ "processing_note": "Used local OCR due to API rate limit"
1014
+ }
1015
+ except (ImportError, Exception) as local_err:
1016
+ logger.warning(f"Local OCR fallback failed: {str(local_err)}")
1017
+
1018
+ # If we get here, both API and local OCR failed
1019
+ raise ValueError(f"Mistral API rate limit exceeded. Please try again later: {error_msg}")
1020
+
1021
+ # Other errors - no retry
1022
+ else:
1023
+ logger.error(f"Unrecoverable API error: {error_msg}")
1024
+ raise
1025
+
1026
+ # Get the OCR markdown from the first page
1027
+ image_ocr_markdown = image_response.pages[0].markdown if image_response.pages else ""
1028
+
1029
+ # Optimize: Skip vision model step if ocr_markdown is very small or empty
1030
+ if not image_ocr_markdown or len(image_ocr_markdown) < 50:
1031
+ logger.warning("OCR produced minimal or no text. Returning basic result.")
1032
+ return {
1033
+ "file_name": file_path.name,
1034
+ "topics": ["Document"],
1035
+ "languages": ["English"],
1036
+ "ocr_contents": {
1037
+ "raw_text": image_ocr_markdown if image_ocr_markdown else "No text could be extracted from the image."
1038
+ },
1039
+ "processing_note": "OCR produced minimal text content"
1040
+ }
1041
+
1042
+ # Extract structured data using the appropriate model, with a single API call
1043
+ if use_vision:
1044
+ logger.info(f"Using vision model: {VISION_MODEL}")
1045
+ result = self._extract_structured_data_with_vision(base64_data_url, image_ocr_markdown, file_path.name, custom_prompt)
1046
+ else:
1047
+ logger.info(f"Using text-only model: {TEXT_MODEL}")
1048
+ result = self._extract_structured_data_text_only(image_ocr_markdown, file_path.name, custom_prompt)
1049
+
1050
+ # Store the serialized OCR response for image rendering (for compatibility with original version)
1051
+ # Don't store raw_response directly as it's not JSON serializable
1052
+ serialized_response = serialize_ocr_response(image_response)
1053
+ result['raw_response_data'] = serialized_response
1054
+
1055
+ # Store key parts of the OCR response for image rendering
1056
+ # With serialized format that can be stored in JSON
1057
+ has_images = hasattr(image_response, 'pages') and image_response.pages and hasattr(image_response.pages[0], 'images') and image_response.pages[0].images
1058
+ result['has_images'] = has_images
1059
+
1060
+ if has_images:
1061
+ # Serialize the entire response to ensure it's JSON serializable
1062
+ serialized_response = serialize_ocr_response(image_response)
1063
+
1064
+ # Create a structured representation of images that can be serialized
1065
+ result['pages_data'] = []
1066
+
1067
+ if hasattr(serialized_response, 'pages'):
1068
+ serialized_pages = serialized_response.pages
1069
+ else:
1070
+ # Handle case where serialization returns a dict instead of an object
1071
+ serialized_pages = serialized_response.get('pages', [])
1072
+
1073
+ for page_idx, page in enumerate(serialized_pages):
1074
+ # Handle both object and dict forms
1075
+ if isinstance(page, dict):
1076
+ markdown = page.get('markdown', '')
1077
+ images = page.get('images', [])
1078
+ else:
1079
+ markdown = page.markdown if hasattr(page, 'markdown') else ''
1080
+ images = page.images if hasattr(page, 'images') else []
1081
+
1082
+ page_data = {
1083
+ 'page_number': page_idx + 1,
1084
+ 'markdown': markdown,
1085
+ 'images': []
1086
+ }
1087
+
1088
+ # Extract images if present
1089
+ for img_idx, img in enumerate(images):
1090
+ img_id = None
1091
+ img_base64 = None
1092
+
1093
+ if isinstance(img, dict):
1094
+ img_id = img.get('id')
1095
+ img_base64 = img.get('image_base64')
1096
+ else:
1097
+ img_id = img.id if hasattr(img, 'id') else None
1098
+ img_base64 = img.image_base64 if hasattr(img, 'image_base64') else None
1099
+
1100
+ if img_base64:
1101
+ page_data['images'].append({
1102
+ 'id': img_id if img_id else f"img_{page_idx}_{img_idx}",
1103
+ 'image_base64': img_base64
1104
+ })
1105
+
1106
+ result['pages_data'].append(page_data)
1107
+
1108
+ logger.info("Image processing completed successfully")
1109
+ return result
1110
+
1111
+ except Exception as e:
1112
+ logger.error(f"Error processing image: {str(e)}")
1113
+ # Return basic result on error
1114
+ return {
1115
+ "file_name": file_path.name,
1116
+ "topics": ["Document"],
1117
+ "languages": ["English"],
1118
+ "error": str(e),
1119
+ "ocr_contents": {
1120
+ "error": f"Failed to process image: {str(e)}",
1121
+ "partial_text": "Image could not be processed."
1122
+ }
1123
+ }
1124
+
1125
+ def _extract_structured_data_with_vision(self, image_base64, ocr_markdown, filename, custom_prompt=None):
1126
+ """
1127
+ Extract structured data using vision model with detailed historical context prompting
1128
+ Optimized for speed, accuracy, and resilience
1129
+ """
1130
+ logger = logging.getLogger("vision_processor")
1131
+
1132
+ try:
1133
+ # Fast path: Skip vision API for minimal OCR text (saves an API call)
1134
+ if not ocr_markdown or len(ocr_markdown.strip()) < 100: # Increased threshold for better detection
1135
+ logger.info("Minimal OCR text detected, skipping vision model processing")
1136
+ return {
1137
+ "file_name": filename,
1138
+ "topics": ["Document"],
1139
+ "languages": ["English"],
1140
+ "ocr_contents": {
1141
+ "raw_text": ocr_markdown if ocr_markdown else "No text could be extracted"
1142
+ }
1143
+ }
1144
+
1145
+ # Fast path: Skip if in test mode or no API key
1146
+ if self.test_mode or not self.api_key:
1147
+ logger.info("Test mode or no API key, using text-only processing")
1148
+ return self._extract_structured_data_text_only(ocr_markdown, filename)
1149
+
1150
+ # Detect document type with optimized cached implementation
1151
+ doc_type = self._detect_document_type(custom_prompt, ocr_markdown)
1152
+ logger.info(f"Detected document type: {doc_type}")
1153
+
1154
+ # Optimize OCR text for processing - focus on the first part which usually contains
1155
+ # the most important information (title, metadata, etc.)
1156
+ if len(ocr_markdown) > 8000:
1157
+ # Start with first 5000 chars
1158
+ first_part = ocr_markdown[:5000]
1159
+
1160
+ # Then add representative samples from different parts of the document
1161
+ # This captures headings and key information throughout
1162
+ middle_start = len(ocr_markdown) // 2 - 1000
1163
+ middle_part = ocr_markdown[middle_start:middle_start+2000] if middle_start > 0 else ""
1164
+
1165
+ # Get ending section if large enough
1166
+ if len(ocr_markdown) > 15000:
1167
+ end_part = ocr_markdown[-1000:]
1168
+ truncated_ocr = f"{first_part}\n...\n{middle_part}\n...\n{end_part}"
1169
+ else:
1170
+ truncated_ocr = f"{first_part}\n...\n{middle_part}"
1171
+
1172
+ logger.info(f"Truncated OCR text from {len(ocr_markdown)} to {len(truncated_ocr)} chars")
1173
+ else:
1174
+ truncated_ocr = ocr_markdown
1175
+
1176
+ # Build an optimized prompt based on document type
1177
+ enhanced_prompt = self._build_enhanced_prompt(doc_type, truncated_ocr, custom_prompt)
1178
+
1179
+ # Measure API call time for optimization feedback
1180
+ start_time = time.time()
1181
+
1182
+ try:
1183
+ # Try with enhanced timing parameters based on document complexity
1184
+ # Use shorter timeout for smaller documents
1185
+ timeout_ms = min(120000, max(60000, len(truncated_ocr) * 10)) # 60-120 seconds based on text length
1186
+
1187
+ logger.info(f"Calling vision model with {timeout_ms}ms timeout and document type {doc_type}")
1188
+ chat_response = self.client.chat.parse(
1189
+ model=VISION_MODEL,
1190
+ messages=[
1191
+ {
1192
+ "role": "user",
1193
+ "content": [
1194
+ ImageURLChunk(image_url=image_base64),
1195
+ TextChunk(text=enhanced_prompt)
1196
+ ],
1197
+ },
1198
+ ],
1199
+ response_format=StructuredOCRModel,
1200
+ temperature=0,
1201
+ timeout_ms=timeout_ms
1202
+ )
1203
+
1204
+ api_time = time.time() - start_time
1205
+ logger.info(f"Vision model completed in {api_time:.2f}s with document type: {doc_type}")
1206
+
1207
+ except Exception as e:
1208
+ # If there's an error with the enhanced prompt, try progressively simpler approaches
1209
+ logger.warning(f"Enhanced prompt failed after {time.time() - start_time:.2f}s: {str(e)}")
1210
+
1211
+ # Try a simplified approach with less context
1212
+ try:
1213
+ # Shorter prompt with less contextual information
1214
+ simplified_prompt = (
1215
+ f"You are an expert in historical document analysis. "
1216
+ f"Analyze this document image and the OCR text below. "
1217
+ f"<BEGIN_OCR>\n{truncated_ocr[:4000]}\n<END_OCR>\n"
1218
+ f"Identify the document type, main topics, languages used, and extract key information "
1219
+ f"including names, dates, places, and events. Return a structured JSON response."
1220
+ )
1221
+
1222
+ # Add custom prompt if provided
1223
+ if custom_prompt:
1224
+ simplified_prompt += f"\n\nAdditional instructions: {custom_prompt}"
1225
+
1226
+ logger.info(f"Trying simplified prompt approach")
1227
+ chat_response = self.client.chat.parse(
1228
+ model=VISION_MODEL,
1229
+ messages=[
1230
+ {
1231
+ "role": "user",
1232
+ "content": [
1233
+ ImageURLChunk(image_url=image_base64),
1234
+ TextChunk(text=simplified_prompt)
1235
+ ],
1236
+ },
1237
+ ],
1238
+ response_format=StructuredOCRModel,
1239
+ temperature=0,
1240
+ timeout_ms=60000 # Shorter timeout for simplified approach
1241
+ )
1242
+
1243
+ logger.info(f"Simplified prompt approach succeeded")
1244
+
1245
+ except Exception as second_e:
1246
+ # If that fails, try with minimal prompt and just image analysis
1247
+ logger.warning(f"Simplified prompt failed: {str(second_e)}. Trying minimal prompt.")
1248
+
1249
+ try:
1250
+ # Minimal prompt focusing on just the image
1251
+ minimal_prompt = (
1252
+ f"Analyze this historical document image. "
1253
+ f"Extract the document type, main topics, languages, and key information. "
1254
+ f"Provide your analysis in a structured JSON format."
1255
+ )
1256
+
1257
+ logger.info(f"Trying minimal prompt with image-only focus")
1258
+ chat_response = self.client.chat.parse(
1259
+ model=VISION_MODEL,
1260
+ messages=[
1261
+ {
1262
+ "role": "user",
1263
+ "content": [
1264
+ ImageURLChunk(image_url=image_base64),
1265
+ TextChunk(text=minimal_prompt)
1266
+ ],
1267
+ },
1268
+ ],
1269
+ response_format=StructuredOCRModel,
1270
+ temperature=0,
1271
+ timeout_ms=45000 # Even shorter timeout for minimal approach
1272
+ )
1273
+
1274
+ logger.info(f"Minimal prompt approach succeeded")
1275
+
1276
+ except Exception as third_e:
1277
+ # If all vision attempts fail, fall back to text-only model
1278
+ logger.warning(f"All vision model attempts failed, falling back to text-only model: {str(third_e)}")
1279
+ return self._extract_structured_data_text_only(ocr_markdown, filename)
1280
+
1281
+ # Convert the response to a dictionary
1282
+ result = json.loads(chat_response.choices[0].message.parsed.json())
1283
+
1284
+ # Ensure languages is a list of strings, not Language enum objects
1285
+ if 'languages' in result:
1286
+ result['languages'] = [str(lang) for lang in result.get('languages', [])]
1287
+
1288
+ # Add metadata about processing
1289
+ result['processing_info'] = {
1290
+ 'method': 'vision_model',
1291
+ 'document_type': doc_type,
1292
+ 'ocr_text_length': len(ocr_markdown),
1293
+ 'api_response_time': time.time() - start_time
1294
+ }
1295
+
1296
+ # Add confidence score if not present
1297
+ if 'confidence_score' not in result:
1298
+ result['confidence_score'] = 0.92 # Vision model typically has higher confidence
1299
+
1300
+ except Exception as e:
1301
+ # Fall back to text-only model if vision model fails
1302
+ logger.warning(f"Vision model processing failed, falling back to text-only model: {str(e)}")
1303
+ result = self._extract_structured_data_text_only(ocr_markdown, filename)
1304
+
1305
+ return result
1306
+
1307
+ # Thread-safe document type detection cache with increased size for better performance
1308
+ _doc_type_cache = {}
1309
+ _doc_type_cache_size = 256
1310
+
1311
+ @staticmethod
1312
+ def _detect_document_type_cached(custom_prompt: Optional[str], ocr_text_sample: str) -> str:
1313
+ """
1314
+ Cached version of document type detection logic with thread-safe implementation
1315
+ """
1316
+ # Generate cache key - use first 50 chars of prompt and ocr_text to avoid memory issues
1317
+ prompt_key = str(custom_prompt)[:50] if custom_prompt else ""
1318
+ text_key = ocr_text_sample[:50] if ocr_text_sample else ""
1319
+ cache_key = f"{prompt_key}::{text_key}"
1320
+
1321
+ # Check cache first (fast path)
1322
+ if cache_key in StructuredOCR._doc_type_cache:
1323
+ return StructuredOCR._doc_type_cache[cache_key]
1324
+
1325
+ # Set default document type
1326
+ doc_type = "general"
1327
+
1328
+ # Optimized pattern matching with compiled lookup dictionaries
1329
+ doc_type_patterns = {
1330
+ "handwritten": ["handwritten", "handwriting", "cursive", "manuscript"],
1331
+ "letter": ["letter", "correspondence", "message", "dear sir", "dear madam", "sincerely", "yours truly"],
1332
+ "legal": ["form", "contract", "agreement", "legal", "certificate", "court", "attorney", "plaintiff", "defendant"],
1333
+ "recipe": ["recipe", "food", "ingredients", "directions", "tbsp", "tsp", "cup", "mix", "bake", "cooking"],
1334
+ "travel": ["travel", "expedition", "journey", "exploration", "voyage", "destination", "map"],
1335
+ "scientific": ["scientific", "experiment", "hypothesis", "research", "study", "analysis", "results", "procedure"],
1336
+ "newspaper": ["news", "newspaper", "article", "press", "headline", "column", "editor"]
1337
+ }
1338
+
1339
+ # Fast custom prompt matching
1340
+ if custom_prompt:
1341
+ prompt_lower = custom_prompt.lower()
1342
+
1343
+ # Optimized pattern matching with early exit
1344
+ for detected_type, patterns in doc_type_patterns.items():
1345
+ if any(term in prompt_lower for term in patterns):
1346
+ doc_type = detected_type
1347
+ break
1348
+
1349
+ # Fast OCR text matching if still general type
1350
+ if doc_type == "general" and ocr_text_sample:
1351
+ ocr_lower = ocr_text_sample.lower()
1352
+
1353
+ # Use the same patterns dictionary for consistency, but scan the OCR text
1354
+ for detected_type, patterns in doc_type_patterns.items():
1355
+ if any(term in ocr_lower for term in patterns):
1356
+ doc_type = detected_type
1357
+ break
1358
+
1359
+ # Cache the result with improved LRU-like behavior
1360
+ if len(StructuredOCR._doc_type_cache) >= StructuredOCR._doc_type_cache_size:
1361
+ # Clear multiple entries at once for better performance
1362
+ try:
1363
+ # Remove up to 20 entries to avoid frequent cache clearing
1364
+ for _ in range(20):
1365
+ if StructuredOCR._doc_type_cache:
1366
+ StructuredOCR._doc_type_cache.pop(next(iter(StructuredOCR._doc_type_cache)))
1367
+ except:
1368
+ # If concurrent modification causes issues, just proceed
1369
+ pass
1370
+
1371
+ # Store in cache
1372
+ StructuredOCR._doc_type_cache[cache_key] = doc_type
1373
+
1374
+ return doc_type
1375
+
1376
+ def _detect_document_type(self, custom_prompt: Optional[str], ocr_text: str) -> str:
1377
+ """
1378
+ Detect document type based on content and custom prompt.
1379
+
1380
+ Args:
1381
+ custom_prompt: User-provided custom prompt
1382
+ ocr_text: OCR-extracted text
1383
+
1384
+ Returns:
1385
+ Document type identifier ("handwritten", "printed", "letter", etc.)
1386
+ """
1387
+ # Only sample first 1000 characters of OCR text for faster processing while maintaining accuracy
1388
+ ocr_sample = ocr_text[:1000] if ocr_text else ""
1389
+
1390
+ # Use the cached version for better performance
1391
+ return self._detect_document_type_cached(custom_prompt, ocr_sample)
1392
+
1393
+ def _build_enhanced_prompt(self, doc_type: str, ocr_text: str, custom_prompt: Optional[str]) -> str:
1394
+ """
1395
+ Build an enhanced prompt based on document type.
1396
+
1397
+ Args:
1398
+ doc_type: Detected document type
1399
+ ocr_text: OCR-extracted text
1400
+ custom_prompt: User-provided custom prompt
1401
+
1402
+ Returns:
1403
+ Enhanced prompt optimized for the document type
1404
+ """
1405
+ # Generic document section (included in all prompts)
1406
+ generic_section = (
1407
+ f"This is a historical document's OCR text:\n"
1408
+ f"<BEGIN_OCR>\n{ocr_text}\n<END_OCR>\n\n"
1409
+ )
1410
+
1411
+ # Document-specific prompting
1412
+ if doc_type == "handwritten":
1413
+ specific_section = (
1414
+ f"You are an expert historian specializing in handwritten document transcription and analysis. "
1415
+ f"The OCR system has attempted to capture the handwriting, but may have made errors with cursive script "
1416
+ f"or unusual letter formations.\n\n"
1417
+ f"Pay careful attention to:\n"
1418
+ f"- Correcting OCR errors common in handwriting recognition\n"
1419
+ f"- Preserving the original document structure\n"
1420
+ f"- Identifying topics, language(s), and document type accurately\n"
1421
+ f"- Detecting any names, dates, places, or events mentioned\n"
1422
+ )
1423
+
1424
+ elif doc_type == "letter":
1425
+ specific_section = (
1426
+ f"You are an expert in historical correspondence analysis. "
1427
+ f"Analyze this letter as a historian would, identifying:\n"
1428
+ f"- Sender and recipient (if mentioned)\n"
1429
+ f"- Date and location of writing (if present)\n"
1430
+ f"- Key topics discussed\n"
1431
+ f"- Historical context and significance\n"
1432
+ f"- Sentiment and tone of the communication\n"
1433
+ f"- Closing formulations and signature\n"
1434
+ )
1435
+
1436
+ elif doc_type == "recipe":
1437
+ specific_section = (
1438
+ f"You are a culinary historian specializing in historical recipes. "
1439
+ f"Analyze this recipe document to extract:\n"
1440
+ f"- Recipe name/title\n"
1441
+ f"- Complete list of ingredients with measurements\n"
1442
+ f"- Preparation instructions in correct order\n"
1443
+ f"- Cooking time and temperature if mentioned\n"
1444
+ f"- Serving suggestions or yield information\n"
1445
+ f"- Any cultural or historical context provided\n"
1446
+ )
1447
+
1448
+ elif doc_type == "travel":
1449
+ specific_section = (
1450
+ f"You are a historian specializing in historical travel and exploration accounts. "
1451
+ f"Analyze this document to extract:\n"
1452
+ f"- Geographical locations mentioned\n"
1453
+ f"- Names of explorers, ships, or expeditions\n"
1454
+ f"- Dates and timelines\n"
1455
+ f"- Descriptions of indigenous peoples, cultures, or local conditions\n"
1456
+ f"- Natural features, weather, or navigational details\n"
1457
+ f"- Historical significance of the journey described\n"
1458
+ )
1459
+
1460
+ elif doc_type == "scientific":
1461
+ specific_section = (
1462
+ f"You are a historian of science specializing in historical scientific documents. "
1463
+ f"Analyze this document to extract:\n"
1464
+ f"- Scientific methodology described\n"
1465
+ f"- Observations, measurements, or data presented\n"
1466
+ f"- Scientific terminology of the period\n"
1467
+ f"- Experimental apparatus or tools mentioned\n"
1468
+ f"- Conclusions or hypotheses presented\n"
1469
+ f"- Historical significance within scientific development\n"
1470
+ )
1471
+
1472
+ elif doc_type == "newspaper":
1473
+ specific_section = (
1474
+ f"You are a media historian specializing in historical newspapers and publications. "
1475
+ f"Analyze this document to extract:\n"
1476
+ f"- Publication name and date if present\n"
1477
+ f"- Headlines and article titles\n"
1478
+ f"- Main news content with focus on events, people, and places\n"
1479
+ f"- Advertisement content if present\n"
1480
+ f"- Historical context and significance\n"
1481
+ f"- Editorial perspective or bias if detectable\n"
1482
+ )
1483
+
1484
+ elif doc_type == "legal":
1485
+ specific_section = (
1486
+ f"You are a legal historian specializing in historical legal documents. "
1487
+ f"Analyze this document to extract:\n"
1488
+ f"- Document type (contract, certificate, will, deed, etc.)\n"
1489
+ f"- Parties involved and their roles\n"
1490
+ f"- Key terms, conditions, or declarations\n"
1491
+ f"- Dates, locations, and jurisdictions mentioned\n"
1492
+ f"- Legal terminology of the period\n"
1493
+ f"- Signatures, witnesses, or official markings\n"
1494
+ )
1495
+
1496
+ else:
1497
+ # General historical document
1498
+ specific_section = (
1499
+ f"You are a historian specializing in historical document analysis. "
1500
+ f"Analyze this document to extract:\n"
1501
+ f"- Document type and purpose\n"
1502
+ f"- Time period and historical context\n"
1503
+ f"- Key topics, themes, and subjects\n"
1504
+ f"- People, places, and events mentioned\n"
1505
+ f"- Languages used and writing style\n"
1506
+ f"- Historical significance and connections\n"
1507
+ )
1508
+
1509
+ # Output instructions
1510
+ output_section = (
1511
+ f"Create a structured JSON response with the following fields:\n"
1512
+ f"- file_name: The document's name\n"
1513
+ f"- topics: An array of topics covered in the document\n"
1514
+ f"- languages: An array of languages used in the document\n"
1515
+ f"- ocr_contents: A dictionary with the document's contents, organized logically\n"
1516
+ )
1517
+
1518
+ # Add custom prompt if provided
1519
+ custom_section = ""
1520
+ if custom_prompt:
1521
+ custom_section = f"\n\nADDITIONAL CONTEXT AND INSTRUCTIONS:\n{custom_prompt}\n"
1522
+
1523
+ # Combine all sections into complete prompt
1524
+ return generic_section + specific_section + output_section + custom_section
1525
+
1526
+ def _extract_structured_data_text_only(self, ocr_markdown, filename, custom_prompt=None):
1527
+ """
1528
+ Extract structured data using text-only model with detailed historical context prompting
1529
+ and improved error handling
1530
+ """
1531
+ logger = logging.getLogger("text_processor")
1532
+ start_time = time.time()
1533
+
1534
+ try:
1535
+ # Fast path: Skip for minimal OCR text
1536
+ if not ocr_markdown or len(ocr_markdown.strip()) < 50:
1537
+ logger.info("Minimal OCR text - returning basic result")
1538
+ return {
1539
+ "file_name": filename,
1540
+ "topics": ["Document"],
1541
+ "languages": ["English"],
1542
+ "ocr_contents": {
1543
+ "raw_text": ocr_markdown if ocr_markdown else "No text could be extracted"
1544
+ },
1545
+ "processing_method": "minimal_text"
1546
+ }
1547
+
1548
+ # Check for API key to avoid unnecessary processing
1549
+ if self.test_mode or not self.api_key:
1550
+ logger.info("Test mode or no API key - returning basic result")
1551
+ return {
1552
+ "file_name": filename,
1553
+ "topics": ["Document"],
1554
+ "languages": ["English"],
1555
+ "ocr_contents": {
1556
+ "raw_text": ocr_markdown[:10000] if ocr_markdown else "No text could be extracted",
1557
+ "note": "API key not provided - showing raw OCR text only"
1558
+ },
1559
+ "processing_method": "test_mode"
1560
+ }
1561
+
1562
+ # Detect document type and build enhanced prompt
1563
+ doc_type = self._detect_document_type(custom_prompt, ocr_markdown)
1564
+ logger.info(f"Detected document type: {doc_type}")
1565
+
1566
+ # If OCR text is very large, truncate it to avoid API limits
1567
+ truncated_text = ocr_markdown
1568
+ if len(ocr_markdown) > 25000:
1569
+ # Keep first 15000 chars and last 5000 chars
1570
+ truncated_text = ocr_markdown[:15000] + "\n...[content truncated]...\n" + ocr_markdown[-5000:]
1571
+ logger.info(f"OCR text truncated from {len(ocr_markdown)} to {len(truncated_text)} chars")
1572
+
1573
+ # Build the prompt with truncated text if needed
1574
+ enhanced_prompt = self._build_enhanced_prompt(doc_type, truncated_text, custom_prompt)
1575
+
1576
+ # Use enhanced prompt with text-only model - with retry logic
1577
+ max_retries = 2
1578
+ retry_delay = 1
1579
+
1580
+ for retry in range(max_retries):
1581
+ try:
1582
+ logger.info(f"Calling text model ({TEXT_MODEL})")
1583
+ api_start = time.time()
1584
+
1585
+ # Set appropriate timeout based on text length
1586
+ timeout_ms = min(120000, max(30000, len(truncated_text) * 5)) # 30-120s based on length
1587
+
1588
+ # Make API call with appropriate timeout
1589
+ chat_response = self.client.chat.parse(
1590
+ model=TEXT_MODEL,
1591
+ messages=[
1592
+ {
1593
+ "role": "user",
1594
+ "content": enhanced_prompt
1595
+ },
1596
+ ],
1597
+ response_format=StructuredOCRModel,
1598
+ temperature=0,
1599
+ timeout_ms=timeout_ms
1600
+ )
1601
+
1602
+ api_time = time.time() - api_start
1603
+ logger.info(f"Text model API call completed in {api_time:.2f}s")
1604
+
1605
+ # Convert the response to a dictionary
1606
+ result = json.loads(chat_response.choices[0].message.parsed.json())
1607
+
1608
+ # Ensure languages is a list of strings, not Language enum objects
1609
+ if 'languages' in result:
1610
+ result['languages'] = [str(lang) for lang in result.get('languages', [])]
1611
+
1612
+ # Add processing metadata
1613
+ result['processing_method'] = 'text_model'
1614
+ result['document_type'] = doc_type
1615
+ result['model_used'] = TEXT_MODEL
1616
+ result['processing_time'] = time.time() - start_time
1617
+
1618
+ # Add raw text for reference if not already present
1619
+ if 'ocr_contents' in result and 'raw_text' not in result['ocr_contents']:
1620
+ # Add truncated raw text if very large
1621
+ if len(ocr_markdown) > 50000:
1622
+ result['ocr_contents']['raw_text'] = ocr_markdown[:50000] + "\n...[content truncated]..."
1623
+ else:
1624
+ result['ocr_contents']['raw_text'] = ocr_markdown
1625
+
1626
+ return result
1627
+
1628
+ except Exception as api_error:
1629
+ error_msg = str(api_error).lower()
1630
+ logger.warning(f"API error on attempt {retry+1}/{max_retries}: {str(api_error)}")
1631
+
1632
+ # Check if retry would help
1633
+ if retry < max_retries - 1:
1634
+ # Rate limit errors - special handling with longer wait
1635
+ if any(term in error_msg for term in ["rate limit", "429", "too many requests", "requests rate limit exceeded"]):
1636
+ # Check specifically for token exhaustion vs temporary rate limit
1637
+ if any(term in error_msg for term in ["quota", "credit", "subscription"]):
1638
+ logger.error("API quota or credit limit reached. No retry will help.")
1639
+ raise ValueError(f"Mistral API quota or credit limit reached. Please check your subscription: {error_msg}")
1640
+ # Longer backoff for rate limit errors
1641
+ wait_time = retry_delay * (2 ** retry) * 6.0 # 6x longer wait for rate limits
1642
+ logger.info(f"Rate limit exceeded. Waiting {wait_time:.1f}s before retry...")
1643
+ time.sleep(wait_time)
1644
+ # Other transient errors
1645
+ elif any(term in error_msg for term in ["timeout", "connection", "500", "503", "504"]):
1646
+ # Wait before retrying
1647
+ wait_time = retry_delay * (2 ** retry)
1648
+ logger.info(f"Transient error, retrying in {wait_time}s")
1649
+ time.sleep(wait_time)
1650
+ else:
1651
+ # Non-retryable error
1652
+ raise
1653
+ else:
1654
+ # Last retry failed
1655
+ raise
1656
+
1657
+ # This shouldn't be reached due to raise in the loop, but just in case
1658
+ raise Exception("All retries failed for text model")
1659
+
1660
+ except Exception as e:
1661
+ logger.error(f"Text model failed: {str(e)}. Creating basic result.")
1662
+
1663
+ # Create a basic result with available OCR text
1664
+ try:
1665
+ # Create a more informative fallback result
1666
+ result = {
1667
+ "file_name": filename,
1668
+ "topics": ["Document"],
1669
+ "languages": ["English"],
1670
+ "ocr_contents": {
1671
+ "raw_text": ocr_markdown[:50000] if ocr_markdown else "No text could be extracted",
1672
+ "error": f"AI processing failed: {str(e)}"
1673
+ },
1674
+ "processing_method": "fallback",
1675
+ "processing_error": str(e),
1676
+ "processing_time": time.time() - start_time
1677
+ }
1678
+
1679
+ # Try to extract some basic metadata even without AI
1680
+ if ocr_markdown:
1681
+ # Simple content analysis
1682
+ text_sample = ocr_markdown[:5000].lower()
1683
+
1684
+ # Try to detect language
1685
+ if "dear" in text_sample and any(word in text_sample for word in ["sincerely", "regards", "truly"]):
1686
+ result["topics"].append("Letter")
1687
+ elif any(word in text_sample for word in ["recipe", "ingredients", "instructions", "cook", "bake"]):
1688
+ result["topics"].append("Recipe")
1689
+ elif any(word in text_sample for word in ["article", "report", "study", "analysis"]):
1690
+ result["topics"].append("Article")
1691
+
1692
+ except Exception as inner_e:
1693
+ logger.error(f"Error creating basic result: {str(inner_e)}")
1694
+ result = {
1695
+ "file_name": str(filename) if filename else "unknown",
1696
+ "topics": ["Document"],
1697
+ "languages": ["English"],
1698
+ "ocr_contents": {
1699
+ "error": "Processing failed completely",
1700
+ "partial_text": ocr_markdown[:1000] if ocr_markdown else "Document could not be processed."
1701
+ }
1702
+ }
1703
+
1704
+ return result
1705
+
1706
+ # For testing directly
1707
+ if __name__ == "__main__":
1708
+ import sys
1709
+
1710
+ if len(sys.argv) < 2:
1711
+ print("Usage: python structured_ocr.py <file_path>")
1712
+ sys.exit(1)
1713
+
1714
+ file_path = sys.argv[1]
1715
+ processor = StructuredOCR()
1716
+ result = processor.process_file(file_path)
1717
+
1718
+ print(json.dumps(result, indent=2))
ui/__pycache__/layout.cpython-312.pyc ADDED
Binary file (7.71 kB). View file
 
ui/__pycache__/layout.cpython-313.pyc ADDED
Binary file (7.62 kB). View file
 
ui/custom.css ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ /* Minimal essential styling */
2
+
3
+ /* Processing status container */
4
+ .processing-status-container {
5
+ margin: 10px 0;
6
+ padding: 8px 12px;
7
+ border-left: 3px solid #5c6bc0;
8
+ font-size: 0.9rem;
9
+ }
10
+
11
+ /* Result card styling */
12
+ .previous-results-container {
13
+ margin-bottom: 20px;
14
+ }
15
+
16
+ .result-card {
17
+ border: 1px solid #e0e0e0;
18
+ border-radius: 4px;
19
+ padding: 15px;
20
+ margin-bottom: 15px;
21
+ }
22
+
23
+ .result-header {
24
+ display: flex;
25
+ justify-content: space-between;
26
+ margin-bottom: 10px;
27
+ padding-bottom: 5px;
28
+ border-bottom: 1px solid #e0e0e0;
29
+ }
30
+
31
+ .result-filename {
32
+ font-weight: bold;
33
+ font-size: 1.1rem;
34
+ }
35
+
36
+ .result-date {
37
+ font-size: 0.9rem;
38
+ color: #666;
39
+ }
40
+
41
+ .result-metadata {
42
+ display: flex;
43
+ flex-wrap: wrap;
44
+ gap: 8px;
45
+ margin-bottom: 10px;
46
+ }
47
+
48
+ .result-tag {
49
+ background-color: #e3f2fd;
50
+ border-radius: 16px;
51
+ padding: 3px 10px;
52
+ font-size: 0.85rem;
53
+ color: #1565c0;
54
+ }
55
+
56
+ .selected-result-container {
57
+ border: 1px solid #e0e0e0;
58
+ border-radius: 4px;
59
+ padding: 20px;
60
+ margin: 15px 0;
61
+ }
62
+
63
+ .selected-result-title {
64
+ font-size: 1.3rem;
65
+ font-weight: bold;
66
+ margin-bottom: 15px;
67
+ }
ui/layout.py ADDED
@@ -0,0 +1,172 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ from pathlib import Path
3
+ import os
4
+
5
+ # Load custom CSS
6
+ def load_css():
7
+ css_file = Path(__file__).parent / "custom.css"
8
+ if css_file.exists():
9
+ with open(css_file) as f:
10
+ st.markdown(f"<style>{f.read()}</style>", unsafe_allow_html=True)
11
+ else:
12
+ st.warning("Custom CSS file not found. Some styles may be missing.")
13
+
14
+ # Header component
15
+ def header():
16
+ st.markdown("""
17
+ <div class="main-header">
18
+ <h1 class="title-text">Historical OCR Workshop</h1>
19
+ </div>
20
+ """, unsafe_allow_html=True)
21
+
22
+ # Create a page wrapper similar to the React component
23
+ def page_wrapper(content_function, current_module=1):
24
+ """
25
+ Creates a consistent page layout with navigation
26
+ Args:
27
+ content_function: Function that renders the page content
28
+ current_module: Current module number (1-6)
29
+ """
30
+ # Load custom CSS
31
+ load_css()
32
+
33
+ # Display header
34
+ header()
35
+
36
+ # Ensure session state for navigation
37
+ if 'current_module' not in st.session_state:
38
+ st.session_state.current_module = current_module
39
+
40
+ # Main content area with bottom padding for the nav
41
+ st.markdown('<div class="main-content">', unsafe_allow_html=True)
42
+
43
+ # Call the content function to render the module content
44
+ content_function()
45
+
46
+ # Add spacer for fixed nav
47
+ st.markdown('<div class="footer-spacer"></div>', unsafe_allow_html=True)
48
+
49
+ # Navigation
50
+ render_navigation(current_module)
51
+
52
+ st.markdown('</div>', unsafe_allow_html=True)
53
+
54
+ # Navigation component
55
+ def render_navigation(current_module):
56
+ # Define modules names like in React
57
+ modules = ['Introduction', 'Historical Context', 'Methodology', 'Case Studies', 'Interactive OCR', 'Conclusion']
58
+
59
+ # Navigation container
60
+ st.markdown(f"""
61
+ <div class="nav-container">
62
+ <div class="nav-buttons">
63
+ {prev_button_html(current_module, modules)}
64
+ {next_button_html(current_module, modules)}
65
+ </div>
66
+
67
+ <div class="nav-dots">
68
+ {nav_dots_html(current_module, modules)}
69
+ </div>
70
+ </div>
71
+ """, unsafe_allow_html=True)
72
+
73
+ # Previous button HTML
74
+ def prev_button_html(current_module, modules):
75
+ if current_module > 1:
76
+ prev_module = current_module - 1
77
+ return f"""
78
+ <button class="prev-button"
79
+ onclick="document.getElementById('nav_prev_{prev_module}').click()"
80
+ aria-label="Go to previous module: {modules[prev_module-1]}">
81
+ ← Previous
82
+ </button>
83
+ """
84
+ return ""
85
+
86
+ # Next button HTML
87
+ def next_button_html(current_module, modules):
88
+ if current_module < len(modules):
89
+ next_module = current_module + 1
90
+ return f"""
91
+ <button class="next-button"
92
+ onclick="document.getElementById('nav_next_{next_module}').click()"
93
+ aria-label="Go to next module: {modules[next_module-1]}">
94
+ Next →
95
+ </button>
96
+ """
97
+ return ""
98
+
99
+ # Navigation dots HTML
100
+ def nav_dots_html(current_module, modules):
101
+ dots_html = ""
102
+ for i, name in enumerate(modules, 1):
103
+ active_class = "active" if i == current_module else ""
104
+ dots_html += f"""
105
+ <a class="nav-dot {active_class}"
106
+ onclick="document.getElementById('nav_dot_{i}').click()"
107
+ aria-current="{i == current_module}"
108
+ aria-label="Go to module {i}: {name}">
109
+ {i}
110
+ </a>
111
+ """
112
+ return dots_html
113
+
114
+ # Helper functions for container styles
115
+ def gray_container(content, padding="1.5rem"):
116
+ """Renders content in a gray container with consistent styling"""
117
+ st.markdown(f'<div class="content-container" style="padding:{padding};">{content}</div>', unsafe_allow_html=True)
118
+
119
+ def blue_container(content, padding="1.5rem"):
120
+ """Renders content in a blue container with consistent styling"""
121
+ st.markdown(f'<div class="blue-container" style="padding:{padding};">{content}</div>', unsafe_allow_html=True)
122
+
123
+ def yellow_container(content, padding="1.5rem"):
124
+ """Renders content in a yellow container with consistent styling"""
125
+ st.markdown(f'<div class="yellow-container" style="padding:{padding};">{content}</div>', unsafe_allow_html=True)
126
+
127
+ def card_grid(cards):
128
+ """
129
+ Renders a responsive grid of cards
130
+ Args:
131
+ cards: List of HTML strings for each card
132
+ """
133
+ grid_html = '<div class="card-grid">'
134
+ for card in cards:
135
+ grid_html += f'<div class="card">{card}</div>'
136
+ grid_html += '</div>'
137
+
138
+ st.markdown(grid_html, unsafe_allow_html=True)
139
+
140
+ def module_card(number, title, description):
141
+ """Creates a styled module card"""
142
+ return f"""
143
+ <div class="module-card">
144
+ <div class="module-number">Module {number}</div>
145
+ <div class="module-title">{title}</div>
146
+ <p>{description}</p>
147
+ </div>
148
+ """
149
+
150
+ def key_concept(content):
151
+ """Renders a key concept box"""
152
+ st.markdown(f'<div class="key-concept">{content}</div>', unsafe_allow_html=True)
153
+
154
+ def research_question(content):
155
+ """Renders a research question box"""
156
+ st.markdown(f'<div class="research-question">{content}</div>', unsafe_allow_html=True)
157
+
158
+ def quote(content, author=""):
159
+ """Renders a quote with optional author"""
160
+ quote_html = f'<div class="quote-container">{content}'
161
+ if author:
162
+ quote_html += f'<br/><br/><span style="font-size:0.9rem; text-align:right; display:block;">— {author}</span>'
163
+ quote_html += '</div>'
164
+ st.markdown(quote_html, unsafe_allow_html=True)
165
+
166
+ def tool_container(content):
167
+ """Renders content in a tool container"""
168
+ st.markdown(f'<div class="tool-container">{content}</div>', unsafe_allow_html=True)
169
+
170
+ def upload_container(content):
171
+ """Renders content in an upload container"""
172
+ st.markdown(f'<div class="upload-container">{content}</div>', unsafe_allow_html=True)