milwright commited on
Commit
c04ffe5
·
1 Parent(s): 836388f

Rolling out modular v2

Browse files
.DS_Store CHANGED
Binary files a/.DS_Store and b/.DS_Store differ
 
.clinerules/apiDocumentation.md ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ apiDocumentation.md
2
+ API Interaction Documentation
3
+ Mistral OCR API
4
+
5
+ Endpoint: /v1/ocr
6
+
7
+ Payload:
8
+
9
+ image (binary)
10
+
11
+ prompt (optional contextual instructions)
12
+
13
+ Response:
14
+
15
+ structured_data: Hierarchical text + metadata output
16
+
17
+ raw_text: Plain extracted text
18
+
19
+ Error Handling:
20
+
21
+ Timeout retries (up to 3 attempts)
22
+
23
+ Local fallback to Tesseract if Mistral service unavailable
24
+
25
+ Tesseract Fallback
26
+
27
+ Only invoked if Mistral API fails after retries.
28
+
29
+ No structured output; raw text only.
.clinerules/projectBrief.md ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Foundation
2
+
3
+ Historical OCR is an advanced optical character recognition (OCR) application designed to support historical research. It leverages Mistral AI's OCR models alongside image preprocessing pipelines optimized for archival material.
4
+
5
+ High-Level Overview
6
+
7
+ Building a Streamlit-based web application to process historical documents (images or PDFs), optimize them for OCR using advanced preprocessing techniques, and extract structured text and metadata through Mistral's large language models.
8
+
9
+ Core Requirements and Goals
10
+
11
+ Upload and preprocess historical documents
12
+
13
+ Automatically detect document types (e.g., handwritten letters, scientific papers)
14
+
15
+ Apply tailored OCR prompting and structured output based on document type
16
+
17
+ Support user-defined contextual instructions to refine output
18
+
19
+ Provide downloadable structured transcripts and analysis
20
+
21
+ Example: "Building a Streamlit web app for OCR transcription and structured extraction from historical documents using Mistral AI."
.clinerules/systemPatterns.md ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # System Architecture
2
+
3
+ Frontend: Streamlit app (app.py) for user interface and interactions.
4
+
5
+ Core Processing: ocr_processing.py orchestrates preprocessing, document type detection, and OCR operations.
6
+
7
+ Image Preprocessing: preprocessing.py, image_segmentation.py handle deskewing, thresholding, and cleaning.
8
+
9
+ OCR and Structuring: structured_ocr.py and ocr_utils.py manage API communication and formatting structured outputs.
10
+
11
+ Utilities and Detection: language_detection.py, utils.py, and constants.py provide language detection, helpers, and prompt templates.
12
+
13
+ Key Technical Decisions
14
+
15
+ Streamlit cache management for upload processing efficiency.
16
+
17
+ Modular design of preprocessing paths based on document type.
18
+
19
+ Mistral AI as the primary OCR processor, with Tesseract fallback for redundancy.
20
+
21
+ Design Patterns in Use
22
+
23
+ Delegation: Frontend delegates all processing to backend orchestrators.
24
+
25
+ Modularity: Preprocessing and OCR tasks divided into clean, testable modules.
26
+
27
+ State-driven Processing: Output dynamically reflects session state and user input.
28
+
29
+ Component Relationships
30
+
31
+ app.py ⇨ ocr_processing.py ⇨ preprocessing.py, structured_ocr.py, language_detection.py, etc.
README.md CHANGED
@@ -21,7 +21,11 @@ An advanced OCR application for historical document analysis using Mistral AI.
21
 
22
  - **OCR with Context:** AI-enhanced OCR optimized for historical documents
23
  - **Document Type Detection:** Automatically identifies handwritten letters, recipes, scientific texts, and more
24
- - **Image Preprocessing:** Optimizes images for better text recognition
 
 
 
 
25
  - **Custom Prompting:** Tailor the AI analysis with document-specific instructions
26
  - **Structured Output:** Returns organized, structured information based on document type
27
 
 
21
 
22
  - **OCR with Context:** AI-enhanced OCR optimized for historical documents
23
  - **Document Type Detection:** Automatically identifies handwritten letters, recipes, scientific texts, and more
24
+ - **Advanced Image Preprocessing:**
25
+ - Automatic deskewing to correct document orientation
26
+ - Smart thresholding with Otsu and adaptive methods
27
+ - Morphological operations to clean up text
28
+ - Document-type specific optimization
29
  - **Custom Prompting:** Tailor the AI analysis with document-specific instructions
30
  - **Structured Output:** Returns organized, structured information based on document type
31
 
app.py CHANGED
@@ -41,7 +41,7 @@ from constants import (
41
  )
42
  from structured_ocr import StructuredOCR
43
  from config import MISTRAL_API_KEY
44
- from ocr_utils import create_results_zip
45
 
46
  # Set favicon path
47
  favicon_path = os.path.join(os.path.dirname(__file__), "static/favicon.png")
@@ -74,20 +74,47 @@ st.set_page_config(
74
  # Consult https://docs.streamlit.io/library/advanced-features/session-state for details.
75
  # ========================================================================================
76
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77
  def init_session_state():
78
  """Initialize session state variables if they don't already exist
79
 
80
  This function follows Streamlit's recommended patterns for state initialization.
81
  It only creates variables if they don't exist yet and doesn't modify existing values.
82
  """
 
83
  if 'previous_results' not in st.session_state:
84
  st.session_state.previous_results = []
85
  if 'temp_file_paths' not in st.session_state:
86
  st.session_state.temp_file_paths = []
87
- if 'last_processed_file' not in st.session_state:
88
- st.session_state.last_processed_file = None
89
  if 'auto_process_sample' not in st.session_state:
90
  st.session_state.auto_process_sample = False
 
 
 
 
 
 
 
 
91
  if 'sample_just_loaded' not in st.session_state:
92
  st.session_state.sample_just_loaded = False
93
  if 'processed_document_active' not in st.session_state:
@@ -104,10 +131,6 @@ def init_session_state():
104
  st.session_state.is_sample_document = False
105
  if 'selected_previous_result' not in st.session_state:
106
  st.session_state.selected_previous_result = None
107
- if 'close_clicked' not in st.session_state:
108
- st.session_state.close_clicked = False
109
- if 'active_tab' not in st.session_state:
110
- st.session_state.active_tab = 0
111
 
112
  def close_document():
113
  """Called when the Close Document button is clicked
@@ -120,24 +143,17 @@ def close_document():
120
  That approach breaks Streamlit's execution flow and causes UI artifacts.
121
  """
122
  logger.info("Close document button clicked")
123
- # Save the previous results
124
- previous_results = st.session_state.previous_results if 'previous_results' in st.session_state else []
125
 
126
- # Clean up temp files
127
  if 'temp_file_paths' in st.session_state and st.session_state.temp_file_paths:
128
  logger.info(f"Cleaning up {len(st.session_state.temp_file_paths)} temporary files")
129
  handle_temp_files(st.session_state.temp_file_paths)
130
 
131
- # Clear all state variables except previous_results
132
- for key in list(st.session_state.keys()):
133
- if key != 'previous_results' and key != 'close_clicked':
134
- st.session_state.pop(key, None)
135
 
136
- # Set flag for having cleaned up
137
  st.session_state.close_clicked = True
138
-
139
- # Restore the previous results
140
- st.session_state.previous_results = previous_results
141
 
142
  def show_example_documents():
143
  """Show example documents section"""
@@ -251,14 +267,12 @@ def show_example_documents():
251
 
252
  # Reset any document state before loading a new sample
253
  if st.session_state.processed_document_active:
254
- # Clear previous document state
255
- st.session_state.processed_document_active = False
256
- st.session_state.last_processed_file = None
257
-
258
  # Clean up any temporary files from previous processing
259
  if st.session_state.temp_file_paths:
260
  handle_temp_files(st.session_state.temp_file_paths)
261
- st.session_state.temp_file_paths = []
 
 
262
 
263
  # Save download info in session state
264
  st.session_state.sample_document = SampleDocument(
@@ -350,6 +364,7 @@ def process_document(uploaded_file, left_col, right_col, sidebar_options):
350
  progress_placeholder = st.empty()
351
 
352
  # Image preprocessing preview - show if image file and preprocessing options are set
 
353
  if (any(sidebar_options["preprocessing_options"].values()) and
354
  uploaded_file.type.startswith('image/')):
355
 
@@ -530,13 +545,14 @@ def main():
530
  sidebar_options = create_sidebar_options()
531
 
532
  # Create main layout with tabs - simpler, more compact approach
533
- tab_names = ["Document Processing", "Sample Documents", "Previous Results", "About"]
534
- main_tab1, main_tab2, main_tab3, main_tab4 = st.tabs(tab_names)
535
 
536
  with main_tab1:
537
  # Create a two-column layout for file upload and results with minimal padding
538
  st.markdown('<style>.block-container{padding-top: 1rem; padding-bottom: 0;}</style>', unsafe_allow_html=True)
539
- left_col, right_col = st.columns([1, 1])
 
540
 
541
  with left_col:
542
  # Create file uploader
@@ -575,11 +591,9 @@ def main():
575
 
576
  show_example_documents()
577
 
578
- with main_tab3:
579
- # Previous results tab
580
- display_previous_results()
581
 
582
- with main_tab4:
583
  # About tab
584
  display_about_tab()
585
 
 
41
  )
42
  from structured_ocr import StructuredOCR
43
  from config import MISTRAL_API_KEY
44
+ from utils.image_utils import create_results_zip
45
 
46
  # Set favicon path
47
  favicon_path = os.path.join(os.path.dirname(__file__), "static/favicon.png")
 
74
  # Consult https://docs.streamlit.io/library/advanced-features/session-state for details.
75
  # ========================================================================================
76
 
77
+ def reset_document_state():
78
+ """Reset only document-specific state variables
79
+
80
+ This function explicitly resets all document-related variables to ensure
81
+ clean state between document processing, preventing cached data issues.
82
+ """
83
+ st.session_state.sample_document = None
84
+ st.session_state.original_sample_bytes = None
85
+ st.session_state.original_sample_name = None
86
+ st.session_state.original_sample_mime_type = None
87
+ st.session_state.is_sample_document = False
88
+ st.session_state.processed_document_active = False
89
+ st.session_state.sample_document_processed = False
90
+ st.session_state.sample_just_loaded = False
91
+ st.session_state.last_processed_file = None
92
+ st.session_state.selected_previous_result = None
93
+ # Keep temp_file_paths but ensure it's empty after cleanup
94
+ if 'temp_file_paths' in st.session_state:
95
+ st.session_state.temp_file_paths = []
96
+
97
  def init_session_state():
98
  """Initialize session state variables if they don't already exist
99
 
100
  This function follows Streamlit's recommended patterns for state initialization.
101
  It only creates variables if they don't exist yet and doesn't modify existing values.
102
  """
103
+ # Initialize persistent app state variables
104
  if 'previous_results' not in st.session_state:
105
  st.session_state.previous_results = []
106
  if 'temp_file_paths' not in st.session_state:
107
  st.session_state.temp_file_paths = []
 
 
108
  if 'auto_process_sample' not in st.session_state:
109
  st.session_state.auto_process_sample = False
110
+ if 'close_clicked' not in st.session_state:
111
+ st.session_state.close_clicked = False
112
+ if 'active_tab' not in st.session_state:
113
+ st.session_state.active_tab = 0
114
+
115
+ # Initialize document-specific state variables
116
+ if 'last_processed_file' not in st.session_state:
117
+ st.session_state.last_processed_file = None
118
  if 'sample_just_loaded' not in st.session_state:
119
  st.session_state.sample_just_loaded = False
120
  if 'processed_document_active' not in st.session_state:
 
131
  st.session_state.is_sample_document = False
132
  if 'selected_previous_result' not in st.session_state:
133
  st.session_state.selected_previous_result = None
 
 
 
 
134
 
135
  def close_document():
136
  """Called when the Close Document button is clicked
 
143
  That approach breaks Streamlit's execution flow and causes UI artifacts.
144
  """
145
  logger.info("Close document button clicked")
 
 
146
 
147
+ # Clean up temp files first
148
  if 'temp_file_paths' in st.session_state and st.session_state.temp_file_paths:
149
  logger.info(f"Cleaning up {len(st.session_state.temp_file_paths)} temporary files")
150
  handle_temp_files(st.session_state.temp_file_paths)
151
 
152
+ # Reset all document-specific state variables to prevent caching issues
153
+ reset_document_state()
 
 
154
 
155
+ # Set flag for having cleaned up - this will trigger a rerun in main()
156
  st.session_state.close_clicked = True
 
 
 
157
 
158
  def show_example_documents():
159
  """Show example documents section"""
 
267
 
268
  # Reset any document state before loading a new sample
269
  if st.session_state.processed_document_active:
 
 
 
 
270
  # Clean up any temporary files from previous processing
271
  if st.session_state.temp_file_paths:
272
  handle_temp_files(st.session_state.temp_file_paths)
273
+
274
+ # Reset all document-specific state variables
275
+ reset_document_state()
276
 
277
  # Save download info in session state
278
  st.session_state.sample_document = SampleDocument(
 
364
  progress_placeholder = st.empty()
365
 
366
  # Image preprocessing preview - show if image file and preprocessing options are set
367
+ # Remove the document active check to show preview immediately after selection
368
  if (any(sidebar_options["preprocessing_options"].values()) and
369
  uploaded_file.type.startswith('image/')):
370
 
 
545
  sidebar_options = create_sidebar_options()
546
 
547
  # Create main layout with tabs - simpler, more compact approach
548
+ tab_names = ["Document Processing", "Sample Documents", "Learn More"]
549
+ main_tab1, main_tab2, main_tab3 = st.tabs(tab_names)
550
 
551
  with main_tab1:
552
  # Create a two-column layout for file upload and results with minimal padding
553
  st.markdown('<style>.block-container{padding-top: 1rem; padding-bottom: 0;}</style>', unsafe_allow_html=True)
554
+ # Using a 2:3 column ratio gives more space to the results column
555
+ left_col, right_col = st.columns([2, 3])
556
 
557
  with left_col:
558
  # Create file uploader
 
591
 
592
  show_example_documents()
593
 
594
+ # Previous results tab temporarily removed
 
 
595
 
596
+ with main_tab3:
597
  # About tab
598
  display_about_tab()
599
 
config.py CHANGED
@@ -40,22 +40,19 @@ VISION_MODEL = os.environ.get("MISTRAL_VISION_MODEL", "mistral-small-latest") #
40
  # Image preprocessing settings optimized for historical documents
41
  # These can be customized from environment variables
42
  IMAGE_PREPROCESSING = {
43
- "enhance_contrast": float(os.environ.get("ENHANCE_CONTRAST", "1.2")), # Reduced contrast for more natural image appearance
44
  "sharpen": os.environ.get("SHARPEN", "True").lower() in ("true", "1", "yes"),
45
  "denoise": os.environ.get("DENOISE", "True").lower() in ("true", "1", "yes"),
46
  "max_size_mb": float(os.environ.get("MAX_IMAGE_SIZE_MB", "12.0")), # Increased size limit for better quality
47
  "target_dpi": int(os.environ.get("TARGET_DPI", "300")), # Target DPI for scaling
48
- "compression_quality": int(os.environ.get("COMPRESSION_QUALITY", "95")), # Higher quality for better OCR results
49
- # Enhanced settings for handwritten documents
50
  "handwritten": {
51
- "contrast": float(os.environ.get("HANDWRITTEN_CONTRAST", "1.2")), # Lower contrast for handwritten text
52
  "block_size": int(os.environ.get("HANDWRITTEN_BLOCK_SIZE", "21")), # Larger block size for adaptive thresholding
53
  "constant": int(os.environ.get("HANDWRITTEN_CONSTANT", "5")), # Lower constant for adaptive thresholding
54
  "use_dilation": os.environ.get("HANDWRITTEN_DILATION", "True").lower() in ("true", "1", "yes"), # Connect broken strokes
55
- "clahe_limit": float(os.environ.get("HANDWRITTEN_CLAHE_LIMIT", "2.0")), # CLAHE limit for local contrast
56
- "bilateral_d": int(os.environ.get("HANDWRITTEN_BILATERAL_D", "5")), # Bilateral filter window size
57
- "bilateral_sigma1": int(os.environ.get("HANDWRITTEN_BILATERAL_SIGMA1", "25")), # Color sigma
58
- "bilateral_sigma2": int(os.environ.get("HANDWRITTEN_BILATERAL_SIGMA2", "45")) # Space sigma
59
  }
60
  }
61
 
 
40
  # Image preprocessing settings optimized for historical documents
41
  # These can be customized from environment variables
42
  IMAGE_PREPROCESSING = {
43
+ "enhance_contrast": float(os.environ.get("ENHANCE_CONTRAST", "1.8")), # Increased contrast for better text recognition
44
  "sharpen": os.environ.get("SHARPEN", "True").lower() in ("true", "1", "yes"),
45
  "denoise": os.environ.get("DENOISE", "True").lower() in ("true", "1", "yes"),
46
  "max_size_mb": float(os.environ.get("MAX_IMAGE_SIZE_MB", "12.0")), # Increased size limit for better quality
47
  "target_dpi": int(os.environ.get("TARGET_DPI", "300")), # Target DPI for scaling
48
+ "compression_quality": int(os.environ.get("COMPRESSION_QUALITY", "100")), # Higher quality for better OCR results
49
+ # # Enhanced settings for handwritten documents
50
  "handwritten": {
 
51
  "block_size": int(os.environ.get("HANDWRITTEN_BLOCK_SIZE", "21")), # Larger block size for adaptive thresholding
52
  "constant": int(os.environ.get("HANDWRITTEN_CONSTANT", "5")), # Lower constant for adaptive thresholding
53
  "use_dilation": os.environ.get("HANDWRITTEN_DILATION", "True").lower() in ("true", "1", "yes"), # Connect broken strokes
54
+ "dilation_iterations": int(os.environ.get("HANDWRITTEN_DILATION_ITERATIONS", "2")), # More iterations for better stroke connection
55
+ "dilation_kernel_size": int(os.environ.get("HANDWRITTEN_DILATION_KERNEL_SIZE", "3")) # Larger kernel for dilation
 
 
56
  }
57
  }
58
 
constants.py CHANGED
@@ -138,17 +138,56 @@ CONTENT_THEMES = {
138
  }
139
 
140
  # Period tags based on year ranges
 
141
  PERIOD_TAGS = {
142
- (0, 1799): "Pre-1800s",
143
- (1800, 1849): "Early 19th Century",
144
- (1850, 1899): "Late 19th Century",
145
- (1900, 1949): "Early 20th Century",
146
- (1950, 2099): "Modern Era"
 
 
 
 
 
 
 
 
 
 
 
 
147
  }
148
 
149
- # Default fallback tags
150
- DEFAULT_TAGS = ["Document", "Historical", "Text"]
151
- GENERIC_TAGS = ["Archive", "Content", "Record"]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
152
 
153
  # UI constants
154
  PROGRESS_DELAY = 0.8 # Seconds to show completion message
 
138
  }
139
 
140
  # Period tags based on year ranges
141
+ # These ranges are used to assign historical period tags to documents based on their year.
142
  PERIOD_TAGS = {
143
+ (0, 499): "Ancient Era (to 500 CE)",
144
+ (500, 999): "Early Medieval (500–1000)",
145
+ (1000, 1299): "High Medieval (1000–1300)",
146
+ (1300, 1499): "Late Medieval (1300–1500)",
147
+ (1500, 1599): "Renaissance (1500–1600)",
148
+ (1600, 1699): "Early Modern (1600–1700)",
149
+ (1700, 1775): "Enlightenment (1700–1775)",
150
+ (1776, 1799): "Age of Revolutions (1776–1800)",
151
+ (1800, 1849): "Early 19th Century (1800–1850)",
152
+ (1850, 1899): "Late 19th Century (1850–1900)",
153
+ (1900, 1918): "Early 20th Century & WWI (1900–1918)",
154
+ (1919, 1938): "Interwar Period (1919–1938)",
155
+ (1939, 1945): "World War II (1939–1945)",
156
+ (1946, 1968): "Postwar & Mid-20th Century (1946–1968)",
157
+ (1969, 1989): "Late 20th Century (1969–1989)",
158
+ (1990, 2000): "Turn of the 21st Century (1990–2000)",
159
+ (2001, 2099): "Contemporary (21st Century)"
160
  }
161
 
162
+ # Default fallback tags for documents when no specific tags are detected.
163
+ DEFAULT_TAGS = [
164
+ "Document",
165
+ "Historical",
166
+ "Text",
167
+ "Primary Source",
168
+ "Archival Material",
169
+ "Record",
170
+ "Manuscript",
171
+ "Printed Material",
172
+ "Correspondence",
173
+ "Publication"
174
+ ]
175
+
176
+ # Generic tags that can be used for broad categorization or as supplemental tags.
177
+ GENERIC_TAGS = [
178
+ "Archive",
179
+ "Content",
180
+ "Record",
181
+ "Source",
182
+ "Material",
183
+ "Page",
184
+ "Scan",
185
+ "Image",
186
+ "Transcription",
187
+ "Uncategorized",
188
+ "General",
189
+ "Miscellaneous"
190
+ ]
191
 
192
  # UI constants
193
  PROGRESS_DELAY = 0.8 # Seconds to show completion message
image_segmentation.py CHANGED
@@ -18,12 +18,13 @@ logging.basicConfig(level=logging.INFO,
18
  format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
19
  logger = logging.getLogger(__name__)
20
 
21
- def segment_image_for_ocr(image_path: Union[str, Path]) -> Dict[str, Union[Image.Image, str]]:
22
  """
23
  Segment an image into text and image regions for improved OCR processing.
24
 
25
  Args:
26
  image_path: Path to the image file
 
27
 
28
  Returns:
29
  Dict containing:
@@ -41,6 +42,23 @@ def segment_image_for_ocr(image_path: Union[str, Path]) -> Dict[str, Union[Image
41
  try:
42
  # Open original image with PIL for compatibility
43
  with Image.open(image_file) as pil_img:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
44
  # Convert to RGB if not already
45
  if pil_img.mode != 'RGB':
46
  pil_img = pil_img.convert('RGB')
@@ -89,7 +107,8 @@ def segment_image_for_ocr(image_path: Union[str, Path]) -> Dict[str, Union[Image
89
 
90
  # Additional check for text-like characteristics
91
  # Text typically has aspect ratio > 1 (wider than tall) and reasonable density
92
- if (aspect_ratio > 1.5 or aspect_ratio < 0.5) and dark_pixel_density > 0.2:
 
93
  # Add to text regions list
94
  text_regions.append((x, y, w, h))
95
  # Add to text mask
 
18
  format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
19
  logger = logging.getLogger(__name__)
20
 
21
+ def segment_image_for_ocr(image_path: Union[str, Path], vision_enabled: bool = True) -> Dict[str, Union[Image.Image, str]]:
22
  """
23
  Segment an image into text and image regions for improved OCR processing.
24
 
25
  Args:
26
  image_path: Path to the image file
27
+ vision_enabled: Whether the vision model is enabled
28
 
29
  Returns:
30
  Dict containing:
 
42
  try:
43
  # Open original image with PIL for compatibility
44
  with Image.open(image_file) as pil_img:
45
+ # --- 2 · Stop "text page detected as image" when vision model is off ---
46
+ if not vision_enabled:
47
+ # Import the entropy calculator from utils.image_utils
48
+ from utils.image_utils import calculate_image_entropy
49
+
50
+ # Calculate entropy to determine if this is line art or blank
51
+ ent = calculate_image_entropy(pil_img)
52
+ if ent < 3.5: # Heuristically low → line-art or blank page
53
+ logger.info(f"Low entropy image detected ({ent:.2f}), classifying as illustration")
54
+ # Return minimal result for illustration
55
+ return {
56
+ 'text_regions': None,
57
+ 'image_regions': pil_img,
58
+ 'text_mask_base64': None,
59
+ 'combined_result': None,
60
+ 'text_regions_coordinates': []
61
+ }
62
  # Convert to RGB if not already
63
  if pil_img.mode != 'RGB':
64
  pil_img = pil_img.convert('RGB')
 
107
 
108
  # Additional check for text-like characteristics
109
  # Text typically has aspect ratio > 1 (wider than tall) and reasonable density
110
+ # Relaxed aspect ratio constraints and lowered density threshold for better detection
111
+ if (aspect_ratio > 1.2 or aspect_ratio < 0.7) and dark_pixel_density > 0.15:
112
  # Add to text regions list
113
  text_regions.append((x, y, w, h))
114
  # Add to text mask
language_detection.py CHANGED
@@ -64,7 +64,6 @@ class LanguageDetector:
64
  "patterns": ['oi[ts]$', 'oi[re]$', 'f[^aeiou]', 'ff', 'ſ', 'auoit', 'eſtoit',
65
  'ſi', 'ſur', 'ſa', 'cy', 'ayant', 'oy', 'uſ', 'auſ']
66
  },
67
- "exclusivity": 2.0 # French indicators have higher weight in historical text detection
68
  },
69
  "German": {
70
  "chars": ['ä', 'ö', 'ü', 'ß'],
 
64
  "patterns": ['oi[ts]$', 'oi[re]$', 'f[^aeiou]', 'ff', 'ſ', 'auoit', 'eſtoit',
65
  'ſi', 'ſur', 'ſa', 'cy', 'ayant', 'oy', 'uſ', 'auſ']
66
  },
 
67
  },
68
  "German": {
69
  "chars": ['ä', 'ö', 'ü', 'ß'],
ocr_processing.py CHANGED
@@ -17,6 +17,9 @@ import streamlit as st
17
 
18
  # Local application imports
19
  from structured_ocr import StructuredOCR
 
 
 
20
  from utils import generate_cache_key, timing, format_timestamp, create_descriptive_filename, extract_subject_tags
21
  from preprocessing import apply_preprocessing_to_file
22
  from error_handler import handle_ocr_error, check_file_size
@@ -239,7 +242,7 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None, pro
239
 
240
  try:
241
  # Perform image segmentation
242
- segmentation_results = segment_image_for_ocr(temp_path)
243
 
244
  if segmentation_results['combined_result'] is not None:
245
  # Save the segmented result to a new temporary file
@@ -357,6 +360,13 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None, pro
357
  # Add additional metadata to result
358
  result = process_result(result, uploaded_file, preprocessing_options)
359
 
 
 
 
 
 
 
 
360
  # Complete progress
361
  progress_reporter.complete()
362
 
 
17
 
18
  # Local application imports
19
  from structured_ocr import StructuredOCR
20
+ # Import from updated utils directory
21
+ from utils.image_utils import clean_ocr_result
22
+ # Temporarily retain old utils imports until they are fully migrated
23
  from utils import generate_cache_key, timing, format_timestamp, create_descriptive_filename, extract_subject_tags
24
  from preprocessing import apply_preprocessing_to_file
25
  from error_handler import handle_ocr_error, check_file_size
 
242
 
243
  try:
244
  # Perform image segmentation
245
+ segmentation_results = segment_image_for_ocr(temp_path, vision_enabled=use_vision)
246
 
247
  if segmentation_results['combined_result'] is not None:
248
  # Save the segmented result to a new temporary file
 
360
  # Add additional metadata to result
361
  result = process_result(result, uploaded_file, preprocessing_options)
362
 
363
+ # 🔧 ALWAYS normalize result before returning
364
+ result = clean_ocr_result(
365
+ result,
366
+ use_segmentation=use_segmentation,
367
+ vision_enabled=use_vision
368
+ )
369
+
370
  # Complete progress
371
  progress_reporter.complete()
372
 
ocr_utils.py CHANGED
@@ -1,110 +1,38 @@
1
  """
2
- Utility functions for OCR processing with Mistral AI.
3
- Contains helper functions for working with OCR responses and image handling.
4
  """
5
 
6
- # Standard library imports
7
- import json
8
  import base64
9
- import io
10
- import zipfile
11
  import logging
12
- import time
13
- from datetime import datetime
14
  from pathlib import Path
15
- from typing import Dict, List, Optional, Union, Any, Tuple
16
- from functools import lru_cache
17
 
18
  # Configure logging
19
  logging.basicConfig(level=logging.INFO,
20
- format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
21
  logger = logging.getLogger(__name__)
22
 
23
- # Third-party imports
24
- import numpy as np
 
 
 
 
 
25
 
26
- # Check for image processing libraries
27
  try:
28
- from PIL import Image, ImageEnhance, ImageFilter, ImageOps
29
  PILLOW_AVAILABLE = True
30
  except ImportError:
31
  logger.warning("PIL not available - image preprocessing will be limited")
32
  PILLOW_AVAILABLE = False
33
 
34
- try:
35
- import cv2
36
- CV2_AVAILABLE = True
37
- except ImportError:
38
- logger.warning("OpenCV (cv2) not available - advanced image processing will be limited")
39
- CV2_AVAILABLE = False
40
-
41
- # Mistral AI imports
42
- from mistralai import DocumentURLChunk, ImageURLChunk, TextChunk
43
- from mistralai.models import OCRImageObject
44
-
45
- # Import configuration
46
- try:
47
- from config import IMAGE_PREPROCESSING
48
- except ImportError:
49
- # Fallback defaults if config not available
50
- IMAGE_PREPROCESSING = {
51
- "enhance_contrast": 1.5,
52
- "sharpen": True,
53
- "denoise": True,
54
- "max_size_mb": 8.0,
55
- "target_dpi": 300,
56
- "compression_quality": 92
57
- }
58
-
59
- def replace_images_in_markdown(markdown_str: str, images_dict: dict) -> str:
60
- """
61
- Replace image placeholders in markdown with base64-encoded images.
62
-
63
- Args:
64
- markdown_str: Markdown text containing image placeholders
65
- images_dict: Dictionary mapping image IDs to base64 strings
66
-
67
- Returns:
68
- Markdown text with images replaced by base64 data
69
- """
70
- for img_name, base64_str in images_dict.items():
71
- markdown_str = markdown_str.replace(
72
- f"![{img_name}]({img_name})", f"![{img_name}]({base64_str})"
73
- )
74
- return markdown_str
75
-
76
- def get_combined_markdown(ocr_response) -> str:
77
- """
78
- Combine OCR text and images into a single markdown document.
79
-
80
- Args:
81
- ocr_response: OCR response object from Mistral AI
82
-
83
- Returns:
84
- Combined markdown string with embedded images
85
- """
86
- markdowns = []
87
-
88
- # Process each page of the OCR response
89
- for page in ocr_response.pages:
90
- # Extract image data if available
91
- image_data = {}
92
- if hasattr(page, "images"):
93
- for img in page.images:
94
- if hasattr(img, "id") and hasattr(img, "image_base64"):
95
- image_data[img.id] = img.image_base64
96
-
97
- # Replace image placeholders with base64 data
98
- page_markdown = page.markdown if hasattr(page, "markdown") else ""
99
- processed_markdown = replace_images_in_markdown(page_markdown, image_data)
100
- markdowns.append(processed_markdown)
101
-
102
- # Join all pages' markdown with double newlines
103
- return "\n\n".join(markdowns)
104
 
105
  def encode_image_for_api(image_path: Union[str, Path]) -> str:
106
  """
107
- Encode an image as base64 data URL for API submission.
108
 
109
  Args:
110
  image_path: Path to the image file
@@ -135,1703 +63,37 @@ def encode_image_for_api(image_path: Union[str, Path]) -> str:
135
  encoded = base64.b64encode(image_file.read_bytes()).decode()
136
  return f"data:{mime_type};base64,{encoded}"
137
 
138
- def encode_bytes_for_api(file_bytes: bytes, mime_type: str) -> str:
139
- """
140
- Encode binary data as base64 data URL for API submission.
141
-
142
- Args:
143
- file_bytes: Binary file data
144
- mime_type: MIME type of the file (e.g., 'image/jpeg', 'application/pdf')
145
-
146
- Returns:
147
- Base64 data URL for the data
148
- """
149
- # Encode data as base64
150
- encoded = base64.b64encode(file_bytes).decode()
151
- return f"data:{mime_type};base64,{encoded}"
152
-
153
- def process_image_with_ocr(client, image_path: Union[str, Path], model: str = "mistral-ocr-latest"):
154
- """
155
- Process an image with OCR and return the response.
156
-
157
- Args:
158
- client: Mistral AI client
159
- image_path: Path to the image file
160
- model: OCR model to use
161
-
162
- Returns:
163
- OCR response object
164
- """
165
- # Encode image as base64
166
- base64_data_url = encode_image_for_api(image_path)
167
-
168
- # Process image with OCR
169
- image_response = client.ocr.process(
170
- document=ImageURLChunk(image_url=base64_data_url),
171
- model=model
172
- )
173
-
174
- return image_response
175
-
176
- def ocr_response_to_json(ocr_response, indent: int = 4) -> str:
177
- """
178
- Convert OCR response to a formatted JSON string.
179
-
180
- Args:
181
- ocr_response: OCR response object
182
- indent: Indentation level for JSON formatting
183
-
184
- Returns:
185
- Formatted JSON string
186
- """
187
- # Convert OCR response to a dictionary
188
- response_dict = {
189
- "text": ocr_response.text if hasattr(ocr_response, "text") else "",
190
- "pages": []
191
- }
192
-
193
- # Process pages if available
194
- if hasattr(ocr_response, "pages"):
195
- for page in ocr_response.pages:
196
- page_dict = {
197
- "text": page.text if hasattr(page, "text") else "",
198
- "markdown": page.markdown if hasattr(page, "markdown") else "",
199
- "images": []
200
- }
201
-
202
- # Process images if available
203
- if hasattr(page, "images"):
204
- for img in page.images:
205
- img_dict = {
206
- "id": img.id if hasattr(img, "id") else "",
207
- "base64": img.image_base64 if hasattr(img, "image_base64") else ""
208
- }
209
- page_dict["images"].append(img_dict)
210
-
211
- response_dict["pages"].append(page_dict)
212
-
213
- # Convert dictionary to JSON
214
- return json.dumps(response_dict, indent=indent)
215
-
216
- def create_results_zip_in_memory(results):
217
- """
218
- Create a zip file containing OCR results in memory.
219
-
220
- Args:
221
- results: Dictionary or list of OCR results
222
-
223
- Returns:
224
- Binary zip file data
225
- """
226
- # Create a BytesIO object
227
- zip_buffer = io.BytesIO()
228
-
229
- # Check if results is a list or a dictionary
230
- is_list = isinstance(results, list)
231
-
232
- # Create zip file in memory
233
- with zipfile.ZipFile(zip_buffer, 'w', zipfile.ZIP_DEFLATED) as zipf:
234
- if is_list:
235
- # Handle list of results
236
- for i, result in enumerate(results):
237
- try:
238
- # Create a descriptive base filename for this result
239
- base_filename = result.get('file_name', f'document_{i+1}').split('.')[0]
240
-
241
- # Add document type if available
242
- if 'topics' in result and result['topics']:
243
- topic = result['topics'][0].lower().replace(' ', '_')
244
- base_filename = f"{base_filename}_{topic}"
245
-
246
- # Add language if available
247
- if 'languages' in result and result['languages']:
248
- lang = result['languages'][0].lower()
249
- # Only add if it's not already in the filename
250
- if lang not in base_filename.lower():
251
- base_filename = f"{base_filename}_{lang}"
252
-
253
- # For PDFs, add page information
254
- if 'total_pages' in result and 'processed_pages' in result:
255
- base_filename = f"{base_filename}_p{result['processed_pages']}of{result['total_pages']}"
256
-
257
- # Add timestamp if available
258
- if 'timestamp' in result:
259
- try:
260
- # Try to parse the timestamp and reformat it
261
- dt = datetime.strptime(result['timestamp'], "%Y-%m-%d %H:%M")
262
- timestamp = dt.strftime("%Y%m%d_%H%M%S")
263
- base_filename = f"{base_filename}_{timestamp}"
264
- except:
265
- pass
266
-
267
- # Add JSON results for each file with descriptive name
268
- result_json = json.dumps(result, indent=2)
269
- zipf.writestr(f"{base_filename}.json", result_json)
270
-
271
- # Add HTML content (generated from the result)
272
- html_content = create_html_with_images(result)
273
- zipf.writestr(f"{base_filename}_with_images.html", html_content)
274
-
275
- # Add raw OCR text if available
276
- if "ocr_contents" in result and "raw_text" in result["ocr_contents"]:
277
- zipf.writestr(f"{base_filename}.txt", result["ocr_contents"]["raw_text"])
278
-
279
- # Add HTML visualization if available
280
- if "html_visualization" in result:
281
- zipf.writestr(f"visualization_{i+1}.html", result["html_visualization"])
282
-
283
- # Add images if available (limit to conserve memory)
284
- if "pages_data" in result:
285
- for page_idx, page in enumerate(result["pages_data"]):
286
- for img_idx, img in enumerate(page.get("images", [])[:3]): # Limit to first 3 images per page
287
- img_base64 = img.get("image_base64", "")
288
- if img_base64:
289
- # Strip data URL prefix if present
290
- if img_base64.startswith("data:image"):
291
- img_base64 = img_base64.split(",", 1)[1]
292
-
293
- # Decode base64 and add to zip
294
- try:
295
- img_data = base64.b64decode(img_base64)
296
- zipf.writestr(f"images/result_{i+1}_page_{page_idx+1}_img_{img_idx+1}.jpg", img_data)
297
- except:
298
- pass
299
- except Exception:
300
- # If any result fails, skip it and continue
301
- continue
302
- else:
303
- # Handle single result
304
- try:
305
- # Create a descriptive base filename for this result
306
- base_filename = results.get('file_name', 'document').split('.')[0]
307
-
308
- # Add document type if available
309
- if 'topics' in results and results['topics']:
310
- topic = results['topics'][0].lower().replace(' ', '_')
311
- base_filename = f"{base_filename}_{topic}"
312
-
313
- # Add language if available
314
- if 'languages' in results and results['languages']:
315
- lang = results['languages'][0].lower()
316
- # Only add if it's not already in the filename
317
- if lang not in base_filename.lower():
318
- base_filename = f"{base_filename}_{lang}"
319
-
320
- # For PDFs, add page information
321
- if 'total_pages' in results and 'processed_pages' in results:
322
- base_filename = f"{base_filename}_p{results['processed_pages']}of{results['total_pages']}"
323
-
324
- # Add timestamp if available
325
- if 'timestamp' in results:
326
- try:
327
- # Try to parse the timestamp and reformat it
328
- dt = datetime.strptime(results['timestamp'], "%Y-%m-%d %H:%M")
329
- timestamp = dt.strftime("%Y%m%d_%H%M%S")
330
- base_filename = f"{base_filename}_{timestamp}"
331
- except:
332
- # If parsing fails, create a new timestamp
333
- timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
334
- base_filename = f"{base_filename}_{timestamp}"
335
- else:
336
- # No timestamp in the result, create a new one
337
- timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
338
- base_filename = f"{base_filename}_{timestamp}"
339
-
340
- # Add JSON results with descriptive name
341
- results_json = json.dumps(results, indent=2)
342
- zipf.writestr(f"{base_filename}.json", results_json)
343
-
344
- # Add HTML content with descriptive name
345
- html_content = create_html_with_images(results)
346
- zipf.writestr(f"{base_filename}_with_images.html", html_content)
347
-
348
- # Add raw OCR text if available
349
- if "ocr_contents" in results and "raw_text" in results["ocr_contents"]:
350
- zipf.writestr(f"{base_filename}.txt", results["ocr_contents"]["raw_text"])
351
-
352
- # Add HTML visualization if available
353
- if "html_visualization" in results:
354
- zipf.writestr("visualization.html", results["html_visualization"])
355
-
356
- # Add images if available
357
- if "pages_data" in results:
358
- for page_idx, page in enumerate(results["pages_data"]):
359
- for img_idx, img in enumerate(page.get("images", [])):
360
- img_base64 = img.get("image_base64", "")
361
- if img_base64:
362
- # Strip data URL prefix if present
363
- if img_base64.startswith("data:image"):
364
- img_base64 = img_base64.split(",", 1)[1]
365
-
366
- # Decode base64 and add to zip
367
- try:
368
- img_data = base64.b64decode(img_base64)
369
- zipf.writestr(f"images/page_{page_idx+1}_img_{img_idx+1}.jpg", img_data)
370
- except:
371
- pass
372
- except Exception:
373
- # If processing fails, return empty zip
374
- pass
375
-
376
- # Seek to the beginning of the BytesIO object
377
- zip_buffer.seek(0)
378
-
379
- # Return the zip file bytes
380
- return zip_buffer.getvalue()
381
-
382
- def create_results_zip(results, output_dir=None, zip_name=None):
383
- """
384
- Create a zip file containing OCR results.
385
-
386
- Args:
387
- results: Dictionary or list of OCR results
388
- output_dir: Optional output directory
389
- zip_name: Optional zip file name
390
-
391
- Returns:
392
- Path to the created zip file
393
- """
394
- # Create temporary output directory if not provided
395
- if output_dir is None:
396
- output_dir = Path.cwd() / "output"
397
- output_dir.mkdir(exist_ok=True)
398
- else:
399
- output_dir = Path(output_dir)
400
- output_dir.mkdir(exist_ok=True)
401
-
402
- # Check if results is a list or a dictionary
403
- is_list = isinstance(results, list)
404
-
405
- # Generate zip name if not provided
406
- if zip_name is None:
407
- timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
408
-
409
- if is_list:
410
- # For a list of results, create a more descriptive name based on the content
411
- file_count = len(results)
412
-
413
- # Count document types
414
- pdf_count = sum(1 for r in results if r.get('file_name', '').lower().endswith('.pdf'))
415
- img_count = sum(1 for r in results if r.get('file_name', '').lower().endswith(('.jpg', '.jpeg', '.png')))
416
-
417
- # Create descriptive name based on contents
418
- if pdf_count > 0 and img_count > 0:
419
- zip_name = f"historical_ocr_mixed_{pdf_count}pdf_{img_count}img_{timestamp}.zip"
420
- elif pdf_count > 0:
421
- zip_name = f"historical_ocr_pdf_documents_{pdf_count}_{timestamp}.zip"
422
- elif img_count > 0:
423
- zip_name = f"historical_ocr_images_{img_count}_{timestamp}.zip"
424
- else:
425
- zip_name = f"historical_ocr_results_{file_count}_{timestamp}.zip"
426
- else:
427
- # For single result, create descriptive filename
428
- base_name = results.get("file_name", "document").split('.')[0]
429
-
430
- # Add document type if available
431
- if 'topics' in results and results['topics']:
432
- topic = results['topics'][0].lower().replace(' ', '_')
433
- base_name = f"{base_name}_{topic}"
434
-
435
- # Add language if available
436
- if 'languages' in results and results['languages']:
437
- lang = results['languages'][0].lower()
438
- # Only add if it's not already in the filename
439
- if lang not in base_name.lower():
440
- base_name = f"{base_name}_{lang}"
441
-
442
- # For PDFs, add page information
443
- if 'total_pages' in results and 'processed_pages' in results:
444
- base_name = f"{base_name}_p{results['processed_pages']}of{results['total_pages']}"
445
-
446
- # Add timestamp
447
- zip_name = f"{base_name}_{timestamp}.zip"
448
-
449
- try:
450
- # Get zip data in memory first
451
- zip_data = create_results_zip_in_memory(results)
452
-
453
- # Save to file
454
- zip_path = output_dir / zip_name
455
- with open(zip_path, 'wb') as f:
456
- f.write(zip_data)
457
-
458
- return zip_path
459
- except Exception as e:
460
- # Create an empty zip file as fallback
461
- zip_path = output_dir / zip_name
462
- with zipfile.ZipFile(zip_path, 'w') as zipf:
463
- zipf.writestr("info.txt", "Could not create complete archive")
464
-
465
- return zip_path
466
-
467
-
468
- # Advanced image preprocessing functions
469
-
470
- def preprocess_image_for_ocr(image_path: Union[str, Path]) -> Tuple[Image.Image, str]:
471
- """
472
- Preprocess an image for optimal OCR performance with enhanced speed and memory optimization.
473
- Enhanced to handle large newspaper and document images.
474
-
475
- Args:
476
- image_path: Path to the image file
477
-
478
- Returns:
479
- Tuple of (processed PIL Image, base64 string)
480
- """
481
- # Fast path: Skip all processing if PIL not available
482
- if not PILLOW_AVAILABLE:
483
- logger.info("PIL not available, skipping image preprocessing")
484
- return None, encode_image_for_api(image_path)
485
-
486
- # Convert to Path object if string
487
- image_file = Path(image_path) if isinstance(image_path, str) else image_path
488
-
489
- # Thread-safe caching with early exit for already processed images
490
- try:
491
- # Fast stat calls for file metadata - consolidate to reduce I/O
492
- file_stat = image_file.stat()
493
- file_size = file_stat.st_size
494
- file_size_mb = file_size / (1024 * 1024)
495
- mod_time = file_stat.st_mtime
496
-
497
- # Create a cache key based on essential file properties
498
- cache_key = f"{image_file.name}_{file_size}_{mod_time}"
499
-
500
- # Fast path: Return cached result if available
501
- if hasattr(preprocess_image_for_ocr, "_cache") and cache_key in preprocess_image_for_ocr._cache:
502
- logger.debug(f"Using cached preprocessing result for {image_file.name}")
503
- return preprocess_image_for_ocr._cache[cache_key]
504
-
505
- # Optimization: Skip heavy processing for very small files
506
- # Small images (less than 100KB) likely don't need preprocessing
507
- if file_size < 100000: # 100KB
508
- logger.info(f"Image {image_file.name} is small ({file_size/1024:.1f}KB), using minimal processing")
509
- with Image.open(image_file) as img:
510
- # Normalize mode only
511
- if img.mode not in ('RGB', 'L'):
512
- img = img.convert('RGB')
513
-
514
- # Save with light optimization
515
- buffer = io.BytesIO()
516
- img.save(buffer, format="JPEG", quality=95, optimize=True)
517
- buffer.seek(0)
518
-
519
- # Get base64
520
- encoded_image = base64.b64encode(buffer.getvalue()).decode()
521
- base64_data_url = f"data:image/jpeg;base64,{encoded_image}"
522
-
523
- # Cache and return
524
- result = (img, base64_data_url)
525
- if not hasattr(preprocess_image_for_ocr, "_cache"):
526
- preprocess_image_for_ocr._cache = {}
527
-
528
- # Clean cache if needed
529
- if len(preprocess_image_for_ocr._cache) > 20: # Increased cache size for better performance
530
- # Remove oldest 5 entries for better batch processing
531
- for _ in range(5):
532
- if preprocess_image_for_ocr._cache:
533
- preprocess_image_for_ocr._cache.pop(next(iter(preprocess_image_for_ocr._cache)))
534
-
535
- preprocess_image_for_ocr._cache[cache_key] = result
536
- return result
537
-
538
- # Special handling for large newspaper-style documents
539
- if file_size_mb > 5 and image_file.name.lower().endswith(('.jpg', '.jpeg', '.png')):
540
- logger.info(f"Large image detected ({file_size_mb:.2f}MB), checking for newspaper format")
541
- try:
542
- # Quickly check dimensions without loading full image
543
- with Image.open(image_file) as img:
544
- width, height = img.size
545
- aspect_ratio = width / height
546
-
547
- # Newspaper-style documents typically have width > height or are very large
548
- is_newspaper_format = (aspect_ratio > 1.15 and width > 2000) or (width > 3000 or height > 3000)
549
-
550
- if is_newspaper_format:
551
- logger.info(f"Newspaper format detected: {width}x{height}, applying specialized processing")
552
-
553
- except Exception as dim_err:
554
- logger.debug(f"Error checking dimensions: {str(dim_err)}")
555
- is_newspaper_format = False
556
- else:
557
- is_newspaper_format = False
558
-
559
- except Exception as e:
560
- # If stat or cache handling fails, log and continue with processing
561
- logger.debug(f"Cache handling failed for {image_path}: {str(e)}")
562
- # Ensure we have a valid file_size_mb for later decisions
563
- try:
564
- file_size_mb = image_file.stat().st_size / (1024 * 1024)
565
- except:
566
- file_size_mb = 0 # Default if we can't determine size
567
-
568
- # Default to not newspaper format on error
569
- is_newspaper_format = False
570
-
571
- try:
572
- # Process start time for performance logging
573
- start_time = time.time()
574
-
575
- # Open and process the image with minimal memory footprint
576
- with Image.open(image_file) as img:
577
- # Normalize image mode
578
- if img.mode not in ('RGB', 'L'):
579
- img = img.convert('RGB')
580
-
581
- # Fast path: Quick check of image properties to determine appropriate processing
582
- width, height = img.size
583
- image_area = width * height
584
-
585
- # Detect document type only for medium to large images to save processing time
586
- is_document = False
587
- is_newspaper = False
588
-
589
- # More aggressive document type detection for larger images
590
- if image_area > 500000: # Approx 700x700 or larger
591
- # Store image for document detection
592
- _detect_document_type_impl._current_img = img
593
- is_document = _detect_document_type_impl(None)
594
-
595
- # Additional check for newspaper format
596
- if is_document:
597
- # Newspapers typically have wide formats or very large dimensions
598
- aspect_ratio = width / height
599
- is_newspaper = (aspect_ratio > 1.15 and width > 2000) or (width > 3000 or height > 3000)
600
-
601
- logger.debug(f"Document type detection for {image_file.name}: " +
602
- f"{'newspaper' if is_newspaper else 'document' if is_document else 'photo'}")
603
-
604
- # Check for handwritten document characteristics
605
- is_handwritten = False
606
- if CV2_AVAILABLE and not is_newspaper:
607
- # Use more advanced detection for handwritten content
608
- try:
609
- gray_np = np.array(img.convert('L'))
610
- # Higher variance in edge strengths can indicate handwriting
611
- edges = cv2.Canny(gray_np, 30, 100)
612
- if np.count_nonzero(edges) / edges.size > 0.02: # Low edge threshold for handwriting
613
- # Additional check with gradient magnitudes
614
- sobelx = cv2.Sobel(gray_np, cv2.CV_64F, 1, 0, ksize=3)
615
- sobely = cv2.Sobel(gray_np, cv2.CV_64F, 0, 1, ksize=3)
616
- magnitude = np.sqrt(sobelx**2 + sobely**2)
617
- # Handwriting typically has more variation in gradient magnitudes
618
- if np.std(magnitude) > 20:
619
- is_handwritten = True
620
- logger.info(f"Handwritten document detected: {image_file.name}")
621
- except Exception as e:
622
- logger.debug(f"Handwriting detection error: {str(e)}")
623
-
624
- # Special processing for very large images (newspapers and large documents)
625
- if is_newspaper:
626
- # For newspaper format, we need more specialized processing
627
- logger.info(f"Processing newspaper format image: {width}x{height}")
628
-
629
- # For newspapers, we prioritize text clarity over file size
630
- # Use higher target resolution to preserve small text common in newspapers
631
- # But still need to resize if extremely large to avoid API limits
632
- max_dimension = max(width, height)
633
-
634
- if max_dimension > 6000: # Extremely large
635
- scale_factor = 0.4 # Preserve more resolution for newspapers (increased from 0.35)
636
- elif max_dimension > 4000:
637
- scale_factor = 0.6 # Higher resolution for better text extraction (increased from 0.5)
638
- else:
639
- scale_factor = 0.8 # Minimal reduction for moderate newspaper size (increased from 0.7)
640
-
641
- # Calculate new dimensions - maintain higher resolution
642
- new_width = int(width * scale_factor)
643
- new_height = int(height * scale_factor)
644
-
645
- # Use high-quality resampling to preserve text clarity in newspapers
646
- processed_img = img.resize((new_width, new_height), Image.LANCZOS)
647
- logger.debug(f"Resized newspaper image from {width}x{height} to {new_width}x{new_height}")
648
-
649
- # For newspapers, we also want to enhance the contrast and sharpen the image
650
- # before the main OCR processing for better text extraction
651
- if img.mode in ('RGB', 'RGBA'):
652
- # For color newspapers, enhance both the overall image and then convert to grayscale
653
- # This helps with mixed content newspapers that have both text and images
654
- enhancer = ImageEnhance.Contrast(processed_img)
655
- processed_img = enhancer.enhance(1.3) # Boost contrast but not too aggressively
656
-
657
- # Also enhance saturation to make colored text more visible
658
- enhancer_sat = ImageEnhance.Color(processed_img)
659
- processed_img = enhancer_sat.enhance(1.2)
660
- # Special processing for handwritten documents
661
- elif is_handwritten:
662
- logger.info(f"Processing handwritten document: {width}x{height}")
663
-
664
- # For handwritten text, we need to preserve stroke details
665
- # Use gentle scaling to maintain handwriting characteristics
666
- max_dimension = max(width, height)
667
-
668
- if max_dimension > 4000: # Large handwritten document
669
- scale_factor = 0.6 # Less aggressive reduction for handwriting
670
- else:
671
- scale_factor = 0.8 # Minimal reduction for moderate size
672
-
673
- # Calculate new dimensions
674
- new_width = int(width * scale_factor)
675
- new_height = int(height * scale_factor)
676
-
677
- # Use high-quality resampling to preserve handwriting details
678
- processed_img = img.resize((new_width, new_height), Image.LANCZOS)
679
-
680
- # Lower contrast enhancement for handwriting to preserve stroke details
681
- if img.mode in ('RGB', 'RGBA'):
682
- # Convert to grayscale for better text processing
683
- processed_img = processed_img.convert('L')
684
-
685
- # Use reduced contrast enhancement to preserve subtle strokes
686
- enhancer = ImageEnhance.Contrast(processed_img)
687
- processed_img = enhancer.enhance(1.2) # Lower contrast value for handwriting
688
-
689
- # Standard processing for other large images
690
- elif file_size_mb > IMAGE_PREPROCESSING["max_size_mb"] or max(width, height) > 3000:
691
- # Calculate target dimensions directly instead of using the heavier resize function
692
- target_width, target_height = width, height
693
- max_dimension = max(width, height)
694
-
695
- # Use a sliding scale for reduction based on image size
696
- if max_dimension > 5000:
697
- scale_factor = 0.3 # Slightly less aggressive reduction (was 0.25)
698
- elif max_dimension > 3000:
699
- scale_factor = 0.45 # Slightly less aggressive reduction (was 0.4)
700
- else:
701
- scale_factor = 0.65 # Slightly less aggressive reduction (was 0.6)
702
-
703
- # Calculate new dimensions
704
- new_width = int(width * scale_factor)
705
- new_height = int(height * scale_factor)
706
-
707
- # Use direct resize with optimized resampling filter based on image size
708
- if image_area > 3000000: # Very large, use faster but lower quality
709
- processed_img = img.resize((new_width, new_height), Image.BILINEAR)
710
- else: # Medium size, use better quality
711
- processed_img = img.resize((new_width, new_height), Image.LANCZOS)
712
-
713
- logger.debug(f"Resized image from {width}x{height} to {new_width}x{new_height}")
714
- else:
715
- # Skip resizing for smaller images
716
- processed_img = img
717
-
718
- # Apply appropriate processing based on document type and size
719
- if is_document:
720
- # Process as document with optimized path based on size
721
- if image_area > 1000000: # Full processing for larger documents
722
- preprocess_document_image._current_img = processed_img
723
- processed = _preprocess_document_image_impl()
724
- else: # Lightweight processing for smaller documents
725
- # Just enhance contrast for small documents to save time
726
- enhancer = ImageEnhance.Contrast(processed_img)
727
- processed = enhancer.enhance(1.3)
728
- else:
729
- # Process as photo with optimized path based on size
730
- if image_area > 1000000: # Full processing for larger photos
731
- preprocess_general_image._current_img = processed_img
732
- processed = _preprocess_general_image_impl()
733
- else: # Skip processing for smaller photos
734
- processed = processed_img
735
-
736
- # Optimize memory handling during encoding
737
- buffer = io.BytesIO()
738
-
739
- # Adjust quality based on image size to optimize API payload
740
- if file_size_mb > 5:
741
- quality = 85 # Lower quality for large files
742
- else:
743
- quality = IMAGE_PREPROCESSING["compression_quality"]
744
-
745
- # Save with optimized parameters
746
- processed.save(buffer, format="JPEG", quality=quality, optimize=True)
747
- buffer.seek(0)
748
-
749
- # Get base64 with minimal memory footprint
750
- encoded_image = base64.b64encode(buffer.getvalue()).decode()
751
- # Always use image/jpeg MIME type since we explicitly save as JPEG above
752
- base64_data_url = f"data:image/jpeg;base64,{encoded_image}"
753
-
754
- # Update cache thread-safely
755
- result = (processed, base64_data_url)
756
- if not hasattr(preprocess_image_for_ocr, "_cache"):
757
- preprocess_image_for_ocr._cache = {}
758
-
759
- # LRU-like cache management with improved clearing
760
- if len(preprocess_image_for_ocr._cache) > 20:
761
- try:
762
- # Remove several entries to avoid frequent cache clearing
763
- for _ in range(5):
764
- if preprocess_image_for_ocr._cache:
765
- preprocess_image_for_ocr._cache.pop(next(iter(preprocess_image_for_ocr._cache)))
766
- except:
767
- # If removal fails, just continue
768
- pass
769
-
770
- # Add to cache
771
- try:
772
- preprocess_image_for_ocr._cache[cache_key] = result
773
- except Exception:
774
- # If caching fails, just proceed
775
- pass
776
-
777
- # Log performance metrics
778
- processing_time = time.time() - start_time
779
- logger.debug(f"Image preprocessing completed in {processing_time:.3f}s for {image_file.name}")
780
-
781
- # Return both processed image and base64 string
782
- return result
783
-
784
- except Exception as e:
785
- # If preprocessing fails, log error and use original image
786
- logger.warning(f"Image preprocessing failed: {str(e)}. Using original image.")
787
- return None, encode_image_for_api(image_path)
788
-
789
- # Removed caching decorator to fix unhashable type error
790
- def detect_document_type(img: Image.Image) -> bool:
791
- """
792
- Detect if an image is likely a document (text-heavy) vs. a photo.
793
-
794
- Args:
795
- img: PIL Image object
796
-
797
- Returns:
798
- True if likely a document, False otherwise
799
- """
800
- # Direct implementation without caching
801
- return _detect_document_type_impl(None)
802
-
803
- def _detect_document_type_impl(img_hash=None) -> bool:
804
- """
805
- Optimized implementation of document type detection for faster processing.
806
- The img_hash parameter is unused but kept for backward compatibility.
807
-
808
- Enhanced to better detect handwritten documents and newspaper formats.
809
- """
810
- # Fast path: Get the image from thread-local storage
811
- if not hasattr(_detect_document_type_impl, "_current_img"):
812
- return False # Fail safe in case image is not set
813
-
814
- img = _detect_document_type_impl._current_img
815
-
816
- # Skip processing for tiny images - just classify as non-documents
817
- width, height = img.size
818
- if width * height < 100000: # Approx 300x300 or smaller
819
- return False
820
-
821
- # Convert to grayscale for analysis (using faster conversion)
822
- gray_img = img.convert('L')
823
-
824
- # PIL-only path for systems without OpenCV
825
- if not CV2_AVAILABLE:
826
- # Faster method: Sample a subset of the image for edge detection
827
- # Downscale image for faster processing
828
- sample_size = min(width, height, 1000)
829
- scale_factor = sample_size / max(width, height)
830
-
831
- if scale_factor < 0.9: # Only resize if significant reduction
832
- sample_img = gray_img.resize(
833
- (int(width * scale_factor), int(height * scale_factor)),
834
- Image.NEAREST # Fastest resampling method
835
- )
836
- else:
837
- sample_img = gray_img
838
-
839
- # Fast edge detection on sample
840
- edges = sample_img.filter(ImageFilter.FIND_EDGES)
841
-
842
- # Count edge pixels using threshold (faster than summing individual pixels)
843
- edge_data = edges.getdata()
844
- edge_threshold = 40 # Lowered threshold to better detect handwritten texts
845
-
846
- # Use list comprehension for better performance
847
- edge_count = sum(1 for p in edge_data if p > edge_threshold)
848
- total_pixels = len(edge_data)
849
- edge_ratio = edge_count / total_pixels
850
-
851
- # Check if bright areas exist - simple approximation of text/background contrast
852
- bright_count = sum(1 for p in gray_img.getdata() if p > 200)
853
- bright_ratio = bright_count / (width * height)
854
-
855
- # Documents typically have more edges (text boundaries) and bright areas (background)
856
- # Lowered edge threshold to better detect handwritten documents
857
- return edge_ratio > 0.035 or bright_ratio > 0.4
858
-
859
- # OpenCV path - optimized for speed and enhanced for handwritten documents
860
- img_np = np.array(gray_img)
861
-
862
- # 1. Fast check: Variance of pixel values
863
- # Documents typically have high variance (text on background)
864
- # Handwritten documents may have less contrast than printed text
865
- std_dev = np.std(img_np)
866
- if std_dev > 40: # Further lowered threshold to better detect handwritten documents with low contrast
867
- return True
868
-
869
- # 2. Quick check using downsampled image for edges
870
- # Downscale for faster processing on large images
871
- if max(img_np.shape) > 1000:
872
- scale = 1000 / max(img_np.shape)
873
- small_img = cv2.resize(img_np, None, fx=scale, fy=scale, interpolation=cv2.INTER_NEAREST)
874
- else:
875
- small_img = img_np
876
-
877
- # Enhanced edge detection for handwritten documents
878
- # Use multiple Canny thresholds to better capture both faint and bold strokes
879
- edges_low = cv2.Canny(small_img, 20, 110, L2gradient=False) # For faint handwriting
880
- edges_high = cv2.Canny(small_img, 30, 150, L2gradient=False) # For standard text
881
-
882
- # Combine edge detection results
883
- edges = cv2.bitwise_or(edges_low, edges_high)
884
- edge_ratio = np.count_nonzero(edges) / edges.size
885
-
886
- # Special handling for potential handwritten content - more sensitive detection
887
- handwritten_indicator = False
888
- if edge_ratio > 0.015: # Lower threshold specifically for handwritten content
889
- try:
890
- # Look for handwriting stroke characteristics using gradient analysis
891
- # Compute gradient magnitudes and directions
892
- sobelx = cv2.Sobel(small_img, cv2.CV_64F, 1, 0, ksize=3)
893
- sobely = cv2.Sobel(small_img, cv2.CV_64F, 0, 1, ksize=3)
894
- magnitude = np.sqrt(sobelx**2 + sobely**2)
895
-
896
- # Handwriting typically has higher variation in gradient magnitudes
897
- if np.std(magnitude) > 18: # Lower threshold for more sensitivity
898
- # Handwriting is indicated if we also have some line structure
899
- # Try to find line segments that could indicate text lines
900
- lines = cv2.HoughLinesP(edges, 1, np.pi/180,
901
- threshold=45, # Lower threshold for handwriting
902
- minLineLength=25, # Shorter minimum line length
903
- maxLineGap=25) # Larger gap for disconnected handwriting
904
-
905
- if lines is not None and len(lines) > 8: # Fewer line segments needed
906
- handwritten_indicator = True
907
- except Exception:
908
- # If analysis fails, continue with other checks
909
- pass
910
-
911
- # 3. Enhanced histogram analysis for handwritten content
912
- # Use more granular bins for better detection of varying stroke densities
913
- dark_mask = img_np < 65 # Increased threshold to capture lighter handwritten text
914
- medium_mask = (img_np >= 65) & (img_np < 170) # Medium gray range for handwriting
915
- light_mask = img_np > 175 # Slightly adjusted for aged paper
916
-
917
- dark_ratio = np.count_nonzero(dark_mask) / img_np.size
918
- medium_ratio = np.count_nonzero(medium_mask) / img_np.size
919
- light_ratio = np.count_nonzero(light_mask) / img_np.size
920
-
921
- # Handwritten documents often have more medium-gray content than printed text
922
- # This helps detect pencil or faded ink handwriting
923
- if medium_ratio > 0.3 and edge_ratio > 0.015:
924
- return True
925
-
926
- # Special analysis for handwritten documents
927
- # Return true immediately if handwriting characteristics detected
928
- if handwritten_indicator:
929
- return True
930
-
931
- # Combine heuristics for final decision with improved sensitivity
932
- # Lower thresholds for handwritten documents
933
- return (dark_ratio > 0.025 and light_ratio > 0.2) or edge_ratio > 0.025
934
-
935
- # Removed caching to fix unhashable type error
936
- def preprocess_document_image(img: Image.Image) -> Image.Image:
937
- """
938
- Preprocess a document image for optimal OCR.
939
-
940
- Args:
941
- img: PIL Image object
942
-
943
- Returns:
944
- Processed PIL Image
945
- """
946
- # Store the image for the implementation function
947
- preprocess_document_image._current_img = img
948
- # The actual implementation is separated for cleaner code organization
949
- return _preprocess_document_image_impl()
950
-
951
- def _preprocess_document_image_impl() -> Image.Image:
952
- """
953
- Optimized implementation of document preprocessing with adaptive processing based on image size.
954
- Enhanced for better handwritten document processing and newspaper format.
955
- """
956
- # Fast path: Get image from thread-local storage
957
- if not hasattr(preprocess_document_image, "_current_img"):
958
- raise ValueError("No image set for document preprocessing")
959
-
960
- img = preprocess_document_image._current_img
961
-
962
- # Analyze image size to determine processing strategy
963
- width, height = img.size
964
- img_size = width * height
965
-
966
- # Detect special document types
967
- is_handwritten = False
968
- is_newspaper = False
969
-
970
- # Check for newspaper format first (takes precedence)
971
- aspect_ratio = width / height
972
- if (aspect_ratio > 1.15 and width > 2000) or (width > 3000 or height > 3000):
973
- is_newspaper = True
974
- logger.debug(f"Newspaper format detected: {width}x{height}, aspect ratio: {aspect_ratio:.2f}")
975
- else:
976
- # If not newspaper, check if handwritten
977
- try:
978
- # Simple check for handwritten document characteristics
979
- # Handwritten documents often have more varied strokes and less stark contrast
980
- if CV2_AVAILABLE:
981
- # Convert to grayscale and calculate local variance
982
- gray_np = np.array(img.convert('L'))
983
- # Higher variance in edge strengths can indicate handwriting
984
- edges = cv2.Canny(gray_np, 30, 100)
985
- if np.count_nonzero(edges) / edges.size > 0.02: # Low edge threshold for handwriting
986
- # Additional check with gradient magnitudes
987
- sobelx = cv2.Sobel(gray_np, cv2.CV_64F, 1, 0, ksize=3)
988
- sobely = cv2.Sobel(gray_np, cv2.CV_64F, 0, 1, ksize=3)
989
- magnitude = np.sqrt(sobelx**2 + sobely**2)
990
- # Handwriting typically has more variation in gradient magnitudes
991
- if np.std(magnitude) > 20:
992
- is_handwritten = True
993
- except:
994
- # If detection fails, assume it's not handwritten
995
- pass
996
-
997
- # Special processing for newspaper format
998
- if is_newspaper:
999
- # Convert to grayscale for better text extraction
1000
- gray = img.convert('L')
1001
-
1002
- # For newspapers, we need aggressive text enhancement to make small print readable
1003
- # First enhance contrast more aggressively for newspaper small text
1004
- enhancer = ImageEnhance.Contrast(gray)
1005
- enhanced = enhancer.enhance(2.0) # More aggressive contrast for newspaper text
1006
-
1007
- # Apply stronger sharpening to make small text more defined
1008
- if IMAGE_PREPROCESSING["sharpen"]:
1009
- # Apply multiple passes of sharpening for newspaper text
1010
- enhanced = enhanced.filter(ImageFilter.SHARPEN)
1011
- enhanced = enhanced.filter(ImageFilter.EDGE_ENHANCE_MORE) # Stronger edge enhancement
1012
-
1013
- # Enhanced processing for newspapers with OpenCV when available
1014
- if CV2_AVAILABLE:
1015
- try:
1016
- # Convert to numpy array
1017
- img_np = np.array(enhanced)
1018
-
1019
- # For newspaper text extraction, CLAHE (Contrast Limited Adaptive Histogram Equalization)
1020
- # works much better than simple contrast enhancement
1021
- clahe = cv2.createCLAHE(clipLimit=3.0, tileGridSize=(8, 8))
1022
- img_np = clahe.apply(img_np)
1023
-
1024
- # Apply different adaptive thresholding approaches and choose the best one
1025
-
1026
- # 1. Standard adaptive threshold with larger block size for newspaper columns
1027
- binary1 = cv2.adaptiveThreshold(img_np, 255,
1028
- cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
1029
- cv2.THRESH_BINARY, 15, 4)
1030
-
1031
- # 2. Otsu's method for global thresholding - works well for clean newspaper print
1032
- _, binary2 = cv2.threshold(img_np, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
1033
-
1034
- # Try to determine which method preserves text better
1035
- # Count white pixels and edges in each binary version
1036
- white_pixels1 = np.count_nonzero(binary1 > 200)
1037
- white_pixels2 = np.count_nonzero(binary2 > 200)
1038
-
1039
- # Calculate edge density to help determine which preserves text features better
1040
- edges1 = cv2.Canny(binary1, 100, 200)
1041
- edges2 = cv2.Canny(binary2, 100, 200)
1042
- edge_count1 = np.count_nonzero(edges1)
1043
- edge_count2 = np.count_nonzero(edges2)
1044
-
1045
- # For newspaper text, we want to preserve more edges while maintaining reasonable
1046
- # white space (typical of printed text on paper background)
1047
- if (edge_count1 > edge_count2 * 1.2 and white_pixels1 > white_pixels2 * 0.7) or \
1048
- (white_pixels1 < white_pixels2 * 0.5): # If Otsu removed too much content
1049
- # Adaptive thresholding usually better preserves small text in newspapers
1050
- logger.debug("Using adaptive thresholding for newspaper text")
1051
-
1052
- # Apply optional denoising to clean up small speckles
1053
- result = cv2.fastNlMeansDenoising(binary1, None, 7, 7, 21)
1054
- return Image.fromarray(result)
1055
- else:
1056
- # Otsu method was better
1057
- logger.debug("Using Otsu thresholding for newspaper text")
1058
- result = cv2.fastNlMeansDenoising(binary2, None, 7, 7, 21)
1059
- return Image.fromarray(result)
1060
-
1061
- except Exception as e:
1062
- logger.debug(f"Advanced newspaper processing failed: {str(e)}")
1063
- # Fall back to PIL processing
1064
- pass
1065
-
1066
- # If OpenCV not available or fails, apply additional PIL enhancements
1067
- # Create a more aggressive binary version to better separate text
1068
- binary_threshold = enhanced.point(lambda x: 0 if x < 150 else 255, '1')
1069
-
1070
- # Return enhanced binary image
1071
- return binary_threshold
1072
-
1073
- # Ultra-fast path for tiny images - just convert to grayscale with contrast enhancement
1074
- if img_size < 300000: # ~500x600 or smaller
1075
- gray = img.convert('L')
1076
- # Lower contrast enhancement for handwritten documents
1077
- contrast_level = 1.4 if is_handwritten else IMAGE_PREPROCESSING["enhance_contrast"]
1078
- enhancer = ImageEnhance.Contrast(gray)
1079
- return enhancer.enhance(contrast_level)
1080
-
1081
- # Fast path for small images - minimal processing
1082
- if img_size < 1000000: # ~1000x1000 or smaller
1083
- gray = img.convert('L')
1084
- # Use gentler contrast enhancement for handwritten documents
1085
- contrast_level = 1.4 if is_handwritten else IMAGE_PREPROCESSING["enhance_contrast"]
1086
- enhancer = ImageEnhance.Contrast(gray)
1087
- enhanced = enhancer.enhance(contrast_level)
1088
-
1089
- # Light sharpening only if sharpen is enabled
1090
- # Use milder sharpening for handwritten documents to preserve stroke detail
1091
- if IMAGE_PREPROCESSING["sharpen"]:
1092
- if is_handwritten:
1093
- # Use edge enhancement which is gentler than SHARPEN for handwriting
1094
- enhanced = enhanced.filter(ImageFilter.EDGE_ENHANCE)
1095
- else:
1096
- enhanced = enhanced.filter(ImageFilter.SHARPEN)
1097
- return enhanced
1098
-
1099
- # Standard path for medium images
1100
- # Convert to grayscale (faster processing)
1101
- gray = img.convert('L')
1102
-
1103
- # Adaptive contrast enhancement based on document type
1104
- contrast_level = 1.4 if is_handwritten else IMAGE_PREPROCESSING["enhance_contrast"]
1105
- enhancer = ImageEnhance.Contrast(gray)
1106
- enhanced = enhancer.enhance(contrast_level)
1107
-
1108
- # Apply light sharpening for text clarity - adapt based on document type
1109
- if IMAGE_PREPROCESSING["sharpen"]:
1110
- if is_handwritten:
1111
- # Use edge enhancement which is gentler than SHARPEN for handwriting
1112
- enhanced = enhanced.filter(ImageFilter.EDGE_ENHANCE)
1113
- else:
1114
- enhanced = enhanced.filter(ImageFilter.SHARPEN)
1115
-
1116
- # Advanced processing with OpenCV if available
1117
- if CV2_AVAILABLE and IMAGE_PREPROCESSING["denoise"]:
1118
- try:
1119
- # Convert to numpy array for OpenCV processing
1120
- img_np = np.array(enhanced)
1121
-
1122
- if is_handwritten:
1123
- # Enhanced processing for handwritten documents
1124
- # Optimized for better stroke preservation and readability
1125
- if img_size > 3000000: # Large images - downsample first
1126
- scale_factor = 0.5
1127
- small_img = cv2.resize(img_np, None, fx=scale_factor, fy=scale_factor,
1128
- interpolation=cv2.INTER_AREA)
1129
-
1130
- # Apply CLAHE for better local contrast in handwriting
1131
- clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
1132
- enhanced_img = clahe.apply(small_img)
1133
-
1134
- # Apply bilateral filter with parameters optimized for handwriting
1135
- # Lower sigma values to preserve more detail
1136
- filtered = cv2.bilateralFilter(enhanced_img, 7, 30, 50)
1137
-
1138
- # Resize back
1139
- filtered = cv2.resize(filtered, (width, height), interpolation=cv2.INTER_LINEAR)
1140
- else:
1141
- # For smaller handwritten images
1142
- # Apply CLAHE for better local contrast
1143
- clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
1144
- enhanced_img = clahe.apply(img_np)
1145
-
1146
- # Apply bilateral filter with parameters optimized for handwriting
1147
- filtered = cv2.bilateralFilter(enhanced_img, 5, 25, 45)
1148
-
1149
- # Adaptive thresholding specific to handwriting
1150
- try:
1151
- # Use larger block size and lower constant for better stroke preservation
1152
- binary = cv2.adaptiveThreshold(
1153
- filtered, 255,
1154
- cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
1155
- cv2.THRESH_BINARY,
1156
- 21, # Larger block size for handwriting
1157
- 5 # Lower constant for better stroke preservation
1158
- )
1159
-
1160
- # Apply slight dilation to connect broken strokes
1161
- kernel = np.ones((2, 2), np.uint8)
1162
- binary = cv2.dilate(binary, kernel, iterations=1)
1163
-
1164
- # Convert back to PIL Image
1165
- return Image.fromarray(binary)
1166
- except Exception as e:
1167
- logger.debug(f"Adaptive threshold for handwriting failed: {str(e)}")
1168
- # Convert filtered image to PIL and return as fallback
1169
- return Image.fromarray(filtered)
1170
-
1171
- else:
1172
- # Standard document processing - optimized for printed text
1173
- # Optimize denoising parameters based on image size
1174
- if img_size > 4000000: # Very large images
1175
- # More aggressive downsampling for very large images
1176
- scale_factor = 0.5
1177
- downsample = cv2.resize(img_np, None, fx=scale_factor, fy=scale_factor,
1178
- interpolation=cv2.INTER_AREA)
1179
-
1180
- # Lighter denoising for downsampled image
1181
- h_value = 7 # Strength parameter
1182
- template_window = 5
1183
- search_window = 13
1184
-
1185
- # Apply denoising on smaller image
1186
- denoised_np = cv2.fastNlMeansDenoising(downsample, None, h_value, template_window, search_window)
1187
-
1188
- # Resize back to original size
1189
- denoised_np = cv2.resize(denoised_np, (width, height), interpolation=cv2.INTER_LINEAR)
1190
- else:
1191
- # Direct denoising for medium-large images
1192
- h_value = 8 # Balanced for speed and quality
1193
- template_window = 5
1194
- search_window = 15
1195
-
1196
- # Apply denoising
1197
- denoised_np = cv2.fastNlMeansDenoising(img_np, None, h_value, template_window, search_window)
1198
-
1199
- # Convert back to PIL Image
1200
- enhanced = Image.fromarray(denoised_np)
1201
-
1202
- # Apply adaptive thresholding only if it improves text visibility
1203
- # Create a binarized version of the image
1204
- if img_size < 8000000: # Skip for extremely large images to save processing time
1205
- binary = cv2.adaptiveThreshold(denoised_np, 255,
1206
- cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
1207
- cv2.THRESH_BINARY, 11, 2)
1208
-
1209
- # Quick verification that binarization preserves text information
1210
- # Use simplified check that works well for document images
1211
- white_pixels_binary = np.count_nonzero(binary > 200)
1212
- white_pixels_orig = np.count_nonzero(denoised_np > 200)
1213
-
1214
- # Check if binary preserves reasonable amount of white pixels (background)
1215
- if white_pixels_binary > white_pixels_orig * 0.8:
1216
- # Binarization looks good, use it
1217
- return Image.fromarray(binary)
1218
-
1219
- return enhanced
1220
-
1221
- except Exception as e:
1222
- # If OpenCV processing fails, continue with PIL-enhanced image
1223
- pass
1224
-
1225
- elif IMAGE_PREPROCESSING["denoise"]:
1226
- # Fallback PIL denoising for systems without OpenCV
1227
- if is_handwritten:
1228
- # Lighter filtering for handwritten text to preserve details
1229
- # Use a smaller median filter for handwritten documents
1230
- enhanced = enhanced.filter(ImageFilter.MedianFilter(1))
1231
- else:
1232
- # Standard filtering for printed documents
1233
- enhanced = enhanced.filter(ImageFilter.MedianFilter(3))
1234
-
1235
- # Return enhanced grayscale image
1236
- return enhanced
1237
-
1238
- # Removed caching to fix unhashable type error
1239
- def preprocess_general_image(img: Image.Image) -> Image.Image:
1240
- """
1241
- Preprocess a general image for OCR.
1242
-
1243
- Args:
1244
- img: PIL Image object
1245
-
1246
- Returns:
1247
- Processed PIL Image
1248
- """
1249
- # Store the image for implementation function
1250
- preprocess_general_image._current_img = img
1251
- return _preprocess_general_image_impl()
1252
-
1253
- def _preprocess_general_image_impl() -> Image.Image:
1254
- """
1255
- Optimized implementation of general image preprocessing with size-based processing paths
1256
- """
1257
- # Fast path: Get the image from thread-local storage
1258
- if not hasattr(preprocess_general_image, "_current_img"):
1259
- raise ValueError("No image set for general preprocessing")
1260
-
1261
- img = preprocess_general_image._current_img
1262
-
1263
- # Ultra-fast path: Skip processing completely for small images to improve performance
1264
- width, height = img.size
1265
- img_size = width * height
1266
- if img_size < 300000: # Skip for tiny images under ~0.3 megapixel
1267
- # Just ensure correct color mode
1268
- if img.mode != 'RGB':
1269
- return img.convert('RGB')
1270
- return img
1271
-
1272
- # Fast path: Minimal processing for smaller images
1273
- if img_size < 600000: # ~800x750 or smaller
1274
- # Ensure RGB mode
1275
- if img.mode != 'RGB':
1276
- img = img.convert('RGB')
1277
-
1278
- # Very light contrast enhancement only
1279
- enhancer = ImageEnhance.Contrast(img)
1280
- return enhancer.enhance(1.15) # Lighter enhancement for small images
1281
-
1282
- # Standard path: Apply moderate enhancements for medium images
1283
- # Convert to RGB to ensure compatibility
1284
- if img.mode != 'RGB':
1285
- img = img.convert('RGB')
1286
-
1287
- # Moderate enhancement only
1288
- enhancer = ImageEnhance.Contrast(img)
1289
- enhanced = enhancer.enhance(1.2) # Less aggressive than document enhancement
1290
-
1291
- # Skip additional processing for medium-sized images
1292
- if img_size < 1000000: # Skip for images under ~1 megapixel
1293
- return enhanced
1294
-
1295
- # Enhanced path: Additional processing for larger images
1296
- try:
1297
- # Apply optimized enhancement pipeline for large non-document images
1298
-
1299
- # 1. Improve color saturation slightly for better feature extraction
1300
- saturation = ImageEnhance.Color(enhanced)
1301
- enhanced = saturation.enhance(1.1)
1302
-
1303
- # 2. Apply adaptive sharpening based on image size
1304
- if img_size > 2500000: # Very large images (~1600x1600 or larger)
1305
- # Use EDGE_ENHANCE instead of SHARPEN for more subtle enhancement on large images
1306
- enhanced = enhanced.filter(ImageFilter.EDGE_ENHANCE)
1307
- else:
1308
- # Standard sharpening for regular large images
1309
- enhanced = enhanced.filter(ImageFilter.SHARPEN)
1310
-
1311
- # 3. Apply additional processing with OpenCV if available (for largest images)
1312
- if CV2_AVAILABLE and img_size > 3000000:
1313
- # Convert to numpy array
1314
- img_np = np.array(enhanced)
1315
-
1316
- # Apply subtle enhancement of details (CLAHE)
1317
- try:
1318
- # Convert to LAB color space for better processing
1319
- lab = cv2.cvtColor(img_np, cv2.COLOR_RGB2LAB)
1320
-
1321
- # Only enhance the L channel (luminance)
1322
- l, a, b = cv2.split(lab)
1323
-
1324
- # Create CLAHE object with optimal parameters for photos
1325
- clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
1326
-
1327
- # Apply CLAHE to L channel
1328
- l = clahe.apply(l)
1329
-
1330
- # Merge channels back and convert to RGB
1331
- lab = cv2.merge((l, a, b))
1332
- enhanced_np = cv2.cvtColor(lab, cv2.COLOR_LAB2RGB)
1333
-
1334
- # Convert back to PIL
1335
- enhanced = Image.fromarray(enhanced_np)
1336
- except:
1337
- # If CLAHE fails, continue with PIL-enhanced image
1338
- pass
1339
-
1340
- except Exception:
1341
- # If any enhancement fails, fall back to basic contrast enhancement
1342
- if img.mode != 'RGB':
1343
- img = img.convert('RGB')
1344
- enhancer = ImageEnhance.Contrast(img)
1345
- enhanced = enhancer.enhance(1.2)
1346
-
1347
- return enhanced
1348
-
1349
- # Removed caching decorator to fix unhashable type error
1350
- def resize_image(img: Image.Image, target_dpi: int = 300) -> Image.Image:
1351
- """
1352
- Resize an image to an optimal size for OCR while preserving quality.
1353
-
1354
- Args:
1355
- img: PIL Image object
1356
- target_dpi: Target DPI (dots per inch)
1357
-
1358
- Returns:
1359
- Resized PIL Image
1360
- """
1361
- # Store the image for implementation function
1362
- resize_image._current_img = img
1363
- return resize_image_impl(target_dpi)
1364
 
1365
- def resize_image_impl(target_dpi: int = 300) -> Image.Image:
1366
  """
1367
- Implementation of resize function that uses thread-local storage.
1368
 
1369
  Args:
1370
- target_dpi: Target DPI (dots per inch)
1371
-
1372
- Returns:
1373
- Resized PIL Image
1374
- """
1375
- # Get the image from thread-local storage (set by the caller)
1376
- if not hasattr(resize_image, "_current_img"):
1377
- raise ValueError("No image set for resizing")
1378
-
1379
- img = resize_image._current_img
1380
-
1381
- # Calculate current dimensions
1382
- width, height = img.size
1383
-
1384
- # Fixed target dimensions based on DPI
1385
- # Using larger dimensions to support newspapers and large documents
1386
- max_width = int(14 * target_dpi) # Increased from 8.5 to 14 inches
1387
- max_height = int(22 * target_dpi) # Increased from 11 to 22 inches
1388
-
1389
- # Check if resizing is needed - quick early return
1390
- if width <= max_width and height <= max_height:
1391
- return img # No resizing needed
1392
-
1393
- # Calculate scaling factor once
1394
- scale_factor = min(max_width / width, max_height / height)
1395
-
1396
- # Calculate new dimensions
1397
- new_width = int(width * scale_factor)
1398
- new_height = int(height * scale_factor)
1399
-
1400
- # Use BICUBIC for better balance of speed and quality
1401
- return img.resize((new_width, new_height), Image.BICUBIC)
1402
-
1403
- def calculate_image_entropy(img: Image.Image) -> float:
1404
- """
1405
- Calculate the entropy (information content) of an image.
1406
-
1407
- Args:
1408
- img: PIL Image object
1409
-
1410
- Returns:
1411
- Entropy value
1412
- """
1413
- # Convert to grayscale
1414
- if img.mode != 'L':
1415
- img = img.convert('L')
1416
-
1417
- # Calculate histogram
1418
- histogram = img.histogram()
1419
- total_pixels = img.width * img.height
1420
-
1421
- # Calculate entropy
1422
- entropy = 0
1423
- for h in histogram:
1424
- if h > 0:
1425
- probability = h / total_pixels
1426
- entropy -= probability * np.log2(probability)
1427
-
1428
- return entropy
1429
-
1430
- def create_html_with_images(result):
1431
- """
1432
- Create an HTML document with embedded images from OCR results.
1433
- Handles serialization of complex OCR objects automatically.
1434
-
1435
- Args:
1436
- result: OCR result dictionary containing pages_data
1437
-
1438
- Returns:
1439
- HTML content as string
1440
- """
1441
- # Ensure result is fully serializable first
1442
- result = serialize_ocr_object(result)
1443
- # Create HTML document structure
1444
- html_content = """
1445
- <!DOCTYPE html>
1446
- <html>
1447
- <head>
1448
- <meta charset="UTF-8">
1449
- <meta name="viewport" content="width=device-width, initial-scale=1.0">
1450
- <title>OCR Document with Images</title>
1451
- <style>
1452
- body {
1453
- font-family: Georgia, serif;
1454
- line-height: 1.7;
1455
- margin: 0 auto;
1456
- max-width: 800px;
1457
- padding: 20px;
1458
- }
1459
- img {
1460
- max-width: 90%;
1461
- max-height: 500px;
1462
- object-fit: contain;
1463
- margin: 20px auto;
1464
- display: block;
1465
- border: 1px solid #ddd;
1466
- border-radius: 4px;
1467
- }
1468
- .image-container {
1469
- margin: 20px 0;
1470
- text-align: center;
1471
- }
1472
- .page-break {
1473
- border-top: 1px solid #ddd;
1474
- margin: 40px 0;
1475
- padding-top: 40px;
1476
- }
1477
- h3 {
1478
- color: #333;
1479
- border-bottom: 1px solid #eee;
1480
- padding-bottom: 10px;
1481
- }
1482
- p {
1483
- margin: 12px 0;
1484
- }
1485
- .page-text-content {
1486
- margin-bottom: 20px;
1487
- }
1488
- .text-block {
1489
- background-color: #f9f9f9;
1490
- padding: 15px;
1491
- border-radius: 4px;
1492
- border-left: 3px solid #546e7a;
1493
- margin-bottom: 15px;
1494
- color: #333;
1495
- }
1496
- .text-block p {
1497
- margin: 8px 0;
1498
- color: #333;
1499
- }
1500
- .metadata {
1501
- background-color: #f5f5f5;
1502
- padding: 10px 15px;
1503
- border-radius: 4px;
1504
- margin-bottom: 20px;
1505
- font-size: 14px;
1506
- }
1507
- .metadata p {
1508
- margin: 5px 0;
1509
- }
1510
- </style>
1511
- </head>
1512
- <body>
1513
- """
1514
-
1515
- # Add document metadata
1516
- html_content += f"""
1517
- <div class="metadata">
1518
- <h2>{result.get('file_name', 'Document')}</h2>
1519
- <p><strong>Processed at:</strong> {result.get('timestamp', '')}</p>
1520
- <p><strong>Languages:</strong> {', '.join(result.get('languages', ['Unknown']))}</p>
1521
- <p><strong>Topics:</strong> {', '.join(result.get('topics', ['Unknown']))}</p>
1522
- </div>
1523
- """
1524
-
1525
- # Check if we have pages_data
1526
- if 'pages_data' in result and result['pages_data']:
1527
- pages_data = result['pages_data']
1528
-
1529
- # Process each page
1530
- for i, page in enumerate(pages_data):
1531
- page_markdown = page.get('markdown', '')
1532
- images = page.get('images', [])
1533
-
1534
- # Add page header if multi-page
1535
- if len(pages_data) > 1:
1536
- html_content += f"<h3>Page {i+1}</h3>"
1537
-
1538
- # Create image dictionary
1539
- image_dict = {}
1540
- for img in images:
1541
- if 'id' in img and 'image_base64' in img:
1542
- image_dict[img['id']] = img['image_base64']
1543
-
1544
- # Process the markdown content
1545
- if page_markdown:
1546
- # Extract text content (lines without images)
1547
- text_content = []
1548
- image_lines = []
1549
-
1550
- for line in page_markdown.split('\n'):
1551
- if '![' in line and '](' in line:
1552
- image_lines.append(line)
1553
- elif line.strip():
1554
- text_content.append(line)
1555
-
1556
- # Add text content
1557
- if text_content:
1558
- html_content += '<div class="text-block">'
1559
- for line in text_content:
1560
- html_content += f"<p>{line}</p>"
1561
- html_content += '</div>'
1562
-
1563
- # Add images
1564
- for line in image_lines:
1565
- # Extract image ID and alt text using simple parsing
1566
- try:
1567
- alt_start = line.find('![') + 2
1568
- alt_end = line.find(']', alt_start)
1569
- alt_text = line[alt_start:alt_end]
1570
-
1571
- img_start = line.find('(', alt_end) + 1
1572
- img_end = line.find(')', img_start)
1573
- img_id = line[img_start:img_end]
1574
-
1575
- if img_id in image_dict:
1576
- html_content += f'<div class="image-container">'
1577
- html_content += f'<img src="{image_dict[img_id]}" alt="{alt_text}">'
1578
- html_content += f'</div>'
1579
- except:
1580
- # If parsing fails, just skip this image
1581
- continue
1582
-
1583
- # Add page separator if not the last page
1584
- if i < len(pages_data) - 1:
1585
- html_content += '<div class="page-break"></div>'
1586
-
1587
- # Add structured content if available
1588
- if 'ocr_contents' in result and isinstance(result['ocr_contents'], dict):
1589
- html_content += '<h3>Structured Content</h3>'
1590
-
1591
- for section, content in result['ocr_contents'].items():
1592
- if content and section not in ['error', 'raw_text', 'partial_text']:
1593
- html_content += f'<h4>{section.replace("_", " ").title()}</h4>'
1594
-
1595
- if isinstance(content, str):
1596
- html_content += f'<p>{content}</p>'
1597
- elif isinstance(content, list):
1598
- html_content += '<ul>'
1599
- for item in content:
1600
- html_content += f'<li>{str(item)}</li>'
1601
- html_content += '</ul>'
1602
- elif isinstance(content, dict):
1603
- html_content += '<dl>'
1604
- for k, v in content.items():
1605
- html_content += f'<dt>{k}</dt><dd>{v}</dd>'
1606
- html_content += '</dl>'
1607
-
1608
- # Close HTML document
1609
- html_content += """
1610
- </body>
1611
- </html>
1612
- """
1613
-
1614
- return html_content
1615
-
1616
- def generate_document_thumbnail(image_path: Union[str, Path], max_size: int = 300) -> str:
1617
- """
1618
- Generate a thumbnail for document preview.
1619
-
1620
- Args:
1621
- image_path: Path to the image file
1622
- max_size: Maximum dimension for thumbnail
1623
-
1624
- Returns:
1625
- Base64 encoded thumbnail
1626
- """
1627
- if not PILLOW_AVAILABLE:
1628
- return None
1629
-
1630
- try:
1631
- # Open the image
1632
- with Image.open(image_path) as img:
1633
- # Calculate thumbnail size preserving aspect ratio
1634
- width, height = img.size
1635
- if width > height:
1636
- new_width = max_size
1637
- new_height = int(height * (max_size / width))
1638
- else:
1639
- new_height = max_size
1640
- new_width = int(width * (max_size / height))
1641
-
1642
- # Create thumbnail
1643
- thumbnail = img.resize((new_width, new_height), Image.LANCZOS)
1644
-
1645
- # Save to buffer
1646
- buffer = io.BytesIO()
1647
- thumbnail.save(buffer, format="JPEG", quality=85)
1648
- buffer.seek(0)
1649
-
1650
- # Encode as base64
1651
- encoded = base64.b64encode(buffer.getvalue()).decode()
1652
- return f"data:image/jpeg;base64,{encoded}"
1653
- except Exception:
1654
- # Return None if thumbnail generation fails
1655
- return None
1656
-
1657
- def serialize_ocr_object(obj):
1658
- """
1659
- Serialize OCR response objects to JSON serializable format.
1660
- Handles OCRImageObject specifically to prevent serialization errors.
1661
-
1662
- Args:
1663
- obj: The object to serialize
1664
-
1665
- Returns:
1666
- JSON serializable representation of the object
1667
- """
1668
- # Fast path: Handle primitive types directly
1669
- if obj is None or isinstance(obj, (str, int, float, bool)):
1670
- return obj
1671
-
1672
- # Handle collections
1673
- if isinstance(obj, list):
1674
- return [serialize_ocr_object(item) for item in obj]
1675
- elif isinstance(obj, dict):
1676
- return {k: serialize_ocr_object(v) for k, v in obj.items()}
1677
- elif isinstance(obj, OCRImageObject):
1678
- # Special handling for OCRImageObject
1679
- return {
1680
- 'id': obj.id if hasattr(obj, 'id') else None,
1681
- 'image_base64': obj.image_base64 if hasattr(obj, 'image_base64') else None
1682
- }
1683
- elif hasattr(obj, '__dict__'):
1684
- # For objects with __dict__ attribute
1685
- return {k: serialize_ocr_object(v) for k, v in obj.__dict__.items()
1686
- if not k.startswith('_')} # Skip private attributes
1687
- else:
1688
- # Try to convert to string as last resort
1689
- try:
1690
- return str(obj)
1691
- except:
1692
- return None
1693
-
1694
- def try_local_ocr_fallback(image_path: Union[str, Path], base64_data_url: str = None) -> str:
1695
- """
1696
- Attempt to use local pytesseract OCR as a fallback when API fails
1697
- With enhanced processing optimized for handwritten content
1698
-
1699
- Args:
1700
- image_path: Path to the image file
1701
  base64_data_url: Optional base64 data URL if already available
1702
 
1703
  Returns:
1704
- OCR text string if successful, None if failed
1705
  """
1706
- logger.info("Attempting local OCR fallback using pytesseract...")
 
 
1707
 
1708
  try:
1709
- import pytesseract
1710
- from PIL import Image
1711
-
1712
- # Load image - either from path or from base64
1713
- if base64_data_url and base64_data_url.startswith('data:image'):
1714
- # Extract image from base64
1715
- image_data = base64_data_url.split(',', 1)[1]
1716
- image_bytes = base64.b64decode(image_data)
1717
- image = Image.open(io.BytesIO(image_bytes))
1718
- else:
1719
- # Load from file path
1720
- image_path = Path(image_path) if isinstance(image_path, str) else image_path
1721
- image = Image.open(image_path)
1722
-
1723
- # Auto-detect if this appears to be handwritten
1724
- is_handwritten = False
1725
 
1726
- # Use OpenCV for better detection and preprocessing if available
1727
- if CV2_AVAILABLE:
1728
- try:
1729
- # Convert image to numpy array
1730
- img_np = np.array(image.convert('L'))
1731
-
1732
- # Check for handwritten characteristics
1733
- edges = cv2.Canny(img_np, 30, 100)
1734
- edge_ratio = np.count_nonzero(edges) / edges.size
1735
-
1736
- # Typical handwritten documents have more varied edge patterns
1737
- if edge_ratio > 0.02:
1738
- # Additional check with gradient magnitudes
1739
- sobelx = cv2.Sobel(img_np, cv2.CV_64F, 1, 0, ksize=3)
1740
- sobely = cv2.Sobel(img_np, cv2.CV_64F, 0, 1, ksize=3)
1741
- magnitude = np.sqrt(sobelx**2 + sobely**2)
1742
- # Handwriting typically has more variation in gradient magnitudes
1743
- if np.std(magnitude) > 20:
1744
- is_handwritten = True
1745
- logger.info("Detected handwritten content for local OCR")
1746
-
1747
- # Enhanced preprocessing based on document type
1748
- if is_handwritten:
1749
- # Process for handwritten content
1750
- # Apply CLAHE for better local contrast
1751
- clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
1752
- img_np = clahe.apply(img_np)
1753
-
1754
- # Apply adaptive thresholding with optimized parameters for handwriting
1755
- binary = cv2.adaptiveThreshold(
1756
- img_np, 255,
1757
- cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
1758
- cv2.THRESH_BINARY,
1759
- 21, # Larger block size for handwriting
1760
- 5 # Lower constant for better stroke preservation
1761
- )
1762
-
1763
- # Optional: apply dilation to thicken strokes slightly
1764
- kernel = np.ones((2, 2), np.uint8)
1765
- binary = cv2.dilate(binary, kernel, iterations=1)
1766
-
1767
- # Convert back to PIL Image for tesseract
1768
- image = Image.fromarray(binary)
1769
-
1770
- # Set tesseract options for handwritten content
1771
- custom_config = r'--oem 1 --psm 6 -l eng'
1772
- else:
1773
- # Process for printed content
1774
- # Apply CLAHE for better contrast
1775
- clahe = cv2.createCLAHE(clipLimit=2.5, tileGridSize=(8, 8))
1776
- img_np = clahe.apply(img_np)
1777
-
1778
- # Apply bilateral filter to reduce noise while preserving edges
1779
- img_np = cv2.bilateralFilter(img_np, 9, 75, 75)
1780
-
1781
- # Apply Otsu's thresholding for printed text
1782
- _, binary = cv2.threshold(img_np, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
1783
-
1784
- # Convert back to PIL Image for tesseract
1785
- image = Image.fromarray(binary)
1786
-
1787
- # Set tesseract options for printed content
1788
- custom_config = r'--oem 3 --psm 6 -l eng'
1789
- except Exception as e:
1790
- logger.warning(f"OpenCV preprocessing failed: {str(e)}. Using PIL fallback.")
1791
-
1792
- # Convert to RGB if not already (pytesseract works best with RGB)
1793
- if image.mode != 'RGB':
1794
- image = image.convert('RGB')
1795
-
1796
- # Apply basic image enhancements
1797
- image = image.convert('L')
1798
- enhancer = ImageEnhance.Contrast(image)
1799
- image = enhancer.enhance(2.0)
1800
- custom_config = r'--oem 3 --psm 6 -l eng'
1801
- else:
1802
- # PIL-only path without OpenCV
1803
- # Convert to RGB if not already (pytesseract works best with RGB)
1804
- if image.mode != 'RGB':
1805
- image = image.convert('RGB')
1806
-
1807
- # Apply basic image enhancements
1808
- image = image.convert('L')
1809
- enhancer = ImageEnhance.Contrast(image)
1810
- image = enhancer.enhance(2.0)
1811
- custom_config = r'--oem 3 --psm 6 -l eng'
1812
 
1813
- # Run OCR with appropriate config
1814
- ocr_text = pytesseract.image_to_string(image, config=custom_config)
1815
 
1816
- if ocr_text and len(ocr_text.strip()) > 50:
1817
- logger.info(f"Local OCR successful: extracted {len(ocr_text)} characters")
1818
- return ocr_text
1819
  else:
1820
- # Try another psm mode as fallback
1821
- logger.warning("First OCR attempt produced minimal text, trying another mode")
1822
- # Try PSM mode 4 (assume single column of text)
1823
- fallback_config = r'--oem 3 --psm 4 -l eng'
1824
- ocr_text = pytesseract.image_to_string(image, config=fallback_config)
1825
-
1826
- if ocr_text and len(ocr_text.strip()) > 50:
1827
- logger.info(f"Local OCR fallback successful: extracted {len(ocr_text)} characters")
1828
- return ocr_text
1829
- else:
1830
- logger.warning("Local OCR produced minimal or no text")
1831
- return None
1832
- except ImportError:
1833
- logger.warning("Pytesseract not installed - local OCR not available")
1834
- return None
1835
  except Exception as e:
1836
- logger.error(f"Local OCR fallback failed: {str(e)}")
1837
- return None
 
1
  """
2
+ OCR utility functions for image processing and OCR operations.
3
+ This module provides helper functions used across the Historical OCR application.
4
  """
5
 
6
+ import os
 
7
  import base64
 
 
8
  import logging
 
 
9
  from pathlib import Path
10
+ from typing import Union, Optional
 
11
 
12
  # Configure logging
13
  logging.basicConfig(level=logging.INFO,
14
+ format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
15
  logger = logging.getLogger(__name__)
16
 
17
+ # Try to import optional dependencies
18
+ try:
19
+ import pytesseract
20
+ TESSERACT_AVAILABLE = True
21
+ except ImportError:
22
+ logger.warning("pytesseract not available - local OCR fallback will not work")
23
+ TESSERACT_AVAILABLE = False
24
 
 
25
  try:
26
+ from PIL import Image
27
  PILLOW_AVAILABLE = True
28
  except ImportError:
29
  logger.warning("PIL not available - image preprocessing will be limited")
30
  PILLOW_AVAILABLE = False
31
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
  def encode_image_for_api(image_path: Union[str, Path]) -> str:
34
  """
35
+ Encode an image as base64 data URL for API submission with proper MIME type.
36
 
37
  Args:
38
  image_path: Path to the image file
 
63
  encoded = base64.b64encode(image_file.read_bytes()).decode()
64
  return f"data:{mime_type};base64,{encoded}"
65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
 
67
+ def try_local_ocr_fallback(file_path: Union[str, Path], base64_data_url: Optional[str] = None) -> Optional[str]:
68
  """
69
+ Try to perform OCR using local Tesseract as a fallback when the API is unavailable.
70
 
71
  Args:
72
+ file_path: Path to the image file
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
  base64_data_url: Optional base64 data URL if already available
74
 
75
  Returns:
76
+ Extracted text or None if extraction failed
77
  """
78
+ if not TESSERACT_AVAILABLE or not PILLOW_AVAILABLE:
79
+ logger.warning("Local OCR fallback is not available (missing dependencies)")
80
+ return None
81
 
82
  try:
83
+ logger.info("Using local Tesseract OCR as fallback")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84
 
85
+ # Use PIL to open the image
86
+ img = Image.open(file_path)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
 
88
+ # Use Tesseract to extract text
89
+ text = pytesseract.image_to_string(img)
90
 
91
+ if text:
92
+ logger.info("Successfully extracted text using local Tesseract OCR")
93
+ return text
94
  else:
95
+ logger.warning("Tesseract extracted no text")
96
+ return None
 
 
 
 
 
 
 
 
 
 
 
 
 
97
  except Exception as e:
98
+ logger.error(f"Error using local OCR fallback: {str(e)}")
99
+ return None
preprocessing.py CHANGED
@@ -3,15 +3,398 @@ import io
3
  import cv2
4
  import numpy as np
5
  import tempfile
 
 
 
6
  from PIL import Image, ImageEnhance, ImageFilter
7
  from pdf2image import convert_from_bytes
8
  import streamlit as st
9
  import logging
 
 
10
 
11
  # Configure logging
12
  logger = logging.getLogger("preprocessing")
13
  logger.setLevel(logging.INFO)
14
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  @st.cache_data(ttl=24*3600, show_spinner=False) # Cache for 24 hours
16
  def convert_pdf_to_images(pdf_bytes, dpi=150, rotation=0):
17
  """Convert PDF bytes to a list of images with caching"""
@@ -34,94 +417,134 @@ def convert_pdf_to_images(pdf_bytes, dpi=150, rotation=0):
34
 
35
  @st.cache_data(ttl=24*3600, show_spinner=False, hash_funcs={dict: lambda x: str(sorted(x.items()))})
36
  def preprocess_image(image_bytes, preprocessing_options):
37
- """Preprocess image with selected options optimized for historical document OCR quality"""
 
 
 
 
 
 
 
 
 
 
38
  # Setup basic console logging
39
  logger = logging.getLogger("image_preprocessor")
40
  logger.setLevel(logging.INFO)
41
 
42
  # Log which preprocessing options are being applied
43
- logger.info(f"Preprocessing image with options: {preprocessing_options}")
 
 
 
 
 
 
 
44
 
45
  # Convert bytes to PIL Image
46
  image = Image.open(io.BytesIO(image_bytes))
47
 
48
- # Check for alpha channel (RGBA) and convert to RGB if needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
  if image.mode == 'RGBA':
50
- # Convert RGBA to RGB by compositing the image onto a white background
 
51
  background = Image.new('RGB', image.size, (255, 255, 255))
52
  background.paste(image, mask=image.split()[3]) # 3 is the alpha channel
53
  image = background
54
- logger.info("Converted RGBA image to RGB")
55
  elif image.mode not in ('RGB', 'L'):
56
- # Convert other modes to RGB as well
 
57
  image = image.convert('RGB')
58
- logger.info(f"Converted {image.mode} image to RGB")
59
-
60
- # Apply rotation if specified
61
- if preprocessing_options.get("rotation", 0) != 0:
62
- rotation_degrees = preprocessing_options.get("rotation")
63
- image = image.rotate(rotation_degrees, expand=True, resample=Image.BICUBIC)
64
-
65
- # Resize large images while preserving details important for OCR
66
- width, height = image.size
67
- max_dimension = max(width, height)
68
-
69
- # Less aggressive resizing to preserve document details
70
- if max_dimension > 2500:
71
- scale_factor = 2500 / max_dimension
72
- new_width = int(width * scale_factor)
73
- new_height = int(height * scale_factor)
74
- # Use LANCZOS for better quality preservation
75
- image = image.resize((new_width, new_height), Image.LANCZOS)
76
 
 
77
  img_array = np.array(image)
78
 
79
- # Apply preprocessing based on selected options with settings optimized for historical documents
80
- document_type = preprocessing_options.get("document_type", "standard")
81
-
82
- # Process grayscale option first as it's a common foundation
83
  if preprocessing_options.get("grayscale", False):
84
  if len(img_array.shape) == 3: # Only convert if it's not already grayscale
85
- if document_type == "handwritten":
86
- # Enhanced grayscale processing for handwritten documents
87
  img_array = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
88
- # Apply adaptive histogram equalization to enhance handwriting
89
- clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8))
90
  img_array = clahe.apply(img_array)
91
  else:
92
  # Standard grayscale for printed documents
93
  img_array = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
94
-
95
- # Convert back to RGB for further processing
96
- img_array = cv2.cvtColor(img_array, cv2.COLOR_GRAY2RGB)
97
-
98
- if preprocessing_options.get("contrast", 0) != 0:
99
- contrast_factor = 1 + (preprocessing_options.get("contrast", 0) / 150) # Reduced from /100 for a gentler effect
100
- image = Image.fromarray(img_array)
101
- enhancer = ImageEnhance.Contrast(image)
102
- image = enhancer.enhance(contrast_factor)
103
- img_array = np.array(image)
104
 
 
105
  if preprocessing_options.get("denoise", False):
106
  try:
107
- # Apply appropriate denoising based on document type (reduced parameters for gentler effect)
108
- if document_type == "handwritten":
109
- # Very light denoising for handwritten documents to preserve pen strokes
110
- if len(img_array.shape) == 3 and img_array.shape[2] == 3: # Color image
111
- img_array = cv2.fastNlMeansDenoisingColored(img_array, None, 2, 2, 3, 7) # Reduced from 3,3,5,9
112
- else: # Grayscale image
113
- img_array = cv2.fastNlMeansDenoising(img_array, None, 2, 5, 15) # Reduced from 3,7,21
114
  else:
115
- # Standard denoising for printed documents
116
- if len(img_array.shape) == 3 and img_array.shape[2] == 3: # Color image
117
- img_array = cv2.fastNlMeansDenoisingColored(img_array, None, 3, 3, 5, 15) # Reduced from 5,5,7,21
118
- else: # Grayscale image
119
- img_array = cv2.fastNlMeansDenoising(img_array, None, 3, 5, 15) # Reduced from 5,7,21
120
  except Exception as e:
121
- logger.error(f"Denoising error: {str(e)}, falling back to standard processing")
 
 
 
 
 
 
122
 
 
 
 
 
 
 
 
 
 
 
 
 
 
123
  # Convert back to PIL Image
124
- processed_image = Image.fromarray(img_array)
 
 
 
 
 
 
125
 
126
  # Higher quality for OCR processing
127
  byte_io = io.BytesIO()
@@ -135,16 +558,14 @@ def preprocess_image(image_bytes, preprocessing_options):
135
 
136
  logger.info(f"Preprocessing complete. Original image mode: {image.mode}, processed mode: {processed_image.mode}")
137
  logger.info(f"Original size: {len(image_bytes)/1024:.1f}KB, processed size: {len(byte_io.getvalue())/1024:.1f}KB")
 
138
 
139
  return byte_io.getvalue()
140
  except Exception as e:
141
  logger.error(f"Error saving processed image: {str(e)}")
142
  # Fallback to original image
143
  logger.info("Using original image as fallback")
144
- image_io = io.BytesIO()
145
- image.save(image_io, format='JPEG', quality=92)
146
- image_io.seek(0)
147
- return image_io.getvalue()
148
 
149
  def create_temp_file(content, suffix, temp_file_paths):
150
  """Create a temporary file and track it for cleanup"""
@@ -157,19 +578,53 @@ def create_temp_file(content, suffix, temp_file_paths):
157
  return temp_path
158
 
159
  def apply_preprocessing_to_file(file_bytes, file_ext, preprocessing_options, temp_file_paths):
160
- """Apply preprocessing to file and return path to processed file"""
161
- # Check if any preprocessing options with boolean values are True, or if any non-boolean values are non-default
162
- # Note: document_type is no longer used to determine if preprocessing should be applied
 
 
 
 
 
 
 
 
 
 
 
 
 
163
  has_preprocessing = (
164
  preprocessing_options.get("grayscale", False) or
165
  preprocessing_options.get("denoise", False) or
166
- preprocessing_options.get("contrast", 0) != 0 or
167
- preprocessing_options.get("rotation", 0) != 0
168
  )
169
 
170
- if has_preprocessing:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
171
  # Apply preprocessing
172
  logger.info(f"Applying preprocessing with options: {preprocessing_options}")
 
 
 
 
 
 
173
  processed_bytes = preprocess_image(file_bytes, preprocessing_options)
174
 
175
  # Save processed image to temp file
 
3
  import cv2
4
  import numpy as np
5
  import tempfile
6
+ import time
7
+ import math
8
+ import json
9
  from PIL import Image, ImageEnhance, ImageFilter
10
  from pdf2image import convert_from_bytes
11
  import streamlit as st
12
  import logging
13
+ import concurrent.futures
14
+ from pathlib import Path
15
 
16
  # Configure logging
17
  logger = logging.getLogger("preprocessing")
18
  logger.setLevel(logging.INFO)
19
 
20
+ # Ensure logs directory exists
21
+ def ensure_log_directory(config):
22
+ """Create logs directory if it doesn't exist"""
23
+ if config.get("logging", {}).get("enabled", False):
24
+ log_path = config.get("logging", {}).get("output_path", "logs/preprocessing_metrics.json")
25
+ log_dir = os.path.dirname(log_path)
26
+ if log_dir:
27
+ Path(log_dir).mkdir(parents=True, exist_ok=True)
28
+
29
+ def log_preprocessing_metrics(metrics, config):
30
+ """Log preprocessing metrics to JSON file"""
31
+ if not config.get("enabled", False):
32
+ return
33
+
34
+ log_path = config.get("output_path", "logs/preprocessing_metrics.json")
35
+ ensure_log_directory({"logging": {"enabled": True, "output_path": log_path}})
36
+
37
+ # Add timestamp
38
+ metrics["timestamp"] = time.strftime("%Y-%m-%d %H:%M:%S")
39
+
40
+ # Append to log file
41
+ try:
42
+ existing_data = []
43
+ if os.path.exists(log_path):
44
+ with open(log_path, 'r') as f:
45
+ existing_data = json.load(f)
46
+ if not isinstance(existing_data, list):
47
+ existing_data = [existing_data]
48
+
49
+ existing_data.append(metrics)
50
+
51
+ with open(log_path, 'w') as f:
52
+ json.dump(existing_data, f, indent=2)
53
+
54
+ logger.info(f"Logged preprocessing metrics to {log_path}")
55
+ except Exception as e:
56
+ logger.error(f"Error logging preprocessing metrics: {str(e)}")
57
+
58
+ def get_document_config(document_type, global_config):
59
+ """
60
+ Get document-specific preprocessing configuration by merging with global settings.
61
+
62
+ Args:
63
+ document_type: The type of document (e.g., 'standard', 'newspaper', 'handwritten')
64
+ global_config: The global preprocessing configuration
65
+
66
+ Returns:
67
+ A merged configuration dictionary with document-specific overrides
68
+ """
69
+ # Start with a copy of the global config
70
+ config = {
71
+ "deskew": global_config.get("deskew", {}),
72
+ "thresholding": global_config.get("thresholding", {}),
73
+ "morphology": global_config.get("morphology", {}),
74
+ "performance": global_config.get("performance", {}),
75
+ "logging": global_config.get("logging", {})
76
+ }
77
+
78
+ # Apply document-specific overrides if they exist
79
+ doc_types = global_config.get("document_types", {})
80
+ if document_type in doc_types:
81
+ doc_config = doc_types[document_type]
82
+
83
+ # Merge document-specific settings into the config
84
+ for section in doc_config:
85
+ if section in config:
86
+ config[section].update(doc_config[section])
87
+
88
+ return config
89
+
90
+ def deskew_image(img_array, config):
91
+ """
92
+ Detect and correct skew in document images.
93
+
94
+ Uses a combination of methods (minAreaRect and/or Hough transform)
95
+ to estimate the skew angle more robustly.
96
+
97
+ Args:
98
+ img_array: Input image as numpy array
99
+ config: Deskew configuration dict
100
+
101
+ Returns:
102
+ Deskewed image as numpy array, estimated angle, success flag
103
+ """
104
+ if not config.get("enabled", False):
105
+ return img_array, 0.0, True
106
+
107
+ # Convert to grayscale if needed
108
+ gray = img_array if len(img_array.shape) == 2 else cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
109
+
110
+ # Start with a threshold to get binary image for angle detection
111
+ _, binary = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)
112
+
113
+ angles = []
114
+ angle_threshold = config.get("angle_threshold", 0.1)
115
+ max_angle = config.get("max_angle", 45.0)
116
+
117
+ # Method 1: minAreaRect approach
118
+ try:
119
+ # Find all contours
120
+ contours, _ = cv2.findContours(binary, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
121
+
122
+ # Filter contours by area to avoid noise
123
+ min_area = binary.shape[0] * binary.shape[1] * 0.0001 # 0.01% of image area
124
+ filtered_contours = [cnt for cnt in contours if cv2.contourArea(cnt) > min_area]
125
+
126
+ # Get angles from rotated rectangles around contours
127
+ for contour in filtered_contours:
128
+ rect = cv2.minAreaRect(contour)
129
+ width, height = rect[1]
130
+
131
+ # Calculate the angle based on the longer side
132
+ # (This is important for getting the orientation right)
133
+ angle = rect[2]
134
+ if width < height:
135
+ angle += 90
136
+
137
+ # Normalize angle to -45 to 45 range
138
+ if angle > 45:
139
+ angle -= 90
140
+ if angle < -45:
141
+ angle += 90
142
+
143
+ # Clamp angle to max limit
144
+ angle = max(min(angle, max_angle), -max_angle)
145
+ angles.append(angle)
146
+ except Exception as e:
147
+ logger.error(f"Error in minAreaRect skew detection: {str(e)}")
148
+
149
+ # Method 2: Hough Transform approach (if enabled)
150
+ if config.get("use_hough", True):
151
+ try:
152
+ # Apply Canny edge detection
153
+ edges = cv2.Canny(gray, 50, 150, apertureSize=3)
154
+
155
+ # Apply Hough lines
156
+ lines = cv2.HoughLinesP(edges, 1, np.pi/180,
157
+ threshold=100, minLineLength=100, maxLineGap=10)
158
+
159
+ if lines is not None:
160
+ for line in lines:
161
+ x1, y1, x2, y2 = line[0]
162
+ if x2 - x1 != 0: # Avoid division by zero
163
+ # Calculate line angle in degrees
164
+ angle = math.atan2(y2 - y1, x2 - x1) * 180.0 / np.pi
165
+
166
+ # Normalize angle to -45 to 45 range
167
+ if angle > 45:
168
+ angle -= 90
169
+ if angle < -45:
170
+ angle += 90
171
+
172
+ # Clamp angle to max limit
173
+ angle = max(min(angle, max_angle), -max_angle)
174
+ angles.append(angle)
175
+ except Exception as e:
176
+ logger.error(f"Error in Hough transform skew detection: {str(e)}")
177
+
178
+ # If no angles were detected, return original image
179
+ if not angles:
180
+ logger.warning("No skew angles detected, using original image")
181
+ return img_array, 0.0, False
182
+
183
+ # Combine angles using the specified consensus method
184
+ consensus_method = config.get("consensus_method", "average")
185
+ if consensus_method == "average":
186
+ final_angle = sum(angles) / len(angles)
187
+ elif consensus_method == "median":
188
+ final_angle = sorted(angles)[len(angles) // 2]
189
+ elif consensus_method == "min":
190
+ final_angle = min(angles, key=abs)
191
+ elif consensus_method == "max":
192
+ final_angle = max(angles, key=abs)
193
+ else:
194
+ final_angle = sum(angles) / len(angles) # Default to average
195
+
196
+ # If angle is below threshold, don't rotate
197
+ if abs(final_angle) < angle_threshold:
198
+ logger.info(f"Detected angle ({final_angle:.2f}°) is below threshold, skipping deskew")
199
+ return img_array, final_angle, True
200
+
201
+ # Log the detected angle
202
+ logger.info(f"Deskewing image with angle: {final_angle:.2f}°")
203
+
204
+ # Get image dimensions
205
+ h, w = img_array.shape[:2]
206
+ center = (w // 2, h // 2)
207
+
208
+ # Get rotation matrix
209
+ rotation_matrix = cv2.getRotationMatrix2D(center, final_angle, 1.0)
210
+
211
+ # Calculate new image dimensions
212
+ abs_cos = abs(rotation_matrix[0, 0])
213
+ abs_sin = abs(rotation_matrix[0, 1])
214
+ new_w = int(h * abs_sin + w * abs_cos)
215
+ new_h = int(h * abs_cos + w * abs_sin)
216
+
217
+ # Adjust the rotation matrix to account for new dimensions
218
+ rotation_matrix[0, 2] += (new_w / 2) - center[0]
219
+ rotation_matrix[1, 2] += (new_h / 2) - center[1]
220
+
221
+ # Perform the rotation
222
+ try:
223
+ # Determine the number of channels to create the correct output array
224
+ if len(img_array.shape) == 3:
225
+ rotated = cv2.warpAffine(img_array, rotation_matrix, (new_w, new_h),
226
+ flags=cv2.INTER_LINEAR, borderMode=cv2.BORDER_CONSTANT,
227
+ borderValue=(255, 255, 255))
228
+ else:
229
+ rotated = cv2.warpAffine(img_array, rotation_matrix, (new_w, new_h),
230
+ flags=cv2.INTER_LINEAR, borderMode=cv2.BORDER_CONSTANT,
231
+ borderValue=255)
232
+ return rotated, final_angle, True
233
+ except Exception as e:
234
+ logger.error(f"Error rotating image: {str(e)}")
235
+ if config.get("fallback", {}).get("enabled", True):
236
+ logger.info("Using original image as fallback after rotation failure")
237
+ return img_array, final_angle, False
238
+ return img_array, final_angle, False
239
+
240
+ def preblur(img_array, config):
241
+ """
242
+ Apply pre-filtering blur to stabilize thresholding results.
243
+
244
+ Args:
245
+ img_array: Input image as numpy array
246
+ config: Pre-blur configuration dict
247
+
248
+ Returns:
249
+ Blurred image as numpy array
250
+ """
251
+ if not config.get("enabled", False):
252
+ return img_array
253
+
254
+ method = config.get("method", "gaussian")
255
+ kernel_size = config.get("kernel_size", 3)
256
+
257
+ # Ensure kernel size is odd
258
+ if kernel_size % 2 == 0:
259
+ kernel_size += 1
260
+
261
+ try:
262
+ if method == "gaussian":
263
+ return cv2.GaussianBlur(img_array, (kernel_size, kernel_size), 0)
264
+ elif method == "median":
265
+ return cv2.medianBlur(img_array, kernel_size)
266
+ else:
267
+ logger.warning(f"Unknown blur method: {method}, using gaussian")
268
+ return cv2.GaussianBlur(img_array, (kernel_size, kernel_size), 0)
269
+ except Exception as e:
270
+ logger.error(f"Error applying {method} blur: {str(e)}")
271
+ return img_array
272
+
273
+ def apply_threshold(img_array, config):
274
+ """
275
+ Apply thresholding to create binary image.
276
+
277
+ Supports Otsu's method and adaptive thresholding.
278
+ Includes pre-filtering and fallback mechanisms.
279
+
280
+ Args:
281
+ img_array: Input image as numpy array
282
+ config: Thresholding configuration dict
283
+
284
+ Returns:
285
+ Binary image as numpy array, success flag
286
+ """
287
+ method = config.get("method", "adaptive")
288
+ if method == "none":
289
+ return img_array, True
290
+
291
+ # Convert to grayscale if needed
292
+ gray = img_array if len(img_array.shape) == 2 else cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
293
+
294
+ # Apply pre-blur if configured
295
+ preblur_config = config.get("preblur", {})
296
+ if preblur_config.get("enabled", False):
297
+ gray = preblur(gray, preblur_config)
298
+
299
+ binary = None
300
+ try:
301
+ if method == "otsu":
302
+ # Apply Otsu's thresholding
303
+ _, binary = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
304
+ elif method == "adaptive":
305
+ # Apply adaptive thresholding
306
+ block_size = config.get("adaptive_block_size", 11)
307
+ constant = config.get("adaptive_constant", 2)
308
+
309
+ # Ensure block size is odd
310
+ if block_size % 2 == 0:
311
+ block_size += 1
312
+
313
+ binary = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
314
+ cv2.THRESH_BINARY, block_size, constant)
315
+ else:
316
+ logger.warning(f"Unknown thresholding method: {method}, using adaptive")
317
+ block_size = config.get("adaptive_block_size", 11)
318
+ constant = config.get("adaptive_constant", 2)
319
+
320
+ # Ensure block size is odd
321
+ if block_size % 2 == 0:
322
+ block_size += 1
323
+
324
+ binary = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
325
+ cv2.THRESH_BINARY, block_size, constant)
326
+ except Exception as e:
327
+ logger.error(f"Error applying {method} thresholding: {str(e)}")
328
+ if config.get("fallback", {}).get("enabled", True):
329
+ logger.info("Using original grayscale image as fallback after thresholding failure")
330
+ return gray, False
331
+ return gray, False
332
+
333
+ # Calculate percentage of non-zero pixels for logging
334
+ nonzero_pct = np.count_nonzero(binary) / binary.size * 100
335
+ logger.info(f"Binary image has {nonzero_pct:.2f}% non-zero pixels")
336
+
337
+ # Check if thresholding was successful (crude check)
338
+ if nonzero_pct < 1 or nonzero_pct > 99:
339
+ logger.warning(f"Thresholding produced extreme result ({nonzero_pct:.2f}% non-zero)")
340
+ if config.get("fallback", {}).get("enabled", True):
341
+ logger.info("Using original grayscale image as fallback after poor thresholding")
342
+ return gray, False
343
+
344
+ return binary, True
345
+
346
+ def apply_morphology(binary_img, config):
347
+ """
348
+ Apply morphological operations to clean up binary image.
349
+
350
+ Supports opening, closing, or both operations.
351
+
352
+ Args:
353
+ binary_img: Binary image as numpy array
354
+ config: Morphology configuration dict
355
+
356
+ Returns:
357
+ Processed binary image as numpy array
358
+ """
359
+ if not config.get("enabled", False):
360
+ return binary_img
361
+
362
+ operation = config.get("operation", "close")
363
+ kernel_size = config.get("kernel_size", 1)
364
+ kernel_shape = config.get("kernel_shape", "rect")
365
+
366
+ # Create appropriate kernel
367
+ if kernel_shape == "rect":
368
+ kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (kernel_size*2+1, kernel_size*2+1))
369
+ elif kernel_shape == "ellipse":
370
+ kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (kernel_size*2+1, kernel_size*2+1))
371
+ elif kernel_shape == "cross":
372
+ kernel = cv2.getStructuringElement(cv2.MORPH_CROSS, (kernel_size*2+1, kernel_size*2+1))
373
+ else:
374
+ logger.warning(f"Unknown kernel shape: {kernel_shape}, using rect")
375
+ kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (kernel_size*2+1, kernel_size*2+1))
376
+
377
+ result = binary_img
378
+ try:
379
+ if operation == "open":
380
+ # Opening: Erosion followed by dilation - removes small noise
381
+ result = cv2.morphologyEx(binary_img, cv2.MORPH_OPEN, kernel)
382
+ elif operation == "close":
383
+ # Closing: Dilation followed by erosion - fills small holes
384
+ result = cv2.morphologyEx(binary_img, cv2.MORPH_CLOSE, kernel)
385
+ elif operation == "both":
386
+ # Both operations in sequence
387
+ result = cv2.morphologyEx(binary_img, cv2.MORPH_OPEN, kernel)
388
+ result = cv2.morphologyEx(result, cv2.MORPH_CLOSE, kernel)
389
+ else:
390
+ logger.warning(f"Unknown morphological operation: {operation}, using close")
391
+ result = cv2.morphologyEx(binary_img, cv2.MORPH_CLOSE, kernel)
392
+ except Exception as e:
393
+ logger.error(f"Error applying morphological operation: {str(e)}")
394
+ return binary_img
395
+
396
+ return result
397
+
398
  @st.cache_data(ttl=24*3600, show_spinner=False) # Cache for 24 hours
399
  def convert_pdf_to_images(pdf_bytes, dpi=150, rotation=0):
400
  """Convert PDF bytes to a list of images with caching"""
 
417
 
418
  @st.cache_data(ttl=24*3600, show_spinner=False, hash_funcs={dict: lambda x: str(sorted(x.items()))})
419
  def preprocess_image(image_bytes, preprocessing_options):
420
+ """
421
+ Conservative preprocessing function for handwritten documents with early exit for clean scans.
422
+ Implements light processing: grayscale → denoise (gently) → contrast (conservative)
423
+
424
+ Args:
425
+ image_bytes: Image content as bytes
426
+ preprocessing_options: Dictionary with document_type, grayscale, denoise, contrast options
427
+
428
+ Returns:
429
+ Processed image bytes or original image bytes if no processing needed
430
+ """
431
  # Setup basic console logging
432
  logger = logging.getLogger("image_preprocessor")
433
  logger.setLevel(logging.INFO)
434
 
435
  # Log which preprocessing options are being applied
436
+ logger.info(f"Document type: {preprocessing_options.get('document_type', 'standard')}")
437
+
438
+ # Check if any preprocessing is actually requested
439
+ has_preprocessing = (
440
+ preprocessing_options.get("grayscale", False) or
441
+ preprocessing_options.get("denoise", False) or
442
+ preprocessing_options.get("contrast", 0) != 0
443
+ )
444
 
445
  # Convert bytes to PIL Image
446
  image = Image.open(io.BytesIO(image_bytes))
447
 
448
+ # Check for minimal skew and exit early if document is already straight
449
+ # This avoids unnecessary processing for clean scans
450
+ try:
451
+ from utils.image_utils import detect_skew
452
+ skew_angle = detect_skew(image)
453
+ if abs(skew_angle) < 0.5:
454
+ logger.info(f"Document has minimal skew ({skew_angle:.2f}°), skipping preprocessing")
455
+ # Return original image bytes as is for perfectly straight documents
456
+ if not has_preprocessing:
457
+ return image_bytes
458
+ except Exception as e:
459
+ logger.warning(f"Error in skew detection: {str(e)}, continuing with preprocessing")
460
+
461
+ # If no preprocessing options are selected, return the original image
462
+ if not has_preprocessing:
463
+ logger.info("No preprocessing options selected, skipping preprocessing")
464
+ return image_bytes
465
+
466
+ # Initialize metrics for logging
467
+ metrics = {
468
+ "file": preprocessing_options.get("filename", "unknown"),
469
+ "document_type": preprocessing_options.get("document_type", "standard"),
470
+ "preprocessing_applied": []
471
+ }
472
+ start_time = time.time()
473
+
474
+ # Handle RGBA images (transparency) by converting to RGB
475
  if image.mode == 'RGBA':
476
+ # Convert RGBA to RGB by compositing onto white background
477
+ logger.info("Converting RGBA image to RGB")
478
  background = Image.new('RGB', image.size, (255, 255, 255))
479
  background.paste(image, mask=image.split()[3]) # 3 is the alpha channel
480
  image = background
481
+ metrics["preprocessing_applied"].append("alpha_conversion")
482
  elif image.mode not in ('RGB', 'L'):
483
+ # Convert other modes to RGB
484
+ logger.info(f"Converting {image.mode} image to RGB")
485
  image = image.convert('RGB')
486
+ metrics["preprocessing_applied"].append("format_conversion")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
487
 
488
+ # Convert to NumPy array for OpenCV processing
489
  img_array = np.array(image)
490
 
491
+ # Apply grayscale if requested (useful for handwritten text)
 
 
 
492
  if preprocessing_options.get("grayscale", False):
493
  if len(img_array.shape) == 3: # Only convert if it's not already grayscale
494
+ # For handwritten documents, apply gentle CLAHE to enhance contrast locally
495
+ if preprocessing_options.get("document_type") == "handwritten":
496
  img_array = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
497
+ clahe = cv2.createCLAHE(clipLimit=1.5, tileGridSize=(8,8)) # Conservative clip limit
 
498
  img_array = clahe.apply(img_array)
499
  else:
500
  # Standard grayscale for printed documents
501
  img_array = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
502
+
503
+ metrics["preprocessing_applied"].append("grayscale")
 
 
 
 
 
 
 
 
504
 
505
+ # Apply light denoising if requested
506
  if preprocessing_options.get("denoise", False):
507
  try:
508
+ # Apply very gentle denoising
509
+ is_color = len(img_array.shape) == 3 and img_array.shape[2] == 3
510
+ if is_color:
511
+ # Very light color denoising with conservative parameters
512
+ img_array = cv2.fastNlMeansDenoisingColored(img_array, None, 2, 2, 3, 7)
 
 
513
  else:
514
+ # Very light grayscale denoising
515
+ img_array = cv2.fastNlMeansDenoising(img_array, None, 2, 3, 7)
516
+
517
+ metrics["preprocessing_applied"].append("light_denoise")
 
518
  except Exception as e:
519
+ logger.error(f"Denoising error: {str(e)}")
520
+
521
+ # Apply contrast adjustment if requested (conservative range)
522
+ contrast_value = preprocessing_options.get("contrast", 0)
523
+ if contrast_value != 0:
524
+ # Use a gentler contrast adjustment factor
525
+ contrast_factor = 1 + (contrast_value / 200) # Conservative scaling factor
526
 
527
+ # Convert NumPy array back to PIL Image for contrast adjustment
528
+ if len(img_array.shape) == 2: # If grayscale, convert to RGB for PIL
529
+ image = Image.fromarray(cv2.cvtColor(img_array, cv2.COLOR_GRAY2RGB))
530
+ else:
531
+ image = Image.fromarray(img_array)
532
+
533
+ enhancer = ImageEnhance.Contrast(image)
534
+ image = enhancer.enhance(contrast_factor)
535
+
536
+ # Convert back to NumPy array
537
+ img_array = np.array(image)
538
+ metrics["preprocessing_applied"].append(f"contrast_{contrast_value}")
539
+
540
  # Convert back to PIL Image
541
+ if len(img_array.shape) == 2: # If grayscale, convert to RGB for saving
542
+ processed_image = Image.fromarray(cv2.cvtColor(img_array, cv2.COLOR_GRAY2RGB))
543
+ else:
544
+ processed_image = Image.fromarray(img_array)
545
+
546
+ # Record total processing time
547
+ metrics["processing_time"] = (time.time() - start_time) * 1000 # ms
548
 
549
  # Higher quality for OCR processing
550
  byte_io = io.BytesIO()
 
558
 
559
  logger.info(f"Preprocessing complete. Original image mode: {image.mode}, processed mode: {processed_image.mode}")
560
  logger.info(f"Original size: {len(image_bytes)/1024:.1f}KB, processed size: {len(byte_io.getvalue())/1024:.1f}KB")
561
+ logger.info(f"Applied preprocessing steps: {', '.join(metrics['preprocessing_applied'])}")
562
 
563
  return byte_io.getvalue()
564
  except Exception as e:
565
  logger.error(f"Error saving processed image: {str(e)}")
566
  # Fallback to original image
567
  logger.info("Using original image as fallback")
568
+ return image_bytes
 
 
 
569
 
570
  def create_temp_file(content, suffix, temp_file_paths):
571
  """Create a temporary file and track it for cleanup"""
 
578
  return temp_path
579
 
580
  def apply_preprocessing_to_file(file_bytes, file_ext, preprocessing_options, temp_file_paths):
581
+ """
582
+ Apply conservative preprocessing to file and return path to the temporary file.
583
+ Handles format conversion and user-selected preprocessing options.
584
+
585
+ Args:
586
+ file_bytes: File content as bytes
587
+ file_ext: File extension (e.g., '.jpg', '.pdf')
588
+ preprocessing_options: Dictionary with document_type and preprocessing options
589
+ temp_file_paths: List to track temporary files for cleanup
590
+
591
+ Returns:
592
+ Tuple of (temp_file_path, was_processed_flag)
593
+ """
594
+ document_type = preprocessing_options.get("document_type", "standard")
595
+
596
+ # Check for user-selected preprocessing
597
  has_preprocessing = (
598
  preprocessing_options.get("grayscale", False) or
599
  preprocessing_options.get("denoise", False) or
600
+ preprocessing_options.get("contrast", 0) != 0
 
601
  )
602
 
603
+ # Check for RGBA/transparency that needs conversion
604
+ format_needs_conversion = False
605
+
606
+ # Only check formats that might have transparency
607
+ if file_ext.lower() in ['.png', '.tif', '.tiff']:
608
+ try:
609
+ # Check if image has transparency
610
+ image = Image.open(io.BytesIO(file_bytes))
611
+ if image.mode == 'RGBA' or image.mode not in ('RGB', 'L'):
612
+ format_needs_conversion = True
613
+ except Exception as e:
614
+ logger.warning(f"Error checking image format: {str(e)}")
615
+
616
+ # Process if user requested preprocessing OR format needs conversion
617
+ needs_processing = has_preprocessing or format_needs_conversion
618
+
619
+ if needs_processing:
620
  # Apply preprocessing
621
  logger.info(f"Applying preprocessing with options: {preprocessing_options}")
622
+ logger.info(f"Using document type '{document_type}' with advanced preprocessing options")
623
+
624
+ # Add filename to preprocessing options for logging if available
625
+ if hasattr(file_bytes, 'name'):
626
+ preprocessing_options["filename"] = file_bytes.name
627
+
628
  processed_bytes = preprocess_image(file_bytes, preprocessing_options)
629
 
630
  # Save processed image to temp file
process_file.py CHANGED
@@ -53,9 +53,7 @@ def process_file(uploaded_file, use_vision=True, processor=None, custom_prompt=N
53
  "file_size_mb": round(file_size_mb, 2),
54
  "use_vision": use_vision
55
  })
56
-
57
- # No longer needed - removing confidence score
58
-
59
  return result
60
  except Exception as e:
61
  return {
@@ -65,4 +63,4 @@ def process_file(uploaded_file, use_vision=True, processor=None, custom_prompt=N
65
  finally:
66
  # Clean up the temporary file
67
  if os.path.exists(temp_path):
68
- os.unlink(temp_path)
 
53
  "file_size_mb": round(file_size_mb, 2),
54
  "use_vision": use_vision
55
  })
56
+
 
 
57
  return result
58
  except Exception as e:
59
  return {
 
63
  finally:
64
  # Clean up the temporary file
65
  if os.path.exists(temp_path):
66
+ os.unlink(temp_path)
requirements.txt CHANGED
@@ -10,6 +10,7 @@ Pillow>=10.0.0
10
  opencv-python-headless>=4.8.0.74
11
  pdf2image>=1.16.0
12
  pytesseract>=0.3.10 # For local OCR fallback
 
13
 
14
  # Data handling and utilities
15
  numpy>=1.24.0
 
10
  opencv-python-headless>=4.8.0.74
11
  pdf2image>=1.16.0
12
  pytesseract>=0.3.10 # For local OCR fallback
13
+ matplotlib>=3.7.0 # For visualization in preprocessing tests
14
 
15
  # Data handling and utilities
16
  numpy>=1.24.0
structured_ocr.py CHANGED
@@ -47,28 +47,38 @@ except ImportError:
47
 
48
  # Import utilities for OCR processing
49
  try:
50
- from ocr_utils import replace_images_in_markdown, get_combined_markdown
51
  except ImportError:
52
- # Define fallback functions if module not found
 
 
53
  def replace_images_in_markdown(markdown_str, images_dict):
54
- for img_name, base64_str in images_dict.items():
55
- markdown_str = markdown_str.replace(
56
- f"![{img_name}]({img_name})", f"![{img_name}]({base64_str})"
57
- )
 
 
 
58
  return markdown_str
59
 
60
  def get_combined_markdown(ocr_response):
 
61
  markdowns = []
62
  for page in ocr_response.pages:
63
  image_data = {}
64
- for img in page.images:
65
- image_data[img.id] = img.image_base64
66
- markdowns.append(replace_images_in_markdown(page.markdown, image_data))
 
 
 
 
67
  return "\n\n".join(markdowns)
68
 
69
  # Import config directly (now local to historical-ocr)
70
  try:
71
- from config import MISTRAL_API_KEY, OCR_MODEL, TEXT_MODEL, VISION_MODEL, TEST_MODE
72
  except ImportError:
73
  # Fallback defaults if config is not available
74
  import os
@@ -77,6 +87,14 @@ except ImportError:
77
  TEXT_MODEL = "mistral-large-latest"
78
  VISION_MODEL = "mistral-large-latest"
79
  TEST_MODE = True
 
 
 
 
 
 
 
 
80
  logging.warning("Config module not found. Using environment variables and defaults.")
81
 
82
  # Helper function to make OCR objects JSON serializable
@@ -127,6 +145,13 @@ def serialize_ocr_response(obj):
127
  is_valid_image = False
128
  logging.warning("Markdown image reference detected")
129
 
 
 
 
 
 
 
 
130
  # Case 3: Needs detailed text content detection
131
  else:
132
  # Use the same proven approach as in our tests
@@ -185,9 +210,27 @@ def serialize_ocr_response(obj):
185
  'image_base64': image_base64
186
  }
187
  else:
188
- # Process as text if validation fails - convert to string to prevent misclassification
189
  if image_base64 and isinstance(image_base64, str):
190
- result[key] = image_base64
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
191
  else:
192
  result[key] = str(value)
193
  # Handle collections
@@ -382,13 +425,47 @@ class StructuredOCR:
382
  result = serialize_ocr_response(result)
383
 
384
  # Make a final pass to check for any remaining non-serializable objects
385
- # Test JSON serialization to catch any remaining issues
386
- json.dumps(result)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
387
  except TypeError as e:
388
- # If there's a serialization error, run the whole result through our serializer
389
  logger = logging.getLogger("serializer")
390
  logger.warning(f"JSON serialization error in result: {str(e)}. Applying full serialization.")
391
- result = serialize_ocr_response(result)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
392
 
393
  return result
394
 
@@ -1104,9 +1181,10 @@ class StructuredOCR:
1104
 
1105
  # Use enhanced preprocessing functions from ocr_utils
1106
  try:
1107
- from ocr_utils import preprocess_image_for_ocr, IMAGE_PREPROCESSING
 
1108
 
1109
- logger.info(f"Applying advanced image preprocessing for OCR")
1110
 
1111
  # Get preprocessing settings from config
1112
  max_size_mb = IMAGE_PREPROCESSING.get("max_size_mb", 8.0)
@@ -1114,8 +1192,14 @@ class StructuredOCR:
1114
  if file_size_mb > max_size_mb:
1115
  logger.info(f"Image is large ({file_size_mb:.2f} MB), optimizing for API submission")
1116
 
1117
- # Preprocess image with document-type detection and appropriate enhancements
1118
- _, base64_data_url = preprocess_image_for_ocr(file_path)
 
 
 
 
 
 
1119
 
1120
  logger.info(f"Image preprocessing completed successfully")
1121
 
@@ -1169,7 +1253,7 @@ class StructuredOCR:
1169
  except ImportError:
1170
  logger.warning("PIL not available for resizing. Using original image.")
1171
  # Use enhanced encoder with proper MIME type detection
1172
- from ocr_utils import encode_image_for_api
1173
  base64_data_url = encode_image_for_api(file_path)
1174
  except Exception as e:
1175
  logger.warning(f"Image resize failed: {str(e)}. Using original image.")
@@ -1178,7 +1262,7 @@ class StructuredOCR:
1178
  base64_data_url = encode_image_for_api(file_path)
1179
  else:
1180
  # For smaller images, use as-is with proper MIME type
1181
- from ocr_utils import encode_image_for_api
1182
  base64_data_url = encode_image_for_api(file_path)
1183
  except Exception as e:
1184
  # Fallback to original image if any preprocessing fails
@@ -1243,7 +1327,7 @@ class StructuredOCR:
1243
  logger.error("Maximum retries reached, rate limit error persists.")
1244
  try:
1245
  # Try to import the local OCR fallback function
1246
- from ocr_utils import try_local_ocr_fallback
1247
 
1248
  # Attempt local OCR fallback
1249
  ocr_text = try_local_ocr_fallback(file_path, base64_data_url)
@@ -1455,7 +1539,14 @@ class StructuredOCR:
1455
  logger.info("Sufficient OCR text detected, analyzing language before using OCR text directly")
1456
 
1457
  # Perform language detection on the OCR text before returning
1458
- detected_languages = self._detect_text_language(ocr_markdown)
 
 
 
 
 
 
 
1459
 
1460
  return {
1461
  "file_name": filename,
@@ -1629,7 +1720,12 @@ class StructuredOCR:
1629
 
1630
  # If OCR text has clear French patterns but language is English or missing, fix it
1631
  if ocr_markdown and 'languages' in result:
1632
- result['languages'] = self._detect_text_language(ocr_markdown, result['languages'])
 
 
 
 
 
1633
 
1634
  except Exception as e:
1635
  # Fall back to text-only model if vision model fails
@@ -1639,22 +1735,25 @@ class StructuredOCR:
1639
  return result
1640
 
1641
  # We've removed document type detection entirely for simplicity
 
1642
 
1643
  # Create a prompt with enhanced language detection instructions
1644
  generic_section = (
1645
  f"You are an OCR specialist processing historical documents. "
1646
- f"Focus on accurately extracting text content while preserving structure and formatting. "
1647
  f"Pay attention to any historical features and document characteristics.\n\n"
1648
- f"IMPORTANT: Accurately identify the document's language(s). Look for language-specific characters, words, and phrases. "
1649
- f"Specifically check for French (accents like é, è, ç, words like 'le', 'la', 'et', 'est'), German (umlauts, words like 'und', 'der', 'das'), "
1650
- f"Latin, and other non-English languages. Carefully analyze the text before determining language.\n\n"
1651
  f"Create a structured JSON response with the following fields:\n"
1652
  f"- file_name: The document's name\n"
1653
  f"- topics: An array of topics covered in the document\n"
1654
  f"- languages: An array of languages used in the document (be precise and specific about language detection)\n"
1655
  f"- ocr_contents: A comprehensive dictionary with the document's contents including:\n"
1656
- f" * title: The main title or heading (if present)\n"
1657
- f" * content: The main body content\n"
 
 
 
 
 
1658
  f" * raw_text: The complete OCR text\n"
1659
  )
1660
 
@@ -1665,86 +1764,7 @@ class StructuredOCR:
1665
 
1666
  # Return the enhanced prompt
1667
  return generic_section + custom_section
1668
-
1669
- def _detect_text_language(self, text, current_languages=None):
1670
- """
1671
- Detect language from text content using the external language detector
1672
- or falling back to internal detection if needed
1673
-
1674
- Args:
1675
- text: The text to analyze
1676
- current_languages: Optional list of languages already detected
1677
-
1678
- Returns:
1679
- List of detected languages
1680
- """
1681
- logger = logging.getLogger("language_detector")
1682
-
1683
- # If no text provided, return current languages or default
1684
- if not text or len(text.strip()) < 10:
1685
- return current_languages if current_languages else ["English"]
1686
-
1687
- # Use the external language detector if available
1688
- if LANG_DETECTOR_AVAILABLE and self.language_detector:
1689
- logger.info("Using external language detector")
1690
- return self.language_detector.detect_languages(text,
1691
- filename=getattr(self, 'current_filename', None),
1692
- current_languages=current_languages)
1693
-
1694
- # Fallback for when the external module is not available
1695
- logger.info("Language detector not available, using simple detection")
1696
-
1697
- # Get all words from text (lowercase for comparison)
1698
- text_lower = text.lower()
1699
- words = text_lower.split()
1700
-
1701
- # Basic language markers - equal treatment of all languages
1702
- language_indicators = {
1703
- "French": {
1704
- "chars": ['é', 'è', 'ê', 'à', 'ç', 'ù', 'â', 'î', 'ô', 'û'],
1705
- "words": ['le', 'la', 'les', 'et', 'en', 'de', 'du', 'des', 'dans', 'ce', 'cette']
1706
- },
1707
- "Spanish": {
1708
- "chars": ['ñ', 'á', 'é', 'í', 'ó', 'ú', '¿', '¡'],
1709
- "words": ['el', 'la', 'los', 'las', 'y', 'en', 'por', 'que', 'con', 'del']
1710
- },
1711
- "German": {
1712
- "chars": ['ä', 'ö', 'ü', 'ß'],
1713
- "words": ['der', 'die', 'das', 'und', 'ist', 'von', 'mit', 'für', 'sich']
1714
- },
1715
- "Latin": {
1716
- "chars": [],
1717
- "words": ['et', 'in', 'ad', 'est', 'sunt', 'non', 'cum', 'sed', 'qui', 'quod']
1718
- }
1719
- }
1720
-
1721
- detected_languages = []
1722
-
1723
- # Simple detection logic - check for language markers
1724
- for language, indicators in language_indicators.items():
1725
- has_chars = any(char in text_lower for char in indicators["chars"])
1726
- has_words = any(word in words for word in indicators["words"])
1727
 
1728
- if has_chars and has_words:
1729
- detected_languages.append(language)
1730
-
1731
- # Check for English
1732
- english_words = ['the', 'and', 'of', 'to', 'in', 'a', 'is', 'that', 'for', 'it']
1733
- if sum(1 for word in words if word in english_words) >= 2:
1734
- detected_languages.append("English")
1735
-
1736
- # If no languages detected, default to English
1737
- if not detected_languages:
1738
- detected_languages = ["English"]
1739
-
1740
- # Limit to top 2 languages
1741
- detected_languages = detected_languages[:2]
1742
-
1743
- # Log what we found
1744
- logger.info(f"Simple fallback language detection results: {detected_languages}")
1745
-
1746
- return detected_languages
1747
-
1748
  def _extract_structured_data_text_only(self, ocr_markdown, filename, custom_prompt=None):
1749
  """
1750
  Extract structured data using text-only model with detailed historical context prompting
 
47
 
48
  # Import utilities for OCR processing
49
  try:
50
+ from utils.image_utils import replace_images_in_markdown, get_combined_markdown
51
  except ImportError:
52
+ # Define minimal fallback functions if module not found
53
+ logger.warning("Could not import utils.image_utils - using minimal fallback functions")
54
+
55
  def replace_images_in_markdown(markdown_str, images_dict):
56
+ """Minimal fallback implementation of replace_images_in_markdown"""
57
+ import re
58
+ for img_id, base64_str in images_dict.items():
59
+ # Match alt text OR link part, ignore extension
60
+ base_id = img_id.split('.')[0]
61
+ pattern = re.compile(rf"!\[[^\]]*{base_id}[^\]]*\]\([^\)]+\)")
62
+ markdown_str = pattern.sub(f"![{img_id}](data:image/jpeg;base64,{base64_str})", markdown_str)
63
  return markdown_str
64
 
65
  def get_combined_markdown(ocr_response):
66
+ """Minimal fallback implementation of get_combined_markdown"""
67
  markdowns = []
68
  for page in ocr_response.pages:
69
  image_data = {}
70
+ if hasattr(page, "images"):
71
+ for img in page.images:
72
+ if hasattr(img, "id") and hasattr(img, "image_base64"):
73
+ image_data[img.id] = img.image_base64
74
+ page_markdown = page.markdown if hasattr(page, "markdown") else ""
75
+ processed_markdown = replace_images_in_markdown(page_markdown, image_data)
76
+ markdowns.append(processed_markdown)
77
  return "\n\n".join(markdowns)
78
 
79
  # Import config directly (now local to historical-ocr)
80
  try:
81
+ from config import MISTRAL_API_KEY, OCR_MODEL, TEXT_MODEL, VISION_MODEL, TEST_MODE, IMAGE_PREPROCESSING
82
  except ImportError:
83
  # Fallback defaults if config is not available
84
  import os
 
87
  TEXT_MODEL = "mistral-large-latest"
88
  VISION_MODEL = "mistral-large-latest"
89
  TEST_MODE = True
90
+ # Default image preprocessing settings if config not available
91
+ IMAGE_PREPROCESSING = {
92
+ "max_size_mb": 8.0,
93
+ # Add basic defaults for preprocessing
94
+ "enhance_contrast": 1.2,
95
+ "denoise": True,
96
+ "compression_quality": 95
97
+ }
98
  logging.warning("Config module not found. Using environment variables and defaults.")
99
 
100
  # Helper function to make OCR objects JSON serializable
 
145
  is_valid_image = False
146
  logging.warning("Markdown image reference detected")
147
 
148
+ # Extract the image ID for logging
149
+ try:
150
+ img_id = image_base64.split('![')[1].split('](')[0]
151
+ logging.debug(f"Markdown reference for image: {img_id}")
152
+ except:
153
+ img_id = "unknown"
154
+
155
  # Case 3: Needs detailed text content detection
156
  else:
157
  # Use the same proven approach as in our tests
 
210
  'image_base64': image_base64
211
  }
212
  else:
213
+ # Process as text if validation fails, but properly handle markdown references
214
  if image_base64 and isinstance(image_base64, str):
215
+ # Special handling for markdown image references
216
+ if image_base64.startswith('![') and '](' in image_base64 and image_base64.endswith(')'):
217
+ # Extract the image description (alt text) if available
218
+ try:
219
+ # Parse the alt text from ![alt_text](url)
220
+ alt_text = image_base64.split('![')[1].split('](')[0]
221
+ # Use the alt text or a placeholder if it's just the image name
222
+ if alt_text and not alt_text.endswith('.jpeg') and not alt_text.endswith('.jpg'):
223
+ result[key] = f"[Image: {alt_text}]"
224
+ else:
225
+ # Just note that there's an image without the reference
226
+ result[key] = "[Image]"
227
+ logging.info(f"Converted markdown reference to text placeholder: {result[key]}")
228
+ except:
229
+ # Fallback for parsing errors
230
+ result[key] = "[Image]"
231
+ else:
232
+ # Regular text content
233
+ result[key] = image_base64
234
  else:
235
  result[key] = str(value)
236
  # Handle collections
 
425
  result = serialize_ocr_response(result)
426
 
427
  # Make a final pass to check for any remaining non-serializable objects
428
+ # Proactively check for OCRImageObject instances to avoid serialization warnings
429
+ def has_ocr_image_objects(obj):
430
+ """Check if object contains any OCRImageObject instances recursively"""
431
+ if isinstance(obj, dict):
432
+ return any(has_ocr_image_objects(v) for v in obj.values())
433
+ elif isinstance(obj, list):
434
+ return any(has_ocr_image_objects(item) for item in obj)
435
+ else:
436
+ return 'OCRImageObject' in str(type(obj))
437
+
438
+ # Apply serialization preemptively if OCRImageObjects are detected
439
+ if has_ocr_image_objects(result):
440
+ # Quietly apply full serialization before any errors occur
441
+ result = serialize_ocr_response(result)
442
+ else:
443
+ # Test JSON serialization to catch any other issues
444
+ json.dumps(result)
445
  except TypeError as e:
446
+ # If there's still a serialization error, run the whole result through our serializer
447
  logger = logging.getLogger("serializer")
448
  logger.warning(f"JSON serialization error in result: {str(e)}. Applying full serialization.")
449
+ # Use a more robust approach to ensure complete serialization
450
+ try:
451
+ # First attempt with our custom serializer
452
+ result = serialize_ocr_response(result)
453
+ # Test if it's fully serializable now
454
+ json.dumps(result)
455
+ except Exception as inner_e:
456
+ # If still not serializable, convert to a simpler format
457
+ logger.warning(f"Secondary serialization error: {str(inner_e)}. Converting to basic format.")
458
+ # Create a simplified result with just the essential information
459
+ simplified_result = {
460
+ "file_name": result.get("file_name", "unknown"),
461
+ "topics": result.get("topics", ["Document"]),
462
+ "languages": [str(lang) for lang in result.get("languages", ["English"]) if lang is not None],
463
+ "ocr_contents": {
464
+ "raw_text": result.get("ocr_contents", {}).get("raw_text", "Text extraction failed due to serialization error")
465
+ },
466
+ "serialization_error": f"Original result could not be fully serialized: {str(e)}"
467
+ }
468
+ result = simplified_result
469
 
470
  return result
471
 
 
1181
 
1182
  # Use enhanced preprocessing functions from ocr_utils
1183
  try:
1184
+ from preprocessing import preprocess_image
1185
+ from utils.file_utils import get_base64_from_bytes
1186
 
1187
+ logger.info(f"Applying image preprocessing for OCR")
1188
 
1189
  # Get preprocessing settings from config
1190
  max_size_mb = IMAGE_PREPROCESSING.get("max_size_mb", 8.0)
 
1192
  if file_size_mb > max_size_mb:
1193
  logger.info(f"Image is large ({file_size_mb:.2f} MB), optimizing for API submission")
1194
 
1195
+ # Handwritten docs default to the conservative pipeline
1196
+ base64_data_url = get_base64_from_bytes(
1197
+ preprocess_image(file_path.read_bytes(),
1198
+ {"document_type": "handwritten",
1199
+ "grayscale": True,
1200
+ "denoise": True,
1201
+ "contrast": 0})
1202
+ )
1203
 
1204
  logger.info(f"Image preprocessing completed successfully")
1205
 
 
1253
  except ImportError:
1254
  logger.warning("PIL not available for resizing. Using original image.")
1255
  # Use enhanced encoder with proper MIME type detection
1256
+ from utils.image_utils import encode_image_for_api
1257
  base64_data_url = encode_image_for_api(file_path)
1258
  except Exception as e:
1259
  logger.warning(f"Image resize failed: {str(e)}. Using original image.")
 
1262
  base64_data_url = encode_image_for_api(file_path)
1263
  else:
1264
  # For smaller images, use as-is with proper MIME type
1265
+ from utils.image_utils import encode_image_for_api
1266
  base64_data_url = encode_image_for_api(file_path)
1267
  except Exception as e:
1268
  # Fallback to original image if any preprocessing fails
 
1327
  logger.error("Maximum retries reached, rate limit error persists.")
1328
  try:
1329
  # Try to import the local OCR fallback function
1330
+ from utils.image_utils import try_local_ocr_fallback
1331
 
1332
  # Attempt local OCR fallback
1333
  ocr_text = try_local_ocr_fallback(file_path, base64_data_url)
 
1539
  logger.info("Sufficient OCR text detected, analyzing language before using OCR text directly")
1540
 
1541
  # Perform language detection on the OCR text before returning
1542
+ if LANG_DETECTOR_AVAILABLE and self.language_detector:
1543
+ detected_languages = self.language_detector.detect_languages(
1544
+ ocr_markdown,
1545
+ filename=getattr(self, 'current_filename', None)
1546
+ )
1547
+ else:
1548
+ # If language detector is not available, use default English
1549
+ detected_languages = ["English"]
1550
 
1551
  return {
1552
  "file_name": filename,
 
1720
 
1721
  # If OCR text has clear French patterns but language is English or missing, fix it
1722
  if ocr_markdown and 'languages' in result:
1723
+ if LANG_DETECTOR_AVAILABLE and self.language_detector:
1724
+ result['languages'] = self.language_detector.detect_languages(
1725
+ ocr_markdown,
1726
+ filename=getattr(self, 'current_filename', None),
1727
+ current_languages=result['languages']
1728
+ )
1729
 
1730
  except Exception as e:
1731
  # Fall back to text-only model if vision model fails
 
1735
  return result
1736
 
1737
  # We've removed document type detection entirely for simplicity
1738
+
1739
 
1740
  # Create a prompt with enhanced language detection instructions
1741
  generic_section = (
1742
  f"You are an OCR specialist processing historical documents. "
1743
+ f"Focus on accurately extracting text content and image chunks while preserving structure and formatting. "
1744
  f"Pay attention to any historical features and document characteristics.\n\n"
 
 
 
1745
  f"Create a structured JSON response with the following fields:\n"
1746
  f"- file_name: The document's name\n"
1747
  f"- topics: An array of topics covered in the document\n"
1748
  f"- languages: An array of languages used in the document (be precise and specific about language detection)\n"
1749
  f"- ocr_contents: A comprehensive dictionary with the document's contents including:\n"
1750
+ f" * title: The title or heading (if present)\n"
1751
+ f" * transcript: The full text of the document\n"
1752
+ f" * text: The main text content (if different from transcript)\n"
1753
+ f" * content: The body content (if different than transcript)\n"
1754
+ f" * images: An array of image objects with their base64 data\n"
1755
+ f" * alt_text: The alt text or description of the images\n"
1756
+ f" * caption: The caption or title of the images\n"
1757
  f" * raw_text: The complete OCR text\n"
1758
  )
1759
 
 
1764
 
1765
  # Return the enhanced prompt
1766
  return generic_section + custom_section
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1767
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1768
  def _extract_structured_data_text_only(self, ocr_markdown, filename, custom_prompt=None):
1769
  """
1770
  Extract structured data using text-only model with detailed historical context prompting
test_magician.py → testing/test_magician.py RENAMED
File without changes
ui_components.py CHANGED
@@ -3,9 +3,21 @@ import os
3
  import io
4
  import base64
5
  import logging
 
6
  from datetime import datetime
7
  from pathlib import Path
8
  import json
 
 
 
 
 
 
 
 
 
 
 
9
  from constants import (
10
  DOCUMENT_TYPES,
11
  DOCUMENT_LAYOUTS,
@@ -19,7 +31,16 @@ from constants import (
19
  PREPROCESSING_DOC_TYPES,
20
  ROTATION_OPTIONS
21
  )
22
- from utils import get_base64_from_image, extract_subject_tags
 
 
 
 
 
 
 
 
 
23
 
24
  class ProgressReporter:
25
  """Class to handle progress reporting in the UI"""
@@ -69,12 +90,10 @@ def create_sidebar_options():
69
 
70
  # Create a container for the sidebar options
71
  with st.container():
72
- # Model selection
73
- st.markdown("### Model Selection")
74
- use_vision = st.toggle("Use Vision Model", value=True, help="Use vision model for better understanding of document structure")
75
 
76
  # Document type selection
77
- st.markdown("### Document Type")
78
  doc_type = st.selectbox("Document Type", DOCUMENT_TYPES,
79
  help="Select the type of document you're processing for better results")
80
 
@@ -91,8 +110,8 @@ def create_sidebar_options():
91
 
92
  # Custom prompt
93
  custom_prompt = ""
94
- if doc_type != DOCUMENT_TYPES[0]: # Not auto-detect
95
- # Get the template for the selected document type
96
  prompt_template = CUSTOM_PROMPT_TEMPLATES.get(doc_type, "")
97
 
98
  # Add layout information if not standard
@@ -103,53 +122,37 @@ def create_sidebar_options():
103
 
104
  # Set the custom prompt
105
  custom_prompt = prompt_template
106
-
107
- # Allow user to edit the prompt
108
- st.markdown("**Custom Processing Instructions**")
109
- custom_prompt = st.text_area("", value=custom_prompt,
110
- help="Customize the instructions for processing this document",
111
- height=80)
112
 
113
- # Image preprocessing options in an expandable section
114
- with st.expander("Image Preprocessing (Optional)"):
115
- # Add help text to clarify that preprocessing is optional
116
- st.info("Preprocessing is optional and only applied when options below are selected. Document type alone doesn't trigger preprocessing.")
117
-
118
- # Grayscale conversion
119
- grayscale = st.checkbox("Convert to Grayscale",
120
- value=False,
121
- help="Convert color images to grayscale for better OCR")
122
-
123
- # Denoise
124
- denoise = st.checkbox("Denoise Image",
125
- value=False,
126
- help="Remove noise from the image")
127
-
128
- # Contrast adjustment
129
- contrast = st.slider("Contrast Adjustment",
130
- min_value=-50,
131
- max_value=50,
132
- value=0,
133
- step=10,
134
- help="Adjust image contrast")
135
-
136
- # Rotation
137
- rotation = st.slider("Rotation",
138
- min_value=-45,
139
- max_value=45,
140
- value=0,
141
- step=5,
142
- help="Rotate image if needed")
143
-
144
- # Add image segmentation option
145
- st.markdown("### Advanced Options")
146
- use_segmentation = st.toggle("Enable Image Segmentation",
147
- value=False,
148
- help="Segment the image into text and image regions for better OCR results on complex documents")
149
-
150
- # Show explanation if segmentation is enabled
151
- if use_segmentation:
152
- st.info("Image segmentation identifies distinct text regions in complex documents, improving OCR accuracy. This is especially helpful for documents with mixed content like the Magician illustration.")
153
 
154
  # Create preprocessing options dictionary
155
  # Set document_type based on selection in UI
@@ -169,17 +172,17 @@ def create_sidebar_options():
169
  "rotation": rotation
170
  }
171
 
172
- # PDF-specific options in an expandable section
173
- with st.expander("PDF Options"):
174
- max_pages = st.number_input("Maximum Pages to Process",
175
- min_value=1,
176
- max_value=20,
177
- value=DEFAULT_MAX_PAGES,
178
- help="Limit the number of pages to process (for multi-page PDFs)")
179
-
180
- # Set default values for removed options
181
- pdf_dpi = DEFAULT_PDF_DPI
182
- pdf_rotation = 0
183
 
184
  # Create options dictionary
185
  options = {
@@ -219,471 +222,6 @@ def create_file_uploader():
219
  )
220
  return uploaded_file
221
 
222
- # Function removed - now using inline implementation in app.py
223
- def _unused_display_preprocessing_preview(uploaded_file, preprocessing_options):
224
- """Display a preview of image with preprocessing options applied"""
225
- if (any(preprocessing_options.values()) and
226
- uploaded_file.type.startswith('image/')):
227
-
228
- st.markdown("**Preprocessed Preview**")
229
- try:
230
- # Create a container for the preview
231
- with st.container():
232
- processed_bytes = preprocess_image(uploaded_file.getvalue(), preprocessing_options)
233
- # Convert image to base64 and display as HTML to avoid fullscreen button
234
- img_data = base64.b64encode(processed_bytes).decode()
235
- img_html = f'<img src="data:image/jpeg;base64,{img_data}" style="width:100%; border-radius:4px;">'
236
- st.markdown(img_html, unsafe_allow_html=True)
237
-
238
- # Show preprocessing metadata in a well-formatted caption
239
- meta_items = []
240
- if preprocessing_options.get("document_type", "standard") != "standard":
241
- meta_items.append(f"Document type ({preprocessing_options['document_type']})")
242
- if preprocessing_options.get("grayscale", False):
243
- meta_items.append("Grayscale")
244
- if preprocessing_options.get("denoise", False):
245
- meta_items.append("Denoise")
246
- if preprocessing_options.get("contrast", 0) != 0:
247
- meta_items.append(f"Contrast ({preprocessing_options['contrast']})")
248
- if preprocessing_options.get("rotation", 0) != 0:
249
- meta_items.append(f"Rotation ({preprocessing_options['rotation']}°)")
250
-
251
- # Only show "Applied:" if there are actual preprocessing steps
252
- if meta_items:
253
- meta_text = "Applied: " + ", ".join(meta_items)
254
- st.caption(meta_text)
255
- except Exception as e:
256
- st.error(f"Error in preprocessing: {str(e)}")
257
- st.info("Try using grayscale preprocessing for PNG images with transparency")
258
-
259
- def display_results(result, container, custom_prompt=""):
260
- """Display OCR results in the provided container"""
261
- with container:
262
- # Add heading for document metadata
263
- st.markdown("### Document Metadata")
264
-
265
- # Create a compact metadata section
266
- meta_html = '<div style="display: flex; flex-wrap: wrap; gap: 0.3rem; margin-bottom: 0.3rem;">'
267
-
268
- # Document type
269
- if 'detected_document_type' in result:
270
- meta_html += f'<div><strong>Type:</strong> {result["detected_document_type"]}</div>'
271
-
272
- # Processing time
273
- if 'processing_time' in result:
274
- meta_html += f'<div><strong>Time:</strong> {result["processing_time"]:.1f}s</div>'
275
-
276
- # Page information
277
- if 'limited_pages' in result:
278
- meta_html += f'<div><strong>Pages:</strong> {result["limited_pages"]["processed"]}/{result["limited_pages"]["total"]}</div>'
279
-
280
- meta_html += '</div>'
281
- st.markdown(meta_html, unsafe_allow_html=True)
282
-
283
- # Language metadata on a separate line, Subject Tags below
284
-
285
- # First show languages if available
286
- if 'languages' in result and result['languages']:
287
- languages = [lang for lang in result['languages'] if lang is not None]
288
- if languages:
289
- # Create a dedicated line for Languages
290
- lang_html = '<div style="display: flex; align-items: center; margin: 0.2rem 0; flex-wrap: wrap;">'
291
- lang_html += '<div style="margin-right: 0.3rem; font-weight: bold;">Language:</div>'
292
-
293
- # Add language tags
294
- for lang in languages:
295
- # Clean language name if needed
296
- clean_lang = str(lang).strip()
297
- if clean_lang: # Only add if not empty
298
- lang_html += f'<span class="subject-tag tag-language">{clean_lang}</span>'
299
-
300
- lang_html += '</div>'
301
- st.markdown(lang_html, unsafe_allow_html=True)
302
-
303
- # Create a separate line for Time if we have time-related tags
304
- if 'topics' in result and result['topics']:
305
- time_tags = [topic for topic in result['topics']
306
- if any(term in topic.lower() for term in ["century", "pre-", "era", "historical"])]
307
- if time_tags:
308
- time_html = '<div style="display: flex; align-items: center; margin: 0.2rem 0; flex-wrap: wrap;">'
309
- time_html += '<div style="margin-right: 0.3rem; font-weight: bold;">Time:</div>'
310
- for tag in time_tags:
311
- time_html += f'<span class="subject-tag tag-time-period">{tag}</span>'
312
- time_html += '</div>'
313
- st.markdown(time_html, unsafe_allow_html=True)
314
-
315
- # Then display remaining subject tags if available
316
- if 'topics' in result and result['topics']:
317
- # Filter out time-related tags which are already displayed
318
- subject_tags = [topic for topic in result['topics']
319
- if not any(term in topic.lower() for term in ["century", "pre-", "era", "historical"])]
320
-
321
- if subject_tags:
322
- # Create a separate line for Subject Tags
323
- tags_html = '<div style="display: flex; align-items: center; margin: 0.2rem 0; flex-wrap: wrap;">'
324
- tags_html += '<div style="margin-right: 0.3rem; font-weight: bold;">Subject Tags:</div>'
325
- tags_html += '<div style="display: flex; flex-wrap: wrap; gap: 2px; align-items: center;">'
326
-
327
- # Generate a badge for each remaining tag
328
- for topic in subject_tags:
329
- # Determine tag category class
330
- tag_class = "subject-tag" # Default class
331
-
332
- # Add specialized class based on category
333
- if any(term in topic.lower() for term in ["language", "english", "french", "german", "latin"]):
334
- tag_class += " tag-language" # Languages
335
- elif any(term in topic.lower() for term in ["letter", "newspaper", "book", "form", "document", "recipe"]):
336
- tag_class += " tag-document-type" # Document types
337
- elif any(term in topic.lower() for term in ["travel", "military", "science", "medicine", "education", "art", "literature"]):
338
- tag_class += " tag-subject" # Subject domains
339
-
340
- # Add each tag as an inline span
341
- tags_html += f'<span class="{tag_class}">{topic}</span>'
342
-
343
- # Close the containers
344
- tags_html += '</div></div>'
345
-
346
- # Render the subject tags section
347
- st.markdown(tags_html, unsafe_allow_html=True)
348
-
349
- # No OCR content heading - start directly with tabs
350
-
351
- # Check if we have OCR content
352
- if 'ocr_contents' in result:
353
- # Create a single view instead of tabs
354
- content_tab1 = st.container()
355
-
356
- # Check for images in the result to use later
357
- has_images = result.get('has_images', False)
358
- has_image_data = ('pages_data' in result and any(page.get('images', []) for page in result.get('pages_data', [])))
359
- has_raw_images = ('raw_response_data' in result and 'pages' in result['raw_response_data'] and
360
- any('images' in page for page in result['raw_response_data']['pages']
361
- if isinstance(page, dict)))
362
-
363
- # Display structured content
364
- with content_tab1:
365
- # Display structured content with markdown formatting
366
- if isinstance(result['ocr_contents'], dict):
367
- # CSS is now handled in the main layout.py file
368
-
369
- # Function to process text with markdown support
370
- def format_markdown_text(text):
371
- """Format text with markdown and handle special patterns"""
372
- if not text:
373
- return ""
374
-
375
- import re
376
-
377
- # First, ensure we're working with a string
378
- if not isinstance(text, str):
379
- text = str(text)
380
-
381
- # Ensure newlines are preserved for proper spacing
382
- # Convert any Windows line endings to Unix
383
- text = text.replace('\r\n', '\n')
384
-
385
- # Format dates (MM/DD/YYYY or similar patterns)
386
- date_pattern = r'\b(0?[1-9]|1[0-2])[\/\-\.](0?[1-9]|[12][0-9]|3[01])[\/\-\.](\d{4}|\d{2})\b'
387
- text = re.sub(date_pattern, r'**\g<0>**', text)
388
-
389
- # Detect markdown tables and preserve them
390
- table_sections = []
391
- non_table_lines = []
392
- in_table = False
393
- table_buffer = []
394
-
395
- # Process text line by line, preserving tables
396
- lines = text.split('\n')
397
- for i, line in enumerate(lines):
398
- line_stripped = line.strip()
399
-
400
- # Detect table rows by pipe character
401
- if '|' in line_stripped and (line_stripped.startswith('|') or line_stripped.endswith('|')):
402
- if not in_table:
403
- in_table = True
404
- if table_buffer:
405
- table_buffer = []
406
- table_buffer.append(line)
407
-
408
- # Check if the next line is a table separator
409
- if i < len(lines) - 1 and '---' in lines[i+1] and '|' in lines[i+1]:
410
- table_buffer.append(lines[i+1])
411
-
412
- # Detect table separators (---|---|---)
413
- elif in_table and '---' in line_stripped and '|' in line_stripped:
414
- table_buffer.append(line)
415
-
416
- # End of table detection
417
- elif in_table:
418
- # Check if this is still part of the table
419
- next_line_is_table = False
420
- if i < len(lines) - 1:
421
- next_line = lines[i+1].strip()
422
- if '|' in next_line and (next_line.startswith('|') or next_line.endswith('|')):
423
- next_line_is_table = True
424
-
425
- if not next_line_is_table:
426
- in_table = False
427
- # Save the complete table
428
- if table_buffer:
429
- table_sections.append('\n'.join(table_buffer))
430
- table_buffer = []
431
- # Add current line to non-table lines
432
- non_table_lines.append(line)
433
- else:
434
- # Still part of the table
435
- table_buffer.append(line)
436
- else:
437
- # Not in a table
438
- non_table_lines.append(line)
439
-
440
- # Handle any remaining table buffer
441
- if in_table and table_buffer:
442
- table_sections.append('\n'.join(table_buffer))
443
-
444
- # Process non-table lines
445
- processed_lines = []
446
- for line in non_table_lines:
447
- line_stripped = line.strip()
448
-
449
- # Check if line is in ALL CAPS (and not just a short acronym)
450
- if line_stripped and line_stripped.isupper() and len(line_stripped) > 3:
451
- # ALL CAPS line - make bold instead of heading to prevent large display
452
- processed_lines.append(f"**{line_stripped}**")
453
- # Process potential headers (lines ending with colon)
454
- elif line_stripped and line_stripped.endswith(':') and len(line_stripped) < 40:
455
- # Likely a header - make it bold
456
- processed_lines.append(f"**{line_stripped}**")
457
- else:
458
- # Keep original line with its spacing
459
- processed_lines.append(line)
460
-
461
- # Join non-table lines
462
- processed_text = '\n'.join(processed_lines)
463
-
464
- # Reinsert tables in the right positions
465
- for table in table_sections:
466
- # Generate a unique marker for this table
467
- marker = f"__TABLE_MARKER_{hash(table) % 10000}__"
468
- # Find a good position to insert this table
469
- # For now, just append all tables at the end
470
- processed_text += f"\n\n{table}\n\n"
471
-
472
- # Make sure paragraphs have proper spacing but not excessive
473
- processed_text = re.sub(r'\n{3,}', '\n\n', processed_text)
474
-
475
- # Ensure two newlines between paragraphs for proper markdown rendering
476
- processed_text = re.sub(r'([^\n])\n([^\n])', r'\1\n\n\2', processed_text)
477
-
478
- return processed_text
479
-
480
- # Collect all available images from the result
481
- available_images = []
482
- if has_images and 'pages_data' in result:
483
- for page_idx, page in enumerate(result['pages_data']):
484
- if 'images' in page and len(page['images']) > 0:
485
- for img_idx, img in enumerate(page['images']):
486
- if 'image_base64' in img:
487
- available_images.append({
488
- 'source': 'pages_data',
489
- 'page': page_idx,
490
- 'index': img_idx,
491
- 'data': img['image_base64']
492
- })
493
-
494
- # Get images from raw response as well
495
- if 'raw_response_data' in result:
496
- raw_data = result['raw_response_data']
497
- if isinstance(raw_data, dict) and 'pages' in raw_data:
498
- for page_idx, page in enumerate(raw_data['pages']):
499
- if isinstance(page, dict) and 'images' in page:
500
- for img_idx, img in enumerate(page['images']):
501
- if isinstance(img, dict) and 'base64' in img:
502
- available_images.append({
503
- 'source': 'raw_response',
504
- 'page': page_idx,
505
- 'index': img_idx,
506
- 'data': img['base64']
507
- })
508
-
509
- # Extract images for display at the top
510
- images_to_display = []
511
-
512
- # First, collect all available images
513
- for img_idx, img in enumerate(available_images):
514
- if 'data' in img:
515
- images_to_display.append({
516
- 'data': img['data'],
517
- 'id': img.get('id', f"img_{img_idx}"),
518
- 'index': img_idx
519
- })
520
-
521
- # Simple display of image without dropdown or Document Image tab
522
- if images_to_display and len(images_to_display) > 0:
523
- # Just display the first image directly
524
- st.image(images_to_display[0]['data'], use_container_width=True)
525
-
526
- # Organize sections in a logical order
527
- section_order = ["title", "author", "date", "summary", "content", "transcript", "metadata"]
528
- ordered_sections = []
529
-
530
- # Add known sections first in preferred order
531
- for section_name in section_order:
532
- if section_name in result['ocr_contents'] and result['ocr_contents'][section_name]:
533
- ordered_sections.append(section_name)
534
-
535
- # Add any remaining sections
536
- for section in result['ocr_contents'].keys():
537
- if (section not in ordered_sections and
538
- section not in ['error', 'partial_text'] and
539
- result['ocr_contents'][section]):
540
- ordered_sections.append(section)
541
-
542
- # If only raw_text is available and no other content, add it last
543
- if ('raw_text' in result['ocr_contents'] and
544
- result['ocr_contents']['raw_text'] and
545
- len(ordered_sections) == 0):
546
- ordered_sections.append('raw_text')
547
-
548
- # Add minimal spacing before OCR results
549
- st.markdown("<div style='margin: 8px 0 4px 0;'></div>", unsafe_allow_html=True)
550
- st.markdown("### Document Content")
551
-
552
- # Process each section using expanders
553
- for i, section in enumerate(ordered_sections):
554
- content = result['ocr_contents'][section]
555
-
556
- # Skip empty content
557
- if not content:
558
- continue
559
-
560
- # Create an expander for each section
561
- # First section is expanded by default
562
- with st.expander(f"{section.replace('_', ' ').title()}", expanded=(i == 0)):
563
- if isinstance(content, str):
564
- # Handle image markdown
565
- if content.startswith("![") and content.endswith(")"):
566
- try:
567
- alt_text = content[2:content.index(']')]
568
- st.info(f"Image description: {alt_text if len(alt_text) > 5 else 'Image'}")
569
- except:
570
- st.info("Contains image reference")
571
- else:
572
- # Process text content
573
- formatted_content = format_markdown_text(content).strip()
574
-
575
- # Check if content contains markdown tables or complex text
576
- has_tables = '|' in formatted_content and '---' in formatted_content
577
- has_complex_structure = formatted_content.count('\n') > 5 or formatted_content.count('**') > 2
578
-
579
- # Use a container with minimal margins
580
- with st.container():
581
- # For text-only extractions or content with tables, ensure proper rendering
582
- if has_tables or has_complex_structure:
583
- # For text with tables or multiple paragraphs, use special handling
584
- # First ensure proper markdown spacing
585
- formatted_content = formatted_content.replace('\n\n\n', '\n\n')
586
-
587
- # Look for any all caps headers that might be misinterpreted
588
- import re
589
- formatted_content = re.sub(
590
- r'^([A-Z][A-Z\s]+)$',
591
- r'**\1**',
592
- formatted_content,
593
- flags=re.MULTILINE
594
- )
595
-
596
- # Preserve table formatting by adding proper spacing
597
- if has_tables:
598
- formatted_content = formatted_content.replace('\n|', '\n\n|')
599
-
600
- # Add proper paragraph spacing
601
- formatted_content = re.sub(r'([^\n])\n([^\n])', r'\1\n\n\2', formatted_content)
602
-
603
- # Use standard markdown with custom styling
604
- st.markdown(formatted_content, unsafe_allow_html=False)
605
- else:
606
- # For simpler content, use standard markdown
607
- st.markdown(formatted_content)
608
-
609
- elif isinstance(content, list):
610
- # Create markdown list
611
- list_items = []
612
- for item in content:
613
- if isinstance(item, str):
614
- item_text = format_markdown_text(item).strip()
615
- # Handle potential HTML special characters for proper rendering
616
- item_text = item_text.replace('<', '&lt;').replace('>', '&gt;')
617
- list_items.append(f"- {item_text}")
618
- else:
619
- list_items.append(f"- {str(item)}")
620
-
621
- list_content = "\n".join(list_items)
622
-
623
- # Use a container with minimal margins
624
- with st.container():
625
- # Use standard markdown for better rendering
626
- st.markdown(list_content)
627
-
628
- elif isinstance(content, dict):
629
- # Format dictionary content
630
- dict_items = []
631
- for k, v in content.items():
632
- key_formatted = k.replace('_', ' ').title()
633
-
634
- if isinstance(v, str):
635
- value_formatted = format_markdown_text(v).strip()
636
- dict_items.append(f"**{key_formatted}:** {value_formatted}")
637
- else:
638
- dict_items.append(f"**{key_formatted}:** {str(v)}")
639
-
640
- dict_content = "\n".join(dict_items)
641
-
642
- # Use a container with minimal margins
643
- with st.container():
644
- # Use standard markdown for better rendering
645
- st.markdown(dict_content)
646
-
647
- # Display custom prompt if provided
648
- if custom_prompt:
649
- with st.expander("Custom Processing Instructions"):
650
- st.write(custom_prompt)
651
-
652
- # No download heading - start directly with buttons
653
-
654
- # JSON download - use full width for buttons
655
- try:
656
- json_str = json.dumps(result, indent=2)
657
- st.download_button(
658
- label="Download JSON",
659
- data=json_str,
660
- file_name=f"{result.get('file_name', 'document').split('.')[0]}_ocr.json",
661
- mime="application/json"
662
- )
663
- except Exception as e:
664
- st.error(f"Error creating JSON download: {str(e)}")
665
-
666
- # Text download
667
- try:
668
- if 'ocr_contents' in result:
669
- if 'raw_text' in result['ocr_contents']:
670
- text_content = result['ocr_contents']['raw_text']
671
- elif 'content' in result['ocr_contents']:
672
- text_content = result['ocr_contents']['content']
673
- else:
674
- text_content = str(result['ocr_contents'])
675
- else:
676
- text_content = "No text content available."
677
-
678
- st.download_button(
679
- label="Download Text",
680
- data=text_content,
681
- file_name=f"{result.get('file_name', 'document').split('.')[0]}_ocr.txt",
682
- mime="text/plain"
683
- )
684
- except Exception as e:
685
- st.error(f"Error creating text download: {str(e)}")
686
-
687
  def display_document_with_images(result):
688
  """Display document with images"""
689
  # Check for pages_data first
@@ -759,7 +297,7 @@ def display_document_with_images(result):
759
  if isinstance(raw_page, dict) and 'images' in raw_page:
760
  for img in raw_page['images']:
761
  if isinstance(img, dict) and 'base64' in img:
762
- st.image(img['base64'])
763
  st.caption("Image from OCR response")
764
  image_displayed = True
765
  break
@@ -797,7 +335,7 @@ def display_previous_results():
797
  st.markdown("""
798
  <div style="text-align: center; padding: 30px 20px; background-color: #f8f9fa; border-radius: 6px; margin-top: 10px;">
799
  <div style="font-size: 36px; margin-bottom: 15px;">📄</div>
800
- <h4 style="margin-bottom: 8px; font-weight: 500;">No Previous Results</h4>
801
  <p style="font-size: 14px; color: #666;">Process a document to see your results history.</p>
802
  </div>
803
  """, unsafe_allow_html=True)
@@ -806,7 +344,7 @@ def display_previous_results():
806
  with col2:
807
  try:
808
  # Create download button for all results
809
- from ocr_utils import create_results_zip_in_memory
810
  zip_data = create_results_zip_in_memory(st.session_state.previous_results)
811
  timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
812
 
@@ -908,37 +446,22 @@ def display_previous_results():
908
  meta_html += '</div>'
909
  st.markdown(meta_html, unsafe_allow_html=True)
910
 
911
- # Simplified tabs - fewer options for cleaner interface
912
  has_images = selected_result.get('has_images', False)
913
  if has_images:
914
- view_tabs = st.tabs(["Document Content", "Raw Text", "Images"])
915
  view_tab1, view_tab2, view_tab3 = view_tabs
916
  else:
917
- view_tabs = st.tabs(["Document Content", "Raw Text"])
918
  view_tab1, view_tab2 = view_tabs
919
-
920
- # Define helper function for formatting text
921
- def format_text_display(text):
922
- if not isinstance(text, str):
923
- return text
924
-
925
- lines = text.split('\n')
926
- processed_lines = []
927
- for line in lines:
928
- line_stripped = line.strip()
929
- if line_stripped and line_stripped.isupper() and len(line_stripped) > 3:
930
- processed_lines.append(f"**{line_stripped}**")
931
- else:
932
- processed_lines.append(line)
933
-
934
- return '\n'.join(processed_lines)
935
 
936
  # First tab - Document Content (simplified structured view)
937
  with view_tab1:
938
  # Display content in a cleaner, more streamlined format
939
  if 'ocr_contents' in selected_result and isinstance(selected_result['ocr_contents'], dict):
940
  # Create a more focused list of important sections
941
- priority_sections = ["title", "content", "transcript", "summary", "raw_text"]
942
  displayed_sections = set()
943
 
944
  # First display priority sections
@@ -951,7 +474,7 @@ def display_previous_results():
951
  st.markdown(f"##### {section.replace('_', ' ').title()}")
952
 
953
  # Format and display content
954
- formatted_content = format_text_display(content)
955
  st.markdown(formatted_content)
956
  displayed_sections.add(section)
957
 
@@ -963,7 +486,7 @@ def display_previous_results():
963
  st.markdown(f"##### {section.replace('_', ' ').title()}")
964
 
965
  if isinstance(content, str):
966
- st.markdown(format_text_display(content))
967
  elif isinstance(content, list):
968
  for item in content:
969
  st.markdown(f"- {item}")
@@ -971,34 +494,42 @@ def display_previous_results():
971
  for k, v in content.items():
972
  st.markdown(f"**{k}:** {v}")
973
 
974
- # Second tab - Raw Text (simplified)
975
  with view_tab2:
976
- # Extract raw text or content
977
- raw_text = ""
 
 
 
 
 
 
 
978
  if 'ocr_contents' in selected_result:
979
- if 'raw_text' in selected_result['ocr_contents']:
980
- raw_text = selected_result['ocr_contents']['raw_text']
981
- elif 'content' in selected_result['ocr_contents']:
982
- raw_text = selected_result['ocr_contents']['content']
 
 
 
 
 
 
 
 
 
 
 
983
 
984
- # Display the text area with raw text
985
- edited_text = st.text_area("", raw_text, height=300, key="selected_raw_text")
986
 
987
- # Add buttons in a row
988
- col1, col2 = st.columns(2)
989
- with col1:
990
- st.button("Copy Text", key="selected_copy_btn")
991
- with col2:
992
- st.download_button(
993
- label="Download Text",
994
- data=edited_text,
995
- file_name=f"{file_name.split('.')[0]}_text.txt",
996
- mime="text/plain",
997
- key="selected_download_btn"
998
- )
999
 
1000
- # Third tab - With Images (simplified)
1001
- if has_images and 'pages_data' in selected_result:
1002
  with view_tab3:
1003
  # Simplified image display
1004
  if 'pages_data' in selected_result:
@@ -1007,7 +538,7 @@ def display_previous_results():
1007
  if 'images' in page_data and len(page_data['images']) > 0:
1008
  for img in page_data['images']:
1009
  if 'image_base64' in img:
1010
- st.image(img['image_base64'], use_column_width=True)
1011
 
1012
  # Get page text if available
1013
  page_text = ""
@@ -1018,21 +549,22 @@ def display_previous_results():
1018
  if page_text:
1019
  with st.expander(f"Page {i+1} Text", expanded=False):
1020
  st.text(page_text)
 
1021
 
1022
  def display_about_tab():
1023
- """Display about tab content"""
1024
- st.header("About")
1025
 
1026
  # Add app description
1027
  st.markdown("""
1028
- **Historical OCR** is a specialized tool for extracting text from historical documents, manuscripts, and printed materials.
1029
  """)
1030
 
1031
  # Purpose section with consistent formatting
1032
  st.markdown("### Purpose")
1033
  st.markdown("""
1034
  This tool is designed to assist scholars in historical research by extracting text from challenging documents.
1035
- While it may not achieve 100% accuracy for all materials, it serves as a valuable research aid for navigating
1036
  historical documents, particularly:
1037
  """)
1038
 
 
3
  import io
4
  import base64
5
  import logging
6
+ import re
7
  from datetime import datetime
8
  from pathlib import Path
9
  import json
10
+
11
+ # Define exports
12
+ __all__ = [
13
+ 'ProgressReporter',
14
+ 'create_sidebar_options',
15
+ 'create_file_uploader',
16
+ 'display_document_with_images',
17
+ 'display_previous_results',
18
+ 'display_about_tab',
19
+ 'display_results' # Re-export from utils.ui_utils
20
+ ]
21
  from constants import (
22
  DOCUMENT_TYPES,
23
  DOCUMENT_LAYOUTS,
 
31
  PREPROCESSING_DOC_TYPES,
32
  ROTATION_OPTIONS
33
  )
34
+ from utils.image_utils import format_ocr_text
35
+ from utils.content_utils import (
36
+ classify_document_content,
37
+ extract_document_text,
38
+ extract_image_description,
39
+ clean_raw_text,
40
+ format_markdown_text
41
+ )
42
+ from utils.ui_utils import display_results
43
+ from preprocessing import preprocess_image
44
 
45
  class ProgressReporter:
46
  """Class to handle progress reporting in the UI"""
 
90
 
91
  # Create a container for the sidebar options
92
  with st.container():
93
+ # Default to using vision model (removed selection from UI)
94
+ use_vision = True
 
95
 
96
  # Document type selection
 
97
  doc_type = st.selectbox("Document Type", DOCUMENT_TYPES,
98
  help="Select the type of document you're processing for better results")
99
 
 
110
 
111
  # Custom prompt
112
  custom_prompt = ""
113
+ # Get the template for the selected document type if not auto-detect
114
+ if doc_type != DOCUMENT_TYPES[0]:
115
  prompt_template = CUSTOM_PROMPT_TEMPLATES.get(doc_type, "")
116
 
117
  # Add layout information if not standard
 
122
 
123
  # Set the custom prompt
124
  custom_prompt = prompt_template
 
 
 
 
 
 
125
 
126
+ # Allow user to edit the prompt (always visible)
127
+ custom_prompt = st.text_area("Custom Processing Instructions", value=custom_prompt,
128
+ help="Customize the instructions for processing this document",
129
+ height=80)
130
+
131
+ # Image preprocessing options (always visible)
132
+ st.markdown("### Image Preprocessing")
133
+
134
+ # Grayscale conversion
135
+ grayscale = st.checkbox("Convert to Grayscale",
136
+ value=False,
137
+ help="Convert color images to grayscale for better text recognition")
138
+
139
+ # Light denoising option
140
+ denoise = st.checkbox("Light Denoising",
141
+ value=False,
142
+ help="Apply gentle denoising to improve text clarity")
143
+
144
+ # Contrast adjustment
145
+ contrast = st.slider("Contrast Adjustment",
146
+ min_value=-20,
147
+ max_value=20,
148
+ value=0,
149
+ step=5,
150
+ help="Adjust image contrast (limited range)")
151
+
152
+
153
+ # Initialize rotation (keeping it set to 0)
154
+ rotation = 0
155
+ use_segmentation = False
 
 
 
 
 
 
 
 
 
 
156
 
157
  # Create preprocessing options dictionary
158
  # Set document_type based on selection in UI
 
172
  "rotation": rotation
173
  }
174
 
175
+ # PDF-specific options
176
+ st.markdown("### PDF Options")
177
+ max_pages = st.number_input("Maximum Pages to Process",
178
+ min_value=1,
179
+ max_value=20,
180
+ value=DEFAULT_MAX_PAGES,
181
+ help="Limit the number of pages to process (for multi-page PDFs)")
182
+
183
+ # Set default values for removed options
184
+ pdf_dpi = DEFAULT_PDF_DPI
185
+ pdf_rotation = 0
186
 
187
  # Create options dictionary
188
  options = {
 
222
  )
223
  return uploaded_file
224
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
225
  def display_document_with_images(result):
226
  """Display document with images"""
227
  # Check for pages_data first
 
297
  if isinstance(raw_page, dict) and 'images' in raw_page:
298
  for img in raw_page['images']:
299
  if isinstance(img, dict) and 'base64' in img:
300
+ st.image(img['base64'], use_container_width=True)
301
  st.caption("Image from OCR response")
302
  image_displayed = True
303
  break
 
335
  st.markdown("""
336
  <div style="text-align: center; padding: 30px 20px; background-color: #f8f9fa; border-radius: 6px; margin-top: 10px;">
337
  <div style="font-size: 36px; margin-bottom: 15px;">📄</div>
338
+ <h3="margin-bottom: 16px; font-weight: 500;">No Previous Results</h3>
339
  <p style="font-size: 14px; color: #666;">Process a document to see your results history.</p>
340
  </div>
341
  """, unsafe_allow_html=True)
 
344
  with col2:
345
  try:
346
  # Create download button for all results
347
+ from utils.image_utils import create_results_zip_in_memory
348
  zip_data = create_results_zip_in_memory(st.session_state.previous_results)
349
  timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
350
 
 
446
  meta_html += '</div>'
447
  st.markdown(meta_html, unsafe_allow_html=True)
448
 
449
+ # Simplified tabs - using the same format as main view
450
  has_images = selected_result.get('has_images', False)
451
  if has_images:
452
+ view_tabs = st.tabs(["Document Content", "Raw JSON", "Images"])
453
  view_tab1, view_tab2, view_tab3 = view_tabs
454
  else:
455
+ view_tabs = st.tabs(["Document Content", "Raw JSON"])
456
  view_tab1, view_tab2 = view_tabs
457
+ view_tab3 = None
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
458
 
459
  # First tab - Document Content (simplified structured view)
460
  with view_tab1:
461
  # Display content in a cleaner, more streamlined format
462
  if 'ocr_contents' in selected_result and isinstance(selected_result['ocr_contents'], dict):
463
  # Create a more focused list of important sections
464
+ priority_sections = ["title", "content", "transcript", "summary"]
465
  displayed_sections = set()
466
 
467
  # First display priority sections
 
474
  st.markdown(f"##### {section.replace('_', ' ').title()}")
475
 
476
  # Format and display content
477
+ formatted_content = format_ocr_text(content)
478
  st.markdown(formatted_content)
479
  displayed_sections.add(section)
480
 
 
486
  st.markdown(f"##### {section.replace('_', ' ').title()}")
487
 
488
  if isinstance(content, str):
489
+ st.markdown(format_ocr_text(content))
490
  elif isinstance(content, list):
491
  for item in content:
492
  st.markdown(f"- {item}")
 
494
  for k, v in content.items():
495
  st.markdown(f"**{k}:** {v}")
496
 
497
+ # Second tab - Raw JSON (simplified)
498
  with view_tab2:
499
+ # Extract the relevant JSON data
500
+ json_data = {}
501
+
502
+ # Include important metadata
503
+ for field in ['file_name', 'timestamp', 'processing_time', 'languages', 'topics', 'subjects', 'detected_document_type', 'text']:
504
+ if field in selected_result:
505
+ json_data[field] = selected_result[field]
506
+
507
+ # Include OCR contents
508
  if 'ocr_contents' in selected_result:
509
+ json_data['ocr_contents'] = selected_result['ocr_contents']
510
+
511
+ # Exclude large binary data like base64 images to keep JSON clean
512
+ if 'pages_data' in selected_result:
513
+ # Create simplified pages_data without large binary content
514
+ simplified_pages = []
515
+ for page in selected_result['pages_data']:
516
+ simplified_page = {
517
+ 'page_number': page.get('page_number', 0),
518
+ 'has_text': bool(page.get('markdown', '')),
519
+ 'has_images': bool(page.get('images', [])),
520
+ 'image_count': len(page.get('images', []))
521
+ }
522
+ simplified_pages.append(simplified_page)
523
+ json_data['pages_summary'] = simplified_pages
524
 
525
+ # Format the JSON prettily
526
+ json_str = json.dumps(json_data, indent=2)
527
 
528
+ # Display in a monospace font with syntax highlighting
529
+ st.code(json_str, language="json")
 
 
 
 
 
 
 
 
 
 
530
 
531
+ # Third tab - Images (simplified)
532
+ if has_images and view_tab3 is not None:
533
  with view_tab3:
534
  # Simplified image display
535
  if 'pages_data' in selected_result:
 
538
  if 'images' in page_data and len(page_data['images']) > 0:
539
  for img in page_data['images']:
540
  if 'image_base64' in img:
541
+ st.image(img['image_base64'], use_container_width=True)
542
 
543
  # Get page text if available
544
  page_text = ""
 
549
  if page_text:
550
  with st.expander(f"Page {i+1} Text", expanded=False):
551
  st.text(page_text)
552
+
553
 
554
  def display_about_tab():
555
+ """Display learn more tab content"""
556
+ st.header("Learn More")
557
 
558
  # Add app description
559
  st.markdown("""
560
+ **Historical OCR** is a tailored academic tool for extracting text from historical documents, manuscripts, and printed materials.
561
  """)
562
 
563
  # Purpose section with consistent formatting
564
  st.markdown("### Purpose")
565
  st.markdown("""
566
  This tool is designed to assist scholars in historical research by extracting text from challenging documents.
567
+ While it may not achieve full accuracy for all materials, it serves as a tailored research aid for navigating
568
  historical documents, particularly:
569
  """)
570
 
utils/content_utils.py ADDED
@@ -0,0 +1,189 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import re
2
+ import ast
3
+ from .text_utils import clean_raw_text, format_markdown_text
4
+
5
+ def classify_document_content(result):
6
+ """Classify document content based on structure and content"""
7
+ classification = {
8
+ 'has_title': False,
9
+ 'has_content': False,
10
+ 'has_sections': False,
11
+ 'is_structured': False
12
+ }
13
+
14
+ if 'ocr_contents' not in result or not isinstance(result['ocr_contents'], dict):
15
+ return classification
16
+
17
+ # Check for title
18
+ if 'title' in result['ocr_contents'] and result['ocr_contents']['title']:
19
+ classification['has_title'] = True
20
+
21
+ # Check for content
22
+ content_fields = ['content', 'transcript', 'text']
23
+ for field in content_fields:
24
+ if field in result['ocr_contents'] and result['ocr_contents'][field]:
25
+ classification['has_content'] = True
26
+ break
27
+
28
+ # Check for sections
29
+ section_count = 0
30
+ for key in result['ocr_contents'].keys():
31
+ if key not in ['raw_text', 'error'] and result['ocr_contents'][key]:
32
+ section_count += 1
33
+
34
+ classification['has_sections'] = section_count > 2
35
+
36
+ # Check if structured
37
+ classification['is_structured'] = (
38
+ classification['has_title'] and
39
+ classification['has_content'] and
40
+ classification['has_sections']
41
+ )
42
+
43
+ return classification
44
+
45
+ def extract_document_text(result):
46
+ """Extract main document text content"""
47
+ if 'ocr_contents' not in result or not isinstance(result['ocr_contents'], dict):
48
+ return ""
49
+
50
+ # Try to get the text from content fields in preferred order - prioritize main_text
51
+ for field in ['main_text', 'content', 'transcript', 'text', 'raw_text']:
52
+ if field in result['ocr_contents'] and result['ocr_contents'][field]:
53
+ content = result['ocr_contents'][field]
54
+ if isinstance(content, str):
55
+ return content
56
+
57
+ return ""
58
+
59
+ def extract_image_description(image_data):
60
+ """Extract image description from data"""
61
+ if not image_data or not isinstance(image_data, dict):
62
+ return ""
63
+
64
+ # Try different fields that might contain descriptions
65
+ for field in ['alt_text', 'caption', 'description']:
66
+ if field in image_data and image_data[field]:
67
+ return image_data[field]
68
+
69
+ return ""
70
+
71
+ def format_structured_data(content):
72
+ """Format structured data like lists and dictionaries into readable markdown
73
+
74
+ Args:
75
+ content: The content to format (str, list, dict)
76
+
77
+ Returns:
78
+ Formatted markdown text
79
+ """
80
+ if not content:
81
+ return ""
82
+
83
+ # If it's already a string, look for patterns that appear to be Python/JSON representations
84
+ if isinstance(content, str):
85
+ # Look for lists like ['item1', 'item2', 'item3']
86
+ list_pattern = r"(\[([^\[\]]*)\])"
87
+ dict_pattern = r"(\{([^\{\}]*)\})"
88
+
89
+ # First handle lists - ['item1', 'item2']
90
+ def replace_list(match):
91
+ try:
92
+ # Try to parse the match as a Python list
93
+ list_str = match.group(1)
94
+
95
+ # Quick check for empty list
96
+ if list_str == "[]":
97
+ return ""
98
+
99
+ # Safe evaluation of list-like string
100
+ try:
101
+ items = ast.literal_eval(list_str)
102
+ if isinstance(items, list):
103
+ # Convert to markdown bullet points
104
+ return "\n" + "\n".join([f"- {item}" for item in items])
105
+ else:
106
+ return list_str # Not a list, return unchanged
107
+ except (SyntaxError, ValueError):
108
+ # Try a simpler regex-based approach for common formats
109
+ # Handle simple comma-separated lists
110
+ items = re.findall(r"'([^']*)'|\"([^\"]*)\"", list_str)
111
+ if items:
112
+ # Extract the matched groups and handle both single and double quotes
113
+ clean_items = [item[0] if item[0] else item[1] for item in items]
114
+ return "\n" + "\n".join([f"- {item}" for item in clean_items])
115
+ return list_str # Couldn't parse, return unchanged
116
+ except Exception:
117
+ return match.group(0) # Return the original text if any error
118
+
119
+ # Handle dictionaries or structured fields like {key: value, key2: value2}
120
+ def replace_dict(match):
121
+ try:
122
+ dict_str = match.group(1)
123
+
124
+ # Quick check for empty dict
125
+ if dict_str == "{}":
126
+ return ""
127
+
128
+ # First try to parse as a Python dict
129
+ try:
130
+ data_dict = ast.literal_eval(dict_str)
131
+ if isinstance(data_dict, dict):
132
+ return "\n" + "\n".join([f"**{k}**: {v}" for k, v in data_dict.items()])
133
+ except (SyntaxError, ValueError):
134
+ # If that fails, use regex to extract key-value pairs
135
+ pairs = re.findall(r"'([^']*)':\s*'([^']*)'|\"([^\"]*)\":\s*\"([^\"]*)\"", dict_str)
136
+ if pairs:
137
+ formatted_pairs = []
138
+ for pair in pairs:
139
+ if pair[0] and pair[1]: # Single quotes
140
+ formatted_pairs.append(f"**{pair[0]}**: {pair[1]}")
141
+ elif pair[2] and pair[3]: # Double quotes
142
+ formatted_pairs.append(f"**{pair[2]}**: {pair[3]}")
143
+ return "\n" + "\n".join(formatted_pairs)
144
+ return dict_str # Return original if couldn't parse
145
+ except Exception:
146
+ return match.group(0) # Return original text if any error
147
+
148
+ # Check for keys with array values (common in OCR output)
149
+ key_array_pattern = r"([a-zA-Z_]+):\s*(\[.*?\])"
150
+
151
+ def replace_key_array(match):
152
+ try:
153
+ key = match.group(1)
154
+ array_str = match.group(2)
155
+
156
+ # Process the array part with our list replacer
157
+ formatted_array = replace_list(re.match(list_pattern, array_str))
158
+
159
+ # If we successfully formatted it, return with the key as a header
160
+ if formatted_array != array_str:
161
+ return f"**{key}**:{formatted_array}"
162
+ else:
163
+ return match.group(0) # Return original if no change
164
+ except Exception:
165
+ return match.group(0) # Return the original on error
166
+
167
+ # Apply all replacements
168
+ content = re.sub(key_array_pattern, replace_key_array, content)
169
+ content = re.sub(list_pattern, replace_list, content)
170
+ content = re.sub(dict_pattern, replace_dict, content)
171
+
172
+ return content
173
+
174
+ # Handle native Python lists
175
+ elif isinstance(content, list):
176
+ if not content:
177
+ return ""
178
+ # Convert to markdown bullet points
179
+ return "\n".join([f"- {item}" for item in content])
180
+
181
+ # Handle native Python dictionaries
182
+ elif isinstance(content, dict):
183
+ if not content:
184
+ return ""
185
+ # Convert to markdown key-value pairs
186
+ return "\n".join([f"**{k}**: {v}" for k, v in content.items()])
187
+
188
+ # Return as string for other types
189
+ return str(content)
utils/file_utils.py ADDED
@@ -0,0 +1,100 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ File utility functions for historical OCR processing.
3
+ """
4
+ import base64
5
+ import logging
6
+ from pathlib import Path
7
+
8
+ # Configure logging
9
+ logger = logging.getLogger("utils")
10
+ logger.setLevel(logging.INFO)
11
+
12
+ def get_base64_from_image(image_path):
13
+ """
14
+ Get base64 data URL from image file with proper MIME type.
15
+
16
+ Args:
17
+ image_path: Path to the image file
18
+
19
+ Returns:
20
+ Base64 data URL with appropriate MIME type prefix
21
+ """
22
+ try:
23
+ # Convert to Path object for better handling
24
+ path_obj = Path(image_path)
25
+
26
+ # Determine mime type based on file extension
27
+ mime_type = 'image/jpeg' # Default mime type
28
+ suffix = path_obj.suffix.lower()
29
+ if suffix == '.png':
30
+ mime_type = 'image/png'
31
+ elif suffix == '.gif':
32
+ mime_type = 'image/gif'
33
+ elif suffix in ['.jpg', '.jpeg']:
34
+ mime_type = 'image/jpeg'
35
+ elif suffix == '.pdf':
36
+ mime_type = 'application/pdf'
37
+
38
+ # Read and encode file
39
+ with open(path_obj, "rb") as file:
40
+ encoded = base64.b64encode(file.read()).decode('utf-8')
41
+ return f"data:{mime_type};base64,{encoded}"
42
+ except Exception as e:
43
+ logger.error(f"Error encoding file to base64: {str(e)}")
44
+ return ""
45
+
46
+ def get_base64_from_bytes(file_bytes, mime_type=None, file_name=None):
47
+ """
48
+ Get base64 data URL from file bytes with proper MIME type.
49
+
50
+ Args:
51
+ file_bytes: Binary file data
52
+ mime_type: MIME type of the file (optional)
53
+ file_name: Original file name for MIME type detection (optional)
54
+
55
+ Returns:
56
+ Base64 data URL with appropriate MIME type prefix
57
+ """
58
+ try:
59
+ # Determine mime type if not provided
60
+ if mime_type is None and file_name is not None:
61
+ # Get file extension
62
+ suffix = Path(file_name).suffix.lower()
63
+ if suffix == '.png':
64
+ mime_type = 'image/png'
65
+ elif suffix == '.gif':
66
+ mime_type = 'image/gif'
67
+ elif suffix in ['.jpg', '.jpeg']:
68
+ mime_type = 'image/jpeg'
69
+ elif suffix == '.pdf':
70
+ mime_type = 'application/pdf'
71
+ else:
72
+ # Default to image/jpeg for unknown types when processing images
73
+ mime_type = 'image/jpeg'
74
+ elif mime_type is None:
75
+ # Default MIME type if we can't determine it - use image/jpeg instead of application/octet-stream
76
+ # to ensure compatibility with Mistral AI OCR API
77
+ mime_type = 'image/jpeg'
78
+
79
+ # Encode and create data URL
80
+ encoded = base64.b64encode(file_bytes).decode('utf-8')
81
+ return f"data:{mime_type};base64,{encoded}"
82
+ except Exception as e:
83
+ logger.error(f"Error encoding bytes to base64: {str(e)}")
84
+ return ""
85
+
86
+ def handle_temp_files(temp_file_paths):
87
+ """
88
+ Clean up temporary files
89
+
90
+ Args:
91
+ temp_file_paths: List of temporary file paths to clean up
92
+ """
93
+ import os
94
+ for temp_path in temp_file_paths:
95
+ try:
96
+ if os.path.exists(temp_path):
97
+ os.unlink(temp_path)
98
+ logger.info(f"Removed temporary file: {temp_path}")
99
+ except Exception as e:
100
+ logger.warning(f"Failed to remove temporary file {temp_path}: {str(e)}")
utils/general_utils.py ADDED
@@ -0,0 +1,163 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ General utility functions for historical OCR processing.
3
+ """
4
+ import os
5
+ import base64
6
+ import hashlib
7
+ import time
8
+ import logging
9
+ from datetime import datetime
10
+ from pathlib import Path
11
+ from functools import wraps
12
+
13
+ # Configure logging
14
+ logger = logging.getLogger("utils")
15
+ logger.setLevel(logging.INFO)
16
+
17
+ def generate_cache_key(file_bytes, file_type, use_vision, preprocessing_options=None, pdf_rotation=0, custom_prompt=None):
18
+ """
19
+ Generate a cache key for OCR processing
20
+
21
+ Args:
22
+ file_bytes: File content as bytes
23
+ file_type: Type of file (pdf or image)
24
+ use_vision: Whether to use vision model
25
+ preprocessing_options: Dictionary of preprocessing options
26
+ pdf_rotation: PDF rotation value
27
+ custom_prompt: Custom prompt for OCR
28
+
29
+ Returns:
30
+ str: Cache key
31
+ """
32
+ # Generate file hash
33
+ file_hash = hashlib.md5(file_bytes).hexdigest()
34
+
35
+ # Include preprocessing options in cache key
36
+ preprocessing_options_hash = ""
37
+ if preprocessing_options:
38
+ # Add pdf_rotation to preprocessing options to ensure it's part of the cache key
39
+ if pdf_rotation != 0:
40
+ preprocessing_options_with_rotation = preprocessing_options.copy()
41
+ preprocessing_options_with_rotation['pdf_rotation'] = pdf_rotation
42
+ preprocessing_str = str(sorted(preprocessing_options_with_rotation.items()))
43
+ else:
44
+ preprocessing_str = str(sorted(preprocessing_options.items()))
45
+ preprocessing_options_hash = hashlib.md5(preprocessing_str.encode()).hexdigest()
46
+ elif pdf_rotation != 0:
47
+ # If no preprocessing options but we have rotation, include that in the hash
48
+ preprocessing_options_hash = hashlib.md5(f"pdf_rotation_{pdf_rotation}".encode()).hexdigest()
49
+
50
+ # Create base cache key
51
+ cache_key = f"{file_hash}_{file_type}_{use_vision}_{preprocessing_options_hash}"
52
+
53
+ # Include custom prompt in cache key if provided
54
+ if custom_prompt:
55
+ custom_prompt_hash = hashlib.md5(str(custom_prompt).encode()).hexdigest()
56
+ cache_key = f"{cache_key}_{custom_prompt_hash}"
57
+
58
+ return cache_key
59
+
60
+ def timing(description):
61
+ """Context manager for timing code execution"""
62
+ class TimingContext:
63
+ def __init__(self, description):
64
+ self.description = description
65
+
66
+ def __enter__(self):
67
+ self.start_time = time.time()
68
+ return self
69
+
70
+ def __exit__(self, exc_type, exc_val, exc_tb):
71
+ end_time = time.time()
72
+ execution_time = end_time - self.start_time
73
+ logger.info(f"{self.description} took {execution_time:.2f} seconds")
74
+ return False
75
+
76
+ return TimingContext(description)
77
+
78
+ def format_timestamp(timestamp=None):
79
+ """Format timestamp for display"""
80
+ if timestamp is None:
81
+ timestamp = datetime.now()
82
+ elif isinstance(timestamp, str):
83
+ try:
84
+ timestamp = datetime.strptime(timestamp, "%Y-%m-%d %H:%M:%S")
85
+ except ValueError:
86
+ timestamp = datetime.now()
87
+
88
+ return timestamp.strftime("%Y-%m-%d %H:%M")
89
+
90
+ def create_descriptive_filename(original_filename, result, file_ext, preprocessing_options=None):
91
+ """
92
+ Create a descriptive filename for the result
93
+
94
+ Args:
95
+ original_filename: Original filename
96
+ result: OCR result dictionary
97
+ file_ext: File extension
98
+ preprocessing_options: Dictionary of preprocessing options
99
+
100
+ Returns:
101
+ str: Descriptive filename
102
+ """
103
+ # Get base name without extension
104
+ original_name = Path(original_filename).stem
105
+
106
+ # Add document type to filename if detected
107
+ doc_type_tag = ""
108
+ if 'detected_document_type' in result:
109
+ doc_type = result['detected_document_type'].lower()
110
+ doc_type_tag = f"_{doc_type.replace(' ', '_')}"
111
+ elif 'topics' in result and result['topics']:
112
+ # Use first tag as document type if not explicitly detected
113
+ doc_type_tag = f"_{result['topics'][0].lower().replace(' ', '_')}"
114
+
115
+ # Add period tag for historical context if available
116
+ period_tag = ""
117
+ if 'topics' in result and result['topics']:
118
+ for tag in result['topics']:
119
+ if "century" in tag.lower() or "pre-" in tag.lower() or "era" in tag.lower():
120
+ period_tag = f"_{tag.lower().replace(' ', '_')}"
121
+ break
122
+
123
+ # Generate final descriptive filename
124
+ descriptive_name = f"{original_name}{doc_type_tag}{period_tag}{file_ext}"
125
+ return descriptive_name
126
+
127
+ def extract_subject_tags(result, raw_text, preprocessing_options=None):
128
+ """
129
+ Extract subject tags from OCR result
130
+
131
+ Args:
132
+ result: OCR result dictionary
133
+ raw_text: Raw text from OCR
134
+ preprocessing_options: Dictionary of preprocessing options
135
+
136
+ Returns:
137
+ list: Subject tags
138
+ """
139
+ subject_tags = []
140
+
141
+ # Use existing topics as starting point if available
142
+ if 'topics' in result and result['topics']:
143
+ subject_tags = list(result['topics'])
144
+
145
+ # Add document type if detected
146
+ if 'detected_document_type' in result:
147
+ doc_type = result['detected_document_type'].capitalize()
148
+ if doc_type not in subject_tags:
149
+ subject_tags.append(doc_type)
150
+
151
+ # If no tags were found, add some defaults
152
+ if not subject_tags:
153
+ subject_tags = ["Document", "Historical Document"]
154
+
155
+ # Try to infer content type
156
+ if "letter" in raw_text.lower()[:1000] or "dear" in raw_text.lower()[:200]:
157
+ subject_tags.append("Letter")
158
+
159
+ # Check if it might be a newspaper
160
+ if "newspaper" in raw_text.lower()[:1000] or "editor" in raw_text.lower()[:500]:
161
+ subject_tags.append("Newspaper")
162
+
163
+ return subject_tags
utils/image_utils.py ADDED
@@ -0,0 +1,886 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Utility functions for OCR image processing with Mistral AI.
3
+ Contains helper functions for working with OCR responses and image handling.
4
+ """
5
+
6
+ # Standard library imports
7
+ import json
8
+ import base64
9
+ import io
10
+ import zipfile
11
+ import logging
12
+ import re
13
+ import time
14
+ import math
15
+ from datetime import datetime
16
+ from pathlib import Path
17
+ from typing import Dict, List, Optional, Union, Any, Tuple
18
+ from functools import lru_cache
19
+
20
+ # Configure logging
21
+ logging.basicConfig(level=logging.INFO,
22
+ format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
23
+ logger = logging.getLogger(__name__)
24
+
25
+ # Third-party imports
26
+ import numpy as np
27
+
28
+ # Mistral AI imports
29
+ from mistralai import DocumentURLChunk, ImageURLChunk, TextChunk
30
+ from mistralai.models import OCRImageObject
31
+
32
+ # Check for image processing libraries
33
+ try:
34
+ from PIL import Image, ImageEnhance, ImageFilter, ImageOps
35
+ PILLOW_AVAILABLE = True
36
+ except ImportError:
37
+ logger.warning("PIL not available - image preprocessing will be limited")
38
+ PILLOW_AVAILABLE = False
39
+
40
+ try:
41
+ import cv2
42
+ CV2_AVAILABLE = True
43
+ except ImportError:
44
+ logger.warning("OpenCV (cv2) not available - advanced image processing will be limited")
45
+ CV2_AVAILABLE = False
46
+
47
+ # Import configuration
48
+ try:
49
+ from config import IMAGE_PREPROCESSING
50
+ except ImportError:
51
+ # Fallback defaults if config not available
52
+ IMAGE_PREPROCESSING = {
53
+ "enhance_contrast": 1.5,
54
+ "sharpen": True,
55
+ "denoise": True,
56
+ "max_size_mb": 8.0,
57
+ "target_dpi": 300,
58
+ "compression_quality": 92
59
+ }
60
+
61
+ def detect_skew(image: Union[Image.Image, np.ndarray]) -> float:
62
+ """
63
+ Quick skew detection that returns angle in degrees.
64
+ Uses a computationally efficient approach by analyzing at 1% resolution.
65
+
66
+ Args:
67
+ image: PIL Image or numpy array
68
+
69
+ Returns:
70
+ Estimated skew angle in degrees (positive or negative)
71
+ """
72
+ # Convert PIL Image to numpy array if needed
73
+ if isinstance(image, Image.Image):
74
+ # Convert to grayscale for processing
75
+ if image.mode != 'L':
76
+ img_np = np.array(image.convert('L'))
77
+ else:
78
+ img_np = np.array(image)
79
+ else:
80
+ # If already numpy array, ensure it's grayscale
81
+ if len(image.shape) == 3:
82
+ if CV2_AVAILABLE:
83
+ img_np = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
84
+ else:
85
+ # Fallback grayscale conversion
86
+ img_np = np.mean(image, axis=2).astype(np.uint8)
87
+ else:
88
+ img_np = image
89
+
90
+ # Downsample to 1% resolution for faster processing
91
+ height, width = img_np.shape
92
+ target_size = int(min(width, height) * 0.01)
93
+
94
+ # Use a sane minimum size and ensure we have enough pixels to detect lines
95
+ target_size = max(target_size, 100)
96
+
97
+ if CV2_AVAILABLE:
98
+ # OpenCV-based implementation (faster)
99
+ # Resize the image to the target size
100
+ scale_factor = target_size / max(width, height)
101
+ small_img = cv2.resize(img_np, None, fx=scale_factor, fy=scale_factor, interpolation=cv2.INTER_AREA)
102
+
103
+ # Apply binary thresholding to get cleaner edges
104
+ _, binary = cv2.threshold(small_img, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)
105
+
106
+ # Use Hough Line Transform to detect lines
107
+ lines = cv2.HoughLinesP(binary, 1, np.pi/180, threshold=target_size//10,
108
+ minLineLength=target_size//5, maxLineGap=target_size//10)
109
+
110
+ if lines is None or len(lines) < 3:
111
+ # Not enough lines detected, assume no significant skew
112
+ return 0.0
113
+
114
+ # Calculate angles of lines
115
+ angles = []
116
+ for line in lines:
117
+ x1, y1, x2, y2 = line[0]
118
+ if x2 - x1 == 0: # Avoid division by zero
119
+ continue
120
+ angle = math.atan2(y2 - y1, x2 - x1) * 180.0 / np.pi
121
+
122
+ # Normalize angle to -45 to 45 range
123
+ angle = angle % 180
124
+ if angle > 90:
125
+ angle -= 180
126
+ if angle > 45:
127
+ angle -= 90
128
+ if angle < -45:
129
+ angle += 90
130
+
131
+ angles.append(angle)
132
+
133
+ if not angles:
134
+ return 0.0
135
+
136
+ # Use median to reduce impact of outliers
137
+ angles.sort()
138
+ median_angle = angles[len(angles) // 2]
139
+
140
+ return median_angle
141
+ else:
142
+ # PIL-only fallback implementation
143
+ # Resize using PIL
144
+ small_img = Image.fromarray(img_np).resize(
145
+ (int(width * target_size / max(width, height)),
146
+ int(height * target_size / max(width, height))),
147
+ Image.NEAREST
148
+ )
149
+
150
+ # Find edges
151
+ edges = small_img.filter(ImageFilter.FIND_EDGES)
152
+ edges_data = np.array(edges)
153
+
154
+ # Simple edge orientation analysis (less precise than OpenCV)
155
+ # Count horizontal vs vertical edges
156
+ h_edges = np.sum(np.abs(np.diff(edges_data, axis=1)))
157
+ v_edges = np.sum(np.abs(np.diff(edges_data, axis=0)))
158
+
159
+ # If horizontal edges dominate, no significant skew
160
+ if h_edges > v_edges * 1.2:
161
+ return 0.0
162
+
163
+ # Simple angle estimation based on edge distribution
164
+ # This is a simplified approach that works for slight skews
165
+ rows, cols = edges_data.shape
166
+ xs, ys = [], []
167
+
168
+ # Sample strong edge points
169
+ for r in range(0, rows, 2):
170
+ for c in range(0, cols, 2):
171
+ if edges_data[r, c] > 128:
172
+ xs.append(c)
173
+ ys.append(r)
174
+
175
+ if len(xs) < 10: # Not enough edge points
176
+ return 0.0
177
+
178
+ # Use simple linear regression to estimate the slope
179
+ n = len(xs)
180
+ mean_x = sum(xs) / n
181
+ mean_y = sum(ys) / n
182
+
183
+ # Calculate slope
184
+ numerator = sum((xs[i] - mean_x) * (ys[i] - mean_y) for i in range(n))
185
+ denominator = sum((xs[i] - mean_x) ** 2 for i in range(n))
186
+
187
+ if abs(denominator) < 1e-6: # Avoid division by zero
188
+ return 0.0
189
+
190
+ slope = numerator / denominator
191
+ angle = math.atan(slope) * 180.0 / math.pi
192
+
193
+ # Normalize to -45 to 45 degrees
194
+ if angle > 45:
195
+ angle -= 90
196
+ elif angle < -45:
197
+ angle += 90
198
+
199
+ return angle
200
+
201
+ def replace_images_in_markdown(md: str, images: dict[str, str]) -> str:
202
+ """
203
+ Replace image placeholders in markdown with base64-encoded images.
204
+ Uses regex-based matching to handle variations in image IDs and formats.
205
+
206
+ Args:
207
+ md: Markdown text containing image placeholders
208
+ images: Dictionary mapping image IDs to base64 strings
209
+
210
+ Returns:
211
+ Markdown text with images replaced by base64 data
212
+ """
213
+ # Process each image ID in the dictionary
214
+ for img_id, base64_str in images.items():
215
+ # Extract the base ID without extension for more flexible matching
216
+ base_id = img_id.split('.')[0]
217
+
218
+ # Match markdown image pattern where URL contains the base ID
219
+ # Using a single regex with groups to capture the full pattern
220
+ pattern = re.compile(rf'!\[([^\]]*)\]\(([^\)]*{base_id}[^\)]*)\)')
221
+
222
+ # Process all matches
223
+ matches = list(pattern.finditer(md))
224
+ for match in reversed(matches): # Process in reverse to avoid offset issues
225
+ # Replace the entire match with a properly formatted base64 image
226
+ md = md[:match.start()] + f"![{img_id}](data:image/jpeg;base64,{base64_str})" + md[match.end():]
227
+
228
+ return md
229
+
230
+ def get_combined_markdown(ocr_response) -> str:
231
+ """
232
+ Combine OCR text and images into a single markdown document.
233
+
234
+ Args:
235
+ ocr_response: OCR response object from Mistral AI
236
+
237
+ Returns:
238
+ Combined markdown string with embedded images
239
+ """
240
+ markdowns = []
241
+
242
+ # Process each page of the OCR response
243
+ for page in ocr_response.pages:
244
+ # Extract image data if available
245
+ image_data = {}
246
+ if hasattr(page, "images"):
247
+ for img in page.images:
248
+ if hasattr(img, "id") and hasattr(img, "image_base64"):
249
+ image_data[img.id] = img.image_base64
250
+
251
+ # Replace image placeholders with base64 data
252
+ page_markdown = page.markdown if hasattr(page, "markdown") else ""
253
+ processed_markdown = replace_images_in_markdown(page_markdown, image_data)
254
+ markdowns.append(processed_markdown)
255
+
256
+ # Join all pages' markdown with double newlines
257
+ return "\n\n".join(markdowns)
258
+
259
+ def encode_image_for_api(image_path: Union[str, Path]) -> str:
260
+ """
261
+ Encode an image as base64 data URL for API submission.
262
+
263
+ Args:
264
+ image_path: Path to the image file
265
+
266
+ Returns:
267
+ Base64 data URL for the image
268
+ """
269
+ # Convert to Path object if string
270
+ image_file = Path(image_path) if isinstance(image_path, str) else image_path
271
+
272
+ # Verify image exists
273
+ if not image_file.is_file():
274
+ raise FileNotFoundError(f"Image file not found: {image_file}")
275
+
276
+ # Determine mime type based on file extension
277
+ mime_type = 'image/jpeg' # Default mime type
278
+ suffix = image_file.suffix.lower()
279
+ if suffix == '.png':
280
+ mime_type = 'image/png'
281
+ elif suffix == '.gif':
282
+ mime_type = 'image/gif'
283
+ elif suffix in ['.jpg', '.jpeg']:
284
+ mime_type = 'image/jpeg'
285
+ elif suffix == '.pdf':
286
+ mime_type = 'application/pdf'
287
+
288
+ # Encode image as base64
289
+ encoded = base64.b64encode(image_file.read_bytes()).decode()
290
+ return f"data:{mime_type};base64,{encoded}"
291
+
292
+ def encode_bytes_for_api(file_bytes: bytes, mime_type: str) -> str:
293
+ """
294
+ Encode binary data as base64 data URL for API submission.
295
+
296
+ Args:
297
+ file_bytes: Binary file data
298
+ mime_type: MIME type of the file (e.g., 'image/jpeg', 'application/pdf')
299
+
300
+ Returns:
301
+ Base64 data URL for the data
302
+ """
303
+ # Encode data as base64
304
+ encoded = base64.b64encode(file_bytes).decode()
305
+ return f"data:{mime_type};base64,{encoded}"
306
+
307
+ def calculate_image_entropy(pil_img: Image.Image) -> float:
308
+ """
309
+ Calculate the entropy of a PIL image.
310
+ Entropy is a measure of randomness; low entropy indicates a blank or simple image,
311
+ high entropy indicates more complex content (e.g., text or detailed images).
312
+
313
+ Args:
314
+ pil_img: PIL Image object
315
+
316
+ Returns:
317
+ float: Entropy value
318
+ """
319
+ # Convert to grayscale for entropy calculation
320
+ gray_img = pil_img.convert("L")
321
+ arr = np.array(gray_img)
322
+ # Compute histogram
323
+ hist, _ = np.histogram(arr, bins=256, range=(0, 255), density=True)
324
+ # Remove zero entries to avoid log(0)
325
+ hist = hist[hist > 0]
326
+ # Calculate entropy
327
+ entropy = -np.sum(hist * np.log2(hist))
328
+ return float(entropy)
329
+
330
+ def serialize_ocr_object(obj):
331
+ """
332
+ Serialize OCR response objects to JSON serializable format.
333
+ Handles OCRImageObject specifically to prevent serialization errors.
334
+
335
+ Args:
336
+ obj: The object to serialize
337
+
338
+ Returns:
339
+ JSON serializable representation of the object
340
+ """
341
+ # Fast path: Handle primitive types directly
342
+ if obj is None or isinstance(obj, (str, int, float, bool)):
343
+ return obj
344
+
345
+ # Handle collections
346
+ if isinstance(obj, list):
347
+ return [serialize_ocr_object(item) for item in obj]
348
+ elif isinstance(obj, dict):
349
+ return {k: serialize_ocr_object(v) for k, v in obj.items()}
350
+ elif isinstance(obj, OCRImageObject):
351
+ # Special handling for OCRImageObject
352
+ return {
353
+ 'id': obj.id if hasattr(obj, 'id') else None,
354
+ 'image_base64': obj.image_base64 if hasattr(obj, 'image_base64') else None
355
+ }
356
+ elif hasattr(obj, '__dict__'):
357
+ # For objects with __dict__ attribute
358
+ return {k: serialize_ocr_object(v) for k, v in obj.__dict__.items()
359
+ if not k.startswith('_')} # Skip private attributes
360
+ else:
361
+ # Try to convert to string as last resort
362
+ try:
363
+ return str(obj)
364
+ except:
365
+ return None
366
+
367
+ def format_ocr_text(text):
368
+ """
369
+ Format OCR text with simple, predictable rules that ensure consistency.
370
+ This formats ALL CAPS lines as bold markdown and preserves the rest.
371
+
372
+ Args:
373
+ text: Text content to format
374
+
375
+ Returns:
376
+ Formatted text with consistent styling
377
+ """
378
+ if not isinstance(text, str):
379
+ return text
380
+
381
+ lines = text.split('\n')
382
+ processed_lines = []
383
+ for line in lines:
384
+ line_stripped = line.strip()
385
+ if line_stripped and line_stripped.isupper() and len(line_stripped) > 3:
386
+ processed_lines.append(f"**{line_stripped}**")
387
+ else:
388
+ processed_lines.append(line)
389
+
390
+ return '\n'.join(processed_lines)
391
+
392
+ def create_results_zip(results, output_dir=None, zip_name=None):
393
+ """
394
+ Create a zip file containing OCR results.
395
+
396
+ Args:
397
+ results: Dictionary or list of OCR results
398
+ output_dir: Optional output directory
399
+ zip_name: Optional zip file name
400
+
401
+ Returns:
402
+ Path to the created zip file
403
+ """
404
+ # Create temporary output directory if not provided
405
+ if output_dir is None:
406
+ output_dir = Path.cwd() / "output"
407
+ output_dir.mkdir(exist_ok=True)
408
+ else:
409
+ output_dir = Path(output_dir)
410
+ output_dir.mkdir(exist_ok=True)
411
+
412
+ # Generate zip name if not provided
413
+ if zip_name is None:
414
+ timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
415
+
416
+ if isinstance(results, list):
417
+ # For a list of results, create a descriptive name
418
+ file_count = len(results)
419
+ zip_name = f"ocr_results_{file_count}_{timestamp}.zip"
420
+ else:
421
+ # For single result, create descriptive filename
422
+ base_name = results.get('file_name', 'document').split('.')[0]
423
+ zip_name = f"{base_name}_{timestamp}.zip"
424
+
425
+ try:
426
+ # Get zip data in memory first
427
+ zip_data = create_results_zip_in_memory(results)
428
+
429
+ # Save to file
430
+ zip_path = output_dir / zip_name
431
+ with open(zip_path, 'wb') as f:
432
+ f.write(zip_data)
433
+
434
+ return zip_path
435
+ except Exception as e:
436
+ # Create an empty zip file as fallback
437
+ logger.error(f"Error creating zip file: {str(e)}")
438
+ zip_path = output_dir / zip_name
439
+ with zipfile.ZipFile(zip_path, 'w') as zipf:
440
+ zipf.writestr("info.txt", "Could not create complete archive")
441
+
442
+ return zip_path
443
+
444
+ def create_results_zip_in_memory(results):
445
+ """
446
+ Create a zip file containing OCR results in memory.
447
+
448
+ Args:
449
+ results: Dictionary or list of OCR results
450
+
451
+ Returns:
452
+ Binary zip file data
453
+ """
454
+ # Create a BytesIO object
455
+ zip_buffer = io.BytesIO()
456
+
457
+ # Check if results is a list or a dictionary
458
+ is_list = isinstance(results, list)
459
+
460
+ # Create zip file in memory
461
+ with zipfile.ZipFile(zip_buffer, 'w', zipfile.ZIP_DEFLATED) as zipf:
462
+ if is_list:
463
+ # Handle list of results
464
+ for i, result in enumerate(results):
465
+ try:
466
+ # Create a descriptive base filename for this result
467
+ base_filename = result.get('file_name', f'document_{i+1}').split('.')[0]
468
+
469
+ # Add document type if available
470
+ if 'topics' in result and result['topics']:
471
+ topic = result['topics'][0].lower().replace(' ', '_')
472
+ base_filename = f"{base_filename}_{topic}"
473
+
474
+ # Add language if available
475
+ if 'languages' in result and result['languages']:
476
+ lang = result['languages'][0].lower()
477
+ # Only add if it's not already in the filename
478
+ if lang not in base_filename.lower():
479
+ base_filename = f"{base_filename}_{lang}"
480
+
481
+ # For PDFs, add page information
482
+ if 'limited_pages' in result:
483
+ base_filename = f"{base_filename}_p{result['limited_pages']['processed']}of{result['limited_pages']['total']}"
484
+
485
+ # Add timestamp if available
486
+ if 'timestamp' in result:
487
+ try:
488
+ # Try to parse the timestamp and reformat it
489
+ dt = datetime.strptime(result['timestamp'], "%Y-%m-%d %H:%M")
490
+ timestamp = dt.strftime("%Y%m%d_%H%M%S")
491
+ base_filename = f"{base_filename}_{timestamp}"
492
+ except Exception:
493
+ pass
494
+
495
+ # Add JSON results for each file with descriptive name
496
+ result_json = json.dumps(result, indent=2)
497
+ zipf.writestr(f"{base_filename}.json", result_json)
498
+
499
+ # Add HTML content (generated from the result)
500
+ html_content = create_html_with_images(result)
501
+ zipf.writestr(f"{base_filename}.html", html_content)
502
+
503
+ # Add raw OCR text if available
504
+ if "ocr_contents" in result and "raw_text" in result["ocr_contents"]:
505
+ zipf.writestr(f"{base_filename}.txt", result["ocr_contents"]["raw_text"])
506
+
507
+ except Exception as e:
508
+ # If any result fails, skip it and continue
509
+ logger.warning(f"Failed to process result for zip: {str(e)}")
510
+ continue
511
+ else:
512
+ # Handle single result
513
+ try:
514
+ # Create a descriptive base filename for this result
515
+ base_filename = results.get('file_name', 'document').split('.')[0]
516
+
517
+ # Add document type if available
518
+ if 'topics' in results and results['topics']:
519
+ topic = results['topics'][0].lower().replace(' ', '_')
520
+ base_filename = f"{base_filename}_{topic}"
521
+
522
+ # Add language if available
523
+ if 'languages' in results and results['languages']:
524
+ lang = results['languages'][0].lower()
525
+ # Only add if it's not already in the filename
526
+ if lang not in base_filename.lower():
527
+ base_filename = f"{base_filename}_{lang}"
528
+
529
+ # For PDFs, add page information
530
+ if 'limited_pages' in results:
531
+ base_filename = f"{base_filename}_p{results['limited_pages']['processed']}of{results['limited_pages']['total']}"
532
+
533
+ # Add timestamp if available
534
+ if 'timestamp' in results:
535
+ try:
536
+ # Try to parse the timestamp and reformat it
537
+ dt = datetime.strptime(results['timestamp'], "%Y-%m-%d %H:%M")
538
+ timestamp = dt.strftime("%Y%m%d_%H%M%S")
539
+ base_filename = f"{base_filename}_{timestamp}"
540
+ except Exception:
541
+ # If parsing fails, create a new timestamp
542
+ timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
543
+ base_filename = f"{base_filename}_{timestamp}"
544
+ else:
545
+ # No timestamp in the result, create a new one
546
+ timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
547
+ base_filename = f"{base_filename}_{timestamp}"
548
+
549
+ # Add JSON results with descriptive name
550
+ results_json = json.dumps(results, indent=2)
551
+ zipf.writestr(f"{base_filename}.json", results_json)
552
+
553
+ # Add HTML content with descriptive name
554
+ html_content = create_html_with_images(results)
555
+ zipf.writestr(f"{base_filename}.html", html_content)
556
+
557
+ # Add raw OCR text if available
558
+ if "ocr_contents" in results and "raw_text" in results["ocr_contents"]:
559
+ zipf.writestr(f"{base_filename}.txt", results["ocr_contents"]["raw_text"])
560
+
561
+ except Exception as e:
562
+ # If processing fails, log the error
563
+ logger.error(f"Failed to create zip file: {str(e)}")
564
+ pass
565
+
566
+ # Seek to the beginning of the BytesIO object
567
+ zip_buffer.seek(0)
568
+
569
+ # Return the zip file bytes
570
+ return zip_buffer.getvalue()
571
+
572
+ def create_html_with_images(result):
573
+ """
574
+ Create a clean HTML document from OCR results that properly preserves page references
575
+ and text structure, without any document-specific special cases.
576
+
577
+ Args:
578
+ result: OCR result dictionary
579
+
580
+ Returns:
581
+ HTML content as string
582
+ """
583
+ # Import content utils to use classification functions
584
+ try:
585
+ from utils.content_utils import classify_document_content, extract_document_text, extract_image_description
586
+ content_utils_available = True
587
+ except ImportError:
588
+ content_utils_available = False
589
+
590
+ # Get content classification
591
+ has_text = True
592
+ has_images = False
593
+ has_page_refs = False
594
+
595
+ if content_utils_available:
596
+ classification = classify_document_content(result)
597
+ has_text = classification['has_content']
598
+ has_images = result.get('has_images', False)
599
+ has_page_refs = False
600
+ else:
601
+ # Minimal fallback detection
602
+ if 'has_images' in result:
603
+ has_images = result['has_images']
604
+
605
+ # Check for image data more thoroughly
606
+ if 'pages_data' in result and isinstance(result['pages_data'], list):
607
+ for page in result['pages_data']:
608
+ if isinstance(page, dict) and 'images' in page and page['images']:
609
+ has_images = True
610
+ break
611
+
612
+ # Start building the HTML document
613
+ html = [
614
+ '<!DOCTYPE html>',
615
+ '<html lang="en">',
616
+ '<head>',
617
+ ' <meta charset="UTF-8">',
618
+ ' <meta name="viewport" content="width=device-width, initial-scale=1.0">',
619
+ f' <title>{result.get("file_name", "Document")}</title>',
620
+ ' <style>',
621
+ ' body {',
622
+ ' font-family: Georgia, serif;',
623
+ ' line-height: 1.6;',
624
+ ' color: #333;',
625
+ ' max-width: 800px;',
626
+ ' margin: 0 auto;',
627
+ ' padding: 20px;',
628
+ ' }',
629
+ ' h1, h2, h3, h4 {',
630
+ ' color: #222;',
631
+ ' margin-top: 1.5em;',
632
+ ' margin-bottom: 0.5em;',
633
+ ' }',
634
+ ' h1 { font-size: 24px; }',
635
+ ' h2 { font-size: 22px; }',
636
+ ' h3 { font-size: 20px; }',
637
+ ' h4 { font-size: 18px; }',
638
+ ' p { margin: 1em 0; }',
639
+ ' .metadata {',
640
+ ' background-color: #f8f9fa;',
641
+ ' border: 1px solid #eaecef;',
642
+ ' border-radius: 6px;',
643
+ ' padding: 15px;',
644
+ ' margin-bottom: 20px;',
645
+ ' }',
646
+ ' .metadata p { margin: 5px 0; }',
647
+ ' img {',
648
+ ' max-width: 100%;',
649
+ ' height: auto;',
650
+ ' display: block;',
651
+ ' margin: 20px auto;',
652
+ ' border: 1px solid #ddd;',
653
+ ' border-radius: 4px;',
654
+ ' }',
655
+ ' .image-container {',
656
+ ' margin: 20px 0;',
657
+ ' text-align: center;',
658
+ ' }',
659
+ ' .image-caption {',
660
+ ' font-size: 0.9em;',
661
+ ' text-align: center;',
662
+ ' color: #666;',
663
+ ' margin-top: 5px;',
664
+ ' }',
665
+ ' .text-block {',
666
+ ' margin: 10px 0;',
667
+ ' }',
668
+ ' .page-ref {',
669
+ ' font-weight: bold;',
670
+ ' color: #555;',
671
+ ' }',
672
+ ' .separator {',
673
+ ' border-top: 1px solid #eaecef;',
674
+ ' margin: 30px 0;',
675
+ ' }',
676
+ ' </style>',
677
+ '</head>',
678
+ '<body>'
679
+ ]
680
+
681
+ # Add document metadata
682
+ html.append('<div class="metadata">')
683
+ html.append(f'<h1>{result.get("file_name", "Document")}</h1>')
684
+
685
+ # Add timestamp
686
+ if 'timestamp' in result:
687
+ html.append(f'<p><strong>Processed:</strong> {result["timestamp"]}</p>')
688
+
689
+ # Add languages if available
690
+ if 'languages' in result and result['languages']:
691
+ languages = [lang for lang in result['languages'] if lang]
692
+ if languages:
693
+ html.append(f'<p><strong>Languages:</strong> {", ".join(languages)}</p>')
694
+
695
+ # Add document type and topics
696
+ if 'detected_document_type' in result:
697
+ html.append(f'<p><strong>Document Type:</strong> {result["detected_document_type"]}</p>')
698
+
699
+ if 'topics' in result and result['topics']:
700
+ html.append(f'<p><strong>Topics:</strong> {", ".join(result["topics"])}</p>')
701
+
702
+ html.append('</div>') # Close metadata div
703
+
704
+ # Document title - extract from result if available
705
+ if 'ocr_contents' in result and 'title' in result['ocr_contents'] and result['ocr_contents']['title']:
706
+ title_content = result['ocr_contents']['title']
707
+ # No special handling for any specific document types
708
+ html.append(f'<h2>{title_content}</h2>')
709
+
710
+ # Add images if present
711
+ if has_images and 'pages_data' in result:
712
+ html.append('<h3>Images</h3>')
713
+
714
+ # Extract and display all images
715
+ for page_idx, page in enumerate(result['pages_data']):
716
+ if 'images' in page and isinstance(page['images'], list):
717
+ for img_idx, img in enumerate(page['images']):
718
+ if 'image_base64' in img and img['image_base64']:
719
+ # Image container
720
+ html.append('<div class="image-container">')
721
+ html.append(f'<img src="{img["image_base64"]}" alt="Image {page_idx+1}-{img_idx+1}">')
722
+
723
+ # Generic caption based on index
724
+ html.append(f'<div class="image-caption">img-{img_idx}.jpeg</div>')
725
+ html.append('</div>')
726
+
727
+ # Add image description if available through utils
728
+ if content_utils_available:
729
+ description = extract_image_description(result)
730
+ if description:
731
+ html.append('<div class="text-block">')
732
+ html.append(f'<p>{description}</p>')
733
+ html.append('</div>')
734
+
735
+ html.append('<hr class="separator">')
736
+
737
+ # Add document text section
738
+ html.append('<h3>Text</h3>')
739
+
740
+ # Extract text content systematically
741
+ text_content = ""
742
+
743
+ if content_utils_available:
744
+ # Use the systematic utility function
745
+ text_content = extract_document_text(result)
746
+ else:
747
+ # Fallback extraction logic
748
+ if 'ocr_contents' in result:
749
+ for field in ["main_text", "content", "text", "transcript", "raw_text"]:
750
+ if field in result['ocr_contents'] and result['ocr_contents'][field]:
751
+ content = result['ocr_contents'][field]
752
+ if isinstance(content, str) and content.strip():
753
+ text_content = content
754
+ break
755
+ elif isinstance(content, dict):
756
+ # Try to convert complex objects to string
757
+ try:
758
+ text_content = json.dumps(content, indent=2)
759
+ break
760
+ except:
761
+ pass
762
+
763
+ # Process text content for HTML display
764
+ if text_content:
765
+ # Clean the text but preserve page references
766
+ text_content = text_content.replace('\r\n', '\n')
767
+
768
+ # Preserve page references by wrapping them in HTML tags
769
+ if has_page_refs:
770
+ # Highlight common page reference patterns
771
+ page_patterns = [
772
+ (r'(page\s+\d+)', r'<span class="page-ref">\1</span>'),
773
+ (r'(p\.\s*\d+)', r'<span class="page-ref">\1</span>'),
774
+ (r'(p\s+\d+)', r'<span class="page-ref">\1</span>'),
775
+ (r'(\[\s*\d+\s*\])', r'<span class="page-ref">\1</span>'),
776
+ (r'(\(\s*\d+\s*\))', r'<span class="page-ref">\1</span>'),
777
+ (r'(folio\s+\d+)', r'<span class="page-ref">\1</span>'),
778
+ (r'(f\.\s*\d+)', r'<span class="page-ref">\1</span>'),
779
+ (r'(pg\.\s*\d+)', r'<span class="page-ref">\1</span>')
780
+ ]
781
+
782
+ for pattern, replacement in page_patterns:
783
+ text_content = re.sub(pattern, replacement, text_content, flags=re.IGNORECASE)
784
+
785
+ # Convert newlines to paragraphs
786
+ paragraphs = text_content.split('\n\n')
787
+ paragraphs = [p for p in paragraphs if p.strip()]
788
+
789
+ html.append('<div class="text-block">')
790
+ for paragraph in paragraphs:
791
+ # Check if paragraph contains multiple lines
792
+ if '\n' in paragraph:
793
+ lines = paragraph.split('\n')
794
+ lines = [line for line in lines if line.strip()]
795
+
796
+ # Convert each line to a paragraph
797
+ for line in lines:
798
+ html.append(f'<p>{line}</p>')
799
+ else:
800
+ html.append(f'<p>{paragraph}</p>')
801
+ html.append('</div>')
802
+ else:
803
+ html.append('<p>No text content available.</p>')
804
+
805
+ # Close the HTML document
806
+ html.append('</body>')
807
+ html.append('</html>')
808
+
809
+ return '\n'.join(html)
810
+
811
+ def clean_ocr_result(result: dict,
812
+ use_segmentation: bool = False,
813
+ vision_enabled: bool = True) -> dict:
814
+ """
815
+ 1. Replace or strip markdown image refs (![id](id))
816
+ 2. Collapse pages that are *only* an illustration into a single
817
+ `illustrations` bucket when vision is off
818
+ 3. Normalise `ocr_contents` keys to always have at least `raw_text`
819
+ """
820
+ if 'pages_data' in result:
821
+ # Build a dict {id: base64} for quick look-ups
822
+ image_dict = {
823
+ img['id']: img['image_base64']
824
+ for page in result['pages_data']
825
+ for img in page.get('images', [])
826
+ }
827
+
828
+ # --- 1 · replace or drop image placeholders ---
829
+ def _scrub(markdown: str) -> str:
830
+ if vision_enabled and image_dict:
831
+ return replace_images_in_markdown(markdown, image_dict)
832
+ # no vision / no images → drop the line
833
+ return re.sub(r'!\[[^\]]*\]\(img-\d+\.\w+\)', '', markdown)
834
+
835
+ for page in result['pages_data']:
836
+ page['markdown'] = _scrub(page.get('markdown', ''))
837
+
838
+ # --- 2 · group illustration-only pages when vision is off ---
839
+ if not vision_enabled and 'pages_data' in result:
840
+ text_pages, art_pages = [], []
841
+ for p in result['pages_data']:
842
+ has_text = p.get('markdown', '').strip()
843
+ (text_pages if has_text else art_pages).append(p)
844
+ result['pages_data'] = text_pages
845
+ if art_pages:
846
+ # keep one thumbnail under metadata
847
+ result.setdefault('illustrations', []).extend(art_pages)
848
+
849
+ # --- 3 · ensure raw_text key ---
850
+ if 'ocr_contents' in result and 'raw_text' not in result['ocr_contents']:
851
+ # First, try to extract any embedded text from image references
852
+ raw_text_parts = []
853
+
854
+ for page in result.get('pages_data', []):
855
+ markdown = page.get('markdown', '')
856
+ # Check if the markdown contains image references
857
+ img_refs = re.findall(r'!\[([^\]]*)\]\(([^\)]*)\)', markdown)
858
+
859
+ # Process each image reference to extract text content
860
+ if img_refs:
861
+ for alt_text, img_url in img_refs:
862
+ # If alt text contains actual text content (not just image ID), add it
863
+ if alt_text and not alt_text.endswith(('.jpeg', '.jpg', '.png')):
864
+ # Clean up the alt text and add it as text content
865
+ alt_text = alt_text.strip()
866
+ if alt_text and len(alt_text) > 3: # Only add if meaningful
867
+ raw_text_parts.append(alt_text)
868
+
869
+ # Remove image references from markdown
870
+ cleaned_markdown = re.sub(r'!\[[^\]]*\]\([^\)]*\)', '', markdown)
871
+
872
+ # Add any remaining text content
873
+ if cleaned_markdown.strip():
874
+ raw_text_parts.append(cleaned_markdown.strip())
875
+
876
+ # Join all extracted text content
877
+ if raw_text_parts:
878
+ result['ocr_contents']['raw_text'] = "\n\n".join(raw_text_parts)
879
+ else:
880
+ # Fallback: use original method if no text was extracted
881
+ joined = "\n".join(p.get('markdown', '') for p in result.get('pages_data', []))
882
+ # Final cleanup of image references
883
+ joined = re.sub(r'!\[[^\]]*\]\([^\)]*\)', '', joined)
884
+ result['ocr_contents']['raw_text'] = joined
885
+
886
+ return result
utils/text_utils.py ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Text utility functions for OCR processing"""
2
+
3
+ import re
4
+
5
+ def clean_raw_text(text):
6
+ """Clean raw text by removing image references and serialized data.
7
+
8
+ Args:
9
+ text (str): The text to clean
10
+
11
+ Returns:
12
+ str: The cleaned text
13
+ """
14
+ if not text or not isinstance(text, str):
15
+ return ""
16
+
17
+ # # Remove image references like ![image](data:image/...)
18
+ # text = re.sub(r'!\[.*?\]\(data:image/[^)]+\)', '', text)
19
+
20
+ # # Remove basic markdown image references like ![alt](img-1.jpg)
21
+ # text = re.sub(r'!\[[^\]]*\]\([^)]+\)', '', text)
22
+
23
+ # # Remove base64 encoded image data
24
+ # text = re.sub(r'data:image/[^;]+;base64,[a-zA-Z0-9+/=]+', '', text)
25
+
26
+ # # Remove image object references like [[OCRImageObject:...]]
27
+ # text = re.sub(r'\[\[OCRImageObject:[^\]]+\]\]', '', text)
28
+
29
+ # # Clean up any JSON-like image object references
30
+ # text = re.sub(r'{"image(_data)?":("[^"]*"|null|true|false|\{[^}]*\}|\[[^\]]*\])}', '', text)
31
+
32
+ # # Clean up excessive whitespace and line breaks created by removals
33
+ # text = re.sub(r'\n{3,}', '\n\n', text)
34
+ # text = re.sub(r'\s{3,}', ' ', text)
35
+
36
+ return text.strip()
37
+
38
+ def format_markdown_text(text):
39
+ """Format text with markdown and handle special patterns
40
+
41
+ Args:
42
+ text (str): The text to format
43
+
44
+ Returns:
45
+ str: The formatted markdown text
46
+ """
47
+ if not text:
48
+ return ""
49
+
50
+ # First, ensure we're working with a string
51
+ if not isinstance(text, str):
52
+ text = str(text)
53
+
54
+ # Ensure newlines are preserved for proper spacing
55
+ # Convert any Windows line endings to Unix
56
+ text = text.replace('\r\n', '\n')
57
+
58
+ # Format dates (MM/DD/YYYY or similar patterns)
59
+ date_pattern = r'\b(0?[1-9]|1[0-2])[\/\-\.](0?[1-9]|[12][0-9]|3[01])[\/\-\.](\d{4}|\d{2})\b'
60
+ text = re.sub(date_pattern, r'**\g<0>**', text)
61
+
62
+ # Detect markdown tables and preserve them
63
+ table_sections = []
64
+ non_table_lines = []
65
+ in_table = False
66
+ table_buffer = []
67
+
68
+ # Process text line by line, preserving tables
69
+ lines = text.split('\n')
70
+ for i, line in enumerate(lines):
71
+ line_stripped = line.strip()
72
+
73
+ # Detect table rows by pipe character
74
+ if '|' in line_stripped and (line_stripped.startswith('|') or line_stripped.endswith('|')):
75
+ if not in_table:
76
+ in_table = True
77
+ if table_buffer:
78
+ table_buffer = []
79
+ table_buffer.append(line)
80
+
81
+ # Check if the next line is a table separator
82
+ if i < len(lines) - 1 and '---' in lines[i+1] and '|' in lines[i+1]:
83
+ table_buffer.append(lines[i+1])
84
+
85
+ # Detect table separators (---|---|---)
86
+ elif in_table and '---' in line_stripped and '|' in line_stripped:
87
+ table_buffer.append(line)
88
+
89
+ # End of table detection
90
+ elif in_table:
91
+ # Check if this is still part of the table
92
+ next_line_is_table = False
93
+ if i < len(lines) - 1:
94
+ next_line = lines[i+1].strip()
95
+ if '|' in next_line and (next_line.startswith('|') or next_line.endswith('|')):
96
+ next_line_is_table = True
97
+
98
+ if not next_line_is_table:
99
+ in_table = False
100
+ # Save the complete table
101
+ if table_buffer:
102
+ table_sections.append('\n'.join(table_buffer))
103
+ table_buffer = []
104
+ # Add current line to non-table lines
105
+ non_table_lines.append(line)
106
+ else:
107
+ # Still part of the table
108
+ table_buffer.append(line)
109
+ else:
110
+ # Not in a table
111
+ non_table_lines.append(line)
112
+
113
+ # Handle any remaining table buffer
114
+ if in_table and table_buffer:
115
+ table_sections.append('\n'.join(table_buffer))
116
+
117
+ # Process non-table lines
118
+ processed_lines = []
119
+ for line in non_table_lines:
120
+ line_stripped = line.strip()
121
+
122
+ # Check if line is in ALL CAPS (and not just a short acronym)
123
+ if line_stripped and line_stripped.isupper() and len(line_stripped) > 3:
124
+ # ALL CAPS line - make bold instead of heading to prevent large display
125
+ processed_lines.append(f"**{line_stripped}**")
126
+ # Process potential headers (lines ending with colon)
127
+ elif line_stripped and line_stripped.endswith(':') and len(line_stripped) < 40:
128
+ # Likely a header - make it bold
129
+ processed_lines.append(f"**{line_stripped}**")
130
+ else:
131
+ # Keep original line with its spacing
132
+ processed_lines.append(line)
133
+
134
+ # Join non-table lines
135
+ processed_text = '\n'.join(processed_lines)
136
+
137
+ # Reinsert tables in the right positions
138
+ for table in table_sections:
139
+ # Generate a unique marker for this table
140
+ marker = f"__TABLE_MARKER_{hash(table) % 10000}__"
141
+ # Find a good position to insert this table
142
+ # For now, just append all tables at the end
143
+ processed_text += f"\n\n{table}\n\n"
144
+
145
+ # Make sure paragraphs have proper spacing but not excessive
146
+ processed_text = re.sub(r'\n{3,}', '\n\n', processed_text)
147
+
148
+ # Ensure two newlines between paragraphs for proper markdown rendering
149
+ processed_text = re.sub(r'([^\n])\n([^\n])', r'\1\n\n\2', processed_text)
150
+
151
+ return processed_text
utils/ui_utils.py ADDED
@@ -0,0 +1,413 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ UI utilities for OCR results display.
3
+ """
4
+ import streamlit as st
5
+ import json
6
+ import base64
7
+ import io
8
+ from datetime import datetime
9
+
10
+ from utils.image_utils import format_ocr_text, create_html_with_images
11
+ from utils.content_utils import classify_document_content, format_structured_data
12
+
13
+ def display_results(result, container, custom_prompt=""):
14
+ """Display OCR results in the provided container"""
15
+ with container:
16
+ # Add heading for document metadata
17
+ st.markdown("### Document Metadata")
18
+
19
+ # Filter out large data structures from metadata display
20
+ meta = {k: v for k, v in result.items()
21
+ if k not in ['pages_data', 'illustrations', 'ocr_contents', 'raw_response_data']}
22
+
23
+ # Create a compact metadata section
24
+ meta_html = '<div style="display: flex; flex-wrap: wrap; gap: 0.3rem; margin-bottom: 0.3rem;">'
25
+
26
+ # Document type
27
+ if 'detected_document_type' in meta:
28
+ meta_html += f'<div><strong>Type:</strong> {meta["detected_document_type"]}</div>'
29
+
30
+ # Processing time
31
+ if 'processing_time' in meta:
32
+ meta_html += f'<div><strong>Time:</strong> {meta["processing_time"]:.1f}s</div>'
33
+
34
+ # Page information
35
+ if 'limited_pages' in meta:
36
+ meta_html += f'<div><strong>Pages:</strong> {meta["limited_pages"]["processed"]}/{meta["limited_pages"]["total"]}</div>'
37
+
38
+ meta_html += '</div>'
39
+ st.markdown(meta_html, unsafe_allow_html=True)
40
+
41
+ # Language metadata on a separate line, Subject Tags below
42
+
43
+ # First show languages if available
44
+ if 'languages' in result and result['languages']:
45
+ languages = [lang for lang in result['languages'] if lang is not None]
46
+ if languages:
47
+ # Create a dedicated line for Languages
48
+ lang_html = '<div style="display: flex; align-items: center; margin: 0.2rem 0; flex-wrap: wrap;">'
49
+ lang_html += '<div style="margin-right: 0.3rem; font-weight: bold;">Language:</div>'
50
+
51
+ # Add language tags
52
+ for lang in languages:
53
+ # Clean language name if needed
54
+ clean_lang = str(lang).strip()
55
+ if clean_lang: # Only add if not empty
56
+ lang_html += f'<span class="subject-tag tag-language">{clean_lang}</span>'
57
+
58
+ lang_html += '</div>'
59
+ st.markdown(lang_html, unsafe_allow_html=True)
60
+
61
+ # Create a separate line for Time if we have time-related tags
62
+ if 'topics' in result and result['topics']:
63
+ time_tags = [topic for topic in result['topics']
64
+ if any(term in topic.lower() for term in ["century", "pre-", "era", "historical"])]
65
+ if time_tags:
66
+ time_html = '<div style="display: flex; align-items: center; margin: 0.2rem 0; flex-wrap: wrap;">'
67
+ time_html += '<div style="margin-right: 0.3rem; font-weight: bold;">Time:</div>'
68
+ for tag in time_tags:
69
+ time_html += f'<span class="subject-tag tag-time-period">{tag}</span>'
70
+ time_html += '</div>'
71
+ st.markdown(time_html, unsafe_allow_html=True)
72
+
73
+ # Then display remaining subject tags if available
74
+ if 'topics' in result and result['topics']:
75
+ # Filter out time-related tags which are already displayed
76
+ subject_tags = [topic for topic in result['topics']
77
+ if not any(term in topic.lower() for term in ["century", "pre-", "era", "historical"])]
78
+
79
+ if subject_tags:
80
+ # Create a separate line for Subject Tags
81
+ tags_html = '<div style="display: flex; align-items: center; margin: 0.2rem 0; flex-wrap: wrap;">'
82
+ tags_html += '<div style="margin-right: 0.3rem; font-weight: bold;">Subject Tags:</div>'
83
+ tags_html += '<div style="display: flex; flex-wrap: wrap; gap: 2px; align-items: center;">'
84
+
85
+ # Generate a badge for each remaining tag
86
+ for topic in subject_tags:
87
+ # Determine tag category class
88
+ tag_class = "subject-tag" # Default class
89
+
90
+ # Add specialized class based on category
91
+ if any(term in topic.lower() for term in ["language", "english", "french", "german", "latin"]):
92
+ tag_class += " tag-language" # Languages
93
+ elif any(term in topic.lower() for term in ["letter", "newspaper", "book", "form", "document", "recipe"]):
94
+ tag_class += " tag-document-type" # Document types
95
+ elif any(term in topic.lower() for term in ["travel", "military", "science", "medicine", "education", "art", "literature"]):
96
+ tag_class += " tag-subject" # Subject domains
97
+
98
+ # Add each tag as an inline span
99
+ tags_html += f'<span class="{tag_class}">{topic}</span>'
100
+
101
+ # Close the containers
102
+ tags_html += '</div></div>'
103
+
104
+ # Render the subject tags section
105
+ st.markdown(tags_html, unsafe_allow_html=True)
106
+
107
+ # Check if we have OCR content
108
+ if 'ocr_contents' in result:
109
+ # Create a single view instead of tabs
110
+ content_tab1 = st.container()
111
+
112
+ # Check for images in the result to use later
113
+ has_images = result.get('has_images', False)
114
+ has_image_data = ('pages_data' in result and any(page.get('images', []) for page in result.get('pages_data', [])))
115
+ has_raw_images = ('raw_response_data' in result and 'pages' in result['raw_response_data'] and
116
+ any('images' in page for page in result['raw_response_data']['pages']
117
+ if isinstance(page, dict)))
118
+
119
+ # Display structured content
120
+ with content_tab1:
121
+ # Display structured content with markdown formatting
122
+ if isinstance(result['ocr_contents'], dict):
123
+ # CSS is now handled in the main layout.py file
124
+
125
+ # Collect all available images from the result
126
+ available_images = []
127
+ if has_images and 'pages_data' in result:
128
+ for page_idx, page in enumerate(result['pages_data']):
129
+ if 'images' in page and len(page['images']) > 0:
130
+ for img_idx, img in enumerate(page['images']):
131
+ if 'image_base64' in img:
132
+ available_images.append({
133
+ 'source': 'pages_data',
134
+ 'page': page_idx,
135
+ 'index': img_idx,
136
+ 'data': img['image_base64']
137
+ })
138
+
139
+ # Get images from raw response as well
140
+ if 'raw_response_data' in result:
141
+ raw_data = result['raw_response_data']
142
+ if isinstance(raw_data, dict) and 'pages' in raw_data:
143
+ for page_idx, page in enumerate(raw_data['pages']):
144
+ if isinstance(page, dict) and 'images' in page:
145
+ for img_idx, img in enumerate(page['images']):
146
+ if isinstance(img, dict) and 'base64' in img:
147
+ available_images.append({
148
+ 'source': 'raw_response',
149
+ 'page': page_idx,
150
+ 'index': img_idx,
151
+ 'data': img['base64']
152
+ })
153
+
154
+ # Extract images for display at the top
155
+ images_to_display = []
156
+
157
+ # First, collect all available images
158
+ for img_idx, img in enumerate(available_images):
159
+ if 'data' in img:
160
+ images_to_display.append({
161
+ 'data': img['data'],
162
+ 'id': img.get('id', f"img_{img_idx}"),
163
+ 'index': img_idx
164
+ })
165
+
166
+ # Image display now only happens in the Images tab
167
+
168
+ # Organize sections in a logical order - prioritize main_text
169
+ section_order = ["title", "author", "date", "summary", "main_text", "content", "transcript", "metadata"]
170
+ ordered_sections = []
171
+
172
+ # Add known sections first in preferred order
173
+ for section_name in section_order:
174
+ if section_name in result['ocr_contents'] and result['ocr_contents'][section_name]:
175
+ ordered_sections.append(section_name)
176
+
177
+ # Add any remaining sections
178
+ for section in result['ocr_contents'].keys():
179
+ if (section not in ordered_sections and
180
+ section not in ['error', 'partial_text'] and
181
+ result['ocr_contents'][section]):
182
+ ordered_sections.append(section)
183
+
184
+ # If only raw_text is available and no other content, add it last
185
+ if ('raw_text' in result['ocr_contents'] and
186
+ result['ocr_contents']['raw_text'] and
187
+ len(ordered_sections) == 0):
188
+ ordered_sections.append('raw_text')
189
+
190
+ # Add minimal spacing before OCR results
191
+ st.markdown("<div style='margin: 8px 0 4px 0;'></div>", unsafe_allow_html=True)
192
+
193
+ # Create tabs for different views
194
+ if has_images:
195
+ tabs = st.tabs(["Document Content", "Raw JSON", "Images"])
196
+ doc_tab, json_tab, img_tab = tabs
197
+ else:
198
+ tabs = st.tabs(["Document Content", "Raw JSON"])
199
+ doc_tab, json_tab = tabs
200
+ img_tab = None
201
+
202
+ # Document Content tab with simplified and systematic content handling
203
+ with doc_tab:
204
+ # Classify document content using our utility function
205
+ content_classification = classify_document_content(result)
206
+
207
+ # Track what content has been displayed to avoid redundancy
208
+ displayed_content = set()
209
+
210
+ # Create a single unified content section
211
+ st.markdown("#### Document Content")
212
+ st.markdown("##### Title")
213
+
214
+ # Extract main structured content fields without redundancy
215
+ text_fields = {}
216
+
217
+ # Use the exact same approach as in Previous Results tab for consistency
218
+ # Create a more focused list of important sections - prioritize main_text
219
+ priority_sections = ["title", "main_text", "content", "transcript", "summary"]
220
+ displayed_sections = set()
221
+
222
+ # First display priority sections
223
+ for section in priority_sections:
224
+ if section in result['ocr_contents'] and result['ocr_contents'][section]:
225
+ content = result['ocr_contents'][section]
226
+ if isinstance(content, str) and content.strip():
227
+ # Only add a subheader for meaningful section names, not raw_text
228
+ if section != "raw_text" and section != "title":
229
+ st.markdown(f"##### {section.replace('_', ' ').title()}")
230
+
231
+ # Format and display content
232
+ # First format any structured data (lists, dicts)
233
+ structured_content = format_structured_data(content)
234
+ # Then apply regular OCR text formatting
235
+ formatted_content = format_ocr_text(structured_content)
236
+ st.markdown(formatted_content)
237
+ displayed_sections.add(section)
238
+ break
239
+ elif isinstance(content, dict):
240
+ # Display dictionary content as key-value pairs
241
+ for k, v in content.items():
242
+ if k not in ['error', 'partial_text'] and v:
243
+ st.markdown(f"**{k.replace('_', ' ').title()}**")
244
+ if isinstance(v, str):
245
+ # Format any structured data in the string
246
+ formatted_v = format_structured_data(v)
247
+ st.markdown(format_ocr_text(formatted_v))
248
+ else:
249
+ # Format non-string values (lists, dicts)
250
+ formatted_v = format_structured_data(v)
251
+ st.markdown(formatted_v)
252
+ displayed_sections.add(section)
253
+ break
254
+ elif isinstance(content, list):
255
+ # Format and display list items using our structured formatter
256
+ formatted_list = format_structured_data(content)
257
+ st.markdown(formatted_list)
258
+ displayed_sections.add(section)
259
+ break
260
+
261
+ # Then display any remaining sections not already shown
262
+ for section, content in result['ocr_contents'].items():
263
+ if (section not in displayed_sections and
264
+ section not in ['error', 'partial_text'] and
265
+ content):
266
+ st.markdown(f"##### {section.replace('_', ' ').title()}")
267
+
268
+ if isinstance(content, str):
269
+ # Format any structured data in the string before display
270
+ structured_content = format_structured_data(content)
271
+ st.markdown(format_ocr_text(structured_content))
272
+ elif isinstance(content, list):
273
+ # Format list using our structured formatter
274
+ formatted_list = format_structured_data(content)
275
+ st.markdown(formatted_list)
276
+ elif isinstance(content, dict):
277
+ # Format dictionary using our structured formatter
278
+ formatted_dict = format_structured_data(content)
279
+ st.markdown(formatted_dict)
280
+
281
+ # Raw JSON tab - for viewing the raw OCR response data
282
+ with json_tab:
283
+ # Extract the relevant JSON data
284
+ json_data = {}
285
+
286
+ # Include important metadata
287
+ for field in ['file_name', 'timestamp', 'processing_time', 'detected_document_type', 'languages', 'topics']:
288
+ if field in result:
289
+ json_data[field] = result[field]
290
+
291
+ # Include OCR contents
292
+ if 'ocr_contents' in result:
293
+ json_data['ocr_contents'] = result['ocr_contents']
294
+
295
+ # Exclude large binary data like base64 images to keep JSON clean
296
+ if 'pages_data' in result:
297
+ # Create simplified pages_data without large binary content
298
+ simplified_pages = []
299
+ for page in result['pages_data']:
300
+ simplified_page = {
301
+ 'page_number': page.get('page_number', 0),
302
+ 'has_text': bool(page.get('markdown', '')),
303
+ 'has_images': bool(page.get('images', [])),
304
+ 'image_count': len(page.get('images', []))
305
+ }
306
+ simplified_pages.append(simplified_page)
307
+ json_data['pages_summary'] = simplified_pages
308
+
309
+ # Format the JSON prettily
310
+ json_str = json.dumps(json_data, indent=2)
311
+
312
+ # Display in a monospace font with syntax highlighting
313
+ st.code(json_str, language="json")
314
+
315
+
316
+ # Images tab - for viewing document images
317
+ if has_images and img_tab:
318
+ with img_tab:
319
+ # Display each available image
320
+ for i, img in enumerate(images_to_display):
321
+ st.image(img['data'], caption=f"Image {i+1}", use_container_width=True)
322
+
323
+ # Display custom prompt if provided
324
+ if custom_prompt:
325
+ with st.expander("Custom Processing Instructions"):
326
+ st.write(custom_prompt)
327
+
328
+ # No download heading - start directly with buttons
329
+
330
+ # Create export section with a simple download menu
331
+ st.markdown("<div style='margin-top: 15px;'></div>", unsafe_allow_html=True)
332
+
333
+ # Prepare all download files at once to avoid rerun resets
334
+ try:
335
+ # 1. JSON download
336
+ json_str = json.dumps(result, indent=2)
337
+ json_filename = f"{result.get('file_name', 'document').split('.')[0]}_ocr.json"
338
+
339
+ # 2. Text download with improved structure
340
+ text_parts = []
341
+ filename = result.get('file_name', 'document')
342
+ text_parts.append(f"DOCUMENT: {filename}\n")
343
+
344
+ if 'timestamp' in result:
345
+ text_parts.append(f"Processed: {result['timestamp']}\n")
346
+
347
+ if 'languages' in result and result['languages']:
348
+ languages = [lang for lang in result['languages'] if lang is not None]
349
+ if languages:
350
+ text_parts.append(f"Languages: {', '.join(languages)}\n")
351
+
352
+ if 'topics' in result and result['topics']:
353
+ text_parts.append(f"Topics: {', '.join(result['topics'])}\n")
354
+
355
+ text_parts.append("\n" + "="*50 + "\n\n")
356
+
357
+ if 'ocr_contents' in result and 'title' in result['ocr_contents'] and result['ocr_contents']['title']:
358
+ text_parts.append(f"TITLE: {result['ocr_contents']['title']}\n\n")
359
+
360
+ content_added = False
361
+
362
+ if 'ocr_contents' in result:
363
+ for field in ["main_text", "content", "text", "transcript", "raw_text"]:
364
+ if field in result['ocr_contents'] and result['ocr_contents'][field]:
365
+ text_parts.append(f"CONTENT:\n\n{result['ocr_contents'][field]}\n")
366
+ content_added = True
367
+ break
368
+
369
+ text_content = "\n".join(text_parts)
370
+ text_filename = f"{result.get('file_name', 'document').split('.')[0]}_ocr.txt"
371
+
372
+ # 3. HTML download
373
+ from utils.image_utils import create_html_with_images
374
+ html_content = create_html_with_images(result)
375
+ html_filename = f"{result.get('file_name', 'document').split('.')[0]}_ocr.html"
376
+
377
+ # Hide download options in an expander
378
+ with st.expander("Download Options"):
379
+ # Remove columns and use vertical layout instead
380
+ # Add spacing between buttons for better readability
381
+ st.download_button(
382
+ label="JSON",
383
+ data=json_str,
384
+ file_name=json_filename,
385
+ mime="application/json",
386
+ key="download_json_btn",
387
+ use_container_width=True
388
+ )
389
+
390
+ st.markdown("<div style='margin-top: 8px;'></div>", unsafe_allow_html=True)
391
+
392
+ st.download_button(
393
+ label="Text",
394
+ data=text_content,
395
+ file_name=text_filename,
396
+ mime="text/plain",
397
+ key="download_text_btn",
398
+ use_container_width=True
399
+ )
400
+
401
+ st.markdown("<div style='margin-top: 8px;'></div>", unsafe_allow_html=True)
402
+
403
+ st.download_button(
404
+ label="HTML",
405
+ data=html_content,
406
+ file_name=html_filename,
407
+ mime="text/html",
408
+ key="download_html_btn",
409
+ use_container_width=True
410
+ )
411
+
412
+ except Exception as e:
413
+ st.error(f"Error preparing download files: {str(e)}")