Spaces:

milwright
/

historical-ocr

Running

App Files Files Community

milwright commited on Apr 29

Commit

c04ffe5

1 Parent(s): 836388f

Rolling out modular v2

Browse files

Files changed (24) hide show

.DS_Store +0 -0
.clinerules/apiDocumentation.md +29 -0
.clinerules/projectBrief.md +21 -0
.clinerules/systemPatterns.md +31 -0
README.md +5 -1
app.py +44 -30
config.py +5 -8
constants.py +47 -8
image_segmentation.py +21 -2
language_detection.py +0 -1
ocr_processing.py +11 -1
ocr_utils.py +33 -1771
preprocessing.py +521 -66
process_file.py +2 -4
requirements.txt +1 -0
structured_ocr.py +130 -110
test_magician.py → testing/test_magician.py +0 -0
ui_components.py +114 -582
utils/content_utils.py +189 -0
utils/file_utils.py +100 -0
utils/general_utils.py +163 -0
utils/image_utils.py +886 -0
utils/text_utils.py +151 -0
utils/ui_utils.py +413 -0

.DS_Store CHANGED Viewed

Binary files a/.DS_Store and b/.DS_Store differ

.clinerules/apiDocumentation.md ADDED Viewed

	@@ -0,0 +1,29 @@

+apiDocumentation.md
+API Interaction Documentation
+Mistral OCR API
+    Endpoint: /v1/ocr
+    Payload:
+        image (binary)
+        prompt (optional contextual instructions)
+    Response:
+        structured_data: Hierarchical text + metadata output
+        raw_text: Plain extracted text
+    Error Handling:
+        Timeout retries (up to 3 attempts)
+        Local fallback to Tesseract if Mistral service unavailable
+Tesseract Fallback
+    Only invoked if Mistral API fails after retries.
+    No structured output; raw text only.

.clinerules/projectBrief.md ADDED Viewed

	@@ -0,0 +1,21 @@

+# Foundation
+Historical OCR is an advanced optical character recognition (OCR) application designed to support historical research. It leverages Mistral AI's OCR models alongside image preprocessing pipelines optimized for archival material.
+High-Level Overview
+Building a Streamlit-based web application to process historical documents (images or PDFs), optimize them for OCR using advanced preprocessing techniques, and extract structured text and metadata through Mistral's large language models.
+Core Requirements and Goals
+Upload and preprocess historical documents
+Automatically detect document types (e.g., handwritten letters, scientific papers)
+Apply tailored OCR prompting and structured output based on document type
+Support user-defined contextual instructions to refine output
+Provide downloadable structured transcripts and analysis
+Example: "Building a Streamlit web app for OCR transcription and structured extraction from historical documents using Mistral AI."

.clinerules/systemPatterns.md ADDED Viewed

	@@ -0,0 +1,31 @@

+# System Architecture
+    Frontend: Streamlit app (app.py) for user interface and interactions.
+    Core Processing: ocr_processing.py orchestrates preprocessing, document type detection, and OCR operations.
+    Image Preprocessing: preprocessing.py, image_segmentation.py handle deskewing, thresholding, and cleaning.
+    OCR and Structuring: structured_ocr.py and ocr_utils.py manage API communication and formatting structured outputs.
+    Utilities and Detection: language_detection.py, utils.py, and constants.py provide language detection, helpers, and prompt templates.
+Key Technical Decisions
+    Streamlit cache management for upload processing efficiency.
+    Modular design of preprocessing paths based on document type.
+    Mistral AI as the primary OCR processor, with Tesseract fallback for redundancy.
+Design Patterns in Use
+    Delegation: Frontend delegates all processing to backend orchestrators.
+    Modularity: Preprocessing and OCR tasks divided into clean, testable modules.
+    State-driven Processing: Output dynamically reflects session state and user input.
+Component Relationships
+    app.py ⇨ ocr_processing.py ⇨ preprocessing.py, structured_ocr.py, language_detection.py, etc.

README.md CHANGED Viewed

@@ -21,7 +21,11 @@ An advanced OCR application for historical document analysis using Mistral AI.
 - **OCR with Context:** AI-enhanced OCR optimized for historical documents
 - **Document Type Detection:** Automatically identifies handwritten letters, recipes, scientific texts, and more
-- **Image Preprocessing:** Optimizes images for better text recognition
 - **Custom Prompting:** Tailor the AI analysis with document-specific instructions
 - **Structured Output:** Returns organized, structured information based on document type

 - **OCR with Context:** AI-enhanced OCR optimized for historical documents
 - **Document Type Detection:** Automatically identifies handwritten letters, recipes, scientific texts, and more
+- **Advanced Image Preprocessing:**
+  - Automatic deskewing to correct document orientation
+  - Smart thresholding with Otsu and adaptive methods
+  - Morphological operations to clean up text
+  - Document-type specific optimization
 - **Custom Prompting:** Tailor the AI analysis with document-specific instructions
 - **Structured Output:** Returns organized, structured information based on document type

app.py CHANGED Viewed

@@ -41,7 +41,7 @@ from constants import (
 )
 from structured_ocr import StructuredOCR
 from config import MISTRAL_API_KEY
-from ocr_utils import create_results_zip
 # Set favicon path
 favicon_path = os.path.join(os.path.dirname(__file__), "static/favicon.png")
@@ -74,20 +74,47 @@ st.set_page_config(
 # Consult https://docs.streamlit.io/library/advanced-features/session-state for details.
 # ========================================================================================
 def init_session_state():
     """Initialize session state variables if they don't already exist
     This function follows Streamlit's recommended patterns for state initialization.
     It only creates variables if they don't exist yet and doesn't modify existing values.
     """
     if 'previous_results' not in st.session_state:
         st.session_state.previous_results = []
     if 'temp_file_paths' not in st.session_state:
         st.session_state.temp_file_paths = []
-    if 'last_processed_file' not in st.session_state:
-        st.session_state.last_processed_file = None
     if 'auto_process_sample' not in st.session_state:
         st.session_state.auto_process_sample = False
     if 'sample_just_loaded' not in st.session_state:
         st.session_state.sample_just_loaded = False
     if 'processed_document_active' not in st.session_state:
@@ -104,10 +131,6 @@ def init_session_state():
         st.session_state.is_sample_document = False
     if 'selected_previous_result' not in st.session_state:
         st.session_state.selected_previous_result = None
-    if 'close_clicked' not in st.session_state:
-        st.session_state.close_clicked = False
-    if 'active_tab' not in st.session_state:
-        st.session_state.active_tab = 0
 def close_document():
     """Called when the Close Document button is clicked
@@ -120,24 +143,17 @@ def close_document():
     That approach breaks Streamlit's execution flow and causes UI artifacts.
     """
     logger.info("Close document button clicked")
-    # Save the previous results
-    previous_results = st.session_state.previous_results if 'previous_results' in st.session_state else []
-    # Clean up temp files
     if 'temp_file_paths' in st.session_state and st.session_state.temp_file_paths:
         logger.info(f"Cleaning up {len(st.session_state.temp_file_paths)} temporary files")
         handle_temp_files(st.session_state.temp_file_paths)
-    # Clear all state variables except previous_results
-    for key in list(st.session_state.keys()):
-        if key != 'previous_results' and key != 'close_clicked':
-            st.session_state.pop(key, None)
-    # Set flag for having cleaned up
     st.session_state.close_clicked = True
-    # Restore the previous results
-    st.session_state.previous_results = previous_results
 def show_example_documents():
     """Show example documents section"""
@@ -251,14 +267,12 @@ def show_example_documents():
                     # Reset any document state before loading a new sample
                     if st.session_state.processed_document_active:
-                        # Clear previous document state
-                        st.session_state.processed_document_active = False
-                        st.session_state.last_processed_file = None
                         # Clean up any temporary files from previous processing
                         if st.session_state.temp_file_paths:
                             handle_temp_files(st.session_state.temp_file_paths)
-                            st.session_state.temp_file_paths = []
                     # Save download info in session state
                     st.session_state.sample_document = SampleDocument(
@@ -350,6 +364,7 @@ def process_document(uploaded_file, left_col, right_col, sidebar_options):
         progress_placeholder = st.empty()
         # Image preprocessing preview - show if image file and preprocessing options are set
         if (any(sidebar_options["preprocessing_options"].values()) and
             uploaded_file.type.startswith('image/')):
@@ -530,13 +545,14 @@ def main():
     sidebar_options = create_sidebar_options()
     # Create main layout with tabs - simpler, more compact approach
-    tab_names = ["Document Processing", "Sample Documents", "Previous Results", "About"]
-    main_tab1, main_tab2, main_tab3, main_tab4 = st.tabs(tab_names)
     with main_tab1:
         # Create a two-column layout for file upload and results with minimal padding
         st.markdown('<style>.block-container{padding-top: 1rem; padding-bottom: 0;}</style>', unsafe_allow_html=True)
-        left_col, right_col = st.columns([1, 1])
         with left_col:
             # Create file uploader
@@ -575,11 +591,9 @@ def main():
         show_example_documents()
-    with main_tab3:
-        # Previous results tab
-        display_previous_results()
-    with main_tab4:
         # About tab
         display_about_tab()

 )
 from structured_ocr import StructuredOCR
 from config import MISTRAL_API_KEY
+from utils.image_utils import create_results_zip
 # Set favicon path
 favicon_path = os.path.join(os.path.dirname(__file__), "static/favicon.png")
 # Consult https://docs.streamlit.io/library/advanced-features/session-state for details.
 # ========================================================================================
+def reset_document_state():
+    """Reset only document-specific state variables
+    This function explicitly resets all document-related variables to ensure
+    clean state between document processing, preventing cached data issues.
+    """
+    st.session_state.sample_document = None
+    st.session_state.original_sample_bytes = None
+    st.session_state.original_sample_name = None
+    st.session_state.original_sample_mime_type = None
+    st.session_state.is_sample_document = False
+    st.session_state.processed_document_active = False
+    st.session_state.sample_document_processed = False
+    st.session_state.sample_just_loaded = False
+    st.session_state.last_processed_file = None
+    st.session_state.selected_previous_result = None
+    # Keep temp_file_paths but ensure it's empty after cleanup
+    if 'temp_file_paths' in st.session_state:
+        st.session_state.temp_file_paths = []
 def init_session_state():
     """Initialize session state variables if they don't already exist
     This function follows Streamlit's recommended patterns for state initialization.
     It only creates variables if they don't exist yet and doesn't modify existing values.
     """
+    # Initialize persistent app state variables
     if 'previous_results' not in st.session_state:
         st.session_state.previous_results = []
     if 'temp_file_paths' not in st.session_state:
         st.session_state.temp_file_paths = []
     if 'auto_process_sample' not in st.session_state:
         st.session_state.auto_process_sample = False
+    if 'close_clicked' not in st.session_state:
+        st.session_state.close_clicked = False
+    if 'active_tab' not in st.session_state:
+        st.session_state.active_tab = 0
+    # Initialize document-specific state variables
+    if 'last_processed_file' not in st.session_state:
+        st.session_state.last_processed_file = None
     if 'sample_just_loaded' not in st.session_state:
         st.session_state.sample_just_loaded = False
     if 'processed_document_active' not in st.session_state:
         st.session_state.is_sample_document = False
     if 'selected_previous_result' not in st.session_state:
         st.session_state.selected_previous_result = None
 def close_document():
     """Called when the Close Document button is clicked
     That approach breaks Streamlit's execution flow and causes UI artifacts.
     """
     logger.info("Close document button clicked")
+    # Clean up temp files first
     if 'temp_file_paths' in st.session_state and st.session_state.temp_file_paths:
         logger.info(f"Cleaning up {len(st.session_state.temp_file_paths)} temporary files")
         handle_temp_files(st.session_state.temp_file_paths)
+    # Reset all document-specific state variables to prevent caching issues
+    reset_document_state()
+    # Set flag for having cleaned up - this will trigger a rerun in main()
     st.session_state.close_clicked = True
 def show_example_documents():
     """Show example documents section"""
                     # Reset any document state before loading a new sample
                     if st.session_state.processed_document_active:
                         # Clean up any temporary files from previous processing
                         if st.session_state.temp_file_paths:
                             handle_temp_files(st.session_state.temp_file_paths)
+                        # Reset all document-specific state variables
+                        reset_document_state()
                     # Save download info in session state
                     st.session_state.sample_document = SampleDocument(
         progress_placeholder = st.empty()
         # Image preprocessing preview - show if image file and preprocessing options are set
+        # Remove the document active check to show preview immediately after selection
         if (any(sidebar_options["preprocessing_options"].values()) and
             uploaded_file.type.startswith('image/')):
     sidebar_options = create_sidebar_options()
     # Create main layout with tabs - simpler, more compact approach
+    tab_names = ["Document Processing", "Sample Documents", "Learn More"]
+    main_tab1, main_tab2, main_tab3 = st.tabs(tab_names)
     with main_tab1:
         # Create a two-column layout for file upload and results with minimal padding
         st.markdown('<style>.block-container{padding-top: 1rem; padding-bottom: 0;}</style>', unsafe_allow_html=True)
+        # Using a 2:3 column ratio gives more space to the results column
+        left_col, right_col = st.columns([2, 3])
         with left_col:
             # Create file uploader
         show_example_documents()
+    # Previous results tab temporarily removed
+    with main_tab3:
         # About tab
         display_about_tab()

config.py CHANGED Viewed

@@ -40,22 +40,19 @@ VISION_MODEL = os.environ.get("MISTRAL_VISION_MODEL", "mistral-small-latest")  #
 # Image preprocessing settings optimized for historical documents
 # These can be customized from environment variables
 IMAGE_PREPROCESSING = {
-    "enhance_contrast": float(os.environ.get("ENHANCE_CONTRAST", "1.2")),  # Reduced contrast for more natural image appearance
     "sharpen": os.environ.get("SHARPEN", "True").lower() in ("true", "1", "yes"),
     "denoise": os.environ.get("DENOISE", "True").lower() in ("true", "1", "yes"),
     "max_size_mb": float(os.environ.get("MAX_IMAGE_SIZE_MB", "12.0")),    # Increased size limit for better quality
     "target_dpi": int(os.environ.get("TARGET_DPI", "300")),               # Target DPI for scaling
-    "compression_quality": int(os.environ.get("COMPRESSION_QUALITY", "95")),  # Higher quality for better OCR results
-    # Enhanced settings for handwritten documents
     "handwritten": {
-        "contrast": float(os.environ.get("HANDWRITTEN_CONTRAST", "1.2")),  # Lower contrast for handwritten text
         "block_size": int(os.environ.get("HANDWRITTEN_BLOCK_SIZE", "21")), # Larger block size for adaptive thresholding
         "constant": int(os.environ.get("HANDWRITTEN_CONSTANT", "5")),      # Lower constant for adaptive thresholding
         "use_dilation": os.environ.get("HANDWRITTEN_DILATION", "True").lower() in ("true", "1", "yes"),  # Connect broken strokes
-        "clahe_limit": float(os.environ.get("HANDWRITTEN_CLAHE_LIMIT", "2.0")),  # CLAHE limit for local contrast
-        "bilateral_d": int(os.environ.get("HANDWRITTEN_BILATERAL_D", "5")), # Bilateral filter window size
-        "bilateral_sigma1": int(os.environ.get("HANDWRITTEN_BILATERAL_SIGMA1", "25")),  # Color sigma
-        "bilateral_sigma2": int(os.environ.get("HANDWRITTEN_BILATERAL_SIGMA2", "45"))   # Space sigma
     }
 }

 # Image preprocessing settings optimized for historical documents
 # These can be customized from environment variables
 IMAGE_PREPROCESSING = {
+    "enhance_contrast": float(os.environ.get("ENHANCE_CONTRAST", "1.8")),  # Increased contrast for better text recognition
     "sharpen": os.environ.get("SHARPEN", "True").lower() in ("true", "1", "yes"),
     "denoise": os.environ.get("DENOISE", "True").lower() in ("true", "1", "yes"),
     "max_size_mb": float(os.environ.get("MAX_IMAGE_SIZE_MB", "12.0")),    # Increased size limit for better quality
     "target_dpi": int(os.environ.get("TARGET_DPI", "300")),               # Target DPI for scaling
+    "compression_quality": int(os.environ.get("COMPRESSION_QUALITY", "100")),  # Higher quality for better OCR results
+    # # Enhanced settings for handwritten documents
     "handwritten": {
         "block_size": int(os.environ.get("HANDWRITTEN_BLOCK_SIZE", "21")), # Larger block size for adaptive thresholding
         "constant": int(os.environ.get("HANDWRITTEN_CONSTANT", "5")),      # Lower constant for adaptive thresholding
         "use_dilation": os.environ.get("HANDWRITTEN_DILATION", "True").lower() in ("true", "1", "yes"),  # Connect broken strokes
+        "dilation_iterations": int(os.environ.get("HANDWRITTEN_DILATION_ITERATIONS", "2")),  # More iterations for better stroke connection
+        "dilation_kernel_size": int(os.environ.get("HANDWRITTEN_DILATION_KERNEL_SIZE", "3"))       # Larger kernel for dilation
     }
 }

constants.py CHANGED Viewed

@@ -138,17 +138,56 @@ CONTENT_THEMES = {
 }
 # Period tags based on year ranges
 PERIOD_TAGS = {
-    (0, 1799): "Pre-1800s",
-    (1800, 1849): "Early 19th Century",
-    (1850, 1899): "Late 19th Century",
-    (1900, 1949): "Early 20th Century",
-    (1950, 2099): "Modern Era"
 }
-# Default fallback tags
-DEFAULT_TAGS = ["Document", "Historical", "Text"]
-GENERIC_TAGS = ["Archive", "Content", "Record"]
 # UI constants
 PROGRESS_DELAY = 0.8  # Seconds to show completion message

 }
 # Period tags based on year ranges
+# These ranges are used to assign historical period tags to documents based on their year.
 PERIOD_TAGS = {
+    (0, 499): "Ancient Era (to 500 CE)",
+    (500, 999): "Early Medieval (500–1000)",
+    (1000, 1299): "High Medieval (1000–1300)",
+    (1300, 1499): "Late Medieval (1300–1500)",
+    (1500, 1599): "Renaissance (1500–1600)",
+    (1600, 1699): "Early Modern (1600–1700)",
+    (1700, 1775): "Enlightenment (1700–1775)",
+    (1776, 1799): "Age of Revolutions (1776–1800)",
+    (1800, 1849): "Early 19th Century (1800–1850)",
+    (1850, 1899): "Late 19th Century (1850–1900)",
+    (1900, 1918): "Early 20th Century & WWI (1900–1918)",
+    (1919, 1938): "Interwar Period (1919–1938)",
+    (1939, 1945): "World War II (1939–1945)",
+    (1946, 1968): "Postwar & Mid-20th Century (1946–1968)",
+    (1969, 1989): "Late 20th Century (1969–1989)",
+    (1990, 2000): "Turn of the 21st Century (1990–2000)",
+    (2001, 2099): "Contemporary (21st Century)"
 }
+# Default fallback tags for documents when no specific tags are detected.
+DEFAULT_TAGS = [
+    "Document",
+    "Historical",
+    "Text",
+    "Primary Source",
+    "Archival Material",
+    "Record",
+    "Manuscript",
+    "Printed Material",
+    "Correspondence",
+    "Publication"
+]
+# Generic tags that can be used for broad categorization or as supplemental tags.
+GENERIC_TAGS = [
+    "Archive",
+    "Content",
+    "Record",
+    "Source",
+    "Material",
+    "Page",
+    "Scan",
+    "Image",
+    "Transcription",
+    "Uncategorized",
+    "General",
+    "Miscellaneous"
+]
 # UI constants
 PROGRESS_DELAY = 0.8  # Seconds to show completion message

image_segmentation.py CHANGED Viewed

@@ -18,12 +18,13 @@ logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
 logger = logging.getLogger(__name__)
-def segment_image_for_ocr(image_path: Union[str, Path]) -> Dict[str, Union[Image.Image, str]]:
     """
     Segment an image into text and image regions for improved OCR processing.
     Args:
         image_path: Path to the image file
     Returns:
         Dict containing:
@@ -41,6 +42,23 @@ def segment_image_for_ocr(image_path: Union[str, Path]) -> Dict[str, Union[Image
     try:
         # Open original image with PIL for compatibility
         with Image.open(image_file) as pil_img:
             # Convert to RGB if not already
             if pil_img.mode != 'RGB':
                 pil_img = pil_img.convert('RGB')
@@ -89,7 +107,8 @@ def segment_image_for_ocr(image_path: Union[str, Path]) -> Dict[str, Union[Image
                     # Additional check for text-like characteristics
                     # Text typically has aspect ratio > 1 (wider than tall) and reasonable density
-                    if (aspect_ratio > 1.5 or aspect_ratio < 0.5) and dark_pixel_density > 0.2:
                         # Add to text regions list
                         text_regions.append((x, y, w, h))
                         # Add to text mask

                    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
 logger = logging.getLogger(__name__)
+def segment_image_for_ocr(image_path: Union[str, Path], vision_enabled: bool = True) -> Dict[str, Union[Image.Image, str]]:
     """
     Segment an image into text and image regions for improved OCR processing.
     Args:
         image_path: Path to the image file
+        vision_enabled: Whether the vision model is enabled
     Returns:
         Dict containing:
     try:
         # Open original image with PIL for compatibility
         with Image.open(image_file) as pil_img:
+            # --- 2 · Stop "text page detected as image" when vision model is off ---
+            if not vision_enabled:
+                # Import the entropy calculator from utils.image_utils
+                from utils.image_utils import calculate_image_entropy
+                # Calculate entropy to determine if this is line art or blank
+                ent = calculate_image_entropy(pil_img)
+                if ent < 3.5:  # Heuristically low → line-art or blank page
+                    logger.info(f"Low entropy image detected ({ent:.2f}), classifying as illustration")
+                    # Return minimal result for illustration
+                    return {
+                        'text_regions': None,
+                        'image_regions': pil_img,
+                        'text_mask_base64': None,
+                        'combined_result': None,
+                        'text_regions_coordinates': []
+                    }
             # Convert to RGB if not already
             if pil_img.mode != 'RGB':
                 pil_img = pil_img.convert('RGB')
                     # Additional check for text-like characteristics
                     # Text typically has aspect ratio > 1 (wider than tall) and reasonable density
+                    # Relaxed aspect ratio constraints and lowered density threshold for better detection
+                    if (aspect_ratio > 1.2 or aspect_ratio < 0.7) and dark_pixel_density > 0.15:
                         # Add to text regions list
                         text_regions.append((x, y, w, h))
                         # Add to text mask

language_detection.py CHANGED Viewed

@@ -64,7 +64,6 @@ class LanguageDetector:
                     "patterns": ['oi[ts]$', 'oi[re]$', 'f[^aeiou]', 'ff', 'ſ', 'auoit', 'eſtoit',
                                 'ſi', 'ſur', 'ſa', 'cy', 'ayant', 'oy', 'uſ', 'auſ']
                 },
-                "exclusivity": 2.0  # French indicators have higher weight in historical text detection
             },
             "German": {
                 "chars": ['ä', 'ö', 'ü', 'ß'],

                     "patterns": ['oi[ts]$', 'oi[re]$', 'f[^aeiou]', 'ff', 'ſ', 'auoit', 'eſtoit',
                                 'ſi', 'ſur', 'ſa', 'cy', 'ayant', 'oy', 'uſ', 'auſ']
                 },
             },
             "German": {
                 "chars": ['ä', 'ö', 'ü', 'ß'],

ocr_processing.py CHANGED Viewed

@@ -17,6 +17,9 @@ import streamlit as st
 # Local application imports
 from structured_ocr import StructuredOCR
 from utils import generate_cache_key, timing, format_timestamp, create_descriptive_filename, extract_subject_tags
 from preprocessing import apply_preprocessing_to_file
 from error_handler import handle_ocr_error, check_file_size
@@ -239,7 +242,7 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None, pro
                 try:
                     # Perform image segmentation
-                    segmentation_results = segment_image_for_ocr(temp_path)
                     if segmentation_results['combined_result'] is not None:
                         # Save the segmented result to a new temporary file
@@ -357,6 +360,13 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None, pro
         # Add additional metadata to result
         result = process_result(result, uploaded_file, preprocessing_options)
         # Complete progress
         progress_reporter.complete()

 # Local application imports
 from structured_ocr import StructuredOCR
+# Import from updated utils directory
+from utils.image_utils import clean_ocr_result
+# Temporarily retain old utils imports until they are fully migrated
 from utils import generate_cache_key, timing, format_timestamp, create_descriptive_filename, extract_subject_tags
 from preprocessing import apply_preprocessing_to_file
 from error_handler import handle_ocr_error, check_file_size
                 try:
                     # Perform image segmentation
+                    segmentation_results = segment_image_for_ocr(temp_path, vision_enabled=use_vision)
                     if segmentation_results['combined_result'] is not None:
                         # Save the segmented result to a new temporary file
         # Add additional metadata to result
         result = process_result(result, uploaded_file, preprocessing_options)
+        # 🔧 ALWAYS normalize result before returning
+        result = clean_ocr_result(
+            result,
+            use_segmentation=use_segmentation,
+            vision_enabled=use_vision
+        )
         # Complete progress
         progress_reporter.complete()

ocr_utils.py CHANGED Viewed

@@ -1,110 +1,38 @@
 """
-Utility functions for OCR processing with Mistral AI.
-Contains helper functions for working with OCR responses and image handling.
 """
-# Standard library imports
-import json
 import base64
-import io
-import zipfile
 import logging
-import time
-from datetime import datetime
 from pathlib import Path
-from typing import Dict, List, Optional, Union, Any, Tuple
-from functools import lru_cache
 # Configure logging
 logging.basicConfig(level=logging.INFO,
-                   format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
 logger = logging.getLogger(__name__)
-# Third-party imports
-import numpy as np
-# Check for image processing libraries
 try:
-    from PIL import Image, ImageEnhance, ImageFilter, ImageOps
     PILLOW_AVAILABLE = True
 except ImportError:
     logger.warning("PIL not available - image preprocessing will be limited")
     PILLOW_AVAILABLE = False
-try:
-    import cv2
-    CV2_AVAILABLE = True
-except ImportError:
-    logger.warning("OpenCV (cv2) not available - advanced image processing will be limited")
-    CV2_AVAILABLE = False
-# Mistral AI imports
-from mistralai import DocumentURLChunk, ImageURLChunk, TextChunk
-from mistralai.models import OCRImageObject
-# Import configuration
-try:
-    from config import IMAGE_PREPROCESSING
-except ImportError:
-    # Fallback defaults if config not available
-    IMAGE_PREPROCESSING = {
-        "enhance_contrast": 1.5,
-        "sharpen": True,
-        "denoise": True,
-        "max_size_mb": 8.0,
-        "target_dpi": 300,
-        "compression_quality": 92
-    }
-def replace_images_in_markdown(markdown_str: str, images_dict: dict) -> str:
-    """
-    Replace image placeholders in markdown with base64-encoded images.
-    Args:
-        markdown_str: Markdown text containing image placeholders
-        images_dict: Dictionary mapping image IDs to base64 strings
-    Returns:
-        Markdown text with images replaced by base64 data
-    """
-    for img_name, base64_str in images_dict.items():
-        markdown_str = markdown_str.replace(
-            f"![{img_name}]({img_name})", f"![{img_name}]({base64_str})"
-        )
-    return markdown_str
-def get_combined_markdown(ocr_response) -> str:
-    """
-    Combine OCR text and images into a single markdown document.
-    Args:
-        ocr_response: OCR response object from Mistral AI
-    Returns:
-        Combined markdown string with embedded images
-    """
-    markdowns = []
-    # Process each page of the OCR response
-    for page in ocr_response.pages:
-        # Extract image data if available
-        image_data = {}
-        if hasattr(page, "images"):
-            for img in page.images:
-                if hasattr(img, "id") and hasattr(img, "image_base64"):
-                    image_data[img.id] = img.image_base64
-        # Replace image placeholders with base64 data
-        page_markdown = page.markdown if hasattr(page, "markdown") else ""
-        processed_markdown = replace_images_in_markdown(page_markdown, image_data)
-        markdowns.append(processed_markdown)
-    # Join all pages' markdown with double newlines
-    return "\n\n".join(markdowns)
 def encode_image_for_api(image_path: Union[str, Path]) -> str:
     """
-    Encode an image as base64 data URL for API submission.
     Args:
         image_path: Path to the image file
@@ -135,1703 +63,37 @@ def encode_image_for_api(image_path: Union[str, Path]) -> str:
     encoded = base64.b64encode(image_file.read_bytes()).decode()
     return f"data:{mime_type};base64,{encoded}"
-def encode_bytes_for_api(file_bytes: bytes, mime_type: str) -> str:
-    """
-    Encode binary data as base64 data URL for API submission.
-    Args:
-        file_bytes: Binary file data
-        mime_type: MIME type of the file (e.g., 'image/jpeg', 'application/pdf')
-    Returns:
-        Base64 data URL for the data
-    """
-    # Encode data as base64
-    encoded = base64.b64encode(file_bytes).decode()
-    return f"data:{mime_type};base64,{encoded}"
-def process_image_with_ocr(client, image_path: Union[str, Path], model: str = "mistral-ocr-latest"):
-    """
-    Process an image with OCR and return the response.
-    Args:
-        client: Mistral AI client
-        image_path: Path to the image file
-        model: OCR model to use
-    Returns:
-        OCR response object
-    """
-    # Encode image as base64
-    base64_data_url = encode_image_for_api(image_path)
-    # Process image with OCR
-    image_response = client.ocr.process(
-        document=ImageURLChunk(image_url=base64_data_url),
-        model=model
-    )
-    return image_response
-def ocr_response_to_json(ocr_response, indent: int = 4) -> str:
-    """
-    Convert OCR response to a formatted JSON string.
-    Args:
-        ocr_response: OCR response object
-        indent: Indentation level for JSON formatting
-    Returns:
-        Formatted JSON string
-    """
-    # Convert OCR response to a dictionary
-    response_dict = {
-        "text": ocr_response.text if hasattr(ocr_response, "text") else "",
-        "pages": []
-    }
-    # Process pages if available
-    if hasattr(ocr_response, "pages"):
-        for page in ocr_response.pages:
-            page_dict = {
-                "text": page.text if hasattr(page, "text") else "",
-                "markdown": page.markdown if hasattr(page, "markdown") else "",
-                "images": []
-            }
-            # Process images if available
-            if hasattr(page, "images"):
-                for img in page.images:
-                    img_dict = {
-                        "id": img.id if hasattr(img, "id") else "",
-                        "base64": img.image_base64 if hasattr(img, "image_base64") else ""
-                    }
-                    page_dict["images"].append(img_dict)
-            response_dict["pages"].append(page_dict)
-    # Convert dictionary to JSON
-    return json.dumps(response_dict, indent=indent)
-def create_results_zip_in_memory(results):
-    """
-    Create a zip file containing OCR results in memory.
-    Args:
-        results: Dictionary or list of OCR results
-    Returns:
-        Binary zip file data
-    """
-    # Create a BytesIO object
-    zip_buffer = io.BytesIO()
-    # Check if results is a list or a dictionary
-    is_list = isinstance(results, list)
-    # Create zip file in memory
-    with zipfile.ZipFile(zip_buffer, 'w', zipfile.ZIP_DEFLATED) as zipf:
-        if is_list:
-            # Handle list of results
-            for i, result in enumerate(results):
-                try:
-                    # Create a descriptive base filename for this result
-                    base_filename = result.get('file_name', f'document_{i+1}').split('.')[0]
-                    # Add document type if available
-                    if 'topics' in result and result['topics']:
-                        topic = result['topics'][0].lower().replace(' ', '_')
-                        base_filename = f"{base_filename}_{topic}"
-                    # Add language if available
-                    if 'languages' in result and result['languages']:
-                        lang = result['languages'][0].lower()
-                        # Only add if it's not already in the filename
-                        if lang not in base_filename.lower():
-                            base_filename = f"{base_filename}_{lang}"
-                    # For PDFs, add page information
-                    if 'total_pages' in result and 'processed_pages' in result:
-                        base_filename = f"{base_filename}_p{result['processed_pages']}of{result['total_pages']}"
-                    # Add timestamp if available
-                    if 'timestamp' in result:
-                        try:
-                            # Try to parse the timestamp and reformat it
-                            dt = datetime.strptime(result['timestamp'], "%Y-%m-%d %H:%M")
-                            timestamp = dt.strftime("%Y%m%d_%H%M%S")
-                            base_filename = f"{base_filename}_{timestamp}"
-                        except:
-                            pass
-                    # Add JSON results for each file with descriptive name
-                    result_json = json.dumps(result, indent=2)
-                    zipf.writestr(f"{base_filename}.json", result_json)
-                    # Add HTML content (generated from the result)
-                    html_content = create_html_with_images(result)
-                    zipf.writestr(f"{base_filename}_with_images.html", html_content)
-                    # Add raw OCR text if available
-                    if "ocr_contents" in result and "raw_text" in result["ocr_contents"]:
-                        zipf.writestr(f"{base_filename}.txt", result["ocr_contents"]["raw_text"])
-                    # Add HTML visualization if available
-                    if "html_visualization" in result:
-                        zipf.writestr(f"visualization_{i+1}.html", result["html_visualization"])
-                    # Add images if available (limit to conserve memory)
-                    if "pages_data" in result:
-                        for page_idx, page in enumerate(result["pages_data"]):
-                            for img_idx, img in enumerate(page.get("images", [])[:3]):  # Limit to first 3 images per page
-                                img_base64 = img.get("image_base64", "")
-                                if img_base64:
-                                    # Strip data URL prefix if present
-                                    if img_base64.startswith("data:image"):
-                                        img_base64 = img_base64.split(",", 1)[1]
-                                    # Decode base64 and add to zip
-                                    try:
-                                        img_data = base64.b64decode(img_base64)
-                                        zipf.writestr(f"images/result_{i+1}_page_{page_idx+1}_img_{img_idx+1}.jpg", img_data)
-                                    except:
-                                        pass
-                except Exception:
-                    # If any result fails, skip it and continue
-                    continue
-        else:
-            # Handle single result
-            try:
-                # Create a descriptive base filename for this result
-                base_filename = results.get('file_name', 'document').split('.')[0]
-                # Add document type if available
-                if 'topics' in results and results['topics']:
-                    topic = results['topics'][0].lower().replace(' ', '_')
-                    base_filename = f"{base_filename}_{topic}"
-                # Add language if available
-                if 'languages' in results and results['languages']:
-                    lang = results['languages'][0].lower()
-                    # Only add if it's not already in the filename
-                    if lang not in base_filename.lower():
-                        base_filename = f"{base_filename}_{lang}"
-                # For PDFs, add page information
-                if 'total_pages' in results and 'processed_pages' in results:
-                    base_filename = f"{base_filename}_p{results['processed_pages']}of{results['total_pages']}"
-                # Add timestamp if available
-                if 'timestamp' in results:
-                    try:
-                        # Try to parse the timestamp and reformat it
-                        dt = datetime.strptime(results['timestamp'], "%Y-%m-%d %H:%M")
-                        timestamp = dt.strftime("%Y%m%d_%H%M%S")
-                        base_filename = f"{base_filename}_{timestamp}"
-                    except:
-                        # If parsing fails, create a new timestamp
-                        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
-                        base_filename = f"{base_filename}_{timestamp}"
-                else:
-                    # No timestamp in the result, create a new one
-                    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
-                    base_filename = f"{base_filename}_{timestamp}"
-                # Add JSON results with descriptive name
-                results_json = json.dumps(results, indent=2)
-                zipf.writestr(f"{base_filename}.json", results_json)
-                # Add HTML content with descriptive name
-                html_content = create_html_with_images(results)
-                zipf.writestr(f"{base_filename}_with_images.html", html_content)
-                # Add raw OCR text if available
-                if "ocr_contents" in results and "raw_text" in results["ocr_contents"]:
-                    zipf.writestr(f"{base_filename}.txt", results["ocr_contents"]["raw_text"])
-                # Add HTML visualization if available
-                if "html_visualization" in results:
-                    zipf.writestr("visualization.html", results["html_visualization"])
-                # Add images if available
-                if "pages_data" in results:
-                    for page_idx, page in enumerate(results["pages_data"]):
-                        for img_idx, img in enumerate(page.get("images", [])):
-                            img_base64 = img.get("image_base64", "")
-                            if img_base64:
-                                # Strip data URL prefix if present
-                                if img_base64.startswith("data:image"):
-                                    img_base64 = img_base64.split(",", 1)[1]
-                                # Decode base64 and add to zip
-                                try:
-                                    img_data = base64.b64decode(img_base64)
-                                    zipf.writestr(f"images/page_{page_idx+1}_img_{img_idx+1}.jpg", img_data)
-                                except:
-                                    pass
-            except Exception:
-                # If processing fails, return empty zip
-                pass
-    # Seek to the beginning of the BytesIO object
-    zip_buffer.seek(0)
-    # Return the zip file bytes
-    return zip_buffer.getvalue()
-def create_results_zip(results, output_dir=None, zip_name=None):
-    """
-    Create a zip file containing OCR results.
-    Args:
-        results: Dictionary or list of OCR results
-        output_dir: Optional output directory
-        zip_name: Optional zip file name
-    Returns:
-        Path to the created zip file
-    """
-    # Create temporary output directory if not provided
-    if output_dir is None:
-        output_dir = Path.cwd() / "output"
-        output_dir.mkdir(exist_ok=True)
-    else:
-        output_dir = Path(output_dir)
-        output_dir.mkdir(exist_ok=True)
-    # Check if results is a list or a dictionary
-    is_list = isinstance(results, list)
-    # Generate zip name if not provided
-    if zip_name is None:
-        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
-        if is_list:
-            # For a list of results, create a more descriptive name based on the content
-            file_count = len(results)
-            # Count document types
-            pdf_count = sum(1 for r in results if r.get('file_name', '').lower().endswith('.pdf'))
-            img_count = sum(1 for r in results if r.get('file_name', '').lower().endswith(('.jpg', '.jpeg', '.png')))
-            # Create descriptive name based on contents
-            if pdf_count > 0 and img_count > 0:
-                zip_name = f"historical_ocr_mixed_{pdf_count}pdf_{img_count}img_{timestamp}.zip"
-            elif pdf_count > 0:
-                zip_name = f"historical_ocr_pdf_documents_{pdf_count}_{timestamp}.zip"
-            elif img_count > 0:
-                zip_name = f"historical_ocr_images_{img_count}_{timestamp}.zip"
-            else:
-                zip_name = f"historical_ocr_results_{file_count}_{timestamp}.zip"
-        else:
-            # For single result, create descriptive filename
-            base_name = results.get("file_name", "document").split('.')[0]
-            # Add document type if available
-            if 'topics' in results and results['topics']:
-                topic = results['topics'][0].lower().replace(' ', '_')
-                base_name = f"{base_name}_{topic}"
-            # Add language if available
-            if 'languages' in results and results['languages']:
-                lang = results['languages'][0].lower()
-                # Only add if it's not already in the filename
-                if lang not in base_name.lower():
-                    base_name = f"{base_name}_{lang}"
-            # For PDFs, add page information
-            if 'total_pages' in results and 'processed_pages' in results:
-                base_name = f"{base_name}_p{results['processed_pages']}of{results['total_pages']}"
-            # Add timestamp
-            zip_name = f"{base_name}_{timestamp}.zip"
-    try:
-        # Get zip data in memory first
-        zip_data = create_results_zip_in_memory(results)
-        # Save to file
-        zip_path = output_dir / zip_name
-        with open(zip_path, 'wb') as f:
-            f.write(zip_data)
-        return zip_path
-    except Exception as e:
-        # Create an empty zip file as fallback
-        zip_path = output_dir / zip_name
-        with zipfile.ZipFile(zip_path, 'w') as zipf:
-            zipf.writestr("info.txt", "Could not create complete archive")
-        return zip_path
-# Advanced image preprocessing functions
-def preprocess_image_for_ocr(image_path: Union[str, Path]) -> Tuple[Image.Image, str]:
-    """
-    Preprocess an image for optimal OCR performance with enhanced speed and memory optimization.
-    Enhanced to handle large newspaper and document images.
-    Args:
-        image_path: Path to the image file
-    Returns:
-        Tuple of (processed PIL Image, base64 string)
-    """
-    # Fast path: Skip all processing if PIL not available
-    if not PILLOW_AVAILABLE:
-        logger.info("PIL not available, skipping image preprocessing")
-        return None, encode_image_for_api(image_path)
-    # Convert to Path object if string
-    image_file = Path(image_path) if isinstance(image_path, str) else image_path
-    # Thread-safe caching with early exit for already processed images
-    try:
-        # Fast stat calls for file metadata - consolidate to reduce I/O
-        file_stat = image_file.stat()
-        file_size = file_stat.st_size
-        file_size_mb = file_size / (1024 * 1024)
-        mod_time = file_stat.st_mtime
-        # Create a cache key based on essential file properties
-        cache_key = f"{image_file.name}_{file_size}_{mod_time}"
-        # Fast path: Return cached result if available
-        if hasattr(preprocess_image_for_ocr, "_cache") and cache_key in preprocess_image_for_ocr._cache:
-            logger.debug(f"Using cached preprocessing result for {image_file.name}")
-            return preprocess_image_for_ocr._cache[cache_key]
-        # Optimization: Skip heavy processing for very small files
-        # Small images (less than 100KB) likely don't need preprocessing
-        if file_size < 100000:  # 100KB
-            logger.info(f"Image {image_file.name} is small ({file_size/1024:.1f}KB), using minimal processing")
-            with Image.open(image_file) as img:
-                # Normalize mode only
-                if img.mode not in ('RGB', 'L'):
-                    img = img.convert('RGB')
-                # Save with light optimization
-                buffer = io.BytesIO()
-                img.save(buffer, format="JPEG", quality=95, optimize=True)
-                buffer.seek(0)
-                # Get base64
-                encoded_image = base64.b64encode(buffer.getvalue()).decode()
-                base64_data_url = f"data:image/jpeg;base64,{encoded_image}"
-                # Cache and return
-                result = (img, base64_data_url)
-                if not hasattr(preprocess_image_for_ocr, "_cache"):
-                    preprocess_image_for_ocr._cache = {}
-                # Clean cache if needed
-                if len(preprocess_image_for_ocr._cache) > 20:  # Increased cache size for better performance
-                    # Remove oldest 5 entries for better batch processing
-                    for _ in range(5):
-                        if preprocess_image_for_ocr._cache:
-                            preprocess_image_for_ocr._cache.pop(next(iter(preprocess_image_for_ocr._cache)))
-                preprocess_image_for_ocr._cache[cache_key] = result
-                return result
-        # Special handling for large newspaper-style documents
-        if file_size_mb > 5 and image_file.name.lower().endswith(('.jpg', '.jpeg', '.png')):
-            logger.info(f"Large image detected ({file_size_mb:.2f}MB), checking for newspaper format")
-            try:
-                # Quickly check dimensions without loading full image
-                with Image.open(image_file) as img:
-                    width, height = img.size
-                    aspect_ratio = width / height
-                    # Newspaper-style documents typically have width > height or are very large
-                    is_newspaper_format = (aspect_ratio > 1.15 and width > 2000) or (width > 3000 or height > 3000)
-                    if is_newspaper_format:
-                        logger.info(f"Newspaper format detected: {width}x{height}, applying specialized processing")
-            except Exception as dim_err:
-                logger.debug(f"Error checking dimensions: {str(dim_err)}")
-                is_newspaper_format = False
-        else:
-            is_newspaper_format = False
-    except Exception as e:
-        # If stat or cache handling fails, log and continue with processing
-        logger.debug(f"Cache handling failed for {image_path}: {str(e)}")
-        # Ensure we have a valid file_size_mb for later decisions
-        try:
-            file_size_mb = image_file.stat().st_size / (1024 * 1024)
-        except:
-            file_size_mb = 0  # Default if we can't determine size
-        # Default to not newspaper format on error
-        is_newspaper_format = False
-    try:
-        # Process start time for performance logging
-        start_time = time.time()
-        # Open and process the image with minimal memory footprint
-        with Image.open(image_file) as img:
-            # Normalize image mode
-            if img.mode not in ('RGB', 'L'):
-                img = img.convert('RGB')
-            # Fast path: Quick check of image properties to determine appropriate processing
-            width, height = img.size
-            image_area = width * height
-            # Detect document type only for medium to large images to save processing time
-            is_document = False
-            is_newspaper = False
-            # More aggressive document type detection for larger images
-            if image_area > 500000:  # Approx 700x700 or larger
-                # Store image for document detection
-                _detect_document_type_impl._current_img = img
-                is_document = _detect_document_type_impl(None)
-                # Additional check for newspaper format
-                if is_document:
-                    # Newspapers typically have wide formats or very large dimensions
-                    aspect_ratio = width / height
-                    is_newspaper = (aspect_ratio > 1.15 and width > 2000) or (width > 3000 or height > 3000)
-                logger.debug(f"Document type detection for {image_file.name}: " +
-                           f"{'newspaper' if is_newspaper else 'document' if is_document else 'photo'}")
-            # Check for handwritten document characteristics
-            is_handwritten = False
-            if CV2_AVAILABLE and not is_newspaper:
-                # Use more advanced detection for handwritten content
-                try:
-                    gray_np = np.array(img.convert('L'))
-                    # Higher variance in edge strengths can indicate handwriting
-                    edges = cv2.Canny(gray_np, 30, 100)
-                    if np.count_nonzero(edges) / edges.size > 0.02:  # Low edge threshold for handwriting
-                        # Additional check with gradient magnitudes
-                        sobelx = cv2.Sobel(gray_np, cv2.CV_64F, 1, 0, ksize=3)
-                        sobely = cv2.Sobel(gray_np, cv2.CV_64F, 0, 1, ksize=3)
-                        magnitude = np.sqrt(sobelx**2 + sobely**2)
-                        # Handwriting typically has more variation in gradient magnitudes
-                        if np.std(magnitude) > 20:
-                            is_handwritten = True
-                            logger.info(f"Handwritten document detected: {image_file.name}")
-                except Exception as e:
-                    logger.debug(f"Handwriting detection error: {str(e)}")
-            # Special processing for very large images (newspapers and large documents)
-            if is_newspaper:
-                # For newspaper format, we need more specialized processing
-                logger.info(f"Processing newspaper format image: {width}x{height}")
-                # For newspapers, we prioritize text clarity over file size
-                # Use higher target resolution to preserve small text common in newspapers
-                # But still need to resize if extremely large to avoid API limits
-                max_dimension = max(width, height)
-                if max_dimension > 6000:  # Extremely large
-                    scale_factor = 0.4   # Preserve more resolution for newspapers (increased from 0.35)
-                elif max_dimension > 4000:
-                    scale_factor = 0.6   # Higher resolution for better text extraction (increased from 0.5)
-                else:
-                    scale_factor = 0.8   # Minimal reduction for moderate newspaper size (increased from 0.7)
-                # Calculate new dimensions - maintain higher resolution
-                new_width = int(width * scale_factor)
-                new_height = int(height * scale_factor)
-                # Use high-quality resampling to preserve text clarity in newspapers
-                processed_img = img.resize((new_width, new_height), Image.LANCZOS)
-                logger.debug(f"Resized newspaper image from {width}x{height} to {new_width}x{new_height}")
-                # For newspapers, we also want to enhance the contrast and sharpen the image
-                # before the main OCR processing for better text extraction
-                if img.mode in ('RGB', 'RGBA'):
-                    # For color newspapers, enhance both the overall image and then convert to grayscale
-                    # This helps with mixed content newspapers that have both text and images
-                    enhancer = ImageEnhance.Contrast(processed_img)
-                    processed_img = enhancer.enhance(1.3)  # Boost contrast but not too aggressively
-                    # Also enhance saturation to make colored text more visible
-                    enhancer_sat = ImageEnhance.Color(processed_img)
-                    processed_img = enhancer_sat.enhance(1.2)
-            # Special processing for handwritten documents
-            elif is_handwritten:
-                logger.info(f"Processing handwritten document: {width}x{height}")
-                # For handwritten text, we need to preserve stroke details
-                # Use gentle scaling to maintain handwriting characteristics
-                max_dimension = max(width, height)
-                if max_dimension > 4000:  # Large handwritten document
-                    scale_factor = 0.6   # Less aggressive reduction for handwriting
-                else:
-                    scale_factor = 0.8   # Minimal reduction for moderate size
-                # Calculate new dimensions
-                new_width = int(width * scale_factor)
-                new_height = int(height * scale_factor)
-                # Use high-quality resampling to preserve handwriting details
-                processed_img = img.resize((new_width, new_height), Image.LANCZOS)
-                # Lower contrast enhancement for handwriting to preserve stroke details
-                if img.mode in ('RGB', 'RGBA'):
-                    # Convert to grayscale for better text processing
-                    processed_img = processed_img.convert('L')
-                    # Use reduced contrast enhancement to preserve subtle strokes
-                    enhancer = ImageEnhance.Contrast(processed_img)
-                    processed_img = enhancer.enhance(1.2)  # Lower contrast value for handwriting
-            # Standard processing for other large images
-            elif file_size_mb > IMAGE_PREPROCESSING["max_size_mb"] or max(width, height) > 3000:
-                # Calculate target dimensions directly instead of using the heavier resize function
-                target_width, target_height = width, height
-                max_dimension = max(width, height)
-                # Use a sliding scale for reduction based on image size
-                if max_dimension > 5000:
-                    scale_factor = 0.3   # Slightly less aggressive reduction (was 0.25)
-                elif max_dimension > 3000:
-                    scale_factor = 0.45  # Slightly less aggressive reduction (was 0.4)
-                else:
-                    scale_factor = 0.65  # Slightly less aggressive reduction (was 0.6)
-                # Calculate new dimensions
-                new_width = int(width * scale_factor)
-                new_height = int(height * scale_factor)
-                # Use direct resize with optimized resampling filter based on image size
-                if image_area > 3000000:  # Very large, use faster but lower quality
-                    processed_img = img.resize((new_width, new_height), Image.BILINEAR)
-                else:  # Medium size, use better quality
-                    processed_img = img.resize((new_width, new_height), Image.LANCZOS)
-                logger.debug(f"Resized image from {width}x{height} to {new_width}x{new_height}")
-            else:
-                # Skip resizing for smaller images
-                processed_img = img
-            # Apply appropriate processing based on document type and size
-            if is_document:
-                # Process as document with optimized path based on size
-                if image_area > 1000000:  # Full processing for larger documents
-                    preprocess_document_image._current_img = processed_img
-                    processed = _preprocess_document_image_impl()
-                else:  # Lightweight processing for smaller documents
-                    # Just enhance contrast for small documents to save time
-                    enhancer = ImageEnhance.Contrast(processed_img)
-                    processed = enhancer.enhance(1.3)
-            else:
-                # Process as photo with optimized path based on size
-                if image_area > 1000000:  # Full processing for larger photos
-                    preprocess_general_image._current_img = processed_img
-                    processed = _preprocess_general_image_impl()
-                else:  # Skip processing for smaller photos
-                    processed = processed_img
-            # Optimize memory handling during encoding
-            buffer = io.BytesIO()
-            # Adjust quality based on image size to optimize API payload
-            if file_size_mb > 5:
-                quality = 85  # Lower quality for large files
-            else:
-                quality = IMAGE_PREPROCESSING["compression_quality"]
-            # Save with optimized parameters
-            processed.save(buffer, format="JPEG", quality=quality, optimize=True)
-            buffer.seek(0)
-            # Get base64 with minimal memory footprint
-            encoded_image = base64.b64encode(buffer.getvalue()).decode()
-            # Always use image/jpeg MIME type since we explicitly save as JPEG above
-            base64_data_url = f"data:image/jpeg;base64,{encoded_image}"
-            # Update cache thread-safely
-            result = (processed, base64_data_url)
-            if not hasattr(preprocess_image_for_ocr, "_cache"):
-                preprocess_image_for_ocr._cache = {}
-            # LRU-like cache management with improved clearing
-            if len(preprocess_image_for_ocr._cache) > 20:
-                try:
-                    # Remove several entries to avoid frequent cache clearing
-                    for _ in range(5):
-                        if preprocess_image_for_ocr._cache:
-                            preprocess_image_for_ocr._cache.pop(next(iter(preprocess_image_for_ocr._cache)))
-                except:
-                    # If removal fails, just continue
-                    pass
-            # Add to cache
-            try:
-                preprocess_image_for_ocr._cache[cache_key] = result
-            except Exception:
-                # If caching fails, just proceed
-                pass
-            # Log performance metrics
-            processing_time = time.time() - start_time
-            logger.debug(f"Image preprocessing completed in {processing_time:.3f}s for {image_file.name}")
-            # Return both processed image and base64 string
-            return result
-    except Exception as e:
-        # If preprocessing fails, log error and use original image
-        logger.warning(f"Image preprocessing failed: {str(e)}. Using original image.")
-        return None, encode_image_for_api(image_path)
-# Removed caching decorator to fix unhashable type error
-def detect_document_type(img: Image.Image) -> bool:
-    """
-    Detect if an image is likely a document (text-heavy) vs. a photo.
-    Args:
-        img: PIL Image object
-    Returns:
-        True if likely a document, False otherwise
-    """
-    # Direct implementation without caching
-    return _detect_document_type_impl(None)
-def _detect_document_type_impl(img_hash=None) -> bool:
-    """
-    Optimized implementation of document type detection for faster processing.
-    The img_hash parameter is unused but kept for backward compatibility.
-    Enhanced to better detect handwritten documents and newspaper formats.
-    """
-    # Fast path: Get the image from thread-local storage
-    if not hasattr(_detect_document_type_impl, "_current_img"):
-        return False  # Fail safe in case image is not set
-    img = _detect_document_type_impl._current_img
-    # Skip processing for tiny images - just classify as non-documents
-    width, height = img.size
-    if width * height < 100000:  # Approx 300x300 or smaller
-        return False
-    # Convert to grayscale for analysis (using faster conversion)
-    gray_img = img.convert('L')
-    # PIL-only path for systems without OpenCV
-    if not CV2_AVAILABLE:
-        # Faster method: Sample a subset of the image for edge detection
-        # Downscale image for faster processing
-        sample_size = min(width, height, 1000)
-        scale_factor = sample_size / max(width, height)
-        if scale_factor < 0.9:  # Only resize if significant reduction
-            sample_img = gray_img.resize(
-                (int(width * scale_factor), int(height * scale_factor)),
-                Image.NEAREST  # Fastest resampling method
-            )
-        else:
-            sample_img = gray_img
-        # Fast edge detection on sample
-        edges = sample_img.filter(ImageFilter.FIND_EDGES)
-        # Count edge pixels using threshold (faster than summing individual pixels)
-        edge_data = edges.getdata()
-        edge_threshold = 40  # Lowered threshold to better detect handwritten texts
-        # Use list comprehension for better performance
-        edge_count = sum(1 for p in edge_data if p > edge_threshold)
-        total_pixels = len(edge_data)
-        edge_ratio = edge_count / total_pixels
-        # Check if bright areas exist - simple approximation of text/background contrast
-        bright_count = sum(1 for p in gray_img.getdata() if p > 200)
-        bright_ratio = bright_count / (width * height)
-        # Documents typically have more edges (text boundaries) and bright areas (background)
-        # Lowered edge threshold to better detect handwritten documents
-        return edge_ratio > 0.035 or bright_ratio > 0.4
-    # OpenCV path - optimized for speed and enhanced for handwritten documents
-    img_np = np.array(gray_img)
-    # 1. Fast check: Variance of pixel values
-    # Documents typically have high variance (text on background)
-    # Handwritten documents may have less contrast than printed text
-    std_dev = np.std(img_np)
-    if std_dev > 40:  # Further lowered threshold to better detect handwritten documents with low contrast
-        return True
-    # 2. Quick check using downsampled image for edges
-    # Downscale for faster processing on large images
-    if max(img_np.shape) > 1000:
-        scale = 1000 / max(img_np.shape)
-        small_img = cv2.resize(img_np, None, fx=scale, fy=scale, interpolation=cv2.INTER_NEAREST)
-    else:
-        small_img = img_np
-    # Enhanced edge detection for handwritten documents
-    # Use multiple Canny thresholds to better capture both faint and bold strokes
-    edges_low = cv2.Canny(small_img, 20, 110, L2gradient=False)  # For faint handwriting
-    edges_high = cv2.Canny(small_img, 30, 150, L2gradient=False) # For standard text
-    # Combine edge detection results
-    edges = cv2.bitwise_or(edges_low, edges_high)
-    edge_ratio = np.count_nonzero(edges) / edges.size
-    # Special handling for potential handwritten content - more sensitive detection
-    handwritten_indicator = False
-    if edge_ratio > 0.015:  # Lower threshold specifically for handwritten content
-        try:
-            # Look for handwriting stroke characteristics using gradient analysis
-            # Compute gradient magnitudes and directions
-            sobelx = cv2.Sobel(small_img, cv2.CV_64F, 1, 0, ksize=3)
-            sobely = cv2.Sobel(small_img, cv2.CV_64F, 0, 1, ksize=3)
-            magnitude = np.sqrt(sobelx**2 + sobely**2)
-            # Handwriting typically has higher variation in gradient magnitudes
-            if np.std(magnitude) > 18:  # Lower threshold for more sensitivity
-                # Handwriting is indicated if we also have some line structure
-                # Try to find line segments that could indicate text lines
-                lines = cv2.HoughLinesP(edges, 1, np.pi/180,
-                                      threshold=45,  # Lower threshold for handwriting
-                                      minLineLength=25,  # Shorter minimum line length
-                                      maxLineGap=25)   # Larger gap for disconnected handwriting
-                if lines is not None and len(lines) > 8:  # Fewer line segments needed
-                    handwritten_indicator = True
-        except Exception:
-            # If analysis fails, continue with other checks
-            pass
-    # 3. Enhanced histogram analysis for handwritten content
-    # Use more granular bins for better detection of varying stroke densities
-    dark_mask = img_np < 65  # Increased threshold to capture lighter handwritten text
-    medium_mask = (img_np >= 65) & (img_np < 170)  # Medium gray range for handwriting
-    light_mask = img_np > 175  # Slightly adjusted for aged paper
-    dark_ratio = np.count_nonzero(dark_mask) / img_np.size
-    medium_ratio = np.count_nonzero(medium_mask) / img_np.size
-    light_ratio = np.count_nonzero(light_mask) / img_np.size
-    # Handwritten documents often have more medium-gray content than printed text
-    # This helps detect pencil or faded ink handwriting
-    if medium_ratio > 0.3 and edge_ratio > 0.015:
-        return True
-    # Special analysis for handwritten documents
-    # Return true immediately if handwriting characteristics detected
-    if handwritten_indicator:
-        return True
-    # Combine heuristics for final decision with improved sensitivity
-    # Lower thresholds for handwritten documents
-    return (dark_ratio > 0.025 and light_ratio > 0.2) or edge_ratio > 0.025
-# Removed caching to fix unhashable type error
-def preprocess_document_image(img: Image.Image) -> Image.Image:
-    """
-    Preprocess a document image for optimal OCR.
-    Args:
-        img: PIL Image object
-    Returns:
-        Processed PIL Image
-    """
-    # Store the image for the implementation function
-    preprocess_document_image._current_img = img
-    # The actual implementation is separated for cleaner code organization
-    return _preprocess_document_image_impl()
-def _preprocess_document_image_impl() -> Image.Image:
-    """
-    Optimized implementation of document preprocessing with adaptive processing based on image size.
-    Enhanced for better handwritten document processing and newspaper format.
-    """
-    # Fast path: Get image from thread-local storage
-    if not hasattr(preprocess_document_image, "_current_img"):
-        raise ValueError("No image set for document preprocessing")
-    img = preprocess_document_image._current_img
-    # Analyze image size to determine processing strategy
-    width, height = img.size
-    img_size = width * height
-    # Detect special document types
-    is_handwritten = False
-    is_newspaper = False
-    # Check for newspaper format first (takes precedence)
-    aspect_ratio = width / height
-    if (aspect_ratio > 1.15 and width > 2000) or (width > 3000 or height > 3000):
-        is_newspaper = True
-        logger.debug(f"Newspaper format detected: {width}x{height}, aspect ratio: {aspect_ratio:.2f}")
-    else:
-        # If not newspaper, check if handwritten
-        try:
-            # Simple check for handwritten document characteristics
-            # Handwritten documents often have more varied strokes and less stark contrast
-            if CV2_AVAILABLE:
-                # Convert to grayscale and calculate local variance
-                gray_np = np.array(img.convert('L'))
-                # Higher variance in edge strengths can indicate handwriting
-                edges = cv2.Canny(gray_np, 30, 100)
-                if np.count_nonzero(edges) / edges.size > 0.02:  # Low edge threshold for handwriting
-                    # Additional check with gradient magnitudes
-                    sobelx = cv2.Sobel(gray_np, cv2.CV_64F, 1, 0, ksize=3)
-                    sobely = cv2.Sobel(gray_np, cv2.CV_64F, 0, 1, ksize=3)
-                    magnitude = np.sqrt(sobelx**2 + sobely**2)
-                    # Handwriting typically has more variation in gradient magnitudes
-                    if np.std(magnitude) > 20:
-                        is_handwritten = True
-        except:
-            # If detection fails, assume it's not handwritten
-            pass
-    # Special processing for newspaper format
-    if is_newspaper:
-        # Convert to grayscale for better text extraction
-        gray = img.convert('L')
-        # For newspapers, we need aggressive text enhancement to make small print readable
-        # First enhance contrast more aggressively for newspaper small text
-        enhancer = ImageEnhance.Contrast(gray)
-        enhanced = enhancer.enhance(2.0)  # More aggressive contrast for newspaper text
-        # Apply stronger sharpening to make small text more defined
-        if IMAGE_PREPROCESSING["sharpen"]:
-            # Apply multiple passes of sharpening for newspaper text
-            enhanced = enhanced.filter(ImageFilter.SHARPEN)
-            enhanced = enhanced.filter(ImageFilter.EDGE_ENHANCE_MORE)  # Stronger edge enhancement
-        # Enhanced processing for newspapers with OpenCV when available
-        if CV2_AVAILABLE:
-            try:
-                # Convert to numpy array
-                img_np = np.array(enhanced)
-                # For newspaper text extraction, CLAHE (Contrast Limited Adaptive Histogram Equalization)
-                # works much better than simple contrast enhancement
-                clahe = cv2.createCLAHE(clipLimit=3.0, tileGridSize=(8, 8))
-                img_np = clahe.apply(img_np)
-                # Apply different adaptive thresholding approaches and choose the best one
-                # 1. Standard adaptive threshold with larger block size for newspaper columns
-                binary1 = cv2.adaptiveThreshold(img_np, 255,
-                                            cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
-                                            cv2.THRESH_BINARY, 15, 4)
-                # 2. Otsu's method for global thresholding - works well for clean newspaper print
-                _, binary2 = cv2.threshold(img_np, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
-                # Try to determine which method preserves text better
-                # Count white pixels and edges in each binary version
-                white_pixels1 = np.count_nonzero(binary1 > 200)
-                white_pixels2 = np.count_nonzero(binary2 > 200)
-                # Calculate edge density to help determine which preserves text features better
-                edges1 = cv2.Canny(binary1, 100, 200)
-                edges2 = cv2.Canny(binary2, 100, 200)
-                edge_count1 = np.count_nonzero(edges1)
-                edge_count2 = np.count_nonzero(edges2)
-                # For newspaper text, we want to preserve more edges while maintaining reasonable
-                # white space (typical of printed text on paper background)
-                if (edge_count1 > edge_count2 * 1.2 and white_pixels1 > white_pixels2 * 0.7) or \
-                   (white_pixels1 < white_pixels2 * 0.5):  # If Otsu removed too much content
-                    # Adaptive thresholding usually better preserves small text in newspapers
-                    logger.debug("Using adaptive thresholding for newspaper text")
-                    # Apply optional denoising to clean up small speckles
-                    result = cv2.fastNlMeansDenoising(binary1, None, 7, 7, 21)
-                    return Image.fromarray(result)
-                else:
-                    # Otsu method was better
-                    logger.debug("Using Otsu thresholding for newspaper text")
-                    result = cv2.fastNlMeansDenoising(binary2, None, 7, 7, 21)
-                    return Image.fromarray(result)
-            except Exception as e:
-                logger.debug(f"Advanced newspaper processing failed: {str(e)}")
-                # Fall back to PIL processing
-                pass
-        # If OpenCV not available or fails, apply additional PIL enhancements
-        # Create a more aggressive binary version to better separate text
-        binary_threshold = enhanced.point(lambda x: 0 if x < 150 else 255, '1')
-        # Return enhanced binary image
-        return binary_threshold
-    # Ultra-fast path for tiny images - just convert to grayscale with contrast enhancement
-    if img_size < 300000:  # ~500x600 or smaller
-        gray = img.convert('L')
-        # Lower contrast enhancement for handwritten documents
-        contrast_level = 1.4 if is_handwritten else IMAGE_PREPROCESSING["enhance_contrast"]
-        enhancer = ImageEnhance.Contrast(gray)
-        return enhancer.enhance(contrast_level)
-    # Fast path for small images - minimal processing
-    if img_size < 1000000:  # ~1000x1000 or smaller
-        gray = img.convert('L')
-        # Use gentler contrast enhancement for handwritten documents
-        contrast_level = 1.4 if is_handwritten else IMAGE_PREPROCESSING["enhance_contrast"]
-        enhancer = ImageEnhance.Contrast(gray)
-        enhanced = enhancer.enhance(contrast_level)
-        # Light sharpening only if sharpen is enabled
-        # Use milder sharpening for handwritten documents to preserve stroke detail
-        if IMAGE_PREPROCESSING["sharpen"]:
-            if is_handwritten:
-                # Use edge enhancement which is gentler than SHARPEN for handwriting
-                enhanced = enhanced.filter(ImageFilter.EDGE_ENHANCE)
-            else:
-                enhanced = enhanced.filter(ImageFilter.SHARPEN)
-        return enhanced
-    # Standard path for medium images
-    # Convert to grayscale (faster processing)
-    gray = img.convert('L')
-    # Adaptive contrast enhancement based on document type
-    contrast_level = 1.4 if is_handwritten else IMAGE_PREPROCESSING["enhance_contrast"]
-    enhancer = ImageEnhance.Contrast(gray)
-    enhanced = enhancer.enhance(contrast_level)
-    # Apply light sharpening for text clarity - adapt based on document type
-    if IMAGE_PREPROCESSING["sharpen"]:
-        if is_handwritten:
-            # Use edge enhancement which is gentler than SHARPEN for handwriting
-            enhanced = enhanced.filter(ImageFilter.EDGE_ENHANCE)
-        else:
-            enhanced = enhanced.filter(ImageFilter.SHARPEN)
-    # Advanced processing with OpenCV if available
-    if CV2_AVAILABLE and IMAGE_PREPROCESSING["denoise"]:
-        try:
-            # Convert to numpy array for OpenCV processing
-            img_np = np.array(enhanced)
-            if is_handwritten:
-                # Enhanced processing for handwritten documents
-                # Optimized for better stroke preservation and readability
-                if img_size > 3000000:  # Large images - downsample first
-                    scale_factor = 0.5
-                    small_img = cv2.resize(img_np, None, fx=scale_factor, fy=scale_factor,
-                                          interpolation=cv2.INTER_AREA)
-                    # Apply CLAHE for better local contrast in handwriting
-                    clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
-                    enhanced_img = clahe.apply(small_img)
-                    # Apply bilateral filter with parameters optimized for handwriting
-                    # Lower sigma values to preserve more detail
-                    filtered = cv2.bilateralFilter(enhanced_img, 7, 30, 50)
-                    # Resize back
-                    filtered = cv2.resize(filtered, (width, height), interpolation=cv2.INTER_LINEAR)
-                else:
-                    # For smaller handwritten images
-                    # Apply CLAHE for better local contrast
-                    clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
-                    enhanced_img = clahe.apply(img_np)
-                    # Apply bilateral filter with parameters optimized for handwriting
-                    filtered = cv2.bilateralFilter(enhanced_img, 5, 25, 45)
-                # Adaptive thresholding specific to handwriting
-                try:
-                    # Use larger block size and lower constant for better stroke preservation
-                    binary = cv2.adaptiveThreshold(
-                        filtered, 255,
-                        cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
-                        cv2.THRESH_BINARY,
-                        21,  # Larger block size for handwriting
-                        5    # Lower constant for better stroke preservation
-                    )
-                    # Apply slight dilation to connect broken strokes
-                    kernel = np.ones((2, 2), np.uint8)
-                    binary = cv2.dilate(binary, kernel, iterations=1)
-                    # Convert back to PIL Image
-                    return Image.fromarray(binary)
-                except Exception as e:
-                    logger.debug(f"Adaptive threshold for handwriting failed: {str(e)}")
-                    # Convert filtered image to PIL and return as fallback
-                    return Image.fromarray(filtered)
-            else:
-                # Standard document processing - optimized for printed text
-                # Optimize denoising parameters based on image size
-                if img_size > 4000000:  # Very large images
-                    # More aggressive downsampling for very large images
-                    scale_factor = 0.5
-                    downsample = cv2.resize(img_np, None, fx=scale_factor, fy=scale_factor,
-                                          interpolation=cv2.INTER_AREA)
-                    # Lighter denoising for downsampled image
-                    h_value = 7  # Strength parameter
-                    template_window = 5
-                    search_window = 13
-                    # Apply denoising on smaller image
-                    denoised_np = cv2.fastNlMeansDenoising(downsample, None, h_value, template_window, search_window)
-                    # Resize back to original size
-                    denoised_np = cv2.resize(denoised_np, (width, height), interpolation=cv2.INTER_LINEAR)
-                else:
-                    # Direct denoising for medium-large images
-                    h_value = 8  # Balanced for speed and quality
-                    template_window = 5
-                    search_window = 15
-                    # Apply denoising
-                    denoised_np = cv2.fastNlMeansDenoising(img_np, None, h_value, template_window, search_window)
-                # Convert back to PIL Image
-                enhanced = Image.fromarray(denoised_np)
-                # Apply adaptive thresholding only if it improves text visibility
-                # Create a binarized version of the image
-                if img_size < 8000000:  # Skip for extremely large images to save processing time
-                    binary = cv2.adaptiveThreshold(denoised_np, 255,
-                                                 cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
-                                                 cv2.THRESH_BINARY, 11, 2)
-                    # Quick verification that binarization preserves text information
-                    # Use simplified check that works well for document images
-                    white_pixels_binary = np.count_nonzero(binary > 200)
-                    white_pixels_orig = np.count_nonzero(denoised_np > 200)
-                    # Check if binary preserves reasonable amount of white pixels (background)
-                    if white_pixels_binary > white_pixels_orig * 0.8:
-                        # Binarization looks good, use it
-                        return Image.fromarray(binary)
-                return enhanced
-        except Exception as e:
-            # If OpenCV processing fails, continue with PIL-enhanced image
-            pass
-    elif IMAGE_PREPROCESSING["denoise"]:
-        # Fallback PIL denoising for systems without OpenCV
-        if is_handwritten:
-            # Lighter filtering for handwritten text to preserve details
-            # Use a smaller median filter for handwritten documents
-            enhanced = enhanced.filter(ImageFilter.MedianFilter(1))
-        else:
-            # Standard filtering for printed documents
-            enhanced = enhanced.filter(ImageFilter.MedianFilter(3))
-    # Return enhanced grayscale image
-    return enhanced
-# Removed caching to fix unhashable type error
-def preprocess_general_image(img: Image.Image) -> Image.Image:
-    """
-    Preprocess a general image for OCR.
-    Args:
-        img: PIL Image object
-    Returns:
-        Processed PIL Image
-    """
-    # Store the image for implementation function
-    preprocess_general_image._current_img = img
-    return _preprocess_general_image_impl()
-def _preprocess_general_image_impl() -> Image.Image:
-    """
-    Optimized implementation of general image preprocessing with size-based processing paths
-    """
-    # Fast path: Get the image from thread-local storage
-    if not hasattr(preprocess_general_image, "_current_img"):
-        raise ValueError("No image set for general preprocessing")
-    img = preprocess_general_image._current_img
-    # Ultra-fast path: Skip processing completely for small images to improve performance
-    width, height = img.size
-    img_size = width * height
-    if img_size < 300000:  # Skip for tiny images under ~0.3 megapixel
-        # Just ensure correct color mode
-        if img.mode != 'RGB':
-            return img.convert('RGB')
-        return img
-    # Fast path: Minimal processing for smaller images
-    if img_size < 600000:  # ~800x750 or smaller
-        # Ensure RGB mode
-        if img.mode != 'RGB':
-            img = img.convert('RGB')
-        # Very light contrast enhancement only
-        enhancer = ImageEnhance.Contrast(img)
-        return enhancer.enhance(1.15)  # Lighter enhancement for small images
-    # Standard path: Apply moderate enhancements for medium images
-    # Convert to RGB to ensure compatibility
-    if img.mode != 'RGB':
-        img = img.convert('RGB')
-    # Moderate enhancement only
-    enhancer = ImageEnhance.Contrast(img)
-    enhanced = enhancer.enhance(1.2)  # Less aggressive than document enhancement
-    # Skip additional processing for medium-sized images
-    if img_size < 1000000:  # Skip for images under ~1 megapixel
-        return enhanced
-    # Enhanced path: Additional processing for larger images
-    try:
-        # Apply optimized enhancement pipeline for large non-document images
-        # 1. Improve color saturation slightly for better feature extraction
-        saturation = ImageEnhance.Color(enhanced)
-        enhanced = saturation.enhance(1.1)
-        # 2. Apply adaptive sharpening based on image size
-        if img_size > 2500000:  # Very large images (~1600x1600 or larger)
-            # Use EDGE_ENHANCE instead of SHARPEN for more subtle enhancement on large images
-            enhanced = enhanced.filter(ImageFilter.EDGE_ENHANCE)
-        else:
-            # Standard sharpening for regular large images
-            enhanced = enhanced.filter(ImageFilter.SHARPEN)
-        # 3. Apply additional processing with OpenCV if available (for largest images)
-        if CV2_AVAILABLE and img_size > 3000000:
-            # Convert to numpy array
-            img_np = np.array(enhanced)
-            # Apply subtle enhancement of details (CLAHE)
-            try:
-                # Convert to LAB color space for better processing
-                lab = cv2.cvtColor(img_np, cv2.COLOR_RGB2LAB)
-                # Only enhance the L channel (luminance)
-                l, a, b = cv2.split(lab)
-                # Create CLAHE object with optimal parameters for photos
-                clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
-                # Apply CLAHE to L channel
-                l = clahe.apply(l)
-                # Merge channels back and convert to RGB
-                lab = cv2.merge((l, a, b))
-                enhanced_np = cv2.cvtColor(lab, cv2.COLOR_LAB2RGB)
-                # Convert back to PIL
-                enhanced = Image.fromarray(enhanced_np)
-            except:
-                # If CLAHE fails, continue with PIL-enhanced image
-                pass
-    except Exception:
-        # If any enhancement fails, fall back to basic contrast enhancement
-        if img.mode != 'RGB':
-            img = img.convert('RGB')
-        enhancer = ImageEnhance.Contrast(img)
-        enhanced = enhancer.enhance(1.2)
-    return enhanced
-# Removed caching decorator to fix unhashable type error
-def resize_image(img: Image.Image, target_dpi: int = 300) -> Image.Image:
-    """
-    Resize an image to an optimal size for OCR while preserving quality.
-    Args:
-        img: PIL Image object
-        target_dpi: Target DPI (dots per inch)
-    Returns:
-        Resized PIL Image
-    """
-    # Store the image for implementation function
-    resize_image._current_img = img
-    return resize_image_impl(target_dpi)
-def resize_image_impl(target_dpi: int = 300) -> Image.Image:
     """
-    Implementation of resize function that uses thread-local storage.
     Args:
-        target_dpi: Target DPI (dots per inch)
-    Returns:
-        Resized PIL Image
-    """
-    # Get the image from thread-local storage (set by the caller)
-    if not hasattr(resize_image, "_current_img"):
-        raise ValueError("No image set for resizing")
-    img = resize_image._current_img
-    # Calculate current dimensions
-    width, height = img.size
-    # Fixed target dimensions based on DPI
-    # Using larger dimensions to support newspapers and large documents
-    max_width = int(14 * target_dpi)  # Increased from 8.5 to 14 inches
-    max_height = int(22 * target_dpi)  # Increased from 11 to 22 inches
-    # Check if resizing is needed - quick early return
-    if width <= max_width and height <= max_height:
-        return img  # No resizing needed
-    # Calculate scaling factor once
-    scale_factor = min(max_width / width, max_height / height)
-    # Calculate new dimensions
-    new_width = int(width * scale_factor)
-    new_height = int(height * scale_factor)
-    # Use BICUBIC for better balance of speed and quality
-    return img.resize((new_width, new_height), Image.BICUBIC)
-def calculate_image_entropy(img: Image.Image) -> float:
-    """
-    Calculate the entropy (information content) of an image.
-    Args:
-        img: PIL Image object
-    Returns:
-        Entropy value
-    """
-    # Convert to grayscale
-    if img.mode != 'L':
-        img = img.convert('L')
-    # Calculate histogram
-    histogram = img.histogram()
-    total_pixels = img.width * img.height
-    # Calculate entropy
-    entropy = 0
-    for h in histogram:
-        if h > 0:
-            probability = h / total_pixels
-            entropy -= probability * np.log2(probability)
-    return entropy
-def create_html_with_images(result):
-    """
-    Create an HTML document with embedded images from OCR results.
-    Handles serialization of complex OCR objects automatically.
-    Args:
-        result: OCR result dictionary containing pages_data
-    Returns:
-        HTML content as string
-    """
-    # Ensure result is fully serializable first
-    result = serialize_ocr_object(result)
-    # Create HTML document structure
-    html_content = """
-    <!DOCTYPE html>
-    <html>
-    <head>
-        <meta charset="UTF-8">
-        <meta name="viewport" content="width=device-width, initial-scale=1.0">
-        <title>OCR Document with Images</title>
-        <style>
-            body {
-                font-family: Georgia, serif;
-                line-height: 1.7;
-                margin: 0 auto;
-                max-width: 800px;
-                padding: 20px;
-            }
-            img {
-                max-width: 90%;
-                max-height: 500px;
-                object-fit: contain;
-                margin: 20px auto;
-                display: block;
-                border: 1px solid #ddd;
-                border-radius: 4px;
-            }
-            .image-container {
-                margin: 20px 0;
-                text-align: center;
-            }
-            .page-break {
-                border-top: 1px solid #ddd;
-                margin: 40px 0;
-                padding-top: 40px;
-            }
-            h3 {
-                color: #333;
-                border-bottom: 1px solid #eee;
-                padding-bottom: 10px;
-            }
-            p {
-                margin: 12px 0;
-            }
-            .page-text-content {
-                margin-bottom: 20px;
-            }
-            .text-block {
-                background-color: #f9f9f9;
-                padding: 15px;
-                border-radius: 4px;
-                border-left: 3px solid #546e7a;
-                margin-bottom: 15px;
-                color: #333;
-            }
-            .text-block p {
-                margin: 8px 0;
-                color: #333;
-            }
-            .metadata {
-                background-color: #f5f5f5;
-                padding: 10px 15px;
-                border-radius: 4px;
-                margin-bottom: 20px;
-                font-size: 14px;
-            }
-            .metadata p {
-                margin: 5px 0;
-            }
-        </style>
-    </head>
-    <body>
-    """
-    # Add document metadata
-    html_content += f"""
-    <div class="metadata">
-        <h2>{result.get('file_name', 'Document')}</h2>
-        <p><strong>Processed at:</strong> {result.get('timestamp', '')}</p>
-        <p><strong>Languages:</strong> {', '.join(result.get('languages', ['Unknown']))}</p>
-        <p><strong>Topics:</strong> {', '.join(result.get('topics', ['Unknown']))}</p>
-    </div>
-    """
-    # Check if we have pages_data
-    if 'pages_data' in result and result['pages_data']:
-        pages_data = result['pages_data']
-        # Process each page
-        for i, page in enumerate(pages_data):
-            page_markdown = page.get('markdown', '')
-            images = page.get('images', [])
-            # Add page header if multi-page
-            if len(pages_data) > 1:
-                html_content += f"<h3>Page {i+1}</h3>"
-            # Create image dictionary
-            image_dict = {}
-            for img in images:
-                if 'id' in img and 'image_base64' in img:
-                    image_dict[img['id']] = img['image_base64']
-            # Process the markdown content
-            if page_markdown:
-                # Extract text content (lines without images)
-                text_content = []
-                image_lines = []
-                for line in page_markdown.split('\n'):
-                    if '![' in line and '](' in line:
-                        image_lines.append(line)
-                    elif line.strip():
-                        text_content.append(line)
-                # Add text content
-                if text_content:
-                    html_content += '<div class="text-block">'
-                    for line in text_content:
-                        html_content += f"<p>{line}</p>"
-                    html_content += '</div>'
-                # Add images
-                for line in image_lines:
-                    # Extract image ID and alt text using simple parsing
-                    try:
-                        alt_start = line.find('![') + 2
-                        alt_end = line.find(']', alt_start)
-                        alt_text = line[alt_start:alt_end]
-                        img_start = line.find('(', alt_end) + 1
-                        img_end = line.find(')', img_start)
-                        img_id = line[img_start:img_end]
-                        if img_id in image_dict:
-                            html_content += f'<div class="image-container">'
-                            html_content += f'<img src="{image_dict[img_id]}" alt="{alt_text}">'
-                            html_content += f'</div>'
-                    except:
-                        # If parsing fails, just skip this image
-                        continue
-            # Add page separator if not the last page
-            if i < len(pages_data) - 1:
-                html_content += '<div class="page-break"></div>'
-    # Add structured content if available
-    if 'ocr_contents' in result and isinstance(result['ocr_contents'], dict):
-        html_content += '<h3>Structured Content</h3>'
-        for section, content in result['ocr_contents'].items():
-            if content and section not in ['error', 'raw_text', 'partial_text']:
-                html_content += f'<h4>{section.replace("_", " ").title()}</h4>'
-                if isinstance(content, str):
-                    html_content += f'<p>{content}</p>'
-                elif isinstance(content, list):
-                    html_content += '<ul>'
-                    for item in content:
-                        html_content += f'<li>{str(item)}</li>'
-                    html_content += '</ul>'
-                elif isinstance(content, dict):
-                    html_content += '<dl>'
-                    for k, v in content.items():
-                        html_content += f'<dt>{k}</dt><dd>{v}</dd>'
-                    html_content += '</dl>'
-    # Close HTML document
-    html_content += """
-    </body>
-    </html>
-    """
-    return html_content
-def generate_document_thumbnail(image_path: Union[str, Path], max_size: int = 300) -> str:
-    """
-    Generate a thumbnail for document preview.
-    Args:
-        image_path: Path to the image file
-        max_size: Maximum dimension for thumbnail
-    Returns:
-        Base64 encoded thumbnail
-    """
-    if not PILLOW_AVAILABLE:
-        return None
-    try:
-        # Open the image
-        with Image.open(image_path) as img:
-            # Calculate thumbnail size preserving aspect ratio
-            width, height = img.size
-            if width > height:
-                new_width = max_size
-                new_height = int(height * (max_size / width))
-            else:
-                new_height = max_size
-                new_width = int(width * (max_size / height))
-            # Create thumbnail
-            thumbnail = img.resize((new_width, new_height), Image.LANCZOS)
-            # Save to buffer
-            buffer = io.BytesIO()
-            thumbnail.save(buffer, format="JPEG", quality=85)
-            buffer.seek(0)
-            # Encode as base64
-            encoded = base64.b64encode(buffer.getvalue()).decode()
-            return f"data:image/jpeg;base64,{encoded}"
-    except Exception:
-        # Return None if thumbnail generation fails
-        return None
-def serialize_ocr_object(obj):
-    """
-    Serialize OCR response objects to JSON serializable format.
-    Handles OCRImageObject specifically to prevent serialization errors.
-    Args:
-        obj: The object to serialize
-    Returns:
-        JSON serializable representation of the object
-    """
-    # Fast path: Handle primitive types directly
-    if obj is None or isinstance(obj, (str, int, float, bool)):
-        return obj
-    # Handle collections
-    if isinstance(obj, list):
-        return [serialize_ocr_object(item) for item in obj]
-    elif isinstance(obj, dict):
-        return {k: serialize_ocr_object(v) for k, v in obj.items()}
-    elif isinstance(obj, OCRImageObject):
-        # Special handling for OCRImageObject
-        return {
-            'id': obj.id if hasattr(obj, 'id') else None,
-            'image_base64': obj.image_base64 if hasattr(obj, 'image_base64') else None
-        }
-    elif hasattr(obj, '__dict__'):
-        # For objects with __dict__ attribute
-        return {k: serialize_ocr_object(v) for k, v in obj.__dict__.items()
-                if not k.startswith('_')}  # Skip private attributes
-    else:
-        # Try to convert to string as last resort
-        try:
-            return str(obj)
-        except:
-            return None
-def try_local_ocr_fallback(image_path: Union[str, Path], base64_data_url: str = None) -> str:
-    """
-    Attempt to use local pytesseract OCR as a fallback when API fails
-    With enhanced processing optimized for handwritten content
-    Args:
-        image_path: Path to the image file
         base64_data_url: Optional base64 data URL if already available
     Returns:
-        OCR text string if successful, None if failed
     """
-    logger.info("Attempting local OCR fallback using pytesseract...")
     try:
-        import pytesseract
-        from PIL import Image
-        # Load image - either from path or from base64
-        if base64_data_url and base64_data_url.startswith('data:image'):
-            # Extract image from base64
-            image_data = base64_data_url.split(',', 1)[1]
-            image_bytes = base64.b64decode(image_data)
-            image = Image.open(io.BytesIO(image_bytes))
-        else:
-            # Load from file path
-            image_path = Path(image_path) if isinstance(image_path, str) else image_path
-            image = Image.open(image_path)
-        # Auto-detect if this appears to be handwritten
-        is_handwritten = False
-        # Use OpenCV for better detection and preprocessing if available
-        if CV2_AVAILABLE:
-            try:
-                # Convert image to numpy array
-                img_np = np.array(image.convert('L'))
-                # Check for handwritten characteristics
-                edges = cv2.Canny(img_np, 30, 100)
-                edge_ratio = np.count_nonzero(edges) / edges.size
-                # Typical handwritten documents have more varied edge patterns
-                if edge_ratio > 0.02:
-                    # Additional check with gradient magnitudes
-                    sobelx = cv2.Sobel(img_np, cv2.CV_64F, 1, 0, ksize=3)
-                    sobely = cv2.Sobel(img_np, cv2.CV_64F, 0, 1, ksize=3)
-                    magnitude = np.sqrt(sobelx**2 + sobely**2)
-                    # Handwriting typically has more variation in gradient magnitudes
-                    if np.std(magnitude) > 20:
-                        is_handwritten = True
-                        logger.info("Detected handwritten content for local OCR")
-                # Enhanced preprocessing based on document type
-                if is_handwritten:
-                    # Process for handwritten content
-                    # Apply CLAHE for better local contrast
-                    clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
-                    img_np = clahe.apply(img_np)
-                    # Apply adaptive thresholding with optimized parameters for handwriting
-                    binary = cv2.adaptiveThreshold(
-                        img_np, 255,
-                        cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
-                        cv2.THRESH_BINARY,
-                        21,  # Larger block size for handwriting
-                        5    # Lower constant for better stroke preservation
-                    )
-                    # Optional: apply dilation to thicken strokes slightly
-                    kernel = np.ones((2, 2), np.uint8)
-                    binary = cv2.dilate(binary, kernel, iterations=1)
-                    # Convert back to PIL Image for tesseract
-                    image = Image.fromarray(binary)
-                    # Set tesseract options for handwritten content
-                    custom_config = r'--oem 1 --psm 6 -l eng'
-                else:
-                    # Process for printed content
-                    # Apply CLAHE for better contrast
-                    clahe = cv2.createCLAHE(clipLimit=2.5, tileGridSize=(8, 8))
-                    img_np = clahe.apply(img_np)
-                    # Apply bilateral filter to reduce noise while preserving edges
-                    img_np = cv2.bilateralFilter(img_np, 9, 75, 75)
-                    # Apply Otsu's thresholding for printed text
-                    _, binary = cv2.threshold(img_np, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
-                    # Convert back to PIL Image for tesseract
-                    image = Image.fromarray(binary)
-                    # Set tesseract options for printed content
-                    custom_config = r'--oem 3 --psm 6 -l eng'
-            except Exception as e:
-                logger.warning(f"OpenCV preprocessing failed: {str(e)}. Using PIL fallback.")
-                # Convert to RGB if not already (pytesseract works best with RGB)
-                if image.mode != 'RGB':
-                    image = image.convert('RGB')
-                # Apply basic image enhancements
-                image = image.convert('L')
-                enhancer = ImageEnhance.Contrast(image)
-                image = enhancer.enhance(2.0)
-                custom_config = r'--oem 3 --psm 6 -l eng'
-        else:
-            # PIL-only path without OpenCV
-            # Convert to RGB if not already (pytesseract works best with RGB)
-            if image.mode != 'RGB':
-                image = image.convert('RGB')
-            # Apply basic image enhancements
-            image = image.convert('L')
-            enhancer = ImageEnhance.Contrast(image)
-            image = enhancer.enhance(2.0)
-            custom_config = r'--oem 3 --psm 6 -l eng'
-        # Run OCR with appropriate config
-        ocr_text = pytesseract.image_to_string(image, config=custom_config)
-        if ocr_text and len(ocr_text.strip()) > 50:
-            logger.info(f"Local OCR successful: extracted {len(ocr_text)} characters")
-            return ocr_text
         else:
-            # Try another psm mode as fallback
-            logger.warning("First OCR attempt produced minimal text, trying another mode")
-            # Try PSM mode 4 (assume single column of text)
-            fallback_config = r'--oem 3 --psm 4 -l eng'
-            ocr_text = pytesseract.image_to_string(image, config=fallback_config)
-            if ocr_text and len(ocr_text.strip()) > 50:
-                logger.info(f"Local OCR fallback successful: extracted {len(ocr_text)} characters")
-                return ocr_text
-            else:
-                logger.warning("Local OCR produced minimal or no text")
-                return None
-    except ImportError:
-        logger.warning("Pytesseract not installed - local OCR not available")
-        return None
     except Exception as e:
-        logger.error(f"Local OCR fallback failed: {str(e)}")
-        return None

 """
+OCR utility functions for image processing and OCR operations.
+This module provides helper functions used across the Historical OCR application.
 """
+import os
 import base64
 import logging
 from pathlib import Path
+from typing import Union, Optional
 # Configure logging
 logging.basicConfig(level=logging.INFO,
+                    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
 logger = logging.getLogger(__name__)
+# Try to import optional dependencies
+try:
+    import pytesseract
+    TESSERACT_AVAILABLE = True
+except ImportError:
+    logger.warning("pytesseract not available - local OCR fallback will not work")
+    TESSERACT_AVAILABLE = False
 try:
+    from PIL import Image
     PILLOW_AVAILABLE = True
 except ImportError:
     logger.warning("PIL not available - image preprocessing will be limited")
     PILLOW_AVAILABLE = False
 def encode_image_for_api(image_path: Union[str, Path]) -> str:
     """
+    Encode an image as base64 data URL for API submission with proper MIME type.
     Args:
         image_path: Path to the image file
     encoded = base64.b64encode(image_file.read_bytes()).decode()
     return f"data:{mime_type};base64,{encoded}"
+def try_local_ocr_fallback(file_path: Union[str, Path], base64_data_url: Optional[str] = None) -> Optional[str]:
     """
+    Try to perform OCR using local Tesseract as a fallback when the API is unavailable.
     Args:
+        file_path: Path to the image file
         base64_data_url: Optional base64 data URL if already available
     Returns:
+        Extracted text or None if extraction failed
     """
+    if not TESSERACT_AVAILABLE or not PILLOW_AVAILABLE:
+        logger.warning("Local OCR fallback is not available (missing dependencies)")
+        return None
     try:
+        logger.info("Using local Tesseract OCR as fallback")
+        # Use PIL to open the image
+        img = Image.open(file_path)
+        # Use Tesseract to extract text
+        text = pytesseract.image_to_string(img)
+        if text:
+            logger.info("Successfully extracted text using local Tesseract OCR")
+            return text
         else:
+            logger.warning("Tesseract extracted no text")
+            return None
     except Exception as e:
+        logger.error(f"Error using local OCR fallback: {str(e)}")
+        return None

preprocessing.py CHANGED Viewed

@@ -3,15 +3,398 @@ import io
 import cv2
 import numpy as np
 import tempfile
 from PIL import Image, ImageEnhance, ImageFilter
 from pdf2image import convert_from_bytes
 import streamlit as st
 import logging
 # Configure logging
 logger = logging.getLogger("preprocessing")
 logger.setLevel(logging.INFO)
 @st.cache_data(ttl=24*3600, show_spinner=False)  # Cache for 24 hours
 def convert_pdf_to_images(pdf_bytes, dpi=150, rotation=0):
     """Convert PDF bytes to a list of images with caching"""
@@ -34,94 +417,134 @@ def convert_pdf_to_images(pdf_bytes, dpi=150, rotation=0):
 @st.cache_data(ttl=24*3600, show_spinner=False, hash_funcs={dict: lambda x: str(sorted(x.items()))})
 def preprocess_image(image_bytes, preprocessing_options):
-    """Preprocess image with selected options optimized for historical document OCR quality"""
     # Setup basic console logging
     logger = logging.getLogger("image_preprocessor")
     logger.setLevel(logging.INFO)
     # Log which preprocessing options are being applied
-    logger.info(f"Preprocessing image with options: {preprocessing_options}")
     # Convert bytes to PIL Image
     image = Image.open(io.BytesIO(image_bytes))
-    # Check for alpha channel (RGBA) and convert to RGB if needed
     if image.mode == 'RGBA':
-        # Convert RGBA to RGB by compositing the image onto a white background
         background = Image.new('RGB', image.size, (255, 255, 255))
         background.paste(image, mask=image.split()[3])  # 3 is the alpha channel
         image = background
-        logger.info("Converted RGBA image to RGB")
     elif image.mode not in ('RGB', 'L'):
-        # Convert other modes to RGB as well
         image = image.convert('RGB')
-        logger.info(f"Converted {image.mode} image to RGB")
-    # Apply rotation if specified
-    if preprocessing_options.get("rotation", 0) != 0:
-        rotation_degrees = preprocessing_options.get("rotation")
-        image = image.rotate(rotation_degrees, expand=True, resample=Image.BICUBIC)
-    # Resize large images while preserving details important for OCR
-    width, height = image.size
-    max_dimension = max(width, height)
-    # Less aggressive resizing to preserve document details
-    if max_dimension > 2500:
-        scale_factor = 2500 / max_dimension
-        new_width = int(width * scale_factor)
-        new_height = int(height * scale_factor)
-        # Use LANCZOS for better quality preservation
-        image = image.resize((new_width, new_height), Image.LANCZOS)
     img_array = np.array(image)
-    # Apply preprocessing based on selected options with settings optimized for historical documents
-    document_type = preprocessing_options.get("document_type", "standard")
-    # Process grayscale option first as it's a common foundation
     if preprocessing_options.get("grayscale", False):
         if len(img_array.shape) == 3:  # Only convert if it's not already grayscale
-            if document_type == "handwritten":
-                # Enhanced grayscale processing for handwritten documents
                 img_array = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
-                # Apply adaptive histogram equalization to enhance handwriting
-                clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8))
                 img_array = clahe.apply(img_array)
             else:
                 # Standard grayscale for printed documents
                 img_array = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
-            # Convert back to RGB for further processing
-            img_array = cv2.cvtColor(img_array, cv2.COLOR_GRAY2RGB)
-    if preprocessing_options.get("contrast", 0) != 0:
-        contrast_factor = 1 + (preprocessing_options.get("contrast", 0) / 150)  # Reduced from /100 for a gentler effect
-        image = Image.fromarray(img_array)
-        enhancer = ImageEnhance.Contrast(image)
-        image = enhancer.enhance(contrast_factor)
-        img_array = np.array(image)
     if preprocessing_options.get("denoise", False):
         try:
-            # Apply appropriate denoising based on document type (reduced parameters for gentler effect)
-            if document_type == "handwritten":
-                # Very light denoising for handwritten documents to preserve pen strokes
-                if len(img_array.shape) == 3 and img_array.shape[2] == 3:  # Color image
-                    img_array = cv2.fastNlMeansDenoisingColored(img_array, None, 2, 2, 3, 7)  # Reduced from 3,3,5,9
-                else:  # Grayscale image
-                    img_array = cv2.fastNlMeansDenoising(img_array, None, 2, 5, 15)  # Reduced from 3,7,21
             else:
-                # Standard denoising for printed documents
-                if len(img_array.shape) == 3 and img_array.shape[2] == 3:  # Color image
-                    img_array = cv2.fastNlMeansDenoisingColored(img_array, None, 3, 3, 5, 15)  # Reduced from 5,5,7,21
-                else:  # Grayscale image
-                    img_array = cv2.fastNlMeansDenoising(img_array, None, 3, 5, 15)  # Reduced from 5,7,21
         except Exception as e:
-            logger.error(f"Denoising error: {str(e)}, falling back to standard processing")
     # Convert back to PIL Image
-    processed_image = Image.fromarray(img_array)
     # Higher quality for OCR processing
     byte_io = io.BytesIO()
@@ -135,16 +558,14 @@ def preprocess_image(image_bytes, preprocessing_options):
         logger.info(f"Preprocessing complete. Original image mode: {image.mode}, processed mode: {processed_image.mode}")
         logger.info(f"Original size: {len(image_bytes)/1024:.1f}KB, processed size: {len(byte_io.getvalue())/1024:.1f}KB")
         return byte_io.getvalue()
     except Exception as e:
         logger.error(f"Error saving processed image: {str(e)}")
         # Fallback to original image
         logger.info("Using original image as fallback")
-        image_io = io.BytesIO()
-        image.save(image_io, format='JPEG', quality=92)
-        image_io.seek(0)
-        return image_io.getvalue()
 def create_temp_file(content, suffix, temp_file_paths):
     """Create a temporary file and track it for cleanup"""
@@ -157,19 +578,53 @@ def create_temp_file(content, suffix, temp_file_paths):
         return temp_path
 def apply_preprocessing_to_file(file_bytes, file_ext, preprocessing_options, temp_file_paths):
-    """Apply preprocessing to file and return path to processed file"""
-    # Check if any preprocessing options with boolean values are True, or if any non-boolean values are non-default
-    # Note: document_type is no longer used to determine if preprocessing should be applied
     has_preprocessing = (
         preprocessing_options.get("grayscale", False) or
         preprocessing_options.get("denoise", False) or
-        preprocessing_options.get("contrast", 0) != 0 or
-        preprocessing_options.get("rotation", 0) != 0
     )
-    if has_preprocessing:
         # Apply preprocessing
         logger.info(f"Applying preprocessing with options: {preprocessing_options}")
         processed_bytes = preprocess_image(file_bytes, preprocessing_options)
         # Save processed image to temp file

 import cv2
 import numpy as np
 import tempfile
+import time
+import math
+import json
 from PIL import Image, ImageEnhance, ImageFilter
 from pdf2image import convert_from_bytes
 import streamlit as st
 import logging
+import concurrent.futures
+from pathlib import Path
 # Configure logging
 logger = logging.getLogger("preprocessing")
 logger.setLevel(logging.INFO)
+# Ensure logs directory exists
+def ensure_log_directory(config):
+    """Create logs directory if it doesn't exist"""
+    if config.get("logging", {}).get("enabled", False):
+        log_path = config.get("logging", {}).get("output_path", "logs/preprocessing_metrics.json")
+        log_dir = os.path.dirname(log_path)
+        if log_dir:
+            Path(log_dir).mkdir(parents=True, exist_ok=True)
+def log_preprocessing_metrics(metrics, config):
+    """Log preprocessing metrics to JSON file"""
+    if not config.get("enabled", False):
+        return
+    log_path = config.get("output_path", "logs/preprocessing_metrics.json")
+    ensure_log_directory({"logging": {"enabled": True, "output_path": log_path}})
+    # Add timestamp
+    metrics["timestamp"] = time.strftime("%Y-%m-%d %H:%M:%S")
+    # Append to log file
+    try:
+        existing_data = []
+        if os.path.exists(log_path):
+            with open(log_path, 'r') as f:
+                existing_data = json.load(f)
+                if not isinstance(existing_data, list):
+                    existing_data = [existing_data]
+        existing_data.append(metrics)
+        with open(log_path, 'w') as f:
+            json.dump(existing_data, f, indent=2)
+        logger.info(f"Logged preprocessing metrics to {log_path}")
+    except Exception as e:
+        logger.error(f"Error logging preprocessing metrics: {str(e)}")
+def get_document_config(document_type, global_config):
+    """
+    Get document-specific preprocessing configuration by merging with global settings.
+    Args:
+        document_type: The type of document (e.g., 'standard', 'newspaper', 'handwritten')
+        global_config: The global preprocessing configuration
+    Returns:
+        A merged configuration dictionary with document-specific overrides
+    """
+    # Start with a copy of the global config
+    config = {
+        "deskew": global_config.get("deskew", {}),
+        "thresholding": global_config.get("thresholding", {}),
+        "morphology": global_config.get("morphology", {}),
+        "performance": global_config.get("performance", {}),
+        "logging": global_config.get("logging", {})
+    }
+    # Apply document-specific overrides if they exist
+    doc_types = global_config.get("document_types", {})
+    if document_type in doc_types:
+        doc_config = doc_types[document_type]
+        # Merge document-specific settings into the config
+        for section in doc_config:
+            if section in config:
+                config[section].update(doc_config[section])
+    return config
+def deskew_image(img_array, config):
+    """
+    Detect and correct skew in document images.
+    Uses a combination of methods (minAreaRect and/or Hough transform)
+    to estimate the skew angle more robustly.
+    Args:
+        img_array: Input image as numpy array
+        config: Deskew configuration dict
+    Returns:
+        Deskewed image as numpy array, estimated angle, success flag
+    """
+    if not config.get("enabled", False):
+        return img_array, 0.0, True
+    # Convert to grayscale if needed
+    gray = img_array if len(img_array.shape) == 2 else cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
+    # Start with a threshold to get binary image for angle detection
+    _, binary = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)
+    angles = []
+    angle_threshold = config.get("angle_threshold", 0.1)
+    max_angle = config.get("max_angle", 45.0)
+    # Method 1: minAreaRect approach
+    try:
+        # Find all contours
+        contours, _ = cv2.findContours(binary, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
+        # Filter contours by area to avoid noise
+        min_area = binary.shape[0] * binary.shape[1] * 0.0001  # 0.01% of image area
+        filtered_contours = [cnt for cnt in contours if cv2.contourArea(cnt) > min_area]
+        # Get angles from rotated rectangles around contours
+        for contour in filtered_contours:
+            rect = cv2.minAreaRect(contour)
+            width, height = rect[1]
+            # Calculate the angle based on the longer side
+            # (This is important for getting the orientation right)
+            angle = rect[2]
+            if width < height:
+                angle += 90
+            # Normalize angle to -45 to 45 range
+            if angle > 45:
+                angle -= 90
+            if angle < -45:
+                angle += 90
+            # Clamp angle to max limit
+            angle = max(min(angle, max_angle), -max_angle)
+            angles.append(angle)
+    except Exception as e:
+        logger.error(f"Error in minAreaRect skew detection: {str(e)}")
+    # Method 2: Hough Transform approach (if enabled)
+    if config.get("use_hough", True):
+        try:
+            # Apply Canny edge detection
+            edges = cv2.Canny(gray, 50, 150, apertureSize=3)
+            # Apply Hough lines
+            lines = cv2.HoughLinesP(edges, 1, np.pi/180,
+                                   threshold=100, minLineLength=100, maxLineGap=10)
+            if lines is not None:
+                for line in lines:
+                    x1, y1, x2, y2 = line[0]
+                    if x2 - x1 != 0:  # Avoid division by zero
+                        # Calculate line angle in degrees
+                        angle = math.atan2(y2 - y1, x2 - x1) * 180.0 / np.pi
+                        # Normalize angle to -45 to 45 range
+                        if angle > 45:
+                            angle -= 90
+                        if angle < -45:
+                            angle += 90
+                        # Clamp angle to max limit
+                        angle = max(min(angle, max_angle), -max_angle)
+                        angles.append(angle)
+        except Exception as e:
+            logger.error(f"Error in Hough transform skew detection: {str(e)}")
+    # If no angles were detected, return original image
+    if not angles:
+        logger.warning("No skew angles detected, using original image")
+        return img_array, 0.0, False
+    # Combine angles using the specified consensus method
+    consensus_method = config.get("consensus_method", "average")
+    if consensus_method == "average":
+        final_angle = sum(angles) / len(angles)
+    elif consensus_method == "median":
+        final_angle = sorted(angles)[len(angles) // 2]
+    elif consensus_method == "min":
+        final_angle = min(angles, key=abs)
+    elif consensus_method == "max":
+        final_angle = max(angles, key=abs)
+    else:
+        final_angle = sum(angles) / len(angles)  # Default to average
+    # If angle is below threshold, don't rotate
+    if abs(final_angle) < angle_threshold:
+        logger.info(f"Detected angle ({final_angle:.2f}°) is below threshold, skipping deskew")
+        return img_array, final_angle, True
+    # Log the detected angle
+    logger.info(f"Deskewing image with angle: {final_angle:.2f}°")
+    # Get image dimensions
+    h, w = img_array.shape[:2]
+    center = (w // 2, h // 2)
+    # Get rotation matrix
+    rotation_matrix = cv2.getRotationMatrix2D(center, final_angle, 1.0)
+    # Calculate new image dimensions
+    abs_cos = abs(rotation_matrix[0, 0])
+    abs_sin = abs(rotation_matrix[0, 1])
+    new_w = int(h * abs_sin + w * abs_cos)
+    new_h = int(h * abs_cos + w * abs_sin)
+    # Adjust the rotation matrix to account for new dimensions
+    rotation_matrix[0, 2] += (new_w / 2) - center[0]
+    rotation_matrix[1, 2] += (new_h / 2) - center[1]
+    # Perform the rotation
+    try:
+        # Determine the number of channels to create the correct output array
+        if len(img_array.shape) == 3:
+            rotated = cv2.warpAffine(img_array, rotation_matrix, (new_w, new_h),
+                                   flags=cv2.INTER_LINEAR, borderMode=cv2.BORDER_CONSTANT,
+                                   borderValue=(255, 255, 255))
+        else:
+            rotated = cv2.warpAffine(img_array, rotation_matrix, (new_w, new_h),
+                                   flags=cv2.INTER_LINEAR, borderMode=cv2.BORDER_CONSTANT,
+                                   borderValue=255)
+        return rotated, final_angle, True
+    except Exception as e:
+        logger.error(f"Error rotating image: {str(e)}")
+        if config.get("fallback", {}).get("enabled", True):
+            logger.info("Using original image as fallback after rotation failure")
+            return img_array, final_angle, False
+        return img_array, final_angle, False
+def preblur(img_array, config):
+    """
+    Apply pre-filtering blur to stabilize thresholding results.
+    Args:
+        img_array: Input image as numpy array
+        config: Pre-blur configuration dict
+    Returns:
+        Blurred image as numpy array
+    """
+    if not config.get("enabled", False):
+        return img_array
+    method = config.get("method", "gaussian")
+    kernel_size = config.get("kernel_size", 3)
+    # Ensure kernel size is odd
+    if kernel_size % 2 == 0:
+        kernel_size += 1
+    try:
+        if method == "gaussian":
+            return cv2.GaussianBlur(img_array, (kernel_size, kernel_size), 0)
+        elif method == "median":
+            return cv2.medianBlur(img_array, kernel_size)
+        else:
+            logger.warning(f"Unknown blur method: {method}, using gaussian")
+            return cv2.GaussianBlur(img_array, (kernel_size, kernel_size), 0)
+    except Exception as e:
+        logger.error(f"Error applying {method} blur: {str(e)}")
+        return img_array
+def apply_threshold(img_array, config):
+    """
+    Apply thresholding to create binary image.
+    Supports Otsu's method and adaptive thresholding.
+    Includes pre-filtering and fallback mechanisms.
+    Args:
+        img_array: Input image as numpy array
+        config: Thresholding configuration dict
+    Returns:
+        Binary image as numpy array, success flag
+    """
+    method = config.get("method", "adaptive")
+    if method == "none":
+        return img_array, True
+    # Convert to grayscale if needed
+    gray = img_array if len(img_array.shape) == 2 else cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
+    # Apply pre-blur if configured
+    preblur_config = config.get("preblur", {})
+    if preblur_config.get("enabled", False):
+        gray = preblur(gray, preblur_config)
+    binary = None
+    try:
+        if method == "otsu":
+            # Apply Otsu's thresholding
+            _, binary = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
+        elif method == "adaptive":
+            # Apply adaptive thresholding
+            block_size = config.get("adaptive_block_size", 11)
+            constant = config.get("adaptive_constant", 2)
+            # Ensure block size is odd
+            if block_size % 2 == 0:
+                block_size += 1
+            binary = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
+                                         cv2.THRESH_BINARY, block_size, constant)
+        else:
+            logger.warning(f"Unknown thresholding method: {method}, using adaptive")
+            block_size = config.get("adaptive_block_size", 11)
+            constant = config.get("adaptive_constant", 2)
+            # Ensure block size is odd
+            if block_size % 2 == 0:
+                block_size += 1
+            binary = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
+                                         cv2.THRESH_BINARY, block_size, constant)
+    except Exception as e:
+        logger.error(f"Error applying {method} thresholding: {str(e)}")
+        if config.get("fallback", {}).get("enabled", True):
+            logger.info("Using original grayscale image as fallback after thresholding failure")
+            return gray, False
+        return gray, False
+    # Calculate percentage of non-zero pixels for logging
+    nonzero_pct = np.count_nonzero(binary) / binary.size * 100
+    logger.info(f"Binary image has {nonzero_pct:.2f}% non-zero pixels")
+    # Check if thresholding was successful (crude check)
+    if nonzero_pct < 1 or nonzero_pct > 99:
+        logger.warning(f"Thresholding produced extreme result ({nonzero_pct:.2f}% non-zero)")
+        if config.get("fallback", {}).get("enabled", True):
+            logger.info("Using original grayscale image as fallback after poor thresholding")
+            return gray, False
+    return binary, True
+def apply_morphology(binary_img, config):
+    """
+    Apply morphological operations to clean up binary image.
+    Supports opening, closing, or both operations.
+    Args:
+        binary_img: Binary image as numpy array
+        config: Morphology configuration dict
+    Returns:
+        Processed binary image as numpy array
+    """
+    if not config.get("enabled", False):
+        return binary_img
+    operation = config.get("operation", "close")
+    kernel_size = config.get("kernel_size", 1)
+    kernel_shape = config.get("kernel_shape", "rect")
+    # Create appropriate kernel
+    if kernel_shape == "rect":
+        kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (kernel_size*2+1, kernel_size*2+1))
+    elif kernel_shape == "ellipse":
+        kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (kernel_size*2+1, kernel_size*2+1))
+    elif kernel_shape == "cross":
+        kernel = cv2.getStructuringElement(cv2.MORPH_CROSS, (kernel_size*2+1, kernel_size*2+1))
+    else:
+        logger.warning(f"Unknown kernel shape: {kernel_shape}, using rect")
+        kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (kernel_size*2+1, kernel_size*2+1))
+    result = binary_img
+    try:
+        if operation == "open":
+            # Opening: Erosion followed by dilation - removes small noise
+            result = cv2.morphologyEx(binary_img, cv2.MORPH_OPEN, kernel)
+        elif operation == "close":
+            # Closing: Dilation followed by erosion - fills small holes
+            result = cv2.morphologyEx(binary_img, cv2.MORPH_CLOSE, kernel)
+        elif operation == "both":
+            # Both operations in sequence
+            result = cv2.morphologyEx(binary_img, cv2.MORPH_OPEN, kernel)
+            result = cv2.morphologyEx(result, cv2.MORPH_CLOSE, kernel)
+        else:
+            logger.warning(f"Unknown morphological operation: {operation}, using close")
+            result = cv2.morphologyEx(binary_img, cv2.MORPH_CLOSE, kernel)
+    except Exception as e:
+        logger.error(f"Error applying morphological operation: {str(e)}")
+        return binary_img
+    return result
 @st.cache_data(ttl=24*3600, show_spinner=False)  # Cache for 24 hours
 def convert_pdf_to_images(pdf_bytes, dpi=150, rotation=0):
     """Convert PDF bytes to a list of images with caching"""
 @st.cache_data(ttl=24*3600, show_spinner=False, hash_funcs={dict: lambda x: str(sorted(x.items()))})
 def preprocess_image(image_bytes, preprocessing_options):
+    """
+    Conservative preprocessing function for handwritten documents with early exit for clean scans.
+    Implements light processing: grayscale → denoise (gently) → contrast (conservative)
+    Args:
+        image_bytes: Image content as bytes
+        preprocessing_options: Dictionary with document_type, grayscale, denoise, contrast options
+    Returns:
+        Processed image bytes or original image bytes if no processing needed
+    """
     # Setup basic console logging
     logger = logging.getLogger("image_preprocessor")
     logger.setLevel(logging.INFO)
     # Log which preprocessing options are being applied
+    logger.info(f"Document type: {preprocessing_options.get('document_type', 'standard')}")
+    # Check if any preprocessing is actually requested
+    has_preprocessing = (
+        preprocessing_options.get("grayscale", False) or
+        preprocessing_options.get("denoise", False) or
+        preprocessing_options.get("contrast", 0) != 0
+    )
     # Convert bytes to PIL Image
     image = Image.open(io.BytesIO(image_bytes))
+    # Check for minimal skew and exit early if document is already straight
+    # This avoids unnecessary processing for clean scans
+    try:
+        from utils.image_utils import detect_skew
+        skew_angle = detect_skew(image)
+        if abs(skew_angle) < 0.5:
+            logger.info(f"Document has minimal skew ({skew_angle:.2f}°), skipping preprocessing")
+            # Return original image bytes as is for perfectly straight documents
+            if not has_preprocessing:
+                return image_bytes
+    except Exception as e:
+        logger.warning(f"Error in skew detection: {str(e)}, continuing with preprocessing")
+    # If no preprocessing options are selected, return the original image
+    if not has_preprocessing:
+        logger.info("No preprocessing options selected, skipping preprocessing")
+        return image_bytes
+    # Initialize metrics for logging
+    metrics = {
+        "file": preprocessing_options.get("filename", "unknown"),
+        "document_type": preprocessing_options.get("document_type", "standard"),
+        "preprocessing_applied": []
+    }
+    start_time = time.time()
+    # Handle RGBA images (transparency) by converting to RGB
     if image.mode == 'RGBA':
+        # Convert RGBA to RGB by compositing onto white background
+        logger.info("Converting RGBA image to RGB")
         background = Image.new('RGB', image.size, (255, 255, 255))
         background.paste(image, mask=image.split()[3])  # 3 is the alpha channel
         image = background
+        metrics["preprocessing_applied"].append("alpha_conversion")
     elif image.mode not in ('RGB', 'L'):
+        # Convert other modes to RGB
+        logger.info(f"Converting {image.mode} image to RGB")
         image = image.convert('RGB')
+        metrics["preprocessing_applied"].append("format_conversion")
+    # Convert to NumPy array for OpenCV processing
     img_array = np.array(image)
+    # Apply grayscale if requested (useful for handwritten text)
     if preprocessing_options.get("grayscale", False):
         if len(img_array.shape) == 3:  # Only convert if it's not already grayscale
+            # For handwritten documents, apply gentle CLAHE to enhance contrast locally
+            if preprocessing_options.get("document_type") == "handwritten":
                 img_array = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
+                clahe = cv2.createCLAHE(clipLimit=1.5, tileGridSize=(8,8))  # Conservative clip limit
                 img_array = clahe.apply(img_array)
             else:
                 # Standard grayscale for printed documents
                 img_array = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
+            metrics["preprocessing_applied"].append("grayscale")
+    # Apply light denoising if requested
     if preprocessing_options.get("denoise", False):
         try:
+            # Apply very gentle denoising
+            is_color = len(img_array.shape) == 3 and img_array.shape[2] == 3
+            if is_color:
+                # Very light color denoising with conservative parameters
+                img_array = cv2.fastNlMeansDenoisingColored(img_array, None, 2, 2, 3, 7)
             else:
+                # Very light grayscale denoising
+                img_array = cv2.fastNlMeansDenoising(img_array, None, 2, 3, 7)
+            metrics["preprocessing_applied"].append("light_denoise")
         except Exception as e:
+            logger.error(f"Denoising error: {str(e)}")
+    # Apply contrast adjustment if requested (conservative range)
+    contrast_value = preprocessing_options.get("contrast", 0)
+    if contrast_value != 0:
+        # Use a gentler contrast adjustment factor
+        contrast_factor = 1 + (contrast_value / 200)  # Conservative scaling factor
+        # Convert NumPy array back to PIL Image for contrast adjustment
+        if len(img_array.shape) == 2:  # If grayscale, convert to RGB for PIL
+            image = Image.fromarray(cv2.cvtColor(img_array, cv2.COLOR_GRAY2RGB))
+        else:
+            image = Image.fromarray(img_array)
+        enhancer = ImageEnhance.Contrast(image)
+        image = enhancer.enhance(contrast_factor)
+        # Convert back to NumPy array
+        img_array = np.array(image)
+        metrics["preprocessing_applied"].append(f"contrast_{contrast_value}")
     # Convert back to PIL Image
+    if len(img_array.shape) == 2:  # If grayscale, convert to RGB for saving
+        processed_image = Image.fromarray(cv2.cvtColor(img_array, cv2.COLOR_GRAY2RGB))
+    else:
+        processed_image = Image.fromarray(img_array)
+    # Record total processing time
+    metrics["processing_time"] = (time.time() - start_time) * 1000  # ms
     # Higher quality for OCR processing
     byte_io = io.BytesIO()
         logger.info(f"Preprocessing complete. Original image mode: {image.mode}, processed mode: {processed_image.mode}")
         logger.info(f"Original size: {len(image_bytes)/1024:.1f}KB, processed size: {len(byte_io.getvalue())/1024:.1f}KB")
+        logger.info(f"Applied preprocessing steps: {', '.join(metrics['preprocessing_applied'])}")
         return byte_io.getvalue()
     except Exception as e:
         logger.error(f"Error saving processed image: {str(e)}")
         # Fallback to original image
         logger.info("Using original image as fallback")
+        return image_bytes
 def create_temp_file(content, suffix, temp_file_paths):
     """Create a temporary file and track it for cleanup"""
         return temp_path
 def apply_preprocessing_to_file(file_bytes, file_ext, preprocessing_options, temp_file_paths):
+    """
+    Apply conservative preprocessing to file and return path to the temporary file.
+    Handles format conversion and user-selected preprocessing options.
+    Args:
+        file_bytes: File content as bytes
+        file_ext: File extension (e.g., '.jpg', '.pdf')
+        preprocessing_options: Dictionary with document_type and preprocessing options
+        temp_file_paths: List to track temporary files for cleanup
+    Returns:
+        Tuple of (temp_file_path, was_processed_flag)
+    """
+    document_type = preprocessing_options.get("document_type", "standard")
+    # Check for user-selected preprocessing
     has_preprocessing = (
         preprocessing_options.get("grayscale", False) or
         preprocessing_options.get("denoise", False) or
+        preprocessing_options.get("contrast", 0) != 0
     )
+    # Check for RGBA/transparency that needs conversion
+    format_needs_conversion = False
+    # Only check formats that might have transparency
+    if file_ext.lower() in ['.png', '.tif', '.tiff']:
+        try:
+            # Check if image has transparency
+            image = Image.open(io.BytesIO(file_bytes))
+            if image.mode == 'RGBA' or image.mode not in ('RGB', 'L'):
+                format_needs_conversion = True
+        except Exception as e:
+            logger.warning(f"Error checking image format: {str(e)}")
+    # Process if user requested preprocessing OR format needs conversion
+    needs_processing = has_preprocessing or format_needs_conversion
+    if needs_processing:
         # Apply preprocessing
         logger.info(f"Applying preprocessing with options: {preprocessing_options}")
+        logger.info(f"Using document type '{document_type}' with advanced preprocessing options")
+        # Add filename to preprocessing options for logging if available
+        if hasattr(file_bytes, 'name'):
+            preprocessing_options["filename"] = file_bytes.name
         processed_bytes = preprocess_image(file_bytes, preprocessing_options)
         # Save processed image to temp file

process_file.py CHANGED Viewed

@@ -53,9 +53,7 @@ def process_file(uploaded_file, use_vision=True, processor=None, custom_prompt=N
             "file_size_mb": round(file_size_mb, 2),
             "use_vision": use_vision
         })
-        # No longer needed - removing confidence score
         return result
     except Exception as e:
         return {
@@ -65,4 +63,4 @@ def process_file(uploaded_file, use_vision=True, processor=None, custom_prompt=N
     finally:
         # Clean up the temporary file
         if os.path.exists(temp_path):
-            os.unlink(temp_path)

             "file_size_mb": round(file_size_mb, 2),
             "use_vision": use_vision
         })
         return result
     except Exception as e:
         return {
     finally:
         # Clean up the temporary file
         if os.path.exists(temp_path):
+            os.unlink(temp_path)

requirements.txt CHANGED Viewed

@@ -10,6 +10,7 @@ Pillow>=10.0.0
 opencv-python-headless>=4.8.0.74
 pdf2image>=1.16.0
 pytesseract>=0.3.10  # For local OCR fallback
 # Data handling and utilities
 numpy>=1.24.0

 opencv-python-headless>=4.8.0.74
 pdf2image>=1.16.0
 pytesseract>=0.3.10  # For local OCR fallback
+matplotlib>=3.7.0    # For visualization in preprocessing tests
 # Data handling and utilities
 numpy>=1.24.0

structured_ocr.py CHANGED Viewed

@@ -47,28 +47,38 @@ except ImportError:
 # Import utilities for OCR processing
 try:
-    from ocr_utils import replace_images_in_markdown, get_combined_markdown
 except ImportError:
-    # Define fallback functions if module not found
     def replace_images_in_markdown(markdown_str, images_dict):
-        for img_name, base64_str in images_dict.items():
-            markdown_str = markdown_str.replace(
-                f"![{img_name}]({img_name})", f"![{img_name}]({base64_str})"
-            )
         return markdown_str
     def get_combined_markdown(ocr_response):
         markdowns = []
         for page in ocr_response.pages:
             image_data = {}
-            for img in page.images:
-                image_data[img.id] = img.image_base64
-            markdowns.append(replace_images_in_markdown(page.markdown, image_data))
         return "\n\n".join(markdowns)
 # Import config directly (now local to historical-ocr)
 try:
-    from config import MISTRAL_API_KEY, OCR_MODEL, TEXT_MODEL, VISION_MODEL, TEST_MODE
 except ImportError:
     # Fallback defaults if config is not available
     import os
@@ -77,6 +87,14 @@ except ImportError:
     TEXT_MODEL = "mistral-large-latest"
     VISION_MODEL = "mistral-large-latest"
     TEST_MODE = True
     logging.warning("Config module not found. Using environment variables and defaults.")
 # Helper function to make OCR objects JSON serializable
@@ -127,6 +145,13 @@ def serialize_ocr_response(obj):
                     is_valid_image = False
                     logging.warning("Markdown image reference detected")
                 # Case 3: Needs detailed text content detection
                 else:
                     # Use the same proven approach as in our tests
@@ -185,9 +210,27 @@ def serialize_ocr_response(obj):
                         'image_base64': image_base64
                     }
                 else:
-                    # Process as text if validation fails - convert to string to prevent misclassification
                     if image_base64 and isinstance(image_base64, str):
-                        result[key] = image_base64
                     else:
                         result[key] = str(value)
             # Handle collections
@@ -382,13 +425,47 @@ class StructuredOCR:
                 result = serialize_ocr_response(result)
             # Make a final pass to check for any remaining non-serializable objects
-            # Test JSON serialization to catch any remaining issues
-            json.dumps(result)
         except TypeError as e:
-            # If there's a serialization error, run the whole result through our serializer
             logger = logging.getLogger("serializer")
             logger.warning(f"JSON serialization error in result: {str(e)}. Applying full serialization.")
-            result = serialize_ocr_response(result)
         return result
@@ -1104,9 +1181,10 @@ class StructuredOCR:
             # Use enhanced preprocessing functions from ocr_utils
             try:
-                from ocr_utils import preprocess_image_for_ocr, IMAGE_PREPROCESSING
-                logger.info(f"Applying advanced image preprocessing for OCR")
                 # Get preprocessing settings from config
                 max_size_mb = IMAGE_PREPROCESSING.get("max_size_mb", 8.0)
@@ -1114,8 +1192,14 @@ class StructuredOCR:
                 if file_size_mb > max_size_mb:
                     logger.info(f"Image is large ({file_size_mb:.2f} MB), optimizing for API submission")
-                # Preprocess image with document-type detection and appropriate enhancements
-                _, base64_data_url = preprocess_image_for_ocr(file_path)
                 logger.info(f"Image preprocessing completed successfully")
@@ -1169,7 +1253,7 @@ class StructuredOCR:
                     except ImportError:
                         logger.warning("PIL not available for resizing. Using original image.")
                         # Use enhanced encoder with proper MIME type detection
-                        from ocr_utils import encode_image_for_api
                         base64_data_url = encode_image_for_api(file_path)
                     except Exception as e:
                         logger.warning(f"Image resize failed: {str(e)}. Using original image.")
@@ -1178,7 +1262,7 @@ class StructuredOCR:
                         base64_data_url = encode_image_for_api(file_path)
                 else:
                     # For smaller images, use as-is with proper MIME type
-                    from ocr_utils import encode_image_for_api
                     base64_data_url = encode_image_for_api(file_path)
             except Exception as e:
                 # Fallback to original image if any preprocessing fails
@@ -1243,7 +1327,7 @@ class StructuredOCR:
                             logger.error("Maximum retries reached, rate limit error persists.")
                             try:
                                 # Try to import the local OCR fallback function
-                                from ocr_utils import try_local_ocr_fallback
                                 # Attempt local OCR fallback
                                 ocr_text = try_local_ocr_fallback(file_path, base64_data_url)
@@ -1455,7 +1539,14 @@ class StructuredOCR:
                 logger.info("Sufficient OCR text detected, analyzing language before using OCR text directly")
                 # Perform language detection on the OCR text before returning
-                detected_languages = self._detect_text_language(ocr_markdown)
                 return {
                     "file_name": filename,
@@ -1629,7 +1720,12 @@ class StructuredOCR:
             # If OCR text has clear French patterns but language is English or missing, fix it
             if ocr_markdown and 'languages' in result:
-                result['languages'] = self._detect_text_language(ocr_markdown, result['languages'])
         except Exception as e:
             # Fall back to text-only model if vision model fails
@@ -1639,22 +1735,25 @@ class StructuredOCR:
         return result
     # We've removed document type detection entirely for simplicity
         # Create a prompt with enhanced language detection instructions
         generic_section = (
             f"You are an OCR specialist processing historical documents. "
-            f"Focus on accurately extracting text content while preserving structure and formatting. "
             f"Pay attention to any historical features and document characteristics.\n\n"
-            f"IMPORTANT: Accurately identify the document's language(s). Look for language-specific characters, words, and phrases. "
-            f"Specifically check for French (accents like é, è, ç, words like 'le', 'la', 'et', 'est'), German (umlauts, words like 'und', 'der', 'das'), "
-            f"Latin, and other non-English languages. Carefully analyze the text before determining language.\n\n"
             f"Create a structured JSON response with the following fields:\n"
             f"- file_name: The document's name\n"
             f"- topics: An array of topics covered in the document\n"
             f"- languages: An array of languages used in the document (be precise and specific about language detection)\n"
             f"- ocr_contents: A comprehensive dictionary with the document's contents including:\n"
-            f"  * title: The main title or heading (if present)\n"
-            f"  * content: The main body content\n"
             f"  * raw_text: The complete OCR text\n"
         )
@@ -1665,86 +1764,7 @@ class StructuredOCR:
         # Return the enhanced prompt
         return generic_section + custom_section
-    def _detect_text_language(self, text, current_languages=None):
-        """
-        Detect language from text content using the external language detector
-        or falling back to internal detection if needed
-        Args:
-            text: The text to analyze
-            current_languages: Optional list of languages already detected
-        Returns:
-            List of detected languages
-        """
-        logger = logging.getLogger("language_detector")
-        # If no text provided, return current languages or default
-        if not text or len(text.strip()) < 10:
-            return current_languages if current_languages else ["English"]
-        # Use the external language detector if available
-        if LANG_DETECTOR_AVAILABLE and self.language_detector:
-            logger.info("Using external language detector")
-            return self.language_detector.detect_languages(text,
-                                                          filename=getattr(self, 'current_filename', None),
-                                                          current_languages=current_languages)
-        # Fallback for when the external module is not available
-        logger.info("Language detector not available, using simple detection")
-        # Get all words from text (lowercase for comparison)
-        text_lower = text.lower()
-        words = text_lower.split()
-        # Basic language markers - equal treatment of all languages
-        language_indicators = {
-            "French": {
-                "chars": ['é', 'è', 'ê', 'à', 'ç', 'ù', 'â', 'î', 'ô', 'û'],
-                "words": ['le', 'la', 'les', 'et', 'en', 'de', 'du', 'des', 'dans', 'ce', 'cette']
-            },
-            "Spanish": {
-                "chars": ['ñ', 'á', 'é', 'í', 'ó', 'ú', '¿', '¡'],
-                "words": ['el', 'la', 'los', 'las', 'y', 'en', 'por', 'que', 'con', 'del']
-            },
-            "German": {
-                "chars": ['ä', 'ö', 'ü', 'ß'],
-                "words": ['der', 'die', 'das', 'und', 'ist', 'von', 'mit', 'für', 'sich']
-            },
-            "Latin": {
-                "chars": [],
-                "words": ['et', 'in', 'ad', 'est', 'sunt', 'non', 'cum', 'sed', 'qui', 'quod']
-            }
-        }
-        detected_languages = []
-        # Simple detection logic - check for language markers
-        for language, indicators in language_indicators.items():
-            has_chars = any(char in text_lower for char in indicators["chars"])
-            has_words = any(word in words for word in indicators["words"])
-            if has_chars and has_words:
-                detected_languages.append(language)
-        # Check for English
-        english_words = ['the', 'and', 'of', 'to', 'in', 'a', 'is', 'that', 'for', 'it']
-        if sum(1 for word in words if word in english_words) >= 2:
-            detected_languages.append("English")
-        # If no languages detected, default to English
-        if not detected_languages:
-            detected_languages = ["English"]
-        # Limit to top 2 languages
-        detected_languages = detected_languages[:2]
-        # Log what we found
-        logger.info(f"Simple fallback language detection results: {detected_languages}")
-        return detected_languages
     def _extract_structured_data_text_only(self, ocr_markdown, filename, custom_prompt=None):
         """
         Extract structured data using text-only model with detailed historical context prompting

 # Import utilities for OCR processing
 try:
+    from utils.image_utils import replace_images_in_markdown, get_combined_markdown
 except ImportError:
+    # Define minimal fallback functions if module not found
+    logger.warning("Could not import utils.image_utils - using minimal fallback functions")
     def replace_images_in_markdown(markdown_str, images_dict):
+        """Minimal fallback implementation of replace_images_in_markdown"""
+        import re
+        for img_id, base64_str in images_dict.items():
+            # Match alt text OR link part, ignore extension
+            base_id = img_id.split('.')[0]
+            pattern = re.compile(rf"!\[[^\]]*{base_id}[^\]]*\]\([^\)]+\)")
+            markdown_str = pattern.sub(f"![{img_id}](data:image/jpeg;base64,{base64_str})", markdown_str)
         return markdown_str
     def get_combined_markdown(ocr_response):
+        """Minimal fallback implementation of get_combined_markdown"""
         markdowns = []
         for page in ocr_response.pages:
             image_data = {}
+            if hasattr(page, "images"):
+                for img in page.images:
+                    if hasattr(img, "id") and hasattr(img, "image_base64"):
+                        image_data[img.id] = img.image_base64
+            page_markdown = page.markdown if hasattr(page, "markdown") else ""
+            processed_markdown = replace_images_in_markdown(page_markdown, image_data)
+            markdowns.append(processed_markdown)
         return "\n\n".join(markdowns)
 # Import config directly (now local to historical-ocr)
 try:
+    from config import MISTRAL_API_KEY, OCR_MODEL, TEXT_MODEL, VISION_MODEL, TEST_MODE, IMAGE_PREPROCESSING
 except ImportError:
     # Fallback defaults if config is not available
     import os
     TEXT_MODEL = "mistral-large-latest"
     VISION_MODEL = "mistral-large-latest"
     TEST_MODE = True
+    # Default image preprocessing settings if config not available
+    IMAGE_PREPROCESSING = {
+        "max_size_mb": 8.0,
+        # Add basic defaults for preprocessing
+        "enhance_contrast": 1.2,
+        "denoise": True,
+        "compression_quality": 95
+    }
     logging.warning("Config module not found. Using environment variables and defaults.")
 # Helper function to make OCR objects JSON serializable
                     is_valid_image = False
                     logging.warning("Markdown image reference detected")
+                    # Extract the image ID for logging
+                    try:
+                        img_id = image_base64.split('![')[1].split('](')[0]
+                        logging.debug(f"Markdown reference for image: {img_id}")
+                    except:
+                        img_id = "unknown"
                 # Case 3: Needs detailed text content detection
                 else:
                     # Use the same proven approach as in our tests
                         'image_base64': image_base64
                     }
                 else:
+                    # Process as text if validation fails, but properly handle markdown references
                     if image_base64 and isinstance(image_base64, str):
+                        # Special handling for markdown image references
+                        if image_base64.startswith('![') and '](' in image_base64 and image_base64.endswith(')'):
+                            # Extract the image description (alt text) if available
+                            try:
+                                # Parse the alt text from ![alt_text](url)
+                                alt_text = image_base64.split('![')[1].split('](')[0]
+                                # Use the alt text or a placeholder if it's just the image name
+                                if alt_text and not alt_text.endswith('.jpeg') and not alt_text.endswith('.jpg'):
+                                    result[key] = f"[Image: {alt_text}]"
+                                else:
+                                    # Just note that there's an image without the reference
+                                    result[key] = "[Image]"
+                                logging.info(f"Converted markdown reference to text placeholder: {result[key]}")
+                            except:
+                                # Fallback for parsing errors
+                                result[key] = "[Image]"
+                        else:
+                            # Regular text content
+                            result[key] = image_base64
                     else:
                         result[key] = str(value)
             # Handle collections
                 result = serialize_ocr_response(result)
             # Make a final pass to check for any remaining non-serializable objects
+            # Proactively check for OCRImageObject instances to avoid serialization warnings
+            def has_ocr_image_objects(obj):
+                """Check if object contains any OCRImageObject instances recursively"""
+                if isinstance(obj, dict):
+                    return any(has_ocr_image_objects(v) for v in obj.values())
+                elif isinstance(obj, list):
+                    return any(has_ocr_image_objects(item) for item in obj)
+                else:
+                    return 'OCRImageObject' in str(type(obj))
+            # Apply serialization preemptively if OCRImageObjects are detected
+            if has_ocr_image_objects(result):
+                # Quietly apply full serialization before any errors occur
+                result = serialize_ocr_response(result)
+            else:
+                # Test JSON serialization to catch any other issues
+                json.dumps(result)
         except TypeError as e:
+            # If there's still a serialization error, run the whole result through our serializer
             logger = logging.getLogger("serializer")
             logger.warning(f"JSON serialization error in result: {str(e)}. Applying full serialization.")
+            # Use a more robust approach to ensure complete serialization
+            try:
+                # First attempt with our custom serializer
+                result = serialize_ocr_response(result)
+                # Test if it's fully serializable now
+                json.dumps(result)
+            except Exception as inner_e:
+                # If still not serializable, convert to a simpler format
+                logger.warning(f"Secondary serialization error: {str(inner_e)}. Converting to basic format.")
+                # Create a simplified result with just the essential information
+                simplified_result = {
+                    "file_name": result.get("file_name", "unknown"),
+                    "topics": result.get("topics", ["Document"]),
+                    "languages": [str(lang) for lang in result.get("languages", ["English"]) if lang is not None],
+                    "ocr_contents": {
+                        "raw_text": result.get("ocr_contents", {}).get("raw_text", "Text extraction failed due to serialization error")
+                    },
+                    "serialization_error": f"Original result could not be fully serialized: {str(e)}"
+                }
+                result = simplified_result
         return result
             # Use enhanced preprocessing functions from ocr_utils
             try:
+                from preprocessing import preprocess_image
+                from utils.file_utils import get_base64_from_bytes
+                logger.info(f"Applying image preprocessing for OCR")
                 # Get preprocessing settings from config
                 max_size_mb = IMAGE_PREPROCESSING.get("max_size_mb", 8.0)
                 if file_size_mb > max_size_mb:
                     logger.info(f"Image is large ({file_size_mb:.2f} MB), optimizing for API submission")
+                # Handwritten docs default to the conservative pipeline
+                base64_data_url = get_base64_from_bytes(
+                    preprocess_image(file_path.read_bytes(),
+                                   {"document_type": "handwritten",
+                                    "grayscale": True,
+                                    "denoise": True,
+                                    "contrast": 0})
+                )
                 logger.info(f"Image preprocessing completed successfully")
                     except ImportError:
                         logger.warning("PIL not available for resizing. Using original image.")
                         # Use enhanced encoder with proper MIME type detection
+                        from utils.image_utils import encode_image_for_api
                         base64_data_url = encode_image_for_api(file_path)
                     except Exception as e:
                         logger.warning(f"Image resize failed: {str(e)}. Using original image.")
                         base64_data_url = encode_image_for_api(file_path)
                 else:
                     # For smaller images, use as-is with proper MIME type
+                    from utils.image_utils import encode_image_for_api
                     base64_data_url = encode_image_for_api(file_path)
             except Exception as e:
                 # Fallback to original image if any preprocessing fails
                             logger.error("Maximum retries reached, rate limit error persists.")
                             try:
                                 # Try to import the local OCR fallback function
+                                from utils.image_utils import try_local_ocr_fallback
                                 # Attempt local OCR fallback
                                 ocr_text = try_local_ocr_fallback(file_path, base64_data_url)
                 logger.info("Sufficient OCR text detected, analyzing language before using OCR text directly")
                 # Perform language detection on the OCR text before returning
+                if LANG_DETECTOR_AVAILABLE and self.language_detector:
+                    detected_languages = self.language_detector.detect_languages(
+                        ocr_markdown,
+                        filename=getattr(self, 'current_filename', None)
+                    )
+                else:
+                    # If language detector is not available, use default English
+                    detected_languages = ["English"]
                 return {
                     "file_name": filename,
             # If OCR text has clear French patterns but language is English or missing, fix it
             if ocr_markdown and 'languages' in result:
+                if LANG_DETECTOR_AVAILABLE and self.language_detector:
+                    result['languages'] = self.language_detector.detect_languages(
+                        ocr_markdown,
+                        filename=getattr(self, 'current_filename', None),
+                        current_languages=result['languages']
+                    )
         except Exception as e:
             # Fall back to text-only model if vision model fails
         return result
     # We've removed document type detection entirely for simplicity
         # Create a prompt with enhanced language detection instructions
         generic_section = (
             f"You are an OCR specialist processing historical documents. "
+            f"Focus on accurately extracting text content and image chunks while preserving structure and formatting. "
             f"Pay attention to any historical features and document characteristics.\n\n"
             f"Create a structured JSON response with the following fields:\n"
             f"- file_name: The document's name\n"
             f"- topics: An array of topics covered in the document\n"
             f"- languages: An array of languages used in the document (be precise and specific about language detection)\n"
             f"- ocr_contents: A comprehensive dictionary with the document's contents including:\n"
+            f"  * title: The title or heading (if present)\n"
+            f"  * transcript: The full text of the document\n"
+            f"  * text: The main text content (if different from transcript)\n"
+            f"  * content: The body content (if different than transcript)\n"
+            f"  * images: An array of image objects with their base64 data\n"
+            f"  * alt_text: The alt text or description of the images\n"
+            f"  * caption: The caption or title of the images\n"
             f"  * raw_text: The complete OCR text\n"
         )
         # Return the enhanced prompt
         return generic_section + custom_section
     def _extract_structured_data_text_only(self, ocr_markdown, filename, custom_prompt=None):
         """
         Extract structured data using text-only model with detailed historical context prompting

test_magician.py → testing/test_magician.py RENAMED Viewed

File without changes

ui_components.py CHANGED Viewed

@@ -3,9 +3,21 @@ import os
 import io
 import base64
 import logging
 from datetime import datetime
 from pathlib import Path
 import json
 from constants import (
     DOCUMENT_TYPES,
     DOCUMENT_LAYOUTS,
@@ -19,7 +31,16 @@ from constants import (
     PREPROCESSING_DOC_TYPES,
     ROTATION_OPTIONS
 )
-from utils import get_base64_from_image, extract_subject_tags
 class ProgressReporter:
     """Class to handle progress reporting in the UI"""
@@ -69,12 +90,10 @@ def create_sidebar_options():
         # Create a container for the sidebar options
         with st.container():
-            # Model selection
-            st.markdown("### Model Selection")
-            use_vision = st.toggle("Use Vision Model", value=True, help="Use vision model for better understanding of document structure")
             # Document type selection
-            st.markdown("### Document Type")
             doc_type = st.selectbox("Document Type", DOCUMENT_TYPES,
                                    help="Select the type of document you're processing for better results")
@@ -91,8 +110,8 @@ def create_sidebar_options():
             # Custom prompt
             custom_prompt = ""
-            if doc_type != DOCUMENT_TYPES[0]:  # Not auto-detect
-                # Get the template for the selected document type
                 prompt_template = CUSTOM_PROMPT_TEMPLATES.get(doc_type, "")
                 # Add layout information if not standard
@@ -103,53 +122,37 @@ def create_sidebar_options():
                 # Set the custom prompt
                 custom_prompt = prompt_template
-                # Allow user to edit the prompt
-                st.markdown("**Custom Processing Instructions**")
-                custom_prompt = st.text_area("", value=custom_prompt,
-                                           help="Customize the instructions for processing this document",
-                                           height=80)
-                # Image preprocessing options in an expandable section
-                with st.expander("Image Preprocessing (Optional)"):
-                    # Add help text to clarify that preprocessing is optional
-                    st.info("Preprocessing is optional and only applied when options below are selected. Document type alone doesn't trigger preprocessing.")
-                    # Grayscale conversion
-                    grayscale = st.checkbox("Convert to Grayscale",
-                                      value=False,
-                                      help="Convert color images to grayscale for better OCR")
-                    # Denoise
-                    denoise = st.checkbox("Denoise Image",
-                                    value=False,
-                                    help="Remove noise from the image")
-                    # Contrast adjustment
-                    contrast = st.slider("Contrast Adjustment",
-                                   min_value=-50,
-                                   max_value=50,
-                                   value=0,
-                                   step=10,
-                                   help="Adjust image contrast")
-                    # Rotation
-                    rotation = st.slider("Rotation",
-                                   min_value=-45,
-                                   max_value=45,
-                                   value=0,
-                                   step=5,
-                                   help="Rotate image if needed")
-                    # Add image segmentation option
-                    st.markdown("### Advanced Options")
-                    use_segmentation = st.toggle("Enable Image Segmentation",
-                                        value=False,
-                                        help="Segment the image into text and image regions for better OCR results on complex documents")
-                    # Show explanation if segmentation is enabled
-                    if use_segmentation:
-                        st.info("Image segmentation identifies distinct text regions in complex documents, improving OCR accuracy. This is especially helpful for documents with mixed content like the Magician illustration.")
             # Create preprocessing options dictionary
             # Set document_type based on selection in UI
@@ -169,17 +172,17 @@ def create_sidebar_options():
                 "rotation": rotation
             }
-            # PDF-specific options in an expandable section
-            with st.expander("PDF Options"):
-                max_pages = st.number_input("Maximum Pages to Process",
-                                          min_value=1,
-                                          max_value=20,
-                                          value=DEFAULT_MAX_PAGES,
-                                          help="Limit the number of pages to process (for multi-page PDFs)")
-                # Set default values for removed options
-                pdf_dpi = DEFAULT_PDF_DPI
-                pdf_rotation = 0
             # Create options dictionary
             options = {
@@ -219,471 +222,6 @@ def create_file_uploader():
     )
     return uploaded_file
-# Function removed - now using inline implementation in app.py
-def _unused_display_preprocessing_preview(uploaded_file, preprocessing_options):
-    """Display a preview of image with preprocessing options applied"""
-    if (any(preprocessing_options.values()) and
-        uploaded_file.type.startswith('image/')):
-        st.markdown("**Preprocessed Preview**")
-        try:
-            # Create a container for the preview
-            with st.container():
-                processed_bytes = preprocess_image(uploaded_file.getvalue(), preprocessing_options)
-                # Convert image to base64 and display as HTML to avoid fullscreen button
-                img_data = base64.b64encode(processed_bytes).decode()
-                img_html = f'<img src="data:image/jpeg;base64,{img_data}" style="width:100%; border-radius:4px;">'
-                st.markdown(img_html, unsafe_allow_html=True)
-                # Show preprocessing metadata in a well-formatted caption
-                meta_items = []
-                if preprocessing_options.get("document_type", "standard") != "standard":
-                    meta_items.append(f"Document type ({preprocessing_options['document_type']})")
-                if preprocessing_options.get("grayscale", False):
-                    meta_items.append("Grayscale")
-                if preprocessing_options.get("denoise", False):
-                    meta_items.append("Denoise")
-                if preprocessing_options.get("contrast", 0) != 0:
-                    meta_items.append(f"Contrast ({preprocessing_options['contrast']})")
-                if preprocessing_options.get("rotation", 0) != 0:
-                    meta_items.append(f"Rotation ({preprocessing_options['rotation']}°)")
-                # Only show "Applied:" if there are actual preprocessing steps
-                if meta_items:
-                    meta_text = "Applied: " + ", ".join(meta_items)
-                    st.caption(meta_text)
-        except Exception as e:
-            st.error(f"Error in preprocessing: {str(e)}")
-            st.info("Try using grayscale preprocessing for PNG images with transparency")
-def display_results(result, container, custom_prompt=""):
-    """Display OCR results in the provided container"""
-    with container:
-        # Add heading for document metadata
-        st.markdown("### Document Metadata")
-        # Create a compact metadata section
-        meta_html = '<div style="display: flex; flex-wrap: wrap; gap: 0.3rem; margin-bottom: 0.3rem;">'
-        # Document type
-        if 'detected_document_type' in result:
-            meta_html += f'<div><strong>Type:</strong> {result["detected_document_type"]}</div>'
-        # Processing time
-        if 'processing_time' in result:
-            meta_html += f'<div><strong>Time:</strong> {result["processing_time"]:.1f}s</div>'
-        # Page information
-        if 'limited_pages' in result:
-            meta_html += f'<div><strong>Pages:</strong> {result["limited_pages"]["processed"]}/{result["limited_pages"]["total"]}</div>'
-        meta_html += '</div>'
-        st.markdown(meta_html, unsafe_allow_html=True)
-        # Language metadata on a separate line, Subject Tags below
-        # First show languages if available
-        if 'languages' in result and result['languages']:
-            languages = [lang for lang in result['languages'] if lang is not None]
-            if languages:
-                # Create a dedicated line for Languages
-                lang_html = '<div style="display: flex; align-items: center; margin: 0.2rem 0; flex-wrap: wrap;">'
-                lang_html += '<div style="margin-right: 0.3rem; font-weight: bold;">Language:</div>'
-                # Add language tags
-                for lang in languages:
-                    # Clean language name if needed
-                    clean_lang = str(lang).strip()
-                    if clean_lang:  # Only add if not empty
-                        lang_html += f'<span class="subject-tag tag-language">{clean_lang}</span>'
-                lang_html += '</div>'
-                st.markdown(lang_html, unsafe_allow_html=True)
-                # Create a separate line for Time if we have time-related tags
-                if 'topics' in result and result['topics']:
-                    time_tags = [topic for topic in result['topics']
-                               if any(term in topic.lower() for term in ["century", "pre-", "era", "historical"])]
-                    if time_tags:
-                        time_html = '<div style="display: flex; align-items: center; margin: 0.2rem 0; flex-wrap: wrap;">'
-                        time_html += '<div style="margin-right: 0.3rem; font-weight: bold;">Time:</div>'
-                        for tag in time_tags:
-                            time_html += f'<span class="subject-tag tag-time-period">{tag}</span>'
-                        time_html += '</div>'
-                        st.markdown(time_html, unsafe_allow_html=True)
-        # Then display remaining subject tags if available
-        if 'topics' in result and result['topics']:
-            # Filter out time-related tags which are already displayed
-            subject_tags = [topic for topic in result['topics']
-                         if not any(term in topic.lower() for term in ["century", "pre-", "era", "historical"])]
-            if subject_tags:
-                # Create a separate line for Subject Tags
-                tags_html = '<div style="display: flex; align-items: center; margin: 0.2rem 0; flex-wrap: wrap;">'
-                tags_html += '<div style="margin-right: 0.3rem; font-weight: bold;">Subject Tags:</div>'
-                tags_html += '<div style="display: flex; flex-wrap: wrap; gap: 2px; align-items: center;">'
-                # Generate a badge for each remaining tag
-                for topic in subject_tags:
-                    # Determine tag category class
-                    tag_class = "subject-tag"  # Default class
-                    # Add specialized class based on category
-                    if any(term in topic.lower() for term in ["language", "english", "french", "german", "latin"]):
-                        tag_class += " tag-language"  # Languages
-                    elif any(term in topic.lower() for term in ["letter", "newspaper", "book", "form", "document", "recipe"]):
-                        tag_class += " tag-document-type"  # Document types
-                    elif any(term in topic.lower() for term in ["travel", "military", "science", "medicine", "education", "art", "literature"]):
-                        tag_class += " tag-subject"  # Subject domains
-                    # Add each tag as an inline span
-                    tags_html += f'<span class="{tag_class}">{topic}</span>'
-                # Close the containers
-                tags_html += '</div></div>'
-                # Render the subject tags section
-                st.markdown(tags_html, unsafe_allow_html=True)
-        # No OCR content heading - start directly with tabs
-            # Check if we have OCR content
-            if 'ocr_contents' in result:
-                # Create a single view instead of tabs
-                content_tab1 = st.container()
-                # Check for images in the result to use later
-                has_images = result.get('has_images', False)
-                has_image_data = ('pages_data' in result and any(page.get('images', []) for page in result.get('pages_data', [])))
-                has_raw_images = ('raw_response_data' in result and 'pages' in result['raw_response_data'] and
-                              any('images' in page for page in result['raw_response_data']['pages']
-                                  if isinstance(page, dict)))
-            # Display structured content
-            with content_tab1:
-                # Display structured content with markdown formatting
-                if isinstance(result['ocr_contents'], dict):
-                    # CSS is now handled in the main layout.py file
-                    # Function to process text with markdown support
-                    def format_markdown_text(text):
-                        """Format text with markdown and handle special patterns"""
-                        if not text:
-                            return ""
-                        import re
-                        # First, ensure we're working with a string
-                        if not isinstance(text, str):
-                            text = str(text)
-                        # Ensure newlines are preserved for proper spacing
-                        # Convert any Windows line endings to Unix
-                        text = text.replace('\r\n', '\n')
-                        # Format dates (MM/DD/YYYY or similar patterns)
-                        date_pattern = r'\b(0?[1-9]|1[0-2])[\/\-\.](0?[1-9]|[12][0-9]|3[01])[\/\-\.](\d{4}|\d{2})\b'
-                        text = re.sub(date_pattern, r'**\g<0>**', text)
-                        # Detect markdown tables and preserve them
-                        table_sections = []
-                        non_table_lines = []
-                        in_table = False
-                        table_buffer = []
-                        # Process text line by line, preserving tables
-                        lines = text.split('\n')
-                        for i, line in enumerate(lines):
-                            line_stripped = line.strip()
-                            # Detect table rows by pipe character
-                            if '|' in line_stripped and (line_stripped.startswith('|') or line_stripped.endswith('|')):
-                                if not in_table:
-                                    in_table = True
-                                    if table_buffer:
-                                        table_buffer = []
-                                table_buffer.append(line)
-                                # Check if the next line is a table separator
-                                if i < len(lines) - 1 and '---' in lines[i+1] and '|' in lines[i+1]:
-                                    table_buffer.append(lines[i+1])
-                            # Detect table separators (---|---|---)
-                            elif in_table and '---' in line_stripped and '|' in line_stripped:
-                                table_buffer.append(line)
-                            # End of table detection
-                            elif in_table:
-                                # Check if this is still part of the table
-                                next_line_is_table = False
-                                if i < len(lines) - 1:
-                                    next_line = lines[i+1].strip()
-                                    if '|' in next_line and (next_line.startswith('|') or next_line.endswith('|')):
-                                        next_line_is_table = True
-                                if not next_line_is_table:
-                                    in_table = False
-                                    # Save the complete table
-                                    if table_buffer:
-                                        table_sections.append('\n'.join(table_buffer))
-                                        table_buffer = []
-                                    # Add current line to non-table lines
-                                    non_table_lines.append(line)
-                                else:
-                                    # Still part of the table
-                                    table_buffer.append(line)
-                            else:
-                                # Not in a table
-                                non_table_lines.append(line)
-                        # Handle any remaining table buffer
-                        if in_table and table_buffer:
-                            table_sections.append('\n'.join(table_buffer))
-                        # Process non-table lines
-                        processed_lines = []
-                        for line in non_table_lines:
-                            line_stripped = line.strip()
-                            # Check if line is in ALL CAPS (and not just a short acronym)
-                            if line_stripped and line_stripped.isupper() and len(line_stripped) > 3:
-                                # ALL CAPS line - make bold instead of heading to prevent large display
-                                processed_lines.append(f"**{line_stripped}**")
-                            # Process potential headers (lines ending with colon)
-                            elif line_stripped and line_stripped.endswith(':') and len(line_stripped) < 40:
-                                # Likely a header - make it bold
-                                processed_lines.append(f"**{line_stripped}**")
-                            else:
-                                # Keep original line with its spacing
-                                processed_lines.append(line)
-                        # Join non-table lines
-                        processed_text = '\n'.join(processed_lines)
-                        # Reinsert tables in the right positions
-                        for table in table_sections:
-                            # Generate a unique marker for this table
-                            marker = f"__TABLE_MARKER_{hash(table) % 10000}__"
-                            # Find a good position to insert this table
-                            # For now, just append all tables at the end
-                            processed_text += f"\n\n{table}\n\n"
-                        # Make sure paragraphs have proper spacing but not excessive
-                        processed_text = re.sub(r'\n{3,}', '\n\n', processed_text)
-                        # Ensure two newlines between paragraphs for proper markdown rendering
-                        processed_text = re.sub(r'([^\n])\n([^\n])', r'\1\n\n\2', processed_text)
-                        return processed_text
-                    # Collect all available images from the result
-                    available_images = []
-                    if has_images and 'pages_data' in result:
-                        for page_idx, page in enumerate(result['pages_data']):
-                            if 'images' in page and len(page['images']) > 0:
-                                for img_idx, img in enumerate(page['images']):
-                                    if 'image_base64' in img:
-                                        available_images.append({
-                                            'source': 'pages_data',
-                                            'page': page_idx,
-                                            'index': img_idx,
-                                            'data': img['image_base64']
-                                        })
-                    # Get images from raw response as well
-                    if 'raw_response_data' in result:
-                        raw_data = result['raw_response_data']
-                        if isinstance(raw_data, dict) and 'pages' in raw_data:
-                            for page_idx, page in enumerate(raw_data['pages']):
-                                if isinstance(page, dict) and 'images' in page:
-                                    for img_idx, img in enumerate(page['images']):
-                                        if isinstance(img, dict) and 'base64' in img:
-                                            available_images.append({
-                                                'source': 'raw_response',
-                                                'page': page_idx,
-                                                'index': img_idx,
-                                                'data': img['base64']
-                                            })
-                    # Extract images for display at the top
-                    images_to_display = []
-                    # First, collect all available images
-                    for img_idx, img in enumerate(available_images):
-                        if 'data' in img:
-                            images_to_display.append({
-                                'data': img['data'],
-                                'id': img.get('id', f"img_{img_idx}"),
-                                'index': img_idx
-                            })
-                    # Simple display of image without dropdown or Document Image tab
-                    if images_to_display and len(images_to_display) > 0:
-                        # Just display the first image directly
-                        st.image(images_to_display[0]['data'], use_container_width=True)
-                    # Organize sections in a logical order
-                    section_order = ["title", "author", "date", "summary", "content", "transcript", "metadata"]
-                    ordered_sections = []
-                    # Add known sections first in preferred order
-                    for section_name in section_order:
-                        if section_name in result['ocr_contents'] and result['ocr_contents'][section_name]:
-                            ordered_sections.append(section_name)
-                    # Add any remaining sections
-                    for section in result['ocr_contents'].keys():
-                        if (section not in ordered_sections and
-                            section not in ['error', 'partial_text'] and
-                            result['ocr_contents'][section]):
-                            ordered_sections.append(section)
-                    # If only raw_text is available and no other content, add it last
-                    if ('raw_text' in result['ocr_contents'] and
-                        result['ocr_contents']['raw_text'] and
-                        len(ordered_sections) == 0):
-                        ordered_sections.append('raw_text')
-                    # Add minimal spacing before OCR results
-                    st.markdown("<div style='margin: 8px 0 4px 0;'></div>", unsafe_allow_html=True)
-                    st.markdown("### Document Content")
-                    # Process each section using expanders
-                    for i, section in enumerate(ordered_sections):
-                        content = result['ocr_contents'][section]
-                        # Skip empty content
-                        if not content:
-                            continue
-                        # Create an expander for each section
-                        # First section is expanded by default
-                        with st.expander(f"{section.replace('_', ' ').title()}", expanded=(i == 0)):
-                            if isinstance(content, str):
-                                # Handle image markdown
-                                if content.startswith("![") and content.endswith(")"):
-                                    try:
-                                        alt_text = content[2:content.index(']')]
-                                        st.info(f"Image description: {alt_text if len(alt_text) > 5 else 'Image'}")
-                                    except:
-                                        st.info("Contains image reference")
-                                else:
-                                    # Process text content
-                                    formatted_content = format_markdown_text(content).strip()
-                                    # Check if content contains markdown tables or complex text
-                                    has_tables = '|' in formatted_content and '---' in formatted_content
-                                    has_complex_structure = formatted_content.count('\n') > 5 or formatted_content.count('**') > 2
-                                    # Use a container with minimal margins
-                                    with st.container():
-                                        # For text-only extractions or content with tables, ensure proper rendering
-                                        if has_tables or has_complex_structure:
-                                            # For text with tables or multiple paragraphs, use special handling
-                                            # First ensure proper markdown spacing
-                                            formatted_content = formatted_content.replace('\n\n\n', '\n\n')
-                                            # Look for any all caps headers that might be misinterpreted
-                                            import re
-                                            formatted_content = re.sub(
-                                                r'^([A-Z][A-Z\s]+)$',
-                                                r'**\1**',
-                                                formatted_content,
-                                                flags=re.MULTILINE
-                                            )
-                                            # Preserve table formatting by adding proper spacing
-                                            if has_tables:
-                                                formatted_content = formatted_content.replace('\n|', '\n\n|')
-                                            # Add proper paragraph spacing
-                                            formatted_content = re.sub(r'([^\n])\n([^\n])', r'\1\n\n\2', formatted_content)
-                                            # Use standard markdown with custom styling
-                                            st.markdown(formatted_content, unsafe_allow_html=False)
-                                        else:
-                                            # For simpler content, use standard markdown
-                                            st.markdown(formatted_content)
-                            elif isinstance(content, list):
-                                # Create markdown list
-                                list_items = []
-                                for item in content:
-                                    if isinstance(item, str):
-                                        item_text = format_markdown_text(item).strip()
-                                        # Handle potential HTML special characters for proper rendering
-                                        item_text = item_text.replace('<', '&lt;').replace('>', '&gt;')
-                                        list_items.append(f"- {item_text}")
-                                    else:
-                                        list_items.append(f"- {str(item)}")
-                                list_content = "\n".join(list_items)
-                                # Use a container with minimal margins
-                                with st.container():
-                                    # Use standard markdown for better rendering
-                                    st.markdown(list_content)
-                            elif isinstance(content, dict):
-                                # Format dictionary content
-                                dict_items = []
-                                for k, v in content.items():
-                                    key_formatted = k.replace('_', ' ').title()
-                                    if isinstance(v, str):
-                                        value_formatted = format_markdown_text(v).strip()
-                                        dict_items.append(f"**{key_formatted}:** {value_formatted}")
-                                    else:
-                                        dict_items.append(f"**{key_formatted}:** {str(v)}")
-                                dict_content = "\n".join(dict_items)
-                                # Use a container with minimal margins
-                                with st.container():
-                                    # Use standard markdown for better rendering
-                                    st.markdown(dict_content)
-            # Display custom prompt if provided
-            if custom_prompt:
-                with st.expander("Custom Processing Instructions"):
-                    st.write(custom_prompt)
-            # No download heading - start directly with buttons
-            # JSON download - use full width for buttons
-            try:
-                json_str = json.dumps(result, indent=2)
-                st.download_button(
-                    label="Download JSON",
-                    data=json_str,
-                    file_name=f"{result.get('file_name', 'document').split('.')[0]}_ocr.json",
-                    mime="application/json"
-                )
-            except Exception as e:
-                st.error(f"Error creating JSON download: {str(e)}")
-            # Text download
-            try:
-                if 'ocr_contents' in result:
-                    if 'raw_text' in result['ocr_contents']:
-                        text_content = result['ocr_contents']['raw_text']
-                    elif 'content' in result['ocr_contents']:
-                        text_content = result['ocr_contents']['content']
-                    else:
-                        text_content = str(result['ocr_contents'])
-                else:
-                    text_content = "No text content available."
-                st.download_button(
-                    label="Download Text",
-                    data=text_content,
-                    file_name=f"{result.get('file_name', 'document').split('.')[0]}_ocr.txt",
-                    mime="text/plain"
-                )
-            except Exception as e:
-                st.error(f"Error creating text download: {str(e)}")
 def display_document_with_images(result):
     """Display document with images"""
     # Check for pages_data first
@@ -759,7 +297,7 @@ def display_document_with_images(result):
                     if isinstance(raw_page, dict) and 'images' in raw_page:
                         for img in raw_page['images']:
                             if isinstance(img, dict) and 'base64' in img:
-                                st.image(img['base64'])
                                 st.caption("Image from OCR response")
                                 image_displayed = True
                                 break
@@ -797,7 +335,7 @@ def display_previous_results():
         st.markdown("""
         <div style="text-align: center; padding: 30px 20px; background-color: #f8f9fa; border-radius: 6px; margin-top: 10px;">
             <div style="font-size: 36px; margin-bottom: 15px;">📄</div>
-            <h4 style="margin-bottom: 8px; font-weight: 500;">No Previous Results</h4>
             <p style="font-size: 14px; color: #666;">Process a document to see your results history.</p>
         </div>
         """, unsafe_allow_html=True)
@@ -806,7 +344,7 @@ def display_previous_results():
         with col2:
             try:
                 # Create download button for all results
-                from ocr_utils import create_results_zip_in_memory
                 zip_data = create_results_zip_in_memory(st.session_state.previous_results)
                 timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
@@ -908,37 +446,22 @@ def display_previous_results():
             meta_html += '</div>'
             st.markdown(meta_html, unsafe_allow_html=True)
-            # Simplified tabs - fewer options for cleaner interface
             has_images = selected_result.get('has_images', False)
             if has_images:
-                view_tabs = st.tabs(["Document Content", "Raw Text", "Images"])
                 view_tab1, view_tab2, view_tab3 = view_tabs
             else:
-                view_tabs = st.tabs(["Document Content", "Raw Text"])
                 view_tab1, view_tab2 = view_tabs
-            # Define helper function for formatting text
-            def format_text_display(text):
-                if not isinstance(text, str):
-                    return text
-                lines = text.split('\n')
-                processed_lines = []
-                for line in lines:
-                    line_stripped = line.strip()
-                    if line_stripped and line_stripped.isupper() and len(line_stripped) > 3:
-                        processed_lines.append(f"**{line_stripped}**")
-                    else:
-                        processed_lines.append(line)
-                return '\n'.join(processed_lines)
             # First tab - Document Content (simplified structured view)
             with view_tab1:
                 # Display content in a cleaner, more streamlined format
                 if 'ocr_contents' in selected_result and isinstance(selected_result['ocr_contents'], dict):
                     # Create a more focused list of important sections
-                    priority_sections = ["title", "content", "transcript", "summary", "raw_text"]
                     displayed_sections = set()
                     # First display priority sections
@@ -951,7 +474,7 @@ def display_previous_results():
                                     st.markdown(f"##### {section.replace('_', ' ').title()}")
                                 # Format and display content
-                                formatted_content = format_text_display(content)
                                 st.markdown(formatted_content)
                                 displayed_sections.add(section)
@@ -963,7 +486,7 @@ def display_previous_results():
                             st.markdown(f"##### {section.replace('_', ' ').title()}")
                             if isinstance(content, str):
-                                st.markdown(format_text_display(content))
                             elif isinstance(content, list):
                                 for item in content:
                                     st.markdown(f"- {item}")
@@ -971,34 +494,42 @@ def display_previous_results():
                                 for k, v in content.items():
                                     st.markdown(f"**{k}:** {v}")
-            # Second tab - Raw Text (simplified)
             with view_tab2:
-                # Extract raw text or content
-                raw_text = ""
                 if 'ocr_contents' in selected_result:
-                    if 'raw_text' in selected_result['ocr_contents']:
-                        raw_text = selected_result['ocr_contents']['raw_text']
-                    elif 'content' in selected_result['ocr_contents']:
-                        raw_text = selected_result['ocr_contents']['content']
-                # Display the text area with raw text
-                edited_text = st.text_area("", raw_text, height=300, key="selected_raw_text")
-                # Add buttons in a row
-                col1, col2 = st.columns(2)
-                with col1:
-                    st.button("Copy Text", key="selected_copy_btn")
-                with col2:
-                    st.download_button(
-                        label="Download Text",
-                        data=edited_text,
-                        file_name=f"{file_name.split('.')[0]}_text.txt",
-                        mime="text/plain",
-                        key="selected_download_btn"
-                    )
-            # Third tab - With Images (simplified)
-            if has_images and 'pages_data' in selected_result:
                 with view_tab3:
                     # Simplified image display
                     if 'pages_data' in selected_result:
@@ -1007,7 +538,7 @@ def display_previous_results():
                             if 'images' in page_data and len(page_data['images']) > 0:
                                 for img in page_data['images']:
                                     if 'image_base64' in img:
-                                        st.image(img['image_base64'], use_column_width=True)
                                         # Get page text if available
                                         page_text = ""
@@ -1018,21 +549,22 @@ def display_previous_results():
                                         if page_text:
                                             with st.expander(f"Page {i+1} Text", expanded=False):
                                                 st.text(page_text)
 def display_about_tab():
-    """Display about tab content"""
-    st.header("About")
     # Add app description
     st.markdown("""
-    **Historical OCR** is a specialized tool for extracting text from historical documents, manuscripts, and printed materials.
     """)
     # Purpose section with consistent formatting
     st.markdown("### Purpose")
     st.markdown("""
     This tool is designed to assist scholars in historical research by extracting text from challenging documents.
-    While it may not achieve 100% accuracy for all materials, it serves as a valuable research aid for navigating
     historical documents, particularly:
     """)

 import io
 import base64
 import logging
+import re
 from datetime import datetime
 from pathlib import Path
 import json
+# Define exports
+__all__ = [
+    'ProgressReporter',
+    'create_sidebar_options',
+    'create_file_uploader',
+    'display_document_with_images',
+    'display_previous_results',
+    'display_about_tab',
+    'display_results'  # Re-export from utils.ui_utils
+]
 from constants import (
     DOCUMENT_TYPES,
     DOCUMENT_LAYOUTS,
     PREPROCESSING_DOC_TYPES,
     ROTATION_OPTIONS
 )
+from utils.image_utils import format_ocr_text
+from utils.content_utils import (
+    classify_document_content,
+    extract_document_text,
+    extract_image_description,
+    clean_raw_text,
+    format_markdown_text
+)
+from utils.ui_utils import display_results
+from preprocessing import preprocess_image
 class ProgressReporter:
     """Class to handle progress reporting in the UI"""
         # Create a container for the sidebar options
         with st.container():
+            # Default to using vision model (removed selection from UI)
+            use_vision = True
             # Document type selection
             doc_type = st.selectbox("Document Type", DOCUMENT_TYPES,
                                    help="Select the type of document you're processing for better results")
             # Custom prompt
             custom_prompt = ""
+            # Get the template for the selected document type if not auto-detect
+            if doc_type != DOCUMENT_TYPES[0]:
                 prompt_template = CUSTOM_PROMPT_TEMPLATES.get(doc_type, "")
                 # Add layout information if not standard
                 # Set the custom prompt
                 custom_prompt = prompt_template
+            # Allow user to edit the prompt (always visible)
+            custom_prompt = st.text_area("Custom Processing Instructions", value=custom_prompt,
+                                       help="Customize the instructions for processing this document",
+                                       height=80)
+            # Image preprocessing options (always visible)
+            st.markdown("### Image Preprocessing")
+            # Grayscale conversion
+            grayscale = st.checkbox("Convert to Grayscale",
+                                  value=False,
+                                  help="Convert color images to grayscale for better text recognition")
+            # Light denoising option
+            denoise = st.checkbox("Light Denoising",
+                                value=False,
+                                help="Apply gentle denoising to improve text clarity")
+            # Contrast adjustment
+            contrast = st.slider("Contrast Adjustment",
+                               min_value=-20,
+                               max_value=20,
+                               value=0,
+                               step=5,
+                               help="Adjust image contrast (limited range)")
+            # Initialize rotation (keeping it set to 0)
+            rotation = 0
+            use_segmentation = False
             # Create preprocessing options dictionary
             # Set document_type based on selection in UI
                 "rotation": rotation
             }
+            # PDF-specific options
+            st.markdown("### PDF Options")
+            max_pages = st.number_input("Maximum Pages to Process",
+                                      min_value=1,
+                                      max_value=20,
+                                      value=DEFAULT_MAX_PAGES,
+                                      help="Limit the number of pages to process (for multi-page PDFs)")
+            # Set default values for removed options
+            pdf_dpi = DEFAULT_PDF_DPI
+            pdf_rotation = 0
             # Create options dictionary
             options = {
     )
     return uploaded_file
 def display_document_with_images(result):
     """Display document with images"""
     # Check for pages_data first
                     if isinstance(raw_page, dict) and 'images' in raw_page:
                         for img in raw_page['images']:
                             if isinstance(img, dict) and 'base64' in img:
+                                st.image(img['base64'], use_container_width=True)
                                 st.caption("Image from OCR response")
                                 image_displayed = True
                                 break
         st.markdown("""
         <div style="text-align: center; padding: 30px 20px; background-color: #f8f9fa; border-radius: 6px; margin-top: 10px;">
             <div style="font-size: 36px; margin-bottom: 15px;">📄</div>
+            <h3="margin-bottom: 16px; font-weight: 500;">No Previous Results</h3>
             <p style="font-size: 14px; color: #666;">Process a document to see your results history.</p>
         </div>
         """, unsafe_allow_html=True)
         with col2:
             try:
                 # Create download button for all results
+                from utils.image_utils import create_results_zip_in_memory
                 zip_data = create_results_zip_in_memory(st.session_state.previous_results)
                 timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
             meta_html += '</div>'
             st.markdown(meta_html, unsafe_allow_html=True)
+            # Simplified tabs - using the same format as main view
             has_images = selected_result.get('has_images', False)
             if has_images:
+                view_tabs = st.tabs(["Document Content", "Raw JSON", "Images"])
                 view_tab1, view_tab2, view_tab3 = view_tabs
             else:
+                view_tabs = st.tabs(["Document Content", "Raw JSON"])
                 view_tab1, view_tab2 = view_tabs
+                view_tab3 = None
             # First tab - Document Content (simplified structured view)
             with view_tab1:
                 # Display content in a cleaner, more streamlined format
                 if 'ocr_contents' in selected_result and isinstance(selected_result['ocr_contents'], dict):
                     # Create a more focused list of important sections
+                    priority_sections = ["title", "content", "transcript", "summary"]
                     displayed_sections = set()
                     # First display priority sections
                                     st.markdown(f"##### {section.replace('_', ' ').title()}")
                                 # Format and display content
+                                formatted_content = format_ocr_text(content)
                                 st.markdown(formatted_content)
                                 displayed_sections.add(section)
                             st.markdown(f"##### {section.replace('_', ' ').title()}")
                             if isinstance(content, str):
+                                st.markdown(format_ocr_text(content))
                             elif isinstance(content, list):
                                 for item in content:
                                     st.markdown(f"- {item}")
                                 for k, v in content.items():
                                     st.markdown(f"**{k}:** {v}")
+            # Second tab - Raw JSON (simplified)
             with view_tab2:
+                # Extract the relevant JSON data
+                json_data = {}
+                # Include important metadata
+                for field in ['file_name', 'timestamp', 'processing_time', 'languages', 'topics', 'subjects', 'detected_document_type', 'text']:
+                    if field in selected_result:
+                        json_data[field] = selected_result[field]
+                # Include OCR contents
                 if 'ocr_contents' in selected_result:
+                    json_data['ocr_contents'] = selected_result['ocr_contents']
+                # Exclude large binary data like base64 images to keep JSON clean
+                if 'pages_data' in selected_result:
+                    # Create simplified pages_data without large binary content
+                    simplified_pages = []
+                    for page in selected_result['pages_data']:
+                        simplified_page = {
+                            'page_number': page.get('page_number', 0),
+                            'has_text': bool(page.get('markdown', '')),
+                            'has_images': bool(page.get('images', [])),
+                            'image_count': len(page.get('images', []))
+                        }
+                        simplified_pages.append(simplified_page)
+                    json_data['pages_summary'] = simplified_pages
+                # Format the JSON prettily
+                json_str = json.dumps(json_data, indent=2)
+                # Display in a monospace font with syntax highlighting
+                st.code(json_str, language="json")
+            # Third tab - Images (simplified)
+            if has_images and view_tab3 is not None:
                 with view_tab3:
                     # Simplified image display
                     if 'pages_data' in selected_result:
                             if 'images' in page_data and len(page_data['images']) > 0:
                                 for img in page_data['images']:
                                     if 'image_base64' in img:
+                                        st.image(img['image_base64'], use_container_width=True)
                                         # Get page text if available
                                         page_text = ""
                                         if page_text:
                                             with st.expander(f"Page {i+1} Text", expanded=False):
                                                 st.text(page_text)
 def display_about_tab():
+    """Display learn more tab content"""
+    st.header("Learn More")
     # Add app description
     st.markdown("""
+    **Historical OCR** is a tailored academic tool for extracting text from historical documents, manuscripts, and printed materials.
     """)
     # Purpose section with consistent formatting
     st.markdown("### Purpose")
     st.markdown("""
     This tool is designed to assist scholars in historical research by extracting text from challenging documents.
+    While it may not achieve full accuracy for all materials, it serves as a tailored research aid for navigating
     historical documents, particularly:
     """)

utils/content_utils.py ADDED Viewed

	@@ -0,0 +1,189 @@

+import re
+import ast
+from .text_utils import clean_raw_text, format_markdown_text
+def classify_document_content(result):
+    """Classify document content based on structure and content"""
+    classification = {
+        'has_title': False,
+        'has_content': False,
+        'has_sections': False,
+        'is_structured': False
+    }
+    if 'ocr_contents' not in result or not isinstance(result['ocr_contents'], dict):
+        return classification
+    # Check for title
+    if 'title' in result['ocr_contents'] and result['ocr_contents']['title']:
+        classification['has_title'] = True
+    # Check for content
+    content_fields = ['content', 'transcript', 'text']
+    for field in content_fields:
+        if field in result['ocr_contents'] and result['ocr_contents'][field]:
+            classification['has_content'] = True
+            break
+    # Check for sections
+    section_count = 0
+    for key in result['ocr_contents'].keys():
+        if key not in ['raw_text', 'error'] and result['ocr_contents'][key]:
+            section_count += 1
+    classification['has_sections'] = section_count > 2
+    # Check if structured
+    classification['is_structured'] = (
+        classification['has_title'] and
+        classification['has_content'] and
+        classification['has_sections']
+    )
+    return classification
+def extract_document_text(result):
+    """Extract main document text content"""
+    if 'ocr_contents' not in result or not isinstance(result['ocr_contents'], dict):
+        return ""
+    # Try to get the text from content fields in preferred order - prioritize main_text
+    for field in ['main_text', 'content', 'transcript', 'text', 'raw_text']:
+        if field in result['ocr_contents'] and result['ocr_contents'][field]:
+            content = result['ocr_contents'][field]
+            if isinstance(content, str):
+                return content
+    return ""
+def extract_image_description(image_data):
+    """Extract image description from data"""
+    if not image_data or not isinstance(image_data, dict):
+        return ""
+    # Try different fields that might contain descriptions
+    for field in ['alt_text', 'caption', 'description']:
+        if field in image_data and image_data[field]:
+            return image_data[field]
+    return ""
+def format_structured_data(content):
+    """Format structured data like lists and dictionaries into readable markdown
+    Args:
+        content: The content to format (str, list, dict)
+    Returns:
+        Formatted markdown text
+    """
+    if not content:
+        return ""
+    # If it's already a string, look for patterns that appear to be Python/JSON representations
+    if isinstance(content, str):
+        # Look for lists like ['item1', 'item2', 'item3']
+        list_pattern = r"(\[([^\[\]]*)\])"
+        dict_pattern = r"(\{([^\{\}]*)\})"
+        # First handle lists - ['item1', 'item2']
+        def replace_list(match):
+            try:
+                # Try to parse the match as a Python list
+                list_str = match.group(1)
+                # Quick check for empty list
+                if list_str == "[]":
+                    return ""
+                # Safe evaluation of list-like string
+                try:
+                    items = ast.literal_eval(list_str)
+                    if isinstance(items, list):
+                        # Convert to markdown bullet points
+                        return "\n" + "\n".join([f"- {item}" for item in items])
+                    else:
+                        return list_str  # Not a list, return unchanged
+                except (SyntaxError, ValueError):
+                    # Try a simpler regex-based approach for common formats
+                    # Handle simple comma-separated lists
+                    items = re.findall(r"'([^']*)'|\"([^\"]*)\"", list_str)
+                    if items:
+                        # Extract the matched groups and handle both single and double quotes
+                        clean_items = [item[0] if item[0] else item[1] for item in items]
+                        return "\n" + "\n".join([f"- {item}" for item in clean_items])
+                    return list_str  # Couldn't parse, return unchanged
+            except Exception:
+                return match.group(0)  # Return the original text if any error
+        # Handle dictionaries or structured fields like {key: value, key2: value2}
+        def replace_dict(match):
+            try:
+                dict_str = match.group(1)
+                # Quick check for empty dict
+                if dict_str == "{}":
+                    return ""
+                # First try to parse as a Python dict
+                try:
+                    data_dict = ast.literal_eval(dict_str)
+                    if isinstance(data_dict, dict):
+                        return "\n" + "\n".join([f"**{k}**: {v}" for k, v in data_dict.items()])
+                except (SyntaxError, ValueError):
+                    # If that fails, use regex to extract key-value pairs
+                    pairs = re.findall(r"'([^']*)':\s*'([^']*)'|\"([^\"]*)\":\s*\"([^\"]*)\"", dict_str)
+                    if pairs:
+                        formatted_pairs = []
+                        for pair in pairs:
+                            if pair[0] and pair[1]:  # Single quotes
+                                formatted_pairs.append(f"**{pair[0]}**: {pair[1]}")
+                            elif pair[2] and pair[3]:  # Double quotes
+                                formatted_pairs.append(f"**{pair[2]}**: {pair[3]}")
+                        return "\n" + "\n".join(formatted_pairs)
+                return dict_str  # Return original if couldn't parse
+            except Exception:
+                return match.group(0)  # Return original text if any error
+        # Check for keys with array values (common in OCR output)
+        key_array_pattern = r"([a-zA-Z_]+):\s*(\[.*?\])"
+        def replace_key_array(match):
+            try:
+                key = match.group(1)
+                array_str = match.group(2)
+                # Process the array part with our list replacer
+                formatted_array = replace_list(re.match(list_pattern, array_str))
+                # If we successfully formatted it, return with the key as a header
+                if formatted_array != array_str:
+                    return f"**{key}**:{formatted_array}"
+                else:
+                    return match.group(0)  # Return original if no change
+            except Exception:
+                return match.group(0)  # Return the original on error
+        # Apply all replacements
+        content = re.sub(key_array_pattern, replace_key_array, content)
+        content = re.sub(list_pattern, replace_list, content)
+        content = re.sub(dict_pattern, replace_dict, content)
+        return content
+    # Handle native Python lists
+    elif isinstance(content, list):
+        if not content:
+            return ""
+        # Convert to markdown bullet points
+        return "\n".join([f"- {item}" for item in content])
+    # Handle native Python dictionaries
+    elif isinstance(content, dict):
+        if not content:
+            return ""
+        # Convert to markdown key-value pairs
+        return "\n".join([f"**{k}**: {v}" for k, v in content.items()])
+    # Return as string for other types
+    return str(content)

utils/file_utils.py ADDED Viewed

	@@ -0,0 +1,100 @@

+"""
+File utility functions for historical OCR processing.
+"""
+import base64
+import logging
+from pathlib import Path
+# Configure logging
+logger = logging.getLogger("utils")
+logger.setLevel(logging.INFO)
+def get_base64_from_image(image_path):
+    """
+    Get base64 data URL from image file with proper MIME type.
+    Args:
+        image_path: Path to the image file
+    Returns:
+        Base64 data URL with appropriate MIME type prefix
+    """
+    try:
+        # Convert to Path object for better handling
+        path_obj = Path(image_path)
+        # Determine mime type based on file extension
+        mime_type = 'image/jpeg'  # Default mime type
+        suffix = path_obj.suffix.lower()
+        if suffix == '.png':
+            mime_type = 'image/png'
+        elif suffix == '.gif':
+            mime_type = 'image/gif'
+        elif suffix in ['.jpg', '.jpeg']:
+            mime_type = 'image/jpeg'
+        elif suffix == '.pdf':
+            mime_type = 'application/pdf'
+        # Read and encode file
+        with open(path_obj, "rb") as file:
+            encoded = base64.b64encode(file.read()).decode('utf-8')
+            return f"data:{mime_type};base64,{encoded}"
+    except Exception as e:
+        logger.error(f"Error encoding file to base64: {str(e)}")
+        return ""
+def get_base64_from_bytes(file_bytes, mime_type=None, file_name=None):
+    """
+    Get base64 data URL from file bytes with proper MIME type.
+    Args:
+        file_bytes: Binary file data
+        mime_type: MIME type of the file (optional)
+        file_name: Original file name for MIME type detection (optional)
+    Returns:
+        Base64 data URL with appropriate MIME type prefix
+    """
+    try:
+        # Determine mime type if not provided
+        if mime_type is None and file_name is not None:
+            # Get file extension
+            suffix = Path(file_name).suffix.lower()
+            if suffix == '.png':
+                mime_type = 'image/png'
+            elif suffix == '.gif':
+                mime_type = 'image/gif'
+            elif suffix in ['.jpg', '.jpeg']:
+                mime_type = 'image/jpeg'
+            elif suffix == '.pdf':
+                mime_type = 'application/pdf'
+            else:
+                # Default to image/jpeg for unknown types when processing images
+                mime_type = 'image/jpeg'
+        elif mime_type is None:
+            # Default MIME type if we can't determine it - use image/jpeg instead of application/octet-stream
+            # to ensure compatibility with Mistral AI OCR API
+            mime_type = 'image/jpeg'
+        # Encode and create data URL
+        encoded = base64.b64encode(file_bytes).decode('utf-8')
+        return f"data:{mime_type};base64,{encoded}"
+    except Exception as e:
+        logger.error(f"Error encoding bytes to base64: {str(e)}")
+        return ""
+def handle_temp_files(temp_file_paths):
+    """
+    Clean up temporary files
+    Args:
+        temp_file_paths: List of temporary file paths to clean up
+    """
+    import os
+    for temp_path in temp_file_paths:
+        try:
+            if os.path.exists(temp_path):
+                os.unlink(temp_path)
+                logger.info(f"Removed temporary file: {temp_path}")
+        except Exception as e:
+            logger.warning(f"Failed to remove temporary file {temp_path}: {str(e)}")

utils/general_utils.py ADDED Viewed

	@@ -0,0 +1,163 @@

+"""
+General utility functions for historical OCR processing.
+"""
+import os
+import base64
+import hashlib
+import time
+import logging
+from datetime import datetime
+from pathlib import Path
+from functools import wraps
+# Configure logging
+logger = logging.getLogger("utils")
+logger.setLevel(logging.INFO)
+def generate_cache_key(file_bytes, file_type, use_vision, preprocessing_options=None, pdf_rotation=0, custom_prompt=None):
+    """
+    Generate a cache key for OCR processing
+    Args:
+        file_bytes: File content as bytes
+        file_type: Type of file (pdf or image)
+        use_vision: Whether to use vision model
+        preprocessing_options: Dictionary of preprocessing options
+        pdf_rotation: PDF rotation value
+        custom_prompt: Custom prompt for OCR
+    Returns:
+        str: Cache key
+    """
+    # Generate file hash
+    file_hash = hashlib.md5(file_bytes).hexdigest()
+    # Include preprocessing options in cache key
+    preprocessing_options_hash = ""
+    if preprocessing_options:
+        # Add pdf_rotation to preprocessing options to ensure it's part of the cache key
+        if pdf_rotation != 0:
+            preprocessing_options_with_rotation = preprocessing_options.copy()
+            preprocessing_options_with_rotation['pdf_rotation'] = pdf_rotation
+            preprocessing_str = str(sorted(preprocessing_options_with_rotation.items()))
+        else:
+            preprocessing_str = str(sorted(preprocessing_options.items()))
+        preprocessing_options_hash = hashlib.md5(preprocessing_str.encode()).hexdigest()
+    elif pdf_rotation != 0:
+        # If no preprocessing options but we have rotation, include that in the hash
+        preprocessing_options_hash = hashlib.md5(f"pdf_rotation_{pdf_rotation}".encode()).hexdigest()
+    # Create base cache key
+    cache_key = f"{file_hash}_{file_type}_{use_vision}_{preprocessing_options_hash}"
+    # Include custom prompt in cache key if provided
+    if custom_prompt:
+        custom_prompt_hash = hashlib.md5(str(custom_prompt).encode()).hexdigest()
+        cache_key = f"{cache_key}_{custom_prompt_hash}"
+    return cache_key
+def timing(description):
+    """Context manager for timing code execution"""
+    class TimingContext:
+        def __init__(self, description):
+            self.description = description
+        def __enter__(self):
+            self.start_time = time.time()
+            return self
+        def __exit__(self, exc_type, exc_val, exc_tb):
+            end_time = time.time()
+            execution_time = end_time - self.start_time
+            logger.info(f"{self.description} took {execution_time:.2f} seconds")
+            return False
+    return TimingContext(description)
+def format_timestamp(timestamp=None):
+    """Format timestamp for display"""
+    if timestamp is None:
+        timestamp = datetime.now()
+    elif isinstance(timestamp, str):
+        try:
+            timestamp = datetime.strptime(timestamp, "%Y-%m-%d %H:%M:%S")
+        except ValueError:
+            timestamp = datetime.now()
+    return timestamp.strftime("%Y-%m-%d %H:%M")
+def create_descriptive_filename(original_filename, result, file_ext, preprocessing_options=None):
+    """
+    Create a descriptive filename for the result
+    Args:
+        original_filename: Original filename
+        result: OCR result dictionary
+        file_ext: File extension
+        preprocessing_options: Dictionary of preprocessing options
+    Returns:
+        str: Descriptive filename
+    """
+    # Get base name without extension
+    original_name = Path(original_filename).stem
+    # Add document type to filename if detected
+    doc_type_tag = ""
+    if 'detected_document_type' in result:
+        doc_type = result['detected_document_type'].lower()
+        doc_type_tag = f"_{doc_type.replace(' ', '_')}"
+    elif 'topics' in result and result['topics']:
+        # Use first tag as document type if not explicitly detected
+        doc_type_tag = f"_{result['topics'][0].lower().replace(' ', '_')}"
+    # Add period tag for historical context if available
+    period_tag = ""
+    if 'topics' in result and result['topics']:
+        for tag in result['topics']:
+            if "century" in tag.lower() or "pre-" in tag.lower() or "era" in tag.lower():
+                period_tag = f"_{tag.lower().replace(' ', '_')}"
+                break
+    # Generate final descriptive filename
+    descriptive_name = f"{original_name}{doc_type_tag}{period_tag}{file_ext}"
+    return descriptive_name
+def extract_subject_tags(result, raw_text, preprocessing_options=None):
+    """
+    Extract subject tags from OCR result
+    Args:
+        result: OCR result dictionary
+        raw_text: Raw text from OCR
+        preprocessing_options: Dictionary of preprocessing options
+    Returns:
+        list: Subject tags
+    """
+    subject_tags = []
+    # Use existing topics as starting point if available
+    if 'topics' in result and result['topics']:
+        subject_tags = list(result['topics'])
+    # Add document type if detected
+    if 'detected_document_type' in result:
+        doc_type = result['detected_document_type'].capitalize()
+        if doc_type not in subject_tags:
+            subject_tags.append(doc_type)
+    # If no tags were found, add some defaults
+    if not subject_tags:
+        subject_tags = ["Document", "Historical Document"]
+        # Try to infer content type
+        if "letter" in raw_text.lower()[:1000] or "dear" in raw_text.lower()[:200]:
+            subject_tags.append("Letter")
+        # Check if it might be a newspaper
+        if "newspaper" in raw_text.lower()[:1000] or "editor" in raw_text.lower()[:500]:
+            subject_tags.append("Newspaper")
+    return subject_tags

utils/image_utils.py ADDED Viewed

	@@ -0,0 +1,886 @@

+"""
+Utility functions for OCR image processing with Mistral AI.
+Contains helper functions for working with OCR responses and image handling.
+"""
+# Standard library imports
+import json
+import base64
+import io
+import zipfile
+import logging
+import re
+import time
+import math
+from datetime import datetime
+from pathlib import Path
+from typing import Dict, List, Optional, Union, Any, Tuple
+from functools import lru_cache
+# Configure logging
+logging.basicConfig(level=logging.INFO,
+                   format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
+logger = logging.getLogger(__name__)
+# Third-party imports
+import numpy as np
+# Mistral AI imports
+from mistralai import DocumentURLChunk, ImageURLChunk, TextChunk
+from mistralai.models import OCRImageObject
+# Check for image processing libraries
+try:
+    from PIL import Image, ImageEnhance, ImageFilter, ImageOps
+    PILLOW_AVAILABLE = True
+except ImportError:
+    logger.warning("PIL not available - image preprocessing will be limited")
+    PILLOW_AVAILABLE = False
+try:
+    import cv2
+    CV2_AVAILABLE = True
+except ImportError:
+    logger.warning("OpenCV (cv2) not available - advanced image processing will be limited")
+    CV2_AVAILABLE = False
+# Import configuration
+try:
+    from config import IMAGE_PREPROCESSING
+except ImportError:
+    # Fallback defaults if config not available
+    IMAGE_PREPROCESSING = {
+        "enhance_contrast": 1.5,
+        "sharpen": True,
+        "denoise": True,
+        "max_size_mb": 8.0,
+        "target_dpi": 300,
+        "compression_quality": 92
+    }
+def detect_skew(image: Union[Image.Image, np.ndarray]) -> float:
+    """
+    Quick skew detection that returns angle in degrees.
+    Uses a computationally efficient approach by analyzing at 1% resolution.
+    Args:
+        image: PIL Image or numpy array
+    Returns:
+        Estimated skew angle in degrees (positive or negative)
+    """
+    # Convert PIL Image to numpy array if needed
+    if isinstance(image, Image.Image):
+        # Convert to grayscale for processing
+        if image.mode != 'L':
+            img_np = np.array(image.convert('L'))
+        else:
+            img_np = np.array(image)
+    else:
+        # If already numpy array, ensure it's grayscale
+        if len(image.shape) == 3:
+            if CV2_AVAILABLE:
+                img_np = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
+            else:
+                # Fallback grayscale conversion
+                img_np = np.mean(image, axis=2).astype(np.uint8)
+        else:
+            img_np = image
+    # Downsample to 1% resolution for faster processing
+    height, width = img_np.shape
+    target_size = int(min(width, height) * 0.01)
+    # Use a sane minimum size and ensure we have enough pixels to detect lines
+    target_size = max(target_size, 100)
+    if CV2_AVAILABLE:
+        # OpenCV-based implementation (faster)
+        # Resize the image to the target size
+        scale_factor = target_size / max(width, height)
+        small_img = cv2.resize(img_np, None, fx=scale_factor, fy=scale_factor, interpolation=cv2.INTER_AREA)
+        # Apply binary thresholding to get cleaner edges
+        _, binary = cv2.threshold(small_img, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)
+        # Use Hough Line Transform to detect lines
+        lines = cv2.HoughLinesP(binary, 1, np.pi/180, threshold=target_size//10,
+                             minLineLength=target_size//5, maxLineGap=target_size//10)
+        if lines is None or len(lines) < 3:
+            # Not enough lines detected, assume no significant skew
+            return 0.0
+        # Calculate angles of lines
+        angles = []
+        for line in lines:
+            x1, y1, x2, y2 = line[0]
+            if x2 - x1 == 0:  # Avoid division by zero
+                continue
+            angle = math.atan2(y2 - y1, x2 - x1) * 180.0 / np.pi
+            # Normalize angle to -45 to 45 range
+            angle = angle % 180
+            if angle > 90:
+                angle -= 180
+            if angle > 45:
+                angle -= 90
+            if angle < -45:
+                angle += 90
+            angles.append(angle)
+        if not angles:
+            return 0.0
+        # Use median to reduce impact of outliers
+        angles.sort()
+        median_angle = angles[len(angles) // 2]
+        return median_angle
+    else:
+        # PIL-only fallback implementation
+        # Resize using PIL
+        small_img = Image.fromarray(img_np).resize(
+            (int(width * target_size / max(width, height)),
+             int(height * target_size / max(width, height))),
+            Image.NEAREST
+        )
+        # Find edges
+        edges = small_img.filter(ImageFilter.FIND_EDGES)
+        edges_data = np.array(edges)
+        # Simple edge orientation analysis (less precise than OpenCV)
+        # Count horizontal vs vertical edges
+        h_edges = np.sum(np.abs(np.diff(edges_data, axis=1)))
+        v_edges = np.sum(np.abs(np.diff(edges_data, axis=0)))
+        # If horizontal edges dominate, no significant skew
+        if h_edges > v_edges * 1.2:
+            return 0.0
+        # Simple angle estimation based on edge distribution
+        # This is a simplified approach that works for slight skews
+        rows, cols = edges_data.shape
+        xs, ys = [], []
+        # Sample strong edge points
+        for r in range(0, rows, 2):
+            for c in range(0, cols, 2):
+                if edges_data[r, c] > 128:
+                    xs.append(c)
+                    ys.append(r)
+        if len(xs) < 10:  # Not enough edge points
+            return 0.0
+        # Use simple linear regression to estimate the slope
+        n = len(xs)
+        mean_x = sum(xs) / n
+        mean_y = sum(ys) / n
+        # Calculate slope
+        numerator = sum((xs[i] - mean_x) * (ys[i] - mean_y) for i in range(n))
+        denominator = sum((xs[i] - mean_x) ** 2 for i in range(n))
+        if abs(denominator) < 1e-6:  # Avoid division by zero
+            return 0.0
+        slope = numerator / denominator
+        angle = math.atan(slope) * 180.0 / math.pi
+        # Normalize to -45 to 45 degrees
+        if angle > 45:
+            angle -= 90
+        elif angle < -45:
+            angle += 90
+        return angle
+def replace_images_in_markdown(md: str, images: dict[str, str]) -> str:
+    """
+    Replace image placeholders in markdown with base64-encoded images.
+    Uses regex-based matching to handle variations in image IDs and formats.
+    Args:
+        md: Markdown text containing image placeholders
+        images: Dictionary mapping image IDs to base64 strings
+    Returns:
+        Markdown text with images replaced by base64 data
+    """
+    # Process each image ID in the dictionary
+    for img_id, base64_str in images.items():
+        # Extract the base ID without extension for more flexible matching
+        base_id = img_id.split('.')[0]
+        # Match markdown image pattern where URL contains the base ID
+        # Using a single regex with groups to capture the full pattern
+        pattern = re.compile(rf'!\[([^\]]*)\]\(([^\)]*{base_id}[^\)]*)\)')
+        # Process all matches
+        matches = list(pattern.finditer(md))
+        for match in reversed(matches):  # Process in reverse to avoid offset issues
+            # Replace the entire match with a properly formatted base64 image
+            md = md[:match.start()] + f"![{img_id}](data:image/jpeg;base64,{base64_str})" + md[match.end():]
+    return md
+def get_combined_markdown(ocr_response) -> str:
+    """
+    Combine OCR text and images into a single markdown document.
+    Args:
+        ocr_response: OCR response object from Mistral AI
+    Returns:
+        Combined markdown string with embedded images
+    """
+    markdowns = []
+    # Process each page of the OCR response
+    for page in ocr_response.pages:
+        # Extract image data if available
+        image_data = {}
+        if hasattr(page, "images"):
+            for img in page.images:
+                if hasattr(img, "id") and hasattr(img, "image_base64"):
+                    image_data[img.id] = img.image_base64
+        # Replace image placeholders with base64 data
+        page_markdown = page.markdown if hasattr(page, "markdown") else ""
+        processed_markdown = replace_images_in_markdown(page_markdown, image_data)
+        markdowns.append(processed_markdown)
+    # Join all pages' markdown with double newlines
+    return "\n\n".join(markdowns)
+def encode_image_for_api(image_path: Union[str, Path]) -> str:
+    """
+    Encode an image as base64 data URL for API submission.
+    Args:
+        image_path: Path to the image file
+    Returns:
+        Base64 data URL for the image
+    """
+    # Convert to Path object if string
+    image_file = Path(image_path) if isinstance(image_path, str) else image_path
+    # Verify image exists
+    if not image_file.is_file():
+        raise FileNotFoundError(f"Image file not found: {image_file}")
+    # Determine mime type based on file extension
+    mime_type = 'image/jpeg'  # Default mime type
+    suffix = image_file.suffix.lower()
+    if suffix == '.png':
+        mime_type = 'image/png'
+    elif suffix == '.gif':
+        mime_type = 'image/gif'
+    elif suffix in ['.jpg', '.jpeg']:
+        mime_type = 'image/jpeg'
+    elif suffix == '.pdf':
+        mime_type = 'application/pdf'
+    # Encode image as base64
+    encoded = base64.b64encode(image_file.read_bytes()).decode()
+    return f"data:{mime_type};base64,{encoded}"
+def encode_bytes_for_api(file_bytes: bytes, mime_type: str) -> str:
+    """
+    Encode binary data as base64 data URL for API submission.
+    Args:
+        file_bytes: Binary file data
+        mime_type: MIME type of the file (e.g., 'image/jpeg', 'application/pdf')
+    Returns:
+        Base64 data URL for the data
+    """
+    # Encode data as base64
+    encoded = base64.b64encode(file_bytes).decode()
+    return f"data:{mime_type};base64,{encoded}"
+def calculate_image_entropy(pil_img: Image.Image) -> float:
+    """
+    Calculate the entropy of a PIL image.
+    Entropy is a measure of randomness; low entropy indicates a blank or simple image,
+    high entropy indicates more complex content (e.g., text or detailed images).
+    Args:
+        pil_img: PIL Image object
+    Returns:
+        float: Entropy value
+    """
+    # Convert to grayscale for entropy calculation
+    gray_img = pil_img.convert("L")
+    arr = np.array(gray_img)
+    # Compute histogram
+    hist, _ = np.histogram(arr, bins=256, range=(0, 255), density=True)
+    # Remove zero entries to avoid log(0)
+    hist = hist[hist > 0]
+    # Calculate entropy
+    entropy = -np.sum(hist * np.log2(hist))
+    return float(entropy)
+def serialize_ocr_object(obj):
+    """
+    Serialize OCR response objects to JSON serializable format.
+    Handles OCRImageObject specifically to prevent serialization errors.
+    Args:
+        obj: The object to serialize
+    Returns:
+        JSON serializable representation of the object
+    """
+    # Fast path: Handle primitive types directly
+    if obj is None or isinstance(obj, (str, int, float, bool)):
+        return obj
+    # Handle collections
+    if isinstance(obj, list):
+        return [serialize_ocr_object(item) for item in obj]
+    elif isinstance(obj, dict):
+        return {k: serialize_ocr_object(v) for k, v in obj.items()}
+    elif isinstance(obj, OCRImageObject):
+        # Special handling for OCRImageObject
+        return {
+            'id': obj.id if hasattr(obj, 'id') else None,
+            'image_base64': obj.image_base64 if hasattr(obj, 'image_base64') else None
+        }
+    elif hasattr(obj, '__dict__'):
+        # For objects with __dict__ attribute
+        return {k: serialize_ocr_object(v) for k, v in obj.__dict__.items()
+                if not k.startswith('_')}  # Skip private attributes
+    else:
+        # Try to convert to string as last resort
+        try:
+            return str(obj)
+        except:
+            return None
+def format_ocr_text(text):
+    """
+    Format OCR text with simple, predictable rules that ensure consistency.
+    This formats ALL CAPS lines as bold markdown and preserves the rest.
+    Args:
+        text: Text content to format
+    Returns:
+        Formatted text with consistent styling
+    """
+    if not isinstance(text, str):
+        return text
+    lines = text.split('\n')
+    processed_lines = []
+    for line in lines:
+        line_stripped = line.strip()
+        if line_stripped and line_stripped.isupper() and len(line_stripped) > 3:
+            processed_lines.append(f"**{line_stripped}**")
+        else:
+            processed_lines.append(line)
+    return '\n'.join(processed_lines)
+def create_results_zip(results, output_dir=None, zip_name=None):
+    """
+    Create a zip file containing OCR results.
+    Args:
+        results: Dictionary or list of OCR results
+        output_dir: Optional output directory
+        zip_name: Optional zip file name
+    Returns:
+        Path to the created zip file
+    """
+    # Create temporary output directory if not provided
+    if output_dir is None:
+        output_dir = Path.cwd() / "output"
+        output_dir.mkdir(exist_ok=True)
+    else:
+        output_dir = Path(output_dir)
+        output_dir.mkdir(exist_ok=True)
+    # Generate zip name if not provided
+    if zip_name is None:
+        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+        if isinstance(results, list):
+            # For a list of results, create a descriptive name
+            file_count = len(results)
+            zip_name = f"ocr_results_{file_count}_{timestamp}.zip"
+        else:
+            # For single result, create descriptive filename
+            base_name = results.get('file_name', 'document').split('.')[0]
+            zip_name = f"{base_name}_{timestamp}.zip"
+    try:
+        # Get zip data in memory first
+        zip_data = create_results_zip_in_memory(results)
+        # Save to file
+        zip_path = output_dir / zip_name
+        with open(zip_path, 'wb') as f:
+            f.write(zip_data)
+        return zip_path
+    except Exception as e:
+        # Create an empty zip file as fallback
+        logger.error(f"Error creating zip file: {str(e)}")
+        zip_path = output_dir / zip_name
+        with zipfile.ZipFile(zip_path, 'w') as zipf:
+            zipf.writestr("info.txt", "Could not create complete archive")
+        return zip_path
+def create_results_zip_in_memory(results):
+    """
+    Create a zip file containing OCR results in memory.
+    Args:
+        results: Dictionary or list of OCR results
+    Returns:
+        Binary zip file data
+    """
+    # Create a BytesIO object
+    zip_buffer = io.BytesIO()
+    # Check if results is a list or a dictionary
+    is_list = isinstance(results, list)
+    # Create zip file in memory
+    with zipfile.ZipFile(zip_buffer, 'w', zipfile.ZIP_DEFLATED) as zipf:
+        if is_list:
+            # Handle list of results
+            for i, result in enumerate(results):
+                try:
+                    # Create a descriptive base filename for this result
+                    base_filename = result.get('file_name', f'document_{i+1}').split('.')[0]
+                    # Add document type if available
+                    if 'topics' in result and result['topics']:
+                        topic = result['topics'][0].lower().replace(' ', '_')
+                        base_filename = f"{base_filename}_{topic}"
+                    # Add language if available
+                    if 'languages' in result and result['languages']:
+                        lang = result['languages'][0].lower()
+                        # Only add if it's not already in the filename
+                        if lang not in base_filename.lower():
+                            base_filename = f"{base_filename}_{lang}"
+                    # For PDFs, add page information
+                    if 'limited_pages' in result:
+                        base_filename = f"{base_filename}_p{result['limited_pages']['processed']}of{result['limited_pages']['total']}"
+                    # Add timestamp if available
+                    if 'timestamp' in result:
+                        try:
+                            # Try to parse the timestamp and reformat it
+                            dt = datetime.strptime(result['timestamp'], "%Y-%m-%d %H:%M")
+                            timestamp = dt.strftime("%Y%m%d_%H%M%S")
+                            base_filename = f"{base_filename}_{timestamp}"
+                        except Exception:
+                            pass
+                    # Add JSON results for each file with descriptive name
+                    result_json = json.dumps(result, indent=2)
+                    zipf.writestr(f"{base_filename}.json", result_json)
+                    # Add HTML content (generated from the result)
+                    html_content = create_html_with_images(result)
+                    zipf.writestr(f"{base_filename}.html", html_content)
+                    # Add raw OCR text if available
+                    if "ocr_contents" in result and "raw_text" in result["ocr_contents"]:
+                        zipf.writestr(f"{base_filename}.txt", result["ocr_contents"]["raw_text"])
+                except Exception as e:
+                    # If any result fails, skip it and continue
+                    logger.warning(f"Failed to process result for zip: {str(e)}")
+                    continue
+        else:
+            # Handle single result
+            try:
+                # Create a descriptive base filename for this result
+                base_filename = results.get('file_name', 'document').split('.')[0]
+                # Add document type if available
+                if 'topics' in results and results['topics']:
+                    topic = results['topics'][0].lower().replace(' ', '_')
+                    base_filename = f"{base_filename}_{topic}"
+                # Add language if available
+                if 'languages' in results and results['languages']:
+                    lang = results['languages'][0].lower()
+                    # Only add if it's not already in the filename
+                    if lang not in base_filename.lower():
+                        base_filename = f"{base_filename}_{lang}"
+                # For PDFs, add page information
+                if 'limited_pages' in results:
+                    base_filename = f"{base_filename}_p{results['limited_pages']['processed']}of{results['limited_pages']['total']}"
+                # Add timestamp if available
+                if 'timestamp' in results:
+                    try:
+                        # Try to parse the timestamp and reformat it
+                        dt = datetime.strptime(results['timestamp'], "%Y-%m-%d %H:%M")
+                        timestamp = dt.strftime("%Y%m%d_%H%M%S")
+                        base_filename = f"{base_filename}_{timestamp}"
+                    except Exception:
+                        # If parsing fails, create a new timestamp
+                        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+                        base_filename = f"{base_filename}_{timestamp}"
+                else:
+                    # No timestamp in the result, create a new one
+                    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+                    base_filename = f"{base_filename}_{timestamp}"
+                # Add JSON results with descriptive name
+                results_json = json.dumps(results, indent=2)
+                zipf.writestr(f"{base_filename}.json", results_json)
+                # Add HTML content with descriptive name
+                html_content = create_html_with_images(results)
+                zipf.writestr(f"{base_filename}.html", html_content)
+                # Add raw OCR text if available
+                if "ocr_contents" in results and "raw_text" in results["ocr_contents"]:
+                    zipf.writestr(f"{base_filename}.txt", results["ocr_contents"]["raw_text"])
+            except Exception as e:
+                # If processing fails, log the error
+                logger.error(f"Failed to create zip file: {str(e)}")
+                pass
+    # Seek to the beginning of the BytesIO object
+    zip_buffer.seek(0)
+    # Return the zip file bytes
+    return zip_buffer.getvalue()
+def create_html_with_images(result):
+    """
+    Create a clean HTML document from OCR results that properly preserves page references
+    and text structure, without any document-specific special cases.
+    Args:
+        result: OCR result dictionary
+    Returns:
+        HTML content as string
+    """
+    # Import content utils to use classification functions
+    try:
+        from utils.content_utils import classify_document_content, extract_document_text, extract_image_description
+        content_utils_available = True
+    except ImportError:
+        content_utils_available = False
+    # Get content classification
+    has_text = True
+    has_images = False
+    has_page_refs = False
+    if content_utils_available:
+        classification = classify_document_content(result)
+        has_text = classification['has_content']
+        has_images = result.get('has_images', False)
+        has_page_refs = False
+    else:
+        # Minimal fallback detection
+        if 'has_images' in result:
+            has_images = result['has_images']
+        # Check for image data more thoroughly
+        if 'pages_data' in result and isinstance(result['pages_data'], list):
+            for page in result['pages_data']:
+                if isinstance(page, dict) and 'images' in page and page['images']:
+                    has_images = True
+                    break
+    # Start building the HTML document
+    html = [
+        '<!DOCTYPE html>',
+        '<html lang="en">',
+        '<head>',
+        '    <meta charset="UTF-8">',
+        '    <meta name="viewport" content="width=device-width, initial-scale=1.0">',
+        f'    <title>{result.get("file_name", "Document")}</title>',
+        '    <style>',
+        '        body {',
+        '            font-family: Georgia, serif;',
+        '            line-height: 1.6;',
+        '            color: #333;',
+        '            max-width: 800px;',
+        '            margin: 0 auto;',
+        '            padding: 20px;',
+        '        }',
+        '        h1, h2, h3, h4 {',
+        '            color: #222;',
+        '            margin-top: 1.5em;',
+        '            margin-bottom: 0.5em;',
+        '        }',
+        '        h1 { font-size: 24px; }',
+        '        h2 { font-size: 22px; }',
+        '        h3 { font-size: 20px; }',
+        '        h4 { font-size: 18px; }',
+        '        p { margin: 1em 0; }',
+        '        .metadata {',
+        '            background-color: #f8f9fa;',
+        '            border: 1px solid #eaecef;',
+        '            border-radius: 6px;',
+        '            padding: 15px;',
+        '            margin-bottom: 20px;',
+        '        }',
+        '        .metadata p { margin: 5px 0; }',
+        '        img {',
+        '            max-width: 100%;',
+        '            height: auto;',
+        '            display: block;',
+        '            margin: 20px auto;',
+        '            border: 1px solid #ddd;',
+        '            border-radius: 4px;',
+        '        }',
+        '        .image-container {',
+        '            margin: 20px 0;',
+        '            text-align: center;',
+        '        }',
+        '        .image-caption {',
+        '            font-size: 0.9em;',
+        '            text-align: center;',
+        '            color: #666;',
+        '            margin-top: 5px;',
+        '        }',
+        '        .text-block {',
+        '            margin: 10px 0;',
+        '        }',
+        '        .page-ref {',
+        '            font-weight: bold;',
+        '            color: #555;',
+        '        }',
+        '        .separator {',
+        '            border-top: 1px solid #eaecef;',
+        '            margin: 30px 0;',
+        '        }',
+        '    </style>',
+        '</head>',
+        '<body>'
+    ]
+    # Add document metadata
+    html.append('<div class="metadata">')
+    html.append(f'<h1>{result.get("file_name", "Document")}</h1>')
+    # Add timestamp
+    if 'timestamp' in result:
+        html.append(f'<p><strong>Processed:</strong> {result["timestamp"]}</p>')
+    # Add languages if available
+    if 'languages' in result and result['languages']:
+        languages = [lang for lang in result['languages'] if lang]
+        if languages:
+            html.append(f'<p><strong>Languages:</strong> {", ".join(languages)}</p>')
+    # Add document type and topics
+    if 'detected_document_type' in result:
+        html.append(f'<p><strong>Document Type:</strong> {result["detected_document_type"]}</p>')
+    if 'topics' in result and result['topics']:
+        html.append(f'<p><strong>Topics:</strong> {", ".join(result["topics"])}</p>')
+    html.append('</div>')  # Close metadata div
+    # Document title - extract from result if available
+    if 'ocr_contents' in result and 'title' in result['ocr_contents'] and result['ocr_contents']['title']:
+        title_content = result['ocr_contents']['title']
+        # No special handling for any specific document types
+        html.append(f'<h2>{title_content}</h2>')
+    # Add images if present
+    if has_images and 'pages_data' in result:
+        html.append('<h3>Images</h3>')
+        # Extract and display all images
+        for page_idx, page in enumerate(result['pages_data']):
+            if 'images' in page and isinstance(page['images'], list):
+                for img_idx, img in enumerate(page['images']):
+                    if 'image_base64' in img and img['image_base64']:
+                        # Image container
+                        html.append('<div class="image-container">')
+                        html.append(f'<img src="{img["image_base64"]}" alt="Image {page_idx+1}-{img_idx+1}">')
+                        # Generic caption based on index
+                        html.append(f'<div class="image-caption">img-{img_idx}.jpeg</div>')
+                        html.append('</div>')
+                        # Add image description if available through utils
+                        if content_utils_available:
+                            description = extract_image_description(result)
+                            if description:
+                                html.append('<div class="text-block">')
+                                html.append(f'<p>{description}</p>')
+                                html.append('</div>')
+        html.append('<hr class="separator">')
+    # Add document text section
+    html.append('<h3>Text</h3>')
+    # Extract text content systematically
+    text_content = ""
+    if content_utils_available:
+        # Use the systematic utility function
+        text_content = extract_document_text(result)
+    else:
+        # Fallback extraction logic
+        if 'ocr_contents' in result:
+            for field in ["main_text", "content", "text", "transcript", "raw_text"]:
+                if field in result['ocr_contents'] and result['ocr_contents'][field]:
+                    content = result['ocr_contents'][field]
+                    if isinstance(content, str) and content.strip():
+                        text_content = content
+                        break
+                    elif isinstance(content, dict):
+                        # Try to convert complex objects to string
+                        try:
+                            text_content = json.dumps(content, indent=2)
+                            break
+                        except:
+                            pass
+    # Process text content for HTML display
+    if text_content:
+        # Clean the text but preserve page references
+        text_content = text_content.replace('\r\n', '\n')
+        # Preserve page references by wrapping them in HTML tags
+        if has_page_refs:
+            # Highlight common page reference patterns
+            page_patterns = [
+                (r'(page\s+\d+)', r'<span class="page-ref">\1</span>'),
+                (r'(p\.\s*\d+)', r'<span class="page-ref">\1</span>'),
+                (r'(p\s+\d+)', r'<span class="page-ref">\1</span>'),
+                (r'(\[\s*\d+\s*\])', r'<span class="page-ref">\1</span>'),
+                (r'(\(\s*\d+\s*\))', r'<span class="page-ref">\1</span>'),
+                (r'(folio\s+\d+)', r'<span class="page-ref">\1</span>'),
+                (r'(f\.\s*\d+)', r'<span class="page-ref">\1</span>'),
+                (r'(pg\.\s*\d+)', r'<span class="page-ref">\1</span>')
+            ]
+            for pattern, replacement in page_patterns:
+                text_content = re.sub(pattern, replacement, text_content, flags=re.IGNORECASE)
+        # Convert newlines to paragraphs
+        paragraphs = text_content.split('\n\n')
+        paragraphs = [p for p in paragraphs if p.strip()]
+        html.append('<div class="text-block">')
+        for paragraph in paragraphs:
+            # Check if paragraph contains multiple lines
+            if '\n' in paragraph:
+                lines = paragraph.split('\n')
+                lines = [line for line in lines if line.strip()]
+                # Convert each line to a paragraph
+                for line in lines:
+                    html.append(f'<p>{line}</p>')
+            else:
+                html.append(f'<p>{paragraph}</p>')
+        html.append('</div>')
+    else:
+        html.append('<p>No text content available.</p>')
+    # Close the HTML document
+    html.append('</body>')
+    html.append('</html>')
+    return '\n'.join(html)
+def clean_ocr_result(result: dict,
+                     use_segmentation: bool = False,
+                     vision_enabled: bool = True) -> dict:
+    """
+    1. Replace or strip markdown image refs (![id](id))
+    2. Collapse pages that are *only* an illustration into a single
+       `illustrations` bucket when vision is off
+    3. Normalise `ocr_contents` keys to always have at least `raw_text`
+    """
+    if 'pages_data' in result:
+        # Build a dict {id: base64} for quick look-ups
+        image_dict = {
+            img['id']: img['image_base64']
+            for page in result['pages_data']
+            for img in page.get('images', [])
+        }
+        # --- 1 · replace or drop image placeholders ---
+        def _scrub(markdown: str) -> str:
+            if vision_enabled and image_dict:
+                return replace_images_in_markdown(markdown, image_dict)
+            # no vision / no images → drop the line
+            return re.sub(r'!\[[^\]]*\]\(img-\d+\.\w+\)', '', markdown)
+        for page in result['pages_data']:
+            page['markdown'] = _scrub(page.get('markdown', ''))
+    # --- 2 · group illustration-only pages when vision is off ---
+    if not vision_enabled and 'pages_data' in result:
+        text_pages, art_pages = [], []
+        for p in result['pages_data']:
+            has_text = p.get('markdown', '').strip()
+            (text_pages if has_text else art_pages).append(p)
+        result['pages_data'] = text_pages
+        if art_pages:
+            # keep one thumbnail under metadata
+            result.setdefault('illustrations', []).extend(art_pages)
+    # --- 3 · ensure raw_text key ---
+    if 'ocr_contents' in result and 'raw_text' not in result['ocr_contents']:
+        # First, try to extract any embedded text from image references
+        raw_text_parts = []
+        for page in result.get('pages_data', []):
+            markdown = page.get('markdown', '')
+            # Check if the markdown contains image references
+            img_refs = re.findall(r'!\[([^\]]*)\]\(([^\)]*)\)', markdown)
+            # Process each image reference to extract text content
+            if img_refs:
+                for alt_text, img_url in img_refs:
+                    # If alt text contains actual text content (not just image ID), add it
+                    if alt_text and not alt_text.endswith(('.jpeg', '.jpg', '.png')):
+                        # Clean up the alt text and add it as text content
+                        alt_text = alt_text.strip()
+                        if alt_text and len(alt_text) > 3:  # Only add if meaningful
+                            raw_text_parts.append(alt_text)
+            # Remove image references from markdown
+            cleaned_markdown = re.sub(r'!\[[^\]]*\]\([^\)]*\)', '', markdown)
+            # Add any remaining text content
+            if cleaned_markdown.strip():
+                raw_text_parts.append(cleaned_markdown.strip())
+        # Join all extracted text content
+        if raw_text_parts:
+            result['ocr_contents']['raw_text'] = "\n\n".join(raw_text_parts)
+        else:
+            # Fallback: use original method if no text was extracted
+            joined = "\n".join(p.get('markdown', '') for p in result.get('pages_data', []))
+            # Final cleanup of image references
+            joined = re.sub(r'!\[[^\]]*\]\([^\)]*\)', '', joined)
+            result['ocr_contents']['raw_text'] = joined
+    return result

utils/text_utils.py ADDED Viewed

	@@ -0,0 +1,151 @@

+"""Text utility functions for OCR processing"""
+import re
+def clean_raw_text(text):
+    """Clean raw text by removing image references and serialized data.
+    Args:
+        text (str): The text to clean
+    Returns:
+        str: The cleaned text
+    """
+    if not text or not isinstance(text, str):
+        return ""
+    # # Remove image references like ![image](data:image/...)
+    # text = re.sub(r'!\[.*?\]\(data:image/[^)]+\)', '', text)
+    # # Remove basic markdown image references like ![alt](img-1.jpg)
+    # text = re.sub(r'!\[[^\]]*\]\([^)]+\)', '', text)
+    # # Remove base64 encoded image data
+    # text = re.sub(r'data:image/[^;]+;base64,[a-zA-Z0-9+/=]+', '', text)
+    # # Remove image object references like [[OCRImageObject:...]]
+    # text = re.sub(r'\[\[OCRImageObject:[^\]]+\]\]', '', text)
+    # # Clean up any JSON-like image object references
+    # text = re.sub(r'{"image(_data)?":("[^"]*"|null|true|false|\{[^}]*\}|\[[^\]]*\])}', '', text)
+    # # Clean up excessive whitespace and line breaks created by removals
+    # text = re.sub(r'\n{3,}', '\n\n', text)
+    # text = re.sub(r'\s{3,}', ' ', text)
+    return text.strip()
+def format_markdown_text(text):
+    """Format text with markdown and handle special patterns
+    Args:
+        text (str): The text to format
+    Returns:
+        str: The formatted markdown text
+    """
+    if not text:
+        return ""
+    # First, ensure we're working with a string
+    if not isinstance(text, str):
+        text = str(text)
+    # Ensure newlines are preserved for proper spacing
+    # Convert any Windows line endings to Unix
+    text = text.replace('\r\n', '\n')
+    # Format dates (MM/DD/YYYY or similar patterns)
+    date_pattern = r'\b(0?[1-9]|1[0-2])[\/\-\.](0?[1-9]|[12][0-9]|3[01])[\/\-\.](\d{4}|\d{2})\b'
+    text = re.sub(date_pattern, r'**\g<0>**', text)
+    # Detect markdown tables and preserve them
+    table_sections = []
+    non_table_lines = []
+    in_table = False
+    table_buffer = []
+    # Process text line by line, preserving tables
+    lines = text.split('\n')
+    for i, line in enumerate(lines):
+        line_stripped = line.strip()
+        # Detect table rows by pipe character
+        if '|' in line_stripped and (line_stripped.startswith('|') or line_stripped.endswith('|')):
+            if not in_table:
+                in_table = True
+                if table_buffer:
+                    table_buffer = []
+            table_buffer.append(line)
+            # Check if the next line is a table separator
+            if i < len(lines) - 1 and '---' in lines[i+1] and '|' in lines[i+1]:
+                table_buffer.append(lines[i+1])
+        # Detect table separators (---|---|---)
+        elif in_table and '---' in line_stripped and '|' in line_stripped:
+            table_buffer.append(line)
+        # End of table detection
+        elif in_table:
+            # Check if this is still part of the table
+            next_line_is_table = False
+            if i < len(lines) - 1:
+                next_line = lines[i+1].strip()
+                if '|' in next_line and (next_line.startswith('|') or next_line.endswith('|')):
+                    next_line_is_table = True
+            if not next_line_is_table:
+                in_table = False
+                # Save the complete table
+                if table_buffer:
+                    table_sections.append('\n'.join(table_buffer))
+                    table_buffer = []
+                # Add current line to non-table lines
+                non_table_lines.append(line)
+            else:
+                # Still part of the table
+                table_buffer.append(line)
+        else:
+            # Not in a table
+            non_table_lines.append(line)
+    # Handle any remaining table buffer
+    if in_table and table_buffer:
+        table_sections.append('\n'.join(table_buffer))
+    # Process non-table lines
+    processed_lines = []
+    for line in non_table_lines:
+        line_stripped = line.strip()
+        # Check if line is in ALL CAPS (and not just a short acronym)
+        if line_stripped and line_stripped.isupper() and len(line_stripped) > 3:
+            # ALL CAPS line - make bold instead of heading to prevent large display
+            processed_lines.append(f"**{line_stripped}**")
+        # Process potential headers (lines ending with colon)
+        elif line_stripped and line_stripped.endswith(':') and len(line_stripped) < 40:
+            # Likely a header - make it bold
+            processed_lines.append(f"**{line_stripped}**")
+        else:
+            # Keep original line with its spacing
+            processed_lines.append(line)
+    # Join non-table lines
+    processed_text = '\n'.join(processed_lines)
+    # Reinsert tables in the right positions
+    for table in table_sections:
+        # Generate a unique marker for this table
+        marker = f"__TABLE_MARKER_{hash(table) % 10000}__"
+        # Find a good position to insert this table
+        # For now, just append all tables at the end
+        processed_text += f"\n\n{table}\n\n"
+    # Make sure paragraphs have proper spacing but not excessive
+    processed_text = re.sub(r'\n{3,}', '\n\n', processed_text)
+    # Ensure two newlines between paragraphs for proper markdown rendering
+    processed_text = re.sub(r'([^\n])\n([^\n])', r'\1\n\n\2', processed_text)
+    return processed_text

utils/ui_utils.py ADDED Viewed

	@@ -0,0 +1,413 @@

+"""
+UI utilities for OCR results display.
+"""
+import streamlit as st
+import json
+import base64
+import io
+from datetime import datetime
+from utils.image_utils import format_ocr_text, create_html_with_images
+from utils.content_utils import classify_document_content, format_structured_data
+def display_results(result, container, custom_prompt=""):
+    """Display OCR results in the provided container"""
+    with container:
+        # Add heading for document metadata
+        st.markdown("### Document Metadata")
+        # Filter out large data structures from metadata display
+        meta = {k: v for k, v in result.items()
+                if k not in ['pages_data', 'illustrations', 'ocr_contents', 'raw_response_data']}
+        # Create a compact metadata section
+        meta_html = '<div style="display: flex; flex-wrap: wrap; gap: 0.3rem; margin-bottom: 0.3rem;">'
+        # Document type
+        if 'detected_document_type' in meta:
+            meta_html += f'<div><strong>Type:</strong> {meta["detected_document_type"]}</div>'
+        # Processing time
+        if 'processing_time' in meta:
+            meta_html += f'<div><strong>Time:</strong> {meta["processing_time"]:.1f}s</div>'
+        # Page information
+        if 'limited_pages' in meta:
+            meta_html += f'<div><strong>Pages:</strong> {meta["limited_pages"]["processed"]}/{meta["limited_pages"]["total"]}</div>'
+        meta_html += '</div>'
+        st.markdown(meta_html, unsafe_allow_html=True)
+        # Language metadata on a separate line, Subject Tags below
+        # First show languages if available
+        if 'languages' in result and result['languages']:
+            languages = [lang for lang in result['languages'] if lang is not None]
+            if languages:
+                # Create a dedicated line for Languages
+                lang_html = '<div style="display: flex; align-items: center; margin: 0.2rem 0; flex-wrap: wrap;">'
+                lang_html += '<div style="margin-right: 0.3rem; font-weight: bold;">Language:</div>'
+                # Add language tags
+                for lang in languages:
+                    # Clean language name if needed
+                    clean_lang = str(lang).strip()
+                    if clean_lang:  # Only add if not empty
+                        lang_html += f'<span class="subject-tag tag-language">{clean_lang}</span>'
+                lang_html += '</div>'
+                st.markdown(lang_html, unsafe_allow_html=True)
+                # Create a separate line for Time if we have time-related tags
+                if 'topics' in result and result['topics']:
+                    time_tags = [topic for topic in result['topics']
+                               if any(term in topic.lower() for term in ["century", "pre-", "era", "historical"])]
+                    if time_tags:
+                        time_html = '<div style="display: flex; align-items: center; margin: 0.2rem 0; flex-wrap: wrap;">'
+                        time_html += '<div style="margin-right: 0.3rem; font-weight: bold;">Time:</div>'
+                        for tag in time_tags:
+                            time_html += f'<span class="subject-tag tag-time-period">{tag}</span>'
+                        time_html += '</div>'
+                        st.markdown(time_html, unsafe_allow_html=True)
+        # Then display remaining subject tags if available
+        if 'topics' in result and result['topics']:
+            # Filter out time-related tags which are already displayed
+            subject_tags = [topic for topic in result['topics']
+                         if not any(term in topic.lower() for term in ["century", "pre-", "era", "historical"])]
+            if subject_tags:
+                # Create a separate line for Subject Tags
+                tags_html = '<div style="display: flex; align-items: center; margin: 0.2rem 0; flex-wrap: wrap;">'
+                tags_html += '<div style="margin-right: 0.3rem; font-weight: bold;">Subject Tags:</div>'
+                tags_html += '<div style="display: flex; flex-wrap: wrap; gap: 2px; align-items: center;">'
+                # Generate a badge for each remaining tag
+                for topic in subject_tags:
+                    # Determine tag category class
+                    tag_class = "subject-tag"  # Default class
+                    # Add specialized class based on category
+                    if any(term in topic.lower() for term in ["language", "english", "french", "german", "latin"]):
+                        tag_class += " tag-language"  # Languages
+                    elif any(term in topic.lower() for term in ["letter", "newspaper", "book", "form", "document", "recipe"]):
+                        tag_class += " tag-document-type"  # Document types
+                    elif any(term in topic.lower() for term in ["travel", "military", "science", "medicine", "education", "art", "literature"]):
+                        tag_class += " tag-subject"  # Subject domains
+                    # Add each tag as an inline span
+                    tags_html += f'<span class="{tag_class}">{topic}</span>'
+                # Close the containers
+                tags_html += '</div></div>'
+                # Render the subject tags section
+                st.markdown(tags_html, unsafe_allow_html=True)
+            # Check if we have OCR content
+            if 'ocr_contents' in result:
+                # Create a single view instead of tabs
+                content_tab1 = st.container()
+                # Check for images in the result to use later
+                has_images = result.get('has_images', False)
+                has_image_data = ('pages_data' in result and any(page.get('images', []) for page in result.get('pages_data', [])))
+                has_raw_images = ('raw_response_data' in result and 'pages' in result['raw_response_data'] and
+                              any('images' in page for page in result['raw_response_data']['pages']
+                                  if isinstance(page, dict)))
+            # Display structured content
+            with content_tab1:
+                # Display structured content with markdown formatting
+                if isinstance(result['ocr_contents'], dict):
+                    # CSS is now handled in the main layout.py file
+                    # Collect all available images from the result
+                    available_images = []
+                    if has_images and 'pages_data' in result:
+                        for page_idx, page in enumerate(result['pages_data']):
+                            if 'images' in page and len(page['images']) > 0:
+                                for img_idx, img in enumerate(page['images']):
+                                    if 'image_base64' in img:
+                                        available_images.append({
+                                            'source': 'pages_data',
+                                            'page': page_idx,
+                                            'index': img_idx,
+                                            'data': img['image_base64']
+                                        })
+                    # Get images from raw response as well
+                    if 'raw_response_data' in result:
+                        raw_data = result['raw_response_data']
+                        if isinstance(raw_data, dict) and 'pages' in raw_data:
+                            for page_idx, page in enumerate(raw_data['pages']):
+                                if isinstance(page, dict) and 'images' in page:
+                                    for img_idx, img in enumerate(page['images']):
+                                        if isinstance(img, dict) and 'base64' in img:
+                                            available_images.append({
+                                                'source': 'raw_response',
+                                                'page': page_idx,
+                                                'index': img_idx,
+                                                'data': img['base64']
+                                            })
+                    # Extract images for display at the top
+                    images_to_display = []
+                    # First, collect all available images
+                    for img_idx, img in enumerate(available_images):
+                        if 'data' in img:
+                            images_to_display.append({
+                                'data': img['data'],
+                                'id': img.get('id', f"img_{img_idx}"),
+                                'index': img_idx
+                            })
+                    # Image display now only happens in the Images tab
+                    # Organize sections in a logical order - prioritize main_text
+                    section_order = ["title", "author", "date", "summary", "main_text", "content", "transcript", "metadata"]
+                    ordered_sections = []
+                    # Add known sections first in preferred order
+                    for section_name in section_order:
+                        if section_name in result['ocr_contents'] and result['ocr_contents'][section_name]:
+                            ordered_sections.append(section_name)
+                    # Add any remaining sections
+                    for section in result['ocr_contents'].keys():
+                        if (section not in ordered_sections and
+                            section not in ['error', 'partial_text'] and
+                            result['ocr_contents'][section]):
+                            ordered_sections.append(section)
+                    # If only raw_text is available and no other content, add it last
+                    if ('raw_text' in result['ocr_contents'] and
+                        result['ocr_contents']['raw_text'] and
+                        len(ordered_sections) == 0):
+                        ordered_sections.append('raw_text')
+                    # Add minimal spacing before OCR results
+                    st.markdown("<div style='margin: 8px 0 4px 0;'></div>", unsafe_allow_html=True)
+                    # Create tabs for different views
+                    if has_images:
+                        tabs = st.tabs(["Document Content", "Raw JSON", "Images"])
+                        doc_tab, json_tab, img_tab = tabs
+                    else:
+                        tabs = st.tabs(["Document Content", "Raw JSON"])
+                        doc_tab, json_tab = tabs
+                        img_tab = None
+                    # Document Content tab with simplified and systematic content handling
+                    with doc_tab:
+                        # Classify document content using our utility function
+                        content_classification = classify_document_content(result)
+                        # Track what content has been displayed to avoid redundancy
+                        displayed_content = set()
+                        # Create a single unified content section
+                        st.markdown("#### Document Content")
+                        st.markdown("##### Title")
+                        # Extract main structured content fields without redundancy
+                        text_fields = {}
+                        # Use the exact same approach as in Previous Results tab for consistency
+                        # Create a more focused list of important sections - prioritize main_text
+                        priority_sections = ["title", "main_text", "content", "transcript", "summary"]
+                        displayed_sections = set()
+                        # First display priority sections
+                        for section in priority_sections:
+                            if section in result['ocr_contents'] and result['ocr_contents'][section]:
+                                content = result['ocr_contents'][section]
+                                if isinstance(content, str) and content.strip():
+                                    # Only add a subheader for meaningful section names, not raw_text
+                                    if section != "raw_text" and section != "title":
+                                        st.markdown(f"##### {section.replace('_', ' ').title()}")
+                                    # Format and display content
+                                    # First format any structured data (lists, dicts)
+                                    structured_content = format_structured_data(content)
+                                    # Then apply regular OCR text formatting
+                                    formatted_content = format_ocr_text(structured_content)
+                                    st.markdown(formatted_content)
+                                    displayed_sections.add(section)
+                                    break
+                                elif isinstance(content, dict):
+                                    # Display dictionary content as key-value pairs
+                                    for k, v in content.items():
+                                        if k not in ['error', 'partial_text'] and v:
+                                            st.markdown(f"**{k.replace('_', ' ').title()}**")
+                                            if isinstance(v, str):
+                                                # Format any structured data in the string
+                                                formatted_v = format_structured_data(v)
+                                                st.markdown(format_ocr_text(formatted_v))
+                                            else:
+                                                # Format non-string values (lists, dicts)
+                                                formatted_v = format_structured_data(v)
+                                                st.markdown(formatted_v)
+                                    displayed_sections.add(section)
+                                    break
+                                elif isinstance(content, list):
+                                    # Format and display list items using our structured formatter
+                                    formatted_list = format_structured_data(content)
+                                    st.markdown(formatted_list)
+                                    displayed_sections.add(section)
+                                    break
+                        # Then display any remaining sections not already shown
+                        for section, content in result['ocr_contents'].items():
+                            if (section not in displayed_sections and
+                                section not in ['error', 'partial_text'] and
+                                content):
+                                st.markdown(f"##### {section.replace('_', ' ').title()}")
+                                if isinstance(content, str):
+                                    # Format any structured data in the string before display
+                                    structured_content = format_structured_data(content)
+                                    st.markdown(format_ocr_text(structured_content))
+                                elif isinstance(content, list):
+                                    # Format list using our structured formatter
+                                    formatted_list = format_structured_data(content)
+                                    st.markdown(formatted_list)
+                                elif isinstance(content, dict):
+                                    # Format dictionary using our structured formatter
+                                    formatted_dict = format_structured_data(content)
+                                    st.markdown(formatted_dict)
+                    # Raw JSON tab - for viewing the raw OCR response data
+                    with json_tab:
+                        # Extract the relevant JSON data
+                        json_data = {}
+                        # Include important metadata
+                        for field in ['file_name', 'timestamp', 'processing_time', 'detected_document_type', 'languages', 'topics']:
+                            if field in result:
+                                json_data[field] = result[field]
+                        # Include OCR contents
+                        if 'ocr_contents' in result:
+                            json_data['ocr_contents'] = result['ocr_contents']
+                        # Exclude large binary data like base64 images to keep JSON clean
+                        if 'pages_data' in result:
+                            # Create simplified pages_data without large binary content
+                            simplified_pages = []
+                            for page in result['pages_data']:
+                                simplified_page = {
+                                    'page_number': page.get('page_number', 0),
+                                    'has_text': bool(page.get('markdown', '')),
+                                    'has_images': bool(page.get('images', [])),
+                                    'image_count': len(page.get('images', []))
+                                }
+                                simplified_pages.append(simplified_page)
+                            json_data['pages_summary'] = simplified_pages
+                        # Format the JSON prettily
+                        json_str = json.dumps(json_data, indent=2)
+                        # Display in a monospace font with syntax highlighting
+                        st.code(json_str, language="json")
+                    # Images tab - for viewing document images
+                    if has_images and img_tab:
+                        with img_tab:
+                            # Display each available image
+                            for i, img in enumerate(images_to_display):
+                                st.image(img['data'], caption=f"Image {i+1}", use_container_width=True)
+            # Display custom prompt if provided
+            if custom_prompt:
+                with st.expander("Custom Processing Instructions"):
+                    st.write(custom_prompt)
+            # No download heading - start directly with buttons
+            # Create export section with a simple download menu
+            st.markdown("<div style='margin-top: 15px;'></div>", unsafe_allow_html=True)
+            # Prepare all download files at once to avoid rerun resets
+            try:
+                # 1. JSON download
+                json_str = json.dumps(result, indent=2)
+                json_filename = f"{result.get('file_name', 'document').split('.')[0]}_ocr.json"
+                # 2. Text download with improved structure
+                text_parts = []
+                filename = result.get('file_name', 'document')
+                text_parts.append(f"DOCUMENT: {filename}\n")
+                if 'timestamp' in result:
+                    text_parts.append(f"Processed: {result['timestamp']}\n")
+                if 'languages' in result and result['languages']:
+                    languages = [lang for lang in result['languages'] if lang is not None]
+                    if languages:
+                        text_parts.append(f"Languages: {', '.join(languages)}\n")
+                if 'topics' in result and result['topics']:
+                    text_parts.append(f"Topics: {', '.join(result['topics'])}\n")
+                text_parts.append("\n" + "="*50 + "\n\n")
+                if 'ocr_contents' in result and 'title' in result['ocr_contents'] and result['ocr_contents']['title']:
+                    text_parts.append(f"TITLE: {result['ocr_contents']['title']}\n\n")
+                content_added = False
+                if 'ocr_contents' in result:
+                    for field in ["main_text", "content", "text", "transcript", "raw_text"]:
+                        if field in result['ocr_contents'] and result['ocr_contents'][field]:
+                            text_parts.append(f"CONTENT:\n\n{result['ocr_contents'][field]}\n")
+                            content_added = True
+                            break
+                text_content = "\n".join(text_parts)
+                text_filename = f"{result.get('file_name', 'document').split('.')[0]}_ocr.txt"
+                # 3. HTML download
+                from utils.image_utils import create_html_with_images
+                html_content = create_html_with_images(result)
+                html_filename = f"{result.get('file_name', 'document').split('.')[0]}_ocr.html"
+                # Hide download options in an expander
+                with st.expander("Download Options"):
+                    # Remove columns and use vertical layout instead
+                    # Add spacing between buttons for better readability
+                    st.download_button(
+                        label="JSON",
+                        data=json_str,
+                        file_name=json_filename,
+                        mime="application/json",
+                        key="download_json_btn",
+                        use_container_width=True
+                    )
+                    st.markdown("<div style='margin-top: 8px;'></div>", unsafe_allow_html=True)
+                    st.download_button(
+                        label="Text",
+                        data=text_content,
+                        file_name=text_filename,
+                        mime="text/plain",
+                        key="download_text_btn",
+                        use_container_width=True
+                    )
+                    st.markdown("<div style='margin-top: 8px;'></div>", unsafe_allow_html=True)
+                    st.download_button(
+                        label="HTML",
+                        data=html_content,
+                        file_name=html_filename,
+                        mime="text/html",
+                        key="download_html_btn",
+                        use_container_width=True
+                    )
+            except Exception as e:
+                st.error(f"Error preparing download files: {str(e)}")