Spaces:

milwright
/

historical-ocr

Running

App Files Files Community

milwright commited on May 4

Commit

94e74f0

1 Parent(s): 3030658

modularize + nest scripts; reduce technical debt

Browse files

Files changed (13) hide show

.clinerules/hocr-basics-api.md +106 -0
.clinerules/project-brief.md +21 -0
.gitignore +1 -0
app.py +1 -1
config.py +9 -14
ocr_processing.py +90 -86
requirements.txt +1 -1
structured_ocr.py +0 -0
ui/ui_components.py +590 -0
utils/helpers/language_detection.py +373 -0
utils/helpers/letterhead_handler.py +82 -0
utils/helpers/ocr_text_repair.py +270 -0
utils/pdf_ocr.py +457 -0

.clinerules/hocr-basics-api.md ADDED Viewed

	@@ -0,0 +1,106 @@

+# HOCR Basics: API Integrations (Streamlit and Mistral OCR)
+This rule defines the essential development standards for integrating the Mistral OCR API and using Streamlit components in the `milwright/historical-ocr` application.
+## 📌 Rule 1: Mistral OCR API Usage
+* **Endpoint:**
+  `POST https://api.mistral.ai/v1/ocr`
+* **Headers:**
+  ```http
+  Authorization: Bearer YOUR_API_KEY
+  Content-Type: application/json
+  ```
+* **Required JSON Body Fields:**
+  ```json
+  {
+    "file_url": "https://example.com/your.pdf"
+  }
+  ```
+* **Expected Response Fields:**
+  * `text`: Raw OCR output
+  * `metadata`: Document structure, language, layout information
+> **Note:** Always validate presence of required fields and handle error codes gracefully.
+---
+### 🖼️ Rule 2: Streamlit Usage Standards
+* Use these core components:
+  * `st.file_uploader()`
+  * `st.selectbox()`
+  * `st.image()`
+  * `st.markdown()`
+  * `st.download_button()`
+* Always set:
+  `use_container_width=True` for responsive display where supported
+* Avoid global state; prefer `st.session_state` for interactivity and stateful inputs
+## Mistral OCR Examples
+``` json
+{
+  "id": "string",
+  "object": "model",
+  "created": 0,{
+  "model": "string",
+  "id": "string",
+  "document": {
+    "document_url": "string",
+    "document_name": "string",
+    "type": "document_url"
+  },
+  "pages": [
+    0
+  ],
+  "include_image_base64": true,
+  "image_limit": 0,
+  "image_min_size": 0
+}
+```
+``` json
+{
+  "pages": [
+    {
+      "index": 0,
+      "markdown": "string",
+      "images": [
+        {
+          "id": "string",
+          "top_left_x": 0,
+          "top_left_y": 0,
+          "bottom_right_x": 0,
+          "bottom_right_y": 0,
+          "image_base64": "string"
+        }
+      ],
+      "dimensions": {
+        "dpi": 0,
+        "height": 0,
+        "width": 0
+      }
+    }
+  ],
+  "model": "string",
+  "usage_info": {
+    "pages_processed": 0,
+    "doc_size_bytes": 0
+  }
+}
+```
+### Links and Resources to Understand
+* [URL to Mistral OCR APi doc](https://docs.mistral.ai/api/#tag/batch/operation/jobs_api_routes_batch_cancel_batch_job)
+* [URL to Streamlit API documentation](https://docs.streamlit.io/develop/api-reference)

.clinerules/project-brief.md ADDED Viewed

	@@ -0,0 +1,21 @@

+# Project Brief
+Historical OCR is an advanced optical character recognition (OCR) application designed to support historical research. It leverages Mistral AI's OCR models alongside image preprocessing pipelines optimized for archival material.
+High-Level Overview
+Building a Streamlit-based web application to process historical documents (images or PDFs), optimize them for OCR using advanced preprocessing techniques, and extract structured text and metadata through Mistral's large language models.
+Core Requirements and Goals
+Upload and preprocess historical documents
+Automatically detect document types (e.g., handwritten letters, scientific papers)
+Apply tailored OCR prompting and structured output based on document type
+Support user-defined contextual instructions to refine output
+Provide downloadable structured transcripts and analysis
+Example: "Building a Streamlit web app for OCR transcription and structured extraction from historical documents using Mistral AI."

.gitignore CHANGED Viewed

@@ -32,3 +32,4 @@ input/*.pdf
 # Temporary documents
 Tmplf6xnkgr*

 # Temporary documents
 Tmplf6xnkgr*
+.env

app.py CHANGED Viewed

@@ -20,7 +20,7 @@ import streamlit as st
 # Local application/module imports
 from preprocessing import convert_pdf_to_images, preprocess_image
 from ocr_processing import process_file
-from ui_components import (
     ProgressReporter,
     create_sidebar_options,
     display_results,

 # Local application/module imports
 from preprocessing import convert_pdf_to_images, preprocess_image
 from ocr_processing import process_file
+from ui.ui_components import (
     ProgressReporter,
     create_sidebar_options,
     display_results,

config.py CHANGED Viewed

@@ -17,39 +17,34 @@ load_dotenv()
 # Priority order:
 # 1. HF_API_KEY environment variable (Hugging Face standard)
 # 2. HUGGING_FACE_API_KEY environment variable (alternative name)
-# 3. MISTRAL_API_KEY environment variable (fallback)
-# 4. Empty string (will show warning in app)
 MISTRAL_API_KEY = os.environ.get("HF_API_KEY",
                   os.environ.get("HUGGING_FACE_API_KEY",
-                  os.environ.get("MISTRAL_API_KEY", ""))).strip()
 if not MISTRAL_API_KEY:
     logger.warning("No Mistral API key found in environment variables. API functionality will be limited.")
 # Check if we're in test mode (allows operation without valid API key)
-# Set to False to use actual API calls
 TEST_MODE = False
-# Just check if API key exists
-if not MISTRAL_API_KEY and not TEST_MODE:
-    logger.warning("No Mistral API key found. OCR functionality will not work unless TEST_MODE is enabled.")
-if TEST_MODE:
-    logger.info("TEST_MODE is enabled. Using mock responses instead of actual API calls.")
 # Model settings with fallbacks
 OCR_MODEL = os.environ.get("MISTRAL_OCR_MODEL", "mistral-ocr-latest")
 TEXT_MODEL = os.environ.get("MISTRAL_TEXT_MODEL", "mistral-small-latest")  # Updated from ministral-8b-latest
-VISION_MODEL = os.environ.get("MISTRAL_VISION_MODEL", "mistral-small-latest")  # Using faster model that supports vision
 # Image preprocessing settings optimized for historical documents
 # These can be customized from environment variables
 IMAGE_PREPROCESSING = {
-    "enhance_contrast": float(os.environ.get("ENHANCE_CONTRAST", "1.8")),  # Increased contrast for better text recognition
     "sharpen": os.environ.get("SHARPEN", "True").lower() in ("true", "1", "yes"),
     "denoise": os.environ.get("DENOISE", "True").lower() in ("true", "1", "yes"),
-    "max_size_mb": float(os.environ.get("MAX_IMAGE_SIZE_MB", "12.0")),    # Increased size limit for better quality
     "target_dpi": int(os.environ.get("TARGET_DPI", "300")),               # Target DPI for scaling
     "compression_quality": int(os.environ.get("COMPRESSION_QUALITY", "100")),  # Higher quality for better OCR results
     # # Enhanced settings for handwritten documents

 # Priority order:
 # 1. HF_API_KEY environment variable (Hugging Face standard)
 # 2. HUGGING_FACE_API_KEY environment variable (alternative name)
+# 3. HF_MISTRAL_API_KEY environment variable (for Hugging Face deployment)
+# 4. MISTRAL_API_KEY environment variable (fallback)
+# 5. Empty string (will show warning in app)
 MISTRAL_API_KEY = os.environ.get("HF_API_KEY",
                   os.environ.get("HUGGING_FACE_API_KEY",
+                  os.environ.get("HF_MISTRAL_API_KEY",
+                  os.environ.get("MISTRAL_API_KEY", "")))).strip()
 if not MISTRAL_API_KEY:
     logger.warning("No Mistral API key found in environment variables. API functionality will be limited.")
 # Check if we're in test mode (allows operation without valid API key)
+# Set to False to use actual API calls with Mistral API
 TEST_MODE = False
 # Model settings with fallbacks
 OCR_MODEL = os.environ.get("MISTRAL_OCR_MODEL", "mistral-ocr-latest")
 TEXT_MODEL = os.environ.get("MISTRAL_TEXT_MODEL", "mistral-small-latest")  # Updated from ministral-8b-latest
+VISION_MODEL = os.environ.get("MISTRAL_VISION_MODEL", "mistral-small-latest")  # faster model that supports vision
 # Image preprocessing settings optimized for historical documents
 # These can be customized from environment variables
 IMAGE_PREPROCESSING = {
+    "enhance_contrast": float(os.environ.get("ENHANCE_CONTRAST", "3.5")),  # Increased contrast for better text recognition
     "sharpen": os.environ.get("SHARPEN", "True").lower() in ("true", "1", "yes"),
     "denoise": os.environ.get("DENOISE", "True").lower() in ("true", "1", "yes"),
+    "max_size_mb": float(os.environ.get("MAX_IMAGE_SIZE_MB", "200.0")),    # Increased size limit for better quality
     "target_dpi": int(os.environ.get("TARGET_DPI", "300")),               # Target DPI for scaling
     "compression_quality": int(os.environ.get("COMPRESSION_QUALITY", "100")),  # Higher quality for better OCR results
     # # Enhanced settings for handwritten documents

ocr_processing.py CHANGED Viewed

@@ -82,7 +82,7 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None, pro
     # Create a container for progress indicators if not provided
     if progress_reporter is None:
-        from ui_components import ProgressReporter
         progress_reporter = ProgressReporter(st.empty()).setup()
     # Initialize temporary file paths list
@@ -119,10 +119,7 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None, pro
         # For PDFs, we need to handle differently
         if file_type == "pdf":
-            progress_reporter.update(20, "Converting PDF to images...")
-            # Process PDF with direct handling
-            progress_reporter.update(30, "Processing PDF with OCR...")
             # Create a temporary file for processing
             temp_path = tempfile.NamedTemporaryFile(delete=False, suffix=file_ext).name
@@ -145,91 +142,98 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None, pro
                 custom_prompt
             )
             # Process with cached function if possible
             try:
-                # Use the document type information from preprocessing options
-                doc_type = preprocessing_options.get("document_type", "standard")
-                modified_custom_prompt = custom_prompt
-                # Add PDF-specific instructions
-                if not modified_custom_prompt:
-                    modified_custom_prompt = "This is a multi-page PDF document."
-                elif "pdf" not in modified_custom_prompt.lower() and "multi-page" not in modified_custom_prompt.lower():
-                    modified_custom_prompt += " This is a multi-page PDF document."
-                # Update the cache key with the modified prompt
-                if modified_custom_prompt != custom_prompt:
-                    cache_key = generate_cache_key(
-                        open(temp_path, 'rb').read(),
-                        file_type,
-                        use_vision,
-                        preprocessing_options,
-                        pdf_rotation,
-                        modified_custom_prompt
-                    )
-                result = process_file_cached(temp_path, file_type, use_vision, file_size_mb, cache_key, str(preprocessing_options), modified_custom_prompt)
                 progress_reporter.update(90, "Finalizing results...")
             except Exception as e:
-                logger.warning(f"Cached processing failed: {str(e)}. Retrying with direct processing.")
-                progress_reporter.update(60, f"Processing error: {str(e)}. Retrying...")
-                # If caching fails, process directly
-                processor = StructuredOCR()
-                # Use the document type from preprocessing options
-                doc_type = preprocessing_options.get("document_type", "standard")
-                modified_custom_prompt = custom_prompt
-                # Check for letterhead/marginalia document types with specialized handling
                 try:
-                    from letterhead_handler import get_letterhead_prompt, is_likely_letterhead
-                    # Extract text density features if available
-                    features = None
-                    if 'text_density' in preprocessing_options:
-                        features = preprocessing_options['text_density']
-                    # Check if this looks like a letterhead document
-                    if is_likely_letterhead(temp_path, features):
-                        # Get specialized letterhead prompt
-                        letterhead_prompt = get_letterhead_prompt(temp_path, features)
-                        if letterhead_prompt:
-                            logger.info(f"Using specialized letterhead prompt for document")
-                            modified_custom_prompt = letterhead_prompt
-                            # Set document type for tracking
-                            preprocessing_options["document_type"] = "letterhead"
-                            doc_type = "letterhead"
                 except ImportError:
-                    logger.debug("Letterhead handler not available")
-                # Add document-type specific instructions based on preprocessing options
-                if doc_type == "handwritten" and not modified_custom_prompt:
-                    modified_custom_prompt = "This is a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
-                elif doc_type == "handwritten" and "handwritten" not in modified_custom_prompt.lower():
-                    modified_custom_prompt += " This is a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
-                elif doc_type == "newspaper" and not modified_custom_prompt:
-                    modified_custom_prompt = "This is a newspaper or document with columns. Please extract all text content from each column, maintaining proper reading order."
-                elif doc_type == "newspaper" and "column" not in modified_custom_prompt.lower() and "newspaper" not in modified_custom_prompt.lower():
-                    modified_custom_prompt += " This appears to be a newspaper or document with columns. Please extract all text content from each column."
-                elif doc_type == "book" and not modified_custom_prompt:
-                    modified_custom_prompt = "This is a book page. Extract titles, headers, footnotes, and body text, preserving paragraph structure and formatting."
-                # Add PDF-specific instructions if needed
-                if "pdf" not in modified_custom_prompt.lower() and "multi-page" not in modified_custom_prompt.lower():
-                    modified_custom_prompt += " This is a multi-page PDF document."
-                # Process directly with optimized settings
-                result = processor.process_file(
-                    file_path=temp_path,
-                    file_type="pdf",
-                    use_vision=use_vision,
-                    custom_prompt=modified_custom_prompt,
-                    file_size_mb=file_size_mb,
-                    pdf_rotation=pdf_rotation
-                )
-                progress_reporter.update(90, "Finalizing results...")
         else:
             # For image files
             progress_reporter.update(20, "Preparing image for processing...")
@@ -390,7 +394,7 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None, pro
                 # Check for letterhead/marginalia document types with specialized handling
                 try:
-                    from letterhead_handler import get_letterhead_prompt, is_likely_letterhead
                     # Extract text density features if available
                     features = None
                     if 'text_density' in preprocessing_options:
@@ -453,7 +457,7 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None, pro
                 # Check for letterhead/marginalia document types with specialized handling
                 try:
-                    from letterhead_handler import get_letterhead_prompt, is_likely_letterhead
                     # Extract text density features if available
                     features = None
                     if 'text_density' in preprocessing_options:
@@ -503,7 +507,7 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None, pro
         # Check for duplicated text patterns that indicate handwritten text issues
         try:
-            from ocr_text_repair import detect_duplicate_text_issues, get_enhanced_preprocessing_options, get_handwritten_specific_prompt, clean_duplicated_text
             # Check OCR output for duplication issues
             if result and 'ocr_contents' in result and 'raw_text' in result['ocr_contents']:

     # Create a container for progress indicators if not provided
     if progress_reporter is None:
+        from ui.ui_components import ProgressReporter
         progress_reporter = ProgressReporter(st.empty()).setup()
     # Initialize temporary file paths list
         # For PDFs, we need to handle differently
         if file_type == "pdf":
+            progress_reporter.update(20, "Preparing PDF document...")
             # Create a temporary file for processing
             temp_path = tempfile.NamedTemporaryFile(delete=False, suffix=file_ext).name
                 custom_prompt
             )
+            # Use the document type information from preprocessing options
+            doc_type = preprocessing_options.get("document_type", "standard")
+            modified_custom_prompt = custom_prompt
+            # Enhance the prompt with document-type specific instructions
+            # Check for letterhead/marginalia document types with specialized handling
+            try:
+                from utils.helpers.letterhead_handler import get_letterhead_prompt, is_likely_letterhead
+                # Extract text density features if available
+                features = None
+                if 'text_density' in preprocessing_options:
+                    features = preprocessing_options['text_density']
+                # Check if this looks like a letterhead document
+                if is_likely_letterhead(temp_path, features):
+                    # Get specialized letterhead prompt
+                    letterhead_prompt = get_letterhead_prompt(temp_path, features)
+                    if letterhead_prompt:
+                        logger.info(f"Using specialized letterhead prompt for document")
+                        modified_custom_prompt = letterhead_prompt
+                        # Set document type for tracking
+                        preprocessing_options["document_type"] = "letterhead"
+                        doc_type = "letterhead"
+            except ImportError:
+                logger.debug("Letterhead handler not available")
+            # Add document-type specific instructions based on preprocessing options
+            if doc_type == "handwritten" and not modified_custom_prompt:
+                modified_custom_prompt = "This is a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
+            elif doc_type == "handwritten" and "handwritten" not in modified_custom_prompt.lower():
+                modified_custom_prompt += " This is a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
+            elif doc_type == "newspaper" and not modified_custom_prompt:
+                modified_custom_prompt = "This is a newspaper or document with columns. Please extract all text content from each column, maintaining proper reading order."
+            elif doc_type == "newspaper" and "column" not in modified_custom_prompt.lower() and "newspaper" not in modified_custom_prompt.lower():
+                modified_custom_prompt += " This appears to be a newspaper or document with columns. Please extract all text content from each column."
+            elif doc_type == "book" and not modified_custom_prompt:
+                modified_custom_prompt = "This is a book page. Extract titles, headers, footnotes, and body text, preserving paragraph structure and formatting."
+            # Update the cache key with the modified prompt
+            if modified_custom_prompt != custom_prompt:
+                cache_key = generate_cache_key(
+                    open(temp_path, 'rb').read(),
+                    file_type,
+                    use_vision,
+                    preprocessing_options,
+                    pdf_rotation,
+                    modified_custom_prompt
+                )
+            progress_reporter.update(30, "Processing PDF with enhanced OCR...")
             # Process with cached function if possible
             try:
+                result = process_file_cached(temp_path, file_type, use_vision, file_size_mb, cache_key,
+                                           str(preprocessing_options), modified_custom_prompt)
                 progress_reporter.update(90, "Finalizing results...")
             except Exception as e:
+                logger.warning(f"Cached processing failed: {str(e)}. Using direct processing.")
+                progress_reporter.update(60, f"Processing error: {str(e)}. Using enhanced PDF processor...")
+                # Import the enhanced PDF processor
                 try:
+                    from utils.pdf_ocr import PDFOCR
+                    # Use our specialized PDF processor
+                    pdf_processor = PDFOCR()
+                    # Process with the enhanced PDF processor
+                    result = pdf_processor.process_pdf(
+                        pdf_path=temp_path,
+                        use_vision=use_vision,
+                        max_pages=max_pages,
+                        custom_prompt=modified_custom_prompt
+                    )
+                    logger.info("PDF successfully processed with enhanced PDF processor")
+                    progress_reporter.update(90, "Finalizing results...")
                 except ImportError:
+                    logger.warning("Enhanced PDF processor not available. Falling back to standard processing.")
+                    progress_reporter.update(70, "Falling back to standard PDF processing...")
+                    # If enhanced processor is not available, fall back to direct StructuredOCR processing
+                    processor = StructuredOCR()
+                    result = processor.process_file(
+                        file_path=temp_path,
+                        file_type="pdf",
+                        use_vision=use_vision,
+                        custom_prompt=modified_custom_prompt,
+                        file_size_mb=file_size_mb,
+                        max_pages=max_pages
+                    )
+                    progress_reporter.update(90, "Finalizing results...")
         else:
             # For image files
             progress_reporter.update(20, "Preparing image for processing...")
                 # Check for letterhead/marginalia document types with specialized handling
                 try:
+                    from utils.helpers.letterhead_handler import get_letterhead_prompt, is_likely_letterhead
                     # Extract text density features if available
                     features = None
                     if 'text_density' in preprocessing_options:
                 # Check for letterhead/marginalia document types with specialized handling
                 try:
+                    from utils.helpers.letterhead_handler import get_letterhead_prompt, is_likely_letterhead
                     # Extract text density features if available
                     features = None
                     if 'text_density' in preprocessing_options:
         # Check for duplicated text patterns that indicate handwritten text issues
         try:
+            from utils.helpers.ocr_text_repair import detect_duplicate_text_issues, get_enhanced_preprocessing_options, get_handwritten_specific_prompt, clean_duplicated_text
             # Check OCR output for duplication issues
             if result and 'ocr_contents' in result and 'raw_text' in result['ocr_contents']:

requirements.txt CHANGED Viewed

@@ -9,7 +9,7 @@ pydantic>=2.5.0  # Updated for better BaseModel support
 Pillow>=10.0.0
 opencv-python-headless>=4.8.0.74
 pdf2image>=1.16.0
-# pytesseract>=0.3.10  # For local OCR fallback
 matplotlib>=3.7.0    # For visualization in preprocessing tests
 # Data handling and utilities

 Pillow>=10.0.0
 opencv-python-headless>=4.8.0.74
 pdf2image>=1.16.0
+pytesseract>=0.3.10  # For local OCR fallback
 matplotlib>=3.7.0    # For visualization in preprocessing tests
 # Data handling and utilities

structured_ocr.py CHANGED Viewed

The diff for this file is too large to render. See raw diff

ui/ui_components.py ADDED Viewed

	@@ -0,0 +1,590 @@

+import streamlit as st
+import os
+import io
+import base64
+import logging
+import re
+from datetime import datetime
+from pathlib import Path
+import json
+# Define exports
+__all__ = [
+    'ProgressReporter',
+    'create_sidebar_options',
+    'create_file_uploader',
+    'display_document_with_images',
+    'display_previous_results',
+    'display_about_tab',
+    'display_results'  # Re-export from utils.ui_utils
+]
+from constants import (
+    DOCUMENT_TYPES,
+    DOCUMENT_LAYOUTS,
+    CUSTOM_PROMPT_TEMPLATES,
+    LAYOUT_PROMPT_ADDITIONS,
+    DEFAULT_PDF_DPI,
+    MIN_PDF_DPI,
+    MAX_PDF_DPI,
+    DEFAULT_MAX_PAGES,
+    PERFORMANCE_MODES,
+    PREPROCESSING_DOC_TYPES,
+    ROTATION_OPTIONS
+)
+from utils.text_utils import format_ocr_text, clean_raw_text, format_markdown_text  # Import from text_utils
+from utils.content_utils import (
+    classify_document_content,
+    extract_document_text,
+    extract_image_description
+)
+from utils.ui_utils import display_results
+from preprocessing import preprocess_image
+class ProgressReporter:
+    """Class to handle progress reporting in the UI"""
+    def __init__(self, placeholder):
+        self.placeholder = placeholder
+        self.progress_bar = None
+        self.status_text = None
+    def setup(self):
+        """Setup the progress components"""
+        with self.placeholder.container():
+            self.progress_bar = st.progress(0)
+            self.status_text = st.empty()
+        return self
+    def update(self, percent, status_text):
+        """Update the progress bar and status text"""
+        if self.progress_bar is not None:
+            self.progress_bar.progress(percent / 100)
+        if self.status_text is not None:
+            self.status_text.text(status_text)
+    def complete(self, success=True):
+        """Complete the progress reporting"""
+        if success:
+            if self.progress_bar is not None:
+                self.progress_bar.progress(100)
+            if self.status_text is not None:
+                self.status_text.text("Processing complete!")
+        else:
+            if self.status_text is not None:
+                self.status_text.text("Processing failed.")
+        # Clear the progress components after a delay
+        import time
+        time.sleep(0.8)  # Short delay to show completion
+        if self.progress_bar is not None:
+            self.progress_bar.empty()
+        if self.status_text is not None:
+            self.status_text.empty()
+def create_sidebar_options():
+    """Create and return sidebar options"""
+    with st.sidebar:
+        st.markdown("## OCR Settings")
+        # Create a container for the sidebar options
+        with st.container():
+            # Default to using vision model (removed selection from UI)
+            use_vision = True
+            # Document type selection
+            doc_type = st.selectbox("Document Type", DOCUMENT_TYPES,
+                                   help="Select the type of document you're processing for better results")
+            # Document layout
+            doc_layout = st.selectbox("Document Layout", DOCUMENT_LAYOUTS,
+                                     help="Select the layout of your document")
+            # Initialize preprocessing variables with default values
+            grayscale = False
+            denoise = False
+            contrast = 0
+            rotation = 0
+            use_segmentation = False
+            # Custom prompt
+            custom_prompt = ""
+            # Get the template for the selected document type if not auto-detect
+            if doc_type != DOCUMENT_TYPES[0]:
+                prompt_template = CUSTOM_PROMPT_TEMPLATES.get(doc_type, "")
+                # Add layout information if not standard
+                if doc_layout != DOCUMENT_LAYOUTS[0]:  # Not standard layout
+                    layout_addition = LAYOUT_PROMPT_ADDITIONS.get(doc_layout, "")
+                    if layout_addition:
+                        prompt_template += " " + layout_addition
+                # Set the custom prompt
+                custom_prompt = prompt_template
+            # Allow user to edit the prompt (always visible)
+            custom_prompt = st.text_area("Custom Processing Instructions", value=custom_prompt,
+                                       help="Customize the instructions for processing this document",
+                                       height=80)
+            # Image preprocessing options (always visible)
+            st.markdown("### Image Preprocessing")
+            # Grayscale conversion
+            grayscale = st.checkbox("Convert to Grayscale",
+                                  value=True,
+                                  help="Convert color images to grayscale for better text recognition")
+            # Light denoising option
+            denoise = st.checkbox("Light Denoising",
+                                value=True,
+                                help="Apply gentle denoising to improve text clarity")
+            # Contrast adjustment
+            contrast = st.slider("Contrast Adjustment",
+                               min_value=-20,
+                               max_value=20,
+                               value=5,
+                               step=5,
+                               help="Adjust image contrast (limited range)")
+            # Initialize rotation (keeping it set to 0)
+            rotation = 0
+            use_segmentation = False
+            # Create preprocessing options dictionary
+            # Map UI document types to preprocessing document types
+            doc_type_for_preprocessing = "standard"
+            if "Handwritten" in doc_type:
+                doc_type_for_preprocessing = "handwritten"
+            elif "Newspaper" in doc_type or "Magazine" in doc_type:
+                doc_type_for_preprocessing = "newspaper"
+            elif "Book" in doc_type or "Publication" in doc_type:
+                doc_type_for_preprocessing = "book"  # Match the actual preprocessing type
+            preprocessing_options = {
+                "document_type": doc_type_for_preprocessing,
+                "grayscale": grayscale,
+                "denoise": denoise,
+                "contrast": contrast,
+                "rotation": rotation
+            }
+            # PDF-specific options
+            st.markdown("### PDF Options")
+            max_pages = st.number_input("Maximum Pages to Process",
+                                      min_value=1,
+                                      max_value=20,
+                                      value=DEFAULT_MAX_PAGES,
+                                      help="Limit the number of pages to process (for multi-page PDFs)")
+            # Set default values for removed options
+            pdf_dpi = DEFAULT_PDF_DPI
+            pdf_rotation = 0
+            # Create options dictionary
+            options = {
+                "use_vision": use_vision,
+                "perf_mode": "Quality",  # Default to Quality, removed performance mode option
+                "pdf_dpi": pdf_dpi,
+                "max_pages": max_pages,
+                "pdf_rotation": pdf_rotation,
+                "custom_prompt": custom_prompt,
+                "preprocessing_options": preprocessing_options,
+                "use_segmentation": use_segmentation if 'use_segmentation' in locals() else False
+            }
+            return options
+def create_file_uploader():
+    """Create and return a file uploader"""
+    # Add app description
+    st.markdown(f'<div style="display: flex; align-items: center; gap: 10px;"><div style="font-size: 32px;">📜</div><div><h2 style="margin: 0; padding: 10px 0 0 0;">Historical OCR</h2></div></div>', unsafe_allow_html=True)
+    st.markdown("<p style='font-size: 0.8em; color: #666; text-align: left;'>Made possible by Mistral AI</p>", unsafe_allow_html=True)
+    # Add project framing
+    st.markdown("""
+    This tool assists scholars in historical research by extracting text from challenging documents. While it may not achieve 100% accuracy, it helps navigate:
+    - **Historical newspapers** with complex layouts
+    - **Handwritten documents** from various periods
+    - **Photos of archival materials**
+    Upload a document to begin, or explore the examples.
+    """)
+    # Create file uploader with a more concise label
+    uploaded_file = st.file_uploader(
+        "Select file",
+        type=["pdf", "png", "jpg"],
+        help="Upload a PDF or image file for OCR processing"
+    )
+    return uploaded_file
+def display_document_with_images(result):
+    """Display document with images"""
+    # Check for pages_data first
+    if 'pages_data' in result and result['pages_data']:
+        pages_data = result['pages_data']
+    # If pages_data not available, try to extract from raw_response_data
+    elif 'raw_response_data' in result and isinstance(result['raw_response_data'], dict) and 'pages' in result['raw_response_data']:
+        # Build pages_data from raw_response_data
+        pages_data = []
+        raw_pages = result['raw_response_data']['pages']
+        for page_idx, page in enumerate(raw_pages):
+            if not isinstance(page, dict):
+                continue
+            page_data = {
+                'page_number': page_idx + 1,
+                'markdown': page.get('markdown', ''),
+                'images': []
+            }
+            # Extract images if present
+            if 'images' in page and isinstance(page['images'], list):
+                for img_idx, img in enumerate(page['images']):
+                    if isinstance(img, dict) and ('base64' in img or 'image_base64' in img):
+                        img_base64 = img.get('image_base64', img.get('base64', ''))
+                        if img_base64:
+                            page_data['images'].append({
+                                'id': img.get('id', f"img_{page_idx}_{img_idx}"),
+                                'image_base64': img_base64
+                            })
+            if page_data['markdown'] or page_data['images']:
+                pages_data.append(page_data)
+    else:
+        st.info("No image data available.")
+        return
+    # Display each page
+    for i, page_data in enumerate(pages_data):
+        st.markdown(f"### Page {i+1}")
+        # Display only the image (removed text column)
+        # Display the image - check multiple possible field names
+        image_displayed = False
+        # Try 'image_data' field first
+        if 'image_data' in page_data:
+            try:
+                # Convert base64 to image
+                image_data = base64.b64decode(page_data['image_data'])
+                st.image(io.BytesIO(image_data), use_container_width=True)
+                image_displayed = True
+            except Exception as e:
+                st.error(f"Error displaying image from image_data: {str(e)}")
+        # Try 'images' array if image_data didn't work
+        if not image_displayed and 'images' in page_data and len(page_data['images']) > 0:
+            for img in page_data['images']:
+                if 'image_base64' in img:
+                    try:
+                        st.image(img['image_base64'], use_container_width=True)
+                        image_displayed = True
+                        break
+                    except Exception as e:
+                        st.error(f"Error displaying image from images array: {str(e)}")
+        # Try alternative image source if still not displayed
+        if not image_displayed and 'raw_response_data' in result:
+            raw_data = result['raw_response_data']
+            if isinstance(raw_data, dict) and 'pages' in raw_data:
+                for raw_page in raw_data['pages']:
+                    if isinstance(raw_page, dict) and 'images' in raw_page:
+                        for img in raw_page['images']:
+                            if isinstance(img, dict) and 'base64' in img:
+                                st.image(img['base64'], use_container_width=True)
+                                st.caption("Image from OCR response")
+                                image_displayed = True
+                                break
+                        if image_displayed:
+                            break
+        if not image_displayed:
+            st.info("No image available for this page.")
+        # Extract and display alt text if available
+        page_text = ""
+        if 'text' in page_data:
+            page_text = page_data['text']
+        elif 'markdown' in page_data:
+            page_text = page_data['markdown']
+        if page_text and page_text.startswith("![") and page_text.endswith(")"):
+            try:
+                alt_text = page_text[2:page_text.index(']')]
+                if alt_text and len(alt_text) > 5:  # Only show if alt text is meaningful
+                    st.caption(f"Image description: {alt_text}")
+            except:
+                pass
+def display_previous_results():
+    """Display previous results tab content in a simplified, structured view"""
+    # Use a simple header without the button column
+    st.header("Previous Results")
+    # Display previous results if available
+    if not st.session_state.previous_results:
+        st.markdown("""
+        <div style="text-align: center; padding: 30px 20px; background-color: #f8f9fa; border-radius: 6px; margin-top: 10px;">
+            <div style="font-size: 36px; margin-bottom: 15px;">📄</div>
+            <h3="margin-bottom: 16px; font-weight: 500;">No Previous Results</h3>
+            <p style="font-size: 14px; color: #666;">Process a document to see your results history.</p>
+        </div>
+        """, unsafe_allow_html=True)
+    else:
+        # Prepare zip download outside of the UI flow
+        try:
+            # Create download button for all results
+            from utils.image_utils import create_results_zip_in_memory
+            zip_data = create_results_zip_in_memory(st.session_state.previous_results)
+            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+            # Simplified filename
+            zip_filename = f"ocr_results_{timestamp}.zip"
+            # Encode the zip data for direct download link
+            zip_b64 = base64.b64encode(zip_data).decode()
+            # Add styled download tag in the metadata section
+            download_html = '<div style="display: flex; align-items: center; margin: 0.5rem 0; flex-wrap: wrap;">'
+            download_html += '<div style="margin-right: 0.3rem; font-weight: bold;">Download:</div>'
+            download_html += f'<a href="data:application/zip;base64,{zip_b64}" download="{zip_filename}" class="subject-tag tag-download">All Results</a>'
+            download_html += '</div>'
+            st.markdown(download_html, unsafe_allow_html=True)
+        except Exception:
+            # Silent fail - no error message to keep UI clean
+            pass
+        # Create a cleaner, more minimal grid for results using Streamlit columns
+        # Calculate number of columns based on screen width - more responsive
+        num_columns = 2  # Two columns for most screens
+        # Create rows of result cards
+        for i in range(0, len(st.session_state.previous_results), num_columns):
+            # Create a row of columns
+            cols = st.columns(num_columns)
+            # Fill each column with a result card
+            for j in range(num_columns):
+                index = i + j
+                if index < len(st.session_state.previous_results):
+                    result = st.session_state.previous_results[index]
+                    # Get basic info for the card
+                    file_name = result.get("file_name", f"Document {index+1}")
+                    timestamp = result.get("timestamp", "")
+                    # Determine file type icon
+                    if file_name.lower().endswith(".pdf"):
+                        icon = "📄"
+                    elif any(file_name.lower().endswith(ext) for ext in [".jpg", ".jpeg", ".png", ".gif"]):
+                        icon = "🖼️"
+                    else:
+                        icon = "📝"
+                    # Display a simplified card in each column
+                    with cols[j]:
+                        # Use a container for better styling control
+                        with st.container():
+                            # Create visually cleaner card with less vertical space
+                            st.markdown(f"""
+                            <div style="padding: 10px; border: 1px solid #e0e0e0; border-radius: 6px; margin-bottom: 10px;">
+                                <div style="display: flex; justify-content: space-between; align-items: center; margin-bottom: 5px;">
+                                    <div style="font-weight: 500; font-size: 14px; overflow: hidden; text-overflow: ellipsis; white-space: nowrap;">{icon} {file_name}</div>
+                                    <div style="color: #666; font-size: 12px;">{timestamp.split()[0] if timestamp else ""}</div>
+                                </div>
+                            </div>
+                            """, unsafe_allow_html=True)
+                            # Add a simple button below each card
+                            if st.button(f"View", key=f"view_{index}", help=f"View {file_name}"):
+                                st.session_state.selected_previous_result = st.session_state.previous_results[index]
+                                st.rerun()
+        # Display the selected result if available
+        if 'selected_previous_result' in st.session_state and st.session_state.selected_previous_result:
+            selected_result = st.session_state.selected_previous_result
+            # Draw a separator between results list and selected document
+            st.markdown("<hr style='margin: 20px 0 15px 0; border: none; height: 1px; background-color: #eee;'>", unsafe_allow_html=True)
+            # Create a cleaner header for the selected document
+            file_name = selected_result.get('file_name', 'Document')
+            st.subheader(f"{file_name}")
+            # Add a simple back button at the top
+            if st.button("← Back to Results", key="back_to_results"):
+                if 'selected_previous_result' in st.session_state:
+                    del st.session_state.selected_previous_result
+                st.session_state.perform_reset = True
+                st.rerun()
+            # Simplified metadata display - just one line with essential info
+            meta_html = '<div style="display: flex; flex-wrap: wrap; gap: 12px; margin: 8px 0 15px 0; font-size: 14px; color: #666;">'
+            # Add timestamp
+            if 'timestamp' in selected_result:
+                meta_html += f'<div>{selected_result["timestamp"]}</div>'
+            # Add languages if available (simplified)
+            if 'languages' in selected_result and selected_result['languages']:
+                languages = [lang for lang in selected_result['languages'] if lang is not None]
+                if languages:
+                    meta_html += f'<div>Language: {", ".join(languages)}</div>'
+            # Add page count if available (simplified)
+            if 'limited_pages' in selected_result:
+                meta_html += f'<div>Pages: {selected_result["limited_pages"]["processed"]}/{selected_result["limited_pages"]["total"]}</div>'
+            meta_html += '</div>'
+            st.markdown(meta_html, unsafe_allow_html=True)
+            # Simplified tabs - using the same format as main view
+            has_images = selected_result.get('has_images', False)
+            if has_images:
+                view_tabs = st.tabs(["Document Content", "Raw JSON", "Images"])
+                view_tab1, view_tab2, view_tab3 = view_tabs
+            else:
+                view_tabs = st.tabs(["Document Content", "Raw JSON"])
+                view_tab1, view_tab2 = view_tabs
+                view_tab3 = None
+            # First tab - Document Content (simplified structured view)
+            with view_tab1:
+                # Display content in a cleaner, more streamlined format
+                if 'ocr_contents' in selected_result and isinstance(selected_result['ocr_contents'], dict):
+                    # Create a more focused list of important sections
+                    priority_sections = ["title", "content", "transcript", "summary"]
+                    displayed_sections = set()
+                    # First display priority sections
+                    for section in priority_sections:
+                        if section in selected_result['ocr_contents'] and selected_result['ocr_contents'][section]:
+                            content = selected_result['ocr_contents'][section]
+                            if isinstance(content, str) and content.strip():
+                                # Only add a subheader for meaningful section names, not raw_text
+                                if section != "raw_text":
+                                    st.markdown(f"##### {section.replace('_', ' ').title()}")
+                                # Format and display content
+                                formatted_content = format_ocr_text(content, for_display=True)
+                                st.markdown(formatted_content)
+                                displayed_sections.add(section)
+                    # Then display any remaining sections not already shown
+                    for section, content in selected_result['ocr_contents'].items():
+                        if (section not in displayed_sections and
+                            section not in ['error', 'partial_text'] and
+                            content):
+                            st.markdown(f"##### {section.replace('_', ' ').title()}")
+                            if isinstance(content, str):
+                                st.markdown(format_ocr_text(content, for_display=True))
+                            elif isinstance(content, list):
+                                for item in content:
+                                    st.markdown(f"- {item}")
+                            elif isinstance(content, dict):
+                                for k, v in content.items():
+                                    st.markdown(f"**{k}:** {v}")
+            # Second tab - Raw JSON (simplified)
+            with view_tab2:
+                # Extract the relevant JSON data
+                json_data = {}
+                # Include important metadata
+                for field in ['file_name', 'timestamp', 'processing_time', 'title', 'languages', 'topics', 'subjects', 'text',' raw_text']:
+                    if field in selected_result:
+                        json_data[field] = selected_result[field]
+                # Include OCR contents
+                if 'ocr_contents' in selected_result:
+                    json_data['ocr_contents'] = selected_result['ocr_contents']
+                # Format the JSON prettily
+                json_str = json.dumps(json_data, indent=2)
+                # Display in a monospace font with syntax highlighting
+                st.code(json_str, language="json")
+            # Third tab - Images (simplified)
+            if has_images and view_tab3 is not None:
+                with view_tab3:
+                    # Simplified image display
+                    if 'pages_data' in selected_result:
+                        for i, page_data in enumerate(selected_result['pages_data']):
+                            # Display each page
+                            if 'images' in page_data and len(page_data['images']) > 0:
+                                for img in page_data['images']:
+                                    if 'image_base64' in img:
+                                        st.image(img['image_base64'], use_container_width=True)
+                                        # Get page text if available
+                                        page_text = ""
+                                        if 'markdown' in page_data:
+                                            page_text = page_data['markdown']
+                                        # Display text if available
+                                        if page_text:
+                                            with st.expander(f"Page {i+1} Text", expanded=False):
+                                                st.text(page_text)
+def display_about_tab():
+    """Display learn more tab content"""
+    st.header("Learn More")
+    # Add app description
+    st.markdown("""
+    **Historical OCR** is a tailored academic tool for extracting text from historical documents, manuscripts, and printed materials.
+    """)
+    # Purpose section with consistent formatting
+    st.markdown("### Purpose")
+    st.markdown("""
+    This tool is designed to assist scholars in historical research by extracting text from challenging documents.
+    While it may not achieve full accuracy for all materials, it serves as a tailored research aid for navigating
+    historical documents, particularly:
+    """)
+    st.markdown("""
+    - **Historical newspapers** with complex layouts and aged text
+    - **Handwritten documents** from various time periods
+    - **Photos of archival materials** that may be difficult to read
+    """)
+    # Features section with consistent formatting
+    st.markdown("### Features")
+    st.markdown("""
+    - **Advanced Image Preprocessing**: Optimize historical documents for better OCR results
+    - **Custom Document Type Processing**: Specialized handling for newspapers, letters, books, and more
+    - **Editable Results**: Review and edit extracted text directly in the interface
+    - **Structured Content Analysis**: Automatic organization of document content
+    - **Multi-language Support**: Process documents in various languages
+    - **PDF Processing**: Handle multi-page historical documents
+    """)
+    # How to Use section with consistent formatting
+    st.markdown("### How to Use")
+    st.markdown("""
+    1. Upload a document (PDF or image)
+    2. Select the document type and adjust preprocessing options if needed
+    3. Add custom processing instructions for specialized documents
+    4. Process the document
+    5. Review, edit, and download the results
+    """)
+    # Technologies section with consistent formatting
+    st.markdown("### Technologies")
+    st.markdown("""
+    - OCR processing using Mistral AI's advanced document understanding capabilities
+    - Image preprocessing with OpenCV
+    - PDF handling with pdf2image
+    - Web interface with Streamlit
+    """)
+    # Add version information
+    st.markdown("**Version:** 2.0.0")

utils/helpers/language_detection.py ADDED Viewed

	@@ -0,0 +1,373 @@

+# Standard library imports
+import logging
+import re
+from typing import List, Dict, Set, Tuple, Optional, Union, Any
+from functools import lru_cache
+# Configure logging
+logging.basicConfig(level=logging.INFO,
+                    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
+logger = logging.getLogger(__name__)
+class LanguageDetector:
+    """
+    A language detection system that provides balanced detection across multiple languages
+    using an enhanced statistical approach.
+    """
+    def __init__(self):
+        """Initialize the language detector with statistical language models"""
+        logger.info("Initializing language detector with statistical models")
+        # Initialize language indicators dictionary for statistical detection
+        self._init_language_indicators()
+        # Set thresholds for language detection confidence
+        self.single_lang_confidence = 65  # Minimum score to consider a language detected
+        self.secondary_lang_threshold = 0.75  # Secondary language must be at least this fraction of primary score
+    def _init_language_indicators(self):
+        """Initialize language indicators for statistical detection with historical markers"""
+        # Define indicators for all supported languages with equal detail level
+        # Each language has:
+        # - Distinctive characters
+        # - Common words (including historical forms)
+        # - N-grams (character sequences)
+        # - Historical markers specific to older forms of the language
+        self.language_indicators = {
+            "English": {
+                "chars": [],  # English uses basic Latin alphabet without special chars
+                "words": ['the', 'and', 'of', 'to', 'in', 'a', 'is', 'that', 'for', 'it',
+                         'with', 'as', 'be', 'on', 'by', 'at', 'this', 'have', 'from', 'or',
+                         'an', 'but', 'not', 'what', 'all', 'were', 'when', 'we', 'there', 'can',
+                         'would', 'who', 'you', 'been', 'one', 'their', 'has', 'more', 'if', 'no'],
+                "ngrams": ['th', 'he', 'in', 'er', 'an', 're', 'on', 'at', 'en', 'nd', 'ti', 'es', 'or',
+                          'ing', 'tion', 'the', 'and', 'tha', 'ent', 'ion'],
+                "historical": {
+                    "chars": ['þ', 'ȝ', 'æ', 'ſ'],  # Thorn, yogh, ash, long s
+                    "words": ['thou', 'thee', 'thy', 'thine', 'hath', 'doth', 'ere', 'whilom', 'betwixt',
+                             'ye', 'art', 'wast', 'dost', 'hast', 'shalt', 'mayst', 'verily'],
+                    "patterns": ['eth$', '^y[^a-z]', 'ck$', 'aught', 'ought']  # -eth endings, y- prefixes
+                }
+            },
+            "French": {
+                "chars": ['é', 'è', 'ê', 'à', 'ç', 'ù', 'â', 'î', 'ô', 'û', 'ë', 'ï', 'ü'],
+                "words": ['le', 'la', 'les', 'et', 'en', 'de', 'du', 'des', 'un', 'une', 'ce', 'cette',
+                         'ces', 'dans', 'par', 'pour', 'sur', 'qui', 'que', 'quoi', 'où', 'quand', 'comment',
+                         'est', 'sont', 'ont', 'nous', 'vous', 'ils', 'elles', 'avec', 'sans', 'mais', 'ou'],
+                "ngrams": ['es', 'le', 'de', 'en', 'on', 'nt', 'qu', 'ai', 'an', 'ou', 'ur', 're', 'me',
+                          'les', 'ent', 'que', 'des', 'ons', 'ant', 'ion'],
+                "historical": {
+                    "chars": ['ſ', 'æ', 'œ'],  # Long s and ligatures
+                    "words": ['aultre', 'avecq', 'icelluy', 'oncques', 'moult', 'estre', 'mesme', 'ceste',
+                             'ledict', 'celuy', 'ceulx', 'aulcun', 'ainſi', 'touſiours', 'eſtre',
+                             'eſt', 'meſme', 'felon', 'auec', 'iufques', 'chofe', 'fcience'],
+                    "patterns": ['oi[ts]$', 'oi[re]$', 'f[^aeiou]', 'ff', 'ſ', 'auoit', 'eſtoit',
+                                'ſi', 'ſur', 'ſa', 'cy', 'ayant', 'oy', 'uſ', 'auſ']
+                },
+            },
+            "German": {
+                "chars": ['ä', 'ö', 'ü', 'ß'],
+                "words": ['der', 'die', 'das', 'und', 'in', 'zu', 'den', 'ein', 'eine', 'mit', 'ist', 'von',
+                         'des', 'sich', 'auf', 'für', 'als', 'auch', 'werden', 'bei', 'durch', 'aus', 'sind',
+                         'nicht', 'nur', 'wurde', 'wie', 'wenn', 'aber', 'noch', 'nach', 'so', 'sein', 'über'],
+                "ngrams": ['en', 'er', 'ch', 'de', 'ei', 'in', 'te', 'nd', 'ie', 'ge', 'un', 'sch', 'ich',
+                          'den', 'die', 'und', 'der', 'ein', 'ung', 'cht'],
+                "historical": {
+                    "chars": ['ſ', 'ů', 'ė', 'ÿ'],
+                    "words": ['vnnd', 'vnnd', 'vnter', 'vnd', 'seyn', 'thun', 'auff', 'auß', 'deß', 'diß'],
+                    "patterns": ['^v[nd]', 'th', 'vnter', 'ſch']
+                }
+            },
+            "Spanish": {
+                "chars": ['á', 'é', 'í', 'ó', 'ú', 'ñ', 'ü', '¿', '¡'],
+                "words": ['el', 'la', 'los', 'las', 'de', 'en', 'y', 'a', 'que', 'por', 'un', 'una', 'no',
+                         'es', 'con', 'para', 'su', 'al', 'se', 'del', 'como', 'más', 'pero', 'lo', 'mi',
+                         'si', 'ya', 'todo', 'esta', 'cuando', 'hay', 'muy', 'bien', 'sin', 'así'],
+                "ngrams": ['de', 'en', 'os', 'es', 'la', 'ar', 'el', 'er', 'ra', 'as', 'an', 'do', 'or',
+                          'que', 'nte', 'los', 'ado', 'con', 'ent', 'ien'],
+                "historical": {
+                    "chars": ['ſ', 'ç', 'ñ'],
+                    "words": ['facer', 'fijo', 'fermoso', 'agora', 'asaz', 'aver', 'caſa', 'deſde', 'eſte',
+                             'eſta', 'eſto', 'deſto', 'deſta', 'eſſo', 'muger', 'dixo', 'fazer'],
+                    "patterns": ['^f[aei]', 'ſſ', 'ſc', '^deſ', 'xo$', 'xe$']
+                },
+            },
+            "Italian": {
+                "chars": ['à', 'è', 'é', 'ì', 'í', 'ò', 'ó', 'ù', 'ú'],
+                "words": ['il', 'la', 'i', 'le', 'e', 'di', 'a', 'in', 'che', 'non', 'per', 'con', 'un',
+                         'una', 'del', 'della', 'è', 'sono', 'da', 'si', 'come', 'anche', 'più', 'ma', 'ci',
+                         'se', 'ha', 'mi', 'lo', 'ti', 'al', 'tu', 'questo', 'questi'],
+                "ngrams": ['di', 'la', 'er', 'to', 're', 'co', 'de', 'in', 'ra', 'on', 'li', 'no', 'ri',
+                          'che', 'ent', 'con', 'per', 'ion', 'ato', 'lla']
+            },
+            "Portuguese": {
+                "chars": ['á', 'â', 'ã', 'à', 'é', 'ê', 'í', 'ó', 'ô', 'õ', 'ú', 'ç'],
+                "words": ['o', 'a', 'os', 'as', 'de', 'em', 'e', 'do', 'da', 'dos', 'das', 'no', 'na',
+                         'para', 'que', 'um', 'uma', 'por', 'com', 'se', 'não', 'mais', 'como', 'mas',
+                         'você', 'eu', 'este', 'isso', 'ele', 'seu', 'sua', 'ou', 'já', 'me'],
+                "ngrams": ['de', 'os', 'em', 'ar', 'es', 'ra', 'do', 'da', 'en', 'co', 'nt', 'ad', 'to',
+                          'que', 'nto', 'ent', 'com', 'ção', 'ado', 'ment']
+            },
+            "Dutch": {
+                "chars": ['ë', 'ï', 'ö', 'ü', 'é', 'è', 'ê', 'ç', 'á', 'à', 'ä', 'ó', 'ô', 'ú', 'ù', 'û', 'ij'],
+                "words": ['de', 'het', 'een', 'en', 'van', 'in', 'is', 'dat', 'op', 'te', 'zijn', 'met',
+                         'voor', 'niet', 'aan', 'er', 'die', 'maar', 'dan', 'ik', 'je', 'hij', 'zij', 'we',
+                         'kunnen', 'wordt', 'nog', 'door', 'over', 'als', 'uit', 'bij', 'om', 'ook'],
+                "ngrams": ['en', 'de', 'er', 'ee', 'ge', 'an', 'aa', 'in', 'te', 'et', 'ng', 'ee', 'or',
+                          'van', 'het', 'een', 'ing', 'ver', 'den', 'sch']
+            },
+            "Russian": {
+                # Russian (Cyrillic alphabet) characters
+                "chars": ['а', 'б', 'в', 'г', 'д', 'е', 'ё', 'ж', 'з', 'и', 'й', 'к', 'л', 'м', 'н', 'о', 'п',
+                         'р', 'с', 'т', 'у', 'ф', 'х', 'ц', 'ч', 'ш', 'щ', 'ъ', 'ы', 'ь', 'э', 'ю', 'я'],
+                "words": ['и', 'в', 'не', 'на', 'что', 'я', 'с', 'а', 'то', 'он', 'как', 'этот', 'по',
+                         'но', 'из', 'к', 'у', 'за', 'вы', 'все', 'так', 'же', 'от', 'для', 'о', 'его',
+                         'мы', 'было', 'она', 'бы', 'мне', 'еще', 'есть', 'быть', 'был'],
+                "ngrams": ['о', 'е', 'а', 'н', 'и', 'т', 'р', 'с', 'в', 'л', 'к', 'м', 'д',
+                          'ст', 'но', 'то', 'ни', 'на', 'по', 'ет']
+            },
+            "Chinese": {
+                "chars": ['的', '是', '不', '了', '在', '和', '有', '我', '们', '人', '这', '上', '中',
+                         '个', '大', '来', '到', '国', '时', '要', '地', '出', '会', '可', '也', '就',
+                         '年', '生', '对', '能', '自', '那', '都', '得', '说', '过', '子', '家', '后', '多'],
+                # Chinese doesn't have "words" in the same way as alphabetic languages
+                "words": ['的', '是', '不', '了', '在', '和', '有', '我', '们', '人', '这', '上', '中',
+                         '个', '大', '来', '到', '国', '时', '要', '地', '出', '会', '可', '也', '就'],
+                "ngrams": ['的', '是', '不', '了', '在', '我', '有', '和', '人', '这', '中', '大', '来', '上',
+                          '国', '个', '到', '说', '们', '为']
+            },
+            "Japanese": {
+                # A mix of hiragana, katakana, and common kanji
+                "chars": ['あ', 'い', 'う', 'え', 'お', 'か', 'き', 'く', 'け', 'こ', 'さ', 'し', 'す', 'せ', 'そ',
+                         'ア', 'イ', 'ウ', 'エ', 'オ', 'カ', 'キ', 'ク', 'ケ', 'コ', 'サ', 'シ', 'ス', 'セ', 'ソ',
+                         '日', '本', '人', '大', '小', '中', '山', '川', '田', '子', '女', '男', '月', '火', '水'],
+                "words": ['は', 'を', 'に', 'の', 'が', 'で', 'へ', 'から', 'より', 'まで', 'だ', 'です', 'した',
+                         'ます', 'ません', 'です', 'これ', 'それ', 'あれ', 'この', 'その', 'あの', 'わたし'],
+                "ngrams": ['の', 'は', 'た', 'が', 'を', 'に', 'て', 'で', 'と', 'し', 'か', 'ま', 'こ', 'い',
+                          'する', 'いる', 'れる', 'なる', 'れて', 'した']
+            },
+            "Korean": {
+                "chars": ['가', '나', '다', '라', '마', '바', '사', '아', '자', '차', '카', '타', '파', '하',
+                         '그', '는', '을', '이', '에', '에서', '로', '으로', '와', '과', '또는', '하지만'],
+                "words": ['이', '그', '저', '나', '너', '우리', '그들', '이것', '그것', '저것', '은', '는',
+                         '이', '가', '을', '를', '에', '에서', '으로', '로', '와', '과', '의', '하다', '되다'],
+                "ngrams": ['이', '다', '는', '에', '하', '고', '지', '서', '의', '가', '을', '로', '을', '으',
+                          '니다', '습니', '하는', '이다', '에서', '하고']
+            },
+            "Arabic": {
+                "chars": ['ا', 'ب', 'ت', 'ث', 'ج', 'ح', 'خ', 'د', 'ذ', 'ر', 'ز', 'س', 'ش', 'ص', 'ض',
+                         'ط', 'ظ', 'ع', 'غ', 'ف', 'ق', 'ك', 'ل', 'م', 'ن', 'ه', 'و', 'ي', 'ء', 'ة', 'ى'],
+                "words": ['في', 'من', 'على', 'إلى', 'هذا', 'هذه', 'ذلك', 'تلك', 'هو', 'هي', 'هم', 'أنا',
+                         'أنت', 'نحن', 'كان', 'كانت', 'يكون', 'لا', 'لم', 'ما', 'أن', 'و', 'أو', 'ثم', 'بعد'],
+                "ngrams": ['ال', 'ان', 'في', 'من', 'ون', 'ين', 'ات', 'ار', 'ور', 'ما', 'لا', 'ها', 'ان',
+                          'الم', 'لان', 'علا', 'الح', 'الس', 'الع', 'الت']
+            },
+            "Hindi": {
+                "chars": ['अ', 'आ', 'इ', 'ई', 'उ', 'ऊ', 'ए', 'ऐ', 'ओ', 'औ', 'क', 'ख', 'ग', 'घ', 'ङ',
+                         'च', 'छ', 'ज', 'झ', 'ञ', 'ट', 'ठ', 'ड', 'ढ', 'ण', 'त', 'थ', 'द', 'ध', 'न',
+                         'प', 'फ', 'ब', 'भ', 'म', 'य', 'र', 'ल', 'व', 'श', 'ष', 'स', 'ह', 'ा', 'ि', 'ी',
+                         'ु', 'ू', 'े', 'ै', 'ो', 'ौ', '्', 'ं', 'ः'],
+                "words": ['और', 'का', 'के', 'की', 'एक', 'में', 'है', 'यह', 'हैं', 'से', 'को', 'पर', 'इस',
+                         'हो', 'गया', 'कर', 'मैं', 'या', 'हुआ', 'था', 'वह', 'अपने', 'सकता', 'ने', 'बहुत'],
+                "ngrams": ['का', 'के', 'की', 'है', 'ने', 'से', 'मे', 'को', 'पर', 'हा', 'रा', 'ता', 'या',
+                          'ार', 'ान', 'कार', 'राज', 'ारा', 'जाए', 'ेजा']
+            },
+            "Latin": {
+                "chars": [],  # Latin uses basic Latin alphabet
+                "words": ['et', 'in', 'ad', 'est', 'sunt', 'non', 'cum', 'sed', 'qui', 'quod', 'ut', 'si',
+                         'nec', 'ex', 'per', 'quam', 'pro', 'iam', 'hoc', 'aut', 'esse', 'enim', 'de',
+                         'atque', 'ac', 'ante', 'post', 'sub', 'ab'],
+                "ngrams": ['us', 'is', 'um', 'er', 'it', 'nt', 'am', 'em', 're', 'at', 'ti', 'es', 'ur',
+                          'tur', 'que', 'ere', 'ent', 'ius', 'rum', 'tus']
+            },
+            "Greek": {
+                "chars": ['α', 'β', 'γ', 'δ', 'ε', 'ζ', 'η', 'θ', 'ι', 'κ', 'λ', 'μ', 'ν', 'ξ', 'ο', 'π',
+                         'ρ', 'σ', 'ς', 'τ', 'υ', 'φ', 'χ', 'ψ', 'ω', 'ά', 'έ', 'ή', 'ί', 'ό', 'ύ', 'ώ'],
+                "words": ['και', 'του', 'της', 'των', 'στο', 'στη', 'με', 'από', 'για', 'είναι', 'να',
+                         'ότι', 'δεν', 'στον', 'μια', 'που', 'ένα', 'έχει', 'θα', 'το', 'ο', 'η', 'τον'],
+                "ngrams": ['αι', 'τα', 'ου', 'τη', 'οι', 'το', 'ης', 'αν', 'ος', 'ον', 'ις', 'ει', 'ερ',
+                          'και', 'την', 'τον', 'ους', 'νου', 'εντ', 'μεν']
+            }
+        }
+    def detect_languages(self, text: str, filename: str = None, current_languages: List[str] = None) -> List[str]:
+        """
+        Detect languages in text using an enhanced statistical approach
+        Args:
+            text: Text to analyze
+            filename: Optional filename to provide additional context
+            current_languages: Optional list of languages already detected
+        Returns:
+            List of detected languages
+        """
+        logger = logging.getLogger("language_detector")
+        # If no text provided, return current languages or default
+        if not text or len(text.strip()) < 10:
+            return current_languages if current_languages else ["English"]
+        # If we already have detected languages, use them
+        if current_languages and len(current_languages) > 0:
+            logger.info(f"Using already detected languages: {current_languages}")
+            return current_languages
+        # Use enhanced statistical detection
+        detected_languages = self._detect_statistically(text, filename)
+        logger.info(f"Statistical language detection results: {detected_languages}")
+        return detected_languages
+    def _detect_statistically(self, text: str, filename: str = None) -> List[str]:
+        """
+        Detect languages using enhanced statistical analysis with historical language indicators
+        Args:
+            text: Text to analyze
+            filename: Optional filename for additional context
+        Returns:
+            List of detected languages
+        """
+        logger = logging.getLogger("language_detector")
+        # Normalize text to lowercase for consistent analysis
+        text_lower = text.lower()
+        words = re.findall(r'\b\w+\b', text_lower)  # Extract words
+        # Score each language based on characters, words, n-grams, and historical markers
+        language_scores = {}
+        historical_bonus = {}
+        # PHASE 1: Special character analysis
+        # Count special characters for each language
+        special_char_counts = {}
+        total_special_chars = 0
+        for language, indicators in self.language_indicators.items():
+            chars = indicators["chars"]
+            count = 0
+            for char in chars:
+                if char in text_lower:
+                    count += text_lower.count(char)
+            special_char_counts[language] = count
+            total_special_chars += count
+        # Normalize character scores (0-30 points)
+        for language, count in special_char_counts.items():
+            if total_special_chars > 0:
+                # Scale score to 0-30 range (reduced from 35 to make room for historical)
+                normalized_score = (count / total_special_chars) * 30
+                language_scores[language] = normalized_score
+            else:
+                language_scores[language] = 0
+        # PHASE 2: Word analysis (0-30 points)
+        # Count common words for each language
+        for language, indicators in self.language_indicators.items():
+            word_list = indicators["words"]
+            word_matches = sum(1 for word in words if word in word_list)
+            # Normalize word score based on text length and word list size
+            word_score_factor = min(1.0, word_matches / (len(words) * 0.1))  # Max 1.0 if 10% match
+            language_scores[language] = language_scores.get(language, 0) + (word_score_factor * 30)
+        # PHASE 3: N-gram analysis (0-20 points)
+        for language, indicators in self.language_indicators.items():
+            ngram_list = indicators["ngrams"]
+            ngram_matches = 0
+            # Count ngram occurrences
+            for ngram in ngram_list:
+                ngram_matches += text_lower.count(ngram)
+            # Normalize ngram score based on text length
+            if len(text_lower) > 0:
+                ngram_score_factor = min(1.0, ngram_matches / (len(text_lower) * 0.05))  # Max 1.0 if 5% match
+                language_scores[language] = language_scores.get(language, 0) + (ngram_score_factor * 20)
+        # PHASE 4: Historical language markers (0-20 points)
+        for language, indicators in self.language_indicators.items():
+            if "historical" in indicators:
+                historical_indicators = indicators["historical"]
+                historical_score = 0
+                # Check for historical chars
+                if "chars" in historical_indicators:
+                    for char in historical_indicators["chars"]:
+                        if char in text_lower:
+                            historical_score += text_lower.count(char) * 0.5
+                # Check for historical words
+                if "words" in historical_indicators:
+                    hist_words = historical_indicators["words"]
+                    hist_word_matches = sum(1 for word in words if word in hist_words)
+                    if hist_word_matches > 0:
+                        # Historical words are strong indicators
+                        historical_score += min(10, hist_word_matches * 2)
+                # Check for historical patterns
+                if "patterns" in historical_indicators:
+                    for pattern in historical_indicators["patterns"]:
+                        matches = len(re.findall(pattern, text_lower))
+                        if matches > 0:
+                            historical_score += min(5, matches * 0.5)
+                # Cap historical score at 20 points
+                historical_score = min(20, historical_score)
+                historical_bonus[language] = historical_score
+                # Apply historical bonus
+                language_scores[language] += historical_score
+                # Apply language-specific exclusivity multiplier if present
+                if "exclusivity" in indicators:
+                    exclusivity = indicators["exclusivity"]
+                    language_scores[language] *= exclusivity
+                    logger.info(f"Applied exclusivity multiplier {exclusivity} to {language}")
+        # Print historical bonus for debugging
+        for language, bonus in historical_bonus.items():
+            if bonus > 0:
+                logger.info(f"Historical language bonus for {language}: {bonus} points")
+        # Final language selection with more stringent criteria
+        # Get languages with scores above threshold
+        threshold = self.single_lang_confidence  # Higher minimum score
+        candidates = [(lang, score) for lang, score in language_scores.items() if score >= threshold]
+        candidates.sort(key=lambda x: x[1], reverse=True)
+        logger.info(f"Language candidates: {candidates}")
+        # If we have candidate languages, return top 1-2 with higher threshold for secondary
+        if candidates:
+            # Always take top language
+            result = [candidates[0][0]]
+            # Add second language only if it's significantly strong compared to primary
+            # and doesn't have a historical/exclusivity conflict
+            if len(candidates) > 1:
+                primary_lang = candidates[0][0]
+                secondary_lang = candidates[1][0]
+                primary_score = candidates[0][1]
+                secondary_score = candidates[1][1]
+                # Only add secondary if it meets threshold and doesn't conflict
+                ratio = secondary_score / primary_score
+                # Check for French and Spanish conflict (historical French often gets misidentified)
+                historical_conflict = False
+                if (primary_lang == "French" and secondary_lang == "Spanish" and
+                    historical_bonus.get("French", 0) > 5):
+                    historical_conflict = True
+                    logger.info("Historical French markers detected, suppressing Spanish detection")
+                if ratio >= self.secondary_lang_threshold and not historical_conflict:
+                    result.append(secondary_lang)
+                    logger.info(f"Added secondary language {secondary_lang} (score ratio: {ratio:.2f})")
+                else:
+                    logger.info(f"Rejected secondary language {secondary_lang} (score ratio: {ratio:.2f})")
+            return result
+        # Default to English if no clear signals

utils/helpers/letterhead_handler.py ADDED Viewed

	@@ -0,0 +1,82 @@

+# Standard library imports
+import os
+import logging
+from pathlib import Path
+# Configure logging
+logging.basicConfig(level=logging.INFO,
+                   format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
+logger = logging.getLogger(__name__)
+def is_likely_letterhead(file_path, features=None):
+    """
+    Determine if a document is likely to contain letterhead or marginalia
+    Args:
+        file_path: Path to the document image
+        features: Optional dictionary of pre-extracted features like text density
+    Returns:
+        bool: True if the document likely contains letterhead, False otherwise
+    """
+    # Simple logic based on filename for initial version
+    file_name = Path(file_path).name.lower()
+    letterhead_indicators = ['letter', 'letterhead', 'correspondence', 'memo']
+    # Check filename for indicators
+    for indicator in letterhead_indicators:
+        if indicator in file_name:
+            logger.info(f"Letterhead detected based on filename: {file_name}")
+            return True
+    # Check features if provided
+    if features:
+        # High text density at the top of the document may indicate letterhead
+        if 'top_density' in features and features['top_density'] > 0.5:
+            logger.info(f"Letterhead detected based on top text density: {features['top_density']}")
+            return True
+        # Uneven text distribution may indicate marginalia
+        if 'density_variance' in features and features['density_variance'] > 0.3:
+            logger.info(f"Possible marginalia detected based on text density variance")
+            return True
+    # Default to standard document
+    return False
+def get_letterhead_prompt(file_path, features=None):
+    """
+    Generate a specialized prompt for letterhead document OCR
+    Args:
+        file_path: Path to the document image
+        features: Optional dictionary of pre-extracted features
+    Returns:
+        str: Specialized prompt for letterhead document OCR
+    """
+    # Base prompt for all letterhead documents
+    base_prompt = ("This document appears to be a letter or includes letterhead elements. "
+                  "Please extract the following components separately if present:\n"
+                  "1. Letterhead (header with logo, organization name, address, etc.)\n"
+                  "2. Date\n"
+                  "3. Recipient information (address, name, title)\n"
+                  "4. Salutation (e.g., 'Dear Sir/Madam')\n"
+                  "5. Main body text\n"
+                  "6. Closing (e.g., 'Sincerely')\n"
+                  "7. Signature\n"
+                  "8. Any footnotes, marginalia, or annotations\n\n"
+                  "Preserve the original formatting and structure as much as possible.")
+    # Enhanced prompts based on features
+    if features:
+        # Extract additional context from features if available
+        if 'is_historical' in features and features['is_historical']:
+            base_prompt += ("\n\nThis appears to be a historical document. Pay special attention to older "
+                           "letterhead styles, formal language patterns, and period-specific formatting.")
+        if 'has_marginalia' in features and features['has_marginalia']:
+            base_prompt += ("\n\nThe document contains marginalia or handwritten notes in the margins. "
+                           "Please extract these separately from the main text and indicate their position.")
+    return base_prompt

utils/helpers/ocr_text_repair.py ADDED Viewed

	@@ -0,0 +1,270 @@

+# Standard library imports
+import re
+import logging
+from difflib import SequenceMatcher
+from typing import Tuple, Dict, Any, List, Optional
+# Configure logging
+logging.basicConfig(level=logging.INFO,
+                   format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
+logger = logging.getLogger(__name__)
+def detect_duplicate_text_issues(text: str) -> Tuple[bool, Dict[str, Any]]:
+    """
+    Detect if OCR text has duplication issues often found in handwritten document OCR
+    Args:
+        text: OCR text to analyze
+    Returns:
+        Tuple of (has_duplication_issues, details_dict)
+    """
+    # Early exit for empty text
+    if not text or len(text) < 100:
+        return False, {"duplication_rate": 0.0, "details": "Text too short for analysis"}
+    # Look for repeated line patterns
+    lines = text.split('\n')
+    line_count = len(lines)
+    # Basic metrics
+    repeated_lines = 0
+    duplicate_sections = []
+    line_repetition_indices = []
+    # Check for exact line repetitions
+    seen_lines = {}
+    for i, line in enumerate(lines):
+        # Skip very short lines or empty lines
+        stripped = line.strip()
+        if len(stripped) < 5:
+            continue
+        if stripped in seen_lines:
+            repeated_lines += 1
+            line_repetition_indices.append((seen_lines[stripped], i))
+        else:
+            seen_lines[stripped] = i
+    # Calculate line repetition rate
+    line_repetition_rate = repeated_lines / max(1, line_count)
+    # Look for longer repeated sections using sequence matcher
+    text_blocks = [text[i:i+100] for i in range(0, len(text), 100) if i+100 <= len(text)]
+    block_count = len(text_blocks)
+    repeated_blocks = 0
+    for i in range(block_count):
+        for j in range(i+1, min(i+10, block_count)):  # Only check nearby blocks for efficiency
+            matcher = SequenceMatcher(None, text_blocks[i], text_blocks[j])
+            similarity = matcher.ratio()
+            if similarity > 0.8:  # High similarity threshold
+                repeated_blocks += 1
+                duplicate_sections.append((i, j, similarity))
+                break
+    # Calculate block repetition rate
+    block_repetition_rate = repeated_blocks / max(1, block_count)
+    # Combine metrics for overall duplication rate
+    duplication_rate = max(line_repetition_rate, block_repetition_rate)
+    # Detect patterns of repeated words in sequence (common OCR mistake)
+    word_pattern = r'\b(\w+)\s+\1\b'
+    repeated_words = len(re.findall(word_pattern, text))
+    repeated_words_rate = repeated_words / max(1, len(text.split()))
+    # Update duplication rate with word repetition
+    duplication_rate = max(duplication_rate, repeated_words_rate)
+    # Log detailed analysis
+    logger.info(f"OCR duplication analysis: line_repetition={line_repetition_rate:.2f}, "
+               f"block_repetition={block_repetition_rate:.2f}, "
+               f"word_repetition={repeated_words_rate:.2f}, "
+               f"final_rate={duplication_rate:.2f}")
+    # Determine if this is a serious issue
+    has_duplication = duplication_rate > 0.1
+    # Return detailed results
+    return has_duplication, {
+        "duplication_rate": duplication_rate,
+        "line_repetition_rate": line_repetition_rate,
+        "block_repetition_rate": block_repetition_rate,
+        "word_repetition_rate": repeated_words_rate,
+        "repeated_lines": repeated_lines,
+        "repeated_blocks": repeated_blocks,
+        "repeated_words": repeated_words,
+        "duplicate_sections": duplicate_sections[:10],  # Only include the first 10 for brevity
+        "repetition_indices": line_repetition_indices[:10]
+    }
+def get_enhanced_preprocessing_options(current_options: Optional[Dict[str, Any]] = None) -> Dict[str, Any]:
+    """
+    Generate enhanced preprocessing options for improved OCR on handwritten documents
+    Args:
+        current_options: Current preprocessing options (if available)
+    Returns:
+        Dict of enhanced options
+    """
+    # Start with current options or empty dict
+    options = current_options.copy() if current_options else {}
+    # Set document type to handwritten
+    options["document_type"] = "handwritten"
+    # Enhanced contrast - higher than normal for better handwriting extraction
+    options["contrast"] = 1.4  # Higher than default
+    # Apply grayscale
+    options["grayscale"] = True
+    # Apply adaptive thresholding optimized for handwriting
+    options["adaptive_threshold"] = True
+    options["threshold_block_size"] = 25  # Larger block size for handwriting
+    options["threshold_c"] = 10  # Adjusted C value for better handwriting detection
+    # Disable standard binarization which often loses handwriting detail
+    options["binarize"] = False
+    # Despeckle to reduce noise
+    options["denoise"] = True
+    # Enable handwriting-specific preprocessing
+    options["handwriting_mode"] = True
+    # Disable anything that might harm handwriting recognition
+    if "sharpen" in options:
+        options["sharpen"] = False
+    logger.info(f"Enhanced handwriting preprocessing options generated: {options}")
+    return options
+def get_handwritten_specific_prompt(current_prompt: Optional[str] = None) -> str:
+    """
+    Generate a specialized prompt for handwritten document OCR
+    Args:
+        current_prompt: Current prompt (if available)
+    Returns:
+        str: Enhanced prompt for handwritten documents
+    """
+    # Base prompt for all handwritten documents
+    base_prompt = ("This is a handwritten document that requires careful transcription. "
+                  "Please transcribe all visible handwritten text, preserving the original "
+                  "line breaks, paragraph structure, and any special formatting or indentation. "
+                  "Pay special attention to:\n"
+                  "1. Words that may be difficult to read due to handwriting style\n"
+                  "2. Any crossed-out text (indicate with [crossed out: possible text])\n"
+                  "3. Insertions or annotations between lines or in margins\n"
+                  "4. Maintain the spatial layout of the text as much as possible\n"
+                  "5. If there are multiple columns or non-linear text, preserve the reading order\n\n"
+                  "If you cannot read a word with confidence, indicate with [?] or provide your best guess as [word?].")
+    # If there's an existing prompt, combine them, otherwise just use the base
+    if current_prompt:
+        # Remove any redundant instructions about handwriting
+        lower_prompt = current_prompt.lower()
+        if "handwritten" in lower_prompt or "handwriting" in lower_prompt:
+            # Extract any unique instructions from the current prompt
+            # This logic is simplified and might need improvement
+            current_sentences = [s.strip() for s in current_prompt.split('.') if s.strip()]
+            handwriting_sentences = [s for s in current_sentences
+                                   if "handwritten" not in s.lower()
+                                   and "handwriting" not in s.lower()]
+            # Add unique instructions to our base prompt
+            if handwriting_sentences:
+                combined_prompt = base_prompt + "\n\nAdditional instructions:\n"
+                combined_prompt += ". ".join(handwriting_sentences) + "."
+                return combined_prompt
+        else:
+            # If no handwriting instructions in the current prompt, just append it
+            return f"{base_prompt}\n\nAdditional context from user:\n{current_prompt}"
+    return base_prompt
+def clean_duplicated_text(text: str) -> str:
+    """
+    Clean up duplicated text often found in OCR output for handwritten documents
+    Args:
+        text: OCR text to clean
+    Returns:
+        str: Cleaned text with duplications removed
+    """
+    if not text:
+        return text
+    # Split into lines for line-based deduplication
+    lines = text.split('\n')
+    # Remove consecutive duplicate lines
+    deduped_lines = []
+    prev_line = None
+    for line in lines:
+        stripped = line.strip()
+        # Skip empty lines
+        if not stripped:
+            if not deduped_lines or deduped_lines[-1].strip():
+                deduped_lines.append(line)  # Keep the first empty line
+            continue
+        # Skip if this line is a duplicate of the previous line
+        if stripped == prev_line:
+            continue
+        deduped_lines.append(line)
+        prev_line = stripped
+    # Re-join the deduplicated lines
+    deduped_text = '\n'.join(deduped_lines)
+    # Remove repeated words
+    word_pattern = r'\b(\w+)\s+\1\b'
+    deduped_text = re.sub(word_pattern, r'\1', deduped_text)
+    # Remove repeated phrases (3+ words)
+    # This is a simplified approach and might need improvement
+    words = deduped_text.split()
+    cleaned_words = []
+    i = 0
+    while i < len(words):
+        # Check for phrase repetition (phrases of 3 to 6 words)
+        found_repeat = False
+        for phrase_len in range(3, min(7, len(words) - i)):
+            phrase = ' '.join(words[i:i+phrase_len])
+            next_pos = i + phrase_len
+            if next_pos + phrase_len <= len(words):
+                next_phrase = ' '.join(words[next_pos:next_pos+phrase_len])
+                if phrase.lower() == next_phrase.lower():
+                    # Found a repeated phrase, skip the second occurrence
+                    cleaned_words.extend(words[i:i+phrase_len])
+                    i = next_pos + phrase_len
+                    found_repeat = True
+                    break
+        if not found_repeat:
+            cleaned_words.append(words[i])
+            i += 1
+    # Rejoin the cleaned words
+    final_text = ' '.join(cleaned_words)
+    # Log the cleaning results
+    original_len = len(text)
+    cleaned_len = len(final_text)
+    reduction = 100 * (original_len - cleaned_len) / max(1, original_len)
+    logger.info(f"Text cleaning: removed {original_len - cleaned_len} chars ({reduction:.1f}% reduction)")
+    return final_text

utils/pdf_ocr.py ADDED Viewed

	@@ -0,0 +1,457 @@

+#!/usr/bin/env python3
+"""
+PDFOCR - Module for processing PDF files with OCR and extracting structured data.
+Provides robust PDF to image conversion before OCR processing.
+"""
+import json
+import os
+import tempfile
+import logging
+from pathlib import Path
+from typing import Optional, Dict, List, Union, Tuple, Any
+# Configure logging
+logging.basicConfig(level=logging.INFO,
+                   format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
+logger = logging.getLogger("pdf_ocr")
+# Import StructuredOCR for OCR processing
+from structured_ocr import StructuredOCR
+class PDFConversionResult:
+    """Class to hold results of PDF to image conversion."""
+    def __init__(self,
+                 success: bool,
+                 images: List[Path] = None,
+                 error: str = None,
+                 page_count: int = 0,
+                 temp_files: List[str] = None):
+        """Initialize the conversion result.
+        Args:
+            success: Whether the conversion was successful
+            images: List of paths to the converted images
+            error: Error message if conversion failed
+            page_count: Total number of pages in the PDF
+            temp_files: List of temporary files that should be cleaned up
+        """
+        self.success = success
+        self.images = images or []
+        self.error = error
+        self.page_count = page_count
+        self.temp_files = temp_files or []
+    def __bool__(self):
+        """Enable boolean evaluation of the result."""
+        return self.success
+    def cleanup(self):
+        """Clean up any temporary files created during conversion."""
+        for temp_file in self.temp_files:
+            try:
+                if os.path.exists(temp_file):
+                    os.unlink(temp_file)
+                    logger.debug(f"Removed temporary file: {temp_file}")
+            except Exception as e:
+                logger.warning(f"Failed to remove temporary file {temp_file}: {e}")
+        self.temp_files = []
+class PDFOCR:
+    """Class for processing PDF files with OCR and extracting structured data."""
+    def __init__(self, api_key=None):
+        """Initialize the PDF OCR processor."""
+        self.processor = StructuredOCR(api_key=api_key)
+        self.temp_files = []
+    def __del__(self):
+        """Clean up resources when object is destroyed."""
+        self.cleanup()
+    def cleanup(self):
+        """Clean up any temporary files."""
+        for temp_file in self.temp_files:
+            try:
+                if os.path.exists(temp_file):
+                    os.unlink(temp_file)
+                    logger.debug(f"Removed temporary file: {temp_file}")
+            except Exception as e:
+                logger.warning(f"Failed to remove temporary file {temp_file}: {e}")
+        self.temp_files = []
+    def convert_pdf_to_images(self,
+                              pdf_path: Union[str, Path],
+                              dpi: int = 200,
+                              max_pages: Optional[int] = None,
+                              page_numbers: Optional[List[int]] = None) -> PDFConversionResult:
+        """
+        Convert a PDF file to images.
+        Args:
+            pdf_path: Path to the PDF file
+            dpi: DPI for the output images
+            max_pages: Maximum number of pages to convert (None for all)
+            page_numbers: Specific page numbers to convert (1-based indexing)
+        Returns:
+            PDFConversionResult object with conversion results
+        """
+        pdf_path = Path(pdf_path)
+        if not pdf_path.exists():
+            return PDFConversionResult(
+                success=False,
+                error=f"PDF file not found: {pdf_path}"
+            )
+        # Check file size
+        file_size_mb = pdf_path.stat().st_size / (1024 * 1024)
+        logger.info(f"PDF size: {file_size_mb:.2f} MB")
+        try:
+            # Import pdf2image for conversion
+            import pdf2image
+            # Initialize list for temporary files
+            temp_files = []
+            # Optimize conversion parameters based on file size
+            thread_count = min(4, os.cpu_count() or 2)
+            # First, determine total pages in the document
+            logger.info("Determining PDF page count...")
+            try:
+                # Use a lightweight approach with multi-threading for faster processing
+                pdf_info = pdf2image.convert_from_path(
+                    pdf_path,
+                    dpi=72,  # Low DPI just for info
+                    first_page=1,
+                    last_page=1,
+                    size=(100, 100),  # Tiny image to save memory
+                    fmt="jpeg",
+                    thread_count=thread_count,
+                    output_file=None
+                )
+                # Get page count from poppler info if available
+                if hasattr(pdf_info, 'n_pages'):
+                    total_pages = pdf_info.n_pages
+                else:
+                    # Try a different approach to get page count
+                    try:
+                        from pypdf import PdfReader
+                        reader = PdfReader(pdf_path)
+                        total_pages = len(reader.pages)
+                    except:
+                        total_pages = 1
+                        logger.warning("Could not determine total page count, assuming 1 page")
+            except Exception as e:
+                logger.warning(f"Failed to determine page count: {e}")
+                total_pages = 1
+            logger.info(f"PDF has {total_pages} total pages")
+            # Determine which pages to process
+            pages_to_process = []
+            # If specific pages are requested, use those
+            if page_numbers and any(1 <= p <= total_pages for p in page_numbers):
+                pages_to_process = [p for p in page_numbers if 1 <= p <= total_pages]
+                logger.info(f"Converting {len(pages_to_process)} specified pages: {pages_to_process}")
+            # If max_pages is set, limit to that number
+            elif max_pages and max_pages < total_pages:
+                pages_to_process = list(range(1, max_pages + 1))
+                logger.info(f"Converting first {max_pages} pages of {total_pages} total")
+            # Otherwise convert all pages if reasonable count
+            else:
+                pages_to_process = list(range(1, total_pages + 1))
+                logger.info(f"Converting all {total_pages} pages")
+            # Convert PDF to images
+            converted_images = []
+            # Process in batches for better memory management
+            batch_size = min(5, len(pages_to_process))  # Process up to 5 pages at a time
+            for i in range(0, len(pages_to_process), batch_size):
+                batch_pages = pages_to_process[i:i+batch_size]
+                logger.info(f"Converting batch of pages {batch_pages}")
+                # Convert this batch of pages
+                try:
+                    batch_images = pdf2image.convert_from_path(
+                        pdf_path,
+                        dpi=dpi,
+                        first_page=min(batch_pages),
+                        last_page=max(batch_pages),
+                        thread_count=thread_count,
+                        fmt="jpeg"
+                    )
+                    # Map converted images to requested page numbers
+                    for idx, page_num in enumerate(range(min(batch_pages), max(batch_pages) + 1)):
+                        if page_num in pages_to_process and idx < len(batch_images):
+                            # Save the image to a temporary file
+                            img_temp_path = tempfile.NamedTemporaryFile(suffix=f'_page{page_num}.jpg', delete=False).name
+                            batch_images[idx].save(img_temp_path, format='JPEG', quality=95)
+                            # Add to results and track the temp file
+                            converted_images.append((page_num, Path(img_temp_path)))
+                            temp_files.append(img_temp_path)
+                except Exception as e:
+                    logger.error(f"Failed to convert batch {batch_pages}: {e}")
+                    # Continue with other batches
+            # Sort by page number to ensure correct order
+            converted_images.sort(key=lambda x: x[0])
+            # Extract just the image paths in correct page order
+            image_paths = [img_path for _, img_path in converted_images]
+            if not image_paths:
+                # No images were successfully converted
+                return PDFConversionResult(
+                    success=False,
+                    error="Failed to convert PDF to images",
+                    page_count=total_pages,
+                    temp_files=temp_files
+                )
+            # Store temp files for later cleanup
+            self.temp_files.extend(temp_files)
+            # Return successful result
+            return PDFConversionResult(
+                success=True,
+                images=image_paths,
+                page_count=total_pages,
+                temp_files=temp_files
+            )
+        except ImportError:
+            return PDFConversionResult(
+                success=False,
+                error="pdf2image module not available. Please install with: pip install pdf2image"
+            )
+        except Exception as e:
+            logger.error(f"PDF conversion error: {str(e)}")
+            return PDFConversionResult(
+                success=False,
+                error=f"Failed to convert PDF to images: {str(e)}"
+            )
+    def process_pdf(self, pdf_path, use_vision=True, max_pages=None, custom_pages=None, custom_prompt=None):
+        """
+        Process a PDF file with OCR and extract structured data.
+        Args:
+            pdf_path: Path to the PDF file
+            use_vision: Whether to use vision model for improved analysis
+            max_pages: Maximum number of pages to process
+            custom_pages: Specific page numbers to process (1-based indexing)
+            custom_prompt: Custom instructions for processing
+        Returns:
+            Dictionary with structured OCR results
+        """
+        pdf_path = Path(pdf_path)
+        if not pdf_path.exists():
+            raise FileNotFoundError(f"PDF file not found: {pdf_path}")
+        # Convert page numbers to list if provided
+        page_numbers = None
+        if custom_pages:
+            if isinstance(custom_pages, (list, tuple)):
+                page_numbers = custom_pages
+            else:
+                try:
+                    # Try to parse as comma-separated string
+                    page_numbers = [int(p.strip()) for p in str(custom_pages).split(',')]
+                except:
+                    logger.warning(f"Invalid custom_pages format: {custom_pages}. Should be list or comma-separated string.")
+        # First try our optimized PDF to image conversion
+        conversion_result = self.convert_pdf_to_images(
+            pdf_path=pdf_path,
+            max_pages=max_pages,
+            page_numbers=page_numbers
+        )
+        if conversion_result.success and conversion_result.images:
+            logger.info(f"Successfully converted PDF to {len(conversion_result.images)} images")
+            # Determine if we need to add PDF-specific context to the prompt
+            modified_prompt = custom_prompt
+            if not modified_prompt:
+                modified_prompt = f"This is a multi-page PDF document with {conversion_result.page_count} total pages, of which {len(conversion_result.images)} were processed."
+            elif "pdf" not in modified_prompt.lower() and "multi-page" not in modified_prompt.lower():
+                modified_prompt += f" This is a multi-page PDF document with {conversion_result.page_count} total pages, of which {len(conversion_result.images)} were processed."
+            try:
+                # First process the first page with vision if requested
+                first_page_result = self.processor.process_file(
+                    file_path=conversion_result.images[0],
+                    file_type="image",
+                    use_vision=use_vision,
+                    custom_prompt=modified_prompt
+                )
+                # Process additional pages if available
+                all_pages_text = []
+                all_languages = set()
+                # Extract text from first page
+                if 'ocr_contents' in first_page_result and 'raw_text' in first_page_result['ocr_contents']:
+                    all_pages_text.append(first_page_result['ocr_contents']['raw_text'])
+                # Track languages from first page
+                if 'languages' in first_page_result:
+                    for lang in first_page_result['languages']:
+                        all_languages.add(str(lang))
+                # Process additional pages if any
+                for i, img_path in enumerate(conversion_result.images[1:], 1):
+                    try:
+                        # Simple text extraction for additional pages
+                        page_result = self.processor.process_file(
+                            file_path=img_path,
+                            file_type="image",
+                            use_vision=False,  # Use simpler processing for additional pages
+                            custom_prompt=f"This is page {i+1} of a {conversion_result.page_count}-page document."
+                        )
+                        # Extract text
+                        if 'ocr_contents' in page_result and 'raw_text' in page_result['ocr_contents']:
+                            all_pages_text.append(page_result['ocr_contents']['raw_text'])
+                        # Track languages
+                        if 'languages' in page_result:
+                            for lang in page_result['languages']:
+                                all_languages.add(str(lang))
+                    except Exception as e:
+                        logger.warning(f"Error processing page {i+1}: {e}")
+                # Combine all text into a single document
+                combined_text = "\n\n".join(all_pages_text)
+                # Update the first page result with combined data
+                if 'ocr_contents' in first_page_result:
+                    first_page_result['ocr_contents']['raw_text'] = combined_text
+                # Update languages with all detected languages
+                if all_languages:
+                    first_page_result['languages'] = list(all_languages)
+                # Add PDF metadata
+                first_page_result['file_name'] = pdf_path.name
+                first_page_result['file_type'] = "pdf"
+                first_page_result['total_pages'] = conversion_result.page_count
+                first_page_result['processed_pages'] = len(conversion_result.images)
+                # Add conversion info
+                first_page_result['pdf_conversion'] = {
+                    "method": "pdf2image",
+                    "pages_converted": len(conversion_result.images),
+                    "pages_requested": len(page_numbers) if page_numbers else (max_pages or conversion_result.page_count)
+                }
+                return first_page_result
+            except Exception as e:
+                logger.error(f"Error processing converted images: {e}")
+                # Fall back to direct processing via StructuredOCR
+            finally:
+                # Clean up temporary files
+                conversion_result.cleanup()
+        # If conversion failed or processing the images failed, fall back to direct processing
+        logger.info(f"Using direct StructuredOCR processing for PDF")
+        return self.processor.process_file(
+            file_path=pdf_path,
+            file_type="pdf",
+            use_vision=use_vision,
+            max_pages=max_pages,
+            custom_pages=custom_pages,
+            custom_prompt=custom_prompt
+        )
+    def save_json_output(self, pdf_path, output_path, use_vision=True, max_pages=None, custom_pages=None, custom_prompt=None):
+        """
+        Process a PDF file and save the structured output as JSON.
+        Args:
+            pdf_path: Path to the PDF file
+            output_path: Path where to save the JSON output
+            use_vision: Whether to use vision model for improved analysis
+            max_pages: Maximum number of pages to process
+            custom_pages: Specific page numbers to process (1-based indexing)
+            custom_prompt: Custom instructions for processing
+        Returns:
+            Path to the saved JSON file
+        """
+        # Process the PDF
+        result = self.process_pdf(
+            pdf_path,
+            use_vision=use_vision,
+            max_pages=max_pages,
+            custom_pages=custom_pages,
+            custom_prompt=custom_prompt
+        )
+        # Save the result to JSON
+        output_path = Path(output_path)
+        output_path.parent.mkdir(parents=True, exist_ok=True)
+        with open(output_path, 'w') as f:
+            json.dump(result, f, indent=2)
+        return output_path
+# For testing directly
+if __name__ == "__main__":
+    import sys
+    import argparse
+    parser = argparse.ArgumentParser(description="Process PDF files with OCR.")
+    parser.add_argument("pdf_path", help="Path to the PDF file to process")
+    parser.add_argument("--output", "-o", help="Path to save the output JSON")
+    parser.add_argument("--no-vision", dest="use_vision", action="store_false",
+                        help="Disable vision model for processing")
+    parser.add_argument("--max-pages", type=int, help="Maximum number of pages to process")
+    parser.add_argument("--pages", help="Specific pages to process (comma-separated)")
+    parser.add_argument("--prompt", help="Custom prompt for processing")
+    args = parser.parse_args()
+    processor = PDFOCR()
+    # Parse custom pages if provided
+    custom_pages = None
+    if args.pages:
+        try:
+            custom_pages = [int(p.strip()) for p in args.pages.split(',')]
+        except:
+            print(f"Error parsing pages: {args.pages}. Should be comma-separated list of numbers.")
+            sys.exit(1)
+    if args.output:
+        result_path = processor.save_json_output(
+            args.pdf_path,
+            args.output,
+            use_vision=args.use_vision,
+            max_pages=args.max_pages,
+            custom_pages=custom_pages,
+            custom_prompt=args.prompt
+        )
+        print(f"Results saved to: {result_path}")
+    else:
+        result = processor.process_pdf(
+            args.pdf_path,
+            use_vision=args.use_vision,
+            max_pages=args.max_pages,
+            custom_pages=custom_pages,
+            custom_prompt=args.prompt
+        )
+        print(json.dumps(result, indent=2))