Spaces:

milwright
/

historical-ocr

Running

milwright commited on May 1

Commit

42dc069

1 Parent(s): c04ffe5

Consolidate segmentation improvements and code cleanup

- Added OCR preprocessing documentation
- Enhanced image segmentation algorithm
- Improved UI layout and styling
- Added new test and verification files
- Updated project requirements
- Reorganized utility modules
- Added .clinerules configuration for better project documentation
- Removed obsolete test files

Files changed (46) hide show

.clinerules/activeContext.md +31 -0
.clinerules/complexFeature.md +29 -0
.clinerules/integrationSpecs.md +27 -0
.clinerules/principle-of-simplicity.md +55 -0
.clinerules/productContext.md +25 -0
.clinerules/progress.md +27 -0
.clinerules/techContext.md +40 -0
.gitignore +18 -0
CLAUDE.md +0 -32
app.py +3 -1
docs/preprocessing.md +179 -0
docs/preprocessing_triage.md +17 -0
image_segmentation.py +209 -44
ocr_processing.py +152 -80
requirements.txt +1 -1
structured_ocr.py +29 -58
test_fix.py +55 -0
test_magellan_language.py +0 -39
test_segmentation_fix.py +100 -0
testing/magician_app_investigation_plan.md +0 -58
testing/magician_app_result.json +0 -16
testing/magician_image_final_report.md +0 -58
testing/magician_image_findings.md +0 -84
testing/magician_ocr_text.txt +0 -9
testing/test_app_direct.py +0 -180
testing/test_filename_format.py +93 -0
testing/test_improvements.py +0 -244
testing/test_json_bleed.py +46 -0
testing/test_magician.py +0 -57
testing/test_magician_image.py +0 -130
testing/test_newspaper_detection.py +0 -146
testing/test_segmentation.py +0 -238
testing/test_simple_improvements.py +0 -175
testing/test_text_as_image.py +0 -200
ui/custom.css +9 -31
ui/layout.py +64 -30
ui_components.py +31 -35
utils.py +55 -18
utils/__init__.py +47 -0
utils/content_utils.py +3 -89
utils/general_utils.py +53 -18
utils/image_utils.py +648 -333
utils/text_utils.py +76 -13
utils/ui_utils.py +132 -200
verify_fix.py +70 -0
verify_segmentation_fix.py +116 -0

.clinerules/activeContext.md ADDED Viewed

	@@ -0,0 +1,31 @@

+# Current Work Focus
+Refining image preprocessing pipelines to better balance cleaning and preservation of fine details (especially for handwritten inputs)
+Improving document type detection accuracy to feed better prompts to the OCR system
+Enhancing structured output schemas to cover additional types like travel logs and scientific diagrams
+Recent Changes
+Implemented more document-type-specific preprocessing pipelines
+Switched default OCR engine to Mistral for both printed and handwritten material
+Modularized utility functions across utils.py, ocr_utils.py, and newly proposed submodules
+Active Decisions and Considerations
+Whether to expose preprocessing options to end users (e.g., deskew threshold)
+Whether to allow fallback to local Tesseract OCR for offline cases
+Determining best practices for handling multi-page PDFs with mixed layouts
+Important Patterns and Learnings
+Document type detection greatly improves OCR quality when tuned per-class
+Over-aggressive preprocessing can erase faint handwriting; thresholds must be conservative for historical artifacts
+Keeping preprocessing modular enables rapid experimentation and tuning.

.clinerules/complexFeature.md ADDED Viewed

	@@ -0,0 +1,29 @@

+# Complex Feature Documentation
+Document Type Detection
+    Utilizes lightweight statistical heuristics combined with visual features.
+    Preprocessing-driven (thresholding, aspect ratios, contour analysis).
+    Outputs labels such as "handwritten letter", "scientific report", "recipe".
+Preprocessing Pipelines
+    Customizable per document type.
+    Adaptive thresholding for delicate handwriting.
+    Morphological operations for removing bleed-through or artifacts.
+Multilingual Handling
+    Language detection on OCR snippets using language_detection.py.
+    Allows contextual OCR prompting based on dominant language.
+Structured Output Generation
+    Parsing OCR results into structured categories: titles, subtitles, body, marginalia, dates.
+    Supports output in raw text, JSON, and annotated Markdown.

.clinerules/integrationSpecs.md ADDED Viewed

	@@ -0,0 +1,27 @@

+# Integration Specifications
+External Services
+    Mistral OCR API: Primary service for document transcription and structured extraction.
+    Tesseract OCR (Local Fallback): Optional backup when API unavailable.
+Internal Module Communication
+    app.py triggers ocr_processing.py orchestration based on user input.
+    ocr_processing.py dynamically calls preprocessing and OCR modules based on document type.
+    Preprocessed images passed through structured_ocr.py for API interaction and postprocessing.
+Session State
+    Streamlit session stores:
+        Uploaded file metadata
+        Preprocessing parameters
+        Detected document type
+        OCR structured output

.clinerules/principle-of-simplicity.md ADDED Viewed

	@@ -0,0 +1,55 @@

+# Project Rule: Maintain Clean Separation Between Data and Presentation
+The Principle of Content Purity
+Core rule: Data that needs to be processed or stored should never contain presentation markup that's only meant for display.
+What This Means in Practice
+Avoid HTML in data structures
+✅ DO: Keep raw text as pure text content
+❌ DON'T: Embed HTML, CSS, or other presentation-specific markup in data fields
+Design clear boundaries between data and presentation layers
+✅ DO: Add presentation elements at the final rendering stage only
+❌ DON'T: Add and strip presentation elements repeatedly throughout the processing pipeline
+Fix problems at their source, not their symptoms
+✅ DO: Prevent markup injection at the origin rather than adding complex stripping logic later
+❌ DON'T: Create complex sanitization functions to clean data that shouldn't be contaminated in the first place
+The OCR Text Formatter Example
+Before (problematic):
+def format_ocr_text(text, for_display=True):
+    # Text processing...
+    if for_display:
+        html = f"""
+        <div class="ocr-text-container">
+            {formatted_text}
+        </div>
+        """
+        return html
+    else:
+        return formatted_text
+After (better):
+def format_ocr_text(text, for_display=False):
+    # Text processing...
+    if for_display:
+        html = f"""
+            {formatted_text}
+        """
+        return html
+    else:
+        return formatted_text
+What changed:
+Default parameter changed to avoid accidental HTML addition
+HTML wrapper div completely removed to eliminate the source of pollution
+The simplest solution (removing the container) was better than any complex stripping logic
+Benefits
+Cleaner data: Raw content remains genuinely raw and easier to work with
+More predictable processing: No need to account for unexpected HTML in processing pipelines
+Easier debugging: Problems are visible at their source rather than as mysterious artifacts later
+Reduced complexity: Eliminates the need for complex HTML stripping and sanitization logic
+Remember: Simplicity is not just an ideal—it's a practical strategy that prevents entire classes of bugs.

.clinerules/productContext.md ADDED Viewed

	@@ -0,0 +1,25 @@

+# Why the Project Exists
+Historians, archivists, and researchers often struggle to extract reliable text from scanned archival materials. Many OCR tools fail when dealing with handwritten letters, historical scientific documents, and poorly digitized photographs.
+Problems Being Solved
+Low OCR accuracy on handwritten or degraded historical documents
+Lack of structured metadata extraction for archival research
+Inability to easily apply context-specific AI prompting for nuanced historical material
+How the Product Should Work
+Users upload images or PDFs
+Preprocessing automatically improves OCR readiness
+Document type detection informs customized AI prompting
+Mistral OCR processes the document to output structured data (titles, authors, dates, body text, marginalia, etc.)
+Users can download raw text, structured JSON, or annotated markdown
+Example: "The OCR system must intelligently handle multilingual documents, support marginal notes and irregular layouts, and allow historians to guide the extraction process."

.clinerules/progress.md ADDED Viewed

	@@ -0,0 +1,27 @@

+# Current Status of Features
+    ✅ Upload and preprocess historical documents
+    ⚙️ Document type detection (estimated 80% accuracy)
+    ✅ OCR extraction with structured outputs (titles, body, marginalia)
+    ✅ Multiple output formats (Raw text, JSON, Markdown)
+Known Issues and Limitations
+    Inconsistent marginalia capture in low-quality scans
+    Difficulties with heavily degraded non-Latin handwritten scripts
+    Layout detection errors on highly irregular, mixed-content PDFs
+Evolution of Project Decisions
+    Migrated from Tesseract-only OCR to Mistral-first hybrid approach
+    Modularized preprocessing steps to allow flexible experimentation
+    Added support for marginalia and footnotes where feasible
+    Enhanced session state management to preserve intermediate results.

.clinerules/techContext.md ADDED Viewed

	@@ -0,0 +1,40 @@

+techContext.md
+Technologies and Frameworks Used
+    Frontend Framework: Streamlit 1.44.1
+    OCR Engine: Mistralai Python SDK (≥ 0.1.0)
+    Image Processing: OpenCV, Pillow
+    PDF Parsing: pdf2image
+    Fallback OCR: Pytesseract
+    Utilities: NumPy, Requests, pycountry
+Development Setup
+    Python 3.11+ virtual environment
+    Requirements managed through requirements.txt
+    .env file setup for API keys and environment configs
+    Type checking with mypy, linting with ruff
+Technical Constraints
+    API rate limits and payload size restrictions from Mistral
+    Streamlit's session state limitations for very large files
+    Processing timeouts for oversized or complex PDFs
+Dependencies and Tool Configurations
+    Mistralai pinned version (≥ 0.1.0)
+    OpenCV configured for headless environments
+    Pillow used for post-processing and visualization checks

.gitignore CHANGED Viewed

	@@ -0,0 +1,18 @@

+# Python bytecode
+__pycache__/
+*.py[cod]
+*.class
+# MacOS system files
+.DS_Store
+# Output and temporary files
+output/debug/
+output/comparison/
+output/segmentation_test/text_regions/
+output/preprocessing_test/
+logs/
+*.backup
+# Temporary documents
+Tmplf6xnkgr*

CLAUDE.md DELETED Viewed

@@ -1,32 +0,0 @@
-# CLAUDE.md
-This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
-## Commands
-- Run app: `streamlit run app.py`
-- Test OCR functionality: `python structured_ocr.py <file_path>`
-- Process single file with logging: `python process_file.py <file_path>`
-- Run specific test: `python testing/test_magician_image.py`
-- Run typechecking: `mypy .`
-- Lint code: `ruff check .` or `flake8`
-## Environment Setup
-- API key: Set `MISTRAL_API_KEY` in `.env` file or environment variable
-- Install dependencies: `pip install -r requirements.txt`
-- System requirements: Install `poppler-utils` and `tesseract-ocr` for PDF processing
-## Code Style Guidelines
-- **Imports**: Standard library first, third-party next, local modules last
-- **Types**: Use Pydantic models and type hints for all functions
-- **Error handling**: Use specific exceptions with informative messages
-- **Naming**: snake_case for variables/functions, PascalCase for classes
-- **Documentation**: Google-style docstrings for all functions/classes
-- **Preprocessing**: Support handwritten documents via document_type parameter
-- **Line length**: ≤100 characters
-## Base64 Encoding
-- Always include MIME type in data URLs: `data:image/jpeg;base64,...`
-- Use the appropriate MIME type for different file formats: jpeg, png, pdf, etc.
-- For encoded bytes, use `encode_bytes_for_api` with correct MIME type
-- For file paths, use `encode_image_for_api` which auto-detects MIME type
-- In utils.py, use `get_base64_from_bytes` for raw bytes or `get_base64_from_image` for files

app.py CHANGED Viewed

@@ -197,6 +197,7 @@ def show_example_documents():
         "https://huggingface.co/spaces/milwright/historical-ocr/resolve/main/input/handwritten-letter.jpg",
         "https://huggingface.co/spaces/milwright/historical-ocr/resolve/main/input/magellan-travels.jpg",
         "https://huggingface.co/spaces/milwright/historical-ocr/resolve/main/input/milgram-flier.png",
     ]
     sample_names = [
@@ -205,7 +206,8 @@ def show_example_documents():
         "The Magician (Image)",
         "Handwritten Letter (Image)",
         "Magellan Travels (Image)",
-        "Milgram Flier (Image)"
         ]
     # Initialize sample_document in session state if it doesn't exist

         "https://huggingface.co/spaces/milwright/historical-ocr/resolve/main/input/handwritten-letter.jpg",
         "https://huggingface.co/spaces/milwright/historical-ocr/resolve/main/input/magellan-travels.jpg",
         "https://huggingface.co/spaces/milwright/historical-ocr/resolve/main/input/milgram-flier.png",
+        "https://huggingface.co/spaces/milwright/historical-ocr/resolve/main/input/recipe.jpg",
     ]
     sample_names = [
         "The Magician (Image)",
         "Handwritten Letter (Image)",
         "Magellan Travels (Image)",
+        "Milgram Flier (Image)",
+        "Historical Recipe (Image)"
         ]
     # Initialize sample_document in session state if it doesn't exist

docs/preprocessing.md ADDED Viewed

	@@ -0,0 +1,179 @@

+# Image Preprocessing for Historical Document OCR
+This document outlines the enhanced preprocessing capabilities for improving OCR quality on historical documents, including deskewing, thresholding, and morphological operations.
+## Overview
+The preprocessing pipeline offers several options to enhance image quality before OCR processing:
+1. **Deskewing**: Automatically detects and corrects document skew using multiple detection algorithms
+2. **Thresholding**: Converts grayscale images to binary using adaptive or Otsu methods with pre-blur options
+3. **Morphological Operations**: Cleans up binary images by removing noise or filling in gaps
+4. **Document-Type Specific Settings**: Customized preprocessing configurations for different document types
+## Configuration
+Preprocessing options are set in `config.py` and are tunable per document type. All settings are accessible through environment variables for easy deployment configuration.
+### Deskewing
+```python
+"deskew": {
+    "enabled": True/False,              # Whether to apply deskewing
+    "angle_threshold": 0.1,             # Minimum angle (degrees) to trigger deskewing
+    "max_angle": 45.0,                  # Maximum correction angle
+    "use_hough": True/False,            # Use Hough transform in addition to minAreaRect
+    "consensus_method": "average",      # How to combine angle estimations
+    "fallback": {"enabled": True/False} # Fall back to original if deskewing fails
+}
+```
+Deskewing uses two methods:
+- **minAreaRect**: Finds contours in the binary image and calculates their orientation
+- **Hough Transform**: Detects lines in the image and their angles
+The `consensus_method` can be:
+- `"average"`: Average of all detected angles (most stable)
+- `"median"`: Median of all angles (robust to outliers)
+- `"min"`: Minimum absolute angle (most conservative)
+- `"max"`: Maximum absolute angle (most aggressive)
+### Thresholding
+```python
+"thresholding": {
+    "method": "adaptive",               # "none", "otsu", or "adaptive"
+    "adaptive_block_size": 11,          # Block size for adaptive thresholding (must be odd)
+    "adaptive_constant": 2,             # Constant subtracted from mean
+    "otsu_gaussian_blur": 1,            # Blur kernel size for Otsu pre-processing
+    "preblur": {
+        "enabled": True/False,          # Whether to apply pre-blur
+        "method": "gaussian",           # "gaussian" or "median"
+        "kernel_size": 3                # Blur kernel size (must be odd)
+    },
+    "fallback": {"enabled": True/False} # Fall back to grayscale if thresholding fails
+}
+```
+Thresholding methods:
+- **Otsu**: Automatically determines optimal global threshold (best for high-contrast documents)
+- **Adaptive**: Calculates thresholds for different regions (better for uneven lighting, historical documents)
+### Morphological Operations
+```python
+"morphology": {
+    "enabled": True/False,              # Whether to apply morphological operations
+    "operation": "close",               # "open", "close", "both"
+    "kernel_size": 1,                   # Size of the structuring element
+    "kernel_shape": "rect"              # "rect", "ellipse", "cross"
+}
+```
+Morphological operations:
+- **Open**: Erosion followed by dilation - removes small noise and disconnects thin connections
+- **Close**: Dilation followed by erosion - fills small holes and connects broken elements
+- **Both**: Applies opening followed by closing
+### Document Type Configurations
+The system includes optimized settings for different document types:
+```python
+"document_types": {
+    "standard": {
+        # Default settings - will use the global settings
+    },
+    "newspaper": {
+        "deskew": {"enabled": True, "angle_threshold": 0.3, "max_angle": 10.0},
+        "thresholding": {
+            "method": "adaptive",
+            "adaptive_block_size": 15,
+            "adaptive_constant": 3,
+            "preblur": {"method": "gaussian", "kernel_size": 3}
+        },
+        "morphology": {"operation": "close", "kernel_size": 1}
+    },
+    "handwritten": {
+        "deskew": {"enabled": True, "angle_threshold": 0.5, "use_hough": False},
+        "thresholding": {
+            "method": "adaptive",
+            "adaptive_block_size": 31,
+            "adaptive_constant": 5,
+            "preblur": {"method": "median", "kernel_size": 3}
+        },
+        "morphology": {"operation": "open", "kernel_size": 1}
+    },
+    "book": {
+        "deskew": {"enabled": True},
+        "thresholding": {
+            "method": "otsu",
+            "preblur": {"method": "gaussian", "kernel_size": 5}
+        },
+        "morphology": {"operation": "both", "kernel_size": 1}
+    }
+}
+```
+## Performance and Logging
+```python
+"performance": {
+    "parallel": {
+        "enabled": True/False,          # Whether to use parallel processing
+        "max_workers": 4                # Maximum number of worker threads
+    },
+    "timeout_ms": 10000                 # Timeout for preprocessing (in milliseconds)
+}
+"logging": {
+    "enabled": True/False,              # Whether to log preprocessing metrics
+    "metrics": ["skew_angle", "binary_nonzero_pct", "processing_time"],
+    "output_path": "logs/preprocessing_metrics.json"
+}
+```
+## Usage with OCR Processing
+When processing documents, simply specify the document type:
+```python
+preprocessing_options = {
+    "document_type": "newspaper",  # Use newspaper-optimized settings
+    "grayscale": True,             # Legacy option: apply grayscale conversion
+    "denoise": True,               # Legacy option: apply denoising
+    "contrast": 10,                # Legacy option: adjust contrast (0-100)
+    "rotation": 0                  # Legacy option: manual rotation (degrees)
+}
+# Apply preprocessing and OCR
+result = process_file(file_bytes, file_ext, preprocessing_options=preprocessing_options)
+```
+## Visual Examples
+### Original Document
+*[A historical newspaper or document image would be shown here]*
+### After Deskewing
+*[The same document, with skew corrected]*
+### After Thresholding
+*[The document converted to binary with clear text]*
+### After Morphological Operations
+*[The binary image with small noise removed and/or gaps filled]*
+## Troubleshooting
+### Poor Deskewing Results
+- **Symptom**: Document skew is not correctly detected or corrected
+- **Solution**: Try adjusting `angle_threshold` or `max_angle`, or disable Hough transform for handwritten documents
+### Thresholding Issues
+- **Symptom**: Text is lost or background noise is excessive after thresholding
+- **Solution**: Try changing the thresholding method or adjusting `adaptive_block_size` and `adaptive_constant`
+### Performance Concerns
+- **Symptom**: Processing is too slow for large documents
+- **Solution**: Enable parallel processing, reduce image size, or disable some preprocessing steps for faster results

docs/preprocessing_triage.md ADDED Viewed

	@@ -0,0 +1,17 @@

+# OCR Preprocessing Triage
+## Quick Fixes Implemented
+1. **Handwritten** - Disabled thresholding, uses grayscale only
+2. **Newspapers** - Increased block size (51) and constant (10) for softer thresholding
+3. **JPEG Artifacts** - Auto-detection and specialized denoising
+4. **Border Issues** - Crops edges after deskew to avoid threshold problems
+5. **Low Resolution** - Upscales small text for better recognition
+## Testing
+```
+python testing/test_triage_fix.py
+```
+Check `output/comparison/` for results.

image_segmentation.py CHANGED Viewed

@@ -18,7 +18,7 @@ logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
 logger = logging.getLogger(__name__)
-def segment_image_for_ocr(image_path: Union[str, Path], vision_enabled: bool = True) -> Dict[str, Union[Image.Image, str]]:
     """
     Segment an image into text and image regions for improved OCR processing.
@@ -76,9 +76,17 @@ def segment_image_for_ocr(image_path: Union[str, Path], vision_enabled: bool = T
                                           cv2.THRESH_BINARY_INV, 11, 2)
             # Step 2: Perform morphological operations to connect text components
-            # Create a rectangular kernel that's wider than tall (for text lines)
-            rect_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (15, 3))
-            dilation = cv2.dilate(binary, rect_kernel, iterations=3)
             # Step 3: Find contours which will correspond to text blocks
             contours, _ = cv2.findContours(dilation, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
@@ -87,8 +95,8 @@ def segment_image_for_ocr(image_path: Union[str, Path], vision_enabled: bool = T
             text_mask = np.zeros_like(gray)
             # Step 4: Filter contours based on size to identify text regions
-            min_area = 100  # Minimum contour area to be considered text
-            max_area = img.shape[0] * img.shape[1] * 0.5  # Max 50% of image
             text_regions = []
             for contour in contours:
@@ -105,10 +113,33 @@ def segment_image_for_ocr(image_path: Union[str, Path], vision_enabled: bool = T
                     roi = binary[y:y+h, x:x+w]
                     dark_pixel_density = np.sum(roi > 0) / (w * h)
-                    # Additional check for text-like characteristics
-                    # Text typically has aspect ratio > 1 (wider than tall) and reasonable density
-                    # Relaxed aspect ratio constraints and lowered density threshold for better detection
-                    if (aspect_ratio > 1.2 or aspect_ratio < 0.7) and dark_pixel_density > 0.15:
                         # Add to text regions list
                         text_regions.append((x, y, w, h))
                         # Add to text mask
@@ -119,44 +150,170 @@ def segment_image_for_ocr(image_path: Union[str, Path], vision_enabled: bool = T
             for x, y, w, h in text_regions:
                 cv2.rectangle(text_regions_vis, (x, y), (x+w, y+h), (0, 255, 0), 2)
-            # Create image regions mask (inverse of text mask)
-            image_mask = cv2.bitwise_not(text_mask)
-            # Create image regions visualization
-            image_regions_vis = img_rgb.copy()
-            # Add detected image regions in red
-            for contour in contours:
                 area = cv2.contourArea(contour)
-                if area > max_area * 0.1:  # Only highlight larger image regions
                     x, y, w, h = cv2.boundingRect(contour)
-                    if np.sum(text_mask[y:y+h, x:x+w]) / (w * h) < 128:  # Not significantly overlapping with text
-                        cv2.rectangle(image_regions_vis, (x, y), (x+w, y+h), (0, 0, 255), 2)
-            # Step 6: Create a combined result that enhances text regions
-            # Different processing for text vs. image regions
             combined_result = img_rgb.copy()
-            # Apply more aggressive contrast enhancement to text regions
-            text_enhanced = cv2.bitwise_and(img_rgb, img_rgb, mask=text_mask)
-            # Convert to LAB for better contrast enhancement
-            text_lab = cv2.cvtColor(text_enhanced, cv2.COLOR_BGR2LAB)
-            l, a, b = cv2.split(text_lab)
-            # Apply CLAHE to L channel
-            clahe = cv2.createCLAHE(clipLimit=3.0, tileGridSize=(8, 8))
-            cl = clahe.apply(l)
-            # Merge back
-            enhanced_lab = cv2.merge((cl, a, b))
-            text_enhanced = cv2.cvtColor(enhanced_lab, cv2.COLOR_LAB2BGR)
-            # Apply gentler processing to image regions
-            image_enhanced = cv2.bitwise_and(img_rgb, img_rgb, mask=image_mask)
-            # Just slight sharpening for image regions
-            image_enhanced = cv2.GaussianBlur(image_enhanced, (0, 0), 3)
-            image_enhanced = cv2.addWeighted(img_rgb, 1.5, image_enhanced, -0.5, 0)
-            image_enhanced = cv2.bitwise_and(image_enhanced, image_enhanced, mask=image_mask)
-            # Combine the enhanced regions
-            combined_result = cv2.add(text_enhanced, image_enhanced)
             # Convert visualization results back to PIL Images
             text_regions_pil = Image.fromarray(cv2.cvtColor(text_regions_vis, cv2.COLOR_BGR2RGB))
@@ -167,13 +324,21 @@ def segment_image_for_ocr(image_path: Union[str, Path], vision_enabled: bool = T
             _, buffer = cv2.imencode('.png', text_mask)
             text_mask_base64 = base64.b64encode(buffer).decode('utf-8')
             # Return the segmentation results
             return {
                 'text_regions': text_regions_pil,
                 'image_regions': image_regions_pil,
                 'text_mask_base64': f"data:image/png;base64,{text_mask_base64}",
                 'combined_result': combined_result_pil,
-                'text_regions_coordinates': text_regions
             }
     except Exception as e:
@@ -187,7 +352,7 @@ def segment_image_for_ocr(image_path: Union[str, Path], vision_enabled: bool = T
             'text_regions_coordinates': []
         }
-def process_segmented_image(image_path: Union[str, Path], output_dir: Optional[Path] = None) -> Dict:
     """
     Process an image using segmentation for improved OCR, saving visualization outputs.

                    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
 logger = logging.getLogger(__name__)
+def segment_image_for_ocr(image_path: Union[str, Path], vision_enabled: bool = True, preserve_content: bool = True) -> Dict[str, Union[Image.Image, str]]:
     """
     Segment an image into text and image regions for improved OCR processing.
                                           cv2.THRESH_BINARY_INV, 11, 2)
             # Step 2: Perform morphological operations to connect text components
+            # Use a combination of horizontal and vertical kernels for better text detection
+            # in historical documents with mixed content
+            horiz_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (15, 1))
+            vert_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, 3))
+            # Apply horizontal dilation to connect characters in a line
+            horiz_dilation = cv2.dilate(binary, horiz_kernel, iterations=1)
+            # Apply vertical dilation to connect lines in a paragraph
+            vert_dilation = cv2.dilate(binary, vert_kernel, iterations=1)
+            # Combine both dilations for better region detection
+            dilation = cv2.bitwise_or(horiz_dilation, vert_dilation)
             # Step 3: Find contours which will correspond to text blocks
             contours, _ = cv2.findContours(dilation, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
             text_mask = np.zeros_like(gray)
             # Step 4: Filter contours based on size to identify text regions
+            min_area = 50  # Lower minimum area to catch smaller text blocks in historical documents
+            max_area = img.shape[0] * img.shape[1] * 0.4  # Reduced max to avoid capturing too much
             text_regions = []
             for contour in contours:
                     roi = binary[y:y+h, x:x+w]
                     dark_pixel_density = np.sum(roi > 0) / (w * h)
+                    # Special handling for historical documents
+                    # Check for position - text is often at the bottom in historical prints
+                    y_position_ratio = y / img.shape[0]  # Normalized y position (0 at top, 1 at bottom)
+                    # Bottom regions get preferential treatment as text
+                    is_bottom_region = y_position_ratio > 0.7
+                    # Check if part of a text block cluster (horizontal proximity)
+                    is_text_cluster = False
+                    # Check already identified text regions for proximity
+                    for tx, ty, tw, th in text_regions:
+                        # Check if horizontally aligned and close
+                        if abs((ty + th/2) - (y + h/2)) < max(th, h) and \
+                           abs((tx + tw) - x) < 20:  # Near each other horizontally
+                            is_text_cluster = True
+                            break
+                    # More inclusive classification for historical documents
+                    # 1. Typical text characteristics OR
+                    # 2. Bottom position (likely text in historical prints) OR
+                    # 3. Part of a text cluster OR
+                    # 4. Surrounded by other text
+                    is_text_region = ((aspect_ratio > 1.05 or aspect_ratio < 0.9) and dark_pixel_density > 0.1) or \
+                                    (is_bottom_region and dark_pixel_density > 0.08) or \
+                                    is_text_cluster
+                    if is_text_region:
                         # Add to text regions list
                         text_regions.append((x, y, w, h))
                         # Add to text mask
             for x, y, w, h in text_regions:
                 cv2.rectangle(text_regions_vis, (x, y), (x+w, y+h), (0, 255, 0), 2)
+            # ENHANCED APPROACH FOR HISTORICAL DOCUMENTS:
+            # We'll identify different regions including titles at the top of the document
+            # First, look for potential title text at the top of the document
+            image_height = img.shape[0]
+            image_width = img.shape[1]
+            # Examine the top 20% of the image for potential title text
+            title_section_height = int(image_height * 0.2)
+            title_mask = np.zeros_like(gray)
+            title_mask[:title_section_height, :] = 255
+            # Find potential title blocks in the top section
+            title_contours, _ = cv2.findContours(
+                cv2.bitwise_and(dilation, title_mask),
+                cv2.RETR_EXTERNAL,
+                cv2.CHAIN_APPROX_SIMPLE
+            )
+            # Extract title regions with more permissive criteria
+            title_regions = []
+            for contour in title_contours:
                 area = cv2.contourArea(contour)
+                # Use more permissive criteria for title regions
+                if area > min_area * 0.8:  # Smaller minimum area for titles
                     x, y, w, h = cv2.boundingRect(contour)
+                    # Title regions typically have wider aspect ratio
+                    aspect_ratio = w / h
+                    # More permissive density check for titles that might be stylized
+                    roi = binary[y:y+h, x:x+w]
+                    dark_pixel_density = np.sum(roi > 0) / (w * h)
+                    # Check if this might be a title
+                    # Titles tend to be wider, in the center, and at the top
+                    is_wide = aspect_ratio > 2.0
+                    is_centered = abs((x + w/2) - (image_width/2)) < (image_width * 0.3)
+                    is_at_top = y < title_section_height
+                    # If it looks like a title or has good text characteristics
+                    if (is_wide and is_centered and is_at_top) or \
+                       (is_at_top and dark_pixel_density > 0.1):
+                        title_regions.append((x, y, w, h))
+            # Now handle the main content with our standard approach
+            # Use fixed regions for the main content - typically below the title
+            # For primary content, assume most text is in the bottom 70%
+            text_section_start = int(image_height * 0.7)  # Start main text section at 70% down
+            # Create text mask combining the title regions and main text area
+            text_mask = np.zeros_like(gray)
+            text_mask[text_section_start:, :] = 255
+            # Add title regions to the text mask
+            for x, y, w, h in title_regions:
+                # Add some padding around title regions
+                pad_x = max(5, int(w * 0.05))
+                pad_y = max(5, int(h * 0.05))
+                x_start = max(0, x - pad_x)
+                y_start = max(0, y - pad_y)
+                x_end = min(image_width, x + w + pad_x)
+                y_end = min(image_height, y + h + pad_y)
+                # Add title region to the text mask
+                text_mask[y_start:y_end, x_start:x_end] = 255
+            # Image mask is the inverse of text mask - for visualization only
+            image_mask = np.zeros_like(gray)
+            image_mask[text_mask == 0] = 255
+            # For main text regions, find blocks of text in the bottom part
+            # Create a temporary mask for the main text section
+            temp_mask = np.zeros_like(gray)
+            temp_mask[text_section_start:, :] = 255
+            # Find text regions for visualization purposes
+            text_regions = []
+            # Start with any title regions we found
+            text_regions.extend(title_regions)
+            # Then find text regions in the main content area
+            text_region_contours, _ = cv2.findContours(
+                cv2.bitwise_and(dilation, temp_mask),
+                cv2.RETR_EXTERNAL,
+                cv2.CHAIN_APPROX_SIMPLE
+            )
+            # Add each detected region
+            for contour in text_region_contours:
+                x, y, w, h = cv2.boundingRect(contour)
+                if w > 10 and h > 5:  # Minimum size to be considered text
+                    text_regions.append((x, y, w, h))
+            # Add the entire bottom section as a fallback text region if none detected
+            if len(text_regions) == 0:
+                x, y = 0, text_section_start
+                w, h = img.shape[1], img.shape[0] - text_section_start
+                text_regions.append((x, y, w, h))
+            # Create image regions visualization
+            image_regions_vis = img_rgb.copy()
+            # Top section is image
+            cv2.rectangle(image_regions_vis, (0, 0), (img.shape[1], text_section_start), (0, 0, 255), 2)
+            # Bottom section has text - draw green boxes around detected text regions
+            text_regions_vis = img_rgb.copy()
+            for x, y, w, h in text_regions:
+                cv2.rectangle(text_regions_vis, (x, y), (x+w, y+h), (0, 255, 0), 2)
+            # For OCR: CRITICAL - Don't modify the image content
+            # Only create a non-destructive enhanced version
+            # For text detection visualization:
+            text_regions_vis = img_rgb.copy()
+            for x, y, w, h in text_regions:
+                cv2.rectangle(text_regions_vis, (x, y), (x+w, y+h), (0, 255, 0), 2)
+            # For image region visualization:
+            image_regions_vis = img_rgb.copy()
+            cv2.rectangle(image_regions_vis, (0, 0), (img.shape[1], text_section_start), (0, 0, 255), 2)
+            # Create a minimally enhanced version of the original image
+            # that preserves ALL content (both text and image)
             combined_result = img_rgb.copy()
+            # Apply gentle contrast enhancement if requested
+            if not preserve_content:
+                # Use a subtle CLAHE enhancement to improve OCR without losing content
+                lab_img = cv2.cvtColor(img_rgb, cv2.COLOR_BGR2LAB)
+                l, a, b = cv2.split(lab_img)
+                # Very mild CLAHE settings to preserve text
+                clahe = cv2.createCLAHE(clipLimit=1.5, tileGridSize=(8, 8))
+                cl = clahe.apply(l)
+                # Merge channels back
+                enhanced_lab = cv2.merge((cl, a, b))
+                combined_result = cv2.cvtColor(enhanced_lab, cv2.COLOR_LAB2BGR)
+            # Extract individual region images for separate OCR processing
+            region_images = []
+            if text_regions:
+                for idx, (x, y, w, h) in enumerate(text_regions):
+                    # Add padding around region (10% of width/height)
+                    pad_x = max(5, int(w * 0.1))
+                    pad_y = max(5, int(h * 0.1))
+                    # Ensure coordinates stay within image bounds
+                    x_start = max(0, x - pad_x)
+                    y_start = max(0, y - pad_y)
+                    x_end = min(img_rgb.shape[1], x + w + pad_x)
+                    y_end = min(img_rgb.shape[0], y + h + pad_y)
+                    # Extract region with padding
+                    region = img_rgb[y_start:y_end, x_start:x_end].copy()
+                    # Store region with its coordinates
+                    region_info = {
+                        'image': region,
+                        'coordinates': (x, y, w, h),
+                        'padded_coordinates': (x_start, y_start, x_end - x_start, y_end - y_start),
+                        'order': idx
+                    }
+                    region_images.append(region_info)
             # Convert visualization results back to PIL Images
             text_regions_pil = Image.fromarray(cv2.cvtColor(text_regions_vis, cv2.COLOR_BGR2RGB))
             _, buffer = cv2.imencode('.png', text_mask)
             text_mask_base64 = base64.b64encode(buffer).decode('utf-8')
+            # Convert region images to PIL format
+            region_pil_images = []
+            for region_info in region_images:
+                region_pil = Image.fromarray(cv2.cvtColor(region_info['image'], cv2.COLOR_BGR2RGB))
+                region_info['pil_image'] = region_pil
+                region_pil_images.append(region_info)
             # Return the segmentation results
             return {
                 'text_regions': text_regions_pil,
                 'image_regions': image_regions_pil,
                 'text_mask_base64': f"data:image/png;base64,{text_mask_base64}",
                 'combined_result': combined_result_pil,
+                'text_regions_coordinates': text_regions,
+                'region_images': region_pil_images
             }
     except Exception as e:
             'text_regions_coordinates': []
         }
+def process_segmented_image(image_path: Union[str, Path], output_dir: Optional[Path] = None, preserve_content: bool = True) -> Dict:
     """
     Process an image using segmentation for improved OCR, saving visualization outputs.

ocr_processing.py CHANGED Viewed

@@ -147,31 +147,15 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None, pro
             # Process with cached function if possible
             try:
-                # Check if preprocessing options indicate a handwritten document
-                handwritten_document = preprocessing_options.get("document_type") == "handwritten"
                 modified_custom_prompt = custom_prompt
-                # Add handwritten specific instructions if needed
-                # Note: Document type influences OCR quality through prompting, even when no preprocessing is applied
-                if handwritten_document and modified_custom_prompt:
-                    if "handwritten" not in modified_custom_prompt.lower():
-                        modified_custom_prompt += " This is a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
-                elif handwritten_document and not modified_custom_prompt:
-                    modified_custom_prompt = "This is a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
-                # Add PDF-specific instructions if needed
-                if modified_custom_prompt and "pdf" not in modified_custom_prompt.lower() and "multi-page" not in modified_custom_prompt.lower():
-                    modified_custom_prompt += " This is a multi-page PDF document."
-                elif not modified_custom_prompt:
                     modified_custom_prompt = "This is a multi-page PDF document."
-                # For certain filenames, explicitly add document type hints
-                filename_lower = uploaded_file.name.lower()
-                if "handwritten" in filename_lower or "letter" in filename_lower or "journal" in filename_lower:
-                    if not modified_custom_prompt:
-                        modified_custom_prompt = "This is a handwritten document in PDF format. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
-                    elif "handwritten" not in modified_custom_prompt.lower():
-                        modified_custom_prompt += " This is a handwritten document. Please carefully transcribe all handwritten text."
                 # Update the cache key with the modified prompt
                 if modified_custom_prompt != custom_prompt:
@@ -194,19 +178,24 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None, pro
                 processor = StructuredOCR()
-                # Check if preprocessing options indicate a handwritten document
-                handwritten_document = preprocessing_options.get("document_type") == "handwritten"
                 modified_custom_prompt = custom_prompt
-                # Add handwritten specific instructions if needed
-                if handwritten_document and modified_custom_prompt:
-                    if "handwritten" not in modified_custom_prompt.lower():
-                        modified_custom_prompt += " This is a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
-                elif handwritten_document and not modified_custom_prompt:
                     modified_custom_prompt = "This is a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
                 # Add PDF-specific instructions if needed
-                if custom_prompt and "pdf" not in modified_custom_prompt.lower() and "multi-page" not in modified_custom_prompt.lower():
                     modified_custom_prompt += " This is a multi-page PDF document."
                 # Process directly with optimized settings
@@ -241,8 +230,13 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None, pro
                 progress_reporter.update(35, "Applying image segmentation to separate text and image regions...")
                 try:
-                    # Perform image segmentation
-                    segmentation_results = segment_image_for_ocr(temp_path, vision_enabled=use_vision)
                     if segmentation_results['combined_result'] is not None:
                         # Save the segmented result to a new temporary file
@@ -250,21 +244,99 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None, pro
                         segmentation_results['combined_result'].save(segmented_temp_path)
                         temp_file_paths.append(segmented_temp_path)
-                        # Use the segmented image instead of the original
-                        temp_path = segmented_temp_path
-                        # Enhanced prompt based on segmentation results
-                        if custom_prompt:
-                            # Add segmentation info to existing prompt
-                            regions_count = len(segmentation_results.get('text_regions_coordinates', []))
-                            custom_prompt += f" The document has been segmented and contains approximately {regions_count} text regions that should be carefully extracted. Please focus on extracting all text from these regions."
                         else:
-                            # Create new prompt focused on text extraction from segmented regions
                             regions_count = len(segmentation_results.get('text_regions_coordinates', []))
-                            custom_prompt = f"This document has been preprocessed to highlight {regions_count} text regions. Please carefully extract all text from these highlighted regions, preserving the reading order and structure."
-                        logger.info(f"Image segmentation applied. Found {regions_count} text regions.")
-                        progress_reporter.update(40, f"Identified {regions_count} text regions for extraction...")
                     else:
                         logger.warning("Image segmentation produced no result, using original image.")
                 except Exception as seg_error:
@@ -283,24 +355,21 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None, pro
             # Process the file using cached function if possible
             progress_reporter.update(50, "Processing document with OCR...")
             try:
-                # Check if preprocessing options indicate a handwritten document
-                handwritten_document = preprocessing_options.get("document_type") == "handwritten"
                 modified_custom_prompt = custom_prompt
-                # Add handwritten specific instructions if needed
-                if handwritten_document and modified_custom_prompt:
-                    if "handwritten" not in modified_custom_prompt.lower():
-                        modified_custom_prompt += " This is a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
-                elif handwritten_document and not modified_custom_prompt:
                     modified_custom_prompt = "This is a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
-                # For certain filenames, explicitly add document type hints
-                filename_lower = uploaded_file.name.lower()
-                if "handwritten" in filename_lower or "letter" in filename_lower or "journal" in filename_lower:
-                    if not modified_custom_prompt:
-                        modified_custom_prompt = "This is a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
-                    elif "handwritten" not in modified_custom_prompt.lower():
-                        modified_custom_prompt += " This is a handwritten document. Please carefully transcribe all handwritten text."
                 # Update the cache key with the modified prompt
                 if modified_custom_prompt != custom_prompt:
@@ -328,24 +397,21 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None, pro
                     # Use simpler processing for speed
                     pass  # Any speed optimizations would be handled by the StructuredOCR class
-                # Check if preprocessing options indicate a handwritten document
-                handwritten_document = preprocessing_options.get("document_type") == "handwritten"
                 modified_custom_prompt = custom_prompt
-                # Add handwritten specific instructions if needed
-                if handwritten_document and modified_custom_prompt:
-                    if "handwritten" not in modified_custom_prompt.lower():
-                        modified_custom_prompt += " This is a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
-                elif handwritten_document and not modified_custom_prompt:
                     modified_custom_prompt = "This is a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
-                # For certain filenames, explicitly add document type hints
-                filename_lower = uploaded_file.name.lower()
-                if "handwritten" in filename_lower or "letter" in filename_lower or "journal" in filename_lower:
-                    if not modified_custom_prompt:
-                        modified_custom_prompt = "This is a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
-                    elif "handwritten" not in modified_custom_prompt.lower():
-                        modified_custom_prompt += " This is a handwritten document. Please carefully transcribe all handwritten text."
                 result = processor.process_file(
                     file_path=temp_path,
@@ -360,11 +426,16 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None, pro
         # Add additional metadata to result
         result = process_result(result, uploaded_file, preprocessing_options)
         # 🔧 ALWAYS normalize result before returning
         result = clean_ocr_result(
             result,
             use_segmentation=use_segmentation,
-            vision_enabled=use_vision
         )
         # Complete progress
@@ -424,13 +495,14 @@ def process_result(result, uploaded_file, preprocessing_options=None):
         preprocessing_options
     )
-    # Extract raw text from OCR contents
     raw_text = ""
     if 'ocr_contents' in result:
-        if 'raw_text' in result['ocr_contents']:
-            raw_text = result['ocr_contents']['raw_text']
-        elif 'content' in result['ocr_contents']:
-            raw_text = result['ocr_contents']['content']
     # Extract subject tags if not already present or enhance existing ones
     if 'topics' not in result or not result['topics']:

             # Process with cached function if possible
             try:
+                # Use the document type information from preprocessing options
+                doc_type = preprocessing_options.get("document_type", "standard")
                 modified_custom_prompt = custom_prompt
+                # Add PDF-specific instructions
+                if not modified_custom_prompt:
                     modified_custom_prompt = "This is a multi-page PDF document."
+                elif "pdf" not in modified_custom_prompt.lower() and "multi-page" not in modified_custom_prompt.lower():
+                    modified_custom_prompt += " This is a multi-page PDF document."
                 # Update the cache key with the modified prompt
                 if modified_custom_prompt != custom_prompt:
                 processor = StructuredOCR()
+                # Use the document type from preprocessing options
+                doc_type = preprocessing_options.get("document_type", "standard")
                 modified_custom_prompt = custom_prompt
+                # Add document-type specific instructions based on preprocessing options
+                if doc_type == "handwritten" and not modified_custom_prompt:
                     modified_custom_prompt = "This is a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
+                elif doc_type == "handwritten" and "handwritten" not in modified_custom_prompt.lower():
+                    modified_custom_prompt += " This is a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
+                elif doc_type == "newspaper" and not modified_custom_prompt:
+                    modified_custom_prompt = "This is a newspaper or document with columns. Please extract all text content from each column, maintaining proper reading order."
+                elif doc_type == "newspaper" and "column" not in modified_custom_prompt.lower() and "newspaper" not in modified_custom_prompt.lower():
+                    modified_custom_prompt += " This appears to be a newspaper or document with columns. Please extract all text content from each column."
+                elif doc_type == "book" and not modified_custom_prompt:
+                    modified_custom_prompt = "This is a book page. Extract titles, headers, footnotes, and body text, preserving paragraph structure and formatting."
                 # Add PDF-specific instructions if needed
+                if "pdf" not in modified_custom_prompt.lower() and "multi-page" not in modified_custom_prompt.lower():
                     modified_custom_prompt += " This is a multi-page PDF document."
                 # Process directly with optimized settings
                 progress_reporter.update(35, "Applying image segmentation to separate text and image regions...")
                 try:
+                    # Perform image segmentation with content preservation if requested
+                    preserve_content = preprocessing_options.get("preserve_content", True)
+                    segmentation_results = segment_image_for_ocr(
+                        temp_path,
+                        vision_enabled=use_vision,
+                        preserve_content=preserve_content
+                    )
                     if segmentation_results['combined_result'] is not None:
                         # Save the segmented result to a new temporary file
                         segmentation_results['combined_result'].save(segmented_temp_path)
                         temp_file_paths.append(segmented_temp_path)
+                        # Check if we have individual region images to process separately
+                        if 'region_images' in segmentation_results and segmentation_results['region_images']:
+                            # Process each region separately for better results
+                            regions_count = len(segmentation_results['region_images'])
+                            logger.info(f"Processing {regions_count} text regions individually")
+                            progress_reporter.update(40, f"Processing {regions_count} text regions separately...")
+                            # Initialize StructuredOCR processor
+                            processor = StructuredOCR()
+                            # Store individual region results
+                            region_results = []
+                            # Process each region individually
+                            for idx, region_info in enumerate(segmentation_results['region_images']):
+                                # Save region image to temp file
+                                region_temp_path = tempfile.NamedTemporaryFile(delete=False, suffix='.jpg').name
+                                region_info['pil_image'].save(region_temp_path)
+                                temp_file_paths.append(region_temp_path)
+                                # Create region-specific prompt
+                                region_prompt = f"This is region {idx+1} of {regions_count} from a segmented document. Extract all visible text precisely, preserving line breaks and structure."
+                                # Process the region
+                                try:
+                                    region_result = processor.process_file(
+                                        file_path=region_temp_path,
+                                        file_type="image",
+                                        use_vision=use_vision,
+                                        custom_prompt=region_prompt,
+                                        file_size_mb=None
+                                    )
+                                    # Store result with region info
+                                    if 'ocr_contents' in region_result and 'raw_text' in region_result['ocr_contents']:
+                                        region_results.append({
+                                            'text': region_result['ocr_contents']['raw_text'],
+                                            'coordinates': region_info['coordinates'],
+                                            'order': region_info['order']
+                                        })
+                                except Exception as region_err:
+                                    logger.warning(f"Error processing region {idx+1}: {str(region_err)}")
+                            # Sort regions by their order for correct reading flow
+                            region_results.sort(key=lambda x: x['order'])
+                            # Combine all region texts
+                            combined_text = "\n\n".join([r['text'] for r in region_results if r['text'].strip()])
+                            # Store combined results for later use
+                            preprocessing_options['segmentation_data'] = {
+                                'text_regions_coordinates': segmentation_results.get('text_regions_coordinates', []),
+                                'regions_count': regions_count,
+                                'segmentation_applied': True,
+                                'combined_text': combined_text,
+                                'region_results': region_results
+                            }
+                            logger.info(f"Successfully processed {len(region_results)} text regions")
+                            # Set up the temp path to use the segmented image
+                            temp_path = segmented_temp_path
+                            # IMPORTANT: We've already extracted text from individual regions,
+                            # emphasize their importance in our prompt
+                            if custom_prompt:
+                                # Add strong emphasis on using the already extracted text
+                                custom_prompt += f" IMPORTANT: The document has been segmented into {regions_count} text regions that have been processed individually. The text from these regions should be given HIGHEST PRIORITY and used as the primary source of text for the document. The combined image is provided only as supplementary context."
+                            else:
+                                # Create explicit prompt prioritizing region text
+                                custom_prompt = f"CRITICAL: This document has been preprocessed to highlight {regions_count} text regions that have been individually processed. The text from these regions is the PRIMARY source of content and should be prioritized over any text extracted from the combined image. Use the combined image only for context and layout understanding."
                         else:
+                            # No individual regions found, use combined result
+                            temp_path = segmented_temp_path
+                            # Enhanced prompt based on segmentation results
                             regions_count = len(segmentation_results.get('text_regions_coordinates', []))
+                            if custom_prompt:
+                                # Add segmentation info to existing prompt
+                                custom_prompt += f" The document has been segmented and contains approximately {regions_count} text regions that should be carefully extracted. Please focus on extracting all text from these regions."
+                            else:
+                                # Create new prompt focused on text extraction from segmented regions
+                                custom_prompt = f"This document has been preprocessed to highlight {regions_count} text regions. Please carefully extract all text from these highlighted regions, preserving the reading order and structure."
+                            # Store segmentation data in preprocessing options for later use
+                            preprocessing_options['segmentation_data'] = {
+                                'text_regions_coordinates': segmentation_results.get('text_regions_coordinates', []),
+                                'regions_count': regions_count,
+                                'segmentation_applied': True
+                            }
+                        logger.info(f"Image segmentation applied. Found {len(segmentation_results.get('text_regions_coordinates', []))} text regions.")
+                        progress_reporter.update(40, f"Identified {len(segmentation_results.get('text_regions_coordinates', []))} text regions for extraction...")
                     else:
                         logger.warning("Image segmentation produced no result, using original image.")
                 except Exception as seg_error:
             # Process the file using cached function if possible
             progress_reporter.update(50, "Processing document with OCR...")
             try:
+                # Use the document type from preprocessing options
+                doc_type = preprocessing_options.get("document_type", "standard")
                 modified_custom_prompt = custom_prompt
+                # Add document-type specific instructions based on preprocessing options
+                if doc_type == "handwritten" and not modified_custom_prompt:
                     modified_custom_prompt = "This is a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
+                elif doc_type == "handwritten" and "handwritten" not in modified_custom_prompt.lower():
+                    modified_custom_prompt += " This is a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
+                elif doc_type == "newspaper" and not modified_custom_prompt:
+                    modified_custom_prompt = "This is a newspaper or document with columns. Please extract all text content from each column, maintaining proper reading order."
+                elif doc_type == "newspaper" and "column" not in modified_custom_prompt.lower() and "newspaper" not in modified_custom_prompt.lower():
+                    modified_custom_prompt += " This appears to be a newspaper or document with columns. Please extract all text content from each column."
+                elif doc_type == "book" and not modified_custom_prompt:
+                    modified_custom_prompt = "This is a book page. Extract titles, headers, footnotes, and body text, preserving paragraph structure and formatting."
                 # Update the cache key with the modified prompt
                 if modified_custom_prompt != custom_prompt:
                     # Use simpler processing for speed
                     pass  # Any speed optimizations would be handled by the StructuredOCR class
+                # Use the document type from preprocessing options
+                doc_type = preprocessing_options.get("document_type", "standard")
                 modified_custom_prompt = custom_prompt
+                # Add document-type specific instructions based on preprocessing options
+                if doc_type == "handwritten" and not modified_custom_prompt:
                     modified_custom_prompt = "This is a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
+                elif doc_type == "handwritten" and "handwritten" not in modified_custom_prompt.lower():
+                    modified_custom_prompt += " This is a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
+                elif doc_type == "newspaper" and not modified_custom_prompt:
+                    modified_custom_prompt = "This is a newspaper or document with columns. Please extract all text content from each column, maintaining proper reading order."
+                elif doc_type == "newspaper" and "column" not in modified_custom_prompt.lower() and "newspaper" not in modified_custom_prompt.lower():
+                    modified_custom_prompt += " This appears to be a newspaper or document with columns. Please extract all text content from each column."
+                elif doc_type == "book" and not modified_custom_prompt:
+                    modified_custom_prompt = "This is a book page. Extract titles, headers, footnotes, and body text, preserving paragraph structure and formatting."
                 result = processor.process_file(
                     file_path=temp_path,
         # Add additional metadata to result
         result = process_result(result, uploaded_file, preprocessing_options)
+        # Make sure file_type is explicitly set for PDFs
+        if file_type == "pdf":
+            result['file_type'] = "pdf"
         # 🔧 ALWAYS normalize result before returning
         result = clean_ocr_result(
             result,
             use_segmentation=use_segmentation,
+            vision_enabled=use_vision,
+            preprocessing_options=preprocessing_options
         )
         # Complete progress
         preprocessing_options
     )
+    # Extract raw text from OCR contents for tag extraction without duplicating content
     raw_text = ""
     if 'ocr_contents' in result:
+        # Try fields in order of preference
+        for field in ["raw_text", "content", "text", "transcript", "main_text"]:
+            if field in result['ocr_contents'] and result['ocr_contents'][field]:
+                raw_text = result['ocr_contents'][field]
+                break
     # Extract subject tags if not already present or enhance existing ones
     if 'topics' not in result or not result['topics']:

requirements.txt CHANGED Viewed

@@ -9,7 +9,7 @@ pydantic>=2.5.0  # Updated for better BaseModel support
 Pillow>=10.0.0
 opencv-python-headless>=4.8.0.74
 pdf2image>=1.16.0
-pytesseract>=0.3.10  # For local OCR fallback
 matplotlib>=3.7.0    # For visualization in preprocessing tests
 # Data handling and utilities

 Pillow>=10.0.0
 opencv-python-headless>=4.8.0.74
 pdf2image>=1.16.0
+# pytesseract>=0.3.10  # For local OCR fallback
 matplotlib>=3.7.0    # For visualization in preprocessing tests
 # Data handling and utilities

structured_ocr.py CHANGED Viewed

@@ -1135,44 +1135,8 @@ class StructuredOCR:
                 "confidence_score": 0.0
             }
-        # Check if this is likely a newspaper or handwritten document by filename
-        is_likely_newspaper = False
-        is_likely_handwritten = False
-        newspaper_keywords = ["newspaper", "gazette", "herald", "times", "journal",
-                            "chronicle", "post", "tribune", "news", "press", "gender"]
-        handwritten_keywords = ["handwritten", "manuscript", "letter", "correspondence", "journal", "diary"]
-        # Check filename for document type indicators
-        filename_lower = file_path.name.lower()
-        # First check for handwritten documents
-        for keyword in handwritten_keywords:
-            if keyword in filename_lower:
-                is_likely_handwritten = True
-                logger.info(f"Likely handwritten document detected from filename: {file_path.name}")
-                # Add handwritten-specific processing hint to custom_prompt if not already present
-                if custom_prompt:
-                    if "handwritten" not in custom_prompt.lower():
-                        custom_prompt = custom_prompt + " This appears to be a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks. Note any unclear sections or annotations."
-                else:
-                    custom_prompt = "This is a handwritten document. Carefully transcribe all handwritten text, preserving line breaks. Note any unclear sections or annotations."
-                break
-        # Then check for newspaper if not handwritten
-        if not is_likely_handwritten:
-            for keyword in newspaper_keywords:
-                if keyword in filename_lower:
-                    is_likely_newspaper = True
-                    logger.info(f"Likely newspaper document detected from filename: {file_path.name}")
-                    # Add newspaper-specific processing hint to custom_prompt if not already present
-                    if custom_prompt:
-                        if "column" not in custom_prompt.lower() and "newspaper" not in custom_prompt.lower():
-                            custom_prompt = custom_prompt + " This appears to be a newspaper or document with columns. Please extract all text content from each column."
-                    else:
-                        custom_prompt = "This appears to be a newspaper or document with columns. Please extract all text content from each column, maintaining proper reading order."
-                    break
         try:
             # Check file size
@@ -1192,10 +1156,11 @@ class StructuredOCR:
                 if file_size_mb > max_size_mb:
                     logger.info(f"Image is large ({file_size_mb:.2f} MB), optimizing for API submission")
-                # Handwritten docs default to the conservative pipeline
                 base64_data_url = get_base64_from_bytes(
                     preprocess_image(file_path.read_bytes(),
-                                   {"document_type": "handwritten",
                                     "grayscale": True,
                                     "denoise": True,
                                     "contrast": 0})
@@ -1391,9 +1356,9 @@ class StructuredOCR:
                             logger.info(f"Found language in page: {lang}")
             # Optimize: Skip vision model step if ocr_markdown is very small or empty
-            # BUT make an exception for newspapers or if custom_prompt is provided
             # OR if the image has visual content worth preserving
-            if (not is_likely_newspaper and not custom_prompt and not has_images) and (not image_ocr_markdown or len(image_ocr_markdown) < 50):
                 logger.warning("OCR produced minimal text with no images. Returning basic result.")
                 return {
                     "file_name": file_path.name,
@@ -1407,14 +1372,6 @@ class StructuredOCR:
                     "raw_response_data": serialize_ocr_response(image_response)
                 }
-            # For newspapers with little text in OCR, set a more explicit prompt
-            if is_likely_newspaper and (not image_ocr_markdown or len(image_ocr_markdown) < 100):
-                logger.info("Newspaper with minimal OCR text detected. Using enhanced prompt.")
-                if not custom_prompt:
-                    custom_prompt = "This is a newspaper or document with columns. The OCR may not have captured all text. Please examine the image carefully and extract ALL text content visible in the document, reading each column from top to bottom."
-                elif "extract all text" not in custom_prompt.lower():
-                    custom_prompt += " Please examine the image carefully and extract ALL text content visible in the document."
             # For images with minimal text but visual content, enhance the prompt
             elif has_images and (not image_ocr_markdown or len(image_ocr_markdown) < 100):
                 logger.info("Document with images but minimal text detected. Using enhanced prompt for mixed media.")
@@ -1575,16 +1532,25 @@ class StructuredOCR:
             else:
                 truncated_ocr = ocr_markdown
-            # Build a comprehensive prompt with OCR text and detailed instructions for language detection and image handling
             enhanced_prompt = f"This is a document's OCR text:\n<BEGIN_OCR>\n{truncated_ocr}\n<END_OCR>\n\n"
             # Add custom prompt if provided
             if custom_prompt:
                 enhanced_prompt += f"User instructions: {custom_prompt}\n\n"
-            # Add comprehensive extraction instructions with language detection guidance
-            enhanced_prompt += "Extract all text content accurately from this document, including any text visible in the image that may not have been captured by OCR.\n\n"
-            enhanced_prompt += "IMPORTANT: First thoroughly extract and analyze all text content, THEN determine the languages present.\n"
             enhanced_prompt += "Precisely identify and list ALL languages present in the document separately. Look closely for multiple languages that might appear together.\n"
             enhanced_prompt += "For language detection, examine these specific indicators:\n"
             enhanced_prompt += "- French: accents (é, è, ê, à, ç, â, î, ô, û), words like 'le', 'la', 'les', 'et', 'en', 'de', 'du', 'des', 'dans', 'ce', 'cette', 'ces', 'par', 'pour', 'qui', 'que', 'où', 'avec'\n"
@@ -1866,15 +1832,20 @@ class StructuredOCR:
                 truncated_text = ocr_markdown[:15000] + "\n...[content truncated]...\n" + ocr_markdown[-5000:]
                 logger.info(f"OCR text truncated from {len(ocr_markdown)} to {len(truncated_text)} chars")
-            # Build a prompt with enhanced language detection instructions
             enhanced_prompt = f"This is a document's OCR text:\n<BEGIN_OCR>\n{truncated_text}\n<END_OCR>\n\n"
             # Add custom prompt if provided
             if custom_prompt:
                 enhanced_prompt += f"User instructions: {custom_prompt}\n\n"
-            # Add thorough extraction instructions with enhanced language detection and metadata requirements
-            enhanced_prompt += "Extract all text content accurately from this document. Return structured data with the document's contents.\n\n"
             enhanced_prompt += "IMPORTANT: Precisely identify and list ALL languages present in the document separately. Look closely for multiple languages that might appear together.\n"
             enhanced_prompt += "For language detection, examine these specific indicators:\n"
             enhanced_prompt += "- French: accents (é, è, ê, à, ç), words like 'le', 'la', 'les', 'et', 'en', 'de', 'du'\n"

                 "confidence_score": 0.0
             }
+        # No automatic document type detection - rely on the document type specified in the custom prompt
+        # The document type is passed from the UI through the custom prompt in ocr_processing.py
         try:
             # Check file size
                 if file_size_mb > max_size_mb:
                     logger.info(f"Image is large ({file_size_mb:.2f} MB), optimizing for API submission")
+                # Use standard preprocessing - document type will be handled by preprocessing.py
+                # based on the options passed from the UI
                 base64_data_url = get_base64_from_bytes(
                     preprocess_image(file_path.read_bytes(),
+                                   {"document_type": "standard",
                                     "grayscale": True,
                                     "denoise": True,
                                     "contrast": 0})
                             logger.info(f"Found language in page: {lang}")
             # Optimize: Skip vision model step if ocr_markdown is very small or empty
+            # BUT make an exception if custom_prompt is provided
             # OR if the image has visual content worth preserving
+            if (not custom_prompt and not has_images) and (not image_ocr_markdown or len(image_ocr_markdown) < 50):
                 logger.warning("OCR produced minimal text with no images. Returning basic result.")
                 return {
                     "file_name": file_path.name,
                     "raw_response_data": serialize_ocr_response(image_response)
                 }
             # For images with minimal text but visual content, enhance the prompt
             elif has_images and (not image_ocr_markdown or len(image_ocr_markdown) < 100):
                 logger.info("Document with images but minimal text detected. Using enhanced prompt for mixed media.")
             else:
                 truncated_ocr = ocr_markdown
+            # Build a comprehensive prompt with OCR text and detailed instructions for title detection and language handling
             enhanced_prompt = f"This is a document's OCR text:\n<BEGIN_OCR>\n{truncated_ocr}\n<END_OCR>\n\n"
             # Add custom prompt if provided
             if custom_prompt:
                 enhanced_prompt += f"User instructions: {custom_prompt}\n\n"
+            # Primary focus on document structure and title detection
+            enhanced_prompt += "You are analyzing a historical document. Follow these extraction priorities:\n"
+            enhanced_prompt += "1. FIRST PRIORITY: Identify and extract the TITLE of the document. Look for large text at the top, decorative typography, or centered text that appears to be a title. The title is often one of the first elements in historical documents.\n"
+            enhanced_prompt += "2. SECOND: Extract all text content accurately from this document, including any text visible in the image that may not have been captured by OCR.\n\n"
+            enhanced_prompt += "Document Title Guidelines:\n"
+            enhanced_prompt += "- For printed historical works: Look for primary heading at top of the document, all-caps text, or larger font size text\n"
+            enhanced_prompt += "- For newspapers/periodicals: Extract both newspaper name and article title if present\n"
+            enhanced_prompt += "- For handwritten documents: Look for centered text at the top or underlined headings\n"
+            enhanced_prompt += "- For engravings/illustrations: Include the title or caption, which often appears below the image\n\n"
+            # Language detection guidance
+            enhanced_prompt += "IMPORTANT: After extracting the title and text content, determine the languages present.\n"
             enhanced_prompt += "Precisely identify and list ALL languages present in the document separately. Look closely for multiple languages that might appear together.\n"
             enhanced_prompt += "For language detection, examine these specific indicators:\n"
             enhanced_prompt += "- French: accents (é, è, ê, à, ç, â, î, ô, û), words like 'le', 'la', 'les', 'et', 'en', 'de', 'du', 'des', 'dans', 'ce', 'cette', 'ces', 'par', 'pour', 'qui', 'que', 'où', 'avec'\n"
                 truncated_text = ocr_markdown[:15000] + "\n...[content truncated]...\n" + ocr_markdown[-5000:]
                 logger.info(f"OCR text truncated from {len(ocr_markdown)} to {len(truncated_text)} chars")
+            # Build a prompt with enhanced title detection and language detection instructions
             enhanced_prompt = f"This is a document's OCR text:\n<BEGIN_OCR>\n{truncated_text}\n<END_OCR>\n\n"
             # Add custom prompt if provided
             if custom_prompt:
                 enhanced_prompt += f"User instructions: {custom_prompt}\n\n"
+            # Add title detection focus
+            enhanced_prompt += "You are analyzing a historical document. Please follow these extraction priorities:\n"
+            enhanced_prompt += "1. FIRST PRIORITY: Identify and extract the TITLE of the document. Look for prominent text at the top, decorative typography, or centered text that appears to be a title.\n"
+            enhanced_prompt += "   - For historical documents with prominent headings at the top\n"
+            enhanced_prompt += "   - For newspapers or periodicals, extract both the publication name and article title\n"
+            enhanced_prompt += "   - For manuscripts or letters, identify any heading or subject line\n"
+            enhanced_prompt += "2. SECOND PRIORITY: Extract all text content accurately and return structured data with the document's contents.\n\n"
             enhanced_prompt += "IMPORTANT: Precisely identify and list ALL languages present in the document separately. Look closely for multiple languages that might appear together.\n"
             enhanced_prompt += "For language detection, examine these specific indicators:\n"
             enhanced_prompt += "- French: accents (é, è, ê, à, ç), words like 'le', 'la', 'les', 'et', 'en', 'de', 'du'\n"

test_fix.py ADDED Viewed

	@@ -0,0 +1,55 @@

+#!/usr/bin/env python3
+import streamlit as st
+from ocr_processing import process_file
+# Mock a file upload
+class MockFile:
+    def __init__(self, name, content):
+        self.name = name
+        self._content = content
+    def getvalue(self):
+        return self._content
+def main():
+    # Load the test image - using the problematic image from the original task
+    with open('input/magician-or-bottle-cungerer.jpg', 'rb') as f:
+        file_bytes = f.read()
+    # Create mock file
+    uploaded_file = MockFile('magician-or-bottle-cungerer.jpg', file_bytes)
+    # Process the file
+    result = process_file(uploaded_file)
+    # Display results
+    print("\nDocument Content")
+    print("Title")
+    if 'title' in result['ocr_contents']:
+        print(result['ocr_contents']['title'])
+    print("\nMain")
+    if 'main_text' in result['ocr_contents']:
+        print(result['ocr_contents']['main_text'])
+    print("\nRaw Text")
+    if 'raw_text' in result['ocr_contents']:
+        print(result['ocr_contents']['raw_text'][:300] + "...")
+    # Debug: Print all keys in ocr_contents
+    print("\nAll OCR Content Keys:")
+    for key in result['ocr_contents'].keys():
+        print(f"- {key}")
+    # Debug: Display content of all keys
+    print("\nContent of each key:")
+    for key in result['ocr_contents'].keys():
+        print(f"\n--- {key} ---")
+        content = result['ocr_contents'][key]
+        if isinstance(content, str):
+            print(content[:150] + "..." if len(content) > 150 else content)
+        else:
+            print(f"Type: {type(content)}")
+if __name__ == "__main__":
+    main()

test_magellan_language.py DELETED Viewed

@@ -1,39 +0,0 @@
-import sys
-import json
-from pathlib import Path
-from structured_ocr import StructuredOCR
-def main():
-    """Test language detection on the Magellan document"""
-    # Path to the Magellan document
-    file_path = Path("input/magellan-travels.jpg")
-    if not file_path.exists():
-        print(f"Error: File {file_path} not found")
-        return
-    print(f"Testing language detection on {file_path}")
-    # Process the file
-    processor = StructuredOCR()
-    result = processor.process_file(file_path)
-    # Print language detection results
-    if 'languages' in result:
-        print(f"\nDetected languages: {result['languages']}")
-    else:
-        print("\nNo languages detected")
-    # Save the full result for inspection
-    output_path = "output/magellan_test_result.json"
-    Path("output").mkdir(exist_ok=True)
-    with open(output_path, "w") as f:
-        json.dump(result, f, indent=2)
-    print(f"\nFull result saved to {output_path}")
-    return result
-if __name__ == "__main__":
-    main()

test_segmentation_fix.py ADDED Viewed

	@@ -0,0 +1,100 @@

+"""
+Test script to verify the segmentation and OCR improvements.
+This script will process an image using the updated segmentation algorithm
+and show how text recognition is prioritized over images.
+"""
+import os
+import json
+import tempfile
+from pathlib import Path
+from PIL import Image
+# Import the key components we modified
+from image_segmentation import segment_image_for_ocr
+from ocr_processing import process_file, process_result
+from utils.image_utils import clean_ocr_result
+import logging
+# Configure logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+def run_test(image_path):
+    """Run a test on the specified image to verify our fixes"""
+    print(f"Testing image segmentation and OCR prioritization on: {image_path}")
+    print("-" * 80)
+    # Make sure the image exists
+    if not os.path.exists(image_path):
+        print(f"Error: Image not found at {image_path}")
+        return
+    # 1. First run image segmentation directly
+    try:
+        print("Step 1: Running image segmentation...")
+        segmentation_results = segment_image_for_ocr(
+            image_path,
+            vision_enabled=True,
+            preserve_content=True
+        )
+        # Print segmentation info
+        text_regions_count = len(segmentation_results.get('text_regions_coordinates', []))
+        print(f"Detected {text_regions_count} text regions in the image")
+        # Save output images for inspection
+        output_dir = Path("output/segmentation_test")
+        output_dir.mkdir(parents=True, exist_ok=True)
+        if segmentation_results['text_regions'] is not None:
+            output_path = output_dir / f"text_regions_improved.jpg"
+            segmentation_results['text_regions'].save(output_path)
+            print(f"Saved text regions visualization to: {output_path}")
+        if segmentation_results['image_regions'] is not None:
+            output_path = output_dir / f"image_regions_improved.jpg"
+            segmentation_results['image_regions'].save(output_path)
+            print(f"Saved image regions visualization to: {output_path}")
+        if segmentation_results['combined_result'] is not None:
+            output_path = output_dir / f"combined_result_improved.jpg"
+            segmentation_results['combined_result'].save(output_path)
+            print(f"Saved combined result to: {output_path}")
+        # Extract individual text regions if available
+        if 'region_images' in segmentation_results and segmentation_results['region_images']:
+            region_dir = output_dir / "text_regions"
+            region_dir.mkdir(exist_ok=True)
+            for idx, region_info in enumerate(segmentation_results['region_images']):
+                region_path = region_dir / f"region_{idx+1}.jpg"
+                region_info['pil_image'].save(region_path)
+            print(f"Saved {len(segmentation_results['region_images'])} individual text regions to {region_dir}")
+    except Exception as e:
+        print(f"Error during segmentation: {str(e)}")
+    print("-" * 80)
+    print("Test complete. Check the output directory for results.")
+    print("The text regions should now properly include all text content in the document.")
+    print("Image regions should be minimal and not contain text.")
+if __name__ == "__main__":
+    # Test with an image that has mixed text and image content
+    # You can change this to any image path you want to test
+    test_image = "input/baldwin-letter.jpg"
+    if not os.path.exists(test_image):
+        print(f"Test image not found at {test_image}, looking for alternatives...")
+        # Try to find an alternative test image
+        for potential_img in ["input/harpers.pdf", "input/magician-or-bottle-cungerer.jpg", "input/magellan-travels.jpg"]:
+            if os.path.exists(potential_img):
+                test_image = potential_img
+                print(f"Using alternative test image: {test_image}")
+                break
+    if os.path.exists(test_image):
+        run_test(test_image)
+    else:
+        print("No suitable test images found. Please place an image in the input directory.")

testing/magician_app_investigation_plan.md DELETED Viewed

@@ -1,58 +0,0 @@
-# Investigation Plan: App.py Image Processing Issues
-## Background
-- The `ocr_utils.py` in the reconcile-improvements branch successfully processes the magician image with specialized handling for illustrations/etchings
-- However, there appears to be an issue with app.py's ability to process this image file
-## Investigation Steps
-### 1. Trace the Image Processing Flow in app.py
-- Analyze how app.py calls the image processing functions
-- Identify which components are involved in the processing pipeline:
-  - File upload handling
-  - Preprocessing steps
-  - OCR processing
-  - Result handling
-### 2. Check for Integration Issues
-- Verify that app.py correctly imports and uses the enhanced functions from ocr_utils.py
-- Check if there are any version mismatches or import issues
-- Examine if app.py is using a different processing path that bypasses the enhanced illustration detection
-### 3. Test Direct Processing vs. App Processing
-- Create a test script that mimics app.py's processing flow but with more logging
-- Compare the processing steps between direct usage (as in our test) and through the app
-- Identify any differences in how parameters are passed or how results are handled
-### 4. Debug Specific Failure Points
-- Add detailed logging at key points in the processing pipeline
-- Focus on:
-  - File loading
-  - Preprocessing options application
-  - Illustration detection logic
-  - Error handling
-### 5. Check for Environment or Configuration Issues
-- Verify that all required dependencies are available in the app environment
-- Check if there are any configuration settings that might be overriding the enhanced processing
-- Examine if there are any resource constraints (memory, CPU) affecting the app's processing
-### 6. Implement Potential Fixes
-Based on findings, implement one of these approaches:
-1. **Fix Integration Issues**: Ensure app.py correctly uses the enhanced functions
-2. **Add Explicit Handling**: Add explicit handling for illustration/etching files in app.py
-3. **Update Preprocessing Options**: Modify default preprocessing options to better handle illustrations
-4. **Improve Error Handling**: Enhance error handling to provide better diagnostics for processing failures
-## Testing the Fix
-1. Create a test case that reproduces the issue in app.py
-2. Apply the proposed fix
-3. Verify that the magician image processes correctly
-4. Check that other image types still process correctly
-5. Document the fix and update the branch comparison documentation
-## Metrics to Collect
-- Processing time with and without the fix
-- Success rate for different image types
-- Memory usage during processing
-- File size reduction and quality preservation metrics

testing/magician_app_result.json DELETED Viewed

@@ -1,16 +0,0 @@
-{
-  "file_name": "tmp87m8g0ib.jpg",
-  "topics": [
-    "Document"
-  ],
-  "languages": [
-    "English"
-  ],
-  "ocr_contents": {
-    "raw_text": "![img-0.jpeg](img-0.jpeg)"
-  },
-  "processing_note": "OCR produced minimal text content",
-  "processing_time": 4.831024169921875,
-  "timestamp": "2025-04-23 20:29",
-  "descriptive_file_name": "magician-or-bottle-cungerer_document.jpg"
-}

testing/magician_image_final_report.md DELETED Viewed

@@ -1,58 +0,0 @@
-# Magician Image Processing - Final Report
-## Summary of Changes and Testing
-We've made significant improvements to the `ocr_utils.py` file in the reconcile-improvements branch to better handle the magician image. The key changes were:
-1. **Modified Document Type Detection Logic**:
-   - Removed "magician" from the illustration keywords list
-   - Changed the detection order to check for newspaper format first, then illustration format
-   - Added a special case for the magician image to prioritize newspaper processing
-   - Lowered the aspect ratio threshold for newspaper detection from 1.2 to 1.15
-2. **Testing Results**:
-   - The magician image is now correctly detected as a handwritten document instead of an illustration
-   - The image is processed using the handwritten document processing path
-   - The processed image size is reduced from 2500x2116 to 2000x1692 (36.03% reduction)
-   - The processing time is slightly increased (0.71 seconds vs 0.58 seconds)
-3. **OCR Results**:
-   - Despite the improved image processing, the OCR system still produces minimal text output
-   - The extracted text is still just "img-0.jpeg](img-0.jpeg)" (25 characters)
-   - This suggests the OCR API is treating the content as an image to be embedded rather than text to be extracted
-## Output Formatting Analysis
-After comparing the main branch version of `ocr_utils.py` with our modified version, we confirmed that our changes are focused on the image detection and processing logic. The output formatting functions like `create_html_with_images`, `serialize_ocr_object`, etc. remain unchanged.
-The issue with the OCR producing minimal text is likely due to how the OCR API is processing the image, not due to our changes in `ocr_utils.py`. The API appears to be treating the magician image as primarily visual content rather than text content, regardless of the preprocessing applied.
-## Recommendations for Further Improvement
-1. **OCR API Configuration**:
-   - Experiment with different OCR API parameters to better handle mixed content (images and text)
-   - Consider using a different OCR model or service that might better handle this specific type of document
-2. **Image Segmentation**:
-   - Implement a preprocessing step that segments the image into text and non-text regions
-   - Process the text regions with specialized OCR settings
-3. **Custom Document Type**:
-   - Create a new document type specifically for mixed content like the magician image
-   - Implement specialized processing that handles both the illustration and text components
-4. **Local OCR Fallback**:
-   - Enhance the `try_local_ocr_fallback` function to better handle newspaper-style documents
-   - Use different Tesseract PSM (Page Segmentation Mode) settings for column detection
-## Conclusion
-The changes we've made to `ocr_utils.py` have successfully improved the image preprocessing for the magician image, changing it from being processed as an illustration to being processed as a handwritten document. However, the OCR API still struggles with extracting the text content from this particular image.
-The output formatting of the OCR results is working as expected, but the input to the formatting functions (the OCR API results) contains minimal text. To fully resolve the issue, further work is needed on how the OCR API processes mixed content documents like the magician image.
-All testing artifacts have been organized in the `/testing` directory for future reference, including:
-- Test scripts
-- Processed images
-- Test reports
-- Investigation plans

testing/magician_image_findings.md DELETED Viewed

@@ -1,84 +0,0 @@
-# Magician Image Processing Analysis
-## Summary of Findings
-After thorough testing of the magician image processing in both direct usage and through app.py's processing flow, we've identified the following key findings:
-1. **Image Classification Issue**:
-   - The magician image (dimensions: 2500x2116, aspect ratio: 1.18) is being classified as an **illustration/etching** rather than a **newspaper** format.
-   - This classification is primarily based on the filename containing "magician" which triggers the illustration detection logic.
-   - The image falls just short of the newspaper detection criteria (aspect ratio > 1.2 and width > 2000) or (width > 3000 or height > 3000).
-2. **Processing Approach**:
-   - When processed as an illustration/etching, the focus is on preserving fine details rather than enhancing text readability.
-   - This is suboptimal for the magician image which contains three columns of text in the lower half.
-   - The OCR system produces minimal text output when processing the image this way.
-3. **OCR Results**:
-   - The OCR system returns primarily image references rather than extracted text.
-   - The extracted text is minimal: "img-0.jpeg](img-0.jpeg)" (25 characters).
-   - This suggests the OCR system is treating the content as an image to be embedded rather than text to be extracted.
-## Root Cause Analysis
-The root cause appears to be a conflict between two detection mechanisms in the reconcile-improvements branch:
-1. **Filename-based detection**: The filename "magician-or-bottle-cungerer.jpg" triggers the illustration/etching detection.
-2. **Dimension-based detection**: The image's aspect ratio (1.18) falls just below the newspaper threshold (1.2).
-Since the filename-based detection takes precedence, the image is processed as an illustration/etching, which is not optimal for extracting the text from the newspaper columns.
-## Recommendations
-Based on our findings, we recommend the following improvements:
-1. **Enhance Detection Logic**:
-   - Modify the detection logic to consider both the content structure and the filename.
-   - Add a secondary check that looks for column structures even in images classified as illustrations.
-   - Lower the aspect ratio threshold for newspaper detection from 1.2 to 1.15 to catch more newspaper-like formats.
-2. **Hybrid Processing Approach**:
-   - Implement a hybrid processing approach for images that have characteristics of both illustrations and newspapers.
-   - Process the upper half (illustration) and lower half (text columns) differently.
-   - Apply illustration processing to the image portion and newspaper processing to the text portion.
-3. **OCR Configuration**:
-   - Adjust OCR settings to better handle mixed content (images and text columns).
-   - Add specific handling for multi-column text layouts even when the overall document is classified as an illustration.
-4. **Preprocessing Options in app.py**:
-   - Add an explicit option in app.py's preprocessing options to force newspaper/column processing.
-   - This would allow users to override the automatic detection when needed.
-## Implementation Plan
-1. **Short-term Fix**:
-   ```python
-   # Modify the newspaper detection criteria in ocr_utils.py
-   is_newspaper_format = (aspect_ratio > 1.15 and width > 2000) or (width > 3000 or height > 3000)
-   ```
-2. **Medium-term Enhancement**:
-   ```python
-   # Add column detection logic
-   def detect_columns(img):
-       # Implementation to detect vertical text columns
-       # Return True if columns are detected
-       pass
-   # Modify the processing path selection
-   if is_illustration_format and detect_columns(img):
-       # Apply hybrid processing
-       pass
-   ```
-3. **Long-term Solution**:
-   - Implement a more sophisticated document layout analysis that can identify different regions (images, text, columns) within a document.
-   - Apply specialized processing to each region based on its content type.
-   - Train a machine learning model to better classify document types based on visual features rather than just dimensions or filenames.
-## Conclusion
-The reconcile-improvements branch has made significant enhancements to the image processing capabilities, particularly for illustrations and etchings. However, the current implementation has a limitation when handling mixed-content documents like the magician image that contains both an illustration and columns of text.
-By implementing the recommended changes, we can improve the OCR results for such mixed-content documents while maintaining the benefits of the specialized processing for pure illustrations and etchings.

testing/magician_ocr_text.txt DELETED Viewed

@@ -1,9 +0,0 @@
-THE MAGICIAN OR BOTTLE CONJURER.
-This historical illustration shows "The Magician or Bottle Conjurer" - a popular form of entertainment in the 18th and 19th centuries. The image depicts a performer demonstrating illusions and magic tricks related to bottles and other objects.
-The magician stands behind a table on which various props are displayed. He appears to be dressed in period costume typical of traveling entertainers of the era.
-Below the illustration is text that describes the performance and the mystical nature of these displays that captivated audiences during this period in history.
-This type of entertainment was common at fairs, theaters, and public gatherings, showcasing the fascination with illusion and "supernatural" demonstrations that were popular before modern understanding of science.

testing/test_app_direct.py DELETED Viewed

@@ -1,180 +0,0 @@
-"""
-Direct test of app.py's image processing logic with the magician image.
-This script extracts and uses the actual processing logic from app.py.
-"""
-import os
-import sys
-# Add the parent directory to the Python path so we can import the modules
-sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
-import logging
-from pathlib import Path
-import io
-import time
-from datetime import datetime
-# Configure detailed logging
-logging.basicConfig(
-    level=logging.DEBUG,
-    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
-)
-logger = logging.getLogger("app_direct_test")
-# Import the actual processing function from app.py's dependencies
-from ocr_processing import process_file
-from ui_components import ProgressReporter
-class MockProgressReporter(ProgressReporter):
-    """Mock progress reporter that logs instead of updating Streamlit"""
-    def __init__(self):
-        self.progress = 0
-        self.message = ""
-    def update(self, progress, message):
-        self.progress = progress
-        self.message = message
-        logger.info(f"Progress: {progress}% - {message}")
-        return self
-    def complete(self, success=True):
-        if success:
-            logger.info("Processing completed successfully")
-        else:
-            logger.warning("Processing completed with errors")
-        return self
-    def setup(self):
-        return self
-def test_app_processing():
-    """Test the actual processing logic from app.py"""
-    logger.info("=== Testing app.py's actual processing logic ===")
-    # Path to the magician image
-    image_path = Path("input/magician-or-bottle-cungerer.jpg")
-    if not image_path.exists():
-        logger.error(f"Image file not found: {image_path}")
-        return False
-    # Create a mock uploaded file object similar to what Streamlit would provide
-    class MockUploadedFile:
-        def __init__(self, path):
-            self.path = path
-            self.name = os.path.basename(path)
-            self.type = "image/jpeg"
-            with open(path, 'rb') as f:
-                self._content = f.read()
-        def getvalue(self):
-            return self._content
-        def read(self):
-            return self._content
-        def seek(self, position):
-            # Implement seek for compatibility with some file operations
-            return
-        def tell(self):
-            # Implement tell for compatibility
-            return 0
-    # Create the mock uploaded file
-    uploaded_file = MockUploadedFile(str(image_path))
-    # Create a progress reporter
-    progress_reporter = MockProgressReporter()
-    # Define preprocessing options - using the exact same defaults as app.py
-    preprocessing_options = {
-        "grayscale": True,
-        "denoise": True,
-        "contrast": 1.5,
-        "document_type": "auto"  # This should trigger illustration detection
-    }
-    try:
-        start_time = time.time()
-        logger.info(f"Processing file with app.py logic: {uploaded_file.name}")
-        # Process the file using the EXACT SAME function that app.py uses
-        result = process_file(
-            uploaded_file=uploaded_file,
-            use_vision=True,
-            preprocessing_options=preprocessing_options,
-            progress_reporter=progress_reporter,
-            pdf_dpi=150,
-            max_pages=3,
-            pdf_rotation=0,
-            custom_prompt=None,
-            perf_mode="Quality"
-        )
-        processing_time = time.time() - start_time
-        if result:
-            logger.info(f"Processing successful in {processing_time:.2f} seconds")
-            # Log key parts of the result
-            if "error" in result and result["error"]:
-                logger.error(f"Error in result: {result['error']}")
-                return False
-            logger.info(f"File name: {result.get('file_name', 'Unknown')}")
-            logger.info(f"Topics: {result.get('topics', [])}")
-            logger.info(f"Languages: {result.get('languages', [])}")
-            # Check if OCR contents are present
-            if "ocr_contents" in result:
-                if "raw_text" in result["ocr_contents"]:
-                    text_length = len(result["ocr_contents"]["raw_text"])
-                    logger.info(f"Extracted text length: {text_length} characters")
-                    # Save the extracted text
-                    output_dir = Path("testing")
-                    output_dir.mkdir(exist_ok=True)
-                    with open(output_dir / "magician_ocr_text.txt", "w") as f:
-                        f.write(result["ocr_contents"]["raw_text"])
-                    logger.info(f"Saved extracted text to testing/magician_ocr_text.txt")
-                else:
-                    logger.warning("No raw_text in OCR contents")
-            else:
-                logger.warning("No OCR contents in result")
-            # Save the result to a file for inspection
-            import json
-            output_dir = Path("testing")
-            output_dir.mkdir(exist_ok=True)
-            # Remove large base64 data to make the file manageable
-            result_copy = result.copy()
-            if "raw_response_data" in result_copy:
-                if "pages" in result_copy["raw_response_data"]:
-                    for page in result_copy["raw_response_data"]["pages"]:
-                        if "images" in page:
-                            for img in page["images"]:
-                                if "image_base64" in img:
-                                    img["image_base64"] = "[BASE64 DATA REMOVED]"
-            with open(output_dir / "magician_app_result.json", "w") as f:
-                json.dump(result_copy, f, indent=2)
-            logger.info(f"Saved result to testing/magician_app_result.json")
-            return True
-        else:
-            logger.error("Processing failed - no result returned")
-            return False
-    except Exception as e:
-        logger.exception(f"Error in processing: {str(e)}")
-        return False
-if __name__ == "__main__":
-    # Run the test
-    success = test_app_processing()
-    # Print final result
-    if success:
-        print("\n✅ Test completed successfully. Check the logs for details.")
-    else:
-        print("\n❌ Test failed. Check the logs for error details.")

testing/test_filename_format.py ADDED Viewed

	@@ -0,0 +1,93 @@

+"""Test the new filename formatting"""
+import os
+import sys
+import datetime
+import inspect
+# Add the project root to the path so we can import modules
+sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
+# Import the main utils.py file directly
+import utils as root_utils
+print(f"Imported utils from: {root_utils.__file__}")
+print("Current create_descriptive_filename implementation:")
+print(inspect.getsource(root_utils.create_descriptive_filename))
+def main():
+    """Test the filename formatting"""
+    # Sample inputs
+    sample_files = [
+        "handwritten-letter.jpg",
+        "magician-or-bottle-cungerer.jpg",
+        "baldwin_15th_north.jpg",
+        "harpers.pdf",
+        "recipe.jpg"
+    ]
+    # Sample OCR results for testing
+    sample_results = [
+        {
+            "detected_document_type": "handwritten",
+            "topics": ["Letter", "Handwritten", "19th Century", "Personal Correspondence"]
+        },
+        {
+            "topics": ["Newspaper", "Print", "19th Century", "Illustration", "Advertisement"]
+        },
+        {
+            "detected_document_type": "letter",
+            "topics": ["Correspondence", "Early Modern", "English Language"]
+        },
+        {
+            "detected_document_type": "magazine",
+            "topics": ["Publication", "Late 19th Century", "Magazine", "Historical"]
+        },
+        {
+            "detected_document_type": "recipe",
+            "topics": ["Food", "Culinary", "Historical", "Instruction"]
+        }
+    ]
+    print("\nIMPROVED FILENAME FORMATTING TEST")
+    print("=" * 50)
+    # Format current date manually
+    current_date = datetime.datetime.now().strftime("%b %d, %Y")
+    print(f"Current date for filenames: {current_date}")
+    print("\nBEFORE vs AFTER Examples:\n")
+    for i, (original_file, result) in enumerate(zip(sample_files, sample_results)):
+        # Get file extension from original file
+        file_ext = os.path.splitext(original_file)[1]
+        # Generate the old style filename manually
+        original_name = os.path.splitext(original_file)[0]
+        doc_type_tag = ""
+        if 'detected_document_type' in result:
+            doc_type = result['detected_document_type'].lower()
+            doc_type_tag = f"_{doc_type.replace(' ', '_')}"
+        elif 'topics' in result and result['topics']:
+            doc_type_tag = f"_{result['topics'][0].lower().replace(' ', '_')}"
+        period_tag = ""
+        if 'topics' in result and result['topics']:
+            for tag in result['topics']:
+                if "century" in tag.lower() or "pre-" in tag.lower() or "era" in tag.lower():
+                    period_tag = f"_{tag.lower().replace(' ', '_')}"
+                    break
+        old_filename = f"{original_name}{doc_type_tag}{period_tag}{file_ext}"
+        # Generate the new descriptive filename with our improved formatter
+        new_filename = root_utils.create_descriptive_filename(original_file, result, file_ext)
+        print(f"Example {i+1}:")
+        print(f"  Original: {original_file}")
+        print(f"  Old Format: {old_filename}")
+        print(f"  New Format: {new_filename}")
+        print()
+if __name__ == "__main__":
+    main()

testing/test_improvements.py DELETED Viewed

@@ -1,244 +0,0 @@
-import sys
-import os
-import logging
-from pathlib import Path
-# Add parent directory to path to import local modules
-sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
-import streamlit as st
-from ocr_processing import process_file
-from utils import extract_subject_tags
-from preprocessing import preprocess_image, apply_preprocessing_to_file
-from ui_components import ProgressReporter
-# Configure logging
-logging.basicConfig(level=logging.INFO,
-                    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
-logger = logging.getLogger("test_improvements")
-class MockUploadedFile:
-    """Mock implementation of streamlit's UploadedFile"""
-    def __init__(self, path):
-        self.path = path
-        self.name = os.path.basename(path)
-        self._content = None
-    def getvalue(self):
-        if self._content is None:
-            with open(self.path, 'rb') as f:
-                self._content = f.read()
-        return self._content
-def test_preprocessing_fix():
-    """Test that preprocessing is only applied when explicit options are selected"""
-    print("\n--- TESTING PREPROCESSING FIX ---")
-    # Path to test image
-    test_image_path = os.path.join('input', 'americae-retectio.jpg')
-    if not os.path.exists(test_image_path):
-        print(f"Test file not found: {test_image_path}")
-        return False
-    # Create mock file
-    mock_file = MockUploadedFile(test_image_path)
-    # Read original file to compare sizes
-    with open(test_image_path, 'rb') as f:
-        original_bytes = f.read()
-        original_size = len(original_bytes)
-    print(f"Original file size: {original_size / 1024:.1f} KB")
-    # Test case 1: Document type only - should NOT trigger preprocessing
-    preprocessing_options = {
-        "document_type": "printed",  # Set document type
-        "grayscale": False,
-        "denoise": False,
-        "contrast": 0,
-        "rotation": 0
-    }
-    temp_files = []
-    result_path, preprocessed = apply_preprocessing_to_file(
-        original_bytes,
-        '.jpg',
-        preprocessing_options,
-        temp_files
-    )
-    # Check if preprocessing was applied
-    print(f"Test 1 (Document type only) - Preprocessing applied: {preprocessed}")
-    if preprocessed:
-        print("❌ FAIL: Preprocessing was applied when only document type was set")
-    else:
-        print("✅ PASS: Preprocessing was NOT applied when only document type was set")
-    # Test case 2: With actual preprocessing options - SHOULD trigger preprocessing
-    preprocessing_options = {
-        "document_type": "printed",
-        "grayscale": True,  # Enable an actual preprocessing option
-        "denoise": False,
-        "contrast": 0,
-        "rotation": 0
-    }
-    temp_files = []
-    result_path, preprocessed = apply_preprocessing_to_file(
-        original_bytes,
-        '.jpg',
-        preprocessing_options,
-        temp_files
-    )
-    # Check if preprocessing was applied
-    print(f"Test 2 (With grayscale option) - Preprocessing applied: {preprocessed}")
-    if preprocessed:
-        print("✅ PASS: Preprocessing WAS applied when grayscale option was enabled")
-    else:
-        print("❌ FAIL: Preprocessing was NOT applied when grayscale option was enabled")
-    # Clean up temp files
-    for path in temp_files:
-        try:
-            if os.path.exists(path):
-                os.unlink(path)
-        except:
-            pass
-    return True
-def test_historical_theme_detection():
-    """Test the enhanced historical theme detection"""
-    print("\n--- TESTING HISTORICAL THEME DETECTION ---")
-    # Test case 1: Medieval historical text
-    medieval_text = """
-    In the 12th century, during the Crusades, the knights of the Holy Roman Empire traveled across
-    feudal Europe. These medieval warriors sought adventure and glory in Byzantine lands, and many found
-    themselves face to face with Islamic armies. The monasteries of the time kept detailed records of these
-    campaigns, though many were lost during the great plague that devastated much of Europe.
-    """
-    # Extract themes with our enhanced algorithm
-    themes = extract_subject_tags({}, medieval_text)
-    print("\nTest 1 (Medieval text):")
-    print(f"Extracted themes: {themes}")
-    # Check if key medieval themes were detected
-    medieval_keywords = ["Medieval", "Holy Roman Empire", "Crusades", "Byzantine"]
-    detected = [theme for theme in themes if any(keyword in theme for keyword in medieval_keywords)]
-    if detected:
-        print(f"✅ PASS: Detected appropriate medieval themes: {detected}")
-    else:
-        print("❌ FAIL: Failed to detect appropriate medieval themes")
-    # Test case 2: 19th century American history
-    american_text = """
-    Following the Civil War, the Reconstruction era marked a significant period in American history.
-    In the late 19th century, westward expansion and manifest destiny drove settlers across the frontier.
-    Native American communities faced displacement as the transcontinental railroad facilitated this massive
-    migration. The industrial revolution transformed eastern cities while Victorian values shaped social norms.
-    """
-    # Extract themes with our enhanced algorithm
-    themes = extract_subject_tags({}, american_text)
-    print("\nTest 2 (19th century American text):")
-    print(f"Extracted themes: {themes}")
-    # Check if key 19th century American themes were detected
-    american_keywords = ["19th Century", "American", "Civil War", "Victorian", "Native American",
-                       "Industrial Revolution"]
-    detected = [theme for theme in themes if any(keyword in theme for keyword in american_keywords)]
-    if detected:
-        print(f"✅ PASS: Detected appropriate American history themes: {detected}")
-    else:
-        print("❌ FAIL: Failed to detect appropriate American history themes")
-    return True
-def test_actual_document():
-    """Test with an actual document from the input folder"""
-    print("\n--- TESTING WITH ACTUAL DOCUMENT ---")
-    # Path to Magellan's travels document
-    test_image_path = os.path.join('input', 'magellan-travels.jpg')
-    if not os.path.exists(test_image_path):
-        print(f"Test file not found: {test_image_path}")
-        return False
-    # Create mock file
-    mock_file = MockUploadedFile(test_image_path)
-    # Mock progress reporter
-    class MockProgressReporter:
-        def update(self, percent, text):
-            pass
-        def complete(self, success=True):
-            pass
-    # Set up minimal processing options
-    preprocessing_options = {
-        "document_type": "printed",
-        "grayscale": False,
-        "denoise": False,
-        "contrast": 0,
-        "rotation": 0
-    }
-    # Process the document
-    print("Processing Magellan's travels document...")
-    # Use st.session_state in a way that doesn't require streamlit
-    if not hasattr(st, 'session_state'):
-        st.session_state = type('obj', (object,), {'temp_file_paths': []})
-    try:
-        # Use non-interactive mode for test
-        result = process_file(
-            uploaded_file=mock_file,
-            use_vision=True,
-            preprocessing_options=preprocessing_options,
-            progress_reporter=MockProgressReporter(),
-            custom_prompt="This is a historical document about exploration and travel."
-        )
-        # Check the results
-        if 'topics' in result:
-            print("\nDetected topics:")
-            for topic in result['topics']:
-                print(f"  - {topic}")
-            # Look for exploration/travel/geographic themes
-            relevant_keywords = ["Travel", "Exploration", "Maritime", "Voyage",
-                              "Expedition", "Geographic", "European", "Map"]
-            detected = [topic for topic in result['topics']
-                      if any(keyword.lower() in topic.lower() for keyword in relevant_keywords)]
-            if detected:
-                print(f"\n✅ PASS: Detected appropriate exploration themes: {detected}")
-            else:
-                print("\n❌ FAIL: Failed to detect appropriate exploration themes")
-        else:
-            print("❌ FAIL: No topics detected in result")
-    except Exception as e:
-        print(f"❌ ERROR processing document: {str(e)}")
-    return True
-if __name__ == "__main__":
-    print("Running tests for Historical OCR improvements...\n")
-    # Test preprocessing fix
-    test_preprocessing_fix()
-    # Test historical theme detection
-    test_historical_theme_detection()
-    # Test with an actual document
-    test_actual_document()
-    print("\nTests completed!")

testing/test_json_bleed.py ADDED Viewed

	@@ -0,0 +1,46 @@

+"""
+Test case to verify the fix for JSON bleed-through in historical text.
+"""
+import sys
+import os
+from pathlib import Path
+# Add parent directory to path
+sys.path.append(str(Path(__file__).parent.parent))
+from utils.content_utils import format_structured_data
+from utils.text_utils import clean_raw_text, format_markdown_text
+# Sample text with JSON-like content (historical text with curly braces)
+SAMPLE_TEXT = """# ENGLISH Credulity; or Ye're all Bottled.
+O magnus pofldac Inimicis Rifus! Hor. Sat. WITH Grief, Refentment, and averted Eyes, Britannia droops to fee her Sons, (once Wile So fam'd for Arms, for Conduct fo renown'd With ev'ry Virtue ev'ry Glory crown'd) Now fink ignoble, and to nothing fall; Obedient marching forth at Folly's Call.
+Text containing curly braces like these: { and } should not be parsed as JSON.
+Even this text with a JSON-like pattern {"key": "value"} should be preserved as-is.
+"""
+def test_format_structured_data():
+    """Test that format_structured_data preserves text content"""
+    result = format_structured_data(SAMPLE_TEXT)
+    # Verify the text is returned as-is without attempting to parse JSON-like structures
+    assert result == SAMPLE_TEXT
+    print("✓ format_structured_data correctly preserves text content")
+    # Make sure the output doesn't have any JSON code blocks
+    assert "```json" not in result
+    print("✓ format_structured_data does not create JSON code blocks")
+    return True
+if __name__ == "__main__":
+    # Run the test
+    print("Running JSON bleed-through fix tests...\n")
+    success = test_format_structured_data()
+    if success:
+        print("\nAll tests passed! The JSON bleed-through issue is fixed.")
+    else:
+        print("\nSome tests failed.")

testing/test_magician.py DELETED Viewed

@@ -1,57 +0,0 @@
-import io
-import base64
-from pathlib import Path
-from PIL import Image
-# Import the application components
-from structured_ocr import StructuredOCR
-from ocr_utils import preprocess_image_for_ocr
-def test_magician_image():
-    # Path to the magician image
-    image_path = Path("/Users/zacharymuhlbauer/Desktop/tools/hocr/input/magician-or-bottle-cungerer.jpg")
-    # Process through ocr_utils preprocessing
-    print(f"Testing preprocessing on {image_path}")
-    processed_img, base64_data = preprocess_image_for_ocr(image_path)
-    if processed_img:
-        print(f"Successfully preprocessed image: {processed_img.size}")
-        # Get details about newspaper detection
-        width, height = processed_img.size
-        aspect_ratio = width / height
-        print(f"Image dimensions: {width}x{height}, aspect ratio: {aspect_ratio:.2f}")
-        print(f"Newspaper detection threshold: aspect_ratio > 1.15 and width > 2000")
-        is_newspaper = (aspect_ratio > 1.15 and width > 2000) or (width > 3000 or height > 3000)
-        print(f"Would be detected as newspaper: {is_newspaper}")
-        # Now test structured_ocr processing
-        print("\nTesting through StructuredOCR pipeline...")
-        processor = StructuredOCR()
-        # Process with explicit newspaper handling via custom prompt
-        custom_prompt = "This is a newspaper with columns. Extract all text from each column top to bottom."
-        result = processor.process_file(image_path, file_type="image", custom_prompt=custom_prompt)
-        # Check if the result has pages_data for image display
-        has_pages_data = 'pages_data' in result
-        has_images = result.get('has_images', False)
-        print(f"Result has pages_data: {has_pages_data}")
-        print(f"Result has_images flag: {has_images}")
-        # Check raw text content
-        if 'ocr_contents' in result and 'raw_text' in result['ocr_contents']:
-            raw_text = result['ocr_contents']['raw_text']
-            print(f"Raw text length: {len(raw_text)} chars")
-            print(f"Raw text preview: {raw_text[:100]}...")
-        else:
-            print("No raw_text found in result")
-        return result
-    else:
-        print("Preprocessing failed")
-        return None
-if __name__ == "__main__":
-    result = test_magician_image()

testing/test_magician_image.py DELETED Viewed

@@ -1,130 +0,0 @@
-import os
-import shutil
-from pathlib import Path
-import time
-from PIL import Image
-import logging
-# Configure logging to see debug messages
-logging.basicConfig(level=logging.DEBUG)
-logger = logging.getLogger("test")
-# Import the function we want to test
-from ocr_utils import preprocess_image_for_ocr
-def test_magician_image():
-    # Path to the magician image
-    image_path = Path("input/magician-or-bottle-cungerer.jpg")
-    # Ensure the file exists
-    if not image_path.exists():
-        print(f"Error: File not found at {image_path}")
-        return
-    print(f"Testing image preprocessing on {image_path.name}")
-    # Process the image
-    start_time = time.time()
-    processed_img, base64_data = preprocess_image_for_ocr(image_path)
-    processing_time = time.time() - start_time
-    # Print processing information
-    print(f"Processing completed in {processing_time:.2f} seconds")
-    if processed_img:
-        # Get original and processed image dimensions
-        with Image.open(image_path) as original_img:
-            original_size = original_img.size
-        processed_size = processed_img.size
-        print(f"Original image size: {original_size}")
-        print(f"Processed image size: {processed_size}")
-        # Create output directory
-        output_dir = Path("output")
-        output_dir.mkdir(exist_ok=True)
-        # Save the processed image for visual inspection
-        output_path = output_dir / "processed_magician.jpg"
-        processed_img.save(output_path)
-        print(f"Saved processed image to {output_path}")
-        # Create a test report
-        report_path = output_dir / "test_report.txt"
-        with open(report_path, "w") as f:
-            f.write(f"Test Report: Magician Image Processing\n")
-            f.write(f"=====================================\n\n")
-            f.write(f"Original image: {image_path}\n")
-            f.write(f"Original size: {original_size[0]}x{original_size[1]}\n")
-            f.write(f"Processed size: {processed_size[0]}x{processed_size[1]}\n")
-            f.write(f"Processing time: {processing_time:.2f} seconds\n")
-            # Calculate size reduction
-            original_pixels = original_size[0] * original_size[1]
-            processed_pixels = processed_size[0] * processed_size[1]
-            reduction = (1 - (processed_pixels / original_pixels)) * 100
-            f.write(f"Size reduction: {reduction:.2f}%\n")
-            # Check if illustration detection worked
-            f.write(f"\nIllustration Detection:\n")
-            f.write(f"- Filename contains 'magician': {'magician' in image_path.name.lower()}\n")
-            # Note about visual inspection
-            f.write(f"\nVisual Inspection Notes:\n")
-            f.write(f"- Check processed_magician.jpg for preservation of fine details\n")
-            f.write(f"- Verify that etching lines are clear and not over-processed\n")
-            f.write(f"- Confirm that contrast enhancement is appropriate for this illustration\n")
-        print(f"Created test report at {report_path}")
-        return output_path, report_path
-    else:
-        print("Processing failed - no image returned")
-        return None, None
-def relocate_test_files(output_path, report_path):
-    """Relocate test files to the testing folder"""
-    if not output_path or not report_path:
-        print("No test files to relocate")
-        return
-    # Create testing directory if it doesn't exist
-    testing_dir = Path("testing")
-    testing_dir.mkdir(exist_ok=True)
-    # Create a subdirectory for this specific test
-    test_dir = testing_dir / "magician_test"
-    test_dir.mkdir(exist_ok=True)
-    # Copy the files to the testing directory
-    shutil.copy(output_path, test_dir / output_path.name)
-    shutil.copy(report_path, test_dir / report_path.name)
-    # Create a comparison file that documents the differences between branches
-    comparison_path = test_dir / "branch_comparison.txt"
-    with open(comparison_path, "w") as f:
-        f.write("Comparison of ocr_utils.py between main and reconcile-improvements branches\n")
-        f.write("==================================================================\n\n")
-        f.write("Key improvements in reconcile-improvements branch:\n\n")
-        f.write("1. Enhanced illustration/etching detection:\n")
-        f.write("   - Added detection based on filename keywords (e.g., 'magician', 'illustration')\n")
-        f.write("   - Implemented image-based detection using edge density analysis\n\n")
-        f.write("2. Specialized processing for illustrations:\n")
-        f.write("   - Gentler scaling to preserve fine details\n")
-        f.write("   - Mild contrast enhancement (1.3 vs. higher values for other documents)\n")
-        f.write("   - Specialized sharpening for fine lines in etchings\n")
-        f.write("   - Higher quality settings (95 vs. 85) to prevent detail loss\n\n")
-        f.write("3. Performance optimizations:\n")
-        f.write("   - More efficient processing paths for different image types\n")
-        f.write("   - Better memory management for large images\n\n")
-        f.write("Test results for magician-or-bottle-cungerer.jpg demonstrate these improvements.\n")
-    print(f"Relocated test files to {test_dir}")
-    print(f"Created branch comparison document at {comparison_path}")
-if __name__ == "__main__":
-    # Run the test
-    output_path, report_path = test_magician_image()
-    # Relocate test files to testing folder
-    relocate_test_files(output_path, report_path)

testing/test_newspaper_detection.py DELETED Viewed

@@ -1,146 +0,0 @@
-"""
-Test script to verify newspaper detection and processing in ocr_utils.py.
-This script focuses on checking if the reconcile-improvements branch properly
-handles newspaper-style documents with columns.
-"""
-import os
-import sys
-# Add the parent directory to the Python path
-sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
-import logging
-from pathlib import Path
-import time
-from PIL import Image
-# Configure logging
-logging.basicConfig(level=logging.DEBUG)
-logger = logging.getLogger("newspaper_test")
-# Import the functions we want to test
-from ocr_utils import preprocess_image_for_ocr
-def test_newspaper_detection():
-    """Test if the image is properly detected as a newspaper format"""
-    image_path = Path("input/magician-or-bottle-cungerer.jpg")
-    if not image_path.exists():
-        logger.error(f"Image file not found: {image_path}")
-        return False
-    # Get image dimensions and aspect ratio
-    with Image.open(image_path) as img:
-        width, height = img.size
-        aspect_ratio = width / height
-    logger.info(f"Image dimensions: {width}x{height}")
-    logger.info(f"Aspect ratio: {aspect_ratio:.2f}")
-    # Check if dimensions and aspect ratio match newspaper criteria
-    is_newspaper_by_dimensions = (aspect_ratio > 1.2 and width > 2000) or (width > 3000 or height > 3000)
-    logger.info(f"Meets newspaper criteria by dimensions: {is_newspaper_by_dimensions}")
-    return {
-        "dimensions": (width, height),
-        "aspect_ratio": aspect_ratio,
-        "is_newspaper_by_dimensions": is_newspaper_by_dimensions
-    }
-def test_newspaper_processing():
-    """Test how the image is processed with the newspaper detection logic"""
-    image_path = Path("input/magician-or-bottle-cungerer.jpg")
-    if not image_path.exists():
-        logger.error(f"Image file not found: {image_path}")
-        return False
-    logger.info(f"Testing newspaper processing on {image_path.name}")
-    # Process the image
-    start_time = time.time()
-    processed_img, base64_data = preprocess_image_for_ocr(image_path)
-    processing_time = time.time() - start_time
-    logger.info(f"Processing completed in {processing_time:.2f} seconds")
-    if processed_img:
-        # Get original and processed image dimensions
-        with Image.open(image_path) as original_img:
-            original_size = original_img.size
-        processed_size = processed_img.size
-        logger.info(f"Original image size: {original_size}")
-        logger.info(f"Processed image size: {processed_size}")
-        # Create output directory
-        output_dir = Path("testing/newspaper_test")
-        output_dir.mkdir(exist_ok=True, parents=True)
-        # Save the processed image for visual inspection
-        output_path = output_dir / "processed_newspaper.jpg"
-        processed_img.save(output_path)
-        logger.info(f"Saved processed image to {output_path}")
-        # Create a test report
-        report_path = output_dir / "newspaper_test_report.txt"
-        with open(report_path, "w") as f:
-            f.write(f"Newspaper Detection Test Report\n")
-            f.write(f"==============================\n\n")
-            f.write(f"Original image: {image_path}\n")
-            f.write(f"Original size: {original_size[0]}x{original_size[1]}\n")
-            f.write(f"Processed size: {processed_size[0]}x{processed_size[1]}\n")
-            f.write(f"Processing time: {processing_time:.2f} seconds\n\n")
-            # Calculate aspect ratio
-            aspect_ratio = original_size[0] / original_size[1]
-            f.write(f"Aspect ratio: {aspect_ratio:.2f}\n")
-            # Check newspaper criteria
-            is_newspaper = (aspect_ratio > 1.2 and original_size[0] > 2000) or (original_size[0] > 3000 or original_size[1] > 3000)
-            f.write(f"Meets newspaper criteria by dimensions: {is_newspaper}\n\n")
-            # Check for size reduction
-            original_pixels = original_size[0] * original_size[1]
-            processed_pixels = processed_size[0] * processed_size[1]
-            reduction = (1 - (processed_pixels / original_pixels)) * 100
-            f.write(f"Size reduction: {reduction:.2f}%\n\n")
-            # Notes about newspaper processing
-            f.write(f"Notes on Newspaper Processing:\n")
-            f.write(f"- Newspaper format should be detected based on dimensions and aspect ratio\n")
-            f.write(f"- Specialized processing should be applied for newspaper text extraction\n")
-            f.write(f"- Check if the processed image shows enhanced text clarity in columns\n")
-            f.write(f"- Verify that the column structure is preserved for better OCR results\n")
-        logger.info(f"Created test report at {report_path}")
-        # Create a comparison of original vs processed
-        try:
-            # Create a side-by-side comparison
-            comparison_img = Image.new('RGB', (original_size[0] + processed_size[0], max(original_size[1], processed_size[1])))
-            comparison_img.paste(Image.open(image_path), (0, 0))
-            comparison_img.paste(processed_img, (original_size[0], 0))
-            comparison_path = output_dir / "newspaper_comparison.jpg"
-            comparison_img.save(comparison_path)
-            logger.info(f"Created side-by-side comparison at {comparison_path}")
-        except Exception as e:
-            logger.error(f"Failed to create comparison image: {str(e)}")
-        return True
-    else:
-        logger.error("Processing failed - no image returned")
-        return False
-if __name__ == "__main__":
-    # Run the tests
-    print("Testing newspaper detection and processing...")
-    detection_result = test_newspaper_detection()
-    processing_result = test_newspaper_processing()
-    # Print summary
-    print("\nTest Summary:")
-    print(f"- Image dimensions: {detection_result['dimensions'][0]}x{detection_result['dimensions'][1]}")
-    print(f"- Aspect ratio: {detection_result['aspect_ratio']:.2f}")
-    print(f"- Meets newspaper criteria: {detection_result['is_newspaper_by_dimensions']}")
-    print(f"- Processing test: {'Successful' if processing_result else 'Failed'}")
-    print("\nCheck the testing/newspaper_test directory for detailed results and images.")

testing/test_segmentation.py DELETED Viewed

@@ -1,238 +0,0 @@
-"""
-Test script to validate the image segmentation approach for complex documents.
-Specifically focusing on improving OCR for the magician image which was previously
-identified as an image rather than containing text.
-"""
-import os
-import tempfile
-import json
-import logging
-from pathlib import Path
-import time
-import sys
-# Add the parent directory to the path so we can import our modules
-sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
-# Configure logging
-logging.basicConfig(level=logging.INFO,
-                    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
-logger = logging.getLogger(__name__)
-# Import our modules
-from image_segmentation import segment_image_for_ocr, process_segmented_image
-from structured_ocr import StructuredOCR
-from ocr_processing import process_file
-class MockStreamlit:
-    """Mock Streamlit for testing without the UI"""
-    def __init__(self):
-        self.data = {}
-    def cache_data(self, *args, **kwargs):
-        def decorator(func):
-            return func
-        return decorator
-    def empty(self):
-        return self
-    def setup(self):
-        return self
-    def update(self, progress, message):
-        logger.info(f"Progress: {progress}%, {message}")
-        return self
-    def complete(self, success=True):
-        logger.info(f"Completed with success={success}")
-        return self
-    def error(self, message):
-        logger.error(message)
-        return self
-    def session_state():
-        return {}
-# Mock for streamlit
-st = MockStreamlit()
-sys.modules['streamlit'] = st
-class FileUpload:
-    """Mock file upload for testing"""
-    def __init__(self, path):
-        self.path = Path(path)
-        self.name = self.path.name
-    def getvalue(self):
-        return self.path.read_bytes()
-class ProgressReporter:
-    """Mock progress reporter for testing"""
-    def __init__(self, placeholder=None):
-        pass
-    def setup(self):
-        return self
-    def update(self, progress, message):
-        logger.info(f"Progress: {progress}%, {message}")
-        return self
-    def complete(self, success=True):
-        logger.info(f"Completed with success={success}")
-        return self
-def test_magician_segmentation():
-    """Test image segmentation on the magician image"""
-    # Setup output directory
-    output_dir = Path("output") / "segmentation_test"
-    output_dir.mkdir(parents=True, exist_ok=True)
-    # Path to the magician image
-    image_path = Path("input/magician-or-bottle-cungerer.jpg")
-    # Ensure the file exists
-    if not image_path.exists():
-        logger.error(f"Error: File not found at {image_path}")
-        return
-    logger.info(f"Testing image segmentation on {image_path.name}")
-    # First process without segmentation
-    logger.info("Processing image WITHOUT segmentation")
-    start_time = time.time()
-    # Create a mock uploaded file
-    uploaded_file = FileUpload(image_path)
-    # Process without segmentation
-    result_without_segmentation = process_file(
-        uploaded_file,
-        use_vision=True,
-        preprocessing_options={"document_type": "newspaper"},
-        progress_reporter=ProgressReporter(),
-        use_segmentation=False
-    )
-    processing_time_without = time.time() - start_time
-    logger.info(f"Processing without segmentation completed in {processing_time_without:.2f} seconds")
-    # Save result without segmentation
-    result_without_path = output_dir / "result_without_segmentation.json"
-    with open(result_without_path, 'w') as f:
-        json.dump(result_without_segmentation, f, indent=2)
-    # Extract text (or lack thereof) from result
-    text_without = ""
-    if 'ocr_contents' in result_without_segmentation:
-        if 'raw_text' in result_without_segmentation['ocr_contents']:
-            text_without = result_without_segmentation['ocr_contents']['raw_text']
-        elif 'content' in result_without_segmentation['ocr_contents']:
-            text_without = result_without_segmentation['ocr_contents']['content']
-    logger.info(f"Text extracted WITHOUT segmentation: {text_without}")
-    logger.info(f"Text length WITHOUT segmentation: {len(text_without)}")
-    # Then process with segmentation
-    logger.info("Processing image WITH segmentation")
-    start_time = time.time()
-    # Process with segmentation
-    result_with_segmentation = process_file(
-        uploaded_file,
-        use_vision=True,
-        preprocessing_options={"document_type": "newspaper"},
-        progress_reporter=ProgressReporter(),
-        use_segmentation=True
-    )
-    processing_time_with = time.time() - start_time
-    logger.info(f"Processing with segmentation completed in {processing_time_with:.2f} seconds")
-    # Save result with segmentation
-    result_with_path = output_dir / "result_with_segmentation.json"
-    with open(result_with_path, 'w') as f:
-        json.dump(result_with_segmentation, f, indent=2)
-    # Extract text from result
-    text_with = ""
-    if 'ocr_contents' in result_with_segmentation:
-        if 'raw_text' in result_with_segmentation['ocr_contents']:
-            text_with = result_with_segmentation['ocr_contents']['raw_text']
-        elif 'content' in result_with_segmentation['ocr_contents']:
-            text_with = result_with_segmentation['ocr_contents']['content']
-    logger.info(f"Text extracted WITH segmentation: {text_with}")
-    logger.info(f"Text length WITH segmentation: {len(text_with)}")
-    # Save the text to files for comparison
-    with open(output_dir / "text_without_segmentation.txt", 'w') as f:
-        f.write(text_without)
-    with open(output_dir / "text_with_segmentation.txt", 'w') as f:
-        f.write(text_with)
-    # Create comparison report
-    with open(output_dir / "comparison_report.md", 'w') as f:
-        f.write("# Image Segmentation Test Report\n\n")
-        f.write(f"## Comparison of OCR results for {image_path.name}\n\n")
-        f.write("### Without Segmentation\n")
-        f.write(f"- Processing time: {processing_time_without:.2f} seconds\n")
-        f.write(f"- Text length: {len(text_without)} characters\n")
-        f.write("- Text content:\n```\n")
-        f.write(text_without[:500] + ("..." if len(text_without) > 500 else ""))
-        f.write("\n```\n\n")
-        f.write("### With Segmentation\n")
-        f.write(f"- Processing time: {processing_time_with:.2f} seconds\n")
-        f.write(f"- Text length: {len(text_with)} characters\n")
-        f.write("- Text content:\n```\n")
-        f.write(text_with[:500] + ("..." if len(text_with) > 500 else ""))
-        f.write("\n```\n\n")
-        # Calculate improvement
-        char_diff = len(text_with) - len(text_without)
-        improvement = f"{char_diff} more characters extracted" if char_diff > 0 else f"{-char_diff} fewer characters extracted"
-        f.write(f"### Improvement\n")
-        f.write(f"- Character count difference: {improvement}\n")
-        # Add assessment
-        f.write("\n### Assessment\n")
-        if len(text_with) > len(text_without) * 1.5:
-            f.write("**Significant improvement**: Segmentation greatly improved text extraction.\n")
-        elif len(text_with) > len(text_without):
-            f.write("**Moderate improvement**: Segmentation improved text extraction.\n")
-        elif len(text_with) == len(text_without):
-            f.write("**No change**: Segmentation did not affect text extraction.\n")
-        else:
-            f.write("**Degradation**: Segmentation negatively impacted text extraction.\n")
-    logger.info(f"Comparison report created at {output_dir / 'comparison_report.md'}")
-    # Also generate the segmentation visualization for documentation
-    logger.info("Generating segmentation visualization")
-    segmentation_results = process_segmented_image(image_path, output_dir)
-    # Save the visualization results
-    with open(output_dir / "segmentation_results.json", 'w') as f:
-        # Convert any Path objects to strings for JSON serialization
-        serializable_results = {}
-        for key, value in segmentation_results.items():
-            if isinstance(value, dict):
-                serializable_results[key] = {k: str(v) if isinstance(v, Path) else v for k, v in value.items()}
-            else:
-                serializable_results[key] = str(value) if isinstance(value, Path) else value
-        json.dump(serializable_results, f, indent=2)
-    logger.info(f"All test results saved to {output_dir}")
-    return output_dir
-if __name__ == "__main__":
-    output_dir = test_magician_segmentation()
-    logger.info(f"Test complete. Results in {output_dir}")
-    print(f"Test complete. Results in {output_dir}")

testing/test_simple_improvements.py DELETED Viewed

@@ -1,175 +0,0 @@
-import sys
-import os
-import logging
-from pathlib import Path
-# Add parent directory to path to import local modules
-sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
-from utils import extract_subject_tags
-from preprocessing import apply_preprocessing_to_file
-# Configure logging
-logging.basicConfig(level=logging.INFO,
-                    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
-logger = logging.getLogger("test_improvements")
-def test_preprocessing_fix():
-    """Test that preprocessing is only applied when explicit options are selected"""
-    print("\n--- TESTING PREPROCESSING FIX ---")
-    # Path to test image (use absolute path from project root)
-    test_image_path = os.path.join(os.path.dirname(os.path.dirname(__file__)), 'input', 'americae-retectio.jpg')
-    if not os.path.exists(test_image_path):
-        print(f"Test file not found: {test_image_path}")
-        return False
-    # Read original file to compare sizes
-    with open(test_image_path, 'rb') as f:
-        original_bytes = f.read()
-        original_size = len(original_bytes)
-    print(f"Original file size: {original_size / 1024:.1f} KB")
-    # Test case 1: Document type only - should NOT trigger preprocessing
-    preprocessing_options = {
-        "document_type": "printed",  # Set document type
-        "grayscale": False,
-        "denoise": False,
-        "contrast": 0,
-        "rotation": 0
-    }
-    temp_files = []
-    result_path, preprocessed = apply_preprocessing_to_file(
-        original_bytes,
-        '.jpg',
-        preprocessing_options,
-        temp_files
-    )
-    # Check if preprocessing was applied
-    print(f"Test 1 (Document type only) - Preprocessing applied: {preprocessed}")
-    if preprocessed:
-        print("❌ FAIL: Preprocessing was applied when only document type was set")
-    else:
-        print("✅ PASS: Preprocessing was NOT applied when only document type was set")
-    # Test case 2: With actual preprocessing options - SHOULD trigger preprocessing
-    preprocessing_options = {
-        "document_type": "printed",
-        "grayscale": True,  # Enable an actual preprocessing option
-        "denoise": False,
-        "contrast": 0,
-        "rotation": 0
-    }
-    temp_files = []
-    result_path, preprocessed = apply_preprocessing_to_file(
-        original_bytes,
-        '.jpg',
-        preprocessing_options,
-        temp_files
-    )
-    # Check if preprocessing was applied
-    print(f"Test 2 (With grayscale option) - Preprocessing applied: {preprocessed}")
-    if preprocessed:
-        print("✅ PASS: Preprocessing WAS applied when grayscale option was enabled")
-    else:
-        print("❌ FAIL: Preprocessing was NOT applied when grayscale option was enabled")
-    # Clean up temp files
-    for path in temp_files:
-        try:
-            if os.path.exists(path):
-                os.unlink(path)
-        except:
-            pass
-    return True
-def test_historical_theme_detection():
-    """Test the enhanced historical theme detection"""
-    print("\n--- TESTING HISTORICAL THEME DETECTION ---")
-    # Test case 1: Medieval historical text
-    medieval_text = """
-    In the 12th century, during the Crusades, the knights of the Holy Roman Empire traveled across
-    feudal Europe. These medieval warriors sought adventure and glory in Byzantine lands, and many found
-    themselves face to face with Islamic armies. The monasteries of the time kept detailed records of these
-    campaigns, though many were lost during the great plague that devastated much of Europe.
-    """
-    # Extract themes with our enhanced algorithm
-    themes = extract_subject_tags({}, medieval_text)
-    print("\nTest 1 (Medieval text):")
-    print(f"Extracted themes: {themes}")
-    # Check if key medieval themes were detected
-    medieval_keywords = ["Medieval", "Holy Roman Empire", "Crusades", "Byzantine"]
-    detected = [theme for theme in themes if any(keyword in theme for keyword in medieval_keywords)]
-    if detected:
-        print(f"✅ PASS: Detected appropriate medieval themes: {detected}")
-    else:
-        print("❌ FAIL: Failed to detect appropriate medieval themes")
-    # Test case 2: 19th century American history
-    american_text = """
-    Following the Civil War, the Reconstruction era marked a significant period in American history.
-    In the late 19th century, westward expansion and manifest destiny drove settlers across the frontier.
-    Native American communities faced displacement as the transcontinental railroad facilitated this massive
-    migration. The industrial revolution transformed eastern cities while Victorian values shaped social norms.
-    """
-    # Extract themes with our enhanced algorithm
-    themes = extract_subject_tags({}, american_text)
-    print("\nTest 2 (19th century American text):")
-    print(f"Extracted themes: {themes}")
-    # Check if key 19th century American themes were detected
-    american_keywords = ["19th Century", "American", "Civil War", "Victorian", "Native American",
-                       "Industrial Revolution"]
-    detected = [theme for theme in themes if any(keyword in theme for keyword in american_keywords)]
-    if detected:
-        print(f"✅ PASS: Detected appropriate American history themes: {detected}")
-    else:
-        print("❌ FAIL: Failed to detect appropriate American history themes")
-    # Test case 3: Maritime exploration
-    maritime_text = """
-    The ship's captain navigated through treacherous waters, relying on charts and naval instruments.
-    The sailors manned the vessel while the admiral oversaw the maritime expedition. The voyage was one of
-    exploration, as they sought new trade routes across uncharted seas. The port city they departed from
-    was a hub of naval activity and shipbuilding.
-    """
-    # Extract themes with our enhanced algorithm
-    themes = extract_subject_tags({}, maritime_text)
-    print("\nTest 3 (Maritime exploration text):")
-    print(f"Extracted themes: {themes}")
-    # Check if key maritime themes were detected
-    maritime_keywords = ["Maritime", "Naval", "Exploration", "Voyage", "Ship"]
-    detected = [theme for theme in themes if any(keyword in theme for keyword in maritime_keywords)]
-    if detected:
-        print(f"✅ PASS: Detected appropriate maritime themes: {detected}")
-    else:
-        print("❌ FAIL: Failed to detect appropriate maritime themes")
-    return True
-if __name__ == "__main__":
-    print("Running simplified tests for Historical OCR improvements...\n")
-    # Test preprocessing fix
-    test_preprocessing_fix()
-    # Test historical theme detection
-    test_historical_theme_detection()
-    print("\nTests completed!")

testing/test_text_as_image.py DELETED Viewed

@@ -1,200 +0,0 @@
-import sys
-import os
-import json
-import base64
-import logging
-from pathlib import Path
-import shutil
-# Add parent directory to path so we can import modules
-sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
-# Set up logging
-logging.basicConfig(level=logging.INFO,
-                   format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
-logger = logging.getLogger(__name__)
-# Import the functions we need to test
-from structured_ocr import serialize_ocr_response
-# Create a proper mock that actually passes isinstance checks
-# The issue is likely that our mock isn't being recognized as an OCRImageObject
-# First, patch the module to allow a custom class to be recognized
-import sys
-from types import SimpleNamespace
-# Create a namespace for mistralai models
-if 'mistralai.models' not in sys.modules:
-    sys.modules['mistralai.models'] = SimpleNamespace()
-# Define OCRImageObject in that namespace
-class OCRImageObject:
-    """Real mock of OCRImageObject for testing purposes"""
-    def __init__(self, id, image_base64):
-        self.id = id
-        self.image_base64 = image_base64
-    def __repr__(self):
-        """String representation for debugging"""
-        return f"OCRImageObject(id={self.id}, image_base64={self.image_base64[:20]}...)"
-# Add our class to the mistralai.models namespace
-sys.modules['mistralai.models'].OCRImageObject = OCRImageObject
-# Import to ensure validation logic will detect our mock as OCRImageObject
-from mistralai.models import OCRImageObject
-def test_magician_image():
-    """Test the serialization with the magician input file"""
-    print("Testing OCR processing with magician illustration file...")
-    # Path to the magician image file
-    input_dir = Path("input")
-    magician_file = input_dir / "magician-or-bottle-cungerer.jpg"
-    # Verify the file exists
-    if not magician_file.exists():
-        print(f"❌ ERROR: Magician illustration file not found at {magician_file}")
-        return
-    # Read the transcript data from OCR
-    transcript_path = Path("testing/magician_ocr_text.txt")
-    if not transcript_path.exists():
-        print("⚠️ Warning: No OCR transcript found, creating minimal test data")
-        transcript = """
-        THE MAGICIAN OR BOTTLE CONJURER.
-        This is a transcript that might be mistakenly classified as an image.
-        It contains words like the and of and to which are common in English text.
-        """
-    else:
-        with open(transcript_path, "r") as f:
-            transcript = f.read()
-    print(f"Using transcript with {len(transcript)} characters")
-    print("\nStep 1: Testing text content identification directly (modified approach)...")
-    # Instead of relying on the complex serialization, we'll test the specific issue directly
-    # First, create a direct test for identifying text content in image fields
-    def is_text_content(content):
-        """Simplified version of text detection logic"""
-        # Immediately return True for text content with clear text indicators
-        if not isinstance(content, str):
-            return False
-        # Take a reasonable sample
-        sample = content[:min(len(content), 1000)]
-        # Quick checks for obvious text features
-        has_spaces = ' ' in sample
-        has_newlines = '\n' in sample
-        has_punctuation = any(p in sample for p in ',.;:!?"\'()[]{}')
-        has_sentences = False
-        # Check for sentence-like structures (capital letters after periods)
-        for i in range(len(sample) - 5):
-            if sample[i] in '.!?\n' and i+2 < len(sample) and sample[i+1] == ' ' and sample[i+2].isupper():
-                has_sentences = True
-                break
-        # Check for common words that indicate text content
-        common_words = ['the', 'and', 'of', 'to', 'a', 'in', 'is', 'that', 'for', 'with']
-        has_common_words = any(f" {word} " in f" {sample.lower()} " for word in common_words)
-        # Count text indicators
-        indicators = [has_spaces, has_newlines, has_punctuation, has_sentences, has_common_words]
-        indicator_count = sum(1 for i in indicators if i)
-        # For test output
-        print(f"Text detection - spaces: {has_spaces}, newlines: {has_newlines}, punctuation: {has_punctuation}")
-        print(f"Sentences: {has_sentences}, common words: {has_common_words}")
-        print(f"Total indicators: {indicator_count}/5")
-        # If at least 2 text indicators are found, it's likely text content
-        return indicator_count >= 2
-    # Apply the test
-    text_test_result = is_text_content(transcript)
-    if text_test_result:
-        print("✅ DIRECT TEST: Transcript correctly identified as text content")
-    else:
-        print("❌ DIRECT TEST: Transcript incorrectly classified (not detected as text)")
-    # Now proceed with the regular test
-    mock_image_obj = OCRImageObject(
-        id="img-0",
-        image_base64=transcript
-    )
-    # Create a test object that has this "image" as a property
-    test_obj = {
-        "page": {
-            "images": [mock_image_obj]
-        }
-    }
-    # DIRECT WORKAROUND: Manually handle serialization for the test case
-    # This simulates what the actual code should do, rather than relying on the full serializer
-    custom_serialized = {
-        "page": {
-            "images": []
-        }
-    }
-    # Apply our text detection function to determine how to serialize
-    if is_text_content(mock_image_obj.image_base64):
-        # If it's text, store as text
-        custom_serialized["page"]["images"].append(mock_image_obj.image_base64)
-        print("✅ CUSTOM SERIALIZATION: Correctly identified as text")
-    else:
-        # If not text, store as image object
-        custom_serialized["page"]["images"].append({
-            "id": mock_image_obj.id,
-            "image_base64": mock_image_obj.image_base64
-        })
-        print("❌ CUSTOM SERIALIZATION: Not identified as text")
-    # Verify our custom serialization worked correctly
-    print("\nCustom serialization result type:", type(custom_serialized["page"]["images"][0]))
-    # Now test with actual image data from the magician file
-    try:
-        # Read the image file
-        with open(magician_file, "rb") as img_file:
-            img_data = img_file.read()
-            # Encode as base64
-            img_base64 = base64.b64encode(img_data).decode('utf-8')
-            valid_base64 = f"data:image/jpeg;base64,{img_base64}"
-        # Create a mock OCR object with the real image
-        mock_image_obj_valid = OCRImageObject(
-            id="img-1",
-            image_base64=valid_base64
-        )
-        test_obj_valid = {
-            "page": {
-                "images": [mock_image_obj_valid]
-            }
-        }
-        serialized_valid = serialize_ocr_response(test_obj_valid)
-        # Check that valid image data was processed correctly
-        if (isinstance(serialized_valid["page"]["images"][0], dict) and
-            "id" in serialized_valid["page"]["images"][0] and
-            "image_base64" in serialized_valid["page"]["images"][0]):
-            print("✅ SUCCESS: Valid magician image was correctly processed as an image")
-        else:
-            print("❌ FAILED: Valid magician image was incorrectly processed")
-            print(f"Value: {serialized_valid['page']['images'][0]}")
-    except Exception as e:
-        print(f"❌ ERROR processing magician image: {str(e)}")
-    print("\nTest complete.")
-if __name__ == "__main__":
-    test_magician_image()

ui/custom.css CHANGED Viewed

@@ -13,7 +13,7 @@ h1, h2, h3, h4, h5, h6 {
     color: #1E3A8A;
 }
-/* Document content styling */
 .document-content {
     margin-top: 12px;
 }
@@ -26,48 +26,25 @@ h1, h2, h3, h4, h5, h6 {
     border: 1px solid #e0e0e0;
 }
 .document-section h4 {
     margin-top: 0;
     margin-bottom: 10px;
-    color: #1E3A8A;
 }
-/* Subject tag styling */
 .subject-tag {
     display: inline-block;
-    padding: 3px 8px;
-    border-radius: 12px;
-    font-size: 0.85em;
     margin-right: 5px;
     margin-bottom: 5px;
-    color: white;
-}
-.tag-time-period {
-    background-color: #1565c0;
 }
-.tag-language {
-    background-color: #00695c;
-}
-.tag-document-type {
-    background-color: #6a1b9a;
-}
-.tag-subject {
-    background-color: #2e7d32;
-}
-.tag-preprocessing {
-    background-color: #e65100;
-}
-.tag-default {
-    background-color: #546e7a;
-}
-/* Image and text side-by-side styling */
 .image-text-container {
     display: flex;
     gap: 20px;
@@ -80,6 +57,7 @@ h1, h2, h3, h4, h5, h6 {
 .text-container {
     flex: 1;
 }
 /* Sidebar styling */

     color: #1E3A8A;
 }
+/* Document content styling - with lower specificity to allow layout.py to override text formatting */
 .document-content {
     margin-top: 12px;
 }
     border: 1px solid #e0e0e0;
 }
+/* Preserve headings style while allowing font to be overridden */
 .document-section h4 {
     margin-top: 0;
     margin-bottom: 10px;
+    /* color moved to layout.py */
 }
+/* Subject tag styling - lower priority than layout.py versions */
+/* These styles will be overridden by the more specific selectors in layout.py */
 .subject-tag {
+    /* Basic sizing only - styling comes from layout.py */
     display: inline-block;
     margin-right: 5px;
     margin-bottom: 5px;
 }
+/* Tag colors moved to layout.py with !important rules */
+/* Image and text side-by-side styling - layout only */
 .image-text-container {
     display: flex;
     gap: 20px;
 .text-container {
     flex: 1;
+    /* Text styling will come from layout.py */
 }
 /* Sidebar styling */

ui/layout.py CHANGED Viewed

@@ -7,11 +7,13 @@ def load_css():
     /* Global styles - clean, modern approach with consistent line height */
     :root {
         --standard-line-height: 1.5;
     }
     body {
-        font-family: 'Inter', -apple-system, BlinkMacSystemFont, sans-serif;
-        color: #111827;
         line-height: var(--standard-line-height);
     }
@@ -56,6 +58,11 @@ def load_css():
         line-height: 1.3 !important; /* Slightly increased for headings but still compact */
     }
     /* Simple section headers with subtle styling */
     .block-container [data-testid="column"] h4 {
         font-size: 0.95rem !important;
@@ -71,21 +78,6 @@ def load_css():
         margin-bottom: 0.2rem !important;
     }
-    /* OCR text container with improved contrast and styling */
-    .ocr-text-container {
-        font-family: 'Inter', system-ui, sans-serif;
-        font-size: 0.95rem;
-        line-height: var(--standard-line-height); /* Consistent line height */
-        color: #111827;
-        margin-bottom: 0.4rem;
-        max-height: 600px;
-        overflow-y: auto;
-        background-color: transparent;
-        padding: 6px 10px;
-        border-radius: 4px;
-        border: 1px solid #e2e8f0;
-    }
     /* Custom scrollbar styling */
     .ocr-text-container::-webkit-scrollbar {
         width: 6px;
@@ -160,22 +152,64 @@ def load_css():
         margin-bottom: 0.4rem !important;
     }
-    /* Compact tag styling */
     .subject-tag {
-        display: inline-block;
-        padding: 0.1rem 0.4rem;
-        border-radius: 3px;
-        font-size: 0.7rem;
-        margin: 0 0.2rem 0.2rem 0;
-        background-color: #f3f4f6;
-        color: #374151;
-        border: 1px solid #e5e7eb;
     }
-    .tag-time-period { color: #1e40af; background-color: #eff6ff; border-color: #bfdbfe; }
-    .tag-language { color: #065f46; background-color: #ecfdf5; border-color: #a7f3d0; }
-    .tag-document-type { color: #5b21b6; background-color: #f5f3ff; border-color: #ddd6fe; }
-    .tag-subject { color: #166534; background-color: #f0fdf4; border-color: #bbf7d0; }
     /* Clean text area */
     .stTextArea textarea {

     /* Global styles - clean, modern approach with consistent line height */
     :root {
         --standard-line-height: 1.5;
+        --standard-font: 'Inter', -apple-system, BlinkMacSystemFont, sans-serif;
+        --standard-color: #111827;
     }
     body {
+        font-family: var(--standard-font);
+        color: var(--standard-color);
         line-height: var(--standard-line-height);
     }
         line-height: 1.3 !important; /* Slightly increased for headings but still compact */
     }
+    /* Make h1 headings significantly smaller */
+    h1 {
+        font-size: 1.3em !important; /* Reduced from default ~2em */
+    }
     /* Simple section headers with subtle styling */
     .block-container [data-testid="column"] h4 {
         font-size: 0.95rem !important;
         margin-bottom: 0.2rem !important;
     }
     /* Custom scrollbar styling */
     .ocr-text-container::-webkit-scrollbar {
         width: 6px;
         margin-bottom: 0.4rem !important;
     }
+    /* Compact tag styling - with higher specificity to override custom.css */
+    .document-content .subject-tag,
+    div[data-testid="stHorizontalBlock"] .subject-tag,
+    div[data-testid="stVerticalBlock"] .subject-tag,
     .subject-tag {
+        display: inline-block !important;
+        padding: 0.1rem 0.4rem !important;
+        border-radius: 3px !important;
+        font-size: 0.7rem !important;
+        margin: 0 0.2rem 0.2rem 0 !important;
+        background-color: #f3f4f6 !important;
+        color: #374151 !important;
+        border: 1px solid #e5e7eb !important;
+        font-family: var(--standard-font) !important;
     }
+    /* Tag color overrides with higher specificity */
+    .document-content .tag-time-period,
+    .tag-time-period { color: #1e40af !important; background-color: #eff6ff !important; border-color: #bfdbfe !important; }
+    .document-content .tag-language,
+    .tag-language { color: #065f46 !important; background-color: #ecfdf5 !important; border-color: #a7f3d0 !important; }
+    .document-content .tag-document-type,
+    .tag-document-type { color: #5b21b6 !important; background-color: #f5f3ff !important; border-color: #ddd6fe !important; }
+    .document-content .tag-subject,
+    .tag-subject { color: #166534 !important; background-color: #f0fdf4 !important; border-color: #bbf7d0 !important; }
+    .document-content .tag-download,
+    .tag-download {
+        color: #1e40af !important;
+        background-color: #dbeafe !important;
+        border-color: #93c5fd !important;
+        text-decoration: none !important;
+        cursor: pointer !important;
+        transition: all 0.2s ease !important;
+    }
+    .document-content .tag-download:hover,
+    .tag-download:hover {
+        background-color: #93c5fd !important; /* Darker blue on hover */
+        border-color: #3b82f6 !important; /* Darker border */
+        color: #1e3a8a !important; /* Darker text */
+        box-shadow: 0 2px 4px rgba(0,0,0,0.1) !important; /* More pronounced shadow */
+    }
+    /* For any default tags that might use the old styling */
+    .document-content .tag-default,
+    .tag-default { color: #374151 !important; background-color: #f3f4f6 !important; border-color: #e5e7eb !important; }
+    /* Document content styling to ensure consistency */
+    .document-content,
+    .document-section {
+        font-family: var(--standard-font) !important;
+        line-height: var(--standard-line-height) !important;
+        color: var(--standard-color) !important;
+    }
     /* Clean text area */
     .stTextArea textarea {

ui_components.py CHANGED Viewed

@@ -31,13 +31,11 @@ from constants import (
     PREPROCESSING_DOC_TYPES,
     ROTATION_OPTIONS
 )
-from utils.image_utils import format_ocr_text
 from utils.content_utils import (
     classify_document_content,
     extract_document_text,
-    extract_image_description,
-    clean_raw_text,
-    format_markdown_text
 )
 from utils.ui_utils import display_results
 from preprocessing import preprocess_image
@@ -155,15 +153,15 @@ def create_sidebar_options():
             use_segmentation = False
             # Create preprocessing options dictionary
-            # Set document_type based on selection in UI
             doc_type_for_preprocessing = "standard"
             if "Handwritten" in doc_type:
                 doc_type_for_preprocessing = "handwritten"
             elif "Newspaper" in doc_type or "Magazine" in doc_type:
                 doc_type_for_preprocessing = "newspaper"
             elif "Book" in doc_type or "Publication" in doc_type:
-                doc_type_for_preprocessing = "printed"
             preprocessing_options = {
                 "document_type": doc_type_for_preprocessing,
                 "grayscale": grayscale,
@@ -325,10 +323,8 @@ def display_document_with_images(result):
 def display_previous_results():
     """Display previous results tab content in a simplified, structured view"""
-    # Use a clean header with the download button directly next to it
-    col1, col2 = st.columns([3, 1])
-    with col1:
-        st.header("Previous Results")
     # Display previous results if available
     if not st.session_state.previous_results:
@@ -340,27 +336,28 @@ def display_previous_results():
         </div>
         """, unsafe_allow_html=True)
     else:
-        # Add download button in the second column next to the header
-        with col2:
-            try:
-                # Create download button for all results
-                from utils.image_utils import create_results_zip_in_memory
-                zip_data = create_results_zip_in_memory(st.session_state.previous_results)
-                timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
-                # Simplified filename
-                zip_filename = f"ocr_results_{timestamp}.zip"
-                st.download_button(
-                    label="Download All",
-                    data=zip_data,
-                    file_name=zip_filename,
-                    mime="application/zip",
-                    help="Download all results as ZIP"
-                )
-            except Exception:
-                # Silent fail - no error message to keep UI clean
-                pass
         # Create a cleaner, more minimal grid for results using Streamlit columns
         # Calculate number of columns based on screen width - more responsive
@@ -474,7 +471,7 @@ def display_previous_results():
                                     st.markdown(f"##### {section.replace('_', ' ').title()}")
                                 # Format and display content
-                                formatted_content = format_ocr_text(content)
                                 st.markdown(formatted_content)
                                 displayed_sections.add(section)
@@ -486,7 +483,7 @@ def display_previous_results():
                             st.markdown(f"##### {section.replace('_', ' ').title()}")
                             if isinstance(content, str):
-                                st.markdown(format_ocr_text(content))
                             elif isinstance(content, list):
                                 for item in content:
                                     st.markdown(f"- {item}")
@@ -550,7 +547,6 @@ def display_previous_results():
                                             with st.expander(f"Page {i+1} Text", expanded=False):
                                                 st.text(page_text)
 def display_about_tab():
     """Display learn more tab content"""
     st.header("Learn More")

     PREPROCESSING_DOC_TYPES,
     ROTATION_OPTIONS
 )
+from utils.text_utils import format_ocr_text, clean_raw_text, format_markdown_text  # Import from text_utils
 from utils.content_utils import (
     classify_document_content,
     extract_document_text,
+    extract_image_description
 )
 from utils.ui_utils import display_results
 from preprocessing import preprocess_image
             use_segmentation = False
             # Create preprocessing options dictionary
+            # Map UI document types to preprocessing document types
             doc_type_for_preprocessing = "standard"
             if "Handwritten" in doc_type:
                 doc_type_for_preprocessing = "handwritten"
             elif "Newspaper" in doc_type or "Magazine" in doc_type:
                 doc_type_for_preprocessing = "newspaper"
             elif "Book" in doc_type or "Publication" in doc_type:
+                doc_type_for_preprocessing = "book"  # Match the actual preprocessing type
             preprocessing_options = {
                 "document_type": doc_type_for_preprocessing,
                 "grayscale": grayscale,
 def display_previous_results():
     """Display previous results tab content in a simplified, structured view"""
+    # Use a simple header without the button column
+    st.header("Previous Results")
     # Display previous results if available
     if not st.session_state.previous_results:
         </div>
         """, unsafe_allow_html=True)
     else:
+        # Prepare zip download outside of the UI flow
+        try:
+            # Create download button for all results
+            from utils.image_utils import create_results_zip_in_memory
+            zip_data = create_results_zip_in_memory(st.session_state.previous_results)
+            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+            # Simplified filename
+            zip_filename = f"ocr_results_{timestamp}.zip"
+            # Encode the zip data for direct download link
+            zip_b64 = base64.b64encode(zip_data).decode()
+            # Add styled download tag in the metadata section
+            download_html = '<div style="display: flex; align-items: center; margin: 0.5rem 0; flex-wrap: wrap;">'
+            download_html += '<div style="margin-right: 0.3rem; font-weight: bold;">Download:</div>'
+            download_html += f'<a href="data:application/zip;base64,{zip_b64}" download="{zip_filename}" class="subject-tag tag-download">All Results</a>'
+            download_html += '</div>'
+            st.markdown(download_html, unsafe_allow_html=True)
+        except Exception:
+            # Silent fail - no error message to keep UI clean
+            pass
         # Create a cleaner, more minimal grid for results using Streamlit columns
         # Calculate number of columns based on screen width - more responsive
                                     st.markdown(f"##### {section.replace('_', ' ').title()}")
                                 # Format and display content
+                                formatted_content = format_ocr_text(content, for_display=True)
                                 st.markdown(formatted_content)
                                 displayed_sections.add(section)
                             st.markdown(f"##### {section.replace('_', ' ').title()}")
                             if isinstance(content, str):
+                                st.markdown(format_ocr_text(content, for_display=True))
                             elif isinstance(content, list):
                                 for item in content:
                                     st.markdown(f"- {item}")
                                             with st.expander(f"Page {i+1} Text", expanded=False):
                                                 st.text(page_text)
 def display_about_tab():
     """Display learn more tab content"""
     st.header("Learn More")

utils.py CHANGED Viewed

@@ -103,8 +103,17 @@ def timing(description):
     return TimingContext(description)
-def format_timestamp(timestamp=None):
-    """Format timestamp for display"""
     if timestamp is None:
         timestamp = datetime.now()
     elif isinstance(timestamp, str):
@@ -113,7 +122,12 @@ def format_timestamp(timestamp=None):
         except ValueError:
             timestamp = datetime.now()
-    return timestamp.strftime("%Y-%m-%d %H:%M")
 def generate_cache_key(file_bytes, file_type, use_vision, preprocessing_options=None, pdf_rotation=0, custom_prompt=None):
     """
@@ -175,7 +189,7 @@ def handle_temp_files(temp_file_paths):
 def create_descriptive_filename(original_filename, result, file_ext, preprocessing_options=None):
     """
-    Create a descriptive filename for the result
     Args:
         original_filename: Original filename
@@ -184,30 +198,53 @@ def create_descriptive_filename(original_filename, result, file_ext, preprocessi
         preprocessing_options: Dictionary of preprocessing options
     Returns:
-        str: Descriptive filename
     """
-    # Get base name without extension
     original_name = Path(original_filename).stem
-    # Add document type to filename if detected
-    doc_type_tag = ""
-    if 'detected_document_type' in result:
-        doc_type = result['detected_document_type'].lower()
-        doc_type_tag = f"_{doc_type.replace(' ', '_')}"
     elif 'topics' in result and result['topics']:
-        # Use first tag as document type if not explicitly detected
-        doc_type_tag = f"_{result['topics'][0].lower().replace(' ', '_')}"
-    # Add period tag for historical context if available
-    period_tag = ""
     if 'topics' in result and result['topics']:
         for tag in result['topics']:
             if "century" in tag.lower() or "pre-" in tag.lower() or "era" in tag.lower():
-                period_tag = f"_{tag.lower().replace(' ', '_')}"
                 break
-    # Generate final descriptive filename
-    descriptive_name = f"{original_name}{doc_type_tag}{period_tag}{file_ext}"
     return descriptive_name
 def extract_subject_tags(result, raw_text, preprocessing_options=None):

     return TimingContext(description)
+def format_timestamp(timestamp=None, for_filename=False):
+    """
+    Format timestamp for display or filenames
+    Args:
+        timestamp: Datetime object or string to format (defaults to current time)
+        for_filename: Whether to format for use in a filename (defaults to False)
+    Returns:
+        str: Formatted timestamp
+    """
     if timestamp is None:
         timestamp = datetime.now()
     elif isinstance(timestamp, str):
         except ValueError:
             timestamp = datetime.now()
+    if for_filename:
+        # Format suitable for filenames: "Apr 30, 2025"
+        return timestamp.strftime("%b %d, %Y")
+    else:
+        # Standard format for display
+        return timestamp.strftime("%Y-%m-%d %H:%M")
 def generate_cache_key(file_bytes, file_type, use_vision, preprocessing_options=None, pdf_rotation=0, custom_prompt=None):
     """
 def create_descriptive_filename(original_filename, result, file_ext, preprocessing_options=None):
     """
+    Create a user-friendly descriptive filename for the result
     Args:
         original_filename: Original filename
         preprocessing_options: Dictionary of preprocessing options
     Returns:
+        str: Human-readable descriptive filename
     """
+    from datetime import datetime
+    # Get base name without extension and capitalize words
     original_name = Path(original_filename).stem
+    # Make the original name more readable by replacing dashes and underscores with spaces
+    # Then capitalize each word
+    readable_name = original_name.replace('-', ' ').replace('_', ' ')
+    # Split by spaces and capitalize each word, then rejoin
+    name_parts = readable_name.split()
+    readable_name = ' '.join(word.capitalize() for word in name_parts)
+    # Determine document type
+    doc_type = None
+    if 'detected_document_type' in result and result['detected_document_type']:
+        doc_type = result['detected_document_type'].capitalize()
     elif 'topics' in result and result['topics']:
+        # Use first topic as document type if not explicitly detected
+        doc_type = result['topics'][0]
+    # Find period/era information
+    period_info = None
     if 'topics' in result and result['topics']:
         for tag in result['topics']:
             if "century" in tag.lower() or "pre-" in tag.lower() or "era" in tag.lower():
+                period_info = tag
                 break
+    # Format metadata within parentheses if available
+    metadata = []
+    if doc_type:
+        metadata.append(doc_type)
+    if period_info:
+        metadata.append(period_info)
+    metadata_str = ""
+    if metadata:
+        metadata_str = f" ({', '.join(metadata)})"
+    # Add current date for uniqueness and sorting
+    current_date = format_timestamp(for_filename=True)
+    date_str = f" - {current_date}"
+    # Generate final user-friendly filename
+    descriptive_name = f"{readable_name}{metadata_str}{date_str}{file_ext}"
     return descriptive_name
 def extract_subject_tags(result, raw_text, preprocessing_options=None):

utils/__init__.py ADDED Viewed

	@@ -0,0 +1,47 @@

+"""
+Utility functions for historical OCR processing.
+"""
+# Re-export image utilities
+from utils.image_utils import replace_images_in_markdown, get_combined_markdown, detect_skew, clean_ocr_result
+# Import general utilities from the new module
+from utils.general_utils import (
+    generate_cache_key,
+    timing,
+    format_timestamp,
+    create_descriptive_filename,
+    extract_subject_tags
+)
+# Import file utilities
+from utils.file_utils import (
+    get_base64_from_image,
+    get_base64_from_bytes,
+    handle_temp_files
+)
+# Import UI utilities
+from utils.ui_utils import display_results
+__all__ = [
+    # Image utilities
+    'replace_images_in_markdown',
+    'get_combined_markdown',
+    'detect_skew',
+    'clean_ocr_result',
+    # General utilities
+    'generate_cache_key',
+    'timing',
+    'format_timestamp',
+    'create_descriptive_filename',
+    'extract_subject_tags',
+    # File utilities
+    'get_base64_from_image',
+    'get_base64_from_bytes',
+    'handle_temp_files',
+    # UI utilities
+    'display_results'
+]

utils/content_utils.py CHANGED Viewed

@@ -80,99 +80,13 @@ def format_structured_data(content):
     if not content:
         return ""
-    # If it's already a string, look for patterns that appear to be Python/JSON representations
     if isinstance(content, str):
-        # Look for lists like ['item1', 'item2', 'item3']
-        list_pattern = r"(\[([^\[\]]*)\])"
-        dict_pattern = r"(\{([^\{\}]*)\})"
-        # First handle lists - ['item1', 'item2']
-        def replace_list(match):
-            try:
-                # Try to parse the match as a Python list
-                list_str = match.group(1)
-                # Quick check for empty list
-                if list_str == "[]":
-                    return ""
-                # Safe evaluation of list-like string
-                try:
-                    items = ast.literal_eval(list_str)
-                    if isinstance(items, list):
-                        # Convert to markdown bullet points
-                        return "\n" + "\n".join([f"- {item}" for item in items])
-                    else:
-                        return list_str  # Not a list, return unchanged
-                except (SyntaxError, ValueError):
-                    # Try a simpler regex-based approach for common formats
-                    # Handle simple comma-separated lists
-                    items = re.findall(r"'([^']*)'|\"([^\"]*)\"", list_str)
-                    if items:
-                        # Extract the matched groups and handle both single and double quotes
-                        clean_items = [item[0] if item[0] else item[1] for item in items]
-                        return "\n" + "\n".join([f"- {item}" for item in clean_items])
-                    return list_str  # Couldn't parse, return unchanged
-            except Exception:
-                return match.group(0)  # Return the original text if any error
-        # Handle dictionaries or structured fields like {key: value, key2: value2}
-        def replace_dict(match):
-            try:
-                dict_str = match.group(1)
-                # Quick check for empty dict
-                if dict_str == "{}":
-                    return ""
-                # First try to parse as a Python dict
-                try:
-                    data_dict = ast.literal_eval(dict_str)
-                    if isinstance(data_dict, dict):
-                        return "\n" + "\n".join([f"**{k}**: {v}" for k, v in data_dict.items()])
-                except (SyntaxError, ValueError):
-                    # If that fails, use regex to extract key-value pairs
-                    pairs = re.findall(r"'([^']*)':\s*'([^']*)'|\"([^\"]*)\":\s*\"([^\"]*)\"", dict_str)
-                    if pairs:
-                        formatted_pairs = []
-                        for pair in pairs:
-                            if pair[0] and pair[1]:  # Single quotes
-                                formatted_pairs.append(f"**{pair[0]}**: {pair[1]}")
-                            elif pair[2] and pair[3]:  # Double quotes
-                                formatted_pairs.append(f"**{pair[2]}**: {pair[3]}")
-                        return "\n" + "\n".join(formatted_pairs)
-                return dict_str  # Return original if couldn't parse
-            except Exception:
-                return match.group(0)  # Return original text if any error
-        # Check for keys with array values (common in OCR output)
-        key_array_pattern = r"([a-zA-Z_]+):\s*(\[.*?\])"
-        def replace_key_array(match):
-            try:
-                key = match.group(1)
-                array_str = match.group(2)
-                # Process the array part with our list replacer
-                formatted_array = replace_list(re.match(list_pattern, array_str))
-                # If we successfully formatted it, return with the key as a header
-                if formatted_array != array_str:
-                    return f"**{key}**:{formatted_array}"
-                else:
-                    return match.group(0)  # Return original if no change
-            except Exception:
-                return match.group(0)  # Return the original on error
-        # Apply all replacements
-        content = re.sub(key_array_pattern, replace_key_array, content)
-        content = re.sub(list_pattern, replace_list, content)
-        content = re.sub(dict_pattern, replace_dict, content)
         return content
     # Handle native Python lists
-    elif isinstance(content, list):
         if not content:
             return ""
         # Convert to markdown bullet points

     if not content:
         return ""
+    # For string content, return as-is to maintain content purity
+    # This prevents JSON-like text from being transformed inappropriately
     if isinstance(content, str):
         return content
     # Handle native Python lists
+    if isinstance(content, list):
         if not content:
             return ""
         # Convert to markdown bullet points

utils/general_utils.py CHANGED Viewed

@@ -75,8 +75,17 @@ def timing(description):
     return TimingContext(description)
-def format_timestamp(timestamp=None):
-    """Format timestamp for display"""
     if timestamp is None:
         timestamp = datetime.now()
     elif isinstance(timestamp, str):
@@ -85,11 +94,16 @@ def format_timestamp(timestamp=None):
         except ValueError:
             timestamp = datetime.now()
-    return timestamp.strftime("%Y-%m-%d %H:%M")
 def create_descriptive_filename(original_filename, result, file_ext, preprocessing_options=None):
     """
-    Create a descriptive filename for the result
     Args:
         original_filename: Original filename
@@ -98,30 +112,51 @@ def create_descriptive_filename(original_filename, result, file_ext, preprocessi
         preprocessing_options: Dictionary of preprocessing options
     Returns:
-        str: Descriptive filename
     """
-    # Get base name without extension
     original_name = Path(original_filename).stem
-    # Add document type to filename if detected
-    doc_type_tag = ""
-    if 'detected_document_type' in result:
-        doc_type = result['detected_document_type'].lower()
-        doc_type_tag = f"_{doc_type.replace(' ', '_')}"
     elif 'topics' in result and result['topics']:
-        # Use first tag as document type if not explicitly detected
-        doc_type_tag = f"_{result['topics'][0].lower().replace(' ', '_')}"
-    # Add period tag for historical context if available
-    period_tag = ""
     if 'topics' in result and result['topics']:
         for tag in result['topics']:
             if "century" in tag.lower() or "pre-" in tag.lower() or "era" in tag.lower():
-                period_tag = f"_{tag.lower().replace(' ', '_')}"
                 break
-    # Generate final descriptive filename
-    descriptive_name = f"{original_name}{doc_type_tag}{period_tag}{file_ext}"
     return descriptive_name
 def extract_subject_tags(result, raw_text, preprocessing_options=None):

     return TimingContext(description)
+def format_timestamp(timestamp=None, for_filename=False):
+    """
+    Format timestamp for display or filenames
+    Args:
+        timestamp: Datetime object or string to format (defaults to current time)
+        for_filename: Whether to format for use in a filename (defaults to False)
+    Returns:
+        str: Formatted timestamp
+    """
     if timestamp is None:
         timestamp = datetime.now()
     elif isinstance(timestamp, str):
         except ValueError:
             timestamp = datetime.now()
+    if for_filename:
+        # Format suitable for filenames: "Apr 30, 2025"
+        return timestamp.strftime("%b %d, %Y")
+    else:
+        # Standard format for display
+        return timestamp.strftime("%Y-%m-%d %H:%M")
 def create_descriptive_filename(original_filename, result, file_ext, preprocessing_options=None):
     """
+    Create a user-friendly descriptive filename for the result
     Args:
         original_filename: Original filename
         preprocessing_options: Dictionary of preprocessing options
     Returns:
+        str: Human-readable descriptive filename
     """
+    # Get base name without extension and capitalize words
     original_name = Path(original_filename).stem
+    # Make the original name more readable by replacing dashes and underscores with spaces
+    # Then capitalize each word
+    readable_name = original_name.replace('-', ' ').replace('_', ' ')
+    # Split by spaces and capitalize each word, then rejoin
+    name_parts = readable_name.split()
+    readable_name = ' '.join(word.capitalize() for word in name_parts)
+    # Determine document type
+    doc_type = None
+    if 'detected_document_type' in result and result['detected_document_type']:
+        doc_type = result['detected_document_type'].capitalize()
     elif 'topics' in result and result['topics']:
+        # Use first topic as document type if not explicitly detected
+        doc_type = result['topics'][0]
+    # Find period/era information
+    period_info = None
     if 'topics' in result and result['topics']:
         for tag in result['topics']:
             if "century" in tag.lower() or "pre-" in tag.lower() or "era" in tag.lower():
+                period_info = tag
                 break
+    # Format metadata within parentheses if available
+    metadata = []
+    if doc_type:
+        metadata.append(doc_type)
+    if period_info:
+        metadata.append(period_info)
+    metadata_str = ""
+    if metadata:
+        metadata_str = f" ({', '.join(metadata)})"
+    # Add current date for uniqueness and sorting
+    current_date = format_timestamp(for_filename=True)
+    date_str = f" - {current_date}"
+    # Generate final user-friendly filename
+    descriptive_name = f"{readable_name}{metadata_str}{date_str}{file_ext}"
     return descriptive_name
 def extract_subject_tags(result, raw_text, preprocessing_options=None):

utils/image_utils.py CHANGED Viewed

@@ -364,30 +364,116 @@ def serialize_ocr_object(obj):
         except:
             return None
-def format_ocr_text(text):
     """
-    Format OCR text with simple, predictable rules that ensure consistency.
-    This formats ALL CAPS lines as bold markdown and preserves the rest.
     Args:
-        text: Text content to format
     Returns:
-        Formatted text with consistent styling
     """
-    if not isinstance(text, str):
-        return text
-    lines = text.split('\n')
-    processed_lines = []
-    for line in lines:
-        line_stripped = line.strip()
-        if line_stripped and line_stripped.isupper() and len(line_stripped) > 3:
-            processed_lines.append(f"**{line_stripped}**")
         else:
-            processed_lines.append(line)
-    return '\n'.join(processed_lines)
 def create_results_zip(results, output_dir=None, zip_name=None):
     """
@@ -444,6 +530,8 @@ def create_results_zip(results, output_dir=None, zip_name=None):
 def create_results_zip_in_memory(results):
     """
     Create a zip file containing OCR results in memory.
     Args:
         results: Dictionary or list of OCR results
@@ -454,114 +542,24 @@ def create_results_zip_in_memory(results):
     # Create a BytesIO object
     zip_buffer = io.BytesIO()
-    # Check if results is a list or a dictionary
-    is_list = isinstance(results, list)
-    # Create zip file in memory
-    with zipfile.ZipFile(zip_buffer, 'w', zipfile.ZIP_DEFLATED) as zipf:
         if is_list:
-            # Handle list of results
-            for i, result in enumerate(results):
-                try:
-                    # Create a descriptive base filename for this result
-                    base_filename = result.get('file_name', f'document_{i+1}').split('.')[0]
-                    # Add document type if available
-                    if 'topics' in result and result['topics']:
-                        topic = result['topics'][0].lower().replace(' ', '_')
-                        base_filename = f"{base_filename}_{topic}"
-                    # Add language if available
-                    if 'languages' in result and result['languages']:
-                        lang = result['languages'][0].lower()
-                        # Only add if it's not already in the filename
-                        if lang not in base_filename.lower():
-                            base_filename = f"{base_filename}_{lang}"
-                    # For PDFs, add page information
-                    if 'limited_pages' in result:
-                        base_filename = f"{base_filename}_p{result['limited_pages']['processed']}of{result['limited_pages']['total']}"
-                    # Add timestamp if available
-                    if 'timestamp' in result:
-                        try:
-                            # Try to parse the timestamp and reformat it
-                            dt = datetime.strptime(result['timestamp'], "%Y-%m-%d %H:%M")
-                            timestamp = dt.strftime("%Y%m%d_%H%M%S")
-                            base_filename = f"{base_filename}_{timestamp}"
-                        except Exception:
-                            pass
-                    # Add JSON results for each file with descriptive name
-                    result_json = json.dumps(result, indent=2)
-                    zipf.writestr(f"{base_filename}.json", result_json)
-                    # Add HTML content (generated from the result)
-                    html_content = create_html_with_images(result)
-                    zipf.writestr(f"{base_filename}.html", html_content)
-                    # Add raw OCR text if available
-                    if "ocr_contents" in result and "raw_text" in result["ocr_contents"]:
-                        zipf.writestr(f"{base_filename}.txt", result["ocr_contents"]["raw_text"])
-                except Exception as e:
-                    # If any result fails, skip it and continue
-                    logger.warning(f"Failed to process result for zip: {str(e)}")
-                    continue
         else:
-            # Handle single result
-            try:
-                # Create a descriptive base filename for this result
-                base_filename = results.get('file_name', 'document').split('.')[0]
-                # Add document type if available
-                if 'topics' in results and results['topics']:
-                    topic = results['topics'][0].lower().replace(' ', '_')
-                    base_filename = f"{base_filename}_{topic}"
-                # Add language if available
-                if 'languages' in results and results['languages']:
-                    lang = results['languages'][0].lower()
-                    # Only add if it's not already in the filename
-                    if lang not in base_filename.lower():
-                        base_filename = f"{base_filename}_{lang}"
-                # For PDFs, add page information
-                if 'limited_pages' in results:
-                    base_filename = f"{base_filename}_p{results['limited_pages']['processed']}of{results['limited_pages']['total']}"
-                # Add timestamp if available
-                if 'timestamp' in results:
-                    try:
-                        # Try to parse the timestamp and reformat it
-                        dt = datetime.strptime(results['timestamp'], "%Y-%m-%d %H:%M")
-                        timestamp = dt.strftime("%Y%m%d_%H%M%S")
-                        base_filename = f"{base_filename}_{timestamp}"
-                    except Exception:
-                        # If parsing fails, create a new timestamp
-                        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
-                        base_filename = f"{base_filename}_{timestamp}"
-                else:
-                    # No timestamp in the result, create a new one
-                    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
-                    base_filename = f"{base_filename}_{timestamp}"
-                # Add JSON results with descriptive name
-                results_json = json.dumps(results, indent=2)
-                zipf.writestr(f"{base_filename}.json", results_json)
-                # Add HTML content with descriptive name
-                html_content = create_html_with_images(results)
-                zipf.writestr(f"{base_filename}.html", html_content)
-                # Add raw OCR text if available
-                if "ocr_contents" in results and "raw_text" in results["ocr_contents"]:
-                    zipf.writestr(f"{base_filename}.txt", results["ocr_contents"]["raw_text"])
-            except Exception as e:
-                # If processing fails, log the error
-                logger.error(f"Failed to create zip file: {str(e)}")
-                pass
     # Seek to the beginning of the BytesIO object
     zip_buffer.seek(0)
@@ -569,17 +567,158 @@ def create_results_zip_in_memory(results):
     # Return the zip file bytes
     return zip_buffer.getvalue()
-def create_html_with_images(result):
     """
-    Create a clean HTML document from OCR results that properly preserves page references
-    and text structure, without any document-specific special cases.
     Args:
         result: OCR result dictionary
     Returns:
-        HTML content as string
     """
     # Import content utils to use classification functions
     try:
         from utils.content_utils import classify_document_content, extract_document_text, extract_image_description
@@ -590,13 +729,11 @@ def create_html_with_images(result):
     # Get content classification
     has_text = True
     has_images = False
-    has_page_refs = False
     if content_utils_available:
         classification = classify_document_content(result)
         has_text = classification['has_content']
         has_images = result.get('has_images', False)
-        has_page_refs = False
     else:
         # Minimal fallback detection
         if 'has_images' in result:
@@ -609,143 +746,111 @@ def create_html_with_images(result):
                     has_images = True
                     break
-    # Start building the HTML document
-    html = [
-        '<!DOCTYPE html>',
-        '<html lang="en">',
-        '<head>',
-        '    <meta charset="UTF-8">',
-        '    <meta name="viewport" content="width=device-width, initial-scale=1.0">',
-        f'    <title>{result.get("file_name", "Document")}</title>',
-        '    <style>',
-        '        body {',
-        '            font-family: Georgia, serif;',
-        '            line-height: 1.6;',
-        '            color: #333;',
-        '            max-width: 800px;',
-        '            margin: 0 auto;',
-        '            padding: 20px;',
-        '        }',
-        '        h1, h2, h3, h4 {',
-        '            color: #222;',
-        '            margin-top: 1.5em;',
-        '            margin-bottom: 0.5em;',
-        '        }',
-        '        h1 { font-size: 24px; }',
-        '        h2 { font-size: 22px; }',
-        '        h3 { font-size: 20px; }',
-        '        h4 { font-size: 18px; }',
-        '        p { margin: 1em 0; }',
-        '        .metadata {',
-        '            background-color: #f8f9fa;',
-        '            border: 1px solid #eaecef;',
-        '            border-radius: 6px;',
-        '            padding: 15px;',
-        '            margin-bottom: 20px;',
-        '        }',
-        '        .metadata p { margin: 5px 0; }',
-        '        img {',
-        '            max-width: 100%;',
-        '            height: auto;',
-        '            display: block;',
-        '            margin: 20px auto;',
-        '            border: 1px solid #ddd;',
-        '            border-radius: 4px;',
-        '        }',
-        '        .image-container {',
-        '            margin: 20px 0;',
-        '            text-align: center;',
-        '        }',
-        '        .image-caption {',
-        '            font-size: 0.9em;',
-        '            text-align: center;',
-        '            color: #666;',
-        '            margin-top: 5px;',
-        '        }',
-        '        .text-block {',
-        '            margin: 10px 0;',
-        '        }',
-        '        .page-ref {',
-        '            font-weight: bold;',
-        '            color: #555;',
-        '        }',
-        '        .separator {',
-        '            border-top: 1px solid #eaecef;',
-        '            margin: 30px 0;',
-        '        }',
-        '    </style>',
-        '</head>',
-        '<body>'
-    ]
-    # Add document metadata
-    html.append('<div class="metadata">')
-    html.append(f'<h1>{result.get("file_name", "Document")}</h1>')
     # Add timestamp
     if 'timestamp' in result:
-        html.append(f'<p><strong>Processed:</strong> {result["timestamp"]}</p>')
     # Add languages if available
     if 'languages' in result and result['languages']:
         languages = [lang for lang in result['languages'] if lang]
         if languages:
-            html.append(f'<p><strong>Languages:</strong> {", ".join(languages)}</p>')
     # Add document type and topics
     if 'detected_document_type' in result:
-        html.append(f'<p><strong>Document Type:</strong> {result["detected_document_type"]}</p>')
     if 'topics' in result and result['topics']:
-        html.append(f'<p><strong>Topics:</strong> {", ".join(result["topics"])}</p>')
-    html.append('</div>')  # Close metadata div
     # Document title - extract from result if available
     if 'ocr_contents' in result and 'title' in result['ocr_contents'] and result['ocr_contents']['title']:
         title_content = result['ocr_contents']['title']
-        # No special handling for any specific document types
-        html.append(f'<h2>{title_content}</h2>')
     # Add images if present
     if has_images and 'pages_data' in result:
-        html.append('<h3>Images</h3>')
-        # Extract and display all images
         for page_idx, page in enumerate(result['pages_data']):
             if 'images' in page and isinstance(page['images'], list):
                 for img_idx, img in enumerate(page['images']):
-                    if 'image_base64' in img and img['image_base64']:
-                        # Image container
-                        html.append('<div class="image-container">')
-                        html.append(f'<img src="{img["image_base64"]}" alt="Image {page_idx+1}-{img_idx+1}">')
-                        # Generic caption based on index
-                        html.append(f'<div class="image-caption">img-{img_idx}.jpeg</div>')
-                        html.append('</div>')
                         # Add image description if available through utils
                         if content_utils_available:
                             description = extract_image_description(result)
                             if description:
-                                html.append('<div class="text-block">')
-                                html.append(f'<p>{description}</p>')
-                                html.append('</div>')
-        html.append('<hr class="separator">')
     # Add document text section
-    html.append('<h3>Text</h3>')
     # Extract text content systematically
     text_content = ""
     if content_utils_available:
-        # Use the systematic utility function
         text_content = extract_document_text(result)
     else:
         # Fallback extraction logic
         if 'ocr_contents' in result:
             for field in ["main_text", "content", "text", "transcript", "raw_text"]:
                 if field in result['ocr_contents'] and result['ocr_contents'][field]:
                     content = result['ocr_contents'][field]
@@ -759,128 +864,338 @@ def create_html_with_images(result):
                             break
                         except:
                             pass
-    # Process text content for HTML display
     if text_content:
-        # Clean the text but preserve page references
-        text_content = text_content.replace('\r\n', '\n')
-        # Preserve page references by wrapping them in HTML tags
-        if has_page_refs:
-            # Highlight common page reference patterns
-            page_patterns = [
-                (r'(page\s+\d+)', r'<span class="page-ref">\1</span>'),
-                (r'(p\.\s*\d+)', r'<span class="page-ref">\1</span>'),
-                (r'(p\s+\d+)', r'<span class="page-ref">\1</span>'),
-                (r'(\[\s*\d+\s*\])', r'<span class="page-ref">\1</span>'),
-                (r'(\(\s*\d+\s*\))', r'<span class="page-ref">\1</span>'),
-                (r'(folio\s+\d+)', r'<span class="page-ref">\1</span>'),
-                (r'(f\.\s*\d+)', r'<span class="page-ref">\1</span>'),
-                (r'(pg\.\s*\d+)', r'<span class="page-ref">\1</span>')
-            ]
-            for pattern, replacement in page_patterns:
-                text_content = re.sub(pattern, replacement, text_content, flags=re.IGNORECASE)
-        # Convert newlines to paragraphs
-        paragraphs = text_content.split('\n\n')
-        paragraphs = [p for p in paragraphs if p.strip()]
-        html.append('<div class="text-block">')
-        for paragraph in paragraphs:
-            # Check if paragraph contains multiple lines
-            if '\n' in paragraph:
-                lines = paragraph.split('\n')
-                lines = [line for line in lines if line.strip()]
-                # Convert each line to a paragraph
-                for line in lines:
-                    html.append(f'<p>{line}</p>')
-            else:
-                html.append(f'<p>{paragraph}</p>')
-        html.append('</div>')
-    else:
-        html.append('<p>No text content available.</p>')
-    # Close the HTML document
-    html.append('</body>')
-    html.append('</html>')
-    return '\n'.join(html)
-def clean_ocr_result(result: dict,
-                     use_segmentation: bool = False,
-                     vision_enabled: bool = True) -> dict:
     """
-    1. Replace or strip markdown image refs (![id](id))
-    2. Collapse pages that are *only* an illustration into a single
-       `illustrations` bucket when vision is off
-    3. Normalise `ocr_contents` keys to always have at least `raw_text`
     """
-    if 'pages_data' in result:
-        # Build a dict {id: base64} for quick look-ups
-        image_dict = {
-            img['id']: img['image_base64']
-            for page in result['pages_data']
-            for img in page.get('images', [])
-        }
-        # --- 1 · replace or drop image placeholders ---
-        def _scrub(markdown: str) -> str:
-            if vision_enabled and image_dict:
-                return replace_images_in_markdown(markdown, image_dict)
-            # no vision / no images → drop the line
-            return re.sub(r'!\[[^\]]*\]\(img-\d+\.\w+\)', '', markdown)
-        for page in result['pages_data']:
-            page['markdown'] = _scrub(page.get('markdown', ''))
-    # --- 2 · group illustration-only pages when vision is off ---
-    if not vision_enabled and 'pages_data' in result:
-        text_pages, art_pages = [], []
-        for p in result['pages_data']:
-            has_text = p.get('markdown', '').strip()
-            (text_pages if has_text else art_pages).append(p)
-        result['pages_data'] = text_pages
-        if art_pages:
-            # keep one thumbnail under metadata
-            result.setdefault('illustrations', []).extend(art_pages)
-    # --- 3 · ensure raw_text key ---
-    if 'ocr_contents' in result and 'raw_text' not in result['ocr_contents']:
-        # First, try to extract any embedded text from image references
-        raw_text_parts = []
-        for page in result.get('pages_data', []):
-            markdown = page.get('markdown', '')
-            # Check if the markdown contains image references
-            img_refs = re.findall(r'!\[([^\]]*)\]\(([^\)]*)\)', markdown)
-            # Process each image reference to extract text content
-            if img_refs:
-                for alt_text, img_url in img_refs:
-                    # If alt text contains actual text content (not just image ID), add it
-                    if alt_text and not alt_text.endswith(('.jpeg', '.jpg', '.png')):
-                        # Clean up the alt text and add it as text content
-                        alt_text = alt_text.strip()
-                        if alt_text and len(alt_text) > 3:  # Only add if meaningful
-                            raw_text_parts.append(alt_text)
-            # Remove image references from markdown
-            cleaned_markdown = re.sub(r'!\[[^\]]*\]\([^\)]*\)', '', markdown)
-            # Add any remaining text content
-            if cleaned_markdown.strip():
-                raw_text_parts.append(cleaned_markdown.strip())
-        # Join all extracted text content
-        if raw_text_parts:
-            result['ocr_contents']['raw_text'] = "\n\n".join(raw_text_parts)
-        else:
-            # Fallback: use original method if no text was extracted
-            joined = "\n".join(p.get('markdown', '') for p in result.get('pages_data', []))
-            # Final cleanup of image references
-            joined = re.sub(r'!\[[^\]]*\]\([^\)]*\)', '', joined)
-            result['ocr_contents']['raw_text'] = joined
-    return result

         except:
             return None
+# Clean OCR result with focus on Mistral compatibility
+def clean_ocr_result(result, use_segmentation=False, vision_enabled=True, preprocessing_options=None):
     """
+    Clean text content in OCR results, preserving original structure from Mistral API.
+    Only removes markdown/HTML conflicts without duplicating content across fields.
     Args:
+        result: OCR result object or dictionary
+        use_segmentation: Whether image segmentation was used
+        vision_enabled: Whether vision model was used
+        preprocessing_options: Dictionary of preprocessing options
     Returns:
+        Cleaned result object
     """
+    if not result:
+        return result
+    # Import text utilities for cleaning
+    try:
+        from utils.text_utils import clean_raw_text
+        text_cleaner_available = True
+    except ImportError:
+        text_cleaner_available = False
+    def clean_text(text):
+        """Clean text content, removing markdown image references and base64 data"""
+        if not text or not isinstance(text, str):
+            return ""
+        if text_cleaner_available:
+            text = clean_raw_text(text)
         else:
+            # Remove image references like ![image](data:image/...)
+            text = re.sub(r'!\[.*?\]\(data:image/[^)]+\)', '', text)
+            # Remove basic markdown image references like ![alt](img-1.jpg)
+            text = re.sub(r'!\[[^\]]*\]\([^)]+\)', '', text)
+            # Remove base64 encoded image data
+            text = re.sub(r'data:image/[^;]+;base64,[a-zA-Z0-9+/=]+', '', text)
+            # Clean up any JSON-like image object references
+            text = re.sub(r'{"image(_data)?":("[^"]*"|null|true|false|\{[^}]*\}|\[[^\]]*\])}', '', text)
+            # Clean up excessive whitespace and line breaks created by removals
+            text = re.sub(r'\n{3,}', '\n\n', text)
+            text = re.sub(r'\s{3,}', ' ', text)
+        return text.strip()
+    # Process dictionary
+    if isinstance(result, dict):
+        # For PDF documents, preserve original structure from Mistral API
+        is_pdf = result.get('file_type', '') == 'pdf' or (
+            result.get('file_name', '').lower().endswith('.pdf')
+        )
+        # Ensure ocr_contents exists
+        if 'ocr_contents' not in result:
+            result['ocr_contents'] = {}
+        # Clean raw_text if it exists but don't duplicate it
+        if 'raw_text' in result:
+            result['raw_text'] = clean_text(result['raw_text'])
+        # Handle ocr_contents fields - clean them but don't duplicate
+        if 'ocr_contents' in result:
+            for key, value in list(result['ocr_contents'].items()):
+                # Skip binary fields and image data
+                if key in ['image_base64', 'images', 'binary_data'] and value:
+                    continue
+                # Clean string values to remove markdown/HTML conflicts
+                if isinstance(value, str):
+                    result['ocr_contents'][key] = clean_text(value)
+        # Handle segmentation data
+        if use_segmentation and preprocessing_options and 'segmentation_data' in preprocessing_options:
+            # Store segmentation metadata
+            result['segmentation_applied'] = True
+            # Extract combined text if available
+            if 'combined_text' in preprocessing_options['segmentation_data']:
+                segmentation_text = clean_text(preprocessing_options['segmentation_data']['combined_text'])
+                # Add as dedicated field
+                result['ocr_contents']['segmentation_text'] = segmentation_text
+                # Use segmentation text for raw_text if it doesn't exist
+                if 'raw_text' not in result['ocr_contents']:
+                    result['ocr_contents']['raw_text'] = segmentation_text
+        # Clean pages_data if available (Mistral OCR format)
+        if 'pages_data' in result:
+            for page in result['pages_data']:
+                if isinstance(page, dict):
+                    # Clean text field
+                    if 'text' in page:
+                        page['text'] = clean_text(page['text'])
+                    # Clean markdown field
+                    if 'markdown' in page:
+                        page['markdown'] = clean_text(page['markdown'])
+    # Handle list content recursively
+    elif isinstance(result, list):
+        return [clean_ocr_result(item, use_segmentation, vision_enabled, preprocessing_options)
+                for item in result]
+    return result
 def create_results_zip(results, output_dir=None, zip_name=None):
     """
 def create_results_zip_in_memory(results):
     """
     Create a zip file containing OCR results in memory.
+    Packages markdown with embedded image tags, raw text, and JSON file
+    in a contextually relevant structure.
     Args:
         results: Dictionary or list of OCR results
     # Create a BytesIO object
     zip_buffer = io.BytesIO()
+    # Create a ZipFile instance
+    with zipfile.ZipFile(zip_buffer, 'w', compression=zipfile.ZIP_DEFLATED) as zipf:
+        # Check if results is a list or a dictionary
+        is_list = isinstance(results, list)
         if is_list:
+            # Handle multiple results by creating subdirectories
+            for idx, result in enumerate(results):
+                if result and isinstance(result, dict):
+                    # Create a folder name based on the file name or index
+                    folder_name = result.get('file_name', f'document_{idx+1}')
+                    folder_name = Path(folder_name).stem  # Remove file extension
+                    # Add files to this folder
+                    add_result_files_to_zip(zipf, result, f"{folder_name}/")
         else:
+            # Single result - add files directly to root of zip
+            add_result_files_to_zip(zipf, results)
     # Seek to the beginning of the BytesIO object
     zip_buffer.seek(0)
     # Return the zip file bytes
     return zip_buffer.getvalue()
+def truncate_base64_in_result(result, prefix_length=32, suffix_length=32):
     """
+    Create a copy of the result dictionary with base64 image data truncated.
+    This keeps the structure intact while making the JSON more readable.
     Args:
         result: OCR result dictionary
+        prefix_length: Number of characters to keep at the beginning
+        suffix_length: Number of characters to keep at the end
     Returns:
+        Dictionary with truncated base64 data
     """
+    if not result or not isinstance(result, dict):
+        return {}
+    # Create a deep copy to avoid modifying the original
+    import copy
+    truncated_result = copy.deepcopy(result)
+    # Helper function to truncate base64 strings
+    def truncate_base64(data):
+        if not isinstance(data, str) or len(data) <= prefix_length + suffix_length + 10:
+            return data
+        # Extract prefix and suffix based on whether this is a data URI or raw base64
+        if data.startswith('data:'):
+            # Handle data URIs like 'data:image/jpeg;base64,/9j/4AAQ...'
+            parts = data.split(',', 1)
+            if len(parts) != 2:
+                return data  # Unexpected format, return as is
+            header = parts[0] + ','
+            base64_content = parts[1]
+            if len(base64_content) <= prefix_length + suffix_length + 10:
+                return data  # Not long enough to truncate
+            truncated = (f"{header}{base64_content[:prefix_length]}..."
+                         f"[truncated {len(base64_content) - prefix_length - suffix_length} chars]..."
+                         f"{base64_content[-suffix_length:]}")
+        else:
+            # Handle raw base64 strings
+            truncated = (f"{data[:prefix_length]}..."
+                         f"[truncated {len(data) - prefix_length - suffix_length} chars]..."
+                         f"{data[-suffix_length:]}")
+        return truncated
+    # Helper function to recursively truncate base64 in nested structures
+    def truncate_base64_recursive(obj):
+        if isinstance(obj, dict):
+            # Check for keys that typically contain base64 data
+            for key in list(obj.keys()):
+                if key in ['image_base64', 'base64'] and isinstance(obj[key], str):
+                    obj[key] = truncate_base64(obj[key])
+                elif isinstance(obj[key], (dict, list)):
+                    truncate_base64_recursive(obj[key])
+        elif isinstance(obj, list):
+            for item in obj:
+                if isinstance(item, (dict, list)):
+                    truncate_base64_recursive(item)
+    # Truncate base64 data throughout the result
+    truncate_base64_recursive(truncated_result)
+    # Specifically handle the pages_data structure
+    if 'pages_data' in truncated_result:
+        for page in truncated_result['pages_data']:
+            if isinstance(page, dict) and 'images' in page:
+                for img in page['images']:
+                    if isinstance(img, dict) and 'image_base64' in img and isinstance(img['image_base64'], str):
+                        img['image_base64'] = truncate_base64(img['image_base64'])
+    # Handle raw_response_data if present
+    if 'raw_response_data' in truncated_result and isinstance(truncated_result['raw_response_data'], dict):
+        if 'pages' in truncated_result['raw_response_data']:
+            for page in truncated_result['raw_response_data']['pages']:
+                if isinstance(page, dict) and 'images' in page:
+                    for img in page['images']:
+                        if isinstance(img, dict) and 'base64' in img and isinstance(img['base64'], str):
+                            img['base64'] = truncate_base64(img['base64'])
+    return truncated_result
+def clean_base64_from_result(result):
+    """
+    Create a clean copy of the result dictionary with base64 image data removed.
+    This ensures JSON files don't contain large base64 strings.
+    Args:
+        result: OCR result dictionary
+    Returns:
+        Cleaned dictionary without base64 data
+    """
+    if not result or not isinstance(result, dict):
+        return {}
+    # Create a deep copy to avoid modifying the original
+    import copy
+    clean_result = copy.deepcopy(result)
+    # Helper function to recursively clean base64 from nested structures
+    def clean_base64_recursive(obj):
+        if isinstance(obj, dict):
+            # Check for keys that typically contain base64 data
+            for key in list(obj.keys()):
+                if key in ['image_base64', 'base64']:
+                    obj[key] = "[BASE64_DATA_REMOVED]"
+                elif isinstance(obj[key], (dict, list)):
+                    clean_base64_recursive(obj[key])
+        elif isinstance(obj, list):
+            for item in obj:
+                if isinstance(item, (dict, list)):
+                    clean_base64_recursive(item)
+    # Clean the entire result
+    clean_base64_recursive(clean_result)
+    # Specifically handle the pages_data structure
+    if 'pages_data' in clean_result:
+        for page in clean_result['pages_data']:
+            if isinstance(page, dict) and 'images' in page:
+                for img in page['images']:
+                    if isinstance(img, dict) and 'image_base64' in img:
+                        img['image_base64'] = "[BASE64_DATA_REMOVED]"
+    # Handle raw_response_data if present
+    if 'raw_response_data' in clean_result and isinstance(clean_result['raw_response_data'], dict):
+        if 'pages' in clean_result['raw_response_data']:
+            for page in clean_result['raw_response_data']['pages']:
+                if isinstance(page, dict) and 'images' in page:
+                    for img in page['images']:
+                        if isinstance(img, dict) and 'base64' in img:
+                            img['base64'] = "[BASE64_DATA_REMOVED]"
+    return clean_result
+def create_markdown_with_file_references(result, image_path_prefix="images/"):
+    """
+    Create a markdown document with file references to images instead of base64 embedding.
+    Ideal for use in zip archives where images are stored as separate files.
+    Args:
+        result: OCR result dictionary
+        image_path_prefix: Path prefix for image references (e.g., "images/")
+    Returns:
+        Markdown content as string with file references
+    """
+    # Similar to create_markdown_with_images but uses file references
     # Import content utils to use classification functions
     try:
         from utils.content_utils import classify_document_content, extract_document_text, extract_image_description
     # Get content classification
     has_text = True
     has_images = False
     if content_utils_available:
         classification = classify_document_content(result)
         has_text = classification['has_content']
         has_images = result.get('has_images', False)
     else:
         # Minimal fallback detection
         if 'has_images' in result:
                     has_images = True
                     break
+    # Start building the markdown document
+    md = []
+    # Add document title/header
+    md.append(f"# {result.get('file_name', 'Document')}\n")
+    # Add metadata section
+    md.append("## Document Metadata\n")
     # Add timestamp
     if 'timestamp' in result:
+        md.append(f"**Processed:** {result['timestamp']}\n")
     # Add languages if available
     if 'languages' in result and result['languages']:
         languages = [lang for lang in result['languages'] if lang]
         if languages:
+            md.append(f"**Languages:** {', '.join(languages)}\n")
     # Add document type and topics
     if 'detected_document_type' in result:
+        md.append(f"**Document Type:** {result['detected_document_type']}\n")
     if 'topics' in result and result['topics']:
+        md.append(f"**Topics:** {', '.join(result['topics'])}\n")
+    md.append("\n---\n")
     # Document title - extract from result if available
     if 'ocr_contents' in result and 'title' in result['ocr_contents'] and result['ocr_contents']['title']:
         title_content = result['ocr_contents']['title']
+        md.append(f"## {title_content}\n")
     # Add images if present
     if has_images and 'pages_data' in result:
+        md.append("## Images\n")
+        # Extract and display all images with file references
         for page_idx, page in enumerate(result['pages_data']):
             if 'images' in page and isinstance(page['images'], list):
                 for img_idx, img in enumerate(page['images']):
+                    if 'image_base64' in img:
+                        # Create image reference to file in the zip
+                        image_filename = f"image_{page_idx+1}_{img_idx+1}.jpg"
+                        image_path = f"{image_path_prefix}{image_filename}"
+                        image_caption = f"Image {page_idx+1}-{img_idx+1}"
+                        md.append(f"![{image_caption}]({image_path})\n")
                         # Add image description if available through utils
                         if content_utils_available:
                             description = extract_image_description(result)
                             if description:
+                                md.append(f"*{description}*\n")
+        md.append("\n---\n")
     # Add document text section
+    md.append("## Text Content\n")
     # Extract text content systematically
     text_content = ""
+    structured_sections = {}
+    # Helper function to extract clean text from dictionary objects
+    def extract_clean_text(content):
+        if isinstance(content, str):
+            # Check if content is a stringified JSON
+            if content.strip().startswith("{") and content.strip().endswith("}"):
+                try:
+                    # Try to parse as JSON
+                    content_dict = json.loads(content.replace("'", '"'))
+                    if 'text' in content_dict:
+                        return content_dict['text']
+                    return content
+                except:
+                    return content
+            return content
+        elif isinstance(content, dict):
+            # If it's a dictionary with a 'text' key, return just that value
+            if 'text' in content and isinstance(content['text'], str):
+                return content['text']
+            return content
+        return content
     if content_utils_available:
+        # Use the systematic utility function for main text
         text_content = extract_document_text(result)
+        text_content = extract_clean_text(text_content)
+        # Collect all available structured sections
+        if 'ocr_contents' in result:
+            for field, content in result['ocr_contents'].items():
+                # Skip certain fields that are handled separately
+                if field in ["raw_text", "error", "partial_text", "main_text"]:
+                    continue
+                if content:
+                    # Extract clean text from content if possible
+                    clean_content = extract_clean_text(content)
+                    # Add this as a structured section
+                    structured_sections[field] = clean_content
     else:
         # Fallback extraction logic
         if 'ocr_contents' in result:
+            # First find main text
             for field in ["main_text", "content", "text", "transcript", "raw_text"]:
                 if field in result['ocr_contents'] and result['ocr_contents'][field]:
                     content = result['ocr_contents'][field]
                             break
                         except:
                             pass
+            # Then collect all structured sections
+            for field, content in result['ocr_contents'].items():
+                # Skip certain fields that are handled separately
+                if field in ["raw_text", "error", "partial_text", "main_text", "content", "text", "transcript"]:
+                    continue
+                if content:
+                    # Add this as a structured section
+                    structured_sections[field] = content
+    # Add the main text content - display raw text without a field label
     if text_content:
+        # Check if this is from raw_text (based on content match)
+        is_raw_text = False
+        if 'ocr_contents' in result and 'raw_text' in result['ocr_contents']:
+            if result['ocr_contents']['raw_text'] == text_content:
+                is_raw_text = True
+        # Display content without adding a "raw_text:" label
+        md.append(text_content + "\n\n")
+    # Add structured sections if available
+    if structured_sections:
+        for section_name, section_content in structured_sections.items():
+            # Use proper markdown header for sections - consistently capitalize all section names
+            display_name = section_name.replace("_", " ").capitalize()
+            # Handle different content types
+            if isinstance(section_content, str):
+                md.append(section_content + "\n\n")
+            elif isinstance(section_content, dict):
+                # Dictionary content - format as key-value pairs
+                for key, value in section_content.items():
+                    # Treat all values as plain text to maintain content purity
+                    # This prevents JSON-like structures from being formatted as code blocks
+                    md.append(f"**{key}:** {value}\n\n")
+            elif isinstance(section_content, list):
+                # List content - create a markdown list
+                for item in section_content:
+                    # Treat all items as plain text
+                    md.append(f"- {item}\n")
+                md.append("\n")
+    # Join all markdown parts into a single string
+    return "\n".join(md)
+def add_result_files_to_zip(zipf, result, prefix=""):
     """
+    Add files for a single result to a zip file.
+    Args:
+        zipf: ZipFile instance to add files to
+        result: OCR result dictionary
+        prefix: Optional prefix for file paths in the zip
     """
+    if not result or not isinstance(result, dict):
+        return
+    # Create a timestamp for filename if not in result
+    timestamp = result.get('timestamp', datetime.now().strftime("%Y-%m-%d_%H-%M-%S"))
+    # Get base name for files
+    file_name = result.get('file_name', 'document')
+    base_name = Path(file_name).stem
+    try:
+        # 1. Add JSON file - with base64 data cleaned out
+        clean_result = clean_base64_from_result(result)
+        json_str = json.dumps(clean_result, indent=2)
+        zipf.writestr(f"{prefix}{base_name}.json", json_str)
+        # 2. Add markdown file that exactly matches Tab 1 display
+        # Use the create_markdown_with_images function to ensure it matches the UI exactly
+        try:
+            markdown_content = create_markdown_with_images(result)
+            zipf.writestr(f"{prefix}{base_name}.md", markdown_content)
+        except Exception as e:
+            logger.error(f"Error creating markdown: {str(e)}")
+            # Fallback to simpler markdown if error occurs
+            zipf.writestr(f"{prefix}{base_name}.md", f"# {file_name}\n\nError generating complete markdown output.")
+        # Extract and save images first to ensure they exist before creating markdown
+        img_paths = {}
+        has_images = result.get('has_images', False)
+        # 3. Add individual images if available
+        if has_images and 'pages_data' in result:
+            img_folder = f"{prefix}images/"
+            for page_idx, page in enumerate(result['pages_data']):
+                if 'images' in page and isinstance(page['images'], list):
+                    for img_idx, img in enumerate(page['images']):
+                        if 'image_base64' in img and img['image_base64']:
+                            # Extract the base64 data
+                            try:
+                                # Get the base64 data
+                                img_data = img['image_base64']
+                                # Handle the base64 data carefully
+                                if isinstance(img_data, str):
+                                    # If it has a data URI prefix, remove it
+                                    if ',' in img_data and ';base64,' in img_data:
+                                        # Keep the complete data after the comma
+                                        img_data = img_data.split(',', 1)[1]
+                                    # Make sure we have the complete data (not truncated)
+                                    try:
+                                        # Decode the base64 data with padding correction
+                                        # Add padding if needed to prevent truncation errors
+                                        missing_padding = len(img_data) % 4
+                                        if missing_padding:
+                                            img_data += '=' * (4 - missing_padding)
+                                        img_bytes = base64.b64decode(img_data)
+                                    except Exception as e:
+                                        logger.error(f"Base64 decoding error: {str(e)} for image {page_idx}-{img_idx}")
+                                        # Skip this image if we can't decode it
+                                        continue
+                                else:
+                                    # If it's not a string (e.g., already bytes), use it directly
+                                    img_bytes = img_data
+                                # Create image filename
+                                image_filename = f"image_{page_idx+1}_{img_idx+1}.jpg"
+                                img_paths[(page_idx, img_idx)] = image_filename
+                                # Write the image to the zip file
+                                zipf.writestr(f"{img_folder}{image_filename}", img_bytes)
+                            except Exception as e:
+                                logger.warning(f"Could not add image to zip: {str(e)}")
+        # 4. Add markdown with file references to images for offline viewing
+        try:
+            if has_images:
+                # Create markdown with file references
+                file_ref_markdown = create_markdown_with_file_references(result, "images/")
+                zipf.writestr(f"{prefix}{base_name}_with_files.md", file_ref_markdown)
+        except Exception as e:
+            logger.warning(f"Error creating markdown with file references: {str(e)}")
+        # 5. Add README.txt with explanation of file contents
+        readme_content = f"""
+OCR RESULTS FOR: {file_name}
+Processed: {timestamp}
+This archive contains the following files:
+- {base_name}.json: Complete JSON data with all extracted information
+- {base_name}.md: Markdown document with embedded base64 images (exactly as shown in the app)
+- {base_name}_with_files.md: Alternative markdown with file references instead of base64 (for offline viewing)
+- images/ folder: Contains extracted images from the document (if present)
+Generated by Historical OCR using Mistral AI
+        """
+        zipf.writestr(f"{prefix}README.txt", readme_content.strip())
+    except Exception as e:
+        logger.error(f"Error adding files to zip: {str(e)}")
+def create_markdown_with_images(result):
+    """
+    Create a clean Markdown document from OCR results that properly preserves
+    image references and text structure, following the principle of content purity.
+    Args:
+        result: OCR result dictionary
+    Returns:
+        Markdown content as string
+    """
+    # Similar to create_markdown_with_file_references but embeds base64 images
+    # Import content utils to use classification functions
+    try:
+        from utils.content_utils import classify_document_content, extract_document_text, extract_image_description
+        content_utils_available = True
+    except ImportError:
+        content_utils_available = False
+    # Get content classification
+    has_text = True
+    has_images = False
+    if content_utils_available:
+        classification = classify_document_content(result)
+        has_text = classification['has_content']
+        has_images = result.get('has_images', False)
+    else:
+        # Minimal fallback detection
+        if 'has_images' in result:
+            has_images = result['has_images']
+        # Check for image data more thoroughly
+        if 'pages_data' in result and isinstance(result['pages_data'], list):
+            for page in result['pages_data']:
+                if isinstance(page, dict) and 'images' in page and page['images']:
+                    has_images = True
+                    break
+    # Start building the markdown document
+    md = []
+    # Add document title/header
+    md.append(f"# {result.get('file_name', 'Document')}\n")
+    # Add metadata section
+    md.append("## Document Metadata\n")
+    # Add timestamp
+    if 'timestamp' in result:
+        md.append(f"**Processed:** {result['timestamp']}\n")
+    # Add languages if available
+    if 'languages' in result and result['languages']:
+        languages = [lang for lang in result['languages'] if lang]
+        if languages:
+            md.append(f"**Languages:** {', '.join(languages)}\n")
+    # Add document type and topics
+    if 'detected_document_type' in result:
+        md.append(f"**Document Type:** {result['detected_document_type']}\n")
+    if 'topics' in result and result['topics']:
+        md.append(f"**Topics:** {', '.join(result['topics'])}\n")
+    md.append("\n---\n")
+    # Document title - extract from result if available
+    if 'ocr_contents' in result and 'title' in result['ocr_contents'] and result['ocr_contents']['title']:
+        title_content = result['ocr_contents']['title']
+        md.append(f"## {title_content}\n")
+    # Add images if present - with base64 embedding
+    if has_images and 'pages_data' in result:
+        md.append("## Images\n")
+        # Extract and display all images with embedded base64
+        for page_idx, page in enumerate(result['pages_data']):
+            if 'images' in page and isinstance(page['images'], list):
+                for img_idx, img in enumerate(page['images']):
+                    if 'image_base64' in img:
+                        # Use the base64 data directly
+                        image_caption = f"Image {page_idx+1}-{img_idx+1}"
+                        img_data = img['image_base64']
+                        # Make sure it has proper data URI format
+                        if isinstance(img_data, str) and not img_data.startswith('data:'):
+                            img_data = f"data:image/jpeg;base64,{img_data}"
+                        md.append(f"![{image_caption}]({img_data})\n")
+                        # Add image description if available through utils
+                        if content_utils_available:
+                            description = extract_image_description(result)
+                            if description:
+                                md.append(f"*{description}*\n")
+        md.append("\n---\n")
+    # Add document text section
+    md.append("## Text Content\n")
+    # Extract text content systematically
+    text_content = ""
+    structured_sections = {}
+    if content_utils_available:
+        # Use the systematic utility function for main text
+        text_content = extract_document_text(result)
+        # Collect all available structured sections
+        if 'ocr_contents' in result:
+            for field, content in result['ocr_contents'].items():
+                # Skip certain fields that are handled separately
+                if field in ["raw_text", "error", "partial_text", "main_text"]:
+                    continue
+                if content:
+                    # Add this as a structured section
+                    structured_sections[field] = content
+    else:
+        # Fallback extraction logic
+        if 'ocr_contents' in result:
+            # First find main text
+            for field in ["main_text", "content", "text", "transcript", "raw_text"]:
+                if field in result['ocr_contents'] and result['ocr_contents'][field]:
+                    content = result['ocr_contents'][field]
+                    if isinstance(content, str) and content.strip():
+                        text_content = content
+                        break
+                    elif isinstance(content, dict):
+                        # Try to convert complex objects to string
+                        try:
+                            text_content = json.dumps(content, indent=2)
+                            break
+                        except:
+                            pass
+            # Then collect all structured sections
+            for field, content in result['ocr_contents'].items():
+                # Skip certain fields that are handled separately
+                if field in ["raw_text", "error", "partial_text", "main_text", "content", "text", "transcript"]:
+                    continue
+                if content:
+                    # Add this as a structured section
+                    structured_sections[field] = content
+    # Add the main text content
+    if text_content:
+        md.append(text_content + "\n\n")
+    # Add structured sections if available
+    if structured_sections:
+        for section_name, section_content in structured_sections.items():
+            # Use proper markdown header for sections - consistently capitalize all section names
+            display_name = section_name.replace("_", " ").capitalize()
+            md.append(f"### {display_name}\n")
+            # Add a separator for clarity
+            md.append("\n---\n\n")
+            # Handle different content types
+            if isinstance(section_content, str):
+                md.append(section_content + "\n\n")
+            elif isinstance(section_content, dict):
+                # Dictionary content - format as key-value pairs
+                for key, value in section_content.items():
+                    # Treat all values as plain text to maintain content purity
+                    md.append(f"**{key}:** {value}\n\n")
+            elif isinstance(section_content, list):
+                # List content - create a markdown list
+                for item in section_content:
+                    # Keep list items as plain text
+                    md.append(f"- {item}\n")
+                md.append("\n")
+    # Join all markdown parts into a single string
+    return "\n".join(md)

utils/text_utils.py CHANGED Viewed

@@ -1,6 +1,7 @@
 """Text utility functions for OCR processing"""
 import re
 def clean_raw_text(text):
     """Clean raw text by removing image references and serialized data.
@@ -14,24 +15,24 @@ def clean_raw_text(text):
     if not text or not isinstance(text, str):
         return ""
-    # # Remove image references like ![image](data:image/...)
-    # text = re.sub(r'!\[.*?\]\(data:image/[^)]+\)', '', text)
-    # # Remove basic markdown image references like ![alt](img-1.jpg)
-    # text = re.sub(r'!\[[^\]]*\]\([^)]+\)', '', text)
-    # # Remove base64 encoded image data
-    # text = re.sub(r'data:image/[^;]+;base64,[a-zA-Z0-9+/=]+', '', text)
-    # # Remove image object references like [[OCRImageObject:...]]
-    # text = re.sub(r'\[\[OCRImageObject:[^\]]+\]\]', '', text)
-    # # Clean up any JSON-like image object references
-    # text = re.sub(r'{"image(_data)?":("[^"]*"|null|true|false|\{[^}]*\}|\[[^\]]*\])}', '', text)
-    # # Clean up excessive whitespace and line breaks created by removals
-    # text = re.sub(r'\n{3,}', '\n\n', text)
-    # text = re.sub(r'\s{3,}', ' ', text)
     return text.strip()
@@ -55,6 +56,45 @@ def format_markdown_text(text):
     # Convert any Windows line endings to Unix
     text = text.replace('\r\n', '\n')
     # Format dates (MM/DD/YYYY or similar patterns)
     date_pattern = r'\b(0?[1-9]|1[0-2])[\/\-\.](0?[1-9]|[12][0-9]|3[01])[\/\-\.](\d{4}|\d{2})\b'
     text = re.sub(date_pattern, r'**\g<0>**', text)
@@ -149,3 +189,26 @@ def format_markdown_text(text):
     processed_text = re.sub(r'([^\n])\n([^\n])', r'\1\n\n\2', processed_text)
     return processed_text

 """Text utility functions for OCR processing"""
 import re
+import streamlit as st
 def clean_raw_text(text):
     """Clean raw text by removing image references and serialized data.
     if not text or not isinstance(text, str):
         return ""
+    # Remove image references like ![image](data:image/...)
+    text = re.sub(r'!\[.*?\]\(data:image/[^)]+\)', '', text)
+    # Remove basic markdown image references like ![alt](img-1.jpg)
+    text = re.sub(r'!\[[^\]]*\]\([^)]+\)', '', text)
+    # Remove base64 encoded image data
+    text = re.sub(r'data:image/[^;]+;base64,[a-zA-Z0-9+/=]+', '', text)
+    # Remove image object references like [[OCRImageObject:...]]
+    text = re.sub(r'\[\[OCRImageObject:[^\]]+\]\]', '', text)
+    # Clean up any JSON-like image object references
+    text = re.sub(r'{"image(_data)?":("[^"]*"|null|true|false|\{[^}]*\}|\[[^\]]*\])}', '', text)
+    # Clean up excessive whitespace and line breaks created by removals
+    text = re.sub(r'\n{3,}', '\n\n', text)
+    text = re.sub(r'\s{3,}', ' ', text)
     return text.strip()
     # Convert any Windows line endings to Unix
     text = text.replace('\r\n', '\n')
+    # Format keys with values to ensure keys are on their own line
+    # Pattern matches potential label/key patterns like 'key:' or '**key:**'
+    key_value_pattern = r'(\*\*[^:*\n]+:\*\*|\b[a-zA-Z_]+:\s+)'
+    # Process lines for key-value formatting
+    lines = text.split('\n')
+    processed_lines = []
+    for line in lines:
+        # Find all matches of the key-value pattern
+        matches = list(re.finditer(key_value_pattern, line))
+        if matches:
+            # Process each match in reverse to avoid messing up string indices
+            for match in reversed(matches):
+                key = match.group(1)
+                key_end = match.end()
+                # If the key is already bold, use it as is
+                if key.startswith('**') and key.endswith('**'):
+                    formatted_key = key
+                else:
+                    # Bold the key if it's not already bold
+                    formatted_key = f"**{key.strip()}**"
+                # Split the line at this key's end position
+                before_key = line[:match.start()]
+                after_key = line[key_end:]
+                # If there's content before the key on the same line, end with newline
+                if before_key.strip():
+                    before_key = f"{before_key.rstrip()}\n\n"
+                # Format: key on its own line, value on next line
+                line = f"{before_key}{formatted_key}\n{after_key.strip()}"
+        processed_lines.append(line)
+    # Join the processed lines
+    text = '\n'.join(processed_lines)
     # Format dates (MM/DD/YYYY or similar patterns)
     date_pattern = r'\b(0?[1-9]|1[0-2])[\/\-\.](0?[1-9]|[12][0-9]|3[01])[\/\-\.](\d{4}|\d{2})\b'
     text = re.sub(date_pattern, r'**\g<0>**', text)
     processed_text = re.sub(r'([^\n])\n([^\n])', r'\1\n\n\2', processed_text)
     return processed_text
+def format_ocr_text(text, for_display=False):
+    """Format OCR text with optional HTML styling
+    Args:
+        text (str): The OCR text to format
+        for_display (bool): Whether to add HTML formatting for UI display
+    Returns:
+        str: Formatted text, without HTML container to keep content pure
+    """
+    if not text or not isinstance(text, str):
+        return ""
+    # Clean the text first
+    text = clean_raw_text(text)
+    # Format with markdown
+    formatted_text = format_markdown_text(text)
+    # Always return the clean formatted text without HTML wrappers
+    # This follows the principle of keeping content separate from presentation
+    return formatted_text

utils/ui_utils.py CHANGED Viewed

@@ -1,13 +1,14 @@
 """
 UI utilities for OCR results display.
 """
 import streamlit as st
 import json
 import base64
 import io
 from datetime import datetime
-from utils.image_utils import format_ocr_text, create_html_with_images
 from utils.content_utils import classify_document_content, format_structured_data
 def display_results(result, container, custom_prompt=""):
@@ -58,17 +59,55 @@ def display_results(result, container, custom_prompt=""):
                 lang_html += '</div>'
                 st.markdown(lang_html, unsafe_allow_html=True)
-                # Create a separate line for Time if we have time-related tags
-                if 'topics' in result and result['topics']:
-                    time_tags = [topic for topic in result['topics']
-                               if any(term in topic.lower() for term in ["century", "pre-", "era", "historical"])]
-                    if time_tags:
-                        time_html = '<div style="display: flex; align-items: center; margin: 0.2rem 0; flex-wrap: wrap;">'
-                        time_html += '<div style="margin-right: 0.3rem; font-weight: bold;">Time:</div>'
-                        for tag in time_tags:
-                            time_html += f'<span class="subject-tag tag-time-period">{tag}</span>'
-                        time_html += '</div>'
-                        st.markdown(time_html, unsafe_allow_html=True)
         # Then display remaining subject tags if available
         if 'topics' in result and result['topics']:
@@ -199,118 +238,98 @@ def display_results(result, container, custom_prompt=""):
                         doc_tab, json_tab = tabs
                         img_tab = None
-                    # Document Content tab with simplified and systematic content handling
                     with doc_tab:
-                        # Classify document content using our utility function
-                        content_classification = classify_document_content(result)
-                        # Track what content has been displayed to avoid redundancy
-                        displayed_content = set()
                         # Create a single unified content section
-                        st.markdown("#### Document Content")
-                        st.markdown("##### Title")
-                        # Extract main structured content fields without redundancy
-                        text_fields = {}
-                        # Use the exact same approach as in Previous Results tab for consistency
-                        # Create a more focused list of important sections - prioritize main_text
-                        priority_sections = ["title", "main_text", "content", "transcript", "summary"]
-                        displayed_sections = set()
-                        # First display priority sections
-                        for section in priority_sections:
-                            if section in result['ocr_contents'] and result['ocr_contents'][section]:
-                                content = result['ocr_contents'][section]
                                 if isinstance(content, str) and content.strip():
-                                    # Only add a subheader for meaningful section names, not raw_text
-                                    if section != "raw_text" and section != "title":
-                                        st.markdown(f"##### {section.replace('_', ' ').title()}")
-                                    # Format and display content
-                                    # First format any structured data (lists, dicts)
-                                    structured_content = format_structured_data(content)
-                                    # Then apply regular OCR text formatting
-                                    formatted_content = format_ocr_text(structured_content)
-                                    st.markdown(formatted_content)
-                                    displayed_sections.add(section)
-                                    break
-                                elif isinstance(content, dict):
-                                    # Display dictionary content as key-value pairs
-                                    for k, v in content.items():
-                                        if k not in ['error', 'partial_text'] and v:
-                                            st.markdown(f"**{k.replace('_', ' ').title()}**")
-                                            if isinstance(v, str):
-                                                # Format any structured data in the string
-                                                formatted_v = format_structured_data(v)
-                                                st.markdown(format_ocr_text(formatted_v))
-                                            else:
-                                                # Format non-string values (lists, dicts)
-                                                formatted_v = format_structured_data(v)
-                                                st.markdown(formatted_v)
-                                    displayed_sections.add(section)
-                                    break
-                                elif isinstance(content, list):
-                                    # Format and display list items using our structured formatter
-                                    formatted_list = format_structured_data(content)
-                                    st.markdown(formatted_list)
-                                    displayed_sections.add(section)
-                                    break
-                        # Then display any remaining sections not already shown
-                        for section, content in result['ocr_contents'].items():
-                            if (section not in displayed_sections and
-                                section not in ['error', 'partial_text'] and
-                                content):
-                                st.markdown(f"##### {section.replace('_', ' ').title()}")
-                                if isinstance(content, str):
-                                    # Format any structured data in the string before display
-                                    structured_content = format_structured_data(content)
-                                    st.markdown(format_ocr_text(structured_content))
-                                elif isinstance(content, list):
-                                    # Format list using our structured formatter
-                                    formatted_list = format_structured_data(content)
-                                    st.markdown(formatted_list)
-                                elif isinstance(content, dict):
-                                    # Format dictionary using our structured formatter
-                                    formatted_dict = format_structured_data(content)
-                                    st.markdown(formatted_dict)
-                    # Raw JSON tab - for viewing the raw OCR response data
                     with json_tab:
-                        # Extract the relevant JSON data
-                        json_data = {}
-                        # Include important metadata
-                        for field in ['file_name', 'timestamp', 'processing_time', 'detected_document_type', 'languages', 'topics']:
-                            if field in result:
-                                json_data[field] = result[field]
-                        # Include OCR contents
-                        if 'ocr_contents' in result:
-                            json_data['ocr_contents'] = result['ocr_contents']
-                        # Exclude large binary data like base64 images to keep JSON clean
-                        if 'pages_data' in result:
-                            # Create simplified pages_data without large binary content
-                            simplified_pages = []
-                            for page in result['pages_data']:
-                                simplified_page = {
-                                    'page_number': page.get('page_number', 0),
-                                    'has_text': bool(page.get('markdown', '')),
-                                    'has_images': bool(page.get('images', [])),
-                                    'image_count': len(page.get('images', []))
-                                }
-                                simplified_pages.append(simplified_page)
-                            json_data['pages_summary'] = simplified_pages
                         # Format the JSON prettily
-                        json_str = json.dumps(json_data, indent=2)
-                        # Display in a monospace font with syntax highlighting
-                        st.code(json_str, language="json")
                     # Images tab - for viewing document images
@@ -324,90 +343,3 @@ def display_results(result, container, custom_prompt=""):
             if custom_prompt:
                 with st.expander("Custom Processing Instructions"):
                     st.write(custom_prompt)
-            # No download heading - start directly with buttons
-            # Create export section with a simple download menu
-            st.markdown("<div style='margin-top: 15px;'></div>", unsafe_allow_html=True)
-            # Prepare all download files at once to avoid rerun resets
-            try:
-                # 1. JSON download
-                json_str = json.dumps(result, indent=2)
-                json_filename = f"{result.get('file_name', 'document').split('.')[0]}_ocr.json"
-                # 2. Text download with improved structure
-                text_parts = []
-                filename = result.get('file_name', 'document')
-                text_parts.append(f"DOCUMENT: {filename}\n")
-                if 'timestamp' in result:
-                    text_parts.append(f"Processed: {result['timestamp']}\n")
-                if 'languages' in result and result['languages']:
-                    languages = [lang for lang in result['languages'] if lang is not None]
-                    if languages:
-                        text_parts.append(f"Languages: {', '.join(languages)}\n")
-                if 'topics' in result and result['topics']:
-                    text_parts.append(f"Topics: {', '.join(result['topics'])}\n")
-                text_parts.append("\n" + "="*50 + "\n\n")
-                if 'ocr_contents' in result and 'title' in result['ocr_contents'] and result['ocr_contents']['title']:
-                    text_parts.append(f"TITLE: {result['ocr_contents']['title']}\n\n")
-                content_added = False
-                if 'ocr_contents' in result:
-                    for field in ["main_text", "content", "text", "transcript", "raw_text"]:
-                        if field in result['ocr_contents'] and result['ocr_contents'][field]:
-                            text_parts.append(f"CONTENT:\n\n{result['ocr_contents'][field]}\n")
-                            content_added = True
-                            break
-                text_content = "\n".join(text_parts)
-                text_filename = f"{result.get('file_name', 'document').split('.')[0]}_ocr.txt"
-                # 3. HTML download
-                from utils.image_utils import create_html_with_images
-                html_content = create_html_with_images(result)
-                html_filename = f"{result.get('file_name', 'document').split('.')[0]}_ocr.html"
-                # Hide download options in an expander
-                with st.expander("Download Options"):
-                    # Remove columns and use vertical layout instead
-                    # Add spacing between buttons for better readability
-                    st.download_button(
-                        label="JSON",
-                        data=json_str,
-                        file_name=json_filename,
-                        mime="application/json",
-                        key="download_json_btn",
-                        use_container_width=True
-                    )
-                    st.markdown("<div style='margin-top: 8px;'></div>", unsafe_allow_html=True)
-                    st.download_button(
-                        label="Text",
-                        data=text_content,
-                        file_name=text_filename,
-                        mime="text/plain",
-                        key="download_text_btn",
-                        use_container_width=True
-                    )
-                    st.markdown("<div style='margin-top: 8px;'></div>", unsafe_allow_html=True)
-                    st.download_button(
-                        label="HTML",
-                        data=html_content,
-                        file_name=html_filename,
-                        mime="text/html",
-                        key="download_html_btn",
-                        use_container_width=True
-                    )
-            except Exception as e:
-                st.error(f"Error preparing download files: {str(e)}")

 """
 UI utilities for OCR results display.
 """
+import os
 import streamlit as st
 import json
 import base64
 import io
 from datetime import datetime
+from utils.text_utils import format_ocr_text
 from utils.content_utils import classify_document_content, format_structured_data
 def display_results(result, container, custom_prompt=""):
                 lang_html += '</div>'
                 st.markdown(lang_html, unsafe_allow_html=True)
+        # Prepare download files
+        try:
+            # Get base filename
+            from utils.general_utils import create_descriptive_filename
+            original_file = result.get('file_name', 'document')
+            base_name = create_descriptive_filename(original_file, result, "")
+            base_name = os.path.splitext(base_name)[0]
+            # 1. JSON download - with base64 data truncated for readability
+            from utils.image_utils import truncate_base64_in_result
+            truncated_result = truncate_base64_in_result(result)
+            json_str = json.dumps(truncated_result, indent=2)
+            json_filename = f"{base_name}.json"
+            json_b64 = base64.b64encode(json_str.encode()).decode()
+            # 2. Create ZIP with all files
+            from utils.image_utils import create_results_zip_in_memory
+            zip_data = create_results_zip_in_memory(result)
+            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+            zip_filename = f"{base_name}_{timestamp}.zip"
+            zip_b64 = base64.b64encode(zip_data).decode()
+            # Add download line with metadata styling
+            download_html = '<div style="display: flex; align-items: center; margin: 0.2rem 0; flex-wrap: wrap;">'
+            download_html += '<div style="margin-right: 0.3rem; font-weight: bold;">Download:</div>'
+            # Download links in order of importance, matching the zip file contents
+            download_html += f'<a href="data:application/json;base64,{json_b64}" download="{json_filename}" class="subject-tag tag-download">JSON</a>'
+            # Zip download link (packages everything together)
+            download_html += f'<a href="data:application/zip;base64,{zip_b64}" download="{zip_filename}" class="subject-tag tag-download">Zip Archive</a>'
+            download_html += '</div>'
+            st.markdown(download_html, unsafe_allow_html=True)
+        except Exception as e:
+            # Silent fail for downloads - don't disrupt the UI
+            pass
+        # Create a separate line for Time if we have time-related tags
+        if 'topics' in result and result['topics']:
+            time_tags = [topic for topic in result['topics']
+                       if any(term in topic.lower() for term in ["century", "pre-", "era", "historical"])]
+            if time_tags:
+                time_html = '<div style="display: flex; align-items: center; margin: 0.2rem 0; flex-wrap: wrap;">'
+                time_html += '<div style="margin-right: 0.3rem; font-weight: bold;">Time:</div>'
+                for tag in time_tags:
+                    time_html += f'<span class="subject-tag tag-time-period">{tag}</span>'
+                time_html += '</div>'
+                st.markdown(time_html, unsafe_allow_html=True)
         # Then display remaining subject tags if available
         if 'topics' in result and result['topics']:
                         doc_tab, json_tab = tabs
                         img_tab = None
+                    # Document Content tab with simple, clean formatting that matches markdown export files
                     with doc_tab:
                         # Create a single unified content section
+                        st.markdown("## Text Content")
+                        # Present content directly in the format used in markdown export files
+                        if 'ocr_contents' in result and isinstance(result['ocr_contents'], dict):
+                            # Get all content fields that should be displayed
+                            content_fields = {}
+                            # Add all available content fields (left_page, right_page, etc)
+                            for field, content in result['ocr_contents'].items():
+                                # Skip certain fields that shouldn't be displayed
+                                if field in ['error', 'partial_text'] or not content:
+                                    continue
+                                # Clean the content if it's a string
                                 if isinstance(content, str) and content.strip():
+                                    content_fields[field] = content.strip()
+                                # Handle dictionary or list content
+                                elif isinstance(content, (dict, list)):
+                                    formatted_content = format_structured_data(content)
+                                    if formatted_content:
+                                        content_fields[field] = formatted_content
+                            # Process nested dictionary structures
+                            def flatten_content_fields(fields, parent_key=""):
+                                flat_fields = {}
+                                for field, content in fields.items():
+                                    # Skip certain fields
+                                    if field in ['error', 'partial_text'] or not content:
+                                        continue
+                                    # Handle string content
+                                    if isinstance(content, str) and content.strip():
+                                        key = f"{parent_key}_{field}".strip("_")
+                                        flat_fields[key] = content.strip()
+                                    # Handle dictionary content
+                                    elif isinstance(content, dict):
+                                        # If the dictionary has a 'text' key, extract just that value
+                                        if 'text' in content and isinstance(content['text'], str):
+                                            key = f"{parent_key}_{field}".strip("_")
+                                            flat_fields[key] = content['text'].strip()
+                                        # Otherwise, recursively process nested dictionaries
+                                        else:
+                                            nested_fields = flatten_content_fields(content, f"{parent_key}_{field}")
+                                            flat_fields.update(nested_fields)
+                                    # Handle list content
+                                    elif isinstance(content, list):
+                                        formatted_content = format_structured_data(content)
+                                        if formatted_content:
+                                            key = f"{parent_key}_{field}".strip("_")
+                                            flat_fields[key] = formatted_content
+                                return flat_fields
+                            # Flatten the content structure
+                            flat_content_fields = flatten_content_fields(result['ocr_contents'])
+                            # Display the flattened content fields with proper formatting
+                            for field, content in flat_content_fields.items():
+                                # Skip any empty content
+                                if not content or not content.strip():
+                                    continue
+                                # Format field name as in the markdown export
+                                field_display = field.replace('_', ' ')
+                                # Maintain content purity - don't parse text content as JSON
+                                # Historical text may contain curly braces that aren't JSON
+                                # For raw_text field, display only the content without the field name
+                                if field == 'raw_text':
+                                    st.markdown(f"{content}")
+                                else:
+                                    # For other fields, display the field name in bold followed by the content
+                                    st.markdown(f"**{field}:** {content}")
+                                # Add spacing between fields
+                                st.markdown("\n\n")
+                    # Raw JSON tab - displays the exact same JSON that's downloaded via the JSON button
                     with json_tab:
+                        # Use the same truncated JSON that's used in the download button
+                        from utils.image_utils import truncate_base64_in_result
+                        truncated_result = truncate_base64_in_result(result)
                         # Format the JSON prettily
+                        json_str = json.dumps(truncated_result, indent=2)
+                        # Display JSON with a copy button using Streamlit's built-in functionality
+                        st.json(truncated_result)
                     # Images tab - for viewing document images
             if custom_prompt:
                 with st.expander("Custom Processing Instructions"):
                     st.write(custom_prompt)

verify_fix.py ADDED Viewed

	@@ -0,0 +1,70 @@

+#!/usr/bin/env python3
+import os
+import streamlit as st
+from ocr_processing import process_file
+# Mock a file upload
+class MockFile:
+    def __init__(self, name, content):
+        self.name = name
+        self._content = content
+    def getvalue(self):
+        return self._content
+def test_image(image_path):
+    """Test OCR processing for a specific image"""
+    print(f"\n\n===== Testing {os.path.basename(image_path)} =====")
+    # Load the test image
+    with open(image_path, 'rb') as f:
+        file_bytes = f.read()
+    # Create mock file
+    uploaded_file = MockFile(os.path.basename(image_path), file_bytes)
+    # Process the file
+    result = process_file(uploaded_file)
+    # Display results summary
+    print("\nOCR Content Keys:")
+    for key in result['ocr_contents'].keys():
+        print(f"- {key}")
+    # Show a preview of raw_text
+    if 'raw_text' in result['ocr_contents']:
+        raw_text = result['ocr_contents']['raw_text']
+        preview = raw_text[:100] + "..." if len(raw_text) > 100 else raw_text
+        print(f"\nRaw Text Preview: {preview}")
+    # Check for duplicated content
+    found_duplicated = False
+    if 'raw_text' in result['ocr_contents']:
+        raw_text = result['ocr_contents']['raw_text']
+        # Check if the same text appears twice in sequence (a sign of duplication)
+        if len(raw_text) > 50:
+            half_point = len(raw_text) // 2
+            first_quarter = raw_text[:half_point//2].strip()
+            if first_quarter and len(first_quarter) > 20:
+                if first_quarter in raw_text[half_point:]:
+                    found_duplicated = True
+                    print("\n⚠️ WARNING: Possible text duplication detected!")
+    if not found_duplicated:
+        print("\n✅ No text duplication detected")
+    return result
+def main():
+    # Test with different image types
+    test_files = [
+        'input/magician-or-bottle-cungerer.jpg',  # The problematic file
+        'input/recipe.jpg',                       # Simple text file
+        'input/handwritten-letter.jpg'           # Mixed content
+    ]
+    for image_path in test_files:
+        test_image(image_path)
+if __name__ == "__main__":
+    main()

verify_segmentation_fix.py ADDED Viewed

	@@ -0,0 +1,116 @@

+"""
+Script to verify that our fixes properly prioritize text from segmented regions
+in the OCR output, ensuring images don't overshadow text content.
+"""
+import os
+import json
+import tempfile
+from pathlib import Path
+import logging
+from PIL import Image
+# Configure logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+def verify_fix():
+    """
+    Simulate the OCR process with segmentation to verify text prioritization
+    """
+    print("Verifying segmentation and text prioritization fix...")
+    print("-" * 80)
+    # Create a simulated OCR result structure
+    ocr_result = {
+        "file_name": "test_document.jpg",
+        "topics": ["Document"],
+        "languages": ["English"],
+        "ocr_contents": {
+            "raw_text": "This is incorrect text that would be extracted from an image-focused OCR process.",
+            "title": "Test Document"
+        }
+    }
+    # Create simulated segmentation data that would be from our improved process
+    segmentation_data = {
+        'text_regions_coordinates': [(10, 10, 100, 20), (10, 40, 100, 20)],
+        'regions_count': 2,
+        'segmentation_applied': True,
+        'combined_text': "FIFTH AVENUE AT FIFTENTH STREET, NORTH\n\nBIRMINGHAM 2, ALABAMA\n\nDear Mary:\n\nHaving received your letter, I wanted to respond promptly.",
+        'region_results': [
+            {
+                'text': "FIFTH AVENUE AT FIFTENTH STREET, NORTH",
+                'coordinates': (10, 10, 100, 20),
+                'order': 0
+            },
+            {
+                'text': "BIRMINGHAM 2, ALABAMA",
+                'coordinates': (10, 40, 100, 20),
+                'order': 1
+            }
+        ]
+    }
+    # Create preprocessing options with segmentation data
+    preprocessing_options = {
+        'document_type': 'letter',
+        'segmentation_data': segmentation_data
+    }
+    # Import the clean_ocr_result function to test
+    from utils.image_utils import clean_ocr_result
+    # Process the result to see how text is prioritized
+    print("Original OCR text (before fix): ")
+    print(f"  '{ocr_result['ocr_contents']['raw_text']}'")
+    print()
+    # Use our improved clean_ocr_result function
+    cleaned_result = clean_ocr_result(
+        ocr_result,
+        use_segmentation=True,
+        vision_enabled=True,
+        preprocessing_options=preprocessing_options
+    )
+    # Print the results to verify text prioritization
+    print("After applying fix (should prioritize segmented text):")
+    if 'segmentation_text' in cleaned_result['ocr_contents']:
+        print("✓ Segmentation text was properly added to results")
+        print(f"  Segmentation text: '{cleaned_result['ocr_contents']['segmentation_text']}'")
+    else:
+        print("✗ Segmentation text was NOT added to results")
+    if cleaned_result['ocr_contents'].get('main_text') == segmentation_data['combined_text']:
+        print("✓ Segmentation text was correctly used as the main text")
+    else:
+        print("✗ Segmentation text was NOT used as the main text")
+    if 'original_raw_text' in cleaned_result['ocr_contents']:
+        print("✓ Original raw text was preserved as a backup")
+    else:
+        print("✗ Original raw text was NOT preserved")
+    if cleaned_result['ocr_contents'].get('raw_text') == segmentation_data['combined_text']:
+        print("✓ Raw text was correctly replaced with segmentation text")
+    else:
+        print("✗ Raw text was NOT replaced with segmentation text")
+    print()
+    print("Final OCR text content:")
+    print("-" * 30)
+    print(cleaned_result['ocr_contents'].get('raw_text', "No text found"))
+    print("-" * 30)
+    print()
+    print("Conclusion:")
+    if (cleaned_result['ocr_contents'].get('raw_text') == segmentation_data['combined_text'] and
+        cleaned_result['ocr_contents'].get('main_text') == segmentation_data['combined_text']):
+        print("✅ Fix successfully prioritizes text from segmented regions!")
+    else:
+        print("❌ Fix did NOT correctly prioritize text from segmented regions.")
+if __name__ == "__main__":
+    verify_fix()