milwright commited on
Commit
42dc069
·
1 Parent(s): c04ffe5

Consolidate segmentation improvements and code cleanup

Browse files

- Added OCR preprocessing documentation
- Enhanced image segmentation algorithm
- Improved UI layout and styling
- Added new test and verification files
- Updated project requirements
- Reorganized utility modules
- Added .clinerules configuration for better project documentation
- Removed obsolete test files

Files changed (46) hide show
  1. .clinerules/activeContext.md +31 -0
  2. .clinerules/complexFeature.md +29 -0
  3. .clinerules/integrationSpecs.md +27 -0
  4. .clinerules/principle-of-simplicity.md +55 -0
  5. .clinerules/productContext.md +25 -0
  6. .clinerules/progress.md +27 -0
  7. .clinerules/techContext.md +40 -0
  8. .gitignore +18 -0
  9. CLAUDE.md +0 -32
  10. app.py +3 -1
  11. docs/preprocessing.md +179 -0
  12. docs/preprocessing_triage.md +17 -0
  13. image_segmentation.py +209 -44
  14. ocr_processing.py +152 -80
  15. requirements.txt +1 -1
  16. structured_ocr.py +29 -58
  17. test_fix.py +55 -0
  18. test_magellan_language.py +0 -39
  19. test_segmentation_fix.py +100 -0
  20. testing/magician_app_investigation_plan.md +0 -58
  21. testing/magician_app_result.json +0 -16
  22. testing/magician_image_final_report.md +0 -58
  23. testing/magician_image_findings.md +0 -84
  24. testing/magician_ocr_text.txt +0 -9
  25. testing/test_app_direct.py +0 -180
  26. testing/test_filename_format.py +93 -0
  27. testing/test_improvements.py +0 -244
  28. testing/test_json_bleed.py +46 -0
  29. testing/test_magician.py +0 -57
  30. testing/test_magician_image.py +0 -130
  31. testing/test_newspaper_detection.py +0 -146
  32. testing/test_segmentation.py +0 -238
  33. testing/test_simple_improvements.py +0 -175
  34. testing/test_text_as_image.py +0 -200
  35. ui/custom.css +9 -31
  36. ui/layout.py +64 -30
  37. ui_components.py +31 -35
  38. utils.py +55 -18
  39. utils/__init__.py +47 -0
  40. utils/content_utils.py +3 -89
  41. utils/general_utils.py +53 -18
  42. utils/image_utils.py +648 -333
  43. utils/text_utils.py +76 -13
  44. utils/ui_utils.py +132 -200
  45. verify_fix.py +70 -0
  46. verify_segmentation_fix.py +116 -0
.clinerules/activeContext.md ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Current Work Focus
2
+
3
+ Refining image preprocessing pipelines to better balance cleaning and preservation of fine details (especially for handwritten inputs)
4
+
5
+ Improving document type detection accuracy to feed better prompts to the OCR system
6
+
7
+ Enhancing structured output schemas to cover additional types like travel logs and scientific diagrams
8
+
9
+ Recent Changes
10
+
11
+ Implemented more document-type-specific preprocessing pipelines
12
+
13
+ Switched default OCR engine to Mistral for both printed and handwritten material
14
+
15
+ Modularized utility functions across utils.py, ocr_utils.py, and newly proposed submodules
16
+
17
+ Active Decisions and Considerations
18
+
19
+ Whether to expose preprocessing options to end users (e.g., deskew threshold)
20
+
21
+ Whether to allow fallback to local Tesseract OCR for offline cases
22
+
23
+ Determining best practices for handling multi-page PDFs with mixed layouts
24
+
25
+ Important Patterns and Learnings
26
+
27
+ Document type detection greatly improves OCR quality when tuned per-class
28
+
29
+ Over-aggressive preprocessing can erase faint handwriting; thresholds must be conservative for historical artifacts
30
+
31
+ Keeping preprocessing modular enables rapid experimentation and tuning.
.clinerules/complexFeature.md ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Complex Feature Documentation
2
+
3
+ Document Type Detection
4
+
5
+ Utilizes lightweight statistical heuristics combined with visual features.
6
+
7
+ Preprocessing-driven (thresholding, aspect ratios, contour analysis).
8
+
9
+ Outputs labels such as "handwritten letter", "scientific report", "recipe".
10
+
11
+ Preprocessing Pipelines
12
+
13
+ Customizable per document type.
14
+
15
+ Adaptive thresholding for delicate handwriting.
16
+
17
+ Morphological operations for removing bleed-through or artifacts.
18
+
19
+ Multilingual Handling
20
+
21
+ Language detection on OCR snippets using language_detection.py.
22
+
23
+ Allows contextual OCR prompting based on dominant language.
24
+
25
+ Structured Output Generation
26
+
27
+ Parsing OCR results into structured categories: titles, subtitles, body, marginalia, dates.
28
+
29
+ Supports output in raw text, JSON, and annotated Markdown.
.clinerules/integrationSpecs.md ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Integration Specifications
2
+
3
+ External Services
4
+
5
+ Mistral OCR API: Primary service for document transcription and structured extraction.
6
+
7
+ Tesseract OCR (Local Fallback): Optional backup when API unavailable.
8
+
9
+ Internal Module Communication
10
+
11
+ app.py triggers ocr_processing.py orchestration based on user input.
12
+
13
+ ocr_processing.py dynamically calls preprocessing and OCR modules based on document type.
14
+
15
+ Preprocessed images passed through structured_ocr.py for API interaction and postprocessing.
16
+
17
+ Session State
18
+
19
+ Streamlit session stores:
20
+
21
+ Uploaded file metadata
22
+
23
+ Preprocessing parameters
24
+
25
+ Detected document type
26
+
27
+ OCR structured output
.clinerules/principle-of-simplicity.md ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Project Rule: Maintain Clean Separation Between Data and Presentation
2
+ The Principle of Content Purity
3
+ Core rule: Data that needs to be processed or stored should never contain presentation markup that's only meant for display.
4
+
5
+ What This Means in Practice
6
+ Avoid HTML in data structures
7
+
8
+ ✅ DO: Keep raw text as pure text content
9
+ ❌ DON'T: Embed HTML, CSS, or other presentation-specific markup in data fields
10
+ Design clear boundaries between data and presentation layers
11
+
12
+ ✅ DO: Add presentation elements at the final rendering stage only
13
+ ❌ DON'T: Add and strip presentation elements repeatedly throughout the processing pipeline
14
+ Fix problems at their source, not their symptoms
15
+
16
+ ✅ DO: Prevent markup injection at the origin rather than adding complex stripping logic later
17
+ ❌ DON'T: Create complex sanitization functions to clean data that shouldn't be contaminated in the first place
18
+ The OCR Text Formatter Example
19
+ Before (problematic):
20
+
21
+ def format_ocr_text(text, for_display=True):
22
+ # Text processing...
23
+
24
+ if for_display:
25
+ html = f"""
26
+ <div class="ocr-text-container">
27
+ {formatted_text}
28
+ </div>
29
+ """
30
+ return html
31
+ else:
32
+ return formatted_text
33
+ After (better):
34
+
35
+ def format_ocr_text(text, for_display=False):
36
+ # Text processing...
37
+
38
+ if for_display:
39
+ html = f"""
40
+ {formatted_text}
41
+ """
42
+ return html
43
+ else:
44
+ return formatted_text
45
+ What changed:
46
+
47
+ Default parameter changed to avoid accidental HTML addition
48
+ HTML wrapper div completely removed to eliminate the source of pollution
49
+ The simplest solution (removing the container) was better than any complex stripping logic
50
+ Benefits
51
+ Cleaner data: Raw content remains genuinely raw and easier to work with
52
+ More predictable processing: No need to account for unexpected HTML in processing pipelines
53
+ Easier debugging: Problems are visible at their source rather than as mysterious artifacts later
54
+ Reduced complexity: Eliminates the need for complex HTML stripping and sanitization logic
55
+ Remember: Simplicity is not just an ideal—it's a practical strategy that prevents entire classes of bugs.
.clinerules/productContext.md ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Why the Project Exists
2
+
3
+ Historians, archivists, and researchers often struggle to extract reliable text from scanned archival materials. Many OCR tools fail when dealing with handwritten letters, historical scientific documents, and poorly digitized photographs.
4
+
5
+ Problems Being Solved
6
+
7
+ Low OCR accuracy on handwritten or degraded historical documents
8
+
9
+ Lack of structured metadata extraction for archival research
10
+
11
+ Inability to easily apply context-specific AI prompting for nuanced historical material
12
+
13
+ How the Product Should Work
14
+
15
+ Users upload images or PDFs
16
+
17
+ Preprocessing automatically improves OCR readiness
18
+
19
+ Document type detection informs customized AI prompting
20
+
21
+ Mistral OCR processes the document to output structured data (titles, authors, dates, body text, marginalia, etc.)
22
+
23
+ Users can download raw text, structured JSON, or annotated markdown
24
+
25
+ Example: "The OCR system must intelligently handle multilingual documents, support marginal notes and irregular layouts, and allow historians to guide the extraction process."
.clinerules/progress.md ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Current Status of Features
2
+
3
+ ✅ Upload and preprocess historical documents
4
+
5
+ ⚙️ Document type detection (estimated 80% accuracy)
6
+
7
+ ✅ OCR extraction with structured outputs (titles, body, marginalia)
8
+
9
+ ✅ Multiple output formats (Raw text, JSON, Markdown)
10
+
11
+ Known Issues and Limitations
12
+
13
+ Inconsistent marginalia capture in low-quality scans
14
+
15
+ Difficulties with heavily degraded non-Latin handwritten scripts
16
+
17
+ Layout detection errors on highly irregular, mixed-content PDFs
18
+
19
+ Evolution of Project Decisions
20
+
21
+ Migrated from Tesseract-only OCR to Mistral-first hybrid approach
22
+
23
+ Modularized preprocessing steps to allow flexible experimentation
24
+
25
+ Added support for marginalia and footnotes where feasible
26
+
27
+ Enhanced session state management to preserve intermediate results.
.clinerules/techContext.md ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ techContext.md
2
+ Technologies and Frameworks Used
3
+
4
+ Frontend Framework: Streamlit 1.44.1
5
+
6
+ OCR Engine: Mistralai Python SDK (≥ 0.1.0)
7
+
8
+ Image Processing: OpenCV, Pillow
9
+
10
+ PDF Parsing: pdf2image
11
+
12
+ Fallback OCR: Pytesseract
13
+
14
+ Utilities: NumPy, Requests, pycountry
15
+
16
+ Development Setup
17
+
18
+ Python 3.11+ virtual environment
19
+
20
+ Requirements managed through requirements.txt
21
+
22
+ .env file setup for API keys and environment configs
23
+
24
+ Type checking with mypy, linting with ruff
25
+
26
+ Technical Constraints
27
+
28
+ API rate limits and payload size restrictions from Mistral
29
+
30
+ Streamlit's session state limitations for very large files
31
+
32
+ Processing timeouts for oversized or complex PDFs
33
+
34
+ Dependencies and Tool Configurations
35
+
36
+ Mistralai pinned version (≥ 0.1.0)
37
+
38
+ OpenCV configured for headless environments
39
+
40
+ Pillow used for post-processing and visualization checks
.gitignore CHANGED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python bytecode
2
+ __pycache__/
3
+ *.py[cod]
4
+ *.class
5
+
6
+ # MacOS system files
7
+ .DS_Store
8
+
9
+ # Output and temporary files
10
+ output/debug/
11
+ output/comparison/
12
+ output/segmentation_test/text_regions/
13
+ output/preprocessing_test/
14
+ logs/
15
+ *.backup
16
+
17
+ # Temporary documents
18
+ Tmplf6xnkgr*
CLAUDE.md DELETED
@@ -1,32 +0,0 @@
1
- # CLAUDE.md
2
-
3
- This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4
-
5
- ## Commands
6
- - Run app: `streamlit run app.py`
7
- - Test OCR functionality: `python structured_ocr.py <file_path>`
8
- - Process single file with logging: `python process_file.py <file_path>`
9
- - Run specific test: `python testing/test_magician_image.py`
10
- - Run typechecking: `mypy .`
11
- - Lint code: `ruff check .` or `flake8`
12
-
13
- ## Environment Setup
14
- - API key: Set `MISTRAL_API_KEY` in `.env` file or environment variable
15
- - Install dependencies: `pip install -r requirements.txt`
16
- - System requirements: Install `poppler-utils` and `tesseract-ocr` for PDF processing
17
-
18
- ## Code Style Guidelines
19
- - **Imports**: Standard library first, third-party next, local modules last
20
- - **Types**: Use Pydantic models and type hints for all functions
21
- - **Error handling**: Use specific exceptions with informative messages
22
- - **Naming**: snake_case for variables/functions, PascalCase for classes
23
- - **Documentation**: Google-style docstrings for all functions/classes
24
- - **Preprocessing**: Support handwritten documents via document_type parameter
25
- - **Line length**: ≤100 characters
26
-
27
- ## Base64 Encoding
28
- - Always include MIME type in data URLs: `data:image/jpeg;base64,...`
29
- - Use the appropriate MIME type for different file formats: jpeg, png, pdf, etc.
30
- - For encoded bytes, use `encode_bytes_for_api` with correct MIME type
31
- - For file paths, use `encode_image_for_api` which auto-detects MIME type
32
- - In utils.py, use `get_base64_from_bytes` for raw bytes or `get_base64_from_image` for files
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app.py CHANGED
@@ -197,6 +197,7 @@ def show_example_documents():
197
  "https://huggingface.co/spaces/milwright/historical-ocr/resolve/main/input/handwritten-letter.jpg",
198
  "https://huggingface.co/spaces/milwright/historical-ocr/resolve/main/input/magellan-travels.jpg",
199
  "https://huggingface.co/spaces/milwright/historical-ocr/resolve/main/input/milgram-flier.png",
 
200
  ]
201
 
202
  sample_names = [
@@ -205,7 +206,8 @@ def show_example_documents():
205
  "The Magician (Image)",
206
  "Handwritten Letter (Image)",
207
  "Magellan Travels (Image)",
208
- "Milgram Flier (Image)"
 
209
  ]
210
 
211
  # Initialize sample_document in session state if it doesn't exist
 
197
  "https://huggingface.co/spaces/milwright/historical-ocr/resolve/main/input/handwritten-letter.jpg",
198
  "https://huggingface.co/spaces/milwright/historical-ocr/resolve/main/input/magellan-travels.jpg",
199
  "https://huggingface.co/spaces/milwright/historical-ocr/resolve/main/input/milgram-flier.png",
200
+ "https://huggingface.co/spaces/milwright/historical-ocr/resolve/main/input/recipe.jpg",
201
  ]
202
 
203
  sample_names = [
 
206
  "The Magician (Image)",
207
  "Handwritten Letter (Image)",
208
  "Magellan Travels (Image)",
209
+ "Milgram Flier (Image)",
210
+ "Historical Recipe (Image)"
211
  ]
212
 
213
  # Initialize sample_document in session state if it doesn't exist
docs/preprocessing.md ADDED
@@ -0,0 +1,179 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Image Preprocessing for Historical Document OCR
2
+
3
+ This document outlines the enhanced preprocessing capabilities for improving OCR quality on historical documents, including deskewing, thresholding, and morphological operations.
4
+
5
+ ## Overview
6
+
7
+ The preprocessing pipeline offers several options to enhance image quality before OCR processing:
8
+
9
+ 1. **Deskewing**: Automatically detects and corrects document skew using multiple detection algorithms
10
+ 2. **Thresholding**: Converts grayscale images to binary using adaptive or Otsu methods with pre-blur options
11
+ 3. **Morphological Operations**: Cleans up binary images by removing noise or filling in gaps
12
+ 4. **Document-Type Specific Settings**: Customized preprocessing configurations for different document types
13
+
14
+ ## Configuration
15
+
16
+ Preprocessing options are set in `config.py` and are tunable per document type. All settings are accessible through environment variables for easy deployment configuration.
17
+
18
+ ### Deskewing
19
+
20
+ ```python
21
+ "deskew": {
22
+ "enabled": True/False, # Whether to apply deskewing
23
+ "angle_threshold": 0.1, # Minimum angle (degrees) to trigger deskewing
24
+ "max_angle": 45.0, # Maximum correction angle
25
+ "use_hough": True/False, # Use Hough transform in addition to minAreaRect
26
+ "consensus_method": "average", # How to combine angle estimations
27
+ "fallback": {"enabled": True/False} # Fall back to original if deskewing fails
28
+ }
29
+ ```
30
+
31
+ Deskewing uses two methods:
32
+ - **minAreaRect**: Finds contours in the binary image and calculates their orientation
33
+ - **Hough Transform**: Detects lines in the image and their angles
34
+
35
+ The `consensus_method` can be:
36
+ - `"average"`: Average of all detected angles (most stable)
37
+ - `"median"`: Median of all angles (robust to outliers)
38
+ - `"min"`: Minimum absolute angle (most conservative)
39
+ - `"max"`: Maximum absolute angle (most aggressive)
40
+
41
+ ### Thresholding
42
+
43
+ ```python
44
+ "thresholding": {
45
+ "method": "adaptive", # "none", "otsu", or "adaptive"
46
+ "adaptive_block_size": 11, # Block size for adaptive thresholding (must be odd)
47
+ "adaptive_constant": 2, # Constant subtracted from mean
48
+ "otsu_gaussian_blur": 1, # Blur kernel size for Otsu pre-processing
49
+ "preblur": {
50
+ "enabled": True/False, # Whether to apply pre-blur
51
+ "method": "gaussian", # "gaussian" or "median"
52
+ "kernel_size": 3 # Blur kernel size (must be odd)
53
+ },
54
+ "fallback": {"enabled": True/False} # Fall back to grayscale if thresholding fails
55
+ }
56
+ ```
57
+
58
+ Thresholding methods:
59
+ - **Otsu**: Automatically determines optimal global threshold (best for high-contrast documents)
60
+ - **Adaptive**: Calculates thresholds for different regions (better for uneven lighting, historical documents)
61
+
62
+ ### Morphological Operations
63
+
64
+ ```python
65
+ "morphology": {
66
+ "enabled": True/False, # Whether to apply morphological operations
67
+ "operation": "close", # "open", "close", "both"
68
+ "kernel_size": 1, # Size of the structuring element
69
+ "kernel_shape": "rect" # "rect", "ellipse", "cross"
70
+ }
71
+ ```
72
+
73
+ Morphological operations:
74
+ - **Open**: Erosion followed by dilation - removes small noise and disconnects thin connections
75
+ - **Close**: Dilation followed by erosion - fills small holes and connects broken elements
76
+ - **Both**: Applies opening followed by closing
77
+
78
+ ### Document Type Configurations
79
+
80
+ The system includes optimized settings for different document types:
81
+
82
+ ```python
83
+ "document_types": {
84
+ "standard": {
85
+ # Default settings - will use the global settings
86
+ },
87
+ "newspaper": {
88
+ "deskew": {"enabled": True, "angle_threshold": 0.3, "max_angle": 10.0},
89
+ "thresholding": {
90
+ "method": "adaptive",
91
+ "adaptive_block_size": 15,
92
+ "adaptive_constant": 3,
93
+ "preblur": {"method": "gaussian", "kernel_size": 3}
94
+ },
95
+ "morphology": {"operation": "close", "kernel_size": 1}
96
+ },
97
+ "handwritten": {
98
+ "deskew": {"enabled": True, "angle_threshold": 0.5, "use_hough": False},
99
+ "thresholding": {
100
+ "method": "adaptive",
101
+ "adaptive_block_size": 31,
102
+ "adaptive_constant": 5,
103
+ "preblur": {"method": "median", "kernel_size": 3}
104
+ },
105
+ "morphology": {"operation": "open", "kernel_size": 1}
106
+ },
107
+ "book": {
108
+ "deskew": {"enabled": True},
109
+ "thresholding": {
110
+ "method": "otsu",
111
+ "preblur": {"method": "gaussian", "kernel_size": 5}
112
+ },
113
+ "morphology": {"operation": "both", "kernel_size": 1}
114
+ }
115
+ }
116
+ ```
117
+
118
+ ## Performance and Logging
119
+
120
+ ```python
121
+ "performance": {
122
+ "parallel": {
123
+ "enabled": True/False, # Whether to use parallel processing
124
+ "max_workers": 4 # Maximum number of worker threads
125
+ },
126
+ "timeout_ms": 10000 # Timeout for preprocessing (in milliseconds)
127
+ }
128
+
129
+ "logging": {
130
+ "enabled": True/False, # Whether to log preprocessing metrics
131
+ "metrics": ["skew_angle", "binary_nonzero_pct", "processing_time"],
132
+ "output_path": "logs/preprocessing_metrics.json"
133
+ }
134
+ ```
135
+
136
+ ## Usage with OCR Processing
137
+
138
+ When processing documents, simply specify the document type:
139
+
140
+ ```python
141
+ preprocessing_options = {
142
+ "document_type": "newspaper", # Use newspaper-optimized settings
143
+ "grayscale": True, # Legacy option: apply grayscale conversion
144
+ "denoise": True, # Legacy option: apply denoising
145
+ "contrast": 10, # Legacy option: adjust contrast (0-100)
146
+ "rotation": 0 # Legacy option: manual rotation (degrees)
147
+ }
148
+
149
+ # Apply preprocessing and OCR
150
+ result = process_file(file_bytes, file_ext, preprocessing_options=preprocessing_options)
151
+ ```
152
+
153
+ ## Visual Examples
154
+
155
+ ### Original Document
156
+ *[A historical newspaper or document image would be shown here]*
157
+
158
+ ### After Deskewing
159
+ *[The same document, with skew corrected]*
160
+
161
+ ### After Thresholding
162
+ *[The document converted to binary with clear text]*
163
+
164
+ ### After Morphological Operations
165
+ *[The binary image with small noise removed and/or gaps filled]*
166
+
167
+ ## Troubleshooting
168
+
169
+ ### Poor Deskewing Results
170
+ - **Symptom**: Document skew is not correctly detected or corrected
171
+ - **Solution**: Try adjusting `angle_threshold` or `max_angle`, or disable Hough transform for handwritten documents
172
+
173
+ ### Thresholding Issues
174
+ - **Symptom**: Text is lost or background noise is excessive after thresholding
175
+ - **Solution**: Try changing the thresholding method or adjusting `adaptive_block_size` and `adaptive_constant`
176
+
177
+ ### Performance Concerns
178
+ - **Symptom**: Processing is too slow for large documents
179
+ - **Solution**: Enable parallel processing, reduce image size, or disable some preprocessing steps for faster results
docs/preprocessing_triage.md ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # OCR Preprocessing Triage
2
+
3
+ ## Quick Fixes Implemented
4
+
5
+ 1. **Handwritten** - Disabled thresholding, uses grayscale only
6
+ 2. **Newspapers** - Increased block size (51) and constant (10) for softer thresholding
7
+ 3. **JPEG Artifacts** - Auto-detection and specialized denoising
8
+ 4. **Border Issues** - Crops edges after deskew to avoid threshold problems
9
+ 5. **Low Resolution** - Upscales small text for better recognition
10
+
11
+ ## Testing
12
+
13
+ ```
14
+ python testing/test_triage_fix.py
15
+ ```
16
+
17
+ Check `output/comparison/` for results.
image_segmentation.py CHANGED
@@ -18,7 +18,7 @@ logging.basicConfig(level=logging.INFO,
18
  format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
19
  logger = logging.getLogger(__name__)
20
 
21
- def segment_image_for_ocr(image_path: Union[str, Path], vision_enabled: bool = True) -> Dict[str, Union[Image.Image, str]]:
22
  """
23
  Segment an image into text and image regions for improved OCR processing.
24
 
@@ -76,9 +76,17 @@ def segment_image_for_ocr(image_path: Union[str, Path], vision_enabled: bool = T
76
  cv2.THRESH_BINARY_INV, 11, 2)
77
 
78
  # Step 2: Perform morphological operations to connect text components
79
- # Create a rectangular kernel that's wider than tall (for text lines)
80
- rect_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (15, 3))
81
- dilation = cv2.dilate(binary, rect_kernel, iterations=3)
 
 
 
 
 
 
 
 
82
 
83
  # Step 3: Find contours which will correspond to text blocks
84
  contours, _ = cv2.findContours(dilation, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
@@ -87,8 +95,8 @@ def segment_image_for_ocr(image_path: Union[str, Path], vision_enabled: bool = T
87
  text_mask = np.zeros_like(gray)
88
 
89
  # Step 4: Filter contours based on size to identify text regions
90
- min_area = 100 # Minimum contour area to be considered text
91
- max_area = img.shape[0] * img.shape[1] * 0.5 # Max 50% of image
92
 
93
  text_regions = []
94
  for contour in contours:
@@ -105,10 +113,33 @@ def segment_image_for_ocr(image_path: Union[str, Path], vision_enabled: bool = T
105
  roi = binary[y:y+h, x:x+w]
106
  dark_pixel_density = np.sum(roi > 0) / (w * h)
107
 
108
- # Additional check for text-like characteristics
109
- # Text typically has aspect ratio > 1 (wider than tall) and reasonable density
110
- # Relaxed aspect ratio constraints and lowered density threshold for better detection
111
- if (aspect_ratio > 1.2 or aspect_ratio < 0.7) and dark_pixel_density > 0.15:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
112
  # Add to text regions list
113
  text_regions.append((x, y, w, h))
114
  # Add to text mask
@@ -119,44 +150,170 @@ def segment_image_for_ocr(image_path: Union[str, Path], vision_enabled: bool = T
119
  for x, y, w, h in text_regions:
120
  cv2.rectangle(text_regions_vis, (x, y), (x+w, y+h), (0, 255, 0), 2)
121
 
122
- # Create image regions mask (inverse of text mask)
123
- image_mask = cv2.bitwise_not(text_mask)
124
 
125
- # Create image regions visualization
126
- image_regions_vis = img_rgb.copy()
127
- # Add detected image regions in red
128
- for contour in contours:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
129
  area = cv2.contourArea(contour)
130
- if area > max_area * 0.1: # Only highlight larger image regions
 
131
  x, y, w, h = cv2.boundingRect(contour)
132
- if np.sum(text_mask[y:y+h, x:x+w]) / (w * h) < 128: # Not significantly overlapping with text
133
- cv2.rectangle(image_regions_vis, (x, y), (x+w, y+h), (0, 0, 255), 2)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
134
 
135
- # Step 6: Create a combined result that enhances text regions
136
- # Different processing for text vs. image regions
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
137
  combined_result = img_rgb.copy()
138
 
139
- # Apply more aggressive contrast enhancement to text regions
140
- text_enhanced = cv2.bitwise_and(img_rgb, img_rgb, mask=text_mask)
141
- # Convert to LAB for better contrast enhancement
142
- text_lab = cv2.cvtColor(text_enhanced, cv2.COLOR_BGR2LAB)
143
- l, a, b = cv2.split(text_lab)
144
- # Apply CLAHE to L channel
145
- clahe = cv2.createCLAHE(clipLimit=3.0, tileGridSize=(8, 8))
146
- cl = clahe.apply(l)
147
- # Merge back
148
- enhanced_lab = cv2.merge((cl, a, b))
149
- text_enhanced = cv2.cvtColor(enhanced_lab, cv2.COLOR_LAB2BGR)
150
-
151
- # Apply gentler processing to image regions
152
- image_enhanced = cv2.bitwise_and(img_rgb, img_rgb, mask=image_mask)
153
- # Just slight sharpening for image regions
154
- image_enhanced = cv2.GaussianBlur(image_enhanced, (0, 0), 3)
155
- image_enhanced = cv2.addWeighted(img_rgb, 1.5, image_enhanced, -0.5, 0)
156
- image_enhanced = cv2.bitwise_and(image_enhanced, image_enhanced, mask=image_mask)
157
-
158
- # Combine the enhanced regions
159
- combined_result = cv2.add(text_enhanced, image_enhanced)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
160
 
161
  # Convert visualization results back to PIL Images
162
  text_regions_pil = Image.fromarray(cv2.cvtColor(text_regions_vis, cv2.COLOR_BGR2RGB))
@@ -167,13 +324,21 @@ def segment_image_for_ocr(image_path: Union[str, Path], vision_enabled: bool = T
167
  _, buffer = cv2.imencode('.png', text_mask)
168
  text_mask_base64 = base64.b64encode(buffer).decode('utf-8')
169
 
 
 
 
 
 
 
 
170
  # Return the segmentation results
171
  return {
172
  'text_regions': text_regions_pil,
173
  'image_regions': image_regions_pil,
174
  'text_mask_base64': f"data:image/png;base64,{text_mask_base64}",
175
  'combined_result': combined_result_pil,
176
- 'text_regions_coordinates': text_regions
 
177
  }
178
 
179
  except Exception as e:
@@ -187,7 +352,7 @@ def segment_image_for_ocr(image_path: Union[str, Path], vision_enabled: bool = T
187
  'text_regions_coordinates': []
188
  }
189
 
190
- def process_segmented_image(image_path: Union[str, Path], output_dir: Optional[Path] = None) -> Dict:
191
  """
192
  Process an image using segmentation for improved OCR, saving visualization outputs.
193
 
 
18
  format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
19
  logger = logging.getLogger(__name__)
20
 
21
+ def segment_image_for_ocr(image_path: Union[str, Path], vision_enabled: bool = True, preserve_content: bool = True) -> Dict[str, Union[Image.Image, str]]:
22
  """
23
  Segment an image into text and image regions for improved OCR processing.
24
 
 
76
  cv2.THRESH_BINARY_INV, 11, 2)
77
 
78
  # Step 2: Perform morphological operations to connect text components
79
+ # Use a combination of horizontal and vertical kernels for better text detection
80
+ # in historical documents with mixed content
81
+ horiz_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (15, 1))
82
+ vert_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, 3))
83
+
84
+ # Apply horizontal dilation to connect characters in a line
85
+ horiz_dilation = cv2.dilate(binary, horiz_kernel, iterations=1)
86
+ # Apply vertical dilation to connect lines in a paragraph
87
+ vert_dilation = cv2.dilate(binary, vert_kernel, iterations=1)
88
+ # Combine both dilations for better region detection
89
+ dilation = cv2.bitwise_or(horiz_dilation, vert_dilation)
90
 
91
  # Step 3: Find contours which will correspond to text blocks
92
  contours, _ = cv2.findContours(dilation, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
 
95
  text_mask = np.zeros_like(gray)
96
 
97
  # Step 4: Filter contours based on size to identify text regions
98
+ min_area = 50 # Lower minimum area to catch smaller text blocks in historical documents
99
+ max_area = img.shape[0] * img.shape[1] * 0.4 # Reduced max to avoid capturing too much
100
 
101
  text_regions = []
102
  for contour in contours:
 
113
  roi = binary[y:y+h, x:x+w]
114
  dark_pixel_density = np.sum(roi > 0) / (w * h)
115
 
116
+ # Special handling for historical documents
117
+ # Check for position - text is often at the bottom in historical prints
118
+ y_position_ratio = y / img.shape[0] # Normalized y position (0 at top, 1 at bottom)
119
+
120
+ # Bottom regions get preferential treatment as text
121
+ is_bottom_region = y_position_ratio > 0.7
122
+
123
+ # Check if part of a text block cluster (horizontal proximity)
124
+ is_text_cluster = False
125
+ # Check already identified text regions for proximity
126
+ for tx, ty, tw, th in text_regions:
127
+ # Check if horizontally aligned and close
128
+ if abs((ty + th/2) - (y + h/2)) < max(th, h) and \
129
+ abs((tx + tw) - x) < 20: # Near each other horizontally
130
+ is_text_cluster = True
131
+ break
132
+
133
+ # More inclusive classification for historical documents
134
+ # 1. Typical text characteristics OR
135
+ # 2. Bottom position (likely text in historical prints) OR
136
+ # 3. Part of a text cluster OR
137
+ # 4. Surrounded by other text
138
+ is_text_region = ((aspect_ratio > 1.05 or aspect_ratio < 0.9) and dark_pixel_density > 0.1) or \
139
+ (is_bottom_region and dark_pixel_density > 0.08) or \
140
+ is_text_cluster
141
+
142
+ if is_text_region:
143
  # Add to text regions list
144
  text_regions.append((x, y, w, h))
145
  # Add to text mask
 
150
  for x, y, w, h in text_regions:
151
  cv2.rectangle(text_regions_vis, (x, y), (x+w, y+h), (0, 255, 0), 2)
152
 
153
+ # ENHANCED APPROACH FOR HISTORICAL DOCUMENTS:
154
+ # We'll identify different regions including titles at the top of the document
155
 
156
+ # First, look for potential title text at the top of the document
157
+ image_height = img.shape[0]
158
+ image_width = img.shape[1]
159
+
160
+ # Examine the top 20% of the image for potential title text
161
+ title_section_height = int(image_height * 0.2)
162
+ title_mask = np.zeros_like(gray)
163
+ title_mask[:title_section_height, :] = 255
164
+
165
+ # Find potential title blocks in the top section
166
+ title_contours, _ = cv2.findContours(
167
+ cv2.bitwise_and(dilation, title_mask),
168
+ cv2.RETR_EXTERNAL,
169
+ cv2.CHAIN_APPROX_SIMPLE
170
+ )
171
+
172
+ # Extract title regions with more permissive criteria
173
+ title_regions = []
174
+ for contour in title_contours:
175
  area = cv2.contourArea(contour)
176
+ # Use more permissive criteria for title regions
177
+ if area > min_area * 0.8: # Smaller minimum area for titles
178
  x, y, w, h = cv2.boundingRect(contour)
179
+ # Title regions typically have wider aspect ratio
180
+ aspect_ratio = w / h
181
+ # More permissive density check for titles that might be stylized
182
+ roi = binary[y:y+h, x:x+w]
183
+ dark_pixel_density = np.sum(roi > 0) / (w * h)
184
+
185
+ # Check if this might be a title
186
+ # Titles tend to be wider, in the center, and at the top
187
+ is_wide = aspect_ratio > 2.0
188
+ is_centered = abs((x + w/2) - (image_width/2)) < (image_width * 0.3)
189
+ is_at_top = y < title_section_height
190
+
191
+ # If it looks like a title or has good text characteristics
192
+ if (is_wide and is_centered and is_at_top) or \
193
+ (is_at_top and dark_pixel_density > 0.1):
194
+ title_regions.append((x, y, w, h))
195
+
196
+ # Now handle the main content with our standard approach
197
+ # Use fixed regions for the main content - typically below the title
198
+ # For primary content, assume most text is in the bottom 70%
199
+ text_section_start = int(image_height * 0.7) # Start main text section at 70% down
200
+
201
+ # Create text mask combining the title regions and main text area
202
+ text_mask = np.zeros_like(gray)
203
+ text_mask[text_section_start:, :] = 255
204
+
205
+ # Add title regions to the text mask
206
+ for x, y, w, h in title_regions:
207
+ # Add some padding around title regions
208
+ pad_x = max(5, int(w * 0.05))
209
+ pad_y = max(5, int(h * 0.05))
210
+ x_start = max(0, x - pad_x)
211
+ y_start = max(0, y - pad_y)
212
+ x_end = min(image_width, x + w + pad_x)
213
+ y_end = min(image_height, y + h + pad_y)
214
+
215
+ # Add title region to the text mask
216
+ text_mask[y_start:y_end, x_start:x_end] = 255
217
+
218
+ # Image mask is the inverse of text mask - for visualization only
219
+ image_mask = np.zeros_like(gray)
220
+ image_mask[text_mask == 0] = 255
221
+
222
+ # For main text regions, find blocks of text in the bottom part
223
+ # Create a temporary mask for the main text section
224
+ temp_mask = np.zeros_like(gray)
225
+ temp_mask[text_section_start:, :] = 255
226
+
227
+ # Find text regions for visualization purposes
228
+ text_regions = []
229
+ # Start with any title regions we found
230
+ text_regions.extend(title_regions)
231
+
232
+ # Then find text regions in the main content area
233
+ text_region_contours, _ = cv2.findContours(
234
+ cv2.bitwise_and(dilation, temp_mask),
235
+ cv2.RETR_EXTERNAL,
236
+ cv2.CHAIN_APPROX_SIMPLE
237
+ )
238
 
239
+ # Add each detected region
240
+ for contour in text_region_contours:
241
+ x, y, w, h = cv2.boundingRect(contour)
242
+ if w > 10 and h > 5: # Minimum size to be considered text
243
+ text_regions.append((x, y, w, h))
244
+
245
+ # Add the entire bottom section as a fallback text region if none detected
246
+ if len(text_regions) == 0:
247
+ x, y = 0, text_section_start
248
+ w, h = img.shape[1], img.shape[0] - text_section_start
249
+ text_regions.append((x, y, w, h))
250
+
251
+ # Create image regions visualization
252
+ image_regions_vis = img_rgb.copy()
253
+
254
+ # Top section is image
255
+ cv2.rectangle(image_regions_vis, (0, 0), (img.shape[1], text_section_start), (0, 0, 255), 2)
256
+
257
+ # Bottom section has text - draw green boxes around detected text regions
258
+ text_regions_vis = img_rgb.copy()
259
+ for x, y, w, h in text_regions:
260
+ cv2.rectangle(text_regions_vis, (x, y), (x+w, y+h), (0, 255, 0), 2)
261
+
262
+ # For OCR: CRITICAL - Don't modify the image content
263
+ # Only create a non-destructive enhanced version
264
+
265
+ # For text detection visualization:
266
+ text_regions_vis = img_rgb.copy()
267
+ for x, y, w, h in text_regions:
268
+ cv2.rectangle(text_regions_vis, (x, y), (x+w, y+h), (0, 255, 0), 2)
269
+
270
+ # For image region visualization:
271
+ image_regions_vis = img_rgb.copy()
272
+ cv2.rectangle(image_regions_vis, (0, 0), (img.shape[1], text_section_start), (0, 0, 255), 2)
273
+
274
+ # Create a minimally enhanced version of the original image
275
+ # that preserves ALL content (both text and image)
276
  combined_result = img_rgb.copy()
277
 
278
+ # Apply gentle contrast enhancement if requested
279
+ if not preserve_content:
280
+ # Use a subtle CLAHE enhancement to improve OCR without losing content
281
+ lab_img = cv2.cvtColor(img_rgb, cv2.COLOR_BGR2LAB)
282
+ l, a, b = cv2.split(lab_img)
283
+
284
+ # Very mild CLAHE settings to preserve text
285
+ clahe = cv2.createCLAHE(clipLimit=1.5, tileGridSize=(8, 8))
286
+ cl = clahe.apply(l)
287
+
288
+ # Merge channels back
289
+ enhanced_lab = cv2.merge((cl, a, b))
290
+ combined_result = cv2.cvtColor(enhanced_lab, cv2.COLOR_LAB2BGR)
291
+
292
+ # Extract individual region images for separate OCR processing
293
+ region_images = []
294
+ if text_regions:
295
+ for idx, (x, y, w, h) in enumerate(text_regions):
296
+ # Add padding around region (10% of width/height)
297
+ pad_x = max(5, int(w * 0.1))
298
+ pad_y = max(5, int(h * 0.1))
299
+
300
+ # Ensure coordinates stay within image bounds
301
+ x_start = max(0, x - pad_x)
302
+ y_start = max(0, y - pad_y)
303
+ x_end = min(img_rgb.shape[1], x + w + pad_x)
304
+ y_end = min(img_rgb.shape[0], y + h + pad_y)
305
+
306
+ # Extract region with padding
307
+ region = img_rgb[y_start:y_end, x_start:x_end].copy()
308
+
309
+ # Store region with its coordinates
310
+ region_info = {
311
+ 'image': region,
312
+ 'coordinates': (x, y, w, h),
313
+ 'padded_coordinates': (x_start, y_start, x_end - x_start, y_end - y_start),
314
+ 'order': idx
315
+ }
316
+ region_images.append(region_info)
317
 
318
  # Convert visualization results back to PIL Images
319
  text_regions_pil = Image.fromarray(cv2.cvtColor(text_regions_vis, cv2.COLOR_BGR2RGB))
 
324
  _, buffer = cv2.imencode('.png', text_mask)
325
  text_mask_base64 = base64.b64encode(buffer).decode('utf-8')
326
 
327
+ # Convert region images to PIL format
328
+ region_pil_images = []
329
+ for region_info in region_images:
330
+ region_pil = Image.fromarray(cv2.cvtColor(region_info['image'], cv2.COLOR_BGR2RGB))
331
+ region_info['pil_image'] = region_pil
332
+ region_pil_images.append(region_info)
333
+
334
  # Return the segmentation results
335
  return {
336
  'text_regions': text_regions_pil,
337
  'image_regions': image_regions_pil,
338
  'text_mask_base64': f"data:image/png;base64,{text_mask_base64}",
339
  'combined_result': combined_result_pil,
340
+ 'text_regions_coordinates': text_regions,
341
+ 'region_images': region_pil_images
342
  }
343
 
344
  except Exception as e:
 
352
  'text_regions_coordinates': []
353
  }
354
 
355
+ def process_segmented_image(image_path: Union[str, Path], output_dir: Optional[Path] = None, preserve_content: bool = True) -> Dict:
356
  """
357
  Process an image using segmentation for improved OCR, saving visualization outputs.
358
 
ocr_processing.py CHANGED
@@ -147,31 +147,15 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None, pro
147
 
148
  # Process with cached function if possible
149
  try:
150
- # Check if preprocessing options indicate a handwritten document
151
- handwritten_document = preprocessing_options.get("document_type") == "handwritten"
152
  modified_custom_prompt = custom_prompt
153
 
154
- # Add handwritten specific instructions if needed
155
- # Note: Document type influences OCR quality through prompting, even when no preprocessing is applied
156
- if handwritten_document and modified_custom_prompt:
157
- if "handwritten" not in modified_custom_prompt.lower():
158
- modified_custom_prompt += " This is a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
159
- elif handwritten_document and not modified_custom_prompt:
160
- modified_custom_prompt = "This is a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
161
-
162
- # Add PDF-specific instructions if needed
163
- if modified_custom_prompt and "pdf" not in modified_custom_prompt.lower() and "multi-page" not in modified_custom_prompt.lower():
164
- modified_custom_prompt += " This is a multi-page PDF document."
165
- elif not modified_custom_prompt:
166
  modified_custom_prompt = "This is a multi-page PDF document."
167
-
168
- # For certain filenames, explicitly add document type hints
169
- filename_lower = uploaded_file.name.lower()
170
- if "handwritten" in filename_lower or "letter" in filename_lower or "journal" in filename_lower:
171
- if not modified_custom_prompt:
172
- modified_custom_prompt = "This is a handwritten document in PDF format. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
173
- elif "handwritten" not in modified_custom_prompt.lower():
174
- modified_custom_prompt += " This is a handwritten document. Please carefully transcribe all handwritten text."
175
 
176
  # Update the cache key with the modified prompt
177
  if modified_custom_prompt != custom_prompt:
@@ -194,19 +178,24 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None, pro
194
  processor = StructuredOCR()
195
 
196
 
197
- # Check if preprocessing options indicate a handwritten document
198
- handwritten_document = preprocessing_options.get("document_type") == "handwritten"
199
  modified_custom_prompt = custom_prompt
200
 
201
- # Add handwritten specific instructions if needed
202
- if handwritten_document and modified_custom_prompt:
203
- if "handwritten" not in modified_custom_prompt.lower():
204
- modified_custom_prompt += " This is a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
205
- elif handwritten_document and not modified_custom_prompt:
206
  modified_custom_prompt = "This is a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
 
 
 
 
 
 
 
 
207
 
208
  # Add PDF-specific instructions if needed
209
- if custom_prompt and "pdf" not in modified_custom_prompt.lower() and "multi-page" not in modified_custom_prompt.lower():
210
  modified_custom_prompt += " This is a multi-page PDF document."
211
 
212
  # Process directly with optimized settings
@@ -241,8 +230,13 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None, pro
241
  progress_reporter.update(35, "Applying image segmentation to separate text and image regions...")
242
 
243
  try:
244
- # Perform image segmentation
245
- segmentation_results = segment_image_for_ocr(temp_path, vision_enabled=use_vision)
 
 
 
 
 
246
 
247
  if segmentation_results['combined_result'] is not None:
248
  # Save the segmented result to a new temporary file
@@ -250,21 +244,99 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None, pro
250
  segmentation_results['combined_result'].save(segmented_temp_path)
251
  temp_file_paths.append(segmented_temp_path)
252
 
253
- # Use the segmented image instead of the original
254
- temp_path = segmented_temp_path
255
-
256
- # Enhanced prompt based on segmentation results
257
- if custom_prompt:
258
- # Add segmentation info to existing prompt
259
- regions_count = len(segmentation_results.get('text_regions_coordinates', []))
260
- custom_prompt += f" The document has been segmented and contains approximately {regions_count} text regions that should be carefully extracted. Please focus on extracting all text from these regions."
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
261
  else:
262
- # Create new prompt focused on text extraction from segmented regions
 
 
 
263
  regions_count = len(segmentation_results.get('text_regions_coordinates', []))
264
- custom_prompt = f"This document has been preprocessed to highlight {regions_count} text regions. Please carefully extract all text from these highlighted regions, preserving the reading order and structure."
 
 
 
 
 
 
 
 
 
 
 
 
265
 
266
- logger.info(f"Image segmentation applied. Found {regions_count} text regions.")
267
- progress_reporter.update(40, f"Identified {regions_count} text regions for extraction...")
268
  else:
269
  logger.warning("Image segmentation produced no result, using original image.")
270
  except Exception as seg_error:
@@ -283,24 +355,21 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None, pro
283
  # Process the file using cached function if possible
284
  progress_reporter.update(50, "Processing document with OCR...")
285
  try:
286
- # Check if preprocessing options indicate a handwritten document
287
- handwritten_document = preprocessing_options.get("document_type") == "handwritten"
288
  modified_custom_prompt = custom_prompt
289
 
290
- # Add handwritten specific instructions if needed
291
- if handwritten_document and modified_custom_prompt:
292
- if "handwritten" not in modified_custom_prompt.lower():
293
- modified_custom_prompt += " This is a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
294
- elif handwritten_document and not modified_custom_prompt:
295
  modified_custom_prompt = "This is a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
296
-
297
- # For certain filenames, explicitly add document type hints
298
- filename_lower = uploaded_file.name.lower()
299
- if "handwritten" in filename_lower or "letter" in filename_lower or "journal" in filename_lower:
300
- if not modified_custom_prompt:
301
- modified_custom_prompt = "This is a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
302
- elif "handwritten" not in modified_custom_prompt.lower():
303
- modified_custom_prompt += " This is a handwritten document. Please carefully transcribe all handwritten text."
304
 
305
  # Update the cache key with the modified prompt
306
  if modified_custom_prompt != custom_prompt:
@@ -328,24 +397,21 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None, pro
328
  # Use simpler processing for speed
329
  pass # Any speed optimizations would be handled by the StructuredOCR class
330
 
331
- # Check if preprocessing options indicate a handwritten document
332
- handwritten_document = preprocessing_options.get("document_type") == "handwritten"
333
  modified_custom_prompt = custom_prompt
334
 
335
- # Add handwritten specific instructions if needed
336
- if handwritten_document and modified_custom_prompt:
337
- if "handwritten" not in modified_custom_prompt.lower():
338
- modified_custom_prompt += " This is a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
339
- elif handwritten_document and not modified_custom_prompt:
340
  modified_custom_prompt = "This is a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
341
-
342
- # For certain filenames, explicitly add document type hints
343
- filename_lower = uploaded_file.name.lower()
344
- if "handwritten" in filename_lower or "letter" in filename_lower or "journal" in filename_lower:
345
- if not modified_custom_prompt:
346
- modified_custom_prompt = "This is a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
347
- elif "handwritten" not in modified_custom_prompt.lower():
348
- modified_custom_prompt += " This is a handwritten document. Please carefully transcribe all handwritten text."
349
 
350
  result = processor.process_file(
351
  file_path=temp_path,
@@ -360,11 +426,16 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None, pro
360
  # Add additional metadata to result
361
  result = process_result(result, uploaded_file, preprocessing_options)
362
 
 
 
 
 
363
  # 🔧 ALWAYS normalize result before returning
364
  result = clean_ocr_result(
365
  result,
366
  use_segmentation=use_segmentation,
367
- vision_enabled=use_vision
 
368
  )
369
 
370
  # Complete progress
@@ -424,13 +495,14 @@ def process_result(result, uploaded_file, preprocessing_options=None):
424
  preprocessing_options
425
  )
426
 
427
- # Extract raw text from OCR contents
428
  raw_text = ""
429
  if 'ocr_contents' in result:
430
- if 'raw_text' in result['ocr_contents']:
431
- raw_text = result['ocr_contents']['raw_text']
432
- elif 'content' in result['ocr_contents']:
433
- raw_text = result['ocr_contents']['content']
 
434
 
435
  # Extract subject tags if not already present or enhance existing ones
436
  if 'topics' not in result or not result['topics']:
 
147
 
148
  # Process with cached function if possible
149
  try:
150
+ # Use the document type information from preprocessing options
151
+ doc_type = preprocessing_options.get("document_type", "standard")
152
  modified_custom_prompt = custom_prompt
153
 
154
+ # Add PDF-specific instructions
155
+ if not modified_custom_prompt:
 
 
 
 
 
 
 
 
 
 
156
  modified_custom_prompt = "This is a multi-page PDF document."
157
+ elif "pdf" not in modified_custom_prompt.lower() and "multi-page" not in modified_custom_prompt.lower():
158
+ modified_custom_prompt += " This is a multi-page PDF document."
 
 
 
 
 
 
159
 
160
  # Update the cache key with the modified prompt
161
  if modified_custom_prompt != custom_prompt:
 
178
  processor = StructuredOCR()
179
 
180
 
181
+ # Use the document type from preprocessing options
182
+ doc_type = preprocessing_options.get("document_type", "standard")
183
  modified_custom_prompt = custom_prompt
184
 
185
+ # Add document-type specific instructions based on preprocessing options
186
+ if doc_type == "handwritten" and not modified_custom_prompt:
 
 
 
187
  modified_custom_prompt = "This is a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
188
+ elif doc_type == "handwritten" and "handwritten" not in modified_custom_prompt.lower():
189
+ modified_custom_prompt += " This is a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
190
+ elif doc_type == "newspaper" and not modified_custom_prompt:
191
+ modified_custom_prompt = "This is a newspaper or document with columns. Please extract all text content from each column, maintaining proper reading order."
192
+ elif doc_type == "newspaper" and "column" not in modified_custom_prompt.lower() and "newspaper" not in modified_custom_prompt.lower():
193
+ modified_custom_prompt += " This appears to be a newspaper or document with columns. Please extract all text content from each column."
194
+ elif doc_type == "book" and not modified_custom_prompt:
195
+ modified_custom_prompt = "This is a book page. Extract titles, headers, footnotes, and body text, preserving paragraph structure and formatting."
196
 
197
  # Add PDF-specific instructions if needed
198
+ if "pdf" not in modified_custom_prompt.lower() and "multi-page" not in modified_custom_prompt.lower():
199
  modified_custom_prompt += " This is a multi-page PDF document."
200
 
201
  # Process directly with optimized settings
 
230
  progress_reporter.update(35, "Applying image segmentation to separate text and image regions...")
231
 
232
  try:
233
+ # Perform image segmentation with content preservation if requested
234
+ preserve_content = preprocessing_options.get("preserve_content", True)
235
+ segmentation_results = segment_image_for_ocr(
236
+ temp_path,
237
+ vision_enabled=use_vision,
238
+ preserve_content=preserve_content
239
+ )
240
 
241
  if segmentation_results['combined_result'] is not None:
242
  # Save the segmented result to a new temporary file
 
244
  segmentation_results['combined_result'].save(segmented_temp_path)
245
  temp_file_paths.append(segmented_temp_path)
246
 
247
+ # Check if we have individual region images to process separately
248
+ if 'region_images' in segmentation_results and segmentation_results['region_images']:
249
+ # Process each region separately for better results
250
+ regions_count = len(segmentation_results['region_images'])
251
+ logger.info(f"Processing {regions_count} text regions individually")
252
+ progress_reporter.update(40, f"Processing {regions_count} text regions separately...")
253
+
254
+ # Initialize StructuredOCR processor
255
+ processor = StructuredOCR()
256
+
257
+ # Store individual region results
258
+ region_results = []
259
+
260
+ # Process each region individually
261
+ for idx, region_info in enumerate(segmentation_results['region_images']):
262
+ # Save region image to temp file
263
+ region_temp_path = tempfile.NamedTemporaryFile(delete=False, suffix='.jpg').name
264
+ region_info['pil_image'].save(region_temp_path)
265
+ temp_file_paths.append(region_temp_path)
266
+
267
+ # Create region-specific prompt
268
+ region_prompt = f"This is region {idx+1} of {regions_count} from a segmented document. Extract all visible text precisely, preserving line breaks and structure."
269
+
270
+ # Process the region
271
+ try:
272
+ region_result = processor.process_file(
273
+ file_path=region_temp_path,
274
+ file_type="image",
275
+ use_vision=use_vision,
276
+ custom_prompt=region_prompt,
277
+ file_size_mb=None
278
+ )
279
+
280
+ # Store result with region info
281
+ if 'ocr_contents' in region_result and 'raw_text' in region_result['ocr_contents']:
282
+ region_results.append({
283
+ 'text': region_result['ocr_contents']['raw_text'],
284
+ 'coordinates': region_info['coordinates'],
285
+ 'order': region_info['order']
286
+ })
287
+ except Exception as region_err:
288
+ logger.warning(f"Error processing region {idx+1}: {str(region_err)}")
289
+
290
+ # Sort regions by their order for correct reading flow
291
+ region_results.sort(key=lambda x: x['order'])
292
+
293
+ # Combine all region texts
294
+ combined_text = "\n\n".join([r['text'] for r in region_results if r['text'].strip()])
295
+
296
+ # Store combined results for later use
297
+ preprocessing_options['segmentation_data'] = {
298
+ 'text_regions_coordinates': segmentation_results.get('text_regions_coordinates', []),
299
+ 'regions_count': regions_count,
300
+ 'segmentation_applied': True,
301
+ 'combined_text': combined_text,
302
+ 'region_results': region_results
303
+ }
304
+
305
+ logger.info(f"Successfully processed {len(region_results)} text regions")
306
+
307
+ # Set up the temp path to use the segmented image
308
+ temp_path = segmented_temp_path
309
+
310
+ # IMPORTANT: We've already extracted text from individual regions,
311
+ # emphasize their importance in our prompt
312
+ if custom_prompt:
313
+ # Add strong emphasis on using the already extracted text
314
+ custom_prompt += f" IMPORTANT: The document has been segmented into {regions_count} text regions that have been processed individually. The text from these regions should be given HIGHEST PRIORITY and used as the primary source of text for the document. The combined image is provided only as supplementary context."
315
+ else:
316
+ # Create explicit prompt prioritizing region text
317
+ custom_prompt = f"CRITICAL: This document has been preprocessed to highlight {regions_count} text regions that have been individually processed. The text from these regions is the PRIMARY source of content and should be prioritized over any text extracted from the combined image. Use the combined image only for context and layout understanding."
318
  else:
319
+ # No individual regions found, use combined result
320
+ temp_path = segmented_temp_path
321
+
322
+ # Enhanced prompt based on segmentation results
323
  regions_count = len(segmentation_results.get('text_regions_coordinates', []))
324
+ if custom_prompt:
325
+ # Add segmentation info to existing prompt
326
+ custom_prompt += f" The document has been segmented and contains approximately {regions_count} text regions that should be carefully extracted. Please focus on extracting all text from these regions."
327
+ else:
328
+ # Create new prompt focused on text extraction from segmented regions
329
+ custom_prompt = f"This document has been preprocessed to highlight {regions_count} text regions. Please carefully extract all text from these highlighted regions, preserving the reading order and structure."
330
+
331
+ # Store segmentation data in preprocessing options for later use
332
+ preprocessing_options['segmentation_data'] = {
333
+ 'text_regions_coordinates': segmentation_results.get('text_regions_coordinates', []),
334
+ 'regions_count': regions_count,
335
+ 'segmentation_applied': True
336
+ }
337
 
338
+ logger.info(f"Image segmentation applied. Found {len(segmentation_results.get('text_regions_coordinates', []))} text regions.")
339
+ progress_reporter.update(40, f"Identified {len(segmentation_results.get('text_regions_coordinates', []))} text regions for extraction...")
340
  else:
341
  logger.warning("Image segmentation produced no result, using original image.")
342
  except Exception as seg_error:
 
355
  # Process the file using cached function if possible
356
  progress_reporter.update(50, "Processing document with OCR...")
357
  try:
358
+ # Use the document type from preprocessing options
359
+ doc_type = preprocessing_options.get("document_type", "standard")
360
  modified_custom_prompt = custom_prompt
361
 
362
+ # Add document-type specific instructions based on preprocessing options
363
+ if doc_type == "handwritten" and not modified_custom_prompt:
 
 
 
364
  modified_custom_prompt = "This is a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
365
+ elif doc_type == "handwritten" and "handwritten" not in modified_custom_prompt.lower():
366
+ modified_custom_prompt += " This is a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
367
+ elif doc_type == "newspaper" and not modified_custom_prompt:
368
+ modified_custom_prompt = "This is a newspaper or document with columns. Please extract all text content from each column, maintaining proper reading order."
369
+ elif doc_type == "newspaper" and "column" not in modified_custom_prompt.lower() and "newspaper" not in modified_custom_prompt.lower():
370
+ modified_custom_prompt += " This appears to be a newspaper or document with columns. Please extract all text content from each column."
371
+ elif doc_type == "book" and not modified_custom_prompt:
372
+ modified_custom_prompt = "This is a book page. Extract titles, headers, footnotes, and body text, preserving paragraph structure and formatting."
373
 
374
  # Update the cache key with the modified prompt
375
  if modified_custom_prompt != custom_prompt:
 
397
  # Use simpler processing for speed
398
  pass # Any speed optimizations would be handled by the StructuredOCR class
399
 
400
+ # Use the document type from preprocessing options
401
+ doc_type = preprocessing_options.get("document_type", "standard")
402
  modified_custom_prompt = custom_prompt
403
 
404
+ # Add document-type specific instructions based on preprocessing options
405
+ if doc_type == "handwritten" and not modified_custom_prompt:
 
 
 
406
  modified_custom_prompt = "This is a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
407
+ elif doc_type == "handwritten" and "handwritten" not in modified_custom_prompt.lower():
408
+ modified_custom_prompt += " This is a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks and original formatting."
409
+ elif doc_type == "newspaper" and not modified_custom_prompt:
410
+ modified_custom_prompt = "This is a newspaper or document with columns. Please extract all text content from each column, maintaining proper reading order."
411
+ elif doc_type == "newspaper" and "column" not in modified_custom_prompt.lower() and "newspaper" not in modified_custom_prompt.lower():
412
+ modified_custom_prompt += " This appears to be a newspaper or document with columns. Please extract all text content from each column."
413
+ elif doc_type == "book" and not modified_custom_prompt:
414
+ modified_custom_prompt = "This is a book page. Extract titles, headers, footnotes, and body text, preserving paragraph structure and formatting."
415
 
416
  result = processor.process_file(
417
  file_path=temp_path,
 
426
  # Add additional metadata to result
427
  result = process_result(result, uploaded_file, preprocessing_options)
428
 
429
+ # Make sure file_type is explicitly set for PDFs
430
+ if file_type == "pdf":
431
+ result['file_type'] = "pdf"
432
+
433
  # 🔧 ALWAYS normalize result before returning
434
  result = clean_ocr_result(
435
  result,
436
  use_segmentation=use_segmentation,
437
+ vision_enabled=use_vision,
438
+ preprocessing_options=preprocessing_options
439
  )
440
 
441
  # Complete progress
 
495
  preprocessing_options
496
  )
497
 
498
+ # Extract raw text from OCR contents for tag extraction without duplicating content
499
  raw_text = ""
500
  if 'ocr_contents' in result:
501
+ # Try fields in order of preference
502
+ for field in ["raw_text", "content", "text", "transcript", "main_text"]:
503
+ if field in result['ocr_contents'] and result['ocr_contents'][field]:
504
+ raw_text = result['ocr_contents'][field]
505
+ break
506
 
507
  # Extract subject tags if not already present or enhance existing ones
508
  if 'topics' not in result or not result['topics']:
requirements.txt CHANGED
@@ -9,7 +9,7 @@ pydantic>=2.5.0 # Updated for better BaseModel support
9
  Pillow>=10.0.0
10
  opencv-python-headless>=4.8.0.74
11
  pdf2image>=1.16.0
12
- pytesseract>=0.3.10 # For local OCR fallback
13
  matplotlib>=3.7.0 # For visualization in preprocessing tests
14
 
15
  # Data handling and utilities
 
9
  Pillow>=10.0.0
10
  opencv-python-headless>=4.8.0.74
11
  pdf2image>=1.16.0
12
+ # pytesseract>=0.3.10 # For local OCR fallback
13
  matplotlib>=3.7.0 # For visualization in preprocessing tests
14
 
15
  # Data handling and utilities
structured_ocr.py CHANGED
@@ -1135,44 +1135,8 @@ class StructuredOCR:
1135
  "confidence_score": 0.0
1136
  }
1137
 
1138
- # Check if this is likely a newspaper or handwritten document by filename
1139
- is_likely_newspaper = False
1140
- is_likely_handwritten = False
1141
-
1142
- newspaper_keywords = ["newspaper", "gazette", "herald", "times", "journal",
1143
- "chronicle", "post", "tribune", "news", "press", "gender"]
1144
-
1145
- handwritten_keywords = ["handwritten", "manuscript", "letter", "correspondence", "journal", "diary"]
1146
-
1147
- # Check filename for document type indicators
1148
- filename_lower = file_path.name.lower()
1149
-
1150
- # First check for handwritten documents
1151
- for keyword in handwritten_keywords:
1152
- if keyword in filename_lower:
1153
- is_likely_handwritten = True
1154
- logger.info(f"Likely handwritten document detected from filename: {file_path.name}")
1155
- # Add handwritten-specific processing hint to custom_prompt if not already present
1156
- if custom_prompt:
1157
- if "handwritten" not in custom_prompt.lower():
1158
- custom_prompt = custom_prompt + " This appears to be a handwritten document. Please carefully transcribe all handwritten text, preserving line breaks. Note any unclear sections or annotations."
1159
- else:
1160
- custom_prompt = "This is a handwritten document. Carefully transcribe all handwritten text, preserving line breaks. Note any unclear sections or annotations."
1161
- break
1162
-
1163
- # Then check for newspaper if not handwritten
1164
- if not is_likely_handwritten:
1165
- for keyword in newspaper_keywords:
1166
- if keyword in filename_lower:
1167
- is_likely_newspaper = True
1168
- logger.info(f"Likely newspaper document detected from filename: {file_path.name}")
1169
- # Add newspaper-specific processing hint to custom_prompt if not already present
1170
- if custom_prompt:
1171
- if "column" not in custom_prompt.lower() and "newspaper" not in custom_prompt.lower():
1172
- custom_prompt = custom_prompt + " This appears to be a newspaper or document with columns. Please extract all text content from each column."
1173
- else:
1174
- custom_prompt = "This appears to be a newspaper or document with columns. Please extract all text content from each column, maintaining proper reading order."
1175
- break
1176
 
1177
  try:
1178
  # Check file size
@@ -1192,10 +1156,11 @@ class StructuredOCR:
1192
  if file_size_mb > max_size_mb:
1193
  logger.info(f"Image is large ({file_size_mb:.2f} MB), optimizing for API submission")
1194
 
1195
- # Handwritten docs default to the conservative pipeline
 
1196
  base64_data_url = get_base64_from_bytes(
1197
  preprocess_image(file_path.read_bytes(),
1198
- {"document_type": "handwritten",
1199
  "grayscale": True,
1200
  "denoise": True,
1201
  "contrast": 0})
@@ -1391,9 +1356,9 @@ class StructuredOCR:
1391
  logger.info(f"Found language in page: {lang}")
1392
 
1393
  # Optimize: Skip vision model step if ocr_markdown is very small or empty
1394
- # BUT make an exception for newspapers or if custom_prompt is provided
1395
  # OR if the image has visual content worth preserving
1396
- if (not is_likely_newspaper and not custom_prompt and not has_images) and (not image_ocr_markdown or len(image_ocr_markdown) < 50):
1397
  logger.warning("OCR produced minimal text with no images. Returning basic result.")
1398
  return {
1399
  "file_name": file_path.name,
@@ -1407,14 +1372,6 @@ class StructuredOCR:
1407
  "raw_response_data": serialize_ocr_response(image_response)
1408
  }
1409
 
1410
- # For newspapers with little text in OCR, set a more explicit prompt
1411
- if is_likely_newspaper and (not image_ocr_markdown or len(image_ocr_markdown) < 100):
1412
- logger.info("Newspaper with minimal OCR text detected. Using enhanced prompt.")
1413
- if not custom_prompt:
1414
- custom_prompt = "This is a newspaper or document with columns. The OCR may not have captured all text. Please examine the image carefully and extract ALL text content visible in the document, reading each column from top to bottom."
1415
- elif "extract all text" not in custom_prompt.lower():
1416
- custom_prompt += " Please examine the image carefully and extract ALL text content visible in the document."
1417
-
1418
  # For images with minimal text but visual content, enhance the prompt
1419
  elif has_images and (not image_ocr_markdown or len(image_ocr_markdown) < 100):
1420
  logger.info("Document with images but minimal text detected. Using enhanced prompt for mixed media.")
@@ -1575,16 +1532,25 @@ class StructuredOCR:
1575
  else:
1576
  truncated_ocr = ocr_markdown
1577
 
1578
- # Build a comprehensive prompt with OCR text and detailed instructions for language detection and image handling
1579
  enhanced_prompt = f"This is a document's OCR text:\n<BEGIN_OCR>\n{truncated_ocr}\n<END_OCR>\n\n"
1580
 
1581
  # Add custom prompt if provided
1582
  if custom_prompt:
1583
  enhanced_prompt += f"User instructions: {custom_prompt}\n\n"
1584
 
1585
- # Add comprehensive extraction instructions with language detection guidance
1586
- enhanced_prompt += "Extract all text content accurately from this document, including any text visible in the image that may not have been captured by OCR.\n\n"
1587
- enhanced_prompt += "IMPORTANT: First thoroughly extract and analyze all text content, THEN determine the languages present.\n"
 
 
 
 
 
 
 
 
 
1588
  enhanced_prompt += "Precisely identify and list ALL languages present in the document separately. Look closely for multiple languages that might appear together.\n"
1589
  enhanced_prompt += "For language detection, examine these specific indicators:\n"
1590
  enhanced_prompt += "- French: accents (é, è, ê, à, ç, â, î, ô, û), words like 'le', 'la', 'les', 'et', 'en', 'de', 'du', 'des', 'dans', 'ce', 'cette', 'ces', 'par', 'pour', 'qui', 'que', 'où', 'avec'\n"
@@ -1866,15 +1832,20 @@ class StructuredOCR:
1866
  truncated_text = ocr_markdown[:15000] + "\n...[content truncated]...\n" + ocr_markdown[-5000:]
1867
  logger.info(f"OCR text truncated from {len(ocr_markdown)} to {len(truncated_text)} chars")
1868
 
1869
- # Build a prompt with enhanced language detection instructions
1870
  enhanced_prompt = f"This is a document's OCR text:\n<BEGIN_OCR>\n{truncated_text}\n<END_OCR>\n\n"
1871
 
1872
  # Add custom prompt if provided
1873
  if custom_prompt:
1874
  enhanced_prompt += f"User instructions: {custom_prompt}\n\n"
1875
-
1876
- # Add thorough extraction instructions with enhanced language detection and metadata requirements
1877
- enhanced_prompt += "Extract all text content accurately from this document. Return structured data with the document's contents.\n\n"
 
 
 
 
 
1878
  enhanced_prompt += "IMPORTANT: Precisely identify and list ALL languages present in the document separately. Look closely for multiple languages that might appear together.\n"
1879
  enhanced_prompt += "For language detection, examine these specific indicators:\n"
1880
  enhanced_prompt += "- French: accents (é, è, ê, à, ç), words like 'le', 'la', 'les', 'et', 'en', 'de', 'du'\n"
 
1135
  "confidence_score": 0.0
1136
  }
1137
 
1138
+ # No automatic document type detection - rely on the document type specified in the custom prompt
1139
+ # The document type is passed from the UI through the custom prompt in ocr_processing.py
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1140
 
1141
  try:
1142
  # Check file size
 
1156
  if file_size_mb > max_size_mb:
1157
  logger.info(f"Image is large ({file_size_mb:.2f} MB), optimizing for API submission")
1158
 
1159
+ # Use standard preprocessing - document type will be handled by preprocessing.py
1160
+ # based on the options passed from the UI
1161
  base64_data_url = get_base64_from_bytes(
1162
  preprocess_image(file_path.read_bytes(),
1163
+ {"document_type": "standard",
1164
  "grayscale": True,
1165
  "denoise": True,
1166
  "contrast": 0})
 
1356
  logger.info(f"Found language in page: {lang}")
1357
 
1358
  # Optimize: Skip vision model step if ocr_markdown is very small or empty
1359
+ # BUT make an exception if custom_prompt is provided
1360
  # OR if the image has visual content worth preserving
1361
+ if (not custom_prompt and not has_images) and (not image_ocr_markdown or len(image_ocr_markdown) < 50):
1362
  logger.warning("OCR produced minimal text with no images. Returning basic result.")
1363
  return {
1364
  "file_name": file_path.name,
 
1372
  "raw_response_data": serialize_ocr_response(image_response)
1373
  }
1374
 
 
 
 
 
 
 
 
 
1375
  # For images with minimal text but visual content, enhance the prompt
1376
  elif has_images and (not image_ocr_markdown or len(image_ocr_markdown) < 100):
1377
  logger.info("Document with images but minimal text detected. Using enhanced prompt for mixed media.")
 
1532
  else:
1533
  truncated_ocr = ocr_markdown
1534
 
1535
+ # Build a comprehensive prompt with OCR text and detailed instructions for title detection and language handling
1536
  enhanced_prompt = f"This is a document's OCR text:\n<BEGIN_OCR>\n{truncated_ocr}\n<END_OCR>\n\n"
1537
 
1538
  # Add custom prompt if provided
1539
  if custom_prompt:
1540
  enhanced_prompt += f"User instructions: {custom_prompt}\n\n"
1541
 
1542
+ # Primary focus on document structure and title detection
1543
+ enhanced_prompt += "You are analyzing a historical document. Follow these extraction priorities:\n"
1544
+ enhanced_prompt += "1. FIRST PRIORITY: Identify and extract the TITLE of the document. Look for large text at the top, decorative typography, or centered text that appears to be a title. The title is often one of the first elements in historical documents.\n"
1545
+ enhanced_prompt += "2. SECOND: Extract all text content accurately from this document, including any text visible in the image that may not have been captured by OCR.\n\n"
1546
+ enhanced_prompt += "Document Title Guidelines:\n"
1547
+ enhanced_prompt += "- For printed historical works: Look for primary heading at top of the document, all-caps text, or larger font size text\n"
1548
+ enhanced_prompt += "- For newspapers/periodicals: Extract both newspaper name and article title if present\n"
1549
+ enhanced_prompt += "- For handwritten documents: Look for centered text at the top or underlined headings\n"
1550
+ enhanced_prompt += "- For engravings/illustrations: Include the title or caption, which often appears below the image\n\n"
1551
+
1552
+ # Language detection guidance
1553
+ enhanced_prompt += "IMPORTANT: After extracting the title and text content, determine the languages present.\n"
1554
  enhanced_prompt += "Precisely identify and list ALL languages present in the document separately. Look closely for multiple languages that might appear together.\n"
1555
  enhanced_prompt += "For language detection, examine these specific indicators:\n"
1556
  enhanced_prompt += "- French: accents (é, è, ê, à, ç, â, î, ô, û), words like 'le', 'la', 'les', 'et', 'en', 'de', 'du', 'des', 'dans', 'ce', 'cette', 'ces', 'par', 'pour', 'qui', 'que', 'où', 'avec'\n"
 
1832
  truncated_text = ocr_markdown[:15000] + "\n...[content truncated]...\n" + ocr_markdown[-5000:]
1833
  logger.info(f"OCR text truncated from {len(ocr_markdown)} to {len(truncated_text)} chars")
1834
 
1835
+ # Build a prompt with enhanced title detection and language detection instructions
1836
  enhanced_prompt = f"This is a document's OCR text:\n<BEGIN_OCR>\n{truncated_text}\n<END_OCR>\n\n"
1837
 
1838
  # Add custom prompt if provided
1839
  if custom_prompt:
1840
  enhanced_prompt += f"User instructions: {custom_prompt}\n\n"
1841
+
1842
+ # Add title detection focus
1843
+ enhanced_prompt += "You are analyzing a historical document. Please follow these extraction priorities:\n"
1844
+ enhanced_prompt += "1. FIRST PRIORITY: Identify and extract the TITLE of the document. Look for prominent text at the top, decorative typography, or centered text that appears to be a title.\n"
1845
+ enhanced_prompt += " - For historical documents with prominent headings at the top\n"
1846
+ enhanced_prompt += " - For newspapers or periodicals, extract both the publication name and article title\n"
1847
+ enhanced_prompt += " - For manuscripts or letters, identify any heading or subject line\n"
1848
+ enhanced_prompt += "2. SECOND PRIORITY: Extract all text content accurately and return structured data with the document's contents.\n\n"
1849
  enhanced_prompt += "IMPORTANT: Precisely identify and list ALL languages present in the document separately. Look closely for multiple languages that might appear together.\n"
1850
  enhanced_prompt += "For language detection, examine these specific indicators:\n"
1851
  enhanced_prompt += "- French: accents (é, è, ê, à, ç), words like 'le', 'la', 'les', 'et', 'en', 'de', 'du'\n"
test_fix.py ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ import streamlit as st
3
+ from ocr_processing import process_file
4
+
5
+ # Mock a file upload
6
+ class MockFile:
7
+ def __init__(self, name, content):
8
+ self.name = name
9
+ self._content = content
10
+
11
+ def getvalue(self):
12
+ return self._content
13
+
14
+ def main():
15
+ # Load the test image - using the problematic image from the original task
16
+ with open('input/magician-or-bottle-cungerer.jpg', 'rb') as f:
17
+ file_bytes = f.read()
18
+
19
+ # Create mock file
20
+ uploaded_file = MockFile('magician-or-bottle-cungerer.jpg', file_bytes)
21
+
22
+ # Process the file
23
+ result = process_file(uploaded_file)
24
+
25
+ # Display results
26
+ print("\nDocument Content")
27
+ print("Title")
28
+ if 'title' in result['ocr_contents']:
29
+ print(result['ocr_contents']['title'])
30
+
31
+ print("\nMain")
32
+ if 'main_text' in result['ocr_contents']:
33
+ print(result['ocr_contents']['main_text'])
34
+
35
+ print("\nRaw Text")
36
+ if 'raw_text' in result['ocr_contents']:
37
+ print(result['ocr_contents']['raw_text'][:300] + "...")
38
+
39
+ # Debug: Print all keys in ocr_contents
40
+ print("\nAll OCR Content Keys:")
41
+ for key in result['ocr_contents'].keys():
42
+ print(f"- {key}")
43
+
44
+ # Debug: Display content of all keys
45
+ print("\nContent of each key:")
46
+ for key in result['ocr_contents'].keys():
47
+ print(f"\n--- {key} ---")
48
+ content = result['ocr_contents'][key]
49
+ if isinstance(content, str):
50
+ print(content[:150] + "..." if len(content) > 150 else content)
51
+ else:
52
+ print(f"Type: {type(content)}")
53
+
54
+ if __name__ == "__main__":
55
+ main()
test_magellan_language.py DELETED
@@ -1,39 +0,0 @@
1
- import sys
2
- import json
3
- from pathlib import Path
4
- from structured_ocr import StructuredOCR
5
-
6
- def main():
7
- """Test language detection on the Magellan document"""
8
- # Path to the Magellan document
9
- file_path = Path("input/magellan-travels.jpg")
10
-
11
- if not file_path.exists():
12
- print(f"Error: File {file_path} not found")
13
- return
14
-
15
- print(f"Testing language detection on {file_path}")
16
-
17
- # Process the file
18
- processor = StructuredOCR()
19
- result = processor.process_file(file_path)
20
-
21
- # Print language detection results
22
- if 'languages' in result:
23
- print(f"\nDetected languages: {result['languages']}")
24
- else:
25
- print("\nNo languages detected")
26
-
27
- # Save the full result for inspection
28
- output_path = "output/magellan_test_result.json"
29
- Path("output").mkdir(exist_ok=True)
30
-
31
- with open(output_path, "w") as f:
32
- json.dump(result, f, indent=2)
33
-
34
- print(f"\nFull result saved to {output_path}")
35
-
36
- return result
37
-
38
- if __name__ == "__main__":
39
- main()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
test_segmentation_fix.py ADDED
@@ -0,0 +1,100 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Test script to verify the segmentation and OCR improvements.
3
+ This script will process an image using the updated segmentation algorithm
4
+ and show how text recognition is prioritized over images.
5
+ """
6
+
7
+ import os
8
+ import json
9
+ import tempfile
10
+ from pathlib import Path
11
+ from PIL import Image
12
+
13
+ # Import the key components we modified
14
+ from image_segmentation import segment_image_for_ocr
15
+ from ocr_processing import process_file, process_result
16
+ from utils.image_utils import clean_ocr_result
17
+ import logging
18
+
19
+ # Configure logging
20
+ logging.basicConfig(level=logging.INFO)
21
+ logger = logging.getLogger(__name__)
22
+
23
+ def run_test(image_path):
24
+ """Run a test on the specified image to verify our fixes"""
25
+ print(f"Testing image segmentation and OCR prioritization on: {image_path}")
26
+ print("-" * 80)
27
+
28
+ # Make sure the image exists
29
+ if not os.path.exists(image_path):
30
+ print(f"Error: Image not found at {image_path}")
31
+ return
32
+
33
+ # 1. First run image segmentation directly
34
+ try:
35
+ print("Step 1: Running image segmentation...")
36
+ segmentation_results = segment_image_for_ocr(
37
+ image_path,
38
+ vision_enabled=True,
39
+ preserve_content=True
40
+ )
41
+
42
+ # Print segmentation info
43
+ text_regions_count = len(segmentation_results.get('text_regions_coordinates', []))
44
+ print(f"Detected {text_regions_count} text regions in the image")
45
+
46
+ # Save output images for inspection
47
+ output_dir = Path("output/segmentation_test")
48
+ output_dir.mkdir(parents=True, exist_ok=True)
49
+
50
+ if segmentation_results['text_regions'] is not None:
51
+ output_path = output_dir / f"text_regions_improved.jpg"
52
+ segmentation_results['text_regions'].save(output_path)
53
+ print(f"Saved text regions visualization to: {output_path}")
54
+
55
+ if segmentation_results['image_regions'] is not None:
56
+ output_path = output_dir / f"image_regions_improved.jpg"
57
+ segmentation_results['image_regions'].save(output_path)
58
+ print(f"Saved image regions visualization to: {output_path}")
59
+
60
+ if segmentation_results['combined_result'] is not None:
61
+ output_path = output_dir / f"combined_result_improved.jpg"
62
+ segmentation_results['combined_result'].save(output_path)
63
+ print(f"Saved combined result to: {output_path}")
64
+
65
+ # Extract individual text regions if available
66
+ if 'region_images' in segmentation_results and segmentation_results['region_images']:
67
+ region_dir = output_dir / "text_regions"
68
+ region_dir.mkdir(exist_ok=True)
69
+
70
+ for idx, region_info in enumerate(segmentation_results['region_images']):
71
+ region_path = region_dir / f"region_{idx+1}.jpg"
72
+ region_info['pil_image'].save(region_path)
73
+
74
+ print(f"Saved {len(segmentation_results['region_images'])} individual text regions to {region_dir}")
75
+ except Exception as e:
76
+ print(f"Error during segmentation: {str(e)}")
77
+
78
+ print("-" * 80)
79
+ print("Test complete. Check the output directory for results.")
80
+ print("The text regions should now properly include all text content in the document.")
81
+ print("Image regions should be minimal and not contain text.")
82
+
83
+ if __name__ == "__main__":
84
+ # Test with an image that has mixed text and image content
85
+ # You can change this to any image path you want to test
86
+ test_image = "input/baldwin-letter.jpg"
87
+ if not os.path.exists(test_image):
88
+ print(f"Test image not found at {test_image}, looking for alternatives...")
89
+
90
+ # Try to find an alternative test image
91
+ for potential_img in ["input/harpers.pdf", "input/magician-or-bottle-cungerer.jpg", "input/magellan-travels.jpg"]:
92
+ if os.path.exists(potential_img):
93
+ test_image = potential_img
94
+ print(f"Using alternative test image: {test_image}")
95
+ break
96
+
97
+ if os.path.exists(test_image):
98
+ run_test(test_image)
99
+ else:
100
+ print("No suitable test images found. Please place an image in the input directory.")
testing/magician_app_investigation_plan.md DELETED
@@ -1,58 +0,0 @@
1
- # Investigation Plan: App.py Image Processing Issues
2
-
3
- ## Background
4
- - The `ocr_utils.py` in the reconcile-improvements branch successfully processes the magician image with specialized handling for illustrations/etchings
5
- - However, there appears to be an issue with app.py's ability to process this image file
6
-
7
- ## Investigation Steps
8
-
9
- ### 1. Trace the Image Processing Flow in app.py
10
- - Analyze how app.py calls the image processing functions
11
- - Identify which components are involved in the processing pipeline:
12
- - File upload handling
13
- - Preprocessing steps
14
- - OCR processing
15
- - Result handling
16
-
17
- ### 2. Check for Integration Issues
18
- - Verify that app.py correctly imports and uses the enhanced functions from ocr_utils.py
19
- - Check if there are any version mismatches or import issues
20
- - Examine if app.py is using a different processing path that bypasses the enhanced illustration detection
21
-
22
- ### 3. Test Direct Processing vs. App Processing
23
- - Create a test script that mimics app.py's processing flow but with more logging
24
- - Compare the processing steps between direct usage (as in our test) and through the app
25
- - Identify any differences in how parameters are passed or how results are handled
26
-
27
- ### 4. Debug Specific Failure Points
28
- - Add detailed logging at key points in the processing pipeline
29
- - Focus on:
30
- - File loading
31
- - Preprocessing options application
32
- - Illustration detection logic
33
- - Error handling
34
-
35
- ### 5. Check for Environment or Configuration Issues
36
- - Verify that all required dependencies are available in the app environment
37
- - Check if there are any configuration settings that might be overriding the enhanced processing
38
- - Examine if there are any resource constraints (memory, CPU) affecting the app's processing
39
-
40
- ### 6. Implement Potential Fixes
41
- Based on findings, implement one of these approaches:
42
- 1. **Fix Integration Issues**: Ensure app.py correctly uses the enhanced functions
43
- 2. **Add Explicit Handling**: Add explicit handling for illustration/etching files in app.py
44
- 3. **Update Preprocessing Options**: Modify default preprocessing options to better handle illustrations
45
- 4. **Improve Error Handling**: Enhance error handling to provide better diagnostics for processing failures
46
-
47
- ## Testing the Fix
48
- 1. Create a test case that reproduces the issue in app.py
49
- 2. Apply the proposed fix
50
- 3. Verify that the magician image processes correctly
51
- 4. Check that other image types still process correctly
52
- 5. Document the fix and update the branch comparison documentation
53
-
54
- ## Metrics to Collect
55
- - Processing time with and without the fix
56
- - Success rate for different image types
57
- - Memory usage during processing
58
- - File size reduction and quality preservation metrics
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
testing/magician_app_result.json DELETED
@@ -1,16 +0,0 @@
1
- {
2
- "file_name": "tmp87m8g0ib.jpg",
3
- "topics": [
4
- "Document"
5
- ],
6
- "languages": [
7
- "English"
8
- ],
9
- "ocr_contents": {
10
- "raw_text": "![img-0.jpeg](img-0.jpeg)"
11
- },
12
- "processing_note": "OCR produced minimal text content",
13
- "processing_time": 4.831024169921875,
14
- "timestamp": "2025-04-23 20:29",
15
- "descriptive_file_name": "magician-or-bottle-cungerer_document.jpg"
16
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
testing/magician_image_final_report.md DELETED
@@ -1,58 +0,0 @@
1
- # Magician Image Processing - Final Report
2
-
3
- ## Summary of Changes and Testing
4
-
5
- We've made significant improvements to the `ocr_utils.py` file in the reconcile-improvements branch to better handle the magician image. The key changes were:
6
-
7
- 1. **Modified Document Type Detection Logic**:
8
- - Removed "magician" from the illustration keywords list
9
- - Changed the detection order to check for newspaper format first, then illustration format
10
- - Added a special case for the magician image to prioritize newspaper processing
11
- - Lowered the aspect ratio threshold for newspaper detection from 1.2 to 1.15
12
-
13
- 2. **Testing Results**:
14
- - The magician image is now correctly detected as a handwritten document instead of an illustration
15
- - The image is processed using the handwritten document processing path
16
- - The processed image size is reduced from 2500x2116 to 2000x1692 (36.03% reduction)
17
- - The processing time is slightly increased (0.71 seconds vs 0.58 seconds)
18
-
19
- 3. **OCR Results**:
20
- - Despite the improved image processing, the OCR system still produces minimal text output
21
- - The extracted text is still just "img-0.jpeg](img-0.jpeg)" (25 characters)
22
- - This suggests the OCR API is treating the content as an image to be embedded rather than text to be extracted
23
-
24
- ## Output Formatting Analysis
25
-
26
- After comparing the main branch version of `ocr_utils.py` with our modified version, we confirmed that our changes are focused on the image detection and processing logic. The output formatting functions like `create_html_with_images`, `serialize_ocr_object`, etc. remain unchanged.
27
-
28
- The issue with the OCR producing minimal text is likely due to how the OCR API is processing the image, not due to our changes in `ocr_utils.py`. The API appears to be treating the magician image as primarily visual content rather than text content, regardless of the preprocessing applied.
29
-
30
- ## Recommendations for Further Improvement
31
-
32
- 1. **OCR API Configuration**:
33
- - Experiment with different OCR API parameters to better handle mixed content (images and text)
34
- - Consider using a different OCR model or service that might better handle this specific type of document
35
-
36
- 2. **Image Segmentation**:
37
- - Implement a preprocessing step that segments the image into text and non-text regions
38
- - Process the text regions with specialized OCR settings
39
-
40
- 3. **Custom Document Type**:
41
- - Create a new document type specifically for mixed content like the magician image
42
- - Implement specialized processing that handles both the illustration and text components
43
-
44
- 4. **Local OCR Fallback**:
45
- - Enhance the `try_local_ocr_fallback` function to better handle newspaper-style documents
46
- - Use different Tesseract PSM (Page Segmentation Mode) settings for column detection
47
-
48
- ## Conclusion
49
-
50
- The changes we've made to `ocr_utils.py` have successfully improved the image preprocessing for the magician image, changing it from being processed as an illustration to being processed as a handwritten document. However, the OCR API still struggles with extracting the text content from this particular image.
51
-
52
- The output formatting of the OCR results is working as expected, but the input to the formatting functions (the OCR API results) contains minimal text. To fully resolve the issue, further work is needed on how the OCR API processes mixed content documents like the magician image.
53
-
54
- All testing artifacts have been organized in the `/testing` directory for future reference, including:
55
- - Test scripts
56
- - Processed images
57
- - Test reports
58
- - Investigation plans
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
testing/magician_image_findings.md DELETED
@@ -1,84 +0,0 @@
1
- # Magician Image Processing Analysis
2
-
3
- ## Summary of Findings
4
-
5
- After thorough testing of the magician image processing in both direct usage and through app.py's processing flow, we've identified the following key findings:
6
-
7
- 1. **Image Classification Issue**:
8
- - The magician image (dimensions: 2500x2116, aspect ratio: 1.18) is being classified as an **illustration/etching** rather than a **newspaper** format.
9
- - This classification is primarily based on the filename containing "magician" which triggers the illustration detection logic.
10
- - The image falls just short of the newspaper detection criteria (aspect ratio > 1.2 and width > 2000) or (width > 3000 or height > 3000).
11
-
12
- 2. **Processing Approach**:
13
- - When processed as an illustration/etching, the focus is on preserving fine details rather than enhancing text readability.
14
- - This is suboptimal for the magician image which contains three columns of text in the lower half.
15
- - The OCR system produces minimal text output when processing the image this way.
16
-
17
- 3. **OCR Results**:
18
- - The OCR system returns primarily image references rather than extracted text.
19
- - The extracted text is minimal: "img-0.jpeg](img-0.jpeg)" (25 characters).
20
- - This suggests the OCR system is treating the content as an image to be embedded rather than text to be extracted.
21
-
22
- ## Root Cause Analysis
23
-
24
- The root cause appears to be a conflict between two detection mechanisms in the reconcile-improvements branch:
25
-
26
- 1. **Filename-based detection**: The filename "magician-or-bottle-cungerer.jpg" triggers the illustration/etching detection.
27
- 2. **Dimension-based detection**: The image's aspect ratio (1.18) falls just below the newspaper threshold (1.2).
28
-
29
- Since the filename-based detection takes precedence, the image is processed as an illustration/etching, which is not optimal for extracting the text from the newspaper columns.
30
-
31
- ## Recommendations
32
-
33
- Based on our findings, we recommend the following improvements:
34
-
35
- 1. **Enhance Detection Logic**:
36
- - Modify the detection logic to consider both the content structure and the filename.
37
- - Add a secondary check that looks for column structures even in images classified as illustrations.
38
- - Lower the aspect ratio threshold for newspaper detection from 1.2 to 1.15 to catch more newspaper-like formats.
39
-
40
- 2. **Hybrid Processing Approach**:
41
- - Implement a hybrid processing approach for images that have characteristics of both illustrations and newspapers.
42
- - Process the upper half (illustration) and lower half (text columns) differently.
43
- - Apply illustration processing to the image portion and newspaper processing to the text portion.
44
-
45
- 3. **OCR Configuration**:
46
- - Adjust OCR settings to better handle mixed content (images and text columns).
47
- - Add specific handling for multi-column text layouts even when the overall document is classified as an illustration.
48
-
49
- 4. **Preprocessing Options in app.py**:
50
- - Add an explicit option in app.py's preprocessing options to force newspaper/column processing.
51
- - This would allow users to override the automatic detection when needed.
52
-
53
- ## Implementation Plan
54
-
55
- 1. **Short-term Fix**:
56
- ```python
57
- # Modify the newspaper detection criteria in ocr_utils.py
58
- is_newspaper_format = (aspect_ratio > 1.15 and width > 2000) or (width > 3000 or height > 3000)
59
- ```
60
-
61
- 2. **Medium-term Enhancement**:
62
- ```python
63
- # Add column detection logic
64
- def detect_columns(img):
65
- # Implementation to detect vertical text columns
66
- # Return True if columns are detected
67
- pass
68
-
69
- # Modify the processing path selection
70
- if is_illustration_format and detect_columns(img):
71
- # Apply hybrid processing
72
- pass
73
- ```
74
-
75
- 3. **Long-term Solution**:
76
- - Implement a more sophisticated document layout analysis that can identify different regions (images, text, columns) within a document.
77
- - Apply specialized processing to each region based on its content type.
78
- - Train a machine learning model to better classify document types based on visual features rather than just dimensions or filenames.
79
-
80
- ## Conclusion
81
-
82
- The reconcile-improvements branch has made significant enhancements to the image processing capabilities, particularly for illustrations and etchings. However, the current implementation has a limitation when handling mixed-content documents like the magician image that contains both an illustration and columns of text.
83
-
84
- By implementing the recommended changes, we can improve the OCR results for such mixed-content documents while maintaining the benefits of the specialized processing for pure illustrations and etchings.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
testing/magician_ocr_text.txt DELETED
@@ -1,9 +0,0 @@
1
- THE MAGICIAN OR BOTTLE CONJURER.
2
-
3
- This historical illustration shows "The Magician or Bottle Conjurer" - a popular form of entertainment in the 18th and 19th centuries. The image depicts a performer demonstrating illusions and magic tricks related to bottles and other objects.
4
-
5
- The magician stands behind a table on which various props are displayed. He appears to be dressed in period costume typical of traveling entertainers of the era.
6
-
7
- Below the illustration is text that describes the performance and the mystical nature of these displays that captivated audiences during this period in history.
8
-
9
- This type of entertainment was common at fairs, theaters, and public gatherings, showcasing the fascination with illusion and "supernatural" demonstrations that were popular before modern understanding of science.
 
 
 
 
 
 
 
 
 
 
testing/test_app_direct.py DELETED
@@ -1,180 +0,0 @@
1
- """
2
- Direct test of app.py's image processing logic with the magician image.
3
- This script extracts and uses the actual processing logic from app.py.
4
- """
5
-
6
- import os
7
- import sys
8
- # Add the parent directory to the Python path so we can import the modules
9
- sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
10
-
11
- import logging
12
- from pathlib import Path
13
- import io
14
- import time
15
- from datetime import datetime
16
-
17
- # Configure detailed logging
18
- logging.basicConfig(
19
- level=logging.DEBUG,
20
- format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
21
- )
22
- logger = logging.getLogger("app_direct_test")
23
-
24
- # Import the actual processing function from app.py's dependencies
25
- from ocr_processing import process_file
26
- from ui_components import ProgressReporter
27
-
28
- class MockProgressReporter(ProgressReporter):
29
- """Mock progress reporter that logs instead of updating Streamlit"""
30
- def __init__(self):
31
- self.progress = 0
32
- self.message = ""
33
-
34
- def update(self, progress, message):
35
- self.progress = progress
36
- self.message = message
37
- logger.info(f"Progress: {progress}% - {message}")
38
- return self
39
-
40
- def complete(self, success=True):
41
- if success:
42
- logger.info("Processing completed successfully")
43
- else:
44
- logger.warning("Processing completed with errors")
45
- return self
46
-
47
- def setup(self):
48
- return self
49
-
50
- def test_app_processing():
51
- """Test the actual processing logic from app.py"""
52
- logger.info("=== Testing app.py's actual processing logic ===")
53
-
54
- # Path to the magician image
55
- image_path = Path("input/magician-or-bottle-cungerer.jpg")
56
- if not image_path.exists():
57
- logger.error(f"Image file not found: {image_path}")
58
- return False
59
-
60
- # Create a mock uploaded file object similar to what Streamlit would provide
61
- class MockUploadedFile:
62
- def __init__(self, path):
63
- self.path = path
64
- self.name = os.path.basename(path)
65
- self.type = "image/jpeg"
66
- with open(path, 'rb') as f:
67
- self._content = f.read()
68
-
69
- def getvalue(self):
70
- return self._content
71
-
72
- def read(self):
73
- return self._content
74
-
75
- def seek(self, position):
76
- # Implement seek for compatibility with some file operations
77
- return
78
-
79
- def tell(self):
80
- # Implement tell for compatibility
81
- return 0
82
-
83
- # Create the mock uploaded file
84
- uploaded_file = MockUploadedFile(str(image_path))
85
-
86
- # Create a progress reporter
87
- progress_reporter = MockProgressReporter()
88
-
89
- # Define preprocessing options - using the exact same defaults as app.py
90
- preprocessing_options = {
91
- "grayscale": True,
92
- "denoise": True,
93
- "contrast": 1.5,
94
- "document_type": "auto" # This should trigger illustration detection
95
- }
96
-
97
- try:
98
- start_time = time.time()
99
- logger.info(f"Processing file with app.py logic: {uploaded_file.name}")
100
-
101
- # Process the file using the EXACT SAME function that app.py uses
102
- result = process_file(
103
- uploaded_file=uploaded_file,
104
- use_vision=True,
105
- preprocessing_options=preprocessing_options,
106
- progress_reporter=progress_reporter,
107
- pdf_dpi=150,
108
- max_pages=3,
109
- pdf_rotation=0,
110
- custom_prompt=None,
111
- perf_mode="Quality"
112
- )
113
-
114
- processing_time = time.time() - start_time
115
-
116
- if result:
117
- logger.info(f"Processing successful in {processing_time:.2f} seconds")
118
-
119
- # Log key parts of the result
120
- if "error" in result and result["error"]:
121
- logger.error(f"Error in result: {result['error']}")
122
- return False
123
-
124
- logger.info(f"File name: {result.get('file_name', 'Unknown')}")
125
- logger.info(f"Topics: {result.get('topics', [])}")
126
- logger.info(f"Languages: {result.get('languages', [])}")
127
-
128
- # Check if OCR contents are present
129
- if "ocr_contents" in result:
130
- if "raw_text" in result["ocr_contents"]:
131
- text_length = len(result["ocr_contents"]["raw_text"])
132
- logger.info(f"Extracted text length: {text_length} characters")
133
-
134
- # Save the extracted text
135
- output_dir = Path("testing")
136
- output_dir.mkdir(exist_ok=True)
137
- with open(output_dir / "magician_ocr_text.txt", "w") as f:
138
- f.write(result["ocr_contents"]["raw_text"])
139
- logger.info(f"Saved extracted text to testing/magician_ocr_text.txt")
140
- else:
141
- logger.warning("No raw_text in OCR contents")
142
- else:
143
- logger.warning("No OCR contents in result")
144
-
145
- # Save the result to a file for inspection
146
- import json
147
- output_dir = Path("testing")
148
- output_dir.mkdir(exist_ok=True)
149
-
150
- # Remove large base64 data to make the file manageable
151
- result_copy = result.copy()
152
- if "raw_response_data" in result_copy:
153
- if "pages" in result_copy["raw_response_data"]:
154
- for page in result_copy["raw_response_data"]["pages"]:
155
- if "images" in page:
156
- for img in page["images"]:
157
- if "image_base64" in img:
158
- img["image_base64"] = "[BASE64 DATA REMOVED]"
159
-
160
- with open(output_dir / "magician_app_result.json", "w") as f:
161
- json.dump(result_copy, f, indent=2)
162
-
163
- logger.info(f"Saved result to testing/magician_app_result.json")
164
- return True
165
- else:
166
- logger.error("Processing failed - no result returned")
167
- return False
168
- except Exception as e:
169
- logger.exception(f"Error in processing: {str(e)}")
170
- return False
171
-
172
- if __name__ == "__main__":
173
- # Run the test
174
- success = test_app_processing()
175
-
176
- # Print final result
177
- if success:
178
- print("\n✅ Test completed successfully. Check the logs for details.")
179
- else:
180
- print("\n❌ Test failed. Check the logs for error details.")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
testing/test_filename_format.py ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Test the new filename formatting"""
2
+ import os
3
+ import sys
4
+ import datetime
5
+ import inspect
6
+
7
+ # Add the project root to the path so we can import modules
8
+ sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
9
+
10
+ # Import the main utils.py file directly
11
+ import utils as root_utils
12
+
13
+ print(f"Imported utils from: {root_utils.__file__}")
14
+ print("Current create_descriptive_filename implementation:")
15
+ print(inspect.getsource(root_utils.create_descriptive_filename))
16
+
17
+ def main():
18
+ """Test the filename formatting"""
19
+ # Sample inputs
20
+ sample_files = [
21
+ "handwritten-letter.jpg",
22
+ "magician-or-bottle-cungerer.jpg",
23
+ "baldwin_15th_north.jpg",
24
+ "harpers.pdf",
25
+ "recipe.jpg"
26
+ ]
27
+
28
+ # Sample OCR results for testing
29
+ sample_results = [
30
+ {
31
+ "detected_document_type": "handwritten",
32
+ "topics": ["Letter", "Handwritten", "19th Century", "Personal Correspondence"]
33
+ },
34
+ {
35
+ "topics": ["Newspaper", "Print", "19th Century", "Illustration", "Advertisement"]
36
+ },
37
+ {
38
+ "detected_document_type": "letter",
39
+ "topics": ["Correspondence", "Early Modern", "English Language"]
40
+ },
41
+ {
42
+ "detected_document_type": "magazine",
43
+ "topics": ["Publication", "Late 19th Century", "Magazine", "Historical"]
44
+ },
45
+ {
46
+ "detected_document_type": "recipe",
47
+ "topics": ["Food", "Culinary", "Historical", "Instruction"]
48
+ }
49
+ ]
50
+
51
+ print("\nIMPROVED FILENAME FORMATTING TEST")
52
+ print("=" * 50)
53
+
54
+ # Format current date manually
55
+ current_date = datetime.datetime.now().strftime("%b %d, %Y")
56
+ print(f"Current date for filenames: {current_date}")
57
+
58
+ print("\nBEFORE vs AFTER Examples:\n")
59
+
60
+ for i, (original_file, result) in enumerate(zip(sample_files, sample_results)):
61
+ # Get file extension from original file
62
+ file_ext = os.path.splitext(original_file)[1]
63
+
64
+ # Generate the old style filename manually
65
+ original_name = os.path.splitext(original_file)[0]
66
+
67
+ doc_type_tag = ""
68
+ if 'detected_document_type' in result:
69
+ doc_type = result['detected_document_type'].lower()
70
+ doc_type_tag = f"_{doc_type.replace(' ', '_')}"
71
+ elif 'topics' in result and result['topics']:
72
+ doc_type_tag = f"_{result['topics'][0].lower().replace(' ', '_')}"
73
+
74
+ period_tag = ""
75
+ if 'topics' in result and result['topics']:
76
+ for tag in result['topics']:
77
+ if "century" in tag.lower() or "pre-" in tag.lower() or "era" in tag.lower():
78
+ period_tag = f"_{tag.lower().replace(' ', '_')}"
79
+ break
80
+
81
+ old_filename = f"{original_name}{doc_type_tag}{period_tag}{file_ext}"
82
+
83
+ # Generate the new descriptive filename with our improved formatter
84
+ new_filename = root_utils.create_descriptive_filename(original_file, result, file_ext)
85
+
86
+ print(f"Example {i+1}:")
87
+ print(f" Original: {original_file}")
88
+ print(f" Old Format: {old_filename}")
89
+ print(f" New Format: {new_filename}")
90
+ print()
91
+
92
+ if __name__ == "__main__":
93
+ main()
testing/test_improvements.py DELETED
@@ -1,244 +0,0 @@
1
- import sys
2
- import os
3
- import logging
4
- from pathlib import Path
5
-
6
- # Add parent directory to path to import local modules
7
- sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
8
-
9
- import streamlit as st
10
- from ocr_processing import process_file
11
- from utils import extract_subject_tags
12
- from preprocessing import preprocess_image, apply_preprocessing_to_file
13
- from ui_components import ProgressReporter
14
-
15
- # Configure logging
16
- logging.basicConfig(level=logging.INFO,
17
- format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
18
- logger = logging.getLogger("test_improvements")
19
-
20
- class MockUploadedFile:
21
- """Mock implementation of streamlit's UploadedFile"""
22
- def __init__(self, path):
23
- self.path = path
24
- self.name = os.path.basename(path)
25
- self._content = None
26
-
27
- def getvalue(self):
28
- if self._content is None:
29
- with open(self.path, 'rb') as f:
30
- self._content = f.read()
31
- return self._content
32
-
33
- def test_preprocessing_fix():
34
- """Test that preprocessing is only applied when explicit options are selected"""
35
- print("\n--- TESTING PREPROCESSING FIX ---")
36
-
37
- # Path to test image
38
- test_image_path = os.path.join('input', 'americae-retectio.jpg')
39
-
40
- if not os.path.exists(test_image_path):
41
- print(f"Test file not found: {test_image_path}")
42
- return False
43
-
44
- # Create mock file
45
- mock_file = MockUploadedFile(test_image_path)
46
-
47
- # Read original file to compare sizes
48
- with open(test_image_path, 'rb') as f:
49
- original_bytes = f.read()
50
- original_size = len(original_bytes)
51
-
52
- print(f"Original file size: {original_size / 1024:.1f} KB")
53
-
54
- # Test case 1: Document type only - should NOT trigger preprocessing
55
- preprocessing_options = {
56
- "document_type": "printed", # Set document type
57
- "grayscale": False,
58
- "denoise": False,
59
- "contrast": 0,
60
- "rotation": 0
61
- }
62
-
63
- temp_files = []
64
- result_path, preprocessed = apply_preprocessing_to_file(
65
- original_bytes,
66
- '.jpg',
67
- preprocessing_options,
68
- temp_files
69
- )
70
-
71
- # Check if preprocessing was applied
72
- print(f"Test 1 (Document type only) - Preprocessing applied: {preprocessed}")
73
- if preprocessed:
74
- print("❌ FAIL: Preprocessing was applied when only document type was set")
75
- else:
76
- print("✅ PASS: Preprocessing was NOT applied when only document type was set")
77
-
78
- # Test case 2: With actual preprocessing options - SHOULD trigger preprocessing
79
- preprocessing_options = {
80
- "document_type": "printed",
81
- "grayscale": True, # Enable an actual preprocessing option
82
- "denoise": False,
83
- "contrast": 0,
84
- "rotation": 0
85
- }
86
-
87
- temp_files = []
88
- result_path, preprocessed = apply_preprocessing_to_file(
89
- original_bytes,
90
- '.jpg',
91
- preprocessing_options,
92
- temp_files
93
- )
94
-
95
- # Check if preprocessing was applied
96
- print(f"Test 2 (With grayscale option) - Preprocessing applied: {preprocessed}")
97
- if preprocessed:
98
- print("✅ PASS: Preprocessing WAS applied when grayscale option was enabled")
99
- else:
100
- print("❌ FAIL: Preprocessing was NOT applied when grayscale option was enabled")
101
-
102
- # Clean up temp files
103
- for path in temp_files:
104
- try:
105
- if os.path.exists(path):
106
- os.unlink(path)
107
- except:
108
- pass
109
-
110
- return True
111
-
112
- def test_historical_theme_detection():
113
- """Test the enhanced historical theme detection"""
114
- print("\n--- TESTING HISTORICAL THEME DETECTION ---")
115
-
116
- # Test case 1: Medieval historical text
117
- medieval_text = """
118
- In the 12th century, during the Crusades, the knights of the Holy Roman Empire traveled across
119
- feudal Europe. These medieval warriors sought adventure and glory in Byzantine lands, and many found
120
- themselves face to face with Islamic armies. The monasteries of the time kept detailed records of these
121
- campaigns, though many were lost during the great plague that devastated much of Europe.
122
- """
123
-
124
- # Extract themes with our enhanced algorithm
125
- themes = extract_subject_tags({}, medieval_text)
126
- print("\nTest 1 (Medieval text):")
127
- print(f"Extracted themes: {themes}")
128
-
129
- # Check if key medieval themes were detected
130
- medieval_keywords = ["Medieval", "Holy Roman Empire", "Crusades", "Byzantine"]
131
- detected = [theme for theme in themes if any(keyword in theme for keyword in medieval_keywords)]
132
-
133
- if detected:
134
- print(f"✅ PASS: Detected appropriate medieval themes: {detected}")
135
- else:
136
- print("❌ FAIL: Failed to detect appropriate medieval themes")
137
-
138
- # Test case 2: 19th century American history
139
- american_text = """
140
- Following the Civil War, the Reconstruction era marked a significant period in American history.
141
- In the late 19th century, westward expansion and manifest destiny drove settlers across the frontier.
142
- Native American communities faced displacement as the transcontinental railroad facilitated this massive
143
- migration. The industrial revolution transformed eastern cities while Victorian values shaped social norms.
144
- """
145
-
146
- # Extract themes with our enhanced algorithm
147
- themes = extract_subject_tags({}, american_text)
148
- print("\nTest 2 (19th century American text):")
149
- print(f"Extracted themes: {themes}")
150
-
151
- # Check if key 19th century American themes were detected
152
- american_keywords = ["19th Century", "American", "Civil War", "Victorian", "Native American",
153
- "Industrial Revolution"]
154
- detected = [theme for theme in themes if any(keyword in theme for keyword in american_keywords)]
155
-
156
- if detected:
157
- print(f"✅ PASS: Detected appropriate American history themes: {detected}")
158
- else:
159
- print("❌ FAIL: Failed to detect appropriate American history themes")
160
-
161
- return True
162
-
163
- def test_actual_document():
164
- """Test with an actual document from the input folder"""
165
- print("\n--- TESTING WITH ACTUAL DOCUMENT ---")
166
-
167
- # Path to Magellan's travels document
168
- test_image_path = os.path.join('input', 'magellan-travels.jpg')
169
-
170
- if not os.path.exists(test_image_path):
171
- print(f"Test file not found: {test_image_path}")
172
- return False
173
-
174
- # Create mock file
175
- mock_file = MockUploadedFile(test_image_path)
176
-
177
- # Mock progress reporter
178
- class MockProgressReporter:
179
- def update(self, percent, text):
180
- pass
181
- def complete(self, success=True):
182
- pass
183
-
184
- # Set up minimal processing options
185
- preprocessing_options = {
186
- "document_type": "printed",
187
- "grayscale": False,
188
- "denoise": False,
189
- "contrast": 0,
190
- "rotation": 0
191
- }
192
-
193
- # Process the document
194
- print("Processing Magellan's travels document...")
195
- # Use st.session_state in a way that doesn't require streamlit
196
- if not hasattr(st, 'session_state'):
197
- st.session_state = type('obj', (object,), {'temp_file_paths': []})
198
-
199
- try:
200
- # Use non-interactive mode for test
201
- result = process_file(
202
- uploaded_file=mock_file,
203
- use_vision=True,
204
- preprocessing_options=preprocessing_options,
205
- progress_reporter=MockProgressReporter(),
206
- custom_prompt="This is a historical document about exploration and travel."
207
- )
208
-
209
- # Check the results
210
- if 'topics' in result:
211
- print("\nDetected topics:")
212
- for topic in result['topics']:
213
- print(f" - {topic}")
214
-
215
- # Look for exploration/travel/geographic themes
216
- relevant_keywords = ["Travel", "Exploration", "Maritime", "Voyage",
217
- "Expedition", "Geographic", "European", "Map"]
218
- detected = [topic for topic in result['topics']
219
- if any(keyword.lower() in topic.lower() for keyword in relevant_keywords)]
220
-
221
- if detected:
222
- print(f"\n✅ PASS: Detected appropriate exploration themes: {detected}")
223
- else:
224
- print("\n❌ FAIL: Failed to detect appropriate exploration themes")
225
- else:
226
- print("❌ FAIL: No topics detected in result")
227
- except Exception as e:
228
- print(f"❌ ERROR processing document: {str(e)}")
229
-
230
- return True
231
-
232
- if __name__ == "__main__":
233
- print("Running tests for Historical OCR improvements...\n")
234
-
235
- # Test preprocessing fix
236
- test_preprocessing_fix()
237
-
238
- # Test historical theme detection
239
- test_historical_theme_detection()
240
-
241
- # Test with an actual document
242
- test_actual_document()
243
-
244
- print("\nTests completed!")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
testing/test_json_bleed.py ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Test case to verify the fix for JSON bleed-through in historical text.
3
+ """
4
+ import sys
5
+ import os
6
+ from pathlib import Path
7
+
8
+ # Add parent directory to path
9
+ sys.path.append(str(Path(__file__).parent.parent))
10
+
11
+ from utils.content_utils import format_structured_data
12
+ from utils.text_utils import clean_raw_text, format_markdown_text
13
+
14
+ # Sample text with JSON-like content (historical text with curly braces)
15
+ SAMPLE_TEXT = """# ENGLISH Credulity; or Ye're all Bottled.
16
+
17
+ O magnus pofldac Inimicis Rifus! Hor. Sat. WITH Grief, Refentment, and averted Eyes, Britannia droops to fee her Sons, (once Wile So fam'd for Arms, for Conduct fo renown'd With ev'ry Virtue ev'ry Glory crown'd) Now fink ignoble, and to nothing fall; Obedient marching forth at Folly's Call.
18
+
19
+ Text containing curly braces like these: { and } should not be parsed as JSON.
20
+
21
+ Even this text with a JSON-like pattern {"key": "value"} should be preserved as-is.
22
+ """
23
+
24
+ def test_format_structured_data():
25
+ """Test that format_structured_data preserves text content"""
26
+ result = format_structured_data(SAMPLE_TEXT)
27
+
28
+ # Verify the text is returned as-is without attempting to parse JSON-like structures
29
+ assert result == SAMPLE_TEXT
30
+ print("✓ format_structured_data correctly preserves text content")
31
+
32
+ # Make sure the output doesn't have any JSON code blocks
33
+ assert "```json" not in result
34
+ print("✓ format_structured_data does not create JSON code blocks")
35
+
36
+ return True
37
+
38
+ if __name__ == "__main__":
39
+ # Run the test
40
+ print("Running JSON bleed-through fix tests...\n")
41
+ success = test_format_structured_data()
42
+
43
+ if success:
44
+ print("\nAll tests passed! The JSON bleed-through issue is fixed.")
45
+ else:
46
+ print("\nSome tests failed.")
testing/test_magician.py DELETED
@@ -1,57 +0,0 @@
1
- import io
2
- import base64
3
- from pathlib import Path
4
- from PIL import Image
5
-
6
- # Import the application components
7
- from structured_ocr import StructuredOCR
8
- from ocr_utils import preprocess_image_for_ocr
9
-
10
- def test_magician_image():
11
- # Path to the magician image
12
- image_path = Path("/Users/zacharymuhlbauer/Desktop/tools/hocr/input/magician-or-bottle-cungerer.jpg")
13
-
14
- # Process through ocr_utils preprocessing
15
- print(f"Testing preprocessing on {image_path}")
16
- processed_img, base64_data = preprocess_image_for_ocr(image_path)
17
-
18
- if processed_img:
19
- print(f"Successfully preprocessed image: {processed_img.size}")
20
-
21
- # Get details about newspaper detection
22
- width, height = processed_img.size
23
- aspect_ratio = width / height
24
- print(f"Image dimensions: {width}x{height}, aspect ratio: {aspect_ratio:.2f}")
25
- print(f"Newspaper detection threshold: aspect_ratio > 1.15 and width > 2000")
26
- is_newspaper = (aspect_ratio > 1.15 and width > 2000) or (width > 3000 or height > 3000)
27
- print(f"Would be detected as newspaper: {is_newspaper}")
28
-
29
- # Now test structured_ocr processing
30
- print("\nTesting through StructuredOCR pipeline...")
31
- processor = StructuredOCR()
32
- # Process with explicit newspaper handling via custom prompt
33
- custom_prompt = "This is a newspaper with columns. Extract all text from each column top to bottom."
34
- result = processor.process_file(image_path, file_type="image", custom_prompt=custom_prompt)
35
-
36
- # Check if the result has pages_data for image display
37
- has_pages_data = 'pages_data' in result
38
- has_images = result.get('has_images', False)
39
-
40
- print(f"Result has pages_data: {has_pages_data}")
41
- print(f"Result has_images flag: {has_images}")
42
-
43
- # Check raw text content
44
- if 'ocr_contents' in result and 'raw_text' in result['ocr_contents']:
45
- raw_text = result['ocr_contents']['raw_text']
46
- print(f"Raw text length: {len(raw_text)} chars")
47
- print(f"Raw text preview: {raw_text[:100]}...")
48
- else:
49
- print("No raw_text found in result")
50
-
51
- return result
52
- else:
53
- print("Preprocessing failed")
54
- return None
55
-
56
- if __name__ == "__main__":
57
- result = test_magician_image()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
testing/test_magician_image.py DELETED
@@ -1,130 +0,0 @@
1
- import os
2
- import shutil
3
- from pathlib import Path
4
- import time
5
- from PIL import Image
6
- import logging
7
-
8
- # Configure logging to see debug messages
9
- logging.basicConfig(level=logging.DEBUG)
10
- logger = logging.getLogger("test")
11
-
12
- # Import the function we want to test
13
- from ocr_utils import preprocess_image_for_ocr
14
-
15
- def test_magician_image():
16
- # Path to the magician image
17
- image_path = Path("input/magician-or-bottle-cungerer.jpg")
18
-
19
- # Ensure the file exists
20
- if not image_path.exists():
21
- print(f"Error: File not found at {image_path}")
22
- return
23
-
24
- print(f"Testing image preprocessing on {image_path.name}")
25
-
26
- # Process the image
27
- start_time = time.time()
28
- processed_img, base64_data = preprocess_image_for_ocr(image_path)
29
- processing_time = time.time() - start_time
30
-
31
- # Print processing information
32
- print(f"Processing completed in {processing_time:.2f} seconds")
33
-
34
- if processed_img:
35
- # Get original and processed image dimensions
36
- with Image.open(image_path) as original_img:
37
- original_size = original_img.size
38
- processed_size = processed_img.size
39
-
40
- print(f"Original image size: {original_size}")
41
- print(f"Processed image size: {processed_size}")
42
-
43
- # Create output directory
44
- output_dir = Path("output")
45
- output_dir.mkdir(exist_ok=True)
46
-
47
- # Save the processed image for visual inspection
48
- output_path = output_dir / "processed_magician.jpg"
49
- processed_img.save(output_path)
50
- print(f"Saved processed image to {output_path}")
51
-
52
- # Create a test report
53
- report_path = output_dir / "test_report.txt"
54
- with open(report_path, "w") as f:
55
- f.write(f"Test Report: Magician Image Processing\n")
56
- f.write(f"=====================================\n\n")
57
- f.write(f"Original image: {image_path}\n")
58
- f.write(f"Original size: {original_size[0]}x{original_size[1]}\n")
59
- f.write(f"Processed size: {processed_size[0]}x{processed_size[1]}\n")
60
- f.write(f"Processing time: {processing_time:.2f} seconds\n")
61
-
62
- # Calculate size reduction
63
- original_pixels = original_size[0] * original_size[1]
64
- processed_pixels = processed_size[0] * processed_size[1]
65
- reduction = (1 - (processed_pixels / original_pixels)) * 100
66
- f.write(f"Size reduction: {reduction:.2f}%\n")
67
-
68
- # Check if illustration detection worked
69
- f.write(f"\nIllustration Detection:\n")
70
- f.write(f"- Filename contains 'magician': {'magician' in image_path.name.lower()}\n")
71
-
72
- # Note about visual inspection
73
- f.write(f"\nVisual Inspection Notes:\n")
74
- f.write(f"- Check processed_magician.jpg for preservation of fine details\n")
75
- f.write(f"- Verify that etching lines are clear and not over-processed\n")
76
- f.write(f"- Confirm that contrast enhancement is appropriate for this illustration\n")
77
-
78
- print(f"Created test report at {report_path}")
79
-
80
- return output_path, report_path
81
- else:
82
- print("Processing failed - no image returned")
83
- return None, None
84
-
85
- def relocate_test_files(output_path, report_path):
86
- """Relocate test files to the testing folder"""
87
- if not output_path or not report_path:
88
- print("No test files to relocate")
89
- return
90
-
91
- # Create testing directory if it doesn't exist
92
- testing_dir = Path("testing")
93
- testing_dir.mkdir(exist_ok=True)
94
-
95
- # Create a subdirectory for this specific test
96
- test_dir = testing_dir / "magician_test"
97
- test_dir.mkdir(exist_ok=True)
98
-
99
- # Copy the files to the testing directory
100
- shutil.copy(output_path, test_dir / output_path.name)
101
- shutil.copy(report_path, test_dir / report_path.name)
102
-
103
- # Create a comparison file that documents the differences between branches
104
- comparison_path = test_dir / "branch_comparison.txt"
105
- with open(comparison_path, "w") as f:
106
- f.write("Comparison of ocr_utils.py between main and reconcile-improvements branches\n")
107
- f.write("==================================================================\n\n")
108
- f.write("Key improvements in reconcile-improvements branch:\n\n")
109
- f.write("1. Enhanced illustration/etching detection:\n")
110
- f.write(" - Added detection based on filename keywords (e.g., 'magician', 'illustration')\n")
111
- f.write(" - Implemented image-based detection using edge density analysis\n\n")
112
- f.write("2. Specialized processing for illustrations:\n")
113
- f.write(" - Gentler scaling to preserve fine details\n")
114
- f.write(" - Mild contrast enhancement (1.3 vs. higher values for other documents)\n")
115
- f.write(" - Specialized sharpening for fine lines in etchings\n")
116
- f.write(" - Higher quality settings (95 vs. 85) to prevent detail loss\n\n")
117
- f.write("3. Performance optimizations:\n")
118
- f.write(" - More efficient processing paths for different image types\n")
119
- f.write(" - Better memory management for large images\n\n")
120
- f.write("Test results for magician-or-bottle-cungerer.jpg demonstrate these improvements.\n")
121
-
122
- print(f"Relocated test files to {test_dir}")
123
- print(f"Created branch comparison document at {comparison_path}")
124
-
125
- if __name__ == "__main__":
126
- # Run the test
127
- output_path, report_path = test_magician_image()
128
-
129
- # Relocate test files to testing folder
130
- relocate_test_files(output_path, report_path)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
testing/test_newspaper_detection.py DELETED
@@ -1,146 +0,0 @@
1
- """
2
- Test script to verify newspaper detection and processing in ocr_utils.py.
3
- This script focuses on checking if the reconcile-improvements branch properly
4
- handles newspaper-style documents with columns.
5
- """
6
-
7
- import os
8
- import sys
9
- # Add the parent directory to the Python path
10
- sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
11
-
12
- import logging
13
- from pathlib import Path
14
- import time
15
- from PIL import Image
16
-
17
- # Configure logging
18
- logging.basicConfig(level=logging.DEBUG)
19
- logger = logging.getLogger("newspaper_test")
20
-
21
- # Import the functions we want to test
22
- from ocr_utils import preprocess_image_for_ocr
23
-
24
- def test_newspaper_detection():
25
- """Test if the image is properly detected as a newspaper format"""
26
- image_path = Path("input/magician-or-bottle-cungerer.jpg")
27
- if not image_path.exists():
28
- logger.error(f"Image file not found: {image_path}")
29
- return False
30
-
31
- # Get image dimensions and aspect ratio
32
- with Image.open(image_path) as img:
33
- width, height = img.size
34
- aspect_ratio = width / height
35
-
36
- logger.info(f"Image dimensions: {width}x{height}")
37
- logger.info(f"Aspect ratio: {aspect_ratio:.2f}")
38
-
39
- # Check if dimensions and aspect ratio match newspaper criteria
40
- is_newspaper_by_dimensions = (aspect_ratio > 1.2 and width > 2000) or (width > 3000 or height > 3000)
41
- logger.info(f"Meets newspaper criteria by dimensions: {is_newspaper_by_dimensions}")
42
-
43
- return {
44
- "dimensions": (width, height),
45
- "aspect_ratio": aspect_ratio,
46
- "is_newspaper_by_dimensions": is_newspaper_by_dimensions
47
- }
48
-
49
- def test_newspaper_processing():
50
- """Test how the image is processed with the newspaper detection logic"""
51
- image_path = Path("input/magician-or-bottle-cungerer.jpg")
52
- if not image_path.exists():
53
- logger.error(f"Image file not found: {image_path}")
54
- return False
55
-
56
- logger.info(f"Testing newspaper processing on {image_path.name}")
57
-
58
- # Process the image
59
- start_time = time.time()
60
- processed_img, base64_data = preprocess_image_for_ocr(image_path)
61
- processing_time = time.time() - start_time
62
-
63
- logger.info(f"Processing completed in {processing_time:.2f} seconds")
64
-
65
- if processed_img:
66
- # Get original and processed image dimensions
67
- with Image.open(image_path) as original_img:
68
- original_size = original_img.size
69
- processed_size = processed_img.size
70
-
71
- logger.info(f"Original image size: {original_size}")
72
- logger.info(f"Processed image size: {processed_size}")
73
-
74
- # Create output directory
75
- output_dir = Path("testing/newspaper_test")
76
- output_dir.mkdir(exist_ok=True, parents=True)
77
-
78
- # Save the processed image for visual inspection
79
- output_path = output_dir / "processed_newspaper.jpg"
80
- processed_img.save(output_path)
81
- logger.info(f"Saved processed image to {output_path}")
82
-
83
- # Create a test report
84
- report_path = output_dir / "newspaper_test_report.txt"
85
- with open(report_path, "w") as f:
86
- f.write(f"Newspaper Detection Test Report\n")
87
- f.write(f"==============================\n\n")
88
- f.write(f"Original image: {image_path}\n")
89
- f.write(f"Original size: {original_size[0]}x{original_size[1]}\n")
90
- f.write(f"Processed size: {processed_size[0]}x{processed_size[1]}\n")
91
- f.write(f"Processing time: {processing_time:.2f} seconds\n\n")
92
-
93
- # Calculate aspect ratio
94
- aspect_ratio = original_size[0] / original_size[1]
95
- f.write(f"Aspect ratio: {aspect_ratio:.2f}\n")
96
-
97
- # Check newspaper criteria
98
- is_newspaper = (aspect_ratio > 1.2 and original_size[0] > 2000) or (original_size[0] > 3000 or original_size[1] > 3000)
99
- f.write(f"Meets newspaper criteria by dimensions: {is_newspaper}\n\n")
100
-
101
- # Check for size reduction
102
- original_pixels = original_size[0] * original_size[1]
103
- processed_pixels = processed_size[0] * processed_size[1]
104
- reduction = (1 - (processed_pixels / original_pixels)) * 100
105
- f.write(f"Size reduction: {reduction:.2f}%\n\n")
106
-
107
- # Notes about newspaper processing
108
- f.write(f"Notes on Newspaper Processing:\n")
109
- f.write(f"- Newspaper format should be detected based on dimensions and aspect ratio\n")
110
- f.write(f"- Specialized processing should be applied for newspaper text extraction\n")
111
- f.write(f"- Check if the processed image shows enhanced text clarity in columns\n")
112
- f.write(f"- Verify that the column structure is preserved for better OCR results\n")
113
-
114
- logger.info(f"Created test report at {report_path}")
115
-
116
- # Create a comparison of original vs processed
117
- try:
118
- # Create a side-by-side comparison
119
- comparison_img = Image.new('RGB', (original_size[0] + processed_size[0], max(original_size[1], processed_size[1])))
120
- comparison_img.paste(Image.open(image_path), (0, 0))
121
- comparison_img.paste(processed_img, (original_size[0], 0))
122
-
123
- comparison_path = output_dir / "newspaper_comparison.jpg"
124
- comparison_img.save(comparison_path)
125
- logger.info(f"Created side-by-side comparison at {comparison_path}")
126
- except Exception as e:
127
- logger.error(f"Failed to create comparison image: {str(e)}")
128
-
129
- return True
130
- else:
131
- logger.error("Processing failed - no image returned")
132
- return False
133
-
134
- if __name__ == "__main__":
135
- # Run the tests
136
- print("Testing newspaper detection and processing...")
137
- detection_result = test_newspaper_detection()
138
- processing_result = test_newspaper_processing()
139
-
140
- # Print summary
141
- print("\nTest Summary:")
142
- print(f"- Image dimensions: {detection_result['dimensions'][0]}x{detection_result['dimensions'][1]}")
143
- print(f"- Aspect ratio: {detection_result['aspect_ratio']:.2f}")
144
- print(f"- Meets newspaper criteria: {detection_result['is_newspaper_by_dimensions']}")
145
- print(f"- Processing test: {'Successful' if processing_result else 'Failed'}")
146
- print("\nCheck the testing/newspaper_test directory for detailed results and images.")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
testing/test_segmentation.py DELETED
@@ -1,238 +0,0 @@
1
- """
2
- Test script to validate the image segmentation approach for complex documents.
3
- Specifically focusing on improving OCR for the magician image which was previously
4
- identified as an image rather than containing text.
5
- """
6
-
7
- import os
8
- import tempfile
9
- import json
10
- import logging
11
- from pathlib import Path
12
- import time
13
- import sys
14
-
15
- # Add the parent directory to the path so we can import our modules
16
- sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
17
-
18
- # Configure logging
19
- logging.basicConfig(level=logging.INFO,
20
- format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
21
- logger = logging.getLogger(__name__)
22
-
23
- # Import our modules
24
- from image_segmentation import segment_image_for_ocr, process_segmented_image
25
- from structured_ocr import StructuredOCR
26
- from ocr_processing import process_file
27
-
28
- class MockStreamlit:
29
- """Mock Streamlit for testing without the UI"""
30
- def __init__(self):
31
- self.data = {}
32
-
33
- def cache_data(self, *args, **kwargs):
34
- def decorator(func):
35
- return func
36
- return decorator
37
-
38
- def empty(self):
39
- return self
40
-
41
- def setup(self):
42
- return self
43
-
44
- def update(self, progress, message):
45
- logger.info(f"Progress: {progress}%, {message}")
46
- return self
47
-
48
- def complete(self, success=True):
49
- logger.info(f"Completed with success={success}")
50
- return self
51
-
52
- def error(self, message):
53
- logger.error(message)
54
- return self
55
-
56
- def session_state():
57
- return {}
58
-
59
- # Mock for streamlit
60
- st = MockStreamlit()
61
- sys.modules['streamlit'] = st
62
-
63
- class FileUpload:
64
- """Mock file upload for testing"""
65
- def __init__(self, path):
66
- self.path = Path(path)
67
- self.name = self.path.name
68
-
69
- def getvalue(self):
70
- return self.path.read_bytes()
71
-
72
- class ProgressReporter:
73
- """Mock progress reporter for testing"""
74
- def __init__(self, placeholder=None):
75
- pass
76
-
77
- def setup(self):
78
- return self
79
-
80
- def update(self, progress, message):
81
- logger.info(f"Progress: {progress}%, {message}")
82
- return self
83
-
84
- def complete(self, success=True):
85
- logger.info(f"Completed with success={success}")
86
- return self
87
-
88
- def test_magician_segmentation():
89
- """Test image segmentation on the magician image"""
90
- # Setup output directory
91
- output_dir = Path("output") / "segmentation_test"
92
- output_dir.mkdir(parents=True, exist_ok=True)
93
-
94
- # Path to the magician image
95
- image_path = Path("input/magician-or-bottle-cungerer.jpg")
96
-
97
- # Ensure the file exists
98
- if not image_path.exists():
99
- logger.error(f"Error: File not found at {image_path}")
100
- return
101
-
102
- logger.info(f"Testing image segmentation on {image_path.name}")
103
-
104
- # First process without segmentation
105
- logger.info("Processing image WITHOUT segmentation")
106
- start_time = time.time()
107
-
108
- # Create a mock uploaded file
109
- uploaded_file = FileUpload(image_path)
110
-
111
- # Process without segmentation
112
- result_without_segmentation = process_file(
113
- uploaded_file,
114
- use_vision=True,
115
- preprocessing_options={"document_type": "newspaper"},
116
- progress_reporter=ProgressReporter(),
117
- use_segmentation=False
118
- )
119
-
120
- processing_time_without = time.time() - start_time
121
- logger.info(f"Processing without segmentation completed in {processing_time_without:.2f} seconds")
122
-
123
- # Save result without segmentation
124
- result_without_path = output_dir / "result_without_segmentation.json"
125
- with open(result_without_path, 'w') as f:
126
- json.dump(result_without_segmentation, f, indent=2)
127
-
128
- # Extract text (or lack thereof) from result
129
- text_without = ""
130
- if 'ocr_contents' in result_without_segmentation:
131
- if 'raw_text' in result_without_segmentation['ocr_contents']:
132
- text_without = result_without_segmentation['ocr_contents']['raw_text']
133
- elif 'content' in result_without_segmentation['ocr_contents']:
134
- text_without = result_without_segmentation['ocr_contents']['content']
135
-
136
- logger.info(f"Text extracted WITHOUT segmentation: {text_without}")
137
- logger.info(f"Text length WITHOUT segmentation: {len(text_without)}")
138
-
139
- # Then process with segmentation
140
- logger.info("Processing image WITH segmentation")
141
- start_time = time.time()
142
-
143
- # Process with segmentation
144
- result_with_segmentation = process_file(
145
- uploaded_file,
146
- use_vision=True,
147
- preprocessing_options={"document_type": "newspaper"},
148
- progress_reporter=ProgressReporter(),
149
- use_segmentation=True
150
- )
151
-
152
- processing_time_with = time.time() - start_time
153
- logger.info(f"Processing with segmentation completed in {processing_time_with:.2f} seconds")
154
-
155
- # Save result with segmentation
156
- result_with_path = output_dir / "result_with_segmentation.json"
157
- with open(result_with_path, 'w') as f:
158
- json.dump(result_with_segmentation, f, indent=2)
159
-
160
- # Extract text from result
161
- text_with = ""
162
- if 'ocr_contents' in result_with_segmentation:
163
- if 'raw_text' in result_with_segmentation['ocr_contents']:
164
- text_with = result_with_segmentation['ocr_contents']['raw_text']
165
- elif 'content' in result_with_segmentation['ocr_contents']:
166
- text_with = result_with_segmentation['ocr_contents']['content']
167
-
168
- logger.info(f"Text extracted WITH segmentation: {text_with}")
169
- logger.info(f"Text length WITH segmentation: {len(text_with)}")
170
-
171
- # Save the text to files for comparison
172
- with open(output_dir / "text_without_segmentation.txt", 'w') as f:
173
- f.write(text_without)
174
-
175
- with open(output_dir / "text_with_segmentation.txt", 'w') as f:
176
- f.write(text_with)
177
-
178
- # Create comparison report
179
- with open(output_dir / "comparison_report.md", 'w') as f:
180
- f.write("# Image Segmentation Test Report\n\n")
181
- f.write(f"## Comparison of OCR results for {image_path.name}\n\n")
182
-
183
- f.write("### Without Segmentation\n")
184
- f.write(f"- Processing time: {processing_time_without:.2f} seconds\n")
185
- f.write(f"- Text length: {len(text_without)} characters\n")
186
- f.write("- Text content:\n```\n")
187
- f.write(text_without[:500] + ("..." if len(text_without) > 500 else ""))
188
- f.write("\n```\n\n")
189
-
190
- f.write("### With Segmentation\n")
191
- f.write(f"- Processing time: {processing_time_with:.2f} seconds\n")
192
- f.write(f"- Text length: {len(text_with)} characters\n")
193
- f.write("- Text content:\n```\n")
194
- f.write(text_with[:500] + ("..." if len(text_with) > 500 else ""))
195
- f.write("\n```\n\n")
196
-
197
- # Calculate improvement
198
- char_diff = len(text_with) - len(text_without)
199
- improvement = f"{char_diff} more characters extracted" if char_diff > 0 else f"{-char_diff} fewer characters extracted"
200
- f.write(f"### Improvement\n")
201
- f.write(f"- Character count difference: {improvement}\n")
202
-
203
- # Add assessment
204
- f.write("\n### Assessment\n")
205
- if len(text_with) > len(text_without) * 1.5:
206
- f.write("**Significant improvement**: Segmentation greatly improved text extraction.\n")
207
- elif len(text_with) > len(text_without):
208
- f.write("**Moderate improvement**: Segmentation improved text extraction.\n")
209
- elif len(text_with) == len(text_without):
210
- f.write("**No change**: Segmentation did not affect text extraction.\n")
211
- else:
212
- f.write("**Degradation**: Segmentation negatively impacted text extraction.\n")
213
-
214
- logger.info(f"Comparison report created at {output_dir / 'comparison_report.md'}")
215
-
216
- # Also generate the segmentation visualization for documentation
217
- logger.info("Generating segmentation visualization")
218
- segmentation_results = process_segmented_image(image_path, output_dir)
219
-
220
- # Save the visualization results
221
- with open(output_dir / "segmentation_results.json", 'w') as f:
222
- # Convert any Path objects to strings for JSON serialization
223
- serializable_results = {}
224
- for key, value in segmentation_results.items():
225
- if isinstance(value, dict):
226
- serializable_results[key] = {k: str(v) if isinstance(v, Path) else v for k, v in value.items()}
227
- else:
228
- serializable_results[key] = str(value) if isinstance(value, Path) else value
229
-
230
- json.dump(serializable_results, f, indent=2)
231
-
232
- logger.info(f"All test results saved to {output_dir}")
233
- return output_dir
234
-
235
- if __name__ == "__main__":
236
- output_dir = test_magician_segmentation()
237
- logger.info(f"Test complete. Results in {output_dir}")
238
- print(f"Test complete. Results in {output_dir}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
testing/test_simple_improvements.py DELETED
@@ -1,175 +0,0 @@
1
- import sys
2
- import os
3
- import logging
4
- from pathlib import Path
5
-
6
- # Add parent directory to path to import local modules
7
- sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
8
-
9
- from utils import extract_subject_tags
10
- from preprocessing import apply_preprocessing_to_file
11
-
12
- # Configure logging
13
- logging.basicConfig(level=logging.INFO,
14
- format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
15
- logger = logging.getLogger("test_improvements")
16
-
17
- def test_preprocessing_fix():
18
- """Test that preprocessing is only applied when explicit options are selected"""
19
- print("\n--- TESTING PREPROCESSING FIX ---")
20
-
21
- # Path to test image (use absolute path from project root)
22
- test_image_path = os.path.join(os.path.dirname(os.path.dirname(__file__)), 'input', 'americae-retectio.jpg')
23
-
24
- if not os.path.exists(test_image_path):
25
- print(f"Test file not found: {test_image_path}")
26
- return False
27
-
28
- # Read original file to compare sizes
29
- with open(test_image_path, 'rb') as f:
30
- original_bytes = f.read()
31
- original_size = len(original_bytes)
32
-
33
- print(f"Original file size: {original_size / 1024:.1f} KB")
34
-
35
- # Test case 1: Document type only - should NOT trigger preprocessing
36
- preprocessing_options = {
37
- "document_type": "printed", # Set document type
38
- "grayscale": False,
39
- "denoise": False,
40
- "contrast": 0,
41
- "rotation": 0
42
- }
43
-
44
- temp_files = []
45
- result_path, preprocessed = apply_preprocessing_to_file(
46
- original_bytes,
47
- '.jpg',
48
- preprocessing_options,
49
- temp_files
50
- )
51
-
52
- # Check if preprocessing was applied
53
- print(f"Test 1 (Document type only) - Preprocessing applied: {preprocessed}")
54
- if preprocessed:
55
- print("❌ FAIL: Preprocessing was applied when only document type was set")
56
- else:
57
- print("✅ PASS: Preprocessing was NOT applied when only document type was set")
58
-
59
- # Test case 2: With actual preprocessing options - SHOULD trigger preprocessing
60
- preprocessing_options = {
61
- "document_type": "printed",
62
- "grayscale": True, # Enable an actual preprocessing option
63
- "denoise": False,
64
- "contrast": 0,
65
- "rotation": 0
66
- }
67
-
68
- temp_files = []
69
- result_path, preprocessed = apply_preprocessing_to_file(
70
- original_bytes,
71
- '.jpg',
72
- preprocessing_options,
73
- temp_files
74
- )
75
-
76
- # Check if preprocessing was applied
77
- print(f"Test 2 (With grayscale option) - Preprocessing applied: {preprocessed}")
78
- if preprocessed:
79
- print("✅ PASS: Preprocessing WAS applied when grayscale option was enabled")
80
- else:
81
- print("❌ FAIL: Preprocessing was NOT applied when grayscale option was enabled")
82
-
83
- # Clean up temp files
84
- for path in temp_files:
85
- try:
86
- if os.path.exists(path):
87
- os.unlink(path)
88
- except:
89
- pass
90
-
91
- return True
92
-
93
- def test_historical_theme_detection():
94
- """Test the enhanced historical theme detection"""
95
- print("\n--- TESTING HISTORICAL THEME DETECTION ---")
96
-
97
- # Test case 1: Medieval historical text
98
- medieval_text = """
99
- In the 12th century, during the Crusades, the knights of the Holy Roman Empire traveled across
100
- feudal Europe. These medieval warriors sought adventure and glory in Byzantine lands, and many found
101
- themselves face to face with Islamic armies. The monasteries of the time kept detailed records of these
102
- campaigns, though many were lost during the great plague that devastated much of Europe.
103
- """
104
-
105
- # Extract themes with our enhanced algorithm
106
- themes = extract_subject_tags({}, medieval_text)
107
- print("\nTest 1 (Medieval text):")
108
- print(f"Extracted themes: {themes}")
109
-
110
- # Check if key medieval themes were detected
111
- medieval_keywords = ["Medieval", "Holy Roman Empire", "Crusades", "Byzantine"]
112
- detected = [theme for theme in themes if any(keyword in theme for keyword in medieval_keywords)]
113
-
114
- if detected:
115
- print(f"✅ PASS: Detected appropriate medieval themes: {detected}")
116
- else:
117
- print("❌ FAIL: Failed to detect appropriate medieval themes")
118
-
119
- # Test case 2: 19th century American history
120
- american_text = """
121
- Following the Civil War, the Reconstruction era marked a significant period in American history.
122
- In the late 19th century, westward expansion and manifest destiny drove settlers across the frontier.
123
- Native American communities faced displacement as the transcontinental railroad facilitated this massive
124
- migration. The industrial revolution transformed eastern cities while Victorian values shaped social norms.
125
- """
126
-
127
- # Extract themes with our enhanced algorithm
128
- themes = extract_subject_tags({}, american_text)
129
- print("\nTest 2 (19th century American text):")
130
- print(f"Extracted themes: {themes}")
131
-
132
- # Check if key 19th century American themes were detected
133
- american_keywords = ["19th Century", "American", "Civil War", "Victorian", "Native American",
134
- "Industrial Revolution"]
135
- detected = [theme for theme in themes if any(keyword in theme for keyword in american_keywords)]
136
-
137
- if detected:
138
- print(f"✅ PASS: Detected appropriate American history themes: {detected}")
139
- else:
140
- print("❌ FAIL: Failed to detect appropriate American history themes")
141
-
142
- # Test case 3: Maritime exploration
143
- maritime_text = """
144
- The ship's captain navigated through treacherous waters, relying on charts and naval instruments.
145
- The sailors manned the vessel while the admiral oversaw the maritime expedition. The voyage was one of
146
- exploration, as they sought new trade routes across uncharted seas. The port city they departed from
147
- was a hub of naval activity and shipbuilding.
148
- """
149
-
150
- # Extract themes with our enhanced algorithm
151
- themes = extract_subject_tags({}, maritime_text)
152
- print("\nTest 3 (Maritime exploration text):")
153
- print(f"Extracted themes: {themes}")
154
-
155
- # Check if key maritime themes were detected
156
- maritime_keywords = ["Maritime", "Naval", "Exploration", "Voyage", "Ship"]
157
- detected = [theme for theme in themes if any(keyword in theme for keyword in maritime_keywords)]
158
-
159
- if detected:
160
- print(f"✅ PASS: Detected appropriate maritime themes: {detected}")
161
- else:
162
- print("❌ FAIL: Failed to detect appropriate maritime themes")
163
-
164
- return True
165
-
166
- if __name__ == "__main__":
167
- print("Running simplified tests for Historical OCR improvements...\n")
168
-
169
- # Test preprocessing fix
170
- test_preprocessing_fix()
171
-
172
- # Test historical theme detection
173
- test_historical_theme_detection()
174
-
175
- print("\nTests completed!")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
testing/test_text_as_image.py DELETED
@@ -1,200 +0,0 @@
1
- import sys
2
- import os
3
- import json
4
- import base64
5
- import logging
6
- from pathlib import Path
7
- import shutil
8
-
9
- # Add parent directory to path so we can import modules
10
- sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
11
-
12
- # Set up logging
13
- logging.basicConfig(level=logging.INFO,
14
- format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
15
- logger = logging.getLogger(__name__)
16
-
17
- # Import the functions we need to test
18
- from structured_ocr import serialize_ocr_response
19
-
20
- # Create a proper mock that actually passes isinstance checks
21
- # The issue is likely that our mock isn't being recognized as an OCRImageObject
22
- # First, patch the module to allow a custom class to be recognized
23
- import sys
24
- from types import SimpleNamespace
25
-
26
- # Create a namespace for mistralai models
27
- if 'mistralai.models' not in sys.modules:
28
- sys.modules['mistralai.models'] = SimpleNamespace()
29
-
30
- # Define OCRImageObject in that namespace
31
- class OCRImageObject:
32
- """Real mock of OCRImageObject for testing purposes"""
33
- def __init__(self, id, image_base64):
34
- self.id = id
35
- self.image_base64 = image_base64
36
-
37
- def __repr__(self):
38
- """String representation for debugging"""
39
- return f"OCRImageObject(id={self.id}, image_base64={self.image_base64[:20]}...)"
40
-
41
- # Add our class to the mistralai.models namespace
42
- sys.modules['mistralai.models'].OCRImageObject = OCRImageObject
43
-
44
- # Import to ensure validation logic will detect our mock as OCRImageObject
45
- from mistralai.models import OCRImageObject
46
-
47
- def test_magician_image():
48
- """Test the serialization with the magician input file"""
49
-
50
- print("Testing OCR processing with magician illustration file...")
51
-
52
- # Path to the magician image file
53
- input_dir = Path("input")
54
- magician_file = input_dir / "magician-or-bottle-cungerer.jpg"
55
-
56
- # Verify the file exists
57
- if not magician_file.exists():
58
- print(f"❌ ERROR: Magician illustration file not found at {magician_file}")
59
- return
60
-
61
- # Read the transcript data from OCR
62
- transcript_path = Path("testing/magician_ocr_text.txt")
63
- if not transcript_path.exists():
64
- print("⚠️ Warning: No OCR transcript found, creating minimal test data")
65
- transcript = """
66
- THE MAGICIAN OR BOTTLE CONJURER.
67
-
68
- This is a transcript that might be mistakenly classified as an image.
69
- It contains words like the and of and to which are common in English text.
70
- """
71
- else:
72
- with open(transcript_path, "r") as f:
73
- transcript = f.read()
74
-
75
- print(f"Using transcript with {len(transcript)} characters")
76
-
77
- print("\nStep 1: Testing text content identification directly (modified approach)...")
78
- # Instead of relying on the complex serialization, we'll test the specific issue directly
79
-
80
- # First, create a direct test for identifying text content in image fields
81
- def is_text_content(content):
82
- """Simplified version of text detection logic"""
83
- # Immediately return True for text content with clear text indicators
84
- if not isinstance(content, str):
85
- return False
86
-
87
- # Take a reasonable sample
88
- sample = content[:min(len(content), 1000)]
89
-
90
- # Quick checks for obvious text features
91
- has_spaces = ' ' in sample
92
- has_newlines = '\n' in sample
93
- has_punctuation = any(p in sample for p in ',.;:!?"\'()[]{}')
94
- has_sentences = False
95
-
96
- # Check for sentence-like structures (capital letters after periods)
97
- for i in range(len(sample) - 5):
98
- if sample[i] in '.!?\n' and i+2 < len(sample) and sample[i+1] == ' ' and sample[i+2].isupper():
99
- has_sentences = True
100
- break
101
-
102
- # Check for common words that indicate text content
103
- common_words = ['the', 'and', 'of', 'to', 'a', 'in', 'is', 'that', 'for', 'with']
104
- has_common_words = any(f" {word} " in f" {sample.lower()} " for word in common_words)
105
-
106
- # Count text indicators
107
- indicators = [has_spaces, has_newlines, has_punctuation, has_sentences, has_common_words]
108
- indicator_count = sum(1 for i in indicators if i)
109
-
110
- # For test output
111
- print(f"Text detection - spaces: {has_spaces}, newlines: {has_newlines}, punctuation: {has_punctuation}")
112
- print(f"Sentences: {has_sentences}, common words: {has_common_words}")
113
- print(f"Total indicators: {indicator_count}/5")
114
-
115
- # If at least 2 text indicators are found, it's likely text content
116
- return indicator_count >= 2
117
-
118
- # Apply the test
119
- text_test_result = is_text_content(transcript)
120
-
121
- if text_test_result:
122
- print("✅ DIRECT TEST: Transcript correctly identified as text content")
123
- else:
124
- print("❌ DIRECT TEST: Transcript incorrectly classified (not detected as text)")
125
-
126
- # Now proceed with the regular test
127
- mock_image_obj = OCRImageObject(
128
- id="img-0",
129
- image_base64=transcript
130
- )
131
-
132
- # Create a test object that has this "image" as a property
133
- test_obj = {
134
- "page": {
135
- "images": [mock_image_obj]
136
- }
137
- }
138
-
139
- # DIRECT WORKAROUND: Manually handle serialization for the test case
140
- # This simulates what the actual code should do, rather than relying on the full serializer
141
- custom_serialized = {
142
- "page": {
143
- "images": []
144
- }
145
- }
146
-
147
- # Apply our text detection function to determine how to serialize
148
- if is_text_content(mock_image_obj.image_base64):
149
- # If it's text, store as text
150
- custom_serialized["page"]["images"].append(mock_image_obj.image_base64)
151
- print("✅ CUSTOM SERIALIZATION: Correctly identified as text")
152
- else:
153
- # If not text, store as image object
154
- custom_serialized["page"]["images"].append({
155
- "id": mock_image_obj.id,
156
- "image_base64": mock_image_obj.image_base64
157
- })
158
- print("❌ CUSTOM SERIALIZATION: Not identified as text")
159
-
160
- # Verify our custom serialization worked correctly
161
- print("\nCustom serialization result type:", type(custom_serialized["page"]["images"][0]))
162
-
163
- # Now test with actual image data from the magician file
164
- try:
165
- # Read the image file
166
- with open(magician_file, "rb") as img_file:
167
- img_data = img_file.read()
168
- # Encode as base64
169
- img_base64 = base64.b64encode(img_data).decode('utf-8')
170
- valid_base64 = f"data:image/jpeg;base64,{img_base64}"
171
-
172
- # Create a mock OCR object with the real image
173
- mock_image_obj_valid = OCRImageObject(
174
- id="img-1",
175
- image_base64=valid_base64
176
- )
177
-
178
- test_obj_valid = {
179
- "page": {
180
- "images": [mock_image_obj_valid]
181
- }
182
- }
183
-
184
- serialized_valid = serialize_ocr_response(test_obj_valid)
185
-
186
- # Check that valid image data was processed correctly
187
- if (isinstance(serialized_valid["page"]["images"][0], dict) and
188
- "id" in serialized_valid["page"]["images"][0] and
189
- "image_base64" in serialized_valid["page"]["images"][0]):
190
- print("✅ SUCCESS: Valid magician image was correctly processed as an image")
191
- else:
192
- print("❌ FAILED: Valid magician image was incorrectly processed")
193
- print(f"Value: {serialized_valid['page']['images'][0]}")
194
- except Exception as e:
195
- print(f"❌ ERROR processing magician image: {str(e)}")
196
-
197
- print("\nTest complete.")
198
-
199
- if __name__ == "__main__":
200
- test_magician_image()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ui/custom.css CHANGED
@@ -13,7 +13,7 @@ h1, h2, h3, h4, h5, h6 {
13
  color: #1E3A8A;
14
  }
15
 
16
- /* Document content styling */
17
  .document-content {
18
  margin-top: 12px;
19
  }
@@ -26,48 +26,25 @@ h1, h2, h3, h4, h5, h6 {
26
  border: 1px solid #e0e0e0;
27
  }
28
 
 
29
  .document-section h4 {
30
  margin-top: 0;
31
  margin-bottom: 10px;
32
- color: #1E3A8A;
33
  }
34
 
35
- /* Subject tag styling */
 
36
  .subject-tag {
 
37
  display: inline-block;
38
- padding: 3px 8px;
39
- border-radius: 12px;
40
- font-size: 0.85em;
41
  margin-right: 5px;
42
  margin-bottom: 5px;
43
- color: white;
44
- }
45
-
46
- .tag-time-period {
47
- background-color: #1565c0;
48
  }
49
 
50
- .tag-language {
51
- background-color: #00695c;
52
- }
53
-
54
- .tag-document-type {
55
- background-color: #6a1b9a;
56
- }
57
-
58
- .tag-subject {
59
- background-color: #2e7d32;
60
- }
61
-
62
- .tag-preprocessing {
63
- background-color: #e65100;
64
- }
65
-
66
- .tag-default {
67
- background-color: #546e7a;
68
- }
69
 
70
- /* Image and text side-by-side styling */
71
  .image-text-container {
72
  display: flex;
73
  gap: 20px;
@@ -80,6 +57,7 @@ h1, h2, h3, h4, h5, h6 {
80
 
81
  .text-container {
82
  flex: 1;
 
83
  }
84
 
85
  /* Sidebar styling */
 
13
  color: #1E3A8A;
14
  }
15
 
16
+ /* Document content styling - with lower specificity to allow layout.py to override text formatting */
17
  .document-content {
18
  margin-top: 12px;
19
  }
 
26
  border: 1px solid #e0e0e0;
27
  }
28
 
29
+ /* Preserve headings style while allowing font to be overridden */
30
  .document-section h4 {
31
  margin-top: 0;
32
  margin-bottom: 10px;
33
+ /* color moved to layout.py */
34
  }
35
 
36
+ /* Subject tag styling - lower priority than layout.py versions */
37
+ /* These styles will be overridden by the more specific selectors in layout.py */
38
  .subject-tag {
39
+ /* Basic sizing only - styling comes from layout.py */
40
  display: inline-block;
 
 
 
41
  margin-right: 5px;
42
  margin-bottom: 5px;
 
 
 
 
 
43
  }
44
 
45
+ /* Tag colors moved to layout.py with !important rules */
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
 
47
+ /* Image and text side-by-side styling - layout only */
48
  .image-text-container {
49
  display: flex;
50
  gap: 20px;
 
57
 
58
  .text-container {
59
  flex: 1;
60
+ /* Text styling will come from layout.py */
61
  }
62
 
63
  /* Sidebar styling */
ui/layout.py CHANGED
@@ -7,11 +7,13 @@ def load_css():
7
  /* Global styles - clean, modern approach with consistent line height */
8
  :root {
9
  --standard-line-height: 1.5;
 
 
10
  }
11
 
12
  body {
13
- font-family: 'Inter', -apple-system, BlinkMacSystemFont, sans-serif;
14
- color: #111827;
15
  line-height: var(--standard-line-height);
16
  }
17
 
@@ -56,6 +58,11 @@ def load_css():
56
  line-height: 1.3 !important; /* Slightly increased for headings but still compact */
57
  }
58
 
 
 
 
 
 
59
  /* Simple section headers with subtle styling */
60
  .block-container [data-testid="column"] h4 {
61
  font-size: 0.95rem !important;
@@ -71,21 +78,6 @@ def load_css():
71
  margin-bottom: 0.2rem !important;
72
  }
73
 
74
- /* OCR text container with improved contrast and styling */
75
- .ocr-text-container {
76
- font-family: 'Inter', system-ui, sans-serif;
77
- font-size: 0.95rem;
78
- line-height: var(--standard-line-height); /* Consistent line height */
79
- color: #111827;
80
- margin-bottom: 0.4rem;
81
- max-height: 600px;
82
- overflow-y: auto;
83
- background-color: transparent;
84
- padding: 6px 10px;
85
- border-radius: 4px;
86
- border: 1px solid #e2e8f0;
87
- }
88
-
89
  /* Custom scrollbar styling */
90
  .ocr-text-container::-webkit-scrollbar {
91
  width: 6px;
@@ -160,22 +152,64 @@ def load_css():
160
  margin-bottom: 0.4rem !important;
161
  }
162
 
163
- /* Compact tag styling */
 
 
 
164
  .subject-tag {
165
- display: inline-block;
166
- padding: 0.1rem 0.4rem;
167
- border-radius: 3px;
168
- font-size: 0.7rem;
169
- margin: 0 0.2rem 0.2rem 0;
170
- background-color: #f3f4f6;
171
- color: #374151;
172
- border: 1px solid #e5e7eb;
 
173
  }
174
 
175
- .tag-time-period { color: #1e40af; background-color: #eff6ff; border-color: #bfdbfe; }
176
- .tag-language { color: #065f46; background-color: #ecfdf5; border-color: #a7f3d0; }
177
- .tag-document-type { color: #5b21b6; background-color: #f5f3ff; border-color: #ddd6fe; }
178
- .tag-subject { color: #166534; background-color: #f0fdf4; border-color: #bbf7d0; }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
179
 
180
  /* Clean text area */
181
  .stTextArea textarea {
 
7
  /* Global styles - clean, modern approach with consistent line height */
8
  :root {
9
  --standard-line-height: 1.5;
10
+ --standard-font: 'Inter', -apple-system, BlinkMacSystemFont, sans-serif;
11
+ --standard-color: #111827;
12
  }
13
 
14
  body {
15
+ font-family: var(--standard-font);
16
+ color: var(--standard-color);
17
  line-height: var(--standard-line-height);
18
  }
19
 
 
58
  line-height: 1.3 !important; /* Slightly increased for headings but still compact */
59
  }
60
 
61
+ /* Make h1 headings significantly smaller */
62
+ h1 {
63
+ font-size: 1.3em !important; /* Reduced from default ~2em */
64
+ }
65
+
66
  /* Simple section headers with subtle styling */
67
  .block-container [data-testid="column"] h4 {
68
  font-size: 0.95rem !important;
 
78
  margin-bottom: 0.2rem !important;
79
  }
80
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81
  /* Custom scrollbar styling */
82
  .ocr-text-container::-webkit-scrollbar {
83
  width: 6px;
 
152
  margin-bottom: 0.4rem !important;
153
  }
154
 
155
+ /* Compact tag styling - with higher specificity to override custom.css */
156
+ .document-content .subject-tag,
157
+ div[data-testid="stHorizontalBlock"] .subject-tag,
158
+ div[data-testid="stVerticalBlock"] .subject-tag,
159
  .subject-tag {
160
+ display: inline-block !important;
161
+ padding: 0.1rem 0.4rem !important;
162
+ border-radius: 3px !important;
163
+ font-size: 0.7rem !important;
164
+ margin: 0 0.2rem 0.2rem 0 !important;
165
+ background-color: #f3f4f6 !important;
166
+ color: #374151 !important;
167
+ border: 1px solid #e5e7eb !important;
168
+ font-family: var(--standard-font) !important;
169
  }
170
 
171
+ /* Tag color overrides with higher specificity */
172
+ .document-content .tag-time-period,
173
+ .tag-time-period { color: #1e40af !important; background-color: #eff6ff !important; border-color: #bfdbfe !important; }
174
+
175
+ .document-content .tag-language,
176
+ .tag-language { color: #065f46 !important; background-color: #ecfdf5 !important; border-color: #a7f3d0 !important; }
177
+
178
+ .document-content .tag-document-type,
179
+ .tag-document-type { color: #5b21b6 !important; background-color: #f5f3ff !important; border-color: #ddd6fe !important; }
180
+
181
+ .document-content .tag-subject,
182
+ .tag-subject { color: #166534 !important; background-color: #f0fdf4 !important; border-color: #bbf7d0 !important; }
183
+
184
+ .document-content .tag-download,
185
+ .tag-download {
186
+ color: #1e40af !important;
187
+ background-color: #dbeafe !important;
188
+ border-color: #93c5fd !important;
189
+ text-decoration: none !important;
190
+ cursor: pointer !important;
191
+ transition: all 0.2s ease !important;
192
+ }
193
+
194
+ .document-content .tag-download:hover,
195
+ .tag-download:hover {
196
+ background-color: #93c5fd !important; /* Darker blue on hover */
197
+ border-color: #3b82f6 !important; /* Darker border */
198
+ color: #1e3a8a !important; /* Darker text */
199
+ box-shadow: 0 2px 4px rgba(0,0,0,0.1) !important; /* More pronounced shadow */
200
+ }
201
+
202
+ /* For any default tags that might use the old styling */
203
+ .document-content .tag-default,
204
+ .tag-default { color: #374151 !important; background-color: #f3f4f6 !important; border-color: #e5e7eb !important; }
205
+
206
+ /* Document content styling to ensure consistency */
207
+ .document-content,
208
+ .document-section {
209
+ font-family: var(--standard-font) !important;
210
+ line-height: var(--standard-line-height) !important;
211
+ color: var(--standard-color) !important;
212
+ }
213
 
214
  /* Clean text area */
215
  .stTextArea textarea {
ui_components.py CHANGED
@@ -31,13 +31,11 @@ from constants import (
31
  PREPROCESSING_DOC_TYPES,
32
  ROTATION_OPTIONS
33
  )
34
- from utils.image_utils import format_ocr_text
35
  from utils.content_utils import (
36
  classify_document_content,
37
  extract_document_text,
38
- extract_image_description,
39
- clean_raw_text,
40
- format_markdown_text
41
  )
42
  from utils.ui_utils import display_results
43
  from preprocessing import preprocess_image
@@ -155,15 +153,15 @@ def create_sidebar_options():
155
  use_segmentation = False
156
 
157
  # Create preprocessing options dictionary
158
- # Set document_type based on selection in UI
159
  doc_type_for_preprocessing = "standard"
160
  if "Handwritten" in doc_type:
161
  doc_type_for_preprocessing = "handwritten"
162
  elif "Newspaper" in doc_type or "Magazine" in doc_type:
163
  doc_type_for_preprocessing = "newspaper"
164
  elif "Book" in doc_type or "Publication" in doc_type:
165
- doc_type_for_preprocessing = "printed"
166
-
167
  preprocessing_options = {
168
  "document_type": doc_type_for_preprocessing,
169
  "grayscale": grayscale,
@@ -325,10 +323,8 @@ def display_document_with_images(result):
325
  def display_previous_results():
326
  """Display previous results tab content in a simplified, structured view"""
327
 
328
- # Use a clean header with the download button directly next to it
329
- col1, col2 = st.columns([3, 1])
330
- with col1:
331
- st.header("Previous Results")
332
 
333
  # Display previous results if available
334
  if not st.session_state.previous_results:
@@ -340,27 +336,28 @@ def display_previous_results():
340
  </div>
341
  """, unsafe_allow_html=True)
342
  else:
343
- # Add download button in the second column next to the header
344
- with col2:
345
- try:
346
- # Create download button for all results
347
- from utils.image_utils import create_results_zip_in_memory
348
- zip_data = create_results_zip_in_memory(st.session_state.previous_results)
349
- timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
350
-
351
- # Simplified filename
352
- zip_filename = f"ocr_results_{timestamp}.zip"
353
-
354
- st.download_button(
355
- label="Download All",
356
- data=zip_data,
357
- file_name=zip_filename,
358
- mime="application/zip",
359
- help="Download all results as ZIP"
360
- )
361
- except Exception:
362
- # Silent fail - no error message to keep UI clean
363
- pass
 
364
 
365
  # Create a cleaner, more minimal grid for results using Streamlit columns
366
  # Calculate number of columns based on screen width - more responsive
@@ -474,7 +471,7 @@ def display_previous_results():
474
  st.markdown(f"##### {section.replace('_', ' ').title()}")
475
 
476
  # Format and display content
477
- formatted_content = format_ocr_text(content)
478
  st.markdown(formatted_content)
479
  displayed_sections.add(section)
480
 
@@ -486,7 +483,7 @@ def display_previous_results():
486
  st.markdown(f"##### {section.replace('_', ' ').title()}")
487
 
488
  if isinstance(content, str):
489
- st.markdown(format_ocr_text(content))
490
  elif isinstance(content, list):
491
  for item in content:
492
  st.markdown(f"- {item}")
@@ -550,7 +547,6 @@ def display_previous_results():
550
  with st.expander(f"Page {i+1} Text", expanded=False):
551
  st.text(page_text)
552
 
553
-
554
  def display_about_tab():
555
  """Display learn more tab content"""
556
  st.header("Learn More")
 
31
  PREPROCESSING_DOC_TYPES,
32
  ROTATION_OPTIONS
33
  )
34
+ from utils.text_utils import format_ocr_text, clean_raw_text, format_markdown_text # Import from text_utils
35
  from utils.content_utils import (
36
  classify_document_content,
37
  extract_document_text,
38
+ extract_image_description
 
 
39
  )
40
  from utils.ui_utils import display_results
41
  from preprocessing import preprocess_image
 
153
  use_segmentation = False
154
 
155
  # Create preprocessing options dictionary
156
+ # Map UI document types to preprocessing document types
157
  doc_type_for_preprocessing = "standard"
158
  if "Handwritten" in doc_type:
159
  doc_type_for_preprocessing = "handwritten"
160
  elif "Newspaper" in doc_type or "Magazine" in doc_type:
161
  doc_type_for_preprocessing = "newspaper"
162
  elif "Book" in doc_type or "Publication" in doc_type:
163
+ doc_type_for_preprocessing = "book" # Match the actual preprocessing type
164
+
165
  preprocessing_options = {
166
  "document_type": doc_type_for_preprocessing,
167
  "grayscale": grayscale,
 
323
  def display_previous_results():
324
  """Display previous results tab content in a simplified, structured view"""
325
 
326
+ # Use a simple header without the button column
327
+ st.header("Previous Results")
 
 
328
 
329
  # Display previous results if available
330
  if not st.session_state.previous_results:
 
336
  </div>
337
  """, unsafe_allow_html=True)
338
  else:
339
+ # Prepare zip download outside of the UI flow
340
+ try:
341
+ # Create download button for all results
342
+ from utils.image_utils import create_results_zip_in_memory
343
+ zip_data = create_results_zip_in_memory(st.session_state.previous_results)
344
+ timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
345
+
346
+ # Simplified filename
347
+ zip_filename = f"ocr_results_{timestamp}.zip"
348
+
349
+ # Encode the zip data for direct download link
350
+ zip_b64 = base64.b64encode(zip_data).decode()
351
+
352
+ # Add styled download tag in the metadata section
353
+ download_html = '<div style="display: flex; align-items: center; margin: 0.5rem 0; flex-wrap: wrap;">'
354
+ download_html += '<div style="margin-right: 0.3rem; font-weight: bold;">Download:</div>'
355
+ download_html += f'<a href="data:application/zip;base64,{zip_b64}" download="{zip_filename}" class="subject-tag tag-download">All Results</a>'
356
+ download_html += '</div>'
357
+ st.markdown(download_html, unsafe_allow_html=True)
358
+ except Exception:
359
+ # Silent fail - no error message to keep UI clean
360
+ pass
361
 
362
  # Create a cleaner, more minimal grid for results using Streamlit columns
363
  # Calculate number of columns based on screen width - more responsive
 
471
  st.markdown(f"##### {section.replace('_', ' ').title()}")
472
 
473
  # Format and display content
474
+ formatted_content = format_ocr_text(content, for_display=True)
475
  st.markdown(formatted_content)
476
  displayed_sections.add(section)
477
 
 
483
  st.markdown(f"##### {section.replace('_', ' ').title()}")
484
 
485
  if isinstance(content, str):
486
+ st.markdown(format_ocr_text(content, for_display=True))
487
  elif isinstance(content, list):
488
  for item in content:
489
  st.markdown(f"- {item}")
 
547
  with st.expander(f"Page {i+1} Text", expanded=False):
548
  st.text(page_text)
549
 
 
550
  def display_about_tab():
551
  """Display learn more tab content"""
552
  st.header("Learn More")
utils.py CHANGED
@@ -103,8 +103,17 @@ def timing(description):
103
 
104
  return TimingContext(description)
105
 
106
- def format_timestamp(timestamp=None):
107
- """Format timestamp for display"""
 
 
 
 
 
 
 
 
 
108
  if timestamp is None:
109
  timestamp = datetime.now()
110
  elif isinstance(timestamp, str):
@@ -113,7 +122,12 @@ def format_timestamp(timestamp=None):
113
  except ValueError:
114
  timestamp = datetime.now()
115
 
116
- return timestamp.strftime("%Y-%m-%d %H:%M")
 
 
 
 
 
117
 
118
  def generate_cache_key(file_bytes, file_type, use_vision, preprocessing_options=None, pdf_rotation=0, custom_prompt=None):
119
  """
@@ -175,7 +189,7 @@ def handle_temp_files(temp_file_paths):
175
 
176
  def create_descriptive_filename(original_filename, result, file_ext, preprocessing_options=None):
177
  """
178
- Create a descriptive filename for the result
179
 
180
  Args:
181
  original_filename: Original filename
@@ -184,30 +198,53 @@ def create_descriptive_filename(original_filename, result, file_ext, preprocessi
184
  preprocessing_options: Dictionary of preprocessing options
185
 
186
  Returns:
187
- str: Descriptive filename
188
  """
189
- # Get base name without extension
 
 
190
  original_name = Path(original_filename).stem
191
 
192
- # Add document type to filename if detected
193
- doc_type_tag = ""
194
- if 'detected_document_type' in result:
195
- doc_type = result['detected_document_type'].lower()
196
- doc_type_tag = f"_{doc_type.replace(' ', '_')}"
 
 
 
 
 
 
197
  elif 'topics' in result and result['topics']:
198
- # Use first tag as document type if not explicitly detected
199
- doc_type_tag = f"_{result['topics'][0].lower().replace(' ', '_')}"
200
 
201
- # Add period tag for historical context if available
202
- period_tag = ""
203
  if 'topics' in result and result['topics']:
204
  for tag in result['topics']:
205
  if "century" in tag.lower() or "pre-" in tag.lower() or "era" in tag.lower():
206
- period_tag = f"_{tag.lower().replace(' ', '_')}"
207
  break
208
 
209
- # Generate final descriptive filename
210
- descriptive_name = f"{original_name}{doc_type_tag}{period_tag}{file_ext}"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
211
  return descriptive_name
212
 
213
  def extract_subject_tags(result, raw_text, preprocessing_options=None):
 
103
 
104
  return TimingContext(description)
105
 
106
+ def format_timestamp(timestamp=None, for_filename=False):
107
+ """
108
+ Format timestamp for display or filenames
109
+
110
+ Args:
111
+ timestamp: Datetime object or string to format (defaults to current time)
112
+ for_filename: Whether to format for use in a filename (defaults to False)
113
+
114
+ Returns:
115
+ str: Formatted timestamp
116
+ """
117
  if timestamp is None:
118
  timestamp = datetime.now()
119
  elif isinstance(timestamp, str):
 
122
  except ValueError:
123
  timestamp = datetime.now()
124
 
125
+ if for_filename:
126
+ # Format suitable for filenames: "Apr 30, 2025"
127
+ return timestamp.strftime("%b %d, %Y")
128
+ else:
129
+ # Standard format for display
130
+ return timestamp.strftime("%Y-%m-%d %H:%M")
131
 
132
  def generate_cache_key(file_bytes, file_type, use_vision, preprocessing_options=None, pdf_rotation=0, custom_prompt=None):
133
  """
 
189
 
190
  def create_descriptive_filename(original_filename, result, file_ext, preprocessing_options=None):
191
  """
192
+ Create a user-friendly descriptive filename for the result
193
 
194
  Args:
195
  original_filename: Original filename
 
198
  preprocessing_options: Dictionary of preprocessing options
199
 
200
  Returns:
201
+ str: Human-readable descriptive filename
202
  """
203
+ from datetime import datetime
204
+
205
+ # Get base name without extension and capitalize words
206
  original_name = Path(original_filename).stem
207
 
208
+ # Make the original name more readable by replacing dashes and underscores with spaces
209
+ # Then capitalize each word
210
+ readable_name = original_name.replace('-', ' ').replace('_', ' ')
211
+ # Split by spaces and capitalize each word, then rejoin
212
+ name_parts = readable_name.split()
213
+ readable_name = ' '.join(word.capitalize() for word in name_parts)
214
+
215
+ # Determine document type
216
+ doc_type = None
217
+ if 'detected_document_type' in result and result['detected_document_type']:
218
+ doc_type = result['detected_document_type'].capitalize()
219
  elif 'topics' in result and result['topics']:
220
+ # Use first topic as document type if not explicitly detected
221
+ doc_type = result['topics'][0]
222
 
223
+ # Find period/era information
224
+ period_info = None
225
  if 'topics' in result and result['topics']:
226
  for tag in result['topics']:
227
  if "century" in tag.lower() or "pre-" in tag.lower() or "era" in tag.lower():
228
+ period_info = tag
229
  break
230
 
231
+ # Format metadata within parentheses if available
232
+ metadata = []
233
+ if doc_type:
234
+ metadata.append(doc_type)
235
+ if period_info:
236
+ metadata.append(period_info)
237
+
238
+ metadata_str = ""
239
+ if metadata:
240
+ metadata_str = f" ({', '.join(metadata)})"
241
+
242
+ # Add current date for uniqueness and sorting
243
+ current_date = format_timestamp(for_filename=True)
244
+ date_str = f" - {current_date}"
245
+
246
+ # Generate final user-friendly filename
247
+ descriptive_name = f"{readable_name}{metadata_str}{date_str}{file_ext}"
248
  return descriptive_name
249
 
250
  def extract_subject_tags(result, raw_text, preprocessing_options=None):
utils/__init__.py ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Utility functions for historical OCR processing.
3
+ """
4
+ # Re-export image utilities
5
+ from utils.image_utils import replace_images_in_markdown, get_combined_markdown, detect_skew, clean_ocr_result
6
+
7
+ # Import general utilities from the new module
8
+ from utils.general_utils import (
9
+ generate_cache_key,
10
+ timing,
11
+ format_timestamp,
12
+ create_descriptive_filename,
13
+ extract_subject_tags
14
+ )
15
+
16
+ # Import file utilities
17
+ from utils.file_utils import (
18
+ get_base64_from_image,
19
+ get_base64_from_bytes,
20
+ handle_temp_files
21
+ )
22
+
23
+ # Import UI utilities
24
+ from utils.ui_utils import display_results
25
+
26
+ __all__ = [
27
+ # Image utilities
28
+ 'replace_images_in_markdown',
29
+ 'get_combined_markdown',
30
+ 'detect_skew',
31
+ 'clean_ocr_result',
32
+
33
+ # General utilities
34
+ 'generate_cache_key',
35
+ 'timing',
36
+ 'format_timestamp',
37
+ 'create_descriptive_filename',
38
+ 'extract_subject_tags',
39
+
40
+ # File utilities
41
+ 'get_base64_from_image',
42
+ 'get_base64_from_bytes',
43
+ 'handle_temp_files',
44
+
45
+ # UI utilities
46
+ 'display_results'
47
+ ]
utils/content_utils.py CHANGED
@@ -80,99 +80,13 @@ def format_structured_data(content):
80
  if not content:
81
  return ""
82
 
83
- # If it's already a string, look for patterns that appear to be Python/JSON representations
 
84
  if isinstance(content, str):
85
- # Look for lists like ['item1', 'item2', 'item3']
86
- list_pattern = r"(\[([^\[\]]*)\])"
87
- dict_pattern = r"(\{([^\{\}]*)\})"
88
-
89
- # First handle lists - ['item1', 'item2']
90
- def replace_list(match):
91
- try:
92
- # Try to parse the match as a Python list
93
- list_str = match.group(1)
94
-
95
- # Quick check for empty list
96
- if list_str == "[]":
97
- return ""
98
-
99
- # Safe evaluation of list-like string
100
- try:
101
- items = ast.literal_eval(list_str)
102
- if isinstance(items, list):
103
- # Convert to markdown bullet points
104
- return "\n" + "\n".join([f"- {item}" for item in items])
105
- else:
106
- return list_str # Not a list, return unchanged
107
- except (SyntaxError, ValueError):
108
- # Try a simpler regex-based approach for common formats
109
- # Handle simple comma-separated lists
110
- items = re.findall(r"'([^']*)'|\"([^\"]*)\"", list_str)
111
- if items:
112
- # Extract the matched groups and handle both single and double quotes
113
- clean_items = [item[0] if item[0] else item[1] for item in items]
114
- return "\n" + "\n".join([f"- {item}" for item in clean_items])
115
- return list_str # Couldn't parse, return unchanged
116
- except Exception:
117
- return match.group(0) # Return the original text if any error
118
-
119
- # Handle dictionaries or structured fields like {key: value, key2: value2}
120
- def replace_dict(match):
121
- try:
122
- dict_str = match.group(1)
123
-
124
- # Quick check for empty dict
125
- if dict_str == "{}":
126
- return ""
127
-
128
- # First try to parse as a Python dict
129
- try:
130
- data_dict = ast.literal_eval(dict_str)
131
- if isinstance(data_dict, dict):
132
- return "\n" + "\n".join([f"**{k}**: {v}" for k, v in data_dict.items()])
133
- except (SyntaxError, ValueError):
134
- # If that fails, use regex to extract key-value pairs
135
- pairs = re.findall(r"'([^']*)':\s*'([^']*)'|\"([^\"]*)\":\s*\"([^\"]*)\"", dict_str)
136
- if pairs:
137
- formatted_pairs = []
138
- for pair in pairs:
139
- if pair[0] and pair[1]: # Single quotes
140
- formatted_pairs.append(f"**{pair[0]}**: {pair[1]}")
141
- elif pair[2] and pair[3]: # Double quotes
142
- formatted_pairs.append(f"**{pair[2]}**: {pair[3]}")
143
- return "\n" + "\n".join(formatted_pairs)
144
- return dict_str # Return original if couldn't parse
145
- except Exception:
146
- return match.group(0) # Return original text if any error
147
-
148
- # Check for keys with array values (common in OCR output)
149
- key_array_pattern = r"([a-zA-Z_]+):\s*(\[.*?\])"
150
-
151
- def replace_key_array(match):
152
- try:
153
- key = match.group(1)
154
- array_str = match.group(2)
155
-
156
- # Process the array part with our list replacer
157
- formatted_array = replace_list(re.match(list_pattern, array_str))
158
-
159
- # If we successfully formatted it, return with the key as a header
160
- if formatted_array != array_str:
161
- return f"**{key}**:{formatted_array}"
162
- else:
163
- return match.group(0) # Return original if no change
164
- except Exception:
165
- return match.group(0) # Return the original on error
166
-
167
- # Apply all replacements
168
- content = re.sub(key_array_pattern, replace_key_array, content)
169
- content = re.sub(list_pattern, replace_list, content)
170
- content = re.sub(dict_pattern, replace_dict, content)
171
-
172
  return content
173
 
174
  # Handle native Python lists
175
- elif isinstance(content, list):
176
  if not content:
177
  return ""
178
  # Convert to markdown bullet points
 
80
  if not content:
81
  return ""
82
 
83
+ # For string content, return as-is to maintain content purity
84
+ # This prevents JSON-like text from being transformed inappropriately
85
  if isinstance(content, str):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
  return content
87
 
88
  # Handle native Python lists
89
+ if isinstance(content, list):
90
  if not content:
91
  return ""
92
  # Convert to markdown bullet points
utils/general_utils.py CHANGED
@@ -75,8 +75,17 @@ def timing(description):
75
 
76
  return TimingContext(description)
77
 
78
- def format_timestamp(timestamp=None):
79
- """Format timestamp for display"""
 
 
 
 
 
 
 
 
 
80
  if timestamp is None:
81
  timestamp = datetime.now()
82
  elif isinstance(timestamp, str):
@@ -85,11 +94,16 @@ def format_timestamp(timestamp=None):
85
  except ValueError:
86
  timestamp = datetime.now()
87
 
88
- return timestamp.strftime("%Y-%m-%d %H:%M")
 
 
 
 
 
89
 
90
  def create_descriptive_filename(original_filename, result, file_ext, preprocessing_options=None):
91
  """
92
- Create a descriptive filename for the result
93
 
94
  Args:
95
  original_filename: Original filename
@@ -98,30 +112,51 @@ def create_descriptive_filename(original_filename, result, file_ext, preprocessi
98
  preprocessing_options: Dictionary of preprocessing options
99
 
100
  Returns:
101
- str: Descriptive filename
102
  """
103
- # Get base name without extension
104
  original_name = Path(original_filename).stem
105
 
106
- # Add document type to filename if detected
107
- doc_type_tag = ""
108
- if 'detected_document_type' in result:
109
- doc_type = result['detected_document_type'].lower()
110
- doc_type_tag = f"_{doc_type.replace(' ', '_')}"
 
 
 
 
 
 
111
  elif 'topics' in result and result['topics']:
112
- # Use first tag as document type if not explicitly detected
113
- doc_type_tag = f"_{result['topics'][0].lower().replace(' ', '_')}"
114
 
115
- # Add period tag for historical context if available
116
- period_tag = ""
117
  if 'topics' in result and result['topics']:
118
  for tag in result['topics']:
119
  if "century" in tag.lower() or "pre-" in tag.lower() or "era" in tag.lower():
120
- period_tag = f"_{tag.lower().replace(' ', '_')}"
121
  break
122
 
123
- # Generate final descriptive filename
124
- descriptive_name = f"{original_name}{doc_type_tag}{period_tag}{file_ext}"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
125
  return descriptive_name
126
 
127
  def extract_subject_tags(result, raw_text, preprocessing_options=None):
 
75
 
76
  return TimingContext(description)
77
 
78
+ def format_timestamp(timestamp=None, for_filename=False):
79
+ """
80
+ Format timestamp for display or filenames
81
+
82
+ Args:
83
+ timestamp: Datetime object or string to format (defaults to current time)
84
+ for_filename: Whether to format for use in a filename (defaults to False)
85
+
86
+ Returns:
87
+ str: Formatted timestamp
88
+ """
89
  if timestamp is None:
90
  timestamp = datetime.now()
91
  elif isinstance(timestamp, str):
 
94
  except ValueError:
95
  timestamp = datetime.now()
96
 
97
+ if for_filename:
98
+ # Format suitable for filenames: "Apr 30, 2025"
99
+ return timestamp.strftime("%b %d, %Y")
100
+ else:
101
+ # Standard format for display
102
+ return timestamp.strftime("%Y-%m-%d %H:%M")
103
 
104
  def create_descriptive_filename(original_filename, result, file_ext, preprocessing_options=None):
105
  """
106
+ Create a user-friendly descriptive filename for the result
107
 
108
  Args:
109
  original_filename: Original filename
 
112
  preprocessing_options: Dictionary of preprocessing options
113
 
114
  Returns:
115
+ str: Human-readable descriptive filename
116
  """
117
+ # Get base name without extension and capitalize words
118
  original_name = Path(original_filename).stem
119
 
120
+ # Make the original name more readable by replacing dashes and underscores with spaces
121
+ # Then capitalize each word
122
+ readable_name = original_name.replace('-', ' ').replace('_', ' ')
123
+ # Split by spaces and capitalize each word, then rejoin
124
+ name_parts = readable_name.split()
125
+ readable_name = ' '.join(word.capitalize() for word in name_parts)
126
+
127
+ # Determine document type
128
+ doc_type = None
129
+ if 'detected_document_type' in result and result['detected_document_type']:
130
+ doc_type = result['detected_document_type'].capitalize()
131
  elif 'topics' in result and result['topics']:
132
+ # Use first topic as document type if not explicitly detected
133
+ doc_type = result['topics'][0]
134
 
135
+ # Find period/era information
136
+ period_info = None
137
  if 'topics' in result and result['topics']:
138
  for tag in result['topics']:
139
  if "century" in tag.lower() or "pre-" in tag.lower() or "era" in tag.lower():
140
+ period_info = tag
141
  break
142
 
143
+ # Format metadata within parentheses if available
144
+ metadata = []
145
+ if doc_type:
146
+ metadata.append(doc_type)
147
+ if period_info:
148
+ metadata.append(period_info)
149
+
150
+ metadata_str = ""
151
+ if metadata:
152
+ metadata_str = f" ({', '.join(metadata)})"
153
+
154
+ # Add current date for uniqueness and sorting
155
+ current_date = format_timestamp(for_filename=True)
156
+ date_str = f" - {current_date}"
157
+
158
+ # Generate final user-friendly filename
159
+ descriptive_name = f"{readable_name}{metadata_str}{date_str}{file_ext}"
160
  return descriptive_name
161
 
162
  def extract_subject_tags(result, raw_text, preprocessing_options=None):
utils/image_utils.py CHANGED
@@ -364,30 +364,116 @@ def serialize_ocr_object(obj):
364
  except:
365
  return None
366
 
367
- def format_ocr_text(text):
 
368
  """
369
- Format OCR text with simple, predictable rules that ensure consistency.
370
- This formats ALL CAPS lines as bold markdown and preserves the rest.
371
 
372
  Args:
373
- text: Text content to format
 
 
 
374
 
375
  Returns:
376
- Formatted text with consistent styling
377
  """
378
- if not isinstance(text, str):
379
- return text
380
-
381
- lines = text.split('\n')
382
- processed_lines = []
383
- for line in lines:
384
- line_stripped = line.strip()
385
- if line_stripped and line_stripped.isupper() and len(line_stripped) > 3:
386
- processed_lines.append(f"**{line_stripped}**")
 
 
 
 
 
 
 
 
387
  else:
388
- processed_lines.append(line)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
389
 
390
- return '\n'.join(processed_lines)
391
 
392
  def create_results_zip(results, output_dir=None, zip_name=None):
393
  """
@@ -444,6 +530,8 @@ def create_results_zip(results, output_dir=None, zip_name=None):
444
  def create_results_zip_in_memory(results):
445
  """
446
  Create a zip file containing OCR results in memory.
 
 
447
 
448
  Args:
449
  results: Dictionary or list of OCR results
@@ -454,114 +542,24 @@ def create_results_zip_in_memory(results):
454
  # Create a BytesIO object
455
  zip_buffer = io.BytesIO()
456
 
457
- # Check if results is a list or a dictionary
458
- is_list = isinstance(results, list)
459
-
460
- # Create zip file in memory
461
- with zipfile.ZipFile(zip_buffer, 'w', zipfile.ZIP_DEFLATED) as zipf:
462
  if is_list:
463
- # Handle list of results
464
- for i, result in enumerate(results):
465
- try:
466
- # Create a descriptive base filename for this result
467
- base_filename = result.get('file_name', f'document_{i+1}').split('.')[0]
468
-
469
- # Add document type if available
470
- if 'topics' in result and result['topics']:
471
- topic = result['topics'][0].lower().replace(' ', '_')
472
- base_filename = f"{base_filename}_{topic}"
473
-
474
- # Add language if available
475
- if 'languages' in result and result['languages']:
476
- lang = result['languages'][0].lower()
477
- # Only add if it's not already in the filename
478
- if lang not in base_filename.lower():
479
- base_filename = f"{base_filename}_{lang}"
480
-
481
- # For PDFs, add page information
482
- if 'limited_pages' in result:
483
- base_filename = f"{base_filename}_p{result['limited_pages']['processed']}of{result['limited_pages']['total']}"
484
-
485
- # Add timestamp if available
486
- if 'timestamp' in result:
487
- try:
488
- # Try to parse the timestamp and reformat it
489
- dt = datetime.strptime(result['timestamp'], "%Y-%m-%d %H:%M")
490
- timestamp = dt.strftime("%Y%m%d_%H%M%S")
491
- base_filename = f"{base_filename}_{timestamp}"
492
- except Exception:
493
- pass
494
 
495
- # Add JSON results for each file with descriptive name
496
- result_json = json.dumps(result, indent=2)
497
- zipf.writestr(f"{base_filename}.json", result_json)
498
-
499
- # Add HTML content (generated from the result)
500
- html_content = create_html_with_images(result)
501
- zipf.writestr(f"{base_filename}.html", html_content)
502
-
503
- # Add raw OCR text if available
504
- if "ocr_contents" in result and "raw_text" in result["ocr_contents"]:
505
- zipf.writestr(f"{base_filename}.txt", result["ocr_contents"]["raw_text"])
506
-
507
- except Exception as e:
508
- # If any result fails, skip it and continue
509
- logger.warning(f"Failed to process result for zip: {str(e)}")
510
- continue
511
  else:
512
- # Handle single result
513
- try:
514
- # Create a descriptive base filename for this result
515
- base_filename = results.get('file_name', 'document').split('.')[0]
516
-
517
- # Add document type if available
518
- if 'topics' in results and results['topics']:
519
- topic = results['topics'][0].lower().replace(' ', '_')
520
- base_filename = f"{base_filename}_{topic}"
521
-
522
- # Add language if available
523
- if 'languages' in results and results['languages']:
524
- lang = results['languages'][0].lower()
525
- # Only add if it's not already in the filename
526
- if lang not in base_filename.lower():
527
- base_filename = f"{base_filename}_{lang}"
528
-
529
- # For PDFs, add page information
530
- if 'limited_pages' in results:
531
- base_filename = f"{base_filename}_p{results['limited_pages']['processed']}of{results['limited_pages']['total']}"
532
-
533
- # Add timestamp if available
534
- if 'timestamp' in results:
535
- try:
536
- # Try to parse the timestamp and reformat it
537
- dt = datetime.strptime(results['timestamp'], "%Y-%m-%d %H:%M")
538
- timestamp = dt.strftime("%Y%m%d_%H%M%S")
539
- base_filename = f"{base_filename}_{timestamp}"
540
- except Exception:
541
- # If parsing fails, create a new timestamp
542
- timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
543
- base_filename = f"{base_filename}_{timestamp}"
544
- else:
545
- # No timestamp in the result, create a new one
546
- timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
547
- base_filename = f"{base_filename}_{timestamp}"
548
-
549
- # Add JSON results with descriptive name
550
- results_json = json.dumps(results, indent=2)
551
- zipf.writestr(f"{base_filename}.json", results_json)
552
-
553
- # Add HTML content with descriptive name
554
- html_content = create_html_with_images(results)
555
- zipf.writestr(f"{base_filename}.html", html_content)
556
-
557
- # Add raw OCR text if available
558
- if "ocr_contents" in results and "raw_text" in results["ocr_contents"]:
559
- zipf.writestr(f"{base_filename}.txt", results["ocr_contents"]["raw_text"])
560
-
561
- except Exception as e:
562
- # If processing fails, log the error
563
- logger.error(f"Failed to create zip file: {str(e)}")
564
- pass
565
 
566
  # Seek to the beginning of the BytesIO object
567
  zip_buffer.seek(0)
@@ -569,17 +567,158 @@ def create_results_zip_in_memory(results):
569
  # Return the zip file bytes
570
  return zip_buffer.getvalue()
571
 
572
- def create_html_with_images(result):
573
  """
574
- Create a clean HTML document from OCR results that properly preserves page references
575
- and text structure, without any document-specific special cases.
576
 
577
  Args:
578
  result: OCR result dictionary
 
 
579
 
580
  Returns:
581
- HTML content as string
582
  """
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
583
  # Import content utils to use classification functions
584
  try:
585
  from utils.content_utils import classify_document_content, extract_document_text, extract_image_description
@@ -590,13 +729,11 @@ def create_html_with_images(result):
590
  # Get content classification
591
  has_text = True
592
  has_images = False
593
- has_page_refs = False
594
 
595
  if content_utils_available:
596
  classification = classify_document_content(result)
597
  has_text = classification['has_content']
598
  has_images = result.get('has_images', False)
599
- has_page_refs = False
600
  else:
601
  # Minimal fallback detection
602
  if 'has_images' in result:
@@ -609,143 +746,111 @@ def create_html_with_images(result):
609
  has_images = True
610
  break
611
 
612
- # Start building the HTML document
613
- html = [
614
- '<!DOCTYPE html>',
615
- '<html lang="en">',
616
- '<head>',
617
- ' <meta charset="UTF-8">',
618
- ' <meta name="viewport" content="width=device-width, initial-scale=1.0">',
619
- f' <title>{result.get("file_name", "Document")}</title>',
620
- ' <style>',
621
- ' body {',
622
- ' font-family: Georgia, serif;',
623
- ' line-height: 1.6;',
624
- ' color: #333;',
625
- ' max-width: 800px;',
626
- ' margin: 0 auto;',
627
- ' padding: 20px;',
628
- ' }',
629
- ' h1, h2, h3, h4 {',
630
- ' color: #222;',
631
- ' margin-top: 1.5em;',
632
- ' margin-bottom: 0.5em;',
633
- ' }',
634
- ' h1 { font-size: 24px; }',
635
- ' h2 { font-size: 22px; }',
636
- ' h3 { font-size: 20px; }',
637
- ' h4 { font-size: 18px; }',
638
- ' p { margin: 1em 0; }',
639
- ' .metadata {',
640
- ' background-color: #f8f9fa;',
641
- ' border: 1px solid #eaecef;',
642
- ' border-radius: 6px;',
643
- ' padding: 15px;',
644
- ' margin-bottom: 20px;',
645
- ' }',
646
- ' .metadata p { margin: 5px 0; }',
647
- ' img {',
648
- ' max-width: 100%;',
649
- ' height: auto;',
650
- ' display: block;',
651
- ' margin: 20px auto;',
652
- ' border: 1px solid #ddd;',
653
- ' border-radius: 4px;',
654
- ' }',
655
- ' .image-container {',
656
- ' margin: 20px 0;',
657
- ' text-align: center;',
658
- ' }',
659
- ' .image-caption {',
660
- ' font-size: 0.9em;',
661
- ' text-align: center;',
662
- ' color: #666;',
663
- ' margin-top: 5px;',
664
- ' }',
665
- ' .text-block {',
666
- ' margin: 10px 0;',
667
- ' }',
668
- ' .page-ref {',
669
- ' font-weight: bold;',
670
- ' color: #555;',
671
- ' }',
672
- ' .separator {',
673
- ' border-top: 1px solid #eaecef;',
674
- ' margin: 30px 0;',
675
- ' }',
676
- ' </style>',
677
- '</head>',
678
- '<body>'
679
- ]
680
-
681
- # Add document metadata
682
- html.append('<div class="metadata">')
683
- html.append(f'<h1>{result.get("file_name", "Document")}</h1>')
684
 
685
  # Add timestamp
686
  if 'timestamp' in result:
687
- html.append(f'<p><strong>Processed:</strong> {result["timestamp"]}</p>')
688
 
689
  # Add languages if available
690
  if 'languages' in result and result['languages']:
691
  languages = [lang for lang in result['languages'] if lang]
692
  if languages:
693
- html.append(f'<p><strong>Languages:</strong> {", ".join(languages)}</p>')
694
 
695
  # Add document type and topics
696
  if 'detected_document_type' in result:
697
- html.append(f'<p><strong>Document Type:</strong> {result["detected_document_type"]}</p>')
698
 
699
  if 'topics' in result and result['topics']:
700
- html.append(f'<p><strong>Topics:</strong> {", ".join(result["topics"])}</p>')
701
 
702
- html.append('</div>') # Close metadata div
703
 
704
  # Document title - extract from result if available
705
  if 'ocr_contents' in result and 'title' in result['ocr_contents'] and result['ocr_contents']['title']:
706
  title_content = result['ocr_contents']['title']
707
- # No special handling for any specific document types
708
- html.append(f'<h2>{title_content}</h2>')
709
 
710
  # Add images if present
711
  if has_images and 'pages_data' in result:
712
- html.append('<h3>Images</h3>')
713
 
714
- # Extract and display all images
715
  for page_idx, page in enumerate(result['pages_data']):
716
  if 'images' in page and isinstance(page['images'], list):
717
  for img_idx, img in enumerate(page['images']):
718
- if 'image_base64' in img and img['image_base64']:
719
- # Image container
720
- html.append('<div class="image-container">')
721
- html.append(f'<img src="{img["image_base64"]}" alt="Image {page_idx+1}-{img_idx+1}">')
722
-
723
- # Generic caption based on index
724
- html.append(f'<div class="image-caption">img-{img_idx}.jpeg</div>')
725
- html.append('</div>')
726
 
727
  # Add image description if available through utils
728
  if content_utils_available:
729
  description = extract_image_description(result)
730
  if description:
731
- html.append('<div class="text-block">')
732
- html.append(f'<p>{description}</p>')
733
- html.append('</div>')
734
 
735
- html.append('<hr class="separator">')
736
 
737
  # Add document text section
738
- html.append('<h3>Text</h3>')
739
 
740
  # Extract text content systematically
741
  text_content = ""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
742
 
743
  if content_utils_available:
744
- # Use the systematic utility function
745
  text_content = extract_document_text(result)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
746
  else:
747
  # Fallback extraction logic
748
  if 'ocr_contents' in result:
 
749
  for field in ["main_text", "content", "text", "transcript", "raw_text"]:
750
  if field in result['ocr_contents'] and result['ocr_contents'][field]:
751
  content = result['ocr_contents'][field]
@@ -759,128 +864,338 @@ def create_html_with_images(result):
759
  break
760
  except:
761
  pass
 
 
 
 
 
 
 
 
 
 
762
 
763
- # Process text content for HTML display
764
  if text_content:
765
- # Clean the text but preserve page references
766
- text_content = text_content.replace('\r\n', '\n')
767
-
768
- # Preserve page references by wrapping them in HTML tags
769
- if has_page_refs:
770
- # Highlight common page reference patterns
771
- page_patterns = [
772
- (r'(page\s+\d+)', r'<span class="page-ref">\1</span>'),
773
- (r'(p\.\s*\d+)', r'<span class="page-ref">\1</span>'),
774
- (r'(p\s+\d+)', r'<span class="page-ref">\1</span>'),
775
- (r'(\[\s*\d+\s*\])', r'<span class="page-ref">\1</span>'),
776
- (r'(\(\s*\d+\s*\))', r'<span class="page-ref">\1</span>'),
777
- (r'(folio\s+\d+)', r'<span class="page-ref">\1</span>'),
778
- (r'(f\.\s*\d+)', r'<span class="page-ref">\1</span>'),
779
- (r'(pg\.\s*\d+)', r'<span class="page-ref">\1</span>')
780
- ]
781
-
782
- for pattern, replacement in page_patterns:
783
- text_content = re.sub(pattern, replacement, text_content, flags=re.IGNORECASE)
784
-
785
- # Convert newlines to paragraphs
786
- paragraphs = text_content.split('\n\n')
787
- paragraphs = [p for p in paragraphs if p.strip()]
788
-
789
- html.append('<div class="text-block">')
790
- for paragraph in paragraphs:
791
- # Check if paragraph contains multiple lines
792
- if '\n' in paragraph:
793
- lines = paragraph.split('\n')
794
- lines = [line for line in lines if line.strip()]
795
 
796
- # Convert each line to a paragraph
797
- for line in lines:
798
- html.append(f'<p>{line}</p>')
799
- else:
800
- html.append(f'<p>{paragraph}</p>')
801
- html.append('</div>')
802
- else:
803
- html.append('<p>No text content available.</p>')
804
 
805
- # Close the HTML document
806
- html.append('</body>')
807
- html.append('</html>')
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
808
 
809
- return '\n'.join(html)
 
810
 
811
- def clean_ocr_result(result: dict,
812
- use_segmentation: bool = False,
813
- vision_enabled: bool = True) -> dict:
814
  """
815
- 1. Replace or strip markdown image refs (![id](id))
816
- 2. Collapse pages that are *only* an illustration into a single
817
- `illustrations` bucket when vision is off
818
- 3. Normalise `ocr_contents` keys to always have at least `raw_text`
 
 
819
  """
820
- if 'pages_data' in result:
821
- # Build a dict {id: base64} for quick look-ups
822
- image_dict = {
823
- img['id']: img['image_base64']
824
- for page in result['pages_data']
825
- for img in page.get('images', [])
826
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
827
 
828
- # --- 1 · replace or drop image placeholders ---
829
- def _scrub(markdown: str) -> str:
830
- if vision_enabled and image_dict:
831
- return replace_images_in_markdown(markdown, image_dict)
832
- # no vision / no images → drop the line
833
- return re.sub(r'!\[[^\]]*\]\(img-\d+\.\w+\)', '', markdown)
834
 
835
- for page in result['pages_data']:
836
- page['markdown'] = _scrub(page.get('markdown', ''))
 
 
837
 
838
- # --- 2 · group illustration-only pages when vision is off ---
839
- if not vision_enabled and 'pages_data' in result:
840
- text_pages, art_pages = [], []
841
- for p in result['pages_data']:
842
- has_text = p.get('markdown', '').strip()
843
- (text_pages if has_text else art_pages).append(p)
844
- result['pages_data'] = text_pages
845
- if art_pages:
846
- # keep one thumbnail under metadata
847
- result.setdefault('illustrations', []).extend(art_pages)
848
 
849
- # --- 3 · ensure raw_text key ---
850
- if 'ocr_contents' in result and 'raw_text' not in result['ocr_contents']:
851
- # First, try to extract any embedded text from image references
852
- raw_text_parts = []
853
-
854
- for page in result.get('pages_data', []):
855
- markdown = page.get('markdown', '')
856
- # Check if the markdown contains image references
857
- img_refs = re.findall(r'!\[([^\]]*)\]\(([^\)]*)\)', markdown)
858
-
859
- # Process each image reference to extract text content
860
- if img_refs:
861
- for alt_text, img_url in img_refs:
862
- # If alt text contains actual text content (not just image ID), add it
863
- if alt_text and not alt_text.endswith(('.jpeg', '.jpg', '.png')):
864
- # Clean up the alt text and add it as text content
865
- alt_text = alt_text.strip()
866
- if alt_text and len(alt_text) > 3: # Only add if meaningful
867
- raw_text_parts.append(alt_text)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
868
 
869
- # Remove image references from markdown
870
- cleaned_markdown = re.sub(r'!\[[^\]]*\]\([^\)]*\)', '', markdown)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
871
 
872
- # Add any remaining text content
873
- if cleaned_markdown.strip():
874
- raw_text_parts.append(cleaned_markdown.strip())
875
-
876
- # Join all extracted text content
877
- if raw_text_parts:
878
- result['ocr_contents']['raw_text'] = "\n\n".join(raw_text_parts)
879
- else:
880
- # Fallback: use original method if no text was extracted
881
- joined = "\n".join(p.get('markdown', '') for p in result.get('pages_data', []))
882
- # Final cleanup of image references
883
- joined = re.sub(r'!\[[^\]]*\]\([^\)]*\)', '', joined)
884
- result['ocr_contents']['raw_text'] = joined
885
-
886
- return result
 
 
 
364
  except:
365
  return None
366
 
367
+ # Clean OCR result with focus on Mistral compatibility
368
+ def clean_ocr_result(result, use_segmentation=False, vision_enabled=True, preprocessing_options=None):
369
  """
370
+ Clean text content in OCR results, preserving original structure from Mistral API.
371
+ Only removes markdown/HTML conflicts without duplicating content across fields.
372
 
373
  Args:
374
+ result: OCR result object or dictionary
375
+ use_segmentation: Whether image segmentation was used
376
+ vision_enabled: Whether vision model was used
377
+ preprocessing_options: Dictionary of preprocessing options
378
 
379
  Returns:
380
+ Cleaned result object
381
  """
382
+ if not result:
383
+ return result
384
+
385
+ # Import text utilities for cleaning
386
+ try:
387
+ from utils.text_utils import clean_raw_text
388
+ text_cleaner_available = True
389
+ except ImportError:
390
+ text_cleaner_available = False
391
+
392
+ def clean_text(text):
393
+ """Clean text content, removing markdown image references and base64 data"""
394
+ if not text or not isinstance(text, str):
395
+ return ""
396
+
397
+ if text_cleaner_available:
398
+ text = clean_raw_text(text)
399
  else:
400
+ # Remove image references like ![image](data:image/...)
401
+ text = re.sub(r'!\[.*?\]\(data:image/[^)]+\)', '', text)
402
+
403
+ # Remove basic markdown image references like ![alt](img-1.jpg)
404
+ text = re.sub(r'!\[[^\]]*\]\([^)]+\)', '', text)
405
+
406
+ # Remove base64 encoded image data
407
+ text = re.sub(r'data:image/[^;]+;base64,[a-zA-Z0-9+/=]+', '', text)
408
+
409
+ # Clean up any JSON-like image object references
410
+ text = re.sub(r'{"image(_data)?":("[^"]*"|null|true|false|\{[^}]*\}|\[[^\]]*\])}', '', text)
411
+
412
+ # Clean up excessive whitespace and line breaks created by removals
413
+ text = re.sub(r'\n{3,}', '\n\n', text)
414
+ text = re.sub(r'\s{3,}', ' ', text)
415
+
416
+ return text.strip()
417
+
418
+ # Process dictionary
419
+ if isinstance(result, dict):
420
+ # For PDF documents, preserve original structure from Mistral API
421
+ is_pdf = result.get('file_type', '') == 'pdf' or (
422
+ result.get('file_name', '').lower().endswith('.pdf')
423
+ )
424
+
425
+ # Ensure ocr_contents exists
426
+ if 'ocr_contents' not in result:
427
+ result['ocr_contents'] = {}
428
+
429
+ # Clean raw_text if it exists but don't duplicate it
430
+ if 'raw_text' in result:
431
+ result['raw_text'] = clean_text(result['raw_text'])
432
+
433
+ # Handle ocr_contents fields - clean them but don't duplicate
434
+ if 'ocr_contents' in result:
435
+ for key, value in list(result['ocr_contents'].items()):
436
+ # Skip binary fields and image data
437
+ if key in ['image_base64', 'images', 'binary_data'] and value:
438
+ continue
439
+
440
+ # Clean string values to remove markdown/HTML conflicts
441
+ if isinstance(value, str):
442
+ result['ocr_contents'][key] = clean_text(value)
443
+
444
+ # Handle segmentation data
445
+ if use_segmentation and preprocessing_options and 'segmentation_data' in preprocessing_options:
446
+ # Store segmentation metadata
447
+ result['segmentation_applied'] = True
448
+
449
+ # Extract combined text if available
450
+ if 'combined_text' in preprocessing_options['segmentation_data']:
451
+ segmentation_text = clean_text(preprocessing_options['segmentation_data']['combined_text'])
452
+ # Add as dedicated field
453
+ result['ocr_contents']['segmentation_text'] = segmentation_text
454
+
455
+ # Use segmentation text for raw_text if it doesn't exist
456
+ if 'raw_text' not in result['ocr_contents']:
457
+ result['ocr_contents']['raw_text'] = segmentation_text
458
+
459
+ # Clean pages_data if available (Mistral OCR format)
460
+ if 'pages_data' in result:
461
+ for page in result['pages_data']:
462
+ if isinstance(page, dict):
463
+ # Clean text field
464
+ if 'text' in page:
465
+ page['text'] = clean_text(page['text'])
466
+
467
+ # Clean markdown field
468
+ if 'markdown' in page:
469
+ page['markdown'] = clean_text(page['markdown'])
470
+
471
+ # Handle list content recursively
472
+ elif isinstance(result, list):
473
+ return [clean_ocr_result(item, use_segmentation, vision_enabled, preprocessing_options)
474
+ for item in result]
475
 
476
+ return result
477
 
478
  def create_results_zip(results, output_dir=None, zip_name=None):
479
  """
 
530
  def create_results_zip_in_memory(results):
531
  """
532
  Create a zip file containing OCR results in memory.
533
+ Packages markdown with embedded image tags, raw text, and JSON file
534
+ in a contextually relevant structure.
535
 
536
  Args:
537
  results: Dictionary or list of OCR results
 
542
  # Create a BytesIO object
543
  zip_buffer = io.BytesIO()
544
 
545
+ # Create a ZipFile instance
546
+ with zipfile.ZipFile(zip_buffer, 'w', compression=zipfile.ZIP_DEFLATED) as zipf:
547
+ # Check if results is a list or a dictionary
548
+ is_list = isinstance(results, list)
549
+
550
  if is_list:
551
+ # Handle multiple results by creating subdirectories
552
+ for idx, result in enumerate(results):
553
+ if result and isinstance(result, dict):
554
+ # Create a folder name based on the file name or index
555
+ folder_name = result.get('file_name', f'document_{idx+1}')
556
+ folder_name = Path(folder_name).stem # Remove file extension
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
557
 
558
+ # Add files to this folder
559
+ add_result_files_to_zip(zipf, result, f"{folder_name}/")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
560
  else:
561
+ # Single result - add files directly to root of zip
562
+ add_result_files_to_zip(zipf, results)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
563
 
564
  # Seek to the beginning of the BytesIO object
565
  zip_buffer.seek(0)
 
567
  # Return the zip file bytes
568
  return zip_buffer.getvalue()
569
 
570
+ def truncate_base64_in_result(result, prefix_length=32, suffix_length=32):
571
  """
572
+ Create a copy of the result dictionary with base64 image data truncated.
573
+ This keeps the structure intact while making the JSON more readable.
574
 
575
  Args:
576
  result: OCR result dictionary
577
+ prefix_length: Number of characters to keep at the beginning
578
+ suffix_length: Number of characters to keep at the end
579
 
580
  Returns:
581
+ Dictionary with truncated base64 data
582
  """
583
+ if not result or not isinstance(result, dict):
584
+ return {}
585
+
586
+ # Create a deep copy to avoid modifying the original
587
+ import copy
588
+ truncated_result = copy.deepcopy(result)
589
+
590
+ # Helper function to truncate base64 strings
591
+ def truncate_base64(data):
592
+ if not isinstance(data, str) or len(data) <= prefix_length + suffix_length + 10:
593
+ return data
594
+
595
+ # Extract prefix and suffix based on whether this is a data URI or raw base64
596
+ if data.startswith('data:'):
597
+ # Handle data URIs like 'data:image/jpeg;base64,/9j/4AAQ...'
598
+ parts = data.split(',', 1)
599
+ if len(parts) != 2:
600
+ return data # Unexpected format, return as is
601
+
602
+ header = parts[0] + ','
603
+ base64_content = parts[1]
604
+
605
+ if len(base64_content) <= prefix_length + suffix_length + 10:
606
+ return data # Not long enough to truncate
607
+
608
+ truncated = (f"{header}{base64_content[:prefix_length]}..."
609
+ f"[truncated {len(base64_content) - prefix_length - suffix_length} chars]..."
610
+ f"{base64_content[-suffix_length:]}")
611
+ else:
612
+ # Handle raw base64 strings
613
+ truncated = (f"{data[:prefix_length]}..."
614
+ f"[truncated {len(data) - prefix_length - suffix_length} chars]..."
615
+ f"{data[-suffix_length:]}")
616
+
617
+ return truncated
618
+
619
+ # Helper function to recursively truncate base64 in nested structures
620
+ def truncate_base64_recursive(obj):
621
+ if isinstance(obj, dict):
622
+ # Check for keys that typically contain base64 data
623
+ for key in list(obj.keys()):
624
+ if key in ['image_base64', 'base64'] and isinstance(obj[key], str):
625
+ obj[key] = truncate_base64(obj[key])
626
+ elif isinstance(obj[key], (dict, list)):
627
+ truncate_base64_recursive(obj[key])
628
+ elif isinstance(obj, list):
629
+ for item in obj:
630
+ if isinstance(item, (dict, list)):
631
+ truncate_base64_recursive(item)
632
+
633
+ # Truncate base64 data throughout the result
634
+ truncate_base64_recursive(truncated_result)
635
+
636
+ # Specifically handle the pages_data structure
637
+ if 'pages_data' in truncated_result:
638
+ for page in truncated_result['pages_data']:
639
+ if isinstance(page, dict) and 'images' in page:
640
+ for img in page['images']:
641
+ if isinstance(img, dict) and 'image_base64' in img and isinstance(img['image_base64'], str):
642
+ img['image_base64'] = truncate_base64(img['image_base64'])
643
+
644
+ # Handle raw_response_data if present
645
+ if 'raw_response_data' in truncated_result and isinstance(truncated_result['raw_response_data'], dict):
646
+ if 'pages' in truncated_result['raw_response_data']:
647
+ for page in truncated_result['raw_response_data']['pages']:
648
+ if isinstance(page, dict) and 'images' in page:
649
+ for img in page['images']:
650
+ if isinstance(img, dict) and 'base64' in img and isinstance(img['base64'], str):
651
+ img['base64'] = truncate_base64(img['base64'])
652
+
653
+ return truncated_result
654
+
655
+ def clean_base64_from_result(result):
656
+ """
657
+ Create a clean copy of the result dictionary with base64 image data removed.
658
+ This ensures JSON files don't contain large base64 strings.
659
+
660
+ Args:
661
+ result: OCR result dictionary
662
+
663
+ Returns:
664
+ Cleaned dictionary without base64 data
665
+ """
666
+ if not result or not isinstance(result, dict):
667
+ return {}
668
+
669
+ # Create a deep copy to avoid modifying the original
670
+ import copy
671
+ clean_result = copy.deepcopy(result)
672
+
673
+ # Helper function to recursively clean base64 from nested structures
674
+ def clean_base64_recursive(obj):
675
+ if isinstance(obj, dict):
676
+ # Check for keys that typically contain base64 data
677
+ for key in list(obj.keys()):
678
+ if key in ['image_base64', 'base64']:
679
+ obj[key] = "[BASE64_DATA_REMOVED]"
680
+ elif isinstance(obj[key], (dict, list)):
681
+ clean_base64_recursive(obj[key])
682
+ elif isinstance(obj, list):
683
+ for item in obj:
684
+ if isinstance(item, (dict, list)):
685
+ clean_base64_recursive(item)
686
+
687
+ # Clean the entire result
688
+ clean_base64_recursive(clean_result)
689
+
690
+ # Specifically handle the pages_data structure
691
+ if 'pages_data' in clean_result:
692
+ for page in clean_result['pages_data']:
693
+ if isinstance(page, dict) and 'images' in page:
694
+ for img in page['images']:
695
+ if isinstance(img, dict) and 'image_base64' in img:
696
+ img['image_base64'] = "[BASE64_DATA_REMOVED]"
697
+
698
+ # Handle raw_response_data if present
699
+ if 'raw_response_data' in clean_result and isinstance(clean_result['raw_response_data'], dict):
700
+ if 'pages' in clean_result['raw_response_data']:
701
+ for page in clean_result['raw_response_data']['pages']:
702
+ if isinstance(page, dict) and 'images' in page:
703
+ for img in page['images']:
704
+ if isinstance(img, dict) and 'base64' in img:
705
+ img['base64'] = "[BASE64_DATA_REMOVED]"
706
+
707
+ return clean_result
708
+
709
+ def create_markdown_with_file_references(result, image_path_prefix="images/"):
710
+ """
711
+ Create a markdown document with file references to images instead of base64 embedding.
712
+ Ideal for use in zip archives where images are stored as separate files.
713
+
714
+ Args:
715
+ result: OCR result dictionary
716
+ image_path_prefix: Path prefix for image references (e.g., "images/")
717
+
718
+ Returns:
719
+ Markdown content as string with file references
720
+ """
721
+ # Similar to create_markdown_with_images but uses file references
722
  # Import content utils to use classification functions
723
  try:
724
  from utils.content_utils import classify_document_content, extract_document_text, extract_image_description
 
729
  # Get content classification
730
  has_text = True
731
  has_images = False
 
732
 
733
  if content_utils_available:
734
  classification = classify_document_content(result)
735
  has_text = classification['has_content']
736
  has_images = result.get('has_images', False)
 
737
  else:
738
  # Minimal fallback detection
739
  if 'has_images' in result:
 
746
  has_images = True
747
  break
748
 
749
+ # Start building the markdown document
750
+ md = []
751
+
752
+ # Add document title/header
753
+ md.append(f"# {result.get('file_name', 'Document')}\n")
754
+
755
+ # Add metadata section
756
+ md.append("## Document Metadata\n")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
757
 
758
  # Add timestamp
759
  if 'timestamp' in result:
760
+ md.append(f"**Processed:** {result['timestamp']}\n")
761
 
762
  # Add languages if available
763
  if 'languages' in result and result['languages']:
764
  languages = [lang for lang in result['languages'] if lang]
765
  if languages:
766
+ md.append(f"**Languages:** {', '.join(languages)}\n")
767
 
768
  # Add document type and topics
769
  if 'detected_document_type' in result:
770
+ md.append(f"**Document Type:** {result['detected_document_type']}\n")
771
 
772
  if 'topics' in result and result['topics']:
773
+ md.append(f"**Topics:** {', '.join(result['topics'])}\n")
774
 
775
+ md.append("\n---\n")
776
 
777
  # Document title - extract from result if available
778
  if 'ocr_contents' in result and 'title' in result['ocr_contents'] and result['ocr_contents']['title']:
779
  title_content = result['ocr_contents']['title']
780
+ md.append(f"## {title_content}\n")
 
781
 
782
  # Add images if present
783
  if has_images and 'pages_data' in result:
784
+ md.append("## Images\n")
785
 
786
+ # Extract and display all images with file references
787
  for page_idx, page in enumerate(result['pages_data']):
788
  if 'images' in page and isinstance(page['images'], list):
789
  for img_idx, img in enumerate(page['images']):
790
+ if 'image_base64' in img:
791
+ # Create image reference to file in the zip
792
+ image_filename = f"image_{page_idx+1}_{img_idx+1}.jpg"
793
+ image_path = f"{image_path_prefix}{image_filename}"
794
+ image_caption = f"Image {page_idx+1}-{img_idx+1}"
795
+ md.append(f"![{image_caption}]({image_path})\n")
 
 
796
 
797
  # Add image description if available through utils
798
  if content_utils_available:
799
  description = extract_image_description(result)
800
  if description:
801
+ md.append(f"*{description}*\n")
 
 
802
 
803
+ md.append("\n---\n")
804
 
805
  # Add document text section
806
+ md.append("## Text Content\n")
807
 
808
  # Extract text content systematically
809
  text_content = ""
810
+ structured_sections = {}
811
+
812
+ # Helper function to extract clean text from dictionary objects
813
+ def extract_clean_text(content):
814
+ if isinstance(content, str):
815
+ # Check if content is a stringified JSON
816
+ if content.strip().startswith("{") and content.strip().endswith("}"):
817
+ try:
818
+ # Try to parse as JSON
819
+ content_dict = json.loads(content.replace("'", '"'))
820
+ if 'text' in content_dict:
821
+ return content_dict['text']
822
+ return content
823
+ except:
824
+ return content
825
+ return content
826
+ elif isinstance(content, dict):
827
+ # If it's a dictionary with a 'text' key, return just that value
828
+ if 'text' in content and isinstance(content['text'], str):
829
+ return content['text']
830
+ return content
831
+ return content
832
 
833
  if content_utils_available:
834
+ # Use the systematic utility function for main text
835
  text_content = extract_document_text(result)
836
+ text_content = extract_clean_text(text_content)
837
+
838
+ # Collect all available structured sections
839
+ if 'ocr_contents' in result:
840
+ for field, content in result['ocr_contents'].items():
841
+ # Skip certain fields that are handled separately
842
+ if field in ["raw_text", "error", "partial_text", "main_text"]:
843
+ continue
844
+
845
+ if content:
846
+ # Extract clean text from content if possible
847
+ clean_content = extract_clean_text(content)
848
+ # Add this as a structured section
849
+ structured_sections[field] = clean_content
850
  else:
851
  # Fallback extraction logic
852
  if 'ocr_contents' in result:
853
+ # First find main text
854
  for field in ["main_text", "content", "text", "transcript", "raw_text"]:
855
  if field in result['ocr_contents'] and result['ocr_contents'][field]:
856
  content = result['ocr_contents'][field]
 
864
  break
865
  except:
866
  pass
867
+
868
+ # Then collect all structured sections
869
+ for field, content in result['ocr_contents'].items():
870
+ # Skip certain fields that are handled separately
871
+ if field in ["raw_text", "error", "partial_text", "main_text", "content", "text", "transcript"]:
872
+ continue
873
+
874
+ if content:
875
+ # Add this as a structured section
876
+ structured_sections[field] = content
877
 
878
+ # Add the main text content - display raw text without a field label
879
  if text_content:
880
+ # Check if this is from raw_text (based on content match)
881
+ is_raw_text = False
882
+ if 'ocr_contents' in result and 'raw_text' in result['ocr_contents']:
883
+ if result['ocr_contents']['raw_text'] == text_content:
884
+ is_raw_text = True
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
885
 
886
+ # Display content without adding a "raw_text:" label
887
+ md.append(text_content + "\n\n")
 
 
 
 
 
 
888
 
889
+ # Add structured sections if available
890
+ if structured_sections:
891
+ for section_name, section_content in structured_sections.items():
892
+ # Use proper markdown header for sections - consistently capitalize all section names
893
+ display_name = section_name.replace("_", " ").capitalize()
894
+ # Handle different content types
895
+ if isinstance(section_content, str):
896
+ md.append(section_content + "\n\n")
897
+ elif isinstance(section_content, dict):
898
+ # Dictionary content - format as key-value pairs
899
+ for key, value in section_content.items():
900
+ # Treat all values as plain text to maintain content purity
901
+ # This prevents JSON-like structures from being formatted as code blocks
902
+ md.append(f"**{key}:** {value}\n\n")
903
+ elif isinstance(section_content, list):
904
+ # List content - create a markdown list
905
+ for item in section_content:
906
+ # Treat all items as plain text
907
+ md.append(f"- {item}\n")
908
+ md.append("\n")
909
 
910
+ # Join all markdown parts into a single string
911
+ return "\n".join(md)
912
 
913
+ def add_result_files_to_zip(zipf, result, prefix=""):
 
 
914
  """
915
+ Add files for a single result to a zip file.
916
+
917
+ Args:
918
+ zipf: ZipFile instance to add files to
919
+ result: OCR result dictionary
920
+ prefix: Optional prefix for file paths in the zip
921
  """
922
+ if not result or not isinstance(result, dict):
923
+ return
924
+
925
+ # Create a timestamp for filename if not in result
926
+ timestamp = result.get('timestamp', datetime.now().strftime("%Y-%m-%d_%H-%M-%S"))
927
+
928
+ # Get base name for files
929
+ file_name = result.get('file_name', 'document')
930
+ base_name = Path(file_name).stem
931
+
932
+ try:
933
+ # 1. Add JSON file - with base64 data cleaned out
934
+ clean_result = clean_base64_from_result(result)
935
+ json_str = json.dumps(clean_result, indent=2)
936
+ zipf.writestr(f"{prefix}{base_name}.json", json_str)
937
+
938
+ # 2. Add markdown file that exactly matches Tab 1 display
939
+ # Use the create_markdown_with_images function to ensure it matches the UI exactly
940
+ try:
941
+ markdown_content = create_markdown_with_images(result)
942
+ zipf.writestr(f"{prefix}{base_name}.md", markdown_content)
943
+ except Exception as e:
944
+ logger.error(f"Error creating markdown: {str(e)}")
945
+ # Fallback to simpler markdown if error occurs
946
+ zipf.writestr(f"{prefix}{base_name}.md", f"# {file_name}\n\nError generating complete markdown output.")
947
+
948
+ # Extract and save images first to ensure they exist before creating markdown
949
+ img_paths = {}
950
+ has_images = result.get('has_images', False)
951
+
952
+ # 3. Add individual images if available
953
+ if has_images and 'pages_data' in result:
954
+ img_folder = f"{prefix}images/"
955
+ for page_idx, page in enumerate(result['pages_data']):
956
+ if 'images' in page and isinstance(page['images'], list):
957
+ for img_idx, img in enumerate(page['images']):
958
+ if 'image_base64' in img and img['image_base64']:
959
+ # Extract the base64 data
960
+ try:
961
+ # Get the base64 data
962
+ img_data = img['image_base64']
963
+
964
+ # Handle the base64 data carefully
965
+ if isinstance(img_data, str):
966
+ # If it has a data URI prefix, remove it
967
+ if ',' in img_data and ';base64,' in img_data:
968
+ # Keep the complete data after the comma
969
+ img_data = img_data.split(',', 1)[1]
970
+
971
+ # Make sure we have the complete data (not truncated)
972
+ try:
973
+ # Decode the base64 data with padding correction
974
+ # Add padding if needed to prevent truncation errors
975
+ missing_padding = len(img_data) % 4
976
+ if missing_padding:
977
+ img_data += '=' * (4 - missing_padding)
978
+ img_bytes = base64.b64decode(img_data)
979
+ except Exception as e:
980
+ logger.error(f"Base64 decoding error: {str(e)} for image {page_idx}-{img_idx}")
981
+ # Skip this image if we can't decode it
982
+ continue
983
+ else:
984
+ # If it's not a string (e.g., already bytes), use it directly
985
+ img_bytes = img_data
986
+
987
+ # Create image filename
988
+ image_filename = f"image_{page_idx+1}_{img_idx+1}.jpg"
989
+ img_paths[(page_idx, img_idx)] = image_filename
990
+
991
+ # Write the image to the zip file
992
+ zipf.writestr(f"{img_folder}{image_filename}", img_bytes)
993
+ except Exception as e:
994
+ logger.warning(f"Could not add image to zip: {str(e)}")
995
+
996
+ # 4. Add markdown with file references to images for offline viewing
997
+ try:
998
+ if has_images:
999
+ # Create markdown with file references
1000
+ file_ref_markdown = create_markdown_with_file_references(result, "images/")
1001
+ zipf.writestr(f"{prefix}{base_name}_with_files.md", file_ref_markdown)
1002
+ except Exception as e:
1003
+ logger.warning(f"Error creating markdown with file references: {str(e)}")
1004
+
1005
+ # 5. Add README.txt with explanation of file contents
1006
+ readme_content = f"""
1007
+ OCR RESULTS FOR: {file_name}
1008
+ Processed: {timestamp}
1009
 
1010
+ This archive contains the following files:
 
 
 
 
 
1011
 
1012
+ - {base_name}.json: Complete JSON data with all extracted information
1013
+ - {base_name}.md: Markdown document with embedded base64 images (exactly as shown in the app)
1014
+ - {base_name}_with_files.md: Alternative markdown with file references instead of base64 (for offline viewing)
1015
+ - images/ folder: Contains extracted images from the document (if present)
1016
 
1017
+ Generated by Historical OCR using Mistral AI
1018
+ """
1019
+ zipf.writestr(f"{prefix}README.txt", readme_content.strip())
1020
+
1021
+ except Exception as e:
1022
+ logger.error(f"Error adding files to zip: {str(e)}")
 
 
 
 
1023
 
1024
+ def create_markdown_with_images(result):
1025
+ """
1026
+ Create a clean Markdown document from OCR results that properly preserves
1027
+ image references and text structure, following the principle of content purity.
1028
+
1029
+ Args:
1030
+ result: OCR result dictionary
1031
+
1032
+ Returns:
1033
+ Markdown content as string
1034
+ """
1035
+ # Similar to create_markdown_with_file_references but embeds base64 images
1036
+ # Import content utils to use classification functions
1037
+ try:
1038
+ from utils.content_utils import classify_document_content, extract_document_text, extract_image_description
1039
+ content_utils_available = True
1040
+ except ImportError:
1041
+ content_utils_available = False
1042
+
1043
+ # Get content classification
1044
+ has_text = True
1045
+ has_images = False
1046
+
1047
+ if content_utils_available:
1048
+ classification = classify_document_content(result)
1049
+ has_text = classification['has_content']
1050
+ has_images = result.get('has_images', False)
1051
+ else:
1052
+ # Minimal fallback detection
1053
+ if 'has_images' in result:
1054
+ has_images = result['has_images']
1055
+
1056
+ # Check for image data more thoroughly
1057
+ if 'pages_data' in result and isinstance(result['pages_data'], list):
1058
+ for page in result['pages_data']:
1059
+ if isinstance(page, dict) and 'images' in page and page['images']:
1060
+ has_images = True
1061
+ break
1062
+
1063
+ # Start building the markdown document
1064
+ md = []
1065
+
1066
+ # Add document title/header
1067
+ md.append(f"# {result.get('file_name', 'Document')}\n")
1068
+
1069
+ # Add metadata section
1070
+ md.append("## Document Metadata\n")
1071
+
1072
+ # Add timestamp
1073
+ if 'timestamp' in result:
1074
+ md.append(f"**Processed:** {result['timestamp']}\n")
1075
+
1076
+ # Add languages if available
1077
+ if 'languages' in result and result['languages']:
1078
+ languages = [lang for lang in result['languages'] if lang]
1079
+ if languages:
1080
+ md.append(f"**Languages:** {', '.join(languages)}\n")
1081
+
1082
+ # Add document type and topics
1083
+ if 'detected_document_type' in result:
1084
+ md.append(f"**Document Type:** {result['detected_document_type']}\n")
1085
+
1086
+ if 'topics' in result and result['topics']:
1087
+ md.append(f"**Topics:** {', '.join(result['topics'])}\n")
1088
+
1089
+ md.append("\n---\n")
1090
+
1091
+ # Document title - extract from result if available
1092
+ if 'ocr_contents' in result and 'title' in result['ocr_contents'] and result['ocr_contents']['title']:
1093
+ title_content = result['ocr_contents']['title']
1094
+ md.append(f"## {title_content}\n")
1095
+
1096
+ # Add images if present - with base64 embedding
1097
+ if has_images and 'pages_data' in result:
1098
+ md.append("## Images\n")
1099
+
1100
+ # Extract and display all images with embedded base64
1101
+ for page_idx, page in enumerate(result['pages_data']):
1102
+ if 'images' in page and isinstance(page['images'], list):
1103
+ for img_idx, img in enumerate(page['images']):
1104
+ if 'image_base64' in img:
1105
+ # Use the base64 data directly
1106
+ image_caption = f"Image {page_idx+1}-{img_idx+1}"
1107
+ img_data = img['image_base64']
1108
+
1109
+ # Make sure it has proper data URI format
1110
+ if isinstance(img_data, str) and not img_data.startswith('data:'):
1111
+ img_data = f"data:image/jpeg;base64,{img_data}"
1112
+
1113
+ md.append(f"![{image_caption}]({img_data})\n")
1114
+
1115
+ # Add image description if available through utils
1116
+ if content_utils_available:
1117
+ description = extract_image_description(result)
1118
+ if description:
1119
+ md.append(f"*{description}*\n")
1120
+
1121
+ md.append("\n---\n")
1122
+
1123
+ # Add document text section
1124
+ md.append("## Text Content\n")
1125
+
1126
+ # Extract text content systematically
1127
+ text_content = ""
1128
+ structured_sections = {}
1129
+
1130
+ if content_utils_available:
1131
+ # Use the systematic utility function for main text
1132
+ text_content = extract_document_text(result)
1133
+
1134
+ # Collect all available structured sections
1135
+ if 'ocr_contents' in result:
1136
+ for field, content in result['ocr_contents'].items():
1137
+ # Skip certain fields that are handled separately
1138
+ if field in ["raw_text", "error", "partial_text", "main_text"]:
1139
+ continue
1140
+
1141
+ if content:
1142
+ # Add this as a structured section
1143
+ structured_sections[field] = content
1144
+ else:
1145
+ # Fallback extraction logic
1146
+ if 'ocr_contents' in result:
1147
+ # First find main text
1148
+ for field in ["main_text", "content", "text", "transcript", "raw_text"]:
1149
+ if field in result['ocr_contents'] and result['ocr_contents'][field]:
1150
+ content = result['ocr_contents'][field]
1151
+ if isinstance(content, str) and content.strip():
1152
+ text_content = content
1153
+ break
1154
+ elif isinstance(content, dict):
1155
+ # Try to convert complex objects to string
1156
+ try:
1157
+ text_content = json.dumps(content, indent=2)
1158
+ break
1159
+ except:
1160
+ pass
1161
 
1162
+ # Then collect all structured sections
1163
+ for field, content in result['ocr_contents'].items():
1164
+ # Skip certain fields that are handled separately
1165
+ if field in ["raw_text", "error", "partial_text", "main_text", "content", "text", "transcript"]:
1166
+ continue
1167
+
1168
+ if content:
1169
+ # Add this as a structured section
1170
+ structured_sections[field] = content
1171
+
1172
+ # Add the main text content
1173
+ if text_content:
1174
+ md.append(text_content + "\n\n")
1175
+
1176
+ # Add structured sections if available
1177
+ if structured_sections:
1178
+ for section_name, section_content in structured_sections.items():
1179
+ # Use proper markdown header for sections - consistently capitalize all section names
1180
+ display_name = section_name.replace("_", " ").capitalize()
1181
+ md.append(f"### {display_name}\n")
1182
+ # Add a separator for clarity
1183
+ md.append("\n---\n\n")
1184
 
1185
+ # Handle different content types
1186
+ if isinstance(section_content, str):
1187
+ md.append(section_content + "\n\n")
1188
+ elif isinstance(section_content, dict):
1189
+ # Dictionary content - format as key-value pairs
1190
+ for key, value in section_content.items():
1191
+ # Treat all values as plain text to maintain content purity
1192
+ md.append(f"**{key}:** {value}\n\n")
1193
+ elif isinstance(section_content, list):
1194
+ # List content - create a markdown list
1195
+ for item in section_content:
1196
+ # Keep list items as plain text
1197
+ md.append(f"- {item}\n")
1198
+ md.append("\n")
1199
+
1200
+ # Join all markdown parts into a single string
1201
+ return "\n".join(md)
utils/text_utils.py CHANGED
@@ -1,6 +1,7 @@
1
  """Text utility functions for OCR processing"""
2
 
3
  import re
 
4
 
5
  def clean_raw_text(text):
6
  """Clean raw text by removing image references and serialized data.
@@ -14,24 +15,24 @@ def clean_raw_text(text):
14
  if not text or not isinstance(text, str):
15
  return ""
16
 
17
- # # Remove image references like ![image](data:image/...)
18
- # text = re.sub(r'!\[.*?\]\(data:image/[^)]+\)', '', text)
19
 
20
- # # Remove basic markdown image references like ![alt](img-1.jpg)
21
- # text = re.sub(r'!\[[^\]]*\]\([^)]+\)', '', text)
22
 
23
- # # Remove base64 encoded image data
24
- # text = re.sub(r'data:image/[^;]+;base64,[a-zA-Z0-9+/=]+', '', text)
25
 
26
- # # Remove image object references like [[OCRImageObject:...]]
27
- # text = re.sub(r'\[\[OCRImageObject:[^\]]+\]\]', '', text)
28
 
29
- # # Clean up any JSON-like image object references
30
- # text = re.sub(r'{"image(_data)?":("[^"]*"|null|true|false|\{[^}]*\}|\[[^\]]*\])}', '', text)
31
 
32
- # # Clean up excessive whitespace and line breaks created by removals
33
- # text = re.sub(r'\n{3,}', '\n\n', text)
34
- # text = re.sub(r'\s{3,}', ' ', text)
35
 
36
  return text.strip()
37
 
@@ -55,6 +56,45 @@ def format_markdown_text(text):
55
  # Convert any Windows line endings to Unix
56
  text = text.replace('\r\n', '\n')
57
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
  # Format dates (MM/DD/YYYY or similar patterns)
59
  date_pattern = r'\b(0?[1-9]|1[0-2])[\/\-\.](0?[1-9]|[12][0-9]|3[01])[\/\-\.](\d{4}|\d{2})\b'
60
  text = re.sub(date_pattern, r'**\g<0>**', text)
@@ -149,3 +189,26 @@ def format_markdown_text(text):
149
  processed_text = re.sub(r'([^\n])\n([^\n])', r'\1\n\n\2', processed_text)
150
 
151
  return processed_text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  """Text utility functions for OCR processing"""
2
 
3
  import re
4
+ import streamlit as st
5
 
6
  def clean_raw_text(text):
7
  """Clean raw text by removing image references and serialized data.
 
15
  if not text or not isinstance(text, str):
16
  return ""
17
 
18
+ # Remove image references like ![image](data:image/...)
19
+ text = re.sub(r'!\[.*?\]\(data:image/[^)]+\)', '', text)
20
 
21
+ # Remove basic markdown image references like ![alt](img-1.jpg)
22
+ text = re.sub(r'!\[[^\]]*\]\([^)]+\)', '', text)
23
 
24
+ # Remove base64 encoded image data
25
+ text = re.sub(r'data:image/[^;]+;base64,[a-zA-Z0-9+/=]+', '', text)
26
 
27
+ # Remove image object references like [[OCRImageObject:...]]
28
+ text = re.sub(r'\[\[OCRImageObject:[^\]]+\]\]', '', text)
29
 
30
+ # Clean up any JSON-like image object references
31
+ text = re.sub(r'{"image(_data)?":("[^"]*"|null|true|false|\{[^}]*\}|\[[^\]]*\])}', '', text)
32
 
33
+ # Clean up excessive whitespace and line breaks created by removals
34
+ text = re.sub(r'\n{3,}', '\n\n', text)
35
+ text = re.sub(r'\s{3,}', ' ', text)
36
 
37
  return text.strip()
38
 
 
56
  # Convert any Windows line endings to Unix
57
  text = text.replace('\r\n', '\n')
58
 
59
+ # Format keys with values to ensure keys are on their own line
60
+ # Pattern matches potential label/key patterns like 'key:' or '**key:**'
61
+ key_value_pattern = r'(\*\*[^:*\n]+:\*\*|\b[a-zA-Z_]+:\s+)'
62
+
63
+ # Process lines for key-value formatting
64
+ lines = text.split('\n')
65
+ processed_lines = []
66
+ for line in lines:
67
+ # Find all matches of the key-value pattern
68
+ matches = list(re.finditer(key_value_pattern, line))
69
+ if matches:
70
+ # Process each match in reverse to avoid messing up string indices
71
+ for match in reversed(matches):
72
+ key = match.group(1)
73
+ key_end = match.end()
74
+
75
+ # If the key is already bold, use it as is
76
+ if key.startswith('**') and key.endswith('**'):
77
+ formatted_key = key
78
+ else:
79
+ # Bold the key if it's not already bold
80
+ formatted_key = f"**{key.strip()}**"
81
+
82
+ # Split the line at this key's end position
83
+ before_key = line[:match.start()]
84
+ after_key = line[key_end:]
85
+
86
+ # If there's content before the key on the same line, end with newline
87
+ if before_key.strip():
88
+ before_key = f"{before_key.rstrip()}\n\n"
89
+
90
+ # Format: key on its own line, value on next line
91
+ line = f"{before_key}{formatted_key}\n{after_key.strip()}"
92
+
93
+ processed_lines.append(line)
94
+
95
+ # Join the processed lines
96
+ text = '\n'.join(processed_lines)
97
+
98
  # Format dates (MM/DD/YYYY or similar patterns)
99
  date_pattern = r'\b(0?[1-9]|1[0-2])[\/\-\.](0?[1-9]|[12][0-9]|3[01])[\/\-\.](\d{4}|\d{2})\b'
100
  text = re.sub(date_pattern, r'**\g<0>**', text)
 
189
  processed_text = re.sub(r'([^\n])\n([^\n])', r'\1\n\n\2', processed_text)
190
 
191
  return processed_text
192
+
193
+ def format_ocr_text(text, for_display=False):
194
+ """Format OCR text with optional HTML styling
195
+
196
+ Args:
197
+ text (str): The OCR text to format
198
+ for_display (bool): Whether to add HTML formatting for UI display
199
+
200
+ Returns:
201
+ str: Formatted text, without HTML container to keep content pure
202
+ """
203
+ if not text or not isinstance(text, str):
204
+ return ""
205
+
206
+ # Clean the text first
207
+ text = clean_raw_text(text)
208
+
209
+ # Format with markdown
210
+ formatted_text = format_markdown_text(text)
211
+
212
+ # Always return the clean formatted text without HTML wrappers
213
+ # This follows the principle of keeping content separate from presentation
214
+ return formatted_text
utils/ui_utils.py CHANGED
@@ -1,13 +1,14 @@
1
  """
2
  UI utilities for OCR results display.
3
  """
 
4
  import streamlit as st
5
  import json
6
  import base64
7
  import io
8
  from datetime import datetime
9
 
10
- from utils.image_utils import format_ocr_text, create_html_with_images
11
  from utils.content_utils import classify_document_content, format_structured_data
12
 
13
  def display_results(result, container, custom_prompt=""):
@@ -58,17 +59,55 @@ def display_results(result, container, custom_prompt=""):
58
  lang_html += '</div>'
59
  st.markdown(lang_html, unsafe_allow_html=True)
60
 
61
- # Create a separate line for Time if we have time-related tags
62
- if 'topics' in result and result['topics']:
63
- time_tags = [topic for topic in result['topics']
64
- if any(term in topic.lower() for term in ["century", "pre-", "era", "historical"])]
65
- if time_tags:
66
- time_html = '<div style="display: flex; align-items: center; margin: 0.2rem 0; flex-wrap: wrap;">'
67
- time_html += '<div style="margin-right: 0.3rem; font-weight: bold;">Time:</div>'
68
- for tag in time_tags:
69
- time_html += f'<span class="subject-tag tag-time-period">{tag}</span>'
70
- time_html += '</div>'
71
- st.markdown(time_html, unsafe_allow_html=True)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
 
73
  # Then display remaining subject tags if available
74
  if 'topics' in result and result['topics']:
@@ -199,118 +238,98 @@ def display_results(result, container, custom_prompt=""):
199
  doc_tab, json_tab = tabs
200
  img_tab = None
201
 
202
- # Document Content tab with simplified and systematic content handling
203
  with doc_tab:
204
- # Classify document content using our utility function
205
- content_classification = classify_document_content(result)
206
-
207
- # Track what content has been displayed to avoid redundancy
208
- displayed_content = set()
209
-
210
  # Create a single unified content section
211
- st.markdown("#### Document Content")
212
- st.markdown("##### Title")
213
 
214
- # Extract main structured content fields without redundancy
215
- text_fields = {}
216
-
217
- # Use the exact same approach as in Previous Results tab for consistency
218
- # Create a more focused list of important sections - prioritize main_text
219
- priority_sections = ["title", "main_text", "content", "transcript", "summary"]
220
- displayed_sections = set()
221
-
222
- # First display priority sections
223
- for section in priority_sections:
224
- if section in result['ocr_contents'] and result['ocr_contents'][section]:
225
- content = result['ocr_contents'][section]
226
  if isinstance(content, str) and content.strip():
227
- # Only add a subheader for meaningful section names, not raw_text
228
- if section != "raw_text" and section != "title":
229
- st.markdown(f"##### {section.replace('_', ' ').title()}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
230
 
231
- # Format and display content
232
- # First format any structured data (lists, dicts)
233
- structured_content = format_structured_data(content)
234
- # Then apply regular OCR text formatting
235
- formatted_content = format_ocr_text(structured_content)
236
- st.markdown(formatted_content)
237
- displayed_sections.add(section)
238
- break
239
- elif isinstance(content, dict):
240
- # Display dictionary content as key-value pairs
241
- for k, v in content.items():
242
- if k not in ['error', 'partial_text'] and v:
243
- st.markdown(f"**{k.replace('_', ' ').title()}**")
244
- if isinstance(v, str):
245
- # Format any structured data in the string
246
- formatted_v = format_structured_data(v)
247
- st.markdown(format_ocr_text(formatted_v))
248
- else:
249
- # Format non-string values (lists, dicts)
250
- formatted_v = format_structured_data(v)
251
- st.markdown(formatted_v)
252
- displayed_sections.add(section)
253
- break
254
- elif isinstance(content, list):
255
- # Format and display list items using our structured formatter
256
- formatted_list = format_structured_data(content)
257
- st.markdown(formatted_list)
258
- displayed_sections.add(section)
259
- break
260
-
261
- # Then display any remaining sections not already shown
262
- for section, content in result['ocr_contents'].items():
263
- if (section not in displayed_sections and
264
- section not in ['error', 'partial_text'] and
265
- content):
266
- st.markdown(f"##### {section.replace('_', ' ').title()}")
267
 
268
- if isinstance(content, str):
269
- # Format any structured data in the string before display
270
- structured_content = format_structured_data(content)
271
- st.markdown(format_ocr_text(structured_content))
272
- elif isinstance(content, list):
273
- # Format list using our structured formatter
274
- formatted_list = format_structured_data(content)
275
- st.markdown(formatted_list)
276
- elif isinstance(content, dict):
277
- # Format dictionary using our structured formatter
278
- formatted_dict = format_structured_data(content)
279
- st.markdown(formatted_dict)
280
 
281
- # Raw JSON tab - for viewing the raw OCR response data
282
  with json_tab:
283
- # Extract the relevant JSON data
284
- json_data = {}
285
-
286
- # Include important metadata
287
- for field in ['file_name', 'timestamp', 'processing_time', 'detected_document_type', 'languages', 'topics']:
288
- if field in result:
289
- json_data[field] = result[field]
290
-
291
- # Include OCR contents
292
- if 'ocr_contents' in result:
293
- json_data['ocr_contents'] = result['ocr_contents']
294
-
295
- # Exclude large binary data like base64 images to keep JSON clean
296
- if 'pages_data' in result:
297
- # Create simplified pages_data without large binary content
298
- simplified_pages = []
299
- for page in result['pages_data']:
300
- simplified_page = {
301
- 'page_number': page.get('page_number', 0),
302
- 'has_text': bool(page.get('markdown', '')),
303
- 'has_images': bool(page.get('images', [])),
304
- 'image_count': len(page.get('images', []))
305
- }
306
- simplified_pages.append(simplified_page)
307
- json_data['pages_summary'] = simplified_pages
308
 
309
  # Format the JSON prettily
310
- json_str = json.dumps(json_data, indent=2)
311
 
312
- # Display in a monospace font with syntax highlighting
313
- st.code(json_str, language="json")
314
 
315
 
316
  # Images tab - for viewing document images
@@ -324,90 +343,3 @@ def display_results(result, container, custom_prompt=""):
324
  if custom_prompt:
325
  with st.expander("Custom Processing Instructions"):
326
  st.write(custom_prompt)
327
-
328
- # No download heading - start directly with buttons
329
-
330
- # Create export section with a simple download menu
331
- st.markdown("<div style='margin-top: 15px;'></div>", unsafe_allow_html=True)
332
-
333
- # Prepare all download files at once to avoid rerun resets
334
- try:
335
- # 1. JSON download
336
- json_str = json.dumps(result, indent=2)
337
- json_filename = f"{result.get('file_name', 'document').split('.')[0]}_ocr.json"
338
-
339
- # 2. Text download with improved structure
340
- text_parts = []
341
- filename = result.get('file_name', 'document')
342
- text_parts.append(f"DOCUMENT: {filename}\n")
343
-
344
- if 'timestamp' in result:
345
- text_parts.append(f"Processed: {result['timestamp']}\n")
346
-
347
- if 'languages' in result and result['languages']:
348
- languages = [lang for lang in result['languages'] if lang is not None]
349
- if languages:
350
- text_parts.append(f"Languages: {', '.join(languages)}\n")
351
-
352
- if 'topics' in result and result['topics']:
353
- text_parts.append(f"Topics: {', '.join(result['topics'])}\n")
354
-
355
- text_parts.append("\n" + "="*50 + "\n\n")
356
-
357
- if 'ocr_contents' in result and 'title' in result['ocr_contents'] and result['ocr_contents']['title']:
358
- text_parts.append(f"TITLE: {result['ocr_contents']['title']}\n\n")
359
-
360
- content_added = False
361
-
362
- if 'ocr_contents' in result:
363
- for field in ["main_text", "content", "text", "transcript", "raw_text"]:
364
- if field in result['ocr_contents'] and result['ocr_contents'][field]:
365
- text_parts.append(f"CONTENT:\n\n{result['ocr_contents'][field]}\n")
366
- content_added = True
367
- break
368
-
369
- text_content = "\n".join(text_parts)
370
- text_filename = f"{result.get('file_name', 'document').split('.')[0]}_ocr.txt"
371
-
372
- # 3. HTML download
373
- from utils.image_utils import create_html_with_images
374
- html_content = create_html_with_images(result)
375
- html_filename = f"{result.get('file_name', 'document').split('.')[0]}_ocr.html"
376
-
377
- # Hide download options in an expander
378
- with st.expander("Download Options"):
379
- # Remove columns and use vertical layout instead
380
- # Add spacing between buttons for better readability
381
- st.download_button(
382
- label="JSON",
383
- data=json_str,
384
- file_name=json_filename,
385
- mime="application/json",
386
- key="download_json_btn",
387
- use_container_width=True
388
- )
389
-
390
- st.markdown("<div style='margin-top: 8px;'></div>", unsafe_allow_html=True)
391
-
392
- st.download_button(
393
- label="Text",
394
- data=text_content,
395
- file_name=text_filename,
396
- mime="text/plain",
397
- key="download_text_btn",
398
- use_container_width=True
399
- )
400
-
401
- st.markdown("<div style='margin-top: 8px;'></div>", unsafe_allow_html=True)
402
-
403
- st.download_button(
404
- label="HTML",
405
- data=html_content,
406
- file_name=html_filename,
407
- mime="text/html",
408
- key="download_html_btn",
409
- use_container_width=True
410
- )
411
-
412
- except Exception as e:
413
- st.error(f"Error preparing download files: {str(e)}")
 
1
  """
2
  UI utilities for OCR results display.
3
  """
4
+ import os
5
  import streamlit as st
6
  import json
7
  import base64
8
  import io
9
  from datetime import datetime
10
 
11
+ from utils.text_utils import format_ocr_text
12
  from utils.content_utils import classify_document_content, format_structured_data
13
 
14
  def display_results(result, container, custom_prompt=""):
 
59
  lang_html += '</div>'
60
  st.markdown(lang_html, unsafe_allow_html=True)
61
 
62
+ # Prepare download files
63
+ try:
64
+ # Get base filename
65
+ from utils.general_utils import create_descriptive_filename
66
+ original_file = result.get('file_name', 'document')
67
+ base_name = create_descriptive_filename(original_file, result, "")
68
+ base_name = os.path.splitext(base_name)[0]
69
+
70
+ # 1. JSON download - with base64 data truncated for readability
71
+ from utils.image_utils import truncate_base64_in_result
72
+ truncated_result = truncate_base64_in_result(result)
73
+ json_str = json.dumps(truncated_result, indent=2)
74
+ json_filename = f"{base_name}.json"
75
+ json_b64 = base64.b64encode(json_str.encode()).decode()
76
+
77
+ # 2. Create ZIP with all files
78
+ from utils.image_utils import create_results_zip_in_memory
79
+ zip_data = create_results_zip_in_memory(result)
80
+ timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
81
+ zip_filename = f"{base_name}_{timestamp}.zip"
82
+ zip_b64 = base64.b64encode(zip_data).decode()
83
+
84
+ # Add download line with metadata styling
85
+ download_html = '<div style="display: flex; align-items: center; margin: 0.2rem 0; flex-wrap: wrap;">'
86
+ download_html += '<div style="margin-right: 0.3rem; font-weight: bold;">Download:</div>'
87
+
88
+ # Download links in order of importance, matching the zip file contents
89
+ download_html += f'<a href="data:application/json;base64,{json_b64}" download="{json_filename}" class="subject-tag tag-download">JSON</a>'
90
+
91
+ # Zip download link (packages everything together)
92
+ download_html += f'<a href="data:application/zip;base64,{zip_b64}" download="{zip_filename}" class="subject-tag tag-download">Zip Archive</a>'
93
+
94
+ download_html += '</div>'
95
+ st.markdown(download_html, unsafe_allow_html=True)
96
+ except Exception as e:
97
+ # Silent fail for downloads - don't disrupt the UI
98
+ pass
99
+
100
+ # Create a separate line for Time if we have time-related tags
101
+ if 'topics' in result and result['topics']:
102
+ time_tags = [topic for topic in result['topics']
103
+ if any(term in topic.lower() for term in ["century", "pre-", "era", "historical"])]
104
+ if time_tags:
105
+ time_html = '<div style="display: flex; align-items: center; margin: 0.2rem 0; flex-wrap: wrap;">'
106
+ time_html += '<div style="margin-right: 0.3rem; font-weight: bold;">Time:</div>'
107
+ for tag in time_tags:
108
+ time_html += f'<span class="subject-tag tag-time-period">{tag}</span>'
109
+ time_html += '</div>'
110
+ st.markdown(time_html, unsafe_allow_html=True)
111
 
112
  # Then display remaining subject tags if available
113
  if 'topics' in result and result['topics']:
 
238
  doc_tab, json_tab = tabs
239
  img_tab = None
240
 
241
+ # Document Content tab with simple, clean formatting that matches markdown export files
242
  with doc_tab:
 
 
 
 
 
 
243
  # Create a single unified content section
244
+ st.markdown("## Text Content")
 
245
 
246
+ # Present content directly in the format used in markdown export files
247
+ if 'ocr_contents' in result and isinstance(result['ocr_contents'], dict):
248
+ # Get all content fields that should be displayed
249
+ content_fields = {}
250
+
251
+ # Add all available content fields (left_page, right_page, etc)
252
+ for field, content in result['ocr_contents'].items():
253
+ # Skip certain fields that shouldn't be displayed
254
+ if field in ['error', 'partial_text'] or not content:
255
+ continue
256
+
257
+ # Clean the content if it's a string
258
  if isinstance(content, str) and content.strip():
259
+ content_fields[field] = content.strip()
260
+ # Handle dictionary or list content
261
+ elif isinstance(content, (dict, list)):
262
+ formatted_content = format_structured_data(content)
263
+ if formatted_content:
264
+ content_fields[field] = formatted_content
265
+
266
+ # Process nested dictionary structures
267
+ def flatten_content_fields(fields, parent_key=""):
268
+ flat_fields = {}
269
+ for field, content in fields.items():
270
+ # Skip certain fields
271
+ if field in ['error', 'partial_text'] or not content:
272
+ continue
273
+
274
+ # Handle string content
275
+ if isinstance(content, str) and content.strip():
276
+ key = f"{parent_key}_{field}".strip("_")
277
+ flat_fields[key] = content.strip()
278
+ # Handle dictionary content
279
+ elif isinstance(content, dict):
280
+ # If the dictionary has a 'text' key, extract just that value
281
+ if 'text' in content and isinstance(content['text'], str):
282
+ key = f"{parent_key}_{field}".strip("_")
283
+ flat_fields[key] = content['text'].strip()
284
+ # Otherwise, recursively process nested dictionaries
285
+ else:
286
+ nested_fields = flatten_content_fields(content, f"{parent_key}_{field}")
287
+ flat_fields.update(nested_fields)
288
+ # Handle list content
289
+ elif isinstance(content, list):
290
+ formatted_content = format_structured_data(content)
291
+ if formatted_content:
292
+ key = f"{parent_key}_{field}".strip("_")
293
+ flat_fields[key] = formatted_content
294
+
295
+ return flat_fields
296
+
297
+ # Flatten the content structure
298
+ flat_content_fields = flatten_content_fields(result['ocr_contents'])
299
+
300
+ # Display the flattened content fields with proper formatting
301
+ for field, content in flat_content_fields.items():
302
+ # Skip any empty content
303
+ if not content or not content.strip():
304
+ continue
305
+
306
+ # Format field name as in the markdown export
307
+ field_display = field.replace('_', ' ')
308
 
309
+ # Maintain content purity - don't parse text content as JSON
310
+ # Historical text may contain curly braces that aren't JSON
311
+
312
+ # For raw_text field, display only the content without the field name
313
+ if field == 'raw_text':
314
+ st.markdown(f"{content}")
315
+ else:
316
+ # For other fields, display the field name in bold followed by the content
317
+ st.markdown(f"**{field}:** {content}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
318
 
319
+ # Add spacing between fields
320
+ st.markdown("\n\n")
 
 
 
 
 
 
 
 
 
 
321
 
322
+ # Raw JSON tab - displays the exact same JSON that's downloaded via the JSON button
323
  with json_tab:
324
+ # Use the same truncated JSON that's used in the download button
325
+ from utils.image_utils import truncate_base64_in_result
326
+ truncated_result = truncate_base64_in_result(result)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
327
 
328
  # Format the JSON prettily
329
+ json_str = json.dumps(truncated_result, indent=2)
330
 
331
+ # Display JSON with a copy button using Streamlit's built-in functionality
332
+ st.json(truncated_result)
333
 
334
 
335
  # Images tab - for viewing document images
 
343
  if custom_prompt:
344
  with st.expander("Custom Processing Instructions"):
345
  st.write(custom_prompt)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
verify_fix.py ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ import os
3
+ import streamlit as st
4
+ from ocr_processing import process_file
5
+
6
+ # Mock a file upload
7
+ class MockFile:
8
+ def __init__(self, name, content):
9
+ self.name = name
10
+ self._content = content
11
+
12
+ def getvalue(self):
13
+ return self._content
14
+
15
+ def test_image(image_path):
16
+ """Test OCR processing for a specific image"""
17
+ print(f"\n\n===== Testing {os.path.basename(image_path)} =====")
18
+
19
+ # Load the test image
20
+ with open(image_path, 'rb') as f:
21
+ file_bytes = f.read()
22
+
23
+ # Create mock file
24
+ uploaded_file = MockFile(os.path.basename(image_path), file_bytes)
25
+
26
+ # Process the file
27
+ result = process_file(uploaded_file)
28
+
29
+ # Display results summary
30
+ print("\nOCR Content Keys:")
31
+ for key in result['ocr_contents'].keys():
32
+ print(f"- {key}")
33
+
34
+ # Show a preview of raw_text
35
+ if 'raw_text' in result['ocr_contents']:
36
+ raw_text = result['ocr_contents']['raw_text']
37
+ preview = raw_text[:100] + "..." if len(raw_text) > 100 else raw_text
38
+ print(f"\nRaw Text Preview: {preview}")
39
+
40
+ # Check for duplicated content
41
+ found_duplicated = False
42
+ if 'raw_text' in result['ocr_contents']:
43
+ raw_text = result['ocr_contents']['raw_text']
44
+ # Check if the same text appears twice in sequence (a sign of duplication)
45
+ if len(raw_text) > 50:
46
+ half_point = len(raw_text) // 2
47
+ first_quarter = raw_text[:half_point//2].strip()
48
+ if first_quarter and len(first_quarter) > 20:
49
+ if first_quarter in raw_text[half_point:]:
50
+ found_duplicated = True
51
+ print("\n⚠️ WARNING: Possible text duplication detected!")
52
+
53
+ if not found_duplicated:
54
+ print("\n✅ No text duplication detected")
55
+
56
+ return result
57
+
58
+ def main():
59
+ # Test with different image types
60
+ test_files = [
61
+ 'input/magician-or-bottle-cungerer.jpg', # The problematic file
62
+ 'input/recipe.jpg', # Simple text file
63
+ 'input/handwritten-letter.jpg' # Mixed content
64
+ ]
65
+
66
+ for image_path in test_files:
67
+ test_image(image_path)
68
+
69
+ if __name__ == "__main__":
70
+ main()
verify_segmentation_fix.py ADDED
@@ -0,0 +1,116 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Script to verify that our fixes properly prioritize text from segmented regions
3
+ in the OCR output, ensuring images don't overshadow text content.
4
+ """
5
+
6
+ import os
7
+ import json
8
+ import tempfile
9
+ from pathlib import Path
10
+ import logging
11
+ from PIL import Image
12
+
13
+ # Configure logging
14
+ logging.basicConfig(level=logging.INFO)
15
+ logger = logging.getLogger(__name__)
16
+
17
+ def verify_fix():
18
+ """
19
+ Simulate the OCR process with segmentation to verify text prioritization
20
+ """
21
+ print("Verifying segmentation and text prioritization fix...")
22
+ print("-" * 80)
23
+
24
+ # Create a simulated OCR result structure
25
+ ocr_result = {
26
+ "file_name": "test_document.jpg",
27
+ "topics": ["Document"],
28
+ "languages": ["English"],
29
+ "ocr_contents": {
30
+ "raw_text": "This is incorrect text that would be extracted from an image-focused OCR process.",
31
+ "title": "Test Document"
32
+ }
33
+ }
34
+
35
+ # Create simulated segmentation data that would be from our improved process
36
+ segmentation_data = {
37
+ 'text_regions_coordinates': [(10, 10, 100, 20), (10, 40, 100, 20)],
38
+ 'regions_count': 2,
39
+ 'segmentation_applied': True,
40
+ 'combined_text': "FIFTH AVENUE AT FIFTENTH STREET, NORTH\n\nBIRMINGHAM 2, ALABAMA\n\nDear Mary:\n\nHaving received your letter, I wanted to respond promptly.",
41
+ 'region_results': [
42
+ {
43
+ 'text': "FIFTH AVENUE AT FIFTENTH STREET, NORTH",
44
+ 'coordinates': (10, 10, 100, 20),
45
+ 'order': 0
46
+ },
47
+ {
48
+ 'text': "BIRMINGHAM 2, ALABAMA",
49
+ 'coordinates': (10, 40, 100, 20),
50
+ 'order': 1
51
+ }
52
+ ]
53
+ }
54
+
55
+ # Create preprocessing options with segmentation data
56
+ preprocessing_options = {
57
+ 'document_type': 'letter',
58
+ 'segmentation_data': segmentation_data
59
+ }
60
+
61
+ # Import the clean_ocr_result function to test
62
+ from utils.image_utils import clean_ocr_result
63
+
64
+ # Process the result to see how text is prioritized
65
+ print("Original OCR text (before fix): ")
66
+ print(f" '{ocr_result['ocr_contents']['raw_text']}'")
67
+ print()
68
+
69
+ # Use our improved clean_ocr_result function
70
+ cleaned_result = clean_ocr_result(
71
+ ocr_result,
72
+ use_segmentation=True,
73
+ vision_enabled=True,
74
+ preprocessing_options=preprocessing_options
75
+ )
76
+
77
+ # Print the results to verify text prioritization
78
+ print("After applying fix (should prioritize segmented text):")
79
+
80
+ if 'segmentation_text' in cleaned_result['ocr_contents']:
81
+ print("✓ Segmentation text was properly added to results")
82
+ print(f" Segmentation text: '{cleaned_result['ocr_contents']['segmentation_text']}'")
83
+ else:
84
+ print("✗ Segmentation text was NOT added to results")
85
+
86
+ if cleaned_result['ocr_contents'].get('main_text') == segmentation_data['combined_text']:
87
+ print("✓ Segmentation text was correctly used as the main text")
88
+ else:
89
+ print("✗ Segmentation text was NOT used as the main text")
90
+
91
+ if 'original_raw_text' in cleaned_result['ocr_contents']:
92
+ print("✓ Original raw text was preserved as a backup")
93
+ else:
94
+ print("✗ Original raw text was NOT preserved")
95
+
96
+ if cleaned_result['ocr_contents'].get('raw_text') == segmentation_data['combined_text']:
97
+ print("✓ Raw text was correctly replaced with segmentation text")
98
+ else:
99
+ print("✗ Raw text was NOT replaced with segmentation text")
100
+
101
+ print()
102
+ print("Final OCR text content:")
103
+ print("-" * 30)
104
+ print(cleaned_result['ocr_contents'].get('raw_text', "No text found"))
105
+ print("-" * 30)
106
+
107
+ print()
108
+ print("Conclusion:")
109
+ if (cleaned_result['ocr_contents'].get('raw_text') == segmentation_data['combined_text'] and
110
+ cleaned_result['ocr_contents'].get('main_text') == segmentation_data['combined_text']):
111
+ print("✅ Fix successfully prioritizes text from segmented regions!")
112
+ else:
113
+ print("❌ Fix did NOT correctly prioritize text from segmented regions.")
114
+
115
+ if __name__ == "__main__":
116
+ verify_fix()