Spaces:
Running
Running
Rolling out modular v2
Browse files- .DS_Store +0 -0
- .clinerules/apiDocumentation.md +29 -0
- .clinerules/projectBrief.md +21 -0
- .clinerules/systemPatterns.md +31 -0
- README.md +5 -1
- app.py +44 -30
- config.py +5 -8
- constants.py +47 -8
- image_segmentation.py +21 -2
- language_detection.py +0 -1
- ocr_processing.py +11 -1
- ocr_utils.py +33 -1771
- preprocessing.py +521 -66
- process_file.py +2 -4
- requirements.txt +1 -0
- structured_ocr.py +130 -110
- test_magician.py → testing/test_magician.py +0 -0
- ui_components.py +114 -582
- utils/content_utils.py +189 -0
- utils/file_utils.py +100 -0
- utils/general_utils.py +163 -0
- utils/image_utils.py +886 -0
- utils/text_utils.py +151 -0
- utils/ui_utils.py +413 -0
.DS_Store
CHANGED
Binary files a/.DS_Store and b/.DS_Store differ
|
|
.clinerules/apiDocumentation.md
ADDED
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
apiDocumentation.md
|
2 |
+
API Interaction Documentation
|
3 |
+
Mistral OCR API
|
4 |
+
|
5 |
+
Endpoint: /v1/ocr
|
6 |
+
|
7 |
+
Payload:
|
8 |
+
|
9 |
+
image (binary)
|
10 |
+
|
11 |
+
prompt (optional contextual instructions)
|
12 |
+
|
13 |
+
Response:
|
14 |
+
|
15 |
+
structured_data: Hierarchical text + metadata output
|
16 |
+
|
17 |
+
raw_text: Plain extracted text
|
18 |
+
|
19 |
+
Error Handling:
|
20 |
+
|
21 |
+
Timeout retries (up to 3 attempts)
|
22 |
+
|
23 |
+
Local fallback to Tesseract if Mistral service unavailable
|
24 |
+
|
25 |
+
Tesseract Fallback
|
26 |
+
|
27 |
+
Only invoked if Mistral API fails after retries.
|
28 |
+
|
29 |
+
No structured output; raw text only.
|
.clinerules/projectBrief.md
ADDED
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Foundation
|
2 |
+
|
3 |
+
Historical OCR is an advanced optical character recognition (OCR) application designed to support historical research. It leverages Mistral AI's OCR models alongside image preprocessing pipelines optimized for archival material.
|
4 |
+
|
5 |
+
High-Level Overview
|
6 |
+
|
7 |
+
Building a Streamlit-based web application to process historical documents (images or PDFs), optimize them for OCR using advanced preprocessing techniques, and extract structured text and metadata through Mistral's large language models.
|
8 |
+
|
9 |
+
Core Requirements and Goals
|
10 |
+
|
11 |
+
Upload and preprocess historical documents
|
12 |
+
|
13 |
+
Automatically detect document types (e.g., handwritten letters, scientific papers)
|
14 |
+
|
15 |
+
Apply tailored OCR prompting and structured output based on document type
|
16 |
+
|
17 |
+
Support user-defined contextual instructions to refine output
|
18 |
+
|
19 |
+
Provide downloadable structured transcripts and analysis
|
20 |
+
|
21 |
+
Example: "Building a Streamlit web app for OCR transcription and structured extraction from historical documents using Mistral AI."
|
.clinerules/systemPatterns.md
ADDED
@@ -0,0 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# System Architecture
|
2 |
+
|
3 |
+
Frontend: Streamlit app (app.py) for user interface and interactions.
|
4 |
+
|
5 |
+
Core Processing: ocr_processing.py orchestrates preprocessing, document type detection, and OCR operations.
|
6 |
+
|
7 |
+
Image Preprocessing: preprocessing.py, image_segmentation.py handle deskewing, thresholding, and cleaning.
|
8 |
+
|
9 |
+
OCR and Structuring: structured_ocr.py and ocr_utils.py manage API communication and formatting structured outputs.
|
10 |
+
|
11 |
+
Utilities and Detection: language_detection.py, utils.py, and constants.py provide language detection, helpers, and prompt templates.
|
12 |
+
|
13 |
+
Key Technical Decisions
|
14 |
+
|
15 |
+
Streamlit cache management for upload processing efficiency.
|
16 |
+
|
17 |
+
Modular design of preprocessing paths based on document type.
|
18 |
+
|
19 |
+
Mistral AI as the primary OCR processor, with Tesseract fallback for redundancy.
|
20 |
+
|
21 |
+
Design Patterns in Use
|
22 |
+
|
23 |
+
Delegation: Frontend delegates all processing to backend orchestrators.
|
24 |
+
|
25 |
+
Modularity: Preprocessing and OCR tasks divided into clean, testable modules.
|
26 |
+
|
27 |
+
State-driven Processing: Output dynamically reflects session state and user input.
|
28 |
+
|
29 |
+
Component Relationships
|
30 |
+
|
31 |
+
app.py ⇨ ocr_processing.py ⇨ preprocessing.py, structured_ocr.py, language_detection.py, etc.
|
README.md
CHANGED
@@ -21,7 +21,11 @@ An advanced OCR application for historical document analysis using Mistral AI.
|
|
21 |
|
22 |
- **OCR with Context:** AI-enhanced OCR optimized for historical documents
|
23 |
- **Document Type Detection:** Automatically identifies handwritten letters, recipes, scientific texts, and more
|
24 |
-
- **Image Preprocessing:**
|
|
|
|
|
|
|
|
|
25 |
- **Custom Prompting:** Tailor the AI analysis with document-specific instructions
|
26 |
- **Structured Output:** Returns organized, structured information based on document type
|
27 |
|
|
|
21 |
|
22 |
- **OCR with Context:** AI-enhanced OCR optimized for historical documents
|
23 |
- **Document Type Detection:** Automatically identifies handwritten letters, recipes, scientific texts, and more
|
24 |
+
- **Advanced Image Preprocessing:**
|
25 |
+
- Automatic deskewing to correct document orientation
|
26 |
+
- Smart thresholding with Otsu and adaptive methods
|
27 |
+
- Morphological operations to clean up text
|
28 |
+
- Document-type specific optimization
|
29 |
- **Custom Prompting:** Tailor the AI analysis with document-specific instructions
|
30 |
- **Structured Output:** Returns organized, structured information based on document type
|
31 |
|
app.py
CHANGED
@@ -41,7 +41,7 @@ from constants import (
|
|
41 |
)
|
42 |
from structured_ocr import StructuredOCR
|
43 |
from config import MISTRAL_API_KEY
|
44 |
-
from
|
45 |
|
46 |
# Set favicon path
|
47 |
favicon_path = os.path.join(os.path.dirname(__file__), "static/favicon.png")
|
@@ -74,20 +74,47 @@ st.set_page_config(
|
|
74 |
# Consult https://docs.streamlit.io/library/advanced-features/session-state for details.
|
75 |
# ========================================================================================
|
76 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
77 |
def init_session_state():
|
78 |
"""Initialize session state variables if they don't already exist
|
79 |
|
80 |
This function follows Streamlit's recommended patterns for state initialization.
|
81 |
It only creates variables if they don't exist yet and doesn't modify existing values.
|
82 |
"""
|
|
|
83 |
if 'previous_results' not in st.session_state:
|
84 |
st.session_state.previous_results = []
|
85 |
if 'temp_file_paths' not in st.session_state:
|
86 |
st.session_state.temp_file_paths = []
|
87 |
-
if 'last_processed_file' not in st.session_state:
|
88 |
-
st.session_state.last_processed_file = None
|
89 |
if 'auto_process_sample' not in st.session_state:
|
90 |
st.session_state.auto_process_sample = False
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
91 |
if 'sample_just_loaded' not in st.session_state:
|
92 |
st.session_state.sample_just_loaded = False
|
93 |
if 'processed_document_active' not in st.session_state:
|
@@ -104,10 +131,6 @@ def init_session_state():
|
|
104 |
st.session_state.is_sample_document = False
|
105 |
if 'selected_previous_result' not in st.session_state:
|
106 |
st.session_state.selected_previous_result = None
|
107 |
-
if 'close_clicked' not in st.session_state:
|
108 |
-
st.session_state.close_clicked = False
|
109 |
-
if 'active_tab' not in st.session_state:
|
110 |
-
st.session_state.active_tab = 0
|
111 |
|
112 |
def close_document():
|
113 |
"""Called when the Close Document button is clicked
|
@@ -120,24 +143,17 @@ def close_document():
|
|
120 |
That approach breaks Streamlit's execution flow and causes UI artifacts.
|
121 |
"""
|
122 |
logger.info("Close document button clicked")
|
123 |
-
# Save the previous results
|
124 |
-
previous_results = st.session_state.previous_results if 'previous_results' in st.session_state else []
|
125 |
|
126 |
-
# Clean up temp files
|
127 |
if 'temp_file_paths' in st.session_state and st.session_state.temp_file_paths:
|
128 |
logger.info(f"Cleaning up {len(st.session_state.temp_file_paths)} temporary files")
|
129 |
handle_temp_files(st.session_state.temp_file_paths)
|
130 |
|
131 |
-
#
|
132 |
-
|
133 |
-
if key != 'previous_results' and key != 'close_clicked':
|
134 |
-
st.session_state.pop(key, None)
|
135 |
|
136 |
-
# Set flag for having cleaned up
|
137 |
st.session_state.close_clicked = True
|
138 |
-
|
139 |
-
# Restore the previous results
|
140 |
-
st.session_state.previous_results = previous_results
|
141 |
|
142 |
def show_example_documents():
|
143 |
"""Show example documents section"""
|
@@ -251,14 +267,12 @@ def show_example_documents():
|
|
251 |
|
252 |
# Reset any document state before loading a new sample
|
253 |
if st.session_state.processed_document_active:
|
254 |
-
# Clear previous document state
|
255 |
-
st.session_state.processed_document_active = False
|
256 |
-
st.session_state.last_processed_file = None
|
257 |
-
|
258 |
# Clean up any temporary files from previous processing
|
259 |
if st.session_state.temp_file_paths:
|
260 |
handle_temp_files(st.session_state.temp_file_paths)
|
261 |
-
|
|
|
|
|
262 |
|
263 |
# Save download info in session state
|
264 |
st.session_state.sample_document = SampleDocument(
|
@@ -350,6 +364,7 @@ def process_document(uploaded_file, left_col, right_col, sidebar_options):
|
|
350 |
progress_placeholder = st.empty()
|
351 |
|
352 |
# Image preprocessing preview - show if image file and preprocessing options are set
|
|
|
353 |
if (any(sidebar_options["preprocessing_options"].values()) and
|
354 |
uploaded_file.type.startswith('image/')):
|
355 |
|
@@ -530,13 +545,14 @@ def main():
|
|
530 |
sidebar_options = create_sidebar_options()
|
531 |
|
532 |
# Create main layout with tabs - simpler, more compact approach
|
533 |
-
tab_names = ["Document Processing", "Sample Documents", "
|
534 |
-
main_tab1, main_tab2, main_tab3
|
535 |
|
536 |
with main_tab1:
|
537 |
# Create a two-column layout for file upload and results with minimal padding
|
538 |
st.markdown('<style>.block-container{padding-top: 1rem; padding-bottom: 0;}</style>', unsafe_allow_html=True)
|
539 |
-
|
|
|
540 |
|
541 |
with left_col:
|
542 |
# Create file uploader
|
@@ -575,11 +591,9 @@ def main():
|
|
575 |
|
576 |
show_example_documents()
|
577 |
|
578 |
-
|
579 |
-
# Previous results tab
|
580 |
-
display_previous_results()
|
581 |
|
582 |
-
with
|
583 |
# About tab
|
584 |
display_about_tab()
|
585 |
|
|
|
41 |
)
|
42 |
from structured_ocr import StructuredOCR
|
43 |
from config import MISTRAL_API_KEY
|
44 |
+
from utils.image_utils import create_results_zip
|
45 |
|
46 |
# Set favicon path
|
47 |
favicon_path = os.path.join(os.path.dirname(__file__), "static/favicon.png")
|
|
|
74 |
# Consult https://docs.streamlit.io/library/advanced-features/session-state for details.
|
75 |
# ========================================================================================
|
76 |
|
77 |
+
def reset_document_state():
|
78 |
+
"""Reset only document-specific state variables
|
79 |
+
|
80 |
+
This function explicitly resets all document-related variables to ensure
|
81 |
+
clean state between document processing, preventing cached data issues.
|
82 |
+
"""
|
83 |
+
st.session_state.sample_document = None
|
84 |
+
st.session_state.original_sample_bytes = None
|
85 |
+
st.session_state.original_sample_name = None
|
86 |
+
st.session_state.original_sample_mime_type = None
|
87 |
+
st.session_state.is_sample_document = False
|
88 |
+
st.session_state.processed_document_active = False
|
89 |
+
st.session_state.sample_document_processed = False
|
90 |
+
st.session_state.sample_just_loaded = False
|
91 |
+
st.session_state.last_processed_file = None
|
92 |
+
st.session_state.selected_previous_result = None
|
93 |
+
# Keep temp_file_paths but ensure it's empty after cleanup
|
94 |
+
if 'temp_file_paths' in st.session_state:
|
95 |
+
st.session_state.temp_file_paths = []
|
96 |
+
|
97 |
def init_session_state():
|
98 |
"""Initialize session state variables if they don't already exist
|
99 |
|
100 |
This function follows Streamlit's recommended patterns for state initialization.
|
101 |
It only creates variables if they don't exist yet and doesn't modify existing values.
|
102 |
"""
|
103 |
+
# Initialize persistent app state variables
|
104 |
if 'previous_results' not in st.session_state:
|
105 |
st.session_state.previous_results = []
|
106 |
if 'temp_file_paths' not in st.session_state:
|
107 |
st.session_state.temp_file_paths = []
|
|
|
|
|
108 |
if 'auto_process_sample' not in st.session_state:
|
109 |
st.session_state.auto_process_sample = False
|
110 |
+
if 'close_clicked' not in st.session_state:
|
111 |
+
st.session_state.close_clicked = False
|
112 |
+
if 'active_tab' not in st.session_state:
|
113 |
+
st.session_state.active_tab = 0
|
114 |
+
|
115 |
+
# Initialize document-specific state variables
|
116 |
+
if 'last_processed_file' not in st.session_state:
|
117 |
+
st.session_state.last_processed_file = None
|
118 |
if 'sample_just_loaded' not in st.session_state:
|
119 |
st.session_state.sample_just_loaded = False
|
120 |
if 'processed_document_active' not in st.session_state:
|
|
|
131 |
st.session_state.is_sample_document = False
|
132 |
if 'selected_previous_result' not in st.session_state:
|
133 |
st.session_state.selected_previous_result = None
|
|
|
|
|
|
|
|
|
134 |
|
135 |
def close_document():
|
136 |
"""Called when the Close Document button is clicked
|
|
|
143 |
That approach breaks Streamlit's execution flow and causes UI artifacts.
|
144 |
"""
|
145 |
logger.info("Close document button clicked")
|
|
|
|
|
146 |
|
147 |
+
# Clean up temp files first
|
148 |
if 'temp_file_paths' in st.session_state and st.session_state.temp_file_paths:
|
149 |
logger.info(f"Cleaning up {len(st.session_state.temp_file_paths)} temporary files")
|
150 |
handle_temp_files(st.session_state.temp_file_paths)
|
151 |
|
152 |
+
# Reset all document-specific state variables to prevent caching issues
|
153 |
+
reset_document_state()
|
|
|
|
|
154 |
|
155 |
+
# Set flag for having cleaned up - this will trigger a rerun in main()
|
156 |
st.session_state.close_clicked = True
|
|
|
|
|
|
|
157 |
|
158 |
def show_example_documents():
|
159 |
"""Show example documents section"""
|
|
|
267 |
|
268 |
# Reset any document state before loading a new sample
|
269 |
if st.session_state.processed_document_active:
|
|
|
|
|
|
|
|
|
270 |
# Clean up any temporary files from previous processing
|
271 |
if st.session_state.temp_file_paths:
|
272 |
handle_temp_files(st.session_state.temp_file_paths)
|
273 |
+
|
274 |
+
# Reset all document-specific state variables
|
275 |
+
reset_document_state()
|
276 |
|
277 |
# Save download info in session state
|
278 |
st.session_state.sample_document = SampleDocument(
|
|
|
364 |
progress_placeholder = st.empty()
|
365 |
|
366 |
# Image preprocessing preview - show if image file and preprocessing options are set
|
367 |
+
# Remove the document active check to show preview immediately after selection
|
368 |
if (any(sidebar_options["preprocessing_options"].values()) and
|
369 |
uploaded_file.type.startswith('image/')):
|
370 |
|
|
|
545 |
sidebar_options = create_sidebar_options()
|
546 |
|
547 |
# Create main layout with tabs - simpler, more compact approach
|
548 |
+
tab_names = ["Document Processing", "Sample Documents", "Learn More"]
|
549 |
+
main_tab1, main_tab2, main_tab3 = st.tabs(tab_names)
|
550 |
|
551 |
with main_tab1:
|
552 |
# Create a two-column layout for file upload and results with minimal padding
|
553 |
st.markdown('<style>.block-container{padding-top: 1rem; padding-bottom: 0;}</style>', unsafe_allow_html=True)
|
554 |
+
# Using a 2:3 column ratio gives more space to the results column
|
555 |
+
left_col, right_col = st.columns([2, 3])
|
556 |
|
557 |
with left_col:
|
558 |
# Create file uploader
|
|
|
591 |
|
592 |
show_example_documents()
|
593 |
|
594 |
+
# Previous results tab temporarily removed
|
|
|
|
|
595 |
|
596 |
+
with main_tab3:
|
597 |
# About tab
|
598 |
display_about_tab()
|
599 |
|
config.py
CHANGED
@@ -40,22 +40,19 @@ VISION_MODEL = os.environ.get("MISTRAL_VISION_MODEL", "mistral-small-latest") #
|
|
40 |
# Image preprocessing settings optimized for historical documents
|
41 |
# These can be customized from environment variables
|
42 |
IMAGE_PREPROCESSING = {
|
43 |
-
"enhance_contrast": float(os.environ.get("ENHANCE_CONTRAST", "1.
|
44 |
"sharpen": os.environ.get("SHARPEN", "True").lower() in ("true", "1", "yes"),
|
45 |
"denoise": os.environ.get("DENOISE", "True").lower() in ("true", "1", "yes"),
|
46 |
"max_size_mb": float(os.environ.get("MAX_IMAGE_SIZE_MB", "12.0")), # Increased size limit for better quality
|
47 |
"target_dpi": int(os.environ.get("TARGET_DPI", "300")), # Target DPI for scaling
|
48 |
-
"compression_quality": int(os.environ.get("COMPRESSION_QUALITY", "
|
49 |
-
# Enhanced settings for handwritten documents
|
50 |
"handwritten": {
|
51 |
-
"contrast": float(os.environ.get("HANDWRITTEN_CONTRAST", "1.2")), # Lower contrast for handwritten text
|
52 |
"block_size": int(os.environ.get("HANDWRITTEN_BLOCK_SIZE", "21")), # Larger block size for adaptive thresholding
|
53 |
"constant": int(os.environ.get("HANDWRITTEN_CONSTANT", "5")), # Lower constant for adaptive thresholding
|
54 |
"use_dilation": os.environ.get("HANDWRITTEN_DILATION", "True").lower() in ("true", "1", "yes"), # Connect broken strokes
|
55 |
-
"
|
56 |
-
"
|
57 |
-
"bilateral_sigma1": int(os.environ.get("HANDWRITTEN_BILATERAL_SIGMA1", "25")), # Color sigma
|
58 |
-
"bilateral_sigma2": int(os.environ.get("HANDWRITTEN_BILATERAL_SIGMA2", "45")) # Space sigma
|
59 |
}
|
60 |
}
|
61 |
|
|
|
40 |
# Image preprocessing settings optimized for historical documents
|
41 |
# These can be customized from environment variables
|
42 |
IMAGE_PREPROCESSING = {
|
43 |
+
"enhance_contrast": float(os.environ.get("ENHANCE_CONTRAST", "1.8")), # Increased contrast for better text recognition
|
44 |
"sharpen": os.environ.get("SHARPEN", "True").lower() in ("true", "1", "yes"),
|
45 |
"denoise": os.environ.get("DENOISE", "True").lower() in ("true", "1", "yes"),
|
46 |
"max_size_mb": float(os.environ.get("MAX_IMAGE_SIZE_MB", "12.0")), # Increased size limit for better quality
|
47 |
"target_dpi": int(os.environ.get("TARGET_DPI", "300")), # Target DPI for scaling
|
48 |
+
"compression_quality": int(os.environ.get("COMPRESSION_QUALITY", "100")), # Higher quality for better OCR results
|
49 |
+
# # Enhanced settings for handwritten documents
|
50 |
"handwritten": {
|
|
|
51 |
"block_size": int(os.environ.get("HANDWRITTEN_BLOCK_SIZE", "21")), # Larger block size for adaptive thresholding
|
52 |
"constant": int(os.environ.get("HANDWRITTEN_CONSTANT", "5")), # Lower constant for adaptive thresholding
|
53 |
"use_dilation": os.environ.get("HANDWRITTEN_DILATION", "True").lower() in ("true", "1", "yes"), # Connect broken strokes
|
54 |
+
"dilation_iterations": int(os.environ.get("HANDWRITTEN_DILATION_ITERATIONS", "2")), # More iterations for better stroke connection
|
55 |
+
"dilation_kernel_size": int(os.environ.get("HANDWRITTEN_DILATION_KERNEL_SIZE", "3")) # Larger kernel for dilation
|
|
|
|
|
56 |
}
|
57 |
}
|
58 |
|
constants.py
CHANGED
@@ -138,17 +138,56 @@ CONTENT_THEMES = {
|
|
138 |
}
|
139 |
|
140 |
# Period tags based on year ranges
|
|
|
141 |
PERIOD_TAGS = {
|
142 |
-
(0,
|
143 |
-
(
|
144 |
-
(
|
145 |
-
(
|
146 |
-
(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
147 |
}
|
148 |
|
149 |
-
# Default fallback tags
|
150 |
-
DEFAULT_TAGS = [
|
151 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
152 |
|
153 |
# UI constants
|
154 |
PROGRESS_DELAY = 0.8 # Seconds to show completion message
|
|
|
138 |
}
|
139 |
|
140 |
# Period tags based on year ranges
|
141 |
+
# These ranges are used to assign historical period tags to documents based on their year.
|
142 |
PERIOD_TAGS = {
|
143 |
+
(0, 499): "Ancient Era (to 500 CE)",
|
144 |
+
(500, 999): "Early Medieval (500–1000)",
|
145 |
+
(1000, 1299): "High Medieval (1000–1300)",
|
146 |
+
(1300, 1499): "Late Medieval (1300–1500)",
|
147 |
+
(1500, 1599): "Renaissance (1500–1600)",
|
148 |
+
(1600, 1699): "Early Modern (1600–1700)",
|
149 |
+
(1700, 1775): "Enlightenment (1700–1775)",
|
150 |
+
(1776, 1799): "Age of Revolutions (1776–1800)",
|
151 |
+
(1800, 1849): "Early 19th Century (1800–1850)",
|
152 |
+
(1850, 1899): "Late 19th Century (1850–1900)",
|
153 |
+
(1900, 1918): "Early 20th Century & WWI (1900–1918)",
|
154 |
+
(1919, 1938): "Interwar Period (1919–1938)",
|
155 |
+
(1939, 1945): "World War II (1939–1945)",
|
156 |
+
(1946, 1968): "Postwar & Mid-20th Century (1946–1968)",
|
157 |
+
(1969, 1989): "Late 20th Century (1969–1989)",
|
158 |
+
(1990, 2000): "Turn of the 21st Century (1990–2000)",
|
159 |
+
(2001, 2099): "Contemporary (21st Century)"
|
160 |
}
|
161 |
|
162 |
+
# Default fallback tags for documents when no specific tags are detected.
|
163 |
+
DEFAULT_TAGS = [
|
164 |
+
"Document",
|
165 |
+
"Historical",
|
166 |
+
"Text",
|
167 |
+
"Primary Source",
|
168 |
+
"Archival Material",
|
169 |
+
"Record",
|
170 |
+
"Manuscript",
|
171 |
+
"Printed Material",
|
172 |
+
"Correspondence",
|
173 |
+
"Publication"
|
174 |
+
]
|
175 |
+
|
176 |
+
# Generic tags that can be used for broad categorization or as supplemental tags.
|
177 |
+
GENERIC_TAGS = [
|
178 |
+
"Archive",
|
179 |
+
"Content",
|
180 |
+
"Record",
|
181 |
+
"Source",
|
182 |
+
"Material",
|
183 |
+
"Page",
|
184 |
+
"Scan",
|
185 |
+
"Image",
|
186 |
+
"Transcription",
|
187 |
+
"Uncategorized",
|
188 |
+
"General",
|
189 |
+
"Miscellaneous"
|
190 |
+
]
|
191 |
|
192 |
# UI constants
|
193 |
PROGRESS_DELAY = 0.8 # Seconds to show completion message
|
image_segmentation.py
CHANGED
@@ -18,12 +18,13 @@ logging.basicConfig(level=logging.INFO,
|
|
18 |
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
|
19 |
logger = logging.getLogger(__name__)
|
20 |
|
21 |
-
def segment_image_for_ocr(image_path: Union[str, Path]) -> Dict[str, Union[Image.Image, str]]:
|
22 |
"""
|
23 |
Segment an image into text and image regions for improved OCR processing.
|
24 |
|
25 |
Args:
|
26 |
image_path: Path to the image file
|
|
|
27 |
|
28 |
Returns:
|
29 |
Dict containing:
|
@@ -41,6 +42,23 @@ def segment_image_for_ocr(image_path: Union[str, Path]) -> Dict[str, Union[Image
|
|
41 |
try:
|
42 |
# Open original image with PIL for compatibility
|
43 |
with Image.open(image_file) as pil_img:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
44 |
# Convert to RGB if not already
|
45 |
if pil_img.mode != 'RGB':
|
46 |
pil_img = pil_img.convert('RGB')
|
@@ -89,7 +107,8 @@ def segment_image_for_ocr(image_path: Union[str, Path]) -> Dict[str, Union[Image
|
|
89 |
|
90 |
# Additional check for text-like characteristics
|
91 |
# Text typically has aspect ratio > 1 (wider than tall) and reasonable density
|
92 |
-
|
|
|
93 |
# Add to text regions list
|
94 |
text_regions.append((x, y, w, h))
|
95 |
# Add to text mask
|
|
|
18 |
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
|
19 |
logger = logging.getLogger(__name__)
|
20 |
|
21 |
+
def segment_image_for_ocr(image_path: Union[str, Path], vision_enabled: bool = True) -> Dict[str, Union[Image.Image, str]]:
|
22 |
"""
|
23 |
Segment an image into text and image regions for improved OCR processing.
|
24 |
|
25 |
Args:
|
26 |
image_path: Path to the image file
|
27 |
+
vision_enabled: Whether the vision model is enabled
|
28 |
|
29 |
Returns:
|
30 |
Dict containing:
|
|
|
42 |
try:
|
43 |
# Open original image with PIL for compatibility
|
44 |
with Image.open(image_file) as pil_img:
|
45 |
+
# --- 2 · Stop "text page detected as image" when vision model is off ---
|
46 |
+
if not vision_enabled:
|
47 |
+
# Import the entropy calculator from utils.image_utils
|
48 |
+
from utils.image_utils import calculate_image_entropy
|
49 |
+
|
50 |
+
# Calculate entropy to determine if this is line art or blank
|
51 |
+
ent = calculate_image_entropy(pil_img)
|
52 |
+
if ent < 3.5: # Heuristically low → line-art or blank page
|
53 |
+
logger.info(f"Low entropy image detected ({ent:.2f}), classifying as illustration")
|
54 |
+
# Return minimal result for illustration
|
55 |
+
return {
|
56 |
+
'text_regions': None,
|
57 |
+
'image_regions': pil_img,
|
58 |
+
'text_mask_base64': None,
|
59 |
+
'combined_result': None,
|
60 |
+
'text_regions_coordinates': []
|
61 |
+
}
|
62 |
# Convert to RGB if not already
|
63 |
if pil_img.mode != 'RGB':
|
64 |
pil_img = pil_img.convert('RGB')
|
|
|
107 |
|
108 |
# Additional check for text-like characteristics
|
109 |
# Text typically has aspect ratio > 1 (wider than tall) and reasonable density
|
110 |
+
# Relaxed aspect ratio constraints and lowered density threshold for better detection
|
111 |
+
if (aspect_ratio > 1.2 or aspect_ratio < 0.7) and dark_pixel_density > 0.15:
|
112 |
# Add to text regions list
|
113 |
text_regions.append((x, y, w, h))
|
114 |
# Add to text mask
|
language_detection.py
CHANGED
@@ -64,7 +64,6 @@ class LanguageDetector:
|
|
64 |
"patterns": ['oi[ts]$', 'oi[re]$', 'f[^aeiou]', 'ff', 'ſ', 'auoit', 'eſtoit',
|
65 |
'ſi', 'ſur', 'ſa', 'cy', 'ayant', 'oy', 'uſ', 'auſ']
|
66 |
},
|
67 |
-
"exclusivity": 2.0 # French indicators have higher weight in historical text detection
|
68 |
},
|
69 |
"German": {
|
70 |
"chars": ['ä', 'ö', 'ü', 'ß'],
|
|
|
64 |
"patterns": ['oi[ts]$', 'oi[re]$', 'f[^aeiou]', 'ff', 'ſ', 'auoit', 'eſtoit',
|
65 |
'ſi', 'ſur', 'ſa', 'cy', 'ayant', 'oy', 'uſ', 'auſ']
|
66 |
},
|
|
|
67 |
},
|
68 |
"German": {
|
69 |
"chars": ['ä', 'ö', 'ü', 'ß'],
|
ocr_processing.py
CHANGED
@@ -17,6 +17,9 @@ import streamlit as st
|
|
17 |
|
18 |
# Local application imports
|
19 |
from structured_ocr import StructuredOCR
|
|
|
|
|
|
|
20 |
from utils import generate_cache_key, timing, format_timestamp, create_descriptive_filename, extract_subject_tags
|
21 |
from preprocessing import apply_preprocessing_to_file
|
22 |
from error_handler import handle_ocr_error, check_file_size
|
@@ -239,7 +242,7 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None, pro
|
|
239 |
|
240 |
try:
|
241 |
# Perform image segmentation
|
242 |
-
segmentation_results = segment_image_for_ocr(temp_path)
|
243 |
|
244 |
if segmentation_results['combined_result'] is not None:
|
245 |
# Save the segmented result to a new temporary file
|
@@ -357,6 +360,13 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None, pro
|
|
357 |
# Add additional metadata to result
|
358 |
result = process_result(result, uploaded_file, preprocessing_options)
|
359 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
360 |
# Complete progress
|
361 |
progress_reporter.complete()
|
362 |
|
|
|
17 |
|
18 |
# Local application imports
|
19 |
from structured_ocr import StructuredOCR
|
20 |
+
# Import from updated utils directory
|
21 |
+
from utils.image_utils import clean_ocr_result
|
22 |
+
# Temporarily retain old utils imports until they are fully migrated
|
23 |
from utils import generate_cache_key, timing, format_timestamp, create_descriptive_filename, extract_subject_tags
|
24 |
from preprocessing import apply_preprocessing_to_file
|
25 |
from error_handler import handle_ocr_error, check_file_size
|
|
|
242 |
|
243 |
try:
|
244 |
# Perform image segmentation
|
245 |
+
segmentation_results = segment_image_for_ocr(temp_path, vision_enabled=use_vision)
|
246 |
|
247 |
if segmentation_results['combined_result'] is not None:
|
248 |
# Save the segmented result to a new temporary file
|
|
|
360 |
# Add additional metadata to result
|
361 |
result = process_result(result, uploaded_file, preprocessing_options)
|
362 |
|
363 |
+
# 🔧 ALWAYS normalize result before returning
|
364 |
+
result = clean_ocr_result(
|
365 |
+
result,
|
366 |
+
use_segmentation=use_segmentation,
|
367 |
+
vision_enabled=use_vision
|
368 |
+
)
|
369 |
+
|
370 |
# Complete progress
|
371 |
progress_reporter.complete()
|
372 |
|
ocr_utils.py
CHANGED
@@ -1,110 +1,38 @@
|
|
1 |
"""
|
2 |
-
|
3 |
-
|
4 |
"""
|
5 |
|
6 |
-
|
7 |
-
import json
|
8 |
import base64
|
9 |
-
import io
|
10 |
-
import zipfile
|
11 |
import logging
|
12 |
-
import time
|
13 |
-
from datetime import datetime
|
14 |
from pathlib import Path
|
15 |
-
from typing import
|
16 |
-
from functools import lru_cache
|
17 |
|
18 |
# Configure logging
|
19 |
logging.basicConfig(level=logging.INFO,
|
20 |
-
|
21 |
logger = logging.getLogger(__name__)
|
22 |
|
23 |
-
#
|
24 |
-
|
|
|
|
|
|
|
|
|
|
|
25 |
|
26 |
-
# Check for image processing libraries
|
27 |
try:
|
28 |
-
from PIL import Image
|
29 |
PILLOW_AVAILABLE = True
|
30 |
except ImportError:
|
31 |
logger.warning("PIL not available - image preprocessing will be limited")
|
32 |
PILLOW_AVAILABLE = False
|
33 |
|
34 |
-
try:
|
35 |
-
import cv2
|
36 |
-
CV2_AVAILABLE = True
|
37 |
-
except ImportError:
|
38 |
-
logger.warning("OpenCV (cv2) not available - advanced image processing will be limited")
|
39 |
-
CV2_AVAILABLE = False
|
40 |
-
|
41 |
-
# Mistral AI imports
|
42 |
-
from mistralai import DocumentURLChunk, ImageURLChunk, TextChunk
|
43 |
-
from mistralai.models import OCRImageObject
|
44 |
-
|
45 |
-
# Import configuration
|
46 |
-
try:
|
47 |
-
from config import IMAGE_PREPROCESSING
|
48 |
-
except ImportError:
|
49 |
-
# Fallback defaults if config not available
|
50 |
-
IMAGE_PREPROCESSING = {
|
51 |
-
"enhance_contrast": 1.5,
|
52 |
-
"sharpen": True,
|
53 |
-
"denoise": True,
|
54 |
-
"max_size_mb": 8.0,
|
55 |
-
"target_dpi": 300,
|
56 |
-
"compression_quality": 92
|
57 |
-
}
|
58 |
-
|
59 |
-
def replace_images_in_markdown(markdown_str: str, images_dict: dict) -> str:
|
60 |
-
"""
|
61 |
-
Replace image placeholders in markdown with base64-encoded images.
|
62 |
-
|
63 |
-
Args:
|
64 |
-
markdown_str: Markdown text containing image placeholders
|
65 |
-
images_dict: Dictionary mapping image IDs to base64 strings
|
66 |
-
|
67 |
-
Returns:
|
68 |
-
Markdown text with images replaced by base64 data
|
69 |
-
"""
|
70 |
-
for img_name, base64_str in images_dict.items():
|
71 |
-
markdown_str = markdown_str.replace(
|
72 |
-
f"", f""
|
73 |
-
)
|
74 |
-
return markdown_str
|
75 |
-
|
76 |
-
def get_combined_markdown(ocr_response) -> str:
|
77 |
-
"""
|
78 |
-
Combine OCR text and images into a single markdown document.
|
79 |
-
|
80 |
-
Args:
|
81 |
-
ocr_response: OCR response object from Mistral AI
|
82 |
-
|
83 |
-
Returns:
|
84 |
-
Combined markdown string with embedded images
|
85 |
-
"""
|
86 |
-
markdowns = []
|
87 |
-
|
88 |
-
# Process each page of the OCR response
|
89 |
-
for page in ocr_response.pages:
|
90 |
-
# Extract image data if available
|
91 |
-
image_data = {}
|
92 |
-
if hasattr(page, "images"):
|
93 |
-
for img in page.images:
|
94 |
-
if hasattr(img, "id") and hasattr(img, "image_base64"):
|
95 |
-
image_data[img.id] = img.image_base64
|
96 |
-
|
97 |
-
# Replace image placeholders with base64 data
|
98 |
-
page_markdown = page.markdown if hasattr(page, "markdown") else ""
|
99 |
-
processed_markdown = replace_images_in_markdown(page_markdown, image_data)
|
100 |
-
markdowns.append(processed_markdown)
|
101 |
-
|
102 |
-
# Join all pages' markdown with double newlines
|
103 |
-
return "\n\n".join(markdowns)
|
104 |
|
105 |
def encode_image_for_api(image_path: Union[str, Path]) -> str:
|
106 |
"""
|
107 |
-
Encode an image as base64 data URL for API submission.
|
108 |
|
109 |
Args:
|
110 |
image_path: Path to the image file
|
@@ -135,1703 +63,37 @@ def encode_image_for_api(image_path: Union[str, Path]) -> str:
|
|
135 |
encoded = base64.b64encode(image_file.read_bytes()).decode()
|
136 |
return f"data:{mime_type};base64,{encoded}"
|
137 |
|
138 |
-
def encode_bytes_for_api(file_bytes: bytes, mime_type: str) -> str:
|
139 |
-
"""
|
140 |
-
Encode binary data as base64 data URL for API submission.
|
141 |
-
|
142 |
-
Args:
|
143 |
-
file_bytes: Binary file data
|
144 |
-
mime_type: MIME type of the file (e.g., 'image/jpeg', 'application/pdf')
|
145 |
-
|
146 |
-
Returns:
|
147 |
-
Base64 data URL for the data
|
148 |
-
"""
|
149 |
-
# Encode data as base64
|
150 |
-
encoded = base64.b64encode(file_bytes).decode()
|
151 |
-
return f"data:{mime_type};base64,{encoded}"
|
152 |
-
|
153 |
-
def process_image_with_ocr(client, image_path: Union[str, Path], model: str = "mistral-ocr-latest"):
|
154 |
-
"""
|
155 |
-
Process an image with OCR and return the response.
|
156 |
-
|
157 |
-
Args:
|
158 |
-
client: Mistral AI client
|
159 |
-
image_path: Path to the image file
|
160 |
-
model: OCR model to use
|
161 |
-
|
162 |
-
Returns:
|
163 |
-
OCR response object
|
164 |
-
"""
|
165 |
-
# Encode image as base64
|
166 |
-
base64_data_url = encode_image_for_api(image_path)
|
167 |
-
|
168 |
-
# Process image with OCR
|
169 |
-
image_response = client.ocr.process(
|
170 |
-
document=ImageURLChunk(image_url=base64_data_url),
|
171 |
-
model=model
|
172 |
-
)
|
173 |
-
|
174 |
-
return image_response
|
175 |
-
|
176 |
-
def ocr_response_to_json(ocr_response, indent: int = 4) -> str:
|
177 |
-
"""
|
178 |
-
Convert OCR response to a formatted JSON string.
|
179 |
-
|
180 |
-
Args:
|
181 |
-
ocr_response: OCR response object
|
182 |
-
indent: Indentation level for JSON formatting
|
183 |
-
|
184 |
-
Returns:
|
185 |
-
Formatted JSON string
|
186 |
-
"""
|
187 |
-
# Convert OCR response to a dictionary
|
188 |
-
response_dict = {
|
189 |
-
"text": ocr_response.text if hasattr(ocr_response, "text") else "",
|
190 |
-
"pages": []
|
191 |
-
}
|
192 |
-
|
193 |
-
# Process pages if available
|
194 |
-
if hasattr(ocr_response, "pages"):
|
195 |
-
for page in ocr_response.pages:
|
196 |
-
page_dict = {
|
197 |
-
"text": page.text if hasattr(page, "text") else "",
|
198 |
-
"markdown": page.markdown if hasattr(page, "markdown") else "",
|
199 |
-
"images": []
|
200 |
-
}
|
201 |
-
|
202 |
-
# Process images if available
|
203 |
-
if hasattr(page, "images"):
|
204 |
-
for img in page.images:
|
205 |
-
img_dict = {
|
206 |
-
"id": img.id if hasattr(img, "id") else "",
|
207 |
-
"base64": img.image_base64 if hasattr(img, "image_base64") else ""
|
208 |
-
}
|
209 |
-
page_dict["images"].append(img_dict)
|
210 |
-
|
211 |
-
response_dict["pages"].append(page_dict)
|
212 |
-
|
213 |
-
# Convert dictionary to JSON
|
214 |
-
return json.dumps(response_dict, indent=indent)
|
215 |
-
|
216 |
-
def create_results_zip_in_memory(results):
|
217 |
-
"""
|
218 |
-
Create a zip file containing OCR results in memory.
|
219 |
-
|
220 |
-
Args:
|
221 |
-
results: Dictionary or list of OCR results
|
222 |
-
|
223 |
-
Returns:
|
224 |
-
Binary zip file data
|
225 |
-
"""
|
226 |
-
# Create a BytesIO object
|
227 |
-
zip_buffer = io.BytesIO()
|
228 |
-
|
229 |
-
# Check if results is a list or a dictionary
|
230 |
-
is_list = isinstance(results, list)
|
231 |
-
|
232 |
-
# Create zip file in memory
|
233 |
-
with zipfile.ZipFile(zip_buffer, 'w', zipfile.ZIP_DEFLATED) as zipf:
|
234 |
-
if is_list:
|
235 |
-
# Handle list of results
|
236 |
-
for i, result in enumerate(results):
|
237 |
-
try:
|
238 |
-
# Create a descriptive base filename for this result
|
239 |
-
base_filename = result.get('file_name', f'document_{i+1}').split('.')[0]
|
240 |
-
|
241 |
-
# Add document type if available
|
242 |
-
if 'topics' in result and result['topics']:
|
243 |
-
topic = result['topics'][0].lower().replace(' ', '_')
|
244 |
-
base_filename = f"{base_filename}_{topic}"
|
245 |
-
|
246 |
-
# Add language if available
|
247 |
-
if 'languages' in result and result['languages']:
|
248 |
-
lang = result['languages'][0].lower()
|
249 |
-
# Only add if it's not already in the filename
|
250 |
-
if lang not in base_filename.lower():
|
251 |
-
base_filename = f"{base_filename}_{lang}"
|
252 |
-
|
253 |
-
# For PDFs, add page information
|
254 |
-
if 'total_pages' in result and 'processed_pages' in result:
|
255 |
-
base_filename = f"{base_filename}_p{result['processed_pages']}of{result['total_pages']}"
|
256 |
-
|
257 |
-
# Add timestamp if available
|
258 |
-
if 'timestamp' in result:
|
259 |
-
try:
|
260 |
-
# Try to parse the timestamp and reformat it
|
261 |
-
dt = datetime.strptime(result['timestamp'], "%Y-%m-%d %H:%M")
|
262 |
-
timestamp = dt.strftime("%Y%m%d_%H%M%S")
|
263 |
-
base_filename = f"{base_filename}_{timestamp}"
|
264 |
-
except:
|
265 |
-
pass
|
266 |
-
|
267 |
-
# Add JSON results for each file with descriptive name
|
268 |
-
result_json = json.dumps(result, indent=2)
|
269 |
-
zipf.writestr(f"{base_filename}.json", result_json)
|
270 |
-
|
271 |
-
# Add HTML content (generated from the result)
|
272 |
-
html_content = create_html_with_images(result)
|
273 |
-
zipf.writestr(f"{base_filename}_with_images.html", html_content)
|
274 |
-
|
275 |
-
# Add raw OCR text if available
|
276 |
-
if "ocr_contents" in result and "raw_text" in result["ocr_contents"]:
|
277 |
-
zipf.writestr(f"{base_filename}.txt", result["ocr_contents"]["raw_text"])
|
278 |
-
|
279 |
-
# Add HTML visualization if available
|
280 |
-
if "html_visualization" in result:
|
281 |
-
zipf.writestr(f"visualization_{i+1}.html", result["html_visualization"])
|
282 |
-
|
283 |
-
# Add images if available (limit to conserve memory)
|
284 |
-
if "pages_data" in result:
|
285 |
-
for page_idx, page in enumerate(result["pages_data"]):
|
286 |
-
for img_idx, img in enumerate(page.get("images", [])[:3]): # Limit to first 3 images per page
|
287 |
-
img_base64 = img.get("image_base64", "")
|
288 |
-
if img_base64:
|
289 |
-
# Strip data URL prefix if present
|
290 |
-
if img_base64.startswith("data:image"):
|
291 |
-
img_base64 = img_base64.split(",", 1)[1]
|
292 |
-
|
293 |
-
# Decode base64 and add to zip
|
294 |
-
try:
|
295 |
-
img_data = base64.b64decode(img_base64)
|
296 |
-
zipf.writestr(f"images/result_{i+1}_page_{page_idx+1}_img_{img_idx+1}.jpg", img_data)
|
297 |
-
except:
|
298 |
-
pass
|
299 |
-
except Exception:
|
300 |
-
# If any result fails, skip it and continue
|
301 |
-
continue
|
302 |
-
else:
|
303 |
-
# Handle single result
|
304 |
-
try:
|
305 |
-
# Create a descriptive base filename for this result
|
306 |
-
base_filename = results.get('file_name', 'document').split('.')[0]
|
307 |
-
|
308 |
-
# Add document type if available
|
309 |
-
if 'topics' in results and results['topics']:
|
310 |
-
topic = results['topics'][0].lower().replace(' ', '_')
|
311 |
-
base_filename = f"{base_filename}_{topic}"
|
312 |
-
|
313 |
-
# Add language if available
|
314 |
-
if 'languages' in results and results['languages']:
|
315 |
-
lang = results['languages'][0].lower()
|
316 |
-
# Only add if it's not already in the filename
|
317 |
-
if lang not in base_filename.lower():
|
318 |
-
base_filename = f"{base_filename}_{lang}"
|
319 |
-
|
320 |
-
# For PDFs, add page information
|
321 |
-
if 'total_pages' in results and 'processed_pages' in results:
|
322 |
-
base_filename = f"{base_filename}_p{results['processed_pages']}of{results['total_pages']}"
|
323 |
-
|
324 |
-
# Add timestamp if available
|
325 |
-
if 'timestamp' in results:
|
326 |
-
try:
|
327 |
-
# Try to parse the timestamp and reformat it
|
328 |
-
dt = datetime.strptime(results['timestamp'], "%Y-%m-%d %H:%M")
|
329 |
-
timestamp = dt.strftime("%Y%m%d_%H%M%S")
|
330 |
-
base_filename = f"{base_filename}_{timestamp}"
|
331 |
-
except:
|
332 |
-
# If parsing fails, create a new timestamp
|
333 |
-
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
334 |
-
base_filename = f"{base_filename}_{timestamp}"
|
335 |
-
else:
|
336 |
-
# No timestamp in the result, create a new one
|
337 |
-
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
338 |
-
base_filename = f"{base_filename}_{timestamp}"
|
339 |
-
|
340 |
-
# Add JSON results with descriptive name
|
341 |
-
results_json = json.dumps(results, indent=2)
|
342 |
-
zipf.writestr(f"{base_filename}.json", results_json)
|
343 |
-
|
344 |
-
# Add HTML content with descriptive name
|
345 |
-
html_content = create_html_with_images(results)
|
346 |
-
zipf.writestr(f"{base_filename}_with_images.html", html_content)
|
347 |
-
|
348 |
-
# Add raw OCR text if available
|
349 |
-
if "ocr_contents" in results and "raw_text" in results["ocr_contents"]:
|
350 |
-
zipf.writestr(f"{base_filename}.txt", results["ocr_contents"]["raw_text"])
|
351 |
-
|
352 |
-
# Add HTML visualization if available
|
353 |
-
if "html_visualization" in results:
|
354 |
-
zipf.writestr("visualization.html", results["html_visualization"])
|
355 |
-
|
356 |
-
# Add images if available
|
357 |
-
if "pages_data" in results:
|
358 |
-
for page_idx, page in enumerate(results["pages_data"]):
|
359 |
-
for img_idx, img in enumerate(page.get("images", [])):
|
360 |
-
img_base64 = img.get("image_base64", "")
|
361 |
-
if img_base64:
|
362 |
-
# Strip data URL prefix if present
|
363 |
-
if img_base64.startswith("data:image"):
|
364 |
-
img_base64 = img_base64.split(",", 1)[1]
|
365 |
-
|
366 |
-
# Decode base64 and add to zip
|
367 |
-
try:
|
368 |
-
img_data = base64.b64decode(img_base64)
|
369 |
-
zipf.writestr(f"images/page_{page_idx+1}_img_{img_idx+1}.jpg", img_data)
|
370 |
-
except:
|
371 |
-
pass
|
372 |
-
except Exception:
|
373 |
-
# If processing fails, return empty zip
|
374 |
-
pass
|
375 |
-
|
376 |
-
# Seek to the beginning of the BytesIO object
|
377 |
-
zip_buffer.seek(0)
|
378 |
-
|
379 |
-
# Return the zip file bytes
|
380 |
-
return zip_buffer.getvalue()
|
381 |
-
|
382 |
-
def create_results_zip(results, output_dir=None, zip_name=None):
|
383 |
-
"""
|
384 |
-
Create a zip file containing OCR results.
|
385 |
-
|
386 |
-
Args:
|
387 |
-
results: Dictionary or list of OCR results
|
388 |
-
output_dir: Optional output directory
|
389 |
-
zip_name: Optional zip file name
|
390 |
-
|
391 |
-
Returns:
|
392 |
-
Path to the created zip file
|
393 |
-
"""
|
394 |
-
# Create temporary output directory if not provided
|
395 |
-
if output_dir is None:
|
396 |
-
output_dir = Path.cwd() / "output"
|
397 |
-
output_dir.mkdir(exist_ok=True)
|
398 |
-
else:
|
399 |
-
output_dir = Path(output_dir)
|
400 |
-
output_dir.mkdir(exist_ok=True)
|
401 |
-
|
402 |
-
# Check if results is a list or a dictionary
|
403 |
-
is_list = isinstance(results, list)
|
404 |
-
|
405 |
-
# Generate zip name if not provided
|
406 |
-
if zip_name is None:
|
407 |
-
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
408 |
-
|
409 |
-
if is_list:
|
410 |
-
# For a list of results, create a more descriptive name based on the content
|
411 |
-
file_count = len(results)
|
412 |
-
|
413 |
-
# Count document types
|
414 |
-
pdf_count = sum(1 for r in results if r.get('file_name', '').lower().endswith('.pdf'))
|
415 |
-
img_count = sum(1 for r in results if r.get('file_name', '').lower().endswith(('.jpg', '.jpeg', '.png')))
|
416 |
-
|
417 |
-
# Create descriptive name based on contents
|
418 |
-
if pdf_count > 0 and img_count > 0:
|
419 |
-
zip_name = f"historical_ocr_mixed_{pdf_count}pdf_{img_count}img_{timestamp}.zip"
|
420 |
-
elif pdf_count > 0:
|
421 |
-
zip_name = f"historical_ocr_pdf_documents_{pdf_count}_{timestamp}.zip"
|
422 |
-
elif img_count > 0:
|
423 |
-
zip_name = f"historical_ocr_images_{img_count}_{timestamp}.zip"
|
424 |
-
else:
|
425 |
-
zip_name = f"historical_ocr_results_{file_count}_{timestamp}.zip"
|
426 |
-
else:
|
427 |
-
# For single result, create descriptive filename
|
428 |
-
base_name = results.get("file_name", "document").split('.')[0]
|
429 |
-
|
430 |
-
# Add document type if available
|
431 |
-
if 'topics' in results and results['topics']:
|
432 |
-
topic = results['topics'][0].lower().replace(' ', '_')
|
433 |
-
base_name = f"{base_name}_{topic}"
|
434 |
-
|
435 |
-
# Add language if available
|
436 |
-
if 'languages' in results and results['languages']:
|
437 |
-
lang = results['languages'][0].lower()
|
438 |
-
# Only add if it's not already in the filename
|
439 |
-
if lang not in base_name.lower():
|
440 |
-
base_name = f"{base_name}_{lang}"
|
441 |
-
|
442 |
-
# For PDFs, add page information
|
443 |
-
if 'total_pages' in results and 'processed_pages' in results:
|
444 |
-
base_name = f"{base_name}_p{results['processed_pages']}of{results['total_pages']}"
|
445 |
-
|
446 |
-
# Add timestamp
|
447 |
-
zip_name = f"{base_name}_{timestamp}.zip"
|
448 |
-
|
449 |
-
try:
|
450 |
-
# Get zip data in memory first
|
451 |
-
zip_data = create_results_zip_in_memory(results)
|
452 |
-
|
453 |
-
# Save to file
|
454 |
-
zip_path = output_dir / zip_name
|
455 |
-
with open(zip_path, 'wb') as f:
|
456 |
-
f.write(zip_data)
|
457 |
-
|
458 |
-
return zip_path
|
459 |
-
except Exception as e:
|
460 |
-
# Create an empty zip file as fallback
|
461 |
-
zip_path = output_dir / zip_name
|
462 |
-
with zipfile.ZipFile(zip_path, 'w') as zipf:
|
463 |
-
zipf.writestr("info.txt", "Could not create complete archive")
|
464 |
-
|
465 |
-
return zip_path
|
466 |
-
|
467 |
-
|
468 |
-
# Advanced image preprocessing functions
|
469 |
-
|
470 |
-
def preprocess_image_for_ocr(image_path: Union[str, Path]) -> Tuple[Image.Image, str]:
|
471 |
-
"""
|
472 |
-
Preprocess an image for optimal OCR performance with enhanced speed and memory optimization.
|
473 |
-
Enhanced to handle large newspaper and document images.
|
474 |
-
|
475 |
-
Args:
|
476 |
-
image_path: Path to the image file
|
477 |
-
|
478 |
-
Returns:
|
479 |
-
Tuple of (processed PIL Image, base64 string)
|
480 |
-
"""
|
481 |
-
# Fast path: Skip all processing if PIL not available
|
482 |
-
if not PILLOW_AVAILABLE:
|
483 |
-
logger.info("PIL not available, skipping image preprocessing")
|
484 |
-
return None, encode_image_for_api(image_path)
|
485 |
-
|
486 |
-
# Convert to Path object if string
|
487 |
-
image_file = Path(image_path) if isinstance(image_path, str) else image_path
|
488 |
-
|
489 |
-
# Thread-safe caching with early exit for already processed images
|
490 |
-
try:
|
491 |
-
# Fast stat calls for file metadata - consolidate to reduce I/O
|
492 |
-
file_stat = image_file.stat()
|
493 |
-
file_size = file_stat.st_size
|
494 |
-
file_size_mb = file_size / (1024 * 1024)
|
495 |
-
mod_time = file_stat.st_mtime
|
496 |
-
|
497 |
-
# Create a cache key based on essential file properties
|
498 |
-
cache_key = f"{image_file.name}_{file_size}_{mod_time}"
|
499 |
-
|
500 |
-
# Fast path: Return cached result if available
|
501 |
-
if hasattr(preprocess_image_for_ocr, "_cache") and cache_key in preprocess_image_for_ocr._cache:
|
502 |
-
logger.debug(f"Using cached preprocessing result for {image_file.name}")
|
503 |
-
return preprocess_image_for_ocr._cache[cache_key]
|
504 |
-
|
505 |
-
# Optimization: Skip heavy processing for very small files
|
506 |
-
# Small images (less than 100KB) likely don't need preprocessing
|
507 |
-
if file_size < 100000: # 100KB
|
508 |
-
logger.info(f"Image {image_file.name} is small ({file_size/1024:.1f}KB), using minimal processing")
|
509 |
-
with Image.open(image_file) as img:
|
510 |
-
# Normalize mode only
|
511 |
-
if img.mode not in ('RGB', 'L'):
|
512 |
-
img = img.convert('RGB')
|
513 |
-
|
514 |
-
# Save with light optimization
|
515 |
-
buffer = io.BytesIO()
|
516 |
-
img.save(buffer, format="JPEG", quality=95, optimize=True)
|
517 |
-
buffer.seek(0)
|
518 |
-
|
519 |
-
# Get base64
|
520 |
-
encoded_image = base64.b64encode(buffer.getvalue()).decode()
|
521 |
-
base64_data_url = f"data:image/jpeg;base64,{encoded_image}"
|
522 |
-
|
523 |
-
# Cache and return
|
524 |
-
result = (img, base64_data_url)
|
525 |
-
if not hasattr(preprocess_image_for_ocr, "_cache"):
|
526 |
-
preprocess_image_for_ocr._cache = {}
|
527 |
-
|
528 |
-
# Clean cache if needed
|
529 |
-
if len(preprocess_image_for_ocr._cache) > 20: # Increased cache size for better performance
|
530 |
-
# Remove oldest 5 entries for better batch processing
|
531 |
-
for _ in range(5):
|
532 |
-
if preprocess_image_for_ocr._cache:
|
533 |
-
preprocess_image_for_ocr._cache.pop(next(iter(preprocess_image_for_ocr._cache)))
|
534 |
-
|
535 |
-
preprocess_image_for_ocr._cache[cache_key] = result
|
536 |
-
return result
|
537 |
-
|
538 |
-
# Special handling for large newspaper-style documents
|
539 |
-
if file_size_mb > 5 and image_file.name.lower().endswith(('.jpg', '.jpeg', '.png')):
|
540 |
-
logger.info(f"Large image detected ({file_size_mb:.2f}MB), checking for newspaper format")
|
541 |
-
try:
|
542 |
-
# Quickly check dimensions without loading full image
|
543 |
-
with Image.open(image_file) as img:
|
544 |
-
width, height = img.size
|
545 |
-
aspect_ratio = width / height
|
546 |
-
|
547 |
-
# Newspaper-style documents typically have width > height or are very large
|
548 |
-
is_newspaper_format = (aspect_ratio > 1.15 and width > 2000) or (width > 3000 or height > 3000)
|
549 |
-
|
550 |
-
if is_newspaper_format:
|
551 |
-
logger.info(f"Newspaper format detected: {width}x{height}, applying specialized processing")
|
552 |
-
|
553 |
-
except Exception as dim_err:
|
554 |
-
logger.debug(f"Error checking dimensions: {str(dim_err)}")
|
555 |
-
is_newspaper_format = False
|
556 |
-
else:
|
557 |
-
is_newspaper_format = False
|
558 |
-
|
559 |
-
except Exception as e:
|
560 |
-
# If stat or cache handling fails, log and continue with processing
|
561 |
-
logger.debug(f"Cache handling failed for {image_path}: {str(e)}")
|
562 |
-
# Ensure we have a valid file_size_mb for later decisions
|
563 |
-
try:
|
564 |
-
file_size_mb = image_file.stat().st_size / (1024 * 1024)
|
565 |
-
except:
|
566 |
-
file_size_mb = 0 # Default if we can't determine size
|
567 |
-
|
568 |
-
# Default to not newspaper format on error
|
569 |
-
is_newspaper_format = False
|
570 |
-
|
571 |
-
try:
|
572 |
-
# Process start time for performance logging
|
573 |
-
start_time = time.time()
|
574 |
-
|
575 |
-
# Open and process the image with minimal memory footprint
|
576 |
-
with Image.open(image_file) as img:
|
577 |
-
# Normalize image mode
|
578 |
-
if img.mode not in ('RGB', 'L'):
|
579 |
-
img = img.convert('RGB')
|
580 |
-
|
581 |
-
# Fast path: Quick check of image properties to determine appropriate processing
|
582 |
-
width, height = img.size
|
583 |
-
image_area = width * height
|
584 |
-
|
585 |
-
# Detect document type only for medium to large images to save processing time
|
586 |
-
is_document = False
|
587 |
-
is_newspaper = False
|
588 |
-
|
589 |
-
# More aggressive document type detection for larger images
|
590 |
-
if image_area > 500000: # Approx 700x700 or larger
|
591 |
-
# Store image for document detection
|
592 |
-
_detect_document_type_impl._current_img = img
|
593 |
-
is_document = _detect_document_type_impl(None)
|
594 |
-
|
595 |
-
# Additional check for newspaper format
|
596 |
-
if is_document:
|
597 |
-
# Newspapers typically have wide formats or very large dimensions
|
598 |
-
aspect_ratio = width / height
|
599 |
-
is_newspaper = (aspect_ratio > 1.15 and width > 2000) or (width > 3000 or height > 3000)
|
600 |
-
|
601 |
-
logger.debug(f"Document type detection for {image_file.name}: " +
|
602 |
-
f"{'newspaper' if is_newspaper else 'document' if is_document else 'photo'}")
|
603 |
-
|
604 |
-
# Check for handwritten document characteristics
|
605 |
-
is_handwritten = False
|
606 |
-
if CV2_AVAILABLE and not is_newspaper:
|
607 |
-
# Use more advanced detection for handwritten content
|
608 |
-
try:
|
609 |
-
gray_np = np.array(img.convert('L'))
|
610 |
-
# Higher variance in edge strengths can indicate handwriting
|
611 |
-
edges = cv2.Canny(gray_np, 30, 100)
|
612 |
-
if np.count_nonzero(edges) / edges.size > 0.02: # Low edge threshold for handwriting
|
613 |
-
# Additional check with gradient magnitudes
|
614 |
-
sobelx = cv2.Sobel(gray_np, cv2.CV_64F, 1, 0, ksize=3)
|
615 |
-
sobely = cv2.Sobel(gray_np, cv2.CV_64F, 0, 1, ksize=3)
|
616 |
-
magnitude = np.sqrt(sobelx**2 + sobely**2)
|
617 |
-
# Handwriting typically has more variation in gradient magnitudes
|
618 |
-
if np.std(magnitude) > 20:
|
619 |
-
is_handwritten = True
|
620 |
-
logger.info(f"Handwritten document detected: {image_file.name}")
|
621 |
-
except Exception as e:
|
622 |
-
logger.debug(f"Handwriting detection error: {str(e)}")
|
623 |
-
|
624 |
-
# Special processing for very large images (newspapers and large documents)
|
625 |
-
if is_newspaper:
|
626 |
-
# For newspaper format, we need more specialized processing
|
627 |
-
logger.info(f"Processing newspaper format image: {width}x{height}")
|
628 |
-
|
629 |
-
# For newspapers, we prioritize text clarity over file size
|
630 |
-
# Use higher target resolution to preserve small text common in newspapers
|
631 |
-
# But still need to resize if extremely large to avoid API limits
|
632 |
-
max_dimension = max(width, height)
|
633 |
-
|
634 |
-
if max_dimension > 6000: # Extremely large
|
635 |
-
scale_factor = 0.4 # Preserve more resolution for newspapers (increased from 0.35)
|
636 |
-
elif max_dimension > 4000:
|
637 |
-
scale_factor = 0.6 # Higher resolution for better text extraction (increased from 0.5)
|
638 |
-
else:
|
639 |
-
scale_factor = 0.8 # Minimal reduction for moderate newspaper size (increased from 0.7)
|
640 |
-
|
641 |
-
# Calculate new dimensions - maintain higher resolution
|
642 |
-
new_width = int(width * scale_factor)
|
643 |
-
new_height = int(height * scale_factor)
|
644 |
-
|
645 |
-
# Use high-quality resampling to preserve text clarity in newspapers
|
646 |
-
processed_img = img.resize((new_width, new_height), Image.LANCZOS)
|
647 |
-
logger.debug(f"Resized newspaper image from {width}x{height} to {new_width}x{new_height}")
|
648 |
-
|
649 |
-
# For newspapers, we also want to enhance the contrast and sharpen the image
|
650 |
-
# before the main OCR processing for better text extraction
|
651 |
-
if img.mode in ('RGB', 'RGBA'):
|
652 |
-
# For color newspapers, enhance both the overall image and then convert to grayscale
|
653 |
-
# This helps with mixed content newspapers that have both text and images
|
654 |
-
enhancer = ImageEnhance.Contrast(processed_img)
|
655 |
-
processed_img = enhancer.enhance(1.3) # Boost contrast but not too aggressively
|
656 |
-
|
657 |
-
# Also enhance saturation to make colored text more visible
|
658 |
-
enhancer_sat = ImageEnhance.Color(processed_img)
|
659 |
-
processed_img = enhancer_sat.enhance(1.2)
|
660 |
-
# Special processing for handwritten documents
|
661 |
-
elif is_handwritten:
|
662 |
-
logger.info(f"Processing handwritten document: {width}x{height}")
|
663 |
-
|
664 |
-
# For handwritten text, we need to preserve stroke details
|
665 |
-
# Use gentle scaling to maintain handwriting characteristics
|
666 |
-
max_dimension = max(width, height)
|
667 |
-
|
668 |
-
if max_dimension > 4000: # Large handwritten document
|
669 |
-
scale_factor = 0.6 # Less aggressive reduction for handwriting
|
670 |
-
else:
|
671 |
-
scale_factor = 0.8 # Minimal reduction for moderate size
|
672 |
-
|
673 |
-
# Calculate new dimensions
|
674 |
-
new_width = int(width * scale_factor)
|
675 |
-
new_height = int(height * scale_factor)
|
676 |
-
|
677 |
-
# Use high-quality resampling to preserve handwriting details
|
678 |
-
processed_img = img.resize((new_width, new_height), Image.LANCZOS)
|
679 |
-
|
680 |
-
# Lower contrast enhancement for handwriting to preserve stroke details
|
681 |
-
if img.mode in ('RGB', 'RGBA'):
|
682 |
-
# Convert to grayscale for better text processing
|
683 |
-
processed_img = processed_img.convert('L')
|
684 |
-
|
685 |
-
# Use reduced contrast enhancement to preserve subtle strokes
|
686 |
-
enhancer = ImageEnhance.Contrast(processed_img)
|
687 |
-
processed_img = enhancer.enhance(1.2) # Lower contrast value for handwriting
|
688 |
-
|
689 |
-
# Standard processing for other large images
|
690 |
-
elif file_size_mb > IMAGE_PREPROCESSING["max_size_mb"] or max(width, height) > 3000:
|
691 |
-
# Calculate target dimensions directly instead of using the heavier resize function
|
692 |
-
target_width, target_height = width, height
|
693 |
-
max_dimension = max(width, height)
|
694 |
-
|
695 |
-
# Use a sliding scale for reduction based on image size
|
696 |
-
if max_dimension > 5000:
|
697 |
-
scale_factor = 0.3 # Slightly less aggressive reduction (was 0.25)
|
698 |
-
elif max_dimension > 3000:
|
699 |
-
scale_factor = 0.45 # Slightly less aggressive reduction (was 0.4)
|
700 |
-
else:
|
701 |
-
scale_factor = 0.65 # Slightly less aggressive reduction (was 0.6)
|
702 |
-
|
703 |
-
# Calculate new dimensions
|
704 |
-
new_width = int(width * scale_factor)
|
705 |
-
new_height = int(height * scale_factor)
|
706 |
-
|
707 |
-
# Use direct resize with optimized resampling filter based on image size
|
708 |
-
if image_area > 3000000: # Very large, use faster but lower quality
|
709 |
-
processed_img = img.resize((new_width, new_height), Image.BILINEAR)
|
710 |
-
else: # Medium size, use better quality
|
711 |
-
processed_img = img.resize((new_width, new_height), Image.LANCZOS)
|
712 |
-
|
713 |
-
logger.debug(f"Resized image from {width}x{height} to {new_width}x{new_height}")
|
714 |
-
else:
|
715 |
-
# Skip resizing for smaller images
|
716 |
-
processed_img = img
|
717 |
-
|
718 |
-
# Apply appropriate processing based on document type and size
|
719 |
-
if is_document:
|
720 |
-
# Process as document with optimized path based on size
|
721 |
-
if image_area > 1000000: # Full processing for larger documents
|
722 |
-
preprocess_document_image._current_img = processed_img
|
723 |
-
processed = _preprocess_document_image_impl()
|
724 |
-
else: # Lightweight processing for smaller documents
|
725 |
-
# Just enhance contrast for small documents to save time
|
726 |
-
enhancer = ImageEnhance.Contrast(processed_img)
|
727 |
-
processed = enhancer.enhance(1.3)
|
728 |
-
else:
|
729 |
-
# Process as photo with optimized path based on size
|
730 |
-
if image_area > 1000000: # Full processing for larger photos
|
731 |
-
preprocess_general_image._current_img = processed_img
|
732 |
-
processed = _preprocess_general_image_impl()
|
733 |
-
else: # Skip processing for smaller photos
|
734 |
-
processed = processed_img
|
735 |
-
|
736 |
-
# Optimize memory handling during encoding
|
737 |
-
buffer = io.BytesIO()
|
738 |
-
|
739 |
-
# Adjust quality based on image size to optimize API payload
|
740 |
-
if file_size_mb > 5:
|
741 |
-
quality = 85 # Lower quality for large files
|
742 |
-
else:
|
743 |
-
quality = IMAGE_PREPROCESSING["compression_quality"]
|
744 |
-
|
745 |
-
# Save with optimized parameters
|
746 |
-
processed.save(buffer, format="JPEG", quality=quality, optimize=True)
|
747 |
-
buffer.seek(0)
|
748 |
-
|
749 |
-
# Get base64 with minimal memory footprint
|
750 |
-
encoded_image = base64.b64encode(buffer.getvalue()).decode()
|
751 |
-
# Always use image/jpeg MIME type since we explicitly save as JPEG above
|
752 |
-
base64_data_url = f"data:image/jpeg;base64,{encoded_image}"
|
753 |
-
|
754 |
-
# Update cache thread-safely
|
755 |
-
result = (processed, base64_data_url)
|
756 |
-
if not hasattr(preprocess_image_for_ocr, "_cache"):
|
757 |
-
preprocess_image_for_ocr._cache = {}
|
758 |
-
|
759 |
-
# LRU-like cache management with improved clearing
|
760 |
-
if len(preprocess_image_for_ocr._cache) > 20:
|
761 |
-
try:
|
762 |
-
# Remove several entries to avoid frequent cache clearing
|
763 |
-
for _ in range(5):
|
764 |
-
if preprocess_image_for_ocr._cache:
|
765 |
-
preprocess_image_for_ocr._cache.pop(next(iter(preprocess_image_for_ocr._cache)))
|
766 |
-
except:
|
767 |
-
# If removal fails, just continue
|
768 |
-
pass
|
769 |
-
|
770 |
-
# Add to cache
|
771 |
-
try:
|
772 |
-
preprocess_image_for_ocr._cache[cache_key] = result
|
773 |
-
except Exception:
|
774 |
-
# If caching fails, just proceed
|
775 |
-
pass
|
776 |
-
|
777 |
-
# Log performance metrics
|
778 |
-
processing_time = time.time() - start_time
|
779 |
-
logger.debug(f"Image preprocessing completed in {processing_time:.3f}s for {image_file.name}")
|
780 |
-
|
781 |
-
# Return both processed image and base64 string
|
782 |
-
return result
|
783 |
-
|
784 |
-
except Exception as e:
|
785 |
-
# If preprocessing fails, log error and use original image
|
786 |
-
logger.warning(f"Image preprocessing failed: {str(e)}. Using original image.")
|
787 |
-
return None, encode_image_for_api(image_path)
|
788 |
-
|
789 |
-
# Removed caching decorator to fix unhashable type error
|
790 |
-
def detect_document_type(img: Image.Image) -> bool:
|
791 |
-
"""
|
792 |
-
Detect if an image is likely a document (text-heavy) vs. a photo.
|
793 |
-
|
794 |
-
Args:
|
795 |
-
img: PIL Image object
|
796 |
-
|
797 |
-
Returns:
|
798 |
-
True if likely a document, False otherwise
|
799 |
-
"""
|
800 |
-
# Direct implementation without caching
|
801 |
-
return _detect_document_type_impl(None)
|
802 |
-
|
803 |
-
def _detect_document_type_impl(img_hash=None) -> bool:
|
804 |
-
"""
|
805 |
-
Optimized implementation of document type detection for faster processing.
|
806 |
-
The img_hash parameter is unused but kept for backward compatibility.
|
807 |
-
|
808 |
-
Enhanced to better detect handwritten documents and newspaper formats.
|
809 |
-
"""
|
810 |
-
# Fast path: Get the image from thread-local storage
|
811 |
-
if not hasattr(_detect_document_type_impl, "_current_img"):
|
812 |
-
return False # Fail safe in case image is not set
|
813 |
-
|
814 |
-
img = _detect_document_type_impl._current_img
|
815 |
-
|
816 |
-
# Skip processing for tiny images - just classify as non-documents
|
817 |
-
width, height = img.size
|
818 |
-
if width * height < 100000: # Approx 300x300 or smaller
|
819 |
-
return False
|
820 |
-
|
821 |
-
# Convert to grayscale for analysis (using faster conversion)
|
822 |
-
gray_img = img.convert('L')
|
823 |
-
|
824 |
-
# PIL-only path for systems without OpenCV
|
825 |
-
if not CV2_AVAILABLE:
|
826 |
-
# Faster method: Sample a subset of the image for edge detection
|
827 |
-
# Downscale image for faster processing
|
828 |
-
sample_size = min(width, height, 1000)
|
829 |
-
scale_factor = sample_size / max(width, height)
|
830 |
-
|
831 |
-
if scale_factor < 0.9: # Only resize if significant reduction
|
832 |
-
sample_img = gray_img.resize(
|
833 |
-
(int(width * scale_factor), int(height * scale_factor)),
|
834 |
-
Image.NEAREST # Fastest resampling method
|
835 |
-
)
|
836 |
-
else:
|
837 |
-
sample_img = gray_img
|
838 |
-
|
839 |
-
# Fast edge detection on sample
|
840 |
-
edges = sample_img.filter(ImageFilter.FIND_EDGES)
|
841 |
-
|
842 |
-
# Count edge pixels using threshold (faster than summing individual pixels)
|
843 |
-
edge_data = edges.getdata()
|
844 |
-
edge_threshold = 40 # Lowered threshold to better detect handwritten texts
|
845 |
-
|
846 |
-
# Use list comprehension for better performance
|
847 |
-
edge_count = sum(1 for p in edge_data if p > edge_threshold)
|
848 |
-
total_pixels = len(edge_data)
|
849 |
-
edge_ratio = edge_count / total_pixels
|
850 |
-
|
851 |
-
# Check if bright areas exist - simple approximation of text/background contrast
|
852 |
-
bright_count = sum(1 for p in gray_img.getdata() if p > 200)
|
853 |
-
bright_ratio = bright_count / (width * height)
|
854 |
-
|
855 |
-
# Documents typically have more edges (text boundaries) and bright areas (background)
|
856 |
-
# Lowered edge threshold to better detect handwritten documents
|
857 |
-
return edge_ratio > 0.035 or bright_ratio > 0.4
|
858 |
-
|
859 |
-
# OpenCV path - optimized for speed and enhanced for handwritten documents
|
860 |
-
img_np = np.array(gray_img)
|
861 |
-
|
862 |
-
# 1. Fast check: Variance of pixel values
|
863 |
-
# Documents typically have high variance (text on background)
|
864 |
-
# Handwritten documents may have less contrast than printed text
|
865 |
-
std_dev = np.std(img_np)
|
866 |
-
if std_dev > 40: # Further lowered threshold to better detect handwritten documents with low contrast
|
867 |
-
return True
|
868 |
-
|
869 |
-
# 2. Quick check using downsampled image for edges
|
870 |
-
# Downscale for faster processing on large images
|
871 |
-
if max(img_np.shape) > 1000:
|
872 |
-
scale = 1000 / max(img_np.shape)
|
873 |
-
small_img = cv2.resize(img_np, None, fx=scale, fy=scale, interpolation=cv2.INTER_NEAREST)
|
874 |
-
else:
|
875 |
-
small_img = img_np
|
876 |
-
|
877 |
-
# Enhanced edge detection for handwritten documents
|
878 |
-
# Use multiple Canny thresholds to better capture both faint and bold strokes
|
879 |
-
edges_low = cv2.Canny(small_img, 20, 110, L2gradient=False) # For faint handwriting
|
880 |
-
edges_high = cv2.Canny(small_img, 30, 150, L2gradient=False) # For standard text
|
881 |
-
|
882 |
-
# Combine edge detection results
|
883 |
-
edges = cv2.bitwise_or(edges_low, edges_high)
|
884 |
-
edge_ratio = np.count_nonzero(edges) / edges.size
|
885 |
-
|
886 |
-
# Special handling for potential handwritten content - more sensitive detection
|
887 |
-
handwritten_indicator = False
|
888 |
-
if edge_ratio > 0.015: # Lower threshold specifically for handwritten content
|
889 |
-
try:
|
890 |
-
# Look for handwriting stroke characteristics using gradient analysis
|
891 |
-
# Compute gradient magnitudes and directions
|
892 |
-
sobelx = cv2.Sobel(small_img, cv2.CV_64F, 1, 0, ksize=3)
|
893 |
-
sobely = cv2.Sobel(small_img, cv2.CV_64F, 0, 1, ksize=3)
|
894 |
-
magnitude = np.sqrt(sobelx**2 + sobely**2)
|
895 |
-
|
896 |
-
# Handwriting typically has higher variation in gradient magnitudes
|
897 |
-
if np.std(magnitude) > 18: # Lower threshold for more sensitivity
|
898 |
-
# Handwriting is indicated if we also have some line structure
|
899 |
-
# Try to find line segments that could indicate text lines
|
900 |
-
lines = cv2.HoughLinesP(edges, 1, np.pi/180,
|
901 |
-
threshold=45, # Lower threshold for handwriting
|
902 |
-
minLineLength=25, # Shorter minimum line length
|
903 |
-
maxLineGap=25) # Larger gap for disconnected handwriting
|
904 |
-
|
905 |
-
if lines is not None and len(lines) > 8: # Fewer line segments needed
|
906 |
-
handwritten_indicator = True
|
907 |
-
except Exception:
|
908 |
-
# If analysis fails, continue with other checks
|
909 |
-
pass
|
910 |
-
|
911 |
-
# 3. Enhanced histogram analysis for handwritten content
|
912 |
-
# Use more granular bins for better detection of varying stroke densities
|
913 |
-
dark_mask = img_np < 65 # Increased threshold to capture lighter handwritten text
|
914 |
-
medium_mask = (img_np >= 65) & (img_np < 170) # Medium gray range for handwriting
|
915 |
-
light_mask = img_np > 175 # Slightly adjusted for aged paper
|
916 |
-
|
917 |
-
dark_ratio = np.count_nonzero(dark_mask) / img_np.size
|
918 |
-
medium_ratio = np.count_nonzero(medium_mask) / img_np.size
|
919 |
-
light_ratio = np.count_nonzero(light_mask) / img_np.size
|
920 |
-
|
921 |
-
# Handwritten documents often have more medium-gray content than printed text
|
922 |
-
# This helps detect pencil or faded ink handwriting
|
923 |
-
if medium_ratio > 0.3 and edge_ratio > 0.015:
|
924 |
-
return True
|
925 |
-
|
926 |
-
# Special analysis for handwritten documents
|
927 |
-
# Return true immediately if handwriting characteristics detected
|
928 |
-
if handwritten_indicator:
|
929 |
-
return True
|
930 |
-
|
931 |
-
# Combine heuristics for final decision with improved sensitivity
|
932 |
-
# Lower thresholds for handwritten documents
|
933 |
-
return (dark_ratio > 0.025 and light_ratio > 0.2) or edge_ratio > 0.025
|
934 |
-
|
935 |
-
# Removed caching to fix unhashable type error
|
936 |
-
def preprocess_document_image(img: Image.Image) -> Image.Image:
|
937 |
-
"""
|
938 |
-
Preprocess a document image for optimal OCR.
|
939 |
-
|
940 |
-
Args:
|
941 |
-
img: PIL Image object
|
942 |
-
|
943 |
-
Returns:
|
944 |
-
Processed PIL Image
|
945 |
-
"""
|
946 |
-
# Store the image for the implementation function
|
947 |
-
preprocess_document_image._current_img = img
|
948 |
-
# The actual implementation is separated for cleaner code organization
|
949 |
-
return _preprocess_document_image_impl()
|
950 |
-
|
951 |
-
def _preprocess_document_image_impl() -> Image.Image:
|
952 |
-
"""
|
953 |
-
Optimized implementation of document preprocessing with adaptive processing based on image size.
|
954 |
-
Enhanced for better handwritten document processing and newspaper format.
|
955 |
-
"""
|
956 |
-
# Fast path: Get image from thread-local storage
|
957 |
-
if not hasattr(preprocess_document_image, "_current_img"):
|
958 |
-
raise ValueError("No image set for document preprocessing")
|
959 |
-
|
960 |
-
img = preprocess_document_image._current_img
|
961 |
-
|
962 |
-
# Analyze image size to determine processing strategy
|
963 |
-
width, height = img.size
|
964 |
-
img_size = width * height
|
965 |
-
|
966 |
-
# Detect special document types
|
967 |
-
is_handwritten = False
|
968 |
-
is_newspaper = False
|
969 |
-
|
970 |
-
# Check for newspaper format first (takes precedence)
|
971 |
-
aspect_ratio = width / height
|
972 |
-
if (aspect_ratio > 1.15 and width > 2000) or (width > 3000 or height > 3000):
|
973 |
-
is_newspaper = True
|
974 |
-
logger.debug(f"Newspaper format detected: {width}x{height}, aspect ratio: {aspect_ratio:.2f}")
|
975 |
-
else:
|
976 |
-
# If not newspaper, check if handwritten
|
977 |
-
try:
|
978 |
-
# Simple check for handwritten document characteristics
|
979 |
-
# Handwritten documents often have more varied strokes and less stark contrast
|
980 |
-
if CV2_AVAILABLE:
|
981 |
-
# Convert to grayscale and calculate local variance
|
982 |
-
gray_np = np.array(img.convert('L'))
|
983 |
-
# Higher variance in edge strengths can indicate handwriting
|
984 |
-
edges = cv2.Canny(gray_np, 30, 100)
|
985 |
-
if np.count_nonzero(edges) / edges.size > 0.02: # Low edge threshold for handwriting
|
986 |
-
# Additional check with gradient magnitudes
|
987 |
-
sobelx = cv2.Sobel(gray_np, cv2.CV_64F, 1, 0, ksize=3)
|
988 |
-
sobely = cv2.Sobel(gray_np, cv2.CV_64F, 0, 1, ksize=3)
|
989 |
-
magnitude = np.sqrt(sobelx**2 + sobely**2)
|
990 |
-
# Handwriting typically has more variation in gradient magnitudes
|
991 |
-
if np.std(magnitude) > 20:
|
992 |
-
is_handwritten = True
|
993 |
-
except:
|
994 |
-
# If detection fails, assume it's not handwritten
|
995 |
-
pass
|
996 |
-
|
997 |
-
# Special processing for newspaper format
|
998 |
-
if is_newspaper:
|
999 |
-
# Convert to grayscale for better text extraction
|
1000 |
-
gray = img.convert('L')
|
1001 |
-
|
1002 |
-
# For newspapers, we need aggressive text enhancement to make small print readable
|
1003 |
-
# First enhance contrast more aggressively for newspaper small text
|
1004 |
-
enhancer = ImageEnhance.Contrast(gray)
|
1005 |
-
enhanced = enhancer.enhance(2.0) # More aggressive contrast for newspaper text
|
1006 |
-
|
1007 |
-
# Apply stronger sharpening to make small text more defined
|
1008 |
-
if IMAGE_PREPROCESSING["sharpen"]:
|
1009 |
-
# Apply multiple passes of sharpening for newspaper text
|
1010 |
-
enhanced = enhanced.filter(ImageFilter.SHARPEN)
|
1011 |
-
enhanced = enhanced.filter(ImageFilter.EDGE_ENHANCE_MORE) # Stronger edge enhancement
|
1012 |
-
|
1013 |
-
# Enhanced processing for newspapers with OpenCV when available
|
1014 |
-
if CV2_AVAILABLE:
|
1015 |
-
try:
|
1016 |
-
# Convert to numpy array
|
1017 |
-
img_np = np.array(enhanced)
|
1018 |
-
|
1019 |
-
# For newspaper text extraction, CLAHE (Contrast Limited Adaptive Histogram Equalization)
|
1020 |
-
# works much better than simple contrast enhancement
|
1021 |
-
clahe = cv2.createCLAHE(clipLimit=3.0, tileGridSize=(8, 8))
|
1022 |
-
img_np = clahe.apply(img_np)
|
1023 |
-
|
1024 |
-
# Apply different adaptive thresholding approaches and choose the best one
|
1025 |
-
|
1026 |
-
# 1. Standard adaptive threshold with larger block size for newspaper columns
|
1027 |
-
binary1 = cv2.adaptiveThreshold(img_np, 255,
|
1028 |
-
cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
|
1029 |
-
cv2.THRESH_BINARY, 15, 4)
|
1030 |
-
|
1031 |
-
# 2. Otsu's method for global thresholding - works well for clean newspaper print
|
1032 |
-
_, binary2 = cv2.threshold(img_np, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
|
1033 |
-
|
1034 |
-
# Try to determine which method preserves text better
|
1035 |
-
# Count white pixels and edges in each binary version
|
1036 |
-
white_pixels1 = np.count_nonzero(binary1 > 200)
|
1037 |
-
white_pixels2 = np.count_nonzero(binary2 > 200)
|
1038 |
-
|
1039 |
-
# Calculate edge density to help determine which preserves text features better
|
1040 |
-
edges1 = cv2.Canny(binary1, 100, 200)
|
1041 |
-
edges2 = cv2.Canny(binary2, 100, 200)
|
1042 |
-
edge_count1 = np.count_nonzero(edges1)
|
1043 |
-
edge_count2 = np.count_nonzero(edges2)
|
1044 |
-
|
1045 |
-
# For newspaper text, we want to preserve more edges while maintaining reasonable
|
1046 |
-
# white space (typical of printed text on paper background)
|
1047 |
-
if (edge_count1 > edge_count2 * 1.2 and white_pixels1 > white_pixels2 * 0.7) or \
|
1048 |
-
(white_pixels1 < white_pixels2 * 0.5): # If Otsu removed too much content
|
1049 |
-
# Adaptive thresholding usually better preserves small text in newspapers
|
1050 |
-
logger.debug("Using adaptive thresholding for newspaper text")
|
1051 |
-
|
1052 |
-
# Apply optional denoising to clean up small speckles
|
1053 |
-
result = cv2.fastNlMeansDenoising(binary1, None, 7, 7, 21)
|
1054 |
-
return Image.fromarray(result)
|
1055 |
-
else:
|
1056 |
-
# Otsu method was better
|
1057 |
-
logger.debug("Using Otsu thresholding for newspaper text")
|
1058 |
-
result = cv2.fastNlMeansDenoising(binary2, None, 7, 7, 21)
|
1059 |
-
return Image.fromarray(result)
|
1060 |
-
|
1061 |
-
except Exception as e:
|
1062 |
-
logger.debug(f"Advanced newspaper processing failed: {str(e)}")
|
1063 |
-
# Fall back to PIL processing
|
1064 |
-
pass
|
1065 |
-
|
1066 |
-
# If OpenCV not available or fails, apply additional PIL enhancements
|
1067 |
-
# Create a more aggressive binary version to better separate text
|
1068 |
-
binary_threshold = enhanced.point(lambda x: 0 if x < 150 else 255, '1')
|
1069 |
-
|
1070 |
-
# Return enhanced binary image
|
1071 |
-
return binary_threshold
|
1072 |
-
|
1073 |
-
# Ultra-fast path for tiny images - just convert to grayscale with contrast enhancement
|
1074 |
-
if img_size < 300000: # ~500x600 or smaller
|
1075 |
-
gray = img.convert('L')
|
1076 |
-
# Lower contrast enhancement for handwritten documents
|
1077 |
-
contrast_level = 1.4 if is_handwritten else IMAGE_PREPROCESSING["enhance_contrast"]
|
1078 |
-
enhancer = ImageEnhance.Contrast(gray)
|
1079 |
-
return enhancer.enhance(contrast_level)
|
1080 |
-
|
1081 |
-
# Fast path for small images - minimal processing
|
1082 |
-
if img_size < 1000000: # ~1000x1000 or smaller
|
1083 |
-
gray = img.convert('L')
|
1084 |
-
# Use gentler contrast enhancement for handwritten documents
|
1085 |
-
contrast_level = 1.4 if is_handwritten else IMAGE_PREPROCESSING["enhance_contrast"]
|
1086 |
-
enhancer = ImageEnhance.Contrast(gray)
|
1087 |
-
enhanced = enhancer.enhance(contrast_level)
|
1088 |
-
|
1089 |
-
# Light sharpening only if sharpen is enabled
|
1090 |
-
# Use milder sharpening for handwritten documents to preserve stroke detail
|
1091 |
-
if IMAGE_PREPROCESSING["sharpen"]:
|
1092 |
-
if is_handwritten:
|
1093 |
-
# Use edge enhancement which is gentler than SHARPEN for handwriting
|
1094 |
-
enhanced = enhanced.filter(ImageFilter.EDGE_ENHANCE)
|
1095 |
-
else:
|
1096 |
-
enhanced = enhanced.filter(ImageFilter.SHARPEN)
|
1097 |
-
return enhanced
|
1098 |
-
|
1099 |
-
# Standard path for medium images
|
1100 |
-
# Convert to grayscale (faster processing)
|
1101 |
-
gray = img.convert('L')
|
1102 |
-
|
1103 |
-
# Adaptive contrast enhancement based on document type
|
1104 |
-
contrast_level = 1.4 if is_handwritten else IMAGE_PREPROCESSING["enhance_contrast"]
|
1105 |
-
enhancer = ImageEnhance.Contrast(gray)
|
1106 |
-
enhanced = enhancer.enhance(contrast_level)
|
1107 |
-
|
1108 |
-
# Apply light sharpening for text clarity - adapt based on document type
|
1109 |
-
if IMAGE_PREPROCESSING["sharpen"]:
|
1110 |
-
if is_handwritten:
|
1111 |
-
# Use edge enhancement which is gentler than SHARPEN for handwriting
|
1112 |
-
enhanced = enhanced.filter(ImageFilter.EDGE_ENHANCE)
|
1113 |
-
else:
|
1114 |
-
enhanced = enhanced.filter(ImageFilter.SHARPEN)
|
1115 |
-
|
1116 |
-
# Advanced processing with OpenCV if available
|
1117 |
-
if CV2_AVAILABLE and IMAGE_PREPROCESSING["denoise"]:
|
1118 |
-
try:
|
1119 |
-
# Convert to numpy array for OpenCV processing
|
1120 |
-
img_np = np.array(enhanced)
|
1121 |
-
|
1122 |
-
if is_handwritten:
|
1123 |
-
# Enhanced processing for handwritten documents
|
1124 |
-
# Optimized for better stroke preservation and readability
|
1125 |
-
if img_size > 3000000: # Large images - downsample first
|
1126 |
-
scale_factor = 0.5
|
1127 |
-
small_img = cv2.resize(img_np, None, fx=scale_factor, fy=scale_factor,
|
1128 |
-
interpolation=cv2.INTER_AREA)
|
1129 |
-
|
1130 |
-
# Apply CLAHE for better local contrast in handwriting
|
1131 |
-
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
|
1132 |
-
enhanced_img = clahe.apply(small_img)
|
1133 |
-
|
1134 |
-
# Apply bilateral filter with parameters optimized for handwriting
|
1135 |
-
# Lower sigma values to preserve more detail
|
1136 |
-
filtered = cv2.bilateralFilter(enhanced_img, 7, 30, 50)
|
1137 |
-
|
1138 |
-
# Resize back
|
1139 |
-
filtered = cv2.resize(filtered, (width, height), interpolation=cv2.INTER_LINEAR)
|
1140 |
-
else:
|
1141 |
-
# For smaller handwritten images
|
1142 |
-
# Apply CLAHE for better local contrast
|
1143 |
-
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
|
1144 |
-
enhanced_img = clahe.apply(img_np)
|
1145 |
-
|
1146 |
-
# Apply bilateral filter with parameters optimized for handwriting
|
1147 |
-
filtered = cv2.bilateralFilter(enhanced_img, 5, 25, 45)
|
1148 |
-
|
1149 |
-
# Adaptive thresholding specific to handwriting
|
1150 |
-
try:
|
1151 |
-
# Use larger block size and lower constant for better stroke preservation
|
1152 |
-
binary = cv2.adaptiveThreshold(
|
1153 |
-
filtered, 255,
|
1154 |
-
cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
|
1155 |
-
cv2.THRESH_BINARY,
|
1156 |
-
21, # Larger block size for handwriting
|
1157 |
-
5 # Lower constant for better stroke preservation
|
1158 |
-
)
|
1159 |
-
|
1160 |
-
# Apply slight dilation to connect broken strokes
|
1161 |
-
kernel = np.ones((2, 2), np.uint8)
|
1162 |
-
binary = cv2.dilate(binary, kernel, iterations=1)
|
1163 |
-
|
1164 |
-
# Convert back to PIL Image
|
1165 |
-
return Image.fromarray(binary)
|
1166 |
-
except Exception as e:
|
1167 |
-
logger.debug(f"Adaptive threshold for handwriting failed: {str(e)}")
|
1168 |
-
# Convert filtered image to PIL and return as fallback
|
1169 |
-
return Image.fromarray(filtered)
|
1170 |
-
|
1171 |
-
else:
|
1172 |
-
# Standard document processing - optimized for printed text
|
1173 |
-
# Optimize denoising parameters based on image size
|
1174 |
-
if img_size > 4000000: # Very large images
|
1175 |
-
# More aggressive downsampling for very large images
|
1176 |
-
scale_factor = 0.5
|
1177 |
-
downsample = cv2.resize(img_np, None, fx=scale_factor, fy=scale_factor,
|
1178 |
-
interpolation=cv2.INTER_AREA)
|
1179 |
-
|
1180 |
-
# Lighter denoising for downsampled image
|
1181 |
-
h_value = 7 # Strength parameter
|
1182 |
-
template_window = 5
|
1183 |
-
search_window = 13
|
1184 |
-
|
1185 |
-
# Apply denoising on smaller image
|
1186 |
-
denoised_np = cv2.fastNlMeansDenoising(downsample, None, h_value, template_window, search_window)
|
1187 |
-
|
1188 |
-
# Resize back to original size
|
1189 |
-
denoised_np = cv2.resize(denoised_np, (width, height), interpolation=cv2.INTER_LINEAR)
|
1190 |
-
else:
|
1191 |
-
# Direct denoising for medium-large images
|
1192 |
-
h_value = 8 # Balanced for speed and quality
|
1193 |
-
template_window = 5
|
1194 |
-
search_window = 15
|
1195 |
-
|
1196 |
-
# Apply denoising
|
1197 |
-
denoised_np = cv2.fastNlMeansDenoising(img_np, None, h_value, template_window, search_window)
|
1198 |
-
|
1199 |
-
# Convert back to PIL Image
|
1200 |
-
enhanced = Image.fromarray(denoised_np)
|
1201 |
-
|
1202 |
-
# Apply adaptive thresholding only if it improves text visibility
|
1203 |
-
# Create a binarized version of the image
|
1204 |
-
if img_size < 8000000: # Skip for extremely large images to save processing time
|
1205 |
-
binary = cv2.adaptiveThreshold(denoised_np, 255,
|
1206 |
-
cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
|
1207 |
-
cv2.THRESH_BINARY, 11, 2)
|
1208 |
-
|
1209 |
-
# Quick verification that binarization preserves text information
|
1210 |
-
# Use simplified check that works well for document images
|
1211 |
-
white_pixels_binary = np.count_nonzero(binary > 200)
|
1212 |
-
white_pixels_orig = np.count_nonzero(denoised_np > 200)
|
1213 |
-
|
1214 |
-
# Check if binary preserves reasonable amount of white pixels (background)
|
1215 |
-
if white_pixels_binary > white_pixels_orig * 0.8:
|
1216 |
-
# Binarization looks good, use it
|
1217 |
-
return Image.fromarray(binary)
|
1218 |
-
|
1219 |
-
return enhanced
|
1220 |
-
|
1221 |
-
except Exception as e:
|
1222 |
-
# If OpenCV processing fails, continue with PIL-enhanced image
|
1223 |
-
pass
|
1224 |
-
|
1225 |
-
elif IMAGE_PREPROCESSING["denoise"]:
|
1226 |
-
# Fallback PIL denoising for systems without OpenCV
|
1227 |
-
if is_handwritten:
|
1228 |
-
# Lighter filtering for handwritten text to preserve details
|
1229 |
-
# Use a smaller median filter for handwritten documents
|
1230 |
-
enhanced = enhanced.filter(ImageFilter.MedianFilter(1))
|
1231 |
-
else:
|
1232 |
-
# Standard filtering for printed documents
|
1233 |
-
enhanced = enhanced.filter(ImageFilter.MedianFilter(3))
|
1234 |
-
|
1235 |
-
# Return enhanced grayscale image
|
1236 |
-
return enhanced
|
1237 |
-
|
1238 |
-
# Removed caching to fix unhashable type error
|
1239 |
-
def preprocess_general_image(img: Image.Image) -> Image.Image:
|
1240 |
-
"""
|
1241 |
-
Preprocess a general image for OCR.
|
1242 |
-
|
1243 |
-
Args:
|
1244 |
-
img: PIL Image object
|
1245 |
-
|
1246 |
-
Returns:
|
1247 |
-
Processed PIL Image
|
1248 |
-
"""
|
1249 |
-
# Store the image for implementation function
|
1250 |
-
preprocess_general_image._current_img = img
|
1251 |
-
return _preprocess_general_image_impl()
|
1252 |
-
|
1253 |
-
def _preprocess_general_image_impl() -> Image.Image:
|
1254 |
-
"""
|
1255 |
-
Optimized implementation of general image preprocessing with size-based processing paths
|
1256 |
-
"""
|
1257 |
-
# Fast path: Get the image from thread-local storage
|
1258 |
-
if not hasattr(preprocess_general_image, "_current_img"):
|
1259 |
-
raise ValueError("No image set for general preprocessing")
|
1260 |
-
|
1261 |
-
img = preprocess_general_image._current_img
|
1262 |
-
|
1263 |
-
# Ultra-fast path: Skip processing completely for small images to improve performance
|
1264 |
-
width, height = img.size
|
1265 |
-
img_size = width * height
|
1266 |
-
if img_size < 300000: # Skip for tiny images under ~0.3 megapixel
|
1267 |
-
# Just ensure correct color mode
|
1268 |
-
if img.mode != 'RGB':
|
1269 |
-
return img.convert('RGB')
|
1270 |
-
return img
|
1271 |
-
|
1272 |
-
# Fast path: Minimal processing for smaller images
|
1273 |
-
if img_size < 600000: # ~800x750 or smaller
|
1274 |
-
# Ensure RGB mode
|
1275 |
-
if img.mode != 'RGB':
|
1276 |
-
img = img.convert('RGB')
|
1277 |
-
|
1278 |
-
# Very light contrast enhancement only
|
1279 |
-
enhancer = ImageEnhance.Contrast(img)
|
1280 |
-
return enhancer.enhance(1.15) # Lighter enhancement for small images
|
1281 |
-
|
1282 |
-
# Standard path: Apply moderate enhancements for medium images
|
1283 |
-
# Convert to RGB to ensure compatibility
|
1284 |
-
if img.mode != 'RGB':
|
1285 |
-
img = img.convert('RGB')
|
1286 |
-
|
1287 |
-
# Moderate enhancement only
|
1288 |
-
enhancer = ImageEnhance.Contrast(img)
|
1289 |
-
enhanced = enhancer.enhance(1.2) # Less aggressive than document enhancement
|
1290 |
-
|
1291 |
-
# Skip additional processing for medium-sized images
|
1292 |
-
if img_size < 1000000: # Skip for images under ~1 megapixel
|
1293 |
-
return enhanced
|
1294 |
-
|
1295 |
-
# Enhanced path: Additional processing for larger images
|
1296 |
-
try:
|
1297 |
-
# Apply optimized enhancement pipeline for large non-document images
|
1298 |
-
|
1299 |
-
# 1. Improve color saturation slightly for better feature extraction
|
1300 |
-
saturation = ImageEnhance.Color(enhanced)
|
1301 |
-
enhanced = saturation.enhance(1.1)
|
1302 |
-
|
1303 |
-
# 2. Apply adaptive sharpening based on image size
|
1304 |
-
if img_size > 2500000: # Very large images (~1600x1600 or larger)
|
1305 |
-
# Use EDGE_ENHANCE instead of SHARPEN for more subtle enhancement on large images
|
1306 |
-
enhanced = enhanced.filter(ImageFilter.EDGE_ENHANCE)
|
1307 |
-
else:
|
1308 |
-
# Standard sharpening for regular large images
|
1309 |
-
enhanced = enhanced.filter(ImageFilter.SHARPEN)
|
1310 |
-
|
1311 |
-
# 3. Apply additional processing with OpenCV if available (for largest images)
|
1312 |
-
if CV2_AVAILABLE and img_size > 3000000:
|
1313 |
-
# Convert to numpy array
|
1314 |
-
img_np = np.array(enhanced)
|
1315 |
-
|
1316 |
-
# Apply subtle enhancement of details (CLAHE)
|
1317 |
-
try:
|
1318 |
-
# Convert to LAB color space for better processing
|
1319 |
-
lab = cv2.cvtColor(img_np, cv2.COLOR_RGB2LAB)
|
1320 |
-
|
1321 |
-
# Only enhance the L channel (luminance)
|
1322 |
-
l, a, b = cv2.split(lab)
|
1323 |
-
|
1324 |
-
# Create CLAHE object with optimal parameters for photos
|
1325 |
-
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
|
1326 |
-
|
1327 |
-
# Apply CLAHE to L channel
|
1328 |
-
l = clahe.apply(l)
|
1329 |
-
|
1330 |
-
# Merge channels back and convert to RGB
|
1331 |
-
lab = cv2.merge((l, a, b))
|
1332 |
-
enhanced_np = cv2.cvtColor(lab, cv2.COLOR_LAB2RGB)
|
1333 |
-
|
1334 |
-
# Convert back to PIL
|
1335 |
-
enhanced = Image.fromarray(enhanced_np)
|
1336 |
-
except:
|
1337 |
-
# If CLAHE fails, continue with PIL-enhanced image
|
1338 |
-
pass
|
1339 |
-
|
1340 |
-
except Exception:
|
1341 |
-
# If any enhancement fails, fall back to basic contrast enhancement
|
1342 |
-
if img.mode != 'RGB':
|
1343 |
-
img = img.convert('RGB')
|
1344 |
-
enhancer = ImageEnhance.Contrast(img)
|
1345 |
-
enhanced = enhancer.enhance(1.2)
|
1346 |
-
|
1347 |
-
return enhanced
|
1348 |
-
|
1349 |
-
# Removed caching decorator to fix unhashable type error
|
1350 |
-
def resize_image(img: Image.Image, target_dpi: int = 300) -> Image.Image:
|
1351 |
-
"""
|
1352 |
-
Resize an image to an optimal size for OCR while preserving quality.
|
1353 |
-
|
1354 |
-
Args:
|
1355 |
-
img: PIL Image object
|
1356 |
-
target_dpi: Target DPI (dots per inch)
|
1357 |
-
|
1358 |
-
Returns:
|
1359 |
-
Resized PIL Image
|
1360 |
-
"""
|
1361 |
-
# Store the image for implementation function
|
1362 |
-
resize_image._current_img = img
|
1363 |
-
return resize_image_impl(target_dpi)
|
1364 |
|
1365 |
-
def
|
1366 |
"""
|
1367 |
-
|
1368 |
|
1369 |
Args:
|
1370 |
-
|
1371 |
-
|
1372 |
-
Returns:
|
1373 |
-
Resized PIL Image
|
1374 |
-
"""
|
1375 |
-
# Get the image from thread-local storage (set by the caller)
|
1376 |
-
if not hasattr(resize_image, "_current_img"):
|
1377 |
-
raise ValueError("No image set for resizing")
|
1378 |
-
|
1379 |
-
img = resize_image._current_img
|
1380 |
-
|
1381 |
-
# Calculate current dimensions
|
1382 |
-
width, height = img.size
|
1383 |
-
|
1384 |
-
# Fixed target dimensions based on DPI
|
1385 |
-
# Using larger dimensions to support newspapers and large documents
|
1386 |
-
max_width = int(14 * target_dpi) # Increased from 8.5 to 14 inches
|
1387 |
-
max_height = int(22 * target_dpi) # Increased from 11 to 22 inches
|
1388 |
-
|
1389 |
-
# Check if resizing is needed - quick early return
|
1390 |
-
if width <= max_width and height <= max_height:
|
1391 |
-
return img # No resizing needed
|
1392 |
-
|
1393 |
-
# Calculate scaling factor once
|
1394 |
-
scale_factor = min(max_width / width, max_height / height)
|
1395 |
-
|
1396 |
-
# Calculate new dimensions
|
1397 |
-
new_width = int(width * scale_factor)
|
1398 |
-
new_height = int(height * scale_factor)
|
1399 |
-
|
1400 |
-
# Use BICUBIC for better balance of speed and quality
|
1401 |
-
return img.resize((new_width, new_height), Image.BICUBIC)
|
1402 |
-
|
1403 |
-
def calculate_image_entropy(img: Image.Image) -> float:
|
1404 |
-
"""
|
1405 |
-
Calculate the entropy (information content) of an image.
|
1406 |
-
|
1407 |
-
Args:
|
1408 |
-
img: PIL Image object
|
1409 |
-
|
1410 |
-
Returns:
|
1411 |
-
Entropy value
|
1412 |
-
"""
|
1413 |
-
# Convert to grayscale
|
1414 |
-
if img.mode != 'L':
|
1415 |
-
img = img.convert('L')
|
1416 |
-
|
1417 |
-
# Calculate histogram
|
1418 |
-
histogram = img.histogram()
|
1419 |
-
total_pixels = img.width * img.height
|
1420 |
-
|
1421 |
-
# Calculate entropy
|
1422 |
-
entropy = 0
|
1423 |
-
for h in histogram:
|
1424 |
-
if h > 0:
|
1425 |
-
probability = h / total_pixels
|
1426 |
-
entropy -= probability * np.log2(probability)
|
1427 |
-
|
1428 |
-
return entropy
|
1429 |
-
|
1430 |
-
def create_html_with_images(result):
|
1431 |
-
"""
|
1432 |
-
Create an HTML document with embedded images from OCR results.
|
1433 |
-
Handles serialization of complex OCR objects automatically.
|
1434 |
-
|
1435 |
-
Args:
|
1436 |
-
result: OCR result dictionary containing pages_data
|
1437 |
-
|
1438 |
-
Returns:
|
1439 |
-
HTML content as string
|
1440 |
-
"""
|
1441 |
-
# Ensure result is fully serializable first
|
1442 |
-
result = serialize_ocr_object(result)
|
1443 |
-
# Create HTML document structure
|
1444 |
-
html_content = """
|
1445 |
-
<!DOCTYPE html>
|
1446 |
-
<html>
|
1447 |
-
<head>
|
1448 |
-
<meta charset="UTF-8">
|
1449 |
-
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
1450 |
-
<title>OCR Document with Images</title>
|
1451 |
-
<style>
|
1452 |
-
body {
|
1453 |
-
font-family: Georgia, serif;
|
1454 |
-
line-height: 1.7;
|
1455 |
-
margin: 0 auto;
|
1456 |
-
max-width: 800px;
|
1457 |
-
padding: 20px;
|
1458 |
-
}
|
1459 |
-
img {
|
1460 |
-
max-width: 90%;
|
1461 |
-
max-height: 500px;
|
1462 |
-
object-fit: contain;
|
1463 |
-
margin: 20px auto;
|
1464 |
-
display: block;
|
1465 |
-
border: 1px solid #ddd;
|
1466 |
-
border-radius: 4px;
|
1467 |
-
}
|
1468 |
-
.image-container {
|
1469 |
-
margin: 20px 0;
|
1470 |
-
text-align: center;
|
1471 |
-
}
|
1472 |
-
.page-break {
|
1473 |
-
border-top: 1px solid #ddd;
|
1474 |
-
margin: 40px 0;
|
1475 |
-
padding-top: 40px;
|
1476 |
-
}
|
1477 |
-
h3 {
|
1478 |
-
color: #333;
|
1479 |
-
border-bottom: 1px solid #eee;
|
1480 |
-
padding-bottom: 10px;
|
1481 |
-
}
|
1482 |
-
p {
|
1483 |
-
margin: 12px 0;
|
1484 |
-
}
|
1485 |
-
.page-text-content {
|
1486 |
-
margin-bottom: 20px;
|
1487 |
-
}
|
1488 |
-
.text-block {
|
1489 |
-
background-color: #f9f9f9;
|
1490 |
-
padding: 15px;
|
1491 |
-
border-radius: 4px;
|
1492 |
-
border-left: 3px solid #546e7a;
|
1493 |
-
margin-bottom: 15px;
|
1494 |
-
color: #333;
|
1495 |
-
}
|
1496 |
-
.text-block p {
|
1497 |
-
margin: 8px 0;
|
1498 |
-
color: #333;
|
1499 |
-
}
|
1500 |
-
.metadata {
|
1501 |
-
background-color: #f5f5f5;
|
1502 |
-
padding: 10px 15px;
|
1503 |
-
border-radius: 4px;
|
1504 |
-
margin-bottom: 20px;
|
1505 |
-
font-size: 14px;
|
1506 |
-
}
|
1507 |
-
.metadata p {
|
1508 |
-
margin: 5px 0;
|
1509 |
-
}
|
1510 |
-
</style>
|
1511 |
-
</head>
|
1512 |
-
<body>
|
1513 |
-
"""
|
1514 |
-
|
1515 |
-
# Add document metadata
|
1516 |
-
html_content += f"""
|
1517 |
-
<div class="metadata">
|
1518 |
-
<h2>{result.get('file_name', 'Document')}</h2>
|
1519 |
-
<p><strong>Processed at:</strong> {result.get('timestamp', '')}</p>
|
1520 |
-
<p><strong>Languages:</strong> {', '.join(result.get('languages', ['Unknown']))}</p>
|
1521 |
-
<p><strong>Topics:</strong> {', '.join(result.get('topics', ['Unknown']))}</p>
|
1522 |
-
</div>
|
1523 |
-
"""
|
1524 |
-
|
1525 |
-
# Check if we have pages_data
|
1526 |
-
if 'pages_data' in result and result['pages_data']:
|
1527 |
-
pages_data = result['pages_data']
|
1528 |
-
|
1529 |
-
# Process each page
|
1530 |
-
for i, page in enumerate(pages_data):
|
1531 |
-
page_markdown = page.get('markdown', '')
|
1532 |
-
images = page.get('images', [])
|
1533 |
-
|
1534 |
-
# Add page header if multi-page
|
1535 |
-
if len(pages_data) > 1:
|
1536 |
-
html_content += f"<h3>Page {i+1}</h3>"
|
1537 |
-
|
1538 |
-
# Create image dictionary
|
1539 |
-
image_dict = {}
|
1540 |
-
for img in images:
|
1541 |
-
if 'id' in img and 'image_base64' in img:
|
1542 |
-
image_dict[img['id']] = img['image_base64']
|
1543 |
-
|
1544 |
-
# Process the markdown content
|
1545 |
-
if page_markdown:
|
1546 |
-
# Extract text content (lines without images)
|
1547 |
-
text_content = []
|
1548 |
-
image_lines = []
|
1549 |
-
|
1550 |
-
for line in page_markdown.split('\n'):
|
1551 |
-
if '
|
1553 |
-
elif line.strip():
|
1554 |
-
text_content.append(line)
|
1555 |
-
|
1556 |
-
# Add text content
|
1557 |
-
if text_content:
|
1558 |
-
html_content += '<div class="text-block">'
|
1559 |
-
for line in text_content:
|
1560 |
-
html_content += f"<p>{line}</p>"
|
1561 |
-
html_content += '</div>'
|
1562 |
-
|
1563 |
-
# Add images
|
1564 |
-
for line in image_lines:
|
1565 |
-
# Extract image ID and alt text using simple parsing
|
1566 |
-
try:
|
1567 |
-
alt_start = line.find('![') + 2
|
1568 |
-
alt_end = line.find(']', alt_start)
|
1569 |
-
alt_text = line[alt_start:alt_end]
|
1570 |
-
|
1571 |
-
img_start = line.find('(', alt_end) + 1
|
1572 |
-
img_end = line.find(')', img_start)
|
1573 |
-
img_id = line[img_start:img_end]
|
1574 |
-
|
1575 |
-
if img_id in image_dict:
|
1576 |
-
html_content += f'<div class="image-container">'
|
1577 |
-
html_content += f'<img src="{image_dict[img_id]}" alt="{alt_text}">'
|
1578 |
-
html_content += f'</div>'
|
1579 |
-
except:
|
1580 |
-
# If parsing fails, just skip this image
|
1581 |
-
continue
|
1582 |
-
|
1583 |
-
# Add page separator if not the last page
|
1584 |
-
if i < len(pages_data) - 1:
|
1585 |
-
html_content += '<div class="page-break"></div>'
|
1586 |
-
|
1587 |
-
# Add structured content if available
|
1588 |
-
if 'ocr_contents' in result and isinstance(result['ocr_contents'], dict):
|
1589 |
-
html_content += '<h3>Structured Content</h3>'
|
1590 |
-
|
1591 |
-
for section, content in result['ocr_contents'].items():
|
1592 |
-
if content and section not in ['error', 'raw_text', 'partial_text']:
|
1593 |
-
html_content += f'<h4>{section.replace("_", " ").title()}</h4>'
|
1594 |
-
|
1595 |
-
if isinstance(content, str):
|
1596 |
-
html_content += f'<p>{content}</p>'
|
1597 |
-
elif isinstance(content, list):
|
1598 |
-
html_content += '<ul>'
|
1599 |
-
for item in content:
|
1600 |
-
html_content += f'<li>{str(item)}</li>'
|
1601 |
-
html_content += '</ul>'
|
1602 |
-
elif isinstance(content, dict):
|
1603 |
-
html_content += '<dl>'
|
1604 |
-
for k, v in content.items():
|
1605 |
-
html_content += f'<dt>{k}</dt><dd>{v}</dd>'
|
1606 |
-
html_content += '</dl>'
|
1607 |
-
|
1608 |
-
# Close HTML document
|
1609 |
-
html_content += """
|
1610 |
-
</body>
|
1611 |
-
</html>
|
1612 |
-
"""
|
1613 |
-
|
1614 |
-
return html_content
|
1615 |
-
|
1616 |
-
def generate_document_thumbnail(image_path: Union[str, Path], max_size: int = 300) -> str:
|
1617 |
-
"""
|
1618 |
-
Generate a thumbnail for document preview.
|
1619 |
-
|
1620 |
-
Args:
|
1621 |
-
image_path: Path to the image file
|
1622 |
-
max_size: Maximum dimension for thumbnail
|
1623 |
-
|
1624 |
-
Returns:
|
1625 |
-
Base64 encoded thumbnail
|
1626 |
-
"""
|
1627 |
-
if not PILLOW_AVAILABLE:
|
1628 |
-
return None
|
1629 |
-
|
1630 |
-
try:
|
1631 |
-
# Open the image
|
1632 |
-
with Image.open(image_path) as img:
|
1633 |
-
# Calculate thumbnail size preserving aspect ratio
|
1634 |
-
width, height = img.size
|
1635 |
-
if width > height:
|
1636 |
-
new_width = max_size
|
1637 |
-
new_height = int(height * (max_size / width))
|
1638 |
-
else:
|
1639 |
-
new_height = max_size
|
1640 |
-
new_width = int(width * (max_size / height))
|
1641 |
-
|
1642 |
-
# Create thumbnail
|
1643 |
-
thumbnail = img.resize((new_width, new_height), Image.LANCZOS)
|
1644 |
-
|
1645 |
-
# Save to buffer
|
1646 |
-
buffer = io.BytesIO()
|
1647 |
-
thumbnail.save(buffer, format="JPEG", quality=85)
|
1648 |
-
buffer.seek(0)
|
1649 |
-
|
1650 |
-
# Encode as base64
|
1651 |
-
encoded = base64.b64encode(buffer.getvalue()).decode()
|
1652 |
-
return f"data:image/jpeg;base64,{encoded}"
|
1653 |
-
except Exception:
|
1654 |
-
# Return None if thumbnail generation fails
|
1655 |
-
return None
|
1656 |
-
|
1657 |
-
def serialize_ocr_object(obj):
|
1658 |
-
"""
|
1659 |
-
Serialize OCR response objects to JSON serializable format.
|
1660 |
-
Handles OCRImageObject specifically to prevent serialization errors.
|
1661 |
-
|
1662 |
-
Args:
|
1663 |
-
obj: The object to serialize
|
1664 |
-
|
1665 |
-
Returns:
|
1666 |
-
JSON serializable representation of the object
|
1667 |
-
"""
|
1668 |
-
# Fast path: Handle primitive types directly
|
1669 |
-
if obj is None or isinstance(obj, (str, int, float, bool)):
|
1670 |
-
return obj
|
1671 |
-
|
1672 |
-
# Handle collections
|
1673 |
-
if isinstance(obj, list):
|
1674 |
-
return [serialize_ocr_object(item) for item in obj]
|
1675 |
-
elif isinstance(obj, dict):
|
1676 |
-
return {k: serialize_ocr_object(v) for k, v in obj.items()}
|
1677 |
-
elif isinstance(obj, OCRImageObject):
|
1678 |
-
# Special handling for OCRImageObject
|
1679 |
-
return {
|
1680 |
-
'id': obj.id if hasattr(obj, 'id') else None,
|
1681 |
-
'image_base64': obj.image_base64 if hasattr(obj, 'image_base64') else None
|
1682 |
-
}
|
1683 |
-
elif hasattr(obj, '__dict__'):
|
1684 |
-
# For objects with __dict__ attribute
|
1685 |
-
return {k: serialize_ocr_object(v) for k, v in obj.__dict__.items()
|
1686 |
-
if not k.startswith('_')} # Skip private attributes
|
1687 |
-
else:
|
1688 |
-
# Try to convert to string as last resort
|
1689 |
-
try:
|
1690 |
-
return str(obj)
|
1691 |
-
except:
|
1692 |
-
return None
|
1693 |
-
|
1694 |
-
def try_local_ocr_fallback(image_path: Union[str, Path], base64_data_url: str = None) -> str:
|
1695 |
-
"""
|
1696 |
-
Attempt to use local pytesseract OCR as a fallback when API fails
|
1697 |
-
With enhanced processing optimized for handwritten content
|
1698 |
-
|
1699 |
-
Args:
|
1700 |
-
image_path: Path to the image file
|
1701 |
base64_data_url: Optional base64 data URL if already available
|
1702 |
|
1703 |
Returns:
|
1704 |
-
|
1705 |
"""
|
1706 |
-
|
|
|
|
|
1707 |
|
1708 |
try:
|
1709 |
-
|
1710 |
-
from PIL import Image
|
1711 |
-
|
1712 |
-
# Load image - either from path or from base64
|
1713 |
-
if base64_data_url and base64_data_url.startswith('data:image'):
|
1714 |
-
# Extract image from base64
|
1715 |
-
image_data = base64_data_url.split(',', 1)[1]
|
1716 |
-
image_bytes = base64.b64decode(image_data)
|
1717 |
-
image = Image.open(io.BytesIO(image_bytes))
|
1718 |
-
else:
|
1719 |
-
# Load from file path
|
1720 |
-
image_path = Path(image_path) if isinstance(image_path, str) else image_path
|
1721 |
-
image = Image.open(image_path)
|
1722 |
-
|
1723 |
-
# Auto-detect if this appears to be handwritten
|
1724 |
-
is_handwritten = False
|
1725 |
|
1726 |
-
# Use
|
1727 |
-
|
1728 |
-
try:
|
1729 |
-
# Convert image to numpy array
|
1730 |
-
img_np = np.array(image.convert('L'))
|
1731 |
-
|
1732 |
-
# Check for handwritten characteristics
|
1733 |
-
edges = cv2.Canny(img_np, 30, 100)
|
1734 |
-
edge_ratio = np.count_nonzero(edges) / edges.size
|
1735 |
-
|
1736 |
-
# Typical handwritten documents have more varied edge patterns
|
1737 |
-
if edge_ratio > 0.02:
|
1738 |
-
# Additional check with gradient magnitudes
|
1739 |
-
sobelx = cv2.Sobel(img_np, cv2.CV_64F, 1, 0, ksize=3)
|
1740 |
-
sobely = cv2.Sobel(img_np, cv2.CV_64F, 0, 1, ksize=3)
|
1741 |
-
magnitude = np.sqrt(sobelx**2 + sobely**2)
|
1742 |
-
# Handwriting typically has more variation in gradient magnitudes
|
1743 |
-
if np.std(magnitude) > 20:
|
1744 |
-
is_handwritten = True
|
1745 |
-
logger.info("Detected handwritten content for local OCR")
|
1746 |
-
|
1747 |
-
# Enhanced preprocessing based on document type
|
1748 |
-
if is_handwritten:
|
1749 |
-
# Process for handwritten content
|
1750 |
-
# Apply CLAHE for better local contrast
|
1751 |
-
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
|
1752 |
-
img_np = clahe.apply(img_np)
|
1753 |
-
|
1754 |
-
# Apply adaptive thresholding with optimized parameters for handwriting
|
1755 |
-
binary = cv2.adaptiveThreshold(
|
1756 |
-
img_np, 255,
|
1757 |
-
cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
|
1758 |
-
cv2.THRESH_BINARY,
|
1759 |
-
21, # Larger block size for handwriting
|
1760 |
-
5 # Lower constant for better stroke preservation
|
1761 |
-
)
|
1762 |
-
|
1763 |
-
# Optional: apply dilation to thicken strokes slightly
|
1764 |
-
kernel = np.ones((2, 2), np.uint8)
|
1765 |
-
binary = cv2.dilate(binary, kernel, iterations=1)
|
1766 |
-
|
1767 |
-
# Convert back to PIL Image for tesseract
|
1768 |
-
image = Image.fromarray(binary)
|
1769 |
-
|
1770 |
-
# Set tesseract options for handwritten content
|
1771 |
-
custom_config = r'--oem 1 --psm 6 -l eng'
|
1772 |
-
else:
|
1773 |
-
# Process for printed content
|
1774 |
-
# Apply CLAHE for better contrast
|
1775 |
-
clahe = cv2.createCLAHE(clipLimit=2.5, tileGridSize=(8, 8))
|
1776 |
-
img_np = clahe.apply(img_np)
|
1777 |
-
|
1778 |
-
# Apply bilateral filter to reduce noise while preserving edges
|
1779 |
-
img_np = cv2.bilateralFilter(img_np, 9, 75, 75)
|
1780 |
-
|
1781 |
-
# Apply Otsu's thresholding for printed text
|
1782 |
-
_, binary = cv2.threshold(img_np, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
|
1783 |
-
|
1784 |
-
# Convert back to PIL Image for tesseract
|
1785 |
-
image = Image.fromarray(binary)
|
1786 |
-
|
1787 |
-
# Set tesseract options for printed content
|
1788 |
-
custom_config = r'--oem 3 --psm 6 -l eng'
|
1789 |
-
except Exception as e:
|
1790 |
-
logger.warning(f"OpenCV preprocessing failed: {str(e)}. Using PIL fallback.")
|
1791 |
-
|
1792 |
-
# Convert to RGB if not already (pytesseract works best with RGB)
|
1793 |
-
if image.mode != 'RGB':
|
1794 |
-
image = image.convert('RGB')
|
1795 |
-
|
1796 |
-
# Apply basic image enhancements
|
1797 |
-
image = image.convert('L')
|
1798 |
-
enhancer = ImageEnhance.Contrast(image)
|
1799 |
-
image = enhancer.enhance(2.0)
|
1800 |
-
custom_config = r'--oem 3 --psm 6 -l eng'
|
1801 |
-
else:
|
1802 |
-
# PIL-only path without OpenCV
|
1803 |
-
# Convert to RGB if not already (pytesseract works best with RGB)
|
1804 |
-
if image.mode != 'RGB':
|
1805 |
-
image = image.convert('RGB')
|
1806 |
-
|
1807 |
-
# Apply basic image enhancements
|
1808 |
-
image = image.convert('L')
|
1809 |
-
enhancer = ImageEnhance.Contrast(image)
|
1810 |
-
image = enhancer.enhance(2.0)
|
1811 |
-
custom_config = r'--oem 3 --psm 6 -l eng'
|
1812 |
|
1813 |
-
#
|
1814 |
-
|
1815 |
|
1816 |
-
if
|
1817 |
-
logger.info(
|
1818 |
-
return
|
1819 |
else:
|
1820 |
-
|
1821 |
-
|
1822 |
-
# Try PSM mode 4 (assume single column of text)
|
1823 |
-
fallback_config = r'--oem 3 --psm 4 -l eng'
|
1824 |
-
ocr_text = pytesseract.image_to_string(image, config=fallback_config)
|
1825 |
-
|
1826 |
-
if ocr_text and len(ocr_text.strip()) > 50:
|
1827 |
-
logger.info(f"Local OCR fallback successful: extracted {len(ocr_text)} characters")
|
1828 |
-
return ocr_text
|
1829 |
-
else:
|
1830 |
-
logger.warning("Local OCR produced minimal or no text")
|
1831 |
-
return None
|
1832 |
-
except ImportError:
|
1833 |
-
logger.warning("Pytesseract not installed - local OCR not available")
|
1834 |
-
return None
|
1835 |
except Exception as e:
|
1836 |
-
logger.error(f"
|
1837 |
-
return None
|
|
|
1 |
"""
|
2 |
+
OCR utility functions for image processing and OCR operations.
|
3 |
+
This module provides helper functions used across the Historical OCR application.
|
4 |
"""
|
5 |
|
6 |
+
import os
|
|
|
7 |
import base64
|
|
|
|
|
8 |
import logging
|
|
|
|
|
9 |
from pathlib import Path
|
10 |
+
from typing import Union, Optional
|
|
|
11 |
|
12 |
# Configure logging
|
13 |
logging.basicConfig(level=logging.INFO,
|
14 |
+
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
|
15 |
logger = logging.getLogger(__name__)
|
16 |
|
17 |
+
# Try to import optional dependencies
|
18 |
+
try:
|
19 |
+
import pytesseract
|
20 |
+
TESSERACT_AVAILABLE = True
|
21 |
+
except ImportError:
|
22 |
+
logger.warning("pytesseract not available - local OCR fallback will not work")
|
23 |
+
TESSERACT_AVAILABLE = False
|
24 |
|
|
|
25 |
try:
|
26 |
+
from PIL import Image
|
27 |
PILLOW_AVAILABLE = True
|
28 |
except ImportError:
|
29 |
logger.warning("PIL not available - image preprocessing will be limited")
|
30 |
PILLOW_AVAILABLE = False
|
31 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
32 |
|
33 |
def encode_image_for_api(image_path: Union[str, Path]) -> str:
|
34 |
"""
|
35 |
+
Encode an image as base64 data URL for API submission with proper MIME type.
|
36 |
|
37 |
Args:
|
38 |
image_path: Path to the image file
|
|
|
63 |
encoded = base64.b64encode(image_file.read_bytes()).decode()
|
64 |
return f"data:{mime_type};base64,{encoded}"
|
65 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
66 |
|
67 |
+
def try_local_ocr_fallback(file_path: Union[str, Path], base64_data_url: Optional[str] = None) -> Optional[str]:
|
68 |
"""
|
69 |
+
Try to perform OCR using local Tesseract as a fallback when the API is unavailable.
|
70 |
|
71 |
Args:
|
72 |
+
file_path: Path to the image file
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
73 |
base64_data_url: Optional base64 data URL if already available
|
74 |
|
75 |
Returns:
|
76 |
+
Extracted text or None if extraction failed
|
77 |
"""
|
78 |
+
if not TESSERACT_AVAILABLE or not PILLOW_AVAILABLE:
|
79 |
+
logger.warning("Local OCR fallback is not available (missing dependencies)")
|
80 |
+
return None
|
81 |
|
82 |
try:
|
83 |
+
logger.info("Using local Tesseract OCR as fallback")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
84 |
|
85 |
+
# Use PIL to open the image
|
86 |
+
img = Image.open(file_path)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
87 |
|
88 |
+
# Use Tesseract to extract text
|
89 |
+
text = pytesseract.image_to_string(img)
|
90 |
|
91 |
+
if text:
|
92 |
+
logger.info("Successfully extracted text using local Tesseract OCR")
|
93 |
+
return text
|
94 |
else:
|
95 |
+
logger.warning("Tesseract extracted no text")
|
96 |
+
return None
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
97 |
except Exception as e:
|
98 |
+
logger.error(f"Error using local OCR fallback: {str(e)}")
|
99 |
+
return None
|
preprocessing.py
CHANGED
@@ -3,15 +3,398 @@ import io
|
|
3 |
import cv2
|
4 |
import numpy as np
|
5 |
import tempfile
|
|
|
|
|
|
|
6 |
from PIL import Image, ImageEnhance, ImageFilter
|
7 |
from pdf2image import convert_from_bytes
|
8 |
import streamlit as st
|
9 |
import logging
|
|
|
|
|
10 |
|
11 |
# Configure logging
|
12 |
logger = logging.getLogger("preprocessing")
|
13 |
logger.setLevel(logging.INFO)
|
14 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
@st.cache_data(ttl=24*3600, show_spinner=False) # Cache for 24 hours
|
16 |
def convert_pdf_to_images(pdf_bytes, dpi=150, rotation=0):
|
17 |
"""Convert PDF bytes to a list of images with caching"""
|
@@ -34,94 +417,134 @@ def convert_pdf_to_images(pdf_bytes, dpi=150, rotation=0):
|
|
34 |
|
35 |
@st.cache_data(ttl=24*3600, show_spinner=False, hash_funcs={dict: lambda x: str(sorted(x.items()))})
|
36 |
def preprocess_image(image_bytes, preprocessing_options):
|
37 |
-
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
38 |
# Setup basic console logging
|
39 |
logger = logging.getLogger("image_preprocessor")
|
40 |
logger.setLevel(logging.INFO)
|
41 |
|
42 |
# Log which preprocessing options are being applied
|
43 |
-
logger.info(f"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
44 |
|
45 |
# Convert bytes to PIL Image
|
46 |
image = Image.open(io.BytesIO(image_bytes))
|
47 |
|
48 |
-
# Check for
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
49 |
if image.mode == 'RGBA':
|
50 |
-
# Convert RGBA to RGB by compositing
|
|
|
51 |
background = Image.new('RGB', image.size, (255, 255, 255))
|
52 |
background.paste(image, mask=image.split()[3]) # 3 is the alpha channel
|
53 |
image = background
|
54 |
-
|
55 |
elif image.mode not in ('RGB', 'L'):
|
56 |
-
# Convert other modes to RGB
|
|
|
57 |
image = image.convert('RGB')
|
58 |
-
|
59 |
-
|
60 |
-
# Apply rotation if specified
|
61 |
-
if preprocessing_options.get("rotation", 0) != 0:
|
62 |
-
rotation_degrees = preprocessing_options.get("rotation")
|
63 |
-
image = image.rotate(rotation_degrees, expand=True, resample=Image.BICUBIC)
|
64 |
-
|
65 |
-
# Resize large images while preserving details important for OCR
|
66 |
-
width, height = image.size
|
67 |
-
max_dimension = max(width, height)
|
68 |
-
|
69 |
-
# Less aggressive resizing to preserve document details
|
70 |
-
if max_dimension > 2500:
|
71 |
-
scale_factor = 2500 / max_dimension
|
72 |
-
new_width = int(width * scale_factor)
|
73 |
-
new_height = int(height * scale_factor)
|
74 |
-
# Use LANCZOS for better quality preservation
|
75 |
-
image = image.resize((new_width, new_height), Image.LANCZOS)
|
76 |
|
|
|
77 |
img_array = np.array(image)
|
78 |
|
79 |
-
# Apply
|
80 |
-
document_type = preprocessing_options.get("document_type", "standard")
|
81 |
-
|
82 |
-
# Process grayscale option first as it's a common foundation
|
83 |
if preprocessing_options.get("grayscale", False):
|
84 |
if len(img_array.shape) == 3: # Only convert if it's not already grayscale
|
85 |
-
|
86 |
-
|
87 |
img_array = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
|
88 |
-
|
89 |
-
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8))
|
90 |
img_array = clahe.apply(img_array)
|
91 |
else:
|
92 |
# Standard grayscale for printed documents
|
93 |
img_array = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
|
94 |
-
|
95 |
-
|
96 |
-
img_array = cv2.cvtColor(img_array, cv2.COLOR_GRAY2RGB)
|
97 |
-
|
98 |
-
if preprocessing_options.get("contrast", 0) != 0:
|
99 |
-
contrast_factor = 1 + (preprocessing_options.get("contrast", 0) / 150) # Reduced from /100 for a gentler effect
|
100 |
-
image = Image.fromarray(img_array)
|
101 |
-
enhancer = ImageEnhance.Contrast(image)
|
102 |
-
image = enhancer.enhance(contrast_factor)
|
103 |
-
img_array = np.array(image)
|
104 |
|
|
|
105 |
if preprocessing_options.get("denoise", False):
|
106 |
try:
|
107 |
-
# Apply
|
108 |
-
|
109 |
-
|
110 |
-
|
111 |
-
|
112 |
-
else: # Grayscale image
|
113 |
-
img_array = cv2.fastNlMeansDenoising(img_array, None, 2, 5, 15) # Reduced from 3,7,21
|
114 |
else:
|
115 |
-
#
|
116 |
-
|
117 |
-
|
118 |
-
|
119 |
-
img_array = cv2.fastNlMeansDenoising(img_array, None, 3, 5, 15) # Reduced from 5,7,21
|
120 |
except Exception as e:
|
121 |
-
logger.error(f"Denoising error: {str(e)}
|
|
|
|
|
|
|
|
|
|
|
|
|
122 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
123 |
# Convert back to PIL Image
|
124 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
125 |
|
126 |
# Higher quality for OCR processing
|
127 |
byte_io = io.BytesIO()
|
@@ -135,16 +558,14 @@ def preprocess_image(image_bytes, preprocessing_options):
|
|
135 |
|
136 |
logger.info(f"Preprocessing complete. Original image mode: {image.mode}, processed mode: {processed_image.mode}")
|
137 |
logger.info(f"Original size: {len(image_bytes)/1024:.1f}KB, processed size: {len(byte_io.getvalue())/1024:.1f}KB")
|
|
|
138 |
|
139 |
return byte_io.getvalue()
|
140 |
except Exception as e:
|
141 |
logger.error(f"Error saving processed image: {str(e)}")
|
142 |
# Fallback to original image
|
143 |
logger.info("Using original image as fallback")
|
144 |
-
|
145 |
-
image.save(image_io, format='JPEG', quality=92)
|
146 |
-
image_io.seek(0)
|
147 |
-
return image_io.getvalue()
|
148 |
|
149 |
def create_temp_file(content, suffix, temp_file_paths):
|
150 |
"""Create a temporary file and track it for cleanup"""
|
@@ -157,19 +578,53 @@ def create_temp_file(content, suffix, temp_file_paths):
|
|
157 |
return temp_path
|
158 |
|
159 |
def apply_preprocessing_to_file(file_bytes, file_ext, preprocessing_options, temp_file_paths):
|
160 |
-
"""
|
161 |
-
|
162 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
163 |
has_preprocessing = (
|
164 |
preprocessing_options.get("grayscale", False) or
|
165 |
preprocessing_options.get("denoise", False) or
|
166 |
-
preprocessing_options.get("contrast", 0) != 0
|
167 |
-
preprocessing_options.get("rotation", 0) != 0
|
168 |
)
|
169 |
|
170 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
171 |
# Apply preprocessing
|
172 |
logger.info(f"Applying preprocessing with options: {preprocessing_options}")
|
|
|
|
|
|
|
|
|
|
|
|
|
173 |
processed_bytes = preprocess_image(file_bytes, preprocessing_options)
|
174 |
|
175 |
# Save processed image to temp file
|
|
|
3 |
import cv2
|
4 |
import numpy as np
|
5 |
import tempfile
|
6 |
+
import time
|
7 |
+
import math
|
8 |
+
import json
|
9 |
from PIL import Image, ImageEnhance, ImageFilter
|
10 |
from pdf2image import convert_from_bytes
|
11 |
import streamlit as st
|
12 |
import logging
|
13 |
+
import concurrent.futures
|
14 |
+
from pathlib import Path
|
15 |
|
16 |
# Configure logging
|
17 |
logger = logging.getLogger("preprocessing")
|
18 |
logger.setLevel(logging.INFO)
|
19 |
|
20 |
+
# Ensure logs directory exists
|
21 |
+
def ensure_log_directory(config):
|
22 |
+
"""Create logs directory if it doesn't exist"""
|
23 |
+
if config.get("logging", {}).get("enabled", False):
|
24 |
+
log_path = config.get("logging", {}).get("output_path", "logs/preprocessing_metrics.json")
|
25 |
+
log_dir = os.path.dirname(log_path)
|
26 |
+
if log_dir:
|
27 |
+
Path(log_dir).mkdir(parents=True, exist_ok=True)
|
28 |
+
|
29 |
+
def log_preprocessing_metrics(metrics, config):
|
30 |
+
"""Log preprocessing metrics to JSON file"""
|
31 |
+
if not config.get("enabled", False):
|
32 |
+
return
|
33 |
+
|
34 |
+
log_path = config.get("output_path", "logs/preprocessing_metrics.json")
|
35 |
+
ensure_log_directory({"logging": {"enabled": True, "output_path": log_path}})
|
36 |
+
|
37 |
+
# Add timestamp
|
38 |
+
metrics["timestamp"] = time.strftime("%Y-%m-%d %H:%M:%S")
|
39 |
+
|
40 |
+
# Append to log file
|
41 |
+
try:
|
42 |
+
existing_data = []
|
43 |
+
if os.path.exists(log_path):
|
44 |
+
with open(log_path, 'r') as f:
|
45 |
+
existing_data = json.load(f)
|
46 |
+
if not isinstance(existing_data, list):
|
47 |
+
existing_data = [existing_data]
|
48 |
+
|
49 |
+
existing_data.append(metrics)
|
50 |
+
|
51 |
+
with open(log_path, 'w') as f:
|
52 |
+
json.dump(existing_data, f, indent=2)
|
53 |
+
|
54 |
+
logger.info(f"Logged preprocessing metrics to {log_path}")
|
55 |
+
except Exception as e:
|
56 |
+
logger.error(f"Error logging preprocessing metrics: {str(e)}")
|
57 |
+
|
58 |
+
def get_document_config(document_type, global_config):
|
59 |
+
"""
|
60 |
+
Get document-specific preprocessing configuration by merging with global settings.
|
61 |
+
|
62 |
+
Args:
|
63 |
+
document_type: The type of document (e.g., 'standard', 'newspaper', 'handwritten')
|
64 |
+
global_config: The global preprocessing configuration
|
65 |
+
|
66 |
+
Returns:
|
67 |
+
A merged configuration dictionary with document-specific overrides
|
68 |
+
"""
|
69 |
+
# Start with a copy of the global config
|
70 |
+
config = {
|
71 |
+
"deskew": global_config.get("deskew", {}),
|
72 |
+
"thresholding": global_config.get("thresholding", {}),
|
73 |
+
"morphology": global_config.get("morphology", {}),
|
74 |
+
"performance": global_config.get("performance", {}),
|
75 |
+
"logging": global_config.get("logging", {})
|
76 |
+
}
|
77 |
+
|
78 |
+
# Apply document-specific overrides if they exist
|
79 |
+
doc_types = global_config.get("document_types", {})
|
80 |
+
if document_type in doc_types:
|
81 |
+
doc_config = doc_types[document_type]
|
82 |
+
|
83 |
+
# Merge document-specific settings into the config
|
84 |
+
for section in doc_config:
|
85 |
+
if section in config:
|
86 |
+
config[section].update(doc_config[section])
|
87 |
+
|
88 |
+
return config
|
89 |
+
|
90 |
+
def deskew_image(img_array, config):
|
91 |
+
"""
|
92 |
+
Detect and correct skew in document images.
|
93 |
+
|
94 |
+
Uses a combination of methods (minAreaRect and/or Hough transform)
|
95 |
+
to estimate the skew angle more robustly.
|
96 |
+
|
97 |
+
Args:
|
98 |
+
img_array: Input image as numpy array
|
99 |
+
config: Deskew configuration dict
|
100 |
+
|
101 |
+
Returns:
|
102 |
+
Deskewed image as numpy array, estimated angle, success flag
|
103 |
+
"""
|
104 |
+
if not config.get("enabled", False):
|
105 |
+
return img_array, 0.0, True
|
106 |
+
|
107 |
+
# Convert to grayscale if needed
|
108 |
+
gray = img_array if len(img_array.shape) == 2 else cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
|
109 |
+
|
110 |
+
# Start with a threshold to get binary image for angle detection
|
111 |
+
_, binary = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)
|
112 |
+
|
113 |
+
angles = []
|
114 |
+
angle_threshold = config.get("angle_threshold", 0.1)
|
115 |
+
max_angle = config.get("max_angle", 45.0)
|
116 |
+
|
117 |
+
# Method 1: minAreaRect approach
|
118 |
+
try:
|
119 |
+
# Find all contours
|
120 |
+
contours, _ = cv2.findContours(binary, cv2.RETR_LIST, cv2.CHAIN_APPROX_SIMPLE)
|
121 |
+
|
122 |
+
# Filter contours by area to avoid noise
|
123 |
+
min_area = binary.shape[0] * binary.shape[1] * 0.0001 # 0.01% of image area
|
124 |
+
filtered_contours = [cnt for cnt in contours if cv2.contourArea(cnt) > min_area]
|
125 |
+
|
126 |
+
# Get angles from rotated rectangles around contours
|
127 |
+
for contour in filtered_contours:
|
128 |
+
rect = cv2.minAreaRect(contour)
|
129 |
+
width, height = rect[1]
|
130 |
+
|
131 |
+
# Calculate the angle based on the longer side
|
132 |
+
# (This is important for getting the orientation right)
|
133 |
+
angle = rect[2]
|
134 |
+
if width < height:
|
135 |
+
angle += 90
|
136 |
+
|
137 |
+
# Normalize angle to -45 to 45 range
|
138 |
+
if angle > 45:
|
139 |
+
angle -= 90
|
140 |
+
if angle < -45:
|
141 |
+
angle += 90
|
142 |
+
|
143 |
+
# Clamp angle to max limit
|
144 |
+
angle = max(min(angle, max_angle), -max_angle)
|
145 |
+
angles.append(angle)
|
146 |
+
except Exception as e:
|
147 |
+
logger.error(f"Error in minAreaRect skew detection: {str(e)}")
|
148 |
+
|
149 |
+
# Method 2: Hough Transform approach (if enabled)
|
150 |
+
if config.get("use_hough", True):
|
151 |
+
try:
|
152 |
+
# Apply Canny edge detection
|
153 |
+
edges = cv2.Canny(gray, 50, 150, apertureSize=3)
|
154 |
+
|
155 |
+
# Apply Hough lines
|
156 |
+
lines = cv2.HoughLinesP(edges, 1, np.pi/180,
|
157 |
+
threshold=100, minLineLength=100, maxLineGap=10)
|
158 |
+
|
159 |
+
if lines is not None:
|
160 |
+
for line in lines:
|
161 |
+
x1, y1, x2, y2 = line[0]
|
162 |
+
if x2 - x1 != 0: # Avoid division by zero
|
163 |
+
# Calculate line angle in degrees
|
164 |
+
angle = math.atan2(y2 - y1, x2 - x1) * 180.0 / np.pi
|
165 |
+
|
166 |
+
# Normalize angle to -45 to 45 range
|
167 |
+
if angle > 45:
|
168 |
+
angle -= 90
|
169 |
+
if angle < -45:
|
170 |
+
angle += 90
|
171 |
+
|
172 |
+
# Clamp angle to max limit
|
173 |
+
angle = max(min(angle, max_angle), -max_angle)
|
174 |
+
angles.append(angle)
|
175 |
+
except Exception as e:
|
176 |
+
logger.error(f"Error in Hough transform skew detection: {str(e)}")
|
177 |
+
|
178 |
+
# If no angles were detected, return original image
|
179 |
+
if not angles:
|
180 |
+
logger.warning("No skew angles detected, using original image")
|
181 |
+
return img_array, 0.0, False
|
182 |
+
|
183 |
+
# Combine angles using the specified consensus method
|
184 |
+
consensus_method = config.get("consensus_method", "average")
|
185 |
+
if consensus_method == "average":
|
186 |
+
final_angle = sum(angles) / len(angles)
|
187 |
+
elif consensus_method == "median":
|
188 |
+
final_angle = sorted(angles)[len(angles) // 2]
|
189 |
+
elif consensus_method == "min":
|
190 |
+
final_angle = min(angles, key=abs)
|
191 |
+
elif consensus_method == "max":
|
192 |
+
final_angle = max(angles, key=abs)
|
193 |
+
else:
|
194 |
+
final_angle = sum(angles) / len(angles) # Default to average
|
195 |
+
|
196 |
+
# If angle is below threshold, don't rotate
|
197 |
+
if abs(final_angle) < angle_threshold:
|
198 |
+
logger.info(f"Detected angle ({final_angle:.2f}°) is below threshold, skipping deskew")
|
199 |
+
return img_array, final_angle, True
|
200 |
+
|
201 |
+
# Log the detected angle
|
202 |
+
logger.info(f"Deskewing image with angle: {final_angle:.2f}°")
|
203 |
+
|
204 |
+
# Get image dimensions
|
205 |
+
h, w = img_array.shape[:2]
|
206 |
+
center = (w // 2, h // 2)
|
207 |
+
|
208 |
+
# Get rotation matrix
|
209 |
+
rotation_matrix = cv2.getRotationMatrix2D(center, final_angle, 1.0)
|
210 |
+
|
211 |
+
# Calculate new image dimensions
|
212 |
+
abs_cos = abs(rotation_matrix[0, 0])
|
213 |
+
abs_sin = abs(rotation_matrix[0, 1])
|
214 |
+
new_w = int(h * abs_sin + w * abs_cos)
|
215 |
+
new_h = int(h * abs_cos + w * abs_sin)
|
216 |
+
|
217 |
+
# Adjust the rotation matrix to account for new dimensions
|
218 |
+
rotation_matrix[0, 2] += (new_w / 2) - center[0]
|
219 |
+
rotation_matrix[1, 2] += (new_h / 2) - center[1]
|
220 |
+
|
221 |
+
# Perform the rotation
|
222 |
+
try:
|
223 |
+
# Determine the number of channels to create the correct output array
|
224 |
+
if len(img_array.shape) == 3:
|
225 |
+
rotated = cv2.warpAffine(img_array, rotation_matrix, (new_w, new_h),
|
226 |
+
flags=cv2.INTER_LINEAR, borderMode=cv2.BORDER_CONSTANT,
|
227 |
+
borderValue=(255, 255, 255))
|
228 |
+
else:
|
229 |
+
rotated = cv2.warpAffine(img_array, rotation_matrix, (new_w, new_h),
|
230 |
+
flags=cv2.INTER_LINEAR, borderMode=cv2.BORDER_CONSTANT,
|
231 |
+
borderValue=255)
|
232 |
+
return rotated, final_angle, True
|
233 |
+
except Exception as e:
|
234 |
+
logger.error(f"Error rotating image: {str(e)}")
|
235 |
+
if config.get("fallback", {}).get("enabled", True):
|
236 |
+
logger.info("Using original image as fallback after rotation failure")
|
237 |
+
return img_array, final_angle, False
|
238 |
+
return img_array, final_angle, False
|
239 |
+
|
240 |
+
def preblur(img_array, config):
|
241 |
+
"""
|
242 |
+
Apply pre-filtering blur to stabilize thresholding results.
|
243 |
+
|
244 |
+
Args:
|
245 |
+
img_array: Input image as numpy array
|
246 |
+
config: Pre-blur configuration dict
|
247 |
+
|
248 |
+
Returns:
|
249 |
+
Blurred image as numpy array
|
250 |
+
"""
|
251 |
+
if not config.get("enabled", False):
|
252 |
+
return img_array
|
253 |
+
|
254 |
+
method = config.get("method", "gaussian")
|
255 |
+
kernel_size = config.get("kernel_size", 3)
|
256 |
+
|
257 |
+
# Ensure kernel size is odd
|
258 |
+
if kernel_size % 2 == 0:
|
259 |
+
kernel_size += 1
|
260 |
+
|
261 |
+
try:
|
262 |
+
if method == "gaussian":
|
263 |
+
return cv2.GaussianBlur(img_array, (kernel_size, kernel_size), 0)
|
264 |
+
elif method == "median":
|
265 |
+
return cv2.medianBlur(img_array, kernel_size)
|
266 |
+
else:
|
267 |
+
logger.warning(f"Unknown blur method: {method}, using gaussian")
|
268 |
+
return cv2.GaussianBlur(img_array, (kernel_size, kernel_size), 0)
|
269 |
+
except Exception as e:
|
270 |
+
logger.error(f"Error applying {method} blur: {str(e)}")
|
271 |
+
return img_array
|
272 |
+
|
273 |
+
def apply_threshold(img_array, config):
|
274 |
+
"""
|
275 |
+
Apply thresholding to create binary image.
|
276 |
+
|
277 |
+
Supports Otsu's method and adaptive thresholding.
|
278 |
+
Includes pre-filtering and fallback mechanisms.
|
279 |
+
|
280 |
+
Args:
|
281 |
+
img_array: Input image as numpy array
|
282 |
+
config: Thresholding configuration dict
|
283 |
+
|
284 |
+
Returns:
|
285 |
+
Binary image as numpy array, success flag
|
286 |
+
"""
|
287 |
+
method = config.get("method", "adaptive")
|
288 |
+
if method == "none":
|
289 |
+
return img_array, True
|
290 |
+
|
291 |
+
# Convert to grayscale if needed
|
292 |
+
gray = img_array if len(img_array.shape) == 2 else cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
|
293 |
+
|
294 |
+
# Apply pre-blur if configured
|
295 |
+
preblur_config = config.get("preblur", {})
|
296 |
+
if preblur_config.get("enabled", False):
|
297 |
+
gray = preblur(gray, preblur_config)
|
298 |
+
|
299 |
+
binary = None
|
300 |
+
try:
|
301 |
+
if method == "otsu":
|
302 |
+
# Apply Otsu's thresholding
|
303 |
+
_, binary = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
|
304 |
+
elif method == "adaptive":
|
305 |
+
# Apply adaptive thresholding
|
306 |
+
block_size = config.get("adaptive_block_size", 11)
|
307 |
+
constant = config.get("adaptive_constant", 2)
|
308 |
+
|
309 |
+
# Ensure block size is odd
|
310 |
+
if block_size % 2 == 0:
|
311 |
+
block_size += 1
|
312 |
+
|
313 |
+
binary = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
|
314 |
+
cv2.THRESH_BINARY, block_size, constant)
|
315 |
+
else:
|
316 |
+
logger.warning(f"Unknown thresholding method: {method}, using adaptive")
|
317 |
+
block_size = config.get("adaptive_block_size", 11)
|
318 |
+
constant = config.get("adaptive_constant", 2)
|
319 |
+
|
320 |
+
# Ensure block size is odd
|
321 |
+
if block_size % 2 == 0:
|
322 |
+
block_size += 1
|
323 |
+
|
324 |
+
binary = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
|
325 |
+
cv2.THRESH_BINARY, block_size, constant)
|
326 |
+
except Exception as e:
|
327 |
+
logger.error(f"Error applying {method} thresholding: {str(e)}")
|
328 |
+
if config.get("fallback", {}).get("enabled", True):
|
329 |
+
logger.info("Using original grayscale image as fallback after thresholding failure")
|
330 |
+
return gray, False
|
331 |
+
return gray, False
|
332 |
+
|
333 |
+
# Calculate percentage of non-zero pixels for logging
|
334 |
+
nonzero_pct = np.count_nonzero(binary) / binary.size * 100
|
335 |
+
logger.info(f"Binary image has {nonzero_pct:.2f}% non-zero pixels")
|
336 |
+
|
337 |
+
# Check if thresholding was successful (crude check)
|
338 |
+
if nonzero_pct < 1 or nonzero_pct > 99:
|
339 |
+
logger.warning(f"Thresholding produced extreme result ({nonzero_pct:.2f}% non-zero)")
|
340 |
+
if config.get("fallback", {}).get("enabled", True):
|
341 |
+
logger.info("Using original grayscale image as fallback after poor thresholding")
|
342 |
+
return gray, False
|
343 |
+
|
344 |
+
return binary, True
|
345 |
+
|
346 |
+
def apply_morphology(binary_img, config):
|
347 |
+
"""
|
348 |
+
Apply morphological operations to clean up binary image.
|
349 |
+
|
350 |
+
Supports opening, closing, or both operations.
|
351 |
+
|
352 |
+
Args:
|
353 |
+
binary_img: Binary image as numpy array
|
354 |
+
config: Morphology configuration dict
|
355 |
+
|
356 |
+
Returns:
|
357 |
+
Processed binary image as numpy array
|
358 |
+
"""
|
359 |
+
if not config.get("enabled", False):
|
360 |
+
return binary_img
|
361 |
+
|
362 |
+
operation = config.get("operation", "close")
|
363 |
+
kernel_size = config.get("kernel_size", 1)
|
364 |
+
kernel_shape = config.get("kernel_shape", "rect")
|
365 |
+
|
366 |
+
# Create appropriate kernel
|
367 |
+
if kernel_shape == "rect":
|
368 |
+
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (kernel_size*2+1, kernel_size*2+1))
|
369 |
+
elif kernel_shape == "ellipse":
|
370 |
+
kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (kernel_size*2+1, kernel_size*2+1))
|
371 |
+
elif kernel_shape == "cross":
|
372 |
+
kernel = cv2.getStructuringElement(cv2.MORPH_CROSS, (kernel_size*2+1, kernel_size*2+1))
|
373 |
+
else:
|
374 |
+
logger.warning(f"Unknown kernel shape: {kernel_shape}, using rect")
|
375 |
+
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (kernel_size*2+1, kernel_size*2+1))
|
376 |
+
|
377 |
+
result = binary_img
|
378 |
+
try:
|
379 |
+
if operation == "open":
|
380 |
+
# Opening: Erosion followed by dilation - removes small noise
|
381 |
+
result = cv2.morphologyEx(binary_img, cv2.MORPH_OPEN, kernel)
|
382 |
+
elif operation == "close":
|
383 |
+
# Closing: Dilation followed by erosion - fills small holes
|
384 |
+
result = cv2.morphologyEx(binary_img, cv2.MORPH_CLOSE, kernel)
|
385 |
+
elif operation == "both":
|
386 |
+
# Both operations in sequence
|
387 |
+
result = cv2.morphologyEx(binary_img, cv2.MORPH_OPEN, kernel)
|
388 |
+
result = cv2.morphologyEx(result, cv2.MORPH_CLOSE, kernel)
|
389 |
+
else:
|
390 |
+
logger.warning(f"Unknown morphological operation: {operation}, using close")
|
391 |
+
result = cv2.morphologyEx(binary_img, cv2.MORPH_CLOSE, kernel)
|
392 |
+
except Exception as e:
|
393 |
+
logger.error(f"Error applying morphological operation: {str(e)}")
|
394 |
+
return binary_img
|
395 |
+
|
396 |
+
return result
|
397 |
+
|
398 |
@st.cache_data(ttl=24*3600, show_spinner=False) # Cache for 24 hours
|
399 |
def convert_pdf_to_images(pdf_bytes, dpi=150, rotation=0):
|
400 |
"""Convert PDF bytes to a list of images with caching"""
|
|
|
417 |
|
418 |
@st.cache_data(ttl=24*3600, show_spinner=False, hash_funcs={dict: lambda x: str(sorted(x.items()))})
|
419 |
def preprocess_image(image_bytes, preprocessing_options):
|
420 |
+
"""
|
421 |
+
Conservative preprocessing function for handwritten documents with early exit for clean scans.
|
422 |
+
Implements light processing: grayscale → denoise (gently) → contrast (conservative)
|
423 |
+
|
424 |
+
Args:
|
425 |
+
image_bytes: Image content as bytes
|
426 |
+
preprocessing_options: Dictionary with document_type, grayscale, denoise, contrast options
|
427 |
+
|
428 |
+
Returns:
|
429 |
+
Processed image bytes or original image bytes if no processing needed
|
430 |
+
"""
|
431 |
# Setup basic console logging
|
432 |
logger = logging.getLogger("image_preprocessor")
|
433 |
logger.setLevel(logging.INFO)
|
434 |
|
435 |
# Log which preprocessing options are being applied
|
436 |
+
logger.info(f"Document type: {preprocessing_options.get('document_type', 'standard')}")
|
437 |
+
|
438 |
+
# Check if any preprocessing is actually requested
|
439 |
+
has_preprocessing = (
|
440 |
+
preprocessing_options.get("grayscale", False) or
|
441 |
+
preprocessing_options.get("denoise", False) or
|
442 |
+
preprocessing_options.get("contrast", 0) != 0
|
443 |
+
)
|
444 |
|
445 |
# Convert bytes to PIL Image
|
446 |
image = Image.open(io.BytesIO(image_bytes))
|
447 |
|
448 |
+
# Check for minimal skew and exit early if document is already straight
|
449 |
+
# This avoids unnecessary processing for clean scans
|
450 |
+
try:
|
451 |
+
from utils.image_utils import detect_skew
|
452 |
+
skew_angle = detect_skew(image)
|
453 |
+
if abs(skew_angle) < 0.5:
|
454 |
+
logger.info(f"Document has minimal skew ({skew_angle:.2f}°), skipping preprocessing")
|
455 |
+
# Return original image bytes as is for perfectly straight documents
|
456 |
+
if not has_preprocessing:
|
457 |
+
return image_bytes
|
458 |
+
except Exception as e:
|
459 |
+
logger.warning(f"Error in skew detection: {str(e)}, continuing with preprocessing")
|
460 |
+
|
461 |
+
# If no preprocessing options are selected, return the original image
|
462 |
+
if not has_preprocessing:
|
463 |
+
logger.info("No preprocessing options selected, skipping preprocessing")
|
464 |
+
return image_bytes
|
465 |
+
|
466 |
+
# Initialize metrics for logging
|
467 |
+
metrics = {
|
468 |
+
"file": preprocessing_options.get("filename", "unknown"),
|
469 |
+
"document_type": preprocessing_options.get("document_type", "standard"),
|
470 |
+
"preprocessing_applied": []
|
471 |
+
}
|
472 |
+
start_time = time.time()
|
473 |
+
|
474 |
+
# Handle RGBA images (transparency) by converting to RGB
|
475 |
if image.mode == 'RGBA':
|
476 |
+
# Convert RGBA to RGB by compositing onto white background
|
477 |
+
logger.info("Converting RGBA image to RGB")
|
478 |
background = Image.new('RGB', image.size, (255, 255, 255))
|
479 |
background.paste(image, mask=image.split()[3]) # 3 is the alpha channel
|
480 |
image = background
|
481 |
+
metrics["preprocessing_applied"].append("alpha_conversion")
|
482 |
elif image.mode not in ('RGB', 'L'):
|
483 |
+
# Convert other modes to RGB
|
484 |
+
logger.info(f"Converting {image.mode} image to RGB")
|
485 |
image = image.convert('RGB')
|
486 |
+
metrics["preprocessing_applied"].append("format_conversion")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
487 |
|
488 |
+
# Convert to NumPy array for OpenCV processing
|
489 |
img_array = np.array(image)
|
490 |
|
491 |
+
# Apply grayscale if requested (useful for handwritten text)
|
|
|
|
|
|
|
492 |
if preprocessing_options.get("grayscale", False):
|
493 |
if len(img_array.shape) == 3: # Only convert if it's not already grayscale
|
494 |
+
# For handwritten documents, apply gentle CLAHE to enhance contrast locally
|
495 |
+
if preprocessing_options.get("document_type") == "handwritten":
|
496 |
img_array = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
|
497 |
+
clahe = cv2.createCLAHE(clipLimit=1.5, tileGridSize=(8,8)) # Conservative clip limit
|
|
|
498 |
img_array = clahe.apply(img_array)
|
499 |
else:
|
500 |
# Standard grayscale for printed documents
|
501 |
img_array = cv2.cvtColor(img_array, cv2.COLOR_RGB2GRAY)
|
502 |
+
|
503 |
+
metrics["preprocessing_applied"].append("grayscale")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
504 |
|
505 |
+
# Apply light denoising if requested
|
506 |
if preprocessing_options.get("denoise", False):
|
507 |
try:
|
508 |
+
# Apply very gentle denoising
|
509 |
+
is_color = len(img_array.shape) == 3 and img_array.shape[2] == 3
|
510 |
+
if is_color:
|
511 |
+
# Very light color denoising with conservative parameters
|
512 |
+
img_array = cv2.fastNlMeansDenoisingColored(img_array, None, 2, 2, 3, 7)
|
|
|
|
|
513 |
else:
|
514 |
+
# Very light grayscale denoising
|
515 |
+
img_array = cv2.fastNlMeansDenoising(img_array, None, 2, 3, 7)
|
516 |
+
|
517 |
+
metrics["preprocessing_applied"].append("light_denoise")
|
|
|
518 |
except Exception as e:
|
519 |
+
logger.error(f"Denoising error: {str(e)}")
|
520 |
+
|
521 |
+
# Apply contrast adjustment if requested (conservative range)
|
522 |
+
contrast_value = preprocessing_options.get("contrast", 0)
|
523 |
+
if contrast_value != 0:
|
524 |
+
# Use a gentler contrast adjustment factor
|
525 |
+
contrast_factor = 1 + (contrast_value / 200) # Conservative scaling factor
|
526 |
|
527 |
+
# Convert NumPy array back to PIL Image for contrast adjustment
|
528 |
+
if len(img_array.shape) == 2: # If grayscale, convert to RGB for PIL
|
529 |
+
image = Image.fromarray(cv2.cvtColor(img_array, cv2.COLOR_GRAY2RGB))
|
530 |
+
else:
|
531 |
+
image = Image.fromarray(img_array)
|
532 |
+
|
533 |
+
enhancer = ImageEnhance.Contrast(image)
|
534 |
+
image = enhancer.enhance(contrast_factor)
|
535 |
+
|
536 |
+
# Convert back to NumPy array
|
537 |
+
img_array = np.array(image)
|
538 |
+
metrics["preprocessing_applied"].append(f"contrast_{contrast_value}")
|
539 |
+
|
540 |
# Convert back to PIL Image
|
541 |
+
if len(img_array.shape) == 2: # If grayscale, convert to RGB for saving
|
542 |
+
processed_image = Image.fromarray(cv2.cvtColor(img_array, cv2.COLOR_GRAY2RGB))
|
543 |
+
else:
|
544 |
+
processed_image = Image.fromarray(img_array)
|
545 |
+
|
546 |
+
# Record total processing time
|
547 |
+
metrics["processing_time"] = (time.time() - start_time) * 1000 # ms
|
548 |
|
549 |
# Higher quality for OCR processing
|
550 |
byte_io = io.BytesIO()
|
|
|
558 |
|
559 |
logger.info(f"Preprocessing complete. Original image mode: {image.mode}, processed mode: {processed_image.mode}")
|
560 |
logger.info(f"Original size: {len(image_bytes)/1024:.1f}KB, processed size: {len(byte_io.getvalue())/1024:.1f}KB")
|
561 |
+
logger.info(f"Applied preprocessing steps: {', '.join(metrics['preprocessing_applied'])}")
|
562 |
|
563 |
return byte_io.getvalue()
|
564 |
except Exception as e:
|
565 |
logger.error(f"Error saving processed image: {str(e)}")
|
566 |
# Fallback to original image
|
567 |
logger.info("Using original image as fallback")
|
568 |
+
return image_bytes
|
|
|
|
|
|
|
569 |
|
570 |
def create_temp_file(content, suffix, temp_file_paths):
|
571 |
"""Create a temporary file and track it for cleanup"""
|
|
|
578 |
return temp_path
|
579 |
|
580 |
def apply_preprocessing_to_file(file_bytes, file_ext, preprocessing_options, temp_file_paths):
|
581 |
+
"""
|
582 |
+
Apply conservative preprocessing to file and return path to the temporary file.
|
583 |
+
Handles format conversion and user-selected preprocessing options.
|
584 |
+
|
585 |
+
Args:
|
586 |
+
file_bytes: File content as bytes
|
587 |
+
file_ext: File extension (e.g., '.jpg', '.pdf')
|
588 |
+
preprocessing_options: Dictionary with document_type and preprocessing options
|
589 |
+
temp_file_paths: List to track temporary files for cleanup
|
590 |
+
|
591 |
+
Returns:
|
592 |
+
Tuple of (temp_file_path, was_processed_flag)
|
593 |
+
"""
|
594 |
+
document_type = preprocessing_options.get("document_type", "standard")
|
595 |
+
|
596 |
+
# Check for user-selected preprocessing
|
597 |
has_preprocessing = (
|
598 |
preprocessing_options.get("grayscale", False) or
|
599 |
preprocessing_options.get("denoise", False) or
|
600 |
+
preprocessing_options.get("contrast", 0) != 0
|
|
|
601 |
)
|
602 |
|
603 |
+
# Check for RGBA/transparency that needs conversion
|
604 |
+
format_needs_conversion = False
|
605 |
+
|
606 |
+
# Only check formats that might have transparency
|
607 |
+
if file_ext.lower() in ['.png', '.tif', '.tiff']:
|
608 |
+
try:
|
609 |
+
# Check if image has transparency
|
610 |
+
image = Image.open(io.BytesIO(file_bytes))
|
611 |
+
if image.mode == 'RGBA' or image.mode not in ('RGB', 'L'):
|
612 |
+
format_needs_conversion = True
|
613 |
+
except Exception as e:
|
614 |
+
logger.warning(f"Error checking image format: {str(e)}")
|
615 |
+
|
616 |
+
# Process if user requested preprocessing OR format needs conversion
|
617 |
+
needs_processing = has_preprocessing or format_needs_conversion
|
618 |
+
|
619 |
+
if needs_processing:
|
620 |
# Apply preprocessing
|
621 |
logger.info(f"Applying preprocessing with options: {preprocessing_options}")
|
622 |
+
logger.info(f"Using document type '{document_type}' with advanced preprocessing options")
|
623 |
+
|
624 |
+
# Add filename to preprocessing options for logging if available
|
625 |
+
if hasattr(file_bytes, 'name'):
|
626 |
+
preprocessing_options["filename"] = file_bytes.name
|
627 |
+
|
628 |
processed_bytes = preprocess_image(file_bytes, preprocessing_options)
|
629 |
|
630 |
# Save processed image to temp file
|
process_file.py
CHANGED
@@ -53,9 +53,7 @@ def process_file(uploaded_file, use_vision=True, processor=None, custom_prompt=N
|
|
53 |
"file_size_mb": round(file_size_mb, 2),
|
54 |
"use_vision": use_vision
|
55 |
})
|
56 |
-
|
57 |
-
# No longer needed - removing confidence score
|
58 |
-
|
59 |
return result
|
60 |
except Exception as e:
|
61 |
return {
|
@@ -65,4 +63,4 @@ def process_file(uploaded_file, use_vision=True, processor=None, custom_prompt=N
|
|
65 |
finally:
|
66 |
# Clean up the temporary file
|
67 |
if os.path.exists(temp_path):
|
68 |
-
os.unlink(temp_path)
|
|
|
53 |
"file_size_mb": round(file_size_mb, 2),
|
54 |
"use_vision": use_vision
|
55 |
})
|
56 |
+
|
|
|
|
|
57 |
return result
|
58 |
except Exception as e:
|
59 |
return {
|
|
|
63 |
finally:
|
64 |
# Clean up the temporary file
|
65 |
if os.path.exists(temp_path):
|
66 |
+
os.unlink(temp_path)
|
requirements.txt
CHANGED
@@ -10,6 +10,7 @@ Pillow>=10.0.0
|
|
10 |
opencv-python-headless>=4.8.0.74
|
11 |
pdf2image>=1.16.0
|
12 |
pytesseract>=0.3.10 # For local OCR fallback
|
|
|
13 |
|
14 |
# Data handling and utilities
|
15 |
numpy>=1.24.0
|
|
|
10 |
opencv-python-headless>=4.8.0.74
|
11 |
pdf2image>=1.16.0
|
12 |
pytesseract>=0.3.10 # For local OCR fallback
|
13 |
+
matplotlib>=3.7.0 # For visualization in preprocessing tests
|
14 |
|
15 |
# Data handling and utilities
|
16 |
numpy>=1.24.0
|
structured_ocr.py
CHANGED
@@ -47,28 +47,38 @@ except ImportError:
|
|
47 |
|
48 |
# Import utilities for OCR processing
|
49 |
try:
|
50 |
-
from
|
51 |
except ImportError:
|
52 |
-
# Define fallback functions if module not found
|
|
|
|
|
53 |
def replace_images_in_markdown(markdown_str, images_dict):
|
54 |
-
|
55 |
-
|
56 |
-
|
57 |
-
|
|
|
|
|
|
|
58 |
return markdown_str
|
59 |
|
60 |
def get_combined_markdown(ocr_response):
|
|
|
61 |
markdowns = []
|
62 |
for page in ocr_response.pages:
|
63 |
image_data = {}
|
64 |
-
|
65 |
-
|
66 |
-
|
|
|
|
|
|
|
|
|
67 |
return "\n\n".join(markdowns)
|
68 |
|
69 |
# Import config directly (now local to historical-ocr)
|
70 |
try:
|
71 |
-
from config import MISTRAL_API_KEY, OCR_MODEL, TEXT_MODEL, VISION_MODEL, TEST_MODE
|
72 |
except ImportError:
|
73 |
# Fallback defaults if config is not available
|
74 |
import os
|
@@ -77,6 +87,14 @@ except ImportError:
|
|
77 |
TEXT_MODEL = "mistral-large-latest"
|
78 |
VISION_MODEL = "mistral-large-latest"
|
79 |
TEST_MODE = True
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
80 |
logging.warning("Config module not found. Using environment variables and defaults.")
|
81 |
|
82 |
# Helper function to make OCR objects JSON serializable
|
@@ -127,6 +145,13 @@ def serialize_ocr_response(obj):
|
|
127 |
is_valid_image = False
|
128 |
logging.warning("Markdown image reference detected")
|
129 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
130 |
# Case 3: Needs detailed text content detection
|
131 |
else:
|
132 |
# Use the same proven approach as in our tests
|
@@ -185,9 +210,27 @@ def serialize_ocr_response(obj):
|
|
185 |
'image_base64': image_base64
|
186 |
}
|
187 |
else:
|
188 |
-
# Process as text if validation fails
|
189 |
if image_base64 and isinstance(image_base64, str):
|
190 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
191 |
else:
|
192 |
result[key] = str(value)
|
193 |
# Handle collections
|
@@ -382,13 +425,47 @@ class StructuredOCR:
|
|
382 |
result = serialize_ocr_response(result)
|
383 |
|
384 |
# Make a final pass to check for any remaining non-serializable objects
|
385 |
-
#
|
386 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
387 |
except TypeError as e:
|
388 |
-
# If there's a serialization error, run the whole result through our serializer
|
389 |
logger = logging.getLogger("serializer")
|
390 |
logger.warning(f"JSON serialization error in result: {str(e)}. Applying full serialization.")
|
391 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
392 |
|
393 |
return result
|
394 |
|
@@ -1104,9 +1181,10 @@ class StructuredOCR:
|
|
1104 |
|
1105 |
# Use enhanced preprocessing functions from ocr_utils
|
1106 |
try:
|
1107 |
-
from
|
|
|
1108 |
|
1109 |
-
logger.info(f"Applying
|
1110 |
|
1111 |
# Get preprocessing settings from config
|
1112 |
max_size_mb = IMAGE_PREPROCESSING.get("max_size_mb", 8.0)
|
@@ -1114,8 +1192,14 @@ class StructuredOCR:
|
|
1114 |
if file_size_mb > max_size_mb:
|
1115 |
logger.info(f"Image is large ({file_size_mb:.2f} MB), optimizing for API submission")
|
1116 |
|
1117 |
-
#
|
1118 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
1119 |
|
1120 |
logger.info(f"Image preprocessing completed successfully")
|
1121 |
|
@@ -1169,7 +1253,7 @@ class StructuredOCR:
|
|
1169 |
except ImportError:
|
1170 |
logger.warning("PIL not available for resizing. Using original image.")
|
1171 |
# Use enhanced encoder with proper MIME type detection
|
1172 |
-
from
|
1173 |
base64_data_url = encode_image_for_api(file_path)
|
1174 |
except Exception as e:
|
1175 |
logger.warning(f"Image resize failed: {str(e)}. Using original image.")
|
@@ -1178,7 +1262,7 @@ class StructuredOCR:
|
|
1178 |
base64_data_url = encode_image_for_api(file_path)
|
1179 |
else:
|
1180 |
# For smaller images, use as-is with proper MIME type
|
1181 |
-
from
|
1182 |
base64_data_url = encode_image_for_api(file_path)
|
1183 |
except Exception as e:
|
1184 |
# Fallback to original image if any preprocessing fails
|
@@ -1243,7 +1327,7 @@ class StructuredOCR:
|
|
1243 |
logger.error("Maximum retries reached, rate limit error persists.")
|
1244 |
try:
|
1245 |
# Try to import the local OCR fallback function
|
1246 |
-
from
|
1247 |
|
1248 |
# Attempt local OCR fallback
|
1249 |
ocr_text = try_local_ocr_fallback(file_path, base64_data_url)
|
@@ -1455,7 +1539,14 @@ class StructuredOCR:
|
|
1455 |
logger.info("Sufficient OCR text detected, analyzing language before using OCR text directly")
|
1456 |
|
1457 |
# Perform language detection on the OCR text before returning
|
1458 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1459 |
|
1460 |
return {
|
1461 |
"file_name": filename,
|
@@ -1629,7 +1720,12 @@ class StructuredOCR:
|
|
1629 |
|
1630 |
# If OCR text has clear French patterns but language is English or missing, fix it
|
1631 |
if ocr_markdown and 'languages' in result:
|
1632 |
-
|
|
|
|
|
|
|
|
|
|
|
1633 |
|
1634 |
except Exception as e:
|
1635 |
# Fall back to text-only model if vision model fails
|
@@ -1639,22 +1735,25 @@ class StructuredOCR:
|
|
1639 |
return result
|
1640 |
|
1641 |
# We've removed document type detection entirely for simplicity
|
|
|
1642 |
|
1643 |
# Create a prompt with enhanced language detection instructions
|
1644 |
generic_section = (
|
1645 |
f"You are an OCR specialist processing historical documents. "
|
1646 |
-
f"Focus on accurately extracting text content while preserving structure and formatting. "
|
1647 |
f"Pay attention to any historical features and document characteristics.\n\n"
|
1648 |
-
f"IMPORTANT: Accurately identify the document's language(s). Look for language-specific characters, words, and phrases. "
|
1649 |
-
f"Specifically check for French (accents like é, è, ç, words like 'le', 'la', 'et', 'est'), German (umlauts, words like 'und', 'der', 'das'), "
|
1650 |
-
f"Latin, and other non-English languages. Carefully analyze the text before determining language.\n\n"
|
1651 |
f"Create a structured JSON response with the following fields:\n"
|
1652 |
f"- file_name: The document's name\n"
|
1653 |
f"- topics: An array of topics covered in the document\n"
|
1654 |
f"- languages: An array of languages used in the document (be precise and specific about language detection)\n"
|
1655 |
f"- ocr_contents: A comprehensive dictionary with the document's contents including:\n"
|
1656 |
-
f" * title: The
|
1657 |
-
f" *
|
|
|
|
|
|
|
|
|
|
|
1658 |
f" * raw_text: The complete OCR text\n"
|
1659 |
)
|
1660 |
|
@@ -1665,86 +1764,7 @@ class StructuredOCR:
|
|
1665 |
|
1666 |
# Return the enhanced prompt
|
1667 |
return generic_section + custom_section
|
1668 |
-
|
1669 |
-
def _detect_text_language(self, text, current_languages=None):
|
1670 |
-
"""
|
1671 |
-
Detect language from text content using the external language detector
|
1672 |
-
or falling back to internal detection if needed
|
1673 |
-
|
1674 |
-
Args:
|
1675 |
-
text: The text to analyze
|
1676 |
-
current_languages: Optional list of languages already detected
|
1677 |
-
|
1678 |
-
Returns:
|
1679 |
-
List of detected languages
|
1680 |
-
"""
|
1681 |
-
logger = logging.getLogger("language_detector")
|
1682 |
-
|
1683 |
-
# If no text provided, return current languages or default
|
1684 |
-
if not text or len(text.strip()) < 10:
|
1685 |
-
return current_languages if current_languages else ["English"]
|
1686 |
-
|
1687 |
-
# Use the external language detector if available
|
1688 |
-
if LANG_DETECTOR_AVAILABLE and self.language_detector:
|
1689 |
-
logger.info("Using external language detector")
|
1690 |
-
return self.language_detector.detect_languages(text,
|
1691 |
-
filename=getattr(self, 'current_filename', None),
|
1692 |
-
current_languages=current_languages)
|
1693 |
-
|
1694 |
-
# Fallback for when the external module is not available
|
1695 |
-
logger.info("Language detector not available, using simple detection")
|
1696 |
-
|
1697 |
-
# Get all words from text (lowercase for comparison)
|
1698 |
-
text_lower = text.lower()
|
1699 |
-
words = text_lower.split()
|
1700 |
-
|
1701 |
-
# Basic language markers - equal treatment of all languages
|
1702 |
-
language_indicators = {
|
1703 |
-
"French": {
|
1704 |
-
"chars": ['é', 'è', 'ê', 'à', 'ç', 'ù', 'â', 'î', 'ô', 'û'],
|
1705 |
-
"words": ['le', 'la', 'les', 'et', 'en', 'de', 'du', 'des', 'dans', 'ce', 'cette']
|
1706 |
-
},
|
1707 |
-
"Spanish": {
|
1708 |
-
"chars": ['ñ', 'á', 'é', 'í', 'ó', 'ú', '¿', '¡'],
|
1709 |
-
"words": ['el', 'la', 'los', 'las', 'y', 'en', 'por', 'que', 'con', 'del']
|
1710 |
-
},
|
1711 |
-
"German": {
|
1712 |
-
"chars": ['ä', 'ö', 'ü', 'ß'],
|
1713 |
-
"words": ['der', 'die', 'das', 'und', 'ist', 'von', 'mit', 'für', 'sich']
|
1714 |
-
},
|
1715 |
-
"Latin": {
|
1716 |
-
"chars": [],
|
1717 |
-
"words": ['et', 'in', 'ad', 'est', 'sunt', 'non', 'cum', 'sed', 'qui', 'quod']
|
1718 |
-
}
|
1719 |
-
}
|
1720 |
-
|
1721 |
-
detected_languages = []
|
1722 |
-
|
1723 |
-
# Simple detection logic - check for language markers
|
1724 |
-
for language, indicators in language_indicators.items():
|
1725 |
-
has_chars = any(char in text_lower for char in indicators["chars"])
|
1726 |
-
has_words = any(word in words for word in indicators["words"])
|
1727 |
|
1728 |
-
if has_chars and has_words:
|
1729 |
-
detected_languages.append(language)
|
1730 |
-
|
1731 |
-
# Check for English
|
1732 |
-
english_words = ['the', 'and', 'of', 'to', 'in', 'a', 'is', 'that', 'for', 'it']
|
1733 |
-
if sum(1 for word in words if word in english_words) >= 2:
|
1734 |
-
detected_languages.append("English")
|
1735 |
-
|
1736 |
-
# If no languages detected, default to English
|
1737 |
-
if not detected_languages:
|
1738 |
-
detected_languages = ["English"]
|
1739 |
-
|
1740 |
-
# Limit to top 2 languages
|
1741 |
-
detected_languages = detected_languages[:2]
|
1742 |
-
|
1743 |
-
# Log what we found
|
1744 |
-
logger.info(f"Simple fallback language detection results: {detected_languages}")
|
1745 |
-
|
1746 |
-
return detected_languages
|
1747 |
-
|
1748 |
def _extract_structured_data_text_only(self, ocr_markdown, filename, custom_prompt=None):
|
1749 |
"""
|
1750 |
Extract structured data using text-only model with detailed historical context prompting
|
|
|
47 |
|
48 |
# Import utilities for OCR processing
|
49 |
try:
|
50 |
+
from utils.image_utils import replace_images_in_markdown, get_combined_markdown
|
51 |
except ImportError:
|
52 |
+
# Define minimal fallback functions if module not found
|
53 |
+
logger.warning("Could not import utils.image_utils - using minimal fallback functions")
|
54 |
+
|
55 |
def replace_images_in_markdown(markdown_str, images_dict):
|
56 |
+
"""Minimal fallback implementation of replace_images_in_markdown"""
|
57 |
+
import re
|
58 |
+
for img_id, base64_str in images_dict.items():
|
59 |
+
# Match alt text OR link part, ignore extension
|
60 |
+
base_id = img_id.split('.')[0]
|
61 |
+
pattern = re.compile(rf"!\[[^\]]*{base_id}[^\]]*\]\([^\)]+\)")
|
62 |
+
markdown_str = pattern.sub(f"", markdown_str)
|
63 |
return markdown_str
|
64 |
|
65 |
def get_combined_markdown(ocr_response):
|
66 |
+
"""Minimal fallback implementation of get_combined_markdown"""
|
67 |
markdowns = []
|
68 |
for page in ocr_response.pages:
|
69 |
image_data = {}
|
70 |
+
if hasattr(page, "images"):
|
71 |
+
for img in page.images:
|
72 |
+
if hasattr(img, "id") and hasattr(img, "image_base64"):
|
73 |
+
image_data[img.id] = img.image_base64
|
74 |
+
page_markdown = page.markdown if hasattr(page, "markdown") else ""
|
75 |
+
processed_markdown = replace_images_in_markdown(page_markdown, image_data)
|
76 |
+
markdowns.append(processed_markdown)
|
77 |
return "\n\n".join(markdowns)
|
78 |
|
79 |
# Import config directly (now local to historical-ocr)
|
80 |
try:
|
81 |
+
from config import MISTRAL_API_KEY, OCR_MODEL, TEXT_MODEL, VISION_MODEL, TEST_MODE, IMAGE_PREPROCESSING
|
82 |
except ImportError:
|
83 |
# Fallback defaults if config is not available
|
84 |
import os
|
|
|
87 |
TEXT_MODEL = "mistral-large-latest"
|
88 |
VISION_MODEL = "mistral-large-latest"
|
89 |
TEST_MODE = True
|
90 |
+
# Default image preprocessing settings if config not available
|
91 |
+
IMAGE_PREPROCESSING = {
|
92 |
+
"max_size_mb": 8.0,
|
93 |
+
# Add basic defaults for preprocessing
|
94 |
+
"enhance_contrast": 1.2,
|
95 |
+
"denoise": True,
|
96 |
+
"compression_quality": 95
|
97 |
+
}
|
98 |
logging.warning("Config module not found. Using environment variables and defaults.")
|
99 |
|
100 |
# Helper function to make OCR objects JSON serializable
|
|
|
145 |
is_valid_image = False
|
146 |
logging.warning("Markdown image reference detected")
|
147 |
|
148 |
+
# Extract the image ID for logging
|
149 |
+
try:
|
150 |
+
img_id = image_base64.split('![')[1].split('](')[0]
|
151 |
+
logging.debug(f"Markdown reference for image: {img_id}")
|
152 |
+
except:
|
153 |
+
img_id = "unknown"
|
154 |
+
|
155 |
# Case 3: Needs detailed text content detection
|
156 |
else:
|
157 |
# Use the same proven approach as in our tests
|
|
|
210 |
'image_base64': image_base64
|
211 |
}
|
212 |
else:
|
213 |
+
# Process as text if validation fails, but properly handle markdown references
|
214 |
if image_base64 and isinstance(image_base64, str):
|
215 |
+
# Special handling for markdown image references
|
216 |
+
if image_base64.startswith(''):
|
217 |
+
# Extract the image description (alt text) if available
|
218 |
+
try:
|
219 |
+
# Parse the alt text from 
|
220 |
+
alt_text = image_base64.split('![')[1].split('](')[0]
|
221 |
+
# Use the alt text or a placeholder if it's just the image name
|
222 |
+
if alt_text and not alt_text.endswith('.jpeg') and not alt_text.endswith('.jpg'):
|
223 |
+
result[key] = f"[Image: {alt_text}]"
|
224 |
+
else:
|
225 |
+
# Just note that there's an image without the reference
|
226 |
+
result[key] = "[Image]"
|
227 |
+
logging.info(f"Converted markdown reference to text placeholder: {result[key]}")
|
228 |
+
except:
|
229 |
+
# Fallback for parsing errors
|
230 |
+
result[key] = "[Image]"
|
231 |
+
else:
|
232 |
+
# Regular text content
|
233 |
+
result[key] = image_base64
|
234 |
else:
|
235 |
result[key] = str(value)
|
236 |
# Handle collections
|
|
|
425 |
result = serialize_ocr_response(result)
|
426 |
|
427 |
# Make a final pass to check for any remaining non-serializable objects
|
428 |
+
# Proactively check for OCRImageObject instances to avoid serialization warnings
|
429 |
+
def has_ocr_image_objects(obj):
|
430 |
+
"""Check if object contains any OCRImageObject instances recursively"""
|
431 |
+
if isinstance(obj, dict):
|
432 |
+
return any(has_ocr_image_objects(v) for v in obj.values())
|
433 |
+
elif isinstance(obj, list):
|
434 |
+
return any(has_ocr_image_objects(item) for item in obj)
|
435 |
+
else:
|
436 |
+
return 'OCRImageObject' in str(type(obj))
|
437 |
+
|
438 |
+
# Apply serialization preemptively if OCRImageObjects are detected
|
439 |
+
if has_ocr_image_objects(result):
|
440 |
+
# Quietly apply full serialization before any errors occur
|
441 |
+
result = serialize_ocr_response(result)
|
442 |
+
else:
|
443 |
+
# Test JSON serialization to catch any other issues
|
444 |
+
json.dumps(result)
|
445 |
except TypeError as e:
|
446 |
+
# If there's still a serialization error, run the whole result through our serializer
|
447 |
logger = logging.getLogger("serializer")
|
448 |
logger.warning(f"JSON serialization error in result: {str(e)}. Applying full serialization.")
|
449 |
+
# Use a more robust approach to ensure complete serialization
|
450 |
+
try:
|
451 |
+
# First attempt with our custom serializer
|
452 |
+
result = serialize_ocr_response(result)
|
453 |
+
# Test if it's fully serializable now
|
454 |
+
json.dumps(result)
|
455 |
+
except Exception as inner_e:
|
456 |
+
# If still not serializable, convert to a simpler format
|
457 |
+
logger.warning(f"Secondary serialization error: {str(inner_e)}. Converting to basic format.")
|
458 |
+
# Create a simplified result with just the essential information
|
459 |
+
simplified_result = {
|
460 |
+
"file_name": result.get("file_name", "unknown"),
|
461 |
+
"topics": result.get("topics", ["Document"]),
|
462 |
+
"languages": [str(lang) for lang in result.get("languages", ["English"]) if lang is not None],
|
463 |
+
"ocr_contents": {
|
464 |
+
"raw_text": result.get("ocr_contents", {}).get("raw_text", "Text extraction failed due to serialization error")
|
465 |
+
},
|
466 |
+
"serialization_error": f"Original result could not be fully serialized: {str(e)}"
|
467 |
+
}
|
468 |
+
result = simplified_result
|
469 |
|
470 |
return result
|
471 |
|
|
|
1181 |
|
1182 |
# Use enhanced preprocessing functions from ocr_utils
|
1183 |
try:
|
1184 |
+
from preprocessing import preprocess_image
|
1185 |
+
from utils.file_utils import get_base64_from_bytes
|
1186 |
|
1187 |
+
logger.info(f"Applying image preprocessing for OCR")
|
1188 |
|
1189 |
# Get preprocessing settings from config
|
1190 |
max_size_mb = IMAGE_PREPROCESSING.get("max_size_mb", 8.0)
|
|
|
1192 |
if file_size_mb > max_size_mb:
|
1193 |
logger.info(f"Image is large ({file_size_mb:.2f} MB), optimizing for API submission")
|
1194 |
|
1195 |
+
# Handwritten docs default to the conservative pipeline
|
1196 |
+
base64_data_url = get_base64_from_bytes(
|
1197 |
+
preprocess_image(file_path.read_bytes(),
|
1198 |
+
{"document_type": "handwritten",
|
1199 |
+
"grayscale": True,
|
1200 |
+
"denoise": True,
|
1201 |
+
"contrast": 0})
|
1202 |
+
)
|
1203 |
|
1204 |
logger.info(f"Image preprocessing completed successfully")
|
1205 |
|
|
|
1253 |
except ImportError:
|
1254 |
logger.warning("PIL not available for resizing. Using original image.")
|
1255 |
# Use enhanced encoder with proper MIME type detection
|
1256 |
+
from utils.image_utils import encode_image_for_api
|
1257 |
base64_data_url = encode_image_for_api(file_path)
|
1258 |
except Exception as e:
|
1259 |
logger.warning(f"Image resize failed: {str(e)}. Using original image.")
|
|
|
1262 |
base64_data_url = encode_image_for_api(file_path)
|
1263 |
else:
|
1264 |
# For smaller images, use as-is with proper MIME type
|
1265 |
+
from utils.image_utils import encode_image_for_api
|
1266 |
base64_data_url = encode_image_for_api(file_path)
|
1267 |
except Exception as e:
|
1268 |
# Fallback to original image if any preprocessing fails
|
|
|
1327 |
logger.error("Maximum retries reached, rate limit error persists.")
|
1328 |
try:
|
1329 |
# Try to import the local OCR fallback function
|
1330 |
+
from utils.image_utils import try_local_ocr_fallback
|
1331 |
|
1332 |
# Attempt local OCR fallback
|
1333 |
ocr_text = try_local_ocr_fallback(file_path, base64_data_url)
|
|
|
1539 |
logger.info("Sufficient OCR text detected, analyzing language before using OCR text directly")
|
1540 |
|
1541 |
# Perform language detection on the OCR text before returning
|
1542 |
+
if LANG_DETECTOR_AVAILABLE and self.language_detector:
|
1543 |
+
detected_languages = self.language_detector.detect_languages(
|
1544 |
+
ocr_markdown,
|
1545 |
+
filename=getattr(self, 'current_filename', None)
|
1546 |
+
)
|
1547 |
+
else:
|
1548 |
+
# If language detector is not available, use default English
|
1549 |
+
detected_languages = ["English"]
|
1550 |
|
1551 |
return {
|
1552 |
"file_name": filename,
|
|
|
1720 |
|
1721 |
# If OCR text has clear French patterns but language is English or missing, fix it
|
1722 |
if ocr_markdown and 'languages' in result:
|
1723 |
+
if LANG_DETECTOR_AVAILABLE and self.language_detector:
|
1724 |
+
result['languages'] = self.language_detector.detect_languages(
|
1725 |
+
ocr_markdown,
|
1726 |
+
filename=getattr(self, 'current_filename', None),
|
1727 |
+
current_languages=result['languages']
|
1728 |
+
)
|
1729 |
|
1730 |
except Exception as e:
|
1731 |
# Fall back to text-only model if vision model fails
|
|
|
1735 |
return result
|
1736 |
|
1737 |
# We've removed document type detection entirely for simplicity
|
1738 |
+
|
1739 |
|
1740 |
# Create a prompt with enhanced language detection instructions
|
1741 |
generic_section = (
|
1742 |
f"You are an OCR specialist processing historical documents. "
|
1743 |
+
f"Focus on accurately extracting text content and image chunks while preserving structure and formatting. "
|
1744 |
f"Pay attention to any historical features and document characteristics.\n\n"
|
|
|
|
|
|
|
1745 |
f"Create a structured JSON response with the following fields:\n"
|
1746 |
f"- file_name: The document's name\n"
|
1747 |
f"- topics: An array of topics covered in the document\n"
|
1748 |
f"- languages: An array of languages used in the document (be precise and specific about language detection)\n"
|
1749 |
f"- ocr_contents: A comprehensive dictionary with the document's contents including:\n"
|
1750 |
+
f" * title: The title or heading (if present)\n"
|
1751 |
+
f" * transcript: The full text of the document\n"
|
1752 |
+
f" * text: The main text content (if different from transcript)\n"
|
1753 |
+
f" * content: The body content (if different than transcript)\n"
|
1754 |
+
f" * images: An array of image objects with their base64 data\n"
|
1755 |
+
f" * alt_text: The alt text or description of the images\n"
|
1756 |
+
f" * caption: The caption or title of the images\n"
|
1757 |
f" * raw_text: The complete OCR text\n"
|
1758 |
)
|
1759 |
|
|
|
1764 |
|
1765 |
# Return the enhanced prompt
|
1766 |
return generic_section + custom_section
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1767 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1768 |
def _extract_structured_data_text_only(self, ocr_markdown, filename, custom_prompt=None):
|
1769 |
"""
|
1770 |
Extract structured data using text-only model with detailed historical context prompting
|
test_magician.py → testing/test_magician.py
RENAMED
File without changes
|
ui_components.py
CHANGED
@@ -3,9 +3,21 @@ import os
|
|
3 |
import io
|
4 |
import base64
|
5 |
import logging
|
|
|
6 |
from datetime import datetime
|
7 |
from pathlib import Path
|
8 |
import json
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
from constants import (
|
10 |
DOCUMENT_TYPES,
|
11 |
DOCUMENT_LAYOUTS,
|
@@ -19,7 +31,16 @@ from constants import (
|
|
19 |
PREPROCESSING_DOC_TYPES,
|
20 |
ROTATION_OPTIONS
|
21 |
)
|
22 |
-
from utils import
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
23 |
|
24 |
class ProgressReporter:
|
25 |
"""Class to handle progress reporting in the UI"""
|
@@ -69,12 +90,10 @@ def create_sidebar_options():
|
|
69 |
|
70 |
# Create a container for the sidebar options
|
71 |
with st.container():
|
72 |
-
#
|
73 |
-
|
74 |
-
use_vision = st.toggle("Use Vision Model", value=True, help="Use vision model for better understanding of document structure")
|
75 |
|
76 |
# Document type selection
|
77 |
-
st.markdown("### Document Type")
|
78 |
doc_type = st.selectbox("Document Type", DOCUMENT_TYPES,
|
79 |
help="Select the type of document you're processing for better results")
|
80 |
|
@@ -91,8 +110,8 @@ def create_sidebar_options():
|
|
91 |
|
92 |
# Custom prompt
|
93 |
custom_prompt = ""
|
94 |
-
|
95 |
-
|
96 |
prompt_template = CUSTOM_PROMPT_TEMPLATES.get(doc_type, "")
|
97 |
|
98 |
# Add layout information if not standard
|
@@ -103,53 +122,37 @@ def create_sidebar_options():
|
|
103 |
|
104 |
# Set the custom prompt
|
105 |
custom_prompt = prompt_template
|
106 |
-
|
107 |
-
# Allow user to edit the prompt
|
108 |
-
st.markdown("**Custom Processing Instructions**")
|
109 |
-
custom_prompt = st.text_area("", value=custom_prompt,
|
110 |
-
help="Customize the instructions for processing this document",
|
111 |
-
height=80)
|
112 |
|
113 |
-
|
114 |
-
|
115 |
-
|
116 |
-
|
117 |
-
|
118 |
-
|
119 |
-
|
120 |
-
|
121 |
-
|
122 |
-
|
123 |
-
|
124 |
-
|
125 |
-
|
126 |
-
|
127 |
-
|
128 |
-
|
129 |
-
|
130 |
-
|
131 |
-
|
132 |
-
|
133 |
-
|
134 |
-
|
135 |
-
|
136 |
-
|
137 |
-
|
138 |
-
|
139 |
-
|
140 |
-
|
141 |
-
|
142 |
-
|
143 |
-
|
144 |
-
# Add image segmentation option
|
145 |
-
st.markdown("### Advanced Options")
|
146 |
-
use_segmentation = st.toggle("Enable Image Segmentation",
|
147 |
-
value=False,
|
148 |
-
help="Segment the image into text and image regions for better OCR results on complex documents")
|
149 |
-
|
150 |
-
# Show explanation if segmentation is enabled
|
151 |
-
if use_segmentation:
|
152 |
-
st.info("Image segmentation identifies distinct text regions in complex documents, improving OCR accuracy. This is especially helpful for documents with mixed content like the Magician illustration.")
|
153 |
|
154 |
# Create preprocessing options dictionary
|
155 |
# Set document_type based on selection in UI
|
@@ -169,17 +172,17 @@ def create_sidebar_options():
|
|
169 |
"rotation": rotation
|
170 |
}
|
171 |
|
172 |
-
# PDF-specific options
|
173 |
-
|
174 |
-
|
175 |
-
|
176 |
-
|
177 |
-
|
178 |
-
|
179 |
-
|
180 |
-
|
181 |
-
|
182 |
-
|
183 |
|
184 |
# Create options dictionary
|
185 |
options = {
|
@@ -219,471 +222,6 @@ def create_file_uploader():
|
|
219 |
)
|
220 |
return uploaded_file
|
221 |
|
222 |
-
# Function removed - now using inline implementation in app.py
|
223 |
-
def _unused_display_preprocessing_preview(uploaded_file, preprocessing_options):
|
224 |
-
"""Display a preview of image with preprocessing options applied"""
|
225 |
-
if (any(preprocessing_options.values()) and
|
226 |
-
uploaded_file.type.startswith('image/')):
|
227 |
-
|
228 |
-
st.markdown("**Preprocessed Preview**")
|
229 |
-
try:
|
230 |
-
# Create a container for the preview
|
231 |
-
with st.container():
|
232 |
-
processed_bytes = preprocess_image(uploaded_file.getvalue(), preprocessing_options)
|
233 |
-
# Convert image to base64 and display as HTML to avoid fullscreen button
|
234 |
-
img_data = base64.b64encode(processed_bytes).decode()
|
235 |
-
img_html = f'<img src="data:image/jpeg;base64,{img_data}" style="width:100%; border-radius:4px;">'
|
236 |
-
st.markdown(img_html, unsafe_allow_html=True)
|
237 |
-
|
238 |
-
# Show preprocessing metadata in a well-formatted caption
|
239 |
-
meta_items = []
|
240 |
-
if preprocessing_options.get("document_type", "standard") != "standard":
|
241 |
-
meta_items.append(f"Document type ({preprocessing_options['document_type']})")
|
242 |
-
if preprocessing_options.get("grayscale", False):
|
243 |
-
meta_items.append("Grayscale")
|
244 |
-
if preprocessing_options.get("denoise", False):
|
245 |
-
meta_items.append("Denoise")
|
246 |
-
if preprocessing_options.get("contrast", 0) != 0:
|
247 |
-
meta_items.append(f"Contrast ({preprocessing_options['contrast']})")
|
248 |
-
if preprocessing_options.get("rotation", 0) != 0:
|
249 |
-
meta_items.append(f"Rotation ({preprocessing_options['rotation']}°)")
|
250 |
-
|
251 |
-
# Only show "Applied:" if there are actual preprocessing steps
|
252 |
-
if meta_items:
|
253 |
-
meta_text = "Applied: " + ", ".join(meta_items)
|
254 |
-
st.caption(meta_text)
|
255 |
-
except Exception as e:
|
256 |
-
st.error(f"Error in preprocessing: {str(e)}")
|
257 |
-
st.info("Try using grayscale preprocessing for PNG images with transparency")
|
258 |
-
|
259 |
-
def display_results(result, container, custom_prompt=""):
|
260 |
-
"""Display OCR results in the provided container"""
|
261 |
-
with container:
|
262 |
-
# Add heading for document metadata
|
263 |
-
st.markdown("### Document Metadata")
|
264 |
-
|
265 |
-
# Create a compact metadata section
|
266 |
-
meta_html = '<div style="display: flex; flex-wrap: wrap; gap: 0.3rem; margin-bottom: 0.3rem;">'
|
267 |
-
|
268 |
-
# Document type
|
269 |
-
if 'detected_document_type' in result:
|
270 |
-
meta_html += f'<div><strong>Type:</strong> {result["detected_document_type"]}</div>'
|
271 |
-
|
272 |
-
# Processing time
|
273 |
-
if 'processing_time' in result:
|
274 |
-
meta_html += f'<div><strong>Time:</strong> {result["processing_time"]:.1f}s</div>'
|
275 |
-
|
276 |
-
# Page information
|
277 |
-
if 'limited_pages' in result:
|
278 |
-
meta_html += f'<div><strong>Pages:</strong> {result["limited_pages"]["processed"]}/{result["limited_pages"]["total"]}</div>'
|
279 |
-
|
280 |
-
meta_html += '</div>'
|
281 |
-
st.markdown(meta_html, unsafe_allow_html=True)
|
282 |
-
|
283 |
-
# Language metadata on a separate line, Subject Tags below
|
284 |
-
|
285 |
-
# First show languages if available
|
286 |
-
if 'languages' in result and result['languages']:
|
287 |
-
languages = [lang for lang in result['languages'] if lang is not None]
|
288 |
-
if languages:
|
289 |
-
# Create a dedicated line for Languages
|
290 |
-
lang_html = '<div style="display: flex; align-items: center; margin: 0.2rem 0; flex-wrap: wrap;">'
|
291 |
-
lang_html += '<div style="margin-right: 0.3rem; font-weight: bold;">Language:</div>'
|
292 |
-
|
293 |
-
# Add language tags
|
294 |
-
for lang in languages:
|
295 |
-
# Clean language name if needed
|
296 |
-
clean_lang = str(lang).strip()
|
297 |
-
if clean_lang: # Only add if not empty
|
298 |
-
lang_html += f'<span class="subject-tag tag-language">{clean_lang}</span>'
|
299 |
-
|
300 |
-
lang_html += '</div>'
|
301 |
-
st.markdown(lang_html, unsafe_allow_html=True)
|
302 |
-
|
303 |
-
# Create a separate line for Time if we have time-related tags
|
304 |
-
if 'topics' in result and result['topics']:
|
305 |
-
time_tags = [topic for topic in result['topics']
|
306 |
-
if any(term in topic.lower() for term in ["century", "pre-", "era", "historical"])]
|
307 |
-
if time_tags:
|
308 |
-
time_html = '<div style="display: flex; align-items: center; margin: 0.2rem 0; flex-wrap: wrap;">'
|
309 |
-
time_html += '<div style="margin-right: 0.3rem; font-weight: bold;">Time:</div>'
|
310 |
-
for tag in time_tags:
|
311 |
-
time_html += f'<span class="subject-tag tag-time-period">{tag}</span>'
|
312 |
-
time_html += '</div>'
|
313 |
-
st.markdown(time_html, unsafe_allow_html=True)
|
314 |
-
|
315 |
-
# Then display remaining subject tags if available
|
316 |
-
if 'topics' in result and result['topics']:
|
317 |
-
# Filter out time-related tags which are already displayed
|
318 |
-
subject_tags = [topic for topic in result['topics']
|
319 |
-
if not any(term in topic.lower() for term in ["century", "pre-", "era", "historical"])]
|
320 |
-
|
321 |
-
if subject_tags:
|
322 |
-
# Create a separate line for Subject Tags
|
323 |
-
tags_html = '<div style="display: flex; align-items: center; margin: 0.2rem 0; flex-wrap: wrap;">'
|
324 |
-
tags_html += '<div style="margin-right: 0.3rem; font-weight: bold;">Subject Tags:</div>'
|
325 |
-
tags_html += '<div style="display: flex; flex-wrap: wrap; gap: 2px; align-items: center;">'
|
326 |
-
|
327 |
-
# Generate a badge for each remaining tag
|
328 |
-
for topic in subject_tags:
|
329 |
-
# Determine tag category class
|
330 |
-
tag_class = "subject-tag" # Default class
|
331 |
-
|
332 |
-
# Add specialized class based on category
|
333 |
-
if any(term in topic.lower() for term in ["language", "english", "french", "german", "latin"]):
|
334 |
-
tag_class += " tag-language" # Languages
|
335 |
-
elif any(term in topic.lower() for term in ["letter", "newspaper", "book", "form", "document", "recipe"]):
|
336 |
-
tag_class += " tag-document-type" # Document types
|
337 |
-
elif any(term in topic.lower() for term in ["travel", "military", "science", "medicine", "education", "art", "literature"]):
|
338 |
-
tag_class += " tag-subject" # Subject domains
|
339 |
-
|
340 |
-
# Add each tag as an inline span
|
341 |
-
tags_html += f'<span class="{tag_class}">{topic}</span>'
|
342 |
-
|
343 |
-
# Close the containers
|
344 |
-
tags_html += '</div></div>'
|
345 |
-
|
346 |
-
# Render the subject tags section
|
347 |
-
st.markdown(tags_html, unsafe_allow_html=True)
|
348 |
-
|
349 |
-
# No OCR content heading - start directly with tabs
|
350 |
-
|
351 |
-
# Check if we have OCR content
|
352 |
-
if 'ocr_contents' in result:
|
353 |
-
# Create a single view instead of tabs
|
354 |
-
content_tab1 = st.container()
|
355 |
-
|
356 |
-
# Check for images in the result to use later
|
357 |
-
has_images = result.get('has_images', False)
|
358 |
-
has_image_data = ('pages_data' in result and any(page.get('images', []) for page in result.get('pages_data', [])))
|
359 |
-
has_raw_images = ('raw_response_data' in result and 'pages' in result['raw_response_data'] and
|
360 |
-
any('images' in page for page in result['raw_response_data']['pages']
|
361 |
-
if isinstance(page, dict)))
|
362 |
-
|
363 |
-
# Display structured content
|
364 |
-
with content_tab1:
|
365 |
-
# Display structured content with markdown formatting
|
366 |
-
if isinstance(result['ocr_contents'], dict):
|
367 |
-
# CSS is now handled in the main layout.py file
|
368 |
-
|
369 |
-
# Function to process text with markdown support
|
370 |
-
def format_markdown_text(text):
|
371 |
-
"""Format text with markdown and handle special patterns"""
|
372 |
-
if not text:
|
373 |
-
return ""
|
374 |
-
|
375 |
-
import re
|
376 |
-
|
377 |
-
# First, ensure we're working with a string
|
378 |
-
if not isinstance(text, str):
|
379 |
-
text = str(text)
|
380 |
-
|
381 |
-
# Ensure newlines are preserved for proper spacing
|
382 |
-
# Convert any Windows line endings to Unix
|
383 |
-
text = text.replace('\r\n', '\n')
|
384 |
-
|
385 |
-
# Format dates (MM/DD/YYYY or similar patterns)
|
386 |
-
date_pattern = r'\b(0?[1-9]|1[0-2])[\/\-\.](0?[1-9]|[12][0-9]|3[01])[\/\-\.](\d{4}|\d{2})\b'
|
387 |
-
text = re.sub(date_pattern, r'**\g<0>**', text)
|
388 |
-
|
389 |
-
# Detect markdown tables and preserve them
|
390 |
-
table_sections = []
|
391 |
-
non_table_lines = []
|
392 |
-
in_table = False
|
393 |
-
table_buffer = []
|
394 |
-
|
395 |
-
# Process text line by line, preserving tables
|
396 |
-
lines = text.split('\n')
|
397 |
-
for i, line in enumerate(lines):
|
398 |
-
line_stripped = line.strip()
|
399 |
-
|
400 |
-
# Detect table rows by pipe character
|
401 |
-
if '|' in line_stripped and (line_stripped.startswith('|') or line_stripped.endswith('|')):
|
402 |
-
if not in_table:
|
403 |
-
in_table = True
|
404 |
-
if table_buffer:
|
405 |
-
table_buffer = []
|
406 |
-
table_buffer.append(line)
|
407 |
-
|
408 |
-
# Check if the next line is a table separator
|
409 |
-
if i < len(lines) - 1 and '---' in lines[i+1] and '|' in lines[i+1]:
|
410 |
-
table_buffer.append(lines[i+1])
|
411 |
-
|
412 |
-
# Detect table separators (---|---|---)
|
413 |
-
elif in_table and '---' in line_stripped and '|' in line_stripped:
|
414 |
-
table_buffer.append(line)
|
415 |
-
|
416 |
-
# End of table detection
|
417 |
-
elif in_table:
|
418 |
-
# Check if this is still part of the table
|
419 |
-
next_line_is_table = False
|
420 |
-
if i < len(lines) - 1:
|
421 |
-
next_line = lines[i+1].strip()
|
422 |
-
if '|' in next_line and (next_line.startswith('|') or next_line.endswith('|')):
|
423 |
-
next_line_is_table = True
|
424 |
-
|
425 |
-
if not next_line_is_table:
|
426 |
-
in_table = False
|
427 |
-
# Save the complete table
|
428 |
-
if table_buffer:
|
429 |
-
table_sections.append('\n'.join(table_buffer))
|
430 |
-
table_buffer = []
|
431 |
-
# Add current line to non-table lines
|
432 |
-
non_table_lines.append(line)
|
433 |
-
else:
|
434 |
-
# Still part of the table
|
435 |
-
table_buffer.append(line)
|
436 |
-
else:
|
437 |
-
# Not in a table
|
438 |
-
non_table_lines.append(line)
|
439 |
-
|
440 |
-
# Handle any remaining table buffer
|
441 |
-
if in_table and table_buffer:
|
442 |
-
table_sections.append('\n'.join(table_buffer))
|
443 |
-
|
444 |
-
# Process non-table lines
|
445 |
-
processed_lines = []
|
446 |
-
for line in non_table_lines:
|
447 |
-
line_stripped = line.strip()
|
448 |
-
|
449 |
-
# Check if line is in ALL CAPS (and not just a short acronym)
|
450 |
-
if line_stripped and line_stripped.isupper() and len(line_stripped) > 3:
|
451 |
-
# ALL CAPS line - make bold instead of heading to prevent large display
|
452 |
-
processed_lines.append(f"**{line_stripped}**")
|
453 |
-
# Process potential headers (lines ending with colon)
|
454 |
-
elif line_stripped and line_stripped.endswith(':') and len(line_stripped) < 40:
|
455 |
-
# Likely a header - make it bold
|
456 |
-
processed_lines.append(f"**{line_stripped}**")
|
457 |
-
else:
|
458 |
-
# Keep original line with its spacing
|
459 |
-
processed_lines.append(line)
|
460 |
-
|
461 |
-
# Join non-table lines
|
462 |
-
processed_text = '\n'.join(processed_lines)
|
463 |
-
|
464 |
-
# Reinsert tables in the right positions
|
465 |
-
for table in table_sections:
|
466 |
-
# Generate a unique marker for this table
|
467 |
-
marker = f"__TABLE_MARKER_{hash(table) % 10000}__"
|
468 |
-
# Find a good position to insert this table
|
469 |
-
# For now, just append all tables at the end
|
470 |
-
processed_text += f"\n\n{table}\n\n"
|
471 |
-
|
472 |
-
# Make sure paragraphs have proper spacing but not excessive
|
473 |
-
processed_text = re.sub(r'\n{3,}', '\n\n', processed_text)
|
474 |
-
|
475 |
-
# Ensure two newlines between paragraphs for proper markdown rendering
|
476 |
-
processed_text = re.sub(r'([^\n])\n([^\n])', r'\1\n\n\2', processed_text)
|
477 |
-
|
478 |
-
return processed_text
|
479 |
-
|
480 |
-
# Collect all available images from the result
|
481 |
-
available_images = []
|
482 |
-
if has_images and 'pages_data' in result:
|
483 |
-
for page_idx, page in enumerate(result['pages_data']):
|
484 |
-
if 'images' in page and len(page['images']) > 0:
|
485 |
-
for img_idx, img in enumerate(page['images']):
|
486 |
-
if 'image_base64' in img:
|
487 |
-
available_images.append({
|
488 |
-
'source': 'pages_data',
|
489 |
-
'page': page_idx,
|
490 |
-
'index': img_idx,
|
491 |
-
'data': img['image_base64']
|
492 |
-
})
|
493 |
-
|
494 |
-
# Get images from raw response as well
|
495 |
-
if 'raw_response_data' in result:
|
496 |
-
raw_data = result['raw_response_data']
|
497 |
-
if isinstance(raw_data, dict) and 'pages' in raw_data:
|
498 |
-
for page_idx, page in enumerate(raw_data['pages']):
|
499 |
-
if isinstance(page, dict) and 'images' in page:
|
500 |
-
for img_idx, img in enumerate(page['images']):
|
501 |
-
if isinstance(img, dict) and 'base64' in img:
|
502 |
-
available_images.append({
|
503 |
-
'source': 'raw_response',
|
504 |
-
'page': page_idx,
|
505 |
-
'index': img_idx,
|
506 |
-
'data': img['base64']
|
507 |
-
})
|
508 |
-
|
509 |
-
# Extract images for display at the top
|
510 |
-
images_to_display = []
|
511 |
-
|
512 |
-
# First, collect all available images
|
513 |
-
for img_idx, img in enumerate(available_images):
|
514 |
-
if 'data' in img:
|
515 |
-
images_to_display.append({
|
516 |
-
'data': img['data'],
|
517 |
-
'id': img.get('id', f"img_{img_idx}"),
|
518 |
-
'index': img_idx
|
519 |
-
})
|
520 |
-
|
521 |
-
# Simple display of image without dropdown or Document Image tab
|
522 |
-
if images_to_display and len(images_to_display) > 0:
|
523 |
-
# Just display the first image directly
|
524 |
-
st.image(images_to_display[0]['data'], use_container_width=True)
|
525 |
-
|
526 |
-
# Organize sections in a logical order
|
527 |
-
section_order = ["title", "author", "date", "summary", "content", "transcript", "metadata"]
|
528 |
-
ordered_sections = []
|
529 |
-
|
530 |
-
# Add known sections first in preferred order
|
531 |
-
for section_name in section_order:
|
532 |
-
if section_name in result['ocr_contents'] and result['ocr_contents'][section_name]:
|
533 |
-
ordered_sections.append(section_name)
|
534 |
-
|
535 |
-
# Add any remaining sections
|
536 |
-
for section in result['ocr_contents'].keys():
|
537 |
-
if (section not in ordered_sections and
|
538 |
-
section not in ['error', 'partial_text'] and
|
539 |
-
result['ocr_contents'][section]):
|
540 |
-
ordered_sections.append(section)
|
541 |
-
|
542 |
-
# If only raw_text is available and no other content, add it last
|
543 |
-
if ('raw_text' in result['ocr_contents'] and
|
544 |
-
result['ocr_contents']['raw_text'] and
|
545 |
-
len(ordered_sections) == 0):
|
546 |
-
ordered_sections.append('raw_text')
|
547 |
-
|
548 |
-
# Add minimal spacing before OCR results
|
549 |
-
st.markdown("<div style='margin: 8px 0 4px 0;'></div>", unsafe_allow_html=True)
|
550 |
-
st.markdown("### Document Content")
|
551 |
-
|
552 |
-
# Process each section using expanders
|
553 |
-
for i, section in enumerate(ordered_sections):
|
554 |
-
content = result['ocr_contents'][section]
|
555 |
-
|
556 |
-
# Skip empty content
|
557 |
-
if not content:
|
558 |
-
continue
|
559 |
-
|
560 |
-
# Create an expander for each section
|
561 |
-
# First section is expanded by default
|
562 |
-
with st.expander(f"{section.replace('_', ' ').title()}", expanded=(i == 0)):
|
563 |
-
if isinstance(content, str):
|
564 |
-
# Handle image markdown
|
565 |
-
if content.startswith("![") and content.endswith(")"):
|
566 |
-
try:
|
567 |
-
alt_text = content[2:content.index(']')]
|
568 |
-
st.info(f"Image description: {alt_text if len(alt_text) > 5 else 'Image'}")
|
569 |
-
except:
|
570 |
-
st.info("Contains image reference")
|
571 |
-
else:
|
572 |
-
# Process text content
|
573 |
-
formatted_content = format_markdown_text(content).strip()
|
574 |
-
|
575 |
-
# Check if content contains markdown tables or complex text
|
576 |
-
has_tables = '|' in formatted_content and '---' in formatted_content
|
577 |
-
has_complex_structure = formatted_content.count('\n') > 5 or formatted_content.count('**') > 2
|
578 |
-
|
579 |
-
# Use a container with minimal margins
|
580 |
-
with st.container():
|
581 |
-
# For text-only extractions or content with tables, ensure proper rendering
|
582 |
-
if has_tables or has_complex_structure:
|
583 |
-
# For text with tables or multiple paragraphs, use special handling
|
584 |
-
# First ensure proper markdown spacing
|
585 |
-
formatted_content = formatted_content.replace('\n\n\n', '\n\n')
|
586 |
-
|
587 |
-
# Look for any all caps headers that might be misinterpreted
|
588 |
-
import re
|
589 |
-
formatted_content = re.sub(
|
590 |
-
r'^([A-Z][A-Z\s]+)$',
|
591 |
-
r'**\1**',
|
592 |
-
formatted_content,
|
593 |
-
flags=re.MULTILINE
|
594 |
-
)
|
595 |
-
|
596 |
-
# Preserve table formatting by adding proper spacing
|
597 |
-
if has_tables:
|
598 |
-
formatted_content = formatted_content.replace('\n|', '\n\n|')
|
599 |
-
|
600 |
-
# Add proper paragraph spacing
|
601 |
-
formatted_content = re.sub(r'([^\n])\n([^\n])', r'\1\n\n\2', formatted_content)
|
602 |
-
|
603 |
-
# Use standard markdown with custom styling
|
604 |
-
st.markdown(formatted_content, unsafe_allow_html=False)
|
605 |
-
else:
|
606 |
-
# For simpler content, use standard markdown
|
607 |
-
st.markdown(formatted_content)
|
608 |
-
|
609 |
-
elif isinstance(content, list):
|
610 |
-
# Create markdown list
|
611 |
-
list_items = []
|
612 |
-
for item in content:
|
613 |
-
if isinstance(item, str):
|
614 |
-
item_text = format_markdown_text(item).strip()
|
615 |
-
# Handle potential HTML special characters for proper rendering
|
616 |
-
item_text = item_text.replace('<', '<').replace('>', '>')
|
617 |
-
list_items.append(f"- {item_text}")
|
618 |
-
else:
|
619 |
-
list_items.append(f"- {str(item)}")
|
620 |
-
|
621 |
-
list_content = "\n".join(list_items)
|
622 |
-
|
623 |
-
# Use a container with minimal margins
|
624 |
-
with st.container():
|
625 |
-
# Use standard markdown for better rendering
|
626 |
-
st.markdown(list_content)
|
627 |
-
|
628 |
-
elif isinstance(content, dict):
|
629 |
-
# Format dictionary content
|
630 |
-
dict_items = []
|
631 |
-
for k, v in content.items():
|
632 |
-
key_formatted = k.replace('_', ' ').title()
|
633 |
-
|
634 |
-
if isinstance(v, str):
|
635 |
-
value_formatted = format_markdown_text(v).strip()
|
636 |
-
dict_items.append(f"**{key_formatted}:** {value_formatted}")
|
637 |
-
else:
|
638 |
-
dict_items.append(f"**{key_formatted}:** {str(v)}")
|
639 |
-
|
640 |
-
dict_content = "\n".join(dict_items)
|
641 |
-
|
642 |
-
# Use a container with minimal margins
|
643 |
-
with st.container():
|
644 |
-
# Use standard markdown for better rendering
|
645 |
-
st.markdown(dict_content)
|
646 |
-
|
647 |
-
# Display custom prompt if provided
|
648 |
-
if custom_prompt:
|
649 |
-
with st.expander("Custom Processing Instructions"):
|
650 |
-
st.write(custom_prompt)
|
651 |
-
|
652 |
-
# No download heading - start directly with buttons
|
653 |
-
|
654 |
-
# JSON download - use full width for buttons
|
655 |
-
try:
|
656 |
-
json_str = json.dumps(result, indent=2)
|
657 |
-
st.download_button(
|
658 |
-
label="Download JSON",
|
659 |
-
data=json_str,
|
660 |
-
file_name=f"{result.get('file_name', 'document').split('.')[0]}_ocr.json",
|
661 |
-
mime="application/json"
|
662 |
-
)
|
663 |
-
except Exception as e:
|
664 |
-
st.error(f"Error creating JSON download: {str(e)}")
|
665 |
-
|
666 |
-
# Text download
|
667 |
-
try:
|
668 |
-
if 'ocr_contents' in result:
|
669 |
-
if 'raw_text' in result['ocr_contents']:
|
670 |
-
text_content = result['ocr_contents']['raw_text']
|
671 |
-
elif 'content' in result['ocr_contents']:
|
672 |
-
text_content = result['ocr_contents']['content']
|
673 |
-
else:
|
674 |
-
text_content = str(result['ocr_contents'])
|
675 |
-
else:
|
676 |
-
text_content = "No text content available."
|
677 |
-
|
678 |
-
st.download_button(
|
679 |
-
label="Download Text",
|
680 |
-
data=text_content,
|
681 |
-
file_name=f"{result.get('file_name', 'document').split('.')[0]}_ocr.txt",
|
682 |
-
mime="text/plain"
|
683 |
-
)
|
684 |
-
except Exception as e:
|
685 |
-
st.error(f"Error creating text download: {str(e)}")
|
686 |
-
|
687 |
def display_document_with_images(result):
|
688 |
"""Display document with images"""
|
689 |
# Check for pages_data first
|
@@ -759,7 +297,7 @@ def display_document_with_images(result):
|
|
759 |
if isinstance(raw_page, dict) and 'images' in raw_page:
|
760 |
for img in raw_page['images']:
|
761 |
if isinstance(img, dict) and 'base64' in img:
|
762 |
-
st.image(img['base64'])
|
763 |
st.caption("Image from OCR response")
|
764 |
image_displayed = True
|
765 |
break
|
@@ -797,7 +335,7 @@ def display_previous_results():
|
|
797 |
st.markdown("""
|
798 |
<div style="text-align: center; padding: 30px 20px; background-color: #f8f9fa; border-radius: 6px; margin-top: 10px;">
|
799 |
<div style="font-size: 36px; margin-bottom: 15px;">📄</div>
|
800 |
-
<
|
801 |
<p style="font-size: 14px; color: #666;">Process a document to see your results history.</p>
|
802 |
</div>
|
803 |
""", unsafe_allow_html=True)
|
@@ -806,7 +344,7 @@ def display_previous_results():
|
|
806 |
with col2:
|
807 |
try:
|
808 |
# Create download button for all results
|
809 |
-
from
|
810 |
zip_data = create_results_zip_in_memory(st.session_state.previous_results)
|
811 |
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
812 |
|
@@ -908,37 +446,22 @@ def display_previous_results():
|
|
908 |
meta_html += '</div>'
|
909 |
st.markdown(meta_html, unsafe_allow_html=True)
|
910 |
|
911 |
-
# Simplified tabs -
|
912 |
has_images = selected_result.get('has_images', False)
|
913 |
if has_images:
|
914 |
-
view_tabs = st.tabs(["Document Content", "Raw
|
915 |
view_tab1, view_tab2, view_tab3 = view_tabs
|
916 |
else:
|
917 |
-
view_tabs = st.tabs(["Document Content", "Raw
|
918 |
view_tab1, view_tab2 = view_tabs
|
919 |
-
|
920 |
-
# Define helper function for formatting text
|
921 |
-
def format_text_display(text):
|
922 |
-
if not isinstance(text, str):
|
923 |
-
return text
|
924 |
-
|
925 |
-
lines = text.split('\n')
|
926 |
-
processed_lines = []
|
927 |
-
for line in lines:
|
928 |
-
line_stripped = line.strip()
|
929 |
-
if line_stripped and line_stripped.isupper() and len(line_stripped) > 3:
|
930 |
-
processed_lines.append(f"**{line_stripped}**")
|
931 |
-
else:
|
932 |
-
processed_lines.append(line)
|
933 |
-
|
934 |
-
return '\n'.join(processed_lines)
|
935 |
|
936 |
# First tab - Document Content (simplified structured view)
|
937 |
with view_tab1:
|
938 |
# Display content in a cleaner, more streamlined format
|
939 |
if 'ocr_contents' in selected_result and isinstance(selected_result['ocr_contents'], dict):
|
940 |
# Create a more focused list of important sections
|
941 |
-
priority_sections = ["title", "content", "transcript", "summary"
|
942 |
displayed_sections = set()
|
943 |
|
944 |
# First display priority sections
|
@@ -951,7 +474,7 @@ def display_previous_results():
|
|
951 |
st.markdown(f"##### {section.replace('_', ' ').title()}")
|
952 |
|
953 |
# Format and display content
|
954 |
-
formatted_content =
|
955 |
st.markdown(formatted_content)
|
956 |
displayed_sections.add(section)
|
957 |
|
@@ -963,7 +486,7 @@ def display_previous_results():
|
|
963 |
st.markdown(f"##### {section.replace('_', ' ').title()}")
|
964 |
|
965 |
if isinstance(content, str):
|
966 |
-
st.markdown(
|
967 |
elif isinstance(content, list):
|
968 |
for item in content:
|
969 |
st.markdown(f"- {item}")
|
@@ -971,34 +494,42 @@ def display_previous_results():
|
|
971 |
for k, v in content.items():
|
972 |
st.markdown(f"**{k}:** {v}")
|
973 |
|
974 |
-
# Second tab - Raw
|
975 |
with view_tab2:
|
976 |
-
# Extract
|
977 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
978 |
if 'ocr_contents' in selected_result:
|
979 |
-
|
980 |
-
|
981 |
-
|
982 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
983 |
|
984 |
-
#
|
985 |
-
|
986 |
|
987 |
-
#
|
988 |
-
|
989 |
-
with col1:
|
990 |
-
st.button("Copy Text", key="selected_copy_btn")
|
991 |
-
with col2:
|
992 |
-
st.download_button(
|
993 |
-
label="Download Text",
|
994 |
-
data=edited_text,
|
995 |
-
file_name=f"{file_name.split('.')[0]}_text.txt",
|
996 |
-
mime="text/plain",
|
997 |
-
key="selected_download_btn"
|
998 |
-
)
|
999 |
|
1000 |
-
# Third tab -
|
1001 |
-
if has_images and
|
1002 |
with view_tab3:
|
1003 |
# Simplified image display
|
1004 |
if 'pages_data' in selected_result:
|
@@ -1007,7 +538,7 @@ def display_previous_results():
|
|
1007 |
if 'images' in page_data and len(page_data['images']) > 0:
|
1008 |
for img in page_data['images']:
|
1009 |
if 'image_base64' in img:
|
1010 |
-
st.image(img['image_base64'],
|
1011 |
|
1012 |
# Get page text if available
|
1013 |
page_text = ""
|
@@ -1018,21 +549,22 @@ def display_previous_results():
|
|
1018 |
if page_text:
|
1019 |
with st.expander(f"Page {i+1} Text", expanded=False):
|
1020 |
st.text(page_text)
|
|
|
1021 |
|
1022 |
def display_about_tab():
|
1023 |
-
"""Display
|
1024 |
-
st.header("
|
1025 |
|
1026 |
# Add app description
|
1027 |
st.markdown("""
|
1028 |
-
**Historical OCR** is a
|
1029 |
""")
|
1030 |
|
1031 |
# Purpose section with consistent formatting
|
1032 |
st.markdown("### Purpose")
|
1033 |
st.markdown("""
|
1034 |
This tool is designed to assist scholars in historical research by extracting text from challenging documents.
|
1035 |
-
While it may not achieve
|
1036 |
historical documents, particularly:
|
1037 |
""")
|
1038 |
|
|
|
3 |
import io
|
4 |
import base64
|
5 |
import logging
|
6 |
+
import re
|
7 |
from datetime import datetime
|
8 |
from pathlib import Path
|
9 |
import json
|
10 |
+
|
11 |
+
# Define exports
|
12 |
+
__all__ = [
|
13 |
+
'ProgressReporter',
|
14 |
+
'create_sidebar_options',
|
15 |
+
'create_file_uploader',
|
16 |
+
'display_document_with_images',
|
17 |
+
'display_previous_results',
|
18 |
+
'display_about_tab',
|
19 |
+
'display_results' # Re-export from utils.ui_utils
|
20 |
+
]
|
21 |
from constants import (
|
22 |
DOCUMENT_TYPES,
|
23 |
DOCUMENT_LAYOUTS,
|
|
|
31 |
PREPROCESSING_DOC_TYPES,
|
32 |
ROTATION_OPTIONS
|
33 |
)
|
34 |
+
from utils.image_utils import format_ocr_text
|
35 |
+
from utils.content_utils import (
|
36 |
+
classify_document_content,
|
37 |
+
extract_document_text,
|
38 |
+
extract_image_description,
|
39 |
+
clean_raw_text,
|
40 |
+
format_markdown_text
|
41 |
+
)
|
42 |
+
from utils.ui_utils import display_results
|
43 |
+
from preprocessing import preprocess_image
|
44 |
|
45 |
class ProgressReporter:
|
46 |
"""Class to handle progress reporting in the UI"""
|
|
|
90 |
|
91 |
# Create a container for the sidebar options
|
92 |
with st.container():
|
93 |
+
# Default to using vision model (removed selection from UI)
|
94 |
+
use_vision = True
|
|
|
95 |
|
96 |
# Document type selection
|
|
|
97 |
doc_type = st.selectbox("Document Type", DOCUMENT_TYPES,
|
98 |
help="Select the type of document you're processing for better results")
|
99 |
|
|
|
110 |
|
111 |
# Custom prompt
|
112 |
custom_prompt = ""
|
113 |
+
# Get the template for the selected document type if not auto-detect
|
114 |
+
if doc_type != DOCUMENT_TYPES[0]:
|
115 |
prompt_template = CUSTOM_PROMPT_TEMPLATES.get(doc_type, "")
|
116 |
|
117 |
# Add layout information if not standard
|
|
|
122 |
|
123 |
# Set the custom prompt
|
124 |
custom_prompt = prompt_template
|
|
|
|
|
|
|
|
|
|
|
|
|
125 |
|
126 |
+
# Allow user to edit the prompt (always visible)
|
127 |
+
custom_prompt = st.text_area("Custom Processing Instructions", value=custom_prompt,
|
128 |
+
help="Customize the instructions for processing this document",
|
129 |
+
height=80)
|
130 |
+
|
131 |
+
# Image preprocessing options (always visible)
|
132 |
+
st.markdown("### Image Preprocessing")
|
133 |
+
|
134 |
+
# Grayscale conversion
|
135 |
+
grayscale = st.checkbox("Convert to Grayscale",
|
136 |
+
value=False,
|
137 |
+
help="Convert color images to grayscale for better text recognition")
|
138 |
+
|
139 |
+
# Light denoising option
|
140 |
+
denoise = st.checkbox("Light Denoising",
|
141 |
+
value=False,
|
142 |
+
help="Apply gentle denoising to improve text clarity")
|
143 |
+
|
144 |
+
# Contrast adjustment
|
145 |
+
contrast = st.slider("Contrast Adjustment",
|
146 |
+
min_value=-20,
|
147 |
+
max_value=20,
|
148 |
+
value=0,
|
149 |
+
step=5,
|
150 |
+
help="Adjust image contrast (limited range)")
|
151 |
+
|
152 |
+
|
153 |
+
# Initialize rotation (keeping it set to 0)
|
154 |
+
rotation = 0
|
155 |
+
use_segmentation = False
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
156 |
|
157 |
# Create preprocessing options dictionary
|
158 |
# Set document_type based on selection in UI
|
|
|
172 |
"rotation": rotation
|
173 |
}
|
174 |
|
175 |
+
# PDF-specific options
|
176 |
+
st.markdown("### PDF Options")
|
177 |
+
max_pages = st.number_input("Maximum Pages to Process",
|
178 |
+
min_value=1,
|
179 |
+
max_value=20,
|
180 |
+
value=DEFAULT_MAX_PAGES,
|
181 |
+
help="Limit the number of pages to process (for multi-page PDFs)")
|
182 |
+
|
183 |
+
# Set default values for removed options
|
184 |
+
pdf_dpi = DEFAULT_PDF_DPI
|
185 |
+
pdf_rotation = 0
|
186 |
|
187 |
# Create options dictionary
|
188 |
options = {
|
|
|
222 |
)
|
223 |
return uploaded_file
|
224 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
225 |
def display_document_with_images(result):
|
226 |
"""Display document with images"""
|
227 |
# Check for pages_data first
|
|
|
297 |
if isinstance(raw_page, dict) and 'images' in raw_page:
|
298 |
for img in raw_page['images']:
|
299 |
if isinstance(img, dict) and 'base64' in img:
|
300 |
+
st.image(img['base64'], use_container_width=True)
|
301 |
st.caption("Image from OCR response")
|
302 |
image_displayed = True
|
303 |
break
|
|
|
335 |
st.markdown("""
|
336 |
<div style="text-align: center; padding: 30px 20px; background-color: #f8f9fa; border-radius: 6px; margin-top: 10px;">
|
337 |
<div style="font-size: 36px; margin-bottom: 15px;">📄</div>
|
338 |
+
<h3="margin-bottom: 16px; font-weight: 500;">No Previous Results</h3>
|
339 |
<p style="font-size: 14px; color: #666;">Process a document to see your results history.</p>
|
340 |
</div>
|
341 |
""", unsafe_allow_html=True)
|
|
|
344 |
with col2:
|
345 |
try:
|
346 |
# Create download button for all results
|
347 |
+
from utils.image_utils import create_results_zip_in_memory
|
348 |
zip_data = create_results_zip_in_memory(st.session_state.previous_results)
|
349 |
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
350 |
|
|
|
446 |
meta_html += '</div>'
|
447 |
st.markdown(meta_html, unsafe_allow_html=True)
|
448 |
|
449 |
+
# Simplified tabs - using the same format as main view
|
450 |
has_images = selected_result.get('has_images', False)
|
451 |
if has_images:
|
452 |
+
view_tabs = st.tabs(["Document Content", "Raw JSON", "Images"])
|
453 |
view_tab1, view_tab2, view_tab3 = view_tabs
|
454 |
else:
|
455 |
+
view_tabs = st.tabs(["Document Content", "Raw JSON"])
|
456 |
view_tab1, view_tab2 = view_tabs
|
457 |
+
view_tab3 = None
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
458 |
|
459 |
# First tab - Document Content (simplified structured view)
|
460 |
with view_tab1:
|
461 |
# Display content in a cleaner, more streamlined format
|
462 |
if 'ocr_contents' in selected_result and isinstance(selected_result['ocr_contents'], dict):
|
463 |
# Create a more focused list of important sections
|
464 |
+
priority_sections = ["title", "content", "transcript", "summary"]
|
465 |
displayed_sections = set()
|
466 |
|
467 |
# First display priority sections
|
|
|
474 |
st.markdown(f"##### {section.replace('_', ' ').title()}")
|
475 |
|
476 |
# Format and display content
|
477 |
+
formatted_content = format_ocr_text(content)
|
478 |
st.markdown(formatted_content)
|
479 |
displayed_sections.add(section)
|
480 |
|
|
|
486 |
st.markdown(f"##### {section.replace('_', ' ').title()}")
|
487 |
|
488 |
if isinstance(content, str):
|
489 |
+
st.markdown(format_ocr_text(content))
|
490 |
elif isinstance(content, list):
|
491 |
for item in content:
|
492 |
st.markdown(f"- {item}")
|
|
|
494 |
for k, v in content.items():
|
495 |
st.markdown(f"**{k}:** {v}")
|
496 |
|
497 |
+
# Second tab - Raw JSON (simplified)
|
498 |
with view_tab2:
|
499 |
+
# Extract the relevant JSON data
|
500 |
+
json_data = {}
|
501 |
+
|
502 |
+
# Include important metadata
|
503 |
+
for field in ['file_name', 'timestamp', 'processing_time', 'languages', 'topics', 'subjects', 'detected_document_type', 'text']:
|
504 |
+
if field in selected_result:
|
505 |
+
json_data[field] = selected_result[field]
|
506 |
+
|
507 |
+
# Include OCR contents
|
508 |
if 'ocr_contents' in selected_result:
|
509 |
+
json_data['ocr_contents'] = selected_result['ocr_contents']
|
510 |
+
|
511 |
+
# Exclude large binary data like base64 images to keep JSON clean
|
512 |
+
if 'pages_data' in selected_result:
|
513 |
+
# Create simplified pages_data without large binary content
|
514 |
+
simplified_pages = []
|
515 |
+
for page in selected_result['pages_data']:
|
516 |
+
simplified_page = {
|
517 |
+
'page_number': page.get('page_number', 0),
|
518 |
+
'has_text': bool(page.get('markdown', '')),
|
519 |
+
'has_images': bool(page.get('images', [])),
|
520 |
+
'image_count': len(page.get('images', []))
|
521 |
+
}
|
522 |
+
simplified_pages.append(simplified_page)
|
523 |
+
json_data['pages_summary'] = simplified_pages
|
524 |
|
525 |
+
# Format the JSON prettily
|
526 |
+
json_str = json.dumps(json_data, indent=2)
|
527 |
|
528 |
+
# Display in a monospace font with syntax highlighting
|
529 |
+
st.code(json_str, language="json")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
530 |
|
531 |
+
# Third tab - Images (simplified)
|
532 |
+
if has_images and view_tab3 is not None:
|
533 |
with view_tab3:
|
534 |
# Simplified image display
|
535 |
if 'pages_data' in selected_result:
|
|
|
538 |
if 'images' in page_data and len(page_data['images']) > 0:
|
539 |
for img in page_data['images']:
|
540 |
if 'image_base64' in img:
|
541 |
+
st.image(img['image_base64'], use_container_width=True)
|
542 |
|
543 |
# Get page text if available
|
544 |
page_text = ""
|
|
|
549 |
if page_text:
|
550 |
with st.expander(f"Page {i+1} Text", expanded=False):
|
551 |
st.text(page_text)
|
552 |
+
|
553 |
|
554 |
def display_about_tab():
|
555 |
+
"""Display learn more tab content"""
|
556 |
+
st.header("Learn More")
|
557 |
|
558 |
# Add app description
|
559 |
st.markdown("""
|
560 |
+
**Historical OCR** is a tailored academic tool for extracting text from historical documents, manuscripts, and printed materials.
|
561 |
""")
|
562 |
|
563 |
# Purpose section with consistent formatting
|
564 |
st.markdown("### Purpose")
|
565 |
st.markdown("""
|
566 |
This tool is designed to assist scholars in historical research by extracting text from challenging documents.
|
567 |
+
While it may not achieve full accuracy for all materials, it serves as a tailored research aid for navigating
|
568 |
historical documents, particularly:
|
569 |
""")
|
570 |
|
utils/content_utils.py
ADDED
@@ -0,0 +1,189 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import re
|
2 |
+
import ast
|
3 |
+
from .text_utils import clean_raw_text, format_markdown_text
|
4 |
+
|
5 |
+
def classify_document_content(result):
|
6 |
+
"""Classify document content based on structure and content"""
|
7 |
+
classification = {
|
8 |
+
'has_title': False,
|
9 |
+
'has_content': False,
|
10 |
+
'has_sections': False,
|
11 |
+
'is_structured': False
|
12 |
+
}
|
13 |
+
|
14 |
+
if 'ocr_contents' not in result or not isinstance(result['ocr_contents'], dict):
|
15 |
+
return classification
|
16 |
+
|
17 |
+
# Check for title
|
18 |
+
if 'title' in result['ocr_contents'] and result['ocr_contents']['title']:
|
19 |
+
classification['has_title'] = True
|
20 |
+
|
21 |
+
# Check for content
|
22 |
+
content_fields = ['content', 'transcript', 'text']
|
23 |
+
for field in content_fields:
|
24 |
+
if field in result['ocr_contents'] and result['ocr_contents'][field]:
|
25 |
+
classification['has_content'] = True
|
26 |
+
break
|
27 |
+
|
28 |
+
# Check for sections
|
29 |
+
section_count = 0
|
30 |
+
for key in result['ocr_contents'].keys():
|
31 |
+
if key not in ['raw_text', 'error'] and result['ocr_contents'][key]:
|
32 |
+
section_count += 1
|
33 |
+
|
34 |
+
classification['has_sections'] = section_count > 2
|
35 |
+
|
36 |
+
# Check if structured
|
37 |
+
classification['is_structured'] = (
|
38 |
+
classification['has_title'] and
|
39 |
+
classification['has_content'] and
|
40 |
+
classification['has_sections']
|
41 |
+
)
|
42 |
+
|
43 |
+
return classification
|
44 |
+
|
45 |
+
def extract_document_text(result):
|
46 |
+
"""Extract main document text content"""
|
47 |
+
if 'ocr_contents' not in result or not isinstance(result['ocr_contents'], dict):
|
48 |
+
return ""
|
49 |
+
|
50 |
+
# Try to get the text from content fields in preferred order - prioritize main_text
|
51 |
+
for field in ['main_text', 'content', 'transcript', 'text', 'raw_text']:
|
52 |
+
if field in result['ocr_contents'] and result['ocr_contents'][field]:
|
53 |
+
content = result['ocr_contents'][field]
|
54 |
+
if isinstance(content, str):
|
55 |
+
return content
|
56 |
+
|
57 |
+
return ""
|
58 |
+
|
59 |
+
def extract_image_description(image_data):
|
60 |
+
"""Extract image description from data"""
|
61 |
+
if not image_data or not isinstance(image_data, dict):
|
62 |
+
return ""
|
63 |
+
|
64 |
+
# Try different fields that might contain descriptions
|
65 |
+
for field in ['alt_text', 'caption', 'description']:
|
66 |
+
if field in image_data and image_data[field]:
|
67 |
+
return image_data[field]
|
68 |
+
|
69 |
+
return ""
|
70 |
+
|
71 |
+
def format_structured_data(content):
|
72 |
+
"""Format structured data like lists and dictionaries into readable markdown
|
73 |
+
|
74 |
+
Args:
|
75 |
+
content: The content to format (str, list, dict)
|
76 |
+
|
77 |
+
Returns:
|
78 |
+
Formatted markdown text
|
79 |
+
"""
|
80 |
+
if not content:
|
81 |
+
return ""
|
82 |
+
|
83 |
+
# If it's already a string, look for patterns that appear to be Python/JSON representations
|
84 |
+
if isinstance(content, str):
|
85 |
+
# Look for lists like ['item1', 'item2', 'item3']
|
86 |
+
list_pattern = r"(\[([^\[\]]*)\])"
|
87 |
+
dict_pattern = r"(\{([^\{\}]*)\})"
|
88 |
+
|
89 |
+
# First handle lists - ['item1', 'item2']
|
90 |
+
def replace_list(match):
|
91 |
+
try:
|
92 |
+
# Try to parse the match as a Python list
|
93 |
+
list_str = match.group(1)
|
94 |
+
|
95 |
+
# Quick check for empty list
|
96 |
+
if list_str == "[]":
|
97 |
+
return ""
|
98 |
+
|
99 |
+
# Safe evaluation of list-like string
|
100 |
+
try:
|
101 |
+
items = ast.literal_eval(list_str)
|
102 |
+
if isinstance(items, list):
|
103 |
+
# Convert to markdown bullet points
|
104 |
+
return "\n" + "\n".join([f"- {item}" for item in items])
|
105 |
+
else:
|
106 |
+
return list_str # Not a list, return unchanged
|
107 |
+
except (SyntaxError, ValueError):
|
108 |
+
# Try a simpler regex-based approach for common formats
|
109 |
+
# Handle simple comma-separated lists
|
110 |
+
items = re.findall(r"'([^']*)'|\"([^\"]*)\"", list_str)
|
111 |
+
if items:
|
112 |
+
# Extract the matched groups and handle both single and double quotes
|
113 |
+
clean_items = [item[0] if item[0] else item[1] for item in items]
|
114 |
+
return "\n" + "\n".join([f"- {item}" for item in clean_items])
|
115 |
+
return list_str # Couldn't parse, return unchanged
|
116 |
+
except Exception:
|
117 |
+
return match.group(0) # Return the original text if any error
|
118 |
+
|
119 |
+
# Handle dictionaries or structured fields like {key: value, key2: value2}
|
120 |
+
def replace_dict(match):
|
121 |
+
try:
|
122 |
+
dict_str = match.group(1)
|
123 |
+
|
124 |
+
# Quick check for empty dict
|
125 |
+
if dict_str == "{}":
|
126 |
+
return ""
|
127 |
+
|
128 |
+
# First try to parse as a Python dict
|
129 |
+
try:
|
130 |
+
data_dict = ast.literal_eval(dict_str)
|
131 |
+
if isinstance(data_dict, dict):
|
132 |
+
return "\n" + "\n".join([f"**{k}**: {v}" for k, v in data_dict.items()])
|
133 |
+
except (SyntaxError, ValueError):
|
134 |
+
# If that fails, use regex to extract key-value pairs
|
135 |
+
pairs = re.findall(r"'([^']*)':\s*'([^']*)'|\"([^\"]*)\":\s*\"([^\"]*)\"", dict_str)
|
136 |
+
if pairs:
|
137 |
+
formatted_pairs = []
|
138 |
+
for pair in pairs:
|
139 |
+
if pair[0] and pair[1]: # Single quotes
|
140 |
+
formatted_pairs.append(f"**{pair[0]}**: {pair[1]}")
|
141 |
+
elif pair[2] and pair[3]: # Double quotes
|
142 |
+
formatted_pairs.append(f"**{pair[2]}**: {pair[3]}")
|
143 |
+
return "\n" + "\n".join(formatted_pairs)
|
144 |
+
return dict_str # Return original if couldn't parse
|
145 |
+
except Exception:
|
146 |
+
return match.group(0) # Return original text if any error
|
147 |
+
|
148 |
+
# Check for keys with array values (common in OCR output)
|
149 |
+
key_array_pattern = r"([a-zA-Z_]+):\s*(\[.*?\])"
|
150 |
+
|
151 |
+
def replace_key_array(match):
|
152 |
+
try:
|
153 |
+
key = match.group(1)
|
154 |
+
array_str = match.group(2)
|
155 |
+
|
156 |
+
# Process the array part with our list replacer
|
157 |
+
formatted_array = replace_list(re.match(list_pattern, array_str))
|
158 |
+
|
159 |
+
# If we successfully formatted it, return with the key as a header
|
160 |
+
if formatted_array != array_str:
|
161 |
+
return f"**{key}**:{formatted_array}"
|
162 |
+
else:
|
163 |
+
return match.group(0) # Return original if no change
|
164 |
+
except Exception:
|
165 |
+
return match.group(0) # Return the original on error
|
166 |
+
|
167 |
+
# Apply all replacements
|
168 |
+
content = re.sub(key_array_pattern, replace_key_array, content)
|
169 |
+
content = re.sub(list_pattern, replace_list, content)
|
170 |
+
content = re.sub(dict_pattern, replace_dict, content)
|
171 |
+
|
172 |
+
return content
|
173 |
+
|
174 |
+
# Handle native Python lists
|
175 |
+
elif isinstance(content, list):
|
176 |
+
if not content:
|
177 |
+
return ""
|
178 |
+
# Convert to markdown bullet points
|
179 |
+
return "\n".join([f"- {item}" for item in content])
|
180 |
+
|
181 |
+
# Handle native Python dictionaries
|
182 |
+
elif isinstance(content, dict):
|
183 |
+
if not content:
|
184 |
+
return ""
|
185 |
+
# Convert to markdown key-value pairs
|
186 |
+
return "\n".join([f"**{k}**: {v}" for k, v in content.items()])
|
187 |
+
|
188 |
+
# Return as string for other types
|
189 |
+
return str(content)
|
utils/file_utils.py
ADDED
@@ -0,0 +1,100 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
File utility functions for historical OCR processing.
|
3 |
+
"""
|
4 |
+
import base64
|
5 |
+
import logging
|
6 |
+
from pathlib import Path
|
7 |
+
|
8 |
+
# Configure logging
|
9 |
+
logger = logging.getLogger("utils")
|
10 |
+
logger.setLevel(logging.INFO)
|
11 |
+
|
12 |
+
def get_base64_from_image(image_path):
|
13 |
+
"""
|
14 |
+
Get base64 data URL from image file with proper MIME type.
|
15 |
+
|
16 |
+
Args:
|
17 |
+
image_path: Path to the image file
|
18 |
+
|
19 |
+
Returns:
|
20 |
+
Base64 data URL with appropriate MIME type prefix
|
21 |
+
"""
|
22 |
+
try:
|
23 |
+
# Convert to Path object for better handling
|
24 |
+
path_obj = Path(image_path)
|
25 |
+
|
26 |
+
# Determine mime type based on file extension
|
27 |
+
mime_type = 'image/jpeg' # Default mime type
|
28 |
+
suffix = path_obj.suffix.lower()
|
29 |
+
if suffix == '.png':
|
30 |
+
mime_type = 'image/png'
|
31 |
+
elif suffix == '.gif':
|
32 |
+
mime_type = 'image/gif'
|
33 |
+
elif suffix in ['.jpg', '.jpeg']:
|
34 |
+
mime_type = 'image/jpeg'
|
35 |
+
elif suffix == '.pdf':
|
36 |
+
mime_type = 'application/pdf'
|
37 |
+
|
38 |
+
# Read and encode file
|
39 |
+
with open(path_obj, "rb") as file:
|
40 |
+
encoded = base64.b64encode(file.read()).decode('utf-8')
|
41 |
+
return f"data:{mime_type};base64,{encoded}"
|
42 |
+
except Exception as e:
|
43 |
+
logger.error(f"Error encoding file to base64: {str(e)}")
|
44 |
+
return ""
|
45 |
+
|
46 |
+
def get_base64_from_bytes(file_bytes, mime_type=None, file_name=None):
|
47 |
+
"""
|
48 |
+
Get base64 data URL from file bytes with proper MIME type.
|
49 |
+
|
50 |
+
Args:
|
51 |
+
file_bytes: Binary file data
|
52 |
+
mime_type: MIME type of the file (optional)
|
53 |
+
file_name: Original file name for MIME type detection (optional)
|
54 |
+
|
55 |
+
Returns:
|
56 |
+
Base64 data URL with appropriate MIME type prefix
|
57 |
+
"""
|
58 |
+
try:
|
59 |
+
# Determine mime type if not provided
|
60 |
+
if mime_type is None and file_name is not None:
|
61 |
+
# Get file extension
|
62 |
+
suffix = Path(file_name).suffix.lower()
|
63 |
+
if suffix == '.png':
|
64 |
+
mime_type = 'image/png'
|
65 |
+
elif suffix == '.gif':
|
66 |
+
mime_type = 'image/gif'
|
67 |
+
elif suffix in ['.jpg', '.jpeg']:
|
68 |
+
mime_type = 'image/jpeg'
|
69 |
+
elif suffix == '.pdf':
|
70 |
+
mime_type = 'application/pdf'
|
71 |
+
else:
|
72 |
+
# Default to image/jpeg for unknown types when processing images
|
73 |
+
mime_type = 'image/jpeg'
|
74 |
+
elif mime_type is None:
|
75 |
+
# Default MIME type if we can't determine it - use image/jpeg instead of application/octet-stream
|
76 |
+
# to ensure compatibility with Mistral AI OCR API
|
77 |
+
mime_type = 'image/jpeg'
|
78 |
+
|
79 |
+
# Encode and create data URL
|
80 |
+
encoded = base64.b64encode(file_bytes).decode('utf-8')
|
81 |
+
return f"data:{mime_type};base64,{encoded}"
|
82 |
+
except Exception as e:
|
83 |
+
logger.error(f"Error encoding bytes to base64: {str(e)}")
|
84 |
+
return ""
|
85 |
+
|
86 |
+
def handle_temp_files(temp_file_paths):
|
87 |
+
"""
|
88 |
+
Clean up temporary files
|
89 |
+
|
90 |
+
Args:
|
91 |
+
temp_file_paths: List of temporary file paths to clean up
|
92 |
+
"""
|
93 |
+
import os
|
94 |
+
for temp_path in temp_file_paths:
|
95 |
+
try:
|
96 |
+
if os.path.exists(temp_path):
|
97 |
+
os.unlink(temp_path)
|
98 |
+
logger.info(f"Removed temporary file: {temp_path}")
|
99 |
+
except Exception as e:
|
100 |
+
logger.warning(f"Failed to remove temporary file {temp_path}: {str(e)}")
|
utils/general_utils.py
ADDED
@@ -0,0 +1,163 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
General utility functions for historical OCR processing.
|
3 |
+
"""
|
4 |
+
import os
|
5 |
+
import base64
|
6 |
+
import hashlib
|
7 |
+
import time
|
8 |
+
import logging
|
9 |
+
from datetime import datetime
|
10 |
+
from pathlib import Path
|
11 |
+
from functools import wraps
|
12 |
+
|
13 |
+
# Configure logging
|
14 |
+
logger = logging.getLogger("utils")
|
15 |
+
logger.setLevel(logging.INFO)
|
16 |
+
|
17 |
+
def generate_cache_key(file_bytes, file_type, use_vision, preprocessing_options=None, pdf_rotation=0, custom_prompt=None):
|
18 |
+
"""
|
19 |
+
Generate a cache key for OCR processing
|
20 |
+
|
21 |
+
Args:
|
22 |
+
file_bytes: File content as bytes
|
23 |
+
file_type: Type of file (pdf or image)
|
24 |
+
use_vision: Whether to use vision model
|
25 |
+
preprocessing_options: Dictionary of preprocessing options
|
26 |
+
pdf_rotation: PDF rotation value
|
27 |
+
custom_prompt: Custom prompt for OCR
|
28 |
+
|
29 |
+
Returns:
|
30 |
+
str: Cache key
|
31 |
+
"""
|
32 |
+
# Generate file hash
|
33 |
+
file_hash = hashlib.md5(file_bytes).hexdigest()
|
34 |
+
|
35 |
+
# Include preprocessing options in cache key
|
36 |
+
preprocessing_options_hash = ""
|
37 |
+
if preprocessing_options:
|
38 |
+
# Add pdf_rotation to preprocessing options to ensure it's part of the cache key
|
39 |
+
if pdf_rotation != 0:
|
40 |
+
preprocessing_options_with_rotation = preprocessing_options.copy()
|
41 |
+
preprocessing_options_with_rotation['pdf_rotation'] = pdf_rotation
|
42 |
+
preprocessing_str = str(sorted(preprocessing_options_with_rotation.items()))
|
43 |
+
else:
|
44 |
+
preprocessing_str = str(sorted(preprocessing_options.items()))
|
45 |
+
preprocessing_options_hash = hashlib.md5(preprocessing_str.encode()).hexdigest()
|
46 |
+
elif pdf_rotation != 0:
|
47 |
+
# If no preprocessing options but we have rotation, include that in the hash
|
48 |
+
preprocessing_options_hash = hashlib.md5(f"pdf_rotation_{pdf_rotation}".encode()).hexdigest()
|
49 |
+
|
50 |
+
# Create base cache key
|
51 |
+
cache_key = f"{file_hash}_{file_type}_{use_vision}_{preprocessing_options_hash}"
|
52 |
+
|
53 |
+
# Include custom prompt in cache key if provided
|
54 |
+
if custom_prompt:
|
55 |
+
custom_prompt_hash = hashlib.md5(str(custom_prompt).encode()).hexdigest()
|
56 |
+
cache_key = f"{cache_key}_{custom_prompt_hash}"
|
57 |
+
|
58 |
+
return cache_key
|
59 |
+
|
60 |
+
def timing(description):
|
61 |
+
"""Context manager for timing code execution"""
|
62 |
+
class TimingContext:
|
63 |
+
def __init__(self, description):
|
64 |
+
self.description = description
|
65 |
+
|
66 |
+
def __enter__(self):
|
67 |
+
self.start_time = time.time()
|
68 |
+
return self
|
69 |
+
|
70 |
+
def __exit__(self, exc_type, exc_val, exc_tb):
|
71 |
+
end_time = time.time()
|
72 |
+
execution_time = end_time - self.start_time
|
73 |
+
logger.info(f"{self.description} took {execution_time:.2f} seconds")
|
74 |
+
return False
|
75 |
+
|
76 |
+
return TimingContext(description)
|
77 |
+
|
78 |
+
def format_timestamp(timestamp=None):
|
79 |
+
"""Format timestamp for display"""
|
80 |
+
if timestamp is None:
|
81 |
+
timestamp = datetime.now()
|
82 |
+
elif isinstance(timestamp, str):
|
83 |
+
try:
|
84 |
+
timestamp = datetime.strptime(timestamp, "%Y-%m-%d %H:%M:%S")
|
85 |
+
except ValueError:
|
86 |
+
timestamp = datetime.now()
|
87 |
+
|
88 |
+
return timestamp.strftime("%Y-%m-%d %H:%M")
|
89 |
+
|
90 |
+
def create_descriptive_filename(original_filename, result, file_ext, preprocessing_options=None):
|
91 |
+
"""
|
92 |
+
Create a descriptive filename for the result
|
93 |
+
|
94 |
+
Args:
|
95 |
+
original_filename: Original filename
|
96 |
+
result: OCR result dictionary
|
97 |
+
file_ext: File extension
|
98 |
+
preprocessing_options: Dictionary of preprocessing options
|
99 |
+
|
100 |
+
Returns:
|
101 |
+
str: Descriptive filename
|
102 |
+
"""
|
103 |
+
# Get base name without extension
|
104 |
+
original_name = Path(original_filename).stem
|
105 |
+
|
106 |
+
# Add document type to filename if detected
|
107 |
+
doc_type_tag = ""
|
108 |
+
if 'detected_document_type' in result:
|
109 |
+
doc_type = result['detected_document_type'].lower()
|
110 |
+
doc_type_tag = f"_{doc_type.replace(' ', '_')}"
|
111 |
+
elif 'topics' in result and result['topics']:
|
112 |
+
# Use first tag as document type if not explicitly detected
|
113 |
+
doc_type_tag = f"_{result['topics'][0].lower().replace(' ', '_')}"
|
114 |
+
|
115 |
+
# Add period tag for historical context if available
|
116 |
+
period_tag = ""
|
117 |
+
if 'topics' in result and result['topics']:
|
118 |
+
for tag in result['topics']:
|
119 |
+
if "century" in tag.lower() or "pre-" in tag.lower() or "era" in tag.lower():
|
120 |
+
period_tag = f"_{tag.lower().replace(' ', '_')}"
|
121 |
+
break
|
122 |
+
|
123 |
+
# Generate final descriptive filename
|
124 |
+
descriptive_name = f"{original_name}{doc_type_tag}{period_tag}{file_ext}"
|
125 |
+
return descriptive_name
|
126 |
+
|
127 |
+
def extract_subject_tags(result, raw_text, preprocessing_options=None):
|
128 |
+
"""
|
129 |
+
Extract subject tags from OCR result
|
130 |
+
|
131 |
+
Args:
|
132 |
+
result: OCR result dictionary
|
133 |
+
raw_text: Raw text from OCR
|
134 |
+
preprocessing_options: Dictionary of preprocessing options
|
135 |
+
|
136 |
+
Returns:
|
137 |
+
list: Subject tags
|
138 |
+
"""
|
139 |
+
subject_tags = []
|
140 |
+
|
141 |
+
# Use existing topics as starting point if available
|
142 |
+
if 'topics' in result and result['topics']:
|
143 |
+
subject_tags = list(result['topics'])
|
144 |
+
|
145 |
+
# Add document type if detected
|
146 |
+
if 'detected_document_type' in result:
|
147 |
+
doc_type = result['detected_document_type'].capitalize()
|
148 |
+
if doc_type not in subject_tags:
|
149 |
+
subject_tags.append(doc_type)
|
150 |
+
|
151 |
+
# If no tags were found, add some defaults
|
152 |
+
if not subject_tags:
|
153 |
+
subject_tags = ["Document", "Historical Document"]
|
154 |
+
|
155 |
+
# Try to infer content type
|
156 |
+
if "letter" in raw_text.lower()[:1000] or "dear" in raw_text.lower()[:200]:
|
157 |
+
subject_tags.append("Letter")
|
158 |
+
|
159 |
+
# Check if it might be a newspaper
|
160 |
+
if "newspaper" in raw_text.lower()[:1000] or "editor" in raw_text.lower()[:500]:
|
161 |
+
subject_tags.append("Newspaper")
|
162 |
+
|
163 |
+
return subject_tags
|
utils/image_utils.py
ADDED
@@ -0,0 +1,886 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
Utility functions for OCR image processing with Mistral AI.
|
3 |
+
Contains helper functions for working with OCR responses and image handling.
|
4 |
+
"""
|
5 |
+
|
6 |
+
# Standard library imports
|
7 |
+
import json
|
8 |
+
import base64
|
9 |
+
import io
|
10 |
+
import zipfile
|
11 |
+
import logging
|
12 |
+
import re
|
13 |
+
import time
|
14 |
+
import math
|
15 |
+
from datetime import datetime
|
16 |
+
from pathlib import Path
|
17 |
+
from typing import Dict, List, Optional, Union, Any, Tuple
|
18 |
+
from functools import lru_cache
|
19 |
+
|
20 |
+
# Configure logging
|
21 |
+
logging.basicConfig(level=logging.INFO,
|
22 |
+
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
|
23 |
+
logger = logging.getLogger(__name__)
|
24 |
+
|
25 |
+
# Third-party imports
|
26 |
+
import numpy as np
|
27 |
+
|
28 |
+
# Mistral AI imports
|
29 |
+
from mistralai import DocumentURLChunk, ImageURLChunk, TextChunk
|
30 |
+
from mistralai.models import OCRImageObject
|
31 |
+
|
32 |
+
# Check for image processing libraries
|
33 |
+
try:
|
34 |
+
from PIL import Image, ImageEnhance, ImageFilter, ImageOps
|
35 |
+
PILLOW_AVAILABLE = True
|
36 |
+
except ImportError:
|
37 |
+
logger.warning("PIL not available - image preprocessing will be limited")
|
38 |
+
PILLOW_AVAILABLE = False
|
39 |
+
|
40 |
+
try:
|
41 |
+
import cv2
|
42 |
+
CV2_AVAILABLE = True
|
43 |
+
except ImportError:
|
44 |
+
logger.warning("OpenCV (cv2) not available - advanced image processing will be limited")
|
45 |
+
CV2_AVAILABLE = False
|
46 |
+
|
47 |
+
# Import configuration
|
48 |
+
try:
|
49 |
+
from config import IMAGE_PREPROCESSING
|
50 |
+
except ImportError:
|
51 |
+
# Fallback defaults if config not available
|
52 |
+
IMAGE_PREPROCESSING = {
|
53 |
+
"enhance_contrast": 1.5,
|
54 |
+
"sharpen": True,
|
55 |
+
"denoise": True,
|
56 |
+
"max_size_mb": 8.0,
|
57 |
+
"target_dpi": 300,
|
58 |
+
"compression_quality": 92
|
59 |
+
}
|
60 |
+
|
61 |
+
def detect_skew(image: Union[Image.Image, np.ndarray]) -> float:
|
62 |
+
"""
|
63 |
+
Quick skew detection that returns angle in degrees.
|
64 |
+
Uses a computationally efficient approach by analyzing at 1% resolution.
|
65 |
+
|
66 |
+
Args:
|
67 |
+
image: PIL Image or numpy array
|
68 |
+
|
69 |
+
Returns:
|
70 |
+
Estimated skew angle in degrees (positive or negative)
|
71 |
+
"""
|
72 |
+
# Convert PIL Image to numpy array if needed
|
73 |
+
if isinstance(image, Image.Image):
|
74 |
+
# Convert to grayscale for processing
|
75 |
+
if image.mode != 'L':
|
76 |
+
img_np = np.array(image.convert('L'))
|
77 |
+
else:
|
78 |
+
img_np = np.array(image)
|
79 |
+
else:
|
80 |
+
# If already numpy array, ensure it's grayscale
|
81 |
+
if len(image.shape) == 3:
|
82 |
+
if CV2_AVAILABLE:
|
83 |
+
img_np = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
|
84 |
+
else:
|
85 |
+
# Fallback grayscale conversion
|
86 |
+
img_np = np.mean(image, axis=2).astype(np.uint8)
|
87 |
+
else:
|
88 |
+
img_np = image
|
89 |
+
|
90 |
+
# Downsample to 1% resolution for faster processing
|
91 |
+
height, width = img_np.shape
|
92 |
+
target_size = int(min(width, height) * 0.01)
|
93 |
+
|
94 |
+
# Use a sane minimum size and ensure we have enough pixels to detect lines
|
95 |
+
target_size = max(target_size, 100)
|
96 |
+
|
97 |
+
if CV2_AVAILABLE:
|
98 |
+
# OpenCV-based implementation (faster)
|
99 |
+
# Resize the image to the target size
|
100 |
+
scale_factor = target_size / max(width, height)
|
101 |
+
small_img = cv2.resize(img_np, None, fx=scale_factor, fy=scale_factor, interpolation=cv2.INTER_AREA)
|
102 |
+
|
103 |
+
# Apply binary thresholding to get cleaner edges
|
104 |
+
_, binary = cv2.threshold(small_img, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)
|
105 |
+
|
106 |
+
# Use Hough Line Transform to detect lines
|
107 |
+
lines = cv2.HoughLinesP(binary, 1, np.pi/180, threshold=target_size//10,
|
108 |
+
minLineLength=target_size//5, maxLineGap=target_size//10)
|
109 |
+
|
110 |
+
if lines is None or len(lines) < 3:
|
111 |
+
# Not enough lines detected, assume no significant skew
|
112 |
+
return 0.0
|
113 |
+
|
114 |
+
# Calculate angles of lines
|
115 |
+
angles = []
|
116 |
+
for line in lines:
|
117 |
+
x1, y1, x2, y2 = line[0]
|
118 |
+
if x2 - x1 == 0: # Avoid division by zero
|
119 |
+
continue
|
120 |
+
angle = math.atan2(y2 - y1, x2 - x1) * 180.0 / np.pi
|
121 |
+
|
122 |
+
# Normalize angle to -45 to 45 range
|
123 |
+
angle = angle % 180
|
124 |
+
if angle > 90:
|
125 |
+
angle -= 180
|
126 |
+
if angle > 45:
|
127 |
+
angle -= 90
|
128 |
+
if angle < -45:
|
129 |
+
angle += 90
|
130 |
+
|
131 |
+
angles.append(angle)
|
132 |
+
|
133 |
+
if not angles:
|
134 |
+
return 0.0
|
135 |
+
|
136 |
+
# Use median to reduce impact of outliers
|
137 |
+
angles.sort()
|
138 |
+
median_angle = angles[len(angles) // 2]
|
139 |
+
|
140 |
+
return median_angle
|
141 |
+
else:
|
142 |
+
# PIL-only fallback implementation
|
143 |
+
# Resize using PIL
|
144 |
+
small_img = Image.fromarray(img_np).resize(
|
145 |
+
(int(width * target_size / max(width, height)),
|
146 |
+
int(height * target_size / max(width, height))),
|
147 |
+
Image.NEAREST
|
148 |
+
)
|
149 |
+
|
150 |
+
# Find edges
|
151 |
+
edges = small_img.filter(ImageFilter.FIND_EDGES)
|
152 |
+
edges_data = np.array(edges)
|
153 |
+
|
154 |
+
# Simple edge orientation analysis (less precise than OpenCV)
|
155 |
+
# Count horizontal vs vertical edges
|
156 |
+
h_edges = np.sum(np.abs(np.diff(edges_data, axis=1)))
|
157 |
+
v_edges = np.sum(np.abs(np.diff(edges_data, axis=0)))
|
158 |
+
|
159 |
+
# If horizontal edges dominate, no significant skew
|
160 |
+
if h_edges > v_edges * 1.2:
|
161 |
+
return 0.0
|
162 |
+
|
163 |
+
# Simple angle estimation based on edge distribution
|
164 |
+
# This is a simplified approach that works for slight skews
|
165 |
+
rows, cols = edges_data.shape
|
166 |
+
xs, ys = [], []
|
167 |
+
|
168 |
+
# Sample strong edge points
|
169 |
+
for r in range(0, rows, 2):
|
170 |
+
for c in range(0, cols, 2):
|
171 |
+
if edges_data[r, c] > 128:
|
172 |
+
xs.append(c)
|
173 |
+
ys.append(r)
|
174 |
+
|
175 |
+
if len(xs) < 10: # Not enough edge points
|
176 |
+
return 0.0
|
177 |
+
|
178 |
+
# Use simple linear regression to estimate the slope
|
179 |
+
n = len(xs)
|
180 |
+
mean_x = sum(xs) / n
|
181 |
+
mean_y = sum(ys) / n
|
182 |
+
|
183 |
+
# Calculate slope
|
184 |
+
numerator = sum((xs[i] - mean_x) * (ys[i] - mean_y) for i in range(n))
|
185 |
+
denominator = sum((xs[i] - mean_x) ** 2 for i in range(n))
|
186 |
+
|
187 |
+
if abs(denominator) < 1e-6: # Avoid division by zero
|
188 |
+
return 0.0
|
189 |
+
|
190 |
+
slope = numerator / denominator
|
191 |
+
angle = math.atan(slope) * 180.0 / math.pi
|
192 |
+
|
193 |
+
# Normalize to -45 to 45 degrees
|
194 |
+
if angle > 45:
|
195 |
+
angle -= 90
|
196 |
+
elif angle < -45:
|
197 |
+
angle += 90
|
198 |
+
|
199 |
+
return angle
|
200 |
+
|
201 |
+
def replace_images_in_markdown(md: str, images: dict[str, str]) -> str:
|
202 |
+
"""
|
203 |
+
Replace image placeholders in markdown with base64-encoded images.
|
204 |
+
Uses regex-based matching to handle variations in image IDs and formats.
|
205 |
+
|
206 |
+
Args:
|
207 |
+
md: Markdown text containing image placeholders
|
208 |
+
images: Dictionary mapping image IDs to base64 strings
|
209 |
+
|
210 |
+
Returns:
|
211 |
+
Markdown text with images replaced by base64 data
|
212 |
+
"""
|
213 |
+
# Process each image ID in the dictionary
|
214 |
+
for img_id, base64_str in images.items():
|
215 |
+
# Extract the base ID without extension for more flexible matching
|
216 |
+
base_id = img_id.split('.')[0]
|
217 |
+
|
218 |
+
# Match markdown image pattern where URL contains the base ID
|
219 |
+
# Using a single regex with groups to capture the full pattern
|
220 |
+
pattern = re.compile(rf'!\[([^\]]*)\]\(([^\)]*{base_id}[^\)]*)\)')
|
221 |
+
|
222 |
+
# Process all matches
|
223 |
+
matches = list(pattern.finditer(md))
|
224 |
+
for match in reversed(matches): # Process in reverse to avoid offset issues
|
225 |
+
# Replace the entire match with a properly formatted base64 image
|
226 |
+
md = md[:match.start()] + f"" + md[match.end():]
|
227 |
+
|
228 |
+
return md
|
229 |
+
|
230 |
+
def get_combined_markdown(ocr_response) -> str:
|
231 |
+
"""
|
232 |
+
Combine OCR text and images into a single markdown document.
|
233 |
+
|
234 |
+
Args:
|
235 |
+
ocr_response: OCR response object from Mistral AI
|
236 |
+
|
237 |
+
Returns:
|
238 |
+
Combined markdown string with embedded images
|
239 |
+
"""
|
240 |
+
markdowns = []
|
241 |
+
|
242 |
+
# Process each page of the OCR response
|
243 |
+
for page in ocr_response.pages:
|
244 |
+
# Extract image data if available
|
245 |
+
image_data = {}
|
246 |
+
if hasattr(page, "images"):
|
247 |
+
for img in page.images:
|
248 |
+
if hasattr(img, "id") and hasattr(img, "image_base64"):
|
249 |
+
image_data[img.id] = img.image_base64
|
250 |
+
|
251 |
+
# Replace image placeholders with base64 data
|
252 |
+
page_markdown = page.markdown if hasattr(page, "markdown") else ""
|
253 |
+
processed_markdown = replace_images_in_markdown(page_markdown, image_data)
|
254 |
+
markdowns.append(processed_markdown)
|
255 |
+
|
256 |
+
# Join all pages' markdown with double newlines
|
257 |
+
return "\n\n".join(markdowns)
|
258 |
+
|
259 |
+
def encode_image_for_api(image_path: Union[str, Path]) -> str:
|
260 |
+
"""
|
261 |
+
Encode an image as base64 data URL for API submission.
|
262 |
+
|
263 |
+
Args:
|
264 |
+
image_path: Path to the image file
|
265 |
+
|
266 |
+
Returns:
|
267 |
+
Base64 data URL for the image
|
268 |
+
"""
|
269 |
+
# Convert to Path object if string
|
270 |
+
image_file = Path(image_path) if isinstance(image_path, str) else image_path
|
271 |
+
|
272 |
+
# Verify image exists
|
273 |
+
if not image_file.is_file():
|
274 |
+
raise FileNotFoundError(f"Image file not found: {image_file}")
|
275 |
+
|
276 |
+
# Determine mime type based on file extension
|
277 |
+
mime_type = 'image/jpeg' # Default mime type
|
278 |
+
suffix = image_file.suffix.lower()
|
279 |
+
if suffix == '.png':
|
280 |
+
mime_type = 'image/png'
|
281 |
+
elif suffix == '.gif':
|
282 |
+
mime_type = 'image/gif'
|
283 |
+
elif suffix in ['.jpg', '.jpeg']:
|
284 |
+
mime_type = 'image/jpeg'
|
285 |
+
elif suffix == '.pdf':
|
286 |
+
mime_type = 'application/pdf'
|
287 |
+
|
288 |
+
# Encode image as base64
|
289 |
+
encoded = base64.b64encode(image_file.read_bytes()).decode()
|
290 |
+
return f"data:{mime_type};base64,{encoded}"
|
291 |
+
|
292 |
+
def encode_bytes_for_api(file_bytes: bytes, mime_type: str) -> str:
|
293 |
+
"""
|
294 |
+
Encode binary data as base64 data URL for API submission.
|
295 |
+
|
296 |
+
Args:
|
297 |
+
file_bytes: Binary file data
|
298 |
+
mime_type: MIME type of the file (e.g., 'image/jpeg', 'application/pdf')
|
299 |
+
|
300 |
+
Returns:
|
301 |
+
Base64 data URL for the data
|
302 |
+
"""
|
303 |
+
# Encode data as base64
|
304 |
+
encoded = base64.b64encode(file_bytes).decode()
|
305 |
+
return f"data:{mime_type};base64,{encoded}"
|
306 |
+
|
307 |
+
def calculate_image_entropy(pil_img: Image.Image) -> float:
|
308 |
+
"""
|
309 |
+
Calculate the entropy of a PIL image.
|
310 |
+
Entropy is a measure of randomness; low entropy indicates a blank or simple image,
|
311 |
+
high entropy indicates more complex content (e.g., text or detailed images).
|
312 |
+
|
313 |
+
Args:
|
314 |
+
pil_img: PIL Image object
|
315 |
+
|
316 |
+
Returns:
|
317 |
+
float: Entropy value
|
318 |
+
"""
|
319 |
+
# Convert to grayscale for entropy calculation
|
320 |
+
gray_img = pil_img.convert("L")
|
321 |
+
arr = np.array(gray_img)
|
322 |
+
# Compute histogram
|
323 |
+
hist, _ = np.histogram(arr, bins=256, range=(0, 255), density=True)
|
324 |
+
# Remove zero entries to avoid log(0)
|
325 |
+
hist = hist[hist > 0]
|
326 |
+
# Calculate entropy
|
327 |
+
entropy = -np.sum(hist * np.log2(hist))
|
328 |
+
return float(entropy)
|
329 |
+
|
330 |
+
def serialize_ocr_object(obj):
|
331 |
+
"""
|
332 |
+
Serialize OCR response objects to JSON serializable format.
|
333 |
+
Handles OCRImageObject specifically to prevent serialization errors.
|
334 |
+
|
335 |
+
Args:
|
336 |
+
obj: The object to serialize
|
337 |
+
|
338 |
+
Returns:
|
339 |
+
JSON serializable representation of the object
|
340 |
+
"""
|
341 |
+
# Fast path: Handle primitive types directly
|
342 |
+
if obj is None or isinstance(obj, (str, int, float, bool)):
|
343 |
+
return obj
|
344 |
+
|
345 |
+
# Handle collections
|
346 |
+
if isinstance(obj, list):
|
347 |
+
return [serialize_ocr_object(item) for item in obj]
|
348 |
+
elif isinstance(obj, dict):
|
349 |
+
return {k: serialize_ocr_object(v) for k, v in obj.items()}
|
350 |
+
elif isinstance(obj, OCRImageObject):
|
351 |
+
# Special handling for OCRImageObject
|
352 |
+
return {
|
353 |
+
'id': obj.id if hasattr(obj, 'id') else None,
|
354 |
+
'image_base64': obj.image_base64 if hasattr(obj, 'image_base64') else None
|
355 |
+
}
|
356 |
+
elif hasattr(obj, '__dict__'):
|
357 |
+
# For objects with __dict__ attribute
|
358 |
+
return {k: serialize_ocr_object(v) for k, v in obj.__dict__.items()
|
359 |
+
if not k.startswith('_')} # Skip private attributes
|
360 |
+
else:
|
361 |
+
# Try to convert to string as last resort
|
362 |
+
try:
|
363 |
+
return str(obj)
|
364 |
+
except:
|
365 |
+
return None
|
366 |
+
|
367 |
+
def format_ocr_text(text):
|
368 |
+
"""
|
369 |
+
Format OCR text with simple, predictable rules that ensure consistency.
|
370 |
+
This formats ALL CAPS lines as bold markdown and preserves the rest.
|
371 |
+
|
372 |
+
Args:
|
373 |
+
text: Text content to format
|
374 |
+
|
375 |
+
Returns:
|
376 |
+
Formatted text with consistent styling
|
377 |
+
"""
|
378 |
+
if not isinstance(text, str):
|
379 |
+
return text
|
380 |
+
|
381 |
+
lines = text.split('\n')
|
382 |
+
processed_lines = []
|
383 |
+
for line in lines:
|
384 |
+
line_stripped = line.strip()
|
385 |
+
if line_stripped and line_stripped.isupper() and len(line_stripped) > 3:
|
386 |
+
processed_lines.append(f"**{line_stripped}**")
|
387 |
+
else:
|
388 |
+
processed_lines.append(line)
|
389 |
+
|
390 |
+
return '\n'.join(processed_lines)
|
391 |
+
|
392 |
+
def create_results_zip(results, output_dir=None, zip_name=None):
|
393 |
+
"""
|
394 |
+
Create a zip file containing OCR results.
|
395 |
+
|
396 |
+
Args:
|
397 |
+
results: Dictionary or list of OCR results
|
398 |
+
output_dir: Optional output directory
|
399 |
+
zip_name: Optional zip file name
|
400 |
+
|
401 |
+
Returns:
|
402 |
+
Path to the created zip file
|
403 |
+
"""
|
404 |
+
# Create temporary output directory if not provided
|
405 |
+
if output_dir is None:
|
406 |
+
output_dir = Path.cwd() / "output"
|
407 |
+
output_dir.mkdir(exist_ok=True)
|
408 |
+
else:
|
409 |
+
output_dir = Path(output_dir)
|
410 |
+
output_dir.mkdir(exist_ok=True)
|
411 |
+
|
412 |
+
# Generate zip name if not provided
|
413 |
+
if zip_name is None:
|
414 |
+
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
415 |
+
|
416 |
+
if isinstance(results, list):
|
417 |
+
# For a list of results, create a descriptive name
|
418 |
+
file_count = len(results)
|
419 |
+
zip_name = f"ocr_results_{file_count}_{timestamp}.zip"
|
420 |
+
else:
|
421 |
+
# For single result, create descriptive filename
|
422 |
+
base_name = results.get('file_name', 'document').split('.')[0]
|
423 |
+
zip_name = f"{base_name}_{timestamp}.zip"
|
424 |
+
|
425 |
+
try:
|
426 |
+
# Get zip data in memory first
|
427 |
+
zip_data = create_results_zip_in_memory(results)
|
428 |
+
|
429 |
+
# Save to file
|
430 |
+
zip_path = output_dir / zip_name
|
431 |
+
with open(zip_path, 'wb') as f:
|
432 |
+
f.write(zip_data)
|
433 |
+
|
434 |
+
return zip_path
|
435 |
+
except Exception as e:
|
436 |
+
# Create an empty zip file as fallback
|
437 |
+
logger.error(f"Error creating zip file: {str(e)}")
|
438 |
+
zip_path = output_dir / zip_name
|
439 |
+
with zipfile.ZipFile(zip_path, 'w') as zipf:
|
440 |
+
zipf.writestr("info.txt", "Could not create complete archive")
|
441 |
+
|
442 |
+
return zip_path
|
443 |
+
|
444 |
+
def create_results_zip_in_memory(results):
|
445 |
+
"""
|
446 |
+
Create a zip file containing OCR results in memory.
|
447 |
+
|
448 |
+
Args:
|
449 |
+
results: Dictionary or list of OCR results
|
450 |
+
|
451 |
+
Returns:
|
452 |
+
Binary zip file data
|
453 |
+
"""
|
454 |
+
# Create a BytesIO object
|
455 |
+
zip_buffer = io.BytesIO()
|
456 |
+
|
457 |
+
# Check if results is a list or a dictionary
|
458 |
+
is_list = isinstance(results, list)
|
459 |
+
|
460 |
+
# Create zip file in memory
|
461 |
+
with zipfile.ZipFile(zip_buffer, 'w', zipfile.ZIP_DEFLATED) as zipf:
|
462 |
+
if is_list:
|
463 |
+
# Handle list of results
|
464 |
+
for i, result in enumerate(results):
|
465 |
+
try:
|
466 |
+
# Create a descriptive base filename for this result
|
467 |
+
base_filename = result.get('file_name', f'document_{i+1}').split('.')[0]
|
468 |
+
|
469 |
+
# Add document type if available
|
470 |
+
if 'topics' in result and result['topics']:
|
471 |
+
topic = result['topics'][0].lower().replace(' ', '_')
|
472 |
+
base_filename = f"{base_filename}_{topic}"
|
473 |
+
|
474 |
+
# Add language if available
|
475 |
+
if 'languages' in result and result['languages']:
|
476 |
+
lang = result['languages'][0].lower()
|
477 |
+
# Only add if it's not already in the filename
|
478 |
+
if lang not in base_filename.lower():
|
479 |
+
base_filename = f"{base_filename}_{lang}"
|
480 |
+
|
481 |
+
# For PDFs, add page information
|
482 |
+
if 'limited_pages' in result:
|
483 |
+
base_filename = f"{base_filename}_p{result['limited_pages']['processed']}of{result['limited_pages']['total']}"
|
484 |
+
|
485 |
+
# Add timestamp if available
|
486 |
+
if 'timestamp' in result:
|
487 |
+
try:
|
488 |
+
# Try to parse the timestamp and reformat it
|
489 |
+
dt = datetime.strptime(result['timestamp'], "%Y-%m-%d %H:%M")
|
490 |
+
timestamp = dt.strftime("%Y%m%d_%H%M%S")
|
491 |
+
base_filename = f"{base_filename}_{timestamp}"
|
492 |
+
except Exception:
|
493 |
+
pass
|
494 |
+
|
495 |
+
# Add JSON results for each file with descriptive name
|
496 |
+
result_json = json.dumps(result, indent=2)
|
497 |
+
zipf.writestr(f"{base_filename}.json", result_json)
|
498 |
+
|
499 |
+
# Add HTML content (generated from the result)
|
500 |
+
html_content = create_html_with_images(result)
|
501 |
+
zipf.writestr(f"{base_filename}.html", html_content)
|
502 |
+
|
503 |
+
# Add raw OCR text if available
|
504 |
+
if "ocr_contents" in result and "raw_text" in result["ocr_contents"]:
|
505 |
+
zipf.writestr(f"{base_filename}.txt", result["ocr_contents"]["raw_text"])
|
506 |
+
|
507 |
+
except Exception as e:
|
508 |
+
# If any result fails, skip it and continue
|
509 |
+
logger.warning(f"Failed to process result for zip: {str(e)}")
|
510 |
+
continue
|
511 |
+
else:
|
512 |
+
# Handle single result
|
513 |
+
try:
|
514 |
+
# Create a descriptive base filename for this result
|
515 |
+
base_filename = results.get('file_name', 'document').split('.')[0]
|
516 |
+
|
517 |
+
# Add document type if available
|
518 |
+
if 'topics' in results and results['topics']:
|
519 |
+
topic = results['topics'][0].lower().replace(' ', '_')
|
520 |
+
base_filename = f"{base_filename}_{topic}"
|
521 |
+
|
522 |
+
# Add language if available
|
523 |
+
if 'languages' in results and results['languages']:
|
524 |
+
lang = results['languages'][0].lower()
|
525 |
+
# Only add if it's not already in the filename
|
526 |
+
if lang not in base_filename.lower():
|
527 |
+
base_filename = f"{base_filename}_{lang}"
|
528 |
+
|
529 |
+
# For PDFs, add page information
|
530 |
+
if 'limited_pages' in results:
|
531 |
+
base_filename = f"{base_filename}_p{results['limited_pages']['processed']}of{results['limited_pages']['total']}"
|
532 |
+
|
533 |
+
# Add timestamp if available
|
534 |
+
if 'timestamp' in results:
|
535 |
+
try:
|
536 |
+
# Try to parse the timestamp and reformat it
|
537 |
+
dt = datetime.strptime(results['timestamp'], "%Y-%m-%d %H:%M")
|
538 |
+
timestamp = dt.strftime("%Y%m%d_%H%M%S")
|
539 |
+
base_filename = f"{base_filename}_{timestamp}"
|
540 |
+
except Exception:
|
541 |
+
# If parsing fails, create a new timestamp
|
542 |
+
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
543 |
+
base_filename = f"{base_filename}_{timestamp}"
|
544 |
+
else:
|
545 |
+
# No timestamp in the result, create a new one
|
546 |
+
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
|
547 |
+
base_filename = f"{base_filename}_{timestamp}"
|
548 |
+
|
549 |
+
# Add JSON results with descriptive name
|
550 |
+
results_json = json.dumps(results, indent=2)
|
551 |
+
zipf.writestr(f"{base_filename}.json", results_json)
|
552 |
+
|
553 |
+
# Add HTML content with descriptive name
|
554 |
+
html_content = create_html_with_images(results)
|
555 |
+
zipf.writestr(f"{base_filename}.html", html_content)
|
556 |
+
|
557 |
+
# Add raw OCR text if available
|
558 |
+
if "ocr_contents" in results and "raw_text" in results["ocr_contents"]:
|
559 |
+
zipf.writestr(f"{base_filename}.txt", results["ocr_contents"]["raw_text"])
|
560 |
+
|
561 |
+
except Exception as e:
|
562 |
+
# If processing fails, log the error
|
563 |
+
logger.error(f"Failed to create zip file: {str(e)}")
|
564 |
+
pass
|
565 |
+
|
566 |
+
# Seek to the beginning of the BytesIO object
|
567 |
+
zip_buffer.seek(0)
|
568 |
+
|
569 |
+
# Return the zip file bytes
|
570 |
+
return zip_buffer.getvalue()
|
571 |
+
|
572 |
+
def create_html_with_images(result):
|
573 |
+
"""
|
574 |
+
Create a clean HTML document from OCR results that properly preserves page references
|
575 |
+
and text structure, without any document-specific special cases.
|
576 |
+
|
577 |
+
Args:
|
578 |
+
result: OCR result dictionary
|
579 |
+
|
580 |
+
Returns:
|
581 |
+
HTML content as string
|
582 |
+
"""
|
583 |
+
# Import content utils to use classification functions
|
584 |
+
try:
|
585 |
+
from utils.content_utils import classify_document_content, extract_document_text, extract_image_description
|
586 |
+
content_utils_available = True
|
587 |
+
except ImportError:
|
588 |
+
content_utils_available = False
|
589 |
+
|
590 |
+
# Get content classification
|
591 |
+
has_text = True
|
592 |
+
has_images = False
|
593 |
+
has_page_refs = False
|
594 |
+
|
595 |
+
if content_utils_available:
|
596 |
+
classification = classify_document_content(result)
|
597 |
+
has_text = classification['has_content']
|
598 |
+
has_images = result.get('has_images', False)
|
599 |
+
has_page_refs = False
|
600 |
+
else:
|
601 |
+
# Minimal fallback detection
|
602 |
+
if 'has_images' in result:
|
603 |
+
has_images = result['has_images']
|
604 |
+
|
605 |
+
# Check for image data more thoroughly
|
606 |
+
if 'pages_data' in result and isinstance(result['pages_data'], list):
|
607 |
+
for page in result['pages_data']:
|
608 |
+
if isinstance(page, dict) and 'images' in page and page['images']:
|
609 |
+
has_images = True
|
610 |
+
break
|
611 |
+
|
612 |
+
# Start building the HTML document
|
613 |
+
html = [
|
614 |
+
'<!DOCTYPE html>',
|
615 |
+
'<html lang="en">',
|
616 |
+
'<head>',
|
617 |
+
' <meta charset="UTF-8">',
|
618 |
+
' <meta name="viewport" content="width=device-width, initial-scale=1.0">',
|
619 |
+
f' <title>{result.get("file_name", "Document")}</title>',
|
620 |
+
' <style>',
|
621 |
+
' body {',
|
622 |
+
' font-family: Georgia, serif;',
|
623 |
+
' line-height: 1.6;',
|
624 |
+
' color: #333;',
|
625 |
+
' max-width: 800px;',
|
626 |
+
' margin: 0 auto;',
|
627 |
+
' padding: 20px;',
|
628 |
+
' }',
|
629 |
+
' h1, h2, h3, h4 {',
|
630 |
+
' color: #222;',
|
631 |
+
' margin-top: 1.5em;',
|
632 |
+
' margin-bottom: 0.5em;',
|
633 |
+
' }',
|
634 |
+
' h1 { font-size: 24px; }',
|
635 |
+
' h2 { font-size: 22px; }',
|
636 |
+
' h3 { font-size: 20px; }',
|
637 |
+
' h4 { font-size: 18px; }',
|
638 |
+
' p { margin: 1em 0; }',
|
639 |
+
' .metadata {',
|
640 |
+
' background-color: #f8f9fa;',
|
641 |
+
' border: 1px solid #eaecef;',
|
642 |
+
' border-radius: 6px;',
|
643 |
+
' padding: 15px;',
|
644 |
+
' margin-bottom: 20px;',
|
645 |
+
' }',
|
646 |
+
' .metadata p { margin: 5px 0; }',
|
647 |
+
' img {',
|
648 |
+
' max-width: 100%;',
|
649 |
+
' height: auto;',
|
650 |
+
' display: block;',
|
651 |
+
' margin: 20px auto;',
|
652 |
+
' border: 1px solid #ddd;',
|
653 |
+
' border-radius: 4px;',
|
654 |
+
' }',
|
655 |
+
' .image-container {',
|
656 |
+
' margin: 20px 0;',
|
657 |
+
' text-align: center;',
|
658 |
+
' }',
|
659 |
+
' .image-caption {',
|
660 |
+
' font-size: 0.9em;',
|
661 |
+
' text-align: center;',
|
662 |
+
' color: #666;',
|
663 |
+
' margin-top: 5px;',
|
664 |
+
' }',
|
665 |
+
' .text-block {',
|
666 |
+
' margin: 10px 0;',
|
667 |
+
' }',
|
668 |
+
' .page-ref {',
|
669 |
+
' font-weight: bold;',
|
670 |
+
' color: #555;',
|
671 |
+
' }',
|
672 |
+
' .separator {',
|
673 |
+
' border-top: 1px solid #eaecef;',
|
674 |
+
' margin: 30px 0;',
|
675 |
+
' }',
|
676 |
+
' </style>',
|
677 |
+
'</head>',
|
678 |
+
'<body>'
|
679 |
+
]
|
680 |
+
|
681 |
+
# Add document metadata
|
682 |
+
html.append('<div class="metadata">')
|
683 |
+
html.append(f'<h1>{result.get("file_name", "Document")}</h1>')
|
684 |
+
|
685 |
+
# Add timestamp
|
686 |
+
if 'timestamp' in result:
|
687 |
+
html.append(f'<p><strong>Processed:</strong> {result["timestamp"]}</p>')
|
688 |
+
|
689 |
+
# Add languages if available
|
690 |
+
if 'languages' in result and result['languages']:
|
691 |
+
languages = [lang for lang in result['languages'] if lang]
|
692 |
+
if languages:
|
693 |
+
html.append(f'<p><strong>Languages:</strong> {", ".join(languages)}</p>')
|
694 |
+
|
695 |
+
# Add document type and topics
|
696 |
+
if 'detected_document_type' in result:
|
697 |
+
html.append(f'<p><strong>Document Type:</strong> {result["detected_document_type"]}</p>')
|
698 |
+
|
699 |
+
if 'topics' in result and result['topics']:
|
700 |
+
html.append(f'<p><strong>Topics:</strong> {", ".join(result["topics"])}</p>')
|
701 |
+
|
702 |
+
html.append('</div>') # Close metadata div
|
703 |
+
|
704 |
+
# Document title - extract from result if available
|
705 |
+
if 'ocr_contents' in result and 'title' in result['ocr_contents'] and result['ocr_contents']['title']:
|
706 |
+
title_content = result['ocr_contents']['title']
|
707 |
+
# No special handling for any specific document types
|
708 |
+
html.append(f'<h2>{title_content}</h2>')
|
709 |
+
|
710 |
+
# Add images if present
|
711 |
+
if has_images and 'pages_data' in result:
|
712 |
+
html.append('<h3>Images</h3>')
|
713 |
+
|
714 |
+
# Extract and display all images
|
715 |
+
for page_idx, page in enumerate(result['pages_data']):
|
716 |
+
if 'images' in page and isinstance(page['images'], list):
|
717 |
+
for img_idx, img in enumerate(page['images']):
|
718 |
+
if 'image_base64' in img and img['image_base64']:
|
719 |
+
# Image container
|
720 |
+
html.append('<div class="image-container">')
|
721 |
+
html.append(f'<img src="{img["image_base64"]}" alt="Image {page_idx+1}-{img_idx+1}">')
|
722 |
+
|
723 |
+
# Generic caption based on index
|
724 |
+
html.append(f'<div class="image-caption">img-{img_idx}.jpeg</div>')
|
725 |
+
html.append('</div>')
|
726 |
+
|
727 |
+
# Add image description if available through utils
|
728 |
+
if content_utils_available:
|
729 |
+
description = extract_image_description(result)
|
730 |
+
if description:
|
731 |
+
html.append('<div class="text-block">')
|
732 |
+
html.append(f'<p>{description}</p>')
|
733 |
+
html.append('</div>')
|
734 |
+
|
735 |
+
html.append('<hr class="separator">')
|
736 |
+
|
737 |
+
# Add document text section
|
738 |
+
html.append('<h3>Text</h3>')
|
739 |
+
|
740 |
+
# Extract text content systematically
|
741 |
+
text_content = ""
|
742 |
+
|
743 |
+
if content_utils_available:
|
744 |
+
# Use the systematic utility function
|
745 |
+
text_content = extract_document_text(result)
|
746 |
+
else:
|
747 |
+
# Fallback extraction logic
|
748 |
+
if 'ocr_contents' in result:
|
749 |
+
for field in ["main_text", "content", "text", "transcript", "raw_text"]:
|
750 |
+
if field in result['ocr_contents'] and result['ocr_contents'][field]:
|
751 |
+
content = result['ocr_contents'][field]
|
752 |
+
if isinstance(content, str) and content.strip():
|
753 |
+
text_content = content
|
754 |
+
break
|
755 |
+
elif isinstance(content, dict):
|
756 |
+
# Try to convert complex objects to string
|
757 |
+
try:
|
758 |
+
text_content = json.dumps(content, indent=2)
|
759 |
+
break
|
760 |
+
except:
|
761 |
+
pass
|
762 |
+
|
763 |
+
# Process text content for HTML display
|
764 |
+
if text_content:
|
765 |
+
# Clean the text but preserve page references
|
766 |
+
text_content = text_content.replace('\r\n', '\n')
|
767 |
+
|
768 |
+
# Preserve page references by wrapping them in HTML tags
|
769 |
+
if has_page_refs:
|
770 |
+
# Highlight common page reference patterns
|
771 |
+
page_patterns = [
|
772 |
+
(r'(page\s+\d+)', r'<span class="page-ref">\1</span>'),
|
773 |
+
(r'(p\.\s*\d+)', r'<span class="page-ref">\1</span>'),
|
774 |
+
(r'(p\s+\d+)', r'<span class="page-ref">\1</span>'),
|
775 |
+
(r'(\[\s*\d+\s*\])', r'<span class="page-ref">\1</span>'),
|
776 |
+
(r'(\(\s*\d+\s*\))', r'<span class="page-ref">\1</span>'),
|
777 |
+
(r'(folio\s+\d+)', r'<span class="page-ref">\1</span>'),
|
778 |
+
(r'(f\.\s*\d+)', r'<span class="page-ref">\1</span>'),
|
779 |
+
(r'(pg\.\s*\d+)', r'<span class="page-ref">\1</span>')
|
780 |
+
]
|
781 |
+
|
782 |
+
for pattern, replacement in page_patterns:
|
783 |
+
text_content = re.sub(pattern, replacement, text_content, flags=re.IGNORECASE)
|
784 |
+
|
785 |
+
# Convert newlines to paragraphs
|
786 |
+
paragraphs = text_content.split('\n\n')
|
787 |
+
paragraphs = [p for p in paragraphs if p.strip()]
|
788 |
+
|
789 |
+
html.append('<div class="text-block">')
|
790 |
+
for paragraph in paragraphs:
|
791 |
+
# Check if paragraph contains multiple lines
|
792 |
+
if '\n' in paragraph:
|
793 |
+
lines = paragraph.split('\n')
|
794 |
+
lines = [line for line in lines if line.strip()]
|
795 |
+
|
796 |
+
# Convert each line to a paragraph
|
797 |
+
for line in lines:
|
798 |
+
html.append(f'<p>{line}</p>')
|
799 |
+
else:
|
800 |
+
html.append(f'<p>{paragraph}</p>')
|
801 |
+
html.append('</div>')
|
802 |
+
else:
|
803 |
+
html.append('<p>No text content available.</p>')
|
804 |
+
|
805 |
+
# Close the HTML document
|
806 |
+
html.append('</body>')
|
807 |
+
html.append('</html>')
|
808 |
+
|
809 |
+
return '\n'.join(html)
|
810 |
+
|
811 |
+
def clean_ocr_result(result: dict,
|
812 |
+
use_segmentation: bool = False,
|
813 |
+
vision_enabled: bool = True) -> dict:
|
814 |
+
"""
|
815 |
+
1. Replace or strip markdown image refs ()
|
816 |
+
2. Collapse pages that are *only* an illustration into a single
|
817 |
+
`illustrations` bucket when vision is off
|
818 |
+
3. Normalise `ocr_contents` keys to always have at least `raw_text`
|
819 |
+
"""
|
820 |
+
if 'pages_data' in result:
|
821 |
+
# Build a dict {id: base64} for quick look-ups
|
822 |
+
image_dict = {
|
823 |
+
img['id']: img['image_base64']
|
824 |
+
for page in result['pages_data']
|
825 |
+
for img in page.get('images', [])
|
826 |
+
}
|
827 |
+
|
828 |
+
# --- 1 · replace or drop image placeholders ---
|
829 |
+
def _scrub(markdown: str) -> str:
|
830 |
+
if vision_enabled and image_dict:
|
831 |
+
return replace_images_in_markdown(markdown, image_dict)
|
832 |
+
# no vision / no images → drop the line
|
833 |
+
return re.sub(r'!\[[^\]]*\]\(img-\d+\.\w+\)', '', markdown)
|
834 |
+
|
835 |
+
for page in result['pages_data']:
|
836 |
+
page['markdown'] = _scrub(page.get('markdown', ''))
|
837 |
+
|
838 |
+
# --- 2 · group illustration-only pages when vision is off ---
|
839 |
+
if not vision_enabled and 'pages_data' in result:
|
840 |
+
text_pages, art_pages = [], []
|
841 |
+
for p in result['pages_data']:
|
842 |
+
has_text = p.get('markdown', '').strip()
|
843 |
+
(text_pages if has_text else art_pages).append(p)
|
844 |
+
result['pages_data'] = text_pages
|
845 |
+
if art_pages:
|
846 |
+
# keep one thumbnail under metadata
|
847 |
+
result.setdefault('illustrations', []).extend(art_pages)
|
848 |
+
|
849 |
+
# --- 3 · ensure raw_text key ---
|
850 |
+
if 'ocr_contents' in result and 'raw_text' not in result['ocr_contents']:
|
851 |
+
# First, try to extract any embedded text from image references
|
852 |
+
raw_text_parts = []
|
853 |
+
|
854 |
+
for page in result.get('pages_data', []):
|
855 |
+
markdown = page.get('markdown', '')
|
856 |
+
# Check if the markdown contains image references
|
857 |
+
img_refs = re.findall(r'!\[([^\]]*)\]\(([^\)]*)\)', markdown)
|
858 |
+
|
859 |
+
# Process each image reference to extract text content
|
860 |
+
if img_refs:
|
861 |
+
for alt_text, img_url in img_refs:
|
862 |
+
# If alt text contains actual text content (not just image ID), add it
|
863 |
+
if alt_text and not alt_text.endswith(('.jpeg', '.jpg', '.png')):
|
864 |
+
# Clean up the alt text and add it as text content
|
865 |
+
alt_text = alt_text.strip()
|
866 |
+
if alt_text and len(alt_text) > 3: # Only add if meaningful
|
867 |
+
raw_text_parts.append(alt_text)
|
868 |
+
|
869 |
+
# Remove image references from markdown
|
870 |
+
cleaned_markdown = re.sub(r'!\[[^\]]*\]\([^\)]*\)', '', markdown)
|
871 |
+
|
872 |
+
# Add any remaining text content
|
873 |
+
if cleaned_markdown.strip():
|
874 |
+
raw_text_parts.append(cleaned_markdown.strip())
|
875 |
+
|
876 |
+
# Join all extracted text content
|
877 |
+
if raw_text_parts:
|
878 |
+
result['ocr_contents']['raw_text'] = "\n\n".join(raw_text_parts)
|
879 |
+
else:
|
880 |
+
# Fallback: use original method if no text was extracted
|
881 |
+
joined = "\n".join(p.get('markdown', '') for p in result.get('pages_data', []))
|
882 |
+
# Final cleanup of image references
|
883 |
+
joined = re.sub(r'!\[[^\]]*\]\([^\)]*\)', '', joined)
|
884 |
+
result['ocr_contents']['raw_text'] = joined
|
885 |
+
|
886 |
+
return result
|
utils/text_utils.py
ADDED
@@ -0,0 +1,151 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""Text utility functions for OCR processing"""
|
2 |
+
|
3 |
+
import re
|
4 |
+
|
5 |
+
def clean_raw_text(text):
|
6 |
+
"""Clean raw text by removing image references and serialized data.
|
7 |
+
|
8 |
+
Args:
|
9 |
+
text (str): The text to clean
|
10 |
+
|
11 |
+
Returns:
|
12 |
+
str: The cleaned text
|
13 |
+
"""
|
14 |
+
if not text or not isinstance(text, str):
|
15 |
+
return ""
|
16 |
+
|
17 |
+
# # Remove image references like 
|
18 |
+
# text = re.sub(r'!\[.*?\]\(data:image/[^)]+\)', '', text)
|
19 |
+
|
20 |
+
# # Remove basic markdown image references like 
|
21 |
+
# text = re.sub(r'!\[[^\]]*\]\([^)]+\)', '', text)
|
22 |
+
|
23 |
+
# # Remove base64 encoded image data
|
24 |
+
# text = re.sub(r'data:image/[^;]+;base64,[a-zA-Z0-9+/=]+', '', text)
|
25 |
+
|
26 |
+
# # Remove image object references like [[OCRImageObject:...]]
|
27 |
+
# text = re.sub(r'\[\[OCRImageObject:[^\]]+\]\]', '', text)
|
28 |
+
|
29 |
+
# # Clean up any JSON-like image object references
|
30 |
+
# text = re.sub(r'{"image(_data)?":("[^"]*"|null|true|false|\{[^}]*\}|\[[^\]]*\])}', '', text)
|
31 |
+
|
32 |
+
# # Clean up excessive whitespace and line breaks created by removals
|
33 |
+
# text = re.sub(r'\n{3,}', '\n\n', text)
|
34 |
+
# text = re.sub(r'\s{3,}', ' ', text)
|
35 |
+
|
36 |
+
return text.strip()
|
37 |
+
|
38 |
+
def format_markdown_text(text):
|
39 |
+
"""Format text with markdown and handle special patterns
|
40 |
+
|
41 |
+
Args:
|
42 |
+
text (str): The text to format
|
43 |
+
|
44 |
+
Returns:
|
45 |
+
str: The formatted markdown text
|
46 |
+
"""
|
47 |
+
if not text:
|
48 |
+
return ""
|
49 |
+
|
50 |
+
# First, ensure we're working with a string
|
51 |
+
if not isinstance(text, str):
|
52 |
+
text = str(text)
|
53 |
+
|
54 |
+
# Ensure newlines are preserved for proper spacing
|
55 |
+
# Convert any Windows line endings to Unix
|
56 |
+
text = text.replace('\r\n', '\n')
|
57 |
+
|
58 |
+
# Format dates (MM/DD/YYYY or similar patterns)
|
59 |
+
date_pattern = r'\b(0?[1-9]|1[0-2])[\/\-\.](0?[1-9]|[12][0-9]|3[01])[\/\-\.](\d{4}|\d{2})\b'
|
60 |
+
text = re.sub(date_pattern, r'**\g<0>**', text)
|
61 |
+
|
62 |
+
# Detect markdown tables and preserve them
|
63 |
+
table_sections = []
|
64 |
+
non_table_lines = []
|
65 |
+
in_table = False
|
66 |
+
table_buffer = []
|
67 |
+
|
68 |
+
# Process text line by line, preserving tables
|
69 |
+
lines = text.split('\n')
|
70 |
+
for i, line in enumerate(lines):
|
71 |
+
line_stripped = line.strip()
|
72 |
+
|
73 |
+
# Detect table rows by pipe character
|
74 |
+
if '|' in line_stripped and (line_stripped.startswith('|') or line_stripped.endswith('|')):
|
75 |
+
if not in_table:
|
76 |
+
in_table = True
|
77 |
+
if table_buffer:
|
78 |
+
table_buffer = []
|
79 |
+
table_buffer.append(line)
|
80 |
+
|
81 |
+
# Check if the next line is a table separator
|
82 |
+
if i < len(lines) - 1 and '---' in lines[i+1] and '|' in lines[i+1]:
|
83 |
+
table_buffer.append(lines[i+1])
|
84 |
+
|
85 |
+
# Detect table separators (---|---|---)
|
86 |
+
elif in_table and '---' in line_stripped and '|' in line_stripped:
|
87 |
+
table_buffer.append(line)
|
88 |
+
|
89 |
+
# End of table detection
|
90 |
+
elif in_table:
|
91 |
+
# Check if this is still part of the table
|
92 |
+
next_line_is_table = False
|
93 |
+
if i < len(lines) - 1:
|
94 |
+
next_line = lines[i+1].strip()
|
95 |
+
if '|' in next_line and (next_line.startswith('|') or next_line.endswith('|')):
|
96 |
+
next_line_is_table = True
|
97 |
+
|
98 |
+
if not next_line_is_table:
|
99 |
+
in_table = False
|
100 |
+
# Save the complete table
|
101 |
+
if table_buffer:
|
102 |
+
table_sections.append('\n'.join(table_buffer))
|
103 |
+
table_buffer = []
|
104 |
+
# Add current line to non-table lines
|
105 |
+
non_table_lines.append(line)
|
106 |
+
else:
|
107 |
+
# Still part of the table
|
108 |
+
table_buffer.append(line)
|
109 |
+
else:
|
110 |
+
# Not in a table
|
111 |
+
non_table_lines.append(line)
|
112 |
+
|
113 |
+
# Handle any remaining table buffer
|
114 |
+
if in_table and table_buffer:
|
115 |
+
table_sections.append('\n'.join(table_buffer))
|
116 |
+
|
117 |
+
# Process non-table lines
|
118 |
+
processed_lines = []
|
119 |
+
for line in non_table_lines:
|
120 |
+
line_stripped = line.strip()
|
121 |
+
|
122 |
+
# Check if line is in ALL CAPS (and not just a short acronym)
|
123 |
+
if line_stripped and line_stripped.isupper() and len(line_stripped) > 3:
|
124 |
+
# ALL CAPS line - make bold instead of heading to prevent large display
|
125 |
+
processed_lines.append(f"**{line_stripped}**")
|
126 |
+
# Process potential headers (lines ending with colon)
|
127 |
+
elif line_stripped and line_stripped.endswith(':') and len(line_stripped) < 40:
|
128 |
+
# Likely a header - make it bold
|
129 |
+
processed_lines.append(f"**{line_stripped}**")
|
130 |
+
else:
|
131 |
+
# Keep original line with its spacing
|
132 |
+
processed_lines.append(line)
|
133 |
+
|
134 |
+
# Join non-table lines
|
135 |
+
processed_text = '\n'.join(processed_lines)
|
136 |
+
|
137 |
+
# Reinsert tables in the right positions
|
138 |
+
for table in table_sections:
|
139 |
+
# Generate a unique marker for this table
|
140 |
+
marker = f"__TABLE_MARKER_{hash(table) % 10000}__"
|
141 |
+
# Find a good position to insert this table
|
142 |
+
# For now, just append all tables at the end
|
143 |
+
processed_text += f"\n\n{table}\n\n"
|
144 |
+
|
145 |
+
# Make sure paragraphs have proper spacing but not excessive
|
146 |
+
processed_text = re.sub(r'\n{3,}', '\n\n', processed_text)
|
147 |
+
|
148 |
+
# Ensure two newlines between paragraphs for proper markdown rendering
|
149 |
+
processed_text = re.sub(r'([^\n])\n([^\n])', r'\1\n\n\2', processed_text)
|
150 |
+
|
151 |
+
return processed_text
|
utils/ui_utils.py
ADDED
@@ -0,0 +1,413 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
"""
|
2 |
+
UI utilities for OCR results display.
|
3 |
+
"""
|
4 |
+
import streamlit as st
|
5 |
+
import json
|
6 |
+
import base64
|
7 |
+
import io
|
8 |
+
from datetime import datetime
|
9 |
+
|
10 |
+
from utils.image_utils import format_ocr_text, create_html_with_images
|
11 |
+
from utils.content_utils import classify_document_content, format_structured_data
|
12 |
+
|
13 |
+
def display_results(result, container, custom_prompt=""):
|
14 |
+
"""Display OCR results in the provided container"""
|
15 |
+
with container:
|
16 |
+
# Add heading for document metadata
|
17 |
+
st.markdown("### Document Metadata")
|
18 |
+
|
19 |
+
# Filter out large data structures from metadata display
|
20 |
+
meta = {k: v for k, v in result.items()
|
21 |
+
if k not in ['pages_data', 'illustrations', 'ocr_contents', 'raw_response_data']}
|
22 |
+
|
23 |
+
# Create a compact metadata section
|
24 |
+
meta_html = '<div style="display: flex; flex-wrap: wrap; gap: 0.3rem; margin-bottom: 0.3rem;">'
|
25 |
+
|
26 |
+
# Document type
|
27 |
+
if 'detected_document_type' in meta:
|
28 |
+
meta_html += f'<div><strong>Type:</strong> {meta["detected_document_type"]}</div>'
|
29 |
+
|
30 |
+
# Processing time
|
31 |
+
if 'processing_time' in meta:
|
32 |
+
meta_html += f'<div><strong>Time:</strong> {meta["processing_time"]:.1f}s</div>'
|
33 |
+
|
34 |
+
# Page information
|
35 |
+
if 'limited_pages' in meta:
|
36 |
+
meta_html += f'<div><strong>Pages:</strong> {meta["limited_pages"]["processed"]}/{meta["limited_pages"]["total"]}</div>'
|
37 |
+
|
38 |
+
meta_html += '</div>'
|
39 |
+
st.markdown(meta_html, unsafe_allow_html=True)
|
40 |
+
|
41 |
+
# Language metadata on a separate line, Subject Tags below
|
42 |
+
|
43 |
+
# First show languages if available
|
44 |
+
if 'languages' in result and result['languages']:
|
45 |
+
languages = [lang for lang in result['languages'] if lang is not None]
|
46 |
+
if languages:
|
47 |
+
# Create a dedicated line for Languages
|
48 |
+
lang_html = '<div style="display: flex; align-items: center; margin: 0.2rem 0; flex-wrap: wrap;">'
|
49 |
+
lang_html += '<div style="margin-right: 0.3rem; font-weight: bold;">Language:</div>'
|
50 |
+
|
51 |
+
# Add language tags
|
52 |
+
for lang in languages:
|
53 |
+
# Clean language name if needed
|
54 |
+
clean_lang = str(lang).strip()
|
55 |
+
if clean_lang: # Only add if not empty
|
56 |
+
lang_html += f'<span class="subject-tag tag-language">{clean_lang}</span>'
|
57 |
+
|
58 |
+
lang_html += '</div>'
|
59 |
+
st.markdown(lang_html, unsafe_allow_html=True)
|
60 |
+
|
61 |
+
# Create a separate line for Time if we have time-related tags
|
62 |
+
if 'topics' in result and result['topics']:
|
63 |
+
time_tags = [topic for topic in result['topics']
|
64 |
+
if any(term in topic.lower() for term in ["century", "pre-", "era", "historical"])]
|
65 |
+
if time_tags:
|
66 |
+
time_html = '<div style="display: flex; align-items: center; margin: 0.2rem 0; flex-wrap: wrap;">'
|
67 |
+
time_html += '<div style="margin-right: 0.3rem; font-weight: bold;">Time:</div>'
|
68 |
+
for tag in time_tags:
|
69 |
+
time_html += f'<span class="subject-tag tag-time-period">{tag}</span>'
|
70 |
+
time_html += '</div>'
|
71 |
+
st.markdown(time_html, unsafe_allow_html=True)
|
72 |
+
|
73 |
+
# Then display remaining subject tags if available
|
74 |
+
if 'topics' in result and result['topics']:
|
75 |
+
# Filter out time-related tags which are already displayed
|
76 |
+
subject_tags = [topic for topic in result['topics']
|
77 |
+
if not any(term in topic.lower() for term in ["century", "pre-", "era", "historical"])]
|
78 |
+
|
79 |
+
if subject_tags:
|
80 |
+
# Create a separate line for Subject Tags
|
81 |
+
tags_html = '<div style="display: flex; align-items: center; margin: 0.2rem 0; flex-wrap: wrap;">'
|
82 |
+
tags_html += '<div style="margin-right: 0.3rem; font-weight: bold;">Subject Tags:</div>'
|
83 |
+
tags_html += '<div style="display: flex; flex-wrap: wrap; gap: 2px; align-items: center;">'
|
84 |
+
|
85 |
+
# Generate a badge for each remaining tag
|
86 |
+
for topic in subject_tags:
|
87 |
+
# Determine tag category class
|
88 |
+
tag_class = "subject-tag" # Default class
|
89 |
+
|
90 |
+
# Add specialized class based on category
|
91 |
+
if any(term in topic.lower() for term in ["language", "english", "french", "german", "latin"]):
|
92 |
+
tag_class += " tag-language" # Languages
|
93 |
+
elif any(term in topic.lower() for term in ["letter", "newspaper", "book", "form", "document", "recipe"]):
|
94 |
+
tag_class += " tag-document-type" # Document types
|
95 |
+
elif any(term in topic.lower() for term in ["travel", "military", "science", "medicine", "education", "art", "literature"]):
|
96 |
+
tag_class += " tag-subject" # Subject domains
|
97 |
+
|
98 |
+
# Add each tag as an inline span
|
99 |
+
tags_html += f'<span class="{tag_class}">{topic}</span>'
|
100 |
+
|
101 |
+
# Close the containers
|
102 |
+
tags_html += '</div></div>'
|
103 |
+
|
104 |
+
# Render the subject tags section
|
105 |
+
st.markdown(tags_html, unsafe_allow_html=True)
|
106 |
+
|
107 |
+
# Check if we have OCR content
|
108 |
+
if 'ocr_contents' in result:
|
109 |
+
# Create a single view instead of tabs
|
110 |
+
content_tab1 = st.container()
|
111 |
+
|
112 |
+
# Check for images in the result to use later
|
113 |
+
has_images = result.get('has_images', False)
|
114 |
+
has_image_data = ('pages_data' in result and any(page.get('images', []) for page in result.get('pages_data', [])))
|
115 |
+
has_raw_images = ('raw_response_data' in result and 'pages' in result['raw_response_data'] and
|
116 |
+
any('images' in page for page in result['raw_response_data']['pages']
|
117 |
+
if isinstance(page, dict)))
|
118 |
+
|
119 |
+
# Display structured content
|
120 |
+
with content_tab1:
|
121 |
+
# Display structured content with markdown formatting
|
122 |
+
if isinstance(result['ocr_contents'], dict):
|
123 |
+
# CSS is now handled in the main layout.py file
|
124 |
+
|
125 |
+
# Collect all available images from the result
|
126 |
+
available_images = []
|
127 |
+
if has_images and 'pages_data' in result:
|
128 |
+
for page_idx, page in enumerate(result['pages_data']):
|
129 |
+
if 'images' in page and len(page['images']) > 0:
|
130 |
+
for img_idx, img in enumerate(page['images']):
|
131 |
+
if 'image_base64' in img:
|
132 |
+
available_images.append({
|
133 |
+
'source': 'pages_data',
|
134 |
+
'page': page_idx,
|
135 |
+
'index': img_idx,
|
136 |
+
'data': img['image_base64']
|
137 |
+
})
|
138 |
+
|
139 |
+
# Get images from raw response as well
|
140 |
+
if 'raw_response_data' in result:
|
141 |
+
raw_data = result['raw_response_data']
|
142 |
+
if isinstance(raw_data, dict) and 'pages' in raw_data:
|
143 |
+
for page_idx, page in enumerate(raw_data['pages']):
|
144 |
+
if isinstance(page, dict) and 'images' in page:
|
145 |
+
for img_idx, img in enumerate(page['images']):
|
146 |
+
if isinstance(img, dict) and 'base64' in img:
|
147 |
+
available_images.append({
|
148 |
+
'source': 'raw_response',
|
149 |
+
'page': page_idx,
|
150 |
+
'index': img_idx,
|
151 |
+
'data': img['base64']
|
152 |
+
})
|
153 |
+
|
154 |
+
# Extract images for display at the top
|
155 |
+
images_to_display = []
|
156 |
+
|
157 |
+
# First, collect all available images
|
158 |
+
for img_idx, img in enumerate(available_images):
|
159 |
+
if 'data' in img:
|
160 |
+
images_to_display.append({
|
161 |
+
'data': img['data'],
|
162 |
+
'id': img.get('id', f"img_{img_idx}"),
|
163 |
+
'index': img_idx
|
164 |
+
})
|
165 |
+
|
166 |
+
# Image display now only happens in the Images tab
|
167 |
+
|
168 |
+
# Organize sections in a logical order - prioritize main_text
|
169 |
+
section_order = ["title", "author", "date", "summary", "main_text", "content", "transcript", "metadata"]
|
170 |
+
ordered_sections = []
|
171 |
+
|
172 |
+
# Add known sections first in preferred order
|
173 |
+
for section_name in section_order:
|
174 |
+
if section_name in result['ocr_contents'] and result['ocr_contents'][section_name]:
|
175 |
+
ordered_sections.append(section_name)
|
176 |
+
|
177 |
+
# Add any remaining sections
|
178 |
+
for section in result['ocr_contents'].keys():
|
179 |
+
if (section not in ordered_sections and
|
180 |
+
section not in ['error', 'partial_text'] and
|
181 |
+
result['ocr_contents'][section]):
|
182 |
+
ordered_sections.append(section)
|
183 |
+
|
184 |
+
# If only raw_text is available and no other content, add it last
|
185 |
+
if ('raw_text' in result['ocr_contents'] and
|
186 |
+
result['ocr_contents']['raw_text'] and
|
187 |
+
len(ordered_sections) == 0):
|
188 |
+
ordered_sections.append('raw_text')
|
189 |
+
|
190 |
+
# Add minimal spacing before OCR results
|
191 |
+
st.markdown("<div style='margin: 8px 0 4px 0;'></div>", unsafe_allow_html=True)
|
192 |
+
|
193 |
+
# Create tabs for different views
|
194 |
+
if has_images:
|
195 |
+
tabs = st.tabs(["Document Content", "Raw JSON", "Images"])
|
196 |
+
doc_tab, json_tab, img_tab = tabs
|
197 |
+
else:
|
198 |
+
tabs = st.tabs(["Document Content", "Raw JSON"])
|
199 |
+
doc_tab, json_tab = tabs
|
200 |
+
img_tab = None
|
201 |
+
|
202 |
+
# Document Content tab with simplified and systematic content handling
|
203 |
+
with doc_tab:
|
204 |
+
# Classify document content using our utility function
|
205 |
+
content_classification = classify_document_content(result)
|
206 |
+
|
207 |
+
# Track what content has been displayed to avoid redundancy
|
208 |
+
displayed_content = set()
|
209 |
+
|
210 |
+
# Create a single unified content section
|
211 |
+
st.markdown("#### Document Content")
|
212 |
+
st.markdown("##### Title")
|
213 |
+
|
214 |
+
# Extract main structured content fields without redundancy
|
215 |
+
text_fields = {}
|
216 |
+
|
217 |
+
# Use the exact same approach as in Previous Results tab for consistency
|
218 |
+
# Create a more focused list of important sections - prioritize main_text
|
219 |
+
priority_sections = ["title", "main_text", "content", "transcript", "summary"]
|
220 |
+
displayed_sections = set()
|
221 |
+
|
222 |
+
# First display priority sections
|
223 |
+
for section in priority_sections:
|
224 |
+
if section in result['ocr_contents'] and result['ocr_contents'][section]:
|
225 |
+
content = result['ocr_contents'][section]
|
226 |
+
if isinstance(content, str) and content.strip():
|
227 |
+
# Only add a subheader for meaningful section names, not raw_text
|
228 |
+
if section != "raw_text" and section != "title":
|
229 |
+
st.markdown(f"##### {section.replace('_', ' ').title()}")
|
230 |
+
|
231 |
+
# Format and display content
|
232 |
+
# First format any structured data (lists, dicts)
|
233 |
+
structured_content = format_structured_data(content)
|
234 |
+
# Then apply regular OCR text formatting
|
235 |
+
formatted_content = format_ocr_text(structured_content)
|
236 |
+
st.markdown(formatted_content)
|
237 |
+
displayed_sections.add(section)
|
238 |
+
break
|
239 |
+
elif isinstance(content, dict):
|
240 |
+
# Display dictionary content as key-value pairs
|
241 |
+
for k, v in content.items():
|
242 |
+
if k not in ['error', 'partial_text'] and v:
|
243 |
+
st.markdown(f"**{k.replace('_', ' ').title()}**")
|
244 |
+
if isinstance(v, str):
|
245 |
+
# Format any structured data in the string
|
246 |
+
formatted_v = format_structured_data(v)
|
247 |
+
st.markdown(format_ocr_text(formatted_v))
|
248 |
+
else:
|
249 |
+
# Format non-string values (lists, dicts)
|
250 |
+
formatted_v = format_structured_data(v)
|
251 |
+
st.markdown(formatted_v)
|
252 |
+
displayed_sections.add(section)
|
253 |
+
break
|
254 |
+
elif isinstance(content, list):
|
255 |
+
# Format and display list items using our structured formatter
|
256 |
+
formatted_list = format_structured_data(content)
|
257 |
+
st.markdown(formatted_list)
|
258 |
+
displayed_sections.add(section)
|
259 |
+
break
|
260 |
+
|
261 |
+
# Then display any remaining sections not already shown
|
262 |
+
for section, content in result['ocr_contents'].items():
|
263 |
+
if (section not in displayed_sections and
|
264 |
+
section not in ['error', 'partial_text'] and
|
265 |
+
content):
|
266 |
+
st.markdown(f"##### {section.replace('_', ' ').title()}")
|
267 |
+
|
268 |
+
if isinstance(content, str):
|
269 |
+
# Format any structured data in the string before display
|
270 |
+
structured_content = format_structured_data(content)
|
271 |
+
st.markdown(format_ocr_text(structured_content))
|
272 |
+
elif isinstance(content, list):
|
273 |
+
# Format list using our structured formatter
|
274 |
+
formatted_list = format_structured_data(content)
|
275 |
+
st.markdown(formatted_list)
|
276 |
+
elif isinstance(content, dict):
|
277 |
+
# Format dictionary using our structured formatter
|
278 |
+
formatted_dict = format_structured_data(content)
|
279 |
+
st.markdown(formatted_dict)
|
280 |
+
|
281 |
+
# Raw JSON tab - for viewing the raw OCR response data
|
282 |
+
with json_tab:
|
283 |
+
# Extract the relevant JSON data
|
284 |
+
json_data = {}
|
285 |
+
|
286 |
+
# Include important metadata
|
287 |
+
for field in ['file_name', 'timestamp', 'processing_time', 'detected_document_type', 'languages', 'topics']:
|
288 |
+
if field in result:
|
289 |
+
json_data[field] = result[field]
|
290 |
+
|
291 |
+
# Include OCR contents
|
292 |
+
if 'ocr_contents' in result:
|
293 |
+
json_data['ocr_contents'] = result['ocr_contents']
|
294 |
+
|
295 |
+
# Exclude large binary data like base64 images to keep JSON clean
|
296 |
+
if 'pages_data' in result:
|
297 |
+
# Create simplified pages_data without large binary content
|
298 |
+
simplified_pages = []
|
299 |
+
for page in result['pages_data']:
|
300 |
+
simplified_page = {
|
301 |
+
'page_number': page.get('page_number', 0),
|
302 |
+
'has_text': bool(page.get('markdown', '')),
|
303 |
+
'has_images': bool(page.get('images', [])),
|
304 |
+
'image_count': len(page.get('images', []))
|
305 |
+
}
|
306 |
+
simplified_pages.append(simplified_page)
|
307 |
+
json_data['pages_summary'] = simplified_pages
|
308 |
+
|
309 |
+
# Format the JSON prettily
|
310 |
+
json_str = json.dumps(json_data, indent=2)
|
311 |
+
|
312 |
+
# Display in a monospace font with syntax highlighting
|
313 |
+
st.code(json_str, language="json")
|
314 |
+
|
315 |
+
|
316 |
+
# Images tab - for viewing document images
|
317 |
+
if has_images and img_tab:
|
318 |
+
with img_tab:
|
319 |
+
# Display each available image
|
320 |
+
for i, img in enumerate(images_to_display):
|
321 |
+
st.image(img['data'], caption=f"Image {i+1}", use_container_width=True)
|
322 |
+
|
323 |
+
# Display custom prompt if provided
|
324 |
+
if custom_prompt:
|
325 |
+
with st.expander("Custom Processing Instructions"):
|
326 |
+
st.write(custom_prompt)
|
327 |
+
|
328 |
+
# No download heading - start directly with buttons
|
329 |
+
|
330 |
+
# Create export section with a simple download menu
|
331 |
+
st.markdown("<div style='margin-top: 15px;'></div>", unsafe_allow_html=True)
|
332 |
+
|
333 |
+
# Prepare all download files at once to avoid rerun resets
|
334 |
+
try:
|
335 |
+
# 1. JSON download
|
336 |
+
json_str = json.dumps(result, indent=2)
|
337 |
+
json_filename = f"{result.get('file_name', 'document').split('.')[0]}_ocr.json"
|
338 |
+
|
339 |
+
# 2. Text download with improved structure
|
340 |
+
text_parts = []
|
341 |
+
filename = result.get('file_name', 'document')
|
342 |
+
text_parts.append(f"DOCUMENT: {filename}\n")
|
343 |
+
|
344 |
+
if 'timestamp' in result:
|
345 |
+
text_parts.append(f"Processed: {result['timestamp']}\n")
|
346 |
+
|
347 |
+
if 'languages' in result and result['languages']:
|
348 |
+
languages = [lang for lang in result['languages'] if lang is not None]
|
349 |
+
if languages:
|
350 |
+
text_parts.append(f"Languages: {', '.join(languages)}\n")
|
351 |
+
|
352 |
+
if 'topics' in result and result['topics']:
|
353 |
+
text_parts.append(f"Topics: {', '.join(result['topics'])}\n")
|
354 |
+
|
355 |
+
text_parts.append("\n" + "="*50 + "\n\n")
|
356 |
+
|
357 |
+
if 'ocr_contents' in result and 'title' in result['ocr_contents'] and result['ocr_contents']['title']:
|
358 |
+
text_parts.append(f"TITLE: {result['ocr_contents']['title']}\n\n")
|
359 |
+
|
360 |
+
content_added = False
|
361 |
+
|
362 |
+
if 'ocr_contents' in result:
|
363 |
+
for field in ["main_text", "content", "text", "transcript", "raw_text"]:
|
364 |
+
if field in result['ocr_contents'] and result['ocr_contents'][field]:
|
365 |
+
text_parts.append(f"CONTENT:\n\n{result['ocr_contents'][field]}\n")
|
366 |
+
content_added = True
|
367 |
+
break
|
368 |
+
|
369 |
+
text_content = "\n".join(text_parts)
|
370 |
+
text_filename = f"{result.get('file_name', 'document').split('.')[0]}_ocr.txt"
|
371 |
+
|
372 |
+
# 3. HTML download
|
373 |
+
from utils.image_utils import create_html_with_images
|
374 |
+
html_content = create_html_with_images(result)
|
375 |
+
html_filename = f"{result.get('file_name', 'document').split('.')[0]}_ocr.html"
|
376 |
+
|
377 |
+
# Hide download options in an expander
|
378 |
+
with st.expander("Download Options"):
|
379 |
+
# Remove columns and use vertical layout instead
|
380 |
+
# Add spacing between buttons for better readability
|
381 |
+
st.download_button(
|
382 |
+
label="JSON",
|
383 |
+
data=json_str,
|
384 |
+
file_name=json_filename,
|
385 |
+
mime="application/json",
|
386 |
+
key="download_json_btn",
|
387 |
+
use_container_width=True
|
388 |
+
)
|
389 |
+
|
390 |
+
st.markdown("<div style='margin-top: 8px;'></div>", unsafe_allow_html=True)
|
391 |
+
|
392 |
+
st.download_button(
|
393 |
+
label="Text",
|
394 |
+
data=text_content,
|
395 |
+
file_name=text_filename,
|
396 |
+
mime="text/plain",
|
397 |
+
key="download_text_btn",
|
398 |
+
use_container_width=True
|
399 |
+
)
|
400 |
+
|
401 |
+
st.markdown("<div style='margin-top: 8px;'></div>", unsafe_allow_html=True)
|
402 |
+
|
403 |
+
st.download_button(
|
404 |
+
label="HTML",
|
405 |
+
data=html_content,
|
406 |
+
file_name=html_filename,
|
407 |
+
mime="text/html",
|
408 |
+
key="download_html_btn",
|
409 |
+
use_container_width=True
|
410 |
+
)
|
411 |
+
|
412 |
+
except Exception as e:
|
413 |
+
st.error(f"Error preparing download files: {str(e)}")
|