Spaces:

milwright
/

historical-ocr

Running

App Files Files Community

milwright commited on Apr 4

Commit

75ead00

1 Parent(s): 2f2eb30

enhanced OCR functionality and efficiency, simplified preprompting, etc

Browse files

Files changed (5) hide show

CLAUDE.md +6 -3
app.py +493 -103
config.py +7 -7
ocr_utils.py +341 -52
structured_ocr.py +298 -146

CLAUDE.md CHANGED Viewed

@@ -7,12 +7,14 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 - Test OCR functionality: `python structured_ocr.py <file_path>`
 - Process PDF files: `python pdf_ocr.py <file_path>`
 - Process single file with logging: `python process_file.py <file_path>`
 - Run typechecking: `mypy .`
 ## Environment Setup
 - API key: Set `MISTRAL_API_KEY` in `.env` file or environment variable
 - Install dependencies: `pip install -r requirements.txt`
-- System requirements: `apt-get install poppler-utils tesseract-ocr` (or equivalent for your OS)
 ## Code Style Guidelines
 - **Imports**: Standard library first, third-party next, local modules last
@@ -21,10 +23,11 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
 - **Naming**: snake_case for variables/functions, PascalCase for classes
 - **Documentation**: Google-style docstrings for all functions/classes
 - **Logging**: Use module-level loggers with appropriate log levels
 ## Architecture
 - Core: `structured_ocr.py` - Main OCR processing with Mistral AI integration
-- Utils: `ocr_utils.py` - Utility functions for OCR text and image processing
-- PDF handling: `pdf_ocr.py` - PDF-specific processing functionality
 - Config: `config.py` - Configuration settings and API keys
 - Web: `app.py` - Streamlit interface with UI components in `/ui` directory

 - Test OCR functionality: `python structured_ocr.py <file_path>`
 - Process PDF files: `python pdf_ocr.py <file_path>`
 - Process single file with logging: `python process_file.py <file_path>`
+- Run newspaper test: `python test_newspaper.py <file_path>`
 - Run typechecking: `mypy .`
+- Lint code: `ruff check .` or `flake8`
 ## Environment Setup
 - API key: Set `MISTRAL_API_KEY` in `.env` file or environment variable
 - Install dependencies: `pip install -r requirements.txt`
+- System requirements: Install `poppler-utils` and `tesseract-ocr` for PDF processing and OCR
 ## Code Style Guidelines
 - **Imports**: Standard library first, third-party next, local modules last
 - **Naming**: snake_case for variables/functions, PascalCase for classes
 - **Documentation**: Google-style docstrings for all functions/classes
 - **Logging**: Use module-level loggers with appropriate log levels
+- **Line length**: ≤100 characters
 ## Architecture
 - Core: `structured_ocr.py` - Main OCR processing with Mistral AI integration
+- Utils: `ocr_utils.py` - OCR text and image processing utilities
+- PDF handling: `pdf_ocr.py` - PDF-specific processing functionality
 - Config: `config.py` - Configuration settings and API keys
 - Web: `app.py` - Streamlit interface with UI components in `/ui` directory

app.py CHANGED Viewed

@@ -322,6 +322,15 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None, pro
                 preprocessing_options.get("document_type", "standard") != "standard"
             )
             if has_preprocessing:
                 status_text.markdown('<div class="processing-status-container">Applying image preprocessing...</div>', unsafe_allow_html=True)
                 progress_bar.progress(20)
@@ -371,7 +380,12 @@ def process_file(uploaded_file, use_vision=True, preprocessing_options=None, pro
             cache_key = f"{file_hash}_{file_type}_{use_vision}_{pdf_rotation_value}"
             progress_bar.progress(50)
-            status_text.markdown('<div class="processing-status-container">Processing document with OCR...</div>', unsafe_allow_html=True)
             # Process the file using cached function if possible
             try:
@@ -563,73 +577,115 @@ with st.sidebar:
     # Add spacing between sections
     st.markdown("<div style='margin: 10px 0;'></div>", unsafe_allow_html=True)
-    # Document Context section
-    st.markdown("##### Document Context", help="Add context information")
-    # Historical period selector
-    historical_periods = [
-        "Select period (if known)",
-        "Pre-1700s",
-        "18th Century (1700s)",
-        "19th Century (1800s)",
-        "Early 20th Century (1900-1950)",
-        "Modern (Post 1950)"
     ]
-    selected_period = st.selectbox(
-        "Time Period",
-        options=historical_periods,
         index=0,
-        help="Select the time period of the document"
     )
-    # Document purpose selector
-    document_purposes = [
-        "Select purpose (if known)",
-        "Personal Letter/Correspondence",
-        "Official/Government Document",
-        "Business/Financial Record",
-        "Literary/Academic Work",
-        "News/Journalism",
-        "Religious Text",
-        "Legal Document"
     ]
-    selected_purpose = st.selectbox(
-        "Document Type",
-        options=document_purposes,
         index=0,
-        help="Select the purpose or type of the document"
     )
-    # Dynamic custom prompt field
     custom_prompt_text = ""
-    if selected_period != "Select period (if known)":
-        custom_prompt_text += f"This is a {selected_period} document. "
-    if selected_purpose != "Select purpose (if known)":
-        custom_prompt_text += f"It appears to be a {selected_purpose}. "
     # Add spacing between sections
     st.markdown("<div style='margin: 10px 0;'></div>", unsafe_allow_html=True)
     custom_prompt = st.text_area(
-        "Special Instructions",
         value=custom_prompt_text,
-        placeholder="Example: Document has unusual cursive handwriting.",
-        height=90,
-        max_chars=500,
         key="custom_analysis_instructions",
-        help="Specify document features or extraction needs"
     )
-    # Compact instructions expander
-    with st.expander("Instruction Examples"):
         st.markdown("""
-        - "Has faded text in corners"
-        - "Extract dates and locations"
-        - "Translate text to English"
-        - "Preserve tabular format"
         """)
     # Add spacing between sections
@@ -733,10 +789,28 @@ with main_tab2:
                     # Get zip data directly in memory
                     zip_data = create_results_zip_in_memory(st.session_state.previous_results)
                     st.download_button(
                         label="Download All Results",
                         data=zip_data,
-                        file_name="all_ocr_results.zip",
                         mime="application/zip",
                         help="Download all previous results as a ZIP file containing HTML and JSON files"
                     )
@@ -776,12 +850,12 @@ with main_tab2:
             st.markdown(f"""
             <div class="result-card">
                 <div class="result-header">
-                    <div class="result-filename">{icon} {file_name}</div>
                     <div class="result-date">{result.get('timestamp', 'Unknown')}</div>
                 </div>
                 <div class="result-metadata">
                     <div class="result-tag">Languages: {', '.join(result.get('languages', ['Unknown']))}</div>
-                    <div class="result-tag">Topics: {', '.join(result.get('topics', ['Unknown']))}</div>
                 </div>
             """, unsafe_allow_html=True)
@@ -824,7 +898,34 @@ with main_tab2:
                         st.write(f"**Languages:** {', '.join(languages)}")
                 if 'topics' in selected_result and selected_result['topics']:
-                    st.write(f"**Topics:** {', '.join(selected_result['topics'])}")
             with meta_col2:
                 # Display processing metadata
@@ -870,23 +971,68 @@ with main_tab2:
                     # Try a safer approach with string representation
                     st.code(str(selected_result))
-                # Add JSON download button
                 try:
                     json_str = json.dumps(selected_result, indent=2)
-                    filename = selected_result.get('file_name', 'document').split('.')[0]
                     st.download_button(
                         label="Download JSON",
                         data=json_str,
-                        file_name=f"{filename}_data.json",
                         mime="application/json"
                     )
                 except Exception as e:
                     st.error(f"Error creating JSON download: {str(e)}")
-                    # Fallback to string representation for download
                     st.download_button(
                         label="Download as Text",
                         data=str(selected_result),
-                        file_name=f"{filename}_data.txt",
                         mime="text/plain"
                     )
@@ -924,14 +1070,57 @@ with main_tab2:
                         if page_idx < len(pages_data) - 1:
                             st.markdown("---")
-                    # Add HTML download button if images are available
                     from ocr_utils import create_html_with_images
                     html_content = create_html_with_images(selected_result)
-                    filename = selected_result.get('file_name', 'document').split('.')[0]
                     st.download_button(
                         label="Download as HTML with Images",
                         data=html_content,
-                        file_name=f"{filename}_with_images.html",
                         mime="text/html"
                     )
@@ -1092,7 +1281,7 @@ with main_tab1:
                             progress_bar.progress(40)
                             try:
-                                # Step 1: Process without custom prompt to get OCR text
                                 processor = StructuredOCR()
                                 # First save the PDF to a temp file
@@ -1100,53 +1289,60 @@ with main_tab1:
                                     tmp.write(uploaded_file.getvalue())
                                     temp_path = tmp.name
-                                # Process with NO custom prompt first
                                 # Apply PDF rotation if specified
                                 pdf_rotation_value = pdf_rotation if 'pdf_rotation' in locals() else 0
-                                base_result = processor.process_file(
                                     file_path=temp_path,
                                     file_type="pdf",
                                     use_vision=use_vision,
-                                    custom_prompt=None,  # No custom prompt in first step
                                     file_size_mb=len(uploaded_file.getvalue()) / (1024 * 1024),
-                                    pdf_rotation=pdf_rotation_value  # Pass rotation value to processor
                                 )
-                                progress_bar.progress(70)
-                                status_text.markdown('<div class="processing-status-container">Applying custom analysis to extracted text...</div>', unsafe_allow_html=True)
-                                # Step 2: Apply custom prompt to the extracted text using text-only LLM
-                                if 'ocr_contents' in base_result and isinstance(base_result['ocr_contents'], dict):
-                                    # Get text from OCR result
-                                    ocr_text = ""
-                                    for section, content in base_result['ocr_contents'].items():
-                                        if isinstance(content, str):
-                                            ocr_text += content + "\n\n"
-                                        elif isinstance(content, list):
-                                            for item in content:
-                                                if isinstance(item, str):
-                                                    ocr_text += item + "\n"
-                                            ocr_text += "\n"
-                                    # Format the custom prompt for text-only processing
-                                    formatted_prompt = f"USER INSTRUCTIONS: {custom_prompt.strip()}\nPay special attention to these instructions and respond accordingly."
-                                    # Apply custom prompt to extracted text
-                                    enhanced_result = processor._extract_structured_data_text_only(ocr_text, uploaded_file.name, formatted_prompt)
-                                    # Merge results, keeping images from base_result
-                                    result = base_result.copy()
-                                    result['custom_prompt_applied'] = 'text_only'
-                                    # Update with enhanced analysis results, preserving image data
-                                    for key, value in enhanced_result.items():
-                                        if key not in ['raw_response_data', 'pages_data', 'has_images']:
-                                            result[key] = value
-                                else:
-                                    # If no OCR content, just use the base result
-                                    result = base_result
-                                    result['custom_prompt_applied'] = 'failed'
                                 # Clean up temp file
                                 if os.path.exists(temp_path):
@@ -1183,8 +1379,21 @@ with main_tab1:
                         # Initialize OCR processor and process with custom prompt
                         processor = StructuredOCR()
-                        # Format the custom prompt to ensure it has an impact
-                        formatted_prompt = f"USER INSTRUCTIONS: {custom_prompt.strip()}\nPay special attention to these instructions and respond accordingly."
                         try:
                             result = processor.process_file(
@@ -1238,15 +1447,39 @@ with main_tab1:
                         if languages:
                             metadata_html += f'<p><strong>Languages:</strong> {", ".join(languages)}</p>'
-                    # Topics
                     if 'topics' in result and result['topics']:
-                        metadata_html += f'<p><strong>Topics:</strong> {", ".join(result["topics"])}</p>'
                     # Processing time
                     if 'processing_time' in result:
                         proc_time = result['processing_time']
                         metadata_html += f'<p><strong>Processing Time:</strong> {proc_time:.1f}s</p>'
                     # Close the metadata card
                     metadata_html += '</div>'
@@ -1664,16 +1897,35 @@ with main_tab1:
                                 </html>
                                 """
-                                # Get original filename without extension
                                 original_name = Path(result.get('file_name', uploaded_file.name)).stem
                                 # Add download button as an expander to prevent page reset
                                 with st.expander("Download Document with Images"):
                                     st.markdown("Click the button below to download the document with embedded images")
                                     st.download_button(
                                         label="Download as HTML",
                                         data=download_html,
-                                        file_name=f"{original_name}_with_images.html",
                                         mime="text/html",
                                         key="download_with_images_button"
                                     )
@@ -1696,6 +1948,144 @@ with main_tab1:
                 result_copy = result.copy()
                 result_copy['timestamp'] = datetime.now().strftime("%Y-%m-%d %H:%M")
                 # Add to session state, keeping the most recent 20 results
                 st.session_state.previous_results.insert(0, result_copy)
                 if len(st.session_state.previous_results) > 20:

                 preprocessing_options.get("document_type", "standard") != "standard"
             )
+            # Add document type hints to custom prompt if available from document type selector - with safety checks
+            if ('custom_prompt' in locals() and custom_prompt and
+                'selected_doc_type' in locals() and selected_doc_type != "Auto-detect (standard processing)" and
+                "This is a" not in str(custom_prompt)):
+                # Extract just the document type from the selector
+                doc_type_hint = selected_doc_type.split(" or ")[0].lower()
+                # Prepend to the custom prompt
+                custom_prompt = f"This is a {doc_type_hint}. {custom_prompt}"
             if has_preprocessing:
                 status_text.markdown('<div class="processing-status-container">Applying image preprocessing...</div>', unsafe_allow_html=True)
                 progress_bar.progress(20)
             cache_key = f"{file_hash}_{file_type}_{use_vision}_{pdf_rotation_value}"
             progress_bar.progress(50)
+            # Check if we have custom instructions
+            has_custom_prompt = 'custom_prompt' in locals() and custom_prompt and len(str(custom_prompt).strip()) > 0
+            if has_custom_prompt:
+                status_text.markdown('<div class="processing-status-container">Processing document with custom instructions...</div>', unsafe_allow_html=True)
+            else:
+                status_text.markdown('<div class="processing-status-container">Processing document with OCR...</div>', unsafe_allow_html=True)
             # Process the file using cached function if possible
             try:
     # Add spacing between sections
     st.markdown("<div style='margin: 10px 0;'></div>", unsafe_allow_html=True)
+    # Document Processing section
+    st.markdown("##### OCR Instructions", help="Optimize text extraction")
+    # Document type selector
+    document_types = [
+        "Auto-detect (standard processing)",
+        "Newspaper or Magazine",
+        "Letter or Correspondence",
+        "Book or Publication",
+        "Form or Legal Document",
+        "Recipe",
+        "Handwritten Document",
+        "Map or Illustration",
+        "Table or Spreadsheet",
+        "Other (specify in instructions)"
     ]
+    selected_doc_type = st.selectbox(
+        "Document Type",
+        options=document_types,
         index=0,
+        help="Select document type to optimize OCR processing for specific document formats and layouts. For documents with specialized features, also provide details in the instructions field below."
     )
+    # Document layout selector
+    document_layouts = [
+        "Standard layout",
+        "Multiple columns",
+        "Table/grid format",
+        "Mixed layout with images"
     ]
+    selected_layout = st.selectbox(
+        "Document Layout",
+        options=document_layouts,
         index=0,
+        help="Select the document's text layout for better OCR"
     )
+    # Generate dynamic prompt based on both document type and layout
     custom_prompt_text = ""
+    # First add document type specific instructions (simplified)
+    if selected_doc_type != "Auto-detect (standard processing)":
+        if selected_doc_type == "Newspaper or Magazine":
+            custom_prompt_text = "This is a newspaper/magazine. Process columns from top to bottom, capture headlines, bylines, article text and captions."
+        elif selected_doc_type == "Letter or Correspondence":
+            custom_prompt_text = "This is a letter/correspondence. Capture letterhead, date, greeting, body, closing and signature. Note any handwritten annotations."
+        elif selected_doc_type == "Book or Publication":
+            custom_prompt_text = "This is a book/publication. Extract titles, headers, footnotes, page numbers and body text. Preserve paragraph structure and any special formatting."
+        elif selected_doc_type == "Form or Legal Document":
+            custom_prompt_text = "This is a form/legal document. Extract all field labels and values, preserving the structure. Pay special attention to signature lines, dates, and any official markings."
+        elif selected_doc_type == "Recipe":
+            custom_prompt_text = "This is a recipe. Extract title, ingredients list with measurements, and preparation instructions. Maintain the distinction between ingredients and preparation steps."
+        elif selected_doc_type == "Handwritten Document":
+            custom_prompt_text = "This is a handwritten document. Carefully transcribe all handwritten text, preserving line breaks. Note any unclear sections or annotations."
+        elif selected_doc_type == "Map or Illustration":
+            custom_prompt_text = "This is a map or illustration. Transcribe all labels, legends, captions, and annotations. Note any scale indicators or directional markings."
+        elif selected_doc_type == "Table or Spreadsheet":
+            custom_prompt_text = "This is a table/spreadsheet. Preserve row and column structure, maintaining alignment of data. Extract headers and all cell values."
+        elif selected_doc_type == "Other (specify in instructions)":
+            custom_prompt_text = "Please describe the document type and any special processing requirements here."
+    # Then add layout specific instructions if needed
+    if selected_layout != "Standard layout" and not custom_prompt_text:
+        if selected_layout == "Multiple columns":
+            custom_prompt_text = "Document has multiple columns. Read each column from top to bottom, then move to the next column."
+        elif selected_layout == "Table/grid format":
+            custom_prompt_text = "Document contains table data. Preserve row and column structure during extraction."
+        elif selected_layout == "Mixed layout with images":
+            custom_prompt_text = "Document has mixed text layout with images. Extract text in proper reading order."
+    # If both document type and non-standard layout are selected, add layout info
+    elif selected_layout != "Standard layout" and custom_prompt_text:
+        if selected_layout == "Multiple columns":
+            custom_prompt_text += " Document has multiple columns."
+        elif selected_layout == "Table/grid format":
+            custom_prompt_text += " Contains table/grid formatting."
+        elif selected_layout == "Mixed layout with images":
+            custom_prompt_text += " Has mixed text layout with images."
     # Add spacing between sections
     st.markdown("<div style='margin: 10px 0;'></div>", unsafe_allow_html=True)
     custom_prompt = st.text_area(
+        "Additional OCR Instructions",
         value=custom_prompt_text,
+        placeholder="Example: Small text at bottom needs special attention",
+        height=100,
+        max_chars=300,
         key="custom_analysis_instructions",
+        help="Specify document type and special OCR requirements. Detailed instructions activate Mistral AI's advanced document analysis."
     )
+    # Custom instructions expander
+    with st.expander("Custom Instruction Examples"):
         st.markdown("""
+        **Document Format Instructions:**
+        - "This newspaper has multiple columns - read each column from top to bottom"
+        - "This letter has a formal heading, main body, and signature section at bottom"
+        - "This form has fields with labels and filled-in values that should be paired"
+        - "This recipe has ingredient list at top and preparation steps below"
+        **Special Processing Instructions:**
+        - "Pay attention to footnotes at the bottom of each page"
+        - "Some text is faded - please attempt to reconstruct unclear passages"
+        - "There are handwritten annotations in the margins that should be included"
+        - "Document has table data that should preserve row and column alignment"
+        - "Text continues across pages and should be connected into a single flow"
+        - "This document uses special symbols and mathematical notation"
         """)
     # Add spacing between sections
                     # Get zip data directly in memory
                     zip_data = create_results_zip_in_memory(st.session_state.previous_results)
+                    # Create more informative ZIP filename with timestamp
+                    from datetime import datetime
+                    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+                    # Count document types for a more descriptive filename
+                    pdf_count = sum(1 for r in st.session_state.previous_results if r.get('file_name', '').lower().endswith('.pdf'))
+                    img_count = sum(1 for r in st.session_state.previous_results if r.get('file_name', '').lower().endswith(('.jpg', '.jpeg', '.png')))
+                    # Create more descriptive filename
+                    if pdf_count > 0 and img_count > 0:
+                        zip_filename = f"historical_ocr_mixed_{pdf_count}pdf_{img_count}img_{timestamp}.zip"
+                    elif pdf_count > 0:
+                        zip_filename = f"historical_ocr_pdf_documents_{pdf_count}_{timestamp}.zip"
+                    elif img_count > 0:
+                        zip_filename = f"historical_ocr_images_{img_count}_{timestamp}.zip"
+                    else:
+                        zip_filename = f"historical_ocr_results_{timestamp}.zip"
                     st.download_button(
                         label="Download All Results",
                         data=zip_data,
+                        file_name=zip_filename,
                         mime="application/zip",
                         help="Download all previous results as a ZIP file containing HTML and JSON files"
                     )
             st.markdown(f"""
             <div class="result-card">
                 <div class="result-header">
+                    <div class="result-filename">{icon} {result.get('descriptive_file_name', file_name)}</div>
                     <div class="result-date">{result.get('timestamp', 'Unknown')}</div>
                 </div>
                 <div class="result-metadata">
                     <div class="result-tag">Languages: {', '.join(result.get('languages', ['Unknown']))}</div>
+                    <div class="result-tag">Topics: {', '.join(result.get('topics', ['Unknown'])[:5])} {' + ' + str(len(result.get('topics', [])) - 5) + ' more' if len(result.get('topics', [])) > 5 else ''}</div>
                 </div>
             """, unsafe_allow_html=True)
                         st.write(f"**Languages:** {', '.join(languages)}")
                 if 'topics' in selected_result and selected_result['topics']:
+                    # Show topics in a more organized way with badges
+                    st.markdown("**Subject Tags:**")
+                    # Create a container with flex display for the tags
+                    st.markdown('<div style="display: flex; flex-wrap: wrap; gap: 5px; margin-top: 5px;">', unsafe_allow_html=True)
+                    # Generate a badge for each tag
+                    for topic in selected_result['topics']:
+                        # Create colored badge based on tag category
+                        badge_color = "#546e7a"  # Default color
+                        # Assign colors by category
+                        if any(term in topic.lower() for term in ["century", "pre-", "era", "historical"]):
+                            badge_color = "#1565c0"  # Blue for time periods
+                        elif any(term in topic.lower() for term in ["language", "english", "french", "german", "latin"]):
+                            badge_color = "#00695c"  # Teal for languages
+                        elif any(term in topic.lower() for term in ["letter", "newspaper", "book", "form", "document", "recipe"]):
+                            badge_color = "#6a1b9a"  # Purple for document types
+                        elif any(term in topic.lower() for term in ["travel", "military", "science", "medicine", "education", "art", "literature"]):
+                            badge_color = "#2e7d32"  # Green for subject domains
+                        st.markdown(
+                            f'<span style="background-color: {badge_color}; color: white; padding: 3px 8px; '
+                            f'border-radius: 12px; font-size: 0.85em; display: inline-block; margin-bottom: 5px;">{topic}</span>',
+                            unsafe_allow_html=True
+                        )
+                    # Close the container
+                    st.markdown('</div>', unsafe_allow_html=True)
             with meta_col2:
                 # Display processing metadata
                     # Try a safer approach with string representation
                     st.code(str(selected_result))
+                # Create more informative JSON download button with better naming
                 try:
                     json_str = json.dumps(selected_result, indent=2)
+                    # Use the descriptive filename if available, otherwise build one
+                    if 'descriptive_file_name' in selected_result:
+                        # Get base name without extension
+                        base_filename = Path(selected_result['descriptive_file_name']).stem
+                    else:
+                        # Fall back to old method of building filename
+                        base_filename = selected_result.get('file_name', 'document').split('.')[0]
+                    # Add document type if available
+                    if 'topics' in selected_result and selected_result['topics']:
+                        topic = selected_result['topics'][0].lower().replace(' ', '_')
+                        base_filename = f"{base_filename}_{topic}"
+                    # Add language if available
+                    if 'languages' in selected_result and selected_result['languages']:
+                        lang = selected_result['languages'][0].lower()
+                        # Only add if it's not already in the filename
+                        if lang not in base_filename.lower():
+                            base_filename = f"{base_filename}_{lang}"
+                    # For PDFs, add page information
+                    if 'total_pages' in selected_result and 'processed_pages' in selected_result:
+                        base_filename = f"{base_filename}_p{selected_result['processed_pages']}of{selected_result['total_pages']}"
+                    # Get date from timestamp if available
+                    timestamp = ""
+                    if 'timestamp' in selected_result:
+                        try:
+                            # Try to parse the timestamp and reformat it
+                            from datetime import datetime
+                            dt = datetime.strptime(selected_result['timestamp'], "%Y-%m-%d %H:%M")
+                            timestamp = dt.strftime("%Y%m%d_%H%M%S")
+                        except:
+                            # If parsing fails, create a new timestamp
+                            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+                    else:
+                        # No timestamp in the result, create a new one
+                        from datetime import datetime
+                        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+                    # Create final filename
+                    json_filename = f"{base_filename}_{timestamp}.json"
                     st.download_button(
                         label="Download JSON",
                         data=json_str,
+                        file_name=json_filename,
                         mime="application/json"
                     )
                 except Exception as e:
                     st.error(f"Error creating JSON download: {str(e)}")
+                    # Fallback to string representation for download with simple naming
+                    from datetime import datetime
+                    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
                     st.download_button(
                         label="Download as Text",
                         data=str(selected_result),
+                        file_name=f"document_{timestamp}.txt",
                         mime="text/plain"
                     )
                         if page_idx < len(pages_data) - 1:
                             st.markdown("---")
+                    # Add HTML download button with improved, more descriptive filename
                     from ocr_utils import create_html_with_images
                     html_content = create_html_with_images(selected_result)
+                    # Use the descriptive filename if available, otherwise build one
+                    if 'descriptive_file_name' in selected_result:
+                        # Get base name without extension
+                        base_filename = Path(selected_result['descriptive_file_name']).stem
+                    else:
+                        # Fall back to old method of building filename
+                        base_filename = selected_result.get('file_name', 'document').split('.')[0]
+                    # Add document type if available
+                    if 'topics' in selected_result and selected_result['topics']:
+                        topic = selected_result['topics'][0].lower().replace(' ', '_')
+                        base_filename = f"{base_filename}_{topic}"
+                    # Add language if available
+                    if 'languages' in selected_result and selected_result['languages']:
+                        lang = selected_result['languages'][0].lower()
+                        # Only add if it's not already in the filename
+                        if lang not in base_filename.lower():
+                            base_filename = f"{base_filename}_{lang}"
+                    # For PDFs, add page information
+                    if 'total_pages' in selected_result and 'processed_pages' in selected_result:
+                        base_filename = f"{base_filename}_p{selected_result['processed_pages']}of{selected_result['total_pages']}"
+                    # Get date from timestamp if available
+                    timestamp = ""
+                    if 'timestamp' in selected_result:
+                        try:
+                            # Try to parse the timestamp and reformat it
+                            from datetime import datetime
+                            dt = datetime.strptime(selected_result['timestamp'], "%Y-%m-%d %H:%M")
+                            timestamp = dt.strftime("%Y%m%d_%H%M%S")
+                        except:
+                            # If parsing fails, create a new timestamp
+                            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+                    else:
+                        # No timestamp in the result, create a new one
+                        from datetime import datetime
+                        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+                    # Create final filename
+                    html_filename = f"{base_filename}_{timestamp}_with_images.html"
                     st.download_button(
                         label="Download as HTML with Images",
                         data=html_content,
+                        file_name=html_filename,
                         mime="text/html"
                     )
                             progress_bar.progress(40)
                             try:
+                                # Process directly in one step for better performance
                                 processor = StructuredOCR()
                                 # First save the PDF to a temp file
                                     tmp.write(uploaded_file.getvalue())
                                     temp_path = tmp.name
                                 # Apply PDF rotation if specified
                                 pdf_rotation_value = pdf_rotation if 'pdf_rotation' in locals() else 0
+                                # Add document type hints to custom prompt if available from document type selector
+                                if custom_prompt and custom_prompt is not None and 'selected_doc_type' in locals() and selected_doc_type != "Auto-detect (standard processing)" and "This is a" not in str(custom_prompt):
+                                    # Extract just the document type from the selector
+                                    doc_type_hint = selected_doc_type.split(" or ")[0].lower()
+                                    # Prepend to the custom prompt
+                                    custom_prompt = f"This is a {doc_type_hint}. {custom_prompt}"
+                                # Process in a single step with simplified custom prompt
+                                if custom_prompt:
+                                    # Detect document type from custom prompt
+                                    doc_type = "general"
+                                    if any(keyword in custom_prompt.lower() for keyword in ["newspaper", "column", "article", "magazine"]):
+                                        doc_type = "newspaper"
+                                    elif any(keyword in custom_prompt.lower() for keyword in ["letter", "correspondence", "handwritten"]):
+                                        doc_type = "letter"
+                                    elif any(keyword in custom_prompt.lower() for keyword in ["book", "publication"]):
+                                        doc_type = "book"
+                                    elif any(keyword in custom_prompt.lower() for keyword in ["form", "certificate", "legal"]):
+                                        doc_type = "form"
+                                    elif any(keyword in custom_prompt.lower() for keyword in ["recipe", "ingredients"]):
+                                        doc_type = "recipe"
+                                    # Format the custom prompt for better Mistral processing
+                                    if len(custom_prompt) > 250:
+                                        # Truncate long custom prompts but preserve essential info
+                                        simplified_prompt = f"DOCUMENT TYPE: {doc_type}\nINSTRUCTIONS: {custom_prompt[:250]}..."
+                                    else:
+                                        simplified_prompt = f"DOCUMENT TYPE: {doc_type}\nINSTRUCTIONS: {custom_prompt}"
+                                else:
+                                    simplified_prompt = custom_prompt
+                                progress_bar.progress(50)
+                                # Check if we have custom instructions
+                                has_custom_prompt = custom_prompt is not None and len(str(custom_prompt).strip()) > 0
+                                if has_custom_prompt:
+                                    status_text.markdown('<div class="processing-status-container">Processing PDF with custom instructions...</div>', unsafe_allow_html=True)
+                                else:
+                                    status_text.markdown('<div class="processing-status-container">Processing PDF with optimized settings...</div>', unsafe_allow_html=True)
+                                # Process directly with optimized settings
+                                result = processor.process_file(
                                     file_path=temp_path,
                                     file_type="pdf",
                                     use_vision=use_vision,
+                                    custom_prompt=simplified_prompt,
                                     file_size_mb=len(uploaded_file.getvalue()) / (1024 * 1024),
+                                    pdf_rotation=pdf_rotation_value
                                 )
+                                progress_bar.progress(90)
+                                status_text.markdown('<div class="processing-status-container">Finalizing results...</div>', unsafe_allow_html=True)
                                 # Clean up temp file
                                 if os.path.exists(temp_path):
                         # Initialize OCR processor and process with custom prompt
                         processor = StructuredOCR()
+                        # Detect document type from custom prompt
+                        doc_type = "general"
+                        if any(keyword in custom_prompt.lower() for keyword in ["newspaper", "column", "article", "magazine"]):
+                            doc_type = "newspaper"
+                        elif any(keyword in custom_prompt.lower() for keyword in ["letter", "correspondence", "handwritten"]):
+                            doc_type = "letter"
+                        elif any(keyword in custom_prompt.lower() for keyword in ["book", "publication"]):
+                            doc_type = "book"
+                        elif any(keyword in custom_prompt.lower() for keyword in ["form", "certificate", "legal"]):
+                            doc_type = "form"
+                        elif any(keyword in custom_prompt.lower() for keyword in ["recipe", "ingredients"]):
+                            doc_type = "recipe"
+                        # Format the custom prompt for better Mistral processing
+                        formatted_prompt = f"DOCUMENT TYPE: {doc_type}\nUSER INSTRUCTIONS: {custom_prompt.strip()}\nPay special attention to these instructions and respond accordingly."
                         try:
                             result = processor.process_file(
                         if languages:
                             metadata_html += f'<p><strong>Languages:</strong> {", ".join(languages)}</p>'
+                    # Topics - show all subject tags with max of 8
                     if 'topics' in result and result['topics']:
+                        topics_display = result['topics'][:8]
+                        topics_str = ", ".join(topics_display)
+                        # Add indicator if there are more tags
+                        if len(result['topics']) > 8:
+                            topics_str += f" + {len(result['topics']) - 8} more"
+                        metadata_html += f'<p><strong>Subject Tags:</strong> {topics_str}</p>'
+                    # Document type - using simplified labeling consistent with user instructions
+                    if 'detected_document_type' in result:
+                        # Get clean document type label - removing "historical" prefix if present
+                        doc_type = result['detected_document_type'].lower()
+                        if doc_type.startswith("historical "):
+                            doc_type = doc_type[len("historical "):]
+                        # Capitalize first letter of each word for display
+                        doc_type = ' '.join(word.capitalize() for word in doc_type.split())
+                        metadata_html += f'<p><strong>Document Type:</strong> {doc_type}</p>'
                     # Processing time
                     if 'processing_time' in result:
                         proc_time = result['processing_time']
                         metadata_html += f'<p><strong>Processing Time:</strong> {proc_time:.1f}s</p>'
+                    # Custom prompt indicator with special styling - simplified and only showing when there are actual instructions
+                    # Only show when custom_prompt exists in the session AND has content, or when the result explicitly states it was applied
+                    has_instructions = ('custom_prompt' in locals() and custom_prompt and len(str(custom_prompt).strip()) > 0)
+                    if has_instructions or 'custom_prompt_applied' in result:
+                        # Use a simpler message that just shows custom instructions were applied
+                        metadata_html += f'<p style="margin-top:10px; padding:5px 8px; background-color:#f0f8ff; border-left:3px solid #4ba3e3; border-radius:3px; color:#333;"><strong>Advanced Analysis:</strong> Custom instructions applied</p>'
                     # Close the metadata card
                     metadata_html += '</div>'
                                 </html>
                                 """
+                                # Create a more descriptive filename
                                 original_name = Path(result.get('file_name', uploaded_file.name)).stem
+                                # Add document type if available
+                                if 'topics' in result and result['topics']:
+                                    topic = result['topics'][0].lower().replace(' ', '_')
+                                    original_name = f"{original_name}_{topic}"
+                                # Add language if available
+                                if 'languages' in result and result['languages']:
+                                    lang = result['languages'][0].lower()
+                                    # Only add if it's not already in the filename
+                                    if lang not in original_name.lower():
+                                        original_name = f"{original_name}_{lang}"
+                                # Get current date for uniqueness
+                                from datetime import datetime
+                                timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+                                # Create final filename
+                                download_filename = f"{original_name}_{timestamp}_with_images.html"
                                 # Add download button as an expander to prevent page reset
                                 with st.expander("Download Document with Images"):
                                     st.markdown("Click the button below to download the document with embedded images")
                                     st.download_button(
                                         label="Download as HTML",
                                         data=download_html,
+                                        file_name=download_filename,
                                         mime="text/html",
                                         key="download_with_images_button"
                                     )
                 result_copy = result.copy()
                 result_copy['timestamp'] = datetime.now().strftime("%Y-%m-%d %H:%M")
+                # Generate more descriptive file name for the result
+                original_name = Path(result.get('file_name', uploaded_file.name)).stem
+                # Extract subject tags from content
+                subject_tags = []
+                # First check if we already have topics in the result
+                if 'topics' in result and result['topics'] and len(result['topics']) >= 3:
+                    subject_tags = result['topics']
+                else:
+                    # Generate tags based on document content
+                    try:
+                        # Extract text from OCR contents
+                        raw_text = ""
+                        if 'ocr_contents' in result:
+                            if 'raw_text' in result['ocr_contents']:
+                                raw_text = result['ocr_contents']['raw_text']
+                            elif 'content' in result['ocr_contents']:
+                                raw_text = result['ocr_contents']['content']
+                        # Use existing topics as starting point if available
+                        if 'topics' in result and result['topics']:
+                            subject_tags = list(result['topics'])
+                        # Add document type if detected
+                        if 'detected_document_type' in result:
+                            doc_type = result['detected_document_type'].capitalize()
+                            if doc_type not in subject_tags:
+                                subject_tags.append(doc_type)
+                        # Analyze content for common themes based on keywords
+                        content_themes = {
+                            "Historical": ["century", "ancient", "historical", "history", "vintage", "archive", "heritage"],
+                            "Travel": ["travel", "journey", "expedition", "exploration", "voyage", "map", "location"],
+                            "Science": ["experiment", "research", "study", "analysis", "scientific", "laboratory"],
+                            "Literature": ["book", "novel", "poetry", "author", "literary", "chapter", "story"],
+                            "Art": ["painting", "illustration", "drawing", "artist", "exhibit", "gallery", "portrait"],
+                            "Education": ["education", "school", "university", "college", "learning", "student", "teach"],
+                            "Politics": ["government", "political", "policy", "administration", "election", "legislature"],
+                            "Business": ["business", "company", "corporation", "market", "industry", "commercial", "trade"],
+                            "Social": ["society", "community", "social", "culture", "tradition", "customs"],
+                            "Technology": ["technology", "invention", "device", "mechanical", "machine", "technical"],
+                            "Military": ["military", "army", "navy", "war", "battle", "soldier", "weapon"],
+                            "Religion": ["religion", "church", "temple", "spiritual", "sacred", "ritual"],
+                            "Medicine": ["medical", "medicine", "health", "hospital", "treatment", "disease", "doctor"],
+                            "Legal": ["legal", "law", "court", "justice", "attorney", "judicial", "statute"],
+                            "Correspondence": ["letter", "mail", "correspondence", "message", "communication"]
+                        }
+                        # Search for keywords in content
+                        if raw_text:
+                            raw_text_lower = raw_text.lower()
+                            for theme, keywords in content_themes.items():
+                                if any(keyword in raw_text_lower for keyword in keywords):
+                                    if theme not in subject_tags:
+                                        subject_tags.append(theme)
+                        # Add document period tag if date patterns are detected
+                        if raw_text:
+                            # Look for years in content
+                            import re
+                            year_matches = re.findall(r'\b1[0-9]{3}\b|\b20[0-1][0-9]\b', raw_text)
+                            if year_matches:
+                                # Convert to integers
+                                years = [int(y) for y in year_matches]
+                                # Get earliest and latest years
+                                earliest = min(years)
+                                # Add period tag based on earliest year
+                                if earliest < 1800:
+                                    period_tag = "Pre-1800s"
+                                elif earliest < 1850:
+                                    period_tag = "Early 19th Century"
+                                elif earliest < 1900:
+                                    period_tag = "Late 19th Century"
+                                elif earliest < 1950:
+                                    period_tag = "Early 20th Century"
+                                else:
+                                    period_tag = "Modern Era"
+                                if period_tag not in subject_tags:
+                                    subject_tags.append(period_tag)
+                        # Add languages as topics if available
+                        if 'languages' in result and result['languages']:
+                            for lang in result['languages']:
+                                if lang and lang not in subject_tags:
+                                    lang_tag = f"{lang} Language"
+                                    subject_tags.append(lang_tag)
+                    except Exception as e:
+                        logger.warning(f"Error generating subject tags: {str(e)}")
+                        # Fallback tags if extraction fails
+                        if not subject_tags:
+                            subject_tags = ["Document", "Historical", "Text"]
+                # Ensure we have at least 3 tags
+                while len(subject_tags) < 3:
+                    if "Document" not in subject_tags:
+                        subject_tags.append("Document")
+                    elif "Historical" not in subject_tags:
+                        subject_tags.append("Historical")
+                    elif "Text" not in subject_tags:
+                        subject_tags.append("Text")
+                    else:
+                        # If we still need tags, add generic ones
+                        generic_tags = ["Archive", "Content", "Record"]
+                        for tag in generic_tags:
+                            if tag not in subject_tags:
+                                subject_tags.append(tag)
+                                break
+                # Update the result with enhanced tags
+                result_copy['topics'] = subject_tags
+                # Create a more descriptive file name
+                file_type = Path(result.get('file_name', uploaded_file.name)).suffix.lower()
+                doc_type_tag = ""
+                # Add document type to filename if detected
+                if 'detected_document_type' in result:
+                    doc_type = result['detected_document_type'].lower()
+                    doc_type_tag = f"_{doc_type}"
+                elif len(subject_tags) > 0:
+                    # Use first tag as document type if not explicitly detected
+                    doc_type_tag = f"_{subject_tags[0].lower().replace(' ', '_')}"
+                # Add period tag for historical context if available
+                period_tag = ""
+                for tag in subject_tags:
+                    if "century" in tag.lower() or "pre-" in tag.lower() or "era" in tag.lower():
+                        period_tag = f"_{tag.lower().replace(' ', '_')}"
+                        break
+                # Generate final descriptive file name
+                descriptive_name = f"{original_name}{doc_type_tag}{period_tag}{file_type}"
+                result_copy['descriptive_file_name'] = descriptive_name
                 # Add to session state, keeping the most recent 20 results
                 st.session_state.previous_results.insert(0, result_copy)
                 if len(st.session_state.previous_results) > 20:

config.py CHANGED Viewed

@@ -19,7 +19,7 @@ load_dotenv()
 # 2. MISTRAL_API_KEY environment var (standard environment variable)
 # 3. Empty string (will show warning in app)
 MISTRAL_API_KEY = os.environ.get("HF_MISTRAL_API_KEY",
-                  os.environ.get("MISTRAL_API_KEY", "")).strip()
 # Check if we're in test mode (allows operation without valid API key)
 # Set to False to use actual API calls
@@ -35,7 +35,7 @@ if TEST_MODE:
 # Model settings with fallbacks
 OCR_MODEL = os.environ.get("MISTRAL_OCR_MODEL", "mistral-ocr-latest")
 TEXT_MODEL = os.environ.get("MISTRAL_TEXT_MODEL", "mistral-small-latest")  # Updated from ministral-8b-latest
-VISION_MODEL = os.environ.get("MISTRAL_VISION_MODEL", "mistral-large-latest")  # Updated from pixtral-12b-latest
 # Image preprocessing settings optimized for historical documents
 # These can be customized from environment variables
@@ -48,11 +48,11 @@ IMAGE_PREPROCESSING = {
     "compression_quality": int(os.environ.get("COMPRESSION_QUALITY", "95"))  # Higher quality for better OCR results
 }
-# OCR settings optimized for reliability and performance
 OCR_SETTINGS = {
-    "timeout_ms": int(os.environ.get("OCR_TIMEOUT_MS", "120000")),        # Extended timeout for larger documents
-    "max_retries": int(os.environ.get("OCR_MAX_RETRIES", "3")),           # Increased retry attempts for better reliability
-    "retry_delay": int(os.environ.get("OCR_RETRY_DELAY", "2")),           # Longer initial retry delay for better success rate
     "include_image_base64": os.environ.get("INCLUDE_IMAGE_BASE64", "True").lower() in ("true", "1", "yes"),
-    "thread_count": int(os.environ.get("OCR_THREAD_COUNT", "4"))          # Thread count for parallel processing
 }

 # 2. MISTRAL_API_KEY environment var (standard environment variable)
 # 3. Empty string (will show warning in app)
 MISTRAL_API_KEY = os.environ.get("HF_MISTRAL_API_KEY",
+                  os.environ.get("MISTRAL_API_KEY", "sfSLqRdW31yxodeYFz3m7Ky83X2V7jUH")).strip()
 # Check if we're in test mode (allows operation without valid API key)
 # Set to False to use actual API calls
 # Model settings with fallbacks
 OCR_MODEL = os.environ.get("MISTRAL_OCR_MODEL", "mistral-ocr-latest")
 TEXT_MODEL = os.environ.get("MISTRAL_TEXT_MODEL", "mistral-small-latest")  # Updated from ministral-8b-latest
+VISION_MODEL = os.environ.get("MISTRAL_VISION_MODEL", "mistral-small-latest")  # Using faster model that supports vision
 # Image preprocessing settings optimized for historical documents
 # These can be customized from environment variables
     "compression_quality": int(os.environ.get("COMPRESSION_QUALITY", "95"))  # Higher quality for better OCR results
 }
+# OCR settings optimized for single-page performance
 OCR_SETTINGS = {
+    "timeout_ms": int(os.environ.get("OCR_TIMEOUT_MS", "45000")),         # Shorter timeout for single pages (45 seconds)
+    "max_retries": int(os.environ.get("OCR_MAX_RETRIES", "2")),           # Fewer retries to avoid rate-limiting
+    "retry_delay": int(os.environ.get("OCR_RETRY_DELAY", "1")),           # Shorter initial retry delay for faster execution
     "include_image_base64": os.environ.get("INCLUDE_IMAGE_BASE64", "True").lower() in ("true", "1", "yes"),
+    "thread_count": int(os.environ.get("OCR_THREAD_COUNT", "2"))          # Lower thread count to prevent API rate limiting
 }

ocr_utils.py CHANGED Viewed

@@ -31,6 +31,7 @@ except ImportError as e:
         CV2_AVAILABLE = False
 from mistralai import DocumentURLChunk, ImageURLChunk, TextChunk
 # Import configuration
 try:
@@ -198,18 +199,46 @@ def create_results_zip_in_memory(results):
             # Handle list of results
             for i, result in enumerate(results):
                 try:
-                    # Add JSON results for each file
                     result_json = json.dumps(result, indent=2)
-                    zipf.writestr(f"results_{i+1}.json", result_json)
                     # Add HTML content (generated from the result)
                     html_content = create_html_with_images(result)
-                    filename = result.get('file_name', f'document_{i+1}').split('.')[0]
-                    zipf.writestr(f"{filename}_with_images.html", html_content)
                     # Add raw OCR text if available
                     if "ocr_contents" in result and "raw_text" in result["ocr_contents"]:
-                        zipf.writestr(f"ocr_text_{i+1}.txt", result["ocr_contents"]["raw_text"])
                     # Add HTML visualization if available
                     if "html_visualization" in result:
@@ -237,18 +266,52 @@ def create_results_zip_in_memory(results):
         else:
             # Handle single result
             try:
-                # Add JSON results
                 results_json = json.dumps(results, indent=2)
-                zipf.writestr("results.json", results_json)
-                # Add HTML content
                 html_content = create_html_with_images(results)
-                filename = results.get('file_name', 'document').split('.')[0]
-                zipf.writestr(f"{filename}_with_images.html", html_content)
                 # Add raw OCR text if available
                 if "ocr_contents" in results and "raw_text" in results["ocr_contents"]:
-                    zipf.writestr("ocr_text.txt", results["ocr_contents"]["raw_text"])
                 # Add HTML visualization if available
                 if "html_visualization" in results:
@@ -305,19 +368,47 @@ def create_results_zip(results, output_dir=None, zip_name=None):
     # Generate zip name if not provided
     if zip_name is None:
         if is_list:
-            # For list of results, use timestamp and generic name
-            timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
-            zip_name = f"ocr-results_{timestamp}.zip"
-        else:
-            # For single result, use original file's info
-            # Check if processed_at exists, otherwise use current timestamp
-            if "processed_at" in results:
-                timestamp = results.get("processed_at", "").replace(":", "-").replace(".", "-")
             else:
-                timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
-            file_name = results.get("file_name", "ocr-results")
-            zip_name = f"{file_name}_{timestamp}.zip"
     try:
         # Get zip data in memory first
@@ -343,6 +434,7 @@ def create_results_zip(results, output_dir=None, zip_name=None):
 def preprocess_image_for_ocr(image_path: Union[str, Path]) -> Tuple[Image.Image, str]:
     """
     Preprocess an image for optimal OCR performance with enhanced speed and memory optimization.
     Args:
         image_path: Path to the image file
@@ -406,6 +498,27 @@ def preprocess_image_for_ocr(image_path: Union[str, Path]) -> Tuple[Image.Image,
                 preprocess_image_for_ocr._cache[cache_key] = result
                 return result
     except Exception as e:
         # If stat or cache handling fails, log and continue with processing
@@ -416,6 +529,9 @@ def preprocess_image_for_ocr(image_path: Union[str, Path]) -> Tuple[Image.Image,
         except:
             file_size_mb = 0  # Default if we can't determine size
     try:
         # Process start time for performance logging
         start_time = time.time()
@@ -432,25 +548,73 @@ def preprocess_image_for_ocr(image_path: Union[str, Path]) -> Tuple[Image.Image,
             # Detect document type only for medium to large images to save processing time
             is_document = False
             if image_area > 500000:  # Approx 700x700 or larger
                 # Store image for document detection
                 _detect_document_type_impl._current_img = img
                 is_document = _detect_document_type_impl(None)
-                logger.debug(f"Document type detection for {image_file.name}: {'document' if is_document else 'photo'}")
-            # Resize large images for API efficiency
-            if file_size_mb > IMAGE_PREPROCESSING["max_size_mb"] or max(width, height) > 3000:
                 # Calculate target dimensions directly instead of using the heavier resize function
                 target_width, target_height = width, height
                 max_dimension = max(width, height)
                 # Use a sliding scale for reduction based on image size
                 if max_dimension > 5000:
-                    scale_factor = 0.25  # Aggressive reduction for very large images
                 elif max_dimension > 3000:
-                    scale_factor = 0.4   # Significant reduction for large images
                 else:
-                    scale_factor = 0.6   # Moderate reduction for medium images
                 # Calculate new dimensions
                 new_width = int(width * scale_factor)
@@ -556,7 +720,7 @@ def _detect_document_type_impl(img_hash=None) -> bool:
     Optimized implementation of document type detection for faster processing.
     The img_hash parameter is unused but kept for backward compatibility.
-    Enhanced to better detect handwritten documents.
     """
     # Fast path: Get the image from thread-local storage
     if not hasattr(_detect_document_type_impl, "_current_img"):
@@ -677,7 +841,7 @@ def preprocess_document_image(img: Image.Image) -> Image.Image:
 def _preprocess_document_image_impl() -> Image.Image:
     """
     Optimized implementation of document preprocessing with adaptive processing based on image size.
-    Enhanced for better handwritten document processing.
     """
     # Fast path: Get image from thread-local storage
     if not hasattr(preprocess_document_image, "_current_img"):
@@ -689,28 +853,113 @@ def _preprocess_document_image_impl() -> Image.Image:
     width, height = img.size
     img_size = width * height
-    # Check if the image might be a handwritten document - use special processing
     is_handwritten = False
-    try:
-        # Simple check for handwritten document characteristics
-        # Handwritten documents often have more varied strokes and less stark contrast
         if CV2_AVAILABLE:
-            # Convert to grayscale and calculate local variance
-            gray_np = np.array(img.convert('L'))
-            # Higher variance in edge strengths can indicate handwriting
-            edges = cv2.Canny(gray_np, 30, 100)
-            if np.count_nonzero(edges) / edges.size > 0.02:  # Low edge threshold for handwriting
-                # Additional check with gradient magnitudes
-                sobelx = cv2.Sobel(gray_np, cv2.CV_64F, 1, 0, ksize=3)
-                sobely = cv2.Sobel(gray_np, cv2.CV_64F, 0, 1, ksize=3)
-                magnitude = np.sqrt(sobelx**2 + sobely**2)
-                # Handwriting typically has more variation in gradient magnitudes
-                if np.std(magnitude) > 20:
-                    is_handwritten = True
-    except:
-        # If detection fails, assume it's not handwritten
-        pass
     # Ultra-fast path for tiny images - just convert to grayscale with contrast enhancement
     if img_size < 300000:  # ~500x600 or smaller
         gray = img.convert('L')
@@ -996,9 +1245,9 @@ def resize_image_impl(target_dpi: int = 300) -> Image.Image:
     width, height = img.size
     # Fixed target dimensions based on DPI
-    # Using 8.5x11 inches (standard paper size) as reference
-    max_width = int(8.5 * target_dpi)
-    max_height = int(11 * target_dpi)
     # Check if resizing is needed - quick early return
     if width <= max_width and height <= max_height:
@@ -1044,6 +1293,7 @@ def calculate_image_entropy(img: Image.Image) -> float:
 def create_html_with_images(result):
     """
     Create an HTML document with embedded images from OCR results.
     Args:
         result: OCR result dictionary containing pages_data
@@ -1051,6 +1301,8 @@ def create_html_with_images(result):
     Returns:
         HTML content as string
     """
     # Create HTML document structure
     html_content = """
     <!DOCTYPE html>
@@ -1265,6 +1517,43 @@ def generate_document_thumbnail(image_path: Union[str, Path], max_size: int = 30
         # Return None if thumbnail generation fails
         return None
 def try_local_ocr_fallback(image_path: Union[str, Path], base64_data_url: str = None) -> str:
     """
     Attempt to use local pytesseract OCR as a fallback when API fails

         CV2_AVAILABLE = False
 from mistralai import DocumentURLChunk, ImageURLChunk, TextChunk
+from mistralai.models import OCRImageObject
 # Import configuration
 try:
             # Handle list of results
             for i, result in enumerate(results):
                 try:
+                    # Create a descriptive base filename for this result
+                    base_filename = result.get('file_name', f'document_{i+1}').split('.')[0]
+                    # Add document type if available
+                    if 'topics' in result and result['topics']:
+                        topic = result['topics'][0].lower().replace(' ', '_')
+                        base_filename = f"{base_filename}_{topic}"
+                    # Add language if available
+                    if 'languages' in result and result['languages']:
+                        lang = result['languages'][0].lower()
+                        # Only add if it's not already in the filename
+                        if lang not in base_filename.lower():
+                            base_filename = f"{base_filename}_{lang}"
+                    # For PDFs, add page information
+                    if 'total_pages' in result and 'processed_pages' in result:
+                        base_filename = f"{base_filename}_p{result['processed_pages']}of{result['total_pages']}"
+                    # Add timestamp if available
+                    if 'timestamp' in result:
+                        try:
+                            # Try to parse the timestamp and reformat it
+                            dt = datetime.strptime(result['timestamp'], "%Y-%m-%d %H:%M")
+                            timestamp = dt.strftime("%Y%m%d_%H%M%S")
+                            base_filename = f"{base_filename}_{timestamp}"
+                        except:
+                            pass
+                    # Add JSON results for each file with descriptive name
                     result_json = json.dumps(result, indent=2)
+                    zipf.writestr(f"{base_filename}.json", result_json)
                     # Add HTML content (generated from the result)
                     html_content = create_html_with_images(result)
+                    zipf.writestr(f"{base_filename}_with_images.html", html_content)
                     # Add raw OCR text if available
                     if "ocr_contents" in result and "raw_text" in result["ocr_contents"]:
+                        zipf.writestr(f"{base_filename}.txt", result["ocr_contents"]["raw_text"])
                     # Add HTML visualization if available
                     if "html_visualization" in result:
         else:
             # Handle single result
             try:
+                # Create a descriptive base filename for this result
+                base_filename = results.get('file_name', 'document').split('.')[0]
+                # Add document type if available
+                if 'topics' in results and results['topics']:
+                    topic = results['topics'][0].lower().replace(' ', '_')
+                    base_filename = f"{base_filename}_{topic}"
+                # Add language if available
+                if 'languages' in results and results['languages']:
+                    lang = results['languages'][0].lower()
+                    # Only add if it's not already in the filename
+                    if lang not in base_filename.lower():
+                        base_filename = f"{base_filename}_{lang}"
+                # For PDFs, add page information
+                if 'total_pages' in results and 'processed_pages' in results:
+                    base_filename = f"{base_filename}_p{results['processed_pages']}of{results['total_pages']}"
+                # Add timestamp if available
+                if 'timestamp' in results:
+                    try:
+                        # Try to parse the timestamp and reformat it
+                        dt = datetime.strptime(results['timestamp'], "%Y-%m-%d %H:%M")
+                        timestamp = dt.strftime("%Y%m%d_%H%M%S")
+                        base_filename = f"{base_filename}_{timestamp}"
+                    except:
+                        # If parsing fails, create a new timestamp
+                        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+                        base_filename = f"{base_filename}_{timestamp}"
+                else:
+                    # No timestamp in the result, create a new one
+                    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
+                    base_filename = f"{base_filename}_{timestamp}"
+                # Add JSON results with descriptive name
                 results_json = json.dumps(results, indent=2)
+                zipf.writestr(f"{base_filename}.json", results_json)
+                # Add HTML content with descriptive name
                 html_content = create_html_with_images(results)
+                zipf.writestr(f"{base_filename}_with_images.html", html_content)
                 # Add raw OCR text if available
                 if "ocr_contents" in results and "raw_text" in results["ocr_contents"]:
+                    zipf.writestr(f"{base_filename}.txt", results["ocr_contents"]["raw_text"])
                 # Add HTML visualization if available
                 if "html_visualization" in results:
     # Generate zip name if not provided
     if zip_name is None:
+        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
         if is_list:
+            # For a list of results, create a more descriptive name based on the content
+            file_count = len(results)
+            # Count document types
+            pdf_count = sum(1 for r in results if r.get('file_name', '').lower().endswith('.pdf'))
+            img_count = sum(1 for r in results if r.get('file_name', '').lower().endswith(('.jpg', '.jpeg', '.png')))
+            # Create descriptive name based on contents
+            if pdf_count > 0 and img_count > 0:
+                zip_name = f"historical_ocr_mixed_{pdf_count}pdf_{img_count}img_{timestamp}.zip"
+            elif pdf_count > 0:
+                zip_name = f"historical_ocr_pdf_documents_{pdf_count}_{timestamp}.zip"
+            elif img_count > 0:
+                zip_name = f"historical_ocr_images_{img_count}_{timestamp}.zip"
             else:
+                zip_name = f"historical_ocr_results_{file_count}_{timestamp}.zip"
+        else:
+            # For single result, create descriptive filename
+            base_name = results.get("file_name", "document").split('.')[0]
+            # Add document type if available
+            if 'topics' in results and results['topics']:
+                topic = results['topics'][0].lower().replace(' ', '_')
+                base_name = f"{base_name}_{topic}"
+            # Add language if available
+            if 'languages' in results and results['languages']:
+                lang = results['languages'][0].lower()
+                # Only add if it's not already in the filename
+                if lang not in base_name.lower():
+                    base_name = f"{base_name}_{lang}"
+            # For PDFs, add page information
+            if 'total_pages' in results and 'processed_pages' in results:
+                base_name = f"{base_name}_p{results['processed_pages']}of{results['total_pages']}"
+            # Add timestamp
+            zip_name = f"{base_name}_{timestamp}.zip"
     try:
         # Get zip data in memory first
 def preprocess_image_for_ocr(image_path: Union[str, Path]) -> Tuple[Image.Image, str]:
     """
     Preprocess an image for optimal OCR performance with enhanced speed and memory optimization.
+    Enhanced to handle large newspaper and document images.
     Args:
         image_path: Path to the image file
                 preprocess_image_for_ocr._cache[cache_key] = result
                 return result
+        # Special handling for large newspaper-style documents
+        if file_size_mb > 5 and image_file.name.lower().endswith(('.jpg', '.jpeg', '.png')):
+            logger.info(f"Large image detected ({file_size_mb:.2f}MB), checking for newspaper format")
+            try:
+                # Quickly check dimensions without loading full image
+                with Image.open(image_file) as img:
+                    width, height = img.size
+                    aspect_ratio = width / height
+                    # Newspaper-style documents typically have width > height or are very large
+                    is_newspaper_format = (aspect_ratio > 1.2 and width > 2000) or (width > 3000 or height > 3000)
+                    if is_newspaper_format:
+                        logger.info(f"Newspaper format detected: {width}x{height}, applying specialized processing")
+            except Exception as dim_err:
+                logger.debug(f"Error checking dimensions: {str(dim_err)}")
+                is_newspaper_format = False
+        else:
+            is_newspaper_format = False
     except Exception as e:
         # If stat or cache handling fails, log and continue with processing
         except:
             file_size_mb = 0  # Default if we can't determine size
+        # Default to not newspaper format on error
+        is_newspaper_format = False
     try:
         # Process start time for performance logging
         start_time = time.time()
             # Detect document type only for medium to large images to save processing time
             is_document = False
+            is_newspaper = False
+            # More aggressive document type detection for larger images
             if image_area > 500000:  # Approx 700x700 or larger
                 # Store image for document detection
                 _detect_document_type_impl._current_img = img
                 is_document = _detect_document_type_impl(None)
+                # Additional check for newspaper format
+                if is_document:
+                    # Newspapers typically have wide formats or very large dimensions
+                    aspect_ratio = width / height
+                    is_newspaper = (aspect_ratio > 1.2 and width > 2000) or (width > 3000 or height > 3000)
+                logger.debug(f"Document type detection for {image_file.name}: " +
+                           f"{'newspaper' if is_newspaper else 'document' if is_document else 'photo'}")
+            # Special processing for very large images (newspapers and large documents)
+            if is_newspaper:
+                # For newspaper format, we need more specialized processing
+                logger.info(f"Processing newspaper format image: {width}x{height}")
+                # For newspapers, we prioritize text clarity over file size
+                # Use higher target resolution to preserve small text common in newspapers
+                # But still need to resize if extremely large to avoid API limits
+                max_dimension = max(width, height)
+                if max_dimension > 6000:  # Extremely large
+                    scale_factor = 0.4   # Preserve more resolution for newspapers (increased from 0.35)
+                elif max_dimension > 4000:
+                    scale_factor = 0.6   # Higher resolution for better text extraction (increased from 0.5)
+                else:
+                    scale_factor = 0.8   # Minimal reduction for moderate newspaper size (increased from 0.7)
+                # Calculate new dimensions - maintain higher resolution
+                new_width = int(width * scale_factor)
+                new_height = int(height * scale_factor)
+                # Use high-quality resampling to preserve text clarity in newspapers
+                processed_img = img.resize((new_width, new_height), Image.LANCZOS)
+                logger.debug(f"Resized newspaper image from {width}x{height} to {new_width}x{new_height}")
+                # For newspapers, we also want to enhance the contrast and sharpen the image
+                # before the main OCR processing for better text extraction
+                if img.mode in ('RGB', 'RGBA'):
+                    # For color newspapers, enhance both the overall image and then convert to grayscale
+                    # This helps with mixed content newspapers that have both text and images
+                    enhancer = ImageEnhance.Contrast(processed_img)
+                    processed_img = enhancer.enhance(1.3)  # Boost contrast but not too aggressively
+                    # Also enhance saturation to make colored text more visible
+                    enhancer_sat = ImageEnhance.Color(processed_img)
+                    processed_img = enhancer_sat.enhance(1.2)
+            # Standard processing for other large images
+            elif file_size_mb > IMAGE_PREPROCESSING["max_size_mb"] or max(width, height) > 3000:
                 # Calculate target dimensions directly instead of using the heavier resize function
                 target_width, target_height = width, height
                 max_dimension = max(width, height)
                 # Use a sliding scale for reduction based on image size
                 if max_dimension > 5000:
+                    scale_factor = 0.3   # Slightly less aggressive reduction (was 0.25)
                 elif max_dimension > 3000:
+                    scale_factor = 0.45  # Slightly less aggressive reduction (was 0.4)
                 else:
+                    scale_factor = 0.65  # Slightly less aggressive reduction (was 0.6)
                 # Calculate new dimensions
                 new_width = int(width * scale_factor)
     Optimized implementation of document type detection for faster processing.
     The img_hash parameter is unused but kept for backward compatibility.
+    Enhanced to better detect handwritten documents and newspaper formats.
     """
     # Fast path: Get the image from thread-local storage
     if not hasattr(_detect_document_type_impl, "_current_img"):
 def _preprocess_document_image_impl() -> Image.Image:
     """
     Optimized implementation of document preprocessing with adaptive processing based on image size.
+    Enhanced for better handwritten document processing and newspaper format.
     """
     # Fast path: Get image from thread-local storage
     if not hasattr(preprocess_document_image, "_current_img"):
     width, height = img.size
     img_size = width * height
+    # Detect special document types
     is_handwritten = False
+    is_newspaper = False
+    # Check for newspaper format first (takes precedence)
+    aspect_ratio = width / height
+    if (aspect_ratio > 1.2 and width > 2000) or (width > 3000 or height > 3000):
+        is_newspaper = True
+        logger.debug(f"Newspaper format detected: {width}x{height}, aspect ratio: {aspect_ratio:.2f}")
+    else:
+        # If not newspaper, check if handwritten
+        try:
+            # Simple check for handwritten document characteristics
+            # Handwritten documents often have more varied strokes and less stark contrast
+            if CV2_AVAILABLE:
+                # Convert to grayscale and calculate local variance
+                gray_np = np.array(img.convert('L'))
+                # Higher variance in edge strengths can indicate handwriting
+                edges = cv2.Canny(gray_np, 30, 100)
+                if np.count_nonzero(edges) / edges.size > 0.02:  # Low edge threshold for handwriting
+                    # Additional check with gradient magnitudes
+                    sobelx = cv2.Sobel(gray_np, cv2.CV_64F, 1, 0, ksize=3)
+                    sobely = cv2.Sobel(gray_np, cv2.CV_64F, 0, 1, ksize=3)
+                    magnitude = np.sqrt(sobelx**2 + sobely**2)
+                    # Handwriting typically has more variation in gradient magnitudes
+                    if np.std(magnitude) > 20:
+                        is_handwritten = True
+        except:
+            # If detection fails, assume it's not handwritten
+            pass
+    # Special processing for newspaper format
+    if is_newspaper:
+        # Convert to grayscale for better text extraction
+        gray = img.convert('L')
+        # For newspapers, we need aggressive text enhancement to make small print readable
+        # First enhance contrast more aggressively for newspaper small text
+        enhancer = ImageEnhance.Contrast(gray)
+        enhanced = enhancer.enhance(2.0)  # More aggressive contrast for newspaper text
+        # Apply stronger sharpening to make small text more defined
+        if IMAGE_PREPROCESSING["sharpen"]:
+            # Apply multiple passes of sharpening for newspaper text
+            enhanced = enhanced.filter(ImageFilter.SHARPEN)
+            enhanced = enhanced.filter(ImageFilter.EDGE_ENHANCE_MORE)  # Stronger edge enhancement
+        # Enhanced processing for newspapers with OpenCV when available
         if CV2_AVAILABLE:
+            try:
+                # Convert to numpy array
+                img_np = np.array(enhanced)
+                # For newspaper text extraction, CLAHE (Contrast Limited Adaptive Histogram Equalization)
+                # works much better than simple contrast enhancement
+                clahe = cv2.createCLAHE(clipLimit=3.0, tileGridSize=(8, 8))
+                img_np = clahe.apply(img_np)
+                # Apply different adaptive thresholding approaches and choose the best one
+                # 1. Standard adaptive threshold with larger block size for newspaper columns
+                binary1 = cv2.adaptiveThreshold(img_np, 255,
+                                            cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
+                                            cv2.THRESH_BINARY, 15, 4)
+                # 2. Otsu's method for global thresholding - works well for clean newspaper print
+                _, binary2 = cv2.threshold(img_np, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
+                # Try to determine which method preserves text better
+                # Count white pixels and edges in each binary version
+                white_pixels1 = np.count_nonzero(binary1 > 200)
+                white_pixels2 = np.count_nonzero(binary2 > 200)
+                # Calculate edge density to help determine which preserves text features better
+                edges1 = cv2.Canny(binary1, 100, 200)
+                edges2 = cv2.Canny(binary2, 100, 200)
+                edge_count1 = np.count_nonzero(edges1)
+                edge_count2 = np.count_nonzero(edges2)
+                # For newspaper text, we want to preserve more edges while maintaining reasonable
+                # white space (typical of printed text on paper background)
+                if (edge_count1 > edge_count2 * 1.2 and white_pixels1 > white_pixels2 * 0.7) or \
+                   (white_pixels1 < white_pixels2 * 0.5):  # If Otsu removed too much content
+                    # Adaptive thresholding usually better preserves small text in newspapers
+                    logger.debug("Using adaptive thresholding for newspaper text")
+                    # Apply optional denoising to clean up small speckles
+                    result = cv2.fastNlMeansDenoising(binary1, None, 7, 7, 21)
+                    return Image.fromarray(result)
+                else:
+                    # Otsu method was better
+                    logger.debug("Using Otsu thresholding for newspaper text")
+                    result = cv2.fastNlMeansDenoising(binary2, None, 7, 7, 21)
+                    return Image.fromarray(result)
+            except Exception as e:
+                logger.debug(f"Advanced newspaper processing failed: {str(e)}")
+                # Fall back to PIL processing
+                pass
+        # If OpenCV not available or fails, apply additional PIL enhancements
+        # Create a more aggressive binary version to better separate text
+        binary_threshold = enhanced.point(lambda x: 0 if x < 150 else 255, '1')
+        # Return enhanced binary image
+        return binary_threshold
     # Ultra-fast path for tiny images - just convert to grayscale with contrast enhancement
     if img_size < 300000:  # ~500x600 or smaller
         gray = img.convert('L')
     width, height = img.size
     # Fixed target dimensions based on DPI
+    # Using larger dimensions to support newspapers and large documents
+    max_width = int(14 * target_dpi)  # Increased from 8.5 to 14 inches
+    max_height = int(22 * target_dpi)  # Increased from 11 to 22 inches
     # Check if resizing is needed - quick early return
     if width <= max_width and height <= max_height:
 def create_html_with_images(result):
     """
     Create an HTML document with embedded images from OCR results.
+    Handles serialization of complex OCR objects automatically.
     Args:
         result: OCR result dictionary containing pages_data
     Returns:
         HTML content as string
     """
+    # Ensure result is fully serializable first
+    result = serialize_ocr_object(result)
     # Create HTML document structure
     html_content = """
     <!DOCTYPE html>
         # Return None if thumbnail generation fails
         return None
+def serialize_ocr_object(obj):
+    """
+    Serialize OCR response objects to JSON serializable format.
+    Handles OCRImageObject specifically to prevent serialization errors.
+    Args:
+        obj: The object to serialize
+    Returns:
+        JSON serializable representation of the object
+    """
+    # Fast path: Handle primitive types directly
+    if obj is None or isinstance(obj, (str, int, float, bool)):
+        return obj
+    # Handle collections
+    if isinstance(obj, list):
+        return [serialize_ocr_object(item) for item in obj]
+    elif isinstance(obj, dict):
+        return {k: serialize_ocr_object(v) for k, v in obj.items()}
+    elif isinstance(obj, OCRImageObject):
+        # Special handling for OCRImageObject
+        return {
+            'id': obj.id if hasattr(obj, 'id') else None,
+            'image_base64': obj.image_base64 if hasattr(obj, 'image_base64') else None
+        }
+    elif hasattr(obj, '__dict__'):
+        # For objects with __dict__ attribute
+        return {k: serialize_ocr_object(v) for k, v in obj.__dict__.items()
+                if not k.startswith('_')}  # Skip private attributes
+    else:
+        # Try to convert to string as last resort
+        try:
+            return str(obj)
+        except:
+            return None
 def try_local_ocr_fallback(image_path: Union[str, Path], base64_data_url: str = None) -> str:
     """
     Attempt to use local pytesseract OCR as a fallback when API fails

structured_ocr.py CHANGED Viewed

@@ -506,6 +506,32 @@ class StructuredOCR:
                                 if 'ocr_contents' in result:
                                     result['ocr_contents']['raw_text'] = all_text
                             except Exception as e:
                                 logger.warning(f"Custom prompt processing failed: {str(e)}. Using standard processing.")
                                 # Fall back to standard processing
@@ -901,6 +927,25 @@ class StructuredOCR:
                 "confidence_score": 0.0
             }
         try:
             # Check file size
             file_size_mb = file_path.stat().st_size / (1024 * 1024)
@@ -992,8 +1037,8 @@ class StructuredOCR:
             logger.info(f"Processing image with OCR using {OCR_MODEL}")
             # Add retry logic with more retries and longer backoff periods for rate limit issues
-            max_retries = 4  # Increased from 2 to give more chances to succeed
-            retry_delay = 2  # Increased from 1 to allow for longer backoff periods
             for retry in range(max_retries):
                 try:
@@ -1001,7 +1046,7 @@ class StructuredOCR:
                         document=ImageURLChunk(image_url=base64_data_url),
                         model=OCR_MODEL,
                         include_image_base64=True,
-                        timeout_ms=90000  # 90 second timeout for better success rate
                     )
                     break  # Success, exit retry loop
                 except Exception as e:
@@ -1079,7 +1124,8 @@ class StructuredOCR:
             image_ocr_markdown = image_response.pages[0].markdown if image_response.pages else ""
             # Optimize: Skip vision model step if ocr_markdown is very small or empty
-            if not image_ocr_markdown or len(image_ocr_markdown) < 50:
                 logger.warning("OCR produced minimal or no text. Returning basic result.")
                 return {
                     "file_name": file_path.name,
@@ -1090,6 +1136,14 @@ class StructuredOCR:
                     },
                     "processing_note": "OCR produced minimal text content"
                 }
             # Extract structured data using the appropriate model, with a single API call
             if use_vision:
@@ -1182,17 +1236,37 @@ class StructuredOCR:
         logger = logging.getLogger("vision_processor")
         try:
-            # Fast path: Skip vision API for minimal OCR text (saves an API call)
-            if not ocr_markdown or len(ocr_markdown.strip()) < 100:  # Increased threshold for better detection
-                logger.info("Minimal OCR text detected, skipping vision model processing")
                 return {
                     "file_name": filename,
                     "topics": ["Document"],
                     "languages": ["English"],
                     "ocr_contents": {
-                        "raw_text": ocr_markdown if ocr_markdown else "No text could be extracted"
                     }
                 }
             # Fast path: Skip if in test mode or no API key
             if self.test_mode or not self.api_key:
@@ -1203,25 +1277,10 @@ class StructuredOCR:
             doc_type = self._detect_document_type(custom_prompt, ocr_markdown)
             logger.info(f"Detected document type: {doc_type}")
-            # Optimize OCR text for processing - focus on the first part which usually contains
-            # the most important information (title, metadata, etc.)
-            if len(ocr_markdown) > 8000:
-                # Start with first 5000 chars
-                first_part = ocr_markdown[:5000]
-                # Then add representative samples from different parts of the document
-                # This captures headings and key information throughout
-                middle_start = len(ocr_markdown) // 2 - 1000
-                middle_part = ocr_markdown[middle_start:middle_start+2000] if middle_start > 0 else ""
-                # Get ending section if large enough
-                if len(ocr_markdown) > 15000:
-                    end_part = ocr_markdown[-1000:]
-                    truncated_ocr = f"{first_part}\n...\n{middle_part}\n...\n{end_part}"
-                else:
-                    truncated_ocr = f"{first_part}\n...\n{middle_part}"
-                logger.info(f"Truncated OCR text from {len(ocr_markdown)} to {len(truncated_ocr)} chars")
             else:
                 truncated_ocr = ocr_markdown
@@ -1232,9 +1291,8 @@ class StructuredOCR:
             start_time = time.time()
             try:
-                # Try with enhanced timing parameters based on document complexity
-                # Use shorter timeout for smaller documents
-                timeout_ms = min(120000, max(60000, len(truncated_ocr) * 10))  # 60-120 seconds based on text length
                 logger.info(f"Calling vision model with {timeout_ms}ms timeout and document type {doc_type}")
                 chat_response = self.client.chat.parse(
@@ -1260,20 +1318,18 @@ class StructuredOCR:
                 # If there's an error with the enhanced prompt, try progressively simpler approaches
                 logger.warning(f"Enhanced prompt failed after {time.time() - start_time:.2f}s: {str(e)}")
-                # Try a simplified approach with less context
                 try:
-                    # Shorter prompt with less contextual information
                     simplified_prompt = (
-                        f"You are an expert in historical document analysis. "
-                        f"Analyze this document image and the OCR text below. "
-                        f"<BEGIN_OCR>\n{truncated_ocr[:4000]}\n<END_OCR>\n"
-                        f"Identify the document type, main topics, languages used, and extract key information "
-                        f"including names, dates, places, and events. Return a structured JSON response."
                     )
-                    # Add custom prompt if provided
-                    if custom_prompt:
-                        simplified_prompt += f"\n\nAdditional instructions: {custom_prompt}"
                     logger.info(f"Trying simplified prompt approach")
                     chat_response = self.client.chat.parse(
@@ -1289,7 +1345,7 @@ class StructuredOCR:
                         ],
                         response_format=StructuredOCRModel,
                         temperature=0,
-                        timeout_ms=60000  # Shorter timeout for simplified approach
                     )
                     logger.info(f"Simplified prompt approach succeeded")
@@ -1299,11 +1355,10 @@ class StructuredOCR:
                     logger.warning(f"Simplified prompt failed: {str(second_e)}. Trying minimal prompt.")
                     try:
-                        # Minimal prompt focusing on just the image
                         minimal_prompt = (
-                            f"Analyze this historical document image. "
-                            f"Extract the document type, main topics, languages, and key information. "
-                            f"Provide your analysis in a structured JSON format."
                         )
                         logger.info(f"Trying minimal prompt with image-only focus")
@@ -1320,7 +1375,7 @@ class StructuredOCR:
                             ],
                             response_format=StructuredOCRModel,
                             temperature=0,
-                            timeout_ms=45000  # Even shorter timeout for minimal approach
                         )
                         logger.info(f"Minimal prompt approach succeeded")
@@ -1345,6 +1400,35 @@ class StructuredOCR:
                 'api_response_time': time.time() - start_time
             }
             # Add confidence score if not present
             if 'confidence_score' not in result:
                 result['confidence_score'] = 0.92  # Vision model typically has higher confidence
@@ -1444,7 +1528,8 @@ class StructuredOCR:
     def _build_enhanced_prompt(self, doc_type: str, ocr_text: str, custom_prompt: Optional[str]) -> str:
         """
-        Build an enhanced prompt based on document type.
         Args:
             doc_type: Detected document type
@@ -1452,125 +1537,163 @@ class StructuredOCR:
             custom_prompt: User-provided custom prompt
         Returns:
-            Enhanced prompt optimized for the document type
         """
         # Generic document section (included in all prompts)
         generic_section = (
-            f"This is a historical document's OCR text:\n"
             f"<BEGIN_OCR>\n{ocr_text}\n<END_OCR>\n\n"
         )
-        # Document-specific prompting
-        if doc_type == "handwritten":
-            specific_section = (
-                f"You are an expert historian specializing in handwritten document transcription and analysis. "
-                f"The OCR system has attempted to capture the handwriting, but may have made errors with cursive script "
-                f"or unusual letter formations.\n\n"
-                f"Pay careful attention to:\n"
-                f"- Correcting OCR errors common in handwriting recognition\n"
-                f"- Preserving the original document structure\n"
-                f"- Identifying topics, language(s), and document type accurately\n"
-                f"- Detecting any names, dates, places, or events mentioned\n"
-            )
-        elif doc_type == "letter":
-            specific_section = (
-                f"You are an expert in historical correspondence analysis. "
-                f"Analyze this letter as a historian would, identifying:\n"
-                f"- Sender and recipient (if mentioned)\n"
-                f"- Date and location of writing (if present)\n"
-                f"- Key topics discussed\n"
-                f"- Historical context and significance\n"
-                f"- Sentiment and tone of the communication\n"
-                f"- Closing formulations and signature\n"
-            )
-        elif doc_type == "recipe":
             specific_section = (
-                f"You are a culinary historian specializing in historical recipes. "
-                f"Analyze this recipe document to extract:\n"
-                f"- Recipe name/title\n"
-                f"- Complete list of ingredients with measurements\n"
-                f"- Preparation instructions in correct order\n"
-                f"- Cooking time and temperature if mentioned\n"
-                f"- Serving suggestions or yield information\n"
-                f"- Any cultural or historical context provided\n"
             )
-        elif doc_type == "travel":
-            specific_section = (
-                f"You are a historian specializing in historical travel and exploration accounts. "
-                f"Analyze this document to extract:\n"
-                f"- Geographical locations mentioned\n"
-                f"- Names of explorers, ships, or expeditions\n"
-                f"- Dates and timelines\n"
-                f"- Descriptions of indigenous peoples, cultures, or local conditions\n"
-                f"- Natural features, weather, or navigational details\n"
-                f"- Historical significance of the journey described\n"
-            )
-        elif doc_type == "scientific":
-            specific_section = (
-                f"You are a historian of science specializing in historical scientific documents. "
-                f"Analyze this document to extract:\n"
-                f"- Scientific methodology described\n"
-                f"- Observations, measurements, or data presented\n"
-                f"- Scientific terminology of the period\n"
-                f"- Experimental apparatus or tools mentioned\n"
-                f"- Conclusions or hypotheses presented\n"
-                f"- Historical significance within scientific development\n"
             )
-        elif doc_type == "newspaper":
             specific_section = (
-                f"You are a media historian specializing in historical newspapers and publications. "
-                f"Analyze this document to extract:\n"
-                f"- Publication name and date if present\n"
-                f"- Headlines and article titles\n"
-                f"- Main news content with focus on events, people, and places\n"
-                f"- Advertisement content if present\n"
-                f"- Historical context and significance\n"
-                f"- Editorial perspective or bias if detectable\n"
             )
-        elif doc_type == "legal":
-            specific_section = (
-                f"You are a legal historian specializing in historical legal documents. "
-                f"Analyze this document to extract:\n"
-                f"- Document type (contract, certificate, will, deed, etc.)\n"
-                f"- Parties involved and their roles\n"
-                f"- Key terms, conditions, or declarations\n"
-                f"- Dates, locations, and jurisdictions mentioned\n"
-                f"- Legal terminology of the period\n"
-                f"- Signatures, witnesses, or official markings\n"
-            )
-        else:
-            # General historical document
-            specific_section = (
-                f"You are a historian specializing in historical document analysis. "
-                f"Analyze this document to extract:\n"
-                f"- Document type and purpose\n"
-                f"- Time period and historical context\n"
-                f"- Key topics, themes, and subjects\n"
-                f"- People, places, and events mentioned\n"
-                f"- Languages used and writing style\n"
-                f"- Historical significance and connections\n"
             )
-        # Output instructions
-        output_section = (
-            f"Create a structured JSON response with the following fields:\n"
-            f"- file_name: The document's name\n"
-            f"- topics: An array of topics covered in the document\n"
-            f"- languages: An array of languages used in the document\n"
-            f"- ocr_contents: A dictionary with the document's contents, organized logically\n"
-        )
         # Add custom prompt if provided
         custom_section = ""
         if custom_prompt:
-            custom_section = f"\n\nADDITIONAL CONTEXT AND INSTRUCTIONS:\n{custom_prompt}\n"
         # Combine all sections into complete prompt
         return generic_section + specific_section + output_section + custom_section
@@ -1667,6 +1790,35 @@ class StructuredOCR:
                     result['model_used'] = TEXT_MODEL
                     result['processing_time'] = time.time() - start_time
                     # Add raw text for reference if not already present
                     if 'ocr_contents' in result and 'raw_text' not in result['ocr_contents']:
                         # Add truncated raw text if very large

                                 if 'ocr_contents' in result:
                                     result['ocr_contents']['raw_text'] = all_text
+                                # Add flag to indicate custom prompt was applied
+                                result['custom_prompt_applied'] = 'text_only'
+                                # Detect document type from custom prompt if available
+                                if custom_prompt:
+                                    # Extract document type if specified
+                                    doc_type = "general"
+                                    if "DOCUMENT TYPE:" in custom_prompt:
+                                        doc_type_line = custom_prompt.split("\n")[0]
+                                        if "DOCUMENT TYPE:" in doc_type_line:
+                                            doc_type = doc_type_line.split("DOCUMENT TYPE:")[1].strip().lower()
+                                    # Keyword-based detection as fallback
+                                    elif any(keyword in custom_prompt.lower() for keyword in ["newspaper", "column", "article", "magazine"]):
+                                        doc_type = "newspaper"
+                                    elif any(keyword in custom_prompt.lower() for keyword in ["letter", "correspondence", "handwritten"]):
+                                        doc_type = "letter"
+                                    elif any(keyword in custom_prompt.lower() for keyword in ["book", "publication"]):
+                                        doc_type = "book"
+                                    elif any(keyword in custom_prompt.lower() for keyword in ["form", "certificate", "legal"]):
+                                        doc_type = "form"
+                                    elif any(keyword in custom_prompt.lower() for keyword in ["recipe", "ingredients"]):
+                                        doc_type = "recipe"
+                                    # Store detected document type in result
+                                    result['detected_document_type'] = doc_type
                             except Exception as e:
                                 logger.warning(f"Custom prompt processing failed: {str(e)}. Using standard processing.")
                                 # Fall back to standard processing
                 "confidence_score": 0.0
             }
+        # Check if this is likely a newspaper or document with columns by filename
+        is_likely_newspaper = False
+        newspaper_keywords = ["newspaper", "gazette", "herald", "times", "journal",
+                            "chronicle", "post", "tribune", "news", "press", "gender"]
+        # Check filename for newspaper indicators
+        filename_lower = file_path.name.lower()
+        for keyword in newspaper_keywords:
+            if keyword in filename_lower:
+                is_likely_newspaper = True
+                logger.info(f"Likely newspaper document detected from filename: {file_path.name}")
+                # Add newspaper-specific processing hint to custom_prompt if not already present
+                if custom_prompt:
+                    if "column" not in custom_prompt.lower() and "newspaper" not in custom_prompt.lower():
+                        custom_prompt = custom_prompt + " This appears to be a newspaper or document with columns. Please extract all text content from each column."
+                else:
+                    custom_prompt = "This appears to be a newspaper or document with columns. Please extract all text content from each column, maintaining proper reading order."
+                break
         try:
             # Check file size
             file_size_mb = file_path.stat().st_size / (1024 * 1024)
             logger.info(f"Processing image with OCR using {OCR_MODEL}")
             # Add retry logic with more retries and longer backoff periods for rate limit issues
+            max_retries = 2  # Reduced to prevent rate limiting
+            retry_delay = 1  # Shorter delay between retries
             for retry in range(max_retries):
                 try:
                         document=ImageURLChunk(image_url=base64_data_url),
                         model=OCR_MODEL,
                         include_image_base64=True,
+                        timeout_ms=45000  # 45 second timeout for better performance
                     )
                     break  # Success, exit retry loop
                 except Exception as e:
             image_ocr_markdown = image_response.pages[0].markdown if image_response.pages else ""
             # Optimize: Skip vision model step if ocr_markdown is very small or empty
+            # BUT make an exception for newspapers or if custom_prompt is provided
+            if (not is_likely_newspaper and not custom_prompt) and (not image_ocr_markdown or len(image_ocr_markdown) < 50):
                 logger.warning("OCR produced minimal or no text. Returning basic result.")
                 return {
                     "file_name": file_path.name,
                     },
                     "processing_note": "OCR produced minimal text content"
                 }
+            # For newspapers with little text in OCR, set a more explicit prompt
+            if is_likely_newspaper and (not image_ocr_markdown or len(image_ocr_markdown) < 100):
+                logger.info("Newspaper with minimal OCR text detected. Using enhanced prompt.")
+                if not custom_prompt:
+                    custom_prompt = "This is a newspaper or document with columns. The OCR may not have captured all text. Please examine the image carefully and extract ALL text content visible in the document, reading each column from top to bottom."
+                elif "extract all text" not in custom_prompt.lower():
+                    custom_prompt += " Please examine the image carefully and extract ALL text content visible in the document."
             # Extract structured data using the appropriate model, with a single API call
             if use_vision:
         logger = logging.getLogger("vision_processor")
         try:
+            # Check if this is a newspaper or document with columns by filename
+            is_likely_newspaper = False
+            newspaper_keywords = ["newspaper", "gazette", "herald", "times", "journal",
+                                "chronicle", "post", "tribune", "news", "press", "gender"]
+            # Check filename for newspaper indicators
+            filename_lower = filename.lower()
+            for keyword in newspaper_keywords:
+                if keyword in filename_lower:
+                    is_likely_newspaper = True
+                    logger.info(f"Likely newspaper document detected in vision processing: {filename}")
+                    break
+            # Fast path: Skip vision API if OCR already produced reasonable text
+            # We'll define "reasonable" as having at least 300 characters
+            if len(ocr_markdown.strip()) > 300:
+                logger.info("Sufficient OCR text detected, using OCR text directly")
                 return {
                     "file_name": filename,
                     "topics": ["Document"],
                     "languages": ["English"],
                     "ocr_contents": {
+                        "raw_text": ocr_markdown
                     }
                 }
+            # Only use vision model for minimal OCR text or when document has columns
+            if is_likely_newspaper and (not ocr_markdown or len(ocr_markdown.strip()) < 300):
+                logger.info("Using vision model for newspaper with minimal OCR text")
+                if not custom_prompt:
+                    custom_prompt = "Document has columns. Extract text by reading each column top to bottom."
             # Fast path: Skip if in test mode or no API key
             if self.test_mode or not self.api_key:
             doc_type = self._detect_document_type(custom_prompt, ocr_markdown)
             logger.info(f"Detected document type: {doc_type}")
+            # Use only the first part of OCR text to keep prompts small and processing fast
+            if len(ocr_markdown) > 1000:
+                truncated_ocr = ocr_markdown[:1000]
+                logger.info(f"Truncated OCR text from {len(ocr_markdown)} to 1000 chars for faster processing")
             else:
                 truncated_ocr = ocr_markdown
             start_time = time.time()
             try:
+                # Use a fixed, shorter timeout for single-page documents
+                timeout_ms = 45000  # 45 seconds is optimal for most single-page documents
                 logger.info(f"Calling vision model with {timeout_ms}ms timeout and document type {doc_type}")
                 chat_response = self.client.chat.parse(
                 # If there's an error with the enhanced prompt, try progressively simpler approaches
                 logger.warning(f"Enhanced prompt failed after {time.time() - start_time:.2f}s: {str(e)}")
+                # Try a very simplified approach with minimal context
                 try:
+                    # Ultra-short prompt for faster processing
                     simplified_prompt = (
+                        f"Extract text from this document image. "
+                        f"<BEGIN_OCR>\n{truncated_ocr[:500]}\n<END_OCR>\n"
+                        f"Return a JSON with file_name, topics, languages, and ocr_contents fields."
                     )
+                    # Only add minimal custom prompt if provided
+                    if custom_prompt and len(custom_prompt) < 100:
+                        simplified_prompt += f"\n{custom_prompt}"
                     logger.info(f"Trying simplified prompt approach")
                     chat_response = self.client.chat.parse(
                         ],
                         response_format=StructuredOCRModel,
                         temperature=0,
+                        timeout_ms=30000  # Very short timeout for simplified approach (30 seconds)
                     )
                     logger.info(f"Simplified prompt approach succeeded")
                     logger.warning(f"Simplified prompt failed: {str(second_e)}. Trying minimal prompt.")
                     try:
+                        # Minimal prompt focusing only on OCR task
                         minimal_prompt = (
+                            f"Extract the text from this image. "
+                            f"Return JSON with file_name, topics, languages, and ocr_contents.raw_text fields."
                         )
                         logger.info(f"Trying minimal prompt with image-only focus")
                             ],
                             response_format=StructuredOCRModel,
                             temperature=0,
+                            timeout_ms=25000  # Minimal timeout for last attempt (25 seconds)
                         )
                         logger.info(f"Minimal prompt approach succeeded")
                 'api_response_time': time.time() - start_time
             }
+            # Flag when custom prompt has been successfully applied
+            if custom_prompt:
+                result['custom_prompt_applied'] = 'vision_model'
+                # Attempt to detect document type from custom prompt
+                if "DOCUMENT TYPE:" in custom_prompt:
+                    doc_type_line = custom_prompt.split("\n")[0]
+                    if "DOCUMENT TYPE:" in doc_type_line:
+                        custom_doc_type = doc_type_line.split("DOCUMENT TYPE:")[1].strip().lower()
+                        result['detected_document_type'] = custom_doc_type
+                # Keyword-based detection as fallback
+                elif any(keyword in custom_prompt.lower() for keyword in ["newspaper", "column", "article", "magazine"]):
+                    result['detected_document_type'] = "newspaper"
+                elif any(keyword in custom_prompt.lower() for keyword in ["letter", "correspondence", "handwritten"]):
+                    result['detected_document_type'] = "letter"
+                elif any(keyword in custom_prompt.lower() for keyword in ["book", "publication"]):
+                    result['detected_document_type'] = "book"
+                elif any(keyword in custom_prompt.lower() for keyword in ["form", "certificate", "legal"]):
+                    result['detected_document_type'] = "form"
+                elif any(keyword in custom_prompt.lower() for keyword in ["recipe", "ingredients"]):
+                    result['detected_document_type'] = "recipe"
+                elif "this is a" in custom_prompt.lower():
+                    # Extract document type from "This is a [type]" format
+                    this_is_parts = custom_prompt.lower().split("this is a ")
+                    if len(this_is_parts) > 1:
+                        extracted_type = this_is_parts[1].split(".")[0].strip()
+                        if extracted_type:
+                            result['detected_document_type'] = extracted_type
             # Add confidence score if not present
             if 'confidence_score' not in result:
                 result['confidence_score'] = 0.92  # Vision model typically has higher confidence
     def _build_enhanced_prompt(self, doc_type: str, ocr_text: str, custom_prompt: Optional[str]) -> str:
         """
+        Build an optimized prompt focused on OCR accuracy with specialized attention to
+        historical typography, manuscript conventions, and document deterioration patterns.
         Args:
             doc_type: Detected document type
             custom_prompt: User-provided custom prompt
         Returns:
+            Optimized prompt focused on text extraction with historical document expertise
         """
         # Generic document section (included in all prompts)
         generic_section = (
+            f"This is a document's OCR text:\n"
             f"<BEGIN_OCR>\n{ocr_text}\n<END_OCR>\n\n"
         )
+        # Check if custom prompt contains document type information
+        has_custom_doc_type = False
+        custom_doc_type = ""
+        if custom_prompt and "DOCUMENT TYPE:" in custom_prompt:
+            # Extract the document type from the custom prompt
+            doc_type_line = custom_prompt.split("\n")[0]
+            if "DOCUMENT TYPE:" in doc_type_line:
+                custom_doc_type = doc_type_line.split("DOCUMENT TYPE:")[1].strip()
+                has_custom_doc_type = True
+                # If we have a custom doc type, use it instead
+                if custom_doc_type:
+                    doc_type = custom_doc_type.lower()
+        # If user has provided detailed instructions, provide more elaborate prompting
+        if custom_prompt and (has_custom_doc_type or len(custom_prompt.strip()) > 20):
+            # Enhanced prompt for documents with custom instructions and historical expertise
             specific_section = (
+                f"You are an advanced OCR specialist with expertise in historical documents, typography, and manuscript conventions. "
+                f"Below is a document that requires specialized analysis with attention to historical characteristics. "
+                f"Pay particular attention to:\n"
+                f"- Historical typography features (long s 'ſ', ligatures, obsolete letter forms)\n"
+                f"- Manuscript conventions of the period (abbreviations, contractions, marginalia)\n"
+                f"- Document deterioration patterns (faded ink, foxing, water damage, paper degradation)\n"
+                f"- Accurately capturing ALL text content visible in the image with historical context\n"
+                f"- Following the specific user instructions for processing this document type\n"
+                f"- Identifying key information, structure, and historical formatting conventions\n"
+                f"- Providing comprehensive analysis with attention to historical context\n"
             )
+            # Add specialized instructions based on document type
+            if doc_type == "newspaper":
+                specific_section += (
+                    f"\nThis appears to be a newspaper or document with columns. "
+                    f"Please read each column from top to bottom, then move to the next column. "
+                    f"Extract all article titles, headings, bylines, and body text in the correct reading order. "
+                    f"Pay special attention to section headers, page numbers, publication date, and newspaper name. "
+                    f"For historical newspapers, be aware of period-specific typography such as the long s (ſ), "
+                    f"unique ligatures (æ, œ, ct, st), and decorative fonts. Account for paper degradation around "
+                    f"fold lines and edges. Recognize archaic abbreviations and typesetting conventions of the period.\n"
+                )
+            elif doc_type == "letter":
+                specific_section += (
+                    f"\nThis appears to be a letter or correspondence. "
+                    f"Pay special attention to the letterhead, date, greeting, body content, closing, and signature. "
+                    f"Preserve the original formatting including paragraph breaks and indentation. "
+                    f"Note any handwritten annotations or marginalia separately. "
+                    f"For historical letters, carefully transcribe historical scripts and handwriting styles, "
+                    f"noting unclear or damaged sections. Identify period-specific salutations, closings, and "
+                    f"formalities. Watch for ink fading, bleeding, and seepage through pages. "
+                    f"Recognize period-specific abbreviations (ye, yr, inst, ult, prox) and long s (ſ) in older printed correspondence.\n"
+                )
+            elif doc_type == "book":
+                specific_section += (
+                    f"\nThis appears to be a book or publication page. "
+                    f"Pay attention to chapter titles, headers, page numbers, footnotes, and main body text. "
+                    f"Preserve paragraph structure and any special formatting. "
+                    f"Note any images, tables, or figures that might be referenced in the text. "
+                    f"For historical books, attend to period typography including the long s (ſ), ligatures (æ, œ, ct, ſt), "
+                    f"archaic letter forms, and decorative initials/drop caps. Account for foxing (brown spotting), "
+                    f"bleed-through from opposite pages, and binding damage. Recognize period-specific typographic "
+                    f"conventions like catchwords, signatures, obsolete punctuation, and historical spelling variants "
+                    f"(e.g., -ize/-ise, past tense 'd for -ed). Note bookplates, ownership marks, and marginalia.\n"
+                )
+            elif doc_type == "form":
+                specific_section += (
+                    f"\nThis appears to be a form or legal document. "
+                    f"Carefully extract all field labels and their corresponding values. "
+                    f"Preserve the structure of form fields and sections. "
+                    f"Pay special attention to signature lines, dates, and any official markings. "
+                    f"For historical forms and legal documents, recognize period-specific legal terminology and "
+                    f"formulaic phrases. Note seals, stamps, watermarks, and official emblems. Watch for faded ink "
+                    f"in signatures and filled fields. Identify period handwriting styles in completed sections. "
+                    f"Account for specialized legal abbreviations (e.g., SS., Esq., inst., wit.) and archaic "
+                    f"measurement units. Note folding patterns and worn edges common in frequently handled legal documents.\n"
+                )
+            elif doc_type == "recipe":
+                specific_section += (
+                    f"\nThis appears to be a recipe or food-related document. "
+                    f"Extract the recipe title, ingredient list (with measurements), preparation steps, "
+                    f"cooking times, serving information, and any notes or tips. "
+                    f"Maintain the distinction between ingredients and preparation instructions. "
+                    f"For historical recipes, attend to archaic measurements (gill, dram, peck, firkin), obsolete "
+                    f"cooking terminology, and period-specific ingredients and their modern equivalents. Note handwritten "
+                    f"annotations and personal modifications. Identify period-specific cooking methods and tools that "
+                    f"might need explanation. Watch for liquid stains and food residue common on well-used recipe pages. "
+                    f"Recognize unclear fractions and temperature instructions (e.g., 'slow oven', 'quick fire').\n"
+                )
+            # Output instructions (enhanced for custom requests)
+            output_section = (
+                f"Create a detailed structured JSON response with the following fields:\n"
+                f"- file_name: The document's name\n"
+                f"- topics: An array of specific topics, themes, or subjects covered in the document\n"
+                f"- languages: An array of languages used in the document\n"
+                f"- ocr_contents: A comprehensive dictionary with the document's contents including:\n"
+                f"  * title: The main title or heading\n"
+                f"  * subtitle: Any subtitle or secondary heading (if present)\n"
+                f"  * date: Publication or document date (if present)\n"
+                f"  * author: Author or creator information (if present)\n"
+                f"  * content: The main body content, properly formatted\n"
+                f"  * additional sections as appropriate for this document type\n"
+                f"  * raw_text: The complete OCR text\n"
             )
+        else:
+            # Default processing with basic historical document awareness
             specific_section = (
+                f"You are an OCR specialist with knowledge of historical documents and typography. "
+                f"Focus on accurately extracting text content with attention to historical features. "
+                f"Pay special attention to:\n"
+                f"- Accurately capturing ALL text content visible in the image\n"
+                f"- Maintaining the correct reading order and structure\n"
+                f"- Preserving paragraph breaks and text layout\n"
+                f"- Identifying the main document type, time period, and language\n"
+                f"- Recognizing historical typography features (long s 'ſ', ligatures, archaic characters)\n"
+                f"- Accounting for document deterioration (faded ink, stains, foxing, physical damage)\n"
             )
+            # Only add specialized instructions for newspapers with columns
+            if doc_type == "newspaper":
+                specific_section += (
+                    f"\nThis appears to be a document with columns. "
+                    f"Be sure to read each column from top to bottom, then move to the next column. "
+                    f"Extract all article titles, headings, and body text.\n"
+                )
+            # Simple output instructions for default cases
+            output_section = (
+                f"Create a structured JSON response with the following fields:\n"
+                f"- file_name: The document's name\n"
+                f"- topics: An array of topics covered in the document\n"
+                f"- languages: An array of languages used in the document\n"
+                f"- ocr_contents: A dictionary with the document's contents, with the focus on complete text extraction\n"
             )
         # Add custom prompt if provided
         custom_section = ""
         if custom_prompt:
+            # Process custom prompt to extract just the instructions part if available
+            if "USER INSTRUCTIONS:" in custom_prompt:
+                instructions_part = custom_prompt.split("USER INSTRUCTIONS:")[1].strip()
+                custom_section = f"\n\nUser-provided instructions: {instructions_part}\n"
+            elif "INSTRUCTIONS:" in custom_prompt:
+                instructions_part = custom_prompt.split("INSTRUCTIONS:")[1].strip()
+                custom_section = f"\n\nUser-provided instructions: {instructions_part}\n"
+            else:
+                # Strip custom prompt to essentials
+                stripped_prompt = custom_prompt.replace("This is a", "").replace("It appears to be a", "")
+                custom_section = f"\n\nUser-provided instructions: {stripped_prompt}\n"
         # Combine all sections into complete prompt
         return generic_section + specific_section + output_section + custom_section
                     result['model_used'] = TEXT_MODEL
                     result['processing_time'] = time.time() - start_time
+                    # Flag when custom prompt has been successfully applied
+                    if custom_prompt:
+                        result['custom_prompt_applied'] = 'text_model'
+                        # Attempt to detect document type from custom prompt
+                        if "DOCUMENT TYPE:" in custom_prompt:
+                            doc_type_line = custom_prompt.split("\n")[0]
+                            if "DOCUMENT TYPE:" in doc_type_line:
+                                custom_doc_type = doc_type_line.split("DOCUMENT TYPE:")[1].strip().lower()
+                                result['detected_document_type'] = custom_doc_type
+                        # Keyword-based detection as fallback
+                        elif any(keyword in custom_prompt.lower() for keyword in ["newspaper", "column", "article", "magazine"]):
+                            result['detected_document_type'] = "newspaper"
+                        elif any(keyword in custom_prompt.lower() for keyword in ["letter", "correspondence", "handwritten"]):
+                            result['detected_document_type'] = "letter"
+                        elif any(keyword in custom_prompt.lower() for keyword in ["book", "publication"]):
+                            result['detected_document_type'] = "book"
+                        elif any(keyword in custom_prompt.lower() for keyword in ["form", "certificate", "legal"]):
+                            result['detected_document_type'] = "form"
+                        elif any(keyword in custom_prompt.lower() for keyword in ["recipe", "ingredients"]):
+                            result['detected_document_type'] = "recipe"
+                        elif "this is a" in custom_prompt.lower():
+                            # Extract document type from "This is a [type]" format
+                            this_is_parts = custom_prompt.lower().split("this is a ")
+                            if len(this_is_parts) > 1:
+                                extracted_type = this_is_parts[1].split(".")[0].strip()
+                                if extracted_type:
+                                    result['detected_document_type'] = extracted_type
                     # Add raw text for reference if not already present
                     if 'ocr_contents' in result and 'raw_text' not in result['ocr_contents']:
                         # Add truncated raw text if very large