OCR Utilities

This directory contains utility modules for the Historical OCR project.

PDF OCR Processing

The pdf_ocr.py module provides specialized functionality for processing PDF documents with OCR.

Features

Robust PDF-to-Image Conversion: Converts PDF documents to images using optimized settings before OCR processing
Multi-Page Support: Intelligently handles multi-page documents, allowing processing of specific pages or page ranges
Memory-Efficient Processing: Processes PDFs in batches to prevent memory issues with large documents
Fallback Mechanism: Falls back to structured_ocr's internal processing if direct conversion fails
Cleanup Management: Automatically cleans up temporary files after processing

Key Components

PDFOCR: Main class for processing PDF files with OCR
PDFConversionResult: Helper class that holds PDF conversion results and manages cleanup

Basic Usage

from utils.pdf_ocr import PDFOCR

# Initialize the processor
processor = PDFOCR()

# Process a PDF file (all pages, with vision model)
result = processor.process_pdf('document.pdf')

# Process a PDF file (specific pages, with vision model)
result = processor.process_pdf('document.pdf', custom_pages=[1, 3, 5])

# Process a PDF file (first N pages, without vision model)
result = processor.process_pdf('document.pdf', max_pages=3, use_vision=False)

# Process a PDF file with custom prompt
result = processor.process_pdf(
    'document.pdf', 
    custom_prompt="This is a historical newspaper with multiple columns."
)

# Save results to JSON
output_path = processor.save_json_output('document.pdf', 'results.json')

Command Line Usage

The module can also be used directly from the command line:

python utils/pdf_ocr.py document.pdf --output results.json
python utils/pdf_ocr.py document.pdf --max-pages 3
python utils/pdf_ocr.py document.pdf --pages 1,3,5
python utils/pdf_ocr.py document.pdf --prompt "This is a historical newspaper with multiple columns."
python utils/pdf_ocr.py document.pdf --no-vision

How It Works

The module first attempts to convert the PDF to images using pdf2image
It processes the first page with the vision model (if requested) for detailed analysis
Additional pages are processed with the text model for efficiency
All text is combined into a single result with appropriate metadata
If direct conversion fails, it falls back to using structured_ocr.py for PDF processing

Parameters

pdf_path: Path to the PDF file to process
use_vision: Whether to use vision model for improved analysis (default: True)
max_pages: Maximum number of pages to process (default: all pages)
custom_pages: Specific page numbers to process, 1-based indexing (e.g., [1, 3, 5])
custom_prompt: Custom instructions for OCR processing