historical-ocr / utils /README.md
milwright's picture
fix cline
2d01495

A newer version of the Streamlit SDK is available: 1.45.1

Upgrade

OCR Utilities

This directory contains utility modules for the Historical OCR project.

PDF OCR Processing

The pdf_ocr.py module provides specialized functionality for processing PDF documents with OCR.

Features

  • Robust PDF-to-Image Conversion: Converts PDF documents to images using optimized settings before OCR processing
  • Multi-Page Support: Intelligently handles multi-page documents, allowing processing of specific pages or page ranges
  • Memory-Efficient Processing: Processes PDFs in batches to prevent memory issues with large documents
  • Fallback Mechanism: Falls back to structured_ocr's internal processing if direct conversion fails
  • Cleanup Management: Automatically cleans up temporary files after processing

Key Components

  • PDFOCR: Main class for processing PDF files with OCR
  • PDFConversionResult: Helper class that holds PDF conversion results and manages cleanup

Basic Usage

from utils.pdf_ocr import PDFOCR

# Initialize the processor
processor = PDFOCR()

# Process a PDF file (all pages, with vision model)
result = processor.process_pdf('document.pdf')

# Process a PDF file (specific pages, with vision model)
result = processor.process_pdf('document.pdf', custom_pages=[1, 3, 5])

# Process a PDF file (first N pages, without vision model)
result = processor.process_pdf('document.pdf', max_pages=3, use_vision=False)

# Process a PDF file with custom prompt
result = processor.process_pdf(
    'document.pdf', 
    custom_prompt="This is a historical newspaper with multiple columns."
)

# Save results to JSON
output_path = processor.save_json_output('document.pdf', 'results.json')

Command Line Usage

The module can also be used directly from the command line:

python utils/pdf_ocr.py document.pdf --output results.json
python utils/pdf_ocr.py document.pdf --max-pages 3
python utils/pdf_ocr.py document.pdf --pages 1,3,5
python utils/pdf_ocr.py document.pdf --prompt "This is a historical newspaper with multiple columns."
python utils/pdf_ocr.py document.pdf --no-vision

How It Works

  1. The module first attempts to convert the PDF to images using pdf2image
  2. It processes the first page with the vision model (if requested) for detailed analysis
  3. Additional pages are processed with the text model for efficiency
  4. All text is combined into a single result with appropriate metadata
  5. If direct conversion fails, it falls back to using structured_ocr.py for PDF processing

Parameters

  • pdf_path: Path to the PDF file to process
  • use_vision: Whether to use vision model for improved analysis (default: True)
  • max_pages: Maximum number of pages to process (default: all pages)
  • custom_pages: Specific page numbers to process, 1-based indexing (e.g., [1, 3, 5])
  • custom_prompt: Custom instructions for OCR processing