Spaces:
Running
Running
A newer version of the Streamlit SDK is available:
1.45.1
OCR Utilities
This directory contains utility modules for the Historical OCR project.
PDF OCR Processing
The pdf_ocr.py
module provides specialized functionality for processing PDF documents with OCR.
Features
- Robust PDF-to-Image Conversion: Converts PDF documents to images using optimized settings before OCR processing
- Multi-Page Support: Intelligently handles multi-page documents, allowing processing of specific pages or page ranges
- Memory-Efficient Processing: Processes PDFs in batches to prevent memory issues with large documents
- Fallback Mechanism: Falls back to structured_ocr's internal processing if direct conversion fails
- Cleanup Management: Automatically cleans up temporary files after processing
Key Components
- PDFOCR: Main class for processing PDF files with OCR
- PDFConversionResult: Helper class that holds PDF conversion results and manages cleanup
Basic Usage
from utils.pdf_ocr import PDFOCR
# Initialize the processor
processor = PDFOCR()
# Process a PDF file (all pages, with vision model)
result = processor.process_pdf('document.pdf')
# Process a PDF file (specific pages, with vision model)
result = processor.process_pdf('document.pdf', custom_pages=[1, 3, 5])
# Process a PDF file (first N pages, without vision model)
result = processor.process_pdf('document.pdf', max_pages=3, use_vision=False)
# Process a PDF file with custom prompt
result = processor.process_pdf(
'document.pdf',
custom_prompt="This is a historical newspaper with multiple columns."
)
# Save results to JSON
output_path = processor.save_json_output('document.pdf', 'results.json')
Command Line Usage
The module can also be used directly from the command line:
python utils/pdf_ocr.py document.pdf --output results.json
python utils/pdf_ocr.py document.pdf --max-pages 3
python utils/pdf_ocr.py document.pdf --pages 1,3,5
python utils/pdf_ocr.py document.pdf --prompt "This is a historical newspaper with multiple columns."
python utils/pdf_ocr.py document.pdf --no-vision
How It Works
- The module first attempts to convert the PDF to images using
pdf2image
- It processes the first page with the vision model (if requested) for detailed analysis
- Additional pages are processed with the text model for efficiency
- All text is combined into a single result with appropriate metadata
- If direct conversion fails, it falls back to using
structured_ocr.py
for PDF processing
Parameters
- pdf_path: Path to the PDF file to process
- use_vision: Whether to use vision model for improved analysis (default: True)
- max_pages: Maximum number of pages to process (default: all pages)
- custom_pages: Specific page numbers to process, 1-based indexing (e.g., [1, 3, 5])
- custom_prompt: Custom instructions for OCR processing