historical-ocr / utils /README.md
milwright's picture
fix cline
2d01495
# OCR Utilities
This directory contains utility modules for the Historical OCR project.
## PDF OCR Processing
The `pdf_ocr.py` module provides specialized functionality for processing PDF documents with OCR.
### Features
- **Robust PDF-to-Image Conversion**: Converts PDF documents to images using optimized settings before OCR processing
- **Multi-Page Support**: Intelligently handles multi-page documents, allowing processing of specific pages or page ranges
- **Memory-Efficient Processing**: Processes PDFs in batches to prevent memory issues with large documents
- **Fallback Mechanism**: Falls back to structured_ocr's internal processing if direct conversion fails
- **Cleanup Management**: Automatically cleans up temporary files after processing
### Key Components
- **PDFOCR**: Main class for processing PDF files with OCR
- **PDFConversionResult**: Helper class that holds PDF conversion results and manages cleanup
### Basic Usage
```python
from utils.pdf_ocr import PDFOCR
# Initialize the processor
processor = PDFOCR()
# Process a PDF file (all pages, with vision model)
result = processor.process_pdf('document.pdf')
# Process a PDF file (specific pages, with vision model)
result = processor.process_pdf('document.pdf', custom_pages=[1, 3, 5])
# Process a PDF file (first N pages, without vision model)
result = processor.process_pdf('document.pdf', max_pages=3, use_vision=False)
# Process a PDF file with custom prompt
result = processor.process_pdf(
'document.pdf',
custom_prompt="This is a historical newspaper with multiple columns."
)
# Save results to JSON
output_path = processor.save_json_output('document.pdf', 'results.json')
```
### Command Line Usage
The module can also be used directly from the command line:
```bash
python utils/pdf_ocr.py document.pdf --output results.json
python utils/pdf_ocr.py document.pdf --max-pages 3
python utils/pdf_ocr.py document.pdf --pages 1,3,5
python utils/pdf_ocr.py document.pdf --prompt "This is a historical newspaper with multiple columns."
python utils/pdf_ocr.py document.pdf --no-vision
```
### How It Works
1. The module first attempts to convert the PDF to images using `pdf2image`
2. It processes the first page with the vision model (if requested) for detailed analysis
3. Additional pages are processed with the text model for efficiency
4. All text is combined into a single result with appropriate metadata
5. If direct conversion fails, it falls back to using `structured_ocr.py` for PDF processing
### Parameters
- **pdf_path**: Path to the PDF file to process
- **use_vision**: Whether to use vision model for improved analysis (default: True)
- **max_pages**: Maximum number of pages to process (default: all pages)
- **custom_pages**: Specific page numbers to process, 1-based indexing (e.g., [1, 3, 5])
- **custom_prompt**: Custom instructions for OCR processing