Spaces:
Running
Running
# OCR Utilities | |
This directory contains utility modules for the Historical OCR project. | |
## PDF OCR Processing | |
The `pdf_ocr.py` module provides specialized functionality for processing PDF documents with OCR. | |
### Features | |
- **Robust PDF-to-Image Conversion**: Converts PDF documents to images using optimized settings before OCR processing | |
- **Multi-Page Support**: Intelligently handles multi-page documents, allowing processing of specific pages or page ranges | |
- **Memory-Efficient Processing**: Processes PDFs in batches to prevent memory issues with large documents | |
- **Fallback Mechanism**: Falls back to structured_ocr's internal processing if direct conversion fails | |
- **Cleanup Management**: Automatically cleans up temporary files after processing | |
### Key Components | |
- **PDFOCR**: Main class for processing PDF files with OCR | |
- **PDFConversionResult**: Helper class that holds PDF conversion results and manages cleanup | |
### Basic Usage | |
```python | |
from utils.pdf_ocr import PDFOCR | |
# Initialize the processor | |
processor = PDFOCR() | |
# Process a PDF file (all pages, with vision model) | |
result = processor.process_pdf('document.pdf') | |
# Process a PDF file (specific pages, with vision model) | |
result = processor.process_pdf('document.pdf', custom_pages=[1, 3, 5]) | |
# Process a PDF file (first N pages, without vision model) | |
result = processor.process_pdf('document.pdf', max_pages=3, use_vision=False) | |
# Process a PDF file with custom prompt | |
result = processor.process_pdf( | |
'document.pdf', | |
custom_prompt="This is a historical newspaper with multiple columns." | |
) | |
# Save results to JSON | |
output_path = processor.save_json_output('document.pdf', 'results.json') | |
``` | |
### Command Line Usage | |
The module can also be used directly from the command line: | |
```bash | |
python utils/pdf_ocr.py document.pdf --output results.json | |
python utils/pdf_ocr.py document.pdf --max-pages 3 | |
python utils/pdf_ocr.py document.pdf --pages 1,3,5 | |
python utils/pdf_ocr.py document.pdf --prompt "This is a historical newspaper with multiple columns." | |
python utils/pdf_ocr.py document.pdf --no-vision | |
``` | |
### How It Works | |
1. The module first attempts to convert the PDF to images using `pdf2image` | |
2. It processes the first page with the vision model (if requested) for detailed analysis | |
3. Additional pages are processed with the text model for efficiency | |
4. All text is combined into a single result with appropriate metadata | |
5. If direct conversion fails, it falls back to using `structured_ocr.py` for PDF processing | |
### Parameters | |
- **pdf_path**: Path to the PDF file to process | |
- **use_vision**: Whether to use vision model for improved analysis (default: True) | |
- **max_pages**: Maximum number of pages to process (default: all pages) | |
- **custom_pages**: Specific page numbers to process, 1-based indexing (e.g., [1, 3, 5]) | |
- **custom_prompt**: Custom instructions for OCR processing | |