# Image Preprocessing for Historical Document OCR This document outlines the enhanced preprocessing capabilities for improving OCR quality on historical documents, including deskewing, thresholding, and morphological operations. ## Overview The preprocessing pipeline offers several options to enhance image quality before OCR processing: 1. **Deskewing**: Automatically detects and corrects document skew using multiple detection algorithms 2. **Thresholding**: Converts grayscale images to binary using adaptive or Otsu methods with pre-blur options 3. **Morphological Operations**: Cleans up binary images by removing noise or filling in gaps 4. **Document-Type Specific Settings**: Customized preprocessing configurations for different document types ## Configuration Preprocessing options are set in `config.py` and are tunable per document type. All settings are accessible through environment variables for easy deployment configuration. ### Deskewing ```python "deskew": { "enabled": True/False, # Whether to apply deskewing "angle_threshold": 0.1, # Minimum angle (degrees) to trigger deskewing "max_angle": 45.0, # Maximum correction angle "use_hough": True/False, # Use Hough transform in addition to minAreaRect "consensus_method": "average", # How to combine angle estimations "fallback": {"enabled": True/False} # Fall back to original if deskewing fails } ``` Deskewing uses two methods: - **minAreaRect**: Finds contours in the binary image and calculates their orientation - **Hough Transform**: Detects lines in the image and their angles The `consensus_method` can be: - `"average"`: Average of all detected angles (most stable) - `"median"`: Median of all angles (robust to outliers) - `"min"`: Minimum absolute angle (most conservative) - `"max"`: Maximum absolute angle (most aggressive) ### Thresholding ```python "thresholding": { "method": "adaptive", # "none", "otsu", or "adaptive" "adaptive_block_size": 11, # Block size for adaptive thresholding (must be odd) "adaptive_constant": 2, # Constant subtracted from mean "otsu_gaussian_blur": 1, # Blur kernel size for Otsu pre-processing "preblur": { "enabled": True/False, # Whether to apply pre-blur "method": "gaussian", # "gaussian" or "median" "kernel_size": 3 # Blur kernel size (must be odd) }, "fallback": {"enabled": True/False} # Fall back to grayscale if thresholding fails } ``` Thresholding methods: - **Otsu**: Automatically determines optimal global threshold (best for high-contrast documents) - **Adaptive**: Calculates thresholds for different regions (better for uneven lighting, historical documents) ### Morphological Operations ```python "morphology": { "enabled": True/False, # Whether to apply morphological operations "operation": "close", # "open", "close", "both" "kernel_size": 1, # Size of the structuring element "kernel_shape": "rect" # "rect", "ellipse", "cross" } ``` Morphological operations: - **Open**: Erosion followed by dilation - removes small noise and disconnects thin connections - **Close**: Dilation followed by erosion - fills small holes and connects broken elements - **Both**: Applies opening followed by closing ### Document Type Configurations The system includes optimized settings for different document types: ```python "document_types": { "standard": { # Default settings - will use the global settings }, "newspaper": { "deskew": {"enabled": True, "angle_threshold": 0.3, "max_angle": 10.0}, "thresholding": { "method": "adaptive", "adaptive_block_size": 15, "adaptive_constant": 3, "preblur": {"method": "gaussian", "kernel_size": 3} }, "morphology": {"operation": "close", "kernel_size": 1} }, "handwritten": { "deskew": {"enabled": True, "angle_threshold": 0.5, "use_hough": False}, "thresholding": { "method": "adaptive", "adaptive_block_size": 31, "adaptive_constant": 5, "preblur": {"method": "median", "kernel_size": 3} }, "morphology": {"operation": "open", "kernel_size": 1} }, "book": { "deskew": {"enabled": True}, "thresholding": { "method": "otsu", "preblur": {"method": "gaussian", "kernel_size": 5} }, "morphology": {"operation": "both", "kernel_size": 1} } } ``` ## Performance and Logging ```python "performance": { "parallel": { "enabled": True/False, # Whether to use parallel processing "max_workers": 4 # Maximum number of worker threads }, "timeout_ms": 10000 # Timeout for preprocessing (in milliseconds) } "logging": { "enabled": True/False, # Whether to log preprocessing metrics "metrics": ["skew_angle", "binary_nonzero_pct", "processing_time"], "output_path": "logs/preprocessing_metrics.json" } ``` ## Usage with OCR Processing When processing documents, simply specify the document type: ```python preprocessing_options = { "document_type": "newspaper", # Use newspaper-optimized settings "grayscale": True, # Legacy option: apply grayscale conversion "denoise": True, # Legacy option: apply denoising "contrast": 10, # Legacy option: adjust contrast (0-100) "rotation": 0 # Legacy option: manual rotation (degrees) } # Apply preprocessing and OCR result = process_file(file_bytes, file_ext, preprocessing_options=preprocessing_options) ``` ## Visual Examples ### Original Document *[A historical newspaper or document image would be shown here]* ### After Deskewing *[The same document, with skew corrected]* ### After Thresholding *[The document converted to binary with clear text]* ### After Morphological Operations *[The binary image with small noise removed and/or gaps filled]* ## Troubleshooting ### Poor Deskewing Results - **Symptom**: Document skew is not correctly detected or corrected - **Solution**: Try adjusting `angle_threshold` or `max_angle`, or disable Hough transform for handwritten documents ### Thresholding Issues - **Symptom**: Text is lost or background noise is excessive after thresholding - **Solution**: Try changing the thresholding method or adjusting `adaptive_block_size` and `adaptive_constant` ### Performance Concerns - **Symptom**: Processing is too slow for large documents - **Solution**: Enable parallel processing, reduce image size, or disable some preprocessing steps for faster results