Spaces:

milwright
/

historical-ocr

Running

App Files Files Community

milwright commited on Jun 12

Commit

e680bda

verified ·

1 Parent(s): 72f6723

Delete docs

Browse files

Files changed (3) hide show

docs/config_refactoring.md +0 -47
docs/preprocessing.md +0 -179
docs/preprocessing_triage.md +0 -17

docs/config_refactoring.md DELETED Viewed

@@ -1,47 +0,0 @@
-# Configuration Refactoring
-## Overview
-This document outlines the changes made to centralize configuration parameters and reduce technical debt in the OCR processing system.
-## Key Changes
-### Centralized Configuration
-All previously hard-coded parameters have been moved to `config.py` and organized by functional category:
-- **PDF_SETTINGS**: Parameters for PDF processing
-- **SEGMENTATION_SETTINGS**: Image segmentation configuration
-- **CACHE_SETTINGS**: Cache TTL and capacity settings
-- **TEXT_REPAIR_SETTINGS**: Duplication detection and repair thresholds
-### Environment Variable Support
-All configuration parameters can now be overridden via environment variables:
-```bash
-# Example: Override PDF DPI
-export PDF_DEFAULT_DPI=200
-# Example: Increase cache size
-export CACHE_MAX_ENTRIES=50
-```
-### Import Strategy
-To prevent circular dependencies, configuration is imported at function level where needed:
-```python
-def process_image():
-    from config import SEGMENTATION_SETTINGS
-    # Function implementation using settings
-```
-## Benefits
-- **Maintainability**: Settings are centralized and documented
-- **Flexibility**: Configuration can be adjusted without code changes
-- **Consistency**: Standardized approach to configuration across modules
-- **Traceability**: Clear overview of all configurable parameters
-## Future Improvements
-- Add configuration schema validation
-- Support for configuration profiles (dev/test/prod)
-- Add detailed documentation for each parameter

docs/preprocessing.md DELETED Viewed

@@ -1,179 +0,0 @@
-# Image Preprocessing for Historical Document OCR
-This document outlines the enhanced preprocessing capabilities for improving OCR quality on historical documents, including deskewing, thresholding, and morphological operations.
-## Overview
-The preprocessing pipeline offers several options to enhance image quality before OCR processing:
-1. **Deskewing**: Automatically detects and corrects document skew using multiple detection algorithms
-2. **Thresholding**: Converts grayscale images to binary using adaptive or Otsu methods with pre-blur options
-3. **Morphological Operations**: Cleans up binary images by removing noise or filling in gaps
-4. **Document-Type Specific Settings**: Customized preprocessing configurations for different document types
-## Configuration
-Preprocessing options are set in `config.py` and are tunable per document type. All settings are accessible through environment variables for easy deployment configuration.
-### Deskewing
-```python
-"deskew": {
-    "enabled": True/False,              # Whether to apply deskewing
-    "angle_threshold": 0.1,             # Minimum angle (degrees) to trigger deskewing
-    "max_angle": 45.0,                  # Maximum correction angle
-    "use_hough": True/False,            # Use Hough transform in addition to minAreaRect
-    "consensus_method": "average",      # How to combine angle estimations
-    "fallback": {"enabled": True/False} # Fall back to original if deskewing fails
-}
-```
-Deskewing uses two methods:
-- **minAreaRect**: Finds contours in the binary image and calculates their orientation
-- **Hough Transform**: Detects lines in the image and their angles
-The `consensus_method` can be:
-- `"average"`: Average of all detected angles (most stable)
-- `"median"`: Median of all angles (robust to outliers)
-- `"min"`: Minimum absolute angle (most conservative)
-- `"max"`: Maximum absolute angle (most aggressive)
-### Thresholding
-```python
-"thresholding": {
-    "method": "adaptive",               # "none", "otsu", or "adaptive"
-    "adaptive_block_size": 11,          # Block size for adaptive thresholding (must be odd)
-    "adaptive_constant": 2,             # Constant subtracted from mean
-    "otsu_gaussian_blur": 1,            # Blur kernel size for Otsu pre-processing
-    "preblur": {
-        "enabled": True/False,          # Whether to apply pre-blur
-        "method": "gaussian",           # "gaussian" or "median"
-        "kernel_size": 3                # Blur kernel size (must be odd)
-    },
-    "fallback": {"enabled": True/False} # Fall back to grayscale if thresholding fails
-}
-```
-Thresholding methods:
-- **Otsu**: Automatically determines optimal global threshold (best for high-contrast documents)
-- **Adaptive**: Calculates thresholds for different regions (better for uneven lighting, historical documents)
-### Morphological Operations
-```python
-"morphology": {
-    "enabled": True/False,              # Whether to apply morphological operations
-    "operation": "close",               # "open", "close", "both"
-    "kernel_size": 1,                   # Size of the structuring element
-    "kernel_shape": "rect"              # "rect", "ellipse", "cross"
-}
-```
-Morphological operations:
-- **Open**: Erosion followed by dilation - removes small noise and disconnects thin connections
-- **Close**: Dilation followed by erosion - fills small holes and connects broken elements
-- **Both**: Applies opening followed by closing
-### Document Type Configurations
-The system includes optimized settings for different document types:
-```python
-"document_types": {
-    "standard": {
-        # Default settings - will use the global settings
-    },
-    "newspaper": {
-        "deskew": {"enabled": True, "angle_threshold": 0.3, "max_angle": 10.0},
-        "thresholding": {
-            "method": "adaptive",
-            "adaptive_block_size": 15,
-            "adaptive_constant": 3,
-            "preblur": {"method": "gaussian", "kernel_size": 3}
-        },
-        "morphology": {"operation": "close", "kernel_size": 1}
-    },
-    "handwritten": {
-        "deskew": {"enabled": True, "angle_threshold": 0.5, "use_hough": False},
-        "thresholding": {
-            "method": "adaptive",
-            "adaptive_block_size": 31,
-            "adaptive_constant": 5,
-            "preblur": {"method": "median", "kernel_size": 3}
-        },
-        "morphology": {"operation": "open", "kernel_size": 1}
-    },
-    "book": {
-        "deskew": {"enabled": True},
-        "thresholding": {
-            "method": "otsu",
-            "preblur": {"method": "gaussian", "kernel_size": 5}
-        },
-        "morphology": {"operation": "both", "kernel_size": 1}
-    }
-}
-```
-## Performance and Logging
-```python
-"performance": {
-    "parallel": {
-        "enabled": True/False,          # Whether to use parallel processing
-        "max_workers": 4                # Maximum number of worker threads
-    },
-    "timeout_ms": 10000                 # Timeout for preprocessing (in milliseconds)
-}
-"logging": {
-    "enabled": True/False,              # Whether to log preprocessing metrics
-    "metrics": ["skew_angle", "binary_nonzero_pct", "processing_time"],
-    "output_path": "logs/preprocessing_metrics.json"
-}
-```
-## Usage with OCR Processing
-When processing documents, simply specify the document type:
-```python
-preprocessing_options = {
-    "document_type": "newspaper",  # Use newspaper-optimized settings
-    "grayscale": True,             # Legacy option: apply grayscale conversion
-    "denoise": True,               # Legacy option: apply denoising
-    "contrast": 10,                # Legacy option: adjust contrast (0-100)
-    "rotation": 0                  # Legacy option: manual rotation (degrees)
-}
-# Apply preprocessing and OCR
-result = process_file(file_bytes, file_ext, preprocessing_options=preprocessing_options)
-```
-## Visual Examples
-### Original Document
-*[A historical newspaper or document image would be shown here]*
-### After Deskewing
-*[The same document, with skew corrected]*
-### After Thresholding
-*[The document converted to binary with clear text]*
-### After Morphological Operations
-*[The binary image with small noise removed and/or gaps filled]*
-## Troubleshooting
-### Poor Deskewing Results
-- **Symptom**: Document skew is not correctly detected or corrected
-- **Solution**: Try adjusting `angle_threshold` or `max_angle`, or disable Hough transform for handwritten documents
-### Thresholding Issues
-- **Symptom**: Text is lost or background noise is excessive after thresholding
-- **Solution**: Try changing the thresholding method or adjusting `adaptive_block_size` and `adaptive_constant`
-### Performance Concerns
-- **Symptom**: Processing is too slow for large documents
-- **Solution**: Enable parallel processing, reduce image size, or disable some preprocessing steps for faster results

docs/preprocessing_triage.md DELETED Viewed

@@ -1,17 +0,0 @@
-# OCR Preprocessing Triage
-## Quick Fixes Implemented
-1. **Handwritten** - Disabled thresholding, uses grayscale only
-2. **Newspapers** - Increased block size (51) and constant (10) for softer thresholding
-3. **JPEG Artifacts** - Auto-detection and specialized denoising
-4. **Border Issues** - Crops edges after deskew to avoid threshold problems
-5. **Low Resolution** - Upscales small text for better recognition
-## Testing
-```
-python testing/test_triage_fix.py
-```
-Check `output/comparison/` for results.