milwright commited on
Commit
e680bda
·
verified ·
1 Parent(s): 72f6723

Delete docs

Browse files
docs/config_refactoring.md DELETED
@@ -1,47 +0,0 @@
1
- # Configuration Refactoring
2
-
3
- ## Overview
4
- This document outlines the changes made to centralize configuration parameters and reduce technical debt in the OCR processing system.
5
-
6
- ## Key Changes
7
-
8
- ### Centralized Configuration
9
- All previously hard-coded parameters have been moved to `config.py` and organized by functional category:
10
-
11
- - **PDF_SETTINGS**: Parameters for PDF processing
12
- - **SEGMENTATION_SETTINGS**: Image segmentation configuration
13
- - **CACHE_SETTINGS**: Cache TTL and capacity settings
14
- - **TEXT_REPAIR_SETTINGS**: Duplication detection and repair thresholds
15
-
16
- ### Environment Variable Support
17
- All configuration parameters can now be overridden via environment variables:
18
-
19
- ```bash
20
- # Example: Override PDF DPI
21
- export PDF_DEFAULT_DPI=200
22
-
23
- # Example: Increase cache size
24
- export CACHE_MAX_ENTRIES=50
25
- ```
26
-
27
- ### Import Strategy
28
- To prevent circular dependencies, configuration is imported at function level where needed:
29
-
30
- ```python
31
- def process_image():
32
- from config import SEGMENTATION_SETTINGS
33
- # Function implementation using settings
34
- ```
35
-
36
- ## Benefits
37
-
38
- - **Maintainability**: Settings are centralized and documented
39
- - **Flexibility**: Configuration can be adjusted without code changes
40
- - **Consistency**: Standardized approach to configuration across modules
41
- - **Traceability**: Clear overview of all configurable parameters
42
-
43
- ## Future Improvements
44
-
45
- - Add configuration schema validation
46
- - Support for configuration profiles (dev/test/prod)
47
- - Add detailed documentation for each parameter
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/preprocessing.md DELETED
@@ -1,179 +0,0 @@
1
- # Image Preprocessing for Historical Document OCR
2
-
3
- This document outlines the enhanced preprocessing capabilities for improving OCR quality on historical documents, including deskewing, thresholding, and morphological operations.
4
-
5
- ## Overview
6
-
7
- The preprocessing pipeline offers several options to enhance image quality before OCR processing:
8
-
9
- 1. **Deskewing**: Automatically detects and corrects document skew using multiple detection algorithms
10
- 2. **Thresholding**: Converts grayscale images to binary using adaptive or Otsu methods with pre-blur options
11
- 3. **Morphological Operations**: Cleans up binary images by removing noise or filling in gaps
12
- 4. **Document-Type Specific Settings**: Customized preprocessing configurations for different document types
13
-
14
- ## Configuration
15
-
16
- Preprocessing options are set in `config.py` and are tunable per document type. All settings are accessible through environment variables for easy deployment configuration.
17
-
18
- ### Deskewing
19
-
20
- ```python
21
- "deskew": {
22
- "enabled": True/False, # Whether to apply deskewing
23
- "angle_threshold": 0.1, # Minimum angle (degrees) to trigger deskewing
24
- "max_angle": 45.0, # Maximum correction angle
25
- "use_hough": True/False, # Use Hough transform in addition to minAreaRect
26
- "consensus_method": "average", # How to combine angle estimations
27
- "fallback": {"enabled": True/False} # Fall back to original if deskewing fails
28
- }
29
- ```
30
-
31
- Deskewing uses two methods:
32
- - **minAreaRect**: Finds contours in the binary image and calculates their orientation
33
- - **Hough Transform**: Detects lines in the image and their angles
34
-
35
- The `consensus_method` can be:
36
- - `"average"`: Average of all detected angles (most stable)
37
- - `"median"`: Median of all angles (robust to outliers)
38
- - `"min"`: Minimum absolute angle (most conservative)
39
- - `"max"`: Maximum absolute angle (most aggressive)
40
-
41
- ### Thresholding
42
-
43
- ```python
44
- "thresholding": {
45
- "method": "adaptive", # "none", "otsu", or "adaptive"
46
- "adaptive_block_size": 11, # Block size for adaptive thresholding (must be odd)
47
- "adaptive_constant": 2, # Constant subtracted from mean
48
- "otsu_gaussian_blur": 1, # Blur kernel size for Otsu pre-processing
49
- "preblur": {
50
- "enabled": True/False, # Whether to apply pre-blur
51
- "method": "gaussian", # "gaussian" or "median"
52
- "kernel_size": 3 # Blur kernel size (must be odd)
53
- },
54
- "fallback": {"enabled": True/False} # Fall back to grayscale if thresholding fails
55
- }
56
- ```
57
-
58
- Thresholding methods:
59
- - **Otsu**: Automatically determines optimal global threshold (best for high-contrast documents)
60
- - **Adaptive**: Calculates thresholds for different regions (better for uneven lighting, historical documents)
61
-
62
- ### Morphological Operations
63
-
64
- ```python
65
- "morphology": {
66
- "enabled": True/False, # Whether to apply morphological operations
67
- "operation": "close", # "open", "close", "both"
68
- "kernel_size": 1, # Size of the structuring element
69
- "kernel_shape": "rect" # "rect", "ellipse", "cross"
70
- }
71
- ```
72
-
73
- Morphological operations:
74
- - **Open**: Erosion followed by dilation - removes small noise and disconnects thin connections
75
- - **Close**: Dilation followed by erosion - fills small holes and connects broken elements
76
- - **Both**: Applies opening followed by closing
77
-
78
- ### Document Type Configurations
79
-
80
- The system includes optimized settings for different document types:
81
-
82
- ```python
83
- "document_types": {
84
- "standard": {
85
- # Default settings - will use the global settings
86
- },
87
- "newspaper": {
88
- "deskew": {"enabled": True, "angle_threshold": 0.3, "max_angle": 10.0},
89
- "thresholding": {
90
- "method": "adaptive",
91
- "adaptive_block_size": 15,
92
- "adaptive_constant": 3,
93
- "preblur": {"method": "gaussian", "kernel_size": 3}
94
- },
95
- "morphology": {"operation": "close", "kernel_size": 1}
96
- },
97
- "handwritten": {
98
- "deskew": {"enabled": True, "angle_threshold": 0.5, "use_hough": False},
99
- "thresholding": {
100
- "method": "adaptive",
101
- "adaptive_block_size": 31,
102
- "adaptive_constant": 5,
103
- "preblur": {"method": "median", "kernel_size": 3}
104
- },
105
- "morphology": {"operation": "open", "kernel_size": 1}
106
- },
107
- "book": {
108
- "deskew": {"enabled": True},
109
- "thresholding": {
110
- "method": "otsu",
111
- "preblur": {"method": "gaussian", "kernel_size": 5}
112
- },
113
- "morphology": {"operation": "both", "kernel_size": 1}
114
- }
115
- }
116
- ```
117
-
118
- ## Performance and Logging
119
-
120
- ```python
121
- "performance": {
122
- "parallel": {
123
- "enabled": True/False, # Whether to use parallel processing
124
- "max_workers": 4 # Maximum number of worker threads
125
- },
126
- "timeout_ms": 10000 # Timeout for preprocessing (in milliseconds)
127
- }
128
-
129
- "logging": {
130
- "enabled": True/False, # Whether to log preprocessing metrics
131
- "metrics": ["skew_angle", "binary_nonzero_pct", "processing_time"],
132
- "output_path": "logs/preprocessing_metrics.json"
133
- }
134
- ```
135
-
136
- ## Usage with OCR Processing
137
-
138
- When processing documents, simply specify the document type:
139
-
140
- ```python
141
- preprocessing_options = {
142
- "document_type": "newspaper", # Use newspaper-optimized settings
143
- "grayscale": True, # Legacy option: apply grayscale conversion
144
- "denoise": True, # Legacy option: apply denoising
145
- "contrast": 10, # Legacy option: adjust contrast (0-100)
146
- "rotation": 0 # Legacy option: manual rotation (degrees)
147
- }
148
-
149
- # Apply preprocessing and OCR
150
- result = process_file(file_bytes, file_ext, preprocessing_options=preprocessing_options)
151
- ```
152
-
153
- ## Visual Examples
154
-
155
- ### Original Document
156
- *[A historical newspaper or document image would be shown here]*
157
-
158
- ### After Deskewing
159
- *[The same document, with skew corrected]*
160
-
161
- ### After Thresholding
162
- *[The document converted to binary with clear text]*
163
-
164
- ### After Morphological Operations
165
- *[The binary image with small noise removed and/or gaps filled]*
166
-
167
- ## Troubleshooting
168
-
169
- ### Poor Deskewing Results
170
- - **Symptom**: Document skew is not correctly detected or corrected
171
- - **Solution**: Try adjusting `angle_threshold` or `max_angle`, or disable Hough transform for handwritten documents
172
-
173
- ### Thresholding Issues
174
- - **Symptom**: Text is lost or background noise is excessive after thresholding
175
- - **Solution**: Try changing the thresholding method or adjusting `adaptive_block_size` and `adaptive_constant`
176
-
177
- ### Performance Concerns
178
- - **Symptom**: Processing is too slow for large documents
179
- - **Solution**: Enable parallel processing, reduce image size, or disable some preprocessing steps for faster results
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
docs/preprocessing_triage.md DELETED
@@ -1,17 +0,0 @@
1
- # OCR Preprocessing Triage
2
-
3
- ## Quick Fixes Implemented
4
-
5
- 1. **Handwritten** - Disabled thresholding, uses grayscale only
6
- 2. **Newspapers** - Increased block size (51) and constant (10) for softer thresholding
7
- 3. **JPEG Artifacts** - Auto-detection and specialized denoising
8
- 4. **Border Issues** - Crops edges after deskew to avoid threshold problems
9
- 5. **Low Resolution** - Upscales small text for better recognition
10
-
11
- ## Testing
12
-
13
- ```
14
- python testing/test_triage_fix.py
15
- ```
16
-
17
- Check `output/comparison/` for results.