Spaces:
Running
Running
Delete docs
Browse files- docs/config_refactoring.md +0 -47
- docs/preprocessing.md +0 -179
- docs/preprocessing_triage.md +0 -17
docs/config_refactoring.md
DELETED
@@ -1,47 +0,0 @@
|
|
1 |
-
# Configuration Refactoring
|
2 |
-
|
3 |
-
## Overview
|
4 |
-
This document outlines the changes made to centralize configuration parameters and reduce technical debt in the OCR processing system.
|
5 |
-
|
6 |
-
## Key Changes
|
7 |
-
|
8 |
-
### Centralized Configuration
|
9 |
-
All previously hard-coded parameters have been moved to `config.py` and organized by functional category:
|
10 |
-
|
11 |
-
- **PDF_SETTINGS**: Parameters for PDF processing
|
12 |
-
- **SEGMENTATION_SETTINGS**: Image segmentation configuration
|
13 |
-
- **CACHE_SETTINGS**: Cache TTL and capacity settings
|
14 |
-
- **TEXT_REPAIR_SETTINGS**: Duplication detection and repair thresholds
|
15 |
-
|
16 |
-
### Environment Variable Support
|
17 |
-
All configuration parameters can now be overridden via environment variables:
|
18 |
-
|
19 |
-
```bash
|
20 |
-
# Example: Override PDF DPI
|
21 |
-
export PDF_DEFAULT_DPI=200
|
22 |
-
|
23 |
-
# Example: Increase cache size
|
24 |
-
export CACHE_MAX_ENTRIES=50
|
25 |
-
```
|
26 |
-
|
27 |
-
### Import Strategy
|
28 |
-
To prevent circular dependencies, configuration is imported at function level where needed:
|
29 |
-
|
30 |
-
```python
|
31 |
-
def process_image():
|
32 |
-
from config import SEGMENTATION_SETTINGS
|
33 |
-
# Function implementation using settings
|
34 |
-
```
|
35 |
-
|
36 |
-
## Benefits
|
37 |
-
|
38 |
-
- **Maintainability**: Settings are centralized and documented
|
39 |
-
- **Flexibility**: Configuration can be adjusted without code changes
|
40 |
-
- **Consistency**: Standardized approach to configuration across modules
|
41 |
-
- **Traceability**: Clear overview of all configurable parameters
|
42 |
-
|
43 |
-
## Future Improvements
|
44 |
-
|
45 |
-
- Add configuration schema validation
|
46 |
-
- Support for configuration profiles (dev/test/prod)
|
47 |
-
- Add detailed documentation for each parameter
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/preprocessing.md
DELETED
@@ -1,179 +0,0 @@
|
|
1 |
-
# Image Preprocessing for Historical Document OCR
|
2 |
-
|
3 |
-
This document outlines the enhanced preprocessing capabilities for improving OCR quality on historical documents, including deskewing, thresholding, and morphological operations.
|
4 |
-
|
5 |
-
## Overview
|
6 |
-
|
7 |
-
The preprocessing pipeline offers several options to enhance image quality before OCR processing:
|
8 |
-
|
9 |
-
1. **Deskewing**: Automatically detects and corrects document skew using multiple detection algorithms
|
10 |
-
2. **Thresholding**: Converts grayscale images to binary using adaptive or Otsu methods with pre-blur options
|
11 |
-
3. **Morphological Operations**: Cleans up binary images by removing noise or filling in gaps
|
12 |
-
4. **Document-Type Specific Settings**: Customized preprocessing configurations for different document types
|
13 |
-
|
14 |
-
## Configuration
|
15 |
-
|
16 |
-
Preprocessing options are set in `config.py` and are tunable per document type. All settings are accessible through environment variables for easy deployment configuration.
|
17 |
-
|
18 |
-
### Deskewing
|
19 |
-
|
20 |
-
```python
|
21 |
-
"deskew": {
|
22 |
-
"enabled": True/False, # Whether to apply deskewing
|
23 |
-
"angle_threshold": 0.1, # Minimum angle (degrees) to trigger deskewing
|
24 |
-
"max_angle": 45.0, # Maximum correction angle
|
25 |
-
"use_hough": True/False, # Use Hough transform in addition to minAreaRect
|
26 |
-
"consensus_method": "average", # How to combine angle estimations
|
27 |
-
"fallback": {"enabled": True/False} # Fall back to original if deskewing fails
|
28 |
-
}
|
29 |
-
```
|
30 |
-
|
31 |
-
Deskewing uses two methods:
|
32 |
-
- **minAreaRect**: Finds contours in the binary image and calculates their orientation
|
33 |
-
- **Hough Transform**: Detects lines in the image and their angles
|
34 |
-
|
35 |
-
The `consensus_method` can be:
|
36 |
-
- `"average"`: Average of all detected angles (most stable)
|
37 |
-
- `"median"`: Median of all angles (robust to outliers)
|
38 |
-
- `"min"`: Minimum absolute angle (most conservative)
|
39 |
-
- `"max"`: Maximum absolute angle (most aggressive)
|
40 |
-
|
41 |
-
### Thresholding
|
42 |
-
|
43 |
-
```python
|
44 |
-
"thresholding": {
|
45 |
-
"method": "adaptive", # "none", "otsu", or "adaptive"
|
46 |
-
"adaptive_block_size": 11, # Block size for adaptive thresholding (must be odd)
|
47 |
-
"adaptive_constant": 2, # Constant subtracted from mean
|
48 |
-
"otsu_gaussian_blur": 1, # Blur kernel size for Otsu pre-processing
|
49 |
-
"preblur": {
|
50 |
-
"enabled": True/False, # Whether to apply pre-blur
|
51 |
-
"method": "gaussian", # "gaussian" or "median"
|
52 |
-
"kernel_size": 3 # Blur kernel size (must be odd)
|
53 |
-
},
|
54 |
-
"fallback": {"enabled": True/False} # Fall back to grayscale if thresholding fails
|
55 |
-
}
|
56 |
-
```
|
57 |
-
|
58 |
-
Thresholding methods:
|
59 |
-
- **Otsu**: Automatically determines optimal global threshold (best for high-contrast documents)
|
60 |
-
- **Adaptive**: Calculates thresholds for different regions (better for uneven lighting, historical documents)
|
61 |
-
|
62 |
-
### Morphological Operations
|
63 |
-
|
64 |
-
```python
|
65 |
-
"morphology": {
|
66 |
-
"enabled": True/False, # Whether to apply morphological operations
|
67 |
-
"operation": "close", # "open", "close", "both"
|
68 |
-
"kernel_size": 1, # Size of the structuring element
|
69 |
-
"kernel_shape": "rect" # "rect", "ellipse", "cross"
|
70 |
-
}
|
71 |
-
```
|
72 |
-
|
73 |
-
Morphological operations:
|
74 |
-
- **Open**: Erosion followed by dilation - removes small noise and disconnects thin connections
|
75 |
-
- **Close**: Dilation followed by erosion - fills small holes and connects broken elements
|
76 |
-
- **Both**: Applies opening followed by closing
|
77 |
-
|
78 |
-
### Document Type Configurations
|
79 |
-
|
80 |
-
The system includes optimized settings for different document types:
|
81 |
-
|
82 |
-
```python
|
83 |
-
"document_types": {
|
84 |
-
"standard": {
|
85 |
-
# Default settings - will use the global settings
|
86 |
-
},
|
87 |
-
"newspaper": {
|
88 |
-
"deskew": {"enabled": True, "angle_threshold": 0.3, "max_angle": 10.0},
|
89 |
-
"thresholding": {
|
90 |
-
"method": "adaptive",
|
91 |
-
"adaptive_block_size": 15,
|
92 |
-
"adaptive_constant": 3,
|
93 |
-
"preblur": {"method": "gaussian", "kernel_size": 3}
|
94 |
-
},
|
95 |
-
"morphology": {"operation": "close", "kernel_size": 1}
|
96 |
-
},
|
97 |
-
"handwritten": {
|
98 |
-
"deskew": {"enabled": True, "angle_threshold": 0.5, "use_hough": False},
|
99 |
-
"thresholding": {
|
100 |
-
"method": "adaptive",
|
101 |
-
"adaptive_block_size": 31,
|
102 |
-
"adaptive_constant": 5,
|
103 |
-
"preblur": {"method": "median", "kernel_size": 3}
|
104 |
-
},
|
105 |
-
"morphology": {"operation": "open", "kernel_size": 1}
|
106 |
-
},
|
107 |
-
"book": {
|
108 |
-
"deskew": {"enabled": True},
|
109 |
-
"thresholding": {
|
110 |
-
"method": "otsu",
|
111 |
-
"preblur": {"method": "gaussian", "kernel_size": 5}
|
112 |
-
},
|
113 |
-
"morphology": {"operation": "both", "kernel_size": 1}
|
114 |
-
}
|
115 |
-
}
|
116 |
-
```
|
117 |
-
|
118 |
-
## Performance and Logging
|
119 |
-
|
120 |
-
```python
|
121 |
-
"performance": {
|
122 |
-
"parallel": {
|
123 |
-
"enabled": True/False, # Whether to use parallel processing
|
124 |
-
"max_workers": 4 # Maximum number of worker threads
|
125 |
-
},
|
126 |
-
"timeout_ms": 10000 # Timeout for preprocessing (in milliseconds)
|
127 |
-
}
|
128 |
-
|
129 |
-
"logging": {
|
130 |
-
"enabled": True/False, # Whether to log preprocessing metrics
|
131 |
-
"metrics": ["skew_angle", "binary_nonzero_pct", "processing_time"],
|
132 |
-
"output_path": "logs/preprocessing_metrics.json"
|
133 |
-
}
|
134 |
-
```
|
135 |
-
|
136 |
-
## Usage with OCR Processing
|
137 |
-
|
138 |
-
When processing documents, simply specify the document type:
|
139 |
-
|
140 |
-
```python
|
141 |
-
preprocessing_options = {
|
142 |
-
"document_type": "newspaper", # Use newspaper-optimized settings
|
143 |
-
"grayscale": True, # Legacy option: apply grayscale conversion
|
144 |
-
"denoise": True, # Legacy option: apply denoising
|
145 |
-
"contrast": 10, # Legacy option: adjust contrast (0-100)
|
146 |
-
"rotation": 0 # Legacy option: manual rotation (degrees)
|
147 |
-
}
|
148 |
-
|
149 |
-
# Apply preprocessing and OCR
|
150 |
-
result = process_file(file_bytes, file_ext, preprocessing_options=preprocessing_options)
|
151 |
-
```
|
152 |
-
|
153 |
-
## Visual Examples
|
154 |
-
|
155 |
-
### Original Document
|
156 |
-
*[A historical newspaper or document image would be shown here]*
|
157 |
-
|
158 |
-
### After Deskewing
|
159 |
-
*[The same document, with skew corrected]*
|
160 |
-
|
161 |
-
### After Thresholding
|
162 |
-
*[The document converted to binary with clear text]*
|
163 |
-
|
164 |
-
### After Morphological Operations
|
165 |
-
*[The binary image with small noise removed and/or gaps filled]*
|
166 |
-
|
167 |
-
## Troubleshooting
|
168 |
-
|
169 |
-
### Poor Deskewing Results
|
170 |
-
- **Symptom**: Document skew is not correctly detected or corrected
|
171 |
-
- **Solution**: Try adjusting `angle_threshold` or `max_angle`, or disable Hough transform for handwritten documents
|
172 |
-
|
173 |
-
### Thresholding Issues
|
174 |
-
- **Symptom**: Text is lost or background noise is excessive after thresholding
|
175 |
-
- **Solution**: Try changing the thresholding method or adjusting `adaptive_block_size` and `adaptive_constant`
|
176 |
-
|
177 |
-
### Performance Concerns
|
178 |
-
- **Symptom**: Processing is too slow for large documents
|
179 |
-
- **Solution**: Enable parallel processing, reduce image size, or disable some preprocessing steps for faster results
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
docs/preprocessing_triage.md
DELETED
@@ -1,17 +0,0 @@
|
|
1 |
-
# OCR Preprocessing Triage
|
2 |
-
|
3 |
-
## Quick Fixes Implemented
|
4 |
-
|
5 |
-
1. **Handwritten** - Disabled thresholding, uses grayscale only
|
6 |
-
2. **Newspapers** - Increased block size (51) and constant (10) for softer thresholding
|
7 |
-
3. **JPEG Artifacts** - Auto-detection and specialized denoising
|
8 |
-
4. **Border Issues** - Crops edges after deskew to avoid threshold problems
|
9 |
-
5. **Low Resolution** - Upscales small text for better recognition
|
10 |
-
|
11 |
-
## Testing
|
12 |
-
|
13 |
-
```
|
14 |
-
python testing/test_triage_fix.py
|
15 |
-
```
|
16 |
-
|
17 |
-
Check `output/comparison/` for results.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|