# Historical OCR Application Improvements Based on a thorough code review of the Historical OCR application, I've identified several areas for improvement to reduce technical debt and enhance the application's functionality, maintainability, and performance. ## 1. Code Organization and Structure ### 1.1 Modularize Large Functions Several functions in the codebase are excessively long and handle multiple responsibilities: - **Issue**: `process_file()` in ocr_processing.py is over 400 lines and handles file validation, preprocessing, OCR processing, and result formatting. - **Solution**: Break down into smaller, focused functions: ```python def process_file(uploaded_file, options): # Validate and prepare file file_info = validate_and_prepare_file(uploaded_file) # Apply preprocessing based on document type preprocessed_file = preprocess_document(file_info, options) # Perform OCR processing ocr_result = perform_ocr(preprocessed_file, options) # Format and enhance results return format_and_enhance_results(ocr_result, file_info) ``` ### 1.2 Consistent Error Handling Error handling approaches vary across modules: - **Issue**: Some functions use try/except blocks with detailed logging, while others return error dictionaries or raise exceptions. - **Solution**: Implement a consistent error handling strategy: ```python class OCRError(Exception): def __init__(self, message, error_code=None, details=None): self.message = message self.error_code = error_code self.details = details super().__init__(self.message) def handle_error(func): @functools.wraps(func) def wrapper(*args, **kwargs): try: return func(*args, **kwargs) except OCRError as e: logger.error(f"OCR Error: {e.message} (Code: {e.error_code})") return {"error": e.message, "error_code": e.error_code, "details": e.details} except Exception as e: logger.error(f"Unexpected error: {str(e)}") return {"error": "An unexpected error occurred", "details": str(e)} return wrapper ``` ## 2. API Integration and Performance ### 2.1 API Client Optimization The Mistral API client initialization and usage can be improved: - **Issue**: The client is initialized for each request and error handling is duplicated. - **Solution**: Create a singleton API client with centralized error handling: ```python class MistralClient: _instance = None @classmethod def get_instance(cls, api_key=None): if cls._instance is None: cls._instance = cls(api_key) return cls._instance def __init__(self, api_key=None): self.api_key = api_key or os.environ.get("MISTRAL_API_KEY", "") self.client = Mistral(api_key=self.api_key) def process_ocr(self, document, **kwargs): try: return self.client.ocr.process(document=document, **kwargs) except Exception as e: # Centralized error handling return self._handle_api_error(e) ``` ### 2.2 Caching Strategy The current caching approach can be improved: - **Issue**: Cache keys don't always account for all relevant parameters, and TTL is fixed at 24 hours. - **Solution**: Implement a more sophisticated caching strategy: ```python def generate_cache_key(file_content, options): # Create a comprehensive hash of all relevant parameters options_str = json.dumps(options, sort_keys=True) content_hash = hashlib.md5(file_content).hexdigest() return f"{content_hash}_{hashlib.md5(options_str.encode()).hexdigest()}" # Adaptive TTL based on document type def get_cache_ttl(document_type): ttl_map = { "handwritten": 48 * 3600, # 48 hours for handwritten docs "newspaper": 24 * 3600, # 24 hours for newspapers "standard": 12 * 3600 # 12 hours for standard docs } return ttl_map.get(document_type, 24 * 3600) ``` ## 3. State Management ### 3.1 Streamlit Session State The application uses a complex state management approach: - **Issue**: Many session state variables with unclear relationships and reset logic. - **Solution**: Implement a more structured state management approach: ```python class DocumentState: def __init__(self): self.document = None self.original_bytes = None self.name = None self.mime_type = None self.is_sample = False self.processed = False self.temp_files = [] def reset(self): # Clean up temp files for temp_file in self.temp_files: if os.path.exists(temp_file): os.unlink(temp_file) # Reset state self.__init__() # Initialize in session state if 'document_state' not in st.session_state: st.session_state.document_state = DocumentState() ``` ### 3.2 Result History Management The current approach to managing result history can be improved: - **Issue**: Results are stored directly in session state with limited management. - **Solution**: Create a dedicated class for result history: ```python class ResultHistory: def __init__(self, max_results=20): self.results = [] self.max_results = max_results def add_result(self, result): # Add timestamp and ensure result is serializable result = self._prepare_result(result) self.results.insert(0, result) # Trim to max size if len(self.results) > self.max_results: self.results = self.results[:self.max_results] def _prepare_result(self, result): # Add timestamp and ensure result is serializable result = result.copy() result['timestamp'] = datetime.now().strftime("%Y-%m-%d %H:%M:%S") # Ensure result is serializable return json.loads(json.dumps(result, default=str)) ``` ## 4. Image Processing Pipeline ### 4.1 Preprocessing Configuration The preprocessing configuration can be improved: - **Issue**: Preprocessing options are scattered across different parts of the code. - **Solution**: Create a centralized preprocessing configuration: ```python PREPROCESSING_CONFIGS = { "standard": { "grayscale": True, "denoise": True, "contrast": 5, "deskew": True }, "handwritten": { "grayscale": True, "denoise": True, "contrast": 10, "deskew": True, "adaptive_threshold": { "block_size": 21, "constant": 5 } }, "newspaper": { "grayscale": True, "denoise": True, "contrast": 5, "deskew": True, "column_detection": True } } ``` ### 4.2 Image Segmentation The image segmentation approach can be improved: - **Issue**: Segmentation is optional and not well-integrated with the preprocessing pipeline. - **Solution**: Make segmentation a standard part of the preprocessing pipeline for certain document types: ```python def preprocess_document(file_info, options): # Apply basic preprocessing preprocessed_file = apply_basic_preprocessing(file_info, options) # Apply segmentation for specific document types if options["document_type"] in ["newspaper", "book", "multi_column"]: return apply_segmentation(preprocessed_file, options) return preprocessed_file ``` ## 5. User Experience Enhancements ### 5.1 Progressive Loading Improve the user experience during processing: - **Issue**: The UI can appear frozen during long-running operations. - **Solution**: Implement progressive loading and feedback: ```python def process_with_feedback(file, options, progress_callback): # Update progress at each step progress_callback(10, "Validating document...") file_info = validate_and_prepare_file(file) progress_callback(30, "Preprocessing document...") preprocessed_file = preprocess_document(file_info, options) progress_callback(50, "Performing OCR...") ocr_result = perform_ocr(preprocessed_file, options) progress_callback(80, "Enhancing results...") final_result = format_and_enhance_results(ocr_result, file_info) progress_callback(100, "Complete!") return final_result ``` ### 5.2 Result Visualization Enhance the visualization of OCR results: - **Issue**: Results are displayed in a basic format with limited visualization. - **Solution**: Implement enhanced visualization options: ```python def display_enhanced_results(result): # Create tabs for different views tabs = st.tabs(["Text", "Annotated", "Side-by-Side", "JSON"]) with tabs[0]: # Display formatted text st.markdown(format_ocr_text(result["ocr_contents"]["raw_text"])) with tabs[1]: # Display annotated image with bounding boxes display_annotated_image(result) with tabs[2]: # Display side-by-side comparison col1, col2 = st.columns(2) with col1: st.image(result["original_image"]) with col2: st.markdown(format_ocr_text(result["ocr_contents"]["raw_text"])) with tabs[3]: # Display raw JSON st.json(result) ``` ## 6. Testing and Reliability ### 6.1 Automated Testing Implement comprehensive testing: - **Issue**: Limited or no automated testing. - **Solution**: Implement unit and integration tests: ```python # Unit test for preprocessing def test_preprocess_image(): # Test with various document types for doc_type in ["standard", "handwritten", "newspaper"]: # Load test image with open(f"test_data/{doc_type}_sample.jpg", "rb") as f: image_bytes = f.read() # Apply preprocessing options = {"document_type": doc_type, "grayscale": True, "denoise": True} result = preprocess_image(image_bytes, options) # Assert result is not None and different from original assert result is not None assert result != image_bytes ``` ### 6.2 Error Recovery Implement better error recovery mechanisms: - **Issue**: Errors in one part of the pipeline can cause the entire process to fail. - **Solution**: Implement graceful degradation: ```python def process_with_fallbacks(file, options): try: # Try full processing pipeline return full_processing_pipeline(file, options) except OCRError as e: logger.warning(f"Full pipeline failed: {e.message}. Trying simplified pipeline.") try: # Try simplified pipeline return simplified_processing_pipeline(file, options) except Exception as e2: logger.error(f"Simplified pipeline failed: {str(e2)}. Falling back to basic OCR.") # Fall back to basic OCR return basic_ocr_only(file) ``` ## 7. Documentation and Maintainability ### 7.1 Code Documentation Improve code documentation: - **Issue**: Inconsistent documentation across modules. - **Solution**: Implement consistent docstring format and add module-level documentation: ```python """ OCR Processing Module This module handles the core OCR processing functionality, including: - File validation and preparation - Image preprocessing - OCR processing with Mistral AI - Result formatting and enhancement The main entry point is the `process_file` function. """ def process_file(file, options): """ Process a file with OCR. Args: file: The file to process (UploadedFile or bytes) options: Dictionary of processing options - document_type: Type of document (standard, handwritten, etc.) - preprocessing: Dictionary of preprocessing options - use_vision: Whether to use vision model Returns: Dictionary containing OCR results and metadata Raises: OCRError: If OCR processing fails """ # Implementation ``` ### 7.2 Configuration Management Improve configuration management: - **Issue**: Configuration is scattered across multiple files. - **Solution**: Implement a centralized configuration system: ```python """ Configuration Module This module provides a centralized configuration system for the application. """ import os import yaml from pathlib import Path class Config: _instance = None @classmethod def get_instance(cls): if cls._instance is None: cls._instance = cls() return cls._instance def __init__(self): self.config = {} self.load_config() def load_config(self): # Load from config file config_path = Path(__file__).parent / "config.yaml" if config_path.exists(): with open(config_path, "r") as f: self.config = yaml.safe_load(f) # Override with environment variables for key, value in os.environ.items(): if key.startswith("OCR_"): config_key = key[4:].lower() self.config[config_key] = value def get(self, key, default=None): return self.config.get(key, default) ``` ## 8. Security Enhancements ### 8.1 API Key Management Improve API key management: - **Issue**: API keys are stored in environment variables with limited validation. - **Solution**: Implement secure API key management: ```python def get_api_key(): # Try to get from secure storage first api_key = get_from_secure_storage("mistral_api_key") # Fall back to environment variable if not api_key: api_key = os.environ.get("MISTRAL_API_KEY", "") # Validate key format if api_key and not re.match(r'^[A-Za-z0-9_-]{30,}$', api_key): logger.warning("API key format appears invalid") return api_key ``` ### 8.2 Input Validation Improve input validation: - **Issue**: Limited validation of user inputs. - **Solution**: Implement comprehensive input validation: ```python def validate_file(file): # Check file size if len(file.getvalue()) > MAX_FILE_SIZE: raise OCRError("File too large", "FILE_TOO_LARGE") # Check file type file_type = get_file_type(file) if file_type not in ALLOWED_FILE_TYPES: raise OCRError(f"Unsupported file type: {file_type}", "UNSUPPORTED_FILE_TYPE") # Check for malicious content if is_potentially_malicious(file): raise OCRError("File appears to be malicious", "SECURITY_RISK") return file_type ``` ## 9. Performance Optimizations ### 9.1 Parallel Processing Implement parallel processing for multi-page documents: - **Issue**: Pages are processed sequentially, which can be slow for large documents. - **Solution**: Implement parallel processing: ```python def process_pdf_pages(pdf_path, options): # Extract pages pages = extract_pdf_pages(pdf_path) # Process pages in parallel with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor: future_to_page = {executor.submit(process_page, page, options): i for i, page in enumerate(pages)} results = [] for future in concurrent.futures.as_completed(future_to_page): page_idx = future_to_page[future] try: result = future.result() results.append((page_idx, result)) except Exception as e: logger.error(f"Error processing page {page_idx}: {str(e)}") # Sort results by page index results.sort(key=lambda x: x[0]) # Combine results return combine_page_results([r[1] for r in results]) ``` ### 9.2 Resource Management Improve resource management: - **Issue**: Temporary files are not always cleaned up properly. - **Solution**: Implement better resource management: ```python class TempFileManager: def __init__(self): self.temp_files = [] def create_temp_file(self, content, suffix=".tmp"): with tempfile.NamedTemporaryFile(delete=False, suffix=suffix) as tmp: tmp.write(content) self.temp_files.append(tmp.name) return tmp.name def cleanup(self): for temp_file in self.temp_files: try: if os.path.exists(temp_file): os.unlink(temp_file) except Exception as e: logger.warning(f"Failed to remove temp file {temp_file}: {str(e)}") self.temp_files = [] def __enter__(self): return self def __exit__(self, exc_type, exc_val, exc_tb): self.cleanup() ``` ## 10. Extensibility ### 10.1 Plugin System Implement a plugin system for extensibility: - **Issue**: Adding new document types or processing methods requires code changes. - **Solution**: Implement a plugin system: ```python class OCRPlugin: def __init__(self, name, description): self.name = name self.description = description def can_handle(self, file_info): """Return True if this plugin can handle the file""" raise NotImplementedError def process(self, file_info, options): """Process the file and return results""" raise NotImplementedError # Example plugin class HandwrittenDocumentPlugin(OCRPlugin): def __init__(self): super().__init__("handwritten", "Handwritten document processor") def can_handle(self, file_info): # Check if this is a handwritten document return file_info.get("document_type") == "handwritten" def process(self, file_info, options): # Specialized processing for handwritten documents # ... ``` ### 10.2 API Abstraction Create an abstraction layer for the OCR API: - **Issue**: The application is tightly coupled to the Mistral AI API. - **Solution**: Implement an abstraction layer: ```python class OCRProvider: def process_image(self, image_path, options): """Process an image and return OCR results""" raise NotImplementedError def process_pdf(self, pdf_path, options): """Process a PDF and return OCR results""" raise NotImplementedError class MistralOCRProvider(OCRProvider): def __init__(self, api_key=None): self.client = MistralClient.get_instance(api_key) def process_image(self, image_path, options): # Implementation using Mistral API def process_pdf(self, pdf_path, options): # Implementation using Mistral API # Factory function to get the appropriate provider def get_ocr_provider(provider_name="mistral"): if provider_name == "mistral": return MistralOCRProvider() # Add more providers as needed raise ValueError(f"Unknown OCR provider: {provider_name}") ``` ## Implementation Priority