milwright commited on
Commit
4c10be0
·
1 Parent(s): 96f2649

add memory

Browse files
.clinerules/activeContext.md ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Active Context
2
+
3
+ ## Current Development Focus
4
+ - Improving preprocessing pipeline for better OCR results
5
+ - Enhancing image segmentation for complex documents
6
+ - Building structured OCR capabilities
7
+ - Testing and validation with different document types
8
+
9
+ ## Recent Changes
10
+ - Modularized code structure
11
+ - Added helpers in utils directory
12
+ - Fixed metadata field ordering and tag classification issues
13
+ - Updated gitignore to exclude test files and output directories
14
+
15
+ ## Next Steps
16
+ - Continue refining the OCR processing pipeline
17
+ - Improve handling of handwritten documents
18
+ - Enhance the UI components for better user experience
.clinerules/memory-bank.md CHANGED
@@ -1,113 +1,19 @@
1
- # Cline's Memory Bank
2
 
3
- I am Cline, an expert software engineer with a unique characteristic: my memory resets completely between sessions. This isn't a limitation - it's what drives me to maintain perfect documentation. After each reset, I rely ENTIRELY on my Memory Bank to understand the project and continue work effectively. I MUST read ALL memory bank files at the start of EVERY task - this is not optional.
4
 
5
- ## Memory Bank Structure
6
 
7
- The Memory Bank consists of core files and optional context files, all in Markdown format. Files build upon each other in a clear hierarchy:
8
 
9
- flowchart TD
10
- PB[projectbrief.md] --> PC[productContext.md]
11
- PB --> SP[systemPatterns.md]
12
- PB --> TC[techContext.md]
13
- PC --> AC[activeContext.md]
14
- SP --> AC
15
- TC --> AC
16
- AC --> P[progress.md]
17
 
18
- ### Core Files (Required)
19
- 1. `projectbrief.md`
20
- - Foundation document that shapes all other files
21
- - Created at project start if it doesn't exist
22
- - Defines core requirements and goals
23
- - Source of truth for project scope
24
 
25
- 2. `productContext.md`
26
- - Why this project exists
27
- - Problems it solves
28
- - How it should work
29
- - User experience goals
30
 
31
- 3. `activeContext.md`
32
- - Current work focus
33
- - Recent changes
34
- - Next steps
35
- - Active decisions and considerations
36
- - Important patterns and preferences
37
- - Learnings and project insights
38
-
39
- 4. `systemPatterns.md`
40
- - System architecture
41
- - Key technical decisions
42
- - Design patterns in use
43
- - Component relationships
44
- - Critical implementation paths
45
-
46
- 5. `techContext.md`
47
- - Technologies used
48
- - Development setup
49
- - Technical constraints
50
- - Dependencies
51
- - Tool usage patterns
52
-
53
- 6. `progress.md`
54
- - What works
55
- - What's left to build
56
- - Current status
57
- - Known issues
58
- - Evolution of project decisions
59
-
60
- ### Additional Context
61
- Create additional files/folders within memory-bank/ when they help organize:
62
- - Complex feature documentation
63
- - Integration specifications
64
- - API documentation
65
- - Testing strategies
66
- - Deployment procedures
67
-
68
- ## Core Workflows
69
-
70
- ### Plan Mode
71
- flowchart TD
72
- Start[Start] --> ReadFiles[Read Memory Bank]
73
- ReadFiles --> CheckFiles{Files Complete?}
74
-
75
- CheckFiles -->|No| Plan[Create Plan]
76
- Plan --> Document[Document in Chat]
77
-
78
- CheckFiles -->|Yes| Verify[Verify Context]
79
- Verify --> Strategy[Develop Strategy]
80
- Strategy --> Present[Present Approach]
81
-
82
- ### Act Mode
83
- flowchart TD
84
- Start[Start] --> Context[Check Memory Bank]
85
- Context --> Update[Update Documentation]
86
- Update --> Execute[Execute Task]
87
- Execute --> Document[Document Changes]
88
-
89
- ## Documentation Updates
90
-
91
- Memory Bank updates occur when:
92
- 1. Discovering new project patterns
93
- 2. After implementing significant changes
94
- 3. When user requests with **update memory bank** (MUST review ALL files)
95
- 4. When context needs clarification
96
-
97
- flowchart TD
98
- Start[Update Process]
99
-
100
- subgraph Process
101
- P1[Review ALL Files]
102
- P2[Document Current State]
103
- P3[Clarify Next Steps]
104
- P4[Document Insights & Patterns]
105
-
106
- P1 --> P2 --> P3 --> P4
107
- end
108
-
109
- Start --> Process
110
-
111
- Note: When triggered by **update memory bank**, I MUST review every memory bank file, even if some don't require updates. Focus particularly on activeContext.md and progress.md as they track current state.
112
-
113
- REMEMBER: After every memory reset, I begin completely fresh. The Memory Bank is my only link to previous work. It must be maintained with precision and clarity, as my effectiveness depends entirely on its accuracy.
 
1
+ # HOCR Project Memory Bank
2
 
3
+ This memory bank is for the HOCR (OCR processing) project.
4
 
5
+ ## Project Context
6
 
7
+ This project appears to be focused on OCR (Optical Character Recognition) processing, with capabilities for image segmentation, preprocessing, and various text extraction techniques.
8
 
9
+ ## System Information
 
 
 
 
 
 
 
10
 
11
+ - Project directory: /Users/zacharymuhlbauer/Desktop/tools/hocr
12
+ - Main Python files include app.py, preprocessing.py, ocr_processing.py, and various utility modules
13
+ - Output directories for test results and processing stages
 
 
 
14
 
15
+ ## Notes
 
 
 
 
16
 
17
+ - The project handles various document types including handwritten documents, printed text, and mixed content
18
+ - Contains preprocessing steps for image enhancement before OCR
19
+ - Has testing directories for different document types and processing approaches
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
.clinerules/productContext.md ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Product Context
2
+
3
+ This is an OCR processing tool for various document types including handwritten, printed, and mixed content documents. The system handles preprocessing, image segmentation, OCR processing, and text extraction with various enhancements.
4
+
5
+ ## Features
6
+ - Document preprocessing (deskewing, thresholding, etc.)
7
+ - Image segmentation to identify text regions
8
+ - OCR processing with different strategies for different document types
9
+ - Language detection
10
+ - Letterhead handling
11
+ - Structured data extraction
.clinerules/progress.md ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Progress
2
+
3
+ ## Completed
4
+ - Basic OCR processing pipeline
5
+ - Image preprocessing capabilities
6
+ - Text segmentation algorithm
7
+ - Initial UI components
8
+ - Testing framework for various document types
9
+
10
+ ## In Progress
11
+ - Improving preprocessing for handwritten documents
12
+ - Enhancing segmentation accuracy
13
+ - Building structured output formatting
14
+ - Refining language detection
15
+
16
+ ## Planned
17
+ - Additional output formats
18
+ - Performance optimization
19
+ - More comprehensive testing
20
+ - Enhanced UI features
.clinerules/project-brief.md ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Project Brief
2
+
3
+ ## Overview
4
+ This project focuses on Optical Character Recognition (OCR) processing for various document types. It handles different document formats and qualities, applying specialized preprocessing and recognition techniques.
5
+
6
+ ## Goals
7
+ - Improve OCR accuracy for challenging document types
8
+ - Support multiple input formats (images, PDFs)
9
+ - Provide structured output of recognized text
10
+ - Enable interactive usage with UI components
11
+
12
+ ## Current Status
13
+ The project has multiple components in place including preprocessing, segmentation, OCR processing, and utility functions. Testing infrastructure is available for different document types and processing approaches.
.clinerules/systemPatterns.md ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # System Patterns
2
+
3
+ ## Code Organization
4
+ - Main processing components in root directory
5
+ - Utility functions in utils/ directory with specific submodules
6
+ - UI components in ui/ directory
7
+ - Test cases and samples in testing/ directory
8
+ - Input/output directories for document processing
9
+
10
+ ## Naming Conventions
11
+ - Snake case for file names and functions
12
+ - Module names reflect their purpose (e.g., ocr_processing.py, image_segmentation.py)
13
+ - Consistent test output naming with descriptive prefixes
14
+
15
+ ## Processing Pipeline
16
+ 1. Preprocessing step (enhancement, cleaning)
17
+ 2. Segmentation (identifying text regions)
18
+ 3. OCR processing with context-specific strategies
19
+ 4. Post-processing and output formatting
.clinerules/techContext.md ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Technical Context
2
+
3
+ This project is a Python-based OCR solution with the following components:
4
+
5
+ ## Tech Stack
6
+ - Python
7
+ - Image processing libraries (OpenCV, PIL)
8
+ - OCR engines
9
+ - UI components for interactive usage
10
+ - File processing utilities
11
+
12
+ ## Architecture
13
+ - Modular design with separate components for preprocessing, OCR, and output formatting
14
+ - Utility modules organized in utils/ directory
15
+ - Testing framework for various document types
16
+ - Configuration system for processing parameters
memory-bank/activeContext.md CHANGED
@@ -1,31 +1,36 @@
1
- # Active Context
2
 
3
- ## Current Work Focus
4
 
5
- * Initializing the core Memory Bank files (`projectbrief.md`, `productContext.md`, `systemPatterns.md`, `techContext.md`, `activeContext.md`, `progress.md`) based on the existing project structure, defined `.clinerules`, and the global `memory-bank.md` specification.
 
6
 
7
- ## Recent Changes
8
 
9
- * This is the initial population of the Memory Bank. No prior changes within this system.
 
 
 
10
 
11
- ## Next Steps
12
 
13
- * Complete the creation of the initial Memory Bank files (specifically `progress.md`).
14
- * Await further instructions from the user regarding the next development task for the `historical-ocr` project.
 
 
15
 
16
- ## Active Decisions, Patterns, and Preferences
17
 
18
- * **Memory Bank Structure:** Adopting the 6-core-file structure defined in the global `memory-bank.md`.
19
- * **Content Derivation:** Populating initial Memory Bank content by analyzing the existing file structure and `.clinerules`.
20
- * **Iterative Improvement Workflow:** Implementing a post-task review step after completing tasks in Act Mode. This involves:
21
- 1. Completing the assigned task.
22
- 2. Presenting the result using `attempt_completion`.
23
- 3. Explicitly asking the user: "Are there any takeaways from this interaction that should be added to the project's `.clinerules`?"
24
- 4. If takeaways are provided, update the relevant `.clinerules` file(s).
25
- 5. Recursively update the Memory Bank (especially `activeContext.md` and `progress.md`) to reflect the new rule or pattern.
26
 
27
- ## Learnings and Project Insights
28
 
29
- * The project emphasizes clear documentation and rule definition from the start (`.clinerules`).
30
- * Maintaining synchronization between `.clinerules` and the Memory Bank is crucial for consistent development.
31
- * The iterative feedback loop (post-task review) is a key requirement for evolving project standards.
 
 
 
 
 
 
1
+ # Active Context: HOCR Processing Tool - Initial Setup
2
 
3
+ ## 1. Current Focus
4
 
5
+ * **Initial Project Setup:** Establishing the core Memory Bank documentation structure for the HOCR project as per Cline's requirements.
6
+ * Creating the foundational markdown files (`projectbrief.md`, `productContext.md`, `activeContext.md`, `systemPatterns.md`, `techContext.md`, `progress.md`).
7
 
8
+ ## 2. Recent Changes
9
 
10
+ * Created `memory-bank` directory within the `hocr` project folder.
11
+ * Created `projectbrief.md` with initial project goals and scope.
12
+ * Created `productContext.md` outlining the problem space, vision, and user goals.
13
+ * Created this `activeContext.md` file.
14
 
15
+ ## 3. Next Steps
16
 
17
+ * Create `systemPatterns.md` to document the high-level architecture and design patterns (based on initial analysis of existing code).
18
+ * Create `techContext.md` to detail the technologies, libraries, and setup requirements.
19
+ * Create `progress.md` to establish the baseline for tracking project status.
20
+ * Once the initial Memory Bank is set up, await further instructions or tasks from the user regarding the HOCR project itself.
21
 
22
+ ## 4. Active Decisions & Considerations
23
 
24
+ * Following Cline's standard Memory Bank structure.
25
+ * Populating initial files with baseline information derived from the project's file structure and general OCR principles. These will need refinement as the project is explored further.
 
 
 
 
 
 
26
 
27
+ ## 5. Important Patterns & Preferences
28
 
29
+ * *(To be filled in as patterns are identified in the codebase or specified by the user)*
30
+
31
+ ## 6. Learnings & Insights
32
+
33
+ * The HOCR project appears to be a substantial Python application with distinct modules for different OCR stages (preprocessing, segmentation, OCR processing, etc.) and includes a UI component.
34
+ * Initial setup requires creating the standard Memory Bank files.
35
+
36
+ *(This file tracks the immediate state and short-term plans. It should be updated frequently.)*
memory-bank/productContext.md CHANGED
@@ -1,31 +1,44 @@
1
- # Product Context
2
 
3
- ## Why This Project Exists
4
 
5
- Historical documents often contain invaluable information but are locked away in formats (scans, photos, complex PDFs) that are difficult to search, analyze, or integrate into digital research workflows. Standard OCR tools may struggle with the unique challenges presented by archival materials, such as varying print quality, handwriting, complex layouts, and archaic language.
 
 
 
6
 
7
- ## Problems It Solves
8
 
9
- This application aims to bridge the gap between physical historical archives and modern digital research by:
 
 
 
10
 
11
- * Providing high-accuracy text extraction from challenging historical documents.
12
- * Structuring the extracted text and metadata in a usable format.
13
- * Making advanced OCR capabilities accessible to researchers without requiring deep technical expertise.
14
- * Optimizing documents specifically for OCR to improve accuracy.
15
 
16
- ## How It Should Work
 
 
 
17
 
18
- The core workflow involves:
19
 
20
- 1. **Upload:** Users upload historical documents (images or PDFs) via a web interface.
21
- 2. **Preprocessing:** The application automatically applies image enhancement and optimization techniques tailored to historical materials.
22
- 3. **OCR:** Processed documents are sent to the Mistral AI OCR API for text extraction. Document type detection may inform specific OCR prompting.
23
- 4. **Structuring:** The raw OCR output is processed to extract structured information (e.g., paragraphs, headings, metadata) based on document type and potentially user instructions.
24
- 5. **Output:** Users can view the extracted text and download structured transcripts and analysis.
25
 
26
- ## User Experience Goals
27
 
28
- * **Intuitive Interface:** A clean, straightforward Streamlit web application that is easy for researchers to use.
29
- * **Clear Feedback:** Provide users with status updates during processing and clear presentation of results.
30
- * **Flexibility:** Allow users some control over the process (e.g., contextual instructions) where appropriate.
31
- * **Reliability:** Ensure consistent and accurate results.
 
 
 
 
 
 
 
 
1
+ # Product Context: HOCR Processing Tool
2
 
3
+ ## 1. Problem Space
4
 
5
+ * Extracting text from images and scanned documents (PDFs) is a common but often challenging task.
6
+ * Documents vary widely in quality, layout, language, and format.
7
+ * Manual transcription is time-consuming, error-prone, and not scalable.
8
+ * Existing OCR tools may lack flexibility, specific preprocessing capabilities, or produce output that isn't easily usable for downstream tasks.
9
 
10
+ ## 2. Solution Vision
11
 
12
+ * Provide a reliable, configurable, and extensible OCR processing pipeline.
13
+ * Empower users to convert document images and PDFs into usable, structured text data.
14
+ * Offer fine-grained control over the OCR process, from preprocessing to output formatting.
15
+ * Serve as a foundational tool that can be integrated into larger document processing workflows.
16
 
17
+ ## 3. Target Audience & Needs
 
 
 
18
 
19
+ * **Researchers/Academics:** Need to extract text from historical documents, scanned books, or research papers for analysis. Require high accuracy and potentially language-specific handling.
20
+ * **Archivists/Librarians:** Need to digitize large collections of documents, making them searchable and accessible. Require batch processing capabilities and robust handling of varied document types.
21
+ * **Developers:** Need an OCR component to integrate into applications (e.g., document management systems, data extraction tools). Require a clear API or command-line interface and structured output (like JSON or hOCR).
22
+ * **General Users:** May need to occasionally extract text from a scanned form, receipt, or image. Require a simple interface (if applicable) and reasonable default settings.
23
 
24
+ ## 4. User Experience Goals
25
 
26
+ * **Accuracy:** The primary goal is to maximize the accuracy of the extracted text.
27
+ * **Configurability:** Users should be able to tailor the processing steps and parameters to their specific document types and needs.
28
+ * **Transparency:** The tool should provide feedback on the processing steps and allow for debugging (e.g., viewing intermediate images, logs).
29
+ * **Performance:** Processing should be reasonably efficient, especially for batch operations.
30
+ * **Ease of Use:** While powerful, the tool should be approachable, whether through a command-line interface or a potential GUI. Configuration should be clear and well-documented.
31
 
32
+ ## 5. How it Should Work (High-Level Flow)
33
 
34
+ 1. User provides input file(s) (image or PDF).
35
+ 2. User specifies configuration options (or uses defaults).
36
+ 3. The tool executes the configured pipeline:
37
+ * Preprocessing (optional steps like deskew, binarize).
38
+ * Segmentation (detecting text regions).
39
+ * OCR (applying Tesseract or another engine).
40
+ * Post-processing (e.g., text correction, structuring output).
41
+ 4. The tool outputs the extracted text in the desired format (plain text, hOCR, JSON).
42
+ 5. Logs and potentially intermediate results are generated for review.
43
+
44
+ *(This context provides the 'why' and 'how' from a user perspective. Technical details are in systemPatterns.md and techContext.md.)*
memory-bank/progress.md CHANGED
@@ -1,34 +1,30 @@
1
- # Project Progress
2
 
3
- ## Current Status
4
 
5
- * **Overall:** Initial project setup phase. Core application structure exists, and foundational documentation (Memory Bank, `.clinerules`) is being established.
6
- * **Memory Bank:** The six core Memory Bank files have just been created with initial content derived from the project structure and rules.
7
 
8
- ## What Works
9
 
10
- * The basic file and directory structure for a modular Python/Streamlit application is in place.
11
- * Project rules (`.clinerules`) defining the brief, API usage, simplicity principle, and technical debt priorities exist.
12
- * The Memory Bank system is now initialized.
 
 
 
 
13
 
14
- ## What's Left to Build / Next Steps
15
 
16
- * Implement the actual functionality described in the project brief (file upload, preprocessing, OCR integration, structuring, UI interactions).
17
- * Address the technical debt items listed below.
18
- * Refine and expand Memory Bank documentation as development progresses.
19
- * Specific next development tasks are pending user direction.
20
 
21
- ## Known Issues / Technical Debt
22
 
23
- The following technical debt items have been identified in `.clinerules/technical-debt.md` and should be addressed during development:
24
 
25
- 1. **Modularize large functions:** Break down functions exceeding 100 lines into smaller, focused units.
26
- 2. **Consistent Error Handling:** Implement a uniform error handling strategy across all modules.
27
- 3. **Preprocessing Pipeline Improvement:** Enhance the preprocessing steps to better handle diverse historical document types.
28
- 4. **Image Segmentation Enhancement:** Improve the current approach for identifying text regions.
29
- 5. **Documentation:** Create comprehensive documentation (docstrings, comments) for public functions and APIs.
30
 
31
- ## Evolution of Project Decisions
32
 
33
- * **[YYYY-MM-DD]:** Initialized Memory Bank structure based on global rules.
34
- * **[YYYY-MM-DD]:** Adopted a post-task review workflow to iteratively update `.clinerules` and the Memory Bank.
 
1
+ # Project Progress: HOCR Processing Tool
2
 
3
+ ## 1. What Works
4
 
5
+ * **Initial Memory Bank Setup:** The core documentation structure (projectbrief, productContext, activeContext, systemPatterns, techContext, progress) has been established in the `memory-bank/` directory.
 
6
 
7
+ ## 2. What's Left to Build / Verify
8
 
9
+ * **Verify Core Functionality:** Need to run the application and test its basic OCR capabilities on sample images and PDFs.
10
+ * **Confirm Technical Assumptions:** Validate the libraries and dependencies outlined in `techContext.md` by checking `requirements.txt` and relevant code sections.
11
+ * **Understand Configuration:** Investigate `config.py` to determine how users configure the pipeline.
12
+ * **Test UI Layer:** If `app.py` provides a UI (Streamlit/Flask), test its usability and connection to the backend pipeline.
13
+ * **Review Existing Code:** Deeper dive into the modules (`preprocessing.py`, `ocr_processing.py`, etc.) to understand implementation details.
14
+ * **Assess Test Coverage:** Examine the tests in `testing/` to understand what is currently covered.
15
+ * **Address Specific User Goals:** Once the baseline is understood, tackle any specific feature requests, bug fixes, or improvements requested by the user.
16
 
17
+ ## 3. Current Status
18
 
19
+ * **Baseline Established (Memory Bank):** As of 2025-05-05, the initial Memory Bank structure is in place.
20
+ * **Code Functionality:** The operational status of the HOCR tool itself is yet to be verified.
 
 
21
 
22
+ ## 4. Known Issues / Bugs
23
 
24
+ * *(None identified yet. To be populated as testing and development proceed.)*
25
 
26
+ ## 5. Evolution of Project Decisions (Decision Log)
 
 
 
 
27
 
28
+ * **2025-05-05:** Decided to create the standard Cline Memory Bank structure for the `hocr` project upon user request to check configuration. Found no existing `memory-bank` directory and proceeded with creation of core files.
29
 
30
+ *(This document tracks the overall progress and state of the project.)*
 
memory-bank/project-brief.md DELETED
@@ -1,19 +0,0 @@
1
- # Project Brief
2
-
3
- Historical OCR is an advanced optical character recognition (OCR) application designed to support historical research. It leverages Mistral AI's OCR models alongside image preprocessing pipelines optimized for archival material.
4
-
5
- ## High-Level Overview
6
-
7
- Building a Streamlit-based web application to process historical documents (images or PDFs), optimize them for OCR using advanced preprocessing techniques, and extract structured text and metadata through Mistral's large language models.
8
-
9
- ## Core Requirements and Goals
10
-
11
- * Upload and preprocess historical documents
12
-
13
- * Apply tailored OCR prompting and structured output based on document type
14
-
15
- * Support user-defined contextual instructions to refine output
16
-
17
- * Provide downloadable structured transcripts and analysis
18
-
19
- * Example: "Building a Streamlit web app for OCR transcription and structured extraction from historical documents using Mistral AI."
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
memory-bank/projectbrief.md CHANGED
@@ -1,21 +1,39 @@
1
- # Project Brief
2
 
3
- Historical OCR is an advanced optical character recognition (OCR) application designed to support historical research. It leverages Mistral AI's OCR models alongside image preprocessing pipelines optimized for archival material.
4
 
5
- High-Level Overview
 
6
 
7
- Building a Streamlit-based web application to process historical documents (images or PDFs), optimize them for OCR using advanced preprocessing techniques, and extract structured text and metadata through Mistral's large language models.
8
 
9
- Core Requirements and Goals
 
 
 
 
 
 
 
 
 
10
 
11
- Upload and preprocess historical documents
12
 
13
- Automatically detect document types (e.g., handwritten letters, scientific papers)
 
14
 
15
- Apply tailored OCR prompting and structured output based on document type
16
 
17
- Support user-defined contextual instructions to refine output
 
 
18
 
19
- Provide downloadable structured transcripts and analysis
20
 
21
- Example: "Building a Streamlit web app for OCR transcription and structured extraction from historical documents using Mistral AI."
 
 
 
 
 
 
1
+ # Project Brief: HOCR Processing Tool
2
 
3
+ ## 1. Project Goal
4
 
5
+ * **Primary Objective:** To develop and maintain a robust tool for performing Optical Character Recognition (OCR) on various document types (images, PDFs), extracting structured text, and handling potential complexities like diverse layouts, languages, and image quality issues.
6
+ * **Target Users:** Researchers, archivists, developers, or anyone needing to extract text from scanned documents or images.
7
 
8
+ ## 2. Core Requirements
9
 
10
+ * **Input:** Accept image files (PNG, JPG, TIFF, etc.) and PDF documents.
11
+ * **Processing Pipeline:**
12
+ * Image Preprocessing (deskewing, noise reduction, binarization, etc.)
13
+ * Layout Analysis / Image Segmentation (identifying text blocks, columns, images)
14
+ * OCR Engine Integration (e.g., Tesseract)
15
+ * Language Detection
16
+ * Structured Output Generation (e.g., hOCR format, JSON, plain text)
17
+ * Error Handling and Logging
18
+ * **Configuration:** Allow users to configure processing parameters (e.g., language, preprocessing steps, output format).
19
+ * **Extensibility:** Design for potential future enhancements (e.g., handwriting recognition, specific template handling).
20
 
21
+ ## 3. Scope
22
 
23
+ * **In Scope:** Core OCR pipeline, configuration management, basic UI (if applicable), testing framework, documentation.
24
+ * **Out of Scope (Initially):** Advanced AI-driven layout analysis beyond standard segmentation, real-time processing for video streams, integration with specific external databases or workflows unless specified.
25
 
26
+ ## 4. Key Technologies (Initial Assessment - *To be refined in techContext.md*)
27
 
28
+ * **Language:** Python
29
+ * **Core Libraries:** OpenCV, Tesseract (pytesseract), Pillow, pdf2image, potentially others for specific tasks.
30
+ * **Framework (if UI exists):** Flask/Streamlit (based on existing files like `app.py`, `ui/`)
31
 
32
+ ## 5. Success Metrics
33
 
34
+ * Accuracy of text extraction (measured against ground truth data).
35
+ * Robustness across different document types and qualities.
36
+ * Ease of configuration and use.
37
+ * Maintainability and extensibility of the codebase.
38
+
39
+ *(This is a foundational brief. Details will be expanded in other Memory Bank documents.)*
memory-bank/systemPatterns.md CHANGED
@@ -1,66 +1,79 @@
1
- # System Patterns
2
 
3
- ## Architecture Overview
4
 
5
- The application follows a modular Python structure, orchestrated by a main Streamlit application script (`app.py`). Key architectural components include:
 
 
 
 
 
 
 
 
 
 
 
6
 
7
- * **Entry Point:** `app.py` likely initializes the Streamlit application and coordinates calls to other modules.
8
- * **UI Layer:** Managed by Streamlit. Core UI elements are defined in `ui/layout.py` and potentially reusable components in `ui/ui_components.py` (or the root `ui_components.py`). Custom styling is applied via `ui/custom.css`.
9
- * **Processing Modules:** Functionality is separated into distinct Python modules:
10
- * `preprocessing.py`: Handles image optimization and preparation for OCR.
11
- * `image_segmentation.py`: Deals with identifying regions of interest within documents.
12
- * `ocr_processing.py`: Manages the interaction with the OCR engine (Mistral API).
13
- * `structured_ocr.py`: Focuses on interpreting raw OCR output and structuring it.
14
- * `language_detection.py`: Detects the language of the document content.
15
- * `letterhead_handler.py`: Specific logic for dealing with letterheads.
16
- * `pdf_ocr.py`: Handles OCR specific to PDF inputs (likely coordinating other modules).
17
- * `process_file.py`: A potential high-level orchestrator for the entire file processing pipeline.
18
- * **Configuration:** `config.py` likely holds application settings, potentially including API keys or processing parameters. `constants.py` holds fixed values used across the application.
19
- * **Utilities:** Common functions are grouped in the `utils/` directory, further categorized (e.g., `image_utils.py`, `text_utils.py`, `file_utils.py`).
20
- * **Error Handling:** A dedicated `error_handler.py` suggests a centralized approach to managing exceptions.
21
 
22
- ## Key Technical Decisions & Patterns
 
 
 
23
 
24
- * **Modularity:** Code is organized into feature-specific modules, promoting separation of concerns.
25
- * **External API Integration:** Relies on the Mistral AI OCR API for core text extraction (`.clinerules/hocr-basics-api.md`). API interaction logic is likely within `ocr_processing.py` or related utilities.
26
- * **Streamlit Framework:** Leverages Streamlit for the web interface, using standard components (`.clinerules/hocr-basics-api.md`). State management likely uses `st.session_state`.
27
- * **Content Purity:** Adheres to the principle of separating data from presentation markup (`.clinerules/principle-of-simplicity.md`). Presentation logic should reside primarily in the UI layer.
28
- * **Configuration Management:** Centralized configuration likely managed through `config.py`.
29
-
30
- ## Component Relationships (Conceptual)
31
 
32
  ```mermaid
33
  graph TD
34
- A[User via Streamlit UI] --> B(app.py);
35
- B --> C{process_file.py?};
36
- C --> D[preprocessing.py];
37
- C --> E[image_segmentation.py];
38
- C --> F[language_detection.py];
39
- C --> G[pdf_ocr.py / ocr_processing.py];
40
- G -- Mistral API --> H((External Mistral OCR));
41
- H -- OCR Result --> G;
42
- G --> I[structured_ocr.py];
43
- I --> J[Output Generation];
44
- J --> A;
45
-
46
- subgraph Modules
47
- D; E; F; G; I; J;
48
  end
49
 
50
- subgraph Configuration & Utils
51
- K[config.py];
52
- L[constants.py];
53
- M[utils/];
54
- N[error_handler.py];
55
  end
56
 
57
- B --> K; B --> L; B --> M; B --> N;
58
- Modules --> K; Modules --> L; Modules --> M; Modules --> N;
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
  ```
60
 
61
- ## Critical Implementation Paths
 
 
 
 
62
 
63
- * The end-to-end flow from file upload (`st.file_uploader`) through preprocessing, OCR API call, structuring, and displaying results (`st.markdown`, `st.download_button`).
64
- * Handling different input file types (Images vs. PDFs).
65
- * Integration and error handling for the Mistral API calls.
66
- * Implementation of specific preprocessing steps relevant to historical documents.
 
1
+ # System Patterns: HOCR Processing Tool
2
 
3
+ ## 1. High-Level Architecture
4
 
5
+ * **Modular Pipeline:** The system appears structured as a pipeline with distinct modules for different stages of OCR processing. Key modules suggested by filenames include:
6
+ * `preprocessing.py`: Handles initial image adjustments.
7
+ * `image_segmentation.py`: Identifies regions of interest (text blocks).
8
+ * `ocr_processing.py`: Manages the core OCR engine interaction.
9
+ * `language_detection.py`: Determines the language of the text.
10
+ * `pdf_ocr.py`: Specific handling for PDF inputs.
11
+ * `structured_ocr.py`: Likely involved in formatting the output.
12
+ * **Configuration Driven:** `config.py` suggests a centralized configuration management approach, allowing pipeline behavior to be customized.
13
+ * **Entry Point / Orchestration:** `app.py` likely serves as the main entry point or orchestrator, possibly for a web UI or API, coordinating the pipeline execution based on user input and configuration. `process_file.py` might be an alternative entry point or a core processing function called by `app.py`.
14
+ * **UI Layer:** The `ui/` directory (`ui/layout.py`, `ui/ui_components.py`) indicates a dedicated user interface layer, possibly built with Streamlit or Flask (as suggested in `projectbrief.md`).
15
+ * **Utility Functions:** The `utils/` directory (`utils/image_utils.py`, `utils/text_utils.py`, etc.) points to a pattern of encapsulating reusable helper functions.
16
+ * **Error Handling:** `error_handler.py` suggests a dedicated mechanism for managing and reporting errors during processing.
17
 
18
+ ## 2. Key Design Patterns (Inferred)
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
+ * **Pipeline Pattern:** The core processing flow seems to follow a pipeline pattern, where data (image/document) passes through sequential processing stages.
21
+ * **Configuration Management:** Centralized configuration (`config.py`) allows for decoupling settings from code.
22
+ * **Separation of Concerns:** Different functionalities (UI, core processing, utilities, configuration) appear to be separated into distinct modules/files.
23
+ * **Utility/Helper Modules:** Common, reusable functions are grouped into utility modules.
24
 
25
+ ## 3. Component Relationships (Initial Diagram - Mermaid)
 
 
 
 
 
 
26
 
27
  ```mermaid
28
  graph TD
29
+ subgraph User Interface / Entry Point
30
+ A[app.py / UI Layer] --> B(process_file.py);
31
+ end
32
+
33
+ subgraph Configuration
34
+ C[config.py];
35
+ end
36
+
37
+ subgraph Core OCR Pipeline
38
+ B --> D(preprocessing.py);
39
+ D --> E(image_segmentation.py);
40
+ E --> F(ocr_processing.py);
41
+ F --> G(language_detection.py);
42
+ G --> H(structured_ocr.py);
43
  end
44
 
45
+ subgraph Input Handling
46
+ I[pdf_ocr.py] --> B;
47
+ J[Image Input] --> B;
 
 
48
  end
49
 
50
+ subgraph Utilities
51
+ K[utils/];
52
+ L[error_handler.py];
53
+ end
54
+
55
+ A --> C;
56
+ B --> C;
57
+ D --> K;
58
+ E --> K;
59
+ F --> K;
60
+ G --> K;
61
+ H --> K;
62
+ I --> K;
63
+ B --> L;
64
+
65
+ style User Interface / Entry Point fill:#f9f,stroke:#333,stroke-width:2px
66
+ style Configuration fill:#ccf,stroke:#333,stroke-width:2px
67
+ style Core OCR Pipeline fill:#cfc,stroke:#333,stroke-width:2px
68
+ style Input Handling fill:#ffc,stroke:#333,stroke-width:2px
69
+ style Utilities fill:#eee,stroke:#333,stroke-width:2px
70
+
71
  ```
72
 
73
+ ## 4. Critical Implementation Paths
74
+
75
+ * **Image Input -> Preprocessing -> Segmentation -> OCR -> Structured Output:** The main flow for image files.
76
+ * **PDF Input -> PDF Extraction -> Image Conversion (per page) -> [Main Flow] -> Aggregated Output:** The likely path for PDF documents.
77
+ * **Configuration Loading -> Pipeline Execution:** How settings influence the process.
78
 
79
+ *(This document outlines the observed structure. It will be refined as the codebase is analyzed in more detail.)*
 
 
 
memory-bank/techContext.md CHANGED
@@ -1,35 +1,37 @@
1
- # Technical Context
2
 
3
- ## Technologies Used
4
 
5
- * **Primary Language:** Python 3.x
6
- * **Web Framework:** Streamlit
7
- * **Core API:** Mistral AI OCR API (via HTTPS requests)
8
- * **Potential Libraries:**
9
- * `requests`: For making HTTP calls to the Mistral API.
10
- * `streamlit`: For the web UI framework.
11
- * `Pillow` (PIL Fork): For basic image loading and manipulation.
12
- * `OpenCV` (`cv2`): Likely used for more advanced image preprocessing tasks (e.g., thresholding, noise reduction, deskewing).
13
- * `python-dotenv`: Potentially used for managing environment variables like API keys (especially if `config.py` loads from a `.env` file).
14
- * `PyMuPDF` or similar: If PDF processing involves direct text/image extraction from PDF structures beyond just sending to OCR.
15
 
16
- *(Note: Specific libraries beyond Streamlit and requests need confirmation, e.g., by inspecting `requirements.txt` or import statements in the code).*
17
 
18
- ## Development Setup
 
 
 
 
 
19
 
20
- * **Environment:** Standard Python environment (virtual environment recommended, e.g., `venv` or `conda`).
21
- * **Dependencies:** Install required packages (likely via `pip install -r requirements.txt` if a requirements file exists).
22
- * **API Keys:** Requires a Mistral AI API key, which needs to be configured securely (likely via environment variables loaded in `config.py`).
23
- * **Running the App:** Typically run using `streamlit run app.py` from the project root directory.
24
 
25
- ## Technical Constraints
 
 
 
26
 
27
- * **API Limits:** Subject to Mistral AI API usage limits, rate limits, and potential costs. Error handling for API responses (e.g., 429 Too Many Requests, 401 Unauthorized, 5xx Server Errors) is crucial.
28
- * **Processing Time:** OCR and complex image preprocessing can be time-consuming, especially for large documents or high-resolution images. Streamlit's execution model needs to be considered for long-running tasks (e.g., using background processes or providing user feedback).
29
- * **Resource Usage:** Image processing can be memory and CPU intensive.
30
 
31
- ## Tool Usage Patterns
 
 
 
32
 
33
- * **Streamlit Components:** Primarily use core components as specified in `.clinerules/hocr-basics-api.md` (`st.file_uploader`, `st.selectbox`, `st.image`, `st.markdown`, `st.download_button`).
34
- * **State Management:** Use `st.session_state` for managing user interactions and state across reruns.
35
- * **API Interaction:** Follow standard practices for REST API calls (headers, JSON body, error checking) as defined in `.clinerules/hocr-basics-api.md`.
 
 
 
 
 
 
1
+ # Technical Context: HOCR Processing Tool
2
 
3
+ ## 1. Core Language
4
 
5
+ * **Python:** The project is primarily written in Python, as indicated by the `.py` files. The specific version should be confirmed (e.g., via `requirements.txt` or environment setup).
 
 
 
 
 
 
 
 
 
6
 
7
+ ## 2. Key Libraries & Frameworks
8
 
9
+ * **OCR Engine:** Likely **Tesseract OCR**, potentially accessed via the `pytesseract` wrapper (common practice). This needs confirmation by inspecting `ocr_processing.py` or dependencies.
10
+ * **Image Processing:** **OpenCV (`cv2`)** and/or **Pillow (PIL)** are highly probable for tasks in `preprocessing.py` and `image_segmentation.py`.
11
+ * **PDF Handling:** **`pdf2image`** (which often relies on Poppler) is a common choice for converting PDF pages to images, relevant for `pdf_ocr.py`. Other PDF libraries like PyMuPDF or PyPDF2 might also be used.
12
+ * **Web Framework/UI:** Based on `app.py` and the `ui/` directory, **Flask** or **Streamlit** are potential candidates for the user interface or API layer.
13
+ * **Configuration:** Standard Python mechanisms (e.g., `.ini` files with `configparser`, `.json` files, or custom Python modules like `config.py`).
14
+ * **Dependency Management:** Likely uses `pip` with a `requirements.txt` file (observed in the file listing). Virtual environments (like `venv` or `conda`) are standard practice.
15
 
16
+ ## 3. External Dependencies & Setup
 
 
 
17
 
18
+ * **Tesseract OCR Engine:** Requires separate installation on the host system. The path to the Tesseract executable might need configuration.
19
+ * **Poppler:** Often required by `pdf2image` for PDF processing; needs separate installation.
20
+ * **Python Environment:** A specific Python version and installed packages via `requirements.txt`.
21
+ * **Environment Variables:** Potential use of environment variables for configuration (e.g., API keys, paths), possibly managed via a `.env` file (observed in the file listing).
22
 
23
+ ## 4. Development Environment
 
 
24
 
25
+ * **Standard Python Setup:** Requires a Python interpreter, `pip`, and likely `virtualenv`.
26
+ * **Code Editor/IDE:** VS Code is being used (based on environment details). Settings might be stored in `.vscode/`.
27
+ * **Version Control:** Git is likely used (indicated by `.gitignore`, `.gitattributes`). The `.git_disabled` directory suggests Git might have been temporarily disabled or renamed.
28
+ * **Testing:** The `testing/` directory and `pytest_cache` suggest **pytest** is used for running tests.
29
 
30
+ ## 5. Technical Constraints & Considerations
31
+
32
+ * **Performance:** OCR, especially on large documents or batches, can be computationally intensive. Image processing steps add overhead.
33
+ * **Tesseract Limitations:** Tesseract's accuracy depends heavily on image quality, preprocessing, and language model availability.
34
+ * **Dependency Hell:** Managing Python dependencies and external binaries (Tesseract, Poppler) across different operating systems can be challenging.
35
+ * **Layout Complexity:** Handling complex layouts (multi-column, tables, embedded images) requires sophisticated segmentation logic.
36
+
37
+ *(This document provides an initial technical overview based on file structure and common practices. It requires verification by examining code and configuration files like requirements.txt.)*