Spaces:
Running
Running
add memory
Browse files- .clinerules/activeContext.md +18 -0
- .clinerules/memory-bank.md +12 -106
- .clinerules/productContext.md +11 -0
- .clinerules/progress.md +20 -0
- .clinerules/project-brief.md +13 -0
- .clinerules/systemPatterns.md +19 -0
- .clinerules/techContext.md +16 -0
- memory-bank/activeContext.md +26 -21
- memory-bank/productContext.md +34 -21
- memory-bank/progress.md +19 -23
- memory-bank/project-brief.md +0 -19
- memory-bank/projectbrief.md +29 -11
- memory-bank/systemPatterns.md +64 -51
- memory-bank/techContext.md +28 -26
.clinerules/activeContext.md
ADDED
@@ -0,0 +1,18 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Active Context
|
2 |
+
|
3 |
+
## Current Development Focus
|
4 |
+
- Improving preprocessing pipeline for better OCR results
|
5 |
+
- Enhancing image segmentation for complex documents
|
6 |
+
- Building structured OCR capabilities
|
7 |
+
- Testing and validation with different document types
|
8 |
+
|
9 |
+
## Recent Changes
|
10 |
+
- Modularized code structure
|
11 |
+
- Added helpers in utils directory
|
12 |
+
- Fixed metadata field ordering and tag classification issues
|
13 |
+
- Updated gitignore to exclude test files and output directories
|
14 |
+
|
15 |
+
## Next Steps
|
16 |
+
- Continue refining the OCR processing pipeline
|
17 |
+
- Improve handling of handwritten documents
|
18 |
+
- Enhance the UI components for better user experience
|
.clinerules/memory-bank.md
CHANGED
@@ -1,113 +1,19 @@
|
|
1 |
-
#
|
2 |
|
3 |
-
|
4 |
|
5 |
-
##
|
6 |
|
7 |
-
|
8 |
|
9 |
-
|
10 |
-
PB[projectbrief.md] --> PC[productContext.md]
|
11 |
-
PB --> SP[systemPatterns.md]
|
12 |
-
PB --> TC[techContext.md]
|
13 |
-
PC --> AC[activeContext.md]
|
14 |
-
SP --> AC
|
15 |
-
TC --> AC
|
16 |
-
AC --> P[progress.md]
|
17 |
|
18 |
-
|
19 |
-
|
20 |
-
|
21 |
-
- Created at project start if it doesn't exist
|
22 |
-
- Defines core requirements and goals
|
23 |
-
- Source of truth for project scope
|
24 |
|
25 |
-
|
26 |
-
- Why this project exists
|
27 |
-
- Problems it solves
|
28 |
-
- How it should work
|
29 |
-
- User experience goals
|
30 |
|
31 |
-
|
32 |
-
|
33 |
-
|
34 |
-
- Next steps
|
35 |
-
- Active decisions and considerations
|
36 |
-
- Important patterns and preferences
|
37 |
-
- Learnings and project insights
|
38 |
-
|
39 |
-
4. `systemPatterns.md`
|
40 |
-
- System architecture
|
41 |
-
- Key technical decisions
|
42 |
-
- Design patterns in use
|
43 |
-
- Component relationships
|
44 |
-
- Critical implementation paths
|
45 |
-
|
46 |
-
5. `techContext.md`
|
47 |
-
- Technologies used
|
48 |
-
- Development setup
|
49 |
-
- Technical constraints
|
50 |
-
- Dependencies
|
51 |
-
- Tool usage patterns
|
52 |
-
|
53 |
-
6. `progress.md`
|
54 |
-
- What works
|
55 |
-
- What's left to build
|
56 |
-
- Current status
|
57 |
-
- Known issues
|
58 |
-
- Evolution of project decisions
|
59 |
-
|
60 |
-
### Additional Context
|
61 |
-
Create additional files/folders within memory-bank/ when they help organize:
|
62 |
-
- Complex feature documentation
|
63 |
-
- Integration specifications
|
64 |
-
- API documentation
|
65 |
-
- Testing strategies
|
66 |
-
- Deployment procedures
|
67 |
-
|
68 |
-
## Core Workflows
|
69 |
-
|
70 |
-
### Plan Mode
|
71 |
-
flowchart TD
|
72 |
-
Start[Start] --> ReadFiles[Read Memory Bank]
|
73 |
-
ReadFiles --> CheckFiles{Files Complete?}
|
74 |
-
|
75 |
-
CheckFiles -->|No| Plan[Create Plan]
|
76 |
-
Plan --> Document[Document in Chat]
|
77 |
-
|
78 |
-
CheckFiles -->|Yes| Verify[Verify Context]
|
79 |
-
Verify --> Strategy[Develop Strategy]
|
80 |
-
Strategy --> Present[Present Approach]
|
81 |
-
|
82 |
-
### Act Mode
|
83 |
-
flowchart TD
|
84 |
-
Start[Start] --> Context[Check Memory Bank]
|
85 |
-
Context --> Update[Update Documentation]
|
86 |
-
Update --> Execute[Execute Task]
|
87 |
-
Execute --> Document[Document Changes]
|
88 |
-
|
89 |
-
## Documentation Updates
|
90 |
-
|
91 |
-
Memory Bank updates occur when:
|
92 |
-
1. Discovering new project patterns
|
93 |
-
2. After implementing significant changes
|
94 |
-
3. When user requests with **update memory bank** (MUST review ALL files)
|
95 |
-
4. When context needs clarification
|
96 |
-
|
97 |
-
flowchart TD
|
98 |
-
Start[Update Process]
|
99 |
-
|
100 |
-
subgraph Process
|
101 |
-
P1[Review ALL Files]
|
102 |
-
P2[Document Current State]
|
103 |
-
P3[Clarify Next Steps]
|
104 |
-
P4[Document Insights & Patterns]
|
105 |
-
|
106 |
-
P1 --> P2 --> P3 --> P4
|
107 |
-
end
|
108 |
-
|
109 |
-
Start --> Process
|
110 |
-
|
111 |
-
Note: When triggered by **update memory bank**, I MUST review every memory bank file, even if some don't require updates. Focus particularly on activeContext.md and progress.md as they track current state.
|
112 |
-
|
113 |
-
REMEMBER: After every memory reset, I begin completely fresh. The Memory Bank is my only link to previous work. It must be maintained with precision and clarity, as my effectiveness depends entirely on its accuracy.
|
|
|
1 |
+
# HOCR Project Memory Bank
|
2 |
|
3 |
+
This memory bank is for the HOCR (OCR processing) project.
|
4 |
|
5 |
+
## Project Context
|
6 |
|
7 |
+
This project appears to be focused on OCR (Optical Character Recognition) processing, with capabilities for image segmentation, preprocessing, and various text extraction techniques.
|
8 |
|
9 |
+
## System Information
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
10 |
|
11 |
+
- Project directory: /Users/zacharymuhlbauer/Desktop/tools/hocr
|
12 |
+
- Main Python files include app.py, preprocessing.py, ocr_processing.py, and various utility modules
|
13 |
+
- Output directories for test results and processing stages
|
|
|
|
|
|
|
14 |
|
15 |
+
## Notes
|
|
|
|
|
|
|
|
|
16 |
|
17 |
+
- The project handles various document types including handwritten documents, printed text, and mixed content
|
18 |
+
- Contains preprocessing steps for image enhancement before OCR
|
19 |
+
- Has testing directories for different document types and processing approaches
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
.clinerules/productContext.md
ADDED
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Product Context
|
2 |
+
|
3 |
+
This is an OCR processing tool for various document types including handwritten, printed, and mixed content documents. The system handles preprocessing, image segmentation, OCR processing, and text extraction with various enhancements.
|
4 |
+
|
5 |
+
## Features
|
6 |
+
- Document preprocessing (deskewing, thresholding, etc.)
|
7 |
+
- Image segmentation to identify text regions
|
8 |
+
- OCR processing with different strategies for different document types
|
9 |
+
- Language detection
|
10 |
+
- Letterhead handling
|
11 |
+
- Structured data extraction
|
.clinerules/progress.md
ADDED
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Progress
|
2 |
+
|
3 |
+
## Completed
|
4 |
+
- Basic OCR processing pipeline
|
5 |
+
- Image preprocessing capabilities
|
6 |
+
- Text segmentation algorithm
|
7 |
+
- Initial UI components
|
8 |
+
- Testing framework for various document types
|
9 |
+
|
10 |
+
## In Progress
|
11 |
+
- Improving preprocessing for handwritten documents
|
12 |
+
- Enhancing segmentation accuracy
|
13 |
+
- Building structured output formatting
|
14 |
+
- Refining language detection
|
15 |
+
|
16 |
+
## Planned
|
17 |
+
- Additional output formats
|
18 |
+
- Performance optimization
|
19 |
+
- More comprehensive testing
|
20 |
+
- Enhanced UI features
|
.clinerules/project-brief.md
ADDED
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Project Brief
|
2 |
+
|
3 |
+
## Overview
|
4 |
+
This project focuses on Optical Character Recognition (OCR) processing for various document types. It handles different document formats and qualities, applying specialized preprocessing and recognition techniques.
|
5 |
+
|
6 |
+
## Goals
|
7 |
+
- Improve OCR accuracy for challenging document types
|
8 |
+
- Support multiple input formats (images, PDFs)
|
9 |
+
- Provide structured output of recognized text
|
10 |
+
- Enable interactive usage with UI components
|
11 |
+
|
12 |
+
## Current Status
|
13 |
+
The project has multiple components in place including preprocessing, segmentation, OCR processing, and utility functions. Testing infrastructure is available for different document types and processing approaches.
|
.clinerules/systemPatterns.md
ADDED
@@ -0,0 +1,19 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# System Patterns
|
2 |
+
|
3 |
+
## Code Organization
|
4 |
+
- Main processing components in root directory
|
5 |
+
- Utility functions in utils/ directory with specific submodules
|
6 |
+
- UI components in ui/ directory
|
7 |
+
- Test cases and samples in testing/ directory
|
8 |
+
- Input/output directories for document processing
|
9 |
+
|
10 |
+
## Naming Conventions
|
11 |
+
- Snake case for file names and functions
|
12 |
+
- Module names reflect their purpose (e.g., ocr_processing.py, image_segmentation.py)
|
13 |
+
- Consistent test output naming with descriptive prefixes
|
14 |
+
|
15 |
+
## Processing Pipeline
|
16 |
+
1. Preprocessing step (enhancement, cleaning)
|
17 |
+
2. Segmentation (identifying text regions)
|
18 |
+
3. OCR processing with context-specific strategies
|
19 |
+
4. Post-processing and output formatting
|
.clinerules/techContext.md
ADDED
@@ -0,0 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Technical Context
|
2 |
+
|
3 |
+
This project is a Python-based OCR solution with the following components:
|
4 |
+
|
5 |
+
## Tech Stack
|
6 |
+
- Python
|
7 |
+
- Image processing libraries (OpenCV, PIL)
|
8 |
+
- OCR engines
|
9 |
+
- UI components for interactive usage
|
10 |
+
- File processing utilities
|
11 |
+
|
12 |
+
## Architecture
|
13 |
+
- Modular design with separate components for preprocessing, OCR, and output formatting
|
14 |
+
- Utility modules organized in utils/ directory
|
15 |
+
- Testing framework for various document types
|
16 |
+
- Configuration system for processing parameters
|
memory-bank/activeContext.md
CHANGED
@@ -1,31 +1,36 @@
|
|
1 |
-
# Active Context
|
2 |
|
3 |
-
## Current
|
4 |
|
5 |
-
*
|
|
|
6 |
|
7 |
-
## Recent Changes
|
8 |
|
9 |
-
*
|
|
|
|
|
|
|
10 |
|
11 |
-
## Next Steps
|
12 |
|
13 |
-
*
|
14 |
-
*
|
|
|
|
|
15 |
|
16 |
-
## Active Decisions
|
17 |
|
18 |
-
*
|
19 |
-
*
|
20 |
-
* **Iterative Improvement Workflow:** Implementing a post-task review step after completing tasks in Act Mode. This involves:
|
21 |
-
1. Completing the assigned task.
|
22 |
-
2. Presenting the result using `attempt_completion`.
|
23 |
-
3. Explicitly asking the user: "Are there any takeaways from this interaction that should be added to the project's `.clinerules`?"
|
24 |
-
4. If takeaways are provided, update the relevant `.clinerules` file(s).
|
25 |
-
5. Recursively update the Memory Bank (especially `activeContext.md` and `progress.md`) to reflect the new rule or pattern.
|
26 |
|
27 |
-
##
|
28 |
|
29 |
-
*
|
30 |
-
|
31 |
-
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Active Context: HOCR Processing Tool - Initial Setup
|
2 |
|
3 |
+
## 1. Current Focus
|
4 |
|
5 |
+
* **Initial Project Setup:** Establishing the core Memory Bank documentation structure for the HOCR project as per Cline's requirements.
|
6 |
+
* Creating the foundational markdown files (`projectbrief.md`, `productContext.md`, `activeContext.md`, `systemPatterns.md`, `techContext.md`, `progress.md`).
|
7 |
|
8 |
+
## 2. Recent Changes
|
9 |
|
10 |
+
* Created `memory-bank` directory within the `hocr` project folder.
|
11 |
+
* Created `projectbrief.md` with initial project goals and scope.
|
12 |
+
* Created `productContext.md` outlining the problem space, vision, and user goals.
|
13 |
+
* Created this `activeContext.md` file.
|
14 |
|
15 |
+
## 3. Next Steps
|
16 |
|
17 |
+
* Create `systemPatterns.md` to document the high-level architecture and design patterns (based on initial analysis of existing code).
|
18 |
+
* Create `techContext.md` to detail the technologies, libraries, and setup requirements.
|
19 |
+
* Create `progress.md` to establish the baseline for tracking project status.
|
20 |
+
* Once the initial Memory Bank is set up, await further instructions or tasks from the user regarding the HOCR project itself.
|
21 |
|
22 |
+
## 4. Active Decisions & Considerations
|
23 |
|
24 |
+
* Following Cline's standard Memory Bank structure.
|
25 |
+
* Populating initial files with baseline information derived from the project's file structure and general OCR principles. These will need refinement as the project is explored further.
|
|
|
|
|
|
|
|
|
|
|
|
|
26 |
|
27 |
+
## 5. Important Patterns & Preferences
|
28 |
|
29 |
+
* *(To be filled in as patterns are identified in the codebase or specified by the user)*
|
30 |
+
|
31 |
+
## 6. Learnings & Insights
|
32 |
+
|
33 |
+
* The HOCR project appears to be a substantial Python application with distinct modules for different OCR stages (preprocessing, segmentation, OCR processing, etc.) and includes a UI component.
|
34 |
+
* Initial setup requires creating the standard Memory Bank files.
|
35 |
+
|
36 |
+
*(This file tracks the immediate state and short-term plans. It should be updated frequently.)*
|
memory-bank/productContext.md
CHANGED
@@ -1,31 +1,44 @@
|
|
1 |
-
# Product Context
|
2 |
|
3 |
-
##
|
4 |
|
5 |
-
|
|
|
|
|
|
|
6 |
|
7 |
-
##
|
8 |
|
9 |
-
|
|
|
|
|
|
|
10 |
|
11 |
-
|
12 |
-
* Structuring the extracted text and metadata in a usable format.
|
13 |
-
* Making advanced OCR capabilities accessible to researchers without requiring deep technical expertise.
|
14 |
-
* Optimizing documents specifically for OCR to improve accuracy.
|
15 |
|
16 |
-
|
|
|
|
|
|
|
17 |
|
18 |
-
|
19 |
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
|
26 |
-
##
|
27 |
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Product Context: HOCR Processing Tool
|
2 |
|
3 |
+
## 1. Problem Space
|
4 |
|
5 |
+
* Extracting text from images and scanned documents (PDFs) is a common but often challenging task.
|
6 |
+
* Documents vary widely in quality, layout, language, and format.
|
7 |
+
* Manual transcription is time-consuming, error-prone, and not scalable.
|
8 |
+
* Existing OCR tools may lack flexibility, specific preprocessing capabilities, or produce output that isn't easily usable for downstream tasks.
|
9 |
|
10 |
+
## 2. Solution Vision
|
11 |
|
12 |
+
* Provide a reliable, configurable, and extensible OCR processing pipeline.
|
13 |
+
* Empower users to convert document images and PDFs into usable, structured text data.
|
14 |
+
* Offer fine-grained control over the OCR process, from preprocessing to output formatting.
|
15 |
+
* Serve as a foundational tool that can be integrated into larger document processing workflows.
|
16 |
|
17 |
+
## 3. Target Audience & Needs
|
|
|
|
|
|
|
18 |
|
19 |
+
* **Researchers/Academics:** Need to extract text from historical documents, scanned books, or research papers for analysis. Require high accuracy and potentially language-specific handling.
|
20 |
+
* **Archivists/Librarians:** Need to digitize large collections of documents, making them searchable and accessible. Require batch processing capabilities and robust handling of varied document types.
|
21 |
+
* **Developers:** Need an OCR component to integrate into applications (e.g., document management systems, data extraction tools). Require a clear API or command-line interface and structured output (like JSON or hOCR).
|
22 |
+
* **General Users:** May need to occasionally extract text from a scanned form, receipt, or image. Require a simple interface (if applicable) and reasonable default settings.
|
23 |
|
24 |
+
## 4. User Experience Goals
|
25 |
|
26 |
+
* **Accuracy:** The primary goal is to maximize the accuracy of the extracted text.
|
27 |
+
* **Configurability:** Users should be able to tailor the processing steps and parameters to their specific document types and needs.
|
28 |
+
* **Transparency:** The tool should provide feedback on the processing steps and allow for debugging (e.g., viewing intermediate images, logs).
|
29 |
+
* **Performance:** Processing should be reasonably efficient, especially for batch operations.
|
30 |
+
* **Ease of Use:** While powerful, the tool should be approachable, whether through a command-line interface or a potential GUI. Configuration should be clear and well-documented.
|
31 |
|
32 |
+
## 5. How it Should Work (High-Level Flow)
|
33 |
|
34 |
+
1. User provides input file(s) (image or PDF).
|
35 |
+
2. User specifies configuration options (or uses defaults).
|
36 |
+
3. The tool executes the configured pipeline:
|
37 |
+
* Preprocessing (optional steps like deskew, binarize).
|
38 |
+
* Segmentation (detecting text regions).
|
39 |
+
* OCR (applying Tesseract or another engine).
|
40 |
+
* Post-processing (e.g., text correction, structuring output).
|
41 |
+
4. The tool outputs the extracted text in the desired format (plain text, hOCR, JSON).
|
42 |
+
5. Logs and potentially intermediate results are generated for review.
|
43 |
+
|
44 |
+
*(This context provides the 'why' and 'how' from a user perspective. Technical details are in systemPatterns.md and techContext.md.)*
|
memory-bank/progress.md
CHANGED
@@ -1,34 +1,30 @@
|
|
1 |
-
# Project Progress
|
2 |
|
3 |
-
##
|
4 |
|
5 |
-
* **
|
6 |
-
* **Memory Bank:** The six core Memory Bank files have just been created with initial content derived from the project structure and rules.
|
7 |
|
8 |
-
## What
|
9 |
|
10 |
-
*
|
11 |
-
*
|
12 |
-
*
|
|
|
|
|
|
|
|
|
13 |
|
14 |
-
##
|
15 |
|
16 |
-
*
|
17 |
-
*
|
18 |
-
* Refine and expand Memory Bank documentation as development progresses.
|
19 |
-
* Specific next development tasks are pending user direction.
|
20 |
|
21 |
-
## Known Issues /
|
22 |
|
23 |
-
|
24 |
|
25 |
-
|
26 |
-
2. **Consistent Error Handling:** Implement a uniform error handling strategy across all modules.
|
27 |
-
3. **Preprocessing Pipeline Improvement:** Enhance the preprocessing steps to better handle diverse historical document types.
|
28 |
-
4. **Image Segmentation Enhancement:** Improve the current approach for identifying text regions.
|
29 |
-
5. **Documentation:** Create comprehensive documentation (docstrings, comments) for public functions and APIs.
|
30 |
|
31 |
-
|
32 |
|
33 |
-
*
|
34 |
-
* **[YYYY-MM-DD]:** Adopted a post-task review workflow to iteratively update `.clinerules` and the Memory Bank.
|
|
|
1 |
+
# Project Progress: HOCR Processing Tool
|
2 |
|
3 |
+
## 1. What Works
|
4 |
|
5 |
+
* **Initial Memory Bank Setup:** The core documentation structure (projectbrief, productContext, activeContext, systemPatterns, techContext, progress) has been established in the `memory-bank/` directory.
|
|
|
6 |
|
7 |
+
## 2. What's Left to Build / Verify
|
8 |
|
9 |
+
* **Verify Core Functionality:** Need to run the application and test its basic OCR capabilities on sample images and PDFs.
|
10 |
+
* **Confirm Technical Assumptions:** Validate the libraries and dependencies outlined in `techContext.md` by checking `requirements.txt` and relevant code sections.
|
11 |
+
* **Understand Configuration:** Investigate `config.py` to determine how users configure the pipeline.
|
12 |
+
* **Test UI Layer:** If `app.py` provides a UI (Streamlit/Flask), test its usability and connection to the backend pipeline.
|
13 |
+
* **Review Existing Code:** Deeper dive into the modules (`preprocessing.py`, `ocr_processing.py`, etc.) to understand implementation details.
|
14 |
+
* **Assess Test Coverage:** Examine the tests in `testing/` to understand what is currently covered.
|
15 |
+
* **Address Specific User Goals:** Once the baseline is understood, tackle any specific feature requests, bug fixes, or improvements requested by the user.
|
16 |
|
17 |
+
## 3. Current Status
|
18 |
|
19 |
+
* **Baseline Established (Memory Bank):** As of 2025-05-05, the initial Memory Bank structure is in place.
|
20 |
+
* **Code Functionality:** The operational status of the HOCR tool itself is yet to be verified.
|
|
|
|
|
21 |
|
22 |
+
## 4. Known Issues / Bugs
|
23 |
|
24 |
+
* *(None identified yet. To be populated as testing and development proceed.)*
|
25 |
|
26 |
+
## 5. Evolution of Project Decisions (Decision Log)
|
|
|
|
|
|
|
|
|
27 |
|
28 |
+
* **2025-05-05:** Decided to create the standard Cline Memory Bank structure for the `hocr` project upon user request to check configuration. Found no existing `memory-bank` directory and proceeded with creation of core files.
|
29 |
|
30 |
+
*(This document tracks the overall progress and state of the project.)*
|
|
memory-bank/project-brief.md
DELETED
@@ -1,19 +0,0 @@
|
|
1 |
-
# Project Brief
|
2 |
-
|
3 |
-
Historical OCR is an advanced optical character recognition (OCR) application designed to support historical research. It leverages Mistral AI's OCR models alongside image preprocessing pipelines optimized for archival material.
|
4 |
-
|
5 |
-
## High-Level Overview
|
6 |
-
|
7 |
-
Building a Streamlit-based web application to process historical documents (images or PDFs), optimize them for OCR using advanced preprocessing techniques, and extract structured text and metadata through Mistral's large language models.
|
8 |
-
|
9 |
-
## Core Requirements and Goals
|
10 |
-
|
11 |
-
* Upload and preprocess historical documents
|
12 |
-
|
13 |
-
* Apply tailored OCR prompting and structured output based on document type
|
14 |
-
|
15 |
-
* Support user-defined contextual instructions to refine output
|
16 |
-
|
17 |
-
* Provide downloadable structured transcripts and analysis
|
18 |
-
|
19 |
-
* Example: "Building a Streamlit web app for OCR transcription and structured extraction from historical documents using Mistral AI."
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
memory-bank/projectbrief.md
CHANGED
@@ -1,21 +1,39 @@
|
|
1 |
-
# Project Brief
|
2 |
|
3 |
-
|
4 |
|
5 |
-
|
|
|
6 |
|
7 |
-
|
8 |
|
9 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
10 |
|
11 |
-
|
12 |
|
13 |
-
|
|
|
14 |
|
15 |
-
|
16 |
|
17 |
-
|
|
|
|
|
18 |
|
19 |
-
|
20 |
|
21 |
-
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Project Brief: HOCR Processing Tool
|
2 |
|
3 |
+
## 1. Project Goal
|
4 |
|
5 |
+
* **Primary Objective:** To develop and maintain a robust tool for performing Optical Character Recognition (OCR) on various document types (images, PDFs), extracting structured text, and handling potential complexities like diverse layouts, languages, and image quality issues.
|
6 |
+
* **Target Users:** Researchers, archivists, developers, or anyone needing to extract text from scanned documents or images.
|
7 |
|
8 |
+
## 2. Core Requirements
|
9 |
|
10 |
+
* **Input:** Accept image files (PNG, JPG, TIFF, etc.) and PDF documents.
|
11 |
+
* **Processing Pipeline:**
|
12 |
+
* Image Preprocessing (deskewing, noise reduction, binarization, etc.)
|
13 |
+
* Layout Analysis / Image Segmentation (identifying text blocks, columns, images)
|
14 |
+
* OCR Engine Integration (e.g., Tesseract)
|
15 |
+
* Language Detection
|
16 |
+
* Structured Output Generation (e.g., hOCR format, JSON, plain text)
|
17 |
+
* Error Handling and Logging
|
18 |
+
* **Configuration:** Allow users to configure processing parameters (e.g., language, preprocessing steps, output format).
|
19 |
+
* **Extensibility:** Design for potential future enhancements (e.g., handwriting recognition, specific template handling).
|
20 |
|
21 |
+
## 3. Scope
|
22 |
|
23 |
+
* **In Scope:** Core OCR pipeline, configuration management, basic UI (if applicable), testing framework, documentation.
|
24 |
+
* **Out of Scope (Initially):** Advanced AI-driven layout analysis beyond standard segmentation, real-time processing for video streams, integration with specific external databases or workflows unless specified.
|
25 |
|
26 |
+
## 4. Key Technologies (Initial Assessment - *To be refined in techContext.md*)
|
27 |
|
28 |
+
* **Language:** Python
|
29 |
+
* **Core Libraries:** OpenCV, Tesseract (pytesseract), Pillow, pdf2image, potentially others for specific tasks.
|
30 |
+
* **Framework (if UI exists):** Flask/Streamlit (based on existing files like `app.py`, `ui/`)
|
31 |
|
32 |
+
## 5. Success Metrics
|
33 |
|
34 |
+
* Accuracy of text extraction (measured against ground truth data).
|
35 |
+
* Robustness across different document types and qualities.
|
36 |
+
* Ease of configuration and use.
|
37 |
+
* Maintainability and extensibility of the codebase.
|
38 |
+
|
39 |
+
*(This is a foundational brief. Details will be expanded in other Memory Bank documents.)*
|
memory-bank/systemPatterns.md
CHANGED
@@ -1,66 +1,79 @@
|
|
1 |
-
# System Patterns
|
2 |
|
3 |
-
## Architecture
|
4 |
|
5 |
-
The
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
6 |
|
7 |
-
|
8 |
-
* **UI Layer:** Managed by Streamlit. Core UI elements are defined in `ui/layout.py` and potentially reusable components in `ui/ui_components.py` (or the root `ui_components.py`). Custom styling is applied via `ui/custom.css`.
|
9 |
-
* **Processing Modules:** Functionality is separated into distinct Python modules:
|
10 |
-
* `preprocessing.py`: Handles image optimization and preparation for OCR.
|
11 |
-
* `image_segmentation.py`: Deals with identifying regions of interest within documents.
|
12 |
-
* `ocr_processing.py`: Manages the interaction with the OCR engine (Mistral API).
|
13 |
-
* `structured_ocr.py`: Focuses on interpreting raw OCR output and structuring it.
|
14 |
-
* `language_detection.py`: Detects the language of the document content.
|
15 |
-
* `letterhead_handler.py`: Specific logic for dealing with letterheads.
|
16 |
-
* `pdf_ocr.py`: Handles OCR specific to PDF inputs (likely coordinating other modules).
|
17 |
-
* `process_file.py`: A potential high-level orchestrator for the entire file processing pipeline.
|
18 |
-
* **Configuration:** `config.py` likely holds application settings, potentially including API keys or processing parameters. `constants.py` holds fixed values used across the application.
|
19 |
-
* **Utilities:** Common functions are grouped in the `utils/` directory, further categorized (e.g., `image_utils.py`, `text_utils.py`, `file_utils.py`).
|
20 |
-
* **Error Handling:** A dedicated `error_handler.py` suggests a centralized approach to managing exceptions.
|
21 |
|
22 |
-
|
|
|
|
|
|
|
23 |
|
24 |
-
|
25 |
-
* **External API Integration:** Relies on the Mistral AI OCR API for core text extraction (`.clinerules/hocr-basics-api.md`). API interaction logic is likely within `ocr_processing.py` or related utilities.
|
26 |
-
* **Streamlit Framework:** Leverages Streamlit for the web interface, using standard components (`.clinerules/hocr-basics-api.md`). State management likely uses `st.session_state`.
|
27 |
-
* **Content Purity:** Adheres to the principle of separating data from presentation markup (`.clinerules/principle-of-simplicity.md`). Presentation logic should reside primarily in the UI layer.
|
28 |
-
* **Configuration Management:** Centralized configuration likely managed through `config.py`.
|
29 |
-
|
30 |
-
## Component Relationships (Conceptual)
|
31 |
|
32 |
```mermaid
|
33 |
graph TD
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
end
|
49 |
|
50 |
-
subgraph
|
51 |
-
|
52 |
-
|
53 |
-
M[utils/];
|
54 |
-
N[error_handler.py];
|
55 |
end
|
56 |
|
57 |
-
|
58 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
59 |
```
|
60 |
|
61 |
-
## Critical Implementation Paths
|
|
|
|
|
|
|
|
|
62 |
|
63 |
-
*
|
64 |
-
* Handling different input file types (Images vs. PDFs).
|
65 |
-
* Integration and error handling for the Mistral API calls.
|
66 |
-
* Implementation of specific preprocessing steps relevant to historical documents.
|
|
|
1 |
+
# System Patterns: HOCR Processing Tool
|
2 |
|
3 |
+
## 1. High-Level Architecture
|
4 |
|
5 |
+
* **Modular Pipeline:** The system appears structured as a pipeline with distinct modules for different stages of OCR processing. Key modules suggested by filenames include:
|
6 |
+
* `preprocessing.py`: Handles initial image adjustments.
|
7 |
+
* `image_segmentation.py`: Identifies regions of interest (text blocks).
|
8 |
+
* `ocr_processing.py`: Manages the core OCR engine interaction.
|
9 |
+
* `language_detection.py`: Determines the language of the text.
|
10 |
+
* `pdf_ocr.py`: Specific handling for PDF inputs.
|
11 |
+
* `structured_ocr.py`: Likely involved in formatting the output.
|
12 |
+
* **Configuration Driven:** `config.py` suggests a centralized configuration management approach, allowing pipeline behavior to be customized.
|
13 |
+
* **Entry Point / Orchestration:** `app.py` likely serves as the main entry point or orchestrator, possibly for a web UI or API, coordinating the pipeline execution based on user input and configuration. `process_file.py` might be an alternative entry point or a core processing function called by `app.py`.
|
14 |
+
* **UI Layer:** The `ui/` directory (`ui/layout.py`, `ui/ui_components.py`) indicates a dedicated user interface layer, possibly built with Streamlit or Flask (as suggested in `projectbrief.md`).
|
15 |
+
* **Utility Functions:** The `utils/` directory (`utils/image_utils.py`, `utils/text_utils.py`, etc.) points to a pattern of encapsulating reusable helper functions.
|
16 |
+
* **Error Handling:** `error_handler.py` suggests a dedicated mechanism for managing and reporting errors during processing.
|
17 |
|
18 |
+
## 2. Key Design Patterns (Inferred)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
19 |
|
20 |
+
* **Pipeline Pattern:** The core processing flow seems to follow a pipeline pattern, where data (image/document) passes through sequential processing stages.
|
21 |
+
* **Configuration Management:** Centralized configuration (`config.py`) allows for decoupling settings from code.
|
22 |
+
* **Separation of Concerns:** Different functionalities (UI, core processing, utilities, configuration) appear to be separated into distinct modules/files.
|
23 |
+
* **Utility/Helper Modules:** Common, reusable functions are grouped into utility modules.
|
24 |
|
25 |
+
## 3. Component Relationships (Initial Diagram - Mermaid)
|
|
|
|
|
|
|
|
|
|
|
|
|
26 |
|
27 |
```mermaid
|
28 |
graph TD
|
29 |
+
subgraph User Interface / Entry Point
|
30 |
+
A[app.py / UI Layer] --> B(process_file.py);
|
31 |
+
end
|
32 |
+
|
33 |
+
subgraph Configuration
|
34 |
+
C[config.py];
|
35 |
+
end
|
36 |
+
|
37 |
+
subgraph Core OCR Pipeline
|
38 |
+
B --> D(preprocessing.py);
|
39 |
+
D --> E(image_segmentation.py);
|
40 |
+
E --> F(ocr_processing.py);
|
41 |
+
F --> G(language_detection.py);
|
42 |
+
G --> H(structured_ocr.py);
|
43 |
end
|
44 |
|
45 |
+
subgraph Input Handling
|
46 |
+
I[pdf_ocr.py] --> B;
|
47 |
+
J[Image Input] --> B;
|
|
|
|
|
48 |
end
|
49 |
|
50 |
+
subgraph Utilities
|
51 |
+
K[utils/];
|
52 |
+
L[error_handler.py];
|
53 |
+
end
|
54 |
+
|
55 |
+
A --> C;
|
56 |
+
B --> C;
|
57 |
+
D --> K;
|
58 |
+
E --> K;
|
59 |
+
F --> K;
|
60 |
+
G --> K;
|
61 |
+
H --> K;
|
62 |
+
I --> K;
|
63 |
+
B --> L;
|
64 |
+
|
65 |
+
style User Interface / Entry Point fill:#f9f,stroke:#333,stroke-width:2px
|
66 |
+
style Configuration fill:#ccf,stroke:#333,stroke-width:2px
|
67 |
+
style Core OCR Pipeline fill:#cfc,stroke:#333,stroke-width:2px
|
68 |
+
style Input Handling fill:#ffc,stroke:#333,stroke-width:2px
|
69 |
+
style Utilities fill:#eee,stroke:#333,stroke-width:2px
|
70 |
+
|
71 |
```
|
72 |
|
73 |
+
## 4. Critical Implementation Paths
|
74 |
+
|
75 |
+
* **Image Input -> Preprocessing -> Segmentation -> OCR -> Structured Output:** The main flow for image files.
|
76 |
+
* **PDF Input -> PDF Extraction -> Image Conversion (per page) -> [Main Flow] -> Aggregated Output:** The likely path for PDF documents.
|
77 |
+
* **Configuration Loading -> Pipeline Execution:** How settings influence the process.
|
78 |
|
79 |
+
*(This document outlines the observed structure. It will be refined as the codebase is analyzed in more detail.)*
|
|
|
|
|
|
memory-bank/techContext.md
CHANGED
@@ -1,35 +1,37 @@
|
|
1 |
-
# Technical Context
|
2 |
|
3 |
-
##
|
4 |
|
5 |
-
* **
|
6 |
-
* **Web Framework:** Streamlit
|
7 |
-
* **Core API:** Mistral AI OCR API (via HTTPS requests)
|
8 |
-
* **Potential Libraries:**
|
9 |
-
* `requests`: For making HTTP calls to the Mistral API.
|
10 |
-
* `streamlit`: For the web UI framework.
|
11 |
-
* `Pillow` (PIL Fork): For basic image loading and manipulation.
|
12 |
-
* `OpenCV` (`cv2`): Likely used for more advanced image preprocessing tasks (e.g., thresholding, noise reduction, deskewing).
|
13 |
-
* `python-dotenv`: Potentially used for managing environment variables like API keys (especially if `config.py` loads from a `.env` file).
|
14 |
-
* `PyMuPDF` or similar: If PDF processing involves direct text/image extraction from PDF structures beyond just sending to OCR.
|
15 |
|
16 |
-
|
17 |
|
18 |
-
|
|
|
|
|
|
|
|
|
|
|
19 |
|
20 |
-
|
21 |
-
* **Dependencies:** Install required packages (likely via `pip install -r requirements.txt` if a requirements file exists).
|
22 |
-
* **API Keys:** Requires a Mistral AI API key, which needs to be configured securely (likely via environment variables loaded in `config.py`).
|
23 |
-
* **Running the App:** Typically run using `streamlit run app.py` from the project root directory.
|
24 |
|
25 |
-
|
|
|
|
|
|
|
26 |
|
27 |
-
|
28 |
-
* **Processing Time:** OCR and complex image preprocessing can be time-consuming, especially for large documents or high-resolution images. Streamlit's execution model needs to be considered for long-running tasks (e.g., using background processes or providing user feedback).
|
29 |
-
* **Resource Usage:** Image processing can be memory and CPU intensive.
|
30 |
|
31 |
-
|
|
|
|
|
|
|
32 |
|
33 |
-
|
34 |
-
|
35 |
-
* **
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Technical Context: HOCR Processing Tool
|
2 |
|
3 |
+
## 1. Core Language
|
4 |
|
5 |
+
* **Python:** The project is primarily written in Python, as indicated by the `.py` files. The specific version should be confirmed (e.g., via `requirements.txt` or environment setup).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
6 |
|
7 |
+
## 2. Key Libraries & Frameworks
|
8 |
|
9 |
+
* **OCR Engine:** Likely **Tesseract OCR**, potentially accessed via the `pytesseract` wrapper (common practice). This needs confirmation by inspecting `ocr_processing.py` or dependencies.
|
10 |
+
* **Image Processing:** **OpenCV (`cv2`)** and/or **Pillow (PIL)** are highly probable for tasks in `preprocessing.py` and `image_segmentation.py`.
|
11 |
+
* **PDF Handling:** **`pdf2image`** (which often relies on Poppler) is a common choice for converting PDF pages to images, relevant for `pdf_ocr.py`. Other PDF libraries like PyMuPDF or PyPDF2 might also be used.
|
12 |
+
* **Web Framework/UI:** Based on `app.py` and the `ui/` directory, **Flask** or **Streamlit** are potential candidates for the user interface or API layer.
|
13 |
+
* **Configuration:** Standard Python mechanisms (e.g., `.ini` files with `configparser`, `.json` files, or custom Python modules like `config.py`).
|
14 |
+
* **Dependency Management:** Likely uses `pip` with a `requirements.txt` file (observed in the file listing). Virtual environments (like `venv` or `conda`) are standard practice.
|
15 |
|
16 |
+
## 3. External Dependencies & Setup
|
|
|
|
|
|
|
17 |
|
18 |
+
* **Tesseract OCR Engine:** Requires separate installation on the host system. The path to the Tesseract executable might need configuration.
|
19 |
+
* **Poppler:** Often required by `pdf2image` for PDF processing; needs separate installation.
|
20 |
+
* **Python Environment:** A specific Python version and installed packages via `requirements.txt`.
|
21 |
+
* **Environment Variables:** Potential use of environment variables for configuration (e.g., API keys, paths), possibly managed via a `.env` file (observed in the file listing).
|
22 |
|
23 |
+
## 4. Development Environment
|
|
|
|
|
24 |
|
25 |
+
* **Standard Python Setup:** Requires a Python interpreter, `pip`, and likely `virtualenv`.
|
26 |
+
* **Code Editor/IDE:** VS Code is being used (based on environment details). Settings might be stored in `.vscode/`.
|
27 |
+
* **Version Control:** Git is likely used (indicated by `.gitignore`, `.gitattributes`). The `.git_disabled` directory suggests Git might have been temporarily disabled or renamed.
|
28 |
+
* **Testing:** The `testing/` directory and `pytest_cache` suggest **pytest** is used for running tests.
|
29 |
|
30 |
+
## 5. Technical Constraints & Considerations
|
31 |
+
|
32 |
+
* **Performance:** OCR, especially on large documents or batches, can be computationally intensive. Image processing steps add overhead.
|
33 |
+
* **Tesseract Limitations:** Tesseract's accuracy depends heavily on image quality, preprocessing, and language model availability.
|
34 |
+
* **Dependency Hell:** Managing Python dependencies and external binaries (Tesseract, Poppler) across different operating systems can be challenging.
|
35 |
+
* **Layout Complexity:** Handling complex layouts (multi-column, tables, embedded images) requires sophisticated segmentation logic.
|
36 |
+
|
37 |
+
*(This document provides an initial technical overview based on file structure and common practices. It requires verification by examining code and configuration files like requirements.txt.)*
|