milwright commited on
Commit
791731a
·
verified ·
1 Parent(s): 4748d85

Delete memory-bank

Browse files
memory-bank/activeContext.md DELETED
@@ -1,36 +0,0 @@
1
- # Active Context: HOCR Processing Tool - Initial Setup
2
-
3
- ## 1. Current Focus
4
-
5
- * **Initial Project Setup:** Establishing the core Memory Bank documentation structure for the HOCR project as per Cline's requirements.
6
- * Creating the foundational markdown files (`projectbrief.md`, `productContext.md`, `activeContext.md`, `systemPatterns.md`, `techContext.md`, `progress.md`).
7
-
8
- ## 2. Recent Changes
9
-
10
- * Created `memory-bank` directory within the `hocr` project folder.
11
- * Created `projectbrief.md` with initial project goals and scope.
12
- * Created `productContext.md` outlining the problem space, vision, and user goals.
13
- * Created this `activeContext.md` file.
14
-
15
- ## 3. Next Steps
16
-
17
- * Create `systemPatterns.md` to document the high-level architecture and design patterns (based on initial analysis of existing code).
18
- * Create `techContext.md` to detail the technologies, libraries, and setup requirements.
19
- * Create `progress.md` to establish the baseline for tracking project status.
20
- * Once the initial Memory Bank is set up, await further instructions or tasks from the user regarding the HOCR project itself.
21
-
22
- ## 4. Active Decisions & Considerations
23
-
24
- * Following Cline's standard Memory Bank structure.
25
- * Populating initial files with baseline information derived from the project's file structure and general OCR principles. These will need refinement as the project is explored further.
26
-
27
- ## 5. Important Patterns & Preferences
28
-
29
- * *(To be filled in as patterns are identified in the codebase or specified by the user)*
30
-
31
- ## 6. Learnings & Insights
32
-
33
- * The HOCR project appears to be a substantial Python application with distinct modules for different OCR stages (preprocessing, segmentation, OCR processing, etc.) and includes a UI component.
34
- * Initial setup requires creating the standard Memory Bank files.
35
-
36
- *(This file tracks the immediate state and short-term plans. It should be updated frequently.)*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
memory-bank/productContext.md DELETED
@@ -1,44 +0,0 @@
1
- # Product Context: HOCR Processing Tool
2
-
3
- ## 1. Problem Space
4
-
5
- * Extracting text from images and scanned documents (PDFs) is a common but often challenging task.
6
- * Documents vary widely in quality, layout, language, and format.
7
- * Manual transcription is time-consuming, error-prone, and not scalable.
8
- * Existing OCR tools may lack flexibility, specific preprocessing capabilities, or produce output that isn't easily usable for downstream tasks.
9
-
10
- ## 2. Solution Vision
11
-
12
- * Provide a reliable, configurable, and extensible OCR processing pipeline.
13
- * Empower users to convert document images and PDFs into usable, structured text data.
14
- * Offer fine-grained control over the OCR process, from preprocessing to output formatting.
15
- * Serve as a foundational tool that can be integrated into larger document processing workflows.
16
-
17
- ## 3. Target Audience & Needs
18
-
19
- * **Researchers/Academics:** Need to extract text from historical documents, scanned books, or research papers for analysis. Require high accuracy and potentially language-specific handling.
20
- * **Archivists/Librarians:** Need to digitize large collections of documents, making them searchable and accessible. Require batch processing capabilities and robust handling of varied document types.
21
- * **Developers:** Need an OCR component to integrate into applications (e.g., document management systems, data extraction tools). Require a clear API or command-line interface and structured output (like JSON or hOCR).
22
- * **General Users:** May need to occasionally extract text from a scanned form, receipt, or image. Require a simple interface (if applicable) and reasonable default settings.
23
-
24
- ## 4. User Experience Goals
25
-
26
- * **Accuracy:** The primary goal is to maximize the accuracy of the extracted text.
27
- * **Configurability:** Users should be able to tailor the processing steps and parameters to their specific document types and needs.
28
- * **Transparency:** The tool should provide feedback on the processing steps and allow for debugging (e.g., viewing intermediate images, logs).
29
- * **Performance:** Processing should be reasonably efficient, especially for batch operations.
30
- * **Ease of Use:** While powerful, the tool should be approachable, whether through a command-line interface or a potential GUI. Configuration should be clear and well-documented.
31
-
32
- ## 5. How it Should Work (High-Level Flow)
33
-
34
- 1. User provides input file(s) (image or PDF).
35
- 2. User specifies configuration options (or uses defaults).
36
- 3. The tool executes the configured pipeline:
37
- * Preprocessing (optional steps like deskew, binarize).
38
- * Segmentation (detecting text regions).
39
- * OCR (applying Tesseract or another engine).
40
- * Post-processing (e.g., text correction, structuring output).
41
- 4. The tool outputs the extracted text in the desired format (plain text, hOCR, JSON).
42
- 5. Logs and potentially intermediate results are generated for review.
43
-
44
- *(This context provides the 'why' and 'how' from a user perspective. Technical details are in systemPatterns.md and techContext.md.)*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
memory-bank/progress.md DELETED
@@ -1,30 +0,0 @@
1
- # Project Progress: HOCR Processing Tool
2
-
3
- ## 1. What Works
4
-
5
- * **Initial Memory Bank Setup:** The core documentation structure (projectbrief, productContext, activeContext, systemPatterns, techContext, progress) has been established in the `memory-bank/` directory.
6
-
7
- ## 2. What's Left to Build / Verify
8
-
9
- * **Verify Core Functionality:** Need to run the application and test its basic OCR capabilities on sample images and PDFs.
10
- * **Confirm Technical Assumptions:** Validate the libraries and dependencies outlined in `techContext.md` by checking `requirements.txt` and relevant code sections.
11
- * **Understand Configuration:** Investigate `config.py` to determine how users configure the pipeline.
12
- * **Test UI Layer:** If `app.py` provides a UI (Streamlit/Flask), test its usability and connection to the backend pipeline.
13
- * **Review Existing Code:** Deeper dive into the modules (`preprocessing.py`, `ocr_processing.py`, etc.) to understand implementation details.
14
- * **Assess Test Coverage:** Examine the tests in `testing/` to understand what is currently covered.
15
- * **Address Specific User Goals:** Once the baseline is understood, tackle any specific feature requests, bug fixes, or improvements requested by the user.
16
-
17
- ## 3. Current Status
18
-
19
- * **Baseline Established (Memory Bank):** As of 2025-05-05, the initial Memory Bank structure is in place.
20
- * **Code Functionality:** The operational status of the HOCR tool itself is yet to be verified.
21
-
22
- ## 4. Known Issues / Bugs
23
-
24
- * *(None identified yet. To be populated as testing and development proceed.)*
25
-
26
- ## 5. Evolution of Project Decisions (Decision Log)
27
-
28
- * **2025-05-05:** Decided to create the standard Cline Memory Bank structure for the `hocr` project upon user request to check configuration. Found no existing `memory-bank` directory and proceeded with creation of core files.
29
-
30
- *(This document tracks the overall progress and state of the project.)*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
memory-bank/projectbrief.md DELETED
@@ -1,39 +0,0 @@
1
- # Project Brief: HOCR Processing Tool
2
-
3
- ## 1. Project Goal
4
-
5
- * **Primary Objective:** To develop and maintain a robust tool for performing Optical Character Recognition (OCR) on various document types (images, PDFs), extracting structured text, and handling potential complexities like diverse layouts, languages, and image quality issues.
6
- * **Target Users:** Researchers, archivists, developers, or anyone needing to extract text from scanned documents or images.
7
-
8
- ## 2. Core Requirements
9
-
10
- * **Input:** Accept image files (PNG, JPG, TIFF, etc.) and PDF documents.
11
- * **Processing Pipeline:**
12
- * Image Preprocessing (deskewing, noise reduction, binarization, etc.)
13
- * Layout Analysis / Image Segmentation (identifying text blocks, columns, images)
14
- * OCR Engine Integration (e.g., Tesseract)
15
- * Language Detection
16
- * Structured Output Generation (e.g., hOCR format, JSON, plain text)
17
- * Error Handling and Logging
18
- * **Configuration:** Allow users to configure processing parameters (e.g., language, preprocessing steps, output format).
19
- * **Extensibility:** Design for potential future enhancements (e.g., handwriting recognition, specific template handling).
20
-
21
- ## 3. Scope
22
-
23
- * **In Scope:** Core OCR pipeline, configuration management, basic UI (if applicable), testing framework, documentation.
24
- * **Out of Scope (Initially):** Advanced AI-driven layout analysis beyond standard segmentation, real-time processing for video streams, integration with specific external databases or workflows unless specified.
25
-
26
- ## 4. Key Technologies (Initial Assessment - *To be refined in techContext.md*)
27
-
28
- * **Language:** Python
29
- * **Core Libraries:** OpenCV, Tesseract (pytesseract), Pillow, pdf2image, potentially others for specific tasks.
30
- * **Framework (if UI exists):** Flask/Streamlit (based on existing files like `app.py`, `ui/`)
31
-
32
- ## 5. Success Metrics
33
-
34
- * Accuracy of text extraction (measured against ground truth data).
35
- * Robustness across different document types and qualities.
36
- * Ease of configuration and use.
37
- * Maintainability and extensibility of the codebase.
38
-
39
- *(This is a foundational brief. Details will be expanded in other Memory Bank documents.)*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
memory-bank/systemPatterns.md DELETED
@@ -1,79 +0,0 @@
1
- # System Patterns: HOCR Processing Tool
2
-
3
- ## 1. High-Level Architecture
4
-
5
- * **Modular Pipeline:** The system appears structured as a pipeline with distinct modules for different stages of OCR processing. Key modules suggested by filenames include:
6
- * `preprocessing.py`: Handles initial image adjustments.
7
- * `image_segmentation.py`: Identifies regions of interest (text blocks).
8
- * `ocr_processing.py`: Manages the core OCR engine interaction.
9
- * `language_detection.py`: Determines the language of the text.
10
- * `pdf_ocr.py`: Specific handling for PDF inputs.
11
- * `structured_ocr.py`: Likely involved in formatting the output.
12
- * **Configuration Driven:** `config.py` suggests a centralized configuration management approach, allowing pipeline behavior to be customized.
13
- * **Entry Point / Orchestration:** `app.py` likely serves as the main entry point or orchestrator, possibly for a web UI or API, coordinating the pipeline execution based on user input and configuration. `process_file.py` might be an alternative entry point or a core processing function called by `app.py`.
14
- * **UI Layer:** The `ui/` directory (`ui/layout.py`, `ui/ui_components.py`) indicates a dedicated user interface layer, possibly built with Streamlit or Flask (as suggested in `projectbrief.md`).
15
- * **Utility Functions:** The `utils/` directory (`utils/image_utils.py`, `utils/text_utils.py`, etc.) points to a pattern of encapsulating reusable helper functions.
16
- * **Error Handling:** `error_handler.py` suggests a dedicated mechanism for managing and reporting errors during processing.
17
-
18
- ## 2. Key Design Patterns (Inferred)
19
-
20
- * **Pipeline Pattern:** The core processing flow seems to follow a pipeline pattern, where data (image/document) passes through sequential processing stages.
21
- * **Configuration Management:** Centralized configuration (`config.py`) allows for decoupling settings from code.
22
- * **Separation of Concerns:** Different functionalities (UI, core processing, utilities, configuration) appear to be separated into distinct modules/files.
23
- * **Utility/Helper Modules:** Common, reusable functions are grouped into utility modules.
24
-
25
- ## 3. Component Relationships (Initial Diagram - Mermaid)
26
-
27
- ```mermaid
28
- graph TD
29
- subgraph User Interface / Entry Point
30
- A[app.py / UI Layer] --> B(process_file.py);
31
- end
32
-
33
- subgraph Configuration
34
- C[config.py];
35
- end
36
-
37
- subgraph Core OCR Pipeline
38
- B --> D(preprocessing.py);
39
- D --> E(image_segmentation.py);
40
- E --> F(ocr_processing.py);
41
- F --> G(language_detection.py);
42
- G --> H(structured_ocr.py);
43
- end
44
-
45
- subgraph Input Handling
46
- I[pdf_ocr.py] --> B;
47
- J[Image Input] --> B;
48
- end
49
-
50
- subgraph Utilities
51
- K[utils/];
52
- L[error_handler.py];
53
- end
54
-
55
- A --> C;
56
- B --> C;
57
- D --> K;
58
- E --> K;
59
- F --> K;
60
- G --> K;
61
- H --> K;
62
- I --> K;
63
- B --> L;
64
-
65
- style User Interface / Entry Point fill:#f9f,stroke:#333,stroke-width:2px
66
- style Configuration fill:#ccf,stroke:#333,stroke-width:2px
67
- style Core OCR Pipeline fill:#cfc,stroke:#333,stroke-width:2px
68
- style Input Handling fill:#ffc,stroke:#333,stroke-width:2px
69
- style Utilities fill:#eee,stroke:#333,stroke-width:2px
70
-
71
- ```
72
-
73
- ## 4. Critical Implementation Paths
74
-
75
- * **Image Input -> Preprocessing -> Segmentation -> OCR -> Structured Output:** The main flow for image files.
76
- * **PDF Input -> PDF Extraction -> Image Conversion (per page) -> [Main Flow] -> Aggregated Output:** The likely path for PDF documents.
77
- * **Configuration Loading -> Pipeline Execution:** How settings influence the process.
78
-
79
- *(This document outlines the observed structure. It will be refined as the codebase is analyzed in more detail.)*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
memory-bank/techContext.md DELETED
@@ -1,37 +0,0 @@
1
- # Technical Context: HOCR Processing Tool
2
-
3
- ## 1. Core Language
4
-
5
- * **Python:** The project is primarily written in Python, as indicated by the `.py` files. The specific version should be confirmed (e.g., via `requirements.txt` or environment setup).
6
-
7
- ## 2. Key Libraries & Frameworks
8
-
9
- * **OCR Engine:** Likely **Tesseract OCR**, potentially accessed via the `pytesseract` wrapper (common practice). This needs confirmation by inspecting `ocr_processing.py` or dependencies.
10
- * **Image Processing:** **OpenCV (`cv2`)** and/or **Pillow (PIL)** are highly probable for tasks in `preprocessing.py` and `image_segmentation.py`.
11
- * **PDF Handling:** **`pdf2image`** (which often relies on Poppler) is a common choice for converting PDF pages to images, relevant for `pdf_ocr.py`. Other PDF libraries like PyMuPDF or PyPDF2 might also be used.
12
- * **Web Framework/UI:** Based on `app.py` and the `ui/` directory, **Flask** or **Streamlit** are potential candidates for the user interface or API layer.
13
- * **Configuration:** Standard Python mechanisms (e.g., `.ini` files with `configparser`, `.json` files, or custom Python modules like `config.py`).
14
- * **Dependency Management:** Likely uses `pip` with a `requirements.txt` file (observed in the file listing). Virtual environments (like `venv` or `conda`) are standard practice.
15
-
16
- ## 3. External Dependencies & Setup
17
-
18
- * **Tesseract OCR Engine:** Requires separate installation on the host system. The path to the Tesseract executable might need configuration.
19
- * **Poppler:** Often required by `pdf2image` for PDF processing; needs separate installation.
20
- * **Python Environment:** A specific Python version and installed packages via `requirements.txt`.
21
- * **Environment Variables:** Potential use of environment variables for configuration (e.g., API keys, paths), possibly managed via a `.env` file (observed in the file listing).
22
-
23
- ## 4. Development Environment
24
-
25
- * **Standard Python Setup:** Requires a Python interpreter, `pip`, and likely `virtualenv`.
26
- * **Code Editor/IDE:** VS Code is being used (based on environment details). Settings might be stored in `.vscode/`.
27
- * **Version Control:** Git is likely used (indicated by `.gitignore`, `.gitattributes`). The `.git_disabled` directory suggests Git might have been temporarily disabled or renamed.
28
- * **Testing:** The `testing/` directory and `pytest_cache` suggest **pytest** is used for running tests.
29
-
30
- ## 5. Technical Constraints & Considerations
31
-
32
- * **Performance:** OCR, especially on large documents or batches, can be computationally intensive. Image processing steps add overhead.
33
- * **Tesseract Limitations:** Tesseract's accuracy depends heavily on image quality, preprocessing, and language model availability.
34
- * **Dependency Hell:** Managing Python dependencies and external binaries (Tesseract, Poppler) across different operating systems can be challenging.
35
- * **Layout Complexity:** Handling complex layouts (multi-column, tables, embedded images) requires sophisticated segmentation logic.
36
-
37
- *(This document provides an initial technical overview based on file structure and common practices. It requires verification by examining code and configuration files like requirements.txt.)*