Spaces:
Running
Running
Delete memory-bank
Browse files- memory-bank/activeContext.md +0 -36
- memory-bank/productContext.md +0 -44
- memory-bank/progress.md +0 -30
- memory-bank/projectbrief.md +0 -39
- memory-bank/systemPatterns.md +0 -79
- memory-bank/techContext.md +0 -37
memory-bank/activeContext.md
DELETED
@@ -1,36 +0,0 @@
|
|
1 |
-
# Active Context: HOCR Processing Tool - Initial Setup
|
2 |
-
|
3 |
-
## 1. Current Focus
|
4 |
-
|
5 |
-
* **Initial Project Setup:** Establishing the core Memory Bank documentation structure for the HOCR project as per Cline's requirements.
|
6 |
-
* Creating the foundational markdown files (`projectbrief.md`, `productContext.md`, `activeContext.md`, `systemPatterns.md`, `techContext.md`, `progress.md`).
|
7 |
-
|
8 |
-
## 2. Recent Changes
|
9 |
-
|
10 |
-
* Created `memory-bank` directory within the `hocr` project folder.
|
11 |
-
* Created `projectbrief.md` with initial project goals and scope.
|
12 |
-
* Created `productContext.md` outlining the problem space, vision, and user goals.
|
13 |
-
* Created this `activeContext.md` file.
|
14 |
-
|
15 |
-
## 3. Next Steps
|
16 |
-
|
17 |
-
* Create `systemPatterns.md` to document the high-level architecture and design patterns (based on initial analysis of existing code).
|
18 |
-
* Create `techContext.md` to detail the technologies, libraries, and setup requirements.
|
19 |
-
* Create `progress.md` to establish the baseline for tracking project status.
|
20 |
-
* Once the initial Memory Bank is set up, await further instructions or tasks from the user regarding the HOCR project itself.
|
21 |
-
|
22 |
-
## 4. Active Decisions & Considerations
|
23 |
-
|
24 |
-
* Following Cline's standard Memory Bank structure.
|
25 |
-
* Populating initial files with baseline information derived from the project's file structure and general OCR principles. These will need refinement as the project is explored further.
|
26 |
-
|
27 |
-
## 5. Important Patterns & Preferences
|
28 |
-
|
29 |
-
* *(To be filled in as patterns are identified in the codebase or specified by the user)*
|
30 |
-
|
31 |
-
## 6. Learnings & Insights
|
32 |
-
|
33 |
-
* The HOCR project appears to be a substantial Python application with distinct modules for different OCR stages (preprocessing, segmentation, OCR processing, etc.) and includes a UI component.
|
34 |
-
* Initial setup requires creating the standard Memory Bank files.
|
35 |
-
|
36 |
-
*(This file tracks the immediate state and short-term plans. It should be updated frequently.)*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
memory-bank/productContext.md
DELETED
@@ -1,44 +0,0 @@
|
|
1 |
-
# Product Context: HOCR Processing Tool
|
2 |
-
|
3 |
-
## 1. Problem Space
|
4 |
-
|
5 |
-
* Extracting text from images and scanned documents (PDFs) is a common but often challenging task.
|
6 |
-
* Documents vary widely in quality, layout, language, and format.
|
7 |
-
* Manual transcription is time-consuming, error-prone, and not scalable.
|
8 |
-
* Existing OCR tools may lack flexibility, specific preprocessing capabilities, or produce output that isn't easily usable for downstream tasks.
|
9 |
-
|
10 |
-
## 2. Solution Vision
|
11 |
-
|
12 |
-
* Provide a reliable, configurable, and extensible OCR processing pipeline.
|
13 |
-
* Empower users to convert document images and PDFs into usable, structured text data.
|
14 |
-
* Offer fine-grained control over the OCR process, from preprocessing to output formatting.
|
15 |
-
* Serve as a foundational tool that can be integrated into larger document processing workflows.
|
16 |
-
|
17 |
-
## 3. Target Audience & Needs
|
18 |
-
|
19 |
-
* **Researchers/Academics:** Need to extract text from historical documents, scanned books, or research papers for analysis. Require high accuracy and potentially language-specific handling.
|
20 |
-
* **Archivists/Librarians:** Need to digitize large collections of documents, making them searchable and accessible. Require batch processing capabilities and robust handling of varied document types.
|
21 |
-
* **Developers:** Need an OCR component to integrate into applications (e.g., document management systems, data extraction tools). Require a clear API or command-line interface and structured output (like JSON or hOCR).
|
22 |
-
* **General Users:** May need to occasionally extract text from a scanned form, receipt, or image. Require a simple interface (if applicable) and reasonable default settings.
|
23 |
-
|
24 |
-
## 4. User Experience Goals
|
25 |
-
|
26 |
-
* **Accuracy:** The primary goal is to maximize the accuracy of the extracted text.
|
27 |
-
* **Configurability:** Users should be able to tailor the processing steps and parameters to their specific document types and needs.
|
28 |
-
* **Transparency:** The tool should provide feedback on the processing steps and allow for debugging (e.g., viewing intermediate images, logs).
|
29 |
-
* **Performance:** Processing should be reasonably efficient, especially for batch operations.
|
30 |
-
* **Ease of Use:** While powerful, the tool should be approachable, whether through a command-line interface or a potential GUI. Configuration should be clear and well-documented.
|
31 |
-
|
32 |
-
## 5. How it Should Work (High-Level Flow)
|
33 |
-
|
34 |
-
1. User provides input file(s) (image or PDF).
|
35 |
-
2. User specifies configuration options (or uses defaults).
|
36 |
-
3. The tool executes the configured pipeline:
|
37 |
-
* Preprocessing (optional steps like deskew, binarize).
|
38 |
-
* Segmentation (detecting text regions).
|
39 |
-
* OCR (applying Tesseract or another engine).
|
40 |
-
* Post-processing (e.g., text correction, structuring output).
|
41 |
-
4. The tool outputs the extracted text in the desired format (plain text, hOCR, JSON).
|
42 |
-
5. Logs and potentially intermediate results are generated for review.
|
43 |
-
|
44 |
-
*(This context provides the 'why' and 'how' from a user perspective. Technical details are in systemPatterns.md and techContext.md.)*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
memory-bank/progress.md
DELETED
@@ -1,30 +0,0 @@
|
|
1 |
-
# Project Progress: HOCR Processing Tool
|
2 |
-
|
3 |
-
## 1. What Works
|
4 |
-
|
5 |
-
* **Initial Memory Bank Setup:** The core documentation structure (projectbrief, productContext, activeContext, systemPatterns, techContext, progress) has been established in the `memory-bank/` directory.
|
6 |
-
|
7 |
-
## 2. What's Left to Build / Verify
|
8 |
-
|
9 |
-
* **Verify Core Functionality:** Need to run the application and test its basic OCR capabilities on sample images and PDFs.
|
10 |
-
* **Confirm Technical Assumptions:** Validate the libraries and dependencies outlined in `techContext.md` by checking `requirements.txt` and relevant code sections.
|
11 |
-
* **Understand Configuration:** Investigate `config.py` to determine how users configure the pipeline.
|
12 |
-
* **Test UI Layer:** If `app.py` provides a UI (Streamlit/Flask), test its usability and connection to the backend pipeline.
|
13 |
-
* **Review Existing Code:** Deeper dive into the modules (`preprocessing.py`, `ocr_processing.py`, etc.) to understand implementation details.
|
14 |
-
* **Assess Test Coverage:** Examine the tests in `testing/` to understand what is currently covered.
|
15 |
-
* **Address Specific User Goals:** Once the baseline is understood, tackle any specific feature requests, bug fixes, or improvements requested by the user.
|
16 |
-
|
17 |
-
## 3. Current Status
|
18 |
-
|
19 |
-
* **Baseline Established (Memory Bank):** As of 2025-05-05, the initial Memory Bank structure is in place.
|
20 |
-
* **Code Functionality:** The operational status of the HOCR tool itself is yet to be verified.
|
21 |
-
|
22 |
-
## 4. Known Issues / Bugs
|
23 |
-
|
24 |
-
* *(None identified yet. To be populated as testing and development proceed.)*
|
25 |
-
|
26 |
-
## 5. Evolution of Project Decisions (Decision Log)
|
27 |
-
|
28 |
-
* **2025-05-05:** Decided to create the standard Cline Memory Bank structure for the `hocr` project upon user request to check configuration. Found no existing `memory-bank` directory and proceeded with creation of core files.
|
29 |
-
|
30 |
-
*(This document tracks the overall progress and state of the project.)*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
memory-bank/projectbrief.md
DELETED
@@ -1,39 +0,0 @@
|
|
1 |
-
# Project Brief: HOCR Processing Tool
|
2 |
-
|
3 |
-
## 1. Project Goal
|
4 |
-
|
5 |
-
* **Primary Objective:** To develop and maintain a robust tool for performing Optical Character Recognition (OCR) on various document types (images, PDFs), extracting structured text, and handling potential complexities like diverse layouts, languages, and image quality issues.
|
6 |
-
* **Target Users:** Researchers, archivists, developers, or anyone needing to extract text from scanned documents or images.
|
7 |
-
|
8 |
-
## 2. Core Requirements
|
9 |
-
|
10 |
-
* **Input:** Accept image files (PNG, JPG, TIFF, etc.) and PDF documents.
|
11 |
-
* **Processing Pipeline:**
|
12 |
-
* Image Preprocessing (deskewing, noise reduction, binarization, etc.)
|
13 |
-
* Layout Analysis / Image Segmentation (identifying text blocks, columns, images)
|
14 |
-
* OCR Engine Integration (e.g., Tesseract)
|
15 |
-
* Language Detection
|
16 |
-
* Structured Output Generation (e.g., hOCR format, JSON, plain text)
|
17 |
-
* Error Handling and Logging
|
18 |
-
* **Configuration:** Allow users to configure processing parameters (e.g., language, preprocessing steps, output format).
|
19 |
-
* **Extensibility:** Design for potential future enhancements (e.g., handwriting recognition, specific template handling).
|
20 |
-
|
21 |
-
## 3. Scope
|
22 |
-
|
23 |
-
* **In Scope:** Core OCR pipeline, configuration management, basic UI (if applicable), testing framework, documentation.
|
24 |
-
* **Out of Scope (Initially):** Advanced AI-driven layout analysis beyond standard segmentation, real-time processing for video streams, integration with specific external databases or workflows unless specified.
|
25 |
-
|
26 |
-
## 4. Key Technologies (Initial Assessment - *To be refined in techContext.md*)
|
27 |
-
|
28 |
-
* **Language:** Python
|
29 |
-
* **Core Libraries:** OpenCV, Tesseract (pytesseract), Pillow, pdf2image, potentially others for specific tasks.
|
30 |
-
* **Framework (if UI exists):** Flask/Streamlit (based on existing files like `app.py`, `ui/`)
|
31 |
-
|
32 |
-
## 5. Success Metrics
|
33 |
-
|
34 |
-
* Accuracy of text extraction (measured against ground truth data).
|
35 |
-
* Robustness across different document types and qualities.
|
36 |
-
* Ease of configuration and use.
|
37 |
-
* Maintainability and extensibility of the codebase.
|
38 |
-
|
39 |
-
*(This is a foundational brief. Details will be expanded in other Memory Bank documents.)*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
memory-bank/systemPatterns.md
DELETED
@@ -1,79 +0,0 @@
|
|
1 |
-
# System Patterns: HOCR Processing Tool
|
2 |
-
|
3 |
-
## 1. High-Level Architecture
|
4 |
-
|
5 |
-
* **Modular Pipeline:** The system appears structured as a pipeline with distinct modules for different stages of OCR processing. Key modules suggested by filenames include:
|
6 |
-
* `preprocessing.py`: Handles initial image adjustments.
|
7 |
-
* `image_segmentation.py`: Identifies regions of interest (text blocks).
|
8 |
-
* `ocr_processing.py`: Manages the core OCR engine interaction.
|
9 |
-
* `language_detection.py`: Determines the language of the text.
|
10 |
-
* `pdf_ocr.py`: Specific handling for PDF inputs.
|
11 |
-
* `structured_ocr.py`: Likely involved in formatting the output.
|
12 |
-
* **Configuration Driven:** `config.py` suggests a centralized configuration management approach, allowing pipeline behavior to be customized.
|
13 |
-
* **Entry Point / Orchestration:** `app.py` likely serves as the main entry point or orchestrator, possibly for a web UI or API, coordinating the pipeline execution based on user input and configuration. `process_file.py` might be an alternative entry point or a core processing function called by `app.py`.
|
14 |
-
* **UI Layer:** The `ui/` directory (`ui/layout.py`, `ui/ui_components.py`) indicates a dedicated user interface layer, possibly built with Streamlit or Flask (as suggested in `projectbrief.md`).
|
15 |
-
* **Utility Functions:** The `utils/` directory (`utils/image_utils.py`, `utils/text_utils.py`, etc.) points to a pattern of encapsulating reusable helper functions.
|
16 |
-
* **Error Handling:** `error_handler.py` suggests a dedicated mechanism for managing and reporting errors during processing.
|
17 |
-
|
18 |
-
## 2. Key Design Patterns (Inferred)
|
19 |
-
|
20 |
-
* **Pipeline Pattern:** The core processing flow seems to follow a pipeline pattern, where data (image/document) passes through sequential processing stages.
|
21 |
-
* **Configuration Management:** Centralized configuration (`config.py`) allows for decoupling settings from code.
|
22 |
-
* **Separation of Concerns:** Different functionalities (UI, core processing, utilities, configuration) appear to be separated into distinct modules/files.
|
23 |
-
* **Utility/Helper Modules:** Common, reusable functions are grouped into utility modules.
|
24 |
-
|
25 |
-
## 3. Component Relationships (Initial Diagram - Mermaid)
|
26 |
-
|
27 |
-
```mermaid
|
28 |
-
graph TD
|
29 |
-
subgraph User Interface / Entry Point
|
30 |
-
A[app.py / UI Layer] --> B(process_file.py);
|
31 |
-
end
|
32 |
-
|
33 |
-
subgraph Configuration
|
34 |
-
C[config.py];
|
35 |
-
end
|
36 |
-
|
37 |
-
subgraph Core OCR Pipeline
|
38 |
-
B --> D(preprocessing.py);
|
39 |
-
D --> E(image_segmentation.py);
|
40 |
-
E --> F(ocr_processing.py);
|
41 |
-
F --> G(language_detection.py);
|
42 |
-
G --> H(structured_ocr.py);
|
43 |
-
end
|
44 |
-
|
45 |
-
subgraph Input Handling
|
46 |
-
I[pdf_ocr.py] --> B;
|
47 |
-
J[Image Input] --> B;
|
48 |
-
end
|
49 |
-
|
50 |
-
subgraph Utilities
|
51 |
-
K[utils/];
|
52 |
-
L[error_handler.py];
|
53 |
-
end
|
54 |
-
|
55 |
-
A --> C;
|
56 |
-
B --> C;
|
57 |
-
D --> K;
|
58 |
-
E --> K;
|
59 |
-
F --> K;
|
60 |
-
G --> K;
|
61 |
-
H --> K;
|
62 |
-
I --> K;
|
63 |
-
B --> L;
|
64 |
-
|
65 |
-
style User Interface / Entry Point fill:#f9f,stroke:#333,stroke-width:2px
|
66 |
-
style Configuration fill:#ccf,stroke:#333,stroke-width:2px
|
67 |
-
style Core OCR Pipeline fill:#cfc,stroke:#333,stroke-width:2px
|
68 |
-
style Input Handling fill:#ffc,stroke:#333,stroke-width:2px
|
69 |
-
style Utilities fill:#eee,stroke:#333,stroke-width:2px
|
70 |
-
|
71 |
-
```
|
72 |
-
|
73 |
-
## 4. Critical Implementation Paths
|
74 |
-
|
75 |
-
* **Image Input -> Preprocessing -> Segmentation -> OCR -> Structured Output:** The main flow for image files.
|
76 |
-
* **PDF Input -> PDF Extraction -> Image Conversion (per page) -> [Main Flow] -> Aggregated Output:** The likely path for PDF documents.
|
77 |
-
* **Configuration Loading -> Pipeline Execution:** How settings influence the process.
|
78 |
-
|
79 |
-
*(This document outlines the observed structure. It will be refined as the codebase is analyzed in more detail.)*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
memory-bank/techContext.md
DELETED
@@ -1,37 +0,0 @@
|
|
1 |
-
# Technical Context: HOCR Processing Tool
|
2 |
-
|
3 |
-
## 1. Core Language
|
4 |
-
|
5 |
-
* **Python:** The project is primarily written in Python, as indicated by the `.py` files. The specific version should be confirmed (e.g., via `requirements.txt` or environment setup).
|
6 |
-
|
7 |
-
## 2. Key Libraries & Frameworks
|
8 |
-
|
9 |
-
* **OCR Engine:** Likely **Tesseract OCR**, potentially accessed via the `pytesseract` wrapper (common practice). This needs confirmation by inspecting `ocr_processing.py` or dependencies.
|
10 |
-
* **Image Processing:** **OpenCV (`cv2`)** and/or **Pillow (PIL)** are highly probable for tasks in `preprocessing.py` and `image_segmentation.py`.
|
11 |
-
* **PDF Handling:** **`pdf2image`** (which often relies on Poppler) is a common choice for converting PDF pages to images, relevant for `pdf_ocr.py`. Other PDF libraries like PyMuPDF or PyPDF2 might also be used.
|
12 |
-
* **Web Framework/UI:** Based on `app.py` and the `ui/` directory, **Flask** or **Streamlit** are potential candidates for the user interface or API layer.
|
13 |
-
* **Configuration:** Standard Python mechanisms (e.g., `.ini` files with `configparser`, `.json` files, or custom Python modules like `config.py`).
|
14 |
-
* **Dependency Management:** Likely uses `pip` with a `requirements.txt` file (observed in the file listing). Virtual environments (like `venv` or `conda`) are standard practice.
|
15 |
-
|
16 |
-
## 3. External Dependencies & Setup
|
17 |
-
|
18 |
-
* **Tesseract OCR Engine:** Requires separate installation on the host system. The path to the Tesseract executable might need configuration.
|
19 |
-
* **Poppler:** Often required by `pdf2image` for PDF processing; needs separate installation.
|
20 |
-
* **Python Environment:** A specific Python version and installed packages via `requirements.txt`.
|
21 |
-
* **Environment Variables:** Potential use of environment variables for configuration (e.g., API keys, paths), possibly managed via a `.env` file (observed in the file listing).
|
22 |
-
|
23 |
-
## 4. Development Environment
|
24 |
-
|
25 |
-
* **Standard Python Setup:** Requires a Python interpreter, `pip`, and likely `virtualenv`.
|
26 |
-
* **Code Editor/IDE:** VS Code is being used (based on environment details). Settings might be stored in `.vscode/`.
|
27 |
-
* **Version Control:** Git is likely used (indicated by `.gitignore`, `.gitattributes`). The `.git_disabled` directory suggests Git might have been temporarily disabled or renamed.
|
28 |
-
* **Testing:** The `testing/` directory and `pytest_cache` suggest **pytest** is used for running tests.
|
29 |
-
|
30 |
-
## 5. Technical Constraints & Considerations
|
31 |
-
|
32 |
-
* **Performance:** OCR, especially on large documents or batches, can be computationally intensive. Image processing steps add overhead.
|
33 |
-
* **Tesseract Limitations:** Tesseract's accuracy depends heavily on image quality, preprocessing, and language model availability.
|
34 |
-
* **Dependency Hell:** Managing Python dependencies and external binaries (Tesseract, Poppler) across different operating systems can be challenging.
|
35 |
-
* **Layout Complexity:** Handling complex layouts (multi-column, tables, embedded images) requires sophisticated segmentation logic.
|
36 |
-
|
37 |
-
*(This document provides an initial technical overview based on file structure and common practices. It requires verification by examining code and configuration files like requirements.txt.)*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|