historical-ocr / memory-bank /techContext.md
milwright's picture
add memory
4c10be0
# Technical Context: HOCR Processing Tool
## 1. Core Language
* **Python:** The project is primarily written in Python, as indicated by the `.py` files. The specific version should be confirmed (e.g., via `requirements.txt` or environment setup).
## 2. Key Libraries & Frameworks
* **OCR Engine:** Likely **Tesseract OCR**, potentially accessed via the `pytesseract` wrapper (common practice). This needs confirmation by inspecting `ocr_processing.py` or dependencies.
* **Image Processing:** **OpenCV (`cv2`)** and/or **Pillow (PIL)** are highly probable for tasks in `preprocessing.py` and `image_segmentation.py`.
* **PDF Handling:** **`pdf2image`** (which often relies on Poppler) is a common choice for converting PDF pages to images, relevant for `pdf_ocr.py`. Other PDF libraries like PyMuPDF or PyPDF2 might also be used.
* **Web Framework/UI:** Based on `app.py` and the `ui/` directory, **Flask** or **Streamlit** are potential candidates for the user interface or API layer.
* **Configuration:** Standard Python mechanisms (e.g., `.ini` files with `configparser`, `.json` files, or custom Python modules like `config.py`).
* **Dependency Management:** Likely uses `pip` with a `requirements.txt` file (observed in the file listing). Virtual environments (like `venv` or `conda`) are standard practice.
## 3. External Dependencies & Setup
* **Tesseract OCR Engine:** Requires separate installation on the host system. The path to the Tesseract executable might need configuration.
* **Poppler:** Often required by `pdf2image` for PDF processing; needs separate installation.
* **Python Environment:** A specific Python version and installed packages via `requirements.txt`.
* **Environment Variables:** Potential use of environment variables for configuration (e.g., API keys, paths), possibly managed via a `.env` file (observed in the file listing).
## 4. Development Environment
* **Standard Python Setup:** Requires a Python interpreter, `pip`, and likely `virtualenv`.
* **Code Editor/IDE:** VS Code is being used (based on environment details). Settings might be stored in `.vscode/`.
* **Version Control:** Git is likely used (indicated by `.gitignore`, `.gitattributes`). The `.git_disabled` directory suggests Git might have been temporarily disabled or renamed.
* **Testing:** The `testing/` directory and `pytest_cache` suggest **pytest** is used for running tests.
## 5. Technical Constraints & Considerations
* **Performance:** OCR, especially on large documents or batches, can be computationally intensive. Image processing steps add overhead.
* **Tesseract Limitations:** Tesseract's accuracy depends heavily on image quality, preprocessing, and language model availability.
* **Dependency Hell:** Managing Python dependencies and external binaries (Tesseract, Poppler) across different operating systems can be challenging.
* **Layout Complexity:** Handling complex layouts (multi-column, tables, embedded images) requires sophisticated segmentation logic.
*(This document provides an initial technical overview based on file structure and common practices. It requires verification by examining code and configuration files like requirements.txt.)*