historical-ocr / README.md
milwright's picture
Rolling out modular v2
c04ffe5
---
title: Historical OCR
emoji: πŸ“œ
colorFrom: blue
colorTo: purple
sdk: streamlit
sdk_version: 1.44.1
app_file: app.py
pinned: false
license: gpl-3.0
short_description: advanced OCR application for historical document analysis
---
# Historical OCR
An advanced OCR application for historical document analysis using Mistral AI.
> **Note:** This tool is designed to assist scholars in historical research by extracting text from challenging documents. While it may not achieve 100% accuracy for all materials, it serves as a valuable research aid for navigating historical documents, particularly historical newspapers, handwritten documents, and photos of archival materials.
## Features
- **OCR with Context:** AI-enhanced OCR optimized for historical documents
- **Document Type Detection:** Automatically identifies handwritten letters, recipes, scientific texts, and more
- **Advanced Image Preprocessing:**
- Automatic deskewing to correct document orientation
- Smart thresholding with Otsu and adaptive methods
- Morphological operations to clean up text
- Document-type specific optimization
- **Custom Prompting:** Tailor the AI analysis with document-specific instructions
- **Structured Output:** Returns organized, structured information based on document type
## Using This App
1. Upload a historical document (image or PDF)
2. Add optional context or special instructions
3. Get detailed, structured OCR results with historical context
## Supported Document Types
- Handwritten letters and correspondence
- Historical recipes and cookbooks
- Travel accounts and exploration logs
- Scientific papers and experiments
- Legal documents and certificates
- Historical newspaper articles
- General historical texts
## Technical Details
Built with Streamlit and Mistral AI's OCR and large language model capabilities.
---
Created by Zach Muhlbauer, CUNY Graduate Center