historical-ocr / README.md
milwright's picture
Rolling out modular v2
c04ffe5

A newer version of the Streamlit SDK is available: 1.45.1

Upgrade
metadata
title: Historical OCR
emoji: πŸ“œ
colorFrom: blue
colorTo: purple
sdk: streamlit
sdk_version: 1.44.1
app_file: app.py
pinned: false
license: gpl-3.0
short_description: advanced OCR application for historical document analysis

Historical OCR

An advanced OCR application for historical document analysis using Mistral AI.

Note: This tool is designed to assist scholars in historical research by extracting text from challenging documents. While it may not achieve 100% accuracy for all materials, it serves as a valuable research aid for navigating historical documents, particularly historical newspapers, handwritten documents, and photos of archival materials.

Features

  • OCR with Context: AI-enhanced OCR optimized for historical documents
  • Document Type Detection: Automatically identifies handwritten letters, recipes, scientific texts, and more
  • Advanced Image Preprocessing:
    • Automatic deskewing to correct document orientation
    • Smart thresholding with Otsu and adaptive methods
    • Morphological operations to clean up text
    • Document-type specific optimization
  • Custom Prompting: Tailor the AI analysis with document-specific instructions
  • Structured Output: Returns organized, structured information based on document type

Using This App

  1. Upload a historical document (image or PDF)
  2. Add optional context or special instructions
  3. Get detailed, structured OCR results with historical context

Supported Document Types

  • Handwritten letters and correspondence
  • Historical recipes and cookbooks
  • Travel accounts and exploration logs
  • Scientific papers and experiments
  • Legal documents and certificates
  • Historical newspaper articles
  • General historical texts

Technical Details

Built with Streamlit and Mistral AI's OCR and large language model capabilities.


Created by Zach Muhlbauer, CUNY Graduate Center