pdf2markdown / README.md
broadfield-dev's picture
Update README.md
6ccfcef verified
---
title: Pdf2markdown (Flask)
emoji: ๐Ÿ‘๏ธ
colorFrom: green
colorTo: blue
sdk: docker
pinned: false
# For Docker Spaces, app_port in README.md informs Hugging Face which internal port your app listens on.
# This should match the port Gunicorn (or your app server) binds to.
app_port: 7860
---
## PDF to Markdown Converter (Flask Version)
This application converts PDF files (either uploaded or from a URL) into Markdown format.
It extracts text, attempts to format it, identifies tables, and extracts images.
Extracted images are uploaded to a Hugging Face Dataset repository named "pdf-images-extracted" (this can be configured).
**Important:** For image uploading to work, you **must** set an `HF_TOKEN` with write access to datasets in your Hugging Face Space secrets.
### Features
- Upload PDF files directly.
- Process PDFs from a publicly accessible URL.
- Extracts plain text and attempts to preserve some layout.
- Detects and formats tables into Markdown.
- Extracts images from the PDF.
- Performs OCR on extracted images to include text from images.
- Uploads extracted images to a Hugging Face Dataset.
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference