TengLongAI2026

Posted on May 29

Zerox: Drop an Image, Get Markdown — The AI OCR That Actually Understands

#webdev #ai #productivity #opensource

Summary

Traditional OCR reads characters. Zerox understands content. It uses vision LLMs (GPT-4o, Claude) to turn any image or PDF into clean Markdown — preserving tables, code blocks, handwriting, and layout. Think of it as "OCR 2.0."

The Problem OCR Never Solved

I spent an afternoon last week trying to extract a table from a scanned PDF.

First, I tried Tesseract. The text came through okay, but the table? Complete chaos. Columns merged. Rows shifted. Numbers ended up in the wrong cells. I spent another 30 minutes manually realigning everything.

This is the dirty secret of traditional OCR: it reads characters, but it doesn't understand structure.

Then I tried Zerox. Same PDF. Same table. Output was a perfect Markdown table — columns aligned, rows intact, numbers where they belonged.

	Traditional OCR	Zerox (AI Vision OCR)
Engine	Character recognition	Vision LLM (GPT-4o/Claude)
Tables	Often broken	Structure preserved
Handwriting	Poor	Good
Layout	Needs extra analysis	Naturally understood
Speed	Fast (local)	Slow (API call per page)
Cost	Free	API costs apply

What Is Zerox?

Zerox is a Python tool that uses vision-capable AI models to convert PDFs and images into Markdown.

pip install py-zerox

Its flow is dead simple:

PDF/Image → Split into pages → Send each page to vision LLM → Return Markdown

Code Example

from py_zerox import zerox

# Convert a PDF to markdown
result = await zerox(
    file_path="invoice.pdf",
    model="gpt-4o",
    credentials={"api_key": "sk-..."}
)

print(result.pages[0].content)  # Clean markdown

Real-World Use Cases

1. Table Extraction (The Obvious One)

Invoices, financial reports, data sheets — anything with structured data in PDF. Zerox preserves the table structure that traditional OCR destroys.

2. Handwritten Notes → Digital

I have a notebook full of meeting notes. Zerox turned them into searchable Markdown. Not perfect (handwriting is hard), but way better than Tesseract's attempts.

3. Screenshot → Documentation

Taking screenshots of UI dashboards and turning them into documentation? Zerox handles this well — it understands the visual hierarchy, not just the text.

4. PDF Books → Knowledge Base

For technical PDFs you want to ingest into a personal wiki or knowledge base, Zerox produces cleaner output than traditional PDF text extractors.

Limitations

Slow — Each page makes a separate API call. A 50-page PDF takes a while.
Cost — GPT-4o vision isn't free.
Overkill for plain text — If your PDF is just paragraphs of text, traditional extraction is faster and free.
Vision model dependent — Quality varies by model.

FAQ

Q: Is Zerox free?
A: The tool is open-source (MIT). But it calls paid vision APIs (GPT-4o, Claude).

Q: Can it handle Chinese/Japanese?
A: Yes. Vision LLMs handle any language — they process the image visually, not through language-specific OCR engines.

Q: Privacy concerns?
A: Your images go to the API provider. For sensitive docs, you'd want a local vision model.

Should You Use It?

Use Zerox when: You have complex PDFs with tables, handwriting, or mixed formatting that traditional OCR butchers.

Skip it when: You have clean text-only PDFs and want speed for free.

For me? I keep both. Tesseract for quick text. Zerox for the hard stuff.

Have you tried AI-based OCR? Drop your experience in the comments.

DEV Community