DEV Community

TengLongAI2026
TengLongAI2026

Posted on

Zerox: Drop an Image, Get Markdown — The AI OCR That Actually Understands

Summary

Traditional OCR reads characters. Zerox understands content. It uses vision LLMs (GPT-4o, Claude) to turn any image or PDF into clean Markdown — preserving tables, code blocks, handwriting, and layout. Think of it as "OCR 2.0."


The Problem OCR Never Solved

I spent an afternoon last week trying to extract a table from a scanned PDF.

First, I tried Tesseract. The text came through okay, but the table? Complete chaos. Columns merged. Rows shifted. Numbers ended up in the wrong cells. I spent another 30 minutes manually realigning everything.

This is the dirty secret of traditional OCR: it reads characters, but it doesn't understand structure.

Then I tried Zerox. Same PDF. Same table. Output was a perfect Markdown table — columns aligned, rows intact, numbers where they belonged.

Traditional OCR Zerox (AI Vision OCR)
Engine Character recognition Vision LLM (GPT-4o/Claude)
Tables Often broken Structure preserved
Handwriting Poor Good
Layout Needs extra analysis Naturally understood
Speed Fast (local) Slow (API call per page)
Cost Free API costs apply

What Is Zerox?

Zerox is a Python tool that uses vision-capable AI models to convert PDFs and images into Markdown.

pip install py-zerox
Enter fullscreen mode Exit fullscreen mode

Its flow is dead simple:

PDF/Image → Split into pages → Send each page to vision LLM → Return Markdown
Enter fullscreen mode Exit fullscreen mode

Code Example

from py_zerox import zerox

# Convert a PDF to markdown
result = await zerox(
    file_path="invoice.pdf",
    model="gpt-4o",
    credentials={"api_key": "sk-..."}
)

print(result.pages[0].content)  # Clean markdown
Enter fullscreen mode Exit fullscreen mode

Real-World Use Cases

1. Table Extraction (The Obvious One)

Invoices, financial reports, data sheets — anything with structured data in PDF. Zerox preserves the table structure that traditional OCR destroys.

2. Handwritten Notes → Digital

I have a notebook full of meeting notes. Zerox turned them into searchable Markdown. Not perfect (handwriting is hard), but way better than Tesseract's attempts.

3. Screenshot → Documentation

Taking screenshots of UI dashboards and turning them into documentation? Zerox handles this well — it understands the visual hierarchy, not just the text.

4. PDF Books → Knowledge Base

For technical PDFs you want to ingest into a personal wiki or knowledge base, Zerox produces cleaner output than traditional PDF text extractors.


Limitations

  1. Slow — Each page makes a separate API call. A 50-page PDF takes a while.
  2. Cost — GPT-4o vision isn't free.
  3. Overkill for plain text — If your PDF is just paragraphs of text, traditional extraction is faster and free.
  4. Vision model dependent — Quality varies by model.

FAQ

Q: Is Zerox free?
A: The tool is open-source (MIT). But it calls paid vision APIs (GPT-4o, Claude).

Q: Can it handle Chinese/Japanese?
A: Yes. Vision LLMs handle any language — they process the image visually, not through language-specific OCR engines.

Q: Privacy concerns?
A: Your images go to the API provider. For sensitive docs, you'd want a local vision model.


Should You Use It?

Use Zerox when: You have complex PDFs with tables, handwriting, or mixed formatting that traditional OCR butchers.

Skip it when: You have clean text-only PDFs and want speed for free.

For me? I keep both. Tesseract for quick text. Zerox for the hard stuff.

Have you tried AI-based OCR? Drop your experience in the comments.

Top comments (0)