Summary
Traditional OCR reads characters. Zerox understands content. It uses vision LLMs (GPT-4o, Claude) to turn any image or PDF into clean Markdown — preserving tables, code blocks, handwriting, and layout. Think of it as "OCR 2.0."
The Problem OCR Never Solved
I spent an afternoon last week trying to extract a table from a scanned PDF.
First, I tried Tesseract. The text came through okay, but the table? Complete chaos. Columns merged. Rows shifted. Numbers ended up in the wrong cells. I spent another 30 minutes manually realigning everything.
This is the dirty secret of traditional OCR: it reads characters, but it doesn't understand structure.
Then I tried Zerox. Same PDF. Same table. Output was a perfect Markdown table — columns aligned, rows intact, numbers where they belonged.
| Traditional OCR | Zerox (AI Vision OCR) | |
|---|---|---|
| Engine | Character recognition | Vision LLM (GPT-4o/Claude) |
| Tables | Often broken | Structure preserved |
| Handwriting | Poor | Good |
| Layout | Needs extra analysis | Naturally understood |
| Speed | Fast (local) | Slow (API call per page) |
| Cost | Free | API costs apply |
What Is Zerox?
Zerox is a Python tool that uses vision-capable AI models to convert PDFs and images into Markdown.
pip install py-zerox
Its flow is dead simple:
PDF/Image → Split into pages → Send each page to vision LLM → Return Markdown
Code Example
from py_zerox import zerox
# Convert a PDF to markdown
result = await zerox(
file_path="invoice.pdf",
model="gpt-4o",
credentials={"api_key": "sk-..."}
)
print(result.pages[0].content) # Clean markdown
Real-World Use Cases
1. Table Extraction (The Obvious One)
Invoices, financial reports, data sheets — anything with structured data in PDF. Zerox preserves the table structure that traditional OCR destroys.
2. Handwritten Notes → Digital
I have a notebook full of meeting notes. Zerox turned them into searchable Markdown. Not perfect (handwriting is hard), but way better than Tesseract's attempts.
3. Screenshot → Documentation
Taking screenshots of UI dashboards and turning them into documentation? Zerox handles this well — it understands the visual hierarchy, not just the text.
4. PDF Books → Knowledge Base
For technical PDFs you want to ingest into a personal wiki or knowledge base, Zerox produces cleaner output than traditional PDF text extractors.
Limitations
- Slow — Each page makes a separate API call. A 50-page PDF takes a while.
- Cost — GPT-4o vision isn't free.
- Overkill for plain text — If your PDF is just paragraphs of text, traditional extraction is faster and free.
- Vision model dependent — Quality varies by model.
FAQ
Q: Is Zerox free?
A: The tool is open-source (MIT). But it calls paid vision APIs (GPT-4o, Claude).
Q: Can it handle Chinese/Japanese?
A: Yes. Vision LLMs handle any language — they process the image visually, not through language-specific OCR engines.
Q: Privacy concerns?
A: Your images go to the API provider. For sensitive docs, you'd want a local vision model.
Should You Use It?
Use Zerox when: You have complex PDFs with tables, handwriting, or mixed formatting that traditional OCR butchers.
Skip it when: You have clean text-only PDFs and want speed for free.
For me? I keep both. Tesseract for quick text. Zerox for the hard stuff.
Have you tried AI-based OCR? Drop your experience in the comments.
Top comments (0)