OCR is back: replacing Tesseract with PP-OCRv5 in my document pipelines

#ai #automation #machinelearning #tooling

OCR is back: how I'm replacing Tesseract with PP-OCRv5 in my pipelines

I've been wrangling OCR pipelines for years — Tesseract for plain text, Google Vision when CJK comes up, AWS Textract for tables. Each has its own pain (Tesseract drops handwritten characters, Vision is pricey at scale, Textract's bbox layout is opinionated).

Recently I've been quietly piping a lot of work through ScanRead.ai instead. It's a free OCR tool built on PP-OCRv5 and the new PaddleOCR-VL model. Here's what changed for me.

What it actually does

Image → text in 100+ languages (including Arabic, Japanese, Chinese, Hindi, Thai)
22 specialized tools: image-to-text, PDF-to-Word, screenshot-to-text, handwriting recognition, math-to-LaTeX, receipt OCR
Outputs to .txt, .md, or .docx — Markdown export is great for pipelines into Notion or Obsidian
Free tier is generous: 20 pages/day, no signup
Pro is $10/mo for 3,000 pages with batch (up to 20 files at once)

Where it shined for me

Handwritten meeting notes. Tesseract gives me garbage on cursive. ScanRead reconstructed three pages of a colleague's whiteboard photos with maybe two errors per page. That's the difference between "useful" and "I'll just retype it."

CJK receipts. I had a folder of Japanese receipts to reconcile. PaddleOCR-VL handles vertical text and mixed kanji/kana way better than I expected — competitive with Google Vision in my spot-check, at zero cost.

Math → LaTeX. Pasting screenshots of equations from PDFs and getting back ( \LaTeX ) source is the kind of small thing that saves a real amount of time over a week.

Where it's weaker

Layout reconstruction for complex multi-column PDFs is okay but Textract is still better for forms with deep nested tables.
The free tier is rate-limited per day, not per minute — fine for humans, awkward for batch jobs.
No public API yet (as of writing); Pro batch UI is the workaround.

Why I'm sharing

If you're paying for Vision/Textract for occasional OCR, try the free tier first. If you do batch scans, the $10/mo Pro plan undercuts both. Link: https://scanread.ai

Curious if anyone else has switched off Tesseract for handwriting. What's your stack?