Ashish Kumar

Posted on Apr 25

OCR in the Browser: How Tesseract.js Makes PDF Text Extraction Free

#javascript #webdev #machinelearning #productivity

You've got a 200-page PDF that someone scanned years ago. It's just images of pages — Cmd-F finds nothing. You need to extract the text, search through it, maybe paste a paragraph into a doc.

Five years ago, this meant a cloud OCR API at $1.50 per 1,000 pages, plus uploading your potentially-sensitive PDF to a third-party service. Now it means dropping the file into a tab and waiting two minutes. The thing that made the difference is Tesseract.js — and understanding what it does, where it shines, and where it falls short is worth knowing whether you're building a tool or just trying to get text out of a scan.

This post walks through how browser-based OCR actually works, what to expect from the open-source state of the art, and the engineering decisions that go into shipping it well.

What OCR is, briefly

Optical character recognition takes an image of text and produces actual text characters. Modern OCR engines do this in two stages:

Layout analysis — figure out where the text regions are on the page, in what order they should be read, and where lines and words break.
Character recognition — for each detected word, classify the visual pattern as one of the characters it could be.

Step 2 used to use rule-based image processing (look at the shape, match against templates). Modern engines including Tesseract use neural networks (LSTMs, mostly) trained on huge corpora of text in different fonts and conditions.

The accuracy of a modern OCR engine on clean printed text is 95–99%. On handwriting, multi-column layouts, tables, or low-quality scans, it drops fast. We'll come back to that.

Tesseract: 30 years of OCR, now in your browser

Tesseract is the open-source OCR engine that's been around since 1985. HP wrote it, then it sat unused for a decade, then Google rescued it in 2005, rewrote the engine in 2018 to use LSTMs, and kept improving it. It supports 100+ languages, runs as a command-line tool, and is the engine behind a huge fraction of OCR products you've used.

Tesseract.js is Tesseract compiled to WebAssembly. Same accuracy as the desktop version, runs in any modern browser, no server required. The whole thing is about 8MB compressed (engine + a single language pack), loads on demand, and processes pages at maybe 1–3 seconds each on a typical laptop.

The basic usage is comically simple:

import Tesseract from 'tesseract.js'

const { data: { text } } = await Tesseract.recognize(
  imageOrCanvasOrUrl,
  'eng',
  { logger: m => console.log(m) }
)

console.log(text)

That's it. Pass an image, get back text. The logger is useful because OCR isn't instant — typical pages take a few seconds, and you want a progress bar.

The PDF-to-text pipeline

Tesseract operates on images, not PDFs. So the full pipeline for "OCR a scanned PDF" is:

Open the PDF (use pdf.js)
Render each page to a canvas
Pass the canvas to Tesseract.js
Concatenate the results

The skeleton looks like:

import * as pdfjs from 'pdfjs-dist'
import Tesseract from 'tesseract.js'

async function ocrPdf(file) {
  const buffer = await file.arrayBuffer()
  const pdf = await pdfjs.getDocument(buffer).promise
  const pages = []

  for (let i = 1; i <= pdf.numPages; i++) {
    const page = await pdf.getPage(i)
    const viewport = page.getViewport({ scale: 2 })  // 2x for OCR accuracy
    const canvas = document.createElement('canvas')
    canvas.width = viewport.width
    canvas.height = viewport.height
    await page.render({ canvasContext: canvas.getContext('2d'), viewport }).promise

    const { data: { text } } = await Tesseract.recognize(canvas, 'eng')
    pages.push(text)
  }

  return pages.join('\n\n')
}

A few details that matter:

Render at 2× or 3× scale. OCR accuracy correlates strongly with resolution. The native PDF DPI (72 or 96) is usually too low; bumping to 2× makes a noticeable difference.
Process pages sequentially. Tesseract.js can run multiple workers in parallel, but each worker loads ~8MB of language data, so on memory-constrained devices, sequential is safer.
Show progress. OCR is slow. A 50-page document at 2 seconds/page is 100 seconds — without a progress indicator, users think the page froze.

Web Workers for the UI

If you call Tesseract.recognize directly on the main thread, the page becomes janky during processing. Tesseract.js comes with built-in Worker support — every recognize call runs in a worker by default, but you can also pre-spin workers and reuse them:

const worker = await Tesseract.createWorker('eng')
for (const canvas of canvases) {
  const { data: { text } } = await worker.recognize(canvas)
  // ...
}
await worker.terminate()

Reusing one worker for multiple pages avoids repeated language-data loading. For batches, this is 3–5× faster than the simple form above.

Language packs

Tesseract supports 100+ languages, but each language is a separate trained data file (5–25MB compressed). You don't bundle them — you download on demand:

const worker = await Tesseract.createWorker(['eng', 'spa'])  // English + Spanish

For a multilingual OCR app, the data-loading strategy matters. Don't ship all 100 language packs in your bundle; let users select languages and lazy-load.

A few practical notes:

English is by far the best-tuned. CJK languages (Chinese, Japanese, Korean) work but are slower and slightly less accurate.
Mixed-language documents are tricky. Tesseract supports passing multiple languages, but it tries to apply all of them to every word — this is slower and sometimes less accurate than running it in single-language mode.
Math and code are recognized poorly. OCR engines were trained on natural-language text. Variable names, equations, and code samples often come out scrambled.

Where OCR falls down

Even on clean printed text, you'll hit cases that break:

Low-resolution scans. Anything below 200 DPI gets unreliable. Below 150 DPI, accuracy drops to 70–80%. If your input is a phone photo of a printed page taken in dim light, OCR will struggle.

Multi-column layouts. Tesseract has a layout analyzer, but it sometimes reads across columns instead of down them, producing scrambled output. Newspapers and academic papers are the classic problem cases.

Tables. Tesseract can extract the text from a table, but it loses the structure. You get a flat stream of cell contents in some order. For real tabular data extraction you need a different tool entirely (or a model fine-tuned on table layouts).

Handwriting. Out of scope for Tesseract. Use a model trained for handwriting (Google Cloud Vision, AWS Textract, or specialized libraries). The accuracy gap is enormous.

Low-contrast or skewed pages. Pre-processing helps a lot. Convert to grayscale, increase contrast, deskew. There are JavaScript libraries (opencv.js, cv-tools) that do these transformations in the browser before OCR.

Forms and structured documents. OCR gives you text. It doesn't tell you "this string is the patient's date of birth and this one is the diagnosis." For structured extraction, you need OCR + a separate parsing step (regex, NER models, or templated extraction).

Cloud OCR vs Tesseract.js — when to pick which

Cloud OCR services (Google Cloud Vision, AWS Textract, Azure Computer Vision) are still better than Tesseract on hard cases. They handle handwriting, complex tables, multilingual documents, and edge cases that Tesseract struggles with. They're trained on far more data.

But Tesseract.js wins on three axes:

Privacy — files never leave the browser
Cost — free, no per-page pricing, no API keys
Simplicity — no signup, no auth, no rate limits

The decision rule: for printed-text OCR where you control the inputs (PDFs, screenshots, document scans), Tesseract.js is good enough most of the time. For high-stakes accuracy on edge-case inputs (handwritten forms, mixed handwriting/print, low-quality phone photos of receipts), use a cloud API.

A practical caveat: PDFs that already have text

A surprising fraction of "scanned" PDFs actually have text in them — they were scanned, then put through an OCR pass at the printer or by another tool, and the text is embedded but invisible because the visual layer is the scan. Before running expensive OCR, check:

const page = await pdf.getPage(1)
const textContent = await page.getTextContent()
if (textContent.items.length > 0) {
  // PDF already has text — skip OCR
}

Saves a lot of CPU when the work is already done.

Where to try OCR right now

For a one-off (you have a scanned PDF, you need the text, you don't want to write code), imagetools.renderlog.in and pdftools.renderlog.in both run Tesseract.js client-side. The PDF tool's OCR feature handles the pdf.js → canvas → Tesseract pipeline described above. Drop a PDF in, get text out, file never leaves the browser.

For the actual implementation in your own app, the Tesseract.js GitHub repo is well-documented and the API hasn't changed much in years.

TL;DR

Tesseract.js puts a battle-tested OCR engine in any modern browser at no per-page cost.
The pipeline for OCRing a PDF is: pdf.js renders pages to canvases → Tesseract.js recognizes each canvas → concatenate the text.
Render at 2–3× scale for accuracy. Use Web Workers (built in) to keep the UI responsive.
Pre-check if PDFs already have a text layer before running OCR; many do.
Tesseract is excellent on clean printed text, weak on handwriting, tables, and very low-quality inputs.
For privacy-sensitive documents (medical, legal, contracts), client-side OCR removes the cloud-API trust problem entirely.

Browser-based OCR went from "tech demo" to "ship this in production" in about three years. If you're still uploading scans to a paid API for printed-text extraction, it's worth a re-evaluation.

If this was useful, I've also built a handful of other free, browser-based tools — no signup, no uploads, everything runs client-side:

JSON Tools — https://json.renderlog.in (formatter, validator, JWT decoder, JSONPath tester, 40+ converters)
Text Tools — https://text.renderlog.in (case converters, slug generator, HTML/markdown utilities, 70+ tools)
PDF Tools — https://pdftools.renderlog.in (merge, split, OCR, compress to exact size, 40+ tools)
Image Tools — https://imagetools.renderlog.in (compress, convert, resize, background remover, 50+ tools)
QR Tools — https://qrtools.renderlog.in (WiFi, vCard, UPI, bulk QR codes with logos)
Calc Tools — https://calctool.renderlog.in (60+ calculators for finance, health, math, dates)
Notepad — https://notepad.renderlog.in (private, offline-first notes, no signup)

DEV Community