Run OCR 100% in the browser: images & scanned PDFs to text, no server

#javascript #webdev #webassembly #ocr

OCR usually means uploading your document to someone's server. For a receipt or a contract, that's a privacy cost for a one-line task. But you don't need a server — a full OCR engine runs in the browser via WebAssembly. Here's how to turn images and scanned PDFs into editable text fully client-side.

The pieces

Tesseract.js — the Tesseract OCR engine compiled to WASM. Recognises text in an image.
pdf.js — to render scanned PDF pages to a canvas first (a scanned PDF is just images, so we OCR each page).

<script src="https://cdn.jsdelivr.net/npm/tesseract.js@5/dist/tesseract.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/pdf.js/3.11.174/pdf.min.js"></script>

OCR an image

The modern API is a worker you create once, reuse, then terminate:

async function imageToText(file, lang = "eng") {
  const worker = await Tesseract.createWorker(lang); // downloads core + lang data on first run
  const { data: { text } } = await worker.recognize(file); // accepts File/Blob/canvas/<img>
  await worker.terminate();
  return text;
}

That's it for images. recognize() takes a File directly, so a file input is enough — nothing leaves the page.

OCR a scanned PDF

A scanned PDF has no text layer, only page images. So render each page to a canvas (at a generous scale for accuracy) and OCR the canvas:

async function scannedPdfToText(bytes, lang = "eng") {
  pdfjsLib.GlobalWorkerOptions.workerSrc =
    "https://cdnjs.cloudflare.com/ajax/libs/pdf.js/3.11.174/pdf.worker.min.js";

  const pdf = await pdfjsLib.getDocument({ data: bytes }).promise;
  const worker = await Tesseract.createWorker(lang);
  let out = "";

  for (let n = 1; n <= pdf.numPages; n++) {
    const page = await pdf.getPage(n);
    const viewport = page.getViewport({ scale: 2.5 }); // ~180 DPI — good OCR accuracy
    const canvas = document.createElement("canvas");
    canvas.width = Math.floor(viewport.width);
    canvas.height = Math.floor(viewport.height);
    const ctx = canvas.getContext("2d");
    ctx.fillStyle = "#fff";
    ctx.fillRect(0, 0, canvas.width, canvas.height);

    await page.render({ canvasContext: ctx, viewport }).promise;
    const { data: { text } } = await worker.recognize(canvas);
    out += text + "\n\n";
  }
  await worker.terminate();
  return out;
}

Gotchas worth knowing

First run is heavy. Tesseract.js downloads its WASM core (~2–4 MB) and the language traineddata (a few MB) on first use, then caches them. Show a "Loading OCR engine…" state so the wait isn't mysterious.
Scale matters for PDFs. Rendering pages at scale: 1 gives ~72 DPI and poor accuracy. Use ~2.5 (≈180 DPI). Higher is sharper but uses more memory.
Match the language. createWorker('eng') vs 'spa' etc. — the right language massively improves accuracy on accented text. You can combine: 'eng+spa'.
Reuse the worker. Creating a worker per page re-loads everything each time. Make one, loop, then terminate().
Set a white background before drawImage on the canvas — transparent PDF regions otherwise OCR as noise.
OCR isn't magic. Clean, straight, high-contrast scans read well; blurry phone photos and handwriting don't. Set that expectation in the UI.

Why client-side

Privacy: the document never leaves the device — ideal for IDs, contracts, statements.
$0 backend: it's static hosting; OCR runs on the user's CPU.
Offline-friendly: once the WASM + lang data are cached, it works without a connection.

The trade-off is the first-load download and that recognition uses the user's CPU — fine for everyday documents.

I built this into a free tool — PDFNest's Image to Text (OCR) does exactly this (images + scanned PDFs, multiple languages, copy/download), all in the browser. There's also a companion guide on extracting text from PDFs. Happy to talk through the Tesseract/pdf.js details in the comments.

DEV Community