OCR usually means uploading your document to someone's server. For a receipt or a contract, that's a privacy cost for a one-line task. But you don't need a server — a full OCR engine runs in the browser via WebAssembly. Here's how to turn images and scanned PDFs into editable text fully client-side.
The pieces
- Tesseract.js — the Tesseract OCR engine compiled to WASM. Recognises text in an image.
- pdf.js — to render scanned PDF pages to a canvas first (a scanned PDF is just images, so we OCR each page).
<script src="https://cdn.jsdelivr.net/npm/tesseract.js@5/dist/tesseract.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/pdf.js/3.11.174/pdf.min.js"></script>
OCR an image
The modern API is a worker you create once, reuse, then terminate:
async function imageToText(file, lang = "eng") {
const worker = await Tesseract.createWorker(lang); // downloads core + lang data on first run
const { data: { text } } = await worker.recognize(file); // accepts File/Blob/canvas/<img>
await worker.terminate();
return text;
}
That's it for images. recognize() takes a File directly, so a file input is enough — nothing leaves the page.
OCR a scanned PDF
A scanned PDF has no text layer, only page images. So render each page to a canvas (at a generous scale for accuracy) and OCR the canvas:
async function scannedPdfToText(bytes, lang = "eng") {
pdfjsLib.GlobalWorkerOptions.workerSrc =
"https://cdnjs.cloudflare.com/ajax/libs/pdf.js/3.11.174/pdf.worker.min.js";
const pdf = await pdfjsLib.getDocument({ data: bytes }).promise;
const worker = await Tesseract.createWorker(lang);
let out = "";
for (let n = 1; n <= pdf.numPages; n++) {
const page = await pdf.getPage(n);
const viewport = page.getViewport({ scale: 2.5 }); // ~180 DPI — good OCR accuracy
const canvas = document.createElement("canvas");
canvas.width = Math.floor(viewport.width);
canvas.height = Math.floor(viewport.height);
const ctx = canvas.getContext("2d");
ctx.fillStyle = "#fff";
ctx.fillRect(0, 0, canvas.width, canvas.height);
await page.render({ canvasContext: ctx, viewport }).promise;
const { data: { text } } = await worker.recognize(canvas);
out += text + "\n\n";
}
await worker.terminate();
return out;
}
Gotchas worth knowing
- First run is heavy. Tesseract.js downloads its WASM core (~2–4 MB) and the language traineddata (a few MB) on first use, then caches them. Show a "Loading OCR engine…" state so the wait isn't mysterious.
-
Scale matters for PDFs. Rendering pages at
scale: 1gives ~72 DPI and poor accuracy. Use ~2.5 (≈180 DPI). Higher is sharper but uses more memory. -
Match the language.
createWorker('eng')vs'spa'etc. — the right language massively improves accuracy on accented text. You can combine:'eng+spa'. -
Reuse the worker. Creating a worker per page re-loads everything each time. Make one, loop, then
terminate(). -
Set a white background before
drawImageon the canvas — transparent PDF regions otherwise OCR as noise. - OCR isn't magic. Clean, straight, high-contrast scans read well; blurry phone photos and handwriting don't. Set that expectation in the UI.
Why client-side
- Privacy: the document never leaves the device — ideal for IDs, contracts, statements.
- $0 backend: it's static hosting; OCR runs on the user's CPU.
- Offline-friendly: once the WASM + lang data are cached, it works without a connection.
The trade-off is the first-load download and that recognition uses the user's CPU — fine for everyday documents.
I built this into a free tool — PDFNest's Image to Text (OCR) does exactly this (images + scanned PDFs, multiple languages, copy/download), all in the browser. There's also a companion guide on extracting text from PDFs. Happy to talk through the Tesseract/pdf.js details in the comments.
Top comments (0)