Ubed Sheikh

Posted on Jun 8

How I Built a Browser-Based OCR Tool That Never Uploads Your Files (Using Tesseract.js)

#webdev #opensource #javascript #buildinpublic

If you've ever tried to extract text from an Aadhaar card photo, a scanned invoice, or a government document screenshot — you've probably used an online OCR tool.

Hopefully.

This post explains why that's a privacy problem, and how browser-based OCR using Tesseract.js solves it completely — no server needed.

Why Most Online OCR Tools Are a Privacy Risk

When you upload an image to a typical OCR website, this is what actually happens:

Your file travels over the internet to a remote server
The server runs OCR software (Tesseract or Google Vision API)
The extracted text is returned to you
The file is "ideally" deleted

The problem is that last step. You have no way to verify it.

This matters especially in India, where people regularly process:

Aadhaar card photos
PAN card scans
Bank statement screenshots
Salary slips and private contracts
Government portal documents (SSC, UPSC, Railway)

Uploading these to an unverified third-party server is a real risk that most tools don't acknowledge.

What Most Online OCR Tools Get Wrong

Most online OCR tools are server-side. You upload your file → it goes to their server → gets processed → you download the result.

Three problems with this:

1. Privacy risk.
Your Aadhaar photo, passport scan, or ID document just got uploaded to a random server you know nothing about.

2. No transparency.
You can't verify whether files are stored, logged, or sold. They say they delete it. Maybe they do.

3. They break on sensitive documents.
The moment you're dealing with something private — a contract, a salary slip, API credentials in a screenshot — you shouldn't be uploading it anywhere.

The Browser-Based Approach: Tesseract.js

Modern browsers can handle OCR entirely client-side using Tesseract.js. No server needed.

Tesseract.js is a pure JavaScript port of Google's Tesseract OCR engine, compiled to WebAssembly. It runs entirely inside the browser tab.

Here's the core of how it works:

import Tesseract from 'tesseract.js';

const result = await Tesseract.recognize(
  imageFile,
  'eng',
  {
    logger: m => console.log(m)
  }
);

console.log(result.data.text);

The key method is Tesseract.recognize(). It accepts:

imageFile — your image (File object, URL, base64, or canvas element)
'eng' — the language code (supports 100+ languages)
logger — a callback for tracking progress phases

Your file never leaves the device.

How the Architecture Differs

Normal OCR tool:
Your image → Internet → Their server → OCR runs → Text returns → (File "deleted"?)

Browser-based OCR:
Your image → Browser tab → WebAssembly OCR runs → Text appears → Tab closed = everything gone

No server to maintain. No infrastructure cost. No privacy risk.

The 5 Things I Had to Build Beyond the Core

1. Multi-language support

Tesseract.js supports 100+ languages via separate trained data files. For Indian users, Hindi support was non-negotiable. Language packs download on demand — you don't load everything upfront.

// English + Hindi combined recognition
const result = await Tesseract.recognize(imageFile, 'eng+hin');

Currently supported: English, Hindi, Arabic, Chinese, Japanese, French, German, Spanish, Portuguese.

2. Progress tracking

Tesseract.js has distinct processing phases — loading the engine, loading language data, initialising, recognising. Without progress feedback, users assume the tool is broken.

Tesseract.recognize(file, lang, {
  logger: ({ status, progress }) => {
    updateProgressBar(status, Math.round(progress * 100));
  }
});

3. Reading the file without uploading it

The browser's FileReader API reads the image into local memory. Nothing touches a network.

const reader = new FileReader();
reader.onload = (e) => {
  const imageData = e.target.result;
  runOCR(imageData);
};
reader.readAsDataURL(file);

4. Camera support on mobile

One HTML attribute unlocks direct camera scanning on mobile:

<input type="file" accept="image/*" capture="environment" />

5. Image preprocessing to improve accuracy

This is the part most Tesseract.js tutorials skip. Raw phone camera photos give poor results. Preprocessing helps significantly.

function preprocessImage(imageElement) {
  const canvas = document.createElement('canvas');
  const ctx = canvas.getContext('2d');

  canvas.width = imageElement.width * 2;
  canvas.height = imageElement.height * 2;
  ctx.drawImage(imageElement, 0, 0, canvas.width, canvas.height);

  const imageData = ctx.getImageData(0, 0, canvas.width, canvas.height);
  const data = imageData.data;
  for (let i = 0; i < data.length; i += 4) {
    const avg = (data[i] + data[i + 1] + data[i + 2]) / 3;
    data[i] = data[i + 1] = data[i + 2] = avg;
  }
  ctx.putImageData(imageData, 0, 0);

  return canvas.toDataURL();
}

Run this before passing the image to Tesseract. You'll see a meaningful accuracy improvement on phone camera photos.

The Gotcha: First Load Time

One thing that catches developers off guard — Tesseract.js downloads the WebAssembly engine and language pack on first use (~10MB for English).

Users will think the tool is broken if you don't handle this.

Tesseract.recognize(file, lang, {
  logger: ({ status, progress }) => {
    if (status === 'loading tesseract core') {
      showMessage('Loading OCR engine for the first time...');
    }
    if (status === 'loading language traineddata') {
      showMessage('Loading language pack...');
    }
  }
});

After first load, the browser caches everything — subsequent uses are fast.

Real-World Use Case: India's Government Portals

This is the gap that made me build this personally.

Almost every government portal in India — Aadhaar, SSC, UPSC, Railway, passport applications — requires documents in specific formats. People need to extract text from scanned copies, screenshot confirmations, and printed notices.

But sending those documents to an unverified OCR server is genuinely risky.

The India-specific angle is baked into the tool — Hindi support, mobile camera input, and JPG/PNG/WebP support for exactly the file types these portals deal with.

Accuracy Benchmarks

Document type	Accuracy
Clean printed text (PDF screenshot)	~97%
Printed document photo, good lighting	~92%
Printed document photo, average lighting	~78%
Handwritten text, neat	~65%
Handwritten text, casual	~40%

For everyday use cases — screenshots, scanned forms, photos of printed documents — it works well.

Honest Limitations

Handwriting accuracy is lower than printed text.
Tesseract was trained primarily on printed fonts. Don't expect perfect results from handwritten notes.
Workaround: take photos in bright, even lighting and hold the camera directly above the paper.

Low contrast images struggle.
A photo taken in bad lighting gives poor output regardless of the OCR engine.
Workaround: the preprocessing function above (grayscale + upscale) helps considerably.

No layout preservation.
It extracts raw text, not formatted tables or columns.
Workaround: for complex PDFs with tables, you need Adobe Acrobat or a dedicated PDF parser.

Summary

Most OCR tools send your image to a server you can't verify
Tesseract.js runs the full OCR engine inside the browser via WebAssembly
FileReader API reads images locally — nothing touches a network
Preprocessing (grayscale + upscale) meaningfully improves accuracy on phone photos
For India specifically — Aadhaar, SSC, UPSC document processing is a real daily use case

If you're building anything that handles sensitive document uploads, browser-based OCR with Tesseract.js is worth serious consideration.

Built this as part of EasyToolkit — a free browser-based toolkit for document and image utilities. All processing happens locally. Feedback welcome.

One question for the comments: What preprocessing techniques have you used to improve Tesseract.js accuracy on low-quality images? Thresholding, deskewing, sharpening? Drop it below — let's build a reference thread.

DEV Community