OCR Is Not Redaction: Building Safer Auto-Redaction With Tesseract.js

#javascript #webdev #privacy #ocr

OCR demos usually stop too early.

They show recognize(), print some text, and imply that automatic redaction is basically done. In a real product, that is maybe 20 percent of the job.

What users actually need is a safer pipeline:

Run OCR on the image.
Classify risky spans such as emails, phone numbers, account references, dates, and IDs.
Map those matched spans back to OCR word boxes.
Pad the boxes so the text edges are fully covered.
Insert them as editable regions instead of exporting immediately.

That is the pattern we use in a browser-first redaction flow built around Tesseract.js.

The full companion guide is here:

https://happyimg.com/guides/how-ocr-assisted-redaction-works-with-tesseract-js

Why we kept OCR in the browser

Sensitive screenshots are exactly the wrong kind of asset to upload to a server by default just to detect an email address or account number.

Running OCR in the browser gave us a cleaner privacy boundary:

the image stays local by default
the user can review the result immediately
the OCR pass can feed directly into the editor without waiting on a round trip

That still leaves the hardest part unsolved: turning OCR output into something safe enough to help with redaction.

Geometry matters more than text

For redaction, plain OCR text is not enough. The editor needs coordinates.

So instead of treating Tesseract.js as a text extractor, we ask it for structured layout data:

const result = await worker.recognize(
  asset.ocrSource,
  { rotateAuto: true },
  { blocks: true }
);

That gives us paragraphs, lines, and words with bounding boxes. Without those word-level bounds, there are no usable redaction candidates. There is only text.

We also lazily create and reuse the worker instead of rebuilding it on every scan:

if (!ocrWorkerRef.current) {
  const createWorker = await loadOcrWorkerFactory();
  ocrWorkerRef.current = await createWorker("eng", 1, { logger });
  await ocrWorkerRef.current.setParameters({
    tessedit_pageseg_mode: "11",
    preserve_interword_spaces: "1",
  });
}

That keeps the editor responsive across repeated scans and makes the OCR step feel more like a tool and less like a blocking batch job.

The useful trick: match text, then map back to words

The main implementation trick was simple and practical.

For each OCR line, we rebuild a single line string, but we also keep the character offsets of every OCR word inside that string. That gives us a bridge between pattern matching and image geometry.

So the flow becomes:

Reconstruct the OCR line as plain text.
Run regexes for categories like email, phone, URL, date, or ID.
Find which OCR words overlap each matched character range.
Merge those word bounds into one redaction region.

That lets us keep the matching logic simple while still ending up with coordinates we can draw and edit.

Tight boxes are risky

One thing that became obvious very quickly: exact glyph bounds look precise in demos, but they are risky in real privacy tooling.

If the box is too tight, the export can still leak fragments of the text around the edges. So after merging the matched word boxes, we expand the region with padding before inserting it into the editor.

That padding step ended up being one of the most important product decisions in the whole flow:

too little padding leaves readable fragments
too much padding hides useful surrounding context

So OCR quality alone is not the main issue. Region construction is just as important.

OCR should propose, not finalize

This was the biggest product lesson.

OCR-assisted redaction should not silently modify an image and export the result. It should insert reviewable regions into the editor and let the user confirm, delete, resize, or add more regions before saving.

For privacy tools, review is not a fallback. It is part of the feature.

That design also helped with the predictable OCR failure cases:

low-contrast screenshots
dense tables with tiny text
mixed-language content
broken OCR segmentation
labels like ID or Total that match patterns but are not always sensitive

Once you accept that OCR is a candidate generator instead of a perfect decision-maker, the whole interaction model gets better.

The real implementation boundary

Tesseract.js is only the OCR engine. The hard part is the boundary around it.

What actually made the feature useful was:

keeping the scan client-side
reusing the worker efficiently
preserving stable geometry
matching only the categories we cared about
padding regions conservatively
requiring review before export

That is the difference between an OCR demo and a privacy tool.

If you are building something similar, I would strongly recommend optimizing for reviewable suggestions instead of "one-click automatic redaction." The first approach ships. The second usually overpromises.

More implementation details:

https://happyimg.com/guides/how-ocr-assisted-redaction-works-with-tesseract-js