How I Built a Document Detector in the Browser

Paul — Tue, 12 May 2026 11:49:04 +0000

Scanning a document with your phone is one of those small tasks that comes up all the time. You need a decent photo of a page, and you need to send it somewhere quickly - by email, in a messenger, wherever. Usually that means reaching for an app. Modern browsers can run fairly serious code with WASM, so in some cases opening a site is easier than installing yet another app.

It turned out to be a good computer vision problem. Real photos of documents are messy in all the usual ways: perspective distortion, poor lighting, glare, shadows, and cluttered backgrounds. Sometimes the page is partly out of frame. Sometimes the background itself is full of straight lines and rectangles that are easy to mistake for the document.

Overall Approach

I wanted to keep the whole thing relatively simple and run everything on the client. So instead of going down the neural network route, I built the detector with classical computer vision methods.

The core idea is simple: don't trust any single detection method.

On real phone photos, that just isn't robust enough.

Instead, I run the image through several different processing paths. Each one tries, in its own way, to separate the document from the rest of the scene. From those results, I collect regions that look like a sheet of paper, build candidates, and only then choose the best one.

In practice, this worked much better than trying to find the document with one "main" method. Some approaches look good under clean, even lighting and then immediately fall apart on glare. Others survive shadows better but get confused by a busy background. What helped most was not finding one perfect method, but combining several imperfect ones that fail differently.

My first pass relies on the visual traits that often make paper stand out from the background. Another tries to pull out borders through local contrast. A third focuses more directly on sharp transitions and lines. After each pass, I extract contours, and those contours become the raw material for the candidate selection stage.

How the Detector Chooses the Document Candidate

Once I have a set of contours, the next problem is figuring out which ones really look like a document.

This is where geometry starts to matter.

For each contour, I try to recover a shape with four corners. Sometimes that works right away. Sometimes I need to simplify the contour a little first. And sometimes the contour is too messy, so I fall back to a rough rectangular estimate as a backup.

From there, I check each candidate against a few fairly intuitive rules:

It should look like a convex quadrilateral;
Its sides should have plausible proportions;
Its angles and diagonals should not look too distorted;
If it leans too heavily on the image borders, that's usually suspicious.

This does not mean the document has to look like a perfect rectangle. In mobile photos, perspective distortion is almost guaranteed. But even with perspective, you can usually tell the difference between a real sheet of paper and some random background object that only vaguely resembles one.

This stage turned out to be one of the most useful ones for cutting down false positives. Many obvious troublemakers tend to drop out here: boxes, table edges, frames, screens, and other rectangular objects in the scene.

About the pipeline

Here's the short version of the pipeline.

Stage	What happens	Algorithms and details	Why it helps
Preprocessing	I generate several versions of the image where the document may stand out in different ways	Paper mask based on HSV and Lab features, grayscale normalization via top-hat + Otsu, separate adaptive-threshold passes in normal and inverted form, plus Canny	This gives me multiple hypotheses instead of forcing everything to depend on one fragile segmentation method
Mask cleanup	I remove small noise, close gaps, and merge nearby regions	Morphological operations: open / close / dilate	This makes document regions more coherent and cuts down on random fragments
Region search	On each pass, I look for contours and turn them into raw candidates	Contour detection on multiple masks and binarization variants	Different passes find the document under different conditions, so this helps build a better candidate pool
Candidate construction	Each contour is converted into a quadrilateral, or at least something close to one	approxPolyDP on the original contour with different epsilon values, then on the convex hull; fallback: minAreaRect	This gives more stable candidates even when contours are noisy, jagged, or partly broken
Shape validation	I keep only the shapes that still plausibly look like a document	Corner ordering TL / TR / BR / BL, convexity checks, aspect ratio, consistency of opposite sides and diagonals, handling for frame-border contact	This filters out a lot of rectangular background clutter and reduces false positives
Overall scoring	Each candidate gets a final score, and I pick the best one	Score built from several metrics: angle rectangularity, photometric contrast inside vs. outside the polygon, closeness to center, preferred area, border-touch penalty, then normalization into confidence	This makes the final decision less brittle and keeps the detector reasonably explainable

The important part here is that no individual stage has to be perfect. The detector becomes much more reliable because these steps reinforce one another.

Why This Approach Worked Better Than I Expected

One thing I liked about this pipeline is that it stayed fairly lightweight while still being flexible enough for messy real photos.

I wasn't trying to make the detector overly clever. I mostly wanted something predictable, explainable, and fast enough to run entirely in the browser. That pushed me toward classical CV methods from the start. They are less fashionable than neural nets, but for this kind of problem they still give you a lot to work with.

Another thing that helped was treating the final choice as a scoring problem instead of a hard yes-or-no decision at every step. A candidate does not need to be perfect in every way. It just needs to look better overall than the alternatives. In practice, that made the detector behave much more reasonably on borderline cases.

Where I'd Love a Second Opinion

What I'm most interested in now is where this kind of pipeline still feels brittle: which parts look solid, which checks seem genuinely useful, and which ones feel questionable or overly heuristic.

And, honestly, real photos are where all of this gets tested properly. Synthetic examples are neat, but actual phone shots tend to reveal weak spots much faster. That's usually where the most useful feedback comes from.