Ashish Kumar

Posted on May 17

Finding Duplicate Photos in the Browser (Without Uploading Your Library)

#webdev #privacy #javascript #photography

My Downloads folder had the same vacation photos three times — once from Google Photos export, once from a WhatsApp forward, once from a backup script that renamed everything IMG_20240315 (1).jpg. Different names. Same bytes. About 4 GB of waste I only noticed when my laptop started complaining about disk space.

The obvious fix is a "duplicate photo finder." The obvious problem is what most of those tools ask you to do first: upload your entire library.

That felt wrong for photos I would never put on a random SaaS server. So I spent a few weeks building a client-side alternative — DupShelf — and this post is the engineering story behind it: what browsers can actually do today, where they still can't, and how to verify that your files never leave your machine.

Why cloud duplicate scanners are a bad default for personal photos

Server-side duplicate finders follow a simple pipeline:

You upload thousands of files.
Their backend hashes or perceptually compares them.
You get a report (and sometimes a "delete all" button).

That made sense in 2012 when JavaScript couldn't read a folder from disk. In 2026 it's mostly inertia.

The costs for you:

Privacy — wedding albums, kids' photos, medical scans, work screenshots. Uploading is an act of trust with unclear retention.
Time — home upload speeds are asymmetric. Shipping 30 GB up often takes longer than hashing locally.
Control — many tools optimize for "free up space now" with aggressive delete flows. One mis-click on a similar-looking burst is worse than keeping a duplicate.

Client-side duplicate finding flips the model: your CPU does the work, your disk is the only storage that matters, and nothing is deleted unless you decide to.

Exact duplicates vs "similar" photos (be precise about what you promise)

Users say "find duplicate photos" but mean two different things:

What users imagine	What most engineers build first
Same file copied twice	Exact duplicate — identical bytes
Burst shots, crops, re-saved JPEGs	Similar image — perceptual hash, ML embeddings

Exact matching is deterministic: hash the file, compare hashes, done. Renamed copies, re-exports, and "Copy of…" files still match even when metadata differs.

Similar matching is fuzzy: two portraits taken a second apart might group together; a heavily edited version might not. It needs different algorithms (pHash, CLIP, etc.), more CPU, and more false positives.

DupShelf ships exact duplicates only for now — SHA-256 over file content, grouped by hash. That's intentional: it's the safest first cleanup pass. You only remove files that are provably identical to another file in the set.

When similar-image mode exists, it should be optional and loudly labeled. Mixing "similar" results into an exact-duplicate UI is how tools earn angry reviews.

How local duplicate detection actually works

At a high level the pipeline is boring on purpose:

Enumerate files in a folder (recursive) or from a file picker batch.
Filter to image types the browser can treat as blobs (jpeg, png, webp, etc.).
Hash each file's bytes (DupShelf uses SHA-256 via Web Crypto).
Group files that share a hash.
Review — human picks one keeper per group.
Act — move extras to a subfolder or export a CSV for manual cleanup.

No magic. The interesting parts are browser APIs and UX around large folders.

Hashing at scale

SHA-256 in the browser is fast enough for real libraries on a desktop. The slow part is often reading files from disk through the File System Access API, not the hash itself.

Practical patterns:

Stream or chunk large files if memory is tight (DupShelf reads blobs sized for typical photos).
Run hashing off the main thread with Web Workers so the UI stays responsive.
Show honest progress — "hashing" is the slow phase users feel on 8k+ file folders; hiding that behind a spinner breeds distrust.

Why filenames lie

Backup tools, messengers, and sync clients rename files constantly:

photo.jpg → photo (1).jpg
DSC_0042.NEF → DSC_0042-copy.NEF
Same bytes, new path

Content hashing ignores names. That's the whole point.

The File System Access API: folder scan and safe moves

Chrome and Edge on desktop support picking a directory handle with showDirectoryPicker(). DupShelf walks it recursively, builds a list of image refs, and keeps the handle so you can move non-keepers later.

Important nuances:

Read vs write — scanning needs read access. Moving duplicates into a subfolder needs write permission; the browser may prompt again. That's good — explicit consent beats silent disk writes.
Safari / Firefox — no full-folder picker today. Users can still add batches via file input, drag-and-drop, or paste. The tool should say that plainly instead of failing mysteriously.
Secure context — folder access requires HTTPS (or localhost during development).

DupShelf never auto-deletes. The write path creates a folder named dupshelf-duplicate-images inside your library and moves extras there in grouped subfolders. You verify in Finder or Explorer, then delete when ready.

OPFS for session restore (optional but nice)

If you've read about the Origin Private File System, it's a good fit for tool state:

Scan results (groups, keeper choices, file metadata) can be large.
localStorage is the wrong tool — size limits and synchronous JSON stringify hurt.
OPFS gives you private, origin-scoped file storage for structured session snapshots.

DupShelf persists completed scans to OPFS so a tab refresh doesn't throw away an hour of hashing. File handles don't survive reloads — you reconnect the same folder to move files again. That's a browser security rule, not a product bug.

Review UX: the product is the grouping UI

Finding duplicates is table stakes. Reviewing them is the product.

What worked in testing:

Virtualize the group list — a 900-group library must not render 900 DOM cards at once.
One keeper per group — tap a thumbnail to keep it; everything else in the group is a move candidate.
Show recoverable space — sum file sizes minus keepers. People need a number to justify an afternoon of cleanup.
CSV export — for users who want a checklist in Excel or a script, not in-app moves.
Verify folder — re-enumerate and re-hash after manual changes on disk.

Real use cases (where local exact dedup wins)

Messy Downloads folders — screenshots, memes, and attachments with (1) in the name. Low risk, high reward.

Pre-backup cleanup — dedupe before Time Machine or copying to an external drive. Smaller archives, faster restores.

WhatsApp / Telegram exports — forwarded photos land with new names daily. Exact dedup catches true copies; it won't merge burst shots.

Photographer delivery folders — verify you didn't zip the same export twice before sending to a client.

Privacy-sensitive libraries — anything you legally or personally shouldn't upload to a third-party "free scanner."

DupShelf has longer guides on these scenarios at dupshelf.renderlog.in if you want SEO-shaped deep dives; the workbench itself is at dupshelf.renderlog.in/app.

How to verify a tool really runs locally

Before you point any duplicate finder at a sensitive folder:

Network tab — start a scan, filter by your domain. Image bytes should not POST to the server. Marketing pages may load analytics; the scan itself should not upload files.
Offline test — load the app once, disconnect, refresh. Client-side tools should still open (scanning may need the tab that already has permission).
Read the move contract — does it delete in-app or move to a folder you control?

DupShelf is static-hosted (Next.js on Vercel) with no backend that accepts your photos. The privacy policy describes folder permissions in plain language: dupshelf.renderlog.in/privacy.

Trade-offs (honest list)

Browser memory — hashing 20k RAW files in one tab on an 8 GB machine can hurt. Start with a subfolder.

Exact only — re-saved JPEGs at different quality settings are different bytes. Similar detection is a different product.

Desktop-first — folder scan + move is built for Chrome/Edge on a laptop. Mobile can add files manually; it's not the primary workflow.

Initial JS bundle — a serious tool ships workers, virtualization, and OPFS helpers. First visit costs more than a landing-page-only site. Cache helps after that.

What I shipped (and what I deliberately didn't)

DupShelf is free, no account, no upload. Core loop:

Choose folder or add batches
Scan with visible progress
Review groups, pick keepers
Move to dupshelf-duplicate-images or export CSV
Session restore + undo last move for folder scans

Deliberately not in v1: cloud sync, AI similarity, auto-delete, mobile folder access, or upsell walls. Those are either privacy regressions or scope explosions.

If you maintain a photo library on disk and you've been putting off cleanup because cloud tools feel gross, try a local pass first. Exact duplicates are the safest win — you'll know the bytes matched before you move anything.

Try it: dupshelf.renderlog.in/app

How it works (marketing): dupshelf.renderlog.in

I build browser-based tools under the Renderlog umbrella — client-side when the platform allows, honest about limits when it doesn't. DupShelf is the duplicate-photo piece; if this post helped, the workflow might save you a few gigabytes this weekend.

If this was useful, I've also built a handful of other free, browser-based tools — no signup, no uploads, everything runs client-side:

JSON Tools — https://json.renderlog.in (formatter, validator, JWT decoder, JSONPath tester, 40+ converters)
Text Tools — https://text.renderlog.in (case converters, slug generator, HTML/markdown utilities, 70+ tools)
PDF Tools — https://pdftools.renderlog.in (merge, split, OCR, compress to exact size, 40+ tools)
Image Tools — https://imagetools.renderlog.in (compress, convert, resize, background remover, 50+ tools)
DupShelf — https://dupshelf.renderlog.in (find exact duplicate photos locally, move or export CSV, no upload)
QR Tools — https://qrtools.renderlog.in (WiFi, vCard, UPI, bulk QR codes with logos)
Calc Tools — https://calctool.renderlog.in (60+ calculators for finance, health, math, dates)
Notepad — https://notepad.renderlog.in (private, offline-first notes, no signup)

DEV Community