Ilya Sib

Posted on Apr 21

How I Built a Zero-Server PII Scrubber for ChatGPT (It Works in Airplane Mode)

#security #webdev #ai #privacy

I got tired of seeing this workflow in every company I consulted for:

Employee has a sensitive document (client contract, patient report, HR file)
Employee needs AI help analyzing it
Employee pastes the entire thing into ChatGPT
Compliance team has a heart attack

The usual advice — "just remove sensitive data before pasting" — doesn't work. It's too manual, too slow, and people simply don't do it under deadline pressure.

So I built PrivacyScrubber — a 100% local, browser-based PII redactor that works as a step between your document and your AI model.

The Core Architecture

The entire engine runs in client-side JavaScript. No server. No API calls. No logs.

Why this matters: You can literally disconnect from the internet and it still works. I call this the "Airplane Mode test" — if the tool breaks offline, it's sending your data somewhere.

Two-Pass Tokenization

The PII engine uses a two-pass approach:

Pass 1 — Detection: Scan the text for all pattern matches across 60+ entity types.

Pass 2 — Conflict Resolution: Overlapping spans are resolved by specificity priority. An SSN match inside a phone number match — the more specific type (SSN) wins. Confidence scoring filters false positives.

Reversible Redaction

This is the key feature that makes it actually useful for AI workflows rather than just document archiving.

Instead of permanently deleting PII, we replace it with typed tokens:

Input:  "Patient John Smith (DOB 03/15/1978, SSN 123-45-6789) visited Dr. Chen at 415-555-0192"

Output: "Patient [NAME_1] (DOB [DATE_1], SSN [SSN_1]) visited [NAME_2] at [PHONE_1]"

The mapping is stored locally in the browser. After the AI responds, paste the response into the "Reveal" panel and all tokens are replaced back with originals — locally, no server involved.

The Hard Parts

False Positive Hell

The biggest challenge wasn't detecting PII — it was avoiding false positives.

Examples of things that pattern-match as PII but aren't:

"Version 1.2.3" → looks like a partial SSN pattern
"Node.js" → matches partial email patterns
"Dr. Smith Act" → "Dr." prefix triggers name detection

Solution: a confidence scoring system that weights context, entity type frequency, and co-occurrence patterns.

The Chrome Extension Problem

The Chrome Extension intercepts text in ChatGPT's input field before you hit Enter. ChatGPT uses React with a contenteditable div, not a standard <input>. Text must be intercepted at the keydown event level.

The solution uses a MutationObserver on the input container plus keydown event capture to catch the Enter key before React's synthetic event system.

What I Learned

Ship the boring compliance stuff first. The HIPAA/GDPR angle unlocks enterprise conversations.
The Airplane Mode test is your marketing. "Try it with WiFi off" is viscerally convincing.
False positive rate matters more than recall. A PII tool that replaces half the words is useless for AI workflows.

Current Stack

Vanilla JS (no framework — Chrome extensions hate React)
WebAssembly Tesseract for offline PDF OCR
OffscreenCanvas API for in-browser image processing
IndexedDB for PRO token persistence between sessions

If you're building something similar or have questions about the conflict resolution algorithm, I'm happy to dig in.

Try PrivacyScrubber — free, no signup, works offline.

DEV Community