I got tired of seeing this workflow in every company I consulted for:
- Employee has a sensitive document (client contract, patient report, HR file)
- Employee needs AI help analyzing it
- Employee pastes the entire thing into ChatGPT
- Compliance team has a heart attack
The usual advice — "just remove sensitive data before pasting" — doesn't work. It's too manual, too slow, and people simply don't do it under deadline pressure.
So I built PrivacyScrubber — a 100% local, browser-based PII redactor that works as a step between your document and your AI model.
The Core Architecture
The entire engine runs in client-side JavaScript. No server. No API calls. No logs.
Why this matters: You can literally disconnect from the internet and it still works. I call this the "Airplane Mode test" — if the tool breaks offline, it's sending your data somewhere.
Two-Pass Tokenization
The PII engine uses a two-pass approach:
Pass 1 — Detection: Scan the text for all pattern matches across 60+ entity types.
Pass 2 — Conflict Resolution: Overlapping spans are resolved by specificity priority. An SSN match inside a phone number match — the more specific type (SSN) wins. Confidence scoring filters false positives.
Reversible Redaction
This is the key feature that makes it actually useful for AI workflows rather than just document archiving.
Instead of permanently deleting PII, we replace it with typed tokens:
Input: "Patient John Smith (DOB 03/15/1978, SSN 123-45-6789) visited Dr. Chen at 415-555-0192"
Output: "Patient [NAME_1] (DOB [DATE_1], SSN [SSN_1]) visited [NAME_2] at [PHONE_1]"
The mapping is stored locally in the browser. After the AI responds, paste the response into the "Reveal" panel and all tokens are replaced back with originals — locally, no server involved.
The Hard Parts
False Positive Hell
The biggest challenge wasn't detecting PII — it was avoiding false positives.
Examples of things that pattern-match as PII but aren't:
- "Version 1.2.3" → looks like a partial SSN pattern
- "Node.js" → matches partial email patterns
- "Dr. Smith Act" → "Dr." prefix triggers name detection
Solution: a confidence scoring system that weights context, entity type frequency, and co-occurrence patterns.
The Chrome Extension Problem
The Chrome Extension intercepts text in ChatGPT's input field before you hit Enter. ChatGPT uses React with a contenteditable div, not a standard <input>. Text must be intercepted at the keydown event level.
The solution uses a MutationObserver on the input container plus keydown event capture to catch the Enter key before React's synthetic event system.
What I Learned
Ship the boring compliance stuff first. The HIPAA/GDPR angle unlocks enterprise conversations.
The Airplane Mode test is your marketing. "Try it with WiFi off" is viscerally convincing.
False positive rate matters more than recall. A PII tool that replaces half the words is useless for AI workflows.
Current Stack
- Vanilla JS (no framework — Chrome extensions hate React)
- WebAssembly Tesseract for offline PDF OCR
- OffscreenCanvas API for in-browser image processing
- IndexedDB for PRO token persistence between sessions
If you're building something similar or have questions about the conflict resolution algorithm, I'm happy to dig in.
Try PrivacyScrubber — free, no signup, works offline.
Top comments (0)