DEV Community: Ilya Sib

How I Built a Zero-Server PII Scrubber for ChatGPT (It Works in Airplane Mode)

Ilya Sib — Tue, 21 Apr 2026 05:22:16 +0000

I got tired of seeing this workflow in every company I consulted for:

Employee has a sensitive document (client contract, patient report, HR file)
Employee needs AI help analyzing it
Employee pastes the entire thing into ChatGPT
Compliance team has a heart attack

The usual advice — "just remove sensitive data before pasting" — doesn't work. It's too manual, too slow, and people simply don't do it under deadline pressure.

So I built PrivacyScrubber — a 100% local, browser-based PII redactor that works as a step between your document and your AI model.

The Core Architecture

The entire engine runs in client-side JavaScript. No server. No API calls. No logs.

Why this matters: You can literally disconnect from the internet and it still works. I call this the "Airplane Mode test" — if the tool breaks offline, it's sending your data somewhere.

Two-Pass Tokenization

The PII engine uses a two-pass approach:

Pass 1 — Detection: Scan the text for all pattern matches across 60+ entity types.

Pass 2 — Conflict Resolution: Overlapping spans are resolved by specificity priority. An SSN match inside a phone number match — the more specific type (SSN) wins. Confidence scoring filters false positives.

Reversible Redaction

This is the key feature that makes it actually useful for AI workflows rather than just document archiving.

Instead of permanently deleting PII, we replace it with typed tokens:

Input:  "Patient John Smith (DOB 03/15/1978, SSN 123-45-6789) visited Dr. Chen at 415-555-0192"

Output: "Patient [NAME_1] (DOB [DATE_1], SSN [SSN_1]) visited [NAME_2] at [PHONE_1]"

The mapping is stored locally in the browser. After the AI responds, paste the response into the "Reveal" panel and all tokens are replaced back with originals — locally, no server involved.

The Hard Parts

False Positive Hell

The biggest challenge wasn't detecting PII — it was avoiding false positives.

Examples of things that pattern-match as PII but aren't:

"Version 1.2.3" → looks like a partial SSN pattern
"Node.js" → matches partial email patterns
"Dr. Smith Act" → "Dr." prefix triggers name detection

Solution: a confidence scoring system that weights context, entity type frequency, and co-occurrence patterns.

The Chrome Extension Problem

The Chrome Extension intercepts text in ChatGPT's input field before you hit Enter. ChatGPT uses React with a contenteditable div, not a standard <input>. Text must be intercepted at the keydown event level.

The solution uses a MutationObserver on the input container plus keydown event capture to catch the Enter key before React's synthetic event system.

What I Learned

Ship the boring compliance stuff first. The HIPAA/GDPR angle unlocks enterprise conversations.
The Airplane Mode test is your marketing. "Try it with WiFi off" is viscerally convincing.
False positive rate matters more than recall. A PII tool that replaces half the words is useless for AI workflows.

Current Stack

Vanilla JS (no framework — Chrome extensions hate React)
WebAssembly Tesseract for offline PDF OCR
OffscreenCanvas API for in-browser image processing
IndexedDB for PRO token persistence between sessions

If you're building something similar or have questions about the conflict resolution algorithm, I'm happy to dig in.

Try PrivacyScrubber — free, no signup, works offline.

Why Pasting Client Data into ChatGPT is a GDPR Liability (and the Fix)

Ilya Sib — Mon, 23 Mar 2026 09:32:21 +0000

Published as: Ilya, Founder of PrivacyScrubber — privacyscrubber.com

Every week, I watch legal teams, HR professionals, and developers do something that makes compliance officers lose sleep: they paste client files — contracts, resumes, medical records, support tickets — straight into ChatGPT to summarize, draft, or analyze them.

I get it. ChatGPT is genuinely useful. The problem isn't the AI. The problem is the data that rides along with your prompt.

The Legal Reality Nobody Talks About

Let me be specific. Under GDPR Article 28, if you use an AI assistant to process personal data on behalf of clients or employees, you need a Data Processing Agreement (DPA) with that AI provider. OpenAI offers a DPA — but only on their API (not ChatGPT Free), and you still bear the burden of proving lawful processing.

More critically: Article 5(1)(f) requires that personal data be processed with "appropriate security... and protection against unauthorised or unlawful processing." Pasting an unredacted client contract into a third-party AI system is hard to square with that requirement.

This isn't hypothetical. The Italian DPA (Garante) temporarily banned ChatGPT in 2023 specifically over data processing transparency concerns. The EU AI Act, coming into force through 2026, adds enforcement teeth to AI data processing requirements.

Even if you have a DPA, you're still on the hook for:

Ensuring the data you send is appropriate to send at all
Logging and auditing what personal data left your environment
Honoring data subject rights (erasure, portability) for anything processed in the AI

The moment a client's name, email, or ID lands in someone else's inference pipeline, your compliance posture weakens.

The "Incognito Chat" Illusion

"But I turned on ChatGPT's privacy mode / temporary chat / incognito mode — my data isn't used for training."

True: OpenAI's temporary chat doesn't use your conversation for model training. But that's a different claim from "your data never touches their servers."

Every message you send — regardless of privacy settings — is processed server-side. It travels over the network, sits in RAM during inference, and is handled by OpenAI's infrastructure. For data that falls under GDPR, HIPAA, or SOC 2 requirements, "not used for training" is a much weaker guarantee than "never left the browser."

This distinction matters enormously when your clients ask: "How are you handling our data when you use AI?"

What Zero-Trust Data Sanitization Means in Practice

The approach I've been building toward — I call it Zero-Trust Data Sanitization (ZTDS) — treats every AI session as potentially hostile to your clients' privacy. The rule is simple:

No personal data should leave your device before you send it to an AI model.

This means scrubbing PII from your text before it becomes a prompt. Not after. Not with server-side filters you don't control. Before.

Here's how ZTDS works in practice:

Input: You paste a client contract, support ticket, or HR document
Detect: Every name, email, phone number, and ID is identified via regex (runs locally)
Replace: Detected PII is swapped for tokens — [NAME_1], [EMAIL_1], [PHONE_1]
Send: You paste the sanitized version into ChatGPT — no real data crosses the wire
Reverse: When you get the AI's response, swap the tokens back to the originals (also locally)

The AI never sees the actual name — it sees [NAME_1]. It still understands the structure, intent, and context of your document perfectly. And the token map that decodes [NAME_1] back to the original? It lives only in your browser's session memory, wiped the moment you close the tab.

The Airplane Mode Test (Your Compliance Proof)

Here's the clearest way to verify whether a privacy tool actually honors zero-trust principles:

Load the page → Disconnect from the internet → Try to use it.

If the tool still works offline: the processing is genuinely local. No data left your device. Your client data never touched a server.

If the tool breaks offline: it's making network calls. Which means your data is traveling somewhere, regardless of what the privacy policy says.

This is the test I require every privacy tool to pass before I recommend it. It's also the test I built PrivacyScrubber to satisfy — you can confirm it yourself, right now, by switching to airplane mode after the page loads.

The GDPR Audit Checklist for AI Sessions

If you're using AI tools with client data today, here's a 5-step framework to reduce your exposure:

1. Classify before you paste
Does this document contain personal data (names, contacts, IDs, health info, financial data)? If yes, it needs scrubbing before it enters any AI prompt.

2. Check your DPA coverage
Do you have a valid DPA with every AI vendor whose models process your prompts? ChatGPT Free = no DPA. Claude API = yes if you signed it. Copilot Enterprise = covered under Microsoft's DPA.

3. Verify client-side processing claims
Apply the Airplane Mode test to any tool claiming to process data locally.

4. Log what you send
Even when using scrubbed data, keep a log of session types (not content). "Summarized HR onboarding document, no PII in prompt" is defensible. Mystery AI sessions are not.

5. Honor data subject rights
If a client asks "what data did you process about me using AI?" — can you answer? If you've been sending raw documents, you probably can't.

A Request from the Dev Community

This article focuses on ChatGPT because it's ubiquitous, but the same logic applies to Claude, Gemini, Copilot, and every other hosted AI model.

There are real solutions here — both technical (regex scrubbers, local models, differential privacy) and procedural (DPAs, audit logs, data classification workflows). I'm biased toward client-side tools because I've found no other approach that satisfies the Airplane Mode test, but the broader conversation matters.

If you're building in this space — privacy-preserving AI pipelines, local inference, differential privacy for LLMs — I'd genuinely like to hear from you in the comments. And if you're using AI with client data today without a sanitization step, I'd encourage a second look.

Ilya is the founder of PrivacyScrubber.com — a browser-based PII sanitizer for AI workflows that specifically passes the airplane mode test. A deeper technical breakdown for enterprise teams is available in the CISO AI Data Security Guide.