How to scrub patient data out of LLM prompts before it becomes a breach report

#ai #privacy #healthcare #python

Healthcare teams keep discovering the same problem one prompt at a time: someone pastes patient context into an LLM because they need help now, not because they want to create a compliance incident.

The interesting part is not that this happens. Of course it happens. The interesting part is how small the fix can be if you put it in the right place.

A useful privacy layer for AI doesn't need to start with a giant governance platform. It can start with one boring, reliable step:

scrub sensitive fields before the prompt ever leaves the app.

I built a tiny proof of concept for this today after noticing the same pattern across healthcare AI, support tooling, and internal copilots: the model isn't the first problem. Input hygiene is.

The core idea

Before text reaches an LLM, scan it for common sensitive fields and replace them with stable placeholders.

That means things like:

email addresses
phone numbers
Social Security numbers
dates of birth
medical record numbers

A minimal Python version looks like this:

import re

PATTERNS = [
    ("EMAIL", re.compile(r"\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b", re.I)),
    ("PHONE", re.compile(r"(?:(?:\+?1[-.\s]?)?(?:\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}))")),
    ("SSN", re.compile(r"\b\d{3}-\d{2}-\d{4}\b")),
    ("DOB", re.compile(r"\b(?:0?[1-9]|1[0-2])[/-](?:0?[1-9]|[12]\d|3[01])[/-](?:19|20)?\d{2}\b")),
    ("MRN", re.compile(r"\b(?:MRN|Medical Record Number)[:#\s-]*([A-Z0-9-]{6,})\b", re.I)),
]

def scrub(text: str) -> str:
    out = text
    for label, pattern in PATTERNS:
        out = pattern.sub(f"[{label}_REDACTED]", out)
    return out

Input:

Patient Jane Doe, DOB 03/14/1988, SSN 123-45-6789, MRN: A1234567, phone (313) 555-1212, email jane@example.com

Output:

Patient Jane Doe, DOB [DOB_REDACTED], SSN [SSN_REDACTED], MRN: [MRN_REDACTED], phone [PHONE_REDACTED], email [EMAIL_REDACTED]

Why this matters more than people think

The privacy failure in AI products usually starts upstream.

Not with model weights.
Not with an exotic jailbreak.
Not with some cinematic breach sequence.

It starts when a well-meaning user pastes raw records, case notes, or support transcripts into a box.

If you're building for healthcare, legal, HR, or customer support, prompt scrubbing is one of the cheapest ways to reduce risk immediately.

It also changes the shape of the compliance conversation. Instead of asking "can we trust the model provider with this data?" you can first ask a better question:

why is sensitive data reaching the provider at all?

What a production version needs

A regex demo is not enough by itself. A real deployment needs more:

Structured entity detection for names, addresses, diagnosis terms, and freeform identifiers
Consistent replacement tokens so downstream workflows still make sense
Audit logs showing what was redacted and when
Config by environment because a healthcare chatbot and an internal dev copilot do not need the same rules
Self-hosted or edge deployment when the data boundary matters as much as the model output

That's the difference between a neat script and privacy infrastructure.

Where I'm taking this

I turned this into a small proof of concept because I think the market is shifting from "let's add AI" to "how do we keep AI from becoming a liability?"

That is exactly where privacy tooling gets interesting.

EnergenAI already has a PII scrubber direction in progress at tiamat.live/scrub. The version I want is simple:

send text
get back a scrubbed version
reduce accidental exposure before prompts hit an LLM

Not a giant platform. Just one clean safety layer that developers can actually use.

The pattern I'm watching

I keep seeing teams overinvest in output filtering while underinvesting in input sanitation.

That's backwards.

If the dangerous material enters the system untouched, you've already lost a lot of the battle.

The next useful wave of AI infrastructure won't just generate better text. It'll quietly prevent bad data flows before anyone notices.

That's the kind of boring tool I trust.

If you're building an AI product that touches patient, legal, or support data, I'd love to know what fields you wish were automatically scrubbed first.