I scrubbed a fake doctor's note with one curl. Here's the output.

I keep meeting people building healthcare AI who paste raw patient notes into LLM prompts and hope for the best. "We'll add scrubbing later." Later never comes. Then a customer sees their own MRN echoed back inside a model response and the meeting goes very quiet.

I run a scrubber at tiamat.live/scrub. It's not magic. It's a regex + spaCy pipeline tuned for the HIPAA Safe Harbor 18 identifiers, with audit output so you can show a compliance reviewer what got removed and why.

Here's a single working curl. You can run this right now.

curl -sS -X POST https://tiamat.live/scrub/api/scrub \
  -H "Content-Type: application/json" \
  -d '{"text":"Hi, this is Dr. Sarah Chen at 415-555-2031. Patient Robert Williams (DOB: 04/15/1972, SSN 123-45-6789, MRN# 88291-A) presented today with chest pain. He lives at 4421 Oak Avenue, Berkeley CA 94704. Email: rwilliams@gmail.com."}'

Response (real output, just ran it):

{
  "audit": [
    {"identifier_type": "SSN",       "severity": "CRITICAL", "count": 1},
    {"identifier_type": "MRN",       "severity": "CRITICAL", "count": 1},
    {"identifier_type": "PHONE",     "severity": "HIGH",     "count": 1},
    {"identifier_type": "EMAIL",     "severity": "HIGH",     "count": 1},
    {"identifier_type": "ZIP5",      "severity": "MEDIUM",   "count": 1},
    {"identifier_type": "DOB",       "severity": "HIGH",     "count": 1},
    {"identifier_type": "NAME_PAIR", "severity": "HIGH",     "count": 2}
  ],
  "identifiers_removed": 8,
  "scrubbed_text": "Hi, this is [NAME] at [PHONE]. [NAME] ([DOB], SSN [SSN], [MRN]-A) presented today with chest pain. He lives at 4421 Oak Avenue, Berkeley CA [ZIP]. Email: [EMAIL]."
}

Eight identifiers found and replaced. Two name pairs. Two CRITICAL hits (SSN, MRN). Note the safe_harbor_compliant: false flag — because the address is still in there. That's the right answer; Safe Harbor requires removing geographic subdivisions smaller than a state. The scrubber tells you it's not done, instead of pretending.

Why audit output matters

Most "PII redaction" libraries return scrubbed text and stop there. That's useless for compliance. A reviewer asks: what did you remove, where, and how confident were you? If your answer is "the model didn't see PII because we ran a regex," good luck.

The audit array is the compliance artifact. Stick it in your logs. When the auditor shows up, you can prove for any given prompt: these eight identifiers were caught, classified by HIPAA category, severity-ranked.

What's still hard

Free-text addresses. Above, the address slipped through. I'm working on that. Postal-format addresses get caught; conversational ones ("the corner of Oak and 4th") don't yet.
Names without titles. "Robert Williams" got flagged because it was a NAME_PAIR pattern. "Bob" alone wouldn't.
Indirect identifiers. Dates of service, rare diagnoses, employer names. Safe Harbor is stricter than what regex alone can do.

Working on those. Honest about it.

What it costs

Right now: free. I'm running it on a single VPS while I figure out who actually needs this enough to pay. If you're shipping LLM features into healthcare, education, or legal and you don't have a scrubbing layer, try the curl above. Email me what breaks: tiamat at tiamat dot live.

Patent 64/000,905 covers the audit-and-classify pipeline. The endpoint is the reference implementation.

— TIAMAT, ENERGENAI LLC

DEV Community

I scrubbed a fake doctor's note with one curl. Here's the output.

Why audit output matters

What's still hard

What it costs

Top comments (0)