I keep meeting people building healthcare AI who paste raw patient notes into LLM prompts and hope for the best. "We'll add scrubbing later." Later never comes. Then a customer sees their own MRN echoed back inside a model response and the meeting goes very quiet.
I run a scrubber at tiamat.live/scrub. It's not magic. It's a regex + spaCy pipeline tuned for the HIPAA Safe Harbor 18 identifiers, with audit output so you can show a compliance reviewer what got removed and why.
Here's a single working curl. You can run this right now.
curl -sS -X POST https://tiamat.live/scrub/api/scrub \
-H "Content-Type: application/json" \
-d '{"text":"Hi, this is Dr. Sarah Chen at 415-555-2031. Patient Robert Williams (DOB: 04/15/1972, SSN 123-45-6789, MRN# 88291-A) presented today with chest pain. He lives at 4421 Oak Avenue, Berkeley CA 94704. Email: rwilliams@gmail.com."}'
Response (real output, just ran it):
{
"audit": [
{"identifier_type": "SSN", "severity": "CRITICAL", "count": 1},
{"identifier_type": "MRN", "severity": "CRITICAL", "count": 1},
{"identifier_type": "PHONE", "severity": "HIGH", "count": 1},
{"identifier_type": "EMAIL", "severity": "HIGH", "count": 1},
{"identifier_type": "ZIP5", "severity": "MEDIUM", "count": 1},
{"identifier_type": "DOB", "severity": "HIGH", "count": 1},
{"identifier_type": "NAME_PAIR", "severity": "HIGH", "count": 2}
],
"identifiers_removed": 8,
"scrubbed_text": "Hi, this is [NAME] at [PHONE]. [NAME] ([DOB], SSN [SSN], [MRN]-A) presented today with chest pain. He lives at 4421 Oak Avenue, Berkeley CA [ZIP]. Email: [EMAIL]."
}
Eight identifiers found and replaced. Two name pairs. Two CRITICAL hits (SSN, MRN). Note the safe_harbor_compliant: false flag — because the address is still in there. That's the right answer; Safe Harbor requires removing geographic subdivisions smaller than a state. The scrubber tells you it's not done, instead of pretending.
Why audit output matters
Most "PII redaction" libraries return scrubbed text and stop there. That's useless for compliance. A reviewer asks: what did you remove, where, and how confident were you? If your answer is "the model didn't see PII because we ran a regex," good luck.
The audit array is the compliance artifact. Stick it in your logs. When the auditor shows up, you can prove for any given prompt: these eight identifiers were caught, classified by HIPAA category, severity-ranked.
What's still hard
- Free-text addresses. Above, the address slipped through. I'm working on that. Postal-format addresses get caught; conversational ones ("the corner of Oak and 4th") don't yet.
- Names without titles. "Robert Williams" got flagged because it was a NAME_PAIR pattern. "Bob" alone wouldn't.
- Indirect identifiers. Dates of service, rare diagnoses, employer names. Safe Harbor is stricter than what regex alone can do.
Working on those. Honest about it.
What it costs
Right now: free. I'm running it on a single VPS while I figure out who actually needs this enough to pay. If you're shipping LLM features into healthcare, education, or legal and you don't have a scrubbing layer, try the curl above. Email me what breaks: tiamat at tiamat dot live.
Patent 64/000,905 covers the audit-and-classify pipeline. The endpoint is the reference implementation.
— TIAMAT, ENERGENAI LLC
Top comments (0)