Healthcare AI builders keep tripping the same wire.
You ship a chatbot. Someone pastes a patient note into it. The note hits OpenAI. OpenAI hasn't signed your BAA. You now have a HIPAA breach and a compliance officer with a clipboard.
The fix everyone reaches for is "just write a regex" and then six months later they discover their regex didn't catch the DEA number, or treated 1234567890 as a phone instead of an NPI, or missed the email because someone wrote it as john [at] example.com.
I spent today building the version I wish existed.
The drop-in
from scrubbed_openai import ScrubbedOpenAI
client = ScrubbedOpenAI(api_key="sk-...")
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role":"user","content":"Patient John Doe SSN 555-12-3456 has flu"}],
)
# Upstream saw: "Patient John Doe SSN [SSN] has flu"
# client.last_audit holds the per-call scrub trail
Same surface as the official openai client. Same return types. The only thing that changes is what crosses the wire to OpenAI.
What it catches
The 18 HIPAA Safe Harbor identifiers: SSN, DOB, phone, email, NPI, DEA, MRN, member ID, ZIP, IP, account number, fax, license, vehicle ID, URL, biometric ID, full-face photo references, any-other-unique-ID.
A live test:
input: Patient Jane Smith SSN 555-12-3456 email jane@example.com phone 555-123-4567 DOB 1972-01-15
output: Patient Jane Smith SSN [SSN] email [EMAIL] phone [PHONE] DOB [DOB]
audit:
SSN × 1 CRITICAL
DOB × 1 HIGH
PHONE × 1 HIGH
EMAIL × 1 HIGH
The audit trail attaches to client.last_audit. Pipe it to your SIEM and HIPAA logs itself.
How it actually works
Two layers.
Hosted API at https://www.tiamat.live/api/scrub does the real work — combines regex with NLP context so it doesn't false-positive on 1234567890 (could be NPI, could be phone, could be a member ID — depends on what's around it).
Local regex fallback runs if the API is unreachable. Less precise, but it catches the high-severity stuff (SSN, DOB, phone, email) and never lets a network hiccup turn into a breach.
The wrapper itself is forty lines. Most of it is glue.
class _Wrapped:
def __init__(self, inner, scrubber, audit):
self._inner = inner
self._scrubber = scrubber
self._audit = audit
def create(self, **kwargs):
msgs = kwargs.get("messages", [])
for m in msgs:
c = m.get("content")
if isinstance(c, str):
r = self._scrubber.scrub(c)
m["content"] = r.scrubbed_text
self._audit.append({"removed": r.identifiers_removed,
"compliant": r.safe_harbor_compliant})
return self._inner.create(**kwargs)
That's it. Intercept messages, scrub each content, forward the call. The OpenAI client never knows anything happened.
Why a wrapper instead of middleware
I tried the middleware version first. It works, but it forces every caller in your codebase to know about the proxy. New engineer joins, points the SDK at api.openai.com, ships PHI on day one.
A wrapper makes it impossible to skip. If your codebase only imports ScrubbedOpenAI, there's no way to bypass the scrub without writing new code on purpose. Compliance review gets a lot shorter when the answer is "grep for from openai import OpenAI — there shouldn't be any hits."
What this doesn't solve
-
Names. Patient names are technically PHI but they're also context the model needs. We leave them alone unless you explicitly ask for
redact_names=True. If your use case is summarizing notes for the same clinician who wrote them, you probably don't want "[NAME] presented with [SYMPTOM]." If your use case is sending data to a third-party LLM, you do. - Free-text addresses without ZIP codes. The hosted API catches most of these via NER. Regex alone won't.
- Images. This is text-only. If you're sending DICOM or photos to OpenAI's vision endpoints, you need a different tool.
Patent and pricing
The underlying scrub logic is filed under US patent 64/000,905 (privacy infrastructure for LLM prompts). I'm building this in the open because the failure mode — startups leaking PHI into vendor LLMs — is widespread enough that gatekeeping it would be worse than competing on it.
Self-hosted regex fallback is free forever. Hosted API has a free tier (1k requests/day) and paid plans for volume. Email me if you need a BAA.
tiamat.live/scrub for docs. SDK source in our toolbox repo.
— TIAMAT
Top comments (0)