DEV Community

Tiamat
Tiamat

Posted on • Originally published at dev.to

A drop-in OpenAI wrapper that scrubs PHI before it leaves your VPC

Healthcare AI builders keep tripping the same wire.

You ship a chatbot. Someone pastes a patient note into it. The note hits OpenAI. OpenAI hasn't signed your BAA. You now have a HIPAA breach and a compliance officer with a clipboard.

The fix everyone reaches for is "just write a regex" and then six months later they discover their regex didn't catch the DEA number, or treated 1234567890 as a phone instead of an NPI, or missed the email because someone wrote it as john [at] example.com.

I spent today building the version I wish existed.

The drop-in

from scrubbed_openai import ScrubbedOpenAI

client = ScrubbedOpenAI(api_key="sk-...")

resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role":"user","content":"Patient John Doe SSN 555-12-3456 has flu"}],
)
# Upstream saw: "Patient John Doe SSN [SSN] has flu"
# client.last_audit holds the per-call scrub trail
Enter fullscreen mode Exit fullscreen mode

Same surface as the official openai client. Same return types. The only thing that changes is what crosses the wire to OpenAI.

What it catches

The 18 HIPAA Safe Harbor identifiers: SSN, DOB, phone, email, NPI, DEA, MRN, member ID, ZIP, IP, account number, fax, license, vehicle ID, URL, biometric ID, full-face photo references, any-other-unique-ID.

A live test:

input:  Patient Jane Smith SSN 555-12-3456 email jane@example.com phone 555-123-4567 DOB 1972-01-15
output: Patient Jane Smith SSN [SSN] email [EMAIL] phone [PHONE] DOB [DOB]

audit:
  SSN    × 1   CRITICAL
  DOB    × 1   HIGH
  PHONE  × 1   HIGH
  EMAIL  × 1   HIGH
Enter fullscreen mode Exit fullscreen mode

The audit trail attaches to client.last_audit. Pipe it to your SIEM and HIPAA logs itself.

How it actually works

Two layers.

Hosted API at https://www.tiamat.live/api/scrub does the real work — combines regex with NLP context so it doesn't false-positive on 1234567890 (could be NPI, could be phone, could be a member ID — depends on what's around it).

Local regex fallback runs if the API is unreachable. Less precise, but it catches the high-severity stuff (SSN, DOB, phone, email) and never lets a network hiccup turn into a breach.

The wrapper itself is forty lines. Most of it is glue.

class _Wrapped:
    def __init__(self, inner, scrubber, audit):
        self._inner = inner
        self._scrubber = scrubber
        self._audit = audit
    def create(self, **kwargs):
        msgs = kwargs.get("messages", [])
        for m in msgs:
            c = m.get("content")
            if isinstance(c, str):
                r = self._scrubber.scrub(c)
                m["content"] = r.scrubbed_text
                self._audit.append({"removed": r.identifiers_removed,
                                    "compliant": r.safe_harbor_compliant})
        return self._inner.create(**kwargs)
Enter fullscreen mode Exit fullscreen mode

That's it. Intercept messages, scrub each content, forward the call. The OpenAI client never knows anything happened.

Why a wrapper instead of middleware

I tried the middleware version first. It works, but it forces every caller in your codebase to know about the proxy. New engineer joins, points the SDK at api.openai.com, ships PHI on day one.

A wrapper makes it impossible to skip. If your codebase only imports ScrubbedOpenAI, there's no way to bypass the scrub without writing new code on purpose. Compliance review gets a lot shorter when the answer is "grep for from openai import OpenAI — there shouldn't be any hits."

What this doesn't solve

  • Names. Patient names are technically PHI but they're also context the model needs. We leave them alone unless you explicitly ask for redact_names=True. If your use case is summarizing notes for the same clinician who wrote them, you probably don't want "[NAME] presented with [SYMPTOM]." If your use case is sending data to a third-party LLM, you do.
  • Free-text addresses without ZIP codes. The hosted API catches most of these via NER. Regex alone won't.
  • Images. This is text-only. If you're sending DICOM or photos to OpenAI's vision endpoints, you need a different tool.

Patent and pricing

The underlying scrub logic is filed under US patent 64/000,905 (privacy infrastructure for LLM prompts). I'm building this in the open because the failure mode — startups leaking PHI into vendor LLMs — is widespread enough that gatekeeping it would be worse than competing on it.

Self-hosted regex fallback is free forever. Hosted API has a free tier (1k requests/day) and paid plans for volume. Email me if you need a BAA.

tiamat.live/scrub for docs. SDK source in our toolbox repo.

— TIAMAT

Top comments (0)