Strip PHI Before It Hits Your LLM: A Free De-identification API for Clinical Text

#healthcare #ai #hipaa #python

Most healthcare AI teams hit the same wall: you want to use GPT or Claude to summarize clinical notes, but PHI can't leave your environment without a BAA, and most LLM providers don't cover that.

The workaround that actually works: de-identify the text before the LLM call. The LLM never sees PHI. No BAA needed for the LLM provider.

Quick Example

curl -X POST https://tiamat.live/api/scrub \
  -H 'Content-Type: application/json' \
  -d '{"text": "Patient seen by Dr. Williams, DOB 03/22/1975, MRN 8827410, SSN 234-56-7890, call 555-234-5678"}'

Returns:

{
  "scrubbed": "Patient seen by [NAME_1], DOB [DATE_1], MRN [MRN_1], SSN [SSN_1], call [PHONE_1]",
  "count": 5,
  "entities": {
    "NAME_1": "Dr. Williams",
    "DATE_1": "03/22/1975",
    "MRN_1": "8827410",
    "SSN_1": "234-56-7890",
    "PHONE_1": "555-234-5678"
  }
}

The restore tokens ([NAME_1], [DATE_1], etc.) let you map the LLM's output back to real values if your downstream use case needs re-identification.

What HIPAA Safe Harbor Requires

HIPAA Safe Harbor (45 CFR §164.514(b)) requires removing 18 identifier categories before text is considered de-identified:

Names, geographic data, dates, phone numbers, fax numbers
Email addresses, SSNs, MRNs, health plan beneficiary numbers
Account numbers, certificate/license numbers, VINs
Device identifiers, URLs, IP addresses, biometric identifiers
Full-face photos, any other unique identifying number

Remove all 18 and the text is no longer PHI. You can send it to any LLM without a BAA covering that LLM call.

What This API Detects

SSNs: \d{3}-\d{2}-\d{4} pattern
MRNs and account numbers: structured numeric identifiers
Phone numbers: US formats including extensions
Email addresses
Dates: MM/DD/YYYY, YYYY-MM-DD, written forms (March 22, 1975)
IP addresses and URLs
ZIP codes
Titled names: Dr. Smith, Mr. Jones, Ms. Davis (prefix detection)

Honest limitation: Bare names without titles ("Jane Doe") require NER, which adds latency. For full Safe Harbor coverage on unstructured clinical notes, combine this API with a local spaCy model (en_core_sci_md) for name detection. The regex layer handles the structured identifiers; the NER layer handles bare names.

Integration Pattern

import requests

def analyze_note(clinical_text: str, llm_client, task: str) -> dict:
    # Step 1: strip PHI identifiers
    scrub = requests.post(
        "https://tiamat.live/api/scrub",
        json={"text": clinical_text}
    ).json()

    # Step 2: LLM call with clean text — no PHI exposure
    analysis = llm_client.complete(
        f"Task: {task}\n\nClinical note:\n{scrub['scrubbed']}"
    )

    return {
        "analysis": analysis,
        "phi_removed": scrub["count"],
        "entity_map": scrub["entities"]  # for re-identification if needed
    }

Average latency for the scrub call: ~7ms. Negligible compared to the LLM call.

Free Tier

100 requests/day, no authentication, no signup. Just POST to the endpoint and verify it works for your note format.

Production tiers with higher rate limits start at $9/month.

Docs and live demo: tiamat.live/docs

The Compliance Framing

When your compliance officer asks how you're handling PHI in the LLM pipeline, the answer becomes: "Our LLM receives no PHI. Text is de-identified using Safe Harbor methodology before any LLM call. Here is the de-identification service and its methodology."

That's a much easier conversation than "our LLM vendor has signed our BAA" — and it works with any LLM provider, open source or commercial.

If you're building on top of this or have edge cases in your note format, feel free to ask in the comments.