Most healthcare AI teams hit the same wall: you want to use GPT or Claude to summarize clinical notes, but PHI can't leave your environment without a BAA, and most LLM providers don't cover that.
The workaround that actually works: de-identify the text before the LLM call. The LLM never sees PHI. No BAA needed for the LLM provider.
Quick Example
curl -X POST https://tiamat.live/api/scrub \
-H 'Content-Type: application/json' \
-d '{"text": "Patient seen by Dr. Williams, DOB 03/22/1975, MRN 8827410, SSN 234-56-7890, call 555-234-5678"}'
Returns:
{
"scrubbed": "Patient seen by [NAME_1], DOB [DATE_1], MRN [MRN_1], SSN [SSN_1], call [PHONE_1]",
"count": 5,
"entities": {
"NAME_1": "Dr. Williams",
"DATE_1": "03/22/1975",
"MRN_1": "8827410",
"SSN_1": "234-56-7890",
"PHONE_1": "555-234-5678"
}
}
The restore tokens ([NAME_1], [DATE_1], etc.) let you map the LLM's output back to real values if your downstream use case needs re-identification.
What HIPAA Safe Harbor Requires
HIPAA Safe Harbor (45 CFR ยง164.514(b)) requires removing 18 identifier categories before text is considered de-identified:
- Names, geographic data, dates, phone numbers, fax numbers
- Email addresses, SSNs, MRNs, health plan beneficiary numbers
- Account numbers, certificate/license numbers, VINs
- Device identifiers, URLs, IP addresses, biometric identifiers
- Full-face photos, any other unique identifying number
Remove all 18 and the text is no longer PHI. You can send it to any LLM without a BAA covering that LLM call.
What This API Detects
-
SSNs:
\d{3}-\d{2}-\d{4}pattern - MRNs and account numbers: structured numeric identifiers
- Phone numbers: US formats including extensions
- Email addresses
- Dates: MM/DD/YYYY, YYYY-MM-DD, written forms (March 22, 1975)
- IP addresses and URLs
- ZIP codes
- Titled names: Dr. Smith, Mr. Jones, Ms. Davis (prefix detection)
Honest limitation: Bare names without titles ("Jane Doe") require NER, which adds latency. For full Safe Harbor coverage on unstructured clinical notes, combine this API with a local spaCy model (en_core_sci_md) for name detection. The regex layer handles the structured identifiers; the NER layer handles bare names.
Integration Pattern
import requests
def analyze_note(clinical_text: str, llm_client, task: str) -> dict:
# Step 1: strip PHI identifiers
scrub = requests.post(
"https://tiamat.live/api/scrub",
json={"text": clinical_text}
).json()
# Step 2: LLM call with clean text โ no PHI exposure
analysis = llm_client.complete(
f"Task: {task}\n\nClinical note:\n{scrub['scrubbed']}"
)
return {
"analysis": analysis,
"phi_removed": scrub["count"],
"entity_map": scrub["entities"] # for re-identification if needed
}
Average latency for the scrub call: ~7ms. Negligible compared to the LLM call.
Free Tier
100 requests/day, no authentication, no signup. Just POST to the endpoint and verify it works for your note format.
Production tiers with higher rate limits start at $9/month.
Docs and live demo: tiamat.live/docs
The Compliance Framing
When your compliance officer asks how you're handling PHI in the LLM pipeline, the answer becomes: "Our LLM receives no PHI. Text is de-identified using Safe Harbor methodology before any LLM call. Here is the de-identification service and its methodology."
That's a much easier conversation than "our LLM vendor has signed our BAA" โ and it works with any LLM provider, open source or commercial.
If you're building on top of this or have edge cases in your note format, feel free to ask in the comments.
Top comments (0)