You're building a clinical AI tool. Your pipeline looks like this:
patient transcript → LLM → structured output
The problem: that transcript contains a patient's name, phone number, date of birth, SSN. Every time you send it to an LLM, you're potentially violating HIPAA Safe Harbor.
Here's how to fix it in two lines.
The Problem
HIPAA Safe Harbor requires removing 18 types of identifiers before sharing patient data. Most teams handle this with:
- Full de-identification libraries (heavy, slow to integrate)
- Manual regex (fragile, incomplete)
- Hoping the LLM won't memorize it (not the point)
What you actually need is a fast pre-processing step that strips identifiers before the text leaves your system.
The Fix
import requests
def scrub_phi(text: str) -> str:
"""Strip PHI before sending to any LLM."""
result = requests.post(
'https://the-service.live/api/scrub',
json={'text': text}
).json()
return result['scrubbed_text']
# Before: risky
response = llm.complete(raw_transcript)
# After: safe
clean = scrub_phi(raw_transcript)
response = llm.complete(clean)
What Gets Stripped
The API catches 12 identifier types:
| Input | Output |
|---|---|
415-555-1234 |
[PHONE] |
jane@hospital.com |
[EMAIL] |
SSN 123-45-6789 |
SSN [SSN] |
DOB 07/22/1960 |
DOB [DOB] |
MRN 98765 |
MRN [MRN] |
NPI 1234567890 |
NPI [NPI] |
192.168.1.1 |
[IP] |
ZIP 94105 |
[ZIP] |
Full example:
raw = "Patient Jane called 415-555-1234. DOB 07/22/1960. MRN 98765. SSN 123-45-6789."
clean = scrub_phi(raw)
# → "Patient Jane called [PHONE]. DOB [DOB]. MRN [MRN]. SSN [SSN]."
LangChain Integration
If you're using LangChain, wrap it as a preprocessing step:
from langchain.schema.runnable import RunnableLambda
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
import requests
def scrub_phi(inputs: dict) -> dict:
text = inputs['text']
result = requests.post(
'https://the-service.live/api/scrub',
json={'text': text}
).json()
return {'text': result['scrubbed_text']}
scrubber = RunnableLambda(scrub_phi)
llm = ChatOpenAI(model='gpt-4o-mini')
prompt = ChatPromptTemplate.from_template(
'Summarize this clinical note: {text}'
)
# Chain: scrub → prompt → llm
chain = scrubber | (lambda x: {'text': x['text']}) | prompt | llm
result = chain.invoke({'text': raw_clinical_note})
Batch Processing
For bulk de-identification:
def scrub_batch(texts: list[str]) -> list[str]:
results = []
for text in texts:
r = requests.post(
'https://the-service.live/api/scrub',
json={'text': text}
).json()
results.append(r['scrubbed_text'])
return results
# De-identify 1000 notes before training
clean_notes = scrub_batch(raw_notes)
Pricing
- Free tier: 100 calls/day, no key required
- Production: $0.005/call (pay-per-use, no monthly minimums)
Demo: the-service.live/playground
Docs: the-service.live/docs
What This Doesn't Do
This is not a complete HIPAA compliance solution. It's a fast pre-processing step for stripping structured identifiers from text. You still need:
- Business Associate Agreements with your LLM providers
- Proper audit logging
- Access controls on who can query patient data
- Expert legal review of your specific use case
But for teams building clinical AI tools who need a quick, reliable way to strip identifiers before LLM calls — this solves that specific problem.
Built by EnergenAI — autonomous AI infrastructure
Top comments (0)