Strip PHI from Clinical Text Before It Reaches Your LLM

#hipaa #healthcare #privacy #python

You're building a clinical AI tool. Your pipeline looks like this:

patient transcript → LLM → structured output

The problem: that transcript contains a patient's name, phone number, date of birth, SSN. Every time you send it to an LLM, you're potentially violating HIPAA Safe Harbor.

Here's how to fix it in two lines.

The Problem

HIPAA Safe Harbor requires removing 18 types of identifiers before sharing patient data. Most teams handle this with:

Full de-identification libraries (heavy, slow to integrate)
Manual regex (fragile, incomplete)
Hoping the LLM won't memorize it (not the point)

What you actually need is a fast pre-processing step that strips identifiers before the text leaves your system.

The Fix

import requests

def scrub_phi(text: str) -> str:
    """Strip PHI before sending to any LLM."""
    result = requests.post(
        'https://the-service.live/api/scrub',
        json={'text': text}
    ).json()
    return result['scrubbed_text']

# Before: risky
response = llm.complete(raw_transcript)

# After: safe
clean = scrub_phi(raw_transcript)
response = llm.complete(clean)

What Gets Stripped

The API catches 12 identifier types:

Input	Output
`415-555-1234`	`[PHONE]`
`jane@hospital.com`	`[EMAIL]`
`SSN 123-45-6789`	`SSN [SSN]`
`DOB 07/22/1960`	`DOB [DOB]`
`MRN 98765`	`MRN [MRN]`
`NPI 1234567890`	`NPI [NPI]`
`192.168.1.1`	`[IP]`
`ZIP 94105`	`[ZIP]`

Full example:

raw = "Patient Jane called 415-555-1234. DOB 07/22/1960. MRN 98765. SSN 123-45-6789."
clean = scrub_phi(raw)
# → "Patient Jane called [PHONE]. DOB [DOB]. MRN [MRN]. SSN [SSN]."

LangChain Integration

If you're using LangChain, wrap it as a preprocessing step:

from langchain.schema.runnable import RunnableLambda
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
import requests

def scrub_phi(inputs: dict) -> dict:
    text = inputs['text']
    result = requests.post(
        'https://the-service.live/api/scrub',
        json={'text': text}
    ).json()
    return {'text': result['scrubbed_text']}

scrubber = RunnableLambda(scrub_phi)
llm = ChatOpenAI(model='gpt-4o-mini')
prompt = ChatPromptTemplate.from_template(
    'Summarize this clinical note: {text}'
)

# Chain: scrub → prompt → llm
chain = scrubber | (lambda x: {'text': x['text']}) | prompt | llm

result = chain.invoke({'text': raw_clinical_note})

Batch Processing

For bulk de-identification:

def scrub_batch(texts: list[str]) -> list[str]:
    results = []
    for text in texts:
        r = requests.post(
            'https://the-service.live/api/scrub',
            json={'text': text}
        ).json()
        results.append(r['scrubbed_text'])
    return results

# De-identify 1000 notes before training
clean_notes = scrub_batch(raw_notes)