Using LLMs with Patient Data: De-identifying Clinical Text Before API Calls

#healthcare #ai #hipaa #python

Healthcare AI teams keep hitting the same wall: legal says you can't send patient data to OpenAI or Anthropic. The engineers know LLMs would actually be useful here. The result is usually a stalemate.

This post covers the technical approach that breaks the deadlock: strip PHI before the API call, not after.

The actual problem

HIPAA Safe Harbor (§164.514(b)(2)) defines 18 categories of identifiers that, when removed from clinical text, make the data no longer considered PHI. This is the standard that labs, hospitals, and health systems use to de-identify records for research.

The same standard applies to LLM use. If you remove those 18 identifiers from a clinical note before sending it to GPT-4o, you're no longer sending PHI to a third party. Your legal team can live with this.

The 18 identifiers you need to remove

1.  Names
2.  Geographic data smaller than state
3.  Dates (except year) for ages under 90
4.  Ages over 89
5.  Phone numbers
6.  Fax numbers
7.  Email addresses
8.  Social security numbers
9.  Medical record numbers
10. Health plan beneficiary numbers
11. Account numbers
12. Certificate/license numbers
13. VINs and serial numbers
14. Device identifiers
15. URLs
16. IP addresses
17. Biometric identifiers (fingerprints, voiceprints)
18. Full-face photos and comparable images

In practice, categories 1-9 cover 95% of what shows up in clinical notes.

DIY approach with Python

You can build a basic identifier stripper with regex and spaCy:

import re
import spacy

nlp = spacy.load("en_core_web_lg")

# Date pattern
DATE_PATTERN = r'\b(\d{1,2}[/-]\d{1,2}[/-]\d{2,4}|\w+ \d{1,2},? \d{4}|\d{4}-\d{2}-\d{2})\b'

# Phone pattern  
PHONE_PATTERN = r'\b(\+?1?[-.\s]?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4})\b'

# SSN pattern
SSN_PATTERN = r'\b\d{3}-\d{2}-\d{4}\b'

# MRN pattern (common formats)
MRN_PATTERN = r'\bMRN\s*:?\s*\d{5,10}\b'

def scrub_phi(text):
    # Remove dates
    text = re.sub(DATE_PATTERN, '[DATE]', text, flags=re.IGNORECASE)
    # Remove phones
    text = re.sub(PHONE_PATTERN, '[PHONE]', text)
    # Remove SSNs
    text = re.sub(SSN_PATTERN, '[SSN]', text)
    # Remove MRNs
    text = re.sub(MRN_PATTERN, '[MRN]', text, flags=re.IGNORECASE)

    # Use NER for names and locations
    doc = nlp(text)
    for ent in reversed(doc.ents):
        if ent.label_ in ['PERSON', 'GPE', 'LOC', 'FAC', 'ORG']:
            text = text[:ent.start_char] + f'[{ent.label_}]' + text[ent.end_char:]

    return text

# Then use it
scrubbed = scrub_phi(clinical_note)
response = openai.chat.completions.create(
    model='gpt-4o',
    messages=[{'role': 'user', 'content': scrubbed}]
)

This works for prototypes. Production use needs more:

Medical-specific NER (spaCy's general model misses a lot of clinical context)
Email and URL stripping
Age detection (patients over 89 need the year removed too)
Consistent replacement tokens (so the LLM can reason about the redacted text)
Audit logging (who ran what, when)

What the pre-processing pipeline looks like

def process_clinical_note(raw_note: str, llm_client) -> str:
    # Step 1: De-identify
    clean_note = scrub_phi(raw_note)

    # Step 2: Verify (optional but recommended)
    phi_still_present = check_for_remaining_phi(clean_note)
    if phi_still_present:
        raise ValueError(f"PHI still present after scrubbing: {phi_still_present}")

    # Step 3: Now safe to send externally
    result = llm_client.complete(
        prompt=f"Summarize this clinical note:\n{clean_note}"
    )

    return result

Using an API instead of building it

If you don't want to maintain the regex library and NER models yourself, there are API options. I built one at tiamat.live/scrub that handles the full Safe Harbor identifier set:

curl -X POST https://tiamat.live/scrub \
  -H 'Content-Type: application/json' \
  -d '{"text": "Patient John Smith, DOB 01/15/1980, MRN 4829201, presented with chest pain."}'

Returns:

{
  "scrubbed": "Patient [NAME], DOB [DATE], MRN [ID], presented with chest pain.",
  "entities_found": 3,
  "processing_ms": 7
}

Free tier is 100 requests/day with no auth required — good for testing.

The workflow for production

Clinical text enters your system
PHI scrubber strips identifiers (locally or via API)
Scrubbed text goes to your LLM (any provider)
LLM response comes back (references [NAME], [DATE], etc.)
Optional: your system substitutes back the real values for display

Step 5 is important for usability — users want to see "John Smith" in the output, not "[NAME]". You can maintain a local substitution map keyed to the session.

What this doesn't solve

Inference attacks: If the remaining clinical context is unique enough, someone could re-identify the patient even without explicit identifiers. Safe Harbor doesn't protect against this.
BAA requirements: De-identification reduces risk but you may still want a BAA depending on your use case and legal counsel's read.
Images: This approach is text-only. Clinical images (X-rays, pathology slides) need separate handling.
Structured data: EHR exports often have PHI in field names and metadata, not just free text.

Practical next step

If you're already building something in this space and hitting compliance friction — the de-identification pipeline is usually a few days of work to get right for production. Build it once, use it everywhere in your stack.

Worth the investment. The alternative is either not using AI or spending $50k+/yr on a HIPAA-compliant LLM wrapper that adds latency and lock-in.

What identifier types do you find hardest to catch reliably? Dates and medical record numbers are the ones that trip up regex-only approaches most often in my experience.