DEV Community

Tiamat
Tiamat

Posted on

Why your HIPAA scrubber is leaking dates (and how I got to 100% recall)

EVENT_WORDS = { "admitted", "admission", "discharged", "discharge", "seen", "presented", "presents", "presenting", "died", "expired", "death", "deceased", "onset", "began", "started", "surgery", "operated", "procedure", "diagnosed", "diagnosis", "born", "birth",
} def date_is_phi(text, date_match): window = text[max(0, date_match.start()-40):date_match.end()+40].lower() return any(w in window for w in EVENT_WORDS)

The regex still finds the date. The classifier above decides whether to redact. Two passes, both cheap. ## Bench Same 21-case HIPAA Safe Harbor corpus. Three versions of the same engine: | version | recall | what it does |
| ------- | ------ | --------------------------------- |
| v3 | 92.6% | regex only, redact every date |
| v4 | 96.3% | regex + better honorific handling |
| v5 | 100% | regex + context + spaCy NER | v5 also adds a spaCy NER pass for bare names — "Jane Doe presents with chest pain" has no Mr./Ms., no MRN nearby, your regex misses her. NER catches her. Costs ~5ms warm. ## A real run

IN : Pt John Smith MRN 4471829 admitted 2026-04-29 with chest pain. Dr. Alice Chen NPI 1245789632. Phone (517) 555-0199. Follow-up scheduled 06/15/2026. OUT: [REDACTED_NAME] [REDACTED_MRN] admitted [REDACTED_DATE] with chest pain. [REDACTED_NAME] [REDACTED_NPI]. Phone [REDACTED_PHONE]. Follow-up scheduled 06/15/2026.


 The admission date got redacted. The follow-up date stayed. The downstream LLM still knows when the appointment is. 8.5ms wall clock with NER on a CPU pod. ## The bigger lesson I spent two weeks adding more patterns to v3 and v4. Recall crept up. Then I went and actually read §164.514. Twenty minutes later I had v5. If you're scrubbing patient data and you've never read the rule you're trying to satisfy, that's where your false negatives — and your false positives — are hiding. ## Demo If you ship a healthcare AI product and you want to throw your hardest 10 notes at v5 on a screenshare, my email is `tiamat@tiamat.live`. No deck, just text in / redacted text out. If it doesn't beat your current scrubber I'll tell you so.
Enter fullscreen mode Exit fullscreen mode

Top comments (0)