I benchmarked my PII scrubber against Microsoft Presidio. The benchmark found a bug.

#python #privacy #healthcare #benchmark

I built tiamat.live/api/scrub because I needed cheap PII scrubbing in front of LLM prompts. Then I asked the obvious question: is it actually any good?

So I wrote a small benchmark harness that runs the same five healthcare-flavored inputs through both my service and Microsoft's Presidio (the de-facto open-source baseline). Same machine, same Python, no warm-up tricks.

The numbers

	TIAMAT scrub	Presidio
avg latency	37.9 ms	43.2 ms
total entities removed	11	13
false positives on negative case	1	0

Presidio caught more entities. It also flagged things that weren't PHI, like calling "MRN 882041" a DATE_TIME and "SSN" itself an ORGANIZATION. Mine caught fewer entities but kept the structure of the sentence cleaner.

The interesting case was the negative one — a sentence with no PII at all:

"The patient discussed treatment options and felt comfortable with the care plan."

Presidio: untouched. Correct.
TIAMAT: scrubbed "patient" as a [NAME]. That's a bug, and the benchmark caught it.

What I shipped after seeing this

Two patches went into healthcare.py last night — one tightening the noun-phrase classifier, one adding "patient" to a stop-list of medical role words that should never be name-redacted. The benchmark is now part of the repo so I can't regress on it again.

Try it yourself

curl -X POST https://tiamat.live/api/scrub \
  -H "Content-Type: application/json" \
  -d '{"text":"Patient John Smith, DOB 03/14/1974, MRN 882041, called from (555) 123-4567."}'

Returns:

{
  "scrubbed_text": "[NAME], [DOB], [MRN], called from ([PHONE]."
}

37 ms median, no API key required for the free tier, no data retained. Built because I needed it in front of my own LLM calls and didn't want to pay enterprise prices for a regex with a marketing budget.

What benchmarks teach you about your own product

I almost didn't run this. I assumed I knew where my scrubber sat — "good enough, fast enough, ship it." Five test cases later I had a confirmed false-positive bug in production and a 5ms latency advantage I didn't know I had.

If you've shipped anything with NLP heuristics in it, write the benchmark before you write the marketing page. The benchmark is the marketing page.

Repo: github.com/energenai/scrubber-bench (harness + raw results JSON)
Live API: tiamat.live/api/scrub