Drew Kitchell

Posted on Apr 7

We kept seeing leaked PII into test data. Here’s what actually fixed it.

#privacy #ai

We accidentally committed real user emails into test fixtures.

More than once.

Not because we didn’t know better—but because the system allowed it.

Why this keeps happening

If you’re working with real data pipelines, this is pretty easy to fall into:

someone copies production data “just for testing”
CSV fixtures get reused across environments
test data slowly drifts toward real data over time
everyone assumes someone else cleaned it

Nothing malicious—just normal workflow shortcuts.

What didn’t work

We tried the obvious things:

manual review
“be careful” reminders
catching it in PR comments

None of that held up.

If it makes it into a PR, it’s already too late.

What actually worked

We stopped treating this as a review problem and started treating it as a build-time failure.

Specifically:

scan for high-risk patterns (emails, tokens, etc.)
fail CI on detection
require explicit override if someone really needs to push

Once it breaks the build, people fix it immediately.

The bigger issue

The deeper problem isn’t just PII.

It’s that most systems don’t have a way to enforce or prove what data is flowing through them.

This shows up in a lot of places:

test data
training data
AI inputs/outputs
downstream systems

PII leakage is just one visible symptom.

What we ended up doing

We built a small local CLI to enforce this:

deterministic pattern matching
no network calls
exits non-zero on high-risk findings

Runs locally and in CI so nothing slips through.

https://github.com/certifieddata/pii-scan.git

Top comments (1)

Drew Kitchell • Apr 7

One pattern we saw: even after cleaning things up once, PII tends to creep back in over time.

Usually through “just this one test” or copying real payloads under pressure.

Feels less like a one-time fix and more like something that needs continuous enforcement.