We kept seeing leaked PII into test data. Here’s what actually fixed it.

Drew Kitchell — Tue, 07 Apr 2026 21:10:31 +0000

We accidentally committed real user emails into test fixtures.

More than once.

Not because we didn’t know better—but because the system allowed it.

Why this keeps happening

If you’re working with real data pipelines, this is pretty easy to fall into:

someone copies production data “just for testing”
CSV fixtures get reused across environments
test data slowly drifts toward real data over time
everyone assumes someone else cleaned it

Nothing malicious—just normal workflow shortcuts.

What didn’t work

We tried the obvious things:

manual review
“be careful” reminders
catching it in PR comments

None of that held up.

If it makes it into a PR, it’s already too late.

What actually worked

We stopped treating this as a review problem and started treating it as a build-time failure.

Specifically:

scan for high-risk patterns (emails, tokens, etc.)
fail CI on detection
require explicit override if someone really needs to push

Once it breaks the build, people fix it immediately.

The bigger issue

The deeper problem isn’t just PII.

It’s that most systems don’t have a way to enforce or prove what data is flowing through them.

This shows up in a lot of places:

test data
training data
AI inputs/outputs
downstream systems

PII leakage is just one visible symptom.

What we ended up doing

We built a small local CLI to enforce this:

deterministic pattern matching
no network calls
exits non-zero on high-risk findings

Runs locally and in CI so nothing slips through.

https://github.com/certifieddata/pii-scan.git

DEV Community: Drew Kitchell

We kept seeing leaked PII into test data. Here’s what actually fixed it.