We accidentally committed real user emails into test fixtures.
More than once.
Not because we didn’t know better—but because the system allowed it.
Why this keeps happening
If you’re working with real data pipelines, this is pretty easy to fall into:
someone copies production data “just for testing”
CSV fixtures get reused across environments
test data slowly drifts toward real data over time
everyone assumes someone else cleaned it
Nothing malicious—just normal workflow shortcuts.
What didn’t work
We tried the obvious things:
manual review
“be careful” reminders
catching it in PR comments
None of that held up.
If it makes it into a PR, it’s already too late.
What actually worked
We stopped treating this as a review problem and started treating it as a build-time failure.
Specifically:
scan for high-risk patterns (emails, tokens, etc.)
fail CI on detection
require explicit override if someone really needs to push
Once it breaks the build, people fix it immediately.
The bigger issue
The deeper problem isn’t just PII.
It’s that most systems don’t have a way to enforce or prove what data is flowing through them.
This shows up in a lot of places:
test data
training data
AI inputs/outputs
downstream systems
PII leakage is just one visible symptom.
What we ended up doing
We built a small local CLI to enforce this:
deterministic pattern matching
no network calls
exits non-zero on high-risk findings
Runs locally and in CI so nothing slips through.
Top comments (1)
One pattern we saw: even after cleaning things up once, PII tends to creep back in over time.
Usually through “just this one test” or copying real payloads under pressure.
Feels less like a one-time fix and more like something that needs continuous enforcement.