Alex Hayward

Posted on Oct 28 • Edited on Oct 29

Building a Test Data Platform After Watching Teams Secretly Use Production for Years

#devops #testing #cicd #dataengineering

Two weeks ago I asked data engineers on Reddit how much time they lose getting test data ready. Top response was brutally honest:

"Everywhere I've worked with sensitive data, everybody ended up secretly working off prod."

Lots of upvotes. Loads of people agreeing in the comments.

That's not a compliance problem. That's a tooling problem.

The Actual Problem

It's not that teams don't want to do things properly. Mock data isn't realistic enough to debug with. Snapshots go stale within weeks. Manual masking takes forever and breaks referential integrity so test environments become useless.

So everyone takes the path of least resistance. Quietly use production data and hope compliance doesn't dig too deep.

I've watched this happen everywhere. Fintech companies. Healthcare providers. E-commerce platforms. The tech stack changes but the pattern doesn't.

Why The Old Tools Don't Work Anymore

Traditional test data management tools were built for waterfall projects and monthly releases. Now we've got dozens of microservices, CI/CD deploying multiple times daily, and distributed teams needing self-serve access.

The "submit a ticket and wait three days" approach doesn't work when you're shipping continuously.

Technical Problems We Had To Solve

Schema-Aware Masking That Actually Works

Simply replacing values breaks everything. If you mask user_id in one table but not the foreign key in another, joins fail and tests become meaningless.

We built an engine that discovers relationships automatically and maintains referential integrity during masking. Works across both relational and NoSQL databases. The masking isn't just "replace with random string", it's "maintain the graph structure whilst protecting sensitive fields".

Synthetic Data That Doesn't Feel Fake

Generating random data is easy. Generating data that behaves like production is hard.

Our synthetic engine uses ML models trained on schema patterns, not your actual data. It generates records that match real-world distributions, include edge cases, and maintain realistic relationships.

Solves the "we can't reproduce this bug because we don't have test data that triggers it" problem.

CI/CD Integration That's Actually Usable

Test data provisioning needs to be as automated as your deployments.

API for anything we didn't anticipate. Processing is parallelised so it doesn't become the bottleneck.

Data provisioning becomes a pipeline step rather than a manual process.

Compliance Without The Pain

Audit trails shouldn't be something you bolt on when legal asks.

Every operation logs what data was accessed, what transformations were applied, who triggered it, and when. Export compliance reports for GDPR, HIPAA, PCI, SOX with one click.

Architecture Stuff

A few technical choices that mattered:

We process in your environment. Don't require shipping production data to our servers. Masking engine runs in your infrastructure.

Database-agnostic connectors. Rather than building for one database, we abstracted the connection layer.

Stateless operations. Provisioning jobs don't maintain state between runs. Makes scaling and recovery simpler.

Usage-based pricing. No per-seat licensing. You pay for data rows processed.

What We Actually Learned

The technical problem is easier than the organisational one. Building schema-aware masking is hard but solvable. Getting legal, security, engineering, and compliance to agree on a process is harder.

Self-service is non-negotiable. Centralised gatekeepers become bottlenecks. Teams need autonomy with guardrails.

Speed isn't just convenience, it's compliance. When provisioning takes minutes instead of days, teams stop taking shortcuts that create security risks.

We have launched!

GoMask is live.

Check it out: gomask.ai

Free tier to get started. Would love feedback, especially on database connectors we should prioritise, CI/CD integrations that matter most, and edge cases we might have missed.

Happy to answer technical questions in the comments.

Built by engineers who got tired of waiting for test data.

DEV Community