Steven Blough

Posted on Mar 19

Build with Intention: The Case for Synthetic Data

#testing #softwaredevelopment #softwareengineering

Every development team eventually confronts the same problem: the data needed to build and test a system is either unavailable, sensitive, incomplete, or all three. The instinctive solution is to copy some production data, sanitize the obvious fields, and move on. It is understandable. It is also, in most cases, the wrong answer. A more deliberate approach, one grounded in understanding the business domain and constructing data from first principles, produces better software and fewer surprises in production.

This is the argument for synthetic data as a primary development and testing strategy, with replicated production data reserved for the specific, bounded purpose it actually serves: final verification that nothing was missed.

The Foundation: Understanding Precedes Generation

There is a persistent myth that synthetic data is technically difficult or that it requires specialized AI tooling before teams can get started. This misunderstands what synthetic data actually demands. The hard part is not the generation; it is the understanding that must precede it.

If an engineering team can articulate, precisely, what their system does with data: what a valid order looks like, what distinguishes an active customer from a lapsed one, what the lifecycle of a financial transaction is from initiation to settlement, then creating synthetic records that represent those concepts is straightforward. The domain knowledge is the hard work. The data generation follows naturally from it.

"Understanding the source data, its characteristics, distributions, and dependencies, is the essential prerequisite to generating synthetic data that maintains the statistical properties necessary for valid testing."

— Tonic.ai, Guide to Synthetic Test Data Generation

This observation has a useful corollary: if a team struggles to define what valid synthetic data should look like, that struggle is a signal. It reveals gaps in the team's understanding of the domain, gaps that will produce bugs regardless of the testing approach used. Forcing that conversation early, during data design, is itself a quality-improvement activity.

Rule-based synthesis (generating records according to explicit business rules, ranges, and relationships) is the natural starting point. For most transactional business domains, it is also sufficient. More sophisticated statistical or ML-based generation adds value later, when fidelity to complex distributions matters. But the entry point is simply: know your domain, codify its rules, generate records that obey them.

Cause and Effect: Designing Tests That Actually Test

The most powerful characteristic of synthetic data is one that is rarely discussed: it enables genuine experimental design. When you construct data rather than inherit it, you control the inputs precisely. That control is the precondition for meaningful testing.

Consider what it means to test a refund processing workflow. With production-derived data, you find refunds that happened to exist in the snapshot. You test against the outcomes that were already embedded in that data. With synthetic data, you construct the exact scenario you intend to test: a refund requested after the return window closes, a partial refund on an order with a promotional discount, a refund on a subscription that has already renewed. Each scenario has a known input and an expected output. The test is a controlled experiment, not an archaeological dig.

Scenario Coverage

This matters enormously for edge cases and failure paths. Production data snapshots reflect what has already happened: the distribution of events that occurred in the past under real conditions. They systematically under-represent rare events: the unusual transaction pattern, the concurrent write conflict, the workflow state that only arises when three conditions coincide. These are precisely the cases that cause production incidents.

Synthetic data inverts this. You can generate exactly as many rare-event scenarios as you need, not because they happen frequently in production, but because you have decided they are important to test. The frequency of a scenario in a test suite should reflect its risk and complexity, not its historical prevalence in production data.

The Problem with Production Snapshots

A replicated subset of production data feels safe because it is real. That intuition is worth examining. Real data carries real distributions, real constraints, and real history. But those properties are exactly what makes it a poor primary testing tool, not a good one.

What production snapshots miss

The most significant limitation is coverage of critical business events. A snapshot taken on any given day captures a cross-section of normal operations. Month-end close behavior, fiscal year rollovers, promotional surge processing, first-time-purchaser workflows, and account reactivation paths: these events happen infrequently and are statistically unlikely to be well-represented in any particular snapshot. Yet these are precisely the flows that accumulate the most technical debt and surface the most defects.

A related problem is that production data encodes past behavior, not future requirements. When a business adds a new product category, changes its pricing model, or enters a new market, the production snapshot contains no records of these new patterns. The team is forced to either hand-craft a handful of test records (ad hoc and not versioned) or test with data that does not reflect the scenarios the new code must handle.

Finally, production data is bounded by its own scale. A subset that represents normal daily volume tells you little about behavior under peak load, bulk import operations, or the long-tail of a large customer's data set. The snapshot is, by definition, a sample; and samples are not the right tool for performance and scalability testing.

"Replicated production data systematically under-represents the rare, high-risk events most likely to cause production incidents: the precise scenarios a test suite most needs to cover."

— Engineering Practice Observation

Scalability: Data on Demand

One of synthetic data's practical advantages is deceptively simple: you can have as much of it as you need, in exactly the shape you need, when you need it. This sounds obvious, but its implications run deep through a development workflow.

Performance testing requires volumes that no sanitized production subset can provide. Load simulations, database query optimization, and index strategy validation all depend on data at production scale or beyond. Synthetic data generators can produce millions of statistically coherent records in minutes. The same specification that generates a thousand records for a developer's local environment generates ten million for a stress test, with no new configuration and no waiting for a DBA to provision a larger snapshot.

This scalability also applies along a different axis: time. Development teams can generate fresh data sets for every test run, eliminating the accumulated drift that affects shared test databases. Tests that depend on specific data states do not corrupt the environment for subsequent runs. Every pipeline starts clean. The feedback loop tightens because the test environment is deterministic.

For teams working in CI/CD pipelines, this is not a convenience; it is an architectural requirement. A test environment that depends on a pre-provisioned, manually refreshed copy of production data is a bottleneck with a scheduled maintenance window. A test environment backed by a data generation specification is a pipeline input that scales horizontally.

The Recommended Pipeline: Synthetic First, Production Last

This is not an argument for abandoning production data in testing entirely. It is an argument for placing it correctly in the pipeline: at the end, as final verification, rather than at the beginning, as the primary test environment.

Phase 1: Domain Modeling & Spec
Phase 2: Synthetic Data Generation
Phase 3: Scenario & Edge Case Testing
Phase 4: Scale & Performance Runs
Phase 5: Final Verification with Masked Production Data

The recommended workflow proceeds as follows. Development and unit testing use fully synthetic data, generated from domain specifications. Integration and system testing use synthetic data sized and shaped to the scenario requirements. Performance and load testing use synthetic data at production scale or beyond. Only the final pre-release verification stage (the stage intended to confirm that the real-world data distribution does not contain surprises the synthetic specification missed) uses a masked or anonymized production subset.

This final verification stage is legitimate and important. No domain specification is perfect. Production data will occasionally reveal a pattern or relationship that the synthetic specification did not capture: an unusual character encoding in a legacy customer record, a historical data migration artifact, an edge case in a third-party data feed. The production subset exists to catch those cases, not to serve as the primary testing substrate for all the cases that were already well-understood.

Compliance as a benefit, not an afterthought

This pipeline structure also resolves the compliance problem that haunts every team that relies on production data for development. Synthetic data does not contain personally identifiable information, protected health records, financial account details, or any other regulated data class, because it was never derived from real individuals. Development environments built on synthetic data satisfy GDPR, HIPAA, PCI-DSS, and similar frameworks structurally, without depending on masking pipelines that require ongoing maintenance and audit. The compliance posture is embedded in the architecture.

Tooling: A Mature Ecosystem

Teams considering this approach will find the tooling landscape well-developed. The choice of tool depends on data type, scale requirements, and team preferences:

Tonic.ai — Structured and unstructured synthesis; strong referential integrity preservation for relational databases
Gretel.ai — Developer-first APIs for generation, transformation, and privacy at scale
MOSTLY AI — Privacy-preserving synthesis with fairness controls; strong for regulated data sharing
Synthetic Data Vault (SDV) — Open-source Python ecosystem for tabular, relational, and time-series data; widely used in enterprise and academia
K2View — Enterprise-grade entity-based synthesis combined with test data management and masking
Mockaroo — Lightweight, schema-driven generation for smaller-scale development use cases; accessible free tier

For teams with primarily tabular, rule-governed data (the common case in transactional business systems), open-source options like SDV or direct programmatic generation using libraries like Python's Faker and factory_boy are often sufficient to start. The investment in domain specification pays dividends regardless of the generation mechanism used.

Comparison: Approaches at a Glance

Criterion	Synthetic Data	Production Subset	Hand-crafted
Privacy & compliance	✓ Structural	△ Requires masking	✓ If no real data
Edge case coverage	✓ By design	✗ Historically rare	△ Labor-intensive
Scalability	✓ On demand	△ Snapshot-bounded	✗ Not practical
Controlled cause & effect	✓ Precise	✗ Inherited state	✓ Precise
Real-world distribution	△ Approximated	✓ Actual	✗ Minimal
CI/CD integration	✓ Native	✗ Refresh bottleneck	△ Fragile
New feature coverage	✓ Specifiable	✗ Not yet in data	△ Manual effort

A Closing Argument

The teams that do this well share a common trait: they have invested in understanding their domain deeply enough to specify it. That investment is not a cost unique to synthetic data; it is a cost that any team building reliable software must eventually pay. The difference is whether that investment happens at the beginning of development, where it shapes the design, or at the end, after a production incident forces a reckoning.

Synthetic data does not replace production data. It replaces the false confidence that production snapshots provide: the feeling that testing against "real" data means testing against reality. A snapshot of the past, stripped of its most sensitive content and frozen at a point in time, is not reality. It is an artifact. The team that understands its domain well enough to construct the right data from scratch understands its system better than the team that inherited a database dump.

Build with intention. Verify against reality. In that order.

Sources: Gartner Market Guide for Data Masking · MIT Sloan Management Review · Perforce 2025 State of Synthetic Data Report · Tonic.ai · Accutive Security · Netguru · Enov8

DEV Community