DEV Community

Cover image for Stop Generating Synthetic Datasets. Start Generating Synthetic Systems.
Jitendra Devabhaktuni
Jitendra Devabhaktuni

Posted on

Stop Generating Synthetic Datasets. Start Generating Synthetic Systems.

If you’re building AI for BFSI, insurance, or healthtech, you’ve probably evaluated synthetic data platforms. You upload a table. You get a table back. The distributions look right. The privacy report is green. You move to training.

Then production happens.

Your fraud model misses edge cases it should have caught. Your risk engine drifts after two weeks. Your QA team ships a bug because the test data didn’t reflect how users actually behave across multiple tables.
You didn’t build a bad model. You built it on a lie.

The Problem Nobody Talks About

Here’s the uncomfortable truth: most synthetic data platforms generate datasets, not systems.

A dataset is a single table with plausible rows. A system is a network of interconnected tables where:
• A user’s transaction history actually belongs to that user
• Claims link to valid policies with realistic timestamps
• Event sequences follow allowed state transitions
• Foreign keys, constraints, and referential integrity just… work
• Edge cases span multiple tables the way they do in production

When you generate isolated datasets, you break all of that. You get tables that look real individually but behave nothing like production when joined, queried, or fed into a model.
And that’s where AI pilots go to die.

Why Dataset-Level Generation Fails in Production?

I’ve audited synthetic data deployments across fintech, insurance, and healthtech.

The pattern is always the same:
1.Team generates synthetic tables for users, transactions, and events separately
2.Each table passes univariate fidelity checks
3.Tables are joined for model training
4.Correlations collapse. Referential integrity breaks. Temporal sequences become impossible
5.Model looks great in the notebook, degrades silently in production
The root cause isn’t the model architecture. It’s the assumption that you can generate tables in isolation and expect them to work together.
You can’t.

What AI Products Actually Need?

AI products don’t run on CSVs. They run on databases.

If you’re building a fraud detection system, you’re not just modeling transactions. You’re modeling:

• Users with histories, risk profiles, and behavioral patterns
• Transactions that link to those users with valid timestamps and merchant contexts
• Events that follow sequences (login → transaction → alert → investigation)
• Policies, claims, denials, and appeals that span multiple entities

If any of those relationships break in your test data, your model learns patterns that don’t exist in production.
You don’t need more data. You need structurally coherent data — a synthetic system that mirrors production complexity end-to-end.

The Shift: From Datasets to Databases
This is the mental model change the industry needs right now.

Instead of asking “Can this platform generate a high-fidelity dataset?”, ask:
• Can it generate a complete synthetic database with my full schema intact?
• Does it preserve foreign keys and referential integrity automatically?
• Do cross-table correlations match production, not just within-table distributions?
• Are temporal sequences and state transitions logically valid?
• Can I generate millions of rows across dozens of tables without structural collapse?
• Can I reproduce this exact database on demand for audit or debugging?

If the answer to any of these is “no” or “we don’t measure that,” you’re not ready for production.

A Better Way to Think About Synthetic Data

Here’s the framework I use when evaluating synthetic data infrastructure now:

Level 1: Dataset Generation
• Single-table fidelity
• Univariate distributions match
• Privacy checks pass
• Good for: Notebooks, proofs of concept, early prototyping

Level 2: Multi-Table Coherence

• Cross-table correlations preserved
• Foreign keys intact
• Joint distributions match production
• Good for: Model training, integration testing, QA environments

Level 3: Synthetic Systems
• Full schema fidelity (constraints, triggers, indexes)
• Temporal consistency across entities
• Realistic user journeys and event sequences
• Audit-ready generation logs with full reproducibility
• Good for: Production-safe testing, compliance reviews, realistic demos, load testing

Most platforms stop at Level 1. A few attempt Level 2. Almost nobody is building for Level 3.

But Level 3 is exactly what enterprise AI teams need to move from pilot to production.

Why This Matters for Regulated Industries?
If you’re in BFSI, insurance, or healthtech, you’re not just trying to train a model.

You’re trying to:

• Build and test AI applications end-to-end without touching production data
• Run product demos that feel real without exposing customer records
• Simulate production load for performance and QA testing
• Pass model risk review with traceability and privacy guarantees
You can’t do any of that with isolated datasets. You need synthetic systems.
And you can’t get there with prompt engineering or LLM-generated rows. This is infrastructure work — statistical fidelity, referential integrity, temporal modeling, and audit trail engineering.

*The Conversation We Should Be Having
*

Instead of debating “synthetic data vs. real data,” let’s talk about:
• How do we measure cross-table fidelity, not just within-table similarity?
• What does referential integrity preservation actually look like in practice?
• How do we validate temporal consistency for event-driven systems?
• What audit logs do model risk teams actually need to sign off?
• When is dataset-level generation enough, and when do we need full synthetic systems?

Because the future of enterprise AI isn’t about generating more data.
It’s about generating data infrastructure that behaves like production — structurally, statistically, and temporally.

Over to You
If you’ve shipped AI to production in a regulated environment:
• What broke when you moved from synthetic data to real data?
• Did your synthetic tables hold up when joined and queried together?
• What metrics did your model risk team actually care about?

If you’re evaluating synthetic data platforms right now:
• Are you testing single-table fidelity or multi-table coherence?
• Have you validated referential integrity and temporal consistency?
• Can you reproduce your test database on demand for audit?

Let’s talk about what it actually takes to build production-safe AI — not just in notebooks, but in the real world.

What’s your experience been? Drop a comment — especially if you’ve hit the dataset trap and had to climb out of it.

Top comments (0)