Stop Writing INSERT Scripts for Test Data
If you’ve been building products for a while, you’ve probably done this dance:
- New feature.
- New tables or columns.
- Empty staging database.
- “I’ll just write a few INSERT scripts to fake some rows…”
An hour later, you’ve got a wall of SQL, a half‑realistic dataset, and the quiet feeling that none of this is going to look like production anyway.
For years this was just “how it’s done.” Today, it’s a tax.
In this post, I want to lay out why hand‑crafted test data is breaking your velocity (and your tests), and what a better default looks like.
The hidden cost of hand‑written test data
The obvious cost is time: senior engineers spending hours writing INSERTs, CSVs, or seed scripts instead of shipping features.
But the deeper costs are more dangerous:
Your data is too clean
Synthetic in the worst way: perfect dates, perfect enums, no NULL hell. Your tests pass beautifully on this happy‑path dataset and then fall over in production.Your relationships drift
You add a new table, a new foreign key, or a new join. Did you remember to update every seed script, every fixture, every CSV? If not, you end up with orphan rows and tests that silently stop covering real flows.Nobody owns it
Test data becomes tribal knowledge. One person “just knows” which script to run or which dump to restore. When they’re busy (or leave), test environments quietly rot.Compliance risk
To avoid writing data by hand, teams often copy masked production snapshots into staging. Masking is rarely complete. A few columns slip through, and now PII is sitting in places it shouldn’t.
Individually these feel like minor annoyances. Together, they create a slow, constant drag on every release.
What “good” test data actually means
When people say “we need realistic test data,” they usually mean more than just random rows.
A useful test database has at least three properties:
Referential integrity
Foreign keys are valid, constraints are respected, and joins behave the way they do in production.Realistic distributions
Data “feels” like production: skewed, messy, correlated. Not everyone signs up on the last day of the month. Not every account has exactly three users.-
Designed edge cases
You see the weird stuff on purpose:- users with 0 orders,
- accounts with 1000+ invoices,
- subscriptions with overlapping billing periods.
Most hand‑written test data does okay on (1), fails on (2), and completely ignores (3). You get just enough to demo the happy path, but not enough to trust your system.
Why staging snapshots aren’t the answer
The usual response is: “We’ll just use a masked copy of production.”
That sounds great until:
- Masking doesn’t catch everything, and suddenly you have PII in non‑prod.
- Schema changes make your anonymization scripts brittle.
- Refreshing the snapshot becomes a mini‑project every time you want to test a new flow.
- You can’t easily generate new edge cases on demand, because the data is whatever production happened to look like last week.
Staging snapshots are a snapshot of the past. Most teams need a generator for the future.
A better default: synthetic relational test databases
The alternative is to treat test data as something you generate on demand, not something you “hope is still usable.”
The workflow looks more like this:
- Describe the domain you care about (in schema or plain English).
- Generate a full relational database that respects your constraints.
- Tune volumes, distributions, and edge cases.
- Regenerate whenever the schema changes.
You get:
- Consistent, repeatable datasets for local dev, CI, demos, and staging.
- No real customer records outside production.
- The ability to intentionally create “weird” worlds to stress your system.
This is the mental model behind SyntheholDB: describe the database you wish you had for testing, then generate it instead of hand‑coding it.
What this looks like in practice
Here’s a simple example.
Imagine you’re testing a B2B SaaS app. You might say:
“I need 200 companies, 1–25 users per company, a mix of free and paid plans, and at least 20 companies with more than 50 invoices each.”
With the traditional approach, you’d:
- Create CSVs for
companies,users,subscriptions,invoices. - Write scripts to import them.
- Fix foreign keys when something doesn’t line up.
- Iterate until the data “looks okay”.
With a synthetic test database generator, you:
- Express that requirement once.
- Let the tool generate all the tables and relationships.
- Re‑run when you change your schema or want a different scenario.
The output becomes an asset: you can spin up identical worlds for local dev, QA, and demos, without anyone touching INSERT scripts.
How to start (even without a fancy tool)
Even if you don’t use SyntheholDB or any specific product, you can still move towards this pattern.
A few practical steps:
Define your core entities and relationships explicitly
Write down the tables and constraints that matter most for testing. This becomes your “test world” spec.Stop editing data directly in the DB
Always go through a generator, script, or seeding process. No more manual tweaks in staging.Design edge-case scenarios as first-class citizens
Don’t wait for production to surprise you. Decide up front which “weird” configurations your system must handle and encode them.Separate test data from real data in your mental model
Production is for truth. Testing environments are for exploring possibilities.
Once you think in generators instead of snapshots, the value of synthetic relational test data becomes obvious.
Closing thought
If you’re still writing INSERT scripts by hand in 2026, it’s not because you enjoy it.
It’s because the alternative feels like “too much work right now.”
The truth is the opposite: the more your product grows, the more expensive hand‑crafted test data becomes.
Whether you roll your own generator or opt for a tool, it’s worth asking:
What would it look like if test databases were never a bottleneck again?
Top comments (0)