Most teams do not have a testing problem. They have a test data realism problem.
Locally, the app runs against test@example.com, User 1, and a seed script nobody wants to maintain. In CI, fixtures slowly drift away from reality. Then the bugs show up after deploy:
- a customer name has an apostrophe or accent
- a field is
NULLwhere your code assumed a string - an account has 47 related records instead of 2
- a query that worked on 20 rows falls over on 20,000
- shared staging data gets mutated by three people at once
If that sounds familiar, the answer usually is not "write more tests." It is "stop testing against fake data."
What teams actually want
What most teams actually want is not a full copy of production, not a giant pg_dump, and not another 400-line seed script. They want production-like data:
- realistic enough to expose bugs
- small enough to restore locally and in CI
- safe enough to use outside production
- reproducible enough that every developer can get the same result
That is the gap most dev workflows never solve cleanly.
Why the usual approaches break down
Seed scripts rot
Seed scripts are fine when your app has five tables. They get painful when your schema grows:
- every migration breaks something
- relationships get harder to maintain
- the data gets less realistic over time
- nobody wants to own the script anymore
You end up with a setup that is reproducible, but not especially useful. We wrote a deeper comparison of seed scripts vs production snapshots.
pg_dump is great for backups, not dev environments
pg_dump solves a different problem. It copies everything:
- all rows
- all tables
- all PII
- all the size and baggage of production
That is useful for backup and recovery. It is usually overkill for local development and CI.
For dev workflows, full dumps create new problems:
- slow restores
- bloated local databases
- longer CI jobs
- sensitive data showing up in places it should not
Most of the time, you do not need the entire database. You need the right slice of it. We wrote a full comparison of pg_dump vs Basecut if you want the details.
The better pattern: subset, anonymize, restore
The workflow that makes sense looks more like this:
- Start from one or more root tables.
- Filter to the rows you actually care about.
- Follow foreign keys to pull in the connected data.
- Anonymize sensitive fields inline.
- Save the snapshot.
- Restore it anywhere you need it.
That gives you a connected, realistic, privacy-safe subset of production instead of a raw copy. This is the workflow we built Basecut around for PostgreSQL: FK-aware extraction, automatic PII anonymization, and one-command restores for local dev, CI, and debugging.
The reason this approach works is simple: it treats test data as a repeatable snapshot problem, not a hand-crafted fixture problem.
What "production-like" should actually mean
The phrase gets used loosely. In practice, production-like data should have four properties.
1. Realistic structure
It should reflect the real relationships, optional fields, and edge cases in your schema.
2. Referential integrity
If you copy one row from orders, you usually also need related rows from users, line_items, shipments, and whatever else your app expects to exist together.
3. Privacy safety
Emails, names, phone numbers, addresses, and other sensitive fields need to be anonymized before the data lands on laptops, CI runners, or logs.
4. Repeatability
Developers need a predictable way to recreate the same kind of dataset without asking someone to send them a dump.
If any one of those is missing, the workflow gets shaky fast.
Why FK-aware extraction matters
This is the part many DIY approaches get wrong. Randomly sampling rows from each table sounds easy until you restore them. Then you get:
- orders pointing to missing users
- line items pointing to missing products
- child rows without their parents
- failed restores or strange app behavior after restore
A useful snapshot has to behave like a self-contained mini-version of production.
That is why FK-aware extraction matters. In Basecut, snapshots are built by following foreign keys in both directions and collecting a connected subgraph of your data. The result is something you can restore into an empty database without ending up with broken references.
That matters more than people think. It is the difference between:
- "the data loaded"
- and
- "the app actually behaves like it does in production"
What the workflow looks like in practice
The nice part is that this can stay simple. Basecut starts with a small YAML config that tells it:
- where to start
- how far to traverse relationships
- how much data to include
- how anonymization should work
Example:
version: '1'
name: 'dev-snapshot'
from:
- table: users
where: 'created_at > :since'
params:
since: '2026-01-01'
traverse:
parents: 5
children: 10
limits:
rows:
per_table: 1000
total: 50000
anonymize:
mode: auto
Then the workflow becomes:
basecut snapshot create \
--config basecut.yml \
--name "dev-snapshot" \
--source "$DATABASE_URL"
basecut snapshot restore dev-snapshot:latest \
--target "$LOCAL_DATABASE_URL"
That is the whole loop:
- inspect schema
- create snapshot
- restore anywhere
In practice, most teams can get from install to first snapshot in a few minutes. You can try this workflow with Basecut — the CLI is free for small teams.
The privacy part is not optional
One more requirement: privacy. If you are moving production-like data into dev and CI, PII handling cannot be a manual cleanup step.
At minimum, your workflow should:
- detect common PII automatically
- allow explicit masking rules
- preserve join integrity where needed
- anonymize during extraction, not afterward
Basecut handles this with automatic PII detection plus 30+ anonymization strategies. It also supports deterministic masking, which matters when the same source value needs to map to the same fake value across related tables.
If jane@company.com turns into one fake email in users and a different fake email somewhere else, your data stops behaving like the real system. That is exactly the sort of detail that makes fake dev data feel fine right up until it is not.
Why this works well in CI too
This pattern is just as useful in CI as it is locally. Instead of checking brittle fixtures into the repo, you restore a realistic snapshot before the test suite runs.
For example:
name: Test
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:15
env:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
POSTGRES_DB: test_db
ports:
- 5432:5432
steps:
- uses: actions/checkout@v4
- name: Install Basecut CLI
run: |
curl -fsSL https://basecut.dev/install.sh | sh
echo "$HOME/.local/bin" >> $GITHUB_PATH
- name: Restore snapshot
env:
BASECUT_API_KEY: ${{ secrets.BASECUT_API_KEY }}
run: |
basecut snapshot restore test-data:latest \
--target "postgresql://postgres:postgres@localhost:5432/test_db"
- name: Run tests
run: npm test
That gives your pipeline real shapes, real relationships, and realistic edge cases without restoring an entire production dump on every run. It also keeps snapshots small enough that restores stay fast. We go deeper in our CI/CD test data guide.
When this is worth doing
You probably want production-like snapshots if:
- your app has more than a handful of tables
- your bugs are often data-dependent
- you need realistic data in local dev
- your CI pipeline should test against something closer to reality
- you handle meaningful PII
- your team is tired of maintaining fixtures or seed scripts
You might not need it yet if:
- the product is brand new
- you do not have real production data yet
- the schema is tiny
- completely fictional demo data is the goal
This is not about replacing every fixture in your test suite. Unit tests still benefit from tiny, explicit test data. The value here is in integration tests, local development, CI, onboarding, and debugging where data shape matters.
A practical rollout
If you want to adopt this without overcomplicating it, start small.
Pick one painful workflow.
Usually local dev onboarding, shared staging, or CI integration tests.Define a small snapshot.
Keep the restore fast. Start with a few root tables and sensible row limits.Turn on anonymization from day one.
Do not leave this for later.Restore it somewhere useful immediately.
Local dev DB or CI test DB is usually enough to prove the value.Expand gradually.
Add more tables, better filters, and refresh automation once the loop is working.
That gets you to useful production-like data quickly without turning the whole thing into a platform project.
Final thought
Most teams are not under-testing. They are testing against data that makes them feel safe. That is not the same thing.
If your local environments and CI pipelines run against tiny, stale, or fake data, they will keep giving you false confidence. Production-like snapshots are one of the highest-leverage ways to make development and testing feel closer to the real system without dragging raw production data everywhere.
If you want to try this with PostgreSQL, Basecut is free for small teams. Or dig into the quickstart guide and how FK-aware extraction works.
Top comments (0)