Jake Lazarus

Posted on Mar 18 • Originally published at basecut.dev

Why Fake PostgreSQL Test Data Misses Real Bugs

#postgres #database #testing #devops

Most teams do not have a testing problem. They have a test data realism problem.

Locally, the app runs against test@example.com, User 1, and a seed script nobody wants to maintain. In CI, fixtures slowly drift away from reality. Then the bugs show up after deploy:

a customer name has an apostrophe or accent
a field is NULL where your code assumed a string
an account has 47 related records instead of 2
a query that worked on 20 rows falls over on 20,000
shared staging data gets mutated by three people at once

If that sounds familiar, the answer usually is not "write more tests." It is "stop testing against fake data."

What teams actually want

What most teams actually want is not a full copy of production, not a giant pg_dump, and not another 400-line seed script. They want production-like data:

realistic enough to expose bugs
small enough to restore locally and in CI
safe enough to use outside production
reproducible enough that every developer can get the same result

That is the gap most dev workflows never solve cleanly.

Why the usual approaches break down

Seed scripts rot

Seed scripts are fine when your app has five tables. They get painful when your schema grows:

every migration breaks something
relationships get harder to maintain
the data gets less realistic over time
nobody wants to own the script anymore

You end up with a setup that is reproducible, but not especially useful. We wrote a deeper comparison of seed scripts vs production snapshots.

`pg_dump` is great for backups, not dev environments

pg_dump solves a different problem. It copies everything:

all rows
all tables
all PII
all the size and baggage of production

That is useful for backup and recovery. It is usually overkill for local development and CI.

For dev workflows, full dumps create new problems:

slow restores
bloated local databases
longer CI jobs
sensitive data showing up in places it should not

Most of the time, you do not need the entire database. You need the right slice of it. We wrote a full comparison of pg_dump vs Basecut if you want the details.

The better pattern: subset, anonymize, restore

The workflow that makes sense looks more like this:

Start from one or more root tables.
Filter to the rows you actually care about.
Follow foreign keys to pull in the connected data.
Anonymize sensitive fields inline.
Save the snapshot.
Restore it anywhere you need it.

That gives you a connected, realistic, privacy-safe subset of production instead of a raw copy. This is the workflow we built Basecut around for PostgreSQL: FK-aware extraction, automatic PII anonymization, and one-command restores for local dev, CI, and debugging.

The reason this approach works is simple: it treats test data as a repeatable snapshot problem, not a hand-crafted fixture problem.

What "production-like" should actually mean

The phrase gets used loosely. In practice, production-like data should have four properties.

1. Realistic structure

It should reflect the real relationships, optional fields, and edge cases in your schema.

2. Referential integrity

If you copy one row from orders, you usually also need related rows from users, line_items, shipments, and whatever else your app expects to exist together.

3. Privacy safety

Emails, names, phone numbers, addresses, and other sensitive fields need to be anonymized before the data lands on laptops, CI runners, or logs.

4. Repeatability

Developers need a predictable way to recreate the same kind of dataset without asking someone to send them a dump.

If any one of those is missing, the workflow gets shaky fast.

Why FK-aware extraction matters

This is the part many DIY approaches get wrong. Randomly sampling rows from each table sounds easy until you restore them. Then you get:

orders pointing to missing users
line items pointing to missing products
child rows without their parents
failed restores or strange app behavior after restore

A useful snapshot has to behave like a self-contained mini-version of production.

That is why FK-aware extraction matters. In Basecut, snapshots are built by following foreign keys in both directions and collecting a connected subgraph of your data. The result is something you can restore into an empty database without ending up with broken references.

That matters more than people think. It is the difference between:

"the data loaded"
and
"the app actually behaves like it does in production"

What the workflow looks like in practice

The nice part is that this can stay simple. Basecut starts with a small YAML config that tells it:

where to start
how far to traverse relationships
how much data to include
how anonymization should work

Example:

version: '1'
name: 'dev-snapshot'

from:
  - table: users
    where: 'created_at > :since'
    params:
      since: '2026-01-01'

traverse:
  parents: 5
  children: 10

limits:
  rows:
    per_table: 1000
    total: 50000

anonymize:
  mode: auto

Then the workflow becomes:

basecut snapshot create \
  --config basecut.yml \
  --name "dev-snapshot" \
  --source "$DATABASE_URL"

basecut snapshot restore dev-snapshot:latest \
  --target "$LOCAL_DATABASE_URL"

That is the whole loop:

inspect schema
create snapshot
restore anywhere

In practice, most teams can get from install to first snapshot in a few minutes. You can try this workflow with Basecut — the CLI is free for small teams.

The privacy part is not optional

One more requirement: privacy. If you are moving production-like data into dev and CI, PII handling cannot be a manual cleanup step.

At minimum, your workflow should:

detect common PII automatically
allow explicit masking rules
preserve join integrity where needed
anonymize during extraction, not afterward

Basecut handles this with automatic PII detection plus 30+ anonymization strategies. It also supports deterministic masking, which matters when the same source value needs to map to the same fake value across related tables.

If jane@company.com turns into one fake email in users and a different fake email somewhere else, your data stops behaving like the real system. That is exactly the sort of detail that makes fake dev data feel fine right up until it is not.

Why this works well in CI too

This pattern is just as useful in CI as it is locally. Instead of checking brittle fixtures into the repo, you restore a realistic snapshot before the test suite runs.

For example:

name: Test
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    services:
      postgres:
        image: postgres:15
        env:
          POSTGRES_USER: postgres
          POSTGRES_PASSWORD: postgres
          POSTGRES_DB: test_db
        ports:
          - 5432:5432

    steps:
      - uses: actions/checkout@v4

      - name: Install Basecut CLI
        run: |
          curl -fsSL https://basecut.dev/install.sh | sh
          echo "$HOME/.local/bin" >> $GITHUB_PATH

      - name: Restore snapshot
        env:
          BASECUT_API_KEY: ${{ secrets.BASECUT_API_KEY }}
        run: |
          basecut snapshot restore test-data:latest \
            --target "postgresql://postgres:postgres@localhost:5432/test_db"

      - name: Run tests
        run: npm test

That gives your pipeline real shapes, real relationships, and realistic edge cases without restoring an entire production dump on every run. It also keeps snapshots small enough that restores stay fast. We go deeper in our CI/CD test data guide.

When this is worth doing

You probably want production-like snapshots if:

your app has more than a handful of tables
your bugs are often data-dependent
you need realistic data in local dev
your CI pipeline should test against something closer to reality
you handle meaningful PII
your team is tired of maintaining fixtures or seed scripts

You might not need it yet if:

the product is brand new
you do not have real production data yet
the schema is tiny
completely fictional demo data is the goal

This is not about replacing every fixture in your test suite. Unit tests still benefit from tiny, explicit test data. The value here is in integration tests, local development, CI, onboarding, and debugging where data shape matters.

A practical rollout

If you want to adopt this without overcomplicating it, start small.

Pick one painful workflow.
Usually local dev onboarding, shared staging, or CI integration tests.
Define a small snapshot.
Keep the restore fast. Start with a few root tables and sensible row limits.
Turn on anonymization from day one.
Do not leave this for later.
Restore it somewhere useful immediately.
Local dev DB or CI test DB is usually enough to prove the value.
Expand gradually.
Add more tables, better filters, and refresh automation once the loop is working.

That gets you to useful production-like data quickly without turning the whole thing into a platform project.

Final thought

Most teams are not under-testing. They are testing against data that makes them feel safe. That is not the same thing.

If your local environments and CI pipelines run against tiny, stale, or fake data, they will keep giving you false confidence. Production-like snapshots are one of the highest-leverage ways to make development and testing feel closer to the real system without dragging raw production data everywhere.

If you want to try this with PostgreSQL, Basecut is free for small teams. Or dig into the quickstart guide and how FK-aware extraction works.

DEV Community

Why Fake PostgreSQL Test Data Misses Real Bugs

What teams actually want

Why the usual approaches break down

Seed scripts rot

`pg_dump` is great for backups, not dev environments

The better pattern: subset, anonymize, restore

What "production-like" should actually mean

1. Realistic structure

2. Referential integrity

3. Privacy safety

4. Repeatability

Why FK-aware extraction matters

What the workflow looks like in practice

The privacy part is not optional

Why this works well in CI too

When this is worth doing

A practical rollout

Final thought

Top comments (0)

What teams actually want

Why the usual approaches break down

Seed scripts rot

pg_dump is great for backups, not dev environments

The better pattern: subset, anonymize, restore

What "production-like" should actually mean

1. Realistic structure

2. Referential integrity

3. Privacy safety

4. Repeatability

Why FK-aware extraction matters

What the workflow looks like in practice

The privacy part is not optional

Why this works well in CI too

When this is worth doing

A practical rollout

Final thought

`pg_dump` is great for backups, not dev environments