DEV Community

Cover image for Why Fake PostgreSQL Test Data Misses Real Bugs
Jake Lazarus
Jake Lazarus

Posted on • Originally published at basecut.dev

Why Fake PostgreSQL Test Data Misses Real Bugs

Most teams do not have a testing problem. They have a test data realism problem.

Locally, the app runs against test@example.com, User 1, and a seed script nobody wants to maintain. In CI, fixtures slowly drift away from reality. Then the bugs show up after deploy:

  • a customer name has an apostrophe or accent
  • a field is NULL where your code assumed a string
  • an account has 47 related records instead of 2
  • a query that worked on 20 rows falls over on 20,000
  • shared staging data gets mutated by three people at once

If that sounds familiar, the answer usually is not "write more tests." It is "stop testing against fake data."

What teams actually want

What most teams actually want is not a full copy of production, not a giant pg_dump, and not another 400-line seed script. They want production-like data:

  • realistic enough to expose bugs
  • small enough to restore locally and in CI
  • safe enough to use outside production
  • reproducible enough that every developer can get the same result

That is the gap most dev workflows never solve cleanly.

Why the usual approaches break down

Seed scripts rot

Seed scripts are fine when your app has five tables. They get painful when your schema grows:

  • every migration breaks something
  • relationships get harder to maintain
  • the data gets less realistic over time
  • nobody wants to own the script anymore

You end up with a setup that is reproducible, but not especially useful. We wrote a deeper comparison of seed scripts vs production snapshots.

pg_dump is great for backups, not dev environments

pg_dump solves a different problem. It copies everything:

  • all rows
  • all tables
  • all PII
  • all the size and baggage of production

That is useful for backup and recovery. It is usually overkill for local development and CI.

For dev workflows, full dumps create new problems:

  • slow restores
  • bloated local databases
  • longer CI jobs
  • sensitive data showing up in places it should not

Most of the time, you do not need the entire database. You need the right slice of it. We wrote a full comparison of pg_dump vs Basecut if you want the details.

The better pattern: subset, anonymize, restore

The workflow that makes sense looks more like this:

  1. Start from one or more root tables.
  2. Filter to the rows you actually care about.
  3. Follow foreign keys to pull in the connected data.
  4. Anonymize sensitive fields inline.
  5. Save the snapshot.
  6. Restore it anywhere you need it.

That gives you a connected, realistic, privacy-safe subset of production instead of a raw copy. This is the workflow we built Basecut around for PostgreSQL: FK-aware extraction, automatic PII anonymization, and one-command restores for local dev, CI, and debugging.

The reason this approach works is simple: it treats test data as a repeatable snapshot problem, not a hand-crafted fixture problem.

What "production-like" should actually mean

The phrase gets used loosely. In practice, production-like data should have four properties.

1. Realistic structure

It should reflect the real relationships, optional fields, and edge cases in your schema.

2. Referential integrity

If you copy one row from orders, you usually also need related rows from users, line_items, shipments, and whatever else your app expects to exist together.

3. Privacy safety

Emails, names, phone numbers, addresses, and other sensitive fields need to be anonymized before the data lands on laptops, CI runners, or logs.

4. Repeatability

Developers need a predictable way to recreate the same kind of dataset without asking someone to send them a dump.

If any one of those is missing, the workflow gets shaky fast.

Why FK-aware extraction matters

This is the part many DIY approaches get wrong. Randomly sampling rows from each table sounds easy until you restore them. Then you get:

  • orders pointing to missing users
  • line items pointing to missing products
  • child rows without their parents
  • failed restores or strange app behavior after restore

A useful snapshot has to behave like a self-contained mini-version of production.

That is why FK-aware extraction matters. In Basecut, snapshots are built by following foreign keys in both directions and collecting a connected subgraph of your data. The result is something you can restore into an empty database without ending up with broken references.

That matters more than people think. It is the difference between:

  • "the data loaded"
  • and
  • "the app actually behaves like it does in production"

What the workflow looks like in practice

The nice part is that this can stay simple. Basecut starts with a small YAML config that tells it:

  • where to start
  • how far to traverse relationships
  • how much data to include
  • how anonymization should work

Example:

version: '1'
name: 'dev-snapshot'

from:
  - table: users
    where: 'created_at > :since'
    params:
      since: '2026-01-01'

traverse:
  parents: 5
  children: 10

limits:
  rows:
    per_table: 1000
    total: 50000

anonymize:
  mode: auto
Enter fullscreen mode Exit fullscreen mode

Then the workflow becomes:

basecut snapshot create \
  --config basecut.yml \
  --name "dev-snapshot" \
  --source "$DATABASE_URL"

basecut snapshot restore dev-snapshot:latest \
  --target "$LOCAL_DATABASE_URL"
Enter fullscreen mode Exit fullscreen mode

That is the whole loop:

  • inspect schema
  • create snapshot
  • restore anywhere

In practice, most teams can get from install to first snapshot in a few minutes. You can try this workflow with Basecut — the CLI is free for small teams.

The privacy part is not optional

One more requirement: privacy. If you are moving production-like data into dev and CI, PII handling cannot be a manual cleanup step.

At minimum, your workflow should:

  • detect common PII automatically
  • allow explicit masking rules
  • preserve join integrity where needed
  • anonymize during extraction, not afterward

Basecut handles this with automatic PII detection plus 30+ anonymization strategies. It also supports deterministic masking, which matters when the same source value needs to map to the same fake value across related tables.

If jane@company.com turns into one fake email in users and a different fake email somewhere else, your data stops behaving like the real system. That is exactly the sort of detail that makes fake dev data feel fine right up until it is not.

Why this works well in CI too

This pattern is just as useful in CI as it is locally. Instead of checking brittle fixtures into the repo, you restore a realistic snapshot before the test suite runs.

For example:

name: Test
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    services:
      postgres:
        image: postgres:15
        env:
          POSTGRES_USER: postgres
          POSTGRES_PASSWORD: postgres
          POSTGRES_DB: test_db
        ports:
          - 5432:5432

    steps:
      - uses: actions/checkout@v4

      - name: Install Basecut CLI
        run: |
          curl -fsSL https://basecut.dev/install.sh | sh
          echo "$HOME/.local/bin" >> $GITHUB_PATH

      - name: Restore snapshot
        env:
          BASECUT_API_KEY: ${{ secrets.BASECUT_API_KEY }}
        run: |
          basecut snapshot restore test-data:latest \
            --target "postgresql://postgres:postgres@localhost:5432/test_db"

      - name: Run tests
        run: npm test
Enter fullscreen mode Exit fullscreen mode

That gives your pipeline real shapes, real relationships, and realistic edge cases without restoring an entire production dump on every run. It also keeps snapshots small enough that restores stay fast. We go deeper in our CI/CD test data guide.

When this is worth doing

You probably want production-like snapshots if:

  • your app has more than a handful of tables
  • your bugs are often data-dependent
  • you need realistic data in local dev
  • your CI pipeline should test against something closer to reality
  • you handle meaningful PII
  • your team is tired of maintaining fixtures or seed scripts

You might not need it yet if:

  • the product is brand new
  • you do not have real production data yet
  • the schema is tiny
  • completely fictional demo data is the goal

This is not about replacing every fixture in your test suite. Unit tests still benefit from tiny, explicit test data. The value here is in integration tests, local development, CI, onboarding, and debugging where data shape matters.

A practical rollout

If you want to adopt this without overcomplicating it, start small.

  1. Pick one painful workflow.
    Usually local dev onboarding, shared staging, or CI integration tests.

  2. Define a small snapshot.
    Keep the restore fast. Start with a few root tables and sensible row limits.

  3. Turn on anonymization from day one.
    Do not leave this for later.

  4. Restore it somewhere useful immediately.
    Local dev DB or CI test DB is usually enough to prove the value.

  5. Expand gradually.
    Add more tables, better filters, and refresh automation once the loop is working.

That gets you to useful production-like data quickly without turning the whole thing into a platform project.

Final thought

Most teams are not under-testing. They are testing against data that makes them feel safe. That is not the same thing.

If your local environments and CI pipelines run against tiny, stale, or fake data, they will keep giving you false confidence. Production-like snapshots are one of the highest-leverage ways to make development and testing feel closer to the real system without dragging raw production data everywhere.

If you want to try this with PostgreSQL, Basecut is free for small teams. Or dig into the quickstart guide and how FK-aware extraction works.

Top comments (0)