Jake Lazarus

Posted on Apr 9 • Originally published at basecut.dev

How to Set Up a PostgreSQL Test Database in GitHub Actions (Without pg_dump)

#postgres #githubactions #testing #devops

Originally published on basecut.dev.

The standard GitHub Actions job for PostgreSQL tests looks something like this:

- name: Set up test database
  run: psql "$TEST_DB_URL" -f scripts/seed.sql

Or worse, a pg_dump from production piped straight into the test container. It runs. The tests pass. Nobody questions it.

Until a developer checks the CI logs and finds real customer emails in the output. Or a dump that used to take two minutes now takes fifteen. Or the seed script breaks on a migration and half the CI matrix fails with ERROR: column "created_by" of relation "orders" does not exist.

This guide covers a better pattern: restore a pre-built, anonymized database snapshot instead. For small-to-medium snapshots, restores typically take seconds rather than minutes, the data contains no PII, and the process never breaks on schema changes.

How do you set up a PostgreSQL test database in GitHub Actions?

The cleanest way to set up a PostgreSQL test database in GitHub Actions is to run a postgres service container and restore a pre-built snapshot before your test step runs. This is faster than restoring a pg_dump, safer than fixture SQL files, and does not require maintaining seed data by hand.

The three-step pattern is:

Spin up a postgres service container in the workflow.
Install a CLI that can restore a versioned snapshot.
Restore the snapshot before your test suite runs.

The rest of this guide shows how to do that with Basecut, with full YAML you can copy directly.

Why the pg_dump approach breaks down in CI

The obvious approach — dump production, restore in CI — causes three problems that compound over time.

Speed. A pg_dump restore is full-size by default. Production databases grow. A restore that takes 90 seconds today takes 8 minutes in two years. That is a lot of developer time spent waiting before tests even start.

PII. A full dump copies real emails, real names, real addresses. Those land in CI artifacts, appear in test output, and end up in logs. That is a compliance problem even if nobody reads the logs. The Thoughtworks Technology Radar explicitly calls out raw production data in test pipelines as a risk to address.

Fragility. Fixture SQL files and seed scripts break the moment a migration adds a NOT NULL column they do not know about. Someone commits a quick fix, the fix produces slightly different data than everyone else's local setup, and now "works on my machine" is a data problem, not a code problem.

The snapshot approach

Instead of restoring a raw dump or running a seed script, you restore a named, versioned snapshot — a small, FK-complete, pre-anonymized subset of production.

The key properties:

Pre-built. The snapshot is created once (or on a schedule), not on every CI run. For most team-sized datasets, restore time is measured in seconds rather than minutes — though this depends on snapshot size and network.
Already anonymized. PII is masked during snapshot creation, before any data leaves production. Nothing sensitive ever travels to CI.
Referentially complete. Every foreign key resolves. The snapshot is a consistent subgraph of production, not a random sample of disconnected rows.
Versioned. dev-snapshot:latest always points to the most recent run. You can also pin to dev-snapshot:v12 to keep tests stable while multiple PRs are in flight.

Full GitHub Actions setup

Here is a complete workflow you can adapt directly:

name: Test

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest

    services:
      postgres:
        image: postgres:16
        env:
          POSTGRES_USER: postgres
          POSTGRES_PASSWORD: postgres
          POSTGRES_DB: test_db
        ports:
          - 5432:5432
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5

    steps:
      - uses: actions/checkout@v4

      - name: Install Basecut CLI
        run: |
          curl -fsSL https://basecut.dev/install.sh | sh
          echo "$HOME/.local/bin" >> $GITHUB_PATH
          # If the installer uses a different path, add that path instead

      - name: Restore test snapshot
        env:
          BASECUT_API_KEY: ${{ secrets.BASECUT_API_KEY }}
        run: |
          basecut snapshot restore dev-snapshot:latest \
            --target "postgresql://postgres:postgres@localhost:5432/test_db"

      - name: Run tests
        run: npm test

A few things worth noting:

--health-cmd pg_isready ensures the Postgres container is accepting connections before the restore step runs. Without this, you get intermittent failures on container startup.
dev-snapshot:latest is the snapshot name. Change this to match whatever you named your snapshot in the Basecut config. Tag a specific version (dev-snapshot:v3) if you want tests to run against a pinned dataset across multiple branches.
BASECUT_API_KEY goes in GitHub repo secrets (Settings → Secrets and variables → Actions), not in the workflow YAML.

GitLab CI equivalent

The same pattern works in GitLab CI:

test:
  image: node:20

  services:
    - name: postgres:16
      alias: postgres
      variables:
        POSTGRES_USER: postgres
        POSTGRES_PASSWORD: postgres
        POSTGRES_DB: test_db

  variables:
    PGHOST: postgres
    PGUSER: postgres
    PGPASSWORD: postgres

  before_script:
    - curl -fsSL https://basecut.dev/install.sh | sh
    - export PATH="$HOME/.local/bin:$PATH"
    - |
      basecut snapshot restore dev-snapshot:latest \
        --target "postgresql://postgres:postgres@postgres:5432/test_db"

  script:
    - npm test

Set BASECUT_API_KEY as a CI/CD variable in GitLab (Settings → CI/CD → Variables) with "Masked" enabled so it does not appear in job logs.

Keeping snapshots fresh without breaking CI

A snapshot that is three months old starts drifting from production. New columns, new relationships, new edge cases — none of them exist in it yet. The fix depends on how stable your CI needs to be.

Refresh on a schedule (recommended for most teams):

name: Refresh test snapshot

on:
  schedule:
    - cron: '0 3 * * 1' # Every Monday at 3am

jobs:
  refresh:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install Basecut CLI
        run: |
          curl -fsSL https://basecut.dev/install.sh | sh
          echo "$HOME/.local/bin" >> $GITHUB_PATH

      - name: Create fresh snapshot
        env:
          BASECUT_API_KEY: ${{ secrets.BASECUT_API_KEY }}
        run: |
          basecut snapshot create \
            --config basecut.yml \
            --source "${{ secrets.PRODUCTION_DATABASE_URL }}"

This keeps dev-snapshot:latest current week-to-week. Existing test workflows keep working — latest only advances when the refresh job runs.

Pin for stability, bump when needed:

If your test suite is sensitive to data changes, pin the snapshot version in the test workflow (dev-snapshot:v3) and bump the pin manually when you want the new data. This gives you explicit control over when CI picks up fresh data shapes.

PII in CI: the part most teams skip

If your pipeline is restoring anything from production — a full dump, a partial export, a seed script that grabs real rows — real customer data is in your CI artifacts. That is the kind of thing commonly discussed in the context of GDPR Article 25 (data protection by design) and SOC 2 data handling controls, and most teams have not thought through whether their CI pipeline is in scope.

The snapshot approach sidesteps this entirely because anonymization happens at extraction time, before data leaves production. The snapshot that arrives in CI already has fake emails, fake names, and fake phone numbers — not real data with a cleanup script someone may or may not have run.

Deterministic masking ensures the same source value maps to the same fake value across all tables, so joins still work and the app behaves like it does in production.

When this pattern matters most

This is overkill for a three-table CRUD app. It becomes the right default when:

Your test suite has data-dependent integration tests
Your schema has more than a handful of related tables
CI job time is starting to affect developer feedback loops
You handle PII and need to demonstrate anonymization in CI for compliance audits
Onboarding means "restore the snapshot and start working" instead of "ask around for a dump"

If your CI pipeline currently has a slow pg_dump step or a seed script that breaks every few sprints, this is the thing worth replacing first.

FAQ

How do I set up a PostgreSQL test database in GitHub Actions?
Spin up a postgres service container in the workflow, install a CLI that can restore a versioned snapshot, then restore a pre-built anonymized snapshot before your test step runs. This is faster than restoring a pg_dump, safer than fixture SQL files, and does not require maintaining seed data by hand. The snapshot is built once from production with PII masked at extraction time, then reused across every CI run until the next refresh.

Why is pg_dump slow in GitHub Actions CI?
A full pg_dump restore scales with the size of production and runs in full on every CI job. A restore that takes 90 seconds today takes 8 minutes in two years as production grows. Because the dump is not subset or anonymized, it also copies real PII into CI logs and artifacts. Pre-built snapshots replace this with a small, FK-complete, pre-anonymized dataset that typically restores in seconds for team-sized schemas.

Can I cache my PostgreSQL test database between GitHub Actions runs?
Caching a live database across runs is fragile — the postgres service container is fresh every job, and seeding state into caches tends to produce flaky tests. A better pattern is to cache the snapshot artifact itself (small, static, versioned) and restore it into a fresh container each run. That gives you fast startup without carrying mutable state between jobs.

How do I keep real customer PII out of CI logs?
Anonymize at extraction time, not after restore. If a pg_dump or seed script ever runs against production data in CI, real emails and names will appear in logs, artifacts, and error output. Pre-built snapshots with deterministic masking applied during creation guarantee that no raw PII ever travels through the CI pipeline. Any snapshot a developer can restore is already safe.

Does this pattern work with GitLab CI or CircleCI?
Yes. The pattern — service container for PostgreSQL, CLI install step, snapshot restore before tests — translates directly to GitLab CI, CircleCI, Buildkite, and Jenkins. Only the YAML syntax changes. The post includes a GitLab CI example; the same three steps apply to any runner that can start a PostgreSQL service and run a shell command.

How do I keep the test snapshot from going stale?
Refresh it on a schedule with a separate workflow — typically a weekly cron job that runs basecut snapshot create against production or a read replica. The :latest tag always points at the most recent run, so existing test workflows pick up fresh data automatically. If your test suite is sensitive to data changes, pin a specific version tag (e.g. dev-snapshot:v12) and bump it manually when you want to adopt new data shapes.

Get started with Basecut. The CLI is free for small teams — create your first snapshot and restore it in CI in under 10 minutes. Try Basecut free →

DEV Community