Setting up a PostgreSQL staging database that actually reflects production is one of those tasks that sounds simple and turns into a day of cleanup. The obvious approach — pg_dump production to staging — breaks down immediately: the dump is too large, it contains real customer PII, and it produces a shared environment nobody wants to touch.
This guide walks through a better pattern: extract a connected, anonymized subset of production and restore it as your staging database. You get a realistic, production-like environment without the size, privacy risk, or shared-state headaches of a full copy.
What you'll learn:
- Why
pg_dumpfails as a staging database solution - How FK-aware subsetting produces a referentially complete extract
- How to anonymize PII before it ever leaves production
- How to automate staging refreshes so it never goes stale
How do you create a staging database from production PostgreSQL safely?
Create a staging database by extracting a referentially complete subset of production, anonymizing PII during extraction, and restoring that snapshot into staging. This keeps the environment realistic without copying the entire database, leaking customer data, or forcing developers to maintain manual cleanup scripts after every refresh.
The shortest version of the workflow is:
- Choose a representative slice of production
- Follow foreign keys so the extract stays complete
- Mask sensitive fields before data leaves production
- Restore the snapshot into staging
- Refresh it on a schedule so it stays current
Why pg_dump Fails for Staging Environments
The first instinct is usually pg_dump:
pg_dump "$PRODUCTION_URL" | psql "$STAGING_URL"
This works in the sense that it runs. The issues come later:
- Size. Production databases grow. A dump that takes 5 minutes today takes 30 minutes in a year. Restores slow down CI, slow down onboarding, and make "refresh staging" a thing people avoid.
- PII. A full dump copies everything — real emails, real names, real addresses, real payment details. That data is now in your staging environment, which means it is probably on developer laptops, in logs, and reachable by anyone with staging credentials. Thoughtworks' Technology Radar on production data in test calls out exactly these privacy and security tradeoffs.
- Shared state. One staging environment with real-ish data usually becomes a place where everyone makes changes at once. It gets out of sync with production constantly. Nobody owns keeping it clean.
We wrote a full comparison of pg_dump vs Basecut if you want the detailed breakdown. The short version: pg_dump is the right tool for backups and disaster recovery. For dev and staging environments, you usually want something smaller and safer.
The Right Pattern: Subset, Anonymize, Restore
Instead of copying the whole database, the better approach is:
- Start from one or more root tables (usually
users,accounts, or whatever your primary entities are). - Filter to a representative slice — recent signups, a specific account tier, a date range.
- Follow foreign keys to pull in all the related data those rows depend on.
- Anonymize sensitive fields during extraction, before anything leaves the production environment.
- Save the result as a named snapshot.
- Restore that snapshot to staging.
What you end up with is a self-contained, realistic subset of production — with real relationships, real data shapes, real edge cases — but no raw PII and a manageable size.
This is what the industry calls database subsetting, and it is the same pattern that powers local dev environments and CI test data pipelines. Staging is just another restore target.
Step 1: Define What "Representative" Means for Your Database
Before you can extract anything, you need to decide what data to include.
For most applications this means picking a root table and a sensible filter:
- Recent users (past 30–90 days)
- A specific cohort (paid accounts, a particular plan tier)
- Accounts associated with specific test scenarios
You do not need the whole database. You need enough data that staging behaves like production — correct relationships, realistic distributions, enough rows to surface data-dependent bugs.
A starting point for a typical SaaS PostgreSQL database:
- 500–2,000 user accounts
- All their related records (orders, subscriptions, events, etc.)
- Enough to make the app behave realistically, small enough to restore in under a minute
Step 2: Extract with Foreign Key Awareness
The mistake most DIY approaches make is sampling rows naively — take 500 rows from users, take 500 rows from orders, call it done. Then you restore it and get:
- orders pointing to users who are not in the snapshot
- line items pointing to products that were not included
- foreign key violations on restore
- an app that half-works or fails in strange ways
A useful staging database has to be referentially complete. Every foreign key must resolve. Every parent row must exist before its children.
This is why FK-aware extraction matters. The extraction process traverses the schema — if you include an order, you also need the user who placed it, the products on the order, the shipping address, and whatever else your schema requires. The result is a subgraph that can be restored into an empty database without broken references.
See the official PostgreSQL docs on foreign key constraints for the underlying mechanics if you want to understand what the extractor is navigating.
Step 3: Anonymize PII Before It Leaves Production
The natural next question is: what do you do about PII?
The common answer is a cleanup script that runs after restore:
UPDATE users SET
email = 'user' || id || '@example.com',
first_name = 'Test',
last_name = 'User';
This has two problems. First, someone has to remember to run it. Second, when a new PII column gets added to the schema, someone has to update the script — and they usually forget.
More importantly, it is already too late by the time this runs. The data traveled through your restore pipeline with real values in it.
The better approach is to anonymize at extraction time, before the data leaves production. The snapshot that gets created already has fake emails, fake names, and fake addresses. Nothing sensitive ever travels to staging.
For columns that need specific handling — free-text fields, external IDs, JSONB blobs — you add explicit rules. Everything else gets auto-detected by column name and type.
We go deep on this in How to Anonymize PII in PostgreSQL for Development.
Step 4: Full Example with Basecut
Here is a complete staging database setup using Basecut. Create a basecut.yml at the root of your repo:
version: '1'
name: 'staging'
from:
- table: users
where: 'created_at > :since AND plan != :plan'
params:
since: '2025-10-01'
plan: 'free'
limits:
rows:
per_table: 2000
total: 100000
anonymize:
mode: auto
rules:
users:
notes: null
stripe_customer_id: hash
audit_logs:
ip_address: fake_ip
Then create the snapshot from production (or a read replica):
basecut snapshot create \
--config basecut.yml \
--source "$PRODUCTION_DATABASE_URL"
And restore it to staging:
basecut snapshot restore staging:latest \
--target "$STAGING_DATABASE_URL"
That is the entire workflow. The snapshot is named and versioned — staging:latest always points to the most recent one, and you can also restore a specific tagged version like staging:v2 if you need to pin a particular snapshot.
Step 5: Keep Your Staging Database Fresh
A staging database that is three months old is almost as bad as one that does not exist. Edge cases you care about — new billing flows, new user types, new schema columns — are not in there yet.
The simplest way to keep staging fresh is to trigger a snapshot refresh on a schedule. Wire it into an existing cron job, a scheduled GitHub Actions workflow, or your platform's scheduler. For example, to refresh every Monday at 2am:
# crontab entry (or a scheduled CI job)
0 2 * * 1 basecut snapshot create --config basecut.yml --source "$PRODUCTION_DATABASE_URL" && \
basecut snapshot restore staging:latest --target "$STAGING_DATABASE_URL"
Or trigger it as part of your CI pipeline whenever you cut a release branch:
- name: Refresh staging database
env:
BASECUT_API_KEY: ${{ secrets.BASECUT_API_KEY }}
run: |
basecut snapshot create --config basecut.yml --source "$PRODUCTION_DATABASE_URL"
basecut snapshot restore staging:latest --target "$STAGING_DATABASE_URL"
Weekly refreshes are usually enough. For teams doing frequent releases, a refresh on every release branch works well too.
Shared Staging vs Per-Developer Environments
Staging environments fall into two patterns, and this approach works for both.
Shared staging (one environment, multiple developers): the workflow above applies directly. Refresh it on a schedule. Everyone gets the same anonymized, realistic baseline. When someone corrupts state for testing, you restore again.
Per-developer environments (each developer has their own): this is actually easier. Each developer restores the same snapshot to their own local PostgreSQL instance. They can make whatever changes they need without affecting anyone else. When they want a fresh start, one command resets it. We cover this more in the local development guide.
The main advantage of per-developer environments is independence — nobody is waiting for staging to settle down before they can test. The tradeoff is that each developer needs somewhere to run their own PostgreSQL instance, which is easy locally but less obvious for teams that rely entirely on remote environments.
Staging Database Setup Checklist
Before you call it done, verify:
- [ ] Snapshot restores cleanly (no FK violations, no missing extension errors)
- [ ] Application runs against it without crashing
- [ ] No real PII visible in common queries (
SELECT email FROM users LIMIT 10) - [ ] Referential integrity intact (spot-check a few joined queries)
- [ ] Row counts are reasonable (not 12 rows, not 5 million rows)
- [ ] Refresh process is automated and nobody is doing it manually
When a Full pg_dump Is Still the Right Answer
To be fair: there are cases where a full dump is the right approach.
- You need to test schema migrations against production-exact data before running them.
- You are debugging a specific production incident and need the exact rows to reproduce it.
- Your compliance requirements demand a production-identical environment for specific tests.
In those cases, the official pg_dump reference is the right place to understand what a logical dump includes. The anonymization and subsetting workflow described here is for the other 95% of staging use cases — where you want something fast, safe, and realistic, not a forensic copy of production.
Final Thought
Staging databases are usually either out of date, full of PII, or both. The reason is that setting them up properly was never made easy enough to do right.
A scripted subset + anonymize + restore workflow fixes all of this at once. The result is a staging environment that is fast to restore, safe to share, realistic enough to catch real bugs, and easy to keep fresh.
Originally published at basecut.dev. If you found this useful, Basecut is the tool described here — free for small teams.
Top comments (0)