<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jake Lazarus</title>
    <description>The latest articles on DEV Community by Jake Lazarus (@jakelaz).</description>
    <link>https://dev.to/jakelaz</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3821271%2F86e1f840-dd57-4a1b-8df6-6274201606fb.png</url>
      <title>DEV Community: Jake Lazarus</title>
      <link>https://dev.to/jakelaz</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jakelaz"/>
    <language>en</language>
    <item>
      <title>How to Set Up a Staging Database from Production PostgreSQL (2026 Guide)</title>
      <dc:creator>Jake Lazarus</dc:creator>
      <pubDate>Tue, 07 Apr 2026 14:30:00 +0000</pubDate>
      <link>https://dev.to/jakelaz/how-to-set-up-a-staging-database-from-production-postgresql-2026-guide-1ckg</link>
      <guid>https://dev.to/jakelaz/how-to-set-up-a-staging-database-from-production-postgresql-2026-guide-1ckg</guid>
      <description>&lt;p&gt;Setting up a PostgreSQL staging database that actually reflects production is one of those tasks that sounds simple and turns into a day of cleanup. The obvious approach — &lt;code&gt;pg_dump&lt;/code&gt; production to staging — breaks down immediately: the dump is too large, it contains real customer PII, and it produces a shared environment nobody wants to touch.&lt;/p&gt;

&lt;p&gt;This guide walks through a better pattern: extract a connected, anonymized subset of production and restore it as your staging database. You get a realistic, production-like environment without the size, privacy risk, or shared-state headaches of a full copy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What you'll learn:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Why &lt;code&gt;pg_dump&lt;/code&gt; fails as a staging database solution&lt;/li&gt;
&lt;li&gt;How FK-aware subsetting produces a referentially complete extract&lt;/li&gt;
&lt;li&gt;How to anonymize PII before it ever leaves production&lt;/li&gt;
&lt;li&gt;How to automate staging refreshes so it never goes stale&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  How do you create a staging database from production PostgreSQL safely?
&lt;/h2&gt;

&lt;p&gt;Create a staging database by extracting a referentially complete subset of production, anonymizing PII during extraction, and restoring that snapshot into staging. This keeps the environment realistic without copying the entire database, leaking customer data, or forcing developers to maintain manual cleanup scripts after every refresh.&lt;/p&gt;

&lt;p&gt;The shortest version of the workflow is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Choose a representative slice of production&lt;/li&gt;
&lt;li&gt;Follow foreign keys so the extract stays complete&lt;/li&gt;
&lt;li&gt;Mask sensitive fields before data leaves production&lt;/li&gt;
&lt;li&gt;Restore the snapshot into staging&lt;/li&gt;
&lt;li&gt;Refresh it on a schedule so it stays current&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Why pg_dump Fails for Staging Environments
&lt;/h2&gt;

&lt;p&gt;The first instinct is usually &lt;code&gt;pg_dump&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pg_dump &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$PRODUCTION_URL&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; | psql &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$STAGING_URL&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works in the sense that it runs. The issues come later:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Size.&lt;/strong&gt; Production databases grow. A dump that takes 5 minutes today takes 30 minutes in a year. Restores slow down CI, slow down onboarding, and make "refresh staging" a thing people avoid.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PII.&lt;/strong&gt; A full dump copies everything — real emails, real names, real addresses, real payment details. That data is now in your staging environment, which means it is probably on developer laptops, in logs, and reachable by anyone with staging credentials. &lt;a href="https://www.thoughtworks.com/en-us/radar/techniques/production-data-in-test-environments" rel="noopener noreferrer"&gt;Thoughtworks' Technology Radar on production data in test&lt;/a&gt; calls out exactly these privacy and security tradeoffs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shared state.&lt;/strong&gt; One staging environment with real-ish data usually becomes a place where everyone makes changes at once. It gets out of sync with production constantly. Nobody owns keeping it clean.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We wrote a &lt;a href="https://basecut.dev/vs/pg-dump" rel="noopener noreferrer"&gt;full comparison of pg_dump vs Basecut&lt;/a&gt; if you want the detailed breakdown. The short version: &lt;code&gt;pg_dump&lt;/code&gt; is the right tool for backups and disaster recovery. For dev and staging environments, you usually want something smaller and safer.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Right Pattern: Subset, Anonymize, Restore
&lt;/h2&gt;

&lt;p&gt;Instead of copying the whole database, the better approach is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Start from one or more root tables (usually &lt;code&gt;users&lt;/code&gt;, &lt;code&gt;accounts&lt;/code&gt;, or whatever your primary entities are).&lt;/li&gt;
&lt;li&gt;Filter to a representative slice — recent signups, a specific account tier, a date range.&lt;/li&gt;
&lt;li&gt;Follow foreign keys to pull in all the related data those rows depend on.&lt;/li&gt;
&lt;li&gt;Anonymize sensitive fields during extraction, before anything leaves the production environment.&lt;/li&gt;
&lt;li&gt;Save the result as a named snapshot.&lt;/li&gt;
&lt;li&gt;Restore that snapshot to staging.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What you end up with is a self-contained, realistic subset of production — with real relationships, real data shapes, real edge cases — but no raw PII and a manageable size.&lt;/p&gt;

&lt;p&gt;This is what the industry calls &lt;strong&gt;database subsetting&lt;/strong&gt;, and it is the same pattern that powers &lt;a href="https://basecut.dev/use-cases/local-development" rel="noopener noreferrer"&gt;local dev environments&lt;/a&gt; and &lt;a href="https://basecut.dev/use-cases/ci-cd-test-data" rel="noopener noreferrer"&gt;CI test data pipelines&lt;/a&gt;. Staging is just another restore target.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Define What "Representative" Means for Your Database
&lt;/h2&gt;

&lt;p&gt;Before you can extract anything, you need to decide what data to include.&lt;/p&gt;

&lt;p&gt;For most applications this means picking a root table and a sensible filter:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Recent users (past 30–90 days)&lt;/li&gt;
&lt;li&gt;A specific cohort (paid accounts, a particular plan tier)&lt;/li&gt;
&lt;li&gt;Accounts associated with specific test scenarios&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You do not need the whole database. You need enough data that staging behaves like production — correct relationships, realistic distributions, enough rows to surface data-dependent bugs.&lt;/p&gt;

&lt;p&gt;A starting point for a typical SaaS PostgreSQL database:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;500–2,000 user accounts&lt;/li&gt;
&lt;li&gt;All their related records (orders, subscriptions, events, etc.)&lt;/li&gt;
&lt;li&gt;Enough to make the app behave realistically, small enough to restore in under a minute&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 2: Extract with Foreign Key Awareness
&lt;/h2&gt;

&lt;p&gt;The mistake most DIY approaches make is sampling rows naively — take 500 rows from &lt;code&gt;users&lt;/code&gt;, take 500 rows from &lt;code&gt;orders&lt;/code&gt;, call it done. Then you restore it and get:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;orders pointing to users who are not in the snapshot&lt;/li&gt;
&lt;li&gt;line items pointing to products that were not included&lt;/li&gt;
&lt;li&gt;foreign key violations on restore&lt;/li&gt;
&lt;li&gt;an app that half-works or fails in strange ways&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A useful staging database has to be &lt;strong&gt;referentially complete&lt;/strong&gt;. Every foreign key must resolve. Every parent row must exist before its children.&lt;/p&gt;

&lt;p&gt;This is why FK-aware extraction matters. The extraction process traverses the schema — if you include an order, you also need the user who placed it, the products on the order, the shipping address, and whatever else your schema requires. The result is a subgraph that can be restored into an empty database without broken references.&lt;/p&gt;

&lt;p&gt;See the &lt;a href="https://www.postgresql.org/docs/current/ddl-constraints.html#DDL-CONSTRAINTS-FK" rel="noopener noreferrer"&gt;official PostgreSQL docs on foreign key constraints&lt;/a&gt; for the underlying mechanics if you want to understand what the extractor is navigating.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Anonymize PII Before It Leaves Production
&lt;/h2&gt;

&lt;p&gt;The natural next question is: what do you do about PII?&lt;/p&gt;

&lt;p&gt;The common answer is a cleanup script that runs after restore:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt;
  &lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'user'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;'@example.com'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;first_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Test'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;last_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'User'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This has two problems. First, someone has to remember to run it. Second, when a new PII column gets added to the schema, someone has to update the script — and they usually forget.&lt;/p&gt;

&lt;p&gt;More importantly, it is already too late by the time this runs. The data traveled through your restore pipeline with real values in it.&lt;/p&gt;

&lt;p&gt;The better approach is to anonymize &lt;strong&gt;at extraction time&lt;/strong&gt;, before the data leaves production. The snapshot that gets created already has fake emails, fake names, and fake addresses. Nothing sensitive ever travels to staging.&lt;/p&gt;

&lt;p&gt;For columns that need specific handling — free-text fields, external IDs, JSONB blobs — you add explicit rules. Everything else gets auto-detected by column name and type.&lt;/p&gt;

&lt;p&gt;We go deep on this in &lt;a href="https://basecut.dev/blog/how-to-anonymize-pii-in-postgresql-for-development" rel="noopener noreferrer"&gt;How to Anonymize PII in PostgreSQL for Development&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Full Example with Basecut
&lt;/h2&gt;

&lt;p&gt;Here is a complete staging database setup using &lt;a href="https://basecut.dev" rel="noopener noreferrer"&gt;Basecut&lt;/a&gt;. Create a &lt;code&gt;basecut.yml&lt;/code&gt; at the root of your repo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1'&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;staging'&lt;/span&gt;

&lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;table&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;users&lt;/span&gt;
    &lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;created_at&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;:since&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;AND&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;plan&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;!=&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;:plan'&lt;/span&gt;
    &lt;span class="na"&gt;params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;since&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2025-10-01'&lt;/span&gt;
      &lt;span class="na"&gt;plan&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;free'&lt;/span&gt;

&lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;rows&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;per_table&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2000&lt;/span&gt;
    &lt;span class="na"&gt;total&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100000&lt;/span&gt;

&lt;span class="na"&gt;anonymize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;auto&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;users&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;notes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;
      &lt;span class="na"&gt;stripe_customer_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hash&lt;/span&gt;
    &lt;span class="na"&gt;audit_logs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;ip_address&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fake_ip&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then create the snapshot from production (or a read replica):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;basecut snapshot create &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--config&lt;/span&gt; basecut.yml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--source&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$PRODUCTION_DATABASE_URL&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And restore it to staging:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;basecut snapshot restore staging:latest &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--target&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$STAGING_DATABASE_URL&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the entire workflow. The snapshot is named and versioned — &lt;code&gt;staging:latest&lt;/code&gt; always points to the most recent one, and you can also restore a specific tagged version like &lt;code&gt;staging:v2&lt;/code&gt; if you need to pin a particular snapshot.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: Keep Your Staging Database Fresh
&lt;/h2&gt;

&lt;p&gt;A staging database that is three months old is almost as bad as one that does not exist. Edge cases you care about — new billing flows, new user types, new schema columns — are not in there yet.&lt;/p&gt;

&lt;p&gt;The simplest way to keep staging fresh is to trigger a snapshot refresh on a schedule. Wire it into an existing cron job, a scheduled GitHub Actions workflow, or your platform's scheduler. For example, to refresh every Monday at 2am:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# crontab entry (or a scheduled CI job)&lt;/span&gt;
0 2 &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; 1 basecut snapshot create &lt;span class="nt"&gt;--config&lt;/span&gt; basecut.yml &lt;span class="nt"&gt;--source&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$PRODUCTION_DATABASE_URL&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
          basecut snapshot restore staging:latest &lt;span class="nt"&gt;--target&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$STAGING_DATABASE_URL&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or trigger it as part of your CI pipeline whenever you cut a release branch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Refresh staging database&lt;/span&gt;
  &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;BASECUT_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.BASECUT_API_KEY }}&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;basecut snapshot create --config basecut.yml --source "$PRODUCTION_DATABASE_URL"&lt;/span&gt;
    &lt;span class="s"&gt;basecut snapshot restore staging:latest --target "$STAGING_DATABASE_URL"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Weekly refreshes are usually enough. For teams doing frequent releases, a refresh on every release branch works well too.&lt;/p&gt;

&lt;h2&gt;
  
  
  Shared Staging vs Per-Developer Environments
&lt;/h2&gt;

&lt;p&gt;Staging environments fall into two patterns, and this approach works for both.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shared staging&lt;/strong&gt; (one environment, multiple developers): the workflow above applies directly. Refresh it on a schedule. Everyone gets the same anonymized, realistic baseline. When someone corrupts state for testing, you restore again.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Per-developer environments&lt;/strong&gt; (each developer has their own): this is actually easier. Each developer restores the same snapshot to their own local PostgreSQL instance. They can make whatever changes they need without affecting anyone else. When they want a fresh start, one command resets it. We cover this more in the &lt;a href="https://basecut.dev/use-cases/local-development" rel="noopener noreferrer"&gt;local development guide&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The main advantage of per-developer environments is independence — nobody is waiting for staging to settle down before they can test. The tradeoff is that each developer needs somewhere to run their own PostgreSQL instance, which is easy locally but less obvious for teams that rely entirely on remote environments.&lt;/p&gt;

&lt;h2&gt;
  
  
  Staging Database Setup Checklist
&lt;/h2&gt;

&lt;p&gt;Before you call it done, verify:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Snapshot restores cleanly (no FK violations, no missing extension errors)&lt;/li&gt;
&lt;li&gt;[ ] Application runs against it without crashing&lt;/li&gt;
&lt;li&gt;[ ] No real PII visible in common queries (&lt;code&gt;SELECT email FROM users LIMIT 10&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;[ ] Referential integrity intact (spot-check a few joined queries)&lt;/li&gt;
&lt;li&gt;[ ] Row counts are reasonable (not 12 rows, not 5 million rows)&lt;/li&gt;
&lt;li&gt;[ ] Refresh process is automated and nobody is doing it manually&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  When a Full pg_dump Is Still the Right Answer
&lt;/h2&gt;

&lt;p&gt;To be fair: there are cases where a full dump is the right approach.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need to test schema migrations against production-exact data before running them.&lt;/li&gt;
&lt;li&gt;You are debugging a specific production incident and need the exact rows to reproduce it.&lt;/li&gt;
&lt;li&gt;Your compliance requirements demand a production-identical environment for specific tests.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In those cases, the &lt;a href="https://www.postgresql.org/docs/current/app-pgdump.html" rel="noopener noreferrer"&gt;official &lt;code&gt;pg_dump&lt;/code&gt; reference&lt;/a&gt; is the right place to understand what a logical dump includes. The anonymization and subsetting workflow described here is for the other 95% of staging use cases — where you want something fast, safe, and realistic, not a forensic copy of production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;Staging databases are usually either out of date, full of PII, or both. The reason is that setting them up properly was never made easy enough to do right.&lt;/p&gt;

&lt;p&gt;A scripted subset + anonymize + restore workflow fixes all of this at once. The result is a staging environment that is fast to restore, safe to share, realistic enough to catch real bugs, and easy to keep fresh.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://basecut.dev/blog/how-to-set-up-a-staging-database-from-production-postgresql" rel="noopener noreferrer"&gt;basecut.dev&lt;/a&gt;. If you found this useful, &lt;a href="https://basecut.dev" rel="noopener noreferrer"&gt;Basecut&lt;/a&gt; is the tool described here — free for small teams.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>devops</category>
      <category>database</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>How to Replace Seed Scripts with Production Snapshots</title>
      <dc:creator>Jake Lazarus</dc:creator>
      <pubDate>Thu, 26 Mar 2026 14:35:00 +0000</pubDate>
      <link>https://dev.to/jakelaz/how-to-replace-seed-scripts-with-production-snapshots-40ng</link>
      <guid>https://dev.to/jakelaz/how-to-replace-seed-scripts-with-production-snapshots-40ng</guid>
      <description>&lt;p&gt;Seed scripts are technical debt that nobody tracks.&lt;/p&gt;

&lt;p&gt;They start as a convenience — a few &lt;code&gt;INSERT&lt;/code&gt; statements so the app boots locally — and they end up as a 400-line file that touches 30 tables, breaks on every third migration, and produces data that looks nothing like production. Everyone knows the seed script is bad. Nobody wants to fix it, because fixing it means rewriting it, and it will just rot again.&lt;/p&gt;

&lt;p&gt;The usual response is to invest in a better seed script — more tables, better relationships, more realistic values. But the underlying issue is not the quality of the script. It is that hand-crafting test data stops scaling as schema complexity grows.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; Define what data you need in a YAML config, extract a subset from production with PII anonymized, and restore it anywhere. No INSERT statements to maintain — the snapshot stays current automatically.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Why database seed scripts break as projects grow
&lt;/h2&gt;

&lt;p&gt;As a database seeding approach, seed scripts have real advantages early on: they are version-controlled, deterministic, and easy to understand. But those advantages erode as the schema grows, and three problems start compounding.&lt;/p&gt;

&lt;h3&gt;
  
  
  Schema drift
&lt;/h3&gt;

&lt;p&gt;Every migration is a chance for the seed script to break. A new &lt;code&gt;NOT NULL&lt;/code&gt; column, a renamed FK, a dropped table — each one needs a corresponding update to the seed file. Those updates happen late or not at all. The person writing the migration is thinking about the migration, not about whether &lt;code&gt;seed.sql&lt;/code&gt; still runs.&lt;/p&gt;

&lt;p&gt;The script does not fail loudly. It just produces increasingly stale data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Manual referential integrity
&lt;/h3&gt;

&lt;p&gt;In production, an order belongs to a user, references line items, connects to shipments and payments. In a seed script, you maintain all of those relationships by hand. Every ID, every FK, every cross-table reference. Miss one and you get constraint violations or, worse, data that loads fine but makes the app behave in ways it never would in production.&lt;/p&gt;

&lt;h3&gt;
  
  
  Flat data
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;Test User 1&lt;/code&gt; with &lt;code&gt;test@example.com&lt;/code&gt; and two orders is structurally valid. It is not useful. Real users have Unicode in their names. Real accounts have nullable fields that are actually null. Real customers have 47 orders accumulated over two years, with edge cases nobody thought to fabricate.&lt;/p&gt;

&lt;p&gt;The bugs that reach production are usually triggered by data shapes that did not exist in the seed script, because nobody anticipated them. We covered this in more detail in &lt;a href="https://basecut.dev/blog/why-fake-postgresql-test-data-misses-bugs" rel="noopener noreferrer"&gt;Why Fake PostgreSQL Test Data Misses Real Bugs&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production database snapshots: the seed script alternative
&lt;/h2&gt;

&lt;p&gt;Instead of fabricating data, extract it.&lt;/p&gt;

&lt;p&gt;Not with &lt;code&gt;pg_dump&lt;/code&gt; — that copies everything, including all the PII you should not have in dev and all the volume you do not need. The &lt;a href="https://www.thoughtworks.com/en-us/radar/techniques/production-data-in-test-environments" rel="noopener noreferrer"&gt;Thoughtworks Technology Radar&lt;/a&gt; explicitly recommends against using raw production data in test environments for exactly this reason — the privacy and security risks outweigh the convenience. Instead, extract a &lt;strong&gt;subset&lt;/strong&gt;: a small, connected slice of production data with sensitive fields anonymized during extraction.&lt;/p&gt;

&lt;p&gt;What you get is a snapshot that reflects the real schema, has valid relationships because they were followed rather than hand-coded, and contains real data shapes because they came from production. It also requires no maintenance, because the next snapshot picks up schema changes automatically.&lt;/p&gt;

&lt;p&gt;This is the workflow &lt;a href="https://basecut.dev" rel="noopener noreferrer"&gt;Basecut&lt;/a&gt; was built for. Define what to extract, run one command, restore to any database.&lt;/p&gt;

&lt;h2&gt;
  
  
  How database subsetting works in practice
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Define what to extract
&lt;/h3&gt;

&lt;p&gt;Instead of INSERT statements, you describe the shape of the data you want:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1'&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dev-data'&lt;/span&gt;

&lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;table&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;users&lt;/span&gt;
    &lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;created_at&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;:since'&lt;/span&gt;
    &lt;span class="na"&gt;params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;since&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2026-01-01'&lt;/span&gt;

&lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;rows&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;per_table&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1000&lt;/span&gt;
    &lt;span class="na"&gt;total&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;50000&lt;/span&gt;

&lt;span class="na"&gt;anonymize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;auto&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Start from recent users, follow FKs to collect related data, cap the size, auto-detect and anonymize PII. You can add explicit anonymization rules when you need them — we cover that in &lt;a href="https://basecut.dev/blog/how-to-anonymize-pii-in-postgresql-for-development" rel="noopener noreferrer"&gt;How to Anonymize PII in PostgreSQL for Development&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The important difference from a seed script: this config describes &lt;em&gt;what to extract&lt;/em&gt;, not &lt;em&gt;what to insert&lt;/em&gt;. New columns, new tables, and new relationships get picked up on the next snapshot without touching the config.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Create a snapshot
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;basecut snapshot create &lt;span class="nt"&gt;--config&lt;/span&gt; basecut.yml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Basecut connects to your database (or a read replica), traverses relationships, anonymizes PII inline, and writes the result. No real PII ever leaves production.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Restore locally
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;basecut snapshot restore dev-data:latest &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--target&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$LOCAL_DATABASE_URL&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the local dev setup. A new developer joining the team runs two commands and has a working database with realistic test data management handled for them.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Basecut handles all of this in one CLI command.&lt;/strong&gt; &lt;a href="https://docs.basecut.dev/getting-started/quick-start" rel="noopener noreferrer"&gt;See the quickstart →&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  4. Share it
&lt;/h3&gt;

&lt;p&gt;Once a snapshot exists, anyone on the team can restore it. Everyone works against the same fixture data, which means bugs are reproducible across machines and "works on my machine" stops being about data differences.&lt;/p&gt;

&lt;h2&gt;
  
  
  Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Seed script&lt;/th&gt;
&lt;th&gt;Production snapshot&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Schema changes&lt;/td&gt;
&lt;td&gt;Breaks until someone updates it&lt;/td&gt;
&lt;td&gt;Automatic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;FK integrity&lt;/td&gt;
&lt;td&gt;Manual&lt;/td&gt;
&lt;td&gt;Followed from the database&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data realism&lt;/td&gt;
&lt;td&gt;Fabricated&lt;/td&gt;
&lt;td&gt;Real, anonymized&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PII risk&lt;/td&gt;
&lt;td&gt;None (but no realism either)&lt;/td&gt;
&lt;td&gt;Handled at extraction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Maintenance&lt;/td&gt;
&lt;td&gt;Grows with the schema&lt;/td&gt;
&lt;td&gt;Near zero&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Onboarding&lt;/td&gt;
&lt;td&gt;Run, debug, ask for help&lt;/td&gt;
&lt;td&gt;Restore, start working&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Edge cases&lt;/td&gt;
&lt;td&gt;Only what someone added&lt;/td&gt;
&lt;td&gt;Whatever production has&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;More detail in our &lt;a href="https://basecut.dev/vs/seed-scripts" rel="noopener noreferrer"&gt;seed scripts vs Basecut comparison&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Using production snapshots in CI/CD pipelines
&lt;/h2&gt;

&lt;p&gt;The same snapshot works in CI. Replace the seed script step with a restore:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Test&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;test&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;postgres&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres:15&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;POSTGRES_USER&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
          &lt;span class="na"&gt;POSTGRES_PASSWORD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
          &lt;span class="na"&gt;POSTGRES_DB&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;test_db&lt;/span&gt;
        &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;5432:5432&lt;/span&gt;

    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Install Basecut CLI&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;curl -fsSL https://basecut.dev/install.sh | sh&lt;/span&gt;
          &lt;span class="s"&gt;echo "$HOME/.local/bin" &amp;gt;&amp;gt; $GITHUB_PATH&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Restore snapshot&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;BASECUT_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.BASECUT_API_KEY }}&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;basecut snapshot restore dev-data:latest \&lt;/span&gt;
            &lt;span class="s"&gt;--target "postgresql://postgres:postgres@localhost:5432/test_db"&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run tests&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npm test&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;New tables and columns from migrations show up in the next snapshot. No CI config changes needed.&lt;/p&gt;

&lt;p&gt;More on this in our &lt;a href="https://basecut.dev/use-cases/ci-cd-test-data" rel="noopener noreferrer"&gt;CI/CD test data guide&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  When seed scripts still make sense
&lt;/h2&gt;

&lt;p&gt;Seed scripts are the right tool in some situations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pre-launch projects&lt;/strong&gt; with no production data yet.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Intentionally fictional demos&lt;/strong&gt; where you need a specific scenario.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unit tests&lt;/strong&gt; that need three rows in one table. A snapshot is overkill there.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Small schemas&lt;/strong&gt; where the maintenance cost is genuinely low.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If your schema has fewer than ten tables and no PII, a seed script is probably the right choice. The crossover point is usually obvious — it is when maintaining the seed file takes more effort than it saves.&lt;/p&gt;

&lt;h2&gt;
  
  
  Migrating off the seed script
&lt;/h2&gt;

&lt;p&gt;You do not have to rip it out in one PR.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Start with one workflow.&lt;/strong&gt; Pick the place where the seed script hurts most — usually dev environment setup or CI. Set up a snapshot and run it alongside the seed script.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Compare.&lt;/strong&gt; Run the app against both datasets. The snapshot will usually expose things the seed script missed: edge cases, data shapes that only exist in production, relationships that only worked because the script inserted rows in a specific order.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Switch gradually.&lt;/strong&gt; Replace the seed script where the snapshot is better. Keep it for unit tests or demos if it still makes sense.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Let it go.&lt;/strong&gt; Once the snapshot covers your main workflows, stop maintaining the seed script. Do not delete it if people reference it — just stop investing in keeping it current.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Final thought
&lt;/h2&gt;

&lt;p&gt;The seed script is one of those things that works well enough to never get fixed. It seeds the database. The app boots. Nobody wants to touch it.&lt;/p&gt;

&lt;p&gt;The problem is that "well enough" slowly gets worse. The schema changes, the data drifts, the edge cases multiply, and the gap between what the seed script produces and what production looks like gets wider every quarter.&lt;/p&gt;

&lt;p&gt;Production snapshots close that gap by removing the maintenance entirely. The data stays current because it comes from the real database. The relationships stay valid because they are followed, not written by hand. And anonymization is part of the process rather than a separate step someone has to remember.&lt;/p&gt;

&lt;p&gt;If your seed script is the file nobody wants to own, maybe the right move is making sure nobody has to.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Get started in minutes.&lt;/strong&gt; Basecut extracts FK-aware, anonymized snapshots from PostgreSQL with one CLI command. Free for small teams. &lt;a href="https://basecut.dev" rel="noopener noreferrer"&gt;Try Basecut free →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Or explore first: &lt;a href="https://docs.basecut.dev/getting-started/quick-start" rel="noopener noreferrer"&gt;quickstart guide&lt;/a&gt; · &lt;a href="https://docs.basecut.dev/configuration/snapshots" rel="noopener noreferrer"&gt;snapshot config reference&lt;/a&gt;&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>testing</category>
      <category>database</category>
      <category>devtools</category>
    </item>
    <item>
      <title>How to Anonymize PII in PostgreSQL for Development</title>
      <dc:creator>Jake Lazarus</dc:creator>
      <pubDate>Tue, 24 Mar 2026 14:43:00 +0000</pubDate>
      <link>https://dev.to/jakelaz/how-to-anonymize-pii-in-postgresql-for-development-hb2</link>
      <guid>https://dev.to/jakelaz/how-to-anonymize-pii-in-postgresql-for-development-hb2</guid>
      <description>&lt;p&gt;Ask any developer whether their local database has real customer data in it, and most will say no.&lt;/p&gt;

&lt;p&gt;Ask them to check, and most will find that it does.&lt;/p&gt;

&lt;p&gt;Real emails in &lt;code&gt;users&lt;/code&gt;. Real names in &lt;code&gt;profiles&lt;/code&gt;. Real billing addresses in &lt;code&gt;payments&lt;/code&gt;. Real IP addresses in &lt;code&gt;audit_logs&lt;/code&gt;. Data that landed in production, got copied somewhere for debugging, and has been sitting in local databases and CI pipelines ever since.&lt;/p&gt;

&lt;p&gt;This is not a hypothetical compliance problem. It is a real one, and it gets messier the longer it goes unaddressed.&lt;/p&gt;

&lt;h2&gt;
  
  
  What counts as PII in a PostgreSQL database
&lt;/h2&gt;

&lt;p&gt;PII is broader than most developers expect. The obvious fields are easy to spot:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;email&lt;/code&gt;, &lt;code&gt;email_address&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;first_name&lt;/code&gt;, &lt;code&gt;last_name&lt;/code&gt;, &lt;code&gt;full_name&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;phone&lt;/code&gt;, &lt;code&gt;phone_number&lt;/code&gt;, &lt;code&gt;mobile&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;address&lt;/code&gt;, &lt;code&gt;street_address&lt;/code&gt;, &lt;code&gt;city&lt;/code&gt;, &lt;code&gt;postal_code&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;date_of_birth&lt;/code&gt;, &lt;code&gt;dob&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ssn&lt;/code&gt;, &lt;code&gt;national_id&lt;/code&gt;, &lt;code&gt;tax_id&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But in real production schemas, PII hides in less obvious places:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;free-text fields like &lt;code&gt;notes&lt;/code&gt;, &lt;code&gt;description&lt;/code&gt;, &lt;code&gt;bio&lt;/code&gt; that users fill in&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ip_address&lt;/code&gt; columns in event logs and audit tables&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;stripe_customer_id&lt;/code&gt;, &lt;code&gt;paypal_email&lt;/code&gt; — identifiers that link back to real people&lt;/li&gt;
&lt;li&gt;JSONB columns that store user-submitted form data&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;metadata&lt;/code&gt; fields that accumulate whatever the app was logging at the time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you have been copying your production database to dev environments without systematically anonymizing those fields, that data is on developer laptops, in CI logs, and probably in Slack at some point.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters beyond just being careful
&lt;/h2&gt;

&lt;p&gt;GDPR, CCPA, HIPAA, and most other data protection frameworks have something in common: they do not distinguish between production and non-production environments. If you are processing personal data in a development environment without appropriate controls, you are in scope.&lt;/p&gt;

&lt;p&gt;In practice, the consequences are usually:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GDPR Article 25&lt;/strong&gt;: "data protection by design and by default" — development tooling is explicitly in scope&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SOC 2 Type II&lt;/strong&gt;: data handling controls are audited across environments, not just production&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HIPAA minimum necessary rule&lt;/strong&gt;: PHI should only be available to the systems that need it for the purpose it was collected&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Beyond compliance, there is a simpler reason: real customer data in dev environments is one of the most common sources of accidental exposure. A developer shares a failing test case on Slack. A CI artifact gets retained with real names in it. A staging database backup ends up in a public S3 bucket.&lt;/p&gt;

&lt;p&gt;The fix is not more policies. It is removing the real data from the environments where it should not be.&lt;/p&gt;

&lt;h2&gt;
  
  
  The naive approach: UPDATE statements after restore
&lt;/h2&gt;

&lt;p&gt;The most common first attempt at anonymization looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Run after restoring a pg_dump to dev&lt;/span&gt;
&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt;
  &lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'user'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="s1"&gt;'@example.com'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;first_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Test'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;last_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'User'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt;
  &lt;span class="n"&gt;shipping_address&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'123 Test St'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works well enough until it does not.&lt;/p&gt;

&lt;p&gt;The problems start to accumulate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Someone forgets to run the script, and real data ends up in a dev environment anyway.&lt;/li&gt;
&lt;li&gt;The script is not versioned with the schema, so it breaks when new PII columns are added.&lt;/li&gt;
&lt;li&gt;It replaces data inconsistently — the same customer gets a different fake email in &lt;code&gt;users&lt;/code&gt; than in &lt;code&gt;audit_logs&lt;/code&gt;, breaking join-based queries.&lt;/li&gt;
&lt;li&gt;It runs after the fact, which means real data has already traveled through the restore pipeline.&lt;/li&gt;
&lt;li&gt;It has no automated detection — every new PII column has to be added manually.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the pattern that eventually lands teams in trouble. It feels like a solution because it works most of the time. It fails when someone does not run it, or when a new field gets added and nobody updates the script. It shares the same fundamental problem as &lt;a href="https://basecut.dev/vs/seed-scripts" rel="noopener noreferrer"&gt;seed scripts&lt;/a&gt; — manual upkeep that silently falls behind.&lt;/p&gt;

&lt;h2&gt;
  
  
  The better approach: anonymize at extraction time
&lt;/h2&gt;

&lt;p&gt;The more reliable pattern is to anonymize the data before it ever leaves the production environment, not after it arrives in dev. This applies whether you are setting up a &lt;a href="https://basecut.dev/use-cases/local-development" rel="noopener noreferrer"&gt;local development environment&lt;/a&gt; or populating CI databases.&lt;/p&gt;

&lt;p&gt;That means the anonymization step is baked into the snapshot process:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Connect to production (or a read replica).&lt;/li&gt;
&lt;li&gt;Extract the rows you need.&lt;/li&gt;
&lt;li&gt;Anonymize sensitive fields inline, during extraction.&lt;/li&gt;
&lt;li&gt;Write the already-anonymized snapshot to wherever it will be stored.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The result is that no real PII ever travels to dev environments. What gets restored is already masked.&lt;/p&gt;

&lt;p&gt;This matters because it removes the "forget to run the script" failure mode entirely. There is no post-restore step to forget.&lt;/p&gt;

&lt;h2&gt;
  
  
  What good anonymization actually requires
&lt;/h2&gt;

&lt;p&gt;Replacing real values with fake ones is straightforward. Making the fake values behave like real data is harder.&lt;/p&gt;

&lt;p&gt;A few requirements come up in practice:&lt;/p&gt;

&lt;h3&gt;
  
  
  Realistic fake values
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;email = 'test@example.com'&lt;/code&gt; is easy to write and easy to spot. It does not behave like real email data in filtering, search, or display.&lt;/p&gt;

&lt;p&gt;Better: generate realistic-looking fake emails that follow the same structure. &lt;code&gt;lmitchell@example.com&lt;/code&gt; is harder to accidentally mistake for test data, and it exercises the same code paths as real emails. We cover why this realism matters for catching bugs in &lt;a href="https://basecut.dev/blog/why-fake-postgresql-test-data-misses-bugs" rel="noopener noreferrer"&gt;Why Fake PostgreSQL Test Data Misses Real Bugs&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deterministic masking
&lt;/h3&gt;

&lt;p&gt;If &lt;code&gt;jane@company.com&lt;/code&gt; maps to &lt;code&gt;lmitchell@example.com&lt;/code&gt; in &lt;code&gt;users&lt;/code&gt;, it should map to the same fake email everywhere it appears — in &lt;code&gt;audit_logs&lt;/code&gt;, &lt;code&gt;notifications&lt;/code&gt;, &lt;code&gt;email_events&lt;/code&gt;, and anywhere else it is referenced.&lt;/p&gt;

&lt;p&gt;Without deterministic masking, the same source value produces different fake values in different tables. Queries that join across tables start returning mismatched or missing results. The data looks restored but does not behave like the real system.&lt;/p&gt;

&lt;h3&gt;
  
  
  FK-aware scope
&lt;/h3&gt;

&lt;p&gt;Anonymization cannot happen table-by-table in isolation. If &lt;code&gt;order_id = 1001&lt;/code&gt; belongs to a user whose real name you are masking, the anonymization needs to be consistent across &lt;code&gt;users&lt;/code&gt;, &lt;code&gt;orders&lt;/code&gt;, &lt;code&gt;billing_addresses&lt;/code&gt;, and everything else connected to that user.&lt;/p&gt;

&lt;h3&gt;
  
  
  Coverage of unknown columns
&lt;/h3&gt;

&lt;p&gt;Manually listing every PII column to anonymize only works until someone adds a new one and forgets to update the anonymization config. Auto-detection — pattern matching on column names, types, and values — catches fields that have not been explicitly listed yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Basecut's anonymization config looks like
&lt;/h2&gt;

&lt;p&gt;In Basecut, anonymization is part of the snapshot config. You can enable automatic detection, add explicit rules for specific columns, or mix both.&lt;/p&gt;

&lt;p&gt;The simplest version uses auto-detection:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1'&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dev-snapshot'&lt;/span&gt;

&lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;table&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;users&lt;/span&gt;
    &lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;created_at&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;:since'&lt;/span&gt;
    &lt;span class="na"&gt;params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;since&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2026-01-01'&lt;/span&gt;

&lt;span class="na"&gt;traverse&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;parents&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
  &lt;span class="na"&gt;children&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;

&lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;rows&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;per_table&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1000&lt;/span&gt;
    &lt;span class="na"&gt;total&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;50000&lt;/span&gt;

&lt;span class="na"&gt;anonymize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;auto&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With &lt;code&gt;mode: auto&lt;/code&gt;, Basecut scans column names and data patterns to detect likely PII fields and applies sensible defaults. Emails become realistic fake emails. Names become realistic fake names. Phone numbers become valid-format fake phone numbers.&lt;/p&gt;

&lt;p&gt;For fields that need explicit control:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;anonymize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;auto&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;column&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;users.notes&lt;/span&gt;
      &lt;span class="na"&gt;strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;clear&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;column&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;users.profile_image_url&lt;/span&gt;
      &lt;span class="na"&gt;strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;clear&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;column&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payments.card_last_four&lt;/span&gt;
      &lt;span class="na"&gt;strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;preserve&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;column&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;audit_logs.ip_address&lt;/span&gt;
      &lt;span class="na"&gt;strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fake_ip&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;column&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;users.external_id&lt;/span&gt;
      &lt;span class="na"&gt;strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hash&lt;/span&gt;
      &lt;span class="na"&gt;deterministic&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The strategies map to real behaviors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;fake_*&lt;/code&gt; — generate realistic-looking fake values (email, name, phone, address, IP)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;clear&lt;/code&gt; — replace with &lt;code&gt;NULL&lt;/code&gt; or empty string&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;preserve&lt;/code&gt; — keep as-is (for fields that are not sensitive but look like they might be)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;hash&lt;/code&gt; — consistent one-way hash, useful for IDs that need to be consistent but not reversible&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Deterministic masking is on by default for most strategies. Basecut uses a stable seed so the same source value always produces the same output, which keeps join queries working correctly across tables.&lt;/p&gt;

&lt;h2&gt;
  
  
  Org-wide policies
&lt;/h2&gt;

&lt;p&gt;One problem with per-snapshot anonymization configs is that each team or developer can configure their own rules — which means someone will configure them wrong.&lt;/p&gt;

&lt;p&gt;For teams with compliance requirements, Basecut supports org-wide anonymization policies: rules defined once at the organization level that apply to every snapshot, regardless of who creates it. An individual snapshot config can add rules but cannot remove or override org-level ones.&lt;/p&gt;

&lt;p&gt;This is how you get consistent anonymization without relying on every developer to configure it correctly every time.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this looks like in CI
&lt;/h2&gt;

&lt;p&gt;The same anonymization config that runs locally runs in CI. If you are restoring a snapshot before your test suite, the data arriving in your CI pipeline is already masked.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Test&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;test&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;postgres&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres:15&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;POSTGRES_USER&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
          &lt;span class="na"&gt;POSTGRES_PASSWORD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
          &lt;span class="na"&gt;POSTGRES_DB&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;test_db&lt;/span&gt;
        &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;5432:5432&lt;/span&gt;

    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Install Basecut CLI&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;curl -fsSL https://basecut.dev/install.sh | sh&lt;/span&gt;
          &lt;span class="s"&gt;echo "$HOME/.local/bin" &amp;gt;&amp;gt; $GITHUB_PATH&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Restore anonymized snapshot&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;BASECUT_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.BASECUT_API_KEY }}&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;basecut snapshot restore dev-snapshot:latest \&lt;/span&gt;
            &lt;span class="s"&gt;--target "postgresql://postgres:postgres@localhost:5432/test_db"&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run tests&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npm test&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The CI pipeline gets realistic data shapes and relationships without ever touching real PII. That is also worth noting for audits: you can show that your CI environment runs against masked data by design, not by policy.&lt;/p&gt;

&lt;p&gt;We go deeper on the CI setup in our &lt;a href="https://basecut.dev/use-cases/ci-cd-test-data" rel="noopener noreferrer"&gt;CI/CD test data guide&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to use pg_anonymizer (and when not to)
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://postgresql-anonymizer.readthedocs.io/" rel="noopener noreferrer"&gt;postgresql_anonymizer&lt;/a&gt; is a PostgreSQL extension worth knowing about. It provides declarative masking rules at the database level, which is useful in some situations — particularly if you want to expose a masked view of production data to specific roles.&lt;/p&gt;

&lt;p&gt;A few places where it fits less well for dev workflows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It requires installing an extension in production, which many teams are not able or willing to do.&lt;/li&gt;
&lt;li&gt;It anonymizes in-place or via views, not during extraction — the data still travels to dev before masking.&lt;/li&gt;
&lt;li&gt;It does not handle subsetting, so you are still copying the full database.&lt;/li&gt;
&lt;li&gt;It requires explicit rule definitions for every column with no auto-detection.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For teams who want anonymization as part of a broader snapshot + subset + restore workflow, extraction-time masking is usually cleaner. You can see more in our &lt;a href="https://basecut.dev/vs/pg-dump" rel="noopener noreferrer"&gt;comparison with pg_dump&lt;/a&gt;. For a broader look at how Basecut compares to commercial alternatives like &lt;a href="https://basecut.dev/vs/tonic" rel="noopener noreferrer"&gt;Tonic&lt;/a&gt; and &lt;a href="https://basecut.dev/vs/delphix" rel="noopener noreferrer"&gt;Delphix&lt;/a&gt;, see our comparison pages.&lt;/p&gt;

&lt;h2&gt;
  
  
  A practical rollout
&lt;/h2&gt;

&lt;p&gt;If you have never done systematic anonymization before, the cleanest way to start is gradually.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Audit first.&lt;/strong&gt;&lt;br&gt;
Run a query across your schema to find columns that look like PII: &lt;code&gt;email&lt;/code&gt;, &lt;code&gt;name&lt;/code&gt;, &lt;code&gt;phone&lt;/code&gt;, &lt;code&gt;address&lt;/code&gt;, any JSONB column with user-submitted data. You will find more than you expect.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Start with auto-detection.&lt;/strong&gt;&lt;br&gt;
Let the tooling make a first pass at detection. Review what it finds, add explicit rules for anything it missed or got wrong.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Run a test snapshot.&lt;/strong&gt;&lt;br&gt;
Restore it to a local dev database and check: does the data look anonymized? Do join queries still work? Does the app behave normally with it?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Commit the config.&lt;/strong&gt;&lt;br&gt;
Your anonymization rules should live in version control alongside your schema. When someone adds a new PII column, the config update is part of the same PR.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Enforce at the org level.&lt;/strong&gt;&lt;br&gt;
Once your rules are stable, promote the critical ones to org-wide policies so they apply regardless of who runs the snapshot.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The common mistake is treating anonymization as a post-restore step rather than part of the data pipeline. Once it is baked into snapshot creation, the risk of it being skipped drops to near zero.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final thought
&lt;/h2&gt;

&lt;p&gt;The reason most dev databases have real PII in them is not malice. It is that anonymization was not built into the default workflow — it was an afterthought, a script someone ran sometimes, a step in a doc nobody updated.&lt;/p&gt;

&lt;p&gt;The fix is simple in principle: make anonymization happen at extraction time, not after the fact. Every snapshot that gets restored to dev, CI, or staging should arrive already masked.&lt;/p&gt;

&lt;p&gt;If your team is still relying on manual cleanup scripts, or skipping anonymization entirely, that is the thing worth fixing first.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Get started with Basecut's anonymization.&lt;/strong&gt; The CLI auto-detects common PII fields and applies masking during snapshot creation. Free for small teams. &lt;a href="https://basecut.dev" rel="noopener noreferrer"&gt;Try Basecut free →&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Or dig into the details first: &lt;a href="https://docs.basecut.dev/configuration/anonymization" rel="noopener noreferrer"&gt;anonymization config reference&lt;/a&gt; · &lt;a href="https://docs.basecut.dev/core-concepts/how-it-works" rel="noopener noreferrer"&gt;how FK-aware extraction works&lt;/a&gt;&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>security</category>
      <category>devops</category>
      <category>database</category>
    </item>
    <item>
      <title>Why Fake PostgreSQL Test Data Misses Real Bugs</title>
      <dc:creator>Jake Lazarus</dc:creator>
      <pubDate>Wed, 18 Mar 2026 05:00:00 +0000</pubDate>
      <link>https://dev.to/jakelaz/why-fake-postgresql-test-data-misses-real-bugs-3jco</link>
      <guid>https://dev.to/jakelaz/why-fake-postgresql-test-data-misses-real-bugs-3jco</guid>
      <description>&lt;p&gt;Most teams do not have a testing problem. They have a &lt;strong&gt;test data realism&lt;/strong&gt; problem.&lt;/p&gt;

&lt;p&gt;Locally, the app runs against &lt;code&gt;test@example.com&lt;/code&gt;, &lt;code&gt;User 1&lt;/code&gt;, and a seed script nobody wants to maintain. In CI, fixtures slowly drift away from reality. Then the bugs show up after deploy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a customer name has an apostrophe or accent&lt;/li&gt;
&lt;li&gt;a field is &lt;code&gt;NULL&lt;/code&gt; where your code assumed a string&lt;/li&gt;
&lt;li&gt;an account has 47 related records instead of 2&lt;/li&gt;
&lt;li&gt;a query that worked on 20 rows falls over on 20,000&lt;/li&gt;
&lt;li&gt;shared staging data gets mutated by three people at once&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If that sounds familiar, the answer usually is not "write more tests." It is "stop testing against fake data."&lt;/p&gt;

&lt;h2&gt;
  
  
  What teams actually want
&lt;/h2&gt;

&lt;p&gt;What most teams actually want is not a full copy of production, not a giant &lt;code&gt;pg_dump&lt;/code&gt;, and not another 400-line seed script. They want &lt;strong&gt;production-like data&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;realistic enough to expose bugs&lt;/li&gt;
&lt;li&gt;small enough to restore locally and in CI&lt;/li&gt;
&lt;li&gt;safe enough to use outside production&lt;/li&gt;
&lt;li&gt;reproducible enough that every developer can get the same result&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the gap most dev workflows never solve cleanly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the usual approaches break down
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Seed scripts rot
&lt;/h3&gt;

&lt;p&gt;Seed scripts are fine when your app has five tables. They get painful when your schema grows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;every migration breaks something&lt;/li&gt;
&lt;li&gt;relationships get harder to maintain&lt;/li&gt;
&lt;li&gt;the data gets less realistic over time&lt;/li&gt;
&lt;li&gt;nobody wants to own the script anymore&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You end up with a setup that is reproducible, but not especially useful. We wrote a deeper &lt;a href="https://basecut.dev/vs/seed-scripts" rel="noopener noreferrer"&gt;comparison of seed scripts vs production snapshots&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;pg_dump&lt;/code&gt; is great for backups, not dev environments
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;pg_dump&lt;/code&gt; solves a different problem. It copies everything:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;all rows&lt;/li&gt;
&lt;li&gt;all tables&lt;/li&gt;
&lt;li&gt;all PII&lt;/li&gt;
&lt;li&gt;all the size and baggage of production&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is useful for backup and recovery. It is usually overkill for local development and CI.&lt;/p&gt;

&lt;p&gt;For dev workflows, full dumps create new problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;slow restores&lt;/li&gt;
&lt;li&gt;bloated local databases&lt;/li&gt;
&lt;li&gt;longer CI jobs&lt;/li&gt;
&lt;li&gt;sensitive data showing up in places it should not&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most of the time, you do not need the entire database. You need the &lt;strong&gt;right slice&lt;/strong&gt; of it. We wrote a &lt;a href="https://basecut.dev/vs/pg-dump" rel="noopener noreferrer"&gt;full comparison of pg_dump vs Basecut&lt;/a&gt; if you want the details.&lt;/p&gt;

&lt;h2&gt;
  
  
  The better pattern: subset, anonymize, restore
&lt;/h2&gt;

&lt;p&gt;The workflow that makes sense looks more like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Start from one or more root tables.&lt;/li&gt;
&lt;li&gt;Filter to the rows you actually care about.&lt;/li&gt;
&lt;li&gt;Follow foreign keys to pull in the connected data.&lt;/li&gt;
&lt;li&gt;Anonymize sensitive fields inline.&lt;/li&gt;
&lt;li&gt;Save the snapshot.&lt;/li&gt;
&lt;li&gt;Restore it anywhere you need it.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That gives you a connected, realistic, privacy-safe subset of production instead of a raw copy. This is the workflow we built &lt;a href="https://basecut.dev" rel="noopener noreferrer"&gt;Basecut&lt;/a&gt; around for PostgreSQL: FK-aware extraction, automatic PII anonymization, and one-command restores for local dev, CI, and debugging.&lt;/p&gt;

&lt;p&gt;The reason this approach works is simple: it treats test data as a &lt;strong&gt;repeatable snapshot problem&lt;/strong&gt;, not a hand-crafted fixture problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "production-like" should actually mean
&lt;/h2&gt;

&lt;p&gt;The phrase gets used loosely. In practice, production-like data should have four properties.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Realistic structure
&lt;/h3&gt;

&lt;p&gt;It should reflect the real relationships, optional fields, and edge cases in your schema.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Referential integrity
&lt;/h3&gt;

&lt;p&gt;If you copy one row from &lt;code&gt;orders&lt;/code&gt;, you usually also need related rows from &lt;code&gt;users&lt;/code&gt;, &lt;code&gt;line_items&lt;/code&gt;, &lt;code&gt;shipments&lt;/code&gt;, and whatever else your app expects to exist together.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Privacy safety
&lt;/h3&gt;

&lt;p&gt;Emails, names, phone numbers, addresses, and other sensitive fields need to be anonymized before the data lands on laptops, CI runners, or logs.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Repeatability
&lt;/h3&gt;

&lt;p&gt;Developers need a predictable way to recreate the same kind of dataset without asking someone to send them a dump.&lt;/p&gt;

&lt;p&gt;If any one of those is missing, the workflow gets shaky fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why FK-aware extraction matters
&lt;/h2&gt;

&lt;p&gt;This is the part many DIY approaches get wrong. Randomly sampling rows from each table sounds easy until you restore them. Then you get:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;orders pointing to missing users&lt;/li&gt;
&lt;li&gt;line items pointing to missing products&lt;/li&gt;
&lt;li&gt;child rows without their parents&lt;/li&gt;
&lt;li&gt;failed restores or strange app behavior after restore&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A useful snapshot has to behave like a self-contained mini-version of production.&lt;/p&gt;

&lt;p&gt;That is why FK-aware extraction matters. In Basecut, snapshots are built by following foreign keys in both directions and collecting a connected subgraph of your data. The result is something you can restore into an empty database without ending up with broken references.&lt;/p&gt;

&lt;p&gt;That matters more than people think. It is the difference between:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"the data loaded"&lt;/li&gt;
&lt;li&gt;and&lt;/li&gt;
&lt;li&gt;"the app actually behaves like it does in production"&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What the workflow looks like in practice
&lt;/h2&gt;

&lt;p&gt;The nice part is that this can stay simple. Basecut starts with a small YAML config that tells it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;where to start&lt;/li&gt;
&lt;li&gt;how far to traverse relationships&lt;/li&gt;
&lt;li&gt;how much data to include&lt;/li&gt;
&lt;li&gt;how anonymization should work&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1'&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;dev-snapshot'&lt;/span&gt;

&lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;table&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;users&lt;/span&gt;
    &lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;created_at&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;gt;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;:since'&lt;/span&gt;
    &lt;span class="na"&gt;params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;since&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2026-01-01'&lt;/span&gt;

&lt;span class="na"&gt;traverse&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;parents&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
  &lt;span class="na"&gt;children&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;

&lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;rows&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;per_table&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1000&lt;/span&gt;
    &lt;span class="na"&gt;total&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;50000&lt;/span&gt;

&lt;span class="na"&gt;anonymize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;auto&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then the workflow becomes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;basecut snapshot create &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--config&lt;/span&gt; basecut.yml &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; &lt;span class="s2"&gt;"dev-snapshot"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--source&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$DATABASE_URL&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

basecut snapshot restore dev-snapshot:latest &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--target&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$LOCAL_DATABASE_URL&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the whole loop:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;inspect schema&lt;/li&gt;
&lt;li&gt;create snapshot&lt;/li&gt;
&lt;li&gt;restore anywhere&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In practice, most teams can get from install to first snapshot in a few minutes. You can try this workflow with &lt;a href="https://basecut.dev" rel="noopener noreferrer"&gt;Basecut&lt;/a&gt; — the CLI is free for small teams.&lt;/p&gt;

&lt;h2&gt;
  
  
  The privacy part is not optional
&lt;/h2&gt;

&lt;p&gt;One more requirement: privacy. If you are moving production-like data into dev and CI, PII handling cannot be a manual cleanup step.&lt;/p&gt;

&lt;p&gt;At minimum, your workflow should:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;detect common PII automatically&lt;/li&gt;
&lt;li&gt;allow explicit masking rules&lt;/li&gt;
&lt;li&gt;preserve join integrity where needed&lt;/li&gt;
&lt;li&gt;anonymize during extraction, not afterward&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Basecut handles this with automatic PII detection plus 30+ anonymization strategies. It also supports deterministic masking, which matters when the same source value needs to map to the same fake value across related tables.&lt;/p&gt;

&lt;p&gt;If &lt;code&gt;jane@company.com&lt;/code&gt; turns into one fake email in &lt;code&gt;users&lt;/code&gt; and a different fake email somewhere else, your data stops behaving like the real system. That is exactly the sort of detail that makes fake dev data feel fine right up until it is not.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this works well in CI too
&lt;/h2&gt;

&lt;p&gt;This pattern is just as useful in CI as it is locally. Instead of checking brittle fixtures into the repo, you restore a realistic snapshot before the test suite runs.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Test&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;test&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;postgres&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres:15&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;POSTGRES_USER&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
          &lt;span class="na"&gt;POSTGRES_PASSWORD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
          &lt;span class="na"&gt;POSTGRES_DB&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;test_db&lt;/span&gt;
        &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;5432:5432&lt;/span&gt;

    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Install Basecut CLI&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;curl -fsSL https://basecut.dev/install.sh | sh&lt;/span&gt;
          &lt;span class="s"&gt;echo "$HOME/.local/bin" &amp;gt;&amp;gt; $GITHUB_PATH&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Restore snapshot&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;BASECUT_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.BASECUT_API_KEY }}&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;basecut snapshot restore test-data:latest \&lt;/span&gt;
            &lt;span class="s"&gt;--target "postgresql://postgres:postgres@localhost:5432/test_db"&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run tests&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npm test&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That gives your pipeline real shapes, real relationships, and realistic edge cases without restoring an entire production dump on every run. It also keeps snapshots small enough that restores stay fast. We go deeper in our &lt;a href="https://basecut.dev/use-cases/ci-cd-test-data" rel="noopener noreferrer"&gt;CI/CD test data guide&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  When this is worth doing
&lt;/h2&gt;

&lt;p&gt;You probably want production-like snapshots if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;your app has more than a handful of tables&lt;/li&gt;
&lt;li&gt;your bugs are often data-dependent&lt;/li&gt;
&lt;li&gt;you need realistic data in local dev&lt;/li&gt;
&lt;li&gt;your CI pipeline should test against something closer to reality&lt;/li&gt;
&lt;li&gt;you handle meaningful PII&lt;/li&gt;
&lt;li&gt;your team is tired of maintaining fixtures or seed scripts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You might not need it yet if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the product is brand new&lt;/li&gt;
&lt;li&gt;you do not have real production data yet&lt;/li&gt;
&lt;li&gt;the schema is tiny&lt;/li&gt;
&lt;li&gt;completely fictional demo data is the goal&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not about replacing every fixture in your test suite. Unit tests still benefit from tiny, explicit test data. The value here is in &lt;strong&gt;integration tests, &lt;a href="https://basecut.dev/use-cases/local-development" rel="noopener noreferrer"&gt;local development&lt;/a&gt;, CI, onboarding, and debugging&lt;/strong&gt; where data shape matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  A practical rollout
&lt;/h2&gt;

&lt;p&gt;If you want to adopt this without overcomplicating it, start small.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Pick one painful workflow.&lt;br&gt;
Usually local dev onboarding, shared staging, or CI integration tests.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Define a small snapshot.&lt;br&gt;
Keep the restore fast. Start with a few root tables and sensible row limits.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Turn on anonymization from day one.&lt;br&gt;
Do not leave this for later.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Restore it somewhere useful immediately.&lt;br&gt;
Local dev DB or CI test DB is usually enough to prove the value.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Expand gradually.&lt;br&gt;
Add more tables, better filters, and refresh automation once the loop is working.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That gets you to useful production-like data quickly without turning the whole thing into a platform project.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final thought
&lt;/h2&gt;

&lt;p&gt;Most teams are not under-testing. They are testing against data that makes them feel safe. That is not the same thing.&lt;/p&gt;

&lt;p&gt;If your local environments and CI pipelines run against tiny, stale, or fake data, they will keep giving you false confidence. Production-like snapshots are one of the highest-leverage ways to make development and testing feel closer to the real system without dragging raw production data everywhere.&lt;/p&gt;

&lt;p&gt;If you want to try this with PostgreSQL, &lt;a href="https://basecut.dev" rel="noopener noreferrer"&gt;Basecut&lt;/a&gt; is free for small teams. Or dig into the &lt;a href="https://docs.basecut.dev/getting-started/quick-start" rel="noopener noreferrer"&gt;quickstart guide&lt;/a&gt; and &lt;a href="https://docs.basecut.dev/core-concepts/how-it-works" rel="noopener noreferrer"&gt;how FK-aware extraction works&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>testing</category>
      <category>devops</category>
      <category>database</category>
    </item>
  </channel>
</rss>
