Schema Diff: Catch Drift Between Staging and Production

#cicd #database #devops #sql

Schema Diff: Catch Drift Between Staging and Production

Your migration tool says everything is applied. Your CI pipeline is green. But production has a column that staging does not, and nobody knows who added it or when. Sound familiar? Schema drift is the gap between what your migration history says should exist and what actually exists in the database.

The Problem

A deployment fails in production because a migration expects a column that exists in staging but not production. Someone added it manually during debugging weeks ago and forgot to create a proper migration. Or the deployment succeeds, but a query starts failing because the index it depends on was only created in the staging environment. Or a constraint was dropped in production to fix a data issue and never re-added.

Schema drift between environments is one of those problems everyone accumulates and nobody actively monitors. Each drift is individually small — a missing column, a different default value, an index present in one environment but not the other. But they compound. By the time you discover them, it is during a deployment or an incident, when the cost of surprise is highest.

The drift is not always between staging and production. Replicas can drift from their primary if DDL is applied directly to one node. Development databases diverge as engineers apply ad-hoc changes. Even within a single environment, a failed migration that partially applied can leave the schema in an inconsistent state — half the tables have the new column, half do not.

Migration tools like Flyway, Liquibase, and Alembic track what migrations were applied, but they do not verify the actual schema matches the expected state. A migration marked as "applied" in the history table says nothing about whether someone later modified the schema by hand.

How to Detect It

Comparing schemas manually means querying information_schema on both databases and diffing the results. Start with tables and columns:

-- Compare columns between two schemas
-- Run on EACH database and diff the output
SELECT
    table_name,
    column_name,
    data_type,
    column_default,
    is_nullable,
    character_maximum_length
FROM information_schema.columns
WHERE table_schema = 'public'
ORDER BY table_name, ordinal_position;

Then indexes:

-- List all indexes with their definitions
SELECT
    tablename,
    indexname,
    indexdef
FROM pg_indexes
WHERE schemaname = 'public'
ORDER BY tablename, indexname;

Then constraints, functions, triggers, and sequences. Each object type requires its own query, its own export, and its own diff. For a schema with 50 tables, this is hundreds of rows to compare across six or more categories. You can script it, but maintaining the script and interpreting the diff requires ongoing effort.

The deeper problem is that nobody does this proactively. Schema comparison is a reactive activity — you do it after something breaks, not before. The gap between "last time someone checked" and "now" is where drift accumulates undetected.

A Practical Approach to Schema Comparison

Whether you build custom tooling or use a monitoring platform, an effective schema diff should categorize differences by type: missing tables, missing columns, different column types or defaults, missing indexes, missing constraints, and missing functions. Color-coding by severity helps with triage — a missing column that exists in production but not staging is a deployment risk, while a different column default may be intentional.

Even better, generate the exact ALTER statements needed to reconcile each difference. A missing column should produce ALTER TABLE ADD COLUMN. A different default should produce ALTER COLUMN SET DEFAULT. A missing index should produce CREATE INDEX CONCURRENTLY. This eliminates the manual step of writing migration SQL.

Safe patterns for generated statements include IF NOT EXISTS where supported, CONCURRENTLY for index operations, and explicit data types to avoid ambiguity.

How to Fix It

Walk through the most common drift scenarios and their fixes:

Missing column — A column exists in staging but not production:

-- Add the missing column with a safe default
ALTER TABLE orders
    ADD COLUMN IF NOT EXISTS status_code integer DEFAULT 0;

Use IF NOT EXISTS (PostgreSQL 9.6+) so the statement is idempotent — safe to run even if someone already added the column manually.

Different default value — The column exists in both environments but defaults differ:

-- Align the default value
ALTER TABLE orders
    ALTER COLUMN status_code SET DEFAULT 1;

Missing index — An index exists in production but not staging (or vice versa):

-- Create the missing index without blocking writes
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_orders_status_code
    ON orders (status_code);

Always use CONCURRENTLY in production. A regular CREATE INDEX takes a ShareLock on the table, blocking all inserts and updates until the index is built.

Missing constraint — A foreign key or check constraint was dropped or never created:

-- Re-add a missing foreign key constraint
ALTER TABLE order_items
    ADD CONSTRAINT fk_order_items_order_id
    FOREIGN KEY (order_id) REFERENCES orders (order_id);

Adding a foreign key validates all existing rows, which can be slow on large tables. Use NOT VALID to add the constraint without validating existing data, then validate separately:

ALTER TABLE order_items
    ADD CONSTRAINT fk_order_items_order_id
    FOREIGN KEY (order_id) REFERENCES orders (order_id)
    NOT VALID;

-- Validate existing rows separately (does not block writes)
ALTER TABLE order_items
    VALIDATE CONSTRAINT fk_order_items_order_id;

How to Prevent It

Run schema diff as part of your deployment pipeline. Before applying migrations to production, compare the production schema against what your migration tool expects. This catches manual changes that were applied outside the migration system — the single most common source of drift.

Compare your primary against each replica periodically. Physical replication keeps schemas in sync, but logical replication does not — DDL is not replicated by default in logical replication setups. If you use logical replication, schema drift between publisher and subscriber is almost guaranteed without explicit management.

Establish a monthly schema comparison between staging and production even when no deployment is planned. Drift accumulates in the gaps between deployments.

Treat schema as code, but verify the code matches reality. Migration history says what should have been applied. Schema diff tells you what actually exists. The gap between the two is where incidents start.