137Foundry

Posted on Jun 13

How to Add a NOT NULL Column Without Locking the Table in Postgres

#database #productivity #webdev #programming

The naive version of "add a NOT NULL column to an existing Postgres table" is one of the classic ways to lock production for an extended window. The migration looks innocent. The downtime is real.

This is the step-by-step for doing it safely, the reason each step matters, and the Postgres version differences that change the answer.

Photo by Massimo Botturi on Unsplash

Step 0: Know Your Postgres Version

The single biggest factor in how risky this migration is, is the Postgres major version. The behavior of ALTER TABLE ... ADD COLUMN with a default value changed meaningfully in Postgres 11.

Before Postgres 11: Adding a column with any default value rewrites the entire table. The rewrite takes an ACCESS EXCLUSIVE lock for the duration. On a multi-million-row table, this is minutes of effective downtime.

Postgres 11 and later: Adding a column with a constant default does not rewrite the table. The default is stored in the catalog and applied to existing rows virtually. The operation is fast and takes a brief lock only.

Confirm your version with SELECT version(). If you are on 11 or later, the rest of this gets simpler. If you are on 10 or earlier, you need the full multi-step pattern.

The behavior is documented in the Postgres 11 release notes, and the broader rules are in the ALTER TABLE documentation.

The Safe Pattern (Works on All Versions)

The multi-step pattern that works regardless of version. Each step is one deploy or one separate operation. None of them holds a long lock.

Step 1: Add the Column as Nullable

ALTER TABLE users ADD COLUMN email_verified BOOLEAN;

No default. No NOT NULL. The column exists. Existing rows have NULL in it. The operation is a metadata-only change in modern Postgres versions and takes a brief lock.

This is the cheapest possible schema change. Run it whenever you want.

Step 2: Update the Application to Write the New Column

Deploy a version of the application code that writes a value to email_verified on every insert and on relevant updates. New rows from this point forward have a value. Old rows still have NULL.

If you are dual-writing to support a rollback, also keep writing to whatever column or system the new column is replacing. The dual-write stage of the broader migration playbook covers this.

Step 3: Backfill the Existing Rows in Batches

Update the existing NULL rows in batches. The batched script pattern:

DO $$
DECLARE
  batch_size INT := 5000;
  last_id BIGINT := 0;
  max_id BIGINT;
  rows_updated INT;
BEGIN
  SELECT MAX(id) INTO max_id FROM users;
  WHILE last_id < max_id LOOP
    UPDATE users
    SET email_verified = false
    WHERE id > last_id AND id <= last_id + batch_size
      AND email_verified IS NULL;
    GET DIAGNOSTICS rows_updated = ROW_COUNT;
    last_id := last_id + batch_size;
    PERFORM pg_sleep(0.05);
  END LOOP;
END $$;

In practice, this lives in a Python or Ruby script with progress tracking and replication-lag monitoring, not in a PL/pgSQL block. But the shape of the operation is the same: bounded range scan on the primary key, small sleep between batches, idempotent update.

The reason for the sleep: it gives the WAL stream a chance to catch up to read replicas and lets concurrent workloads get a fair share of the database's attention.

Step 4: Add the NOT NULL Constraint With NOT VALID

ALTER TABLE users
  ADD CONSTRAINT users_email_verified_not_null
  CHECK (email_verified IS NOT NULL) NOT VALID;

NOT VALID adds the constraint without scanning the existing data. The operation is fast and takes a brief lock. New rows are required to have a non-NULL value.

The constraint is technically a check constraint, not a NOT NULL column attribute. The behavior is similar (new rows must have a value) but the catalog representation is different. The next steps convert it to a proper NOT NULL attribute.

Step 5: Validate the Constraint

ALTER TABLE users
  VALIDATE CONSTRAINT users_email_verified_not_null;

The validate step scans the table to confirm every row satisfies the constraint. It takes a SHARE UPDATE EXCLUSIVE lock, which allows reads and writes but blocks other schema changes. This is the longest operation in the sequence on a large table, but it is concurrent with normal traffic.

If the validation fails, it means the backfill missed rows. Find them, backfill them, re-run the validate.

Step 6: Convert the Constraint to a Column NOT NULL Attribute

This step is optional but cleaner. In Postgres 12 and later, once the CHECK constraint is validated, you can add the NOT NULL attribute to the column without a full table scan:

ALTER TABLE users ALTER COLUMN email_verified SET NOT NULL;

Postgres uses the existing validated CHECK constraint to satisfy the requirement, so this takes a brief lock instead of a full scan. After that, you can drop the CHECK constraint:

ALTER TABLE users DROP CONSTRAINT users_email_verified_not_null;

On Postgres 11 and earlier, the SET NOT NULL step requires a full table scan. The safer alternative on older versions is to keep the CHECK constraint instead of converting.

Photo by Negative Space on Pexels

What Goes Wrong If You Skip Steps

Skipping the nullable add and trying to do it with a default in one shot, on an older Postgres version. Full table rewrite, ACCESS EXCLUSIVE lock for the duration, production downtime.

Skipping the application dual-write before the backfill. New rows arrive during the backfill with NULL in the new column. The backfill thinks it is done. The validate step fails on the rows that arrived after the backfill started.

Skipping the NOT VALID constraint approach and going straight to SET NOT NULL. Full table scan with ACCESS EXCLUSIVE on older versions. Extended lock.

Skipping the batched backfill and using a single UPDATE. Long lock, WAL flood, replication lag spike. The longer the table, the worse the symptoms.

Each of these is a real production incident pattern. The defensive process is to assume every step matters until proven otherwise on your specific version.

A Note on Default Values

If the new column has a sensible default (false, 0, empty string), include it in the backfill. The application code that writes the new column should also write the default for new rows.

If you want the database to enforce the default for new rows so application bugs do not write NULL, add the default in a separate ALTER:

ALTER TABLE users ALTER COLUMN email_verified SET DEFAULT false;

This is a metadata-only change in modern Postgres versions. It does not rewrite the table. It only affects rows inserted without an explicit value going forward.

Putting It Together

The whole sequence on a modern Postgres version:

Add nullable column (cheap).
Deploy app code that dual-writes.
Backfill in batches.
Add NOT NULL CHECK with NOT VALID (cheap).
Validate the constraint (long but concurrent).
Convert to column NOT NULL attribute (cheap).
Drop the CHECK constraint.

Seven operations, none of them holding a long exclusive lock. The total wall-clock time is dominated by the backfill in step 3 and the validate in step 5, both of which run concurrent with production traffic.

The broader migration playbook this fits into (the dual-write stage, the gradual cutover, the contract step) is covered at https://137foundry.com/articles/how-to-plan-zero-downtime-database-schema-migration. The free 137Foundry web development overview at the services hub covers the broader engineering practice of putting these patterns in place, including the operational scaffolding around backfills and gradual rollouts.

For Postgres-specific deep dives on lock behavior, the pgexperts blog and PostgresWeekly newsletter archive are both good references that cover the version-by-version evolution of these operations.

The Single Takeaway

The migration that fails badly is the one that tries to do all of this in one statement. The migration that ships safely is the one that takes seven small, well-understood steps. The total engineering time is similar; the operational risk profile is wildly different.

The shape of database migration work in production is mostly choosing the boring path that does not hold long locks. Once that habit is routine, the migrations stop being interesting. Boring migrations are the goal.

DEV Community