How to add a NOT NULL column to a large table safely in Rails

#rails #postgressql #migrations #database

A migration that runs in two milliseconds on your laptop can lock a production table for thirty seconds and pile up every request behind it. The gap is data. Your dev table has twelve rows. The production table has twelve million. An interviewer asked me how I'd add a required status column to a large orders table, and the real answer is that you don't do it in one migration, you do it in four small ones.

First, the part people skip.

What a migration actually is

A migration is a versioned schema change written as Ruby. Each one has a timestamp, Rails records which have run in a schema_migrations table, and db/schema.rb always reflects the current shape of the database. The win is that schema changes become reviewable code that every environment applies in the same order, instead of someone SSHing into production and running ALTER TABLE by hand at 11pm.

class AddEmailToUsers < ActiveRecord::Migration[8.0]
  def change
    add_column :users, :email, :string
    add_index  :users, :email, unique: true
  end
end

That migration is totally fine on a small table. The trouble only shows up at scale, so that's where the interesting decisions live.

What changes when the table is big and live

On your machine the migration runs against an empty table while nothing else is happening. In production it runs against millions of rows while real traffic hits the same table. Four things start to matter that never mattered in dev.

Locks. Some operations grab a lock that blocks reads or writes while they run. Hold that lock on a busy table for a few seconds and requests stack up behind it until something times out.

Table size. Anything that has to touch every row, like a backfill or a column rewrite, takes time proportional to row count. Twelve million rows is a different animal from twelve.

Rolling deploys. During a deploy, old and new versions of your code run at the same time against the same database. The schema has to work for both versions at every moment. This is the part that catches people, and it's why a single migration often can't be safe.

Reversibility. If a deploy goes wrong you want to roll back cleanly, so I want each step to be reversible on its own.

The trap

Here's the migration almost everyone writes first:

add_column :orders, :status, :string, null: false, default: "pending"

One line, reads great, and it's a landmine on a large table. Two problems. On older Postgres, adding a column with a default rewrites every single row while holding a lock, so the table is frozen for the length of that rewrite. And the null: false breaks your rolling deploy the instant the old code, which knows nothing about status, inserts an order without it.

The fix is to stop thinking in one migration and start thinking in phases, each one safe on its own and shipped separately.

Step 1: add the column, nullable

class AddStatusToOrders < ActiveRecord::Migration[8.0]
  def change
    add_column :orders, :status, :string, default: "pending"
  end
end

On Postgres 11 and up, adding a column with a constant default is a metadata change. It does not rewrite the table, so it returns almost instantly. New rows get "pending". Old rows stay NULL for now, and that's fine, because the column is still nullable and old code can keep inserting without touching it.

Step 2: backfill the old rows in batches

class BackfillOrderStatus < ActiveRecord::Migration[8.0]
  disable_ddl_transaction!

  def up
    Order.where(status: nil).in_batches(of: 10_000) do |batch|
      batch.update_all(status: "pending")
      sleep(0.1)
    end
  end

  def down
    # nothing to undo
  end
end

The point of batching is to never hold a long lock. Ten thousand rows at a time, each in its own quick transaction, with a short pause so other queries get a turn. Updating all twelve million in one statement would be one enormous transaction sitting on a lock, which is the thing we're trying to avoid.

Step 3: ship the code that sets status

Deploy your application change so every path that creates an order sets status explicitly. After this point, nothing writes a NULL anymore. This is a code deploy, not a migration, and it has to land before the next step.

Step 4: enforce NOT NULL safely

Now that the column is fully populated and nobody writes NULL, you can add the constraint. Doing it the blunt way still scans the whole table under a lock to verify, so on Postgres I validate it in two moves:

class MakeOrderStatusNotNull < ActiveRecord::Migration[8.0]
  def up
    add_check_constraint :orders, "status IS NOT NULL",
      name: "orders_status_not_null", validate: false
    validate_check_constraint :orders, name: "orders_status_not_null"
  end

  def down
    remove_check_constraint :orders, name: "orders_status_not_null"
  end
end

Adding the constraint as validate: false is instant because it doesn't check existing rows yet. Then validate_check_constraint checks them without taking the heavy lock that a plain NOT NULL would. Once it passes, you have the guarantee you wanted and nobody noticed.

The short version

A migration is versioned schema-as-code, and the moment a table gets big and busy you care about locks, table size, rolling deploys, and clean rollbacks. To add a required column: add it nullable, backfill in batches, deploy code that sets it, then enforce NOT NULL. The one rule worth tattooing on your hand is never add a NOT NULL column with a backfill in a single migration on a large table.

One tool that makes this automatic: the strong_migrations gem. It flags unsafe migrations in code review and prints the safe version for you, so you catch the landmine before it ships instead of during the incident.