53/60 Days System Design Questions

#abotwrotethis #database #backend #software

Your migration ran fine in staging.

Then you ran it in production.

The app went down.

Not because the SQL was wrong. Because you ran it on a live table with 40 million rows while 8 services were actively writing to it.

Your setup:
→ PostgreSQL. users table. 40M rows. Active writes from 8 services.
→ Product request: split full_name into first_name + last_name.
→ You have a 2-hour maintenance window tonight.

The engineering question: how do you ship this without downtime?

A) Run ALTER TABLE to drop full_name and add first_name + last_name in a single migration during the maintenance window.

B) Add first_name + last_name as nullable columns first → backfill → update all services to write to both → drop full_name only after everything is migrated.

C) Create a new users_v2 table with the target schema → dual-write to both tables → flip the read pointer → drain the old table.

D) Add a DB view that aliases full_name as first_name || ' ' || last_name → let each service migrate off it at its own pace.

Drop your answer 👇

Top comments (4)

Joud Awad • Jun 28

Why A will get you paged

ALTER TABLE on a live table acquires an ACCESS EXCLUSIVE lock in PostgreSQL. It blocks every read and write for the duration. On a 40M row table with active traffic, that lock holds for minutes — not seconds. Your 8 services time out. The queue backs up. Alerts fire. CONCURRENTLY helps for index creation. It doesn't help for column drops or renames. The maintenance window doesn't save you — the lock is the problem, not the timing.

Joud Awad • Jun 28

Why B is correct (expand/contract)

Two phases. Phase 1 — Expand: add first_name and last_name as nullable columns. On PostgreSQL, adding a nullable column with no default is a metadata-only operation — near-instant, no lock. Backfill in small batches (rate-limited, no row lock escalation). Update services to write to both columns. Phase 2 — Contract: once every service is writing to the new columns and reads are fully migrated, drop full_name. You control the timing. Every step is independently reversible. If something breaks after Phase 1, you stop — the old column still exists, old services still work. Slower than A? Yes. 3am pages? Zero.

Joud Awad • Jun 28

Why C is overkill ⚠️

Shadow table + dual-write works. But it requires a dual-write layer in every service, an atomic read-pointer flip, and a drain/cleanup phase. Worth it for major structural changes: new primary key, re-partitioning, changing storage engines. For a column split? That's three weeks of work for a one-week problem. Stripe and GitHub use this for genuine rewrites, not column renames.

Joud Awad • Jun 28

Why D defers the problem 🟡

A view aliasing full_name buys time. Useful as a transition layer inside B. But PostgreSQL write-through views only work on simple single-table views with no joins or aggregates. You still need to backfill the real columns. It defers the migration — it doesn't replace it.