DEV Community

Cover image for 53/60 Days System Design Questions
Joud Awad
Joud Awad

Posted on

53/60 Days System Design Questions

Your migration ran fine in staging.

Then you ran it in production.

The app went down.

Not because the SQL was wrong. Because you ran it on a live table with 40 million rows while 8 services were actively writing to it.

Your setup:
→ PostgreSQL. users table. 40M rows. Active writes from 8 services.
→ Product request: split full_name into first_name + last_name.
→ You have a 2-hour maintenance window tonight.

The engineering question: how do you ship this without downtime?

A) Run ALTER TABLE to drop full_name and add first_name + last_name in a single migration during the maintenance window.

B) Add first_name + last_name as nullable columns first → backfill → update all services to write to both → drop full_name only after everything is migrated.

C) Create a new users_v2 table with the target schema → dual-write to both tables → flip the read pointer → drain the old table.

D) Add a DB view that aliases full_name as first_name || ' ' || last_name → let each service migrate off it at its own pace.

Drop your answer 👇

Top comments (4)

Collapse
 
thejoud1997 profile image
Joud Awad

Why A will get you paged

ALTER TABLE on a live table acquires an ACCESS EXCLUSIVE lock in PostgreSQL. It blocks every read and write for the duration. On a 40M row table with active traffic, that lock holds for minutes — not seconds. Your 8 services time out. The queue backs up. Alerts fire. CONCURRENTLY helps for index creation. It doesn't help for column drops or renames. The maintenance window doesn't save you — the lock is the problem, not the timing.

Collapse
 
thejoud1997 profile image
Joud Awad

Why B is correct (expand/contract)

Two phases. Phase 1 — Expand: add first_name and last_name as nullable columns. On PostgreSQL, adding a nullable column with no default is a metadata-only operation — near-instant, no lock. Backfill in small batches (rate-limited, no row lock escalation). Update services to write to both columns. Phase 2 — Contract: once every service is writing to the new columns and reads are fully migrated, drop full_name. You control the timing. Every step is independently reversible. If something breaks after Phase 1, you stop — the old column still exists, old services still work. Slower than A? Yes. 3am pages? Zero.

Collapse
 
thejoud1997 profile image
Joud Awad

Why C is overkill ⚠️

Shadow table + dual-write works. But it requires a dual-write layer in every service, an atomic read-pointer flip, and a drain/cleanup phase. Worth it for major structural changes: new primary key, re-partitioning, changing storage engines. For a column split? That's three weeks of work for a one-week problem. Stripe and GitHub use this for genuine rewrites, not column renames.

Collapse
 
thejoud1997 profile image
Joud Awad

Why D defers the problem 🟡

A view aliasing full_name buys time. Useful as a transition layer inside B. But PostgreSQL write-through views only work on simple single-table views with no joins or aggregates. You still need to backfill the real columns. It defers the migration — it doesn't replace it.