Rohit Gavali

Posted on Jan 19

Lessons from Zero-Downtime Postgres Migrations That Nearly Took Prod Down

#ai #webdev #programming

The migration was supposed to be routine. Add an index, update some constraints, deploy the new application code. Zero downtime, zero risk. We'd done this dozens of times.

Then at 2:47 PM on a Wednesday, our production database locked up. API response times spiked from 50ms to 30 seconds. User sessions started timing out. The queue of pending requests grew exponentially. Within ninety seconds, our entire platform was effectively down.

The migration was still running. The index creation we thought would take two minutes had been holding an exclusive lock for three minutes and counting. Every query waiting for that lock was blocking other queries. The cascade failure was complete.

We had to make a choice: kill the migration and restore service, or wait for it to complete and hope the platform survived. We killed it. Service restored in fifteen seconds. But the damage was done—users had experienced downtime we promised would never happen.

The irony: we had followed best practices. We'd tested the migration in staging. We'd verified the execution plan. We'd even calculated the expected lock time. Everything looked safe.

We were wrong about what "safe" meant.

The Confidence That Kills You

Zero-downtime migrations sound straightforward in theory. You design schema changes that don't require locking tables. You deploy code that works with both old and new schemas. You migrate data in small batches. You verify everything in staging.

This works beautifully until production has ten times the data volume, different query patterns, and active connections you can't replicate in testing.

Our staging database had 2 million rows in the table we were indexing. Production had 40 million. The index creation we tested took 90 seconds in staging. In production, it took over 5 minutes—and held an exclusive lock the entire time.

The lock wasn't technically required for index creation. Postgres supports CREATE INDEX CONCURRENTLY which builds indexes without blocking writes. We knew this. We used it.

What we didn't account for: the table had active long-running transactions when the migration started. CREATE INDEX CONCURRENTLY waits for existing transactions to complete before it can proceed without blocking. In staging, there were no long-running transactions. In production, there were three.

One was a analytics query someone had kicked off five minutes earlier. Another was a batch job that had been running for eight minutes. The third was a zombie connection that had been idle in transaction for over an hour.

Our "concurrent" index creation waited for these transactions to complete. While waiting, it held locks that blocked new queries. The cascade began.

We had tested the migration. We just hadn't tested it under production conditions.

What "Zero-Downtime" Actually Means

The term "zero-downtime migration" creates a dangerous illusion: that you can change database schemas without affecting system performance or availability.

This is technically possible. It's also rarely what actually happens.

Real zero-downtime migrations aren't about eliminating all impact. They're about controlling and minimizing impact in ways that users don't notice. There's a difference between "no user-facing downtime" and "no database impact."

Every schema change has impact. The question is whether that impact stays within acceptable boundaries or cascades into user-visible failures.

Acceptable impact: Slightly elevated CPU during index creation. Temporary increase in replication lag. Brief moments where query plans are suboptimal.

Unacceptable impact: Queries timing out. Connections refused. Response times degrading to the point where features stop working.

The line between these isn't clear until you cross it. And in production, you often don't know you've crossed it until the alerts start firing.

The Patterns That Fail

We analyzed our near-disaster and five other problematic migrations from the previous year. Patterns emerged—not in what we did wrong technically, but in what we assumed incorrectly.

Assumption one: Staging matches production. It never does. Production has more data, different data distribution, different query patterns, different connection behavior, different resource contention. A migration that runs perfectly in staging can behave completely differently in production.

We started actually measuring production conditions before migrations. Not just table sizes—connection counts, active transaction lengths, query patterns during the deployment window, disk I/O patterns. We'd use tools to analyze our database performance metrics over the previous week to understand what "normal" looked like.

Assumption two: Lock duration is predictable. It's not. Even with CREATE INDEX CONCURRENTLY, locks can persist longer than expected. Even with carefully designed multi-phase migrations, unexpected locks can emerge.

We stopped trusting execution time estimates and started setting hard timeouts. If a migration step runs longer than expected, kill it. Better to abort cleanly than let it cascade into a full outage.

Assumption three: You can test everything in advance. You can't. Production has edge cases you can't replicate. The combination of active queries, concurrent transactions, and resource contention creates scenarios that don't exist in testing.

We started treating every migration as a potential incident. Not pessimistically—pragmatically. We had rollback plans. We had monitoring during execution. We had clear criteria for when to abort.

Assumption four: If it worked before, it's safe now. Previous success doesn't guarantee future safety. Table sizes grow. Data distributions change. Query patterns evolve. A migration strategy that worked six months ago can fail today because conditions have changed.

We started reviewing migration approaches every quarter, not just reusing patterns that had worked previously.

The Multi-Phase Approach That Actually Works

After our production incident, we redesigned our migration process. Not the technical implementation—the operational approach.

Phase one: Make the schema compatible. Add new columns, tables, or indexes without removing anything old. The database now supports both old and new application code. This phase might degrade performance slightly, but it doesn't break anything.

We'd deploy this during low-traffic periods and monitor closely. If anything looked wrong, rollback was simple—just drop the new schema elements.

Phase two: Deploy application code that uses new schema. The application starts writing to new columns or using new indexes, but still maintains compatibility with old schema. Both versions of the application can coexist.

This is where we'd use AI to help review our code changes for potential edge cases—having a fresh set of eyes (even artificial ones) often caught assumptions we'd embedded in the migration logic.

Phase three: Migrate existing data. In small batches, during low-traffic periods, with extensive monitoring. If migration causes problems, we can pause or rollback without affecting new data.

For complex data transformations, we'd sometimes use Claude Sonnet 3.7 to help verify our migration scripts caught all edge cases in the data—it's surprisingly good at spotting scenarios you didn't consider.

Phase four: Remove old schema elements. Only after verifying that nothing is using them. This is often weeks after the migration started.

This approach is slower than "deploy everything at once." It's also far more reliable.

The Monitoring You Actually Need

Standard database monitoring tells you when things have already gone wrong. You need monitoring that tells you when things are about to go wrong.

Lock monitoring during migrations. We built custom tooling that watches for locks during migration execution. If any lock lasts longer than expected, if queries are queuing behind locks, if transaction wait times spike—abort the migration immediately.

Query performance tracking before and during migrations. We baseline query performance in the hours before a migration, then monitor for regressions during execution. A 2x slowdown might not trigger alerts, but it's a signal that something isn't working as expected.

Connection pool monitoring. Migrations can exhaust connection pools in subtle ways. We watch for increasing connection acquisition times and pool exhaustion patterns.

Replication lag tracking. Schema changes can cause replication lag spikes. For systems relying on read replicas, this can cascade into user-visible issues even if the primary database is fine.

We'd use analytical tools to quickly parse and visualize metrics during migrations, helping us spot patterns that would take too long to notice manually.

The Rollback Plan You Need Before You Start

The worst time to figure out rollback is when things are failing. We learned this by nearly making our outage worse.

When our index creation locked up production, we panicked briefly trying to remember the correct way to kill it safely. Could we just terminate the migration connection? Would that leave the database in a corrupted state? Should we wait for it to complete?

These questions should have been answered before we started.

Now every migration has a documented rollback procedure written before execution begins:

Immediate abort criteria. Clear thresholds for when to kill the migration. Lock duration exceeding X seconds. Query queue depth exceeding Y. Response time degradation beyond Z. No judgment calls during an incident—just follow the criteria.

Abort procedure. Exact commands to safely stop the migration. Not "kill the connection"—the specific SQL commands, in order, with expected outcomes for each.

Verification steps. How to confirm the database is in a stable state after abort. What queries to run, what metrics to check, what behaviors indicate success.

Rollback procedure if needed. If aborting the migration isn't enough, how to roll back schema changes. This is especially critical for multi-phase migrations where partial completion might leave the schema in an unexpected state.

Communication plan. Who to notify when aborting, what to tell them, how to coordinate with application deployments if needed.

Writing this before the migration forces you to think through failure modes clearly. You're not optimizing for success—you're optimizing for surviving failure.

What We Changed Permanently

The near-outage changed how we think about database migrations fundamentally.

We stopped doing migrations during business hours. Even "safe" migrations. The risk isn't worth the convenience. Nights and weekends aren't fun, but they give you breathing room if something goes wrong.

We started doing dry runs in production. Not the actual migration—test runs that verify assumptions. Check for long-running transactions before scheduling the migration. Verify connection counts are within expected ranges. Confirm query patterns match what we planned for.

We built in mandatory waiting periods. After deploying code that supports new schema, we wait at least 24 hours before migrating data. After migrating data, we wait at least a week before removing old schema. Rushing migrations causes problems.

We created a migration review process. Every migration gets reviewed by someone who didn't write it. Fresh eyes catch assumptions the original author embedded without realizing.

Using platforms like Crompt AI where you can compare different AI model outputs helped us during reviews—we'd ask multiple AIs to review migration scripts and identify potential issues. Different models caught different edge cases.

We started measuring migration success differently. Success isn't "the migration completed." Success is "the migration completed without user impact." We track metrics during every migration and review them afterward, even for successful migrations.

The Hard Truth

Zero-downtime database migrations are possible. They're also fragile, complex, and dependent on conditions you can't fully control.

Every migration carries risk. The question isn't whether to avoid risk—it's whether you've done enough to survive when things go wrong.

Testing in staging helps but doesn't eliminate uncertainty. Following best practices helps but doesn't guarantee success. Having smart engineers helps but doesn't prevent mistakes.

What helps most is accepting that migrations can fail and building systems that handle failure gracefully. Have rollback plans. Have abort criteria. Have monitoring that tells you when to bail out before user impact becomes severe.

The migration that nearly took down our production system followed all the best practices we knew at the time. It still almost failed catastrophically. Not because we were careless, but because production is different than staging in ways you can't fully predict.

The lesson isn't "don't do database migrations." It's "respect the complexity and plan for failure."

Your migrations will work ninety-nine times. It's the hundredth time—when production conditions align in ways you didn't anticipate—that determines whether your "zero-downtime" approach actually delivers zero downtime or just a slower disaster.

Managing complex database migrations? Use Crompt AI to review migration scripts, analyze patterns, and catch edge cases before they hit production—because the best outages are the ones you prevent.

-ROHIT

DEV Community