Leena Malhotra

Posted on Jan 15

Lessons from Migrating a Live Postgres Schema Without Downtime

#webdev #programming #ai

We had 47 tables, 280 million rows, and a promise we couldn't break: zero downtime during the migration.

The schema redesign was necessary. Our original database structure made sense when we launched two years ago with 5,000 users. Now we had 200,000 active users, and queries that once took milliseconds were timing out. Joins were crossing six tables to fetch basic user data. Indexes were bloated. Our data model had become a performance bottleneck we couldn't ignore.

But we couldn't just flip a switch. Our application served requests 24/7 across multiple time zones. A single second of downtime meant failed transactions, interrupted user sessions, and angry customers demanding refunds. The business made it clear: migrate the schema, but keep the lights on.

This is the story of how we pulled it off—and the lessons that only come from doing it wrong the first time.

The Migration Nobody Plans For

Most database migration guides assume you can take downtime. They walk you through elegant solutions involving maintenance windows, schema dumps, and clean cutovers. Real-world migrations aren't like that.

You can't stop the application. You can't pause incoming writes. You can't coordinate a global "quiet period" when everyone agrees to stop using your product for an hour. The database has to keep serving traffic while you fundamentally restructure how data is stored and accessed.

Our first attempt failed spectacularly. We tried a dual-write approach: write to both old and new schemas simultaneously, backfill historical data, then cut over when they were in sync. Simple in theory. Catastrophic in practice.

The dual writes created race conditions we hadn't anticipated. Data written to the old schema didn't always propagate to the new schema before being read. Users saw stale data, then fresh data, then stale data again. Cache invalidation became a nightmare. Database locks started piling up. Query performance degraded because every write was now hitting two schemas.

We rolled back after six hours of chaos, restored from backups, and accepted that we didn't actually know how to do this.

What We Learned the Hard Way

Lesson one: You can't migrate everything at once. We initially tried to move all 47 tables in a coordinated big-bang migration. The complexity was unmanageable. Instead, we broke it into 12 phases, each handling a cluster of related tables. Some phases took days. Some took weeks. But each was small enough to reason about and roll back independently.

Lesson two: Your application needs to speak both languages. The killer insight was building an abstraction layer that could read from either schema and write to both. We created a repository pattern that hid schema differences from application code. When we started migration, the code could handle requests regardless of which schema held the authoritative data.

Lesson three: Backfilling is harder than forward migration. Moving new data is straightforward—you control the writes. Historical data is the nightmare. We had years of records to migrate, and doing it all at once would lock tables for hours. We built a backfill system that processed data in small batches during low-traffic periods, tracking progress and resuming after interruptions.

Lesson four: Testing in production is the only testing that matters. We had staging environments. We had test databases with production data snapshots. None of it prepared us for real production behavior. The query patterns were different. The lock contention was different. The edge cases were different. We ended up using feature flags and gradual rollouts to test migration phases against real traffic with the ability to roll back instantly.

The Architecture That Worked

After our failed first attempt, we designed a migration architecture that could handle the reality of a live system.

Shadow writing became our foundation. Instead of dual-writing to both schemas simultaneously, we wrote to the old schema (the source of truth) and asynchronously propagated changes to the new schema. This eliminated race conditions and kept database locks from stacking up.

Read routing logic let us gradually shift traffic from old to new schema. We started by routing 1% of read queries to the new schema. If metrics looked good, we increased to 5%, then 10%, then 50%. When something broke—and things did break—we could route traffic back to the old schema while we debugged.

Continuous validation ran in the background, comparing old and new schemas for consistency. We sampled random records, compared their representations across both schemas, and flagged discrepancies. This caught data transformation bugs that would have been invisible until users complained.

Incremental backfill processed historical data in 10,000-row batches with built-in throttling. If database CPU spiked or query latency increased, the backfill paused automatically. We used task prioritization to schedule backfill jobs during off-peak hours, ensuring migration work didn't degrade user experience.

The Unexpected Problems

We anticipated most of the technical challenges. What surprised us were the second-order effects.

Monitoring became unreliable. Our observability stack tracked metrics based on schema structure. During migration, we had two schemas with different table names, different column names, different indexes. Half our dashboards stopped making sense. We had to rebuild monitoring to understand both schemas simultaneously and eventually created custom analytics to track migration progress and data consistency across both systems.

Database backups doubled in size and duration. We were running both schemas in parallel, effectively duplicating our entire dataset. Backup windows that used to take 45 minutes stretched to two hours. Storage costs ballooned. We had to negotiate emergency budget approval because we hadn't accounted for the temporary doubling of database footprint.

Foreign key constraints became migration blockers. Tables with foreign key relationships couldn't be migrated independently. We had to carefully orchestrate migration order, sometimes temporarily dropping constraints, migrating data, then recreating them. Each constraint violation had to be investigated and resolved before we could proceed.

Application deployment dependencies multiplied. Code that worked with the old schema had to be deployed before we could migrate those tables. Code that worked with the new schema couldn't be deployed until migration was complete. We created a complex deployment choreography that had to be executed in precise order.

The Rollback Strategy Nobody Wants to Use

Every migration guide tells you to have a rollback plan. Nobody tells you what that actually looks like when you're three weeks into a six-week migration with half your data in each schema.

We built three levels of rollback capability:

Instant routing rollback could redirect all traffic back to the old schema in seconds using feature flags. This saved us twice when bugs in the new schema caused production incidents.

Table-level migration reversal let us undo individual table migrations without affecting others. Each migration phase was reversible independently, so a problem with user authentication tables didn't force us to roll back unrelated payment data migrations.

Full disaster recovery involved point-in-time recovery to before migration started, but we designed this as the nuclear option we'd only use if everything else failed. We never needed it, but knowing we could recover from catastrophic failure made the entire team more willing to take calculated risks.

The psychological safety of comprehensive rollback plans meant we could be aggressive about pushing migration forward, knowing we could retreat if necessary.

What Actually Takes the Time

The actual database migration—moving data from old schema to new—was maybe 20% of the effort. The rest was operational overhead that nobody warns you about.

Building dual-schema application code consumed weeks. Every database interaction had to be abstracted behind interfaces that could work with either schema. We used AI code assistance to help refactor our repository layer, but even with tooling support, touching every database query in a large codebase is tedious, error-prone work.

Data validation and reconciliation never ended. Even after backfill completed, we ran continuous comparison jobs to catch drift between schemas. Small bugs in transformation logic caused subtle inconsistencies that took days to track down.

Coordination across teams became a project in itself. Frontend engineers needed to know when API contracts would change. DevOps needed to manage database resources and deployment sequencing. Customer support needed to understand potential issues and how to escalate them. Product management needed to know which features might behave strangely during migration.

Documentation and runbooks multiplied because normal operational procedures didn't apply during migration. How do you restore from backup when you have two schemas? How do you investigate a bug when you don't know which schema served the request? We created comprehensive documentation covering every scenario we could think of, and still got surprised.

The Metrics That Mattered

Standard database metrics told us almost nothing useful during migration. CPU utilization, disk I/O, connection pool usage—all of these were elevated and stayed elevated for weeks. We needed different signals.

Schema sync lag measured how far behind the new schema was from the old schema. If a write to the old schema took more than five seconds to propagate to the new schema, something was wrong with our replication pipeline.

Validation error rate tracked how often old and new schemas disagreed. Early in migration, this was 15-20% because backfill was incomplete. As we progressed, it should have dropped to near zero. When it spiked, we knew transformation logic had bugs.

Read routing distribution showed what percentage of traffic was served by each schema. This let us gradually increase load on the new schema while monitoring for degradation.

Backfill throughput measured rows migrated per hour. When this dropped, it meant either we hit a data inconsistency that required manual intervention, or database load was high enough that we needed to throttle.

User-reported issues became our ultimate validation metric. We tracked support tickets, user complaints, and bug reports. If any metric spiked during a migration phase, we paused and investigated before proceeding.

The Lessons That Transferred

This migration taught me patterns that apply beyond database schemas.

Make reversibility a first-class requirement. Every change should be undoable. Every deployment should be roll-back-able. Every migration step should have a tested reverse procedure. The confidence to move forward comes from knowing you can move backward.

Build observability before you need it. We should have had schema-agnostic monitoring from day one. Instead, we built it frantically during migration. The best time to add comprehensive instrumentation is before chaos, not during it.

Test the rollback as thoroughly as the migration. We ran rollback drills weekly, timing how long each reversal took and what broke during rollbacks. This caught bugs in our rollback procedures that would have caused disasters if we'd discovered them during a real incident.

Communicate relentlessly. We posted daily migration updates in Slack. We held weekly migration review meetings. We maintained a dashboard showing migration progress. Over-communication prevented surprises and kept everyone aligned on status and risks.

Accept that perfect planning is impossible. Despite months of preparation, we still encountered unexpected problems weekly. The goal isn't to anticipate everything—it's to build systems robust enough to handle the unanticipated.

The Final Push

Six weeks after starting, we finally served 100% of traffic from the new schema. The old schema sat idle, ready as a fallback if disaster struck. We kept it running for another two weeks, monitoring everything, before finally declaring victory and beginning cleanup.

Total migration duration: eight weeks from first dual-write to complete cutover. Zero seconds of user-facing downtime. Zero data loss. Dozens of lessons learned the hard way.

The new schema performs beautifully. Queries that used to take 800ms now complete in 40ms. The data model is cleaner, more maintainable, and ready to scale to the next million users. We're already planning the next migration because database schemas, like all software, eventually accumulate enough technical debt that restructuring becomes necessary.

But next time, we'll start with the lessons from this migration. We'll build dual-schema support from day one. We'll implement shadow writing before we need it. We'll have robust rollback procedures tested and ready. We'll know that the hard part isn't the database migration—it's keeping the system running while we change its foundation.

What You Should Remember

If you're facing a similar migration, here's what actually matters:

Break it into phases small enough to understand and reverse. Build application code that can speak both old and new dialects. Test rollback procedures as rigorously as migration procedures. Use tools like Gemini 2.5 Flash to help analyze complex data transformations and verify migration logic before running it in production.

Accept that planning eliminates some surprises but not all of them. The goal is building systems resilient enough to handle what you didn't anticipate.

Your migration will take longer than you think. It will surface problems you didn't know existed. It will require more coordination and communication than seems reasonable. And at the end, when users don't notice anything changed, you'll know you did it right.

Zero downtime isn't about perfection—it's about building enough safety mechanisms that imperfection doesn't cause catastrophe.

Planning a zero-downtime migration? Use Crompt AI to help validate transformation logic, generate test cases, and analyze edge cases before they hit production.

DEV Community