DEV Community

Yash Pritwani
Yash Pritwani

Posted on • Originally published at techsaas.cloud

Zero-Downtime Database Migration: Shadow Writes, Dual-Read, and the 12-Second Cutover

Originally published on TechSaaS Cloud


Originally published on TechSaaS Cloud


Zero-Downtime Database Migration: Shadow Writes, Dual-Read, and the 12-Second Cutover

Database migrations are the scariest infrastructure change you can make. Your data is the one thing you absolutely cannot lose, corrupt, or make unavailable.

We migrated a 2TB PostgreSQL database to CockroachDB for a SaaS client with zero downtime, zero data loss, and a cutover that took 12 seconds. Here's the complete playbook.

Why Not Just pg_dump and Restore?

For a 2TB database, pg_dump takes roughly 4-8 hours depending on your hardware. During that time, your application is either down or writing data that won't be in the dump. You'd need a maintenance window, and for a SaaS product with global users, "maintenance windows" mean lost revenue and broken SLAs.

The shadow-write approach eliminates the maintenance window entirely.

Phase 1: Dual-Write Setup

The core idea: write every mutation to BOTH the old database (Postgres) and the new database (CockroachDB) simultaneously.

class DualWriteMiddleware:
    def __init__(self, primary_db, shadow_db):
        self.primary = primary_db    # Postgres (source of truth)
        self.shadow = shadow_db      # CockroachDB (catching up)

    async def execute_write(self, query, params):
        # Primary write — this is the source of truth
        result = await self.primary.execute(query, params)

        # Shadow write — async, failures logged but don't affect user
        try:
            await asyncio.wait_for(
                self.shadow.execute(query, params),
                timeout=5.0
            )
        except Exception as e:
            log.warning(f"Shadow write failed: {e}")
            self.shadow_failure_queue.append((query, params))

        return result
Enter fullscreen mode Exit fullscreen mode

Key rules:

  • Primary database (Postgres) is always the source of truth
  • Shadow writes are fire-and-forget — failures are logged and retried, never shown to users
  • A failure queue captures any shadow writes that fail, for replay later

Duration: We ran dual-write for 2 weeks before moving to Phase 2.

Phase 2: Historical Data Migration

While dual-writes handle new data, you need to backfill historical data. We used a chunked migration approach:

# Migrate in 10,000-row chunks with checkpointing
python migrate.py \
  --source postgres://prod-primary:5432/app \
  --target cockroach://cockroach-cluster:26257/app \
  --table orders \
  --chunk-size 10000 \
  --checkpoint-file /tmp/migration-orders.checkpoint
Enter fullscreen mode Exit fullscreen mode

The checkpoint file tracks the last migrated primary key, so you can restart the migration without re-processing. For a 2TB database, this took about 18 hours running at low priority (to avoid impacting production reads).

Phase 3: Shadow-Read Validation

This is where most migration guides stop, and where most migrations fail. Before cutting over reads, you need to validate that CockroachDB returns the same results as Postgres.

class ShadowReadValidator:
    async def validate_read(self, query, params):
        # Read from both
        primary_result = await self.primary.execute(query, params)
        shadow_result = await self.shadow.execute(query, params)

        # Compare
        if primary_result != shadow_result:
            log.error(f"READ MISMATCH: query={query}")
            log.error(f"  Postgres: {primary_result[:100]}")
            log.error(f"  CockroachDB: {shadow_result[:100]}")
            self.mismatch_counter.inc()

        # Always return primary result
        return primary_result
Enter fullscreen mode Exit fullscreen mode

We ran shadow-read validation on 10% of production read traffic for one week. Results:

  • 47 query incompatibilities found (mostly around timestamp precision and JSON operator differences)
  • 3 data mismatches (all from shadow-write failures that hadn't been replayed yet)
  • 0 correctness bugs in CockroachDB itself

Each incompatibility was fixed by updating the application query or adding a compatibility layer. This validation phase is the most valuable part of the entire migration — it catches problems before they affect users.

Phase 4: Traffic Shifting

Once shadow-reads show zero mismatches for 48 hours, gradually shift read traffic:

# Feature flag configuration
database_read_routing:
  cockroach_percentage: 5     # Start at 5%
  escalation_schedule:
    - after: 24h → 20%
    - after: 24h → 50%
    - after: 24h → 80%
    - after: 24h → 100%
  rollback_trigger:
    error_rate_threshold: 0.1%
    latency_p99_threshold: 500ms
Enter fullscreen mode Exit fullscreen mode

At each stage, monitor:

  • Error rates (should be identical or better)
  • Latency p50/p95/p99 (CockroachDB was 15% faster for our read patterns)
  • Data consistency (shadow-read mismatches should stay at 0)

Phase 5: The 12-Second Cutover

Once 100% of reads are going to CockroachDB successfully:

  1. Stop dual-writes (Postgres stops receiving new data)
  2. Drain any remaining shadow-write failure queue
  3. Final consistency check (compare row counts, checksums on critical tables)
  4. Update connection strings to point to CockroachDB
  5. Restart application pools

Steps 1-5 took 12 seconds in our case. The application experienced zero errors during cutover because reads were already going to CockroachDB.

Post-Migration

Keep Postgres running in read-only mode for 30 days as a safety net. If anything goes wrong, you can revert by switching connection strings back. After 30 days with no issues, decommission the Postgres instance.

Lessons Learned

  1. Shadow-read validation catches 95% of migration bugs. Don't skip it.
  2. The failure queue is critical. Without it, your shadow database will have data gaps.
  3. Run dual-write for at least 2 weeks. One week isn't enough to catch all edge cases.
  4. Monitor CockroachDB performance during the migration. Backfilling 2TB while handling dual-writes is a significant load.
  5. Test rollback before you need it. We practiced the rollback procedure three times before the actual migration.

Planning a database migration? We've done zero-downtime migrations for databases from 100GB to 5TB. Book a consultation or explore our infrastructure services.

Top comments (0)