DEV Community

Mateen Anjum
Mateen Anjum

Posted on

I Built a Zero-Downtime Database Migration Pipeline (PostgreSQL to Aurora)

TL;DR: I created a production-ready framework for migrating PostgreSQL databases to Aurora with zero downtime using AWS DMS, Terraform, and automated validation scripts.

The Problem

Last year, I inherited a project running on self-managed PostgreSQL. The database had grown to 500GB, and we were spending too much time on maintenance: patching, backups, replication issues, the usual headaches. The decision was made to move to Aurora PostgreSQL for managed operations, better scalability, and native AWS integration.

The catch? This was a production database serving 50,000 daily active users. Any downtime meant lost revenue and angry customers. The business gave us a maintenance window of... zero minutes. No pressure.

What I Tried First (And Why It Didn't Work)

My initial thought was simple: pg_dump and pg_restore during a quiet period. Classic approach, right?

# The naive approach
pg_dump -Fc production_db > backup.dump
pg_restore -d aurora_target backup.dump
Enter fullscreen mode Exit fullscreen mode

For a 500GB database, this would take roughly 4 to 6 hours depending on network and instance sizes. That's 4 to 6 hours of stale data accumulating on the source while users keep writing. Unacceptable.

I also looked at logical replication native to PostgreSQL, but setting up publication/subscription across AWS accounts with proper security controls was turning into a monster. Plus, the operational tooling around it was basically DIY.

Then I found AWS DMS. It handles the heavy lifting: full load plus change data capture in a managed service. The challenge was building everything around it to make migrations repeatable and safe.

The Solution

I built a complete migration framework with four major components:

  1. Terraform modules for all AWS infrastructure
  2. Python automation scripts for validation and cutover
  3. GitHub Actions workflows for CI/CD
  4. CloudWatch monitoring for full observability

Architecture Overview

Architecture-Overview

The blue-green strategy works like this: DMS performs a full load of all existing data, then switches to CDC mode to capture ongoing changes. Both databases stay in sync until we're ready to cut over.

Terraform Infrastructure

I created modular Terraform for reproducibility across environments. Here's how the DMS module looks:

module "dms" {
  source = "./modules/dms"

  project     = "db-migration"
  environment = "prod"

  subnet_ids         = var.private_subnet_ids
  security_group_ids = [module.networking.dms_security_group_id]

  replication_instance_class = "dms.r5.4xlarge"
  multi_az                   = true

  source_db_host     = var.source_db_host
  source_db_username = var.source_db_username
  source_db_password = var.source_db_password

  target_db_host     = module.aurora.cluster_endpoint
  target_db_username = var.aurora_master_username
  target_db_password = var.aurora_master_password
}
Enter fullscreen mode Exit fullscreen mode

The DMS module creates:

  • Replication instance with appropriate sizing
  • Source and target endpoints with proper SSL configuration
  • Replication task with CDC enabled
  • CloudWatch alarms for monitoring lag and errors

Validation Scripts

Before any cutover, you need confidence that the data matches. I wrote a Python validation tool that checks multiple dimensions:

# Quick validation (uses table statistics for fast estimates)
python validation.py --quick

# Full validation (exact counts, checksums, sequence values)
python validation.py --full

# Just check DMS status
python validation.py --dms
Enter fullscreen mode Exit fullscreen mode

The full validation performs:

Check Description
Row Counts Compare exact row counts between source and target
Checksums MD5 hash of sample data from each table
Sequences Verify sequence values are synchronized
Primary Keys Ensure all tables have PKs (required for CDC)
DMS Status Task running, replication lag below threshold

Here's a snippet from the checksum validation:

def calculate_checksum(self, table: str, columns: list, limit: int = 1000) -> str:
    """Calculate MD5 checksum of sample rows."""
    cols = ", ".join(columns)
    query = f"""
        SELECT md5(string_agg(row_hash, '' ORDER BY row_hash))
        FROM (
            SELECT md5(ROW({cols})::text) as row_hash
            FROM {table}
            ORDER BY {columns[0]}
            LIMIT {limit}
        ) t
    """
    result = self.execute_query(query)
    return result[0][0] if result else None
Enter fullscreen mode Exit fullscreen mode

The Cutover Process

Cutover is where things get nerve-wracking. I built a multi-phase process with automatic rollback capability at each stage:

Phase Action Rollback Available
1 Pre-validation (verify DMS, row counts) Yes
2 Wait for sync (CDC lag under threshold) Yes
3 Drain connections (terminate source connections) Yes
4 Final sync (wait for remaining changes) Yes
5 Stop replication Manual only
6 Post-validation Manual only

The cutover script saves state to JSON after each phase, so you can resume if something fails:

# Always do a dry run first
python cutover.py --dry-run

# Execute when ready
python cutover.py --execute

# Resume from saved state if interrupted
python cutover.py --execute --resume
Enter fullscreen mode Exit fullscreen mode

GitHub Actions Integration

Everything is automated through GitHub Actions. The cutover workflow requires manual approval for production:

jobs:
  approval:
    name: Approve Cutover
    runs-on: ubuntu-latest
    if: github.event.inputs.mode == 'execute'
    environment: prod-cutover  # Requires manual approval
    steps:
      - name: Cutover Approved
        run: echo "Cutover approved"

  cutover:
    name: Database Cutover
    needs: [approval]
    # ... actual cutover steps
Enter fullscreen mode Exit fullscreen mode

The workflow pulls credentials from AWS Secrets Manager, runs the cutover script, uploads state artifacts for auditing, and sends SNS notifications on failure.

Results

The migration completed successfully with these metrics:

Metric Value
Total data migrated 512 GB
Migration time (full load) 3 hours 22 minutes
CDC lag during cutover 2.1 seconds
Application downtime 0 seconds
Data validation errors 0

Post-migration, we observed:

  • 40% reduction in read latency (Aurora read replicas)
  • Zero time spent on database maintenance
  • Automatic backups and point-in-time recovery

Lessons Learned

1. Test your validation scripts thoroughly. I initially had a bug where the checksum query didn't handle NULL values correctly. Caught it in staging, thankfully.

2. Size your DMS instance appropriately. We started with a dms.r5.2xlarge and hit CPU limits during full load. Upgrading to 4xlarge cut migration time in half.

3. Monitor CDC lag obsessively. I set up CloudWatch alarms for lag exceeding 30 seconds. During our migration, we had a spike to 45 seconds when someone ran a batch job on the source. Knowing about it immediately let us delay cutover until it settled.

4. Have a rollback plan that you've actually tested. We kept the source PostgreSQL running for 48 hours post-cutover. When a minor bug surfaced (unrelated to the migration), having the option to rollback gave everyone peace of mind while we investigated.

5. Communicate more than you think is necessary. We sent updates every hour during the migration. Stakeholders appreciated the transparency.

Try It Yourself

The complete framework is available on GitHub:

Repository: github.com/mateenali66/zero-downtime-db-migration

It includes:

  • Terraform modules for DMS, Aurora, networking, and monitoring
  • Python validation and cutover scripts
  • GitHub Actions workflows
  • Full documentation and runbook

Clone it, adapt the configuration for your environment, and run through staging before touching production. Happy migrating.

Top comments (0)