TL;DR: I created a production-ready framework for migrating PostgreSQL databases to Aurora with zero downtime using AWS DMS, Terraform, and automated validation scripts.
The Problem
Last year, I inherited a project running on self-managed PostgreSQL. The database had grown to 500GB, and we were spending too much time on maintenance: patching, backups, replication issues, the usual headaches. The decision was made to move to Aurora PostgreSQL for managed operations, better scalability, and native AWS integration.
The catch? This was a production database serving 50,000 daily active users. Any downtime meant lost revenue and angry customers. The business gave us a maintenance window of... zero minutes. No pressure.
What I Tried First (And Why It Didn't Work)
My initial thought was simple: pg_dump and pg_restore during a quiet period. Classic approach, right?
# The naive approach
pg_dump -Fc production_db > backup.dump
pg_restore -d aurora_target backup.dump
For a 500GB database, this would take roughly 4 to 6 hours depending on network and instance sizes. That's 4 to 6 hours of stale data accumulating on the source while users keep writing. Unacceptable.
I also looked at logical replication native to PostgreSQL, but setting up publication/subscription across AWS accounts with proper security controls was turning into a monster. Plus, the operational tooling around it was basically DIY.
Then I found AWS DMS. It handles the heavy lifting: full load plus change data capture in a managed service. The challenge was building everything around it to make migrations repeatable and safe.
The Solution
I built a complete migration framework with four major components:
- Terraform modules for all AWS infrastructure
- Python automation scripts for validation and cutover
- GitHub Actions workflows for CI/CD
- CloudWatch monitoring for full observability
Architecture Overview
The blue-green strategy works like this: DMS performs a full load of all existing data, then switches to CDC mode to capture ongoing changes. Both databases stay in sync until we're ready to cut over.
Terraform Infrastructure
I created modular Terraform for reproducibility across environments. Here's how the DMS module looks:
module "dms" {
source = "./modules/dms"
project = "db-migration"
environment = "prod"
subnet_ids = var.private_subnet_ids
security_group_ids = [module.networking.dms_security_group_id]
replication_instance_class = "dms.r5.4xlarge"
multi_az = true
source_db_host = var.source_db_host
source_db_username = var.source_db_username
source_db_password = var.source_db_password
target_db_host = module.aurora.cluster_endpoint
target_db_username = var.aurora_master_username
target_db_password = var.aurora_master_password
}
The DMS module creates:
- Replication instance with appropriate sizing
- Source and target endpoints with proper SSL configuration
- Replication task with CDC enabled
- CloudWatch alarms for monitoring lag and errors
Validation Scripts
Before any cutover, you need confidence that the data matches. I wrote a Python validation tool that checks multiple dimensions:
# Quick validation (uses table statistics for fast estimates)
python validation.py --quick
# Full validation (exact counts, checksums, sequence values)
python validation.py --full
# Just check DMS status
python validation.py --dms
The full validation performs:
| Check | Description |
|---|---|
| Row Counts | Compare exact row counts between source and target |
| Checksums | MD5 hash of sample data from each table |
| Sequences | Verify sequence values are synchronized |
| Primary Keys | Ensure all tables have PKs (required for CDC) |
| DMS Status | Task running, replication lag below threshold |
Here's a snippet from the checksum validation:
def calculate_checksum(self, table: str, columns: list, limit: int = 1000) -> str:
"""Calculate MD5 checksum of sample rows."""
cols = ", ".join(columns)
query = f"""
SELECT md5(string_agg(row_hash, '' ORDER BY row_hash))
FROM (
SELECT md5(ROW({cols})::text) as row_hash
FROM {table}
ORDER BY {columns[0]}
LIMIT {limit}
) t
"""
result = self.execute_query(query)
return result[0][0] if result else None
The Cutover Process
Cutover is where things get nerve-wracking. I built a multi-phase process with automatic rollback capability at each stage:
| Phase | Action | Rollback Available |
|---|---|---|
| 1 | Pre-validation (verify DMS, row counts) | Yes |
| 2 | Wait for sync (CDC lag under threshold) | Yes |
| 3 | Drain connections (terminate source connections) | Yes |
| 4 | Final sync (wait for remaining changes) | Yes |
| 5 | Stop replication | Manual only |
| 6 | Post-validation | Manual only |
The cutover script saves state to JSON after each phase, so you can resume if something fails:
# Always do a dry run first
python cutover.py --dry-run
# Execute when ready
python cutover.py --execute
# Resume from saved state if interrupted
python cutover.py --execute --resume
GitHub Actions Integration
Everything is automated through GitHub Actions. The cutover workflow requires manual approval for production:
jobs:
approval:
name: Approve Cutover
runs-on: ubuntu-latest
if: github.event.inputs.mode == 'execute'
environment: prod-cutover # Requires manual approval
steps:
- name: Cutover Approved
run: echo "Cutover approved"
cutover:
name: Database Cutover
needs: [approval]
# ... actual cutover steps
The workflow pulls credentials from AWS Secrets Manager, runs the cutover script, uploads state artifacts for auditing, and sends SNS notifications on failure.
Results
The migration completed successfully with these metrics:
| Metric | Value |
|---|---|
| Total data migrated | 512 GB |
| Migration time (full load) | 3 hours 22 minutes |
| CDC lag during cutover | 2.1 seconds |
| Application downtime | 0 seconds |
| Data validation errors | 0 |
Post-migration, we observed:
- 40% reduction in read latency (Aurora read replicas)
- Zero time spent on database maintenance
- Automatic backups and point-in-time recovery
Lessons Learned
1. Test your validation scripts thoroughly. I initially had a bug where the checksum query didn't handle NULL values correctly. Caught it in staging, thankfully.
2. Size your DMS instance appropriately. We started with a dms.r5.2xlarge and hit CPU limits during full load. Upgrading to 4xlarge cut migration time in half.
3. Monitor CDC lag obsessively. I set up CloudWatch alarms for lag exceeding 30 seconds. During our migration, we had a spike to 45 seconds when someone ran a batch job on the source. Knowing about it immediately let us delay cutover until it settled.
4. Have a rollback plan that you've actually tested. We kept the source PostgreSQL running for 48 hours post-cutover. When a minor bug surfaced (unrelated to the migration), having the option to rollback gave everyone peace of mind while we investigated.
5. Communicate more than you think is necessary. We sent updates every hour during the migration. Stakeholders appreciated the transparency.
Try It Yourself
The complete framework is available on GitHub:
Repository: github.com/mateenali66/zero-downtime-db-migration
It includes:
- Terraform modules for DMS, Aurora, networking, and monitoring
- Python validation and cutover scripts
- GitHub Actions workflows
- Full documentation and runbook
Clone it, adapt the configuration for your environment, and run through staging before touching production. Happy migrating.

Top comments (0)