Zero Data Loss Migration: Moving Billions of Rows from SQL Server to Aurora RDS — Architecture, Predictive CDC Monitoring & Lessons from Production

#ai #aws #database #sre

Migrating a live financial database with billions of rows, zero tolerance for data loss, and a strict cutover window is not a data transfer problem.
It is a resource isolation problem, a risk prediction problem, and a compliance documentation problem — all running simultaneously.
This article documents the architecture and lessons from a production SQL Server → AWS Aurora RDS migration I executed across multiple credit union banking environments. The core contribution is a framework I built called DMS-PredictLagNet — combining parallel DMS instance isolation with Holt-Winters predictive CDC lag forecasting for autonomous scaling.
The Challenge
The source environment was on-premises SQL Server across two separate data centers. Hundreds of tables. Two tables with billions of rows each. Continuous live transaction traffic — no maintenance window available. SOC 2 Type II and PCI DSS compliance required throughout.
The hardest constraint: cutover had to happen within a documented change window measured in hours. If CDC replication lag was not at zero when that window opened, the entire migration had to defer to the next available window.
Network Architecture: Dual VPN → Transit Gateway
I established Site-to-Site VPN tunnels (IPSec/IKEv2) from both on-premises data centers into AWS, terminating at AWS Transit Gateway with dedicated route tables per client VPC. This guaranteed complete traffic isolation between the two migration streams — data from one client's pipeline could not traverse the other's route domain under any circumstances.
Critical lesson learned the hard way: The source network team provided their internal LAN CIDR (192.x.x.x) for VPN configuration. What AWS actually sees is the post-NAT translated address — a completely different range. Every AWS-side configuration (route tables, security groups, network ACLs, VPN Phase 2 proxy ID selectors) must be built around the post-NAT address, not the internal LAN address. This mistake caused millions of connection timeouts before I identified the root cause. The fastest way to avoid it: ask "what IP address does AWS actually see when traffic leaves your environment?" before touching any configuration.
Before starting any DMS task, I ran AWS Reachability Analyzer to validate end-to-end connectivity from each DMS replication instance to its source endpoint. This caught a missing route table entry that would have caused a task failure mid-window. I now treat this as a mandatory pre-migration gate.
Schema Conversion with AWS SCT
I ran AWS Schema Conversion Tool on a Windows EC2 instance inside the VPC — giving it direct connectivity to Aurora through the VPC network and to SQL Server through the VPN tunnel. Running SCT on a local laptop introduces latency variability that causes timeouts on large schema assessments.
Credentials were stored in AWS Secrets Manager and accessed via IAM role — never stored in configuration files. This is a SOC 2 control requirement, not just a best practice.
Two transformation rules were configured before assessment:

Database remapping rule for naming convention differences
Drop-schema rule to remove the SQL Server dbo prefix from all migrated objects

Every incompatibility was resolved before a single row of data moved. Starting the full load before schema validation is complete is a common mistake with expensive consequences.
The Core Architectural Decision: Parallel DMS Instance Isolation
This was the most important design decision in the migration.
A single DMS replication instance handling both the billion-row table and everything else creates resource contention. The billion-row table's CDC competes with hundreds of other tables for memory, CPU, and network bandwidth. Under peak transaction volume, that contention manifests as lag accumulation across the entire pipeline — and lag on a billion-row table takes the longest to clear.
My solution: complete workload isolation.

Instance 1 — dedicated exclusively to CDC replication for the single billion-row table. Nothing else ran on this instance.
Instance 2 — handled full load and then CDC for all remaining tables.

Both instances ran on the latest available DMS instance type with high-memory configuration. Standard sizing guidance does not account for sustained 14-day CDC workloads in live financial environments. The newer instance generation provided lower baseline CPU utilization under CDC load, more memory for the transaction log decoder, and better network throughput — all of which directly improved the predictive monitor's accuracy by providing more headroom before threshold triggers.
LOB settings required per-table tuning. Tables with large text columns used Full LOB mode. Tables without LOB columns used Limited LOB mode with appropriate size limits. Mixing these without table-level configuration would have degraded throughput across the entire non-LOB majority of the table estate.
The Foreign Key Pre-Assessment Fix
The DMS pre-assessment failed on the first run — foreign key constraint violations because DMS loads tables in parallel and does not guarantee parent tables are loaded before child table inserts begin.
Fix: add initstmt=set foreign_key_checks=0 to the Aurora target endpoint extra connection attributes. This disables foreign key enforcement for the DMS session only — it does not affect any other connections to Aurora. Post-load referential integrity validation then confirms consistency was achieved through the migration process rather than enforced during loading.
In a SOC 2 environment: document this in the change control request and retain validation script output as audit evidence.
DMS-PredictLagNet: Predictive CDC Lag Monitoring
The standard reactive approach — CloudWatch alarm fires when lag exceeds a threshold — is insufficient in a live financial environment for two reasons. By the time an alarm fires, the backlog may already require hours to clear. And financial transaction volume is non-linear: payroll processing, end-of-day settlement, and batch jobs create predictable but sharp spikes that static thresholds do not adapt to.
I built a predictive monitoring system using Holt-Winters triple exponential smoothing trained on 90 days of source transaction volume patterns.
The model captures three components:

Level — baseline transaction rate
Trend — directional change over time
Seasonality — recurring patterns (daily and weekly cycles)

The seasonal period was set to m=168 (hourly observations over a 7-day weekly cycle) — the dominant periodicity in credit union banking, driven by business-day versus weekend patterns and weekly payroll cycles.
Rather than forecasting lag directly, I predicted transaction volume 30 minutes ahead and translated the forecast into predicted lag via an empirically calibrated throughput model for the specific DMS instance sizes in use. This two-stage approach produced more reliable results because CDC lag is affected by DMS internal buffer state that is not observable from CloudWatch metrics alone.
The autonomous scaling response operated on two tiers:
When forecast indicated predicted lag would reach 60% of critical threshold within 30 minutes → AWS Lambda triggered DMS instance scale-up automatically.
When forecast indicated 85% of critical threshold → AWS Systems Manager automation executed emergency scale-up to maximum pre-approved instance size and paged the on-call engineer via PagerDuty.
All automated actions wrote to the S3 audit log before execution — satisfying SOC 2 requirements for immutable evidence of automated control actions.
Results
Across the 14-day CDC replication window:

7 high-risk lag events identified by the predictive monitor
5 resolved autonomously by Lambda-triggered scale-up — no human intervention
2 required engineer engagement (one unscheduled batch job outside training distribution, one DMS task restart requiring SOC 2 change authorization)
Zero engineer pages for predictable, pattern-driven lag events

Post-migration outcomes:

Zero data loss across all tables
Cutover window met
41% query performance improvement on Aurora within 48 hours post-cutover

Post-CDC Validation Before Cutover
Three-level validation executed across all tables before cutover authorization:

Row count parity — exact match between source and Aurora at validation timestamp
Checksum validation — hash comparison over critical column sets to detect corruption that row counts alone would not reveal
Referential integrity validation — all foreign key relationships confirmed satisfied in Aurora

Two tables had minor row count discrepancies on first run — both traced to in-flight transactions committed in the milliseconds between source and target count queries. Rerunning during a low-transaction period confirmed equivalence. Run validation during known low-traffic windows, not during peak processing.
The 14-Day CDC Window
The 14-day validation period served three purposes simultaneously:

Application teams ran full regression testing against Aurora using real production data
The CDC pipeline's behavior was observed across a complete two-week transaction cycle including payroll, weekends, and month-end batch
Validation scripts were executed and verified before the cutover decision was made

Key Takeaways for Engineers Planning Similar Migrations
Ask the right network question first. What IP address does AWS actually see when traffic leaves your environment? Build everything around the post-NAT address.
Run Reachability Analyzer before any DMS task starts. The cost is negligible. The cost of discovering a routing gap after migration tasks have started is not.
Isolate your highest-volume table CDC on a dedicated instance. Do not let it compete for resources with your bulk load.
Validate content, not just row counts. Checksum validation caught LOB truncation that row count checks would have missed entirely.
Pre-assessment is not optional in regulated environments. Discovering the foreign_key_checks issue after a full load has started on a billion-row table is not recoverable within an eight-hour window.
Predictive monitoring is not about preventing every lag event. It is about converting unpredictable events into manageable ones — autonomous handling of known patterns, human escalation for genuinely novel ones.
The full framework — including the Holt-Winters forecasting methodology, parallel DMS partition design, and SOC 2 audit trail architecture — is written up as peer-reviewed research for the SRE and cloud engineering community. Migration patterns like this should be documented, not just passed around as tribal knowledge.
What's the hardest part of large database migrations for your team — data volume, CDC lag management, cutover coordination, or post-migration validation?

I also shared a high-level architecture overview of this migration on LinkedIn — you can find it here https://www.linkedin.com/posts/ajay-devineni_aws-databasemigration-aurorards-activity-7438712828808548352-rz76?utm_source=share&utm_medium=member_desktop&rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU

Top comments (2)

Victor Okefie • Apr 6

The predictive lag monitoring is the piece most migrations skip, they react when lag hits a threshold instead of forecasting before it arrives. Holt-Winters on transaction volume patterns turns a reactive alarm into a lead time buffer. That's the difference between paging an engineer at 2am and scaling before the batch job starts.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.