<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ajay Devineni</title>
    <description>The latest articles on DEV Community by Ajay Devineni (@ajaydevineni).</description>
    <link>https://dev.to/ajaydevineni</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3862822%2Fddbc52cd-519d-4344-bea2-effb2a513786.png</url>
      <title>DEV Community: Ajay Devineni</title>
      <link>https://dev.to/ajaydevineni</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ajaydevineni"/>
    <language>en</language>
    <item>
      <title>Zero Data Loss Migration: Moving Billions of Rows from SQL Server to Aurora RDS — Architecture, Predictive CDC Monitoring &amp; Lessons from Production</title>
      <dc:creator>Ajay Devineni</dc:creator>
      <pubDate>Sun, 05 Apr 2026 22:03:10 +0000</pubDate>
      <link>https://dev.to/ajaydevineni/zero-data-loss-migration-moving-billions-of-rows-from-sql-server-to-aurora-rds-architecture-4g56</link>
      <guid>https://dev.to/ajaydevineni/zero-data-loss-migration-moving-billions-of-rows-from-sql-server-to-aurora-rds-architecture-4g56</guid>
      <description>&lt;p&gt;Migrating a live financial database with billions of rows, zero tolerance for data loss, and a strict cutover window is not a data transfer problem.&lt;br&gt;
It is a resource isolation problem, a risk prediction problem, and a compliance documentation problem — all running simultaneously.&lt;br&gt;
This article documents the architecture and lessons from a production SQL Server → AWS Aurora RDS migration I executed across multiple credit union banking environments. The core contribution is a framework I built called DMS-PredictLagNet — combining parallel DMS instance isolation with Holt-Winters predictive CDC lag forecasting for autonomous scaling.&lt;br&gt;
The Challenge&lt;br&gt;
The source environment was on-premises SQL Server across two separate data centers. Hundreds of tables. Two tables with billions of rows each. Continuous live transaction traffic — no maintenance window available. SOC 2 Type II and PCI DSS compliance required throughout.&lt;br&gt;
The hardest constraint: cutover had to happen within a documented change window measured in hours. If CDC replication lag was not at zero when that window opened, the entire migration had to defer to the next available window.&lt;br&gt;
Network Architecture: Dual VPN → Transit Gateway&lt;br&gt;
I established Site-to-Site VPN tunnels (IPSec/IKEv2) from both on-premises data centers into AWS, terminating at AWS Transit Gateway with dedicated route tables per client VPC. This guaranteed complete traffic isolation between the two migration streams — data from one client's pipeline could not traverse the other's route domain under any circumstances.&lt;br&gt;
Critical lesson learned the hard way: The source network team provided their internal LAN CIDR (192.x.x.x) for VPN configuration. What AWS actually sees is the post-NAT translated address — a completely different range. Every AWS-side configuration (route tables, security groups, network ACLs, VPN Phase 2 proxy ID selectors) must be built around the post-NAT address, not the internal LAN address. This mistake caused millions of connection timeouts before I identified the root cause. The fastest way to avoid it: ask "what IP address does AWS actually see when traffic leaves your environment?" before touching any configuration.&lt;br&gt;
Before starting any DMS task, I ran AWS Reachability Analyzer to validate end-to-end connectivity from each DMS replication instance to its source endpoint. This caught a missing route table entry that would have caused a task failure mid-window. I now treat this as a mandatory pre-migration gate.&lt;br&gt;
Schema Conversion with AWS SCT&lt;br&gt;
I ran AWS Schema Conversion Tool on a Windows EC2 instance inside the VPC — giving it direct connectivity to Aurora through the VPC network and to SQL Server through the VPN tunnel. Running SCT on a local laptop introduces latency variability that causes timeouts on large schema assessments.&lt;br&gt;
Credentials were stored in AWS Secrets Manager and accessed via IAM role — never stored in configuration files. This is a SOC 2 control requirement, not just a best practice.&lt;br&gt;
Two transformation rules were configured before assessment:&lt;/p&gt;

&lt;p&gt;Database remapping rule for naming convention differences&lt;br&gt;
Drop-schema rule to remove the SQL Server dbo prefix from all migrated objects&lt;/p&gt;

&lt;p&gt;Every incompatibility was resolved before a single row of data moved. Starting the full load before schema validation is complete is a common mistake with expensive consequences.&lt;br&gt;
The Core Architectural Decision: Parallel DMS Instance Isolation&lt;br&gt;
This was the most important design decision in the migration.&lt;br&gt;
A single DMS replication instance handling both the billion-row table and everything else creates resource contention. The billion-row table's CDC competes with hundreds of other tables for memory, CPU, and network bandwidth. Under peak transaction volume, that contention manifests as lag accumulation across the entire pipeline — and lag on a billion-row table takes the longest to clear.&lt;br&gt;
My solution: complete workload isolation.&lt;/p&gt;

&lt;p&gt;Instance 1 — dedicated exclusively to CDC replication for the single billion-row table. Nothing else ran on this instance.&lt;br&gt;
Instance 2 — handled full load and then CDC for all remaining tables.&lt;/p&gt;

&lt;p&gt;Both instances ran on the latest available DMS instance type with high-memory configuration. Standard sizing guidance does not account for sustained 14-day CDC workloads in live financial environments. The newer instance generation provided lower baseline CPU utilization under CDC load, more memory for the transaction log decoder, and better network throughput — all of which directly improved the predictive monitor's accuracy by providing more headroom before threshold triggers.&lt;br&gt;
LOB settings required per-table tuning. Tables with large text columns used Full LOB mode. Tables without LOB columns used Limited LOB mode with appropriate size limits. Mixing these without table-level configuration would have degraded throughput across the entire non-LOB majority of the table estate.&lt;br&gt;
The Foreign Key Pre-Assessment Fix&lt;br&gt;
The DMS pre-assessment failed on the first run — foreign key constraint violations because DMS loads tables in parallel and does not guarantee parent tables are loaded before child table inserts begin.&lt;br&gt;
Fix: add initstmt=set foreign_key_checks=0 to the Aurora target endpoint extra connection attributes. This disables foreign key enforcement for the DMS session only — it does not affect any other connections to Aurora. Post-load referential integrity validation then confirms consistency was achieved through the migration process rather than enforced during loading.&lt;br&gt;
In a SOC 2 environment: document this in the change control request and retain validation script output as audit evidence.&lt;br&gt;
DMS-PredictLagNet: Predictive CDC Lag Monitoring&lt;br&gt;
The standard reactive approach — CloudWatch alarm fires when lag exceeds a threshold — is insufficient in a live financial environment for two reasons. By the time an alarm fires, the backlog may already require hours to clear. And financial transaction volume is non-linear: payroll processing, end-of-day settlement, and batch jobs create predictable but sharp spikes that static thresholds do not adapt to.&lt;br&gt;
I built a predictive monitoring system using Holt-Winters triple exponential smoothing trained on 90 days of source transaction volume patterns.&lt;br&gt;
The model captures three components:&lt;/p&gt;

&lt;p&gt;Level — baseline transaction rate&lt;br&gt;
Trend — directional change over time&lt;br&gt;
Seasonality — recurring patterns (daily and weekly cycles)&lt;/p&gt;

&lt;p&gt;The seasonal period was set to m=168 (hourly observations over a 7-day weekly cycle) — the dominant periodicity in credit union banking, driven by business-day versus weekend patterns and weekly payroll cycles.&lt;br&gt;
Rather than forecasting lag directly, I predicted transaction volume 30 minutes ahead and translated the forecast into predicted lag via an empirically calibrated throughput model for the specific DMS instance sizes in use. This two-stage approach produced more reliable results because CDC lag is affected by DMS internal buffer state that is not observable from CloudWatch metrics alone.&lt;br&gt;
The autonomous scaling response operated on two tiers:&lt;br&gt;
When forecast indicated predicted lag would reach 60% of critical threshold within 30 minutes → AWS Lambda triggered DMS instance scale-up automatically.&lt;br&gt;
When forecast indicated 85% of critical threshold → AWS Systems Manager automation executed emergency scale-up to maximum pre-approved instance size and paged the on-call engineer via PagerDuty.&lt;br&gt;
All automated actions wrote to the S3 audit log before execution — satisfying SOC 2 requirements for immutable evidence of automated control actions.&lt;br&gt;
Results&lt;br&gt;
Across the 14-day CDC replication window:&lt;/p&gt;

&lt;p&gt;7 high-risk lag events identified by the predictive monitor&lt;br&gt;
5 resolved autonomously by Lambda-triggered scale-up — no human intervention&lt;br&gt;
2 required engineer engagement (one unscheduled batch job outside training distribution, one DMS task restart requiring SOC 2 change authorization)&lt;br&gt;
Zero engineer pages for predictable, pattern-driven lag events&lt;/p&gt;

&lt;p&gt;Post-migration outcomes:&lt;/p&gt;

&lt;p&gt;Zero data loss across all tables&lt;br&gt;
Cutover window met&lt;br&gt;
41% query performance improvement on Aurora within 48 hours post-cutover&lt;/p&gt;

&lt;p&gt;Post-CDC Validation Before Cutover&lt;br&gt;
Three-level validation executed across all tables before cutover authorization:&lt;/p&gt;

&lt;p&gt;Row count parity — exact match between source and Aurora at validation timestamp&lt;br&gt;
Checksum validation — hash comparison over critical column sets to detect corruption that row counts alone would not reveal&lt;br&gt;
Referential integrity validation — all foreign key relationships confirmed satisfied in Aurora&lt;/p&gt;

&lt;p&gt;Two tables had minor row count discrepancies on first run — both traced to in-flight transactions committed in the milliseconds between source and target count queries. Rerunning during a low-transaction period confirmed equivalence. Run validation during known low-traffic windows, not during peak processing.&lt;br&gt;
The 14-Day CDC Window&lt;br&gt;
The 14-day validation period served three purposes simultaneously:&lt;/p&gt;

&lt;p&gt;Application teams ran full regression testing against Aurora using real production data&lt;br&gt;
The CDC pipeline's behavior was observed across a complete two-week transaction cycle including payroll, weekends, and month-end batch&lt;br&gt;
Validation scripts were executed and verified before the cutover decision was made&lt;/p&gt;

&lt;p&gt;Key Takeaways for Engineers Planning Similar Migrations&lt;br&gt;
Ask the right network question first. What IP address does AWS actually see when traffic leaves your environment? Build everything around the post-NAT address.&lt;br&gt;
Run Reachability Analyzer before any DMS task starts. The cost is negligible. The cost of discovering a routing gap after migration tasks have started is not.&lt;br&gt;
Isolate your highest-volume table CDC on a dedicated instance. Do not let it compete for resources with your bulk load.&lt;br&gt;
Validate content, not just row counts. Checksum validation caught LOB truncation that row count checks would have missed entirely.&lt;br&gt;
Pre-assessment is not optional in regulated environments. Discovering the foreign_key_checks issue after a full load has started on a billion-row table is not recoverable within an eight-hour window.&lt;br&gt;
Predictive monitoring is not about preventing every lag event. It is about converting unpredictable events into manageable ones — autonomous handling of known patterns, human escalation for genuinely novel ones.&lt;br&gt;
The full framework — including the Holt-Winters forecasting methodology, parallel DMS partition design, and SOC 2 audit trail architecture — is written up as peer-reviewed research for the SRE and cloud engineering community. Migration patterns like this should be documented, not just passed around as tribal knowledge.&lt;br&gt;
What's the hardest part of large database migrations for your team — data volume, CDC lag management, cutover coordination, or post-migration validation?&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fieutspzuwckoi2fvjze3.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fieutspzuwckoi2fvjze3.jpg" alt=" " width="800" height="537"&gt;&lt;/a&gt;&lt;br&gt;
I also shared a high-level architecture overview of this migration on LinkedIn — you can find it here &lt;a href="https://www.linkedin.com/posts/ajay-devineni_aws-databasemigration-aurorards-activity-7438712828808548352-rz76?utm_source=share&amp;amp;utm_medium=member_desktop&amp;amp;rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU" rel="noopener noreferrer"&gt;https://www.linkedin.com/posts/ajay-devineni_aws-databasemigration-aurorards-activity-7438712828808548352-rz76?utm_source=share&amp;amp;utm_medium=member_desktop&amp;amp;rcm=ACoAACIp55QBRGVmAcEbf0D-1PaR5vEbm2yMcJU&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>aws</category>
      <category>database</category>
      <category>sre</category>
    </item>
  </channel>
</rss>
