<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Pavan Bhatia</title>
    <description>The latest articles on DEV Community by Pavan Bhatia (@pavanbhatia).</description>
    <link>https://dev.to/pavanbhatia</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3955051%2F3df47b7a-16aa-41af-90bc-89ecb7e14b35.png</url>
      <title>DEV Community: Pavan Bhatia</title>
      <link>https://dev.to/pavanbhatia</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/pavanbhatia"/>
    <language>en</language>
    <item>
      <title>How a 2.8ms Network Delta Nearly Broke Our 7 TB Oracle to Amazon RDS Migration</title>
      <dc:creator>Pavan Bhatia</dc:creator>
      <pubDate>Fri, 29 May 2026 12:31:53 +0000</pubDate>
      <link>https://dev.to/pavanbhatia/how-a-28ms-network-delta-nearly-broke-our-7-tb-oracle-to-amazon-rds-migration-4h69</link>
      <guid>https://dev.to/pavanbhatia/how-a-28ms-network-delta-nearly-broke-our-7-tb-oracle-to-amazon-rds-migration-4h69</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;🎯 &lt;strong&gt;Target Audience:&lt;/strong&gt; Intermediate to Advanced Cloud Architects, DBAs, and DevOps Engineers who manage enterprise migrations, analyze Oracle AWR reports, or tune hybrid AWS network architectures.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;Cross-posted from my infrastructure postmortem series at &lt;a href="https://pavanbhatia.hashnode.dev/oracle-to-amazon-rds-migration-latency-terraform" rel="noopener noreferrer"&gt;pavanbhatia.hashnode.dev&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;At 1:40 AM on Sunday, our 7 TB Oracle-to-Amazon RDS migration was on the verge of collapse.&lt;/p&gt;

&lt;p&gt;What looked like a routine cutover had turned into a system-wide latency failure with no obvious root cause.&lt;/p&gt;

&lt;p&gt;Database CPU sat below 15%. Storage I/O looked healthy. Application logs showed zero errors.&lt;/p&gt;

&lt;p&gt;Yet user-facing latency had exploded by nearly 800%.&lt;/p&gt;

&lt;p&gt;Our final UAT validation had stalled completely, and we had less than four hours before business traffic resumed.&lt;/p&gt;

&lt;p&gt;As the lead cloud architect driving the cutover, I had to decide whether to continue debugging live under extreme time pressure — or abort the migration and execute a controlled rollback.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Data Pump Concurrency Bottleneck
&lt;/h2&gt;

&lt;p&gt;Our initial staging runs with Oracle Data Pump (&lt;code&gt;impdp&lt;/code&gt;) showed that the 7 TB data payload was tracking toward a 24-hour import window. In a strict 48-hour cutover timeline, spending half of our entire allocation moving raw bytes was terrifying. It left zero margin for validation, error remediation, or a clean rollback if things went sideways. &lt;/p&gt;

&lt;p&gt;Vertical scaling did not solve the bottleneck; the issue was process- and I/O-level concurrency. We provisioned a memory-optimized RDS instance class and maximized storage IOPS, but the import throughput remained unchanged. To find the stall, we pulled an Automatic Workload Repository (AWR) report during the import run. &lt;/p&gt;

&lt;p&gt;The metrics told us something critical immediately: compute wasn't the bottleneck. The import workers were completely serializing around index and constraint operations while the database sat mostly idle.&lt;/p&gt;

&lt;p&gt;The AWR report immediately ruled out infrastructure saturation:&lt;/p&gt;

&lt;h3&gt;
  
  
  Top AWR Wait Events During Import
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Event                         Waits      Avg Wait  % DB time
---------------------------  ---------  --------  ---------
db file sequential read      4,120,500    3.6ms      52.4%
resmgr:cpu quantum             842,110    3.7ms      11.0%
SQL*Net message from client  9,104,220    0.2ms       8.0%

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The absence of sustained CPU or storage waits confirmed that the bottleneck was orchestration overhead rather than raw infrastructure capacity.&lt;/p&gt;

&lt;p&gt;We refactored our ingestion pipeline to focus on &lt;strong&gt;schema deconstruction&lt;/strong&gt; rather than raw hardware scaling:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Parallelism Tuning:&lt;/strong&gt; We increased the parallel workers incrementally during test runs until throughput plateaued efficiently around &lt;code&gt;32&lt;/code&gt; workers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schema Deconstruction:&lt;/strong&gt; We ran an initial import execution with index and constraint exclusions (&lt;code&gt;EXCLUDE=INDEX,CONSTRAINT&lt;/code&gt;), allowing flat tables to ingest via rapid, direct loads. We deferred all foreign keys and constraints to be validated later.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Concurrent Indexing:&lt;/strong&gt; Once the raw data rows were loaded, we executed a multi-threaded script to rebuild indexes and constraints concurrently.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;h3&gt;
  
  
  🎉 The Result
&lt;/h3&gt;

&lt;p&gt;Total import time dropped from 24 hours to &lt;strong&gt;8 hours and 12 minutes&lt;/strong&gt;—a 66% performance gain that secured our ingestion window.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  In-Flight Infrastructure: Catching the IaC Deadlock
&lt;/h2&gt;

&lt;p&gt;With data ingestion optimized, we used subsequent test cycles to validate our infrastructure-as-code (IaC) deployment via Terraform. The automated pipeline consistently failed when attempting to provision our secondary read replica.&lt;/p&gt;

&lt;p&gt;Our pipeline threw a generic AWS API &lt;code&gt;InvalidDBInstanceState&lt;/code&gt; error, stating that the primary database was not in an available state to spin up a replica. Digging into the RDS engine events, we discovered that Oracle's &lt;code&gt;MAX_STRING_SIZE&lt;/code&gt; parameter was the culprit. We had set it to &lt;code&gt;EXTENDED&lt;/code&gt; to support 32,767-byte columns in our legacy schema.&lt;/p&gt;

&lt;p&gt;Enabling &lt;code&gt;EXTENDED&lt;/code&gt; requires the database instance to boot in upgrade mode and execute internal data dictionary conversion scripts (&lt;code&gt;utl32k.sql&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;Terraform's default concurrency created a race condition: &lt;strong&gt;the replica was being created before the primary had finished its upgrade.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This caused intermittent provisioning failures that were difficult to reproduce in staging but consistently triggered during full cutover runs.&lt;/p&gt;

&lt;p&gt;What finally exposed the issue was noticing the primary instance repeatedly entering an internal upgrade state while Terraform simultaneously attempted replica creation. To bypass this timing limitation, we modified our deployment runbook into a &lt;strong&gt;two-phase execution&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Phase 1:&lt;/strong&gt; We bootstrapped the cluster resources with a baseline parameter group utilizing the default &lt;code&gt;STANDARD&lt;/code&gt; string setting, allowing the AWS API to establish the replication topology successfully.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Phase 2:&lt;/strong&gt; Once the resources were registered in our Terraform state file, we ran a targeted pipeline execution to apply the &lt;code&gt;EXTENDED&lt;/code&gt; parameter group to the primary database alone.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The secondary replica automatically inherited and synchronized the data dictionary upgrades from the primary instance over the wire, stabilizing our deployments and removing the parallel provisioning race condition.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpp5rilbecy0awh0vdrhh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpp5rilbecy0awh0vdrhh.png" alt="System architecture diagram showing bi-directional data replication between on-premises Oracle and AWS RDS using Oracle GoldenGate for zero-downtime fallback." width="800" height="668"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🔍 &lt;strong&gt;The Synchronization Loop:&lt;/strong&gt; To ensure a "no-regrets" migration, we established a bi-directional synchronization path. As shown in the diagram, Oracle GoldenGate acted as our insurance policy. By keeping the on-premises legacy database in an active, up-to-the-second state with AWS RDS, we ensured that the decision to abort the cutover at 5:00 AM resulted in a seamless fallback rather than a data recovery crisis.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The Cutover Crisis: The 2.8ms Network Tax
&lt;/h2&gt;

&lt;p&gt;The real failure surfaced during our live cutover validation. Our testing showed immediate performance degradation on our core dashboards, measured as P95 end-to-end API latency at the API gateway.&lt;/p&gt;

&lt;p&gt;For about 20 minutes, the war room was convinced our AWS Direct Connect link was saturating under validation load, which briefly sent our investigation down a rabbit hole of network packet analysis. However, once we looked at the application traces, the true bottleneck emerged.&lt;/p&gt;

&lt;p&gt;The issue was not inside the Oracle engine; it was the physical distance between our remaining on-premises application tier and the new cloud environment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[On-Premises App Tier] ---&amp;gt; (0.4ms RTT) ---&amp;gt; [On-Prem Legacy Oracle]
[On-Premises App Tier] ---&amp;gt; (3.2ms RTT via Direct Connect) ---&amp;gt; [AWS RDS Oracle]

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On-premises, our application servers and the legacy Oracle hardware shared the same local data center fabric, yielding a network Round-Trip Time (RTT) of &lt;strong&gt;0.4ms&lt;/strong&gt;. Moving the database to Amazon RDS via AWS Direct Connect introduced a hybrid network hop, increasing that RTT to &lt;strong&gt;3.2ms&lt;/strong&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ &lt;strong&gt;The Hidden Problem:&lt;/strong&gt; A delta of 2.8ms appears negligible on an architectural diagram. At scale, however, latency multiplies across every application round-trip, turning small inefficiencies into system-wide failures.&lt;/p&gt;

&lt;p&gt;💡 &lt;strong&gt;Key Realization:&lt;/strong&gt; The cloud network link wasn't the constraint. Our query amplification was.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  1. The Read Bottleneck (N+1 Query Chattiness)
&lt;/h3&gt;

&lt;p&gt;Many of our core user dashboards relied on un-batched, sequential loops that executed thousands of individual &lt;code&gt;SELECT&lt;/code&gt; queries to render a single interface view. For every user request, our dashboard looped over a list of items, firing one &lt;code&gt;SELECT&lt;/code&gt; per item instead of a single &lt;code&gt;IN&lt;/code&gt;-clause query.&lt;/p&gt;

&lt;p&gt;The math broke our performance requirements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;On-Premises Latency:&lt;/strong&gt; 4,800 queries × 0.4ms RTT = &lt;strong&gt;1.92 seconds&lt;/strong&gt; network overhead&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS Cloud Latency:&lt;/strong&gt; 4,800 queries × 3.2ms RTT = &lt;strong&gt;15.36 seconds&lt;/strong&gt; network overhead &lt;em&gt;(Resulting in a monitored P95 response of 6.1 seconds)&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The database completed each query quickly, then sat idle waiting for the application layer to request the next record over the network.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fws2y3j8vy4vfkk4nf6p9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fws2y3j8vy4vfkk4nf6p9.png" alt="Sequence diagram comparing low-latency local network database queries with high-latency hybrid cloud network queries, showcasing N+1 query amplification." width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🔍 &lt;strong&gt;Architectural Breakdown:&lt;/strong&gt; The diagram above captures exactly how our code loops behaved across both environments. On-premises, the tight local network fabric masked inefficient coding patterns. When stretched over a hybrid cloud link, the synchronous nature of the 4,800 sequential queries weaponized the 2.8ms delta against our application layer.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  2. The Write Bottleneck (Sequence Fetch Allocation)
&lt;/h3&gt;

&lt;p&gt;Unfortunately, the read path was only half the problem. The network tax similarly paralyzed our bulk data-entry processes.&lt;/p&gt;

&lt;p&gt;The culprit was our legacy ORM primary key configuration, which utilized an Oracle sequence with an allocation size of 1 (&lt;code&gt;INCREMENT BY 1&lt;/code&gt;). On-premises, the local fabric completely masked the fact that the application was making a dedicated network round-trip to ask the database for a new ID sequence number for every single row before executing the corresponding &lt;code&gt;INSERT&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Over the 3.2ms cloud link, inserting 5,000 records forced 10,000 sequential round-trips (5,000 sequence fetches + 5,000 inserts), translating to &lt;strong&gt;over 30 seconds of pure network wait time&lt;/strong&gt; per batch.&lt;/p&gt;

&lt;p&gt;Modifying ORM behavior and data-access loops under that intense time pressure would have violated our change-control policy and risked data corruption. I made the call to abort the cutover and run a controlled rollback.&lt;/p&gt;

&lt;p&gt;The migration was technically recoverable, but the operational risk window had closed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Reverse Oracle GoldenGate Safety Net
&lt;/h2&gt;

&lt;p&gt;Because this entire validation loop was executed within our isolated testing environment, production data remained completely untouched. However, the simulation proved that our fallback mechanics were sound.&lt;/p&gt;

&lt;p&gt;We ran &lt;strong&gt;Oracle GoldenGate in reverse&lt;/strong&gt;: data changes flowed continuously from AWS RDS back to the on-premises database, keeping it configured as a live, active backup. Dropping back to on-premises during this window was entirely seamless. By 5:00 AM Sunday, we had safely rerouted testing traffic back to the legacy database. The fallback process was fully automated, with &lt;strong&gt;zero data loss&lt;/strong&gt; and no disruption to our ongoing business operations.&lt;/p&gt;

&lt;p&gt;We spent the subsequent workweek executing targeted application code fixes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Query Batching:&lt;/strong&gt; We refactored three key endpoints, replacing nested loops with batched &lt;code&gt;SELECT&lt;/code&gt;s. This consolidated our 4,800 iterative, single-record queries into 300 single batched SQL queries using &lt;code&gt;IN&lt;/code&gt; clauses. Dashboard latency dropped from 4.5 seconds down to &lt;strong&gt;under 400ms&lt;/strong&gt; under identical workloads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;JDBC Fetch Tuning:&lt;/strong&gt; We bumped the default Oracle JDBC driver fetch size from its conservative default up to &lt;code&gt;100&lt;/code&gt;. This ensured that when the database processed one of our 300 consolidated batch queries, the entire dataset was returned to the application server in a single round-trip.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sequence Refactoring:&lt;/strong&gt; We updated our sequence definitions to allocate IDs in batches (&lt;code&gt;INCREMENT BY 50&lt;/code&gt;) and aligned our ORM generators accordingly. This enabled HiLo ID generation, allowing the application server to pull a pool of IDs in a single wire trip and assign them to rows entirely in-memory—reducing primary-key network requests by two orders of magnitude.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The following Saturday, we initiated the cutover again. The dashboard loops that had previously fired 4,800 sequential requests now fired a clean combined total of 350 net round-trips. After the fixes, our 5,000-record batch inserts completed in &lt;strong&gt;under 2 seconds&lt;/strong&gt; instead of 30+.&lt;/p&gt;

&lt;p&gt;We completed validation ahead of schedule and were fully live by 3:00 AM with no application performance bottlenecks. Our monitored user-facing P95 response time fell from 6.1 seconds to a &lt;strong&gt;crisp 900ms&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Post-Go-Live Validation &amp;amp; The Real-World Failover
&lt;/h2&gt;

&lt;p&gt;To mitigate the risk of an unforeseen infrastructure failure during our first week in the cloud, we kept our reverse Oracle GoldenGate replication pipeline active for two weeks. This ensured that our decommissioned on-premises database remained an up-to-the-second replica of our production cloud environment, providing an immediate fallback option if a critical defect surfaced.&lt;/p&gt;

&lt;p&gt;Three weeks after go-live, we got the validation every migration team quietly fears: &lt;strong&gt;a real infrastructure failure.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;An Amazon EventBridge rule captured an RDS infrastructure event notification indicating that the primary instance in the active Availability Zone (AZ) had encountered a hardware fault. This triggered an automated RDS Multi-AZ failover, promoting the standby instance in the secondary AZ to primary.&lt;/p&gt;

&lt;p&gt;Because our application connection pools recycled cleanly, the secondary instance assumed the active workload within &lt;strong&gt;two minutes&lt;/strong&gt;. CloudWatch alarms and synthetic checks confirmed zero impact on user performance. We had successfully survived a production database failure in the cloud with zero downtime—prompting us to permanently decommission the legacy on-premises synchronization.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Outcomes
&lt;/h2&gt;

&lt;p&gt;Six months after that second cutover weekend, the system processes over &lt;strong&gt;12 million database transactions per day&lt;/strong&gt;. Moving to Amazon RDS for Oracle eliminated our legacy hardware maintenance overhead, and after the application-side optimizations, overall end-to-end P95 latency is now &lt;strong&gt;35% faster&lt;/strong&gt; than our previous on-premises baseline.&lt;/p&gt;

&lt;p&gt;We spent months planning storage throughput, replication pipelines, rollback mechanics, and failover scenarios. In the end, the migration nearly failed because our application had been built around a network assumption nobody realized existed until the database moved 40 miles away.&lt;/p&gt;

&lt;p&gt;Every migration exposes a different constraint. Ours exposed latency amplification, ORM query chattiness, and infrastructure sequencing failures that our on-premises environment had masked for years. The most important lesson wasn't the specific Oracle or AWS tuning itself—it was &lt;strong&gt;validating architectural assumptions early enough&lt;/strong&gt; that rollback remained controlled when those assumptions broke.&lt;/p&gt;

&lt;h3&gt;
  
  
  Thanks for reading!
&lt;/h3&gt;

&lt;p&gt;If you enjoyed this infrastructure breakdown, &lt;strong&gt;follow me here on DEV.to&lt;/strong&gt; for more deep-dives into real-world production failures and cloud architecture.&lt;/p&gt;

&lt;p&gt;You can also find me on &lt;a href="https://www.linkedin.com/in/pavan-bhatia-94557b6/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; to discuss distributed systems and large-scale AWS migrations.&lt;/p&gt;

&lt;h2&gt;
  
  
  Question for readers
&lt;/h2&gt;

&lt;p&gt;Have you seen latency amplification or N+1 query issues surface only after moving to cloud or distributed systems?&lt;/p&gt;

&lt;p&gt;Would love to hear similar migration war stories.&lt;/p&gt;

&lt;p&gt;Some implementation details and timelines have been generalized slightly to respect internal enterprise confidentiality requirements while preserving the technical architecture and operational lessons.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>database</category>
      <category>devops</category>
      <category>cloud</category>
    </item>
  </channel>
</rss>
