DEV Community: Pavan Bhatia

Everything Was Green. Production Was Failing.

Pavan Bhatia — Sat, 20 Jun 2026 20:12:53 +0000

Originally published on my Hashnode blog. This version has been adapted for the DEV community.

Confidentiality Notice: To protect intellectual property, certain architectural values, network configurations, and specific company identifiers have been generalized or obfuscated.

ALB Targets: Healthy.

NLB Targets: Healthy.

ECS Tasks: Running Normally.

CloudWatch: Fully Green.

To any engineer, this view is the ultimate green light. But at 2:00 AM on a Sunday, during the live cutover of our first major migration release, those pristine green lines were masking a critical failure. We had absolutely no instrumentation to see it.

Our maintenance window was rapidly shrinking, and production traffic was only a few hours away. Out at the ingress edge, client connections were dropping silently. Traffic wasn't degraded or throttled; it was completely absent.

As the Lead Cloud Architect steering this deployment, I watched our operational window slam shut. We spent 14 hours trapped in a grueling troubleshooting loop:

The security group audit? Clean.
Container logs? Showed healthy application responses.
VPC configurations? Spotless.

Every layer we checked told us the system was working perfectly—which made the silence at the client side completely inexplicable.

What we ultimately uncovered wasn’t a cloud provider outage. It was a brutal lesson in metric rollups and a hidden architectural trap born from our own unverified infrastructure assumptions.

We were observing control plane correctness — not data plane reality.

1. The Ingress Architecture

Our architecture was driven by a strict upstream constraint: our B2B partners operated legacy perimeter firewalls that enforced strict static IP allowlisting.

Because an AWS Application Load Balancer (ALB) dynamically rotates its underlying IP addresses as it scales, partner firewalls would drop traffic after any scaling event. To resolve this, we implemented a dual-layer ingress strategy:

A public Network Load Balancer (NLB) at the edge to provide fixed Elastic IPs.
An internal Application Load Balancer (ALB) downstream to handle Layer 7 routing, SSL termination, and WAF rules.

Figure 1: The dual-hop ingress path — NLB provides static Elastic IPs at the edge, ALB handles Layer 7 routing, SSL termination, and WAF behind it.

Because our existing setup predated native ALB-type target groups, we maintained ALB node registration through a custom internal synchronization script. The NLB utilized an IP-type target group tracking the private IP addresses of the ALB nodes.

The public SSL/TLS session terminated entirely at the ALB tier. We configured the NLB target group to execute HTTP health checks on Port 80, targeting the root path (/). This decision—made to verify actual application availability rather than simple TCP socket health—became the exact source of our failure.

2. The Fault Mechanism: The Host-Header Trap

Concurrently, our security team had hardened the downstream ALB with strict Host-header listener rules to drop unauthorized background internet scanning. Our default listener action was configured to return a fixed HTTP 400 Bad Request. Any request arriving at the ALB without our specific, approved application domain in the HTTP Host header hit this default action and was instantly rejected.

This is where our structural assumptions crashed with AWS realities:

The Catch: AWS NLB automated HTTP health checks are basic, automated probes. They do not allow you to inject custom HTTP headers (like a specific Host header) into the probe payload.

Consequently, when the NLB probed the ALB nodes, the health-check requests arrived without the required application host header. The ALB’s hardened listener evaluated the request, failed to find the approved domain header, matched the default fallback rule, and returned an HTTP 400.

3. Why It Survived Staging But Blew Up Under Load

This architectural flaw had quietly existed in our templates for months. It managed to survive staging due to three factors: low node turnover, specific deployment sequencing, and target registration delays.

In staging, the original ALB nodes registered before the host-header rules were deployed, and traffic volumes were never high enough to trigger scale-out events. The system maintained a deceptive steady state.

The live migration cutover shattered this illusion:

The sudden influx of validation traffic triggered an immediate scale-out event on the ALB tier.
AWS dynamically provisioned brand-new ALB IP endpoints to handle the load.
These newly provisioned ALB nodes consistently returned HTTP 400 responses to the NLB health checks.
Because they failed health checks, they never entered service.

This created a catastrophic loop: available routing capacity at the perimeter steadily collapsed, causing the NLB to drop client TCP connections at the edge while the remaining nodes choked.

4. The Observability Blind Spot

Why didn't our alerts fire? Our primary CloudWatch dashboard was configured to track the HealthyHostCount metric inside our NLB namespace. The graph was mapped using a standard 1-minute rollup period displaying the Average statistic.

Because the ALB nodes were rapidly scaling, cycling through IP targets, and oscillating between initialization, brief timeouts, and de-registration, the 1-minute aggregation window completely smoothed out these sharp, localized drops. The macro-level graph averaged the numbers out and rendered a flat, beautifully healthy line.

Figure 2: Sub-minute target-state churn versus the smoothed 1-minute CloudWatch rollup — the gap between control plane reporting and data plane reality.

We were evaluating infrastructure metrics on a macro-level timeline while the data plane was failing on a sub-minute, second-level timeline.

By 3:00 AM, we bypassed the global graphs entirely. We hopped into the AWS CLI, queried the raw Target Health State History for the NLB registry, and correlated the timestamps with our VPC Flow Logs.

The logs revealed a sharp spike in the TCP_Target_Reset_Count metric at the NLB tier. The NLB was actively dropping the ALB nodes out of rotation faster than CloudWatch could update its averages.

5. The Technical Fix

We resolved the loop by implementing an explicit, high-priority listener rule on the internal ALB to intercept the infrastructure probes before they ever reached the security host-filtering rules.

Using Terraform, we added a priority-1 rule to the ALB’s Port 80 listener utilizing a Source IPs condition:

# High-priority rule to catch NLB health checks by Source IP
resource "aws_lb_listener_rule" "nlb_health_check_bypass" {
  listener_arn = aws_lb_listener.internal_alb_80.arn
  priority     = 1

  action {
    type = "fixed-response"
    fixed_response {
      content_type = "text/plain"
      message_body = "HEALTHY"
      status_code  = "200"
    }
  }

  condition {
    source_ip {
      values = ["10.0.0.0/16"] # Restricted to our internal VPC NLB Subnets
    }
  }
}

By explicitly evaluating the source network topology rather than the Layer 7 payload, the ALB safely verified the internal health checks. Target synchronization stabilized immediately, allowing newly scaled ALB nodes to remain healthy and receive traffic.

6. The Organizational Blind Spot

While the technical fix allowed us to cross the migration finish line, the real postmortem occurred eight months later.

During a routine architectural review with our B2B partner’s infrastructure security team, we mapped out our edge architecture. We brought up the static IP allowlisting requirement and explained the immense operational complexity of running our custom NLB-to-ALB private IP synchronization loop just to accommodate them.

The partner’s Lead Security Engineer stopped us mid-sentence.

"Oh, we upgraded our edge proxies to support dynamic FQDN allowlisting two years ago," he said. "We only asked you for static IPs because that’s what was listed on your project questionnaire from 2021."

I had accepted an upstream constraint early in the design phase as an unmovable law of physics. We spent weeks writing automation code, building synchronization loops, and ultimately burning a 14-hour production outage window troubleshooting a brilliant technical solution to an organizational problem that didn’t actually exist. The NLB-to-ALB synchronization loop still runs in production today—because by the time we knew it was unnecessary, removing it had become its own complex migration project.

7. The Deeper Lesson: Security vs. Probes

The engineering takeaway here goes deeper than load balancer configurations. It exposes a recurring architectural friction point: Security hardening policies and operational health probes frequently operate at cross-purposes.

As engineers, we are trained to enforce zero-trust paradigms: validate headers, restrict paths, drop unauthenticated traffic at the absolute perimeter, and enforce strict Web Application Firewall (WAF) rule sets. But infrastructure probes are simple by design; they verify transport layer health, completely unaware of your application's security context.

Failing to account for how infrastructure health checks interact with security controls can cause otherwise healthy systems to remove themselves from service under load.

Our New Ingress Playbook

To ensure this never happens again, we changed three core aspects of our platform architecture:

Dedicated Health Paths: We now provide explicit, isolated fast-paths for infrastructure probes that bypass application-level validation.
Granular Alerting: We alert on target-state churn and sub-minute metric variations rather than relying on static 1-minute averages.
Synthetic Monitoring: We continuously validate the entire ingress path using external synthetic transactions that mimic real user behavior.

Final Takeaway

The outage exposed two profound blind spots. We trusted health signals that didn’t reflect user reality, and we built an increasingly complex architecture around a requirement nobody had revalidated.

The technical fix took hours. Discovering the unnecessary requirement took eight months. Validate your metrics, but more importantly, validate your assumptions.

💬 What's your take?

Have you ever run into a situation where standard infrastructure metrics looked perfectly healthy, but the data plane was completely dead for your users? Or have you ever spent weeks engineering a complex fix for a business requirement that turned out to be completely obsolete?

Let's exchange war stories—drop a comment down below!

Enjoyed this deep dive? Follow me here on DEV.to for more cloud architecture postmortems, production horror stories, and real-world infrastructure lessons.

About the Author: Bhatia Pavan is a Lead Cloud Architect specializing in highly available distributed systems, AWS networking, and robust observability platforms.

How a 2.8ms Network Delta Nearly Broke Our 7 TB Oracle to Amazon RDS Migration

Pavan Bhatia — Fri, 29 May 2026 12:31:53 +0000

🎯 Target Audience: Intermediate to Advanced Cloud Architects, DBAs, and DevOps Engineers who manage enterprise migrations, analyze Oracle AWR reports, or tune hybrid AWS network architectures.

Cross-posted from my infrastructure postmortem series at pavanbhatia.hashnode.dev.

At 1:40 AM on Sunday, our 7 TB Oracle-to-Amazon RDS migration was on the verge of collapse.

What looked like a routine cutover had turned into a system-wide latency failure with no obvious root cause.

Database CPU sat below 15%. Storage I/O looked healthy. Application logs showed zero errors.

Yet user-facing latency had exploded by nearly 800%.

Our final UAT validation had stalled completely, and we had less than four hours before business traffic resumed.

As the lead cloud architect driving the cutover, I had to decide whether to continue debugging live under extreme time pressure — or abort the migration and execute a controlled rollback.

The Data Pump Concurrency Bottleneck

Our initial staging runs with Oracle Data Pump (impdp) showed that the 7 TB data payload was tracking toward a 24-hour import window. In a strict 48-hour cutover timeline, spending half of our entire allocation moving raw bytes was terrifying. It left zero margin for validation, error remediation, or a clean rollback if things went sideways.

Vertical scaling did not solve the bottleneck; the issue was process- and I/O-level concurrency. We provisioned a memory-optimized RDS instance class and maximized storage IOPS, but the import throughput remained unchanged. To find the stall, we pulled an Automatic Workload Repository (AWR) report during the import run.

The metrics told us something critical immediately: compute wasn't the bottleneck. The import workers were completely serializing around index and constraint operations while the database sat mostly idle.

The AWR report immediately ruled out infrastructure saturation:

Top AWR Wait Events During Import

Event                         Waits      Avg Wait  % DB time
---------------------------  ---------  --------  ---------
db file sequential read      4,120,500    3.6ms      52.4%
resmgr:cpu quantum             842,110    3.7ms      11.0%
SQL*Net message from client  9,104,220    0.2ms       8.0%

The absence of sustained CPU or storage waits confirmed that the bottleneck was orchestration overhead rather than raw infrastructure capacity.

We refactored our ingestion pipeline to focus on schema deconstruction rather than raw hardware scaling:

Parallelism Tuning: We increased the parallel workers incrementally during test runs until throughput plateaued efficiently around 32 workers.
Schema Deconstruction: We ran an initial import execution with index and constraint exclusions (EXCLUDE=INDEX,CONSTRAINT), allowing flat tables to ingest via rapid, direct loads. We deferred all foreign keys and constraints to be validated later.
Concurrent Indexing: Once the raw data rows were loaded, we executed a multi-threaded script to rebuild indexes and constraints concurrently.

🎉 The Result

Total import time dropped from 24 hours to 8 hours and 12 minutes—a 66% performance gain that secured our ingestion window.

In-Flight Infrastructure: Catching the IaC Deadlock

With data ingestion optimized, we used subsequent test cycles to validate our infrastructure-as-code (IaC) deployment via Terraform. The automated pipeline consistently failed when attempting to provision our secondary read replica.

Our pipeline threw a generic AWS API InvalidDBInstanceState error, stating that the primary database was not in an available state to spin up a replica. Digging into the RDS engine events, we discovered that Oracle's MAX_STRING_SIZE parameter was the culprit. We had set it to EXTENDED to support 32,767-byte columns in our legacy schema.

Enabling EXTENDED requires the database instance to boot in upgrade mode and execute internal data dictionary conversion scripts (utl32k.sql).

Terraform's default concurrency created a race condition: the replica was being created before the primary had finished its upgrade.

This caused intermittent provisioning failures that were difficult to reproduce in staging but consistently triggered during full cutover runs.

What finally exposed the issue was noticing the primary instance repeatedly entering an internal upgrade state while Terraform simultaneously attempted replica creation. To bypass this timing limitation, we modified our deployment runbook into a two-phase execution:

Phase 1: We bootstrapped the cluster resources with a baseline parameter group utilizing the default STANDARD string setting, allowing the AWS API to establish the replication topology successfully.
Phase 2: Once the resources were registered in our Terraform state file, we ran a targeted pipeline execution to apply the EXTENDED parameter group to the primary database alone.

The secondary replica automatically inherited and synchronized the data dictionary upgrades from the primary instance over the wire, stabilizing our deployments and removing the parallel provisioning race condition.

🔍 The Synchronization Loop: To ensure a "no-regrets" migration, we established a bi-directional synchronization path. As shown in the diagram, Oracle GoldenGate acted as our insurance policy. By keeping the on-premises legacy database in an active, up-to-the-second state with AWS RDS, we ensured that the decision to abort the cutover at 5:00 AM resulted in a seamless fallback rather than a data recovery crisis.

The Cutover Crisis: The 2.8ms Network Tax

The real failure surfaced during our live cutover validation. Our testing showed immediate performance degradation on our core dashboards, measured as P95 end-to-end API latency at the API gateway.

For about 20 minutes, the war room was convinced our AWS Direct Connect link was saturating under validation load, which briefly sent our investigation down a rabbit hole of network packet analysis. However, once we looked at the application traces, the true bottleneck emerged.

The issue was not inside the Oracle engine; it was the physical distance between our remaining on-premises application tier and the new cloud environment.

[On-Premises App Tier] ---> (0.4ms RTT) ---> [On-Prem Legacy Oracle]
[On-Premises App Tier] ---> (3.2ms RTT via Direct Connect) ---> [AWS RDS Oracle]

On-premises, our application servers and the legacy Oracle hardware shared the same local data center fabric, yielding a network Round-Trip Time (RTT) of 0.4ms. Moving the database to Amazon RDS via AWS Direct Connect introduced a hybrid network hop, increasing that RTT to 3.2ms.

⚠️ The Hidden Problem: A delta of 2.8ms appears negligible on an architectural diagram. At scale, however, latency multiplies across every application round-trip, turning small inefficiencies into system-wide failures.

💡 Key Realization: The cloud network link wasn't the constraint. Our query amplification was.

1. The Read Bottleneck (N+1 Query Chattiness)

Many of our core user dashboards relied on un-batched, sequential loops that executed thousands of individual SELECT queries to render a single interface view. For every user request, our dashboard looped over a list of items, firing one SELECT per item instead of a single IN-clause query.

The math broke our performance requirements:

On-Premises Latency: 4,800 queries × 0.4ms RTT = 1.92 seconds network overhead
AWS Cloud Latency: 4,800 queries × 3.2ms RTT = 15.36 seconds network overhead (Resulting in a monitored P95 response of 6.1 seconds)

The database completed each query quickly, then sat idle waiting for the application layer to request the next record over the network.

🔍 Architectural Breakdown: The diagram above captures exactly how our code loops behaved across both environments. On-premises, the tight local network fabric masked inefficient coding patterns. When stretched over a hybrid cloud link, the synchronous nature of the 4,800 sequential queries weaponized the 2.8ms delta against our application layer.

2. The Write Bottleneck (Sequence Fetch Allocation)

Unfortunately, the read path was only half the problem. The network tax similarly paralyzed our bulk data-entry processes.

The culprit was our legacy ORM primary key configuration, which utilized an Oracle sequence with an allocation size of 1 (INCREMENT BY 1). On-premises, the local fabric completely masked the fact that the application was making a dedicated network round-trip to ask the database for a new ID sequence number for every single row before executing the corresponding INSERT.

Over the 3.2ms cloud link, inserting 5,000 records forced 10,000 sequential round-trips (5,000 sequence fetches + 5,000 inserts), translating to over 30 seconds of pure network wait time per batch.

Modifying ORM behavior and data-access loops under that intense time pressure would have violated our change-control policy and risked data corruption. I made the call to abort the cutover and run a controlled rollback.

The migration was technically recoverable, but the operational risk window had closed.

The Reverse Oracle GoldenGate Safety Net

Because this entire validation loop was executed within our isolated testing environment, production data remained completely untouched. However, the simulation proved that our fallback mechanics were sound.

We ran Oracle GoldenGate in reverse: data changes flowed continuously from AWS RDS back to the on-premises database, keeping it configured as a live, active backup. Dropping back to on-premises during this window was entirely seamless. By 5:00 AM Sunday, we had safely rerouted testing traffic back to the legacy database. The fallback process was fully automated, with zero data loss and no disruption to our ongoing business operations.

We spent the subsequent workweek executing targeted application code fixes:

Query Batching: We refactored three key endpoints, replacing nested loops with batched SELECTs. This consolidated our 4,800 iterative, single-record queries into 300 single batched SQL queries using IN clauses. Dashboard latency dropped from 4.5 seconds down to under 400ms under identical workloads.
JDBC Fetch Tuning: We bumped the default Oracle JDBC driver fetch size from its conservative default up to 100. This ensured that when the database processed one of our 300 consolidated batch queries, the entire dataset was returned to the application server in a single round-trip.
Sequence Refactoring: We updated our sequence definitions to allocate IDs in batches (INCREMENT BY 50) and aligned our ORM generators accordingly. This enabled HiLo ID generation, allowing the application server to pull a pool of IDs in a single wire trip and assign them to rows entirely in-memory—reducing primary-key network requests by two orders of magnitude.

The following Saturday, we initiated the cutover again. The dashboard loops that had previously fired 4,800 sequential requests now fired a clean combined total of 350 net round-trips. After the fixes, our 5,000-record batch inserts completed in under 2 seconds instead of 30+.

We completed validation ahead of schedule and were fully live by 3:00 AM with no application performance bottlenecks. Our monitored user-facing P95 response time fell from 6.1 seconds to a crisp 900ms.

Post-Go-Live Validation & The Real-World Failover

To mitigate the risk of an unforeseen infrastructure failure during our first week in the cloud, we kept our reverse Oracle GoldenGate replication pipeline active for two weeks. This ensured that our decommissioned on-premises database remained an up-to-the-second replica of our production cloud environment, providing an immediate fallback option if a critical defect surfaced.

Three weeks after go-live, we got the validation every migration team quietly fears: a real infrastructure failure.

An Amazon EventBridge rule captured an RDS infrastructure event notification indicating that the primary instance in the active Availability Zone (AZ) had encountered a hardware fault. This triggered an automated RDS Multi-AZ failover, promoting the standby instance in the secondary AZ to primary.

Because our application connection pools recycled cleanly, the secondary instance assumed the active workload within two minutes. CloudWatch alarms and synthetic checks confirmed zero impact on user performance. We had successfully survived a production database failure in the cloud with zero downtime—prompting us to permanently decommission the legacy on-premises synchronization.

Real-World Outcomes

Six months after that second cutover weekend, the system processes over 12 million database transactions per day. Moving to Amazon RDS for Oracle eliminated our legacy hardware maintenance overhead, and after the application-side optimizations, overall end-to-end P95 latency is now 35% faster than our previous on-premises baseline.

We spent months planning storage throughput, replication pipelines, rollback mechanics, and failover scenarios. In the end, the migration nearly failed because our application had been built around a network assumption nobody realized existed until the database moved 40 miles away.

Every migration exposes a different constraint. Ours exposed latency amplification, ORM query chattiness, and infrastructure sequencing failures that our on-premises environment had masked for years. The most important lesson wasn't the specific Oracle or AWS tuning itself—it was validating architectural assumptions early enough that rollback remained controlled when those assumptions broke.

Thanks for reading!

If you enjoyed this infrastructure breakdown, follow me here on DEV.to for more deep-dives into real-world production failures and cloud architecture.

You can also find me on LinkedIn to discuss distributed systems and large-scale AWS migrations.

Question for readers

Have you seen latency amplification or N+1 query issues surface only after moving to cloud or distributed systems?

Would love to hear similar migration war stories.

Some implementation details and timelines have been generalized slightly to respect internal enterprise confidentiality requirements while preserving the technical architecture and operational lessons.