Pavan Bhatia

Posted on Jun 20 • Originally published at pavanbhatia.hashnode.dev

Everything Was Green. Production Was Failing.

#aws #devops #architecture #cloud

Originally published on my Hashnode blog. This version has been adapted for the DEV community.

Confidentiality Notice: To protect intellectual property, certain architectural values, network configurations, and specific company identifiers have been generalized or obfuscated.

ALB Targets: Healthy.

NLB Targets: Healthy.

ECS Tasks: Running Normally.

CloudWatch: Fully Green.

To any engineer, this view is the ultimate green light. But at 2:00 AM on a Sunday, during the live cutover of our first major migration release, those pristine green lines were masking a critical failure. We had absolutely no instrumentation to see it.

Our maintenance window was rapidly shrinking, and production traffic was only a few hours away. Out at the ingress edge, client connections were dropping silently. Traffic wasn't degraded or throttled; it was completely absent.

As the Lead Cloud Architect steering this deployment, I watched our operational window slam shut. We spent 14 hours trapped in a grueling troubleshooting loop:

The security group audit? Clean.
Container logs? Showed healthy application responses.
VPC configurations? Spotless.

Every layer we checked told us the system was working perfectly—which made the silence at the client side completely inexplicable.

What we ultimately uncovered wasn’t a cloud provider outage. It was a brutal lesson in metric rollups and a hidden architectural trap born from our own unverified infrastructure assumptions.

We were observing control plane correctness — not data plane reality.

1. The Ingress Architecture

Our architecture was driven by a strict upstream constraint: our B2B partners operated legacy perimeter firewalls that enforced strict static IP allowlisting.

Because an AWS Application Load Balancer (ALB) dynamically rotates its underlying IP addresses as it scales, partner firewalls would drop traffic after any scaling event. To resolve this, we implemented a dual-layer ingress strategy:

A public Network Load Balancer (NLB) at the edge to provide fixed Elastic IPs.
An internal Application Load Balancer (ALB) downstream to handle Layer 7 routing, SSL termination, and WAF rules.

Figure 1: The dual-hop ingress path — NLB provides static Elastic IPs at the edge, ALB handles Layer 7 routing, SSL termination, and WAF behind it.

Because our existing setup predated native ALB-type target groups, we maintained ALB node registration through a custom internal synchronization script. The NLB utilized an IP-type target group tracking the private IP addresses of the ALB nodes.

The public SSL/TLS session terminated entirely at the ALB tier. We configured the NLB target group to execute HTTP health checks on Port 80, targeting the root path (/). This decision—made to verify actual application availability rather than simple TCP socket health—became the exact source of our failure.

2. The Fault Mechanism: The Host-Header Trap

Concurrently, our security team had hardened the downstream ALB with strict Host-header listener rules to drop unauthorized background internet scanning. Our default listener action was configured to return a fixed HTTP 400 Bad Request. Any request arriving at the ALB without our specific, approved application domain in the HTTP Host header hit this default action and was instantly rejected.

This is where our structural assumptions crashed with AWS realities:

The Catch: AWS NLB automated HTTP health checks are basic, automated probes. They do not allow you to inject custom HTTP headers (like a specific Host header) into the probe payload.

Consequently, when the NLB probed the ALB nodes, the health-check requests arrived without the required application host header. The ALB’s hardened listener evaluated the request, failed to find the approved domain header, matched the default fallback rule, and returned an HTTP 400.

3. Why It Survived Staging But Blew Up Under Load

This architectural flaw had quietly existed in our templates for months. It managed to survive staging due to three factors: low node turnover, specific deployment sequencing, and target registration delays.

In staging, the original ALB nodes registered before the host-header rules were deployed, and traffic volumes were never high enough to trigger scale-out events. The system maintained a deceptive steady state.

The live migration cutover shattered this illusion:

The sudden influx of validation traffic triggered an immediate scale-out event on the ALB tier.
AWS dynamically provisioned brand-new ALB IP endpoints to handle the load.
These newly provisioned ALB nodes consistently returned HTTP 400 responses to the NLB health checks.
Because they failed health checks, they never entered service.

This created a catastrophic loop: available routing capacity at the perimeter steadily collapsed, causing the NLB to drop client TCP connections at the edge while the remaining nodes choked.

4. The Observability Blind Spot

Why didn't our alerts fire? Our primary CloudWatch dashboard was configured to track the HealthyHostCount metric inside our NLB namespace. The graph was mapped using a standard 1-minute rollup period displaying the Average statistic.

Because the ALB nodes were rapidly scaling, cycling through IP targets, and oscillating between initialization, brief timeouts, and de-registration, the 1-minute aggregation window completely smoothed out these sharp, localized drops. The macro-level graph averaged the numbers out and rendered a flat, beautifully healthy line.

Figure 2: Sub-minute target-state churn versus the smoothed 1-minute CloudWatch rollup — the gap between control plane reporting and data plane reality.

We were evaluating infrastructure metrics on a macro-level timeline while the data plane was failing on a sub-minute, second-level timeline.

By 3:00 AM, we bypassed the global graphs entirely. We hopped into the AWS CLI, queried the raw Target Health State History for the NLB registry, and correlated the timestamps with our VPC Flow Logs.

The logs revealed a sharp spike in the TCP_Target_Reset_Count metric at the NLB tier. The NLB was actively dropping the ALB nodes out of rotation faster than CloudWatch could update its averages.

5. The Technical Fix

We resolved the loop by implementing an explicit, high-priority listener rule on the internal ALB to intercept the infrastructure probes before they ever reached the security host-filtering rules.

Using Terraform, we added a priority-1 rule to the ALB’s Port 80 listener utilizing a Source IPs condition:

# High-priority rule to catch NLB health checks by Source IP
resource "aws_lb_listener_rule" "nlb_health_check_bypass" {
  listener_arn = aws_lb_listener.internal_alb_80.arn
  priority     = 1

  action {
    type = "fixed-response"
    fixed_response {
      content_type = "text/plain"
      message_body = "HEALTHY"
      status_code  = "200"
    }
  }

  condition {
    source_ip {
      values = ["10.0.0.0/16"] # Restricted to our internal VPC NLB Subnets
    }
  }
}

By explicitly evaluating the source network topology rather than the Layer 7 payload, the ALB safely verified the internal health checks. Target synchronization stabilized immediately, allowing newly scaled ALB nodes to remain healthy and receive traffic.

6. The Organizational Blind Spot

While the technical fix allowed us to cross the migration finish line, the real postmortem occurred eight months later.

During a routine architectural review with our B2B partner’s infrastructure security team, we mapped out our edge architecture. We brought up the static IP allowlisting requirement and explained the immense operational complexity of running our custom NLB-to-ALB private IP synchronization loop just to accommodate them.

The partner’s Lead Security Engineer stopped us mid-sentence.

"Oh, we upgraded our edge proxies to support dynamic FQDN allowlisting two years ago," he said. "We only asked you for static IPs because that’s what was listed on your project questionnaire from 2021."

I had accepted an upstream constraint early in the design phase as an unmovable law of physics. We spent weeks writing automation code, building synchronization loops, and ultimately burning a 14-hour production outage window troubleshooting a brilliant technical solution to an organizational problem that didn’t actually exist. The NLB-to-ALB synchronization loop still runs in production today—because by the time we knew it was unnecessary, removing it had become its own complex migration project.

7. The Deeper Lesson: Security vs. Probes

The engineering takeaway here goes deeper than load balancer configurations. It exposes a recurring architectural friction point: Security hardening policies and operational health probes frequently operate at cross-purposes.

As engineers, we are trained to enforce zero-trust paradigms: validate headers, restrict paths, drop unauthenticated traffic at the absolute perimeter, and enforce strict Web Application Firewall (WAF) rule sets. But infrastructure probes are simple by design; they verify transport layer health, completely unaware of your application's security context.

Failing to account for how infrastructure health checks interact with security controls can cause otherwise healthy systems to remove themselves from service under load.

Our New Ingress Playbook

To ensure this never happens again, we changed three core aspects of our platform architecture:

Dedicated Health Paths: We now provide explicit, isolated fast-paths for infrastructure probes that bypass application-level validation.
Granular Alerting: We alert on target-state churn and sub-minute metric variations rather than relying on static 1-minute averages.
Synthetic Monitoring: We continuously validate the entire ingress path using external synthetic transactions that mimic real user behavior.

Final Takeaway

The outage exposed two profound blind spots. We trusted health signals that didn’t reflect user reality, and we built an increasingly complex architecture around a requirement nobody had revalidated.

The technical fix took hours. Discovering the unnecessary requirement took eight months. Validate your metrics, but more importantly, validate your assumptions.

💬 What's your take?

Have you ever run into a situation where standard infrastructure metrics looked perfectly healthy, but the data plane was completely dead for your users? Or have you ever spent weeks engineering a complex fix for a business requirement that turned out to be completely obsolete?

Let's exchange war stories—drop a comment down below!

Enjoyed this deep dive? Follow me here on DEV.to for more cloud architecture postmortems, production horror stories, and real-world infrastructure lessons.

About the Author: Bhatia Pavan is a Lead Cloud Architect specializing in highly available distributed systems, AWS networking, and robust observability platforms.

Top comments (1)

Pavan Bhatia • Jun 24

Author here - If you found this useful, I also wrote a detailed postmortem on a 7 TB Oracle to Amazon RDS migration — covering a 2.8ms network tax that nearly killed the cutover, a Terraform race condition, and a controlled rollback.
dev.to/pavanbhatia/how-a-28ms-netw...