DEV Community

Cover image for The Case of the 40-Second Logins: Debugging an ALB Gone Wrong
prateeks007
prateeks007

Posted on • Originally published at prateeks007.hashnode.dev

The Case of the 40-Second Logins: Debugging an ALB Gone Wrong

It was supposed to be a smooth EKS migration. Instead, a handful of users started complaining about painfully slow logins — 20 to 40 seconds long. Oddly, others saw no issue at all. What followed was a three-hour debugging marathon that took us through Cloudflare cache rabbit holes, pod benchmarking, and finally, an AWS ALB subnet misconfiguration.

What makes this story worth telling is not just the root cause (a bad ALB IP), but the full forensic process — every command, every false lead, and the eventual breakthrough. This is an engineering playbook, shared so you don’t waste the same hours the next time you face an intermittent latency mystery.

Summary

After an EKS + ALB migration we saw intermittent 20–40s delays for users hitting SSO and API XHR endpoints. Some users were unaffected. Host- and pod-level checks looked healthy. The root cause turned out to be one ALB node (an Availability Zone / subnet) that was misbehaving — traffic routed to that node would hang. DNS round-robin / Cloudflare was returning the bad ALB IP to a subset of users, causing the intermittent slow behavior. Removing the faulty subnet from the ALB fixed the problem.

This post walks the entire forensic process: the commands we ran (copy/paste-ready, with safe placeholders), the exact timings and outputs we observed (sensitive values redacted), the conclusions at each step, the mistakes and detours, and the final mitigation + runbook you can reuse.


Background

  • Service migrated to a new EKS cluster fronted by ALB + Cloudflare.

  • Immediately after migration: some users saw smooth logins, others saw 20–50s hangs during login/API fetch.

  • HAR logs from affected users: static JS/CSS fast, XHR requests (API) stuck.

  • Classic symptom of multi-node load balancer inconsistency.


Step 1: First suspicion — Cloudflare caching (our biggest detour)

Because HAR logs from affected users showed long waits on API/XHR calls (while static JS/CSS looked fine), our first instinct was: maybe Cloudflare is the culprit.

Specifically, we thought Cloudflare might be mishandling cache rules or proxying traffic inconsistently.

What we tried

  1. Checked response headers for static assets to confirm whether caching was working as expected:

    curl -I "https://<MY_DOMAIN>/main.css" | egrep -i "CF-Cache-Status|Age|Cache-Control|Content-Encoding"
    

    Example output:

    CF-Cache-Status: HIT
    Cache-Control: max-age=31536000
    Age: 1284
    

    This showed caching was behaving normally — assets were cached and HITs increased over time.

  2. Toggled DNS/Proxy settings in Cloudflare Dashboard


    We experimented with switching the API and frontend DNS records between:

* **Proxied (orange cloud)** → traffic routed through Cloudflare’s edge.

* **DNS-only (grey cloud)** → traffic went directly to the ALB.


Observation: regardless of proxy setting, users still hit intermittent 20–40s delays. This suggested Cloudflare wasn’t the true bottleneck.
Enter fullscreen mode Exit fullscreen mode
  1. Played with Page Rules & Cache Rules We tried temporarily bypassing cache for /api/* paths (setting “Cache Level: Bypass” in rules). API requests still hung for some clients — proving cache wasn’t the issue.

Why we wasted time here

HAR logs are notorious for being misleading when “Waiting (TTFB)” dominates. At first glance it looked like a cache miss problem (especially since static assets behaved differently from API calls).

We spent valuable time tweaking Cloudflare rules, proxy toggles, and headers — only to realize later that the problem lay downstream in the ALB node itself.

Lesson: when HAR shows long wait (StartTransfer), always double-check server-side timings (pods, ingress) before diving too deep into CDN/cache layers.


Do you want me to re-stitch this updated Cloudflare section back into the full blog and resend the polished whole post — or do you just want to manually swap this part into your current copy?


Step 2: Pod-level health checks

We needed to confirm whether the pods themselves were slow.

ApacheBench inside API pods

What is ApacheBench (ab)?

ab is a lightweight command-line HTTP benchmarking tool that fires a specified number of HTTP requests at a server with a specified concurrency. It measures requests per second, average latency, and distribution of response times. We use it for quick sanity-checks of basic HTTP responsiveness from inside the pod.

Commands we ran:

kubectl exec -it -n <NAMESPACE> <API_POD_1> -- ab -n 10 -c 2 http://localhost/validatetoken
kubectl exec -it -n <NAMESPACE> <API_POD_2> -- ab -n 10 -c 2 http://localhost/validatetoken
Enter fullscreen mode Exit fullscreen mode

Representative output:

Requests per second:    ~9 [#/sec] (mean)
Time per request:       ~220 ms (mean)
Enter fullscreen mode Exit fullscreen mode

Both pods responded in <250ms.

Test from ingress

kubectl exec -it -n ingress-nginx <INGRESS_POD> -- \
  curl -w "\nConnect:%{time_connect} StartTransfer:%{time_starttransfer} Total:%{time_total}\n" \
  -o /dev/null -s http://api-service.<NAMESPACE>.svc.cluster.local/validatetoken
Enter fullscreen mode Exit fullscreen mode

Output:

Connect:0.005s StartTransfer:0.136s Total:0.136s
Enter fullscreen mode Exit fullscreen mode

Conclusion: pods and ingress path were fine.


Step 3: End-to-end curl timings from different clients

We asked three clients: Investigator, User A (slow), and User B (slow) to run this command:

curl -w "DNS:%{time_namelookup} Connect:%{time_connect} SSL:%{time_appconnect} StartTransfer:%{time_starttransfer} Total:%{time_total}\n" \
  -o /dev/null -s "https://<API_DOMAIN>/validatetoken"
Enter fullscreen mode Exit fullscreen mode

Outputs:

  • Investigator (fast):

    DNS:0.013 Connect:0.022 SSL:0.152 StartTransfer:0.629 Total:0.629
    
  • User A (slow):

    DNS:0.103 Connect:0.206 SSL:0.523 StartTransfer:20.995 Total:20.996
    
  • User B (slow): similar ~20s waits.

Observation: DNS and SSL were fine. The delay was in StartTransfer (server response).


Step 4: Investigating ALB IPs

The API domain resolved to multiple ALB IPs. We checked with:

dig +short <API_DOMAIN>
Enter fullscreen mode Exit fullscreen mode

Output:

<ALB_IP_1>
<ALB_IP_2>
Enter fullscreen mode Exit fullscreen mode

Step 5: Per-IP forced testing

We forced curl to each IP with --resolve:

# Test ALB IP 1
curl -w "Connect:%{time_connect} SSL:%{time_appconnect} StartTransfer:%{time_starttransfer} Total:%{time_total}\n" \
  -o /dev/null -s --resolve <API_DOMAIN>:443:<ALB_IP_1> https://<API_DOMAIN>/validatetoken

# Test ALB IP 2
curl -w "Connect:%{time_connect} SSL:%{time_appconnect} StartTransfer:%{time_starttransfer} Total:%{time_total}\n" \
  -o /dev/null -s --resolve <API_DOMAIN>:443:<ALB_IP_2> https://<API_DOMAIN>/validatetoken
Enter fullscreen mode Exit fullscreen mode

Results:

  • <ALB_IP_1> → fast (1s total)

  • <ALB_IP_2> → hung (20–120s total, sometimes timeout)

We now had a smoking gun: one ALB IP (node/AZ) was bad.


Step 6: Automating per-IP checks (conncheck.sh)

To be sure, we scripted it:

#!/usr/bin/env bash
# conncheck.sh
DOMAIN="<API_DOMAIN>"
IPS=(<ALB_IP_1> <ALB_IP_2>)
for ip in "${IPS[@]}"; do
  echo "--- Testing $ip ---"
  for i in {1..10}; do
    curl -s -o /dev/null --resolve $DOMAIN:443:$ip \
      -w "Run:$i Connect:%{time_connect} SSL:%{time_appconnect} StartTransfer:%{time_starttransfer} Total:%{time_total}\n" \
      https://$DOMAIN/validatetoken
  done
done
Enter fullscreen mode Exit fullscreen mode

Output (redacted):

--- Testing <ALB_IP_1> ---
Run:1 Connect:0.305 SSL:0.799 StartTransfer:1.187 Total:1.187
...

--- Testing <ALB_IP_2> ---
Run:1 Connect:0.000 SSL:0.000 StartTransfer:0.000 Total:132.969
Run:2 Connect:0.000 SSL:0.000 StartTransfer:0.000 Total:133.002
...
Enter fullscreen mode Exit fullscreen mode

Every run confirmed: <ALB_IP_2> was consistently bad.


Step 7: Mapping the bad IP in AWS — how we actually did it

We needed to find which AZ/subnet the bad IP belonged to.

1) Tried the CLI mapping (it failed to return info for us):

aws ec2 describe-network-interfaces \
  --region <REGION> \
  --filters "Name=association.public-ip,Values=<ALB_IP_BAD>" \
  --query "NetworkInterfaces[*].{ENI:NetworkInterfaceId,Subnet:SubnetId,AZ:AvailabilityZone}" \
  --output table
Enter fullscreen mode Exit fullscreen mode

In our case this returned no rows. This can happen depending on how front-end ALB IPs are represented in the account or due to IAM/console differences or environment configurations.

2) Used the AWS Console to find the ENI / subnet / AZ (this is what actually worked for us):

  • Open EC2 in the AWS Console.

  • Go to Network Interfaces (EC2 → Network Interfaces).

  • Search for the bad public IP <ALB_IP_BAD> in the Network Interfaces list (or inspect the ALB → Description → Subnets to see which subnets are attached).

  • When you find the ENI that has that Public IP, open it — the console shows Subnet ID and Availability Zone (e.g., us-east-2c).

  • Cross-check by looking at ALB → Description → Subnets and the public IPs that the ALB resolved to (from dig) to be confident which subnet/ENI mapped to the bad IP.

What we saw in the Console: one subnet looked different (the properties / mapping in the console visually stood out when compared to the other two). That subnet corresponded to <ALB_IP_BAD> and the ENI attached to it.

3) Action: remove that subnet from the ALB (ALB → Edit subnets) (or from the target group) and let ALB reprovision across remaining subnets.

Quick notes

What is an ENI?

An ENI (Elastic Network Interface) is the AWS virtual network interface attached to resources (ALB nodes, EC2 instances, etc.). An ENI has a Public/Private IP, belongs to a particular subnet, and therefore an AZ. Finding which ENI has a public IP lets you see the subnet and AZ where a problematic node lives.

What is an AZ?

An Availability Zone (AZ) is a discrete data center within an AWS region (for example us-east-2a, us-east-2b, us-east-2c). An ALB typically spans multiple AZs and exposes one IP per node; if one AZ’s node or subnet is misconfigured, only that node (and clients routed to its IP) will suffer.


Step 8: Fix

We removed the bad subnet from the ALB (console action).

After that:

  • dig +short <API_DOMAIN> → only returned healthy IPs.

  • Re-ran conncheck.sh → all IPs fast (~1s).

  • User A and User B confirmed API requests were instant again.


Root cause

  • Cause: ALB node in one AZ was unhealthy/misconfigured. Any client that got that ALB IP suffered 20–40s hangs.

  • Why only some users? DNS round-robin gave different IPs to different clients. If you got the bad IP → you suffered.

Why the subnet mismatch broke the ALB (public vs private)

During post-mortem verification we discovered the real infra mismatch: one of the ALB’s node subnets (AZ 2c) was a private subnet that routed outbound via a NAT gateway, while the other two ALB subnets were public and routed directly to the Internet Gateway (IGW).

Key facts:

  • An internet-facing ALB must place its load-balancer nodes (the ALB front-end ENIs) in public subnets — that is, subnets whose route table has 0.0.0.0/0 → an Internet Gateway (IGW). This gives the ALB node a public path to receive and respond to client traffic.

  • A private subnet normally uses a NAT gateway for outbound traffic (0.0.0.0/0 → NAT). NAT gateways forward outbound connections from private hosts to the internet, but they do not provide a stable inbound path that an internet-facing ALB needs.

  • In our case the ALB got a node into the private subnet in AZ 2c (the subnet had 0.0.0.0/0 → nat-…). Any client that resolved DNS to that ALB IP hit a node that could not service incoming internet requests properly, producing the 20–40s hangs.

Why this matters:

  • DNS / Cloudflare / recursive resolvers will hand out different ALB public IPs to different clients. If one of those public IPs corresponds to an ALB node that lives in a private subnet (no IGW), those clients experience timeouts/hangs while other clients are fine — exactly the intermittent, user-specific behavior we saw.

Remediation and best practice:

  • For internet-facing ALBs, ensure you configure public subnets (with IGW route) in every AZ you attach to the ALB. If you need HA across three AZs, create public subnets in all three AZs.

  • If you require internal-only load balancing (no direct internet access), use an internal ALB (scheme internal) and place it in private subnets — but then it won’t be reachable from the public internet.

  • If you accidentally attach a private subnet to an internet ALB, detach it or replace it with a properly routed public subnet. Tag subnets clearly (public-<az>, private-<az>) to avoid future mistakes.

This is the precise mismatch that caused the outage: the ALB node in us-east-2c was in a private subnet (NAT only), not a public subnet with an IGW, so it could not reliably serve external client requests.


Lessons learned

  1. False start: We wasted time chasing Cloudflare cache headers. Static files were never the problem — API responses were.

  2. Best tool: curl --resolve is the single best way to isolate per-IP ALB issues.

  3. Automation: A simple script like conncheck.sh can catch bad nodes quickly.

  4. AWS mapping: If CLI doesn’t show subnet info, check the console. Visual inspection caught the misconfigured subnet.

  5. Checklist item: Never assume a reused ALB is healthy. Always test each AZ/IP after migration.


Suggested runbook

If you hit intermittent API latency after ALB migration:

  1. Collect HARs — look for long wait times.

  2. Pod-level check with ab — confirm pods are fast.

  3. Run curl -w from multiple clients.

  4. Use dig +short to list ALB IPs.

  5. Use curl --resolve or conncheck.sh to test each IP.

  6. Identify bad IP → map in AWS console → remove offending subnet/AZ.

  7. Re-test and confirm.


Final thoughts

This was a wild debugging ride:

  • We started with Cloudflare cache rules.

  • Dug through pod and ingress latency.

  • Only by comparing outputs from multiple clients did we realize it was external only.

  • Finally isolated a single bad ALB IP and removed its subnet.

Three hours later, we had a fix — and a playbook to never fall into the same trap again.


Join the Conversation 🚀

I’d love to hear from you:

  • Have you ever chased a bug that turned out to be an infra misconfiguration?

  • Any war stories with ALBs, NATs, or subnets that ate up hours of your life?

  • Would you like more deep-dive posts like this (commands, outputs, root cause), or shorter “just the fix” write-ups?

Drop your thoughts in the comments — I’m excited to learn from this community and share more DevOps/infra stories in the future.

Top comments (0)