koushik

Posted on Jul 18

💡 Real-world AWS NLB Troubleshooting Story: Lessons from the Trenches 💡

#aws #eks #cloudengineering #devops

Last week reminded me why I love (and occasionally hate) working with AWS and Kubernetes. What started as a "quick setup" turned into a deep dive through configuration flags and silent defaults.

The Challenge

Expose a Broadcom Layer7 API Gateway running on EKS through an AWS Network Load Balancer, targeting pod IPs directly. Should be straightforward, right?

What I Expected vs. Reality

✅ Expected: Configure service annotations, deploy, celebrate

❌ Reality: Hours of connection resets and head-scratching

The Investigation Journey

Problem #1: The Sneaky NodePort

Symptom: AWS console showed TargetType: instance despite specifying IP targeting

Root Cause: Kubernetes was silently allocating NodePorts by default, causing AWS to fall back to instance mode

Solution: Added allocateLoadBalancerNodePorts: false to the service spec

spec:
  allocateLoadBalancerNodePorts: false

Problem #2: The Silent Controller

Symptom: No logs from AWS Load Balancer Controller, no NLB listener created

Root Cause: Controller flag --ingress-class=alb in v2.7.2 meant it only watched ALB ingresses, completely ignoring NLB services

Solution: Either Upgrade to v2.8.2 and configure dual-mode support

--ingress-class=alb
--controller-class=service.k8s.aws/nlb

OR add following changes in the K8s service manifests

spec:
  type: LoadBalancer
  externalTrafficPolicy: Local
  loadBalancerClass: service.k8s.aws/nlb
  allocateLoadBalancerNodePorts: false

The Breakthrough

Within minutes of the controller update, everything clicked into place:

NLB listener appeared automatically
Target groups attached directly to pod IPs
Connection errors vanished completely

Key Lessons Learned

Always disable NodePorts when targeting pod IPs - Kubernetes defaults can silently override your intentions
No controller logs = your resources aren't being watched - Check your controller scope carefully
Version matters - Newer controller versions (v2.8.2+) handle multi-mode ALB/NLB scenarios much better
Read the fine print - ingress-class vs controller-class flags have very different behaviors

The Bigger Picture

This wasn't just debugging - it was a reminder that even "simple" cloud-native setups involve multiple layers of abstraction, each with their own defaults and assumptions. The key is understanding how these layers interact.

Have you run into similar AWS + Kubernetes integration surprises? I'd love to hear about your troubleshooting adventures in the comments!

DEV Community