DEV Community

Cover image for Debugging a Self-Managed Kubernetes Platform on AWS: Lessons from Calico, Gateway API, AWS Load Balancer Controller, and Cloud Integrations
Kene Ojiteli
Kene Ojiteli

Posted on

Debugging a Self-Managed Kubernetes Platform on AWS: Lessons from Calico, Gateway API, AWS Load Balancer Controller, and Cloud Integrations

Introduction

One of the biggest misconceptions about Kubernetes projects is that success comes from getting the cluster running.

In reality, the most valuable learning often comes from the failures encountered along the way.

While building and operating a self-managed Kubernetes platform on AWS using kubeadm, I encountered networking failures, controller reconciliation issues, cloud integration problems, DNS confusion, load balancer misconfigurations, and target registration failures that forced me to move beyond deployment commands and develop a deeper understanding of how Kubernetes behaves operationally.

Many of these issues appeared at one layer of the stack but originated somewhere completely different.

A DNS issue turned out to be a networking issue.

A load balancer issue turned out to be a cloud metadata issue.

A healthy application turned out to be unreachable because of infrastructure configuration.

This article documents the troubleshooting journey, root causes, debugging process, and operational lessons learned while bringing the platform to a healthy state.

The goal is twofold:

  • Document real-world troubleshooting experiences for long-term technical recall.
  • Demonstrate operational thinking required when working with production Kubernetes environments.

Platform Overview

The environment consisted of:

  • AWS VPC provisioned using Terraform.
  • Public and private subnets.
  • Bastion host.
  • 1 Kubernetes control plane node.
  • 2 worker nodes.
  • Calico CNI.
  • Gateway API.
  • AWS Load Balancer Controller.
  • AWS Application Load Balancer.

arch-diag

My Troubleshooting Approach

One of the most important lessons from this project was learning how to troubleshoot layer by layer.

Rather than changing multiple variables simultaneously, I gradually adopted the following troubleshooting workflow:

Infrastructure → Nodes → Networking → DNS → Controllers → Load Balancers → Applications

This approach became critical because many Kubernetes failures surfaced far away from their actual root cause.

  • Security Groups were too strict

    • Symptoms: immediately after cluster deployment, nodes could not communicate properly, some cluster components behaved inconsistently, and network-related issues appeared difficult to isolate.
    • Initial Assumption: I initially assumed Kubernetes was misconfigured.
    • Investigation: I reviewed security group rules, node connectivity, cluster events and kubelet logs.
    • Root Cause: The security groups were locked down too early. At the time, I attempted to implement production-grade security controls before confirming that the cluster itself was healthy.
      This created multiple variables simultaneously:

      • Kubernetes configuration.
      • Calico networking.
      • Security group restrictions.

      This made troubleshooting unnecessarily difficult.

    • Fix: I temporarily relaxed security group restrictions and allowed cluster communication to function correctly. Once the platform became stable, I gradually tightened ingress and egress rules once the cluster became healthy.

    • Verification

      • Nodes communicated correctly.
      • Cluster components stabilized.
      • Networking troubleshooting became easier.
    • Lesson Learned: When building a platform from scratch:

      • Validate functionality first.
      • Harden security second.

      This issue taught me the importance of reducing troubleshooting variables. Over-securing an unvalidated platform can make root-cause analysis significantly more difficult.
      strict-sgtemp-relaxed-sg

  • CoreDNS Stuck in Pending State

    • Symptoms: CoreDNS remained Pending after cluster initialisation. coredns-pending-pod
    • Investigation: I went ahead to check the kubelet status, node readiness and cluster events.
    • Root Cause: No Container Network Interface (CNI) plugin had been installed. Kubernetes was functioning, but pod networking was unavailable.

      Without a CNI:

      • Nodes remain NotReady.
      • Pod networking is unavailable.
      • CoreDNS cannot start correctly.
    • Fix: I installed Calico CNI.

    • Verification: after Calico installation, kubectl get nodes returned Ready, CoreDNS became healthy.
      coredns-running-pod

    • Lesson Learned: Kubernetes networking is not optional. A functioning CNI is a prerequisite for cluster health.

  • Calico Readiness Probe Failures

    • Symptoms: Calico node pods showed readiness failures, including:

      felix is not ready
      readiness probe reporting 503 
      dial tcp 127.0.0.1:9099: connect: connection refused
      
    • Investigation: I investigated by describing the pod and checking the logs, and I reviewed Felix logs, Interface states and VXLAN configuration.

    • Root Cause: There were multiple contributing factors, including:

      • Security group restrictions: Early security group restrictions complicated node communication.
      • Incorrect node interface detection: Calico initially struggled to identify the correct node interface.
      • AWS private subnet design: The cluster operated entirely within private subnets, so Calico needed explicit guidance regarding which interface/IP range to use.
    • Fix: I added node autodetection, which configured Calico using private subnet CIDRs.

    • Verification

      • Felix became healthy.
      • Readiness probe passed.
      • Nodes remained ready.
    • Lesson Learned: In cloud environments, never assume the CNI will automatically detect the desired node interface. Explicit configuration is often safer.
      calico-custom-resource-installation

  • Kubernetes DNS Validation and NXDOMAIN Confusion

    • Symptoms: DNS behaviour appeared inconsistent; some queries worked while others returned NXDOMAIN. For example, nslookup kubernetes.default returned NXDOMAIN, while nslookup kubernetes.default.svc.cluster.local resolved successfully.
    • Initial Assumption: I suspected CoreDNS was still unhealthy. To investigate I validated:
      • Pod-to-pod communication.
      • DNS resolution.
      • Service discovery.
    • Root Cause: DNS was functioning correctly. The issue was misunderstanding Kubernetes DNS search domains.
    • Fix: No infrastructure fix required. This was a knowledge gap rather than a platform failure.
    • Lesson Learned: Not every apparent DNS issue is a DNS failure. Sometimes it is simply a misunderstanding of name resolution behaviour. k8s-dns-validation
  • Gateway API Stuck Waiting for Controller

    • Symptoms: Gateway status showed:

      Accepted: Unknown
      Programmed: Unknown
      Waiting for controller 
      
    • Investigation: I executed:

      kubectl describe gateway 
      kubectl get gatewayclass
      kubectl logs aws-load-balancer-controller 
      
    • Root Cause: Incorrect GatewayClass controllerName. I initially configured controllerName: ingress.k8s.aws/gateway
      instead of controllerName: gateway.k8s.aws/alb.

    • Fix: I updated the controller name and corrected the GatewayClass configuration.

    • Verification: GatewayClass became Accepted=True, Gateway resources began reconciling.

    • Lesson learned: Gateway API resources are controller-driven. Without a matching controller, reconciliation never occurs. gateway-waiting-for-controllergateway-unknown

  • AWS Load Balancer Controller IAM and IMDS Failures

    • Symptoms: Controller logs showed:

      FailedBuildModel
      No EC2 IMDS role found
      failed to refresh cached credentials 
      
    • Investigation: I reviewed IAM roles, Instance profiles, IMDS access and worker node permissions.

    • Root Cause: Controller could not retrieve AWS credentials through IMDS.

    • Fix: I validated IAM role attachment and IMDS accessibility.

    • Verification: Controller successfully communicated with AWS APIs.

    • Lesson learned: Cloud integrations often fail because of identity and permissions rather than Kubernetes configuration.

  • Missing providerID Preventing Target Registration

    • Symptoms: AWS Load Balancer Controller failed to register worker nodes. Logs showed providerID is not specified, and target groups remained empty. missing-provider-ID
    • Root Cause: Nodes lacked spec.providerID. AWS LBC could not map Kubernetes nodes to EC2 instances.
    • Fix: I investigated AWS Cloud Controller Manager, IMDS and Node metadata and patched providerID values. patch-providerID
    • Verification: Targets successfully registered.
    • Lesson learned: Cloud-native integrations rely heavily on metadata. Healthy nodes do not necessarily mean healthy cloud integration.
  • Internal Application Load Balancer Instead of Internet-Facing

    • Symptoms: Application Load Balancer was created successfully. Targets were healthy, pods were healthy, but browser access still failed.
    • Investigation: I checked security groups, Target group health, gateway configuration and route configuration.
    • Root Cause: The ALB was created using the default internal scheme. I had assumed subnet tags alone would produce an internet-facing ALB. They did not.
    • Fix: I explicitly configured scheme: internet-facing through LoadBalancerConfiguration.
    • Lesson learned: Never assume controller defaults match your intended architecture. Explicitly define desired behaviour.internal-lb lb-config internet-facing-lb internet-facing-lb
  • Healthy Targets but Browser Access Failed

    • Symptoms: Healthy target group and using curl worker-ip:nodePort worked. But browser access still failed.
    • Root Cause: The ALB itself was internal. This prevented traffic from reaching the public internet.
    • Fix: I converted ALB to internet-facing and validated public subnet selection.
    • Lesson learned: Healthy targets do not automatically mean traffic is reachable. It is best to validate the entire request path: Browser → ALB → Target Group → NodePort → Service → Pod

Major Operational Lessons

This project reinforced several key principles.

  • Validate Layer by Layer: never debug the entire platform simultaneously. Validate Infrastructure, Nodes, Networking, DNS, Services, Controllers, Load balancers and Applications.

  • Networking Is Often the Real Problem: many failures surfaced as pending pods, readiness probe failures, missing targets, gateway reconciliation issues, etc, but ultimately traced back to networking.

  • Cloud Integrations Depend on Metadata: components such as AWS Load Balancer Controller, Gateway API integrations and Target registration rely heavily on providerID, subnet tags, IAM permissions, and IMDS access.

  • Production Troubleshooting Is About Isolation: The most valuable skill developed during this project was learning how to isolate variables and validate each layer independently.

What's Next

The next phase of the platform includes:

  • Production-style 3-tier application deployment.
  • CI/CD implementation.
  • Observability.
  • Cluster upgrades.
  • AI-assisted Kubernetes troubleshooting.

Find the GitHub repo here and Kubernetes cluster build article here

The platform eventually became healthy.

However, the most valuable outcome was not the working cluster itself.

It was developing a systematic approach to troubleshooting distributed systems and a much deeper understanding of how Kubernetes behaves operationally than a successful deployment ever could.

Top comments (0)