Introduction
One of the biggest misconceptions about Kubernetes projects is that success comes from getting the cluster running.
In reality, the most valuable learning often comes from the failures encountered along the way.
While building and operating a self-managed Kubernetes platform on AWS using kubeadm, I encountered networking failures, controller reconciliation issues, cloud integration problems, DNS confusion, load balancer misconfigurations, and target registration failures that forced me to move beyond deployment commands and develop a deeper understanding of how Kubernetes behaves operationally.
Many of these issues appeared at one layer of the stack but originated somewhere completely different.
A DNS issue turned out to be a networking issue.
A load balancer issue turned out to be a cloud metadata issue.
A healthy application turned out to be unreachable because of infrastructure configuration.
This article documents the troubleshooting journey, root causes, debugging process, and operational lessons learned while bringing the platform to a healthy state.
The goal is twofold:
- Document real-world troubleshooting experiences for long-term technical recall.
- Demonstrate operational thinking required when working with production Kubernetes environments.
Platform Overview
The environment consisted of:
- AWS VPC provisioned using Terraform.
- Public and private subnets.
- Bastion host.
- 1 Kubernetes control plane node.
- 2 worker nodes.
- Calico CNI.
- Gateway API.
- AWS Load Balancer Controller.
- AWS Application Load Balancer.
My Troubleshooting Approach
One of the most important lessons from this project was learning how to troubleshoot layer by layer.
Rather than changing multiple variables simultaneously, I gradually adopted the following troubleshooting workflow:
Infrastructure → Nodes → Networking → DNS → Controllers → Load Balancers → Applications
This approach became critical because many Kubernetes failures surfaced far away from their actual root cause.
-
Security Groups were too strict
- Symptoms: immediately after cluster deployment, nodes could not communicate properly, some cluster components behaved inconsistently, and network-related issues appeared difficult to isolate.
- Initial Assumption: I initially assumed Kubernetes was misconfigured.
- Investigation: I reviewed security group rules, node connectivity, cluster events and kubelet logs.
-
Root Cause: The security groups were locked down too early. At the time, I attempted to implement production-grade security controls before confirming that the cluster itself was healthy.
This created multiple variables simultaneously:- Kubernetes configuration.
- Calico networking.
- Security group restrictions.
This made troubleshooting unnecessarily difficult.
Fix: I temporarily relaxed security group restrictions and allowed cluster communication to function correctly. Once the platform became stable, I gradually tightened ingress and egress rules once the cluster became healthy.
-
Verification
- Nodes communicated correctly.
- Cluster components stabilized.
- Networking troubleshooting became easier.
-
Lesson Learned: When building a platform from scratch:
- Validate functionality first.
- Harden security second.
This issue taught me the importance of reducing troubleshooting variables. Over-securing an unvalidated platform can make root-cause analysis significantly more difficult.


-
CoreDNS Stuck in Pending State
-
Symptoms: CoreDNS remained
Pendingafter cluster initialisation.
- Investigation: I went ahead to check the kubelet status, node readiness and cluster events.
-
Root Cause: No Container Network Interface (CNI) plugin had been installed. Kubernetes was functioning, but pod networking was unavailable.
Without a CNI:
- Nodes remain NotReady.
- Pod networking is unavailable.
- CoreDNS cannot start correctly.
Fix: I installed Calico CNI.
Verification: after Calico installation,
kubectl get nodesreturnedReady, CoreDNS became healthy.

Lesson Learned: Kubernetes networking is not optional. A functioning CNI is a prerequisite for cluster health.
-
Symptoms: CoreDNS remained
-
Calico Readiness Probe Failures
-
Symptoms: Calico node pods showed readiness failures, including:
felix is not ready readiness probe reporting 503 dial tcp 127.0.0.1:9099: connect: connection refused Investigation: I investigated by describing the pod and checking the logs, and I reviewed Felix logs, Interface states and VXLAN configuration.
-
Root Cause: There were multiple contributing factors, including:
- Security group restrictions: Early security group restrictions complicated node communication.
- Incorrect node interface detection: Calico initially struggled to identify the correct node interface.
- AWS private subnet design: The cluster operated entirely within private subnets, so Calico needed explicit guidance regarding which interface/IP range to use.
Fix: I added
node autodetection, which configured Calico using private subnet CIDRs.-
Verification
- Felix became healthy.
- Readiness probe passed.
- Nodes remained ready.
Lesson Learned: In cloud environments, never assume the CNI will automatically detect the desired node interface. Explicit configuration is often safer.

-
-
Kubernetes DNS Validation and NXDOMAIN Confusion
-
Symptoms: DNS behaviour appeared inconsistent; some queries worked while others returned
NXDOMAIN. For example,nslookup kubernetes.defaultreturned NXDOMAIN, whilenslookup kubernetes.default.svc.cluster.localresolved successfully. -
Initial Assumption: I suspected CoreDNS was still unhealthy. To investigate I validated:
- Pod-to-pod communication.
- DNS resolution.
- Service discovery.
- Root Cause: DNS was functioning correctly. The issue was misunderstanding Kubernetes DNS search domains.
- Fix: No infrastructure fix required. This was a knowledge gap rather than a platform failure.
-
Lesson Learned: Not every apparent DNS issue is a DNS failure. Sometimes it is simply a misunderstanding of name resolution behaviour.
-
Symptoms: DNS behaviour appeared inconsistent; some queries worked while others returned
-
Gateway API Stuck Waiting for Controller
-
Symptoms: Gateway status showed:
Accepted: Unknown Programmed: Unknown Waiting for controller -
Investigation: I executed:
kubectl describe gateway kubectl get gatewayclass kubectl logs aws-load-balancer-controller Root Cause: Incorrect GatewayClass controllerName. I initially configured
controllerName: ingress.k8s.aws/gateway
instead ofcontrollerName: gateway.k8s.aws/alb.Fix: I updated the controller name and corrected the GatewayClass configuration.
Verification: GatewayClass became
Accepted=True, Gateway resources began reconciling.Lesson learned: Gateway API resources are controller-driven. Without a matching controller, reconciliation never occurs.


-
-
AWS Load Balancer Controller IAM and IMDS Failures
-
Symptoms: Controller logs showed:
FailedBuildModel No EC2 IMDS role found failed to refresh cached credentials Investigation: I reviewed IAM roles, Instance profiles, IMDS access and worker node permissions.
Root Cause: Controller could not retrieve AWS credentials through IMDS.
Fix: I validated IAM role attachment and IMDS accessibility.
Verification: Controller successfully communicated with AWS APIs.
Lesson learned: Cloud integrations often fail because of identity and permissions rather than Kubernetes configuration.
-
-
Missing providerID Preventing Target Registration
-
Symptoms: AWS Load Balancer Controller failed to register worker nodes. Logs showed
providerID is not specified, and target groups remained empty.
-
Root Cause: Nodes lacked
spec.providerID. AWS LBC could not map Kubernetes nodes to EC2 instances. -
Fix: I investigated AWS Cloud Controller Manager, IMDS and Node metadata and patched providerID values.
- Verification: Targets successfully registered.
- Lesson learned: Cloud-native integrations rely heavily on metadata. Healthy nodes do not necessarily mean healthy cloud integration.
-
Symptoms: AWS Load Balancer Controller failed to register worker nodes. Logs showed
-
Internal Application Load Balancer Instead of Internet-Facing
- Symptoms: Application Load Balancer was created successfully. Targets were healthy, pods were healthy, but browser access still failed.
- Investigation: I checked security groups, Target group health, gateway configuration and route configuration.
- Root Cause: The ALB was created using the default internal scheme. I had assumed subnet tags alone would produce an internet-facing ALB. They did not.
-
Fix: I explicitly configured
scheme: internet-facingthrough LoadBalancerConfiguration. -
Lesson learned: Never assume controller defaults match your intended architecture. Explicitly define desired behaviour.
-
Healthy Targets but Browser Access Failed
-
Symptoms: Healthy target group and using
curl worker-ip:nodePortworked. But browser access still failed. - Root Cause: The ALB itself was internal. This prevented traffic from reaching the public internet.
- Fix: I converted ALB to internet-facing and validated public subnet selection.
- Lesson learned: Healthy targets do not automatically mean traffic is reachable. It is best to validate the entire request path: Browser → ALB → Target Group → NodePort → Service → Pod
-
Symptoms: Healthy target group and using
Major Operational Lessons
This project reinforced several key principles.
Validate Layer by Layer: never debug the entire platform simultaneously. Validate Infrastructure, Nodes, Networking, DNS, Services, Controllers, Load balancers and Applications.
Networking Is Often the Real Problem: many failures surfaced as pending pods, readiness probe failures, missing targets, gateway reconciliation issues, etc, but ultimately traced back to networking.
Cloud Integrations Depend on Metadata: components such as AWS Load Balancer Controller, Gateway API integrations and Target registration rely heavily on providerID, subnet tags, IAM permissions, and IMDS access.
Production Troubleshooting Is About Isolation: The most valuable skill developed during this project was learning how to isolate variables and validate each layer independently.
What's Next
The next phase of the platform includes:
- Production-style 3-tier application deployment.
- CI/CD implementation.
- Observability.
- Cluster upgrades.
- AI-assisted Kubernetes troubleshooting.
Find the GitHub repo here and Kubernetes cluster build article here
The platform eventually became healthy.
However, the most valuable outcome was not the working cluster itself.
It was developing a systematic approach to troubleshooting distributed systems and a much deeper understanding of how Kubernetes behaves operationally than a successful deployment ever could.

Top comments (0)