Kene Ojiteli

Posted on Jun 2

Building and Operating a Production-Style Kubernetes Platform on AWS Using kubeadm

#devops #aws #cloud #kubernetes

Introduction

Managed Kubernetes platforms such as Amazon EKS, Google Kubernetes Engine (GKE), and Azure Kubernetes Service (AKS) abstract away much of the operational complexity involved in running Kubernetes clusters. While this significantly improves developer productivity, it also hides many of the internal systems responsible for cluster orchestration, networking, node registration, and workload scheduling.

As a result, many engineers interact with Kubernetes daily without fully understanding the components that keep a cluster operational behind the scenes.

To better understand Kubernetes from an operational perspective, I set out to build and operate a self-managed Kubernetes platform on AWS using kubeadm. Unlike lightweight local environments such as Minikube or kind, kubeadm bootstraps Kubernetes in a way that closely resembles how real-world self-managed clusters are provisioned and operated.

The objective of this project was not simply to install Kubernetes, but to explore:

How the control plane components interact.
How worker nodes register with the cluster.
How Kubernetes networking behaves.
How cloud integrations work.
How traffic reaches workloads running inside the cluster.
How operational failures surface during deployment and runtime.
How production-style systems behave beneath managed abstractions.

This article documents the architecture, implementation process, engineering decisions, operational lessons, and troubleshooting insights encountered during the effort to bring the platform to a healthy operational state.

Project Objectives

The primary objectives of this project were to:

Provision infrastructure on AWS using Terraform.
Bootstrap a self-managed Kubernetes cluster using kubeadm.
Configure Kubernetes networking using Calico.
Integrate Gateway API with AWS Load Balancer Controller.
Expose workloads externally using AWS Application Load Balancers.
Validate cluster functionality through application deployment.
Understand the operational mechanics typically abstracted away by managed Kubernetes services.

Beyond simply executing commands, the project focused on understanding why each component exists and how the individual systems interact within the broader Kubernetes ecosystem.

Architecture Overview

The environment was designed to simulate a simplified production-style Kubernetes deployment.

The infrastructure consisted of:

1 control plane node deployed within a private subnet.
2 worker nodes deployed within private subnets.
1 bastion host deployed within a public subnet.
Public subnets for internet-facing load balancers and NAT gateway routing.
Private networking between Kubernetes nodes.
No direct public internet exposure for cluster nodes.

The operational access flow was structured as: Local Machine → Bastion Host → Kubernetes Nodes.

This approach reflects common production security practices where internal infrastructure resources are isolated from direct public access.

Infrastructure Architecture

Engineering Decisions

Several architectural and operational decisions were intentionally made during this project to better simulate how production Kubernetes environments are designed and operated.

Why kubeadm Instead of Managed Kubernetes?

Managed Kubernetes services significantly reduce operational burden by automating:

Control plane provisioning.
Networking setup.
Upgrades.
Cloud integrations.

However, this abstraction can make troubleshooting difficult when engineers do not understand the underlying systems.

Using kubeadm exposed the internal mechanics involved in:

Control plane initialisation.
Node registration.
Cluster networking.
Certificate generation.
Cloud controller integrations.
Workload routing.
Service exposure.

Private Cluster Nodes

The Kubernetes nodes were deployed within private subnets to reduce direct internet exposure.

Administrative access was restricted through a bastion host acting as the controlled entry point into the infrastructure.

This mirrors how many production Kubernetes environments isolate internal infrastructure from the public internet.

Calico as the Container Network Interface (CNI)

Calico was selected as the Container Network Interface (CNI) plugin to provide pod-to-pod networking using VXLAN encapsulation.

This enabled communication between workloads running across multiple worker nodes while maintaining network isolation and routing consistency.

Gateway API Instead of Traditional Ingress

Instead of using traditional Ingress resources, this project explored Kubernetes Gateway API together with AWS Load Balancer Controller.

Gateway API provides a more expressive and extensible traffic management model compared to legacy Ingress resources and reflects the direction Kubernetes networking is evolving toward.

This phase of the project also became one of the most operationally valuable sections due to the cloud integration and reconciliation issues encountered during implementation.

Step-by-step implementation

Infrastructure Provisioning with Terraform

The infrastructure resources were provisioned using Terraform. The Terraform configuration created:

VPC.
Public and private subnets.
Internet gateway.
NAT gateway.
Route tables.
Security groups.
EC2 instances.
IAM roles and instance profiles.

Before provisioning resources, the AWS CLI was configured locally.

Terraform was then used to provision the infrastructure.

Important Resource Tags
Several subnet resource tags were also required to enable AWS Load Balancer Controller subnet discovery.

Public Subnet Tags
"kubernetes.io/role/elb" = "1"
"kubernetes.io/cluster/kubeadm-cluster" = "shared"

Private Subnet Tags
"kubernetes.io/role/internal-elb" = "1"
"kubernetes.io/cluster/kubeadm-cluster" = "shared"

These tags allowed the AWS Load Balancer Controller to automatically discover eligible subnets during Application Load Balancer provisioning.

This later became particularly important when troubleshooting why an Application Load Balancer was unexpectedly provisioned as internal rather than internet-facing.

Accessing the Cluster Environment

Since the Kubernetes nodes were deployed within private subnets, direct SSH access was not possible.

A bastion host deployed in a public subnet was used as the administrative entry point into the infrastructure.

The access flow was: Local Machine → Bastion Host → Kubernetes Nodes

Once connected to the bastion host, SSH access to the control plane and worker nodes was established using private IP addresses.

Preparing the Kubernetes Nodes

Before initialising the cluster, all nodes required baseline operating system and Kubernetes configuration.

The following preparation steps were performed on all Kubernetes nodes:

Updating package repositories.
Disabling swap.
Enabling required kernel modules.
Configuring container runtime dependencies.
Installing Kubernetes repositories.
Configuring networking prerequisites.

These preparation steps are critical because Kubernetes expects consistent runtime and networking behaviour across all participating nodes.

Installing Kubernetes Components

After preparing the operating systems, the Kubernetes components were installed on all nodes. The following components were installed:

kubeadm: responsible for bootstrapping the Kubernetes cluster.
kubelet: the node agent responsible for managing workloads.
kubectl: the Kubernetes command-line client.
container runtime: responsible for running containers on each node. Containerd was used as the container runtime.

The kubelet service was then enabled to start automatically during system boot.

Initialising the Control Plane

Once the common node configuration was complete, the Kubernetes control plane was initialised on the control plane node. The initialisation process bootstrapped the cluster by:

Generating cluster certificates.
Creating the API server.
Initialising etcd.
Deploying control plane components.

After the cluster initialisation completed successfully, kubeconfig was configured to allow kubectl communication with the Kubernetes API server.

At this stage, the CoreDNS pods remained in a pending state because a Container Network Interface had not yet been installed.

This was an important operational lesson: Kubernetes networking is not optional. Without a functioning CNI plugin, nodes remain in a NotReady state and pod networking cannot function correctly.

Configuring Cluster Networking with Calico

To enable pod-to-pod communication across nodes, Calico was deployed as the cluster networking solution.

Once the networking layer became operational:

Nodes transitioned to Ready.
Pending system pods became healthy.
Pod networking became functional.

Joining Worker Nodes

After the control plane became operational, the worker nodes were joined to the cluster using the token generated during cluster initialisation.

The join process established secure trust between the worker nodes and the Kubernetes API server.

Once completed successfully, the worker nodes appeared within the cluster.

Validating Cluster Networking

Before exposing workloads externally, I validated internal Kubernetes networking, I deployed temporary test pods and verified:

Pod-to-pod communication.
DNS resolution.
Service discovery.
Cluster networking behaviour.

This phase was particularly valuable because it exposed how Kubernetes networking failures often surface indirectly through symptoms such as readiness probe failures, DNS inconsistencies, or Pending pods.

Deploying a Sample Workload

To validate cluster functionality, I deployed a sample NGINX workload with multiple replicas. This validation confirmed:

Workload scheduling.
Pod networking.
Service exposure.
Node communication.
Replica distribution across worker nodes.

Exposing the Application with Gateway API and AWS Load Balancer Controller

To expose the application externally, I implemented AWS Load Balancer Controller alongside Gateway API. The implementation involved:

Installing AWS Load Balancer & Gateway API CRDs.
Installing AWS Load Balancer Controller.
Creating a GatewayClass.
Creating a Gateway.
Configuring HTTPRoutes.
Integrating an AWS Application Load Balancer.

Installing AWS Load Balancer & Gateway API

Creating a GatewayClass

A GatewayClass resource was created to reference the AWS Load Balancer Controller implementation responsible for reconciling Gateway API resources into AWS infrastructure.

Creating a Gateway

The Gateway resource defined the desired external entry point into the cluster.

The AWS Load Balancer Controller reconciled this Gateway into an AWS Application Load Balancer.

Configuring HTTPRoutes

HTTPRoutes were then configured to route external traffic into backend Kubernetes services.

Browser Validation

Once the Application Load Balancer was provisioned successfully, the application became accessible externally through the ALB DNS endpoint.

Key Operational Challenges Encountered

Several operational issues were encountered during the project. Rather than viewing these failures as setbacks, they became some of the most valuable learning experiences throughout the platform build.

CoreDNS Pending Before CNI Installation

Before installing Calico, CoreDNS pods remained in a Pending state because Kubernetes networking had not yet been initialized.

This reinforced how heavily Kubernetes depends on a functioning networking layer.

Gateway API Controller Discovery Issues

While integrating Gateway API with AWS Load Balancer Controller, the GatewayClass initially remained in a Pending state because of an incorrect controllerName configuration.

This highlighted the importance of understanding how Gateway API resources map to specific controller implementations.

Missing providerID Preventing Target Registration

AWS Load Balancer Controller initially failed to register Kubernetes worker nodes into the AWS Target Group because the Kubernetes nodes lacked providerID values.

This exposed the relationship between Kubernetes node metadata and cloud-provider integrations.

Internal vs Internet-Facing Load Balancer Behaviour

An Application Load Balancer was initially provisioned as internal rather than internet-facing because subnet selection and load balancer scheme configuration were not explicitly defined.

This reinforced how AWS networking behaviour influences Kubernetes ingress architecture.

Kubernetes Networking and DNS Debugging

Several cluster issues surfaced indirectly through networking symptoms rather than explicit infrastructure-level failures. Debugging involved validating:

Pod-to-pod communication.
Kubernetes DNS resolution.
Service discovery.
NodePort reachability.
Target group health.
ALB routing behaviour.

This reinforced how deeply Kubernetes operations depend on healthy networking behavior across both cluster and cloud infrastructure layers.

Key Lessons Learned

Several operational insights emerged throughout this project.

First, Kubernetes clusters rely heavily on networking. Many cluster failures originate from CNI misconfiguration, DNS inconsistencies, or node communication issues.

Second, Kubernetes cloud integrations depend heavily on correct metadata, resource tagging, IAM configuration and controller configuration. Small configuration issues can prevent controllers from reconciling infrastructure correctly.

Third, many Kubernetes issues surface indirectly through symptoms such as Pending pods, readiness probe failures, DNS inconsistencies, or missing ALB targets rather than explicit infrastructure-level failures.

Most importantly, building and operating a self-managed Kubernetes platform provided significantly deeper operational understanding than simply consuming Kubernetes through a managed service.

What Comes Next

This project now serves as the foundation for the next phases of the platform.

The next stage involves deploying a production-style 3-tier application onto the cluster while implementing a more complete end-to-end DevOps lifecycle, including:

CI/CD pipelines.
Observability.
Centralised logging.
Rolling deployments.
Cluster upgrades.
Reliability testing.

The platform will also evolve toward a highly available Kubernetes control plane to better simulate production-grade resiliency and operational continuity during upgrades and node failures.

Future phases will additionally explore AI-assisted operational tooling for Kubernetes troubleshooting and log analysis.

Conclusion

Building and operating a self-managed Kubernetes platform using kubeadm provided significantly deeper operational visibility into how Kubernetes functions internally compared to managed Kubernetes environments.

Beyond simply deploying workloads, the project exposed the interactions between networking, control plane components, node registration, cloud integrations, and traffic management.

Many of the most valuable lessons emerged not from successful deployments, but from troubleshooting failures involving DNS resolution, Gateway API reconciliation, target registration, and application load balancer behaviour.

Several of these troubleshooting scenarios will be explored in more detail in a follow-up article focused specifically on operational debugging, root-cause analysis, and Kubernetes troubleshooting workflows.

The complete project repository, Terraform configuration, Kubernetes manifests, and future project updates can be found here

Building systems is valuable.

Understanding how to operate, troubleshoot, and evolve them is where deeper engineering growth happens.