When people talk about Kubernetes projects, they often focus on the final deployment screenshots.
What they rarely talk about are:
the broken IAM trust relationships,
worker nodes failing to register,
EBS CSI drivers crashing,
NAT Gateway mistakes,
cluster connectivity issues,
and the countless troubleshooting sessions behind the scenes.
Recently, I worked as both the Infrastructure Engineer and Team Lead on a collaborative cloud-native project where we deployed the Spring PetClinic Microservices application on Amazon EKS using Terraform, Docker, and Kubernetes.
This wasn’t a simple localhost deployment.
We built a production-style Kubernetes environment on AWS — complete with networking, ingress, persistent storage, IAM integration, node scaling, and multi-service orchestration.
In this article, I’ll walk through:
the infrastructure architecture,
what I implemented,
the major challenges we faced,
and the lessons I learned while managing the infrastructure side of the project.
Project Architecture
The application followed a microservices architecture deployed on Kubernetes.
Core Components
Config Server
Discovery Server
API Gateway
Customers Service
Vets Service
Visits Service
GenAI Service
MySQL Stateful Databases
AWS Services Used
Service Purpose
Amazon EKS Kubernetes orchestration
EC2 Worker nodes
IAM Access management
VPC Networking
NAT Gateway Internet access for private nodes
ALB External traffic routing
EBS CSI Driver Persistent storage
ECR Container image registry
Terraform Infrastructure provisioning
My Role as Infrastructure Engineer
My responsibilities included:
Provisioning AWS infrastructure using Terraform
Managing the Amazon EKS cluster
Configuring VPC networking
Managing worker nodes and scaling
Installing Kubernetes add-ons
Configuring AWS Load Balancer Controller
Managing IAM roles and OIDC integration
Enabling persistent storage using EBS CSI Driver
Supporting deployment teams
Troubleshooting infrastructure and Kubernetes issues
I also ended up coordinating infrastructure operations across the team whenever deployment blockers occurred.
Provisioning the Infrastructure with Terraform
The environment was provisioned using Terraform.
Core Infrastructure Created
Networking
VPC
Public Subnets
Private Subnets
Internet Gateway
NAT Gateway
Route Tables
Kubernetes Infrastructure
EKS Cluster
Managed Node Groups
IAM Roles
Security Groups
Terraform Workflow
terraform init
terraform validate
terraform plan
terraform apply
One of the biggest advantages of using Terraform was reproducibility.
Instead of manually provisioning infrastructure from the AWS Console, the environment could be recreated consistently using Infrastructure as Code.
Configuring the EKS Cluster
After provisioning the cluster, I connected locally using:
aws eks update-kubeconfig \
--region us-east-1 \
--name petclinic-cluster
Verification:
kubectl get nodes
At one point, the cluster returned:
No resources found
After investigation, I discovered the node group had previously been scaled down to zero to reduce costs.
Scaling the node group back up restored cluster functionality.
Installing the AWS Load Balancer Controller
To expose Kubernetes Ingress resources externally, I installed the AWS Load Balancer Controller.
This involved:
configuring OIDC,
creating IAM policies,
creating IAM service accounts,
and deploying the controller using Helm.
OIDC Configuration
eksctl utils associate-iam-oidc-provider \
--region us-east-1 \
--cluster petclinic-cluster \
--approve
Helm Installation
helm install aws-load-balancer-controller eks/aws-load-balancer-controller \
-n kube-system \
--set clusterName=petclinic-cluster \
--set serviceAccount.create=false \
--set serviceAccount.name=aws-load-balancer-controller
The OIDC Trust Policy Problem
One major issue completely broke the ALB controller deployment.
The controller continuously failed with AccessDenied.
After several troubleshooting sessions, I discovered:
the IAM trust relationship referenced the wrong OIDC provider ID.
This was one of the most important lessons from the project:
In AWS EKS, IAM/OIDC integration is extremely sensitive to trust policy configuration.
Once the correct OIDC provider was configured, the controller immediately became healthy.
Persistent Storage with EBS CSI Driver
The database team required persistent storage for MySQL StatefulSets.
To support this, I installed the AWS EBS CSI Driver.

IAM Role Creation
eksctl create iamserviceaccount \
--name ebs-csi-controller-sa \
--namespace kube-system \
--cluster petclinic-cluster \
--role-name AmazonEKS_EBS_CSI_DriverRole \
--role-only \
--attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \
--approve
Initially, the CSI driver entered:
CrashLoopBackOff
Root cause:
incorrect IAM permissions.
After correcting the IAM role attachment, the storage driver became healthy and PersistentVolumeClaims successfully bound to EBS volumes.
Worker Node Capacity Problems
As more microservices were deployed, our single worker node became overloaded.
Observed issues included:
Pending pods
High CPU utilisation
Scheduling failures
We eventually scaled the node group to multiple worker nodes.
At one point, one EC2 worker instance became unhealthy.
To recover:
I drained the node,
deleted it,
and allowed the Auto Scaling Group to recreate a replacement instance automatically.
This was one of the most realistic operational experiences in the project.
Networking Challenges
Earlier in the project, we deleted the NAT Gateway to reduce AWS costs.
That introduced multiple failures:
worker nodes lost outbound internet access,
pods could not pull images,
API connectivity became unstable.
This reinforced another important lesson:
Private EKS worker nodes depend heavily on NAT Gateway connectivity.
Terraform was later used to recreate the NAT Gateway and stabilise the environment.
Kubernetes Troubleshooting Experience
This project involved far more troubleshooting than I initially expected.
Some major issues included:
This project taught me that Kubernetes engineering is often less about deployment and more about troubleshooting distributed systems.
Leadership Beyond Infrastructure
Although my primary responsibility was infrastructure engineering, I also helped coordinate:
IAM access for teammates,
cluster access troubleshooting,
deployment support,
infrastructure recovery,
and operational decisions during failures.
One of the biggest lessons I learned is this:
Infrastructure engineering is not just about provisioning resources.
It’s also about ownership, communication, stability, and helping teams move forward when systems fail.
Key Lessons Learned
This project gave me hands-on experience with:
Amazon EKS administration
Kubernetes operations
IAM and OIDC integration
Terraform Infrastructure as Code
Kubernetes networking
Load balancing and ingress
Persistent storage management
Worker node recovery
Cluster troubleshooting
Cloud-native architecture
Most importantly, it taught me how real-world infrastructure behaves under operational pressure.
Final Thoughts
This project was one of the most challenging and rewarding cloud engineering experiences I’ve had so far.
We successfully built and managed:
a production-style EKS environment,
persistent Kubernetes workloads,
scalable worker nodes,
ingress-based traffic routing,
and cloud-native infrastructure on AWS.
And while the deployment itself was important, the real value came from:
the troubleshooting,
the operational decision-making,
and the engineering lessons learned along the way.
If you’re learning Kubernetes or AWS:
don’t avoid the hard problems.
That’s where the real growth happens.
Technologies Used
AWS
Amazon EKS
Kubernetes
Terraform
Docker
Helm
IAM
Amazon ECR
Amazon EBS
kubectl
eksctl
Author
Chioma Nwosu
Infrastructure Engineer | Cloud & DevOps Enthusiast
If you enjoyed this article or have suggestions, feel free to connect with me on LinkedIn. LinkedIn profile











Top comments (0)