Introduction : EKS Observability Platform ๐ฅ๏ธ
As cloud-native systems continue to grow in complexity, many organizations depend on Kubernetes to run and scale their containerized applications. While Kubernetes is powerful, managing distributed workloads often comes with limited visibility, making it difficult to detect issues before they impact the business. The real challenge isnโt just deploying applications to a Kubernetes clusterโitโs understanding how those applications are performing, how resources are being used, and whether the system is healthy overall. In today's blog we look at how building an observability-first Amazon Elastic Kubernetes Service (EKS) platform can help solve these challenges through better monitoring, automated scaling, and early detection of potential incidents.
This project aims to demonstrates how we can design, provision, and operate a productionโstyle, observabilityโfirst Kubernetes platform on Amazon EKS, using Terraform as the platform definition layer.
The focus is Dayโ2 operations: metrics, autoscaling, failure recovery, and clean platform boundaries โ not just deploying containers.
Project Goals ๐ค
- Build a true 3โtier architecture on EKS (Frontend โ API โ Platform Services)
- Provision infrastructure using modular Terraform
- Deploy Prometheus + Grafana before workloads
- Demonstrate autoscaling and selfโhealing with live metrics
- Serve as a GitHub portfolio project for platform engineering
What This Project Demonstrates ๐ญโ๏ธ
- Productionโstyle Terraform design
- Observabilityโdriven operations
- Safe autoscaling practices
- Kubernetes selfโhealing behavior
- Platform engineering mindset
Project Structure ๐
.
โโโ providers.tf
โโโ versions.tf
โโโ variables.tf
โโโ main.tf
โโโ outputs.tf
โโโ modules/
โ โโโ vpc/
โ โโโ eks/
โ โโโ observability/
โ โโโ apps/
Each Terraform module represents a platform responsibility boundary.
Infrastructure (Terraform)
- VPC Module: Creates networking foundation with public/private subnets
-
EKS Module: Deploys managed Kubernetes cluster (v1.32) with:
- 2-4 worker nodes (t3.medium instances)
- Public/private API endpoint access
- IAM Roles for Service Accounts enabled
- Cluster creator admin permissions
-
Application Stack
-
Frontend:
containous/whoamiservice showing request details -
API:
hashicorp/http-echoreturning "Hello from API" - Resource Limits: CPU/memory constraints for autoscaling
- HPA: Horizontal Pod Autoscaler (2-6 replicas, 50% CPU threshold)
-
Frontend:
-
Observability Stack
- Prometheus: Metrics collection via kube-prometheus-stack Helm chart
-
Grafana: Visualization dashboards for:
- Kubernetes cluster metrics
- Pod/deployment monitoring
- CPU/memory utilization
- Autoscaling events
Implementation Guide ๐จ
Step 1: Network Foundation (VPC Module)
Why: EKS must run in private subnets for productionโgrade security.
- Create VPC with public + private subnets
- Enable NAT Gateway for outbound traffic
- Keep networking isolated from workloads
Key takeaway: ๐ Networking is platform infrastructure, not app concern. ๐
Step 2: EKS Cluster Provisioning
Why: Managed control plane + managed node groups reduce operational load.
- Provision EKS cluster
- Create managed node group
- Expose cluster endpoint and credentials for providers
Key takeaway: ๐ Platform teams optimize for operability, not customization. ๐
Step 3: ObservabilityโFirst Setup
Why: You cannot safely scale or debug what you cannot see.
- Create dedicated
observabilitynamespace - Install
kubeโprometheusโstackvia Helm - Deploy Grafana and Prometheus before apps
Key takeaway: ๐ Observability is foundational infrastructure. ๐
Step 4: Application Namespaces
Why: Namespace isolation simplifies ownership and RBAC later.
- Create
appsnamespace - Keep workloads separate from platform tooling
Key takeaway: ๐ Logical isolation improves longโterm operability. ๐
Step 5: Frontend Tier Deployment
Why: Complete the 3โtier story, even with a simple UI.
- Deploy NGINX frontend
- Expose via ClusterIP service
- Define resource requests and limits
Key takeaway: ๐ Even simple workloads deserve resource boundaries. ๐
Step 6: Backend API Tier Deployment
Why: This tier demonstrates autoscaling and failure recovery.
- Deploy lightweight API (
httpโecho) - Apply CPU requests/limits
- Expose internally via service
Key takeaway: ๐ Backend services are the primary scaling surface. ๐
Step 7: Horizontal Pod Autoscaling (HPA)
Why: Scaling without metrics is dangerous.
- Configure HPA based on CPU utilization
- Define min/max replicas
- Observe behavior in Grafana
Key takeaway: ๐ Autoscaling is a control system, not a checkbox. ๐
Step 8: Failure Injection (Dayโ2 Operations)
Pod Failure
- Manually delete an API pod
- Observe:
- No frontend impact
- New pod scheduled automatically
- Metrics reflect recovery
Node Failure
- Drain a worker node
- Observe:
- Pods rescheduled
- No service interruption
Key takeaway: ๐ Resilience is observable, not assumed. ๐
Implementation ๐
terraform init
terraform validate
terraform plan
terraform apply
Testing โกโก
Login to the AWS Management Console and see the EKS cluster.

Explore a bit to identify the resources created, networking layer etc.
Let us check our frontend service
Check current frontend service:
kubectl get svc -n apps
If no service exists, create one:
kubectl patch svc frontend -n apps -p '{"spec": {"type": "LoadBalancer"}}'
kubectl patch svc api -n apps -p '{"spec": {"type": "LoadBalancer"}}'
Get the external URL:
kubectl get svc frontend -n apps
kubectl get svc api -n apps
Wait for EXTERNAL-IP to show the AWS ELB hostname (takes 2-3 minutes).
Get the hostname directly:
kubectl get svc frontend -n apps -o jsonpath='{.status.loadBalancer.ingress[0].hostname}'
kubectl get svc api -n apps -o jsonpath='{.status.loadBalancer.ingress[0].hostname}'
Access your frontend
Once you have the hostname, access it in your browser:
http://your-elb-hostname
Access Grafana
kubectl port-forward -n observability svc/kube-prometheus-grafana 3000:80
Open browser:
http://localhost:3000
Login:
- Username: admin
- Password: retrieve from Kubernetes secret
Decode the Grafana Configure and View dashboards! Follow same approach for Prometheus: Dashboards to open: Confirm: Exec into API pod: Generate CPU load: Observe: Observe: List nodes: Drain one node: Observe: A complete Terraform-based solution for deploying an Amazon EKS cluster with built-in observability stack and sample applications. This project provisions a production-ready EKS cluster on AWS with:admin password:
$password = kubectl get secret -n observability kube-prometheus-grafana -o jsonpath="{.data.admin-password}"
Observe Baseline Metrics
Generate Load (Autoscaling Demo)
kubectl exec -it deploy/api -n apps -- sh
while true; do :; done
Pod Failure Injection
kubectl delete pod -n apps -l app=api
Node Failure Injection
kubectl get nodes
kubectl drain <node-name> --ignore-daemonsets
Github Repository
aggarwal-tanushree
/
eks-observability-first-platform
Personal platform engineering project demonstrating an observability-first 3-tier architecture on Amazon EKS using Terraform, Prometheus, and Grafana.
EKS Observability First Platform
Description
Directory Structure
โฆeks-observability-first-platform/
โโโ modules/
โ โโโ vpc/ # VPC module
โ โ โโโ main.tf
โ โ โโโ outputs.tf
โ โ โโโ variables.tf
โ โโโ eks/ # EKS cluster module
โ โ โโโ main.tf
โ โ โโโ outputs.tf
โ โ โโโ variables.tf
โ โโโ observability/ # Prometheus/Grafana stack
โ โ โโโ main.tf
โ โ โโโ variables.tf
โ โ โโโ versions.tf
โ โโโ apps/ # Sample applications
โ โโโ main.tf
โ โโโ variables.tf
โ โโโ versions.tf
โโโ main.tf # Root module
Key Cost Controls in This Platform ๐ฒ:
RightโSized Node Groups ๐งฎ
Why: Overโprovisioning nodes is the most common EKS cost mistake.
Resource Requests & Limits ๐ฃ๐๐
Frontend: Requests : 50m CPU / 64Mi memory. Limits : 200m CPU / 128Mi memory
Backend API: Requests : 100m CPU / 128Mi memory. Limits : 500m CPU / 256Mi memory
Why:
- Enables accurate scheduling
- Prevents noisyโneighbor problems
- Improves HPA decision quality
Horizontal Pod Autoscaling ๐๐๐๐
- Minimum replicas: 2
- Maximum replicas: 6
- Scaling driven by CPU utilization
Cost Benefit:
- Low load โ minimal replicas โ lower cost
- High load โ scale only when needed
Observability Reduces Hidden Costs ๐ฐ
Metrics help avoid:
- Overโscaling due to guesswork
- Long outages with high blast radius
- Manual firefighting (human cost)
Key Insight:
Observability is a costโcontrol mechanism, not just a debugging tool.
This demo proves that EKS platforms must be observable before they are scalable.
Conclusion ๐๏ธ
The Operational Challenge:
As organizations adopt Kubernetes, many run into an unexpected contradiction. While containers and orchestration make it easier to scale and move faster, they also add complexity that makes systems harder to understand. As a result, teams often end up reacting to problems after something breaks instead of preventing issues through better visibility and automation.
The Observability-First Approach
The answer is to adopt an observability-first approachโone where monitoring and visibility are built into the platform from day one, not added later as an afterthought. When teams can clearly see whatโs happening inside their systems, theyโre able to spot issues early, make smarter decisions automatically, and continuously improve performance. This shift allows organizations to move from constantly reacting to problems to predicting and preventing them.
In an observability-first platform, monitoring is woven directly into the infrastructure as itโs being created. Every component is instrumented and visible as soon as it goes live. This creates a strong foundation for automatic scaling, meaningful alerts, and data-driven capacity planning. Over time, the platform becomes more self-awareโable to understand how itโs performing and adjust on its own as conditions change.
So to summarize, in this (lengthy, but hopefully insightful) blog, we:
โ
Built a productionโstyle Amazon EKS platform using modular Terraform, separating networking, cluster, observability, and application concerns.
โ
Implemented Prometheus and Grafana before workloads, enabling safe CPUโbased autoscaling and rapid failure detection.
โ
Validated Dayโ2 operations by demonstrating pod and node failure recovery with realโtime metrics.
Remember, "If you canโt observe it, you canโt operate it." ๐
Future Enhancements ๐งฉ
๐ ALB Ingress + pathโbased routing
๐ An Interactive Web UI for Frontend
๐ Distributed tracing (OpenTelemetry)
๐ RBAC per namespace
๐ CI/CD pipeline integration
๐ Cost dashboards
References ๐
https://developer.hashicorp.com/terraform/tutorials/kubernetes/eks
https://registry.terraform.io/providers/hashicorp/aws/latest















Top comments (0)