Tanushree Aggarwal for AWS Community Builders

Posted on Dec 20, 2025

Observability-Driven Kubernetes: A Practical EKS Demo

#aws #kubernetes #observability #devops

Introduction : EKS Observability Platform 🖥️

As cloud-native systems continue to grow in complexity, many organizations depend on Kubernetes to run and scale their containerized applications. While Kubernetes is powerful, managing distributed workloads often comes with limited visibility, making it difficult to detect issues before they impact the business. The real challenge isn’t just deploying applications to a Kubernetes cluster—it’s understanding how those applications are performing, how resources are being used, and whether the system is healthy overall. In today's blog we look at how building an observability-first Amazon Elastic Kubernetes Service (EKS) platform can help solve these challenges through better monitoring, automated scaling, and early detection of potential incidents.

This project aims to demonstrates how we can design, provision, and operate a production‑style, observability‑first Kubernetes platform on Amazon EKS, using Terraform as the platform definition layer.
The focus is Day‑2 operations: metrics, autoscaling, failure recovery, and clean platform boundaries — not just deploying containers.

Project Goals 🤖

Build a true 3‑tier architecture on EKS (Frontend → API → Platform Services)
Provision infrastructure using modular Terraform
Deploy Prometheus + Grafana before workloads
Demonstrate autoscaling and self‑healing with live metrics
Serve as a GitHub portfolio project for platform engineering

What This Project Demonstrates 💭✏️

Production‑style Terraform design
Observability‑driven operations
Safe autoscaling practices
Kubernetes self‑healing behavior
Platform engineering mindset

Project Structure 📋

.
├── providers.tf
├── versions.tf
├── variables.tf
├── main.tf
├── outputs.tf
├── modules/
│ ├── vpc/
│ ├── eks/
│ ├── observability/
│ └── apps/

Each Terraform module represents a platform responsibility boundary.

Infrastructure (Terraform)

VPC Module: Creates networking foundation with public/private subnets
EKS Module: Deploys managed Kubernetes cluster (v1.32) with:
- 2-4 worker nodes (t3.medium instances)
- Public/private API endpoint access
- IAM Roles for Service Accounts enabled
- Cluster creator admin permissions
Application Stack
- Frontend: containous/whoami service showing request details
- API: hashicorp/http-echo returning "Hello from API"
- Resource Limits: CPU/memory constraints for autoscaling
- HPA: Horizontal Pod Autoscaler (2-6 replicas, 50% CPU threshold)

Observability Stack
- Prometheus: Metrics collection via kube-prometheus-stack Helm chart
- Grafana: Visualization dashboards for:
  - Kubernetes cluster metrics
  - Pod/deployment monitoring
  - CPU/memory utilization
  - Autoscaling events

Implementation Guide 🎨

Step 1: Network Foundation (VPC Module)

Why: EKS must run in private subnets for production‑grade security.

Create VPC with public + private subnets
Enable NAT Gateway for outbound traffic
Keep networking isolated from workloads

Key takeaway: 🌟 Networking is platform infrastructure, not app concern. 🌟

Step 2: EKS Cluster Provisioning

Why: Managed control plane + managed node groups reduce operational load.

Provision EKS cluster
Create managed node group
Expose cluster endpoint and credentials for providers

Key takeaway: 🌟 Platform teams optimize for operability, not customization. 🌟

Step 3: Observability‑First Setup

Why: You cannot safely scale or debug what you cannot see.

Create dedicated observability namespace
Install kube‑prometheus‑stack via Helm
Deploy Grafana and Prometheus before apps

Key takeaway: 🌟 Observability is foundational infrastructure. 🌟

Step 4: Application Namespaces

Why: Namespace isolation simplifies ownership and RBAC later.

Create apps namespace
Keep workloads separate from platform tooling

Key takeaway: 🌟 Logical isolation improves long‑term operability. 🌟

Step 5: Frontend Tier Deployment

Why: Complete the 3‑tier story, even with a simple UI.

Deploy NGINX frontend
Expose via ClusterIP service
Define resource requests and limits

Key takeaway: 🌟 Even simple workloads deserve resource boundaries. 🌟

Step 6: Backend API Tier Deployment

Why: This tier demonstrates autoscaling and failure recovery.

Deploy lightweight API (http‑echo)
Apply CPU requests/limits
Expose internally via service

Key takeaway: 🌟 Backend services are the primary scaling surface. 🌟

Step 7: Horizontal Pod Autoscaling (HPA)

Why: Scaling without metrics is dangerous.

Configure HPA based on CPU utilization
Define min/max replicas
Observe behavior in Grafana

Key takeaway: 🌟 Autoscaling is a control system, not a checkbox. 🌟

Step 8: Failure Injection (Day‑2 Operations)

Pod Failure

Manually delete an API pod
Observe:
No frontend impact
New pod scheduled automatically
Metrics reflect recovery

Node Failure

Drain a worker node
Observe:
Pods rescheduled
No service interruption

Key takeaway: 🌟 Resilience is observable, not assumed. 🌟

Implementation 🌀

terraform init

terraform validate

terraform plan

terraform apply

Testing ⚡⚡

Login to the AWS Management Console and see the EKS cluster.

Explore a bit to identify the resources created, networking layer etc.

Let us check our frontend service

Check current frontend service:
kubectl get svc -n apps

If no service exists, create one:
kubectl patch svc frontend -n apps -p '{"spec": {"type": "LoadBalancer"}}'
kubectl patch svc api -n apps -p '{"spec": {"type": "LoadBalancer"}}'

Get the external URL:
kubectl get svc frontend -n apps
kubectl get svc api -n apps

Wait for EXTERNAL-IP to show the AWS ELB hostname (takes 2-3 minutes).

Get the hostname directly:
kubectl get svc frontend -n apps -o jsonpath='{.status.loadBalancer.ingress[0].hostname}'

kubectl get svc api -n apps -o jsonpath='{.status.loadBalancer.ingress[0].hostname}'

Access your frontend
Once you have the hostname, access it in your browser:
http://your-elb-hostname

Access Grafana

kubectl port-forward -n observability svc/kube-prometheus-grafana 3000:80

Open browser:

http://localhost:3000

Username: admin
Password: retrieve from Kubernetes secret

Decode the Grafana admin password:
$password = kubectl get secret -n observability kube-prometheus-grafana -o jsonpath="{.data.admin-password}"





Configure and View dashboards!

Follow same approach for Prometheus:



  
  
  Observe Baseline Metrics


Dashboards to open:


Kubernetes / Nodes
Kubernetes / Pods
Kubernetes / Workloads / Deployment


Confirm:


API replicas = 2
Low CPU usage






  
  
  Generate Load (Autoscaling Demo)


Exec into API pod:

kubectl exec -it deploy/api -n apps -- sh

Generate CPU load:

while true; do :; done

Observe:


CPU spikes in Grafana
HPA scales pods from 2 → 6










  
  
  Pod Failure Injection


kubectl delete pod -n apps -l app=api

Observe:


New pod scheduled immediately
No frontend impact
Metrics show brief dip and recovery


  
  
  Node Failure Injection


List nodes:

kubectl get nodes

Drain one node:

kubectl drain <node-name> --ignore-daemonsets

Observe:


Pods rescheduled
Grafana shows node loss
Service remains available


  
  
  Github Repository




  
    
      
      
        aggarwal-tanushree
       / 
        eks-observability-first-platform
      
    
    
      Personal platform engineering project demonstrating an observability-first 3-tier architecture on Amazon EKS using Terraform, Prometheus, and Grafana.
    
  
  
    

EKS Observability First Platform

A complete Terraform-based solution for deploying an Amazon EKS cluster with built-in observability stack and sample applications.

Description

This project provisions a production-ready EKS cluster on AWS with:


VPC Infrastructure: Custom VPC with public/private subnets across multiple AZs

EKS Cluster: Kubernetes 1.32 with managed node groups

Observability Stack: Prometheus and Grafana via Helm charts

Sample Applications: Frontend and API deployments for testing


Directory Structure


eks-observability-first-platform/
├── modules/
│   ├── vpc/                    # VPC module
│   │   ├── main.tf
│   │   ├── outputs.tf
│   │   └── variables.tf
│   ├── eks/                    # EKS cluster module
│   │   ├── main.tf
│   │   ├── outputs.tf
│   │   └── variables.tf
│   ├── observability/          # Prometheus/Grafana stack
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── versions.tf
│   └── apps/                   # Sample applications
│       ├── main.tf
│       ├── variables.tf
│       └── versions.tf
├── main.tf                     # Root module
…
  
  View on GitHub

Key Cost Controls in This Platform 💲:

Right‑Sized Node Groups 🧮

Why: Over‑provisioning nodes is the most common EKS cost mistake.

Resource Requests & Limits 💣🔆🔅

Frontend: Requests : 50m CPU / 64Mi memory. Limits : 200m CPU / 128Mi memory

Backend API: Requests : 100m CPU / 128Mi memory. Limits : 500m CPU / 256Mi memory

Why:

Enables accurate scheduling
Prevents noisy‑neighbor problems
Improves HPA decision quality

Horizontal Pod Autoscaling 👈👉👆👇

Minimum replicas: 2
Maximum replicas: 6
Scaling driven by CPU utilization

Cost Benefit:

Low load → minimal replicas → lower cost
High load → scale only when needed

Observability Reduces Hidden Costs 💰

Metrics help avoid:

Over‑scaling due to guesswork
Long outages with high blast radius
Manual firefighting (human cost)

Key Insight:

Observability is a cost‑control mechanism, not just a debugging tool.

This demo proves that EKS platforms must be observable before they are scalable.

Conclusion 🗝️

The Operational Challenge:
As organizations adopt Kubernetes, many run into an unexpected contradiction. While containers and orchestration make it easier to scale and move faster, they also add complexity that makes systems harder to understand. As a result, teams often end up reacting to problems after something breaks instead of preventing issues through better visibility and automation.

The Observability-First Approach
The answer is to adopt an observability-first approach—one where monitoring and visibility are built into the platform from day one, not added later as an afterthought. When teams can clearly see what’s happening inside their systems, they’re able to spot issues early, make smarter decisions automatically, and continuously improve performance. This shift allows organizations to move from constantly reacting to problems to predicting and preventing them.
In an observability-first platform, monitoring is woven directly into the infrastructure as it’s being created. Every component is instrumented and visible as soon as it goes live. This creates a strong foundation for automatic scaling, meaningful alerts, and data-driven capacity planning. Over time, the platform becomes more self-aware—able to understand how it’s performing and adjust on its own as conditions change.

So to summarize, in this (lengthy, but hopefully insightful) blog, we:
✅ Built a production‑style Amazon EKS platform using modular Terraform, separating networking, cluster, observability, and application concerns.
✅ Implemented Prometheus and Grafana before workloads, enabling safe CPU‑based autoscaling and rapid failure detection.
✅ Validated Day‑2 operations by demonstrating pod and node failure recovery with real‑time metrics.

Remember, "If you can’t observe it, you can’t operate it." 🙏

Future Enhancements 🧩

🚀 ALB Ingress + path‑based routing
🚀 An Interactive Web UI for Frontend
🚀 Distributed tracing (OpenTelemetry)
🚀 RBAC per namespace
🚀 CI/CD pipeline integration
🚀 Cost dashboards

References 🌐

https://aws.amazon.com/eks/

https://developer.hashicorp.com/terraform/tutorials/kubernetes/eks

https://registry.terraform.io/providers/hashicorp/aws/latest