DEV Community

Cover image for Observability-Driven Kubernetes: A Practical EKS Demo

Observability-Driven Kubernetes: A Practical EKS Demo

Introduction : EKS Observability Platform ๐Ÿ–ฅ๏ธ

As cloud-native systems continue to grow in complexity, many organizations depend on Kubernetes to run and scale their containerized applications. While Kubernetes is powerful, managing distributed workloads often comes with limited visibility, making it difficult to detect issues before they impact the business. The real challenge isnโ€™t just deploying applications to a Kubernetes clusterโ€”itโ€™s understanding how those applications are performing, how resources are being used, and whether the system is healthy overall. In today's blog we look at how building an observability-first Amazon Elastic Kubernetes Service (EKS) platform can help solve these challenges through better monitoring, automated scaling, and early detection of potential incidents.

This project aims to demonstrates how we can design, provision, and operate a productionโ€‘style, observabilityโ€‘first Kubernetes platform on Amazon EKS, using Terraform as the platform definition layer.
The focus is Dayโ€‘2 operations: metrics, autoscaling, failure recovery, and clean platform boundaries โ€” not just deploying containers.

Project Goals ๐Ÿค–

  • Build a true 3โ€‘tier architecture on EKS (Frontend โ†’ API โ†’ Platform Services)
  • Provision infrastructure using modular Terraform
  • Deploy Prometheus + Grafana before workloads
  • Demonstrate autoscaling and selfโ€‘healing with live metrics
  • Serve as a GitHub portfolio project for platform engineering

What This Project Demonstrates ๐Ÿ’ญโœ๏ธ

  • Productionโ€‘style Terraform design
  • Observabilityโ€‘driven operations
  • Safe autoscaling practices
  • Kubernetes selfโ€‘healing behavior
  • Platform engineering mindset

Project Structure ๐Ÿ“‹

.
โ”œโ”€โ”€ providers.tf
โ”œโ”€โ”€ versions.tf
โ”œโ”€โ”€ variables.tf
โ”œโ”€โ”€ main.tf
โ”œโ”€โ”€ outputs.tf
โ”œโ”€โ”€ modules/
โ”‚ โ”œโ”€โ”€ vpc/
โ”‚ โ”œโ”€โ”€ eks/
โ”‚ โ”œโ”€โ”€ observability/
โ”‚ โ””โ”€โ”€ apps/
Enter fullscreen mode Exit fullscreen mode

Each Terraform module represents a platform responsibility boundary.

Infrastructure (Terraform)

  1. VPC Module: Creates networking foundation with public/private subnets
  2. EKS Module: Deploys managed Kubernetes cluster (v1.32) with:

    • 2-4 worker nodes (t3.medium instances)
    • Public/private API endpoint access
    • IAM Roles for Service Accounts enabled
    • Cluster creator admin permissions
  3. Application Stack

    • Frontend: containous/whoami service showing request details
    • API: hashicorp/http-echo returning "Hello from API"
    • Resource Limits: CPU/memory constraints for autoscaling
    • HPA: Horizontal Pod Autoscaler (2-6 replicas, 50% CPU threshold)

  1. Observability Stack
    • Prometheus: Metrics collection via kube-prometheus-stack Helm chart
    • Grafana: Visualization dashboards for:
      • Kubernetes cluster metrics
      • Pod/deployment monitoring
      • CPU/memory utilization
      • Autoscaling events

Implementation Guide ๐ŸŽจ

Step 1: Network Foundation (VPC Module)

Why: EKS must run in private subnets for productionโ€‘grade security.

  • Create VPC with public + private subnets
  • Enable NAT Gateway for outbound traffic
  • Keep networking isolated from workloads

Key takeaway: ๐ŸŒŸ Networking is platform infrastructure, not app concern. ๐ŸŒŸ

Step 2: EKS Cluster Provisioning

Why: Managed control plane + managed node groups reduce operational load.

  • Provision EKS cluster
  • Create managed node group
  • Expose cluster endpoint and credentials for providers

Key takeaway: ๐ŸŒŸ Platform teams optimize for operability, not customization. ๐ŸŒŸ

Step 3: Observabilityโ€‘First Setup

Why: You cannot safely scale or debug what you cannot see.

  • Create dedicated observability namespace
  • Install kubeโ€‘prometheusโ€‘stack via Helm
  • Deploy Grafana and Prometheus before apps

Key takeaway: ๐ŸŒŸ Observability is foundational infrastructure. ๐ŸŒŸ

Step 4: Application Namespaces

Why: Namespace isolation simplifies ownership and RBAC later.

  • Create apps namespace
  • Keep workloads separate from platform tooling

Key takeaway: ๐ŸŒŸ Logical isolation improves longโ€‘term operability. ๐ŸŒŸ

Step 5: Frontend Tier Deployment

Why: Complete the 3โ€‘tier story, even with a simple UI.

  • Deploy NGINX frontend
  • Expose via ClusterIP service
  • Define resource requests and limits

Key takeaway: ๐ŸŒŸ Even simple workloads deserve resource boundaries. ๐ŸŒŸ

Step 6: Backend API Tier Deployment

Why: This tier demonstrates autoscaling and failure recovery.

  • Deploy lightweight API (httpโ€‘echo)
  • Apply CPU requests/limits
  • Expose internally via service

Key takeaway: ๐ŸŒŸ Backend services are the primary scaling surface. ๐ŸŒŸ

Step 7: Horizontal Pod Autoscaling (HPA)

Why: Scaling without metrics is dangerous.

  • Configure HPA based on CPU utilization
  • Define min/max replicas
  • Observe behavior in Grafana

Key takeaway: ๐ŸŒŸ Autoscaling is a control system, not a checkbox. ๐ŸŒŸ

Step 8: Failure Injection (Dayโ€‘2 Operations)

Pod Failure

  • Manually delete an API pod
  • Observe:
  • No frontend impact
  • New pod scheduled automatically
  • Metrics reflect recovery

Node Failure

  • Drain a worker node
  • Observe:
  • Pods rescheduled
  • No service interruption

Key takeaway: ๐ŸŒŸ Resilience is observable, not assumed. ๐ŸŒŸ

Implementation ๐ŸŒ€

terraform init

terraform validate

terraform plan

terraform apply

Testing โšกโšก

Login to the AWS Management Console and see the EKS cluster.


Explore a bit to identify the resources created, networking layer etc.

Let us check our frontend service

Check current frontend service:
kubectl get svc -n apps

If no service exists, create one:
kubectl patch svc frontend -n apps -p '{"spec": {"type": "LoadBalancer"}}'
kubectl patch svc api -n apps -p '{"spec": {"type": "LoadBalancer"}}'

Get the external URL:
kubectl get svc frontend -n apps
kubectl get svc api -n apps

Wait for EXTERNAL-IP to show the AWS ELB hostname (takes 2-3 minutes).

Get the hostname directly:
kubectl get svc frontend -n apps -o jsonpath='{.status.loadBalancer.ingress[0].hostname}'

kubectl get svc api -n apps -o jsonpath='{.status.loadBalancer.ingress[0].hostname}'

Access your frontend
Once you have the hostname, access it in your browser:
http://your-elb-hostname

Access Grafana

kubectl port-forward -n observability svc/kube-prometheus-grafana 3000:80

Open browser:

http://localhost:3000

Login:

  • Username: admin
  • Password: retrieve from Kubernetes secret

Decode the Grafana admin password:
$password = kubectl get secret -n observability kube-prometheus-grafana -o jsonpath="{.data.admin-password}"

Configure and View dashboards!

Follow same approach for Prometheus:

Observe Baseline Metrics

Dashboards to open:

  • Kubernetes / Nodes
  • Kubernetes / Pods
  • Kubernetes / Workloads / Deployment

Confirm:

  • API replicas = 2
  • Low CPU usage

Generate Load (Autoscaling Demo)

Exec into API pod:
kubectl exec -it deploy/api -n apps -- sh

Generate CPU load:
while true; do :; done

Observe:

  • CPU spikes in Grafana
  • HPA scales pods from 2 โ†’ 6

Pod Failure Injection

kubectl delete pod -n apps -l app=api

Observe:

  • New pod scheduled immediately
  • No frontend impact
  • Metrics show brief dip and recovery

Node Failure Injection

List nodes:
kubectl get nodes

Drain one node:
kubectl drain <node-name> --ignore-daemonsets

Observe:

  • Pods rescheduled
  • Grafana shows node loss
  • Service remains available

Github Repository

GitHub logo aggarwal-tanushree / eks-observability-first-platform

Personal platform engineering project demonstrating an observability-first 3-tier architecture on Amazon EKS using Terraform, Prometheus, and Grafana.

EKS Observability First Platform

A complete Terraform-based solution for deploying an Amazon EKS cluster with built-in observability stack and sample applications.

Description

This project provisions a production-ready EKS cluster on AWS with:

  • VPC Infrastructure: Custom VPC with public/private subnets across multiple AZs
  • EKS Cluster: Kubernetes 1.32 with managed node groups
  • Observability Stack: Prometheus and Grafana via Helm charts
  • Sample Applications: Frontend and API deployments for testing

Directory Structure

eks-observability-first-platform/
โ”œโ”€โ”€ modules/
โ”‚   โ”œโ”€โ”€ vpc/                    # VPC module
โ”‚   โ”‚   โ”œโ”€โ”€ main.tf
โ”‚   โ”‚   โ”œโ”€โ”€ outputs.tf
โ”‚   โ”‚   โ””โ”€โ”€ variables.tf
โ”‚   โ”œโ”€โ”€ eks/                    # EKS cluster module
โ”‚   โ”‚   โ”œโ”€โ”€ main.tf
โ”‚   โ”‚   โ”œโ”€โ”€ outputs.tf
โ”‚   โ”‚   โ””โ”€โ”€ variables.tf
โ”‚   โ”œโ”€โ”€ observability/          # Prometheus/Grafana stack
โ”‚   โ”‚   โ”œโ”€โ”€ main.tf
โ”‚   โ”‚   โ”œโ”€โ”€ variables.tf
โ”‚   โ”‚   โ””โ”€โ”€ versions.tf
โ”‚   โ””โ”€โ”€ apps/                   # Sample applications
โ”‚       โ”œโ”€โ”€ main.tf
โ”‚       โ”œโ”€โ”€ variables.tf
โ”‚       โ””โ”€โ”€ versions.tf
โ”œโ”€โ”€ main.tf                     # Root module
โ€ฆ

Key Cost Controls in This Platform ๐Ÿ’ฒ:

Rightโ€‘Sized Node Groups ๐Ÿงฎ

Why: Overโ€‘provisioning nodes is the most common EKS cost mistake.

Resource Requests & Limits ๐Ÿ’ฃ๐Ÿ”†๐Ÿ”…

Frontend: Requests : 50m CPU / 64Mi memory. Limits : 200m CPU / 128Mi memory

Backend API: Requests : 100m CPU / 128Mi memory. Limits : 500m CPU / 256Mi memory

Why:

  • Enables accurate scheduling
  • Prevents noisyโ€‘neighbor problems
  • Improves HPA decision quality

Horizontal Pod Autoscaling ๐Ÿ‘ˆ๐Ÿ‘‰๐Ÿ‘†๐Ÿ‘‡

  • Minimum replicas: 2
  • Maximum replicas: 6
  • Scaling driven by CPU utilization

Cost Benefit:

  • Low load โ†’ minimal replicas โ†’ lower cost
  • High load โ†’ scale only when needed

Observability Reduces Hidden Costs ๐Ÿ’ฐ

Metrics help avoid:

  • Overโ€‘scaling due to guesswork
  • Long outages with high blast radius
  • Manual firefighting (human cost)

Key Insight:

Observability is a costโ€‘control mechanism, not just a debugging tool.

This demo proves that EKS platforms must be observable before they are scalable.

Conclusion ๐Ÿ—๏ธ

The Operational Challenge:
As organizations adopt Kubernetes, many run into an unexpected contradiction. While containers and orchestration make it easier to scale and move faster, they also add complexity that makes systems harder to understand. As a result, teams often end up reacting to problems after something breaks instead of preventing issues through better visibility and automation.

The Observability-First Approach
The answer is to adopt an observability-first approachโ€”one where monitoring and visibility are built into the platform from day one, not added later as an afterthought. When teams can clearly see whatโ€™s happening inside their systems, theyโ€™re able to spot issues early, make smarter decisions automatically, and continuously improve performance. This shift allows organizations to move from constantly reacting to problems to predicting and preventing them.
In an observability-first platform, monitoring is woven directly into the infrastructure as itโ€™s being created. Every component is instrumented and visible as soon as it goes live. This creates a strong foundation for automatic scaling, meaningful alerts, and data-driven capacity planning. Over time, the platform becomes more self-awareโ€”able to understand how itโ€™s performing and adjust on its own as conditions change.

So to summarize, in this (lengthy, but hopefully insightful) blog, we:
โœ… Built a productionโ€‘style Amazon EKS platform using modular Terraform, separating networking, cluster, observability, and application concerns.
โœ… Implemented Prometheus and Grafana before workloads, enabling safe CPUโ€‘based autoscaling and rapid failure detection.
โœ… Validated Dayโ€‘2 operations by demonstrating pod and node failure recovery with realโ€‘time metrics.

Remember, "If you canโ€™t observe it, you canโ€™t operate it." ๐Ÿ™

Future Enhancements ๐Ÿงฉ

๐Ÿš€ ALB Ingress + pathโ€‘based routing
๐Ÿš€ An Interactive Web UI for Frontend
๐Ÿš€ Distributed tracing (OpenTelemetry)
๐Ÿš€ RBAC per namespace
๐Ÿš€ CI/CD pipeline integration
๐Ÿš€ Cost dashboards

References ๐ŸŒ

https://aws.amazon.com/eks/

https://developer.hashicorp.com/terraform/tutorials/kubernetes/eks

https://registry.terraform.io/providers/hashicorp/aws/latest

Top comments (0)