Why I Built This
Most Kubernetes tutorials stop at kubectl apply -f deployment.yaml. They don't show you how a VPC is laid out, why you need two availability zones, what IAM roles EKS nodes actually need, or how to debug a live failure using Prometheus metrics.
I wanted to build something that forced me to make every decision a senior DevOps engineer would make on a real project. So I built a food delivery platform — four independent microservices, a React frontend, full Terraform infrastructure on AWS, a GitHub Actions pipeline, and a Grafana dashboard — and recorded the whole thing.
This is what I learned.
How It Works
The Application Layer
Four FastAPI microservices, each completely independent with its own SQLite database:
- user-service (port 8001): Registration, JWT login, user profiles. Seeds 3 users on startup.
- restaurant-service (port 8002): Restaurant listing + full menus. Seeds 5 restaurants with 10 menu items each — real food names, USD prices.
-
order-service (port 8003): Order placement. Makes a synchronous HTTP call to restaurant-service to validate menu items before placing the order. Has a built-in
ORDER_SERVICE_FAILURE_MODEenv var for the observability demo. - delivery-service (port 8004): Agent assignment and delivery tracking. Seeds 5 delivery agents.
Each service exposes /health (returns {"status":"healthy","service":"<name>","version":"1.0.0"}) and /metrics (auto-generated by prometheus-fastapi-instrumentator).
An NGINX gateway (port 8080 locally) routes /api/users, /api/restaurants, /api/orders, /api/delivery to the right service and serves the React frontend at /.
The Infrastructure
Terraform is split into four modules:
modules/vpc: VPC (10.0.0.0/16), 2 public + 2 private subnets across us-east-1a and us-east-1b, Internet Gateway, 1 NAT Gateway (single point of failure — intentional cost trade-off for a demo, documented in comments), route tables.
modules/eks: EKS 1.32 cluster, managed node group with t3.small instances (min=1, desired=2, max=4 in private subnets), cluster IAM role, node IAM role with three AWS-managed policies, launch template to name EC2 instances in the console.
modules/ecr: Five repositories (food-delivery/user-service, food-delivery/frontend, etc.), image scan on push, lifecycle policy keeping last 10 images.
modules/iam: GitHub Actions IAM user with an inline policy scoped to ECR push/pull and EKS describe — nothing else.
The CI/CD Pipeline
deploy.yml triggers on push to main. It:
- Applies Kubernetes manifests, ingress-nginx, and kube-prometheus-stack
- Uses a matrix job for user-service, restaurant-service, order-service, delivery-service, and frontend
- Logs into ECR
- Builds and tags each image with
$GITHUB_SHAandlatest - Runs
aws eks update-kubeconfig - Does
kubectl set imagewith the SHA tag - Waits for
kubectl rollout status
pr-checks.yml runs flake8, pytest, terraform fmt -check, and terraform validate on every pull request.
destroy.yml is a manual workflow_dispatch with a typed confirmation — safeguard against accidental terraform destroy.
The Observability Demo
This is the part that makes the project worth recording.
Set ORDER_SERVICE_FAILURE_MODE=true in Docker Compose and restart order-service. Now 50% of POST /orders requests return HTTP 500. Run scripts/load-test.sh — it fires 300 requests in 10 concurrent workers over 3 minutes.
In Grafana, the "Error rate per service" panel spikes immediately from 0% to ~50% for order-service. The failed_orders_total counter climbs. P95 latency creeps up because failed requests still go through the restaurant-service validation call before failing.
Meanwhile HPA detects elevated CPU, scales replicas from 2 to 6. More pods, same error rate — the bug is in code, not capacity.
kubectl logs on any order-service pod shows the failure mode immediately. Fix: set ORDER_SERVICE_FAILURE_MODE=false, redeploy. Grafana recovers in under 30 seconds.
That recovery graph — the spike, the plateau, the drop — is the money shot of the video.
What I Learned
1. EKS nodes don't get Name tags by default.
The aws_eks_node_group resource tags the node group, not the individual EC2 instances. You need a launch_template with tag_specifications { resource_type = "instance" } to see names in the EC2 console. Lost 20 minutes on this.
2. One NAT Gateway is a trade-off, not a mistake.
The prompt called for cost saving. A single NAT Gateway means if us-east-1a goes down, private subnets in us-east-1b lose internet access. I documented this in a comment on the resource. Production would use one NAT per AZ. That trade-off is worth explaining explicitly.
3. The IAM roles for EKS are the biggest footgun.
You need three separate IAM roles: cluster role (for the control plane), node role (for EC2 instances in the node group), and optionally a IRSA role per service. Mixing them up silently breaks things. The AmazonEKS_CNI_Policy on the node role is what makes pod networking work — missing it gives you running pods with no network connectivity.
4. prometheus-fastapi-instrumentator is one line of code.
Instrumentator().instrument(app).expose(app)
That's it. You get request count, latency histograms, and HTTP status breakdown per endpoint, all at /metrics. The custom counters (orders_total, failed_orders_total, order_processing_seconds) are 5 more lines.
5. Service-to-service calls need explicit timeouts.
order-service calls restaurant-service with httpx.AsyncClient(timeout=5.0). Without the timeout, a slow restaurant-service will hold an order-service worker indefinitely, causing cascade failures that look like order-service bugs in the logs.
6. maxUnavailable=0 in rolling updates protects you more than you think.
With maxSurge=1, maxUnavailable=0, Kubernetes brings up the new pod and passes readiness checks before terminating the old one. The /health readinessProbe with initialDelaySeconds=15 means the new pod gets 15 seconds to initialize SQLite and seed data before traffic hits it. Without this, users hit 503s during every deploy.
Limitations (honest)
- SQLite is fine for local dev and demos. This would use RDS or Aurora in production.
- Single NAT Gateway is a cost optimization, not production-ready.
- The React frontend hardcodes
http://localhost:8080— a real app would use environment injection at build time. - No secrets management — passwords and JWT secret are env vars. Production would use AWS Secrets Manager + Kubernetes Secrets.
- The GitHub Actions IAM user uses long-lived access keys. Production would use OIDC federation (no keys at all).
- The Grafana dashboard started as a local Docker Compose dashboard. Kubernetes metrics need their own PromQL queries and dashboard panels.
Try It
# Local — everything runs in Docker
git clone https://github.com/vijayb-aiops/devops-production-projects
cd devops-production-projects/projects/01-food-delivery-eks-platform
bash scripts/bootstrap.sh
# Trigger the observability demo
ORDER_SERVICE_FAILURE_MODE=true docker compose up -d order-service
bash scripts/load-test.sh
# Open Grafana at http://localhost:3000 (admin/foodrush123)
# Deploy to AWS
cd infra/terraform
terraform init
terraform apply
cd ../..
bash scripts/deploy-eks.sh
Estimated AWS cost while recording: ~$0.19/hr. Run terraform destroy when done.
📺 Full build-along: https://www.youtube.com/watch?v=HDiWR1uVI9s
📁 GitHub: https://github.com/vijayb-aiops/devops-production-projects/tree/main/projects/01-food-delivery-eks-platform
Top comments (0)