I Built a Production Food Delivery Platform on AWS EKS — Here's Everything I Learned

#aws #kubernetes #terraform #devops

Why I Built This

Most Kubernetes tutorials stop at kubectl apply -f deployment.yaml. They don't show you how a VPC is laid out, why you need two availability zones, what IAM roles EKS nodes actually need, or how to debug a live failure using Prometheus metrics.

I wanted to build something that forced me to make every decision a senior DevOps engineer would make on a real project. So I built a food delivery platform — four independent microservices, a React frontend, full Terraform infrastructure on AWS, a GitHub Actions pipeline, and a Grafana dashboard — and recorded the whole thing.

This is what I learned.

How It Works

The Application Layer

Four FastAPI microservices, each completely independent with its own SQLite database:

user-service (port 8001): Registration, JWT login, user profiles. Seeds 3 users on startup.
restaurant-service (port 8002): Restaurant listing + full menus. Seeds 5 restaurants with 10 menu items each — real food names, USD prices.
order-service (port 8003): Order placement. Makes a synchronous HTTP call to restaurant-service to validate menu items before placing the order. Has a built-in ORDER_SERVICE_FAILURE_MODE env var for the observability demo.
delivery-service (port 8004): Agent assignment and delivery tracking. Seeds 5 delivery agents.

Each service exposes /health (returns {"status":"healthy","service":"<name>","version":"1.0.0"}) and /metrics (auto-generated by prometheus-fastapi-instrumentator).

An NGINX gateway (port 8080 locally) routes /api/users, /api/restaurants, /api/orders, /api/delivery to the right service and serves the React frontend at /.

The Infrastructure

Terraform is split into four modules:

modules/vpc: VPC (10.0.0.0/16), 2 public + 2 private subnets across us-east-1a and us-east-1b, Internet Gateway, 1 NAT Gateway (single point of failure — intentional cost trade-off for a demo, documented in comments), route tables.

modules/eks: EKS 1.32 cluster, managed node group with t3.small instances (min=1, desired=2, max=4 in private subnets), cluster IAM role, node IAM role with three AWS-managed policies, launch template to name EC2 instances in the console.

modules/ecr: Five repositories (food-delivery/user-service, food-delivery/frontend, etc.), image scan on push, lifecycle policy keeping last 10 images.

modules/iam: GitHub Actions IAM user with an inline policy scoped to ECR push/pull and EKS describe — nothing else.

The CI/CD Pipeline

deploy.yml triggers on push to main. It:

Applies Kubernetes manifests, ingress-nginx, and kube-prometheus-stack
Uses a matrix job for user-service, restaurant-service, order-service, delivery-service, and frontend
Logs into ECR
Builds and tags each image with $GITHUB_SHA and latest
Runs aws eks update-kubeconfig
Does kubectl set image with the SHA tag
Waits for kubectl rollout status

pr-checks.yml runs flake8, pytest, terraform fmt -check, and terraform validate on every pull request.

destroy.yml is a manual workflow_dispatch with a typed confirmation — safeguard against accidental terraform destroy.

The Observability Demo

This is the part that makes the project worth recording.

Set ORDER_SERVICE_FAILURE_MODE=true in Docker Compose and restart order-service. Now 50% of POST /orders requests return HTTP 500. Run scripts/load-test.sh — it fires 300 requests in 10 concurrent workers over 3 minutes.

In Grafana, the "Error rate per service" panel spikes immediately from 0% to ~50% for order-service. The failed_orders_total counter climbs. P95 latency creeps up because failed requests still go through the restaurant-service validation call before failing.

Meanwhile HPA detects elevated CPU, scales replicas from 2 to 6. More pods, same error rate — the bug is in code, not capacity.

kubectl logs on any order-service pod shows the failure mode immediately. Fix: set ORDER_SERVICE_FAILURE_MODE=false, redeploy. Grafana recovers in under 30 seconds.

That recovery graph — the spike, the plateau, the drop — is the money shot of the video.

What I Learned

1. EKS nodes don't get Name tags by default.
The aws_eks_node_group resource tags the node group, not the individual EC2 instances. You need a launch_template with tag_specifications { resource_type = "instance" } to see names in the EC2 console. Lost 20 minutes on this.

2. One NAT Gateway is a trade-off, not a mistake.
The prompt called for cost saving. A single NAT Gateway means if us-east-1a goes down, private subnets in us-east-1b lose internet access. I documented this in a comment on the resource. Production would use one NAT per AZ. That trade-off is worth explaining explicitly.

3. The IAM roles for EKS are the biggest footgun.
You need three separate IAM roles: cluster role (for the control plane), node role (for EC2 instances in the node group), and optionally a IRSA role per service. Mixing them up silently breaks things. The AmazonEKS_CNI_Policy on the node role is what makes pod networking work — missing it gives you running pods with no network connectivity.

4. prometheus-fastapi-instrumentator is one line of code.

Instrumentator().instrument(app).expose(app)

That's it. You get request count, latency histograms, and HTTP status breakdown per endpoint, all at /metrics. The custom counters (orders_total, failed_orders_total, order_processing_seconds) are 5 more lines.

5. Service-to-service calls need explicit timeouts.
order-service calls restaurant-service with httpx.AsyncClient(timeout=5.0). Without the timeout, a slow restaurant-service will hold an order-service worker indefinitely, causing cascade failures that look like order-service bugs in the logs.

6. maxUnavailable=0 in rolling updates protects you more than you think.
With maxSurge=1, maxUnavailable=0, Kubernetes brings up the new pod and passes readiness checks before terminating the old one. The /health readinessProbe with initialDelaySeconds=15 means the new pod gets 15 seconds to initialize SQLite and seed data before traffic hits it. Without this, users hit 503s during every deploy.

Limitations (honest)

SQLite is fine for local dev and demos. This would use RDS or Aurora in production.
Single NAT Gateway is a cost optimization, not production-ready.
The React frontend hardcodes http://localhost:8080 — a real app would use environment injection at build time.
No secrets management — passwords and JWT secret are env vars. Production would use AWS Secrets Manager + Kubernetes Secrets.
The GitHub Actions IAM user uses long-lived access keys. Production would use OIDC federation (no keys at all).
The Grafana dashboard started as a local Docker Compose dashboard. Kubernetes metrics need their own PromQL queries and dashboard panels.

Try It

# Local — everything runs in Docker
git clone https://github.com/vijayb-aiops/devops-production-projects
cd devops-production-projects/projects/01-food-delivery-eks-platform
bash scripts/bootstrap.sh

# Trigger the observability demo
ORDER_SERVICE_FAILURE_MODE=true docker compose up -d order-service
bash scripts/load-test.sh
# Open Grafana at http://localhost:3000 (admin/foodrush123)

# Deploy to AWS
cd infra/terraform
terraform init
terraform apply
cd ../..
bash scripts/deploy-eks.sh