DEV Community

Cover image for I Built a Production Food Delivery Platform on AWS EKS — Here's Everything I Learned
Vijaya Rajeev Bollu
Vijaya Rajeev Bollu

Posted on

I Built a Production Food Delivery Platform on AWS EKS — Here's Everything I Learned

Why I Built This

Most Kubernetes tutorials stop at kubectl apply -f deployment.yaml. They don't show you how a VPC is laid out, why you need two availability zones, what IAM roles EKS nodes actually need, or how to debug a live failure using Prometheus metrics.

I wanted to build something that forced me to make every decision a senior DevOps engineer would make on a real project. So I built a food delivery platform — four independent microservices, a React frontend, full Terraform infrastructure on AWS, a GitHub Actions pipeline, and a Grafana dashboard — and recorded the whole thing.

This is what I learned.


How It Works

The Application Layer

Four FastAPI microservices, each completely independent with its own SQLite database:

  • user-service (port 8001): Registration, JWT login, user profiles. Seeds 3 users on startup.
  • restaurant-service (port 8002): Restaurant listing + full menus. Seeds 5 restaurants with 10 menu items each — real food names, USD prices.
  • order-service (port 8003): Order placement. Makes a synchronous HTTP call to restaurant-service to validate menu items before placing the order. Has a built-in ORDER_SERVICE_FAILURE_MODE env var for the observability demo.
  • delivery-service (port 8004): Agent assignment and delivery tracking. Seeds 5 delivery agents.

Each service exposes /health (returns {"status":"healthy","service":"<name>","version":"1.0.0"}) and /metrics (auto-generated by prometheus-fastapi-instrumentator).

An NGINX gateway (port 8080 locally) routes /api/users, /api/restaurants, /api/orders, /api/delivery to the right service and serves the React frontend at /.

The Infrastructure

Terraform is split into four modules:

modules/vpc: VPC (10.0.0.0/16), 2 public + 2 private subnets across us-east-1a and us-east-1b, Internet Gateway, 1 NAT Gateway (single point of failure — intentional cost trade-off for a demo, documented in comments), route tables.

modules/eks: EKS 1.32 cluster, managed node group with t3.small instances (min=1, desired=2, max=4 in private subnets), cluster IAM role, node IAM role with three AWS-managed policies, launch template to name EC2 instances in the console.

modules/ecr: Five repositories (food-delivery/user-service, food-delivery/frontend, etc.), image scan on push, lifecycle policy keeping last 10 images.

modules/iam: GitHub Actions IAM user with an inline policy scoped to ECR push/pull and EKS describe — nothing else.

The CI/CD Pipeline

deploy.yml triggers on push to main. It:

  1. Applies Kubernetes manifests, ingress-nginx, and kube-prometheus-stack
  2. Uses a matrix job for user-service, restaurant-service, order-service, delivery-service, and frontend
  3. Logs into ECR
  4. Builds and tags each image with $GITHUB_SHA and latest
  5. Runs aws eks update-kubeconfig
  6. Does kubectl set image with the SHA tag
  7. Waits for kubectl rollout status

pr-checks.yml runs flake8, pytest, terraform fmt -check, and terraform validate on every pull request.

destroy.yml is a manual workflow_dispatch with a typed confirmation — safeguard against accidental terraform destroy.


The Observability Demo

This is the part that makes the project worth recording.

Set ORDER_SERVICE_FAILURE_MODE=true in Docker Compose and restart order-service. Now 50% of POST /orders requests return HTTP 500. Run scripts/load-test.sh — it fires 300 requests in 10 concurrent workers over 3 minutes.

In Grafana, the "Error rate per service" panel spikes immediately from 0% to ~50% for order-service. The failed_orders_total counter climbs. P95 latency creeps up because failed requests still go through the restaurant-service validation call before failing.

Meanwhile HPA detects elevated CPU, scales replicas from 2 to 6. More pods, same error rate — the bug is in code, not capacity.

kubectl logs on any order-service pod shows the failure mode immediately. Fix: set ORDER_SERVICE_FAILURE_MODE=false, redeploy. Grafana recovers in under 30 seconds.

That recovery graph — the spike, the plateau, the drop — is the money shot of the video.


What I Learned

1. EKS nodes don't get Name tags by default.
The aws_eks_node_group resource tags the node group, not the individual EC2 instances. You need a launch_template with tag_specifications { resource_type = "instance" } to see names in the EC2 console. Lost 20 minutes on this.

2. One NAT Gateway is a trade-off, not a mistake.
The prompt called for cost saving. A single NAT Gateway means if us-east-1a goes down, private subnets in us-east-1b lose internet access. I documented this in a comment on the resource. Production would use one NAT per AZ. That trade-off is worth explaining explicitly.

3. The IAM roles for EKS are the biggest footgun.
You need three separate IAM roles: cluster role (for the control plane), node role (for EC2 instances in the node group), and optionally a IRSA role per service. Mixing them up silently breaks things. The AmazonEKS_CNI_Policy on the node role is what makes pod networking work — missing it gives you running pods with no network connectivity.

4. prometheus-fastapi-instrumentator is one line of code.

Instrumentator().instrument(app).expose(app)
Enter fullscreen mode Exit fullscreen mode

That's it. You get request count, latency histograms, and HTTP status breakdown per endpoint, all at /metrics. The custom counters (orders_total, failed_orders_total, order_processing_seconds) are 5 more lines.

5. Service-to-service calls need explicit timeouts.
order-service calls restaurant-service with httpx.AsyncClient(timeout=5.0). Without the timeout, a slow restaurant-service will hold an order-service worker indefinitely, causing cascade failures that look like order-service bugs in the logs.

6. maxUnavailable=0 in rolling updates protects you more than you think.
With maxSurge=1, maxUnavailable=0, Kubernetes brings up the new pod and passes readiness checks before terminating the old one. The /health readinessProbe with initialDelaySeconds=15 means the new pod gets 15 seconds to initialize SQLite and seed data before traffic hits it. Without this, users hit 503s during every deploy.


Limitations (honest)

  • SQLite is fine for local dev and demos. This would use RDS or Aurora in production.
  • Single NAT Gateway is a cost optimization, not production-ready.
  • The React frontend hardcodes http://localhost:8080 — a real app would use environment injection at build time.
  • No secrets management — passwords and JWT secret are env vars. Production would use AWS Secrets Manager + Kubernetes Secrets.
  • The GitHub Actions IAM user uses long-lived access keys. Production would use OIDC federation (no keys at all).
  • The Grafana dashboard started as a local Docker Compose dashboard. Kubernetes metrics need their own PromQL queries and dashboard panels.

Try It

# Local — everything runs in Docker
git clone https://github.com/vijayb-aiops/devops-production-projects
cd devops-production-projects/projects/01-food-delivery-eks-platform
bash scripts/bootstrap.sh

# Trigger the observability demo
ORDER_SERVICE_FAILURE_MODE=true docker compose up -d order-service
bash scripts/load-test.sh
# Open Grafana at http://localhost:3000 (admin/foodrush123)

# Deploy to AWS
cd infra/terraform
terraform init
terraform apply
cd ../..
bash scripts/deploy-eks.sh
Enter fullscreen mode Exit fullscreen mode

Estimated AWS cost while recording: ~$0.19/hr. Run terraform destroy when done.

📺 Full build-along: https://www.youtube.com/watch?v=HDiWR1uVI9s
📁 GitHub: https://github.com/vijayb-aiops/devops-production-projects/tree/main/projects/01-food-delivery-eks-platform

Top comments (0)