DEV Community

Cover image for Deploying a Production-Grade Microservices Platform on AWS EKS, Every Decision, Every Error, Every Lesson
Ijeawele Divine Nkwocha
Ijeawele Divine Nkwocha

Posted on

Deploying a Production-Grade Microservices Platform on AWS EKS, Every Decision, Every Error, Every Lesson

Most Kubernetes tutorials stop at "your pod is running." That's not production.

Production is secrets management, autoscaling, TLS automation, persistent storage across availability zones, and an ingress layer that handles real traffic patterns. This article walks through a full microservices deployment on AWS EKS, the architecture decisions, the errors that will humble you if you skip the fundamentals, and the things worth doing differently on the next project.

The platform is RideShare Pro. Six independent services, a centralised ingress layer, managed data stores, and a live domain. GitHub repo here.

If you're a DevOps engineer working toward production-grade Kubernetes, this is the kind of breakdown you won't find in a quickstart guide.

Application Deployed with Custom Domain


What's Being Built

RideShare Pro is a microservices-based rideshare application where each business capability lives in its own independent service:

  • Rider Service, rider profiles, ride requests, status tracking
  • Driver Service, driver profiles, vehicle management, availability
  • Trips Service, trip lifecycle from creation to completion, trip history
  • Matching Service, real-time matching of riders with the nearest available driver
  • Email Service, transactional emails triggered by events from other services
  • Frontend, the user-facing web application

Each service communicates over HTTP APIs. All external traffic is routed through a centralised NGINX Ingress Controller acting as the API gateway.

The goal: deploy this on AWS EKS in a way that is scalable, highly available, and extremely production-ready.

Rideshare Pro Architecture Diagram


Step 1, Deploy Locally Before Touching the Cluster

Before writing a single Kubernetes manifest, deploy the entire application locally.

This one decision saves hours.

Local deployment surfaces errors in the codebase itself, bugs that would later appear as CrashLoopBackOff pods with no obvious cause. Catching them at the local stage means you're not debugging application code and infrastructure configuration simultaneously. That combination is one of the most frustrating debugging scenarios in DevOps work.

The cluster will surface problems. It won't always tell you clearly why. Local validation first means the only thing the cluster is testing is the infrastructure.


Step 2, Build Docker Images and Push to ECR

Once the application is running locally, build Docker images for each service and store them in Amazon ECR (Elastic Container Registry).

ECR is the right choice when you're already in the AWS ecosystem. It integrates natively with EKS, uses IAM-based access control, and there's no friction pulling images at deploy time. No separate registry credentials to manage. No external dependency at cluster startup.


Step 3, Kubernetes Manifest Structure

The manifests are organised into four directories: aws/, platform/, stateful/, and applications/. Each one has a clear separation of concern.

aws/, Cluster-Level Infrastructure

These manifests interact directly with the AWS API to provision cluster-level resources.

node-groups.yaml creates managed node groups, collections of EC2 instances that Kubernetes schedules pods onto. Managed node groups mean AWS handles provisioning, scaling, and lifecycle management. t3.medium instances across multiple Availability Zones cover general-purpose workloads well without over-provisioning.

iam-roles.yaml sets up IAM Roles for Service Accounts (IRSA). IRSA is how specific Kubernetes service accounts get permission to call AWS APIs, in this case, permission to create EBS volumes for persistent storage. This is the correct approach. Giving broad IAM permissions to nodes is a security anti-pattern. IRSA scopes permissions to exactly what each service account needs and nothing more.

storage-classes.yaml creates an EBS gp3 StorageClass using the IRSA role above. The critical setting here is volumeBindingMode: WaitForFirstConsumer. More on why this matters in the errors section.

platform/, Shared Cluster-Wide Components

This directory sets up everything that makes the cluster secure and scalable, autoscaling, ingress, secrets, and security.

Autoscaling, Two autoscaling mechanisms work together here. The Horizontal Pod Autoscaler (HPA) scales pods when CPU or memory thresholds are hit. When all nodes are full and more pods need scheduling, the Cluster Autoscaler adds nodes to accommodate them. Both are necessary. HPA without the Cluster Autoscaler means pod scaling stalls when node capacity runs out.

One thing worth knowing about HPA: it requires both resource requests and limits to be set on containers. HPA measures CPU utilisation as a percentage of the requested CPU, not the limit. Without requests, it has no baseline and effectively does nothing. Always set both.

Ingress, NGINX Ingress Controller with path-based routing. The backend ingress rules (ingress-api.yaml) are separated from the frontend (ingress-frontend.yaml) deliberately. API paths need specific annotations, rate limiting, and authentication headers that shouldn't bleed over to the frontend. Separating them gives cleaner, more targeted control and makes future changes safer.

For HTTPS, cert-manager with a ClusterIssuer pointing to Let's Encrypt handles certificate provisioning and renewal automatically. Production deployments need HTTPS. This is the cleanest way to handle it without any manual certificate management.

Secrets, This is where most engineers either get it right or create a future security incident.

Native Kubernetes Secrets are base64 encoded, not encrypted. Anyone with cluster access can decode them. The production-grade approach is the External Secrets Operator (ESO) with AWS Secrets Manager.

Here's how it works: secrets, database URLs, Redis connection strings, JWT keys and other sensitive credentials are stored in AWS Secrets Manager. ESO creates a SecretStore pointing to that service. ExternalSecret resources reference the store and map specific secrets into pods as environment variables. ESO syncs on a configurable schedule, so rotating a secret in AWS Secrets Manager propagates to the cluster automatically.

This eliminates an entire class of security risk. The alternative, base64 encoded secrets committed to YAML files, allows your secrets to be revealed with just a Google search.

Secret Store for Trips Service in AWS Secrets Manager

Security, Pod Disruption Budgets (PDB) for all critical services. A PDB ensures that during node maintenance or cluster upgrades, Kubernetes cannot take down more than a defined number of pods simultaneously. Setting minAvailable: 2 means regardless of what's happening at the node level, at least 2 pods of that service stay running. This is the difference between a cluster that survives a rolling upgrade and one that causes an outage during routine maintenance.

stateful/, Persistent Data

This is the most significant architectural decision in the project, and it's a deliberate departure from the typical tutorial approach.

A standard spec would call for PostgreSQL and Redis deployed as StatefulSets inside the cluster. Here, both are replaced with managed AWS services: Aurora RDS for PostgreSQL and Amazon ElastiCache for Redis.

The reasoning is operational reality.

StatefulSets in Kubernetes are powerful but come with real overhead. Database replication, node failure recovery, volume reattachment, version upgrades, all of that falls on the team. For most production systems, that's engineering time that isn't being spent on the product.

Aurora RDS changes the equation. Replication across Availability Zones is automatic. Storage scales without intervention. Automated backups, failover, and read replicas are built in. ElastiCache gives the same model for Redis, managed, highly available, secure, with automatic failover and no operational burden.

Details of RDS on AWS Portal

The tradeoff is cost and cloud portability. Managed services cost more than self-hosted, and you're tied to AWS. For a production system where reliability and engineering time both matter, this is the right call. Know the tradeoff, make the decision consciously.

applications/, The Microservices

Each service directory contains:

  • deployment.yaml, pod specs, container definitions, resource requests/limits, environment variables pulled from ExternalSecrets
  • service.yaml, ClusterIP service exposing the deployment internally
  • hpa.yaml, Horizontal Pod Autoscaler targeting CPU and memory thresholds
  • configmap.yaml, non-sensitive configuration like service URLs and feature flags

All services are ClusterIP. External traffic flows through NGINX Ingress only. Exposing individual services directly to the internet through LoadBalancer type is both a cost and security problem.


Step 4, Real Domain, Real HTTPS

Deploying to a cluster is one thing. A live URL that external traffic can hit is another, and it's what makes a portfolio project credible.

Finding the Load Balancer IP

When NGINX Ingress Controller deploys on EKS, it provisions an AWS Load Balancer automatically. To get the external IP:

kubectl get svc -n ingress-nginx
Enter fullscreen mode Exit fullscreen mode

The EXTERNAL-IP column on the ingress-nginx-controller service is the cluster's entry point. That's what DNS points at.

DNS Configuration

In GoDaddy (or whichever registrar), add:

  • Type: A
  • Name: rideshare
  • Value: the Load Balancer IP from above

rideshare.ijeaweledivine.online now routes to the NGINX Ingress Controller, which applies path rules to reach the correct service.

TLS Automation with cert-manager

Certificate management - cert-manager with a Let's Encrypt ClusterIssuer handles provisioning and renewal automatically.

The annotations that drive this on the Ingress manifest:

annotations:
  nginx.ingress.kubernetes.io/use-regex: "true"
  cert-manager.io/cluster-issuer: letsencrypt-prod
  nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
  nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
  nginx.ingress.kubernetes.io/websocket-services: "trip-service"
Enter fullscreen mode Exit fullscreen mode

Understanding what they mean is a Nice-to-Know.

use-regex: "true", enables regex path matching. Without this, path rules are basic prefix matching only. With it, you can write precise rules like /api/trips/.* to catch all trip-related routes cleanly.

cluster-issuer: letsencrypt-prod, the key annotation. cert-manager sees this, creates a CertificateRequest, runs the ACME challenge with Let's Encrypt, gets a signed certificate, stores it as a Kubernetes Secret, and handles renewal before expiry. One annotation. Permanent HTTPS.

proxy-read-timeout and proxy-send-timeout: "3600", sets timeout values to 1 hour. The default NGINX timeout is 60 seconds. For a rideshare platform where an active trip can last 45 minutes, 60 seconds kills live connections mid-trip. Match your timeout values to your actual usage patterns.

websocket-services: "trip-service", the trips service uses WebSockets for real-time communication: live trip status updates, driver location tracking. Standard HTTP is request-response and closes. WebSockets stay open. Without this annotation, NGINX doesn't handle the connection upgrade correctly, and real-time features fail silently.

The cert-manager Automation Flow

Once the Ingress is applied with the cert-manager annotation:

  1. cert-manager detects the annotation and creates a CertificateRequest
  2. Let's Encrypt issues an ACME challenge, proof that the domain is under your control
  3. cert-manager creates a temporary pod and Ingress rule to respond to the challenge
  4. Let's Encrypt verifies, issues the certificate
  5. cert-manager stores it as a Kubernetes Secret and mounts it into the Ingress
  6. NGINX serves HTTPS traffic

Watch it happen in real time:

kubectl get certificaterequest -n your-namespace
kubectl describe certificaterequest <name> -n your-namespace
Enter fullscreen mode Exit fullscreen mode

Status shows Approved and Issued in about 2 minutes. The whole process is hands-off after the initial annotation.


The Errors That May Stress You OUT

Error 1, no topology key found for node

The EBS CSI Driver couldn't identify which Availability Zone the worker node was in, so it couldn't safely create a persistent volume. EBS volumes are AZ-specific; a volume in eu-north-1a cannot be mounted by a pod in eu-north-1b.

Two fixes:

EKS nodes need the label topology.ebs.csi.aws.com/zone for the storage driver to identify the AZ. Apply this to your node groups.

More importantly: set volumeBindingMode: WaitForFirstConsumer in your StorageClass. Without this, Kubernetes creates the EBS volume before it knows which node the pod will land on. WaitForFirstConsumer delays volume creation until the pod is scheduled to a node, then creates the volume in the same AZ. This single setting eliminates an entire class of storage scheduling problems.

Error 2, secret "postgres-credentials" not found

Namespace isolation. Kubernetes secrets are namespace-scoped. A pod in the rideshare-app namespace cannot access a secret created in the default namespace. The credentials existed, they were just invisible from where the pod was looking.

When pods hit CreateContainerConfigError or CrashLoopBackOff and the image is confirmed healthy, check namespace alignment before anything else. It's almost always either a namespace mismatch or a missing secret.


What to Do Differently

Two gaps from the project review are worth calling out explicitly, because they're easy to miss and impactful to fix.

Health probes, livenessProbe and readinessProbe tell Kubernetes whether a pod is healthy and ready to receive traffic. Without them, Kubernetes has no mechanism to automatically restart a stuck pod or remove it from rotation when it's not ready. The result: a broken pod silently receives live traffic and returns errors until someone notices manually. Adding probes is a small amount of YAML that meaningfully improves reliability. Don't skip them.

Resource requests on HPA, As covered above: HPA uses requests as the baseline for utilization calculations. Limits without requests gives the autoscaler nothing to measure against. Set both, always.


Final Thoughts

The infrastructure setup is the visible part. The real depth is in the operational decisions, why managed services over StatefulSets, why External Secrets over native Kubernetes Secrets, why separate ingress manifests for API and frontend.

Those decisions are what separate someone who knows Kubernetes syntax from someone who can design a system that holds up under real conditions.

For anyone working through something similar: deploy locally first, understand service communication before touching the cluster, and don't treat health probes and resource requests as optional polish. They're not. The cluster runs without them until it doesn't.


*Ijeawele is a DevOps Engineer building production-grade infrastructure and writing about it in plain terms. More projects coming.

Reach out to me for questions or any opportunities on my Linkedin
Check out my other Projects on GitHub*

Top comments (0)