DEV Community

Cover image for How I Built a Production-Style GitOps Platform on AWS EKS — Solo, From Scratch
Lanre Awe
Lanre Awe

Posted on

How I Built a Production-Style GitOps Platform on AWS EKS — Solo, From Scratch

Most DevOps portfolio projects follow the same pattern: deploy a "hello world" app to Kubernetes, write a README, call it done.

This isn't that.

I took the Spring PetClinic microservices — a real Java application with 7 independent services, service discovery, an API gateway, and distributed tracing — and built the entire platform around it on AWS. Infrastructure as code, a proper GitOps delivery pipeline, autoscaling at two layers, end-to-end observability, and a reproducible lifecycle that provisions or destroys the whole environment with a single command.

The live app is running right now at petclinic.ralphnetwork.online.

This post is a walkthrough of what I built, how I made the decisions I made, and — most importantly — what broke and why. Because that last part is what actually teaches you something.


Why I built this

I'm an infrastructure engineer with 18 years of hands-on experience — servers, networking, firewalls, backup and DR — making the transition into DevOps and cloud engineering. I've been building cloud-native projects and documenting the journey publicly.

My goal with this project was specific: demonstrate that I can operate at the platform layer, not just the tool layer. Anyone can follow a tutorial and get kubectl apply to work. What I wanted to prove was that I could make engineering decisions, build a reliable delivery pipeline, handle real failures, and articulate the trade-offs — the way a working engineer actually operates.

So I treated it like a real system, not a demo.


The architecture

At a high level: a push to main triggers a GitHub Actions pipeline that builds and pushes Docker images to ECR, then commits a tag bump to the Helm chart in Git. Argo CD detects the change and syncs the cluster. The CI pipeline never runs kubectl directly — git is the authoritative source of truth.

flowchart TB
    Dev[Push to GitHub main] --> GHA[GitHub Actions CI]
    GHA -->|OIDC role - no static keys| ECR[Amazon ECR]
    GHA -->|bump image tag + commit| Git[Helm chart in Git]
    Git --> Argo[Argo CD]
    Argo -->|sync| Cluster
    ECR --> Cluster

    subgraph Cluster["EKS cluster (petclinic-prod) — eu-central-1"]
      direction TB
      ALB[ALB Ingress - ACM TLS] --> GW[api-gateway]
      GW --> APP[customers / vets / visits]
      APP --- Platform[discovery + config server]
      HPA[HPA] -. scales pods .-> APP
      Karpenter[Karpenter] -. scales nodes .-> Nodes[EC2 nodes]
      APP -. traces .-> Zipkin
      APP -. metrics .-> Prometheus --> Grafana
    end
Enter fullscreen mode Exit fullscreen mode

Cluster: petclinic-prod · Region: eu-central-1 · Kubernetes: 1.33


The full stack

Layer Tooling
Cloud AWS (EKS, ECR, VPC, IAM, ALB, ACM, SQS)
IaC Terraform — remote state on S3 + DynamoDB, reusable modules
Containers Docker, Amazon ECR (one repo per service, scan-on-push)
Orchestration Kubernetes (EKS, managed node group + Karpenter)
Packaging Helm (one values-driven chart for all 7 services)
GitOps Argo CD
CI/CD GitHub Actions (OIDC auth — no static AWS keys)
Autoscaling HPA (pods) + Karpenter (nodes) + metrics-server
Observability Prometheus, Grafana, Zipkin (distributed tracing)
App Spring Boot microservices, Spring Cloud Config + Eureka

Layer by layer: what I built and why

Infrastructure as Code (Terraform)

Every AWS resource is defined in Terraform, split into reusable modules and wired together in a single prod environment. The first thing I provisioned — before anything else — was the remote state backend: an S3 bucket (versioned, encrypted, public access blocked) and a DynamoDB lock table. If you lose your state file, you lose control of your infrastructure. That comes first, always.

The modules:

  • vpc — 2 availability zones, public and private subnets, with the specific subnet tags the AWS Load Balancer Controller and Karpenter need to discover them.
  • eks — built on the official terraform-aws-modules/eks module, EKS 1.33, managed node group, IRSA and EKS Pod Identity enabled, control-plane logging on.
  • ecr — one repository per service with image scanning on push.
  • iam — IRSA role for the Load Balancer Controller.
  • github-actions — OIDC trust policy and an IAM role so GitHub Actions can assume it without a long-lived access key.
  • Karpenter — IAM role, SQS interruption queue, and node role, via the EKS module's built-in Karpenter submodule using Pod Identity.

Terraform provisions the AWS platform. Everything above that — cluster add-ons, Argo CD, the app — is installed by scripts/addons.sh in the correct dependency order.

GitOps delivery

This is the piece I'm proudest of, because it's the difference between "I can run kubectl" and "I built a delivery pipeline."

The workflow:

  1. Push to main triggers GitHub Actions.
  2. GitHub Actions authenticates to AWS via an OIDC role — no AWS_ACCESS_KEY_ID in secrets, not anywhere.
  3. All 7 services are built as Docker images and pushed to ECR, tagged with the git SHA.
  4. The pipeline then bumps the image tag in helm/petclinic/values.yaml and commits it back to the repo.
  5. Argo CD detects the change and syncs the Helm chart to the cluster.

The cluster never pulls credentials from CI. CI never holds cluster access. The audit trail for every deployment lives in git history. That's real GitOps, and it's meaningfully different from "run kubectl apply at the end of a pipeline."

Packaging with Helm

The app started with hand-written Kubernetes manifests with hardcoded image tags — one manifest per service, with the image version baked in. I converted everything into one values-driven Helm chart that renders all 7 services from a single config block.

That collapsed seven hardcoded image tags into one value that CI controls. It also eliminated hundreds of lines of duplicated YAML, made per-service configuration changes a one-line edit, and gave me a single versioned artifact I can promote, roll back, or diff. It also made Argo CD's diff view meaningful — you can actually see what changed per deployment.

Autoscaling at two layers

HPA (Horizontal Pod Autoscaler) is configured on the four stateless services — api-gateway, customers-service, vets-service, visits-service — with a minimum of 2 replicas, maximum of 4, scaling on CPU at 70%, fed by metrics-server.

Karpenter handles node autoscaling. When the HPA needs to schedule more pods than the current nodes can fit, Karpenter provisions a right-sized EC2 instance and decommissions it when idle. I didn't just configure this — I tested it under real load. Pending pods from HPA scaling triggered a t3a.medium provisioning event, and Karpenter had a node ready in approximately 90 seconds.

The choice to use Karpenter over the older cluster-autoscaler was deliberate. It bin-packs more efficiently, picks instance types dynamically, and it's the modern EKS approach. More setup, but a better result.

Observability

Prometheus scrapes every service via the /actuator/prometheus endpoint. Grafana visualises the metrics. Zipkin collects distributed traces, so you can follow a single user request as it travels from the api-gateway through customers-service and back.

Getting all three working together — and getting traces working end to end specifically — was one of the most instructive parts of the build. More on that in the debugging section.

Networking and TLS

An ALB Ingress (provisioned by the AWS Load Balancer Controller from a Kubernetes Ingress object) fronts the gateway. TLS is terminated at the ALB using an ACM certificate, with a real DNS record at petclinic.ralphnetwork.online. The cluster itself runs in private subnets. The only public entry point is the load balancer.


Reproducible from one command

A platform you can't rebuild from scratch isn't really infrastructure as code — it's a managed pet. So I encoded the full lifecycle in a Makefile:

# Provision the state backend once per AWS account
make state

# Provision the full platform + install add-ons + deploy the app
make up

# Tear everything down cleanly
make down
Enter fullscreen mode Exit fullscreen mode

make up runs two phases in order: terraform apply to provision the AWS layer, then scripts/addons.sh to install add-ons in dependency order: AWS Load Balancer Controller → metrics-server → Karpenter → Argo CD → the PetClinic application.

make down is the part that trips people up. Terraform provisions the base infrastructure, but the in-cluster controllers create resources at runtime that Terraform doesn't know about — specifically the ALB and any Karpenter-provisioned EC2 nodes. A naive terraform destroy hangs waiting for a VPC it can't delete because the ALB is still attached. The teardown script deletes the Kubernetes layer first, waits for the ALB and extra nodes to actually drain, and then runs terraform destroy.

This means I can provision the full environment for a live demo and destroy it to near-zero cost afterward without leaving orphaned load balancers or a surprise AWS bill.


The bugs — the part that actually matters

I want to be honest about this: most of the learning in this project came from what broke. Here are the ones that taught me the most.

Zipkin showed no traces, even though all services were up

Tracing export is non-fatal — a failure to connect to Zipkin doesn't crash the application, it just silently drops spans. So the services appeared healthy while producing zero traces.

The root causes were two independent misconfigurations that had to be fixed together:

  1. The tracing endpoint in the Spring Cloud Config pointed at tracing-server, which didn't match the Kubernetes Service name zipkin.
  2. The endpoint was only set under one Spring profile, so most services never exported at all.

The fix: corrected the hostname, and had the Helm chart inject the Zipkin endpoint via an environment variable into every service — so tracing is now uniform and controlled at the platform level, not buried in per-service config files.

CI was building images that never actually deployed

The deploy step rewrote :latest tags, but the manifests had specific version pins (:4.0.1, :4.0.2). The substitution matched nothing — every "deployment" silently re-applied the old images. The cluster looked updated; it wasn't.

Migrating to Helm fixed this properly. Image tags became a single chart value that CI bumps to the git SHA, and Argo CD shows a visible diff when the value changes. There's no ambiguity about what's running.

Argo CD and the HPA fought over replica counts

With Argo's self-healing enabled, it kept resetting replicas to the chart value. The HPA simultaneously tried to scale based on CPU. They were in a tug of war that neither could win cleanly.

The fix is a standard but non-obvious GitOps pattern: omit replicas from the Deployment spec when an HPA controls the workload, and configure Argo to explicitly ignore differences on the replicas field. That way Argo reconciles everything except replica counts, and the HPA owns that field exclusively.

Karpenter's IAM policy exceeded AWS's size limit

The error was: LimitExceeded: Cannot exceed quota for PolicySize: 6144.

Karpenter's required IAM policy is large. The fix was to switch from a managed policy to an inline policy (10,240-character limit) using a flag built into the EKS Terraform module. One line change; the reason isn't obvious unless you've hit it.

A capacity planning decision that wasn't a bug

Enabling HPA would have scheduled more pods than the 2-node cluster could hold — they'd have sat in Pending indefinitely, which looks like a broken cluster. I had three options: add a third static node, use cluster-autoscaler, or add Karpenter.

I chose Karpenter: it scales on demand rather than requiring a fixed node count, bins-packs more efficiently, and it's the approach AWS recommends for EKS. The decision had a cost in setup time and complexity. The benefit is a cluster that genuinely scales rather than one that holds a fixed headroom.


Decisions and trade-offs

The interesting engineering questions in this project weren't "which tool" — they were "why this, versus that, given these constraints."

GitOps over direct kubectl apply in CI. More moving parts upfront. But: CI doesn't hold cluster credentials, every deployment is auditable in git, and Argo's self-healing means drift from the desired state gets corrected automatically. For any real team, this is non-negotiable.

Karpenter over cluster-autoscaler. Faster to respond, picks the right instance type for the pending workload, consolidates underutilised nodes. The trade-off is more setup. Worth it for the operational behaviour and the learning.

Kept Eureka + Spring Cloud Config, deliberately. On Kubernetes, native Service DNS and ConfigMaps overlap with what these frameworks provide — they're somewhat redundant. I kept them because rewriting all 7 services to drop them was out of scope, and doing it poorly would be worse than the overlap. Going fully Kubernetes-native is explicitly on the backlog as a next step, not an oversight I missed.

Single NAT gateway, single environment. A deliberate cost decision. Multi-AZ NAT gateways add ~$30/month per gateway, which adds up quickly for a demo project. I know exactly where the HA gap is and named it, rather than pretending it's production-grade multi-region when it isn't.


What this project demonstrates

Skill area Evidence
Infrastructure as Code Modular Terraform, remote state, full AWS platform from scratch
CI/CD GitHub Actions, OIDC auth, build → push → tag bump → deploy automated
GitOps Argo CD syncing a Helm chart; git as sole source of truth
Kubernetes EKS, Helm, HPA, PDBs, ALB Ingress, Pod Identity / IRSA
Cloud (AWS) EKS, ECR, IAM, VPC, ALB, ACM, SQS — provisioned end to end
Autoscaling HPA + Karpenter, verified under real load, capacity reasoning documented
Observability Prometheus, Grafana, distributed tracing with Zipkin
Security OIDC over static keys, least-privilege IAM, TLS on all public traffic
Debugging Real failures diagnosed and fixed; root causes explained
Engineering judgment Trade-offs documented and defended, not assumed away

What I'd do next

I kept an honest backlog rather than claiming the project is "done":

  • NetworkPolicies — default-deny with explicit allow rules for each service-to-service path.
  • Secrets management — External Secrets Operator backed by AWS SSM Parameter Store or Secrets Manager, to replace env-var secrets.
  • kube-prometheus-stack — replace the hand-assembled Prometheus + Grafana setup with the community Helm chart.
  • Go Kubernetes-native — remove Eureka and Spring Cloud Config in favour of Kubernetes Service DNS and native ConfigMaps.

Closing

This project started as "deploy an app to Kubernetes" and became a study in what it actually means to build a platform. The delivery pipeline, the autoscaling, the tracing, the teardown ordering, the GitOps patterns — none of that comes from a tutorial. It comes from making deliberate choices, hitting real problems, and working through them.

That's the work I want to do professionally, and this project is my evidence that I can.


Repos:

Live app: petclinic.ralphnetwork.online

If you're hiring for DevOps or Platform Engineering roles — remote or Lagos on-site — I'd genuinely love to talk. Find me on LinkedIn.

Top comments (0)