augusthottie

Posted on Mar 17

I Set Up GitOps on EKS with ArgoCD, Here's What Kubernetes Actually Looks Like in Production

#aws #devops #terraform #kubernetes

I had a Kubernetes problem. I could talk about it, reference it on my resume, and deploy to existing clusters. But I'd never provisioned one from scratch, written a Helm chart, or set up GitOps. That gap showed, and I knew interviewers could smell it.

So I built the whole thing: EKS cluster with Terraform, custom Helm chart, ArgoCD for GitOps, and a real application that talks to PostgreSQL and Redis. Push to main, ArgoCD syncs, pods roll out. No kubectl apply in sight. This is what I learned.

What I Built

A GitOps pipeline where Git is the only way deployments happen:

git push → GitHub → ArgoCD (polls every 3m) → Helm sync → EKS cluster

The application is a Notes API running on Bun/Express with PostgreSQL for persistence and Redis for caching, both running as pods inside the cluster. Two API replicas sit behind an ALB provisioned automatically by the AWS Load Balancer Controller from a Kubernetes Ingress resource.

Every response includes a pod field showing which replica served the request. Hit the endpoint twice and you'll see different pod names, load balancing in action.

Provisioning EKS with Terraform

I wrote two Terraform modules: one for the VPC and one for the EKS cluster.

The VPC module creates public and private subnets across two AZs, with a NAT Gateway for outbound traffic from private subnets. The critical detail for EKS: every subnet needs specific tags so Kubernetes knows which subnets to use for load balancers.

Public subnets get kubernetes.io/role/elb = 1 (for internet-facing ALBs). Private subnets get kubernetes.io/role/internal-elb = 1. Miss these tags and the AWS Load Balancer Controller silently fails to create ALBs.

The EKS module creates the cluster, a managed node group (2x t3.medium), an OIDC provider for IRSA, and IAM roles for both the nodes and the LB controller.

27 resources. terraform apply takes about 15-20 minutes because EKS clusters are slow to provision.

Writing the Helm Chart

This was the part I had the least experience with. The chart has 8 templates:

Namespace: isolates everything in three-tier
ConfigMap: database host, Redis host, app version
Secret: database password
Deployment: API with 2 replicas, health checks, resource limits
Service: ClusterIP exposing port 80 internally
Ingress: annotated for the AWS Load Balancer Controller to create an ALB
PostgreSQL StatefulSet: with a PersistentVolumeClaim for data
Redis Deployment: ephemeral cache with LRU eviction

Everything configurable lives in values.yaml. Want 3 replicas? Change one number. Different image tag? One line. That's the power of Helm!

ArgoCD: The GitOps Engine

ArgoCD watches a Git repo and makes the cluster match what's in Git. The application definition is simple:

spec:
  source:
    repoURL: https://github.com/augusthottie/gitops-eks
    targetRevision: main
    path: helm/three-tier-app

  syncPolicy:
    automated:
      prune: true      # Delete resources removed from Git
      selfHeal: true   # Revert manual kubectl changes

prune: true means if you delete a template from the Helm chart and push, ArgoCD deletes the corresponding resource from the cluster. selfHeal: true means if someone runs kubectl edit to change something manually, ArgoCD reverts it back to what Git says. Git is the source of truth, always!

The ArgoCD UI is beautiful. You get a tree view of every resource, their health status, sync status, and the Git commit that triggered each sync.

The Problems I Hit (All of Them)

EBS CSI Driver

My PostgreSQL StatefulSet stayed in Pending forever. The PersistentVolumeClaim couldn't bind because EKS doesn't include the EBS CSI driver by default, it's an addon you install separately with its own IAM role.

The fix: install the aws-ebs-csi-driver addon via aws eks create-addon. But even that failed initially because eksctl create iamserviceaccount created a service account that conflicted with the addon. I had to delete and recreate with --resolve-conflicts OVERWRITE.

IRSA Not Working for LB Controller

The AWS Load Balancer Controller kept failing with "AccessDenied" errors, but the IAM policy was correct. The problem: it was using the node role instead of the IRSA role. The service account had the right annotation, but the pods were created before the annotation was applied.

Fix: kubectl rollout restart deployment/aws-load-balancer-controller. The new pods picked up the service account annotation and used the correct IRSA role.

ArgoCD CRD Too Large

Installing ArgoCD with kubectl apply failed because the applicationsets CRD exceeded the annotation size limit for client-side apply. The fix: --server-side=true --force-conflicts. This is a known ArgoCD issue that everyone hits.

ConfigMap Changes Don't Restart Pods

I changed the app version in values.yaml, pushed to GitHub, ArgoCD synced the ConfigMap, but the pods kept serving the old version. Kubernetes doesn't restart pods when a ConfigMap changes, you need a checksum annotation on the pod template that changes when the ConfigMap changes, forcing a rollout.

exec format error (Again)

Same issue from Project 2 — built the Docker image on my Mac (ARM), Kubernetes nodes are x86_64. --platform linux/amd64 on every build. I'll never forget this one.

Testing the GitOps Flow

The demo that proves everything works:

Change app.version in values.yaml from 1.0.0 to 1.1.0
git push origin main
Wait for ArgoCD to sync (up to 3 minutes)
curl /info returns "version": "1.1.0"

No CI/CD pipeline. No deployment commands. No SSH. Git push is the deployment.

Why This Matters

Before this project, my Kubernetes answer in interviews was... vague. I'd deployed to existing clusters, used Helm to install other people's charts, and read a lot of docs. But I hadn't actually provisioned a cluster, written my own Helm chart, or set up GitOps from scratch.

Now I can talk about EKS provisioning, OIDC providers, IRSA, and why your node group needs specific IAM policies. I can explain why StatefulSets exist (PostgreSQL needs stable storage) and why Deployments don't care (Redis can lose its data). I know that the EBS CSI driver isn't installed by default and that it will silently break your PVCs if you forget it.

Every problem I hit, the exec format error, the IRSA annotation not being picked up, ArgoCD CRDs being too large for kubectl, these are real production issues. Not textbook scenarios. The kind of stuff that comes up in interviews when someone asks "tell me about a time you debugged something in Kubernetes."

Cost Warning

EKS is expensive for learning: ~$181/month (control plane $73, nodes $60, NAT $32, ALB $16). Always terraform destroy when you're not working. You can bring it back in 20 minutes.

DEV Community