Vivian Chiamaka Okose

Posted on Jun 6

From Code to Cloud: Deploying 11 Microservices to AWS EKS with Terraform and ArgoCD

#aws #kubernetes #terraform #devops

The CI pipeline was passing. Images were in ECR. Everything looked ready.

Then I ran kubectl get pods and saw this:

emailservice        0/1    CrashLoopBackOff   12 (3m ago)    24m
recommendationservice  0/1    ImagePullBackOff    0              5m

That was just the beginning. Before the app fully loaded on a live AWS URL, I hit authentication failures, image pull errors, a Kubernetes version mismatch in Terraform, and a pod scheduling limit I didn't know existed.

This is the full story of deploying to AWS EKS — what broke, why, and how each problem got solved.

The Stack

Before getting into what broke, here is what the deployment stack looks like:

Terraform — provisions the VPC, EKS cluster, node groups, and IAM roles on AWS
ArgoCD — watches the Git repo and syncs the Helm chart to the cluster automatically
Helm — packages all 11 Kubernetes manifests into one deployable chart
AWS EKS — managed Kubernetes on AWS
Amazon ECR — private container registry where CI pushes the Docker images

The flow is: Terraform creates the infrastructure, kubectl connects to the cluster, ArgoCD is installed, and from that point on every Git push triggers an automatic deployment.

Step 1 — Provisioning the Infrastructure with Terraform

The Terraform config creates everything the cluster needs: a VPC with public and private subnets across two availability zones, a NAT gateway for private subnet egress, an EKS cluster, and a managed node group with EC2 worker nodes.

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.0"

  cluster_name    = var.cluster_name
  cluster_version = var.cluster_version

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  cluster_endpoint_public_access = true

  eks_managed_node_groups = {
    main = {
      instance_types = [var.node_instance_type]
      min_size       = var.node_min_count
      max_size       = var.node_max_count
      desired_size   = var.node_desired_count
      ami_type       = "AL2_x86_64"
    }
  }
}

Running terraform apply takes about 15 minutes. When it completes it outputs the cluster endpoint, certificate data, and ECR registry URL.

terraform init
terraform plan
terraform apply

After apply, connect kubectl to the new cluster:

aws eks update-kubeconfig \
  --name online-boutique-cluster \
  --region us-east-1

Problem 1 — kubectl Authentication Failure

The first thing that happened after the cluster was created was this:

error: You must be logged in to the server
(the server has asked for the client to provide credentials)

Running aws eks update-kubeconfig had added the cluster to the local kubeconfig but kubectl get nodes kept failing with credential errors.

The cause was that the cluster was created by one IAM identity but the local AWS CLI was authenticated as a different IAM user. EKS only grants automatic access to the identity that created the cluster. Everyone else has to be explicitly granted access.

The fix was creating an access entry for the current user:

aws eks create-access-entry \
  --cluster-name online-boutique-cluster \
  --region us-east-1 \
  --principal-arn arn:aws:iam::164885464623:user/crystal \
  --type STANDARD

aws eks associate-access-policy \
  --cluster-name online-boutique-cluster \
  --region us-east-1 \
  --principal-arn arn:aws:iam::164885464623:user/crystal \
  --policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy \
  --access-scope type=cluster

After that, kubectl get nodes returned the two worker nodes as expected.

Problem 2 — Kubernetes Version Mismatch in Terraform

When I later ran terraform apply to scale the node group, Terraform tried to downgrade the cluster from Kubernetes 1.30 to 1.29.

Error: updating EKS Cluster (online-boutique-cluster) version:
Unsupported Kubernetes minor version update from 1.30 to 1.29

The cluster_version variable in variables.tf said 1.29 but the actual cluster was already running 1.30 — AWS had auto-upgraded it. Terraform noticed the mismatch and tried to make reality match the config, which meant a downgrade. AWS doesn't allow Kubernetes downgrades.

The fix was updating the variable to match the actual running version:

variable "cluster_version" {
  description = "Kubernetes version for EKS"
  type        = string
  default     = "1.30"
}

This is a good reminder to always keep Terraform variables in sync with the actual state of your infrastructure, especially after AWS performs automatic maintenance upgrades.

Step 2 — Installing ArgoCD on EKS

With the cluster running and kubectl authenticated, the next step was installing ArgoCD:

kubectl create namespace argocd
kubectl apply -n argocd -f \
  https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

One error came up during the install:

The CustomResourceDefinition "applicationsets.argoproj.io" is invalid:
metadata.annotations: Too long: must have at most 262144 bytes

This is a known issue with large CRDs and kubectl apply. The manifest is too big to fit in the annotation that kubectl apply uses to track changes. The fix is to use kubectl create for that specific CRD:

kubectl create -f \
  https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/crds/applicationset-crd.yaml

After that all ArgoCD pods came up healthy:

kubectl get pods -n argocd

Step 3 — Deploying the App via ArgoCD

With ArgoCD running, the next step was creating the Application manifest that tells ArgoCD where to find the Helm chart and where to deploy it:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: online-boutique
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/pawsible-cloud/online-boutique-platform.git
    targetRevision: main
    path: helm-chart
    helm:
      releaseName: online-boutique
      valueFiles:
        - values.yaml
  destination:
    server: https://kubernetes.default.svc
    namespace: online-boutique
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
      - CreateNamespace=true

Applying this triggered the first sync:

kubectl apply -f argocd/application.yaml

ArgoCD pulled the Helm chart from Git, rendered the manifests, and started creating pods in the online-boutique namespace.

Problem 3 — ImagePullBackOff on Two Services

The first kubectl get pods -n online-boutique showed most services running but two stuck:

emailservice          0/1   ImagePullBackOff   0   5m
recommendationservice 0/1   ImagePullBackOff   0   5m

Running kubectl describe pod on one of them showed:

Failed to pull image "emailservice": pull access denied,
repository does not exist or may require 'docker login'

The image reference in the deployment was just emailservice — no registry, no tag. Kubernetes didn't know where to pull from so it tried Docker Hub, got denied, and gave up.

The Helm values file had the image field set incorrectly. The correct reference needed the full ECR URI:

164885464623.dkr.ecr.us-east-1.amazonaws.com/online-boutique/emailservice:v0.10.5

After fixing the values file and pushing to Git, ArgoCD detected the change and redeployed with the correct image references.

Problem 4 — The 500 Error on the Live Site

The app got a load balancer URL. Opening it in the browser showed this:

HTTP Status: 500 Internal Server Error
rpc error: code = Unavailable
dial tcp 172.20.197.197:7070: connect: connection refused
could not retrieve cart

The frontend was up but cartservice was unreachable. Checking the pods showed cartservice was in Pending state — it had never scheduled onto a node.

That led to Problem 5.

Problem 5 — Pod Scheduling Failures (The Unexpected One)

Three pods were stuck in Pending. Running kubectl describe pod showed:

0/2 nodes are available: 2 Too many pods.

Not out of memory. Not out of CPU. Too many pods.

Here is what I didn't know: AWS calculates the maximum number of pods a node can run based on the number of network interfaces and IP addresses the instance type supports, not just available resources. A t3.medium node maxes out at 11 pods per node.

With 11 app services plus ArgoCD plus Prometheus all competing for slots across 2 nodes (22 slots total), and system pods also taking space, there weren't enough slots for every pod to schedule.

kubectl describe nodes | grep -A 5 "Allocatable"
# pods: 11
# pods: 11

The first attempt was adding a third node by updating Terraform:

variable "node_desired_count" {
  default = 3
}

But the AWS account had a restriction blocking t3.medium On-Demand instance launches on a free tier account.

The practical fix was scaling down the load generator — the least critical service — to free up exactly enough slots:

kubectl scale deployment loadgenerator \
  -n online-boutique --replicas=0

That freed one slot. The remaining pending pods scheduled immediately. Every service came up. The app loaded.

The App Live on AWS

After all of that — the authentication fix, the version mismatch, the image pull errors, the scheduling limit — the app finally loaded on the AWS load balancer URL.

Every service running. Every pod healthy. ArgoCD showing Synced and Healthy.

What This Project Taught Me About EKS

A few things that aren't obvious until you actually run into them:

EKS access control is IAM-based. The identity that creates the cluster gets access automatically. Everyone else needs an explicit access entry. If you're working in a team or switching between IAM users, set up access entries from the start.

Pod limits on AWS are network-based, not resource-based. t3.medium maxes at 11 pods per node regardless of how much CPU or memory is free. If you're running a lot of services, plan your node count accordingly or use a larger instance type.

Terraform and AWS can drift. AWS performs automatic upgrades and maintenance. If your Terraform config doesn't reflect the actual state, the next apply will try to reconcile the difference — sometimes in a direction you don't want.

GitOps makes debugging easier. Because every deployment comes from Git, you always know exactly what version of the config is running. When something breaks you know where to look.

What Comes Next

With the app running on EKS and ArgoCD managing deployments, the next step was setting up monitoring. Prometheus and Grafana give visibility into what the cluster is actually doing — pod health, resource usage, and alerts when things go wrong.

The next post covers the full monitoring setup, what the ServiceMonitor limitation taught me about how the Online Boutique is instrumented, and how to build useful Grafana dashboards with the metrics you actually have.

This is part of an ongoing series documenting a full DevOps project built on Google's Online Boutique microservices demo, deployed to AWS EKS with Terraform, GitHub Actions, ArgoCD, Helm, Prometheus, and Grafana.

Repo: github.com/pawsible-cloud/online-boutique-platform