Stop Managing EKS Add-ons by Hand

#eks #infrastructureascode #kubernetes #terraform

This post was originally published on graycloudarch.com.

I was preparing to upgrade a production EKS cluster to version 1.32\
when I discovered a problem.

Four of our core cluster components---VPC CNI, CoreDNS, kube-proxy, and\
Metrics Server---were all running versions incompatible with EKS 1.32. I\
needed to update them before upgrading.

And I had no easy way to do it.

VPC CNI, CoreDNS, and kube-proxy had been installed automatically\
when the cluster was created, running in "self-managed" mode. Metrics\
Server was installed with\
kubectl apply -f metrics-server.yaml from some GitHub\
release page, months ago, by someone who is no longer on the team.

No version pinning. No history of what changed or when. No way to\
test the upgrade before applying it to production.

That's when I decided to stop managing EKS add-ons by hand.

The Problem with Self-Managed Add-ons

There are two categories of EKS add-ons, and most teams don't think\
about the distinction until they're stuck.

Self-managed: You're responsible for installation,\
updates, and compatibility. AWS won't help you troubleshoot them. When\
EKS releases a new version, you need to manually verify your add-ons\
still work, find compatible versions, and update them yourself.

EKS-managed: AWS handles the lifecycle. Compatible\
versions are tested and published for each EKS release. AWS Support can\
troubleshoot them. Security patches are available without you tracking\
CVEs.

If you created an EKS cluster without explicitly enabling managed\
add-ons, VPC CNI, CoreDNS, and kube-proxy are running in self-managed\
mode right now.

The fix is straightforward---migrate them to EKS-managed. But if you're\
also running kubectl-installed tools like Metrics Server, you have a\
second problem: those aren't managed by anything at all.

The Solution: One Terraform Module for All Six Add-ons

I built a single eks-addons Terraform module that\
manages everything:

EKS-managed (4): -- VPC CNI --- pod networking -- EBS\
CSI Driver --- persistent volumes (added this one while I was at it) --\
CoreDNS --- DNS resolution -- kube-proxy --- network proxy

Helm-managed (2): -- Metrics Server --- resource\
metrics for kubectl top and HPA -- Reloader --- auto-restart\
pods when ConfigMaps or Secrets change

Why one module instead of six separate ones? All of these share the\
same dependency: the EKS cluster. Consolidating them means one\
terragrunt apply deploys everything, one\
terraform plan shows drift across all add-ons, and one PR\
updates any version.

The core Terraform for an EKS-managed add-on is minimal:

resource "aws_eks_addon" "vpc_cni" {
  count = var.enable_vpc_cni ? 1 : 0

  cluster_name                = var.cluster_name
  addon_name                  = "vpc-cni"
  addon_version               = var.vpc_cni_version
  resolve_conflicts_on_create = "OVERWRITE"
  resolve_conflicts_on_update = "OVERWRITE"
  preserve                    = true
}

Two things worth explaining:

resolve_conflicts = "OVERWRITE" tells Terraform it's the\
source of truth. Any manual changes in the cluster get overwritten on\
the next apply. This is what you want.

preserve = true means if you remove the resource from\
Terraform, the add-on stays in the cluster. Safety net during\
refactoring---you won't accidentally delete a running add-on.

EBS CSI Driver Needs an IAM Role

The EBS CSI Driver is the one add-on that requires extra work: it\
needs IAM permissions to create and attach EBS volumes. The right way to\
handle this is IRSA (IAM Roles for Service Accounts).

resource "aws_iam_role" "ebs_csi" {
  count = var.enable_ebs_csi ? 1 : 0
  name  = "${var.cluster_name}-ebs-csi-driver"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Federated = var.oidc_provider_arn }
      Action    = "sts:AssumeRoleWithWebIdentity"
      Condition = {
        StringEquals = {
          "${var.oidc_provider}:sub" = "system:serviceaccount:kube-system:ebs-csi-controller-sa"
          "${var.oidc_provider}:aud" = "sts.amazonaws.com"
        }
      }
    }]
  })
}

resource "aws_iam_role_policy_attachment" "ebs_csi" {
  count      = var.enable_ebs_csi ? 1 : 0
  role       = aws_iam_role.ebs_csi[0].name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy"
}

No credentials in pods, automatic rotation, and a clean audit trail\
in CloudTrail. IRSA is the correct pattern for any AWS service that\
needs to call AWS APIs from inside Kubernetes.

Migrating Metrics Server from kubectl to Helm

This is the one step that requires manual cleanup before Terraform\
can take over.

The existing kubectl-installed Metrics Server needs to go first:

kubectl delete deployment metrics-server -n kube-system
kubectl delete service metrics-server -n kube-system
kubectl delete apiservice v1beta1.metrics.k8s.io

Then Terraform installs the Helm-managed version:

resource "helm_release" "metrics_server" {
  count      = var.enable_metrics_server ? 1 : 0
  name       = "metrics-server"
  repository = "https://kubernetes-sigs.github.io/metrics-server/"
  chart      = "metrics-server"
  version    = var.metrics_server_chart_version
  namespace  = "kube-system"

  values = [yamlencode({
    replicas = 2
    args = [
      "--kubelet-preferred-address-types=InternalIP",
      "--kubelet-insecure-tls"
    ]
    podDisruptionBudget = {
      enabled      = true
      minAvailable = 1
    }
  })]
}

Expected downtime: 2-3 minutes. Only kubectl top is\
unavailable during the transition. Running applications are not\
affected.

Deploying It

One thing that bit me: CI/CD doesn't pick up module changes\
automatically.

Our GitHub Actions workflow detects changes by looking for modified\
terragrunt.hcl files. When I changed files in\
common/modules/eks-addons/, the workflow triggered but\
found no stacks to deploy (no terragrunt.hcl changed), so\
nothing ran.

Module changes require a manual deploy:

cd workloads-nonprod/us-east-1/cluster-name/eks-addons
terragrunt init
terragrunt plan   # Review: should show ~10 resources to add
terragrunt apply

After apply, verify everything is healthy:

# Check EKS-managed add-on status
for addon in vpc-cni aws-ebs-csi-driver coredns kube-proxy; do
  aws eks describe-addon --cluster-name <cluster> --addon-name $addon \
    --query 'addon.[addonName,status]' --output text
done
# All should show: ACTIVE

# Verify Metrics Server
kubectl top nodes

What Changed

Before: four add-ons running in self-managed mode, one installed by\
kubectl, no version history, no drift detection.

After: -- All six add-ons defined in code with pinned versions --\
terraform plan shows immediately if anything drifts from\
the declared state -- Rollback is git revert +\
terragrunt apply -- EKS cluster upgrade checklist is now:\
update four version strings in the Terragrunt config, open a PR,\
done

The cluster upgrade I was dreading took about 30 minutes instead of a\
day of manual compatibility checking.

Running into EKS add-on management problems? Reach\
out---this is the kind of operational work I do for platform\
teams.