DEV Community

Cover image for The EKS 1.32 to 1.33 Upgrade That Broke Everything (And How I Fixed It)
Adedamola Ajibola
Adedamola Ajibola

Posted on

The EKS 1.32 to 1.33 Upgrade That Broke Everything (And How I Fixed It)

Introduction
Upgrading Kubernetes should be boring.

This one wasn’t.

I recently upgraded a production Amazon EKS cluster from 1.32 to 1.33, expecting a routine change. Instead, it triggered a cascading failure:

  • Nodes went NotReady
  • Add-ons stalled indefinitely
  • Karpenter stopped provisioning capacity
  • The cluster deadlocked itself

This post walks through what broke, why it broke, and the exact steps that stabilized the cluster so you don’t repeat my mistakes.

Critical Issues

If you're upgrading to EKS 1.33, know this:

  1. Amazon Linux 2 is NOT supported - Must migrate to AL2023 first
  2. Anonymous auth is restricted - New RBAC required for kube-apiserver

  3. Karpenter needs eks:DescribeCluster permission - Missing this breaks everything

  4. Addons can get stuck "Updating" - managed node groups are your escape hatch

Part 1: The Failed First Attempt

What I Did Wrong
I started with what looked like a standard Terraform upgrade:

module "damola_eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.0"

  cluster_name    = local.name
  cluster_version = "1.33"

Enter fullscreen mode Exit fullscreen mode

What happened

Error: AMI Type AL2_x86_64 is only supported for kubernetes versions 1.32 or earlier
Enter fullscreen mode Exit fullscreen mode

The Root Cause
EKS 1.33 drops Amazon Linux 2 completely:

AL2 reaches end of support on Nov 26, 2025 and no AL2 AMIs exist for 1.33.

The Fix: AL2 → AL2023 Migration
For Karpenter users, this is actually simple. Update your EC2NodeClass:

# EC2NodeClass
spec:
  amiSelectorTerms:
    - alias: al2023@latest

Enter fullscreen mode Exit fullscreen mode

Managed node groups (Terraform)

ami_type = "AL2023_x86_64_STANDARD"
Enter fullscreen mode Exit fullscreen mode

Wait until all nodes are AL2023, then upgrade the control plane.

Part 2: The Karpenter Catastrophe
After migrating to AL2023, I cordoned old nodes, no new nodes came up.

Karpenter was completely stuck.

The Error
Checking Karpenter logs revealed:


{
  "message": "failed to detect the cluster CIDR",
  "error": "not authorized to perform: eks:DescribeCluster"
}

Enter fullscreen mode Exit fullscreen mode

The Root Cause

Starting with Karpenter v1.0, the controller requires eks:DescribeCluster to:

  • Detect cluster networking (CIDR)
  • Discover API endpoint configuration
  • Validate authentication mode

Without this permission, provisioning silently fails.

The Fix
Add the permission to your Karpenter controller IAM role:

{
  "Effect": "Allow",
  "Action": "eks:DescribeCluster",
  "Resource": "*"
}
Enter fullscreen mode Exit fullscreen mode

Then restart:

kubectl rollout restart deployment/karpenter -n karpenter
kubectl rollout status deployment/karpenter -n karpenter

Enter fullscreen mode Exit fullscreen mode

Karpenter recovered but the cluster still wasn’t healthy.

Part 3: The Addon Deadlock

After the control plane upgraded:

  1. Add-ons started updating (vpc-cni, kube-proxy, coredns)
  2. They got stuck in Updating
  3. All nodes went NotReady
  4. No new nodes could join

Classic deadlock:

Nodes need add-ons → add-ons need healthy nodes.

The Error

kubectl get nodes
# All showed: NotReady

kubectl logs -n kube-system -l k8s-app=kube-dns
# Error: Authorization error (user=kube-apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)
Enter fullscreen mode Exit fullscreen mode

The Root Causes

Anonymous auth restricted (EKS 1.33)

  • Anonymous API access is now limited to health endpoints only.
  • The kube-apiserver requires explicit RBAC to communicate with kubelet.

Add-on update deadlock

  • Add-ons need healthy nodes to update.
  • Nodes need working add-ons to become Ready.
  • When all nodes are NotReady, everything gets stuck.

The Fix Part 1: RBAC for kube-apiserver
Create the missing RBAC:

kubectl apply -f - <<EOF
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: system:kube-apiserver-to-kubelet
rules:
- apiGroups: [""]
  resources: ["nodes/proxy","nodes/stats","nodes/log","nodes/spec","nodes/metrics"]
  verbs: ["*"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: system:kube-apiserver
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:kube-apiserver-to-kubelet
subjects:
- kind: User
  name: kube-apiserver-kubelet-client
  apiGroup: rbac.authorization.k8s.io
EOF
Enter fullscreen mode Exit fullscreen mode

Errors stopped but add-ons were still stuck.

The Fix Part 2: Breaking the Deadlock with Managed Nodes

With broken Karpenter nodes, I had no way out.

Solution: temporarily scale up managed node groups.

desired_size = 1
ami_type     = "AL2023_x86_64_STANDARD"
Enter fullscreen mode Exit fullscreen mode

Why this works

  • Managed nodes bootstrap independently
  • They come up with working VPC CNI
  • Add-ons get healthy replicas
  • Karpenter recovers
  • Broken nodes can be safely deleted

Within ~10 minutes, the cluster recovered.

Part 4: Final Cleanup & Validation
Once stable verify all nodes healthy:

kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
STATUS:.status.conditions[-1].type,\
OS:.status.nodeInfo.osImage,\
VERSION:.status.nodeInfo.kubeletVersion

# All showed:
# Ready | Amazon Linux 2023.9.20251208 | v1.33.5-eks-ecaa3a6

# Verify addons
kubectl get daemonset -n kube-system
# All showed READY = DESIRED

# Clean up stuck terminating pods
kubectl delete pod -n kube-system --force --grace-period=0 <stuck-pod-names>
Enter fullscreen mode Exit fullscreen mode

Recommended Add-on Versions for EKS 1.33

  • CoreDNS: v1.12.4-eksbuild.1
  • kube-proxy: v1.33.5-eksbuild.2
  • VPC CNI: v1.21.1-eksbuild.1
  • EBS CSI: v1.54.0-eksbuild.1

Key Takeaways

  • AL2023 is mandatory for EKS 1.33
  • Karpenter needs eks:DescribeCluster
  • kube-apiserver RBAC must be updated
  • Keep managed node groups as a safety net

Note
Looking back, the main issue wasn’t just missing permissions it was configuration drift.

While the cluster was still running EKS 1.32, I manually added eks:DescribeCluster during the AL2023 migration. Everything worked, so I forgot to codify it in Terraform.

During the upgrade to EKS 1.33, Terraform re-applied the IAM role and removed the permission right when Karpenter started requiring it.

The upgrade didn’t introduce the bug.

Environment: EKS, Terraform, Karpenter v1.x

Resources

Top comments (0)