Adedamola Ajibola

Posted on Dec 26, 2025

The EKS 1.32 to 1.33 Upgrade That Broke Everything (And How I Fixed It)

#aws #eks #devops #kubernetes

Introduction
Upgrading Kubernetes should be boring.

This one wasn’t.

I recently upgraded a production Amazon EKS cluster from 1.32 to 1.33, expecting a routine change. Instead, it triggered a cascading failure:

Nodes went NotReady
Add-ons stalled indefinitely
Karpenter stopped provisioning capacity
The cluster deadlocked itself

This post walks through what broke, why it broke, and the exact steps that stabilized the cluster so you don’t repeat my mistakes.

Critical Issues

If you're upgrading to EKS 1.33, know this:

Amazon Linux 2 is NOT supported - Must migrate to AL2023 first
Anonymous auth is restricted - New RBAC required for kube-apiserver
Karpenter needs eks:DescribeCluster permission - Missing this breaks everything
Addons can get stuck "Updating" - managed node groups are your escape hatch

Part 1: The Failed First Attempt

What I Did Wrong
I started with what looked like a standard Terraform upgrade:

module "damola_eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 20.0"

  cluster_name    = local.name
  cluster_version = "1.33"

What happened

Error: AMI Type AL2_x86_64 is only supported for kubernetes versions 1.32 or earlier

The Root Cause
EKS 1.33 drops Amazon Linux 2 completely:

AL2 reaches end of support on Nov 26, 2025 and no AL2 AMIs exist for 1.33.

The Fix: AL2 → AL2023 Migration
For Karpenter users, this is actually simple. Update your EC2NodeClass:

# EC2NodeClass
spec:
  amiSelectorTerms:
    - alias: al2023@latest

Managed node groups (Terraform)

ami_type = "AL2023_x86_64_STANDARD"

Wait until all nodes are AL2023, then upgrade the control plane.

Part 2: The Karpenter Catastrophe
After migrating to AL2023, I cordoned old nodes, no new nodes came up.

Karpenter was completely stuck.

The Error
Checking Karpenter logs revealed:


{
  "message": "failed to detect the cluster CIDR",
  "error": "not authorized to perform: eks:DescribeCluster"
}

The Root Cause

Starting with Karpenter v1.0, the controller requires eks:DescribeCluster to:

Detect cluster networking (CIDR)
Discover API endpoint configuration
Validate authentication mode

Without this permission, provisioning silently fails.

The Fix
Add the permission to your Karpenter controller IAM role:

{
  "Effect": "Allow",
  "Action": "eks:DescribeCluster",
  "Resource": "*"
}

Then restart:

kubectl rollout restart deployment/karpenter -n karpenter
kubectl rollout status deployment/karpenter -n karpenter

Karpenter recovered but the cluster still wasn’t healthy.

Part 3: The Addon Deadlock

After the control plane upgraded:

Add-ons started updating (vpc-cni, kube-proxy, coredns)
They got stuck in Updating
All nodes went NotReady
No new nodes could join

Classic deadlock:

Nodes need add-ons → add-ons need healthy nodes.

The Error

kubectl get nodes
# All showed: NotReady

kubectl logs -n kube-system -l k8s-app=kube-dns
# Error: Authorization error (user=kube-apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)

The Root Causes

Anonymous auth restricted (EKS 1.33)

Anonymous API access is now limited to health endpoints only.
The kube-apiserver requires explicit RBAC to communicate with kubelet.

Add-on update deadlock

Add-ons need healthy nodes to update.
Nodes need working add-ons to become Ready.
When all nodes are NotReady, everything gets stuck.

The Fix Part 1: RBAC for kube-apiserver
Create the missing RBAC:

kubectl apply -f - <<EOF
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: system:kube-apiserver-to-kubelet
rules:
- apiGroups: [""]
  resources: ["nodes/proxy","nodes/stats","nodes/log","nodes/spec","nodes/metrics"]
  verbs: ["*"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: system:kube-apiserver
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:kube-apiserver-to-kubelet
subjects:
- kind: User
  name: kube-apiserver-kubelet-client
  apiGroup: rbac.authorization.k8s.io
EOF

Errors stopped but add-ons were still stuck.

The Fix Part 2: Breaking the Deadlock with Managed Nodes

With broken Karpenter nodes, I had no way out.

Solution: temporarily scale up managed node groups.

desired_size = 1
ami_type     = "AL2023_x86_64_STANDARD"

Why this works

Managed nodes bootstrap independently
They come up with working VPC CNI
Add-ons get healthy replicas
Karpenter recovers
Broken nodes can be safely deleted

Within ~10 minutes, the cluster recovered.

Part 4: Final Cleanup & Validation
Once stable verify all nodes healthy:

kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
STATUS:.status.conditions[-1].type,\
OS:.status.nodeInfo.osImage,\
VERSION:.status.nodeInfo.kubeletVersion

# All showed:
# Ready | Amazon Linux 2023.9.20251208 | v1.33.5-eks-ecaa3a6

# Verify addons
kubectl get daemonset -n kube-system
# All showed READY = DESIRED

# Clean up stuck terminating pods
kubectl delete pod -n kube-system --force --grace-period=0 <stuck-pod-names>

Recommended Add-on Versions for EKS 1.33

CoreDNS: v1.12.4-eksbuild.1
kube-proxy: v1.33.5-eksbuild.2
VPC CNI: v1.21.1-eksbuild.1
EBS CSI: v1.54.0-eksbuild.1

Key Takeaways

AL2023 is mandatory for EKS 1.33
Karpenter needs eks:DescribeCluster
kube-apiserver RBAC must be updated
Keep managed node groups as a safety net

Note
Looking back, the main issue wasn’t just missing permissions it was configuration drift.

While the cluster was still running EKS 1.32, I manually added eks:DescribeCluster during the AL2023 migration. Everything worked, so I forgot to codify it in Terraform.

During the upgrade to EKS 1.33, Terraform re-applied the IAM role and removed the permission right when Karpenter started requiring it.

The upgrade didn’t introduce the bug.

Environment: EKS, Terraform, Karpenter v1.x

Resources