Introduction
Upgrading Kubernetes should be boring.
This one wasn’t.
I recently upgraded a production Amazon EKS cluster from 1.32 to 1.33, expecting a routine change. Instead, it triggered a cascading failure:
- Nodes went NotReady
- Add-ons stalled indefinitely
- Karpenter stopped provisioning capacity
- The cluster deadlocked itself
This post walks through what broke, why it broke, and the exact steps that stabilized the cluster so you don’t repeat my mistakes.
Critical Issues
If you're upgrading to EKS 1.33, know this:
- Amazon Linux 2 is NOT supported - Must migrate to AL2023 first
Anonymous auth is restricted - New RBAC required for kube-apiserver
Karpenter needs
eks:DescribeClusterpermission - Missing this breaks everythingAddons can get stuck "Updating" - managed node groups are your escape hatch
Part 1: The Failed First Attempt
What I Did Wrong
I started with what looked like a standard Terraform upgrade:
module "damola_eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 20.0"
cluster_name = local.name
cluster_version = "1.33"
What happened
Error: AMI Type AL2_x86_64 is only supported for kubernetes versions 1.32 or earlier
The Root Cause
EKS 1.33 drops Amazon Linux 2 completely:
AL2 reaches end of support on Nov 26, 2025 and no AL2 AMIs exist for 1.33.
The Fix: AL2 → AL2023 Migration
For Karpenter users, this is actually simple. Update your EC2NodeClass:
# EC2NodeClass
spec:
amiSelectorTerms:
- alias: al2023@latest
Managed node groups (Terraform)
ami_type = "AL2023_x86_64_STANDARD"
Wait until all nodes are AL2023, then upgrade the control plane.
Part 2: The Karpenter Catastrophe
After migrating to AL2023, I cordoned old nodes, no new nodes came up.
Karpenter was completely stuck.
The Error
Checking Karpenter logs revealed:
{
"message": "failed to detect the cluster CIDR",
"error": "not authorized to perform: eks:DescribeCluster"
}
The Root Cause
Starting with Karpenter v1.0, the controller requires eks:DescribeCluster to:
- Detect cluster networking (CIDR)
- Discover API endpoint configuration
- Validate authentication mode
Without this permission, provisioning silently fails.
The Fix
Add the permission to your Karpenter controller IAM role:
{
"Effect": "Allow",
"Action": "eks:DescribeCluster",
"Resource": "*"
}
Then restart:
kubectl rollout restart deployment/karpenter -n karpenter
kubectl rollout status deployment/karpenter -n karpenter
Karpenter recovered but the cluster still wasn’t healthy.
Part 3: The Addon Deadlock
After the control plane upgraded:
- Add-ons started updating (
vpc-cni,kube-proxy,coredns) - They got stuck in Updating
- All nodes went NotReady
- No new nodes could join
Classic deadlock:
Nodes need add-ons → add-ons need healthy nodes.
The Error
kubectl get nodes
# All showed: NotReady
kubectl logs -n kube-system -l k8s-app=kube-dns
# Error: Authorization error (user=kube-apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)
The Root Causes
Anonymous auth restricted (EKS 1.33)
- Anonymous API access is now limited to health endpoints only.
- The kube-apiserver requires explicit RBAC to communicate with kubelet.
Add-on update deadlock
- Add-ons need healthy nodes to update.
- Nodes need working add-ons to become Ready.
- When all nodes are NotReady, everything gets stuck.
The Fix Part 1: RBAC for kube-apiserver
Create the missing RBAC:
kubectl apply -f - <<EOF
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: system:kube-apiserver-to-kubelet
rules:
- apiGroups: [""]
resources: ["nodes/proxy","nodes/stats","nodes/log","nodes/spec","nodes/metrics"]
verbs: ["*"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: system:kube-apiserver
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:kube-apiserver-to-kubelet
subjects:
- kind: User
name: kube-apiserver-kubelet-client
apiGroup: rbac.authorization.k8s.io
EOF
Errors stopped but add-ons were still stuck.
The Fix Part 2: Breaking the Deadlock with Managed Nodes
With broken Karpenter nodes, I had no way out.
Solution: temporarily scale up managed node groups.
desired_size = 1
ami_type = "AL2023_x86_64_STANDARD"
Why this works
- Managed nodes bootstrap independently
- They come up with working VPC CNI
- Add-ons get healthy replicas
- Karpenter recovers
- Broken nodes can be safely deleted
Within ~10 minutes, the cluster recovered.
Part 4: Final Cleanup & Validation
Once stable verify all nodes healthy:
kubectl get nodes -o custom-columns=\
NAME:.metadata.name,\
STATUS:.status.conditions[-1].type,\
OS:.status.nodeInfo.osImage,\
VERSION:.status.nodeInfo.kubeletVersion
# All showed:
# Ready | Amazon Linux 2023.9.20251208 | v1.33.5-eks-ecaa3a6
# Verify addons
kubectl get daemonset -n kube-system
# All showed READY = DESIRED
# Clean up stuck terminating pods
kubectl delete pod -n kube-system --force --grace-period=0 <stuck-pod-names>
Recommended Add-on Versions for EKS 1.33
- CoreDNS:
v1.12.4-eksbuild.1 - kube-proxy:
v1.33.5-eksbuild.2 - VPC CNI:
v1.21.1-eksbuild.1 - EBS CSI:
v1.54.0-eksbuild.1
Key Takeaways
- AL2023 is mandatory for EKS 1.33
- Karpenter needs
eks:DescribeCluster - kube-apiserver RBAC must be updated
- Keep managed node groups as a safety net
Note
Looking back, the main issue wasn’t just missing permissions it was configuration drift.
While the cluster was still running EKS 1.32, I manually added eks:DescribeCluster during the AL2023 migration. Everything worked, so I forgot to codify it in Terraform.
During the upgrade to EKS 1.33, Terraform re-applied the IAM role and removed the permission right when Karpenter started requiring it.
The upgrade didn’t introduce the bug.
Environment: EKS, Terraform, Karpenter v1.x
Resources

Top comments (0)