Upgrading a Kubernetes cluster is one of the most critical operational tasks in any cloud-native environment. A poorly planned upgrade can lead to application downtime, add-on incompatibility or even cluster instability. On the other hand, a well-executed upgrade ensures better security, improved performance and access to the latest Kubernetes features.
Recently, I performed an upgrade of an Amazon Elastic Kubernetes Service (EKS) cluster in a production-like environment and in this article, I want to share my practical understanding and step-by-step approach to help others perform this activity safely and confidently.
This guide walks through the complete EKS upgrade process, including prerequisites, backup strategy, control plane upgrade, add-on upgrades, node upgrades and the cordon and drain process.
Why EKS Upgrade is Important
Kubernetes versions are released frequently and AWS supports only a limited number of Kubernetes versions at a time. Running an outdated cluster can lead to:
- Security vulnerabilities
- Unsupported add-ons
- Compatibility issues
- Limited feature availability
- Compliance risks
Step 1: Prerequisites
Before starting the upgrade, it is important to validate the current cluster state and ensure all required tools are ready.
Check Current Cluster Version
kubectl version --short
kubectl get nodes
aws eks describe-cluster --name <cluster-name> --query "cluster.version"
Step 2: Backup (Most Critical Step)
Before upgrading any EKS cluster, always take a backup. This ensures you can recover quickly in case something goes wrong.
Backup Kubernetes Resources
kubectl get all -A -o yaml > cluster-backup.yaml
kubectl get pvc -A -o yaml> cluster-backup-pvc.yaml
kubectl get pv -A -o yaml> cluster-backup-pv.yaml
aws eks list-addons --cluster-name <cluster-name> > cluster-backup-addons.json
We can also use AWS Back up service to take the backup from the 'Update and Backup History' tab from the AWS Console.
Step 3: Control Plane Upgrade
The control plane is the brain of the Kubernetes cluster. In EKS, upgrading the control plane is managed by AWS and does not affect running workloads.
Upgrade Control Plane
aws eks update-cluster-version \
--name <cluster-name> \
--kubernetes-version 1.30
You can also upgrade from a single click from the AWS console.
Typically, this process takes around 10–15 minutes.
Once completed, the control plane will run on the new Kubernetes version.
Step 4: Add-ons Upgrade
After upgrading the control plane, the next step is upgrading EKS add-ons.
Refer the "cluster-backup-addons.json" file created in step 2 and run the following command for each of the add-ons listed on the file.
aws eks update-addon \
--cluster-name <cluster-name> \
--addon-name vpc-cni \
--addon-version <version>
Step 5: Node Upgrade
Node upgrade is the most sensitive part of the EKS upgrade process.
Nodes must run a Kubernetes version compatible with the control plane.
Managed Node Group Upgrade
If you are using managed node groups, AWS handles most of the upgrade process.
aws eks update-nodegroup-version \
--cluster-name <cluster-name> \
--nodegroup-name <nodegroup-name>
AWS will:
- Create new nodes
- Move workloads
- Terminate old nodes
- Maintain availability
Self-Managed Node Upgrade
In self-managed nodes, you need to manually upgrade the AMI and node group.
1. Create New Launch Template
Update:
- AMI
- Kubernetes version
- User data
2. Create New Nodes
eksctl create nodegroup
3. Move Workloads
kubectl cordon <node>
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
kubectl delete node <node>
Challenges you might faced During EKS Upgrade
While upgrading the Amazon Elastic Kubernetes Service (EKS) cluster, one of the biggest learnings was that upgrades are rarely smooth in real production environments. Several challenges can arise, especially at the node level, where running workloads may block the upgrade process.
Here are some common challenges and how to handle them.
1. Workloads Running on Node Blocking Node Upgrade
One of the most common issues occurs when a node cannot be drained because some workloads are still running on it.
Problem
During node upgrade, when executing:
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
You may see errors like:
- Pod cannot be evicted
- Pod disruption budget prevents eviction
- Node drain failed
- Pod stuck in Terminating state
This happens because Kubernetes is trying to protect running applications from downtime.
Common Reasons
Pod Disruption Budget (PDB)
If a Pod Disruption Budget is configured, Kubernetes will not allow pods to be terminated beyond the defined limit.
Example:
kubectl get pdb -A
If PDB allows only 1 pod to be down, drain will fail.
Solution
Temporarily modify or remove PDB.
kubectl delete pdb <pdb-name>
or update the PDB:
minAvailable: 0
Then retry drain.
2. Pods Using Local Storage (emptyDir)
Problem
Some pods use emptyDir or local storage.
Kubernetes blocks node drain to prevent data loss.
Error example:
cannot delete Pods with local storage
Solution
Use:
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
This allows drain to proceed.
Always verify application behavior before doing this.
3. DaemonSets Preventing Node Drain
Problem
DaemonSets run on every node.
Examples:
- CloudWatch agent
- Monitoring agents
- Security agents
- Log collectors
Drain fails because DaemonSets cannot be evicted.
Solution
Use:
kubectl drain <node> --ignore-daemonsets
DaemonSets will automatically run on new nodes.
4. Stateful Applications Blocking Upgrade
Problem
Stateful applications like:
- Databases
- Kafka
- Airflow workers
- Elasticsearch
may not move easily.
Drain may take a long time or fail.
Solution
Upgrade in controlled steps:
- Scale replicas
- Move pods manually
- Drain one node at a time
- Monitor application health
Example:
kubectl get pods -o wide
kubectl cordon <node>
kubectl drain <node>
5. Add-on Not Compatible with New Kubernetes Version
Problem
After control plane upgrade, some add-ons stop working.
Example:
- VPC CNI crash
- CoreDNS not running
- Metrics server failure
Solution
Upgrade add-ons immediately.
aws eks update-addon \
--cluster-name <cluster-name> \
--addon-name coredns
Always check compatibility before upgrading.
6. Node Group Upgrade Taking Too Long
Problem
Managed node group upgrade sometimes takes longer than expected.
Reasons:
- Large number of pods
- PDB restrictions
- Insufficient capacity
- Scaling delays
Solution
Monitor upgrade:
kubectl get nodes
kubectl get pods -A
kubectl get events -A
Check AWS Console for node group update status.
Ensure:
- Enough capacity
- Proper scaling
- No stuck pods
If you have any questions or would like to share your experience with Amazon Elastic Kubernetes Service (EKS) upgrades, feel free to drop a comment below or connect with me on LinkedIn:
https://www.linkedin.com/in/shubham-kumar1807/
I would be happy to discuss ideas, challenges and real-world upgrade strategies with fellow cloud and Kubernetes enthusiasts. 🚀

Top comments (0)