Shubham Kumar

Posted on Apr 11

Upgrading an Amazon EKS Cluster: A Practical Step-by-Step Guide from Real-World Experience

#devops #aws #eks #kubernetes

Upgrading a Kubernetes cluster is one of the most critical operational tasks in any cloud-native environment. A poorly planned upgrade can lead to application downtime, add-on incompatibility or even cluster instability. On the other hand, a well-executed upgrade ensures better security, improved performance and access to the latest Kubernetes features.

Recently, I performed an upgrade of an Amazon Elastic Kubernetes Service (EKS) cluster in a production-like environment and in this article, I want to share my practical understanding and step-by-step approach to help others perform this activity safely and confidently.

This guide walks through the complete EKS upgrade process, including prerequisites, backup strategy, control plane upgrade, add-on upgrades, node upgrades and the cordon and drain process.

Why EKS Upgrade is Important

Kubernetes versions are released frequently and AWS supports only a limited number of Kubernetes versions at a time. Running an outdated cluster can lead to:

Security vulnerabilities
Unsupported add-ons
Compatibility issues
Limited feature availability
Compliance risks

Step 1: Prerequisites
Before starting the upgrade, it is important to validate the current cluster state and ensure all required tools are ready.

Check Current Cluster Version

kubectl version --short
kubectl get nodes
aws eks describe-cluster --name <cluster-name> --query "cluster.version"

Step 2: Backup (Most Critical Step)

Before upgrading any EKS cluster, always take a backup. This ensures you can recover quickly in case something goes wrong.

Backup Kubernetes Resources

kubectl get all -A -o yaml > cluster-backup.yaml
kubectl get pvc -A -o yaml> cluster-backup-pvc.yaml
kubectl get pv -A -o yaml> cluster-backup-pv.yaml
aws eks list-addons --cluster-name <cluster-name> > cluster-backup-addons.json

We can also use AWS Back up service to take the backup from the 'Update and Backup History' tab from the AWS Console.

Step 3: Control Plane Upgrade
The control plane is the brain of the Kubernetes cluster. In EKS, upgrading the control plane is managed by AWS and does not affect running workloads.

Upgrade Control Plane

aws eks update-cluster-version \
--name <cluster-name> \
--kubernetes-version 1.30

You can also upgrade from a single click from the AWS console.

Typically, this process takes around 10–15 minutes.

Once completed, the control plane will run on the new Kubernetes version.

Step 4: Add-ons Upgrade
After upgrading the control plane, the next step is upgrading EKS add-ons.

Refer the "cluster-backup-addons.json" file created in step 2 and run the following command for each of the add-ons listed on the file.

aws eks update-addon \
--cluster-name <cluster-name> \
--addon-name vpc-cni \
--addon-version <version>

Step 5: Node Upgrade

Node upgrade is the most sensitive part of the EKS upgrade process.

Nodes must run a Kubernetes version compatible with the control plane.

Managed Node Group Upgrade

If you are using managed node groups, AWS handles most of the upgrade process.

aws eks update-nodegroup-version \
--cluster-name <cluster-name> \
--nodegroup-name <nodegroup-name>

AWS will:

Create new nodes
Move workloads
Terminate old nodes
Maintain availability

Self-Managed Node Upgrade

In self-managed nodes, you need to manually upgrade the AMI and node group.

1. Create New Launch Template

Update:

AMI
Kubernetes version
User data

2. Create New Nodes

eksctl create nodegroup

3. Move Workloads

kubectl cordon <node>
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
kubectl delete node <node>

Challenges you might faced During EKS Upgrade

While upgrading the Amazon Elastic Kubernetes Service (EKS) cluster, one of the biggest learnings was that upgrades are rarely smooth in real production environments. Several challenges can arise, especially at the node level, where running workloads may block the upgrade process.

Here are some common challenges and how to handle them.

1. Workloads Running on Node Blocking Node Upgrade
One of the most common issues occurs when a node cannot be drained because some workloads are still running on it.

Problem

During node upgrade, when executing:

kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

You may see errors like:

Pod cannot be evicted
Pod disruption budget prevents eviction
Node drain failed
Pod stuck in Terminating state

This happens because Kubernetes is trying to protect running applications from downtime.

Common Reasons

Pod Disruption Budget (PDB)

If a Pod Disruption Budget is configured, Kubernetes will not allow pods to be terminated beyond the defined limit.

Example:

kubectl get pdb -A

If PDB allows only 1 pod to be down, drain will fail.

Solution

Temporarily modify or remove PDB.

kubectl delete pdb <pdb-name>

or update the PDB:

minAvailable: 0

Then retry drain.

2. Pods Using Local Storage (emptyDir)

Problem

Some pods use emptyDir or local storage.

Kubernetes blocks node drain to prevent data loss.

Error example:

cannot delete Pods with local storage

Solution

Use:

kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

This allows drain to proceed.

Always verify application behavior before doing this.

3. DaemonSets Preventing Node Drain

Problem

DaemonSets run on every node.

Examples:

CloudWatch agent
Monitoring agents
Security agents
Log collectors

Drain fails because DaemonSets cannot be evicted.

Solution

Use:

kubectl drain <node> --ignore-daemonsets

DaemonSets will automatically run on new nodes.

4. Stateful Applications Blocking Upgrade

Problem

Stateful applications like:

Databases
Kafka
Airflow workers
Elasticsearch

may not move easily.

Drain may take a long time or fail.

Solution

Upgrade in controlled steps:

Scale replicas
Move pods manually
Drain one node at a time
Monitor application health

Example:

kubectl get pods -o wide
kubectl cordon <node>
kubectl drain <node>

5. Add-on Not Compatible with New Kubernetes Version

Problem

After control plane upgrade, some add-ons stop working.

Example:

VPC CNI crash
CoreDNS not running
Metrics server failure

Solution

Upgrade add-ons immediately.

aws eks update-addon \
--cluster-name <cluster-name> \
--addon-name coredns

Always check compatibility before upgrading.

6. Node Group Upgrade Taking Too Long

Problem

Managed node group upgrade sometimes takes longer than expected.

Reasons:

Large number of pods
PDB restrictions
Insufficient capacity
Scaling delays

Solution

Monitor upgrade:

kubectl get nodes
kubectl get pods -A
kubectl get events -A

Check AWS Console for node group update status.

Ensure:

Enough capacity
Proper scaling
No stuck pods

If you have any questions or would like to share your experience with Amazon Elastic Kubernetes Service (EKS) upgrades, feel free to drop a comment below or connect with me on LinkedIn:
https://www.linkedin.com/in/shubham-kumar1807/

I would be happy to discuss ideas, challenges and real-world upgrade strategies with fellow cloud and Kubernetes enthusiasts. 🚀