DEV Community

Shubham Kumar
Shubham Kumar

Posted on

Upgrading an Amazon EKS Cluster: A Practical Step-by-Step Guide from Real-World Experience

Upgrading a Kubernetes cluster is one of the most critical operational tasks in any cloud-native environment. A poorly planned upgrade can lead to application downtime, add-on incompatibility or even cluster instability. On the other hand, a well-executed upgrade ensures better security, improved performance and access to the latest Kubernetes features.

Recently, I performed an upgrade of an Amazon Elastic Kubernetes Service (EKS) cluster in a production-like environment and in this article, I want to share my practical understanding and step-by-step approach to help others perform this activity safely and confidently.

This guide walks through the complete EKS upgrade process, including prerequisites, backup strategy, control plane upgrade, add-on upgrades, node upgrades and the cordon and drain process.

Why EKS Upgrade is Important

Kubernetes versions are released frequently and AWS supports only a limited number of Kubernetes versions at a time. Running an outdated cluster can lead to:

  • Security vulnerabilities
  • Unsupported add-ons
  • Compatibility issues
  • Limited feature availability
  • Compliance risks

Step 1: Prerequisites
Before starting the upgrade, it is important to validate the current cluster state and ensure all required tools are ready.

Check Current Cluster Version

kubectl version --short
kubectl get nodes
aws eks describe-cluster --name <cluster-name> --query "cluster.version"

Enter fullscreen mode Exit fullscreen mode

Step 2: Backup (Most Critical Step)

Before upgrading any EKS cluster, always take a backup. This ensures you can recover quickly in case something goes wrong.

Backup Kubernetes Resources

kubectl get all -A -o yaml > cluster-backup.yaml
kubectl get pvc -A -o yaml> cluster-backup-pvc.yaml
kubectl get pv -A -o yaml> cluster-backup-pv.yaml
aws eks list-addons --cluster-name <cluster-name> > cluster-backup-addons.json

Enter fullscreen mode Exit fullscreen mode

We can also use AWS Back up service to take the backup from the 'Update and Backup History' tab from the AWS Console.

Step 3: Control Plane Upgrade
The control plane is the brain of the Kubernetes cluster. In EKS, upgrading the control plane is managed by AWS and does not affect running workloads.

Upgrade Control Plane

aws eks update-cluster-version \
--name <cluster-name> \
--kubernetes-version 1.30
Enter fullscreen mode Exit fullscreen mode

You can also upgrade from a single click from the AWS console.

Typically, this process takes around 10–15 minutes.

Once completed, the control plane will run on the new Kubernetes version.

Step 4: Add-ons Upgrade
After upgrading the control plane, the next step is upgrading EKS add-ons.

Refer the "cluster-backup-addons.json" file created in step 2 and run the following command for each of the add-ons listed on the file.

aws eks update-addon \
--cluster-name <cluster-name> \
--addon-name vpc-cni \
--addon-version <version>
Enter fullscreen mode Exit fullscreen mode

Step 5: Node Upgrade

Node upgrade is the most sensitive part of the EKS upgrade process.

Nodes must run a Kubernetes version compatible with the control plane.

Managed Node Group Upgrade

If you are using managed node groups, AWS handles most of the upgrade process.

aws eks update-nodegroup-version \
--cluster-name <cluster-name> \
--nodegroup-name <nodegroup-name>
Enter fullscreen mode Exit fullscreen mode

AWS will:

  • Create new nodes
  • Move workloads
  • Terminate old nodes
  • Maintain availability

Self-Managed Node Upgrade

In self-managed nodes, you need to manually upgrade the AMI and node group.

1. Create New Launch Template

Update:

  • AMI
  • Kubernetes version
  • User data

2. Create New Nodes

eksctl create nodegroup
Enter fullscreen mode Exit fullscreen mode

3. Move Workloads

kubectl cordon <node>
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
kubectl delete node <node>
Enter fullscreen mode Exit fullscreen mode

Challenges you might faced During EKS Upgrade

While upgrading the Amazon Elastic Kubernetes Service (EKS) cluster, one of the biggest learnings was that upgrades are rarely smooth in real production environments. Several challenges can arise, especially at the node level, where running workloads may block the upgrade process.

Here are some common challenges and how to handle them.

1. Workloads Running on Node Blocking Node Upgrade
One of the most common issues occurs when a node cannot be drained because some workloads are still running on it.

Problem

During node upgrade, when executing:

kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
Enter fullscreen mode Exit fullscreen mode

You may see errors like:

  • Pod cannot be evicted
  • Pod disruption budget prevents eviction
  • Node drain failed
  • Pod stuck in Terminating state

This happens because Kubernetes is trying to protect running applications from downtime.

Common Reasons

Pod Disruption Budget (PDB)

If a Pod Disruption Budget is configured, Kubernetes will not allow pods to be terminated beyond the defined limit.

Example:

kubectl get pdb -A
Enter fullscreen mode Exit fullscreen mode

If PDB allows only 1 pod to be down, drain will fail.

Solution

Temporarily modify or remove PDB.

kubectl delete pdb <pdb-name>
Enter fullscreen mode Exit fullscreen mode

or update the PDB:

minAvailable: 0
Enter fullscreen mode Exit fullscreen mode

Then retry drain.

2. Pods Using Local Storage (emptyDir)

Problem

Some pods use emptyDir or local storage.

Kubernetes blocks node drain to prevent data loss.

Error example:

cannot delete Pods with local storage
Enter fullscreen mode Exit fullscreen mode

Solution

Use:

kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
Enter fullscreen mode Exit fullscreen mode

This allows drain to proceed.

Always verify application behavior before doing this.

3. DaemonSets Preventing Node Drain

Problem

DaemonSets run on every node.

Examples:

  • CloudWatch agent
  • Monitoring agents
  • Security agents
  • Log collectors

Drain fails because DaemonSets cannot be evicted.

Solution

Use:

kubectl drain <node> --ignore-daemonsets
Enter fullscreen mode Exit fullscreen mode

DaemonSets will automatically run on new nodes.

4. Stateful Applications Blocking Upgrade

Problem

Stateful applications like:

  • Databases
  • Kafka
  • Airflow workers
  • Elasticsearch

may not move easily.

Drain may take a long time or fail.

Solution

Upgrade in controlled steps:

  • Scale replicas
  • Move pods manually
  • Drain one node at a time
  • Monitor application health

Example:

kubectl get pods -o wide
kubectl cordon <node>
kubectl drain <node>
Enter fullscreen mode Exit fullscreen mode

5. Add-on Not Compatible with New Kubernetes Version

Problem

After control plane upgrade, some add-ons stop working.

Example:

  • VPC CNI crash
  • CoreDNS not running
  • Metrics server failure

Solution

Upgrade add-ons immediately.

aws eks update-addon \
--cluster-name <cluster-name> \
--addon-name coredns
Enter fullscreen mode Exit fullscreen mode

Always check compatibility before upgrading.

6. Node Group Upgrade Taking Too Long

Problem

Managed node group upgrade sometimes takes longer than expected.

Reasons:

  • Large number of pods
  • PDB restrictions
  • Insufficient capacity
  • Scaling delays

Solution

Monitor upgrade:

kubectl get nodes
kubectl get pods -A
kubectl get events -A
Enter fullscreen mode Exit fullscreen mode

Check AWS Console for node group update status.

Ensure:

  • Enough capacity
  • Proper scaling
  • No stuck pods

If you have any questions or would like to share your experience with Amazon Elastic Kubernetes Service (EKS) upgrades, feel free to drop a comment below or connect with me on LinkedIn:
https://www.linkedin.com/in/shubham-kumar1807/

I would be happy to discuss ideas, challenges and real-world upgrade strategies with fellow cloud and Kubernetes enthusiasts. 🚀

Top comments (0)