DEV Community

Prithiviraj R
Prithiviraj R

Posted on

Amazon EKS Cluster Reliability with Node Auto Repair

What is Node Auto Repair?

Node Auto Repair is a feature in Amazon EKS (Elastic Kubernetes Service) that helps maintain the health of your cluster by automatically identifying and replacing unhealthy nodes. When a node becomes unresponsive or fails health checks, Node Auto Repair terminates the faulty node and launches a new one to restore the cluster's capacity and functionality. This feature reduces manual intervention and ensures high availability and reliability in your Kubernetes workloads.


Benefits:
Reduced Downtime: Automatically replaces failed nodes without requiring manual intervention.

Improved Reliability: Ensures application workloads have sufficient healthy nodes to run on.

Operational Efficiency: Simplifies cluster maintenance by automating node recovery.

Use Cases:
Ensures high availability and reliability in Amazon EKS clusters.

Automatically detects and replaces unhealthy nodes.

Reduces manual intervention and operational overhead.

Minimizes downtime in production environments.

Maintains consistent application performance.

Step-by-Step Guide:
Step 1: Create the EKS Cluster Without Any Node Groups
Use the eksctl command to create an EKS cluster without a node group:

eksctl create cluster --name=eks-demo --region=eu-west-1 --without-nodegroup
Enter fullscreen mode Exit fullscreen mode

Image description

Step 2: Create a Managed Node Group
Add a node group with the following command:

eksctl create nodegroup --name eks-demo-ng --cluster eks-demo --region eu-west-1 --nodes 2 --nodes-min 1 --nodes-max 3 --node-type t3.medium --enable-ssm

Enter fullscreen mode Exit fullscreen mode

Image description

Image description

Flag Explanations:

--cluster: Specifies the existing EKS cluster.
--name: Names the node group.
--region: AWS region.
--nodes: Initial number of nodes.
--nodes-min and --nodes-max: Minimum and maximum number of nodes for auto-scaling.
--node-type: EC2 instance type.

After creation, view the node group and its instances in the EC2 console.
To enable Node Auto Repair:
Go to the Node Group page in the EKS console.
Click "Edit" and enable Node Auto Repair.
Save changes.

Image description

Step 3: Simulate Node Failure
Stop the kubelet on a node:
Connect to the EC2 instance via Session Manager.
Run the following commands:

sudo su -
sudo systemctl stop kubelet
Enter fullscreen mode Exit fullscreen mode

Check node status:

kubectl get nodes
Enter fullscreen mode Exit fullscreen mode

Terminate the instance:

aws ec2 terminate-instances --instance-ids <instance-id>
Enter fullscreen mode Exit fullscreen mode

Image description

Step 4: Monitor Node Auto Repair
Observe a new instance being initialized automatically.

Confirm the replaced node's status as Ready:

kubectl get nodes
Enter fullscreen mode Exit fullscreen mode

Image description

Conclusion
Enabling Node Auto Repair in Amazon EKS guarantees a robust and stable Kubernetes cluster. It automatically replaces failing nodes, minimizes downtime, and maintains high availability and consistent performance, making it perfect for production environments.

Prithiviraj Rengarajan
DevOps Engineer

Top comments (0)