DEV Community

Cover image for How automatic repair works in AKS
MakendranG for Kubernetes Community Days Chennai

Posted on • Updated on

How automatic repair works in AKS

AKS continuously monitors the health of the worker and performs automatic repairs if necessary. Maintenance is performed on virtual machines that are experiencing issues.

Service disruptions for clusters can be minimized with the help of AKS and Azure VMs.

In this article, we'll learn how the automatic repair function works for both Windows and Linux nodes.

How AKS checks for unhealthy nodes

The following rules are used by AKS to determine if there is a problem with nodes.

  • The NotReady status is reported on consecutive checks.

  • No status is reported within 10 minutes.

The health state of our nodes can be checked manually by kubectl.

kubectl get nodes
Enter fullscreen mode Exit fullscreen mode

How automatic repair works

AKS takes following actions if it finds an unhealthy node for 10 minutes:

  1. Reboot the node.
  2. If the reboot is not successful, reimage the node.
  3. If the reimage is not successful, redeploy the node.

If auto-repair is unsuccessful, alternative remedies are investigated by AKS engineers.

If AKS finds a lot of unhealthy nodes during a health check, they will repair them individually.

Node Autodrain

There are Scheduled Events that can occur on the underlying virtual machines in any of our node pools. For spot node pools, scheduled events may additionally reason a preempt node match for the node.

Certain events, such as preempt, cause AKS to attempt a cordon and drain of the affected nodes, which allows for a graceful rescheduling of any affected workload on that node.

When this happens, we might notice the node to receive a taint with "remediator.aks.microsoft.com/unschedulable", because of "kubernetes.azure.com/scalesetpriority: spot".

The actions they cause for AKS is shown in the following table.

Event Description Action
Freeze The VM is going to stop for a few seconds. There is no impact on memory or open files when the network is suspended. No action
Reboot The VM is going to be reboot. The non-persistent memory is lost. No action
Redeploy The VM is going to be redeployed. The ephemeral disks are lost. Cordon and drain
Preempt The spot is being deleted. The ephemeral disks are lost. Cordon and drain
Terminate The VM is going to be deleted. Cordon and drain

Limitations

In many cases, AKS can determine if a node is healthy and attempt to fix it, but in some cases, AKS can't detect an issue and can't repair it. AKS can't detect issues if a status is not reported due to an error in the network configuration, or if a node has failed to register as a healthy one.

Thanks for reading my article till end. I hope you learned something special today. If you enjoyed this article then please share to your friends and if you have suggestions or thoughts to share with me then please write in the comment box.

Top comments (1)