DEV Community

Cheedge Lee
Cheedge Lee

Posted on

Kubenetes Cluster & Nodes related issues

1. Node

1.1 Node NotReady

A Kubernetes cluster node being in the NotReady state can result from various issues. Here are some realistic and common reasons:

1. Node Resource Issues

  • Insufficient Memory or CPU: If the node is running out of memory or CPU resources, the kubelet may mark the node as NotReady.
  • Disk Pressure: The node's disk usage may be too high, causing the kubelet to mark it as NotReady.
    • Example: kubectl describe node <node-name> shows DiskPressure under conditions.
  • Network Pressure: High network latency or dropped packets may cause readiness issues.

2. kubelet Problems

  • kubelet Down: The kubelet service on the node is not running or has crashed.
  • Certificate Issues: The kubelet's certificate might have expired, causing it to fail authentication with the kube-apiserver.
  • Configuration Errors: Misconfigured kubelet flags (e.g., wrong --cluster-dns, --api-servers) can lead to connectivity issues.

3. Networking Issues

  • Node Network Unreachable: The node cannot communicate with the control plane or other nodes.
  • CNI Plugin Failure: Issues with the Container Network Interface (CNI) plugin (e.g., Calico, Flannel, Weave) may disrupt pod-to-pod or node-to-node communication.
  • Firewall Rules: A firewall or security group blocking Kubernetes-related traffic (e.g., ports 6443, 10250) can cause the node to go NotReady.

4. Control Plane Connectivity Issues

  • kube-apiserver Unreachable: The node cannot reach the API server due to network partitioning or DNS resolution issues.
  • etcd Problems: If the control plane's etcd database is down or unhealthy, the API server might not respond to node heartbeats.

5. Component Issues

  • Container Runtime Failure: The container runtime (e.g., Docker, containerd, CRI-O) is not running or is misconfigured.
  • kube-proxy Failure: The kube-proxy component on the node is not functioning correctly, disrupting node communication.

6. Other reasons

There are still many possible reasons, like the hardware failures, missing some config files, some time the kubelet version mismatch also can casue node failure.

1.1.1 How to Debug a NotReady Node

  1. Check node conditions frist and kubelet status:
kubectl describe node <node-name>
systemctl status kubelet
# if it down/inactive restart it
systemctl restart kubelet
Enter fullscreen mode Exit fullscreen mode
  1. Check logs:
    • kubelet logs:
    • or we can also check the container runtime logs (e.g., Docker):
journalctl -u kubelet
journalctl -u docker
Enter fullscreen mode Exit fullscreen mode
  1. Verify network connectivity:
    • Ping the control plane API server:
    • Check CNI plugin logs.
journalctl -u kubelet
Enter fullscreen mode Exit fullscreen mode
  1. Inspect resource usage:
top
df -h
# or directly use kubectl top
kubectl top node --sort-by='cpu' | awk 'NR==2 {print $1}'
Enter fullscreen mode Exit fullscreen mode

in this article, we will always use systemd cmds, and ubuntu sys.

1.2 Cordon and Drain Nodes

kubectl cordon NODENAME
kubectl drain NODENAME
kubectl uncordon NODENAME
Enter fullscreen mode Exit fullscreen mode

1.2.1 kubectl cordon NODENAME

  • Purpose: Marks a node as unschedulable. This prevents new pods from being scheduled on that node.  
  • Effect on existing pods: Existing pods continue to run on the node.  
  • Use case: temporarily prevent new workloads from being placed on a node, perhaps for investigation or minor maintenance, without disrupting existing applications.

NOTICE: if you have specify the nodeName: <node_name>, then it can still schedule pod to the node, because:

  • Cordoning a node: it tells the scheduler "don't place any new pods on this node unless there's a very good reason."
  • nodeName in pod spec: this will be a direct instruction to Kubernetes: "I want this pod to run specifically on this node."

1.2.2 kubectl drain NODENAME

  • Purpose: Evicts all pods from a node and marks it as unschedulable.  
  • Effect on existing pods: Gracefully terminates pods running on the node.
  • Use case: perform more significant maintenance on a node, such as kernel updates, hardware replacement.

2. Cluster

2.1 Update

Following the official documents here

0. check availiable version

sudo apt update
sudo apt-cache madison kubeadm
Enter fullscreen mode Exit fullscreen mode

1. updata kubeadmin

# change the version as you need
sudo apt-mark unhold kubeadm && \
sudo apt-get update && sudo apt-get install -y kubeadm='1.32.x-*' && \
sudo apt-mark hold kubeadm
kubeadm version
Enter fullscreen mode Exit fullscreen mode

1.1 on control node

sudo kubeadm upgrade apply
Enter fullscreen mode Exit fullscreen mode

1.2 on other node

sudo kubeadm upgrade node
Enter fullscreen mode Exit fullscreen mode

2. drain node

kubectl drain <node-to-drain> --ignore-daemonsets
Enter fullscreen mode Exit fullscreen mode

If on other nodes, first ssh to the control node, then drain the node, then ssh back to updating node

3. update kubelet and kubectl

# change the version as you need
sudo apt-mark unhold kubelet kubectl && \
sudo apt-get update && sudo apt-get install -y kubelet='1.32.x-*' kubectl='1.32.x-*' && \
sudo apt-mark hold kubelet kubectl
Enter fullscreen mode Exit fullscreen mode

Then restart kubelet

sudo systemctl daemon-reload
sudo systemctl restart kubelet
Enter fullscreen mode Exit fullscreen mode

4. uncordon node

kubectl uncordon <node-to-uncordon>
Enter fullscreen mode Exit fullscreen mode

Similarly, if you are on other node, ssh back to control node, then do the uncordon.

Top comments (0)