Cheedge Lee

Posted on Jan 8

Kubenetes Cluster & Nodes related issues

#kubernetes #cluster #node #cordon

1. Node

1.1 Node NotReady

A Kubernetes cluster node being in the NotReady state can result from various issues. Here are some realistic and common reasons:

1. Node Resource Issues

Insufficient Memory or CPU: If the node is running out of memory or CPU resources, the kubelet may mark the node as NotReady.
Disk Pressure: The node's disk usage may be too high, causing the kubelet to mark it as NotReady.
- Example: kubectl describe node <node-name> shows DiskPressure under conditions.
Network Pressure: High network latency or dropped packets may cause readiness issues.

2. kubelet Problems

kubelet Down: The kubelet service on the node is not running or has crashed.
Certificate Issues: The kubelet's certificate might have expired, causing it to fail authentication with the kube-apiserver.
Configuration Errors: Misconfigured kubelet flags (e.g., wrong --cluster-dns, --api-servers) can lead to connectivity issues.

3. Networking Issues

Node Network Unreachable: The node cannot communicate with the control plane or other nodes.
CNI Plugin Failure: Issues with the Container Network Interface (CNI) plugin (e.g., Calico, Flannel, Weave) may disrupt pod-to-pod or node-to-node communication.
Firewall Rules: A firewall or security group blocking Kubernetes-related traffic (e.g., ports 6443, 10250) can cause the node to go NotReady.

4. Control Plane Connectivity Issues

kube-apiserver Unreachable: The node cannot reach the API server due to network partitioning or DNS resolution issues.
etcd Problems: If the control plane's etcd database is down or unhealthy, the API server might not respond to node heartbeats.

5. Component Issues

Container Runtime Failure: The container runtime (e.g., Docker, containerd, CRI-O) is not running or is misconfigured.
kube-proxy Failure: The kube-proxy component on the node is not functioning correctly, disrupting node communication.

6. Other reasons

There are still many possible reasons, like the hardware failures, missing some config files, some time the kubelet version mismatch also can casue node failure.

1.1.1 How to Debug a `NotReady` Node

Check node conditions frist and kubelet status:

kubectl describe node <node-name>
systemctl status kubelet
# if it down/inactive restart it
systemctl restart kubelet

Check logs:
- kubelet logs:
- or we can also check the container runtime logs (e.g., Docker):

journalctl -u kubelet
journalctl -u docker

Verify network connectivity:
- Ping the control plane API server:
- Check CNI plugin logs.

journalctl -u kubelet

Inspect resource usage:

top
df -h
# or directly use kubectl top
kubectl top node --sort-by='cpu' | awk 'NR==2 {print $1}'

in this article, we will always use systemd cmds, and ubuntu sys.

1.2 Cordon and Drain Nodes

kubectl cordon NODENAME
kubectl drain NODENAME
kubectl uncordon NODENAME

1.2.1 `kubectl cordon NODENAME`

Purpose: Marks a node as unschedulable. This prevents new pods from being scheduled on that node.
Effect on existing pods: Existing pods continue to run on the node.
Use case: temporarily prevent new workloads from being placed on a node, perhaps for investigation or minor maintenance, without disrupting existing applications.

NOTICE: if you have specify the nodeName: <node_name>, then it can still schedule pod to the node, because:

Cordoning a node: it tells the scheduler "don't place any new pods on this node unless there's a very good reason."

nodeName in pod spec: this will be a direct instruction to Kubernetes: "I want this pod to run specifically on this node."

1.2.2 `kubectl drain NODENAME`

Purpose: Evicts all pods from a node and marks it as unschedulable.
Effect on existing pods: Gracefully terminates pods running on the node.
Use case: perform more significant maintenance on a node, such as kernel updates, hardware replacement.

2. Cluster

2.1 Update

Following the official documents here

0. check availiable version

sudo apt update
sudo apt-cache madison kubeadm

1. updata `kubeadmin`

# change the version as you need
sudo apt-mark unhold kubeadm && \
sudo apt-get update && sudo apt-get install -y kubeadm='1.32.x-*' && \
sudo apt-mark hold kubeadm
kubeadm version

1.1 on control node

sudo kubeadm upgrade apply

1.2 on other node

sudo kubeadm upgrade node

2. drain node

kubectl drain <node-to-drain> --ignore-daemonsets

If on other nodes, first ssh to the control node, then drain the node, then ssh back to updating node

3. update `kubelet` and `kubectl`

# change the version as you need
sudo apt-mark unhold kubelet kubectl && \
sudo apt-get update && sudo apt-get install -y kubelet='1.32.x-*' kubectl='1.32.x-*' && \
sudo apt-mark hold kubelet kubectl

Then restart kubelet

sudo systemctl daemon-reload
sudo systemctl restart kubelet

4. uncordon node

kubectl uncordon <node-to-uncordon>

Similarly, if you are on other node, ssh back to control node, then do the uncordon.

DEV Community

Kubenetes Cluster & Nodes related issues

1. Node

1.1 Node NotReady

1. Node Resource Issues

2. kubelet Problems

3. Networking Issues

4. Control Plane Connectivity Issues

5. Component Issues

6. Other reasons

1.1.1 How to Debug a `NotReady` Node

1.2 Cordon and Drain Nodes

1.2.1 `kubectl cordon NODENAME`

1.2.2 `kubectl drain NODENAME`

2. Cluster

2.1 Update

0. check availiable version

1. updata `kubeadmin`

2. drain node

3. update `kubelet` and `kubectl`

4. uncordon node

Top comments (0)

1. Node

1.1 Node NotReady

1. Node Resource Issues

2. kubelet Problems

3. Networking Issues

4. Control Plane Connectivity Issues

5. Component Issues

6. Other reasons

1.1.1 How to Debug a NotReady Node

1.2 Cordon and Drain Nodes

1.2.1 kubectl cordon NODENAME

1.2.2 kubectl drain NODENAME

2. Cluster

2.1 Update

0. check availiable version

1. updata kubeadmin

2. drain node

3. update kubelet and kubectl

4. uncordon node

1.1.1 How to Debug a `NotReady` Node

1.2.1 `kubectl cordon NODENAME`

1.2.2 `kubectl drain NODENAME`

1. updata `kubeadmin`

3. update `kubelet` and `kubectl`