Background
A few days ago, while developing the KubeRay project, I learned about a Kubernetes behavior from the issues comment section. There are two types of Eviction: Node-pressure Eviction and API-initiated Eviction. API-initiated Eviction is done by directly calling the API or using commands like kubectl drain
. Pods evicted this way will ultimately be deleted and usually recreated on another node. However, for Node-pressure Eviction, kubelet will only set the Pod's Phase to Failed
without deleting it. Therefore, if the controller does not handle it properly, the Pod will not be recreated on another node.
Heres a brief overview of the issue: When a Pod created by the KubeRay operator is on a node with insufficient disk space, the Pod gets evicted. After the disk space is cleared, the Pod remains in a Failed state and is not recreated on another node.
Now, I need to reproduce this issue. The key point is that since these two types of Evictions behave differently, I cannot use kubectl drain
or similar commands to reproduce the scenario. I need to specifically create a Node-pressure Eviction. However, I don't have a cluster to use; I do all my development on my personal computer, making it difficult to reproduce the issue. When developing Kubernetes applications locally, most people use minikube, kind, or k3d. Since I need a multi-node environment, minikube is excluded. Although it now supports multiple nodes, it's still more commonly used for single-node scenarios. Both kind
and k3d
use Docker containers as Kubernetes nodes. My operating system is Linux Mint, and Docker runs natively, unlike macOS where Docker runs in a virtual machine. Because the resources (memory, disk, etc.) are shared between Docker and my local machine, if I do create a Node-pressure scenario, my computer might become unusable.
After extensive Googling, I discovered that Docker can set Runtime Memory Limits, and k3d has a --agents-memory
flag to set agent node memory. This is how I found a way to reproduce the issue.
Steps
First, create a k3d cluster with 2 agent nodes, each with 3GB of memory, and trigger Pod Eviction when the available memory is less than 1GiB.
k3d cluster create \
--agents 2 \
--k3s - arg "--disable=traefik@server:0" \
--agents - memory 3g \
--k3s - arg "--kubelet-arg=eviction-hard=memory.available<1Gi@agent:0" \
--k3s - arg "--kubelet-arg=eviction-hard=memory.available<1Gi@agent:1"
Check the memory of all nodes
kubectl get nodes -o custom-columns=NAME:.metadata.name,CAPACITY_MEMORY:.status.capacity.memory,ALLOCATABLE_MEMORY:.status.allocatable.memory
Output:
# NAME CAPACITY_MEMORY ALLOCATABLE_MEMORY
# k3d-k3s-default-agent-1 3221225Ki 2172649Ki
# k3d-k3s-default-agent-0 3221225Ki 2172649Ki
# k3d-k3s-default-server-0 32590664Ki 32590664Ki
You can see that both agent 0 and agent 1 have 3GB of memory, but only 2GB is allocatable because Pod Eviction is triggered when available memory is less than 1GiB.
Next, add taints to agent 0 and agent 1 so that subsequent Pods will only be deployed to the server-0 node.
kubectl taint nodes k3d-k3s-default-agent-0 k3d=noschedule:NoSchedule
kubectl taint nodes k3d-k3s-default-agent-1 k3d=noschedule:NoSchedule
Install the KubeRay operator, so the operator Pod will run on the server-0 node.
helm install kuberay-operator kuberay/kuberay-operator --namespace ray-system --version 1.1.1 --create-namespace
Remove the taints from agent 0 and agent 1 and add a taint to server 0 so that subsequent Pods will not be deployed to server 0.
kubectl taint nodes k3d-k3s-default-server-0 k3d=noschedule:NoSchedule
kubectl taint nodes k3d-k3s-default-agent-0 k3d=noschedule:NoSchedule-
kubectl taint nodes k3d-k3s-default-agent-1 k3d=noschedule:NoSchedule-
Install the RayCluster
custom resource. After installation, the KubeRay operator will create a head pod and a worker pod. Since the memory resource request for the head pod is 2GB and for the worker pod is 1GB in the helm chart, and both agent 0 and agent 1 have only 2GB of allocatable memory, these two Pods will definitely not be on the same node.
helm install raycluster kuberay/ray-cluster --version 1.1.1
Next, we need to perform a memory stress test on the node where the head pod is located. After some Googling, I found that stress-ng is commonly used for this purpose, so Ill use it as well. We need to ensure that the head pod has stress-ng
available. The simplest way is to copy the statically compiled stress-ng
binary directly into the head pod, so we don't have to worry about the head pod's base image or any missing dependencies. As for obtaining the statically compiled binary, you can compile it yourself, but I took a shortcut by copying it from a Docker image that includes the binary. Assuming the head pod is named raycluster-kuberay-head-ldg9f
.
kubectl cp ./stress-ng raycluster-kuberay-head-ldg9f:/home/ray
Open a shell on the head pod
kubectl exec -it raycluster-kuberay-head-ldg9f -- bash
Simulate memory stress
./stress-ng --vm 4 --vm-bytes 2G --vm-keep
In this way, you can see the head pod being evicted due to Node-pressure Eviction.
Top comments (0)