When a Kubernetes Pod is OOMKilled, many engineers immediately blame "memory limit too small".
But in reality, a Pod can be killed by three completely different mechanisms:
- Scheduling Layer: kubelet proactively performs orderly eviction based on the Pod's QoS and resource usage.
- Container Layer: Linux cgroups enforces the Pod's memory limit (limits.memory).
- Kernel Layer: When a process triggers the cgroup memory limit (memory.max) or the system runs out of global memory, the kernel OOM mechanism intervenes and selects a process to terminate.
In this article, we will trace a real Logstash Pod from Kubernetes API down to the Linux cgroup filesystem and see how memory decisions are actually made.
Kubernetes Node Pressure Eviction Mechanism
The order of Kubernetes node pressure eviction is affected by three factors: Quality of Service, PriorityClass, and actual resource usage.
The first one is Pod's QoS. The order of node eviction is as follows:
| QoS | Condition | Feature |
|---|---|---|
| Guaranteed | request == limit | Least likely to be killed |
| Burstable | Has request / limit but not equal | Middle |
| BestEffort | No request/limit | Most likely to be killed |
The second one is Pod's PriorityClass: In the same QoS, Pod with higher PriorityClass is more likely to survive.
The last one is Pod's resource usage: In the same QoS and PriorityClass case (default 0 or using the same PriorityClass), for Burstable Pods, eviction priority depends on how much memory usage exceeds their requested memory.
How Kubelet Influences OOM Behavior Through cgroups Configuration (Based on cgroups v2)
Kubernetes' resource statistics, management, quotas, and evictions are all based on cgroups.
Kubernetes Pod cgroups Mapping on Node Host
Take a Logstash pod as an example.
First, check which Node the Pod is deployed on:
[root@k8s-master-05 logstash]# kubectl get pods -n logging -o wide | grep logstash
logstash-7c748fcf64-nt52r 1/1 Running 0 116m 10.244.232.85 k8s-node-06 <none> <none>
Then, we need to get the Pod's uid:
[root@k8s-master-05 logstash]# kubectl get pod -n logging logstash-7c748fcf64-nt52r -o jsonpath='{.metadata.uid}'
745d59b2-bda3-4f6d-94a2-c5bafa3560b9
On the k8s-node-06 host, find the corresponding cgroups directory:
/sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod745d59b2_bda3_4f6d_94a2_c5bafa3560b9.slice
- In this directory,
/sys/fs/cgroup/is the inherent cgroups control group directory, which can be used to manage processes on the host. -
kubepods.slicecontains control groups for all Kubernetes Pods. Below it are three types of directories:kubepods-besteffort.slice,kubepods-burstable.slice, andkubepods-guaranteed.slice, representing three different QoS levels respectively. -
kubepods-burstable-pod745d59b2_bda3_4f6d_94a2_c5bafa3560b9.sliceis the actual Pod's cgroups mapping on the node, where745d59b2_bda3_4f6d_94a2_c5bafa3560b9is the Pod uid obtained earlier.
In this directory, there are usually at least two containers, one is the main business container and one is the Pause container, for example:
[root@k8s-node-06 kubepods-burstable-pod745d59b2_bda3_4f6d_94a2_c5bafa3560b9.slice]# ls -d */
cri-containerd-3d9508ad8b835c31b9ab4551b977bf16bef8025f6bfb61c17a8da9b60e50fd4b.scope/
cri-containerd-557e374182e9f183aac7b6537d44d8a74565891a231468599032dce0f6b9cad2.scope/
For the first Directory
Check pid:
[root@k8s-node-06 cri-containerd-3d9508ad8b835c31b9ab4551b977bf16bef8025f6bfb61c17a8da9b60e50fd4b.scope]# cat cgroup.procs
154733
Check the corresponding process by pid:
[root@k8s-node-06 cri-containerd-3d9508ad8b835c31b9ab4551b977bf16bef8025f6bfb61c17a8da9b60e50fd4b.scope]# ps -fp 154733
UID PID PPID C STIME TTY TIME CMD
65535 154733 154706 0 17:35 ? 00:00:00 /pause
As we can see, this container is the Pause container of the Logstash Pod.
For the second Directory
Check pid:
[root@k8s-node-06 cri-containerd-3d9508ad8b835c31b9ab4551b977bf16bef8025f6bfb61c17a8da9b60e50fd4b.scope]# cd ../cri-containerd-557e374182e9f183aac7b6537d44d8a74565891a231468599032dce0f6b9cad2.scope/
[root@k8s-node-06 cri-containerd-557e374182e9f183aac7b6537d44d8a74565891a231468599032dce0f6b9cad2.scope]# cat cgroup.procs
164485
164497
Check the corresponding process by pid:
[root@k8s-node-06 ~]# ps -fp 164485
UID PID PPID C STIME TTY TIME CMD
1000 164485 154706 0 17:47 ? 00:00:00 /bin/sh -c /opt/logstash/bin/logstash -f /opt/logstash/confi
And:
[root@k8s-node-06 ~]# ps -fp 164497
UID PID PPID C STIME TTY TIME CMD
1000 164497 164485 2 17:47 ? 00:03:26 /opt/logstash/jdk/bin/java -Xms1g -Xmx1g -XX:+UseConcMarkSwe
Enter Logstash to verify:
[root@k8s-master-05 logstash]# kubectl exec -it -n logging logstash-7c748fcf64-nt52r -- sh
$ ps -aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
logstash 1 0.0 0.0 2624 96 ? Ss 09:47 0:00 /bin/sh -c /opt/logstash/bin/logstash -f /opt/logstash/config/
logstash 6 2.7 16.9 5012196 1029084 ? Sl 09:47 3:27 /opt/logstash/jdk/bin/java -Xms1g -Xmx1g -XX:+UseConcMarkSweep
logstash 197 0.0 0.0 2624 1672 pts/0 Ss 11:51 0:00 sh
logstash 203 0.0 0.0 8904 3380 pts/0 R+ 11:51 0:00 ps -aux
As we can see, processes 164485 and 164497 are the mappings of the containerized Logstash processes on the Node host.
Check the Mapping of Pod Resource Request and Limit in the cgroups Control Group
- Check Pod limits:
[root@k8s-master-05 logstash]# kubectl get pods -n logging logstash-7c748fcf64-nt52r -o jsonpath='{range .spec.containers[*]}{"Container:"}{.name}{"\n"}{"CPU Request:"}{.resources.requests.cpu}{" "}{"CPU Limit:"}{.resources.limits.cpu}{"\n"}{"Memory Request:"}{.resources.requests.memory}{" "}{"Memory Limit:"}{.resources.limits.memory}{"\n"}{end}'
Container:logstash
CPU Request:500m CPU Limit:1
Memory Request:512Mi Memory Limit:1Gi
- Check the cgroups memory.max parameter:
[root@k8s-node-06 cri-containerd-557e374182e9f183aac7b6537d44d8a74565891a231468599032dce0f6b9cad2.scope]# cat memory.max | awk '{print $1/1024/1024}'
1024
The memory limit matches the Pod's limits.
- Check the cpu.max parameter:
[root@k8s-node-06 cri-containerd-557e374182e9f183aac7b6537d44d8a74565891a231468599032dce0f6b9cad2.scope]# cat cpu.max
100000 100000
This means a quota of 100ms per 100ms period, effectively limiting the container to 1 CPU.
This demonstrates that kubelet completely relies on cgroups to manage resources.
Why can OOM Kill Multiple Processes in One Container at the Same Time?
This relies on the memory.oom.group parameter, when memory.oom.group = 1, the kernel treats all processes inside this cgroup as a single unit during OOM, killing them together instead of selecting individual processes.
[root@k8s-node-06 cri-containerd-557e374182e9f183aac7b6537d44d8a74565891a231468599032dce0f6b9cad2.scope]# cat memory.oom.group
1
Kernel-Level OOM
Kernel-level OOM relies on another mechanism: oom_score and oom_score_adj.
oom_score
The Linux kernel continuously maintains an oom_score for each process through a heuristic algorithm. This value generally ranges from 0 to 1000, but practically there's no upper limit. When OOM occurs, the kernel reassesses the oom_score of each candidate process based on the heuristic algorithm and starts killing processes from the one with the highest oom_score until the system has enough available memory.
oom_score_adj
oom_score_adj is an important parameter for calculating oom_score, and it also represents the artificial priority assigned to the process.
The general range is from -1000 to 1000, where kernel-level processes are mostly -1000, which represents the absolute importance of these processes and they cannot be easily killed. A value of 1000 is the most likely to be killed. According to how oom_score is calculated, a process with oom_score_adj of -1000 will have a calculated oom_score of 0, so they are at the end of the oom ranking (the least likely to be killed).
oom_score_adj Set by kubelet for Pods
When a Pod is created, kubelet generates oom_score_adj for each Pod. The specific value is determined by QoS:
| QoS | oom_score_adj default value |
|---|---|
| Guaranteed | -998 |
| Burstable | Dynamically calculated |
| BestEffort | 1000 |
Among them, the oom_score_adj of Burstable Pod is comprehensively calculated from parameters such as requests.memory and node allocatable memory:
oom_score_adj = 1000 - (1000 * memory_request) / node_allocatable_memory
(This formula is just an approximate calculation for Burstable Pod's oom_score_adj. The actual implementation also includes boundary checks and special cases.)
In addition, kubelet's oom_score_adj is -999, which means kubelet has one of the lowest oom_score_adj values among user-space processes, making it one of the last processes to be killed. However, in extreme cases, kubelet is not indispensable. The reason is simple: if kubelet dead, the node still has the possibility of self-healing, but if the node freezes because memory cannot be released, it is almost impossible to recover through any means other than system rebooting, as the operation window (such as ssh) cannot be used. The severity of the two situations differs significantly.
Through this oom_score_adj, Kubernetes ensures that high-value Pods have a higher probability of survival when system OOM occurs.
VPA Affects oom_score_adj
VPA can automatically correct the requests value during deployment based on actual conditions, making the Pod more likely to be scheduled to a node with truly sufficient resources, instead of being scheduled to a node with insufficient resources due to its small requests value, and being evicted by kubelet after resource usage gradually increases.
VPA modifying the requests value affects the value of the oom_score_adj mapped in the cgroups on the node host where the Pod is located. Since the oom_score_adj of Burstable Pod ranges from 0 to 999, VPA can narrow the gap between request and limits by appropriately correcting the requests value, lowering the range of oom_score_adj values. This will cause the Pod to be OOM-killed later when system-level OOM occurs.
VPA does not dynamically modify the oom_score_adj of running Pods, only affecting newly created Pods or those rebuilt to take effect.
Summary
Node Pressure Eviction: An active Pod deletion behavior initiated by kubelet, aiming to prevent node crashes. It is a relatively "gentle" and orderly cleanup process.
Kernel OOM Killer: A passive, last-resort mechanism of the Linux kernel under extreme pressure (when memory is exhausted and swap space may also be depleted). From its perspective, there are only processes without the superior-subordinate relationships of kubelet and Pods. It only decides which process to kill based on the process's oom_score.
cgroups: The common technical foundation of both. It provides kubelet with the ability to implement resource limits (memory.max) and influence kernel OOM decisions (through memory.oom.group and oom_score_adj).
Relationship: Reasonable Pod resource limits (implemented through cgroups) can reduce the probability of node pressure eviction and kernel OOM. When node pressure appears, kubelet's eviction is a prevention of kernel OOM. If prevention fails, the kernel OOM will take over based on the oom_score affected by cgroups.
Top comments (0)