CHUNYI

Posted on Feb 19

Why your kubernetes Pod was OOM killed and who really killed it.

#devops #kubernetes #linux #containers

When a Kubernetes Pod is OOMKilled, many engineers immediately blame "memory limit too small".
But in reality, a Pod can be killed by three completely different mechanisms:

Scheduling Layer: kubelet proactively performs orderly eviction based on the Pod's QoS and resource usage.
Container Layer: Linux cgroups enforces the Pod's memory limit (limits.memory).
Kernel Layer: When a process triggers the cgroup memory limit (memory.max) or the system runs out of global memory, the kernel OOM mechanism intervenes and selects a process to terminate.

In this article, we will trace a real Logstash Pod from Kubernetes API down to the Linux cgroup filesystem and see how memory decisions are actually made.

Kubernetes Node Pressure Eviction Mechanism

The order of Kubernetes node pressure eviction is affected by three factors: Quality of Service, PriorityClass, and actual resource usage.

The first one is Pod's QoS. The order of node eviction is as follows:

QoS	Condition	Feature
Guaranteed	request == limit	Least likely to be killed
Burstable	Has request / limit but not equal	Middle
BestEffort	No request/limit	Most likely to be killed

The second one is Pod's PriorityClass: In the same QoS, Pod with higher PriorityClass is more likely to survive.

The last one is Pod's resource usage: In the same QoS and PriorityClass case (default 0 or using the same PriorityClass), for Burstable Pods, eviction priority depends on how much memory usage exceeds their requested memory.

How Kubelet Influences OOM Behavior Through cgroups Configuration (Based on cgroups v2)

Kubernetes' resource statistics, management, quotas, and evictions are all based on cgroups.

Kubernetes Pod cgroups Mapping on Node Host

Take a Logstash pod as an example.

First, check which Node the Pod is deployed on:

[root@k8s-master-05 logstash]# kubectl get pods -n logging -o wide | grep logstash
logstash-7c748fcf64-nt52r            1/1     Running   0               116m    10.244.232.85    k8s-node-06   <none>           <none>

Then, we need to get the Pod's uid:

[root@k8s-master-05 logstash]# kubectl get pod -n logging logstash-7c748fcf64-nt52r -o jsonpath='{.metadata.uid}'
745d59b2-bda3-4f6d-94a2-c5bafa3560b9

On the k8s-node-06 host, find the corresponding cgroups directory:

/sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod745d59b2_bda3_4f6d_94a2_c5bafa3560b9.slice

In this directory, /sys/fs/cgroup/ is the inherent cgroups control group directory, which can be used to manage processes on the host.
kubepods.slice contains control groups for all Kubernetes Pods. Below it are three types of directories: kubepods-besteffort.slice, kubepods-burstable.slice, and kubepods-guaranteed.slice, representing three different QoS levels respectively.
kubepods-burstable-pod745d59b2_bda3_4f6d_94a2_c5bafa3560b9.slice is the actual Pod's cgroups mapping on the node, where 745d59b2_bda3_4f6d_94a2_c5bafa3560b9 is the Pod uid obtained earlier.

In this directory, there are usually at least two containers, one is the main business container and one is the Pause container, for example:

[root@k8s-node-06 kubepods-burstable-pod745d59b2_bda3_4f6d_94a2_c5bafa3560b9.slice]# ls -d */
cri-containerd-3d9508ad8b835c31b9ab4551b977bf16bef8025f6bfb61c17a8da9b60e50fd4b.scope/
cri-containerd-557e374182e9f183aac7b6537d44d8a74565891a231468599032dce0f6b9cad2.scope/

For the first Directory

Check pid:

[root@k8s-node-06 cri-containerd-3d9508ad8b835c31b9ab4551b977bf16bef8025f6bfb61c17a8da9b60e50fd4b.scope]# cat cgroup.procs 
154733

Check the corresponding process by pid:

[root@k8s-node-06 cri-containerd-3d9508ad8b835c31b9ab4551b977bf16bef8025f6bfb61c17a8da9b60e50fd4b.scope]# ps -fp 154733
UID          PID    PPID  C STIME TTY          TIME CMD
65535     154733  154706  0 17:35 ?        00:00:00 /pause

As we can see, this container is the Pause container of the Logstash Pod.

For the second Directory

Check pid:

[root@k8s-node-06 cri-containerd-3d9508ad8b835c31b9ab4551b977bf16bef8025f6bfb61c17a8da9b60e50fd4b.scope]# cd ../cri-containerd-557e374182e9f183aac7b6537d44d8a74565891a231468599032dce0f6b9cad2.scope/
[root@k8s-node-06 cri-containerd-557e374182e9f183aac7b6537d44d8a74565891a231468599032dce0f6b9cad2.scope]# cat cgroup.procs 
164485
164497

Check the corresponding process by pid:

[root@k8s-node-06 ~]# ps -fp 164485
UID          PID    PPID  C STIME TTY          TIME CMD
1000      164485  154706  0 17:47 ?        00:00:00 /bin/sh -c /opt/logstash/bin/logstash -f /opt/logstash/confi

And:

[root@k8s-node-06 ~]# ps -fp 164497
UID          PID    PPID  C STIME TTY          TIME CMD
1000      164497  164485  2 17:47 ?        00:03:26 /opt/logstash/jdk/bin/java -Xms1g -Xmx1g -XX:+UseConcMarkSwe

Enter Logstash to verify:

[root@k8s-master-05 logstash]# kubectl exec -it -n logging logstash-7c748fcf64-nt52r -- sh
$ ps -aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
logstash       1  0.0  0.0   2624    96 ?        Ss   09:47   0:00 /bin/sh -c /opt/logstash/bin/logstash -f /opt/logstash/config/
logstash       6  2.7 16.9 5012196 1029084 ?     Sl   09:47   3:27 /opt/logstash/jdk/bin/java -Xms1g -Xmx1g -XX:+UseConcMarkSweep
logstash     197  0.0  0.0   2624  1672 pts/0    Ss   11:51   0:00 sh
logstash     203  0.0  0.0   8904  3380 pts/0    R+   11:51   0:00 ps -aux

As we can see, processes 164485 and 164497 are the mappings of the containerized Logstash processes on the Node host.

Check the Mapping of Pod Resource Request and Limit in the cgroups Control Group

Check Pod limits:

[root@k8s-master-05 logstash]# kubectl get pods -n logging logstash-7c748fcf64-nt52r  -o jsonpath='{range .spec.containers[*]}{"Container:"}{.name}{"\n"}{"CPU Request:"}{.resources.requests.cpu}{" "}{"CPU Limit:"}{.resources.limits.cpu}{"\n"}{"Memory Request:"}{.resources.requests.memory}{" "}{"Memory Limit:"}{.resources.limits.memory}{"\n"}{end}'
Container:logstash
CPU Request:500m CPU Limit:1
Memory Request:512Mi Memory Limit:1Gi

Check the cgroups memory.max parameter:

[root@k8s-node-06 cri-containerd-557e374182e9f183aac7b6537d44d8a74565891a231468599032dce0f6b9cad2.scope]# cat memory.max | awk '{print $1/1024/1024}'
1024

The memory limit matches the Pod's limits.

Check the cpu.max parameter:

[root@k8s-node-06 cri-containerd-557e374182e9f183aac7b6537d44d8a74565891a231468599032dce0f6b9cad2.scope]# cat cpu.max
100000 100000

This means a quota of 100ms per 100ms period, effectively limiting the container to 1 CPU.

This demonstrates that kubelet completely relies on cgroups to manage resources.

Why can OOM Kill Multiple Processes in One Container at the Same Time?

This relies on the memory.oom.group parameter, when memory.oom.group = 1, the kernel treats all processes inside this cgroup as a single unit during OOM, killing them together instead of selecting individual processes.

[root@k8s-node-06 cri-containerd-557e374182e9f183aac7b6537d44d8a74565891a231468599032dce0f6b9cad2.scope]# cat memory.oom.group 
1

Kernel-Level OOM

Kernel-level OOM relies on another mechanism: oom_score and oom_score_adj.

oom_score

The Linux kernel continuously maintains an oom_score for each process through a heuristic algorithm. This value generally ranges from 0 to 1000, but practically there's no upper limit. When OOM occurs, the kernel reassesses the oom_score of each candidate process based on the heuristic algorithm and starts killing processes from the one with the highest oom_score until the system has enough available memory.

oom_score_adj

oom_score_adj is an important parameter for calculating oom_score, and it also represents the artificial priority assigned to the process.

The general range is from -1000 to 1000, where kernel-level processes are mostly -1000, which represents the absolute importance of these processes and they cannot be easily killed. A value of 1000 is the most likely to be killed. According to how oom_score is calculated, a process with oom_score_adj of -1000 will have a calculated oom_score of 0, so they are at the end of the oom ranking (the least likely to be killed).

oom_score_adj Set by kubelet for Pods

When a Pod is created, kubelet generates oom_score_adj for each Pod. The specific value is determined by QoS:

QoS	oom_score_adj default value
Guaranteed	-998
Burstable	Dynamically calculated
BestEffort	1000

Among them, the oom_score_adj of Burstable Pod is comprehensively calculated from parameters such as requests.memory and node allocatable memory:

oom_score_adj = 1000 - (1000 * memory_request) / node_allocatable_memory

(This formula is just an approximate calculation for Burstable Pod's oom_score_adj. The actual implementation also includes boundary checks and special cases.)

In addition, kubelet's oom_score_adj is -999, which means kubelet has one of the lowest oom_score_adj values among user-space processes, making it one of the last processes to be killed. However, in extreme cases, kubelet is not indispensable. The reason is simple: if kubelet dead, the node still has the possibility of self-healing, but if the node freezes because memory cannot be released, it is almost impossible to recover through any means other than system rebooting, as the operation window (such as ssh) cannot be used. The severity of the two situations differs significantly.

Through this oom_score_adj, Kubernetes ensures that high-value Pods have a higher probability of survival when system OOM occurs.

VPA Affects oom_score_adj

VPA can automatically correct the requests value during deployment based on actual conditions, making the Pod more likely to be scheduled to a node with truly sufficient resources, instead of being scheduled to a node with insufficient resources due to its small requests value, and being evicted by kubelet after resource usage gradually increases.

VPA modifying the requests value affects the value of the oom_score_adj mapped in the cgroups on the node host where the Pod is located. Since the oom_score_adj of Burstable Pod ranges from 0 to 999, VPA can narrow the gap between request and limits by appropriately correcting the requests value, lowering the range of oom_score_adj values. This will cause the Pod to be OOM-killed later when system-level OOM occurs.

VPA does not dynamically modify the oom_score_adj of running Pods, only affecting newly created Pods or those rebuilt to take effect.

Summary

Node Pressure Eviction: An active Pod deletion behavior initiated by kubelet, aiming to prevent node crashes. It is a relatively "gentle" and orderly cleanup process.

Kernel OOM Killer: A passive, last-resort mechanism of the Linux kernel under extreme pressure (when memory is exhausted and swap space may also be depleted). From its perspective, there are only processes without the superior-subordinate relationships of kubelet and Pods. It only decides which process to kill based on the process's oom_score.

cgroups: The common technical foundation of both. It provides kubelet with the ability to implement resource limits (memory.max) and influence kernel OOM decisions (through memory.oom.group and oom_score_adj).

Relationship: Reasonable Pod resource limits (implemented through cgroups) can reduce the probability of node pressure eviction and kernel OOM. When node pressure appears, kubelet's eviction is a prevention of kernel OOM. If prevention fails, the kernel OOM will take over based on the oom_score affected by cgroups.

Original article (Chinese language)