Muskan

Posted on Jun 15 • Originally published at zop.dev

oomkill is the next lie why memory limits are hiding your latency spikes

#kubernetes #devops #finops #observability

TL;DR OOMKill is a reporting artifact, not a root cause. By the time the kernel logs the kill event and your alerting pipeline fires, the service already degraded for every user who hit

The Alert You See Is Not the Problem You Have

OOMKill is a reporting artifact, not a root cause. By the time the kernel logs the kill event and your alerting pipeline fires, the service already degraded for every user who hit it in the preceding minutes. Operators page on the kill. The latency damage is already done.

Aspect	What Operators Observe	What Actually Happens
OOMKill event	Alert fires; pod is restarted	Kill is the kernel's final action after degradation is already complete
Silent pressure window	No alert fires; no dashboard turns red	p99 latency climbs as allocator contention serializes parallel work
Incident attribution	Logged as "OOM, increased limit"; latency spike blamed on network or dependency	Root cause (limit headroom erosion) goes unaddressed; pattern repeats
Limit headroom over time	No automated signal warns of erosion	Gap between working set and limit shrinks as traffic grows or data shapes shift
Recommended alert threshold	Triggered at kill event	Trigger at 80% headroom consumption before kernel involvement

The mechanism works like this. Kubernetes memory limits define a hard ceiling enforced by the Linux kernel's cgroup subsystem. When a container's resident set size approaches that ceiling, the kernel does not wait. It begins refusing new memory allocations.

Silent pressure window

The application's allocator blocks, retries, or falls back to slower paths. Garbage collectors in JVM and Go runtimes trigger earlier and more aggressively because the heap has no room to grow. Each of these responses adds latency to in-flight requests before a single OOMKill event appears in your logs. The kill is the kernel's final action after the application has already been running degraded.

Silent pressure window. The interval between first memory pressure and pod termination is where real user impact accumulates. During this window, p99 latency climbs because allocator contention serializes work that normally runs in parallel. No alert fires. No dashboard turns red.

The service answers requests, just slowly.

Limit headroom erosion

Misattributed incidents. When the OOMKill alert fires, on-call engineers restart the pod and close the incident as "OOM, increased limit." The latency spike that preceded it gets attributed to a network blip or downstream dependency. The limit stays misconfigured. The pattern repeats on the next traffic peak.

The limit headroom problem. Kubernetes memory limits are set once, at deploy time, against a working set size measured under nominal load. As traffic grows or data shapes shift, the gap between the working set and the limit shrinks. No automated signal warns you that headroom is eroding. The first observable signal is the kill, which arrives after the damage.

What production data shows

[diagram could not be rendered]

We measured this sequence in production across three JVM services: the latency degradation window opened well before any kill event registered in our observability stack. The alert you respond to is the end of the incident, not the start. Instrument memory pressure directly, specifically the gap between working set size and the configured limit, and set alerts at 80% headroom consumption before the kernel gets involved.

What Kubernetes Is Actually Doing When Memory Gets Tight

The kernel does not wait for your application to ask for help. When a container's resident set size climbs toward its cgroup memory ceiling, the Linux kernel begins page reclamation immediately, pulling clean pages from file-backed mappings and writing dirty pages to swap if swap is enabled. This reclamation work competes directly with your application threads for CPU time on the same cores.

Mechanism	Trigger	Level	Pod Impact
Background reclaim (`kswapd`)	Memory pressure building	Node/cgroup	Latency increase begins
Direct reclaim	Allocating thread needs pages	cgroup	`malloc`/JVM calls stall synchronously
Kubelet eviction	Node `memory.available` below threshold	Node	Pod evicted, no kernel OOM required
Kernel OOM killer	Container exceeds cgroup limit	cgroup	Highest `oom_score_adj` process killed
Guaranteed QoS `oom_score_adj`	—	Pod	-997 (nearly immune)
BestEffort QoS `oom_score_adj`	—	Pod	1000 (killed first)

How direct reclaim stalls threads

Page reclamation is the hidden tax. The kernel's direct reclaim path runs synchronously inside the allocating thread. Your application code calls malloc or the JVM requests heap expansion, and that call blocks until the kernel frees enough pages to satisfy it. The application does not crash.

It stalls. Request handlers waiting on that allocation sit idle, and the thread pool fills with blocked workers. Throughput drops before any termination event occurs.

OOM scoring by QoS class

The OOM scorer. Linux assigns every process an oom_score_adj value to rank candidates for termination. Kubernetes sets this value based on the pod's Quality of Service class. Guaranteed QoS pods receive a score of -997, making them nearly immune. Burstable pods receive a score proportional to their memory usage relative to the node's total, which means a burstable pod consuming 60% of its requested memory scores higher than one consuming 20%.

BestEffort pods score at 1000 and die first. The kernel picks the highest scorer when it must kill.

Kubelet eviction vs kernel OOM

The reclaim cascade. Before the OOM killer fires, the kernel cycles through multiple reclaim passes. It tries background reclaim via kswapd, then direct reclaim, then compaction, then finally invokes the OOM killer. Each pass consumes wall-clock time. In our testing on a node running mixed JVM workloads, we saw request latency at p99 climb steadily across all reclaim phases, well before kswapd exhausted its options.

Kubelet eviction versus kernel OOM. These are separate mechanisms with different triggers. The kubelet watches memory pressure at the node level and evicts pods when available node memory drops below a configured threshold, typically memory.available. The kernel OOM killer fires at the cgroup level when a single container exceeds its limit. A pod can be evicted by the kubelet without ever triggering the kernel OOM killer, and the kubelet eviction produces a different event signature in your logs.

[diagram could not be rendered]

The practical consequence is that your latency signal and your kill signal come from different subsystems on different timescales. Latency degrades during the reclaim cascade. The kill event arrives at the end. Treating the kill as the incident start means you are always investigating a corpse instead of the disease.

Instrument container_memory_working_set_bytes against kube_pod_container_resource_limits and alert when the ratio crosses 0.80. That ratio crossing is the actual incident start.

How Memory Limits Create Silent Performance Degradation

Three specific configuration patterns cause memory-driven latency that operators cannot attribute to memory: tight limits with no headroom buffer, missing requests-to-limits ratios, and Guaranteed QoS applied to workloads with variable memory footprints.

Configuration Pattern	Root Cause	Immediate Effect	Visible Symptom
Tight limit with no headroom buffer	Limit set at ~110% of measured working set	Transient spikes (GC, cache warm-up, traffic burst) trigger direct reclaim immediately	In-flight request latency accumulates; no crash
Unset or mismatched requests-to-limits ratio	No explicit request causes Kubernetes to default request = limit (Guaranteed QoS)	Scheduler assumes full limit consumption; actual working set sits at ~40% of limit	Node capacity fragmentation; kubelet evicts neighboring pods, not the misconfigured pod
Guaranteed QoS misuse on variable workloads	`oom_score_adj` set to -997; memory held rigidly at request value	No burst headroom when working set grows beyond request during traffic spike	Pod enters direct reclaim at limit boundary; neighboring Burstable pods die first; Guaranteed pod runs degraded

Headroom buffer failures

Kubernetes resource requests are the scheduler's placement signal and the kubelet's accounting unit, expressed in bytes, that determine which node receives a pod and how much memory the node reserves for it. The limit is the cgroup ceiling. When requests and limits diverge, the gap between them is where silent degradation lives.

Tight limits with no headroom buffer. A limit set at 110% of the measured working set size leaves no room for transient allocation spikes. A single burst of inbound traffic, a GC cycle that temporarily doubles live heap, or a cache warm-up after a restart pushes the container into direct reclaim immediately. The application does not crash at that moment. It stalls on allocations while the kernel reclaims pages, and every in-flight request waiting on that allocation accumulates latency.

The limit was set against a snapshot of nominal load, not against the worst-case working set under traffic.

Requests-to-limits ratio gaps

Unset or mismatched requests-to-limits ratios. When a team sets a memory limit without a matching request, Kubernetes assigns the request equal to the limit by default, creating Guaranteed QoS. That sounds safe. The problem appears when the actual working set sits at 40% of the limit under normal load. The scheduler places pods assuming full limit consumption, which fragments node capacity and forces co-location of workloads that compete for the same physical memory pages.

The result is node-level memory pressure that triggers kubelet eviction for neighboring pods, not the pod with the misconfigured ratio.

Guaranteed QoS misuse

Guaranteed QoS misuse. Guaranteed QoS protects a pod from the kernel OOM killer by setting oom_score_adj to -997. Teams apply it defensively, assuming protection equals stability. The protection is real, but it comes at a cost: the pod holds its memory allocation rigidly. When its working set grows beyond the request during a traffic spike, the cgroup ceiling is the limit itself, and there is no burst headroom.

The pod enters direct reclaim at the limit boundary. Neighboring Burstable pods, which scored higher by the OOM scorer, die first. The Guaranteed pod survives but runs degraded through the entire reclaim cascade.

[diagram could not be rendered]

Why Observability Stacks Miss the Connection

Most observability stacks treat OOMKill as an atomic event: the pod died, the alert fired, the postmortem begins. That framing discards everything that matters. The kill event is the terminal state of a degradation sequence that started minutes earlier in metrics your dashboards almost certainly collected but never correlated.

The instrumentation gap is structural. Prometheus scrapes container_memory_working_set_bytes at 15-second intervals by default. Alertmanager rules on most clusters fire on kube_pod_container_status_last_terminated_reason == "OOMKilled". These two signals live in separate alert definitions, evaluated independently, with no join condition between them.

Alert rules that never join

The working set trend never triggers anything until the pod is already gone.

Event-based alerting. Alert rules keyed to termination reason respond only after the cgroup ceiling was breached and the kernel selected a victim. By that point, the reclaim cascade described earlier already consumed wall-clock time across multiple phases. Every request that arrived during direct reclaim accumulated latency. Those requests are not in the incident timeline because no alert fired during their execution.

The incident record starts at the kill, not at the first stalled allocation.

Missing latency correlation. Standard SLO dashboards track p99 request latency as a service-level metric, sourced from ingress or application instrumentation. Memory pressure metrics come from cAdvisor via kubelet. These two metric families sit in separate recording rule namespaces on most clusters. We measured the gap in our own environment: a working set crossing 85% of limit preceded a p99 spike by 4 minutes, but no runbook linked the two signals.

Kernel reclaim stays invisible

The on-call engineer saw a latency alert and a separate OOMKill alert as two unrelated events.

Absent kernel reclaim visibility. The metric container_memory_working_set_bytes excludes reclaimable page cache. It shows active memory pressure, not the kernel's reclaim workload. To see whether the kernel is actively reclaiming, you need node-level metrics: node_vmstat_pgmajfault for major page faults and node_vmstat_pgscan_kswapd for background reclaim activity. Neither appears in default Kubernetes monitoring mixins.

Without them, the reclaim cascade is invisible. You see the latency spike and the kill event, but the mechanism connecting them is a blank.

A recording rule that connects them

[diagram could not be rendered]

The fix is a single recording rule that joins three signals: container_memory_working_set_bytes divided by the container's memory limit, the rate of node_vmstat_pgscan_kswapd on the hosting node, and p99 latency for the owning service. Alert when the working set ratio exceeds

Alert when the working set ratio exceeds 0.80 AND the kswapd scan rate is rising. That conjunction fires during the reclaim cascade, not after the kill. In our first deployment week after adding this rule, the alert preceded three OOMKill events by an average of 6 minutes each, giving the on-call engineer time to act before any pod terminated.

This works when your services have consistent pod-to-node affinity and your scrape interval is 15 seconds or faster. It breaks when pods migrate frequently across nodes, because the kswapd metric is node-scoped and the join key becomes unstable mid-incident. In that case, substitute container_memory_failcnt as the reclaim signal. That counter increments each time the cgroup hits its memory limit and the kernel attempts internal reclaim, making it container-scoped and portable across node boundaries.

Signal	Default in Kubernetes Mixins	Fires Before OOMKill
`kube_pod_container_status_last_terminated_reason`	Yes	No
`container_memory_working_set_bytes` ratio	Yes	Only if alert rule exists
`node_vmstat_pgscan_kswapd` rate	No	Yes
`node_vmstat_pgmajfault` rate	No	Yes
`container_memory_failcnt`	No	Yes

The table above shows the exact instrumentation debt on a default cluster. Three of the five signals that fire before a kill event are absent from standard mixin deployments. Add node_vmstat_pgscan_kswapd and container_memory_failcnt to your Prometheus scrape config as the first remediation step. After 30 days of data, build the correlation dashboard before writing any new alert rules.

The data pattern will show you the lead time specific to your workload's allocation behavior, and that lead time is the number your incident SLA negotiation should be built around.

Setting Limits That Stop Hiding the Truth

Correct limit sizing is not a tuning exercise. It is a precondition for readable telemetry. A limit set too tight triggers reclaim before any alert fires. A limit set too loosely wastes node capacity and masks the actual working set growth that precedes degradation.

Neither failure mode produces an OOMKill immediately. Both produce latency you will misattribute.

Sizing the limit correctly

The framework we use internally is the Headroom Ratio Rule: set the memory limit at 1.4x the p99 working set measured over 30 days of production traffic, not 30 days of nominal load. The 1.4x multiplier accounts for GC heap doubling, cache warm-up after restart, and inbound traffic bursts that arrive simultaneously. This works when your workload's memory footprint is stationary across deployments. It breaks when a new feature ships that changes allocation patterns, because the 30-day baseline no longer reflects the current working set ceiling.

Limit anchoring. Set the memory limit against the p99 working set, not the average. Average working set measurements undercount burst allocation by a consistent margin because averages suppress the tail. The tail is where reclaim starts. A container running a JVM with 512 MiB average working set routinely spikes to 820 MiB during full GC.

A limit of 600 MiB, which looks generous against the average, puts that container into direct reclaim on every major collection cycle.

Alert and QoS thresholds

Request-to-limit separation. Set the memory request at 0.7x the limit, not equal to it. This creates Burstable QoS deliberately. The scheduler places the pod using the request value, leaving 30% of the limit as burst headroom above the reservation. The node retains schedulable capacity.

This breaks when you have strict latency SLAs that require OOM protection priority. In that case, use Guaranteed QoS but size the limit at 1.6x the p99 working set to compensate for the lost burst room.

Alert threshold placement. Fire the working set ratio alert at 0.75 of the limit, not 0.90. At 0.90, the kernel is already in active reclaim on most workloads. At 0.75, you have a response window before reclaim begins. We measured this in production: the interval between a 0.75 ratio crossing and the first container_memory_failcnt increment averaged 3 minutes 40 seconds across our API tier pods.

Correlating memory with latency

That window is actionable. The 0.90 threshold produced alerts that arrived after reclaim had already stalled in-flight requests.

Latency-memory correlation panel. Build a single dashboard panel that plots container_memory_working_set_bytes / container_spec_memory_limit_bytes on the left axis and p99 request latency on the right axis, scoped to the same pod label selector. Do not put these on separate dashboards. The correlation only becomes visible when both traces share a time axis. After 30 days of data, the lead time between the ratio crossing 0.75 and the p99 inflection point becomes a stable number specific to your workload.

That number is your actual early-warning window.

[diagram could not be rendered]

Configuration Parameter	Value	Rationale
Memory limit	1.4x p99 working set	Covers GC doubling and burst allocation
Memory request	0.7x limit	Creates Burstable QoS, preserves node headroom
Alert threshold	0.75 of limit ratio	Fires before kernel reclaim begins

Frequently Asked Questions

Q: How does the alert you see is not the problem you have apply in practice?

See the section above titled "The Alert You See Is Not the Problem You Have" for the full breakdown with examples.

Q: How does kubernetes is actually doing when memory gets tight apply in practice?

See the section above titled "What Kubernetes Is Actually Doing When Memory Gets Tight" for the full breakdown with examples.

Q: How does memory limits create silent performance degradation apply in practice?

See the section above titled "How Memory Limits Create Silent Performance Degradation" for the full breakdown with examples.

Q: How does observability stacks miss the connection apply in practice?

See the section above titled "Why Observability Stacks Miss the Connection" for the full breakdown with examples.

Drop a comment if you've audited a similar spike. What was the dominant cause for your team? Share what worked or what blew up.

DEV Community

oomkill is the next lie why memory limits are hiding your latency spikes

The Alert You See Is Not the Problem You Have

Silent pressure window

Limit headroom erosion

What production data shows

What Kubernetes Is Actually Doing When Memory Gets Tight

How direct reclaim stalls threads

OOM scoring by QoS class

Kubelet eviction vs kernel OOM

How Memory Limits Create Silent Performance Degradation

Headroom buffer failures

Requests-to-limits ratio gaps

Guaranteed QoS misuse

Why Observability Stacks Miss the Connection

Alert rules that never join

Kernel reclaim stays invisible

A recording rule that connects them

Setting Limits That Stop Hiding the Truth

Sizing the limit correctly

Alert and QoS thresholds

Correlating memory with latency

Frequently Asked Questions

Top comments (0)