Fer Rios

Posted on Jul 2 • Originally published at ferztyle.me

Kubernetes resource requests and limits explained: scheduling, throttling, and OOMKill

#kubernetes #devops #sre #k8s

This is part of the Platform engineering with Go series: a growing collection of posts on Kubernetes, Go tooling, and infrastructure automation. View all posts in the series

The 3am incident nobody talks about

It's 3am. Your on-call phone goes off. A service is down in production. You log in, check the pods, and see this:

kubectl get pods -n production

NAME                    READY   STATUS             RESTARTS   AGE
api-7d6b9f8c4-xk2pq    0/1     OOMKilled          14         2d
api-7d6b9f8c4-mn9rt    0/1     OOMKilled          14         2d
api-7d6b9f8c4-p8wvz    0/1     OOMKilled          14         2d

You restart the pods. They come back. Five minutes later, they die again.

The problem could be several things: limits set too low for the actual workload, no limits set at all so the pod consumed memory freely until the node ran out, a traffic spike the pod wasn't provisioned for, or yes, a memory leak in the application itself. In all cases, the result is the same: memory consumption exceeded the allowed ceiling, the OOM killer fired, the pod died, and Kubernetes restarted it into the same situation. Fourteen times.

This is one of the most common production incidents in Kubernetes, and one of the most preventable. But preventing it requires understanding what requests and limits actually do, what happens when memory consumption exceeds them, and how to configure them correctly for your actual workload.

That's what this post is about.

No Go knowledge required: this post has zero code

Requests vs limits, two completely different things

Before anything else, the most important concept to internalize: requests and limits are not two versions of the same thing. They serve entirely different purposes and are enforced by entirely different components of Kubernetes.

# A pod spec with both requests and limits set
resources:
  requests:
    cpu: "250m"     # 250 millicores = 0.25 CPU cores
    memory: "256Mi" # 256 mebibytes
  limits:
    cpu: "500m"     # 500 millicores = 0.5 CPU cores
    memory: "512Mi" # 512 mebibytes

Here's what each one actually does:

Requests are a promise to the scheduler. When you set requests.memory: 256Mi, you're telling Kubernetes: "I need at least 256Mi of memory reserved for this pod on whatever node it runs on.. The scheduler uses this value to decide which node to place the pod on. Once scheduled, that 256Mi is considered reserved on that node, even if the pod is only using 50Mi at the moment.

Limits are a ceiling enforced by the runtime. When you set limits.memory: 512Mi, you're telling Kubernetes: "This pod is never allowed to use more than 512Mi of memory." If it tries to exceed that ceiling, the kernel kills it. No warning. No graceful shutdown. Just gone.

The key insight: requests affect scheduling; limits affect runtime behavior. They are read by different components at different times for entirely different reasons.

How the scheduler uses requests to place pods

The Kubernetes scheduler's job is to find a node for each new pod. It does this by looking at each node's allocatable resources and comparing them against the sum of all pod requests already scheduled there.

A node's allocatable resources are not the same as its total capacity. Some resources are always reserved for the operating system and Kubernetes system components:

kubectl describe node my-node

# Look for the Allocatable section:
Allocatable:
  cpu:     3800m   # 3.8 cores available to pods (out of 4 total)
  memory:  7Gi     # 7Gi available to pods (out of 8Gi total)

The scheduler adds up the requests of all pods already on a node and compares that sum against the allocatable resources. If the remaining capacity is less than the new pod's request, the node is skipped.

Here's the critical subtlety: the scheduler cares about requests, not actual usage.

Node: 4 CPU allocatable

Pod A: requests 1 CPU → actual usage: 0.2 CPU
Pod B: requests 1 CPU → actual usage: 0.1 CPU
Pod C: requests 1 CPU → actual usage: 0.8 CPU

From the scheduler's perspective: 3 out of 4 CPU are "used"
From the kernel's perspective: only 1.1 CPU are actually being consumed

New pod requesting 1.5 CPU → scheduler says NO (only 1 CPU remaining)

This creates an important tension: if your requests are set too high relative to actual usage, your nodes look full when they're actually mostly idle, and new pods can't be scheduled. This is called poor bin packing, and it wastes money.

On the other hand, if your requests are too low, too many pods get scheduled onto the same node. When they all start consuming resources simultaneously, the node becomes overloaded, and Kubernetes starts evicting pods to relieve the pressure.

Getting requests right is a balancing act between cost efficiency and stability.

CPU limits and throttling: the silent killer

CPU is what Kubernetes calls a compressible resource. If a pod tries to use more CPU than its limit allows, the Linux kernel doesn't kill it, it throttles it. The pod keeps running, but it gets fewer CPU cycles, so everything it does takes longer.

The throttling mechanism is the Linux CFS (Completely Fair Scheduler).

Understanding CPU cycles and scheduling periods

Before we talk about how throttling works, it helps to understand two concepts that are invisible in day-to-day operations but fundamental to what's happening under the hood.

What is a CPU cycle?

Your server's CPU is constantly doing work, executing instructions, processing data, running code. A CPU cycle is the smallest unit of that work. A modern CPU completes billions of cycles per second (gigahertz, that's what the "3.2GHz" on a server spec means: 3.2 billion cycles per second).

Think of CPU cycles like minutes of attention from a very fast worker. Your pod's processes need a certain number of those "minutes" to do their job, handle a request, run a query, process a message. The more cycles your pod gets, the faster it runs. The fewer it gets, the slower it runs.

When Kubernetes talks about CPU in millicores (250m, 500m, 1000m), it's describing what fraction of one CPU core's cycles your pod gets access to:

1000m = 1 full CPU core = 100% of one core's cycles
 500m = 0.5 CPU core   = 50% of one core's cycles
 250m = 0.25 CPU core  = 25% of one core's cycles

What is a scheduling period?

A CPU doesn't serve one process at a time from start to finish. It slices time into tiny windows and gives each process a turn. This is called time-sharing, and the windows are called scheduling periods.

Think of it like a teacher in a classroom. Instead of helping one student for the entire class, the teacher spends 5 minutes with each student in rotation. Every student gets attention, but no single student monopolizes the teacher's time.

The Linux CFS uses scheduling periods of 100 milliseconds by default. In every 100ms window, the CPU divides its time among all the processes competing for it.

100ms scheduling period
│
├── 0ms  - 25ms  → Pod A gets its turn  (250m limit = 25% of 100ms = 25ms)
├── 25ms - 75ms  → Pod B gets its turn  (500m limit = 50% of 100ms = 50ms)
├── 75ms - 100ms → Pod C gets its turn  (250m limit = 25% of 100ms = 25ms)
│
└── (next 100ms period starts)

Each pod's CPU limit determines how many milliseconds of that 100ms window it's allowed to use:

CPU limit of 250m → 25ms of CPU time per 100ms period
CPU limit of 500m → 50ms of CPU time per 100ms period
CPU limit of 1000m (1 full core) → 100ms of CPU time per 100ms period

What happens when a pod hits its limit mid-period?

Here's where throttling kicks in. If a pod uses up its entire allocation before the 100ms period ends, the CFS puts it in a throttled state for the rest of that period; it gets zero CPU cycles until the next period starts, regardless of whether other pods are idle.

Pod with 250m CPU limit → 25ms of allowed CPU time per 100ms period

Period 1 (0ms to 100ms):
  ├── 0ms:  Pod starts processing a request
  ├── 25ms: Pod has used its full 25ms allocation ← throttled here
  ├── 25ms to 100ms: Pod sits idle, gets zero CPU cycles
  └── 100ms: New period starts, pod gets another 25ms

What the user experiences:
  ├── Request arrives at 0ms
  ├── Pod processes half the request, then waits 75ms doing nothing
  └── Response arrives much later than it should

The pod didn't crash. It didn't log an error. It just stopped making progress for 75ms out of every 100ms, which is why a throttled service feels sluggish rather than broken. Everything works, just much slower than it should.

A concrete analogy

Imagine you're writing a report and your manager says you can only use the shared laptop for 15 minutes every hour. You start writing, but at the 15-minute mark your access is cut off, even if you're mid-sentence. You sit and wait for the next hour to start before you can type another word.

That's exactly what the CFS does to a throttled pod. It doesn't care that you were in the middle of something important. When the quota is up, the process waits, and whatever request it was handling has to wait too.

Why this is hard to detect

The reason CPU throttling causes so much confusion is that it's invisible in all the usual places:

# This shows current usage — looks fine
kubectl top pods -n production
NAME                 CPU(cores)   MEMORY(bytes)
api-6d8f9b7c-xk2pq   240m         180Mi

# But the pod might be throttled 80% of the time
# kubectl top shows average usage, not whether that usage caused throttling

A pod using 240m CPU on average can still be heavily throttled if it regularly bursts above its limit within a single 100ms period. The average looks healthy; the latency tells a different story.

So, this is why this is dangerous: CPU throttling is completely silent.

There's no error. No log line. No Kubernetes event. Your pod just slows down. Requests take longer. Latency increases. Users notice something is wrong, but nothing in your logs explains why.

The only reliable way to detect throttling is through the container_cpu_cfs_throttled_periods_total metric in Prometheus, which counts how many scheduling periods a container was throttled in. If that number is climbing, throttling is happening regardless of what kubectl top shows.

To detect throttling, you can use the Kubernetes metrics server:

# See current CPU usage vs requests
kubectl top pods -n production

NAME                    CPU(cores)   MEMORY(bytes)
api-7d6b9f8c4-xk2pq    480m         210Mi
api-7d6b9f8c4-mn9rt    498m         198Mi

If you see pods consistently near their CPU limit, throttling is likely happening, especially if you're seeing latency spikes that don't correlate with error rates.

The CPU limits controversy

There's an ongoing debate in the Kubernetes community about CPU limits. Some platform teams remove CPU limits entirely for latency-sensitive services, allowing pods to burst freely as long as there's spare capacity on the node.

The argument for removing CPU limits:

Eliminates throttling completely.
Workloads use spare node capacity efficiently.
Latency becomes predictable because pods are never artificially slowed.

The argument for keeping CPU limits:

A noisy neighbor pod can consume all spare CPU and starve other pods.
Without limits, a bug in one service can degrade the entire node.

The right answer depends on your workload. For latency-sensitive APIs, consider removing CPU limits and relying on requests alone. For batch workloads, CPU limits are fine. For anything in between, measure throttling first before deciding.

Memory limits and OOMKill: the dangerous one

Memory is what Kubernetes calls an incompressible resource. Unlike CPU, the kernel cannot throttle memory access, it can't say "you only get 80% of the memory reads you asked for." Memory is binary: either the process has it, or it doesn't.

When a pod's memory usage exceeds its limit, the Linux OOM (Out of Memory) killer terminates one of its processes immediately. No warning. No graceful shutdown. The process is gone. Kubernetes then sees that a container has exited unexpectedly and restarts it, which is where CrashLoopBackOff comes from.

# Detect OOMKill in pod description
kubectl describe pod api-7d6b9f8c4-xk2pq -n production

# Look for this in the output:
Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    137
  Started:      Mon, 21 Jun 2026 02:14:32 +0000
  Finished:     Mon, 21 Jun 2026 02:14:33 +0000

Exit code 137 means the process was killed by signal 9 (SIGKILL) from the OOM killer. If you see this, your memory limit is too low for your actual workload.

OOMKill at the node level

OOMKill can also happen at the node level, independently of your pod limits. If total memory consumption across all pods on a node approaches the node's total capacity, the Linux kernel's node-level OOM killer activates.

In this case, Kubernetes doesn't wait for the OOM killer. It has its own eviction mechanism, if available memory on a node drops below a configured threshold, Kubernetes starts evicting pods proactively. Which pods get evicted first is determined by QoS class, which we'll cover next.

QoS classes: who dies first under pressure

Kubernetes automatically assigns every pod a Quality of Service (QoS) class based on how its requests and limits are configured. You don't set this manually; it's derived. Under node pressure, Kubernetes evicts pods in order from lowest to highest QoS class.

There are three classes:

BestEffort: lowest priority

A pod is BestEffort when it has no requests or limits set at all:

# BestEffort, no resources section at all
resources: {}

BestEffort pods are the first to be evicted under node pressure. They get whatever resources happen to be available, and nothing is guaranteed. Never run production workloads as BestEffort.

Burstable: middle priority

A pod is Burstable when it has requests set, but limits are either not set or higher than requests:

# Burstable, requests set, limits higher than requests
resources:
  requests:
    cpu: "250m"
    memory: "256Mi"
  limits:
    cpu: "500m"
    memory: "512Mi"

Most production workloads should be Burstable. The pod has guaranteed minimum resources (the requests) but can burst above them when capacity is available. Under eviction pressure, Burstable pods are evicted after BestEffort but before Guaranteed.

Guaranteed: highest priority

A pod is Guaranteed when its requests equal its limits for all containers:

# Guaranteed, requests == limits
resources:
  requests:
    cpu: "500m"
    memory: "512Mi"
  limits:
    cpu: "500m"     # same as request
    memory: "512Mi" # same as request

Guaranteed pods are the last to be evicted. Kubernetes will sacrifice BestEffort and Burstable pods before touching a Guaranteed pod. Use this for your most critical services, databases, core APIs, and anything where an eviction is catastrophic.

To check a pod's QoS class:

kubectl get pod api-7d6b9f8c4-xk2pq -n production \
  -o jsonpath='{.status.qosClass}'

# Output: Burstable

A practical framework for setting requests and limits

Knowing the theory is one thing. Knowing what numbers to put in your YAML is another. Here's a step-by-step framework that works in production:

Step 1: observe before you configure

Deploy your workload without limits first (or with very high limits that won't be hit). Let it run under realistic traffic for several days, including peak hours. Collect metrics with kubectl top pods or Prometheus.

# Watch resource usage over time
watch kubectl top pods -n production -l app=api

Step 2: set requests at p50 (median) usage

Your request should reflect typical usage, what the pod uses most often. The p50 (50th percentile) of observed CPU and memory usage is a good starting point.

If your service typically uses 150m CPU and 200Mi memory:

requests:
  cpu: "150m"
  memory: "200Mi"

Step 3: set memory limits at p99 plus a buffer

Your memory limit should handle traffic spikes without triggering OOMKill. The p99 (99th percentile) plus a 20-30% buffer is a safe starting point:

limits:
  memory: "350Mi"  # p99 was ~280Mi, plus 25% buffer

Step 4: handle CPU limits carefully

Start at 2-4x your CPU request. Monitor for throttling. If you see consistent throttling in container_cpu_throttled_seconds_total, either raise the limit or remove it for that service.

limits:
  cpu: "500m"  # 3x the 150m request, gives room to burst

Step 5: use LimitRange to enforce defaults at the namespace level

As a platform engineer, you don't want to rely on every developer remembering to set resources. Use a LimitRange to provide defaults for pods that don't specify them:

# Enforce default requests and limits for all pods in a namespace
apiVersion: v1
kind: LimitRange
metadata:
  name: default-resource-limits
  namespace: production
spec:
  limits:
  - type: Container
    default:           # applied when no limits are specified
      cpu: "500m"
      memory: "256Mi"
    defaultRequest:    # applied when no requests are specified
      cpu: "100m"
      memory: "128Mi"
    max:               # no pod can exceed these
      cpu: "2"
      memory: "2Gi"
    min:               # no pod can go below these
      cpu: "50m"
      memory: "64Mi"

Step 6: use ResourceQuota to cap total namespace consumption

ResourceQuota limits the total resources that can be consumed across all pods in a namespace, useful for multi-tenant clusters and chargeback:

# Cap total resource consumption for the production namespace
apiVersion: v1
kind: ResourceQuota
metadata:
  name: production-quota
  namespace: production
spec:
  hard:
    requests.cpu: "10"      # total CPU requests across all pods
    requests.memory: "20Gi" # total memory requests across all pods
    limits.cpu: "20"        # total CPU limits across all pods
    limits.memory: "40Gi"   # total memory limits across all pods
    pods: "50"              # maximum number of pods

Common mistakes

Running without any resource configuration (BestEffort)

This is the most dangerous mistake. With no requests, the scheduler has no visibility into what your pod actually needs. Under node pressure, your pods are the first to be evicted, regardless of how critical your service is.

Setting limits without requests

When you set a limit without a request, Kubernetes automatically sets the request equal to the limit, which makes the pod Guaranteed. This isn't always wrong, but it means you're reserving 100% of your limit on the scheduler even if typical usage is much lower. Over time this leads to poor bin packing and wasted capacity.

Copying the same resource values for every service

A stateless Go API, a JVM-based service, and a batch data processor have completely different memory profiles. A Go binary might be happy with 128Mi. A JVM service might need 1Gi just for the heap. Tune per workload, not per deployment template.

Forgetting init containers

Init containers run before your main container and have their own resource requirements. If you set tight limits on init containers, pod initialization can fail, and Kubernetes will keep retrying. Always check the init container resource usage too:

initContainers:
- name: db-migrate
  resources:
    requests:
      cpu: "100m"
      memory: "128Mi"
    limits:
      cpu: "200m"
      memory: "256Mi"

Setting memory requests too low relative to actual usage

If your pod's actual memory usage regularly exceeds its request (even if it stays below the limit), it becomes a candidate for eviction under node pressure. The request is not just a scheduling hint, it also affects eviction priority within the Burstable class. Pods whose usage exceeds their request are evicted before pods whose usage stays within their request.

Summary

Requests and limits are the foundation of stable Kubernetes workloads. Get them wrong, and you'll spend nights chasing OOM kills and throttling-induced latency spikes. Get them right, and your cluster runs efficiently with predictable, stable workloads.

Three things to take away:

Requests tell the scheduler where to place your pod. Always set them, never run BestEffort in production.
Memory limits tell the kernel when to kill your pod. Set them generously enough to handle traffic spikes, and monitor for OOMKill with kubectl describe pod.
CPU limits are more nuanced. Start with 2-4x your CPU request and remove them for latency-sensitive services if throttling is a problem.

Understanding this is the prerequisite for everything else in platform engineering, autoscaling, capacity planning, chargeback, and multi-tenancy, all of which depend on the resource configuration being correct.

P.S.: Writing this post from Asunción, Paraguay 🇵🇾 where it's hard to concentrate right now. Paraguay just knocked Germany out of the 2026 World Cup on penalties. For the first time in history, Germany lost a penalty shootout. Goalkeeper Orlando Gil was a hero. We face France next. Forgive me if there are any bugs or mistakes in this post. I may have been screaming at a television.

Let's connect!

One of the best parts of writing in public is the people you meet along the way, engineers at different stages of their journey, working on similar problems from completely different angles.

If something in this post resonated, if you spotted a bug, or if you just want to talk Go, Kubernetes, Platform Engineering, DevOps, or whatever, I'm always happy to hear from you.

Building from Asunción, Paraguay 🇵🇾

DEV Community