Falolu Olaitan

Posted on Jun 8

Why Your AKS Pods Keep Getting OOMKilled Even When CPU Looks Fine

#azure #kubernetes #performance #sre

Introduction

One of the most misleading situations in Kubernetes is when a pod keeps restarting because of an OOMKilled event while CPU utilization looks perfectly healthy.

I have seen engineers spend hours investigating CPU throttling, autoscaling, node capacity, and even networking, only to discover later that memory was the actual problem.

The reality is that Kubernetes treats CPU and memory very differently. CPU can be throttled. Memory cannot. Once memory is exhausted, Kubernetes has no choice but to terminate the container.

Understanding why this happens is critical for running production workloads reliably.

Understanding OOMKilled

OOM stands for Out Of Memory.

When a container exceeds its allocated memory limit, the Linux kernel invokes the Out Of Memory Killer and terminates the process consuming memory.

From Kubernetes' perspective, the container exits unexpectedly and the pod enters a restart cycle.

You will typically see something similar to:

kubectl describe pod payment-api-5f4d7d8d9f-xqk2r

Output:

Last State: Terminated
Reason: OOMKilled
Exit Code: 137

Exit code 137 is usually the first indication that memory exhaustion caused the restart.

Why CPU Looks Healthy

Many teams monitor CPU aggressively while paying little attention to memory consumption.

Consider this example:

resources:
  requests:
    cpu: 250m
    memory: 512Mi
  limits:
    cpu: 500m
    memory: 1Gi

Application metrics show:

CPU Usage: 120m
Memory Usage: 1.1Gi

CPU utilization appears healthy.

However memory has exceeded the configured limit.

The container gets terminated immediately.

The result is:

CPU Fine
Memory Exhausted
Container Killed

This is why relying solely on CPU dashboards often leads engineers in the wrong direction.

Requests and Limits Are Not the Same Thing

One of the most common misunderstandings in Kubernetes is confusing requests with limits.

Requests

Requests determine scheduling.

requests:
  memory: 512Mi

Kubernetes uses this value when deciding where to place the pod.

Limits

Limits determine maximum consumption.

limits:
  memory: 1Gi

Once memory exceeds this value, Kubernetes terminates the container.

Think of requests as reservation and limits as a hard wall.

Cross the wall and the container dies.

How to Confirm an OOMKill

Start with:

kubectl get pods

You may see:

CrashLoopBackOff

Then inspect the pod:

kubectl describe pod <pod-name>

Look for:

Reason: OOMKilled

You can also check previous logs:

kubectl logs <pod-name> --previous

This is useful because the current container may already have restarted.

Investigating Memory Consumption

Check actual consumption:

kubectl top pod

Example:

NAME                 CPU     MEMORY
payment-api          90m     1050Mi

If the limit is:

memory: 1024Mi

The container will eventually be terminated.

Also inspect node utilization:

kubectl top node

This helps determine whether the issue is isolated to the workload or affecting the entire node.

Common Causes of OOMKilled Events

Memory Leaks

Applications continuously allocate memory but never release it.

Typical examples:

Unclosed database connections
Large object caching
Static collections
Long-running background workers

The memory graph steadily increases until the limit is reached.

Large Payload Processing

Applications processing large files often experience memory spikes.

Examples:

PDF generation
Image manipulation
Bulk imports
Report generation

The workload may run successfully hundreds of times before encountering a payload large enough to trigger an OOMKill.

Incorrect Limits

Sometimes the application simply requires more memory than allocated.

For example:

limits:
  memory: 512Mi

while production usage averages:

750Mi

In this case Kubernetes is behaving exactly as configured.

The configuration is wrong.

.NET Applications

Many modern .NET applications can consume significant memory under load.

Common contributors include:

Large object heap growth
Heavy caching
Excessive serialization
Background processing

The application may perform perfectly in development but fail under production traffic.

Why Increasing Memory Is Not Always the Fix

The immediate reaction is usually:

limits:
  memory: 2Gi

Problem solved.

Or maybe not.

If a memory leak exists, the application will eventually consume:

2Gi
3Gi
4Gi

and fail again.

Increasing limits without understanding consumption patterns only delays the problem.

Always determine whether memory growth is expected or abnormal.

Monitoring OOMKills in AKS

Container Insights provides visibility into:

Memory trends
Pod restarts
Node pressure
Container consumption

Useful Kusto query:

KubePodInventory
| where ContainerStatusReason == "OOMKilled"
| project TimeGenerated, Namespace, PodName, ContainerName
| order by TimeGenerated desc

This helps identify recurring offenders before they become production incidents.

Preventing OOMKilled Events

Right-Size Resources

Avoid guessing.

Measure actual workload consumption.

Use production metrics to determine realistic values.

Configure Horizontal Pod Autoscaler

Scaling based on memory can help distribute workload.

Example:

targetAverageUtilization: 70

However remember that autoscaling cannot fix memory leaks.

Implement Resource Governance

Every workload should define:

resources:
  requests:
  limits:

Running without limits can allow a single application to consume excessive node memory and affect other workloads.

Perform Load Testing

Many memory-related issues only appear under production-like traffic.

Load testing reveals:

Memory spikes
Allocation patterns
Scaling behaviour

before customers encounter them.

Final Thoughts

When a pod is OOMKilled, Kubernetes is usually not the problem.

The platform is enforcing the limits you defined.

The real challenge is understanding why the application exceeded those limits.

Before increasing memory allocations, determine whether the issue is caused by workload growth, configuration mistakes, or application behaviour.

The most effective troubleshooting process is simple:

Confirm the OOMKilled event.
Measure actual memory consumption.
Compare usage against configured limits.
Identify memory growth patterns.
Fix the root cause before increasing resources.

In production Kubernetes environments, memory issues are often harder to diagnose than CPU issues, but they are also among the most common causes of unexpected application restarts. Understanding how Kubernetes manages memory is one of the most valuable skills a platform engineer can develop.

DEV Community