Introduction
One of the most misleading situations in Kubernetes is when a pod keeps restarting because of an OOMKilled event while CPU utilization looks perfectly healthy.
I have seen engineers spend hours investigating CPU throttling, autoscaling, node capacity, and even networking, only to discover later that memory was the actual problem.
The reality is that Kubernetes treats CPU and memory very differently. CPU can be throttled. Memory cannot. Once memory is exhausted, Kubernetes has no choice but to terminate the container.
Understanding why this happens is critical for running production workloads reliably.
Understanding OOMKilled
OOM stands for Out Of Memory.
When a container exceeds its allocated memory limit, the Linux kernel invokes the Out Of Memory Killer and terminates the process consuming memory.
From Kubernetes' perspective, the container exits unexpectedly and the pod enters a restart cycle.
You will typically see something similar to:
kubectl describe pod payment-api-5f4d7d8d9f-xqk2r
Output:
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Exit code 137 is usually the first indication that memory exhaustion caused the restart.
Why CPU Looks Healthy
Many teams monitor CPU aggressively while paying little attention to memory consumption.
Consider this example:
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
cpu: 500m
memory: 1Gi
Application metrics show:
CPU Usage: 120m
Memory Usage: 1.1Gi
CPU utilization appears healthy.
However memory has exceeded the configured limit.
The container gets terminated immediately.
The result is:
CPU Fine
Memory Exhausted
Container Killed
This is why relying solely on CPU dashboards often leads engineers in the wrong direction.
Requests and Limits Are Not the Same Thing
One of the most common misunderstandings in Kubernetes is confusing requests with limits.
Requests
Requests determine scheduling.
requests:
memory: 512Mi
Kubernetes uses this value when deciding where to place the pod.
Limits
Limits determine maximum consumption.
limits:
memory: 1Gi
Once memory exceeds this value, Kubernetes terminates the container.
Think of requests as reservation and limits as a hard wall.
Cross the wall and the container dies.
How to Confirm an OOMKill
Start with:
kubectl get pods
You may see:
CrashLoopBackOff
Then inspect the pod:
kubectl describe pod <pod-name>
Look for:
Reason: OOMKilled
You can also check previous logs:
kubectl logs <pod-name> --previous
This is useful because the current container may already have restarted.
Investigating Memory Consumption
Check actual consumption:
kubectl top pod
Example:
NAME CPU MEMORY
payment-api 90m 1050Mi
If the limit is:
memory: 1024Mi
The container will eventually be terminated.
Also inspect node utilization:
kubectl top node
This helps determine whether the issue is isolated to the workload or affecting the entire node.
Common Causes of OOMKilled Events
Memory Leaks
Applications continuously allocate memory but never release it.
Typical examples:
- Unclosed database connections
- Large object caching
- Static collections
- Long-running background workers
The memory graph steadily increases until the limit is reached.
Large Payload Processing
Applications processing large files often experience memory spikes.
Examples:
- PDF generation
- Image manipulation
- Bulk imports
- Report generation
The workload may run successfully hundreds of times before encountering a payload large enough to trigger an OOMKill.
Incorrect Limits
Sometimes the application simply requires more memory than allocated.
For example:
limits:
memory: 512Mi
while production usage averages:
750Mi
In this case Kubernetes is behaving exactly as configured.
The configuration is wrong.
.NET Applications
Many modern .NET applications can consume significant memory under load.
Common contributors include:
- Large object heap growth
- Heavy caching
- Excessive serialization
- Background processing
The application may perform perfectly in development but fail under production traffic.
Why Increasing Memory Is Not Always the Fix
The immediate reaction is usually:
limits:
memory: 2Gi
Problem solved.
Or maybe not.
If a memory leak exists, the application will eventually consume:
2Gi
3Gi
4Gi
and fail again.
Increasing limits without understanding consumption patterns only delays the problem.
Always determine whether memory growth is expected or abnormal.
Monitoring OOMKills in AKS
Container Insights provides visibility into:
- Memory trends
- Pod restarts
- Node pressure
- Container consumption
Useful Kusto query:
KubePodInventory
| where ContainerStatusReason == "OOMKilled"
| project TimeGenerated, Namespace, PodName, ContainerName
| order by TimeGenerated desc
This helps identify recurring offenders before they become production incidents.
Preventing OOMKilled Events
Right-Size Resources
Avoid guessing.
Measure actual workload consumption.
Use production metrics to determine realistic values.
Configure Horizontal Pod Autoscaler
Scaling based on memory can help distribute workload.
Example:
targetAverageUtilization: 70
However remember that autoscaling cannot fix memory leaks.
Implement Resource Governance
Every workload should define:
resources:
requests:
limits:
Running without limits can allow a single application to consume excessive node memory and affect other workloads.
Perform Load Testing
Many memory-related issues only appear under production-like traffic.
Load testing reveals:
- Memory spikes
- Allocation patterns
- Scaling behaviour
before customers encounter them.
Final Thoughts
When a pod is OOMKilled, Kubernetes is usually not the problem.
The platform is enforcing the limits you defined.
The real challenge is understanding why the application exceeded those limits.
Before increasing memory allocations, determine whether the issue is caused by workload growth, configuration mistakes, or application behaviour.
The most effective troubleshooting process is simple:
- Confirm the OOMKilled event.
- Measure actual memory consumption.
- Compare usage against configured limits.
- Identify memory growth patterns.
- Fix the root cause before increasing resources.
In production Kubernetes environments, memory issues are often harder to diagnose than CPU issues, but they are also among the most common causes of unexpected application restarts. Understanding how Kubernetes manages memory is one of the most valuable skills a platform engineer can develop.

Top comments (0)