Kubernetes with Naveen

Posted on Feb 12 • Edited on Mar 16

Kubernetes Requests and Limits: The Most Misunderstood Feature in Production

#kubernetes #microservices #devops #cloudnative

In the last post i explained why Kubernetes resource overprovisioning happens, how it quietly inflates cloud costs, and what real-world strategies DevOps teams use to regain control over CPU, memory, and GPU usage and you can read that right here.

Kubernetes requests and limits look simple, but in production they quietly dictate cost, stability, and scalability. This deep dive explains how they really work, why most teams get them wrong, and how to configure them without risking outages.

If you ask most engineers what Kubernetes requests and limits do, you’ll get a confident answer within seconds. Requests are what the container needs. Limits are the maximum it can use. Simple.

And that’s exactly why this feature causes so much damage in production.

Requests and limits are one of the earliest concepts people learn in Kubernetes, but they’re also one of the least revisited. Teams copy values from old services, cargo-cult them across repositories, and rarely question whether they still reflect reality. Over time, these numbers quietly shape scheduling behavior, autoscaling decisions, node count, and ultimately cloud spend — often without anyone realizing it.

To understand why this goes wrong at scale, you have to stop thinking of requests and limits as “resource settings” and start seeing them for what they cyually are: contracts with the scheduler.

Requests Are Reservations, Not Estimates

The most important thing to internalize is this: when a pod specifies resource requests, Kubernetes treats them as guaranteed reservations.

If a container requests 1 CPU and 4 GiB of memory, the scheduler will only place it on a node that has at least that much allocatable capacity available. From that point on, that capacity is considered consumed, whether the container uses it or not.

It doesn’t matter if the application idles for hours.
It doesn’t matter if average usage is a fraction of the request.
As far as the scheduler is concerned, that resource is gone.

This is why clusters end up in the strange state where they can’t schedule new pods even though node-level metrics show plenty of unused CPU and memory. The scheduler is doing exactly what it was told to do — it’s just working with inflated numbers.

Why Engineers Inflate Requests (And Why It’s Rational)

Over-requesting resources isn’t a sign of poor engineering discipline. It’s a rational response to uncertainty.

Most teams have lived through at least one painful incident where a container was under-provisioned. Maybe a memory spike triggered an OOM kill during peak traffic. Maybe CPU throttling caused latency to creep up just enough to trip timeouts. Those incidents stick.

After that, the thought process changes. Engineers stop asking, “What does this service usually need?” and start asking, “What’s the worst case I’ve ever seen?”

Requests grow to cover edge cases. Limits are pushed far beyond normal operation or removed entirely. Over time, this becomes the default posture, especially for services that are considered critical. Nobody wants to be the person who reduced a request and caused the next outage.

The problem is that Kubernetes has no native way to tell you when that fear is outdated. A service that once needed 8 GiB of memory during a launch might now be stable at 2 GiB — but the request never gets revisited. Multiply that across hundreds of workloads, and the waste compounds quietly.

Limits Are Not a Safety Net (Especially for Memory)

Limits are often described as a “safety boundary,” but that description glosses over some important realities.

CPU limits are enforced through throttling. When a container hits its CPU limit, it doesn’t crash — it just gets slowed down. This can be acceptable for some workloads and disastrous for others, depending on latency sensitivity.

Memory limits are far less forgiving. When a container exceeds its memory limit, it is immediately terminated by the kernel. There’s no graceful degradation. No backpressure. Just a hard stop.

Because of this, many teams choose one of two extremes: either they set memory limits extremely high, or they avoid setting them altogether. Both approaches come with trade-offs. High limits reduce the chance of OOM kills but increase the blast radius if something leaks memory. No limits improve stability for individual pods but shift risk to the node and, by extension, other workloads.

What’s often missing from this decision is an understanding of actual memory usage over time. Without that context, limits become guesswork — and guesswork tends to err on the side of excess.

The Hidden Relationship Between Requests and Autoscaling

Autoscaling is frequently used as a justification for sloppy requests. The logic goes something like this: “We have HPA, so it’ll scale if things get busy.”

What’s overlooked is that horizontal autoscaling relies on requests to calculate utilization. If your CPU request is wildly inflated, your utilization percentage will look low even under real load. The autoscaler won’t trigger when it should, because from its perspective, nothing is wrong.

In this way, over-requesting doesn’t just waste capacity — it actively breaks scaling behavior. Teams then respond by increasing replica counts manually or inflating requests even further, reinforcing the cycle.

Autoscaling works best when requests reflect baseline usage, not peak fear. Without that honesty, the system amplifies bad assumptions instead of correcting them.

A More Honest Way to Configure Requests and Limits

In mature environments, requests are treated as a representation of typical behavior, not worst-case scenarios. They’re based on observed usage over time, not a single incident from six months ago.

Limits, when used, are chosen deliberately based on failure tolerance. For CPU, that might mean allowing bursts while preventing a single pod from monopolizing a core. For memory, it often means accepting that some workloads are better protected by node-level isolation than aggressive per-container limits.

This approach requires trust — not blind trust, but trust built on metrics, slow change, and fast rollback. Teams that succeed with right-sizing don’t aim for perfection. They aim for plausibility.

Why This Misunderstanding Gets More Expensive at Scale

In small clusters, over-requesting mostly results in inefficiency. In large fleets, it reshapes the entire platform.

Inflated requests reduce bin-packing efficiency, which increases node count. Higher node count increases failure domains, upgrade complexity, and operational overhead. Autoscalers react to distorted signals. Scheduling latency increases. GPU pools grow faster than they need to.

At that point, requests and limits are no longer just a configuration detail. They are a major architectural input.

This is why organizations that treat resource configuration as a first-class concern often see dramatic improvements without changing application code at all. They stop feeding the scheduler exaggerated inputs, and the system immediately behaves better.

Closing Thoughts

Requests and limits are simple on the surface, which is exactly why they’re dangerous when misunderstood. They don’t just affect individual pods — they influence how Kubernetes perceives the entire cluster.

When requests are inflated, Kubernetes is forced to plan for a world that doesn’t exist. When limits are misunderstood, teams either accept unnecessary risk or waste massive amounts of capacity trying to avoid it.

Getting this right isn’t about squeezing every last CPU cycle. It’s about giving the scheduler truthful information and letting it do its job. Once that happens, autoscaling becomes predictable, clusters become calmer, and cost optimization stops feeling like a fight.

In the next part of this series, we’ll dig into autoscaling itself — why HPA alone won’t save you, and how bad inputs can turn scaling from a solution into a multiplier of waste.

Key Takeaways

Requests are scheduling contracts, not usage estimates, and inflating them directly leads to wasted capacity.
Limits behave very differently for CPU and memory, and misunderstanding that difference causes both outages and inefficiency.
Autoscaling depends on honest requests, and overprovisioning silently breaks its assumptions.

So What's Next?

In my next blog post, I will cover Kubernetes autoscaling, which is often used to mask bad resource configurations. Learn how horizontal and vertical scaling actually work together — and how to avoid autoscalers amplifying bad inputs. Till then, have fun in reading, help me to share this post to your dear ones for wider outreach.

DEV Community