Daya Shankar

Posted on Jun 29

GPU Resource Requests and Limits in Kubernetes: Why Default Settings Break Production

#kubernetes

One of the strangest GPU incidents I have seen involved a Kubernetes cluster that looked underused and overloaded at the same time.

GPU utilisation was below 40%.

The dashboards looked healthy.

But new workloads were stuck in Pending. Application teams wanted more GPU nodes. The autoscaler kept expanding the cluster.

At first, the numbers did not make sense.

If the GPUs were not fully used, why could Kubernetes not schedule more work?

The answer was not compute capacity.

It was allocation.

And this is where many GPU Kubernetes clusters quietly become expensive.

Kubernetes Does Not Treat GPUs Like CPUs

Most teams understand CPU requests and limits reasonably well.

GPUs are different.

In Kubernetes, GPUs are scheduled as extended resources. The official Kubernetes GPU scheduling documentation states that GPUs must be specified in limits, and if requests are also specified, requests and limits must be equal.

That means a pod asking for one GPU is not asking for 30% of a GPU.

It is asking for one whole GPU device.

So, if a workload requests:

resources:
  limits:
    nvidia.com/gpu: 1

Kubernetes treats one GPU as allocated to that pod.

Even if the workload only uses 25% or 40% of the GPU in practice.

This is the part that surprises many teams.

A GPU can be partially utilised and fully allocated at the same time.

Why Default GPU Requests Break Production

There is nothing technically wrong with requesting one GPU.

For model training, large inference jobs or workloads that genuinely need exclusive access, it may be exactly the right configuration.

But many production workloads do not use an entire GPU all the time.

Small inference services, development notebooks, internal tools and batch jobs often consume only a fraction of available compute.

Still, once they request one GPU, Kubernetes reserves the whole device.

That creates a strange situation.

8 GPUs are available
8 pods request one GPU each
Average GPU utilisation stays below 40%
New GPU pods remain pending

From the dashboard, the cluster looks underused.

From Kubernetes’ scheduler, the cluster is full.

Both views are true.

And that is the problem.

Why Autoscaling Can Make the Problem More Expensive

Once pods remain pending, node autoscaling usually enters the conversation.

A node autoscaler provisions or consolidates nodes so the cluster has enough capacity for workloads, as described in Kubernetes’ node autoscaling documentation.

So if a pod needs one GPU and no allocatable GPU is available, adding a GPU node can be the correct scheduler response.

Technically, the autoscaler is doing its job.

But here is the catch.

The cluster may not need more GPU compute.

It may need a better allocation strategy.

If every lightweight workload reserves a full device, autoscaling solves the scheduling problem by buying more capacity. It does not fix the utilisation problem.

That is how teams end up with larger GPU fleets, higher cloud bills and only a small improvement in actual throughput.

If the problem appears repeatedly across training, inference and batch workloads, the issue is no longer only resource requests.

It becomes a GPU orchestration problem.

At that point, teams may need workload-aware scheduling, queues, node pools and retry policies rather than simply adding more GPU nodes. A deeper look at multi-GPU orchestration in Kubernetes can help when the cluster is moving beyond simple one-pod-one-GPU scheduling.

The Metric That Misleads Teams

GPU utilisation is useful.

But by itself, it can be misleading.

A low utilisation number does not automatically mean Kubernetes can place another GPU workload on the node.

You also need to check:

GPU allocation
GPU memory usage
Pending GPU pods
Node allocatable GPU count
Autoscaler activity
Workload concurrency

Looking only at utilisation is like seeing empty seats in a theatre after every ticket has already been sold.

The room looks available.

The booking system disagrees.

Kubernetes works more like the booking system.

How Should Teams Evaluate GPU Requests?

Before assigning a full GPU to a workload, ask a few practical questions:

How much GPU memory does the workload actually use?
Does it need exclusive GPU access?
What does utilisation look like at peak load?
Can it safely share a GPU with another workload?
Is the workload continuous or intermittent?
Does it need latency isolation?

These questions usually reveal the real shape of the workload.

A training job may need a dedicated GPU.

A production inference service may need one too, especially when latency is strict.

But a development notebook, lightweight batch job or small internal inference service may not.

Without that distinction, GPU requests become guesses.

And expensive guesses tend to scale badly.

What Do Efficient GPU Clusters Do Differently?

Efficient GPU Kubernetes clusters do not start by giving every workload a full device.

They start by measuring workload behaviour.

They monitor memory, utilisation, scheduling frequency, peak demand and contention.

Then they choose the allocation model.

Workload type	Common allocation approach
Large training jobs	Dedicated GPU
Latency-sensitive inference	Dedicated GPU or isolated slice
Small inference services	Shared GPU or time-slicing
Development environments	Shared GPU
Batch processing jobs	Time-sliced or shared GPU

The right answer depends on isolation needs.

If workloads need stronger isolation, NVIDIA Multi-Instance GPU can partition supported GPUs into separate GPU instances with dedicated compute and memory resources.

If workloads mainly need opportunistic sharing, NVIDIA GPU time-slicing can allow multiple workloads to share access to the same underlying GPU over time. Teams comparing shared and exclusive access models should also understand the trade-off between GPU time-slicing and passthrough before changing production allocation rules.

But these are not magic switches.

MIG, time-slicing and dedicated allocation all have trade-offs around isolation, scheduling flexibility, memory guarantees and operational complexity.

The point is not to share every GPU.

The point is to stop treating every workload as if it needs a full one.

When Should You Use Dedicated GPUs?

Use dedicated GPUs when the workload needs predictable performance or consumes most of the device.

This is common for large training jobs, high-throughput model serving, strict latency workloads or applications with heavy GPU memory requirements.

In those cases, sharing can create more risk than savings.

Dedicated allocation may look less efficient on a dashboard, but it can be the right design when predictability matters.

When Should You Consider GPU Sharing?

Consider sharing when workloads are smaller, intermittent or tolerant of variable performance.

This often includes development environments, internal tools, lightweight inference services and batch jobs.

These workloads may not need an entire GPU reserved all day.

Sharing can improve utilisation and reduce unnecessary node growth.

But it should be tested under real load.

A setup that works for idle notebooks may not work for production inference with strict response-time targets.

The Real Question Before Adding More GPU Nodes

When GPU workloads start pending, the instinct is to add nodes.

Sometimes that is correct.

But before expanding the cluster, ask:

Are the existing GPUs actually being used, or are they just allocated?

That one question changes the troubleshooting path.

If GPUs are heavily utilised and memory is saturated, you probably need more capacity.

If GPUs are allocated but lightly used, the problem is allocation design.

More nodes may hide the issue for a while.

They will not fix it.

Final Thought

Most GPU Kubernetes problems do not begin with a shortage of GPUs.

They begin with a mismatch between how workloads consume GPU resources and how Kubernetes allocates them.

A GPU can sit at 35% utilisation and still be unavailable to every new pod.

Once that makes sense, many scheduling, autoscaling and cost problems become easier to explain.

So before increasing autoscaling limits or buying more GPU nodes, check the allocation model.

You may not have a capacity problem.

You may have a reservation problem.

DEV Community