NTCTech

Posted on Apr 30 • Originally published at rack2cloud.com

GPU Scheduling in Kubernetes: Start Before the Scheduler

#ai #kubernetes #infrastructure #devops

Most teams think GPU scheduling starts with the scheduler.

It starts with demand modeling.

By the time Volcano, Kueue, or KEDA enters the conversation, the expensive mistake has usually already been made. The cluster was provisioned against a theoretical peak that rarely materializes. The demand curve was never drawn. The concurrency profile was assumed rather than measured.

The core argument: GPU scheduling is not a capacity solution. It is a capacity enforcement layer. If you provisioned against the wrong demand curve, the scheduler cannot save you.

The Demand Model Preflight

Before you talk about schedulers, answer four questions:

1. What is your real concurrency floor? Not peak theoretical demand. The minimum sustained parallel work your cluster must support without queue collapse. If you cannot answer this from measurement, you don't have a demand model — you have an assumption.

2. What is burst, and what is noise? If demand spikes for ninety seconds, does that justify permanent GPU allocation — or should it queue? Burst shorter than your cold-start window is noise. Noise should not drive provisioning decisions.

3. How long does work stay resident? A model loaded in VRAM is not active work. If memory stays hot longer than compute stays busy, utilization is already overstated before the scheduler runs a single job.

4. What can wait, and for how long? Scheduling starts with tolerated latency. If every workload is marked urgent, none of them are schedulable efficiently.

If you cannot answer all four from data rather than assumption, the scheduler conversation is premature.

What Correct GPU Demand Modeling Looks Like

Seven inputs. Each one has a consequence if you get it wrong.

Request concurrency — If you modeled single-thread throughput, your cluster is sized for a workload that never actually runs.

Queue depth — How many jobs can wait before it becomes a latency problem? Most teams buy hardware when they should be designing queue behavior.

Burst profile — Short demand spikes get priced into permanent capacity. A correct burst profile separates the spike duration from the allocation decision.

Latency tolerance — Batch training tolerates queuing. Real-time inference does not. Sizing uniformly across both is a guaranteed waste pattern.

Batch vs inference mix — These are distinct provisioning decisions. A cluster optimized for training batch jobs has a different shape than one optimized for sustained inference throughput.

VRAM residency time — How long does a model stay loaded relative to how long it is actively processing requests? High residency-to-compute ratio means memory is doing the work of availability, not throughput.

Job duration variance — High variance creates scheduling fragmentation regardless of how well the scheduler is configured. Understanding variance at p50/p90/p99 determines whether gang scheduling or preemption policies are necessary.

Provision for Shape, Not Peak

The corrective action is a provisioning philosophy shift.

Wrong Target	Correct Target
Peak demand	Concurrency bands
Max model size	Queue tolerance
Future scale	Sustained demand windows
Worst-case headroom	Known burst ceilings

Concurrency bands come from request concurrency measurement. Queue tolerance comes from latency tolerance modeling. Burst ceilings come from burst profile analysis. The provisioning decision is downstream of the model — not upstream of it.

Where the Scheduler Actually Fits

The right evaluation criterion for a scheduler is not feature sets. It is whether the scheduler enforces the constraints your demand model defined.

Three tools, three enforcement roles:

Volcano → batch fairness / queue discipline. Implements fair-share scheduling and gang scheduling for distributed training. Enforces concurrency band design across workload classes.

Kueue → admission control / workload gating. Answers Preflight Question 4 directly — what can wait. Prevents jobs from entering the scheduling queue until capacity exists to run them.

KEDA → event-driven scale behavior. Answers Preflight Question 2 — burst vs noise. Scales to the burst ceiling the demand model defined, not to unbounded demand signals.

These are not alternatives. They are complementary enforcement layers at different points in the scheduling stack.

What Good GPU Scheduling Actually Looks Like

Not which scheduler. What the outcome looks like when the demand model is correct:

Jobs wait intentionally — queue latency exists by design, not by accident
Inference scales on bounded demand — KEDA scales to the burst ceiling, not beyond it
VRAM stays loaded for active work — residency-to-compute ratio is enforced operationally
Queue latency is tolerated by design — the latency tolerance input becomes an SLA
Expensive accelerators do not sit hot without work — loaded ≠ active, eliminated

Architect's Verdict

The scheduler is not where GPU efficiency begins. It is where good capacity decisions are enforced — or bad ones become permanent.

Build the demand model first. Provision to its shape. Then configure the enforcement layer. In that order, and no other.

Originally published at rack2cloud.com

DEV Community