Paulo Victor Leite Lima Gomes

Posted on Jul 1

the Kubernetes scheduler is becoming the AI capacity broker

#kubernetes #ai #scheduling #dra

The most expensive machine in the cluster is not automatically the most important one.

That sounds obvious until the GPU queue fills up.

Then the cluster becomes a negotiation. Which training job gets the good devices? Which batch workload waits? Which inference service keeps its latency budget? Which team gets the rack-local placement they asked for? Which half-started job is allowed to sit on scarce hardware while the other half cannot be scheduled?

This is where AI infrastructure stops being a procurement story and becomes a scheduler story.

Kubernetes has been moving in that direction for a while. Dynamic Resource Allocation made device requests more expressive than "give this pod a GPU." The newer workload-aware scheduling work in Kubernetes v1.36 pushes on the next problem: many expensive workloads are not really pod-shaped. They are groups. They need enough capacity at once, often in the right topology, with failure behavior that makes sense for the whole job instead of one lonely pod at a time.

That is a bigger shift than it first looks.

The scheduler is no longer just finding an empty slot.

It is starting to broker scarce AI capacity as a unit of work.

pod by pod is the wrong mental model

Kubernetes trained many of us to think in pods.

That is usually fine for stateless services. A pod needs CPU, memory, maybe a volume, maybe a node selector, maybe some affinity rules. The scheduler looks at the world, finds a node, and the pod lands somewhere reasonable.

AI and high-performance batch workloads are less forgiving.

A distributed training job may need several workers to start together. If half the pods run and the other half wait, the cluster is not being productive. It is just converting expensive accelerators into anxiety. A job may need devices that share a fast network path. It may need to avoid spreading work across topology that makes every all-reduce operation slower. It may need a shared device claim for a group rather than a pile of unrelated per-pod claims.

At that point, "schedule this pod" is too small a question.

The better question is: can this workload run as a coherent thing?

Kubernetes v1.36's workload-aware scheduling work makes that question more explicit. The Workload API becomes more of a static template, while PodGroup becomes the runtime scheduling object. That separation is not just API housekeeping. It gives the scheduler a clearer object to reason about when the thing being placed is a group of pods with shared constraints.

In plain English: the scheduler gets to see the job-like shape of the work instead of pretending every pod is an independent little island.

That matters because AI capacity is rarely consumed independently.

gang scheduling is queue honesty

Gang scheduling sounds like a niche batch-computing phrase, but the idea is simple.

If a workload needs four pods to run together, do not schedule one pod and hope the other three eventually find room. Either the group can meet its minimum, or it waits.

That is queue honesty.

Without it, a cluster can drift into a very silly state. Partial jobs occupy resources, blocked jobs keep retrying, operators stare at pending pods, and everyone gets to debate whether the cluster is underprovisioned or merely wedged in a bad allocation pattern.

The v1.36 PodGroup scheduling cycle is interesting because it evaluates the group as a unified operation. The scheduler can take one view of cluster state, try to find valid placements for the group, and apply the decision atomically for the relevant pods. If the group cannot meet its requirements, the group waits instead of leaking half a workload into the cluster.

That is not glamorous.

It is also exactly the kind of boring behavior that makes expensive infrastructure usable.

AI workloads make partial success especially painful. A web service with one fewer replica may degrade gracefully. A distributed training job with missing workers may do nothing useful while still holding devices. A batch workload spread across bad topology may technically run while burning extra time on network overhead.

So the scheduler needs to understand when "some of it is running" is not progress.

topology is part of capacity

The phrase "we have enough GPUs" hides a lot of detail.

Enough where?

On which nodes?

Behind which network?

With which device types?

Under which sharing rules?

For tightly coupled AI jobs, capacity is not just a count. Four available devices in four awkward corners of the cluster may not be equivalent to four devices close together. The network path becomes part of the resource. Rack placement becomes part of the resource. Device locality becomes part of the resource. The scheduler has to care about the shape of capacity, not only the amount.

That is why topology-aware scheduling feels like more than a nice placement optimization.

Kubernetes v1.36 lets topology constraints live on the PodGroup, so the scheduler can try placements that keep the group's pods within a physical or logical domain such as a rack. The implementation still has limits, and the Kubernetes post is refreshingly honest about that. This is a foundation, not a magic wand.

But the direction is right.

AI platform teams are going to need ways to express not only "I need accelerators" but "I need these accelerators in a shape that makes the workload worth running." If that stays outside the scheduler, it becomes tribal knowledge, wrapper scripts, custom queues, and late-night Slack messages asking why the expensive job is slow again.

Topology should be part of the contract.

Not folklore.

preemption becomes political

Preemption is where scheduling stops being purely technical.

If the cluster cannot fit an important workload, something else may need to move. In ordinary pod-by-pod scheduling, preemption is already delicate. In workload-aware scheduling, it gets more interesting because the unit of disruption may be a whole PodGroup.

Kubernetes v1.36 introduces workload-aware preemption that treats a PodGroup as a single preemptor unit. It can look across the cluster and make enough room for the group instead of evaluating victims one node at a time. PodGroup priority and disruption mode add more language for saying whether the group should be treated independently or all at once.

This is the point where the scheduler starts reflecting business policy.

Which training run can interrupt which batch job? Is a research experiment allowed to evict a lower-priority workload? Should a group be disrupted as a unit because partial eviction is worse than waiting? Are teams paying for reserved capacity, or are they sharing a common pool? Does the cluster favor utilization, fairness, deadlines, or executive urgency disguised as priority class?

Kubernetes will not answer those questions for you.

Good.

It should not.

But it can provide better primitives so platform teams do not encode every policy as a pile of conventions and custom controllers. Priority and disruption behavior are not just scheduler knobs. They are where resource politics become API fields.

That sounds uncomfortable because it is.

It is also honest.

DRA needed this partner

Dynamic Resource Allocation was an important step because accelerators are not all the same.

Requesting a generic extended resource is too blunt when hardware differs by model, topology, driver, health, partitioning, and sharing behavior. DRA gives Kubernetes a richer way to ask for and bind specialized devices. In v1.36, DRA keeps expanding with better fallback preferences, broader resource support, and ResourceClaim support at the PodGroup level.

But device allocation alone is not the whole job.

Once you can describe the hardware you need, you still need to place the workload that will use it. That is why the integration between DRA and workload-aware scheduling is the interesting long-term story.

DRA answers: what kind of device does this work need?

Workload-aware scheduling answers: can the work run as a group, in the right shape, under the right priority and disruption rules?

For AI infrastructure, those questions belong together.

A platform that can allocate a perfect set of devices but schedules the job badly is still wasting money. A scheduler that understands gang behavior but cannot reason about the actual devices is also incomplete. The useful control plane is the one that connects device claims, topology, queueing, preemption, health, and ownership.

That is when Kubernetes starts to look less like a place to run pods and more like the broker for expensive capacity.

the platform team still has work

None of this removes platform engineering.

It changes the work.

The tempting mistake is to read these features and assume the scheduler will make the hard choices automatically. It will not. Someone still has to decide which workloads qualify for gang scheduling, which topology labels are meaningful, which queues exist, how priority classes map to real commitments, how preemption is explained to humans, and how teams debug a workload that cannot be placed.

The day-two work may be the real test.

Can users understand why a PodGroup is waiting? Can operators see which topology constraint made placement impossible? Can finance understand why capacity is idle but not available to a particular job? Can an engineer tell whether the blocker is device health, quota, priority, topology, or a bad request? Can the platform expose enough information that "pending" stops being a mysterious state?

This is where AI infrastructure becomes a product.

Not a pile of GPU nodes. Not a YAML museum. A product with queues, explanations, defaults, error messages, budgets, and escape hatches.

The scheduler primitives are necessary, but the user experience around them is where teams will either trust the platform or route around it.

the punchline

AI capacity is not just a cluster-size problem.

It is a fairness, topology, preemption, and observability problem wearing a YAML jacket.

Kubernetes workload-aware scheduling is interesting because it moves the platform closer to the actual shape of the work. PodGroups let the scheduler reason about groups instead of pretending every pod stands alone. Gang scheduling prevents partial jobs from squatting on expensive resources. Topology-aware placement admits that where capacity lives matters. Workload-aware preemption turns priority and disruption into explicit policy.

Together with DRA, this points at a more mature model for AI infrastructure.

Not "we bought GPUs, good luck."

More like: the cluster understands scarce devices, the scheduler understands workload shape, and the platform team can express how capacity should be shared when everyone wants the same expensive hardware at the same time.

That is less exciting than a benchmark chart.

It is much closer to the problem most teams will actually have.

The future bottleneck for AI platforms may not be whether Kubernetes can run the workload.

It may be whether Kubernetes can explain why this workload, with these devices, in this topology, at this priority, should run now instead of the other one.

That is the scheduler becoming a capacity broker.

And once the bill arrives, everyone suddenly cares about brokers.

references

To test my projects, I use Railway. If you want $20 USD to get started, use this link.

DEV Community