Dynamic Resource Allocation hits GA in Kubernetes 1.35, and GPU CI jobs finally get a real API

#kubernetes #dra #gpu #runners

The first time I tried to run a GPU test job on a shared Kubernetes cluster, I spent most of an afternoon fighting nodeSelector. The pod would land on the wrong node. Two builds would both grab the same card. The device plugin would happily expose a GPU another team had already booked, because nothing in the world was going to tell it otherwise. A colleague walked past, looked at my tangle of taints and affinities, and asked the fair question: "you're just trying to ask for a GPU, right?" That is the whole story, honestly. And that is the friction Dynamic Resource Allocation is meant to make go away.

What actually shipped

DRA is now generally available in Kubernetes v1.35. Alongside that milestone, the NVIDIA DRA driver (dra-driver-nvidia-gpu) has moved into Kubernetes SIGs, and its documentation has dropped the Beta label. CNCF published a hands-on walkthrough on July 1, written by CNCF Ambassador ChengHao Yang, that stands the new stack up on a Cluster API + OpenStack cluster with a mixed pool of an NVIDIA RTX A5000 (about 23028Mi of memory) and two Tesla T10 cards at 16Gi each. The lab runs Kubernetes v1.35.3 on Ubuntu 24.04 with Containerd 2.2.2, NVIDIA GPU Operator v26.3.1, and NVIDIA DRA driver GPU v25.12.0.

Small versions on a blog post, but the substance underneath is a shift I have been waiting on for years.

Why the device-plugin path felt so wrong

If you have ever wired GPU access into CI, you already know the pattern. Extended resources exposed by a per-node device plugin. Heavy scheduling gymnastics stacked on top. Everything with the same kind of hardware had to be colocated on the same node. Every claim needed a matching nodeSelector or affinity rule. It worked, but it read like YAML poetry every single time, and the failure modes were opaque enough that most teams eventually pinned entire jobs to one labelled pool and called it a day.

The CNCF post is blunt about this: the old model required "colocate the same kind of device on the same node" and pushed all the placement logic into "complex rules in nodeSelector or Affinity". DRA replaces that with an actual API.

The new primitives you will touch

DRA is small once you see the pieces:

DeviceClass describes the kind of device a workload can accept (say, "any NVIDIA GPU with at least 16Gi").
ResourceSlice is what a driver publishes on behalf of a node, an inventory of what is actually available.
ResourceClaim is a workload's request against a class.
ResourceClaimTemplate stamps out claims per-pod, the way PersistentVolumeClaim templates do for storage.
GpuConfig, exposed by the NVIDIA driver, adds vendor-specific tuning. The walkthrough demonstrates a TimeSlicing strategy so a single card can back multiple concurrent workloads.

Nothing here is subtle. The point is that requesting a specialised device stops being an ad-hoc affinity puzzle and starts looking like a real Kubernetes API, with the same shape as storage claims you already trust.

Where CI/CD teams will feel this first

Pipelines are where this pays back fastest. Model-training jobs, GPU-accelerated integration suites, browser farms with hardware decoders, anything CUDA-shaped, they all sit in the same queue as your Node install steps, waiting for a runner. With DRA, a runner pod can:

Ask for a class of GPU and let the scheduler place it, no per-pool labels required.
Share a single card across several short test jobs via time-slicing instead of dedicating a whole GPU to each pod.
Retire the pile of node labels that grew, one incident at a time, around the old scheme.

That last one is the quiet win you will notice first. Every runner Helm chart, every taint, every "please only run this on gpu-pool-b" note in a README. A lot of that becomes readable again.

The rough edges I would not sign off yet

DRA is GA, which is not the same as ecosystem-ready. A few honest caveats:

Driver drift. The lab pins specific NVIDIA GPU Operator and DRA driver versions for a reason. If your fleet is a mix of managed distributions, you will hit "which driver ships when" questions before you hit any DRA question.
Time-slicing is easy to abuse. Two concurrent CUDA runs happily hitting the same VRAM ceiling look fine until they both OOM together. Slicing is a scheduling primitive, not a memory quota.
Autoscaler behaviour. Cluster Autoscaler and DRA are still learning to speak to each other in shipping deployments. If you rely on scale-from-zero for expensive GPU pools, verify it end to end before you migrate a paying workload.

How the wider ecosystem is approaching the same problem

DRA is not the only way to get a GPU into a CI job, and it does not obsolete the alternatives:

The classic device-plugin path still works for teams that do not need per-claim flexibility. If your fleet is homogeneous and your labels are already tidy, staying put is fine.
Ephemeral cloud runners (self-hosted GitHub Actions on GPU VMs, GitLab runners on managed instances, Buildkite agents on cloud hardware) sidestep the Kubernetes piece entirely. You rent a GPU node per job and skip the scheduler question.
Volcano and Kueue offer batch-style queueing on top of Kubernetes. They still handle the queueing behaviour DRA on its own does not do, and both projects are aligning with DRA as the underlying allocation primitive.
Managed platforms like cloud-managed batch services or specialised inference platforms lift the abstraction higher still: you hand off a workload, they pick the hardware.

DRA does one thing well. It gives the plain-Kubernetes path a decent primitive. Whether that beats a managed alternative depends entirely on how much of the surrounding scheduler you were already running.

What I am watching next

Two things. First, whether the NVIDIA driver stays the reference implementation or whether AMD and Intel driver work catches up fast enough that DeviceClass really is a portable contract across vendors. Second, how the big managed Kubernetes offerings ship default DRA support, because until that lands in a click-to-enable form, most teams will treat DRA as an interesting internal experiment and keep their old node labels alive. Ask a friend on a shared GPU cluster next week whether they have torn out their nodeSelector rules yet. Their answer is the real launch date.

Top comments (2)

Max Quimby • Jul 4

The time-slicing line is the one I'd flag hardest to anyone reading this as a green light. DRA finally makes requesting a GPU sane — a ResourceClaim reads like a PVC, which is exactly the mental model people already trust — but time-slicing a single card across CI jobs doesn't give you memory isolation. Two pods sharing one A5000 via TimeSlicing are still in the same memory space; one job that leaks or OOMs can absolutely take its neighbor down with it. If your test jobs are untrusted or just poorly behaved, MIG partitioning is the safer share, even though it's coarser and strands some capacity. The other thing I'd watch early is observability: once the scheduler places a claim for you, "which physical card did runner-7 actually land on?" gets harder to answer than it was with pinned pools — and you want that answer the first time a job is mysteriously slow. Genuinely excited about the API though; retiring the taint/affinity poetry alone is worth the upgrade.

Leo • Jul 6

This is the right flag, thanks.
One nuance worth adding: MIG isn't on the table for the lab hardware in the CNCF post -neither the A5000 nor the T10s support it - so on mixed or workstation-class fleets the practical middle ground is MPS, which at least gives per-client memory limits.

And on the "which card did I land on" question: the ResourceClaim status.allocation records the allocated device, so it's queryable - just a different habit than reading node labels, and definitely something to wire into dashboards before trusting it in anger.