Paulo Victor Leite Lima Gomes

Posted on May 27

kubernetes dra is making ai hardware an api contract

#kubernetes #aws #platformengineering #ai

Kubernetes has been slowly turning infrastructure into API objects for a decade.

Pods, Services, Ingress, volumes, identities, policies, autoscalers, gateways. The pattern is familiar by now: something starts as a messy operational concern, then Kubernetes absorbs enough of it that teams can stop passing tribal knowledge around in Slack and start expressing intent in a resource.

Dynamic Resource Allocation is one of those changes that sounds deeply unsexy until you run into the problem it solves.

And AI infrastructure is making that problem very expensive.

The short version: Kubernetes DRA lets workloads request specialized resources through richer, scheduler-aware claims instead of the old "give me N devices and good luck" model. In Kubernetes v1.36, DRA keeps getting more serious. AWS also published DRA drivers for Trainium and Elastic Fabric Adapter, which is a pretty good signal that this is not just upstream architecture astronomy.

This is the platform moving from "schedule my pod" to "broker scarce hardware correctly."

That is a big shift.

accelerators are not just bigger cpus

For a long time, Kubernetes scheduling was mostly about CPU and memory. Yes, storage mattered. Yes, networking mattered. Yes, GPUs existed. But the basic shape of a workload request was still understandable by normal human beings: this pod needs some CPU, some memory, maybe a device count, and a few placement constraints.

AI workloads are much less polite.

They care about accelerator type, topology, NUMA boundaries, high-performance networking, device locality, sharing behavior, driver configuration, and weird per-workload settings that make perfect sense to the ML team.

The old device plugin model helped Kubernetes notice specialized hardware, but it is still too blunt for a lot of this work. "Four accelerators" is not the same thing as "four accelerators placed close to the right network interfaces, with the right topology."

That difference used to be hidden in custom schedulers, init containers, launch templates, admission hooks, internal runbooks, and people who somehow knew which instance types should never be mixed.

the resource claim is the interesting object

The important idea in DRA is that hardware requirements become claims the scheduler can understand.

Instead of treating specialized hardware as a flat count, the platform can expose richer attributes. A workload can ask for a class of resource. The scheduler can reason about placement. The kubelet and drivers can coordinate allocation before the workload starts.

That sounds like plumbing because it is plumbing.

But good plumbing changes the product.

Once scarce hardware is represented as an API contract, platform teams can build golden paths around it. They can define approved resource classes for training, inference, experimentation, and expensive production jobs without making every ML engineer learn the physical topology of a node.

That is the win. The messy agreement between ML teams and infrastructure teams can move from tickets and conventions into Kubernetes objects.

When that happens, the scheduler becomes part of the organizational contract.

topology is where the money leaks

The AWS Trainium and Elastic Fabric Adapter example is useful because it makes the problem concrete. Distributed AI workloads are not only asking for compute. They are asking for compute plus fast communication. If the accelerator and the network path are badly aligned, the workload may still run, but slower. Slower means more expensive, and eventually teams start inventing bypasses around the platform.

This is how infrastructure gets weird.

Someone discovers that a particular placement works better. Then a script appears. Then a wiki page says to use the script. Three months later nobody knows whether the platform supports the pattern or whether the pattern is just a pile of heroic exceptions.

DRA is interesting because it gives Kubernetes a native place to carry that intent. AWS describes the EFA and Neuron DRA drivers as a way to give Kubernetes topology-aware placement for high-performance networking and accelerators. The point is that the platform can hide more of the physical mess behind a declarative request that still preserves the important constraints.

That is what good platform abstractions do.

They do not pretend physics disappeared.

They give physics a better API.

this changes the platform team job

The platform team used to provide clusters. Then deployment paths. Then security controls, observability, secrets, service templates, internal developer portals, and all the other things that slowly accumulate around a production system.

AI hardware pushes the platform team into another role: resource broker.

Not just "can the pod run?" but:

Which team is allowed to consume this accelerator class?
Is this request appropriate for experimentation or production?
Can this workload share a network interface safely?
Does this job need topology guarantees?
Should this workload preempt something cheaper?
How do we explain why the scheduler placed it there?
What is the cost of this claim if it sits pending for an hour?

These questions are not YAML trivia. They are business questions wearing infrastructure clothes.

That is why I think DRA matters. It gives platform teams a better primitive for turning those decisions into policy and product. The alternative is a swamp of custom schedulers, per-team conventions, and "ask the infra channel before launching the big job."

do not make every team become a hardware team

There is a trap here.

Because AI infrastructure is expensive and specialized, organizations will be tempted to push too much detail onto application and ML teams. "Here are the accelerator types, the network topology, the driver caveats, the quota model, and a spreadsheet. Please be responsible."

That is not a platform. That is a procurement process with kubectl.

The better version is opinionated resource classes.

Give teams a small menu:

small inference
batch inference
interactive experiment
distributed training
large reserved training

Behind those classes, the platform can encode device allocation, topology, network locality, sharing rules, quota, cost labels, and admission policy. The user should still understand the cost and tradeoffs, but they should not need to know every physical constraint before doing useful work.

Same lesson as every good internal platform: hide what can be hidden, expose what must be chosen, and make the dangerous path explicit.

scheduling becomes governance

People often talk about AI infrastructure as a capacity problem. We need more GPUs. More accelerators. More regions. More quota. Sure. Sometimes the answer is literally "buy more hardware." But after you buy it, you still have to allocate it.

That allocation layer becomes governance very quickly. Who gets priority? Which jobs can use the most expensive resource classes? Which workloads are allowed to share? Which claims require approval? Which teams are burning money because their jobs are topology-incorrect but technically green?

This is where Kubernetes keeps doing Kubernetes things. The control plane absorbs more ambiguity. What used to be a human coordination problem becomes API shape, scheduler behavior, policy, and observability.

That is not magic. It is still hard. But it is a better kind of hard than every team manually negotiating hardware access through ad hoc scripts.

what i would do first

If I were building an AI platform on Kubernetes today, I would not start by exposing every DRA feature to every user.

I would start with one expensive workflow that already hurts.

Pick a training or inference path where teams regularly need help from infrastructure. Define one or two resource classes. Add cost labels. Add clear quota. Make pending claims visible. Make placement decisions explainable enough that people can distinguish "the cluster is full" from "your request cannot be satisfied with this topology."

Then measure the boring stuff:

how often jobs sit pending
how often teams request the wrong class
how much accelerator capacity is stranded
how often manual intervention is needed
how much performance changes when topology is handled correctly

That feedback loop matters more than announcing a grand AI platform.

The first version should remove one painful class of coordination. Then another. Then another.

the punchline

Kubernetes DRA is not only a device allocation feature. It is a sign that AI workloads are forcing hardware to become part of the platform API.

That is good.

The old model asks teams to understand too much physical detail or rely on too much hidden platform magic. The better model turns scarce, topology-sensitive resources into explicit claims, policies, and scheduler decisions.

The future AI platform is not just a pile of accelerators attached to Kubernetes. It is a control plane that can broker those accelerators without making every user become a hardware specialist.

Containers made compute feel portable. Kubernetes made operations programmable. AI hardware is now pushing the next uncomfortable step: the expensive parts of the machine need contracts too.

And once something becomes a contract, someone has to own the terms.

That is platform engineering.

DEV Community