DEV Community

Cover image for GPU Scheduling Deep Dive: How Cloud Providers Allocate GPUs for Multi-Tenant AI Workloads
Daya Shankar
Daya Shankar

Posted on

GPU Scheduling Deep Dive: How Cloud Providers Allocate GPUs for Multi-Tenant AI Workloads

Cloud GPU “scheduling” is a chain of gates: quota decides if you’re allowed to ask, capacity reservations decide if GPUs exist in a zone, placement bins you onto physical hosts, and partitioning decides how many tenants share silicon (full GPU, MIG, time-slicing, vGPU). 

Kubernetes sits at the end and consumes whatever the platform exposes. 

GPU allocation is a pipeline, not one scheduler

If you can’t name the layer that said “no,” you can’t fix it.

Request (API / YAML) 
-> Account quota (by region + purchasing option) 
-> Zonal capacity (reservation / capacity block / spot pool) 
-> Placement (host + network topology) 
-> Partitioning (full GPU | MIG | time-slicing | vGPU) 
-> Cluster scheduler (Kubernetes / Slurm) 
-> Kubelet + device plugin exposes devices to containers 

What each layer controls

This table routes incidents to the right team fast.

Layer

Who controls it

What failure looks like

What to check

Quota

Provider control plane

“quota exceeded” / “limit exceeded”

Instance type quotas + service quotas docs 

Capacity

Provider control plane

“insufficient capacity” / stuck provisioning

Capacity Reservations / Capacity Blocks 

Placement

Provider control plane

GPUs launch, topology is wrong

Placement groups / cluster strategy 

Partitioning

Provider + NVIDIA stack

noisy neighbors / unfair sharing

MIG vs time-slicing vs vGPU 

Cluster scheduling

You

Pods Pending

GPU resources + device plugin plumbing 

Partitioning decides multi-tenancy behavior

This is where you pick isolation vs utilization.

Mode

What it does

Isolation

What breaks first

Best fit

Full GPU

One workload per GPU

High

utilization

training, big batch

MIG

Hardware partitions with dedicated compute/memory

High

fragmentation by profile

inference, fine-tune with QoS 

Time-slicing

Oversubscribe GPUs; workloads interleave

Low

noisy neighbor

burst inference, dev/test 

vGPU

Virtual GPU slices via hypervisor stack

Medium–High

licensing + ops

shared VM fleets, VDI, controlled slices 

Two blunt truths:

  • Time-slicing is not memory/fault isolation. It’s interleaving. 
  • MIG is real partitioning with fixed profiles. That creates profile fragmentation if you don’t standardize. 

How major cloud providers gate GPU allocation

You can’t “autoscale GPUs” if the provider won’t hand you any.

Amazon Web Services

Quota and capacity are separate checks.

  • Instance type quotas are grouped by purchasing option (On-Demand, Spot, Dedicated, Capacity Blocks). 
  • Capacity Reservations reserve compute capacity in a specific AZ. 
  • Capacity Blocks for ML reserve GPU instances for a future time window for short-duration ML workloads. 

Google Cloud

Reservations are zonal, and they validate capacity up front.

  • When you create a reservation, Compute Engine verifies capacity in the specified zone, then reserves it. 
  • GPU machine types include A3 variants backed by H100 SKUs (A3 High/Mega/Edge). 
  • Reservation types exist for ensuring optional resources like GPUs are available when you need them. 

Microsoft Azure

Quota often shows up as vCPU-family limits plus capacity reservation.

  • Azure VM quotas are tiered (total regional vCPUs + VM-family cores). If either is exceeded, deployment fails. 
  • Azure capacity reservation reserves compute capacity in a region or AZ for any duration. 
  • ND H100 v5 starts at 8× H100 GPUs per VM (Azure docs). 

Kubernetes GPU scheduling is device-plugin driven

The scheduler can’t place what the node doesn’t advertise.

Kubernetes exposes GPUs through device plugins; workloads request resources like nvidia.com/gpu, and the scheduler places Pods on nodes with allocatable capacity. 

Full GPU request

This is the baseline contract.

apiVersion: v1 
kind: Pod 
metadata: 
name: gpu-smoke 
spec: 
restartPolicy: Never 
containers: 
- name: cuda 
image: nvidia/cuda:12.4.1-base 
command: ["bash","-lc","nvidia-smi && sleep 3600"] 
resources: 
limits: 
nvidia.com/gpu: 1 

MIG request

You request a profile resource, not “a GPU.”

apiVersion: v1 
kind: Pod 
metadata: 
name: mig-infer 
spec: 
containers: 
- name: infer 
image: your-infer-image 
resources: 
limits: 
nvidia.com/mig-1g.10gb: 1 

MIG is explicitly described as partitioning supported GPUs into isolated instances with dedicated compute/memory. 

Time-slicing (oversubscription)

This raises utilization and raises blast radius.

version: v1 
sharing: 
timeSlicing: 
renameByDefault: true 
resources: 
- name: nvidia.com/gpu 
replicas: 10 

Time-slicing is documented as oversubscription where workloads interleave on the same GPU. 

Installing the NVIDIA stack

You need one canonical way to install drivers + device plugin + toolkit + monitoring.

The NVIDIA GPU Operator automates driver + device plugin + container toolkit + labeling + DCGM monitoring components. 

Queue GPUs at the job layer or you’ll drown in Pending Pods

Pods Pending is not a scheduling policy.

Kueue manages quotas and decides when a job waits, when it’s admitted (Pods can be created), and when it’s preempted. 

Practical pattern:

  • One shared cluster.
  • Separate GPU node pools by workload class.
  • Kueue ClusterQueues per tenant/team.
  • Admission control before Pods exist.

GKE publishes a multi-tenant Kueue tutorial that matches this model. 

Reference architecture for multi-tenant AI workloads

This is a setup you can run for months without babysitting.

Split pools by workload class

This prevents MIG profile churn from breaking training placement.

Pool

GPU strategy

Controls

Notes

Training

Full GPUs

Kueue + quotas

avoids MIG fragmentation

Inference

MIG

Kueue or HPA

predictable slices

Dev/Test

Time-slicing

loose quotas

accept noisy neighbors

Use taints/tolerations to keep workloads honest

This stops inference from landing on training nodes “because it fit.”

# training nodes 
spec: 
taints: 
- key: gpu-class 
value: training 
effect: NoSchedule 

# training pods 
spec: 
tolerations: 
- key: gpu-class 
operator: Equal 
value: training 
effect: NoSchedule 

Where AceCloud.ai fits

You keep the same scheduling stack. You change where the capacity comes from.

AceCloud publishes:

  • Cloud GPU instances with H100/A100/L40S listed as available options. 
  • Spot GPU pricing pages with per-hour rates and “saving” percentages by SKU (example: L40S in Mumbai). 
  • Managed control plane claims, including a stated 99.99% uptime SLA on the managed control plane page. 

How teams wire AceCloud into multi-tenant scheduling

This is the boring path. It works.

  1. Deploy a managed Kubernetes cluster plus GPU node pools (training vs inference). 
  1. Install NVIDIA GPU Operator once per cluster. 
  1. Enable MIG on inference pools; keep training pools full GPU. 
  1. Add Kueue for tenant quotas and job admission. 
  1. Use spot GPUs for interruptible inference/batch where it fits your SLOs. 

Ops checklist for debugging “we can’t get GPUs”

This is what you run before you open a ticket.

1) Confirm the provider gate that blocked you

Quota errors and capacity errors look similar in dashboards. They aren’t.

  • AWS: instance type quotas by purchasing option; capacity reservations / capacity blocks. 
  • GCP: reservations validate capacity at creation time. 
  • Azure: vCPU quota tiers plus capacity reservation. 

2) Confirm Kubernetes sees allocatable GPUs

If the node doesn’t advertise it, the scheduler can’t place it.

kubectl get nodes -o custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu 
kubectl describe node <node> | grep -E "nvidia.com/gpu|nvidia.com/mig" 

Device plugin plumbing is the core dependency here. 

3) Confirm you didn’t fragment MIG profiles

MIG failures are often self-inflicted.

If your inference pool is carved into small profiles, large-profile jobs won’t place until you reconfigure the GPU. MIG profiles are fixed. 

Conclusion

Cloud providers allocate GPUs through quota + capacity + placement + partitioning before Kubernetes schedules anything. Multi-tenant reliability comes from picking the right sharing primitive: full GPU for trainingMIG for isolated inferencetime-slicing only for best-effort. Add Kueue so jobs queue instead of stalling Pods. Use AceCloud when capacity gating is your bottleneck, while keeping the same Kubernetes + NVIDIA Operator model.

 

Top comments (0)