Daya Shankar

Posted on Feb 19

GPU Scheduling Deep Dive: How Cloud Providers Allocate GPUs for Multi-Tenant AI Workloads

#cloud #gpu #cloudcomputing

Cloud GPU “scheduling” is a chain of gates: quota decides if you’re allowed to ask, capacity reservations decide if GPUs exist in a zone, placement bins you onto physical hosts, and partitioning decides how many tenants share silicon (full GPU, MIG, time-slicing, vGPU).

Kubernetes sits at the end and consumes whatever the platform exposes.

GPU allocation is a pipeline, not one scheduler

If you can’t name the layer that said “no,” you can’t fix it.

Request (API / YAML)
-> Account quota (by region + purchasing option)
-> Zonal capacity (reservation / capacity block / spot pool)
-> Placement (host + network topology)
-> Partitioning (full GPU | MIG | time-slicing | vGPU)
-> Cluster scheduler (Kubernetes / Slurm)
-> Kubelet + device plugin exposes devices to containers

What each layer controls

This table routes incidents to the right team fast.

Layer	Who controls it	What failure looks like	What to check
Quota	Provider control plane	“quota exceeded” / “limit exceeded”	Instance type quotas + service quotas docs
Capacity	Provider control plane	“insufficient capacity” / stuck provisioning	Capacity Reservations / Capacity Blocks
Placement	Provider control plane	GPUs launch, topology is wrong	Placement groups / cluster strategy
Partitioning	Provider + NVIDIA stack	noisy neighbors / unfair sharing	MIG vs time-slicing vs vGPU
Cluster scheduling	You	Pods Pending	GPU resources + device plugin plumbing

Partitioning decides multi-tenancy behavior

This is where you pick isolation vs utilization.

Mode	What it does	Isolation	What breaks first	Best fit
Full GPU	One workload per GPU	High	utilization	training, big batch
MIG	Hardware partitions with dedicated compute/memory	High	fragmentation by profile	inference, fine-tune with QoS
Time-slicing	Oversubscribe GPUs; workloads interleave	Low	noisy neighbor	burst inference, dev/test
vGPU	Virtual GPU slices via hypervisor stack	Medium–High	licensing + ops	shared VM fleets, VDI, controlled slices

Two blunt truths:

Time-slicing is not memory/fault isolation. It’s interleaving.

MIG is real partitioning with fixed profiles. That creates profile fragmentation if you don’t standardize.

How major cloud providers gate GPU allocation

You can’t “autoscale GPUs” if the provider won’t hand you any.

Amazon Web Services

Quota and capacity are separate checks.

Instance type quotas are grouped by purchasing option (On-Demand, Spot, Dedicated, Capacity Blocks).

Capacity Reservations reserve compute capacity in a specific AZ.

Capacity Blocks for ML reserve GPU instances for a future time window for short-duration ML workloads.

Google Cloud

Reservations are zonal, and they validate capacity up front.

When you create a reservation, Compute Engine verifies capacity in the specified zone, then reserves it.

GPU machine types include A3 variants backed by H100 SKUs (A3 High/Mega/Edge).

Reservation types exist for ensuring optional resources like GPUs are available when you need them.

Microsoft Azure

Quota often shows up as vCPU-family limits plus capacity reservation.

Azure VM quotas are tiered (total regional vCPUs + VM-family cores). If either is exceeded, deployment fails.

Azure capacity reservation reserves compute capacity in a region or AZ for any duration.

ND H100 v5 starts at 8× H100 GPUs per VM (Azure docs).

Kubernetes GPU scheduling is device-plugin driven

The scheduler can’t place what the node doesn’t advertise.

Kubernetes exposes GPUs through device plugins; workloads request resources like nvidia.com/gpu, and the scheduler places Pods on nodes with allocatable capacity.

Full GPU request

This is the baseline contract.

apiVersion: v1
kind: Pod
metadata:
name: gpu-smoke
spec:
restartPolicy: Never
containers:
- name: cuda
image: nvidia/cuda:12.4.1-base
command: ["bash","-lc","nvidia-smi && sleep 3600"]
resources:
limits:
nvidia.com/gpu: 1

MIG request

You request a profile resource, not “a GPU.”

apiVersion: v1
kind: Pod
metadata:
name: mig-infer
spec:
containers:
- name: infer
image: your-infer-image
resources:
limits:
nvidia.com/mig-1g.10gb: 1

MIG is explicitly described as partitioning supported GPUs into isolated instances with dedicated compute/memory.

Time-slicing (oversubscription)

This raises utilization and raises blast radius.

version: v1
sharing:
timeSlicing:
renameByDefault: true
resources:
- name: nvidia.com/gpu
replicas: 10

Time-slicing is documented as oversubscription where workloads interleave on the same GPU.

Installing the NVIDIA stack

You need one canonical way to install drivers + device plugin + toolkit + monitoring.

The NVIDIA GPU Operator automates driver + device plugin + container toolkit + labeling + DCGM monitoring components.

Queue GPUs at the job layer or you’ll drown in Pending Pods

Pods Pending is not a scheduling policy.

Kueue manages quotas and decides when a job waits, when it’s admitted (Pods can be created), and when it’s preempted.

Practical pattern:

One shared cluster.

Separate GPU node pools by workload class.

Kueue ClusterQueues per tenant/team.

Admission control before Pods exist.

GKE publishes a multi-tenant Kueue tutorial that matches this model.

Reference architecture for multi-tenant AI workloads

This is a setup you can run for months without babysitting.

Split pools by workload class

This prevents MIG profile churn from breaking training placement.

Pool	GPU strategy	Controls	Notes
Training	Full GPUs	Kueue + quotas	avoids MIG fragmentation
Inference	MIG	Kueue or HPA	predictable slices
Dev/Test	Time-slicing	loose quotas	accept noisy neighbors

Use taints/tolerations to keep workloads honest

This stops inference from landing on training nodes “because it fit.”

# training nodes
spec:
taints:
- key: gpu-class
value: training
effect: NoSchedule

# training pods
spec:
tolerations:
- key: gpu-class
operator: Equal
value: training
effect: NoSchedule

Where AceCloud.ai fits

You keep the same scheduling stack. You change where the capacity comes from.

AceCloud publishes:

Cloud GPU instances with H100/A100/L40S listed as available options.

Managed Kubernetes GPU clusters as a first-class service offering.

Spot GPU pricing pages with per-hour rates and “saving” percentages by SKU (example: L40S in Mumbai).

Managed control plane claims, including a stated 99.99% uptime SLA on the managed control plane page.

How teams wire AceCloud into multi-tenant scheduling

This is the boring path. It works.

Deploy a managed Kubernetes cluster plus GPU node pools (training vs inference).

Install NVIDIA GPU Operator once per cluster.

Enable MIG on inference pools; keep training pools full GPU.

Add Kueue for tenant quotas and job admission.

Use spot GPUs for interruptible inference/batch where it fits your SLOs.

Ops checklist for debugging “we can’t get GPUs”

This is what you run before you open a ticket.

1) Confirm the provider gate that blocked you

Quota errors and capacity errors look similar in dashboards. They aren’t.

AWS: instance type quotas by purchasing option; capacity reservations / capacity blocks.

GCP: reservations validate capacity at creation time.

Azure: vCPU quota tiers plus capacity reservation.

2) Confirm Kubernetes sees allocatable GPUs

If the node doesn’t advertise it, the scheduler can’t place it.

kubectl get nodes -o custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu
kubectl describe node <node> | grep -E "nvidia.com/gpu|nvidia.com/mig"

Device plugin plumbing is the core dependency here.

3) Confirm you didn’t fragment MIG profiles

MIG failures are often self-inflicted.

If your inference pool is carved into small profiles, large-profile jobs won’t place until you reconfigure the GPU. MIG profiles are fixed.

Conclusion

Cloud providers allocate GPUs through quota + capacity + placement + partitioning before Kubernetes schedules anything. Multi-tenant reliability comes from picking the right sharing primitive: full GPU for training, MIG for isolated inference, time-slicing only for best-effort. Add Kueue so jobs queue instead of stalling Pods. Use AceCloud when capacity gating is your bottleneck, while keeping the same Kubernetes + NVIDIA Operator model.

DEV Community

GPU Scheduling Deep Dive: How Cloud Providers Allocate GPUs for Multi-Tenant AI Workloads

Top comments (0)