Cloud GPU “scheduling” is a chain of gates: quota decides if you’re allowed to ask, capacity reservations decide if GPUs exist in a zone, placement bins you onto physical hosts, and partitioning decides how many tenants share silicon (full GPU, MIG, time-slicing, vGPU).
Kubernetes sits at the end and consumes whatever the platform exposes.
GPU allocation is a pipeline, not one scheduler
If you can’t name the layer that said “no,” you can’t fix it.
Request (API / YAML)
-> Account quota (by region + purchasing option)
-> Zonal capacity (reservation / capacity block / spot pool)
-> Placement (host + network topology)
-> Partitioning (full GPU | MIG | time-slicing | vGPU)
-> Cluster scheduler (Kubernetes / Slurm)
-> Kubelet + device plugin exposes devices to containers
What each layer controls
This table routes incidents to the right team fast.
|
Layer |
Who controls it |
What failure looks like |
What to check |
|
Quota |
Provider control plane |
“quota exceeded” / “limit exceeded” |
Instance type quotas + service quotas docs |
|
Capacity |
Provider control plane |
“insufficient capacity” / stuck provisioning |
Capacity Reservations / Capacity Blocks |
|
Placement |
Provider control plane |
GPUs launch, topology is wrong |
Placement groups / cluster strategy |
|
Partitioning |
Provider + NVIDIA stack |
noisy neighbors / unfair sharing |
MIG vs time-slicing vs vGPU |
|
Cluster scheduling |
You |
Pods Pending |
GPU resources + device plugin plumbing |
Partitioning decides multi-tenancy behavior
This is where you pick isolation vs utilization.
|
Mode |
What it does |
Isolation |
What breaks first |
Best fit |
|
Full GPU |
One workload per GPU |
High |
utilization |
training, big batch |
|
MIG |
Hardware partitions with dedicated compute/memory |
High |
fragmentation by profile |
inference, fine-tune with QoS |
|
Time-slicing |
Oversubscribe GPUs; workloads interleave |
Low |
noisy neighbor |
burst inference, dev/test |
|
vGPU |
Virtual GPU slices via hypervisor stack |
Medium–High |
licensing + ops |
shared VM fleets, VDI, controlled slices |
Two blunt truths:
- Time-slicing is not memory/fault isolation. It’s interleaving.
- MIG is real partitioning with fixed profiles. That creates profile fragmentation if you don’t standardize.
How major cloud providers gate GPU allocation
You can’t “autoscale GPUs” if the provider won’t hand you any.
Amazon Web Services
Quota and capacity are separate checks.
- Instance type quotas are grouped by purchasing option (On-Demand, Spot, Dedicated, Capacity Blocks).
- Capacity Reservations reserve compute capacity in a specific AZ.
- Capacity Blocks for ML reserve GPU instances for a future time window for short-duration ML workloads.
Google Cloud
Reservations are zonal, and they validate capacity up front.
- When you create a reservation, Compute Engine verifies capacity in the specified zone, then reserves it.
- GPU machine types include A3 variants backed by H100 SKUs (A3 High/Mega/Edge).
- Reservation types exist for ensuring optional resources like GPUs are available when you need them.
Microsoft Azure
Quota often shows up as vCPU-family limits plus capacity reservation.
- Azure VM quotas are tiered (total regional vCPUs + VM-family cores). If either is exceeded, deployment fails.
- Azure capacity reservation reserves compute capacity in a region or AZ for any duration.
- ND H100 v5 starts at 8× H100 GPUs per VM (Azure docs).
Kubernetes GPU scheduling is device-plugin driven
The scheduler can’t place what the node doesn’t advertise.
Kubernetes exposes GPUs through device plugins; workloads request resources like nvidia.com/gpu, and the scheduler places Pods on nodes with allocatable capacity.
Full GPU request
This is the baseline contract.
apiVersion: v1
kind: Pod
metadata:
name: gpu-smoke
spec:
restartPolicy: Never
containers:
- name: cuda
image: nvidia/cuda:12.4.1-base
command: ["bash","-lc","nvidia-smi && sleep 3600"]
resources:
limits:
nvidia.com/gpu: 1
MIG request
You request a profile resource, not “a GPU.”
apiVersion: v1
kind: Pod
metadata:
name: mig-infer
spec:
containers:
- name: infer
image: your-infer-image
resources:
limits:
nvidia.com/mig-1g.10gb: 1
MIG is explicitly described as partitioning supported GPUs into isolated instances with dedicated compute/memory.
Time-slicing (oversubscription)
This raises utilization and raises blast radius.
version: v1
sharing:
timeSlicing:
renameByDefault: true
resources:
- name: nvidia.com/gpu
replicas: 10
Time-slicing is documented as oversubscription where workloads interleave on the same GPU.
Installing the NVIDIA stack
You need one canonical way to install drivers + device plugin + toolkit + monitoring.
The NVIDIA GPU Operator automates driver + device plugin + container toolkit + labeling + DCGM monitoring components.
Queue GPUs at the job layer or you’ll drown in Pending Pods
Pods Pending is not a scheduling policy.
Kueue manages quotas and decides when a job waits, when it’s admitted (Pods can be created), and when it’s preempted.
Practical pattern:
- One shared cluster.
- Separate GPU node pools by workload class.
- Kueue ClusterQueues per tenant/team.
- Admission control before Pods exist.
GKE publishes a multi-tenant Kueue tutorial that matches this model.
Reference architecture for multi-tenant AI workloads
This is a setup you can run for months without babysitting.
Split pools by workload class
This prevents MIG profile churn from breaking training placement.
|
Pool |
GPU strategy |
Controls |
Notes |
|
Training |
Full GPUs |
Kueue + quotas |
avoids MIG fragmentation |
|
Inference |
MIG |
Kueue or HPA |
predictable slices |
|
Dev/Test |
Time-slicing |
loose quotas |
accept noisy neighbors |
Use taints/tolerations to keep workloads honest
This stops inference from landing on training nodes “because it fit.”
# training nodes
spec:
taints:
- key: gpu-class
value: training
effect: NoSchedule
# training pods
spec:
tolerations:
- key: gpu-class
operator: Equal
value: training
effect: NoSchedule
Where AceCloud.ai fits
You keep the same scheduling stack. You change where the capacity comes from.
AceCloud publishes:
- Cloud GPU instances with H100/A100/L40S listed as available options.
- Managed Kubernetes GPU clusters as a first-class service offering.
- Spot GPU pricing pages with per-hour rates and “saving” percentages by SKU (example: L40S in Mumbai).
- Managed control plane claims, including a stated 99.99% uptime SLA on the managed control plane page.
How teams wire AceCloud into multi-tenant scheduling
This is the boring path. It works.
- Deploy a managed Kubernetes cluster plus GPU node pools (training vs inference).
- Install NVIDIA GPU Operator once per cluster.
- Enable MIG on inference pools; keep training pools full GPU.
- Add Kueue for tenant quotas and job admission.
- Use spot GPUs for interruptible inference/batch where it fits your SLOs.
Ops checklist for debugging “we can’t get GPUs”
This is what you run before you open a ticket.
1) Confirm the provider gate that blocked you
Quota errors and capacity errors look similar in dashboards. They aren’t.
- AWS: instance type quotas by purchasing option; capacity reservations / capacity blocks.
- GCP: reservations validate capacity at creation time.
- Azure: vCPU quota tiers plus capacity reservation.
2) Confirm Kubernetes sees allocatable GPUs
If the node doesn’t advertise it, the scheduler can’t place it.
kubectl get nodes -o custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu
kubectl describe node <node> | grep -E "nvidia.com/gpu|nvidia.com/mig"
Device plugin plumbing is the core dependency here.
3) Confirm you didn’t fragment MIG profiles
MIG failures are often self-inflicted.
If your inference pool is carved into small profiles, large-profile jobs won’t place until you reconfigure the GPU. MIG profiles are fixed.
Conclusion
Cloud providers allocate GPUs through quota + capacity + placement + partitioning before Kubernetes schedules anything. Multi-tenant reliability comes from picking the right sharing primitive: full GPU for training, MIG for isolated inference, time-slicing only for best-effort. Add Kueue so jobs queue instead of stalling Pods. Use AceCloud when capacity gating is your bottleneck, while keeping the same Kubernetes + NVIDIA Operator model.
Top comments (0)