KAI Scheduler
nvidia
Features
-
Batch Scheduling
- Bin Packing: min # of nodes used (min fragmentation)
- Spread Scheduling: max # of nodes used (HA, load balancing)
- Workload Priority
- Hierarchical Queues: 2 level queue (parent & child)
-
Fairness
- Dominant Resource Fairness (DRF) scheduling with quota enforcement and reclaim across queues
-
Elastic Workloads
- Dynamically scale workloads within defined minimum and maximum pod counts.
-
DRA
- Dynamic Resource Allocation
- Support multi vendor (Nvidia, AMD ..)
- GPU Sharing: share single or multi GPUs, maxmizing resource utilization.
-
Cloud & On-premise
- Supportauto-scalers like karpenter)
Concepts and Principles
The Scheduler
's primary responsibility is to allocate workloads
to the most suitable node or nodes based:
- resource requirements
- fairness
- quota management.
Workloads
- 1:N = Pod: Node
- a single pod running on individual nodes
- N:1 = Pod:Node
- distributed multiple pods, each running on a node
- N:M = Pod: Node
GPU sharing
Allocating a GPU device to multiple pods by
- request GPU memory amount (e.g. 2000Mib)
- Or, request a portion of a GPU device mem
# gpu-memory.yaml
apiVersion: v1
kind: Pod
metadata:
name: gpu-sharing
labels:
kai.scheduler/queue: test
annotations:
gpu-memory: "2000" # in Mib
spec:
schedulerName: kai-scheduler
containers:
- name: ubuntu
image: ubuntu
args: ["sleep", "infinity"]
# gpu-sharing.yaml
apiVersion: v1
kind: Pod
metadata:
name: gpu-sharing
labels:
kai.scheduler/queue: test
annotations:
gpu-fraction: "0.5"
spec:
schedulerName: kai-scheduler
containers:
- name: ubuntu
image: ubuntu
args: ["sleep", "infinity"]
GPU sharing autoscaling
Problem: GPU requests in annotations -> Cluster Autoscaler can't detect them
Solution: node-scale-adjuster
- Watches unschedulable GPU-sharing pods
- Launches utility pod requesting full GPU → triggers autoscaler
The utility pod is used because the k8s cluster Autoscaler cannot detect GPU sharing pods.
Why Autoscaler cannot detect GPU sharing-pods?
Because it only looks at resources declared in the pod spec (resources.requests
) but GPU-sharing pods place their GPU requests in annotations not in the standard resources field. Since the autoscaler ignores annotations, it sees those pods as having no GPU demand, so it doesn't trigger scaling.
To solve this, the node-scale-adjuster
launches a temp utility pod that requests a full GPU in its spec. This makes the autoscaler recognize a resource shortage and triggers it to scale up the node pool.
💡 I think this workaround using utility pods is problematic, since scheduling noise and inaccurate scaling. A better solution is to expose GPU sharing as a proper custom resource and use an operator to manage and report GPU demand declaratively to the autoscaler.
Calculation
- Sums GPU fractions (e.g., 2 pods × 0.5 GPU -> 1 utility pod)
- If new node fits only one -> spawns additional utility pod
GPU Memory Requests
- Assumes 0.1 GPU per pod
- It means creates 1 utility pod per 10 memory-based pods
- Configurable via
--gpu-memory-to-fraction-ratio
# how to enable?
--set "global.clusterAutoscaling=true"
Top comments (0)