Minwook Je

Posted on Jul 12 • Edited on Jul 21

KAI Scheduler

#gpu #programming #ai #kubernetes

KAI Scheduler

nvidia

Features

Batch Scheduling
- Bin Packing: min # of nodes used (min fragmentation)
- Spread Scheduling: max # of nodes used (HA, load balancing)
Workload Priority
Hierarchical Queues: 2 level queue (parent & child)
Fairness
- Dominant Resource Fairness (DRF) scheduling with quota enforcement and reclaim across queues
Elastic Workloads
- Dynamically scale workloads within defined minimum and maximum pod counts.
DRA
- Dynamic Resource Allocation
- Support multi vendor (Nvidia, AMD ..)
GPU Sharing: share single or multi GPUs, maxmizing resource utilization.
Cloud & On-premise
- Supportauto-scalers like karpenter)

nvidia::run-ai

Concepts and Principles

docs

The Scheduler's primary responsibility is to allocate workloads to the most suitable node or nodes based:

resource requirements
fairness
quota management.

Workloads

Introduction to Workloads

1:N = Pod: Node
- a single pod running on individual nodes
N:1 = Pod:Node
- distributed multiple pods, each running on a node
N:M = Pod: Node

GPU sharing

Allocating a GPU device to multiple pods by

request GPU memory amount (e.g. 2000Mib)
Or, request a portion of a GPU device mem

# gpu-memory.yaml
apiVersion: v1
kind: Pod
metadata:
  name: gpu-sharing
  labels:
    kai.scheduler/queue: test
  annotations:
    gpu-memory: "2000" # in Mib
spec:
  schedulerName: kai-scheduler
  containers:
    - name: ubuntu
      image: ubuntu
      args: ["sleep", "infinity"]

# gpu-sharing.yaml
apiVersion: v1
kind: Pod
metadata:
  name: gpu-sharing
  labels:
    kai.scheduler/queue: test
  annotations:
    gpu-fraction: "0.5"
spec:
  schedulerName: kai-scheduler
  containers:
    - name: ubuntu
      image: ubuntu
      args: ["sleep", "infinity"]

GPU sharing autoscaling

docs

Problem: GPU requests in annotations -> Cluster Autoscaler can't detect them

Solution: `node-scale-adjuster`

Watches unschedulable GPU-sharing pods
Launches utility pod requesting full GPU → triggers autoscaler

The utility pod is used because the k8s cluster Autoscaler cannot detect GPU sharing pods.

Why Autoscaler cannot detect GPU sharing-pods?
Because it only looks at resources declared in the pod spec (resources.requests) but GPU-sharing pods place their GPU requests in annotations not in the standard resources field. Since the autoscaler ignores annotations, it sees those pods as having no GPU demand, so it doesn't trigger scaling.

To solve this, the node-scale-adjuster launches a temp utility pod that requests a full GPU in its spec. This makes the autoscaler recognize a resource shortage and triggers it to scale up the node pool.

💡 I think this workaround using utility pods is problematic, since scheduling noise and inaccurate scaling. A better solution is to expose GPU sharing as a proper custom resource and use an operator to manage and report GPU demand declaratively to the autoscaler.

Calculation

Sums GPU fractions (e.g., 2 pods × 0.5 GPU -> 1 utility pod)
If new node fits only one -> spawns additional utility pod

GPU Memory Requests

Assumes 0.1 GPU per pod
It means creates 1 utility pod per 10 memory-based pods
Configurable via --gpu-memory-to-fraction-ratio

# how to enable?
--set "global.clusterAutoscaling=true"

GPU sharing with MPS

docs

DEV Community

KAI Scheduler

KAI Scheduler

Features

Concepts and Principles

Workloads

GPU sharing

GPU sharing autoscaling

Solution: `node-scale-adjuster`

Calculation

GPU Memory Requests

GPU sharing with MPS

Refs

Top comments (0)

KAI Scheduler

Features

Concepts and Principles

Workloads

GPU sharing

GPU sharing autoscaling

Solution: node-scale-adjuster

Calculation

GPU Memory Requests

GPU sharing with MPS

Refs

Solution: `node-scale-adjuster`