DEV Community

SoftwareDevs mvpfactory.io
SoftwareDevs mvpfactory.io

Posted on • Originally published at mvpfactory.io

Kubernetes Pod Scheduling for GPU-Accelerated ML Inference

---
title: "K8s GPU Scheduling: Cut ML Inference P99 Latency 40%"
published: true
description: "Topology-aware GPU scheduling, fractional sharing with MIG, and affinity rules that cut ML inference P99 latency by 40%  no new hardware required."
tags: kubernetes, cloud, architecture, devops
canonical_url: https://blog.mvpfactory.co/k8s-gpu-scheduling-cut-ml-inference-p99-latency-40
---

## What We Are Building

By the end of this walkthrough, you will have a Kubernetes scheduling stack that aligns GPU pods to the correct NUMA node, splits GPUs into hardware-isolated fractions with MIG, spreads replicas across failure domains, and protects serving pods from batch training eviction. We took P99 inference latency from 23.8ms down to 14.3ms on a 48-GPU mixed A100/H100 cluster — without adding a single GPU.

Let me show you exactly how the pieces fit together.

## Prerequisites

- A Kubernetes 1.36+ cluster with NVIDIA GPU nodes (A100 or H100)
- NVIDIA device plugin deployed with MIG support
- Familiarity with kubelet configuration and pod scheduling basics

## Step 1: Understand the Problem — Cross-NUMA Memory Access

Here is the gotcha that will save you hours of debugging mysterious P99 spikes. When a pod lands on NUMA node 0 but its GPU sits behind NUMA node 1, every tensor transfer crosses the interconnect. The docs do not mention this, but it is the number one overlooked latency source in production inference.

| Placement | Avg Latency (ms) | P99 Latency (ms) | Throughput (req/s) |
|---|---|---|---|
| NUMA-aligned | 8.2 | 14.1 | 1,240 |
| NUMA-misaligned | 10.7 | 23.8 | 940 |
| **Delta** | **+30%** | **+69%** | **-24%** |

*Benchmarks: BERT-large inference, batch size 1, A100 80GB, Kubernetes 1.36, Ubuntu 22.04*

A 69% P99 penalty compounds across a fleet. That is the gap between hitting your SLA and fielding pages at 2am.

## Step 2: Enable Topology-Aware Scheduling

For latency-sensitive serving, always use `single-numa-node`. Here is the minimal setup to get this working in your kubelet config:

Enter fullscreen mode Exit fullscreen mode


yaml
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
topologyManagerPolicy: single-numa-node
topologyManagerScope: pod


Set `topologyManagerScope: pod` — not `container` — so the entire pod's resources align. This matters when sidecars share the pod spec. The `single-numa-node` policy is strict: pod admission fails if alignment is impossible. That strictness is a feature. It surfaces misalignment at deploy time instead of in production.

For batch training where latency is less critical, `best-effort` works fine.

## Step 3: Use MIG for Fractional GPU Sharing

Not every inference workload needs a full A100. Let me show you a pattern I use in every project that runs mixed workloads. Teams default to time-slicing because it is simplest, then cannot figure out why tail latencies spike under contention.

| Method | Isolation | Latency Predictability | Memory Guarantee | Setup Complexity |
|---|---|---|---|---|
| Time-slicing | None (context switch) | Poor under contention | None | Low |
| MPS | Partial (shared context) | Moderate | None | Medium |
| MIG | Full (hardware partition) | Excellent | Yes | High |

MIG wins for inference. Each partition is a true hardware slice with its own memory bandwidth, compute units, and L2 cache. On an A100, you can carve seven `1g.10gb` instances. Request them as extended resources:

Enter fullscreen mode Exit fullscreen mode


yaml
resources:
limits:
nvidia.com/mig-1g.10gb: 1


The device plugin API v1beta1 in Kubernetes 1.36 reports these as individual allocatable resources, and the topology manager aligns them correctly to NUMA nodes.

## Step 4: Spread Replicas Across Failure Domains

Once pods are NUMA-aligned, spread inference replicas across zones:

Enter fullscreen mode Exit fullscreen mode


yaml
topologySpreadConstraints:

  • maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: inference-serving

Each pod individually gets aligned placement; the fleet distributes across zones.

## Step 5: Protect Serving Pods with Priority Preemption

In a heterogeneous cluster, define clear preemption rules:

Enter fullscreen mode Exit fullscreen mode


yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: serving-critical
value: 1000000

preemptionPolicy: PreemptLowerPriority

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: batch-training
value: 100
preemptionPolicy: Never


Serving pods evict batch jobs when capacity is constrained. Setting `preemptionPolicy: Never` on batch jobs prevents cascading eviction thrash.

## Gotchas

- **Batch jobs squatting on MIG slices.** After preemption restarts, batch jobs can sit on fractional GPU slices indefinitely. Always set reasonable `activeDeadlineSeconds`.
- **Time-slicing for inference is a trap.** When two inference pods share a time-sliced GPU, a batch job's large allocation can stall your serving workload for entire scheduling quanta. Save time-slicing for dev environments.
- **Container-scoped topology manager.** If you leave `topologyManagerScope` at the default `container` instead of `pod`, sidecars can land on a different NUMA node and break your alignment.
- **Ignoring admission failures.** `single-numa-node` will reject pods that cannot be aligned. This is good — it tells you at deploy time rather than as a mystery P99 spike at 2am.

## Results

After rolling out the full stack across a 48-GPU mixed cluster:

- P99 latency dropped 40% (23.8ms → 14.3ms)
- GPU utilization increased 22% from MIG packing vs whole-GPU allocation
- Zero SLA breaches in 90 days, down from 3–5/month

No new hardware. Just better scheduling. Start with the topology manager policy, layer in MIG, then add spread constraints and priority classes. Each step compounds.
Enter fullscreen mode Exit fullscreen mode

Top comments (0)