Practical LLM Inference Scheduling on Kubernetes

#programming #webdev

---
title: "LLM Inference on Kubernetes: Cut GPU Costs 70% with Priority Queues and MPS Time-Slicing"
published: true
description: "A practical workshop on combining Kubernetes device plugins, NVIDIA MPS time-slicing, and a custom priority queue to reduce self-hosted LLM inference costs by up to 70%."
tags: kubernetes, cloud, devops, architecture
canonical_url: https://blog.mvpfactory.co/llm-inference-kubernetes-cut-gpu-costs-70
---

## What We Are Building

By the end of this tutorial, you will have a three-layer scheduling architecture for mixed-priority LLM inference on Kubernetes. We will wire up NVIDIA MPS for GPU time-slicing, configure PriorityClasses for pod-level preemption, and design an application-level priority queue that keeps real-time requests fast while batch jobs soak up every idle GPU cycle.

This is the resource architecture that took our GPU serving costs from ~$52,000/month down to ~$16,000 on an 8x A100 cluster. Let me show you how each layer works.

## Prerequisites

- A Kubernetes cluster with NVIDIA GPU nodes (A100s or equivalent)
- The NVIDIA device plugin for Kubernetes installed
- Familiarity with Kubernetes scheduling concepts (PriorityClasses, resource requests)
- A workload mix of real-time and batch inference requests

## Step 1: Enable NVIDIA MPS for GPU Time-Slicing

Here is the minimal setup to get this working. Most teams are running at 20% GPU utilization on dedicated nodes. NVIDIA's Multi-Process Service lets multiple pods share a single GPU with actual compute partitioning — not just memory splitting.

Apply this ConfigMap to your NVIDIA device plugin:

yaml

nvidia-device-plugin ConfigMap

version: v1
sharing:
timeSlicing:
renameByDefault: false
resources:
- name: nvidia.com/gpu
replicas: 4 # 4 virtual GPUs per physical GPU


This gives you 4 schedulable GPU slices per physical device. Each slice gets fair-share compute access, and MPS handles context switching at the hardware level — far more efficient than container-level time-sharing. A single ConfigMap change can double or triple your effective capacity.

## Step 2: Define Priority Classes and Preemption

Now we teach Kubernetes which workloads matter most. Define PriorityClasses that map to your workload tiers:

yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: realtime-inference
value: 1000000
preemptionPolicy: PreemptLowerPriority
globalDefault: false

description: "User-facing real-time LLM requests"

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: batch-inference
value: 100
preemptionPolicy: Never
globalDefault: false
description: "Background summarization, embeddings, batch jobs"


When a real-time inference pod needs GPU resources and the node is full, Kubernetes evicts batch pods automatically. Your batch pods need to be idempotent and restart-safe — they pick up where they left off via checkpointed job queues.

Let me show you a pattern I use in every project: set `preemptionPolicy: Never` on batch workloads. This means batch pods will never evict *other* batch pods, keeping your lower tiers stable among themselves.

## Step 3: Build the Application-Level Priority Queue

Here is the gotcha that will save you hours: the Kubernetes scheduler alone is not enough. Pod scheduling operates on minutes-scale granularity. Request-level prioritization needs millisecond decisions. Those are different problems, and I have watched teams burn weeks trying to force one layer to do both jobs.

You need a lightweight service sitting in front of your inference servers that:

1. Accepts inference requests tagged with priority (`P0` real-time, `P1` near-real-time, `P2` batch)
2. Routes P0 requests to a reserved capacity pool (guaranteed 30% of GPU slices)
3. Allows P1/P2 to fill remaining capacity with preemption semantics
4. Tracks per-tenant quotas via Redis-backed counters

The result: P0 latency stays under 200ms at P99, while batch throughput fills every idle GPU cycle.

## The Cost Model: Know Your Crossover Point

Here is why this matters. At moderate scale — roughly 2M–10M inference requests per month — the numbers look like this:

| Monthly Requests | API Cost (est.) | Self-Hosted (this arch) | Savings |
|---|---|---|---|
| 1M | $6,800 | $16,000 | -$9,200 (API wins) |
| 3M | $20,400 | $16,000 | $4,400 |
| 5M | $34,000 | $17,500 | $16,500 |
| 10M | $68,000 | $21,000 | $47,000 (69%) |

Infrastructure cost scales sub-linearly because GPU utilization increases with request volume. That is the whole point of the architecture. Self-hosted breaks even at around the 3M request mark.

## Gotchas

**Do not skip the cost modeling step.** Self-hosted inference only wins at moderate scale. Below 3M requests/month, API calls are cheaper. Run a week of production traffic logs through a cost simulator with your actual token distributions and latency requirements — not a back-of-napkin guess.

**Separate scheduling by timescale.** Use Kubernetes PriorityClasses for pod-level preemption (seconds to minutes) and the application-level queue for request-level routing (milliseconds). The docs do not mention this, but neither layer alone is sufficient.

**MPS replicas are not infinite.** Setting `replicas: 4` is a practical sweet spot. Going higher fragments GPU memory and increases context-switch overhead. Profile your specific model's memory footprint before tuning this number.

**Batch pods must be restart-safe.** Preemption means your batch jobs *will* get killed. If they cannot checkpoint and resume, you will lose work. Design for this from day one.

## Conclusion

The GPU cost problem in AI serving is real, but it is an architecture problem, not a hardware problem. Enable MPS time-slicing before you buy more nodes. Layer in PriorityClasses for coarse-grained preemption, then add an application-level queue for fine-grained request routing. Schedule smarter before you spend bigger.

**Further reading:**
- [NVIDIA MPS Documentation](https://docs.nvidia.com/deploy/mps/)
- [Kubernetes Priority and Preemption](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/)
- [NVIDIA GPU Operator for Kubernetes](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/overview.html)

DEV Community

Practical LLM Inference Scheduling on Kubernetes

nvidia-device-plugin ConfigMap

description: "User-facing real-time LLM requests"

Top comments (0)