DEV Community: Pawan Kumar

Your First LLM API on Kubernetes: From Model to Curl Request

Pawan Kumar — Thu, 25 Jun 2026 07:44:50 +0000

Series links

Part 1: Everything You Know About Scaling Web Apps Breaks When You Serve an LLM

Part 2: The Request Is the Wrong Unit of Scale for LLMs on Kubernetes

Part 3: How Do You Fit a Trillion-Parameter Model Into a Kubernetes Cluster?

Part 4: Before the Pod Starts: GPU Node Setup for LLMs on Kubernetes

Part 5: OpenAI Already Told Us the Kubernetes Scaling Story, Most People Just Did Not Read It Closely

So far in this series, we have covered the mental model, tokens, model size, GPU node readiness, and OpenAI's Kubernetes scaling lessons.

Now we should run something.

In this part, we will deploy an actual model on a Kubernetes GPU node, expose it as an OpenAI-compatible API, and call it with curl. The model is:

Qwen/Qwen2.5-1.5B-Instruct

That model is small enough for a first single-GPU walkthrough, but still behaves like a real chat model. If your GPU is very small, try Qwen/Qwen2.5-0.5B-Instruct. If you have more memory and want a bigger test, try Qwen/Qwen2.5-7B-Instruct.

Do not start with the biggest model you can name. Start with a model your node can actually load. The goal here is not benchmark glory. The goal is to get from Kubernetes GPU capacity to a working LLM API request.

What vLLM is doing in this setup

Kubernetes is not serving the model by itself. Kubernetes schedules the pod, gives it networking, mounts the Secret, and asks the NVIDIA device plugin for a GPU. After that, the model server inside the container has to do the LLM-specific work.

vLLM is that model server in this walkthrough. It downloads the model weights, loads them into GPU memory, starts an HTTP server, accepts OpenAI-compatible requests, batches work internally, runs the model, and streams or returns generated tokens.

That distinction matters. The Kubernetes Deployment does not magically become an LLM API because it has nvidia.com/gpu: 1. It becomes an LLM API because the container starts a serving engine that knows how to load a Hugging Face model and expose routes like /v1/chat/completions.

vLLM is a good first serving engine because it hides a lot of ugly details without hiding the shape from you. You still see the model name, GPU request, port, token Secret, logs, Service, and curl request. But you do not have to write your own batching loop, tokenizer path, HTTP server, or OpenAI-compatible API wrapper just to prove the deployment works.

vLLM is the engine. The thing we care about is the model API it serves.

Prerequisites

I am assuming you already completed the GPU node setup from Part 4. That means the NVIDIA driver stack, container runtime, GPU Operator or NVIDIA device plugin, labels, and basic GPU checks are already working.

We are not reinstalling the GPU Operator here. Before deploying the model, confirm Kubernetes can see GPU capacity:

kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu

A useful output looks like this:

NAME            GPU
gpu-worker-01   1

If the GPU column is empty, <none>, or missing, stop here. Kubernetes cannot schedule this workload until the node advertises nvidia.com/gpu.

Create a Hugging Face token first

Even though Qwen/Qwen2.5-1.5B-Instruct is public, we will still use a Hugging Face token. That is intentional.

Real teams often start with a public model and later swap to a gated model, private model, licensed model, or organization repository. If the token path is already part of the Deployment, that swap is much less annoying.

Create a token first:

Open the official Hugging Face token docs: https://huggingface.co/docs/hub/security-tokens
Create a token with read access.
Copy the token value and keep it ready.

From this point onward, I will assume you have the token value. Do not paste it into Git. Do not put it directly in a Deployment manifest. Put it in a Kubernetes Secret.

Create the namespace and Secret

Keep the first LLM workload out of the default namespace:

kubectl create namespace llm-demo

Set the token in your shell:

export HF_TOKEN="hf_your_token_here"

Create the Secret:

kubectl create secret generic hf-token \
  -n llm-demo \
  --from-literal=HF_TOKEN="${HF_TOKEN}"

Check that it exists:

kubectl get secret hf-token -n llm-demo

Expected shape:

NAME       TYPE     DATA   AGE
hf-token   Opaque   1      10s

Existence is enough. Do not print the token back unless you have a specific reason.

Deploy the model API

vLLM gives us the model server and the OpenAI-compatible HTTP API. The Kubernetes pattern is documented in the vLLM Kubernetes docs, and the API shape is documented in the vLLM OpenAI-compatible server docs.

Create qwen-vllm.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: qwen-vllm
  namespace: llm-demo
spec:
  replicas: 1
  selector:
    matchLabels:
      app: qwen-vllm
  template:
    metadata:
      labels:
        app: qwen-vllm
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          imagePullPolicy: IfNotPresent
          command:
            - vllm
            - serve
            - Qwen/Qwen2.5-1.5B-Instruct
          args:
            - --host
            - 0.0.0.0
            - --port
            - "8000"
          ports:
            - containerPort: 8000
              name: http
          env:
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: HF_TOKEN
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: HF_TOKEN
          resources:
            limits:
              nvidia.com/gpu: 1
          volumeMounts:
            - name: shm
              mountPath: /dev/shm
      volumes:
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: 2Gi
---
apiVersion: v1
kind: Service
metadata:
  name: qwen-vllm
  namespace: llm-demo
spec:
  selector:
    app: qwen-vllm
  ports:
    - name: http
      port: 8000
      targetPort: 8000

A few details matter.

The pod requests one GPU with nvidia.com/gpu: 1. That is what makes this schedulable as a GPU workload. The token appears as both HF_TOKEN and HUGGING_FACE_HUB_TOKEN because different libraries and examples use different names. Both point to the same Secret value.

The /dev/shm mount is there because model servers often use shared memory heavily. Tiny default shared memory limits inside containers can create strange failures. A memory-backed emptyDir keeps the first deployment boring.

When this pod starts, vLLM does roughly five things. It reads the model name from the command, uses the Hugging Face token to access the repository, downloads or reuses the model files, initializes the tokenizer and model runtime, then starts the API server on port 8000. Only after that finishes is the API useful.

For production, pin the vllm/vllm-openai image version instead of using latest. For this walkthrough, latest keeps the example readable.

Apply it:

kubectl apply -f qwen-vllm.yaml

Expected output:

deployment.apps/qwen-vllm created
service/qwen-vllm created

Watch startup properly

Watch the pod:

kubectl get pods -n llm-demo -w

You may see:

NAME                         READY   STATUS              RESTARTS   AGE
qwen-vllm-6c9f7d8c9d-x9v2m   0/1     Pending             0          3s
qwen-vllm-6c9f7d8c9d-x9v2m   0/1     ContainerCreating   0          15s
qwen-vllm-6c9f7d8c9d-x9v2m   1/1     Running             0          2m

Do not celebrate too early.

Running is not the same as ready. The container can be running while the image is still settling, the model is downloading, CUDA is initializing, weights are loading, or vLLM is preparing the serving engine. The first start is usually slower because the model has to be pulled.

Follow the logs:

kubectl logs -n llm-demo -f deployment/qwen-vllm

You are looking for the server to finish loading the model and listen on port 8000. The exact log lines vary by vLLM version. If logs are still busy, wait. If they show a clear error, jump to the troubleshooting table below.

Port-forward the Service

For the first test, do not create public ingress. Do not add DNS. Do not put it behind an internet-facing load balancer.

Use port-forward:

kubectl port-forward -n llm-demo svc/qwen-vllm 8000:8000

Keep that command running. You should see:

Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000

Now local port 8000 forwards to the Kubernetes Service, which forwards to the vLLM pod.

Send the first curl request

In another terminal, call the OpenAI-compatible chat endpoint:

curl http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-1.5B-Instruct",
    "messages": [
      {
        "role": "system",
        "content": "You are a concise Kubernetes assistant."
      },
      {
        "role": "user",
        "content": "Explain what a Kubernetes Service does in two sentences."
      }
    ],
    "max_tokens": 120,
    "temperature": 0.2
  }'

Why does the curl request include the model name again?

This part looks redundant at first:

"model": "Qwen/Qwen2.5-1.5B-Instruct"

We already gave the model name to vllm serve in the Deployment. That tells the server which model to load into memory. The model field in the curl request is part of the OpenAI-compatible API contract. Clients send it so the server knows which served model the request is targeting.

In this article, the server has only one model, so the value feels repetitive. In real systems, the same API style may sit behind routers, gateways, aliases, multiple deployments, or clients that can switch between models. Keeping the field means curl, OpenAI SDK code, and later gateway setup all follow the same shape.

For the first run, keep the value identical to the model passed to vllm serve. Later, vLLM can expose a different client-facing name with a served model name alias, but that is extra complexity we do not need yet.

A successful response will be JSON. The exact wording will differ, but the shape should look familiar:

{
  "object": "chat.completion",
  "model": "Qwen/Qwen2.5-1.5B-Instruct",
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "A Kubernetes Service provides a stable network endpoint for a set of Pods, even as those Pods are created, deleted, or replaced. It selects Pods using labels and forwards traffic to the matching backends."
      }
    }
  ]
}

That is the moment the deployment becomes real. The request reached your model server, vLLM handled the OpenAI-compatible route, the model generated text, and the response came back through Kubernetes. Not a diagram, not a promise. A model answered through an API running inside the cluster.

Swapping the model

To try the smaller model, change the served model:

command:
  - vllm
  - serve
  - Qwen/Qwen2.5-0.5B-Instruct

Then change the curl body too:

"model": "Qwen/Qwen2.5-0.5B-Instruct"

For a larger test, use Qwen/Qwen2.5-7B-Instruct in both places.

For a first run, keep the model name in the request identical to the model name served by vLLM. You can configure aliases later. Today, remove avoidable debugging.

What happened

Kubernetes scheduled a pod onto a node that advertises nvidia.com/gpu. The NVIDIA device plugin made the GPU available to the container. The Hugging Face token let the container pull the model. vLLM loaded the model onto the GPU and started an HTTP server on port 8000. The Service gave the pod a stable in-cluster endpoint. Port-forward gave us a safe local path. Curl proved the API could answer through /v1/chat/completions.

That is the basic loop every LLM platform needs before it becomes fancy:

Can Kubernetes schedule the workload onto a GPU?
Can the container see the GPU?
Can the model server download and load the model?
Can the API route accept a request?
Can the model generate a response?
Can you observe failures when any of those steps break?

If this loop is unreliable, autoscaling and gateways will not save you. They will only hide the problem for a while.

Troubleshooting

Symptom	What it usually means	What to check
Pod stuck in `Pending`	Kubernetes cannot find a matching node	Run `kubectl describe pod -n llm-demo <pod-name>` and read scheduler events. Confirm GPU capacity exists.
`nvidia.com/gpu` missing	GPU Operator or device plugin path is broken	Re-run the GPU visibility command and go back to Part 4 before continuing.
Hugging Face download fails	Token is missing, wrong, expired, or lacks model access	Recreate the token, update the Secret, then run `kubectl rollout restart deployment/qwen-vllm -n llm-demo`.
CUDA initialization error	Driver, runtime, image, or node stack mismatch	Check pod logs, GPU Operator status, driver version, and a simple CUDA test pod.
Pod crashes with OOM	Model or runtime needs more memory	Try `Qwen/Qwen2.5-0.5B-Instruct`, use a larger GPU, or tune model/runtime settings later.
`curl: connection refused`	Server is not ready or port-forward is not running	Check logs, keep port-forward running, and verify `kubectl get svc -n llm-demo`.
Model name mismatch	Request model differs from served model	Make the curl `model` value match the `vllm serve` model.

The most common mistake is treating Running as the finish line. It is not. For model serving, readiness is tied to download, GPU initialization, model loading, and server startup. Watch logs, not just pod phase.

Clean up

If this was only a test, delete the namespace:

kubectl delete namespace llm-demo

That removes the Deployment, Service, and Secret. If you keep experimenting, remember that a GPU pod can hold expensive capacity even when nobody is sending requests.

What we are not covering yet

This article stops at the first working API call. We are not covering public ingress, authentication, autoscaling, multi-GPU serving, quantization, production monitoring, or cost optimization yet.

Those are not tiny details. Public ingress brings TLS, routing, limits, and abuse controls. Authentication decides who can call the model. Autoscaling needs LLM-specific signals, not only CPU. Multi-GPU serving changes scheduling and failure behavior. Quantization changes memory and quality tradeoffs. Monitoring needs token, latency, GPU, queue, and model-server metrics.

But all of that comes after this basic path works.

A Kubernetes LLM platform starts becoming real when a model can load, serve, and answer through an API that other systems can call. Today we got there with one Deployment, one Service, one Secret, and one curl request.

In the next parts, we can make this less like a demo and more like a platform: readiness, observability, routing, auth, scaling, and the failure paths that show up once real users start sending prompts.

If you are following the series, subscribe and keep the manifest from this article handy. It is a good checklist for the first LLM-on-Kubernetes question: can we actually serve a model and call it?

OpenAI Already Told Us the Kubernetes Scaling Story, Most People Just Did Not Read It Closely

Pawan Kumar — Fri, 12 Jun 2026 07:03:45 +0000

Series links

Part 1: Everything You Know About Scaling Web Apps Breaks When You Serve an LLM

Part 2: The Request Is the Wrong Unit of Scale for LLMs on Kubernetes

Part 3: How Do You Fit a Trillion-Parameter Model Into a Kubernetes Cluster?

Part 4: Before the Pod Starts: GPU Node Setup for LLMs on Kubernetes

OpenAI already told us a lot about Kubernetes and large AI workloads. They did it years ago, before everyone started calling every GPU cluster an AI platform.

The posts are not product launches. They are engineering notes: scaling Kubernetes to 2,500 nodes in 2018, then scaling Kubernetes to 7,500 nodes in 2021. Read them closely and the lesson is not "use Kubernetes because OpenAI used Kubernetes." That would be lazy.

The better lesson is sharper:

Kubernetes can run serious AI infrastructure, but only when the team understands which parts of Kubernetes help, which parts get bypassed, and which parts become load-bearing at scale.

That distinction matters. A lot of teams are now trying to serve LLMs on Kubernetes by stacking tools until the diagram looks impressive. OpenAI's posts are a useful correction. They show a system that is surprisingly pragmatic. Whole-node pods. Direct pod IPs. MPI. Blob storage. Custom health checks. API server tuning. Less magic than you might expect.

This is Part 5 of the LLMs on Kubernetes series. I am keeping it narrow on purpose. This is not another introduction to GPU scheduling, token metrics, model parallelism, or autoscaling. We already covered enough of that earlier. This one is a teardown of what OpenAI publicly said and what a smaller platform team can steal from it without pretending to be OpenAI.

Kubernetes was the substrate, not the AI platform

OpenAI's 2018 post is very clear about why Kubernetes was useful. Their largest-scale workloads still managed bare cloud VMs directly at the time, but Kubernetes gave most experiments a faster iteration cycle, reasonable scalability, and less boilerplate.

That is the first lesson.

Kubernetes did not become useful because it magically understood deep learning. It became useful because it gave researchers a common substrate for running jobs, getting capacity, and iterating without hand-rolling the same infrastructure every time.

That sounds boring. It is also the part many teams skip.

A good LLM platform should not begin with "which AI gateway should we buy?" or "which model server wins?" Those questions matter later. The first platform question is simpler: can teams reliably ask for compute, run the workload, observe it, restart it, and get their data in and out without inventing a new workflow every week?

OpenAI's Kubernetes story is not about Kubernetes replacing every ML system. It is about Kubernetes becoming the shared operating layer for a messy research environment.

For a smaller team, the translation is simple: do not copy the node count. Copy the substrate mindset.

Give teams a boring path to run workloads. Make resource ownership clear. Make failures visible. Make restart behavior predictable. Let the specialized AI tools sit on top of that instead of turning the whole cluster into a science project.

Whole-node pods were a feature, not waste

One line from OpenAI's 7,500-node post is easy to miss: for many workloads, a single pod occupied the entire node.

In normal Kubernetes land, that sounds inefficient. We are trained to think about bin packing, fragmentation, spreading, utilization, requests, limits, and squeezing many workloads onto the same pool.

Large ML jobs can invert that instinct.

OpenAI explained the reason plainly. A large machine learning job can span many nodes and run most efficiently when it has access to all hardware resources on each node. GPUs may communicate through NVLink. GPUs may communicate directly with the NIC through GPUDirect. CPU, NUMA, PCIe, and local hardware topology stop being background details and become part of the job's performance envelope.

So the pod is not a small web replica anymore. It is closer to a worker slot in a coordinated compute job.

That is a different scheduling shape.

This does not mean every LLM inference service should use one pod per node. Most teams should not start there. But it does mean you should be careful with the default Kubernetes instinct of treating every GPU node as a bin-packing target.

If your workload depends on multiple GPUs behaving as a tight local group, the cleanest unit of scheduling may be the node. If the model server expects exclusive access to the local GPU topology, pretending the node is a general shared basket can make the system worse.

The smaller-team version is not "use whole-node pods everywhere." It is this:

Know when the node is the unit of performance.

For some LLM workloads, especially larger replicas, the useful abstraction is not "one container gets one device." It is "this serving replica owns this hardware shape." Kubernetes can schedule that shape, but you have to describe it honestly.

Direct pod IPs beat pretty abstractions for some jobs

OpenAI also wrote that they did not rely heavily on Kubernetes load balancing. Their biggest jobs had very little HTTPS traffic. They were not doing A/B tests, blue/green deploys, or canary rollouts inside those jobs.

Pods communicated directly with each other on pod IPs, using MPI over SSH, not service endpoints. Discovery happened once at job startup: which pods are participating in this MPI job?

That is a very different world from a stateless HTTP service behind a Service object.

This is where OpenAI's post is most useful as a mindset check. Kubernetes has beautiful abstractions for service discovery and load balancing. But not every distributed AI workload wants that abstraction in the hot path.

For tightly coordinated jobs, the membership of the group matters. Rank 0, rank 1, rank 2, and the rest are not interchangeable web replicas. If one participant disappears, the whole job may stop. If traffic gets sprayed through a generic balancing layer, the abstraction can be actively wrong.

The practical takeaway is not to bypass Kubernetes networking casually. It is to separate two kinds of traffic:

User-facing service traffic, where Services, Gateways, ingress, and load balancing make sense.
Job-internal coordination traffic, where direct pod identity and predictable peer membership may matter more.

LLM serving teams will hit this distinction as systems get more advanced. A simple single-node inference server can look like any other HTTP deployment. A multi-node replica, a distributed prefill/decode setup, or a training-style job starts to care about peer identity and network paths.

The abstraction should match the workload. That is the point.

Checkpointing was not an optimization. It was survival.

OpenAI described their largest jobs as MPI jobs where all pods participate in a single communicator. If one pod dies, the whole job halts and needs to restart. The job checkpoints regularly and resumes from the last checkpoint.

That sentence is doing a lot of work.

In a normal web service, a pod dying is usually noise. A replica disappears. The endpoint controller updates. Traffic goes somewhere else. The user might never notice.

In a large coordinated ML job, one pod dying can waste a huge amount of work. The failure model is different. Kubernetes can replace the pod, but it cannot make the lost compute free.

That is why checkpointing belongs in the infrastructure conversation. Not as a nice ML detail. As part of reliability.

If your job takes hours or days, restart behavior is not a footnote. If your model takes minutes to load, restart behavior is not a footnote. If your serving replica spans multiple GPUs or nodes, restart behavior is not a footnote.

For smaller teams, the practical version is this: test the boring failure path before production tests it for you.

Kill a pod while the model is loaded. Drain a GPU node. Interrupt a worker. Restart the model server. Break access to the model bucket. Watch what happens.

Do not only measure the happy path where the endpoint starts once and serves a prompt. Measure how long it takes to become useful again.

OpenAI's posts make this obvious at huge scale, but the same lesson shows up much earlier. A five-node LLM platform with bad restart behavior can still ruin your day.

Blob storage carried the boring weight

Another small but important detail: most OpenAI jobs interacted with blob storage. Jobs streamed dataset shards or checkpoints from blob storage, or cached data to fast local ephemeral disks. They used PersistentVolumes only where POSIX semantics were useful, and said blob storage was more scalable and avoided slow detach/attach operations.

That is a very practical choice.

Kubernetes storage discussions often get stuck on volumes. Which CSI driver? Which storage class? ReadWriteMany or ReadWriteOnce? How do we attach it? Can the pod move?

For AI workloads, a lot of the heavy data path may be better shaped around object storage and local cache instead. Model weights, checkpoints, datasets, tokenizer files, adapters, and artifacts often move through object storage more naturally than through classic attached disks.

The smaller-team lesson is not "never use PersistentVolumes." It is to be honest about the access pattern.

If the workload needs POSIX semantics, use the right volume system. If the workload mostly needs large immutable files, checkpoints, or artifacts, object storage plus local ephemeral cache may be simpler and easier to scale.

This also affects cold starts. A model server that pulls hundreds of gigabytes from object storage on every restart is not just a pod problem. It is a storage, cache, network, and readiness problem.

Kubernetes starts the container. Your platform still has to make the model arrive.

The API server became part of the scaling story

The most Kubernetes-specific lesson in OpenAI's posts is not about GPUs. It is about the control plane.

At 2,500 nodes, OpenAI hit etcd latency, excessive API reads, Kubernetes Event pressure, image pull issues, KubeDNS hotspots, networking limits, and even ARP cache overflow. At 7,500 nodes, they paid close attention to API server 429 and 5xx rates, ran API servers and etcd on dedicated nodes, and observed up to 70 GB of heap usage per API server.

This is the part platform teams should read twice.

At AI scale, boring Kubernetes objects become infrastructure load.

A DaemonSet that watches the API server from every node may be harmless in a small cluster and painful in a large one. A monitoring agent that polls too aggressively can become control-plane traffic. A node join event is not just a node join event when hundreds of nodes arrive together. A Service with every node behind it can create huge watch traffic if the data structure is wrong.

OpenAI specifically called out WATCHes on Endpoints. Some services, like kubelet and node-exporter, had every node as a member. When nodes were added or removed, endpoint watches fired broadly. They said EndpointSlices, introduced in Kubernetes 1.17 and later stable, reduced that load by a massive amount.

Most teams will never see OpenAI's numbers. That does not make the lesson irrelevant.

If you are building an LLM platform, ask boring control-plane questions early:

kubectl get --raw /metrics | grep apiserver_request_total
kubectl get --raw /metrics | grep apiserver_flowcontrol_rejected_requests_total
kubectl get endpointslices -A | head
kubectl get events -A --sort-by=.lastTimestamp | tail

Those commands do not make your platform OpenAI-scale. They just force you to look at the control plane as part of the system.

The cluster is part of the model serving path.

If the API server is slow, if node joins are noisy, if events explode, if monitoring creates too much cardinality, if image pulls block startup, your LLM platform will feel unreliable even when the model server is technically fine.

Health checks had to understand hardware

OpenAI's 7,500-node post also described passive and active health checks. Passive checks watched basic resources, network reachability, disks, GPU errors, maintenance events, and signals such as DCGM Xid errors. Active GPU tests ran at boot through a preflight taint and label, then periodically during node lifetime.

That is a useful pattern because it draws a line between node readiness and workload readiness.

A Kubernetes node can be Ready while still being a bad place to run an expensive GPU job. The host may be reachable. The kubelet may be alive. The node may pass generic checks. But the GPU can still be unhealthy, misbehaving, or not worth trusting for a long-running job.

OpenAI handled this by preventing normal workloads from landing until preflight passed. That is a very Kubernetes-native move: use taints and labels to keep the node out of rotation until hardware-specific validation completes.

Smaller teams should steal that idea.

You do not need OpenAI-scale automation to start. A practical GPU node readiness path can be simple:

Node joins with a temporary taint.
A validation job checks GPU visibility, driver behavior, runtime access, and a tiny CUDA or model-server smoke test.
The taint is removed only after the test passes.
Failed nodes stay out of the serving pool.

This is not the same as Part 4's GPU setup checklist. Part 4 was about preparing the node. This is about trust over time.

The question changes from "can Kubernetes see the GPU?" to "should we let this node take expensive work right now?"

Do not copy OpenAI. Copy the questions.

The wrong response to OpenAI's posts is to imitate the architecture directly.

Most teams do not need 7,500-node clusters. They do not need to expose every pod CIDR to researchers. They do not need the same networking model, the same quota system, the same team tainting service, or the same autoscaling behavior.

But they do need the questions OpenAI was forced to answer:

What is the real unit of scheduling for this workload?
Does this job want service load balancing, or direct peer identity?
What happens when one participant dies?
Where do checkpoints, weights, and datasets actually live?
Which agents talk to the API server from every node?
What happens if hundreds of pods or nodes appear at once?
Can a node be Ready but still unsafe for GPU work?
Which metrics are useful, and which metrics are just expensive noise?

That list is more valuable than the node count.

OpenAI's posts are interesting because they make Kubernetes look both powerful and very ordinary. Kubernetes handled huge clusters, but not by pretending AI workloads were normal web apps. The team shaped the platform around the workload.

That is the real takeaway.

Do not turn Kubernetes into a shrine. Do not turn OpenAI into a cargo cult. Use Kubernetes where it gives you a common substrate, clear scheduling boundaries, and operational leverage. Bypass or reshape the abstractions when the workload proves they are wrong.

The cluster is not just where the model runs. At LLM scale, the cluster becomes part of the model's behavior.

In the next part, we will look at another public signal from the frontier labs: Claude on EKS, Trainium, and the rise of the AI megacluster.

If you are building or evaluating LLM serving on Kubernetes, subscribe to follow the rest of the series. I am also putting together a free LLM Serving on Kubernetes Production Readiness Checklist so teams can sanity-check GPU nodes, model loading, observability, scaling, cost, and failure recovery before production does it for them.

Before the Pod Starts: GPU Node Setup for LLMs on Kubernetes

Pawan Kumar — Thu, 04 Jun 2026 12:06:57 +0000

Originally published at dheeth.blog.

Series links

Part 1: Everything You Know About Scaling Web Apps Breaks When You Serve an LLM

Part 2: The Request Is the Wrong Unit of Scale for LLMs on Kubernetes

Part 3: How Do You Fit a Trillion-Parameter Model Into a Kubernetes Cluster?

A pod is usually where Kubernetes conversations start. You write a Deployment, set requests and limits, pick a container image, add a Service, and let the scheduler place the workload somewhere in the cluster.

That is fine for normal applications. It is not enough for LLM serving.

Part 3 explained why a large model does not simply "run in a pod." A serving replica may be a coordinated GPU group. It may span multiple GPUs. It may depend on tensor parallelism, pipeline parallelism, expert parallelism, NCCL communication, model server behavior, and the shape of the hardware underneath it.

Part 4 moves one layer down: before the pod starts, the GPU node has to be prepared correctly. Kubernetes has to know that a node has GPUs. The container runtime has to expose those GPUs into containers. The node needs the right driver stack. The device plugin has to advertise schedulable resources. Labels have to describe what kind of GPU capacity exists. Metrics have to tell you whether the GPUs are healthy and useful. If you use MIG, time-slicing, or MPS, the sharing model has to be explicit.

Otherwise Kubernetes is scheduling blind.

It may see a node. It may even see nvidia.com/gpu. But that still does not mean the node is ready to serve LLM traffic well.

A GPU node is not just a bigger worker node

A normal Kubernetes worker node needs a kubelet, a container runtime, networking, storage integration, and enough CPU and memory to run pods. A GPU node needs all of that, plus a second hardware and software stack that has to line up cleanly.

At minimum, you care about:

the GPU model and memory size
the NVIDIA driver
CUDA compatibility
the NVIDIA Container Toolkit
the Kubernetes device plugin
GPU feature labels
monitoring through DCGM
node pool isolation
taints and tolerations
runtime behavior for MIG, MPS, or time-slicing
whether the node can support the serving engine you plan to run

This is why "add GPU nodes" is a dangerous oversimplification. A node with a T4, a node with an A10, a node with an A100 split into MIG instances, and a node with H100s connected through NVLink are all very different scheduling targets.

For a small model, that difference may only affect throughput. For a large model, it may decide whether the deployment works at all.

A Kubernetes scheduler does not automatically understand all of those details. It schedules based on resources, constraints, labels, taints, affinity rules, and plugins. If the GPU node does not publish the right information, Kubernetes cannot make a good placement decision.

What Kubernetes actually sees

Kubernetes has a generic way to work with special hardware through the device plugin framework. The kubelet does not magically discover every accelerator and understand how to allocate it. A vendor or third-party device plugin registers with the kubelet and advertises device resources to the node.

For NVIDIA GPUs, the common resource name is:

nvidia.com/gpu

Once the device plugin is running, a pod can request that extended resource with a quantity:

resources:
  limits:
    nvidia.com/gpu: 1

That is the basic Kubernetes contract. The pod asks for one GPU. The node says it has allocatable GPU resources. The scheduler only places the pod on a node that can satisfy the request.

Useful, but limited.

That resource request hides most of the information LLM platforms actually need. The GPU might have 16 GB, 80 GB, or 192 GB of memory. The node may or may not have NVLink between GPUs. The GPU might be in MIG mode. The node might belong to an inference pool, a training pool, a batch pool, or somebody's experiment corner. DCGM may already be reporting errors. The model server may need a topology this node cannot provide.

The device plugin makes GPUs schedulable. It does not make Kubernetes an LLM placement brain.

That distinction matters. A lot of LLM failures start when teams treat nvidia.com/gpu: 1 as the whole story.

The NVIDIA GPU Operator is the usual starting point

You can install every GPU component manually, but most production Kubernetes setups use the NVIDIA GPU Operator or a cloud provider equivalent. The operator exists because a GPU node needs more than one daemon.

NVIDIA describes the problem plainly: Kubernetes can provide access to special hardware through device plugins, but configuring nodes also requires drivers, container runtimes, libraries, monitoring, and other components. The GPU Operator automates much of that node-level software stack.

In practice, the operator can manage or deploy components such as:

NVIDIA drivers, if you want the operator to manage them
NVIDIA Container Toolkit
NVIDIA Kubernetes device plugin
GPU Feature Discovery
DCGM and DCGM Exporter
MIG Manager
validator pods

The exact setup depends on your environment. Managed Kubernetes providers sometimes preinstall drivers or handle parts of the stack. Bare metal clusters may need the operator to do more. Air-gapped clusters need image mirroring and version discipline. Some organizations deliberately manage drivers outside the cluster because kernel and driver upgrades are part of their node image pipeline.

A basic install usually starts with Helm:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

kubectl create namespace gpu-operator

If your cluster enforces Pod Security Admission, label that namespace before the operator starts creating privileged node-level components:

kubectl label --overwrite ns gpu-operator \
  pod-security.kubernetes.io/enforce=privileged

Then install the operator:

helm install gpu-operator nvidia/gpu-operator \
  -n gpu-operator \
  --version=v26.3.2 \
  --wait

Then check that the operator-managed pods are actually running:

kubectl get pods -n gpu-operator
kubectl get daemonset -n gpu-operator

The exact pod names vary by GPU Operator version and configuration, but this is the point where you should see components for the device plugin, GPU Feature Discovery, DCGM Exporter, validators, and any driver/toolkit pieces your environment needs.

The important point is not "always install the operator and forget everything else." The point is: there is a GPU node stack, and something has to own it.

If nobody owns it, the first real LLM workload becomes the integration test.

That is a bad place to learn that the driver, CUDA userspace, container runtime, and model server image do not agree with each other.

The device plugin turns GPUs into schedulable resources

The NVIDIA device plugin is the bridge between the physical GPUs on the node and the resources Kubernetes can schedule. It runs on GPU nodes, discovers the devices, registers them with the kubelet, and exposes resources such as nvidia.com/gpu.

This is the part many platform engineers recognize first because it shows up directly in pod specs.

A minimal workload might request one GPU like this:

apiVersion: v1
kind: Pod
metadata:
  name: llm-worker
spec:
  containers:
    - name: worker
      image: example/llm-server:latest
      resources:
        limits:
          nvidia.com/gpu: 1

After the device plugin is running, verify what the node advertises:

kubectl describe node <gpu-node-name> | grep -A6 -E "Capacity|Allocatable"

On a node with one physical GPU and no sharing enabled, you might see:

Capacity:
  cpu:                32
  memory:             131932000Ki
  nvidia.com/gpu:     1

Allocatable:
  cpu:                32
  memory:             131829600Ki
  nvidia.com/gpu:     1

On a node with four physical GPUs, the same resource name may show a capacity of 4. Without MIG or time-slicing, that number usually maps to physical GPU count. With time-slicing, it can become logical shared capacity instead. That difference matters.

You should also run a small GPU smoke test before trusting the node for model serving:

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-vectoradd
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0-ubuntu22.04
      resources:
        limits:
          nvidia.com/gpu: 1

Then apply it and check the logs:

kubectl apply -f cuda-vectoradd.yaml
kubectl logs pod/cuda-vectoradd

That proves two things: Kubernetes can schedule a pod that requests a GPU, and the container can actually use the GPU runtime path. It still does not prove that the node is good for a large LLM.

That YAML is useful, but it is only the outermost layer. For a serious LLM workload, you usually need to ask more questions:

Which GPU model should this pod land on?
How much GPU memory does the model need?
Is this a full GPU, a MIG slice, or a time-sliced replica?
Can the serving engine use this GPU type efficiently?
Does this workload need multiple GPUs on the same node?
Does it need a specific driver or CUDA capability?
Should it avoid nodes shared with batch or notebook workloads?

The scheduler can only respect these requirements if you express them through resources, labels, affinity, taints, topology constraints, or a higher-level scheduler. If the cluster only exposes a flat nvidia.com/gpu resource, you have thrown away a lot of useful placement information.

For simple inference, that may be acceptable. For large LLM serving, it usually is not.

Labels are how the node starts telling the truth

Kubernetes scheduling improves when nodes describe themselves. That is where Node Feature Discovery and GPU Feature Discovery come in.

Node Feature Discovery detects hardware features available on each node and advertises them through node labels, and optionally extended resources, annotations, and taints. It is not GPU-specific. It can label CPU features, kernel features, PCI devices, and other node capabilities.

GPU Feature Discovery is NVIDIA-specific. It labels GPU properties so workloads and schedulers can distinguish between different GPU nodes. Historically it existed as its own project, and NVIDIA has since archived the standalone repository, but the function remains part of the GPU Operator stack.

The labels are the difference between "this node has a GPU" and "this node has the kind of GPU I want."

You might care about labels for:

GPU product name
GPU count
GPU memory
CUDA driver capability
MIG capability
MIG strategy
GPU family or architecture
whether a node belongs to a production inference pool
whether a node is allowed to run experimental workloads

The exact label names vary by component and version, so do not hard-code examples from a blog post into production without checking your cluster. The pattern is what matters:

nodeSelector:
  accelerator: nvidia-h100

or:

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: nvidia.com/gpu.product
              operator: In
              values:
                - NVIDIA-H100-80GB-HBM3

That is the practical scheduling jump. You stop saying "give me a GPU" and start saying "give me this class of GPU node."

LLM serving needs that distinction because GPU memory, interconnect, and serving-engine support shape the deployment. A 7B model, a 70B model, and a multi-GPU serving group should not all be treated as generic GPU workloads.

Taints and tolerations keep GPU nodes from becoming expensive junk drawers

GPU nodes are too expensive to become general worker nodes by accident.

A common pattern is to taint GPU nodes so normal pods do not land there unless they explicitly tolerate the taint:

kubectl taint nodes gpu-node-1 accelerator=nvidia:NoSchedule

Then GPU workloads add a toleration:

tolerations:
  - key: "accelerator"
    operator: "Equal"
    value: "nvidia"
    effect: "NoSchedule"

That looks basic, but it matters. Without isolation, GPU nodes can become a dumping ground for random sidecars, CPU-heavy services, log agents with bad limits, notebooks, experiments, and batch jobs that make production inference harder to reason about.

For LLMs, you may need more than one GPU pool:

production online inference
batch inference
experiments and notebooks
fine-tuning or training
small-model serving
large-model serving
MIG-backed shared inference
full-GPU serving

These pools may use the same Kubernetes cluster but should not have the same scheduling policy. Taints, labels, node selectors, priority classes, quotas, and admission policy are the boring controls that keep the expensive hardware usable.

This is also where platform teams start turning hardware into a product surface. Developers should not need to know every node name. They should be able to ask for a workload class, such as "small shared GPU inference" or "full H100 production inference," and let the platform map that to the right node pool.

DCGM is how you know whether the GPU is healthy and busy

Scheduling is only half the story. Once workloads land on GPU nodes, you need to know whether the GPUs are actually working well.

That is where DCGM and DCGM Exporter enter the setup. DCGM provides GPU telemetry. DCGM Exporter exposes metrics that can be scraped by Prometheus and visualized in Grafana or another observability stack.

If DCGM Exporter is enabled through the GPU Operator, it is usually part of the operator-managed stack. NVIDIA's chart exposes dcgmExporter.enabled, and the default is true. So first check whether it is already there before installing anything separately:

kubectl get pods -n gpu-operator | grep -i dcgm
kubectl get svc -n gpu-operator | grep -i dcgm

If your platform disables that component, or if you are not using the GPU Operator, then deploy DCGM Exporter separately through your observability stack instead of assuming GPU metrics will appear automatically.

For LLM serving, useful DCGM metrics include:

DCGM_FI_DEV_GPU_UTIL: GPU compute utilization
DCGM_FI_DEV_MEM_COPY_UTIL: memory copy utilization
DCGM_FI_DEV_FB_USED: framebuffer memory used
DCGM_FI_DEV_FB_FREE: framebuffer memory free
DCGM_FI_DEV_GPU_TEMP: GPU temperature
DCGM_FI_DEV_POWER_USAGE: power usage
DCGM_FI_DEV_XID_ERRORS: XID error count
DCGM_FI_DEV_ECC_DBE_VOL_TOTAL: volatile double-bit ECC errors

Those map directly to practical questions:

Is the model filling GPU memory before traffic even arrives?
Is KV cache pressure eating the remaining memory during generation?
Is the GPU busy, or is the model server queueing somewhere else?
Is memory movement becoming the bottleneck?
Is the card throttling or throwing hardware-level errors?
Is this node safe to keep in the serving pool?

Be careful with one metric: raw GPU utilization can lie to you.

A GPU can show high utilization while users still see poor time to first token because queueing is bad. A GPU can show moderate utilization while KV cache pressure is the real limiter. A GPU can be busy with the wrong mix of prefill and decode work. A GPU can be allocated to a pod that is not producing useful throughput.

So DCGM metrics are necessary, but they are not sufficient. You still need model-server metrics from vLLM, Triton, TensorRT-LLM, TGI, SGLang, or whatever you run. The GPU layer tells you what the hardware is doing. The serving layer tells you whether the model is serving traffic well.

Part 14 of this series will go deeper into autoscaling signals. For now, the practical point is simple: if you cannot observe GPU health and GPU memory pressure, your LLM platform is flying blind.

MIG is not the same as time-slicing

GPU sharing is one of the easiest places to confuse yourself because the words sound similar but the isolation model is different.

MIG, or Multi-Instance GPU, lets supported NVIDIA GPUs partition a physical GPU into separate GPU instances. NVIDIA describes MIG as a way to partition GPUs based on Ampere and later architectures into separate and secure GPU instances for CUDA applications. The GPU Operator can deploy MIG Manager to manage MIG configuration on Kubernetes nodes.

MIG is useful when you want stronger partitioning. A large GPU can be split into smaller slices so several workloads can run with more predictable boundaries. For smaller models, internal tools, embeddings workloads, evaluation jobs, or lower-tier inference, that can be a good use of expensive hardware.

But MIG is not magic. A MIG slice has less memory and compute than the full GPU. A model that needs a full 80 GB GPU will not fit just because the physical card is present. A workload that depends on multiple full GPUs may not be happy on fragmented MIG capacity. Changing MIG geometry can also be operationally disruptive. NVIDIA notes that MIG Manager requires no user workloads running on the GPUs being configured, and in some environments the node may need a reboot.

That matters for production planning. MIG configuration is not something you casually flip during an incident.

Time-slicing is different. NVIDIA's GPU Operator time-slicing documentation explains that time-slicing enables oversubscription by letting workloads scheduled on an oversubscribed GPU interleave with one another. Unlike MIG, time-slicing does not provide memory or fault isolation between replicas.

A cluster-wide time-slicing config looks like this:

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
          - name: nvidia.com/gpu
            replicas: 4

Create it in the operator namespace and point the device plugin at it during install:

kubectl create -n gpu-operator -f time-slicing-config.yaml

helm install gpu-operator nvidia/gpu-operator \
  -n gpu-operator \
  --create-namespace \
  --version=v26.3.2 \
  --set devicePlugin.config.name=time-slicing-config \
  --wait

If the operator is already installed, patch the ClusterPolicy instead:

kubectl patch clusterpolicies.nvidia.com/cluster-policy \
  -n gpu-operator --type merge \
  -p '{"spec":{"devicePlugin":{"config":{"name":"time-slicing-config","default":"any"}}}}'

With replicas: 4, one physical GPU can advertise four schedulable shared replicas. Four physical GPUs can advertise sixteen. With renameByDefault: false, the resource name remains nvidia.com/gpu, while labels such as nvidia.com/gpu.product can get a -SHARED suffix and nvidia.com/gpu.replicas=4 tells you the oversubscription factor.

kubectl describe node <gpu-node-name> | grep -A8 -E "Labels:|Capacity:|Allocatable:"

Example shape for one physical GPU with four time-sliced replicas:

Labels:
  nvidia.com/gpu.count=1
  nvidia.com/gpu.product=Tesla-T4-SHARED
  nvidia.com/gpu.replicas=4

Capacity:
  nvidia.com/gpu: 4

Allocatable:
  nvidia.com/gpu: 4

That tradeoff is huge.

Time-slicing can be useful for lightweight workloads, experiments, notebooks, CI jobs, embeddings, small internal tools, dev/test endpoints, or low-duty-cycle inference where exclusive GPU access would waste money. If every tiny workload asks for a full nvidia.com/gpu: 1 and gets exclusive access, one notebook or one small model can occupy the entire scheduling unit while using only a fraction of the card.

Time-slicing helps utilization by allowing more pods to share the same physical GPU over time. The value is sharing, not isolation.

A pod that requests a time-sliced GPU is not getting a private piece of hardware. It is getting shared access to an underlying GPU. It does not get separate GPU memory. It does not get fault isolation. It does not get guaranteed proportional compute. NVIDIA explicitly notes that requesting more than one time-sliced GPU does not guarantee a proportional amount of GPU compute power.

So do not treat replicas: 4 as four real GPUs. Use time-slicing for workloads that can tolerate noisy neighbors. Be very careful with latency-sensitive LLM serving, large models near memory limits, or coordinated multi-GPU serving groups.

MPS, the NVIDIA Multi-Process Service, is another sharing mechanism. It can improve GPU utilization for multiple CUDA processes by letting them share execution resources more efficiently, but it also needs careful workload-level testing. For LLM serving, the question is not "can we share this GPU?" The question is "can we share this GPU without destroying latency, memory predictability, or failure isolation?"

Those are different questions.

GPU memory is a scheduling constraint, even when Kubernetes does not see it that way

This is one of the biggest gaps between LLM reality and default Kubernetes scheduling.

Kubernetes can schedule nvidia.com/gpu: 1. But a single GPU is not a uniform unit. The useful capacity depends heavily on GPU memory.

A 7B model in FP16 or BF16 may fit on many cards. A 70B model may need much more memory, especially after you include KV cache and runtime overhead. A long-context workload can run out of memory even if the model weights fit. A workload with high concurrency can hit KV cache pressure long before the GPU looks simple to the scheduler.

Kubernetes does not natively schedule based on "80 GB of GPU memory free for this pod" in the same way it handles CPU and RAM requests. You need to model this through one or more of:

separate node pools by GPU memory class
labels for GPU product and memory size
admission policy that maps workload profiles to allowed GPU classes
MIG profiles when slicing is appropriate
model-server-level controls for max context, max batch size, and max concurrent sequences
observability that catches GPU memory pressure before users do

This is why Part 3's memory math matters even after you leave the article. Weight memory math tells you what class of node the model can run on. GPU node setup tells Kubernetes how to find that class of node.

If you skip this step, you get weird failures: pods schedule successfully, containers start, the model begins loading, then dies with CUDA out-of-memory errors. Kubernetes did its job. You gave it the wrong abstraction.

The container runtime is part of the serving path

Another easy mistake: treating the container image as if it is enough.

For a GPU workload to work inside a container, the host and runtime must expose the GPU correctly. The NVIDIA Container Toolkit is part of that path. The driver has to exist on the host or be managed through the operator. The container needs compatible userspace libraries. The kubelet and runtime need to know how to make GPU devices available to the container.

This is why GPU node readiness is more than kubectl get nodes showing Ready.

A node can be Ready for normal pods and still be broken for GPU workloads. The failure may only appear when a pod tries to start, load CUDA, initialize NCCL, or run the model server. Good GPU platforms usually add validation pods or smoke tests that check the GPU path before developers depend on the node.

A simple mental checklist:

Can the node see the GPU?
Is the driver loaded?
Can a container see the GPU?
Does a CUDA sample work?
Does the device plugin advertise the resource?
Do labels describe the GPU accurately?
Does DCGM report metrics?
Can the intended model server initialize on this node?
Can a small test model load and serve a request?

If the answer stops at "the node is Ready," you have not tested enough.

Multi-GPU nodes need topology awareness

Part 3 talked about tensor parallelism and pipeline parallelism. This is where that discussion touches the node.

If a model server needs multiple GPUs, placement inside a node matters. GPUs may be connected differently. Some paths have better bandwidth. Some nodes have NVLink. Some rely more heavily on PCIe. The serving engine may assume a certain number of GPUs per worker. NCCL performance may depend on the topology.

Kubernetes, by default, is not deeply reasoning about your tensor parallel group. If a pod requests four GPUs on a node, the device plugin can allocate devices, but the model server still has to use them correctly. If a deployment needs multiple pods across nodes, Kubernetes can place those pods, but the serving framework has to coordinate the ranks.

This article is not the dedicated scheduler article. That comes later. But the GPU node setup matters here because scheduling cannot become topology-aware if the platform does not expose useful topology and node information in the first place.

A practical rule: keep the first successful design boring.

If a model can run with tensor parallelism inside one node, start there before spreading a single serving replica across nodes. Multi-node serving adds network sensitivity, failure coordination, startup sequencing, and debugging pain. Kubernetes can manage the shape, but it will not make a bad topology fast.

What a practical GPU node baseline looks like

A serious LLM GPU node baseline does not have to be fancy. It needs to be explicit.

At a minimum, I would want a platform team to know the answers to these questions before onboarding production LLM workloads:

Who owns driver installation and upgrades?

The GPU Operator can manage drivers, or the node image pipeline can manage them. Both can work. The bad answer is "we are not sure."

How are GPUs advertised to Kubernetes?

The NVIDIA device plugin should expose GPU resources consistently. You should know what resource names workloads request, especially if MIG or time-slicing is enabled.

How are GPU nodes labeled?

Node labels should capture GPU class, node pool purpose, MIG strategy if relevant, and anything else needed for scheduling decisions.

How are GPU nodes isolated?

Use taints, tolerations, node pools, quotas, and policy so random workloads do not land on expensive GPU nodes.

How are GPU metrics collected?

DCGM Exporter should feed your observability stack. Model-server metrics should sit beside GPU metrics so you can connect hardware behavior to LLM behavior.

What sharing mode is allowed?

Full GPU, MIG, time-slicing, and MPS are different operational choices. Do not let teams discover the difference after latency falls apart.

How do you validate a node before using it?

Have a smoke test for CUDA, device plugin resources, labels, DCGM metrics, and a small model-server startup path.

Which workloads are allowed on which GPU classes?

A small embedding service, an internal chatbot, a batch summarization job, and a large production model should not all be scheduled with the same policy.

Turn that baseline into a small verification routine:

kubectl get pods -n gpu-operator
kubectl describe node <gpu-node-name>
kubectl apply -f cuda-vectoradd.yaml
kubectl logs pod/cuda-vectoradd
kubectl get svc -n gpu-operator | grep -i dcgm

By the end of this check, you should know whether the node is Ready, the operator components are running, the device plugin advertises GPU resources, the node has useful GPU labels, GPU workloads can start, DCGM metrics are available, and your sharing mode is explicit.

This baseline is boring on purpose. Most production incidents are not caused by exotic scheduler theory. They are caused by a missing label, a wrong driver, an unisolated node pool, a bad sharing assumption, or a metric nobody collected.

The pod starts late in the story

By the time an LLM pod starts, many decisions have already been made.

The node pool decided what hardware exists. The driver stack decided whether CUDA works. The device plugin decided what resources Kubernetes can allocate. Feature discovery decided what labels describe the node. Taints and tolerations decided who is allowed to land there. MIG, MPS, or time-slicing decided what "a GPU" means on that node. DCGM decided what you can observe. The model server will decide how efficiently the allocated GPU is used.

The pod is where all of those decisions meet.

That is why GPU node setup deserves its own article. It is not glamorous, and it is not the full LLM platform. But if this layer is wrong, everything above it becomes harder: vLLM, Triton, TensorRT-LLM, KServe, Ray, autoscaling, routing, cost control, latency tuning, and multi-tenancy.

Kubernetes can schedule LLM workloads only as well as the cluster describes its GPU capacity.

So before you ask why your LLM pod is slow, unstable, expensive, or impossible to place, ask a simpler question:

What did the GPU node actually tell Kubernetes before the pod started?

Continue the series

This is Part 4 of my practical series on hosting large LLMs on Kubernetes. The next parts will move from GPU node setup into real-world scaling stories, model servers, KV cache, batching, scheduling, autoscaling, latency, cost, and production architecture.

I am also preparing a free LLM Serving on Kubernetes Production Readiness Checklist with the questions platform teams should ask before putting an LLM workload in production. Subscribe to the newsletter and I will share it when it is ready.

How Do You Fit a Trillion-Parameter Model Into a Kubernetes Cluster?

Pawan Kumar — Thu, 28 May 2026 03:32:13 +0000

Series links

Part 1: Everything You Know About Scaling Web Apps Breaks When You Serve an LLM

Part 2: The Request Is the Wrong Unit of Scale for LLMs on Kubernetes

A giant model does not "run in a pod."

That sentence sounds wrong if you have spent years thinking in Kubernetes objects. We package software into containers. We run containers in pods. We schedule pods onto nodes. We put Services in front of them. That model works well when the thing inside the container is a web server, worker, queue consumer, or API process.

Then someone says, "Can we host a trillion-parameter model on Kubernetes?"

The honest answer is: yes, but not in the way your brain first pictures it. A trillion-parameter model is not one neat process sitting inside one neat pod, waiting for the kubelet to give it enough CPU and memory. It is a pile of weights, communication patterns, parallel workers, GPU memory limits, interconnect assumptions, and serving-engine decisions. Kubernetes can coordinate the outer shape, but the model itself has to be split.

Part 1 of this series argued that LLM serving is not normal web serving. Part 2 argued that requests are the wrong unit of scale because tokens are the work. Part 3 is about the next uncomfortable step: once the model is large enough, a replica is no longer a pod. A replica may be a group of GPUs, a node, several nodes, or a slice of a much larger GPU cluster working together.

The pod is just the envelope. The model is the distributed system.

Start with the boring math

Before tensor parallelism, pipeline parallelism, expert parallelism, Ray, vLLM, KServe, MPI, NCCL, or any Kubernetes YAML, there is a dumb memory question:

How many bytes do the weights need?

The rough formula is simple:

weight memory = number of parameters x bytes per parameter

That is only the model weights. It does not include KV cache, runtime overhead, CUDA graphs, activations during training, optimizer states, communication buffers, fragmentation, or the serving engine's own memory reservations. But for the first pass, it is enough to kill bad assumptions.

A 1 trillion parameter dense model stored in FP16 or BF16 needs about:

1,000,000,000,000 parameters x 2 bytes = 2,000,000,000,000 bytes

That is roughly 2 TB of raw weight memory.

Not disk. Not object storage. GPU-addressable memory.

If you had 80 GB GPUs, 2 TB of raw weights already needs 25 GPUs before overhead. If you had 141 GB H200-class GPUs, it still needs about 15 GPUs just for the weights. That does not mean the model is usable with exactly that many GPUs. It means this is the floor before the real serving problem begins.

This is where normal Kubernetes thinking starts to mislead people. A pod can request memory. A pod can request nvidia.com/gpu: 8. Kubernetes can place that pod on a node with enough advertised GPU devices. But Kubernetes does not magically make one process treat 25 separate GPUs as one giant GPU. The serving engine and distributed runtime have to do that work.

Kubernetes schedules access to hardware. It does not shard the model for you.

FP16 and BF16 are not small

FP16 and BF16 are often discussed as if they are memory optimizations, and compared with FP32, they are. FP32 uses 4 bytes per parameter. FP16 and BF16 use 2 bytes. Cutting weight memory in half is a big deal.

But at trillion-parameter scale, half of enormous is still enormous.

A 175B parameter model in FP16 or BF16 is about 350 GB of raw weights. That already does not fit on a single common GPU. A 671B parameter model is about 1.34 TB in FP16 or BF16. A 1T dense model is about 2 TB. A 1.8T dense model would be about 3.6 TB.

Quantization changes the math. FP8 brings 1T parameters down to about 1 TB of raw weights. FP4 brings it down to about 500 GB. NVIDIA's trillion-parameter inference write-up uses an example GPT MoE model with 1.8T parameters stored with FP4, where the raw weights are about 900 GB. On 192 GB GPUs, the theoretical minimum just to hold those weights is five GPUs.

That sounds surprisingly small until you remember the word "minimum."

Five GPUs may hold the weights. They may not generate tokens fast enough. They may not leave enough room for KV cache. For long-context models, KV cache can consume tens of gigabytes per busy replica, and at real concurrency it competes directly with weights for GPU memory. The same five GPUs may also communicate too much. They may deliver terrible time to first token. They may support one beautiful demo and then fall apart under real traffic.

The memory floor tells you whether a deployment is possible. It does not tell you whether it is good.

The model gets split because memory is only one problem

When a model does not fit on one GPU, there are two broad things you can do.

You can make the model smaller: quantize it, distill it, pick a smaller checkpoint, reduce context length, use adapters, or route some traffic to a cheaper model. Those are valid and often the right production choices, but that is not the focus of this part.

Or you can split the model across GPUs.

Splitting is where the word "replica" becomes slippery. In a normal web app, one replica is usually one pod. In LLM serving, one model replica may require multiple GPU workers that must cooperate for every token. If one worker is slow, missing, placed badly, or stuck behind a bad network path, the whole replica suffers.

That is why large-model serving feels less like "run N pods" and more like "assemble a tiny supercomputer for each serving replica."

There are three important forms of model splitting to understand:

Tensor parallelism
Pipeline parallelism
Expert parallelism

They are often combined with data parallelism, where you run multiple independent replicas of the sharded model to serve more traffic. Data parallelism is easy to understand once the model replica fits somewhere. The hard part is making one replica exist in the first place.

Tensor parallelism splits the inside of a layer

Tensor parallelism splits individual tensors inside a model layer across GPUs. Instead of putting the full matrix multiplication for a transformer layer on one GPU, the serving engine divides the layer's work across several GPUs and combines the result.

This is useful because transformers have large matrix operations that can be partitioned. Megatron-LM popularized this style of tensor model parallelism for GPT-like models, and vLLM's distributed serving documentation still points to Megatron-LM's tensor parallel algorithm as the implementation basis.

A simple mental model:

One transformer layer
  GPU 0: owns one slice of the weight matrix
  GPU 1: owns another slice
  GPU 2: owns another slice
  GPU 3: owns another slice

For every token step, the GPUs compute their slices and then exchange partial results. This is powerful, but it is not free. Tensor parallelism depends heavily on fast GPU-to-GPU communication. Inside a node with NVLink or another high-bandwidth interconnect, it can work well. Across nodes, the communication cost can get ugly.

That is why many practical guides recommend keeping tensor parallelism inside a node when possible. vLLM's scaling guidance gives the same shape: if the model is too large for one GPU but fits on one multi-GPU machine, use tensor parallelism. If you have 4 GPUs in the node, tensor_parallel_size=4 is the obvious starting point.

Tensor parallelism makes one layer wider across GPUs. It helps with memory and per-token compute, but it ties those GPUs together tightly. They are not independent pods anymore. They are pieces of one inference machine.

Pipeline parallelism splits the stack of layers

Pipeline parallelism cuts the model vertically by layers.

Instead of every GPU participating in every layer, one GPU or group of GPUs owns the early layers, another owns the middle layers, and another owns the later layers. A request moves through those stages like work moving through an assembly line.

A rough picture:

Stage 1: layers 1-20    -> GPU group A
Stage 2: layers 21-40   -> GPU group B
Stage 3: layers 41-60   -> GPU group C
Stage 4: layers 61-80   -> GPU group D

Pipeline parallelism is attractive when the model cannot fit within one node. Instead of stretching tensor parallelism across a slow boundary, you can keep tensor parallelism inside each node and use pipeline parallelism across nodes. NVIDIA's Megatron work describes exactly this pattern: tensor parallelism works well within a DGX A100 node, while pipeline parallelism helps scale across nodes because it uses a different communication pattern.

vLLM's current docs give a practical serving version of the same idea. For 2 nodes with 8 GPUs each, set tensor parallelism to 8 and pipeline parallelism to 2. In plain English: split each layer across the 8 GPUs inside a node, then split the model's layers across the 2 nodes.

tensor_parallel_size = GPUs per node
pipeline_parallel_size = number of nodes

That rule is not sacred, but it is a good first mental model.

Pipeline parallelism also introduces its own pain. Pipelines can have bubbles, where some stages sit idle while waiting for work. Training systems fight this with microbatches and scheduling tricks. In inference, the serving engine may keep stages busier by feeding different requests through the pipeline continuously, but that depends on batching, traffic shape, and implementation. The operational point is simple: the more stages you add, the more the model starts behaving like a distributed workflow instead of a containerized API.

Kubernetes can keep the pods alive. The serving engine has to keep the pipeline full.

In that shape, a "replica" is already bigger than the mental model most platform teams start with. The serving replica is not one container. It is a coordinated set of ranks, workers, and devices.

Expert parallelism exists because MoE models are weird

Mixture-of-Experts models add another twist.

A dense model usually uses the same parameters for every token. If the model has 70B parameters, each token flows through that dense stack. If the model has 1T dense parameters, the serving system must deal with a terrifying amount of memory and compute.

MoE models are different. They contain many expert feed-forward networks, and a router chooses which experts handle each token. This creates a model with a huge total parameter count, but only a fraction of those parameters are active for any one token.

This is the line that saves people from a lot of confusion:

Trillion parameters does not always mean trillion-parameter compute per token.

The Switch Transformer paper made this idea famous at large scale. It describes MoE models as sparsely activated: huge parameter counts, but roughly constant computation because each token is routed to a small number of experts. Switch simplified the routing further by sending each token to one expert.

Modern public models show the same idea in a more familiar form. DeepSeek-V3 is reported as a 671B parameter MoE model, but only 37B parameters are activated for each token. That does not make the model "really 37B." The inactive experts still exist. Their weights still need to live somewhere. But the compute path for one token is much smaller than the total parameter count suggests.

This distinction matters for capacity planning. Total parameters drive storage and placement. Active parameters drive per-token compute. Both matter, but they are not the same number.

Expert parallelism is the systems trick that places different experts on different GPUs or nodes. When tokens are routed to experts, the serving system sends token representations to the devices that own those experts, runs the expert computation, and combines the results back into the model flow.

That creates a new bottleneck: token routing and all-to-all communication. If the router sends too much traffic to one expert, that expert becomes hot. If experts are spread across nodes, the network starts carrying token activations around the cluster. If the serving engine does not overlap communication and compute well, the GPUs wait.

MoE is not magic. It trades dense compute for routing, load balancing, memory placement, and communication.

A trillion-parameter MoE is not the same as a trillion-parameter dense model

This is worth slowing down on because marketing numbers blur it.

Imagine two models:

Model A: 1T dense parameters
Model B: 1T total MoE parameters, 50B active per token

Both can be called trillion-parameter models. They are not the same infrastructure problem.

Model A needs the serving system to carry the memory and compute burden of the full dense stack. Every token touches the model in a much more uniform way.

Model B still needs the cluster to store the full set of experts, but each token activates only a subset. The challenge shifts toward routing, expert placement, load balancing, and making sure the right GPUs communicate quickly enough.

This is why a model card's parameter count is only the beginning of the conversation. For serving, you also want to know:

Is it dense or MoE?
How many total parameters are there?
How many parameters are active per token?
What precision are the weights stored in?
How long is the context window?
How large is the KV cache at your expected concurrency?
Does the serving engine support the model's parallelism pattern well?
Can your node and network topology support the communication pattern?

If you only ask, "How many parameters?" you will size the cluster badly.

What Kubernetes actually sees

Kubernetes does not see tensor slices, pipeline stages, or experts. It sees pods, containers, resources, nodes, labels, taints, tolerations, Services, volumes, and health checks.

For GPUs, Kubernetes normally depends on the device plugin framework. A vendor device plugin registers resources like nvidia.com/gpu with the kubelet. The kubelet advertises those resources on the node. A pod requests them through resource limits. The scheduler places the pod on a node that can satisfy the request.

That is useful, but it is a lower-level contract than many people assume.

Kubernetes can say:

resources:
  limits:
    nvidia.com/gpu: 8

It cannot infer that those 8 GPUs should form tensor parallel group 0, that another node should form pipeline stage 1, that experts 0-63 should live on one rank group, or that the network path between two stages is now your latency bottleneck.

Those choices happen in the serving layer: vLLM, TensorRT-LLM, Triton or Dynamo Triton, SGLang, TGI, Ray, KServe, llm-d, custom launch scripts, or whatever stack your team chooses. Kubernetes is still important, but its job is orchestration around the model, not inside the model.

This is the split I find useful:

Kubernetes decides where the workers run.
The serving engine decides how the model is split.
The interconnect decides whether the split is fast enough.

Miss any one of those, and the deployment becomes fragile.

One pod, many pods, or one distributed replica?

There are a few common deployment shapes.

The first is a single pod requesting multiple GPUs on one node. This is the simplest shape for models that fit within one machine. A vLLM server might run with --tensor-parallel-size 4 or --tensor-parallel-size 8, and the pod requests the same number of GPUs. Kubernetes schedules one pod. Inside that pod, the serving engine starts multiple GPU workers.

This is operationally pleasant because the failure domain is clean. The pod is up or down. The node has the GPUs or it does not. The model weights are local or mounted. You still have complexity, but it is contained.

The second shape is one distributed replica spread across multiple pods or nodes. This is what you need when the model or desired serving shape exceeds one node. Now you need coordinated startup, rank assignment, service discovery, identical images, shared model paths or download behavior, and careful placement. If one part of the replica is missing, the replica is not healthy.

This is where Kubernetes starts to need help from higher-level controllers or conventions. StatefulSets, headless Services, Ray clusters, KServe runtimes, LeaderWorkerSet, job-style launchers, or custom operators can all appear depending on the stack. The exact tool matters less than the invariant: the workers are not independent replicas. They are shards of one replica.

The third shape is data parallel replicas of sharded replicas. For example, you may run four independent model replicas, and each replica uses 16 GPUs internally. That gives you 64 GPUs total, but the scheduling unit is not "64 independent pods." It is four coordinated groups.

This is where platform teams need to be very careful with autoscaling language. Scaling from 4 replicas to 5 may mean adding 16 GPUs and starting a full distributed group. It may require model weight loading, rank coordination, cache warmup, and traffic shifting. It is not the same as adding one more stateless web pod.

The network is part of the model now

With normal web services, the network matters. With distributed LLM inference, the network is part of the model's execution path.

Tensor parallelism needs frequent GPU-to-GPU communication. Pipeline parallelism moves activations between stages. Expert parallelism can create all-to-all traffic between tokens and expert owners. NCCL, or the equivalent communication layer in your stack, becomes part of the serving path. Multi-node serving depends on bandwidth, latency, topology, and how well the serving engine overlaps communication with compute.

This is why "we have 32 GPUs in the cluster" is not enough information. Are they eight GPUs in four nodes? Four GPUs in eight nodes? Do they have NVLink inside the node? What is the NIC? Is RDMA available? Are the nodes in the same placement group or rack? Are you crossing noisy network boundaries? Is the storage path going to make every cold start painful?

A cluster with the same GPU count can behave like a different machine depending on topology.

For smaller models, Kubernetes scheduling may feel like bin packing. For giant models, it becomes topology-aware placement. Any 16 GPUs will not do. You need 16 GPUs arranged in a way that matches the communication pattern of the model.

That is one reason large AI clusters often feel more rigid than normal Kubernetes clusters. The workload cares about where things are, not only whether resources exist.

Why loading the model is its own event

A 2 TB model is painful while serving. It is also painful while starting.

The weights have to come from somewhere: container image layers, persistent volumes, object storage, local NVMe, a model cache, or a preloaded node image. The file format matters too. A memory-mappable format like safetensors behaves differently from formats that need heavier deserialization before the model is usable. Pulling, reading, mapping, transferring to GPU memory, initializing kernels, building CUDA graphs, and warming the serving engine can dominate startup time.

This changes how you think about pod restarts and autoscaling.

In a web app, a new pod might become useful in seconds. For a giant LLM, a new replica may take minutes. If the weights are remote and the cache is cold, it can be worse. If the replica spans nodes, all workers need to agree on their ranks and become ready together. One slow worker can delay the group.

Kubernetes readiness probes are necessary, but they are not the whole story. A pod can exist before the model is loaded. A container can be running before the GPU workers are ready. A distributed group can have seven healthy workers and still be unusable because the eighth worker failed.

That is why production LLM serving often needs warm pools, minimum replicas, local weight caches, careful rollout strategy, and boring operational patience. The deployment is not ready when the pod starts. It is ready when the model can actually generate tokens at the latency you promised.

A practical sizing walkthrough

Suppose someone asks for a 1T parameter model on Kubernetes.

The first question is not YAML. It is model shape.

If it is a dense 1T model in BF16, you start with about 2 TB of raw weights. On 80 GB GPUs, that is 25 GPUs before overhead. In reality, you probably need more. You also need room for KV cache, which grows with context length and concurrency. If the model supports lower precision weight formats without unacceptable quality loss, quantization may reduce the floor, but it does not remove the distributed serving problem.

If it is a 1T MoE model, the next questions change. How many experts exist? How many are active per token? Are the attention layers dense while the feed-forward layers are sparse? What does the serving engine support: tensor parallel attention, expert parallel MoE layers, data parallel attention, or some hybrid? How much all-to-all traffic appears when real prompts arrive?

Then you map it to topology.

If one node has 8 GPUs with fast intra-node links, tensor parallelism across 8 GPUs is a natural first shape. If the model needs more than one node, you may use tensor parallelism inside each node and pipeline parallelism across nodes. If it is MoE, you may place experts across devices using expert parallelism and still combine that with tensor or data parallel attention.

Only after that does Kubernetes enter the center of the conversation.

You need nodes labeled by GPU type and topology. You need the NVIDIA device plugin or GPU Operator equivalent exposing devices. You need pod specs or higher-level controllers that request the right GPU count. You need placement rules so the workers land together. You need model storage that does not turn every restart into a download storm. You need readiness that understands distributed health. You need metrics that report tokens, KV cache, queueing, and per-worker failures, not just pod CPU.

The YAML is the last mile. The architecture decision happened before it.

Where teams usually get surprised

The first surprise is that GPU count is not capacity. It is potential capacity. Capacity appears only when the GPUs are arranged, connected, loaded, and driven correctly.

The second surprise is that one model replica can be a group. Platform teams like replicas because replicas sound independent. With large LLMs, the word can hide coordination. A "replica" might be 8 pods. Or 16 GPUs. Or 2 nodes. Or a combination of tensor, pipeline, and expert parallel ranks that must agree before anything works.

The third surprise is that MoE parameter counts are easy to misread. A 671B total parameter model with 37B active parameters per token is not lying. It is telling you two different infrastructure facts at once. You need enough memory and placement for the large total model, but the per-token compute path is sparse.

The fourth surprise is that the scheduler is not the serving system. Kubernetes can place pods on GPU nodes. It cannot decide the model parallel strategy. It cannot make a bad tensor-parallel layout fast. It cannot fix a network topology that does not match the workload.

This is why serious LLM-on-Kubernetes work ends up crossing boundaries that platform teams used to keep separate: scheduler behavior, GPU topology, model architecture, serving-engine internals, storage layout, rollout strategy, and latency SLOs.

The Kubernetes mental model that works better

Do not picture a trillion-parameter model as a container image.

Picture it as a distributed runtime that Kubernetes happens to host.

The model weights are split. The computation is split. The memory pressure is split. The failure modes are split. The serving engine owns the model-parallel details. Kubernetes owns the outer lifecycle: placement, resources, health, rollout, identity, networking, storage, and integration with the rest of the platform.

That does not make Kubernetes the wrong tool. It makes Kubernetes the substrate, not the magic trick.

For smaller LLMs, you can get away with thinking "one pod equals one model server." For larger models, use a different sentence:

One serving replica is a coordinated GPU group.

That group may live in one pod on one node. It may live across several pods and nodes. It may combine tensor parallelism, pipeline parallelism, expert parallelism, and data parallelism. But if it takes all of those workers to generate one token stream, treat the group as the unit you operate.

That shift makes the rest of the architecture less surprising.

Autoscaling becomes group scaling. Rollouts become distributed rollouts. Readiness becomes model readiness, not process readiness. Capacity planning starts with bytes and tokens, not pod count. Scheduling starts caring about topology. Observability moves from CPU and memory to TTFT, TPOT, tokens per second, KV cache, queue depth, per-rank health, and GPU utilization.

A giant model does not run in a pod. It runs across a shape.

Kubernetes can manage that shape, but only if you tell it what the shape is.

Continue the series

The next part goes one layer lower into the machines themselves: GPU nodes, device plugins, feature discovery, MIG, MPS, time slicing, labels, taints, and what Kubernetes must know before it schedules an LLM.

If you are working through LLM serving on Kubernetes, subscribe to get the next part. I am also putting together a free LLM Serving on Kubernetes Production Readiness Checklist that turns these ideas into a practical review path for teams.

And if your team is already trying to serve large models on Kubernetes, this is the kind of architecture decision worth reviewing before the cloud bill becomes the incident report.

Sources worth reading

The Request Is the Wrong Unit of Scale for LLMs on Kubernetes

Pawan Kumar — Thu, 21 May 2026 03:32:01 +0000

Series links

Part 1: Everything You Know About Scaling Web Apps Breaks When You Serve an LLM

Your dashboard says traffic is flat. Requests per second barely moved. CPU looks fine. Memory looks normal. The HPA is calm. Then latency starts drifting. Time to first token gets worse. GPU memory pressure rises. Queues grow. Users complain that the model is "thinking forever."

Part 1 introduced why LLM serving breaks the normal web-scaling model. Part 2 zooms into one reason: the HTTP request is only the envelope. The real work is token processing.

For a normal web app, a request is often a useful approximation of work. One request hits an API, does some bounded work, maybe talks to a database, returns JSON, and ends. LLMs do not behave like that. One request may contain a 20-token question and produce a 50-token answer. Another may contain a long system prompt, full chat history, retrieved documents, tool output, metadata, and a user asking for a 4,000-token report.

Both are one HTTP request.

They are not the same workload.

Kubernetes may see one request. Your ingress may see one request. Your API gateway may see one request. But the GPU sees tokens: prefill work, decode work, KV cache growth, memory pressure, queueing, and time spent generating output one token at a time.

Tokens are the work.

Why request count worked for web apps

Most platform teams grew up around request-based thinking. We look at requests per second, p95 latency, p99 latency, error rate, CPU usage, memory usage, queue depth, pod count, and replica count. That model works reasonably well for many web services because requests are often similar enough for capacity planning.

Not always, of course. A login request is not the same as an export request. A cached read is not the same as a database-heavy query. Every experienced SRE has seen one "simple" endpoint melt something important. But request count still gives a useful first approximation in many normal systems.

With LLM serving, that approximation breaks faster. A request does not tell you how long the prompt is, how many retrieved documents were added, how much chat history was included, how many output tokens the model generated, how much KV cache was needed, or how long the request occupied the GPU.

This is why a Kubernetes deployment can look stable at the HTTP layer while the model server is under real pressure. The API did not necessarily get more traffic.

The traffic got heavier.

Input tokens and output tokens are different problems

When people first hear "tokens are the unit of work," they often treat all tokens as one bucket. That is a good starting point, but it is not enough.

For serving, input tokens and output tokens stress the system differently.

At a high level, LLM inference has two phases:

Prefill
Decode

The prefill phase processes the input prompt. This includes the system prompt, developer instructions, chat history, retrieved documents, tool results, the user's message, and whatever formatting your application adds before the request reaches the model. The decode phase generates the response one token at a time. The model predicts a token, appends it to the sequence, uses that updated sequence to predict the next token, and keeps going until it hits a stop condition or a token limit.

A practical way to remember it:

Input tokens decide how heavy it is to start answering.

Output tokens decide how long the model stays busy.

Long input usually hurts time to first token because the model has to process the prompt before it can begin generating. Long output usually hurts total latency and capacity because the model stays in the generation loop longer. Streaming can make this feel better to the user, but streaming does not remove the backend work. It just lets the user watch the work happen.

This is why serious LLM serving metrics talk about time to first token, time per output token, inter-token latency, and tokens per second. NVIDIA's LLM benchmarking docs describe TTFT as the time before the first output token appears, and note that longer prompts generally increase TTFT because the input sequence has to be processed and the KV cache has to be created. Databricks' inference performance guidance also separates TTFT, TPOT, latency, and throughput instead of treating latency as one simple number.

A normal API request is one operation.

An LLM request is a sequence of token work.

One request can hide a huge amount of prompt assembly

The user does not usually send the real prompt.

The application builds it.

A user might type:

What is our refund policy for enterprise customers?

That looks tiny.

But by the time your application sends the request to the model, the prompt might include:

system prompt: 700 tokens
developer instructions: 400 tokens
chat history: 1,500 tokens
retrieved policy documents: 6,000 tokens
citations and metadata: 600 tokens
user question: 12 tokens
formatting instructions: 300 tokens

The user sent one short question. The model received more than 9,000 input tokens before it generated a single output token. That is one of the easiest mistakes to miss in production: teams measure the user message size, not the final assembled prompt size.

RAG makes this even more interesting. Retrieval-augmented generation is often described as a quality feature. The model gets relevant context, answers with better grounding, and can cite internal documents. That is true. But RAG is also an infrastructure multiplier.

Changing top_k from 4 chunks to 12 chunks may look like a harmless retrieval tuning change. No Kubernetes manifest changed. No model changed. Request count did not change. The product team may even see better answers. But now every request may carry thousands of extra input tokens. That can affect time to first token, GPU memory pressure, KV cache usage, batch composition, queueing delay, maximum concurrency, tail latency, and cost per interaction.

This is why prompt assembly needs observability. You do not only want to know that a request had 9,000 input tokens. You want to know where those tokens came from: chat history, retrieved documents, tool results, system instructions, verbose metadata, tenant documents, or an agent flow that appends every intermediate step.

Without that breakdown, token growth stays invisible until latency tells you something is wrong.

Long context is a capacity decision

Long-context models are useful. They let you analyze larger documents, keep longer conversations, handle more retrieval context, and build richer workflows. But a large context window is not a target. It is a limit.

This sounds obvious, but many teams behave as if "the model supports 128k context" means "we can casually send 128k context." That is like saying a node has 1 TB of memory, so every process should try to use it.

Long context changes the shape of serving. A small number of long-context requests can consume enough GPU memory and serving time to affect everyone else. A chat session can become more expensive as history grows. An agent can quietly append tool traces until each turn becomes much heavier than the first one. A summarization feature can go from "summarize this page" to "summarize this folder of documents" without the HTTP request count changing at all.

The failure mode is subtle because the old dashboard may still look calm. RPS is flat, but p95 input tokens moved from 2,000 to 18,000.

That is not flat traffic.

A useful platform practice is to bucket prompts by size: short prompts, medium prompts, long prompts, very long prompts, and batch or offline prompts. The exact numbers depend on your model and hardware, but the habit matters more than the bucket boundaries.

A 500-token chat and a 50,000-token document analysis should not be treated as the same class of work just because both entered through /v1/chat/completions.

Context windows are limits, not goals.

Output length is not just a UX choice

Input tokens get a lot of attention because long prompts are easy to blame. But output tokens can be just as important for capacity planning.

Two requests can have the same input prompt and completely different backend cost depending on output length.

Request A:

1,000 input tokens
100 output tokens

Request B:

1,000 input tokens
2,000 output tokens

Same route. Same user flow. Same prompt size. Very different serving time.

The second request keeps the model generating for much longer. If the response is streamed, the user may start seeing text quickly, which is good. But the GPU is still occupied while the model continues decoding token after token.

This is why max_tokens is not only a product parameter. It is a capacity control.

If every request is allowed to generate 4,000 tokens, you have created a worst-case capacity problem even if most responses are shorter. If a feature asks the model to "write a detailed report," that is not the same workload as "answer this chat question." If agents are allowed to produce long reasoning traces, tool plans, summaries, and final answers, output length can grow quickly.

You should track both requested maximum output tokens and actual generated output tokens. Requested max output tokens show the capacity risk your system accepted. Actual output tokens show the work the model really performed. If many requests hit the output cap, your users may be getting truncated answers. If very few requests use the available budget, your default might be too generous.

Output length is not formatting.

It is how long the request rents the GPU.

Same request count, completely different load

A useful dashboard should show when the same request count hides a different workload shape. For example:

Window A:

requests: 1,000
average input: 500 tokens
average output: 150 tokens
total token work: 650,000 tokens

Window B:

requests: 1,000
average input: 8,000 tokens
average output: 1,000 tokens
total token work: 9,000,000 tokens

Both windows show 1,000 requests. But the second window has almost 14x the token volume. If your dashboard only shows request count, it says traffic is flat. If your dashboard shows token volume, it says the workload changed completely.

That is why the useful question is not only:

How many requests are we serving?

It is also:

How many input tokens are arriving, how many output tokens are being generated, and where are those tokens coming from?

What Kubernetes sees and what the model server feels

Kubernetes is very good at managing containers. It can schedule pods, restart failed workloads, apply resource requests and limits, spread replicas, roll out deployments, and attach workloads to GPU nodes. But Kubernetes does not automatically understand the shape of an LLM request. A pod can be healthy while the model server is struggling. CPU can look uninteresting while GPU memory is the real limit. Generic memory can look fine while KV cache is under pressure. Request count can look flat while token volume has exploded.

This is where the division of responsibility matters. Kubernetes gives you the orchestration layer. The model server gives you the LLM execution layer. The application builds the prompt. The platform team has to connect the signals.

If those layers do not share the right metrics, you end up scaling the wrong thing. For example, CPU-based HPA may be useful around some parts of the stack, but it is not enough to understand LLM serving capacity. A model server may expose more relevant metrics such as prompt tokens, generation tokens, time to first token, time per output token, queue time, number of running requests, number of waiting requests, and KV cache usage.

vLLM's production metrics are a good example of where the industry is moving. It exposes metrics for prompt tokens, generation tokens, request prompt tokens, request generation tokens, time to first token, time per output token, request queue time, prefill time, decode time, KV cache usage, running requests, and waiting requests. That metric set tells you something important:

The production surface of LLM serving is already token-aware.

Your dashboard should be too.

Token-based observability is not optional

If you are running LLM workloads on Kubernetes, request count still matters. You still need API-level metrics. You still care about errors, availability, saturation, queueing, and latency. But those metrics need token context.

At minimum, every request should give you:

input tokens
output tokens
total tokens
requested max output tokens
time to first token
time per output token or inter-token latency
end-to-end latency
queue time
model name
model version
deployment
tenant or team
route or feature
finish reason

If possible, also track prompt composition:

system prompt tokens
chat history tokens
retrieved context tokens
tool result tokens
user message tokens
metadata or formatting tokens

That breakdown is where many production surprises hide.

For dashboards, averages are not enough. Average token count can look stable while the tail gets ugly. You want p50, p95, and p99 for input tokens and output tokens. You want latency by token bucket. You want TTFT by input size. You want end-to-end latency by output size. You want to know whether a tenant is sending mostly short prompts or occasionally sending giant ones.

Some useful views:

input token p50, p95, p99
output token p50, p95, p99
total tokens per minute by model
total tokens per tenant
TTFT by input token bucket
TPOT by output token bucket
queue time by token bucket
percentage of requests near context limits
percentage of requests hitting output cap
retrieved context tokens per request
chat history tokens per request
KV cache usage over time
waiting requests alongside waiting token estimates

The last point is important.

Do not only ask how many requests are waiting. Ask how many tokens are waiting. A queue of 20 short chat requests and a queue of 20 long document-analysis requests are not the same queue.

Product changes become infrastructure changes

One uncomfortable part of LLM platforms is that product changes can become infrastructure changes very quickly.

In a normal web app, adding a new field to a response may not matter much. In an LLM application, adding more context to a prompt can change capacity. Increasing retrieval depth can change latency. Keeping longer chat history can change memory pressure. Allowing longer outputs can change GPU occupancy.

The product team might say:

We only changed the prompt.

The platform team hears:

We changed the workload.

Both are true.

This does not mean product teams should be afraid of improving prompts. It means token impact should be visible before and after the change. If a new prompt improves answer quality but increases average input tokens by 3x, that may be a good tradeoff. But it should be a conscious tradeoff.

If a RAG change improves accuracy but pushes p99 prompts near the context limit, that should be visible before production users discover the latency problem. If a new report-generation mode produces 10x more output tokens than chat, it probably needs a different workload class and different expectations.

The platform question is not "are tokens bad?"

Tokens are the product.

The question is whether you know how many you are serving, where they come from, and what they do to your capacity.

Practical rules for platform teams

If you are starting to serve LLMs on Kubernetes, measure input and output tokens for every request. Do not wait until the first incident to add token metrics. Track the final assembled prompt, not just the user message. The model does not care what the user typed. It cares what your application sent.

Break input tokens down by source. Separate system prompt, chat history, retrieved context, tool results, and user message. Track requested max output tokens separately from actual output tokens. One tells you accepted risk. The other tells you real work.

Use token buckets in latency dashboards. A p95 latency graph without token buckets mixes small chat requests and huge document requests into one misleading line. Watch p95 and p99 token counts, not just averages. The tail is where LLM serving gets painful.

Put budgets on RAG retrieval. top_k is not only a relevance knob. It is a capacity knob. Treat context windows as limits, not targets. Just because a model accepts long context does not mean every request should use it. Set sane output defaults. Long answers should be intentional, not the accidental default for every route.

Separate workload classes when needed. Short interactive chat, long RAG, report generation, agent workflows, and batch summarization do not have the same shape. Review token growth after product changes. Prompt changes, retrieval changes, memory changes, and tool changes can all affect infrastructure.

These rules are not about making the system slower or less useful. They are about making the system understandable.

You cannot operate what you do not measure. And in LLM serving, measuring only requests means you are measuring the envelope while ignoring the work inside it.

The real unit of scale

The request is still useful at the API boundary. You need it for authentication, rate limits, logs, tracing, errors, and user flows. But it is not the right unit for LLM capacity.

It cannot tell you how much prompt the model processed, how long the model generated, how much KV cache was needed, or whether the workload was short chat, long-context RAG, report generation, or an agent loop.

Tokens get you closer to the truth. Input tokens explain much of the work before the first response appears. Output tokens explain how long the model keeps generating. Token distributions explain why averages lie. Token sources explain which product behavior changed the workload. Token-aware metrics explain why your Kubernetes deployment looks healthy while users still feel latency.

Part 1 was about letting go of the normal web app scaling model. Part 2 is about replacing one of its most misleading assumptions.

For LLMs on Kubernetes, you are not really scaling requests.

You are scaling token work across expensive, memory-constrained, latency-sensitive GPU systems.

Once you see that, the rest of the platform starts to make more sense.

Continue the series

I am writing this as a practical series on hosting large LLMs on Kubernetes, from GPU nodes and model servers to autoscaling, latency, cost, and production architecture. If you want the next part, subscribe to the newsletter.

I am also preparing a free LLM Serving on Kubernetes Production Readiness Checklist with the metrics, dashboard questions, and architecture review points platform teams should ask before putting an LLM workload in production. Subscribe and I will share it when it is ready.

Everything You Know About Scaling Web Apps Breaks When You Serve an LLM

Pawan Kumar — Thu, 14 May 2026 03:33:30 +0000

Most platform engineers already know how to scale a web app. Put it in a container. Deploy it on Kubernetes. Add CPU and memory requests. Put a Service or Ingress in front. Configure HPA. Watch p95 latency, error rate, CPU, memory, and request throughput. Add replicas when traffic goes up. This is Part 1 of a practical series on hosting large LLMs on Kubernetes.

That playbook works for a lot of services. Then you try to serve a large language model, and suddenly the old model starts cracking. A request is no longer just a request. Memory does not just mean RAM. Latency is not one number. Scaling a pod does not mean capacity appears instantly. One "replica" may need one GPU, eight GPUs, or several machines working together.

And the bottleneck may not be CPU at all. The first mental shift is simple:

LLM serving is not normal web serving.

The real unit of work is the token.

A request is no longer a request

In a normal web app, request count is often a useful planning signal. Not perfect, obviously. Some endpoints are heavier than others. Some queries are ugly. Some users manage to find the one path that melts the database. But request count still tells you something.

With LLMs, it can lie to your face.

One user asks:

Summarize this sentence.

Another user asks:

Analyze this 80-page contract, compare it with these policy documents, extract the risks, and generate a detailed memo.

Both are one request. They are not the same workload.

The second request may contain thousands of input tokens. It may generate thousands of output tokens. It may sit on GPU memory for longer. It may increase queueing delay for everyone behind it. It may consume far more KV cache. It may make your latency charts look haunted.

So if you only measure requests per second, you are almost blind. For LLMs, you need to care about:

input tokens
output tokens
tokens per second
time to first token
time per output token
queue depth
batch size
GPU memory
KV cache usage
model loading time

That is a very different world from normal HTTP throughput.

LLM inference has two phases, and they behave differently

When a user sends a prompt to an LLM, the model does not handle the request as one uniform block of work. At a high level, inference has two phases:

prefill
decode

Prefill is where the model processes the input prompt. If the prompt is long, prefill gets expensive. This is where the model reads the context and builds the internal state needed to start generating. Decode is where the model generates output tokens one at a time. This is the part users see when text starts streaming on the screen.

These phases stress the system differently. Prefill is more compute heavy. Decode is often more memory bandwidth heavy. Prefill depends heavily on input length. Decode depends heavily on output length. Both affect latency, throughput, cost, and capacity.

This distinction does not usually matter when you are scaling a normal API. You do not think of a checkout endpoint as having two GPU phases with different scheduling behavior. With LLMs, you have to.

If you ignore prefill and decode, you will struggle to explain why first token latency is slow, why long prompts hurt so much, or why the GPU looks busy but users still complain.

Latency is not one number anymore

For web services, we usually talk about latency as one number:

p50 latency
p95 latency
p99 latency
request duration

For LLMs, that is not enough. Two latency numbers matter a lot.

Time to first token

Time to first token, or TTFT, is how long the user waits before the model starts responding. This controls the feeling of responsiveness.

If nothing appears for five seconds, the product feels slow. It does not matter that the final answer is useful. The user has already started wondering if the system is stuck.

TTFT is affected by:

queueing delay
prompt length
prefill time
model routing
batch scheduling
GPU availability
cold starts
cache behavior

Users feel TTFT sharply because silence feels broken.

Time per output token

Time per output token, or TPOT, measures how fast the model generates each token after generation starts. This controls the streaming experience.

Good TTFT with bad TPOT feels like the model wakes up quickly and then crawls. Good TPOT makes the answer feel alive, even if the full response takes time.

TPOT is affected by:

decode efficiency
GPU memory bandwidth
batch size
KV cache pressure
model size
quantization
serving engine
hardware type

Normal web systems rarely force you to separate "time until the response starts" from "speed at which the rest of the response streams." LLM serving does.

Memory means GPU memory now

In a web app, memory usually means heap, runtime overhead, in-process cache, or connection pools. In LLM serving, memory often means GPU memory. And GPU memory is painful because it is limited, expensive, and easy to waste.

You need GPU memory for:

model weights
KV cache
runtime buffers
activations
batching overhead
framework overhead

Model weights are the obvious part. A 7 billion parameter model in FP16 or BF16 needs roughly 14 GB just for weights. A 70 billion parameter model needs roughly 140 GB just for weights at that precision. That already means one GPU may not be enough.

But weights are only the obvious cost. The hidden cost is KV cache.

KV cache stores the key and value tensors from previous tokens so the model does not recompute everything from scratch during generation. The longer the context and the more concurrent users you serve, the more KV cache you need.

This is why long context is not just a product feature. It is an infra bill. Every extra token you allow into the context window can come back as GPU memory pressure. Maximum context length is not only a model capability. It is a capacity planning decision.

Replicas are not always replicas

In a normal web app, one replica usually means one pod running one copy of the application. Traffic goes up, add pods. With LLMs, the word "replica" can hide a lot.

A small model may run inside one pod on one GPU.

A larger model may need:

multiple GPUs in one node
multiple pods on one node
multiple nodes
tensor parallelism
pipeline parallelism
a Ray cluster
a leader-worker setup
a group of pods that must start together

So when someone says, "scale the model to 10 replicas," the first question should be: what is one replica?

Is it one pod? One GPU? One tensor parallel group? One multi-node deployment? One endpoint backed by several workers? One prefill group plus one decode group?

This is where Kubernetes abstractions get interesting. A Deployment works nicely for simple stateless services. Serious LLM serving may need Ray, KServe, LeaderWorkerSet, Kueue, Volcano, or custom orchestration.

The model may not fit into the old "one pod equals one replica" picture.

Scaling a pod does not mean capacity is ready

In a normal web app, a new pod can become useful quickly. The image is pulled. The process starts. Readiness passes. Traffic flows. For LLMs, a new pod may sit there for a while before it can handle real traffic.

It may need to:

pull a large container image
download model weights
load hundreds of GBs from object storage or disk
initialize CUDA
allocate GPU memory
build or load optimized engines
warm up the model
join a distributed serving group

This can take minutes. Sometimes longer. So autoscaling is not just about deciding when to add replicas. It is about adding capacity early enough that it is ready before users feel the pain.

That is much harder than scaling a normal web app. This is why LLM platforms often use:

minimum warm replicas
preloaded models
local NVMe model cache
warm pools
separate GPU node pools
predictive scaling
queue based scaling
scheduled capacity for known peaks

Scale to zero sounds great until the first user waits for a giant model to load.

CPU autoscaling becomes a weak signal

CPU utilization is a decent signal for many Kubernetes workloads. Not perfect. But decent. For LLM serving, CPU can be almost irrelevant.

The expensive work happens on GPUs. More specifically, the bottleneck may be:

GPU memory
GPU memory bandwidth
KV cache capacity
decode throughput
queue depth
batch saturation
inter-GPU communication
model server scheduling
request length distribution

A model server can have low CPU usage and still be overloaded. It can have high GPU utilization and still deliver terrible latency. It can have enough compute but not enough KV cache capacity. It can be stuck serving long prompts while short prompts wait behind them.

So if you autoscale only on CPU, the platform may make the wrong decision at the worst possible time.

Better signals include:

queue depth
waiting requests
ongoing requests per replica
batch size
TTFT
TPOT
tokens per second
KV cache usage
GPU memory pressure
SLO burn rate

GPU utilization still matters. It just cannot be the only signal. LLM autoscaling has to understand the workload.

Round robin load balancing gets weird

For a normal web app, round robin load balancing is often fine. Request 1 goes to pod A. Request 2 goes to pod B. Request 3 goes to pod C. For LLMs, this can be wasteful.

A short prompt and a long prompt have completely different costs. A request with a cached prefix may be cheaper if it lands on the right worker. A long generation may occupy capacity much longer than the load balancer expects. One tenant may need lower latency than another. One model may need different hardware from another.

Naive load balancing can create strange failures:

one worker gets long prompts and slows down
another worker stays underused
KV cache locality is lost
prefix caching becomes less useful
tail latency gets worse
GPU utilization looks fine while users are unhappy

LLM serving needs smarter routing.

Good routing may consider:

model name
prompt length
estimated output length
tenant priority
cache locality
GPU availability
queue depth
hardware type
region
latency SLO

This is why inference gateways, model-aware routing, and cache-aware scheduling matter.

Cost changes shape

In a web app, cost usually grows with pods, CPU, memory, database load, and network traffic. In LLM serving, cost is shaped by GPU usage, and GPUs are expensive enough that small inefficiencies matter.

You can burn money through:

idle GPUs
poor batching
overprovisioned replicas
long context windows
bad routing
large models for simple tasks
no quantization
slow cold starts
inefficient KV cache usage
serving every request with the same model

The cost unit also changes. Instead of only thinking about cost per request, you start thinking about:

cost per input token
cost per output token
cost per million tokens
cost per model
cost per tenant
cost per GPU hour
cost per region
cost per latency tier

This is the cloud bill version of the first mental shift: a request is not a request. A token is the real unit of work.

Kubernetes is still useful. It is just not enough by itself

None of this means Kubernetes is the wrong platform for LLM serving. Kubernetes still gives you a lot:

scheduling
declarative deployment
resource management
isolation
service discovery
rollouts
observability integrations
autoscaling primitives
platform patterns for multiple teams

That is why many serious AI infrastructure platforms still use Kubernetes or something close to it. But Kubernetes does not automatically understand LLMs.

Out of the box, Kubernetes does not know:

what KV cache is
whether a model is loaded
whether a GPU group must be scheduled together
whether pods should land in the same rack
whether a request has 100 tokens or 100,000 tokens
whether TTFT is bad
whether a model server is overloaded despite low CPU
whether a new replica will take 10 minutes to warm up

You have to teach the platform these things through metrics, controllers, schedulers, serving frameworks, routing layers, and operational discipline. That is the real work.

The old scaling model breaks

The old web scaling model looks like this:

Traffic increases
        ↓
CPU increases
        ↓
HPA adds pods
        ↓
Load balancer spreads requests
        ↓
Latency improves

That still works for many stateless services. LLM serving looks more like this:

Traffic increases
        ↓
Input and output token mix changes
        ↓
Queue depth grows
        ↓
KV cache pressure increases
        ↓
Batching behavior changes
        ↓
TTFT and TPOT drift
        ↓
GPU memory or decode throughput becomes the bottleneck
        ↓
Autoscaler needs model-aware metrics
        ↓
New capacity may take minutes to warm up

If you bring only the old playbook, you will scale the wrong thing, at the wrong time, using the wrong signal.

The new mental model

To serve LLMs well, you need a different model in your head:

A request is not the unit. A token is.
Memory is not just RAM. GPU memory and KV cache matter more.
Latency is not one number. TTFT and TPOT matter separately.
A replica may be a distributed group, not a single pod.
Scaling is not instant because model loading is slow.
CPU is not a reliable autoscaling signal by itself.
Load balancing must understand request cost and cache locality.
Long context is an infrastructure cost decision.
Cost optimization starts with keeping expensive GPUs useful.
Kubernetes is the foundation, but LLM-aware systems must be built on top.

Once this clicks, the rest of LLM infrastructure becomes easier to reason about. You can understand why vLLM became popular. Why PagedAttention matters. Why KV cache dominates serving design. Why quantization is a capacity strategy. Why topology-aware scheduling matters. Why teams split prefill and decode. Why GPU cost optimization is its own discipline. Why normal autoscaling is not enough.

LLM serving is not "deploy a model behind an API." It is a new platform engineering problem.

Closing thought

For years, platform teams became very good at scaling stateless web services. We learned containers, Kubernetes, service meshes, autoscaling, observability, progressive delivery, and cloud cost optimization. That knowledge still matters, but LLM serving changes the shape of the problem.

The bottlenecks move. The metrics change. The cost model changes. The scheduler matters more. The load balancer needs to get smarter. The GPU becomes the scarce resource. The token becomes the unit of work.

So if you are trying to serve LLMs on Kubernetes, the first step is not installing a Helm chart. The first step is replacing the old mental model.

Because everything you know about scaling web apps starts to break the moment you serve an LLM.

Continue the series

I Don't Want AI to Replace DevOps. I Want It to Read the Docs I'm Too Tired to Read

Pawan Kumar — Thu, 07 May 2026 06:32:51 +0000

Originally published at dheeth.blog

It's 2 AM. The pager went off eleven minutes ago. You're staring at a Kubernetes upgrade advisory that's forty-seven paragraphs long, and somewhere in paragraph thirty-one there's a breaking change about how EKS handles PodIdentity federation with IAM roles. You know it's in there. You read it three months ago. But right now your brain is running on caffeine and cortisol, and the words are blurring into each other.

You could run the upgrade now and hope for the best. Or you could spend forty minutes re-reading the entire changelog, the Terraform provider notes, the Helm chart migration guide, and three different Slack threads from the last time someone did this.

This is the part of DevOps nobody puts in conference talks. Not the elegant GitOps pipelines or the slick dashboards. The part where you're exhausted and you still have to make a decision that affects production, and the information you need is spread across nine browser tabs, a Confluence page from 2023, and a runbook that was last updated when your cluster was on 1.24.

This is where I want AI to help. Not by taking over. Not by running kubectl apply on my behalf while I sleep. By reading the damn docs for me.

The kind of tired that matters

The Google SRE Workbook has a word for what happens when engineers spend too much time on repetitive operational work: toil. They define it as "the repetitive, predictable, constant stream of tasks related to maintaining a service." Rollouts, upgrades, alert triage, manual repairs, ticket-driven provisioning. Google puts a hard cap on it: no more than 50% of an SRE's time should go to operational work.

The reasoning isn't just about efficiency. The workbook makes a point that has always stuck with me: time spent on toil is time not spent where human judgment, creativity, and design thinking matter.

Here's what I think the SRE Workbook doesn't fully capture, at least not in those exact words. There's a specific kind of toil that doesn't look like toil. It doesn't involve clicking buttons or running the same script for the hundredth time. It's cognitive. It's the mental cost of assembling context from scattered sources before you can make a decision.

Reading a Kubernetes release notes page that's 3,000 words long to find the one deprecation that affects your cluster. Comparing two versions of a Helm values.yaml to understand what changed between chart versions 4.2.1 and 5.0.0. Skimming a Terraform provider changelog to see if the aws_eks_cluster resource changed its default behavior. Correlating an incident timeline from last Thursday with the deployment that happened two hours before the spike in 5xx errors.

This work isn't glamorous. It doesn't produce artifacts. Nobody thanks you for spending an hour reading release notes. But if you skip it, you miss the breaking change that takes down a service at 3 AM on a Sunday.

Sometimes the most exhausting part of an incident is not fixing the issue. It is building enough context to feel safe fixing it.

I think of this as cognitive toil, and AI is unusually well suited to help with it.

Why I don't want an AI agent with production access

Before I talk about what I do want, let me be clear about what I don't.

I don't want an AI agent that has kubectl apply access by default. I don't want one that can merge PRs, push to main, modify IAM policies, or restart services without a human in the loop. I've seen enough production incidents caused by humans who were tired, rushed, or copy-pasting from the wrong terminal. Giving that same power to something that hallucinates API flags and invents Kubernetes resources that don't exist is not progress. It's a new category of incident.

In application code, an AI mistake might fail a test. In DevOps, an AI mistake might page five teams, drain the wrong node, rotate the wrong secret, or turn a small incident into a very educational afternoon.

The Stack Overflow 2025 Developer Survey backs this up. 76% of developers don't plan to use AI for deployment or monitoring tasks. Not because they're luddites. Because they know what's at stake. More developers actively distrust AI accuracy (46%) than trust it (33%). Only 3% highly trust it. That is the part that makes people nervous: AI can sound confident even when the answer still needs careful verification.

In DevOps, "almost right" isn't a minor inconvenience. An "almost right" IAM policy is a security incident. An "almost right" Kubernetes manifest is a workload that runs fine until it doesn't, and then you're debugging at 2 AM wondering why the liveness probe path changed. An "almost right" Terraform plan is a production resource that gets destroyed and recreated instead of updated in place.

The problem is not that AI is useless. The problem is that AI is useful enough to make dangerous workflows look reasonable. In DevOps, the gap between "sounds correct" and "safe to execute" is where incidents live.

The hard part of DevOps is rarely knowing the command. kubectl apply -f manifest.yaml isn't the hard part. The hard part is knowing whether that command is safe in this environment, with this version of Kubernetes, with these admission controllers, with this cluster autoscaler configuration, right after that EKS add-on got updated. That requires context, judgment, and accountability. AI is genuinely useful for the first two, but it can't own the third. Not yet. Maybe not ever.

Most production work is not blocked because nobody knows how to type kubectl. It is blocked because nobody is completely sure what is safe to do next.

What I actually want AI to do

I want AI to be the colleague who actually reads the release notes before standup. The one who highlights the three things that matter out of a forty-seven-paragraph changelog. The one who can look at a Terraform plan diff and tell you, in plain language, what's about to change and what might break.

Concretely, here's what that looks like.

When I'm going from Kubernetes 1.29 to 1.30, I want something that tells me what got deprecated, what changed in API versions, and what I need to act on before upgrading. Skip the boilerplate about "improved performance." Focus on the removals and behavioral changes.

Before I update the VPC CNI add-on, I want to know if this version is compatible with my current Kubernetes version, my node group AMI, and the Calico network policy version we're running. That compatibility matrix is spread across three AWS docs pages and it changes every quarter.

When the AWS Terraform provider goes from 5.x to 6.x, I don't want to read the entire migration guide. I want to know which resources I'm actually using that changed behavior. Focus on my code, not the universe of possibilities.

When I'm upgrading a Helm chart from 4.x to 5.x, show me what changed in the default values: which new keys were introduced, which old keys were removed, which ones changed their default behavior. Better yet, cross-reference my current values.yaml and tell me which of my overrides are now invalid.

If I inherit a cluster with 200 custom resources I've never seen before, help me understand what they do without reading CRD documentation for six hours.

When an incident happens, take the Slack thread, the PagerDuty timeline, and the post-mortem notes, and produce a runbook that the next on-call engineer can actually follow. One that isn't three years stale.

When the error rate spiked at 14:32 and something was deployed at 14:15, pull the deployment diff, the relevant log lines, and the metrics shift into one view so I can see the connection without switching between four tools.

When five services are throwing errors and the logs are a wall of stack traces, filter out the noise, group the unique errors, and tell me which one started first. That's the one I care about.

None of these require production access. None require the AI to execute anything. They require it to read, understand, summarize, compare, and present information so I can decide faster.

The best DevOps AI will not feel magical. It will feel like a senior engineer left clean notes before going on vacation.

The data says this approach works

GitHub's study on Copilot found something interesting beyond speed. 87% of developers said AI helped them preserve mental effort during repetitive tasks. 73% said it helped them stay in flow. 60-75% said it helped them focus on more satisfying work. One senior engineer put it simply: with AI, they had to think less about the boring stuff, and when they had to think, it was the fun stuff.

The DORA research on generative AI adds an important nuance. Developers who use AI extensively report higher job satisfaction, more time in flow state, and less burnout. But there's a catch: AI adoption didn't reduce time spent on toilsome, repetitive tasks. It sped up the valuable work developers already enjoyed, but didn't crack the code on automating drudgery. DORA also found that a 25% increase in AI adoption was associated with a decrease in delivery stability, because AI lets teams generate more code and more changes faster than their review and testing processes can handle.

Read that last sentence again. AI doesn't hurt stability because it writes bad code. It hurts stability because it lets teams produce more work than their feedback loops can safely absorb.

This is exactly why the read-summarize-suggest model is the right one for DevOps. It gives engineers better context without adding unreviewed changes to the pipeline. It accelerates understanding without bypassing approval. It reduces the time between "I need to figure this out" and "I understand enough to decide" without collapsing the distance between "I decided" and "it's done."

A boundary that matters

I'm not anti-agent. I think autonomous AI agents will eventually have a role in infrastructure operations. But the keyword is eventually, and the prerequisite is trust, and trust is earned slowly and lost quickly.

Stack Overflow also shows developers are much more cautious with high-responsibility work. Most respondents do not plan to use AI for deployment or monitoring. These are not people who hate AI. These are people who know where the blast radius lives.

The DORA report reinforces this: trust directly drives AI productivity. Developers who trust AI accept more suggestions, submit more changes, and spend less time searching for information. But DORA also found that 39% of developers still trust AI outputs "a little" or "not at all."

In DevOps, trust isn't about vibes. It's about being right when being wrong has consequences. An AI that summarizes a changelog and misses a breaking change is annoying but survivable. An AI that applies a change based on that incomplete summary is a production incident.

The line I draw is simple. AI should read, summarize, compare, draft, and suggest. Humans should approve, execute, and own.

Let AI read. Let AI summarize. Let AI compare. Let AI draft. Let AI suggest.

But make humans approve, execute, and own.

The fatigue I want to replace

DevOps has a burnout problem. This isn't news. The on-call rotations, the incident pressure, the constant context switching between ten different tools and three different cloud providers and a pile of documentation that's always slightly out of date.

The fatigue is real. It accumulates. It's not the dramatic kind where someone screams and quits. It's the quiet kind where you stop reading the full changelog because you've read forty of them and nothing ever breaks, until one Tuesday it does. Where you stop updating the runbook because nobody reads it anyway, including you. Where you start copy-pasting Terraform modules from the last project because you don't have the energy to check if the AWS provider changed the defaults again.

AI can't fix organizational dysfunction. It can't fix understaffed on-call rotations or unreasonable SLAs. But it can reduce the cognitive tax of the work that sits between "I got paged" and "I understand what is happening." It can give you back the thirty minutes you'd have spent re-reading docs you already read once. It can catch the breaking change you'd have missed at 2 AM.

I don't want AI to replace DevOps engineers. I want it to replace the exhaustion that makes us worse at the job we're good at. I want it to be the thing that reads the docs so I can focus on deciding what to do with what they say. I want it to handle the reading so I can handle the thinking.

That's not a smaller vision. It's a more honest one.

References:

Google SRE Workbook, "Eliminating Toil": sre.google/workbook/eliminating-toil/
DORA, "Impact of Generative AI in Software Development": dora.dev/ai/gen-ai-report/
Stack Overflow Developer Survey 2025, AI section: survey.stackoverflow.co/2025/ai/
GitHub Research, "Quantifying GitHub Copilot's Impact on Developer Productivity and Happiness": github.blog/news-insights/research/research-quantifying-github-copilots-impact-on-developer-productivity-and-happiness/

DevOps to Platform Engineer: The Career Shift Nobody Explains Properly

Pawan Kumar — Thu, 30 Apr 2026 20:15:33 +0000

If you've been in DevOps long enough, you've probably seen the job postings by now. "Platform Engineer." "Internal Developer Platform." "Platform-as-a-Product." The titles are everywhere. Gartner says 80% of large engineering organizations will have dedicated platform teams by 2026. That's up from 45% in 2022.

But nobody really explains what changes. Not the buzzwords. The actual day job. The skills. The salary. The headaches.

I work as a DevOps Engineer at a company that builds Kubernetes application platforms. So I'm living in the middle of this transition every single day. Let me break down what's actually happening, what it means for your career, and whether you should care.

What's Actually Happening

Here's the short version: DevOps broke at scale. Not the philosophy. The practice.

When you have 5 teams and 20 services, DevOps works beautifully. Everyone knows everyone. You can walk over to someone's desk (or Slack them) and figure out why the pipeline broke. The "culture of collaboration" actually functions.

But at 50 teams? 500 services? Multiple clouds? That informal shared context collapses. Onboarding takes weeks instead of days. Every team builds slightly different CI/CD pipelines. Security reviews become bottlenecks. And you end up with 3 senior engineers who "know how things really work," and they're drowning.

Platform engineering is the response to that breakdown. Instead of relying on culture and tribal knowledge, you build a product, an Internal Developer Platform (IDP), that encodes best practices into self-service tooling.

The platform becomes the documentation. The guardrails become the governance. And the paved road becomes the easiest road.

DevOps vs Platform Engineering: The Real Differences

Let's skip the marketing fluff. Here's what actually changes:

Aspect	DevOps	Platform Engineering
Who you build for	Infrastructure, pipelines	Developers (your users)
How work comes to you	Tickets, Slack pings, "can you help me"	Platform feature requests, adoption metrics
Governance	PR reviews, approval gates, manual checks	Embedded into templates and workflows
Success metric	"Did the deploy work?"	"Are developers choosing to use the platform?"
Scale model	Linear (more teams = more DevOps)	Leverage (platform scales once, serves all)
Your mindset	"Let me fix this for you"	"Let me build it so you never need to ask"

That last row is the fundamental shift. DevOps is a service mindset. Platform engineering is a product mindset.

A Day in the Life

DevOps Engineer's typical day:

Build and maintain CI/CD pipelines (30%)
Write Terraform, manage infrastructure (25%)
Set up monitoring and alerting (15%)
Automate deployment processes (20%)
Help developers with infrastructure issues (10%)

Platform Engineer's typical day:

Build internal developer tools and abstractions (35%)
Improve self-service capabilities (25%)
Maintain the platform infrastructure itself (20%)
Developer support, education, onboarding (10%)
Platform documentation (10%)

Notice the shift: you're spending more time building things for developers to use independently and less time doing things for developers. It's the difference between being a chef and being someone who designs kitchen layouts.

The Salary Question (India-Focused)

Let's talk numbers. I cross-referenced data from AmbitionBox, Glassdoor, Levels.fyi, and real job postings across LinkedIn and Naukri. Here is the realistic range for India in 2026:

Experience	DevOps	Platform Engineer
3-5 years	₹10-28 LPA	₹20-40 LPA
6-10 years	₹20-45 LPA	₹35-60 LPA
Lead/Principal	₹35-65 LPA	₹55-90 LPA

Platform engineering commands a 30-60% premium over generalist DevOps, according to multiple 2026 India salary reports. The premium exists because the talent pool is much smaller, you need DevOps foundations plus product thinking plus software engineering depth.

Globally, platform engineers in North America average $160,000 USD, compared to DevOps roles that typically plateau around $140K. Not life-changing, but meaningful.

The Skills Gap: What You Need to Learn

If you're a DevOps engineer today, you already have most of the foundations. Here's what's missing:

1. Product Thinking

This is the biggest mindset shift. You're no longer building pipelines, you're building a product with users, feedback loops, and adoption metrics. That means:

Understanding developer pain points (user research)
Prioritizing features based on impact (product management)
Measuring adoption, not just uptime (analytics)
Iterating based on feedback (continuous improvement)

The #1 reason platform initiatives fail? Teams build technically excellent platforms that nobody uses. Voluntary adoption is the real metric.

2. API Design and Software Engineering

DevOps scripting (Bash, YAML, a bit of Python) doesn't cut it anymore. Platform engineers need:

API design - Your platform is consumed through APIs
Go or Rust - Most CNCF platform tooling is written in Go
Multi-tenancy patterns - Your platform serves multiple teams with different needs
Software engineering practices - Testing, versioning, deprecation strategies

# Example: A Golden Path template for a new microservice
# This is what platform engineers build - opinionated defaults
apiVersion: backstage.io/v1alpha1
kind: Template
metadata:
  name: microservice-template
  title: Standard Microservice
  description: Spin up a new Go microservice with CI/CD, monitoring, and security baked in
spec:
  parameters:
    - title: Service Details
      required:
        - name
        - team
      properties:
        name:
          title: Service Name
          type: string
        team:
          title: Owning Team
          type: string
          enum: [payments, auth, core, platform]
  steps:
    - id: scaffold
      name: Generate Service
      action: fetch:template
      input:
        url: ./templates/go-microservice
        values:
          name: ${{ parameters.name }}
          team: ${{ parameters.team }}

This is a simplified Backstage software template - one of the most common patterns in platform engineering. Developers fill in a few fields, and the platform generates a production-ready service with CI/CD, observability, and security pre-configured.

You can achieve the same with Devtron's Application Templates - capture CI/CD workflows, build configs, deployment templates, and environment overrides from an existing app, then reuse them to spin up new microservices in minutes instead of hours.

3. Developer Experience (DevEx)

You need to care about how developers feel using your platform. This includes:

Time to first deploy (how fast can a new dev ship?)
Self-service capabilities (can they do it without filing a ticket?)
Documentation quality (can they figure it out without asking you?)
Error messages (are they helpful or cryptic?)

The State of Platform Engineering Report recommends tracking DORA metrics (deployment frequency, lead time, change failure rate, MTTR) alongside SPACE metrics (developer productivity) and time-to-onboarding.

4. AI Literacy

This isn't optional anymore. 92% of CIOs plan AI integrations into their platforms. The recommendation is to reserve 20% of your time for AI skill development:

Using AI tools for platform operations (K8sGPT, AI-assisted troubleshooting)
Building AI-powered capabilities into your platform (intelligent autoscaling, anomaly detection)
Understanding how AI-generated code flows through your CI/CD

By 2028, platforms without AI capabilities will be considered outdated.

How to Actually Make the Transition

Here's a practical roadmap, assuming you have 3+ years of DevOps experience:

Month 1-2: Build Product Thinking

Read "Team Topologies" by Matthew Skelton and Manuel Pais
Start treating your current internal tools as products - add documentation, gather feedback, track usage
Learn about Backstage (CNCF project, 89% market share for IDPs)
Explore Devtron - an AI-native Kubernetes management platform to see how real IDPs work in practice

Month 3-4: Level Up Software Engineering

Pick up Go if you haven't already - most platform tooling is Go-based
Build a small internal tool with proper API design, tests, and documentation
Contribute to an open-source platform tool (Backstage, Crossplane, Port)

Month 5-6: Get Hands-On with IDPs

Deploy Backstage locally or in a sandbox cluster
Build a software template for your team's most common workflow
Add golden paths for your existing infrastructure patterns

Ongoing: Develop AI Competency

Experiment with K8sGPT for cluster troubleshooting
Explore AI-assisted CI/CD (GitHub Copilot in Actions, AI-powered code review)
Stay current with AI SRE tools (autonomous incident response is coming fast)

The Six Specialized Roles Within Platform Engineering

As the field matures, "platform engineer" is splitting into distinct specializations:

Head of Platform Engineering (HOPE) - Strategic direction, cross-functional coordination
Platform Product Manager (PPM) - Bridges technical teams and organizational needs
Infrastructure Platform Engineer (IPE) - Underlying infra (servers, networks, databases)
DevEx Platform Engineer (DPE) - Developer workflows, friction reduction, tool UX
Security Platform Engineer (SPE) - Security embedded into pipelines, policy-as-code
Reliability Platform Engineer (RPE) - Evolution of SRE, monitoring/observability plane

You don't need to pick one immediately. Most platform engineers touch multiple areas, especially in smaller teams. But knowing these exist helps you see where your career can go.

What I'm Seeing From the Inside

Working at Devtron, a company that literally builds a Kubernetes application platform, I get a front-row seat to this transition. Here's what I see daily:

Teams that adopted platform thinking are shipping faster with fewer incidents. They're not firefighting as much because the platform catches common mistakes before they reach production.

Teams that didn't are drowning in tickets. Every new microservice means another pipeline to build, another set of alerts to configure, another on-call rotation to manage. It doesn't scale.

The companies that get this right treat their platform as a product with a dedicated team, clear ownership, and actual user research. The ones that get it wrong rebrand their DevOps team as "Platform Engineering" and change nothing about how they work.

Don't be the second one.

The Honest Take

Platform engineering isn't replacing DevOps. It's DevOps growing up. The philosophy of collaboration, automation, and shared responsibility stays. What changes is the mechanism, from culture-dependent to platform-dependent.

Should you make the shift? If you enjoy building tools more than operating infrastructure, if you care about developer experience, and if you want to work on leverage (building something once that serves hundreds of developers) - yes.

The timing is right. Mid-level engineers with 3-5 years of experience are entering platform roles in growing numbers. You don't need to be a senior architect anymore. The field is democratizing, the salaries are competitive, and the demand is only going up.

Start by building one thing that removes friction for your team. Treat it like a product. See what happens.

Further Reading:

Team Topologies - the org design book behind platform thinking
Backstage.io - CNCF project for building developer portals
Devtron - AI-Native Kubernetes Management Platform
Platform Engineering community - reports, articles, and the annual State of Platform Engineering survey
DORA metrics - the standard for measuring software delivery performance