<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Pawan Kumar</title>
    <description>The latest articles on DEV Community by Pawan Kumar (@the-persistent-engineer).</description>
    <link>https://dev.to/the-persistent-engineer</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3903751%2Fa8128dd1-30a5-4a8b-a5aa-acc051e7828e.png</url>
      <title>DEV Community: Pawan Kumar</title>
      <link>https://dev.to/the-persistent-engineer</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/the-persistent-engineer"/>
    <language>en</language>
    <item>
      <title>Before the Pod Starts: GPU Node Setup for LLMs on Kubernetes</title>
      <dc:creator>Pawan Kumar</dc:creator>
      <pubDate>Thu, 04 Jun 2026 12:06:57 +0000</pubDate>
      <link>https://dev.to/the-persistent-engineer/before-the-pod-starts-gpu-node-setup-for-llms-on-kubernetes-ae6</link>
      <guid>https://dev.to/the-persistent-engineer/before-the-pod-starts-gpu-node-setup-for-llms-on-kubernetes-ae6</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://www.dheeth.blog/before-the-pod-starts-gpu-node-setup-llms-kubernetes/" rel="noopener noreferrer"&gt;dheeth.blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Series links&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.dheeth.blog/llm-serving-is-not-normal-web-serving/" rel="noopener noreferrer"&gt;Part 1: Everything You Know About Scaling Web Apps Breaks When You Serve an LLM&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.dheeth.blog/real-unit-of-llm-infrastructure-is-the-token/" rel="noopener noreferrer"&gt;Part 2: The Request Is the Wrong Unit of Scale for LLMs on Kubernetes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.dheeth.blog/trillion-parameter-model-kubernetes-cluster/" rel="noopener noreferrer"&gt;Part 3: How Do You Fit a Trillion-Parameter Model Into a Kubernetes Cluster?&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;A pod is usually where Kubernetes conversations start. You write a Deployment, set requests and limits, pick a container image, add a Service, and let the scheduler place the workload somewhere in the cluster.&lt;/p&gt;

&lt;p&gt;That is fine for normal applications. It is not enough for LLM serving.&lt;/p&gt;

&lt;p&gt;Part 3 explained why a large model does not simply "run in a pod." A serving replica may be a coordinated GPU group. It may span multiple GPUs. It may depend on tensor parallelism, pipeline parallelism, expert parallelism, NCCL communication, model server behavior, and the shape of the hardware underneath it.&lt;/p&gt;

&lt;p&gt;Part 4 moves one layer down: before the pod starts, the GPU node has to be prepared correctly. Kubernetes has to know that a node has GPUs. The container runtime has to expose those GPUs into containers. The node needs the right driver stack. The device plugin has to advertise schedulable resources. Labels have to describe what kind of GPU capacity exists. Metrics have to tell you whether the GPUs are healthy and useful. If you use MIG, time-slicing, or MPS, the sharing model has to be explicit.&lt;/p&gt;

&lt;p&gt;Otherwise Kubernetes is scheduling blind.&lt;/p&gt;

&lt;p&gt;It may see a node. It may even see &lt;code&gt;nvidia.com/gpu&lt;/code&gt;. But that still does not mean the node is ready to serve LLM traffic well.&lt;/p&gt;

&lt;h2&gt;
  
  
  A GPU node is not just a bigger worker node
&lt;/h2&gt;

&lt;p&gt;A normal Kubernetes worker node needs a kubelet, a container runtime, networking, storage integration, and enough CPU and memory to run pods. A GPU node needs all of that, plus a second hardware and software stack that has to line up cleanly.&lt;/p&gt;

&lt;p&gt;At minimum, you care about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the GPU model and memory size&lt;/li&gt;
&lt;li&gt;the NVIDIA driver&lt;/li&gt;
&lt;li&gt;CUDA compatibility&lt;/li&gt;
&lt;li&gt;the NVIDIA Container Toolkit&lt;/li&gt;
&lt;li&gt;the Kubernetes device plugin&lt;/li&gt;
&lt;li&gt;GPU feature labels&lt;/li&gt;
&lt;li&gt;monitoring through DCGM&lt;/li&gt;
&lt;li&gt;node pool isolation&lt;/li&gt;
&lt;li&gt;taints and tolerations&lt;/li&gt;
&lt;li&gt;runtime behavior for MIG, MPS, or time-slicing&lt;/li&gt;
&lt;li&gt;whether the node can support the serving engine you plan to run&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why "add GPU nodes" is a dangerous oversimplification. A node with a T4, a node with an A10, a node with an A100 split into MIG instances, and a node with H100s connected through NVLink are all very different scheduling targets.&lt;/p&gt;

&lt;p&gt;For a small model, that difference may only affect throughput. For a large model, it may decide whether the deployment works at all.&lt;/p&gt;

&lt;p&gt;A Kubernetes scheduler does not automatically understand all of those details. It schedules based on resources, constraints, labels, taints, affinity rules, and plugins. If the GPU node does not publish the right information, Kubernetes cannot make a good placement decision.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Kubernetes actually sees
&lt;/h2&gt;

&lt;p&gt;Kubernetes has a generic way to work with special hardware through the &lt;a href="https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/" rel="noopener noreferrer"&gt;device plugin framework&lt;/a&gt;. The kubelet does not magically discover every accelerator and understand how to allocate it. A vendor or third-party device plugin registers with the kubelet and advertises device resources to the node.&lt;/p&gt;

&lt;p&gt;For NVIDIA GPUs, the common resource name is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nvidia.com/gpu
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once the device plugin is running, a pod can request that extended resource with a quantity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;nvidia.com/gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the basic Kubernetes contract. The pod asks for one GPU. The node says it has allocatable GPU resources. The scheduler only places the pod on a node that can satisfy the request.&lt;/p&gt;

&lt;p&gt;Useful, but limited.&lt;/p&gt;

&lt;p&gt;That resource request hides most of the information LLM platforms actually need. The GPU might have 16 GB, 80 GB, or 192 GB of memory. The node may or may not have NVLink between GPUs. The GPU might be in MIG mode. The node might belong to an inference pool, a training pool, a batch pool, or somebody's experiment corner. DCGM may already be reporting errors. The model server may need a topology this node cannot provide.&lt;/p&gt;

&lt;p&gt;The device plugin makes GPUs schedulable. It does not make Kubernetes an LLM placement brain.&lt;/p&gt;

&lt;p&gt;That distinction matters. A lot of LLM failures start when teams treat &lt;code&gt;nvidia.com/gpu: 1&lt;/code&gt; as the whole story.&lt;/p&gt;

&lt;h2&gt;
  
  
  The NVIDIA GPU Operator is the usual starting point
&lt;/h2&gt;

&lt;p&gt;You can install every GPU component manually, but most production Kubernetes setups use the &lt;a href="https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/overview.html" rel="noopener noreferrer"&gt;NVIDIA GPU Operator&lt;/a&gt; or a cloud provider equivalent. The operator exists because a GPU node needs more than one daemon.&lt;/p&gt;

&lt;p&gt;NVIDIA describes the problem plainly: Kubernetes can provide access to special hardware through device plugins, but configuring nodes also requires drivers, container runtimes, libraries, monitoring, and other components. The GPU Operator automates much of that node-level software stack.&lt;/p&gt;

&lt;p&gt;In practice, the operator can manage or deploy components such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;NVIDIA drivers, if you want the operator to manage them&lt;/li&gt;
&lt;li&gt;NVIDIA Container Toolkit&lt;/li&gt;
&lt;li&gt;NVIDIA Kubernetes device plugin&lt;/li&gt;
&lt;li&gt;GPU Feature Discovery&lt;/li&gt;
&lt;li&gt;DCGM and DCGM Exporter&lt;/li&gt;
&lt;li&gt;MIG Manager&lt;/li&gt;
&lt;li&gt;validator pods&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The exact setup depends on your environment. Managed Kubernetes providers sometimes preinstall drivers or handle parts of the stack. Bare metal clusters may need the operator to do more. Air-gapped clusters need image mirroring and version discipline. Some organizations deliberately manage drivers outside the cluster because kernel and driver upgrades are part of their node image pipeline.&lt;/p&gt;

&lt;p&gt;A basic install usually starts with Helm:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

kubectl create namespace gpu-operator
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your cluster enforces Pod Security Admission, label that namespace before the operator starts creating privileged node-level components:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl label &lt;span class="nt"&gt;--overwrite&lt;/span&gt; ns gpu-operator &lt;span class="se"&gt;\&lt;/span&gt;
  pod-security.kubernetes.io/enforce&lt;span class="o"&gt;=&lt;/span&gt;privileged
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then install the operator:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm &lt;span class="nb"&gt;install &lt;/span&gt;gpu-operator nvidia/gpu-operator &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; gpu-operator &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;v26.3.2 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--wait&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then check that the operator-managed pods are actually running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; gpu-operator
kubectl get daemonset &lt;span class="nt"&gt;-n&lt;/span&gt; gpu-operator
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The exact pod names vary by GPU Operator version and configuration, but this is the point where you should see components for the device plugin, GPU Feature Discovery, DCGM Exporter, validators, and any driver/toolkit pieces your environment needs.&lt;/p&gt;

&lt;p&gt;The important point is not "always install the operator and forget everything else." The point is: there is a GPU node stack, and something has to own it.&lt;/p&gt;

&lt;p&gt;If nobody owns it, the first real LLM workload becomes the integration test.&lt;/p&gt;

&lt;p&gt;That is a bad place to learn that the driver, CUDA userspace, container runtime, and model server image do not agree with each other.&lt;/p&gt;

&lt;h2&gt;
  
  
  The device plugin turns GPUs into schedulable resources
&lt;/h2&gt;

&lt;p&gt;The NVIDIA device plugin is the bridge between the physical GPUs on the node and the resources Kubernetes can schedule. It runs on GPU nodes, discovers the devices, registers them with the kubelet, and exposes resources such as &lt;code&gt;nvidia.com/gpu&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This is the part many platform engineers recognize first because it shows up directly in pod specs.&lt;/p&gt;

&lt;p&gt;A minimal workload might request one GPU like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Pod&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llm-worker&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;worker&lt;/span&gt;
      &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;example/llm-server:latest&lt;/span&gt;
      &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;nvidia.com/gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After the device plugin is running, verify what the node advertises:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl describe node &amp;lt;gpu-node-name&amp;gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-A6&lt;/span&gt; &lt;span class="nt"&gt;-E&lt;/span&gt; &lt;span class="s2"&gt;"Capacity|Allocatable"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On a node with one physical GPU and no sharing enabled, you might see:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Capacity:
  cpu:                32
  memory:             131932000Ki
  nvidia.com/gpu:     1

Allocatable:
  cpu:                32
  memory:             131829600Ki
  nvidia.com/gpu:     1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On a node with four physical GPUs, the same resource name may show a capacity of &lt;code&gt;4&lt;/code&gt;. Without MIG or time-slicing, that number usually maps to physical GPU count. With time-slicing, it can become logical shared capacity instead. That difference matters.&lt;/p&gt;

&lt;p&gt;You should also run a small GPU smoke test before trusting the node for model serving:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Pod&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cuda-vectoradd&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;restartPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;OnFailure&lt;/span&gt;
  &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cuda-vectoradd&lt;/span&gt;
      &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0-ubuntu22.04&lt;/span&gt;
      &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;nvidia.com/gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then apply it and check the logs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; cuda-vectoradd.yaml
kubectl logs pod/cuda-vectoradd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That proves two things: Kubernetes can schedule a pod that requests a GPU, and the container can actually use the GPU runtime path. It still does not prove that the node is good for a large LLM.&lt;/p&gt;

&lt;p&gt;That YAML is useful, but it is only the outermost layer. For a serious LLM workload, you usually need to ask more questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which GPU model should this pod land on?&lt;/li&gt;
&lt;li&gt;How much GPU memory does the model need?&lt;/li&gt;
&lt;li&gt;Is this a full GPU, a MIG slice, or a time-sliced replica?&lt;/li&gt;
&lt;li&gt;Can the serving engine use this GPU type efficiently?&lt;/li&gt;
&lt;li&gt;Does this workload need multiple GPUs on the same node?&lt;/li&gt;
&lt;li&gt;Does it need a specific driver or CUDA capability?&lt;/li&gt;
&lt;li&gt;Should it avoid nodes shared with batch or notebook workloads?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The scheduler can only respect these requirements if you express them through resources, labels, affinity, taints, topology constraints, or a higher-level scheduler. If the cluster only exposes a flat &lt;code&gt;nvidia.com/gpu&lt;/code&gt; resource, you have thrown away a lot of useful placement information.&lt;/p&gt;

&lt;p&gt;For simple inference, that may be acceptable. For large LLM serving, it usually is not.&lt;/p&gt;

&lt;h2&gt;
  
  
  Labels are how the node starts telling the truth
&lt;/h2&gt;

&lt;p&gt;Kubernetes scheduling improves when nodes describe themselves. That is where Node Feature Discovery and GPU Feature Discovery come in.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://kubernetes-sigs.github.io/node-feature-discovery/stable/get-started/introduction.html" rel="noopener noreferrer"&gt;Node Feature Discovery&lt;/a&gt; detects hardware features available on each node and advertises them through node labels, and optionally extended resources, annotations, and taints. It is not GPU-specific. It can label CPU features, kernel features, PCI devices, and other node capabilities.&lt;/p&gt;

&lt;p&gt;GPU Feature Discovery is NVIDIA-specific. It labels GPU properties so workloads and schedulers can distinguish between different GPU nodes. Historically it existed as its own project, and NVIDIA has since archived the standalone repository, but the function remains part of the GPU Operator stack.&lt;/p&gt;

&lt;p&gt;The labels are the difference between "this node has a GPU" and "this node has the kind of GPU I want."&lt;/p&gt;

&lt;p&gt;You might care about labels for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPU product name&lt;/li&gt;
&lt;li&gt;GPU count&lt;/li&gt;
&lt;li&gt;GPU memory&lt;/li&gt;
&lt;li&gt;CUDA driver capability&lt;/li&gt;
&lt;li&gt;MIG capability&lt;/li&gt;
&lt;li&gt;MIG strategy&lt;/li&gt;
&lt;li&gt;GPU family or architecture&lt;/li&gt;
&lt;li&gt;whether a node belongs to a production inference pool&lt;/li&gt;
&lt;li&gt;whether a node is allowed to run experimental workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The exact label names vary by component and version, so do not hard-code examples from a blog post into production without checking your cluster. The pattern is what matters:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;nodeSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;accelerator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nvidia-h100&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;or:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;affinity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;nodeAffinity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;requiredDuringSchedulingIgnoredDuringExecution&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;nodeSelectorTerms&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;matchExpressions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nvidia.com/gpu.product&lt;/span&gt;
              &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;In&lt;/span&gt;
              &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;NVIDIA-H100-80GB-HBM3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the practical scheduling jump. You stop saying "give me a GPU" and start saying "give me this class of GPU node."&lt;/p&gt;

&lt;p&gt;LLM serving needs that distinction because GPU memory, interconnect, and serving-engine support shape the deployment. A 7B model, a 70B model, and a multi-GPU serving group should not all be treated as generic GPU workloads.&lt;/p&gt;

&lt;h2&gt;
  
  
  Taints and tolerations keep GPU nodes from becoming expensive junk drawers
&lt;/h2&gt;

&lt;p&gt;GPU nodes are too expensive to become general worker nodes by accident.&lt;/p&gt;

&lt;p&gt;A common pattern is to taint GPU nodes so normal pods do not land there unless they explicitly tolerate the taint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl taint nodes gpu-node-1 &lt;span class="nv"&gt;accelerator&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;nvidia:NoSchedule
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then GPU workloads add a toleration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;tolerations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accelerator"&lt;/span&gt;
    &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Equal"&lt;/span&gt;
    &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nvidia"&lt;/span&gt;
    &lt;span class="na"&gt;effect&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NoSchedule"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That looks basic, but it matters. Without isolation, GPU nodes can become a dumping ground for random sidecars, CPU-heavy services, log agents with bad limits, notebooks, experiments, and batch jobs that make production inference harder to reason about.&lt;/p&gt;

&lt;p&gt;For LLMs, you may need more than one GPU pool:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;production online inference&lt;/li&gt;
&lt;li&gt;batch inference&lt;/li&gt;
&lt;li&gt;experiments and notebooks&lt;/li&gt;
&lt;li&gt;fine-tuning or training&lt;/li&gt;
&lt;li&gt;small-model serving&lt;/li&gt;
&lt;li&gt;large-model serving&lt;/li&gt;
&lt;li&gt;MIG-backed shared inference&lt;/li&gt;
&lt;li&gt;full-GPU serving&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These pools may use the same Kubernetes cluster but should not have the same scheduling policy. Taints, labels, node selectors, priority classes, quotas, and admission policy are the boring controls that keep the expensive hardware usable.&lt;/p&gt;

&lt;p&gt;This is also where platform teams start turning hardware into a product surface. Developers should not need to know every node name. They should be able to ask for a workload class, such as "small shared GPU inference" or "full H100 production inference," and let the platform map that to the right node pool.&lt;/p&gt;

&lt;h2&gt;
  
  
  DCGM is how you know whether the GPU is healthy and busy
&lt;/h2&gt;

&lt;p&gt;Scheduling is only half the story. Once workloads land on GPU nodes, you need to know whether the GPUs are actually working well.&lt;/p&gt;

&lt;p&gt;That is where &lt;a href="https://developer.nvidia.com/dcgm" rel="noopener noreferrer"&gt;DCGM&lt;/a&gt; and DCGM Exporter enter the setup. DCGM provides GPU telemetry. DCGM Exporter exposes metrics that can be scraped by Prometheus and visualized in Grafana or another observability stack.&lt;/p&gt;

&lt;p&gt;If DCGM Exporter is enabled through the GPU Operator, it is usually part of the operator-managed stack. NVIDIA's chart exposes &lt;code&gt;dcgmExporter.enabled&lt;/code&gt;, and the default is &lt;code&gt;true&lt;/code&gt;. So first check whether it is already there before installing anything separately:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; gpu-operator | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; dcgm
kubectl get svc &lt;span class="nt"&gt;-n&lt;/span&gt; gpu-operator | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; dcgm
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your platform disables that component, or if you are not using the GPU Operator, then deploy DCGM Exporter separately through your observability stack instead of assuming GPU metrics will appear automatically.&lt;/p&gt;

&lt;p&gt;For LLM serving, useful DCGM metrics include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;DCGM_FI_DEV_GPU_UTIL&lt;/code&gt;: GPU compute utilization&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;DCGM_FI_DEV_MEM_COPY_UTIL&lt;/code&gt;: memory copy utilization&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;DCGM_FI_DEV_FB_USED&lt;/code&gt;: framebuffer memory used&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;DCGM_FI_DEV_FB_FREE&lt;/code&gt;: framebuffer memory free&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;DCGM_FI_DEV_GPU_TEMP&lt;/code&gt;: GPU temperature&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;DCGM_FI_DEV_POWER_USAGE&lt;/code&gt;: power usage&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;DCGM_FI_DEV_XID_ERRORS&lt;/code&gt;: XID error count&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;DCGM_FI_DEV_ECC_DBE_VOL_TOTAL&lt;/code&gt;: volatile double-bit ECC errors&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those map directly to practical questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is the model filling GPU memory before traffic even arrives?&lt;/li&gt;
&lt;li&gt;Is KV cache pressure eating the remaining memory during generation?&lt;/li&gt;
&lt;li&gt;Is the GPU busy, or is the model server queueing somewhere else?&lt;/li&gt;
&lt;li&gt;Is memory movement becoming the bottleneck?&lt;/li&gt;
&lt;li&gt;Is the card throttling or throwing hardware-level errors?&lt;/li&gt;
&lt;li&gt;Is this node safe to keep in the serving pool?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Be careful with one metric: raw GPU utilization can lie to you.&lt;/p&gt;

&lt;p&gt;A GPU can show high utilization while users still see poor time to first token because queueing is bad. A GPU can show moderate utilization while KV cache pressure is the real limiter. A GPU can be busy with the wrong mix of prefill and decode work. A GPU can be allocated to a pod that is not producing useful throughput.&lt;/p&gt;

&lt;p&gt;So DCGM metrics are necessary, but they are not sufficient. You still need model-server metrics from vLLM, Triton, TensorRT-LLM, TGI, SGLang, or whatever you run. The GPU layer tells you what the hardware is doing. The serving layer tells you whether the model is serving traffic well.&lt;/p&gt;

&lt;p&gt;Part 14 of this series will go deeper into autoscaling signals. For now, the practical point is simple: if you cannot observe GPU health and GPU memory pressure, your LLM platform is flying blind.&lt;/p&gt;

&lt;h2&gt;
  
  
  MIG is not the same as time-slicing
&lt;/h2&gt;

&lt;p&gt;GPU sharing is one of the easiest places to confuse yourself because the words sound similar but the isolation model is different.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-operator-mig.html" rel="noopener noreferrer"&gt;MIG, or Multi-Instance GPU&lt;/a&gt;, lets supported NVIDIA GPUs partition a physical GPU into separate GPU instances. NVIDIA describes MIG as a way to partition GPUs based on Ampere and later architectures into separate and secure GPU instances for CUDA applications. The GPU Operator can deploy MIG Manager to manage MIG configuration on Kubernetes nodes.&lt;/p&gt;

&lt;p&gt;MIG is useful when you want stronger partitioning. A large GPU can be split into smaller slices so several workloads can run with more predictable boundaries. For smaller models, internal tools, embeddings workloads, evaluation jobs, or lower-tier inference, that can be a good use of expensive hardware.&lt;/p&gt;

&lt;p&gt;But MIG is not magic. A MIG slice has less memory and compute than the full GPU. A model that needs a full 80 GB GPU will not fit just because the physical card is present. A workload that depends on multiple full GPUs may not be happy on fragmented MIG capacity. Changing MIG geometry can also be operationally disruptive. NVIDIA notes that MIG Manager requires no user workloads running on the GPUs being configured, and in some environments the node may need a reboot.&lt;/p&gt;

&lt;p&gt;That matters for production planning. MIG configuration is not something you casually flip during an incident.&lt;/p&gt;

&lt;p&gt;Time-slicing is different. NVIDIA's &lt;a href="https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html" rel="noopener noreferrer"&gt;GPU Operator time-slicing documentation&lt;/a&gt; explains that time-slicing enables oversubscription by letting workloads scheduled on an oversubscribed GPU interleave with one another. Unlike MIG, time-slicing does not provide memory or fault isolation between replicas.&lt;/p&gt;

&lt;p&gt;A cluster-wide time-slicing config looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ConfigMap&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;time-slicing-config&lt;/span&gt;
&lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;any&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
    &lt;span class="s"&gt;version: v1&lt;/span&gt;
    &lt;span class="s"&gt;flags:&lt;/span&gt;
      &lt;span class="s"&gt;migStrategy: none&lt;/span&gt;
    &lt;span class="s"&gt;sharing:&lt;/span&gt;
      &lt;span class="s"&gt;timeSlicing:&lt;/span&gt;
        &lt;span class="s"&gt;renameByDefault: false&lt;/span&gt;
        &lt;span class="s"&gt;failRequestsGreaterThanOne: false&lt;/span&gt;
        &lt;span class="s"&gt;resources:&lt;/span&gt;
          &lt;span class="s"&gt;- name: nvidia.com/gpu&lt;/span&gt;
            &lt;span class="s"&gt;replicas: 4&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create it in the operator namespace and point the device plugin at it during install:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl create &lt;span class="nt"&gt;-n&lt;/span&gt; gpu-operator &lt;span class="nt"&gt;-f&lt;/span&gt; time-slicing-config.yaml

helm &lt;span class="nb"&gt;install &lt;/span&gt;gpu-operator nvidia/gpu-operator &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; gpu-operator &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--create-namespace&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;v26.3.2 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; devicePlugin.config.name&lt;span class="o"&gt;=&lt;/span&gt;time-slicing-config &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--wait&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the operator is already installed, patch the ClusterPolicy instead:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl patch clusterpolicies.nvidia.com/cluster-policy &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; gpu-operator &lt;span class="nt"&gt;--type&lt;/span&gt; merge &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s1"&gt;'{"spec":{"devicePlugin":{"config":{"name":"time-slicing-config","default":"any"}}}}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With &lt;code&gt;replicas: 4&lt;/code&gt;, one physical GPU can advertise four schedulable shared replicas. Four physical GPUs can advertise sixteen. With &lt;code&gt;renameByDefault: false&lt;/code&gt;, the resource name remains &lt;code&gt;nvidia.com/gpu&lt;/code&gt;, while labels such as &lt;code&gt;nvidia.com/gpu.product&lt;/code&gt; can get a &lt;code&gt;-SHARED&lt;/code&gt; suffix and &lt;code&gt;nvidia.com/gpu.replicas=4&lt;/code&gt; tells you the oversubscription factor.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl describe node &amp;lt;gpu-node-name&amp;gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-A8&lt;/span&gt; &lt;span class="nt"&gt;-E&lt;/span&gt; &lt;span class="s2"&gt;"Labels:|Capacity:|Allocatable:"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example shape for one physical GPU with four time-sliced replicas:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Labels:
  nvidia.com/gpu.count=1
  nvidia.com/gpu.product=Tesla-T4-SHARED
  nvidia.com/gpu.replicas=4

Capacity:
  nvidia.com/gpu: 4

Allocatable:
  nvidia.com/gpu: 4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That tradeoff is huge.&lt;/p&gt;

&lt;p&gt;Time-slicing can be useful for lightweight workloads, experiments, notebooks, CI jobs, embeddings, small internal tools, dev/test endpoints, or low-duty-cycle inference where exclusive GPU access would waste money. If every tiny workload asks for a full &lt;code&gt;nvidia.com/gpu: 1&lt;/code&gt; and gets exclusive access, one notebook or one small model can occupy the entire scheduling unit while using only a fraction of the card.&lt;/p&gt;

&lt;p&gt;Time-slicing helps utilization by allowing more pods to share the same physical GPU over time. The value is sharing, not isolation.&lt;/p&gt;

&lt;p&gt;A pod that requests a time-sliced GPU is not getting a private piece of hardware. It is getting shared access to an underlying GPU. It does not get separate GPU memory. It does not get fault isolation. It does not get guaranteed proportional compute. NVIDIA explicitly notes that requesting more than one time-sliced GPU does not guarantee a proportional amount of GPU compute power.&lt;/p&gt;

&lt;p&gt;So do not treat &lt;code&gt;replicas: 4&lt;/code&gt; as four real GPUs. Use time-slicing for workloads that can tolerate noisy neighbors. Be very careful with latency-sensitive LLM serving, large models near memory limits, or coordinated multi-GPU serving groups.&lt;/p&gt;

&lt;p&gt;MPS, the NVIDIA Multi-Process Service, is another sharing mechanism. It can improve GPU utilization for multiple CUDA processes by letting them share execution resources more efficiently, but it also needs careful workload-level testing. For LLM serving, the question is not "can we share this GPU?" The question is "can we share this GPU without destroying latency, memory predictability, or failure isolation?"&lt;/p&gt;

&lt;p&gt;Those are different questions.&lt;/p&gt;

&lt;h2&gt;
  
  
  GPU memory is a scheduling constraint, even when Kubernetes does not see it that way
&lt;/h2&gt;

&lt;p&gt;This is one of the biggest gaps between LLM reality and default Kubernetes scheduling.&lt;/p&gt;

&lt;p&gt;Kubernetes can schedule &lt;code&gt;nvidia.com/gpu: 1&lt;/code&gt;. But a single GPU is not a uniform unit. The useful capacity depends heavily on GPU memory.&lt;/p&gt;

&lt;p&gt;A 7B model in FP16 or BF16 may fit on many cards. A 70B model may need much more memory, especially after you include KV cache and runtime overhead. A long-context workload can run out of memory even if the model weights fit. A workload with high concurrency can hit KV cache pressure long before the GPU looks simple to the scheduler.&lt;/p&gt;

&lt;p&gt;Kubernetes does not natively schedule based on "80 GB of GPU memory free for this pod" in the same way it handles CPU and RAM requests. You need to model this through one or more of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;separate node pools by GPU memory class&lt;/li&gt;
&lt;li&gt;labels for GPU product and memory size&lt;/li&gt;
&lt;li&gt;admission policy that maps workload profiles to allowed GPU classes&lt;/li&gt;
&lt;li&gt;MIG profiles when slicing is appropriate&lt;/li&gt;
&lt;li&gt;model-server-level controls for max context, max batch size, and max concurrent sequences&lt;/li&gt;
&lt;li&gt;observability that catches GPU memory pressure before users do&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why Part 3's memory math matters even after you leave the article. Weight memory math tells you what class of node the model can run on. GPU node setup tells Kubernetes how to find that class of node.&lt;/p&gt;

&lt;p&gt;If you skip this step, you get weird failures: pods schedule successfully, containers start, the model begins loading, then dies with CUDA out-of-memory errors. Kubernetes did its job. You gave it the wrong abstraction.&lt;/p&gt;

&lt;h2&gt;
  
  
  The container runtime is part of the serving path
&lt;/h2&gt;

&lt;p&gt;Another easy mistake: treating the container image as if it is enough.&lt;/p&gt;

&lt;p&gt;For a GPU workload to work inside a container, the host and runtime must expose the GPU correctly. The NVIDIA Container Toolkit is part of that path. The driver has to exist on the host or be managed through the operator. The container needs compatible userspace libraries. The kubelet and runtime need to know how to make GPU devices available to the container.&lt;/p&gt;

&lt;p&gt;This is why GPU node readiness is more than &lt;code&gt;kubectl get nodes&lt;/code&gt; showing &lt;code&gt;Ready&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;A node can be Ready for normal pods and still be broken for GPU workloads. The failure may only appear when a pod tries to start, load CUDA, initialize NCCL, or run the model server. Good GPU platforms usually add validation pods or smoke tests that check the GPU path before developers depend on the node.&lt;/p&gt;

&lt;p&gt;A simple mental checklist:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Can the node see the GPU?&lt;/li&gt;
&lt;li&gt;Is the driver loaded?&lt;/li&gt;
&lt;li&gt;Can a container see the GPU?&lt;/li&gt;
&lt;li&gt;Does a CUDA sample work?&lt;/li&gt;
&lt;li&gt;Does the device plugin advertise the resource?&lt;/li&gt;
&lt;li&gt;Do labels describe the GPU accurately?&lt;/li&gt;
&lt;li&gt;Does DCGM report metrics?&lt;/li&gt;
&lt;li&gt;Can the intended model server initialize on this node?&lt;/li&gt;
&lt;li&gt;Can a small test model load and serve a request?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the answer stops at "the node is Ready," you have not tested enough.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-GPU nodes need topology awareness
&lt;/h2&gt;

&lt;p&gt;Part 3 talked about tensor parallelism and pipeline parallelism. This is where that discussion touches the node.&lt;/p&gt;

&lt;p&gt;If a model server needs multiple GPUs, placement inside a node matters. GPUs may be connected differently. Some paths have better bandwidth. Some nodes have NVLink. Some rely more heavily on PCIe. The serving engine may assume a certain number of GPUs per worker. NCCL performance may depend on the topology.&lt;/p&gt;

&lt;p&gt;Kubernetes, by default, is not deeply reasoning about your tensor parallel group. If a pod requests four GPUs on a node, the device plugin can allocate devices, but the model server still has to use them correctly. If a deployment needs multiple pods across nodes, Kubernetes can place those pods, but the serving framework has to coordinate the ranks.&lt;/p&gt;

&lt;p&gt;This article is not the dedicated scheduler article. That comes later. But the GPU node setup matters here because scheduling cannot become topology-aware if the platform does not expose useful topology and node information in the first place.&lt;/p&gt;

&lt;p&gt;A practical rule: keep the first successful design boring.&lt;/p&gt;

&lt;p&gt;If a model can run with tensor parallelism inside one node, start there before spreading a single serving replica across nodes. Multi-node serving adds network sensitivity, failure coordination, startup sequencing, and debugging pain. Kubernetes can manage the shape, but it will not make a bad topology fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a practical GPU node baseline looks like
&lt;/h2&gt;

&lt;p&gt;A serious LLM GPU node baseline does not have to be fancy. It needs to be explicit.&lt;/p&gt;

&lt;p&gt;At a minimum, I would want a platform team to know the answers to these questions before onboarding production LLM workloads:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Who owns driver installation and upgrades?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The GPU Operator can manage drivers, or the node image pipeline can manage them. Both can work. The bad answer is "we are not sure."&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;How are GPUs advertised to Kubernetes?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The NVIDIA device plugin should expose GPU resources consistently. You should know what resource names workloads request, especially if MIG or time-slicing is enabled.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;How are GPU nodes labeled?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Node labels should capture GPU class, node pool purpose, MIG strategy if relevant, and anything else needed for scheduling decisions.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;How are GPU nodes isolated?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Use taints, tolerations, node pools, quotas, and policy so random workloads do not land on expensive GPU nodes.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;How are GPU metrics collected?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;DCGM Exporter should feed your observability stack. Model-server metrics should sit beside GPU metrics so you can connect hardware behavior to LLM behavior.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;What sharing mode is allowed?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Full GPU, MIG, time-slicing, and MPS are different operational choices. Do not let teams discover the difference after latency falls apart.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;How do you validate a node before using it?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Have a smoke test for CUDA, device plugin resources, labels, DCGM metrics, and a small model-server startup path.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Which workloads are allowed on which GPU classes?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A small embedding service, an internal chatbot, a batch summarization job, and a large production model should not all be scheduled with the same policy.&lt;/p&gt;

&lt;p&gt;Turn that baseline into a small verification routine:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; gpu-operator
kubectl describe node &amp;lt;gpu-node-name&amp;gt;
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; cuda-vectoradd.yaml
kubectl logs pod/cuda-vectoradd
kubectl get svc &lt;span class="nt"&gt;-n&lt;/span&gt; gpu-operator | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; dcgm
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By the end of this check, you should know whether the node is &lt;code&gt;Ready&lt;/code&gt;, the operator components are running, the device plugin advertises GPU resources, the node has useful GPU labels, GPU workloads can start, DCGM metrics are available, and your sharing mode is explicit.&lt;/p&gt;

&lt;p&gt;This baseline is boring on purpose. Most production incidents are not caused by exotic scheduler theory. They are caused by a missing label, a wrong driver, an unisolated node pool, a bad sharing assumption, or a metric nobody collected.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pod starts late in the story
&lt;/h2&gt;

&lt;p&gt;By the time an LLM pod starts, many decisions have already been made.&lt;/p&gt;

&lt;p&gt;The node pool decided what hardware exists. The driver stack decided whether CUDA works. The device plugin decided what resources Kubernetes can allocate. Feature discovery decided what labels describe the node. Taints and tolerations decided who is allowed to land there. MIG, MPS, or time-slicing decided what "a GPU" means on that node. DCGM decided what you can observe. The model server will decide how efficiently the allocated GPU is used.&lt;/p&gt;

&lt;p&gt;The pod is where all of those decisions meet.&lt;/p&gt;

&lt;p&gt;That is why GPU node setup deserves its own article. It is not glamorous, and it is not the full LLM platform. But if this layer is wrong, everything above it becomes harder: vLLM, Triton, TensorRT-LLM, KServe, Ray, autoscaling, routing, cost control, latency tuning, and multi-tenancy.&lt;/p&gt;

&lt;p&gt;Kubernetes can schedule LLM workloads only as well as the cluster describes its GPU capacity.&lt;/p&gt;

&lt;p&gt;So before you ask why your LLM pod is slow, unstable, expensive, or impossible to place, ask a simpler question:&lt;/p&gt;

&lt;p&gt;What did the GPU node actually tell Kubernetes before the pod started?&lt;/p&gt;




&lt;h2&gt;
  
  
  Continue the series
&lt;/h2&gt;

&lt;p&gt;This is Part 4 of my practical series on hosting large LLMs on Kubernetes. The next parts will move from GPU node setup into real-world scaling stories, model servers, KV cache, batching, scheduling, autoscaling, latency, cost, and production architecture.&lt;/p&gt;

&lt;p&gt;I am also preparing a free &lt;strong&gt;LLM Serving on Kubernetes Production Readiness Checklist&lt;/strong&gt; with the questions platform teams should ask before putting an LLM workload in production. Subscribe to the newsletter and I will share it when it is ready.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>ai</category>
      <category>llm</category>
    </item>
    <item>
      <title>How Do You Fit a Trillion-Parameter Model Into a Kubernetes Cluster?</title>
      <dc:creator>Pawan Kumar</dc:creator>
      <pubDate>Thu, 28 May 2026 03:32:13 +0000</pubDate>
      <link>https://dev.to/the-persistent-engineer/how-do-you-fit-a-trillion-parameter-model-into-a-kubernetes-cluster-2124</link>
      <guid>https://dev.to/the-persistent-engineer/how-do-you-fit-a-trillion-parameter-model-into-a-kubernetes-cluster-2124</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Series links&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.dheeth.blog/llm-serving-is-not-normal-web-serving/" rel="noopener noreferrer"&gt;Part 1: Everything You Know About Scaling Web Apps Breaks When You Serve an LLM&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.dheeth.blog/real-unit-of-llm-infrastructure-is-the-token/" rel="noopener noreferrer"&gt;Part 2: The Request Is the Wrong Unit of Scale for LLMs on Kubernetes&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;A giant model does not "run in a pod."&lt;/p&gt;

&lt;p&gt;That sentence sounds wrong if you have spent years thinking in Kubernetes objects. We package software into containers. We run containers in pods. We schedule pods onto nodes. We put Services in front of them. That model works well when the thing inside the container is a web server, worker, queue consumer, or API process.&lt;/p&gt;

&lt;p&gt;Then someone says, "Can we host a trillion-parameter model on Kubernetes?"&lt;/p&gt;

&lt;p&gt;The honest answer is: yes, but not in the way your brain first pictures it. A trillion-parameter model is not one neat process sitting inside one neat pod, waiting for the kubelet to give it enough CPU and memory. It is a pile of weights, communication patterns, parallel workers, GPU memory limits, interconnect assumptions, and serving-engine decisions. Kubernetes can coordinate the outer shape, but the model itself has to be split.&lt;/p&gt;

&lt;p&gt;Part 1 of this series argued that LLM serving is not normal web serving. Part 2 argued that requests are the wrong unit of scale because tokens are the work. Part 3 is about the next uncomfortable step: once the model is large enough, a replica is no longer a pod. A replica may be a group of GPUs, a node, several nodes, or a slice of a much larger GPU cluster working together.&lt;/p&gt;

&lt;p&gt;The pod is just the envelope. The model is the distributed system.&lt;/p&gt;

&lt;h2&gt;
  
  
  Start with the boring math
&lt;/h2&gt;

&lt;p&gt;Before tensor parallelism, pipeline parallelism, expert parallelism, Ray, vLLM, KServe, MPI, NCCL, or any Kubernetes YAML, there is a dumb memory question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How many bytes do the weights need?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The rough formula is simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;weight memory = number of parameters x bytes per parameter
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is only the model weights. It does not include KV cache, runtime overhead, CUDA graphs, activations during training, optimizer states, communication buffers, fragmentation, or the serving engine's own memory reservations. But for the first pass, it is enough to kill bad assumptions.&lt;/p&gt;

&lt;p&gt;A 1 trillion parameter dense model stored in FP16 or BF16 needs about:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1,000,000,000,000 parameters x 2 bytes = 2,000,000,000,000 bytes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is roughly 2 TB of raw weight memory.&lt;/p&gt;

&lt;p&gt;Not disk. Not object storage. GPU-addressable memory.&lt;/p&gt;

&lt;p&gt;If you had 80 GB GPUs, 2 TB of raw weights already needs 25 GPUs before overhead. If you had 141 GB H200-class GPUs, it still needs about 15 GPUs just for the weights. That does not mean the model is usable with exactly that many GPUs. It means this is the floor before the real serving problem begins.&lt;/p&gt;

&lt;p&gt;This is where normal Kubernetes thinking starts to mislead people. A pod can request memory. A pod can request &lt;code&gt;nvidia.com/gpu: 8&lt;/code&gt;. Kubernetes can place that pod on a node with enough advertised GPU devices. But Kubernetes does not magically make one process treat 25 separate GPUs as one giant GPU. The serving engine and distributed runtime have to do that work.&lt;/p&gt;

&lt;p&gt;Kubernetes schedules access to hardware. It does not shard the model for you.&lt;/p&gt;

&lt;h2&gt;
  
  
  FP16 and BF16 are not small
&lt;/h2&gt;

&lt;p&gt;FP16 and BF16 are often discussed as if they are memory optimizations, and compared with FP32, they are. FP32 uses 4 bytes per parameter. FP16 and BF16 use 2 bytes. Cutting weight memory in half is a big deal.&lt;/p&gt;

&lt;p&gt;But at trillion-parameter scale, half of enormous is still enormous.&lt;/p&gt;

&lt;p&gt;A 175B parameter model in FP16 or BF16 is about 350 GB of raw weights. That already does not fit on a single common GPU. A 671B parameter model is about 1.34 TB in FP16 or BF16. A 1T dense model is about 2 TB. A 1.8T dense model would be about 3.6 TB.&lt;/p&gt;

&lt;p&gt;Quantization changes the math. FP8 brings 1T parameters down to about 1 TB of raw weights. FP4 brings it down to about 500 GB. NVIDIA's trillion-parameter inference write-up uses an example GPT MoE model with 1.8T parameters stored with FP4, where the raw weights are about 900 GB. On 192 GB GPUs, the theoretical minimum just to hold those weights is five GPUs.&lt;/p&gt;

&lt;p&gt;That sounds surprisingly small until you remember the word "minimum."&lt;/p&gt;

&lt;p&gt;Five GPUs may hold the weights. They may not generate tokens fast enough. They may not leave enough room for KV cache. For long-context models, KV cache can consume tens of gigabytes per busy replica, and at real concurrency it competes directly with weights for GPU memory. The same five GPUs may also communicate too much. They may deliver terrible time to first token. They may support one beautiful demo and then fall apart under real traffic.&lt;/p&gt;

&lt;p&gt;The memory floor tells you whether a deployment is possible. It does not tell you whether it is good.&lt;/p&gt;

&lt;h2&gt;
  
  
  The model gets split because memory is only one problem
&lt;/h2&gt;

&lt;p&gt;When a model does not fit on one GPU, there are two broad things you can do.&lt;/p&gt;

&lt;p&gt;You can make the model smaller: quantize it, distill it, pick a smaller checkpoint, reduce context length, use adapters, or route some traffic to a cheaper model. Those are valid and often the right production choices, but that is not the focus of this part.&lt;/p&gt;

&lt;p&gt;Or you can split the model across GPUs.&lt;/p&gt;

&lt;p&gt;Splitting is where the word "replica" becomes slippery. In a normal web app, one replica is usually one pod. In LLM serving, one model replica may require multiple GPU workers that must cooperate for every token. If one worker is slow, missing, placed badly, or stuck behind a bad network path, the whole replica suffers.&lt;/p&gt;

&lt;p&gt;That is why large-model serving feels less like "run N pods" and more like "assemble a tiny supercomputer for each serving replica."&lt;/p&gt;

&lt;p&gt;There are three important forms of model splitting to understand:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tensor parallelism&lt;/li&gt;
&lt;li&gt;Pipeline parallelism&lt;/li&gt;
&lt;li&gt;Expert parallelism&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They are often combined with data parallelism, where you run multiple independent replicas of the sharded model to serve more traffic. Data parallelism is easy to understand once the model replica fits somewhere. The hard part is making one replica exist in the first place.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tensor parallelism splits the inside of a layer
&lt;/h2&gt;

&lt;p&gt;Tensor parallelism splits individual tensors inside a model layer across GPUs. Instead of putting the full matrix multiplication for a transformer layer on one GPU, the serving engine divides the layer's work across several GPUs and combines the result.&lt;/p&gt;

&lt;p&gt;This is useful because transformers have large matrix operations that can be partitioned. Megatron-LM popularized this style of tensor model parallelism for GPT-like models, and vLLM's distributed serving documentation still points to Megatron-LM's tensor parallel algorithm as the implementation basis.&lt;/p&gt;

&lt;p&gt;A simple mental model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;One transformer layer
  GPU 0: owns one slice of the weight matrix
  GPU 1: owns another slice
  GPU 2: owns another slice
  GPU 3: owns another slice
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For every token step, the GPUs compute their slices and then exchange partial results. This is powerful, but it is not free. Tensor parallelism depends heavily on fast GPU-to-GPU communication. Inside a node with NVLink or another high-bandwidth interconnect, it can work well. Across nodes, the communication cost can get ugly.&lt;/p&gt;

&lt;p&gt;That is why many practical guides recommend keeping tensor parallelism inside a node when possible. vLLM's scaling guidance gives the same shape: if the model is too large for one GPU but fits on one multi-GPU machine, use tensor parallelism. If you have 4 GPUs in the node, &lt;code&gt;tensor_parallel_size=4&lt;/code&gt; is the obvious starting point.&lt;/p&gt;

&lt;p&gt;Tensor parallelism makes one layer wider across GPUs. It helps with memory and per-token compute, but it ties those GPUs together tightly. They are not independent pods anymore. They are pieces of one inference machine.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pipeline parallelism splits the stack of layers
&lt;/h2&gt;

&lt;p&gt;Pipeline parallelism cuts the model vertically by layers.&lt;/p&gt;

&lt;p&gt;Instead of every GPU participating in every layer, one GPU or group of GPUs owns the early layers, another owns the middle layers, and another owns the later layers. A request moves through those stages like work moving through an assembly line.&lt;/p&gt;

&lt;p&gt;A rough picture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Stage 1: layers 1-20    -&amp;gt; GPU group A
Stage 2: layers 21-40   -&amp;gt; GPU group B
Stage 3: layers 41-60   -&amp;gt; GPU group C
Stage 4: layers 61-80   -&amp;gt; GPU group D
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pipeline parallelism is attractive when the model cannot fit within one node. Instead of stretching tensor parallelism across a slow boundary, you can keep tensor parallelism inside each node and use pipeline parallelism across nodes. NVIDIA's Megatron work describes exactly this pattern: tensor parallelism works well within a DGX A100 node, while pipeline parallelism helps scale across nodes because it uses a different communication pattern.&lt;/p&gt;

&lt;p&gt;vLLM's current docs give a practical serving version of the same idea. For 2 nodes with 8 GPUs each, set tensor parallelism to 8 and pipeline parallelism to 2. In plain English: split each layer across the 8 GPUs inside a node, then split the model's layers across the 2 nodes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tensor_parallel_size = GPUs per node
pipeline_parallel_size = number of nodes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That rule is not sacred, but it is a good first mental model.&lt;/p&gt;

&lt;p&gt;Pipeline parallelism also introduces its own pain. Pipelines can have bubbles, where some stages sit idle while waiting for work. Training systems fight this with microbatches and scheduling tricks. In inference, the serving engine may keep stages busier by feeding different requests through the pipeline continuously, but that depends on batching, traffic shape, and implementation. The operational point is simple: the more stages you add, the more the model starts behaving like a distributed workflow instead of a containerized API.&lt;/p&gt;

&lt;p&gt;Kubernetes can keep the pods alive. The serving engine has to keep the pipeline full.&lt;/p&gt;

&lt;p&gt;In that shape, a "replica" is already bigger than the mental model most platform teams start with. The serving replica is not one container. It is a coordinated set of ranks, workers, and devices.&lt;/p&gt;

&lt;h2&gt;
  
  
  Expert parallelism exists because MoE models are weird
&lt;/h2&gt;

&lt;p&gt;Mixture-of-Experts models add another twist.&lt;/p&gt;

&lt;p&gt;A dense model usually uses the same parameters for every token. If the model has 70B parameters, each token flows through that dense stack. If the model has 1T dense parameters, the serving system must deal with a terrifying amount of memory and compute.&lt;/p&gt;

&lt;p&gt;MoE models are different. They contain many expert feed-forward networks, and a router chooses which experts handle each token. This creates a model with a huge total parameter count, but only a fraction of those parameters are active for any one token.&lt;/p&gt;

&lt;p&gt;This is the line that saves people from a lot of confusion:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Trillion parameters does not always mean trillion-parameter compute per token.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The Switch Transformer paper made this idea famous at large scale. It describes MoE models as sparsely activated: huge parameter counts, but roughly constant computation because each token is routed to a small number of experts. Switch simplified the routing further by sending each token to one expert.&lt;/p&gt;

&lt;p&gt;Modern public models show the same idea in a more familiar form. DeepSeek-V3 is reported as a 671B parameter MoE model, but only 37B parameters are activated for each token. That does not make the model "really 37B." The inactive experts still exist. Their weights still need to live somewhere. But the compute path for one token is much smaller than the total parameter count suggests.&lt;/p&gt;

&lt;p&gt;This distinction matters for capacity planning. Total parameters drive storage and placement. Active parameters drive per-token compute. Both matter, but they are not the same number.&lt;/p&gt;

&lt;p&gt;Expert parallelism is the systems trick that places different experts on different GPUs or nodes. When tokens are routed to experts, the serving system sends token representations to the devices that own those experts, runs the expert computation, and combines the results back into the model flow.&lt;/p&gt;

&lt;p&gt;That creates a new bottleneck: token routing and all-to-all communication. If the router sends too much traffic to one expert, that expert becomes hot. If experts are spread across nodes, the network starts carrying token activations around the cluster. If the serving engine does not overlap communication and compute well, the GPUs wait.&lt;/p&gt;

&lt;p&gt;MoE is not magic. It trades dense compute for routing, load balancing, memory placement, and communication.&lt;/p&gt;

&lt;h2&gt;
  
  
  A trillion-parameter MoE is not the same as a trillion-parameter dense model
&lt;/h2&gt;

&lt;p&gt;This is worth slowing down on because marketing numbers blur it.&lt;/p&gt;

&lt;p&gt;Imagine two models:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Model A: 1T dense parameters
Model B: 1T total MoE parameters, 50B active per token
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both can be called trillion-parameter models. They are not the same infrastructure problem.&lt;/p&gt;

&lt;p&gt;Model A needs the serving system to carry the memory and compute burden of the full dense stack. Every token touches the model in a much more uniform way.&lt;/p&gt;

&lt;p&gt;Model B still needs the cluster to store the full set of experts, but each token activates only a subset. The challenge shifts toward routing, expert placement, load balancing, and making sure the right GPUs communicate quickly enough.&lt;/p&gt;

&lt;p&gt;This is why a model card's parameter count is only the beginning of the conversation. For serving, you also want to know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is it dense or MoE?&lt;/li&gt;
&lt;li&gt;How many total parameters are there?&lt;/li&gt;
&lt;li&gt;How many parameters are active per token?&lt;/li&gt;
&lt;li&gt;What precision are the weights stored in?&lt;/li&gt;
&lt;li&gt;How long is the context window?&lt;/li&gt;
&lt;li&gt;How large is the KV cache at your expected concurrency?&lt;/li&gt;
&lt;li&gt;Does the serving engine support the model's parallelism pattern well?&lt;/li&gt;
&lt;li&gt;Can your node and network topology support the communication pattern?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you only ask, "How many parameters?" you will size the cluster badly.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Kubernetes actually sees
&lt;/h2&gt;

&lt;p&gt;Kubernetes does not see tensor slices, pipeline stages, or experts. It sees pods, containers, resources, nodes, labels, taints, tolerations, Services, volumes, and health checks.&lt;/p&gt;

&lt;p&gt;For GPUs, Kubernetes normally depends on the device plugin framework. A vendor device plugin registers resources like &lt;code&gt;nvidia.com/gpu&lt;/code&gt; with the kubelet. The kubelet advertises those resources on the node. A pod requests them through resource limits. The scheduler places the pod on a node that can satisfy the request.&lt;/p&gt;

&lt;p&gt;That is useful, but it is a lower-level contract than many people assume.&lt;/p&gt;

&lt;p&gt;Kubernetes can say:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;nvidia.com/gpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It cannot infer that those 8 GPUs should form tensor parallel group 0, that another node should form pipeline stage 1, that experts 0-63 should live on one rank group, or that the network path between two stages is now your latency bottleneck.&lt;/p&gt;

&lt;p&gt;Those choices happen in the serving layer: vLLM, TensorRT-LLM, Triton or Dynamo Triton, SGLang, TGI, Ray, KServe, llm-d, custom launch scripts, or whatever stack your team chooses. Kubernetes is still important, but its job is orchestration around the model, not inside the model.&lt;/p&gt;

&lt;p&gt;This is the split I find useful:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Kubernetes decides where the workers run.
The serving engine decides how the model is split.
The interconnect decides whether the split is fast enough.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Miss any one of those, and the deployment becomes fragile.&lt;/p&gt;

&lt;h2&gt;
  
  
  One pod, many pods, or one distributed replica?
&lt;/h2&gt;

&lt;p&gt;There are a few common deployment shapes.&lt;/p&gt;

&lt;p&gt;The first is a single pod requesting multiple GPUs on one node. This is the simplest shape for models that fit within one machine. A vLLM server might run with &lt;code&gt;--tensor-parallel-size 4&lt;/code&gt; or &lt;code&gt;--tensor-parallel-size 8&lt;/code&gt;, and the pod requests the same number of GPUs. Kubernetes schedules one pod. Inside that pod, the serving engine starts multiple GPU workers.&lt;/p&gt;

&lt;p&gt;This is operationally pleasant because the failure domain is clean. The pod is up or down. The node has the GPUs or it does not. The model weights are local or mounted. You still have complexity, but it is contained.&lt;/p&gt;

&lt;p&gt;The second shape is one distributed replica spread across multiple pods or nodes. This is what you need when the model or desired serving shape exceeds one node. Now you need coordinated startup, rank assignment, service discovery, identical images, shared model paths or download behavior, and careful placement. If one part of the replica is missing, the replica is not healthy.&lt;/p&gt;

&lt;p&gt;This is where Kubernetes starts to need help from higher-level controllers or conventions. StatefulSets, headless Services, Ray clusters, KServe runtimes, LeaderWorkerSet, job-style launchers, or custom operators can all appear depending on the stack. The exact tool matters less than the invariant: the workers are not independent replicas. They are shards of one replica.&lt;/p&gt;

&lt;p&gt;The third shape is data parallel replicas of sharded replicas. For example, you may run four independent model replicas, and each replica uses 16 GPUs internally. That gives you 64 GPUs total, but the scheduling unit is not "64 independent pods." It is four coordinated groups.&lt;/p&gt;

&lt;p&gt;This is where platform teams need to be very careful with autoscaling language. Scaling from 4 replicas to 5 may mean adding 16 GPUs and starting a full distributed group. It may require model weight loading, rank coordination, cache warmup, and traffic shifting. It is not the same as adding one more stateless web pod.&lt;/p&gt;

&lt;h2&gt;
  
  
  The network is part of the model now
&lt;/h2&gt;

&lt;p&gt;With normal web services, the network matters. With distributed LLM inference, the network is part of the model's execution path.&lt;/p&gt;

&lt;p&gt;Tensor parallelism needs frequent GPU-to-GPU communication. Pipeline parallelism moves activations between stages. Expert parallelism can create all-to-all traffic between tokens and expert owners. NCCL, or the equivalent communication layer in your stack, becomes part of the serving path. Multi-node serving depends on bandwidth, latency, topology, and how well the serving engine overlaps communication with compute.&lt;/p&gt;

&lt;p&gt;This is why "we have 32 GPUs in the cluster" is not enough information. Are they eight GPUs in four nodes? Four GPUs in eight nodes? Do they have NVLink inside the node? What is the NIC? Is RDMA available? Are the nodes in the same placement group or rack? Are you crossing noisy network boundaries? Is the storage path going to make every cold start painful?&lt;/p&gt;

&lt;p&gt;A cluster with the same GPU count can behave like a different machine depending on topology.&lt;/p&gt;

&lt;p&gt;For smaller models, Kubernetes scheduling may feel like bin packing. For giant models, it becomes topology-aware placement. Any 16 GPUs will not do. You need 16 GPUs arranged in a way that matches the communication pattern of the model.&lt;/p&gt;

&lt;p&gt;That is one reason large AI clusters often feel more rigid than normal Kubernetes clusters. The workload cares about where things are, not only whether resources exist.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why loading the model is its own event
&lt;/h2&gt;

&lt;p&gt;A 2 TB model is painful while serving. It is also painful while starting.&lt;/p&gt;

&lt;p&gt;The weights have to come from somewhere: container image layers, persistent volumes, object storage, local NVMe, a model cache, or a preloaded node image. The file format matters too. A memory-mappable format like safetensors behaves differently from formats that need heavier deserialization before the model is usable. Pulling, reading, mapping, transferring to GPU memory, initializing kernels, building CUDA graphs, and warming the serving engine can dominate startup time.&lt;/p&gt;

&lt;p&gt;This changes how you think about pod restarts and autoscaling.&lt;/p&gt;

&lt;p&gt;In a web app, a new pod might become useful in seconds. For a giant LLM, a new replica may take minutes. If the weights are remote and the cache is cold, it can be worse. If the replica spans nodes, all workers need to agree on their ranks and become ready together. One slow worker can delay the group.&lt;/p&gt;

&lt;p&gt;Kubernetes readiness probes are necessary, but they are not the whole story. A pod can exist before the model is loaded. A container can be running before the GPU workers are ready. A distributed group can have seven healthy workers and still be unusable because the eighth worker failed.&lt;/p&gt;

&lt;p&gt;That is why production LLM serving often needs warm pools, minimum replicas, local weight caches, careful rollout strategy, and boring operational patience. The deployment is not ready when the pod starts. It is ready when the model can actually generate tokens at the latency you promised.&lt;/p&gt;

&lt;h2&gt;
  
  
  A practical sizing walkthrough
&lt;/h2&gt;

&lt;p&gt;Suppose someone asks for a 1T parameter model on Kubernetes.&lt;/p&gt;

&lt;p&gt;The first question is not YAML. It is model shape.&lt;/p&gt;

&lt;p&gt;If it is a dense 1T model in BF16, you start with about 2 TB of raw weights. On 80 GB GPUs, that is 25 GPUs before overhead. In reality, you probably need more. You also need room for KV cache, which grows with context length and concurrency. If the model supports lower precision weight formats without unacceptable quality loss, quantization may reduce the floor, but it does not remove the distributed serving problem.&lt;/p&gt;

&lt;p&gt;If it is a 1T MoE model, the next questions change. How many experts exist? How many are active per token? Are the attention layers dense while the feed-forward layers are sparse? What does the serving engine support: tensor parallel attention, expert parallel MoE layers, data parallel attention, or some hybrid? How much all-to-all traffic appears when real prompts arrive?&lt;/p&gt;

&lt;p&gt;Then you map it to topology.&lt;/p&gt;

&lt;p&gt;If one node has 8 GPUs with fast intra-node links, tensor parallelism across 8 GPUs is a natural first shape. If the model needs more than one node, you may use tensor parallelism inside each node and pipeline parallelism across nodes. If it is MoE, you may place experts across devices using expert parallelism and still combine that with tensor or data parallel attention.&lt;/p&gt;

&lt;p&gt;Only after that does Kubernetes enter the center of the conversation.&lt;/p&gt;

&lt;p&gt;You need nodes labeled by GPU type and topology. You need the NVIDIA device plugin or GPU Operator equivalent exposing devices. You need pod specs or higher-level controllers that request the right GPU count. You need placement rules so the workers land together. You need model storage that does not turn every restart into a download storm. You need readiness that understands distributed health. You need metrics that report tokens, KV cache, queueing, and per-worker failures, not just pod CPU.&lt;/p&gt;

&lt;p&gt;The YAML is the last mile. The architecture decision happened before it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where teams usually get surprised
&lt;/h2&gt;

&lt;p&gt;The first surprise is that GPU count is not capacity. It is potential capacity. Capacity appears only when the GPUs are arranged, connected, loaded, and driven correctly.&lt;/p&gt;

&lt;p&gt;The second surprise is that one model replica can be a group. Platform teams like replicas because replicas sound independent. With large LLMs, the word can hide coordination. A "replica" might be 8 pods. Or 16 GPUs. Or 2 nodes. Or a combination of tensor, pipeline, and expert parallel ranks that must agree before anything works.&lt;/p&gt;

&lt;p&gt;The third surprise is that MoE parameter counts are easy to misread. A 671B total parameter model with 37B active parameters per token is not lying. It is telling you two different infrastructure facts at once. You need enough memory and placement for the large total model, but the per-token compute path is sparse.&lt;/p&gt;

&lt;p&gt;The fourth surprise is that the scheduler is not the serving system. Kubernetes can place pods on GPU nodes. It cannot decide the model parallel strategy. It cannot make a bad tensor-parallel layout fast. It cannot fix a network topology that does not match the workload.&lt;/p&gt;

&lt;p&gt;This is why serious LLM-on-Kubernetes work ends up crossing boundaries that platform teams used to keep separate: scheduler behavior, GPU topology, model architecture, serving-engine internals, storage layout, rollout strategy, and latency SLOs.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Kubernetes mental model that works better
&lt;/h2&gt;

&lt;p&gt;Do not picture a trillion-parameter model as a container image.&lt;/p&gt;

&lt;p&gt;Picture it as a distributed runtime that Kubernetes happens to host.&lt;/p&gt;

&lt;p&gt;The model weights are split. The computation is split. The memory pressure is split. The failure modes are split. The serving engine owns the model-parallel details. Kubernetes owns the outer lifecycle: placement, resources, health, rollout, identity, networking, storage, and integration with the rest of the platform.&lt;/p&gt;

&lt;p&gt;That does not make Kubernetes the wrong tool. It makes Kubernetes the substrate, not the magic trick.&lt;/p&gt;

&lt;p&gt;For smaller LLMs, you can get away with thinking "one pod equals one model server." For larger models, use a different sentence:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;One serving replica is a coordinated GPU group.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That group may live in one pod on one node. It may live across several pods and nodes. It may combine tensor parallelism, pipeline parallelism, expert parallelism, and data parallelism. But if it takes all of those workers to generate one token stream, treat the group as the unit you operate.&lt;/p&gt;

&lt;p&gt;That shift makes the rest of the architecture less surprising.&lt;/p&gt;

&lt;p&gt;Autoscaling becomes group scaling. Rollouts become distributed rollouts. Readiness becomes model readiness, not process readiness. Capacity planning starts with bytes and tokens, not pod count. Scheduling starts caring about topology. Observability moves from CPU and memory to TTFT, TPOT, tokens per second, KV cache, queue depth, per-rank health, and GPU utilization.&lt;/p&gt;

&lt;p&gt;A giant model does not run in a pod. It runs across a shape.&lt;/p&gt;

&lt;p&gt;Kubernetes can manage that shape, but only if you tell it what the shape is.&lt;/p&gt;

&lt;h2&gt;
  
  
  Continue the series
&lt;/h2&gt;

&lt;p&gt;The next part goes one layer lower into the machines themselves: GPU nodes, device plugins, feature discovery, MIG, MPS, time slicing, labels, taints, and what Kubernetes must know before it schedules an LLM.&lt;/p&gt;

&lt;p&gt;If you are working through LLM serving on Kubernetes, subscribe to get the next part. I am also putting together a free &lt;strong&gt;LLM Serving on Kubernetes Production Readiness Checklist&lt;/strong&gt; that turns these ideas into a practical review path for teams.&lt;/p&gt;

&lt;p&gt;And if your team is already trying to serve large models on Kubernetes, this is the kind of architecture decision worth reviewing before the cloud bill becomes the incident report.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources worth reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://developer.nvidia.com/blog/demystifying-ai-inference-deployments-for-trillion-parameter-large-language-models/" rel="noopener noreferrer"&gt;NVIDIA: Demystifying AI Inference Deployments for Trillion Parameter Large Language Models&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron" rel="noopener noreferrer"&gt;NVIDIA: Scaling Language Model Training to a Trillion Parameters Using Megatron&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.vllm.ai/en/latest/serving/parallelism_scaling/" rel="noopener noreferrer"&gt;vLLM documentation: Parallelism and Scaling&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://jmlr.org/papers/volume23/21-0998/21-0998.pdf" rel="noopener noreferrer"&gt;Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2412.19437" rel="noopener noreferrer"&gt;DeepSeek-V3 Technical Report&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/" rel="noopener noreferrer"&gt;Kubernetes documentation: Device Plugins&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>kubernetes</category>
      <category>llm</category>
      <category>ai</category>
      <category>devops</category>
    </item>
    <item>
      <title>The Request Is the Wrong Unit of Scale for LLMs on Kubernetes</title>
      <dc:creator>Pawan Kumar</dc:creator>
      <pubDate>Thu, 21 May 2026 03:32:01 +0000</pubDate>
      <link>https://dev.to/the-persistent-engineer/the-request-is-the-wrong-unit-of-scale-for-llms-on-kubernetes-3j9c</link>
      <guid>https://dev.to/the-persistent-engineer/the-request-is-the-wrong-unit-of-scale-for-llms-on-kubernetes-3j9c</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Series links&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.dheeth.blog/llm-serving-is-not-normal-web-serving/" rel="noopener noreferrer"&gt;Part 1: Everything You Know About Scaling Web Apps Breaks When You Serve an LLM&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;Your dashboard says traffic is flat. Requests per second barely moved. CPU looks fine. Memory looks normal. The HPA is calm. Then latency starts drifting. Time to first token gets worse. GPU memory pressure rises. Queues grow. Users complain that the model is "thinking forever."&lt;/p&gt;

&lt;p&gt;Part 1 introduced why LLM serving breaks the normal web-scaling model. Part 2 zooms into one reason: the HTTP request is only the envelope. The real work is token processing.&lt;/p&gt;

&lt;p&gt;For a normal web app, a request is often a useful approximation of work. One request hits an API, does some bounded work, maybe talks to a database, returns JSON, and ends. LLMs do not behave like that. One request may contain a 20-token question and produce a 50-token answer. Another may contain a long system prompt, full chat history, retrieved documents, tool output, metadata, and a user asking for a 4,000-token report.&lt;/p&gt;

&lt;p&gt;Both are one HTTP request.&lt;/p&gt;

&lt;p&gt;They are not the same workload.&lt;/p&gt;

&lt;p&gt;Kubernetes may see one request. Your ingress may see one request. Your API gateway may see one request. But the GPU sees tokens: prefill work, decode work, KV cache growth, memory pressure, queueing, and time spent generating output one token at a time.&lt;/p&gt;

&lt;p&gt;Tokens are the work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why request count worked for web apps
&lt;/h2&gt;

&lt;p&gt;Most platform teams grew up around request-based thinking. We look at requests per second, p95 latency, p99 latency, error rate, CPU usage, memory usage, queue depth, pod count, and replica count. That model works reasonably well for many web services because requests are often similar enough for capacity planning.&lt;/p&gt;

&lt;p&gt;Not always, of course. A login request is not the same as an export request. A cached read is not the same as a database-heavy query. Every experienced SRE has seen one "simple" endpoint melt something important. But request count still gives a useful first approximation in many normal systems.&lt;/p&gt;

&lt;p&gt;With LLM serving, that approximation breaks faster. A request does not tell you how long the prompt is, how many retrieved documents were added, how much chat history was included, how many output tokens the model generated, how much KV cache was needed, or how long the request occupied the GPU.&lt;/p&gt;

&lt;p&gt;This is why a Kubernetes deployment can look stable at the HTTP layer while the model server is under real pressure. The API did not necessarily get more traffic.&lt;/p&gt;

&lt;p&gt;The traffic got heavier.&lt;/p&gt;

&lt;h2&gt;
  
  
  Input tokens and output tokens are different problems
&lt;/h2&gt;

&lt;p&gt;When people first hear "tokens are the unit of work," they often treat all tokens as one bucket. That is a good starting point, but it is not enough.&lt;/p&gt;

&lt;p&gt;For serving, input tokens and output tokens stress the system differently.&lt;/p&gt;

&lt;p&gt;At a high level, LLM inference has two phases:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Prefill&lt;/li&gt;
&lt;li&gt;Decode&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The prefill phase processes the input prompt. This includes the system prompt, developer instructions, chat history, retrieved documents, tool results, the user's message, and whatever formatting your application adds before the request reaches the model. The decode phase generates the response one token at a time. The model predicts a token, appends it to the sequence, uses that updated sequence to predict the next token, and keeps going until it hits a stop condition or a token limit.&lt;/p&gt;

&lt;p&gt;A practical way to remember it:&lt;/p&gt;

&lt;p&gt;Input tokens decide how heavy it is to start answering.&lt;/p&gt;

&lt;p&gt;Output tokens decide how long the model stays busy.&lt;/p&gt;

&lt;p&gt;Long input usually hurts time to first token because the model has to process the prompt before it can begin generating. Long output usually hurts total latency and capacity because the model stays in the generation loop longer. Streaming can make this feel better to the user, but streaming does not remove the backend work. It just lets the user watch the work happen.&lt;/p&gt;

&lt;p&gt;This is why serious LLM serving metrics talk about time to first token, time per output token, inter-token latency, and tokens per second. &lt;a href="https://docs.nvidia.com/nim/benchmarking/llm/latest/metrics.html" rel="noopener noreferrer"&gt;NVIDIA's LLM benchmarking docs&lt;/a&gt; describe TTFT as the time before the first output token appears, and note that longer prompts generally increase TTFT because the input sequence has to be processed and the KV cache has to be created. &lt;a href="https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices" rel="noopener noreferrer"&gt;Databricks' inference performance guidance&lt;/a&gt; also separates TTFT, TPOT, latency, and throughput instead of treating latency as one simple number.&lt;/p&gt;

&lt;p&gt;A normal API request is one operation.&lt;/p&gt;

&lt;p&gt;An LLM request is a sequence of token work.&lt;/p&gt;

&lt;h2&gt;
  
  
  One request can hide a huge amount of prompt assembly
&lt;/h2&gt;

&lt;p&gt;The user does not usually send the real prompt.&lt;/p&gt;

&lt;p&gt;The application builds it.&lt;/p&gt;

&lt;p&gt;A user might type:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What is our refund policy for enterprise customers?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That looks tiny.&lt;/p&gt;

&lt;p&gt;But by the time your application sends the request to the model, the prompt might include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;system prompt: 700 tokens&lt;/li&gt;
&lt;li&gt;developer instructions: 400 tokens&lt;/li&gt;
&lt;li&gt;chat history: 1,500 tokens&lt;/li&gt;
&lt;li&gt;retrieved policy documents: 6,000 tokens&lt;/li&gt;
&lt;li&gt;citations and metadata: 600 tokens&lt;/li&gt;
&lt;li&gt;user question: 12 tokens&lt;/li&gt;
&lt;li&gt;formatting instructions: 300 tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The user sent one short question. The model received more than 9,000 input tokens before it generated a single output token. That is one of the easiest mistakes to miss in production: teams measure the user message size, not the final assembled prompt size.&lt;/p&gt;

&lt;p&gt;RAG makes this even more interesting. Retrieval-augmented generation is often described as a quality feature. The model gets relevant context, answers with better grounding, and can cite internal documents. That is true. But RAG is also an infrastructure multiplier.&lt;/p&gt;

&lt;p&gt;Changing &lt;code&gt;top_k&lt;/code&gt; from 4 chunks to 12 chunks may look like a harmless retrieval tuning change. No Kubernetes manifest changed. No model changed. Request count did not change. The product team may even see better answers. But now every request may carry thousands of extra input tokens. That can affect time to first token, GPU memory pressure, KV cache usage, batch composition, queueing delay, maximum concurrency, tail latency, and cost per interaction.&lt;/p&gt;

&lt;p&gt;This is why prompt assembly needs observability. You do not only want to know that a request had 9,000 input tokens. You want to know where those tokens came from: chat history, retrieved documents, tool results, system instructions, verbose metadata, tenant documents, or an agent flow that appends every intermediate step.&lt;/p&gt;

&lt;p&gt;Without that breakdown, token growth stays invisible until latency tells you something is wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  Long context is a capacity decision
&lt;/h2&gt;

&lt;p&gt;Long-context models are useful. They let you analyze larger documents, keep longer conversations, handle more retrieval context, and build richer workflows. But a large context window is not a target. It is a limit.&lt;/p&gt;

&lt;p&gt;This sounds obvious, but many teams behave as if "the model supports 128k context" means "we can casually send 128k context." That is like saying a node has 1 TB of memory, so every process should try to use it.&lt;/p&gt;

&lt;p&gt;Long context changes the shape of serving. A small number of long-context requests can consume enough GPU memory and serving time to affect everyone else. A chat session can become more expensive as history grows. An agent can quietly append tool traces until each turn becomes much heavier than the first one. A summarization feature can go from "summarize this page" to "summarize this folder of documents" without the HTTP request count changing at all.&lt;/p&gt;

&lt;p&gt;The failure mode is subtle because the old dashboard may still look calm. RPS is flat, but p95 input tokens moved from 2,000 to 18,000.&lt;/p&gt;

&lt;p&gt;That is not flat traffic.&lt;/p&gt;

&lt;p&gt;A useful platform practice is to bucket prompts by size: short prompts, medium prompts, long prompts, very long prompts, and batch or offline prompts. The exact numbers depend on your model and hardware, but the habit matters more than the bucket boundaries.&lt;/p&gt;

&lt;p&gt;A 500-token chat and a 50,000-token document analysis should not be treated as the same class of work just because both entered through &lt;code&gt;/v1/chat/completions&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Context windows are limits, not goals.&lt;/p&gt;

&lt;h2&gt;
  
  
  Output length is not just a UX choice
&lt;/h2&gt;

&lt;p&gt;Input tokens get a lot of attention because long prompts are easy to blame. But output tokens can be just as important for capacity planning.&lt;/p&gt;

&lt;p&gt;Two requests can have the same input prompt and completely different backend cost depending on output length.&lt;/p&gt;

&lt;p&gt;Request A:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1,000 input tokens&lt;/li&gt;
&lt;li&gt;100 output tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Request B:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1,000 input tokens&lt;/li&gt;
&lt;li&gt;2,000 output tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Same route. Same user flow. Same prompt size. Very different serving time.&lt;/p&gt;

&lt;p&gt;The second request keeps the model generating for much longer. If the response is streamed, the user may start seeing text quickly, which is good. But the GPU is still occupied while the model continues decoding token after token.&lt;/p&gt;

&lt;p&gt;This is why &lt;code&gt;max_tokens&lt;/code&gt; is not only a product parameter. It is a capacity control.&lt;/p&gt;

&lt;p&gt;If every request is allowed to generate 4,000 tokens, you have created a worst-case capacity problem even if most responses are shorter. If a feature asks the model to "write a detailed report," that is not the same workload as "answer this chat question." If agents are allowed to produce long reasoning traces, tool plans, summaries, and final answers, output length can grow quickly.&lt;/p&gt;

&lt;p&gt;You should track both requested maximum output tokens and actual generated output tokens. Requested max output tokens show the capacity risk your system accepted. Actual output tokens show the work the model really performed. If many requests hit the output cap, your users may be getting truncated answers. If very few requests use the available budget, your default might be too generous.&lt;/p&gt;

&lt;p&gt;Output length is not formatting.&lt;/p&gt;

&lt;p&gt;It is how long the request rents the GPU.&lt;/p&gt;

&lt;h2&gt;
  
  
  Same request count, completely different load
&lt;/h2&gt;

&lt;p&gt;A useful dashboard should show when the same request count hides a different workload shape. For example:&lt;/p&gt;

&lt;p&gt;Window A:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;requests: 1,000&lt;/li&gt;
&lt;li&gt;average input: 500 tokens&lt;/li&gt;
&lt;li&gt;average output: 150 tokens&lt;/li&gt;
&lt;li&gt;total token work: 650,000 tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Window B:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;requests: 1,000&lt;/li&gt;
&lt;li&gt;average input: 8,000 tokens&lt;/li&gt;
&lt;li&gt;average output: 1,000 tokens&lt;/li&gt;
&lt;li&gt;total token work: 9,000,000 tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both windows show 1,000 requests. But the second window has almost 14x the token volume. If your dashboard only shows request count, it says traffic is flat. If your dashboard shows token volume, it says the workload changed completely.&lt;/p&gt;

&lt;p&gt;That is why the useful question is not only:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How many requests are we serving?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It is also:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How many input tokens are arriving, how many output tokens are being generated, and where are those tokens coming from?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What Kubernetes sees and what the model server feels
&lt;/h2&gt;

&lt;p&gt;Kubernetes is very good at managing containers. It can schedule pods, restart failed workloads, apply resource requests and limits, spread replicas, roll out deployments, and attach workloads to GPU nodes. But Kubernetes does not automatically understand the shape of an LLM request. A pod can be healthy while the model server is struggling. CPU can look uninteresting while GPU memory is the real limit. Generic memory can look fine while KV cache is under pressure. Request count can look flat while token volume has exploded.&lt;/p&gt;

&lt;p&gt;This is where the division of responsibility matters. Kubernetes gives you the orchestration layer. The model server gives you the LLM execution layer. The application builds the prompt. The platform team has to connect the signals.&lt;/p&gt;

&lt;p&gt;If those layers do not share the right metrics, you end up scaling the wrong thing. For example, CPU-based HPA may be useful around some parts of the stack, but it is not enough to understand LLM serving capacity. A model server may expose more relevant metrics such as prompt tokens, generation tokens, time to first token, time per output token, queue time, number of running requests, number of waiting requests, and KV cache usage.&lt;/p&gt;

&lt;p&gt;vLLM's production metrics are a good example of where the industry is moving. It exposes metrics for prompt tokens, generation tokens, request prompt tokens, request generation tokens, time to first token, time per output token, request queue time, prefill time, decode time, KV cache usage, running requests, and waiting requests. That metric set tells you something important:&lt;/p&gt;

&lt;p&gt;The production surface of LLM serving is already token-aware.&lt;/p&gt;

&lt;p&gt;Your dashboard should be too.&lt;/p&gt;

&lt;h2&gt;
  
  
  Token-based observability is not optional
&lt;/h2&gt;

&lt;p&gt;If you are running LLM workloads on Kubernetes, request count still matters. You still need API-level metrics. You still care about errors, availability, saturation, queueing, and latency. But those metrics need token context.&lt;/p&gt;

&lt;p&gt;At minimum, every request should give you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;input tokens&lt;/li&gt;
&lt;li&gt;output tokens&lt;/li&gt;
&lt;li&gt;total tokens&lt;/li&gt;
&lt;li&gt;requested max output tokens&lt;/li&gt;
&lt;li&gt;time to first token&lt;/li&gt;
&lt;li&gt;time per output token or inter-token latency&lt;/li&gt;
&lt;li&gt;end-to-end latency&lt;/li&gt;
&lt;li&gt;queue time&lt;/li&gt;
&lt;li&gt;model name&lt;/li&gt;
&lt;li&gt;model version&lt;/li&gt;
&lt;li&gt;deployment&lt;/li&gt;
&lt;li&gt;tenant or team&lt;/li&gt;
&lt;li&gt;route or feature&lt;/li&gt;
&lt;li&gt;finish reason&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If possible, also track prompt composition:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;system prompt tokens&lt;/li&gt;
&lt;li&gt;chat history tokens&lt;/li&gt;
&lt;li&gt;retrieved context tokens&lt;/li&gt;
&lt;li&gt;tool result tokens&lt;/li&gt;
&lt;li&gt;user message tokens&lt;/li&gt;
&lt;li&gt;metadata or formatting tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That breakdown is where many production surprises hide.&lt;/p&gt;

&lt;p&gt;For dashboards, averages are not enough. Average token count can look stable while the tail gets ugly. You want p50, p95, and p99 for input tokens and output tokens. You want latency by token bucket. You want TTFT by input size. You want end-to-end latency by output size. You want to know whether a tenant is sending mostly short prompts or occasionally sending giant ones.&lt;/p&gt;

&lt;p&gt;Some useful views:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;input token p50, p95, p99&lt;/li&gt;
&lt;li&gt;output token p50, p95, p99&lt;/li&gt;
&lt;li&gt;total tokens per minute by model&lt;/li&gt;
&lt;li&gt;total tokens per tenant&lt;/li&gt;
&lt;li&gt;TTFT by input token bucket&lt;/li&gt;
&lt;li&gt;TPOT by output token bucket&lt;/li&gt;
&lt;li&gt;queue time by token bucket&lt;/li&gt;
&lt;li&gt;percentage of requests near context limits&lt;/li&gt;
&lt;li&gt;percentage of requests hitting output cap&lt;/li&gt;
&lt;li&gt;retrieved context tokens per request&lt;/li&gt;
&lt;li&gt;chat history tokens per request&lt;/li&gt;
&lt;li&gt;KV cache usage over time&lt;/li&gt;
&lt;li&gt;waiting requests alongside waiting token estimates&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The last point is important.&lt;/p&gt;

&lt;p&gt;Do not only ask how many requests are waiting. Ask how many tokens are waiting. A queue of 20 short chat requests and a queue of 20 long document-analysis requests are not the same queue.&lt;/p&gt;

&lt;h2&gt;
  
  
  Product changes become infrastructure changes
&lt;/h2&gt;

&lt;p&gt;One uncomfortable part of LLM platforms is that product changes can become infrastructure changes very quickly.&lt;/p&gt;

&lt;p&gt;In a normal web app, adding a new field to a response may not matter much. In an LLM application, adding more context to a prompt can change capacity. Increasing retrieval depth can change latency. Keeping longer chat history can change memory pressure. Allowing longer outputs can change GPU occupancy.&lt;/p&gt;

&lt;p&gt;The product team might say:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;We only changed the prompt.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The platform team hears:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;We changed the workload.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Both are true.&lt;/p&gt;

&lt;p&gt;This does not mean product teams should be afraid of improving prompts. It means token impact should be visible before and after the change. If a new prompt improves answer quality but increases average input tokens by 3x, that may be a good tradeoff. But it should be a conscious tradeoff.&lt;/p&gt;

&lt;p&gt;If a RAG change improves accuracy but pushes p99 prompts near the context limit, that should be visible before production users discover the latency problem. If a new report-generation mode produces 10x more output tokens than chat, it probably needs a different workload class and different expectations.&lt;/p&gt;

&lt;p&gt;The platform question is not "are tokens bad?"&lt;/p&gt;

&lt;p&gt;Tokens are the product.&lt;/p&gt;

&lt;p&gt;The question is whether you know how many you are serving, where they come from, and what they do to your capacity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical rules for platform teams
&lt;/h2&gt;

&lt;p&gt;If you are starting to serve LLMs on Kubernetes, measure input and output tokens for every request. Do not wait until the first incident to add token metrics. Track the final assembled prompt, not just the user message. The model does not care what the user typed. It cares what your application sent.&lt;/p&gt;

&lt;p&gt;Break input tokens down by source. Separate system prompt, chat history, retrieved context, tool results, and user message. Track requested max output tokens separately from actual output tokens. One tells you accepted risk. The other tells you real work.&lt;/p&gt;

&lt;p&gt;Use token buckets in latency dashboards. A p95 latency graph without token buckets mixes small chat requests and huge document requests into one misleading line. Watch p95 and p99 token counts, not just averages. The tail is where LLM serving gets painful.&lt;/p&gt;

&lt;p&gt;Put budgets on RAG retrieval. &lt;code&gt;top_k&lt;/code&gt; is not only a relevance knob. It is a capacity knob. Treat context windows as limits, not targets. Just because a model accepts long context does not mean every request should use it. Set sane output defaults. Long answers should be intentional, not the accidental default for every route.&lt;/p&gt;

&lt;p&gt;Separate workload classes when needed. Short interactive chat, long RAG, report generation, agent workflows, and batch summarization do not have the same shape. Review token growth after product changes. Prompt changes, retrieval changes, memory changes, and tool changes can all affect infrastructure.&lt;/p&gt;

&lt;p&gt;These rules are not about making the system slower or less useful. They are about making the system understandable.&lt;/p&gt;

&lt;p&gt;You cannot operate what you do not measure. And in LLM serving, measuring only requests means you are measuring the envelope while ignoring the work inside it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real unit of scale
&lt;/h2&gt;

&lt;p&gt;The request is still useful at the API boundary. You need it for authentication, rate limits, logs, tracing, errors, and user flows. But it is not the right unit for LLM capacity.&lt;/p&gt;

&lt;p&gt;It cannot tell you how much prompt the model processed, how long the model generated, how much KV cache was needed, or whether the workload was short chat, long-context RAG, report generation, or an agent loop.&lt;/p&gt;

&lt;p&gt;Tokens get you closer to the truth. Input tokens explain much of the work before the first response appears. Output tokens explain how long the model keeps generating. Token distributions explain why averages lie. Token sources explain which product behavior changed the workload. Token-aware metrics explain why your Kubernetes deployment looks healthy while users still feel latency.&lt;/p&gt;

&lt;p&gt;Part 1 was about letting go of the normal web app scaling model. Part 2 is about replacing one of its most misleading assumptions.&lt;/p&gt;

&lt;p&gt;For LLMs on Kubernetes, you are not really scaling requests.&lt;/p&gt;

&lt;p&gt;You are scaling token work across expensive, memory-constrained, latency-sensitive GPU systems.&lt;/p&gt;

&lt;p&gt;Once you see that, the rest of the platform starts to make more sense.&lt;/p&gt;

&lt;h2&gt;
  
  
  Continue the series
&lt;/h2&gt;

&lt;p&gt;I am writing this as a practical series on hosting large LLMs on Kubernetes, from GPU nodes and model servers to autoscaling, latency, cost, and production architecture. If you want the next part, subscribe to the newsletter.&lt;/p&gt;

&lt;p&gt;I am also preparing a free &lt;strong&gt;LLM Serving on Kubernetes Production Readiness Checklist&lt;/strong&gt; with the metrics, dashboard questions, and architecture review points platform teams should ask before putting an LLM workload in production. Subscribe and I will share it when it is ready.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>llm</category>
      <category>devops</category>
      <category>mlops</category>
    </item>
    <item>
      <title>Everything You Know About Scaling Web Apps Breaks When You Serve an LLM</title>
      <dc:creator>Pawan Kumar</dc:creator>
      <pubDate>Thu, 14 May 2026 03:33:30 +0000</pubDate>
      <link>https://dev.to/the-persistent-engineer/everything-you-know-about-scaling-web-apps-breaks-when-you-serve-an-llm-2141</link>
      <guid>https://dev.to/the-persistent-engineer/everything-you-know-about-scaling-web-apps-breaks-when-you-serve-an-llm-2141</guid>
      <description>&lt;p&gt;Most platform engineers already know how to scale a web app. Put it in a container. Deploy it on Kubernetes. Add CPU and memory requests. Put a Service or Ingress in front. Configure HPA. Watch p95 latency, error rate, CPU, memory, and request throughput. Add replicas when traffic goes up. This is Part 1 of a practical series on hosting large LLMs on Kubernetes.&lt;/p&gt;

&lt;p&gt;That playbook works for a lot of services. Then you try to serve a large language model, and suddenly the old model starts cracking. A request is no longer just a request. Memory does not just mean RAM. Latency is not one number. Scaling a pod does not mean capacity appears instantly. One "replica" may need one GPU, eight GPUs, or several machines working together.&lt;/p&gt;

&lt;p&gt;And the bottleneck may not be CPU at all. The first mental shift is simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;LLM serving is not normal web serving.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The real unit of work is the token.&lt;/p&gt;

&lt;h2&gt;
  
  
  A request is no longer a request
&lt;/h2&gt;

&lt;p&gt;In a normal web app, request count is often a useful planning signal. Not perfect, obviously. Some endpoints are heavier than others. Some queries are ugly. Some users manage to find the one path that melts the database. But request count still tells you something.&lt;/p&gt;

&lt;p&gt;With LLMs, it can lie to your face.&lt;/p&gt;

&lt;p&gt;One user asks:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Summarize this sentence.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Another user asks:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Analyze this 80-page contract, compare it with these policy documents, extract the risks, and generate a detailed memo.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Both are one request. They are not the same workload.&lt;/p&gt;

&lt;p&gt;The second request may contain thousands of input tokens. It may generate thousands of output tokens. It may sit on GPU memory for longer. It may increase queueing delay for everyone behind it. It may consume far more KV cache. It may make your latency charts look haunted.&lt;/p&gt;

&lt;p&gt;So if you only measure requests per second, you are almost blind. For LLMs, you need to care about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;input tokens&lt;/li&gt;
&lt;li&gt;output tokens&lt;/li&gt;
&lt;li&gt;tokens per second&lt;/li&gt;
&lt;li&gt;time to first token&lt;/li&gt;
&lt;li&gt;time per output token&lt;/li&gt;
&lt;li&gt;queue depth&lt;/li&gt;
&lt;li&gt;batch size&lt;/li&gt;
&lt;li&gt;GPU memory&lt;/li&gt;
&lt;li&gt;KV cache usage&lt;/li&gt;
&lt;li&gt;model loading time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is a very different world from normal HTTP throughput.&lt;/p&gt;

&lt;h2&gt;
  
  
  LLM inference has two phases, and they behave differently
&lt;/h2&gt;

&lt;p&gt;When a user sends a prompt to an LLM, the model does not handle the request as one uniform block of work. At a high level, inference has two phases:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;prefill&lt;/li&gt;
&lt;li&gt;decode&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Prefill is where the model processes the input prompt. If the prompt is long, prefill gets expensive. This is where the model reads the context and builds the internal state needed to start generating. Decode is where the model generates output tokens one at a time. This is the part users see when text starts streaming on the screen.&lt;/p&gt;

&lt;p&gt;These phases stress the system differently. Prefill is more compute heavy. Decode is often more memory bandwidth heavy. Prefill depends heavily on input length. Decode depends heavily on output length. Both affect latency, throughput, cost, and capacity.&lt;/p&gt;

&lt;p&gt;This distinction does not usually matter when you are scaling a normal API. You do not think of a checkout endpoint as having two GPU phases with different scheduling behavior. With LLMs, you have to.&lt;/p&gt;

&lt;p&gt;If you ignore prefill and decode, you will struggle to explain why first token latency is slow, why long prompts hurt so much, or why the GPU looks busy but users still complain.&lt;/p&gt;

&lt;h2&gt;
  
  
  Latency is not one number anymore
&lt;/h2&gt;

&lt;p&gt;For web services, we usually talk about latency as one number:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;p50 latency&lt;/li&gt;
&lt;li&gt;p95 latency&lt;/li&gt;
&lt;li&gt;p99 latency&lt;/li&gt;
&lt;li&gt;request duration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For LLMs, that is not enough. Two latency numbers matter a lot.&lt;/p&gt;

&lt;h3&gt;
  
  
  Time to first token
&lt;/h3&gt;

&lt;p&gt;Time to first token, or TTFT, is how long the user waits before the model starts responding. This controls the feeling of responsiveness.&lt;/p&gt;

&lt;p&gt;If nothing appears for five seconds, the product feels slow. It does not matter that the final answer is useful. The user has already started wondering if the system is stuck.&lt;/p&gt;

&lt;p&gt;TTFT is affected by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;queueing delay&lt;/li&gt;
&lt;li&gt;prompt length&lt;/li&gt;
&lt;li&gt;prefill time&lt;/li&gt;
&lt;li&gt;model routing&lt;/li&gt;
&lt;li&gt;batch scheduling&lt;/li&gt;
&lt;li&gt;GPU availability&lt;/li&gt;
&lt;li&gt;cold starts&lt;/li&gt;
&lt;li&gt;cache behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Users feel TTFT sharply because silence feels broken.&lt;/p&gt;

&lt;h3&gt;
  
  
  Time per output token
&lt;/h3&gt;

&lt;p&gt;Time per output token, or TPOT, measures how fast the model generates each token after generation starts. This controls the streaming experience.&lt;/p&gt;

&lt;p&gt;Good TTFT with bad TPOT feels like the model wakes up quickly and then crawls. Good TPOT makes the answer feel alive, even if the full response takes time.&lt;/p&gt;

&lt;p&gt;TPOT is affected by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;decode efficiency&lt;/li&gt;
&lt;li&gt;GPU memory bandwidth&lt;/li&gt;
&lt;li&gt;batch size&lt;/li&gt;
&lt;li&gt;KV cache pressure&lt;/li&gt;
&lt;li&gt;model size&lt;/li&gt;
&lt;li&gt;quantization&lt;/li&gt;
&lt;li&gt;serving engine&lt;/li&gt;
&lt;li&gt;hardware type&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Normal web systems rarely force you to separate "time until the response starts" from "speed at which the rest of the response streams." LLM serving does.&lt;/p&gt;

&lt;h2&gt;
  
  
  Memory means GPU memory now
&lt;/h2&gt;

&lt;p&gt;In a web app, memory usually means heap, runtime overhead, in-process cache, or connection pools. In LLM serving, memory often means GPU memory. And GPU memory is painful because it is limited, expensive, and easy to waste.&lt;/p&gt;

&lt;p&gt;You need GPU memory for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;model weights&lt;/li&gt;
&lt;li&gt;KV cache&lt;/li&gt;
&lt;li&gt;runtime buffers&lt;/li&gt;
&lt;li&gt;activations&lt;/li&gt;
&lt;li&gt;batching overhead&lt;/li&gt;
&lt;li&gt;framework overhead&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Model weights are the obvious part. A 7 billion parameter model in FP16 or BF16 needs roughly 14 GB just for weights. A 70 billion parameter model needs roughly 140 GB just for weights at that precision. That already means one GPU may not be enough.&lt;/p&gt;

&lt;p&gt;But weights are only the obvious cost. The hidden cost is KV cache.&lt;/p&gt;

&lt;p&gt;KV cache stores the key and value tensors from previous tokens so the model does not recompute everything from scratch during generation. The longer the context and the more concurrent users you serve, the more KV cache you need.&lt;/p&gt;

&lt;p&gt;This is why long context is not just a product feature. It is an infra bill. Every extra token you allow into the context window can come back as GPU memory pressure. Maximum context length is not only a model capability. It is a capacity planning decision.&lt;/p&gt;

&lt;h2&gt;
  
  
  Replicas are not always replicas
&lt;/h2&gt;

&lt;p&gt;In a normal web app, one replica usually means one pod running one copy of the application. Traffic goes up, add pods. With LLMs, the word "replica" can hide a lot.&lt;/p&gt;

&lt;p&gt;A small model may run inside one pod on one GPU.&lt;/p&gt;

&lt;p&gt;A larger model may need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;multiple GPUs in one node&lt;/li&gt;
&lt;li&gt;multiple pods on one node&lt;/li&gt;
&lt;li&gt;multiple nodes&lt;/li&gt;
&lt;li&gt;tensor parallelism&lt;/li&gt;
&lt;li&gt;pipeline parallelism&lt;/li&gt;
&lt;li&gt;a Ray cluster&lt;/li&gt;
&lt;li&gt;a leader-worker setup&lt;/li&gt;
&lt;li&gt;a group of pods that must start together&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So when someone says, "scale the model to 10 replicas," the first question should be: what is one replica?&lt;/p&gt;

&lt;p&gt;Is it one pod? One GPU? One tensor parallel group? One multi-node deployment? One endpoint backed by several workers? One prefill group plus one decode group?&lt;/p&gt;

&lt;p&gt;This is where Kubernetes abstractions get interesting. A Deployment works nicely for simple stateless services. Serious LLM serving may need Ray, KServe, LeaderWorkerSet, Kueue, Volcano, or custom orchestration.&lt;/p&gt;

&lt;p&gt;The model may not fit into the old "one pod equals one replica" picture.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scaling a pod does not mean capacity is ready
&lt;/h2&gt;

&lt;p&gt;In a normal web app, a new pod can become useful quickly. The image is pulled. The process starts. Readiness passes. Traffic flows. For LLMs, a new pod may sit there for a while before it can handle real traffic.&lt;/p&gt;

&lt;p&gt;It may need to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;pull a large container image&lt;/li&gt;
&lt;li&gt;download model weights&lt;/li&gt;
&lt;li&gt;load hundreds of GBs from object storage or disk&lt;/li&gt;
&lt;li&gt;initialize CUDA&lt;/li&gt;
&lt;li&gt;allocate GPU memory&lt;/li&gt;
&lt;li&gt;build or load optimized engines&lt;/li&gt;
&lt;li&gt;warm up the model&lt;/li&gt;
&lt;li&gt;join a distributed serving group&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This can take minutes. Sometimes longer. So autoscaling is not just about deciding when to add replicas. It is about adding capacity early enough that it is ready before users feel the pain.&lt;/p&gt;

&lt;p&gt;That is much harder than scaling a normal web app. This is why LLM platforms often use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;minimum warm replicas&lt;/li&gt;
&lt;li&gt;preloaded models&lt;/li&gt;
&lt;li&gt;local NVMe model cache&lt;/li&gt;
&lt;li&gt;warm pools&lt;/li&gt;
&lt;li&gt;separate GPU node pools&lt;/li&gt;
&lt;li&gt;predictive scaling&lt;/li&gt;
&lt;li&gt;queue based scaling&lt;/li&gt;
&lt;li&gt;scheduled capacity for known peaks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Scale to zero sounds great until the first user waits for a giant model to load.&lt;/p&gt;

&lt;h2&gt;
  
  
  CPU autoscaling becomes a weak signal
&lt;/h2&gt;

&lt;p&gt;CPU utilization is a decent signal for many Kubernetes workloads. Not perfect. But decent. For LLM serving, CPU can be almost irrelevant.&lt;/p&gt;

&lt;p&gt;The expensive work happens on GPUs. More specifically, the bottleneck may be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPU memory&lt;/li&gt;
&lt;li&gt;GPU memory bandwidth&lt;/li&gt;
&lt;li&gt;KV cache capacity&lt;/li&gt;
&lt;li&gt;decode throughput&lt;/li&gt;
&lt;li&gt;queue depth&lt;/li&gt;
&lt;li&gt;batch saturation&lt;/li&gt;
&lt;li&gt;inter-GPU communication&lt;/li&gt;
&lt;li&gt;model server scheduling&lt;/li&gt;
&lt;li&gt;request length distribution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A model server can have low CPU usage and still be overloaded. It can have high GPU utilization and still deliver terrible latency. It can have enough compute but not enough KV cache capacity. It can be stuck serving long prompts while short prompts wait behind them.&lt;/p&gt;

&lt;p&gt;So if you autoscale only on CPU, the platform may make the wrong decision at the worst possible time.&lt;/p&gt;

&lt;p&gt;Better signals include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;queue depth&lt;/li&gt;
&lt;li&gt;waiting requests&lt;/li&gt;
&lt;li&gt;ongoing requests per replica&lt;/li&gt;
&lt;li&gt;batch size&lt;/li&gt;
&lt;li&gt;TTFT&lt;/li&gt;
&lt;li&gt;TPOT&lt;/li&gt;
&lt;li&gt;tokens per second&lt;/li&gt;
&lt;li&gt;KV cache usage&lt;/li&gt;
&lt;li&gt;GPU memory pressure&lt;/li&gt;
&lt;li&gt;SLO burn rate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;GPU utilization still matters. It just cannot be the only signal. LLM autoscaling has to understand the workload.&lt;/p&gt;

&lt;h2&gt;
  
  
  Round robin load balancing gets weird
&lt;/h2&gt;

&lt;p&gt;For a normal web app, round robin load balancing is often fine. Request 1 goes to pod A. Request 2 goes to pod B. Request 3 goes to pod C. For LLMs, this can be wasteful.&lt;/p&gt;

&lt;p&gt;A short prompt and a long prompt have completely different costs. A request with a cached prefix may be cheaper if it lands on the right worker. A long generation may occupy capacity much longer than the load balancer expects. One tenant may need lower latency than another. One model may need different hardware from another.&lt;/p&gt;

&lt;p&gt;Naive load balancing can create strange failures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;one worker gets long prompts and slows down&lt;/li&gt;
&lt;li&gt;another worker stays underused&lt;/li&gt;
&lt;li&gt;KV cache locality is lost&lt;/li&gt;
&lt;li&gt;prefix caching becomes less useful&lt;/li&gt;
&lt;li&gt;tail latency gets worse&lt;/li&gt;
&lt;li&gt;GPU utilization looks fine while users are unhappy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;LLM serving needs smarter routing.&lt;/p&gt;

&lt;p&gt;Good routing may consider:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;model name&lt;/li&gt;
&lt;li&gt;prompt length&lt;/li&gt;
&lt;li&gt;estimated output length&lt;/li&gt;
&lt;li&gt;tenant priority&lt;/li&gt;
&lt;li&gt;cache locality&lt;/li&gt;
&lt;li&gt;GPU availability&lt;/li&gt;
&lt;li&gt;queue depth&lt;/li&gt;
&lt;li&gt;hardware type&lt;/li&gt;
&lt;li&gt;region&lt;/li&gt;
&lt;li&gt;latency SLO&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why inference gateways, model-aware routing, and cache-aware scheduling matter.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost changes shape
&lt;/h2&gt;

&lt;p&gt;In a web app, cost usually grows with pods, CPU, memory, database load, and network traffic. In LLM serving, cost is shaped by GPU usage, and GPUs are expensive enough that small inefficiencies matter.&lt;/p&gt;

&lt;p&gt;You can burn money through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;idle GPUs&lt;/li&gt;
&lt;li&gt;poor batching&lt;/li&gt;
&lt;li&gt;overprovisioned replicas&lt;/li&gt;
&lt;li&gt;long context windows&lt;/li&gt;
&lt;li&gt;bad routing&lt;/li&gt;
&lt;li&gt;large models for simple tasks&lt;/li&gt;
&lt;li&gt;no quantization&lt;/li&gt;
&lt;li&gt;slow cold starts&lt;/li&gt;
&lt;li&gt;inefficient KV cache usage&lt;/li&gt;
&lt;li&gt;serving every request with the same model&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The cost unit also changes. Instead of only thinking about cost per request, you start thinking about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cost per input token&lt;/li&gt;
&lt;li&gt;cost per output token&lt;/li&gt;
&lt;li&gt;cost per million tokens&lt;/li&gt;
&lt;li&gt;cost per model&lt;/li&gt;
&lt;li&gt;cost per tenant&lt;/li&gt;
&lt;li&gt;cost per GPU hour&lt;/li&gt;
&lt;li&gt;cost per region&lt;/li&gt;
&lt;li&gt;cost per latency tier&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the cloud bill version of the first mental shift: a request is not a request. A token is the real unit of work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Kubernetes is still useful. It is just not enough by itself
&lt;/h2&gt;

&lt;p&gt;None of this means Kubernetes is the wrong platform for LLM serving. Kubernetes still gives you a lot:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;scheduling&lt;/li&gt;
&lt;li&gt;declarative deployment&lt;/li&gt;
&lt;li&gt;resource management&lt;/li&gt;
&lt;li&gt;isolation&lt;/li&gt;
&lt;li&gt;service discovery&lt;/li&gt;
&lt;li&gt;rollouts&lt;/li&gt;
&lt;li&gt;observability integrations&lt;/li&gt;
&lt;li&gt;autoscaling primitives&lt;/li&gt;
&lt;li&gt;platform patterns for multiple teams&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is why many serious AI infrastructure platforms still use Kubernetes or something close to it. But Kubernetes does not automatically understand LLMs.&lt;/p&gt;

&lt;p&gt;Out of the box, Kubernetes does not know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what KV cache is&lt;/li&gt;
&lt;li&gt;whether a model is loaded&lt;/li&gt;
&lt;li&gt;whether a GPU group must be scheduled together&lt;/li&gt;
&lt;li&gt;whether pods should land in the same rack&lt;/li&gt;
&lt;li&gt;whether a request has 100 tokens or 100,000 tokens&lt;/li&gt;
&lt;li&gt;whether TTFT is bad&lt;/li&gt;
&lt;li&gt;whether a model server is overloaded despite low CPU&lt;/li&gt;
&lt;li&gt;whether a new replica will take 10 minutes to warm up&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You have to teach the platform these things through metrics, controllers, schedulers, serving frameworks, routing layers, and operational discipline. That is the real work.&lt;/p&gt;

&lt;h2&gt;
  
  
  The old scaling model breaks
&lt;/h2&gt;

&lt;p&gt;The old web scaling model looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Traffic increases
        ↓
CPU increases
        ↓
HPA adds pods
        ↓
Load balancer spreads requests
        ↓
Latency improves
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That still works for many stateless services. LLM serving looks more like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Traffic increases
        ↓
Input and output token mix changes
        ↓
Queue depth grows
        ↓
KV cache pressure increases
        ↓
Batching behavior changes
        ↓
TTFT and TPOT drift
        ↓
GPU memory or decode throughput becomes the bottleneck
        ↓
Autoscaler needs model-aware metrics
        ↓
New capacity may take minutes to warm up
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you bring only the old playbook, you will scale the wrong thing, at the wrong time, using the wrong signal.&lt;/p&gt;

&lt;h2&gt;
  
  
  The new mental model
&lt;/h2&gt;

&lt;p&gt;To serve LLMs well, you need a different model in your head:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A request is not the unit. A token is.&lt;/li&gt;
&lt;li&gt;Memory is not just RAM. GPU memory and KV cache matter more.&lt;/li&gt;
&lt;li&gt;Latency is not one number. TTFT and TPOT matter separately.&lt;/li&gt;
&lt;li&gt;A replica may be a distributed group, not a single pod.&lt;/li&gt;
&lt;li&gt;Scaling is not instant because model loading is slow.&lt;/li&gt;
&lt;li&gt;CPU is not a reliable autoscaling signal by itself.&lt;/li&gt;
&lt;li&gt;Load balancing must understand request cost and cache locality.&lt;/li&gt;
&lt;li&gt;Long context is an infrastructure cost decision.&lt;/li&gt;
&lt;li&gt;Cost optimization starts with keeping expensive GPUs useful.&lt;/li&gt;
&lt;li&gt;Kubernetes is the foundation, but LLM-aware systems must be built on top.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once this clicks, the rest of LLM infrastructure becomes easier to reason about. You can understand why vLLM became popular. Why PagedAttention matters. Why KV cache dominates serving design. Why quantization is a capacity strategy. Why topology-aware scheduling matters. Why teams split prefill and decode. Why GPU cost optimization is its own discipline. Why normal autoscaling is not enough.&lt;/p&gt;

&lt;p&gt;LLM serving is not "deploy a model behind an API." It is a new platform engineering problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing thought
&lt;/h2&gt;

&lt;p&gt;For years, platform teams became very good at scaling stateless web services. We learned containers, Kubernetes, service meshes, autoscaling, observability, progressive delivery, and cloud cost optimization. That knowledge still matters, but LLM serving changes the shape of the problem.&lt;/p&gt;

&lt;p&gt;The bottlenecks move. The metrics change. The cost model changes. The scheduler matters more. The load balancer needs to get smarter. The GPU becomes the scarce resource. The token becomes the unit of work.&lt;/p&gt;

&lt;p&gt;So if you are trying to serve LLMs on Kubernetes, the first step is not installing a Helm chart. The first step is replacing the old mental model.&lt;/p&gt;

&lt;p&gt;Because everything you know about scaling web apps starts to break the moment you serve an LLM.&lt;/p&gt;




&lt;h2&gt;
  
  
  Continue the series
&lt;/h2&gt;

&lt;p&gt;I am writing this as a practical series on hosting large LLMs on Kubernetes, from GPU nodes and model servers to autoscaling, latency, cost, and production architecture. If you want the next part, subscribe to the newsletter.&lt;/p&gt;

&lt;p&gt;I am also preparing a free &lt;strong&gt;LLM Serving on Kubernetes Production Readiness Checklist&lt;/strong&gt; with the questions platform teams should ask before putting an LLM workload in production. Subscribe and I will share it when it is ready.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>ai</category>
      <category>devops</category>
      <category>mlops</category>
    </item>
    <item>
      <title>I Don't Want AI to Replace DevOps. I Want It to Read the Docs I'm Too Tired to Read</title>
      <dc:creator>Pawan Kumar</dc:creator>
      <pubDate>Thu, 07 May 2026 06:32:51 +0000</pubDate>
      <link>https://dev.to/the-persistent-engineer/i-dont-want-ai-to-replace-devops-i-want-it-to-read-the-docs-im-too-tired-to-read-1j2d</link>
      <guid>https://dev.to/the-persistent-engineer/i-dont-want-ai-to-replace-devops-i-want-it-to-read-the-docs-im-too-tired-to-read-1j2d</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://dheeth.blog/i-dont-want-ai-to-replace-devops/" rel="noopener noreferrer"&gt;dheeth.blog&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It's 2 AM. The pager went off eleven minutes ago. You're staring at a Kubernetes upgrade advisory that's forty-seven paragraphs long, and somewhere in paragraph thirty-one there's a breaking change about how EKS handles PodIdentity federation with IAM roles. You know it's in there. You read it three months ago. But right now your brain is running on caffeine and cortisol, and the words are blurring into each other.&lt;/p&gt;

&lt;p&gt;You could run the upgrade now and hope for the best. Or you could spend forty minutes re-reading the entire changelog, the Terraform provider notes, the Helm chart migration guide, and three different Slack threads from the last time someone did this.&lt;/p&gt;

&lt;p&gt;This is the part of DevOps nobody puts in conference talks. Not the elegant GitOps pipelines or the slick dashboards. The part where you're exhausted and you still have to make a decision that affects production, and the information you need is spread across nine browser tabs, a Confluence page from 2023, and a runbook that was last updated when your cluster was on 1.24.&lt;/p&gt;

&lt;p&gt;This is where I want AI to help. Not by taking over. Not by running &lt;code&gt;kubectl apply&lt;/code&gt; on my behalf while I sleep. By reading the damn docs for me.&lt;/p&gt;

&lt;h2&gt;
  
  
  The kind of tired that matters
&lt;/h2&gt;

&lt;p&gt;The Google SRE Workbook has a word for what happens when engineers spend too much time on repetitive operational work: toil. They define it as "the repetitive, predictable, constant stream of tasks related to maintaining a service." Rollouts, upgrades, alert triage, manual repairs, ticket-driven provisioning. Google puts a hard cap on it: no more than 50% of an SRE's time should go to operational work.&lt;/p&gt;

&lt;p&gt;The reasoning isn't just about efficiency. The workbook makes a point that has always stuck with me: time spent on toil is time not spent where human judgment, creativity, and design thinking matter.&lt;/p&gt;

&lt;p&gt;Here's what I think the SRE Workbook doesn't fully capture, at least not in those exact words. There's a specific kind of toil that doesn't look like toil. It doesn't involve clicking buttons or running the same script for the hundredth time. It's cognitive. It's the mental cost of assembling context from scattered sources before you can make a decision.&lt;/p&gt;

&lt;p&gt;Reading a Kubernetes release notes page that's 3,000 words long to find the one deprecation that affects your cluster. Comparing two versions of a Helm &lt;code&gt;values.yaml&lt;/code&gt; to understand what changed between chart versions 4.2.1 and 5.0.0. Skimming a Terraform provider changelog to see if the &lt;code&gt;aws_eks_cluster&lt;/code&gt; resource changed its default behavior. Correlating an incident timeline from last Thursday with the deployment that happened two hours before the spike in 5xx errors.&lt;/p&gt;

&lt;p&gt;This work isn't glamorous. It doesn't produce artifacts. Nobody thanks you for spending an hour reading release notes. But if you skip it, you miss the breaking change that takes down a service at 3 AM on a Sunday.&lt;/p&gt;

&lt;p&gt;Sometimes the most exhausting part of an incident is not fixing the issue. It is building enough context to feel safe fixing it.&lt;/p&gt;

&lt;p&gt;I think of this as cognitive toil, and AI is unusually well suited to help with it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I don't want an AI agent with production access
&lt;/h2&gt;

&lt;p&gt;Before I talk about what I do want, let me be clear about what I don't.&lt;/p&gt;

&lt;p&gt;I don't want an AI agent that has &lt;code&gt;kubectl apply&lt;/code&gt; access by default. I don't want one that can merge PRs, push to main, modify IAM policies, or restart services without a human in the loop. I've seen enough production incidents caused by humans who were tired, rushed, or copy-pasting from the wrong terminal. Giving that same power to something that hallucinates API flags and invents Kubernetes resources that don't exist is not progress. It's a new category of incident.&lt;/p&gt;

&lt;p&gt;In application code, an AI mistake might fail a test. In DevOps, an AI mistake might page five teams, drain the wrong node, rotate the wrong secret, or turn a small incident into a very educational afternoon.&lt;/p&gt;

&lt;p&gt;The Stack Overflow 2025 Developer Survey backs this up. 76% of developers don't plan to use AI for deployment or monitoring tasks. Not because they're luddites. Because they know what's at stake. More developers actively distrust AI accuracy (46%) than trust it (33%). Only 3% highly trust it. That is the part that makes people nervous: AI can sound confident even when the answer still needs careful verification.&lt;/p&gt;

&lt;p&gt;In DevOps, "almost right" isn't a minor inconvenience. An "almost right" IAM policy is a security incident. An "almost right" Kubernetes manifest is a workload that runs fine until it doesn't, and then you're debugging at 2 AM wondering why the liveness probe path changed. An "almost right" Terraform plan is a production resource that gets destroyed and recreated instead of updated in place.&lt;/p&gt;

&lt;p&gt;The problem is not that AI is useless. The problem is that AI is useful enough to make dangerous workflows look reasonable. In DevOps, the gap between "sounds correct" and "safe to execute" is where incidents live.&lt;/p&gt;

&lt;p&gt;The hard part of DevOps is rarely knowing the command. &lt;code&gt;kubectl apply -f manifest.yaml&lt;/code&gt; isn't the hard part. The hard part is knowing whether that command is safe in this environment, with this version of Kubernetes, with these admission controllers, with this cluster autoscaler configuration, right after that EKS add-on got updated. That requires context, judgment, and accountability. AI is genuinely useful for the first two, but it can't own the third. Not yet. Maybe not ever.&lt;/p&gt;

&lt;p&gt;Most production work is not blocked because nobody knows how to type &lt;code&gt;kubectl&lt;/code&gt;. It is blocked because nobody is completely sure what is safe to do next.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I actually want AI to do
&lt;/h2&gt;

&lt;p&gt;I want AI to be the colleague who actually reads the release notes before standup. The one who highlights the three things that matter out of a forty-seven-paragraph changelog. The one who can look at a Terraform plan diff and tell you, in plain language, what's about to change and what might break.&lt;/p&gt;

&lt;p&gt;Concretely, here's what that looks like.&lt;/p&gt;

&lt;p&gt;When I'm going from Kubernetes 1.29 to 1.30, I want something that tells me what got deprecated, what changed in API versions, and what I need to act on before upgrading. Skip the boilerplate about "improved performance." Focus on the removals and behavioral changes.&lt;/p&gt;

&lt;p&gt;Before I update the VPC CNI add-on, I want to know if this version is compatible with my current Kubernetes version, my node group AMI, and the Calico network policy version we're running. That compatibility matrix is spread across three AWS docs pages and it changes every quarter.&lt;/p&gt;

&lt;p&gt;When the AWS Terraform provider goes from 5.x to 6.x, I don't want to read the entire migration guide. I want to know which resources I'm actually using that changed behavior. Focus on my code, not the universe of possibilities.&lt;/p&gt;

&lt;p&gt;When I'm upgrading a Helm chart from 4.x to 5.x, show me what changed in the default values: which new keys were introduced, which old keys were removed, which ones changed their default behavior. Better yet, cross-reference my current &lt;code&gt;values.yaml&lt;/code&gt; and tell me which of my overrides are now invalid.&lt;/p&gt;

&lt;p&gt;If I inherit a cluster with 200 custom resources I've never seen before, help me understand what they do without reading CRD documentation for six hours.&lt;/p&gt;

&lt;p&gt;When an incident happens, take the Slack thread, the PagerDuty timeline, and the post-mortem notes, and produce a runbook that the next on-call engineer can actually follow. One that isn't three years stale.&lt;/p&gt;

&lt;p&gt;When the error rate spiked at 14:32 and something was deployed at 14:15, pull the deployment diff, the relevant log lines, and the metrics shift into one view so I can see the connection without switching between four tools.&lt;/p&gt;

&lt;p&gt;When five services are throwing errors and the logs are a wall of stack traces, filter out the noise, group the unique errors, and tell me which one started first. That's the one I care about.&lt;/p&gt;

&lt;p&gt;None of these require production access. None require the AI to execute anything. They require it to read, understand, summarize, compare, and present information so I can decide faster.&lt;/p&gt;

&lt;p&gt;The best DevOps AI will not feel magical. It will feel like a senior engineer left clean notes before going on vacation.&lt;/p&gt;

&lt;h2&gt;
  
  
  The data says this approach works
&lt;/h2&gt;

&lt;p&gt;GitHub's study on Copilot found something interesting beyond speed. 87% of developers said AI helped them preserve mental effort during repetitive tasks. 73% said it helped them stay in flow. 60-75% said it helped them focus on more satisfying work. One senior engineer put it simply: with AI, they had to think less about the boring stuff, and when they had to think, it was the fun stuff.&lt;/p&gt;

&lt;p&gt;The DORA research on generative AI adds an important nuance. Developers who use AI extensively report higher job satisfaction, more time in flow state, and less burnout. But there's a catch: AI adoption didn't reduce time spent on toilsome, repetitive tasks. It sped up the valuable work developers already enjoyed, but didn't crack the code on automating drudgery. DORA also found that a 25% increase in AI adoption was associated with a decrease in delivery stability, because AI lets teams generate more code and more changes faster than their review and testing processes can handle.&lt;/p&gt;

&lt;p&gt;Read that last sentence again. AI doesn't hurt stability because it writes bad code. It hurts stability because it lets teams produce more work than their feedback loops can safely absorb.&lt;/p&gt;

&lt;p&gt;This is exactly why the read-summarize-suggest model is the right one for DevOps. It gives engineers better context without adding unreviewed changes to the pipeline. It accelerates understanding without bypassing approval. It reduces the time between "I need to figure this out" and "I understand enough to decide" without collapsing the distance between "I decided" and "it's done."&lt;/p&gt;

&lt;h2&gt;
  
  
  A boundary that matters
&lt;/h2&gt;

&lt;p&gt;I'm not anti-agent. I think autonomous AI agents will eventually have a role in infrastructure operations. But the keyword is eventually, and the prerequisite is trust, and trust is earned slowly and lost quickly.&lt;/p&gt;

&lt;p&gt;Stack Overflow also shows developers are much more cautious with high-responsibility work. Most respondents do not plan to use AI for deployment or monitoring. These are not people who hate AI. These are people who know where the blast radius lives.&lt;/p&gt;

&lt;p&gt;The DORA report reinforces this: trust directly drives AI productivity. Developers who trust AI accept more suggestions, submit more changes, and spend less time searching for information. But DORA also found that 39% of developers still trust AI outputs "a little" or "not at all."&lt;/p&gt;

&lt;p&gt;In DevOps, trust isn't about vibes. It's about being right when being wrong has consequences. An AI that summarizes a changelog and misses a breaking change is annoying but survivable. An AI that applies a change based on that incomplete summary is a production incident.&lt;/p&gt;

&lt;p&gt;The line I draw is simple. AI should read, summarize, compare, draft, and suggest. Humans should approve, execute, and own.&lt;/p&gt;

&lt;p&gt;Let AI read. Let AI summarize. Let AI compare. Let AI draft. Let AI suggest.&lt;/p&gt;

&lt;p&gt;But make humans approve, execute, and own.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fatigue I want to replace
&lt;/h2&gt;

&lt;p&gt;DevOps has a burnout problem. This isn't news. The on-call rotations, the incident pressure, the constant context switching between ten different tools and three different cloud providers and a pile of documentation that's always slightly out of date.&lt;/p&gt;

&lt;p&gt;The fatigue is real. It accumulates. It's not the dramatic kind where someone screams and quits. It's the quiet kind where you stop reading the full changelog because you've read forty of them and nothing ever breaks, until one Tuesday it does. Where you stop updating the runbook because nobody reads it anyway, including you. Where you start copy-pasting Terraform modules from the last project because you don't have the energy to check if the AWS provider changed the defaults again.&lt;/p&gt;

&lt;p&gt;AI can't fix organizational dysfunction. It can't fix understaffed on-call rotations or unreasonable SLAs. But it can reduce the cognitive tax of the work that sits between "I got paged" and "I understand what is happening." It can give you back the thirty minutes you'd have spent re-reading docs you already read once. It can catch the breaking change you'd have missed at 2 AM.&lt;/p&gt;

&lt;p&gt;I don't want AI to replace DevOps engineers. I want it to replace the exhaustion that makes us worse at the job we're good at. I want it to be the thing that reads the docs so I can focus on deciding what to do with what they say. I want it to handle the reading so I can handle the thinking.&lt;/p&gt;

&lt;p&gt;That's not a smaller vision. It's a more honest one.&lt;/p&gt;




&lt;p&gt;References:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Google SRE Workbook, "Eliminating Toil": &lt;a href="https://sre.google/workbook/eliminating-toil/" rel="noopener noreferrer"&gt;sre.google/workbook/eliminating-toil/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;DORA, "Impact of Generative AI in Software Development": &lt;a href="https://dora.dev/ai/gen-ai-report/" rel="noopener noreferrer"&gt;dora.dev/ai/gen-ai-report/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Stack Overflow Developer Survey 2025, AI section: &lt;a href="https://survey.stackoverflow.co/2025/ai/" rel="noopener noreferrer"&gt;survey.stackoverflow.co/2025/ai/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GitHub Research, "Quantifying GitHub Copilot's Impact on Developer Productivity and Happiness": &lt;a href="https://github.blog/news-insights/research/research-quantifying-github-copilots-impact-on-developer-productivity-and-happiness/" rel="noopener noreferrer"&gt;github.blog/news-insights/research/research-quantifying-github-copilots-impact-on-developer-productivity-and-happiness/&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>devops</category>
      <category>ai</category>
      <category>sre</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>DevOps to Platform Engineer: The Career Shift Nobody Explains Properly</title>
      <dc:creator>Pawan Kumar</dc:creator>
      <pubDate>Thu, 30 Apr 2026 20:15:33 +0000</pubDate>
      <link>https://dev.to/the-persistent-engineer/devops-to-platform-engineer-the-career-shift-nobody-explains-properly-48f2</link>
      <guid>https://dev.to/the-persistent-engineer/devops-to-platform-engineer-the-career-shift-nobody-explains-properly-48f2</guid>
      <description>&lt;p&gt;If you've been in DevOps long enough, you've probably seen the job postings by now. "Platform Engineer." "Internal Developer Platform." "Platform-as-a-Product." The titles are everywhere. Gartner says 80% of large engineering organizations will have dedicated platform teams by 2026. That's up from 45% in 2022.&lt;/p&gt;

&lt;p&gt;But nobody really explains what changes. Not the buzzwords. The actual day job. The skills. The salary. The headaches.&lt;/p&gt;

&lt;p&gt;I work as a DevOps Engineer at a company that builds Kubernetes application platforms. So I'm living in the middle of this transition every single day. Let me break down what's actually happening, what it means for your career, and whether you should care.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Actually Happening
&lt;/h2&gt;

&lt;p&gt;Here's the short version: DevOps broke at scale. Not the philosophy. The practice.&lt;/p&gt;

&lt;p&gt;When you have 5 teams and 20 services, DevOps works beautifully. Everyone knows everyone. You can walk over to someone's desk (or Slack them) and figure out why the pipeline broke. The "culture of collaboration" actually functions.&lt;/p&gt;

&lt;p&gt;But at 50 teams? 500 services? Multiple clouds? That informal shared context collapses. Onboarding takes weeks instead of days. Every team builds slightly different CI/CD pipelines. Security reviews become bottlenecks. And you end up with 3 senior engineers who "know how things really work," and they're drowning.&lt;/p&gt;

&lt;p&gt;Platform engineering is the response to that breakdown. Instead of relying on culture and tribal knowledge, you build a product, an Internal Developer Platform (IDP), that encodes best practices into self-service tooling.&lt;/p&gt;

&lt;p&gt;The platform becomes the documentation. The guardrails become the governance. And the paved road becomes the easiest road.&lt;/p&gt;

&lt;h2&gt;
  
  
  DevOps vs Platform Engineering: The Real Differences
&lt;/h2&gt;

&lt;p&gt;Let's skip the marketing fluff. Here's what actually changes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;DevOps&lt;/th&gt;
&lt;th&gt;Platform Engineering&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Who you build for&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Infrastructure, pipelines&lt;/td&gt;
&lt;td&gt;Developers (your users)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;How work comes to you&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Tickets, Slack pings, "can you help me"&lt;/td&gt;
&lt;td&gt;Platform feature requests, adoption metrics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Governance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;PR reviews, approval gates, manual checks&lt;/td&gt;
&lt;td&gt;Embedded into templates and workflows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Success metric&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"Did the deploy work?"&lt;/td&gt;
&lt;td&gt;"Are developers choosing to use the platform?"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scale model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Linear (more teams = more DevOps)&lt;/td&gt;
&lt;td&gt;Leverage (platform scales once, serves all)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Your mindset&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"Let me fix this for you"&lt;/td&gt;
&lt;td&gt;"Let me build it so you never need to ask"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That last row is the fundamental shift. DevOps is a service mindset. Platform engineering is a product mindset.&lt;/p&gt;

&lt;h3&gt;
  
  
  A Day in the Life
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;DevOps Engineer's typical day:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Build and maintain CI/CD pipelines (30%)&lt;/li&gt;
&lt;li&gt;Write Terraform, manage infrastructure (25%)&lt;/li&gt;
&lt;li&gt;Set up monitoring and alerting (15%)&lt;/li&gt;
&lt;li&gt;Automate deployment processes (20%)&lt;/li&gt;
&lt;li&gt;Help developers with infrastructure issues (10%)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Platform Engineer's typical day:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Build internal developer tools and abstractions (35%)&lt;/li&gt;
&lt;li&gt;Improve self-service capabilities (25%)&lt;/li&gt;
&lt;li&gt;Maintain the platform infrastructure itself (20%)&lt;/li&gt;
&lt;li&gt;Developer support, education, onboarding (10%)&lt;/li&gt;
&lt;li&gt;Platform documentation (10%)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Notice the shift: you're spending more time building things &lt;em&gt;for developers to use independently&lt;/em&gt; and less time &lt;em&gt;doing things for developers&lt;/em&gt;. It's the difference between being a chef and being someone who designs kitchen layouts.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Salary Question (India-Focused)
&lt;/h2&gt;

&lt;p&gt;Let's talk numbers. I cross-referenced data from AmbitionBox, Glassdoor, Levels.fyi, and real job postings across LinkedIn and Naukri. Here is the realistic range for India in 2026:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Experience&lt;/th&gt;
&lt;th&gt;DevOps&lt;/th&gt;
&lt;th&gt;Platform Engineer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;3-5 years&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;₹10-28 LPA&lt;/td&gt;
&lt;td&gt;₹20-40 LPA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;6-10 years&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;₹20-45 LPA&lt;/td&gt;
&lt;td&gt;₹35-60 LPA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Lead/Principal&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;₹35-65 LPA&lt;/td&gt;
&lt;td&gt;₹55-90 LPA&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Platform engineering commands a 30-60% premium over generalist DevOps, according to multiple 2026 India salary reports. The premium exists because the talent pool is much smaller, you need DevOps foundations plus product thinking plus software engineering depth.&lt;/p&gt;

&lt;p&gt;Globally, platform engineers in North America average $160,000 USD, compared to DevOps roles that typically plateau around $140K. Not life-changing, but meaningful.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Skills Gap: What You Need to Learn
&lt;/h2&gt;

&lt;p&gt;If you're a DevOps engineer today, you already have most of the foundations. Here's what's missing:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Product Thinking
&lt;/h3&gt;

&lt;p&gt;This is the biggest mindset shift. You're no longer building pipelines, you're building a product with users, feedback loops, and adoption metrics. That means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Understanding developer pain points (user research)&lt;/li&gt;
&lt;li&gt;Prioritizing features based on impact (product management)&lt;/li&gt;
&lt;li&gt;Measuring adoption, not just uptime (analytics)&lt;/li&gt;
&lt;li&gt;Iterating based on feedback (continuous improvement)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The #1 reason platform initiatives fail? Teams build technically excellent platforms that nobody uses. Voluntary adoption is the real metric.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. API Design and Software Engineering
&lt;/h3&gt;

&lt;p&gt;DevOps scripting (Bash, YAML, a bit of Python) doesn't cut it anymore. Platform engineers need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;API design&lt;/strong&gt; - Your platform is consumed through APIs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Go or Rust&lt;/strong&gt; - Most CNCF platform tooling is written in Go&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-tenancy patterns&lt;/strong&gt; - Your platform serves multiple teams with different needs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Software engineering practices&lt;/strong&gt; - Testing, versioning, deprecation strategies
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Example: A Golden Path template for a new microservice&lt;/span&gt;
&lt;span class="c1"&gt;# This is what platform engineers build - opinionated defaults&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;backstage.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Template&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;microservice-template&lt;/span&gt;
  &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Standard Microservice&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Spin up a new Go microservice with CI/CD, monitoring, and security baked in&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;parameters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Service Details&lt;/span&gt;
      &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;name&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;team&lt;/span&gt;
      &lt;span class="na"&gt;properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Service Name&lt;/span&gt;
          &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
        &lt;span class="na"&gt;team&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Owning Team&lt;/span&gt;
          &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
          &lt;span class="na"&gt;enum&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;payments&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;auth&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;core&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;platform&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;scaffold&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Generate Service&lt;/span&gt;
      &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fetch:template&lt;/span&gt;
      &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./templates/go-microservice&lt;/span&gt;
        &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ parameters.name }}&lt;/span&gt;
          &lt;span class="na"&gt;team&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ parameters.team }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is a simplified Backstage software template - one of the most common patterns in platform engineering. Developers fill in a few fields, and the platform generates a production-ready service with CI/CD, observability, and security pre-configured.&lt;/p&gt;

&lt;p&gt;You can achieve the same with &lt;a href="https://docs.devtron.ai/docs/user-guide/app-management/application-template" rel="noopener noreferrer"&gt;Devtron's Application Templates&lt;/a&gt; - capture CI/CD workflows, build configs, deployment templates, and environment overrides from an existing app, then reuse them to spin up new microservices in minutes instead of hours.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Developer Experience (DevEx)
&lt;/h3&gt;

&lt;p&gt;You need to care about how developers &lt;em&gt;feel&lt;/em&gt; using your platform. This includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Time to first deploy (how fast can a new dev ship?)&lt;/li&gt;
&lt;li&gt;Self-service capabilities (can they do it without filing a ticket?)&lt;/li&gt;
&lt;li&gt;Documentation quality (can they figure it out without asking you?)&lt;/li&gt;
&lt;li&gt;Error messages (are they helpful or cryptic?)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The State of Platform Engineering Report recommends tracking DORA metrics (deployment frequency, lead time, change failure rate, MTTR) alongside SPACE metrics (developer productivity) and time-to-onboarding.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. AI Literacy
&lt;/h3&gt;

&lt;p&gt;This isn't optional anymore. 92% of CIOs plan AI integrations into their platforms. The recommendation is to reserve 20% of your time for AI skill development:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Using AI tools for platform operations (K8sGPT, AI-assisted troubleshooting)&lt;/li&gt;
&lt;li&gt;Building AI-powered capabilities into your platform (intelligent autoscaling, anomaly detection)&lt;/li&gt;
&lt;li&gt;Understanding how AI-generated code flows through your CI/CD&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By 2028, platforms without AI capabilities will be considered outdated.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Actually Make the Transition
&lt;/h2&gt;

&lt;p&gt;Here's a practical roadmap, assuming you have 3+ years of DevOps experience:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 1-2: Build Product Thinking&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Read "Team Topologies" by Matthew Skelton and Manuel Pais&lt;/li&gt;
&lt;li&gt;Start treating your current internal tools as products - add documentation, gather feedback, track usage&lt;/li&gt;
&lt;li&gt;Learn about Backstage (CNCF project, 89% market share for IDPs)&lt;/li&gt;
&lt;li&gt;Explore Devtron - an AI-native Kubernetes management platform to see how real IDPs work in practice&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Month 3-4: Level Up Software Engineering&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pick up Go if you haven't already - most platform tooling is Go-based&lt;/li&gt;
&lt;li&gt;Build a small internal tool with proper API design, tests, and documentation&lt;/li&gt;
&lt;li&gt;Contribute to an open-source platform tool (Backstage, Crossplane, Port)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Month 5-6: Get Hands-On with IDPs&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deploy Backstage locally or in a sandbox cluster&lt;/li&gt;
&lt;li&gt;Build a software template for your team's most common workflow&lt;/li&gt;
&lt;li&gt;Add golden paths for your existing infrastructure patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Ongoing: Develop AI Competency&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Experiment with K8sGPT for cluster troubleshooting&lt;/li&gt;
&lt;li&gt;Explore AI-assisted CI/CD (GitHub Copilot in Actions, AI-powered code review)&lt;/li&gt;
&lt;li&gt;Stay current with AI SRE tools (autonomous incident response is coming fast)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Six Specialized Roles Within Platform Engineering
&lt;/h2&gt;

&lt;p&gt;As the field matures, "platform engineer" is splitting into distinct specializations:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Head of Platform Engineering (HOPE)&lt;/strong&gt; - Strategic direction, cross-functional coordination&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Platform Product Manager (PPM)&lt;/strong&gt; - Bridges technical teams and organizational needs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infrastructure Platform Engineer (IPE)&lt;/strong&gt; - Underlying infra (servers, networks, databases)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DevEx Platform Engineer (DPE)&lt;/strong&gt; - Developer workflows, friction reduction, tool UX&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security Platform Engineer (SPE)&lt;/strong&gt; - Security embedded into pipelines, policy-as-code&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliability Platform Engineer (RPE)&lt;/strong&gt; - Evolution of SRE, monitoring/observability plane&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You don't need to pick one immediately. Most platform engineers touch multiple areas, especially in smaller teams. But knowing these exist helps you see where your career can go.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'm Seeing From the Inside
&lt;/h2&gt;

&lt;p&gt;Working at Devtron, a company that literally builds a Kubernetes application platform, I get a front-row seat to this transition. Here's what I see daily:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Teams that adopted platform thinking&lt;/strong&gt; are shipping faster with fewer incidents. They're not firefighting as much because the platform catches common mistakes before they reach production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Teams that didn't&lt;/strong&gt; are drowning in tickets. Every new microservice means another pipeline to build, another set of alerts to configure, another on-call rotation to manage. It doesn't scale.&lt;/p&gt;

&lt;p&gt;The companies that get this right treat their platform as a product with a dedicated team, clear ownership, and actual user research. The ones that get it wrong rebrand their DevOps team as "Platform Engineering" and change nothing about how they work.&lt;/p&gt;

&lt;p&gt;Don't be the second one.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Honest Take
&lt;/h2&gt;

&lt;p&gt;Platform engineering isn't replacing DevOps. It's DevOps growing up. The philosophy of collaboration, automation, and shared responsibility stays. What changes is the &lt;em&gt;mechanism&lt;/em&gt;, from culture-dependent to platform-dependent.&lt;/p&gt;

&lt;p&gt;Should you make the shift? If you enjoy building tools more than operating infrastructure, if you care about developer experience, and if you want to work on leverage (building something once that serves hundreds of developers) - yes.&lt;/p&gt;

&lt;p&gt;The timing is right. Mid-level engineers with 3-5 years of experience are entering platform roles in growing numbers. You don't need to be a senior architect anymore. The field is democratizing, the salaries are competitive, and the demand is only going up.&lt;/p&gt;

&lt;p&gt;Start by building one thing that removes friction for your team. Treat it like a product. See what happens.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Further Reading:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://teamtopologies.com/" rel="noopener noreferrer"&gt;Team Topologies&lt;/a&gt; - the org design book behind platform thinking&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://backstage.io/" rel="noopener noreferrer"&gt;Backstage.io&lt;/a&gt; - CNCF project for building developer portals&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://devtron.ai/" rel="noopener noreferrer"&gt;Devtron&lt;/a&gt; - AI-Native Kubernetes Management Platform&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://platformengineering.org/" rel="noopener noreferrer"&gt;Platform Engineering community&lt;/a&gt; - reports, articles, and the annual State of Platform Engineering survey&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dora.dev/" rel="noopener noreferrer"&gt;DORA metrics&lt;/a&gt; - the standard for measuring software delivery performance&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>devops</category>
      <category>platformengineering</category>
      <category>career</category>
    </item>
  </channel>
</rss>
