<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Pawan Kumar</title>
    <description>The latest articles on DEV Community by Pawan Kumar (@the-persistent-engineer).</description>
    <link>https://dev.to/the-persistent-engineer</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3903751%2Fa8128dd1-30a5-4a8b-a5aa-acc051e7828e.png</url>
      <title>DEV Community: Pawan Kumar</title>
      <link>https://dev.to/the-persistent-engineer</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/the-persistent-engineer"/>
    <language>en</language>
    <item>
      <title>The Request Is the Wrong Unit of Scale for LLMs on Kubernetes</title>
      <dc:creator>Pawan Kumar</dc:creator>
      <pubDate>Thu, 21 May 2026 03:32:01 +0000</pubDate>
      <link>https://dev.to/the-persistent-engineer/the-request-is-the-wrong-unit-of-scale-for-llms-on-kubernetes-3j9c</link>
      <guid>https://dev.to/the-persistent-engineer/the-request-is-the-wrong-unit-of-scale-for-llms-on-kubernetes-3j9c</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Series links&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.dheeth.blog/llm-serving-is-not-normal-web-serving/" rel="noopener noreferrer"&gt;Part 1: Everything You Know About Scaling Web Apps Breaks When You Serve an LLM&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;Your dashboard says traffic is flat. Requests per second barely moved. CPU looks fine. Memory looks normal. The HPA is calm. Then latency starts drifting. Time to first token gets worse. GPU memory pressure rises. Queues grow. Users complain that the model is "thinking forever."&lt;/p&gt;

&lt;p&gt;Part 1 introduced why LLM serving breaks the normal web-scaling model. Part 2 zooms into one reason: the HTTP request is only the envelope. The real work is token processing.&lt;/p&gt;

&lt;p&gt;For a normal web app, a request is often a useful approximation of work. One request hits an API, does some bounded work, maybe talks to a database, returns JSON, and ends. LLMs do not behave like that. One request may contain a 20-token question and produce a 50-token answer. Another may contain a long system prompt, full chat history, retrieved documents, tool output, metadata, and a user asking for a 4,000-token report.&lt;/p&gt;

&lt;p&gt;Both are one HTTP request.&lt;/p&gt;

&lt;p&gt;They are not the same workload.&lt;/p&gt;

&lt;p&gt;Kubernetes may see one request. Your ingress may see one request. Your API gateway may see one request. But the GPU sees tokens: prefill work, decode work, KV cache growth, memory pressure, queueing, and time spent generating output one token at a time.&lt;/p&gt;

&lt;p&gt;Tokens are the work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why request count worked for web apps
&lt;/h2&gt;

&lt;p&gt;Most platform teams grew up around request-based thinking. We look at requests per second, p95 latency, p99 latency, error rate, CPU usage, memory usage, queue depth, pod count, and replica count. That model works reasonably well for many web services because requests are often similar enough for capacity planning.&lt;/p&gt;

&lt;p&gt;Not always, of course. A login request is not the same as an export request. A cached read is not the same as a database-heavy query. Every experienced SRE has seen one "simple" endpoint melt something important. But request count still gives a useful first approximation in many normal systems.&lt;/p&gt;

&lt;p&gt;With LLM serving, that approximation breaks faster. A request does not tell you how long the prompt is, how many retrieved documents were added, how much chat history was included, how many output tokens the model generated, how much KV cache was needed, or how long the request occupied the GPU.&lt;/p&gt;

&lt;p&gt;This is why a Kubernetes deployment can look stable at the HTTP layer while the model server is under real pressure. The API did not necessarily get more traffic.&lt;/p&gt;

&lt;p&gt;The traffic got heavier.&lt;/p&gt;

&lt;h2&gt;
  
  
  Input tokens and output tokens are different problems
&lt;/h2&gt;

&lt;p&gt;When people first hear "tokens are the unit of work," they often treat all tokens as one bucket. That is a good starting point, but it is not enough.&lt;/p&gt;

&lt;p&gt;For serving, input tokens and output tokens stress the system differently.&lt;/p&gt;

&lt;p&gt;At a high level, LLM inference has two phases:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Prefill&lt;/li&gt;
&lt;li&gt;Decode&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The prefill phase processes the input prompt. This includes the system prompt, developer instructions, chat history, retrieved documents, tool results, the user's message, and whatever formatting your application adds before the request reaches the model. The decode phase generates the response one token at a time. The model predicts a token, appends it to the sequence, uses that updated sequence to predict the next token, and keeps going until it hits a stop condition or a token limit.&lt;/p&gt;

&lt;p&gt;A practical way to remember it:&lt;/p&gt;

&lt;p&gt;Input tokens decide how heavy it is to start answering.&lt;/p&gt;

&lt;p&gt;Output tokens decide how long the model stays busy.&lt;/p&gt;

&lt;p&gt;Long input usually hurts time to first token because the model has to process the prompt before it can begin generating. Long output usually hurts total latency and capacity because the model stays in the generation loop longer. Streaming can make this feel better to the user, but streaming does not remove the backend work. It just lets the user watch the work happen.&lt;/p&gt;

&lt;p&gt;This is why serious LLM serving metrics talk about time to first token, time per output token, inter-token latency, and tokens per second. &lt;a href="https://docs.nvidia.com/nim/benchmarking/llm/latest/metrics.html" rel="noopener noreferrer"&gt;NVIDIA's LLM benchmarking docs&lt;/a&gt; describe TTFT as the time before the first output token appears, and note that longer prompts generally increase TTFT because the input sequence has to be processed and the KV cache has to be created. &lt;a href="https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices" rel="noopener noreferrer"&gt;Databricks' inference performance guidance&lt;/a&gt; also separates TTFT, TPOT, latency, and throughput instead of treating latency as one simple number.&lt;/p&gt;

&lt;p&gt;A normal API request is one operation.&lt;/p&gt;

&lt;p&gt;An LLM request is a sequence of token work.&lt;/p&gt;

&lt;h2&gt;
  
  
  One request can hide a huge amount of prompt assembly
&lt;/h2&gt;

&lt;p&gt;The user does not usually send the real prompt.&lt;/p&gt;

&lt;p&gt;The application builds it.&lt;/p&gt;

&lt;p&gt;A user might type:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What is our refund policy for enterprise customers?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That looks tiny.&lt;/p&gt;

&lt;p&gt;But by the time your application sends the request to the model, the prompt might include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;system prompt: 700 tokens&lt;/li&gt;
&lt;li&gt;developer instructions: 400 tokens&lt;/li&gt;
&lt;li&gt;chat history: 1,500 tokens&lt;/li&gt;
&lt;li&gt;retrieved policy documents: 6,000 tokens&lt;/li&gt;
&lt;li&gt;citations and metadata: 600 tokens&lt;/li&gt;
&lt;li&gt;user question: 12 tokens&lt;/li&gt;
&lt;li&gt;formatting instructions: 300 tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The user sent one short question. The model received more than 9,000 input tokens before it generated a single output token. That is one of the easiest mistakes to miss in production: teams measure the user message size, not the final assembled prompt size.&lt;/p&gt;

&lt;p&gt;RAG makes this even more interesting. Retrieval-augmented generation is often described as a quality feature. The model gets relevant context, answers with better grounding, and can cite internal documents. That is true. But RAG is also an infrastructure multiplier.&lt;/p&gt;

&lt;p&gt;Changing &lt;code&gt;top_k&lt;/code&gt; from 4 chunks to 12 chunks may look like a harmless retrieval tuning change. No Kubernetes manifest changed. No model changed. Request count did not change. The product team may even see better answers. But now every request may carry thousands of extra input tokens. That can affect time to first token, GPU memory pressure, KV cache usage, batch composition, queueing delay, maximum concurrency, tail latency, and cost per interaction.&lt;/p&gt;

&lt;p&gt;This is why prompt assembly needs observability. You do not only want to know that a request had 9,000 input tokens. You want to know where those tokens came from: chat history, retrieved documents, tool results, system instructions, verbose metadata, tenant documents, or an agent flow that appends every intermediate step.&lt;/p&gt;

&lt;p&gt;Without that breakdown, token growth stays invisible until latency tells you something is wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  Long context is a capacity decision
&lt;/h2&gt;

&lt;p&gt;Long-context models are useful. They let you analyze larger documents, keep longer conversations, handle more retrieval context, and build richer workflows. But a large context window is not a target. It is a limit.&lt;/p&gt;

&lt;p&gt;This sounds obvious, but many teams behave as if "the model supports 128k context" means "we can casually send 128k context." That is like saying a node has 1 TB of memory, so every process should try to use it.&lt;/p&gt;

&lt;p&gt;Long context changes the shape of serving. A small number of long-context requests can consume enough GPU memory and serving time to affect everyone else. A chat session can become more expensive as history grows. An agent can quietly append tool traces until each turn becomes much heavier than the first one. A summarization feature can go from "summarize this page" to "summarize this folder of documents" without the HTTP request count changing at all.&lt;/p&gt;

&lt;p&gt;The failure mode is subtle because the old dashboard may still look calm. RPS is flat, but p95 input tokens moved from 2,000 to 18,000.&lt;/p&gt;

&lt;p&gt;That is not flat traffic.&lt;/p&gt;

&lt;p&gt;A useful platform practice is to bucket prompts by size: short prompts, medium prompts, long prompts, very long prompts, and batch or offline prompts. The exact numbers depend on your model and hardware, but the habit matters more than the bucket boundaries.&lt;/p&gt;

&lt;p&gt;A 500-token chat and a 50,000-token document analysis should not be treated as the same class of work just because both entered through &lt;code&gt;/v1/chat/completions&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Context windows are limits, not goals.&lt;/p&gt;

&lt;h2&gt;
  
  
  Output length is not just a UX choice
&lt;/h2&gt;

&lt;p&gt;Input tokens get a lot of attention because long prompts are easy to blame. But output tokens can be just as important for capacity planning.&lt;/p&gt;

&lt;p&gt;Two requests can have the same input prompt and completely different backend cost depending on output length.&lt;/p&gt;

&lt;p&gt;Request A:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1,000 input tokens&lt;/li&gt;
&lt;li&gt;100 output tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Request B:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1,000 input tokens&lt;/li&gt;
&lt;li&gt;2,000 output tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Same route. Same user flow. Same prompt size. Very different serving time.&lt;/p&gt;

&lt;p&gt;The second request keeps the model generating for much longer. If the response is streamed, the user may start seeing text quickly, which is good. But the GPU is still occupied while the model continues decoding token after token.&lt;/p&gt;

&lt;p&gt;This is why &lt;code&gt;max_tokens&lt;/code&gt; is not only a product parameter. It is a capacity control.&lt;/p&gt;

&lt;p&gt;If every request is allowed to generate 4,000 tokens, you have created a worst-case capacity problem even if most responses are shorter. If a feature asks the model to "write a detailed report," that is not the same workload as "answer this chat question." If agents are allowed to produce long reasoning traces, tool plans, summaries, and final answers, output length can grow quickly.&lt;/p&gt;

&lt;p&gt;You should track both requested maximum output tokens and actual generated output tokens. Requested max output tokens show the capacity risk your system accepted. Actual output tokens show the work the model really performed. If many requests hit the output cap, your users may be getting truncated answers. If very few requests use the available budget, your default might be too generous.&lt;/p&gt;

&lt;p&gt;Output length is not formatting.&lt;/p&gt;

&lt;p&gt;It is how long the request rents the GPU.&lt;/p&gt;

&lt;h2&gt;
  
  
  Same request count, completely different load
&lt;/h2&gt;

&lt;p&gt;A useful dashboard should show when the same request count hides a different workload shape. For example:&lt;/p&gt;

&lt;p&gt;Window A:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;requests: 1,000&lt;/li&gt;
&lt;li&gt;average input: 500 tokens&lt;/li&gt;
&lt;li&gt;average output: 150 tokens&lt;/li&gt;
&lt;li&gt;total token work: 650,000 tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Window B:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;requests: 1,000&lt;/li&gt;
&lt;li&gt;average input: 8,000 tokens&lt;/li&gt;
&lt;li&gt;average output: 1,000 tokens&lt;/li&gt;
&lt;li&gt;total token work: 9,000,000 tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both windows show 1,000 requests. But the second window has almost 14x the token volume. If your dashboard only shows request count, it says traffic is flat. If your dashboard shows token volume, it says the workload changed completely.&lt;/p&gt;

&lt;p&gt;That is why the useful question is not only:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How many requests are we serving?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It is also:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How many input tokens are arriving, how many output tokens are being generated, and where are those tokens coming from?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What Kubernetes sees and what the model server feels
&lt;/h2&gt;

&lt;p&gt;Kubernetes is very good at managing containers. It can schedule pods, restart failed workloads, apply resource requests and limits, spread replicas, roll out deployments, and attach workloads to GPU nodes. But Kubernetes does not automatically understand the shape of an LLM request. A pod can be healthy while the model server is struggling. CPU can look uninteresting while GPU memory is the real limit. Generic memory can look fine while KV cache is under pressure. Request count can look flat while token volume has exploded.&lt;/p&gt;

&lt;p&gt;This is where the division of responsibility matters. Kubernetes gives you the orchestration layer. The model server gives you the LLM execution layer. The application builds the prompt. The platform team has to connect the signals.&lt;/p&gt;

&lt;p&gt;If those layers do not share the right metrics, you end up scaling the wrong thing. For example, CPU-based HPA may be useful around some parts of the stack, but it is not enough to understand LLM serving capacity. A model server may expose more relevant metrics such as prompt tokens, generation tokens, time to first token, time per output token, queue time, number of running requests, number of waiting requests, and KV cache usage.&lt;/p&gt;

&lt;p&gt;vLLM's production metrics are a good example of where the industry is moving. It exposes metrics for prompt tokens, generation tokens, request prompt tokens, request generation tokens, time to first token, time per output token, request queue time, prefill time, decode time, KV cache usage, running requests, and waiting requests. That metric set tells you something important:&lt;/p&gt;

&lt;p&gt;The production surface of LLM serving is already token-aware.&lt;/p&gt;

&lt;p&gt;Your dashboard should be too.&lt;/p&gt;

&lt;h2&gt;
  
  
  Token-based observability is not optional
&lt;/h2&gt;

&lt;p&gt;If you are running LLM workloads on Kubernetes, request count still matters. You still need API-level metrics. You still care about errors, availability, saturation, queueing, and latency. But those metrics need token context.&lt;/p&gt;

&lt;p&gt;At minimum, every request should give you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;input tokens&lt;/li&gt;
&lt;li&gt;output tokens&lt;/li&gt;
&lt;li&gt;total tokens&lt;/li&gt;
&lt;li&gt;requested max output tokens&lt;/li&gt;
&lt;li&gt;time to first token&lt;/li&gt;
&lt;li&gt;time per output token or inter-token latency&lt;/li&gt;
&lt;li&gt;end-to-end latency&lt;/li&gt;
&lt;li&gt;queue time&lt;/li&gt;
&lt;li&gt;model name&lt;/li&gt;
&lt;li&gt;model version&lt;/li&gt;
&lt;li&gt;deployment&lt;/li&gt;
&lt;li&gt;tenant or team&lt;/li&gt;
&lt;li&gt;route or feature&lt;/li&gt;
&lt;li&gt;finish reason&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If possible, also track prompt composition:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;system prompt tokens&lt;/li&gt;
&lt;li&gt;chat history tokens&lt;/li&gt;
&lt;li&gt;retrieved context tokens&lt;/li&gt;
&lt;li&gt;tool result tokens&lt;/li&gt;
&lt;li&gt;user message tokens&lt;/li&gt;
&lt;li&gt;metadata or formatting tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That breakdown is where many production surprises hide.&lt;/p&gt;

&lt;p&gt;For dashboards, averages are not enough. Average token count can look stable while the tail gets ugly. You want p50, p95, and p99 for input tokens and output tokens. You want latency by token bucket. You want TTFT by input size. You want end-to-end latency by output size. You want to know whether a tenant is sending mostly short prompts or occasionally sending giant ones.&lt;/p&gt;

&lt;p&gt;Some useful views:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;input token p50, p95, p99&lt;/li&gt;
&lt;li&gt;output token p50, p95, p99&lt;/li&gt;
&lt;li&gt;total tokens per minute by model&lt;/li&gt;
&lt;li&gt;total tokens per tenant&lt;/li&gt;
&lt;li&gt;TTFT by input token bucket&lt;/li&gt;
&lt;li&gt;TPOT by output token bucket&lt;/li&gt;
&lt;li&gt;queue time by token bucket&lt;/li&gt;
&lt;li&gt;percentage of requests near context limits&lt;/li&gt;
&lt;li&gt;percentage of requests hitting output cap&lt;/li&gt;
&lt;li&gt;retrieved context tokens per request&lt;/li&gt;
&lt;li&gt;chat history tokens per request&lt;/li&gt;
&lt;li&gt;KV cache usage over time&lt;/li&gt;
&lt;li&gt;waiting requests alongside waiting token estimates&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The last point is important.&lt;/p&gt;

&lt;p&gt;Do not only ask how many requests are waiting. Ask how many tokens are waiting. A queue of 20 short chat requests and a queue of 20 long document-analysis requests are not the same queue.&lt;/p&gt;

&lt;h2&gt;
  
  
  Product changes become infrastructure changes
&lt;/h2&gt;

&lt;p&gt;One uncomfortable part of LLM platforms is that product changes can become infrastructure changes very quickly.&lt;/p&gt;

&lt;p&gt;In a normal web app, adding a new field to a response may not matter much. In an LLM application, adding more context to a prompt can change capacity. Increasing retrieval depth can change latency. Keeping longer chat history can change memory pressure. Allowing longer outputs can change GPU occupancy.&lt;/p&gt;

&lt;p&gt;The product team might say:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;We only changed the prompt.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The platform team hears:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;We changed the workload.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Both are true.&lt;/p&gt;

&lt;p&gt;This does not mean product teams should be afraid of improving prompts. It means token impact should be visible before and after the change. If a new prompt improves answer quality but increases average input tokens by 3x, that may be a good tradeoff. But it should be a conscious tradeoff.&lt;/p&gt;

&lt;p&gt;If a RAG change improves accuracy but pushes p99 prompts near the context limit, that should be visible before production users discover the latency problem. If a new report-generation mode produces 10x more output tokens than chat, it probably needs a different workload class and different expectations.&lt;/p&gt;

&lt;p&gt;The platform question is not "are tokens bad?"&lt;/p&gt;

&lt;p&gt;Tokens are the product.&lt;/p&gt;

&lt;p&gt;The question is whether you know how many you are serving, where they come from, and what they do to your capacity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical rules for platform teams
&lt;/h2&gt;

&lt;p&gt;If you are starting to serve LLMs on Kubernetes, measure input and output tokens for every request. Do not wait until the first incident to add token metrics. Track the final assembled prompt, not just the user message. The model does not care what the user typed. It cares what your application sent.&lt;/p&gt;

&lt;p&gt;Break input tokens down by source. Separate system prompt, chat history, retrieved context, tool results, and user message. Track requested max output tokens separately from actual output tokens. One tells you accepted risk. The other tells you real work.&lt;/p&gt;

&lt;p&gt;Use token buckets in latency dashboards. A p95 latency graph without token buckets mixes small chat requests and huge document requests into one misleading line. Watch p95 and p99 token counts, not just averages. The tail is where LLM serving gets painful.&lt;/p&gt;

&lt;p&gt;Put budgets on RAG retrieval. &lt;code&gt;top_k&lt;/code&gt; is not only a relevance knob. It is a capacity knob. Treat context windows as limits, not targets. Just because a model accepts long context does not mean every request should use it. Set sane output defaults. Long answers should be intentional, not the accidental default for every route.&lt;/p&gt;

&lt;p&gt;Separate workload classes when needed. Short interactive chat, long RAG, report generation, agent workflows, and batch summarization do not have the same shape. Review token growth after product changes. Prompt changes, retrieval changes, memory changes, and tool changes can all affect infrastructure.&lt;/p&gt;

&lt;p&gt;These rules are not about making the system slower or less useful. They are about making the system understandable.&lt;/p&gt;

&lt;p&gt;You cannot operate what you do not measure. And in LLM serving, measuring only requests means you are measuring the envelope while ignoring the work inside it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real unit of scale
&lt;/h2&gt;

&lt;p&gt;The request is still useful at the API boundary. You need it for authentication, rate limits, logs, tracing, errors, and user flows. But it is not the right unit for LLM capacity.&lt;/p&gt;

&lt;p&gt;It cannot tell you how much prompt the model processed, how long the model generated, how much KV cache was needed, or whether the workload was short chat, long-context RAG, report generation, or an agent loop.&lt;/p&gt;

&lt;p&gt;Tokens get you closer to the truth. Input tokens explain much of the work before the first response appears. Output tokens explain how long the model keeps generating. Token distributions explain why averages lie. Token sources explain which product behavior changed the workload. Token-aware metrics explain why your Kubernetes deployment looks healthy while users still feel latency.&lt;/p&gt;

&lt;p&gt;Part 1 was about letting go of the normal web app scaling model. Part 2 is about replacing one of its most misleading assumptions.&lt;/p&gt;

&lt;p&gt;For LLMs on Kubernetes, you are not really scaling requests.&lt;/p&gt;

&lt;p&gt;You are scaling token work across expensive, memory-constrained, latency-sensitive GPU systems.&lt;/p&gt;

&lt;p&gt;Once you see that, the rest of the platform starts to make more sense.&lt;/p&gt;

&lt;h2&gt;
  
  
  Continue the series
&lt;/h2&gt;

&lt;p&gt;I am writing this as a practical series on hosting large LLMs on Kubernetes, from GPU nodes and model servers to autoscaling, latency, cost, and production architecture. If you want the next part, subscribe to the newsletter.&lt;/p&gt;

&lt;p&gt;I am also preparing a free &lt;strong&gt;LLM Serving on Kubernetes Production Readiness Checklist&lt;/strong&gt; with the metrics, dashboard questions, and architecture review points platform teams should ask before putting an LLM workload in production. Subscribe and I will share it when it is ready.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>llm</category>
      <category>devops</category>
      <category>mlops</category>
    </item>
    <item>
      <title>Everything You Know About Scaling Web Apps Breaks When You Serve an LLM</title>
      <dc:creator>Pawan Kumar</dc:creator>
      <pubDate>Thu, 14 May 2026 03:33:30 +0000</pubDate>
      <link>https://dev.to/the-persistent-engineer/everything-you-know-about-scaling-web-apps-breaks-when-you-serve-an-llm-2141</link>
      <guid>https://dev.to/the-persistent-engineer/everything-you-know-about-scaling-web-apps-breaks-when-you-serve-an-llm-2141</guid>
      <description>&lt;p&gt;Most platform engineers already know how to scale a web app. Put it in a container. Deploy it on Kubernetes. Add CPU and memory requests. Put a Service or Ingress in front. Configure HPA. Watch p95 latency, error rate, CPU, memory, and request throughput. Add replicas when traffic goes up. This is Part 1 of a practical series on hosting large LLMs on Kubernetes.&lt;/p&gt;

&lt;p&gt;That playbook works for a lot of services. Then you try to serve a large language model, and suddenly the old model starts cracking. A request is no longer just a request. Memory does not just mean RAM. Latency is not one number. Scaling a pod does not mean capacity appears instantly. One "replica" may need one GPU, eight GPUs, or several machines working together.&lt;/p&gt;

&lt;p&gt;And the bottleneck may not be CPU at all. The first mental shift is simple:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;LLM serving is not normal web serving.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The real unit of work is the token.&lt;/p&gt;

&lt;h2&gt;
  
  
  A request is no longer a request
&lt;/h2&gt;

&lt;p&gt;In a normal web app, request count is often a useful planning signal. Not perfect, obviously. Some endpoints are heavier than others. Some queries are ugly. Some users manage to find the one path that melts the database. But request count still tells you something.&lt;/p&gt;

&lt;p&gt;With LLMs, it can lie to your face.&lt;/p&gt;

&lt;p&gt;One user asks:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Summarize this sentence.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Another user asks:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Analyze this 80-page contract, compare it with these policy documents, extract the risks, and generate a detailed memo.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Both are one request. They are not the same workload.&lt;/p&gt;

&lt;p&gt;The second request may contain thousands of input tokens. It may generate thousands of output tokens. It may sit on GPU memory for longer. It may increase queueing delay for everyone behind it. It may consume far more KV cache. It may make your latency charts look haunted.&lt;/p&gt;

&lt;p&gt;So if you only measure requests per second, you are almost blind. For LLMs, you need to care about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;input tokens&lt;/li&gt;
&lt;li&gt;output tokens&lt;/li&gt;
&lt;li&gt;tokens per second&lt;/li&gt;
&lt;li&gt;time to first token&lt;/li&gt;
&lt;li&gt;time per output token&lt;/li&gt;
&lt;li&gt;queue depth&lt;/li&gt;
&lt;li&gt;batch size&lt;/li&gt;
&lt;li&gt;GPU memory&lt;/li&gt;
&lt;li&gt;KV cache usage&lt;/li&gt;
&lt;li&gt;model loading time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is a very different world from normal HTTP throughput.&lt;/p&gt;

&lt;h2&gt;
  
  
  LLM inference has two phases, and they behave differently
&lt;/h2&gt;

&lt;p&gt;When a user sends a prompt to an LLM, the model does not handle the request as one uniform block of work. At a high level, inference has two phases:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;prefill&lt;/li&gt;
&lt;li&gt;decode&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Prefill is where the model processes the input prompt. If the prompt is long, prefill gets expensive. This is where the model reads the context and builds the internal state needed to start generating. Decode is where the model generates output tokens one at a time. This is the part users see when text starts streaming on the screen.&lt;/p&gt;

&lt;p&gt;These phases stress the system differently. Prefill is more compute heavy. Decode is often more memory bandwidth heavy. Prefill depends heavily on input length. Decode depends heavily on output length. Both affect latency, throughput, cost, and capacity.&lt;/p&gt;

&lt;p&gt;This distinction does not usually matter when you are scaling a normal API. You do not think of a checkout endpoint as having two GPU phases with different scheduling behavior. With LLMs, you have to.&lt;/p&gt;

&lt;p&gt;If you ignore prefill and decode, you will struggle to explain why first token latency is slow, why long prompts hurt so much, or why the GPU looks busy but users still complain.&lt;/p&gt;

&lt;h2&gt;
  
  
  Latency is not one number anymore
&lt;/h2&gt;

&lt;p&gt;For web services, we usually talk about latency as one number:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;p50 latency&lt;/li&gt;
&lt;li&gt;p95 latency&lt;/li&gt;
&lt;li&gt;p99 latency&lt;/li&gt;
&lt;li&gt;request duration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For LLMs, that is not enough. Two latency numbers matter a lot.&lt;/p&gt;

&lt;h3&gt;
  
  
  Time to first token
&lt;/h3&gt;

&lt;p&gt;Time to first token, or TTFT, is how long the user waits before the model starts responding. This controls the feeling of responsiveness.&lt;/p&gt;

&lt;p&gt;If nothing appears for five seconds, the product feels slow. It does not matter that the final answer is useful. The user has already started wondering if the system is stuck.&lt;/p&gt;

&lt;p&gt;TTFT is affected by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;queueing delay&lt;/li&gt;
&lt;li&gt;prompt length&lt;/li&gt;
&lt;li&gt;prefill time&lt;/li&gt;
&lt;li&gt;model routing&lt;/li&gt;
&lt;li&gt;batch scheduling&lt;/li&gt;
&lt;li&gt;GPU availability&lt;/li&gt;
&lt;li&gt;cold starts&lt;/li&gt;
&lt;li&gt;cache behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Users feel TTFT sharply because silence feels broken.&lt;/p&gt;

&lt;h3&gt;
  
  
  Time per output token
&lt;/h3&gt;

&lt;p&gt;Time per output token, or TPOT, measures how fast the model generates each token after generation starts. This controls the streaming experience.&lt;/p&gt;

&lt;p&gt;Good TTFT with bad TPOT feels like the model wakes up quickly and then crawls. Good TPOT makes the answer feel alive, even if the full response takes time.&lt;/p&gt;

&lt;p&gt;TPOT is affected by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;decode efficiency&lt;/li&gt;
&lt;li&gt;GPU memory bandwidth&lt;/li&gt;
&lt;li&gt;batch size&lt;/li&gt;
&lt;li&gt;KV cache pressure&lt;/li&gt;
&lt;li&gt;model size&lt;/li&gt;
&lt;li&gt;quantization&lt;/li&gt;
&lt;li&gt;serving engine&lt;/li&gt;
&lt;li&gt;hardware type&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Normal web systems rarely force you to separate "time until the response starts" from "speed at which the rest of the response streams." LLM serving does.&lt;/p&gt;

&lt;h2&gt;
  
  
  Memory means GPU memory now
&lt;/h2&gt;

&lt;p&gt;In a web app, memory usually means heap, runtime overhead, in-process cache, or connection pools. In LLM serving, memory often means GPU memory. And GPU memory is painful because it is limited, expensive, and easy to waste.&lt;/p&gt;

&lt;p&gt;You need GPU memory for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;model weights&lt;/li&gt;
&lt;li&gt;KV cache&lt;/li&gt;
&lt;li&gt;runtime buffers&lt;/li&gt;
&lt;li&gt;activations&lt;/li&gt;
&lt;li&gt;batching overhead&lt;/li&gt;
&lt;li&gt;framework overhead&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Model weights are the obvious part. A 7 billion parameter model in FP16 or BF16 needs roughly 14 GB just for weights. A 70 billion parameter model needs roughly 140 GB just for weights at that precision. That already means one GPU may not be enough.&lt;/p&gt;

&lt;p&gt;But weights are only the obvious cost. The hidden cost is KV cache.&lt;/p&gt;

&lt;p&gt;KV cache stores the key and value tensors from previous tokens so the model does not recompute everything from scratch during generation. The longer the context and the more concurrent users you serve, the more KV cache you need.&lt;/p&gt;

&lt;p&gt;This is why long context is not just a product feature. It is an infra bill. Every extra token you allow into the context window can come back as GPU memory pressure. Maximum context length is not only a model capability. It is a capacity planning decision.&lt;/p&gt;

&lt;h2&gt;
  
  
  Replicas are not always replicas
&lt;/h2&gt;

&lt;p&gt;In a normal web app, one replica usually means one pod running one copy of the application. Traffic goes up, add pods. With LLMs, the word "replica" can hide a lot.&lt;/p&gt;

&lt;p&gt;A small model may run inside one pod on one GPU.&lt;/p&gt;

&lt;p&gt;A larger model may need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;multiple GPUs in one node&lt;/li&gt;
&lt;li&gt;multiple pods on one node&lt;/li&gt;
&lt;li&gt;multiple nodes&lt;/li&gt;
&lt;li&gt;tensor parallelism&lt;/li&gt;
&lt;li&gt;pipeline parallelism&lt;/li&gt;
&lt;li&gt;a Ray cluster&lt;/li&gt;
&lt;li&gt;a leader-worker setup&lt;/li&gt;
&lt;li&gt;a group of pods that must start together&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So when someone says, "scale the model to 10 replicas," the first question should be: what is one replica?&lt;/p&gt;

&lt;p&gt;Is it one pod? One GPU? One tensor parallel group? One multi-node deployment? One endpoint backed by several workers? One prefill group plus one decode group?&lt;/p&gt;

&lt;p&gt;This is where Kubernetes abstractions get interesting. A Deployment works nicely for simple stateless services. Serious LLM serving may need Ray, KServe, LeaderWorkerSet, Kueue, Volcano, or custom orchestration.&lt;/p&gt;

&lt;p&gt;The model may not fit into the old "one pod equals one replica" picture.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scaling a pod does not mean capacity is ready
&lt;/h2&gt;

&lt;p&gt;In a normal web app, a new pod can become useful quickly. The image is pulled. The process starts. Readiness passes. Traffic flows. For LLMs, a new pod may sit there for a while before it can handle real traffic.&lt;/p&gt;

&lt;p&gt;It may need to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;pull a large container image&lt;/li&gt;
&lt;li&gt;download model weights&lt;/li&gt;
&lt;li&gt;load hundreds of GBs from object storage or disk&lt;/li&gt;
&lt;li&gt;initialize CUDA&lt;/li&gt;
&lt;li&gt;allocate GPU memory&lt;/li&gt;
&lt;li&gt;build or load optimized engines&lt;/li&gt;
&lt;li&gt;warm up the model&lt;/li&gt;
&lt;li&gt;join a distributed serving group&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This can take minutes. Sometimes longer. So autoscaling is not just about deciding when to add replicas. It is about adding capacity early enough that it is ready before users feel the pain.&lt;/p&gt;

&lt;p&gt;That is much harder than scaling a normal web app. This is why LLM platforms often use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;minimum warm replicas&lt;/li&gt;
&lt;li&gt;preloaded models&lt;/li&gt;
&lt;li&gt;local NVMe model cache&lt;/li&gt;
&lt;li&gt;warm pools&lt;/li&gt;
&lt;li&gt;separate GPU node pools&lt;/li&gt;
&lt;li&gt;predictive scaling&lt;/li&gt;
&lt;li&gt;queue based scaling&lt;/li&gt;
&lt;li&gt;scheduled capacity for known peaks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Scale to zero sounds great until the first user waits for a giant model to load.&lt;/p&gt;

&lt;h2&gt;
  
  
  CPU autoscaling becomes a weak signal
&lt;/h2&gt;

&lt;p&gt;CPU utilization is a decent signal for many Kubernetes workloads. Not perfect. But decent. For LLM serving, CPU can be almost irrelevant.&lt;/p&gt;

&lt;p&gt;The expensive work happens on GPUs. More specifically, the bottleneck may be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPU memory&lt;/li&gt;
&lt;li&gt;GPU memory bandwidth&lt;/li&gt;
&lt;li&gt;KV cache capacity&lt;/li&gt;
&lt;li&gt;decode throughput&lt;/li&gt;
&lt;li&gt;queue depth&lt;/li&gt;
&lt;li&gt;batch saturation&lt;/li&gt;
&lt;li&gt;inter-GPU communication&lt;/li&gt;
&lt;li&gt;model server scheduling&lt;/li&gt;
&lt;li&gt;request length distribution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A model server can have low CPU usage and still be overloaded. It can have high GPU utilization and still deliver terrible latency. It can have enough compute but not enough KV cache capacity. It can be stuck serving long prompts while short prompts wait behind them.&lt;/p&gt;

&lt;p&gt;So if you autoscale only on CPU, the platform may make the wrong decision at the worst possible time.&lt;/p&gt;

&lt;p&gt;Better signals include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;queue depth&lt;/li&gt;
&lt;li&gt;waiting requests&lt;/li&gt;
&lt;li&gt;ongoing requests per replica&lt;/li&gt;
&lt;li&gt;batch size&lt;/li&gt;
&lt;li&gt;TTFT&lt;/li&gt;
&lt;li&gt;TPOT&lt;/li&gt;
&lt;li&gt;tokens per second&lt;/li&gt;
&lt;li&gt;KV cache usage&lt;/li&gt;
&lt;li&gt;GPU memory pressure&lt;/li&gt;
&lt;li&gt;SLO burn rate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;GPU utilization still matters. It just cannot be the only signal. LLM autoscaling has to understand the workload.&lt;/p&gt;

&lt;h2&gt;
  
  
  Round robin load balancing gets weird
&lt;/h2&gt;

&lt;p&gt;For a normal web app, round robin load balancing is often fine. Request 1 goes to pod A. Request 2 goes to pod B. Request 3 goes to pod C. For LLMs, this can be wasteful.&lt;/p&gt;

&lt;p&gt;A short prompt and a long prompt have completely different costs. A request with a cached prefix may be cheaper if it lands on the right worker. A long generation may occupy capacity much longer than the load balancer expects. One tenant may need lower latency than another. One model may need different hardware from another.&lt;/p&gt;

&lt;p&gt;Naive load balancing can create strange failures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;one worker gets long prompts and slows down&lt;/li&gt;
&lt;li&gt;another worker stays underused&lt;/li&gt;
&lt;li&gt;KV cache locality is lost&lt;/li&gt;
&lt;li&gt;prefix caching becomes less useful&lt;/li&gt;
&lt;li&gt;tail latency gets worse&lt;/li&gt;
&lt;li&gt;GPU utilization looks fine while users are unhappy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;LLM serving needs smarter routing.&lt;/p&gt;

&lt;p&gt;Good routing may consider:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;model name&lt;/li&gt;
&lt;li&gt;prompt length&lt;/li&gt;
&lt;li&gt;estimated output length&lt;/li&gt;
&lt;li&gt;tenant priority&lt;/li&gt;
&lt;li&gt;cache locality&lt;/li&gt;
&lt;li&gt;GPU availability&lt;/li&gt;
&lt;li&gt;queue depth&lt;/li&gt;
&lt;li&gt;hardware type&lt;/li&gt;
&lt;li&gt;region&lt;/li&gt;
&lt;li&gt;latency SLO&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is why inference gateways, model-aware routing, and cache-aware scheduling matter.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost changes shape
&lt;/h2&gt;

&lt;p&gt;In a web app, cost usually grows with pods, CPU, memory, database load, and network traffic. In LLM serving, cost is shaped by GPU usage, and GPUs are expensive enough that small inefficiencies matter.&lt;/p&gt;

&lt;p&gt;You can burn money through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;idle GPUs&lt;/li&gt;
&lt;li&gt;poor batching&lt;/li&gt;
&lt;li&gt;overprovisioned replicas&lt;/li&gt;
&lt;li&gt;long context windows&lt;/li&gt;
&lt;li&gt;bad routing&lt;/li&gt;
&lt;li&gt;large models for simple tasks&lt;/li&gt;
&lt;li&gt;no quantization&lt;/li&gt;
&lt;li&gt;slow cold starts&lt;/li&gt;
&lt;li&gt;inefficient KV cache usage&lt;/li&gt;
&lt;li&gt;serving every request with the same model&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The cost unit also changes. Instead of only thinking about cost per request, you start thinking about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cost per input token&lt;/li&gt;
&lt;li&gt;cost per output token&lt;/li&gt;
&lt;li&gt;cost per million tokens&lt;/li&gt;
&lt;li&gt;cost per model&lt;/li&gt;
&lt;li&gt;cost per tenant&lt;/li&gt;
&lt;li&gt;cost per GPU hour&lt;/li&gt;
&lt;li&gt;cost per region&lt;/li&gt;
&lt;li&gt;cost per latency tier&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the cloud bill version of the first mental shift: a request is not a request. A token is the real unit of work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Kubernetes is still useful. It is just not enough by itself
&lt;/h2&gt;

&lt;p&gt;None of this means Kubernetes is the wrong platform for LLM serving. Kubernetes still gives you a lot:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;scheduling&lt;/li&gt;
&lt;li&gt;declarative deployment&lt;/li&gt;
&lt;li&gt;resource management&lt;/li&gt;
&lt;li&gt;isolation&lt;/li&gt;
&lt;li&gt;service discovery&lt;/li&gt;
&lt;li&gt;rollouts&lt;/li&gt;
&lt;li&gt;observability integrations&lt;/li&gt;
&lt;li&gt;autoscaling primitives&lt;/li&gt;
&lt;li&gt;platform patterns for multiple teams&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is why many serious AI infrastructure platforms still use Kubernetes or something close to it. But Kubernetes does not automatically understand LLMs.&lt;/p&gt;

&lt;p&gt;Out of the box, Kubernetes does not know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what KV cache is&lt;/li&gt;
&lt;li&gt;whether a model is loaded&lt;/li&gt;
&lt;li&gt;whether a GPU group must be scheduled together&lt;/li&gt;
&lt;li&gt;whether pods should land in the same rack&lt;/li&gt;
&lt;li&gt;whether a request has 100 tokens or 100,000 tokens&lt;/li&gt;
&lt;li&gt;whether TTFT is bad&lt;/li&gt;
&lt;li&gt;whether a model server is overloaded despite low CPU&lt;/li&gt;
&lt;li&gt;whether a new replica will take 10 minutes to warm up&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You have to teach the platform these things through metrics, controllers, schedulers, serving frameworks, routing layers, and operational discipline. That is the real work.&lt;/p&gt;

&lt;h2&gt;
  
  
  The old scaling model breaks
&lt;/h2&gt;

&lt;p&gt;The old web scaling model looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Traffic increases
        ↓
CPU increases
        ↓
HPA adds pods
        ↓
Load balancer spreads requests
        ↓
Latency improves
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That still works for many stateless services. LLM serving looks more like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Traffic increases
        ↓
Input and output token mix changes
        ↓
Queue depth grows
        ↓
KV cache pressure increases
        ↓
Batching behavior changes
        ↓
TTFT and TPOT drift
        ↓
GPU memory or decode throughput becomes the bottleneck
        ↓
Autoscaler needs model-aware metrics
        ↓
New capacity may take minutes to warm up
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you bring only the old playbook, you will scale the wrong thing, at the wrong time, using the wrong signal.&lt;/p&gt;

&lt;h2&gt;
  
  
  The new mental model
&lt;/h2&gt;

&lt;p&gt;To serve LLMs well, you need a different model in your head:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A request is not the unit. A token is.&lt;/li&gt;
&lt;li&gt;Memory is not just RAM. GPU memory and KV cache matter more.&lt;/li&gt;
&lt;li&gt;Latency is not one number. TTFT and TPOT matter separately.&lt;/li&gt;
&lt;li&gt;A replica may be a distributed group, not a single pod.&lt;/li&gt;
&lt;li&gt;Scaling is not instant because model loading is slow.&lt;/li&gt;
&lt;li&gt;CPU is not a reliable autoscaling signal by itself.&lt;/li&gt;
&lt;li&gt;Load balancing must understand request cost and cache locality.&lt;/li&gt;
&lt;li&gt;Long context is an infrastructure cost decision.&lt;/li&gt;
&lt;li&gt;Cost optimization starts with keeping expensive GPUs useful.&lt;/li&gt;
&lt;li&gt;Kubernetes is the foundation, but LLM-aware systems must be built on top.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once this clicks, the rest of LLM infrastructure becomes easier to reason about. You can understand why vLLM became popular. Why PagedAttention matters. Why KV cache dominates serving design. Why quantization is a capacity strategy. Why topology-aware scheduling matters. Why teams split prefill and decode. Why GPU cost optimization is its own discipline. Why normal autoscaling is not enough.&lt;/p&gt;

&lt;p&gt;LLM serving is not "deploy a model behind an API." It is a new platform engineering problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing thought
&lt;/h2&gt;

&lt;p&gt;For years, platform teams became very good at scaling stateless web services. We learned containers, Kubernetes, service meshes, autoscaling, observability, progressive delivery, and cloud cost optimization. That knowledge still matters, but LLM serving changes the shape of the problem.&lt;/p&gt;

&lt;p&gt;The bottlenecks move. The metrics change. The cost model changes. The scheduler matters more. The load balancer needs to get smarter. The GPU becomes the scarce resource. The token becomes the unit of work.&lt;/p&gt;

&lt;p&gt;So if you are trying to serve LLMs on Kubernetes, the first step is not installing a Helm chart. The first step is replacing the old mental model.&lt;/p&gt;

&lt;p&gt;Because everything you know about scaling web apps starts to break the moment you serve an LLM.&lt;/p&gt;




&lt;h2&gt;
  
  
  Continue the series
&lt;/h2&gt;

&lt;p&gt;I am writing this as a practical series on hosting large LLMs on Kubernetes, from GPU nodes and model servers to autoscaling, latency, cost, and production architecture. If you want the next part, subscribe to the newsletter.&lt;/p&gt;

&lt;p&gt;I am also preparing a free &lt;strong&gt;LLM Serving on Kubernetes Production Readiness Checklist&lt;/strong&gt; with the questions platform teams should ask before putting an LLM workload in production. Subscribe and I will share it when it is ready.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>ai</category>
      <category>devops</category>
      <category>mlops</category>
    </item>
    <item>
      <title>I Don't Want AI to Replace DevOps. I Want It to Read the Docs I'm Too Tired to Read</title>
      <dc:creator>Pawan Kumar</dc:creator>
      <pubDate>Thu, 07 May 2026 06:32:51 +0000</pubDate>
      <link>https://dev.to/the-persistent-engineer/i-dont-want-ai-to-replace-devops-i-want-it-to-read-the-docs-im-too-tired-to-read-1j2d</link>
      <guid>https://dev.to/the-persistent-engineer/i-dont-want-ai-to-replace-devops-i-want-it-to-read-the-docs-im-too-tired-to-read-1j2d</guid>
      <description>&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://dheeth.blog/i-dont-want-ai-to-replace-devops/" rel="noopener noreferrer"&gt;dheeth.blog&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It's 2 AM. The pager went off eleven minutes ago. You're staring at a Kubernetes upgrade advisory that's forty-seven paragraphs long, and somewhere in paragraph thirty-one there's a breaking change about how EKS handles PodIdentity federation with IAM roles. You know it's in there. You read it three months ago. But right now your brain is running on caffeine and cortisol, and the words are blurring into each other.&lt;/p&gt;

&lt;p&gt;You could run the upgrade now and hope for the best. Or you could spend forty minutes re-reading the entire changelog, the Terraform provider notes, the Helm chart migration guide, and three different Slack threads from the last time someone did this.&lt;/p&gt;

&lt;p&gt;This is the part of DevOps nobody puts in conference talks. Not the elegant GitOps pipelines or the slick dashboards. The part where you're exhausted and you still have to make a decision that affects production, and the information you need is spread across nine browser tabs, a Confluence page from 2023, and a runbook that was last updated when your cluster was on 1.24.&lt;/p&gt;

&lt;p&gt;This is where I want AI to help. Not by taking over. Not by running &lt;code&gt;kubectl apply&lt;/code&gt; on my behalf while I sleep. By reading the damn docs for me.&lt;/p&gt;

&lt;h2&gt;
  
  
  The kind of tired that matters
&lt;/h2&gt;

&lt;p&gt;The Google SRE Workbook has a word for what happens when engineers spend too much time on repetitive operational work: toil. They define it as "the repetitive, predictable, constant stream of tasks related to maintaining a service." Rollouts, upgrades, alert triage, manual repairs, ticket-driven provisioning. Google puts a hard cap on it: no more than 50% of an SRE's time should go to operational work.&lt;/p&gt;

&lt;p&gt;The reasoning isn't just about efficiency. The workbook makes a point that has always stuck with me: time spent on toil is time not spent where human judgment, creativity, and design thinking matter.&lt;/p&gt;

&lt;p&gt;Here's what I think the SRE Workbook doesn't fully capture, at least not in those exact words. There's a specific kind of toil that doesn't look like toil. It doesn't involve clicking buttons or running the same script for the hundredth time. It's cognitive. It's the mental cost of assembling context from scattered sources before you can make a decision.&lt;/p&gt;

&lt;p&gt;Reading a Kubernetes release notes page that's 3,000 words long to find the one deprecation that affects your cluster. Comparing two versions of a Helm &lt;code&gt;values.yaml&lt;/code&gt; to understand what changed between chart versions 4.2.1 and 5.0.0. Skimming a Terraform provider changelog to see if the &lt;code&gt;aws_eks_cluster&lt;/code&gt; resource changed its default behavior. Correlating an incident timeline from last Thursday with the deployment that happened two hours before the spike in 5xx errors.&lt;/p&gt;

&lt;p&gt;This work isn't glamorous. It doesn't produce artifacts. Nobody thanks you for spending an hour reading release notes. But if you skip it, you miss the breaking change that takes down a service at 3 AM on a Sunday.&lt;/p&gt;

&lt;p&gt;Sometimes the most exhausting part of an incident is not fixing the issue. It is building enough context to feel safe fixing it.&lt;/p&gt;

&lt;p&gt;I think of this as cognitive toil, and AI is unusually well suited to help with it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I don't want an AI agent with production access
&lt;/h2&gt;

&lt;p&gt;Before I talk about what I do want, let me be clear about what I don't.&lt;/p&gt;

&lt;p&gt;I don't want an AI agent that has &lt;code&gt;kubectl apply&lt;/code&gt; access by default. I don't want one that can merge PRs, push to main, modify IAM policies, or restart services without a human in the loop. I've seen enough production incidents caused by humans who were tired, rushed, or copy-pasting from the wrong terminal. Giving that same power to something that hallucinates API flags and invents Kubernetes resources that don't exist is not progress. It's a new category of incident.&lt;/p&gt;

&lt;p&gt;In application code, an AI mistake might fail a test. In DevOps, an AI mistake might page five teams, drain the wrong node, rotate the wrong secret, or turn a small incident into a very educational afternoon.&lt;/p&gt;

&lt;p&gt;The Stack Overflow 2025 Developer Survey backs this up. 76% of developers don't plan to use AI for deployment or monitoring tasks. Not because they're luddites. Because they know what's at stake. More developers actively distrust AI accuracy (46%) than trust it (33%). Only 3% highly trust it. That is the part that makes people nervous: AI can sound confident even when the answer still needs careful verification.&lt;/p&gt;

&lt;p&gt;In DevOps, "almost right" isn't a minor inconvenience. An "almost right" IAM policy is a security incident. An "almost right" Kubernetes manifest is a workload that runs fine until it doesn't, and then you're debugging at 2 AM wondering why the liveness probe path changed. An "almost right" Terraform plan is a production resource that gets destroyed and recreated instead of updated in place.&lt;/p&gt;

&lt;p&gt;The problem is not that AI is useless. The problem is that AI is useful enough to make dangerous workflows look reasonable. In DevOps, the gap between "sounds correct" and "safe to execute" is where incidents live.&lt;/p&gt;

&lt;p&gt;The hard part of DevOps is rarely knowing the command. &lt;code&gt;kubectl apply -f manifest.yaml&lt;/code&gt; isn't the hard part. The hard part is knowing whether that command is safe in this environment, with this version of Kubernetes, with these admission controllers, with this cluster autoscaler configuration, right after that EKS add-on got updated. That requires context, judgment, and accountability. AI is genuinely useful for the first two, but it can't own the third. Not yet. Maybe not ever.&lt;/p&gt;

&lt;p&gt;Most production work is not blocked because nobody knows how to type &lt;code&gt;kubectl&lt;/code&gt;. It is blocked because nobody is completely sure what is safe to do next.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I actually want AI to do
&lt;/h2&gt;

&lt;p&gt;I want AI to be the colleague who actually reads the release notes before standup. The one who highlights the three things that matter out of a forty-seven-paragraph changelog. The one who can look at a Terraform plan diff and tell you, in plain language, what's about to change and what might break.&lt;/p&gt;

&lt;p&gt;Concretely, here's what that looks like.&lt;/p&gt;

&lt;p&gt;When I'm going from Kubernetes 1.29 to 1.30, I want something that tells me what got deprecated, what changed in API versions, and what I need to act on before upgrading. Skip the boilerplate about "improved performance." Focus on the removals and behavioral changes.&lt;/p&gt;

&lt;p&gt;Before I update the VPC CNI add-on, I want to know if this version is compatible with my current Kubernetes version, my node group AMI, and the Calico network policy version we're running. That compatibility matrix is spread across three AWS docs pages and it changes every quarter.&lt;/p&gt;

&lt;p&gt;When the AWS Terraform provider goes from 5.x to 6.x, I don't want to read the entire migration guide. I want to know which resources I'm actually using that changed behavior. Focus on my code, not the universe of possibilities.&lt;/p&gt;

&lt;p&gt;When I'm upgrading a Helm chart from 4.x to 5.x, show me what changed in the default values: which new keys were introduced, which old keys were removed, which ones changed their default behavior. Better yet, cross-reference my current &lt;code&gt;values.yaml&lt;/code&gt; and tell me which of my overrides are now invalid.&lt;/p&gt;

&lt;p&gt;If I inherit a cluster with 200 custom resources I've never seen before, help me understand what they do without reading CRD documentation for six hours.&lt;/p&gt;

&lt;p&gt;When an incident happens, take the Slack thread, the PagerDuty timeline, and the post-mortem notes, and produce a runbook that the next on-call engineer can actually follow. One that isn't three years stale.&lt;/p&gt;

&lt;p&gt;When the error rate spiked at 14:32 and something was deployed at 14:15, pull the deployment diff, the relevant log lines, and the metrics shift into one view so I can see the connection without switching between four tools.&lt;/p&gt;

&lt;p&gt;When five services are throwing errors and the logs are a wall of stack traces, filter out the noise, group the unique errors, and tell me which one started first. That's the one I care about.&lt;/p&gt;

&lt;p&gt;None of these require production access. None require the AI to execute anything. They require it to read, understand, summarize, compare, and present information so I can decide faster.&lt;/p&gt;

&lt;p&gt;The best DevOps AI will not feel magical. It will feel like a senior engineer left clean notes before going on vacation.&lt;/p&gt;

&lt;h2&gt;
  
  
  The data says this approach works
&lt;/h2&gt;

&lt;p&gt;GitHub's study on Copilot found something interesting beyond speed. 87% of developers said AI helped them preserve mental effort during repetitive tasks. 73% said it helped them stay in flow. 60-75% said it helped them focus on more satisfying work. One senior engineer put it simply: with AI, they had to think less about the boring stuff, and when they had to think, it was the fun stuff.&lt;/p&gt;

&lt;p&gt;The DORA research on generative AI adds an important nuance. Developers who use AI extensively report higher job satisfaction, more time in flow state, and less burnout. But there's a catch: AI adoption didn't reduce time spent on toilsome, repetitive tasks. It sped up the valuable work developers already enjoyed, but didn't crack the code on automating drudgery. DORA also found that a 25% increase in AI adoption was associated with a decrease in delivery stability, because AI lets teams generate more code and more changes faster than their review and testing processes can handle.&lt;/p&gt;

&lt;p&gt;Read that last sentence again. AI doesn't hurt stability because it writes bad code. It hurts stability because it lets teams produce more work than their feedback loops can safely absorb.&lt;/p&gt;

&lt;p&gt;This is exactly why the read-summarize-suggest model is the right one for DevOps. It gives engineers better context without adding unreviewed changes to the pipeline. It accelerates understanding without bypassing approval. It reduces the time between "I need to figure this out" and "I understand enough to decide" without collapsing the distance between "I decided" and "it's done."&lt;/p&gt;

&lt;h2&gt;
  
  
  A boundary that matters
&lt;/h2&gt;

&lt;p&gt;I'm not anti-agent. I think autonomous AI agents will eventually have a role in infrastructure operations. But the keyword is eventually, and the prerequisite is trust, and trust is earned slowly and lost quickly.&lt;/p&gt;

&lt;p&gt;Stack Overflow also shows developers are much more cautious with high-responsibility work. Most respondents do not plan to use AI for deployment or monitoring. These are not people who hate AI. These are people who know where the blast radius lives.&lt;/p&gt;

&lt;p&gt;The DORA report reinforces this: trust directly drives AI productivity. Developers who trust AI accept more suggestions, submit more changes, and spend less time searching for information. But DORA also found that 39% of developers still trust AI outputs "a little" or "not at all."&lt;/p&gt;

&lt;p&gt;In DevOps, trust isn't about vibes. It's about being right when being wrong has consequences. An AI that summarizes a changelog and misses a breaking change is annoying but survivable. An AI that applies a change based on that incomplete summary is a production incident.&lt;/p&gt;

&lt;p&gt;The line I draw is simple. AI should read, summarize, compare, draft, and suggest. Humans should approve, execute, and own.&lt;/p&gt;

&lt;p&gt;Let AI read. Let AI summarize. Let AI compare. Let AI draft. Let AI suggest.&lt;/p&gt;

&lt;p&gt;But make humans approve, execute, and own.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fatigue I want to replace
&lt;/h2&gt;

&lt;p&gt;DevOps has a burnout problem. This isn't news. The on-call rotations, the incident pressure, the constant context switching between ten different tools and three different cloud providers and a pile of documentation that's always slightly out of date.&lt;/p&gt;

&lt;p&gt;The fatigue is real. It accumulates. It's not the dramatic kind where someone screams and quits. It's the quiet kind where you stop reading the full changelog because you've read forty of them and nothing ever breaks, until one Tuesday it does. Where you stop updating the runbook because nobody reads it anyway, including you. Where you start copy-pasting Terraform modules from the last project because you don't have the energy to check if the AWS provider changed the defaults again.&lt;/p&gt;

&lt;p&gt;AI can't fix organizational dysfunction. It can't fix understaffed on-call rotations or unreasonable SLAs. But it can reduce the cognitive tax of the work that sits between "I got paged" and "I understand what is happening." It can give you back the thirty minutes you'd have spent re-reading docs you already read once. It can catch the breaking change you'd have missed at 2 AM.&lt;/p&gt;

&lt;p&gt;I don't want AI to replace DevOps engineers. I want it to replace the exhaustion that makes us worse at the job we're good at. I want it to be the thing that reads the docs so I can focus on deciding what to do with what they say. I want it to handle the reading so I can handle the thinking.&lt;/p&gt;

&lt;p&gt;That's not a smaller vision. It's a more honest one.&lt;/p&gt;




&lt;p&gt;References:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Google SRE Workbook, "Eliminating Toil": &lt;a href="https://sre.google/workbook/eliminating-toil/" rel="noopener noreferrer"&gt;sre.google/workbook/eliminating-toil/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;DORA, "Impact of Generative AI in Software Development": &lt;a href="https://dora.dev/ai/gen-ai-report/" rel="noopener noreferrer"&gt;dora.dev/ai/gen-ai-report/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Stack Overflow Developer Survey 2025, AI section: &lt;a href="https://survey.stackoverflow.co/2025/ai/" rel="noopener noreferrer"&gt;survey.stackoverflow.co/2025/ai/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;GitHub Research, "Quantifying GitHub Copilot's Impact on Developer Productivity and Happiness": &lt;a href="https://github.blog/news-insights/research/research-quantifying-github-copilots-impact-on-developer-productivity-and-happiness/" rel="noopener noreferrer"&gt;github.blog/news-insights/research/research-quantifying-github-copilots-impact-on-developer-productivity-and-happiness/&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>devops</category>
      <category>ai</category>
      <category>sre</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>DevOps to Platform Engineer: The Career Shift Nobody Explains Properly</title>
      <dc:creator>Pawan Kumar</dc:creator>
      <pubDate>Thu, 30 Apr 2026 20:15:33 +0000</pubDate>
      <link>https://dev.to/the-persistent-engineer/devops-to-platform-engineer-the-career-shift-nobody-explains-properly-48f2</link>
      <guid>https://dev.to/the-persistent-engineer/devops-to-platform-engineer-the-career-shift-nobody-explains-properly-48f2</guid>
      <description>&lt;p&gt;If you've been in DevOps long enough, you've probably seen the job postings by now. "Platform Engineer." "Internal Developer Platform." "Platform-as-a-Product." The titles are everywhere. Gartner says 80% of large engineering organizations will have dedicated platform teams by 2026. That's up from 45% in 2022.&lt;/p&gt;

&lt;p&gt;But nobody really explains what changes. Not the buzzwords. The actual day job. The skills. The salary. The headaches.&lt;/p&gt;

&lt;p&gt;I work as a DevOps Engineer at a company that builds Kubernetes application platforms. So I'm living in the middle of this transition every single day. Let me break down what's actually happening, what it means for your career, and whether you should care.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Actually Happening
&lt;/h2&gt;

&lt;p&gt;Here's the short version: DevOps broke at scale. Not the philosophy. The practice.&lt;/p&gt;

&lt;p&gt;When you have 5 teams and 20 services, DevOps works beautifully. Everyone knows everyone. You can walk over to someone's desk (or Slack them) and figure out why the pipeline broke. The "culture of collaboration" actually functions.&lt;/p&gt;

&lt;p&gt;But at 50 teams? 500 services? Multiple clouds? That informal shared context collapses. Onboarding takes weeks instead of days. Every team builds slightly different CI/CD pipelines. Security reviews become bottlenecks. And you end up with 3 senior engineers who "know how things really work," and they're drowning.&lt;/p&gt;

&lt;p&gt;Platform engineering is the response to that breakdown. Instead of relying on culture and tribal knowledge, you build a product, an Internal Developer Platform (IDP), that encodes best practices into self-service tooling.&lt;/p&gt;

&lt;p&gt;The platform becomes the documentation. The guardrails become the governance. And the paved road becomes the easiest road.&lt;/p&gt;

&lt;h2&gt;
  
  
  DevOps vs Platform Engineering: The Real Differences
&lt;/h2&gt;

&lt;p&gt;Let's skip the marketing fluff. Here's what actually changes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;DevOps&lt;/th&gt;
&lt;th&gt;Platform Engineering&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Who you build for&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Infrastructure, pipelines&lt;/td&gt;
&lt;td&gt;Developers (your users)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;How work comes to you&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Tickets, Slack pings, "can you help me"&lt;/td&gt;
&lt;td&gt;Platform feature requests, adoption metrics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Governance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;PR reviews, approval gates, manual checks&lt;/td&gt;
&lt;td&gt;Embedded into templates and workflows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Success metric&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"Did the deploy work?"&lt;/td&gt;
&lt;td&gt;"Are developers choosing to use the platform?"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scale model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Linear (more teams = more DevOps)&lt;/td&gt;
&lt;td&gt;Leverage (platform scales once, serves all)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Your mindset&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;"Let me fix this for you"&lt;/td&gt;
&lt;td&gt;"Let me build it so you never need to ask"&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That last row is the fundamental shift. DevOps is a service mindset. Platform engineering is a product mindset.&lt;/p&gt;

&lt;h3&gt;
  
  
  A Day in the Life
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;DevOps Engineer's typical day:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Build and maintain CI/CD pipelines (30%)&lt;/li&gt;
&lt;li&gt;Write Terraform, manage infrastructure (25%)&lt;/li&gt;
&lt;li&gt;Set up monitoring and alerting (15%)&lt;/li&gt;
&lt;li&gt;Automate deployment processes (20%)&lt;/li&gt;
&lt;li&gt;Help developers with infrastructure issues (10%)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Platform Engineer's typical day:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Build internal developer tools and abstractions (35%)&lt;/li&gt;
&lt;li&gt;Improve self-service capabilities (25%)&lt;/li&gt;
&lt;li&gt;Maintain the platform infrastructure itself (20%)&lt;/li&gt;
&lt;li&gt;Developer support, education, onboarding (10%)&lt;/li&gt;
&lt;li&gt;Platform documentation (10%)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Notice the shift: you're spending more time building things &lt;em&gt;for developers to use independently&lt;/em&gt; and less time &lt;em&gt;doing things for developers&lt;/em&gt;. It's the difference between being a chef and being someone who designs kitchen layouts.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Salary Question (India-Focused)
&lt;/h2&gt;

&lt;p&gt;Let's talk numbers. I cross-referenced data from AmbitionBox, Glassdoor, Levels.fyi, and real job postings across LinkedIn and Naukri. Here is the realistic range for India in 2026:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Experience&lt;/th&gt;
&lt;th&gt;DevOps&lt;/th&gt;
&lt;th&gt;Platform Engineer&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;3-5 years&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;₹10-28 LPA&lt;/td&gt;
&lt;td&gt;₹20-40 LPA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;6-10 years&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;₹20-45 LPA&lt;/td&gt;
&lt;td&gt;₹35-60 LPA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Lead/Principal&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;₹35-65 LPA&lt;/td&gt;
&lt;td&gt;₹55-90 LPA&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Platform engineering commands a 30-60% premium over generalist DevOps, according to multiple 2026 India salary reports. The premium exists because the talent pool is much smaller, you need DevOps foundations plus product thinking plus software engineering depth.&lt;/p&gt;

&lt;p&gt;Globally, platform engineers in North America average $160,000 USD, compared to DevOps roles that typically plateau around $140K. Not life-changing, but meaningful.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Skills Gap: What You Need to Learn
&lt;/h2&gt;

&lt;p&gt;If you're a DevOps engineer today, you already have most of the foundations. Here's what's missing:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Product Thinking
&lt;/h3&gt;

&lt;p&gt;This is the biggest mindset shift. You're no longer building pipelines, you're building a product with users, feedback loops, and adoption metrics. That means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Understanding developer pain points (user research)&lt;/li&gt;
&lt;li&gt;Prioritizing features based on impact (product management)&lt;/li&gt;
&lt;li&gt;Measuring adoption, not just uptime (analytics)&lt;/li&gt;
&lt;li&gt;Iterating based on feedback (continuous improvement)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The #1 reason platform initiatives fail? Teams build technically excellent platforms that nobody uses. Voluntary adoption is the real metric.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. API Design and Software Engineering
&lt;/h3&gt;

&lt;p&gt;DevOps scripting (Bash, YAML, a bit of Python) doesn't cut it anymore. Platform engineers need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;API design&lt;/strong&gt; - Your platform is consumed through APIs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Go or Rust&lt;/strong&gt; - Most CNCF platform tooling is written in Go&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-tenancy patterns&lt;/strong&gt; - Your platform serves multiple teams with different needs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Software engineering practices&lt;/strong&gt; - Testing, versioning, deprecation strategies
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Example: A Golden Path template for a new microservice&lt;/span&gt;
&lt;span class="c1"&gt;# This is what platform engineers build - opinionated defaults&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;backstage.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Template&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;microservice-template&lt;/span&gt;
  &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Standard Microservice&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Spin up a new Go microservice with CI/CD, monitoring, and security baked in&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;parameters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Service Details&lt;/span&gt;
      &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;name&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;team&lt;/span&gt;
      &lt;span class="na"&gt;properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Service Name&lt;/span&gt;
          &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
        &lt;span class="na"&gt;team&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;title&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Owning Team&lt;/span&gt;
          &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
          &lt;span class="na"&gt;enum&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;payments&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;auth&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;core&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;platform&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;scaffold&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Generate Service&lt;/span&gt;
      &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fetch:template&lt;/span&gt;
      &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./templates/go-microservice&lt;/span&gt;
        &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ parameters.name }}&lt;/span&gt;
          &lt;span class="na"&gt;team&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ parameters.team }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is a simplified Backstage software template - one of the most common patterns in platform engineering. Developers fill in a few fields, and the platform generates a production-ready service with CI/CD, observability, and security pre-configured.&lt;/p&gt;

&lt;p&gt;You can achieve the same with &lt;a href="https://docs.devtron.ai/docs/user-guide/app-management/application-template" rel="noopener noreferrer"&gt;Devtron's Application Templates&lt;/a&gt; - capture CI/CD workflows, build configs, deployment templates, and environment overrides from an existing app, then reuse them to spin up new microservices in minutes instead of hours.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Developer Experience (DevEx)
&lt;/h3&gt;

&lt;p&gt;You need to care about how developers &lt;em&gt;feel&lt;/em&gt; using your platform. This includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Time to first deploy (how fast can a new dev ship?)&lt;/li&gt;
&lt;li&gt;Self-service capabilities (can they do it without filing a ticket?)&lt;/li&gt;
&lt;li&gt;Documentation quality (can they figure it out without asking you?)&lt;/li&gt;
&lt;li&gt;Error messages (are they helpful or cryptic?)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The State of Platform Engineering Report recommends tracking DORA metrics (deployment frequency, lead time, change failure rate, MTTR) alongside SPACE metrics (developer productivity) and time-to-onboarding.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. AI Literacy
&lt;/h3&gt;

&lt;p&gt;This isn't optional anymore. 92% of CIOs plan AI integrations into their platforms. The recommendation is to reserve 20% of your time for AI skill development:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Using AI tools for platform operations (K8sGPT, AI-assisted troubleshooting)&lt;/li&gt;
&lt;li&gt;Building AI-powered capabilities into your platform (intelligent autoscaling, anomaly detection)&lt;/li&gt;
&lt;li&gt;Understanding how AI-generated code flows through your CI/CD&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By 2028, platforms without AI capabilities will be considered outdated.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Actually Make the Transition
&lt;/h2&gt;

&lt;p&gt;Here's a practical roadmap, assuming you have 3+ years of DevOps experience:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 1-2: Build Product Thinking&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Read "Team Topologies" by Matthew Skelton and Manuel Pais&lt;/li&gt;
&lt;li&gt;Start treating your current internal tools as products - add documentation, gather feedback, track usage&lt;/li&gt;
&lt;li&gt;Learn about Backstage (CNCF project, 89% market share for IDPs)&lt;/li&gt;
&lt;li&gt;Explore Devtron - an AI-native Kubernetes management platform to see how real IDPs work in practice&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Month 3-4: Level Up Software Engineering&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pick up Go if you haven't already - most platform tooling is Go-based&lt;/li&gt;
&lt;li&gt;Build a small internal tool with proper API design, tests, and documentation&lt;/li&gt;
&lt;li&gt;Contribute to an open-source platform tool (Backstage, Crossplane, Port)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Month 5-6: Get Hands-On with IDPs&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deploy Backstage locally or in a sandbox cluster&lt;/li&gt;
&lt;li&gt;Build a software template for your team's most common workflow&lt;/li&gt;
&lt;li&gt;Add golden paths for your existing infrastructure patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Ongoing: Develop AI Competency&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Experiment with K8sGPT for cluster troubleshooting&lt;/li&gt;
&lt;li&gt;Explore AI-assisted CI/CD (GitHub Copilot in Actions, AI-powered code review)&lt;/li&gt;
&lt;li&gt;Stay current with AI SRE tools (autonomous incident response is coming fast)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Six Specialized Roles Within Platform Engineering
&lt;/h2&gt;

&lt;p&gt;As the field matures, "platform engineer" is splitting into distinct specializations:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Head of Platform Engineering (HOPE)&lt;/strong&gt; - Strategic direction, cross-functional coordination&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Platform Product Manager (PPM)&lt;/strong&gt; - Bridges technical teams and organizational needs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infrastructure Platform Engineer (IPE)&lt;/strong&gt; - Underlying infra (servers, networks, databases)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DevEx Platform Engineer (DPE)&lt;/strong&gt; - Developer workflows, friction reduction, tool UX&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security Platform Engineer (SPE)&lt;/strong&gt; - Security embedded into pipelines, policy-as-code&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliability Platform Engineer (RPE)&lt;/strong&gt; - Evolution of SRE, monitoring/observability plane&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You don't need to pick one immediately. Most platform engineers touch multiple areas, especially in smaller teams. But knowing these exist helps you see where your career can go.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'm Seeing From the Inside
&lt;/h2&gt;

&lt;p&gt;Working at Devtron, a company that literally builds a Kubernetes application platform, I get a front-row seat to this transition. Here's what I see daily:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Teams that adopted platform thinking&lt;/strong&gt; are shipping faster with fewer incidents. They're not firefighting as much because the platform catches common mistakes before they reach production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Teams that didn't&lt;/strong&gt; are drowning in tickets. Every new microservice means another pipeline to build, another set of alerts to configure, another on-call rotation to manage. It doesn't scale.&lt;/p&gt;

&lt;p&gt;The companies that get this right treat their platform as a product with a dedicated team, clear ownership, and actual user research. The ones that get it wrong rebrand their DevOps team as "Platform Engineering" and change nothing about how they work.&lt;/p&gt;

&lt;p&gt;Don't be the second one.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Honest Take
&lt;/h2&gt;

&lt;p&gt;Platform engineering isn't replacing DevOps. It's DevOps growing up. The philosophy of collaboration, automation, and shared responsibility stays. What changes is the &lt;em&gt;mechanism&lt;/em&gt;, from culture-dependent to platform-dependent.&lt;/p&gt;

&lt;p&gt;Should you make the shift? If you enjoy building tools more than operating infrastructure, if you care about developer experience, and if you want to work on leverage (building something once that serves hundreds of developers) - yes.&lt;/p&gt;

&lt;p&gt;The timing is right. Mid-level engineers with 3-5 years of experience are entering platform roles in growing numbers. You don't need to be a senior architect anymore. The field is democratizing, the salaries are competitive, and the demand is only going up.&lt;/p&gt;

&lt;p&gt;Start by building one thing that removes friction for your team. Treat it like a product. See what happens.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Further Reading:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://teamtopologies.com/" rel="noopener noreferrer"&gt;Team Topologies&lt;/a&gt; - the org design book behind platform thinking&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://backstage.io/" rel="noopener noreferrer"&gt;Backstage.io&lt;/a&gt; - CNCF project for building developer portals&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://devtron.ai/" rel="noopener noreferrer"&gt;Devtron&lt;/a&gt; - AI-Native Kubernetes Management Platform&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://platformengineering.org/" rel="noopener noreferrer"&gt;Platform Engineering community&lt;/a&gt; - reports, articles, and the annual State of Platform Engineering survey&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dora.dev/" rel="noopener noreferrer"&gt;DORA metrics&lt;/a&gt; - the standard for measuring software delivery performance&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>devops</category>
      <category>platformengineering</category>
      <category>career</category>
    </item>
  </channel>
</rss>
