Pawan Kumar

Posted on May 21 • Originally published at dheeth.blog

The Request Is the Wrong Unit of Scale for LLMs on Kubernetes

#kubernetes #llm #devops #mlops

Series links

Part 1: Everything You Know About Scaling Web Apps Breaks When You Serve an LLM

Your dashboard says traffic is flat. Requests per second barely moved. CPU looks fine. Memory looks normal. The HPA is calm. Then latency starts drifting. Time to first token gets worse. GPU memory pressure rises. Queues grow. Users complain that the model is "thinking forever."

Part 1 introduced why LLM serving breaks the normal web-scaling model. Part 2 zooms into one reason: the HTTP request is only the envelope. The real work is token processing.

For a normal web app, a request is often a useful approximation of work. One request hits an API, does some bounded work, maybe talks to a database, returns JSON, and ends. LLMs do not behave like that. One request may contain a 20-token question and produce a 50-token answer. Another may contain a long system prompt, full chat history, retrieved documents, tool output, metadata, and a user asking for a 4,000-token report.

Both are one HTTP request.

They are not the same workload.

Kubernetes may see one request. Your ingress may see one request. Your API gateway may see one request. But the GPU sees tokens: prefill work, decode work, KV cache growth, memory pressure, queueing, and time spent generating output one token at a time.

Tokens are the work.

Why request count worked for web apps

Most platform teams grew up around request-based thinking. We look at requests per second, p95 latency, p99 latency, error rate, CPU usage, memory usage, queue depth, pod count, and replica count. That model works reasonably well for many web services because requests are often similar enough for capacity planning.

Not always, of course. A login request is not the same as an export request. A cached read is not the same as a database-heavy query. Every experienced SRE has seen one "simple" endpoint melt something important. But request count still gives a useful first approximation in many normal systems.

With LLM serving, that approximation breaks faster. A request does not tell you how long the prompt is, how many retrieved documents were added, how much chat history was included, how many output tokens the model generated, how much KV cache was needed, or how long the request occupied the GPU.

This is why a Kubernetes deployment can look stable at the HTTP layer while the model server is under real pressure. The API did not necessarily get more traffic.

The traffic got heavier.

Input tokens and output tokens are different problems

When people first hear "tokens are the unit of work," they often treat all tokens as one bucket. That is a good starting point, but it is not enough.

For serving, input tokens and output tokens stress the system differently.

At a high level, LLM inference has two phases:

Prefill
Decode

The prefill phase processes the input prompt. This includes the system prompt, developer instructions, chat history, retrieved documents, tool results, the user's message, and whatever formatting your application adds before the request reaches the model. The decode phase generates the response one token at a time. The model predicts a token, appends it to the sequence, uses that updated sequence to predict the next token, and keeps going until it hits a stop condition or a token limit.

A practical way to remember it:

Input tokens decide how heavy it is to start answering.

Output tokens decide how long the model stays busy.

Long input usually hurts time to first token because the model has to process the prompt before it can begin generating. Long output usually hurts total latency and capacity because the model stays in the generation loop longer. Streaming can make this feel better to the user, but streaming does not remove the backend work. It just lets the user watch the work happen.

This is why serious LLM serving metrics talk about time to first token, time per output token, inter-token latency, and tokens per second. NVIDIA's LLM benchmarking docs describe TTFT as the time before the first output token appears, and note that longer prompts generally increase TTFT because the input sequence has to be processed and the KV cache has to be created. Databricks' inference performance guidance also separates TTFT, TPOT, latency, and throughput instead of treating latency as one simple number.

A normal API request is one operation.

An LLM request is a sequence of token work.

One request can hide a huge amount of prompt assembly

The user does not usually send the real prompt.

The application builds it.

A user might type:

What is our refund policy for enterprise customers?

That looks tiny.

But by the time your application sends the request to the model, the prompt might include:

system prompt: 700 tokens
developer instructions: 400 tokens
chat history: 1,500 tokens
retrieved policy documents: 6,000 tokens
citations and metadata: 600 tokens
user question: 12 tokens
formatting instructions: 300 tokens

The user sent one short question. The model received more than 9,000 input tokens before it generated a single output token. That is one of the easiest mistakes to miss in production: teams measure the user message size, not the final assembled prompt size.

RAG makes this even more interesting. Retrieval-augmented generation is often described as a quality feature. The model gets relevant context, answers with better grounding, and can cite internal documents. That is true. But RAG is also an infrastructure multiplier.

Changing top_k from 4 chunks to 12 chunks may look like a harmless retrieval tuning change. No Kubernetes manifest changed. No model changed. Request count did not change. The product team may even see better answers. But now every request may carry thousands of extra input tokens. That can affect time to first token, GPU memory pressure, KV cache usage, batch composition, queueing delay, maximum concurrency, tail latency, and cost per interaction.

This is why prompt assembly needs observability. You do not only want to know that a request had 9,000 input tokens. You want to know where those tokens came from: chat history, retrieved documents, tool results, system instructions, verbose metadata, tenant documents, or an agent flow that appends every intermediate step.

Without that breakdown, token growth stays invisible until latency tells you something is wrong.

Long context is a capacity decision

Long-context models are useful. They let you analyze larger documents, keep longer conversations, handle more retrieval context, and build richer workflows. But a large context window is not a target. It is a limit.

This sounds obvious, but many teams behave as if "the model supports 128k context" means "we can casually send 128k context." That is like saying a node has 1 TB of memory, so every process should try to use it.

Long context changes the shape of serving. A small number of long-context requests can consume enough GPU memory and serving time to affect everyone else. A chat session can become more expensive as history grows. An agent can quietly append tool traces until each turn becomes much heavier than the first one. A summarization feature can go from "summarize this page" to "summarize this folder of documents" without the HTTP request count changing at all.

The failure mode is subtle because the old dashboard may still look calm. RPS is flat, but p95 input tokens moved from 2,000 to 18,000.

That is not flat traffic.

A useful platform practice is to bucket prompts by size: short prompts, medium prompts, long prompts, very long prompts, and batch or offline prompts. The exact numbers depend on your model and hardware, but the habit matters more than the bucket boundaries.

A 500-token chat and a 50,000-token document analysis should not be treated as the same class of work just because both entered through /v1/chat/completions.

Context windows are limits, not goals.

Output length is not just a UX choice

Input tokens get a lot of attention because long prompts are easy to blame. But output tokens can be just as important for capacity planning.

Two requests can have the same input prompt and completely different backend cost depending on output length.

Request A:

1,000 input tokens
100 output tokens

Request B:

1,000 input tokens
2,000 output tokens

Same route. Same user flow. Same prompt size. Very different serving time.

The second request keeps the model generating for much longer. If the response is streamed, the user may start seeing text quickly, which is good. But the GPU is still occupied while the model continues decoding token after token.

This is why max_tokens is not only a product parameter. It is a capacity control.

If every request is allowed to generate 4,000 tokens, you have created a worst-case capacity problem even if most responses are shorter. If a feature asks the model to "write a detailed report," that is not the same workload as "answer this chat question." If agents are allowed to produce long reasoning traces, tool plans, summaries, and final answers, output length can grow quickly.

You should track both requested maximum output tokens and actual generated output tokens. Requested max output tokens show the capacity risk your system accepted. Actual output tokens show the work the model really performed. If many requests hit the output cap, your users may be getting truncated answers. If very few requests use the available budget, your default might be too generous.

Output length is not formatting.

It is how long the request rents the GPU.

Same request count, completely different load

A useful dashboard should show when the same request count hides a different workload shape. For example:

Window A:

requests: 1,000
average input: 500 tokens
average output: 150 tokens
total token work: 650,000 tokens

Window B:

requests: 1,000
average input: 8,000 tokens
average output: 1,000 tokens
total token work: 9,000,000 tokens

Both windows show 1,000 requests. But the second window has almost 14x the token volume. If your dashboard only shows request count, it says traffic is flat. If your dashboard shows token volume, it says the workload changed completely.

That is why the useful question is not only:

How many requests are we serving?

It is also:

How many input tokens are arriving, how many output tokens are being generated, and where are those tokens coming from?

What Kubernetes sees and what the model server feels

Kubernetes is very good at managing containers. It can schedule pods, restart failed workloads, apply resource requests and limits, spread replicas, roll out deployments, and attach workloads to GPU nodes. But Kubernetes does not automatically understand the shape of an LLM request. A pod can be healthy while the model server is struggling. CPU can look uninteresting while GPU memory is the real limit. Generic memory can look fine while KV cache is under pressure. Request count can look flat while token volume has exploded.

This is where the division of responsibility matters. Kubernetes gives you the orchestration layer. The model server gives you the LLM execution layer. The application builds the prompt. The platform team has to connect the signals.

If those layers do not share the right metrics, you end up scaling the wrong thing. For example, CPU-based HPA may be useful around some parts of the stack, but it is not enough to understand LLM serving capacity. A model server may expose more relevant metrics such as prompt tokens, generation tokens, time to first token, time per output token, queue time, number of running requests, number of waiting requests, and KV cache usage.

vLLM's production metrics are a good example of where the industry is moving. It exposes metrics for prompt tokens, generation tokens, request prompt tokens, request generation tokens, time to first token, time per output token, request queue time, prefill time, decode time, KV cache usage, running requests, and waiting requests. That metric set tells you something important:

The production surface of LLM serving is already token-aware.

Your dashboard should be too.

Token-based observability is not optional

If you are running LLM workloads on Kubernetes, request count still matters. You still need API-level metrics. You still care about errors, availability, saturation, queueing, and latency. But those metrics need token context.

At minimum, every request should give you:

input tokens
output tokens
total tokens
requested max output tokens
time to first token
time per output token or inter-token latency
end-to-end latency
queue time
model name
model version
deployment
tenant or team
route or feature
finish reason

If possible, also track prompt composition:

system prompt tokens
chat history tokens
retrieved context tokens
tool result tokens
user message tokens
metadata or formatting tokens

That breakdown is where many production surprises hide.

For dashboards, averages are not enough. Average token count can look stable while the tail gets ugly. You want p50, p95, and p99 for input tokens and output tokens. You want latency by token bucket. You want TTFT by input size. You want end-to-end latency by output size. You want to know whether a tenant is sending mostly short prompts or occasionally sending giant ones.

Some useful views:

input token p50, p95, p99
output token p50, p95, p99
total tokens per minute by model
total tokens per tenant
TTFT by input token bucket
TPOT by output token bucket
queue time by token bucket
percentage of requests near context limits
percentage of requests hitting output cap
retrieved context tokens per request
chat history tokens per request
KV cache usage over time
waiting requests alongside waiting token estimates

The last point is important.

Do not only ask how many requests are waiting. Ask how many tokens are waiting. A queue of 20 short chat requests and a queue of 20 long document-analysis requests are not the same queue.

Product changes become infrastructure changes

One uncomfortable part of LLM platforms is that product changes can become infrastructure changes very quickly.

In a normal web app, adding a new field to a response may not matter much. In an LLM application, adding more context to a prompt can change capacity. Increasing retrieval depth can change latency. Keeping longer chat history can change memory pressure. Allowing longer outputs can change GPU occupancy.

The product team might say:

We only changed the prompt.

The platform team hears:

We changed the workload.

Both are true.

This does not mean product teams should be afraid of improving prompts. It means token impact should be visible before and after the change. If a new prompt improves answer quality but increases average input tokens by 3x, that may be a good tradeoff. But it should be a conscious tradeoff.

If a RAG change improves accuracy but pushes p99 prompts near the context limit, that should be visible before production users discover the latency problem. If a new report-generation mode produces 10x more output tokens than chat, it probably needs a different workload class and different expectations.

The platform question is not "are tokens bad?"

Tokens are the product.

The question is whether you know how many you are serving, where they come from, and what they do to your capacity.

Practical rules for platform teams

If you are starting to serve LLMs on Kubernetes, measure input and output tokens for every request. Do not wait until the first incident to add token metrics. Track the final assembled prompt, not just the user message. The model does not care what the user typed. It cares what your application sent.

Break input tokens down by source. Separate system prompt, chat history, retrieved context, tool results, and user message. Track requested max output tokens separately from actual output tokens. One tells you accepted risk. The other tells you real work.

Use token buckets in latency dashboards. A p95 latency graph without token buckets mixes small chat requests and huge document requests into one misleading line. Watch p95 and p99 token counts, not just averages. The tail is where LLM serving gets painful.

Put budgets on RAG retrieval. top_k is not only a relevance knob. It is a capacity knob. Treat context windows as limits, not targets. Just because a model accepts long context does not mean every request should use it. Set sane output defaults. Long answers should be intentional, not the accidental default for every route.

Separate workload classes when needed. Short interactive chat, long RAG, report generation, agent workflows, and batch summarization do not have the same shape. Review token growth after product changes. Prompt changes, retrieval changes, memory changes, and tool changes can all affect infrastructure.

These rules are not about making the system slower or less useful. They are about making the system understandable.

You cannot operate what you do not measure. And in LLM serving, measuring only requests means you are measuring the envelope while ignoring the work inside it.

The real unit of scale

The request is still useful at the API boundary. You need it for authentication, rate limits, logs, tracing, errors, and user flows. But it is not the right unit for LLM capacity.

It cannot tell you how much prompt the model processed, how long the model generated, how much KV cache was needed, or whether the workload was short chat, long-context RAG, report generation, or an agent loop.

Tokens get you closer to the truth. Input tokens explain much of the work before the first response appears. Output tokens explain how long the model keeps generating. Token distributions explain why averages lie. Token sources explain which product behavior changed the workload. Token-aware metrics explain why your Kubernetes deployment looks healthy while users still feel latency.

Part 1 was about letting go of the normal web app scaling model. Part 2 is about replacing one of its most misleading assumptions.

For LLMs on Kubernetes, you are not really scaling requests.

You are scaling token work across expensive, memory-constrained, latency-sensitive GPU systems.

Once you see that, the rest of the platform starts to make more sense.

Continue the series

I am writing this as a practical series on hosting large LLMs on Kubernetes, from GPU nodes and model servers to autoscaling, latency, cost, and production architecture. If you want the next part, subscribe to the newsletter.

I am also preparing a free LLM Serving on Kubernetes Production Readiness Checklist with the metrics, dashboard questions, and architecture review points platform teams should ask before putting an LLM workload in production. Subscribe and I will share it when it is ready.

DEV Community