DEV Community

Cover image for Everything You Know About Scaling Web Apps Breaks When You Serve an LLM
Pawan Kumar
Pawan Kumar

Posted on • Originally published at dheeth.blog

Everything You Know About Scaling Web Apps Breaks When You Serve an LLM

Most platform engineers already know how to scale a web app. Put it in a container. Deploy it on Kubernetes. Add CPU and memory requests. Put a Service or Ingress in front. Configure HPA. Watch p95 latency, error rate, CPU, memory, and request throughput. Add replicas when traffic goes up. This is Part 1 of a practical series on hosting large LLMs on Kubernetes.

That playbook works for a lot of services. Then you try to serve a large language model, and suddenly the old model starts cracking. A request is no longer just a request. Memory does not just mean RAM. Latency is not one number. Scaling a pod does not mean capacity appears instantly. One "replica" may need one GPU, eight GPUs, or several machines working together.

And the bottleneck may not be CPU at all. The first mental shift is simple:

LLM serving is not normal web serving.

The real unit of work is the token.

A request is no longer a request

In a normal web app, request count is often a useful planning signal. Not perfect, obviously. Some endpoints are heavier than others. Some queries are ugly. Some users manage to find the one path that melts the database. But request count still tells you something.

With LLMs, it can lie to your face.

One user asks:

Summarize this sentence.

Another user asks:

Analyze this 80-page contract, compare it with these policy documents, extract the risks, and generate a detailed memo.

Both are one request. They are not the same workload.

The second request may contain thousands of input tokens. It may generate thousands of output tokens. It may sit on GPU memory for longer. It may increase queueing delay for everyone behind it. It may consume far more KV cache. It may make your latency charts look haunted.

So if you only measure requests per second, you are almost blind. For LLMs, you need to care about:

  • input tokens
  • output tokens
  • tokens per second
  • time to first token
  • time per output token
  • queue depth
  • batch size
  • GPU memory
  • KV cache usage
  • model loading time

That is a very different world from normal HTTP throughput.

LLM inference has two phases, and they behave differently

When a user sends a prompt to an LLM, the model does not handle the request as one uniform block of work. At a high level, inference has two phases:

  1. prefill
  2. decode

Prefill is where the model processes the input prompt. If the prompt is long, prefill gets expensive. This is where the model reads the context and builds the internal state needed to start generating. Decode is where the model generates output tokens one at a time. This is the part users see when text starts streaming on the screen.

These phases stress the system differently. Prefill is more compute heavy. Decode is often more memory bandwidth heavy. Prefill depends heavily on input length. Decode depends heavily on output length. Both affect latency, throughput, cost, and capacity.

This distinction does not usually matter when you are scaling a normal API. You do not think of a checkout endpoint as having two GPU phases with different scheduling behavior. With LLMs, you have to.

If you ignore prefill and decode, you will struggle to explain why first token latency is slow, why long prompts hurt so much, or why the GPU looks busy but users still complain.

Latency is not one number anymore

For web services, we usually talk about latency as one number:

  • p50 latency
  • p95 latency
  • p99 latency
  • request duration

For LLMs, that is not enough. Two latency numbers matter a lot.

Time to first token

Time to first token, or TTFT, is how long the user waits before the model starts responding. This controls the feeling of responsiveness.

If nothing appears for five seconds, the product feels slow. It does not matter that the final answer is useful. The user has already started wondering if the system is stuck.

TTFT is affected by:

  • queueing delay
  • prompt length
  • prefill time
  • model routing
  • batch scheduling
  • GPU availability
  • cold starts
  • cache behavior

Users feel TTFT sharply because silence feels broken.

Time per output token

Time per output token, or TPOT, measures how fast the model generates each token after generation starts. This controls the streaming experience.

Good TTFT with bad TPOT feels like the model wakes up quickly and then crawls. Good TPOT makes the answer feel alive, even if the full response takes time.

TPOT is affected by:

  • decode efficiency
  • GPU memory bandwidth
  • batch size
  • KV cache pressure
  • model size
  • quantization
  • serving engine
  • hardware type

Normal web systems rarely force you to separate "time until the response starts" from "speed at which the rest of the response streams." LLM serving does.

Memory means GPU memory now

In a web app, memory usually means heap, runtime overhead, in-process cache, or connection pools. In LLM serving, memory often means GPU memory. And GPU memory is painful because it is limited, expensive, and easy to waste.

You need GPU memory for:

  • model weights
  • KV cache
  • runtime buffers
  • activations
  • batching overhead
  • framework overhead

Model weights are the obvious part. A 7 billion parameter model in FP16 or BF16 needs roughly 14 GB just for weights. A 70 billion parameter model needs roughly 140 GB just for weights at that precision. That already means one GPU may not be enough.

But weights are only the obvious cost. The hidden cost is KV cache.

KV cache stores the key and value tensors from previous tokens so the model does not recompute everything from scratch during generation. The longer the context and the more concurrent users you serve, the more KV cache you need.

This is why long context is not just a product feature. It is an infra bill. Every extra token you allow into the context window can come back as GPU memory pressure. Maximum context length is not only a model capability. It is a capacity planning decision.

Replicas are not always replicas

In a normal web app, one replica usually means one pod running one copy of the application. Traffic goes up, add pods. With LLMs, the word "replica" can hide a lot.

A small model may run inside one pod on one GPU.

A larger model may need:

  • multiple GPUs in one node
  • multiple pods on one node
  • multiple nodes
  • tensor parallelism
  • pipeline parallelism
  • a Ray cluster
  • a leader-worker setup
  • a group of pods that must start together

So when someone says, "scale the model to 10 replicas," the first question should be: what is one replica?

Is it one pod? One GPU? One tensor parallel group? One multi-node deployment? One endpoint backed by several workers? One prefill group plus one decode group?

This is where Kubernetes abstractions get interesting. A Deployment works nicely for simple stateless services. Serious LLM serving may need Ray, KServe, LeaderWorkerSet, Kueue, Volcano, or custom orchestration.

The model may not fit into the old "one pod equals one replica" picture.

Scaling a pod does not mean capacity is ready

In a normal web app, a new pod can become useful quickly. The image is pulled. The process starts. Readiness passes. Traffic flows. For LLMs, a new pod may sit there for a while before it can handle real traffic.

It may need to:

  • pull a large container image
  • download model weights
  • load hundreds of GBs from object storage or disk
  • initialize CUDA
  • allocate GPU memory
  • build or load optimized engines
  • warm up the model
  • join a distributed serving group

This can take minutes. Sometimes longer. So autoscaling is not just about deciding when to add replicas. It is about adding capacity early enough that it is ready before users feel the pain.

That is much harder than scaling a normal web app. This is why LLM platforms often use:

  • minimum warm replicas
  • preloaded models
  • local NVMe model cache
  • warm pools
  • separate GPU node pools
  • predictive scaling
  • queue based scaling
  • scheduled capacity for known peaks

Scale to zero sounds great until the first user waits for a giant model to load.

CPU autoscaling becomes a weak signal

CPU utilization is a decent signal for many Kubernetes workloads. Not perfect. But decent. For LLM serving, CPU can be almost irrelevant.

The expensive work happens on GPUs. More specifically, the bottleneck may be:

  • GPU memory
  • GPU memory bandwidth
  • KV cache capacity
  • decode throughput
  • queue depth
  • batch saturation
  • inter-GPU communication
  • model server scheduling
  • request length distribution

A model server can have low CPU usage and still be overloaded. It can have high GPU utilization and still deliver terrible latency. It can have enough compute but not enough KV cache capacity. It can be stuck serving long prompts while short prompts wait behind them.

So if you autoscale only on CPU, the platform may make the wrong decision at the worst possible time.

Better signals include:

  • queue depth
  • waiting requests
  • ongoing requests per replica
  • batch size
  • TTFT
  • TPOT
  • tokens per second
  • KV cache usage
  • GPU memory pressure
  • SLO burn rate

GPU utilization still matters. It just cannot be the only signal. LLM autoscaling has to understand the workload.

Round robin load balancing gets weird

For a normal web app, round robin load balancing is often fine. Request 1 goes to pod A. Request 2 goes to pod B. Request 3 goes to pod C. For LLMs, this can be wasteful.

A short prompt and a long prompt have completely different costs. A request with a cached prefix may be cheaper if it lands on the right worker. A long generation may occupy capacity much longer than the load balancer expects. One tenant may need lower latency than another. One model may need different hardware from another.

Naive load balancing can create strange failures:

  • one worker gets long prompts and slows down
  • another worker stays underused
  • KV cache locality is lost
  • prefix caching becomes less useful
  • tail latency gets worse
  • GPU utilization looks fine while users are unhappy

LLM serving needs smarter routing.

Good routing may consider:

  • model name
  • prompt length
  • estimated output length
  • tenant priority
  • cache locality
  • GPU availability
  • queue depth
  • hardware type
  • region
  • latency SLO

This is why inference gateways, model-aware routing, and cache-aware scheduling matter.

Cost changes shape

In a web app, cost usually grows with pods, CPU, memory, database load, and network traffic. In LLM serving, cost is shaped by GPU usage, and GPUs are expensive enough that small inefficiencies matter.

You can burn money through:

  • idle GPUs
  • poor batching
  • overprovisioned replicas
  • long context windows
  • bad routing
  • large models for simple tasks
  • no quantization
  • slow cold starts
  • inefficient KV cache usage
  • serving every request with the same model

The cost unit also changes. Instead of only thinking about cost per request, you start thinking about:

  • cost per input token
  • cost per output token
  • cost per million tokens
  • cost per model
  • cost per tenant
  • cost per GPU hour
  • cost per region
  • cost per latency tier

This is the cloud bill version of the first mental shift: a request is not a request. A token is the real unit of work.

Kubernetes is still useful. It is just not enough by itself

None of this means Kubernetes is the wrong platform for LLM serving. Kubernetes still gives you a lot:

  • scheduling
  • declarative deployment
  • resource management
  • isolation
  • service discovery
  • rollouts
  • observability integrations
  • autoscaling primitives
  • platform patterns for multiple teams

That is why many serious AI infrastructure platforms still use Kubernetes or something close to it. But Kubernetes does not automatically understand LLMs.

Out of the box, Kubernetes does not know:

  • what KV cache is
  • whether a model is loaded
  • whether a GPU group must be scheduled together
  • whether pods should land in the same rack
  • whether a request has 100 tokens or 100,000 tokens
  • whether TTFT is bad
  • whether a model server is overloaded despite low CPU
  • whether a new replica will take 10 minutes to warm up

You have to teach the platform these things through metrics, controllers, schedulers, serving frameworks, routing layers, and operational discipline. That is the real work.

The old scaling model breaks

The old web scaling model looks like this:

Traffic increases
        ↓
CPU increases
        ↓
HPA adds pods
        ↓
Load balancer spreads requests
        ↓
Latency improves
Enter fullscreen mode Exit fullscreen mode

That still works for many stateless services. LLM serving looks more like this:

Traffic increases
        ↓
Input and output token mix changes
        ↓
Queue depth grows
        ↓
KV cache pressure increases
        ↓
Batching behavior changes
        ↓
TTFT and TPOT drift
        ↓
GPU memory or decode throughput becomes the bottleneck
        ↓
Autoscaler needs model-aware metrics
        ↓
New capacity may take minutes to warm up
Enter fullscreen mode Exit fullscreen mode

If you bring only the old playbook, you will scale the wrong thing, at the wrong time, using the wrong signal.

The new mental model

To serve LLMs well, you need a different model in your head:

  • A request is not the unit. A token is.
  • Memory is not just RAM. GPU memory and KV cache matter more.
  • Latency is not one number. TTFT and TPOT matter separately.
  • A replica may be a distributed group, not a single pod.
  • Scaling is not instant because model loading is slow.
  • CPU is not a reliable autoscaling signal by itself.
  • Load balancing must understand request cost and cache locality.
  • Long context is an infrastructure cost decision.
  • Cost optimization starts with keeping expensive GPUs useful.
  • Kubernetes is the foundation, but LLM-aware systems must be built on top.

Once this clicks, the rest of LLM infrastructure becomes easier to reason about. You can understand why vLLM became popular. Why PagedAttention matters. Why KV cache dominates serving design. Why quantization is a capacity strategy. Why topology-aware scheduling matters. Why teams split prefill and decode. Why GPU cost optimization is its own discipline. Why normal autoscaling is not enough.

LLM serving is not "deploy a model behind an API." It is a new platform engineering problem.

Closing thought

For years, platform teams became very good at scaling stateless web services. We learned containers, Kubernetes, service meshes, autoscaling, observability, progressive delivery, and cloud cost optimization. That knowledge still matters, but LLM serving changes the shape of the problem.

The bottlenecks move. The metrics change. The cost model changes. The scheduler matters more. The load balancer needs to get smarter. The GPU becomes the scarce resource. The token becomes the unit of work.

So if you are trying to serve LLMs on Kubernetes, the first step is not installing a Helm chart. The first step is replacing the old mental model.

Because everything you know about scaling web apps starts to break the moment you serve an LLM.


Continue the series

I am writing this as a practical series on hosting large LLMs on Kubernetes, from GPU nodes and model servers to autoscaling, latency, cost, and production architecture. If you want the next part, subscribe to the newsletter.

I am also preparing a free LLM Serving on Kubernetes Production Readiness Checklist with the questions platform teams should ask before putting an LLM workload in production. Subscribe and I will share it when it is ready.

Top comments (0)