Dev Yadav

Posted on Apr 4 • Originally published at luminoai.co.in

The Demo Was One User. Then Batch Size Became Real.

#llm #batching #inference #serving

The demo worked because the test was one user, one prompt, one response.

Then real usage showed up, requests overlapped, and the same GPU plan suddenly looked underpowered.

What changed?

Usually not the model. Usually not the code.

What changed was the shape of the workload.

Once batching, queueing, or concurrent users become real, the memory and latency profile stops looking like the notebook version that originally passed.

Why teams miss this

they validate the model with one request at a time
they treat it loaded and answered as performance proof
they never test the real prompt distribution
they ignore how batching changes both memory use and latency

The exact moment the plan starts breaking

You launch a private demo. It feels fine.

Then a few real users arrive at once, or you enable batching to improve throughput, and memory margin disappears.

Now the same setup that looked safe starts queueing harder, spilling over, or forcing you into ugly latency compromises.

What the single-user test hides

Single user, short prompt: proves the model can answer
Single user, longer prompt: exposes context sensitivity
Multiple users or batching: exposes the real serving shape

That last test is the one that actually matters.

Why batch size changes the GPU decision

A lot of people pick GPUs like they are renting a single-user workstation.

Production inference is not that.

Once you care about throughput, queue time, or overlapping users, batch size starts interacting with context length, KV cache, and runtime overhead.

That can turn a works on 4090 plan into an A100 is calmer plan very quickly.

The expensive mistake

Seeing the first slowdown and jumping blindly to the biggest card.

The better move is to measure what actually changed:

prompt length
concurrent requests
whether batching is helping throughput enough to justify the memory cost

What we would measure before touching the GPU plan

p50 and p95 prompt length
how many requests overlap during real use
whether batching improves throughput or just hurts latency
how much headroom remains after a realistic traffic spike

A practical decision framework

One user at a time, short prompts, narrow demo: keep the smaller GPU and validate harder
Longer prompts and moderate concurrent usage: leave more VRAM headroom before traffic grows
Real serving, batching, and unstable latency: re-evaluate the serving plan, then the GPU tier

The real rule

A GPU plan is not validated when one prompt works.

It is validated when the real workload stays stable under realistic request patterns.

If batch size becomes real and the setup starts sweating, that is not bad luck. That is the workload finally telling the truth.

Read this next

Browse live GPUs

DEV Community