Daya Shankar

Posted on Feb 27

Serving LLMs on IaaS: throughput vs latency tuning with practical guardrails

#ai #llm

Serving LLMs on IaaS is queueing plus memory pressure dressed up as ML. Every request has a prefill phase (prompt → KV cache) and a decode phase (token-by-token output).

Throughput tuning pushes batching and concurrency. Latency tuning caps them to protect TTFT and ITL. With vLLM on a single L40S (PCIe), you win by setting hard limits and enforcing admission control.

TTFT, ITL, TPS: stop mixing the metrics

If you tune the wrong metric, you’ll ship a fast benchmark and a slow product.

You need three numbers, and they mean different things:

TTFT (time to first token): how long the user waits before anything shows up. Interactive UX lives here.
ITL (inter-token latency): the “smoothness” of streaming output once decoding starts. Chat feels broken when this jitters.
Throughput (tokens/sec): the finance metric. It decides cost per request.

One important detail: E2E latency includes queueing + prefill + decode. TTFT is where queueing hides when you’re overloaded.

Practical measurement rule: measure TTFT and ITL at the client (or gateway), not inside the GPU server. Internal timings miss queueing in front of vLLM.

Hardware reality check: single L40S on PCIe

You can’t tune around a bus you don’t have.

An L40S is a strong inference GPU, but it’s not an NVLink box. It’s 48GB GDDR6 on PCIe Gen4 x16.
That matters because:

You have one GPU’s worth of memory for weights + KV cache.
You don’t get multi-GPU model parallel tricks for free.
Your main enemies are KV-cache pressure and batch/concurrency overshoot, not “GPU topology.”

On a single GPU server, latency failures usually look like:

TTFT spikes because the prefill queue grows.
ITL spikes because decode gets starved or the batch gets too big.
OOM/restarts because KV cache math was wishful thinking.

vLLM’s default behavior: TTFT-first scheduling (and the trade)

vLLM already picks a side; your job is to set guardrails around it.

By default, vLLM’s scheduler prioritizes prefills and does not batch prefill and decode into the same batch. That typically optimizes TTFT, but can worsen ITL and GPU utilization.

Translation: out of the box, vLLM tries to be responsive. You can still break it by feeding it mixed traffic with no limits.

The knobs that actually move TTFT, ITL, and OOM risk

You don’t “optimize latency.” You Configure concurrency and KV-cache headroom.

These four knobs do most of the work in production vLLM.

1) --max-num-seqs caps concurrency

This is your “how many requests can be active” ceiling.

--max-num-seqs is the maximum number of sequences per iteration.
Lowering it:

reduces concurrent KV cache usage
reduces queue contention inside the engine
usually helps tail latency (until you underutilize the GPU)

2) --max-num-batched-tokens controls batch size per iteration

This is where you trade throughput for TTFT/ITL stability.

--max-num-batched-tokens limits batched tokens per iteration.
Lowering it:

reduces “one huge prefill” events
reduces KV cache demand per cycle
can protect TTFT and prevent decode jitter

Raising it:

can increase throughput
can increase queueing and tail spikes if your traffic is bursty or prompts are long

3) --gpu-memory-utilization sets KV-cache headroom

This decides how much VRAM vLLM pre-allocates for cache.

vLLM pre-allocates GPU cache using gpu_memory_utilization. Increase it to provide more KV cache space.
If you set it too high, you risk fragmentation and less room for everything else. If you set it too low, you’ll hit KV cache limits early and TTFT will spike under concurrency.

4) --enable-chunked-prefill tames long prompts

Long prompts are TTFT killers; chunking makes them less explosive.

When enabled, vLLM can chunk prefill requests based on max_num_batched_tokens.
This is a practical guardrail when you can’t control prompt length perfectly.

A sane starting config for your SLA (p95 TTFT 250ms, p99 800ms)

Start conservative, hit the TTFT target, then earn throughput back.

On a single L40S, don’t begin with “maximum throughput.” Begin with “stable TTFT.”

Example vllm serve baseline (single GPU):

vllm serve /models/your-llm \
--host 0.0.0.0 --port 8000 \
--gpu-memory-utilization 0.85 \
--max-num-seqs 64 \
--max-num-batched-tokens 8192 \
--enable-chunked-prefill

Why these shapes:

max_num_seqs prevents unlimited concurrency blowups.
max_num_batched_tokens prevents one batch from ballooning.
gpu_memory_utilization keeps cache headroom explicit.
chunked prefill reduces “one giant prompt ruins the minute.”

You will tune these. But you need a stable base first.

Practical guardrails for mixed chat + batch traffic

Throughput tuning is easy. Guardrails are what keep p99 alive.

Mixed traffic (interactive + batch) is where systems get weird. Batch clients tend to:

send long prompts
request long generations
retry aggressively
keep load constant

Interactive chat needs:

fast TTFT
consistent ITL
predictable tail behavior

So you need admission control in front of vLLM. Not “best effort.”

Guardrail table (start here)

These caps stop one client from torching everyone else.

Guardrail	Default starting point	Why it exists
Max prompt tokens	4k–8k (per request)	Long prefills blow TTFT and batch size
Max output tokens	256–512 (interactive), 1k+ (batch)	Protect tail latency for chat
Max in-flight requests	Tie to max_num_seqs	Prevent internal queue explosion
Max queue depth	1–2× in-flight	If queue > that, reject/429 fast
Request timeout	Slightly above p99 target	Don’t let zombie requests clog decode
Retry policy	capped + jitter	Stops retry storms multiplying load

These aren’t theoretical. They’re how you keep a single GPU server usable.

Two-lane routing (interactive vs batch)

If you mix traffic in one FIFO queue, batch wins and chat loses.

On one GPU, the clean pattern is two lanes at the gateway:

Interactive lane: strict caps (short prompts, short outputs), low queue depth.
Batch lane: looser caps, but it yields when interactive is busy.

You can implement this with a thin gateway that:

inspects request size (prompt tokens + requested output tokens)
routes “interactive” to the main lane
routes “batch” to a background lane with stricter admission

Even if both lanes hit the same vLLM backend, the queue policy changes outcomes.

Concrete rule that works:
If interactive queue depth > N, reject batch (429) instead of letting it sit and inflate TTFT.

The tuning loop that converges (without cargo cult)

Tune one knob at a time and measure TTFT and ITL separately.

Here’s the loop to run on a GPU cloud server before you call it “production.”

Step 1: Fix the workload mix

Your traffic generator must match reality.

Build two test profiles:

Chat: short prompts, short outputs, bursty concurrency.
Batch: longer prompts and outputs, steady concurrency.

If you benchmark only one, you’ll tune only one.

Step 2: Lock SLOs first

You already have targets; enforce them.

Targets:

TTFT p95 ≤ 250ms
TTFT p99 ≤ 800ms

Keep a red line on the dashboard. If a tuning change crosses it, roll back.

Step 3: Set limits, then raise carefully

Earn throughput; don’t steal it from p99.

Order of operations:

Set max_num_seqs low enough that you never OOM under your worst prompt mix.
Set max_num_batched_tokens to prevent giant prefills from blocking decode.
Adjust gpu_memory_utilization to give KV cache room.
Enable chunked prefill if long prompts exist in real traffic.

Then:

increase max_num_seqs until TTFT p95 hits the edge of your budget
increase max_num_batched_tokens only if ITL stays stable and TTFT doesn’t spike

Step 4: Add overload behavior on purpose

A good system fails fast, not slowly.

Define overload mode:

when queue depth exceeds threshold → return 429 with Retry-After
when prompt/output exceeds limits → return 400 with a clear message
when batch lane is busy → shed batch first

If you don’t define this, your system will “define it” by melting.

Dashboards that catch trouble before users do

You can’t grep production. You need signals that predict tail spikes.

Track:

TTFT p50/p95/p99 (interactive lane, batch lane)
ITL distribution (interactive lane)
queue depth and reject rate (the guardrail is working if it fires)
GPU memory usage and cache pressure (OOM risk proxy)

vLLM already frames TTFT/ITL as the core performance story, and its scheduler tradeoffs explain why TTFT can look good while ITL suffers (or vice versa).

Where AceCloud fits (one honest paragraph)

IaaS isn’t the problem; inconsistency is.

If you’re serving on an IaaS gpu cloud server from a provider like AceCloud, treat it like any other VM: bake a known image, pin driver/CUDA versions, and script your vLLM flags so every node behaves the same. The tuning work above only sticks when the box is predictable.

Bottom line

Throughput is what you brag about. Latency is what users feel.

On vLLM + single L40S, you don’t win by chasing max tokens/sec. You win by controlling concurrency and batch size, allocating KV cache intentionally, and enforcing guardrails that keep mixed traffic from turning into a queueing disaster. Hit TTFT p95/p99 first. Then scale throughput without stealing it from your tail.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.