Production vLLM deployments live or die on three configuration decisions, and getting any of them wrong shows up early: static KV cache allocation will OOM your GPU long before billing teaches you the same lesson. This guide is written for the operator who already accepts vLLM as the default serving engine and now needs a ranked decision surface, a runbook for the failure modes, and a clean view of the architecture that makes the knobs behave the way they do.
Configuration guidance and architecture descriptions in this article reflect vLLM 0.20.x and the V1 engine, which has been the default since v0.8.0 (released March 2025). Flag behavior and metric names may differ on releases before v0.8.0, when V1 was opt-in via VLLM_USE_V1=1. All commands assume vLLM installed via pip install vllm (tested on Python 3.10+ / CUDA 12.x). For containerized deployments, the official image is vllm/vllm-openai. Check the installation guide for version-specific CUDA requirements.
Cost-per-token: the three decisions that dominate vLLM deployments
At scale with a real inter-token latency SLA, vLLM cost is shaped by configuration choices long before GPU budget enters the conversation. Land the three below, and the remaining tuning surface yields diminishing returns; miss any of them, and no amount of GPU spend will rescue the SLA.
The first decision is framework choice itself. vLLM is the right default for most teams, but TensorRT-LLM, SGLang, and TGI each win in narrow conditions. Committing to vLLM under the wrong workload (deeply branching agentic call graphs, fixed-shape NVIDIA-only deployments at extreme scale) is a slower-to-fix mistake than a flag value.
The second is the memory budget: how much VRAM you cede to KV cache versus weights and activations, expressed through --gpu-memory-utilization and --max-model-len. This is the variable that determines how many concurrent sequences your pool can hold before the scheduler starts preempting. It is also the variable that operators most often leave at defaults on shared infrastructure and then debug for a week.
The third is the batching and admission strategy: continuous batching is on by default, but --enable-chunked-prefill and --enable-prefix-caching decide whether prefill work corrupts your decode latency and whether repeated prompt prefixes are paid for once or every time. Two flags, both cheap to enable, both with workload-dependent payoffs.
The rest of this guide treats these three in order: framework choice first, then the architecture that makes the budget and batching knobs predictable, followed by deployment shapes, memory budgeting, the measurement contract that validates your configuration, the ranked knobs themselves, and finally the failure modes you will see when one of them is off.
Serving framework: vLLM, SGLang, TensorRT-LLM, or TGI
The decision is dominated by workload shape and hardware constraint. The flowchart below leads; the prose underneath fills in the cases where the answer is not “vLLM.”
When vLLM is not the right default
SGLang earns the choice when the workload is structured generation or multi-step agent programs. Its RadixAttention reuses KV state across branching call graphs more aggressively than vLLM’s prefix caching, which matters when a single user turn fans out into a tree of constrained-output sub-calls. For linear chat and completion endpoints with unique prompts, that advantage is minimal to negligible.
TensorRT-LLM has a non-trivial throughput advantage on fixed shapes and a fixed NVIDIA SKU, but the cost is operational: every change to model version, GPU tier, or sequence-length configuration forces an engine rebuild measured in tens of minutes for large models. Teams running one model on one hardware tier at a scale where even marginal throughput gains justify operational overhead can get value from TensorRT-LLM. Most teams don’t.
Text Generation Inference (TGI) overlaps with vLLM on capability and integrates tightly with the Hugging Face ecosystem. The deciding factor is often ecosystem fit: if Hub repos, Spaces, and HF-format configs are already wired into the deployment path, TGI requires less reconfiguration to adopt. Optimization momentum since 2024 has favored vLLM, particularly on the scheduling and KV-cache management side, so greenfield deployments lean vLLM.
For everything else, including AMD GPUs and any workload where future GPU portability is a constraint, vLLM is the answer. Before sizing the deployment, understanding the architectural primitives that make vLLM’s configuration surface predictable will make every subsequent decision more legible.
vLLM architecture: PagedAttention, continuous batching, and V1 modularity
The configuration surface above is only as good as the runtime behavior that backs it. Three architectural pieces give the budget knob, the batching flags, and the scheduler-tuning options their teeth. The framing here is “why does that knob work?” rather than “here is the breakthrough.”
PagedAttention as virtual memory for KV
PagedAttention treats the KV cache the way an operating system treats process memory: as fixed-size physical blocks (16 tokens per block by default) accessed through a per-sequence logical-to-physical block table. Physical blocks live anywhere in GPU memory and don’t need to be contiguous. When a sequence advances, the allocator hands it one more block at a time. When the sequence terminates, every block returns to the free pool immediately. Block sharing across sequences with identical prefix tokens is the foundation that makes prefix caching possible.
The flowchart below shows how the operator-set memory budget translates into runtime behavior, starting from the configuration value rather than from request arrival.
The block-pool sizing step at the top is what makes --gpu-memory-utilization an operator-level budget. The reclaim path at the bottom is what makes eviction an observable event rather than a silent failure: the metrics endpoint reports free-block count and the scheduler logs reclaim actions, which is why the failure-mode catalog can name eviction as a diagnosable signature.
Continuous batching at the iteration level
The other half of the throughput story is iteration-level scheduling. Static batching waits for a full batch of N sequences, runs the forward pass, returns all outputs, then admits the next batch; any sequence finishing early leaves its slot idle until the batch completes. The vLLM scheduler operates at the iteration level: when a sequence completes, its slot is freed and a waiting request can be admitted at the next iteration. The result is higher GPU utilization at steady state and lower average queue time, both of which the ranked-knobs section relies on when it claims that prefix caching and chunked prefill change the ITL distribution rather than just the mean.
V1 modularity
The vLLM V1 re-architecture splits the scheduler, KV cache manager, and model runner into distinct, modular components. For operators, the practical change is a cleaner configuration surface; the modular design also provides developer-level hackability for custom scheduler and cache manager implementations. The disaggregated-serving direction in the closing section rests on this modular substrate.
Deployment surfaces: single-GPU, tensor-parallel, serverless
Three deployment shapes cover production vLLM workloads. The VRAM sizing rule is the same in all three: budget weights as 2 bytes per parameter at BF16/FP16, 1 byte at INT8, and 0.5 bytes at INT4, then subtract weights from --gpu-memory-utilization x VRAM to get the KV pool budget.
Single-GPU
The minimal configuration on an L40S 48GB is:
vllm serve mistralai/Mistral-7B-Instruct-v0.3 \\\\
\--gpu-memory-utilization 0.90 \\\\
\--max-model-len 16384
Mistral-7B-Instruct-v0.3 at BF16 occupies roughly 14 GB for weights. At 0.90 utilization on a 48 GB L40S the engine has a 43.2 GB envelope, which leaves roughly 29 GB for the KV pool. Capping --max-model-len at 16K rather than the model’s 32K maximum halves the worst-case per-sequence KV claim and roughly doubles the concurrency the same pool can support; in production chat traffic the truncation is invisible. On an A100 40GB the same model leaves about 22 GB for KV; on an A100 80GB, about 58 GB. The numerical method is identical, only the GPU envelope changes.
Tensor-parallel for larger models
A 70B-class model in BF16 will not fit on a single GPU. Qwen2.5-72B-Instruct at BF16 occupies roughly 144 GB of weights, which requires at minimum two 80 GB GPUs.
vllm serve Qwen/Qwen2.5-72B-Instruct \\\\
\--tensor-parallel-size 2 \\\\
\--gpu-memory-utilization 0.90 \\\\
\--max-model-len 32768
Cap --max-model-len to your actual use case; the Qwen2.5-72B architectural maximum is 128K, and leaving it at the default with only two 80 GB GPUs will exhaust the KV pool at moderate concurrency.
Tensor parallelism shards the attention and feed-forward weight matrices across the configured number of devices and exchanges activation tensors at each layer boundary. The interconnect topology matters. NVLink carries that traffic at bandwidths that keep the per-layer cost in the noise; PCIe is functional but adds measurable overhead per forward pass, with workload-dependent throughput losses that can reach the mid-double-digit-percent range in adverse topologies. If the host machine has the model split across GPUs that aren’t NVLink-bridged, expect to see that overhead reflected in the throughput numbers, not just the topology diagram.
Serverless via Runpod
For teams that need a vLLM endpoint on H100, A100, or L40S without operating GPU infrastructure, Runpod’s Serverless provisions one in minutes (initial model download may extend total setup time). The console walkthrough, endpoint creation, vLLM worker selection, model ID, MAX_MODEL_LEN / GPU_MEMORY_UTILIZATION / DTYPE env vars, and HF_TOKEN for gated checkpoints like Llama-3 or Gemma is covered end-to-end in the Serverless quickstart; the configuration surface that matters for production is what comes next.
Runpod maps every AsyncEngineArgs field to an uppercase environment variable of the same name, so any launch-script flag has a configuration-panel equivalent that is editable without redeploying. The endpoint exposes an OpenAI-compatible API at https://api.runpod.ai/v2/<ENDPOINT_ID>/openai/v1, which the OpenAI SDK consumes without code changes:
from openai import OpenAI
client \= OpenAI(
api\_key="your-runpod-api-key",
base\_url="\<https://api.runpod.ai/v2/\>\<ENDPOINT\_ID\>/openai/v1",
)
completion \= client.chat.completions.create(
model="mistralai/Mistral-7B-Instruct-v0.3",
messages=\[
{"role": "user", "content": "Summarize the trade-offs of FP8 KV cache quantization."}
\],
max\_tokens=512,
)
print(completion.choices\[0\].message.content)
Billing is per-second of active compute, which makes serverless a useful target for ramp testing without committing to reserved capacity. One operational caveat: workers scale to zero between requests, so cold start (the interval from first request to first token on a freshly-initialized worker) ranges from roughly 30 seconds on cached images to 90+ seconds on first pull, before any inference latency. Run a warm-up request before recording p99 metrics.
Memory budgeting: multi-tenant discipline on shared GPUs
GPU memory on shared infrastructure is best treated as a tenancy budget rather than a single number to dial. --gpu-memory-utilization is the primitive that exposes the budget to vLLM, and the right value depends on what else lives on the device.
On a shared node, every co-tenant (a monitoring agent, a sidecar model, a CUDA debugger) competes for the same headroom, and a peak utilization that worked in isolation can OOM in production. The discipline is to allocate a per-tenant headroom share before deciding the utilization value, then verify with watch -n1 nvidia-smi --query-gpu=memory.used,memory.free,utilization.gpu --format=csv during a ramp test. Confirm that memory.used at peak load stays within the tenant’s allocated share and that memory.free never drops below the headroom you reserved for CUDA context and activation buffers. This headroom discipline is operator practice, not a vLLM feature; the framework gives you a budget knob and trusts you to know what fraction of the device is yours.
Treating the budget as configuration that you version alongside the model and tenant changes is the practice that prevents the next incident. Platforms that surface it as a first-class endpoint setting (Runpod’s GPU_MEMORY_UTILIZATION env var is one example) make the discipline easier; on a hand-rolled launch script the same value belongs in a checked-in config file, not in the bash history.
Measurement contract: TTFT, ITL, and ramp testing
Production vLLM deployments are bounded by a measurement contract the operator owes the SLA. Four quantities define the contract, and the protocol that verifies them is a ramp test against the actual model on representative traffic. Definitions and methodology belong together; separating them is what produces the dashboards that look healthy until production breaks them.
TTFT (Time To First Token) is the wall-clock interval from request arrival to the first token streamed back. It is dominated by prefill: the cost of pushing the entire input through every attention layer once. Sub-second TTFT is the correct target for interactive chat; multi-second TTFT is acceptable for batch summarization where no human is watching the cursor.
ITL (Inter-Token Latency) is the gap between successive output tokens during decode. TPOT (Time Per Output Token) is the mean of that distribution across the full output. Interactive UX tracks ITL consistency far more than mean TPOT, because users perceive cadence stalls more readily than variations in average rate, and a mean of 50 ms with a clean p99 reads smoother than a faster mean with a long tail.
End-to-end latency is TTFT plus the sum of all ITLs across the response. SLAs typically cite this number, but it lags as a diagnostic: a healthy deployment shows p99 ITL within a small multiple of the median, and when that multiple stretches you are seeing the symptoms catalogued later in this guide (KV eviction, prefill-decode contention, communication stalls) before they show up in end-to-end numbers.
Reading the benchmark output
vLLM ships a benchmark harness in its source tree that measures all four quantities against a running server. If you installed via pip, clone the repo first: git clone <https://github.com/vllm-project/vllm> && cd vllm. Start the server in a separate terminal (vllm serve <model> ...), then run the benchmark. The --dataset-name sharegpt flag downloads the ShareGPT dataset on first use; substitute --dataset-name random for air-gapped environments.
python benchmarks/benchmark\_serving.py \\\\
\--model mistralai/Mistral-7B-Instruct-v0.3 \\\\
\--request-rate 4 \\\\
\--num-prompts 800 \\\\
\--dataset-name sharegpt \\\\
\--host localhost \\\\
\--port 8000
The output reports mean, median, and p99 for TTFT, ITL, and TPOT, plus aggregate throughput in tokens per second. Read the p99 columns first. Mean values smooth over the eviction events and contention spikes that actually shape the user experience.
Ramp methodology
A single-rate benchmark tells you whether one operating point is healthy. Finding the serving ceiling requires ramping. Step --request-rate upward (1, 2, 4, 8, 16, …) and record p99 ITL at each step. The point where p99 ITL begins growing super-linearly with request rate is the ceiling for the current configuration. Beyond that point the deployment is capacity-constrained, most commonly due to KV pool pressure, scheduler oversubscription, or a combination of both. The configuration changes in the next section move that ceiling; the ramp test is what proves they did.
Configuration knobs: four flags ranked by impact
Once the deployment surface is fixed, four flags do most of the work on a standard mixed-traffic deployment. Treat the order below as the baseline impact ranking; the “when this matters” line on each one is what you check before deciding to enable it. Two additional features for specific workload classes follow in the next section.
Quantization (--quantization awq). Largest single memory win available. AWQ and GPTQ cut weight footprint by half (INT8) or 75% (INT4) relative to BF16, with quality degradation that is model- and benchmark-dependent but usually small for instruction-tuned models on standard tasks. AWQ (Activation-aware Weight Quantization) calibrates against activation distributions rather than applying static rounding, which generally produces better outputs at the same bit width.
vllm serve mistralai/Mistral-7B-Instruct-v0.3 \\\\
\--quantization awq
The --quantization awq flag expects the model checkpoint to already be in AWQ format. Pointing it at a standard BF16 checkpoint will produce a runtime error, not a silent quality degradation. Search the Hub for a *-AWQ variant of your model, or run a post-hoc quantization pass with AutoAWQ before serving.
When this matters: any deployment where weights are crowding the KV pool or where you want headroom for higher concurrency without moving to a larger GPU. Verify the chosen model has an AWQ checkpoint on the Hub; if not, GPTQ is the post-hoc alternative.
FP8 KV cache (--kv-cache-dtype fp8). Storing KV in FP8 instead of BF16 halves cache memory; at 64K context the KV-cache footprint that previously consumed roughly 8 GB drops to about 4 GB on the running model. Quality degradation is measurable but small on standard benchmarks.
vllm serve mistralai/Mistral-7B-Instruct-v0.3 \\\\
\--kv-cache-dtype fp8
FP8 KV cache is natively accelerated on H100 (Hopper) GPUs. On A100 and L40S (Ampere/Ada), vLLM falls back to software emulation which still saves memory but at reduced throughput gains. Verify the behavior on your GPU tier before assuming compute neutrality.
When this matters: long-context workloads where the KV pool, not the weights, is the binding budget. At 4K-8K context the savings are real but rarely change the concurrency story.
Prefix caching (--enable-prefix-caching). vLLM hashes the token sequence of each KV block and reuses materialized blocks across requests with shared prefixes. A multi-tenant chat system with a common system prompt or a RAG pipeline that retrieves from a small corpus pays prefill once for the shared portion instead of every request. The fraction of prefill compute eliminated is workload-dependent and tracks the prefix-overlap rate of your traffic.
vllm serve mistralai/Mistral-7B-Instruct-v0.3 \\\\
\--enable-prefix-caching
When this matters: any workload with non-trivial prompt-prefix overlap, including agentic systems that send the same tool definitions on every call.
Chunked prefill (--enable-chunked-prefill). Splits long prefill phases into smaller chunks and interleaves them with decode steps from in-flight sequences. Without it, a single 10K-token prefill stalls decode for every concurrent sequence for the duration, which surfaces as a visible ITL spike. With it, prefill is budgeted across iterations at some TTFT cost on the prefilling request (tunable via max_num_batched_tokens) and steady ITL for everyone else.
vllm serve mistralai/Mistral-7B-Instruct-v0.3 \\\\
\--enable-prefix-caching \\\\
\--enable-chunked-prefill
When this matters: mixed workloads where chat traffic and long-document requests share the same endpoint. The TTFT tradeoff on the prefilling request is small relative to the ITL stability it buys for concurrent sequences.
Speculative decoding and multi-LoRA: throughput levers for specific workloads
Two 2025-era features change the throughput story for specific workload classes.
Speculative decoding runs a small draft model in front of the target model to propose tokens that the target then verifies in parallel. On workloads where the draft model agrees with the target most of the time (consistent prose, predictable code), the verification step accepts multiple drafted tokens per target step, which raises effective decode throughput without changing output quality. The win shrinks on outputs the draft model handles poorly, so the feature pays back on workload classes more than on benchmarks.
The relevant flags are --speculative-model <draft-model-id> and --num-speculative-tokens <N> (typically 3-5). The draft model must match the tokenizer of the target. VRAM overhead is the full weight footprint of the draft model in addition to the base.
When to use: latency-sensitive workloads where you can afford the draft-model VRAM and where the target’s outputs are predictable enough for the draft to agree often. Verify current support and operator semantics in the vLLM documentation before committing.
Multi-LoRA serving lets a single vLLM instance host the base model once and swap in LoRA adapters per request. For deployments serving many fine-tuned variants of the same base, this collapses the GPU footprint of “one endpoint per adapter” into “one endpoint, many adapters.” The tradeoff is per-request adapter loading latency on cold paths; pre-loading adapters with a dummy warm-up request mitigates this, and you should check the docs for your target vLLM version.
Enable with --enable-lora. Register adapters at startup via --lora-modules <name>=<path-or-hub-id> (repeatable). Control concurrency with --max-loras and --max-cpu-loras. Adapters not listed at startup can be loaded dynamically via the /v1/load_lora_adapter endpoint (vLLM 0.5+).
When to use: SaaS deployments with per-tenant fine-tunes on a shared base, or any catalog of LoRA variants where one-endpoint-per-adapter is operationally untenable.
Failure modes: KV eviction, prefill-decode contention, OOM
Three failure modes account for most production vLLM regressions. Each entry pairs an observable symptom with the root cause and the remediation.
KV cache eviction. Symptom: p99 ITL spikes to several multiples of the median while mean throughput holds; vLLM logs show “number of free blocks” trending toward zero. Cause: the block allocator has run out of free blocks and is preempting in-flight sequences, which then need to recompute their KV state when re-admitted. Fix: lower --max-model-len to the actual maximum your application needs, reduce --gpu-memory-utilization only if another process on the device is competing for the same memory budget, or move to a larger GPU. Enabling --kv-cache-dtype fp8 reduces the per-token KV cache cost by roughly half (the vLLM blog reports reduction to ~54% of BF16 in best cases) and is often sufficient for long-context workloads.
Prefill-decode contention. Symptom: ITL spikes correlated with the arrival of long-prompt requests rather than with overall load; mean ITL is fine but the distribution has visible tails after every long prompt. Cause: prefill is compute-bound on dense matmuls against long token sequences, decode is memory-bandwidth-bound on matrix-vector products, and a scheduler running both on one GPU has to switch between profiles inside a single iteration. Fix: --enable-chunked-prefill budgets prefill across iterations and is the first remediation. If contention persists at high concurrency with mixed prompt lengths, the architectural answer is to split prefill and decode onto different instances, covered in the closing section.
Out-of-memory at admission. Symptom: CUDA OOM during high-concurrency bursts; the engine refuses new requests rather than running them slowly. Cause: weights, KV pool, activation memory, and CUDA context together exceeded the budget set by --gpu-memory-utilization. The static-allocation case is the classic example: a slot-per-sequence allocator at long max_seq_len reserves so much KV pool per slot that a fourth or fifth request cannot be admitted even though their working sets would fit. With PagedAttention the equivalent failure is reaching pool exhaustion, which manifests as eviction first; hard OOM can follow when additional memory pressure pushes usage past the allocated budget. Fix: recompute the budget from first principles (weights bytes + KV pool budget at chosen --max-model-len + 5-10% headroom) and confirm with a ramp test before declaring the configuration shipped.
Tensor-parallel communication stalls. Symptom: p99 latency on multi-GPU deployments is disproportionately high relative to single-GPU baselines after accounting for the weight-shard benefit; throughput is sensitive to --tensor-parallel-size beyond what shard math predicts. Cause: inter-GPU activation transfers at each layer boundary are constrained by PCIe bandwidth (typically 64 GB/s bidirectional) instead of NVLink (600+ GB/s on H100 NVLink4). Fix: verify GPU interconnect topology with nvidia-smi topo -m. If GPUs are PCIe-only, the throughput loss is architectural; mitigation is tensor-parallel-size reduction (to minimize cross-GPU transfers) or migration to NVLink-bridged hardware.
Production observability: vLLM metrics, Prometheus, and alertable thresholds
Observability for a production vLLM deployment is layered. vLLM exposes a Prometheus-format metrics endpoint at http://<host>:8000/metrics by default (same port as the OpenAI-compatible API, no additional flag required) that surfaces request and KV-cache state; GPU-level tools sit underneath as the second layer. A minimal Prometheus scrape config:
scrape\_configs:
\-job\_name: vllm
static\_configs:
\-targets:\['localhost:8000'\]
The following metric names are accurate as of vLLM 0.20.x. Verify against /metrics on your running instance; names have changed between minor versions. Four metrics carry most of the alerting signal:
-
KV cache utilization (
vllm:gpu_cache_usage_perc). Fraction 0-1 representing cache pool consumption. The leading indicator for eviction. Alert when sustained usage exceeds 0.85, well before eviction starts. This metric is the dashboard companion to the eviction failure mode. -
Pending request queue depth (
vllm:num_requests_waiting). The leading indicator for scheduler oversubscription. A queue that grows without bounding indicates the deployment is past its serving ceiling and ramping admission is what’s needed, not more tuning. -
Per-request TTFT and ITL distributions (
vllm:time_to_first_token_seconds,vllm:time_per_output_token_seconds). The end-user-facing contract. Alert on p99 thresholds tied to the bands defined in the measurement contract, not on means. -
GPU memory utilization and SM activity. Underlying-resource view.
nvidia-smi,nvitop, or DCGM exporters fill this layer. Useful when investigating whether contention is on the device or in the scheduler.
Alert thresholds should cite the SLA bands defined in the measurement contract rather than carrying their own copies; one source of truth keeps the dashboard from drifting away from the contract over time.
Pre-launch checklist: validation steps and the disaggregated-serving roadmap
Before the endpoint takes production traffic, run through this short list:
-
-max-model-lenset to the actual maximum context your application uses, not the model’s architectural ceiling (128K is typical for Llama-3.1 and Qwen2.5 class models, which silently inherit it on a default launch). -
-gpu-memory-utilizationreduced from the default of 0.92 if the device is shared, with the per-tenant share documented somewhere your on-call can find it. - Ramp test against
benchmark_serving.pyon representative traffic, with p99 ITL recorded at each rate up to the target concurrency. - Prometheus scrape configured for the vLLM metrics endpoint and alerts wired to the thresholds in the measurement contract.
Disaggregated prefill-decode serving is the architectural answer to the contention failure mode for workloads that have outgrown what --enable-chunked-prefill can absorb. The direction is toward multi-node deployments that route prefill to compute-optimized instances and decode to memory-bandwidth-optimized instances. Production readiness for any given vLLM version belongs in the docs, not in this guide; check before planning a deployment around disaggregated prefill-decode serving.
For a validated path from model selection through VRAM sizing and environment-variable configuration, Runpod’s serverless vLLM documentation walks through the full setup against the same knobs ranked above.


Top comments (0)