vllm-review-2026

#opensource #ai #selfhosted #linux

This article was originally published on aifoss.dev

---
title: 'vLLM Review 2026: Production LLM Inference at Scale'
description: 'vLLM delivers high-throughput LLM serving with PagedAttention and continuous batching. 2026 review: installation, benchmarks, and when to choose it over Ollama.'
pubDate: 'May 25 2026'

tags: ["vllm", "ai", "llm", "python", "gpu"]

vLLM is the inference engine you reach for when Ollama stops being enough. Built at UC Berkeley's Sky Computing Lab and now stewarded under the Linux Foundation's PyTorch Foundation umbrella, it's open-source (Apache 2.0), opinionated about performance, and deliberately harder to set up than the alternatives. That trade-off is the whole point.

If you're serving LLMs to more than a handful of concurrent users, vLLM's two core innovations — PagedAttention and continuous batching — change the math considerably. If you're running a model locally for yourself, it's overkill.

This review covers v0.21.0, released May 15, 2026, on Linux with NVIDIA hardware.

What vLLM does differently

Every LLM inference engine has to solve the same problem: the KV cache. As the model generates tokens, it needs to store key and value tensors for all previous tokens in the context window. That storage eats VRAM fast, and how it's managed determines how many concurrent requests you can handle.

Traditional serving allocates VRAM statically — you reserve a fixed block per request and fill it as generation proceeds. The waste is significant: you're holding memory for the maximum possible context even when actual generation is using 10% of it.

PagedAttention solves this by borrowing the OS virtual memory idea. The KV cache is divided into fixed-size pages, and only the pages actually needed by active tokens are allocated. Memory fragmentation drops to near-zero, and the same VRAM supports far more concurrent sequences.

Continuous batching is the scheduling counterpart. Traditional batched inference waits for a full batch to complete before accepting new requests. Continuous batching lets new requests slot into the batch the moment a slot opens — mid-generation. Tail latency shrinks; GPU utilization rises.

These aren't academic improvements. They're why vLLM benchmarks at around 187 tokens/second on Llama 3 8B under 8 concurrent users, versus Ollama's 82 tokens/second in the same scenario. At peak throughput with multiple concurrent requests, the gap widens further — roughly 793 tok/s versus 41 tok/s according to third-party benchmarks from Markaicode and SitePoint (see Sources).

Installation

vLLM runs on Linux with NVIDIA GPUs as its primary target. The simple path:

pip install vllm

That handles CUDA 12.4 on Linux. The wheel bundles PyTorch 2.11 and all dependencies — use a fresh virtual environment to avoid version conflicts.

System requirements:

Python ≥3.10 and <3.15 (Python 3.14 added in v0.21.0)
Linux (Ubuntu 20.04+ recommended)
NVIDIA GPU with CUDA 12.4; CUDA 13.0 for newer Blackwell features
VRAM appropriate to the model you're serving (see table below)

AMD ROCm support exists but trails the NVIDIA path — context-length limitations on AMD GPUs were still being worked through as of April 2026 (64k-token wall on certain configurations). Windows received native support in 2026 but requires CUDA 13.0 and RTX 6000 Ada or newer; WSL2 remains the more reliable Windows path for older hardware.

If you want to skip local driver setup entirely, RunPod provides vLLM-ready GPU instances with pre-installed CUDA environments. Useful for evaluating vLLM on production hardware before committing to a server build.

Serving your first model

The quickstart launches an OpenAI-compatible API server:

vllm serve meta-llama/Llama-3.2-8B-Instruct --host 0.0.0.0 --port 8000

Or with explicit configuration:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.2-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --gpu-memory-utilization 0.9 \
  --max-model-len 8192

--gpu-memory-utilization 0.9 tells vLLM to use 90% of available VRAM for its KV cache pool. The remaining 10% covers model weights and overhead. Tune this downward if you're hitting OOM errors.

Once running, the server accepts standard OpenAI API calls:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-8B-Instruct",
    messages=[{"role": "user", "content": "Explain PagedAttention in two sentences."}]
)
print(response.choices[0].message.content)

Any application already written for the OpenAI API works with vLLM by changing the base URL. That drop-in compatibility is the main reason teams choose it for self-hosted API infrastructure.

For 70B models across multiple GPUs, tensor parallelism is one flag away:

vllm serve meta-llama/Llama-3.2-70B-Instruct \
  --tensor-parallel-size 4 \
  --host 0.0.0.0 \
  --port 8000

vLLM requires identical GPUs for tensor parallelism — all cards in the group need matching VRAM and compute capability.

VRAM requirements by model size

Model size	FP16 (unquantized)	FP8 quantization	Example GPU
7B	~14 GB	~8 GB	RTX 3080 10GB with FP8
13B	~26 GB	~14 GB	RTX 3090 24GB with FP8
34B	~68 GB	~36 GB	A100 80GB or 2× A100 40GB
70B	~140 GB	~76 GB	2× H100 80GB or 4× A100 40GB

FP8 quantization (--quantization fp8) roughly halves VRAM requirements with minimal quality loss — it's the first flag to add when you're memory-constrained on consumer hardware. For a broader look at quantization tradeoffs across GGUF, AWQ, and FP8 formats, the GGUF quantization guide has the specifics.

Performance: where the advantage shows up

vLLM's edge grows with concurrency. At a single user, the gap over Ollama is modest — about 13% — partly because Ollama's Q4_K_M quantization uses less memory and can run faster on memory-limited consumer hardware. The architecture difference becomes clear at scale:

Scenario	vLLM (FP16)	Ollama (Q4_K_M)
1 concurrent user, Llama 3 8B	~71 tok/s	~62 tok/s
8 concurrent users	~187 tok/s	~82 tok/s
Peak sustained throughput	~793 tok/s	~41 tok/s

Benchmarks from Markaicode and SitePoint 2026 testing; see Sources for links.

The single-user numbers are close because vLLM's batching machinery has nothing to batch. Its advantage is in keeping GPU utilization high across many simultaneous requests — Ollama's throughput flattens almost immediately under concurrent load, while vLLM's scales smoothly.

Against TensorRT-LLM (NVIDIA's proprietary engine), vLLM trades a few percent of peak throughput for dramatically simpler setup and model-agnostic architecture. TGI (Hugging Face's Text Generation Inference) occupies the same niche as vLLM and is worth comparing if you're already deep in the Hugging Face ecosystem.

For a detailed hardware-by-hardware Ollama vs vLLM breakdown, the Ollama vs vLLM comparison has the numbers.

Supported models in v0.21.0

vLLM's model support covers most of the architectures that matter:

Llama family: Llama 3.x, Llama 4
Mistral/Mixtral: Mistral 7B, Mixtral 8x7B, Mixtral 8x22B
Qwen: Qwen 2.5, Qwen 3.5, Qwen-VL vision-language variants
DeepSeek: DeepSeek V3, V4, R1 (with MLA attention support)
Gemma: Gemma 2, Gemma 3
Phi: Microsoft Phi-3, Phi-4
Vision-language models: LLaVA, InternVL, Moondream3 (added in v0.21.0)

New architectures from Hugging Face Transformers typically land in vLLM within weeks of release. The project moves fast — v0.21.0 shipped 367 commits from 202 contributors, and v0.