Inside vLLM's CPU backend: a new contributor's notes

#ai #opensource #llm #machinelearning

Inside vLLM's CPU backend: a new contributor's notes

Most of the public technical writing about vLLM focuses on its GPU-side innovations. PagedAttention, continuous batching, the V1 engine. Less has been written about the CPU backend, which is where I spent the last couple of weeks: building vLLM from source, working through some rough edges, and shipping a small PR that clarifies three confusing error messages.

This post is a writeup of what surprised me along the way. It's aimed at the next contributor who's going to spend time in the CPU paths — whether for dev work, CI testing, edge inference, or just because that's the entry point that fits their environment. Some of it is genuinely useful setup info that isn't well-documented elsewhere. Some of it is observations about how the project's GPU-first history shows up in the design of its CPU-side code.

The setup story (or: prerequisites the docs don't make obvious)

The official docs walk you through building vLLM from source on CPU. They're correct. They're also incomplete in ways that matter if you're on Ubuntu 22.04 — the most common WSL2 default.

Three things tripped me up that I'd want a future contributor to know upfront:

GCC 11 isn't enough. The CPU backend's CMake check is explicit:

CMake Error at cmake/cpu_extension.cmake:116 (message):
  X86 backend requires gcc/g++ >= 12.3

Ubuntu 22.04's default gcc-12 package is actually 12.1, which also fails. The fix is to add the Ubuntu Toolchain PPA:

sudo add-apt-repository -y ppa:ubuntu-toolchain-r/test
sudo apt update
sudo apt install -y gcc-13 g++-13
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-13 130
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-13 130

setuptools_scm is a hidden build-time dependency. The repo's setup.py imports it to compute the version from git history, but requirements/cpu-build.txt doesn't list it. With --no-build-isolation (which you need for editable installs in a venv), the first error you'll see is:

ModuleNotFoundError: No module named 'setuptools_scm'

Easy fix once you know:

pip install setuptools_scm cmake ninja packaging wheel

Default parallel-build settings will OOM your WSL. vLLM's build defaults to spawning one cc1plus per logical core (-j=22 in my case). Each compile job uses 1–2 GB compiling AVX-512 and AMX-BF16 template code. On a 16 GiB WSL2 instance, that gives you:

c++: fatal error: Killed signal terminated program cc1plus
compilation terminated.

The fix is one environment variable:

MAX_JOBS=4 VLLM_TARGET_DEVICE=cpu pip install -e . --no-build-isolation

MAX_JOBS=2 for 8 GiB systems, MAX_JOBS=4 for 16 GiB. Slower build, completes successfully. With these three things sorted, the CPU build takes about 30–45 minutes on a modest WSL2 box and gives you a working editable install.

If you're building vLLM on Ubuntu 22.04 today, save yourself half a day. These three issues account for, by my count, most of the friction I hit during a few-hours setup that took several days.

What surprised me about how CPU memory works in vLLM

Once vLLM is built and you try to serve a tiny model, this is the first error you're likely to hit:

ValueError: Available memory on node 0 (13.58/15.34 GiB) on startup is less
than desired CPU memory utilization (0.92, 14.11 GiB).
Decrease --gpu-memory-utilization or reduce CPU memory used by other processes.

If you're new, this message will probably make you stop and squint. Decrease --gpu-memory-utilization? But I'm on CPU.

Here's what's actually going on. The gpu_memory_utilization config field — and the CLI flag of the same name — is a single setting shared across backends. On GPU it controls how much VRAM vLLM is allowed to reserve for the KV cache. On CPU it controls the same thing but with system RAM. The field name is GPU-flavored because vLLM is primarily a GPU engine and that's the dominant use case. But the semantics on the CPU backend are: "the fraction of total system memory I should reserve."

Once you internalize this, the design choice makes sense. One config, two backends. The friction comes from the field name not warning CPU users that it applies to them.

This setting gets checked in two places, both of which raise errors that mention gpu_memory_utilization:

vllm/v1/worker/cpu_worker.py — at startup, when the worker verifies that requested memory ≤ available memory.
vllm/v1/core/kv_cache_utils.py — at runtime, when the KV cache is being sized for a given max_model_len.

The startup check fires almost immediately. The runtime check fires later, after the model is loaded, when vLLM tries to allocate KV cache space for the configured sequence length. Both errors tell a CPU user to "increase gpu_memory_utilization" without indicating that this flag — despite its name — is the right thing to adjust on a CPU-only system.

This is the kind of thing that's totally obvious if you've been reading the codebase for months and totally bewildering when you first encounter it.

Shipping a (small) fix

The error-message confusion seemed worth fixing, so I sent a PR (#42479) that does three things:

Reword the cpu_worker.py startup-memory error to explain the flag-name overlap and give a concrete remedy value
Add clarifying parentheticals to the two kv_cache_utils.py errors so CPU users know the flag applies to them
Fix a small models's → model's typo I noticed in passing

The actual diff is small — fewer than twenty lines of string changes across two files. After the change, the startup error reads:

ValueError: Available memory on node 0 (12.83/15.34 GiB) on startup is less
than desired CPU memory utilization (0.99, 15.18 GiB). On the CPU backend,
the `--gpu-memory-utilization` flag controls the fraction of CPU memory
reserved (despite its name). To resolve: decrease `--gpu-memory-utilization`
(e.g. `--gpu-memory-utilization 0.5`) or reduce CPU memory used by other
processes.

Not a deep engineering achievement. But every CPU-only user who hits this from now on gets a message that tells them what to do, instead of one that implies the message doesn't apply to them.

I want to be honest about what this contribution is and isn't. It's not solving a hard technical problem. It's not adding a feature. It's a small UX cleanup informed by personally hitting the confusion the message creates. That's a legitimate kind of contribution, but it's not the kind I'd point at and claim deep inference-engine expertise. It's the kind that says: "I read enough of the code to find where the confusion originates, I traced it to specific lines, and I shipped a minimal fix."

There's a separate, more interesting thread visible in the same area of the codebase. Issue #29233 raises broader concerns about CPU memory configuration on the CPU backend, specifically that the default VLLM_CPU_KVCACHE_SPACE value is too small for many real use cases, and that the CPU backend uses a different env variable instead of the unified kv_cache_memory_bytes argument the GPU path now uses. My PR explicitly scopes itself away from those questions, leaving them for a separate, deeper PR.

When does CPU vLLM actually make sense?

Worth a section because the question comes up. CPU inference is dramatically slower than GPU — for a 7B model in fp16, you're looking at single-digit tokens per second on a decent server CPU. So the use cases are specific.

A few real ones I've seen or used myself:

Development environments. You're contributing to vLLM, running a tiny model to exercise code paths. Throughput doesn't matter; correctness does. facebook/opt-125m on CPU works fine for testing the serving stack.
CI testing. Same logic. Running tests against a real (small) vLLM instance is cheaper than provisioning a GPU runner.
AVX-512-equipped servers without GPUs. Intel Sapphire Rapids and newer Xeons have AVX-512 + AMX-BF16. For small models with throughput-tolerant batch workloads, this can actually be cost-effective compared to renting GPU time.
Edge cases. Inference at the edge on x86 servers without GPUs, where you control batch size and latency expectations.

For everything else anything resembling production LLM serving at meaningful throughput, CPU vLLM isn't the right call. The whole point of vLLM is throughput-optimized GPU inference. The CPU backend exists, it works, but it's not where vLLM's best work lives.

What's next

I'm going to keep working on small, scoped contributions. Next up is a bug involving a bitsandbytes-plus-pytest interaction (#32793) that's GPU-side — a debugging-heavy fix that involves understanding how vLLM's multiprocess worker startup interacts with pytest's import hooks. That's the kind of contribution I'd like to ship next.

If you're new to vLLM and looking for an entry point, the CPU backend is a legitimate one. The contributing flow, the DCO sign-off, the review process — all the same as the GPU side. The work is just smaller in scope. And if you find a confusing error message during setup, that's an opportunity, not a failure of the project.

Helpful starting links:

vLLM contributing guide
The CPU installation docs (where the prerequisites I documented above belong)
Issue #29233 if you want to take on the broader CPU memory configuration redesign
The good first issue filter, which is a magnet, claim quickly or pick something less labeled

If you've fought through vLLM CPU setup or shipped a fix, I'd genuinely like to hear about it. Comment below or find me on GitHub.