Marco Gonzalez for AWS Community Builders

Posted on Aug 26

vLLM on x86: Because Not Everyone Can Afford a GPU Cluster

#ai #machinelearning #vllm #production

After my recent presentation on our AI inference PoC (details here), I received a bunch of great follow-up questions and DMs. A lot of you were asking the same thing:

"This is a cool demo, but how do we actually take this to the next level and build a real commercial solution?"

It's a fantastic question, and it's the crucial step that turns a promising experiment into a production-ready service. So, in today's blog, I want to dive into the more technical details of how I'd approach this. We'll be focusing on one of the most powerful tools for the job: vLLM.

📑 Table of Contents

Why Choose vLLM? The Business Value of Inference
Chapter Summary
High Performance
Cross-Platform Compatibility
Ease of Use
Environment & Setup
Prerequisites
Installation & Walkthrough
How to Verify KV Cache on CPU
Key Environment Variables for CPU Performance
Conclusion

🤔 Why Choose vLLM? The Business Value of Inference

Before we get into the nitty-gritty, it's worth touching on why this matters.

The AI inference market is where the real business value happens, and it's projected to grow massively—from $106 billion in 2025 to over $255 billion by 2030.

Having a de facto, standard inference platform is a huge opportunity. That's where vLLM comes in; it's rapidly emerging as the "Linux of GenAI Inference" for a few key reasons.

📖 Chapter Summary

This chapter outlines the core benefits of vLLM that make it a top choice for production-level AI inference. We'll cover:

🚀 High Performance – advanced algorithms for high QPS
🌐 Cross-Platform – support for a wide range of accelerators and OEMs
👍 Ease of Use – integrations and APIs that developers love

🚀 High Performance

vLLM is engineered for speed and efficiency. It uses advanced algorithms to deliver high Queries Per Second (QPS) serving, which is critical for commercial applications. Its performance is already comparable to optimized solutions like Nvidia's TRT-LLM, making it a benchmark for other methods.

🌐 Cross-Platform Compatibility

One of vLLM's biggest strengths is its ability to run on a wide array of hardware (NVIDIA, AMD, Intel, Google, AWS, etc.) and with major OEMs like Dell, Lenovo, Cisco, and HPE. This lets you build enterprise inference without being tied to a specific hardware stack.

👍 Ease of Use

High performance doesn’t mean high complexity. vLLM features native Hugging Face integration, simple APIs, and an OpenAI-Compatible API, which is a huge productivity boost for developers.

🌍 Environment & Setup

For this walkthrough and our demo benchmarks, we'll use:

Host: c7i.4xlarge (16 vCPU), Amazon Linux
Local model: phi3:mini (fast micro-prompt baseline)
Python: 3.9+

✅ Prerequisites

Before starting, ensure your environment meets vLLM's requirements:

Python: 3.9–3.12
OS: Linux
CPU Flags: avx512f is recommended.

💡 Pro Tip: Check for the required CPU flag with:

lscpu | grep avx512f

🛠️ Installation & Walkthrough

Instead of generic instructions, here are the exact steps I followed to get vLLM running from source and then containerized with Docker on Amazon Linux.

Step 1: Set Up Python Environment

uv venv --python 3.12 --seed
source .venv/bin/activate

Step 2: Install System Dependencies

sudo dnf update -y
sudo dnf install -y git gcc gcc-c++ gperftools-devel numactl-devel libSM-devel libXext-devel mesa-libGL-devel

# Install EPEL and RPM Fusion for extra packages like ffmpeg
sudo dnf install -y [https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm](https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm)
sudo dnf install -y [https://download1.rpmfusion.org/free/el/rpmfusion-free-release-9.noarch.rpm](https://download1.rpmfusion.org/free/el/rpmfusion-free-release-9.noarch.rpm)
sudo dnf install -y ffmpeg

# Set the default compiler
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc 10 --slave /usr/bin/g++ g++ /usr/bin/g++

Step 3: Clone and Build vLLM

git clone [https://github.com/vllm-project/vllm.git](https://github.com/vllm-project/vllm.git) vllm_source
cd vllm_source

# Install dependencies
uv pip install -r requirements/cpu-build.txt --torch-backend auto --index-strategy unsafe-best-match
uv pip install -r requirements/cpu.txt --torch-backend auto --index-strategy unsafe-best-match

# Build and install vLLM for CPU
VLLM_TARGET_DEVICE=cpu python setup.py install
# (Optional) For development mode
# VLLM_TARGET_DEVICE=cpu python setup.py develop

Step 4: Build the Docker Image

sudo docker build -f docker/Dockerfile.cpu \
        --build-arg VLLM_CPU_AVX512BF16=false \
        --build-arg VLLM_CPU_AVX512VNNI=false \
        --build-arg VLLM_CPU_DISABLE_AVX512=false \
        --tag vllm-cpu-env \
        --target vllm-openai .

Step 5: Run & Test

Run the container with the Phi-3-mini-4k-instruct LLM

sudo docker run --rm \
    --privileged=true \
    --shm-size=4g \
    -p 8000:8000 \
    -e VLLM_CPU_KVCACHE_SPACE=8 \
    vllm-cpu-env \
    --model=microsoft/Phi-3-mini-4k-instruct \
    --dtype=bfloat16 \
    --disable-sliding-window


INFO 08-26 10:00:17 [__init__.py:241] Automatically detected platform cpu.
(APIServer pid=1) INFO 08-26 10:00:19 [api_server.py:1873] vLLM API server version 0.10.1rc2.dev204+g2da02dd0d
(APIServer pid=1) INFO 08-26 10:00:19 [utils.py:326] non-default args: {'model': 'microsoft/Phi-3-mini-4k-instruct', 'dtype': 'bfloat16', 'disable_sliding_window': True}
(APIServer pid=1) INFO 08-26 10:00:24 [__init__.py:742] Resolved architecture: Phi3ForCausalLM
(APIServer pid=1) INFO 08-26 10:00:24 [__init__.py:1786] Using max model len 2047
(APIServer pid=1) INFO 08-26 10:00:24 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 08-26 10:00:28 [__init__.py:241] Automatically detected platform cpu.
(EngineCore_0 pid=94) INFO 08-26 10:00:29 [core.py:644] Waiting for init message from front-end.
(EngineCore_0 pid=94) INFO 08-26 10:00:29 [core.py:74] Initializing a V1 LLM engine (v0.10.1rc2.dev204+g2da02dd0d) with config: model='microsoft/Phi-3-mini-4k-instruct', speculative_config=None, tokenizer='microsoft/Phi-3-mini-4k-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2047, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cpu, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=microsoft/Phi-3-mini-4k-instruct, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, pooler_config=None, compilation_config={"level":2,"debug_dump_path":"","cache_dir":"","backend":"inductor","custom_ops":["none"],"splitting_ops":null,"use_inductor":true,"compile_sizes":null,"inductor_compile_config":{"enable_auto_functionalized_v2":false,"dce":true,"size_asserts":false,"nan_asserts":false,"epilogue_fusion":true},"inductor_passes":{},"cudagraph_mode":0,"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":null,"local_cache_dir":null}
(EngineCore_0 pid=94) INFO 08-26 10:00:29 [importing.py:43] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors.
(EngineCore_0 pid=94) INFO 08-26 10:00:29 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
(EngineCore_0 pid=94) WARNING 08-26 10:00:29 [_logger.py:72] Pin memory is not supported on CPU.
(EngineCore_0 pid=94) INFO 08-26 10:00:30 [cpu_worker.py:172] auto thread-binding list (id, physical core): [(8, 0), (9, 1), (10, 2), (11, 3), (12, 4), (13, 5), (14, 6), (15, 7)]
(EngineCore_0 pid=94) INFO 08-26 10:00:30 [cpu_worker.py:63] OMP threads binding of Process 94:
(EngineCore_0 pid=94) INFO 08-26 10:00:30 [cpu_worker.py:63]    OMP tid: 94, core 8
(EngineCore_0 pid=94) INFO 08-26 10:00:30 [cpu_worker.py:63]    OMP tid: 122, core 9
(EngineCore_0 pid=94) INFO 08-26 10:00:30 [cpu_worker.py:63]    OMP tid: 123, core 10
(EngineCore_0 pid=94) INFO 08-26 10:00:30 [cpu_worker.py:63]    OMP tid: 124, core 11
(EngineCore_0 pid=94) INFO 08-26 10:00:30 [cpu_worker.py:63]    OMP tid: 125, core 12
(EngineCore_0 pid=94) INFO 08-26 10:00:30 [cpu_worker.py:63]    OMP tid: 126, core 13
(EngineCore_0 pid=94) INFO 08-26 10:00:30 [cpu_worker.py:63]    OMP tid: 127, core 14
(EngineCore_0 pid=94) INFO 08-26 10:00:30 [cpu_worker.py:63]    OMP tid: 128, core 15
(EngineCore_0 pid=94) INFO 08-26 10:00:30 [cpu_worker.py:63] 
(EngineCore_0 pid=94) INFO 08-26 10:00:30 [parallel_state.py:1134] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_0 pid=94) INFO 08-26 10:00:30 [cpu_model_runner.py:87] Starting to load model microsoft/Phi-3-mini-4k-instruct...
(EngineCore_0 pid=94) INFO 08-26 10:00:30 [cpu.py:100] Using Torch SDPA backend.
(EngineCore_0 pid=94) INFO 08-26 10:00:30 [weight_utils.py:294] Using model weights format ['*.safetensors']
(EngineCore_0 pid=94) INFO 08-26 10:01:59 [weight_utils.py:310] Time spent downloading weights for microsoft/Phi-3-mini-4k-instruct: 88.702533 seconds
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  4.44it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  2.45it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  2.52it/s]
(EngineCore_0 pid=94) 
(EngineCore_0 pid=94) INFO 08-26 10:02:00 [default_loader.py:267] Loading weights took 0.86 seconds
(EngineCore_0 pid=94) INFO 08-26 10:02:00 [kv_cache_utils.py:849] GPU KV cache size: 21,760 tokens
(EngineCore_0 pid=94) INFO 08-26 10:02:00 [kv_cache_utils.py:853] Maximum concurrency for 2,047 tokens per request: 10.62x
(EngineCore_0 pid=94) INFO 08-26 10:02:01 [cpu_model_runner.py:99] Warming up model for the compilation...
(EngineCore_0 pid=94) INFO 08-26 10:03:01 [cpu_model_runner.py:103] Warming up done.
(EngineCore_0 pid=94) INFO 08-26 10:03:01 [core.py:215] init engine (profile, create kv cache, warmup model) took 61.05 seconds
(APIServer pid=1) INFO 08-26 10:03:01 [loggers.py:142] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 170
(APIServer pid=1) INFO 08-26 10:03:01 [async_llm.py:165] Torch profiler disabled. AsyncLLM CPU traces will not be collected.
(APIServer pid=1) INFO 08-26 10:03:01 [api_server.py:1679] Supported_tasks: ['generate']
(APIServer pid=1) INFO 08-26 10:03:01 [api_server.py:1948] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:36] Available routes are:
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /docs, Methods: GET, HEAD
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /redoc, Methods: GET, HEAD
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /health, Methods: GET
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /load, Methods: GET
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /ping, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /ping, Methods: GET
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /version, Methods: GET
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /v1/embeddings, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /pooling, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /classify, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /score, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /v1/score, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /v1/audio/translations, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /rerank, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /v1/rerank, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /v2/rerank, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /metrics, Methods: GET
(APIServer pid=1) INFO:     Started server process [1]
(APIServer pid=1) INFO:     Waiting for application startup.
(APIServer pid=1) INFO:     Application startup complete.

Test the endpoint with curl:

curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "microsoft/Phi-3-mini-4k-instruct",
  "messages": [
    {"role": "user", "content": "Analyze the main changes for dijkstra algorithm"}
  ],
  "temperature": 0.7,
  "max_tokens": 50
}'

1st Result:

{"id":"chatcmpl-8b1d987f6979436d90bba661b088f6c7","object":"chat.completion","created":1756202802,"model":"microsoft/Phi-3-mini-4k-instruct","choices":[{"index":0,"message":{"role":"assistant","content":" The Dijkstra algorithm is an algorithm for finding the shortest path between nodes in a graph. It was invented by computer scientist Edsger W. Dijkstra in 1956 and published three years later. Throughout","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":14,"total_tokens":64,"completion_tokens":50,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

Logs collected:

APIServer pid=1) INFO 08-26 10:06:42 [chat_utils.py:470] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(EngineCore_0 pid=94) WARNING 08-26 10:06:42 [logger.py:71] cudagraph dispatching keys are not initialized. No cudagraph will be used.
(APIServer pid=1) INFO:     172.17.0.1:53458 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 08-26 10:06:52 [loggers.py:123] Engine 000: Avg prompt throughput: 1.4 tokens/s, Avg generation throughput: 5.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

Let's have a look if we run the same query once again:

(APIServer pid=1) INFO 08-26 10:07:12 [loggers.py:123] Engine 000: Avg prompt throughput: 2.0 tokens/s, Avg generation throughput: 4.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, **`GPU KV cache usage: 0.6%`**, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO:     172.17.0.1:44836 - "POST /v1/chat/completions HTTP/1.1" 200 OK

🔍 How to Verify KV Cache on CPU

To enable and allocate space for the CPU KV cache, you must set the VLLM_CPU_KVCACHE_SPACE environment variable. The value is in GB. In our docker run command, we allocated 8 GB:

-e VLLM_CPU_KVCACHE_SPACE=8

When running in CPU-only mode, you might see logs mentioning GPU KV cache usage: 0.6%

Explanation:

VLLM_CPU_KVCACHE_SPACE: This defines the KV cache's memory allocation in GiB. A larger value allows for more concurrent requests and longer contexts. Start with a conservative value (e.g., 4 or 8) and monitor memory usage.

⚙️Key Environment Variables for CPU Performance

Fine-tuning vLLM on CPUs involves a few key environment variables. Here's a quick guide to the most important ones for performance tuning:

VLLM_CPU_OMP_THREADS_BIND: 📌 This setting pins processing threads to specific CPU cores. You can set it to auto (the default) for automatic assignment based on your hardware's NUMA architecture, or you can specify core ranges manually (e.g., 0-31). For multi-process tensor parallelism, you can assign different cores to each process using a pipe | (e.g., 0-31|32-63).
VLLM_CPU_NUM_OF_RESERVED_CPU: 🛡️ Reserves a number of CPU cores, keeping them free from vLLM's main processing threads. This is useful for system overhead or other processes and only works when the thread binding above is set to auto. By default, one core is reserved per process in multi-process setups.
VLLM_CPU_MOE_PREPACK: 🚀 (x86 only) A performance optimization for models using Mixture-of-Experts (MoE) layers. It's enabled by default (1), but you may need to disable it by setting it to 0 if you run into issues on unsupported CPUs.
VLLM_CPU_SGL_KERNEL: 🧪 (Experimental, x86 only) Enables specialized kernels for low-latency tasks like real-time serving. This requires a CPU with the AMX instruction set, BFloat16 model weights, and specific weight shapes. It's disabled by default (0).

👋 Conclusion

Transitioning an AI PoC to a production-ready service hinges on maximizing performance and reliability. As we've seen, vLLM's environment variables are the key to unlocking this potential on standard CPU hardware.

By strategically managing memory with VLLM_CPU_KVCACHE_SPACE and precisely controlling thread behavior with VLLM_CPU_OMP_THREADS_BIND, you move beyond default settings to achieve significant gains in throughput and latency. This fine-grained control is what transforms a functional demo into a scalable, cost-effective, and commercially viable inference solution ready for real-world traffic.