After my recent presentation on our AI inference PoC (details here), I received a bunch of great follow-up questions and DMs. A lot of you were asking the same thing:
"This is a cool demo, but how do we actually take this to the next level and build a real commercial solution?"
It's a fantastic question, and it's the crucial step that turns a promising experiment into a production-ready service. So, in today's blog, I want to dive into the more technical details of how I'd approach this. We'll be focusing on one of the most powerful tools for the job: vLLM.
๐ Table of Contents
- Why Choose vLLM? The Business Value of Inference
- Chapter Summary
- High Performance
- Cross-Platform Compatibility
- Ease of Use
- Environment & Setup
- Prerequisites
- Installation & Walkthrough
- How to Verify KV Cache on CPU
- Key Environment Variables for CPU Performance
- Conclusion
๐ค Why Choose vLLM? The Business Value of Inference
Before we get into the nitty-gritty, it's worth touching on why this matters.
The AI inference market is where the real business value happens, and it's projected to grow massivelyโfrom $106 billion in 2025 to over $255 billion by 2030.
Having a de facto, standard inference platform is a huge opportunity. That's where vLLM comes in; it's rapidly emerging as the "Linux of GenAI Inference" for a few key reasons.
๐ Chapter Summary
This chapter outlines the core benefits of vLLM that make it a top choice for production-level AI inference. We'll cover:
- ๐ High Performance โ advanced algorithms for high QPS
- ๐ Cross-Platform โ support for a wide range of accelerators and OEMs
- ๐ Ease of Use โ integrations and APIs that developers love
๐ High Performance
vLLM is engineered for speed and efficiency. It uses advanced algorithms to deliver high Queries Per Second (QPS) serving, which is critical for commercial applications. Its performance is already comparable to optimized solutions like Nvidia's TRT-LLM, making it a benchmark for other methods.
๐ Cross-Platform Compatibility
One of vLLM's biggest strengths is its ability to run on a wide array of hardware (NVIDIA, AMD, Intel, Google, AWS, etc.) and with major OEMs like Dell, Lenovo, Cisco, and HPE. This lets you build enterprise inference without being tied to a specific hardware stack.
๐ Ease of Use
High performance doesnโt mean high complexity. vLLM features native Hugging Face integration, simple APIs, and an OpenAI-Compatible API, which is a huge productivity boost for developers.
๐ Environment & Setup
For this walkthrough and our demo benchmarks, we'll use:
- Host:
c7i.4xlarge
(16 vCPU), Amazon Linux - Local model:
phi3:mini
(fast micro-prompt baseline) - Python: 3.9+
โ Prerequisites
Before starting, ensure your environment meets vLLM's requirements:
- Python: 3.9โ3.12
- OS: Linux
- CPU Flags:
avx512f
is recommended.
๐ก Pro Tip: Check for the required CPU flag with:
lscpu | grep avx512f
๐ ๏ธ Installation & Walkthrough
Instead of generic instructions, here are the exact steps I followed to get vLLM running from source and then containerized with Docker on Amazon Linux.
Step 1: Set Up Python Environment
uv venv --python 3.12 --seed
source .venv/bin/activate
Step 2: Install System Dependencies
sudo dnf update -y
sudo dnf install -y git gcc gcc-c++ gperftools-devel numactl-devel libSM-devel libXext-devel mesa-libGL-devel
# Install EPEL and RPM Fusion for extra packages like ffmpeg
sudo dnf install -y [https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm](https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm)
sudo dnf install -y [https://download1.rpmfusion.org/free/el/rpmfusion-free-release-9.noarch.rpm](https://download1.rpmfusion.org/free/el/rpmfusion-free-release-9.noarch.rpm)
sudo dnf install -y ffmpeg
# Set the default compiler
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc 10 --slave /usr/bin/g++ g++ /usr/bin/g++
Step 3: Clone and Build vLLM
git clone [https://github.com/vllm-project/vllm.git](https://github.com/vllm-project/vllm.git) vllm_source
cd vllm_source
# Install dependencies
uv pip install -r requirements/cpu-build.txt --torch-backend auto --index-strategy unsafe-best-match
uv pip install -r requirements/cpu.txt --torch-backend auto --index-strategy unsafe-best-match
# Build and install vLLM for CPU
VLLM_TARGET_DEVICE=cpu python setup.py install
# (Optional) For development mode
# VLLM_TARGET_DEVICE=cpu python setup.py develop
Step 4: Build the Docker Image
sudo docker build -f docker/Dockerfile.cpu \
--build-arg VLLM_CPU_AVX512BF16=false \
--build-arg VLLM_CPU_AVX512VNNI=false \
--build-arg VLLM_CPU_DISABLE_AVX512=false \
--tag vllm-cpu-env \
--target vllm-openai .
Step 5: Run & Test
Run the container with the Phi-3-mini-4k-instruct LLM
sudo docker run --rm \
--privileged=true \
--shm-size=4g \
-p 8000:8000 \
-e VLLM_CPU_KVCACHE_SPACE=8 \
vllm-cpu-env \
--model=microsoft/Phi-3-mini-4k-instruct \
--dtype=bfloat16 \
--disable-sliding-window
INFO 08-26 10:00:17 [__init__.py:241] Automatically detected platform cpu.
(APIServer pid=1) INFO 08-26 10:00:19 [api_server.py:1873] vLLM API server version 0.10.1rc2.dev204+g2da02dd0d
(APIServer pid=1) INFO 08-26 10:00:19 [utils.py:326] non-default args: {'model': 'microsoft/Phi-3-mini-4k-instruct', 'dtype': 'bfloat16', 'disable_sliding_window': True}
(APIServer pid=1) INFO 08-26 10:00:24 [__init__.py:742] Resolved architecture: Phi3ForCausalLM
(APIServer pid=1) INFO 08-26 10:00:24 [__init__.py:1786] Using max model len 2047
(APIServer pid=1) INFO 08-26 10:00:24 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 08-26 10:00:28 [__init__.py:241] Automatically detected platform cpu.
(EngineCore_0 pid=94) INFO 08-26 10:00:29 [core.py:644] Waiting for init message from front-end.
(EngineCore_0 pid=94) INFO 08-26 10:00:29 [core.py:74] Initializing a V1 LLM engine (v0.10.1rc2.dev204+g2da02dd0d) with config: model='microsoft/Phi-3-mini-4k-instruct', speculative_config=None, tokenizer='microsoft/Phi-3-mini-4k-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2047, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cpu, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=microsoft/Phi-3-mini-4k-instruct, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=False, pooler_config=None, compilation_config={"level":2,"debug_dump_path":"","cache_dir":"","backend":"inductor","custom_ops":["none"],"splitting_ops":null,"use_inductor":true,"compile_sizes":null,"inductor_compile_config":{"enable_auto_functionalized_v2":false,"dce":true,"size_asserts":false,"nan_asserts":false,"epilogue_fusion":true},"inductor_passes":{},"cudagraph_mode":0,"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":null,"local_cache_dir":null}
(EngineCore_0 pid=94) INFO 08-26 10:00:29 [importing.py:43] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors.
(EngineCore_0 pid=94) INFO 08-26 10:00:29 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
(EngineCore_0 pid=94) WARNING 08-26 10:00:29 [_logger.py:72] Pin memory is not supported on CPU.
(EngineCore_0 pid=94) INFO 08-26 10:00:30 [cpu_worker.py:172] auto thread-binding list (id, physical core): [(8, 0), (9, 1), (10, 2), (11, 3), (12, 4), (13, 5), (14, 6), (15, 7)]
(EngineCore_0 pid=94) INFO 08-26 10:00:30 [cpu_worker.py:63] OMP threads binding of Process 94:
(EngineCore_0 pid=94) INFO 08-26 10:00:30 [cpu_worker.py:63] OMP tid: 94, core 8
(EngineCore_0 pid=94) INFO 08-26 10:00:30 [cpu_worker.py:63] OMP tid: 122, core 9
(EngineCore_0 pid=94) INFO 08-26 10:00:30 [cpu_worker.py:63] OMP tid: 123, core 10
(EngineCore_0 pid=94) INFO 08-26 10:00:30 [cpu_worker.py:63] OMP tid: 124, core 11
(EngineCore_0 pid=94) INFO 08-26 10:00:30 [cpu_worker.py:63] OMP tid: 125, core 12
(EngineCore_0 pid=94) INFO 08-26 10:00:30 [cpu_worker.py:63] OMP tid: 126, core 13
(EngineCore_0 pid=94) INFO 08-26 10:00:30 [cpu_worker.py:63] OMP tid: 127, core 14
(EngineCore_0 pid=94) INFO 08-26 10:00:30 [cpu_worker.py:63] OMP tid: 128, core 15
(EngineCore_0 pid=94) INFO 08-26 10:00:30 [cpu_worker.py:63]
(EngineCore_0 pid=94) INFO 08-26 10:00:30 [parallel_state.py:1134] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_0 pid=94) INFO 08-26 10:00:30 [cpu_model_runner.py:87] Starting to load model microsoft/Phi-3-mini-4k-instruct...
(EngineCore_0 pid=94) INFO 08-26 10:00:30 [cpu.py:100] Using Torch SDPA backend.
(EngineCore_0 pid=94) INFO 08-26 10:00:30 [weight_utils.py:294] Using model weights format ['*.safetensors']
(EngineCore_0 pid=94) INFO 08-26 10:01:59 [weight_utils.py:310] Time spent downloading weights for microsoft/Phi-3-mini-4k-instruct: 88.702533 seconds
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:00<00:00, 4.44it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00, 2.45it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00, 2.52it/s]
(EngineCore_0 pid=94)
(EngineCore_0 pid=94) INFO 08-26 10:02:00 [default_loader.py:267] Loading weights took 0.86 seconds
(EngineCore_0 pid=94) INFO 08-26 10:02:00 [kv_cache_utils.py:849] GPU KV cache size: 21,760 tokens
(EngineCore_0 pid=94) INFO 08-26 10:02:00 [kv_cache_utils.py:853] Maximum concurrency for 2,047 tokens per request: 10.62x
(EngineCore_0 pid=94) INFO 08-26 10:02:01 [cpu_model_runner.py:99] Warming up model for the compilation...
(EngineCore_0 pid=94) INFO 08-26 10:03:01 [cpu_model_runner.py:103] Warming up done.
(EngineCore_0 pid=94) INFO 08-26 10:03:01 [core.py:215] init engine (profile, create kv cache, warmup model) took 61.05 seconds
(APIServer pid=1) INFO 08-26 10:03:01 [loggers.py:142] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 170
(APIServer pid=1) INFO 08-26 10:03:01 [async_llm.py:165] Torch profiler disabled. AsyncLLM CPU traces will not be collected.
(APIServer pid=1) INFO 08-26 10:03:01 [api_server.py:1679] Supported_tasks: ['generate']
(APIServer pid=1) INFO 08-26 10:03:01 [api_server.py:1948] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:36] Available routes are:
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /docs, Methods: GET, HEAD
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /redoc, Methods: GET, HEAD
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /health, Methods: GET
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /load, Methods: GET
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /ping, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /ping, Methods: GET
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /version, Methods: GET
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /v1/embeddings, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /pooling, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /classify, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /score, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /v1/score, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /v1/audio/translations, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /rerank, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /v1/rerank, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /v2/rerank, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 08-26 10:03:01 [launcher.py:44] Route: /metrics, Methods: GET
(APIServer pid=1) INFO: Started server process [1]
(APIServer pid=1) INFO: Waiting for application startup.
(APIServer pid=1) INFO: Application startup complete.
Test the endpoint with curl:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "microsoft/Phi-3-mini-4k-instruct",
"messages": [
{"role": "user", "content": "Analyze the main changes for dijkstra algorithm"}
],
"temperature": 0.7,
"max_tokens": 50
}'
1st Result:
{"id":"chatcmpl-8b1d987f6979436d90bba661b088f6c7","object":"chat.completion","created":1756202802,"model":"microsoft/Phi-3-mini-4k-instruct","choices":[{"index":0,"message":{"role":"assistant","content":" The Dijkstra algorithm is an algorithm for finding the shortest path between nodes in a graph. It was invented by computer scientist Edsger W. Dijkstra in 1956 and published three years later. Throughout","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning_content":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":14,"total_tokens":64,"completion_tokens":50,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
Logs collected:
APIServer pid=1) INFO 08-26 10:06:42 [chat_utils.py:470] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(EngineCore_0 pid=94) WARNING 08-26 10:06:42 [logger.py:71] cudagraph dispatching keys are not initialized. No cudagraph will be used.
(APIServer pid=1) INFO: 172.17.0.1:53458 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 08-26 10:06:52 [loggers.py:123] Engine 000: Avg prompt throughput: 1.4 tokens/s, Avg generation throughput: 5.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
Let's have a look if we run the same query once again:
(APIServer pid=1) INFO 08-26 10:07:12 [loggers.py:123] Engine 000: Avg prompt throughput: 2.0 tokens/s, Avg generation throughput: 4.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, **`GPU KV cache usage: 0.6%`**, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO: 172.17.0.1:44836 - "POST /v1/chat/completions HTTP/1.1" 200 OK
๐ How to Verify KV Cache on CPU
To enable and allocate space for the CPU KV cache, you must set the VLLM_CPU_KVCACHE_SPACE environment variable. The value is in GB. In our docker run command, we allocated 8 GB:
-e VLLM_CPU_KVCACHE_SPACE=8
When running in CPU-only mode, you might see logs mentioning GPU KV cache usage: 0.6%
Explanation:
VLLM_CPU_KVCACHE_SPACE: This defines the KV cache's memory allocation in GiB. A larger value allows for more concurrent requests and longer contexts. Start with a conservative value (e.g., 4 or 8) and monitor memory usage.
โ๏ธKey Environment Variables for CPU Performance
Fine-tuning vLLM on CPUs involves a few key environment variables. Here's a quick guide to the most important ones for performance tuning:
VLLM_CPU_OMP_THREADS_BIND
: ๐ This setting pins processing threads to specific CPU cores. You can set it toauto
(the default) for automatic assignment based on your hardware's NUMA architecture, or you can specify core ranges manually (e.g.,0-31
). For multi-process tensor parallelism, you can assign different cores to each process using a pipe|
(e.g.,0-31|32-63
).VLLM_CPU_NUM_OF_RESERVED_CPU
: ๐ก๏ธ Reserves a number of CPU cores, keeping them free from vLLM's main processing threads. This is useful for system overhead or other processes and only works when the thread binding above is set toauto
. By default, one core is reserved per process in multi-process setups.VLLM_CPU_MOE_PREPACK
: ๐ (x86 only) A performance optimization for models using Mixture-of-Experts (MoE) layers. It's enabled by default (1
), but you may need to disable it by setting it to0
if you run into issues on unsupported CPUs.VLLM_CPU_SGL_KERNEL
: ๐งช (Experimental, x86 only) Enables specialized kernels for low-latency tasks like real-time serving. This requires a CPU with the AMX instruction set, BFloat16 model weights, and specific weight shapes. It's disabled by default (0
).
๐ Conclusion
Transitioning an AI PoC to a production-ready service hinges on maximizing performance and reliability. As we've seen, vLLM's environment variables are the key to unlocking this potential on standard CPU hardware.
By strategically managing memory with VLLM_CPU_KVCACHE_SPACE
and precisely controlling thread behavior with VLLM_CPU_OMP_THREADS_BIND
, you move beyond default settings to achieve significant gains in throughput and latency. This fine-grained control is what transforms a functional demo into a scalable, cost-effective, and commercially viable inference solution ready for real-world traffic.
๐ References
- vLLM Docs: Build a Docker Image from Source for CPU
- Hugging Face: Microsoft Phi-3-mini-4k-instruct Model Card
- AWS Console: AWS Management Console Login
Top comments (0)