Are you building Large Language Model (LLM) applications and hitting slow inference, high latency, or GPU memory limits? vLLM is an open-source inference engine for serving LLMs with high throughput, efficient memory usage, continuous batching, and an OpenAI-compatible API server. This guide shows how to install vLLM, run offline batch inference, expose a real-time API, and troubleshoot common deployment issues.
What is vLLM?
vLLM is an open-source, high-throughput, memory-efficient inference engine for serving large language models.
It is designed to solve two common production problems:
- Slow inference when many users send requests concurrently.
- High GPU memory usage caused by inefficient KV cache handling.
vLLM improves serving performance mainly through:
- PagedAttention: manages the key-value cache using a paging strategy similar to virtual memory, reducing memory waste.
- Continuous batching: dynamically batches incoming requests so the GPU stays busy without waiting for fixed-size batches.
For API and backend developers, vLLM is useful when you want to self-host LLMs behind an API, reduce inference latency, or build an OpenAI-compatible endpoint using your own models.
Why API Developers Use vLLM
vLLM is commonly used for LLM API backends because it provides:
- High throughput for serving more requests per second.
- Better GPU utilization through continuous batching.
- Efficient memory management with PagedAttention.
- OpenAI-compatible API endpoints for easier integration.
- Offline and online inference APIs for both batch jobs and real-time serving.
- Broad model support, including Llama, Mistral, Qwen, OPT, Falcon, and more.
- Active open-source development with frequent updates.
See the full model list in the vLLM supported models documentation.
Tip: If you are building or testing LLM-powered APIs, you can use Apidog to design, test, and document your endpoints whether they are backed by vLLM, OpenAI, or a custom service.
Supported LLMs
vLLM supports many transformer-based models, including:
- Llama, Llama 2, Llama 3
- Mistral and Mixtral
- Qwen and Qwen2
- GPT-2, GPT-J, GPT-NeoX
- OPT
- Bloom
- Falcon
- MPT
- Multi-modal models
- Other compatible Hugging Face and ModelScope models
For the latest compatibility list, check the official vLLM Supported Models List.
If your model is not listed but uses an architecture compatible with a supported model, it may still work. Test carefully before using it in production. Custom architectures may require upstream code changes.
Key Concepts: PagedAttention and Continuous Batching
PagedAttention
Traditional attention implementations often allocate contiguous memory for the KV cache. This can cause fragmentation and wasted GPU memory, especially when requests have different sequence lengths.
vLLM uses PagedAttention to split the KV cache into smaller blocks or “pages.” This makes memory allocation more flexible and can significantly reduce memory waste.
Practical impact:
- Serve longer contexts more efficiently.
- Handle more concurrent requests.
- Reduce GPU memory pressure.
- Improve utilization for workloads with variable prompt lengths.
Continuous Batching
Static batching waits until a batch is full or a timeout is reached. That can increase latency and leave GPU resources idle.
vLLM uses continuous batching, which adds new requests as soon as resources become available.
Practical impact:
- Better throughput under real traffic.
- Lower waiting time for users.
- More efficient GPU scheduling.
- Better handling of mixed short and long requests.
Prerequisites
Before installing vLLM, prepare the following environment:
- OS: Linux recommended. WSL2 and macOS may work, but Linux is best supported.
- Python: 3.9, 3.10, 3.11, or 3.12.
- GPU: NVIDIA GPU with CUDA for best performance.
- PyTorch: vLLM can install a compatible version automatically, but you may pre-install PyTorch for custom CUDA setups.
- Virtual environment: recommended for clean dependency management.
Check your GPU and driver:
nvidia-smi
Check your Python version:
python --version
Install vLLM
Option 1: Install with pip
python -m venv vllm-env
source vllm-env/bin/activate
pip install vllm
On Windows:
vllm-env\Scripts\activate
Verify the install:
python -c "import vllm; print(vllm.__version__)"
vllm --help
Option 2: Install with Conda
conda create -n vllm-env python=3.11 -y
conda activate vllm-env
pip install vllm
If you need a specific CUDA/PyTorch combination, install PyTorch first, then install vLLM.
Option 3: Install with uv
uv venv vllm-env --python 3.12 --seed
source vllm-env/bin/activate
uv pip install vllm
Verify:
python -c "import vllm; print(vllm.__version__)"
vllm --help
Run Offline Batch Inference
Batch inference is useful for:
- Dataset generation
- Evaluation jobs
- Bulk prompt processing
- Synthetic data creation
- Internal model testing
Create a file named batch_inference.py:
from vllm import LLM, SamplingParams
# 1. Define prompts
prompts = [
"The capital of France is",
"Explain the theory of relativity in simple terms:",
"Write a short poem about a rainy day:",
"Translate 'Hello, world!' to German:",
]
# 2. Configure sampling
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.95,
max_tokens=150,
stop=["\n", " Human:", " Assistant:"],
)
# 3. Initialize the vLLM engine
llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1")
# 4. Generate outputs
outputs = llm.generate(prompts, sampling_params)
# 5. Print results
for output in outputs:
print("-" * 40)
print(f"Prompt: {output.prompt!r}")
print(f"Generated Text: {output.outputs[0].text!r}")
Run it:
python batch_inference.py
Useful notes:
- vLLM loads models from Hugging Face Hub by default.
- To use ModelScope, set:
export VLLM_USE_MODELSCOPE=1
- To use vLLM’s default generation config instead of the model’s config:
llm = LLM(
model="mistralai/Mistral-7B-Instruct-v0.1",
generation_config="vllm",
)
- For quantized models such as AWQ or GPTQ, check the vLLM documentation and the model card before deployment.
Run vLLM as an OpenAI-Compatible API Server
vLLM can expose an OpenAI-compatible API, which makes it easier to replace or supplement OpenAI endpoints with a self-hosted model.
Start the server:
source vllm-env/bin/activate
vllm serve mistralai/Mistral-7B-Instruct-v0.1
The server runs at:
http://localhost:8000
You can also serve another model:
vllm serve Qwen/Qwen2-1.5B-Instruct
Common server options:
vllm serve mistralai/Mistral-7B-Instruct-v0.1 \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--api-key your-api-key \
--generation-config vllm
Useful options:
-
--host 0.0.0.0: bind to all interfaces. -
--port 8000: set the API port. -
--tensor-parallel-size <N>: split the model across multiple GPUs. -
--api-key <key>: require an API key. -
--generation-config vllm: use vLLM default generation parameters. -
--chat-template <path>: use a custom chat template.
Call the Completions API
cURL
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.1",
"prompt": "San Francisco is a city in",
"max_tokens": 50,
"temperature": 0.7
}'
If you started the server with --api-key, include an authorization header:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-api-key" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.1",
"prompt": "Explain the benefits of using vLLM:",
"max_tokens": 150,
"temperature": 0.5
}'
Python with the OpenAI Client
Install the client:
pip install openai
Call the vLLM endpoint:
from openai import OpenAI
client = OpenAI(
api_key="EMPTY", # Use your API key if the server requires one
base_url="http://localhost:8000/v1",
)
completion = client.completions.create(
model="mistralai/Mistral-7B-Instruct-v0.1",
prompt="Explain the benefits of using vLLM:",
max_tokens=150,
temperature=0.5,
)
print(completion.choices[0].text)
Call the Chat Completions API
cURL
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.1",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is the main advantage of PagedAttention in vLLM?"
}
],
"max_tokens": 100,
"temperature": 0.7
}'
Python
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1",
)
chat_response = client.chat.completions.create(
model="mistralai/Mistral-7B-Instruct-v0.1",
messages=[
{
"role": "system",
"content": "You are a helpful programming assistant.",
},
{
"role": "user",
"content": "Write a simple Python function to calculate factorial.",
},
],
max_tokens=200,
temperature=0.5,
)
print(chat_response.choices[0].message.content)
You can use an API tool such as Apidog to define these endpoints, send test requests, validate responses, and document the API for your team.
vLLM Attention Backends
vLLM supports multiple attention computation backends:
- FlashAttention 1 and 2: fast on modern NVIDIA GPUs and optimized for memory usage.
- xFormers: broader compatibility and useful as a fallback.
- FlashInfer: an advanced backend that may require manual installation.
By default, vLLM selects an appropriate backend for your hardware and model.
To force a backend, set VLLM_ATTENTION_BACKEND before starting vLLM:
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
vllm serve mistralai/Mistral-7B-Instruct-v0.1
Other possible values include:
export VLLM_ATTENTION_BACKEND=XFORMERS
export VLLM_ATTENTION_BACKEND=FLASHINFER
Troubleshooting Common vLLM Issues
1. CUDA Out of Memory
Symptoms:
- Server fails during model loading.
- Requests fail with CUDA OOM errors.
- GPU memory is almost fully allocated.
Try:
nvidia-smi
Then:
- Use a smaller model.
- Reduce concurrent requests.
- Reduce
max_tokens. - Use a quantized model such as AWQ or GPTQ where supported.
- Use multiple GPUs with
--tensor-parallel-size. - Stop other processes using GPU memory.
Example:
vllm serve mistralai/Mistral-7B-Instruct-v0.1 \
--tensor-parallel-size 2
2. Installation or Compatibility Problems
Check that CUDA, PyTorch, and NVIDIA drivers are compatible.
Useful references:
If pip installation causes dependency issues, try a clean environment:
conda create -n vllm-env python=3.11 -y
conda activate vllm-env
pip install vllm
3. Model Loading Failures
Check:
- The model name is correct.
- You have access to gated models if applicable.
- Your machine has enough disk space.
- The model files can be downloaded.
- The model architecture is supported.
Example model name:
mistralai/Mistral-7B-Instruct-v0.1
For models that require custom code, you may need:
llm = LLM(
model="your-model-name",
trust_remote_code=True,
)
For local models:
llm = LLM(model="/path/to/local/model")
4. Slow Inference
Check GPU utilization:
nvidia-smi
Try:
- Upgrade vLLM and dependencies.
- Update NVIDIA drivers.
- Test a different attention backend.
- Reduce
max_tokens. - Tune
temperatureandtop_p. - Increase concurrency gradually to find the best throughput/latency balance.
5. Unexpected or Nonsensical Output
Check:
- Prompt formatting from the model card.
- Whether the model expects chat-style messages.
- Sampling parameters such as
temperatureandtop_p. - Chat template configuration.
- Whether the issue happens with a different model.
For chat models, prefer the chat completions endpoint:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.1",
"messages": [
{
"role": "user",
"content": "Explain PagedAttention in one paragraph."
}
]
}'
Production Checklist
Before exposing a vLLM API to users, verify:
- The model fits your GPU memory budget.
- You tested realistic concurrent traffic.
- You configured API authentication if needed.
- You monitor GPU utilization and memory.
- You set request limits and timeouts at the gateway or application layer.
- You validate prompt formatting for your chosen model.
- You document the request and response formats.
- You have a fallback plan for model loading or GPU failures.
Next Steps
vLLM gives you a practical path to serve LLMs with better throughput, memory efficiency, and OpenAI-compatible APIs.
To continue:
- Explore quantization, multi-LoRA, distributed serving, and speculative decoding in the official vLLM documentation.
- Test your API using real prompts and expected response schemas.
- Document your vLLM endpoints so frontend, backend, and QA teams can work against the same contract.
- Use Apidog to design, mock, test, and document your LLM APIs as part of your development workflow.
Top comments (0)