Hassann

Posted on Jun 23 • Originally published at apidog.com

What is vLLM ? Supercharge LLM Inference for Fast and Scalable APIs

Are you building Large Language Model (LLM) applications and hitting slow inference, high latency, or GPU memory limits? vLLM is an open-source inference engine for serving LLMs with high throughput, efficient memory usage, continuous batching, and an OpenAI-compatible API server. This guide shows how to install vLLM, run offline batch inference, expose a real-time API, and troubleshoot common deployment issues.

Try Apidog today

What is vLLM?

vLLM is an open-source, high-throughput, memory-efficient inference engine for serving large language models.

It is designed to solve two common production problems:

Slow inference when many users send requests concurrently.
High GPU memory usage caused by inefficient KV cache handling.

vLLM improves serving performance mainly through:

PagedAttention: manages the key-value cache using a paging strategy similar to virtual memory, reducing memory waste.
Continuous batching: dynamically batches incoming requests so the GPU stays busy without waiting for fixed-size batches.

For API and backend developers, vLLM is useful when you want to self-host LLMs behind an API, reduce inference latency, or build an OpenAI-compatible endpoint using your own models.

Why API Developers Use vLLM

vLLM is commonly used for LLM API backends because it provides:

High throughput for serving more requests per second.
Better GPU utilization through continuous batching.
Efficient memory management with PagedAttention.
OpenAI-compatible API endpoints for easier integration.
Offline and online inference APIs for both batch jobs and real-time serving.
Broad model support, including Llama, Mistral, Qwen, OPT, Falcon, and more.
Active open-source development with frequent updates.

See the full model list in the vLLM supported models documentation.

Tip: If you are building or testing LLM-powered APIs, you can use Apidog to design, test, and document your endpoints whether they are backed by vLLM, OpenAI, or a custom service.

Supported LLMs

vLLM supports many transformer-based models, including:

Llama, Llama 2, Llama 3
Mistral and Mixtral
Qwen and Qwen2
GPT-2, GPT-J, GPT-NeoX
OPT
Bloom
Falcon
MPT
Multi-modal models
Other compatible Hugging Face and ModelScope models

For the latest compatibility list, check the official vLLM Supported Models List.

If your model is not listed but uses an architecture compatible with a supported model, it may still work. Test carefully before using it in production. Custom architectures may require upstream code changes.

Key Concepts: PagedAttention and Continuous Batching

PagedAttention

Traditional attention implementations often allocate contiguous memory for the KV cache. This can cause fragmentation and wasted GPU memory, especially when requests have different sequence lengths.

vLLM uses PagedAttention to split the KV cache into smaller blocks or “pages.” This makes memory allocation more flexible and can significantly reduce memory waste.

Practical impact:

Serve longer contexts more efficiently.
Handle more concurrent requests.
Reduce GPU memory pressure.
Improve utilization for workloads with variable prompt lengths.

Continuous Batching

Static batching waits until a batch is full or a timeout is reached. That can increase latency and leave GPU resources idle.

vLLM uses continuous batching, which adds new requests as soon as resources become available.

Practical impact:

Better throughput under real traffic.
Lower waiting time for users.
More efficient GPU scheduling.
Better handling of mixed short and long requests.

Prerequisites

Before installing vLLM, prepare the following environment:

OS: Linux recommended. WSL2 and macOS may work, but Linux is best supported.
Python: 3.9, 3.10, 3.11, or 3.12.
GPU: NVIDIA GPU with CUDA for best performance.
PyTorch: vLLM can install a compatible version automatically, but you may pre-install PyTorch for custom CUDA setups.
Virtual environment: recommended for clean dependency management.

Check your GPU and driver:

nvidia-smi

Check your Python version:

python --version

Install vLLM

Option 1: Install with pip

python -m venv vllm-env
source vllm-env/bin/activate

pip install vllm

On Windows:

vllm-env\Scripts\activate

Verify the install:

python -c "import vllm; print(vllm.__version__)"
vllm --help

Option 2: Install with Conda

conda create -n vllm-env python=3.11 -y
conda activate vllm-env

pip install vllm

If you need a specific CUDA/PyTorch combination, install PyTorch first, then install vLLM.

Option 3: Install with uv

uv venv vllm-env --python 3.12 --seed
source vllm-env/bin/activate

uv pip install vllm

Verify:

python -c "import vllm; print(vllm.__version__)"
vllm --help

Run Offline Batch Inference

Batch inference is useful for:

Dataset generation
Evaluation jobs
Bulk prompt processing
Synthetic data creation
Internal model testing

Create a file named batch_inference.py:

from vllm import LLM, SamplingParams

# 1. Define prompts
prompts = [
    "The capital of France is",
    "Explain the theory of relativity in simple terms:",
    "Write a short poem about a rainy day:",
    "Translate 'Hello, world!' to German:",
]

# 2. Configure sampling
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=150,
    stop=["\n", " Human:", " Assistant:"],
)

# 3. Initialize the vLLM engine
llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1")

# 4. Generate outputs
outputs = llm.generate(prompts, sampling_params)

# 5. Print results
for output in outputs:
    print("-" * 40)
    print(f"Prompt: {output.prompt!r}")
    print(f"Generated Text: {output.outputs[0].text!r}")

Run it:

python batch_inference.py

Useful notes:

vLLM loads models from Hugging Face Hub by default.
To use ModelScope, set:

export VLLM_USE_MODELSCOPE=1

To use vLLM’s default generation config instead of the model’s config:

llm = LLM(
    model="mistralai/Mistral-7B-Instruct-v0.1",
    generation_config="vllm",
)

For quantized models such as AWQ or GPTQ, check the vLLM documentation and the model card before deployment.

Run vLLM as an OpenAI-Compatible API Server

vLLM can expose an OpenAI-compatible API, which makes it easier to replace or supplement OpenAI endpoints with a self-hosted model.

Start the server:

source vllm-env/bin/activate

vllm serve mistralai/Mistral-7B-Instruct-v0.1

The server runs at:

http://localhost:8000

You can also serve another model:

vllm serve Qwen/Qwen2-1.5B-Instruct

Common server options:

vllm serve mistralai/Mistral-7B-Instruct-v0.1 \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --api-key your-api-key \
  --generation-config vllm

Useful options:

--host 0.0.0.0: bind to all interfaces.
--port 8000: set the API port.
--tensor-parallel-size <N>: split the model across multiple GPUs.
--api-key <key>: require an API key.
--generation-config vllm: use vLLM default generation parameters.
--chat-template <path>: use a custom chat template.

Call the Completions API

cURL

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.1",
    "prompt": "San Francisco is a city in",
    "max_tokens": 50,
    "temperature": 0.7
  }'

If you started the server with --api-key, include an authorization header:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-api-key" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.1",
    "prompt": "Explain the benefits of using vLLM:",
    "max_tokens": 150,
    "temperature": 0.5
  }'

Python with the OpenAI Client

Install the client:

pip install openai

Call the vLLM endpoint:

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",  # Use your API key if the server requires one
    base_url="http://localhost:8000/v1",
)

completion = client.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.1",
    prompt="Explain the benefits of using vLLM:",
    max_tokens=150,
    temperature=0.5,
)

print(completion.choices[0].text)

Call the Chat Completions API

cURL

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.1",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "What is the main advantage of PagedAttention in vLLM?"
      }
    ],
    "max_tokens": 100,
    "temperature": 0.7
  }'

Python

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
)

chat_response = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.1",
    messages=[
        {
            "role": "system",
            "content": "You are a helpful programming assistant.",
        },
        {
            "role": "user",
            "content": "Write a simple Python function to calculate factorial.",
        },
    ],
    max_tokens=200,
    temperature=0.5,
)

print(chat_response.choices[0].message.content)

You can use an API tool such as Apidog to define these endpoints, send test requests, validate responses, and document the API for your team.

vLLM Attention Backends

vLLM supports multiple attention computation backends:

FlashAttention 1 and 2: fast on modern NVIDIA GPUs and optimized for memory usage.
xFormers: broader compatibility and useful as a fallback.
FlashInfer: an advanced backend that may require manual installation.

By default, vLLM selects an appropriate backend for your hardware and model.

To force a backend, set VLLM_ATTENTION_BACKEND before starting vLLM:

export VLLM_ATTENTION_BACKEND=FLASH_ATTN
vllm serve mistralai/Mistral-7B-Instruct-v0.1

Other possible values include:

export VLLM_ATTENTION_BACKEND=XFORMERS

export VLLM_ATTENTION_BACKEND=FLASHINFER

Troubleshooting Common vLLM Issues

1. CUDA Out of Memory

Symptoms:

Server fails during model loading.
Requests fail with CUDA OOM errors.
GPU memory is almost fully allocated.

Try:

nvidia-smi

Then:

Use a smaller model.
Reduce concurrent requests.
Reduce max_tokens.
Use a quantized model such as AWQ or GPTQ where supported.
Use multiple GPUs with --tensor-parallel-size.
Stop other processes using GPU memory.

Example:

vllm serve mistralai/Mistral-7B-Instruct-v0.1 \
  --tensor-parallel-size 2

2. Installation or Compatibility Problems

Check that CUDA, PyTorch, and NVIDIA drivers are compatible.

Useful references:

If pip installation causes dependency issues, try a clean environment:

conda create -n vllm-env python=3.11 -y
conda activate vllm-env
pip install vllm

3. Model Loading Failures

Check:

The model name is correct.
You have access to gated models if applicable.
Your machine has enough disk space.
The model files can be downloaded.
The model architecture is supported.

Example model name:

mistralai/Mistral-7B-Instruct-v0.1

For models that require custom code, you may need:

llm = LLM(
    model="your-model-name",
    trust_remote_code=True,
)

For local models:

llm = LLM(model="/path/to/local/model")

4. Slow Inference

Check GPU utilization:

nvidia-smi

Try:

Upgrade vLLM and dependencies.
Update NVIDIA drivers.
Test a different attention backend.
Reduce max_tokens.
Tune temperature and top_p.
Increase concurrency gradually to find the best throughput/latency balance.

5. Unexpected or Nonsensical Output

Check:

Prompt formatting from the model card.
Whether the model expects chat-style messages.
Sampling parameters such as temperature and top_p.
Chat template configuration.
Whether the issue happens with a different model.

For chat models, prefer the chat completions endpoint:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.1",
    "messages": [
      {
        "role": "user",
        "content": "Explain PagedAttention in one paragraph."
      }
    ]
  }'

Production Checklist

Before exposing a vLLM API to users, verify:

The model fits your GPU memory budget.
You tested realistic concurrent traffic.
You configured API authentication if needed.
You monitor GPU utilization and memory.
You set request limits and timeouts at the gateway or application layer.
You validate prompt formatting for your chosen model.
You document the request and response formats.
You have a fallback plan for model loading or GPU failures.

Next Steps

vLLM gives you a practical path to serve LLMs with better throughput, memory efficiency, and OpenAI-compatible APIs.

To continue:

Explore quantization, multi-LoRA, distributed serving, and speculative decoding in the official vLLM documentation.
Test your API using real prompts and expected response schemas.
Document your vLLM endpoints so frontend, backend, and QA teams can work against the same contract.
Use Apidog to design, mock, test, and document your LLM APIs as part of your development workflow.

DEV Community

What is vLLM ? Supercharge LLM Inference for Fast and Scalable APIs

What is vLLM?

Why API Developers Use vLLM

Supported LLMs

Key Concepts: PagedAttention and Continuous Batching

PagedAttention

Continuous Batching

Prerequisites

Install vLLM

Option 1: Install with pip

Option 2: Install with Conda

Option 3: Install with uv

Run Offline Batch Inference

Run vLLM as an OpenAI-Compatible API Server

Call the Completions API

cURL

Python with the OpenAI Client

Call the Chat Completions API

cURL

Python

vLLM Attention Backends

Troubleshooting Common vLLM Issues

1. CUDA Out of Memory

2. Installation or Compatibility Problems

3. Model Loading Failures

4. Slow Inference

5. Unexpected or Nonsensical Output

Production Checklist

Next Steps

Top comments (0)