Rost

Posted on Jan 10 • Originally published at glukhov.org

vLLM Quickstart: High-Performance LLM Serving

#llm #ai #python #docker

vLLM is a high-throughput, memory-efficient inference and serving engine for Large Language Models (LLMs) developed by UC Berkeley's Sky Computing Lab.

With its revolutionary PagedAttention algorithm, vLLM achieves 14-24x higher throughput than traditional serving methods, making it the go-to choice for production LLM deployments.

What is vLLM?

vLLM (virtual LLM) is an open-source library for fast LLM inference and serving that has quickly become the industry standard for production deployments. Released in 2023, it introduced PagedAttention, a groundbreaking memory management technique that dramatically improves serving efficiency.

Key Features

High Throughput Performance: vLLM delivers 14-24x higher throughput compared to HuggingFace Transformers with the same hardware. This massive performance gain comes from continuous batching, optimized CUDA kernels, and the PagedAttention algorithm that eliminates memory fragmentation.

OpenAI API Compatibility: vLLM includes a built-in API server that's fully compatible with OpenAI's format. This allows seamless migration from OpenAI to self-hosted infrastructure without changing application code. Simply point your API client to vLLM's endpoint and it works transparently.

PagedAttention Algorithm: The core innovation behind vLLM's performance is PagedAttention, which applies the concept of virtual memory paging to attention mechanisms. Instead of allocating contiguous memory blocks for KV caches (which leads to fragmentation), PagedAttention divides memory into fixed-size blocks that can be allocated on-demand. This reduces memory waste by up to 4x and enables much larger batch sizes.

Continuous Batching: Unlike static batching where you wait for all sequences to complete, vLLM uses continuous (rolling) batching. As soon as one sequence finishes, a new one can be added to the batch. This maximizes GPU utilization and minimizes latency for incoming requests.

Multi-GPU Support: vLLM supports tensor parallelism and pipeline parallelism for distributing large models across multiple GPUs. It can efficiently serve models that don't fit in a single GPU's memory, supporting configurations from 2 to 8+ GPUs.

Wide Model Support: Compatible with popular model architectures including LLaMA, Mistral, Mixtral, Qwen, Phi, Gemma, and many others. Supports both instruction-tuned and base models from HuggingFace Hub.

When to Use vLLM

vLLM excels in specific scenarios where its strengths shine:

Production API Services: When you need to serve an LLM to many concurrent users via API, vLLM's high throughput and efficient batching make it the best choice. Companies running chatbots, code assistants, or content generation services benefit from its ability to handle hundreds of requests per second.

High-Concurrency Workloads: If your application has many simultaneous users making requests, vLLM's continuous batching and PagedAttention enable serving more users with the same hardware compared to alternatives.

Cost Optimization: When GPU costs are a concern, vLLM's superior throughput means you can serve the same traffic with fewer GPUs, directly reducing infrastructure costs. The 4x memory efficiency from PagedAttention also allows using smaller, cheaper GPU instances.

Kubernetes Deployments: vLLM's stateless design and container-friendly architecture make it ideal for Kubernetes clusters. Its consistent performance under load and straightforward resource management integrate well with cloud-native infrastructure.

When NOT to Use vLLM: For local development, experimentation, or single-user scenarios, tools like Ollama provide better user experience with simpler setup. vLLM's complexity is justified when you need its performance advantages for production workloads.

How to Install vLLM

Prerequisites

Before installing vLLM, ensure your system meets these requirements:

GPU: NVIDIA GPU with compute capability 7.0+ (V100, T4, A10, A100, H100, RTX 20/30/40 series)
CUDA: Version 11.8 or higher
Python: 3.8 to 3.11
VRAM: Minimum 16GB for 7B models, 24GB+ for 13B, 40GB+ for larger models
Driver: NVIDIA driver 450.80.02 or newer

Installation via pip

The simplest installation method is using pip. This works on systems with CUDA 11.8 or newer:

# Create a virtual environment (recommended)
python3 -m venv vllm-env
source vllm-env/bin/activate

# Install vLLM
pip install vllm

# Verify installation
python -c "import vllm; print(vllm.__version__)"

For systems with different CUDA versions, install the appropriate wheel:

# For CUDA 12.1
pip install vllm==0.4.2+cu121 -f https://github.com/vllm-project/vllm/releases

# For CUDA 11.8
pip install vllm==0.4.2+cu118 -f https://github.com/vllm-project/vllm/releases

Installation with Docker

Docker provides the most reliable deployment method, especially for production:

# Pull the official vLLM image
docker pull vllm/vllm-openai:latest

# Run vLLM with GPU support
docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model mistralai/Mistral-7B-Instruct-v0.2

The --ipc=host flag is important for multi-GPU setups as it enables proper inter-process communication.

Building from Source

For the latest features or custom modifications, build from source:

git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .

vLLM Quickstart Guide

Running Your First Model

Start vLLM with a model using the command-line interface:

# Download and serve Mistral-7B with OpenAI-compatible API
python -m vllm.entrypoints.openai.api_server \
    --model mistralai/Mistral-7B-Instruct-v0.2 \
    --port 8000

vLLM will automatically download the model from HuggingFace Hub (if not cached) and start the server. You'll see output indicating the server is ready:

INFO:     Started server process [12345]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000

Making API Requests

Once the server is running, you can make requests using the OpenAI Python client or curl:

Using curl:

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "mistralai/Mistral-7B-Instruct-v0.2",
        "prompt": "Explain what vLLM is in one sentence:",
        "max_tokens": 100,
        "temperature": 0.7
    }'

Using OpenAI Python Client:

from openai import OpenAI

# Point to your vLLM server
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"  # vLLM doesn't require authentication by default
)

response = client.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    prompt="Explain what vLLM is in one sentence:",
    max_tokens=100,
    temperature=0.7
)

print(response.choices[0].text)

Chat Completions API:

response = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is PagedAttention?"}
    ],
    max_tokens=200
)

print(response.choices[0].message.content)

Advanced Configuration

vLLM offers numerous parameters to optimize performance:

python -m vllm.entrypoints.openai.api_server \
    --model mistralai/Mistral-7B-Instruct-v0.2 \
    --port 8000 \
    --gpu-memory-utilization 0.95 \  # Use 95% of GPU memory
    --max-model-len 8192 \            # Maximum sequence length
    --tensor-parallel-size 2 \        # Use 2 GPUs with tensor parallelism
    --dtype float16 \                 # Use FP16 precision
    --max-num-seqs 256                # Maximum batch size

Key Parameters Explained:

--gpu-memory-utilization: How much GPU memory to use (0.90 = 90%). Higher values allow larger batches but leave less margin for memory spikes.
--max-model-len: Maximum context length. Reducing this saves memory for larger batches.
--tensor-parallel-size: Number of GPUs to split the model across.
--dtype: Data type for weights (float16, bfloat16, or float32). FP16 is usually optimal.
--max-num-seqs: Maximum number of sequences to process in a batch.

vLLM vs Ollama Comparison

Both vLLM and Ollama are popular choices for local LLM hosting, but they target different use cases. Understanding when to use each tool can significantly impact your project's success.

Performance and Throughput

vLLM is engineered for maximum throughput in multi-user scenarios. Its PagedAttention and continuous batching enable serving hundreds of concurrent requests efficiently. Benchmarks show vLLM achieving 14-24x higher throughput than standard implementations and 2-4x higher than Ollama under high concurrency.

Ollama optimizes for single-user interactive use with focus on low latency for individual requests. While it doesn't match vLLM's multi-user throughput, it provides excellent performance for development and personal use with faster cold-start times and lower idle resource consumption.

Ease of Use

Ollama wins decisively on simplicity. Installation is a single command (curl | sh), and running models is as simple as ollama run llama2. It includes a model library with quantized versions optimized for different hardware profiles. The user experience resembles Docker – pull, run, and go.

vLLM requires more setup: Python environment management, CUDA installation, understanding of serving parameters, and manual model specification. The learning curve is steeper, but you gain fine-grained control over performance optimization. This complexity is warranted for production deployments where you need to squeeze maximum performance from your hardware.

API and Integration

vLLM provides OpenAI-compatible REST APIs out of the box, making it a drop-in replacement for OpenAI's API in existing applications. This is crucial for migrating production services from cloud providers to self-hosted infrastructure without code changes.

Ollama offers a simpler REST API and a dedicated Python/JavaScript library. While functional, it's not OpenAI-compatible, requiring code changes when integrating with applications expecting OpenAI's format. However, community projects like Ollama-OpenAI adapters bridge this gap.

Memory Management

vLLM's PagedAttention algorithm provides superior memory efficiency for concurrent requests. It can serve 2-4x more concurrent users with the same VRAM compared to naive implementations. This directly translates to cost savings in production deployments.

Ollama uses simpler memory management suitable for single-user scenarios. It automatically manages model loading/unloading based on activity, which is convenient for development but not optimal for high-concurrency production use.

Multi-GPU Support

vLLM excels with native tensor parallelism and pipeline parallelism, efficiently distributing models across 2-8+ GPUs. This is essential for serving large models like 70B parameter LLMs that don't fit in a single GPU.

Ollama currently has limited multi-GPU support, primarily working best with a single GPU. This makes it less suitable for very large models requiring distributed inference.

Use Case Recommendations

Choose vLLM when:

Serving production APIs with many concurrent users
Optimizing cost per request in cloud deployments
Running in Kubernetes or container orchestration platforms
Need OpenAI API compatibility for existing applications
Serving large models requiring multi-GPU support
Performance and throughput are critical requirements

Choose Ollama when:

Local development and experimentation
Single-user interactive use (personal assistants, chatbots)
Quick prototyping and model evaluation
Learning about LLMs without infrastructure complexity
Running on personal workstations or laptops
Simplicity and ease of use are priorities

Many teams use both: Ollama for development and experimentation, then vLLM for production deployment. This combination provides developer productivity while maintaining production performance.

vLLM vs Docker Model Runner

Docker recently introduced Model Runner (formerly GenAI Stack) as their official solution for local AI model deployment. How does it compare to vLLM?

Architecture Philosophy

Docker Model Runner aims to be the "Docker for AI" – a simple, standardized way to run AI models locally with the same ease as running containers. It abstracts away complexity and provides a consistent interface across different models and frameworks.

vLLM is a specialized inference engine focused solely on LLM serving with maximum performance. It's a lower-level tool that you containerize with Docker, rather than a complete platform.

Setup and Getting Started

Docker Model Runner installation is straightforward for Docker users:

docker model pull llama3:8b
docker model run llama3:8b

This similarity to Docker's image workflow makes it instantly familiar to developers already using containers.

vLLM requires more initial setup (Python, CUDA, dependencies) or using pre-built Docker images:

docker pull vllm/vllm-openai:latest
docker run --runtime nvidia --gpus all vllm/vllm-openai:latest --model <model-name>

Performance Characteristics

vLLM delivers superior throughput for multi-user scenarios due to PagedAttention and continuous batching. For production API services handling hundreds of requests per second, vLLM's optimizations provide 2-5x better throughput than generic serving approaches.

Docker Model Runner focuses on ease of use rather than maximum performance. It's suitable for local development, testing, and moderate workloads, but doesn't implement the advanced optimizations that make vLLM excel at scale.

Model Support

Docker Model Runner provides a curated model library with one-command access to popular models. It supports multiple frameworks (not just LLMs) including Stable Diffusion, Whisper, and other AI models, making it more versatile for different AI workloads.

vLLM specializes in LLM inference with deep support for transformer-based language models. It supports any HuggingFace-compatible LLM but doesn't extend to other AI model types like image generation or speech recognition.

Production Deployment

vLLM is battle-tested in production at companies like Anthropic, Replicate, and many others serving billions of tokens daily. Its performance characteristics and stability under heavy load make it the de facto standard for production LLM serving.

Docker Model Runner is newer and positions itself more for development and local testing scenarios. While it could serve production traffic, it lacks the proven track record and performance optimizations that production deployments require.

Integration Ecosystem

vLLM integrates with production infrastructure tools: Kubernetes operators, Prometheus metrics, Ray for distributed serving, and extensive OpenAI API compatibility for existing applications.

Docker Model Runner integrates naturally with Docker's ecosystem and Docker Desktop. For teams already standardized on Docker, this integration provides a cohesive experience but fewer specialized LLM serving features.

When to Use Each

Use vLLM for:

Production LLM API services
High-throughput, multi-user deployments
Cost-sensitive cloud deployments needing maximum efficiency
Kubernetes and cloud-native environments
When you need proven scalability and performance

Use Docker Model Runner for:

Local development and testing
Running various AI model types (not just LLMs)
Teams heavily invested in Docker ecosystem
Quick experimentation without infrastructure setup
Learning and educational purposes

Hybrid Approach: Many teams develop with Docker Model Runner locally for convenience, then deploy with vLLM in production for performance. The Docker Model Runner images can also be used to run vLLM containers, combining both approaches.

Production Deployment Best Practices

Docker Deployment

Create a production-ready Docker Compose configuration:

version: '3.8'

services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - CUDA_VISIBLE_DEVICES=0,1
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
      - ./logs:/logs
    ports:
      - "8000:8000"
    command: >
      --model mistralai/Mistral-7B-Instruct-v0.2
      --tensor-parallel-size 2
      --gpu-memory-utilization 0.90
      --max-num-seqs 256
      --max-model-len 8192
    restart: unless-stopped
    shm_size: '16gb'
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 2
              capabilities: [gpu]

Kubernetes Deployment

Deploy vLLM on Kubernetes for production scale:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
          - --model
          - mistralai/Mistral-7B-Instruct-v0.2
          - --tensor-parallel-size
          - "2"
          - --gpu-memory-utilization
          - "0.90"
        resources:
          limits:
            nvidia.com/gpu: 2
        ports:
        - containerPort: 8000
        volumeMounts:
        - name: cache
          mountPath: /root/.cache/huggingface
      volumes:
      - name: cache
        hostPath:
          path: /mnt/huggingface-cache
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm
  ports:
  - port: 80
    targetPort: 8000
  type: LoadBalancer

Monitoring and Observability

vLLM exposes Prometheus metrics for monitoring:

import requests

# Get metrics
metrics = requests.get("http://localhost:8000/metrics").text
print(metrics)

Key metrics to monitor:

vllm:num_requests_running - Active requests
vllm:gpu_cache_usage_perc - KV cache utilization
vllm:time_to_first_token - Latency metric
vllm:time_per_output_token - Generation speed

Performance Tuning

Optimize GPU Memory Utilization: Start with --gpu-memory-utilization 0.90 and adjust based on observed behavior. Higher values allow larger batches but risk OOM errors during traffic spikes.

Tune Max Sequence Length: If your use case doesn't need full context length, reduce --max-model-len. This frees memory for larger batches. For example, if you only need 4K context, set --max-model-len 4096 instead of using the model's maximum (often 8K-32K).

Choose Appropriate Quantization: For models that support it, use quantized versions (8-bit, 4-bit) to reduce memory and increase throughput:

--quantization awq  # For AWQ quantized models
--quantization gptq # For GPTQ quantized models

Enable Prefix Caching: For applications with repeated prompts (like chatbots with system messages), enable prefix caching:

--enable-prefix-caching

This caches the KV values for common prefixes, reducing computation for requests sharing the same prompt prefix.

Troubleshooting Common Issues

Out of Memory Errors

Symptoms: Server crashes with CUDA out of memory errors.

Solutions:

Reduce --gpu-memory-utilization to 0.85 or 0.80
Decrease --max-model-len if your use case allows
Lower --max-num-seqs to reduce batch size
Use a quantized model version
Enable tensor parallelism to distribute across more GPUs

Low Throughput

Symptoms: Server handles fewer requests than expected.

Solutions:

Increase --max-num-seqs to allow larger batches
Raise --gpu-memory-utilization if you have headroom
Check if CPU is bottlenecked with htop – consider faster CPUs
Verify GPU utilization with nvidia-smi – should be 95%+
Enable FP16 if using FP32: --dtype float16

Slow First Token Time

Symptoms: High latency before generation starts.

Solutions:

Use smaller models for latency-critical applications
Enable prefix caching for repeated prompts
Reduce --max-num-seqs to prioritize latency over throughput
Consider speculative decoding for supported models
Optimize tensor parallelism configuration

Model Loading Failures

Symptoms: Server fails to start, can't load model.

Solutions:

Verify model name matches HuggingFace format exactly
Check network connectivity to HuggingFace Hub
Ensure sufficient disk space in ~/.cache/huggingface
For gated models, set HF_TOKEN environment variable
Try manually downloading with huggingface-cli download <model>

Advanced Features

Speculative Decoding

vLLM supports speculative decoding, where a smaller draft model proposes tokens that a larger target model verifies. This can accelerate generation by 1.5-2x:

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-70b-chat-hf \
    --speculative-model meta-llama/Llama-2-7b-chat-hf \
    --num-speculative-tokens 5

LoRA Adapters

Serve multiple LoRA adapters on top of a base model without loading multiple full models:

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-hf \
    --enable-lora \
    --lora-modules sql-lora=./path/to/sql-adapter \
                   code-lora=./path/to/code-adapter

Then specify which adapter to use per request:

response = client.completions.create(
    model="sql-lora",  # Use the SQL adapter
    prompt="Convert this to SQL: Show me all users created this month"
)

Multi-LoRA Serving

vLLM's multi-LoRA serving allows hosting dozens of fine-tuned adapters with minimal memory overhead. This is ideal for serving customer-specific or task-specific model variants:

# Request with specific LoRA adapter
response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-hf",
    messages=[{"role": "user", "content": "Write SQL query"}],
    extra_body={"lora_name": "sql-lora"}
)

Prefix Caching

Enable automatic prefix caching to avoid recomputing KV cache for repeated prompt prefixes:

--enable-prefix-caching

This is particularly effective for:

Chatbots with fixed system prompts
RAG applications with consistent context templates
Few-shot learning prompts repeated across requests

Prefix caching can reduce time-to-first-token by 50-80% for requests sharing prompt prefixes.

Integration Examples

LangChain Integration

from langchain.llms import VLLMOpenAI

llm = VLLMOpenAI(
    openai_api_key="EMPTY",
    openai_api_base="http://localhost:8000/v1",
    model_name="mistralai/Mistral-7B-Instruct-v0.2",
    max_tokens=512,
    temperature=0.7,
)

response = llm("Explain PagedAttention in simple terms")
print(response)

LlamaIndex Integration

from llama_index.llms import VLLMServer

llm = VLLMServer(
    api_url="http://localhost:8000/v1",
    model="mistralai/Mistral-7B-Instruct-v0.2",
    temperature=0.7,
    max_tokens=512
)

response = llm.complete("What is vLLM?")
print(response)

FastAPI Application

from fastapi import FastAPI
from openai import AsyncOpenAI

app = FastAPI()
client = AsyncOpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

@app.post("/generate")
async def generate(prompt: str):
    response = await client.completions.create(
        model="mistralai/Mistral-7B-Instruct-v0.2",
        prompt=prompt,
        max_tokens=200
    )
    return {"result": response.choices[0].text}

Performance Benchmarks

Real-world performance data helps illustrate vLLM's advantages:

Throughput Comparison (Mistral-7B on A100 GPU):

vLLM: ~3,500 tokens/second with 64 concurrent users
HuggingFace Transformers: ~250 tokens/second with same concurrency
Ollama: ~1,200 tokens/second with same concurrency
Result: vLLM provides 14x improvement over basic implementations

Memory Efficiency (LLaMA-2-13B):

Standard implementation: 24GB VRAM, 32 concurrent sequences
vLLM with PagedAttention: 24GB VRAM, 128 concurrent sequences
Result: 4x more concurrent requests with same memory

Latency Under Load (Mixtral-8x7B on 2xA100):

vLLM: P50 latency 180ms, P99 latency 420ms at 100 req/s
Standard serving: P50 latency 650ms, P99 latency 3,200ms at 100 req/s
Result: vLLM maintains consistent latency under high load

These benchmarks demonstrate why vLLM has become the de facto standard for production LLM serving where performance matters.

Cost Analysis

Understanding the cost implications of choosing vLLM:

Scenario: Serving 1M requests/day

With Standard Serving:

Required: 8x A100 GPUs (80GB)
AWS cost: ~$32/hour × 24 × 30 = $23,040/month
Cost per 1M tokens: ~$0.75

With vLLM:

Required: 2x A100 GPUs (80GB)
AWS cost: ~$8/hour × 24 × 30 = $5,760/month
Cost per 1M tokens: ~$0.19
Savings: $17,280/month (75% reduction)

This cost advantage grows with scale. Organizations serving billions of tokens monthly save hundreds of thousands of dollars by using vLLM's optimized serving instead of naive implementations.

Security Considerations

Authentication

vLLM doesn't include authentication by default. For production, implement authentication at the reverse proxy level:

# Nginx configuration
location /v1/ {
    auth_request /auth;
    proxy_pass http://vllm-backend:8000;
}

location /auth {
    proxy_pass http://auth-service:8080/verify;
    proxy_pass_request_body off;
    proxy_set_header Content-Length "";
    proxy_set_header X-Original-URI $request_uri;
}

Or use API gateways like Kong, Traefik, or AWS API Gateway for enterprise-grade authentication and rate limiting.

Network Isolation

Run vLLM in private networks, not directly exposed to the internet:

# Kubernetes NetworkPolicy example
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: vllm-access
spec:
  podSelector:
    matchLabels:
      app: vllm
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          role: api-gateway
    ports:
    - protocol: TCP
      port: 8000

Rate Limiting

Implement rate limiting to prevent abuse:

# Example using Redis for rate limiting
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
import redis
from datetime import datetime, timedelta

app = FastAPI()
redis_client = redis.Redis(host='localhost', port=6379)

@app.middleware("http")
async def rate_limit_middleware(request, call_next):
    client_ip = request.client.host
    key = f"rate_limit:{client_ip}"

    requests = redis_client.incr(key)
    if requests == 1:
        redis_client.expire(key, 60)  # 60 second window

    if requests > 60:  # 60 requests per minute
        raise HTTPException(status_code=429, detail="Rate limit exceeded")

    return await call_next(request)

Model Access Control

For multi-tenant deployments, control which users can access which models:

ALLOWED_MODELS = {
    "user_tier_1": ["mistralai/Mistral-7B-Instruct-v0.2"],
    "user_tier_2": ["mistralai/Mistral-7B-Instruct-v0.2", "meta-llama/Llama-2-13b-chat-hf"],
    "admin": ["*"]  # All models
}

def verify_model_access(user_tier: str, model: str) -> bool:
    allowed = ALLOWED_MODELS.get(user_tier, [])
    return "*" in allowed or model in allowed

Migration Guide

From OpenAI to vLLM

Migrating from OpenAI to self-hosted vLLM is straightforward thanks to API compatibility:

Before (OpenAI):

from openai import OpenAI

client = OpenAI(api_key="sk-...")
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "Hello"}]
)

After (vLLM):

from openai import OpenAI

client = OpenAI(
    base_url="https://your-vllm-server.com/v1",
    api_key="your-internal-key"  # If you added authentication
)
response = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    messages=[{"role": "user", "content": "Hello"}]
)

Only two changes needed: update base_url and model name. All other code remains identical.

From Ollama to vLLM

Ollama uses a different API format. Here's the conversion:

Ollama API:

import requests

response = requests.post('http://localhost:11434/api/generate',
    json={
        'model': 'llama2',
        'prompt': 'Why is the sky blue?'
    })

vLLM Equivalent:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    prompt="Why is the sky blue?"
)

You'll need to update API calls throughout your codebase, but the OpenAI client libraries provide better error handling and features.

From HuggingFace Transformers to vLLM

Direct Python usage migration:

HuggingFace:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

inputs = tokenizer("Hello", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
result = tokenizer.decode(outputs[0])

vLLM:

from vllm import LLM, SamplingParams

llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.2")
sampling_params = SamplingParams(max_tokens=100)

outputs = llm.generate("Hello", sampling_params)
result = outputs[0].outputs[0].text

vLLM's Python API is simpler and much faster for batch inference.

Future of vLLM

vLLM continues rapid development with exciting features on the roadmap:

Disaggregated Serving: Separating prefill (prompt processing) and decode (token generation) onto different GPUs to optimize resource utilization. Prefill is compute-bound while decode is memory-bound, so running them on specialized hardware improves efficiency.

Multi-Node Inference: Distributing very large models (100B+ parameters) across multiple machines, enabling serving of models too large for single-node setups.

Enhanced Quantization: Support for new quantization formats like GGUF (used by llama.cpp) and improved AWQ/GPTQ integration for better performance with quantized models.

Speculative Decoding Improvements: More efficient draft models and adaptive speculation strategies to achieve higher speedups without accuracy loss.

Attention Optimizations: FlashAttention 3, ring attention for extremely long contexts (100K+ tokens), and other cutting-edge attention mechanisms.

Better Model Coverage: Expanding support to multimodal models (vision-language models), audio models, and specialized architectures as they emerge.

The vLLM project maintains active development with contributions from UC Berkeley, Anyscale, and the broader open-source community. As LLM deployment becomes more critical to production systems, vLLM's role as the performance standard continues to grow.

Useful Links

External Resources and Documentation

vLLM GitHub Repository - Official vLLM repository with source code, comprehensive documentation, installation guides, and active community discussions. Essential resource for staying current with latest features and troubleshooting issues.
vLLM Documentation - Official documentation covering all aspects of vLLM from basic setup to advanced configuration. Includes API references, performance tuning guides, and deployment best practices.
PagedAttention Paper - Academic paper introducing PagedAttention algorithm that powers vLLM's efficiency. Essential reading for understanding the technical innovations behind vLLM's performance advantages.
vLLM Blog - Official vLLM blog featuring release announcements, performance benchmarks, technical deep dives, and community case studies from production deployments.
HuggingFace Model Hub - Comprehensive repository of open-source LLMs that work with vLLM. Search for models by size, task, license, and performance characteristics to find the right model for your use case.
Ray Serve Documentation - Ray Serve framework documentation for building scalable, distributed vLLM deployments. Ray provides advanced features like autoscaling, multi-model serving, and resource management for production systems.
NVIDIA TensorRT-LLM - NVIDIA's TensorRT-LLM for highly optimized inference on NVIDIA GPUs. Alternative to vLLM with different optimization strategies, useful for comparison and understanding the inference optimization landscape.
OpenAI API Reference - Official OpenAI API documentation that vLLM's API is compatible with. Reference this when building applications that need to work with both OpenAI and self-hosted vLLM endpoints interchangeably.

What is vLLM?

Key Features

When to Use vLLM

How to Install vLLM

Prerequisites

Installation via pip

Installation with Docker

Building from Source

vLLM Quickstart Guide

Running Your First Model

Making API Requests

Advanced Configuration

vLLM vs Ollama Comparison

Performance and Throughput

Ease of Use

API and Integration

Memory Management

Multi-GPU Support

Use Case Recommendations

vLLM vs Docker Model Runner

Architecture Philosophy

Setup and Getting Started

Performance Characteristics

Model Support

Production Deployment

Integration Ecosystem

When to Use Each

Production Deployment Best Practices

Docker Deployment

Kubernetes Deployment

Monitoring and Observability

Performance Tuning

Troubleshooting Common Issues

Out of Memory Errors

Low Throughput

Slow First Token Time

Model Loading Failures

Advanced Features

Speculative Decoding

LoRA Adapters

Multi-LoRA Serving

Prefix Caching

Integration Examples

LangChain Integration

LlamaIndex Integration

FastAPI Application

Performance Benchmarks

Cost Analysis

Scenario: Serving 1M requests/day

Security Considerations

Authentication

Network Isolation

Rate Limiting

Model Access Control

Migration Guide

From OpenAI to vLLM

From Ollama to vLLM

From HuggingFace Transformers to vLLM

Future of vLLM

Useful Links

Related Articles on This Site

External Resources and Documentation