DEV Community

Rost
Rost

Posted on • Originally published at glukhov.org

vLLM Quickstart: High-Performance LLM Serving

vLLM is a high-throughput, memory-efficient inference and serving engine for Large Language Models (LLMs) developed by UC Berkeley's Sky Computing Lab.

With its revolutionary PagedAttention algorithm, vLLM achieves 14-24x higher throughput than traditional serving methods, making it the go-to choice for production LLM deployments.

What is vLLM?

vLLM (virtual LLM) is an open-source library for fast LLM inference and serving that has quickly become the industry standard for production deployments. Released in 2023, it introduced PagedAttention, a groundbreaking memory management technique that dramatically improves serving efficiency.

Key Features

High Throughput Performance: vLLM delivers 14-24x higher throughput compared to HuggingFace Transformers with the same hardware. This massive performance gain comes from continuous batching, optimized CUDA kernels, and the PagedAttention algorithm that eliminates memory fragmentation.

OpenAI API Compatibility: vLLM includes a built-in API server that's fully compatible with OpenAI's format. This allows seamless migration from OpenAI to self-hosted infrastructure without changing application code. Simply point your API client to vLLM's endpoint and it works transparently.

PagedAttention Algorithm: The core innovation behind vLLM's performance is PagedAttention, which applies the concept of virtual memory paging to attention mechanisms. Instead of allocating contiguous memory blocks for KV caches (which leads to fragmentation), PagedAttention divides memory into fixed-size blocks that can be allocated on-demand. This reduces memory waste by up to 4x and enables much larger batch sizes.

Continuous Batching: Unlike static batching where you wait for all sequences to complete, vLLM uses continuous (rolling) batching. As soon as one sequence finishes, a new one can be added to the batch. This maximizes GPU utilization and minimizes latency for incoming requests.

Multi-GPU Support: vLLM supports tensor parallelism and pipeline parallelism for distributing large models across multiple GPUs. It can efficiently serve models that don't fit in a single GPU's memory, supporting configurations from 2 to 8+ GPUs.

Wide Model Support: Compatible with popular model architectures including LLaMA, Mistral, Mixtral, Qwen, Phi, Gemma, and many others. Supports both instruction-tuned and base models from HuggingFace Hub.

When to Use vLLM

vLLM excels in specific scenarios where its strengths shine:

Production API Services: When you need to serve an LLM to many concurrent users via API, vLLM's high throughput and efficient batching make it the best choice. Companies running chatbots, code assistants, or content generation services benefit from its ability to handle hundreds of requests per second.

High-Concurrency Workloads: If your application has many simultaneous users making requests, vLLM's continuous batching and PagedAttention enable serving more users with the same hardware compared to alternatives.

Cost Optimization: When GPU costs are a concern, vLLM's superior throughput means you can serve the same traffic with fewer GPUs, directly reducing infrastructure costs. The 4x memory efficiency from PagedAttention also allows using smaller, cheaper GPU instances.

Kubernetes Deployments: vLLM's stateless design and container-friendly architecture make it ideal for Kubernetes clusters. Its consistent performance under load and straightforward resource management integrate well with cloud-native infrastructure.

When NOT to Use vLLM: For local development, experimentation, or single-user scenarios, tools like Ollama provide better user experience with simpler setup. vLLM's complexity is justified when you need its performance advantages for production workloads.

How to Install vLLM

Prerequisites

Before installing vLLM, ensure your system meets these requirements:

  • GPU: NVIDIA GPU with compute capability 7.0+ (V100, T4, A10, A100, H100, RTX 20/30/40 series)
  • CUDA: Version 11.8 or higher
  • Python: 3.8 to 3.11
  • VRAM: Minimum 16GB for 7B models, 24GB+ for 13B, 40GB+ for larger models
  • Driver: NVIDIA driver 450.80.02 or newer

Installation via pip

The simplest installation method is using pip. This works on systems with CUDA 11.8 or newer:

# Create a virtual environment (recommended)
python3 -m venv vllm-env
source vllm-env/bin/activate

# Install vLLM
pip install vllm

# Verify installation
python -c "import vllm; print(vllm.__version__)"
Enter fullscreen mode Exit fullscreen mode

For systems with different CUDA versions, install the appropriate wheel:

# For CUDA 12.1
pip install vllm==0.4.2+cu121 -f https://github.com/vllm-project/vllm/releases

# For CUDA 11.8
pip install vllm==0.4.2+cu118 -f https://github.com/vllm-project/vllm/releases
Enter fullscreen mode Exit fullscreen mode

Installation with Docker

Docker provides the most reliable deployment method, especially for production:

# Pull the official vLLM image
docker pull vllm/vllm-openai:latest

# Run vLLM with GPU support
docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model mistralai/Mistral-7B-Instruct-v0.2
Enter fullscreen mode Exit fullscreen mode

The --ipc=host flag is important for multi-GPU setups as it enables proper inter-process communication.

Building from Source

For the latest features or custom modifications, build from source:

git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .
Enter fullscreen mode Exit fullscreen mode

vLLM Quickstart Guide

Running Your First Model

Start vLLM with a model using the command-line interface:

# Download and serve Mistral-7B with OpenAI-compatible API
python -m vllm.entrypoints.openai.api_server \
    --model mistralai/Mistral-7B-Instruct-v0.2 \
    --port 8000
Enter fullscreen mode Exit fullscreen mode

vLLM will automatically download the model from HuggingFace Hub (if not cached) and start the server. You'll see output indicating the server is ready:

INFO:     Started server process [12345]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000
Enter fullscreen mode Exit fullscreen mode

Making API Requests

Once the server is running, you can make requests using the OpenAI Python client or curl:

Using curl:

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "mistralai/Mistral-7B-Instruct-v0.2",
        "prompt": "Explain what vLLM is in one sentence:",
        "max_tokens": 100,
        "temperature": 0.7
    }'
Enter fullscreen mode Exit fullscreen mode

Using OpenAI Python Client:

from openai import OpenAI

# Point to your vLLM server
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"  # vLLM doesn't require authentication by default
)

response = client.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    prompt="Explain what vLLM is in one sentence:",
    max_tokens=100,
    temperature=0.7
)

print(response.choices[0].text)
Enter fullscreen mode Exit fullscreen mode

Chat Completions API:

response = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is PagedAttention?"}
    ],
    max_tokens=200
)

print(response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

Advanced Configuration

vLLM offers numerous parameters to optimize performance:

python -m vllm.entrypoints.openai.api_server \
    --model mistralai/Mistral-7B-Instruct-v0.2 \
    --port 8000 \
    --gpu-memory-utilization 0.95 \  # Use 95% of GPU memory
    --max-model-len 8192 \            # Maximum sequence length
    --tensor-parallel-size 2 \        # Use 2 GPUs with tensor parallelism
    --dtype float16 \                 # Use FP16 precision
    --max-num-seqs 256                # Maximum batch size
Enter fullscreen mode Exit fullscreen mode

Key Parameters Explained:

  • --gpu-memory-utilization: How much GPU memory to use (0.90 = 90%). Higher values allow larger batches but leave less margin for memory spikes.
  • --max-model-len: Maximum context length. Reducing this saves memory for larger batches.
  • --tensor-parallel-size: Number of GPUs to split the model across.
  • --dtype: Data type for weights (float16, bfloat16, or float32). FP16 is usually optimal.
  • --max-num-seqs: Maximum number of sequences to process in a batch.

vLLM vs Ollama Comparison

Both vLLM and Ollama are popular choices for local LLM hosting, but they target different use cases. Understanding when to use each tool can significantly impact your project's success.

Performance and Throughput

vLLM is engineered for maximum throughput in multi-user scenarios. Its PagedAttention and continuous batching enable serving hundreds of concurrent requests efficiently. Benchmarks show vLLM achieving 14-24x higher throughput than standard implementations and 2-4x higher than Ollama under high concurrency.

Ollama optimizes for single-user interactive use with focus on low latency for individual requests. While it doesn't match vLLM's multi-user throughput, it provides excellent performance for development and personal use with faster cold-start times and lower idle resource consumption.

Ease of Use

Ollama wins decisively on simplicity. Installation is a single command (curl | sh), and running models is as simple as ollama run llama2. It includes a model library with quantized versions optimized for different hardware profiles. The user experience resembles Docker – pull, run, and go.

vLLM requires more setup: Python environment management, CUDA installation, understanding of serving parameters, and manual model specification. The learning curve is steeper, but you gain fine-grained control over performance optimization. This complexity is warranted for production deployments where you need to squeeze maximum performance from your hardware.

API and Integration

vLLM provides OpenAI-compatible REST APIs out of the box, making it a drop-in replacement for OpenAI's API in existing applications. This is crucial for migrating production services from cloud providers to self-hosted infrastructure without code changes.

Ollama offers a simpler REST API and a dedicated Python/JavaScript library. While functional, it's not OpenAI-compatible, requiring code changes when integrating with applications expecting OpenAI's format. However, community projects like Ollama-OpenAI adapters bridge this gap.

Memory Management

vLLM's PagedAttention algorithm provides superior memory efficiency for concurrent requests. It can serve 2-4x more concurrent users with the same VRAM compared to naive implementations. This directly translates to cost savings in production deployments.

Ollama uses simpler memory management suitable for single-user scenarios. It automatically manages model loading/unloading based on activity, which is convenient for development but not optimal for high-concurrency production use.

Multi-GPU Support

vLLM excels with native tensor parallelism and pipeline parallelism, efficiently distributing models across 2-8+ GPUs. This is essential for serving large models like 70B parameter LLMs that don't fit in a single GPU.

Ollama currently has limited multi-GPU support, primarily working best with a single GPU. This makes it less suitable for very large models requiring distributed inference.

Use Case Recommendations

Choose vLLM when:

  • Serving production APIs with many concurrent users
  • Optimizing cost per request in cloud deployments
  • Running in Kubernetes or container orchestration platforms
  • Need OpenAI API compatibility for existing applications
  • Serving large models requiring multi-GPU support
  • Performance and throughput are critical requirements

Choose Ollama when:

  • Local development and experimentation
  • Single-user interactive use (personal assistants, chatbots)
  • Quick prototyping and model evaluation
  • Learning about LLMs without infrastructure complexity
  • Running on personal workstations or laptops
  • Simplicity and ease of use are priorities

Many teams use both: Ollama for development and experimentation, then vLLM for production deployment. This combination provides developer productivity while maintaining production performance.

vLLM vs Docker Model Runner

Docker recently introduced Model Runner (formerly GenAI Stack) as their official solution for local AI model deployment. How does it compare to vLLM?

Architecture Philosophy

Docker Model Runner aims to be the "Docker for AI" – a simple, standardized way to run AI models locally with the same ease as running containers. It abstracts away complexity and provides a consistent interface across different models and frameworks.

vLLM is a specialized inference engine focused solely on LLM serving with maximum performance. It's a lower-level tool that you containerize with Docker, rather than a complete platform.

Setup and Getting Started

Docker Model Runner installation is straightforward for Docker users:

docker model pull llama3:8b
docker model run llama3:8b
Enter fullscreen mode Exit fullscreen mode

This similarity to Docker's image workflow makes it instantly familiar to developers already using containers.

vLLM requires more initial setup (Python, CUDA, dependencies) or using pre-built Docker images:

docker pull vllm/vllm-openai:latest
docker run --runtime nvidia --gpus all vllm/vllm-openai:latest --model <model-name>
Enter fullscreen mode Exit fullscreen mode

Performance Characteristics

vLLM delivers superior throughput for multi-user scenarios due to PagedAttention and continuous batching. For production API services handling hundreds of requests per second, vLLM's optimizations provide 2-5x better throughput than generic serving approaches.

Docker Model Runner focuses on ease of use rather than maximum performance. It's suitable for local development, testing, and moderate workloads, but doesn't implement the advanced optimizations that make vLLM excel at scale.

Model Support

Docker Model Runner provides a curated model library with one-command access to popular models. It supports multiple frameworks (not just LLMs) including Stable Diffusion, Whisper, and other AI models, making it more versatile for different AI workloads.

vLLM specializes in LLM inference with deep support for transformer-based language models. It supports any HuggingFace-compatible LLM but doesn't extend to other AI model types like image generation or speech recognition.

Production Deployment

vLLM is battle-tested in production at companies like Anthropic, Replicate, and many others serving billions of tokens daily. Its performance characteristics and stability under heavy load make it the de facto standard for production LLM serving.

Docker Model Runner is newer and positions itself more for development and local testing scenarios. While it could serve production traffic, it lacks the proven track record and performance optimizations that production deployments require.

Integration Ecosystem

vLLM integrates with production infrastructure tools: Kubernetes operators, Prometheus metrics, Ray for distributed serving, and extensive OpenAI API compatibility for existing applications.

Docker Model Runner integrates naturally with Docker's ecosystem and Docker Desktop. For teams already standardized on Docker, this integration provides a cohesive experience but fewer specialized LLM serving features.

When to Use Each

Use vLLM for:

  • Production LLM API services
  • High-throughput, multi-user deployments
  • Cost-sensitive cloud deployments needing maximum efficiency
  • Kubernetes and cloud-native environments
  • When you need proven scalability and performance

Use Docker Model Runner for:

  • Local development and testing
  • Running various AI model types (not just LLMs)
  • Teams heavily invested in Docker ecosystem
  • Quick experimentation without infrastructure setup
  • Learning and educational purposes

Hybrid Approach: Many teams develop with Docker Model Runner locally for convenience, then deploy with vLLM in production for performance. The Docker Model Runner images can also be used to run vLLM containers, combining both approaches.

Production Deployment Best Practices

Docker Deployment

Create a production-ready Docker Compose configuration:

version: '3.8'

services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - CUDA_VISIBLE_DEVICES=0,1
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
      - ./logs:/logs
    ports:
      - "8000:8000"
    command: >
      --model mistralai/Mistral-7B-Instruct-v0.2
      --tensor-parallel-size 2
      --gpu-memory-utilization 0.90
      --max-num-seqs 256
      --max-model-len 8192
    restart: unless-stopped
    shm_size: '16gb'
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 2
              capabilities: [gpu]
Enter fullscreen mode Exit fullscreen mode

Kubernetes Deployment

Deploy vLLM on Kubernetes for production scale:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
          - --model
          - mistralai/Mistral-7B-Instruct-v0.2
          - --tensor-parallel-size
          - "2"
          - --gpu-memory-utilization
          - "0.90"
        resources:
          limits:
            nvidia.com/gpu: 2
        ports:
        - containerPort: 8000
        volumeMounts:
        - name: cache
          mountPath: /root/.cache/huggingface
      volumes:
      - name: cache
        hostPath:
          path: /mnt/huggingface-cache
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm
  ports:
  - port: 80
    targetPort: 8000
  type: LoadBalancer
Enter fullscreen mode Exit fullscreen mode

Monitoring and Observability

vLLM exposes Prometheus metrics for monitoring:

import requests

# Get metrics
metrics = requests.get("http://localhost:8000/metrics").text
print(metrics)
Enter fullscreen mode Exit fullscreen mode

Key metrics to monitor:

  • vllm:num_requests_running - Active requests
  • vllm:gpu_cache_usage_perc - KV cache utilization
  • vllm:time_to_first_token - Latency metric
  • vllm:time_per_output_token - Generation speed

Performance Tuning

Optimize GPU Memory Utilization: Start with --gpu-memory-utilization 0.90 and adjust based on observed behavior. Higher values allow larger batches but risk OOM errors during traffic spikes.

Tune Max Sequence Length: If your use case doesn't need full context length, reduce --max-model-len. This frees memory for larger batches. For example, if you only need 4K context, set --max-model-len 4096 instead of using the model's maximum (often 8K-32K).

Choose Appropriate Quantization: For models that support it, use quantized versions (8-bit, 4-bit) to reduce memory and increase throughput:

--quantization awq  # For AWQ quantized models
--quantization gptq # For GPTQ quantized models
Enter fullscreen mode Exit fullscreen mode

Enable Prefix Caching: For applications with repeated prompts (like chatbots with system messages), enable prefix caching:

--enable-prefix-caching
Enter fullscreen mode Exit fullscreen mode

This caches the KV values for common prefixes, reducing computation for requests sharing the same prompt prefix.

Troubleshooting Common Issues

Out of Memory Errors

Symptoms: Server crashes with CUDA out of memory errors.

Solutions:

  • Reduce --gpu-memory-utilization to 0.85 or 0.80
  • Decrease --max-model-len if your use case allows
  • Lower --max-num-seqs to reduce batch size
  • Use a quantized model version
  • Enable tensor parallelism to distribute across more GPUs

Low Throughput

Symptoms: Server handles fewer requests than expected.

Solutions:

  • Increase --max-num-seqs to allow larger batches
  • Raise --gpu-memory-utilization if you have headroom
  • Check if CPU is bottlenecked with htop – consider faster CPUs
  • Verify GPU utilization with nvidia-smi – should be 95%+
  • Enable FP16 if using FP32: --dtype float16

Slow First Token Time

Symptoms: High latency before generation starts.

Solutions:

  • Use smaller models for latency-critical applications
  • Enable prefix caching for repeated prompts
  • Reduce --max-num-seqs to prioritize latency over throughput
  • Consider speculative decoding for supported models
  • Optimize tensor parallelism configuration

Model Loading Failures

Symptoms: Server fails to start, can't load model.

Solutions:

  • Verify model name matches HuggingFace format exactly
  • Check network connectivity to HuggingFace Hub
  • Ensure sufficient disk space in ~/.cache/huggingface
  • For gated models, set HF_TOKEN environment variable
  • Try manually downloading with huggingface-cli download <model>

Advanced Features

Speculative Decoding

vLLM supports speculative decoding, where a smaller draft model proposes tokens that a larger target model verifies. This can accelerate generation by 1.5-2x:

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-70b-chat-hf \
    --speculative-model meta-llama/Llama-2-7b-chat-hf \
    --num-speculative-tokens 5
Enter fullscreen mode Exit fullscreen mode

LoRA Adapters

Serve multiple LoRA adapters on top of a base model without loading multiple full models:

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-hf \
    --enable-lora \
    --lora-modules sql-lora=./path/to/sql-adapter \
                   code-lora=./path/to/code-adapter
Enter fullscreen mode Exit fullscreen mode

Then specify which adapter to use per request:

response = client.completions.create(
    model="sql-lora",  # Use the SQL adapter
    prompt="Convert this to SQL: Show me all users created this month"
)
Enter fullscreen mode Exit fullscreen mode

Multi-LoRA Serving

vLLM's multi-LoRA serving allows hosting dozens of fine-tuned adapters with minimal memory overhead. This is ideal for serving customer-specific or task-specific model variants:

# Request with specific LoRA adapter
response = client.chat.completions.create(
    model="meta-llama/Llama-2-7b-hf",
    messages=[{"role": "user", "content": "Write SQL query"}],
    extra_body={"lora_name": "sql-lora"}
)
Enter fullscreen mode Exit fullscreen mode

Prefix Caching

Enable automatic prefix caching to avoid recomputing KV cache for repeated prompt prefixes:

--enable-prefix-caching
Enter fullscreen mode Exit fullscreen mode

This is particularly effective for:

  • Chatbots with fixed system prompts
  • RAG applications with consistent context templates
  • Few-shot learning prompts repeated across requests

Prefix caching can reduce time-to-first-token by 50-80% for requests sharing prompt prefixes.

Integration Examples

LangChain Integration

from langchain.llms import VLLMOpenAI

llm = VLLMOpenAI(
    openai_api_key="EMPTY",
    openai_api_base="http://localhost:8000/v1",
    model_name="mistralai/Mistral-7B-Instruct-v0.2",
    max_tokens=512,
    temperature=0.7,
)

response = llm("Explain PagedAttention in simple terms")
print(response)
Enter fullscreen mode Exit fullscreen mode

LlamaIndex Integration

from llama_index.llms import VLLMServer

llm = VLLMServer(
    api_url="http://localhost:8000/v1",
    model="mistralai/Mistral-7B-Instruct-v0.2",
    temperature=0.7,
    max_tokens=512
)

response = llm.complete("What is vLLM?")
print(response)
Enter fullscreen mode Exit fullscreen mode

FastAPI Application

from fastapi import FastAPI
from openai import AsyncOpenAI

app = FastAPI()
client = AsyncOpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

@app.post("/generate")
async def generate(prompt: str):
    response = await client.completions.create(
        model="mistralai/Mistral-7B-Instruct-v0.2",
        prompt=prompt,
        max_tokens=200
    )
    return {"result": response.choices[0].text}
Enter fullscreen mode Exit fullscreen mode

Performance Benchmarks

Real-world performance data helps illustrate vLLM's advantages:

Throughput Comparison (Mistral-7B on A100 GPU):

  • vLLM: ~3,500 tokens/second with 64 concurrent users
  • HuggingFace Transformers: ~250 tokens/second with same concurrency
  • Ollama: ~1,200 tokens/second with same concurrency
  • Result: vLLM provides 14x improvement over basic implementations

Memory Efficiency (LLaMA-2-13B):

  • Standard implementation: 24GB VRAM, 32 concurrent sequences
  • vLLM with PagedAttention: 24GB VRAM, 128 concurrent sequences
  • Result: 4x more concurrent requests with same memory

Latency Under Load (Mixtral-8x7B on 2xA100):

  • vLLM: P50 latency 180ms, P99 latency 420ms at 100 req/s
  • Standard serving: P50 latency 650ms, P99 latency 3,200ms at 100 req/s
  • Result: vLLM maintains consistent latency under high load

These benchmarks demonstrate why vLLM has become the de facto standard for production LLM serving where performance matters.

Cost Analysis

Understanding the cost implications of choosing vLLM:

Scenario: Serving 1M requests/day

With Standard Serving:

  • Required: 8x A100 GPUs (80GB)
  • AWS cost: ~$32/hour × 24 × 30 = $23,040/month
  • Cost per 1M tokens: ~$0.75

With vLLM:

  • Required: 2x A100 GPUs (80GB)
  • AWS cost: ~$8/hour × 24 × 30 = $5,760/month
  • Cost per 1M tokens: ~$0.19
  • Savings: $17,280/month (75% reduction)

This cost advantage grows with scale. Organizations serving billions of tokens monthly save hundreds of thousands of dollars by using vLLM's optimized serving instead of naive implementations.

Security Considerations

Authentication

vLLM doesn't include authentication by default. For production, implement authentication at the reverse proxy level:

# Nginx configuration
location /v1/ {
    auth_request /auth;
    proxy_pass http://vllm-backend:8000;
}

location /auth {
    proxy_pass http://auth-service:8080/verify;
    proxy_pass_request_body off;
    proxy_set_header Content-Length "";
    proxy_set_header X-Original-URI $request_uri;
}
Enter fullscreen mode Exit fullscreen mode

Or use API gateways like Kong, Traefik, or AWS API Gateway for enterprise-grade authentication and rate limiting.

Network Isolation

Run vLLM in private networks, not directly exposed to the internet:

# Kubernetes NetworkPolicy example
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: vllm-access
spec:
  podSelector:
    matchLabels:
      app: vllm
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          role: api-gateway
    ports:
    - protocol: TCP
      port: 8000
Enter fullscreen mode Exit fullscreen mode

Rate Limiting

Implement rate limiting to prevent abuse:

# Example using Redis for rate limiting
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
import redis
from datetime import datetime, timedelta

app = FastAPI()
redis_client = redis.Redis(host='localhost', port=6379)

@app.middleware("http")
async def rate_limit_middleware(request, call_next):
    client_ip = request.client.host
    key = f"rate_limit:{client_ip}"

    requests = redis_client.incr(key)
    if requests == 1:
        redis_client.expire(key, 60)  # 60 second window

    if requests > 60:  # 60 requests per minute
        raise HTTPException(status_code=429, detail="Rate limit exceeded")

    return await call_next(request)
Enter fullscreen mode Exit fullscreen mode

Model Access Control

For multi-tenant deployments, control which users can access which models:

ALLOWED_MODELS = {
    "user_tier_1": ["mistralai/Mistral-7B-Instruct-v0.2"],
    "user_tier_2": ["mistralai/Mistral-7B-Instruct-v0.2", "meta-llama/Llama-2-13b-chat-hf"],
    "admin": ["*"]  # All models
}

def verify_model_access(user_tier: str, model: str) -> bool:
    allowed = ALLOWED_MODELS.get(user_tier, [])
    return "*" in allowed or model in allowed
Enter fullscreen mode Exit fullscreen mode

Migration Guide

From OpenAI to vLLM

Migrating from OpenAI to self-hosted vLLM is straightforward thanks to API compatibility:

Before (OpenAI):

from openai import OpenAI

client = OpenAI(api_key="sk-...")
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "Hello"}]
)
Enter fullscreen mode Exit fullscreen mode

After (vLLM):

from openai import OpenAI

client = OpenAI(
    base_url="https://your-vllm-server.com/v1",
    api_key="your-internal-key"  # If you added authentication
)
response = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    messages=[{"role": "user", "content": "Hello"}]
)
Enter fullscreen mode Exit fullscreen mode

Only two changes needed: update base_url and model name. All other code remains identical.

From Ollama to vLLM

Ollama uses a different API format. Here's the conversion:

Ollama API:

import requests

response = requests.post('http://localhost:11434/api/generate',
    json={
        'model': 'llama2',
        'prompt': 'Why is the sky blue?'
    })
Enter fullscreen mode Exit fullscreen mode

vLLM Equivalent:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.completions.create(
    model="meta-llama/Llama-2-7b-chat-hf",
    prompt="Why is the sky blue?"
)
Enter fullscreen mode Exit fullscreen mode

You'll need to update API calls throughout your codebase, but the OpenAI client libraries provide better error handling and features.

From HuggingFace Transformers to vLLM

Direct Python usage migration:

HuggingFace:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

inputs = tokenizer("Hello", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
result = tokenizer.decode(outputs[0])
Enter fullscreen mode Exit fullscreen mode

vLLM:

from vllm import LLM, SamplingParams

llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.2")
sampling_params = SamplingParams(max_tokens=100)

outputs = llm.generate("Hello", sampling_params)
result = outputs[0].outputs[0].text
Enter fullscreen mode Exit fullscreen mode

vLLM's Python API is simpler and much faster for batch inference.

Future of vLLM

vLLM continues rapid development with exciting features on the roadmap:

Disaggregated Serving: Separating prefill (prompt processing) and decode (token generation) onto different GPUs to optimize resource utilization. Prefill is compute-bound while decode is memory-bound, so running them on specialized hardware improves efficiency.

Multi-Node Inference: Distributing very large models (100B+ parameters) across multiple machines, enabling serving of models too large for single-node setups.

Enhanced Quantization: Support for new quantization formats like GGUF (used by llama.cpp) and improved AWQ/GPTQ integration for better performance with quantized models.

Speculative Decoding Improvements: More efficient draft models and adaptive speculation strategies to achieve higher speedups without accuracy loss.

Attention Optimizations: FlashAttention 3, ring attention for extremely long contexts (100K+ tokens), and other cutting-edge attention mechanisms.

Better Model Coverage: Expanding support to multimodal models (vision-language models), audio models, and specialized architectures as they emerge.

The vLLM project maintains active development with contributions from UC Berkeley, Anyscale, and the broader open-source community. As LLM deployment becomes more critical to production systems, vLLM's role as the performance standard continues to grow.

Useful Links

Related Articles on This Site

  • Local LLM Hosting: Complete 2025 Guide - Ollama, vLLM, LocalAI, Jan, LM Studio & More - Comprehensive comparison of 12+ local LLM hosting tools including detailed vLLM analysis alongside Ollama, LocalAI, Jan, LM Studio and others. Covers API maturity, tool calling support, GGUF compatibility and performance benchmarks to help choose the right solution.

  • Ollama Cheatsheet - Complete Ollama command reference and cheatsheet covering installation, model management, API usage, and best practices for local LLM deployment. Essential for developers using Ollama alongside or instead of vLLM.

  • Docker Model Runner vs Ollama: Which to Choose? - In-depth comparison of Docker's Model Runner and Ollama for local LLM deployment, analyzing performance, GPU support, API compatibility and use cases. Helps understand the competitive landscape vLLM operates in.

  • Docker Model Runner Cheatsheet: Commands & Examples - Practical Docker Model Runner cheatsheet with commands and examples for AI model deployment. Useful for teams comparing Docker's approach with vLLM's specialized LLM serving capabilities.

External Resources and Documentation

  • vLLM GitHub Repository - Official vLLM repository with source code, comprehensive documentation, installation guides, and active community discussions. Essential resource for staying current with latest features and troubleshooting issues.

  • vLLM Documentation - Official documentation covering all aspects of vLLM from basic setup to advanced configuration. Includes API references, performance tuning guides, and deployment best practices.

  • PagedAttention Paper - Academic paper introducing PagedAttention algorithm that powers vLLM's efficiency. Essential reading for understanding the technical innovations behind vLLM's performance advantages.

  • vLLM Blog - Official vLLM blog featuring release announcements, performance benchmarks, technical deep dives, and community case studies from production deployments.

  • HuggingFace Model Hub - Comprehensive repository of open-source LLMs that work with vLLM. Search for models by size, task, license, and performance characteristics to find the right model for your use case.

  • Ray Serve Documentation - Ray Serve framework documentation for building scalable, distributed vLLM deployments. Ray provides advanced features like autoscaling, multi-model serving, and resource management for production systems.

  • NVIDIA TensorRT-LLM - NVIDIA's TensorRT-LLM for highly optimized inference on NVIDIA GPUs. Alternative to vLLM with different optimization strategies, useful for comparison and understanding the inference optimization landscape.

  • OpenAI API Reference - Official OpenAI API documentation that vLLM's API is compatible with. Reference this when building applications that need to work with both OpenAI and self-hosted vLLM endpoints interchangeably.

Top comments (0)