selfhosting.sh

Posted on Mar 8 • Originally published at selfhosting.sh

Ollama vs vLLM: Which Should You Self-Host?

#selfhosted #devops #homelab #linux

Quick Verdict

Ollama is the better choice for personal use and small teams. It's easy to set up, runs on consumer hardware (including CPU-only), and integrates with every major LLM frontend. vLLM is the better choice for production serving where throughput matters — it handles concurrent requests much more efficiently using PagedAttention and continuous batching, but requires a dedicated NVIDIA GPU and more setup effort.

Overview

Both run LLMs locally, but they're designed for very different scales.

Ollama — MIT license. 250k+ GitHub stars. Written in Go, wraps llama.cpp. Designed for simplicity — download a model with one command, run it immediately. Targets developers and self-hosters who want local AI without complexity.

vLLM — Apache 2.0 license. 50k+ GitHub stars. Written in Python/C++/CUDA. Designed for high-throughput LLM serving. Invented PagedAttention for efficient GPU memory management. Targets production deployments serving multiple concurrent users.

Feature Comparison

Feature	Ollama	vLLM
Primary goal	Simplicity	Throughput
Model download	`ollama pull model`	Manual or HuggingFace Hub
OpenAI API compatible	Yes	Yes (native)
CPU inference	Yes	No (GPU required)
GPU: NVIDIA	Yes	Yes (primary)
GPU: AMD	Yes (ROCm)	Yes (ROCm)
GPU: Apple Silicon	Yes (Metal)	No
Multi-GPU	Yes	Yes (tensor parallelism)
Continuous batching	No	Yes
PagedAttention	No	Yes
Speculative decoding	No	Yes
Model formats: GGUF	Yes (primary)	Limited
Model formats: HuggingFace	Via conversion	Yes (native)
Model formats: AWQ/GPTQ	Via conversion	Yes (native)
Quantization	GGUF quants (Q4, Q5, Q8)	AWQ, GPTQ, FP8, INT8
Concurrent requests	Sequential by default	Optimized for concurrency
Vision models	Yes	Yes
Function calling	Yes	Yes
LoRA serving	No	Yes (multi-LoRA)
Guided generation	No	Yes (structured output)
Setup complexity	Very low	Medium-high
Docker image size	~1 GB	~5-10 GB
Default port	11434	8000
License	MIT	Apache 2.0

Installation Complexity

Ollama is trivial to deploy:

services:
  ollama:
    image: ollama/ollama:0.16.2
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0:11434
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped

volumes:
  ollama_data:

Pull a model and start serving:

docker exec ollama ollama pull llama3.2

Works on CPU, NVIDIA, AMD, and Apple Silicon — same image, auto-detected.

vLLM requires an NVIDIA GPU and more configuration:

services:
  vllm:
    image: vllm/vllm-openai:v0.15.1
    container_name: vllm
    ports:
      - "8000:8000"
    volumes:
      - huggingface_cache:/root/.cache/huggingface
    environment:
      - HUGGING_FACE_HUB_TOKEN=your-hf-token
    command: >
      --model mistralai/Mistral-7B-Instruct-v0.3
      --max-model-len 4096
      --gpu-memory-utilization 0.9
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped

volumes:
  huggingface_cache:

vLLM downloads the model from HuggingFace on first start (requires a token for gated models). The model must fit in GPU VRAM — no CPU fallback, no automatic CPU/GPU splitting like Ollama.

Ollama is significantly easier. vLLM requires understanding GPU memory, model formats, and serving parameters.

Performance and Resource Usage

This is where vLLM shines. The performance gap is substantial under concurrent load.

Ollama processes requests sequentially by default (one at a time). A 7B model generates ~40-80 tokens/sec on a consumer NVIDIA GPU. Adding more users means waiting in line. Ollama prioritizes simplicity and compatibility over raw throughput.

vLLM uses PagedAttention and continuous batching to serve multiple requests simultaneously. The same 7B model can serve 5-10 concurrent users with minimal latency degradation. Throughput can be 2-5x higher than Ollama under concurrent load. Tensor parallelism across multiple GPUs is built-in.

For a single user: performance is comparable. For 5+ concurrent users: vLLM is dramatically faster.

Resource requirements:

Ollama: Can run on CPU (slow but works). GPU optional. A 7B GGUF Q4 model needs ~4-6 GB RAM or VRAM.
vLLM: NVIDIA GPU required (16+ GB VRAM recommended). A 7B model in FP16 needs ~14 GB VRAM. AWQ/GPTQ quantized needs ~4-6 GB VRAM.

Community and Support

Ollama: 250k+ stars, largest LLM tool community. Every frontend and IDE plugin supports it. Extensive model library with one-command downloads. Excellent documentation.

vLLM: 50k+ stars, strong ML engineering community. Used by major AI companies for production serving. Active development with frequent releases. Documentation is more technical and assumes ML background.

Ollama has the broader community. vLLM has the deeper ML engineering community.

Use Cases

Choose Ollama If...

You're running AI for personal use or a small team
You want the simplest possible setup
You need CPU-only inference (no GPU available)
You're pairing it with Open WebUI for a ChatGPT replacement
You want to quickly test different models
You need Apple Silicon or AMD GPU support
You don't serve more than 2-3 concurrent users

Choose vLLM If...

You're serving an application with multiple concurrent users
Throughput and latency under load matter
You have a dedicated NVIDIA GPU (16+ GB VRAM)
You need multi-LoRA serving (different fine-tunes for different users)
You need structured output / guided generation
You're building a production API service
You need tensor parallelism across multiple GPUs

Final Verdict

Ollama is the right choice for self-hosters. If you want to run AI models at home or for a small team, Ollama is unbeatable for simplicity. Pull a model, connect a frontend, and you're done. It works on everything from a Raspberry Pi (slowly) to a workstation with multiple GPUs.

vLLM is the right choice for production serving. If you're building an application that needs to serve LLM responses to many users simultaneously, vLLM's continuous batching and PagedAttention make it 2-5x more efficient under load. The trade-off is a hard NVIDIA GPU requirement and more complex configuration.

Most self-hosters should start with Ollama. Graduate to vLLM when you need to serve concurrent users at scale.

DEV Community