DEV Community

selfhosting.sh
selfhosting.sh

Posted on • Originally published at selfhosting.sh

Ollama vs vLLM: Which Should You Self-Host?

Quick Verdict

Ollama is the better choice for personal use and small teams. It's easy to set up, runs on consumer hardware (including CPU-only), and integrates with every major LLM frontend. vLLM is the better choice for production serving where throughput matters — it handles concurrent requests much more efficiently using PagedAttention and continuous batching, but requires a dedicated NVIDIA GPU and more setup effort.

Overview

Both run LLMs locally, but they're designed for very different scales.

Ollama — MIT license. 250k+ GitHub stars. Written in Go, wraps llama.cpp. Designed for simplicity — download a model with one command, run it immediately. Targets developers and self-hosters who want local AI without complexity.

vLLM — Apache 2.0 license. 50k+ GitHub stars. Written in Python/C++/CUDA. Designed for high-throughput LLM serving. Invented PagedAttention for efficient GPU memory management. Targets production deployments serving multiple concurrent users.

Feature Comparison

Feature Ollama vLLM
Primary goal Simplicity Throughput
Model download ollama pull model Manual or HuggingFace Hub
OpenAI API compatible Yes Yes (native)
CPU inference Yes No (GPU required)
GPU: NVIDIA Yes Yes (primary)
GPU: AMD Yes (ROCm) Yes (ROCm)
GPU: Apple Silicon Yes (Metal) No
Multi-GPU Yes Yes (tensor parallelism)
Continuous batching No Yes
PagedAttention No Yes
Speculative decoding No Yes
Model formats: GGUF Yes (primary) Limited
Model formats: HuggingFace Via conversion Yes (native)
Model formats: AWQ/GPTQ Via conversion Yes (native)
Quantization GGUF quants (Q4, Q5, Q8) AWQ, GPTQ, FP8, INT8
Concurrent requests Sequential by default Optimized for concurrency
Vision models Yes Yes
Function calling Yes Yes
LoRA serving No Yes (multi-LoRA)
Guided generation No Yes (structured output)
Setup complexity Very low Medium-high
Docker image size ~1 GB ~5-10 GB
Default port 11434 8000
License MIT Apache 2.0

Installation Complexity

Ollama is trivial to deploy:

services:
  ollama:
    image: ollama/ollama:0.16.2
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0:11434
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped

volumes:
  ollama_data:
Enter fullscreen mode Exit fullscreen mode

Pull a model and start serving:

docker exec ollama ollama pull llama3.2
Enter fullscreen mode Exit fullscreen mode

Works on CPU, NVIDIA, AMD, and Apple Silicon — same image, auto-detected.

vLLM requires an NVIDIA GPU and more configuration:

services:
  vllm:
    image: vllm/vllm-openai:v0.15.1
    container_name: vllm
    ports:
      - "8000:8000"
    volumes:
      - huggingface_cache:/root/.cache/huggingface
    environment:
      - HUGGING_FACE_HUB_TOKEN=your-hf-token
    command: >
      --model mistralai/Mistral-7B-Instruct-v0.3
      --max-model-len 4096
      --gpu-memory-utilization 0.9
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped

volumes:
  huggingface_cache:
Enter fullscreen mode Exit fullscreen mode

vLLM downloads the model from HuggingFace on first start (requires a token for gated models). The model must fit in GPU VRAM — no CPU fallback, no automatic CPU/GPU splitting like Ollama.

Ollama is significantly easier. vLLM requires understanding GPU memory, model formats, and serving parameters.

Performance and Resource Usage

This is where vLLM shines. The performance gap is substantial under concurrent load.

Ollama processes requests sequentially by default (one at a time). A 7B model generates ~40-80 tokens/sec on a consumer NVIDIA GPU. Adding more users means waiting in line. Ollama prioritizes simplicity and compatibility over raw throughput.

vLLM uses PagedAttention and continuous batching to serve multiple requests simultaneously. The same 7B model can serve 5-10 concurrent users with minimal latency degradation. Throughput can be 2-5x higher than Ollama under concurrent load. Tensor parallelism across multiple GPUs is built-in.

For a single user: performance is comparable. For 5+ concurrent users: vLLM is dramatically faster.

Resource requirements:

  • Ollama: Can run on CPU (slow but works). GPU optional. A 7B GGUF Q4 model needs ~4-6 GB RAM or VRAM.
  • vLLM: NVIDIA GPU required (16+ GB VRAM recommended). A 7B model in FP16 needs ~14 GB VRAM. AWQ/GPTQ quantized needs ~4-6 GB VRAM.

Community and Support

Ollama: 250k+ stars, largest LLM tool community. Every frontend and IDE plugin supports it. Extensive model library with one-command downloads. Excellent documentation.

vLLM: 50k+ stars, strong ML engineering community. Used by major AI companies for production serving. Active development with frequent releases. Documentation is more technical and assumes ML background.

Ollama has the broader community. vLLM has the deeper ML engineering community.

Use Cases

Choose Ollama If...

  • You're running AI for personal use or a small team
  • You want the simplest possible setup
  • You need CPU-only inference (no GPU available)
  • You're pairing it with Open WebUI for a ChatGPT replacement
  • You want to quickly test different models
  • You need Apple Silicon or AMD GPU support
  • You don't serve more than 2-3 concurrent users

Choose vLLM If...

  • You're serving an application with multiple concurrent users
  • Throughput and latency under load matter
  • You have a dedicated NVIDIA GPU (16+ GB VRAM)
  • You need multi-LoRA serving (different fine-tunes for different users)
  • You need structured output / guided generation
  • You're building a production API service
  • You need tensor parallelism across multiple GPUs

Final Verdict

Ollama is the right choice for self-hosters. If you want to run AI models at home or for a small team, Ollama is unbeatable for simplicity. Pull a model, connect a frontend, and you're done. It works on everything from a Raspberry Pi (slowly) to a workstation with multiple GPUs.

vLLM is the right choice for production serving. If you're building an application that needs to serve LLM responses to many users simultaneously, vLLM's continuous batching and PagedAttention make it 2-5x more efficient under load. The trade-off is a hard NVIDIA GPU requirement and more complex configuration.

Most self-hosters should start with Ollama. Graduate to vLLM when you need to serve concurrent users at scale.

Related

Top comments (0)