Jovan Chan

Posted on Jun 2 • Originally published at aifoss.dev

ollama-review-2026

#opensource #ai #selfhosted #linux

This article was originally published on aifoss.dev

---
title: 'Ollama 2026 Review: The Default Local LLM Runner'
description: 'Ollama is the easiest way to run local LLMs on your own hardware. Here is what version 0.23.3 does well, where it falls short, and when to use something else.'
pubDate: 'May 14 2026'

tags: ["ollama", "ai", "selfhosted", "llm", "opensource"]

Ollama became the default answer to "how do I run a local LLM?" somewhere around 2024, and it has held that position by consistently solving the most annoying part of local inference: getting a model loaded and responding without a PhD in CUDA.

As of v0.23.3 (released May 13, 2026), it is still the right starting point for most people. It is also still wrong for certain use cases, and those cases are worth knowing before you commit to building around it.

This review covers what Ollama actually does, the hardware it needs, what changed in 2026, where the ceilings are, and how it compares to the main alternatives. The goal is a verdict you can act on, not a feature list.

What Ollama is (and what it isn't)

Ollama is a local model runner. You install it, pull a model, run it. It exposes an OpenAI-compatible REST API at localhost:11434, which means anything built for the OpenAI API works against Ollama with a single line change.

Under the hood it is a wrapper around llama.cpp. That matters because it means Ollama inherits llama.cpp's broad hardware support — NVIDIA, AMD, Apple Silicon, and CPU-only — while adding a cleaner CLI, automatic model management, and a model library at ollama.com/library.

What it is not: a chat UI (use Open WebUI for that), a fine-tuning tool, a multi-user inference server, or a replacement for vLLM if you are serving dozens of concurrent users.

License: MIT. Active open-source project on GitHub with frequent releases.

What changed in 2026

v0.6.2 (March 2026)

Llama 4 support added
Batch embedding API — embed multiple texts in one call, useful for RAG pipelines
Flash Attention v2.7 integration
M4 Metal 3 optimizations for Apple Silicon

v0.23.3 (May 13, 2026)

/api/show responses are now cached, improving median API latency by ~6.7x — meaningful for integrations that call this endpoint repeatedly (VS Code extensions, Open WebUI)
Claude Desktop removed from ollama launch due to Anthropic restricting the integration to their own models

Gemma 4 speculative decoding (Mac)

Ollama now supports Gemma 4 MTP speculative decoding on Mac, delivering over 2x speed increase for the Gemma 4 31B model on coding tasks — significant for Apple Silicon users running large models.

Hardware requirements

GPU is not required but makes a significant difference. The general rule: the model must fit in VRAM (or system RAM for CPU-only) for usable speeds.

Model size	Recommended VRAM	CPU-only usable?	Practical tokens/sec (GPU)
1B–3B (Gemma 3n, Phi-3.5 mini)	4 GB	Yes	80–120
7B–8B (Llama 3.1, Mistral)	8 GB	Slow (≈5–8 t/s)	40–55 (RTX 4060)
13B–14B	12–16 GB	No	25–35 (RTX 4090)
30B–34B	24 GB	No	15–22 (RTX 4090)
70B+	48 GB+	No	8–15 (2× RTX 4090)

Minimum system RAM: 16 GB. Below that, even a 7B model risks being swapped to disk, which makes CPU inference unusably slow. 32 GB is the practical baseline if you want to run 7B models comfortably while doing other work.

For a hardware-focused breakdown of GPU options, see runaihome.com's GPU buying guide. Budget entry point: an RTX 4060 on Amazon covers 7B models at 40–55 tokens/sec and costs under $350. If you don't want to buy hardware at all, RunPod rents GPU instances by the hour — useful for testing large models before committing to a purchase.

Installation

Three commands on Linux/Mac:

curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1:8b
ollama run llama3.1:8b

On Windows: download the installer from ollama.com, run it, open a terminal. The same ollama pull and ollama run commands work identically.

The model library uses a Docker-like pull syntax. llama3.1:8b pulls the 8B Q4_K_M quantized variant by default. To get a specific quantization: ollama pull llama3.1:8b-instruct-q8_0.

The REST API is available immediately after installation:

curl http://localhost:11434/api/generate \
  -d '{"model": "llama3.1:8b", "prompt": "Explain GGUF in one sentence.", "stream": false}'

This API compatibility is Ollama's biggest practical advantage. Open WebUI, Continue.dev, AnythingLLM, and hundreds of other tools connect to it without modification.

What Ollama does well

Zero-friction setup. The install-to-first-response time is under five minutes for most setups. No Python environment, no CUDA toolkit management, no configuration files. The model library handles quantization format, so you don't need to know what Q4_K_M means to get a working model.

Automatic hardware detection. Ollama detects your GPU, falls back to CPU if needed, and handles model loading without manual layer configuration. Apple Silicon, NVIDIA, and AMD all work without separate install paths.

OpenAI API compatibility. Drop-in replacement for openai.OpenAI(base_url="http://localhost:11434/v1"). Any OpenAI SDK client works with a single line change.

Model library. 100+ models available via ollama pull. The library includes current models (Llama 3.x, Gemma 3, Mistral, DeepSeek, Phi-3, Qwen 2.5, Command R) updated within days of upstream releases.

Multi-model management. OLLAMA_MAX_LOADED_MODELS (default: 3× GPU count) keeps multiple models warm in memory simultaneously. Switching between a coding model and a chat model does not require a full reload.

Where it falls short

Concurrency is not the default

Parallel requests are queued sequentially by default. Out of the box, Ollama serves one prompt at a time, regardless of how much GPU headroom remains. You have to explicitly set environment variables to enable parallelism:

OLLAMA_NUM_PARALLEL=4 ollama serve

OLLAMA_NUM_PARALLEL controls how many requests each loaded model handles simultaneously. Default is 1 (memory-dependent; may be 4 on high-VRAM systems). OLLAMA_MAX_QUEUE sets the queue depth before requests are rejected.

Even with tuning, Ollama's throughput ceiling is modest. Under heavy multi-user load, a tuned Ollama instance peaks at roughly 40 tokens/second total, compared to vLLM's ~800 tokens/second on the same hardware. That gap exists because vLLM uses PagedAttention and continuous batching; Ollama does not.

Multi-GPU model distribution

When multiple users request the same model across a multi-GPU system, Ollama routes them to one GPU rather than distributing load across all available cards. Other GPUs sit idle while one GPU queues requests. This is a known limitation tracked in the GitHub issues.

Quantization format lock-in

Ollama uses GGUF format exclusively. If you want to experiment with GPTQ, AWQ, or exl2 quantized models — which can offer better quality/speed tradeoffs at some bit widths — you need a different tool (llama.cpp directly, or a framework that supports those formats). This matters if you are doing model evaluation work rather than just running models.

Abstraction overhead

Because Ollama wraps llama.cpp rather than exposing it directly, it adds a small but measurable overhead. Direct llama.cpp usage achieves 15–25% higher token generation speed on the same hardware. For most personal use, this does not matter. For latency-sensitive production use, it does.

Comparison: Ollama vs the main alternatives

DEV Community