Jovan Chan

Posted on Jul 4 • Originally published at runaihome.com

Ollama Not Using GPU? Fix CPU-Only Inference on Windows, WSL2, and Linux (2026)

#ollama #gpu #localllm #troubleshooting

This article was originally published on runaihome.com

TL;DR: If Ollama feels slow, run ollama ps — a "100% CPU" line means your GPU isn't being used at all, and a CPU/GPU split means the model is too big for your VRAM. Most cases come down to drivers, a WSL2/Docker passthrough gap, or VRAM overflow. The speed gap is real: ~42 tok/s on an RTX 3060 versus 8–14 tok/s CPU-only for Llama 3.1 8B.

What you'll be able to do after this guide:

Confirm in 30 seconds whether Ollama is on your GPU, your CPU, or split between both
Fix the six causes that account for nearly every "Ollama won't use my GPU" report in 2026
Read the Ollama server log to find the one line that tells you what actually happened

Honest take: 80% of these reports are one of two things — you installed Ollama before the NVIDIA driver was working, or your model simply doesn't fit in VRAM and is spilling to system RAM. Check ollama ps first; it tells you which camp you're in before you change a single setting.

Step 1: Confirm the problem (don't guess)

Before touching drivers or reinstalling anything, find out what Ollama is actually doing. Load a model and run ollama ps:

$ ollama run llama3.1:8b "hi" 
$ ollama ps
NAME           ID              SIZE      PROCESSOR    UNTIL
llama3.1:8b    365c0bd3c000    6.7 GB    100% GPU     4 minutes from now

That PROCESSOR column is the whole diagnosis:

100% GPU — working as intended. If it's still slow, your model/quant or context is the bottleneck, not GPU detection.
100% CPU — Ollama isn't seeing your GPU at all. This is a driver, passthrough, or unsupported-card problem.
58% / 42% CPU/GPU (a split) — Ollama found the GPU but the model doesn't fully fit in VRAM, so layers spilled to system RAM. The GPU is fine; you're out of VRAM.

Cross-check with the GPU itself while a prompt is generating:

$ nvidia-smi

If nvidia-smi prints a table and you see a python/ollama process using VRAM during generation, the GPU is being used. If nvidia-smi returns command not found or NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver, your driver is the problem — jump to Cause 1.

Here's the diagnostic flow in one table:

`ollama ps` shows	`nvidia-smi` shows	What's wrong	Go to
100% CPU	driver error / not found	Driver missing or broken	Cause 1
100% CPU	works fine on host, fails in WSL/Docker	Passthrough not configured	Cause 2 / 4
100% CPU	works, GPU is old	Compute capability too low	Cause 5
CPU/GPU split	GPU present, VRAM full	Model bigger than VRAM	Cause 3
100% GPU on wrong card	both GPUs listed	Ollama picked the wrong GPU	Cause 6

Cause 1: Drivers missing, outdated, or installed after Ollama

This is the single most common cause. Ollama detects GPU libraries at install time and at server start, so the order of operations matters.

The version floor in 2026: Ollama supports NVIDIA GPUs with compute capability 5.0+ and driver 531 or newer. Older Maxwell/Pascal cards (compute capability 5.0–6.2, e.g. a GTX 1060) need driver 570 or newer. If your driver is below that, Ollama silently falls back to CPU.

Check your driver:

$ nvidia-smi --query-gpu=driver_version --format=csv,noheader
576.52

Then fix in this order:

Install/update the NVIDIA driver first. On Windows, grab the latest Game Ready or Studio driver. On Linux, install the proprietary driver (e.g. sudo ubuntu-drivers install on Ubuntu) and reboot.
Verify nvidia-smi works before going further.
Reinstall Ollama after the driver is healthy. If you installed Ollama before the driver worked, its server never registered CUDA support. On Linux: curl -fsSL https://ollama.com/install.sh | sh. On Windows, reinstall the app. Then systemctl restart ollama (Linux) or restart the app.

The classic trap: people install Ollama on a fresh machine, then install GPU drivers, then wonder why it's on CPU. Reinstall Ollama last.

Cause 2: WSL2 passthrough (the Windows + Linux gotcha)

Running Ollama inside WSL2 on Windows is its own special case, and the fix is counterintuitive.

Do not install a Linux NVIDIA driver inside WSL2. The Windows host driver is automatically projected into WSL2 as libcuda.so. Installing a Linux driver on top of that breaks the stub and sends you straight to CPU. This is the mistake that generates the most WSL2 bug reports.

The working setup:

Update the Windows NVIDIA driver (must be 470.76 or later for CUDA-in-WSL2; in practice use a current driver). Windows 11, or Windows 10 21H2+, is required.
Confirm you're on WSL2, not WSL1:

   # in PowerShell
   wsl -l -v

The VERSION column must say 2. WSL1 has no GPU passthrough at all.

Inside WSL2, verify the stub is visible:

   $ nvidia-smi

If that works inside WSL but Ollama still shows CPU, reinstall Ollama inside WSL after confirming nvidia-smi works.

Cause 3: The model is bigger than your VRAM

If ollama ps shows a CPU/GPU split, nothing is broken — you're out of VRAM, and Ollama is doing exactly what it's designed to do: offload the layers that fit to the GPU and run the rest on CPU. That CPU portion is what tanks your tokens/sec.

A rough VRAM budget: a Q4_K_M quant needs about 0.6 GB per billion parameters, plus 1–2 GB for the KV cache at modest context. So Llama 3.1 8B Q4_K_M wants ~6–7 GB, which is why it fits cleanly on an 8GB card; a 14B Q4 wants ~10 GB; a 32B Q4 wants ~20 GB and will split on anything under a 24GB card.

Fixes, in order of preference:

Use a smaller quant. Drop from Q6_K to Q4_K_M, or pull the :8b instead of :14b tag. See our quantization explainer for what you actually lose (less than people think — Q4_K_M is the sweet spot).
Shrink the context window. A huge num_ctx eats VRAM through the KV cache. If you set OLLAMA_CONTEXT_LENGTH or num_ctx to 32768 "just in case," that alone can force a split. Drop to 4096 or 8192.
Unload other models. Ollama keeps recently-used models resident. Run ollama stop <model> to free VRAM, or set OLLAMA_MAX_LOADED_MODELS=1.
Get more VRAM. If a 32B model is your daily driver, a 24GB card is the real answer — see our VRAM-by-model guide and check whether you actually have enough system RAM for the spillover, too.

The classic error here is the hard out-of-memory case:

Error: CUDA error: out of memory

Fix: lower the context (num_ctx 2048), use a smaller quant, or stop other loaded models — then retry.

Cause 4: Docker without GPU access

A container does not get GPU access by default. If you run Ollama in Docker and it's on CPU, the host setup is incomplete.

On the host (or inside your WSL2 distro if that's where Docker lives):

# install the toolkit, then register the runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Then launch the container with the GPU flag:

docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 ollama/ollama

The two things people forget: installing nvidia-container-toolkit on the host, and the --gpus=all flag itself. Miss either and the container quietly runs on CPU.

Cause 5: Your GPU is too old (compute capability)

Ollama requires compute capability 5.0 or higher — Maxwell (GTX 900 series) and newer. A Kepler-era card (GTX 700 series, compute capability 3.5) will never get GPU acceleration in current Ollama, no matter what driver you install. Datacenter Tesla K80s fall in the same bucket.

Check your card's compute capability against NVIDIA's CUDA GPUs list. If it's below 5.0, your only paths are a newer GPU or cloud rental (see Cause 7's note). This is rare on home builds but common when someone tries to repurpose an ancient mining or serv