This article was originally published on runaihome.com
TL;DR: If Ollama feels slow, run ollama ps — a "100% CPU" line means your GPU isn't being used at all, and a CPU/GPU split means the model is too big for your VRAM. Most cases come down to drivers, a WSL2/Docker passthrough gap, or VRAM overflow. The speed gap is real: ~42 tok/s on an RTX 3060 versus 8–14 tok/s CPU-only for Llama 3.1 8B.
What you'll be able to do after this guide:
- Confirm in 30 seconds whether Ollama is on your GPU, your CPU, or split between both
- Fix the six causes that account for nearly every "Ollama won't use my GPU" report in 2026
- Read the Ollama server log to find the one line that tells you what actually happened
Honest take: 80% of these reports are one of two things — you installed Ollama before the NVIDIA driver was working, or your model simply doesn't fit in VRAM and is spilling to system RAM. Check
ollama psfirst; it tells you which camp you're in before you change a single setting.
Step 1: Confirm the problem (don't guess)
Before touching drivers or reinstalling anything, find out what Ollama is actually doing. Load a model and run ollama ps:
$ ollama run llama3.1:8b "hi"
$ ollama ps
NAME ID SIZE PROCESSOR UNTIL
llama3.1:8b 365c0bd3c000 6.7 GB 100% GPU 4 minutes from now
That PROCESSOR column is the whole diagnosis:
-
100% GPU— working as intended. If it's still slow, your model/quant or context is the bottleneck, not GPU detection. -
100% CPU— Ollama isn't seeing your GPU at all. This is a driver, passthrough, or unsupported-card problem. -
58% / 42% CPU/GPU(a split) — Ollama found the GPU but the model doesn't fully fit in VRAM, so layers spilled to system RAM. The GPU is fine; you're out of VRAM.
Cross-check with the GPU itself while a prompt is generating:
$ nvidia-smi
If nvidia-smi prints a table and you see a python/ollama process using VRAM during generation, the GPU is being used. If nvidia-smi returns command not found or NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver, your driver is the problem — jump to Cause 1.
Here's the diagnostic flow in one table:
ollama ps shows |
nvidia-smi shows |
What's wrong | Go to |
|---|---|---|---|
| 100% CPU | driver error / not found | Driver missing or broken | Cause 1 |
| 100% CPU | works fine on host, fails in WSL/Docker | Passthrough not configured | Cause 2 / 4 |
| 100% CPU | works, GPU is old | Compute capability too low | Cause 5 |
| CPU/GPU split | GPU present, VRAM full | Model bigger than VRAM | Cause 3 |
| 100% GPU on wrong card | both GPUs listed | Ollama picked the wrong GPU | Cause 6 |
Cause 1: Drivers missing, outdated, or installed after Ollama
This is the single most common cause. Ollama detects GPU libraries at install time and at server start, so the order of operations matters.
The version floor in 2026: Ollama supports NVIDIA GPUs with compute capability 5.0+ and driver 531 or newer. Older Maxwell/Pascal cards (compute capability 5.0–6.2, e.g. a GTX 1060) need driver 570 or newer. If your driver is below that, Ollama silently falls back to CPU.
Check your driver:
$ nvidia-smi --query-gpu=driver_version --format=csv,noheader
576.52
Then fix in this order:
-
Install/update the NVIDIA driver first. On Windows, grab the latest Game Ready or Studio driver. On Linux, install the proprietary driver (e.g.
sudo ubuntu-drivers installon Ubuntu) and reboot. -
Verify
nvidia-smiworks before going further. -
Reinstall Ollama after the driver is healthy. If you installed Ollama before the driver worked, its server never registered CUDA support. On Linux:
curl -fsSL https://ollama.com/install.sh | sh. On Windows, reinstall the app. Thensystemctl restart ollama(Linux) or restart the app.
The classic trap: people install Ollama on a fresh machine, then install GPU drivers, then wonder why it's on CPU. Reinstall Ollama last.
Cause 2: WSL2 passthrough (the Windows + Linux gotcha)
Running Ollama inside WSL2 on Windows is its own special case, and the fix is counterintuitive.
Do not install a Linux NVIDIA driver inside WSL2. The Windows host driver is automatically projected into WSL2 as libcuda.so. Installing a Linux driver on top of that breaks the stub and sends you straight to CPU. This is the mistake that generates the most WSL2 bug reports.
The working setup:
- Update the Windows NVIDIA driver (must be 470.76 or later for CUDA-in-WSL2; in practice use a current driver). Windows 11, or Windows 10 21H2+, is required.
- Confirm you're on WSL2, not WSL1:
# in PowerShell
wsl -l -v
The VERSION column must say 2. WSL1 has no GPU passthrough at all.
- Inside WSL2, verify the stub is visible:
$ nvidia-smi
If that works inside WSL but Ollama still shows CPU, reinstall Ollama inside WSL after confirming nvidia-smi works.
Cause 3: The model is bigger than your VRAM
If ollama ps shows a CPU/GPU split, nothing is broken — you're out of VRAM, and Ollama is doing exactly what it's designed to do: offload the layers that fit to the GPU and run the rest on CPU. That CPU portion is what tanks your tokens/sec.
A rough VRAM budget: a Q4_K_M quant needs about 0.6 GB per billion parameters, plus 1–2 GB for the KV cache at modest context. So Llama 3.1 8B Q4_K_M wants ~6–7 GB, which is why it fits cleanly on an 8GB card; a 14B Q4 wants ~10 GB; a 32B Q4 wants ~20 GB and will split on anything under a 24GB card.
Fixes, in order of preference:
-
Use a smaller quant. Drop from Q6_K to Q4_K_M, or pull the
:8binstead of:14btag. See our quantization explainer for what you actually lose (less than people think — Q4_K_M is the sweet spot). -
Shrink the context window. A huge
num_ctxeats VRAM through the KV cache. If you setOLLAMA_CONTEXT_LENGTHornum_ctxto 32768 "just in case," that alone can force a split. Drop to 4096 or 8192. -
Unload other models. Ollama keeps recently-used models resident. Run
ollama stop <model>to free VRAM, or setOLLAMA_MAX_LOADED_MODELS=1. - Get more VRAM. If a 32B model is your daily driver, a 24GB card is the real answer — see our VRAM-by-model guide and check whether you actually have enough system RAM for the spillover, too.
The classic error here is the hard out-of-memory case:
Error: CUDA error: out of memory
Fix: lower the context (num_ctx 2048), use a smaller quant, or stop other loaded models — then retry.
Cause 4: Docker without GPU access
A container does not get GPU access by default. If you run Ollama in Docker and it's on CPU, the host setup is incomplete.
On the host (or inside your WSL2 distro if that's where Docker lives):
# install the toolkit, then register the runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Then launch the container with the GPU flag:
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 ollama/ollama
The two things people forget: installing nvidia-container-toolkit on the host, and the --gpus=all flag itself. Miss either and the container quietly runs on CPU.
Cause 5: Your GPU is too old (compute capability)
Ollama requires compute capability 5.0 or higher — Maxwell (GTX 900 series) and newer. A Kepler-era card (GTX 700 series, compute capability 3.5) will never get GPU acceleration in current Ollama, no matter what driver you install. Datacenter Tesla K80s fall in the same bucket.
Check your card's compute capability against NVIDIA's CUDA GPUs list. If it's below 5.0, your only paths are a newer GPU or cloud rental (see Cause 7's note). This is rare on home builds but common when someone tries to repurpose an ancient mining or serv
Top comments (0)