This article was originally published on runaihome.com
TL;DR: Ollama unloads a model from VRAM 5 minutes after the last request by default, so the next prompt pays a cold-start penalty while weights reload from disk. The fix is one environment variable — OLLAMA_KEEP_ALIVE — set on the service, not your shell. If you're juggling several models, OLLAMA_MAX_LOADED_MODELS decides how many stay resident at once. You almost never need more VRAM; you need the right keep-alive policy.
What you'll be able to do after this guide:
- Keep a model pinned in VRAM indefinitely (or for a set window) so the first token is instant
- Read
ollama psto confirm what's actually loaded and when it'll unload - Stop two models from fighting over VRAM and thrashing each other out of memory
Honest take: 90% of "Ollama is slow on the first message" complaints are the 5-minute keep-alive timeout doing exactly what it was designed to do. Set
OLLAMA_KEEP_ALIVEcorrectly on the service and the problem disappears — no hardware change required.
What's actually happening
By default, Ollama keeps a model loaded in memory for 5 minutes after its last request, then unloads it and frees the VRAM (Ollama FAQ). That idle timeout is deliberate: it returns GPU memory to the system so other workloads (or other models) can use it.
The downside shows up the moment you step away. Come back after lunch, send a prompt, and Ollama has to reload the entire model from disk before it can answer. On a 7B model that cold start is roughly 3–10 seconds; on a 70B model loading from a SATA SSD it can be ~74 seconds, versus about 18 seconds from an NVMe drive (Markaicode NVMe load-time benchmarks). To the user it feels like Ollama "froze." It didn't — it's doing a disk-to-VRAM reload because the model went cold.
So before you blame your GPU, confirm the symptom. Open two terminals. In one, run your model. In the other, watch what's resident:
$ ollama ps
NAME ID SIZE PROCESSOR UNTIL
llama3.1:8b 42182419e950 6.7 GB 100% GPU 4 minutes from now
Three columns matter here:
-
PROCESSOR —
100% GPUmeans the whole model is in VRAM (fast). Anything less means part of it spilled to CPU/system RAM, which tanks tokens/sec. If you're seeing CPU offload, that's a different problem — see our Ollama not using GPU fix. - SIZE — how much memory this model is holding.
-
UNTIL — the countdown to unload.
4 minutes from nowis the default 5-minute timer ticking down. This is the column that explains your cold starts.
If ollama ps shows nothing, the model is already unloaded and your next request will be a cold start. That's the whole bug.
Fix 1: Set OLLAMA_KEEP_ALIVE on the service (the real fix)
OLLAMA_KEEP_ALIVE controls how long a model stays resident after its last request. It accepts (Ollama FAQ):
- A duration string:
"10m","24h" - A number in seconds:
3600 - Any negative value to keep it loaded forever:
-1 -
0to unload immediately after each response
The trap that wastes the most time: setting it in your shell does nothing. Ollama usually runs as a background service (systemd on Linux, a launch agent on macOS, a tray app on Windows) with its own environment. Exporting OLLAMA_KEEP_ALIVE=-1 in .bashrc is invisible to that service (SumGuy's Ramblings). You have to set it where the service can see it.
Linux (systemd):
sudo systemctl edit ollama.service
Add, under the [Service] section:
[Service]
Environment="OLLAMA_KEEP_ALIVE=-1"
Then reload and restart:
sudo systemctl daemon-reload
sudo systemctl restart ollama
-1 pins the model in VRAM permanently — ideal for a dedicated home AI box that only ever runs one model. If you'd rather it free memory overnight, use "8h" instead.
macOS: set it for the launch environment, then restart the Ollama app:
launchctl setenv OLLAMA_KEEP_ALIVE "-1"
Windows: quit Ollama from the system tray, open Settings → System → About → Advanced system settings → Environment Variables, add a user variable OLLAMA_KEEP_ALIVE with value -1, then relaunch Ollama.
Verify it took effect — UNTIL should now read Forever:
$ ollama ps
NAME ID SIZE PROCESSOR UNTIL
llama3.1:8b 42182419e950 6.7 GB 100% GPU Forever
Fix 2: Override per request with keep_alive
If you don't want a global policy — say a script that should load a big model, do one batch job, and release the VRAM — pass keep_alive directly in the API call. The request-level parameter overrides the OLLAMA_KEEP_ALIVE environment variable for that call (Ollama FAQ).
Keep a model loaded for this session:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1:8b",
"prompt": "Summarize this changelog...",
"keep_alive": -1
}'
Unload immediately when the job is done (frees VRAM the instant the response finishes):
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1:8b",
"keep_alive": 0
}'
Sending a request with an empty prompt just loads (or unloads) the model without generating — handy for preloading.
Fix 3: Preload before the user shows up
If your real complaint is "the first request of the day is slow," preload the model at boot so the cold start happens before anyone is waiting. The cleanest way is an empty generate request that pins the model:
ollama run llama3.1:8b ""
Or via the API in a startup script / systemd unit:
curl http://localhost:11434/api/generate -d '{"model": "llama3.1:8b", "keep_alive": -1}'
Run this from a cron @reboot job or a small systemd service after ollama.service, and your model is warm in VRAM before the first real prompt arrives. Combined with OLLAMA_KEEP_ALIVE=-1, you get instant responses around the clock — at the cost of holding that VRAM permanently.
Fix 4: Stop two models from thrashing
A subtler version of the same problem: you switch between, say, a coding model and a general chat model, and every switch is slow. That's not the keep-alive timer — it's model swapping.
Two environment variables govern this (Ollama FAQ, envconfig source):
| Variable | Default | What it does |
|---|---|---|
OLLAMA_MAX_LOADED_MODELS |
3 × GPU count (3 for CPU) | How many distinct models can be resident at once |
OLLAMA_NUM_PARALLEL |
auto (1 or 4) | Concurrent requests per model; each parallel slot needs its own KV-cache VRAM |
The catch is physical: on GPU inference, a new model must fit entirely in VRAM alongside whatever's already loaded, or Ollama unloads something to make room (Ollama FAQ). On a 24GB card, two 8B models (≈7 GB each) coexist comfortably and switching is instant. An 8B plus a 32B model won't both fit, so Ollama evicts one on every switch — and you're back to cold starts.
Practical rules:
-
Plenty of VRAM, a few small models you alternate between? Raise
OLLAMA_MAX_LOADED_MODELSand let them all stay resident. - Tight on VRAM? Leave it at default and standardize on one model. Forcing two big models into a card that can't hold both just guarantees thrashing.
-
Watch the KV cache. Bumping
OLLAMA_NUM_PARALLELmultiplies KV-cache VRAM by the context length, which can quietly push you into a CUDA out-of-memory error. If you raised parallelism and things got less stable, lower it back.
Check your VRAM headroom before raising any of these. A used RTX 3090's 24GB holds one 8B model with enormous room to spare, but t
Top comments (0)