I run a vision-language model (qwen2.5vl:7b) on an NVIDIA DGX Spark for automated camera analysis — three RTSP cameras, one inference call every 5 seconds, 24/7. The model weights are about 6GB. It should use maybe 8-10GB total.
After a week of running, I checked memory usage: 70.8GB out of 120GB.
That's 65GB of VRAM consumed by a 6GB model. Here's what happened.
The Symptom
Everything was working fine. Inference was fast, results were accurate. I only noticed the problem because I wanted to load a second model and got an out-of-memory error.
$ ollama ps
NAME SIZE PROCESSOR
qwen2.5vl:7b 70.8GB 100% GPU
70.8GB for a 7B model. That's not right.
Finding the Cause
The VRAM breakdown:
- Model weights: ~6 GB
- KV cache: ~65 GB
- Overhead: ~0.5 GB
The KV cache was the problem. But why was it so large?
Every transformer model has a context length — the maximum number of tokens it can process at once. Ollama pre-allocates a KV cache for the full declared context length when a model first loads. And qwen2.5vl:7b declares a context length of 131,072 tokens (128K) in its GGUF metadata.
My requests used about 1,000 tokens each. Ollama allocated memory for 131,072.
Why Didn't It Shrink?
I tried every obvious fix:
| What I tried | What happened |
|---|---|
OLLAMA_NUM_CTX=4096 environment variable |
Ignored — doesn't override per-model defaults |
"num_ctx": 4096 in /api/chat request body |
Doesn't shrink an already-loaded model |
Using /v1/chat/completions (OpenAI-compatible API) |
No num_ctx parameter available at all |
| Restarting Ollama | Works temporarily — but model reloads at 128K on first request |
The root cause: Ollama reads the model's context length from the GGUF file and allocates the full KV cache on first load. There is no way to override this at request time for an already-loaded model. And in an automated pipeline where requests come every 5 seconds, the model never unloads.
The Fix
The only reliable solution is to create a derived model with the context size baked into the model definition:
# Save as Modelfile.vision
FROM qwen2.5vl:7b
PARAMETER num_ctx 4096
ollama create qwen2.5vl:7b-4k -f Modelfile.vision
That's it. Now use qwen2.5vl:7b-4k instead of qwen2.5vl:7b in your API calls.
For extra safety, I also set a global default in Ollama's systemd service:
# /etc/systemd/system/ollama.service
[Service]
Environment="OLLAMA_NUM_CTX=4096"
sudo systemctl daemon-reload
sudo systemctl restart ollama
This catches any model that doesn't have an explicit num_ctx — at least it won't silently balloon to 128K.
The Result
| Before | After | |
|---|---|---|
| Total VRAM used | 70.8 GB | 14.5 GB |
| KV cache context | 128,000 tokens | 4,096 tokens |
| Free VRAM | 37 GB | 85 GB |
| Inference speed | No change | No change |
| Output quality | No change | No change |
56GB of VRAM recovered with zero impact on inference. My requests never used more than ~1K tokens — the other 127K were allocated for nothing.
Who Is Affected?
This matters if you're running Ollama for automated workloads:
- API servers handling frequent requests (the model stays loaded)
- Chatbots, agents, or monitoring pipelines
- Multiple models on the same GPU
- Any setup where you need predictable memory usage
Interactive chat sessions are less affected because Ollama unloads models after an idle timeout. But if your requests keep the model hot, the full KV cache lives in VRAM permanently.
Models to Watch Out For
Many popular models declare 128K context by default:
| Model | Default Context | Approx KV Cache |
|---|---|---|
| qwen2.5vl:7b | 128K | ~65 GB |
| qwen2.5:32b | 128K | ~130 GB |
| llama3.1:70b | 128K | ~130 GB |
| mistral-large | 128K | ~130 GB |
Check your model's declared context:
ollama show <model> --modelfile | grep -i ctx
Or via the API:
curl -s http://localhost:11434/api/show \
-d '{"name":"qwen2.5vl:7b"}' | python3 -m json.tool | grep context
The Rule
For any Ollama model used in automated pipelines:
Always create a derived Modelfile with an explicit
num_ctxmatching your actual needs.
Some guidelines:
- Vision/camera analysis: 2K–4K tokens
- Chatbot or agent: 4K–8K tokens
- Document analysis: 8K–16K tokens
- RAG with large context: 16K–32K tokens
Never leave a model at its default 128K context unless you actually need 128K. The KV cache allocation is proportional to context size — halving the context roughly halves the memory.
Why This Isn't a Bug (But Maybe Should Be)
Ollama's behavior is technically correct — pre-allocating the KV cache avoids the overhead of dynamic resizing during inference. For interactive use, where you might paste a long document or have a deep conversation, having the full context available makes sense.
But for API workloads, it's a footgun. The mismatch between "model supports 128K context" and "my requests use 1K context" is common, and the memory cost is hidden. You don't see it in nvidia-smi as a separate allocation — it's all lumped under the model.
A dynamic or configurable allocation per-request would fix this for API users. Until then, the Modelfile workaround is the best approach.
I documented the full benchmarks and fix in my llama-cpp-distributed-benchmarks repo, which also covers distributed inference across Apple Silicon + NVIDIA Blackwell over 10GbE.
Tags: #ollama #llm #vram #inference #nvidia #machinelearning
Top comments (0)