Kunal Jaiswal

Posted on Mar 30

How Ollama Silently Ate 65GB of My VRAM (And How I Fixed It)

#ai #llm #performance #tutorial

I run a vision-language model (qwen2.5vl:7b) on an NVIDIA DGX Spark for automated camera analysis — three RTSP cameras, one inference call every 5 seconds, 24/7. The model weights are about 6GB. It should use maybe 8-10GB total.

After a week of running, I checked memory usage: 70.8GB out of 120GB.

That's 65GB of VRAM consumed by a 6GB model. Here's what happened.

The Symptom

Everything was working fine. Inference was fast, results were accurate. I only noticed the problem because I wanted to load a second model and got an out-of-memory error.

$ ollama ps
NAME              SIZE     PROCESSOR
qwen2.5vl:7b     70.8GB   100% GPU

70.8GB for a 7B model. That's not right.

Finding the Cause

The VRAM breakdown:

Model weights: ~6 GB
KV cache: ~65 GB
Overhead: ~0.5 GB

The KV cache was the problem. But why was it so large?

Every transformer model has a context length — the maximum number of tokens it can process at once. Ollama pre-allocates a KV cache for the full declared context length when a model first loads. And qwen2.5vl:7b declares a context length of 131,072 tokens (128K) in its GGUF metadata.

My requests used about 1,000 tokens each. Ollama allocated memory for 131,072.

Why Didn't It Shrink?

I tried every obvious fix:

What I tried	What happened
`OLLAMA_NUM_CTX=4096` environment variable	Ignored — doesn't override per-model defaults
`"num_ctx": 4096` in `/api/chat` request body	Doesn't shrink an already-loaded model
Using `/v1/chat/completions` (OpenAI-compatible API)	No `num_ctx` parameter available at all
Restarting Ollama	Works temporarily — but model reloads at 128K on first request

The root cause: Ollama reads the model's context length from the GGUF file and allocates the full KV cache on first load. There is no way to override this at request time for an already-loaded model. And in an automated pipeline where requests come every 5 seconds, the model never unloads.

The Fix

The only reliable solution is to create a derived model with the context size baked into the model definition:

# Save as Modelfile.vision
FROM qwen2.5vl:7b
PARAMETER num_ctx 4096

ollama create qwen2.5vl:7b-4k -f Modelfile.vision

That's it. Now use qwen2.5vl:7b-4k instead of qwen2.5vl:7b in your API calls.

For extra safety, I also set a global default in Ollama's systemd service:

# /etc/systemd/system/ollama.service
[Service]
Environment="OLLAMA_NUM_CTX=4096"

sudo systemctl daemon-reload
sudo systemctl restart ollama

This catches any model that doesn't have an explicit num_ctx — at least it won't silently balloon to 128K.

The Result

	Before	After
Total VRAM used	70.8 GB	14.5 GB
KV cache context	128,000 tokens	4,096 tokens
Free VRAM	37 GB	85 GB
Inference speed	No change	No change
Output quality	No change	No change

56GB of VRAM recovered with zero impact on inference. My requests never used more than ~1K tokens — the other 127K were allocated for nothing.

Who Is Affected?

This matters if you're running Ollama for automated workloads:

API servers handling frequent requests (the model stays loaded)
Chatbots, agents, or monitoring pipelines
Multiple models on the same GPU
Any setup where you need predictable memory usage

Interactive chat sessions are less affected because Ollama unloads models after an idle timeout. But if your requests keep the model hot, the full KV cache lives in VRAM permanently.

Models to Watch Out For

Many popular models declare 128K context by default:

Model	Default Context	Approx KV Cache
qwen2.5vl:7b	128K	~65 GB
qwen2.5:32b	128K	~130 GB
llama3.1:70b	128K	~130 GB
mistral-large	128K	~130 GB

Check your model's declared context:

ollama show <model> --modelfile | grep -i ctx

Or via the API:

curl -s http://localhost:11434/api/show \
  -d '{"name":"qwen2.5vl:7b"}' | python3 -m json.tool | grep context

The Rule

For any Ollama model used in automated pipelines:

Always create a derived Modelfile with an explicit num_ctx matching your actual needs.

Some guidelines:

Vision/camera analysis: 2K–4K tokens
Chatbot or agent: 4K–8K tokens
Document analysis: 8K–16K tokens
RAG with large context: 16K–32K tokens

Never leave a model at its default 128K context unless you actually need 128K. The KV cache allocation is proportional to context size — halving the context roughly halves the memory.

Why This Isn't a Bug (But Maybe Should Be)

Ollama's behavior is technically correct — pre-allocating the KV cache avoids the overhead of dynamic resizing during inference. For interactive use, where you might paste a long document or have a deep conversation, having the full context available makes sense.

But for API workloads, it's a footgun. The mismatch between "model supports 128K context" and "my requests use 1K context" is common, and the memory cost is hidden. You don't see it in nvidia-smi as a separate allocation — it's all lumped under the model.

A dynamic or configurable allocation per-request would fix this for API users. Until then, the Modelfile workaround is the best approach.

I documented the full benchmarks and fix in my llama-cpp-distributed-benchmarks repo, which also covers distributed inference across Apple Silicon + NVIDIA Blackwell over 10GbE.

Tags: #ollama #llm #vram #inference #nvidia #machinelearning

DEV Community