The crash that started this
I was loading a Q4_K_M quantized 13B model on a 24GB card. The model file was about 7.5GB. Free VRAM according to nvidia-smi: 21GB. Plenty of headroom. I hit run, watched the loader bar, and the process died on the last few layers with CUDA out of memory.
That was not a one-off. I had the same crash twice that week, each time after eyeballing free VRAM and convincing myself a model would fit. After the second one I stopped trusting my eyes and started actually doing the math.
This post is the math, and the small CLI tool I now run before any local inference job.
Why "free VRAM" is not what you think
nvidia-smi reports a snapshot. It tells you what is allocated right now. It does not tell you what your model loader is about to allocate, and it does not account for the things that are about to grow.
Three buckets eat into the gap between "reported free" and "actually usable":
1. CUDA context overhead. Loading a CUDA context for inference costs a few hundred MB on its own. Each process you spawn pays this tax. If you have a Jupyter kernel, an Ollama daemon, and a llama.cpp test all sharing one GPU, you are paying it three times.
2. The display server and other tenants. On a workstation, the desktop compositor sits on the same card. Browsers with hardware acceleration drift up and down by a couple of GB depending on what you have open. That number you saw in nvidia-smi was a moment ago.
3. The thing nobody warns you about: the KV cache. Quantized weights are only one part of the bill. As soon as you start generating tokens, the model allocates a key-value cache that scales linearly with context length and the number of attention layers. For a 13B model with 4096 context, the KV cache alone can be 1.5 to 2.5 GB. For 32k context, it can rival the model file itself.
That is why my "21GB free, 7.5GB model, should fit" math kept failing. I was budgeting weights and ignoring everything around them.
What the math actually looks like
A more honest VRAM budget for loading a quantized model:
required = weights_on_disk
+ kv_cache(context_length, num_layers, hidden_dim, dtype)
+ activation_overhead (~10-20% of weights for batched inference)
+ cuda_context_per_process (~300-500 MB)
+ safety_buffer (1-2 GB you do not touch)
For most consumer setups, the safety buffer is the part people skip and then regret. If your "free" VRAM is exactly equal to your budget, you are one Chrome tab away from an OOM.
I now apply a simple rule: if my computed budget plus a 2GB safety buffer does not fit in current free VRAM, I do not load the model. I either drop to a smaller quantization, a smaller context length, or another card.
A pre-flight CLI
I got tired of doing this calculation in my head before every load, so I wrapped it in a tiny CLI called gpu-memory-guard. It does one thing: tells you whether a model will fit before you try to load it.
Install:
pip install gpu-memory-guard
Check the current state of every GPU on the box:
$ gpu-guard
GPU Memory Status
============================================================
GPU 0: NVIDIA RTX 5090
Total: 32.00GB
Used: 4.12GB
Available: 27.88GB
Util: 12.9%
Check whether a specific model will fit, with a buffer:
$ gpu-guard --model-size 18 --buffer 2
Required: 20.00GB (18.00 model + 2.00 buffer)
Available: 27.88GB
Status: FITS
Use it as a guard in front of an inference command. It exits with code 1 if the model would not fit, so you can chain it:
gpu-guard --model-size 8 && ./main -m model.gguf -n 256
If the check fails, the inference command never runs, and you do not get a half-loaded process eating VRAM until you kill it.
JSON output is there for when you want to wire it into a scheduler or CI:
$ gpu-guard --model-size 13 --json
{
"fits": true,
"required_gb": 13.0,
"available_gb": 27.88,
"gpus": [...]
}
Using it from Python
Most of the time I run it from the shell, but the same checks are exposed as a small library so you can put them at the top of a loader script:
from gpu_guard import check_vram, get_gpu_info
fits, message = check_vram(model_size_gb=18, buffer_gb=2)
print(message)
if not fits:
raise RuntimeError("Refusing to load model: " + message)
# proceed to load
This is the pattern I use inside CastelOS. Every inference job goes through an admission check before the model is touched. If the check fails, the job is rejected with a clear reason, not a stack trace from the middle of a loader. That single change has cut our half-loaded zombie processes to zero.
What I would still build on top of this
gpu-memory-guard is intentionally small. It checks weights plus a buffer, and that already catches the majority of OOMs in practice because the worst offender is people loading models that are obviously too big and pretending the buffer will save them.
The next layer, which I have not committed yet, is a proper KV cache estimator that takes context length, layer count, and head dim as inputs and gives you a real number. That would let you answer "can I run this 13B model at 32k context, or do I need to drop to 16k?" without either crashing or guessing.
If you have ideas for the API, or you want a different cost model, the repo takes issues and PRs:
https://github.com/CastelDazur/gpu-memory-guard
The small lesson
If you run local models, the cheapest reliability fix you can ship is a check that runs before the loader, not after the crash. nvidia-smi was never designed to be a budget. The minute you treat free VRAM as a number you can spend down to zero, you are going to lose work.
A 50-line CLI does not fix this on its own. It just removes one excuse to skip the check.
Top comments (0)