<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Dmytro Romanov</title>
    <description>The latest articles on DEV Community by Dmytro Romanov (@casteldazur).</description>
    <link>https://dev.to/casteldazur</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3868279%2F4ddb792c-d076-4be5-9816-ef7d07a65bd3.png</url>
      <title>DEV Community: Dmytro Romanov</title>
      <link>https://dev.to/casteldazur</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/casteldazur"/>
    <language>en</language>
    <item>
      <title>How I Stopped GGUF Models From Crashing My GPU: A Pre-flight VRAM Check</title>
      <dc:creator>Dmytro Romanov</dc:creator>
      <pubDate>Wed, 08 Apr 2026 17:21:09 +0000</pubDate>
      <link>https://dev.to/casteldazur/how-i-stopped-gguf-models-from-crashing-my-gpu-a-pre-flight-vram-check-44i2</link>
      <guid>https://dev.to/casteldazur/how-i-stopped-gguf-models-from-crashing-my-gpu-a-pre-flight-vram-check-44i2</guid>
      <description>&lt;h2&gt;
  
  
  The crash that started this
&lt;/h2&gt;

&lt;p&gt;I was loading a Q4_K_M quantized 13B model on a 24GB card. The model file was about 7.5GB. Free VRAM according to &lt;code&gt;nvidia-smi&lt;/code&gt;: 21GB. Plenty of headroom. I hit run, watched the loader bar, and the process died on the last few layers with &lt;code&gt;CUDA out of memory&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That was not a one-off. I had the same crash twice that week, each time after eyeballing free VRAM and convincing myself a model would fit. After the second one I stopped trusting my eyes and started actually doing the math.&lt;/p&gt;

&lt;p&gt;This post is the math, and the small CLI tool I now run before any local inference job.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why "free VRAM" is not what you think
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;nvidia-smi&lt;/code&gt; reports a snapshot. It tells you what is allocated right now. It does not tell you what your model loader is about to allocate, and it does not account for the things that are about to grow.&lt;/p&gt;

&lt;p&gt;Three buckets eat into the gap between "reported free" and "actually usable":&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. CUDA context overhead.&lt;/strong&gt; Loading a CUDA context for inference costs a few hundred MB on its own. Each process you spawn pays this tax. If you have a Jupyter kernel, an Ollama daemon, and a llama.cpp test all sharing one GPU, you are paying it three times.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. The display server and other tenants.&lt;/strong&gt; On a workstation, the desktop compositor sits on the same card. Browsers with hardware acceleration drift up and down by a couple of GB depending on what you have open. That number you saw in &lt;code&gt;nvidia-smi&lt;/code&gt; was a moment ago.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. The thing nobody warns you about: the KV cache.&lt;/strong&gt; Quantized weights are only one part of the bill. As soon as you start generating tokens, the model allocates a key-value cache that scales linearly with context length and the number of attention layers. For a 13B model with 4096 context, the KV cache alone can be 1.5 to 2.5 GB. For 32k context, it can rival the model file itself.&lt;/p&gt;

&lt;p&gt;That is why my "21GB free, 7.5GB model, should fit" math kept failing. I was budgeting weights and ignoring everything around them.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the math actually looks like
&lt;/h2&gt;

&lt;p&gt;A more honest VRAM budget for loading a quantized model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;required = weights_on_disk
         + kv_cache(context_length, num_layers, hidden_dim, dtype)
         + activation_overhead       (~10-20% of weights for batched inference)
         + cuda_context_per_process  (~300-500 MB)
         + safety_buffer             (1-2 GB you do not touch)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For most consumer setups, the safety buffer is the part people skip and then regret. If your "free" VRAM is exactly equal to your budget, you are one Chrome tab away from an OOM.&lt;/p&gt;

&lt;p&gt;I now apply a simple rule: if my computed budget plus a 2GB safety buffer does not fit in current free VRAM, I do not load the model. I either drop to a smaller quantization, a smaller context length, or another card.&lt;/p&gt;

&lt;h2&gt;
  
  
  A pre-flight CLI
&lt;/h2&gt;

&lt;p&gt;I got tired of doing this calculation in my head before every load, so I wrapped it in a tiny CLI called &lt;code&gt;gpu-memory-guard&lt;/code&gt;. It does one thing: tells you whether a model will fit before you try to load it.&lt;/p&gt;

&lt;p&gt;Install:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;gpu-memory-guard
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check the current state of every GPU on the box:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;gpu-guard

GPU Memory Status
&lt;span class="o"&gt;============================================================&lt;/span&gt;

GPU 0: NVIDIA RTX 5090
  Total:     32.00GB
  Used:       4.12GB
  Available: 27.88GB
  Util:      12.9%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check whether a specific model will fit, with a buffer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;gpu-guard &lt;span class="nt"&gt;--model-size&lt;/span&gt; 18 &lt;span class="nt"&gt;--buffer&lt;/span&gt; 2
Required: 20.00GB &lt;span class="o"&gt;(&lt;/span&gt;18.00 model + 2.00 buffer&lt;span class="o"&gt;)&lt;/span&gt;
Available: 27.88GB
Status: FITS
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use it as a guard in front of an inference command. It exits with code 1 if the model would not fit, so you can chain it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gpu-guard &lt;span class="nt"&gt;--model-size&lt;/span&gt; 8 &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; ./main &lt;span class="nt"&gt;-m&lt;/span&gt; model.gguf &lt;span class="nt"&gt;-n&lt;/span&gt; 256
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the check fails, the inference command never runs, and you do not get a half-loaded process eating VRAM until you kill it.&lt;/p&gt;

&lt;p&gt;JSON output is there for when you want to wire it into a scheduler or CI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;gpu-guard &lt;span class="nt"&gt;--model-size&lt;/span&gt; 13 &lt;span class="nt"&gt;--json&lt;/span&gt;
&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"fits"&lt;/span&gt;: &lt;span class="nb"&gt;true&lt;/span&gt;,
  &lt;span class="s2"&gt;"required_gb"&lt;/span&gt;: 13.0,
  &lt;span class="s2"&gt;"available_gb"&lt;/span&gt;: 27.88,
  &lt;span class="s2"&gt;"gpus"&lt;/span&gt;: &lt;span class="o"&gt;[&lt;/span&gt;...]
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Using it from Python
&lt;/h2&gt;

&lt;p&gt;Most of the time I run it from the shell, but the same checks are exposed as a small library so you can put them at the top of a loader script:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;gpu_guard&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;check_vram&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;get_gpu_info&lt;/span&gt;

&lt;span class="n"&gt;fits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;check_vram&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_size_gb&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;buffer_gb&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;fits&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Refusing to load model: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# proceed to load
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the pattern I use inside CastelOS. Every inference job goes through an admission check before the model is touched. If the check fails, the job is rejected with a clear reason, not a stack trace from the middle of a loader. That single change has cut our half-loaded zombie processes to zero.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I would still build on top of this
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;gpu-memory-guard&lt;/code&gt; is intentionally small. It checks weights plus a buffer, and that already catches the majority of OOMs in practice because the worst offender is people loading models that are obviously too big and pretending the buffer will save them.&lt;/p&gt;

&lt;p&gt;The next layer, which I have not committed yet, is a proper KV cache estimator that takes context length, layer count, and head dim as inputs and gives you a real number. That would let you answer "can I run this 13B model at 32k context, or do I need to drop to 16k?" without either crashing or guessing.&lt;/p&gt;

&lt;p&gt;If you have ideas for the API, or you want a different cost model, the repo takes issues and PRs:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/CastelDazur/gpu-memory-guard" rel="noopener noreferrer"&gt;https://github.com/CastelDazur/gpu-memory-guard&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The small lesson
&lt;/h2&gt;

&lt;p&gt;If you run local models, the cheapest reliability fix you can ship is a check that runs before the loader, not after the crash. &lt;code&gt;nvidia-smi&lt;/code&gt; was never designed to be a budget. The minute you treat free VRAM as a number you can spend down to zero, you are going to lose work.&lt;/p&gt;

&lt;p&gt;A 50-line CLI does not fix this on its own. It just removes one excuse to skip the check.&lt;/p&gt;

</description>
      <category>localllm</category>
      <category>gpu</category>
      <category>machinelearning</category>
      <category>python</category>
    </item>
  </channel>
</rss>
