Patrick Hughes

Posted on Jun 8 • Edited on Jun 30 • Originally published at bmdpat.com

How to Tune --n-gpu-layers for Your VRAM Budget

#localllm #llamacpp #gpu #vram

How to Tune --n-gpu-layers for Your VRAM Budget

I wrote an explainer on llama.cpp's --n-gpu-layers flag and it keeps pulling traffic. The explainer covers what the flag does. This post covers the part people actually struggle with: how to pick the right number, do the offload math, split across two GPUs, and stop the out-of-memory crashes.

What the flag really controls

A model is a stack of transformer layers. --n-gpu-layers (or -ngl) tells llama.cpp how many of those layers to put on the GPU. The rest run on the CPU.

Layers on the GPU run fast. Layers on the CPU run slow. So your goal is simple: put as many layers on the GPU as will fit, and not one more. One layer too many and you get an out-of-memory crash or a silent spill that tanks your speed.

If the whole model fits, just set -ngl 99 and forget it. The number only matters when the model is bigger than your VRAM.

The offload math

Each layer takes roughly the same amount of memory. So the math is:

vram-per-layer = model-weights-GB / total-layers
layers-that-fit = (free-vram-GB - overhead) / vram-per-layer

Work an example. A 13B model at Q4 is about 7.5 GB of weights across 40 layers. That is roughly 0.19 GB per layer.

You have an 8 GB card. Reserve about 1.5 GB for the KV cache and overhead. That leaves 6.5 GB for layers.

6.5 / 0.19 = ~34 layers

So start at -ngl 34 for that model on that card. The other 6 layers run on CPU. You get most of the speed of a full GPU load without the crash.

Find the real number fast

The math gets you close. Then you tune by hand. Watch VRAM in one terminal and step the number in another.

# terminal 1
nvidia-smi -l 1

# terminal 2: start lower than the math says, then climb
./llama-cli -m model-q4.gguf -ngl 30 -p "test prompt" -c 4096

Climb by 2 or 3 layers each run. Watch nvidia-smi. When VRAM hits about 90 percent, stop. Leave headroom. The KV cache grows as the context fills, so a load that fits an empty prompt can crash 3000 tokens later.

That last point is the number one cause of OOM crashes. People tune with a tiny prompt, see it fit, ship it, then crash on a real long input. Always tune at the context length you will actually use, set with -c.

Common OOM mistakes

You set -ngl too high and forgot the KV cache. The cache is not free. At an 8K context it can eat a couple of GB on a 13B. Reserve for it.

You raised the context length and kept the old -ngl. Bigger context means a bigger cache means less room for layers. Re-tune when you change -c.

You loaded a second model on the same card. Two models share one pool of VRAM. The first one's -ngl no longer fits.

You assumed Q8 fits because Q4 did. Q8 is nearly double the weight memory. The layer math changes completely.

Splitting across two GPUs

If you have two cards, llama.cpp can split the model across both. Use --tensor-split to set the ratio.

# two cards, 24 GB and 32 GB: weight the bigger card heavier
./llama-cli -m big-model-q5.gguf -ngl 99 --tensor-split 24,32 -p "test"

The numbers are a ratio, not gigabytes, but matching them to your VRAM sizes is a good start. With -ngl 99 and a split, llama.cpp puts all layers on the GPUs and divides them by the ratio. Now a 34B that fits on neither card alone fits across both.

One catch. Splitting adds a little cross-GPU traffic, so two 16 GB cards are a touch slower than one 32 GB card at the same total memory. Still far faster than spilling to CPU.

When this runs inside an agent

Tuning -ngl gets a single run fast. But if a local model sits behind an agent that calls it in a loop, a stuck loop can peg both GPUs for hours and run your power bill up overnight. Local does not mean free.

That is why I built AgentGuard. It is an open-source runtime budget, token, and rate limiter for AI agents, and it caps your agent loop whether the model is a cloud API or a local GGUF on your own cards. pip install agentguard, wrap the loop, set a cap, and a runaway agent stops before it costs you a night of compute.

Do the layer math, tune at your real context length, leave headroom for the cache, and split across cards when one is not enough. That is the whole game.

Originally published on bmdpat.com. I run a one-person AI agent company and write about what actually works.

Want these in your inbox? Subscribe to the newsletter - no spam, unsubscribe anytime.

DEV Community

How to Tune --n-gpu-layers for Your VRAM Budget

How to Tune --n-gpu-layers for Your VRAM Budget

What the flag really controls

The offload math

Find the real number fast

Common OOM mistakes

Splitting across two GPUs

When this runs inside an agent

Top comments (0)