Patrick Hughes

Posted on Jun 9 • Edited on Jun 30 • Originally published at bmdpat.com

How to Tune llama.cpp --n-gpu-layers: A Practical VRAM Guide (2026)

#localllm #llamacpp #gpu #vram

You already know what --n-gpu-layers does. It moves transformer layers onto your GPU. This post is the next step: how to actually pick the number.

If you want the basics first, read the original: llama.cpp n-gpu-layers explained. This is the tuning guide that follows it.

The one rule that matters

A model has a fixed number of layers. A 7B model might have 32. A 70B might have 80. The --n-gpu-layers flag (often shortened to ngl) says how many of those go on the GPU. The rest stay on the CPU and run in system RAM.

Full GPU means fast. Full CPU means slow. Partial means somewhere in between, and it scales close to linearly. Offload half the layers and you get roughly half the speedup.

So the goal is simple. Put as many layers on the GPU as your VRAM allows. Not one more.

The VRAM math

Each layer costs roughly the same amount of VRAM. You can estimate it.

Take the model file size on disk. Divide by the layer count. That gives you a rough per-layer cost.

A 7B model quantized to Q4 is around 4 GB. Split across 32 layers, that is about 125 MB per layer. Offload 24 layers and you spend roughly 3 GB on weights.

This is an estimate, not a promise. Attention layers and embedding layers differ slightly. But the per-layer average holds well enough to plan with.

Do not forget the KV cache

Weights are only part of the bill. The KV cache also lives on the GPU when you offload, and it grows with context length.

Longer context means a bigger cache. Double the context window and you roughly double the cache size. On a tight card, a long context can push you into OOM even when the weights fit.

So budget VRAM in two buckets. Weights first. Then leave headroom for the KV cache at the context length you actually plan to run.

Reading OOM symptoms

When you ask for too many layers, llama.cpp fails at load time with a CUDA out of memory error. It will not silently fall back. It stops.

The fix is to drop ngl by a few and reload. Step down until it loads. If you are right at the edge, shave 2 or 3 layers and try again.

Watch your VRAM with nvidia-smi while the model loads. You want a buffer left over, not a card pinned at 100 percent. Other apps, your desktop, and the KV cache all want a slice.

A fast tuning loop

You do not need to calculate everything. You can probe.

Start with ngl set to a high number. Many people use 99 to mean "offload everything." If it loads, you are done. The whole model fits.

If it OOMs, step down. Try 28, then 24, then 20. Each reload tells you where the ceiling is. Five minutes of trial beats an hour of spreadsheet math.

Once it loads cleanly, run a real prompt at your target context length. If that OOMs mid-generation, the KV cache pushed you over. Drop a few more layers and leave room.

Quick starting points by card

These are rough anchors, not guarantees. Your quant, context, and model size all move the number.

On an 8 GB card, a 7B Q4 model usually offloads fully. A 13B will only fit partially.

On a 12 GB card, 13B models fit comfortably and you have room for context.

On 16 GB or more, you can run larger models or push context length hard. A 24 GB card handles most single-GPU local work without much tuning at all.

How quant choice feeds in

Smaller quant means smaller weights means more layers fit. If you cannot offload a model fully, dropping from Q5 to Q4 might get you there. That tradeoff is its own decision, and it pairs directly with this one.

If you are weighing which quant to run, read the companion post: which GGUF quant should you actually pick. Tune ngl and quant together. They share the same VRAM budget.

The instinct underneath all of this

Running models locally is a cost move. Every token you serve on your own GPU is a token you did not pay an API for. Tuning ngl is just squeezing more value out of hardware you already own.

That same instinct, watching the meter and refusing to overspend, is what AgentGuard does for AI agents. It caps token spend, rate limits calls, and stops a runaway loop before it burns your budget. Local inference cuts your fixed cost. AgentGuard caps your variable cost.

If you are running agents and want a hard ceiling on spend, check out AgentGuard.

Originally published on bmdpat.com. I run a one-person AI agent company and write about what actually works.

Want these in your inbox? Subscribe to the newsletter - no spam, unsubscribe anytime.

DEV Community