Patrick Hughes

Posted on May 3 • Originally published at bmdpat.com

llama.cpp n_gpu_layers Explained: -1, 0 & VRAM Guide

#llamacpp #localllm #gpuoffloading #ngpulayers

llama.cpp n_gpu_layers Explained: What -1, 0, and 99 Actually Do

If you have ever tried running a local LLM with llama.cpp, you have seen this flag:

--n-gpu-layers 99

Or maybe you set it to -1, or left it at 0, and got wildly different performance. This post explains what n_gpu_layers actually controls, how to calculate the right value for your hardware, and what happens when you get it wrong.

Quick answer: --n-gpu-layers -1 offloads every transformer layer to the GPU. It is identical to --n-gpu-layers 999. If your model fits in VRAM, this gives maximum speed. Use 0 for CPU-only. Any value between 1 and the model's layer count gives partial GPU offloading.

What n_gpu_layers Controls

Large language models are built from stacked transformer blocks. A 7B parameter model like Llama 3.1 8B has 32 of these blocks (called "layers"). Each layer contains attention weights and feed-forward weights that consume GPU memory (VRAM) when loaded.

The --n-gpu-layers flag (or -ngl for short) tells llama.cpp how many of those layers to move from system RAM onto your GPU. The more layers on the GPU, the faster inference runs, because GPUs process matrix math orders of magnitude faster than CPUs.

Here is what the common values mean:

Value	What It Does
`0`	All layers stay on CPU. No GPU used at all.
`1` to `N`	That many layers load onto GPU. Rest stay on CPU.
`99` or `999`	Attempts to load all layers onto GPU. Excess is ignored.
`-1`	Special value: offload every layer to GPU. Same effect as 999.

Setting -ngl 99 and -ngl -1 do the same thing in practice. Both tell llama.cpp to put as much as possible on the GPU. The difference is cosmetic. Most people use 99 or 999 because it is more explicit.

How Many Layers Does Your Model Have?

The number of layers depends on the model architecture:

Model	Parameters	Layers
Llama 3.1 8B	8B	32
Llama 3.1 70B	70B	80
Mistral 7B	7B	32
Phi-3 Mini	3.8B	32
SmolLM2 1.7B	1.7B	24
Qwen2 72B	72B	80

You can check your model's layer count in the GGUF metadata. When you load a model, llama.cpp prints something like llama.block_count = 32 in the startup logs. That is your total layer count.

VRAM Math: How to Pick the Right Number

The formula is straightforward. Take your model file size (the GGUF file on disk) and divide by the number of layers. That gives you a rough per-layer VRAM cost.

Example: Llama 3.1 8B Q4_K_M (4.9 GB file, 32 layers)

Per layer: ~153 MB
16 layers on GPU: ~2.4 GB VRAM
All 32 layers on GPU: ~4.9 GB VRAM
Plus overhead (KV cache, context buffer): ~1-2 GB extra

So a 6 GB GPU (like an RTX 2060) can fit all 32 layers of this model with room for a small context window. A 16 GB GPU (like an RTX 5070 Ti) can fit it with a 32K context window and still have headroom.

Example: Llama 3.1 70B Q4_K_M (~40 GB file, 80 layers)

Per layer: ~500 MB
16 GB GPU: fits about 28-30 layers (after context overhead)
24 GB GPU (RTX 4090): fits about 42-44 layers
The rest spills to system RAM

When layers spill to CPU, you get "partial offloading." The GPU handles some layers, the CPU handles others, and data shuttles back and forth over PCIe. This is slower than full GPU offloading but still faster than CPU-only.

Performance: CPU vs Partial vs Full GPU

Real benchmark data from a Ryzen 5900X + RX 7900 XT running SmolLM2 1.7B (source: SteelPh0enix's llama.cpp guide):

Configuration	Prompt Processing	Text Generation
CPU only (0 layers)	165 tok/s	22 tok/s
Full GPU offload	880 tok/s	90 tok/s

That is a 5.3x speedup for prompt processing and a 4x speedup for text generation. The difference between 22 tokens per second and 90 tokens per second is the difference between "painfully slow" and "usable in production."

For larger models on more powerful hardware, NVIDIA reports the RTX 4090 hitting roughly 150 tokens per second on Llama 3 8B int4 with full GPU offloading (100-token input, 100-token output sequences).

Partial offloading falls somewhere in between. If you can only fit 20 out of 32 layers on GPU, expect roughly 60-70% of full GPU performance. The exact number depends on which layers are offloaded and how fast your PCIe bus moves data.

Common Problems and Fixes

"I set n_gpu_layers to 99 but it is still slow"

Check that your llama.cpp build actually has GPU support compiled in. If you built from source without CUDA, Vulkan, or Metal flags, the flag gets silently ignored. Run llama-cli --help and look for GPU-related options. If they are missing, rebuild with the right backend.

On Linux with NVIDIA:

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

On Mac with Metal (M-series):

cmake -B build -DGGML_METAL=ON
cmake --build build --config Release

"Out of memory" errors

Lower the value. If -ngl 32 crashes, try -ngl 24 or -ngl 16. You can also reduce the context size with --ctx-size 2048 to free up VRAM for more layers. Or use a more aggressively quantized model (Q3_K_M instead of Q4_K_M).

"n_gpu_layers -1 does not work on AMD"

Some AMD GPU users on older ROCm versions have reported that -ngl -1 does not trigger offloading. The workaround: use a large explicit number like --n-gpu-layers 999 instead.

Shared VRAM on Windows (integrated graphics)

On Windows systems with shared VRAM, llama.cpp might report using your GPU but actually consume ~20 GB of system RAM with minimal speed benefit. This is because shared memory goes through the CPU memory bus anyway. If you have a dedicated GPU, make sure llama.cpp is targeting it, not the integrated graphics.

Practical Recommendations

If your model fits entirely in VRAM:
Use -ngl -1 or -ngl 999. Full offloading gives the best performance.

If your model is too large for your GPU:
Count backward. Check the GGUF file size, subtract 1-2 GB for context overhead, and divide the remaining VRAM by per-layer size. That is your layer count.

For example, with 8 GB VRAM and a 4.9 GB model:

Available after overhead: ~6.5 GB
4.9 GB / 32 layers = 153 MB per layer
6.5 GB / 153 MB = ~42 layers (more than the model has, so use -1)

With 8 GB VRAM and a 40 GB model:

Available after overhead: ~6.5 GB
40 GB / 80 layers = 500 MB per layer
6.5 GB / 500 MB = ~13 layers
Use -ngl 13

If you have no dedicated GPU:
Leave it at 0 and accept CPU-only speeds. On a modern CPU with AVX2, expect 1-3 tokens per second for 7B models. It works, but it is not fast enough for real-time applications.

Running It

Here is a concrete command to serve Llama 3.1 8B with full GPU offloading as an OpenAI-compatible API:

./llama-server \
  --model models/llama-3.1-8b-q4_k_m.gguf \
  --n-gpu-layers -1 \
  --ctx-size 8192 \
  --host 0.0.0.0 \
  --port 8080

This loads all layers onto your GPU, sets an 8K context window, and exposes an API endpoint you can hit with any OpenAI-compatible client. If you run out of VRAM, drop -ngl to something like 24 and reduce --ctx-size to 4096.

For a deeper look at running this in production (including hardware selection, benchmarks on an RTX 5070 Ti, and when local inference actually makes financial sense), see the full guide: Local LLM Inference on Consumer GPUs.

Can llama.cpp Use Two GPUs?

Yes. llama.cpp supports multi-GPU inference via CUDA using the --tensor-split flag. The syntax is a comma-separated list of ratios corresponding to each GPU:

--n-gpu-layers 99 --tensor-split 1,1

This splits layers evenly between two GPUs. If your GPUs have different VRAM sizes, adjust the ratio — e.g. --tensor-split 2,1 for a setup where GPU 0 has twice the VRAM of GPU 1. NVLink is not required; PCIe-connected GPUs work, with some inter-GPU data transfer overhead.

Multi-GPU is most useful for models too large to fit on a single card, like 70B parameter models.

Which Is Faster: vLLM or llama.cpp?

For high-concurrency production deployments, vLLM is faster. It uses PagedAttention and continuous batching designed for serving many users simultaneously. At 20+ concurrent requests, vLLM wins.

For a single-user local setup or small-team server (under 5 concurrent users), llama.cpp is competitive and much simpler to run. It has lower memory overhead, works on consumer hardware including AMD and Apple Silicon, and supports quantized formats vLLM does not.

Rule of thumb: llama.cpp for personal or small-team inference. vLLM for production APIs serving real traffic.

Is llama.cpp Multithreaded?

Yes. llama.cpp uses multiple CPU threads via --threads (or -t). The default is typically half your logical core count. For most setups, setting threads to your physical core count gives the best CPU performance:

--threads 8  # for an 8-core CPU

If you have GPU offloading enabled with -ngl, the GPU handles offloaded layers while CPU threads process the rest. With full GPU offloading (-ngl -1), the --threads flag has minimal effect since almost all computation is on the GPU.

What Does "cpp" Mean in llama.cpp?

It is the C++ file extension. llama.cpp is an implementation of Llama model inference written in C++, originally by Georgi Gerganov. The .cpp extension became part of the project name.

The goal was to run LLM inference on consumer hardware without Python overhead or heavy framework dependencies. The inference stack compiles to a single portable binary — no virtual environment, no PyTorch, no CUDA toolkit required for CPU inference. That portability is why llama.cpp runs on everything from MacBooks to Raspberry Pis to enterprise GPUs.

FAQ

What does n_gpu_layers = -1 mean in llama.cpp?
It tells llama.cpp to offload all transformer layers to the GPU. It is equivalent to setting the value to a very large number like 999. If your model fits in VRAM, this gives maximum performance.

What happens if I set n_gpu_layers higher than the model's layer count?
Nothing bad. llama.cpp caps the value at the actual layer count. Setting 999 on a 32-layer model just loads 32 layers. The extra is ignored.

How much VRAM do I need for full offloading?
Roughly the size of the GGUF file on disk, plus 1-2 GB for context and overhead. A 4.9 GB Q4_K_M model needs about 6-7 GB total. A 40 GB Q4_K_M model needs about 42-44 GB total.

Is partial GPU offloading worth it?
Yes. Even offloading 50% of layers to GPU gives a significant speedup over CPU-only. The relationship is roughly linear: half the layers on GPU gives roughly half the GPU-only speedup.

Does n_gpu_layers affect model quality or accuracy?
No. It only changes where computation happens (CPU vs GPU). The math is identical either way. Your outputs will be the same regardless of the setting.

DEV Community