5 llama.cpp Settings That Turn 8GB VRAM From Sluggish to 5x Faster — Every Option Benchmarked
llama.cpp has over 50 launch options. Most of them are fine at their defaults. But on 8GB VRAM, misconfiguring just 5 of them will cut your inference speed in half.
What follows is a settings guide based on actual measurements on an RTX 4060 8GB (GDDR6 272 GB/s).
The Most Important: -ngl (GPU Layer Count)
# -ngl: How many model layers to offload to GPU
ngl_config = {
"meaning": "Number of Transformer layers loaded into GPU VRAM",
"default": 0, # All layers on CPU = slowest possible
"max": "Total layers in the model (Qwen2.5-7B = 28, Llama-3-8B = 32, Qwen2.5-32B = 64)",
"999": "All layers on GPU (fastest, if it fits in VRAM)",
}
# Optimal values for 8GB VRAM
ngl_optimal_8gb = {
"Qwen2.5-7B Q4_K_M (4.7GB)": {
"-ngl": 999, # Full GPU offload possible
"VRAM usage": "~5.4 GB (weights 4.7 + KV 0.44 + overhead 0.3 at 8K context)",
"speed": "~32 t/s",
},
"Mistral-Nemo-12B Q4_K_M (7.2GB)": {
"-ngl": 999, # Barely fits entirely on GPU
"VRAM usage": "~7.5 GB",
"speed": "~20 t/s",
"warning": "KV cache may cause OOM. Use -c 2048",
},
"Qwen2.5-32B Q4_K_M (18.5GB)": {
"-ngl": 25, # 25 of 64 layers on GPU
"VRAM usage": "~7.4 GB",
"speed": "~10.8 t/s",
"remaining 39 layers": "CPU (via DDR5)",
},
}
Changing -ngl by just 1 shifts speed by a few percent. The optimal value is the one that squeezes VRAM usage right to the limit.
# Finding the optimal -ngl (binary search)
def find_optimal_ngl(total_layers, vram_gb=8.0):
"""
1. Launch with -ngl 999 -> if OOM, move on
2. Launch with -ngl {total_layers // 2}
3. No OOM -> increase; OOM -> decrease
4. The sweet spot is where VRAM sits at 7.0-7.5 GB
"""
# On RTX 4060 8GB, ~0.5 GB goes to CUDA context + framework overhead
# The remaining 7.5 GB is available for model layers
pass
# Tips for tuning:
# Monitor VRAM with nvidia-smi while adjusting -ngl
# 7.0-7.5 GB usage is the sweet spot. Above 7.8 GB risks OOM during inference
-c (Context Length)
# -c: Maximum context length (in tokens)
context_config = {
"meaning": "Upper limit of tokens the model can reference during inference",
"default": 4096, # llama.cpp default (as of b8233)
"impact": "Directly determines KV cache VRAM consumption",
}
# KV cache VRAM consumption calculation
def kv_cache_vram(context_len, n_layers, n_heads, head_dim, dtype_bytes=2):
"""
KV cache = 2 × n_layers × n_heads × head_dim × context_len × dtype_bytes
(K cache + V cache = 2x)
"""
bytes_total = 2 * n_layers * n_heads * head_dim * context_len * dtype_bytes
return bytes_total / (1024**3) # GB
# Qwen2.5-7B (28 layers, 4 KV heads (GQA), 128 head_dim)
kv_7b = {
"4096 tokens (FP16)": f"{kv_cache_vram(4096, 28, 4, 128, 2):.2f} GB", # 0.22 GB
"8192 tokens (FP16)": f"{kv_cache_vram(8192, 28, 4, 128, 2):.2f} GB", # 0.44 GB
"32768 tokens (FP16)": f"{kv_cache_vram(32768, 28, 4, 128, 2):.2f} GB", # 1.75 GB
"131072 tokens (FP16)": f"{kv_cache_vram(131072, 28, 4, 128, 2):.2f} GB", # 7.00 GB
}
# Qwen2.5-32B (64 layers, 8 KV heads, 128 head_dim)
kv_32b = {
"4096 tokens (FP16)": f"{kv_cache_vram(4096, 64, 8, 128, 2):.2f} GB", # 1.00 GB
"8192 tokens (FP16)": f"{kv_cache_vram(8192, 64, 8, 128, 2):.2f} GB", # 2.00 GB
"32768 tokens (FP16)": f"{kv_cache_vram(32768, 64, 8, 128, 2):.2f} GB", # 8.00 GB
}
# Note: with partial offload (-ngl), KV cache is also split across CPU/GPU per layer
# -ngl 25 means GPU holds KV for 25/64 layers only
# Recommendations for 8GB VRAM:
# 7B model: -c 8192 (KV 0.44GB, safe), -c 32768 (KV 1.75GB, use flash-attn)
# 32B model (partial offload -ngl 25): -c 4096 (GPU KV ~0.39GB), anything higher requires KV quantization
Doubling the context length doubles the KV cache VRAM. On 8GB, your -c setting directly determines what model size you can load.
--cache-type-k / --cache-type-v (KV Cache Quantization)
# KV cache quantization options
kv_quant_options = {
"f16": "Default. FP16 (2 bytes/element)",
"q8_0": "8-bit quantization (1 byte/element) -> VRAM halved",
"q4_0": "4-bit quantization (0.5 bytes/element) -> VRAM quartered",
}
# Recommended combinations
kv_quant_recommendations = {
"Quality first": {
"K": "f16",
"V": "f16",
"VRAM": "1x (baseline)",
"quality loss": "None",
},
"Balanced (recommended)": {
"K": "q8_0",
"V": "q8_0",
"VRAM": "0.5x",
"quality loss": "Negligible for general tasks",
},
"Capacity first": {
"K": "q4_0",
"V": "q8_0",
"VRAM": "0.375x",
"quality loss": "Degradation on math/reasoning tasks",
"note": "V cache is more sensitive to quantization than K cache",
},
"Maximum compression": {
"K": "q4_0",
"V": "q4_0",
"VRAM": "0.25x",
"quality loss": "Significant. Especially bad on long contexts",
},
}
# Example: Qwen2.5-32B + -ngl 25 + 8K context on 8GB VRAM
# -ngl 25 -> 25/64 layers on GPU, KV also splits 25/64 on GPU
example_32b_8k = {
"Total KV (f16, 8K)": "2.00 GB (all 64 layers)",
"GPU KV (f16, -ngl 25)": "2.00 * 25/64 = 0.78 GB",
"GPU total": "weights 7.4GB + KV 0.78GB + overhead 0.3GB = 8.48GB -> OOM",
"With KV q8_0": "0.78 * 0.5 = 0.39 GB -> 7.4 + 0.39 + 0.3 = 8.09GB -> works",
"conclusion": "32B at 8K context fits on 8GB with KV quantization (q8_0)",
"32K context": "GPU KV (f16) = 8.0 * 25/64 = 3.13 GB -> impossible. q4_0 = 0.78GB -> 8.48GB -> borderline",
}
Launch command:
llama-server -m model.gguf -ngl 25 -c 8192 --cache-type-k q8_0 --cache-type-v q8_0
--flash-attn (Flash Attention)
# Flash Attention
flash_attn_config = {
"meaning": "Memory-efficient attention computation algorithm",
"effects": {
"VRAM reduction": "Eliminates intermediate attention buffers -> saves hundreds of MB",
"speed boost": "Faster on long contexts (~10% at 32K, scales with context length)",
"short_context": "Minimal effect below 4K tokens",
},
"requirements": "CUDA backend + compatible GPU (RTX 20xx or newer)",
"compatibility": "Works alongside KV cache quantization",
}
# Benchmarks on 8GB
flash_attn_benchmark = {
"Qwen2.5-7B Q4_K_M, -c 8192": {
"without_flash_attn": "31.8 t/s, VRAM 5.5 GB",
"with_flash_attn": "32.1 t/s, VRAM 5.2 GB",
"delta": "speed +1%, VRAM -0.3 GB",
},
"Qwen2.5-7B Q4_K_M, -c 32768": {
"without_flash_attn": "28.5 t/s, VRAM 6.3 GB",
"with_flash_attn": "31.5 t/s, VRAM 5.8 GB",
"delta": "speed +10.5%, VRAM -0.5 GB",
},
}
# Verdict: Always enable it. There is no downside.
--flash-attn has zero downsides. Always include it.
-b (Batch Size) and -t (Thread Count)
# Batch size
batch_config = {
"-b (batch size)": {
"meaning": "Number of tokens processed at once during prompt evaluation",
"default": 2048,
"8GB recommendation": 512,
"reason": "Large batches cause VRAM spikes during prompt eval, risking OOM",
},
"-ub (micro batch)": {
"meaning": "Further subdivides batches for processing",
"default": 512,
"usually": "No need to change",
},
}
# Thread count
thread_config = {
"-t (threads)": {
"meaning": "Number of threads for CPU computation",
"default": "All cores",
"recommendation": "Physical core count (no HT)",
"example_i7_13700H": "-t 6 (6 P-cores)",
"reason": "HT logical threads just compete for memory bandwidth. Physical core count is optimal",
},
}
# Benchmark: thread count impact (Qwen2.5-32B Q4_K_M, -ngl 25)
thread_benchmark = {
"-t 6 (P-core count)": "10.8 t/s",
"-t 8 (P+E cores)": "10.5 t/s",
"-t 14 (all physical cores P+E)": "9.8 t/s",
"-t 20 (all threads incl. HT)": "9.2 t/s",
"conclusion": "More threads = slower. Physical P-core count is optimal",
}
The intuition that more threads means faster inference is wrong. HT logical threads share L1/L2 cache and memory bandwidth, which turns into pure overhead for LLM inference.
Server Options (llama-server)
# llama-server: set up an OpenAI-compatible API
server_config = {
"basic": "llama-server -m model.gguf -ngl 999 -c 4096 --host 0.0.0.0 --port 8080",
"recommended extras": {
"--flash-attn": "Memory efficiency (always ON)",
"--metrics": "Expose Prometheus-format metrics",
"--parallel 1": "Concurrent request count (keep at 1 for 8GB)",
"--cont-batching": "Continuous batching (useful when --parallel >= 2)",
},
}
# Function calling setup
function_calling_config = {
"--chat-template": "Auto-detected (uses template embedded in GGUF)",
"note": "The tools parameter for function calling depends on the model's chat template",
"recommended models": [
"Qwen3.5-4B-Instruct (3.4GB, function calling 97.5%)",
"Qwen2.5-7B-Instruct (4.7GB, function calling 95%+)",
],
}
# Enforce structured output with GBNF grammar
grammar_config = {
"--grammar-file": "Force output format via GBNF grammar file",
"use case": "Guarantees valid JSON output. Syntax errors drop to 0%",
"caveat": "Inference can slow down when the model tries to generate output that doesn't match the grammar",
"alternative": "--json-schema to specify JSON Schema directly (llama.cpp b7000+)",
}
Configuration Templates
Template 1: 7B Model, Chat Use (Maximum Speed)
llama-server \
-m qwen2.5-7b-instruct-q4_k_m.gguf \
-ngl 999 \
-c 8192 \
--flash-attn \
-t 6 \
--host 127.0.0.1 --port 8080
# Expected speed: ~32 t/s, VRAM: ~5.2 GB
Template 2: 32B Model, Quality Focus (Partial Offload)
llama-server \
-m qwen2.5-32b-instruct-q4_k_m.gguf \
-ngl 25 \
-c 4096 \
--cache-type-k q8_0 --cache-type-v q8_0 \
--flash-attn \
-t 6 \
-b 512 \
--host 127.0.0.1 --port 8080
# Expected speed: ~10.8 t/s, VRAM: ~7.4 GB
Template 3: 7B Model, Long Context (32K)
llama-server \
-m qwen2.5-7b-instruct-q4_k_m.gguf \
-ngl 999 \
-c 32768 \
--cache-type-k q8_0 --cache-type-v q8_0 \
--flash-attn \
-t 6 \
-b 512 \
--host 127.0.0.1 --port 8080
# Expected speed: ~31 t/s, VRAM: ~5.7 GB
Template 4: 4B Model, Function Calling (Maximum Reliability)
llama-server \
-m qwen3.5-4b-instruct-q4_k_m.gguf \
-ngl 999 \
-c 4096 \
--flash-attn \
-t 6 \
--host 127.0.0.1 --port 8080
# Expected speed: ~50 t/s, VRAM: ~3.8 GB
# Function calling accuracy: 97.5%
Common Mistakes and Fixes
common_mistakes = {
"-ngl 0 (not using GPU)": {
"symptom": "Inference speed stuck at 3-5 t/s",
"cause": "All layers running on CPU. DDR5 ~50 GB/s is the bottleneck",
"fix": "Try -ngl 999. If OOM, decrease",
},
"-c set too high": {
"symptom": "OOM immediately after inference starts",
"cause": "KV cache eating all VRAM",
"fix": "Lower to -c 4096, or add --cache-type-k q8_0",
},
"-t set too high": {
"symptom": "CPU at 100% but inference is slow",
"cause": "HT logical threads fighting over cache and memory bandwidth",
"fix": "Set -t to physical core count",
},
"Using --mlock": {
"symptom": "Memory error on startup",
"cause": "Locks entire model in RAM -> physical memory exhausted",
"fix": "Remove --mlock (especially unnecessary on Windows)",
},
"Batch size too large": {
"symptom": "OOM when feeding long prompts",
"cause": "VRAM spike during prompt evaluation",
"fix": "Lower to -b 512",
},
}
Summary: Speed Impact by Setting
Setting Change Speed Impact VRAM Impact
──────────────────────────────────────────────────────────────────
-ngl 0 -> 999 (full GPU) +5-10x +4-7 GB
-ngl fine-tuning (±5) +10-20% ±0.5 GB
--flash-attn enabled +1-10% -0.3 GB
--cache-type q8_0 ±0% -50%
-t all threads -> physical cores +5-15% ±0
-c 32K -> 4K +5% -0.7 GB
-b 2048 -> 512 ±0%* -0.2 GB**
* No effect on generation speed (only prompt eval time)
** Suppresses temporary VRAM spikes during prompt eval
The biggest lever is -ngl. Next is -t. Everything else is fine-tuning. On 8GB VRAM, the core strategy is: maximize -ngl, then use -c and KV cache quantization to claw back enough VRAM to make it fit.
References
- llama.cpp — github.com/ggerganov/llama.cpp
- llama.cpp Server documentation — github.com/ggerganov/llama.cpp/tree/master/examples/server
- GGUF format specification — github.com/ggerganov/ggml/blob/master/docs/gguf.md
- Flash Attention — "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning" (2023) arXiv:2307.08691
Top comments (0)