DEV Community

plasmon
plasmon

Posted on

20260325_llamacpp_options_8gb_en

5 llama.cpp Settings That Turn 8GB VRAM From Sluggish to 5x Faster — Every Option Benchmarked

llama.cpp has over 50 launch options. Most of them are fine at their defaults. But on 8GB VRAM, misconfiguring just 5 of them will cut your inference speed in half.

What follows is a settings guide based on actual measurements on an RTX 4060 8GB (GDDR6 272 GB/s).


The Most Important: -ngl (GPU Layer Count)

# -ngl: How many model layers to offload to GPU
ngl_config = {
    "meaning": "Number of Transformer layers loaded into GPU VRAM",
    "default": 0,  # All layers on CPU = slowest possible
    "max": "Total layers in the model (Qwen2.5-7B = 28, Llama-3-8B = 32, Qwen2.5-32B = 64)",
    "999": "All layers on GPU (fastest, if it fits in VRAM)",
}

# Optimal values for 8GB VRAM
ngl_optimal_8gb = {
    "Qwen2.5-7B Q4_K_M (4.7GB)": {
        "-ngl": 999,  # Full GPU offload possible
        "VRAM usage": "~5.4 GB (weights 4.7 + KV 0.44 + overhead 0.3 at 8K context)",
        "speed": "~32 t/s",
    },
    "Mistral-Nemo-12B Q4_K_M (7.2GB)": {
        "-ngl": 999,  # Barely fits entirely on GPU
        "VRAM usage": "~7.5 GB",
        "speed": "~20 t/s",
        "warning": "KV cache may cause OOM. Use -c 2048",
    },
    "Qwen2.5-32B Q4_K_M (18.5GB)": {
        "-ngl": 25,  # 25 of 64 layers on GPU
        "VRAM usage": "~7.4 GB",
        "speed": "~10.8 t/s",
        "remaining 39 layers": "CPU (via DDR5)",
    },
}
Enter fullscreen mode Exit fullscreen mode

Changing -ngl by just 1 shifts speed by a few percent. The optimal value is the one that squeezes VRAM usage right to the limit.

# Finding the optimal -ngl (binary search)
def find_optimal_ngl(total_layers, vram_gb=8.0):
    """
    1. Launch with -ngl 999 -> if OOM, move on
    2. Launch with -ngl {total_layers // 2}
    3. No OOM -> increase; OOM -> decrease
    4. The sweet spot is where VRAM sits at 7.0-7.5 GB
    """
    # On RTX 4060 8GB, ~0.5 GB goes to CUDA context + framework overhead
    # The remaining 7.5 GB is available for model layers
    pass

# Tips for tuning:
# Monitor VRAM with nvidia-smi while adjusting -ngl
# 7.0-7.5 GB usage is the sweet spot. Above 7.8 GB risks OOM during inference
Enter fullscreen mode Exit fullscreen mode

-c (Context Length)

# -c: Maximum context length (in tokens)
context_config = {
    "meaning": "Upper limit of tokens the model can reference during inference",
    "default": 4096,  # llama.cpp default (as of b8233)
    "impact": "Directly determines KV cache VRAM consumption",
}

# KV cache VRAM consumption calculation
def kv_cache_vram(context_len, n_layers, n_heads, head_dim, dtype_bytes=2):
    """
    KV cache = 2 × n_layers × n_heads × head_dim × context_len × dtype_bytes
    (K cache + V cache = 2x)
    """
    bytes_total = 2 * n_layers * n_heads * head_dim * context_len * dtype_bytes
    return bytes_total / (1024**3)  # GB

# Qwen2.5-7B (28 layers, 4 KV heads (GQA), 128 head_dim)
kv_7b = {
    "4096 tokens (FP16)":  f"{kv_cache_vram(4096, 28, 4, 128, 2):.2f} GB",   # 0.22 GB
    "8192 tokens (FP16)":  f"{kv_cache_vram(8192, 28, 4, 128, 2):.2f} GB",   # 0.44 GB
    "32768 tokens (FP16)": f"{kv_cache_vram(32768, 28, 4, 128, 2):.2f} GB",  # 1.75 GB
    "131072 tokens (FP16)": f"{kv_cache_vram(131072, 28, 4, 128, 2):.2f} GB", # 7.00 GB
}

# Qwen2.5-32B (64 layers, 8 KV heads, 128 head_dim)
kv_32b = {
    "4096 tokens (FP16)":  f"{kv_cache_vram(4096, 64, 8, 128, 2):.2f} GB",   # 1.00 GB
    "8192 tokens (FP16)":  f"{kv_cache_vram(8192, 64, 8, 128, 2):.2f} GB",   # 2.00 GB
    "32768 tokens (FP16)": f"{kv_cache_vram(32768, 64, 8, 128, 2):.2f} GB",  # 8.00 GB
}

# Note: with partial offload (-ngl), KV cache is also split across CPU/GPU per layer
# -ngl 25 means GPU holds KV for 25/64 layers only

# Recommendations for 8GB VRAM:
# 7B model: -c 8192 (KV 0.44GB, safe), -c 32768 (KV 1.75GB, use flash-attn)
# 32B model (partial offload -ngl 25): -c 4096 (GPU KV ~0.39GB), anything higher requires KV quantization
Enter fullscreen mode Exit fullscreen mode

Doubling the context length doubles the KV cache VRAM. On 8GB, your -c setting directly determines what model size you can load.


--cache-type-k / --cache-type-v (KV Cache Quantization)

# KV cache quantization options
kv_quant_options = {
    "f16": "Default. FP16 (2 bytes/element)",
    "q8_0": "8-bit quantization (1 byte/element) -> VRAM halved",
    "q4_0": "4-bit quantization (0.5 bytes/element) -> VRAM quartered",
}

# Recommended combinations
kv_quant_recommendations = {
    "Quality first": {
        "K": "f16",
        "V": "f16",
        "VRAM": "1x (baseline)",
        "quality loss": "None",
    },
    "Balanced (recommended)": {
        "K": "q8_0",
        "V": "q8_0",
        "VRAM": "0.5x",
        "quality loss": "Negligible for general tasks",
    },
    "Capacity first": {
        "K": "q4_0",
        "V": "q8_0",
        "VRAM": "0.375x",
        "quality loss": "Degradation on math/reasoning tasks",
        "note": "V cache is more sensitive to quantization than K cache",
    },
    "Maximum compression": {
        "K": "q4_0",
        "V": "q4_0",
        "VRAM": "0.25x",
        "quality loss": "Significant. Especially bad on long contexts",
    },
}

# Example: Qwen2.5-32B + -ngl 25 + 8K context on 8GB VRAM
# -ngl 25 -> 25/64 layers on GPU, KV also splits 25/64 on GPU
example_32b_8k = {
    "Total KV (f16, 8K)": "2.00 GB (all 64 layers)",
    "GPU KV (f16, -ngl 25)": "2.00 * 25/64 = 0.78 GB",
    "GPU total": "weights 7.4GB + KV 0.78GB + overhead 0.3GB = 8.48GB -> OOM",
    "With KV q8_0": "0.78 * 0.5 = 0.39 GB -> 7.4 + 0.39 + 0.3 = 8.09GB -> works",
    "conclusion": "32B at 8K context fits on 8GB with KV quantization (q8_0)",
    "32K context": "GPU KV (f16) = 8.0 * 25/64 = 3.13 GB -> impossible. q4_0 = 0.78GB -> 8.48GB -> borderline",
}
Enter fullscreen mode Exit fullscreen mode

Launch command:

llama-server -m model.gguf -ngl 25 -c 8192 --cache-type-k q8_0 --cache-type-v q8_0
Enter fullscreen mode Exit fullscreen mode

--flash-attn (Flash Attention)

# Flash Attention
flash_attn_config = {
    "meaning": "Memory-efficient attention computation algorithm",
    "effects": {
        "VRAM reduction": "Eliminates intermediate attention buffers -> saves hundreds of MB",
        "speed boost": "Faster on long contexts (~10% at 32K, scales with context length)",
        "short_context": "Minimal effect below 4K tokens",
    },
    "requirements": "CUDA backend + compatible GPU (RTX 20xx or newer)",
    "compatibility": "Works alongside KV cache quantization",
}

# Benchmarks on 8GB
flash_attn_benchmark = {
    "Qwen2.5-7B Q4_K_M, -c 8192": {
        "without_flash_attn": "31.8 t/s, VRAM 5.5 GB",
        "with_flash_attn": "32.1 t/s, VRAM 5.2 GB",
        "delta": "speed +1%, VRAM -0.3 GB",
    },
    "Qwen2.5-7B Q4_K_M, -c 32768": {
        "without_flash_attn": "28.5 t/s, VRAM 6.3 GB",
        "with_flash_attn": "31.5 t/s, VRAM 5.8 GB",
        "delta": "speed +10.5%, VRAM -0.5 GB",
    },
}

# Verdict: Always enable it. There is no downside.
Enter fullscreen mode Exit fullscreen mode

--flash-attn has zero downsides. Always include it.


-b (Batch Size) and -t (Thread Count)

# Batch size
batch_config = {
    "-b (batch size)": {
        "meaning": "Number of tokens processed at once during prompt evaluation",
        "default": 2048,
        "8GB recommendation": 512,
        "reason": "Large batches cause VRAM spikes during prompt eval, risking OOM",
    },
    "-ub (micro batch)": {
        "meaning": "Further subdivides batches for processing",
        "default": 512,
        "usually": "No need to change",
    },
}

# Thread count
thread_config = {
    "-t (threads)": {
        "meaning": "Number of threads for CPU computation",
        "default": "All cores",
        "recommendation": "Physical core count (no HT)",
        "example_i7_13700H": "-t 6 (6 P-cores)",
        "reason": "HT logical threads just compete for memory bandwidth. Physical core count is optimal",
    },
}

# Benchmark: thread count impact (Qwen2.5-32B Q4_K_M, -ngl 25)
thread_benchmark = {
    "-t 6 (P-core count)": "10.8 t/s",
    "-t 8 (P+E cores)": "10.5 t/s",
    "-t 14 (all physical cores P+E)": "9.8 t/s",
    "-t 20 (all threads incl. HT)": "9.2 t/s",
    "conclusion": "More threads = slower. Physical P-core count is optimal",
}
Enter fullscreen mode Exit fullscreen mode

The intuition that more threads means faster inference is wrong. HT logical threads share L1/L2 cache and memory bandwidth, which turns into pure overhead for LLM inference.


Server Options (llama-server)

# llama-server: set up an OpenAI-compatible API
server_config = {
    "basic": "llama-server -m model.gguf -ngl 999 -c 4096 --host 0.0.0.0 --port 8080",
    "recommended extras": {
        "--flash-attn": "Memory efficiency (always ON)",
        "--metrics": "Expose Prometheus-format metrics",
        "--parallel 1": "Concurrent request count (keep at 1 for 8GB)",
        "--cont-batching": "Continuous batching (useful when --parallel >= 2)",
    },
}

# Function calling setup
function_calling_config = {
    "--chat-template": "Auto-detected (uses template embedded in GGUF)",
    "note": "The tools parameter for function calling depends on the model's chat template",
    "recommended models": [
        "Qwen3.5-4B-Instruct (3.4GB, function calling 97.5%)",
        "Qwen2.5-7B-Instruct (4.7GB, function calling 95%+)",
    ],
}

# Enforce structured output with GBNF grammar
grammar_config = {
    "--grammar-file": "Force output format via GBNF grammar file",
    "use case": "Guarantees valid JSON output. Syntax errors drop to 0%",
    "caveat": "Inference can slow down when the model tries to generate output that doesn't match the grammar",
    "alternative": "--json-schema to specify JSON Schema directly (llama.cpp b7000+)",
}
Enter fullscreen mode Exit fullscreen mode

Configuration Templates

Template 1: 7B Model, Chat Use (Maximum Speed)

llama-server \
  -m qwen2.5-7b-instruct-q4_k_m.gguf \
  -ngl 999 \
  -c 8192 \
  --flash-attn \
  -t 6 \
  --host 127.0.0.1 --port 8080
# Expected speed: ~32 t/s, VRAM: ~5.2 GB
Enter fullscreen mode Exit fullscreen mode

Template 2: 32B Model, Quality Focus (Partial Offload)

llama-server \
  -m qwen2.5-32b-instruct-q4_k_m.gguf \
  -ngl 25 \
  -c 4096 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --flash-attn \
  -t 6 \
  -b 512 \
  --host 127.0.0.1 --port 8080
# Expected speed: ~10.8 t/s, VRAM: ~7.4 GB
Enter fullscreen mode Exit fullscreen mode

Template 3: 7B Model, Long Context (32K)

llama-server \
  -m qwen2.5-7b-instruct-q4_k_m.gguf \
  -ngl 999 \
  -c 32768 \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --flash-attn \
  -t 6 \
  -b 512 \
  --host 127.0.0.1 --port 8080
# Expected speed: ~31 t/s, VRAM: ~5.7 GB
Enter fullscreen mode Exit fullscreen mode

Template 4: 4B Model, Function Calling (Maximum Reliability)

llama-server \
  -m qwen3.5-4b-instruct-q4_k_m.gguf \
  -ngl 999 \
  -c 4096 \
  --flash-attn \
  -t 6 \
  --host 127.0.0.1 --port 8080
# Expected speed: ~50 t/s, VRAM: ~3.8 GB
# Function calling accuracy: 97.5%
Enter fullscreen mode Exit fullscreen mode

Common Mistakes and Fixes

common_mistakes = {
    "-ngl 0 (not using GPU)": {
        "symptom": "Inference speed stuck at 3-5 t/s",
        "cause": "All layers running on CPU. DDR5 ~50 GB/s is the bottleneck",
        "fix": "Try -ngl 999. If OOM, decrease",
    },
    "-c set too high": {
        "symptom": "OOM immediately after inference starts",
        "cause": "KV cache eating all VRAM",
        "fix": "Lower to -c 4096, or add --cache-type-k q8_0",
    },
    "-t set too high": {
        "symptom": "CPU at 100% but inference is slow",
        "cause": "HT logical threads fighting over cache and memory bandwidth",
        "fix": "Set -t to physical core count",
    },
    "Using --mlock": {
        "symptom": "Memory error on startup",
        "cause": "Locks entire model in RAM -> physical memory exhausted",
        "fix": "Remove --mlock (especially unnecessary on Windows)",
    },
    "Batch size too large": {
        "symptom": "OOM when feeding long prompts",
        "cause": "VRAM spike during prompt evaluation",
        "fix": "Lower to -b 512",
    },
}
Enter fullscreen mode Exit fullscreen mode

Summary: Speed Impact by Setting

Setting Change                        Speed Impact    VRAM Impact
──────────────────────────────────────────────────────────────────
-ngl 0 -> 999 (full GPU)             +5-10x          +4-7 GB
-ngl fine-tuning (±5)                +10-20%         ±0.5 GB
--flash-attn enabled                  +1-10%          -0.3 GB
--cache-type q8_0                     ±0%             -50%
-t all threads -> physical cores      +5-15%          ±0
-c 32K -> 4K                         +5%             -0.7 GB
-b 2048 -> 512                       ±0%*            -0.2 GB**

* No effect on generation speed (only prompt eval time)
** Suppresses temporary VRAM spikes during prompt eval
Enter fullscreen mode Exit fullscreen mode

The biggest lever is -ngl. Next is -t. Everything else is fine-tuning. On 8GB VRAM, the core strategy is: maximize -ngl, then use -c and KV cache quantization to claw back enough VRAM to make it fit.


References

  1. llama.cpp — github.com/ggerganov/llama.cpp
  2. llama.cpp Server documentation — github.com/ggerganov/llama.cpp/tree/master/examples/server
  3. GGUF format specification — github.com/ggerganov/ggml/blob/master/docs/gguf.md
  4. Flash Attention — "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning" (2023) arXiv:2307.08691

Top comments (0)