quantization-guide-llms-2026

#opensource #ai #selfhosted #linux

This article was originally published on aifoss.dev

---
title: 'GGUF Quantization Guide 2026: Q4_K_M vs Q5_K_M vs Q8_0'
description: 'The exact tradeoffs between Q4_K_M, Q5_K_M, and Q8_0 GGUF quantization: perplexity cost, VRAM requirements, and inference speed for 7B to 70B models in 2026.'
pubDate: 'May 23 2026'

tags: ["quantization", "ai", "llm", "gpu", "opensource"]

The Q4_K_M vs Q5_K_M vs Q8_0 decision is the first one you make every time you pull a new model — and most guides reduce it to "Q4 is fine for most people." That's true. It's also not the whole picture.

What follows is the breakdown with actual numbers: file sizes, VRAM budgets, perplexity deltas from FP16, and generation speeds — all from the llama.cpp official documentation and verified benchmarks. By the end, you'll have a hardware-specific decision rule that takes 30 seconds to apply, not a table of caveats.

What quantization does

A full-precision language model stores each weight as a 16-bit floating-point number. An 8B parameter model at FP16 takes 14.96 GiB of storage and roughly the same in VRAM — out of reach for most consumer GPUs. Quantization compresses those weights by mapping them to a smaller set of representable values. A 4-bit quantized 8B model fits in 4.58 GiB.

The compression is lossy. Rounding weights to fewer bits introduces noise. The question is how much, and whether it matters for your specific workload. The short answer is that the quality gap between Q4 and Q8 is small enough that most users won't perceive it in chat. The gap between Q3 and Q4 is where things break noticeably. Most quantization guides worry about the wrong cliff.

GGUF and the three quantization families

GGUF (GPT-Generated Unified Format) is the file format used by llama.cpp (MIT license, 112k GitHub stars, build b9297 as of May 23, 2026), Ollama, LM Studio, and nearly every other local inference tool. It replaced the older GGML format and is now the universal container for quantized open-source models.

Three quantization families exist within GGUF:

Legacy quants (Q4_0, Q4_1, Q5_0, Q5_1, Q8_0) — the original scheme. At 8 bits, the legacy Q8_0 format remains excellent and widely used. At lower bit depths, the legacy formats are largely outclassed by K-quants at the same bit count.

K-quants (Q2_K through Q6_K) — the current standard for 4–6 bit compression. These use a super-block structure: 256 weights are grouped together, and the per-block scales are themselves quantized, which recovers significant quality at the same average bit depth. The suffixes indicate mixed-precision aggression: _S (small/aggressive), _M (medium/balanced), _L (large/conservative). The _M variant is the recommended default for most use cases.

IQ quants (IQ1_S through IQ4_XS) — importance-matrix-based quantization. These use lookup tables derived from calibration data that records which weights matter most for model behavior. They deliver better quality per bit than K-quants at the lowest bit levels, but require an imatrix calibration file to work correctly. Without an imatrix, IQ quant output is noticeably worse than the equivalent K-quant.

For the vast majority of decisions, the relevant question is: Q4_K_M, Q5_K_M, or Q8_0?

Q4_K_M: the default for a reason

Q4_K_M averages 4.89 bits per weight. For a Llama 3.1 8B model, that's a 4.58 GiB file. In VRAM, add ~10–20% overhead for context buffers and metadata — budget 5.5–6 GB for an 8B model at a moderate context window.

The internal structure is mixed-precision: Q4_K_M uses Q6_K for half of the attention value-projection and feed-forward output tensors (the layers most sensitive to quantization noise) and Q4_K for everything else. This selective higher precision is why K-quants outperform legacy Q4_0 at essentially the same file size.

Quality: Perplexity increase over FP16 is approximately +0.18 on Llama 3.1 8B. In absolute terms, this is small — around 3% degradation from the full-precision baseline. For general chat, summarization, and document Q&A tasks, Q4_K_M output is indistinguishable from FP16 to most users.

Speed: Q4_K_M generates at approximately 71.9 tokens/second on an 8B model, with prompt processing (prefill) at ~821 tokens/second. These figures come from the llama.cpp quantize README and reflect single-GPU consumer hardware performance.

Ollama defaults to Q4_K_M for most models in its registry. When you run ollama pull llama3.2, you get Q4_K_M unless you specify otherwise. That default choice is deliberate — it runs on 6 GB VRAM, produces output that satisfies nearly every general-use workload, and generates fast enough that interactive use feels responsive.

Q5_K_M: the step-up that earns its VRAM

Q5_K_M uses 5.70 bits per weight. The same 8B model becomes a 5.33 GiB file — 16% larger than Q4_K_M. VRAM requirement for an 8B model: approximately 6.5 GB.

The quality improvement is real but modest on perplexity benchmarks: +0.06 over FP16 versus Q4_K_M's +0.18 — a 66% reduction in quantization noise for 16% more memory. Whether that matters depends entirely on the task.

Where the Q4→Q5 upgrade earns its cost: coding and structured reasoning. At 4 bits, the accumulated weight noise is enough to occasionally corrupt variable names, skip edge conditions in logic chains, or produce subtly malformed JSON output. The degradation isn't catastrophic — it's the kind of error you might attribute to the model's capability rather than its quantization. At 5 bits, this class of error largely disappears. If you're running a coding assistant or using the model for structured data extraction in a pipeline, Q5_K_M is a meaningful upgrade over Q4_K_M.

Where it doesn't matter: chat, summarization, translation, creative writing. The Q4→Q5 delta is not perceptible in output quality for these workloads.

Speed: Q5_K_M generates at approximately 67.2 tokens/second on an 8B model — about 7% slower than Q4_K_M for generation. Prompt processing drops to ~758 tokens/second.

Q8_0: near-lossless, with a speed tradeoff worth understanding

Q8_0 is the legacy 8-bit format. It averages 8.50 bits per weight, producing a 7.95 GiB file for an 8B model — roughly 2× the VRAM of Q4_K_M. Budget 9–10 GB VRAM for an 8B model at normal context lengths.

The quality delta from FP16 is approximately +0.01 perplexity — effectively lossless. No task type shows a meaningful degradation compared to running the model at full precision. This is the format to use when you need a reference-quality baseline.

The counterintuitive speed profile: Q8_0 generates at approximately 50.9 tokens/second — 29% slower than Q4_K_M for token generation. That's a significant penalty for interactive use. However, prompt processing runs at ~865 tokens/second, faster than Q4_K_M's 821.

This is not a paradox. Prompt processing (prefill) is compute-bound: all input tokens are processed in large parallel matrix operations on the GPU, and modern tensor cores have highly optimized INT8 paths that run efficiently at 8-bit precision. Token generation (decode) is memory-bandwidth-bound: producing each new token requires reading every weight in the model once from VRAM. Moving 8 bytes per parameter through memory buses takes proportionally longer than moving 4.9 bytes. The more data the GPU has to shuttle per generated token, the slower generation gets — regardless of computational throughput.

Practical implication: if your workload is document ingestion, RAG retrieval, or long-context summarization (lots of prompt tokens, few generated tokens), Q8_0 costs you almost nothing in speed and gives you lossless quality. If you're generating long responses in an interactive chat context, you'll notice the 29% generation slowdown.

The other Q8_0 risk: VRAM crowding. A 7B Q8_0 fits on an 8 GB card, but leaves little room for context. Q4_K_M on the same card lets you run a 13B model with headroom to spare — and the 13B Q4_K_M almost always outperforms the 7B Q8_0 on output