LLM Quantization Explained: What Q4, Q5, and Q8 Actually Mean for Your GPU

#llm #ollama #machinelearning #localllm

You ran ollama pull and saw phi4:Q4_K_M. The docs say it's a quantized version. The model page shows the file size. Neither tells you which one to pull or why the difference matters.

Here's what the naming actually means.

The Q Number is Bits Per Weight

LLM quantization is a method of compressing model weights from full floating-point precision down to lower bit representations so the model fits in less VRAM without destroying output quality.

A 7B model at FP16 needs roughly 14GB of VRAM. At Q4_K_M, that same model loads in 4 to 4.5GB. That's not a marginal savings. That's the difference between a model loading at all and refusing to load entirely.
What Each Level Delivers

Q2 / Q3 — Dramatic VRAM savings, significant quality loss. Q3 is not meaningfully better than Q2 for most tasks. If a model only fits at Q3, the better move is a smaller model at Q4.

Q4_K_M — The working standard. Strong output quality across drafting, summarization, coding, and reasoning. This is what Ollama pulls by default, and that default is correct.

Q5_K_M — Noticeable improvement on structured output, complex reasoning chains, and constrained code generation. Worth the extra VRAM if your card has headroom.

Q6 — Exists. Works. Most people skip to Q8 anyway.

Q8 — Near-full-precision output at roughly half the VRAM of FP16. Pull this if your card handles it cleanly. If it fits but barely, stay at Q5.

What K_M Actually Means

K-quants apply different bit depths to different layers rather than treating all weights identically. Some layers matter more for output quality.

K-quants allocate more bits where precision has impact and fewer bits where the loss is less noticeable.
M = medium group size (the best default).
S = slightly less VRAM, small quality cost.
L = slightly more VRAM, small quality gain.

The difference between K_M, K_S, and K_L at the same Q level is smaller than the difference between Q levels. Start with K_M.

If you see Q4_0 and Q4_K_M available, pull Q4_K_M. The K-quant method outperforms the older _0 format at the same bit depth.

A Real Example: Phi-4 14B on an RTX 3060 12GB

FP16: ~28GB — requires a professional GPU
Q8: ~14GB — fits on a 16GB card with minimal headroom
Q5_K_M: ~10–11GB — fits on a 12GB card comfortably
Q4_K_M: ~8–9GB — runs cleanly on an RTX 3060 12GB with KV cache room
Q3: fits on 8GB, but defeats the purpose of running a 14B model

The right pull for an RTX 3060 12GB running Phi-4 is Q4_K_M. Every time.
Full post: https://engineeredai.net/llm-quantization-explained/

DEV Community

LLM Quantization Explained: What Q4, Q5, and Q8 Actually Mean for Your GPU

Top comments (0)