You ran ollama pull and saw phi4:Q4_K_M. The docs say it's a quantized version. The model page shows the file size. Neither tells you which one to pull or why the difference matters.
Here's what the naming actually means.
The Q Number is Bits Per Weight
LLM quantization is a method of compressing model weights from full floating-point precision down to lower bit representations so the model fits in less VRAM without destroying output quality.
A 7B model at FP16 needs roughly 14GB of VRAM. At Q4_K_M, that same model loads in 4 to 4.5GB. That's not a marginal savings. That's the difference between a model loading at all and refusing to load entirely.
What Each Level Delivers
Q2 / Q3 — Dramatic VRAM savings, significant quality loss. Q3 is not meaningfully better than Q2 for most tasks. If a model only fits at Q3, the better move is a smaller model at Q4.
Q4_K_M — The working standard. Strong output quality across drafting, summarization, coding, and reasoning. This is what Ollama pulls by default, and that default is correct.
Q5_K_M — Noticeable improvement on structured output, complex reasoning chains, and constrained code generation. Worth the extra VRAM if your card has headroom.
Q6 — Exists. Works. Most people skip to Q8 anyway.
Q8 — Near-full-precision output at roughly half the VRAM of FP16. Pull this if your card handles it cleanly. If it fits but barely, stay at Q5.
What K_M Actually Means
K-quants apply different bit depths to different layers rather than treating all weights identically. Some layers matter more for output quality.
K-quants allocate more bits where precision has impact and fewer bits where the loss is less noticeable.
M = medium group size (the best default).
S = slightly less VRAM, small quality cost.
L = slightly more VRAM, small quality gain.
The difference between K_M, K_S, and K_L at the same Q level is smaller than the difference between Q levels. Start with K_M.
If you see Q4_0 and Q4_K_M available, pull Q4_K_M. The K-quant method outperforms the older _0 format at the same bit depth.
A Real Example: Phi-4 14B on an RTX 3060 12GB
FP16: ~28GB — requires a professional GPU
Q8: ~14GB — fits on a 16GB card with minimal headroom
Q5_K_M: ~10–11GB — fits on a 12GB card comfortably
Q4_K_M: ~8–9GB — runs cleanly on an RTX 3060 12GB with KV cache room
Q3: fits on 8GB, but defeats the purpose of running a 14B model
The right pull for an RTX 3060 12GB running Phi-4 is Q4_K_M. Every time.
Full post: https://engineeredai.net/llm-quantization-explained/
Top comments (0)