Thurmon Demich

Posted on May 17 • Originally published at bestgpuforllm.com

Best Quantization for Local LLM in 2026 (Q4 to Q8)

#quantization #gguf #llm #vram

This article was originally published on Best GPU for LLM. The full version with interactive tools, FAQ, and live pricing is on the original site.

Q4_K_M. That is the answer for 90% of users — skip the rest of this article if you just need a quick recommendation. But if you want to understand why, and when the other options make sense, read on. The difference between Q3 and Q5 can mean the gap between a model that hallucinates and one that reasons cleanly.

What quantization actually does

Quantization reduces the precision of model weights from 16-bit floating point (FP16) to lower bit representations. Fewer bits = smaller model = less VRAM = faster inference. The trade-off is output quality — lower precision means the model loses nuance in its weights, which can degrade reasoning, instruction following, and factual accuracy.

GGUF is the standard format for quantized models on consumer hardware. Tools like llama.cpp, Ollama, and LM Studio all use GGUF files. When you download a model from HuggingFace, the filename tells you the quantization: model-Q4_K_M.gguf, model-Q5_K_M.gguf, etc.

The quantization comparison table

Quant	Bits/param	Quality vs FP16	VRAM (7B)	VRAM (13B)	VRAM (34B)	VRAM (70B)
Q2_K	~2.5	75-80%	~2.5GB	~5GB	~12GB	~25GB
Q3_K_M	~3.5	85-90%	~3.5GB	~7GB	~17GB	~35GB
Q4_K_M	~4.5	93-96%	~4.5GB	~8.5GB	~21GB	~42GB
Q5_K_M	~5.5	96-98%	~5.5GB	~10GB	~25GB	~50GB
Q6_K	~6.5	98-99%	~6.5GB	~12GB	~30GB	~60GB
Q8_0	~8	99%+	~8GB	~15GB	~38GB	~75GB
FP16	16	100%	~14GB	~26GB	~68GB	~140GB

VRAM estimates include ~1-2GB overhead for KV cache at moderate context lengths. Actual usage varies by model architecture and context window size.

The breakdown: when to use each level

Q4_K_M — the default choice

Use when: You want the best balance of quality and VRAM efficiency.

Q4_K_M preserves 93-96% of FP16 quality on most benchmarks. The "_K_M" suffix means it uses k-quant mixed precision — important layers (attention, output) get higher precision while less critical layers get lower precision. This targeted approach is why Q4_K_M outperforms naive 4-bit quantization by a meaningful margin.

For conversational AI, coding assistance, and general reasoning, Q4_K_M is virtually indistinguishable from FP16 in blind tests. We recommend it as the starting point for any model.

Q5_K_M — the upgrade if you have headroom

Use when: You have 20-30% more VRAM than Q4 requires.

Q5_K_M closes most of the remaining gap to FP16. The quality improvement over Q4 is most noticeable on:

Complex multi-step reasoning
Creative writing with specific style constraints
Code generation for less common languages
Tasks requiring precise numerical reasoning

If your GPU has the VRAM to spare, Q5 is always worth choosing over Q4. The performance (tok/s) difference is small — the model is ~20% larger, but inference speed is dominated by memory bandwidth, not model size.

Q3_K_M — acceptable compromise

Use when: Your VRAM is tight and Q4 does not fit comfortably.

Q3 is the lowest we recommend for serious use. Quality degrades noticeably on reasoning-heavy tasks — you will see more hallucinations and logic errors compared to Q4. But for simple chat, summarization, and straightforward Q&A, Q3 models remain functional. If the alternative is not running the model at all, Q3 is a valid option.

Q6_K and Q8_0 — diminishing returns

Use when: You have abundant VRAM and want maximum quality.

The jump from Q5 to Q6 is marginal — maybe 1-2% on benchmarks. Q8 is nearly identical to FP16 in practice. These quantizations make sense for small models (7B at Q8 = ~8GB, easily fits on most GPUs) but become impractical for larger models. Running a 34B at Q8 needs ~38GB — beyond any single consumer GPU.

Q2_K and below — last resort

Use when: You absolutely must fit a specific model on limited hardware and accept significant quality loss.

Q2 models lose 20-25% of FP16 quality. Reasoning degrades substantially. Instruction following becomes unreliable. We do not recommend Q2 for anything beyond experimentation.

VRAM chart available at the original article

Dynamic quantization: the new frontier

Unsloth introduced UD (Ultra Dynamic) quantization in 2025, and it is gaining traction in 2026. UD-Q2, UD-Q3, and UD-Q4 use variable bit allocation across layers — critical layers get more bits, less important layers get fewer. The result: a UD-Q3 model can match traditional Q4_K_M quality at Q3-level VRAM usage.

If you see UD-quantized models on HuggingFace, prefer them over standard quants at the same nominal bit level. The VRAM savings are real and the quality is measurably better.

Practical recommendations by GPU

GPU tier list available at the original article

GPU	VRAM	Best quant for 7B	Best quant for 14B	Best quant for 34B
RTX 3060 12GB	12GB	Q8_0	Q4_K_M	Won't fit
RTX 4060 Ti 16GB	16GB	Q8_0	Q5_K_M	Won't fit
RTX 4090	24GB	FP16	Q8_0	Q4_K_M
RTX 5090	32GB	FP16	FP16	Q5_K_M

The pattern is simple: use the highest quantization your VRAM can hold while leaving 2-3GB headroom for KV cache.

Common mistakes

Defaulting to Q8 or FP16 "for quality." Unless you are evaluating or fine-tuning, Q8 is overkill for inference. Q5_K_M captures nearly all the quality at 60-70% of the VRAM cost.
Using Q2/Q3 to fit a bigger model. Running a 70B at Q2 is almost always worse than running a 34B at Q4. A well-quantized smaller model beats a poorly quantized larger one.
Ignoring the _K_M suffix. Plain Q4 and Q4_K_M are not the same. Always prefer the k-quant variants — they allocate bits more intelligently.
Not checking for UD quants. Before downloading a standard Q4_K_M, check if a UD-Q4 version exists. Same VRAM, better quality.

Final answer

Situation	Recommended quant
General use, most users	Q4_K_M
Have VRAM headroom (~20%+)	Q5_K_M
VRAM-constrained	Q3_K_M
Small models (7B) on 16GB+	Q8_0
Evaluating/benchmarking	FP16

Q4_K_M remains king in 2026. The quality-to-VRAM ratio is unmatched. Upgrade to Q5 when you can, drop to Q3 when you must, and check for UD quants before downloading anything.

For VRAM planning across model sizes, see how much VRAM for local LLM. Running models through Ollama? Our best GPU for Ollama guide covers setup. Budget shoppers should check best budget GPU for local LLM for affordable options. And if you want to push the limits with a single GPU, read how to run 70B on a single GPU.

Related guides on Best GPU for LLM

Read the full guide on Best GPU for LLM — includes our VRAM calculator, GPU comparison table, and live pricing.

DEV Community