This article was originally published on Best GPU for LLM. The full version with interactive tools, FAQ, and live pricing is on the original site.
Q4_K_M. That is the answer for 90% of users — skip the rest of this article if you just need a quick recommendation. But if you want to understand why, and when the other options make sense, read on. The difference between Q3 and Q5 can mean the gap between a model that hallucinates and one that reasons cleanly.
See the recommended pick on the original guide
What quantization actually does
Quantization reduces the precision of model weights from 16-bit floating point (FP16) to lower bit representations. Fewer bits = smaller model = less VRAM = faster inference. The trade-off is output quality — lower precision means the model loses nuance in its weights, which can degrade reasoning, instruction following, and factual accuracy.
GGUF is the standard format for quantized models on consumer hardware. Tools like llama.cpp, Ollama, and LM Studio all use GGUF files. When you download a model from HuggingFace, the filename tells you the quantization: model-Q4_K_M.gguf, model-Q5_K_M.gguf, etc.
The quantization comparison table
| Quant | Bits/param | Quality vs FP16 | VRAM (7B) | VRAM (13B) | VRAM (34B) | VRAM (70B) |
|---|---|---|---|---|---|---|
| Q2_K | ~2.5 | 75-80% | ~2.5GB | ~5GB | ~12GB | ~25GB |
| Q3_K_M | ~3.5 | 85-90% | ~3.5GB | ~7GB | ~17GB | ~35GB |
| Q4_K_M | ~4.5 | 93-96% | ~4.5GB | ~8.5GB | ~21GB | ~42GB |
| Q5_K_M | ~5.5 | 96-98% | ~5.5GB | ~10GB | ~25GB | ~50GB |
| Q6_K | ~6.5 | 98-99% | ~6.5GB | ~12GB | ~30GB | ~60GB |
| Q8_0 | ~8 | 99%+ | ~8GB | ~15GB | ~38GB | ~75GB |
| FP16 | 16 | 100% | ~14GB | ~26GB | ~68GB | ~140GB |
VRAM estimates include ~1-2GB overhead for KV cache at moderate context lengths. Actual usage varies by model architecture and context window size.
The breakdown: when to use each level
Q4_K_M — the default choice
Use when: You want the best balance of quality and VRAM efficiency.
Q4_K_M preserves 93-96% of FP16 quality on most benchmarks. The "_K_M" suffix means it uses k-quant mixed precision — important layers (attention, output) get higher precision while less critical layers get lower precision. This targeted approach is why Q4_K_M outperforms naive 4-bit quantization by a meaningful margin.
For conversational AI, coding assistance, and general reasoning, Q4_K_M is virtually indistinguishable from FP16 in blind tests. We recommend it as the starting point for any model.
Q5_K_M — the upgrade if you have headroom
Use when: You have 20-30% more VRAM than Q4 requires.
Q5_K_M closes most of the remaining gap to FP16. The quality improvement over Q4 is most noticeable on:
- Complex multi-step reasoning
- Creative writing with specific style constraints
- Code generation for less common languages
- Tasks requiring precise numerical reasoning
If your GPU has the VRAM to spare, Q5 is always worth choosing over Q4. The performance (tok/s) difference is small — the model is ~20% larger, but inference speed is dominated by memory bandwidth, not model size.
See the recommended pick on the original guide
Q3_K_M — acceptable compromise
Use when: Your VRAM is tight and Q4 does not fit comfortably.
Q3 is the lowest we recommend for serious use. Quality degrades noticeably on reasoning-heavy tasks — you will see more hallucinations and logic errors compared to Q4. But for simple chat, summarization, and straightforward Q&A, Q3 models remain functional. If the alternative is not running the model at all, Q3 is a valid option.
Q6_K and Q8_0 — diminishing returns
Use when: You have abundant VRAM and want maximum quality.
The jump from Q5 to Q6 is marginal — maybe 1-2% on benchmarks. Q8 is nearly identical to FP16 in practice. These quantizations make sense for small models (7B at Q8 = ~8GB, easily fits on most GPUs) but become impractical for larger models. Running a 34B at Q8 needs ~38GB — beyond any single consumer GPU.
Q2_K and below — last resort
Use when: You absolutely must fit a specific model on limited hardware and accept significant quality loss.
Q2 models lose 20-25% of FP16 quality. Reasoning degrades substantially. Instruction following becomes unreliable. We do not recommend Q2 for anything beyond experimentation.
VRAM chart available at the original article
Dynamic quantization: the new frontier
Unsloth introduced UD (Ultra Dynamic) quantization in 2025, and it is gaining traction in 2026. UD-Q2, UD-Q3, and UD-Q4 use variable bit allocation across layers — critical layers get more bits, less important layers get fewer. The result: a UD-Q3 model can match traditional Q4_K_M quality at Q3-level VRAM usage.
If you see UD-quantized models on HuggingFace, prefer them over standard quants at the same nominal bit level. The VRAM savings are real and the quality is measurably better.
Practical recommendations by GPU
GPU tier list available at the original article
| GPU | VRAM | Best quant for 7B | Best quant for 14B | Best quant for 34B |
|---|---|---|---|---|
| RTX 3060 12GB | 12GB | Q8_0 | Q4_K_M | Won't fit |
| RTX 4060 Ti 16GB | 16GB | Q8_0 | Q5_K_M | Won't fit |
| RTX 4090 | 24GB | FP16 | Q8_0 | Q4_K_M |
| RTX 5090 | 32GB | FP16 | FP16 | Q5_K_M |
The pattern is simple: use the highest quantization your VRAM can hold while leaving 2-3GB headroom for KV cache.
See the recommended pick on the original guide
Common mistakes
- Defaulting to Q8 or FP16 "for quality." Unless you are evaluating or fine-tuning, Q8 is overkill for inference. Q5_K_M captures nearly all the quality at 60-70% of the VRAM cost.
- Using Q2/Q3 to fit a bigger model. Running a 70B at Q2 is almost always worse than running a 34B at Q4. A well-quantized smaller model beats a poorly quantized larger one.
- Ignoring the _K_M suffix. Plain Q4 and Q4_K_M are not the same. Always prefer the k-quant variants — they allocate bits more intelligently.
- Not checking for UD quants. Before downloading a standard Q4_K_M, check if a UD-Q4 version exists. Same VRAM, better quality.
Final answer
| Situation | Recommended quant |
|---|---|
| General use, most users | Q4_K_M |
| Have VRAM headroom (~20%+) | Q5_K_M |
| VRAM-constrained | Q3_K_M |
| Small models (7B) on 16GB+ | Q8_0 |
| Evaluating/benchmarking | FP16 |
Q4_K_M remains king in 2026. The quality-to-VRAM ratio is unmatched. Upgrade to Q5 when you can, drop to Q3 when you must, and check for UD quants before downloading anything.
See the recommended pick on the original guide
For VRAM planning across model sizes, see how much VRAM for local LLM. Running models through Ollama? Our best GPU for Ollama guide covers setup. Budget shoppers should check best budget GPU for local LLM for affordable options. And if you want to push the limits with a single GPU, read how to run 70B on a single GPU.
Related guides on Best GPU for LLM
- How Much VRAM for Local LLMs in 2026? Full Q4-Q8 Guide
- Can the RTX 4060 Ti Run Llama 70B in 2026? (Honest)
- Can the RTX 5070 Run 34B Models in 2026? (Analyzed)
Read the full guide on Best GPU for LLM — includes our VRAM calculator, GPU comparison table, and live pricing.
Top comments (0)