Patrick Hughes

Posted on Jun 9 • Edited on Jun 30 • Originally published at bmdpat.com

Which GGUF Quant Should You Actually Pick? Q4 vs Q5 vs Q6 vs Q8 (2026)

#localllm #gguf #quantization #gpu

You know what Q4, Q5, and Q8 mean. Now the real question: which one do you actually download?

If you need the background on what these numbers represent, start with the original: GGUF quantization Q4 Q5 Q8 explained. This post is the decision guide.

The tradeoff in one line

Lower number means smaller file, less VRAM, faster load, and slightly worse output. Higher number means bigger file, more VRAM, and output closer to the original model.

That is the whole game. You are trading quality for size. The trick is knowing how much quality you actually lose, and the answer is: less than you think at the high end, more than you think at the low end.

Start with what fits

The first filter is not quality. It is VRAM.

Pick the largest quant that fits on your GPU with room for context. A model you can fully offload at Q4 will run faster than the same model at Q5 that spills onto the CPU. Fit beats precision when fit decides speed.

So measure your VRAM, subtract headroom for the KV cache, and see which quant lands under that ceiling. That narrows the choice fast.

What the K-quants give you

You will see names like Q4_K_M and Q5_K_M. The K means K-quants. They are smarter than the old flat quants.

K-quants spend more bits on the parts of the model that matter most and fewer bits on the rest. For the same file size, a K-quant holds quality better than a plain one. This is why Q4_K_M became the default many people reach for.

The M and S suffixes mean medium and small. M keeps more quality. S shrinks further. When in doubt, pick M.

The practical ladder

Here is how the common options stack up for someone on a consumer GPU.

Q4_K_M is the workhorse. Smallest footprint that still feels like the real model. If you are tight on VRAM, start here.

Q5_K_M is the safe upgrade. A noticeable quality bump over Q4 for a modest size increase. If it fits, many people prefer it.

Q6_K is close to lossless for most tasks. Bigger, but the quality gap to the full model is small. Good when you have VRAM to spare and want margin.

Q8_0 is near the original. The difference from full precision is hard to notice in normal use. It is large, so you only pick it when size is not a concern.

How quality actually falls off

Think of it as a curve, not a line.

Going from Q8 down to Q5, the quality loss is small. The model barely changes for most prompts. You get a big size win for almost no cost.

Going below Q4, the loss grows fast. Q3 and Q2 start making real mistakes: weaker reasoning, more repetition, shakier instruction following. They exist for cases where a model simply will not fit otherwise.

So the sweet spot for most people sits between Q4_K_M and Q6_K. Above that you pay size for little gain. Below that you lose quality faster than you save space.

A simple decision flow

Can you fit Q6_K with your context? Take it. Near-lossless, done.

Cannot fit Q6 but can fit Q5_K_M? Take that. Strong quality, smaller.

Tight on VRAM? Q4_K_M is the floor that still feels right.

Cannot even fit Q4? Drop to a smaller model at Q4 before you drop to Q3 of a bigger one. A smaller model at a healthy quant usually beats a big model crushed too hard.

Quant and offload are the same budget

Your quant choice and your --n-gpu-layers setting pull from the same VRAM pool. A smaller quant frees room to offload more layers, which is what makes the model fast.

If you have not tuned your layer offload yet, read the companion post: the n-gpu-layers tuning guide. Pick the quant and the layer count together. They are one decision wearing two hats.

Why any of this matters

Running local models is a cost play. Every prompt you answer on your own card is a prompt you did not pay an API to handle. Picking the right quant means more capable output per dollar of hardware you already bought.

That same discipline, getting the most value while keeping spend capped, is what AgentGuard enforces for AI agents. It sets hard limits on tokens, cost, and call rate so a loop cannot run up a bill while you sleep. Local inference trims your fixed cost. AgentGuard caps your variable cost.

If you run agents and want a real ceiling on spend, check out AgentGuard.

Originally published on bmdpat.com. I run a one-person AI agent company and write about what actually works.

Want these in your inbox? Subscribe to the newsletter - no spam, unsubscribe anytime.

DEV Community