DEV Community

Billy Bob Gurr
Billy Bob Gurr

Posted on

Most people starting with local LLMs jump straight to 4-bit quantization because it's fast and uses

I tested the same model (Mistral 7B) in three formats: full precision (16-bit), 8-bit, and 4-bit. On inference speed, yes, 4-bit was fastest. But here's what surprised me: the quality gap between 8-bit and 4-bit was visible on reasoning tasks. Writing tasks didn't suffer much. Math almost always came out wrong with 4-bit.

The real tradeoff isn't speed versus accuracy. It's what you're actually doing with the model.

For creative tasks (summarization, rewriting), 4-bit is fine. For anything requiring precision (code generation, math, fact retrieval), start with 8-bit. You'll get 70-80% of the speed benefit with close to zero quality loss.

RAM and VRAM matter too. A 7B model in 8-bit needs about 14GB VRAM. 4-bit cuts that to 4-5GB. If you're running on an RTX 4060 (8GB), 4-bit is your only option. But if you have a 16GB GPU or you're offloading to system RAM, 8-bit is the better default.

The mistake most people make: picking quantization based on hardware alone. Pick based on your task first, then your hardware constraints.

Top comments (0)