When I started running open models locally, I was paranoid about quantization. Lower bit depths seemed like cutting corners. After months of testing, I've changed my mind: for most use cases, 4-bit quantization is the practical sweet spot.
Here's what I found. An 8-bit quantized 70B model takes roughly 140GB of VRAM. A 4-bit version of the same model takes 35-45GB. That's not a rounding difference. For anyone without a $40K GPU cluster, that's the difference between "possible" and "impossible."
The quality hit is smaller than the math suggests. With techniques like NF4 (used in bitsandbytes and llama.cpp), 4-bit models lose maybe 2-5% of reasoning quality on structured tasks, while staying nearly identical on text generation, creative writing, and instruction following. I tested this by running the same prompts through 8-bit and 4-bit versions of Llama 2 70B back to back. The 4-bit output was almost never noticeably worse unless I was explicitly pushing the model's reasoning limits.
Where 4-bit breaks down: long-context work under heavy load. If you're doing intensive reasoning with 8k+ token context, or running batch inference with dozens of requests, you'll see degradation. For those cases, 8-bit or higher is worth the VRAM cost. But for typical interactive chat and most production workloads on modest hardware, 4-bit is the reality check.
The practical implication: don't default to "highest quality quantization." Start at 4-bit, run your actual workload, and only step up if the output matters more than the hardware constraints. Most people find 4-bit is good enough and never move.
One concrete number: 4-bit Mistral 7B runs at 50+ tokens/second on a single RTX 4060 Ti (8GB). 8-bit tops out around 20 tokens/second on the same card due to bandwidth limits. Speed gains are real.
Top comments (0)