DEV Community

plasmon
plasmon

Posted on

8-Bit Quantization Destroyed 92% of Code Generation — The Culprit Wasn't Bit Count

8-Bit Quantization Destroyed 92% of Code Generation — The Culprit Wasn't Bit Count

If you run local LLMs, you probably assume "Q4 loses quality" and "Q8 is safe." More bits = better quality. Obvious.

A 2025 ArXiv paper (Dong et al., arXiv:2508.16712) destroyed this assumption with measured data. 8-bit quantization killed 92% of HumanEval pass rate on a 13B model. The worst 4-bit degradation was 22%. 8-bit lost to 4-bit.

Read in isolation, this makes no sense. But dig into the cause and you'll find the essential distinction that the word "quantization" conceals. The bit count wasn't the problem. What got quantized was.


"Quantization" Is Not One Operation

When the local LLM community says "quantization," two fundamentally different operations get conflated.

Weight-only Quantization:
  Only model weight parameters converted from FP16 → low-bit
  Inference activations (intermediate computations) remain FP16
  Examples: GGUF Q4_K_M, GPTQ, AWQ

Weight-Activation Quantization:
  Both weights AND inference intermediate values converted to low-bit
  Examples: W8A8-INT, W8A8-FP, SmoothQuant

Notation: W{weight_bits}A{activation_bits}-{format}
  W4A16 = 4-bit weights, FP16 activations (weight-only)
  W8A8-INT = 8-bit weights, INT8 activations (both quantized)
Enter fullscreen mode Exit fullscreen mode

llama.cpp's GGUF formats (Q4_K_M, Q5_K_M, Q8_0) are all weight-only quantization. Activations stay in FP16/FP32. This makes them W4A16-family methods.

The 92% destruction was W8A8-INT — weight + activation quantization, with activations rounded to INT8 (8-bit integer). Cramming activation dynamic range into integers severely distorts the inference intermediate state.

"8-bit is safe, 4-bit is dangerous" is wrong. What you quantize determines quality.


The Quality Degradation Map

Dong et al. (2025) tested 11 quantization methods across 4 model sizes (7B/13B/34B/70B) using the qMeter framework. GPUs: A100/H100.

Degradation by Task Type

Benchmarks tested:
  Chat-S (commonsense): HellaSwag, ARC-C, Winogrande, TriviaQA
  Chat-R (reasoning): BigBench-Hard, MMLU
  Chat-M (math): GSM8K, GPQA Diamond
  Code: HumanEval
  Summary: NewsQA (ROUGE)
Enter fullscreen mode Exit fullscreen mode
Method Type Chat-S Loss Code Loss Worst Case
W8A16-INT Weight-only 8-bit 5-10% 5-10% Mild
W8A8-FP Weight+Act 8-bit (FP) 5-10% Moderate Moderate
W4A16-INT Weight-only 4-bit 5-15% Up to 22% Moderate
W8A8-INT Weight+Act 8-bit (INT) Med-High Up to 92% Catastrophic

Only W8A8-INT shows anomalous degradation. The same 8-bit width with W8A16-INT (weight-only) is mild. INT activation quantization is the culprit.

Model Size × Vulnerability

# Pattern from paper data
vulnerability_by_size = {
    "13B": {
        "description": "widespread and severe",
        "worst_case": "HumanEval -92% (W8A8-INT)",
        "note": "Most vulnerable. Large degradation across all tasks"
    },
    "34B": {
        "description": "moderate to severe",
        "worst_case": "Code/Math severe, Chat-S/R moderate",
        "note": "Degradation severity splits by task type"
    },
    "70B": {
        "description": "mostly resilient",
        "worst_case": "Code -22%, Chat-S nearly lossless",
        "note": "Larger models have higher quantization resilience"
    }
}
# Pattern: smaller models are more vulnerable to quantization
# 7B-13B that fit on 8GB VRAM are in the most vulnerable zone
Enter fullscreen mode Exit fullscreen mode

This aligns with intuition. Smaller models have less redundancy. Each parameter carries more information, so quantization-induced information loss hits harder.


What This Means for RTX 4060 8GB Users

Translating the paper's results to our use case.

The Good News

llama.cpp's GGUF quantization (Q4_K_M etc.) is weight-only quantization. It's structurally different from W8A8-INT that caused 92% destruction.

# llama.cpp GGUF = W_xA16 family
gguf_formats = {
    "Q4_K_M": {"weight_bits": 4.8, "activation": "FP16/FP32",
               "type": "weight-only", "risk": "low-moderate"},
    "Q5_K_M": {"weight_bits": 5.5, "activation": "FP16/FP32",
               "type": "weight-only", "risk": "low"},
    "Q8_0":   {"weight_bits": 8.5, "activation": "FP16/FP32",
               "type": "weight-only", "risk": "lowest"},
}
# Activations always FP16+ → W8A8-INT catastrophic failure can't happen
Enter fullscreen mode Exit fullscreen mode

As long as you're running Q4_K_M, the 92% degradation is structurally impossible. Relax.

The Caveats

  1. Smaller models are more vulnerable: The 7B-13B range that fits on 8GB is the fragile zone. Weight-only quantization still causes 5-22% degradation with Q4_K_M

  2. Code generation is the most vulnerable task: HumanEval degradation consistently exceeds other tasks. For code generation with local LLMs, use the largest model or highest-bit quantization you can fit

  3. Commonsense tasks are resilient: Chat-S (HellaSwag etc.) resists quantization well. If chat and Q&A are your primary use, degradation is minimal

# Task-specific quantization resilience (paper data + hands-on experience)
task_resilience = {
    "Chat / conversation": {"resilience": "high", "safe_quant": "Q4_K_M"},
    "Knowledge Q&A":       {"resilience": "high", "safe_quant": "Q4_K_M"},
    "Math / reasoning":    {"resilience": "medium", "safe_quant": "Q5_K_M preferred"},
    "Code generation":     {"resilience": "low", "safe_quant": "Q5_K_M+ preferred"},
    "Summarization":       {"resilience": "medium", "safe_quant": "Q4_K_M"},
}
Enter fullscreen mode Exit fullscreen mode

Next Generation: Does NVFP4 Change the Game?

A January 2026 ArXiv paper (arXiv:2601.09527) evaluated local LLM deployment on NVIDIA's consumer Blackwell GPU (RTX 5090).

The headline number is NVFP4 (NVIDIA floating-point 4-bit):

NVFP4 performance (RTX 5090):
  - 1.6x throughput over BF16
  - 41% energy reduction
  - Quality loss: only 2-4%

Models tested: Qwen3-8B, Gemma3-12B/27B, GPT-OSS-20B

Inference cost:
  $0.001-0.04 per million tokens (electricity only)
  → 40-200x cheaper than budget cloud APIs
  → Hardware ROI under 4 months at moderate usage
Enter fullscreen mode Exit fullscreen mode

NVFP4 is hardware-native 4-bit floating-point support. Unlike software quantization (GGUF Q4_K_M), the GPU's tensor cores execute 4-bit arithmetic directly. The 2-4% quality loss matches or beats software Q4_K_M.

RTX 4060 reality:

  • RTX 4060 (Ada Lovelace) doesn't have NVFP4. Next-gen GPU required
  • GGUF Q4_K_M remains the best current option
  • Upgrading to RTX 5060 Ti or above unlocks NVFP4 + larger VRAM

Quantization Selection Decision Flow

Practical guide integrating data from all three papers:

Q1: What task?
├── Code generation → Q5_K_M+ recommended (most vulnerable task)
├── Math / reasoning → Q5_K_M preferred, Q4_K_M acceptable
├── Chat / Q&A → Q4_K_M is sufficient
└── Summarization / translation → Q4_K_M is sufficient

Q2: Model size?
├── 7B-13B → Weak against quantization. Go one bit level higher if possible
├── 27B-35B → Moderate resilience. Q4_K_M generally safe
└── 70B+ → High resilience. Partial offload required on 8GB anyway

Q3: VRAM budget?
├── Inference only → Use highest bit quantization that fits
└── Inference + Embedding → Lock Q4_K_M, prioritize VRAM savings
Enter fullscreen mode Exit fullscreen mode

Recommendation Matrix (RTX 4060 8GB)

Model Code Gen Chat RAG + Inference
9B Q5_K_M (7.5GB) Q4_K_M (7.1GB) Q4_K_M + BGE-M3 at limit
32B (ngl=24) Q4_K_M only Q4_K_M only Not recommended
35B-A3B MoE Q4_K_M (7.6GB) Q4_K_M (7.6GB) Not recommended

Stop Talking About Quantization by Bit Count Alone

The conclusion is simple.

"Q4 loses quality" is imprecise. "What you quantize and how" determines quality.

  • Weight-only 4-bit (Q4_K_M): 5-22% degradation, practical
  • Weight+activation 8-bit INT (W8A8-INT): up to 92% degradation, catastrophic
  • 4-bit beating 8-bit is a documented reality

Bit count is necessary but not sufficient for quantization quality. If you're using llama.cpp/GGUF, you're in weight-only territory, so catastrophic failure won't happen. But that's because GGUF's design is smart, not because "4-bit is safe."

Next time you pick a model, check the quantization method — not just the bit count. GGUF vs GPTQ vs AWQ. Weight-only vs weight+activation. That one extra check might save 92% of your model's capability.


References

  1. Dong, J. et al. "Systematic Characterization of LLM Quantization: A Performance, Energy, and Quality Perspective" (2025) arXiv:2508.16712
  2. "Private LLM Inference on Consumer Blackwell GPUs: A Practical Guide for Cost-Effective Local Deployment in SMEs" (2026) arXiv:2601.09527
  3. "A Systematic Evaluation of On-Device LLMs: Quantization, Performance, and Resources" (2026) arXiv:2505.15030

Top comments (0)