MxGuru

Posted on May 20

When the Sensitivity Metric Lies: A Drift-Inversion Smoking Gun in Mixed-Precision LLM Quantization

#quantization #hsaq #awq #granite

The HSAQ pipeline (Hybrid Sensitivity-Aware Quantization) is supposed to do one thing well: spend bits where they hurt. Profile each Linear layer's output drift under 2/3/4-bit quantization on real calibration data, then let a greedy allocator distribute the bit budget so total drift is minimized under the VRAM ceiling.

That works. Until it doesn't.

This is the story of one experiment — Phase-3a, run 2026-05-19 on ibm-granite/granite-3.3-8b-instruct — that broke a quiet assumption underneath the whole approach. The drift metric mismeasures real PPL impact on outlier-heavy attention layers. Worse, it mismeasures it in the wrong direction: the harder you push the metric down, the more outliers can sometimes corrupt generation.

The setup

HSAQ's baseline on granite-3.3-8B at a 12 GB consumer VRAM budget produces a mixed assignment averaging ~3.3 bits per Linear across 281 quantized modules. Measured against bf16, this baseline lands at:

Metric	bf16 baseline	HSAQ baseline	Δ
Wikitext perplexity	8.756	10.013	+14.42%

A +14.42% PPL hit is rough. Target was <8% (a soft "you can still feel it but it's usable" line in our internal eval). The first thing you do when the budget is the constraint is examine the residue — which layers are at the bottom of the bit-ladder, and could a small structural rule move them up?

After baseline assignment, 16 of 281 Linears sit at 3-bit (the rest at 4):

7 × mlp.down_proj — FFN expansion projections (~59M params each, the allocator's favorite victims)
6 × self_attn.o_proj — attention output projections (the outlier-heavy ones)
2 × mlp.gate_proj (L0, L39)
1 × self_attn.q_proj (L34)

The Phase-3a intervention was simple: force all o_proj layers to a minimum of 4 bits, regardless of allocator preference. Six layers move 3 → 4. About 0.05 GB of weight budget gets reallocated. Re-run end to end.

The result

Metric	HSAQ baseline	HSAQ + o_proj floor	Δ
PPL above bf16	+14.42%	+13.80%	-0.62pp

A real improvement. Small — about 4% relative on the gap to bf16 — but real. And reproducible: the baseline run inside the same job matched yesterday's baseline to 4 decimal places (10.0133 → 10.0133), so the methodology is bulletproof. Cache invariance also confirmed: HSAQ's SQLite sensitivity cache produced identical drift values across both runs.

So far this is unremarkable. The "+0.62pp from a 0.05 GB nudge" finding alone would justify a paragraph in an internal log, nothing more.

Then we looked at the per-layer drift.

The inversion

When the floor forced these six o_proj layers from 3-bit to 4-bit, their measured per-layer drift went dramatically worse — not better:

Layer	Drift at 3-bit	Drift at 4-bit	Ratio
`model.layers.21.self_attn.o_proj`	2.70	8.44	3.1× worse
`model.layers.30.self_attn.o_proj`	1.26	6.51	5.2× worse
`model.layers.8.self_attn.o_proj`	1.39	3.44	2.5× worse

Three of six layers showed >2.5× drift inflation at the higher bit-width. And the overall PPL — the thing the drift metric is supposed to predict — got better anyway.

Let that land. The signal the allocator uses to decide which layers deserve more bits is telling us:

"Layer 21's o_proj is 3× more damaged at 4-bit than at 3-bit. Definitely don't promote it."

And the model is responding:

"Actually, the 4-bit version generates better text. Thanks."

This is not noise. It reproduced across 32-sample and 256-sample calibration sets. It is a systematic divergence between what HSAQ measures and what actually matters.

What's actually happening

HQQ's quantization is groupwise: it picks one scale and zero-point per group of 64 weights. The mechanism that makes HQQ fast and parameter-light is the same mechanism that breaks here.

"One scaling factor for 128 weights means one outlier crushes the other 127 to zero." — Gemini's description of HQQ group quantization (we run at group_size=64, but the principle is identical).

On outlier-heavy layers like o_proj (which carries the per-head attention output back into the residual stream) and down_proj (which projects the wide FFN intermediate back down), a small number of channels carry order-of-magnitude larger activations than the rest. At 3-bit, the quantization is so coarse that everything is approximate and the model has already absorbed the noise. At 4-bit, you get more precision per group, but the outlier still dominates its group's scale — so the 63 non-outlier weights in that group get more crushed relative to what they should be, not less.

The drift metric notices this. It measures normalized MSE between the bf16 layer output and the quantized layer output on captured calibration activations. The increased crushing of small weights inside outlier-dominated groups produces a larger MSE — that part is real and the metric is honest about it. But the model in practice is much more tolerant of "small weights got squashed" noise than of "outlier weight got rounded to a bin that doesn't represent its magnitude" noise. The drift metric weights these the same. Real PPL doesn't.

"HQQ is blind to data flowing through it." — same source. This is the whole conceptual gap that activation-aware methods (AWQ, GPTQ, imatrix) close.

What this means if you use a drift-based allocator

If you're running anything in the mixed-precision-by-sensitivity family — SqueezeLLM, OWQ, our HSAQ, anything that picks per-layer bit-widths from a calibration MSE signal — there is a category of layer where your signal is lying to you. Specifically: outlier-heavy attention output projections (o_proj) and FFN down projections (down_proj). These are the layers AWQ identified five years ago as needing per-channel scaling, and the reason is precisely the dynamic our drift metric is failing to model.

Two implications:

Treat the drift signal as approximate on o_proj and down_proj. A sensitivity floor is one cheap way to do this — force these layers to a known-better bit-width regardless of what calibration MSE says. That's what Phase-3a tested, and it worked, even though it cut against the allocator's recommendation.
Calibration-MSE is the wrong signal for outlier-heavy layers. The right signal is something like KL divergence on output logits, or PPL impact directly measured on a held-out validation set. Both are more expensive than HQQ-output MSE, but on the layers where MSE lies, the expense is justified.

We are not the first to notice this. AWQ's original paper makes the case in different language: "the importance of a weight is determined by the activation magnitude, not the weight magnitude." HQQ's design choice to be data-blind is the feature that makes it fast and the bug that makes it brittle. What this experiment adds is a clean reproduction on a current 8B model, with the exact mechanism visible: same calibration cache, same allocator, two runs differing only in the floor parameter, drift-vs-PPL anticorrelation jumping out at you.

What didn't work

For completeness — Phase-3a tested two structural levers, only one helped meaningfully.

o_proj sensitivity floor: +0.6pp PPL improvement. Useful, but small.
group_size=64 (vs the HQQ default of 128): already baked into HSAQ from day one (config.py:52: HQQ_OVERHEAD_FACTOR = 1.065 # 6.5% average (zeros 64 + scales 64 per group)). The hypothesis that tightening the group size would help was wrong about our starting point — we were already at the practical floor. Tightening further to gs=32 has diminishing returns and roughly doubles overhead.

The conclusion is sharper than the headline number: more HQQ tuning is not the lever. The bit budget is gone, the group size is at the practical floor, and the drift metric we're using to allocate the budget that remains is unreliable on the layers where allocation matters most.

What's next: AWQ on a 9-layer target list

A separate diagnostic — logit divergence comparison between the HSAQ-quantized model and bf16, run on 96 prompts the same day — produced a clean QUANTIZATION_BIAS_DOMINANT verdict: 63/96 divergences are confidently wrong (the model is sure of a wrong token), only 3/96 are high-entropy uncertainty. This is the signature of representation failure, not undertraining. It is what AWQ is designed to fix.

The diagnostic surfaced nine specific layers driving the divergence:

Layer	Drift score
`model.layers.28.self_attn.o_proj`	23.00
`model.layers.13.self_attn.o_proj`	14.53
`model.layers.15.mlp.down_proj`	6.36
`model.layers.28.mlp.down_proj`	6.28
`model.layers.25.mlp.down_proj`	6.21
`model.layers.20.mlp.down_proj`	5.41
`model.layers.14.self_attn.o_proj`	5.18
`model.layers.15.self_attn.o_proj`	5.15
`model.layers.17.self_attn.o_proj`	4.69

Pattern: mid-to-late transformer (L13–L28), attention output and MLP down projections. Textbook activation-outlier signature. The next post will report on an AWQ POC targeting exactly these nine layers — leaving the other 272 Linears under HSAQ as today, swapping only the outliers to AWQ. If the gap closes there, the recipe likely generalizes. If it doesn't, we have a different problem.

Calibrating prior claims

A previous LinkedIn pulse made the claim that this hybrid quantisation recipe holds across model families. That claim should be softened pending the AWQ run. The HSAQ allocator's behavior on o_proj and down_proj is consistent across architectures we've tested — but the fix (whether AWQ closes the gap to <8% PPL across architectures) is not yet validated. Phi-4 has a different attention layout (no separate o_proj); confirming transferability there requires running the same divergence diagnostic on a Phi-4 HSAQ quantization, which is queued.

Bottom line

If you're using calibration-MSE as your per-layer sensitivity signal, run a sanity check: pick your worst-PPL allocation and force-promote the o_proj and down_proj layers to 4-bit anyway. If PPL improves, your drift metric is lying to you in the same direction ours is. That's information you can use without changing your quantizer; it's information that says your quantizer needs to change.

This is part of an ongoing series on running 13–20B language models on 12 GB consumer GPUs. The pipeline is open work-in-progress at mxguru1/hsaq-tools on Hugging Face. Granite-3.3-8B was chosen as the headline target because community AWQ/GPTQ quants exist for ground truth, and because 8B parameters at mixed 3/4-bit fits comfortably on a 12 GB card with room for a LoRA adapter.

Update (2026-05-21) — model-specificity caveat

Follow-up transfer testing on the o_proj 3→4-bit floor intervention shows it is model-specific, not a generalizable recipe. On a clean, identical evaluation protocol (full wikitext-2 test set, non-overlapping 2048-token windows):

Model	Δ PPL from floor	Direction
granite-3.3-8B	+0.0840 (1.137%)	improvement
phi-4 (14B)	+0.0088 (0.127%)	small improvement
Qwen-2.5-14B	−0.0019 (0.031% worse)	mild regression

Phase-3a's observation — drift-MSE on outlier-heavy layers disagrees with downstream PPL — holds for granite as originally reported. The intervention of forcing o_proj layers from 3-bit to 4-bit transfers cleanly to phi-4 (small positive effect, 67.6% of windows helped), and reverses on Qwen-2.5-14B (61.2% of windows hurt). No clean predictor — count of underbitted layers, tier distribution, architecture, parameter scale — sorts the result.

Full writeup of the transfer testing, the dose-response hypothesis that died on the clean protocol, and the discipline checks that caught a wrong prediction in real time is in Part 2 and a forthcoming Part 3.

DEV Community