Perplexity held flat after INT4. Task accuracy dropped 7 points.

#machinelearning #mlops #llm #pytorch

TL;DR: We quantized a fine-tuned 14B agent model to INT4 with GPTQ. Perplexity moved 0.04. We almost shipped it. A domain eval suite caught a 7-point drop in multi-step task completion that perplexity never saw. Perplexity is a terrible acceptance gate for quantized models.

We run model fine-tuning and eval for enterprise agent automation at Nexus Labs. Series B, small team, ten people who touch the eval pipeline. The model in question was a Qwen2.5-14B fine-tune we use for structured workflow execution. Customer-facing. It matters when it's wrong.

The plan was boring. Quantize to INT4 to fit two replicas on one A100 instead of one, cut serving cost roughly in half. Standard move. We picked GPTQ with a 128 group size, ran calibration on 512 samples from our training distribution, and measured perplexity before and after.

The number that lied

Perplexity on our held-out set: 3.81 full precision, 3.85 after INT4. That's a 1% move. Nothing. By the old folklore, a quantization that holds perplexity is a quantization you ship.

So we ran the actual eval suite. Not perplexity. The 340-case adversarial set we built for this product, where each case is a multi-step task with a programmatic pass/fail check on the final state.

Task completion went from 81.2% to 74.1%. Seven points. On a metric customers feel directly.

The failures clustered. Long sequences, six steps or more, where the model had to hold a constraint from step one and apply it at step five. The INT4 model dropped the constraint. Perplexity averages token-level surprise across the whole corpus, so a few critical tokens going wrong in a 400-token trajectory barely move the mean. The eval that scores the trajectory outcome sees it immediately.

Here is roughly what we measured across the gates:

Metric	FP16	INT4 (GPTQ)	Delta
Perplexity (held-out)	3.81	3.85	+0.04
MMLU (5-shot)	71.4%	70.9%	-0.5
Task completion (our suite)	81.2%	74.1%	-7.1
Constraint-retention subset	88%	69%	-19

MMLU barely moved either. Generic benchmarks were as blind as perplexity here. The damage was concentrated in exactly the capability our product depends on, and only the domain suite measured it.

Why averaged metrics miss this

Quantization error isn't uniform. INT4 rounds weights into buckets, and the layers that handle long-range dependency, attention projections deep in the stack, take the error worst. A model can stay fluent token-to-token while losing the thread across a long context. Fluency is what perplexity rewards. Following a constraint across 400 tokens is not fluency.

The lesson we keep relearning. The model is the easy part. The thing that tells you whether the model is good enough is the hard part, and it's almost never a single scalar.

What we changed

We made the domain suite a hard gate for any inference-level change. Quantization, a vLLM version bump, a new kernel, all of it has to clear the trajectory eval, not perplexity.

To get clean comparisons we shadow every eval case against two backends at once: the FP16 reference on one endpoint and the candidate INT4 build on another. We route both through Bifrost, our gateway, so the eval harness sends one OpenAI-format request and we fan it to both backends behind the same interface. That removed a class of bugs where prompt formatting drifted between the two test paths and made the diff look bigger than it was.

The harness itself is dull on purpose:

import asyncio, httpx

GATEWAY = "http://localhost:8080/v1/chat/completions"

async def run_case(client, model, case):
    state = case.initial_state
    for step in case.steps:
        r = await client.post(GATEWAY, json={
            "model": model,                 # "ref/qwen-fp16" or "cand/qwen-int4"
            "messages": case.render(state),
            "temperature": 0,
        })
        state = case.apply(state, r.json())
    return case.check(state)               # programmatic pass/fail

async def eval_suite(model, cases):
    async with httpx.AsyncClient(timeout=60) as c:
        results = await asyncio.gather(*[run_case(c, model, x) for x in cases])
    return sum(results) / len(results)

Temperature 0, deterministic check, no LLM judging the output. The check is code that inspects final state. When the pass criterion is itself fuzzy, you can't tell a quantization regression from judge noise, and we'd already been burned by that.

We didn't abandon INT4. We re-ran with AWQ instead of GPTQ and bumped calibration to 1,024 samples weighted toward long sequences. That landed at 79.3% task completion. Still down from FP16, but inside our 2-point tolerance, so we shipped it with the cost win mostly intact.

Trade-offs and limitations

A 340-case trajectory suite is expensive. Each full run is about 11 minutes and real GPU time. Perplexity is seconds. We only afford the suite because we gate on it for releases, not every commit.

This finding is ours, not a law. A model serving short single-turn responses would likely show almost no gap between perplexity and task metrics, because there's no long-range constraint to lose. The wider the gap between your token-level proxy and your actual product behavior, the more this bites.

Deterministic checks only work when success is checkable in code. Plenty of generation tasks aren't, and there you're stuck with judge models and their variance. We don't pretend INT4 is free either. It cost us 2 points we chose to pay for the throughput.

And calibration data matters more than the algorithm. Switching GPTQ to AWQ helped, but reweighting calibration toward long sequences helped more.