Benjamin Savoy

Posted on Jun 9

Can LLMs save themselves from verbosity?

#ai #nlp

« Je n'ai fait celle-ci plus longue que parce que je n'ai pas eu le loisir de la faire plus courte. »
— Blaise Pascal, Lettres provinciales, Lettre XVI (1656)

"I have made this one longer only because I have not had the leisure to make it shorter."

Pascal's joke is the whole problem: the short version is the expensive one. LLMs lean the other way, they pad. So the question is whether a model can rein in its own verbosity, and what the trimming costs when the deciding clause is buried: "…shall not disclose, except to affiliates who…" Drop the "except," and the answer flips.

The test

We use ContractNLI: real NDAs, each with expert Entailment / Contradiction / NotMentioned labels. The clauses that decide a label, the buried "negation-by-exception" conditions, we tag as traps. The metric is decision-survival, and it's judge-free: answer from the full document (the ceiling), compress, answer again, score by exact match against the expert label. Survival is the fraction of full-document-correct answers that stay correct after compression. Compression is blind to the question and computed once per document.

Three compressors on Groq (llama-3.1-8b, qwen3-32b, gpt-oss-120b), one fixed reader (llama-3.3-70b), 400 items across 61 NDAs, two prompts: naive ("Summarise this") and effortful (a careful lossless instruction). The raw ceiling is 66%, but 87% on traps, an artifact of the label mix, which is exactly why we report survival rather than accuracy.

Finding 1: Prompt engineering is still alive

Decision-survival on trap clauses:

Compressor	naive	effortful
`llama-3.1-8b`	57%	74%
`qwen3-32b`	88%	93%
`gpt-oss-120b`	91%	95%

The weak model jumps +16 points on traps; the capable ones improve slightly. The payoff from a better prompt is largest exactly where capacity is scarce.

Finding 2: The traps catch out simpler models

Decision-survival on ordinary (non-trap) clauses:

Compressor	naive	effortful
`llama-3.1-8b`	87%	87%
`qwen3-32b`	88%	94%
`gpt-oss-120b`	94%	91%

The small model isn't uniformly lossy. It's fine on ordinary clauses and collapses only on the buried exceptions. It holds 87% here but falls to 57% on traps; the capable models show no such gap.

Finding 3: Compression ratios vary significantly

Realised compression (tokens out / in):

Compressor	naive	effortful
`llama-3.1-8b`	0.22	0.36
`qwen3-32b`	0.27	0.37
`gpt-oss-120b`	0.37	0.43

The smallest model compresses the hardest (0.22) and loses the most traps: it isn't lazy, it hits that ratio by throwing the load-bearing clause overboard. Not too surprising, just sloppy work.

More interestingly, the 32B qwen compresses tighter than the 120B gpt-oss (0.37 vs 0.43) with comparable fidelity. It appears that concision may be a model trait, but we'd have to broaden our testing to confirm.

Finding 4: Pushing the boundaries can still work

A forced word-budget prompt, swept at 0.35 and 0.25 targets, moved only qwen: from 0.37 down to 0.32 at full fidelity (95% on traps). It then refuses to go lower, told to hit 0.25 it stops at 0.32 rather than cut content. llama-3.1-8b and gpt-oss-120b don't tighten at all. And fidelity didn't budge in any case. It's an interesting degree of self-limitation.

Gotcha: Reasoning models do behave differently

gpt-oss first looked terrible: a 0.74 ratio (barely shorter than the input) and a suspiciously high survival score. A strong model flubbing an easy task is a smell, not a finding. It's a reasoning model, and with its chain-of-thought hidden (reasoning_format=hidden) the reasoning still spends the token budget, so on hard documents it thought its way to the cap and emitted a bloated or empty "summary" (6 of 61 came back empty, each scored as perfect compression). One parameter fixed it:

`reasoning_effort`	tokens for the same answer
high	596
low	55

Capped, gpt-oss is a fine compressor (0.43, zero empties). Benchmark a reasoning model on a generation task and you must bound its thinking, or you're measuring its budget hit a wall.

Honest caveats

We measure fidelity through a reader that is itself weak on NotMentioned (46%), so those items rest on a shakier base.
That reader shares a family with llama-3.1-8b, a mild self-decoding edge, so its already-poor trap numbers are if anything optimistic.
A few long documents still truncate gpt-oss even at low effort; those are excluded from the ratios.
One corpus, one domain, n = 400. The shape is the finding, not any single number.

Conclusion

So, can LLMs save themselves from verbosity? Up to a point, and the point is set by the model. A good prompt trims the small one, a word budget trims qwen further until it digs in at 0.32, and capping a reasoning model's hidden thinking stops it drowning in its own tokens. And concision is partly innate, the 32B beat the 120B for free, though the trimming still ultimately costs you on decision-making.

Where this goes next

This opened two threads that I'm now curious about:

More models, more scenarios. Three compressors on NDAs is a keyhole. The obvious widening: a same-lab size ladder, other labs, and domains where the deciding clause hides differently, like specs, policies, or medical instructions.
How hard can we push? A word budget moved qwen from 0.37 to 0.32, and then it dug in. I want the frontier, not two points: how far a better prompt takes each model before fidelity finally cracks, what a hard token cap does when I shove a model past its comfort zone, and whether the prompt that rescues the small model is the same one that helps the big one.

It may be that prompt engineering still has a bright future ahead of it.

~ Ben

DEV Community