Your LLM Gets Dumber Every Time You Compact Context — And Nobody Is Measuring It

#llm #opensource

TL;DR: LLM context compaction degrades output quality across sessions, but zero published research quantifies how much. I ran 70+ sessions and mapped what looks like a "dead cat bounce" — the model briefly gets better after compaction 2, then collapses. This post catalogs what we know, what we don't, and calls for shared benchmarks.

My Claude session was 200 messages deep. I hit compact. The next response was... fine. Then I hit compact again. And again. By the fifth compaction, the model couldn't remember which file we were editing. I started over, counted the quality drop, ran another 70 sessions on DeepSeek V4, and charted what nobody talks about: context compaction has a curve, and that curve is not linear.

The strangest part: after the second compaction, the model briefly gets better. Then it declines and never recovers. A "dead cat bounce" — except nobody has measured it, named it, or built a benchmark for it.

What I Found (Not Much)

I searched for benchmarks that measure multi-round compaction degradation. Here's what exists:

RULER: Measures how performance drops as static input grows longer. Nothing about what happens after you compress and re-compress.
Context Rot (Chroma 2025): 18 models tested, all degrade with more tokens. Again, static.
Multi-turn evaluation: Tests whether models drift across conversation turns. Doesn't touch compaction.

Parameter compression (pruning, quantization) has well-mapped scaling laws. The Lottery Ticket Hypothesis (ICLR 2019) and Compression Laws for LLMs (2025) tell you exactly where the performance peak sits. Context summarization — the thing that happens every time your agent runs /compact — has no such curve.

Why This Matters

If the curve is real, you could:

Know exactly when to start a fresh session (before the decline hits)
Compare models on a new dimension: who maintains quality longest across compactions?
Give LLM providers a concrete target: "your compaction quality drops 20% faster than competitor X"

Right now, none of the major benchmark suites (MMLU, HELM, BigBench, RULER) include a "compaction persistence" metric. If context windows keep growing and sessions keep getting longer, this gap gets bigger every year.

This is why I eventually built a closed-loop self-healing config system — when your model degrades silently across sessions, you need automated detection.

Help Me Map This Curve

I built a tiny monitor (compact-counter) and a rough experiment framework — 50 lines of Python, 10 benchmark tasks, 0-5 rubric. It's a starting point.

If you've measured compaction degradation in your own LLM workflows — or if your team has internal benchmarks — drop a link in the comments or open an issue on the repo. If enough people contribute data points across different models, I'll compile and publish a shared benchmark.

Have you observed non-linear degradation in your own LLM pipelines? At which compaction did it break? I'm especially curious about Claude Opus and GPT-5 — does the curve look different on different architectures?

References

Frankle & Carbin, "The Lottery Ticket Hypothesis" (ICLR 2019)
"Compression Laws for Large Language Models" (2025)
RULER: What's the Real Context Size of Your Long-Context Language Models? (COLM 2024)
Chroma Research, "Context Rot" (2025)

🤖 Fact-checked 2026-07-10: GitHub PR status verified against API.

中文版：掘金/YuhaoLin2005yhl · Code on GitHub