DEV Community

Elnur Atakishiyev
Elnur Atakishiyev

Posted on

Context rot is real. You can compile it away.

Your agent didn't get dumber. Its context rotted.

A support agent answers perfectly for 30 turns, then starts getting things wrong. Nobody touched the prompt. The chat just got long — and the one fact that mattered (which card a refund goes to) is now stranded in the middle of a 250k-token window, where the model barely attends to it.

This is context rot, and it's now measurable. Chroma's 2025 study showed accuracy dropping well before the documented context limit, with a sharp knee far under the "1M-token" ceiling — and the unsettling part: coherent, well-structured input can degrade attention more than shuffled input. Standard needle-in-a-haystack benchmarks miss it because the real failures live in long, multi-tool agent sessions.

The fix isn't a bigger window. It's a smaller, cleaner one.

ContextForge is an open-source context compiler. It sits between your app and the model and does four things to everything entering the window:

  • Score — a 0–100 rot-risk per call (load · redundancy · middle-burial · fragmentation). Put it in CI.
  • Compress — remove near-duplicates and trim stale, low-salience spans. Extractive only — it never paraphrases away the fact you needed.
  • Reorder — lift load-bearing facts to the edges of the window to beat "lost in the middle."
  • Budget — enforce a token ceiling, dropping the least-salient material, logging every drop.

What it looks like. On a long support session: 252k → 20k tokens (~92% smaller), rot risk 44 → 13 (moderate → low), and the buried refund fact restored to the window edge. The same shows up for a 24/7 OpenClaw-style agent (228k→20k) and an OpenHands-style coding agent that kept forgetting a "don't touch the frozen module" constraint (171k→20k, constraint re-anchored).

It's reproducible. The benchmark ships an offline, deterministic proxy "model" that emulates rot, so you can run it with no API key and see the baseline lose a buried fact while the compiled context recovers it — then swap in a real model with one flag to measure the accuracy delta on your traces. That last part is the point: don't trust my numbers, measure your own.

Drop-in. Point your SDK's base_url at the compiling proxy and change nothing else:

contextforge proxy --api anthropic --budget 20000
Enter fullscreen mode Exit fullscreen mode

Responses come back with x-contextforge-rot-before/after and token deltas in the headers.

Why this compounds: even if model vendors fix long-context attention tomorrow, the ~90% token savings never stop mattering. Quality and cost, one layer.

Repo (Apache-2.0, zero-dependency core): https://github.com/eatakishiyev/context-forge — stars and adversarial traces both very welcome.

Top comments (0)