Maurizio-L

Posted on May 28 • Originally published at promptolian.com

Everyone compresses their agent's context. Nobody measures what it forgets.

#python #ai #productivity #machinelearning

How we benchmarked context quality across Anthropic, OpenAI, and Promptolian — a transparent proxy that compresses agent context without losing facts — and found the sweet spot nobody talks about

The problem nobody talks about
The U-curve most teams never see
What we built instead
The second problem: tool schema re-sending
The business case
Try it

The problem nobody talks about

You ship an AI agent. It works great on the first 10 messages. By message 30, it's asking you to repeat the database URL. By message 50, it's recommending the wrong config value — one it hallucinated because the real one got summarised away three turns back.

This isn't a model problem. The problem is context management.

Most agent frameworks hand off context management to the provider — Anthropic's built-in summarisation, or OpenAI's equivalent. The provider, with no way to know which facts matter, does what it's built to do: compress aggressively. 98–99% compression. Everything gets squashed into a summary paragraph.

The result: your agent forgets things. And every time it forgets something important, a human has to step in. That's rework. And rework costs more than tokens.

The U-curve most teams never see

Total cost = context tokens + rework from fact loss.

Everyone optimises the left side. But when the agent loses facts, someone has to catch the error, re-explain the context, and redo the work. That's not free.

Plot quality score against compression rate and you get this:

Interactive version: promptolian.com/ucurve.html

Promptolian (22%) stays in the green zone (≥ 4.0). Both provider built-ins land in the red zone at 99% compression — because LLM summarisers don't know which facts matter later. They see postgres://db.prod/main and write "the database connection was discussed." Accurate. Useless.

Factory.ai measured this (May 2026):

System	Context quality	Compression
Anthropic built-in	3.44 / 5	98.7%
OpenAI built-in	3.35 / 5	99.3%

The quality gap has a direct cost. Figure 2 shows total monthly spend (API tokens + engineer time fixing context failures) for a solo developer:

Assumptions:

100 sessions/month
50 calls/session · 8K context tokens
$3/MTok input (Claude Sonnet 4)
$100/hr engineer rate

At zero debugging time, Anthropic built-in wins (99% token savings). Above 3.5 minutes per failure, Promptolian is cheaper. That threshold is low — it's the agent asking you to re-confirm the deployment target, you correcting it, losing your train of thought.

The cost minimum isn't at maximum compression. It's at 22%, not 99%.

What we built instead

Promptolian is a transparent proxy that sits between your agent and the Anthropic API. You change one line (base_url) — no changes to agent logic.

The key insight: not all turns are equal.

First 2 turns — task framing and constraints. Losing these is catastrophic. Kept verbatim.
Last 4 turns — current working state. Compress these and you break continuity. Kept verbatim.
Everything in between — repeated phrasing, confirmed values, filler. Safe to compress.

So the middle is compressed, the edges are not:

HEAD   → first 2 turns  → VERBATIM
MIDDLE → turns 3 to N-4 → WEIGHTED + COMPRESSED
TAIL   → last 4 turns   → VERBATIM

Not all middle turns are treated equally either. Each turn gets a score based on how much new information it adds — new entities, new vocabulary, delta from what came before. Turns with high information density survive compression; pure acknowledgements ("ok", "noted", "sounds good") and reformulations of things already said are pruned first.

The compression engine is rule-based — no LLM. It encodes repeated entities into a local registry (postgres://db.prod/main → §E1) and expands them back before each API call. Facts aren't summarised — they're encoded. Nothing is destroyed.

Benchmark across 25 sessions, 5 task domains, same Factory.ai 6-dimension methodology:

System	Quality	Compression
Promptolian	4.26 / 5	21.8%
Anthropic built-in	3.44 / 5	98.7%
OpenAI built-in	3.35 / 5	99.3%

Better context quality at 22% compression vs 99%.

The second problem: your agent re-sends the same tool schema every call

Every API call re-sends the full tool schema — even if nothing changed. For 5 tools that's ~600 tokens wasted per call.

Call 1: [system] + [tools: 600 tok] + [message]  → full price
Call 2: [system] + [tools: 600 tok] + [message]  → full price again

Anthropic's prompt cache solves this — but only if you manually add cache_control blocks. Most frameworks don't. The proxy does it automatically.

Assumptions:

500 calls/day · 5 tools · ~120 tokens each = 600 tool tokens/call
30 days/month → 9M tool tokens/month → $27.00 without caching
With Anthropic prompt cache (10% on hits) → $2.70
Saving: $24.30/month

The business case

Context failures are rework events. The fact-loss rates derived from benchmark scores:

System	Quality	Fact-loss rate
Promptolian	4.26/5	14.8%
Anthropic built-in	3.44/5	31.2%
OpenAI built-in	3.35/5	33.0%

If fixing a context failure costs your team more than 3.5 minutes, Promptolian is cheaper than Anthropic built-in in total. A context failure that reaches code review can cost an hour.

Tool caching adds ~$24/month on top. Token savings show up in your API bill. Rework savings don't — but that's where the ROI lives.

Try it

pip install "promptolian[proxy]"

promptolian proxy            # tool caching only — 1 min to production
promptolian proxy --compress # + context history compression

import anthropic

client = anthropic.Anthropic(
    base_url="http://localhost:3002",  # only change needed
)

Or skip self-hosting with the cloud proxy:

client = anthropic.Anthropic(
    base_url="https://proxy.promptolian.com",
    default_headers={"X-Promptolian-Key": "pk_..."},
)

Full docs: promptolian.com/docs.html · GitHub: github.com/Maurizio-L/promptolian-public

Methodology: 25 sessions × 5 task domains, Factory.ai 6-dimension probe scoring (Accuracy, Context, Artifact, Completeness, Continuity, Instruction). Anthropic/OpenAI baselines from Factory.ai May 2026. Promptolian: internal benchmark, same methodology. Validation run: 4.19/5 (second 25-session run after entity-encoding fix). Fact-loss rate = 1 − quality/5.

Top comments (9)

Ken W Alger • Jun 1

This is a fantastic and deeply necessary breakdown. The industry has treated context compression as a pure FinOps optimization, completely ignoring the cognitive degradation it introduces.

When teams use a probabilistic model to summarize an ongoing session, they are introducing semantic drift under the hood. The agent ends up running on an unverified, lossy 'hallucination of its own history.'

In my work on the Sovereign System Spec, I've been advocating for treating context management as a strict engineering boundary—using deterministic sieves to strip conversational filler (the 'Prose Tax') at the ingestion gate, while preserving core transactions in a cryptographically signed, immutable ledger underneath. If we don't measure what a model forgets during compression, we aren't engineering a system; we're just gambling with state.

Maurizio-L • Jun 1

Hi Keen! Word up!

to integrate what you said and following a benchmark study i did, the optimal compression rate should fall between [20%-40%] of the original context window in order to dont lose information.

I guess dont lose information wont be enough to avoid the humans to pass the buck to LLMs.

At some points we should create some metrics imagining we are LLM receiveing wrong prompts complaining "Is not our fault! Dude you just gave a suboptimal prompt..." 😄

thanks for your comment @kenwalger

Ken W Alger • Jun 1 • Edited

Those benchmark numbers (20%–40%) are a massive sanity check for the industry. It proves there is a hard structural ceiling where optimization turns into destruction.

Your joke about the LLM complaining about bad prompts hits on a profound architectural truth. We need to stop treating the input payload as a 'vibes-based string' and start treating it as a strict software contract.

Within the Sovereign System Spec, we treat context management as a dual-engine responsibility rather than a single freeform text dump. We use a deterministic sieve to strip out colloquial filler (the Prose Tax) at the ingestion gate so we can stay safely within that optimal 20%–40% window without touching the core data structure.

But more importantly, the Sovereign SDK uses a stateful SessionContext and a deterministic runtime router to enforce strict input validation. Before a query ever touches a model token, the runtime checks the structural boundaries—verifying, for example, a monotonic execution_depth index and the cryptographic hashes of preceding tool states.

If the upstream compression routine violates the contract by dropping critical variables, the runtime gatekeeper should immediately fail the transaction. We absolutely need a metrics layer that stands up for the model and flags when it's being fed a compromised or suboptimal history. You can't have model accountability without strict input custody! Thanks for dropping those benchmark insights.

Harjot Singh • May 31

"Works beautifully until your agent forgets the one detail that mattered" is the exact silent failure of context compression, and you've named the gap nobody owns: compression is measured by tokens saved, never by information destroyed, so the metric points entirely at the win and is blind to the cost. The brutal part is that the cost is selective, summarizing turns is lossy in a way that usually drops the rare, specific, load-bearing detail (the constraint, the edge case, the one ID) while faithfully keeping the generic narrative, so the agent feels coherent and is quietly wrong. Measuring what compression forgets is the right and missing discipline: you need a held-out set of must-not-forget facts and a check that they survive the compression, otherwise you're optimizing token savings against an invisible accuracy regression. The compression that loses the one thing that mattered is negative ROI no matter how many tokens it saved. The instinct I'd add: protect the negative constraints and specifics from compression explicitly, never let never-do-this or the exact figure get summarized into a vibe. That measure-what-you-destroyed-not-just-what-you-saved view is core to how I think about context in Moonshift. How are you measuring the forgetting, a recall set of key facts, or downstream task success before/after compression?

Maurizio-L • Jun 1

Hi Harjot thanks a lot for your comment!

Regarding your question:

We measure it with a held-out recall set: factual questions whose answers were explicitly stated earlier, probed after compression. Artifact dimension specifically (did the exact value survive, not just the vibe. LLM summarizers score 2.2/5 on that. They keep the narrative, lose the number.

The entity registry is our answer to your constraint-protection point. Exact figures, IDs, "never-do-this" values get extracted before compression touches the text and expanded back verbatim before each API call. Can't be paraphrased if it's never handed to a summarizer.

Now, the exact way of how we designed the entity registry engine is a bit complex and will be fully explained in our methodology paper that will be out soon!
So, monitor our website in the research section.