How we benchmarked context quality across Anthropic, OpenAI, and Promptolian — a transparent proxy that compresses agent context without losing facts — and found the sweet spot nobody talks about
- The problem nobody talks about
- The U-curve most teams never see
- What we built instead
- The second problem: tool schema re-sending
- The business case
- Try it
The problem nobody talks about
You ship an AI agent. It works great on the first 10 messages. By message 30, it's asking you to repeat the database URL. By message 50, it's recommending the wrong config value — one it hallucinated because the real one got summarised away three turns back.
This isn't a model problem. The problem is context management.
Most agent frameworks hand off context management to the provider — Anthropic's built-in summarisation, or OpenAI's equivalent. The provider, with no way to know which facts matter, does what it's built to do: compress aggressively. 98–99% compression. Everything gets squashed into a summary paragraph.
The result: your agent forgets things. And every time it forgets something important, a human has to step in. That's rework. And rework costs more than tokens.
The U-curve most teams never see
Total cost = context tokens + rework from fact loss.
Everyone optimises the left side. But when the agent loses facts, someone has to catch the error, re-explain the context, and redo the work. That's not free.
Plot quality score against compression rate and you get this:

Interactive version: promptolian.com/ucurve.html
Promptolian (22%) stays in the green zone (≥ 4.0). Both provider built-ins land in the red zone at 99% compression — because LLM summarisers don't know which facts matter later. They see postgres://db.prod/main and write "the database connection was discussed." Accurate. Useless.
Factory.ai measured this (May 2026):
| System | Context quality | Compression |
|---|---|---|
| Anthropic built-in | 3.44 / 5 | 98.7% |
| OpenAI built-in | 3.35 / 5 | 99.3% |
The quality gap has a direct cost. Figure 2 shows total monthly spend (API tokens + engineer time fixing context failures) for a solo developer:
- 100 sessions/month
- 50 calls/session · 8K context tokens
- $3/MTok input (Claude Sonnet 4)
- $100/hr engineer rate
At zero debugging time, Anthropic built-in wins (99% token savings). Above 3.5 minutes per failure, Promptolian is cheaper. That threshold is low — it's the agent asking you to re-confirm the deployment target, you correcting it, losing your train of thought.
The cost minimum isn't at maximum compression. It's at 22%, not 99%.
What we built instead
Promptolian is a transparent proxy that sits between your agent and the Anthropic API. You change one line (base_url) — no changes to agent logic.
The key insight: not all turns are equal.
- First 2 turns — task framing and constraints. Losing these is catastrophic. Kept verbatim.
- Last 4 turns — current working state. Compress these and you break continuity. Kept verbatim.
- Everything in between — repeated phrasing, confirmed values, filler. Safe to compress.
So the middle is compressed, the edges are not:
HEAD → first 2 turns → VERBATIM
MIDDLE → turns 3 to N-4 → WEIGHTED + COMPRESSED
TAIL → last 4 turns → VERBATIM
Not all middle turns are treated equally either. Each turn gets a score based on how much new information it adds — new entities, new vocabulary, delta from what came before. Turns with high information density survive compression; pure acknowledgements ("ok", "noted", "sounds good") and reformulations of things already said are pruned first.
The compression engine is rule-based — no LLM. It encodes repeated entities into a local registry (postgres://db.prod/main → §E1) and expands them back before each API call. Facts aren't summarised — they're encoded. Nothing is destroyed.
Benchmark across 25 sessions, 5 task domains, same Factory.ai 6-dimension methodology:
| System | Quality | Compression |
|---|---|---|
| Promptolian | 4.26 / 5 | 21.8% |
| Anthropic built-in | 3.44 / 5 | 98.7% |
| OpenAI built-in | 3.35 / 5 | 99.3% |
Better context quality at 22% compression vs 99%.
The second problem: your agent re-sends the same tool schema every call
Every API call re-sends the full tool schema — even if nothing changed. For 5 tools that's ~600 tokens wasted per call.
Call 1: [system] + [tools: 600 tok] + [message] → full price
Call 2: [system] + [tools: 600 tok] + [message] → full price again
Anthropic's prompt cache solves this — but only if you manually add cache_control blocks. Most frameworks don't. The proxy does it automatically.
Assumptions:
- 500 calls/day · 5 tools · ~120 tokens each = 600 tool tokens/call
- 30 days/month → 9M tool tokens/month → $27.00 without caching
- With Anthropic prompt cache (10% on hits) → $2.70
- Saving: $24.30/month
The business case
Context failures are rework events. The fact-loss rates derived from benchmark scores:
| System | Quality | Fact-loss rate |
|---|---|---|
| Promptolian | 4.26/5 | 14.8% |
| Anthropic built-in | 3.44/5 | 31.2% |
| OpenAI built-in | 3.35/5 | 33.0% |
If fixing a context failure costs your team more than 3.5 minutes, Promptolian is cheaper than Anthropic built-in in total. A context failure that reaches code review can cost an hour.
Tool caching adds ~$24/month on top. Token savings show up in your API bill. Rework savings don't — but that's where the ROI lives.
Try it
pip install "promptolian[proxy]"
promptolian proxy # tool caching only — 1 min to production
promptolian proxy --compress # + context history compression
import anthropic
client = anthropic.Anthropic(
base_url="http://localhost:3002", # only change needed
)
Or skip self-hosting with the cloud proxy:
client = anthropic.Anthropic(
base_url="https://proxy.promptolian.com",
default_headers={"X-Promptolian-Key": "pk_..."},
)
Full docs: promptolian.com/docs.html · GitHub: github.com/Maurizio-L/promptolian-public
Methodology: 25 sessions × 5 task domains, Factory.ai 6-dimension probe scoring (Accuracy, Context, Artifact, Completeness, Continuity, Instruction). Anthropic/OpenAI baselines from Factory.ai May 2026. Promptolian: internal benchmark, same methodology. Validation run: 4.19/5 (second 25-session run after entity-encoding fix). Fact-loss rate = 1 − quality/5.

Top comments (8)
This is a fantastic and deeply necessary breakdown. The industry has treated context compression as a pure FinOps optimization, completely ignoring the cognitive degradation it introduces.
When teams use a probabilistic model to summarize an ongoing session, they are introducing semantic drift under the hood. The agent ends up running on an unverified, lossy 'hallucination of its own history.'
In my work on the Sovereign System Spec, I've been advocating for treating context management as a strict engineering boundary—using deterministic sieves to strip conversational filler (the 'Prose Tax') at the ingestion gate, while preserving core transactions in a cryptographically signed, immutable ledger underneath. If we don't measure what a model forgets during compression, we aren't engineering a system; we're just gambling with state.
Hi Keen! Word up!
to integrate what you said and following a benchmark study i did, the optimal compression rate should fall between [20%-40%] of the original context window in order to dont lose information.
I guess dont lose information wont be enough to avoid the humans to pass the buck to LLMs.
At some points we should create some metrics imagining we are LLM receiveing wrong prompts complaining "Is not our fault! Dude you just gave a suboptimal prompt..." 😄
thanks for your comment @kenwalger
Those benchmark numbers ($20\%–40\%$) are a massive sanity check for the industry. It proves there is a hard structural ceiling where optimization turns into destruction.
Your joke about the LLM complaining about bad prompts hits on a profound architectural truth. We need to stop treating the input payload as a 'vibes-based string' and start treating it as a strict software contract.
Within the Sovereign System Spec, we treat context management as a dual-engine responsibility rather than a single freeform text dump. We use a deterministic sieve to strip out colloquial filler (the Prose Tax) at the ingestion gate so we can stay safely within that optimal $20\%–40\%$ window without touching the core data structure.
But more importantly, the Sovereign SDK uses a stateful
SessionContextand a deterministic runtime router to enforce strict input validation. Before a query ever touches a model token, the runtime checks the structural boundaries—verifying, for example, a monotonicexecution_depthindex and the cryptographic hashes of preceding tool states.If the upstream compression routine violates the contract by dropping critical variables, the runtime gatekeeper should immediately fail the transaction. We absolutely need a metrics layer that stands up for the model and flags when it's being fed a compromised or suboptimal history. You can't have model accountability without strict input custody! Thanks for dropping those benchmark insights.
"Works beautifully until your agent forgets the one detail that mattered" is the exact silent failure of context compression, and you've named the gap nobody owns: compression is measured by tokens saved, never by information destroyed, so the metric points entirely at the win and is blind to the cost. The brutal part is that the cost is selective, summarizing turns is lossy in a way that usually drops the rare, specific, load-bearing detail (the constraint, the edge case, the one ID) while faithfully keeping the generic narrative, so the agent feels coherent and is quietly wrong. Measuring what compression forgets is the right and missing discipline: you need a held-out set of must-not-forget facts and a check that they survive the compression, otherwise you're optimizing token savings against an invisible accuracy regression. The compression that loses the one thing that mattered is negative ROI no matter how many tokens it saved. The instinct I'd add: protect the negative constraints and specifics from compression explicitly, never let never-do-this or the exact figure get summarized into a vibe. That measure-what-you-destroyed-not-just-what-you-saved view is core to how I think about context in Moonshift. How are you measuring the forgetting, a recall set of key facts, or downstream task success before/after compression?
Hi Harjot thanks a lot for your comment!
Regarding your question:
We measure it with a held-out recall set: factual questions whose answers were explicitly stated earlier, probed after compression. Artifact dimension specifically (did the exact value survive, not just the vibe. LLM summarizers score 2.2/5 on that. They keep the narrative, lose the number.
The entity registry is our answer to your constraint-protection point. Exact figures, IDs, "never-do-this" values get extracted before compression touches the text and expanded back verbatim before each API call. Can't be paraphrased if it's never handed to a summarizer.
Now, the exact way of how we designed the entity registry engine is a bit complex and will be fully explained in our methodology paper that will be out soon!
So, monitor our website in the research section.
Thanks for sharing! Very impressive article! It is rare that not many researches on this domain. My team loves to explore Promptolian.
Very insightful work, thank you for sharing. Looking forward to using Promptolian
Some comments may only be visible to logged-in visitors. Sign in to view all comments.