Everyone compresses their agent's context. Nobody measures what it forgets.

Maurizio-L — Thu, 28 May 2026 21:33:47 +0000

How we benchmarked context quality across Anthropic, OpenAI, and Promptolian — a transparent proxy that compresses agent context without losing facts — and found the sweet spot nobody talks about

The problem nobody talks about
The U-curve most teams never see
What we built instead
The second problem: tool schema re-sending
The business case
Try it

The problem nobody talks about

You ship an AI agent. It works great on the first 10 messages. By message 30, it's asking you to repeat the database URL. By message 50, it's recommending the wrong config value — one it hallucinated because the real one got summarised away three turns back.

This isn't a model problem. The problem is context management.

Most agent frameworks hand off context management to the provider — Anthropic's built-in summarisation, or OpenAI's equivalent. The provider, with no way to know which facts matter, does what it's built to do: compress aggressively. 98–99% compression. Everything gets squashed into a summary paragraph.

The result: your agent forgets things. And every time it forgets something important, a human has to step in. That's rework. And rework costs more than tokens.

The U-curve most teams never see

Total cost = context tokens + rework from fact loss.

Everyone optimises the left side. But when the agent loses facts, someone has to catch the error, re-explain the context, and redo the work. That's not free.

Plot quality score against compression rate and you get this:

Interactive version: promptolian.com/ucurve.html

Promptolian (22%) stays in the green zone (≥ 4.0). Both provider built-ins land in the red zone at 99% compression — because LLM summarisers don't know which facts matter later. They see postgres://db.prod/main and write "the database connection was discussed." Accurate. Useless.

Factory.ai measured this (May 2026):

System	Context quality	Compression
Anthropic built-in	3.44 / 5	98.7%
OpenAI built-in	3.35 / 5	99.3%

The quality gap has a direct cost. Figure 2 shows total monthly spend (API tokens + engineer time fixing context failures) for a solo developer:

Assumptions:

100 sessions/month
50 calls/session · 8K context tokens
$3/MTok input (Claude Sonnet 4)
$100/hr engineer rate

At zero debugging time, Anthropic built-in wins (99% token savings). Above 3.5 minutes per failure, Promptolian is cheaper. That threshold is low — it's the agent asking you to re-confirm the deployment target, you correcting it, losing your train of thought.

The cost minimum isn't at maximum compression. It's at 22%, not 99%.

What we built instead

Promptolian is a transparent proxy that sits between your agent and the Anthropic API. You change one line (base_url) — no changes to agent logic.

The key insight: not all turns are equal.

First 2 turns — task framing and constraints. Losing these is catastrophic. Kept verbatim.
Last 4 turns — current working state. Compress these and you break continuity. Kept verbatim.
Everything in between — repeated phrasing, confirmed values, filler. Safe to compress.

So the middle is compressed, the edges are not:

HEAD   → first 2 turns  → VERBATIM
MIDDLE → turns 3 to N-4 → WEIGHTED + COMPRESSED
TAIL   → last 4 turns   → VERBATIM

Not all middle turns are treated equally either. Each turn gets a score based on how much new information it adds — new entities, new vocabulary, delta from what came before. Turns with high information density survive compression; pure acknowledgements ("ok", "noted", "sounds good") and reformulations of things already said are pruned first.

The compression engine is rule-based — no LLM. It encodes repeated entities into a local registry (postgres://db.prod/main → §E1) and expands them back before each API call. Facts aren't summarised — they're encoded. Nothing is destroyed.

Benchmark across 25 sessions, 5 task domains, same Factory.ai 6-dimension methodology:

System	Quality	Compression
Promptolian	4.26 / 5	21.8%
Anthropic built-in	3.44 / 5	98.7%
OpenAI built-in	3.35 / 5	99.3%

Better context quality at 22% compression vs 99%.

The second problem: your agent re-sends the same tool schema every call

Every API call re-sends the full tool schema — even if nothing changed. For 5 tools that's ~600 tokens wasted per call.

Call 1: [system] + [tools: 600 tok] + [message]  → full price
Call 2: [system] + [tools: 600 tok] + [message]  → full price again

Anthropic's prompt cache solves this — but only if you manually add cache_control blocks. Most frameworks don't. The proxy does it automatically.

Assumptions:

500 calls/day · 5 tools · ~120 tokens each = 600 tool tokens/call
30 days/month → 9M tool tokens/month → $27.00 without caching
With Anthropic prompt cache (10% on hits) → $2.70
Saving: $24.30/month

The business case

Context failures are rework events. The fact-loss rates derived from benchmark scores:

System	Quality	Fact-loss rate
Promptolian	4.26/5	14.8%
Anthropic built-in	3.44/5	31.2%
OpenAI built-in	3.35/5	33.0%

If fixing a context failure costs your team more than 3.5 minutes, Promptolian is cheaper than Anthropic built-in in total. A context failure that reaches code review can cost an hour.

Tool caching adds ~$24/month on top. Token savings show up in your API bill. Rework savings don't — but that's where the ROI lives.

Try it

pip install "promptolian[proxy]"

promptolian proxy            # tool caching only — 1 min to production
promptolian proxy --compress # + context history compression

import anthropic

client = anthropic.Anthropic(
    base_url="http://localhost:3002",  # only change needed
)

Or skip self-hosting with the cloud proxy:

client = anthropic.Anthropic(
    base_url="https://proxy.promptolian.com",
    default_headers={"X-Promptolian-Key": "pk_..."},
)

Full docs: promptolian.com/docs.html · GitHub: github.com/Maurizio-L/promptolian-public

Methodology: 25 sessions × 5 task domains, Factory.ai 6-dimension probe scoring (Accuracy, Context, Artifact, Completeness, Continuity, Instruction). Anthropic/OpenAI baselines from Factory.ai May 2026. Promptolian: internal benchmark, same methodology. Validation run: 4.19/5 (second 25-session run after entity-encoding fix). Fact-loss rate = 1 − quality/5.