I used to hear the pitch for context compression and think: sure, makes sense.
Smaller prompts. Lower latency. Lower cost. Same output quality.
Then I watched an agent blow a perfectly good debugging session because one field name disappeared from compressed memory.
That changed my opinion fast.
Three hours into a Claude Code run, the agent made the wrong API call with full confidence. The plan looked coherent. The reasoning looked clean. The summary of prior steps sounded smart.
It was also missing the one detail that mattered: a field name from an earlier error log.
The agent had already seen the bug. It had already “understood” the bug. But the compressed version of history dropped the exact detail it needed to avoid repeating it.
That’s the real failure mode.
Not “compression loses words.”
Compression loses the one fact your agent needs later, after it has already committed to the wrong action.
While researching this, I found a thread on r/openclaw about using Headroom with OpenClaw:
https://reddit.com/r/openclaw/comments/1u3j5xs/anyone_using_headroom_with_openclaw/
That thread gets at the real tension: compression is useful, but only if you treat it as a reversible optimization, not a memory wipe with better branding.
The bug pattern nobody talks about
Here’s the pattern I keep seeing in long-running agents:
- The agent collects a lot of noisy context.
- The team compresses it to save tokens.
- The summary preserves the broad story.
- The summary drops one edge-case fact.
- Two hours later, that fact becomes the only thing that matters.
- The agent confidently does the wrong thing.
This is why “same answers, fewer tokens” is not a serious reliability claim for agent workflows.
It might be true for some short chat tasks.
It is absolutely not something I’d assume for:
- n8n agents
- Make scenarios
- Zapier AI steps
- OpenClaw sessions
- Claude Code runs
- custom OpenAI-compatible agent loops
- multi-step debugging or incident workflows
In those systems, exact details matter more than elegant summaries.
Why compression fails in practice
Summaries are good at preserving themes.
Summaries are bad at preserving sharp edges.
That tradeoff is fine when you’re chatting with a model and throwing the session away. It is dangerous when the model is making decisions over hours.
The exact wording of an error matters.
The exact JSON shape matters.
The exact schema name matters.
The exact user constraint matters.
The exact branch that was ruled out matters.
A compressed summary is not the original context.
It is an interpretation of the original context.
That distinction is everything.
A concrete example
Imagine your agent saw this error earlier:
ValidationError: field "customer_external_id" is required
Now imagine the compressed memory turns that into this:
Previous API call failed because a customer identifier was missing.
That sounds fine until later, when the agent has to choose between:
customer_idexternal_customer_idcustomer_external_id
At that point, the summary is useless.
The “meaning” was preserved.
The recoverable debugging value was not.
That’s how agents end up sounding intelligent while still being wrong.
What I think is safe: compress noise, reload facts
My rule now is simple:
Compress what is noisy. Reload what is consequential.
Good compression candidates:
- repetitive terminal output
- verbose logs
- tool chatter
- long command traces
- bulky intermediate reasoning artifacts
Bad compression candidates:
- exact error messages
- schema details
- API payload shapes
- unresolved branches
- retrieved evidence
- user constraints
- final decisions
- debugging clues
If the agent might need the exact original later, don’t make the summary the source of truth.
Make the raw source retrievable.
The safe architecture vs the fragile one
The difference is not subtle.
Fragile architecture
raw context -> summary -> summary replaces source forever
This is one-way summarization pretending to be memory.
Safer architecture
raw context -> compressed working memory
+ indexed raw history
+ retrieval path back to source
In other words:
- keep the working set small
- keep the original data somewhere reachable
- let the agent fetch raw context on demand
That is a much better design for long-running agents.
Tool-by-tool: what I’d trust and what I wouldn’t
There are some important differences between the current approaches.
Headroom
Headroom is ambitious. It compresses logs, files, tool output, and history across an agent stack. The reported reductions are big, often in the 60% to 95% range.
That’s useful.
But the real question is not “how much did you shrink it?”
The real question is: can the agent recover the untouched source when the summary is not enough?
If yes, much safer.
If no, much riskier.
RTK
RTK is narrower, which I actually like from a systems perspective. If it’s mainly shrinking Bash output before it reaches Claude Code, the blast radius is easier to reason about.
One example I saw cited was a Claude Code session dropping from about 118,000 tokens to 23,900.
That’s a huge reduction.
But again, the question is not whether the number is impressive.
The question is whether the transformed output still preserves the exact details needed for later decisions, or whether the raw output is still accessible.
Prompt caching
Prompt caching is different.
For example, OpenAI Prompt Caching preserves the exact prompt prefix rather than rewriting it. That makes it much safer than summarization if your goal is exact reuse.
But prompt caching does not solve long-context memory by itself.
Cached is not compressed.
Exact reuse is not selective recall.
My winner and loser
If I had to pick winners and losers:
Winner: reversible compression plus retrieval.
Loser: one-way summarization pretending to be memory.
That’s the line.
Not “compression good” vs “compression bad.”
Compression is fine.
Irreversible compression in long-running agents is the problem.
What this looks like in code
Here’s the pattern I’d rather ship.
Bad pattern
const compressedHistory = await summarize(fullHistory)
const prompt = [systemPrompt, compressedHistory, currentTask].join("\n\n")
const result = await client.responses.create({
model: "gpt-5.4",
input: prompt
})
This is cheap.
It is also brittle if compressedHistory becomes the only memory.
Better pattern
const workingSummary = await summarize(noisyHistory)
const retrievedRawItems = await retrieveRawContext({
query: currentTask,
sources: rawEventStore,
topK: 5
})
const prompt = [
systemPrompt,
"Working summary:",
workingSummary,
"Raw retrieved context:",
JSON.stringify(retrievedRawItems, null, 2),
"Current task:",
currentTask
].join("\n\n")
const result = await client.responses.create({
model: "gpt-5.4",
input: prompt
})
That second pattern costs more context.
It is also much less likely to hallucinate the exact field name you already had three hours ago.
A practical storage model for agents
If you’re building agents, I’d separate memory into three buckets.
| Memory Type | What goes there |
|---|---|
| Working summary | compressed chatter, recent progress, high-level state |
| Raw evidence store | logs, tool output, payloads, errors, retrieved docs |
| Decision ledger | explicit decisions, assumptions, unresolved questions |
That gives you a cleaner contract:
- summaries are for speed
- raw evidence is for correctness
- decision ledgers are for continuity
If your architecture only has the first bucket, you are asking for trouble.
A quick n8n / automation example
This problem gets worse in automation tools because long workflows create a lot of junk context.
A typical loop might look like this:
Webhook -> Fetch data -> LLM classify -> Call API -> Retry on failure -> LLM debug -> Call API again
The temptation is obvious: summarize everything after each step so the prompt stays cheap.
The safer pattern is:
Webhook -> Store raw events/logs externally
-> Keep compact working summary in prompt
-> Retrieve exact raw failure context before retry/debug steps
If you’re using n8n, Make, Zapier, OpenClaw, or a custom worker queue, this is the difference between an agent that degrades gracefully and one that accumulates invisible mistakes.
Why teams over-compress in the first place
Honestly, a lot of bad agent design is pricing pressure wearing a fake mustache.
Teams don’t aggressively trim context because it’s always the best technical choice.
They do it because per-token billing trains everyone to fear their own context windows.
Every extra tool call feels expensive.
Every long trace feels expensive.
Every retry feels expensive.
Every “maybe we should keep the raw logs around” decision feels expensive.
So people reach for summarization earlier than they should.
That’s not always engineering judgment.
A lot of the time it’s cost anxiety.
Why this matters for Standard Compute
This is exactly why flat-rate compute changes the design space.
With Standard Compute, you get unlimited AI compute for a predictable monthly price, using an OpenAI-compatible API that works with existing SDKs and HTTP clients.
That matters because it lets you make a better tradeoff:
- keep more raw context
- reload source material when needed
- avoid brittle over-summarization
- stop designing every agent around token panic
If your agents run in n8n, Make, Zapier, OpenClaw, or custom workflows, predictable flat pricing is not just a finance benefit.
It changes system design.
You can optimize for reliability first.
That’s the better default.
My practical rule now
This is the rule I trust:
Never let your agent depend on compressed context unless it can recover the raw source later.
If the original log, chunk, tool output, or conversation turn is gone, the compression step is not optimization.
It’s amputation.
And once you’ve watched an agent fail because one field name vanished from memory, the whole “same answers, fewer tokens” slogan starts sounding like marketing copy written by someone who has never debugged a broken workflow at 2 a.m.
Actionable checklist
If you’re building long-running agents, here’s the checklist I’d use:
- Keep raw logs and tool outputs in external storage.
- Summarize noisy context, not critical evidence.
- Store exact errors and payloads separately from summaries.
- Add retrieval before retry, debug, and decision steps.
- Treat summaries as hints, not truth.
- Prefer prompt caching for exact reuse when available.
- Don’t force aggressive compression just to satisfy per-token budgets.
- If pricing is driving bad architecture, fix pricing.
That last one matters more than people admit.
If your cost model pushes you toward memory loss, your cost model is part of the bug.
Top comments (0)