Lars Winstand

Posted on Jun 12 • Originally published at standardcompute.com

I stopped trusting “same answers, fewer tokens” after watching an agent lose 1 field name and burn 3 hours

#ai #llm #agents #devops

I used to hear the pitch for context compression and think: sure, makes sense.

Smaller prompts. Lower latency. Lower cost. Same output quality.

Then I watched an agent blow a perfectly good debugging session because one field name disappeared from compressed memory.

That changed my opinion fast.

Three hours into a Claude Code run, the agent made the wrong API call with full confidence. The plan looked coherent. The reasoning looked clean. The summary of prior steps sounded smart.

It was also missing the one detail that mattered: a field name from an earlier error log.

The agent had already seen the bug. It had already “understood” the bug. But the compressed version of history dropped the exact detail it needed to avoid repeating it.

That’s the real failure mode.

Not “compression loses words.”

Compression loses the one fact your agent needs later, after it has already committed to the wrong action.

While researching this, I found a thread on r/openclaw about using Headroom with OpenClaw:
https://reddit.com/r/openclaw/comments/1u3j5xs/anyone_using_headroom_with_openclaw/

That thread gets at the real tension: compression is useful, but only if you treat it as a reversible optimization, not a memory wipe with better branding.

The bug pattern nobody talks about

Here’s the pattern I keep seeing in long-running agents:

The agent collects a lot of noisy context.
The team compresses it to save tokens.
The summary preserves the broad story.
The summary drops one edge-case fact.
Two hours later, that fact becomes the only thing that matters.
The agent confidently does the wrong thing.

This is why “same answers, fewer tokens” is not a serious reliability claim for agent workflows.

It might be true for some short chat tasks.

It is absolutely not something I’d assume for:

n8n agents
Make scenarios
Zapier AI steps
OpenClaw sessions
Claude Code runs
custom OpenAI-compatible agent loops
multi-step debugging or incident workflows

In those systems, exact details matter more than elegant summaries.

Why compression fails in practice

Summaries are good at preserving themes.

Summaries are bad at preserving sharp edges.

That tradeoff is fine when you’re chatting with a model and throwing the session away. It is dangerous when the model is making decisions over hours.

The exact wording of an error matters.
The exact JSON shape matters.
The exact schema name matters.
The exact user constraint matters.
The exact branch that was ruled out matters.

A compressed summary is not the original context.

It is an interpretation of the original context.

That distinction is everything.

A concrete example

Imagine your agent saw this error earlier:

ValidationError: field "customer_external_id" is required

Now imagine the compressed memory turns that into this:

Previous API call failed because a customer identifier was missing.

That sounds fine until later, when the agent has to choose between:

customer_id
external_customer_id
customer_external_id

At that point, the summary is useless.

The “meaning” was preserved.
The recoverable debugging value was not.

That’s how agents end up sounding intelligent while still being wrong.

What I think is safe: compress noise, reload facts

My rule now is simple:

Compress what is noisy. Reload what is consequential.

Good compression candidates:

repetitive terminal output
verbose logs
tool chatter
long command traces
bulky intermediate reasoning artifacts

Bad compression candidates:

exact error messages
schema details
API payload shapes
unresolved branches
retrieved evidence
user constraints
final decisions
debugging clues

If the agent might need the exact original later, don’t make the summary the source of truth.

Make the raw source retrievable.

The safe architecture vs the fragile one

The difference is not subtle.

Fragile architecture

raw context -> summary -> summary replaces source forever

This is one-way summarization pretending to be memory.

Safer architecture

raw context -> compressed working memory
                + indexed raw history
                + retrieval path back to source

In other words:

keep the working set small
keep the original data somewhere reachable
let the agent fetch raw context on demand

That is a much better design for long-running agents.

Tool-by-tool: what I’d trust and what I wouldn’t

There are some important differences between the current approaches.

Headroom

Headroom is ambitious. It compresses logs, files, tool output, and history across an agent stack. The reported reductions are big, often in the 60% to 95% range.

That’s useful.

But the real question is not “how much did you shrink it?”

The real question is: can the agent recover the untouched source when the summary is not enough?

If yes, much safer.
If no, much riskier.

RTK

RTK is narrower, which I actually like from a systems perspective. If it’s mainly shrinking Bash output before it reaches Claude Code, the blast radius is easier to reason about.

One example I saw cited was a Claude Code session dropping from about 118,000 tokens to 23,900.

That’s a huge reduction.

But again, the question is not whether the number is impressive.

The question is whether the transformed output still preserves the exact details needed for later decisions, or whether the raw output is still accessible.

Prompt caching

Prompt caching is different.

For example, OpenAI Prompt Caching preserves the exact prompt prefix rather than rewriting it. That makes it much safer than summarization if your goal is exact reuse.

But prompt caching does not solve long-context memory by itself.

Cached is not compressed.
Exact reuse is not selective recall.

My winner and loser

If I had to pick winners and losers:

Winner: reversible compression plus retrieval.

Loser: one-way summarization pretending to be memory.

That’s the line.

Not “compression good” vs “compression bad.”

Compression is fine.

Irreversible compression in long-running agents is the problem.

What this looks like in code

Here’s the pattern I’d rather ship.

Bad pattern

const compressedHistory = await summarize(fullHistory)
const prompt = [systemPrompt, compressedHistory, currentTask].join("\n\n")
const result = await client.responses.create({
  model: "gpt-5.4",
  input: prompt
})

This is cheap.

It is also brittle if compressedHistory becomes the only memory.

Better pattern

const workingSummary = await summarize(noisyHistory)
const retrievedRawItems = await retrieveRawContext({
  query: currentTask,
  sources: rawEventStore,
  topK: 5
})

const prompt = [
  systemPrompt,
  "Working summary:",
  workingSummary,
  "Raw retrieved context:",
  JSON.stringify(retrievedRawItems, null, 2),
  "Current task:",
  currentTask
].join("\n\n")

const result = await client.responses.create({
  model: "gpt-5.4",
  input: prompt
})

That second pattern costs more context.

It is also much less likely to hallucinate the exact field name you already had three hours ago.

A practical storage model for agents

If you’re building agents, I’d separate memory into three buckets.

Memory Type	What goes there
Working summary	compressed chatter, recent progress, high-level state
Raw evidence store	logs, tool output, payloads, errors, retrieved docs
Decision ledger	explicit decisions, assumptions, unresolved questions

That gives you a cleaner contract:

summaries are for speed
raw evidence is for correctness
decision ledgers are for continuity

If your architecture only has the first bucket, you are asking for trouble.

A quick n8n / automation example

This problem gets worse in automation tools because long workflows create a lot of junk context.

A typical loop might look like this:

Webhook -> Fetch data -> LLM classify -> Call API -> Retry on failure -> LLM debug -> Call API again

The temptation is obvious: summarize everything after each step so the prompt stays cheap.

The safer pattern is:

Webhook -> Store raw events/logs externally
        -> Keep compact working summary in prompt
        -> Retrieve exact raw failure context before retry/debug steps

If you’re using n8n, Make, Zapier, OpenClaw, or a custom worker queue, this is the difference between an agent that degrades gracefully and one that accumulates invisible mistakes.

Why teams over-compress in the first place

Honestly, a lot of bad agent design is pricing pressure wearing a fake mustache.

Teams don’t aggressively trim context because it’s always the best technical choice.

They do it because per-token billing trains everyone to fear their own context windows.

Every extra tool call feels expensive.
Every long trace feels expensive.
Every retry feels expensive.
Every “maybe we should keep the raw logs around” decision feels expensive.

So people reach for summarization earlier than they should.

That’s not always engineering judgment.

A lot of the time it’s cost anxiety.

Why this matters for Standard Compute

This is exactly why flat-rate compute changes the design space.

With Standard Compute, you get unlimited AI compute for a predictable monthly price, using an OpenAI-compatible API that works with existing SDKs and HTTP clients.

That matters because it lets you make a better tradeoff:

keep more raw context
reload source material when needed
avoid brittle over-summarization
stop designing every agent around token panic

If your agents run in n8n, Make, Zapier, OpenClaw, or custom workflows, predictable flat pricing is not just a finance benefit.

It changes system design.

You can optimize for reliability first.

That’s the better default.

My practical rule now

This is the rule I trust:

Never let your agent depend on compressed context unless it can recover the raw source later.

If the original log, chunk, tool output, or conversation turn is gone, the compression step is not optimization.

It’s amputation.

And once you’ve watched an agent fail because one field name vanished from memory, the whole “same answers, fewer tokens” slogan starts sounding like marketing copy written by someone who has never debugged a broken workflow at 2 a.m.

Actionable checklist

If you’re building long-running agents, here’s the checklist I’d use:

Keep raw logs and tool outputs in external storage.
Summarize noisy context, not critical evidence.
Store exact errors and payloads separately from summaries.
Add retrieval before retry, debug, and decision steps.
Treat summaries as hints, not truth.
Prefer prompt caching for exact reuse when available.
Don’t force aggressive compression just to satisfy per-token budgets.
If pricing is driving bad architecture, fix pricing.

That last one matters more than people admit.

If your cost model pushes you toward memory loss, your cost model is part of the bug.

DEV Community