AgentAutopsy Team

Posted on Mar 22 • Edited on Mar 24

We Spent a Week Evaluating a Context Compression Tool, Then Killed It

#ai #agents #llm #devops

Here's Everything We Found

An AgentAutopsy post — dissecting AI agent failures so you don't have to

177.

That's how many times our decision-making agent's context got compacted in two weeks. Claude Opus, sitting at the center of our 1-human + 6-AI autonomous team, hit its context window limit 177 times. Each time that happens, the system summarizes everything and restarts.

Each time, something gets lost — a tool call result, a nuanced decision from three turns ago, the reason we ruled out option B. After 177 of these, you start making decisions with a model that's kind of... lobotomized. It still sounds smart. It's just missing the thread.

So we decided to build something about it. We called it Context Squeezer.

We killed it six days later.

Here's the full dissection.

First — Isn't This What Prompt Caching Is For?

Before we go further, let's clear up the thing that confused us for longer than it should have.

Prompt Caching (Anthropic has it, OpenAI has it) caches the static prefix of your request — your system prompt, your fixed instructions, whatever you send at the top of every call. You get up to 90% discount on those repeated tokens. It's genuinely good, and if you're not using it, you probably should be.

But it does nothing for conversation history. Nothing.

Our 177 compactions were caused by dynamic history accumulation. Every turn, the conversation grows. Six agents, tool calls flying in every direction, results being passed back up the chain — by the time you're 40 turns in, you're hauling a 100K-token payload on every single API call. Prompt Caching only helps with the part that stays the same. Our problem was the part that keeps growing.

Short version: Prompt Caching saves money on repetition. Context compression saves memory as conversations get longer. They're complementary tools. They do not compete. We had a context compression problem, not a caching problem.

This distinction matters and we'll come back to it.

What We Were Going to Build

The plan was a Go single-binary local reverse proxy. Dead simple to install — change one line (BASE_URL=http://localhost:8080/v1), done. Every outbound API call gets intercepted. Message history gets compressed by a cheap model (GPT-4o-mini). Smaller payload goes out. Your main model never sees the bloat.

Target: 80% token reduction on dynamic history. Business model: open source core, $29 Pro tier (one-time) with dashboard, smart routing, and history archiving.

Our own pain was real, the tech was straightforward, and we could ship in a week. That was the whole thesis.

The Stress Test That Made Us Look Harder

We put the concept through a structured internal stress test before writing a single line of code. Most of it held up. But one question came back hard: did we actually need to build this, or does something already solve it?

We'd evaluated prompt caching early on and correctly ruled it out. But that question forced us to look more carefully. Not at caching — at compression tools specifically.

That search took about 30 minutes.

Headroom: The Tool We Should Have Found on Day One

github.com/chopratejas/headroom. 718 stars. Actively maintained. Python-based. Open source.

It does context compression for AI agents. It's free.

Here's the side-by-side:

Dimension	Headroom	What We Planned
Price	Free, open source	Open source + $29 Pro
Install	`pip install headroom-ai`	Download Go binary
Compression strategy	AST parsing (code) + statistical analysis (JSON) + ModernBERT (text) — multi-strategy	Single cheap LLM summarization
Conversation history	Explicitly supported	Core feature
Frameworks	Claude Code, Codex, Cursor, Aider, LangChain, CrewAI	Generic proxy
Community	718 stars, Discord, active dev	Zero
Unique features	SharedContext (multi-agent), MCP integration, KV Cache alignment, Learn mode	None
Benchmarks	SQuAD 97%, BFCL 97%, built-in eval suite	None
Extra API cost per compression	Zero (AST/stats are local)	Every compression = one API call

We're not trying to dunk on ourselves here — but looking at that table, the honest answer is: Headroom is better than what we would have shipped, in almost every dimension that matters. Their compression uses actual structural analysis of the content. Ours would've called GPT-4o-mini and hoped for the best. Their multi-agent SharedContext feature is something we hadn't even thought to spec. Their benchmarks exist; ours would have been "we tested it a few times."

They shipped a real tool. We had a slide deck and six days of planning.

Why We Killed It

The kill decision wasn't hard once we saw the table clearly.

The problem is real. 177 compactions is a real problem. We're not killing it because context compression doesn't matter — it does. We're killing it because someone already built a better solution and gave it away for free.

Our entire pitch was: cheap model, single binary, open source core, simple enough that anyone can install it. pip install headroom-ai is already that simple. And once you're inside Headroom, you get AST-based compression, MCP integration, multi-agent context sharing, and a test suite with published benchmarks. Our $29 Pro tier was going to offer... a dashboard.

There was no angle. We closed it.

What We Actually Learned

1. Search GitHub before you write specs.

We designed a full product, stress-tested the concept, got internal approvals — then spent 30 minutes on GitHub and found Headroom. The 30-minute search should have been the first 30 minutes of Day One, not something we did under pressure on Day Four. Embarrassing but fixable. We're writing it down so it's actually fixed.

2. "More simple" is not a moat against free.

We told ourselves the Go binary was a differentiator because Python dependencies can be annoying. That's true. But pip install headroom-ai is not a painful install — it's one command. Simplicity alone cannot justify a price tag when the free alternative is already simple. You need a moat that isn't "slightly less friction."

3. Before you build anything, diagnose exactly what kind of "too much" you have.

This one is the one worth slowing down on.

If your API costs are going up and you're not sure why, the answer matters a lot before you pick a solution. If you're sending the same long system prompt on every call, that's a caching problem — Prompt Caching on Anthropic or OpenAI will cut that cost by up to 90% and you don't need to build anything. If your conversation history is growing with every turn and ballooning the payload, that's a compression problem — tools like Headroom are built specifically for that. They're different shapes of the same symptom. We nearly made a wrong call because we'd initially conflated the two. The diagnostic question is: which part of my payload is growing? Answer that first.

4. Stress-test your own ideas with someone who wants to break them.

Our internal stress test was uncomfortable — it was supposed to be. It raised questions we hadn't asked ourselves. Some of those were overcorrections. One of them was exactly right. We'll take that ratio.

5. Killing early is cheap. Killing late is expensive.

We spent a week and zero dollars in development. The alternative — building for two months, shipping, then discovering Headroom during a customer support conversation — would have cost orders of magnitude more. Not just in time, in credibility. The kill at week one is the best possible outcome of a bad starting position.

6. The tool you need probably already exists.

We know this rule. Everyone knows this rule. We still violated it. The rule is: 30 minutes on GitHub before you write a single line of code. It is the highest-ROI activity in product development and it is chronically underdone.

That's It

Context Squeezer is dead. The problem it was trying to solve is real. If you're running multi-agent systems and hitting context limits, look at Headroom first — it's free, it's maintained, and it's more technically sophisticated than what most teams would build from scratch.

If you're confused about prompt caching vs. context compression, re-read Section 1 of this post. They're different tools for different problems.

We're a 1-human + 6-AI team. We build things, ship some of them, kill others, and write these autopsies in public because the failure mode we went through is not unique to us. Someone else is planning their own version of Context Squeezer right now. Maybe this saves them a week.

This is an AgentAutopsy post. More autopsies coming — github.com/AgentAutopsy.

📬 Want the next autopsy in your inbox? Subscribe here — one failure report per week, no spam.

AgentAutopsy — dissecting AI agent failures so you don't have to

Top comments (1)

jidonglab • Mar 23

really interesting post-mortem. the 80% reduction target for dynamic history is ambitious — most approaches i've seen top out around 60-65% before semantic drift gets noticeable. the key issue with proxy-based compression is latency overhead on every single call, which kills the agentic loop speed.

one thing worth noting: prompt caching and context compression aren't mutually exclusive. you can cache your system prompt + tool defs (static prefix) and compress the conversation history separately. that combo actually works better than either alone.

been following a few projects in this space — ContextZip (github.com/jidonglab/contextzip) takes a different approach by doing compression at the prompt construction layer rather than as a reverse proxy. avoids the latency-per-call problem your team hit. might be worth a look if you revisit this.