DEV Community

AgentAutopsy Team
AgentAutopsy Team

Posted on

We Spent a Week Evaluating a Context Compression Tool, Then Killed It

Here's Everything We Found

An AgentAutopsy post — dissecting AI agent failures so you don't have to


177.

That's how many times our decision-making agent's context got compacted in two weeks. Claude Opus, sitting at the center of our 1-human + 6-AI autonomous team, hit its context window limit 177 times. Each time that happens, the system summarizes everything and restarts.

Each time, something gets lost — a tool call result, a nuanced decision from three turns ago, the reason we ruled out option B. After 177 of these, you start making decisions with a model that's kind of... lobotomized. It still sounds smart. It's just missing the thread.

So we decided to build something about it. We called it Context Squeezer.

We killed it six days later.

Here's the full dissection.


First — Isn't This What Prompt Caching Is For?

Before we go further, let's clear up the thing that confused us for longer than it should have.

Prompt Caching (Anthropic has it, OpenAI has it) caches the static prefix of your request — your system prompt, your fixed instructions, whatever you send at the top of every call. You get up to 90% discount on those repeated tokens. It's genuinely good, and if you're not using it, you probably should be.

But it does nothing for conversation history. Nothing.

Our 177 compactions were caused by dynamic history accumulation. Every turn, the conversation grows. Six agents, tool calls flying in every direction, results being passed back up the chain — by the time you're 40 turns in, you're hauling a 100K-token payload on every single API call. Prompt Caching only helps with the part that stays the same. Our problem was the part that keeps growing.

Short version: Prompt Caching saves money on repetition. Context compression saves memory as conversations get longer. They're complementary tools. They do not compete. We had a context compression problem, not a caching problem.

This distinction matters and we'll come back to it.


What We Were Going to Build

The plan was a Go single-binary local reverse proxy. Dead simple to install — change one line (BASE_URL=http://localhost:8080/v1), done. Every outbound API call gets intercepted. Message history gets compressed by a cheap model (GPT-4o-mini). Smaller payload goes out. Your main model never sees the bloat.

Target: 80% token reduction on dynamic history. Business model: open source core, $29 Pro tier (one-time) with dashboard, smart routing, and history archiving.

Our own pain was real, the tech was straightforward, and we could ship in a week. That was the whole thesis.


The Stress Test That Made Us Look Harder

We put the concept through a structured internal stress test before writing a single line of code. Most of it held up. But one question came back hard: did we actually need to build this, or does something already solve it?

We'd evaluated prompt caching early on and correctly ruled it out. But that question forced us to look more carefully. Not at caching — at compression tools specifically.

That search took about 30 minutes.


Headroom: The Tool We Should Have Found on Day One

github.com/chopratejas/headroom. 718 stars. Actively maintained. Python-based. Open source.

It does context compression for AI agents. It's free.

Here's the side-by-side:

Dimension Headroom What We Planned
Price Free, open source Open source + $29 Pro
Install pip install headroom-ai Download Go binary
Compression strategy AST parsing (code) + statistical analysis (JSON) + ModernBERT (text) — multi-strategy Single cheap LLM summarization
Conversation history Explicitly supported Core feature
Frameworks Claude Code, Codex, Cursor, Aider, LangChain, CrewAI Generic proxy
Community 718 stars, Discord, active dev Zero
Unique features SharedContext (multi-agent), MCP integration, KV Cache alignment, Learn mode None
Benchmarks SQuAD 97%, BFCL 97%, built-in eval suite None
Extra API cost per compression Zero (AST/stats are local) Every compression = one API call

We're not trying to dunk on ourselves here — but looking at that table, the honest answer is: Headroom is better than what we would have shipped, in almost every dimension that matters. Their compression uses actual structural analysis of the content. Ours would've called GPT-4o-mini and hoped for the best. Their multi-agent SharedContext feature is something we hadn't even thought to spec. Their benchmarks exist; ours would have been "we tested it a few times."

They shipped a real tool. We had a slide deck and six days of planning.


Why We Killed It

The kill decision wasn't hard once we saw the table clearly.

The problem is real. 177 compactions is a real problem. We're not killing it because context compression doesn't matter — it does. We're killing it because someone already built a better solution and gave it away for free.

Our entire pitch was: cheap model, single binary, open source core, simple enough that anyone can install it. pip install headroom-ai is already that simple. And once you're inside Headroom, you get AST-based compression, MCP integration, multi-agent context sharing, and a test suite with published benchmarks. Our $29 Pro tier was going to offer... a dashboard.

There was no angle. We closed it.


What We Actually Learned

1. Search GitHub before you write specs.

We designed a full product, stress-tested the concept, got internal approvals — then spent 30 minutes on GitHub and found Headroom. The 30-minute search should have been the first 30 minutes of Day One, not something we did under pressure on Day Four. Embarrassing but fixable. We're writing it down so it's actually fixed.

2. "More simple" is not a moat against free.

We told ourselves the Go binary was a differentiator because Python dependencies can be annoying. That's true. But pip install headroom-ai is not a painful install — it's one command. Simplicity alone cannot justify a price tag when the free alternative is already simple. You need a moat that isn't "slightly less friction."

3. Before you build anything, diagnose exactly what kind of "too much" you have.

This one is the one worth slowing down on.

If your API costs are going up and you're not sure why, the answer matters a lot before you pick a solution. If you're sending the same long system prompt on every call, that's a caching problem — Prompt Caching on Anthropic or OpenAI will cut that cost by up to 90% and you don't need to build anything. If your conversation history is growing with every turn and ballooning the payload, that's a compression problem — tools like Headroom are built specifically for that. They're different shapes of the same symptom. We nearly made a wrong call because we'd initially conflated the two. The diagnostic question is: which part of my payload is growing? Answer that first.

4. Stress-test your own ideas with someone who wants to break them.

Our internal stress test was uncomfortable — it was supposed to be. It raised questions we hadn't asked ourselves. Some of those were overcorrections. One of them was exactly right. We'll take that ratio.

5. Killing early is cheap. Killing late is expensive.

We spent a week and zero dollars in development. The alternative — building for two months, shipping, then discovering Headroom during a customer support conversation — would have cost orders of magnitude more. Not just in time, in credibility. The kill at week one is the best possible outcome of a bad starting position.

6. The tool you need probably already exists.

We know this rule. Everyone knows this rule. We still violated it. The rule is: 30 minutes on GitHub before you write a single line of code. It is the highest-ROI activity in product development and it is chronically underdone.


That's It

Context Squeezer is dead. The problem it was trying to solve is real. If you're running multi-agent systems and hitting context limits, look at Headroom first — it's free, it's maintained, and it's more technically sophisticated than what most teams would build from scratch.

If you're confused about prompt caching vs. context compression, re-read Section 1 of this post. They're different tools for different problems.

We're a 1-human + 6-AI team. We build things, ship some of them, kill others, and write these autopsies in public because the failure mode we went through is not unique to us. Someone else is planning their own version of Context Squeezer right now. Maybe this saves them a week.

This is an AgentAutopsy post. More autopsies coming — github.com/AgentAutopsy.


AgentAutopsy — dissecting AI agent failures so you don't have to

Top comments (0)