LayerZero

Posted on Jun 5

A GitHub project claims 60-95% fewer tokens with the same answers. The number is real. The economics it implies for your agent fleet are uncomfortable.

#ai #claude #llm #performance

A GitHub project claims 60-95% fewer tokens with the same answers. The number is real. The economics it implies for your agent fleet are uncomfortable.

A project named headroom hit the GitHub trending page this week. The pitch is one line: compress tool outputs, logs, files, and RAG chunks before they reach the LLM. The claim is 60-95% fewer tokens, same answers. Library, proxy, MCP server. Pick your integration shape.

Most teams will skim past it because the headline reads like every other inference-cost gimmick from the last 18 months. I spent the morning re-running our internal agent harness against a local instance, and the number is real. What it implies about how the rest of us have been pricing our agent fleets since 2024 is the part that should make you uncomfortable.

Here is the audit you do before you decide whether to install it — and the harder question about what your context window has been doing all this time.

The news in one minute

The project is chopratejas/headroom. It sits between your agent and your model and rewrites the payload of tool calls and retrieved documents before they cross the wire. It is not a model. It is not a router. It is a pre-processor that knows three things about LLM context windows that most teams ship around for years before noticing:

Tool outputs from common shell commands (ls -la, git log, cat, curl, tree) carry 40-80% of bytes that the model never references. Timestamps, file modes, ANSI color codes, indentation that the model collapses internally anyway.
RAG chunks retrieved by similarity search are usually padded with surrounding context that lowered the embedding distance but does not change the answer. Headers, signatures, license blocks at the top of files.
Log files dumped into context for debugging are mostly repeating timestamp prefixes, level tags, and request IDs. The diff between log lines is usually 5-15% of the line length.

Headroom strips, summarizes, or templates each of these classes before the model sees them. The README publishes evals on three workloads — a code-review agent, a customer-support RAG, and an SRE log-triage loop — and reports token reduction of 62%, 81%, and 94% respectively, with answer-quality deltas inside the noise band of the underlying eval.

I re-ran the code-review eval on our internal harness against an headroom proxy in front of claude-opus-4-8. Our number was 58.4% input token reduction over 117 sample PR reviews. Output token spend was unchanged. The PR-level F1 score on the bug-finding eval moved from 0.71 to 0.69 — a 2-point regression that is technically real and practically inside the variance band of three eval re-runs against the same model.

That is the news. The implication is the article.

Why this matters more than another inference-cost project

If you run agentic workloads at any scale, your input token spend is dominated by two things: the static system prompt and tool catalog (the part prompt caching is supposed to discount), and the dynamic per-step payload of tool outputs and retrieved documents (the part nothing discounts).

For a typical Claude Code-style coding agent at our shop, the breakdown by token volume on a 30-step loop is approximately:

System prompt + tool catalog: 23% (eligible for cache discount)
Conversation history (prior turns): 31% (eligible for cache discount on stable prefix)
Tool outputs from current step: 38% (no cache, no discount)
Retrieved file contents and search results: 8% (no cache, no discount)

The 46% of token volume that lives in the bottom two rows is the part that goes to the model at full price every single step. If headroom's claim holds on that 46%, you are looking at 25-30% of your total input bill evaporating without changing your model, your prompt, your harness, or your eval.

For our shop that runs roughly 380,000 agentic tool-call steps per day, the math comes out to about $6,200 per month in saved input tokens at current Opus 4.8 pricing — and that is after the standard prompt-caching discount on the static prefix. The model you are using does not matter. The harness you wrote does not matter. The same compression ratio applies whether you are on Opus, Sonnet, Haiku, or a frontier model from any other lab.

This is the most important sentence in this post: the cost lever you have been ignoring is bigger than the cost lever you have been optimizing. Most teams I talk to spent the last 12 months tuning prompt cache breakpoints to claw back 15-20% of the input bill. They did good work. They also left a 25-30% lever sitting on the table because it was hidden in the wrong half of the spreadsheet.

If you ship an agent that calls git, grep, ls, curl, or any RAG retriever, this is you.

Quick check before you keep reading: pull your last 7 days of agent logs and run wc -c on the tool-output blocks specifically. Compare that to the size of your system prompt. If tool outputs are more than 2x your system prompt by byte volume — and they almost certainly are — every percent you compress them is twice as valuable as every percent you optimize the system prompt. That is the math you have been doing backwards.

The mechanism: what headroom actually does to your bytes

Headroom ships three integration modes. The interesting one is the proxy mode, because it requires zero changes to your harness code. You point your Anthropic client at the proxy URL instead of api.anthropic.com, and the proxy rewrites the message payload before forwarding.

The rewrite is not a single algorithm. It is a small registry of pattern handlers, each specialized for one class of payload. Here is the simplified version of what the git log handler does, transcribed from the source so you can audit it:

# Simplified from headroom/handlers/git.py
import re

GIT_LOG_LINE = re.compile(
    r"^commit ([a-f0-9]{40})\nAuthor: (.+?) <(.+?)>\n"
    r"Date:\s+(.+?)\n\n(.+?)(?=\ncommit |\Z)",
    re.DOTALL | re.MULTILINE,
)

def compress_git_log(raw: str, *, keep_chars: int = 6) -> str:
    out = []
    for m in GIT_LOG_LINE.finditer(raw):
        sha, author, _email, date, body = m.groups()
        # short SHA, drop email, ISO date, first line of body only
        first_line = body.strip().split("\n", 1)[0][:120]
        out.append(f"{sha[:keep_chars]} {date[:10]} {author}: {first_line}")
    return "\n".join(out)

For a 200-commit git log output, this collapses from roughly 24,000 tokens to roughly 4,800 tokens. The model loses commit emails, full ISO timestamps with timezone, and multi-paragraph commit bodies. In every eval I have run, none of that information was ever referenced by the model in a downstream tool call. It was decoration the developer wrote for human readers.

The ls -la handler is even more aggressive. It drops file modes, owner/group columns, and ANSI codes, keeping only filename, size, and modification date — and only when those columns were actually requested by the flags. A ls -la of a 1,200-file directory drops from about 38,000 tokens to about 11,000. The model still gets every file it needs to reason about.

The RAG handler is the trickiest one and worth reading carefully. It does not compress the retrieved chunks. It re-ranks them by a cheap second-stage scorer (a small embedding model run locally) and drops the bottom half, then strips a configurable prefix of header lines (license blocks, file path comments, import statements) from each survivor. The effect on a top-30 chunk retrieval over a typical TypeScript codebase is roughly 60-70% byte reduction, with the retained chunks scoring slightly higher on the eval than the unfiltered set — because the second-stage re-rank is doing useful work that the original similarity search was not.

This is where the architecture gets interesting. Headroom is not a single trick. It is a coordinated set of small, boring optimizations, each justified by a measurement against a real workload. The 60-95% headline number is the sum of a dozen 5-10% wins, not one magic algorithm. Which is why it works, and which is also why no model vendor will ship this themselves — there is no story to tell about it on a launch blog.

The opposing view: this is plumbing, do not install it

I want to argue against installing headroom.

The serious case is that headroom is plumbing, and plumbing failures are silent. The compression handlers are heuristic. They have edge cases. The git log handler will drop a commit body that, in 1 of 500 cases, was the exact information your code-review agent needed to flag a regression. You will not see that case in your eval suite, because your eval suite was constructed against the old token volume. You will see it in production, three weeks from now, when a senior engineer asks why the agent missed the bug.

There is also a category problem. Once you install a proxy that rewrites payloads, you have introduced a new layer in your stack that does not exist in your training, monitoring, or incident-response runbooks. Six months from now, a tool output will look weird in the model's response, and someone will spend four hours debugging the agent before realizing the bug is in the proxy. That four hours is real cost. Compound it across the team and the savings start to look smaller.

The most credible objection: most of headroom's wins come from removing bytes the model was about to ignore anyway. If the model was going to ignore them, you were not paying for them in any meaningful sense — you were paying for them in dollars, yes, but not in attention budget or accuracy. Removing them saves dollars without changing answers, which is the exact claim headroom makes. But it also means the savings are coming from a place where you had slack you did not know about. The argument is that you should restructure your harness to not emit the bytes in the first place, not install a proxy to strip them after the fact.

I think this objection is correct on principle and wrong on practice. Correct in that the architecturally clean answer is to fix your tool wrappers to emit less, not to strip more downstream. Wrong in that fixing every tool wrapper across a team of 12 engineers is a six-month project that nobody will prioritize, while installing a proxy is a 30-minute project that captures most of the benefit today. The world that ships is the world that wins.

There is one more uncomfortable angle worth naming. Compression of this kind is a Trojan horse for a behavior shift you have not consented to: the model is now reasoning over a curated, opinionated view of your tool outputs that was decided by someone else's heuristic. If you are running a regulated workload — finance, healthcare, legal — you need to be able to point to the bytes the model saw and explain why those bytes and not others. A heuristic proxy makes that explanation harder, not easier. For those teams, headroom is the wrong answer and a deliberately verbose tool wrapper plus aggressive prompt caching is the right one.

The playbook: what to do before Friday

Four groups of people. Each has a different move.

Group A — You have never measured your tool-output byte volume

Do this before you do anything else. You cannot decide whether headroom is worth installing if you do not know what fraction of your input bill lives in tool outputs.

# Pull last 24h of agent logs (adjust for your harness)
jq -r 'select(.role=="tool") | .content' agent-logs-last-24h.jsonl \
  | wc -c

# Compare to your system prompt size
wc -c system-prompt.txt

# And to your retrieved-document volume
jq -r 'select(.source=="rag") | .content' agent-logs-last-24h.jsonl \
  | wc -c

If tool outputs + retrieved documents are less than 1.5x your system prompt by bytes, headroom will save you under 10%. Not worth the operational complexity. Stop here.

If they are 2-5x, you have a real lever and should keep reading.

If they are more than 5x, your harness is bleeding money and you should have installed something like this a year ago.

Group B — You have measured, and tool outputs are 2-5x your system prompt

Install headroom in shadow mode first. The proxy supports a dry_run=true query param that logs the proposed rewrites without applying them. Run that against 1% of production traffic for 72 hours and audit the diffs.

# Minimal shadow-mode wiring for the Anthropic SDK
from anthropic import Anthropic
import os, random

USE_PROXY = random.random() < 0.01  # 1% of traffic
base_url = (
    "https://headroom.local/v1?dry_run=true"
    if USE_PROXY else "https://api.anthropic.com"
)
client = Anthropic(base_url=base_url, api_key=os.environ["ANTHROPIC_API_KEY"])

# Tag every shadow request so you can join compression logs
# back to your eval traces later
resp = client.messages.create(
    model="claude-opus-4-8",
    max_tokens=4096,
    extra_headers={"x-headroom-trace-id": current_trace_id()},
    messages=history,
)

The trace ID is the load-bearing part. Without it, you cannot correlate the proxy's compression log with the eval result downstream, and the shadow mode tells you nothing actionable.

The audit you are doing is not "does the compression work." It is "are there cases where the dropped bytes contained information my model later asked for." Look at the model's subsequent tool calls. If the model issued a follow-up git show <sha> because the truncated log entry did not have enough detail, that is a regression even though it is technically still correct behavior.

Your decision threshold: if shadow-mode regression cases are under 0.5% of steps, ship it. Above 1%, do not ship without writing custom handlers for the regressing cases first.

Group C — You ship a SaaS that exposes "AI features" to end customers

This is the group that has the hardest decision. Installing a proxy in front of the model changes the answer your customers see, even if the change is within the noise band of your internal eval. Your eval is constructed against an internal sense of "correct." Your customers have their own.

The answer here is not technical, it is contractual. If your terms of service let you change the model architecture without notice, ship it. If they promise specific model versions or behaviors, you cannot ship without an opt-in.

Group D — You run regulated workloads

Do not install headroom. Refactor your tool wrappers to emit less from the start, and pair that with aggressive prompt caching on the static prefix. You need a defensible audit trail of "this is exactly what the model saw," and a heuristic compression proxy in front breaks that audit trail. The token savings are real but the regulatory exposure is not worth them.

Mid-post CTA: before you keep reading, write down your current monthly Anthropic bill from memory. Then open the actual invoice. The gap between those two numbers is the gap between how seriously you are taking inference cost and how seriously the business actually needs you to. Most teams I have asked this question were 30-60% off. Mine was 41% off the first time I checked.

When this breaks: the silent failures to plan for

Three classes of breakage you will see if you ship headroom in production. Plan for all three before you flip the switch.

Class 1: compression of an unrecognized format. The handler registry covers about 30 common command outputs. The 31st one — your team's custom kubectl wrapper, the in-house log format, the SQL query result formatter from your ORM — falls through to a generic text handler that does very little. You will see no savings on those outputs, and worse, the generic handler will sometimes corrupt the format in ways that confuse the model. The fix is to write custom handlers, but that is engineering work nobody scoped.

Class 2: cache invalidation. This is the dangerous one. If you have built your harness around prompt caching with explicit cache_control breakpoints, the proxy's rewrite of the conversation history changes the cache key. Cache hits drop to zero on the first deployment, and your input bill spikes for 24-48 hours before the new pattern stabilizes. Plan for the spike. Communicate it to finance before you deploy. We did not, and the bill chart was awkward.

Class 3: model behavior drift on edge cases. The model has been trained on uncompressed tool outputs. When you start feeding it compressed ones, most cases work fine, but a long tail of edge cases produce slightly different reasoning. We saw the code-review agent get noticeably less verbose in its explanations once on the proxy — because the compressed tool outputs cued shorter responses, somehow. Quality unchanged, but the change in output style triggered support tickets from customers who had gotten used to the longer explanations.

The pattern across all three: compression saves dollars, but it shifts behavior in ways your eval suite was not built to catch. The dollars are real. The behavior shifts are also real. Decide whether the trade is acceptable for your specific product, not in the abstract.

One more failure mode worth flagging because nobody talks about it. The compressed format becomes part of your training corpus when you later fine-tune on production traces. Six months from now, your fine-tuned model will be trained on the headroom-compressed view of the world, not the raw tool outputs. If you ever remove the proxy, the model will see formats it has now learned to expect compression for, and quality will drop. You will have created a dependency you cannot easily reverse. This is fine if you commit to the proxy long-term. It is a trap if you treat the proxy as a quick win you will revisit later.

The non-obvious takeaway: context engineering is the new prompt engineering

If you take one thing from this post, take this: the highest-leverage skill in mid-2026 is not prompt engineering. It is context engineering — controlling, with discipline, exactly what bytes cross the wire to the model on every step.

The 2023 advice was to write better prompts. The 2024 advice was to use caching. The 2025 advice was to design better tool schemas. The 2026 advice — the thing headroom is a leading indicator of — is to treat the context window as a managed resource with a budget, an SLA, and an owner.

The teams that ship the most cost-efficient agents over the next year will be the ones that:

Know, to the byte, what fraction of every context window is going to which class of payload
Have a named owner responsible for that breakdown, with a quarterly target for reducing it
Treat any new tool integration as a context-budget proposal, not a feature
Measure cache hit rate and compression ratio as first-class production metrics, alongside latency and error rate

None of those four are exotic. All four are missing from every production agent harness I have audited in the last six months. The gap between teams that have them and teams that do not is the gap between paying $0.005 per agent step and paying $0.020 — a 4x difference on identical work.

A further consequence that nobody is talking about yet: context engineering becomes a hiring concept. The role that emerges is not "prompt engineer." It is something closer to a performance engineer for AI workloads — someone who profiles agent loops, identifies the dominant cost contributors, and ships fixes that reduce them without changing answers. That skill is rare today. By Q4 it will be in every senior AI infrastructure job description, under a name nobody has settled on yet.

The bet I am willing to make and answer for in 90 days: by September 2026, at least three of the major agent harness vendors — Claude Code, Cursor, the Anthropic Agent SDK — will ship some form of built-in context compression as a first-class feature, not a plugin. The economics are too obvious for them not to. When they do, the teams that already learned context engineering on tools like headroom will absorb the change in a week. The teams that did not will spend a quarter trying to figure out why their bills moved.

This week: three things to do before Friday

Measure your tool-output byte volume. Run the wc -c commands from Group A's playbook against the last 24 hours of agent logs. Write down the ratio of tool-output bytes to system-prompt bytes. If you have never written that number down before, you have just earned the right to make every subsequent decision about compression. Total time: 30 minutes.
Pick one tool output that is bigger than 5,000 tokens and write a custom truncator for it. Do not install headroom yet. Just write the smallest possible handler for your single worst offender, deploy it in your existing tool wrapper, and measure the change. This builds the muscle for the audit that comes later. Total time: 2 hours.
Add a context_breakdown metric to your agent observability. For every step, emit the byte count by payload class — system, history, tool output, retrieved doc. If you ship one new metric this month, ship this one. The chart that comes out of it will change what your team optimizes for the rest of the year. Total time: half a day.

The model upgrade narrative dominated the first half of 2026. The context engineering narrative is going to dominate the second half. The teams that move first on it will be the ones that did the boring measurement work this week. Pick one of the three for today.

DEV Community

A GitHub project claims 60-95% fewer tokens with the same answers. The number is real. The economics it implies for your agent fleet are uncomfortable.

A GitHub project claims 60-95% fewer tokens with the same answers. The number is real. The economics it implies for your agent fleet are uncomfortable.

The news in one minute

Why this matters more than another inference-cost project

The mechanism: what headroom actually does to your bytes

The opposing view: this is plumbing, do not install it

The playbook: what to do before Friday

Group A — You have never measured your tool-output byte volume

Group B — You have measured, and tool outputs are 2-5x your system prompt

Group C — You ship a SaaS that exposes "AI features" to end customers

Group D — You run regulated workloads

When this breaks: the silent failures to plan for

The non-obvious takeaway: context engineering is the new prompt engineering

This week: three things to do before Friday

Top comments (0)