DEV Community

Cover image for 94% of My Claude Code Tokens Went to the Wrong Model. So I Stopped Paying Opus to Do Haiku's Job.
Phil Rentier Digital
Phil Rentier Digital

Posted on • Originally published at rentierdigital.xyz

94% of My Claude Code Tokens Went to the Wrong Model. So I Stopped Paying Opus to Do Haiku's Job.

You feel like you have done everything right with Claude Code. Hooks installed. CLAUDE.md curated to 6,890 tokens. Every MCP server killed off, with the kind of discipline you are silently proud of on a Friday evening.

Then it is Wednesday, three days before your weekly limits reset, and you are already at 80% usage. Even on the Max plan.

That is when you start wondering what the discipline was actually buying you.

TLDR. You can twist your Claude Code setup all you want. I found a free tool that does not lie about the bloat, cuts costs drastically, and makes Claude sharper, not just cheaper. Here is what I did, and how you can replicate it.

Months I had been listening to Boris Cherny and reading the Anthropic essays on context engineering. I did the homework. MCPs gone. CLAUDE.md trimmed. By the time I had also archived the MEMORY files and installed the SessionEnd hook, the numbers spoke for themselves: 643 sessions in thirty days with zero crashes and zero /context panic.

At the end of my tokens, I went looking. And I found a gem. An audit tool that does not just count tokens. It tells you when Claude is getting dumb.

The Score That Made Me Close the Tab

[IMAGE: Token Optimizer audit dashboard showing Health Score C at 69/100 with Claude.md global tokens and session overhead breakdown]

I ran it last week. Six agents in parallel, sixty seconds. First thing on screen: Health Score C, 69 out of 100.

Not D, not F. C. The grade of a kid who thought he had revised well. I had margin, or what am I saying, a chasm.

Total Session Overhead: 23.5K tokens, 2.3% of the million-token window. CLAUDE.md global at 6,890. Skills at 1,809. Commands 1K, zero MCP tools, the rest of it sitting where I expected. The numbers were small.

And then I saw the line that made me stop scrolling.

Real session baseline averages 43.2K tokens (+83.9% vs estimate). Gap is from unmeasured system overhead.

The audit tool was telling me itself it was undermeasuring. The XML framing, the MCP instructions, the system prompts that ship with every model call (admit it, you do not check those either). Things I never see in /context, that I had been paying for every single message.

What I thought I was measuring was half of what was actually billed.

That is the moment the framing flipped. The score is an architecture grade dressed up as a fill gauge. Half the bloat I was paying for, I never saw.

Token Debt Is Not Just Financial

Every Medium post about Claude Code costs frames it the same way. Your context is full. You are paying too much. Trim your CLAUDE.md.

Half right. The other half is that token debt is dual.

The financial layer is the obvious one. The Anthropic API is stateless. Every turn reloads the entire stack. Prompt caching divides cost by ten but the window itself does not shrink. So if you have a ghost file of 1,410 tokens sitting in MEMORY for a finished project, and you send 3,347 messages a day, and you do that for eighteen days... that is 85 million tokens billed for zero value. From one file.

For scale: 137 million billable tokens is what I went through last month, total. One ghost file alone took several percent of the volume. In silence.

The cognitive layer is the one nobody talks about. Per Anthropic's published guidance and confirmed by the audit tool's design, quality drops past 50-70% context fill. Compaction is lossy. With 9.7 million tokens average per session in a one-million-token window, that is roughly ten compactions per session. Six thousand compactions in thirty days. Each one shaves something off (a rule, a convention, a piece of scope you set up two hours ago).

And then there is the overhead you cannot see. Anthropic's own engineering team reported that MCP tool definitions consumed 134K tokens before they shipped Tool Search. A Reddit user documented 67,000 tokens gone just from connecting four MCP servers. None of it shows up in /context.

Unused tokens cost you money. The bigger cost is that you pay to degrade the quality of your own reasoning, every turn.

I had written, back in February, that CLAUDE.md is the new .env, not the new README. I treated the file as a config problem. That was half the picture. The cognitive layer, I had not seen.

Six Agents, Sixty Seconds, One Verdict

The tool is called token-optimizer. It is a Claude Code plugin, open source, on Alex Greensh's GitHub. You install it with one line: /plugin marketplace add alexgreensh/token-optimizer.

The architecture is honest. Six agents running in parallel. Sonnet handles the judgment calls, Haiku does the counting work, and Opus comes in only at the end for the synthesis. A dashboard lives at localhost:24842 and auto-updates on every SessionEnd hook.

What it does that nothing else does: it tracks how much dumber your AI is getting as the session wears on. Quote from the repo, not mine. /context shows you the gauge. This shows you the architecture under the gauge AND the quality decay over time.

The tool also admits its own limits. The +83.9% sub-measurement I quoted earlier? It comes from the audit telling me itself: "Gap is from unmeasured system overhead." A tool that owns what it cannot see is more credible than one that pretends to see everything.

Three weeks ago, I wrote about the day my Pro Max plan lasted fifteen minutes before I ran /context and saw the bloat for the first time. I thought I had seen the worst. /context tells you the fridge is full. Token-optimizer tells you half the food is expired.

It also fits a thesis I have been pushing for months. The reason this works is because it is a CLI-native plugin, not an MCP server. It runs inside Claude Code's execution loop, not over a remote tool protocol. It is a small data point in why CLI-native tools beat MCP servers for AI agents.

Six agents, sixty seconds, one dashboard. And every finding it produces, you have to apply yourself.

The Visible Lever Pays the Least

Here is what the audit found first. Three project MEMORY files, archived from finished work. One of them was 1,410 tokens on its own. Run the math: 1,410 tokens × 3,347 messages a day × 18 days that the file sat in MEMORY after I had stopped touching the project. That is 85 million tokens billed for zero value, from one file.

For scale, my total billable consumption that month was 137 million. One ghost file ate roughly 0.6% of my entire monthly volume, while contributing nothing.

Multiply that across the others: two more archived MEMORY files. Six citations of old Twitter incidents in the global CLAUDE.md, retired. Verbose skill descriptions trimmed. Plugin commands I never invoked, pruned.

The total measured cleanup: -1,386 tokens per message on the file layer, a 5.9% drop. Plus another 2,565 tokens shaved from the on-demand corpus.

The rule I would generalize: if a MEMORY file has not been referenced in a prompt for 14 days, archive it by default. Anthropic's docs recommend a 200-line auto-load cap. The real practical cap is 14 days of reference.

Now here is the inconvenient truth.

5.9% is not the half-bite of bread it sounds like. It is the layer you can see, which makes it the layer everyone obsesses over. But it is also the layer that pays the least.

The ghost file is the appetizer. The main course is somewhere else.

Two Best Practices I Copied From Threads. Both Cost Me.

Two findings the audit flagged that you would not catch by reading every Anthropic doc cover to cover.

First one. I had a skill called questionnaire-prospect with a description that ran 438 characters. The audit flagged it past the threshold. Past roughly 250 characters, Claude Code silently truncates the description in the skill menu. The full SKILL.md still loads when the skill is invoked, but in the menu where Claude decides whether to trigger the skill in the first place, the description is mutilated.

Result: Claude sees a cut-off sentence, does not understand when the skill should fire, and quietly ignores it. The skill does not crash. It just stops triggering reliably. I trimmed the description to 223 characters. The audit's recommended optimum is 80 characters. None of the official Anthropic docs I have read mentions this truncation behavior.

What you write in a description, and what Claude actually sees, are not guaranteed to be the same.

Second one. A PostToolUse hook running vitest after every Edit/Write. TDD applied to the agent workflow, no broken state ever shipped. Sounded great on Reddit. Sounded great on a thread I was reading.

Real life, with one file edited thirty times during a refactor: thirty test runs. Thirty vitest reports polluting Claude's context (each report is a system message in his context window). Thirty latency hits. The audit flagged it the moment it ran. I disabled the hook. Latency cleaned up. Context stayed cleaner.

The rule that goes beyond vitest: a generic PostToolUse hook matching Edit/Write is almost always a trap. Hooks should fire on phase transitions (SessionEnd, end of feature, deploy). Not on atomic operations.

Common thread of both: generic best practices are not the same as best practices for YOUR project at YOUR volume. A senior dev does not run the test suite after every save. He runs it before commit. Same logic for agents.

These are the patterns I called out in every Claude Code tutorial that falls apart in production. I had been collecting best practices like skills: in a folder, just in case, paid for in silence.

Why Routing Made the AI Sharper, Not Just Cheaper

This is the finding the audit ranked number one. And it is the section the title points at.

Thirty days of usage. 643 sessions. 137 million tokens billable. Model mix: 94% Opus, 0% Sonnet, 4% Haiku, 2% other.

Sonnet at zero. On 137 million tokens.

I had not chosen Opus over Sonnet. I had simply never set up routing in the first place. The audit tagged it with a savings projection of 50-75%. On the seven days I have visible in the report, I spent $1,668. A 50% routing discipline saves $834 a week. 75% saves $1,250.

That is where the cost angle lands honestly, not on the file cleanup. But cost is half the story. The pivot the title makes is this: the routing does not just save money. It makes the AI sharper. Three mechanisms.

Opus has a temperament. It is tuned to think. When you give it a bounded task, it widens it, because that is what it does well. Take a typical week. I am debugging a flow in my WooCommerce store and I ask an Explore agent a simple question: "find every reference to checkout_cta across the storefront." A grep task. Literally. Haiku does this in four seconds for a fraction of a cent.

In 94% Opus mode, the agent reads 23 files, contextualizes the usages, notices an inconsistency between the markdown bullet and blockquote formats in older versus newer modules, proposes a refactor, opens a discussion about factoring the pattern into a shared template. I had asked for nothing of the kind. I wanted a list of files. I got a mini architecture review. Cost: roughly $0.30, eight minutes, context eaten for nothing because I had moved on after.

Routed to Haiku, same query: list in four seconds, $0.001, context intact.

Opus is not bad here. Opus is miscast. Asking Opus to grep is like asking a senior architect to count boxes in a closet.

Sonnet on bounded refactors is more disciplined than Opus. For transformations in a defined scope, refactoring a module, adding a feature inside a perimeter, Sonnet ships an output that stays aligned. Fewer unsolicited alternative proposals. Right tool, right scope.

Opus freed from bounded tasks gets better at the architectural decisions you actually need it for. When you let Opus handle grep AND counting AND lookup AND architectural decisions, by the time you invoke it on the hard problem, its context is already polluted by 47 sessions of grep. The "quality drops past 70%" rule lives here. You do not just pay Opus to do Haiku's job. You pay the hidden price too: your Opus is less sharp when you actually need it.

Opus stays the right call for complex reasoning, that is what Anthropic tunes it for. Haiku is not "smarter" in absolute. The nuance is that Haiku is better cast, on the right task, with a clean context.

The Lever the Audit Couldn't Pull for Me

The audit shows you. The audit does not fix everything.

Of the findings in the report, the file cleanup is auto-applicable. Archive a file, it stays archived. The skill bloat is partially auto: descriptions get shortened in place. The vitest hook is binary, on or off.

The model routing? That is different.

The tool injects a routing block into CLAUDE.md global with a 48-hour TTL. If I do not refresh it, it auto-deletes, because usage patterns drift. I cannot auto-discipline myself from a config file. I have to consciously dispatch: Explore and lookup go to Haiku. Refactors in scope go to Sonnet. Architectural decisions go to Opus.

That is human discipline. Measurable, but not automatable.

The real message of the audit is not "here are the files to archive." It is closer to: here is the debt your workflow accumulates, and the dominant lever runs through you, not through me.

Same structure as something I shipped a couple of months ago, when I let Claude Code audit my own three hundred and seventy-four sessions and the report came back inconveniently honest. /insights audited my behavior. token-optimizer audits my config. In both cases, the correction is human.

The 50-75% number is a projection at the moment of the audit. My real gain depends on my routing discipline over the next two weeks. Honest framing.

Two Weeks Later: What the Score Did and Didn't Move

I ran the audit again two weeks later. Different score, different shape.

The file cleanup held. The archived MEMORY files stayed archived. The skill descriptions stayed trimmed. -1,386 tokens per message on the file layer, baseline. -2,565 tokens on the on-demand corpus. Those gains do not require willpower. They are structural.

The routing layer is where the real test was. I had committed to dispatching Explore and lookup tasks to Haiku, refactors to Sonnet, architectural decisions to Opus. Two weeks of conscious dispatch. The model mix shifted, but not as cleanly as I had pictured. The 48-hour TTL on the routing block in CLAUDE.md global did its job: when I drifted back to default-Opus on a Tuesday afternoon, the audit caught it the next time I ran it.

Drift detection is the feature I underestimated. The tool snapshots your config at install and watches for what comes back. If a hook I disabled gets reinstalled, I see it. If a memory file I archived returns, I see it too. The model mix slipping back to 90% Opus also lands in red on the dashboard.

What I see clearly now: the qualitative side is real. The Explore tasks I route to Haiku come back faster, in cleaner context, and the Opus calls I save for architectural decisions arrive on a context that has not already been chewed. Sessions that used to hit 70% fill before compaction now hit it a third later. The bill will not register that change. The output does.

A C is not "bad." It means there is margin, and you now know where.

The real levers do not sit where Medium keeps pointing. They sit in routing, and in the hooks you copy from threads without testing on your own setup.


Token-optimizer is part of my routine now, sitting next to graphify, which is a story for another day. I am calmer, and Claudius is sharper 😊

Less tempted to wander over and check whether the grass is greener at Codex.

Sources

Top comments (0)