Three weeks ago I had an overnight agent run that cleaned out my Claude Max quota in about 40 minutes. I went to bed thinking the refactor would be done by morning, and woke up to a "rate limit reached" alert before any of it had committed.
There's an active GitHub thread on this — anthropics/claude-code#41930. It got locked a couple of days ago with the lock reason "resolved", which I think means Anthropic shipped something server-side. Good. But a separate thing was bugging me, which is that I had no idea, while it was happening, what specifically was eating the quota. The Anthropic console tells you you're over the limit. It doesn't tell you why.
So I pulled the JSONL session logs from ~/.claude/projects/ and started reading. They're surprisingly readable: every assistant turn, every tool call, the token counts, all there. After a few hours I had three observations I haven't seen written up anywhere.
1. Cache-read tokens are not a flat multiplier
I had assumed cache reads behaved like a steady tax on input tokens. They don't. They spike on certain tool sequences — specifically, when the agent does Edit → Read(same file) → Edit → Read. Each Read after the Edit reloads the whole file context, and the cache-read count for those turns is something like 3-5× the count for turns that just used Grep or Bash.
Anyone who's had a long agent run that "felt fine per turn but blew the budget at the end" — this is probably why. The agent verifies its own work after every change, and each verification is full-context-priced.
2. Post-compact turns cost a lot more than you'd think
When the conversation hits the context limit, Claude Code compacts: summarizes the prior context into a single message, then keeps going. The first 2-3 turns after a compact are around 3-5× the cost of the same kind of turn at session start. The agent has to re-establish its mental model, often re-reads files it already had loaded, and pays for it.
This is a known trade-off (compaction lets you go past the limit at all), but I think people are seriously under-counting how much of the "I burned my quota" delta comes from this. If your workflow involves many long compacted sessions, that's where a chunk of the spend goes.
3. Subagent loops multiply down the tree
If your main agent dispatches a subagent, both processes log their own session JSONL. Both load the conversation context. Both pay for it. If the subagent dispatches further subagents, that compounds.
The worst session I have is a 4-deep subagent chain where the leaf got into a Read-loop on the same file. It cost $47 for what should have been under a dollar. The data is sitting right there in the JSONL — there's a subagentPaths field that ties it all together — but nothing I'd seen surfaces it.
What I built
A small MCP server, @vk0/agent-cost-mcp. v2.0-beta.4 just shipped on npm. It reads the JSONL logs locally and exposes a few tools the agent can call:
-
get_subagent_tree— returns the subagent tree as a JSON object with cost summed per branch. This is the one I cared most about. You can see exactly which child agent burned what. -
get_tool_roi— for each tool the agent used, looks at whether the next turn moved the conversation forward (linked tool result, no immediate retry/rollback). Tools that fire repeatedly without progress get taggedefficiency=low. That's the runaway-loop signature. -
detect_cost_anomalies— keeps a rolling daily baseline and flags days that deviate by more than ~50% from the recent average. -
estimate_run_cost— give it a prompt and a model, get back a{low, expected, high}for what the run will cost. -
configure_budget— set a daily cap and three thresholds (80/100/150%). When you cross one, the next cost-query tool returns the alert in its response, and your agent can read that and stop. -
set_monitor_webhook— pipe alerts to Telegram/Slack/whatever. HMAC-signed so the receiver can verify.
It's all local. No API key, no auth, no cloud. Reads files you already have. There's an opt-in telemetry stub for v2.1 so I can do anonymized benchmarks ("your session cost X, median for similar sessions is Y") but it's a no-op in v2.0 by design — I want to ship that with the privacy commitments worked out before the network code goes in.
What it doesn't do
It doesn't fix anything Anthropic-side. If their server-side resolution holds, the drain pattern from #41930 should be less common — but the per-tool waste, subagent loops, and post-compact reprocessing cost are independent problems that are still on you to manage.
The forecasting tool is dumb right now. It's basically linear extrapolation. I'll do something with seasonal smoothing in v2.1 if it turns out to matter.
The runaway detection has about an 8% false positive rate on my fixture sessions. I'd like to get that under 3% before going to GA, but I'd need adversarial samples I don't have. If anyone reading this has a JSONL file from a session that went really sideways (and is comfortable with anonymizing it), I'd love to run it.
Try it
npx -y @vk0/agent-cost-mcp@beta
Repo: https://github.com/vk0dev/agent-cost-mcp
Five MCP clients tested — Claude Desktop, Claude Code, Cursor, Cline, Windsurf.
Top comments (0)