Garvit Surana

Posted on May 13

I lost $14,502 to Claude Code in one month. Here's the autopsy.

#claude #ai #agents #llm

Last spring I racked up a $14,502 invoice on Claude Code in 31 days.

Anthropic's billing page told me the total. It didn't tell me where it went. There was no per-session breakdown, no "this retry storm cost you $612," no way to tell whether I was hitting Opus on tasks Sonnet would have nailed. Just one big number, paid, gone.

So I wrote a CLI to read my own ~/.claude/projects/*.jsonl files and rank the leaks. The CLI is open source (github.com/garvitsurana271/burnd) but this post isn't a pitch — it's the autopsy. Eight patterns, ranked by what they cost me personally. Most of them generalize to any LLM-coding-agent setup; only two are Claude-specific.

If you've ever paid an LLM bill and not known where it went, one of these is probably eating you alive too.

The damage, ranked

#	Pattern	% of bill	$ value
1	retry-storm	21.6%	$3,140
2	wrong-model-on-task	19.9%	$2,890
3	context-bloat	14.7%	$2,140
4	repeated-reads	11.1%	$1,610
5	tool-overuse	9.4%	$1,360
6	model-substitution	7.7%	$1,120
7	off-hours-spend	5.9%	$850
8	spend-creep	4.5%	$650
—	unclassified noise	5.2%	$742

Now the parts.

1 · retry-storm — $3,140

Pattern: Claude tries something, it fails, it tries again, fails again, tries a third time, and so on. Each retry pays full input-token cost for the (usually growing) context. If you don't notice, the loop can run six or seven turns deep before you intervene.

The session that hurt the most: I asked Claude to "make this work with the new API." It tried four implementations across six turns; each turn re-read the same 14k-token codebase context. Total session: $612 for what should have been a $40 fix.

What it looks like in the raw logs. A JSONL turn from a retry storm typically has:

{
  "role": "assistant",
  "tool_use": [{ "name": "Edit", "input": { ... } }],
  "stop_reason": "tool_use",
  "usage": { "input_tokens": 14723, "output_tokens": 891 }
}
{
  "role": "tool_result",
  "is_error": true,
  "content": "TypeError: Cannot read property 'foo' of undefined"
}

Three consecutive (assistant, tool-error) pairs with no user message in between = retry storm. The detector codifies this:

function isRetryStorm(turns: Turn[]): boolean {
  let consecutiveErrors = 0;
  for (const t of turns) {
    if (t.role === 'tool_result' && t.is_error) consecutiveErrors++;
    else if (t.role === 'user') consecutiveErrors = 0;     // user broke the loop
    if (consecutiveErrors >= 3) return true;
  }
  return false;
}

Fix that actually worked for me: when Claude's first attempt fails, I now stop, read the error, and either give explicit guidance or rewrite the prompt. The retry loop is the single most expensive habit a Claude Code user has — and the one easiest to break once you see it.

2 · wrong-model-on-task — $2,890

Pattern: running Opus on tasks Sonnet (or Haiku) would have nailed in seconds. This mostly happens because Claude Code's model selection is sticky — set Opus for one hard problem and you keep paying Opus rates for the next ten trivial ones.

Concrete example: a 142-token "rename this variable" task on Opus 4.7 billed $1.40. Same task on Sonnet 4.6 would have been ~$0.05. That's a 28x markup for nothing.

The heuristic for classifying tasks:

function isOpusClass(turn: AssistantTurn): boolean {
  return turn.output_tokens > 2000                     // generated significant code
      || turn.tool_uses.length > 5                     // multi-tool reasoning
      || turn.input.includes('refactor')               // architectural verb
      || turn.input.length > 8000;                     // complex prompt
}

If a turn used Opus but doesn't match isOpusClass, the waste is the cost delta to Sonnet.

Fix: I now explicitly set model per session, not per Claude Code preference: claude --model sonnet for routine work, claude --model opus for novel-architecture work. Saves ~70% on routine sessions, with effectively zero quality loss. The Opus-vs-Sonnet quality gap on a rename this variable task is rounding error.

3 · context-bloat — $2,140

Pattern: a session passes 60k tokens of conversation context, and now every new turn pays full input cost on the entire history. The cumulative effect is brutal — a 200k-context session pays roughly 50× the cost of an equivalent 4k-context session, even when your actual question is the same size.

Most bloat comes from (a) Claude re-reading large files across turns, and (b) the user pasting big chunks of code that linger.

The detector flags any session where input-tokens-per-turn exceeds a per-user baseline by 3σ. Per-user matters here — a researcher chewing through papers has a legitimately higher baseline than someone editing CSS, so an absolute threshold over-fires for the first group and under-fires for the second.

function isBloated(session: Session, userBaseline: number, userStdDev: number): boolean {
  const avgInputPerTurn = session.totalInputTokens / session.turnCount;
  return avgInputPerTurn > userBaseline + 3 * userStdDev;
}

Fix: start a new session more aggressively. Once you cross ~50k tokens, the cost-per-turn-to-value-of-turn ratio breaks. Better to ask Claude to summarize the session into a single "state" document and start fresh with that summary as the new context. Pays itself back within 2-3 turns.

4 · repeated-reads — $1,610

Pattern: Claude reads the same file three or more times in a session because the conversation didn't preserve what it read. Each re-read pays full input-token cost for that file.

This happened in a session where I asked questions about my Express routes file across 8 turns; Claude re-read it 5 times because each turn was somewhat independent and Claude's context-management decided "fresh read is safer."

The detector flags any file path appearing in tool-use input across 3+ non-consecutive turns:

function repeatedReads(session: Session): string[] {
  const reads: Record<string, number[]> = {};
  session.turns.forEach((t, i) => {
    for (const use of t.tool_uses ?? []) {
      if (use.name === 'Read' && use.input.file_path) {
        (reads[use.input.file_path] ??= []).push(i);
      }
    }
  });
  return Object.entries(reads)
    .filter(([, turns]) => turns.length >= 3 && !isConsecutive(turns))
    .map(([path]) => path);
}

Fix: explicitly tell Claude "you've already read routes.ts — refer to it from context" — or ask it to summarize the file once into the session and refer back. Saves an entire re-read every time.

5 · tool-overuse — $1,360

Pattern: agentic loops where Claude calls Read, Grep, ListFiles, etc. excessively, often duplicating work the previous turn already did. Each tool call costs input tokens to plan + output tokens to consume the result.

Heaviest session: Claude ran 47 separate Grep calls across 12 turns. About 15 of those were variations of the same query (User, users, User\\s+, \\buser\\b, Users).

The detector flags sessions with tool-call counts more than 2σ above the user's baseline AND with significant overlap in tool arguments (Levenshtein distance < 20% between calls). The Levenshtein check is what catches the "I keep grepping for variants of the same word" pattern that simple count-based detectors miss.

Fix: prompt-engineer the search step. "Before searching, list the queries you intend to run. I'll approve them. Then batch the searches." Turns 47 grep calls into one batched plan. Costs less in tokens; produces better answers because Claude reasons about coverage upfront.

6 · model-substitution — $1,120

Pattern: subtle, and Claude-specific. Sometimes Claude Code falls back to a more expensive model than you requested due to availability throttling or model-routing quirks. You think you're paying Sonnet rates but the session logs show Opus.

The detector compares the user's stated model preference (in .claude/config.json) to the actual model used (parsed from the model field in each JSONL turn). Flags any session where the cheaper model was requested but the more expensive one was used.

function modelSubstitution(session: Session, statedModel: string): number {
  const actualModels = session.turns.map(t => t.model);
  const wrongModelTurns = actualModels.filter(m => 
    isExpensive(m) && !isExpensive(statedModel)
  );
  return wrongModelTurns.length / actualModels.length;
}

Fix: spot-check session logs occasionally. This isn't your fault — it's an Anthropic-side sometimes-thing — but it's worth knowing about so you can report it or factor it into your monthly burn.

7 · off-hours-spend — $850

Pattern: agentic sessions that ran while you weren't watching, drove up cost on autopilot, and you didn't notice until the next morning. Often these are "let it think for a while" sessions where you walked away.

The detector clusters timestamps to derive your "active hours" envelope (heuristic: any 10-minute window with ≥3 short-turn-cadence interactions = human-in-the-loop), then flags spend during off-envelope hours.

Fix: this is more of an awareness pattern than a fix. Nothing wrong with letting Claude work while you're away — but the detector tells you what it cost so you can make the trade explicitly. I cut my off-hours spend ~60% just by knowing the number.

8 · spend-creep — $650

Pattern: a slow upward drift in average cost-per-session week-over-week, without a corresponding increase in productivity. Often caused by gradually larger context windows, gradually more tool calls, or gradually more aggressive model selection.

The detector compares each week's median cost-per-session to the previous 4-week rolling median. Flags weeks with >40% increase that aren't explained by a major workload change.

Fix: the early-warning value here is bigger than any specific fix. Knowing your spend is creeping lets you check in BEFORE the next $14k surprise. I now look at the trend chart on Sunday mornings — takes 30 seconds.

"But Anthropic shipped `/usage` — isn't that enough?"

Anthropic shipped /usage in mid-April. It's good. It shows your current-session token usage in real time.

It's also a different problem. /usage answers "what is this one session costing me right now?" The autopsy above answers "which patterns across my whole month are bleeding money?" You need both.

/usage is your speedometer. The detector list is your annual mechanic inspection. They're complementary, not substitutes.

The two practical differences:

Cross-session. A retry storm rarely happens in one session — the pattern is "every time you ask me to integrate a new API, you retry 4 times." That's only visible looking across N sessions of the same shape. /usage is single-session.
Prescriptive. /usage tells you the dollar number. The detector list tells you what to change in your CLAUDE.md so the leak stops returning. That's the difference between a dashboard and a fix.

What I changed after running the detectors on my own logs

Three habits, ranked by impact:

I check before I retry. When Claude's first attempt fails, I read it before re-running. ~22% of my prior waste was retry storms; this habit alone got most of it back.
I explicitly set model per session. claude --model sonnet for routine work, claude --model opus for heavy lifting. Save the Opus tab for problems that genuinely need it.
I start new sessions more aggressively. When a session crosses 50k tokens, I summarize and reset. The summary loses ~5% of the context; the next 20 turns are 10× cheaper.

Three months later my Claude Code spend is ~$2,300/month for the same volume of work. Not zero, but a hell of a lot less than $14,502.

Reproducible: try it on your own logs

The numbers above are from running Burnd on my own ~/.claude/projects/*.jsonl for one calendar month. It's open source, MIT, local-first — nothing leaves your machine.

npx getburnd

That's the whole free tier. It scans, ranks, shows you the same eight detectors with your own dollar amounts and the top sessions for each. No account, no signup, no telemetry.

If you want it to also auto-apply CLAUDE.md fixes, run alerts when spend creeps, and produce weekly reports, there's a Pro tier ($89 lifetime, founding price until May 18) — but the free CLI is genuinely complete on its own and most people will never need anything more.

Detector source is in /src/cli/src/detectors. Each has unit tests with synthetic session data. If you disagree with a threshold, the tests show what would change at different cutoffs. PRs welcome — the detector list is meant to grow.

If you've found patterns I missed, drop a comment or open an issue. I want this list to be the canonical "where LLM coding spend leaks" reference.

About the author: Garvit Surana is a 16-year-old developer in Guwahati, India. He shipped Burnd after losing $14,502 of Claude Code in one month. Find him on GitHub and at garvit-surana.vercel.app.

Top comments (2)

Harjot Singh • Jun 1

great insights on the lack of transparency in LLM billing. it's frustrating not knowing where the costs are coming from. at moonshift, we help developers deploy full next.js + postgres + auth apps in about 7 minutes, and you own the code on your github. if you're interested, i can set you up with a free run to give it a go.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.