Gabriel Anhaia

Posted on Apr 5

Anthropic's Caching Bug Turned Claude Code Into a Token Furnace — Here's What Actually Happened

#ai #programming #productivity #discuss

Project: Hermes IDE | GitHub
Author: gabrielanhaia

A $200 Plan That Died in Two Days

Something went wrong with Claude Code in late March 2025. Not a crash. Not a 500 error. The billing meter lost its mind.

Developers across multiple continents opened their dashboards to discover their monthly quotas had been gutted. Plans meant to last 30 days were flatlined in 48 hours. Some users reported simple operations eating 10x to 20x more quota than expected. A function refactor here, a test generation there, and suddenly the account's toast.

Anthropic stayed quiet for a while. Then came the postmortem on April 2nd, confirming what everyone suspected: prompt caching had broken. Two separate bugs killed it simultaneously. Every request was being processed from scratch, billed at full price, with no one at Anthropic catching it for days.

The developer community's reaction was predictable. And entirely earned.

Prompt Caching 101: The Thing That Keeps AI Billing Sane

Before getting into the bugs, it's worth understanding what prompt caching actually does. It's the single biggest cost-reduction mechanism in usage-based AI pricing.

When a developer works in a codebase with Claude Code, the tool builds up context: file trees, recent edits, coding patterns, project structure. That context gets sent as part of the prompt to the model on every request. For a medium-sized project, this context alone can run 20,000 to 80,000 tokens.

Prompt caching stores this context server-side so repeat requests don't re-process it from zero. The cached tokens get billed at a fraction of the cost. Anthropic's own documentation puts cached input tokens at roughly 10% of the price of fresh input tokens.

Here's a simplified view of what token costs look like with and without caching on a typical coding session:

# WITH prompt caching working correctly
# ──────────────────────────────────────
Request 1: "Refactor the auth middleware"
  System prompt tokens:     3,000  (full price)
  Project context tokens:  45,000  (full price - first load)
  User message tokens:         50  (full price)
  Total billed tokens:     48,050

Request 2: "Now add error handling to that refactor"
  System prompt tokens:     3,000  (CACHED - 90% discount)
  Project context tokens:  45,000  (CACHED - 90% discount)
  New context tokens:       1,200  (full price)
  User message tokens:         60  (full price)
  Total billed (effective): 6,060  << this is the number that matters

# WITHOUT prompt caching (the bug)
# ──────────────────────────────────────
Request 2: "Now add error handling to that refactor"
  System prompt tokens:     3,000  (full price)
  Project context tokens:  45,000  (full price)
  New context tokens:       1,200  (full price)
  User message tokens:         60  (full price)
  Total billed tokens:     49,260  << 8x more expensive

That's one follow-up request. Now multiply that by a normal working day. A developer making 40-60 requests during an eight-hour coding session would see their token consumption balloon from roughly 250,000 effective tokens to over 2,000,000 tokens. Same work. Same output quality. Eight times the bill.

And that's a conservative estimate. Developers working on larger codebases with 100k+ token context windows were looking at even worse multipliers.

The Two Bugs: A Perfect Failure

Anthropic's postmortem confirmed two distinct caching bugs hit production at the same time. Understanding how they interacted explains why the damage was so severe.

Bug #1: Cache writes failed. New context wasn't being persisted to the cache layer. Every fresh session started cold, and nothing it processed got stored for future use. Think of it like a database where all INSERT operations silently fail. The application keeps running, but nothing accumulates.

Bug #2: Cache reads failed. Even data that had been cached before the bug existed couldn't be retrieved. This is the double-tap. Bug #1 alone would've been bad but survivable for users with established sessions. Bug #2 made sure nobody got cache hits, period.

Together, these bugs created a system doing maximum computational work on every single interaction while billing accordingly. No errors thrown. No warnings surfaced. The tool worked fine from a functionality perspective. It just cost 8-15x more per request than it should have.

A rough model of what this looks like in a real session:

# Simulating token costs over a typical dev session
# Assuming 50 requests, 45k context tokens, Sonnet 3.5 pricing

CACHED_INPUT_RATE = 0.30    # per million tokens
FULL_INPUT_RATE = 3.00      # per million tokens
CONTEXT_TOKENS = 45_000
REQUESTS_PER_SESSION = 50

# Normal operation (first request full price, rest cached)
normal_cost = (
    (CONTEXT_TOKENS * FULL_INPUT_RATE / 1_000_000) +           # first request
    (CONTEXT_TOKENS * (REQUESTS_PER_SESSION - 1) *
     CACHED_INPUT_RATE / 1_000_000)                             # cached requests
)
# normal_cost = $0.135 + $0.6615 = ~$0.80

# Bugged operation (every request full price)
bugged_cost = (
    CONTEXT_TOKENS * REQUESTS_PER_SESSION *
    FULL_INPUT_RATE / 1_000_000
)
# bugged_cost = $6.75

# That's ~8.4x more expensive PER SESSION
# A dev doing 3-4 sessions/day burns through quota in days, not weeks

These numbers are rough but directionally correct. The actual multiplier varied by project size, session length, and how the developer interacted with the tool. Some users reported even worse ratios.

The Compounding Factors That Made Everything Worse

The caching bug alone would've been a bad week. But three other factors turned it into a billing catastrophe.

Peak-hour quota throttling kicked in at the worst possible time. Anthropic applies multipliers during high-traffic periods that reduce how much compute each plan tier can consume per hour. The specific thresholds aren't public. During the caching outage, developers slammed into these reduced limits even faster because every request was burning tokens at the full uncached rate. So the hourly quota that might normally support 30-40 requests was getting exhausted in 5-6.

To make matters worse, Anthropic's promotional pricing had quietly expired. They'd been running a subsidy that lowered effective per-token costs, and when it ended, there was no prominent notification according to many affected users. Per-token costs went up at the exact moment per-request token consumption exploded.

And developers had no way to catch any of this. Claude Code doesn't surface real-time token consumption in any meaningful way. There's no per-request cost breakdown. No "your caching hit rate is 0%" warning. No spending alerts. The only signal was the quota bar dropping like a stone.

One Reddit user reported their $200 Max plan lasting exactly 47 hours of normal development. Another posted screenshots showing a simple "explain this function" request consuming what appeared to be a full day's quota allocation in a single interaction.

The Community Response: From Confusion to Distrust

The r/ClaudeAI subreddit became ground zero.

Posts titled "Is anyone else's quota disappearing?" started appearing on March 31st. By April 1st, billing complaints dominated the front page. Several developers initially assumed it was an April Fools' joke because the timing was that absurd. It wasn't.

The frustration wasn't purely financial. Developers had restructured their workflows around Claude Code. They'd recommended it to teams, built muscle memory, integrated it into sprint cycles. The tool didn't stop working. It just became economically hostile to use.

The sentiment arc went: confusion, then anger, then something harder to fix. Distrust.

"How do we know this won't happen again?" became the refrain. "How do we know it isn't happening right now at a smaller scale we can't detect?"

These questions aren't paranoia. They're rational responses to a system where the billing meter is completely opaque and just proved it can fail catastrophically without anyone noticing for days. When developers can't independently verify what they're being charged, they're trusting the provider's meter. That meter broke. Trust doesn't just snap back.

The Source Leak Didn't Help

This billing mess didn't happen in isolation. Days before the caching bugs surfaced, roughly 500,000 lines of Claude Code's source code leaked via an npm package. Internal implementation details, API patterns, and architectural decisions were exposed to anyone paying attention.

Anthropic moved fast to pull the code down. The internet moved faster to archive it.

A source code leak alone is survivable. Every large company has had operational mistakes. But stacked on top of a billing crisis, it built a narrative that's difficult to dismantle: that the operational maturity doesn't match the product ambition.

Fair or not, perception compounds. And the compounding here wasn't in Anthropic's favor.

What Anthropic Got Right (and Where They Didn't)

The postmortem was solid engineering communication. They identified both bugs, explained the interaction, acknowledged the impact, and committed to quota extensions for affected users.

But the surrounding response had gaps.

The delay was the biggest problem. Users reported anomalies for days before any public acknowledgment. In a usage-based billing system, every silent hour is an hour where customers are potentially being overcharged with no way to know. Incident communication isn't a nice-to-have here. It's a billing integrity issue. The minimum viable response time for a billing anomaly should be hours, not days.

The promo transition was poorly handled. Ending a promotional offer is standard business. Ending it during a billing bug, without clear user notification, creates a bait-and-switch perception even if the intent was innocent.

Quota extensions are a start, not a solution. For developers who hit their limit mid-sprint, the damage wasn't just the lost quota. It was the scramble to find alternatives, the missed deadlines, the context switching cost. A credit doesn't cover that.

The Real Problem: Usage-Based AI Billing Is a Black Box

This is where the Claude Code situation stops being about one company and starts being about an industry-wide structural flaw.

Usage-based billing for AI coding tools is, right now, a pure trust exercise. Developers can't see the meter ticking. They can't audit token counts per request. They can't verify caching status, confirm which pricing tier was applied, or check whether peak-hour multipliers are active.

Compare this to other infrastructure developers pay for:

AWS EC2:     Per-second billing, full instance visibility,
             CloudWatch metrics, billing alerts, cost explorer

Stripe:      Per-transaction billing, every charge logged
             and auditable, real-time dashboards

Vercel:      Per-invocation billing, function-level metrics,
             spending limits, automatic alerts

Claude Code: Per-token billing, no per-request breakdown,
             no cache-hit visibility, no spending alerts,
             no real-time cost tracking

The information asymmetry is staggering. Every other developer tool in this price range gives users granular visibility into what they're paying for. AI coding assistants give users a quota bar and a prayer.

That asymmetry benefits the provider in normal times and devastates the user when things break. There's no third-party auditing for AI billing. No open standard for token usage reporting. No equivalent of a cloud cost explorer for prompt economics.

This isn't a billing model. It's a trust fall with someone else's wallet.

What Developers Should Be Asking Every AI Tool Provider

This incident surfaced questions that apply to Claude Code, GitHub Copilot, Cursor, Windsurf, and every other AI coding tool:

What happens when billing breaks? Is there a hard spending cap? An automatic shutoff? Or does the meter just keep running silently until your quota is gone?
Can a developer see per-request token costs? Not aggregate. Per request. With cache hit/miss status.
What's the SLA for billing accuracy? If overcharging occurs, what's the detection mechanism, the notification timeline, and the remediation policy?
Is there a flat-rate option? Predictability has real value, even if it's less cost-efficient on average.

GitHub Copilot's flat-rate pricing suddenly looks a lot more attractive to risk-averse teams. It's a worse deal on paper for heavy users, but "worse deal with predictable costs" beats "better deal with the possibility of waking up broke." Local model setups, once dismissed as impractical, are getting a serious second look from developers who'd rather trade raw capability for billing certainty.

Where This Goes From Here

The Claude Code incident didn't create the underlying problems with usage-based AI billing, but it made them impossible to ignore.

Anthropic has a real opportunity here. They can build the transparency infrastructure the entire industry needs: per-request cost breakdowns, cache status indicators, real-time spending dashboards, hard spending caps, and public SLAs for billing accuracy. If they do that, this incident becomes a growth moment. A scar that made the product stronger.

Or they can patch the bugs, issue credits, and move on. In which case, the next caching failure produces the same result, and developers will have already moved to alternatives.

The underlying AI is still excellent. The product vision is sound. But a great model behind an opaque billing system is a liability, not just an asset. And trust, once cracked, takes more than a postmortem to rebuild.

The question worth sitting with: how much do you actually know about what you're paying for? Not the marketing page. Not the tier comparison chart. The actual, per-request, real-time cost of using the tool.

If the answer is "not much," the Claude Code incident isn't someone else's problem. It's a preview.

What's your experience been? Has this situation changed how you evaluate AI coding tool pricing? Switched tools? Set spending limits? Started looking at local alternatives? The comments are open.

DEV Community