Claude Code's Prompt Cache TTL Dropped From 1h to 5m

#ai #observability #llm #devops

Book: LLM Observability Pocket Guide
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

The first sign was a quota bar. Engineers on Claude Code's 5-hour subscription tier started clipping their limits a week ahead of schedule. A handful of them ran their billing logs through a Python script and confirmed what their gut had told them: every coffee break was now invalidating their cache and re-paying the write rate. The change landed on April 2, 2026: the default prompt-cache TTL on Claude Code went from 1 hour to 5 minutes, with no release-notes line item and no banner.

The behavior is documented in the public postmortem thread on GitHub issue #46829, in community write-ups on dev.to, and in The Register's coverage of Anthropic's response. Anthropic's stated position, as quoted in The Register, is that the 5-minute TTL is, on average, cheaper for their workload mix because many Claude Code sessions are one-shot. That is plausibly true in aggregate. It is also entirely irrelevant to the engineer whose long-running refactor session now writes a large system prompt many times an hour (the 80k-token figure I use later is illustrative, not measured).

The bigger story is not which TTL is correct. The bigger story is that vendor-side knobs on hosted LLM infrastructure now move silently, and the failure mode they produce (higher cost, identical outputs) is invisible to every monitor most teams have wired up.

The shape of the failure

Latency monitors caught nothing. The p50 and p99 of messages.create did not move. Error rates did not move. Output token counts did not move. The only signals that moved were:

cache_creation_input_tokens per session climbed.
cache_read_input_tokens per session fell.
Total billed cost per equivalent session rose 15–53% across the 95-day log scan one engineer published, per their write-up on dev.to.

That is the entire shape of the regression. If your observability stack tracks request count, latency, and 5xx rate: congratulations, the change is invisible to you. If it tracks tokens but only at the rollup level, the cost ratio inside the rollup shifted but the totals look like organic growth.

Three classes of failure live in this gap:

Silent vendor changes — TTLs, default sampling, model-version aliases, rate-limit accounting.
Drift in semantics — same model name, slightly different behavior after a routing change.
Billing-vs-usage skew — the unit cost of a logically identical request changes without the request changing.

Traditional APM was built for the first wave of cloud failures: latency, errors, saturation. LLM ops needs a different scoreboard.

What to instrument

Three signals catch this class of incident before the invoice does.

Prompt-cache hit ratio over time. Every hosted LLM that supports caching exposes per-call counters for cache reads and cache writes. Emit them. Roll them up by session, by user, by deployment. The number you watch is cache_read_input_tokens / (cache_read_input_tokens + cache_creation_input_tokens). For a steady workload this is flat. When it drops, something changed. Could be your prompt. Could be theirs.

Billing-vs-usage drift. Track two ratios: tokens-per-session and dollars-per-session. They should move together. When dollars-per-session rises faster than tokens-per-session, the unit economics of a token changed. That is almost always a vendor-side action: cache change, pricing tier shift, or a routing change to a more expensive model.

Golden-set regression. A small fixed set of prompts you replay daily against the production model and diff. Catches semantic drift the cache metrics miss. Cheap to run. Embarrassingly underused.

The point is not to monitor everything. The point is that "200 OK with the right shape" is no longer a sufficient health check when the thing on the other end of the wire is a multi-tenant model with a control plane you cannot see.

A Python emitter for prompt-cache hit ratio

Below is a thin wrapper that publishes the cache-hit ratio as an OpenTelemetry attribute on every Claude API call. It assumes you are already using the anthropic SDK and have an OTel tracer set up. The official SDK exposes usage.cache_creation_input_tokens and usage.cache_read_input_tokens per call, per Anthropic's prompt-caching docs.

from anthropic import Anthropic
from opentelemetry import trace

client = Anthropic()
tracer = trace.get_tracer("llm.claude")


def call_claude(messages, system, model="claude-sonnet-4-5"):
    with tracer.start_as_current_span("claude.messages") as span:
        resp = client.messages.create(
            model=model,
            max_tokens=2048,
            system=system,
            messages=messages,
        )

        u = resp.usage
        cache_read = getattr(u, "cache_read_input_tokens", 0) or 0
        cache_write = (
            getattr(u, "cache_creation_input_tokens", 0) or 0
        )
        cached_total = cache_read + cache_write
        hit_ratio = (
            cache_read / cached_total if cached_total else 0.0
        )

        span.set_attribute("llm.model", model)
        span.set_attribute("llm.input_tokens", u.input_tokens)
        span.set_attribute("llm.output_tokens", u.output_tokens)
        span.set_attribute("llm.cache.read_tokens", cache_read)
        span.set_attribute("llm.cache.write_tokens", cache_write)
        span.set_attribute("llm.cache.hit_ratio", hit_ratio)

        return resp

That is the emitter. Now the alert. Most teams export llm.cache.hit_ratio as a histogram into their existing metrics backend and compute the 7-day rolling average per deployment. The alert is a simple step-down detector: page when the trailing 1-hour mean drops more than 5 percentage points below the trailing 7-day mean and stays there for two consecutive 5-minute evaluation windows.

def cache_regression_alert(samples_1h, samples_7d):
    if len(samples_1h) < 60 or len(samples_7d) < 7 * 24 * 60:
        return False  # not enough data yet
    recent = sum(samples_1h) / len(samples_1h)
    baseline = sum(samples_7d) / len(samples_7d)
    return (baseline - recent) > 0.05

Two thresholds matter here: the 5-percentage-point gap and the 1-hour evaluation window. Drop the gap to 2 points and you will page on every cold-start cohort. Stretch the window to 6 hours and the April 2 incident would have taken six hours to fire. The values above are tuned for a workload where cache hit ratio sits around 0.85 in steady state. Calibrate to yours.

Why this category is growing

Hosted LLMs are a control plane you do not own. The vendor ships:

routing rules between model variants,
cache TTLs and admission policies,
safety filters that may rewrite or refuse,
batching policies that affect latency tails,
pricing-tier defaults.

Each of these can change without a model-version bump. The model name in your config (claude-sonnet-4-5, gpt-5o, whatever) is a stable handle on an unstable backend. The xda-developers writeup of the Claude Code situation framed it bluntly: the cache change wasn't announced because, from the vendor's perspective, the contract is the API surface, not the cost profile.

You probably accept that for a CDN. You probably accept it for a database vendor adjusting a query planner. The new wrinkle is that LLM cost variance is dramatic: a cache-hit token is roughly 10% the cost of a base token, and cache writes are priced at 25% extra (5-minute) or 100% extra (1-hour) per Anthropic's docs. A control-plane change that flips the cache ratio is a control-plane change that doubles or halves your bill.

A short checklist

If you have a Claude Code or Claude API workload of any size, do these this week:

Emit cache read/write tokens on every call. Aggregate to hit ratio.
Track tokens-per-session and dollars-per-session. Alert on divergence.
Add the 1h/5m TTL hint explicitly on your cacheable system prompts. The default may move again.
Run a 20-prompt golden set daily against your production deployment. Diff outputs.
Pull last 30 days of usage logs and rerun the math under both TTL assumptions. The gap is your exposure.

The April 2 incident is going to happen again, with a different vendor, a different knob, a different magnitude. Catching it costs one OTel span attribute and one alert rule. Missing it costs a month of writes while you argue about whether your usage really grew that much.

If this was useful

The instrumentation patterns above sit in chapters 3 and 7 of the LLM Observability Pocket Guide: span attributes for cache, golden-set diffs, billing-vs-usage drift detection. It is the short reference for picking a tracing and evals stack that catches silent regressions before the invoice does.