Claude Code's Reasoning Was Silently Lowered. Caught a Month Late.

#ai #llm #observability #tutorial

Book: LLM Observability Pocket Guide
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

March 4, 2026: per Anthropic's April 23 postmortem, the default reasoning effort for Claude Code was reduced from high to medium. The change was published in the changelog. It did not show up in any latency dashboard or 5xx counter. No SLO moved. The HTTP 200s kept flowing and so did the tokens. The model just got noticeably worse at hard problems. It took the developer community roughly seven weeks to triangulate what had changed. Anthropic's postmortem and a community write-up on earezki.com landed the same week.

Per the postmortem, three changes were eventually identified: the reasoning-effort default change on March 4, a thinking-history clearing bug shipped March 26, and a verbosity-reduction system prompt added April 16. Anthropic reverted all three by April 20 and published the postmortem on April 23. Fortune and The Register covered the user backlash.

Set aside whether this was a fair tradeoff (Anthropic reverted the change). The interesting question is upstream: why did your telemetry not catch a 50-day regression? Because LLM regressions don't trip the alarms LLM apps were wired to.

What latency and error rate cannot tell you

Your standard SRE dashboard watches three things: requests per second, p99 latency, error rate. A reasoning-effort downgrade moves none of them in a direction that triggers an alert. Latency, if anything, drops — that was the entire point of the change. Error rate stays flat because the model still returns valid JSON. The output is just worse.

The regression lives in a place the dashboard isn't looking:

Output quality on tasks that benefit from longer reasoning chains. Code that compiles but doesn't handle edge cases. Plans that skip steps. Refactors that drop tests.
Output length distribution. Medium-effort responses are systematically shorter than high-effort responses on the same prompts. The mean shifts. The distribution gets a fatter left tail.
Tool-call patterns. Lower reasoning effort means fewer search-then-act loops, more one-shot guesses. The shape of tool sequences changes even when the tools used don't.

None of those surface as an incident in Datadog. They surface as a slow trickle of user complaints. "Claude feels off" gets filed as an anecdote for weeks before the anecdotes pattern-match into a regression report. Per the Anthropic engineering postmortem, the issue was first surfaced via user reports rather than internal monitoring.

The three signals that would have caught it

Three lightweight checks, all of which could have shaved weeks off detection:

A frozen golden set, run on a schedule. Same 50–200 prompts, every hour, judged by an LLM-as-judge against a stable rubric. Score this week against last week. Alert on drift.
Output-length distribution monitoring. Track mean, median, and 90th percentile of completion-token counts per route. A defaults change drops the mean overnight. The shape moves before quality does.
Per-prompt eval drift. For your top-N production prompts (highest volume or highest stakes), maintain a baseline eval score. Re-score nightly on a sampled traffic snapshot. Compare against a 7-day rolling baseline.

The first one is the cheapest to ship and would likely have surfaced the March 4 change within the first day's run, since a 5% aggregate score drop on a 50-case eval is a one-run signal rather than a trend. Here's the implementation.

A golden-set runner that alerts on drift

The script is small on purpose. You can have one running by tomorrow. The framework doesn't matter; the cron does.

import json
import statistics
import time
from dataclasses import dataclass
from datetime import datetime, timedelta
from pathlib import Path
from typing import Callable

import httpx

The golden set is JSONL on disk. One line per case. Each case has a prompt, an expected-trait description, and a reference answer the judge can compare against.

@dataclass
class GoldenCase:
    id: str
    prompt: str
    rubric: str
    reference: str


def load_golden(path: Path) -> list[GoldenCase]:
    cases = []
    for line in path.read_text().splitlines():
        d = json.loads(line)
        cases.append(GoldenCase(**d))
    return cases

The runner. Calls the system under test, calls a judge model, returns a score per case. Two LLM calls per case — one for the candidate answer, one for the rubric grade. Cheap if your set is 50 cases.

JUDGE_PROMPT = """You are scoring a candidate answer.

Rubric: {rubric}
Reference answer: {reference}
Candidate answer: {candidate}

Score the candidate from 0 to 10 against the rubric.
Return JSON: {{"score": <int>, "reason": "<one sentence>"}}.
"""


def run_case(
    case: GoldenCase,
    answer_fn: Callable[[str], str],
    judge_fn: Callable[[str], dict],
) -> dict:
    candidate = answer_fn(case.prompt)
    grade_prompt = JUDGE_PROMPT.format(
        rubric=case.rubric,
        reference=case.reference,
        candidate=candidate,
    )
    grade = judge_fn(grade_prompt)
    return {
        "id": case.id,
        "ts": time.time(),
        "score": grade["score"],
        "reason": grade["reason"],
        "len": len(candidate),
    }

Persist results. Plain JSONL append per hour. You don't need a database for this.

def append_results(out: Path, results: list[dict]):
    with out.open("a") as f:
        for r in results:
            f.write(json.dumps(r) + "\n")

The drift check. Read the last 7 days of results, compute a baseline mean per case ID, compare today's hour against it. Alert when more than 5% of cases drift more than 1.0 score points, or when the aggregate mean drops more than 5%.

def baseline_means(history: list[dict]) -> dict[str, float]:
    by_id: dict[str, list[float]] = {}
    for r in history:
        by_id.setdefault(r["id"], []).append(r["score"])
    return {k: statistics.mean(v) for k, v in by_id.items()}


def detect_drift(
    today: list[dict],
    history: list[dict],
    per_case_threshold: float = 1.0,
    aggregate_pct: float = 0.05,
) -> dict:
    base = baseline_means(history)
    drifted = []
    for r in today:
        b = base.get(r["id"])
        if b is None:
            continue
        if b - r["score"] >= per_case_threshold:
            drifted.append(
                {"id": r["id"], "base": b, "now": r["score"]}
            )
    today_mean = statistics.mean(r["score"] for r in today)
    base_mean = statistics.mean(base.values())
    pct_drop = (base_mean - today_mean) / base_mean
    return {
        "drifted_cases": drifted,
        "pct_drop": pct_drop,
        "alert": (
            len(drifted) / max(len(today), 1) > 0.05
            or pct_drop > aggregate_pct
        ),
    }

Wire it together. A cron at the top of every hour. Run the golden set, append, check drift, page if the alert flag is true.

def hourly(
    golden_path: Path,
    out_path: Path,
    answer_fn: Callable[[str], str],
    judge_fn: Callable[[str], dict],
    notify: Callable[[dict], None],
):
    cases = load_golden(golden_path)
    today = [
        run_case(c, answer_fn, judge_fn) for c in cases
    ]
    append_results(out_path, today)
    cutoff = datetime.utcnow() - timedelta(days=7)
    history = [
        json.loads(l)
        for l in out_path.read_text().splitlines()
        if datetime.utcfromtimestamp(json.loads(l)["ts"])
        >= cutoff
    ]
    result = detect_drift(today, history)
    if result["alert"]:
        notify(result)

At $0.01 per case (a Haiku-tier judge with roughly 2K tokens per case round-trip), fifty cases is $0.50 per run and $12/day if you run hourly. The cost of catching a silent model regression on day one instead of day fifty is a much larger number.

Adding the output-length sentinel

The golden-set check is signal-rich but slow to trigger. Output-length distribution is the early-warning channel — it moves before the quality eval does, because length shifts the moment you change reasoning effort.

You don't need an eval for this. Just instrument your existing tracing.

from collections import deque

class LengthMonitor:
    def __init__(self, window: int = 1000):
        self.window: dict[str, deque[int]] = {}
        self.size = window

    def record(self, route: str, tokens: int):
        q = self.window.setdefault(
            route, deque(maxlen=self.size)
        )
        q.append(tokens)

    def median(self, route: str) -> float:
        return statistics.median(self.window[route])

    def shift_pct(self, route: str, baseline: float) -> float:
        return (baseline - self.median(route)) / baseline

Pair this with a baseline you snapshot weekly. A 10% drop in median completion tokens on a stable route is the kind of thing that should generate a JIRA ticket the same day. A reasoning-effort downgrade shifts the median by enough that a 1000-sample window would likely surface the change within the first day of normal traffic.

Why these are the checks the postmortem implies

Anthropic's April 23 postmortem describes three issues that ran for weeks before being identified. Two of them (the reasoning-effort default and the brevity prompt) directly affect output length and reasoning depth. The third (the thinking-history clearing bug) affects multi-turn coherence, which a per-prompt eval on a single-turn golden set won't catch but a multi-turn golden set will.

Anthropic has Anthropic-scale telemetry, and the detection gap was still multiple weeks per the postmortem. The standard observability stack (logs, metrics, traces) is not enough on its own when the system under test is a stochastic model whose behavior depends on configuration that lives outside your service.

You need a probe layer. The golden set is a probe. The length monitor is a probe. The per-prompt eval baseline is a probe. They run on a schedule, they emit numbers, and the numbers go on a graph next to your latency and error rate.

What to put on the graph

A useful pre-production dashboard, in roughly the order they should be added:

Median completion tokens per route, with a 7-day baseline.
Golden-set pass rate, hourly.
LLM-judge score mean, daily, broken down by case category.
Tool-call sequence-length distribution per agent route.
Per-prompt eval score on the top 20 production prompts, sampled hourly.

None of these are exotic. All five together cost less than a single mid-tier engineer's afternoon to build. The reason most teams don't have them is that LLM observability is treated as a follow-up project, scheduled after the demo ships and before the second incident. The Claude Code timeline is the cheap reminder to flip that order.

The real lesson sits in the tooling layer

Silent regressions are a property of building on top of a model whose behavior is configured by someone other than you, and whose configuration changes do not cause errors. They will keep happening. The question is whether your stack tells you about them in hours or in weeks.

The Anthropic postmortem turned out fine because the issues were caught and reverted. The next one might not be inside your provider — it might be inside your own prompt template, your own retrieval pipeline, your own fine-tune. The probe layer doesn't care which.

If your current observability story for an LLM feature is "logs and Datadog dashboards," the calendar on the wall says you have until the next silent model update to fix that.

If this was useful

The probe-layer patterns above (golden sets, length sentinels, per-prompt drift) are the core of LLM Observability Pocket Guide. It walks through the eval rig, the tracing instrumentation, and the alerting thresholds that turn "Claude feels off" into a graph with a line that crosses on the day the upstream change shipped.