Your LLM Got Quietly Dumber Last Week. Your Dashboards Have No Idea.

#ai #observability #llm #devops

Book: Observability for LLM Applications — paperback and hardcover on Amazon · Ebook from Apr 22
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

This week, developers and researchers have been reporting a sharp drop in Claude Opus 4.6 quality. Coding tasks noticeably worse. Reasoning benchmarks down. Conversational quality off. The theory gaining traction is that Anthropic is reallocating compute toward training Opus 4.7 and running the live 4.6 model leaner in the interim.

Anthropic has not confirmed it. The model ID you call is the same as last week. The status page is green. Your APM sees nothing because nothing at the transport layer has changed.

If your product has Opus 4.6 in the critical path, you have two choices. Trust the users on Reddit, or have your own instrument.

This post is about building the instrument.

Silent model drift is the default, not the exception

The mapping from model ID you call to artifact your request is actually run against is maintained by the vendor, not by you. That mapping can change:

Without a version bump.
For a subset of traffic (a load-balancer shifts a percentage through a different path).
For a subset of request shapes (long context routes differently from short).
For a subset of geographies or tenants.

The canonical public example is the Anthropic August–September 2025 three-bug cascade. A context-window routing error affected 0.8% of Sonnet 4 requests on August 5, rising to 16% at peak on August 31 after a load-balancer change. A TPU misconfiguration occasionally inserted Thai or Chinese characters into English responses. An XLA:TPU top-k miscompile affected Haiku 3.5. None of the three triggered a server-side error. None changed the shape of the API response. All three changed the content.

Simon Willison's take on the postmortem is the one to remember: "The evaluations Anthropic ran simply didn't capture the degradation users were reporting." A world-class eval team missed what users were catching by eye.

Google has done the same thing quietly. Developers reported that gemini-2.5-pro-preview-03-25 was silently redirected to gemini-2.5-pro-preview-05-06 in May 2025. OpenAI community forums carry long threads on perceived degradation across specific weeks. None of these show up on your APM.

What a canary actually measures

A canary eval is a fixed set of prompts you send through your production model path every N minutes, with a judge that scores the output against a known-good baseline. If the score drops, the model behind your endpoint changed.

Four properties matter:

Fixed prompts. Same text every time. The input is a constant; the only variable is the model's response.
Pinned judge. Use a different provider than the one you're canarying. If you're watching Anthropic, judge with OpenAI. Self-preference bias is real (arXiv:2410.21819) — a judge rates its own family higher, which masks the drift you're trying to catch.
Rolling window. One-off dips happen. Two consecutive failing runs are signal. The window is part of the contract.
Per model ID. If you call both claude-sonnet-4-6 and claude-opus-4-6, you canary each one separately. A drop on Opus doesn't tell you anything about Sonnet.

The 30 lines

A minimum-viable canary. Runs hourly via cron or a scheduled job. Writes the score to your metrics backend.

# canary.py — detects silent model drift on the provider side.
import os
from openai import OpenAI
from anthropic import Anthropic

CANARY_PROMPTS = [
    (
        "Refactor: replace the for-loop below with a list "
        "comprehension. Return only Python code.\n\n"
        "result = []\nfor x in items:\n    if x > 0:\n"
        "        result.append(x * 2)\n"
    ),
    (
        "What is the capital of Australia? Answer with one "
        "word and nothing else."
    ),
    # 10-20 total. Keep them narrow, verifiable, deterministic.
]

JUDGE_PROMPT = """You are a strict grader. Below is a PROMPT
and a RESPONSE. Return a JSON object with two fields:
- "verdict": 1 if the response is correct and well-formed, 0 otherwise.
- "rationale": one sentence.

PROMPT:
{prompt}

RESPONSE:
{response}

Return only the JSON."""

def run_canary(model: str) -> float:
    anthropic = Anthropic()
    judge = OpenAI()
    scores = []
    for prompt in CANARY_PROMPTS:
        resp = anthropic.messages.create(
            model=model,
            max_tokens=300,
            messages=[{"role": "user", "content": prompt}],
        )
        answer = resp.content[0].text
        grade = judge.chat.completions.create(
            model="gpt-4o-2024-11-20",  # pinned
            temperature=0,
            response_format={"type": "json_object"},
            messages=[
                {
                    "role": "user",
                    "content": JUDGE_PROMPT.format(
                        prompt=prompt, response=answer
                    ),
                }
            ],
        )
        verdict = int(grade.choices[0].message.content.split('"verdict":')[1].split(",")[0].strip())
        scores.append(verdict)
    return sum(scores) / len(scores)

if __name__ == "__main__":
    score = run_canary(os.environ["MODEL_UNDER_TEST"])
    print(f"canary_score={score:.3f}")

In production, the last line becomes a metric emission (canary_score{model="claude-opus-4-6"} 0.900). Prometheus scrapes it. Grafana alerts on a drop.

The alert that catches the story

# Alert when canary score drops more than 0.15 from the 7-day
# baseline, sustained over two consecutive runs.
(
  avg_over_time(canary_score[2h])
  <
  avg_over_time(canary_score[7d]) - 0.15
)
and
canary_score < avg_over_time(canary_score[7d]) - 0.15

Two conditions. The short-window average has to be below the long-window baseline by more than a rounding error, and the most recent run has to confirm it. That pattern rejects single-run noise and surfaces real shifts.

In the Anthropic August 2025 case, a canary like this would have fired on August 5 for users routing to the affected Sonnet 4 path. Anthropic confirmed the issue weeks later. Users with their own canaries knew first.

What canaries cannot see (and the layer above)

A canary with 15 prompts catches provider-side drift because the input is fixed. It does not catch:

Drift in your own prompt templates (you shipped a new system prompt).
Drift in your retrieval corpus (the RAG index rebuilt, different docs now score higher).
Drift in your user query distribution (new product launched, the model gets asked new questions).

For those, you need an online eval running on a sample of live production traffic, judged against a multi-axis rubric (faithfulness, relevance, safety, format). That's Chapter 10 of the book. A canary is the cheapest first move. The online eval is where the observability stack becomes load-bearing.

If this was useful

The canary is 30 lines. The alert is four. The hard part is agreeing on what "good" looks like and pinning the judge. That is the Chapter 17 problem: building alerts that don't wake you up for nothing and still catch the silent regressions.

Observability for LLM Applications has a full chapter (Ch 17) on drift detection, alert design, and the multi-window burn-rate SLOs that cut LLM alert noise by 10x.

Book: Observability for LLM Applications — paperback and hardcover now; ebook April 22.
Hermes IDE: hermes-ide.com — the IDE for developers shipping with Claude Code and other AI tools.
Me: xgabriel.com · github.com/gabrielanhaia.