DEV Community

Jamie Cole
Jamie Cole

Posted on

Claude 3.5 Sonnet Changed. My System Prompt Stopped Working. Here's What I Learned.

I've been building with Claude APIs since early 2025. Last month, I noticed something strange: my carefully tuned system prompt stopped producing the outputs I expected. The format was slightly off. The tone shifted. My downstream parsing started failing intermittently.

Anthropic hadn't announced any changes to claude-3-5-sonnet-20241022. My code was identical. But the model was behaving differently.

This article is what I learned from that experience — and what I now do to catch this kind of drift automatically.

What happened

I was using Claude for a structured data extraction task. My system prompt instructed the model to:

  • Return JSON only, no preamble
  • Use specific field names (entity, confidence, source)
  • Never include explanation text

For months it worked perfectly. Then, over the course of about 3 days, outputs started including preamble text: "Here is the extracted data:" followed by the JSON. My json.loads() calls started throwing JSONDecodeError because of the leading text.

It wasn't every response. It was maybe 15–20% of calls. Enough to cause intermittent failures but not a hard crash. The worst kind of bug.

What I first tried (that didn't work)

Checking Anthropic's changelog: Anthropic updates model behavior without always publishing granular release notes. The model version identifier (claude-3-5-sonnet-20241022) hadn't changed, but behavior had.

Checking my own code: Nothing changed. I'd been running the exact same system prompt and API call structure for months.

Adding retry logic: This masked the symptom without fixing the cause. Worse — it made the failure invisible. Errors went down, but bad data was now flowing into my database.

Strengthening the system prompt: Adding more explicit instructions helped but didn't eliminate the drift. The model was behaving differently at the inference level and the system prompt couldn't fully compensate.

The core problem: you have no baseline

Here's the thing that makes this failure mode particularly dangerous: without a behavioral baseline, you don't know if your model has drifted.

You can check if your code is working (tests pass). You can check if the API is responding (health checks pass). But you can't check if the model is still producing the same kind of outputs it was a month ago — unless you have a baseline to compare against.

Most observability tools (LangSmith, Langfuse, Helicone) log what your model does. They don't tell you when the behavior has shifted from before. That requires:

  1. A reference run at time T₀
  2. A regular re-run of the same prompts at T₁, T₂, T₃...
  3. A scoring mechanism that compares T₁ to T₀ semantically (not just string equality)
  4. An alert when the score crosses a threshold

None of the existing observability tools do this automatically. It's a gap.

What I built to solve this

I built DriftWatch — a service that does exactly this.

You paste your critical prompts. It runs them once to create a behavioral baseline. Every hour, it re-runs them and computes a drift score across three dimensions:

  • Semantic similarity (is the meaning of the output still the same?)
  • Format compliance (is the structure — JSON, markdown, etc. — preserved?)
  • Instruction-following delta (did specific instructions like "no preamble" stop being honored?)

When drift exceeds a threshold (configurable, default 0.3), you get a Slack or email alert.

For the Claude preamble regression I described above: the JSON extraction prompt drifted to 0.316 (format compliance failure). Alert fired within one hour of the change. I caught it before users noticed.

What drift scores look like in practice

Here are some real drift scores from the past few months of monitoring Claude and GPT-4o endpoints:

Prompt Drift Score What changed
JSON extraction 0.316 Added preamble text (breaks json.loads)
Binary classifier 0.041 Stable — no drift
Instruction follow (no caps) 0.575 Started capitalizing headers
Summarization 0.189 Slightly more verbose
Sentiment analysis 0.0 No drift detected

The instruction-following regression (0.575) is particularly dangerous. If you're using re.findall() or splitting on specific structural markers, a capitalization change breaks your parser silently.

How to set up drift monitoring for Claude

If you want to do this yourself:

  1. Identify your 5–10 most critical prompts (the ones where output format or behavior matters)
  2. Run each one 3 times and store the median output as your baseline
  3. Schedule a cron job to re-run them daily
  4. Compute cosine similarity between current and baseline outputs (use a sentence transformer for semantic comparison)
  5. Flag anything over 0.3 for manual review
  6. Alert on anything over 0.5 (likely a breaking regression)

Alternatively, DriftWatch does all of this for you. Free tier is 3 prompts, no credit card. Setup takes about 5 minutes.

The bigger pattern

This isn't just a Claude problem. OpenAI does the same thing with GPT-4o. Model providers update model behavior without updating version identifiers. It's a consequence of how LLMs are deployed — rolling updates, A/B experiments, safety fine-tunes that change behavior as a side effect.

The solution isn't to find a provider that doesn't do this (they all do). The solution is to monitor your model's behavior the same way you monitor your API's uptime: continuously, with alerts.


DriftWatch is a behavioral drift detection service for LLM applications. It monitors your prompts on a schedule and alerts you the moment output quality or format shifts — before your users notice.

Free tier available at genesisclawbot.github.io/llm-drift.


If you want to know the moment Claude changes again: how to get real-time alerts when Anthropic updates your model. Or try DriftWatch free — 3 prompts, no card.

Top comments (0)