<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Andrew Kew</title>
    <description>The latest articles on DEV Community by Andrew Kew (@thegatewayguy).</description>
    <link>https://dev.to/thegatewayguy</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3895707%2Fc20fb912-7c14-4930-97fc-5894d1c3e4c5.png</url>
      <title>DEV Community: Andrew Kew</title>
      <link>https://dev.to/thegatewayguy</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/thegatewayguy"/>
    <language>en</language>
    <item>
      <title>Harness bugs, not model bugs</title>
      <dc:creator>Andrew Kew</dc:creator>
      <pubDate>Fri, 24 Apr 2026 14:51:12 +0000</pubDate>
      <link>https://dev.to/thegatewayguy/harness-bugs-not-model-bugs-1f4e</link>
      <guid>https://dev.to/thegatewayguy/harness-bugs-not-model-bugs-1f4e</guid>
      <description>&lt;p&gt;For six weeks, developers have been complaining that Claude got worse. You've seen the posts — "Claude Code is flaky", the AI-shrinkflation discourse.&lt;/p&gt;

&lt;p&gt;Yesterday, Anthropic shipped a &lt;a href="https://www.anthropic.com/engineering/april-23-postmortem" rel="noopener noreferrer"&gt;postmortem&lt;/a&gt;. Three unrelated bugs, all in the Claude Code &lt;em&gt;harness&lt;/em&gt;. The model weights and the API were never touched.&lt;/p&gt;

&lt;p&gt;That distinction is the whole point.&lt;/p&gt;

&lt;h2&gt;
  
  
  What actually broke
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Default reasoning effort got dialled down.&lt;/strong&gt; A UX fix dropped Claude Code's default from "high" to "medium" in early March. Users noticed it felt dumber. Reverted April 7.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A caching optimisation dropped prior reasoning every turn.&lt;/strong&gt; Supposed to clear stale thinking once per idle session; a bug made it fire on every turn. Claude kept executing without memory of why. Surfaced as forgetfulness and repetition. Fixed April 10.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A verbosity system prompt hurt coding quality.&lt;/strong&gt; "Keep responses under 100 words." Internal evals missed a 3% regression on code. Caught by broader ablations. Reverted April 20.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;None of this was a model change. The weights didn't move. The API was never in scope.&lt;/p&gt;

&lt;h2&gt;
  
  
  The lesson
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The model is not the product.&lt;/strong&gt; What your users experience is &lt;code&gt;model + harness + system prompt + tool wiring + context management + caching&lt;/code&gt;. Each layer has its own bugs. When someone says "Claude got worse," the weights are usually the last thing that changed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;API-layer products were unaffected.&lt;/strong&gt; If you're building directly against the Messages API, none of these bugs touched you. This is why "am I on Claude Code, or am I on the raw API?" matters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Eval passed" ≠ "no regression."&lt;/strong&gt; The verbosity prompt passed Anthropic's initial evals. Only a broader ablation — removing lines one at a time — caught the 3% drop. Fixed eval suites miss behavioural drift; ablations catch it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to actually do
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;On Claude Code?&lt;/strong&gt; Update to v2.1.116+. You're already through it. Usage limits got reset as an apology.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;On the API directly?&lt;/strong&gt; Nothing to do. Stay on whatever model you were on.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shipping your own harness on top of a frontier model?&lt;/strong&gt; Read the postmortem twice, then audit your prompt + caching + context-management pipeline for the same silent-failure modes. The bugs Anthropic described are exactly the ones every harness reinvents.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;The meta-lesson is boring and important: most of the quality variance lives between "the model" and "the thing your user sees." Ship good harnesses.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;✏️ Drafted with KewBot (AI), edited and approved by Drew.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>anthropic</category>
      <category>postmortem</category>
    </item>
  </channel>
</rss>
