<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: r-via</title>
    <description>The latest articles on DEV Community by r-via (@r-via).</description>
    <link>https://dev.to/r-via</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3919360%2F5596f2ba-1d5e-4618-8399-b4d8a93780c2.png</url>
      <title>DEV Community: r-via</title>
      <link>https://dev.to/r-via</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/r-via"/>
    <language>en</language>
    <item>
      <title>Benchmarking the Claude Agent SDK on a local LLM: Haiku and Sonnet tier performance</title>
      <dc:creator>r-via</dc:creator>
      <pubDate>Thu, 28 May 2026 08:31:52 +0000</pubDate>
      <link>https://dev.to/r-via/benchmarking-the-claude-agent-sdk-on-a-local-llm-haiku-and-sonnet-tier-performance-504b</link>
      <guid>https://dev.to/r-via/benchmarking-the-claude-agent-sdk-on-a-local-llm-haiku-and-sonnet-tier-performance-504b</guid>
      <description>&lt;p&gt;The Claude Agent SDK exposes three budget tiers (&lt;code&gt;haiku&lt;/code&gt;, &lt;code&gt;sonnet&lt;/code&gt;, &lt;code&gt;opus&lt;/code&gt;) and reads its routing target from environment variables on every call. That means a single env-var swap can point a tier at any Anthropic-compatible endpoint — including a local &lt;code&gt;llama-server&lt;/code&gt;. The question is not whether you &lt;em&gt;can&lt;/em&gt; do it. The question is whether the local model is good enough, per tier, to ship.&lt;/p&gt;

&lt;p&gt;This article is the benchmark we ran to answer that question for Anatoly's document fact-check pipeline. &lt;strong&gt;5 providers × 4 workloads × N=5 trials&lt;/strong&gt;, Opus LLM-as-judge with an Anthropic-vs-Anthropic ceiling for calibration. Host: one RTX 3090 Ti, 24 GB VRAM.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we are testing
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;What it does in our pipeline&lt;/th&gt;
&lt;th&gt;Calls per run&lt;/th&gt;
&lt;th&gt;Profile&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Haiku&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;extract atomic facts, verify them against retrieval, score importance&lt;/td&gt;
&lt;td&gt;~1,700&lt;/td&gt;
&lt;td&gt;JSON-only, low-latency, strict schema&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Sonnet&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;rewrite document sections to integrate omissions and correct hallucinations&lt;/td&gt;
&lt;td&gt;~8&lt;/td&gt;
&lt;td&gt;markdown out, ~3 k tokens, citation-tag sensitive&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Opus&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;reserved for final review and high-stakes work&lt;/td&gt;
&lt;td&gt;varies&lt;/td&gt;
&lt;td&gt;stays on Anthropic regardless&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The two questions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Can a local LLM hit the Haiku tier's quality bar&lt;/strong&gt; at lower latency?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Can a local LLM hit the Sonnet tier's quality bar&lt;/strong&gt; on long-form rewriting?&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Methodology
&lt;/h2&gt;

&lt;p&gt;Five providers cycled sequentially through four workloads, N=5 trials, 100 calls total, 0 errors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;anthropic&lt;/strong&gt; — Anthropic native API (reference)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;local-mainline&lt;/strong&gt; — Qwen on llama.cpp mainline, FP16 KV&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;local-turbo&lt;/strong&gt; — Qwen on spiritbuun's TurboQuant fork, 4-bit KV, &lt;code&gt;parallel=1&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;local-turbo-parallel&lt;/strong&gt; — same with &lt;code&gt;parallel=2&lt;/code&gt; (the 1.5 GB freed by 4-bit KV makes the extra slot viable)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;local-turbo-haiku-think&lt;/strong&gt; — ablation: 35B model with thinking ON, on haiku-tier workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Three quality signals per output, all measured against the Anthropic reference:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cosine similarity&lt;/strong&gt; (semantic, embedding-based)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ROUGE-L F1&lt;/strong&gt; (lexical)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Opus LLM-as-judge&lt;/strong&gt; (0-to-10 absolute equivalence scale)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Opus judge is the load-bearing one. Crucially, we also rate &lt;strong&gt;Anthropic trial 2 against Anthropic trial 1&lt;/strong&gt; as the empirical ceiling: the score "indistinguishable" gets for two runs of the same backend on the same input. Any local provider at the ceiling is at parity, &lt;em&gt;empirically&lt;/em&gt;, not approximately.&lt;/p&gt;

&lt;h2&gt;
  
  
  Haiku tier results
&lt;/h2&gt;

&lt;p&gt;Three workloads: extract-chunk (~88 calls/run), verify-rag-fact (~1,300 calls/run), importance-score (~300 calls/run). All JSON-only with strict schemas.&lt;/p&gt;

&lt;h3&gt;
  
  
  Latency
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workload&lt;/th&gt;
&lt;th&gt;Anthropic&lt;/th&gt;
&lt;th&gt;local-turbo-parallel&lt;/th&gt;
&lt;th&gt;Speedup&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;extract-chunk&lt;/td&gt;
&lt;td&gt;53.80 s&lt;/td&gt;
&lt;td&gt;5.79 s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;×9.3&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;verify-rag-fact&lt;/td&gt;
&lt;td&gt;10.90 s&lt;/td&gt;
&lt;td&gt;1.87 s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;×5.8&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;importance-score&lt;/td&gt;
&lt;td&gt;10.39 s&lt;/td&gt;
&lt;td&gt;2.42 s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;×4.3&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Quality (Opus-judge, ceiling = Anthropic vs Anthropic)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workload&lt;/th&gt;
&lt;th&gt;Anthropic ceiling&lt;/th&gt;
&lt;th&gt;Local (production: 35 B no-think)&lt;/th&gt;
&lt;th&gt;Reading&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;verify-rag-fact&lt;/td&gt;
&lt;td&gt;9/10&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;9/10&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;at ceiling&lt;/strong&gt; (1,300 calls/run, dominant cost)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;importance-score&lt;/td&gt;
&lt;td&gt;8/10&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;9/10&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;one point above ceiling&lt;/strong&gt; (35B picks borderline buckets cleaner than Anthropic Haiku)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;extract-chunk&lt;/td&gt;
&lt;td&gt;6/10&lt;/td&gt;
&lt;td&gt;4–5/10&lt;/td&gt;
&lt;td&gt;66–83 % of ceiling (Anthropic is strict on its own output here)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Verdict on the Haiku tier&lt;/strong&gt;: the local 35B-A3B in no-think mode is at parity on the dominant workload (verify, ×5.8 faster) and one point above ceiling on importance. Extract is one to two points below ceiling; the gap is real but small, and the upper bound matches the ceiling. &lt;strong&gt;The local 35B can replace Anthropic Haiku in production&lt;/strong&gt;, with the caveat that extract benefits from human spot-checks until we close that gap.&lt;/p&gt;

&lt;p&gt;A surprise from a follow-up mini-bench: a 35B-A3B MoE in no-think mode &lt;strong&gt;outperforms a dedicated dense 4B&lt;/strong&gt; at essentially the same latency, because MoE only activates ~3 B parameters per token.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Config (importance workload)&lt;/th&gt;
&lt;th&gt;Wall mean&lt;/th&gt;
&lt;th&gt;Stdev&lt;/th&gt;
&lt;th&gt;Opus rating&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;4 B no-think&lt;/td&gt;
&lt;td&gt;2.13 s&lt;/td&gt;
&lt;td&gt;0.02 s&lt;/td&gt;
&lt;td&gt;8/10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;35 B no-think&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.42 s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.15 s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;9/10&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;We initially shipped two separate models (4B for haiku, 35B for sonnet). After the mini-bench we collapsed to one GGUF in two containers, distinguished only by a thinking flag.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sonnet tier results
&lt;/h2&gt;

&lt;p&gt;One workload: correct-section-rewrite (~8 calls/run). Markdown output, ~1.5 to 3 k tokens, requires &lt;code&gt;[#filename]&lt;/code&gt; citation tags on every inserted sentence.&lt;/p&gt;

&lt;h3&gt;
  
  
  Latency
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Wall mean&lt;/th&gt;
&lt;th&gt;Note&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Anthropic Sonnet&lt;/td&gt;
&lt;td&gt;26.01 s&lt;/td&gt;
&lt;td&gt;reference&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Anthropic Opus&lt;/td&gt;
&lt;td&gt;12.88 s&lt;/td&gt;
&lt;td&gt;the actual Anthropic option for this workload&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;local Qwen 35B no-think&lt;/td&gt;
&lt;td&gt;9.61 s&lt;/td&gt;
&lt;td&gt;fastest&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;local Qwen 35B thinking ON&lt;/td&gt;
&lt;td&gt;23.50 s&lt;/td&gt;
&lt;td&gt;thinking helps quality, hurts latency&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Quality (Opus-judge, ceiling = 9/10)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Opus rating&lt;/th&gt;
&lt;th&gt;Failure mode&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Anthropic Opus&lt;/td&gt;
&lt;td&gt;9/10&lt;/td&gt;
&lt;td&gt;reference&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;local Qwen 35B thinking ON&lt;/td&gt;
&lt;td&gt;6/10&lt;/td&gt;
&lt;td&gt;misses &lt;code&gt;[#filename]&lt;/code&gt; citation tags&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;local Qwen 35B no-think&lt;/td&gt;
&lt;td&gt;5/10&lt;/td&gt;
&lt;td&gt;misses citation tags, slight formatting drift&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Verdict on the Sonnet tier&lt;/strong&gt;: &lt;strong&gt;local does not replace Anthropic on this workload&lt;/strong&gt;. The local 35B applies every requested fix (Opus judge: "all four hallucinations softened and all three omissions integrated"), but consistently omits the &lt;code&gt;[#filename]&lt;/code&gt; citation tags Opus produces by default. No combination of thinking flags, prompt tweaks, or larger context closed the gap. The shortfall is 3 to 4 Opus-judge points on a 0-10 scale.&lt;/p&gt;

&lt;p&gt;So we ship hybrid: local for the Haiku tier middle, &lt;strong&gt;Anthropic Opus for the 8 Sonnet-tier rewrites&lt;/strong&gt;. Those 8 calls cost ~$4 per run, but they buy the full ceiling on the content-touching phase, which matters more for the user than the cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  Headline summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Full Anthropic&lt;/th&gt;
&lt;th&gt;Production hybrid (local middle + Opus rewrite)&lt;/th&gt;
&lt;th&gt;Delta&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Anthropic API calls per run&lt;/td&gt;
&lt;td&gt;1,696&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;−99.5%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Wall-clock end-to-end&lt;/td&gt;
&lt;td&gt;~4 h&lt;/td&gt;
&lt;td&gt;~59 min&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;−75%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost per run&lt;/td&gt;
&lt;td&gt;~$5&lt;/td&gt;
&lt;td&gt;~$4&lt;/td&gt;
&lt;td&gt;−20%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Verify quality&lt;/td&gt;
&lt;td&gt;9/10&lt;/td&gt;
&lt;td&gt;9/10 (parity)&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rewrite quality&lt;/td&gt;
&lt;td&gt;9/10&lt;/td&gt;
&lt;td&gt;9/10 (still Opus)&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The cost saving is modest because the 8 Anthropic calls we keep are on Opus (expensive per call but mandatory for quality). The volume reduction is the headline: 99.5% fewer calls means our Anthropic quota is no longer the bottleneck for the rest of the product.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we ship: per-tier production setup
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SDK env vars per call
       │
       ├─ haiku  ──► local llama-server, Qwen3.6-35B-A3B GGUF, thinking OFF
       ├─ sonnet ──► (defined but not loaded in production; falls back to Opus)
       └─ opus   ──► Anthropic native API (for correct-phase rewrite)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same GGUF in both LLM containers, different thinking flags. ~10 s restart to switch tiers. TurboQuant 4-bit KV cache + &lt;code&gt;--parallel 2&lt;/code&gt; for throughput.&lt;/p&gt;

&lt;p&gt;For Anatoly, the practical impact:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pipeline can run roughly 4× more often per day on the same Anthropic budget&lt;/li&gt;
&lt;li&gt;High-volume Haiku tier no longer competes for rate-limit headroom&lt;/li&gt;
&lt;li&gt;Hardware floor is one consumer GPU; horizontal scaling is just another machine&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Technical unlocks: four SDK gotchas that gate the numbers
&lt;/h2&gt;

&lt;p&gt;The benchmark above is contingent on four SDK integration fixes. They are not exotic, but none are obvious from the docs:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Per-tier &lt;code&gt;--alias&lt;/code&gt; on &lt;code&gt;llama-server&lt;/code&gt;&lt;/strong&gt; so the SDK's &lt;code&gt;/v1/models&lt;/code&gt; validation accepts a stable name (&lt;code&gt;local-haiku&lt;/code&gt;, &lt;code&gt;local-sonnet&lt;/code&gt;) instead of the GGUF filename.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;parallel=1, ctx_per_slot=32768&lt;/code&gt;&lt;/strong&gt; for the long correction prompts (llama.cpp divides &lt;code&gt;--ctx-size&lt;/code&gt; by &lt;code&gt;--parallel&lt;/code&gt; for per-slot context; defaults give only 4 k per slot).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ban all 27 built-in tools, not the 12 you remember.&lt;/strong&gt; The SDK exposes &lt;code&gt;AskUserQuestion&lt;/code&gt;, &lt;code&gt;Skill&lt;/code&gt;, &lt;code&gt;CronCreate&lt;/code&gt;, &lt;code&gt;ScheduleWakeup&lt;/code&gt;, and ~20 others by default. Sonnet ignores them politely; Qwen happily calls &lt;code&gt;AskUserQuestion&lt;/code&gt; to "think out loud" and burns the &lt;code&gt;max_turns&lt;/code&gt; budget.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Disable thinking at the container, not in the prompt.&lt;/strong&gt; Prompt-level &lt;code&gt;/no_think&lt;/code&gt; directive is honoured 8% of the time on Qwen3.5+. Fix at &lt;code&gt;llama-server&lt;/code&gt; startup:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;llama-server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--jinja&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--reasoning&lt;/span&gt; off &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--chat-template-kwargs&lt;/span&gt; &lt;span class="s1"&gt;'{"enable_thinking": false}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The last one is the biggest single perf lever: &lt;strong&gt;×12 speedup&lt;/strong&gt; on the importance workload (21.7 s → 1.83 s wall mean) because the model stops emitting 358 tokens of reasoning before the 9-token JSON answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to take away
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;For the Haiku tier&lt;/strong&gt;: a local mid-size MoE in no-think mode is a credible replacement on JSON workloads, validated against an empirical Anthropic-vs-Anthropic ceiling. The 9/10 parity on verify is the strongest signal.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;For the Sonnet tier&lt;/strong&gt;: rewrite-with-citation-tags is the workload where local plateaued at 6/10 vs 9/10 ceiling. Route it back to Opus and stop chasing the gap.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Methodology matters more than the numbers&lt;/strong&gt;. The Anthropic-vs-Anthropic ceiling is what turns "0.96 cosine, looks close" into "at parity, ship it". Single-trial benches lie about stochastic decoders.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hybrid is the destination, not pure local.&lt;/strong&gt; The 99.5% call-volume cut is bigger than the 20% cost cut precisely because we keep Anthropic where it matters.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Full write-up with the TurboQuant Dockerfile, the build gotchas (&lt;code&gt;libcuda.so.1&lt;/code&gt;, &lt;code&gt;libgomp.so.1&lt;/code&gt;), the fork comparison, the complete per-workload Opus-judge tables, and the N=5 variance discussion: &lt;strong&gt;&lt;a href="https://anatoly.cloud/research/local-llm-claude-agent-sdk-turboquant" rel="noopener noreferrer"&gt;https://anatoly.cloud/research/local-llm-claude-agent-sdk-turboquant&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>claude</category>
      <category>llamacpp</category>
      <category>benchmark</category>
    </item>
    <item>
      <title>Your AI coding agent gets worse as your codebase grows. Here's why.</title>
      <dc:creator>r-via</dc:creator>
      <pubDate>Fri, 08 May 2026 08:06:59 +0000</pubDate>
      <link>https://dev.to/r-via/your-ai-coding-agent-gets-worse-as-your-codebase-grows-heres-why-53c0</link>
      <guid>https://dev.to/r-via/your-ai-coding-agent-gets-worse-as-your-codebase-grows-heres-why-53c0</guid>
      <description>&lt;p&gt;Most people don't notice their AI coding agent gets worse as their codebase grows.&lt;/p&gt;

&lt;p&gt;Not because the model degrades. Because the context does.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pattern
&lt;/h2&gt;

&lt;p&gt;50 files: Claude Code or Cursor sees enough of the codebase to follow conventions, reuse utilities, avoid duplication. The output is coherent.&lt;/p&gt;

&lt;p&gt;500 files: it can't. So it reimplements helpers that already exist three folders away. Introduces naming conventions that contradict the rest. Generates functions nobody will ever call. "Fixes" bugs by stacking workarounds on workarounds.&lt;/p&gt;

&lt;p&gt;The model didn't get dumber. It just stopped being able to hold the whole project in its head.&lt;/p&gt;

&lt;p&gt;The result: codebases that ship fast at first, then collapse under their own weight. Dead code. Hidden duplication. Best practices selectively applied. The exact opposite of what AI-assisted coding was supposed to give us.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why nothing catches it
&lt;/h2&gt;

&lt;p&gt;Linters catch syntax. Type checkers catch types. Test runners catch broken contracts. None of them catch the kind of rot AI-generated code produces, which is &lt;strong&gt;architectural&lt;/strong&gt; rot.&lt;/p&gt;

&lt;p&gt;"Is this function used anywhere?"&lt;br&gt;
"Does this already exist somewhere else under a different name?"&lt;br&gt;
"Is this the third time we've solved this problem three different ways?"&lt;br&gt;
"Does this follow the conventions used in the rest of the codebase?"&lt;/p&gt;

&lt;p&gt;These questions require understanding the whole project, not just one file. No traditional tool can answer them. And asking the same AI agent that wrote the code to also review it is asking the fox to guard the henhouse.&lt;/p&gt;
&lt;h2&gt;
  
  
  What I tried first
&lt;/h2&gt;

&lt;p&gt;I tried prompting more. Bigger context windows. Stricter system prompts. "Please check if this function already exists before creating a new one."&lt;/p&gt;

&lt;p&gt;It worked sometimes. It also failed silently most of the time, because the agent had no reliable way to actually verify its claims. So it just confidently asserted things were unique when they weren't.&lt;/p&gt;

&lt;p&gt;I needed an intermediary. A separate quality layer that does one job: audit what the coding agent produces, with proof.&lt;/p&gt;
&lt;h2&gt;
  
  
  What I built
&lt;/h2&gt;

&lt;p&gt;I built &lt;a href="https://anatoly.cloud" rel="noopener noreferrer"&gt;Anatoly&lt;/a&gt;, an open-source audit agent that walks through every file in your codebase and produces an evidence-backed review.&lt;/p&gt;

&lt;p&gt;The core rule: every finding has to be proven before it's reported. If the agent claims a function is dead, a second agent has to prove it through a deliberation mode, using read-only tools (Grep, Glob, Read) to investigate the whole project and confirm the finding. No claim survives without evidence. No hallucinated findings.&lt;/p&gt;

&lt;p&gt;Under the hood it uses tree-sitter for AST parsing, gives a Claude agent read-only tools (Glob, Grep, Read) to investigate, and runs a local semantic RAG index (Xenova embeddings + LanceDB) to catch cross-file duplication that grep can't see. Output is schema-validated with Zod, with a self-correction loop if the agent's JSON doesn't pass.&lt;/p&gt;

&lt;p&gt;One command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx anatoly run
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Next step
&lt;/h2&gt;

&lt;p&gt;The next step I'm working on: a remote audit workflow. Anatoly runs on a remote server while you sleep, posts a structured report directly on your GitHub repo (issues or PR comments), and gives you a clean list of findings to address one by one the next morning. No local cost, no waiting, no context switching. You wake up, you review, you fix.&lt;/p&gt;

&lt;h2&gt;
  
  
  Looking for repos
&lt;/h2&gt;

&lt;p&gt;Anatoly is open-source under AGPL3. I'm currently looking for codebases to scan for free to refine the model and surface edge cases. If you've got a project you'd like audited, no strings attached, drop a comment or open an issue on the repo.&lt;br&gt;
Repo: &lt;a href="https://github.com/r-via/anatoly" rel="noopener noreferrer"&gt;github.com/r-via/anatoly&lt;/a&gt;&lt;br&gt;
How are you handling AI-generated code rot on your team? Curious if others are seeing the same pattern.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>programming</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
