r-via

Posted on May 28 • Originally published at anatoly.cloud

Benchmarking the Claude Agent SDK on a local LLM: Haiku and Sonnet tier performance

#llm #claude #llamacpp #benchmark

The Claude Agent SDK exposes three budget tiers (haiku, sonnet, opus) and reads its routing target from environment variables on every call. That means a single env-var swap can point a tier at any Anthropic-compatible endpoint — including a local llama-server. The question is not whether you can do it. The question is whether the local model is good enough, per tier, to ship.

This article is the benchmark we ran to answer that question for Anatoly's document fact-check pipeline. 5 providers × 4 workloads × N=5 trials, Opus LLM-as-judge with an Anthropic-vs-Anthropic ceiling for calibration. Host: one RTX 3090 Ti, 24 GB VRAM.

What we are testing

Tier	What it does in our pipeline	Calls per run	Profile
Haiku	extract atomic facts, verify them against retrieval, score importance	~1,700	JSON-only, low-latency, strict schema
Sonnet	rewrite document sections to integrate omissions and correct hallucinations	~8	markdown out, ~3 k tokens, citation-tag sensitive
Opus	reserved for final review and high-stakes work	varies	stays on Anthropic regardless

The two questions:

Can a local LLM hit the Haiku tier's quality bar at lower latency?
Can a local LLM hit the Sonnet tier's quality bar on long-form rewriting?

Methodology

Five providers cycled sequentially through four workloads, N=5 trials, 100 calls total, 0 errors:

anthropic — Anthropic native API (reference)
local-mainline — Qwen on llama.cpp mainline, FP16 KV
local-turbo — Qwen on spiritbuun's TurboQuant fork, 4-bit KV, parallel=1
local-turbo-parallel — same with parallel=2 (the 1.5 GB freed by 4-bit KV makes the extra slot viable)
local-turbo-haiku-think — ablation: 35B model with thinking ON, on haiku-tier workloads

Three quality signals per output, all measured against the Anthropic reference:

Cosine similarity (semantic, embedding-based)
ROUGE-L F1 (lexical)
Opus LLM-as-judge (0-to-10 absolute equivalence scale)

The Opus judge is the load-bearing one. Crucially, we also rate Anthropic trial 2 against Anthropic trial 1 as the empirical ceiling: the score "indistinguishable" gets for two runs of the same backend on the same input. Any local provider at the ceiling is at parity, empirically, not approximately.

Haiku tier results

Three workloads: extract-chunk (~88 calls/run), verify-rag-fact (~1,300 calls/run), importance-score (~300 calls/run). All JSON-only with strict schemas.

Latency

Workload	Anthropic	local-turbo-parallel	Speedup
extract-chunk	53.80 s	5.79 s	×9.3
verify-rag-fact	10.90 s	1.87 s	×5.8
importance-score	10.39 s	2.42 s	×4.3

Quality (Opus-judge, ceiling = Anthropic vs Anthropic)

Workload	Anthropic ceiling	Local (production: 35 B no-think)	Reading
verify-rag-fact	9/10	9/10	at ceiling (1,300 calls/run, dominant cost)
importance-score	8/10	9/10	one point above ceiling (35B picks borderline buckets cleaner than Anthropic Haiku)
extract-chunk	6/10	4–5/10	66–83 % of ceiling (Anthropic is strict on its own output here)

Verdict on the Haiku tier: the local 35B-A3B in no-think mode is at parity on the dominant workload (verify, ×5.8 faster) and one point above ceiling on importance. Extract is one to two points below ceiling; the gap is real but small, and the upper bound matches the ceiling. The local 35B can replace Anthropic Haiku in production, with the caveat that extract benefits from human spot-checks until we close that gap.

A surprise from a follow-up mini-bench: a 35B-A3B MoE in no-think mode outperforms a dedicated dense 4B at essentially the same latency, because MoE only activates ~3 B parameters per token.

Config (importance workload)	Wall mean	Stdev	Opus rating
4 B no-think	2.13 s	0.02 s	8/10
35 B no-think	2.42 s	0.15 s	9/10

We initially shipped two separate models (4B for haiku, 35B for sonnet). After the mini-bench we collapsed to one GGUF in two containers, distinguished only by a thinking flag.

Sonnet tier results

One workload: correct-section-rewrite (~8 calls/run). Markdown output, ~1.5 to 3 k tokens, requires [#filename] citation tags on every inserted sentence.

Latency

Provider	Wall mean	Note
Anthropic Sonnet	26.01 s	reference
Anthropic Opus	12.88 s	the actual Anthropic option for this workload
local Qwen 35B no-think	9.61 s	fastest
local Qwen 35B thinking ON	23.50 s	thinking helps quality, hurts latency

Quality (Opus-judge, ceiling = 9/10)

Provider	Opus rating	Failure mode
Anthropic Opus	9/10	reference
local Qwen 35B thinking ON	6/10	misses `[#filename]` citation tags
local Qwen 35B no-think	5/10	misses citation tags, slight formatting drift

Verdict on the Sonnet tier: local does not replace Anthropic on this workload. The local 35B applies every requested fix (Opus judge: "all four hallucinations softened and all three omissions integrated"), but consistently omits the [#filename] citation tags Opus produces by default. No combination of thinking flags, prompt tweaks, or larger context closed the gap. The shortfall is 3 to 4 Opus-judge points on a 0-10 scale.

So we ship hybrid: local for the Haiku tier middle, Anthropic Opus for the 8 Sonnet-tier rewrites. Those 8 calls cost ~$4 per run, but they buy the full ceiling on the content-touching phase, which matters more for the user than the cost.

Headline summary

	Full Anthropic	Production hybrid (local middle + Opus rewrite)	Delta
Anthropic API calls per run	1,696	8	−99.5%
Wall-clock end-to-end	~4 h	~59 min	−75%
Cost per run	~$5	~$4	−20%
Verify quality	9/10	9/10 (parity)	0
Rewrite quality	9/10	9/10 (still Opus)	0

The cost saving is modest because the 8 Anthropic calls we keep are on Opus (expensive per call but mandatory for quality). The volume reduction is the headline: 99.5% fewer calls means our Anthropic quota is no longer the bottleneck for the rest of the product.

What we ship: per-tier production setup

SDK env vars per call
       │
       ├─ haiku  ──► local llama-server, Qwen3.6-35B-A3B GGUF, thinking OFF
       ├─ sonnet ──► (defined but not loaded in production; falls back to Opus)
       └─ opus   ──► Anthropic native API (for correct-phase rewrite)

Same GGUF in both LLM containers, different thinking flags. ~10 s restart to switch tiers. TurboQuant 4-bit KV cache + --parallel 2 for throughput.

For Anatoly, the practical impact:

Pipeline can run roughly 4× more often per day on the same Anthropic budget
High-volume Haiku tier no longer competes for rate-limit headroom
Hardware floor is one consumer GPU; horizontal scaling is just another machine

Technical unlocks: four SDK gotchas that gate the numbers

The benchmark above is contingent on four SDK integration fixes. They are not exotic, but none are obvious from the docs:

Per-tier --alias on llama-server so the SDK's /v1/models validation accepts a stable name (local-haiku, local-sonnet) instead of the GGUF filename.
parallel=1, ctx_per_slot=32768 for the long correction prompts (llama.cpp divides --ctx-size by --parallel for per-slot context; defaults give only 4 k per slot).
Ban all 27 built-in tools, not the 12 you remember. The SDK exposes AskUserQuestion, Skill, CronCreate, ScheduleWakeup, and ~20 others by default. Sonnet ignores them politely; Qwen happily calls AskUserQuestion to "think out loud" and burns the max_turns budget.
Disable thinking at the container, not in the prompt. Prompt-level /no_think directive is honoured 8% of the time on Qwen3.5+. Fix at llama-server startup:

llama-server \
  --jinja \
  --reasoning off \
  --chat-template-kwargs '{"enable_thinking": false}' \
  ...

The last one is the biggest single perf lever: ×12 speedup on the importance workload (21.7 s → 1.83 s wall mean) because the model stops emitting 358 tokens of reasoning before the 9-token JSON answer.

What to take away

For the Haiku tier: a local mid-size MoE in no-think mode is a credible replacement on JSON workloads, validated against an empirical Anthropic-vs-Anthropic ceiling. The 9/10 parity on verify is the strongest signal.
For the Sonnet tier: rewrite-with-citation-tags is the workload where local plateaued at 6/10 vs 9/10 ceiling. Route it back to Opus and stop chasing the gap.
Methodology matters more than the numbers. The Anthropic-vs-Anthropic ceiling is what turns "0.96 cosine, looks close" into "at parity, ship it". Single-trial benches lie about stochastic decoders.
Hybrid is the destination, not pure local. The 99.5% call-volume cut is bigger than the 20% cost cut precisely because we keep Anthropic where it matters.

Full write-up with the TurboQuant Dockerfile, the build gotchas (libcuda.so.1, libgomp.so.1), the fork comparison, the complete per-workload Opus-judge tables, and the N=5 variance discussion: https://anatoly.cloud/research/local-llm-claude-agent-sdk-turboquant

Top comments (1)

Harjot Singh • May 31

Surprisingly close for many tasks, with caveats around tool use reliability is the most useful sentence in the whole local-vs-hosted debate, because it kills the binary framing. The question was never is the local model as good as Sonnet, it's which tasks does it clear the bar on, and your result says the answer is most of them, which is exactly the case for routing rather than picking one model for everything. The easy, high-volume majority goes to the cheap self-hosted model, and you reserve the hosted frontier model for the slice that actually needs it. And your caveat names that slice precisely: tool use reliability. Freeform reasoning and summarization degrade gracefully on a smaller model, but tool calling is brittle, a malformed JSON arg or a hallucinated tool name doesn't degrade, it breaks the whole agent step, so that's the dividing line, not a vague quality score. The pragmatic architecture that falls out: route structured tool-heavy steps to the reliable model, route the reasoning and drafting to local, and you capture most of the cost savings without eating the failure mode that actually hurts. Match the model to the task's failure tolerance, not to a leaderboard. That route-by-where-the-cheap-model-actually-breaks instinct is core to how I think about cost in Moonshift. Did the tool-use unreliability look fixable with stricter output constraints/grammar, or was it a hard ceiling on the local model?