Local LLM vs Claude: Benchmarking qwen3-coder:30b as a Production Agent Backend

#llm #homelab #opensource #ai

TL;DR

Replayed 27 real historical tasks from Jarvis (my LangGraph agent, ~90 tools) through qwen3-coder:30b on an RTX 3090, scored against Claude's actual production answers to the same tasks. Quality: Claude 89.4/100 vs qwen 22.8/100. Cost: qwen ~5,150x cheaper per task ($0.00015 vs $0.763, real GPU electricity vs real API billing). Reliability: qwen leaked malformed tool-call tags into 26% of answers and only overlapped with the tools the task actually needed 14.8% of the time. Same qwen3-coder:30b scored 100% in an earlier, much smaller benchmark — the gap here is about tool-surface complexity, not the model being bad.

The question

Jarvis is a real personal AI agent — LangGraph create_react_agent, ~90 tools spanning email/calendar/notes/files/messaging/code, running on Claude in production. qwen3-coder:30b had already scored 100% task success in a controlled 17-task benchmark on the same RTX 3090. Obvious next question: drop it into the real agent and see what happens.

The setup

28 real task prompts pulled from Jarvis's own Langfuse traces (90-day window), stratified 4×7 across calendar / code / email / files / general / messaging / notes.
Claude's answers are real production history, not re-run. Re-running through the sandbox would hand it fake stub data it never saw — that's a worse baseline, not a fairer one.
qwen runs fresh, through a sandboxed replay harness: the real Jarvis agent code in-process, every write-capable tool intercepted (nothing sent/written for real), and every mocked read-only tool serves the real recorded output from that task's original trace when available — not a generic stub. Same data, both models.
1/28 tasks excluded (336,906-char prompt, over any 16K–24K context window) → 27 scored.
Judge: LLM-as-judge (claude-opus-4-8), scored independently per answer (not pairwise) to avoid position bias, 1–5 → 0–100.
Every qwen run priced as a HomeLab Monitor experiment against real 3090 power draw. Claude's cost is Langfuse's recorded API billing.

Caveat, stated plainly: the judge is a Claude model scoring Claude's own answers alongside qwen's — self-preference bias is a documented effect in LLM-as-judge setups and probably inflates the gap somewhat. It doesn't explain a 66-point gap, a 26% malformed-output rate, or two tool-call loops, but it's a real methodology limitation, not a footnote.

Getting here took three re-runs: a judge-response parsing bug that silently neutral-scored ~40/54 calls, a mock-data bug that starved qwen of real inbox/calendar content on 16/28 tasks while Claude's baseline had the real thing, and a Claude-API rate limit that neutral-scored another batch mid-scoring. All three caught by checking score distributions, not by trusting a clean exit code — worth knowing before trusting the numbers below.

The numbers

	Claude	qwen3-coder:30b
Avg quality (0–100)	89.4	22.8
Cost / task	$0.763 (real API billing)	$0.00015 (real GPU electricity)
Total cost, 27 tasks	$20.60	$0.004
Total energy	—	0.0396 kWh

~5,150x cheaper per task for qwen (precise, currency-converted from a 0.0072 BGN total across all 27 tasks, at 1 BGN = $0.5547 — an earlier rough estimate of 180x on this project was wrong, this is the corrected number).

By category (Claude | qwen | n):

calendar:   90 | 30 | 4
code:       87 | 25 | 3
email:      92 | 15 | 4
files:      88 | 15 | 4
general:    85 | 30 | 4
messaging:  87 | 22 | 4
notes:      97 | 22 | 4

qwen's best relative showing (calendar, general) is still a third of Claude's score. It never wins a category.

Where it breaks

Malformed tool-call leak — instead of a real LangGraph tool call, qwen sometimes emits the call as raw text in its final answer:

<function=send_email>
{"to": "...", "subject": "...", "body": "..."}
</function>

That happened on 7/27 tasks (26%). The user reading that answer sees broken syntax where a real action should have been confirmed or a real answer given.

Tool-overlap recall: 14.8% average, measured over the 18/27 tasks where the original historical trace actually used at least one tool (9 tasks needed none). Most of the time qwen reached for different tools than the ones that actually solved the task — or none.

Repetitive-loop failure on 2/27 tasks: pilot-17 (email, 24 tool calls, 138.6s, ~196.7K input tokens) and pilot-27 (messaging, 27 tool calls, 148.9s, ~196.7K input tokens) both called the same already-answered tool (run_command, todo_write) repeatedly instead of stopping. Confirmed via raw logs both tasks got real replayed data (replayed_real_data: true) — a genuine stopping-condition failure, not a data-starvation artifact.

One more data point worth having, not a verdict: on a task where both models actually called send_email(...) in the harness (intercepted, nothing sent), Claude told the user the email had been sent — a fabrication. qwen correctly disclosed the send didn't go through. Not "qwen is more honest" — it's also the model leaking raw tags 26% of the time. Both mishandled the mock, just differently.

Scope of the claim

Same qwen3-coder:30b, same GPU, scored 100% on a 17-task controlled benchmark with a much smaller tool surface. This isn't "local LLMs are bad" — it's that a model excellent on a scoped benchmark isn't automatically a safe drop-in for a large, real, ~90-tool production surface with a 31KB context prompt and real messy history behind it. Task/tool-surface complexity mattered as much as raw model quality here. Claude isn't flawless either — see the fabricated send-email confirmation above.

Jarvis stays on Claude for now. The cost number is real enough to be worth a narrower follow-up — testing qwen on just the categories where it scored closest (calendar, general) as a cheap fallback path, rather than a full swap.

Full narrative version, charts, and the three-bug scoring saga: on Medium.

Every qwen run here was priced through HomeLab Monitor against the 3090's real power draw — MIT licensed, one container, reproducible if you want to price your own local-model experiments the same way.

Curious where the line is for you: how cheap does a local model have to be before you'd trust it with a slice of a real agent, and which slice would you pick first?

Top comments (1)

Mateo Ruiz • Jul 6

Appreciate that you included both the wins and the failures instead of trying to prove a point. Real production comparisons are rarely clean, and calling out methodology limitations (judge bias, replay bugs, tool complexity) makes the benchmark much more useful than a simple "Model A vs Model B" scorecard.