<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: kuroko</title>
    <description>The latest articles on DEV Community by kuroko (@kuroko1t).</description>
    <link>https://dev.to/kuroko1t</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F168435%2Fb7b5bb69-c228-4766-8b85-7b9cc4322d73.jpeg</url>
      <title>DEV Community: kuroko</title>
      <link>https://dev.to/kuroko1t</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kuroko1t"/>
    <language>en</language>
    <item>
      <title>Benchmarking Local Coding LLMs: 11 Realistic Tasks, 232 Runs, and the Bugs My Bench Found in My Agent</title>
      <dc:creator>kuroko</dc:creator>
      <pubDate>Wed, 29 Apr 2026 03:42:33 +0000</pubDate>
      <link>https://dev.to/kuroko1t/benchmarking-local-coding-llms-11-realistic-tasks-232-runs-and-the-bugs-my-bench-found-in-my-46pl</link>
      <guid>https://dev.to/kuroko1t/benchmarking-local-coding-llms-11-realistic-tasks-232-runs-and-the-bugs-my-bench-found-in-my-46pl</guid>
      <description>&lt;p&gt;What can a 16GB GPU and a local LLM actually do for everyday coding work? I built an 11-task benchmark to find out and ran four open-weight models (9B to 35B; the 35B is an MoE with 3B active per token) through it. &lt;strong&gt;232 runs in total.&lt;/strong&gt; A single RTX 5060 Ti with 16GB VRAM.&lt;/p&gt;

&lt;p&gt;Headline: the biggest, newest model (Qwen3.6-35B-A3B) won at &lt;strong&gt;100% pass rate (29/29 runs, 11/11 tasks)&lt;/strong&gt; after some tuning. The previous-gen &lt;code&gt;qwen3.5:9b&lt;/code&gt; — older &lt;em&gt;and&lt;/em&gt; smaller — passed &lt;strong&gt;9/11 tasks at 24s/run average, roughly one third the wall time&lt;/strong&gt; of the 35B. So the more interesting question turns out not to be "which model wins" but "do you actually need the latest, biggest model":&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The benchmark found three bugs in &lt;strong&gt;my own agent&lt;/strong&gt; before it surfaced anything interesting about the models.&lt;/li&gt;
&lt;li&gt;Picking the right quantization (UD-Q3_K_M instead of Q4_K_M) was worth ~33% on average and saved one model from CPU offload entirely — but the same quant under FP16 KV cache &lt;em&gt;blew up&lt;/em&gt; on two tasks specifically.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;qwen3.5:9b&lt;/code&gt; passes 9/11 tasks at one-third the latency of Qwen3.6-35B-A3B; the bigger newer model's two extra wins are both refactor/feature-add tasks. If your workload is single-file edits, debugging, or read-only investigation, the 9B is plenty.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;The agent under test is &lt;a href="https://github.com/kuroko1t/whet" rel="noopener noreferrer"&gt;Whet&lt;/a&gt; — a single-binary Rust coding agent that talks to local models via Ollama. The benchmark suite is open and runnable: &lt;code&gt;scripts/run_bench.sh -m &amp;lt;model&amp;gt; -n 3&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The benchmark
&lt;/h2&gt;

&lt;p&gt;Eleven tasks across six ability axes. Each task is a self-contained directory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;benchmarks/&amp;lt;task&amp;gt;/
  prompt.txt    — instruction passed to whet -p
  verify.sh     — exits 0 on pass, non-zero on fail
  workspace/    — initial files, copied to a tempdir per run
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Axis&lt;/th&gt;
&lt;th&gt;What it asks for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;task1_hello&lt;/td&gt;
&lt;td&gt;single-file edit&lt;/td&gt;
&lt;td&gt;add a &lt;code&gt;farewell()&lt;/code&gt; function&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;task2_typo&lt;/td&gt;
&lt;td&gt;multi-file grep+replace&lt;/td&gt;
&lt;td&gt;fix &lt;code&gt;recieve&lt;/code&gt; → &lt;code&gt;receive&lt;/code&gt; across 3 files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;task3_rename&lt;/td&gt;
&lt;td&gt;multi-file rename&lt;/td&gt;
&lt;td&gt;rename &lt;code&gt;compute()&lt;/code&gt; → &lt;code&gt;add()&lt;/code&gt; across 3 files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;task6_debug&lt;/td&gt;
&lt;td&gt;debug + run tests&lt;/td&gt;
&lt;td&gt;fix three empty-list edge cases (division-by-zero / index-out-of-range) — tests are SHA-pinned&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;task7_dedupe&lt;/td&gt;
&lt;td&gt;refactor&lt;/td&gt;
&lt;td&gt;extract a helper from four near-duplicate functions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;task8_cli_filter&lt;/td&gt;
&lt;td&gt;feature add&lt;/td&gt;
&lt;td&gt;add a `--status pending\&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;task9_investigate&lt;/td&gt;
&lt;td&gt;read-only exploration&lt;/td&gt;
&lt;td&gt;enumerate every HTTP endpoint and write to ANSWER.md&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;task10_security_fix&lt;/td&gt;
&lt;td&gt;judgment&lt;/td&gt;
&lt;td&gt;patch a SQL injection (verifier injects {% raw %}&lt;code&gt;' OR '1'='1&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;task11_planning_chain&lt;/td&gt;
&lt;td&gt;multi-file planning&lt;/td&gt;
&lt;td&gt;migrate &lt;code&gt;print()&lt;/code&gt; → &lt;code&gt;logging&lt;/code&gt; across 3 files (caplog tests)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;task12_test_gen&lt;/td&gt;
&lt;td&gt;TDD&lt;/td&gt;
&lt;td&gt;write a test suite for a Calculator class. &lt;strong&gt;Verifier mutation-tests it.&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;td&gt;task13_typescript&lt;/td&gt;
&lt;td&gt;non-Python&lt;/td&gt;
&lt;td&gt;add a function + test in a tiny TS module (&lt;code&gt;tsc --noEmit&lt;/code&gt; + &lt;code&gt;node:test&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;(Task numbering is non-contiguous: numbers 4 and 5 were never assigned, and the original &lt;code&gt;task7_refactor&lt;/code&gt; was retired and replaced with the stricter &lt;code&gt;task7_dedupe&lt;/code&gt;. The 11 above are the live set.)&lt;/p&gt;

&lt;p&gt;Verifier rules in three short bullets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Where the model is supposed to fix implementation code while tests act as the judge (task6, task10, task11), the test files are SHA-256-pinned. A model that "wins" by deleting or weakening the failing tests gets a hard FAIL.&lt;/li&gt;
&lt;li&gt;Where the model writes its own tests (task8, task12), the verifier collects them with &lt;code&gt;pytest --collect-only&lt;/code&gt; and then applies a one-line &lt;strong&gt;mutation&lt;/strong&gt; to the implementation (e.g. &lt;code&gt;divide(a, b)&lt;/code&gt; becomes &lt;code&gt;a + b&lt;/code&gt;). Tests that don't catch the mutation get FAIL.&lt;/li&gt;
&lt;li&gt;For task9 (read-only investigation) the source files themselves are SHA-pinned.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The runners
&lt;/h2&gt;

&lt;p&gt;Four local models, picked to span the practical 16GB-VRAM range:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;qwen3.6:35b-a3b-q4_K_M&lt;/code&gt; — Qwen3.6 35B-A3B (3B active per token). Released April 2026. Tool-calling-trained. ~23GB on disk → 7GB CPU offload on this GPU.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;devstral:24b&lt;/code&gt; — Mistral × All Hands AI's open coding agent model. ~14GB → fits cleanly.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;gemma4:26b&lt;/code&gt; — Google's QAT int4 release. ~17GB → fits cleanly.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;qwen3.5:9b&lt;/code&gt; — the previous-generation Qwen, smaller and pure-dense. ~5.5GB → fits trivially. Included as a "do you actually need 24B+?" baseline.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each model ran every task three times with &lt;code&gt;temperature=0&lt;/code&gt;, &lt;code&gt;seed=42&lt;/code&gt;, &lt;code&gt;num_ctx=8192&lt;/code&gt;, &lt;code&gt;think=false&lt;/code&gt; (Qwen3.6 is a thinking model; without this flag it spends every iteration on internal reasoning). &lt;code&gt;OLLAMA_KV_CACHE_TYPE=q8_0&lt;/code&gt;, &lt;code&gt;OLLAMA_FLASH_ATTENTION=1&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;Headline numbers across 11 tasks (latest batch per &lt;code&gt;(model, task)&lt;/code&gt;):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Pass rate&lt;/th&gt;
&lt;th&gt;Tasks fully passed&lt;/th&gt;
&lt;th&gt;Avg time/task&lt;/th&gt;
&lt;th&gt;Total tokens&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;&lt;code&gt;qwen3.6-q3&lt;/code&gt;&lt;/strong&gt; (Qwen3.6-35B-A3B, UD-Q3_K_M)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;100%&lt;/strong&gt; (29/29)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;11/11&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;82s&lt;/td&gt;
&lt;td&gt;532K&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;qwen3.5:9b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;82% (27/33)&lt;/td&gt;
&lt;td&gt;9/11&lt;/td&gt;
&lt;td&gt;24s&lt;/td&gt;
&lt;td&gt;627K&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gemma4:26b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;61% (20/33)&lt;/td&gt;
&lt;td&gt;6/11&lt;/td&gt;
&lt;td&gt;32s&lt;/td&gt;
&lt;td&gt;585K&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;devstral:24b&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;39% (13/33)&lt;/td&gt;
&lt;td&gt;4/11&lt;/td&gt;
&lt;td&gt;70s&lt;/td&gt;
&lt;td&gt;437K&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;(The vanilla &lt;code&gt;qwen3.6:35b-a3b-q4_K_M&lt;/code&gt; from Ollama also went 18/18 on the six common tasks it ran but is slower; see the Quant sweep section below.)&lt;/p&gt;

&lt;p&gt;Per-task in compact form:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;qwen3.6-q3&lt;/th&gt;
&lt;th&gt;qwen3.5:9b&lt;/th&gt;
&lt;th&gt;gemma4&lt;/th&gt;
&lt;th&gt;devstral&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;task1_hello&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;⚠️ 1/3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;task2_typo&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;task3_rename&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;task6_debug&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;task7_dedupe&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;task8_cli_filter&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;⚠️ 1/3&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;task9_investigate&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;task10_security_fix&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;task11_planning_chain&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;⚠️ 1/3&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;task12_test_gen&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;task13_typescript&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 9B is the surprise of this batch. It clears the multi-file rename, the typo fix, the planning chain, the SQL-injection patch, the debug task, and the TypeScript edit — at a third of the 35B's wall time. The two it loses (task7: extract a helper from four near-duplicates; task8: add a CLI flag with a full code path) are both &lt;em&gt;write-new-structure&lt;/em&gt; tasks. The 9B handles &lt;em&gt;modify-existing-code&lt;/em&gt; work cleanly; on &lt;em&gt;write-new-structure&lt;/em&gt; it falls behind.&lt;/p&gt;

&lt;p&gt;devstral and gemma4 fail in different shapes. devstral misses on simpler tasks too: &lt;code&gt;task1_hello&lt;/code&gt; 1/3 (gives up after one whitespace-mismatched &lt;code&gt;edit_file&lt;/code&gt;), &lt;code&gt;task7_dedupe&lt;/code&gt; 0/3 (the edits succeed but the refactor only deduplicates the validation guard, not the &lt;code&gt;round()&lt;/code&gt; call — verifier requires both). gemma4's biggest gaps are multi-file (task2/task3) and TypeScript. Some of those gemma4 failures, as the next section shows, were really &lt;em&gt;my&lt;/em&gt; agent's fault.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bench found three bugs in my agent first
&lt;/h2&gt;

&lt;p&gt;The first time I ran the suite the rankings were misleading: qwen3.6's pass rate looked closer to gemma4's than it should have. After fixing three Whet-side issues, qwen3.6 went to 100% and the real gap opened up. None of the three bugs were obvious before I had the data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bug 1 — &lt;code&gt;apply_diff&lt;/code&gt; ignored multi-file diffs.&lt;/strong&gt; gemma4 likes to fix multi-file typos with a single unified diff containing three &lt;code&gt;--- file&lt;/code&gt; headers. Whet's &lt;code&gt;apply_diff&lt;/code&gt; ignored the headers and applied every hunk to the JSON &lt;code&gt;path&lt;/code&gt; argument, so hunks meant for files 2 and 3 hit file 1 with mismatched context and the call returned "context not found." Fix: parse &lt;code&gt;--- path&lt;/code&gt; headers between hunks and route each hunk group to its real file.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bug 2 — hunk anchor was a hard line-number match.&lt;/strong&gt; When a model emits &lt;code&gt;@@ -44,3 @@&lt;/code&gt; but the actual context lives at line 39, real &lt;code&gt;git apply&lt;/code&gt; and &lt;code&gt;patch&lt;/code&gt; are tolerant; Whet wasn't. Fix: treat the &lt;code&gt;@@&lt;/code&gt; line numbers as a hint and locate the hunk by searching for the context+removal lines, picking the closest match.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bug 3 — my verifiers were reading my own logs.&lt;/strong&gt; This is the one that fooled me for half a day. After fixing &lt;code&gt;apply_diff&lt;/code&gt;, qwen3.6's task2_typo runs &lt;em&gt;still&lt;/em&gt; failed. The fixed files looked correct — every &lt;code&gt;recieve&lt;/code&gt; was now &lt;code&gt;receive&lt;/code&gt;. The verifier ran &lt;code&gt;grep -rEi 'recieve' .&lt;/code&gt; recursively and found … &lt;em&gt;eight matches&lt;/em&gt;. In &lt;code&gt;.stats.log&lt;/code&gt;. The harness was writing Whet's tool-call traces (which include the model's &lt;code&gt;-recieve&lt;/code&gt;/&lt;code&gt;+receive&lt;/code&gt; diffs) inside the workspace copy, and the verifier was scanning them as if they were task content. Fix: put &lt;code&gt;.stats.log&lt;/code&gt;, &lt;code&gt;.stdout.log&lt;/code&gt;, &lt;code&gt;.verify.log&lt;/code&gt; in a sibling &lt;code&gt;${run_dir}.logs&lt;/code&gt; directory.&lt;/p&gt;

&lt;p&gt;The pass-rate trajectory for task2_typo across the three fixes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;State of the harness&lt;/th&gt;
&lt;th&gt;qwen3.6&lt;/th&gt;
&lt;th&gt;gemma4&lt;/th&gt;
&lt;th&gt;devstral&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Original bench&lt;/td&gt;
&lt;td&gt;0/3&lt;/td&gt;
&lt;td&gt;0/3&lt;/td&gt;
&lt;td&gt;0/3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;After &lt;code&gt;apply_diff&lt;/code&gt; multi-file fix&lt;/td&gt;
&lt;td&gt;0/3&lt;/td&gt;
&lt;td&gt;0/3&lt;/td&gt;
&lt;td&gt;0/3 (174s avg — model retrying)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;After verifier-infra fix&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;3/3&lt;/strong&gt; ✅&lt;/td&gt;
&lt;td&gt;0/3&lt;/td&gt;
&lt;td&gt;0/3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The first two fixes were necessary but didn't move the needle on their own. The grep-the-logs bug was the one blocking the visible green — and you couldn't tell which of the three fixes was load-bearing until all three were in place and the cell flipped.&lt;/p&gt;

&lt;h2&gt;
  
  
  Picking the right quant matters as much as picking the right model
&lt;/h2&gt;

&lt;p&gt;Qwen3.6-35B-A3B has many quantizations. The default &lt;code&gt;qwen3.6:35b-a3b-q4_K_M&lt;/code&gt; from Ollama is 23GB on disk — 7GB over my GPU's VRAM, so a portion of the layers run on CPU (&lt;code&gt;ollama ps&lt;/code&gt; reports a 50/50 CPU/GPU split for this model on this hardware). Unsloth ships a &lt;code&gt;UD-Q3_K_M&lt;/code&gt; variant (~15GB) that fits cleanly into 16GB VRAM. I tested four configurations:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;q4_K_M + KV f16 (default)&lt;/th&gt;
&lt;th&gt;q4_K_M + KV q8_0&lt;/th&gt;
&lt;th&gt;UD-Q3_K_M + KV q8_0&lt;/th&gt;
&lt;th&gt;UD-Q3_K_M + KV f16&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;task1_hello&lt;/td&gt;
&lt;td&gt;40s&lt;/td&gt;
&lt;td&gt;43s&lt;/td&gt;
&lt;td&gt;23s ¹&lt;/td&gt;
&lt;td&gt;27s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;task2_typo&lt;/td&gt;
&lt;td&gt;(verify bug)&lt;/td&gt;
&lt;td&gt;104s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;59s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;55s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;task3_rename&lt;/td&gt;
&lt;td&gt;(verify bug)&lt;/td&gt;
&lt;td&gt;74s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;51s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;46s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;task6_debug&lt;/td&gt;
&lt;td&gt;67s&lt;/td&gt;
&lt;td&gt;77s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;52s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;188s&lt;/strong&gt; ⚠️&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;task7_dedupe&lt;/td&gt;
&lt;td&gt;47s&lt;/td&gt;
&lt;td&gt;46s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;34s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;34s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;task8_cli_filter&lt;/td&gt;
&lt;td&gt;157s&lt;/td&gt;
&lt;td&gt;146s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;103s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;242s&lt;/strong&gt; ⚠️&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;¹: warm load. The first run after a fresh &lt;code&gt;ollama&lt;/code&gt; session takes 100-170s while the model loads.&lt;br&gt;
⚠️: completion output ballooned. On task6 the model emitted ~2.4× the tokens (12K → 30K) and the average duration tripled. On task8 the token count grew more modestly (~37K → ~43K, +16%) but each run still took 2.4× as long, suggesting throughput dropped, not just length.&lt;/p&gt;

&lt;p&gt;Three lessons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;KV-cache quantization on Q4_K_M was roughly a wash, with a slight tilt toward regression. Per-task deltas across the six common rows ranged from 7% faster to 15% slower (mean ≈ +3% slower). The V-cache dequantization overhead and the small VRAM savings cancel out when the model is already CPU-offloading: cutting the KV cache doesn't change which layers fit on the GPU.&lt;/li&gt;
&lt;li&gt;The same KV-q8_0 helped UD-Q3_K_M considerably on average (-21% across the six common tasks). The win was concentrated in task6 and task8, which were ~3× faster with q8_0; on task2 and task3 the q8_0 variant was actually 5-10% slower than f16. So "faster on average" hides a per-task split.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;UD-Q3_K_M with FP16 KV blew up on task6 and task8.&lt;/strong&gt; Same model, same task code, same prompt — moving from 8-bit to 16-bit KV cache made task6 emit ~2.4× the tokens and pushed both tasks to ~3× the wall time. I don't have a clean explanation; the pattern was reproducible across runs in the same batch.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The clear winner: &lt;strong&gt;UD-Q3_K_M weights + KV q8_0 + Flash Attention.&lt;/strong&gt; The short recipe:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Download the GGUF (~15GB) into ~/models&lt;/span&gt;
&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models"&lt;/span&gt;
curl &lt;span class="nt"&gt;-L&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/Qwen3.6-35B-A3B-UD-Q3_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/resolve/main/Qwen3.6-35B-A3B-UD-Q3_K_M.gguf"&lt;/span&gt;

&lt;span class="c"&gt;# 2. Write a Modelfile (absolute path required — Ollama does not expand ~) and register it.&lt;/span&gt;
&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /tmp/Modelfile.q3 &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;
FROM &lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="sh"&gt;/models/Qwen3.6-35B-A3B-UD-Q3_K_M.gguf
RENDERER qwen3.5
PARSER qwen3.5
&lt;/span&gt;&lt;span class="no"&gt;EOF
&lt;/span&gt;ollama create qwen3.6-q3 &lt;span class="nt"&gt;-f&lt;/span&gt; /tmp/Modelfile.q3

&lt;span class="c"&gt;# 3. Turn on KV q8_0 + Flash Attention via a systemd drop-in&lt;/span&gt;
&lt;span class="nb"&gt;sudo mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; /etc/systemd/system/ollama.service.d
&lt;span class="nb"&gt;sudo tee&lt;/span&gt; /etc/systemd/system/ollama.service.d/kv-cache.conf &lt;span class="o"&gt;&amp;gt;&lt;/span&gt;/dev/null &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;'
[Service]
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
Environment="OLLAMA_FLASH_ATTENTION=1"
&lt;/span&gt;&lt;span class="no"&gt;EOF
&lt;/span&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl daemon-reload &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl restart ollama
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then in &lt;code&gt;~/.whet/config.toml&lt;/code&gt;, point &lt;code&gt;[llm].model&lt;/code&gt; at &lt;code&gt;qwen3.6-q3&lt;/code&gt; and set &lt;code&gt;[llm.options]&lt;/code&gt; with &lt;code&gt;num_ctx=8192&lt;/code&gt;, &lt;code&gt;temperature=0.0&lt;/code&gt;, &lt;code&gt;seed=42&lt;/code&gt;, &lt;code&gt;think=false&lt;/code&gt;. Run &lt;code&gt;whet -m qwen3.6-q3&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three failure modes I saw repeatedly
&lt;/h2&gt;

&lt;p&gt;Across 232 runs the model-side failures clustered into a handful of patterns. Three are worth a closer look.&lt;/p&gt;

&lt;h3&gt;
  
  
  "Read everything, edit nothing" — the early give-up
&lt;/h3&gt;

&lt;p&gt;This appeared mostly in devstral and gemma4 runs that did not progress past the read phase.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="s"&gt;gemma4 on task13_typescript, run 2 (warm load, after the node_modules pre-install fix)&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;        &lt;span class="s"&gt;4.6s&lt;/span&gt;
  &lt;span class="na"&gt;llm_calls&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;       &lt;span class="m"&gt;4&lt;/span&gt;
  &lt;span class="na"&gt;completion tokens&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;67&lt;/span&gt;
  &lt;span class="na"&gt;tool calls&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;      &lt;span class="s"&gt;list_dir, read_file × 2     ← reads only&lt;/span&gt;
  &lt;span class="na"&gt;edits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;           &lt;span class="m"&gt;0&lt;/span&gt;
  &lt;span class="na"&gt;stdout response&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;(empty)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Four LLM iterations, sixty-seven completion tokens, no editing tool ever invoked. After three reads the model produced no further tool calls and the agent loop exited (its termination condition is "the last response had no tool calls and no extracted text-mode tool calls"). The 67 generated tokens never reached &lt;code&gt;stdout&lt;/code&gt; — they were attached to the tool-calling iterations themselves, not to a final user-visible reply.&lt;/p&gt;

&lt;p&gt;Whet has a "looks like a question?" detector that re-prompts once when the model asks the user something instead of acting. That doesn't catch this case — the model isn't asking a question, it's just stopping. A fix here would be: detect "no tool calls and no terminal verifier evidence the task is done" and inject a re-prompt. Future work.&lt;/p&gt;

&lt;h3&gt;
  
  
  Edit-tool whitespace thrash
&lt;/h3&gt;

&lt;p&gt;devstral on task2_typo, after I'd added &lt;code&gt;apply_diff&lt;/code&gt; multi-file support but before the verifier-infra fix. Aggregated across the three runs in that batch:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;    &lt;span class="s"&gt;174s avg per run&lt;/span&gt;
&lt;span class="na"&gt;tool calls&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;41 × edit_file across 3 runs (~14 per run)&lt;/span&gt;
&lt;span class="na"&gt;edit targets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;server.py 20, notes.md 12, README.txt 3, .stats.log &lt;/span&gt;&lt;span class="m"&gt;3&lt;/span&gt;
&lt;span class="na"&gt;tool failed&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;26/41&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model correctly identified the typo and &lt;em&gt;all three&lt;/em&gt; affected files, but its &lt;code&gt;edit_file&lt;/code&gt; calls used &lt;code&gt;old_text&lt;/code&gt; snippets that did not exactly match the file's whitespace. After each "text not found" error it retried with a slightly different snippet rather than switching tools or moving on. Each run hit Whet's max-iteration cap. The three &lt;code&gt;.stats.log&lt;/code&gt; edit attempts are a side-effect of bug 3 above: the model saw the in-workspace stats file containing its own diff text (&lt;code&gt;-recieve&lt;/code&gt;/&lt;code&gt;+receive&lt;/code&gt;) and tried to "fix the typo" there too.&lt;/p&gt;

&lt;p&gt;This is the classic &lt;code&gt;edit_file&lt;/code&gt; exact-match brittleness, plus a model that doesn't know to stop and try a different tool. A fuzzy-match tier inside &lt;code&gt;edit_file&lt;/code&gt; would rescue most of these — same fix as &lt;code&gt;apply_diff&lt;/code&gt;'s anchor matching, applied to the simpler tool. I haven't built it yet because qwen3.6-q3 (the recommended model) doesn't trigger it; the 9B and devstral do, and adding fuzzy match would let the 9B finish faster and pull devstral over the line on more multi-file tasks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Helpfully wrong: &lt;code&gt;npm install --save-dev&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;qwen3.6-q3 on task13_typescript, run 2 of 3 (before the node_modules pre-install fix):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[1] read_file src/calc.ts
[2] read_file src/calc.test.ts
[3] edit_file src/calc.ts            ← added subtract() correctly
[4] edit_file src/calc.test.ts       ← updated import correctly
[5] edit_file src/calc.test.ts       ← added subtract test correctly
[6] shell    npx tsc --noEmit        ← failed: typescript not installed
[7] shell    npx tsx src/calc.test.ts ← failed: tsx not installed either
[8] shell    npm install typescript --save-dev &amp;amp;&amp;amp; npx tsc --noEmit  ← MUTATED package.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model completed the task correctly, then tried to verify its work, hit two missing tools, and bundled &lt;code&gt;npm install --save-dev&lt;/code&gt; with &lt;code&gt;tsc&lt;/code&gt; in a single shell call. The &lt;code&gt;--save-dev&lt;/code&gt; rewrites &lt;code&gt;package.json&lt;/code&gt;. The verifier had &lt;code&gt;package.json&lt;/code&gt; SHA-pinned to block the model from disabling failing tests by editing config — and it caught this install instead.&lt;/p&gt;

&lt;p&gt;This is reasonable behaviour from the model penalized by defensive harness infrastructure. The fix was not on the model side — it was to ship &lt;code&gt;node_modules/&lt;/code&gt; pre-populated in the workspace and to tell the model up front: "deps are already installed, do not run npm install." After that change qwen3.6-q3 went from 1/3 partial to 3/3.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tool selection matters as much as model capability
&lt;/h2&gt;

&lt;p&gt;The benchmark harness writes every tool call to a &lt;code&gt;stats.log&lt;/code&gt; file. Aggregated over the latest batch, the per-model histograms tell their own story.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="s"&gt;task2_typo edit calls (one task, totals across 3 runs each)&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;qwen3.6-q3&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;                &lt;span class="s"&gt;15 × edit_file,  0 × apply_diff       (0 failed)&lt;/span&gt;
  &lt;span class="na"&gt;qwen3.6:35b-a3b-q4_K_M&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;    &lt;span class="s"&gt;27 × edit_file,  0 × apply_diff       (0 failed)&lt;/span&gt;
  &lt;span class="na"&gt;qwen3.5:9b&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;                &lt;span class="s"&gt;21 × edit_file,  0 × apply_diff      (12 failed, then succeeded)&lt;/span&gt;
  &lt;span class="na"&gt;gemma4:26b&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;                 &lt;span class="s"&gt;6 × edit_file,  9 × apply_diff       (apply_diff path)&lt;/span&gt;
  &lt;span class="na"&gt;devstral:24b&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;              &lt;span class="s"&gt;21 × edit_file,  0 × apply_diff      (13 failed, gave up)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All three Qwen runs went through &lt;code&gt;edit_file&lt;/code&gt; exclusively. gemma4 reached for &lt;code&gt;apply_diff&lt;/code&gt; — a semantically equivalent choice that also happened to exercise the multi-file routing bug described above. Two different paths to the same task, with very different harness experiences.&lt;/p&gt;

&lt;p&gt;The 9B and devstral both hit the same wall (whitespace mismatches), but only the 9B got past it: it adjusted the &lt;code&gt;old_text&lt;/code&gt; snippet on retry, devstral retried near-identical snippets until the iteration cap. Persistence shape, not just persistence count.&lt;/p&gt;

&lt;p&gt;A similar split shows up on read-heavy tasks. On task9_investigate the per-run tool mix was 6 &lt;code&gt;read_file&lt;/code&gt; + 1 &lt;code&gt;repo_map&lt;/code&gt; + 1 &lt;code&gt;list_dir&lt;/code&gt; for qwen3.6-q3, 5 reads + 1 &lt;code&gt;list_dir&lt;/code&gt; for the 9B, and 4 reads + 2 &lt;code&gt;list_dir&lt;/code&gt; for gemma4. All three passed — for read-only work, the difference is just search depth. It only matters once the chosen tool has to &lt;em&gt;do&lt;/em&gt; something.&lt;/p&gt;

&lt;p&gt;A model that consistently picks tools its harness handles well can outperform a more capable model that picks tools whose implementation has rough edges. Easy to miss when comparing LLMs head-to-head.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;One GPU, one configuration.&lt;/strong&gt; RTX 5060 Ti, 16GB VRAM, Blackwell. On a 24GB or 48GB card the rankings could shift — Q4_K_M wouldn't need offload anymore, gemma4's speed advantage shrinks, etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python-heavy.&lt;/strong&gt; 10 of 11 tasks are Python. The single TypeScript task is enough to show that gemma4/devstral struggle with non-Python ecosystems, but I wouldn't claim much beyond that. Rust/Go/Java tasks are future work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Four tasks are effectively calibration tasks.&lt;/strong&gt; task6_debug, task9_investigate, task10_security_fix, and task12_test_gen were passed by all four models. They're useful for catching regressions but they don't differentiate the models in this lineup.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;temperature=0&lt;/code&gt; and &lt;code&gt;seed=42&lt;/code&gt; did not produce fully deterministic runs&lt;/strong&gt; for Qwen3.6-35B-A3B. MoE expert routing has small non-determinism that shows up as ±5% token-count variance between runs. I report mean values across &lt;code&gt;n=3&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Author bias.&lt;/strong&gt; I built Whet. When a model fails because of a Whet-side bug I'm motivated to fix Whet, not to penalize the model. A different reviewer might decide that "model x failed at multi-file diff because Whet's &lt;code&gt;apply_diff&lt;/code&gt; was buggy" should still count as a model failure for the purposes of choosing a model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;232 runs is small.&lt;/strong&gt; For headline rankings I'm comfortable. For "is gemma4's task11 really 1/3 partial or just unlucky?" I'm not.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Build a benchmark before believing a benchmark.&lt;/strong&gt; The first time I ran my own suite, two of the three biggest signals were artifacts of bugs in my benchmark harness or my agent — not in the models. If I'd published rankings off that data I'd be wrong in print.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The most ergonomic model finds the fewest agent bugs.&lt;/strong&gt; All three Whet bugs surfaced through devstral and gemma4 failures. qwen3.6 has such a strong preference for &lt;code&gt;edit_file&lt;/code&gt; that it never exercised &lt;code&gt;apply_diff&lt;/code&gt; and never tripped its multi-file routing bug. If I'd benchmarked only qwen3.6 my agent would still be broken.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Quantization choice can be worth as much as model choice.&lt;/strong&gt; Same model file at UD-Q3_K_M instead of Q4_K_M was ~33% faster on average and never lost a task. Same model file at FP16 KV instead of q8_0 KV blew up on two specific tasks. Run the sweep on your hardware before settling.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;You may not need the latest, biggest model.&lt;/strong&gt; &lt;code&gt;qwen3.5:9b&lt;/code&gt; — older generation &lt;em&gt;and&lt;/em&gt; one quarter the parameter count — passed 9/11 tasks at 24s/run average, about a third of qwen3.6-q3's 82s. The two it failed (task7_dedupe, task8_cli_filter) were both &lt;em&gt;write-new-structure&lt;/em&gt; tasks. Modify-existing-code work — multi-file rename, typo fix, planning chain, debug, security patch, TypeScript edit — it handled cleanly. The 35B's headline 100% is real, but the &lt;em&gt;delta&lt;/em&gt; between 82% and 100% is exactly those two tasks. Knowing which class of work you do most decides whether that delta is worth 3.4× the latency. (Caveat: this crosses two axes, size and generation. Without &lt;code&gt;qwen3.5:35B&lt;/code&gt; or &lt;code&gt;qwen3.6:9b&lt;/code&gt; in the run I can't separate them.)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Realistic-task benchmarks differ from synthetic benchmarks more than I expected.&lt;/strong&gt; Four tasks (task6, task9, task10, task12) were passed by all four models — they catch regressions but don't rank the lineup. The other seven (task1_hello, task2, task3, task7, task8, task11, task13 — single-file edit, multi-file work, refactor, planning, non-Python) made the differences sharp. If you only test the cases everyone passes you'll buy speed at the cost of correctness without realizing it.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;The benchmark suite, the analysis scripts, and a per-run leaderboard generator are in &lt;a href="https://github.com/kuroko1t/whet/tree/main/benchmarks" rel="noopener noreferrer"&gt;whet on GitHub&lt;/a&gt;. To reproduce on your own hardware:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/kuroko1t/whet
&lt;span class="nb"&gt;cd &lt;/span&gt;whet
cargo &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--path&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
ollama pull qwen3.6:35b-a3b-q4_K_M    &lt;span class="c"&gt;# or use the UD-Q3_K_M recipe above&lt;/span&gt;
scripts/run_bench.sh &lt;span class="nt"&gt;-m&lt;/span&gt; qwen3.6:35b-a3b-q4_K_M &lt;span class="nt"&gt;-n&lt;/span&gt; 3
&lt;span class="nb"&gt;cat &lt;/span&gt;benchmarks/results/leaderboard.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I'd be curious to see the same suite run on a 24GB or 48GB card. If you do, send me the JSONL.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Whet is a Rust-based coding agent for local LLMs. &lt;a href="https://github.com/kuroko1t/whet" rel="noopener noreferrer"&gt;Source on GitHub&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>rust</category>
      <category>ollama</category>
      <category>benchmark</category>
    </item>
    <item>
      <title>I Built a Tool to Stop Losing My Claude Code Conversation History</title>
      <dc:creator>kuroko</dc:creator>
      <pubDate>Sat, 14 Mar 2026 03:02:40 +0000</pubDate>
      <link>https://dev.to/kuroko1t/i-built-a-tool-to-stop-losing-my-claude-code-conversation-history-5500</link>
      <guid>https://dev.to/kuroko1t/i-built-a-tool-to-stop-losing-my-claude-code-conversation-history-5500</guid>
      <description>&lt;p&gt;A few weeks ago I needed to revisit a debugging session. Claude had walked me through a nasty race condition in my app — it took over an hour, and the fix was subtle. I knew exactly which session it was.&lt;/p&gt;

&lt;p&gt;I went to find the JSONL file. Gone. No warning, no "this file will be deleted in 3 days." Just gone.&lt;/p&gt;

&lt;p&gt;If you've been using Claude Code for more than a couple of months, this has probably happened to you too.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wait, Claude Code Deletes My History?
&lt;/h2&gt;

&lt;p&gt;Yeah. Claude Code stores conversations as JSONL files under &lt;code&gt;~/.claude/projects/&lt;/code&gt;, and old files are &lt;a href="https://github.com/anthropics/claude-code/issues/4172" rel="noopener noreferrer"&gt;automatically deleted over time&lt;/a&gt;. You can change this in settings, but that only solves the auto-deletion problem. &lt;code&gt;/compact&lt;/code&gt; still lossy-summarizes your context, and version updates can &lt;a href="https://github.com/anthropics/claude-code/issues/29154" rel="noopener noreferrer"&gt;break session compatibility&lt;/a&gt;. Even with deletion disabled, JSONL files are scattered across directories with no way to search across sessions.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Tried (and Why It Wasn't Enough)
&lt;/h2&gt;

&lt;p&gt;I tried &lt;a href="https://github.com/raine/claude-history" rel="noopener noreferrer"&gt;claude-history&lt;/a&gt; (Rust TUI) and &lt;a href="https://github.com/jhlee0409/claude-code-history-viewer" rel="noopener noreferrer"&gt;Claude Code History Viewer&lt;/a&gt; (desktop app). Both are great for browsing, but they read JSONL files directly — once those files get deleted, they can't show you anything either. &lt;a href="https://github.com/thedotmack/claude-mem" rel="noopener noreferrer"&gt;claude-mem&lt;/a&gt; does persist data into its own database, but it's a full memory system with Node.js, MCP server, and semantic search — more than I needed. I just wanted to archive conversations before they disappear.&lt;/p&gt;

&lt;p&gt;What I was missing: a simple, durable archive I could set up once and forget about.&lt;/p&gt;

&lt;h2&gt;
  
  
  So I Built One
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/kuroko1t/claude-vault" rel="noopener noreferrer"&gt;claude-vault&lt;/a&gt; is a single Rust binary that imports your Claude Code conversations into SQLite with full-text search. No Node.js, no Python, no MCP server — just download and run.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude-vault import
&lt;span class="c"&gt;# Imported 94562 messages (0 skipped, 12847 filtered, 0 errors) from 203 files&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once conversations are in SQLite, they survive file deletion, compaction, updates — whatever happens to the original JSONL files.&lt;/p&gt;

&lt;h3&gt;
  
  
  What About All the Noise?
&lt;/h3&gt;

&lt;p&gt;If you've ever opened a Claude Code JSONL file, you know it's mostly noise — tool results, system tags, file read outputs, progress messages. claude-vault strips all of that during import, keeping only what matters: your questions, Claude's responses, and code-modifying actions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Search That Actually Works
&lt;/h3&gt;

&lt;p&gt;Search uses FTS5 with Porter stemming, so "running" matches "run" and "configurations" matches "configure":&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude-vault search &lt;span class="s2"&gt;"race condition fix"&lt;/span&gt;
claude-vault search &lt;span class="s2"&gt;"deploy"&lt;/span&gt; &lt;span class="nt"&gt;--project&lt;/span&gt; my-app &lt;span class="nt"&gt;--since&lt;/span&gt; 2025-01-01
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also pipe JSON output to Claude itself:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude-vault search &lt;span class="s2"&gt;"previous auth implementation"&lt;/span&gt; &lt;span class="nt"&gt;--json&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Part That Made It Actually Useful: Hooks
&lt;/h2&gt;

&lt;p&gt;Manually running &lt;code&gt;import&lt;/code&gt; is fine, but I kept forgetting. The real fix was hooking it into Claude Code's lifecycle. Add this to &lt;code&gt;~/.claude/settings.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"hooks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"PreCompact"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"hooks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"claude-vault import &amp;gt;/dev/null 2&amp;gt;&amp;amp;1"&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"SessionEnd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"hooks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"claude-vault import &amp;gt;/dev/null 2&amp;gt;&amp;amp;1 &amp;amp;"&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;PreCompact&lt;/strong&gt; captures the full conversation before &lt;code&gt;/compact&lt;/code&gt; summarizes it. &lt;strong&gt;SessionEnd&lt;/strong&gt; archives in the background when you exit. Once set up, I never think about it — every session is archived automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  What It Doesn't Do (Honest Assessment)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;It's an &lt;strong&gt;archive&lt;/strong&gt;, not a memory system. It won't inject past context into new sessions automatically.&lt;/li&gt;
&lt;li&gt;It's &lt;strong&gt;CLI-only&lt;/strong&gt;. If you want a TUI, &lt;a href="https://github.com/raine/claude-history" rel="noopener noreferrer"&gt;claude-history&lt;/a&gt; is great.&lt;/li&gt;
&lt;li&gt;No semantic search — it's keyword-based FTS5 with stemming.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It does one thing: makes sure your conversations don't disappear. That's it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cargo &lt;span class="nb"&gt;install &lt;/span&gt;claude-vault
&lt;span class="c"&gt;# or download a prebuilt binary from GitHub Releases&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Seriously, run &lt;code&gt;claude-vault import&lt;/code&gt; now. If you've been using Claude Code for a while, some of your old sessions might already be gone — archive what's left before it's too late.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/kuroko1t/claude-vault" rel="noopener noreferrer"&gt;GitHub: kuroko1t/claude-vault&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Have you lost Claude Code sessions you wish you could get back? What's your approach to preserving conversation history? I'd love to hear what others are doing.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>ai</category>
      <category>productivity</category>
      <category>devtools</category>
    </item>
    <item>
      <title>What Happens When Local LLMs Fail at Tool Calling — Testing 7 Models with a Rust Coding Agent</title>
      <dc:creator>kuroko</dc:creator>
      <pubDate>Sun, 01 Mar 2026 14:28:05 +0000</pubDate>
      <link>https://dev.to/kuroko1t/what-happens-when-local-llms-fail-at-tool-calling-testing-7-models-with-a-rust-coding-agent-cep</link>
      <guid>https://dev.to/kuroko1t/what-happens-when-local-llms-fail-at-tool-calling-testing-7-models-with-a-rust-coding-agent-cep</guid>
      <description>&lt;p&gt;I tested 7 local LLMs on the same simple coding task. 4 succeeded. 3 failed — each in a different way. One model burned 30K tokens retrying the exact same broken call because my system prompt told it to.&lt;/p&gt;

&lt;p&gt;I built &lt;a href="https://github.com/kuroko1t/whet" rel="noopener noreferrer"&gt;Whet&lt;/a&gt;, a coding agent written in Rust. It connects to local LLMs through Ollama and gives them tools — read files, edit files, run shell commands, search code — so the model can actually modify your project instead of just suggesting changes. Think of it as a local, open-source alternative to tools like Claude Code or Cursor, but running entirely on your machine with whatever model you choose.&lt;/p&gt;

&lt;p&gt;The key mechanism is &lt;strong&gt;tool calling&lt;/strong&gt;: instead of the model printing "you should edit line 5," the model returns a structured API call like &lt;code&gt;edit_file(path, old_text, new_text)&lt;/code&gt;, and the agent executes it. When this works, the model can autonomously chain multiple tools to complete a task. When it breaks, things get interesting.&lt;/p&gt;

&lt;p&gt;This article documents the failure patterns I found, which ones were the model's fault vs. my agent's fault, and what I did about it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Important caveat&lt;/strong&gt;: I built Whet as a personal project, so I'm biased toward finding and fixing issues in my own agent rather than blaming models. The "model vs agent" distinction below is my interpretation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Agent&lt;/strong&gt;: &lt;a href="https://github.com/kuroko1t/whet" rel="noopener noreferrer"&gt;Whet&lt;/a&gt; — a single-binary Rust coding agent with 9 built-in tools (read_file, edit_file, shell, grep, etc.) plus optional web tools&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Task&lt;/strong&gt;: "Read hello.py and add a farewell function"&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# hello.py (before)
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;greet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hello, &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;greet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;World&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Simple enough that any tool-calling model should handle it. The expected tool chain is: &lt;code&gt;read_file&lt;/code&gt; → &lt;code&gt;edit_file&lt;/code&gt;. Two calls, done.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Models&lt;/strong&gt;: 7 models available via Ollama, ranging from 7B to 24B parameters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mode&lt;/strong&gt;: Yolo (auto-approve all tool calls). Max 10 iterations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to reproduce&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cargo &lt;span class="nb"&gt;install &lt;/span&gt;whet
ollama pull qwen3:8b  &lt;span class="c"&gt;# or any model below&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'def greet(name):
    return f"Hello, {name}!"

if __name__ == "__main__":
    print(greet("World"))'&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; hello.py
whet &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"Read hello.py and add a farewell function"&lt;/span&gt; &lt;span class="nt"&gt;-m&lt;/span&gt; qwen3:8b &lt;span class="nt"&gt;-y&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Params&lt;/th&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Tokens&lt;/th&gt;
&lt;th&gt;Tool Calls&lt;/th&gt;
&lt;th&gt;Failure Pattern&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;devstral-small-2&lt;/td&gt;
&lt;td&gt;24B&lt;/td&gt;
&lt;td&gt;Pass&lt;/td&gt;
&lt;td&gt;5,990&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;glm-4.7-flash&lt;/td&gt;
&lt;td&gt;19B&lt;/td&gt;
&lt;td&gt;Pass&lt;/td&gt;
&lt;td&gt;6,684&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;qwen3:8b&lt;/td&gt;
&lt;td&gt;8B&lt;/td&gt;
&lt;td&gt;Pass&lt;/td&gt;
&lt;td&gt;6,895&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;qwen3:14b&lt;/td&gt;
&lt;td&gt;14B&lt;/td&gt;
&lt;td&gt;Pass&lt;/td&gt;
&lt;td&gt;8,946&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;qwen2.5:14b&lt;/td&gt;
&lt;td&gt;14B&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Fail&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;6,013&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Wrong &lt;code&gt;old_text&lt;/code&gt;, gave up&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;qwen2.5:7b&lt;/td&gt;
&lt;td&gt;7B&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Fail&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3,801&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Read file, asked user instead of editing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;qwen2.5-coder:14b&lt;/td&gt;
&lt;td&gt;14B&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Fail&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1,873&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;Output JSON as text instead of calling tool&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;4 passed. 3 failed. Parameter count didn't predict success — qwen3:8b (8B) passed while qwen2.5-coder:14b (14B) failed.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Success Looks Like
&lt;/h2&gt;

&lt;p&gt;Before the failures, here's a successful run (devstral-small-2, 5,990 tokens):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[1] read_file {"path": "hello.py"}
    → returned file content (5 lines)

[2] edit_file {"path": "hello.py", "old_text": "if __name__...", "new_text": "def farewell..."}
    → added farewell function ✓

Done. Task complete.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two tool calls, clean execution. The model read the file, understood the structure, wrote a valid edit, and stopped. This is what all 7 models should have done.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Three Failure Patterns
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Pattern 1: Refusing to Act (qwen2.5:7b)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[tool: read_file] {"path":"hello.py"}  ← only tool call

"Should I edit the file?"  ← asked user instead of editing
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model read the file successfully, then asked for permission instead of using &lt;code&gt;edit_file&lt;/code&gt;. The system prompt says "ACT, DON'T ASK" — the model ignored it. 1 tool call, 3,801 tokens, task incomplete.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 2: Tool Format Confusion (qwen2.5-coder:14b)
&lt;/h3&gt;

&lt;p&gt;The model output what &lt;em&gt;looks like&lt;/em&gt; a tool call, but as plain text instead of using the API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# What the model printed (as text, NOT an actual tool call):
{"name": "read_file", "arguments": {"path": "hello.py"}}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model understood it needed to call &lt;code&gt;read_file&lt;/code&gt;, but output the JSON as text inside a markdown code block instead of using the tool calling API. Zero actual tool calls. 1,873 tokens wasted.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 3: Retry Loop
&lt;/h3&gt;

&lt;p&gt;This was the most interesting failure because it was &lt;strong&gt;both the model's and my agent's fault&lt;/strong&gt;.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Iteration&lt;/th&gt;
&lt;th&gt;Tool Call&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;read_file {"path": "hello.py"}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;OK&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;shell {"command": "cat hello.py"}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Error&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;code&gt;shell {"command": "cat hello.py"}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Error&lt;/strong&gt; (same)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;code&gt;shell {"command": "cat hello.py"}&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Error&lt;/strong&gt; (same)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;td&gt;...&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;(max iterations)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Gave up&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;30K tokens. 10+ tool calls. The model hit an error on &lt;code&gt;shell&lt;/code&gt;, then repeated the exact same call 5+ times. It never tried a different approach.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Model side&lt;/strong&gt;: qwen3:14b didn't adapt after seeing the error. Other models (qwen3:8b, devstral) changed their approach on failure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent side&lt;/strong&gt;: My system prompt said &lt;em&gt;"if shell command fails: read the error output, fix the issue, and retry"&lt;/em&gt; — which the model interpreted literally as "call the same thing again."&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What I Did About It
&lt;/h2&gt;

&lt;p&gt;Pattern 3 was the most actionable. One line added to the system prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- NEVER repeat the same failing tool call more than once.
  If it failed, change your approach (different arguments,
  different tool, or ask the user).
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The result:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;qwen3:14b (before)&lt;/th&gt;
&lt;th&gt;qwen3:14b (after)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Task completed&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total tokens&lt;/td&gt;
&lt;td&gt;~30,000&lt;/td&gt;
&lt;td&gt;8,946&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool calls&lt;/td&gt;
&lt;td&gt;10+&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool success rate&lt;/td&gt;
&lt;td&gt;&amp;lt; 20%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;One line of prompt turned a 30K-token failure into a 9K-token success.&lt;/p&gt;

&lt;p&gt;For the other two patterns, I added agent-level recovery:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pattern 2 (JSON as text)&lt;/strong&gt;: A fallback parser that scans the model's text output for JSON objects matching the tool call format and executes them. This successfully extracted &lt;code&gt;read_file&lt;/code&gt; calls from qwen2.5-coder:14b's text output.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pattern 1 (refusing to act)&lt;/strong&gt;: A question detector that catches when the model asks instead of acting, and re-prompts it to use tools instead of asking. This fired in 3 out of 5 test runs with qwen2.5:7b.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both helped partially, but neither is a complete fix — ultimately the model needs to use the tool calling API correctly.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Data Shows
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Model generation matters more than size
&lt;/h3&gt;

&lt;p&gt;All three qwen2.5 models failed. All three qwen3 models passed (after the prompt fix). devstral-small-2 and glm-4.7-flash also passed. The qwen3/qwen2.5 boundary is a clearer predictor of tool-calling success than parameter count.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Each failure is different
&lt;/h3&gt;

&lt;p&gt;The three failing models broke in three distinct ways: refusing to act, format confusion, retry loops. There's no single "tool calling doesn't work" failure mode — each model fails differently, which means each failure needs different investigation.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Agent bugs hide behind smart models
&lt;/h3&gt;

&lt;p&gt;qwen3:8b and devstral never triggered the retry loop bug because they recover gracefully from errors. If I'd only tested with these models, the prompt bug would still be in my code. The "worst" model (qwen3:14b pre-fix) was the most useful for finding agent bugs.&lt;/p&gt;




&lt;h2&gt;
  
  
  Limitations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Single task&lt;/strong&gt;: These results are from one task. A model that passes "add a function" might fail at "debug a test failure" or "refactor across files." I'm working on a broader benchmark.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Non-deterministic&lt;/strong&gt;: LLM outputs vary between runs. qwen2.5:14b might succeed on a retry. I ran each model once for the initial results.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ollama-specific&lt;/strong&gt;: Results may differ with other inference engines (llama.cpp, vLLM). Tool calling implementation varies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Author bias&lt;/strong&gt;: I built Whet. I'm inclined to fix my agent rather than blame models. Another developer might classify some "agent bugs" as "model limitations" or vice versa.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Test with multiple models, not just the best one.&lt;/strong&gt; Smart models hide agent bugs by working around them. The model that fails the most dramatically teaches you the most about your agent's weaknesses.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;"Retry on failure" is dangerous prompt guidance.&lt;/strong&gt; Humans understand "retry" as "try differently." LLMs may read it as "call the exact same function again." Be explicit about what NOT to do.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Check the generation, not just the size.&lt;/strong&gt; qwen3:8b (8B) outperformed qwen2.5-coder:14b (14B) at tool calling. Newer model families tend to have better tool-use training regardless of parameter count.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The agent can compensate — partially.&lt;/strong&gt; JSON fallback parsing and question re-prompting helped, but the biggest win was a one-line prompt fix. Invest in your system prompt before building workarounds.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;The code is &lt;a href="https://github.com/kuroko1t/whet" rel="noopener noreferrer"&gt;open source&lt;/a&gt;. &lt;/p&gt;

</description>
      <category>llm</category>
      <category>rust</category>
      <category>ai</category>
      <category>agents</category>
    </item>
    <item>
      <title>How Accessibility Tree Formatting Affects Token Cost in Browser MCPs</title>
      <dc:creator>kuroko</dc:creator>
      <pubDate>Thu, 26 Feb 2026 07:58:44 +0000</pubDate>
      <link>https://dev.to/kuroko1t/how-accessibility-tree-formatting-affects-token-cost-in-browser-mcps-n2a</link>
      <guid>https://dev.to/kuroko1t/how-accessibility-tree-formatting-affects-token-cost-in-browser-mcps-n2a</guid>
      <description>&lt;p&gt;Token cost in browser automation MCPs has become a real topic — articles like &lt;a href="https://scrolltest.medium.com/playwright-mcp-burns-114k-tokens-per-test-the-new-cli-uses-27k-heres-when-to-use-each-65dabeaac7a0" rel="noopener noreferrer"&gt;"Playwright MCP Burns 114K Tokens Per Test"&lt;/a&gt; have been making the rounds. Tools are approaching this from different angles: Playwright MCP's &lt;code&gt;--output-mode file&lt;/code&gt; option saves snapshots to disk instead of returning them in LLM context, Vercel's &lt;a href="https://github.com/vercel-labs/agent-browser" rel="noopener noreferrer"&gt;agent-browser&lt;/a&gt; compresses DOM state to a fraction of the original, and some tools add vision-based fallbacks for layout understanding.&lt;/p&gt;

&lt;p&gt;I've been working on &lt;a href="https://github.com/kuroko1t/webclaw" rel="noopener noreferrer"&gt;WebClaw&lt;/a&gt;, an open-source Chrome extension-based browser MCP. It takes the accessibility tree approach like Playwright MCP, but with a more compact format. I wanted to measure the actual difference — not guess, but measure — so I set up a side-by-side test.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Measured
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Versions tested:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Playwright MCP: &lt;code&gt;@playwright/mcp&lt;/code&gt; v0.0.68 (&lt;code&gt;npx @playwright/mcp@0.0.68 --headless&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;WebClaw: &lt;code&gt;webclaw-mcp&lt;/code&gt; v0.9.0 + Chrome extension v0.9.0&lt;/li&gt;
&lt;li&gt;Measured: February 26, 2026&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I registered both &lt;a href="https://github.com/microsoft/playwright-mcp" rel="noopener noreferrer"&gt;Playwright MCP&lt;/a&gt; and WebClaw as MCP servers in the &lt;strong&gt;same Claude Code session&lt;/strong&gt;, then ran the same steps on each:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Navigate to the target URL&lt;/li&gt;
&lt;li&gt;Call the snapshot tool (&lt;code&gt;browser_snapshot&lt;/code&gt; / &lt;code&gt;page_snapshot&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Measure the full response text length in characters&lt;/li&gt;
&lt;li&gt;Estimate tokens as &lt;code&gt;characters / 4&lt;/code&gt; (approximation — actual tokenization varies by model)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Both tools return the complete accessibility tree with no truncation.&lt;/strong&gt; WebClaw's default is unlimited output (no token budget), so this is a pure format efficiency comparison.&lt;/p&gt;

&lt;p&gt;I picked three pages with different content patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Wikipedia&lt;/strong&gt; — long article with many reference links and navigation templates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub&lt;/strong&gt; — repository page with file listing, README, and sidebar&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hacker News&lt;/strong&gt; — list-style page with 30 items&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Important caveat on fairness:&lt;/strong&gt; Playwright MCP runs a headless Chromium (not logged in). WebClaw runs in the user's Chrome (logged in to GitHub in my case). This means WebClaw sees &lt;em&gt;more&lt;/em&gt; UI on GitHub — authenticated menus, notifications, repo actions — which actually increases its output. The comparison is biased against WebClaw on that page.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results: Format Efficiency
&lt;/h2&gt;

&lt;p&gt;Both tools returning full, untruncated accessibility trees:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Site&lt;/th&gt;
&lt;th&gt;Playwright MCP&lt;/th&gt;
&lt;th&gt;WebClaw&lt;/th&gt;
&lt;th&gt;Difference&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://en.wikipedia.org/wiki/Model_Context_Protocol" rel="noopener noreferrer"&gt;Wikipedia (MCP article)&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;16,044 tokens (64,176 chars)&lt;/td&gt;
&lt;td&gt;7,860 tokens (31,439 chars)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;51% smaller&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://github.com/anthropics/claude-cookbooks" rel="noopener noreferrer"&gt;GitHub (anthropics/claude-cookbooks)&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;19,409 tokens (77,637 chars)&lt;/td&gt;
&lt;td&gt;4,304 tokens (17,215 chars)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;78% smaller&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://news.ycombinator.com/" rel="noopener noreferrer"&gt;Hacker News (front page)&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;14,547 tokens (58,189 chars)&lt;/td&gt;
&lt;td&gt;3,052 tokens (12,207 chars)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;79% smaller&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The range is &lt;strong&gt;51% to 79%&lt;/strong&gt; depending on the page. Let me dig into why.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Creates the Difference
&lt;/h2&gt;

&lt;p&gt;Comparing the actual output for the same Wikipedia page:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Playwright MCP&lt;/strong&gt; (&lt;code&gt;browser_snapshot&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;generic [active] [ref=e1]&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;link "Jump to content" [ref=e2] [cursor=pointer]&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;/url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;#bodyContent"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;banner [ref=e4]&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;navigation "Site" [ref=e6]&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;generic "Main menu" [ref=e7]&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;button "Main menu" [ref=e8] [cursor=pointer]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;WebClaw&lt;/strong&gt; (&lt;code&gt;page_snapshot&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[page "Model Context Protocol - Wikipedia"]
 [banner]
  [nav "Site"]
  [@e2 link]
 [search]
  [@e3 searchbox "Search Wikipedia"]
  [@e4 button "Search"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The difference comes down to design choices — each reasonable on its own, but they compound:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Design choice&lt;/th&gt;
&lt;th&gt;Playwright MCP&lt;/th&gt;
&lt;th&gt;WebClaw&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Which elements get refs&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;All elements (&lt;code&gt;generic&lt;/code&gt;, &lt;code&gt;rowgroup&lt;/code&gt;, &lt;code&gt;cell&lt;/code&gt;...)&lt;/td&gt;
&lt;td&gt;Only interactive elements (buttons, links, inputs)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Attribute output&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;[active]&lt;/code&gt;, &lt;code&gt;[cursor=pointer]&lt;/code&gt;, &lt;code&gt;/url:&lt;/code&gt; on all applicable&lt;/td&gt;
&lt;td&gt;Minimal — only what's needed for action&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Table representation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Full nested structure per cell&lt;/td&gt;
&lt;td&gt;Compressed single-line rows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Ref count (GitHub)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;789 refs&lt;/td&gt;
&lt;td&gt;245 refs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Playwright MCP's approach — labeling every element with a ref — gives maximum flexibility for targeting any element. WebClaw trades that completeness for compactness by only labeling things the AI can actually interact with.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why the range is so wide (51% to 79%)
&lt;/h3&gt;

&lt;p&gt;The format savings vary by page structure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub (78%)&lt;/strong&gt;: The file listing table is where the biggest difference shows. Playwright MCP assigns refs to every &lt;code&gt;row&lt;/code&gt;, &lt;code&gt;cell&lt;/code&gt;, &lt;code&gt;generic&lt;/code&gt; wrapper (789 total). WebClaw only labels links and buttons (245 total). Additionally, WebClaw follows the W3C Accessible Name specification, using &lt;code&gt;textContent&lt;/code&gt; before the &lt;code&gt;title&lt;/code&gt; attribute for buttons and links. On GitHub, many buttons have short display text ("X") but verbose title attributes ("Close dialog") — using the spec-compliant order avoids the bloat.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hacker News (79%)&lt;/strong&gt;: Simple, repetitive table structure. WebClaw's table compression (&lt;code&gt;[row] 1. | link | link&lt;/code&gt;) eliminates most of the verbosity. Playwright MCP outputs nested &lt;code&gt;rowgroup &amp;gt; row &amp;gt; cell &amp;gt; generic &amp;gt; link&lt;/code&gt; for each of the 30 items.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wikipedia (51%)&lt;/strong&gt;: The article body has many inline links that both tools represent similarly. The savings come primarily from the navigation templates (Generative AI, Artificial Intelligence navboxes) where structural compression helps, but the text content itself is irreducible.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Controlling Output Size
&lt;/h2&gt;

&lt;p&gt;WebClaw defaults to unlimited output — no truncation. But when you need to manage token costs, two options are available:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Interactive elements only&lt;/strong&gt; — &lt;code&gt;interactiveOnly&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"interactiveOnly"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Strips all text content. A 2,000-line page becomes ~200 lines of buttons, links, and inputs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Landmark region focus&lt;/strong&gt; — &lt;code&gt;focusRegion&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"focusRegion"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"main"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Only returns the &lt;code&gt;main&lt;/code&gt;, &lt;code&gt;nav&lt;/code&gt;, &lt;code&gt;header&lt;/code&gt;, or &lt;code&gt;footer&lt;/code&gt; section. Useful when you know where the content you need is.&lt;/p&gt;

&lt;p&gt;Playwright MCP doesn't have equivalents — it always returns the full tree.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Broader Landscape
&lt;/h2&gt;

&lt;p&gt;This comparison only covers in-context accessibility trees. The ecosystem is moving fast, and there are other approaches worth knowing about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Playwright MCP file output&lt;/strong&gt; (&lt;code&gt;--output-mode file&lt;/code&gt;): Saves snapshots to disk files instead of returning them in LLM context. Clients that support file references can read these without consuming context tokens. A fundamentally different approach to the same problem.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DOM compression tools&lt;/strong&gt; (Vercel's &lt;a href="https://github.com/vercel-labs/agent-browser" rel="noopener noreferrer"&gt;agent-browser&lt;/a&gt;, &lt;a href="https://github.com/browser-use/browser-use" rel="noopener noreferrer"&gt;browser-use&lt;/a&gt;, etc.): These extract and compress DOM/accessibility tree state, filtering down thousands of nodes to the most relevant elements. Some also support optional vision models for layout understanding as a secondary input.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;WebClaw's approach is narrower: same accessibility tree method as Playwright MCP's &lt;code&gt;browser_snapshot&lt;/code&gt;, but with a more compact format. The numbers above show what format choices alone can do — but they don't capture the full picture of what's possible with file-based or DOM compression approaches.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Format Efficiency Still Matters
&lt;/h2&gt;

&lt;p&gt;Even with file-based alternatives emerging, in-context snapshots remain the default for most MCP setups. A browser automation task rarely reads a page just once — navigate, read, click, read again, fill a form, check the result — that's easily 5-10 snapshot calls. A 51-79% format reduction compounds across those calls.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tradeoffs
&lt;/h2&gt;

&lt;p&gt;I'm biased — I built WebClaw — so let me be upfront about the tradeoffs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where Playwright MCP is the better choice:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CI/headless environments (WebClaw needs a visible Chrome window)&lt;/li&gt;
&lt;li&gt;Cross-browser testing (Chromium, Firefox, WebKit)&lt;/li&gt;
&lt;li&gt;Zero-install setup (&lt;code&gt;npx&lt;/code&gt; one-liner vs. Chrome extension)&lt;/li&gt;
&lt;li&gt;Complete output — every element gets a ref, nothing is omitted&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--output-mode file&lt;/code&gt; for file-based snapshots&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Where WebClaw fits better:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Token-sensitive workflows where format compactness matters&lt;/li&gt;
&lt;li&gt;Logged-in sessions (runs in your existing Chrome — no re-authentication)&lt;/li&gt;
&lt;li&gt;Bot-resistant sites (Chrome extension, no WebDriver flags)&lt;/li&gt;
&lt;li&gt;When you need output size controls (&lt;code&gt;interactiveOnly&lt;/code&gt;, &lt;code&gt;focusRegion&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;WebClaw limitations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Requires Chrome + extension install&lt;/li&gt;
&lt;li&gt;No headless mode&lt;/li&gt;
&lt;li&gt;No test code generation&lt;/li&gt;
&lt;li&gt;Uses your real session (the AI operates with your credentials)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Claude Code:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude mcp add webclaw &lt;span class="nt"&gt;--&lt;/span&gt; npx &lt;span class="nt"&gt;-y&lt;/span&gt; webclaw-mcp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Claude Desktop&lt;/strong&gt; — add to &lt;code&gt;claude_desktop_config.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"webclaw"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"npx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"-y"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"webclaw-mcp"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then install the &lt;a href="https://github.com/kuroko1t/webclaw/releases/latest" rel="noopener noreferrer"&gt;Chrome extension&lt;/a&gt;: extract the zip, go to &lt;code&gt;chrome://extensions/&lt;/code&gt;, enable Developer mode, and load the &lt;code&gt;dist/&lt;/code&gt; folder.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;The takeaway isn't "use WebClaw instead of Playwright MCP" — it's that &lt;strong&gt;accessibility tree format choices matter more than you'd expect&lt;/strong&gt;. Assigning refs to every element vs. only interactive ones, including &lt;code&gt;[cursor=pointer]&lt;/code&gt; hints vs. omitting them, following the W3C accessible name spec vs. using title attributes — these small decisions compound into a 51-79% difference on real pages.&lt;/p&gt;

&lt;p&gt;The browser MCP space is evolving quickly. File-based snapshots, DOM compression tools, and hybrid approaches are all worth watching. If you're hitting token limits with your current setup, the data here might help you understand why — and what to try next.&lt;/p&gt;

&lt;p&gt;If you want to reproduce these measurements or try WebClaw, the &lt;a href="https://github.com/kuroko1t/webclaw" rel="noopener noreferrer"&gt;repo is open&lt;/a&gt;. Issues and feedback welcome — this is a solo project and I'm still figuring out the right tradeoffs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/kuroko1t/webclaw" rel="noopener noreferrer"&gt;github.com/kuroko1t/webclaw&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;npm&lt;/strong&gt;: &lt;code&gt;npx -y webclaw-mcp&lt;/code&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;WebClaw is MIT-licensed open source.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>mcp</category>
      <category>ai</category>
      <category>webdev</category>
      <category>playwright</category>
    </item>
  </channel>
</rss>
