DEV Community: kuroko

Benchmarking Local Coding LLMs: 11 Realistic Tasks, 232 Runs, and the Bugs My Bench Found in My Agent

kuroko — Wed, 29 Apr 2026 03:42:33 +0000

What can a 16GB GPU and a local LLM actually do for everyday coding work? I built an 11-task benchmark to find out and ran four open-weight models (9B to 35B; the 35B is an MoE with 3B active per token) through it. 232 runs in total. A single RTX 5060 Ti with 16GB VRAM.

Headline: the biggest, newest model (Qwen3.6-35B-A3B) won at 100% pass rate (29/29 runs, 11/11 tasks) after some tuning. The previous-gen qwen3.5:9b — older and smaller — passed 9/11 tasks at 24s/run average, roughly one third the wall time of the 35B. So the more interesting question turns out not to be "which model wins" but "do you actually need the latest, biggest model":

The benchmark found three bugs in my own agent before it surfaced anything interesting about the models.
Picking the right quantization (UD-Q3_K_M instead of Q4_K_M) was worth ~33% on average and saved one model from CPU offload entirely — but the same quant under FP16 KV cache blew up on two tasks specifically.
qwen3.5:9b passes 9/11 tasks at one-third the latency of Qwen3.6-35B-A3B; the bigger newer model's two extra wins are both refactor/feature-add tasks. If your workload is single-file edits, debugging, or read-only investigation, the 9B is plenty.

The agent under test is Whet — a single-binary Rust coding agent that talks to local models via Ollama. The benchmark suite is open and runnable: scripts/run_bench.sh -m <model> -n 3.

The benchmark

Eleven tasks across six ability axes. Each task is a self-contained directory:

benchmarks/<task>/
  prompt.txt    — instruction passed to whet -p
  verify.sh     — exits 0 on pass, non-zero on fail
  workspace/    — initial files, copied to a tempdir per run

#	Task	Axis	What it asks for
1	task1_hello	single-file edit	add a `farewell()` function
2	task2_typo	multi-file grep+replace	fix `recieve` → `receive` across 3 files
3	task3_rename	multi-file rename	rename `compute()` → `add()` across 3 files
6	task6_debug	debug + run tests	fix three empty-list edge cases (division-by-zero / index-out-of-range) — tests are SHA-pinned
7	task7_dedupe	refactor	extract a helper from four near-duplicate functions
8	task8_cli_filter	feature add	add a `--status pending\
9	task9_investigate	read-only exploration	enumerate every HTTP endpoint and write to ANSWER.md
10	task10_security_fix	judgment	patch a SQL injection (verifier injects {% raw %}`' OR '1'='1`)
11	task11_planning_chain	multi-file planning	migrate `print()` → `logging` across 3 files (caplog tests)
12	task12_test_gen	TDD	write a test suite for a Calculator class. Verifier mutation-tests it.
13	task13_typescript	non-Python	add a function + test in a tiny TS module (`tsc --noEmit` + `node:test`)

(Task numbering is non-contiguous: numbers 4 and 5 were never assigned, and the original task7_refactor was retired and replaced with the stricter task7_dedupe. The 11 above are the live set.)

Verifier rules in three short bullets:

Where the model is supposed to fix implementation code while tests act as the judge (task6, task10, task11), the test files are SHA-256-pinned. A model that "wins" by deleting or weakening the failing tests gets a hard FAIL.
Where the model writes its own tests (task8, task12), the verifier collects them with pytest --collect-only and then applies a one-line mutation to the implementation (e.g. divide(a, b) becomes a + b). Tests that don't catch the mutation get FAIL.
For task9 (read-only investigation) the source files themselves are SHA-pinned.

The runners

Four local models, picked to span the practical 16GB-VRAM range:

qwen3.6:35b-a3b-q4_K_M — Qwen3.6 35B-A3B (3B active per token). Released April 2026. Tool-calling-trained. ~23GB on disk → 7GB CPU offload on this GPU.
devstral:24b — Mistral × All Hands AI's open coding agent model. ~14GB → fits cleanly.
gemma4:26b — Google's QAT int4 release. ~17GB → fits cleanly.
qwen3.5:9b — the previous-generation Qwen, smaller and pure-dense. ~5.5GB → fits trivially. Included as a "do you actually need 24B+?" baseline.

Each model ran every task three times with temperature=0, seed=42, num_ctx=8192, think=false (Qwen3.6 is a thinking model; without this flag it spends every iteration on internal reasoning). OLLAMA_KV_CACHE_TYPE=q8_0, OLLAMA_FLASH_ATTENTION=1.

Results

Headline numbers across 11 tasks (latest batch per (model, task)):

Model	Pass rate	Tasks fully passed	Avg time/task	Total tokens
`qwen3.6-q3` (Qwen3.6-35B-A3B, UD-Q3_K_M)	100% (29/29)	11/11	82s	532K
`qwen3.5:9b`	82% (27/33)	9/11	24s	627K
`gemma4:26b`	61% (20/33)	6/11	32s	585K
`devstral:24b`	39% (13/33)	4/11	70s	437K

(The vanilla qwen3.6:35b-a3b-q4_K_M from Ollama also went 18/18 on the six common tasks it ran but is slower; see the Quant sweep section below.)

Per-task in compact form:

	qwen3.6-q3	qwen3.5:9b	gemma4	devstral
task1_hello	✅	✅	✅	⚠️ 1/3
task2_typo	✅	✅	❌	❌
task3_rename	✅	✅	❌	❌
task6_debug	✅	✅	✅	✅
task7_dedupe	✅	❌	✅	❌
task8_cli_filter	✅	❌	⚠️ 1/3	❌
task9_investigate	✅	✅	✅	✅
task10_security_fix	✅	✅	✅	✅
task11_planning_chain	✅	✅	⚠️ 1/3	❌
task12_test_gen	✅	✅	✅	✅
task13_typescript	✅	✅	❌	❌

The 9B is the surprise of this batch. It clears the multi-file rename, the typo fix, the planning chain, the SQL-injection patch, the debug task, and the TypeScript edit — at a third of the 35B's wall time. The two it loses (task7: extract a helper from four near-duplicates; task8: add a CLI flag with a full code path) are both write-new-structure tasks. The 9B handles modify-existing-code work cleanly; on write-new-structure it falls behind.

devstral and gemma4 fail in different shapes. devstral misses on simpler tasks too: task1_hello 1/3 (gives up after one whitespace-mismatched edit_file), task7_dedupe 0/3 (the edits succeed but the refactor only deduplicates the validation guard, not the round() call — verifier requires both). gemma4's biggest gaps are multi-file (task2/task3) and TypeScript. Some of those gemma4 failures, as the next section shows, were really my agent's fault.

The bench found three bugs in my agent first

The first time I ran the suite the rankings were misleading: qwen3.6's pass rate looked closer to gemma4's than it should have. After fixing three Whet-side issues, qwen3.6 went to 100% and the real gap opened up. None of the three bugs were obvious before I had the data.

Bug 1 — apply_diff ignored multi-file diffs. gemma4 likes to fix multi-file typos with a single unified diff containing three --- file headers. Whet's apply_diff ignored the headers and applied every hunk to the JSON path argument, so hunks meant for files 2 and 3 hit file 1 with mismatched context and the call returned "context not found." Fix: parse --- path headers between hunks and route each hunk group to its real file.

Bug 2 — hunk anchor was a hard line-number match. When a model emits @@ -44,3 @@ but the actual context lives at line 39, real git apply and patch are tolerant; Whet wasn't. Fix: treat the @@ line numbers as a hint and locate the hunk by searching for the context+removal lines, picking the closest match.

Bug 3 — my verifiers were reading my own logs. This is the one that fooled me for half a day. After fixing apply_diff, qwen3.6's task2_typo runs still failed. The fixed files looked correct — every recieve was now receive. The verifier ran grep -rEi 'recieve' . recursively and found … eight matches. In .stats.log. The harness was writing Whet's tool-call traces (which include the model's -recieve/+receive diffs) inside the workspace copy, and the verifier was scanning them as if they were task content. Fix: put .stats.log, .stdout.log, .verify.log in a sibling ${run_dir}.logs directory.

The pass-rate trajectory for task2_typo across the three fixes:

State of the harness	qwen3.6	gemma4	devstral
Original bench	0/3	0/3	0/3
After `apply_diff` multi-file fix	0/3	0/3	0/3 (174s avg — model retrying)
After verifier-infra fix	3/3 ✅	0/3	0/3

The first two fixes were necessary but didn't move the needle on their own. The grep-the-logs bug was the one blocking the visible green — and you couldn't tell which of the three fixes was load-bearing until all three were in place and the cell flipped.

Picking the right quant matters as much as picking the right model

Qwen3.6-35B-A3B has many quantizations. The default qwen3.6:35b-a3b-q4_K_M from Ollama is 23GB on disk — 7GB over my GPU's VRAM, so a portion of the layers run on CPU (ollama ps reports a 50/50 CPU/GPU split for this model on this hardware). Unsloth ships a UD-Q3_K_M variant (~15GB) that fits cleanly into 16GB VRAM. I tested four configurations:

Task	q4_K_M + KV f16 (default)	q4_K_M + KV q8_0	UD-Q3_K_M + KV q8_0	UD-Q3_K_M + KV f16
task1_hello	40s	43s	23s ¹	27s
task2_typo	(verify bug)	104s	59s	55s
task3_rename	(verify bug)	74s	51s	46s
task6_debug	67s	77s	52s	188s ⚠️
task7_dedupe	47s	46s	34s	34s
task8_cli_filter	157s	146s	103s	242s ⚠️

¹: warm load. The first run after a fresh ollama session takes 100-170s while the model loads.
⚠️: completion output ballooned. On task6 the model emitted ~2.4× the tokens (12K → 30K) and the average duration tripled. On task8 the token count grew more modestly (~37K → ~43K, +16%) but each run still took 2.4× as long, suggesting throughput dropped, not just length.

Three lessons:

KV-cache quantization on Q4_K_M was roughly a wash, with a slight tilt toward regression. Per-task deltas across the six common rows ranged from 7% faster to 15% slower (mean ≈ +3% slower). The V-cache dequantization overhead and the small VRAM savings cancel out when the model is already CPU-offloading: cutting the KV cache doesn't change which layers fit on the GPU.
The same KV-q8_0 helped UD-Q3_K_M considerably on average (-21% across the six common tasks). The win was concentrated in task6 and task8, which were ~3× faster with q8_0; on task2 and task3 the q8_0 variant was actually 5-10% slower than f16. So "faster on average" hides a per-task split.
UD-Q3_K_M with FP16 KV blew up on task6 and task8. Same model, same task code, same prompt — moving from 8-bit to 16-bit KV cache made task6 emit ~2.4× the tokens and pushed both tasks to ~3× the wall time. I don't have a clean explanation; the pattern was reproducible across runs in the same batch.

The clear winner: UD-Q3_K_M weights + KV q8_0 + Flash Attention. The short recipe:

# 1. Download the GGUF (~15GB) into ~/models
mkdir -p "$HOME/models"
curl -L -o "$HOME/models/Qwen3.6-35B-A3B-UD-Q3_K_M.gguf" \
  "https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/resolve/main/Qwen3.6-35B-A3B-UD-Q3_K_M.gguf"

# 2. Write a Modelfile (absolute path required — Ollama does not expand ~) and register it.
cat > /tmp/Modelfile.q3 <<EOF
FROM $HOME/models/Qwen3.6-35B-A3B-UD-Q3_K_M.gguf
RENDERER qwen3.5
PARSER qwen3.5
EOF
ollama create qwen3.6-q3 -f /tmp/Modelfile.q3

# 3. Turn on KV q8_0 + Flash Attention via a systemd drop-in
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/kv-cache.conf >/dev/null <<'EOF'
[Service]
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
Environment="OLLAMA_FLASH_ATTENTION=1"
EOF
sudo systemctl daemon-reload && sudo systemctl restart ollama

Then in ~/.whet/config.toml, point [llm].model at qwen3.6-q3 and set [llm.options] with num_ctx=8192, temperature=0.0, seed=42, think=false. Run whet -m qwen3.6-q3.

Three failure modes I saw repeatedly

Across 232 runs the model-side failures clustered into a handful of patterns. Three are worth a closer look.

"Read everything, edit nothing" — the early give-up

This appeared mostly in devstral and gemma4 runs that did not progress past the read phase.

gemma4 on task13_typescript, run 2 (warm load, after the node_modules pre-install fix):
  duration:        4.6s
  llm_calls:       4
  completion tokens: 67
  tool calls:      list_dir, read_file × 2     ← reads only
  edits:           0
  stdout response: (empty)

Four LLM iterations, sixty-seven completion tokens, no editing tool ever invoked. After three reads the model produced no further tool calls and the agent loop exited (its termination condition is "the last response had no tool calls and no extracted text-mode tool calls"). The 67 generated tokens never reached stdout — they were attached to the tool-calling iterations themselves, not to a final user-visible reply.

Whet has a "looks like a question?" detector that re-prompts once when the model asks the user something instead of acting. That doesn't catch this case — the model isn't asking a question, it's just stopping. A fix here would be: detect "no tool calls and no terminal verifier evidence the task is done" and inject a re-prompt. Future work.

Edit-tool whitespace thrash

devstral on task2_typo, after I'd added apply_diff multi-file support but before the verifier-infra fix. Aggregated across the three runs in that batch:

duration:    174s avg per run
tool calls:  41 × edit_file across 3 runs (~14 per run)
edit targets: server.py 20, notes.md 12, README.txt 3, .stats.log 3
tool failed: 26/41

The model correctly identified the typo and all three affected files, but its edit_file calls used old_text snippets that did not exactly match the file's whitespace. After each "text not found" error it retried with a slightly different snippet rather than switching tools or moving on. Each run hit Whet's max-iteration cap. The three .stats.log edit attempts are a side-effect of bug 3 above: the model saw the in-workspace stats file containing its own diff text (-recieve/+receive) and tried to "fix the typo" there too.

This is the classic edit_file exact-match brittleness, plus a model that doesn't know to stop and try a different tool. A fuzzy-match tier inside edit_file would rescue most of these — same fix as apply_diff's anchor matching, applied to the simpler tool. I haven't built it yet because qwen3.6-q3 (the recommended model) doesn't trigger it; the 9B and devstral do, and adding fuzzy match would let the 9B finish faster and pull devstral over the line on more multi-file tasks.

Helpfully wrong: `npm install --save-dev`

qwen3.6-q3 on task13_typescript, run 2 of 3 (before the node_modules pre-install fix):

[1] read_file src/calc.ts
[2] read_file src/calc.test.ts
[3] edit_file src/calc.ts            ← added subtract() correctly
[4] edit_file src/calc.test.ts       ← updated import correctly
[5] edit_file src/calc.test.ts       ← added subtract test correctly
[6] shell    npx tsc --noEmit        ← failed: typescript not installed
[7] shell    npx tsx src/calc.test.ts ← failed: tsx not installed either
[8] shell    npm install typescript --save-dev && npx tsc --noEmit  ← MUTATED package.json

The model completed the task correctly, then tried to verify its work, hit two missing tools, and bundled npm install --save-dev with tsc in a single shell call. The --save-dev rewrites package.json. The verifier had package.json SHA-pinned to block the model from disabling failing tests by editing config — and it caught this install instead.

This is reasonable behaviour from the model penalized by defensive harness infrastructure. The fix was not on the model side — it was to ship node_modules/ pre-populated in the workspace and to tell the model up front: "deps are already installed, do not run npm install." After that change qwen3.6-q3 went from 1/3 partial to 3/3.

Tool selection matters as much as model capability

The benchmark harness writes every tool call to a stats.log file. Aggregated over the latest batch, the per-model histograms tell their own story.

task2_typo edit calls (one task, totals across 3 runs each):
  qwen3.6-q3:                15 × edit_file,  0 × apply_diff       (0 failed)
  qwen3.6:35b-a3b-q4_K_M:    27 × edit_file,  0 × apply_diff       (0 failed)
  qwen3.5:9b:                21 × edit_file,  0 × apply_diff      (12 failed, then succeeded)
  gemma4:26b:                 6 × edit_file,  9 × apply_diff       (apply_diff path)
  devstral:24b:              21 × edit_file,  0 × apply_diff      (13 failed, gave up)

All three Qwen runs went through edit_file exclusively. gemma4 reached for apply_diff — a semantically equivalent choice that also happened to exercise the multi-file routing bug described above. Two different paths to the same task, with very different harness experiences.

The 9B and devstral both hit the same wall (whitespace mismatches), but only the 9B got past it: it adjusted the old_text snippet on retry, devstral retried near-identical snippets until the iteration cap. Persistence shape, not just persistence count.

A similar split shows up on read-heavy tasks. On task9_investigate the per-run tool mix was 6 read_file + 1 repo_map + 1 list_dir for qwen3.6-q3, 5 reads + 1 list_dir for the 9B, and 4 reads + 2 list_dir for gemma4. All three passed — for read-only work, the difference is just search depth. It only matters once the chosen tool has to do something.

A model that consistently picks tools its harness handles well can outperform a more capable model that picks tools whose implementation has rough edges. Easy to miss when comparing LLMs head-to-head.

Limitations

One GPU, one configuration. RTX 5060 Ti, 16GB VRAM, Blackwell. On a 24GB or 48GB card the rankings could shift — Q4_K_M wouldn't need offload anymore, gemma4's speed advantage shrinks, etc.
Python-heavy. 10 of 11 tasks are Python. The single TypeScript task is enough to show that gemma4/devstral struggle with non-Python ecosystems, but I wouldn't claim much beyond that. Rust/Go/Java tasks are future work.
Four tasks are effectively calibration tasks. task6_debug, task9_investigate, task10_security_fix, and task12_test_gen were passed by all four models. They're useful for catching regressions but they don't differentiate the models in this lineup.
temperature=0 and seed=42 did not produce fully deterministic runs for Qwen3.6-35B-A3B. MoE expert routing has small non-determinism that shows up as ±5% token-count variance between runs. I report mean values across n=3.
Author bias. I built Whet. When a model fails because of a Whet-side bug I'm motivated to fix Whet, not to penalize the model. A different reviewer might decide that "model x failed at multi-file diff because Whet's apply_diff was buggy" should still count as a model failure for the purposes of choosing a model.
232 runs is small. For headline rankings I'm comfortable. For "is gemma4's task11 really 1/3 partial or just unlucky?" I'm not.

Takeaways

Build a benchmark before believing a benchmark. The first time I ran my own suite, two of the three biggest signals were artifacts of bugs in my benchmark harness or my agent — not in the models. If I'd published rankings off that data I'd be wrong in print.
The most ergonomic model finds the fewest agent bugs. All three Whet bugs surfaced through devstral and gemma4 failures. qwen3.6 has such a strong preference for edit_file that it never exercised apply_diff and never tripped its multi-file routing bug. If I'd benchmarked only qwen3.6 my agent would still be broken.
Quantization choice can be worth as much as model choice. Same model file at UD-Q3_K_M instead of Q4_K_M was ~33% faster on average and never lost a task. Same model file at FP16 KV instead of q8_0 KV blew up on two specific tasks. Run the sweep on your hardware before settling.
You may not need the latest, biggest model. qwen3.5:9b — older generation and one quarter the parameter count — passed 9/11 tasks at 24s/run average, about a third of qwen3.6-q3's 82s. The two it failed (task7_dedupe, task8_cli_filter) were both write-new-structure tasks. Modify-existing-code work — multi-file rename, typo fix, planning chain, debug, security patch, TypeScript edit — it handled cleanly. The 35B's headline 100% is real, but the delta between 82% and 100% is exactly those two tasks. Knowing which class of work you do most decides whether that delta is worth 3.4× the latency. (Caveat: this crosses two axes, size and generation. Without qwen3.5:35B or qwen3.6:9b in the run I can't separate them.)
Realistic-task benchmarks differ from synthetic benchmarks more than I expected. Four tasks (task6, task9, task10, task12) were passed by all four models — they catch regressions but don't rank the lineup. The other seven (task1_hello, task2, task3, task7, task8, task11, task13 — single-file edit, multi-file work, refactor, planning, non-Python) made the differences sharp. If you only test the cases everyone passes you'll buy speed at the cost of correctness without realizing it.

The benchmark suite, the analysis scripts, and a per-run leaderboard generator are in whet on GitHub. To reproduce on your own hardware:

git clone https://github.com/kuroko1t/whet
cd whet
cargo install --path .
ollama pull qwen3.6:35b-a3b-q4_K_M    # or use the UD-Q3_K_M recipe above
scripts/run_bench.sh -m qwen3.6:35b-a3b-q4_K_M -n 3
cat benchmarks/results/leaderboard.md

I'd be curious to see the same suite run on a 24GB or 48GB card. If you do, send me the JSONL.

Whet is a Rust-based coding agent for local LLMs. Source on GitHub.

I Built a Tool to Stop Losing My Claude Code Conversation History

kuroko — Sat, 14 Mar 2026 03:02:40 +0000

A few weeks ago I needed to revisit a debugging session. Claude had walked me through a nasty race condition in my app — it took over an hour, and the fix was subtle. I knew exactly which session it was.

I went to find the JSONL file. Gone. No warning, no "this file will be deleted in 3 days." Just gone.

If you've been using Claude Code for more than a couple of months, this has probably happened to you too.

Wait, Claude Code Deletes My History?

Yeah. Claude Code stores conversations as JSONL files under ~/.claude/projects/, and old files are automatically deleted over time. You can change this in settings, but that only solves the auto-deletion problem. /compact still lossy-summarizes your context, and version updates can break session compatibility. Even with deletion disabled, JSONL files are scattered across directories with no way to search across sessions.

What I Tried (and Why It Wasn't Enough)

I tried claude-history (Rust TUI) and Claude Code History Viewer (desktop app). Both are great for browsing, but they read JSONL files directly — once those files get deleted, they can't show you anything either. claude-mem does persist data into its own database, but it's a full memory system with Node.js, MCP server, and semantic search — more than I needed. I just wanted to archive conversations before they disappear.

What I was missing: a simple, durable archive I could set up once and forget about.

So I Built One

claude-vault is a single Rust binary that imports your Claude Code conversations into SQLite with full-text search. No Node.js, no Python, no MCP server — just download and run.

claude-vault import
# Imported 94562 messages (0 skipped, 12847 filtered, 0 errors) from 203 files

Once conversations are in SQLite, they survive file deletion, compaction, updates — whatever happens to the original JSONL files.

What About All the Noise?

If you've ever opened a Claude Code JSONL file, you know it's mostly noise — tool results, system tags, file read outputs, progress messages. claude-vault strips all of that during import, keeping only what matters: your questions, Claude's responses, and code-modifying actions.

Search That Actually Works

Search uses FTS5 with Porter stemming, so "running" matches "run" and "configurations" matches "configure":

claude-vault search "race condition fix"
claude-vault search "deploy" --project my-app --since 2025-01-01

You can also pipe JSON output to Claude itself:

claude-vault search "previous auth implementation" --json

The Part That Made It Actually Useful: Hooks

Manually running import is fine, but I kept forgetting. The real fix was hooking it into Claude Code's lifecycle. Add this to ~/.claude/settings.json:

{
  "hooks": {
    "PreCompact": [
      {
        "hooks": [
          {
            "type": "command",
            "command": "claude-vault import >/dev/null 2>&1"
          }
        ]
      }
    ],
    "SessionEnd": [
      {
        "hooks": [
          {
            "type": "command",
            "command": "claude-vault import >/dev/null 2>&1 &"
          }
        ]
      }
    ]
  }
}

PreCompact captures the full conversation before /compact summarizes it. SessionEnd archives in the background when you exit. Once set up, I never think about it — every session is archived automatically.

What It Doesn't Do (Honest Assessment)

It's an archive, not a memory system. It won't inject past context into new sessions automatically.
It's CLI-only. If you want a TUI, claude-history is great.
No semantic search — it's keyword-based FTS5 with stemming.

It does one thing: makes sure your conversations don't disappear. That's it.

Try It

cargo install claude-vault
# or download a prebuilt binary from GitHub Releases

Seriously, run claude-vault import now. If you've been using Claude Code for a while, some of your old sessions might already be gone — archive what's left before it's too late.

GitHub: kuroko1t/claude-vault

Have you lost Claude Code sessions you wish you could get back? What's your approach to preserving conversation history? I'd love to hear what others are doing.

What Happens When Local LLMs Fail at Tool Calling — Testing 7 Models with a Rust Coding Agent

kuroko — Sun, 01 Mar 2026 14:28:05 +0000

I tested 7 local LLMs on the same simple coding task. 4 succeeded. 3 failed — each in a different way. One model burned 30K tokens retrying the exact same broken call because my system prompt told it to.

I built Whet, a coding agent written in Rust. It connects to local LLMs through Ollama and gives them tools — read files, edit files, run shell commands, search code — so the model can actually modify your project instead of just suggesting changes. Think of it as a local, open-source alternative to tools like Claude Code or Cursor, but running entirely on your machine with whatever model you choose.

The key mechanism is tool calling: instead of the model printing "you should edit line 5," the model returns a structured API call like edit_file(path, old_text, new_text), and the agent executes it. When this works, the model can autonomously chain multiple tools to complete a task. When it breaks, things get interesting.

This article documents the failure patterns I found, which ones were the model's fault vs. my agent's fault, and what I did about it.

Important caveat: I built Whet as a personal project, so I'm biased toward finding and fixing issues in my own agent rather than blaming models. The "model vs agent" distinction below is my interpretation.

Setup

Agent: Whet — a single-binary Rust coding agent with 9 built-in tools (read_file, edit_file, shell, grep, etc.) plus optional web tools

Task: "Read hello.py and add a farewell function"

# hello.py (before)
def greet(name):
    return f"Hello, {name}!"

if __name__ == "__main__":
    print(greet("World"))

Simple enough that any tool-calling model should handle it. The expected tool chain is: read_file → edit_file. Two calls, done.

Models: 7 models available via Ollama, ranging from 7B to 24B parameters.

Mode: Yolo (auto-approve all tool calls). Max 10 iterations.

How to reproduce:

cargo install whet
ollama pull qwen3:8b  # or any model below
echo 'def greet(name):
    return f"Hello, {name}!"

if __name__ == "__main__":
    print(greet("World"))' > hello.py
whet -p "Read hello.py and add a farewell function" -m qwen3:8b -y

Results

Model	Params	Task	Tokens	Tool Calls	Failure Pattern
devstral-small-2	24B	Pass	5,990	2	—
glm-4.7-flash	19B	Pass	6,684	2	—
qwen3:8b	8B	Pass	6,895	2	—
qwen3:14b	14B	Pass	8,946	3	—
qwen2.5:14b	14B	Fail	6,013	2	Wrong `old_text`, gave up
qwen2.5:7b	7B	Fail	3,801	1	Read file, asked user instead of editing
qwen2.5-coder:14b	14B	Fail	1,873	0	Output JSON as text instead of calling tool

4 passed. 3 failed. Parameter count didn't predict success — qwen3:8b (8B) passed while qwen2.5-coder:14b (14B) failed.

What Success Looks Like

Before the failures, here's a successful run (devstral-small-2, 5,990 tokens):

[1] read_file {"path": "hello.py"}
    → returned file content (5 lines)

[2] edit_file {"path": "hello.py", "old_text": "if __name__...", "new_text": "def farewell..."}
    → added farewell function ✓

Done. Task complete.

Two tool calls, clean execution. The model read the file, understood the structure, wrote a valid edit, and stopped. This is what all 7 models should have done.

The Three Failure Patterns

Pattern 1: Refusing to Act (qwen2.5:7b)

[tool: read_file] {"path":"hello.py"}  ← only tool call

"Should I edit the file?"  ← asked user instead of editing

The model read the file successfully, then asked for permission instead of using edit_file. The system prompt says "ACT, DON'T ASK" — the model ignored it. 1 tool call, 3,801 tokens, task incomplete.

Pattern 2: Tool Format Confusion (qwen2.5-coder:14b)

The model output what looks like a tool call, but as plain text instead of using the API:

# What the model printed (as text, NOT an actual tool call):
{"name": "read_file", "arguments": {"path": "hello.py"}}

The model understood it needed to call read_file, but output the JSON as text inside a markdown code block instead of using the tool calling API. Zero actual tool calls. 1,873 tokens wasted.

Pattern 3: Retry Loop

This was the most interesting failure because it was both the model's and my agent's fault.

Iteration	Tool Call	Result
1	`read_file {"path": "hello.py"}`	OK
2	`shell {"command": "cat hello.py"}`	Error
3	`shell {"command": "cat hello.py"}`	Error (same)
4	`shell {"command": "cat hello.py"}`	Error (same)
...	...	...
10	(max iterations)	Gave up

30K tokens. 10+ tool calls. The model hit an error on shell, then repeated the exact same call 5+ times. It never tried a different approach.

Model side: qwen3:14b didn't adapt after seeing the error. Other models (qwen3:8b, devstral) changed their approach on failure.
Agent side: My system prompt said "if shell command fails: read the error output, fix the issue, and retry" — which the model interpreted literally as "call the same thing again."

What I Did About It

Pattern 3 was the most actionable. One line added to the system prompt:

- NEVER repeat the same failing tool call more than once.
  If it failed, change your approach (different arguments,
  different tool, or ask the user).

The result:

Metric	qwen3:14b (before)	qwen3:14b (after)
Task completed	No	Yes
Total tokens	~30,000	8,946
Tool calls	10+	3
Tool success rate	< 20%	100%

One line of prompt turned a 30K-token failure into a 9K-token success.

For the other two patterns, I added agent-level recovery:

Pattern 2 (JSON as text): A fallback parser that scans the model's text output for JSON objects matching the tool call format and executes them. This successfully extracted read_file calls from qwen2.5-coder:14b's text output.
Pattern 1 (refusing to act): A question detector that catches when the model asks instead of acting, and re-prompts it to use tools instead of asking. This fired in 3 out of 5 test runs with qwen2.5:7b.

Both helped partially, but neither is a complete fix — ultimately the model needs to use the tool calling API correctly.

What the Data Shows

1. Model generation matters more than size

All three qwen2.5 models failed. All three qwen3 models passed (after the prompt fix). devstral-small-2 and glm-4.7-flash also passed. The qwen3/qwen2.5 boundary is a clearer predictor of tool-calling success than parameter count.

2. Each failure is different

The three failing models broke in three distinct ways: refusing to act, format confusion, retry loops. There's no single "tool calling doesn't work" failure mode — each model fails differently, which means each failure needs different investigation.

3. Agent bugs hide behind smart models

qwen3:8b and devstral never triggered the retry loop bug because they recover gracefully from errors. If I'd only tested with these models, the prompt bug would still be in my code. The "worst" model (qwen3:14b pre-fix) was the most useful for finding agent bugs.

Limitations

Single task: These results are from one task. A model that passes "add a function" might fail at "debug a test failure" or "refactor across files." I'm working on a broader benchmark.
Non-deterministic: LLM outputs vary between runs. qwen2.5:14b might succeed on a retry. I ran each model once for the initial results.
Ollama-specific: Results may differ with other inference engines (llama.cpp, vLLM). Tool calling implementation varies.
Author bias: I built Whet. I'm inclined to fix my agent rather than blame models. Another developer might classify some "agent bugs" as "model limitations" or vice versa.

Takeaways

Test with multiple models, not just the best one. Smart models hide agent bugs by working around them. The model that fails the most dramatically teaches you the most about your agent's weaknesses.
"Retry on failure" is dangerous prompt guidance. Humans understand "retry" as "try differently." LLMs may read it as "call the exact same function again." Be explicit about what NOT to do.
Check the generation, not just the size. qwen3:8b (8B) outperformed qwen2.5-coder:14b (14B) at tool calling. Newer model families tend to have better tool-use training regardless of parameter count.
The agent can compensate — partially. JSON fallback parsing and question re-prompting helped, but the biggest win was a one-line prompt fix. Invest in your system prompt before building workarounds.

The code is open source.

How Accessibility Tree Formatting Affects Token Cost in Browser MCPs

kuroko — Thu, 26 Feb 2026 07:58:44 +0000

Token cost in browser automation MCPs has become a real topic — articles like "Playwright MCP Burns 114K Tokens Per Test" have been making the rounds. Tools are approaching this from different angles: Playwright MCP's --output-mode file option saves snapshots to disk instead of returning them in LLM context, Vercel's agent-browser compresses DOM state to a fraction of the original, and some tools add vision-based fallbacks for layout understanding.

I've been working on WebClaw, an open-source Chrome extension-based browser MCP. It takes the accessibility tree approach like Playwright MCP, but with a more compact format. I wanted to measure the actual difference — not guess, but measure — so I set up a side-by-side test.

How I Measured

Versions tested:

Playwright MCP: @playwright/mcp v0.0.68 (npx @playwright/mcp@0.0.68 --headless)
WebClaw: webclaw-mcp v0.9.0 + Chrome extension v0.9.0
Measured: February 26, 2026

I registered both Playwright MCP and WebClaw as MCP servers in the same Claude Code session, then ran the same steps on each:

Navigate to the target URL
Call the snapshot tool (browser_snapshot / page_snapshot)
Measure the full response text length in characters
Estimate tokens as characters / 4 (approximation — actual tokenization varies by model)

Both tools return the complete accessibility tree with no truncation. WebClaw's default is unlimited output (no token budget), so this is a pure format efficiency comparison.

I picked three pages with different content patterns:

Wikipedia — long article with many reference links and navigation templates
GitHub — repository page with file listing, README, and sidebar
Hacker News — list-style page with 30 items

Important caveat on fairness: Playwright MCP runs a headless Chromium (not logged in). WebClaw runs in the user's Chrome (logged in to GitHub in my case). This means WebClaw sees more UI on GitHub — authenticated menus, notifications, repo actions — which actually increases its output. The comparison is biased against WebClaw on that page.

Results: Format Efficiency

Both tools returning full, untruncated accessibility trees:

Site	Playwright MCP	WebClaw	Difference
Wikipedia (MCP article)	16,044 tokens (64,176 chars)	7,860 tokens (31,439 chars)	51% smaller
GitHub (anthropics/claude-cookbooks)	19,409 tokens (77,637 chars)	4,304 tokens (17,215 chars)	78% smaller
Hacker News (front page)	14,547 tokens (58,189 chars)	3,052 tokens (12,207 chars)	79% smaller

The range is 51% to 79% depending on the page. Let me dig into why.

What Creates the Difference

Comparing the actual output for the same Wikipedia page:

Playwright MCP (browser_snapshot):

- generic [active] [ref=e1]:
  - link "Jump to content" [ref=e2] [cursor=pointer]:
    - /url: "#bodyContent"
  - banner [ref=e4]:
    - navigation "Site" [ref=e6]:
      - generic "Main menu" [ref=e7]:
        - button "Main menu" [ref=e8] [cursor=pointer]

WebClaw (page_snapshot):

[page "Model Context Protocol - Wikipedia"]
 [banner]
  [nav "Site"]
  [@e2 link]
 [search]
  [@e3 searchbox "Search Wikipedia"]
  [@e4 button "Search"]

The difference comes down to design choices — each reasonable on its own, but they compound:

Design choice	Playwright MCP	WebClaw
Which elements get refs	All elements (`generic`, `rowgroup`, `cell`...)	Only interactive elements (buttons, links, inputs)
Attribute output	`[active]`, `[cursor=pointer]`, `/url:` on all applicable	Minimal — only what's needed for action
Table representation	Full nested structure per cell	Compressed single-line rows
Ref count (GitHub)	789 refs	245 refs

Playwright MCP's approach — labeling every element with a ref — gives maximum flexibility for targeting any element. WebClaw trades that completeness for compactness by only labeling things the AI can actually interact with.

Why the range is so wide (51% to 79%)

The format savings vary by page structure:

GitHub (78%): The file listing table is where the biggest difference shows. Playwright MCP assigns refs to every row, cell, generic wrapper (789 total). WebClaw only labels links and buttons (245 total). Additionally, WebClaw follows the W3C Accessible Name specification, using textContent before the title attribute for buttons and links. On GitHub, many buttons have short display text ("X") but verbose title attributes ("Close dialog") — using the spec-compliant order avoids the bloat.
Hacker News (79%): Simple, repetitive table structure. WebClaw's table compression ([row] 1. | link | link) eliminates most of the verbosity. Playwright MCP outputs nested rowgroup > row > cell > generic > link for each of the 30 items.
Wikipedia (51%): The article body has many inline links that both tools represent similarly. The savings come primarily from the navigation templates (Generative AI, Artificial Intelligence navboxes) where structural compression helps, but the text content itself is irreducible.

Controlling Output Size

WebClaw defaults to unlimited output — no truncation. But when you need to manage token costs, two options are available:

Interactive elements only — interactiveOnly

{ "interactiveOnly": true }

Strips all text content. A 2,000-line page becomes ~200 lines of buttons, links, and inputs.

Landmark region focus — focusRegion

{ "focusRegion": "main" }

Only returns the main, nav, header, or footer section. Useful when you know where the content you need is.

Playwright MCP doesn't have equivalents — it always returns the full tree.

The Broader Landscape

This comparison only covers in-context accessibility trees. The ecosystem is moving fast, and there are other approaches worth knowing about:

Playwright MCP file output (--output-mode file): Saves snapshots to disk files instead of returning them in LLM context. Clients that support file references can read these without consuming context tokens. A fundamentally different approach to the same problem.
DOM compression tools (Vercel's agent-browser, browser-use, etc.): These extract and compress DOM/accessibility tree state, filtering down thousands of nodes to the most relevant elements. Some also support optional vision models for layout understanding as a secondary input.

WebClaw's approach is narrower: same accessibility tree method as Playwright MCP's browser_snapshot, but with a more compact format. The numbers above show what format choices alone can do — but they don't capture the full picture of what's possible with file-based or DOM compression approaches.

Why Format Efficiency Still Matters

Even with file-based alternatives emerging, in-context snapshots remain the default for most MCP setups. A browser automation task rarely reads a page just once — navigate, read, click, read again, fill a form, check the result — that's easily 5-10 snapshot calls. A 51-79% format reduction compounds across those calls.

Tradeoffs

I'm biased — I built WebClaw — so let me be upfront about the tradeoffs.

Where Playwright MCP is the better choice:

CI/headless environments (WebClaw needs a visible Chrome window)
Cross-browser testing (Chromium, Firefox, WebKit)
Zero-install setup (npx one-liner vs. Chrome extension)
Complete output — every element gets a ref, nothing is omitted
--output-mode file for file-based snapshots

Where WebClaw fits better:

Token-sensitive workflows where format compactness matters
Logged-in sessions (runs in your existing Chrome — no re-authentication)
Bot-resistant sites (Chrome extension, no WebDriver flags)
When you need output size controls (interactiveOnly, focusRegion)

WebClaw limitations:

Requires Chrome + extension install
No headless mode
No test code generation
Uses your real session (the AI operates with your credentials)

Setup

Claude Code:

claude mcp add webclaw -- npx -y webclaw-mcp

Claude Desktop — add to claude_desktop_config.json:

{
  "mcpServers": {
    "webclaw": {
      "command": "npx",
      "args": ["-y", "webclaw-mcp"]
    }
  }
}

Then install the Chrome extension: extract the zip, go to chrome://extensions/, enable Developer mode, and load the dist/ folder.

Wrapping Up

The takeaway isn't "use WebClaw instead of Playwright MCP" — it's that accessibility tree format choices matter more than you'd expect. Assigning refs to every element vs. only interactive ones, including [cursor=pointer] hints vs. omitting them, following the W3C accessible name spec vs. using title attributes — these small decisions compound into a 51-79% difference on real pages.

The browser MCP space is evolving quickly. File-based snapshots, DOM compression tools, and hybrid approaches are all worth watching. If you're hitting token limits with your current setup, the data here might help you understand why — and what to try next.

If you want to reproduce these measurements or try WebClaw, the repo is open. Issues and feedback welcome — this is a solo project and I'm still figuring out the right tradeoffs.

GitHub: github.com/kuroko1t/webclaw
npm: npx -y webclaw-mcp

WebClaw is MIT-licensed open source.

DEV Community: kuroko

Benchmarking Local Coding LLMs: 11 Realistic Tasks, 232 Runs, and the Bugs My Bench Found in My Agent

The benchmark

The runners

Results

The bench found three bugs in my agent first

Picking the right quant matters as much as picking the right model

Three failure modes I saw repeatedly

"Read everything, edit nothing" — the early give-up

Edit-tool whitespace thrash

Helpfully wrong: npm install --save-dev

Tool selection matters as much as model capability

Limitations

Takeaways

I Built a Tool to Stop Losing My Claude Code Conversation History

Wait, Claude Code Deletes My History?

What I Tried (and Why It Wasn't Enough)

So I Built One

What About All the Noise?

Search That Actually Works

The Part That Made It Actually Useful: Hooks

What It Doesn't Do (Honest Assessment)

Try It

What Happens When Local LLMs Fail at Tool Calling — Testing 7 Models with a Rust Coding Agent

Setup

Results

What Success Looks Like

The Three Failure Patterns

Pattern 1: Refusing to Act (qwen2.5:7b)

Pattern 2: Tool Format Confusion (qwen2.5-coder:14b)

Pattern 3: Retry Loop

What I Did About It

What the Data Shows

1. Model generation matters more than size

2. Each failure is different

3. Agent bugs hide behind smart models

Limitations

Takeaways

How Accessibility Tree Formatting Affects Token Cost in Browser MCPs

How I Measured

Results: Format Efficiency

What Creates the Difference

Why the range is so wide (51% to 79%)

Controlling Output Size

The Broader Landscape

Why Format Efficiency Still Matters

Tradeoffs

Setup

Wrapping Up

Helpfully wrong: `npm install --save-dev`