kuroko

Posted on Apr 29

Benchmarking Local Coding LLMs: 11 Realistic Tasks, 232 Runs, and the Bugs My Bench Found in My Agent

#llm #rust #ollama #benchmark

What can a 16GB GPU and a local LLM actually do for everyday coding work? I built an 11-task benchmark to find out and ran four open-weight models (9B to 35B; the 35B is an MoE with 3B active per token) through it. 232 runs in total. A single RTX 5060 Ti with 16GB VRAM.

Headline: the biggest, newest model (Qwen3.6-35B-A3B) won at 100% pass rate (29/29 runs, 11/11 tasks) after some tuning. The previous-gen qwen3.5:9b — older and smaller — passed 9/11 tasks at 24s/run average, roughly one third the wall time of the 35B. So the more interesting question turns out not to be "which model wins" but "do you actually need the latest, biggest model":

The benchmark found three bugs in my own agent before it surfaced anything interesting about the models.
Picking the right quantization (UD-Q3_K_M instead of Q4_K_M) was worth ~33% on average and saved one model from CPU offload entirely — but the same quant under FP16 KV cache blew up on two tasks specifically.
qwen3.5:9b passes 9/11 tasks at one-third the latency of Qwen3.6-35B-A3B; the bigger newer model's two extra wins are both refactor/feature-add tasks. If your workload is single-file edits, debugging, or read-only investigation, the 9B is plenty.

The agent under test is Whet — a single-binary Rust coding agent that talks to local models via Ollama. The benchmark suite is open and runnable: scripts/run_bench.sh -m <model> -n 3.

The benchmark

Eleven tasks across six ability axes. Each task is a self-contained directory:

benchmarks/<task>/
  prompt.txt    — instruction passed to whet -p
  verify.sh     — exits 0 on pass, non-zero on fail
  workspace/    — initial files, copied to a tempdir per run

#	Task	Axis	What it asks for
1	task1_hello	single-file edit	add a `farewell()` function
2	task2_typo	multi-file grep+replace	fix `recieve` → `receive` across 3 files
3	task3_rename	multi-file rename	rename `compute()` → `add()` across 3 files
6	task6_debug	debug + run tests	fix three empty-list edge cases (division-by-zero / index-out-of-range) — tests are SHA-pinned
7	task7_dedupe	refactor	extract a helper from four near-duplicate functions
8	task8_cli_filter	feature add	add a `--status pending\
9	task9_investigate	read-only exploration	enumerate every HTTP endpoint and write to ANSWER.md
10	task10_security_fix	judgment	patch a SQL injection (verifier injects {% raw %}`' OR '1'='1`)
11	task11_planning_chain	multi-file planning	migrate `print()` → `logging` across 3 files (caplog tests)
12	task12_test_gen	TDD	write a test suite for a Calculator class. Verifier mutation-tests it.
13	task13_typescript	non-Python	add a function + test in a tiny TS module (`tsc --noEmit` + `node:test`)

(Task numbering is non-contiguous: numbers 4 and 5 were never assigned, and the original task7_refactor was retired and replaced with the stricter task7_dedupe. The 11 above are the live set.)

Verifier rules in three short bullets:

Where the model is supposed to fix implementation code while tests act as the judge (task6, task10, task11), the test files are SHA-256-pinned. A model that "wins" by deleting or weakening the failing tests gets a hard FAIL.
Where the model writes its own tests (task8, task12), the verifier collects them with pytest --collect-only and then applies a one-line mutation to the implementation (e.g. divide(a, b) becomes a + b). Tests that don't catch the mutation get FAIL.
For task9 (read-only investigation) the source files themselves are SHA-pinned.

The runners

Four local models, picked to span the practical 16GB-VRAM range:

qwen3.6:35b-a3b-q4_K_M — Qwen3.6 35B-A3B (3B active per token). Released April 2026. Tool-calling-trained. ~23GB on disk → 7GB CPU offload on this GPU.
devstral:24b — Mistral × All Hands AI's open coding agent model. ~14GB → fits cleanly.
gemma4:26b — Google's QAT int4 release. ~17GB → fits cleanly.
qwen3.5:9b — the previous-generation Qwen, smaller and pure-dense. ~5.5GB → fits trivially. Included as a "do you actually need 24B+?" baseline.

Each model ran every task three times with temperature=0, seed=42, num_ctx=8192, think=false (Qwen3.6 is a thinking model; without this flag it spends every iteration on internal reasoning). OLLAMA_KV_CACHE_TYPE=q8_0, OLLAMA_FLASH_ATTENTION=1.

Results

Headline numbers across 11 tasks (latest batch per (model, task)):

Model	Pass rate	Tasks fully passed	Avg time/task	Total tokens
`qwen3.6-q3` (Qwen3.6-35B-A3B, UD-Q3_K_M)	100% (29/29)	11/11	82s	532K
`qwen3.5:9b`	82% (27/33)	9/11	24s	627K
`gemma4:26b`	61% (20/33)	6/11	32s	585K
`devstral:24b`	39% (13/33)	4/11	70s	437K

(The vanilla qwen3.6:35b-a3b-q4_K_M from Ollama also went 18/18 on the six common tasks it ran but is slower; see the Quant sweep section below.)

Per-task in compact form:

	qwen3.6-q3	qwen3.5:9b	gemma4	devstral
task1_hello	✅	✅	✅	⚠️ 1/3
task2_typo	✅	✅	❌	❌
task3_rename	✅	✅	❌	❌
task6_debug	✅	✅	✅	✅
task7_dedupe	✅	❌	✅	❌
task8_cli_filter	✅	❌	⚠️ 1/3	❌
task9_investigate	✅	✅	✅	✅
task10_security_fix	✅	✅	✅	✅
task11_planning_chain	✅	✅	⚠️ 1/3	❌
task12_test_gen	✅	✅	✅	✅
task13_typescript	✅	✅	❌	❌

The 9B is the surprise of this batch. It clears the multi-file rename, the typo fix, the planning chain, the SQL-injection patch, the debug task, and the TypeScript edit — at a third of the 35B's wall time. The two it loses (task7: extract a helper from four near-duplicates; task8: add a CLI flag with a full code path) are both write-new-structure tasks. The 9B handles modify-existing-code work cleanly; on write-new-structure it falls behind.

devstral and gemma4 fail in different shapes. devstral misses on simpler tasks too: task1_hello 1/3 (gives up after one whitespace-mismatched edit_file), task7_dedupe 0/3 (the edits succeed but the refactor only deduplicates the validation guard, not the round() call — verifier requires both). gemma4's biggest gaps are multi-file (task2/task3) and TypeScript. Some of those gemma4 failures, as the next section shows, were really my agent's fault.

The bench found three bugs in my agent first

The first time I ran the suite the rankings were misleading: qwen3.6's pass rate looked closer to gemma4's than it should have. After fixing three Whet-side issues, qwen3.6 went to 100% and the real gap opened up. None of the three bugs were obvious before I had the data.

Bug 1 — apply_diff ignored multi-file diffs. gemma4 likes to fix multi-file typos with a single unified diff containing three --- file headers. Whet's apply_diff ignored the headers and applied every hunk to the JSON path argument, so hunks meant for files 2 and 3 hit file 1 with mismatched context and the call returned "context not found." Fix: parse --- path headers between hunks and route each hunk group to its real file.

Bug 2 — hunk anchor was a hard line-number match. When a model emits @@ -44,3 @@ but the actual context lives at line 39, real git apply and patch are tolerant; Whet wasn't. Fix: treat the @@ line numbers as a hint and locate the hunk by searching for the context+removal lines, picking the closest match.

Bug 3 — my verifiers were reading my own logs. This is the one that fooled me for half a day. After fixing apply_diff, qwen3.6's task2_typo runs still failed. The fixed files looked correct — every recieve was now receive. The verifier ran grep -rEi 'recieve' . recursively and found … eight matches. In .stats.log. The harness was writing Whet's tool-call traces (which include the model's -recieve/+receive diffs) inside the workspace copy, and the verifier was scanning them as if they were task content. Fix: put .stats.log, .stdout.log, .verify.log in a sibling ${run_dir}.logs directory.

The pass-rate trajectory for task2_typo across the three fixes:

State of the harness	qwen3.6	gemma4	devstral
Original bench	0/3	0/3	0/3
After `apply_diff` multi-file fix	0/3	0/3	0/3 (174s avg — model retrying)
After verifier-infra fix	3/3 ✅	0/3	0/3

The first two fixes were necessary but didn't move the needle on their own. The grep-the-logs bug was the one blocking the visible green — and you couldn't tell which of the three fixes was load-bearing until all three were in place and the cell flipped.

Picking the right quant matters as much as picking the right model

Qwen3.6-35B-A3B has many quantizations. The default qwen3.6:35b-a3b-q4_K_M from Ollama is 23GB on disk — 7GB over my GPU's VRAM, so a portion of the layers run on CPU (ollama ps reports a 50/50 CPU/GPU split for this model on this hardware). Unsloth ships a UD-Q3_K_M variant (~15GB) that fits cleanly into 16GB VRAM. I tested four configurations:

Task	q4_K_M + KV f16 (default)	q4_K_M + KV q8_0	UD-Q3_K_M + KV q8_0	UD-Q3_K_M + KV f16
task1_hello	40s	43s	23s ¹	27s
task2_typo	(verify bug)	104s	59s	55s
task3_rename	(verify bug)	74s	51s	46s
task6_debug	67s	77s	52s	188s ⚠️
task7_dedupe	47s	46s	34s	34s
task8_cli_filter	157s	146s	103s	242s ⚠️

¹: warm load. The first run after a fresh ollama session takes 100-170s while the model loads.
⚠️: completion output ballooned. On task6 the model emitted ~2.4× the tokens (12K → 30K) and the average duration tripled. On task8 the token count grew more modestly (~37K → ~43K, +16%) but each run still took 2.4× as long, suggesting throughput dropped, not just length.

Three lessons:

KV-cache quantization on Q4_K_M was roughly a wash, with a slight tilt toward regression. Per-task deltas across the six common rows ranged from 7% faster to 15% slower (mean ≈ +3% slower). The V-cache dequantization overhead and the small VRAM savings cancel out when the model is already CPU-offloading: cutting the KV cache doesn't change which layers fit on the GPU.
The same KV-q8_0 helped UD-Q3_K_M considerably on average (-21% across the six common tasks). The win was concentrated in task6 and task8, which were ~3× faster with q8_0; on task2 and task3 the q8_0 variant was actually 5-10% slower than f16. So "faster on average" hides a per-task split.
UD-Q3_K_M with FP16 KV blew up on task6 and task8. Same model, same task code, same prompt — moving from 8-bit to 16-bit KV cache made task6 emit ~2.4× the tokens and pushed both tasks to ~3× the wall time. I don't have a clean explanation; the pattern was reproducible across runs in the same batch.

The clear winner: UD-Q3_K_M weights + KV q8_0 + Flash Attention. The short recipe:

# 1. Download the GGUF (~15GB) into ~/models
mkdir -p "$HOME/models"
curl -L -o "$HOME/models/Qwen3.6-35B-A3B-UD-Q3_K_M.gguf" \
  "https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF/resolve/main/Qwen3.6-35B-A3B-UD-Q3_K_M.gguf"

# 2. Write a Modelfile (absolute path required — Ollama does not expand ~) and register it.
cat > /tmp/Modelfile.q3 <<EOF
FROM $HOME/models/Qwen3.6-35B-A3B-UD-Q3_K_M.gguf
RENDERER qwen3.5
PARSER qwen3.5
EOF
ollama create qwen3.6-q3 -f /tmp/Modelfile.q3

# 3. Turn on KV q8_0 + Flash Attention via a systemd drop-in
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/kv-cache.conf >/dev/null <<'EOF'
[Service]
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
Environment="OLLAMA_FLASH_ATTENTION=1"
EOF
sudo systemctl daemon-reload && sudo systemctl restart ollama

Then in ~/.whet/config.toml, point [llm].model at qwen3.6-q3 and set [llm.options] with num_ctx=8192, temperature=0.0, seed=42, think=false. Run whet -m qwen3.6-q3.

Three failure modes I saw repeatedly

Across 232 runs the model-side failures clustered into a handful of patterns. Three are worth a closer look.

"Read everything, edit nothing" — the early give-up

This appeared mostly in devstral and gemma4 runs that did not progress past the read phase.

gemma4 on task13_typescript, run 2 (warm load, after the node_modules pre-install fix):
  duration:        4.6s
  llm_calls:       4
  completion tokens: 67
  tool calls:      list_dir, read_file × 2     ← reads only
  edits:           0
  stdout response: (empty)

Four LLM iterations, sixty-seven completion tokens, no editing tool ever invoked. After three reads the model produced no further tool calls and the agent loop exited (its termination condition is "the last response had no tool calls and no extracted text-mode tool calls"). The 67 generated tokens never reached stdout — they were attached to the tool-calling iterations themselves, not to a final user-visible reply.

Whet has a "looks like a question?" detector that re-prompts once when the model asks the user something instead of acting. That doesn't catch this case — the model isn't asking a question, it's just stopping. A fix here would be: detect "no tool calls and no terminal verifier evidence the task is done" and inject a re-prompt. Future work.

Edit-tool whitespace thrash

devstral on task2_typo, after I'd added apply_diff multi-file support but before the verifier-infra fix. Aggregated across the three runs in that batch:

duration:    174s avg per run
tool calls:  41 × edit_file across 3 runs (~14 per run)
edit targets: server.py 20, notes.md 12, README.txt 3, .stats.log 3
tool failed: 26/41

The model correctly identified the typo and all three affected files, but its edit_file calls used old_text snippets that did not exactly match the file's whitespace. After each "text not found" error it retried with a slightly different snippet rather than switching tools or moving on. Each run hit Whet's max-iteration cap. The three .stats.log edit attempts are a side-effect of bug 3 above: the model saw the in-workspace stats file containing its own diff text (-recieve/+receive) and tried to "fix the typo" there too.

This is the classic edit_file exact-match brittleness, plus a model that doesn't know to stop and try a different tool. A fuzzy-match tier inside edit_file would rescue most of these — same fix as apply_diff's anchor matching, applied to the simpler tool. I haven't built it yet because qwen3.6-q3 (the recommended model) doesn't trigger it; the 9B and devstral do, and adding fuzzy match would let the 9B finish faster and pull devstral over the line on more multi-file tasks.

Helpfully wrong: `npm install --save-dev`

qwen3.6-q3 on task13_typescript, run 2 of 3 (before the node_modules pre-install fix):

[1] read_file src/calc.ts
[2] read_file src/calc.test.ts
[3] edit_file src/calc.ts            ← added subtract() correctly
[4] edit_file src/calc.test.ts       ← updated import correctly
[5] edit_file src/calc.test.ts       ← added subtract test correctly
[6] shell    npx tsc --noEmit        ← failed: typescript not installed
[7] shell    npx tsx src/calc.test.ts ← failed: tsx not installed either
[8] shell    npm install typescript --save-dev && npx tsc --noEmit  ← MUTATED package.json

The model completed the task correctly, then tried to verify its work, hit two missing tools, and bundled npm install --save-dev with tsc in a single shell call. The --save-dev rewrites package.json. The verifier had package.json SHA-pinned to block the model from disabling failing tests by editing config — and it caught this install instead.

This is reasonable behaviour from the model penalized by defensive harness infrastructure. The fix was not on the model side — it was to ship node_modules/ pre-populated in the workspace and to tell the model up front: "deps are already installed, do not run npm install." After that change qwen3.6-q3 went from 1/3 partial to 3/3.

Tool selection matters as much as model capability

The benchmark harness writes every tool call to a stats.log file. Aggregated over the latest batch, the per-model histograms tell their own story.

task2_typo edit calls (one task, totals across 3 runs each):
  qwen3.6-q3:                15 × edit_file,  0 × apply_diff       (0 failed)
  qwen3.6:35b-a3b-q4_K_M:    27 × edit_file,  0 × apply_diff       (0 failed)
  qwen3.5:9b:                21 × edit_file,  0 × apply_diff      (12 failed, then succeeded)
  gemma4:26b:                 6 × edit_file,  9 × apply_diff       (apply_diff path)
  devstral:24b:              21 × edit_file,  0 × apply_diff      (13 failed, gave up)

All three Qwen runs went through edit_file exclusively. gemma4 reached for apply_diff — a semantically equivalent choice that also happened to exercise the multi-file routing bug described above. Two different paths to the same task, with very different harness experiences.

The 9B and devstral both hit the same wall (whitespace mismatches), but only the 9B got past it: it adjusted the old_text snippet on retry, devstral retried near-identical snippets until the iteration cap. Persistence shape, not just persistence count.

A similar split shows up on read-heavy tasks. On task9_investigate the per-run tool mix was 6 read_file + 1 repo_map + 1 list_dir for qwen3.6-q3, 5 reads + 1 list_dir for the 9B, and 4 reads + 2 list_dir for gemma4. All three passed — for read-only work, the difference is just search depth. It only matters once the chosen tool has to do something.

A model that consistently picks tools its harness handles well can outperform a more capable model that picks tools whose implementation has rough edges. Easy to miss when comparing LLMs head-to-head.

Limitations

One GPU, one configuration. RTX 5060 Ti, 16GB VRAM, Blackwell. On a 24GB or 48GB card the rankings could shift — Q4_K_M wouldn't need offload anymore, gemma4's speed advantage shrinks, etc.
Python-heavy. 10 of 11 tasks are Python. The single TypeScript task is enough to show that gemma4/devstral struggle with non-Python ecosystems, but I wouldn't claim much beyond that. Rust/Go/Java tasks are future work.
Four tasks are effectively calibration tasks. task6_debug, task9_investigate, task10_security_fix, and task12_test_gen were passed by all four models. They're useful for catching regressions but they don't differentiate the models in this lineup.
temperature=0 and seed=42 did not produce fully deterministic runs for Qwen3.6-35B-A3B. MoE expert routing has small non-determinism that shows up as ±5% token-count variance between runs. I report mean values across n=3.
Author bias. I built Whet. When a model fails because of a Whet-side bug I'm motivated to fix Whet, not to penalize the model. A different reviewer might decide that "model x failed at multi-file diff because Whet's apply_diff was buggy" should still count as a model failure for the purposes of choosing a model.
232 runs is small. For headline rankings I'm comfortable. For "is gemma4's task11 really 1/3 partial or just unlucky?" I'm not.

Takeaways

Build a benchmark before believing a benchmark. The first time I ran my own suite, two of the three biggest signals were artifacts of bugs in my benchmark harness or my agent — not in the models. If I'd published rankings off that data I'd be wrong in print.
The most ergonomic model finds the fewest agent bugs. All three Whet bugs surfaced through devstral and gemma4 failures. qwen3.6 has such a strong preference for edit_file that it never exercised apply_diff and never tripped its multi-file routing bug. If I'd benchmarked only qwen3.6 my agent would still be broken.
Quantization choice can be worth as much as model choice. Same model file at UD-Q3_K_M instead of Q4_K_M was ~33% faster on average and never lost a task. Same model file at FP16 KV instead of q8_0 KV blew up on two specific tasks. Run the sweep on your hardware before settling.
You may not need the latest, biggest model. qwen3.5:9b — older generation and one quarter the parameter count — passed 9/11 tasks at 24s/run average, about a third of qwen3.6-q3's 82s. The two it failed (task7_dedupe, task8_cli_filter) were both write-new-structure tasks. Modify-existing-code work — multi-file rename, typo fix, planning chain, debug, security patch, TypeScript edit — it handled cleanly. The 35B's headline 100% is real, but the delta between 82% and 100% is exactly those two tasks. Knowing which class of work you do most decides whether that delta is worth 3.4× the latency. (Caveat: this crosses two axes, size and generation. Without qwen3.5:35B or qwen3.6:9b in the run I can't separate them.)
Realistic-task benchmarks differ from synthetic benchmarks more than I expected. Four tasks (task6, task9, task10, task12) were passed by all four models — they catch regressions but don't rank the lineup. The other seven (task1_hello, task2, task3, task7, task8, task11, task13 — single-file edit, multi-file work, refactor, planning, non-Python) made the differences sharp. If you only test the cases everyone passes you'll buy speed at the cost of correctness without realizing it.

The benchmark suite, the analysis scripts, and a per-run leaderboard generator are in whet on GitHub. To reproduce on your own hardware:

git clone https://github.com/kuroko1t/whet
cd whet
cargo install --path .
ollama pull qwen3.6:35b-a3b-q4_K_M    # or use the UD-Q3_K_M recipe above
scripts/run_bench.sh -m qwen3.6:35b-a3b-q4_K_M -n 3
cat benchmarks/results/leaderboard.md

I'd be curious to see the same suite run on a 24GB or 48GB card. If you do, send me the JSONL.

Whet is a Rust-based coding agent for local LLMs. Source on GitHub.

DEV Community

Benchmarking Local Coding LLMs: 11 Realistic Tasks, 232 Runs, and the Bugs My Bench Found in My Agent

The benchmark

The runners

Results

The bench found three bugs in my agent first

Picking the right quant matters as much as picking the right model

Three failure modes I saw repeatedly

"Read everything, edit nothing" — the early give-up

Edit-tool whitespace thrash

Helpfully wrong: `npm install --save-dev`

Tool selection matters as much as model capability

Limitations

Takeaways

Top comments (0)

The benchmark

The runners

Results

The bench found three bugs in my agent first

Picking the right quant matters as much as picking the right model

Three failure modes I saw repeatedly

"Read everything, edit nothing" — the early give-up

Edit-tool whitespace thrash

Helpfully wrong: npm install --save-dev

Tool selection matters as much as model capability

Limitations

Takeaways

Helpfully wrong: `npm install --save-dev`