QuantaMind

Posted on Jun 22

I benchmarked two local LLMs on agentic coding tasks — the results surprised me

#llm #ai #quantamind #opensource

I've been building QuantaMind — a desktop app for evaluating local LLMs on real agentic tasks, not just vibes. This week I ran the built-in Coding eval suite against two popular Ollama models and the gap was wider than I expected.

Here's what I found.

The setup

Both models ran on the same machine (64 GB RAM, Workstation class), same Ollama backend, same settings:

Difficulty: Easy tier
Eval suite: Built-in Coding collection (agentic tasks)
K (pass consistency): 5 runs per task
Decoys: off
Max steps: 8

The five tasks tested:

Task ID	Category
es_co_run_failing_test	agent_loop
es_co_lint_then_report	agent_loop
es_co_grep_symbol	agent_loop
es_co_branch_target	agent_loop
es_co_dep_pin	agent_loop

gemma4:e4b (Q4_K_M) — 60% Pass Rate

15/25 runs passed. The model handled the grep and lint tasks cleanly but consistently failed es_co_branch_target and es_co_dep_pin.

Avg steps: 1.8
Effort: 257 tokens
Top error: FAKE DONE (the model reported task completion without actually finishing it)

That last point is the important one. FAKE DONE is a specific failure mode QuantaMind tracks — the model calls the done tool or says it's finished before the required sequence of tool calls is complete. It's a reliability red flag for agentic use.

qwen3.6:35b (Q4_K_M) — 100% Pass Rate

Insert screenshot 2 here — qwen3.6:35b eval results (100% pass rate, 5/5 tasks)
25/25 runs passed. Every task, every run. Clean.

Avg steps: 1.8 (same!)
Effort: 447 tokens (74% more tokens spent)
Top error: NONE

Interestingly, both models averaged exactly 1.8 steps per task. The difference was in correctness, not in how many actions they took. qwen3.6:35b just did the right things.

What this means

A few takeaways:

1. Parameter count isn't everything for agentic tasks.
gemma4:e4b is a 4B model. qwen3.6:35b is nearly 9× larger at 35B. On raw token generation qwen obviously wins — but the 60% vs 100% gap on structured agentic loops is still striking. Easy tier tasks should be baseline reliable.

2. "FAKE DONE" is a real production concern.
If you're building an agent pipeline, a model that prematurely signals completion is worse than one that fails loudly. This failure mode doesn't show up in standard benchmarks.

3. Effort ≠ accuracy.
qwen used 74% more tokens to achieve its 100% pass rate. That's a real cost in latency and context if you're running these locally. There's a trade-off to weigh.

Try it yourself

QuantaMind is a free macOS desktop app. Run your own models through the same eval suite and see where they land before putting them in a pipeline.

👉 www.quantamind.co

The coding eval collection is built-in — just pick your model, hit Run Batch, and let the agentic simulator do the rest. The context-cliff probe and Agent Report tab give you even deeper readiness data.

Have you tested local models on agentic tasks? I'd love to compare notes in the comments.

Top comments (2)

Dhanush G • Jun 22

This is the FAKE DONE finding that gets me. I've hit exactly this, a local model that "passes" in a quick manual test because it confidently says it's done, and you only catch it when something downstream breaks. The fact that it doesn't show up in standard benchmarks is the whole problem; a model that fails loudly is debuggable, one that lies about completion isn't.

Question:

How does QuantaMind actually verify completion? If the model can fake "done," presumably you're checking the end state against an expected sequence of tool calls rather than trusting the model's claim, is there checklist match, or something looser?

QuantaMind • Jun 23

Exactly this — a loud failure is a known failure.

QuantaMind doesn't trust the model's "I'm done." It watches the actual tool calls made during the run and checks them against a required sequence. If the model skipped a step, it's a fail — doesn't matter what it claimed.

The pipeline debugger shows you the exact call trace so you can see where it went wrong.