I've been building QuantaMind — a desktop app for evaluating local LLMs on real agentic tasks, not just vibes. This week I ran the built-in Coding eval suite against two popular Ollama models and the gap was wider than I expected.
Here's what I found.
The setup
Both models ran on the same machine (64 GB RAM, Workstation class), same Ollama backend, same settings:
- Difficulty: Easy tier
- Eval suite: Built-in Coding collection (agentic tasks)
- K (pass consistency): 5 runs per task
- Decoys: off
- Max steps: 8
The five tasks tested:
| Task ID | Category |
|---|---|
| es_co_run_failing_test | agent_loop |
| es_co_lint_then_report | agent_loop |
| es_co_grep_symbol | agent_loop |
| es_co_branch_target | agent_loop |
| es_co_dep_pin | agent_loop |
gemma4:e4b (Q4_K_M) — 60% Pass Rate

15/25 runs passed. The model handled the grep and lint tasks cleanly but consistently failed es_co_branch_target and es_co_dep_pin.
- Avg steps: 1.8
- Effort: 257 tokens
- Top error: FAKE DONE (the model reported task completion without actually finishing it)
That last point is the important one. FAKE DONE is a specific failure mode QuantaMind tracks — the model calls the done tool or says it's finished before the required sequence of tool calls is complete. It's a reliability red flag for agentic use.
qwen3.6:35b (Q4_K_M) — 100% Pass Rate

Insert screenshot 2 here — qwen3.6:35b eval results (100% pass rate, 5/5 tasks)
25/25 runs passed. Every task, every run. Clean.
- Avg steps: 1.8 (same!)
- Effort: 447 tokens (74% more tokens spent)
- Top error: NONE
Interestingly, both models averaged exactly 1.8 steps per task. The difference was in correctness, not in how many actions they took. qwen3.6:35b just did the right things.
What this means
A few takeaways:
1. Parameter count isn't everything for agentic tasks.
gemma4:e4b is a 4B model. qwen3.6:35b is nearly 9× larger at 35B. On raw token generation qwen obviously wins — but the 60% vs 100% gap on structured agentic loops is still striking. Easy tier tasks should be baseline reliable.
2. "FAKE DONE" is a real production concern.
If you're building an agent pipeline, a model that prematurely signals completion is worse than one that fails loudly. This failure mode doesn't show up in standard benchmarks.
3. Effort ≠ accuracy.
qwen used 74% more tokens to achieve its 100% pass rate. That's a real cost in latency and context if you're running these locally. There's a trade-off to weigh.
Try it yourself
QuantaMind is a free macOS desktop app. Run your own models through the same eval suite and see where they land before putting them in a pipeline.
The coding eval collection is built-in — just pick your model, hit Run Batch, and let the agentic simulator do the rest. The context-cliff probe and Agent Report tab give you even deeper readiness data.
Have you tested local models on agentic tasks? I'd love to compare notes in the comments.
Top comments (1)
This is the FAKE DONE finding that gets me. I've hit exactly this, a local model that "passes" in a quick manual test because it confidently says it's done, and you only catch it when something downstream breaks. The fact that it doesn't show up in standard benchmarks is the whole problem; a model that fails loudly is debuggable, one that lies about completion isn't.
Question:
How does QuantaMind actually verify completion? If the model can fake "done," presumably you're checking the end state against an expected sequence of tool calls rather than trusting the model's claim, is there checklist match, or something looser?