Running LLMs locally is half the promise — the other half is knowing what they can actually do reliably. I put qwen2.5-coder-14b-instruct-q8_0 through QuantaMind's three-panel evaluation suite (Inspector + Eval + Agent Report) on a 64GB workstation and the results are more nuanced than a simple pass/fail.
TL;DR: Easy coding tasks? Flawless. Hard multi-step agent tasks? Falls apart completely. Here's the data.
The Setup
Model served via llama.cpp backend using native tool-calling (the model's embedded Jinja template). Hardware: 64GB unified-memory workstation. QuantaMind's Inspector measures raw inference telemetry; Eval runs structured agent-loop task batteries across difficulty tiers.
Model details
model = qwen2.5-coder-14b-instruct-q8_0
backend = llama.cpp
method = Tool-Calling (native)
hardware = Workstation · 64GB RAM
Inspector output (single run)
ttft = 334ms
generation = 18.4 tok/s
prompt_prefill = 326ms (129 tokens @ 396 tok/s)
inter_token = 54.4ms avg
outlier_spikes = 32 #latency spikes above threshold
prefix_cache = 0 reused / 129 recomputed # cold start
18.4 tok/s is comfortable for local inference on a 14B Q8 model. The 32 outlier spikes are visible as red bars in the token timeline — momentary latency jumps during generation, likely GC or KV cache pressure. For interactive use they're perceptible but not blocking.
Eval Results: Easy vs Hard
Easy Tier — 25/25 tasks ✅
| Task ID | Target Tools | Steps | Result |
|---|---|---|---|
es_co_run_failing_test |
run_tests, reply | 2 | ✅ PASS |
es_co_lint_then_report |
run_lint, reply | 2 | ✅ PASS |
es_co_grep_symbol |
search_symbol, reply | 2 | ✅ PASS |
es_co_branch_target |
open_pr | 1 | ✅ PASS |
es_co_dep_pin |
get_dep, pin_and_flag, apply_update | 2 | ✅ PASS |
Easy Tier Summary: 100% pass rate · 25/25 · avg 1.8 steps · 73 tokens effort. The model handles single-tool and 2-step agent loops cleanly. Tool schemas respected, correct tool selected every time.
Hard Tier — 0/64 tasks ❌
🛑 Failed Tasks
These tasks required complex tool chaining but ultimately resulted in a failure.
| Task ID | Target Tools | Result |
|---|---|---|
hd_co_ci_multifile_instance0 |
run_ci → read_file → search_symbol → write_file → add_marker |
❌ FAIL |
hd_co_import_cycle_instance0 |
run_import_check → read_file → write_file |
❌ FAIL |
hd_co_perf_regression_instance0 |
run_profiler → read_file → write_file |
❌ FAIL |
hd_co_incident_forensics_instance0 |
get_audit_log → identify_credential → blast_radius → snapshot_forensics → rotate_credential → revoke_sessions → file_incident |
❌ FAIL |
Successful Tasks
These tasks were executed successfully, completing within 1 to 2 steps.
| Task ID | Target Tools | Steps | Result |
|---|---|---|---|
es_co_run_failing_test |
run_tests, reply | 2 | ✅ PASS |
es_co_lint_then_report |
run_lint, reply | 2 | ✅ PASS |
es_co_grep_symbol |
search_symbol, reply | 2 | ✅ PASS |
es_co_branch_target |
open_pr | 1 | ✅ PASS |
es_co_dep_pin |
get_dep, pin_and_flag, apply_update | 2 | ✅ PASS |
⚠️ Hard Tier Failure Pattern: Top error:
FORBIDDEN. Execution loop capped at turn 4. The model either refuses multi-step tool chains or enters loops it can't exit. 28-step horizon tasks are entirely out of reach at this tier.
The FORBIDDEN error points to safety guardrails triggering on tool names like rotate_credential and revoke_sessions — the model refuses to execute the chain in a sandboxed eval context. This isn't a context length problem; the Cliff Depth probe would isolate that, but the failure here starts from turn 1.
Tier Progression
| Tier | Pass Rate | Verdict |
|---|---|---|
| Easy | 100% (pass^5) | ✅ CLEAR |
| Medium | — | NOT TESTED |
| Hard | 0% (pass^16) | ❌ FAIL |
| Extreme | — | NOT TESTED |
Agent Report Verdict: CONDITIONAL 🟡
"Clears through Easy; falls off at Hard — the most demanding tier tested."
Blocking issues: [X Reliability] [X Loops] · pass^k = 0.00 (required ≥ 0.80)
The CONDITIONAL verdict means: usable, but know the ceiling. The Loops blocker is notable — on 16 Hard runs the model reliably enters execution loops rather than completing the task or failing gracefully. That's a reliability signal, not just a difficulty signal.
What Can You Actually Use This For?
▸ Code search and symbol lookup — grep_symbol, find_file patterns: reliable
▸ Single-pass lint / test runner with report — run tool, read output, reply: solid 100%
▸ Dependency pinning and simple PR ops — 1–2 step agent loops: works well
▸ Code explanation and documentation drafting — text generation at 18.4 tok/s is fast enough for interactive use
▸ Autocomplete and inline suggestions — strong code training, fast prefill at 396 tok/s
▸ Privacy-first development — runs fully local, no data leaves the machine
What to Avoid
▸ Multi-file CI/CD debugging chains — fails at Hard tier
▸ Security-sensitive agentic tasks — credential rotation, session management → FORBIDDEN
▸ 28+ step autonomous agents — execution loops, no graceful exit
▸ Import cycle / dependency graph analysis — requires cross-file reasoning the model can't sustain
Bottom Line
qwen2.5-coder-14b-q8_0 is a solid local coding assistant for developers who want fast, privacy-first code search, test running, and simple 2-step agent workflows — at 18.4 tok/s on a 64GB workstation. It's not ready to replace a cloud model on complex autonomous coding tasks.
The CONDITIONAL verdict is honest: it earns its place in the local toolchain, just not at the agentic frontier. The Medium tier remains untested — that's the logical next experiment to find exactly where reliability starts degrading.
Tested with QuantaMind · llama.cpp backend · Native tool-calling · 64GB workstation · June 2026



Top comments (4)
The FORBIDDEN result is the part worth noticing. The model didn't fail because the task was too hard — it backed off because tool names like rotate_credential and revoke_sessions tripped a safety reflex. On a leaderboard that just looks like "bad at hard tasks," but it's a totally different problem with a different fix. Makes you wonder how many "weak agent" scores out there are really just the model being over-cautious.
Exactly! It’s easy to misinterpret those 'FORBIDDEN' results as a lack of intelligence rather than a safety feature. It definitely raises the question of how many models on current leaderboards are actually being penalized for being 'too safe' during agentic tasks.
Why exactly did the model fail 100% of the hard tasks - was it completely stuck in infinite loops on the coding problems, or did it just refuse to run the security tasks because of its safety guardrails?
It was actually both. Security tasks triggered safety guardrails (the 'FORBIDDEN' error), while the more complex coding problems caused the model to get stuck in infinite loops. It really struggled to maintain the chain of logic required for those multi-step processes.