QuantaMind

Posted on Jun 30

Fast Model, Fatal Loop: north-mini-code-1.0 Honest Benchmark

#ai #opensource #rust #testing

Local coding models are everywhere right now, and benchmark numbers on a README
tell you almost nothing useful. Does the model loop? Does it hallucinate tool
calls? Does accuracy fall off a cliff when the context grows? I ran
north-mini-code-1.0:mlx-mxfp8 through QuantaMind's full diagnostic suite on a
64 GB unified-memory Mac and the results are worth sharing — both the impressive
parts and the hard limits.

The setup

Field	Value
Model	`north-mini-code-1.0:mlx-mxfp8`
Runtime	Ollama (`:11434`)
Hardware	64 GB unified memory (Apple Silicon)
Benchmark tool	QuantaMind Inspector + Audit + Agent Report
Test domain	Coding

All three probes — latency, context retention, and agentic reliability — were run
cold (model not yet warm in memory) to simulate a real first-use scenario.

1 · Inspector: raw throughput and VRAM

The Inspector measures the three phases that define how a model feels in practice:
model load, prompt prefill, and inter-token generation latency.

Cold load      2 924 ms
Prompt prefill   797 ms  (123 prompt tokens)
Inter-token       16.5 ms/tok  →  60.8 tok/s

VRAM footprint: 29.5 GB of 64 GB (46%) at a 500 000-token context ceiling.
The OOM ceiling marker sits comfortably above the model load line at current
context depth — but watch that number if you push toward the context limit with
large codebases.

3 latency spikes were detected in the generation phase (visible as red marks
on the waterfall). These are outlier inter-token gaps, not dropped packets —
likely GC pauses or memory pressure moments. They don't derail generation but
they do cause occasional stutter in streaming UIs.

Takeaway: 60.8 tok/s is fast for a model this size on Apple Silicon. Cold
start under 3 seconds is comfortable for a workflow tool. VRAM headroom is
generous at 46% used.

2 · Audit: context-cliff diagnostic

The context-cliff probe progressively fills the context window with padding
(Corporate Policy prose — realistic, dense, but semantically irrelevant) and
checks whether the model can still retrieve a hidden fact buried at the start.

| Step | Prompt tokens | Accuracy | Status |
|---|---|---|---|
| 1 | 829 | 100% | Pass |
| 2 | 2 134 | 100% | Pass |
| 3 | 3 903 | 100% | Pass |
| 4 | 6 072 | 100% | Pass |
| 5 | 7 902 | 100% | Pass |

Verdict: accuracy maintained up to ≈ 8 000 tokens. The model does not
hallucinate or confabulate the target fact as context grows — it holds the signal
through nearly 8 k tokens of distractor noise.

This is the part of the benchmark most README numbers skip entirely. A model that
degrades at 4 k tokens is useless for real repo-level tasks. north-mini-code-1.0
doesn't degrade here — a genuine green flag for coding workflows where you're
passing large files or multi-file context.

3 · Agent Report: agentic reliability

This is where things get real. The Agent Report runs the model as an autonomous
coding agent — it must use tools, chain steps, and call done correctly.

Executive verdict

CONDITIONAL — Clears through Easy; falls off at Hard, the most demanding
tier tested. Hardware class: Workstation (64 GB RAM). HW recommends HARD.

Tier	Pass rate	Avg steps	Result
Easy	100% (Pass^5)	1.8	✅ CLEAR
Medium	—	—	NOT TESTED
Hard	0% (Pass^16)	7.4	❌ FAIL
Extreme	—	—	NOT TESTED

Failure taxonomy (across Easy + Hard, 64 tracked events)

Failure type	Share	Description
InfiniteLoop	83%	Failed to resolve hidden prerequisites; repeated actions
Hallucinated	17%	Claimed done / called methods outside the schema

The dominant failure mode is looping — the model gets stuck resolving
prerequisites it cannot surface, cycling through the same tool calls instead of
recognising the dead end and re-planning. At 28-step horizon with 8 decoy tools
(Hard tier), it averages 7.4 steps before failure, never reaching the goal.

Hallucination is the secondary problem: claiming done prematurely, or invoking
tool names that don't exist in the schema. At 17% of failure events this is
meaningful but not the primary blocker.

What this means in practice

For simple, single-hop coding tasks — generate a function, fix a bug,
explain a snippet — this model is fast, accurate, and context-stable. The Easy
tier numbers (100% Pass^5, 1.8 avg steps) confirm it.

For multi-step agentic workflows — tasks that require planning, dependency
resolution across several tool calls, and graceful backtracking — it isn't
reliable yet. An 83% infinite-loop rate at Hard tier means you should not deploy
this model as the backbone of an autonomous coding agent without a loop-detection
layer or a stronger orchestrating model above it.

TL;DR

Dimension	Result
Speed	✅ 60.8 tok/s — fast
VRAM	✅ 29.5 GB / 64 GB — comfortable
Context retention	✅ 100% accuracy to 8 k tokens
Simple tasks	✅ 100% pass rate
Agentic reliability (Hard)	❌ 0% pass — loops dominate

Best fit: a fast, locally-run assistant for single-shot coding queries and
context-heavy retrieval tasks. Not yet a drop-in autonomous agent backbone.

Tools used

QuantaMind — Inspector, Audit (Context-Cliff), Agent Report Github
Ollama — model serving

Have you benchmarked a different quantisation level of this model? Drop the
numbers in the comments — I'm curious whether mxfp4 trades context retention
for more VRAM headroom.

DEV Community