Local coding models are everywhere right now, and benchmark numbers on a README
tell you almost nothing useful. Does the model loop? Does it hallucinate tool
calls? Does accuracy fall off a cliff when the context grows? I ran
north-mini-code-1.0:mlx-mxfp8 through QuantaMind's full diagnostic suite on a
64 GB unified-memory Mac and the results are worth sharing — both the impressive
parts and the hard limits.
The setup
| Field | Value |
|---|---|
| Model | north-mini-code-1.0:mlx-mxfp8 |
| Runtime | Ollama (:11434) |
| Hardware | 64 GB unified memory (Apple Silicon) |
| Benchmark tool | QuantaMind Inspector + Audit + Agent Report |
| Test domain | Coding |
All three probes — latency, context retention, and agentic reliability — were run
cold (model not yet warm in memory) to simulate a real first-use scenario.
1 · Inspector: raw throughput and VRAM
The Inspector measures the three phases that define how a model feels in practice:
model load, prompt prefill, and inter-token generation latency.
Cold load 2 924 ms
Prompt prefill 797 ms (123 prompt tokens)
Inter-token 16.5 ms/tok → 60.8 tok/s
VRAM footprint: 29.5 GB of 64 GB (46%) at a 500 000-token context ceiling.
The OOM ceiling marker sits comfortably above the model load line at current
context depth — but watch that number if you push toward the context limit with
large codebases.
3 latency spikes were detected in the generation phase (visible as red marks
on the waterfall). These are outlier inter-token gaps, not dropped packets —
likely GC pauses or memory pressure moments. They don't derail generation but
they do cause occasional stutter in streaming UIs.
Takeaway: 60.8 tok/s is fast for a model this size on Apple Silicon. Cold
start under 3 seconds is comfortable for a workflow tool. VRAM headroom is
generous at 46% used.
2 · Audit: context-cliff diagnostic
The context-cliff probe progressively fills the context window with padding
(Corporate Policy prose — realistic, dense, but semantically irrelevant) and
checks whether the model can still retrieve a hidden fact buried at the start.

| Step | Prompt tokens | Accuracy | Status |
|---|---|---|---|
| 1 | 829 | 100% | Pass |
| 2 | 2 134 | 100% | Pass |
| 3 | 3 903 | 100% | Pass |
| 4 | 6 072 | 100% | Pass |
| 5 | 7 902 | 100% | Pass |
Verdict: accuracy maintained up to ≈ 8 000 tokens. The model does not
hallucinate or confabulate the target fact as context grows — it holds the signal
through nearly 8 k tokens of distractor noise.
This is the part of the benchmark most README numbers skip entirely. A model that
degrades at 4 k tokens is useless for real repo-level tasks. north-mini-code-1.0
doesn't degrade here — a genuine green flag for coding workflows where you're
passing large files or multi-file context.
3 · Agent Report: agentic reliability
This is where things get real. The Agent Report runs the model as an autonomous
coding agent — it must use tools, chain steps, and call done correctly.
Executive verdict
CONDITIONAL — Clears through Easy; falls off at Hard, the most demanding
tier tested. Hardware class: Workstation (64 GB RAM). HW recommends HARD.
| Tier | Pass rate | Avg steps | Result |
|---|---|---|---|
| Easy | 100% (Pass^5) | 1.8 | ✅ CLEAR |
| Medium | — | — | NOT TESTED |
| Hard | 0% (Pass^16) | 7.4 | ❌ FAIL |
| Extreme | — | — | NOT TESTED |
Failure taxonomy (across Easy + Hard, 64 tracked events)
| Failure type | Share | Description |
|---|---|---|
| InfiniteLoop | 83% | Failed to resolve hidden prerequisites; repeated actions |
| Hallucinated | 17% | Claimed done / called methods outside the schema |
The dominant failure mode is looping — the model gets stuck resolving
prerequisites it cannot surface, cycling through the same tool calls instead of
recognising the dead end and re-planning. At 28-step horizon with 8 decoy tools
(Hard tier), it averages 7.4 steps before failure, never reaching the goal.
Hallucination is the secondary problem: claiming done prematurely, or invoking
tool names that don't exist in the schema. At 17% of failure events this is
meaningful but not the primary blocker.
What this means in practice
For simple, single-hop coding tasks — generate a function, fix a bug,
explain a snippet — this model is fast, accurate, and context-stable. The Easy
tier numbers (100% Pass^5, 1.8 avg steps) confirm it.
For multi-step agentic workflows — tasks that require planning, dependency
resolution across several tool calls, and graceful backtracking — it isn't
reliable yet. An 83% infinite-loop rate at Hard tier means you should not deploy
this model as the backbone of an autonomous coding agent without a loop-detection
layer or a stronger orchestrating model above it.
TL;DR
| Dimension | Result |
|---|---|
| Speed | ✅ 60.8 tok/s — fast |
| VRAM | ✅ 29.5 GB / 64 GB — comfortable |
| Context retention | ✅ 100% accuracy to 8 k tokens |
| Simple tasks | ✅ 100% pass rate |
| Agentic reliability (Hard) | ❌ 0% pass — loops dominate |
Best fit: a fast, locally-run assistant for single-shot coding queries and
context-heavy retrieval tasks. Not yet a drop-in autonomous agent backbone.
Tools used
- QuantaMind — Inspector, Audit (Context-Cliff), Agent Report Github
- Ollama — model serving
Have you benchmarked a different quantisation level of this model? Drop the
numbers in the comments — I'm curious whether mxfp4 trades context retention
for more VRAM headroom.
Top comments (0)