Dhanush G

Posted on Jul 1

Qwen 3.5 vs Ornith 1.0 9B Models, Same Hardware, Same Quant as Coding Agents

#ai #llm #opensource #machinelearning

I ran Qwen 3.5 9B and Ornith 1.0 9B, both at Q8, on the same 16GB Mac, through the same multi-step agent tests. Neither is agent-ready. But they're not ready in interesting, different ways — and the most surprising result is that the native tool-calling API made both of them worse on easy tasks than plain prompting did.

If you run local models, you've seen the recommendation threads:
"Qwen 3.5 9B is great for its size," "Ornith 1.0 is impressive." Both are true in a chat window. The question I care about is different: can either one actually run a coding agent — a multi-step loop that calls tools, reads results, and keeps going — on normal hardware?

So I tested both the same way. Same machine (16GB Apple Silicon, Mainstream class). Same quant (Q8). Same task batteries, run through a real agent loop, each task repeated multiple times, counted as passed only if it succeeds every run (pass^k). Same backend (llama.cpp).

Here's what came back.

The headline: neither clears even the Easy tier

Both models got the same top-line verdict: NOT READY — does not clear Easy, the easiest tier tested.Cleared tier: NONE.

That's worth sitting with. These are capable chat models. In a single back-and-forth they answer fine. But "answers one question well" and "survives a tool-calling loop reliably" are nearly unrelated skills, and the loop is where both fall down.

The surprise: native tool-calling was worse than prompting

This is the result I didn't expect, and it showed up for both models.

On the Easy tier, the native tool-calling path scored worse than just describing the tools in the prompt and parsing the model's text:

Model (Q8, Easy tier) Native tool-calling Prompt-based
Qwen 3.5 9B60% (15/25) 100% (25/25)
Ornith 1.0 9B60% (15/25) 100% (25/25)

Same pattern, same numbers. In both cases the native path failed the exact tasks the prompt path passed — things like running a failing test or linting and reporting. The failure wasn't "the model is dumb." Looking at the failure breakdown, 100% of the easy-tier failures were "reported in prose" — the model did the work correctly but answered in plain text instead of emitting a proper tool call. Right answer, wrong channel.

Everyone's default advice is "use the native function-calling API." For these two 9B models, on easy agent tasks, that advice was backwards.

Where they differ: how they break on harder tasks

On the Medium tier the two models start to diverge in their failure modes, which is more useful than the raw score.

Qwen 3.5 9B (Medium): native 40%, prompt-based 0%. The prompt path collapsed completely — top error "NO OUTPUT," it just stopped producing anything usable. Native limped to 40% but its top error was FAKE DONE — claiming the task was complete when it wasn't.

Ornith 1.0 9B (Medium): native 40%, prompt-based 0% — nearly identical top-line. Its Medium failures broke down as 67% hallucinated completions ("claimed done / called methods outside the schema") and 33% infinite loops ("failed to resolve hidden prerequisites; repeated actions").

So as the tasks get harder, both shift from a harmless formatting problem (prose instead of a tool call) into the two genuinely dangerous agent failures: falsely claiming completion, and looping without progress. A model that hallucinates "done" is the worst case for a coding agent, because in a real pipeline it looks like success.

The hidden killer: the cold-start tax

There's one more finding that doesn't show up in any pass rate, and it's arguably the most practical.

Both models are ~8.9GB at Q8. With KV cache and overhead, a run pushes ~15.6GB on a 16GB machine — right at the ceiling. The result: after each response the model gets evicted from memory, and the next call has to reload it from disk before it can even start. That's a ~20-second time-to-first-token, every single call.

In a chat window you barely notice — one reload, then you talk. In an agent loop it's fatal. A 20-step task pays that ~20s reload twenty times — roughly 7 minutes of pure cold-start tax before counting a single token of actual reasoning. Generation speed itself was fine (~9.8 tok/s once loaded). The model isn't slow to think; it's slow because it can't stay in memory.

This is a hardware-ceiling problem, not a model problem — and it points at a fix: leave headroom. A smaller quant that fits with room to spare would stay resident and skip the per-call reload entirely. "Largest quant that fits" is the wrong rule when fitting exactly means re-loading every step.

So which one?

Honestly: for a coding agent on a 16GB Mac, neither, at Q8. Both are NOT READY, both fail the easiest tier, both pay the cold-start tax. As chat models or for single-shot help they're fine — this isn't a knock on either.

But the comparison taught me three things worth keeping:

Native tool-calling isn't automatically the right path. Both models scored higher with plain prompting on easy tasks. Test both; don't assume the native API wins.
The failure mode matters more than the score. Two models can tie and break completely differently — prose-formatting vs hallucinated-done vs looping — and you debug each one differently.
Fit isn't the same as headroom. A model that just fits cold-starts every call. For agent work, leave memory to spare.

The honest caveats

This is 9B at Q8 on a 16GB Mac, one backend (llama.cpp). Different hardware, quant, or harness may land differently — which is exactly why testing your own combo beats trusting a thread.
Both models are well-regarded for good reason; "not agent-ready here" is a narrow, scoped claim about multi-step tool-calling reliability on constrained hardware, not a verdict on the models overall.
Small sample at Medium. The failure-mode split is a strong signal, not a proof.

I measured all of this with QuantaMind, the open-source tool I'm building to test local models in a real agent loop on your own hardware GitHub: — free, fully offline, with the failure broken down by type instead of hidden behind a single score.

Question for you: have you seen native tool-calling underperform plain prompting on a small model? Or is this a 9B-specific thing that disappears at 27B+? I'm collecting these and would genuinely like to know where the line is.

Top comments (4)

Chirag.V.K • Jul 1

This is a fantastic reality check for local agent development. The distinction between "good at chat" and "resilient in a tool loop" is massive.

Your point about native tool-calling failing on formatting rather than reasoning is spot on for 9B models—they have the intellect to solve the problem but lack the structural discipline under loop pressure.

Dhanush G • Jul 1

The chat interface masks a lot of structural flaws. The next hurdle is figuring out if this formatting discipline can be fixed with targeted fine-tuning, or if it strictly requires scaling up the parameter count.

Gowri Katte • Jul 1

It’s wild how differently they break on the harder tasks. When you were testing this, how exactly did your tool help you catch those 'fake done' and infinite loop errors compared to just looking at a standard benchmark score?

Dhanush G • Jul 1

QuantaMind acts as a workbench for the agent loop, it logs every single tool call attempt, parsed argument, and raw text output in real-time. This lets us see exactly when the model hallucinates a 'done' state or gets stuck in an infinite loop, rather than just returning a generic failure score at the end of the run.