QuantaMind

Posted on Jul 2

100% to 0%: What Happened When I Stress-Tested a Local 35B Model's Tool-Calling

#testing #ai #development #opensource

I've spent the last few weeks building a benchmarking desktop app for local LLMs — think of it as a dashboard that sits on top of llama.cpp / Ollama and answers the question every local-model tinkerer eventually asks: "is this model actually good, or does it just look good in a chat window?"

Here's the short version of what I found testing a 35B quantized model: it aced a five-task "easy" agentic suite with a 100% pass rate, then I bumped the difficulty up one notch and it went to 0/54. Zero. Not "a bit worse." Zero.

That gap is the whole reason this post exists, because I don't think it's specific to this model — I think it's specific to how most of us evaluate local LLMs (a chat prompt, a vibe check, maybe a HumanEval score) versus how we actually use them (multi-step, tool-calling, agentic loops). Here's the walkthrough.

The setup: a model, a quant, and a question

The model under test is ornith-1.0-35b-Q8_0, served locally through llama.cpp, alongside Whisper.cpp for the audio side of the same pipeline. Before touching anything agentic, I wanted a baseline read on whether the model could reason about code at all — a single-shot code review task.

This part was genuinely impressive. Fed a script for review, it correctly flagged an unhandled EOFError around input() calls that would crash the "graceful exit" path the script promised — a real bug, not a hallucinated one. Underneath the response, the harness logs the numbers that actually matter for local inference: 73,317ms time-to-first-token, 53.2 tok/s, 533 tokens generated, and a cache stat (0/451 reused) confirming this was a cold run with no prompt-prefix caching. Nice model. Slow first token. Filed that away.

Before you even get to "is it smart": does it fit your machine?

Long TTFT numbers like that raise the obvious next question — what's the actual memory/quality tradeoff here, and did I pick the right quantization for my hardware? This is the part of local-LLM work that's usually done by squinting at a GGUF filename and hoping. I built a comparison view specifically so I wouldn't have to guess:

For a gemma4 8.0B model on a 41GB Mac, the recommendation engine lands on Q4_K_M at 8.9GB — "the highest-quality quant that fits," with headroom. That's a boring, correct answer, and boring-correct is exactly what you want before you commit to downloading a 20-70GB file and finding out three hours later that it swaps to disk under load.

Watching the model breathe: latency, phase by phase

Aggregate tok/s numbers hide a lot. A model can have a great average throughput and still spike badly on individual tokens — which matters a lot if you're building anything interactive. So the harness breaks a run down into phases (model load → prompt prefill → generation) and flags outliers (latency spikes) inside generation itself:

For the same ornith-1.0-35b-Q8_0 run: model loaded in 6.0s at server startup (a one-time cost, not per-request), prompt prefill processed 451 tokens at 434 tok/s, and generation ran at an 18.8ms average inter-token gap — with 3 outlier spikes visible as red bars near the end of the run. Prefix cache was 0/451 reused, meaning nothing was carried over from a previous turn. On a model you're planning to put behind an actual product, that's the difference between "feels responsive" and "why did it just hang for half a second."

The real test: can it actually do things?

Chat quality and raw latency are necessary but not sufficient. The question I actually care about is: can this model drive a tool-calling loop — call a function, read the result, decide the next step, and eventually finish the task — without a human in the loop?

I ran the model's native tool-calling mode (llama.cpp applying the model's own Jinja chat template, not a prompt-engineered fallback) against a batch of agentic coding tasks. First, the "Easy" tier: five tasks like run the failing test, then report, lint, then report, grep for a symbol, open a PR against a target branch, pin a dependency and apply the update. Each task expects a short, bounded tool-call chain.

100% pass rate. 25/25. Average 2 steps. 235 tokens of effort. Every single task, every one of the 5 iterations, passed clean, with zero errors. If I'd stopped here, I would have shipped this model into an agent pipeline and called it a day.

I didn't stop here.

Then I turned the difficulty up one notch

The "Hard" tier isn't a different category of task — it's the same idea (agent loop, native tool-calling, multi-tool chains) scaled up to what a real engineering task actually looks like: fixing a multi-file CI failure, resolving an import cycle across files, profiling and fixing a performance regression, or running a full incident-response chain (pull the audit log, identify the compromised credential, assess blast radius, snapshot forensics, rotate the credential, revoke sessions, file the incident).

0% pass rate. 0/54. Every single task failed. And critically, look at the failure mode: it's not "wrong answer" or "malformed tool call" — the top error is LOOP CAP. The model didn't fail by doing the wrong thing; it failed by never converging. Average steps climbed to 4.4–6 per task, right up against the step ceiling, without ever emitting the final state the task needed. Given more steps, would it eventually get there, or would it loop forever? That's the uncomfortable question a LOOP CAP verdict leaves open, and it's a completely different failure mode than "wrong answer" — one that a five-task easy suite will never surface.

Why this gap matters more than either number alone

Neither the 100% nor the 0% is the interesting number. The interesting number is that they're the same model, same quant, same calling method, tested minutes apart. A model card, a leaderboard score, or a five-minute chat session would have shown you the 100% and nothing else. You'd deploy it into an agent, hand it a real multi-file task, and watch it spin in a loop burning tokens until something else timed out.

A few things I've taken away from building this and watching it happen live:

"Tool-calling capable" is not binary. It's a curve that degrades with task complexity, chain length, and required tool diversity — not a checkbox a model either has or doesn't.
Loop caps are a distinct failure mode from wrong answers, and they deserve their own bucket. A model that confidently gives a wrong answer is more useful — and more diagnosable — than one that never stops "thinking about it."
Step count and token effort are leading indicators. Watching average steps climb toward the ceiling before the pass rate craters is a warning sign worth alerting on, not just a footnote.
Latency numbers and correctness numbers live in different dashboards for a reason, but they should inform the same decision. A model with a great TTFT and a 0% hard-tier pass rate is not a fast model — it's a fast way to burn a budget on a task that was never going to finish.

If you're evaluating a local model for anything more agentic than autocomplete, don't stop at the easy tier. The gap between 100% and 0% is exactly where the real answer is hiding.

Feel freee to contribute to this open source application Github Link

Built with QuantaMind — a local desktop harness for benchmarking LLMs across quantization, latency, and agentic tool-calling, running on top of llama.cpp/Ollama. Questions, disagreements, or "your loop-cap threshold is wrong" takes welcome in the comments.

Top comments (4)

Dhanush G • Jul 2

Adding some light on why the model zeroed out in the hard tier:
When you look at the raw logs, the model didn't fail because it forgot Python syntax or hallucinated a tool name. It failed because it lost the thread of the overall plan.

In simple terms: It would call a tool, get a correct result, but instead of moving to step two, it would panic and repeat the exact same tool call, or try to fix an error that wasn't there. It gets stuck in a reaction loop instead of driving the task forward.

I'm curious if anyone here has found a reliable prompting strategy (or system prompt structure) to force a local 35B model to break out of these loops, or if this is strictly a parameter-size limitation.

QuantaMind • Jul 2

That’s a spot-on observation. The 'reaction loop' behavior is exactly what I was seeing in the logs—it’s like the model gets 'stuck' in the immediate context of the last tool output and loses the higher-level objective.

Gowri Katte • Jul 2

This dashboard looks incredible. Testing local tool-calling like this is exactly what the community needs right now to move past basic chat leaderboards. Would love to try this out.

QuantaMind • Jul 2

Thanks, Gowri! I’m glad the dashboard resonates with you. Moving past basic chat leaderboards is exactly the goal—we need metrics that reflect how these models actually hold up in a real, multi-step engineering loop. If you end up giving it a spin, I'd love to hear how your local models perform on that 'hard' tier!