DEV Community

kuroko
kuroko

Posted on

What Happens When Local LLMs Fail at Tool Calling — Testing 7 Models with a Rust Coding Agent

I tested 7 local LLMs on the same simple coding task. 4 succeeded. 3 failed — each in a different way. One model burned 30K tokens retrying the exact same broken call because my system prompt told it to.

I built Whet, a coding agent written in Rust. It connects to local LLMs through Ollama and gives them tools — read files, edit files, run shell commands, search code — so the model can actually modify your project instead of just suggesting changes. Think of it as a local, open-source alternative to tools like Claude Code or Cursor, but running entirely on your machine with whatever model you choose.

The key mechanism is tool calling: instead of the model printing "you should edit line 5," the model returns a structured API call like edit_file(path, old_text, new_text), and the agent executes it. When this works, the model can autonomously chain multiple tools to complete a task. When it breaks, things get interesting.

This article documents the failure patterns I found, which ones were the model's fault vs. my agent's fault, and what I did about it.

Important caveat: I built Whet as a personal project, so I'm biased toward finding and fixing issues in my own agent rather than blaming models. The "model vs agent" distinction below is my interpretation.


Setup

Agent: Whet — a single-binary Rust coding agent with 9 built-in tools (read_file, edit_file, shell, grep, etc.) plus optional web tools

Task: "Read hello.py and add a farewell function"

# hello.py (before)
def greet(name):
    return f"Hello, {name}!"

if __name__ == "__main__":
    print(greet("World"))
Enter fullscreen mode Exit fullscreen mode

Simple enough that any tool-calling model should handle it. The expected tool chain is: read_fileedit_file. Two calls, done.

Models: 7 models available via Ollama, ranging from 7B to 24B parameters.

Mode: Yolo (auto-approve all tool calls). Max 10 iterations.

How to reproduce:

cargo install whet
ollama pull qwen3:8b  # or any model below
echo 'def greet(name):
    return f"Hello, {name}!"

if __name__ == "__main__":
    print(greet("World"))' > hello.py
whet -p "Read hello.py and add a farewell function" -m qwen3:8b -y
Enter fullscreen mode Exit fullscreen mode

Results

Model Params Task Tokens Tool Calls Failure Pattern
devstral-small-2 24B Pass 5,990 2
glm-4.7-flash 19B Pass 6,684 2
qwen3:8b 8B Pass 6,895 2
qwen3:14b 14B Pass 8,946 3
qwen2.5:14b 14B Fail 6,013 2 Wrong old_text, gave up
qwen2.5:7b 7B Fail 3,801 1 Read file, asked user instead of editing
qwen2.5-coder:14b 14B Fail 1,873 0 Output JSON as text instead of calling tool

4 passed. 3 failed. Parameter count didn't predict success — qwen3:8b (8B) passed while qwen2.5-coder:14b (14B) failed.


What Success Looks Like

Before the failures, here's a successful run (devstral-small-2, 5,990 tokens):

[1] read_file {"path": "hello.py"}
    → returned file content (5 lines)

[2] edit_file {"path": "hello.py", "old_text": "if __name__...", "new_text": "def farewell..."}
    → added farewell function ✓

Done. Task complete.
Enter fullscreen mode Exit fullscreen mode

Two tool calls, clean execution. The model read the file, understood the structure, wrote a valid edit, and stopped. This is what all 7 models should have done.


The Three Failure Patterns

Pattern 1: Refusing to Act (qwen2.5:7b)

[tool: read_file] {"path":"hello.py"}  ← only tool call

"Should I edit the file?"  ← asked user instead of editing
Enter fullscreen mode Exit fullscreen mode

The model read the file successfully, then asked for permission instead of using edit_file. The system prompt says "ACT, DON'T ASK" — the model ignored it. 1 tool call, 3,801 tokens, task incomplete.

Pattern 2: Tool Format Confusion (qwen2.5-coder:14b)

The model output what looks like a tool call, but as plain text instead of using the API:

# What the model printed (as text, NOT an actual tool call):
{"name": "read_file", "arguments": {"path": "hello.py"}}
Enter fullscreen mode Exit fullscreen mode

The model understood it needed to call read_file, but output the JSON as text inside a markdown code block instead of using the tool calling API. Zero actual tool calls. 1,873 tokens wasted.

Pattern 3: Retry Loop

This was the most interesting failure because it was both the model's and my agent's fault.

Iteration Tool Call Result
1 read_file {"path": "hello.py"} OK
2 shell {"command": "cat hello.py"} Error
3 shell {"command": "cat hello.py"} Error (same)
4 shell {"command": "cat hello.py"} Error (same)
... ... ...
10 (max iterations) Gave up

30K tokens. 10+ tool calls. The model hit an error on shell, then repeated the exact same call 5+ times. It never tried a different approach.

  • Model side: qwen3:14b didn't adapt after seeing the error. Other models (qwen3:8b, devstral) changed their approach on failure.
  • Agent side: My system prompt said "if shell command fails: read the error output, fix the issue, and retry" — which the model interpreted literally as "call the same thing again."

What I Did About It

Pattern 3 was the most actionable. One line added to the system prompt:

- NEVER repeat the same failing tool call more than once.
  If it failed, change your approach (different arguments,
  different tool, or ask the user).
Enter fullscreen mode Exit fullscreen mode

The result:

Metric qwen3:14b (before) qwen3:14b (after)
Task completed No Yes
Total tokens ~30,000 8,946
Tool calls 10+ 3
Tool success rate < 20% 100%

One line of prompt turned a 30K-token failure into a 9K-token success.

For the other two patterns, I added agent-level recovery:

  • Pattern 2 (JSON as text): A fallback parser that scans the model's text output for JSON objects matching the tool call format and executes them. This successfully extracted read_file calls from qwen2.5-coder:14b's text output.
  • Pattern 1 (refusing to act): A question detector that catches when the model asks instead of acting, and re-prompts it to use tools instead of asking. This fired in 3 out of 5 test runs with qwen2.5:7b.

Both helped partially, but neither is a complete fix — ultimately the model needs to use the tool calling API correctly.


What the Data Shows

1. Model generation matters more than size

All three qwen2.5 models failed. All three qwen3 models passed (after the prompt fix). devstral-small-2 and glm-4.7-flash also passed. The qwen3/qwen2.5 boundary is a clearer predictor of tool-calling success than parameter count.

2. Each failure is different

The three failing models broke in three distinct ways: refusing to act, format confusion, retry loops. There's no single "tool calling doesn't work" failure mode — each model fails differently, which means each failure needs different investigation.

3. Agent bugs hide behind smart models

qwen3:8b and devstral never triggered the retry loop bug because they recover gracefully from errors. If I'd only tested with these models, the prompt bug would still be in my code. The "worst" model (qwen3:14b pre-fix) was the most useful for finding agent bugs.


Limitations

  • Single task: These results are from one task. A model that passes "add a function" might fail at "debug a test failure" or "refactor across files." I'm working on a broader benchmark.
  • Non-deterministic: LLM outputs vary between runs. qwen2.5:14b might succeed on a retry. I ran each model once for the initial results.
  • Ollama-specific: Results may differ with other inference engines (llama.cpp, vLLM). Tool calling implementation varies.
  • Author bias: I built Whet. I'm inclined to fix my agent rather than blame models. Another developer might classify some "agent bugs" as "model limitations" or vice versa.

Takeaways

  1. Test with multiple models, not just the best one. Smart models hide agent bugs by working around them. The model that fails the most dramatically teaches you the most about your agent's weaknesses.

  2. "Retry on failure" is dangerous prompt guidance. Humans understand "retry" as "try differently." LLMs may read it as "call the exact same function again." Be explicit about what NOT to do.

  3. Check the generation, not just the size. qwen3:8b (8B) outperformed qwen2.5-coder:14b (14B) at tool calling. Newer model families tend to have better tool-use training regardless of parameter count.

  4. The agent can compensate — partially. JSON fallback parsing and question re-prompting helped, but the biggest win was a one-line prompt fix. Invest in your system prompt before building workarounds.


The code is open source.

Top comments (1)

Collapse
 
scottcjn profile image
AutoJanitor

This is one of the best empirical tool-calling analyses I've seen. The finding that model generation matters more than parameter count is something we've observed too — and your Pattern 3 (retry loop) is a critical insight for anyone building agentic systems.

We run local LLMs on an IBM POWER8 server (128 threads, 512GB RAM) using llama.cpp with custom vec_perm optimizations. Our experience with Qwen2.5-7B is interesting context for your results: we fine-tuned it with SFT + DPO into "SophiaCore" — a 7B model with baked-in identity and tool awareness. The fine-tuning dramatically improved tool-call reliability vs stock Qwen2.5, which aligns with your finding that the qwen2.5 family struggles with structured tool output.

Your "NEVER repeat the same failing tool call" prompt fix mirrors exactly what Claude Code does internally — it has the same anti-retry logic baked into its system prompt. The fact that one line of prompt engineering turned 30K wasted tokens into a 9K success is a powerful argument that agent architecture matters as much as model quality.

One thing we found running multi-model consensus (4 models simultaneously on POWER8, each getting the same prompt with different role framing): the failure modes you document are actually useful when you treat them as diversity signals rather than bugs. The model that "asks the user instead of acting" is providing a different cognitive frame than the one that "retries aggressively." In a multi-path setup, these become complementary perspectives rather than failures.

Whet looks solid — the Rust + Ollama approach is clean. Have you considered testing with the tool-calling-specific fine-tunes (like Hermes or NousResearch function-calling variants)? Curious if they handle your Pattern 2 better than stock models.