Baris Terzioglu

Posted on Mar 15

Why LLM agents break when you give them tools (and what to do about it)

#ai #python #agents #llm

Your agent demo works perfectly. The model picks the right function, passes clean arguments, gets a response, and synthesizes a nice answer. Then you deploy it with 50 real API endpoints and everything falls apart.

This is the gap that nobody warns you about in tool-use tutorials. The research on LLM tool use is actually quite mature at this point, with clear findings about what works and what doesn't. But most of those findings haven't made it into the "how to build an AI agent" blog posts that dominate search results.

I spent the last few weeks going through the academic literature on tool use in LLM agents. Here's what I found, what it means if you're building agents today, and the failure modes that will bite you in production.

The two schools of tool use

There are fundamentally two approaches to giving LLMs access to tools, and understanding the difference matters.

The first is prompting-based tool use. You describe your tools in the system prompt or via a function-calling API, and the model decides at inference time which tools to use. This is what OpenAI's function calling, Anthropic's tool use, and most agent frameworks do. The model was never specifically trained on your tools. It's generalizing from its understanding of APIs and function signatures.

The second is training-based tool use. You fine-tune the model to use specific tools. Toolformer (Schick et al., 2023) is the seminal paper here. They trained a model to decide which APIs to call, when to call them, what arguments to pass, and how to incorporate results into its predictions. The clever bit: they did it in a self-supervised way, needing only a handful of demonstrations per API. The model learned to insert API calls into its own text generation when doing so improved next-token prediction.

The Toolformer approach got 3100+ citations for good reason. It showed that a 6.7B parameter model could match or beat much larger models on tasks involving calculation, search, and translation, simply by learning when to reach for a tool instead of hallucinating an answer.

But here's what's interesting: the industry largely went with the prompting-based approach anyway. Why? Because fine-tuning per-tool doesn't scale when you need to support arbitrary APIs. And modern function-calling implementations are good enough for most use cases.

ReAct and why interleaving reasoning with action matters

The ReAct paper (Yao et al., 2022) is probably the most influential work on agent tool use, with over 6300 citations. The core insight is deceptively simple: let the model think out loud between tool calls.

Before ReAct, you had two separate paradigms. Chain-of-thought prompting let models reason step-by-step but couldn't interact with the outside world. Action-generation approaches could call tools but were essentially operating blind between calls, with no visible reasoning about what to do next.

ReAct interleaves the two. The model generates a thought ("I need to search for X because..."), then an action (calling the search tool), then an observation (processing the result), then another thought ("This tells me Y, so next I should..."), and so on.

On HotpotQA, this approach beat chain-of-thought prompting by reducing hallucinations. The model could actually check facts instead of just reasoning about them. On interactive benchmarks like ALFWorld and WebShop, it outperformed reinforcement learning methods by 34% and 10% success rate respectively, using only one or two in-context examples.

The practical lesson: if you're building an agent that uses tools, don't just have it call functions. Make it explain its reasoning between calls. The interpretability is nice for debugging, but the real payoff is better tool selection and argument quality.

Here's what a ReAct-style loop looks like in practice:

messages = [{"role": "system", "content": SYSTEM_PROMPT}]

while not done:
    response = llm.chat(messages, tools=available_tools)

    if response.has_tool_call:
        # The model chose a tool - execute it
        result = execute_tool(response.tool_call)
        messages.append({"role": "tool", "content": result})
    else:
        # The model is done reasoning
        done = True

The important difference from naive tool use is that the system prompt encourages the model to think before acting:

Before calling any tool, explain:
1. What you're trying to find out
2. Why this specific tool is the right choice
3. What you'll do with the result

After receiving a tool result, analyze it before deciding your next step.

This isn't magic. It's just giving the model space to plan. But in our experience building agents for data pipelines at Bruin, that planning step is the difference between an agent that picks the right tool 60% of the time and one that picks it 85% of the time.

Where tool use actually fails

The research paints a clear picture of where things go wrong. Let me walk through the biggest failure modes.

Nested and sequential calls are hard

The NESTFUL benchmark (Basu et al., 2024) tested LLMs on nested sequences of API calls, where the output of one call feeds into the input of the next. Think: "get the user's order history, find the most recent order, look up the shipping status for that order."

GPT-4o, the best-performing model they tested, achieved a full sequence match accuracy of just 28%. It could get individual calls right, but chaining them correctly was a different story. The win-rate (partial credit for getting some calls right) was 60%, which tells you the model understands the individual tools fine but struggles with the composition.

This matches what anyone building real agents has seen. An agent can call a single tool reliably. But give it a task that requires calling tool A, parsing the result, using part of that result as input to tool B, and then combining both results? The error rate compounds at each step.

Static tool retrieval breaks down

When you have more than a handful of tools, you need some way to decide which tools are relevant for a given query. Most systems use embedding-based retrieval: embed the user query, embed the tool descriptions, find the closest matches, and include only those in the prompt.

Patel et al. (2025) showed this static approach has a fundamental problem. The right tool for step 2 of a task depends on what happened in step 1. Their Dynamic Tool Dependency Retrieval (DTDR) method conditions on both the initial query and the evolving execution context, which improved function calling success rates compared to static retrievers.

In simpler terms: when an agent is three steps into a task, the tools it needs aren't necessarily the ones that were closest to the original question. Tool selection needs to be dynamic.

Poor documentation means poor tool use

OpaqueToolsBench (Hallinan et al., 2026) studied what happens when tools have incomplete or misleading documentation. This is the real world. Your internal APIs rarely have perfect docs. There are edge cases, implicit constraints, and undocumented behaviors everywhere.

Their finding: LLMs struggle with tools that lack clear best practices or documented failure modes. Their proposed solution, ToolObserver, iteratively refines tool documentation by observing execution feedback from actual tool-calling trajectories. The agent literally learns from its mistakes to build better tool descriptions.

This is a big deal. It means writing good tool descriptions isn't optional, it's load-bearing infrastructure. The descriptions need to cover failure modes and edge cases, not only the happy path.

The model doesn't have a world model

Guo et al. (2025) identified a subtle but important problem: LLMs using tools in stateful environments can't predict what will happen when they take an action. Their method, DyMo (dynamics modelling), augments LLMs with a state prediction capability alongside function calling during post-training.

This matters because in many real applications, tools change state. If your agent is managing a database, it needs to understand that running a DELETE query will change what subsequent SELECT queries return. Without that world model, agents can and do take destructive actions because they don't predict the consequences.

What actually helps in practice

Based on the research and my own experience, here are the things that make the biggest difference.

Write tool descriptions like your agent's life depends on it

Because it does. Include:

What the tool does (one sentence)
What the parameters mean (with types and constraints)
What the tool returns (with examples)
When NOT to use the tool
Common error conditions

{
  "name": "query_database",
  "description": "Executes a read-only SQL query against the analytics database. Returns up to 1000 rows. Use this for data retrieval, not for mutations. If you need to modify data, use the write_database tool instead.",
  "parameters": {
    "sql": {
      "type": "string",
      "description": "A SELECT query. Must not contain INSERT, UPDATE, DELETE, or DROP statements. The query will be rejected if it does."
    },
    "timeout_seconds": {
      "type": "integer",
      "description": "Max execution time. Default 30. Queries over large tables may need 60+. If a query times out, try adding a LIMIT clause or narrowing the WHERE condition."
    }
  },
  "errors": {
    "TIMEOUT": "Query exceeded timeout_seconds. Simplify the query or increase timeout.",
    "SYNTAX_ERROR": "SQL syntax error. Check for missing quotes or wrong table names.",
    "NO_RESULTS": "Query ran but returned 0 rows. This isn't an error, the data just doesn't exist."
  }
}

That "errors" field isn't standard in any function-calling spec. But putting it in the description text helps the model recover from failures instead of retrying the same broken call.

Keep the tool count low

The Berkeley Function Calling Leaderboard (Patil et al., 2025) has evaluated dozens of models across thousands of function-calling scenarios. One consistent finding: accuracy drops as the number of available tools increases. This isn't surprising. More tools means more options to confuse, more documentation to parse, and more chances to pick the wrong one.

If you have 50 tools, the agent doesn't need all 50 in every prompt. Use retrieval to narrow it down to the 5-10 most relevant. And if you find yourself with hundreds of tools, that's a sign you need to rethink your API design, not a problem to solve with better prompting.

Let the agent fail and recover

ToolPRM (Lin et al., 2025) introduced an inference scaling framework that scores internal steps of function calls. One of their findings is a principle they call "explore more but retain less," because structured function calling has an "unrecoverability" characteristic. Once the model starts generating a malformed function call, it's very hard to course-correct mid-generation.

The practical implication: build your agent loops to expect failures. Parse tool call results. If the call failed, give the model the error message and let it try again. This sounds obvious, but a surprising number of agent implementations either crash on tool errors or silently swallow them.

max_retries = 3
for attempt in range(max_retries):
    tool_result = execute_tool(tool_call)

    if tool_result.success:
        break

    # Feed the error back to the model
    messages.append({
        "role": "tool",
        "content": f"Error: {tool_result.error}. "
                   f"Attempt {attempt + 1}/{max_retries}. "
                   f"Please fix the arguments and try again."
    })

    response = llm.chat(messages, tools=available_tools)
    tool_call = response.tool_call

Think in chains, not single calls

The Chain-of-Abstraction work (Gao et al., 2024) showed that training models to plan tool call chains with abstract placeholders before executing them improved accuracy by about 6% across math and QA tasks, while being 1.4x faster than sequential tool calling.

You probably can't train your own model this way. But you can prompt for it. Before a complex task, ask the model to outline the sequence of tool calls it plans to make, with placeholders for intermediate results. Then execute them. This planning step catches many composition errors before they happen.

User: What was our revenue last quarter compared to the same quarter last year?

Agent thinking:
1. Call query_database to get last quarter's dates -> Q_CURRENT
2. Call query_database to get revenue for Q_CURRENT -> REV_CURRENT
3. Call query_database to get same quarter last year -> Q_PREVIOUS
4. Call query_database to get revenue for Q_PREVIOUS -> REV_PREVIOUS
5. Calculate and compare REV_CURRENT vs REV_PREVIOUS

Actually, I can combine steps 1-2 and 3-4 into two queries...

That "actually, I can combine" moment is exactly what you want. The model is optimizing its own plan before executing it.

The gap between benchmarks and production

One thing that stood out going through this research: the benchmarks are getting more realistic, but they still don't capture the full messiness of production environments.

NESTFUL tests nested API calls but with well-defined, stable APIs. In real systems, APIs return unexpected formats, timeout randomly, and have rate limits. The Berkeley Function Calling Leaderboard tests a wide range of scenarios but in isolation. Real agents deal with conversation context that accumulates over dozens of turns and can push against context window limits.

The OpaqueToolsBench work gets closest to reality by studying poorly-documented tools, and it's no coincidence that it's one of the most recent papers in this space. The field is slowly moving toward evaluating what actually matters: tool use under messy, real-world conditions.

What I'd tell someone building their first agent

Start with three to five tools, not thirty. Get ReAct-style reasoning working with those few tools until it's reliable. Write obsessively detailed tool descriptions, including failure modes and constraints. Build retry logic from day one. And test with sequences of tool calls, not just individual ones, because that's where everything breaks.

The research is clear that tool use in LLMs is a solved problem for simple cases and a very much unsolved problem for complex ones. The gap between "call one function with clean arguments" and "orchestrate a sequence of dependent API calls with error handling" is enormous. If you're building something real, plan for that gap.

The papers I've referenced here are a good starting point if you want to go deeper. The ReAct paper for foundations, NESTFUL for understanding failure modes in composition, and the Berkeley Function Calling Leaderboard for keeping up with which models are actually good at this. The field moves fast, but these core challenges haven't changed much in three years. We're just getting more honest about how hard they are.

Top comments (1)

Mavericksantander • Mar 16

Great synthesis of the research.

The DyMo section hits close to home. The world model problem is real, but there is a complementary angle: even if the agent has good reasoning about consequences, you still want a hard enforcement layer between the decision and the execution.

All the patterns you describe (ReAct reasoning, retry loops, better tool descriptions) improve the agent's judgment. But judgment is not a guarantee. A mandatory authorize_action call before any tool executes gives you ALLOW/DENY/REQUIRE_APPROVAL at runtime, independent of what the model decided.

Built something minimal for exactly this: github.com/Mavericksantander/Canopy

Curious if you have seen teams combine both approaches in production.