Your Agent Is Calling the Wrong Tool (And Here's Why)

#ai #agents #mcp #python

Your Agent Is Calling the Wrong Tool (And Here's Why)

A guide to the tool-use failure modes that kill production agents — and how to prevent them.

Most agents fail at tool use. Not because the LLM is bad at reasoning. Because the tool schemas are bad.

I've spent months building tool-using agents for production systems and collecting failure modes. Here are the ones that will hit you, and what to do about them.

Failure Mode 1: Ambiguous Tool Names

If you have get_user and get_customer, your agent will confuse them. Guaranteed. Not sometimes — consistently.

The fix is surgical: name tools by what they DO at the semantic level, and add explicit disambiguation to descriptions.

# Bad
{"name": "get_user", "description": "Get user data."}

# Good
{
    "name": "get_user_account",
    "description": (
        "Fetch internal user account details (employees, team members). "
        "Use for staff lookups. "
        "NOT for external customers — use get_customer_profile for those."
    )
}

The key phrase: "NOT for X — use Y for that." Negative examples are more useful than positive ones.

Failure Mode 2: The Model Skips Your Tool

You have a get_current_weather tool. User asks "what's the weather in Chicago?" Model says "It's typically cold in Chicago in March!" and doesn't call the tool.

This happens because the model's training says it knows this. You need to override that.

Two fixes:

1. Add trigger language to the description: ""

"Use when the user asks about current weather conditions, 
temperature right now, or today's forecast for any location. 
ALWAYS use this tool — do not use training data for current conditions."

2. Force the tool call:

response = client.chat(
    messages=messages,
    tools=tools,
    tool_choice={"type": "function", "function": {"name": "get_current_weather"}}
)

Failure Mode 3: Sequential Execution When You Should Go Parallel

This is the most common performance killer and the easiest to miss because it works — just slowly.

# This takes 3x longer than it needs to
weather = call_tool("get_weather", {"city": "NYC"})
news = call_tool("get_news", {"topic": "NYC"})
events = call_tool("get_events", {"city": "NYC"})

# This takes the same time as the slowest single call
import asyncio
weather, news, events = await asyncio.gather(
    call_tool_async("get_weather", {"city": "NYC"}),
    call_tool_async("get_news", {"topic": "NYC"}),
    call_tool_async("get_events", {"city": "NYC"})
)

Modern LLMs (GPT-4, Claude 3+, Gemini 1.5+) will emit multiple tool calls in a single response when they can. Your execution layer must be ready to run them in parallel.

The pattern:

async def execute_tool_calls(tool_calls: list[dict]) -> list[dict]:
    tasks = [
        execute_tool(tc["function"]["name"], tc["function"]["arguments"])
        for tc in tool_calls
    ]
    results = await asyncio.gather(*tasks, return_exceptions=True)
    # ... format results

If you're not doing this, you're leaving 50-70% of your speed on the floor.

Failure Mode 4: Generic Error Responses

Your tool fails. You return:

return f"Error: {str(e)}"

The model gets "Error: 'NoneType' object has no attribute 'email'". It has no idea what to do with this. It might retry with the same arguments. It might give up. It might hallucinate a recovery.

Return structured errors that help the model reason:

return json.dumps({
    "error": True,
    "code": "not_found",
    "message": f"No user found with id: {user_id}",
    "retry_suggested": False,
    "suggestions": [
        "Try search_users with the user's email if you have it",
        "Check if the user_id format is correct — expected UUID format"
    ]
})

Now the model knows: (1) the call failed, (2) why, (3) whether to retry, (4) what to try instead.

Failure Mode 5: Tool Overload

Above 15-20 tools in a single system prompt, models start skipping tools randomly. The selection accuracy degrades.

The fix: dynamic tool injection based on context.

class DynamicToolRegistry:
    def get_tools_for_context(self, context: str) -> list[dict]:
        relevant = set(self.always_loaded)  # Base tools, always present

        if any(w in context for w in ["search", "find", "look up"]):
            relevant.update(self.search_tools)
        if any(w in context for w in ["email", "send", "message"]):
            relevant.update(self.communication_tools)

        return [self.tools[n] for n in list(relevant)[:15]]

Cap at 15. Load only what's relevant to this turn.

The Schema Checklist (Quick Version)

Before you ship a tool, verify:

[ ] Name is a verb phrase: get_, search_, create_, update_
[ ] Description says when to use AND when NOT to use
[ ] Every parameter has a description explaining valid values
[ ] Bounded-value parameters use enum
[ ] Optional parameters are NOT in required
[ ] Tool does exactly one thing (no mode parameter branching)

Going Deeper

If you're building production agents and need the full picture — parallel execution dependency graphs, multi-agent tool permissions, MCP-specific patterns, streaming tool call buffering, the 10 observability metrics that actually matter, and a 35-point production hardening checklist — I packaged all of it into MAC-013: Agent Tool Use & Function Calling Patterns Pack.

42KB, 9 modules, Python implementations throughout. 0.016 ETH (~$33).

The rest of the Machina Market catalog covers memory architecture, RAG patterns, testing & debugging, multi-agent orchestration, and security hardening — all in the same format. Everything I've built and tested in real systems.

→ machinamarket.surge.sh

What tool-use failure modes have you hit that I didn't cover? Drop them in the comments.