Agent Tool Selection Accuracy: 3 Prompt Patterns That Move It 20%

#agents #ai #llm #python

Book: AI Agents Pocket Guide: Patterns for Building Autonomous Systems with LLMs
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Your agent picks the wrong tool about 30% of the time. The first instinct is to swap to a bigger model. That fixes maybe half of it and triples your bill. The other fix, the one that scales, is to rewrite three things in your prompt. None of them are clever. All three move the needle.

I'll show you a 100-case eval, a baseline, and the three patterns that took us from 68% to 89% selection accuracy on a tool set of 14 functions. The model didn't change. The prompt did.

Why "the model picks badly" is usually a prompt bug

When a model picks get_user_orders instead of list_recent_orders, the model isn't confused. The descriptions are confused. Two tools that overlap by 60% in their description text will get picked roughly 50/50 regardless of which one the user really wants. The model is doing pattern matching against text you wrote. If you wrote ambiguous text, you get ambiguous behavior.

The other failure shape is more subtle. A tool description that says "use this to query the database" sounds reasonable in isolation. Add five more tools that also touch a database (search_invoices, fetch_customer, get_subscription_status) and the model has no way to know which one wins. Each one's description is locally coherent and globally useless.

These are not model bugs. They're documentation bugs that happen to have a language model as the reader.

Baseline — measure tool-selection accuracy in 30 lines

Before fixing anything, measure. The eval rig is tiny. Each case is a user query paired with the tool that should be called. You run the agent, capture which tool it picked, and compare.

import json
from dataclasses import dataclass
from anthropic import Anthropic

client = Anthropic()

@dataclass
class EvalCase:
    query: str
    expected_tool: str
    notes: str = ""

EVAL_SET = [
    EvalCase("show me my last 5 orders", "list_recent_orders"),
    EvalCase("what did I buy in March 2025?", "search_orders_by_date"),
    EvalCase("is order #4471 shipped yet?", "get_order_status"),
    EvalCase("did my refund go through?", "get_refund_status"),
    EvalCase("cancel my pending subscription", "cancel_subscription"),
    # ... 95 more
]

def run_case(case: EvalCase, system_prompt: str, tools: list) -> str:
    resp = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        system=system_prompt,
        tools=tools,
        messages=[{"role": "user", "content": case.query}],
    )
    for block in resp.content:
        if block.type == "tool_use":
            return block.name
    return "no_tool_called"

def score(cases, system_prompt, tools) -> dict:
    results = {"correct": 0, "wrong": 0, "no_call": 0, "details": []}
    for case in cases:
        picked = run_case(case, system_prompt, tools)
        if picked == case.expected_tool:
            results["correct"] += 1
        elif picked == "no_tool_called":
            results["no_call"] += 1
        else:
            results["wrong"] += 1
        results["details"].append((case.query, case.expected_tool, picked))
    return results

That's the whole rig. On a 14-tool set with our 100 cases, the baseline came in at 68% correct, 24% wrong tool, 8% no call. Anything below 90% on this kind of eval will produce visible quality problems in production.

The expected_tool label is the entire game. Without it, you're measuring something else: completion length, latency, model preference. With it, every regression has a name.

Pattern 1 — Disambiguating descriptions ("when to use vs when NOT to use")

The first pattern is the cheapest. For every tool description, add a "when to use" line and a "when NOT to use" line. The "NOT to use" line is doing most of the work.

Before:

{
    "name": "list_recent_orders",
    "description": "Returns the most recent orders for the current user.",
    "input_schema": {...}
},
{
    "name": "search_orders_by_date",
    "description": "Searches orders by date range.",
    "input_schema": {...}
},

After:

{
    "name": "list_recent_orders",
    "description": (
        "Returns the user's N most recent orders, newest first. "
        "USE when the user asks for 'recent', 'last few', or 'latest' orders "
        "without specifying a date. "
        "DO NOT USE when the user names a specific date, month, or year — "
        "use search_orders_by_date for that."
    ),
    "input_schema": {...}
},
{
    "name": "search_orders_by_date",
    "description": (
        "Searches orders within a date range (inclusive). "
        "USE when the user mentions a specific date, month, year, or range "
        "like 'last quarter', 'in March', 'between Jan and June'. "
        "DO NOT USE for 'recent' or 'last few' — use list_recent_orders for those."
    ),
    "input_schema": {...}
},

The model now has explicit negative constraints. The "DO NOT USE" line is what kills the 50/50 ambiguity; it gives the model a reason to prefer one over the other on cases where both look plausible.

On our eval set, pattern 1 alone moved accuracy from 68% to 79%. The biggest gains were on tool pairs that overlapped semantically. Cases the model already got right stayed right; cases it confused stopped confusing it.

One thing to watch: the "DO NOT USE" line should reference the alternative tool by name. "Do not use for date queries" is weak. "Do not use for date queries, use search_orders_by_date instead" gives the model the next step, not just the negative.

Pattern 2 — Worked examples in the system prompt (2 per tool)

The second pattern is worked examples, with a strict cap. Two per tool. One positive, one near-miss. More than that and you crowd the prompt without adding signal.

The system prompt grows a section like this:

SYSTEM_PROMPT = """
You are an order-management agent. You have access to tools for querying
orders, subscriptions, and refunds. Pick the most specific tool for each
user query.

## Tool selection examples

User: "show me what I ordered last week"
→ list_recent_orders (relative time window, no specific date)

User: "what did I buy on March 14?"
→ search_orders_by_date (specific date mentioned)

User: "did my refund post?"
→ get_refund_status (status query about a refund)

User: "cancel my subscription"
→ cancel_subscription (mutation, not a query)

User: "is order 4471 shipped?"
→ get_order_status (status query, specific order number)

User: "what's my subscription tier?"
→ get_subscription_status (subscription metadata, not order metadata)
"""

The positive example shows the model what success looks like. The near-miss example, the contrast pair, is more important. "March 14" vs "last week" trains the model on the exact ambiguity boundary that was costing you accuracy.

Two-per-tool is the calibrated number. We tried four. Accuracy went up another point on simple cases and down three points on novel queries. The model started over-anchoring to the examples and refusing to generalize. Two examples gives the model just enough pattern to copy without it learning to copy too rigidly.

Pattern 2 stacks with pattern 1. After both: 79% → 85%. The gain is smaller because pattern 1 already grabbed the cheap wins. Pattern 2 catches the cases where the description text was clear but the boundary between two tools wasn't intuitive.

Pattern 3 — Tool grouping with a router meta-tool

The third pattern only matters past a certain tool count. Below 10 tools, skip it. Above 15, it's the difference between an agent that scales and one that hits a wall.

The idea: instead of giving the model 25 tools and asking it to pick, give it 5 categories and a single select_tool_category meta-tool that returns the subset of tools relevant to the user query. The agent then makes a second call with that filtered set.

TOOL_CATEGORIES = {
    "orders": [
        "list_recent_orders",
        "search_orders_by_date",
        "get_order_status",
        "get_order_details",
    ],
    "refunds": [
        "get_refund_status",
        "request_refund",
        "list_refunds",
    ],
    "subscriptions": [
        "get_subscription_status",
        "cancel_subscription",
        "change_subscription_tier",
        "pause_subscription",
    ],
    "payment_methods": [
        "list_payment_methods",
        "add_payment_method",
        "remove_payment_method",
    ],
    "account": [
        "get_account_info",
        "update_account_info",
    ],
}

ROUTER_TOOL = {
    "name": "select_tool_category",
    "description": (
        "ALWAYS call this first to identify which category of tools is "
        "relevant to the user query. Categories: orders, refunds, "
        "subscriptions, payment_methods, account."
    ),
    "input_schema": {
        "type": "object",
        "properties": {
            "category": {
                "type": "string",
                "enum": list(TOOL_CATEGORIES.keys()),
                "description": "The single category most relevant to the query"
            }
        },
        "required": ["category"]
    }
}

def run_with_router(query: str) -> str:
    # First call: pick the category
    first = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=512,
        system="Route the user query to the right tool category.",
        tools=[ROUTER_TOOL],
        messages=[{"role": "user", "content": query}],
    )
    category = first.content[0].input["category"]
    tools_subset = [t for t in ALL_TOOLS if t["name"] in TOOL_CATEGORIES[category]]

    # Second call: pick the tool within that category
    second = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        system=SYSTEM_PROMPT,
        tools=tools_subset,
        messages=[{"role": "user", "content": query}],
    )
    for block in second.content:
        if block.type == "tool_use":
            return block.name
    return "no_tool_called"

Two calls instead of one. The first is cheap (5-category enum, short response). The second sees a smaller, more coherent tool set, usually 3-5 tools, which makes selection within the category almost trivial.

The trade is latency and one extra request. On our 14-tool set, pattern 3 added latency for a small accuracy bump. On a 28-tool variant we built for stress testing, pattern 3 was the difference between 71% and 88%. The break-even point sits around 15 tools.

Combining all three — 100-case eval delta

Numbers from the eval rig above, same model (Claude Sonnet 4.5), same 100 cases, same 14-tool set:

Configuration	Correct	Wrong	No call	Δ vs baseline
Baseline (default descriptions)	68	24	8	—
+ Pattern 1 (disambiguating descs)	79	14	7	+11
+ Pattern 2 (worked examples, 2/tool)	85	11	4	+17
+ Pattern 3 (router meta-tool)	87	9	4	+19
All three + slot disambiguation prose	89	8	3	+21

The final row adds one more sentence to the system prompt naming the "slot" each tool occupies in the mental model the agent is supposed to have. ("get_order_status is one order. list_recent_orders is many orders. search_orders_by_date is some orders.") Two points of recovery on edge cases from a single sentence.

A 21-point delta. Same model. No additional inference cost on patterns 1 and 2; pattern 3 doubles the request count but only on calls that route through it. The economics are good even before you count the dropped failure-tax of retrying after a wrong tool call.

The gotcha — over-instruction backfires past a certain tool count

The honest version of this post: patterns 1 and 2 can hurt past a tool count of about 20. The "DO NOT USE — use X instead" lines start chaining. Tool A says "don't use, use B." Tool B says "don't use, use C." The cross-references compound and the model spends its budget reading instructions instead of answering the question.

We saw this when we pushed the same 14-tool stack to 32. Pattern 1's accuracy lift dropped from +11 to +3. Pattern 2 went from +6 to +1. Pattern 3 became necessary, not optional.

The rule of thumb: if your system prompt's tool section is over 4,000 tokens, you've crossed the line where adding more text helps. Switch to a router pattern that filters tools dynamically. Keep each tool's description tight and let the category boundaries do the disambiguation work.

The other thing that goes wrong: stale eval sets. If you wrote the 100 cases six months ago and your tool set has grown, the eval no longer covers the failure modes you have in production today. Sample production traces monthly, label what should have been called, append to the eval. The rig above is cheap enough to run on every prompt change. Make it part of CI and the regression that would have shipped on Tuesday gets caught Monday morning.

What's the wrong-tool rate on your current agent, and have you ever measured it against a labeled eval set? Drop your numbers in the comments — I'm curious how 68% baseline lines up with what other teams see.

If this was useful

Tool selection is one of the load-bearing problems in agent design. Get it wrong and the rest of the system tries to compensate with retries, fallback prompts, and bigger models. The AI Agents Pocket Guide covers the broader pattern catalog: tool design, error recovery, multi-turn context handling, and the architectural decisions that decide whether your agent scales past 10 tools. The chapter on tool surfaces walks through the disambiguation patterns above and the router pattern in production-grade depth.