Tool-Calling Prompts That Don't Blow Up on the Five Edge Cases

#ai #agents #prompt #python

Book: AI Agents Pocket Guide
Also by me: Prompt Engineering Pocket Guide
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You've seen this shape: an agent confidently calls delete_user instead of archive_user because the description is off by one verb. Same arg signature, both (user_id: str), the model picks the first one in the schema list, and a paying customer gets nuked. The post-mortem eats half a day. The fix is a sentence in a tool description.

That's edge case one of five. Tool-calling looks clean in the demo. You define a function, register it, the model calls it, you parse the JSON. Then you ship and the model finds the seams. It picks the wrong tool. It passes a string where you wanted an int. It calls search_orders six times in a row with the same query because the response format confused it. It runs send_email before compose_email finished returning. It calls a tool when the answer was already in context.

These five failure modes are common in production agents, and the prompt-plus-schema-plus-validator pattern handles them. Code is Python with the Anthropic SDK shape, but the patterns port to OpenAI, Vertex, and the AI SDK one-to-one.

Edge case 1: two valid tools, model picks neither (or the wrong one)

Two tools both fit the user's request. The model either picks the wrong one with confidence, or freezes and answers without calling any tool at all. The NESTFUL benchmark (as reported by Medium citing the NESTFUL paper) measured frontier models on nested API calls and found full-sequence accuracy was low. A chunk of those misses are tool-selection misses, and many of the rest are arg-generation misses.

The fix lives in the description. Each tool description has to explicitly contrast itself against the closest sibling.

{
  "name": "archive_user",
  "description": "Soft-delete a user. Marks the row archived=true and hides the account from listings. REVERSIBLE within 30 days. Use this for cancellations, churn, GDPR soft requests. NOT for permanent deletion: use delete_user for that.",
  "input_schema": {
    "type": "object",
    "properties": {
      "user_id": {"type": "string", "format": "uuid"}
    },
    "required": ["user_id"]
  }
}

{
  "name": "delete_user",
  "description": "PERMANENTLY remove a user row and PII. IRREVERSIBLE. Use only for legal-mandated hard deletes (GDPR Article 17 confirmed, court order). NOT for cancellations: use archive_user.",
  "input_schema": {
    "type": "object",
    "properties": {
      "user_id": {"type": "string", "format": "uuid"},
      "legal_basis": {
        "type": "string",
        "enum": ["gdpr_article_17", "court_order"]
      }
    },
    "required": ["user_id", "legal_basis"]
  }
}

Two patterns there. First, every description names its sibling and the boundary between them. The model picks tools by reading descriptions; if your descriptions don't compare, the model guesses. Second, the dangerous tool requires an extra arg with a closed enum. A model that wants to call delete_user has to commit to a legal basis from a fixed list. That's hard to do casually.

Edge case 2: wrong-arg-type, silently coerced

The model passes "42" where you wanted 42. Or "true" where you wanted true. Or "2026-04-27" where you wanted a full ISO datetime with timezone. JSON Schema validation catches the obvious ones. The non-obvious ones slide through. Strings that parse as numbers, dates without timezones, IDs that look right but reference the wrong namespace. They reach your downstream call and break something you didn't write.

Wrap every tool handler in a validator. Don't trust the schema alone.

import re
from datetime import datetime
from pydantic import BaseModel, Field, field_validator

UUID_RE = re.compile(
    r"^[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-"
    r"[0-9a-f]{4}-[0-9a-f]{12}$"
)

class ArchiveUserArgs(BaseModel):
    user_id: str = Field(..., min_length=36, max_length=36)

    @field_validator("user_id")
    @classmethod
    def _uuid_shape(cls, v: str) -> str:
        if not UUID_RE.match(v.lower()):
            raise ValueError(
                f"user_id must be a UUID, got {v!r}"
            )
        return v.lower()

def call_tool(name: str, raw_args: dict) -> dict:
    schema = TOOL_SCHEMAS[name]
    try:
        args = schema(**raw_args)
    except Exception as e:
        return {
            "error": "invalid_arguments",
            "tool": name,
            "detail": str(e),
            "hint": "Re-read the tool schema and retry.",
        }
    return TOOL_HANDLERS[name](args)

The validator does two jobs the schema can't. It returns a structured error the model can read on the next turn, and it gives you a single chokepoint to log every coercion attempt. When user_id arrives as "USER-1234" instead of a UUID, the model sees the error message, corrects, and retries, usually within one turn. Without the validator, the request hits your DB layer, throws somewhere ugly, and the agent turn ends in a 500.

The hint matters. Models steer hard on the error string. "Re-read the tool schema and retry" produces better recoveries than a bare stack trace.

Edge case 3: the infinite tool loop

Agent calls search_orders(query="acme"). Gets back a result that doesn't quite match what the user asked. Calls search_orders(query="acme") again. Gets the same result. Calls it again. You wake up to a $2,400 bill and a Slack thread.

Two layers of defense. The first is a hard step counter, the classic mitigation. The second is a duplicate-call detector that fires before the counter ever trips.

from collections import Counter
from hashlib import sha256
import json

class LoopGuard:
    def __init__(self, max_steps: int = 12,
                 max_repeat: int = 2):
        self.max_steps = max_steps
        self.max_repeat = max_repeat
        self.calls: Counter[str] = Counter()
        self.step = 0

    def _key(self, name: str, args: dict) -> str:
        blob = json.dumps(
            {"n": name, "a": args}, sort_keys=True
        )
        return sha256(blob.encode()).hexdigest()[:16]

    def check(self, name: str, args: dict) -> str | None:
        self.step += 1
        if self.step > self.max_steps:
            return "step_budget_exceeded"
        k = self._key(name, args)
        self.calls[k] += 1
        if self.calls[k] > self.max_repeat:
            return "duplicate_call_blocked"
        return None

In the agent loop, call guard.check(name, args) before dispatching. If it returns a reason, push that reason back into the conversation as a tool result with is_error=true and break out. The model sees "you already ran this exact call twice; pick a different approach" and almost always pivots. If it doesn't, the step budget catches it.

The system prompt has to know about this. Otherwise the model treats the block as a bug and complains.

You have a tool budget of 12 calls per turn. Identical
calls (same name, same args) are blocked after 2 attempts.
If a tool result is unhelpful, change the args or pick a
different tool. Don't retry the exact call.

Edge case 4: wrong tool ordering

The agent has to compose an email before sending it. It sends first, the body is empty, the customer gets a blank message at 9:02 AM. Or it tries to charge a card before creating the customer record. Or it queries an analytics endpoint before the data warehouse refresh tool finished.

Schemas don't express order. Prompts have to. Two patterns work.

The first is a phase-gated system prompt. Group tools by phase and tell the model the phase rules.

You operate in three phases per request:

1. GATHER: call read-only tools (search_*, get_*, list_*).
   You may call any number, in any order.
2. PLAN: produce a short plan in plain text. No tools.
3. ACT: call write tools (create_*, send_*, charge_*,
   archive_*, delete_*) in the order from the plan.
   You may not call a write tool before producing the plan.

If a write tool fails, return to PLAN before retrying.

The second is a precondition field on each write tool, enforced by the validator.

class SendEmailArgs(BaseModel):
    draft_id: str
    confirmed_recipient: str

    @field_validator("draft_id")
    @classmethod
    def _draft_exists(cls, v: str) -> str:
        if not draft_store.exists(v):
            raise ValueError(
                f"draft_id {v!r} not found. "
                "Call compose_email first."
            )
        return v

The model can't call send_email until compose_email has produced a real draft_id. The error message names the missing prerequisite, so the next turn fixes itself. This is the same "structured error, model reads it, recovers" pattern from edge case 2, applied to ordering instead of types.

Edge case 5: the tool that should be skipped entirely

You ask the agent "what's our refund policy?" The agent calls search_knowledge_base("refund policy"), retrieves three docs, summarizes them. Fine. You ask it "what's 2 + 2." The agent calls search_knowledge_base("2 + 2"), retrieves nothing relevant, and now you've spent 800ms and three vector lookups on arithmetic. Multiply by a million users.

The fix is a "no tool needed" escape hatch in the system prompt, plus a tool-selection rubric.

Before calling any tool, ask yourself:

- Is the answer already in the conversation?
- Is this a general-knowledge question the model can
  answer without a tool?
- Would a human assistant reach for a tool here, or
  just answer?

If any of the above is yes, answer directly. Tools cost
latency and money. Default to no tool.

Pair that with a tool description that names the trigger condition explicitly.

{
  "name": "search_knowledge_base",
  "description": "Search internal company docs (policies, runbooks, product specs). USE WHEN the user asks about company-specific information not in your training data. DO NOT USE for general knowledge, math, code questions, or follow-ups whose answer is already in the conversation.",
  "input_schema": {
    "type": "object",
    "properties": {
      "query": {"type": "string", "minLength": 3}
    },
    "required": ["query"]
  }
}

"USE WHEN" and "DO NOT USE" are two phrases that tend to move tool-call rate the most in production evals. The model is reading your description as a rubric. Give it one.

The prompt skeleton, end to end

The pieces fit together like this. System prompt sets phases and the no-tool rubric. Tool descriptions contrast siblings and name USE WHEN / DO NOT USE. Schemas use enums and formats wherever a closed set exists. Every handler runs through a Pydantic validator that returns structured errors. A LoopGuard wraps the dispatch loop. The model sees errors as readable text on the next turn and self-corrects.

def agent_turn(messages: list, tools: list) -> list:
    guard = LoopGuard()
    while True:
        resp = client.messages.create(
            model="claude-opus-4-7",
            system=SYSTEM_PROMPT,
            messages=messages,
            tools=tools,
        )
        if resp.stop_reason != "tool_use":
            messages.append(
                {"role": "assistant", "content": resp.content}
            )
            return messages
        for block in resp.content:
            if block.type != "tool_use":
                continue
            reason = guard.check(block.name, block.input)
            if reason:
                result = {"error": reason}
            else:
                result = call_tool(block.name, block.input)
            messages.append(
                {"role": "user", "content": [{
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": json.dumps(result),
                    "is_error": "error" in result,
                }]}
            )

That loop is maybe forty lines once you flesh out the imports. It survives all five edge cases above without any LangChain, LangGraph, or framework on top. Teams shipping production agents tend to reinvent pieces of this anyway; writing it once and treating it as the contract is faster than learning a framework's escape hatches.

The pattern that ties everything together is small: tools fail, the model reads the failure, the model retries. Your job is to make every failure a sentence the model can act on. A bare 500 isn't actionable. "draft_id 'xyz' not found. Call compose_email first." is. The whole rig is just plumbing around that one idea.

If this was useful

The AI Agents Pocket Guide goes deeper on the agent-loop design: tool budgets, retry policy, phase prompts, and the eval rig that catches edge cases before customers do. The Prompt Engineering Pocket Guide covers the description-writing patterns from edge cases 1 and 5 in their own chapter, with before/after diffs and the eval scores that go with them. Pair them if you're shipping anything that calls more than one tool per turn.