Adamo Software

Posted on May 5

Tool-use API design for LLMs: 5 patterns that prevent agent loops and silent failures

#llm #python #ai #tutorial

In July 2025, a developer's Claude Code instance hit a recursion loop and burned through 1.67 billion tokens in 5 hours, generating an estimated $16,000 to $50,000 in API charges before anyone noticed. The agent did not crash. It did not throw an error. It just kept calling tools, getting confused, calling more tools, and silently accumulating cost. Old software crashes. LLM agents spend.

This is the failure mode most teams discover the hard way. You design a clean tool interface, the agent works in your test environment, you ship it to production, and three weeks later something hits an edge case that sends the agent into a loop. The patterns below are what we have used to prevent agent loops and silent failures across production LLM systems handling thousands of tool calls per day. None of them are about better prompts. They are about better tool design.

Why agent loops happen in the first place

Before the patterns, it helps to understand the failure mode precisely.

An LLM agent receives a user request, reasons about which tool to call, calls it, gets a result, reasons about whether the goal is achieved, and either responds or calls another tool. This loop is supposed to terminate when the goal is achieved or the model decides no further action is needed.

It does not terminate when:

The tool result is ambiguous. The model cannot tell whether the call succeeded, so it tries again with slightly different parameters.
The tool fails silently. The model receives a non-error response that does not actually contain the data it needed, so it interprets this as "I should retry."
The tool returns conflicting information. Two consecutive calls return different results, the model loses confidence in either, and tries to "verify" by calling more tools.
The model misreads its own previous output. With long context windows, the model sees a previous tool call's result, forgets it already processed that result, and re-processes it as new information.

Every one of these is preventable with tool design. The model is not the problem. The interface is.

Pattern 1: Make every tool result self-describing

The single most common cause of agent loops is tool results that the model cannot interpret without making assumptions.

A bad tool result:

{
  "results": [
    {"id": "h_1234", "name": "Hotel Granbell", "price": 128},
    {"id": "h_5678", "name": "Shibuya Stream", "price": 142}
  ]
}

The model now has to assume what this means. Are these the only matches? Are there more? Is the search complete? What was searched for? When the model gets confused (and at scale, it always gets confused eventually), it calls the tool again "to verify."

A self-describing tool result:

{
  "status": "success",
  "search_id": "srch_abc123",
  "query_summary": {
    "destination": "Shibuya, Tokyo",
    "check_in": "2026-07-12",
    "check_out": "2026-07-15",
    "guests": 1,
    "max_price": 150
  },
  "results": [
    {"id": "h_1234", "name": "Hotel Granbell", "price": 128, "currency": "USD"},
    {"id": "h_5678", "name": "Shibuya Stream", "price": 142, "currency": "USD"}
  ],
  "total_matches": 2,
  "is_complete": true,
  "next_action_hint": "User has 2 valid options. Present both with prices. Do not search again unless user changes parameters."
}

The is_complete: true and next_action_hint fields are the critical additions. The model can now read this result, understand that the search is finished, and know what to do next without re-querying. The query_summary echo lets the model verify it called the tool with the right parameters.

The next_action_hint is unconventional but extremely effective. It is a short instruction included in the tool response that tells the model what state the conversation is in. Think of it as the tool nudging the model toward correct loop termination.

# Wrapping tools to inject next_action_hint
def with_action_hint(tool_func):
    def wrapper(*args, **kwargs):
        result = tool_func(*args, **kwargs)
        result['next_action_hint'] = derive_hint(result)
        return result
    return wrapper

def derive_hint(result):
    if result['status'] == 'success' and result['total_matches'] == 0:
        return "No matches. Inform user and ask for relaxed criteria. Do not retry."
    if result['status'] == 'success' and result['total_matches'] > 0:
        return f"Found {result['total_matches']} matches. Present to user. Do not search again unless parameters change."
    if result['status'] == 'error':
        return f"Tool failed: {result['error']}. Inform user. Do not retry without user input."
    return "Process result and decide next step."

Implementing this across our tool surface reduced retry-driven loops by approximately 60% in production.

Pattern 2: Distinguish between "no results" and "tool failure"

The second most common cause of agent loops: ambiguous failure states.

A search that returns zero matches is a successful tool call. A search that timed out is a failed tool call. To the LLM, both can look identical if the tool just returns an empty results array.

# Bad: indistinguishable from no results
def search_hotels(query):
    try:
        results = supplier_api.search(query)
        return {"results": results}
    except Exception:
        return {"results": []}  # silent failure

# Good: explicit status with retry guidance
def search_hotels(query):
    try:
        results = supplier_api.search(query, timeout=5)
        return {
            "status": "success",
            "results": results,
            "total_matches": len(results),
            "retryable": False,
        }
    except SupplierTimeout:
        return {
            "status": "error",
            "error_type": "timeout",
            "error_message": "Supplier API did not respond within 5 seconds.",
            "retryable": True,
            "retry_after_ms": 2000,
            "max_retries_remaining": get_retry_budget(query),
        }
    except SupplierAuthError:
        return {
            "status": "error",
            "error_type": "auth",
            "error_message": "API authentication failed.",
            "retryable": False,
            "user_facing_message": "We're having trouble accessing hotel data. Please try again later.",
        }
    except RateLimitError as e:
        return {
            "status": "error",
            "error_type": "rate_limit",
            "error_message": f"Rate limit hit. Reset in {e.reset_seconds}s.",
            "retryable": True,
            "retry_after_ms": e.reset_seconds * 1000,
        }

The retryable flag is doing real work. When it is false, the LLM knows there is no point retrying and will inform the user instead. When it is true, the LLM has a structured retry path with explicit limits.

Without this pattern, an authentication failure that looks like an empty result set causes the model to try increasingly creative parameter combinations to "find results," consuming tokens and producing nothing.

Pattern 3: Enforce a hard call budget at the orchestrator level

No matter how well-designed your tools are, the model will occasionally enter a loop. The orchestrator must enforce a hard ceiling.

class AgentOrchestrator:
    def __init__(self, max_tool_calls=15, max_total_cost_usd=0.50):
        self.max_tool_calls = max_tool_calls
        self.max_total_cost_usd = max_total_cost_usd
        self.calls_made = 0
        self.total_cost_usd = 0

    async def run_agent_turn(self, user_message, conversation_history):
        history = conversation_history + [{"role": "user", "content": user_message}]

        while self.calls_made < self.max_tool_calls:
            if self.total_cost_usd >= self.max_total_cost_usd:
                return self._cost_limit_response()

            response = await call_llm(history, tools=self.tools)
            self.total_cost_usd += response.cost_usd

            if not response.tool_calls:
                # Model produced final response, exit loop
                return response.content

            for tool_call in response.tool_calls:
                self.calls_made += 1
                tool_result = await self.execute_tool(tool_call)
                history.append({"role": "tool", "content": tool_result})

        # Hit call limit. Force a final response.
        return await self._force_final_response(history)

    async def _force_final_response(self, history):
        # Add explicit instruction and call LLM with tools=None
        history.append({
            "role": "system",
            "content": "Tool call limit reached. Produce a final response to the user "
                       "based on information already gathered. Do not request more tools."
        })
        response = await call_llm(history, tools=None)
        return response.content

Two safeguards here. First, max_tool_calls prevents infinite loops by capping iterations. Fifteen is our default for booking workflows. Anything more than that is almost always a sign the agent is confused, not productive. Second, max_total_cost_usd is a financial circuit breaker. Even if the agent finds creative ways to make many tool calls, it cannot spend more than the per-conversation budget.

When the limit is hit, the orchestrator does not just return an error. It calls the LLM one more time with tools=None, forcing it to produce a final response from whatever it has gathered. This is much better UX than "Sorry, agent failed."

For high-volume systems, also implement per-tenant rate limiting. The single-developer Claude Code incident burned $16-50K because there was no per-account ceiling. Production systems need both per-conversation and per-tenant limits.

Pattern 4: Detect repeated calls and short-circuit them

Even with call budgets, agents waste budget by repeating the same call with minor variations. The fix is a deduplication layer at the orchestrator.

import hashlib
import json

class ToolCallDeduplicator:
    def __init__(self, window_size=5):
        self.recent_calls = []
        self.window_size = window_size

    def is_duplicate(self, tool_name, arguments):
        signature = self._signature(tool_name, arguments)
        is_dup = any(call == signature for call in self.recent_calls)
        self.recent_calls.append(signature)
        if len(self.recent_calls) > self.window_size:
            self.recent_calls.pop(0)
        return is_dup

    def _signature(self, tool_name, arguments):
        # Normalize arguments for comparison
        normalized = json.dumps(arguments, sort_keys=True, default=str)
        return f"{tool_name}:{hashlib.sha256(normalized.encode()).hexdigest()[:16]}"


# In the orchestrator
async def execute_tool(self, tool_call):
    if self.deduplicator.is_duplicate(tool_call.name, tool_call.arguments):
        return {
            "status": "duplicate_call_blocked",
            "message": (
                f"This exact {tool_call.name} call was made earlier in this conversation "
                f"with the same arguments. The previous result is already in your context. "
                f"Use it instead of calling again."
            ),
            "retryable": False,
        }

    return await self._actually_execute(tool_call)

When the model calls the same tool with the same arguments twice in a 5-call window, the orchestrator returns a structured "this is a duplicate" message instead of executing again. The model sees this and almost always recovers, often by referring back to the earlier result.

This pattern caught about 8% of calls in our production systems. Eight percent of total tool calls were unnecessary repeats. Blocking them saved both cost and latency.

A subtle detail: the deduplication signature should be lossy enough to catch near-duplicates. We use exact argument matching, but for some tools (search queries that differ only in word order), a normalization step before hashing would catch more.

Pattern 5: Parameter validation at the boundary, not inside the LLM

The slowest path to detecting a bad tool call is letting the LLM make it, the tool execute it, and the failure propagate back. The fastest path is validating parameters before the tool runs.

from pydantic import BaseModel, Field, validator
from datetime import date, timedelta


class SearchHotelsArgs(BaseModel):
    destination: str = Field(min_length=2, max_length=100)
    check_in: date
    check_out: date
    guests: int = Field(ge=1, le=20)
    max_price: float = Field(gt=0, le=10000)

    @validator('check_in')
    def check_in_not_in_past(cls, v):
        if v < date.today():
            raise ValueError(f"check_in date {v} is in the past")
        return v

    @validator('check_out')
    def check_out_after_check_in(cls, v, values):
        if 'check_in' in values and v <= values['check_in']:
            raise ValueError("check_out must be after check_in")
        if 'check_in' in values and (v - values['check_in']) > timedelta(days=90):
            raise ValueError("Stay length cannot exceed 90 days")
        return v


async def execute_tool(self, tool_call):
    if tool_call.name == "search_hotels":
        try:
            args = SearchHotelsArgs(**tool_call.arguments)
        except ValidationError as e:
            return {
                "status": "validation_error",
                "errors": e.errors(),
                "user_facing_hint": (
                    "Some search parameters were invalid. Confirm with the user before retrying."
                ),
                "retryable_after_correction": True,
            }
        return await self._search_hotels(args)

This catches three classes of bad calls:

Type errors: the LLM passes a string where the tool expects an integer.
Range errors: the LLM tries to search for 50 guests in one room.
Logical errors: check-out before check-in, dates in the past.

By rejecting these at the boundary with structured error responses, we prevent the deeper failure mode where the supplier API gets called with bad data, returns a cryptic error, the LLM cannot interpret the error, and a loop starts.

The Pydantic-based approach also gives you JSON schema generation for free, which feeds directly into the tool definitions you send to the LLM. Schema-aligned validation across both ends.

What I would not recommend

A few approaches we tried and abandoned:

Asking the LLM "are you done?" prompts mid-loop. Slows everything down and only works inconsistently. The orchestrator-level call budget is more reliable.
Letting the LLM see the full call history in every iteration. Increases context cost dramatically and provides little benefit. Pattern 4 (deduplication with structured feedback) is more efficient.
Streaming tool execution with partial results. Looks attractive but creates new failure modes where the LLM acts on incomplete data. Stick with atomic tool calls that either complete or fail cleanly.
Auto-generating tool definitions from API specs. Tempting because it sounds DRY, but auto-generated descriptions are usually not what the LLM needs. Hand-written tool descriptions, with explicit guidance about when to use the tool and when not to, work better.

Production results

After implementing these five patterns across our LLM-powered booking systems:

Agent loop incidents: dropped from 3 to 5 per week to under 1 per month.
Average tool calls per conversation: dropped 22%, mostly by eliminating duplicates and unnecessary retries.
Time-to-final-response: improved 18%, primarily from earlier short-circuiting of bad parameter calls.
Cost per conversation: dropped 31%, combination of fewer tool calls and tighter budget enforcement.

The patterns are not glamorous. They are mostly defensive engineering. But the alternative is the Claude Code incident: a $16,000 to $50,000 loss from an agent that "did not crash" but kept spending. In production LLM systems, the difference between cost-stable and not is exactly this kind of unsexy infrastructure.

If you are designing tools for an LLM agent, treat the tool interface as a contract that must hold even when the model is confused. The model will be confused. The contract will be tested. Patterns 1 through 5 are how it survives.

These patterns were developed across production builds at Adamo Software, including AI travel assistant deployments and agentic AI systems where tool-use reliability is non-negotiable.