DEV Community

Young Gao
Young Gao

Posted on

Practical Guide to Building AI Agents with Tool Use: Patterns That Actually Work in Production

Every week there's a new "autonomous AI agent" framework on GitHub with 10k stars and a demo that books flights, writes code, and orders pizza. Every week, teams try to use these in production and discover they hallucinate tool calls, burn through API budgets in minutes, and get stuck in infinite loops.

The gap between agent demos and production agents is enormous. This guide bridges it. We'll build a minimal agent framework from scratch, implement battle-tested patterns for tool use, and be honest about when you should skip agents entirely. No frameworks, no magic -- just Python, an LLM API, and hard-won lessons from shipping agents that handle real workloads.

What AI Agents with Tool Use Actually Are

Strip away the hype and an AI agent is just a loop:

  1. The LLM receives a task and a list of available tools
  2. It decides which tool to call (or whether to respond directly)
  3. The tool executes and returns a result
  4. The LLM sees the result and decides what to do next

That's it. The "intelligence" comes from the LLM's ability to plan multi-step sequences and adapt when things go wrong. The "agency" comes from the loop -- the model keeps going until the task is done.

This is fundamentally different from a single LLM call. A single call is a function: input in, output out. An agent is a program that runs for an indeterminate number of steps. That distinction has massive implications for error handling, cost, and safety.

The Core Loop: Plan, Act, Observe, Reflect

Every production agent follows the same conceptual loop, whether the framework makes it explicit or not:

  • Plan: The LLM analyzes the current state and decides on the next action. This might be implicit (the model just picks a tool) or explicit (the model writes out its reasoning first).
  • Act: A tool is called with specific arguments. This is where the agent interacts with the real world -- APIs, databases, file systems.
  • Observe: The tool's output (or error) is fed back to the LLM as a new message in the conversation.
  • Reflect: The LLM evaluates whether the task is complete, whether the tool output was useful, and what to do next. This often happens implicitly within the next "plan" step.

The key insight: the conversation history IS the agent's memory. Every tool call and result gets appended to the message list. The LLM reasons over the full history each iteration. This is both the strength (rich context) and the weakness (token costs grow linearly with steps).

A Minimal Agent Framework in Python

Here's a complete, runnable agent in about 100 lines. It uses the Anthropic API, but the pattern is identical for OpenAI.

"""
minimal_agent.py - A production-ready agent loop in ~100 lines.
Requires: pip install anthropic
"""
import json
import anthropic
from typing import Any, Callable

# --- Tool Registry ---
TOOLS: dict[str, Callable] = {}
TOOL_SCHEMAS: list[dict] = []

def tool(name: str, description: str, input_schema: dict):
    """Decorator to register a function as an agent tool."""
    def decorator(func: Callable) -> Callable:
        TOOLS[name] = func
        TOOL_SCHEMAS.append({
            "name": name,
            "description": description,
            "input_schema": input_schema,
        })
        return func
    return decorator

# --- Example Tools ---
@tool(
    name="calculator",
    description="Evaluate a mathematical expression. Returns the numeric result.",
    input_schema={
        "type": "object",
        "properties": {
            "expression": {
                "type": "string",
                "description": "A Python math expression, e.g. '2 ** 10 + 5'"
            }
        },
        "required": ["expression"],
    },
)
def calculator(expression: str) -> str:
    """Safe math evaluation -- no exec/eval of arbitrary code."""
    allowed = set("0123456789+-*/().% ")
    if not all(c in allowed for c in expression):
        return f"Error: expression contains disallowed characters"
    try:
        result = eval(expression, {"__builtins__": {}}, {})
        return str(result)
    except Exception as e:
        return f"Error: {e}"

@tool(
    name="lookup_user",
    description="Look up a user by ID. Returns their name and email.",
    input_schema={
        "type": "object",
        "properties": {
            "user_id": {"type": "string", "description": "The user ID"}
        },
        "required": ["user_id"],
    },
)
def lookup_user(user_id: str) -> str:
    fake_db = {
        "u_001": {"name": "Alice Chen", "email": "alice@example.com"},
        "u_002": {"name": "Bob Park", "email": "bob@example.com"},
    }
    user = fake_db.get(user_id)
    if user:
        return json.dumps(user)
    return f"Error: user '{user_id}' not found"

# --- Agent Loop ---
def run_agent(
    task: str,
    max_steps: int = 10,
    max_tokens_per_turn: int = 1024,
    model: str = "claude-sonnet-4-20250514",
) -> str:
    client = anthropic.Anthropic()
    messages = [{"role": "user", "content": task}]
    system = "You are a helpful assistant. Use the provided tools when needed. Be concise."

    for step in range(max_steps):
        response = client.messages.create(
            model=model,
            max_tokens=max_tokens_per_turn,
            system=system,
            tools=TOOL_SCHEMAS,
            messages=messages,
        )

        # Collect all content blocks
        assistant_content = response.content
        messages.append({"role": "assistant", "content": assistant_content})

        # If the model stopped naturally (no tool use), we're done
        if response.stop_reason == "end_turn":
            # Extract the final text response
            text_parts = [b.text for b in assistant_content if b.type == "text"]
            return "\n".join(text_parts)

        # Process tool calls
        tool_results = []
        for block in assistant_content:
            if block.type == "tool_use":
                func = TOOLS.get(block.name)
                if func is None:
                    result = f"Error: unknown tool '{block.name}'"
                else:
                    try:
                        result = func(**block.input)
                    except Exception as e:
                        result = f"Error executing {block.name}: {e}"
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": result,
                })

        messages.append({"role": "user", "content": tool_results})

    return "Error: agent exceeded maximum steps"

# --- Run It ---
if __name__ == "__main__":
    answer = run_agent("What is 2^32, and can you look up user u_001?")
    print(answer)
Enter fullscreen mode Exit fullscreen mode

Run it, and the agent will call calculator and lookup_user in sequence (or parallel, depending on the model), then synthesize a final answer. This is the skeleton every production agent builds on.

Tool Definition Patterns

Both OpenAI and Anthropic use JSON Schema for tool definitions. The quality of your schema directly impacts how reliably the model calls your tools. Here are patterns that work.

Be Specific in Descriptions

Bad:

{"name": "search", "description": "Search for stuff"}
Enter fullscreen mode Exit fullscreen mode

Good:

{
    "name": "search_documents",
    "description": "Full-text search over the internal knowledge base. Returns up to 5 matching document snippets ranked by relevance. Use this when the user asks about company policies, product specs, or internal processes.",
    "input_schema": {
        "type": "object",
        "properties": {
            "query": {
                "type": "string",
                "description": "Natural language search query. Be specific -- 'vacation policy for US employees' works better than 'vacation'."
            },
            "max_results": {
                "type": "integer",
                "description": "Number of results to return (1-10). Default: 5.",
                "default": 5
            }
        },
        "required": ["query"]
    }
}
Enter fullscreen mode Exit fullscreen mode

The description should tell the model when to use the tool, not just what it does. Include example inputs. Mention edge cases.

Enum Parameters Over Free-Text

When a parameter has a fixed set of valid values, use an enum:

"status_filter": {
    "type": "string",
    "enum": ["open", "closed", "in_progress"],
    "description": "Filter tickets by status"
}
Enter fullscreen mode Exit fullscreen mode

This eliminates an entire class of hallucinated arguments.

Return Structured Data

Tool outputs should be structured and concise. Don't return raw HTML or 50KB API responses. Parse, filter, and format the result before handing it back:

def search_tickets(query: str, status_filter: str = "open") -> str:
    raw_results = ticket_api.search(query, status=status_filter)
    # Don't return raw API response -- extract what matters
    formatted = [
        {"id": t["id"], "title": t["title"], "status": t["status"]}
        for t in raw_results[:5]
    ]
    return json.dumps(formatted, indent=2)
Enter fullscreen mode Exit fullscreen mode

Every unnecessary byte in a tool result costs tokens on every subsequent turn.

Error Handling and Retry Strategies

Tools fail. APIs time out. Models hallucinate invalid arguments. Your agent needs to handle all of this gracefully.

Structured Error Returns

Never let exceptions bubble up as raw tracebacks. Return errors as structured strings the model can reason about:

def execute_tool(name: str, args: dict) -> str:
    func = TOOLS.get(name)
    if func is None:
        return json.dumps({"error": "unknown_tool", "message": f"No tool named '{name}'. Available: {list(TOOLS.keys())}"})

    try:
        result = func(**args)
        return result
    except TypeError as e:
        return json.dumps({"error": "invalid_arguments", "message": str(e), "hint": "Check the required parameters and their types."})
    except TimeoutError:
        return json.dumps({"error": "timeout", "message": f"Tool '{name}' timed out after 30s. Try again or use a simpler query."})
    except Exception as e:
        return json.dumps({"error": "execution_error", "message": str(e)})
Enter fullscreen mode Exit fullscreen mode

The model can read these error messages and self-correct. Often it will fix its own argument mistakes on the retry. The hint field is particularly useful -- it guides the model toward the right fix.

Retry with Exponential Backoff on Transient Failures

Wrap external API calls with retries at the tool level, not the agent level:

import time
from functools import wraps

def with_retries(max_retries: int = 3, backoff_base: float = 1.0):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_retries):
                try:
                    return func(*args, **kwargs)
                except (ConnectionError, TimeoutError) as e:
                    if attempt == max_retries - 1:
                        return json.dumps({"error": "max_retries_exceeded", "message": str(e)})
                    time.sleep(backoff_base * (2 ** attempt))
            return json.dumps({"error": "unexpected", "message": "Retry loop exited unexpectedly"})
        return wrapper
    return decorator

@with_retries(max_retries=3)
def call_external_api(endpoint: str, params: dict) -> str:
    response = requests.get(endpoint, params=params, timeout=10)
    response.raise_for_status()
    return json.dumps(response.json())
Enter fullscreen mode Exit fullscreen mode

This keeps transient failures invisible to the agent. It only sees the error if retries are exhausted.

Guardrails: Preventing Runaway Agents

An unguarded agent with access to your production database is a liability. Here are the guardrails that matter.

Token Budget Enforcement

Track cumulative token usage and kill the loop when budget is exceeded:

def run_agent_with_budget(task: str, token_budget: int = 50_000, **kwargs) -> str:
    client = anthropic.Anthropic()
    messages = [{"role": "user", "content": task}]
    total_input_tokens = 0
    total_output_tokens = 0

    for step in range(kwargs.get("max_steps", 10)):
        response = client.messages.create(
            model=kwargs.get("model", "claude-sonnet-4-20250514"),
            max_tokens=kwargs.get("max_tokens_per_turn", 1024),
            system="You are a helpful assistant. Use tools when needed.",
            tools=TOOL_SCHEMAS,
            messages=messages,
        )

        total_input_tokens += response.usage.input_tokens
        total_output_tokens += response.usage.output_tokens
        total = total_input_tokens + total_output_tokens

        if total > token_budget:
            return f"Agent stopped: token budget exceeded ({total}/{token_budget} tokens used)"

        # ... rest of the loop (same as before)
Enter fullscreen mode Exit fullscreen mode

Action Limits and Confirmation Gates

For destructive operations, require explicit confirmation:

DESTRUCTIVE_TOOLS = {"delete_record", "send_email", "execute_sql_write"}

def execute_with_guardrails(name: str, args: dict, confirm_fn=None) -> str:
    if name in DESTRUCTIVE_TOOLS:
        if confirm_fn is None:
            return json.dumps({
                "error": "confirmation_required",
                "message": f"Tool '{name}' requires human confirmation. Args: {json.dumps(args)}"
            })
        if not confirm_fn(name, args):
            return json.dumps({"error": "rejected", "message": "Human rejected the action."})

    return execute_tool(name, args)
Enter fullscreen mode Exit fullscreen mode

In a web application, confirm_fn could pause the agent and show a dialog. In a CLI, it could prompt y/n. The point is that the agent cannot bypass the gate.

Rate Limiting Per Tool

Prevent the agent from hammering a single tool in a loop:

from collections import defaultdict

class ToolRateLimiter:
    def __init__(self, max_calls_per_tool: int = 5):
        self.counts: dict[str, int] = defaultdict(int)
        self.max_calls = max_calls_per_tool

    def check(self, tool_name: str) -> bool:
        self.counts[tool_name] += 1
        return self.counts[tool_name] <= self.max_calls

    def reject_message(self, tool_name: str) -> str:
        return json.dumps({
            "error": "rate_limited",
            "message": f"Tool '{tool_name}' has been called {self.counts[tool_name]} times (limit: {self.max_calls}). Find another approach or summarize what you've found so far."
        })
Enter fullscreen mode Exit fullscreen mode

This is particularly important for search tools. Without it, the agent will sometimes enter a loop of slightly different searches, each burning tokens without making progress.

Real Production Patterns

Idempotent Tools

Every tool that mutates state should be idempotent. If the agent retries a tool call (because it didn't see the result, or the framework retried), the outcome should be the same:

@tool(
    name="create_or_update_ticket",
    description="Create a ticket or update it if it already exists. Uses idempotency_key to prevent duplicates.",
    input_schema={
        "type": "object",
        "properties": {
            "idempotency_key": {"type": "string", "description": "Unique key for this operation (e.g. 'user-123-refund-456')"},
            "title": {"type": "string"},
            "body": {"type": "string"},
        },
        "required": ["idempotency_key", "title", "body"],
    },
)
def create_or_update_ticket(idempotency_key: str, title: str, body: str) -> str:
    existing = db.tickets.find_one({"idempotency_key": idempotency_key})
    if existing:
        db.tickets.update_one(
            {"idempotency_key": idempotency_key},
            {"$set": {"title": title, "body": body}}
        )
        return json.dumps({"status": "updated", "id": existing["id"]})
    else:
        ticket_id = db.tickets.insert_one({
            "idempotency_key": idempotency_key,
            "title": title,
            "body": body
        }).inserted_id
        return json.dumps({"status": "created", "id": str(ticket_id)})
Enter fullscreen mode Exit fullscreen mode

Audit Logging

Log every tool invocation. You will need this for debugging, compliance, and cost analysis:

import datetime
import uuid

class AuditLogger:
    def __init__(self, log_store):
        self.log_store = log_store

    def log_tool_call(self, session_id: str, tool_name: str, args: dict, result: str, duration_ms: float):
        entry = {
            "id": str(uuid.uuid4()),
            "session_id": session_id,
            "timestamp": datetime.datetime.utcnow().isoformat(),
            "tool_name": tool_name,
            "arguments": args,
            "result_preview": result[:500],  # Don't store huge outputs
            "duration_ms": duration_ms,
        }
        self.log_store.append(entry)
        return entry
Enter fullscreen mode Exit fullscreen mode

Integrate this into your execute_tool function. Every call gets logged with timing, arguments, and a truncated result. When an agent goes off the rails at 2 AM, this log is how you figure out what happened.

Cost Tracking

Token costs add up fast with multi-step agents. Track them per-session:

# Pricing per million tokens (example rates, check current pricing)
PRICING = {
    "claude-sonnet-4-20250514": {"input": 3.00, "output": 15.00},
    "claude-opus-4-20250514": {"input": 15.00, "output": 75.00},
    "gpt-4o": {"input": 2.50, "output": 10.00},
}

def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
    rates = PRICING.get(model, {"input": 5.0, "output": 15.0})
    input_cost = (input_tokens / 1_000_000) * rates["input"]
    output_cost = (output_tokens / 1_000_000) * rates["output"]
    return round(input_cost + output_cost, 6)
Enter fullscreen mode Exit fullscreen mode

Set hard dollar limits per agent session. A customer support agent that costs $2 per conversation is a problem. Measure this from day one.

When NOT to Use Agents

Agents are powerful, but they're also slow, expensive, and non-deterministic. Here's when simpler alternatives win.

Use a Single LLM Call When...

  • The task is self-contained (summarization, translation, classification)
  • You don't need external data or side effects
  • Determinism matters more than flexibility
  • Latency budget is under 2 seconds

Use Traditional Code When...

  • The logic is fully known at design time
  • You're doing data transformation with clear rules
  • The "decision" is a lookup table or a few if statements
  • You need guaranteed correctness (financial calculations)

Use a Pipeline (Chain) Instead of an Agent When...

  • The steps are always the same, just the data changes
  • You can hardcode the sequence: extract -> enrich -> format -> send
  • There's no conditional branching based on intermediate results

A pipeline is a fixed sequence of LLM calls and tool executions. An agent is a dynamic loop. Pipelines are cheaper, faster, and easier to debug. Only reach for agents when you genuinely need the model to decide what to do next based on what it just learned.

The Decision Framework

Ask yourself: "Does the next step depend on the result of the previous step in ways I can't predict at design time?" If yes, you need an agent. If no, you probably don't.

A common anti-pattern is building an agent to do something that's really a three-step pipeline wearing a trench coat. The agent technically works, but it's 5x slower, 10x more expensive, and fails in unpredictable ways compared to the hardcoded version.

Putting It All Together

Here's the production-ready version combining all the patterns above -- budget enforcement, rate limiting, audit logging, and guardrails -- in a single coherent loop:

def run_production_agent(
    task: str,
    session_id: str | None = None,
    max_steps: int = 10,
    token_budget: int = 50_000,
    max_calls_per_tool: int = 5,
    confirm_fn: Callable | None = None,
    model: str = "claude-sonnet-4-20250514",
) -> dict:
    session_id = session_id or str(uuid.uuid4())
    client = anthropic.Anthropic()
    messages = [{"role": "user", "content": task}]
    rate_limiter = ToolRateLimiter(max_calls_per_tool)
    audit = AuditLogger(log_store=[])
    total_input_tokens = 0
    total_output_tokens = 0

    for step in range(max_steps):
        response = client.messages.create(
            model=model, max_tokens=1024,
            system="You are a helpful assistant. Use tools when needed. Be concise.",
            tools=TOOL_SCHEMAS, messages=messages,
        )

        total_input_tokens += response.usage.input_tokens
        total_output_tokens += response.usage.output_tokens

        if total_input_tokens + total_output_tokens > token_budget:
            return {"status": "budget_exceeded", "session_id": session_id,
                    "cost": calculate_cost(model, total_input_tokens, total_output_tokens)}

        messages.append({"role": "assistant", "content": response.content})

        if response.stop_reason == "end_turn":
            text = "\n".join(b.text for b in response.content if b.type == "text")
            return {
                "status": "complete", "result": text, "session_id": session_id,
                "steps": step + 1,
                "tokens": {"input": total_input_tokens, "output": total_output_tokens},
                "cost": calculate_cost(model, total_input_tokens, total_output_tokens),
                "audit_log": audit.log_store,
            }

        tool_results = []
        for block in response.content:
            if block.type != "tool_use":
                continue

            if not rate_limiter.check(block.name):
                result = rate_limiter.reject_message(block.name)
            elif block.name in DESTRUCTIVE_TOOLS and confirm_fn and not confirm_fn(block.name, block.input):
                result = json.dumps({"error": "rejected", "message": "Human rejected."})
            else:
                start = time.time()
                result = execute_tool(block.name, block.input)
                duration = (time.time() - start) * 1000
                audit.log_tool_call(session_id, block.name, block.input, result, duration)

            tool_results.append({"type": "tool_result", "tool_use_id": block.id, "content": result})

        messages.append({"role": "user", "content": tool_results})

    return {"status": "max_steps_exceeded", "session_id": session_id, "steps": max_steps}
Enter fullscreen mode Exit fullscreen mode

This returns a structured result with cost tracking, audit logs, and clear status. You can store this in a database, alert on high-cost sessions, and replay the audit log for debugging.

Conclusion

Building AI agents that work in production comes down to engineering discipline, not framework magic. The core loop is simple: let the model plan, execute tools, observe results, and iterate. Everything else is guardrails and operational hygiene.

The patterns that matter most:

  1. Keep the tool registry simple -- decorators, JSON schemas, a dictionary. You don't need a framework for this.
  2. Return structured errors from tools so the model can self-correct.
  3. Enforce hard limits on tokens, steps, and per-tool call counts. Runaway agents are not a theoretical risk; they're a Tuesday.
  4. Make tools idempotent because retries are inevitable.
  5. Log everything -- tool calls, arguments, results, timing, costs. You will need this data.
  6. Ask whether you need an agent at all. A pipeline is almost always better if the steps are predictable.

The best agent is the simplest one that gets the job done. Start with a single tool and the minimal loop shown here. Add complexity only when production data tells you to. The 100-line agent in this article isn't a toy -- it's a foundation you can build real systems on.

Top comments (0)