Every week there's a new "autonomous AI agent" framework on GitHub with 10k stars and a demo that books flights, writes code, and orders pizza. Every week, teams try to use these in production and discover they hallucinate tool calls, burn through API budgets in minutes, and get stuck in infinite loops.
The gap between agent demos and production agents is enormous. This guide bridges it. We'll build a minimal agent framework from scratch, implement battle-tested patterns for tool use, and be honest about when you should skip agents entirely. No frameworks, no magic -- just Python, an LLM API, and hard-won lessons from shipping agents that handle real workloads.
What AI Agents with Tool Use Actually Are
Strip away the hype and an AI agent is just a loop:
- The LLM receives a task and a list of available tools
- It decides which tool to call (or whether to respond directly)
- The tool executes and returns a result
- The LLM sees the result and decides what to do next
That's it. The "intelligence" comes from the LLM's ability to plan multi-step sequences and adapt when things go wrong. The "agency" comes from the loop -- the model keeps going until the task is done.
This is fundamentally different from a single LLM call. A single call is a function: input in, output out. An agent is a program that runs for an indeterminate number of steps. That distinction has massive implications for error handling, cost, and safety.
The Core Loop: Plan, Act, Observe, Reflect
Every production agent follows the same conceptual loop, whether the framework makes it explicit or not:
- Plan: The LLM analyzes the current state and decides on the next action. This might be implicit (the model just picks a tool) or explicit (the model writes out its reasoning first).
- Act: A tool is called with specific arguments. This is where the agent interacts with the real world -- APIs, databases, file systems.
- Observe: The tool's output (or error) is fed back to the LLM as a new message in the conversation.
- Reflect: The LLM evaluates whether the task is complete, whether the tool output was useful, and what to do next. This often happens implicitly within the next "plan" step.
The key insight: the conversation history IS the agent's memory. Every tool call and result gets appended to the message list. The LLM reasons over the full history each iteration. This is both the strength (rich context) and the weakness (token costs grow linearly with steps).
A Minimal Agent Framework in Python
Here's a complete, runnable agent in about 100 lines. It uses the Anthropic API, but the pattern is identical for OpenAI.
"""
minimal_agent.py - A production-ready agent loop in ~100 lines.
Requires: pip install anthropic
"""
import json
import anthropic
from typing import Any, Callable
# --- Tool Registry ---
TOOLS: dict[str, Callable] = {}
TOOL_SCHEMAS: list[dict] = []
def tool(name: str, description: str, input_schema: dict):
"""Decorator to register a function as an agent tool."""
def decorator(func: Callable) -> Callable:
TOOLS[name] = func
TOOL_SCHEMAS.append({
"name": name,
"description": description,
"input_schema": input_schema,
})
return func
return decorator
# --- Example Tools ---
@tool(
name="calculator",
description="Evaluate a mathematical expression. Returns the numeric result.",
input_schema={
"type": "object",
"properties": {
"expression": {
"type": "string",
"description": "A Python math expression, e.g. '2 ** 10 + 5'"
}
},
"required": ["expression"],
},
)
def calculator(expression: str) -> str:
"""Safe math evaluation -- no exec/eval of arbitrary code."""
allowed = set("0123456789+-*/().% ")
if not all(c in allowed for c in expression):
return f"Error: expression contains disallowed characters"
try:
result = eval(expression, {"__builtins__": {}}, {})
return str(result)
except Exception as e:
return f"Error: {e}"
@tool(
name="lookup_user",
description="Look up a user by ID. Returns their name and email.",
input_schema={
"type": "object",
"properties": {
"user_id": {"type": "string", "description": "The user ID"}
},
"required": ["user_id"],
},
)
def lookup_user(user_id: str) -> str:
fake_db = {
"u_001": {"name": "Alice Chen", "email": "alice@example.com"},
"u_002": {"name": "Bob Park", "email": "bob@example.com"},
}
user = fake_db.get(user_id)
if user:
return json.dumps(user)
return f"Error: user '{user_id}' not found"
# --- Agent Loop ---
def run_agent(
task: str,
max_steps: int = 10,
max_tokens_per_turn: int = 1024,
model: str = "claude-sonnet-4-20250514",
) -> str:
client = anthropic.Anthropic()
messages = [{"role": "user", "content": task}]
system = "You are a helpful assistant. Use the provided tools when needed. Be concise."
for step in range(max_steps):
response = client.messages.create(
model=model,
max_tokens=max_tokens_per_turn,
system=system,
tools=TOOL_SCHEMAS,
messages=messages,
)
# Collect all content blocks
assistant_content = response.content
messages.append({"role": "assistant", "content": assistant_content})
# If the model stopped naturally (no tool use), we're done
if response.stop_reason == "end_turn":
# Extract the final text response
text_parts = [b.text for b in assistant_content if b.type == "text"]
return "\n".join(text_parts)
# Process tool calls
tool_results = []
for block in assistant_content:
if block.type == "tool_use":
func = TOOLS.get(block.name)
if func is None:
result = f"Error: unknown tool '{block.name}'"
else:
try:
result = func(**block.input)
except Exception as e:
result = f"Error executing {block.name}: {e}"
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result,
})
messages.append({"role": "user", "content": tool_results})
return "Error: agent exceeded maximum steps"
# --- Run It ---
if __name__ == "__main__":
answer = run_agent("What is 2^32, and can you look up user u_001?")
print(answer)
Run it, and the agent will call calculator and lookup_user in sequence (or parallel, depending on the model), then synthesize a final answer. This is the skeleton every production agent builds on.
Tool Definition Patterns
Both OpenAI and Anthropic use JSON Schema for tool definitions. The quality of your schema directly impacts how reliably the model calls your tools. Here are patterns that work.
Be Specific in Descriptions
Bad:
{"name": "search", "description": "Search for stuff"}
Good:
{
"name": "search_documents",
"description": "Full-text search over the internal knowledge base. Returns up to 5 matching document snippets ranked by relevance. Use this when the user asks about company policies, product specs, or internal processes.",
"input_schema": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Natural language search query. Be specific -- 'vacation policy for US employees' works better than 'vacation'."
},
"max_results": {
"type": "integer",
"description": "Number of results to return (1-10). Default: 5.",
"default": 5
}
},
"required": ["query"]
}
}
The description should tell the model when to use the tool, not just what it does. Include example inputs. Mention edge cases.
Enum Parameters Over Free-Text
When a parameter has a fixed set of valid values, use an enum:
"status_filter": {
"type": "string",
"enum": ["open", "closed", "in_progress"],
"description": "Filter tickets by status"
}
This eliminates an entire class of hallucinated arguments.
Return Structured Data
Tool outputs should be structured and concise. Don't return raw HTML or 50KB API responses. Parse, filter, and format the result before handing it back:
def search_tickets(query: str, status_filter: str = "open") -> str:
raw_results = ticket_api.search(query, status=status_filter)
# Don't return raw API response -- extract what matters
formatted = [
{"id": t["id"], "title": t["title"], "status": t["status"]}
for t in raw_results[:5]
]
return json.dumps(formatted, indent=2)
Every unnecessary byte in a tool result costs tokens on every subsequent turn.
Error Handling and Retry Strategies
Tools fail. APIs time out. Models hallucinate invalid arguments. Your agent needs to handle all of this gracefully.
Structured Error Returns
Never let exceptions bubble up as raw tracebacks. Return errors as structured strings the model can reason about:
def execute_tool(name: str, args: dict) -> str:
func = TOOLS.get(name)
if func is None:
return json.dumps({"error": "unknown_tool", "message": f"No tool named '{name}'. Available: {list(TOOLS.keys())}"})
try:
result = func(**args)
return result
except TypeError as e:
return json.dumps({"error": "invalid_arguments", "message": str(e), "hint": "Check the required parameters and their types."})
except TimeoutError:
return json.dumps({"error": "timeout", "message": f"Tool '{name}' timed out after 30s. Try again or use a simpler query."})
except Exception as e:
return json.dumps({"error": "execution_error", "message": str(e)})
The model can read these error messages and self-correct. Often it will fix its own argument mistakes on the retry. The hint field is particularly useful -- it guides the model toward the right fix.
Retry with Exponential Backoff on Transient Failures
Wrap external API calls with retries at the tool level, not the agent level:
import time
from functools import wraps
def with_retries(max_retries: int = 3, backoff_base: float = 1.0):
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
for attempt in range(max_retries):
try:
return func(*args, **kwargs)
except (ConnectionError, TimeoutError) as e:
if attempt == max_retries - 1:
return json.dumps({"error": "max_retries_exceeded", "message": str(e)})
time.sleep(backoff_base * (2 ** attempt))
return json.dumps({"error": "unexpected", "message": "Retry loop exited unexpectedly"})
return wrapper
return decorator
@with_retries(max_retries=3)
def call_external_api(endpoint: str, params: dict) -> str:
response = requests.get(endpoint, params=params, timeout=10)
response.raise_for_status()
return json.dumps(response.json())
This keeps transient failures invisible to the agent. It only sees the error if retries are exhausted.
Guardrails: Preventing Runaway Agents
An unguarded agent with access to your production database is a liability. Here are the guardrails that matter.
Token Budget Enforcement
Track cumulative token usage and kill the loop when budget is exceeded:
def run_agent_with_budget(task: str, token_budget: int = 50_000, **kwargs) -> str:
client = anthropic.Anthropic()
messages = [{"role": "user", "content": task}]
total_input_tokens = 0
total_output_tokens = 0
for step in range(kwargs.get("max_steps", 10)):
response = client.messages.create(
model=kwargs.get("model", "claude-sonnet-4-20250514"),
max_tokens=kwargs.get("max_tokens_per_turn", 1024),
system="You are a helpful assistant. Use tools when needed.",
tools=TOOL_SCHEMAS,
messages=messages,
)
total_input_tokens += response.usage.input_tokens
total_output_tokens += response.usage.output_tokens
total = total_input_tokens + total_output_tokens
if total > token_budget:
return f"Agent stopped: token budget exceeded ({total}/{token_budget} tokens used)"
# ... rest of the loop (same as before)
Action Limits and Confirmation Gates
For destructive operations, require explicit confirmation:
DESTRUCTIVE_TOOLS = {"delete_record", "send_email", "execute_sql_write"}
def execute_with_guardrails(name: str, args: dict, confirm_fn=None) -> str:
if name in DESTRUCTIVE_TOOLS:
if confirm_fn is None:
return json.dumps({
"error": "confirmation_required",
"message": f"Tool '{name}' requires human confirmation. Args: {json.dumps(args)}"
})
if not confirm_fn(name, args):
return json.dumps({"error": "rejected", "message": "Human rejected the action."})
return execute_tool(name, args)
In a web application, confirm_fn could pause the agent and show a dialog. In a CLI, it could prompt y/n. The point is that the agent cannot bypass the gate.
Rate Limiting Per Tool
Prevent the agent from hammering a single tool in a loop:
from collections import defaultdict
class ToolRateLimiter:
def __init__(self, max_calls_per_tool: int = 5):
self.counts: dict[str, int] = defaultdict(int)
self.max_calls = max_calls_per_tool
def check(self, tool_name: str) -> bool:
self.counts[tool_name] += 1
return self.counts[tool_name] <= self.max_calls
def reject_message(self, tool_name: str) -> str:
return json.dumps({
"error": "rate_limited",
"message": f"Tool '{tool_name}' has been called {self.counts[tool_name]} times (limit: {self.max_calls}). Find another approach or summarize what you've found so far."
})
This is particularly important for search tools. Without it, the agent will sometimes enter a loop of slightly different searches, each burning tokens without making progress.
Real Production Patterns
Idempotent Tools
Every tool that mutates state should be idempotent. If the agent retries a tool call (because it didn't see the result, or the framework retried), the outcome should be the same:
@tool(
name="create_or_update_ticket",
description="Create a ticket or update it if it already exists. Uses idempotency_key to prevent duplicates.",
input_schema={
"type": "object",
"properties": {
"idempotency_key": {"type": "string", "description": "Unique key for this operation (e.g. 'user-123-refund-456')"},
"title": {"type": "string"},
"body": {"type": "string"},
},
"required": ["idempotency_key", "title", "body"],
},
)
def create_or_update_ticket(idempotency_key: str, title: str, body: str) -> str:
existing = db.tickets.find_one({"idempotency_key": idempotency_key})
if existing:
db.tickets.update_one(
{"idempotency_key": idempotency_key},
{"$set": {"title": title, "body": body}}
)
return json.dumps({"status": "updated", "id": existing["id"]})
else:
ticket_id = db.tickets.insert_one({
"idempotency_key": idempotency_key,
"title": title,
"body": body
}).inserted_id
return json.dumps({"status": "created", "id": str(ticket_id)})
Audit Logging
Log every tool invocation. You will need this for debugging, compliance, and cost analysis:
import datetime
import uuid
class AuditLogger:
def __init__(self, log_store):
self.log_store = log_store
def log_tool_call(self, session_id: str, tool_name: str, args: dict, result: str, duration_ms: float):
entry = {
"id": str(uuid.uuid4()),
"session_id": session_id,
"timestamp": datetime.datetime.utcnow().isoformat(),
"tool_name": tool_name,
"arguments": args,
"result_preview": result[:500], # Don't store huge outputs
"duration_ms": duration_ms,
}
self.log_store.append(entry)
return entry
Integrate this into your execute_tool function. Every call gets logged with timing, arguments, and a truncated result. When an agent goes off the rails at 2 AM, this log is how you figure out what happened.
Cost Tracking
Token costs add up fast with multi-step agents. Track them per-session:
# Pricing per million tokens (example rates, check current pricing)
PRICING = {
"claude-sonnet-4-20250514": {"input": 3.00, "output": 15.00},
"claude-opus-4-20250514": {"input": 15.00, "output": 75.00},
"gpt-4o": {"input": 2.50, "output": 10.00},
}
def calculate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
rates = PRICING.get(model, {"input": 5.0, "output": 15.0})
input_cost = (input_tokens / 1_000_000) * rates["input"]
output_cost = (output_tokens / 1_000_000) * rates["output"]
return round(input_cost + output_cost, 6)
Set hard dollar limits per agent session. A customer support agent that costs $2 per conversation is a problem. Measure this from day one.
When NOT to Use Agents
Agents are powerful, but they're also slow, expensive, and non-deterministic. Here's when simpler alternatives win.
Use a Single LLM Call When...
- The task is self-contained (summarization, translation, classification)
- You don't need external data or side effects
- Determinism matters more than flexibility
- Latency budget is under 2 seconds
Use Traditional Code When...
- The logic is fully known at design time
- You're doing data transformation with clear rules
- The "decision" is a lookup table or a few
ifstatements - You need guaranteed correctness (financial calculations)
Use a Pipeline (Chain) Instead of an Agent When...
- The steps are always the same, just the data changes
- You can hardcode the sequence: extract -> enrich -> format -> send
- There's no conditional branching based on intermediate results
A pipeline is a fixed sequence of LLM calls and tool executions. An agent is a dynamic loop. Pipelines are cheaper, faster, and easier to debug. Only reach for agents when you genuinely need the model to decide what to do next based on what it just learned.
The Decision Framework
Ask yourself: "Does the next step depend on the result of the previous step in ways I can't predict at design time?" If yes, you need an agent. If no, you probably don't.
A common anti-pattern is building an agent to do something that's really a three-step pipeline wearing a trench coat. The agent technically works, but it's 5x slower, 10x more expensive, and fails in unpredictable ways compared to the hardcoded version.
Putting It All Together
Here's the production-ready version combining all the patterns above -- budget enforcement, rate limiting, audit logging, and guardrails -- in a single coherent loop:
def run_production_agent(
task: str,
session_id: str | None = None,
max_steps: int = 10,
token_budget: int = 50_000,
max_calls_per_tool: int = 5,
confirm_fn: Callable | None = None,
model: str = "claude-sonnet-4-20250514",
) -> dict:
session_id = session_id or str(uuid.uuid4())
client = anthropic.Anthropic()
messages = [{"role": "user", "content": task}]
rate_limiter = ToolRateLimiter(max_calls_per_tool)
audit = AuditLogger(log_store=[])
total_input_tokens = 0
total_output_tokens = 0
for step in range(max_steps):
response = client.messages.create(
model=model, max_tokens=1024,
system="You are a helpful assistant. Use tools when needed. Be concise.",
tools=TOOL_SCHEMAS, messages=messages,
)
total_input_tokens += response.usage.input_tokens
total_output_tokens += response.usage.output_tokens
if total_input_tokens + total_output_tokens > token_budget:
return {"status": "budget_exceeded", "session_id": session_id,
"cost": calculate_cost(model, total_input_tokens, total_output_tokens)}
messages.append({"role": "assistant", "content": response.content})
if response.stop_reason == "end_turn":
text = "\n".join(b.text for b in response.content if b.type == "text")
return {
"status": "complete", "result": text, "session_id": session_id,
"steps": step + 1,
"tokens": {"input": total_input_tokens, "output": total_output_tokens},
"cost": calculate_cost(model, total_input_tokens, total_output_tokens),
"audit_log": audit.log_store,
}
tool_results = []
for block in response.content:
if block.type != "tool_use":
continue
if not rate_limiter.check(block.name):
result = rate_limiter.reject_message(block.name)
elif block.name in DESTRUCTIVE_TOOLS and confirm_fn and not confirm_fn(block.name, block.input):
result = json.dumps({"error": "rejected", "message": "Human rejected."})
else:
start = time.time()
result = execute_tool(block.name, block.input)
duration = (time.time() - start) * 1000
audit.log_tool_call(session_id, block.name, block.input, result, duration)
tool_results.append({"type": "tool_result", "tool_use_id": block.id, "content": result})
messages.append({"role": "user", "content": tool_results})
return {"status": "max_steps_exceeded", "session_id": session_id, "steps": max_steps}
This returns a structured result with cost tracking, audit logs, and clear status. You can store this in a database, alert on high-cost sessions, and replay the audit log for debugging.
Conclusion
Building AI agents that work in production comes down to engineering discipline, not framework magic. The core loop is simple: let the model plan, execute tools, observe results, and iterate. Everything else is guardrails and operational hygiene.
The patterns that matter most:
- Keep the tool registry simple -- decorators, JSON schemas, a dictionary. You don't need a framework for this.
- Return structured errors from tools so the model can self-correct.
- Enforce hard limits on tokens, steps, and per-tool call counts. Runaway agents are not a theoretical risk; they're a Tuesday.
- Make tools idempotent because retries are inevitable.
- Log everything -- tool calls, arguments, results, timing, costs. You will need this data.
- Ask whether you need an agent at all. A pipeline is almost always better if the steps are predictable.
The best agent is the simplest one that gets the job done. Start with a single tool and the minimal loop shown here. Add complexity only when production data tells you to. The 100-line agent in this article isn't a toy -- it's a foundation you can build real systems on.
If this was helpful, you can support my work at ko-fi.com/nopkt
Top comments (0)