Most tutorials about building AI agents focus on the happy path. The agent calls a tool, gets a result, continues. Clean. Simple. Nothing like what you actually deal with in production.
This guide is different. I've been building Claude-powered agents for my ecommerce operation for eight months. Some of them run dozens of times a day. Here's what actually works - including the parts that are messy.
What "Agent" Actually Means Here
Before we touch any code, let's align on terminology because this word is overloaded.
An agent, in the context of the Claude SDK, is a loop:
- Give Claude a task and tools
- Claude decides whether to use a tool
- If yes: execute the tool, feed the result back to Claude
- Repeat until Claude says it's done
That's it. The magic is in how you design the tools, structure the context, and handle the failure cases.
The Minimal Agent
Here's the smallest useful agent I can show you:
import anthropic
client = anthropic.Anthropic()
tools = [
{
"name": "get_product_inventory",
"description": "Get current inventory count for a product SKU",
"input_schema": {
"type": "object",
"properties": {
"sku": {
"type": "string",
"description": "The product SKU to check"
}
},
"required": ["sku"]
}
}
]
def run_agent(task: str):
messages = [{"role": "user", "content": task}]
while True:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
tools=tools,
messages=messages
)
# Agent is done
if response.stop_reason == "end_turn":
return response.content[0].text
# Agent wants to use a tool
if response.stop_reason == "tool_use":
tool_use = next(b for b in response.content if b.type == "tool_use")
# Execute the tool
result = execute_tool(tool_use.name, tool_use.input)
# Add the exchange to message history
messages.append({"role": "assistant", "content": response.content})
messages.append({
"role": "user",
"content": [{
"type": "tool_result",
"tool_use_id": tool_use.id,
"content": str(result)
}]
})
This is the core loop that every Claude agent is built on. The rest is complexity management.
The loop runs until
stop_reason == "end_turn". Everything else is about what happens inside the loop.
Designing Tools That Actually Work
The quality of your agent is almost entirely determined by your tool design. Bad tools make even great models perform poorly.
Rule 1: One tool, one responsibility.
I've seen developers build tools like manage_inventory that handles checking, updating, and reporting inventory. This confuses the model and produces unpredictable behavior.
Instead: get_inventory, update_inventory, generate_inventory_report. Three tools with crystal-clear purposes.
Rule 2: Descriptions are prompts.
Your tool description is not documentation. It's instruction. Write it like you're telling a smart colleague exactly when and how to use this function.
Bad:
"description": "Gets order data"
Good:
"description": "Retrieves detailed order information including line items, customer data, shipping status, and fulfillment history. Use this when you need to analyze a specific order or when a customer asks about their order status. Requires a valid order ID."
Rule 3: Return structured data, not prose.
Your tool results feed back into the model's context. Structured data (JSON) is more reliably understood than natural language summaries.
# Bad tool return
return f"There are 47 units of SKU-123 in stock, last updated Tuesday"
# Good tool return
return {
"sku": "SKU-123",
"quantity": 47,
"last_updated": "2026-04-22T14:30:00Z",
"warehouse": "main"
}
Prompt Caching: The Performance Multiplier
If your agent runs repeatedly with similar system prompts (and it will), prompt caching will cut your costs significantly and improve response times.
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": """You are an inventory management agent for an ecommerce store.
Your responsibilities:
- Check inventory levels when asked
- Flag items that need reordering (below 10 units)
- Generate reorder recommendations with quantities
- Track inventory changes over time
Always verify data before making recommendations. Be conservative with reorder quantities.""",
"cache_control": {"type": "ephemeral"}
}
],
tools=tools,
messages=messages
)
60% of my agent API costs disappeared after adding prompt caching. The system prompt gets cached after the first call and reused across the entire conversation.
The cache_control: ephemeral tells Anthropic to cache this content. The cache persists for 5 minutes, which covers most agent loops. For longer operations, you can cache at multiple breakpoints in the conversation.
Handling Failure Gracefully
Production agents fail. Here's how to handle it without your entire workflow breaking.
Tool execution errors:
def execute_tool(name: str, inputs: dict) -> dict:
try:
if name == "get_product_inventory":
return get_inventory(inputs["sku"])
# ... other tools
except Exception as e:
# Return error as structured data so Claude can decide what to do
return {
"error": True,
"error_type": type(e).__name__,
"message": str(e),
"recoverable": isinstance(e, (TimeoutError, ConnectionError))
}
When you return structured error data instead of raising an exception, Claude can often recover - retrying the operation, trying an alternative approach, or explaining to the user what happened.
Infinite loop protection:
def run_agent(task: str, max_iterations: int = 10):
messages = [{"role": "user", "content": task}]
iterations = 0
while iterations < max_iterations:
iterations += 1
response = client.messages.create(...)
if response.stop_reason == "end_turn":
return response.content[0].text
# ... handle tool use
return "Agent reached maximum iterations without completing the task."
Set max_iterations based on your task complexity. Simple lookups: 5. Complex multi-step operations: 15-20.
Multi-Agent Patterns
Single agents are powerful. Multiple agents working together can handle complexity that would overwhelm any single context window.
The pattern I use most: orchestrator + specialists.
# Orchestrator decides what needs to happen
orchestrator_result = run_agent(
task="Analyze our inventory situation and create a reorder plan",
tools=[route_to_inventory_agent, route_to_pricing_agent, route_to_supplier_agent]
)
# Specialists handle specific domains
def route_to_inventory_agent(query: str) -> dict:
return run_specialized_agent(
system="You are an inventory specialist...",
tools=[get_inventory, update_inventory, get_sales_velocity],
task=query
)
The orchestrator never touches raw data. It coordinates specialists who do. This keeps each agent's context focused and its tool set manageable.
At the end of the day, a single agent with 30 tools is harder to debug and less reliable than three agents with 10 tools each.
Streaming for Long Operations
For operations that take more than a few seconds, streaming makes the experience dramatically better.
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=2048,
tools=tools,
messages=messages
) as stream:
for event in stream:
if hasattr(event, 'delta') and hasattr(event.delta, 'text'):
print(event.delta.text, end="", flush=True)
This is particularly valuable for agents that generate reports or analysis as their final output. Users see progress instead of waiting for a spinner.
The Observability Problem
The hardest part of running agents in production isn't building them. It's understanding what they did when something goes wrong.
My solution: log every tool call and result.
import json
from datetime import datetime
def execute_tool_with_logging(name: str, inputs: dict) -> dict:
start_time = datetime.now()
result = execute_tool(name, inputs)
duration_ms = (datetime.now() - start_time).total_seconds() * 1000
log_entry = {
"timestamp": start_time.isoformat(),
"tool": name,
"inputs": inputs,
"result": result,
"duration_ms": duration_ms
}
# Write to your logging system
append_to_agent_log(log_entry)
return result
This log lets you reconstruct exactly what an agent did, in what order, with what data. When a bug appears (and it will), you won't be debugging blind.
Starting Simple, Scaling Up
Here's the progression I'd recommend:
Week 1: Build one agent with two or three tools for a task you currently do manually. Don't optimize. Just get it working.
Week 2: Add error handling and logging. Run it in production but monitor it closely.
Week 3: Add prompt caching. Measure the cost and latency improvement.
Month 2: Extract specialists for different domains. Build the orchestrator pattern.
The agents I run today took about six months to reach their current form. They didn't start that way. They started with three tools and grew as I understood what they needed to do.
Resources
The tools I've built for managing Claude agents are available at mynextools.com - including workflow templates and a monitoring dashboard for tracking agent runs.
The full Anthropic SDK documentation is thorough and worth reading: the tool use guide in particular covers edge cases I didn't have space for here.
What are you trying to automate with Claude agents? Drop it in the comments - I read every one and try to cover the most common use cases in future posts.
If you found this useful, follow me here. I publish a new deep-dive every week.
Top comments (0)