Every production AI agent, regardless of framework or language, shares the same structure. It has a brain that reasons, memory that persists state, tools that act on the world, and a planning mechanism that decides what to do next.
These are not optional components that some agents have and others skip. They are the defining structure of what makes something an agent rather than a chatbot. Before you write a line of agent code, understanding all four will determine whether your system is fast or slow, expensive or cheap, reliable or brittle.
Frameworks come and go. This structure does not.
The Agent Loop
Before examining each component, look at the structure that connects them:
Perceive -> Reason -> Act -> Observe -> Repeat
The agent receives input from its environment. It reasons about what to do. It calls a tool. It observes the result. It reasons again with the new information. It repeats this until it achieves its goal or hits a stop condition.
This maps onto the OODA loop from military strategy (Observe, Orient, Decide, Act), developed to describe how fighter pilots make decisions faster than their opponents. The principle is the same: the agent that cycles through the loop faster and more accurately than the problem requires is the agent that succeeds.
Here is the minimal agent loop in Python:
from anthropic import Anthropic
client = Anthropic()
tools = [] # tool definitions go here
messages = []
def agent_loop(user_input: str, max_iterations: int = 20) -> str:
messages.append({"role": "user", "content": user_input})
for iteration in range(max_iterations):
response = client.messages.create(
model="claude-opus-4.6",
max_tokens=4096,
system="You are a helpful agent. Use tools to accomplish tasks.",
tools=tools,
messages=messages,
)
messages.append({"role": "assistant", "content": response.content})
if response.stop_reason == "end_turn":
for block in response.content:
if hasattr(block, "text"):
return block.text
if response.stop_reason == "tool_use":
tool_results = execute_tools(response.content)
messages.append({"role": "user", "content": tool_results})
continue
break
return "Agent stopped without completing task"
This is the foundation. Everything else is about what goes into each component.
Component 1: The Brain (LLM)
The LLM is the reasoning engine. It receives everything in the conversation window, decides what to do next, and either produces a response or requests a tool call.
Model selection is not about benchmark scores. It is about matching model capabilities to task requirements along specific dimensions.
Context window determines how much state the agent can hold at once. If your agent needs to reason over large amounts of accumulated information, context window size often determines whether it can do the task at all.
Tool-calling reliability measures how accurately the model fills tool parameter schemas, handles ambiguous tool selection, and recovers from errors. A model that calls tools correctly 95% of the time sounds reliable until you realize that across fifty agent steps, you expect two or three misfires. Production needs closer to 99%.
Cost and latency determine whether your agent is economically viable. At $15 per million output tokens for flagship models, an agent producing 50,000 tokens of reasoning per task costs $0.75 per run in model fees alone. That number changes your business model.
Model Routing: The Highest-Leverage Cost Optimization
The single biggest lever for reducing agent costs without sacrificing quality is routing different steps to different models based on what each step actually requires.
Expensive models are for steps that need deep reasoning: planning a complex approach, evaluating conflicting information, making high-stakes decisions. Cheap, fast models handle mechanical steps: simple classification, format conversion, data extraction.
# Two clients, two price points
expensive_client = Anthropic() # Claude Opus 4.6
def plan_task(user_goal: str) -> dict:
"""High-level planning: use the expensive model."""
response = expensive_client.messages.create(
model="claude-opus-4.6",
max_tokens=2048,
system='Break down the goal into steps. Return JSON: {"steps": [...]}',
messages=[{"role": "user", "content": user_goal}],
)
return parse_json(response.content[0].text)
def classify_result(tool_output: str, expected_format: str) -> str:
"""Simple extraction: use the cheap model."""
# Gemini Flash, GPT-4o-mini, Claude Haiku -- pick your fast cheap model
response = cheap_client.generate_content(
f"Extract {expected_format} from this text. Return only the value.\n\n{tool_output}"
)
return response.text.strip()
A well-implemented routing strategy typically reduces model API costs by 60-80% with minimal quality impact. The majority of steps in a typical agent task are mechanical operations that do not require deep reasoning. Stop paying flagship model prices for them.
Component 2: Memory
Memory is how an agent maintains state. Without it, every invocation starts from scratch. The agent does not know what it already tried. It does not know the user's database is on port 5434. It does not know the file format it is parsing has a known bug.
Memory operates at three levels.
Short-term memory is the conversation context -- everything in the current session. It lives in the context window and is naturally managed by the framework. When you exceed the model's context limit, old context must be discarded or compressed.
Long-term memory is persistent state that survives between sessions. This requires deliberate engineering: writing important information to a database, file system, or vector store, and retrieving it at the start of future sessions. Without it, your agent starts every session knowing nothing about previous interactions.
Working memory is the current task state: which step is the agent on, what has it already done, what intermediate results has it collected. For multi-step tasks, this is the difference between an agent that can recover from an interrupted run and one that must start over.
Memory is complex enough to deserve its own post (which it gets -- see part 3 of this series). For now, the key point: memory is not optional. It is one of the four core components, and your choice of memory architecture has as large an impact on agent capability as your choice of model.
Component 3: Tools
Tools are how agents take actions in the world. A tool is a function with a name, a description, and a JSON schema for its parameters. The agent calls a tool by specifying name and parameters; the runtime executes the function and returns the result.
search_tool = {
"name": "web_search",
"description": """Search the web for current information. Use this when you
need facts that may not be in your training data, or when you need recent
information about events, prices, or system status.""",
"input_schema": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query. Be specific and use keywords."
}
},
"required": ["query"]
}
}
def web_search(query: str) -> list[dict]:
results = search_api.search(query)
return [{"title": r.title, "url": r.url, "snippet": r.snippet} for r in results]
Notice the detail in the tool description. The description is not documentation for humans -- it is the primary mechanism by which the model decides when and how to call the tool. A well-written description that explains what the tool does, when to use it, and what input to expect will be called correctly far more often than a terse one.
The tool-calling protocol follows the same pattern across all frameworks:
- At initialization, the agent receives a list of available tools with their schemas.
- During the loop, when the model needs a tool, it outputs a structured call (tool name + parameters).
- The runtime validates parameters and executes the function.
- The result is appended to the conversation.
- The agent continues reasoning with the new information.
Component 4: Planning
Planning is how an agent decides what to do next. It is the least tangible of the four components, but it separates agents that stumble toward a goal from agents that execute efficiently.
The ReAct Pattern
The dominant planning pattern in production agents is ReAct (Reasoning + Acting). Before taking any action, the agent explicitly reasons about what it is trying to do and why this specific action serves that goal.
In practice, ReAct looks like internal monologue before each tool call:
Thought: I need to find the current API pricing. I'll search rather than
use my training data, since prices change frequently.
Action: web_search(query="Anthropic Claude API pricing 2026")
Observation: [search results]
Thought: Results show Sonnet costs $3 per million input tokens. Now I
need to calculate the cost for the user's use case of 10M tokens/month.
Action: calculate(expression="10 * 3")
...
This explicit reasoning step serves two purposes. It makes the agent's decision-making auditable -- you can read the trace and understand exactly why the agent did what it did. And it improves decision quality. Models that reason explicitly before acting make fewer tool-calling mistakes than models that jump directly from observation to action.
Stop Conditions: The Most Underappreciated Problem
Without explicit stop conditions, agents fall into two failure modes. The first is infinite loops: the agent keeps taking actions without recognizing it has achieved its goal. The second is premature termination: the agent stops before completing the task because it incorrectly evaluates intermediate results as sufficient.
Stop conditions come in several forms:
- Iteration-based: the framework stops after N iterations regardless of state
-
Tool-based: the agent calls a specific
completetool to signal it is done - Goal-evaluation: a secondary model evaluates whether the goal was achieved
- Semantic: the agent includes a specific phrase in its response when done
In practice, always use an iteration limit as a safety net to prevent runaway API costs. But do not rely on it as your primary stop condition. The primary condition should be semantic: explicitly prompt the agent to recognize and declare completion when the goal is achieved.
A Minimal Production Agent
Here is a complete, working agent that combines all four components. Stripped to essentials to show the core structure:
import json
from anthropic import Anthropic
client = Anthropic()
# TOOLS
def read_file(path: str) -> str:
try:
with open(path, "r") as f:
return f.read()
except FileNotFoundError:
return f"Error: file not found at {path}"
def write_file(path: str, content: str) -> str:
with open(path, "w") as f:
f.write(content)
return f"Wrote {len(content)} characters to {path}"
TOOLS = [
{
"name": "read_file",
"description": "Read the contents of a file.",
"input_schema": {
"type": "object",
"properties": {"path": {"type": "string"}},
"required": ["path"]
}
},
{
"name": "write_file",
"description": "Write content to a file. Creates the file if it doesn't exist.",
"input_schema": {
"type": "object",
"properties": {
"path": {"type": "string"},
"content": {"type": "string"}
},
"required": ["path", "content"]
}
}
]
TOOL_MAP = {"read_file": read_file, "write_file": write_file}
def run_agent(task: str, max_steps: int = 20) -> str:
messages = [{"role": "user", "content": task}]
system = """You are a file agent. Use available tools to accomplish the task.
Think step by step. When done, respond with a summary of what you did."""
for step in range(max_steps):
response = client.messages.create(
model="claude-sonnet-4.6", # Sonnet for execution steps
max_tokens=2048,
system=system,
tools=TOOLS,
messages=messages,
)
# MEMORY: working memory accumulates in messages
messages.append({"role": "assistant", "content": response.content})
# PLANNING: stop condition -- agent declared completion
if response.stop_reason == "end_turn":
for block in response.content:
if hasattr(block, "text"):
return block.text
return "Task complete"
# TOOLS: execute what the agent asked for (ACT + OBSERVE)
if response.stop_reason == "tool_use":
tool_results = []
for block in response.content:
if block.type == "tool_use":
result = TOOL_MAP.get(block.name, lambda **k: "Unknown tool")(**block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": str(result)
})
messages.append({"role": "user", "content": tool_results})
return f"Hit maximum steps ({max_steps})"
This is approximately fifty lines. It handles the full loop: reasoning, tool execution, observation, and completion detection. It is missing production features -- error recovery, cost tracking, observability -- but the architecture is correct. Every production agent you build is an elaboration of this pattern.
The Complexity Is in the Details
The four-component architecture is simple to describe and hard to get right.
The brain requires careful model selection and routing. Memory requires decisions about what to persist and how to retrieve it efficiently. Tools require careful schema design and security thinking. Planning requires explicit stop conditions and recovery mechanisms for the inevitable failure cases.
When an agent fails in production -- and it will -- the diagnosis almost always points back to one of these four components being misconfigured, underpowered, or missing entirely.
The agent that performs well in production is not the one with the most sophisticated framework. It is the one where someone thought carefully about each of these four components and made deliberate decisions about all of them.
This post is adapted from Production AI Agents: Build, Deploy, and Monetize Autonomous Systems, available on Amazon Kindle. The book goes deeper with 12 chapters of real code, battle-tested patterns, and a complete hands-on tutorial.
I build production AI systems. More at astraedus.dev.
Top comments (0)