Everyone is building AI agents in 2026. Most of them are terrible.
I have spent the last year building, testing, and breaking AI agents across dozens of use cases — from research assistants to code generators to automated customer support pipelines. Along the way, I watched countless projects fail spectacularly, including several of my own.
The pattern is always the same: a developer gets excited about a demo, spins up a quick prototype, shows it to stakeholders, and then spends six months trying to make it reliable enough for production. The demo-to-production gap for AI agents is wider than almost any other technology I have worked with.
This article is the guide I wish I had when I started. A practical, no-hype framework for building AI agents that actually work — not just in demos, but in the real world where users do unexpected things and uptime matters.
Why Most AI Agents Fail
Before we build anything, let us understand the failure modes. After analyzing dozens of failed agent projects (mine and others), I have identified four recurring patterns.
Failure Mode 1: Over-Engineering from Day One
The most common mistake is starting with a complex multi-agent orchestration system when a single well-prompted LLM call would do the job. I see teams building elaborate frameworks with 15 different agent types before they have even validated that the core task works.
The fix: Start with the simplest possible implementation. A single LLM call with good instructions. Only add complexity when you can prove it is necessary.
Failure Mode 2: Poor Prompt Design
Many developers treat prompts as an afterthought — a quick instruction tacked onto the beginning of a context window. But prompt design is the single most important factor in agent reliability. A well-designed prompt with a mediocre model will outperform a poorly-designed prompt with a frontier model almost every time.
Failure Mode 3: Wrong Architecture for the Task
Not every task needs an agent. If you can solve the problem with a simple chain of LLM calls (input → process → output), do that. Agents add autonomy, which adds unpredictability. That unpredictability is only worth it when the task genuinely requires adaptive decision-making.
Failure Mode 4: No Evaluation Framework
If you cannot measure whether your agent is working, you cannot improve it. Most teams skip evaluation entirely and rely on vibes — "it seems to work pretty well." That is how you ship agents that fail 30% of the time and nobody notices until users start complaining.
The PTME Framework: Plan, Tools, Memory, Evaluation
Here is the framework I use for every agent project. It is not fancy, but it works.
Step 1: Plan — Define the Agent's Decision Space
Before writing any code, answer these questions:
- What decisions does the agent need to make? List every point where the agent chooses between actions.
- What information does it need to make each decision? This determines your context strategy.
- What are the failure modes for each decision? This shapes your error handling.
- What should happen when the agent is uncertain? This determines your fallback strategy.
Write this down. Literally. I keep a one-page "Agent Decision Map" for every agent I build.
Agent: Research Assistant
Decisions:
1. Which sources to search → Needs: user query, available tools
2. Whether results are relevant → Needs: user query, search results
3. When to stop searching → Needs: result quality threshold, max iterations
4. How to synthesize findings → Needs: all collected results, output format
Failure modes:
- No relevant results found → Ask user to refine query
- Contradictory sources → Present both with confidence scores
- Token limit approaching → Summarize and present partial results
Step 2: Tools — Give the Agent Capabilities
Tools are functions your agent can call to interact with the world. The quality of your tools determines the ceiling of your agent's capabilities.
Here are the principles I follow for tool design:
Keep tools atomic. Each tool should do one thing well. A search_web tool should search the web, not search the web and summarize the results.
Make tool descriptions crystal clear. The LLM reads your tool descriptions to decide when to use each tool. Ambiguous descriptions lead to wrong tool choices.
Return structured data. Tools should return JSON or structured objects, not free-form text. This makes it easier for the agent to process results.
Here is a practical example in Python:
import json
import httpx
from typing import Any
def create_tool(name: str, description: str, parameters: dict, handler):
"""Create a tool definition for the agent."""
return {
"name": name,
"description": description,
"parameters": parameters,
"handler": handler,
}
async def search_web(query: str, max_results: int = 5) -> dict:
"""Search the web and return structured results."""
async with httpx.AsyncClient() as client:
response = await client.get(
"https://api.search-provider.com/search",
params={"q": query, "count": max_results},
)
results = response.json()
return {
"query": query,
"results": [
{
"title": r["title"],
"url": r["url"],
"snippet": r["snippet"],
}
for r in results.get("items", [])
],
"total_count": results.get("total", 0),
}
async def read_webpage(url: str) -> dict:
"""Fetch and extract content from a webpage."""
async with httpx.AsyncClient() as client:
response = await client.get(url, follow_redirects=True)
# Use a simple extraction approach
text = extract_main_content(response.text)
return {
"url": url,
"title": extract_title(response.text),
"content": text[:5000], # Limit content length
"word_count": len(text.split()),
}
async def save_note(title: str, content: str, tags: list[str] = None) -> dict:
"""Save a research note for later reference."""
note = {
"title": title,
"content": content,
"tags": tags or [],
"timestamp": datetime.now().isoformat(),
}
# Save to your storage backend
note_id = await storage.save(note)
return {"note_id": note_id, "status": "saved"}
# Register tools
tools = [
create_tool(
name="search_web",
description="Search the web for information. Use this when you need to find current data, articles, or documentation on a topic.",
parameters={
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query"
},
"max_results": {
"type": "integer",
"description": "Maximum number of results (default: 5)"
}
},
"required": ["query"]
},
handler=search_web,
),
create_tool(
name="read_webpage",
description="Read the content of a specific webpage. Use this after search_web to get the full content of a promising result.",
parameters={
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "The URL to read"
}
},
"required": ["url"]
},
handler=read_webpage,
),
create_tool(
name="save_note",
description="Save a research note. Use this when you find important information that should be included in the final report.",
parameters={
"type": "object",
"properties": {
"title": {
"type": "string",
"description": "Note title"
},
"content": {
"type": "string",
"description": "Note content"
},
"tags": {
"type": "array",
"items": {"type": "string"},
"description": "Tags for categorization"
}
},
"required": ["title", "content"]
},
handler=save_note,
),
]
Step 3: Memory — Give the Agent Context
Memory is what separates a stateless chatbot from a useful agent. There are three types of memory you need to consider:
Working Memory (Short-term): The current conversation or task context. This is your context window — the information the agent can see right now.
Episodic Memory (Medium-term): Records of past interactions and outcomes. "Last time the user asked about X, they wanted Y format." This helps agents adapt to individual users.
Semantic Memory (Long-term): Persistent knowledge the agent can reference. Documentation, FAQs, product catalogs, user preferences.
Here is a simple but effective memory system:
from dataclasses import dataclass, field
from datetime import datetime
@dataclass
class MemoryEntry:
content: str
memory_type: str # "working", "episodic", "semantic"
timestamp: datetime = field(default_factory=datetime.now)
relevance_score: float = 1.0
metadata: dict = field(default_factory=dict)
class AgentMemory:
def __init__(self, max_working_entries: int = 20):
self.working: list[MemoryEntry] = []
self.episodic: list[MemoryEntry] = []
self.semantic: list[MemoryEntry] = []
self.max_working = max_working_entries
def add_working(self, content: str, metadata: dict = None):
"""Add to working memory (current task context)."""
entry = MemoryEntry(
content=content,
memory_type="working",
metadata=metadata or {},
)
self.working.append(entry)
# Evict oldest entries if over limit
if len(self.working) > self.max_working:
self.working = self.working[-self.max_working:]
def add_episodic(self, content: str, metadata: dict = None):
"""Record an interaction outcome for future reference."""
entry = MemoryEntry(
content=content,
memory_type="episodic",
metadata=metadata or {},
)
self.episodic.append(entry)
def retrieve_relevant(self, query: str, top_k: int = 5) -> list[MemoryEntry]:
"""Retrieve relevant memories using simple keyword matching.
In production, replace this with vector similarity search.
"""
all_memories = self.working + self.episodic + self.semantic
query_words = set(query.lower().split())
scored = []
for memory in all_memories:
content_words = set(memory.content.lower().split())
overlap = len(query_words & content_words)
if overlap > 0:
scored.append((memory, overlap))
scored.sort(key=lambda x: x[1], reverse=True)
return [m for m, _ in scored[:top_k]]
def build_context(self, query: str) -> str:
"""Build a context string for the agent's prompt."""
relevant = self.retrieve_relevant(query)
sections = []
for memory in relevant:
prefix = f"[{memory.memory_type}]"
sections.append(f"{prefix} {memory.content}")
return "\n".join(sections) if sections else "No relevant context found."
For production systems, replace the keyword matching with vector similarity search using embeddings. But start simple — keyword matching works surprisingly well for many use cases.
Step 4: Evaluation — Measure What Matters
This is the step everyone skips, and it is the most important one. Without evaluation, you are flying blind.
Here is my evaluation framework:
import json
from dataclasses import dataclass
@dataclass
class EvalCase:
input_query: str
expected_behavior: str # What should the agent do?
success_criteria: list[str] # Specific checkable outcomes
max_steps: int = 10
max_time_seconds: float = 60.0
# Define your evaluation suite
eval_suite = [
EvalCase(
input_query="What are the latest developments in quantum computing?",
expected_behavior="Search for recent quantum computing news, read 2-3 sources, synthesize findings",
success_criteria=[
"agent_used_search_tool",
"agent_read_at_least_2_sources",
"response_mentions_specific_developments",
"response_includes_dates_or_timeframes",
"response_is_factually_grounded",
],
max_steps=8,
),
EvalCase(
input_query="Compare React and Vue for a new project",
expected_behavior="Research both frameworks, present structured comparison",
success_criteria=[
"agent_searched_for_both_frameworks",
"response_covers_performance",
"response_covers_ecosystem",
"response_covers_learning_curve",
"response_gives_recommendation_with_reasoning",
],
max_steps=10,
),
]
async def run_evaluation(agent, eval_cases: list[EvalCase]) -> dict:
"""Run evaluation suite and return results."""
results = []
for case in eval_cases:
start_time = time.time()
# Run the agent
response, trace = await agent.run_with_trace(case.input_query)
elapsed = time.time() - start_time
steps_taken = len(trace.steps)
# Check success criteria
criteria_results = {}
for criterion in case.success_criteria:
criteria_results[criterion] = check_criterion(
criterion, response, trace
)
passed = sum(criteria_results.values())
total = len(criteria_results)
results.append({
"query": case.input_query,
"score": passed / total,
"criteria": criteria_results,
"steps": steps_taken,
"time": elapsed,
"within_step_limit": steps_taken <= case.max_steps,
"within_time_limit": elapsed <= case.max_time_seconds,
})
# Calculate aggregate metrics
avg_score = sum(r["score"] for r in results) / len(results)
return {
"average_score": avg_score,
"total_cases": len(results),
"results": results,
}
Run this evaluation suite every time you change your agent. Track the scores over time. This is how you know whether your changes are improvements or regressions.
Putting It All Together: A Research Agent
Let me walk you through building a complete research agent using the PTME framework. This agent takes a research question, searches the web, reads relevant sources, and produces a structured summary.
import anthropic
import json
class ResearchAgent:
def __init__(self):
self.client = anthropic.Anthropic()
self.memory = AgentMemory()
self.max_iterations = 8
async def run(self, query: str) -> str:
"""Execute a research task."""
self.memory.add_working(f"Research query: {query}")
system_prompt = """You are a research assistant. Your job is to
thoroughly research a topic and provide a well-sourced summary.
Guidelines:
- Search for information using the search_web tool
- Read at least 2-3 relevant sources using read_webpage
- Save important findings using save_note
- When you have enough information, provide a final summary
- Always cite your sources with URLs
- If sources contradict each other, note the disagreement
- Focus on recent, authoritative sources
Available context from memory:
{context}"""
messages = [{"role": "user", "content": query}]
for iteration in range(self.max_iterations):
context = self.memory.build_context(query)
response = self.client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
system=system_prompt.format(context=context),
tools=self._get_tool_definitions(),
messages=messages,
)
# Check if agent wants to use tools
if response.stop_reason == "tool_use":
tool_results = await self._execute_tools(response)
# Add tool results to memory
for result in tool_results:
self.memory.add_working(
f"Tool {result['name']}: {json.dumps(result['output'])[:500]}"
)
# Continue conversation with tool results
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
else:
# Agent is done — extract final text
final_text = self._extract_text(response)
# Save to episodic memory
self.memory.add_episodic(
f"Researched '{query}' in {iteration + 1} steps. "
f"Result length: {len(final_text)} chars."
)
return final_text
return "Research incomplete — reached maximum iterations."
async def _execute_tools(self, response) -> list:
"""Execute tool calls from the agent's response."""
results = []
for block in response.content:
if block.type == "tool_use":
handler = self._get_handler(block.name)
output = await handler(**json.loads(json.dumps(block.input)))
results.append({
"type": "tool_result",
"tool_use_id": block.id,
"name": block.name,
"output": output,
"content": json.dumps(output),
})
return results
Tool Calling Patterns That Work
After building dozens of agents, here are the tool calling patterns I have found most reliable:
Pattern 1: Search-Read-Synthesize
The most common pattern for information gathering:
search_web("topic") → pick best results → read_webpage(url) → synthesize
Always search first, then read. Do not try to guess URLs.
Pattern 2: Plan-Execute-Verify
For multi-step tasks:
create_plan(task) → for each step: execute_step() → verify_result() → next
The verification step catches errors early, before they compound.
Pattern 3: Progressive Refinement
For complex analysis:
rough_analysis(data) → identify_gaps() → targeted_search(gaps) → refine_analysis()
Start broad, then narrow down. This is more efficient than trying to be comprehensive from the start.
Common Pitfalls and How to Avoid Them
Infinite loops: Always set a maximum iteration count. Agents can get stuck in search-refine loops forever.
Token budget explosions: Track token usage per step. Set hard limits. Summarize intermediate results to keep context manageable.
Tool abuse: Some agents will call tools unnecessarily — searching for information they already have. Include "only search if you do not already know the answer" in your system prompt.
Hallucinated tool calls: Agents sometimes try to call tools that do not exist. Validate tool names before execution.
Testing Your Agent in Production
Once your agent passes evaluation, deploy it with guardrails:
- Log everything. Every tool call, every decision, every output. You will need this for debugging.
- Set rate limits. Prevent runaway agents from making thousands of API calls.
- Add a human-in-the-loop option. For high-stakes decisions, let the agent ask for confirmation.
- Monitor costs. AI agents can get expensive fast. Track cost per task.
class ProductionGuardrails:
def __init__(self, max_cost_per_task: float = 0.50):
self.max_cost = max_cost_per_task
self.current_cost = 0.0
def check_budget(self, estimated_cost: float) -> bool:
if self.current_cost + estimated_cost > self.max_cost:
return False
self.current_cost += estimated_cost
return True
def check_tool_call(self, tool_name: str, params: dict) -> bool:
"""Validate tool calls before execution."""
# Block dangerous operations
blocked_patterns = ["delete", "drop", "remove", "sudo"]
param_str = json.dumps(params).lower()
return not any(p in param_str for p in blocked_patterns)
What I Would Do Differently
If I were starting over, I would:
- Spend 80% of my time on prompts and evaluation, 20% on code. The framework matters less than the instructions.
- Build the evaluation suite before the agent. Test-driven development works even better for agents than for traditional code.
- Start with Claude Haiku or Sonnet, not Opus. Faster iterations, lower costs, and the performance difference matters less than you think for most tasks.
- Ship a simple version first. A research agent that searches and summarizes is more useful shipped today than a perfect multi-agent system shipped never.
Next Steps
The best way to learn is to build. Take a task you do repeatedly — research, data analysis, content creation — and build an agent for it using the PTME framework.
Start with Plan. Define the decisions. Then add Tools one at a time. Layer in Memory as patterns emerge. And always, always build your Evaluation suite early.
If you want a head start with pre-built agent templates, system prompts, and tool configurations for 8 common agent types (research, code review, data analysis, content creation, and more), check out the AI Agent Toolkit. It includes ready-to-use Python code for each agent pattern covered in this article, plus advanced patterns like multi-agent orchestration and human-in-the-loop workflows.
Happy building.
Top comments (0)