Daniil Kornilov

Posted on Feb 24

How to Build AI Agents That Actually Work in 2026

#ai #agents #tutorial #productivity

Everyone is building AI agents in 2026. Most of them are terrible.

I have spent the last year building, testing, and breaking AI agents across dozens of use cases — from research assistants to code generators to automated customer support pipelines. Along the way, I watched countless projects fail spectacularly, including several of my own.

The pattern is always the same: a developer gets excited about a demo, spins up a quick prototype, shows it to stakeholders, and then spends six months trying to make it reliable enough for production. The demo-to-production gap for AI agents is wider than almost any other technology I have worked with.

This article is the guide I wish I had when I started. A practical, no-hype framework for building AI agents that actually work — not just in demos, but in the real world where users do unexpected things and uptime matters.

Why Most AI Agents Fail

Before we build anything, let us understand the failure modes. After analyzing dozens of failed agent projects (mine and others), I have identified four recurring patterns.

Failure Mode 1: Over-Engineering from Day One

The most common mistake is starting with a complex multi-agent orchestration system when a single well-prompted LLM call would do the job. I see teams building elaborate frameworks with 15 different agent types before they have even validated that the core task works.

The fix: Start with the simplest possible implementation. A single LLM call with good instructions. Only add complexity when you can prove it is necessary.

Failure Mode 2: Poor Prompt Design

Many developers treat prompts as an afterthought — a quick instruction tacked onto the beginning of a context window. But prompt design is the single most important factor in agent reliability. A well-designed prompt with a mediocre model will outperform a poorly-designed prompt with a frontier model almost every time.

Failure Mode 3: Wrong Architecture for the Task

Not every task needs an agent. If you can solve the problem with a simple chain of LLM calls (input → process → output), do that. Agents add autonomy, which adds unpredictability. That unpredictability is only worth it when the task genuinely requires adaptive decision-making.

Failure Mode 4: No Evaluation Framework

If you cannot measure whether your agent is working, you cannot improve it. Most teams skip evaluation entirely and rely on vibes — "it seems to work pretty well." That is how you ship agents that fail 30% of the time and nobody notices until users start complaining.

The PTME Framework: Plan, Tools, Memory, Evaluation

Here is the framework I use for every agent project. It is not fancy, but it works.

Step 1: Plan — Define the Agent's Decision Space

Before writing any code, answer these questions:

What decisions does the agent need to make? List every point where the agent chooses between actions.
What information does it need to make each decision? This determines your context strategy.
What are the failure modes for each decision? This shapes your error handling.
What should happen when the agent is uncertain? This determines your fallback strategy.

Write this down. Literally. I keep a one-page "Agent Decision Map" for every agent I build.

Agent: Research Assistant
Decisions:
  1. Which sources to search → Needs: user query, available tools
  2. Whether results are relevant → Needs: user query, search results
  3. When to stop searching → Needs: result quality threshold, max iterations
  4. How to synthesize findings → Needs: all collected results, output format
Failure modes:
  - No relevant results found → Ask user to refine query
  - Contradictory sources → Present both with confidence scores
  - Token limit approaching → Summarize and present partial results

Step 2: Tools — Give the Agent Capabilities

Tools are functions your agent can call to interact with the world. The quality of your tools determines the ceiling of your agent's capabilities.

Here are the principles I follow for tool design:

Keep tools atomic. Each tool should do one thing well. A search_web tool should search the web, not search the web and summarize the results.

Make tool descriptions crystal clear. The LLM reads your tool descriptions to decide when to use each tool. Ambiguous descriptions lead to wrong tool choices.

Return structured data. Tools should return JSON or structured objects, not free-form text. This makes it easier for the agent to process results.

Here is a practical example in Python:

import json
import httpx
from typing import Any

def create_tool(name: str, description: str, parameters: dict, handler):
    """Create a tool definition for the agent."""
    return {
        "name": name,
        "description": description,
        "parameters": parameters,
        "handler": handler,
    }

async def search_web(query: str, max_results: int = 5) -> dict:
    """Search the web and return structured results."""
    async with httpx.AsyncClient() as client:
        response = await client.get(
            "https://api.search-provider.com/search",
            params={"q": query, "count": max_results},
        )
        results = response.json()

    return {
        "query": query,
        "results": [
            {
                "title": r["title"],
                "url": r["url"],
                "snippet": r["snippet"],
            }
            for r in results.get("items", [])
        ],
        "total_count": results.get("total", 0),
    }

async def read_webpage(url: str) -> dict:
    """Fetch and extract content from a webpage."""
    async with httpx.AsyncClient() as client:
        response = await client.get(url, follow_redirects=True)

    # Use a simple extraction approach
    text = extract_main_content(response.text)

    return {
        "url": url,
        "title": extract_title(response.text),
        "content": text[:5000],  # Limit content length
        "word_count": len(text.split()),
    }

async def save_note(title: str, content: str, tags: list[str] = None) -> dict:
    """Save a research note for later reference."""
    note = {
        "title": title,
        "content": content,
        "tags": tags or [],
        "timestamp": datetime.now().isoformat(),
    }
    # Save to your storage backend
    note_id = await storage.save(note)
    return {"note_id": note_id, "status": "saved"}

# Register tools
tools = [
    create_tool(
        name="search_web",
        description="Search the web for information. Use this when you need to find current data, articles, or documentation on a topic.",
        parameters={
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "The search query"
                },
                "max_results": {
                    "type": "integer",
                    "description": "Maximum number of results (default: 5)"
                }
            },
            "required": ["query"]
        },
        handler=search_web,
    ),
    create_tool(
        name="read_webpage",
        description="Read the content of a specific webpage. Use this after search_web to get the full content of a promising result.",
        parameters={
            "type": "object",
            "properties": {
                "url": {
                    "type": "string",
                    "description": "The URL to read"
                }
            },
            "required": ["url"]
        },
        handler=read_webpage,
    ),
    create_tool(
        name="save_note",
        description="Save a research note. Use this when you find important information that should be included in the final report.",
        parameters={
            "type": "object",
            "properties": {
                "title": {
                    "type": "string",
                    "description": "Note title"
                },
                "content": {
                    "type": "string",
                    "description": "Note content"
                },
                "tags": {
                    "type": "array",
                    "items": {"type": "string"},
                    "description": "Tags for categorization"
                }
            },
            "required": ["title", "content"]
        },
        handler=save_note,
    ),
]

Step 3: Memory — Give the Agent Context

Memory is what separates a stateless chatbot from a useful agent. There are three types of memory you need to consider:

Working Memory (Short-term): The current conversation or task context. This is your context window — the information the agent can see right now.

Episodic Memory (Medium-term): Records of past interactions and outcomes. "Last time the user asked about X, they wanted Y format." This helps agents adapt to individual users.

Semantic Memory (Long-term): Persistent knowledge the agent can reference. Documentation, FAQs, product catalogs, user preferences.

Here is a simple but effective memory system:

from dataclasses import dataclass, field
from datetime import datetime


@dataclass
class MemoryEntry:
    content: str
    memory_type: str  # "working", "episodic", "semantic"
    timestamp: datetime = field(default_factory=datetime.now)
    relevance_score: float = 1.0
    metadata: dict = field(default_factory=dict)


class AgentMemory:
    def __init__(self, max_working_entries: int = 20):
        self.working: list[MemoryEntry] = []
        self.episodic: list[MemoryEntry] = []
        self.semantic: list[MemoryEntry] = []
        self.max_working = max_working_entries

    def add_working(self, content: str, metadata: dict = None):
        """Add to working memory (current task context)."""
        entry = MemoryEntry(
            content=content,
            memory_type="working",
            metadata=metadata or {},
        )
        self.working.append(entry)
        # Evict oldest entries if over limit
        if len(self.working) > self.max_working:
            self.working = self.working[-self.max_working:]

    def add_episodic(self, content: str, metadata: dict = None):
        """Record an interaction outcome for future reference."""
        entry = MemoryEntry(
            content=content,
            memory_type="episodic",
            metadata=metadata or {},
        )
        self.episodic.append(entry)

    def retrieve_relevant(self, query: str, top_k: int = 5) -> list[MemoryEntry]:
        """Retrieve relevant memories using simple keyword matching.

        In production, replace this with vector similarity search.
        """
        all_memories = self.working + self.episodic + self.semantic
        query_words = set(query.lower().split())

        scored = []
        for memory in all_memories:
            content_words = set(memory.content.lower().split())
            overlap = len(query_words & content_words)
            if overlap > 0:
                scored.append((memory, overlap))

        scored.sort(key=lambda x: x[1], reverse=True)
        return [m for m, _ in scored[:top_k]]

    def build_context(self, query: str) -> str:
        """Build a context string for the agent's prompt."""
        relevant = self.retrieve_relevant(query)

        sections = []
        for memory in relevant:
            prefix = f"[{memory.memory_type}]"
            sections.append(f"{prefix} {memory.content}")

        return "\n".join(sections) if sections else "No relevant context found."

For production systems, replace the keyword matching with vector similarity search using embeddings. But start simple — keyword matching works surprisingly well for many use cases.

Step 4: Evaluation — Measure What Matters

This is the step everyone skips, and it is the most important one. Without evaluation, you are flying blind.

Here is my evaluation framework:

import json
from dataclasses import dataclass


@dataclass
class EvalCase:
    input_query: str
    expected_behavior: str  # What should the agent do?
    success_criteria: list[str]  # Specific checkable outcomes
    max_steps: int = 10
    max_time_seconds: float = 60.0


# Define your evaluation suite
eval_suite = [
    EvalCase(
        input_query="What are the latest developments in quantum computing?",
        expected_behavior="Search for recent quantum computing news, read 2-3 sources, synthesize findings",
        success_criteria=[
            "agent_used_search_tool",
            "agent_read_at_least_2_sources",
            "response_mentions_specific_developments",
            "response_includes_dates_or_timeframes",
            "response_is_factually_grounded",
        ],
        max_steps=8,
    ),
    EvalCase(
        input_query="Compare React and Vue for a new project",
        expected_behavior="Research both frameworks, present structured comparison",
        success_criteria=[
            "agent_searched_for_both_frameworks",
            "response_covers_performance",
            "response_covers_ecosystem",
            "response_covers_learning_curve",
            "response_gives_recommendation_with_reasoning",
        ],
        max_steps=10,
    ),
]


async def run_evaluation(agent, eval_cases: list[EvalCase]) -> dict:
    """Run evaluation suite and return results."""
    results = []

    for case in eval_cases:
        start_time = time.time()

        # Run the agent
        response, trace = await agent.run_with_trace(case.input_query)

        elapsed = time.time() - start_time
        steps_taken = len(trace.steps)

        # Check success criteria
        criteria_results = {}
        for criterion in case.success_criteria:
            criteria_results[criterion] = check_criterion(
                criterion, response, trace
            )

        passed = sum(criteria_results.values())
        total = len(criteria_results)

        results.append({
            "query": case.input_query,
            "score": passed / total,
            "criteria": criteria_results,
            "steps": steps_taken,
            "time": elapsed,
            "within_step_limit": steps_taken <= case.max_steps,
            "within_time_limit": elapsed <= case.max_time_seconds,
        })

    # Calculate aggregate metrics
    avg_score = sum(r["score"] for r in results) / len(results)

    return {
        "average_score": avg_score,
        "total_cases": len(results),
        "results": results,
    }

Run this evaluation suite every time you change your agent. Track the scores over time. This is how you know whether your changes are improvements or regressions.

Putting It All Together: A Research Agent

Let me walk you through building a complete research agent using the PTME framework. This agent takes a research question, searches the web, reads relevant sources, and produces a structured summary.

import anthropic
import json


class ResearchAgent:
    def __init__(self):
        self.client = anthropic.Anthropic()
        self.memory = AgentMemory()
        self.max_iterations = 8

    async def run(self, query: str) -> str:
        """Execute a research task."""
        self.memory.add_working(f"Research query: {query}")

        system_prompt = """You are a research assistant. Your job is to
        thoroughly research a topic and provide a well-sourced summary.

        Guidelines:
        - Search for information using the search_web tool
        - Read at least 2-3 relevant sources using read_webpage
        - Save important findings using save_note
        - When you have enough information, provide a final summary
        - Always cite your sources with URLs
        - If sources contradict each other, note the disagreement
        - Focus on recent, authoritative sources

        Available context from memory:
        {context}"""

        messages = [{"role": "user", "content": query}]

        for iteration in range(self.max_iterations):
            context = self.memory.build_context(query)

            response = self.client.messages.create(
                model="claude-sonnet-4-6",
                max_tokens=4096,
                system=system_prompt.format(context=context),
                tools=self._get_tool_definitions(),
                messages=messages,
            )

            # Check if agent wants to use tools
            if response.stop_reason == "tool_use":
                tool_results = await self._execute_tools(response)

                # Add tool results to memory
                for result in tool_results:
                    self.memory.add_working(
                        f"Tool {result['name']}: {json.dumps(result['output'])[:500]}"
                    )

                # Continue conversation with tool results
                messages.append({"role": "assistant", "content": response.content})
                messages.append({"role": "user", "content": tool_results})
            else:
                # Agent is done — extract final text
                final_text = self._extract_text(response)

                # Save to episodic memory
                self.memory.add_episodic(
                    f"Researched '{query}' in {iteration + 1} steps. "
                    f"Result length: {len(final_text)} chars."
                )

                return final_text

        return "Research incomplete — reached maximum iterations."

    async def _execute_tools(self, response) -> list:
        """Execute tool calls from the agent's response."""
        results = []
        for block in response.content:
            if block.type == "tool_use":
                handler = self._get_handler(block.name)
                output = await handler(**json.loads(json.dumps(block.input)))
                results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "name": block.name,
                    "output": output,
                    "content": json.dumps(output),
                })
        return results

Tool Calling Patterns That Work

After building dozens of agents, here are the tool calling patterns I have found most reliable:

Pattern 1: Search-Read-Synthesize

The most common pattern for information gathering:

search_web("topic") → pick best results → read_webpage(url) → synthesize

Always search first, then read. Do not try to guess URLs.

Pattern 2: Plan-Execute-Verify

For multi-step tasks:

create_plan(task) → for each step: execute_step() → verify_result() → next

The verification step catches errors early, before they compound.

Pattern 3: Progressive Refinement

For complex analysis:

rough_analysis(data) → identify_gaps() → targeted_search(gaps) → refine_analysis()

Start broad, then narrow down. This is more efficient than trying to be comprehensive from the start.

Common Pitfalls and How to Avoid Them

Infinite loops: Always set a maximum iteration count. Agents can get stuck in search-refine loops forever.

Token budget explosions: Track token usage per step. Set hard limits. Summarize intermediate results to keep context manageable.

Tool abuse: Some agents will call tools unnecessarily — searching for information they already have. Include "only search if you do not already know the answer" in your system prompt.

Hallucinated tool calls: Agents sometimes try to call tools that do not exist. Validate tool names before execution.

Testing Your Agent in Production

Once your agent passes evaluation, deploy it with guardrails:

Log everything. Every tool call, every decision, every output. You will need this for debugging.
Set rate limits. Prevent runaway agents from making thousands of API calls.
Add a human-in-the-loop option. For high-stakes decisions, let the agent ask for confirmation.
Monitor costs. AI agents can get expensive fast. Track cost per task.

class ProductionGuardrails:
    def __init__(self, max_cost_per_task: float = 0.50):
        self.max_cost = max_cost_per_task
        self.current_cost = 0.0

    def check_budget(self, estimated_cost: float) -> bool:
        if self.current_cost + estimated_cost > self.max_cost:
            return False
        self.current_cost += estimated_cost
        return True

    def check_tool_call(self, tool_name: str, params: dict) -> bool:
        """Validate tool calls before execution."""
        # Block dangerous operations
        blocked_patterns = ["delete", "drop", "remove", "sudo"]
        param_str = json.dumps(params).lower()
        return not any(p in param_str for p in blocked_patterns)

What I Would Do Differently

If I were starting over, I would:

Spend 80% of my time on prompts and evaluation, 20% on code. The framework matters less than the instructions.
Build the evaluation suite before the agent. Test-driven development works even better for agents than for traditional code.
Start with Claude Haiku or Sonnet, not Opus. Faster iterations, lower costs, and the performance difference matters less than you think for most tasks.
Ship a simple version first. A research agent that searches and summarizes is more useful shipped today than a perfect multi-agent system shipped never.

Next Steps

The best way to learn is to build. Take a task you do repeatedly — research, data analysis, content creation — and build an agent for it using the PTME framework.

Start with Plan. Define the decisions. Then add Tools one at a time. Layer in Memory as patterns emerge. And always, always build your Evaluation suite early.

If you want a head start with pre-built agent templates, system prompts, and tool configurations for 8 common agent types (research, code review, data analysis, content creation, and more), check out the AI Agent Toolkit. It includes ready-to-use Python code for each agent pattern covered in this article, plus advanced patterns like multi-agent orchestration and human-in-the-loop workflows.

Happy building.

DEV Community