AI Agent Architecture Patterns: 6 Designs That Work in Production (2026)

#ai #programming #automation #architecture

Every AI agent is built on an architecture pattern — even if the builder doesn't realize it. The pattern determines how the agent reasons, when it uses tools, how it handles errors, and ultimately whether it works reliably or falls apart under real traffic.

    There's no single "best" architecture. A customer support agent needs a different design than a research agent or a coding assistant. The right choice depends on your task complexity, latency requirements, cost budget, and reliability needs.

    This guide covers the 6 architecture patterns used by production AI agents in 2026, with trade-offs and code for each.

    ## Pattern 1: ReAct (Reasoning + Acting)

    The most common agent pattern. The LLM alternates between **thinking** (reasoning about what to do) and **acting** (calling tools). Each observation from a tool informs the next thought.

Loop:
  1. THOUGHT: "I need to find the user's order status"
  2. ACTION: lookup_order(order_id="12345")
  3. OBSERVATION: {"status": "shipped", "tracking": "FX789"}
  4. THOUGHT: "I have the tracking info, I can respond now"
  5. RESPONSE: "Your order shipped! Tracking: FX789"

    ### Implementation

class ReActAgent:
    def __init__(self, llm, tools, max_steps=10):
        self.llm = llm
        self.tools = {t.name: t for t in tools}
        self.max_steps = max_steps

    def run(self, user_input: str) -> str:
        messages = [
            {"role": "system", "content": self.system_prompt},
            {"role": "user", "content": user_input}
        ]

        for step in range(self.max_steps):
            response = self.llm.generate(messages, tools=list(self.tools.values()))

            if response.tool_calls:
                for call in response.tool_calls:
                    result = self.tools[call.name].execute(**call.args)
                    messages.append({"role": "tool", "content": str(result),
                                    "tool_call_id": call.id})
            else:
                return response.content  # Final answer

        return "I wasn't able to complete this request. Let me connect you with support."

    ### When to Use ReAct

        - Tasks with 1-5 tool calls
        - When the next step depends on the previous result
        - Customer support, Q&A, simple data lookups


    ### Trade-offs

        ProsCons
        Simple to implementSequential — can't parallelize tool calls
        Good reasoning transparencyCost grows linearly with steps (full context each time)
        Works with any LLM that supports toolsCan loop on difficult tasks
        Easy to debug (read the thought chain)No upfront planning — greedy decisions


    ## Pattern 2: Plan-and-Execute

    Instead of deciding one step at a time, the agent first creates a **complete plan**, then executes each step. If a step fails, it replans.

1. PLAN:
   Step 1: Look up customer's recent orders
   Step 2: Check refund eligibility for each
   Step 3: Process refund for the eligible one
   Step 4: Send confirmation email

2. EXECUTE Step 1: lookup_orders(email="user@example.com")
3. EXECUTE Step 2: check_eligibility(order_id="ORD-789")
4. EXECUTE Step 3: process_refund(order_id="ORD-789", amount=49.99)
5. EXECUTE Step 4: send_email(to="user@example.com", template="refund_confirmation")

    ### Implementation

class PlanAndExecuteAgent:
    def __init__(self, planner_llm, executor_llm, tools):
        self.planner = planner_llm    # Strong model (GPT-4o, Claude Sonnet)
        self.executor = executor_llm   # Can be cheaper model
        self.tools = tools

    def run(self, user_input: str) -> str:
        # Phase 1: Plan
        plan = self.create_plan(user_input)

        # Phase 2: Execute
        results = []
        for i, step in enumerate(plan.steps):
            try:
                result = self.execute_step(step, results)
                results.append({"step": step, "result": result, "status": "success"})
            except Exception as e:
                results.append({"step": step, "result": str(e), "status": "failed"})
                # Replan from current state
                plan = self.replan(user_input, results, remaining=plan.steps[i+1:])

        # Phase 3: Synthesize response
        return self.synthesize(user_input, results)

    def create_plan(self, task: str) -> Plan:
        prompt = f"""Create a step-by-step plan to accomplish this task.
Each step should map to exactly one tool call.
Available tools: {[t.name for t in self.tools]}

Task: {task}

Output as JSON: {{"steps": ["step 1 description", "step 2 description", ...]}}"""
        return self.planner.generate(prompt)

    def replan(self, task, completed, remaining):
        prompt = f"""The original plan hit a problem. Create a new plan.
Task: {task}
Completed steps: {completed}
Remaining (may need changes): {remaining}"""
        return self.planner.generate(prompt)

    ### When to Use Plan-and-Execute

        - Complex tasks with 5+ steps
        - Tasks where order matters (can't just wing it)
        - Research tasks, data pipelines, multi-tool workflows


    ### Trade-offs

        ProsCons
        Better at complex multi-step tasksPlanning step adds latency
        Can use cheaper model for executionPlan can be wrong (garbage in, garbage out)
        Replanning handles failures gracefullyMore complex to implement
        User can review/approve plan before executionOver-plans for simple tasks


    ## Pattern 3: Router Agent

    A lightweight agent that **classifies the request** and routes it to a specialized handler. Each handler is optimized for one type of task.

User Input → Router
              │
              ├── "order_status" → Order Status Handler (fast, cheap model)
              ├── "refund" → Refund Handler (careful model + approval flow)
              ├── "technical" → Technical Support Handler (RAG + strong model)
              └── "general" → General Handler (basic RAG)

    ### Implementation

class RouterAgent:
    def __init__(self):
        self.router_llm = "gpt-4o-mini"  # Fast classification
        self.handlers = {
            "order_status": OrderStatusHandler(model="gpt-4o-mini"),
            "refund": RefundHandler(model="gpt-4o", requires_approval=True),
            "technical": TechnicalHandler(model="claude-sonnet", rag=True),
            "billing": BillingHandler(model="gpt-4o"),
            "general": GeneralHandler(model="gpt-4o-mini", rag=True),
        }

    async def run(self, user_input: str) -> str:
        # Step 1: Classify (fast, ~100ms)
        intent = await self.classify(user_input)

        # Step 2: Route to handler
        handler = self.handlers.get(intent["category"], self.handlers["general"])

        # Step 3: Execute specialized handler
        return await handler.handle(user_input, intent)

    async def classify(self, text: str) -> dict:
        result = await self.router_llm.generate(
            f"Classify into: {list(self.handlers.keys())}\nInput: {text}\nJSON: {{\"category\": \"...\", \"confidence\": 0.0-1.0}}",
            model=self.router_llm
        )
        return json.loads(result)

    ### When to Use Router

        - Multiple distinct task types with different requirements
        - When you want different models/tools per task type
        - Customer support, multi-purpose assistants


    ### Trade-offs

        ProsCons
        Optimized cost per task typeClassification errors route to wrong handler
        Each handler is simple and focusedMore code to maintain
        Easy to add new task typesDoesn't handle multi-intent requests well
        Latency optimized per pathRouter adds one extra LLM call


    ## Pattern 4: Hierarchical (Manager/Worker)

    A manager agent breaks the task into subtasks and delegates each to specialized worker agents. Workers run independently and report back.

User: "Write a market analysis report for AI agents in healthcare"

Manager Agent:
  ├── Worker 1: Research market size and growth (web search agent)
  ├── Worker 2: Find key players and competitors (web search agent)
  ├── Worker 3: Analyze regulatory landscape (RAG agent)
  └── Worker 4: Compile report from all findings (writing agent)

Each worker has its own tools, context, and model.

    ### Implementation

class HierarchicalAgent:
    def __init__(self):
        self.manager = ManagerLLM(model="gpt-4o")
        self.workers = {
            "researcher": ResearchWorker(tools=[web_search, scrape]),
            "analyst": AnalystWorker(tools=[database_query, calculator]),
            "writer": WriterWorker(tools=[format_document]),
        }

    async def run(self, task: str) -> str:
        # Manager creates subtask plan
        subtasks = await self.manager.decompose(task)

        # Execute workers (parallel where possible)
        results = {}
        parallel_groups = self.manager.identify_parallel_groups(subtasks)

        for group in parallel_groups:
            group_results = await asyncio.gather(*[
                self.workers[st.worker_type].execute(st)
                for st in group
            ])
            for st, result in zip(group, group_results):
                results[st.id] = result

        # Manager synthesizes final output
        return await self.manager.synthesize(task, results)

    ### When to Use Hierarchical

        - Complex tasks that decompose into independent subtasks
        - When subtasks need different tools/models
        - Research, report generation, complex analysis


    ### Trade-offs

        ProsCons
        Parallel execution = fasterMost complex to implement
        Each worker is focused and testableManager decomposition can be wrong
        Scales to very complex tasksHigher cost (multiple agents running)
        Workers can use different modelsInter-worker communication is tricky


    ## Pattern 5: Reflection

    The agent generates a response, then **critiques its own output** and iterates. Like having a built-in code reviewer.

1. GENERATE: Draft response to user query
2. REFLECT: "Is this response correct? Complete? Well-formatted?"
3. CRITIQUE: "The pricing info might be outdated. I should verify."
4. REVISE: Call pricing API, update response
5. REFLECT: "Now it's accurate and complete."
6. RESPOND: Return final version

    ### Implementation

class ReflectionAgent:
    def __init__(self, llm, tools, max_reflections=3):
        self.llm = llm
        self.tools = tools
        self.max_reflections = max_reflections

    def run(self, user_input: str) -> str:
        # Generate initial response
        draft = self.llm.generate(f"Respond to: {user_input}")

        for i in range(self.max_reflections):
            # Critique
            critique = self.llm.generate(f"""Review this response for issues:

User query: {user_input}
Draft response: {draft}

Check for:
1. Factual accuracy — are all claims verifiable?
2. Completeness — does it address the full query?
3. Missing information — should any tools be called?
4. Tone — is it appropriate?

If the response is good, output: {{"status": "approved"}}
If it needs improvement, output: {{"status": "revise", "issues": ["..."], "suggested_actions": ["..."]}}""")

            result = json.loads(critique)
            if result["status"] == "approved":
                return draft

            # Revise based on critique
            if result.get("suggested_actions"):
                for action in result["suggested_actions"]:
                    tool_result = self.execute_action(action)
                    draft = self.llm.generate(
                        f"Revise this response based on new information:\n"
                        f"Original: {draft}\n"
                        f"New data: {tool_result}\n"
                        f"Issues to fix: {result['issues']}"
                    )

        return draft  # Return best version after max reflections

    ### When to Use Reflection

        - Tasks where accuracy is critical (medical, legal, financial)
        - Content generation (writing, reports, emails)
        - When the cost of a wrong answer is high


    ### Trade-offs

        ProsCons
        Higher accuracy through self-correction2-3x the cost (multiple LLM calls)
        Catches hallucinations before deliverySlower (each reflection adds latency)
        Natural quality improvement loopCan over-critique and make things worse
        Works with any base architectureDiminishing returns after 2-3 iterations


    ## Pattern 6: State Machine

    The most structured pattern. The agent follows a predefined state machine with explicit transitions. Each state has its own behavior, tools, and exit conditions.

States: [GREETING] → [IDENTIFY] → [DIAGNOSE] → [RESOLVE] → [CONFIRM] → [CLOSE]

GREETING: Welcome user, detect intent
  → IDENTIFY (if needs account lookup)
  → DIAGNOSE (if general question)

IDENTIFY: Authenticate user, find account
  → DIAGNOSE (authenticated)
  → ESCALATE (auth failed 3x)

DIAGNOSE: Understand the specific issue
  → RESOLVE (issue identified)
  → ESCALATE (can't determine issue)

RESOLVE: Apply fix or provide answer
  → CONFIRM (fix applied)
  → ESCALATE (can't resolve)

CONFIRM: Verify customer is satisfied
  → CLOSE (satisfied)
  → DIAGNOSE (not satisfied, try again)

    ### Implementation

from enum import Enum

class State(Enum):
    GREETING = "greeting"
    IDENTIFY = "identify"
    DIAGNOSE = "diagnose"
    RESOLVE = "resolve"
    CONFIRM = "confirm"
    CLOSE = "close"
    ESCALATE = "escalate"

class StateMachineAgent:
    def __init__(self):
        self.state = State.GREETING
        self.context = {}
        self.handlers = {
            State.GREETING: self.handle_greeting,
            State.IDENTIFY: self.handle_identify,
            State.DIAGNOSE: self.handle_diagnose,
            State.RESOLVE: self.handle_resolve,
            State.CONFIRM: self.handle_confirm,
        }

    async def process_message(self, message: str) -> str:
        handler = self.handlers.get(self.state)
        if not handler:
            return "This conversation has ended. Please start a new one."

        response, next_state = await handler(message)
        self.state = next_state
        return response

    async def handle_greeting(self, message):
        intent = await classify_intent(message)
        self.context["intent"] = intent

        if intent["requires_auth"]:
            return ("I'd be happy to help with that! First, I need to verify your identity. "
                   "Could you provide your email address?"), State.IDENTIFY
        else:
            return await self.handle_diagnose(message)

    async def handle_identify(self, message):
        # Try to authenticate
        email = extract_email(message)
        if email:
            account = await lookup_account(email)
            if account:
                self.context["account"] = account
                return (f"Found your account. Now, tell me more about the issue "
                       f"you're experiencing."), State.DIAGNOSE

        self.context["auth_attempts"] = self.context.get("auth_attempts", 0) + 1
        if self.context["auth_attempts"] >= 3:
            return "Let me connect you with our team for security verification.", State.ESCALATE

        return "I couldn't find an account with that info. Could you try again?", State.IDENTIFY

    async def handle_diagnose(self, message):
        # Use RAG + LLM to understand the issue
        context_docs = await retrieve_relevant_docs(message)
        diagnosis = await self.llm.generate(
            f"Diagnose this issue: {message}\nContext: {context_docs}"
        )
        self.context["diagnosis"] = diagnosis
        return diagnosis["suggested_response"], State.RESOLVE

    ### When to Use State Machine

        - Well-defined workflows (support, onboarding, intake)
        - Compliance-sensitive processes (need audit trail)
        - When you need predictable behavior


    ### Trade-offs

        ProsCons
        Most predictable and controllableRigid — can't handle unexpected flows
        Easy to audit and debugRequires upfront workflow design
        Clear metrics per stateNew scenarios need new states
        Lowest risk of runaway behaviorLess "intelligent" feeling to users


    ## Choosing the Right Pattern


        ScenarioBest PatternWhy
        Simple Q&A with toolsReActLow complexity, good enough
        Complex multi-step researchPlan-and-ExecuteNeeds upfront planning
        Multi-purpose assistantRouterDifferent handlers per intent
        Report generationHierarchicalParallel research + synthesis
        High-accuracy responsesReflectionSelf-correction catches errors
        Regulated workflowState MachinePredictable, auditable
        Customer supportRouter + State MachineRoute by intent, structured flow per type
        Coding assistantReAct + ReflectionTry code, test, self-correct



        **Tip:** Most production agents combine 2-3 patterns. A customer support system might use a **Router** for classification, **State Machine** for the refund flow, and **ReAct** for general questions. Don't feel locked into a single pattern.


    ## Anti-Pattern: The "God Agent"

    The most common architecture mistake: one giant agent with 30 tools, a 5,000-token system prompt, and instructions for every possible scenario. This agent:


        - Confuses which tools to use (too many choices)
        - Has slow, expensive LLM calls (massive context)
        - Is impossible to test (too many code paths)
        - Degrades as you add more features


    If your agent has more than 8-10 tools, you need a Router or Hierarchical pattern. Split it up.

    ## Architecture Decision Checklist

    Before building, answer these questions:


        - **How many steps does the typical task require?** (1-3: ReAct, 4-8: Plan-and-Execute, 8+: Hierarchical)
        - **Are there distinct task categories?** (Yes: Router)
        - **Is accuracy critical?** (Yes: add Reflection)
        - **Is the workflow well-defined?** (Yes: State Machine)
        - **What's your latency budget?** (Tight: ReAct or Router. Flexible: any)
        - **What's your cost budget per request?** (Tight: ReAct + cheap model. Flexible: Hierarchical + Reflection)



        Designing AI agent architectures? [AI Agents Weekly](/newsletter.html) covers patterns, frameworks, and production case studies 3x/week. Join free.



    ## Conclusion

    Architecture is the decision that's hardest to change later. Start with the simplest pattern that meets your requirements (usually ReAct), then evolve. Most production agents end up as hybrids — and that's fine.

    The key insight: **match the architecture to the task, not the framework**. Don't use a hierarchical multi-agent system because it sounds impressive. Use it because your task genuinely decomposes into parallel subtasks. The best architecture is the one that solves your problem with the least complexity.

Get our free AI Agent Starter Kit — templates, checklists, and deployment guides for building production AI agents.

DEV Community

AI Agent Architecture Patterns: 6 Designs That Work in Production (2026)

Top comments (0)