Eric Weston

Posted on Apr 3

How Agentic AI Systems Execute Multi-Step Workflows (Architecture + Stack)

You've probably used ChatGPT or Claude to answer a question. That's a single-turn interaction, you ask, it answers, done.
Agentic AI is different. It doesn't just answer, it plans, acts, observes, and iterates until a goal is achieved.

This article breaks down exactly how that works: the architecture, the components, the stack, and the tricky parts nobody talks about.

What Makes a System "Agentic"?

A system is agentic when the LLM isn't just generating text, it's making decisions that affect what happens next.
Three markers of an agentic system:
Tool use: the model calls external functions like search, code execution, or APIs

Multi-step loops: the model acts, sees a result, then decides the next action based on what it observed
Goal-directedness: it's working toward an objective, not just completing a prompt

Single LLM call = not agentic. LLM in a loop with tools and memory = agentic.

The Core Architecture

At their core, agentic AI systems operate through a recurring execution loop:

Perceive → Plan → Act → Observe → Repeat

Everything else, memory, tools, and multi-agent coordination, is built around making that loop smarter, faster, and more reliable.

The Orchestrator (Agent Core)
The Planner
The Memory Layer
The Tool Execution Layer
The Multi-Agent Coordination Layer (optional but powerful)
Each one has a specific task. All of them talk to each other.

The Orchestrator (Agent Core)

This is the LLM itself, the brain of the operation.
It receives:
The user's goal
Current memory and context
Results from previous actions
A list of tools it can call

It outputs:

A tool call with parameters, OR
A final response when the goal is complete

The orchestrator is shaped almost entirely by its system prompt. That prompt defines its role, what tools are available, and how it should make decisions. A well-written system prompt is the difference between an agent that spirals into loops and one that reliably completes tasks.

The Planners

Not every agentic system has an explicit planner, but the best ones do.
Reactive agents: Skip planning entirely; they decide the next action based purely on the current state. Fast, but brittle on complex tasks. No roadmap means they drift and get stuck.

Planning agents: Generate a task graph before acting. Given a goal like "write a competitive analysis on Notion," the planner breaks it into steps:

Search for Notion's latest features
Search for each competitor (Obsidian, Confluence, Coda)
Read the top results per competitor
Synthesize the findings
Write the structured report

The most widely used pattern is ReAct (Reasoning + Acting). Before each action, the model thinks out loud about what it's trying to find, what action it's taking, and what it learned from the result. That chain of thought before each step dramatically improves reliability compared to acting blindly.

The planner can be the same LLM prompted differently, or a dedicated, smaller model that handles task decomposition while a larger model handles reasoning.

Memory

This is where most agentic systems fall apart in production. Without the right memory architecture, agents lose context, repeat themselves, or hallucinate facts from earlier in the run.

In-Context Memory: Just the conversation window. Everything the agent knows during a run lives here. Simple, but once you hit the token limit, the old context falls off.

Episodic Memory (Short-Term): A structured log of what happened during the current session, actions taken, results observed, and decisions made. Gets summarized periodically and fed back into context so the agent doesn't lose track in long workflows.

Semantic Memory (Long-Term): A vector database. Past runs, domain knowledge, and user preferences are chunked, embedded, and retrieved by similarity. This is what lets an agent "remember" things across sessions without stuffing everything into context.

Procedural Memory: Stores tool definitions, function signatures, and learned workflows. Let's have an agent reuse strategies that worked in the past rather than figuring everything out from scratch.

A production memory system works like this:

Run starts → load relevant past memories from the vector database
During run → append actions and results to episodic log
Run ends → summarize the episode, embed it, store it back for future retrieval

Popular tools: Pinecone, Weaviate, ChromaDB, pgvector for semantic memory. Redis for fast episodic state.

Tool Execution Layer

Tools are just functions. The agent calls them, your code runs them, results come back as observations.

The flow is always the same:

LLM outputs a structured tool call with parameters
Your code parses it and executes the function
The result is returned as a string
That result becomes the next "Observation" in context
The loop continues

Common tool categories:

Information retrieval: Web search, database queries, vector lookups
Code execution: Python REPL, bash shell, Jupyter kernel
File I/O: Reading and writing documents, parsing PDFs and CSVs
External APIs: Slack, GitHub, Jira, email, and calendar integrations
Browser control: Navigating and interacting with web pages via Playwright or Puppeteer
Sub-agents: Spawning another agent to handle a delegated subtask

Tool definitions are passed to the LLM as structured schemas. The model produces valid, parseable tool calls, which your code intercepts, executes, and returns results from.

Multi-Agent Coordination

When one agent isn't enough, you compose multiple agents. This is where things get genuinely powerful and genuinely complicated.

Hierarchical Pattern: One orchestrator agent decomposes the goal and delegates to specialized worker agents. A research agent, an analysis agent, and a writing agent each do one thing well. The orchestrator coordinates them and assembles the final output.

Pipeline Pattern: Agents run sequentially. Output of Agent A feeds into Agent B, which feeds into Agent C. Works well for structured, predictable workflows where each stage has a clear input and output.

Debate / Critique Pattern: Two agents with opposing perspectives. One generates, the other critiques, and they loop until the output converges on something defensible. Significantly improves quality on complex reasoning tasks.

The Full Stack

A production agentic system has six layers:
User Interface: Web app, Slack bot, CLI, or API endpoint. How the user or system triggers the agent.

Agent Orchestration Framework: The code that runs the loop, manages tool calls, and coordinates agents. Popular choices:

LangChain: The largest ecosystem, great for RAG-heavy systems, can be verbose

LlamaIndex: Excels at document and data workflows
AutoGen: Best for multi-agent conversation patterns
CrewAI: Clean role-based mental model, good developer experience
Custom: When you know exactly what you need and don't want framework overhead

LLM Provider: Where the actual model calls happen. OpenAI, Anthropic, Google, or a locally hosted model, depending on your cost, latency, and data privacy requirements.

Memory Store: Short-term episodic state and long-term semantic retrieval. Pinecone, Weaviate, ChromaDB, and pgvector for vectors. Redis for fast session state.

Tool Layer: Where actual work happens. Web searches, code execution, API calls, file reads, and writes.

Observability: Every LLM call, every tool call, and every result needs to be logged and traceable. LangSmith and Langfuse are the main tools here. Without observability, debugging a broken agentic run is nearly impossible.

The Hard Truth Nobody Talks About

Prompt brittleness at scale: A prompt that works in testing breaks on edge cases in production. Agentic systems amplify this; a bad decision at step two cascades through eight more steps.

Fix: Add validation steps. Have the agent confirm its plan before executing. Catch malformed decisions early.

Token budget management: A ten-step agentic run with retrieval can burn through 50K+ tokens. Expensive and slow.

Fix: Summarize episodic memory mid-run. Use smaller models for tool selection, larger models for reasoning. Cache tool results aggressively.

Loop detection: Agents get stuck, especially when a tool fails, and they keep retrying the same action.

Fix: Hard step limits. Track action history and refuse repeated identical calls. Exponential backoff on tool failures.

Ambiguous goals: "Make my code better" is not a goal an agent can reliably pursue. Vague inputs produce unpredictable behavior.

Fix: Add a clarification step at the start. Have the agent restate the goal and get confirmation before planning.

Observability gaps: When something goes wrong, you need to know which step failed, why, and what the agent was thinking.

Fix: Log everything. Use a dedicated tracing tool. Never debug a blind agent.

DEV Community