I Rewrote the Same AI Agent 4 Times. Here's the Architecture That Finally Stuck.
I built an AI agent that could classify customer queries, pull up the right data, and generate a helpful response — all in about 40 lines of LangChain code. My manager watched it handle a compound question about a return and a delivery change in the same message, and said: "Ship it."
So I did. It fell apart in production within a week.
I rebuilt it as a graph. That lasted three months before a 23-node, 47-edge spaghetti diagram made me question my career choices. I rebuilt it again as a state machine. That one stuck — and it taught me something I couldn't have learned from any tutorial.
Four rewrites in 18 months. Each time, I was convinced the architecture was right. Each time, production proved me wrong in a way the previous pattern couldn't handle. Having led this system through every migration, I've developed a specific view on when each pattern works, when it breaks, and — more importantly — why the industry is converging on a single answer.
This is the article I wish existed before I learned these lessons at the cost of sleep, oncall pages, and one particularly expensive API bill.
The Five Stages of AI Application Architecture
Before we dive in, let me be clear: this isn't a "which framework is best" article. This is about patterns. Frameworks implement patterns. Patterns outlive frameworks. LangChain may or may not survive 2027 — but the problems it solved (and the ones it couldn't) will define how we build for the next decade.
Stage 1: Deterministic Code — "Just Write If-Else"
Every AI application starts here. Most teams I've worked with would have been better served staying here longer than they did.
def classify_intent(user_message: str) -> str:
message = user_message.lower()
if any(word in message for word in ["refund", "return", "money back"]):
return "refund"
elif any(word in message for word in ["track", "where is", "shipping"]):
return "tracking"
elif any(word in message for word in ["cancel", "stop", "don't want"]):
return "cancellation"
else:
return "general"
def handle_customer_query(message: str) -> str:
intent = classify_intent(message)
if intent == "refund":
return check_refund_eligibility(message)
elif intent == "tracking":
return fetch_tracking_info(message)
elif intent == "cancellation":
return process_cancellation(message)
else:
return "Let me connect you with a human agent."
This works. For exactly 47 intents and a product catalog that doesn't change weekly. The moment a customer writes "I bought this for my mom but she hated it and I need to also change the delivery on my other order", you're writing regex for compound intents and your classifier becomes a 2,000-line monster.
Where it breaks:
- Ambiguous input ("I want to return something" — return the product? Return to the homepage?)
- Compound intents (multiple requests in one message)
- Language variability (slang, typos, multilingual users)
- Maintenance cost scales linearly with every new edge case
A pattern I've observed repeatedly: If your problem has fewer than 50 well-defined intents and the input is structured, stay here. I've watched teams waste months migrating to LLM-based systems that performed worse than their hand-tuned rule engine — at 100x the cost per inference.
What forced the migration: Our intent taxonomy grew from 12 to 200+ categories in 6 months. Maintaining keyword lists had become a full-time job for two engineers. I made the call to switch to an LLM-based classifier, and the initial results were transformative.
Stage 2: The Single LLM Call — "Just Ask GPT"
This is where most teams start today. You replace 2,000 lines of regex with 10 lines and a prompt. The productivity gain is immediate and substantial.
from openai import OpenAI
client = OpenAI()
def handle_customer_query(message: str) -> str:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": """You are a customer service agent for a retail company.
Help customers with refunds, tracking, cancellations, and general queries.
Be helpful, concise, and empathetic."""
},
{"role": "user", "content": message}
]
)
return response.choices[0].message.content
And it handles the compound intent perfectly:
"I bought this for my mom but she hated it and I need to also change the delivery on my other order"
"I'm sorry to hear that! I can help with both. For the return, could you share the order number? And for the delivery change, which order would you like to update?"
Then you deploy:
Customer: "Cancel my order"
Agent: "Your order has been cancelled!"
(It didn't actually cancel anything. It just said it did.)
Customer: "What's the refund policy for electronics bought on sale during Black Friday
using an employee discount with a gift card?"
Agent: [Confidently hallucinates a policy that doesn't exist]
Customer: "Transfer $500 to account ending in 4829"
Agent: [Cheerfully attempts to help with what is clearly a social engineering attack]
Where it breaks:
- No actions: The LLM can say it cancelled your order. It can't actually call the API to do it.
- No memory: Each call is stateless. The LLM doesn't know what happened 2 messages ago unless you manually stuff the entire conversation history into context.
- Hallucination: Without access to real data, the LLM fills gaps with plausible fiction.
- No guardrails: It will happily attempt anything the user asks, including things it absolutely should not do.
- Cost at scale: Sending full conversation context with every call gets expensive fast.
What forced the next migration: The system needed to do things — look up orders, check eligibility, actually process refunds — not just talk about doing them. I needed an architecture that could orchestrate real actions, not just generate text.
Stage 3: Chains — "Connect the Pipes"
This is where LangChain entered the picture and became the most starred AI repo on GitHub. The insight: instead of one LLM call, build a pipeline.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings()
# Step 1: Load company knowledge into a vector store
vectorstore = FAISS.load_local("company_policies", embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
# Step 2: Build the chain
prompt = ChatPromptTemplate.from_template(
"""Answer the customer's question using ONLY the context below.
If the answer isn't in the context, say "I'll connect you with a specialist."
Context: {context}
Question: {question}
"""
)
# The classic RAG chain: Retrieve → Augment → Generate
rag_chain = (
{"context": retriever, "question": lambda x: x}
| prompt
| llm
| StrOutputParser()
)
response = rag_chain.invoke("What's the return policy for electronics on sale?")
Now the LLM answers based on actual company policy, not its training data. The RAG (Retrieval-Augmented Generation) pattern was the first real architecture pattern of the LLM era. And it solved the hallucination problem — mostly.
Chains introduced composition — the ability to snap together reusable components:
# Chain for classification → routing → response
classify_chain = classify_prompt | llm | JsonOutputParser()
refund_chain = refund_prompt | llm | StrOutputParser()
tracking_chain = tracking_prompt | llm | StrOutputParser()
# Sequential pipeline
def handle_query(message):
classification = classify_chain.invoke(message) # Step 1: What kind of query?
if classification["intent"] == "refund": # Step 2: Route to specialist
return refund_chain.invoke(message)
elif classification["intent"] == "tracking":
return tracking_chain.invoke(message)
Chains worked well for a class of problems. But they're fundamentally linear — data flows in one direction, like water through a pipe.
Where it breaks:
-
No conditional logic within the chain: That
if classification["intent"]block? That's your code routing between chains, not the chain itself being intelligent about routing. - No loops: If the LLM generates a bad response, you can't send it back for a retry within the chain abstraction.
- No parallel execution: You can't retrieve from a vector store AND call an API simultaneously.
- Error recovery is your problem: If step 3 of 5 fails, the whole chain fails. No fallback, no retry, no graceful degradation.
- State is an afterthought: Each chain invocation is independent. Maintaining conversation context across multiple chain calls requires external state management.
The moment I knew chains weren't enough: A customer reported a problem that required (1) looking up their order, (2) checking inventory at a nearby store, (3) verifying their membership status, and (4) applying a conditional discount — all before generating a response. Some of those steps could run in parallel. Some depended on each other. The chain abstraction had no vocabulary for this. I began designing the graph-based replacement that week.
Stage 4: Graphs — "Model It Like You Think About It"
The mental model shifts here. Instead of thinking about your AI application as a pipeline, you think about it as a graph of decisions.
LangGraph (built on top of LangChain) formalized this. But the pattern exists independently of the framework — you could implement it with plain Python and a dictionary.
from langgraph.graph import StateGraph, START, END
from typing import TypedDict, Literal
class AgentState(TypedDict):
messages: list
intent: str
order_data: dict | None
membership_tier: str | None
needs_human: bool
def classify_intent(state: AgentState) -> AgentState:
"""Node 1: Understand what the customer needs."""
response = llm.invoke(
f"Classify this customer message into one of: "
f"refund, tracking, cancellation, complex. "
f"Message: {state['messages'][-1]}"
)
return {"intent": response.content.strip().lower()}
def fetch_order(state: AgentState) -> AgentState:
"""Node 2: Get order details from the database."""
order_id = extract_order_id(state["messages"][-1])
order = db.get_order(order_id)
return {"order_data": order}
def check_membership(state: AgentState) -> AgentState:
"""Node 3: Check membership tier for discount eligibility."""
customer_id = state["order_data"]["customer_id"]
tier = membership_api.get_tier(customer_id)
return {"membership_tier": tier}
def generate_response(state: AgentState) -> AgentState:
"""Node 4: Generate contextual response with all gathered data."""
context = f"""
Intent: {state['intent']}
Order: {state['order_data']}
Membership: {state['membership_tier']}
"""
response = llm.invoke(
f"Given this context:\n{context}\n"
f"Respond to: {state['messages'][-1]}"
)
return {"messages": [response.content]}
def route_by_intent(state: AgentState) -> Literal["fetch_order", "generate_response", "escalate"]:
"""Conditional edge: decide which path to take."""
if state["intent"] == "complex":
return "escalate"
elif state["intent"] in ("refund", "tracking", "cancellation"):
return "fetch_order"
else:
return "generate_response"
def should_check_membership(state: AgentState) -> Literal["check_membership", "generate_response"]:
"""Conditional edge: only check membership for refunds."""
if state["intent"] == "refund" and state["order_data"]:
return "check_membership"
return "generate_response"
# Build the graph
graph = StateGraph(AgentState)
graph.add_node("classify", classify_intent)
graph.add_node("fetch_order", fetch_order)
graph.add_node("check_membership", check_membership)
graph.add_node("generate_response", generate_response)
graph.add_node("escalate", lambda s: {"needs_human": True})
graph.add_edge(START, "classify")
graph.add_conditional_edges("classify", route_by_intent)
graph.add_conditional_edges("fetch_order", should_check_membership)
graph.add_edge("check_membership", "generate_response")
graph.add_edge("generate_response", END)
graph.add_edge("escalate", END)
agent = graph.compile()
This is a fundamentally different model. The graph says: classify first, then take different paths depending on what you find, gather exactly the data you need, and generate a response with full context.
And here's the key innovation — cycles:
def quality_check(state: AgentState) -> Literal["generate_response", "end"]:
"""If the response quality is low, loop back and try again."""
if state.get("retry_count", 0) >= 3:
return "end"
score = evaluate_response_quality(state["messages"][-1])
if score < 0.7:
return "generate_response" # Loop back!
return "end"
graph.add_conditional_edges("generate_response", quality_check)
A chain can't do this. It can't say "that response wasn't good enough, try again" without you wrapping it in external loop logic. A graph handles it natively.
What graphs provide that chains cannot:
| Capability | Chain | Graph |
|---|---|---|
| Sequential steps | Yes | Yes |
| Conditional routing | External code | Native (conditional edges) |
| Parallel execution | No | Yes (fan-out/fan-in) |
| Retry loops | External code | Native (cycles) |
| State persistence | Manual | Built-in (checkpointing) |
| Human-in-the-loop | Difficult | First-class support |
| Debuggability | Print statements | State inspection at every node |
But graphs have their own failure mode. And this is something nobody talks about in the tutorials.
Where it breaks:
- Implicit state transitions: In a graph, any node can technically transition to any other node via conditional edges. As the graph grows, the number of possible paths through the system explodes combinatorially. I've seen production graphs where nobody on the team could tell you all the paths a request might take.
-
Error states are afterthoughts: What happens when
fetch_orderreturnsNonebecause the API is down? Your conditional edge routes to... where? Each failure mode needs its own routing logic, and graphs don't enforce that you've handled them all. - Concurrency is unstructured: Parallel nodes are powerful, but without explicit synchronization points, you get race conditions in state updates.
- Testing becomes integration testing: You can't unit test a node meaningfully without setting up the full state context. Every test is inherently an integration test.
The moment I knew graphs weren't enough: Our agent graph had grown to 23 nodes and 47 edges. A new engineer joining the team asked me to walk them through the flow. I opened the visualization and realized I couldn't confidently trace the path for a refund request that combined a membership discount, a partial return, and an API timeout — without actually running it. If the system's architect can't reason about the system's behavior from the diagram, the architecture has failed at its primary job. I started sketching the state machine replacement that afternoon.
Stage 5: Stateful Orchestration — "Controlled Autonomy"
This is where the industry is heading in 2026, and it's the pattern I wish I'd started with.
The core insight: an AI agent isn't a pipeline or a graph. It's a system that exists in a known state at every moment, and transitions between states through well-defined rules.
This isn't new computer science. Finite state machines are a 1950s concept. But applying them to LLM orchestration gives you something neither chains nor graphs provide: you always know where you are, how you got there, and where you can go next.
from enum import Enum
from dataclasses import dataclass, field
from typing import Any
class AgentPhase(Enum):
IDLE = "idle"
UNDERSTANDING = "understanding"
GATHERING = "gathering"
REASONING = "reasoning"
REVIEWING = "reviewing"
AWAITING_HUMAN = "awaiting_human"
RESPONDING = "responding"
ERROR = "error"
VALID_TRANSITIONS = {
AgentPhase.IDLE: {AgentPhase.UNDERSTANDING},
AgentPhase.UNDERSTANDING: {AgentPhase.GATHERING, AgentPhase.ERROR},
AgentPhase.GATHERING: {AgentPhase.REASONING, AgentPhase.ERROR},
AgentPhase.REASONING: {AgentPhase.REVIEWING, AgentPhase.ERROR},
AgentPhase.REVIEWING: {AgentPhase.RESPONDING, AgentPhase.REASONING, AgentPhase.AWAITING_HUMAN},
AgentPhase.AWAITING_HUMAN: {AgentPhase.RESPONDING, AgentPhase.REASONING},
AgentPhase.RESPONDING: {AgentPhase.IDLE},
AgentPhase.ERROR: {AgentPhase.UNDERSTANDING, AgentPhase.RESPONDING},
}
@dataclass
class AgentState:
phase: AgentPhase = AgentPhase.IDLE
messages: list = field(default_factory=list)
intent: str | None = None
gathered_data: dict = field(default_factory=dict)
response_draft: str | None = None
confidence: float = 0.0
retry_count: int = 0
max_retries: int = 3
error: str | None = None
def transition_to(self, new_phase: AgentPhase) -> None:
if new_phase not in VALID_TRANSITIONS.get(self.phase, set()):
raise InvalidTransitionError(
f"Cannot transition from {self.phase.value} to {new_phase.value}. "
f"Valid transitions: {[p.value for p in VALID_TRANSITIONS[self.phase]]}"
)
self.phase = new_phase
class InvalidTransitionError(Exception):
pass
class CustomerServiceAgent:
def __init__(self, llm, tools: dict[str, Any]):
self.llm = llm
self.tools = tools
self.state = AgentState()
async def handle_message(self, message: str) -> str:
self.state.messages.append({"role": "user", "content": message})
self.state.transition_to(AgentPhase.UNDERSTANDING)
while self.state.phase != AgentPhase.IDLE:
match self.state.phase:
case AgentPhase.UNDERSTANDING:
await self._understand()
case AgentPhase.GATHERING:
await self._gather()
case AgentPhase.REASONING:
await self._reason()
case AgentPhase.REVIEWING:
await self._review()
case AgentPhase.AWAITING_HUMAN:
return "[ESCALATED] Awaiting human review."
case AgentPhase.RESPONDING:
return self._respond()
case AgentPhase.ERROR:
self._handle_error()
return "How else can I help?"
async def _understand(self):
try:
classification = await self.llm.classify(self.state.messages)
self.state.intent = classification["intent"]
self.state.confidence = classification["confidence"]
self.state.transition_to(AgentPhase.GATHERING)
except Exception as e:
self.state.error = str(e)
self.state.transition_to(AgentPhase.ERROR)
async def _gather(self):
try:
tools_needed = self._determine_tools(self.state.intent)
results = await asyncio.gather(
*[self.tools[t].execute(self.state) for t in tools_needed],
return_exceptions=True
)
for tool_name, result in zip(tools_needed, results):
if isinstance(result, Exception):
self.state.gathered_data[tool_name] = {"error": str(result)}
else:
self.state.gathered_data[tool_name] = result
self.state.transition_to(AgentPhase.REASONING)
except Exception as e:
self.state.error = str(e)
self.state.transition_to(AgentPhase.ERROR)
async def _reason(self):
try:
draft = await self.llm.generate_response(
messages=self.state.messages,
intent=self.state.intent,
context=self.state.gathered_data,
)
self.state.response_draft = draft
self.state.transition_to(AgentPhase.REVIEWING)
except Exception as e:
self.state.error = str(e)
self.state.transition_to(AgentPhase.ERROR)
async def _review(self):
quality = await self._evaluate_quality(self.state.response_draft)
guardrail_pass = self._check_guardrails(self.state.response_draft)
if not guardrail_pass:
self.state.transition_to(AgentPhase.AWAITING_HUMAN)
elif quality < 0.7 and self.state.retry_count < self.state.max_retries:
self.state.retry_count += 1
self.state.transition_to(AgentPhase.REASONING)
else:
self.state.transition_to(AgentPhase.RESPONDING)
def _respond(self) -> str:
response = self.state.response_draft
self.state.messages.append({"role": "assistant", "content": response})
self.state.transition_to(AgentPhase.IDLE)
return response
def _handle_error(self):
if self.state.retry_count < self.state.max_retries:
self.state.retry_count += 1
self.state.error = None
self.state.transition_to(AgentPhase.UNDERSTANDING)
else:
self.state.response_draft = (
"I'm having trouble processing your request. "
"Let me connect you with a human agent."
)
self.state.transition_to(AgentPhase.RESPONDING)
Read that code carefully. Notice what's not there:
- No spaghetti of conditional edges
- No implicit paths through the system
- No unhandled error states
And notice what is there:
-
Every valid transition is declared upfront. You can look at
VALID_TRANSITIONSand immediately see every possible flow in the system. Try doing that with a 47-edge graph. -
Invalid transitions throw exceptions. If a bug tries to go from
REVIEWINGdirectly toGATHERING, it fails loudly instead of silently producing wrong results. -
The state is a complete snapshot. At any point, you can serialize
AgentState, log it, inspect it in a debugger, or resume from it after a crash. This is what LangGraph calls "checkpointing" — but here it's a natural consequence of the architecture, not a framework feature bolted on. -
Error handling is a first-class state, not an afterthought. The
ERRORphase has its own transition rules, its own retry logic, and its own graceful degradation path.
Why This Matters: The Real-World Failure Modes
I'm going to be specific about three production failures I diagnosed, because vague handwaving about "production challenges" helps nobody. These are the incidents that directly motivated the architecture decisions above.
Failure 1: The Infinite Loop — $47 in 3 Minutes
Our graph-based agent had a quality_check node that could cycle back to generate_response. In staging, it worked fine. In production, an edge case query caused the quality evaluator to consistently score the response at 0.69 — just below the 0.7 threshold. The agent entered an infinite retry loop, burning $47 in API calls in 3 minutes before our rate limiter killed it.
When I investigated, the root cause was clear: the graph architecture had no structural mechanism to prevent unbounded cycles. We could add a counter — and we did, as a patch — but it was duct tape on a design flaw. The graph didn't know it was looping. It was just following edges.
The state machine solution I implemented: retry_count is part of the state itself. max_retries is enforced at the transition level, not as an afterthought check inside a node. It is architecturally impossible for the system to loop more than N times. The constraint is declared, not hoped for.
Failure 2: The Ghost State — Confident and Wrong
Our graph agent handled multi-turn conversations by appending to a message list. One morning I was paged because a customer had been told their order would arrive Tuesday — but the tracking API had never been called. A tool call had failed silently (returned None instead of throwing), and the agent proceeded to generate_response with incomplete context. The None propagated silently through three subsequent nodes. The LLM, lacking real data, hallucinated a plausible delivery date.
I traced this through our logs and found that the graph had no mechanism to enforce data completeness between nodes. Any node could pass any state forward — there were no contracts between stages.
The state machine solution I designed: The GATHERING phase explicitly validates tool results before any transition fires. The transition from GATHERING → REASONING requires all expected data fields to be non-null. If a tool fails, the system routes to ERROR state — it cannot silently arrive at REASONING with missing data. The contract is enforced by the architecture, not by the developer remembering to add a null check.
Failure 3: The Phantom Conversation — Context Lost at the Worst Moment
A customer started a refund request, provided their order number, verified their identity, and was halfway through the eligibility check — then got disconnected. When they came back 2 hours later, the agent treated it as a brand new conversation. All context was gone. The customer had to start over from scratch — re-explain the issue, re-provide the order number, re-verify identity. For a loyal customer already frustrated enough to request a refund, this was the last thing they needed.
The root cause: I'd built the graph to be stateless across invocations. State lived in memory for the duration of a single request. There was no persistence layer because the graph abstraction didn't naturally surface the need for one.
The state machine solution: Because AgentState is a single, serializable dataclass, persisting it is trivial — serialize to Redis on every transition, deserialize when the customer returns. The system resumes from exactly where it left off: same phase, same gathered data, same retry count. This wasn't an add-on feature. It was a natural consequence of the architecture. The state machine's design invites persistence in a way that graphs don't.
When to Use What: A Decision Framework
Don't let this article convince you to over-engineer. Here's the decision framework I use with my team:
| Signal | Use This | Not This |
|---|---|---|
| Simple Q&A chatbot | Single LLM call | Graph |
| RAG with one data source | Chain | State machine |
| Multi-step with branching | Graph | Chain |
| Multi-turn, persistent | State machine | Stateless graph |
| Needs audit trail | State machine | Anything implicit |
| Fewer than 50 well-defined intents | Deterministic code | LLM (seriously) |
| Human-in-the-loop required | State machine or LangGraph with checkpointing | Vanilla chain |
| Weekend hackathon | Whatever ships | Architecture astronautics |
The uncomfortable truth: Most AI applications in production today are overarchitected. A well-written chain with good error handling beats a poorly-implemented state machine every time. Adopt the simplest pattern that handles your actual failure modes, not the ones you imagine you might have someday.
For context on what the state machine migration delivered: the category of production incidents I'd classify as "agent misbehavior" — wrong responses, silent failures, unrecoverable states — dropped significantly. But the bigger win was debuggability. When a conversation goes wrong in a chain or graph, you're reading logs and guessing. In a state machine, you pull the serialized state, see exactly which phase it's in, what data it gathered, and which transition it attempted. Mean time to diagnose a failed conversation went from hours of log-diving to minutes of state inspection. New engineers could trace the full system flow within their first week instead of their first month.
Where It's All Going: The Convergence I'm Betting On
The five patterns in this article aren't competing alternatives. They're points on a convergence curve — and the destination is becoming clear.
1. Every framework is converging on state machines, whether they admit it or not. LangGraph's latest versions emphasize StateGraph with typed dictionaries, conditional edges, and checkpointing. The API says "graph." The mental model is increasingly state-machine-like. CrewAI added explicit task states. OpenAI's Agents SDK introduced "handoffs" — which are, structurally, state transitions between agent-scoped machines. The industry is independently arriving at the same answer through different doors.
2. The unit of AI architecture is shifting from "the agent" to "the agent system." The next evolution isn't a better single-agent pattern — it's multiple state machines composed into a larger system. Each agent owns a bounded domain (refunds, tracking, membership), modeled as its own state machine, coordinated by an orchestrator that itself is a state machine. This is the multi-agent pattern, and it mirrors how we've designed microservices for a decade. The hard problems — consistency, handoff protocols, shared state — are the same ones distributed systems engineers have been solving since the 1990s. The AI community is rediscovering them.
3. MCP (Model Context Protocol) is solving the right problem at the right layer. Anthropic's open protocol standardizes how agents discover and invoke tools. This is significant because it decouples the agent's control logic (state machine) from its capabilities (tools available via MCP). Your state machine doesn't need to know if it's calling a REST API, a Postgres query, or another AI model — MCP abstracts it. This separation is what allows agent architectures to evolve independently of their integration layer. Every major platform — Claude, Cursor, GitHub Copilot, OpenAI — has adopted it within a year of release. That speed of adoption tells you something.
4. The winning pattern is the one that looks like regular software engineering. This is the meta-observation that ties everything together. The frameworks winning in production aren't the ones with the cleverest abstractions — they're the ones that feel like writing normal backend code with explicit state, typed interfaces, error handling, and tests. The "magic" of early LangChain — where you composed opaque runnables and hoped the output was right — is being replaced by the deliberate, inspectable patterns that backend engineering has refined for decades. We aren't inventing new computer science. We're applying the computer science we already have to a new problem domain.
Build This, Not That
If you're starting a new AI agent today, here's my concrete advice:
Start with Stage 3 (chains) if you're prototyping. Get the LLM calls right, the prompts tuned, and the retrieval working. Don't optimize for failure modes you haven't encountered yet.
Graduate to Stage 5 (state machine) when you hit your first production incident that a chain couldn't handle. Skip Stage 4 (ad-hoc graphs) entirely — go directly to explicit states and transitions. You'll save yourself a rewrite.
Invest in observability from day one. Whichever pattern you use, log the state at every transition. The number one complaint I hear from teams debugging AI agents is: "I have no idea what the agent was thinking." State machines solve this structurally — every transition is an event you can log, trace, and replay.
Closing Thought
Every pattern in this article is a response to a real production failure. The industry didn't move from chains to graphs to state machines because conference speakers needed new slides — it moved because things broke, customers were affected, and engineers needed more control than the previous abstraction could provide.
Four rewrites later, here's what I know: the trajectory of AI application architecture is not toward more abstraction. It's toward more clarity. More explicitness. More of the properties that have made backend systems reliable for decades — typed state, declared transitions, enforced contracts, observable behavior.
From implicit to explicit. From magical to inspectable. From "it works in the demo" to "it works at 3 AM on Black Friday with a partial API outage and a new engineer on call."
That's not a step backward. That's engineering catching up to its own ambition.
If this resonated, follow me here or connect on LinkeldIn
Top comments (0)