Dishant Sethi

Posted on Jun 8 • Originally published at prodinit.com

AI Agents in Production: 7 Architecture Mistakes That Sink Your System

#ai #rag #llm #agents

Key Takeaways

52% of enterprises deployed AI agents in production in 2026 — most hit at least one of these seven architecture mistakes before stabilizing (McKinsey State of AI, 2026)

The #1 mistake is the god agent: one agent handling too many tasks, causing hallucinations and unpredictable behavior that scale with complexity

Stateless agents look fine in demos and fail silently in production when sessions span more than a few turns

Missing tool-call guardrails is the fastest path to an unauthorized external action your team will spend days explaining

Unbounded agent loops have a documented cost: teams have burned thousands in API credits overnight from a single recursive loop triggered by a malformed tool response

AI agent production mistakes cluster around seven architecture decisions: task decomposition, memory strategy, tool-call guardrails, observability, evaluation pipelines, cost controls, and human escalation paths. Most teams discover these gaps in production, not in staging — and the systems that survive are the ones where all seven were designed in before the first deploy.

Why Production AI Agents Fail Differently Than Demos

Demos lie. A demo runs for 30 seconds, processes a happy-path input, produces a clean output, and everyone applauds. Production runs for 30 days, handles inputs nobody anticipated, hits API rate limits, encounters malformed tool responses, and keeps running — or stops without telling you.

The gap between a working demo and a stable production agent system is not a gap in model capability. It is a gap in architecture. The model that produced the demo output is the same model that hallucinates tool arguments in production. What changed is the system around it: guardrails that weren't added, observability that wasn't wired in, a memory strategy that was never designed, an escalation path that was never built.

52% of enterprises had deployed AI agents in production as of 2026. Prodinit's engineering team and the production teams we've worked with encountered these seven mistakes — either in their own deployments or in systems we audited — and designed around them before scaling. The ones still debugging production incidents at 3 AM mostly skipped step two.

Mistake 1: The God Agent

What it is: A single agent is assigned every task in a workflow — it retrieves data, drafts responses, calls external APIs, validates outputs, and triggers downstream systems, all in one prompt loop. It is the LLM equivalent of a 2,000-line function.

Why it happens: The demo worked. A single model call with a long system prompt produced a coherent output for a controlled input. The natural next step was adding more tools and more instructions to the same agent rather than decomposing the problem.

How to detect it: Your system prompt exceeds 3,000 tokens. The agent is registered with more than 6–7 tools. Hallucination rate increases non-linearly as task complexity grows. Latency spikes on simple requests because the model navigates a bloated context.

The fix: Decompose into an orchestrator agent that routes tasks and specialized sub-agents that each own one domain.

Before: One agent with 12 tools and a 4,000-token system prompt handling inbound requests, CRM lookups, response drafting, ticket updates, and Slack notifications.

After (LangGraph pattern):

graph = StateGraph(AgentState)
graph.add_node("router", route_intent)
graph.add_node("crm_agent", crm_lookup_agent)       # 2 tools, narrow context
graph.add_node("draft_agent", response_draft_agent)  # 1 tool, narrow context
graph.add_node("ticket_agent", ticket_update_agent)  # 3 tools, narrow context

Each worker agent has a 300–500 token system prompt and a single responsibility. The orchestrator knows nothing about tool details — it only routes. Hallucination rates drop because context windows stay within the model's reliable operating range.

Mistake 2: No Memory Strategy

What it is: Stateless agents reset entirely between turns or sessions. Every invocation starts from scratch with no awareness of prior context, user preferences, or previous decisions made in the same workflow.

Why it happens: The MVP didn't need it. A single-turn agent — "summarize this document" — has no session concept. When the same codebase is extended to multi-turn workflows, nobody adds the missing memory layer because the agent technically still runs.

How to detect it: Users repeatedly re-state context the agent should know. Long-running workflows fail when they hit token limits because all prior state is crammed into the context window. Agent decisions in step 8 contradict decisions made in step 2.

The fix: Design three memory tiers before writing agent logic:

In-context memory: Current conversation history and task state, managed via a structured state object (LangGraph's TypedDict state). Use for data that must be in the active prompt.
Semantic memory: Long-term user facts and preferences, stored in a vector database and retrieved via similarity search. Use for anything that won't fit in context.
Episodic memory: Prior session summaries and decision logs, stored by session ID. Use for audit trails and session continuity.

Before: Agent receives the full conversation history as a growing context window until it hits the 128k token limit and starts truncating or hallucinating.

After: A memory manager summarizes completed subtask state into external storage after each milestone. New turns retrieve the relevant summary plus the last 3–5 turns, keeping the context window stable regardless of session length.

Mistake 3: Missing Tool-Call Guardrails

What it is: The agent can call any tool at any time with any arguments it generates — including tools that write to external systems, spend money, or contact third parties — without validation or confirmation.

Why it happens: Tools are added incrementally. First a read-only tool, then a write tool, then an external API. No single addition seemed dangerous, and adding a confirmation step felt like it would break the autonomous flow the demo promised.

How to detect it: Your agent has write-capable tools accessible without additional validation. Tool arguments are passed directly from LLM output without schema validation. You cannot produce a log showing every external action the agent took in a given session.

The fix: Apply a three-layer guardrail pattern:

Schema validation: Validate every tool-call argument against a strict schema before execution. Reject calls with missing required fields or out-of-range values before the tool runs.
Action classification: Tag every tool as read, write, or external. Apply different confirmation policies per class. Read tools run automatically; write tools validate against business rules; external API calls with financial or communication effects require explicit confirmation.
Role-scoped access: Pass only the tools relevant to the current agent's role and the current user's permission level.

def get_tools_for_role(role: str) -> list[Tool]:
    base_tools = [search_knowledge_base, get_ticket_status]
    if role == "admin":
        return base_tools + [update_ticket, send_notification]
    return base_tools  # regular users: read-only

Mistake 4: No Observability

What it is: You cannot reconstruct what the agent did, why it did it, what tools it called with what arguments, or where it went wrong — in real time or after the fact.

Why it happens: Observability is treated as infrastructure work to be done after the "real" AI work is complete. In demos, you watch the output stream. In production, thousands of sessions run concurrently and something fails in session 7,312.

How to detect it: When a customer reports a wrong output, you cannot trace the exact tool calls and model decisions that produced it. You have no visibility into token usage at the session level. There is no alert when an agent session takes longer than expected or calls a tool an unusual number of times.

The fix: Instrument every layer at build time, not after an incident. The minimum instrumentation surface:

Trace-level: Every agent invocation gets a trace ID. Log the input, model parameters, every tool call with arguments and response, every LLM call with token count, and the final output — all linked by trace ID.
Span-level: Each tool call is a child span with timing, success/failure status, and serialized arguments.
Metric-level: Token cost per session, tool call frequency by tool name, error rate by agent node, average session duration.

LangSmith, Langfuse, and Arize Phoenix provide out-of-the-box instrumentation for LangGraph systems:

from langsmith import traceable

@traceable(run_type="chain", name="crm_lookup_agent")
def crm_lookup_agent(state: AgentState) -> AgentState:
    # all tool calls within this function are auto-traced as child spans
    ...

Set alerts on anomalous tool call frequency (more than N calls to any single tool in one session) and session cost thresholds before the first production deploy.

Mistake 5: No Eval Loop

What it is: The agent ships, and its behavior is validated through production incidents rather than systematic evaluation. Regressions from model updates, prompt changes, or new tool versions are discovered by customers, not caught by a test suite.

Why it happens: Agents are harder to evaluate than deterministic software. The same input can produce different outputs. Writing evals feels uncertain, and teams postpone it until after launch — which means it rarely happens before the first regression.

How to detect it: You changed the system prompt and deployed without running structured tests. A model version upgrade is treated as a "should be fine" event. You have no golden dataset. Customer-reported bugs cannot be mapped to specific eval failures because there are no evals.

The fix: Build a four-layer eval suite before deploying:

Unit evals: Does the agent route correctly for known inputs? Does it refuse out-of-scope requests? These are deterministic and run in milliseconds.
Tool-call evals: For a given input, does the agent call the right tool with the right arguments? Compare actual calls to recorded ground-truth calls.
Output evals (LLM-as-judge): Is the final output factually consistent, on-topic, and within policy constraints?
Behavioral evals: Does the agent complete multi-turn workflows correctly from start to finish?

Maintain a golden dataset of at least 50–100 representative inputs per agent node. Block deployment if tool-call accuracy drops more than 5% relative to the last passing run.

Mistake 6: Runaway Costs from Unbounded Loops

What it is: The agent enters a loop — through a retry strategy, a recursive tool call chain, or an orchestration bug — with no termination condition, consuming tokens and API credits until it hits an external limit or exhausts the budget.

Why it happens: Retry and reflection loops are added to handle edge cases: "if the tool call fails, try again." The retry logic has no maximum iteration count, or the maximum is set too high. A malformed tool response triggers the retry condition on every attempt. Nobody tested behavior when the tool returns unexpected data.

How to detect it: Agent sessions occasionally run 10–20× longer than expected. Token cost per session has a long right tail — most sessions cost $0.02, a few cost $2.00. A single session can trigger hundreds of identical tool calls in sequence.

The fix: Enforce two hard limits at the infrastructure level — not in the prompt:

Max steps: Every agent graph has a maximum step count. In LangGraph, this is recursion_limit. Set it to 2–3× the expected maximum legitimate step count.
Token budget: Track cumulative token usage across the session. Halt and return a graceful error if it exceeds a defined threshold.

# LangGraph: hard step limit — never leave this unbounded
graph = graph.compile(
    checkpointer=memory,
    recursion_limit=25
)

# Session-level token budget check
def check_budget(state: AgentState) -> AgentState:
    if state["total_tokens"] > TOKEN_BUDGET:
        raise BudgetExceededError(f"Session exceeded {TOKEN_BUDGET} token budget")
    return state

Wire a spend-rate alert before you deploy. A $10/hour burn rate on an agent that normally costs $0.50/hour is detectable within minutes with a CloudWatch or Datadog metric — and a 10-minute detection window is the difference between a $5 incident and a $400 incident.

Mistake 7: No Human-in-the-Loop Escalation Path

What it is: The agent handles every case autonomously — including cases where it is uncertain, where the stakes are high, or where the action is irreversible. There is no mechanism for the agent to pause, flag a case for human review, or request confirmation before acting.

Why it happens: Autonomous operation is the goal. Adding human review checkpoints feels like defeating the purpose of the agent. The design assumes the model will handle edge cases correctly — which it does in demos.

How to detect it: The agent performs irreversible actions (sends emails, charges payments, deletes records) without any human confirmation step. There is no low-confidence threshold that triggers a review queue. Customers report complaints about autonomous actions they didn't authorize.

The fix: Design human escalation as a first-class node in the agent graph, not a fallback added after an incident. Three trigger conditions that should always route to human review:

Low confidence: The model's decision confidence score falls below a defined threshold
High-stakes action: The agent is about to perform an irreversible or high-cost action (write, send, delete, charge)
Ambiguity: The input maps to multiple valid interpretations with meaningfully different outcomes

def should_escalate(state: AgentState) -> str:
    if state["confidence"] < 0.75 or state["action_type"] == "irreversible":
        return "human_review"
    return "execute"

graph.add_conditional_edges("agent", should_escalate, {
    "human_review": "human_review_node",
    "execute": "execute_node"
})

The human review node suspends the session, routes the case to a review queue (Slack, email, internal dashboard), and resumes from the agent's current state once a decision is recorded. LangGraph's persistence layer handles state across the suspension window — the agent picks up exactly where it paused.

Quick Reference: Mistake, Symptom, Fix

Mistake	Production Symptom	Fix
God Agent	Hallucination scales with task complexity; latency spikes	Orchestrator + specialized sub-agents
No Memory Strategy	Users re-state context; long sessions truncate silently	External memory layer + structured state
Missing Tool Guardrails	Unauthorized external actions; write calls with bad args	Schema validation + action classification + role-scoped tools
No Observability	Cannot trace what the agent did or why	Trace per session + span per tool + cost alerts
No Eval Loop	Regressions discovered by customers after model/prompt changes	Four-layer eval suite gating every deploy
Unbounded Loops	Token cost spikes; sessions run indefinitely	`recursion_limit` + token budget enforced at infrastructure level
No Human Escalation	Irreversible actions without confirmation; customer complaints	Low-confidence + high-stakes + ambiguity routing to review queue

Frequently Asked Questions

What are the most common AI agent failures in production?

The most common failure is the god agent pattern — a single agent handling every task in a workflow. It works in demos because inputs are controlled. In production, task complexity grows, context windows fill, and hallucination rates climb non-linearly. The second most common failure is missing observability: teams cannot trace what the agent did, so debugging takes days instead of hours. Both are architecture decisions made before the first line of agent code.

How do I add observability to an existing LangGraph agent?

The fastest path is enabling LangSmith tracing by setting LANGCHAIN_TRACING_V2=true in your environment — this instruments all LangChain and LangGraph calls automatically with trace and span data. Pair it with a session-level token cost metric and an alert for anomalous tool call frequency. Instrument on the next deploy, not after the next incident.

How do I prevent runaway AI agent costs?

Three controls in combination cover 99% of runaway cost scenarios: set recursion_limit in your LangGraph compilation to 2–3× the expected maximum step count; add a session-level token budget check as an early graph node; wire a spend-rate alert in your cloud billing tooling. These enforce hard stops without relying on the model to self-terminate.

What is human-in-the-loop in agentic AI systems?

Human-in-the-loop is an architecture pattern where the agent suspends execution and routes a case to a human reviewer before proceeding. It triggers on low model confidence, high-stakes irreversible actions, or ambiguous inputs. LangGraph supports this natively through its persistence layer, which preserves the full agent state across the suspension window so the agent resumes from exactly where it paused.

How should I test AI agents before production?

Build a four-layer eval suite: unit evals for routing and refusal behavior, tool-call evals comparing actual calls to ground-truth calls, output evals using LLM-as-judge against a defined rubric, and end-to-end behavioral evals for multi-turn workflows. Maintain a golden dataset of 50–100 representative inputs per agent node. Block deployment if tool-call accuracy regresses more than 5% from the last passing run.

Can LangGraph handle production multi-agent systems?

Yes. LangGraph is production-ready for multi-agent architectures and provides the graph-based execution model, persistence layer, and streaming support the patterns above require. The critical configuration decisions are recursion_limit, human-in-the-loop node design, and LangSmith integration — set these before the first production deploy, not after the first incident.

Build AI Agents That Survive Production

The seven mistakes above are not edge cases — they are the default trajectory for agent systems built without architecture review. The difference between a working demo and a stable production deployment is a handful of deliberate decisions made before the first deploy.

At Prodinit, our AI product development practice is built around these architecture patterns. We design multi-agent systems with observability, guardrails, eval pipelines, and human escalation built in — not bolted on after the first incident.

If you're scaling an AI agent system and want an architecture review before it reaches production, talk to our team.

Top comments (2)

Mateo Ruiz • Jun 8

This hits on something a lot of teams discover the hard way: AI can generate code incredibly fast, but verification, testing, and runtime behavior are where the real engineering work begins.
The shift from "single file execution" to "project-aware validation" is especially important because modern AI tools rarely generate isolated scripts anymore they generate interconnected systems with dependencies, tests, configs, and deployment assumptions. The dependency race condition example is a great reminder that many AI-generated issues aren't code problems, they're workflow and environment problems. We've seen similar patterns at IT Path Solutions when helping teams move AI-generated MVPs toward production readiness the bottleneck is usually validation and operational reliability, not code generation itself.
The idea of treating verification as a first-class layer rather than an afterthought feels like the right direction. Nice build and honest lessons learned from shipping it.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.