The gap between an AI agent that works in a demo and one that works reliably over months of production operation is larger than most teams anticipate. I have built and maintained agentic AI systems across several enterprise deployments, and the failure modes I keep encountering are consistent enough that I want to document them specifically.
This is not a tutorial on how to build an agent. It is a description of the specific things that break in production that were not obvious during development, and the architectural decisions that prevent or mitigate those failures.
Why agents fail differently from RAG systems
A RAG system has a bounded failure mode. It retrieves context, generates a response, and either the response is accurate or it is not. The failure is contained to a single interaction and the blast radius is limited to one user's trust in one answer.
An agentic system has unbounded failure potential. An agent that can take actions, update records, send communications, trigger processes, or spawn other agents can fail in ways that compound across time and affect people beyond the user who initiated the interaction. A retrieval error produces a wrong answer. An agent error can produce a wrong action, and wrong actions in enterprise systems often have consequences that are difficult or impossible to reverse.
This fundamental difference in failure surface is the reason that building agents requires a different set of architectural precautions than building RAG systems, and why most RAG-first teams are surprised by the problems they encounter when they add action capabilities to their AI systems.
The state management problem
The most common agentic failure I encounter in production is a state management failure: the agent loses track of where it is in a multi-step task and either starts over, gets stuck, or makes decisions based on an incorrect model of what has already happened.
In development, this problem is rare because tasks run to completion quickly in a controlled environment. In production, tasks get interrupted by network failures, rate limits, user session timeouts, and system restarts. An agent that cannot resume a partially completed task correctly is an agent that will occasionally take the wrong action after a restart, thinking it is in a different position than it actually is.
The architectural solution is checkpointing with immutable state records:
from dataclasses import dataclass, field
from typing import Optional, List, Dict, Any
from enum import Enum
import json
class TaskStatus(Enum):
PENDING = "pending"
IN_PROGRESS = "in_progress"
AWAITING_APPROVAL = "awaiting_approval"
COMPLETED = "completed"
FAILED = "failed"
CANCELLED = "cancelled"
@dataclass
class TaskCheckpoint:
task_id: str
step_index: int
step_name: str
status: TaskStatus
inputs: Dict[str, Any]
outputs: Optional[Dict[str, Any]]
agent_reasoning: str # what the agent was thinking at this step
timestamp: str
requires_human_approval: bool
class CheckpointedAgent:
def __init__(self, task_id: str, checkpoint_store):
self.task_id = task_id
self.checkpoint_store = checkpoint_store
self.steps_completed: List[TaskCheckpoint] = self.load_history()
def load_history(self) -> List[TaskCheckpoint]:
return self.checkpoint_store.get_task_history(self.task_id)
def get_current_context(self) -> str:
if not self.steps_completed:
return "No previous steps. Starting fresh."
completed_summary = "\n".join([
f"Step {cp.step_index} ({cp.step_name}): {cp.status.value} - {cp.agent_reasoning}"
for cp in self.steps_completed
])
return f"Previously completed steps:\n{completed_summary}"
def execute_step(self, step_name: str, step_inputs: Dict, requires_approval: bool = False):
checkpoint = TaskCheckpoint(
task_id=self.task_id,
step_index=len(self.steps_completed),
step_name=step_name,
status=TaskStatus.IN_PROGRESS,
inputs=step_inputs,
outputs=None,
agent_reasoning="",
timestamp=datetime.now().isoformat(),
requires_human_approval=requires_approval
)
self.checkpoint_store.save(checkpoint)
if requires_approval:
checkpoint.status = TaskStatus.AWAITING_APPROVAL
self.checkpoint_store.save(checkpoint)
raise AwaitingApprovalException(f"Step {step_name} requires human approval before proceeding")
# Execute the actual step here
# ...
def resume_after_approval(self, step_index: int, approved_by: str):
checkpoint = self.steps_completed[step_index]
if checkpoint.status != TaskStatus.AWAITING_APPROVAL:
raise ValueError("This step is not awaiting approval")
checkpoint.approved_by = approved_by
checkpoint.status = TaskStatus.IN_PROGRESS
self.checkpoint_store.save(checkpoint)
# Resume execution from this step
The critical property here is that every state transition is logged before it is executed. If the system fails mid-execution, the checkpoint log tells you exactly what was done, what was not done, and what the agent was intending to do next. Resuming from a known state is dramatically safer than attempting to infer state from partial completion.
The approval gate pattern
Every agentic system I build that operates on real enterprise data now includes what I call approval gates: explicit checkpoints where the agent cannot proceed without human confirmation.
The reasoning is straightforward. Agents make mistakes. Mistakes in action-taking agents affect the real world. The cost of requiring human approval for consequential actions is a friction that slows execution. The cost of an agent taking incorrect consequential actions is a much larger operational problem. For enterprise deployments, the second cost consistently outweighs the first.
What makes approval gates work in practice is specificity. A vague approval request ("the AI wants to do something, please approve") does not work. Approval requests need to specify: what action is proposed, what the agent's reasoning was, what the expected outcome is, what the reversibility is if the action turns out to be wrong, and what happens if the approver does nothing.
@dataclass
class ApprovalRequest:
request_id: str
task_id: str
requesting_agent: str
proposed_action: str
action_parameters: Dict[str, Any]
agent_reasoning: str
expected_outcome: str
reversibility: str # "fully reversible", "partially reversible", "irreversible"
auto_approve_after: Optional[int] # seconds until auto-approval, None for manual-only
auto_deny_after: Optional[int] # seconds until auto-denial if no response
context_summary: str # what has happened so far in this task
class ApprovalGateManager:
def request_approval(self, request: ApprovalRequest) -> bool:
self.store.save(request)
self.notify_approvers(request)
# For irreversible actions, never auto-approve
if request.reversibility == "irreversible":
request.auto_approve_after = None
response = self.wait_for_response(
request_id=request.request_id,
timeout=request.auto_deny_after or 3600
)
if response is None and request.auto_approve_after:
# Time-based auto-approval for low-risk actions
return True
elif response is None:
# Default to denial for safety
self.log_denial(request.request_id, reason="timeout_no_response")
return False
return response.approved
The reversibility field drives the strictest control: irreversible actions always require explicit human approval, no exceptions. This is the line I do not move regardless of how much the operational team wants to eliminate the friction for specific workflows.
The tool call audit problem
When an agent calls a tool, whether that is querying a database, updating a CRM record, sending an email, or calling an external API, that tool call needs to be logged in a format that satisfies two requirements simultaneously: technical debugging and compliance audit.
These requirements conflict. Technical debugging logs want raw data, full request and response payloads, timing information, error details. Compliance audit logs want a human-readable description of what happened and why, without the raw data that might contain personal information or proprietary content.
The architecture that serves both:
@dataclass
class ToolCallRecord:
call_id: str
task_id: str
agent_id: str
tool_name: str
timestamp: str
duration_ms: int
# Technical log (internal, full detail)
raw_input: Dict # full tool input parameters
raw_output: Dict # full tool output
error: Optional[str]
# Compliance log (auditable, human-readable, privacy-safe)
action_description: str # "Updated contract renewal date for vendor X"
data_accessed: List[str] # ["contract_id:12345", "vendor:acme_corp"]
data_modified: List[str] # ["renewal_date", "contract_status"]
authorized_by: str # user or policy that authorized this action
outcome: str # "success", "failure", "partial"
# Immutability controls
logged_at: str
log_hash: str # hash of the record content for tamper detection
The compliance log fields are what your legal team and auditors need to see. The technical log fields are what your engineers need when something goes wrong. Both live in the same record. They go to different downstream destinations with different access controls.
Handling partial failures gracefully
Production agentic systems encounter partial failures constantly. A tool call times out. An API rate limit is hit. A database record was modified by another user between when the agent read it and when it tried to update it. An external service returns an unexpected response format.
The naive approach is to treat any failure as a task failure and surface an error to the user. This is too conservative. Many partial failures are recoverable without human intervention.
The approach that works: classify failures by their recoverability at the tool call level rather than the task level.
class FailureClassifier:
TRANSIENT = "transient" # retry likely to succeed
STALE_DATA = "stale_data" # re-read and retry
PERMISSION = "permission" # requires human intervention
DATA_ERROR = "data_error" # data quality issue, needs investigation
FATAL = "fatal" # task cannot complete, surface to user
def classify(self, tool_name: str, error: Exception, context: dict) -> str:
if "timeout" in str(error).lower() or "connection" in str(error).lower():
return self.TRANSIENT
if "optimistic_lock" in str(error).lower() or "version_conflict" in str(error).lower():
return self.STALE_DATA
if "permission" in str(error).lower() or "unauthorized" in str(error).lower():
return self.PERMISSION
if "invalid_data" in str(error).lower() or "validation" in str(error).lower():
return self.DATA_ERROR
return self.FATAL
class ResilientToolExecutor:
def execute_with_retry(self, tool_name, inputs, max_retries=3):
for attempt in range(max_retries):
try:
return self.tools[tool_name].execute(inputs)
except Exception as e:
failure_type = self.classifier.classify(tool_name, e, inputs)
if failure_type == "transient" and attempt < max_retries - 1:
time.sleep(2 ** attempt) # exponential backoff
continue
elif failure_type == "stale_data":
inputs = self.refresh_inputs(tool_name, inputs)
continue
elif failure_type == "permission":
raise HumanInterventionRequired(f"Permission error on {tool_name}: {e}")
else:
raise TaskFailure(f"Unrecoverable error on {tool_name}: {e}")
This classification-based retry strategy means that transient infrastructure failures do not surface as user-visible errors, stale data conflicts get automatically refreshed and retried, and only genuinely unrecoverable failures escalate to human attention.
The monitoring layer that actually matters
For agentic systems, the metrics that matter are different from the metrics that matter for RAG systems.
Queue depth and task age are the early warning metrics. If tasks are sitting in the pending queue for longer than expected, something has stalled. If a task is in-progress for longer than the expected maximum execution time, the agent is probably stuck in a loop or waiting on an approval that was missed.
Tool call success rates by tool are more useful than overall success rates. An agentic system can have a high task completion rate while masking a specific tool that is failing silently and causing tasks to take workaround paths. Broken down by tool, you see the specific integration that is causing problems rather than a blended metric that looks acceptable.
Human intervention rate is the metric I watch most closely for agentic systems. It measures what percentage of task executions required human approval or intervention beyond what was expected. Rising intervention rates indicate either that the agent's judgment is not calibrated correctly for the real-world task distribution, or that the task types being handled have evolved beyond what the agent was designed for. Either case requires investigation.
Building the monitoring layer is not glamorous work. It also does not show up in demos. But the systems that remain trustworthy in production over eighteen months are almost always the systems where the monitoring layer was built with the same care as the execution layer.
Top comments (1)
The distinction you make between bounded RAG failures and compounding agent failures is the key architectural shift a lot of teams still underestimate. Your checkpointing point is especially important because most “agent bugs” I see in production are really state-resumption bugs after a timeout, rate limit, or partial side effect. The other half of that problem is having a trace that shows which tool call, decision, or checkpoint transition actually pushed the workflow off course, which is why I keep coming back to agent-inspect-style visibility for multi-step TypeScript agents. Have you found immutable state records to be enough on their own, or did you also need explicit idempotency guards around the action layer to keep retries safe?