Ali Suleyman TOPUZ

Posted on Mar 31 • Originally published at topuzas.Medium on Mar 31

Agentic Architectures — Article 3: AgentOps

#devops #softwareengineering #artificialintelligen #machinelearning

Treating AI Like the Distributed System It Actually Is

There’s a moment every team hits, usually somewhere between the third demo and the first real production deployment. The agent works beautifully in the notebook. It handles every test case you throw at it. You ship it. And then, three days later, you get a Slack message from a user that says something like: “It’s been running for 20 minutes and nothing is happening.”

You open the logs. There are no logs. The agent made 47 API calls, hit a rate limit on call 12, entered an undocumented retry state, and has been quietly spinning ever since — accumulating token costs, holding open a connection, and doing absolutely nothing useful.

Welcome to production.

The discipline of AgentOps exists because agentic systems are distributed systems, and distributed systems fail in distributed ways — partially, silently, and at the worst possible time. The practices in this article aren’t optional polish you add after launch. They’re the foundation that determines whether your system is operable when things go wrong. And they will go wrong.

Observability & Tracing: You Can’t Fix What You Can’t See

In a traditional web application, a request comes in and a response goes out. If something breaks, you have a single trace to inspect — one thread of execution, one error to find.

An agentic system doesn’t work like this. A single user request might spawn a Manager agent, three Worker agents, a Critic, and a tool execution layer. Each of these makes independent model calls. Some run in parallel. Each can fail independently. The user sees one thing — “the agent is thinking” — and behind that is a branching tree of execution that can fail at any node.

Debugging this without proper tracing is like trying to debug a microservices outage by reading individual server logs with no correlation IDs. Technically possible. Practically miserable.

The modern answer is OpenTelemetry (OTel) — the vendor-neutral observability standard that has become the lingua franca of distributed systems monitoring. The good news is that both LangSmith (from the LangChain ecosystem) and Arize Phoenix support OTel-compatible trace ingestion, which means you can instrument your agent once and route traces to whichever backend you prefer.

What you want to capture at every node in your agent graph:

Span start and end timestamps — so you can see exactly where time is being spent
Model call metadata — which model, which prompt template version, input/output token counts
Tool call inputs and outputs — what the agent asked the tool to do, and what it got back
State transitions — when the agent moved from Planning to Executing to Reflecting
Error events — with full context, not just the exception message

The two metrics that matter most in production, and that most teams don’t track until they should:

Trace Latency is the wall-clock time from user request to final response, across the entire agent execution graph. Not just model latency — total latency, including tool calls, state reads and writes, and any waiting time. This is what the user experiences, and it’s almost always higher than you think because it includes all the overhead your benchmark tests don’t capture.

Token Cost per Trace is the total model spend for a single user task, aggregated across all agents and all model calls in the trace. In a multi-agent system, this is the number that will surprise you. Individual model calls look cheap. When you multiply them by agent count, loop iterations, and daily request volume, the number that emerges is frequently 5–10x what the team estimated during design.

Build a dashboard with both of these as primary metrics before you launch, not after. The alert threshold for Trace Latency should trigger before your user-facing timeout does. The alert threshold for Token Cost per Trace should trigger before your monthly budget does.

Guardrails & Safety: The Gates That Protect the System

Every agent that interacts with real users is a potential attack surface — not just for adversarial users, but for the model’s own failure modes. A guardrail is an enforcement layer that sits between the world and your agent, checking inputs before they reach the model and outputs before they reach the user.

Think of it as two gates:

User Input → [INPUT GATE] → Agent → Model → [OUTPUT GATE] → User Response
                  ↓ ↓
             Block / Sanitize Block / Rewrite / Flag

The Input Gate protects the model from harmful, manipulative, or out-of-scope inputs. Common implementations:

LlamaGuard — Meta’s open-source safety classifier, trained specifically to detect harmful content categories (violence, hate speech, self-harm, illegal activity). It runs as a separate model call before your main agent, adding ~100–200ms of latency and a fraction of a cent in cost per request. Worth it for any user-facing deployment.
Regex and rule-based filters — Fast, cheap, and reliable for known patterns. Prompt injection attempts often have detectable signatures (ignore previous instructions, you are now, your new system prompt is). A well-maintained regex filter catches a meaningful percentage of these before they ever reach the model.
LLM-based classifiers — For nuanced cases where rule-based filtering isn’t sufficient. A small, fast model (Haiku, GPT-4o mini) classifying the intent of an input before it hits your expensive main model is a good investment for high-value workflows.

The Output Gate protects users from the model’s failure modes — hallucinations, off-topic responses, sensitive data leakage, and policy violations:

PII detection — Before any agent output reaches a user or gets written to a log, scan it for personally identifiable information that shouldn’t be there. Regex handles the obvious cases (email patterns, credit card formats, SSN patterns). For subtler cases, a dedicated NER model does the job.
Factual grounding checks — For RAG-based agents, verify that claims in the output can be traced back to retrieved source documents. Outputs that make claims not present in the source context should be flagged or blocked.
Output format validation — If your agent is supposed to return structured JSON, validate the schema before passing it downstream. A malformed output that crashes a downstream service is a guardrail failure, not a model failure.

One implementation principle that’s easy to overlook: guardrails must fail safely. If your Input Gate goes down, what happens? If the answer is “all user requests go through unfiltered,” you have a single point of failure in your safety layer. Design guardrails with explicit fallback behavior — if the safety classifier is unavailable, either queue the request or return a graceful error, never silently bypass the check.

Evaluation Pipelines: The Regression Test for Your Agent

Software engineers have a deeply ingrained habit of writing tests before shipping code. Most AI teams don’t extend this habit to their agents — and they pay for it every time a prompt change breaks something in production that worked perfectly last week.

The equivalent of a test suite for an agentic system is an Eval Pipeline , and the core artifact it runs against is a Golden Dataset.

A Golden Dataset is a curated collection of input-output pairs that represent the behavior your system should exhibit. Each entry contains:

A realistic input (user query, document, task description)
The expected output — or the criteria by which a good output can be evaluated
Metadata: the scenario type, difficulty level, which agent capabilities it exercises

Building a good Golden Dataset is not a one-time task. It grows over time, fed primarily by production failures. Every time your agent produces a wrong or unexpected output in production, that input — along with the correct output — gets added to the dataset. The Golden Dataset becomes a living record of every failure mode your system has ever exhibited and been fixed for.

The Eval Pipeline runs this dataset against your agent automatically on every significant change — prompt updates, model version changes, tool modifications, new agent roles. The output is a regression report:

Golden Dataset Run — 2025-03-28
------------------------------------------------------------
Total cases: 847
Passed: 821 (96.9%)
Regressed: 18 (2.1%) ← These need investigation
Improved: 8 (0.9%)
------------------------------------------------------------
Regressed cases by category:
  Multi-step reasoning: 9
  Tool selection: 5
  Output format: 4

The 18 regressions are the cases where a change that was supposed to improve things has broken something that worked before. Without the eval pipeline, you’d find these in production. With it, you find them in CI.

Evaluation metrics vary by task type. For classification tasks, precision and recall. For generation tasks, you need LLM-based evaluation — a judge model that scores outputs against a rubric (this is increasingly standard and works well when the rubric is specific). For tool-use tasks, check whether the correct tools were called with correct arguments, independent of the final text output.

One practical note: keep your Golden Dataset honest. The temptation is to add only cases your system handles well, which turns the eval into a confidence-boosting exercise rather than a quality gate. Actively seek out edge cases, adversarial inputs, and the kinds of queries that make your system stumble.

Agentic Error Handling: Building for the Inevitable

A production agent will encounter rate limits. APIs will return 500 errors. Model calls will time out. Tools will return malformed responses. The question is not whether these things will happen — it’s whether your system handles them gracefully or catastrophically.

Exponential Backoff with Jitter

When a model API returns a 429 (rate limit exceeded), the naive response is to retry immediately. This is exactly wrong — immediate retries hammer the rate-limited endpoint and make the congestion worse. The correct pattern is exponential backoff with jitter:

Attempt 1: fail → wait 1s + random(0-500ms)
Attempt 2: fail → wait 2s + random(0-500ms)
Attempt 3: fail → wait 4s + random(0-500ms)
Attempt 4: fail → wait 8s + random(0-500ms)
Attempt 5: fail → give up, return error to orchestrator

The jitter (random delay) is critical in multi-agent systems. Without it, multiple agents hitting the same rate limit will retry in synchrony, creating thundering herd waves that make the rate limit problem worse. With jitter, retries spread out naturally.

Set a maximum retry count (5 is a reasonable default) and a maximum total wait time (30–60 seconds for interactive tasks). After that, fail explicitly — a clean error the orchestrator can handle is better than a silent spin.

Fallback Models

Not all model failures are rate limits. Sometimes a model is genuinely unavailable, or a specific request exceeds the model’s context limit, or a cost budget has been hit. For these cases, build a fallback model hierarchy:

Primary: Claude Sonnet (full capability, higher cost)
Fallback: Claude Haiku (reduced capability, lower cost)
Emergency: Cached response or template-based response

The fallback trigger conditions and the fallback target should be explicit configuration, not hardcoded logic. Different tasks warrant different fallback strategies — a customer-facing response probably shouldn’t fall back to a template, but an internal classification task might be fine running on a smaller model.

Critically, log every fallback event. A spike in fallback usage is an early warning signal — it means your primary model is struggling for some reason, and you want to know about it before it becomes a full outage.

Human-in-the-Loop: Designing Deliberate Pause Points

There’s a class of agent actions where the cost of getting it wrong is high enough that no amount of automated validation is sufficient. Executing a SQL write against a production database. Sending an email to a thousand customers. Approving a financial transaction. Deploying code to production.

For these, Human-in-the-Loop (HITL) isn’t a limitation of your agent’s capability — it’s a deliberate architectural choice that reflects the appropriate level of trust for that action.

The implementation pattern is an Interrupt Point — a designated node in your agent’s state graph where execution pauses, the pending action is surfaced to a human reviewer, and the agent waits for explicit approval before proceeding.

Agent Planning → Tool Selection → [INTERRUPT: Awaiting approval]
                                          ↓
                                   Human Reviews
                                    / \
                               Approve Reject (+ feedback)
                                  ↓ ↓
                           Agent Executes Agent Replans

The UX of the approval interface matters more than most engineering teams acknowledge. The human reviewer needs to see: what action the agent wants to take, why it decided to take it (a brief reasoning summary), what the expected outcome is, and what the rollback plan is if it goes wrong. A one-line notification that says “Agent wants to run a database query — approve?” is not sufficient. A panel showing the exact SQL, the expected rows affected, and the agent’s stated rationale is.

HITL points should be defined in configuration, not code — so that business stakeholders can adjust the approval threshold for an action type without requiring a code deploy. “Any SQL write affecting more than 1,000 rows requires approval” is a policy decision, not an engineering decision.

One nuance worth designing for early: what happens if the human doesn’t respond? The agent is waiting at an interrupt point. The reviewer is in a meeting. An hour passes. Your system needs explicit timeout behavior — either escalate to a different reviewer, cancel the task gracefully, or (for some use cases) proceed with a lower-risk fallback action. The worst outcome is an agent silently holding state and accumulating cost while waiting indefinitely.

Production Reality Check

AgentOps is the article in this series most likely to be skimmed and least likely to be implemented before launch. That’s a mistake that reliably produces avoidable incidents.

Some concrete numbers to make the case:

Observability instrumentation — setting up OTel, integrating LangSmith or Arize Phoenix, building a basic dashboard — takes a senior engineer approximately 2–3 days to do properly. The first production incident it prevents will typically save more than 2–3 days of debugging time, usually within the first month of operation.

A well-maintained Golden Dataset with 500–1000 cases catches roughly 60–70% of prompt-change regressions before they reach production, based on experience across several production systems. The remaining 30–40% are novel failure modes — which get added to the dataset after they’re found, continuously improving the coverage.

HITL approval flows feel like friction during development (“the agent should just do it”). In production, they become the feature that saves your team from the 2am incident where the agent queued 50,000 email sends based on a misconfigured trigger. Every high-stakes agentic system needs at least one HITL checkpoint. Design it in from the start — retrofitting it into an existing state machine is painful.

The honest framing: every hour you invest in AgentOps before launch is worth roughly five hours of incident response after it. The math isn’t complicated. The discipline is.

What Comes Next

Article 4 is where we zoom out from how individual agents work to how different agents talk to each other — across team boundaries, vendor boundaries, and trust boundaries.

The Model Context Protocol (MCP) and Agent-to-Agent (A2A) communication standards are quietly becoming the connective tissue of the agentic ecosystem. If you’re building agents that need to interoperate with tools, services, or other agents you don’t control — which is almost everyone — understanding these protocols is no longer optional.

                  L1: Stateless L2: Tool-Augmented L3: Autonomous L4: Multi-Agent L5: Self-Correcting
------------------------------------------------------------------------------------------------------------------------------------------------------
Execution Serverless / Edge Serverless + integr. Long-running container Distributed orchestrator Distributed + feedback loops
State None None Short + long-term memory Shared state across agents State + mutation history
Latency profile Predictable Slightly variable Variable (loop-dependent) High, parallelizable Highest, bounded by budget
Cost model Linear (tokens) Linear + tool costs Nonlinear (calls per task) Nonlinear × agent count Nonlinear × iteration count
Primary failure Bad retrieval Tool hallucination Context overflow Cascade failures Runaway loops
Observability Basic logging Tool call tracing Full trace per loop Cross-agent tracing Cost + quality dashboards

Included for reference from Article 1 — AgentOps practices apply most critically at L3, L4, and L5.

This is Article 3 of a 4-part series on Agentic AI Architectures. Article 4 — Agentic Protocols (MCP and A2A) — is the final piece.

DEV Community