AgentOps: The Discipline Missing From Your AI Deployment Stack

#ai #agents #devops #agentops

AWS made its DevOps Agent generally available on March 31, 2026. It investigates incidents, executes SRE tasks, and operates across multicloud and on-prem environments autonomously. Enterprises that haven't shipped their own agents yet are about to have one handed to them.

The question nobody is asking loudly enough: how do you operate it once it's running?

MLOps gave teams a discipline for managing models — training pipelines, versioning, drift detection, retraining schedules. It was the right answer to the right problem. But the problem has changed. An agent isn't a model. It's a system that uses models to take actions in the world: querying databases, calling APIs, spawning sub-agents, writing files, triggering workflows. The failure mode isn't a bad prediction. It's a bad action. And there's no MLOps playbook for that.

AgentOps is the emerging discipline that fills this gap. Not a rebrand of MLOps with "agent" swapped in, but a different set of operational concerns, practices, and controls — built for systems that act rather than just infer.

AgentOps is the set of operational practices, tooling, and governance controls required to deploy, monitor, and manage autonomous AI agents in production. Where MLOps stops at the model boundary — concerned with training, versioning, and prediction quality — AgentOps extends to the full execution surface: how agents reason, which tools they invoke, what they spend, how they behave across sessions, and what constraints they operate within. AgentOps is not MLOps for agents. It's a new discipline for a new class of system.

Why MLOps doesn't cover what agents actually do

MLOps was designed for a well-defined production unit: the model. A model has inputs, produces outputs, and can be versioned, evaluated, and retrained. Drift detection tells you when the relationship between inputs and outputs has changed. A/B testing tells you which version performs better. Rollback is clean — you swap in the previous model checkpoint.

None of this translates cleanly to agents, because agents aren't defined by their outputs. They're defined by their actions.

When an MLOps system detects a bad prediction, the blast radius is a bad response. When an AgentOps system fails to catch a bad action, the blast radius is whatever the agent did: a database write, a sent email, a processed transaction, a spawned sub-agent that spawns further sub-agents. Agents don't just generate text that might be wrong. They interact with systems that don't have an undo button.

There's a second difference that makes MLOps even less transferable: agents run in loops. A model call is atomic — prompt in, completion out, done. An agent session is a sequence of decisions, each one potentially altering the context for the next. A single agentic task that looks simple — "research this company and draft an outreach email" — might involve a dozen tool calls, multiple model invocations, and decisions that compound across the session. The cost and behavior profile of that session is nothing like the per-call economics MLOps was built to track.

The third difference is non-determinism at the workflow level. Individual LLM calls are already non-deterministic. At the agent level, this non-determinism compounds: the same starting prompt can produce genuinely different execution paths depending on which tool returns what, in what order. Testing and evaluation can characterize this variance. They cannot eliminate it. Every agent session is partly novel.

MLOps handles none of this well. It wasn't designed to. Calling your agent deployment "MLOps" is like calling your Kubernetes cluster a "script runner" — technically not wrong, but missing the actual discipline you need.

What AgentOps actually requires

AgentOps is a four-layer discipline. Teams that have shipped agents to production typically have the first two layers. The ones who've been burned tend to know which layers they were missing — and it's rarely the tracing layer.

Layer 1: Agent registry and lifecycle management. Before you can operate agents, you need a system of record for what's running. What agents exist? What model version does each use? What tools does each have access to? What policy version are they running against? Without a registry, you can't answer the question your CTO will eventually ask: "what do we have running in production right now?" A registry isn't just inventory — it's the prerequisite for every other AgentOps practice. You can't deploy a configuration change to "all agents handling customer data" if you don't have a queryable definition of that set.

Layer 2: Execution tracing. You need to know what your agents are doing, not just what they're costing. Execution logs for agents need to capture the full execution graph: every LLM call, every tool invocation, every external API request, every sub-agent spawn, the timing, the token counts, the sequence. Not just the LLM API costs — those are one dimension of a multi-dimensional record. A team that has LLM cost visibility but no tool call tracing is flying half-blind. They know what the model charged them. They don't know what the agent actually did between calls.

Layer 3: Runtime telemetry and alerting. Static logs tell you what happened. Runtime telemetry tells you what's happening now. For agents, the operational signals that matter aren't the same as for traditional services. You're not just watching p95 latency and error rates. You're watching session cost accumulation rates — because a session that's burning through its token budget faster than expected is a real-time signal, not a post-hoc finding. You're watching tool call failure rates by tool type, because a specific integration breaking can cause cascading agent failures. You're watching session duration distributions, because an agent stuck in a loop looks like an outlier on session length before it looks like anything else.

Layer 4: Policy enforcement and governance. This is the layer most AgentOps implementations skip. And it's the layer that makes the first three layers matter.

Tracing tells you what happened. Telemetry tells you what's happening. Governance determines what's allowed to happen — and enforces those constraints before actions execute, not after they're logged.

Without governance, your AgentOps stack is a very sophisticated incident report generator. You'll have beautiful traces of exactly how your agent went wrong. You won't have stopped it.

Agentic governance at the AgentOps layer means: per-session cost ceilings that terminate sessions before they overspend; tool access policies that scope what each agent can invoke in what context; content policies that intercept PII before it leaves your system boundary; circuit breakers that terminate sessions exhibiting anomalous loop behavior; and human escalation gates for actions above a defined risk threshold. These aren't observability features. They're control features. The distinction matters operationally.

Where AgentOps breaks without governance

Three patterns emerge, reliably, when teams ship agents without the governance layer.

The cost spiral. An agent is deployed to handle a customer research workflow. Average session cost: $0.12. Ninety-nine sessions in, one hits an edge case where a tool call returns unexpected data, and the agent enters a sub-optimal retry pattern. That session costs $4.70. Nobody notices for two days. By then, a handful of similar sessions have run. The bill is wrong, the cause is reconstructable from logs, and the fix requires a code change that goes through normal deployment. The governance answer: a per-session cost ceiling terminates the anomalous session automatically, at $0.50, without a deployment or a ticket.

The silent policy violation. A customer service agent is defined with access to the customer database for read operations. Four months into production, a developer adds a new tool — a "helpful utility" for looking up related accounts — that has implicit write access to a tagging field. The agent starts using it. Nobody notices. The agent is correct that the tool is available to it; nobody told it that write access to that field has compliance implications. The governance answer: tool access policies are defined at the infrastructure layer, not inside the agent prompt. The new tool gets used only when the policy explicitly permits it.

The audit gap. A regulated industry team deploys agents for document processing. Six months later, a compliance audit asks for evidence that agents operated within defined data handling constraints. The team has logs. Logs show what happened — they do not show that specific constraints were evaluated before each action, that certain behaviors were blocked, or that the agent operated within a defined policy envelope. Logs are not enforcement records. The governance answer: policy evaluations are embedded in the execution trace as first-class events, not inferred from behavior.

All three of these are AgentOps failures, not engineering failures. The engineering was fine. The agents worked. The operational discipline that should have been around them wasn't.

How Waxell handles this

How Waxell handles this: Waxell Observe instruments agents across any framework — LangChain, CrewAI, LlamaIndex, custom Python — with three lines of SDK code, capturing the full execution graph as the foundation of your AgentOps stack. Execution logs record every LLM call, tool invocation, external request, and sub-agent spawn with timing, token counts, and costs. Runtime telemetry surfaces operational signals in real time: session cost rates, tool call failure patterns, session duration distributions. On top of that observability layer, Waxell's policy engine evaluates before each tool call and output — enforcing cost ceilings, tool access scope, content filtering, and escalation triggers at the infrastructure layer, independent of agent code. Policy evaluations land in the execution trace as enforcement records — distinct events, not inferred from behavior after the fact. The governance layer runs inside the same operational data model as the observability layer: one instrumentation, one data store, one audit trail.

AgentOps in 2026: the moment it becomes mandatory

The week AWS made its DevOps Agent generally available, enterprises across six AWS regions gained access to an autonomous agent that investigates incidents, executes SRE tasks, and has integrations with Datadog, PagerDuty, GitHub Enterprise, Grafana, and Azure. This isn't a demo. It's production infrastructure.

The teams receiving this capability face an AgentOps gap immediately: they have an agent, but they didn't build it. They don't have a registry entry for it. They don't have custom telemetry watching its behavior. They don't have governance policies defining what it should and shouldn't be allowed to do in their specific environment.

AWS provides the agent. AWS does not provide your organization's governance requirements. That's the AgentOps gap, and it's now a gap in production rather than in theory.

This is the inflection point AgentOps has been building toward: not the moment teams decide to ship their first agent, but the moment production agents arrive whether they're ready or not. The discipline can no longer be deferred.

A January 2026 Futurum Group survey of 628 enterprise IT leaders found that 60% of organizations are actively using AI to build and deploy software. LangChain's 2026 State of Agent Engineering report found that 89% of teams have implemented observability for their agents — but only 52% have adopted evals. The report doesn't ask directly about runtime governance enforcement, but if observability outpaces evals by 37 points, enforcement almost certainly lags further. Teams are watching. They're not yet controlling.

That gap closes the hard way — through incidents — or the planned way, through an AgentOps practice that treats governance as a first-class concern from day one.

If your team is at the stage of building out that practice, get early access to Waxell — the governance layer built specifically for production agent deployments.

Frequently Asked Questions

What is AgentOps?
AgentOps is the set of operational practices, tooling, and governance controls required to deploy, monitor, and manage autonomous AI agents in production. It extends beyond MLOps — which stops at the model boundary — to cover the full execution surface of agentic systems: tool invocations, session cost management, multi-step reasoning traces, deployment lifecycle, and runtime policy enforcement. The term is emerging as agentic systems move from experimentation into production infrastructure.

How is AgentOps different from MLOps?
MLOps is built for the model: training pipelines, versioning, drift detection, prediction quality evaluation. The production unit is a model call — atomic, bounded, evaluable. AgentOps is built for the agent: a system that makes sequences of decisions, invokes tools, interacts with external systems, and accumulates costs across sessions. The failure modes differ (bad action vs. bad prediction), the cost models differ (per-session loop cost vs. per-call API cost), and the observability requirements differ (full execution graph vs. model call logs). MLOps practices are a starting point — they don't cover the action surface that agents introduce.

What does governance look like in an AgentOps stack?
Governance in AgentOps operates at the policy enforcement layer — above agent code, evaluated before each tool call and output. Concretely: per-session token budget policies that terminate sessions before they overspend; tool access controls that scope which tools each agent can invoke in which contexts; content policies that intercept PII or sensitive data before it crosses a system boundary; circuit breaker policies that terminate sessions exhibiting anomalous loop behavior; and escalation gates that route high-risk actions to human reviewers before execution. Without this layer, an AgentOps stack provides visibility into what happened — but no control over what happens.

Why can't I use CI/CD for AI agents the way I use it for code?
You can use CI/CD for deploying agent configurations and policy definitions — that part maps cleanly. The gap is in what CI/CD can't cover: agent behavior at runtime. A CI/CD pipeline can test that your agent passes your test suite before deployment. It cannot handle the novel edge cases that appear in production that didn't appear in your test suite. That's not a failure of CI/CD — it's a fundamental property of systems that interact with a non-deterministic external environment. Runtime governance fills the gap that pre-deployment testing leaves open.

What should be in an agent registry?
An agent registry should capture, at minimum: agent identity (name, version, framework), the model(s) it uses, the tools it has access to, the policy version it's running against, current deployment status, and the team responsible for it. The registry is the operational prerequisite for everything else — you can't apply a governance policy change to a category of agents if you don't have a queryable definition of that category. Most teams build this as a side effect of a CI/CD pipeline or a configuration management system; the more intentional approach is a first-class registry as part of the AgentOps platform.

When do teams need AgentOps vs. just MLOps?
The transition point is tool access. If your AI system only generates text — no external API calls, no database operations, no file writes — MLOps practices are sufficient. The moment your system takes actions in external systems, the failure mode changes from "bad output" to "bad action," and the operational requirements change with it. Most production agents cross this threshold immediately: even a simple retrieval-augmented agent is making external calls (to a vector database or document store). AgentOps applies from the first tool call.