Logan for Waxell

Posted on Mar 11 • Originally published at waxell.ai

You've Shipped Agents. Now You Have to Run Them.

#observability #ai #devops #sre

Shipping an agent is an act of optimism. Running it is an act of engineering discipline.

These are different skills. The skills that get you from idea to working demo — prompt engineering, tool design, context management, iteration speed — are not the same skills you need when the thing is live and real users are depending on it and something is going wrong at 11pm and you need to figure out what.

Most engineering teams learned this the hard way with microservices. The same lesson is playing out right now with agents. Different architecture, same core problem: building something and operating something are not the same discipline.

Here's what changes when you shift from "we shipped an agent" to "we're running an agent in production." (See also: Why AI agent costs spiral → · What is agentic governance →)

What Breaks First

Not what breaks catastrophically — what breaks first, subtly, in the ways you don't catch until they've been broken for a while.

Latency. Your p95 looked fine in testing. In production, you have tail cases that never appeared in your test set — long context windows, tool calls that take longer under load, retry sequences. Your p99 is significantly worse than your median. Users in that tail are having a bad experience, and you're finding out from support tickets, not monitoring.

Behavioral drift. This one's insidious. The model provider ships a new version. A tool your agent depends on changes its response schema. Someone modifies the system prompt for a product reason and the downstream effects on agent behavior weren't fully mapped. None of these show up as errors. The agent still runs. It just behaves differently — and you might not notice for days.

Context window edge cases are the gap between your test suite and reality. Production has long sessions, confused users who restart mid-conversation, inputs that contain unexpected content, tool responses three times longer than anticipated. Your context management wasn't built for any of this.

Concurrency breaks things that worked perfectly in sequence. Resource contention, rate limits on downstream tools, session isolation issues — problems that only exist when multiple sessions are running at once.

And then there's cost variance. Your average case is fine. Your variance is not. Long sessions, retry chains, and aggressive tool use by certain user segments run up bills that your average-case projections never captured.

The Operational Questions You Need to Be Able to Answer

The test of whether you're actually running an agent — versus just having deployed one — is whether you can answer the operational questions on demand. Not after a two-hour investigation. On demand.

What's my p99 latency right now? Which sessions today took the longest, and why? What fraction of sessions in the last 24 hours completed successfully vs. hit an error? Did the agent's behavior change after the system prompt update yesterday? What's my average cost per session today vs. last week — and which sessions are in the top 1% of cost? What PII has entered agent context in the last 7 days, how was it handled, and where did it end up?

If answering any of those requires digging through logs instead of pulling up a dashboard, you have observability. You don't have operational capability.

Building an SLA for Your Agent

Most teams haven't asked this question yet, and it shows: what does reliability actually mean for an AI agent?

For a traditional API, it's clean. Uptime percentage. Error rate. Latency percentiles. Done.

For an agent, reliability has a behavioral dimension that makes it harder. An agent that responds within your latency target but gives a confidently wrong answer isn't reliable in the way that matters. So you need to think about reliability across multiple axes: Does the agent respond? (Availability — table stakes.) Does it respond correctly? (Behavioral consistency — which means you've defined what "correct" looks like, and that's a product decision before it's an engineering one.) Does it stay inside its policy envelope — spend budgets, PII handling, tool constraints? (Governance compliance — measurable if you have the infrastructure, invisible if you don't.) And what happens when it can't handle a request? A good SLA defines acceptable fallback behavior, not just acceptable success behavior.

Having this conversation with your stakeholders explicitly turns "the agent sometimes does weird things" into a quantifiable problem instead of a vague worry.

Incident Response for AI Behavior

Traditional incident response has a clean shape: something breaks, you find the root cause, you fix it, you deploy. Bounded in time and scope.

Agent incidents don't work like that.

First, the incident may have been happening for days before anyone flagged it. Behavioral issues don't always surface as errors — they look like slightly worse retention, slightly higher support volume, slightly more escalations. By the time someone says "this is an incident," you're reconstructing what happened over a week, not debugging a single event.

Second, the fix often isn't a code deploy. Maybe it's reverting a system prompt change. Maybe it's adapting to a model behavior shift from the upstream provider. Maybe it's a policy update for a PII handling gap. The intervention options are fundamentally different.

And rollback? Rollback doesn't mean the same thing when behavior is distributed across the model, the prompt, the tools, and the governance policies. You need to figure out which layer the problem lives in before you know what to revert. Meanwhile, if the agent already produced bad outputs that users acted on or that got logged in downstream systems, those effects don't disappear when you fix the agent. Your incident response needs to account for cleanup and user communication, not just the technical fix.

Think through this before the incident happens. Document your response procedures. Define what "resolved" means. The difference between a managed incident and a chaotic one is whether you did the thinking in advance.

How Do You Move from Reactive to Governed Agent Operations?

The teams that operate agents well aren't the ones that get good at firefighting. They're the ones that systematically reduce the number of fires. That means moving from reactive — find it when it breaks — to governed — define what acceptable looks like, enforce it continuously, know immediately when it's violated.

What that actually takes: a policy layer that makes "acceptable behavior" explicit instead of hoping the model does the right thing. An enforcement mechanism that applies those policies in real time, not after the fact. Instrumentation that makes the operational questions answerable without an investigation. And incident playbooks that treat agent behavior as its own category — because it is.

None of this is exotic. It's ops discipline applied to a new kind of system. Teams that bring the same rigor to their agents that they'd bring to a database or a microservice find that agents are perfectly operable. Teams that treat agents as something too intelligent to need real ops are the ones with the war stories.

How Waxell handles this: Waxell provides the governance and operational layer that makes "running agents" different from "having deployed agents" — real-time cost tracking, behavioral policy enforcement, PII controls, and a queryable audit trail. The operational questions (latency, cost per session, behavioral drift, data handling) become answerable on demand without engineering investigation. No rewrites. Deploy over whatever you've already built. Start free →

Frequently Asked Questions

What breaks first when you run AI agents in production?
In order of typical appearance: latency tail cases (p99 latency significantly worse than median, discovered through support tickets not monitoring), behavioral drift after upstream changes (model updates, tool schema changes, prompt modifications with unmapped downstream effects), context window edge cases (long sessions, unexpected tool response lengths), and cost variance (average-case costs in budget, but outlier sessions running up the tail).

How do you build an SLA for an AI agent?
An agent SLA needs to cover four dimensions: availability (error rate, latency at defined percentiles), behavioral consistency (responses meet defined quality criteria, evaluated against some benchmark), governance compliance (agent operates within its policy envelope — spend, PII, tool constraints), and degradation behavior (defined fallback when the agent can't handle a request). Each dimension requires having defined what "acceptable" looks like before you can measure it, which is a product decision before it's an engineering one.

What does AI agent incident response look like?
Agent behavior incidents differ from traditional software incidents in four ways: the problem may have been happening for days before detection; the fix may not be a code deploy (it might be a policy update, a prompt revert, or a governance layer change); rollback has a different meaning because behavior is distributed across model, prompt, tools, and policies; and impact may not be fully reversible if bad outputs were acted on or logged downstream. Response procedures need to account for all of this before an incident, not during one.

How is operating AI agents different from traditional software operations?
Traditional systems are deterministic — you control the code, the code executes predictably. Agents are probabilistic — the same inputs can produce different outputs, and behavior is distributed across model, prompt, tools, and governance layer. This means traditional on-call runbooks don't translate directly. Agent operations requires understanding which layer a problem is at (model behavior? prompt? tools? policies?), what "rollback" means for each layer, and how to measure behavioral compliance, not just technical availability.

What operational questions should you be able to answer about your AI agents?
On demand, without investigation: current p50 and p99 session latency; fraction of sessions in the last 24 hours that completed successfully vs. hit errors; average cost per session today vs. last week; top 1% of sessions by cost and what made them expensive; any PII that entered context in the last 7 days and how it was handled; whether agent behavior shifted after the last system prompt change. If any of these require a manual investigation rather than a dashboard query, you have observability but not operational capability.

Top comments (2)

Alessandro Pireno • Mar 11

The gap between "agents work in my demo" and "agents work reliably in production" is the most important unsolved problem in this space right now. The hardest part isn't the agent logic — it's that agents fail silently and gracefully, which is worse than crashing. A crash you catch. An agent that confidently produces the wrong output and moves on? You find that in a postmortem. Observability for agents needs to be opinionated about capturing decision points, not just logs. What's working for your team on the instrumentation side?

Incident Copilot • Mar 11

This framing is exactly right. A lot of teams still think the hard part is getting an agent demo to work, when the real difficulty starts once you need repeatability, auditability, and incident response for behavior instead of just uptime.

The line that matters most is the distinction between observability and operational capability. If you cannot answer latency, drift, cost, and policy questions on demand, you do not really have an operable system yet.