Austin Vance for Focused

Posted on May 28 • Originally published at focused.io

Agent Monitoring Is an Infrastructure Workload | Focused Labs

#ai #programming

I added agent monitoring to the list of reporting work that has crossed over into SRE production infrastructure, which is annoying but real enough. The trace used to explain a single request. Now it has to carry the agent run through tool calls, subagents, sandboxes, services, approvals, retries, and side effects. It has to support SREs reading the trace a week or so after it happened, when no one remembers the details. The trace must support rollback and the other production troubleshooting work SREs do. And it must be understandable by an SRE who has not already read through the full raw event log for the agent run.

First off, Sarah Cat made the core point that managing and monitoring agents requires rethinking infrastructure because existing systems were not designed for agent scale. Then Harrison Chase added that the same point applies on the monitoring side. Charity Majors made the observability version sharper: there is a huge problem tracking long-running async AI sessions with the usual transaction and trace building blocks.

Observability for long-running agent sessions is turning into the storage, identity, retention, correlation and control-plane for the behavior of AI agents.

Fine. Call it monitoring if the label helps. Just don’t staff it like reporting.

The trace is carrying a bigger object

A web request goes in through a handler, calls a database (maybe a queue), and then returns a response to the person who made the request. This is the shape of a transaction and this is what people have built their observability tools around. This is the world of spans, of logs, of metrics, of viewing individual exemplars, of looking at a service map, of building a dashboard.

LangChain’s observability docs say agents require visibility into tools, prompts, decisions, tool calls, model interactions, and decision points. LangSmith’s dashboard docs turn that surface into operating metrics: trace count, latency, error rates, token usage, cost, tool run counts, tool errors, tool latency, run types, and feedback scores.

Read that list slowly. The list describes the beginning of an agent control plane.

I like traces. I like dashboards. I do not like pretending raw logs are the record.

The trace has to follow both the decision and the side effect.

For a single agent run, the logs from that run are fine to read as logs. However, for monitoring a production system where logs become evidence for SREs or security investigators later, the trace following the commands in the log also has to follow decisions and side effects across the boundaries the agent crosses: tool boundaries, MCP boundaries, sandbox boundaries, and other runtime boundaries. Again, Agent Traces Need to Cross the MCP Boundary.

Long-running sessions break request-shaped thinking

A transaction trace is too small for a long-running agent session.

Honeycomb is running its 2026 Innovation Week around understanding, debugging, and improving AI systems in production. Its agent observability launch focuses on tracing agent activity, reconstructing decision paths, and debugging failures without manual log dives. The manual log diving is where immature agent systems die.

Without a session id across the systems, incident review degenerates into archaeology with search bars. Someone opens LangSmith to trace the model calls. Someone opens Honeycomb to trace the service operations. Someone checks the workflow queues. Someone reads application logs. Someone asks whether the sandbox still exists. Someone searches Slack for approvals. Then the team starts piecing together guesses.

That is archaeology with search bars.

A transaction trace is too small for a long-running agent session.

Stateful agents raise the pressure again. The DeltaBox paper frames AI agents as systems that perform high-frequency state exploration, relying on checkpoint and rollback of complete sandbox state, including files and process state. With change-based sandbox state, the paper reports 14 ms checkpoint and 5 ms rollback. Prior approaches copy the full state to roll back, adding hundreds of milliseconds to seconds of latency during evaluation. The resulting system enables high-frequency state exploration, and supports a wide range of applications.

This also shows on the UI side. In Agent UI Is Runtime Infrastructure we noted that the agent event stream contains state that a user can act on. Monitoring has to share those identifiers. The backend trace and user-visible event belong to the same run.

AI SRE starts before the incident prompt

The term “AI SRE” is likely to be abused in 2026, but the useful meaning is mundane: an AI agent helps run software, and that software provides enough structure so the agent does not blow the incident budget interpreting things.

The Causely paper presents hard numbers around AI in SRE. In their paper, the authors describe how workflows using AI derive the environment state from raw telemetry, paying for it in tokens, latency, and interpretation errors. In a 24-microservice OpenTelemetry demo application with injected faults, causal grounding reduced mean time to diagnosis by 63 percent, token consumption by 60 percent, tool calls by 78 percent, and improved root-cause accuracy from 75 percent to 100 percent.

AI SRE is going to get abused, but the useful frame is simple: an agent helps operate software, and the software gives that agent enough structure to avoid burning the incident budget on interpretation.

Monitoring an agent means building production monitoring for the service that runs the agent. That work is data modeling, retention, identity propagation, correlation, storage, schema, and permissions. Boring work. That is why no one wants to do it. That is also why it matters, as long as the work stays boring and does not turn into AI SRE snake oil.

It also has to close the loop from signal to issue. The signal should open a ticket and attach evidence to that ticket, and then continue to update the evaluator or release gate based on what the human does with that work. We wrote about how agent failures should open tickets in Agent Failures Should Open Tickets.

Ownership belongs below the dashboard

Agent monitoring fails when ownership only exists inside the dashboard.

Who Owns This Agent? observes an attribution gap in which the agent’s behavior is visible to affected parties, but the responsible operator or account is not identifiable. The enterprise version is obvious enough. A finance agent updates a customer record. A security agent modifies a ticket. A coding agent opens a pull request. A support agent sends an email. In each case, the record should say: system, account, policy, and operator.

The answer cannot be "check the logs." Come on.

Agent attribution forms part of monitoring, along with the permission state, session identity, and the rollback context of the deployment. The user-visible event and the backend action that caused it need a shared record. Supervisor and swarm systems spread decisions across agents with different responsibilities, which is why multi-agent orchestration has to be an architecture choice, not a vibe.

A production record has to survive outside the dashboard. I want a ledger that can answer boring questions without detective work: which run, which owner, which permission, which tool, which checkpoint, which evaluator, which side effect. The dashboard can stay a view. The record has to be the substrate.

Build the monitoring substrate before the agent fleet arrives

Monitoring an AI agent is a platform capability, with its own ownership, budget, and failure modes. Treating it that way eventually makes vendor comparisons useful. The steering deck can wait.

For incident tracking and monitoring to work, the evidence needs structure. As we argued in Agent Failures Should Open Tickets, the entire prompt may not be safe to store, and raw logs are not enough. Store the fields that make incident response possible: tool called, rejected alternatives, human policy decision, permissions, session cost, retry reason, checkpoint pointer, evaluator result, and final side effect. Govern the sensitive information. Keep the operational information queryable.

Connecting monitoring to repair work matters too. Failed evaluators should create issues. Cost anomalies should page the owner or throttle the workflow. Repetitive tool errors should open a repair loop. A rollback must attach the checkpoint, diff, and human decision path. A human approval to modify customer data should attach to the trace, not to a screenshot in Slack.

This is the unglamorous part of the AI agent infrastructure. But it is also where production agents become manageable. The agent can be clever. The runtime cannot be casual.

Monitoring an AI agent is an infrastructure workload. The agent is a running system with memory, side effects, cost, ownership, state, and judgment loops. A dashboard may report the smoke. The monitoring substrate has to explain the fire.

Top comments (9)

xulingfeng • May 28

The framing of agent monitoring as infrastructure workload (not debugging tool) is spot on. We've been running Hermes agents in production and the hardest lesson was that traditional APM tools don't understand agent-specific signals — tool call latency, context window pressure, memory retrieval quality.

One thing I'd love to hear more about: how do you separate "the agent is working as expected" from "the agent is silently failing but still generating output"? That's been our biggest blind spot in production.

Austin Vance Focused • May 28

We instrument traces with things like tool calls and then run online evals to alert if tool call % are falling below an expected value

xulingfeng • May 28

Really appreciate this post! Agent monitoring is exactly the problem we've been trying to solve. Followed you — would love to see more on this topic 👋

xulingfeng • May 28

Nice approach! We do something similar but with token-level anomaly detection on the Hermes side. Curious — do you run evals in real-time or batch after the fact? Both have trade-offs and I keep going back and forth 🤔

Austin Vance Focused • May 28

Depends on the eval. Realtime for some batch on others. Run a sample set or 💸💸💸

Some comments may only be visible to logged-in visitors. Sign in to view all comments.