SapotaCorp

Posted on May 24 • Originally published at sapotacorp.vn

What to monitor in an AI agent before you launch (and after)

#ai

A founder we work with texted six weeks after launching their AI agent: "Something is wrong. Cost has tripled this month, response times have gotten worse, but I don't actually know which part of the system is the problem. Help."

The system was a multi-agent customer support setup with three agents and four tools. The team had built it well in most respects: solid prompts, good tool integration, decent eval before launch. They had skipped one thing: observability. Six weeks of production traffic had gone through the system with no traces, no per-step latency tracking, no cost attribution.

Diagnosing the problem took us a week of forensic work, mostly because we had to retroactively instrument what should have been there from day one. The fix took an afternoon. By the time we found the actual issue (one of the agents had started looping more often after a vendor model update), the team had spent thousands of dollars in unnecessary LLM costs and lost weeks of customer trust.

This is the most preventable category of agent failure. The minimum observability stack costs almost nothing to install before launch and saves weeks of debugging when something goes wrong. Here is what Sapota ships with every production agent.

The three failure modes observability catches

AI agents in production fail in three structurally different ways. None of them are visible without instrumentation.

Silent quality drops. The agent's responses get worse over time. The corpus drifts, the model provider updates the underlying weights, the prompt template gets edited, the user query distribution shifts. Quality degrades by 5% per week, the team does not notice for two months, by which time customer-reported errors have spiked and trust has eroded.

Cost spikes. Production cost runs 3 to 10 times the test projection. Sometimes this is because production queries are more complex than test queries. Sometimes it is because a specific user is abusing the system. Sometimes it is because one agent in a multi-agent setup has started looping more often. Without per-request, per-agent cost tracking, you cannot tell which.

Cascade failures. Tool A starts timing out. Agent B retries it. Agent C waits for Agent B's output. The whole system slows to a crawl, but no single component is "down" enough to alert. Latency at the 95th percentile triples, but the average looks fine. Without traces across the agent system, the failure mode is invisible.

Each of these failure modes has cost real production agents real money in the audits we have run. None of them needed to happen.

Layer 1: Tracing

The minimum trace setup captures, per request:

Request ID that propagates through every step
User ID for cost attribution and abuse detection
Input and output of each LLM call (input tokens, output tokens, latency, cost)
Tool calls with input, output, latency, and any errors
Exit reason: success, max iterations hit, timeout, abort

The implementation is not complicated. Langfuse, Opik, LangSmith, and Helicone all provide this in 10 lines of code wrapping your LLM client. The marginal cost is roughly 5% latency overhead and storage cost that runs about 1% of LLM cost. Both are negligible.

We default to Langfuse self-hosted for most clients. Open source, no per-trace pricing, runs on a single small VM. LangSmith is fine if you are already in the LangChain ecosystem. Opik is good if you want minimal setup. Pick one and install it before launch.

The practical value: when something breaks, the traces tell you exactly where. The founder's looping agent showed up in traces as "Agent B exit reason: max_iter" appearing 4x more often than the baseline. Without traces, that pattern was invisible.

Layer 2: Metrics

Aggregated dashboards built on top of traces. The metrics that matter for production agents:

Quality metrics:

Task completion rate (% of requests where the agent reaches a "success" exit, not max_iter or abort)
Faithfulness score (if you have a faithfulness gate, track its output)
User satisfaction (thumbs up rate, if you have feedback collection)

Performance metrics:

Latency at p50, p95, p99 (averages hide the tail)
Iteration count per task (high iteration = likely loop death)
Tool call frequency by tool

Cost metrics:

Cost per task (mean, p95, max)
Cost per user per day (to catch abuse and runaway costs)
Cost by model (if using a multi-model setup)

Reliability metrics:

Error rate by error type (transient, permanent, validation)
Tool failure rate by tool
Fallback usage rate

A production dashboard with eight to twelve metrics covers most of what you need to see. We typically build this in Grafana on top of the trace store, or in the native dashboards of Langfuse / Opik / LangSmith.

Layer 3: Alerts

Metrics on a dashboard are useful for investigation. Alerts are what catch problems before customers do.

The minimum alert set:

Error rate spike. Alert if error rate exceeds 5% over any 5-minute window. Catches sudden breakage from a tool outage, model provider issue, or bad deploy.

Latency degradation. Alert if p95 latency exceeds your SLA for 10 minutes. Catches gradual degradation from increased load, retry loops, or downstream service issues.

Cost overrun. Alert if daily cost exceeds budget threshold. Catches runaway loops, abuse, or pricing changes from the model provider.

Quality regression. Alert if weekly eval scores drop more than 5%. Catches the silent degradation that customers notice before you do.

Critical service unavailability. Alert if a circuit breaker opens on any external dependency. Catches cascading failures before they cascade.

Five alert rules is the minimum. Most production agents accumulate more over time as the team learns what to watch. The pattern that does not work: zero alert rules, "we'll check the dashboard sometimes."

Layer 4: Eval pipeline

The fourth layer is offline evaluation. Traces tell you what is happening in production. Eval tells you whether what is happening is actually correct.

The minimum eval pipeline:

A ground-truth eval set of 100 to 500 questions with expected answers
A weekly cron that runs the eval set through the production pipeline
Tracking of pass rate over time
An alert if pass rate drops more than 5% week over week

We use Ragas as the metric layer for most clients. It computes faithfulness, answer relevance, context recall, and answer correctness from a single eval run. The compute cost for a 100-question weekly eval is under $5.

The eval set is the artifact most teams underestimate. It needs to represent the real production query distribution, not just what the engineering team thinks users will ask. The way to build a good eval set is to wait until you have two weeks of production traffic, sample 100 actual queries, write expected answers, and use that as the ground truth going forward. Refresh quarterly.

What we found in the founder's system

When we instrumented the agent six weeks after launch, the traces immediately showed:

One specific agent (the routing agent) had started exiting with "max_iter" 4x more often than at launch
Latency p95 had drifted from 4 seconds to 12 seconds
Cost per task had drifted from $0.04 to $0.13
The change had started exactly when the underlying model provider had pushed an update three weeks earlier

The fix was prompt-level: the new model version interpreted one specific instruction differently than the old one, and the routing agent kept asking for clarification on requests that the old model handled directly. We tightened the routing prompt with two more few-shot examples. The agent stopped looping. Costs and latency returned to baseline.

The whole investigation was four hours once we had traces. Without traces, the team had spent six weeks unable to even pin down what was wrong.

The recommendation: install before launch

The argument for waiting on observability is "we're moving fast, we'll add it after the launch." The argument for installing it before launch is that the cost of installing it later is much higher (retroactive instrumentation, lost data, catastrophic surprises) and the cost of installing it now is almost zero.

The minimum stack:

Tracing (Langfuse / Opik / LangSmith): 1 day to install, propagate request IDs through your code
Dashboard with 8-12 metrics: 1 day to build on top of traces
Five alert rules: 1 day to define and wire to your alerting system
Eval pipeline (Ragas + cron): 2 days to set up the eval set and weekly run

One week of work. Less than the time spent debugging a single mystery production issue.

If your agent is in production with no traces

If your team has shipped an AI agent and you would not be able to answer "why did response time get worse last week" with data, the gap is observability.

Sapota offers a one-week observability implementation that installs Langfuse (or your preferred stack), instruments your existing agent code, builds the production dashboard, configures the alert rules, and sets up the eval pipeline. We have done this for half a dozen agent systems, mostly for teams that shipped without instrumentation and started having mystery production issues.

Reach out via the AI engineering page with your current agent stack and what kind of issues you are seeing. The first conversation usually surfaces which layer is missing and what to install first.