The agent returned a 200. The customer lost $4,200. The monitoring dashboard showed green. This is the failure mode that keeps me up at night — not the one where your AI agent crashes with a stack trace, but the one where it succeeds at the wrong thing with complete confidence and nobody notices for three days.
I run AI agents in production. A content research pipeline, a pricing sync agent across 2,000-plus products, a nightly SEO task executor. I have had every variety of failure: hard crashes, timeout loops, runaway tool calls, authentication expiries. Those failures are easy. They show up in your error logs. They page you at 2am. They get fixed. The failure that cost that customer $4,200 was none of those. It was a step-10 output that was confidently, politely, completely wrong — and tracing it back to the tool call at step 3 that set up the error took four engineers and a full day of log archaeology.
According to data from enterprise AI deployment surveys published in early 2026, 89% of organizations have implemented some form of observability for their AI systems. Only 31% have measurement frameworks — defined as documented KPIs, baselines, and a process for evaluating whether the outputs are actually correct. The remaining 58% are watching request counts and latency distributions while entirely missing the question that matters: was the answer right?
This post is the architecture I wish I had before that incident. I will cover why traditional monitoring fails for multi-step agents, the five observability primitives you actually need, how to build an eval pipeline from production traces, how the major platforms compare, and the shadow agent problem that will bite you if you skip the governance layer.
Why Traditional Monitoring Fails for AI Agents
Traditional application monitoring was built for a world where correctness is binary. A function either throws an exception or it does not. A database query either returns results or it errors. An HTTP endpoint either returns 200 or it does not. The entire observability stack — Datadog, New Relic, Prometheus, PagerDuty — is optimized for detecting deviations from a deterministic expected behavior. When everything is deterministic, the absence of errors means things are working.
AI agents violate this assumption at every level. An agent can execute all tool calls successfully, return HTTP 200, complete within SLA, and produce output that is wrong in a way that only a domain expert would recognize. The monitoring dashboard showing green is not lying to you — it is answering a different question than the one you need answered. It is telling you that the plumbing worked. It cannot tell you whether the output was good.
The multi-step nature of agent workflows compounds the problem. In a ten-step reasoning chain, an error at step 3 does not propagate as an exception. It propagates as subtly wrong context that informs step 4, which informs step 5, which informs the final answer. By the time you see the output, the root cause is nine steps upstream and you have no trace linking the wrong conclusion to the wrong tool call that started it. This is not a hypothetical failure mode. It is the default failure mode for any agent that chains tool calls together and does not instrument each step independently.
The quality issues that kill production agents are now the primary barrier — at 32% of surveyed organizations, they outrank latency problems, cost overruns, and infrastructure failures combined. Quality problems are invisible to infrastructure monitoring because quality is a semantic property of outputs, not a syntactic property of HTTP responses. Detecting them requires a different instrumentation layer entirely.
The Five Observability Primitives You Actually Need
After running agents in production and reading post-mortems from teams that have been doing this longer, I have converged on five primitives that actually cover the failure surface of agentic systems. These are not the five things vendors pitch. They are the five things that would have caught every production incident I have personally experienced or traced through other teams' post-mortems.
Distributed tracing with step-level spans. Each tool call, each LLM invocation, each decision branch in your agent workflow needs its own span. The span should capture the input, the output, the latency, the token counts, and any structured metadata the step produces. The trace should be queryable by run ID so you can replay exactly what happened in any production execution. This is the foundation everything else builds on. Without it, debugging a wrong output means guessing at causality from log fragments.
OpenTelemetry is the right instrumentation layer here, and it integrates with all four major platforms. Here is the setup that works for a TypeScript agent:
// otel-agent-tracing.ts — step-level spans for every agent action
import { trace, context, SpanStatusCode } from '@opentelemetry/api'
import { NodeTracerProvider } from '@opentelemetry/sdk-node'
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'
const provider = new NodeTracerProvider({
resource: Resource.default().merge(
new Resource({ 'service.name': 'wowhow-content-agent', 'service.version': '2.1.0' })
),
})
provider.addSpanProcessor(
new BatchSpanProcessor(new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT }))
)
provider.register()
const tracer = trace.getTracer('agent-tracer')
export async function tracedToolCall(
toolName: string,
input: unknown,
fn: () => Promise
): Promise {
return tracer.startActiveSpan(`tool.${toolName}`, async (span) => {
span.setAttributes({
'tool.name': toolName,
'tool.input': JSON.stringify(input).slice(0, 1024), // truncate large inputs
'agent.run_id': context.active().getValue(RUN_ID_KEY) as string ?? 'unknown',
})
try {
const result = await fn()
span.setAttributes({ 'tool.output': JSON.stringify(result).slice(0, 1024) })
span.setStatus({ code: SpanStatusCode.OK })
return result
} catch (err) {
span.recordException(err as Error)
span.setStatus({ code: SpanStatusCode.ERROR })
throw err
} finally {
span.end()
}
})
}
Multi-turn conversation replay. For agents that operate over multiple turns — handling a support ticket thread, researching a topic across several queries — you need the ability to replay any production conversation exactly as it happened. Not a summary, not a log of outputs. The full message history, tool call sequence, and intermediate states, reconstructed so a human reviewer can follow the exact reasoning path the agent took. This is what makes the step-3 root cause findable instead of invisible.
Online evaluation. A sample of production outputs, evaluated automatically against quality rubrics, running continuously. Not just at deployment time, not just in staging. In production, on real traffic, so you detect when model behavior drifts after a provider update, after your prompt changes, or after input distribution shifts. The evaluations do not need to be exhaustive — sampling 5% of production traffic with a fast evaluator gives you statistical sensitivity to drift without burning your evaluation budget.
Semantic alerting. Alerts that trigger on output quality metrics, not just infrastructure metrics. Error rate for tool calls is a good infrastructure alert. A drop in evaluation pass rate from 94% to 87% over 24 hours is a semantic alert, and it is the one that actually tells you something is wrong with the agent's behavior. Most teams have the former and none of the latter.
Data curation loop. The traces from your production runs are the highest-quality training and evaluation data you will ever have. They show exactly what real users asked, exactly what the agent produced, and — after review — whether the output was good. A data curation loop systematically captures interesting production traces (high-confidence correct, high-confidence wrong, borderline cases) and routes them into your eval dataset. This is how your evaluation suite gets smarter over time rather than stale.
Building an Eval Pipeline From Production Traces
The most effective eval pipeline I have built does not start with carefully constructed test cases. It starts with production traces. Real inputs, real agent behavior, real outputs reviewed by a human expert or a strong evaluator model. That dataset is ground truth in a way that synthetic benchmarks never are.
Here is the pipeline architecture. Production traces are sampled, filtered for interest (high latency, low confidence scores, or random sampling for baseline coverage), and routed to a review queue. Reviewers label outputs as correct, incorrect, or borderline. Labeled traces become evaluation cases. The eval suite runs against every deployment candidate. The pass rate becomes the gate.
// trace-to-eval-pipeline.ts — promote production traces to eval cases
interface ProductionTrace {
runId: string
input: string
steps: Array
finalOutput: string
confidenceScore: number
timestamp: string
}
interface EvalCase {
id: string
input: string
expectedOutputPattern: string // regex or semantic description
expectedSteps: string[] // required tool calls in order
sourceTrace: string // runId for lineage
reviewer: string
reviewedAt: string
}
async function promoteTraceToEvalCase(
trace: ProductionTrace,
review: { correct: boolean; expectedOutput: string; reviewer: string }
): Promise {
if (!review.correct) {
// Wrong outputs are MORE valuable as eval cases — they define failure modes
return {
id: `eval-${trace.runId}`,
input: trace.input,
expectedOutputPattern: review.expectedOutput,
expectedSteps: trace.steps.map(s => s.tool),
sourceTrace: trace.runId,
reviewer: review.reviewer,
reviewedAt: new Date().toISOString(),
}
}
// Sample correct outputs at 20% — maintain eval diversity
return Math.random() s.tool),
sourceTrace: trace.runId,
reviewer: review.reviewer,
reviewedAt: new Date().toISOString(),
} : null
}
The custom evaluator layer is where the real quality signal comes from. An LLM-as-judge evaluator, given a well-defined rubric, can score agent outputs for correctness, relevance, and absence of hallucination at a cost that makes continuous evaluation economically viable. The key is giving the evaluator enough context: the original input, the expected behavior, the actual output, and a structured rubric that converts a subjective quality judgment into a numeric score.
// custom-evaluator.ts — LLM-as-judge for production output quality
interface EvalRubric {
name: string
description: string
scoringGuide: Record // 1-5 scale with explicit definitions
}
const FACTUAL_ACCURACY_RUBRIC: EvalRubric = {
name: 'factual_accuracy',
description: 'Does the output contain only claims that are verifiable from the input context?',
scoringGuide: {
5: 'All claims are directly supported by input context. No hallucination.',
4: 'All major claims supported. Minor unsupported details that do not affect correctness.',
3: 'Core answer correct but includes 1-2 unsupported or unverifiable claims.',
2: 'Mixed accuracy. Key claims are wrong or unverifiable.',
1: 'Output is predominantly incorrect or hallucinates critical facts.',
},
}
async function evaluateOutput(
input: string,
output: string,
rubrics: EvalRubric[]
): Promise> {
const scores: Record = {}
for (const rubric of rubrics) {
const prompt = `You are evaluating AI agent output quality.
Input: ${input}
Output: ${output}
Rubric: ${rubric.description}
${Object.entries(rubric.scoringGuide).map(([score, desc]) => `${score}: ${desc}`).join('\n')}
Respond with JSON: {"score": , "reasoning": ""}`
const result = await callEvaluatorModel(prompt)
scores[rubric.name] = result.score
}
return scores
}
Once you have an eval pipeline running on production traces, the quality alert becomes straightforward: compare rolling average scores against a threshold and alert when you cross it.
// quality-alerting.ts — semantic alerts on eval score degradation
interface QualityAlert {
metric: string
currentScore: number
threshold: number
windowHours: number
samplesEvaluated: number
alertAt: string
}
async function checkQualityAlerts(
evalScores: Array,
thresholds: Record
): Promise {
const alerts: QualityAlert[] = []
const windowStart = new Date(Date.now() - 24 * 60 * 60 * 1000).toISOString()
for (const [metric, threshold] of Object.entries(thresholds)) {
const recentScores = evalScores.filter(
s => s.metric === metric && s.timestamp >= windowStart
)
if (recentScores.length sum + s.score, 0) / recentScores.length
if (avgScore list[dict]:
"""
Query LLM provider usage logs for API calls not attributable to
registered agents. Requires usage metadata (user tag or custom header)
to be set on all registered agent calls.
"""
cutoff = datetime.utcnow() - timedelta(days=lookback_days)
async with httpx.AsyncClient() as client:
resp = await client.get(
"https://api.anthropic.com/v1/usage",
headers={"x-api-key": api_key, "anthropic-version": "2023-06-01"},
params={"start_time": cutoff.isoformat(), "limit": 1000}
)
usage_records = resp.json().get("data", [])
shadow_candidates = []
for record in usage_records:
agent_tag = record.get("metadata", {}).get("agent_id", "UNTAGGED")
if agent_tag not in REGISTERED_AGENTS:
shadow_candidates.append({
"agent_tag": agent_tag,
"model": record.get("model"),
"input_tokens": record.get("input_tokens"),
"output_tokens": record.get("output_tokens"),
"timestamp": record.get("created_at"),
})
# Group by agent_tag to see usage patterns
from collections import defaultdict
by_agent: dict = defaultdict(list)
for r in shadow_candidates:
by_agent[r["agent_tag"]].append(r)
return [
{
"agent_tag": tag,
"call_count": len(records),
"total_tokens": sum(r["input_tokens"] + r["output_tokens"] for r in records),
"first_seen": min(r["timestamp"] for r in records),
"last_seen": max(r["timestamp"] for r in records),
}
for tag, records in by_agent.items()
]
The discovery script is the easy part. The harder part is what you do with the results. A shadow agent that has been running for months, used by a business team for a real workflow, cannot be simply shut down without disruption. The practical response is a registration amnesty: give teams a window to register their shadow agents, bring them under the standard observability stack, and assign an owner. Make the bar for registration low enough that compliance is easy. The agents that do not get registered during the amnesty period are the ones you shut down — because if no one registered them, no one is responsible for them, and no one will respond when they start generating wrong outputs at scale.
The connection between shadow agents and the quality crisis is direct. Organizations reporting the highest rates of AI quality problems in production are the same organizations with the weakest agent registration practices. When you do not know an agent exists, you cannot evaluate it. When you cannot evaluate it, you find out it is wrong the same way that customer found out: after the damage is done, with a 200 response code in the logs and no trace of what actually happened.
The investment in observability infrastructure pays back unevenly but reliably. The first few months feel like instrumentation overhead. The first time your quality alert fires two hours after a provider model update and saves you from a day of wrong outputs reaching real users, the ROI calculation becomes obvious. I have had that experience once. It changed how I think about observability from an optional enhancement to a deployment prerequisite. You should not ship an agent to production that you cannot evaluate, and you should not evaluate an agent you cannot trace.
For more on building agents that are production-ready from day one, see the related posts on why 88% of agent pilots never reach production and the 3-layer agent harness pattern for keeping configuration complexity under control. The WOWHOW tools catalog includes utilities for structured agent logging that integrate with all four platforms covered here.
Originally published at wowhow.cloud
Top comments (0)