Chandraprakash Sharma

Posted on Jun 8 • Originally published at wisflux.com

How to Make Your AI Agent's Behavior Observable in Production

#ai #llmops #machinelearning #agents

Observability for AI agents — logs, traces, evaluation, and production monitoring.

Your agent returns a clean, confident, well-formatted answer. It's completely wrong. No exception is thrown. No alert fires. No status code goes red. The first person to find out is the customer who acted on it.

Now answer a few questions about that run. What exactly did the agent do, step by step? Which tool did it call, and what came back? How many tokens did it burn — and what did that cost you? How long did it take? Was the answer even correct?

"If you can't answer those in under a minute without grepping raw log files, you're not really running an AI agent. You're hoping one works."

AI agents have evolved beyond experimental prototypes. They handle customer inquiries, generate and run code, navigate the internet, interface with APIs, and orchestrate with other agents on complex assignments. They represent software — and like all software, they fail. The distinction: agent failures frequently go undetected. No error message points to a specific line. The agent might present a plausible response that happens to be factually incorrect. It might call the wrong tool repeatedly before timing out. It might steadily deplete your API budget over several days without triggering notices.

This represents the observability challenge for AI agents — a more difficult version of a challenge the tech industry has addressed over many years. This piece explores the failures that warrant concern, explains why conventional tracking fails to identify them, then addresses genuine observability for agents: what needs measurement, measurement approaches, and where to begin.

Six questions you should be able to answer

Before diving into concepts, a straightforward assessment. For any agent execution in your current setup, can you quickly respond to the following without examining raw logs?

What did the user request?
What did the agent do, step by step?
What did each tool call receive and return?
How many tokens did the run consume, and what did it cost?
How long did it take?
Did the output look correct?

If you can respond to all six, you possess a substantially stronger position than the majority of teams operating AI agents in live environments. If you cannot, the following material addresses how to reach that point — and why each missing answer carries greater danger than apparent.

When agents fail, they fail silently

General arguments about observability can be dismissed easily. Specific failure examples are harder to overlook. Here are five instances from actual production deployments.

Five ways an agent fails silently — infinite loop, tool misuse, hallucination cascade, silent budget drain, confident wrong answer — each one reporting success

The infinite loop. An agent receives a task to investigate a subject. It runs a search tool, receives findings, determines the findings lack depth, runs the search tool using a modified query, determines those findings also lack depth, and persists — without limitation. Without token-usage tracking and iteration limits, this continues until a timeout halts it or your API allowance depletes.

Tool misuse. An agent gains access to a database query tool. It builds a query with a minor parameter issue — perhaps providing text instead of a numeric value. The tool sends back an error. The agent, reasoning that retry is appropriate, builds another query with the identical error slightly modified. This repeats across many tries. Without tool-call recording, you see only that the agent encountered failure; you don't understand why.

Mid-chain hallucination cascade. An agent reasoning through a multi-step problem hallucinates information in step 2. All subsequent actions rely on that false assumption. The finished response is wrong, yet confidently presented. The hallucination remains undetectable without following the complete reasoning sequence — you cannot pinpoint step 2 as the failure location without observing step 2.

Silent budget drain. An agent launched Friday begins getting unexpected inputs that generate unusually lengthy outputs. Token consumption increases steadily through Saturday and Sunday. No alarm activates, because none was configured. Monday morning's spending analysis reveals a weekend cost triple the typical amount.

The confident wrong answer. An agent delivers a formatted, polished response to a user question. The response contains factual inaccuracy. No exception was raised. No threshold was breached. The issue emerges only when a person or downstream automation acts on the incorrect response. Without output assessment, no signal indicates failure.

Observe the pattern: in nearly all scenarios, the system indicated success. Technical operation was appropriate. The problem was semantic, financial, or operational — precisely what conventional monitoring was never intended to detect.

These represent the three categories every silent failure fits within, and each bypasses your monitoring for distinct reasons:

Failure class	What's actually broken	Examples above	What traditional monitoring sees
Semantic	The answer is wrong, incomplete, or misaligned with intent	Hallucination cascade · confident wrong answer	Nothing — no exception, no failed status code
Financial	The run completes, but costs far more than it should	Infinite loop · silent budget drain	A cost spike — noticed after the money is gone
Behavioral	The agent takes wrong or pointlessly repeated actions	Tool misuse	Success — the tool call eventually "returned"

Why your existing monitoring won't catch this

Your infrastructure likely includes monitoring already. You operate dashboards, monitor error frequencies, set latency notifications. So why doesn't that framework identify the failures outlined earlier?

Because conventional software adheres to deterministic, fixed code patterns. A client request arrives, travels through a route handler, reaches a database, formats data, and responds. Given matching input, the system follows matching steps. Tracking that path is uncomplicated: you add measurement to recognized functions and services, and the measurement reflects your expectations.

AI agents function differently. An agent doesn't follow a predetermined path — it reasons. Given a user question, it chooses what action to take next, which tool to activate, what to say, whether to coordinate with another agent. The sequence through the operation cannot be determined beforehand; it materializes from the model's decisions at execution time. That challenges multiple assumptions your mechanisms operate on.

Dimension	Traditional software	AI agents
Execution path	Deterministic — the same input takes the same path	Emerges at runtime from the model's reasoning
Core computation	Inspectable functions, call stacks, internal state	A black-box LLM call — only inputs and outputs are visible
Control & tool flow	Fixed and known in advance	Decided dynamically; the trace's shape varies per run
How it fails	Loudly — exceptions, non-200s, crashes	Silently — a confident, well-formed, wrong answer
Across components	Known service boundaries	Agents coordinating; unclear which one caused the bad output

Non-determinism. Identical prompts can yield different reasoning pathways on separate executions. An agent performing optimally last week might take entirely different reasoning steps today, resulting in different outcomes. Conventional measurement assumes reproducible paths — for agents, this typically doesn't hold.

Black-box LLM calls. The actual computation — the LLM call itself — remains fully opaque. You transmit a prompt and get back a response. There are no function identifiers, no stack traces, no accessible state. The model's thinking is not observable. Only the inputs given and outputs received can be tracked.

Dynamic tool use. Agents determine which tools to activate during execution, based on what the model reasons is fitting. A standard measurement trace has a consistent framework; an agent trace has a framework determined by the model's choices. Measuring a predetermined set of functions misses what an agent accomplishes if those functions are invoked unexpectedly, or skipped entirely.

Semantic failures. When conventional systems break down, it usually occurs visibly — an exception gets raised, a non-200 response code is produced, a service stops. Agents can malfunction imperceptibly. The agent completes properly from a technical view but produces an answer that is factually wrong, partial, or slightly misaligned with the user's request. No exception surfaces. No notification sounds. The issue only becomes apparent when a person detects the poor output — if they detect it at all.

Multi-agent complexity. When frameworks include multiple agents working in concert, the complication multiplies. Which agent in a sequence created the poor output? What information was transferred between them? What did each agent recognize when it acted? Answering these requires end-to-end measurement built for agent-based systems.

All share a fundamental source:

"AI agents produce outcomes through a reasoning process you cannot inspect."

In conventional systems, internal operation is always reconstructible — it is the code, operating deterministically on specified inputs. If something breaks, you examine the code, move through the trace, and locate the responsible line. An LLM lacks accessible internal operation. It takes a prompt and generates a response, and the thinking that joins them remains hidden. When an agent breaks, even severely, no trace displays the problem. Only inputs and outputs remain — and if the problem is semantic, even that is absent.

Thus the typical debugging method — something went wrong, read the code, identify the cause — does not function for agents. You cannot examine the model. You can merely monitor its behavior: the information it got, the choices it made, the tools it used, the information it generated. The main difficulty of agent observability is "maintaining a thorough log of that behavior at all degrees of detail, so that when things fail, the evidence is already captured."

So what is observability, really?

Take a step back for a useful description. Observability is a combination of two ideas — observe and ability — and that's nearly the entire concept:

"Observability is the ability to understand the internal state of a system from the outside: to observe what it is doing and, crucially, why."

Without it, problem-solving is speculation — you work from presumptions regarding inner operation instead of information. With it, you can discover the fundamental cause of a problem instead of theorizing.

In conventional software, observability depends on three types of material:

Logs are timestamped, moment-level records of events — a question ran, a person verified, a function failed. They're the most minute signal, and they're the mechanism for recreating precisely what happened, and when, following the fact.
Metrics are computed measurements across duration — handling speed, failure frequency, memory consumption, message volume. They're economical to maintain and suited for monitoring screens and automated notifications. They show that something is off, rapidly, even when they cannot show why.
Traces monitor a single interaction as it traverses a structure, connecting spans (individual responsibilities, like a database search or a web call) to indicate the complete progression from start to finish. They're essential for comprehending delays and problems across numerous services.

One distinction worth emphasizing: monitoring means watching established data — you specify a threshold ("notify me if failure frequency crosses 1%") and remain alert for crossing. Observability is wider: the characteristic that enables querying any aspect of a system's operation, containing those you didn't anticipate when designing it. Monitoring leverages the material observability supplies. You require both.

These principles apply to agents too. They're simply inadequate by themselves.

The Agent Observability Stack: four layers of behavior

Since you cannot directly observe the model's reasoning, you construct it from the exterior: all information it got, all choices it made visible through an answer, all tools it called, all information it generated. In sequence, these observations create a thorough log of the agent's reasoning path — the nearest equivalent to a trace possible for a variable system. You continue using logs, metrics, and traces; you broaden them to record what counts in an agent situation.

The clearest method to think about what to gather is as a framework of four stacked levels — from the basic LLM operation at the foundation to the entire interaction — with assessment spanning all of them.

The Agent Observability Stack — four nested layers: LLM Call and Tool Call inside a Run inside a Session, with Evaluation as a cross-cutting dimension

Layer 1 — the LLM call is the foundational piece of an agent's job. As a baseline, all LLM operations should be documented with the entire message submitted to the system, the complete answer given, the system and edition applied, the word amounts (submitted, created, reused), and the handling delay. Word amounts matter particularly — they're the main factor of spending, and uncontrolled word expansion is one of the frequent failure styles in live setups.

Layer 2 — the tool call is how agents affect the broader world — running code, accessing repositories, triggering services, reaching documents, writing information. All should be documented with the operation activated, the details supplied, the details obtained, and completion standing. Tool-operation documentation frequently holds the most useful diagnostic material, because it indicates what the agent genuinely carried out, not merely what it claimed.

Layer 3 — the run is the highest-level assignment: the full set of steps from getting a user question to delivering a finished response. A run record ties all the LLM operations and tool operations in a single objective, creating the entire reasoning chain understandable. Absent run-degree recording, you own scattered moments instead of a linked account.

Layer 4 — the session monitors multi-turn exchanges over duration, displaying how information grows and how the agent's behavior changes as the exchange continues. This bears most meaning for speaking agents, where the past exchange is something the system views each period.

The cross-cutting layer: evaluation

Standard observability responds to a single inquiry: did the operation work? Agent observability must respond to a further one: did it work accurately?

That's the assessment component, and it has no counterpart in conventional tracking. Assessment determines whether the agent's answers are precise, fitting, valuable, and matching intended operation. It accepts several methods: individual assessment of a fraction of answers, computerized assessment utilizing an LLM as an analyzer, evaluation versus known datasets of recognized accurate answers, or feedback signs like approvals/disapprovals.

The important modification: assessment is not a one-instance offline activity you perform preceding deployment. In live operation, it's a continuous observability measurement — as essential as failure frequency or speed.

How to build it, incrementally

Observability can appear daunting to incorporate into a structure already functioning in customer settings. The reflex is to hold off until problems become severe sufficient to justify the investment. Don't.

"The time to add observability is before you need it — because when you need it, you need it immediately."

The sequence below is deliberately graduated. All phases supply independent advantage, enabling you to halt at any juncture and maintain progress relative to your beginning point.

From blind to baseline in six steps: log every LLM call, trace the whole run, instrument tool calls, alert on cost and latency, evaluate outputs, adopt a platform

Step 1 — Log every LLM call. This is the essential requirement. All operations with an LLM — whether a fundamental generation or one cycle of a complex agent cycle — must produce a organized (JSON, not raw) documentation entry holding:

The complete message supplied to the system (beginning instruction, previous exchanges, and any discovered material — what the system truly got is frequently the single highest-value proof)
The complete answer obtained
The system designation and edition (e.g. claude-sonnet-4-6, gpt-4o)
Word amounts: submitted, obtained, and reused if relevant
Actual passing period from delivery to obtain
A date/period marker and a special operation identifier

Organized documentation permits following gear to narrow and join by particular fields — system, word quantity, operation designation, failure variety — at range. This measure by itself gives spending insight, a basic tracking ledger, and the base material for all following measures. If you accomplish only this, accomplish this.

Step 2 — Add run-level tracing. Separated operation records are scattered happenings; run-degree tracing joins them into a progression. Allocate a special designation at the beginning of all agent assignment and move it through all LLM operations and tool operations interested in managing that inquiry. Access the designation later and you watch the complete progression: what the agent was questioned, what it made the decision, what gear it triggered, what they supplied, and what it eventually produced. This phase is what moves troubleshooting from "something went wrong somewhere" to "here is precisely what happened, in order."

Step 3 — Instrument tool calls explicitly. Tool operations warrant independent documentation marks, apart from LLM documentation. For all invocations, preserve the operation designation, the details supplied (structured, not flat text), the detail gotten, completion standing plus mistake report if it stopped working, and the operation's personal handling period. This counts because numerous agent problems are operation problems — the system made sound determinations but the operation performed incorrectly, delivered unforeseen material, or stopped working quietly. Absent independent operation documentation, those appear equivalent to reasoning problems.

Step 4 — Set cost and latency alerts. Prior to your structure experiences genuine demand, specify what "typical" implies and notify on divergence:

A per-run token budget: if a single assignment consumes past X words, deliver a notification. This captures cycling patterns and runaway thinking prior to they get pricey.
A daily spend ceiling: an ultimate restriction that sends a notification — or a firm terminate — when surpassed.
A latency threshold: if p95 handling period for an assignment surpasses your arrangement, you'll want to understand.

Producing effective restrictions demands some info, which is why phases 1–3 arrive initially. Also rough starting amounts beat nothing; modify as you recognize your operation's ordinary patterns.

Step 5 — Start evaluating outputs, even informally. Computed information demonstrates whether the operation executed; assessment demonstrates whether it executed accurately. Begin easy: gather 5–10% of agent replies and request a individual to examine them. Identify responses that are wrong, partial, off-point, or in any way bad, and retain a ongoing listing. Following several cycles, styles show up — inquiry kinds that steadily generate inferior replies, situations the agent manages inadequately, message phrasings that perplex the system. After you hold plenty of classified fitting/unfit instances, machine-code: system-as-analyzer for typical fitting metrics, instruction-based assessments for recognized problem designs, a failure collection constructed from previous problems. Yet the casual individual assessment is your beginning location, and it detects items no machine operation would.

Step 6 — Adopt a dedicated observability platform. After foundational measuring is in position — LLM documentation, run tracing, operation-call documentation — the proportion and detail will go past hand-made screens and unprocessed documentation records. That's the point to incorporate a specialized measuring framework (view the scene below). No thing which you select, link it appropriately, not as an end step: make certain reference ids move appropriately, assemble the displays you'll actually look at, and establish notices via the framework instead of haphazard engines.

What "done" looks like. Bear in mind the six questions from the start of this posting? You've reached a sufficient baseline once you can respond to all of them — What did the user request? What did the agent do, step by step? What did each tool call receive and return? How many tokens did it cost? How long did it take? Did the output look correct? — for any assignment, inside moments, absent accessing unprocessed documentation records.

The tooling landscape

You need not create this independently. The group matured rapidly — what was a fraction of spare attempts in 2023 is currently a packed, heavily financed area. (A meaningful marker: January 2026, database platform ClickHouse obtained Langfuse, the unrestricted framework head, as portion of a $400M collection — measuring material is the system creation right now.) Four groupings deserve focus.

Open-source, self-hostable. The ordinary beginning if you require command of your material.

Tool	Best for	What you get
Langfuse	The strong default if you're not tied to a framework	MIT-licensed, framework-agnostic leader: tracing, prompt management with a playground, evaluation (LLM-as-judge, user feedback, custom metrics), one-command self-host. Fortune-500 adoption; ClickHouse-backed but committed to staying open.
Arize Phoenix	Teams where evaluation is a standing practice	Evaluation-first and free to self-host: 50+ research-backed metrics (faithfulness, relevance, toxicity, hallucination), drift detection, RAG-quality analysis. Commercial AX covers scale.
Comet Opik	Auto-capturing every agent step	Open-source tracing + evaluation that records prompt chains and tool calls automatically, with AI-assisted prompt optimization. A fast-rising newer entrant.
OpenLLMetry	The least lock-in	Pure OpenTelemetry instrumentation (by Traceloop) that pipes LLM and agent traces into whatever backend you already run — Datadog, Grafana, Honeycomb, New Relic.

Managed / commercial. Minimal operational demands, greater refinement, amount-based cost.

Tool	Best for	What you get
LangSmith	Teams on LangChain/LangGraph	Built by the LangChain team — the path of least resistance there. Renders every run as a visual graph of reasoning steps, tool calls, and multi-agent hand-offs.
Braintrust	The fastest production-to-fix loop	Evaluation-native: production failures become eval cases and CI gates block regressions before release.
Weights & Biases Weave	Teams already training on W&B	LLM tracing and evaluation that extends Weights & Biases.
Helicone	Fast cost/latency visibility	Proxy-based, minimal-code quick start; less depth than dedicated tracing.

Your existing APM. If your squad presently inhabits one, this is the smallest-work path — no added provider, and system indicators coexist with your structure measures.

Tool	Best for	What you get
Datadog LLM Observability	Teams already on Datadog	Auto-instruments OpenAI, Anthropic, Bedrock, LangChain, and Google's Agent Development Kit, with built-in hallucination evaluations and prompt-injection scanning.
New Relic / Grafana	Teams on those stacks	Comparable LLM/agent observability features.

The emerging standard — OpenTelemetry GenAI. OTEL presently specifies GenAI semantic conventions: standardized designation kinds for system operations, operation and structure spans, and MCP operation invocations — system, word amounts, operation invocations, and additional. They're still being designed (not yet marked permanent as of mid-2026), yet suppliers currently generate them (Datadog as OTel v1.37; Arize's OpenInference is getting nearer). Measure opposite these guideline and you maintain your material moveable no thing which base you choose.

Choosing, in one line: begin with one framework — Langfuse if you desire unrestricted and self-controlled, LangSmith if you apply LangChain, Datadog if you're currently there — and incorporate a committed assessment resource (Phoenix, Braintrust, or Opik) once assessment turns into a genuine activity.

Challenges to keep in mind

Observability for agents has its own challenges:

Storage cost. Communications and reactions are broad. Preserving all operations thoroughly, at high use, generates considerable storing demands. You'll want deliberate selections regarding what to preserve completely versus condense or break down.
Privacy and PII. Queries frequently hold private material, and reactions might duplicate or integrate it. Preserving all means retaining private material — which generates lawful duties. You require an explicit standard: concealment, elimination timeframes, entrance limitations.
High cardinality. All assignment creates special mention designation, operation details, and communication material. Numeric frameworks have difficulty with excessively various worth combinations and might turn inefficient or pricey. Pick metric dimensions carefully.
Instrumentation overhead. Measuring introduces handling time. Immediate preserving might decrease your agent's speeds — employ asynchronous preserving where feasible, and assess the price of your measuring.
Evaluating probabilistic outputs. There's uncommon a singular accurate response for an agent's generation. Assessment demands consideration, which is pricey at scale. System-as-analyzer aids yet has its private mistake ratio — confirm it opposite individual thought regularly.
A fast-moving ecosystem. Gear, expectations, and finest habits are changing rapidly — frameworks get acquired, and perhaps the OpenTelemetry GenAI expectation is yet firming up. What's typical currently might turn in a semester. Assemble on public requirements (OpenTelemetry) wherever feasible, and prevent deep connection to any provider's uncommon material structure.

Conclusion

AI agents introduce a fresh kind of software challenge: structures that are strong, changeable, and deeply opaque, fit of crashing in ways conventional tracking was never meant to find — quietly, meaningfully, and pricily.

Observability is how you regain management. Not by creating agents predetermined — that might defeat the aim — yet by making their activities seeable. Once you can perceive what an agent is carrying out, why it's carrying it out, and regardless of whether the outcome was sufficient, you can fix breakages, discover regressions, and strengthen the structure with sureness.

The foundations — documentation, computed info, traces — stay appropriate. For agents they're a foundation, not an endpoint. The endpoint is a structure where all reasoning measure is followable, all operation triggering is assessable, all generation is examined, and all issues notify an notification prior to a user must disclose it.

That degree of observability isn't created instantaneously. Yet it is created progressively, commencing with a sole organized documentation entrance for your subsequent system operation.

"For any agent running in production, the question was never whether you can afford to invest in observability. It's whether you can afford to keep flying blind."

Read the original on wisflux.com →

DEV Community