chunxiaoxx

Posted on Apr 10

AI Agent Observability in 2026: OpenAI Agents SDK, LangSmith, and OpenTelemetry

#ai #observability #opentelemetry

AI Agent Observability in 2026: OpenAI Agents SDK, LangSmith, and OpenTelemetry

If you are building production AI agents, "it runs on my laptop" is not enough.

You need to answer questions like:

Which tool calls failed?
Where did latency spike?
Which handoff or guardrail caused the run to derail?
How do you connect agent traces to the rest of your production telemetry?

A practical 2026 stack is:

OpenAI Agents SDK for agent execution and built-in traces
LangSmith for agent-native debugging, evaluation, and dashboards
OpenTelemetry for vendor-neutral export into your wider observability stack

This post focuses on what is actually useful in production.

1) What the OpenAI Agents SDK gives you by default

The OpenAI Agents SDK ships with built-in tracing enabled by default.

That matters because agent failures are rarely a single API error. A bad run is usually a sequence:

user input
retrieval
model generation
tool call
guardrail
handoff
retry
final output

The SDK traces this workflow as traces and spans.

According to the OpenAI Agents SDK tracing docs, the default instrumentation includes:

the overall Runner.run() / run_sync() / run_streamed() workflow
agent spans
generation spans
function/tool call spans
guardrail spans
handoff spans
audio transcription and speech spans when relevant

This is the right baseline because agent debugging requires step-level causality, not just final output logging.

Important production detail

For long-running workers and background jobs, the SDK documentation recommends calling flush_traces() when you need immediate export at the end of a unit of work.

That is important if you run agents inside:

Celery workers
background tasks
queue consumers
cron-style jobs

A minimal pattern looks like this:

from agents import Runner, flush_traces, trace


def run_job(agent, prompt: str):
    try:
        with trace("background_job"):
            result = Runner.run_sync(agent, prompt)
        return result.final_output
    finally:
        flush_traces()

Without an explicit flush, traces may export in the background a few seconds later. That is acceptable for many apps, but not for every operational workflow.

2) Where LangSmith fits

OpenAI's built-in tracing is useful, but most teams also need:

searchable traces across runs
evaluation workflows
dashboards and alerts
user feedback logging
framework-agnostic observability

This is where LangSmith fits.

LangSmith's documentation now explicitly supports:

OpenTelemetry-based tracing
OpenAI Agents SDK tracing
tracing for both LangChain and non-LangChain applications

That means you do not need to rewrite your stack around one framework to get observability.

A practical division of labor is:

OpenAI Agents SDK = emits detailed agent workflow traces
LangSmith = developer-facing debugging, evaluation, alerting, run inspection
OpenTelemetry = standard transport layer into the rest of your telemetry system

This separation is valuable because it avoids lock-in while still giving you agent-native debugging.

3) Why OpenTelemetry matters

In production, agents should not be a special observability island.

Your infra team already monitors:

APIs
workers
databases
queues
cost and latency trends

If agent telemetry cannot join that system, you create a blind spot.

OpenTelemetry solves this by giving you a vendor-neutral standard for traces, metrics, and logs.

The OpenTelemetry GenAI semantic conventions now cover:

events
metrics
model spans
agent spans
provider-specific conventions including OpenAI
related conventions for MCP

This is the key architectural point:

Your agent stack can be agent-native at development time and still be standards-based in production.

That is how you avoid rebuilding your monitoring stack every time the AI tooling layer changes.

4) Recommended architecture

For a small team shipping real agents, use this pattern:

Layer 1 — Agent runtime

Use the OpenAI Agents SDK to run workflows and capture fine-grained traces for:

model generations
tool invocations
guardrails
handoffs
retries

Layer 2 — Agent debugging and evaluation

Use LangSmith for:

per-run debugging
dataset-based evaluation
dashboards
feedback collection
trace search and comparison

Layer 3 — Platform observability

Use OpenTelemetry export so traces can join:

API traces
worker traces
database spans
alerting pipelines
cost and latency dashboards

This gives you both:

deep agent visibility
system-wide operational visibility

5) What to measure first

Most teams over-measure prompt details and under-measure workflow reliability.

Start with:

Reliability

tool call success rate
handoff success rate
guardrail trigger rate
retry rate
final task completion rate

Performance

total run latency
per-tool latency
model latency by step
queue wait time for async jobs

Cost

cost per successful task
cost per failed run
token usage by workflow stage

Quality

user feedback attached to trace IDs
evaluation score by workflow version
regression rate after prompt/tool changes

These metrics are more useful than generic "LLM quality" scores because they are tied to concrete operations.

6) Common production mistake

A common mistake is to log only the final prompt and final response.

That is not observability.

For agents, the real failure often happens earlier:

retrieval returned the wrong context
a tool timed out
a guardrail blocked the route
a handoff moved to the wrong specialist agent
the run succeeded technically but violated a business policy

If you cannot see the intermediate spans, you cannot fix the system quickly.

7) Bottom line

In 2026, production agent observability should look like this:

OpenAI Agents SDK for built-in step-level traces
LangSmith for debugging, evaluation, and agent-centric monitoring
OpenTelemetry for portable, system-wide observability

That combination is practical because it matches how real agent systems fail: not as one big error, but as a chain of small decisions, tool calls, and routing events.

If your agents call tools, hand off work, or run asynchronously, observability is not optional. It is part of the product.

Sources used

OpenAI Agents SDK tracing docs
LangSmith docs for OpenTelemetry and OpenAI Agents SDK tracing
OpenTelemetry GenAI semantic conventions docs

Top comments (4)

Max Quimby • Apr 13

This matches a frustration we kept running into — agents that log a confident final answer while the actual problem was a silent tool failure three steps back. The three-layer stack you describe is clean, but in practice we found the trickiest part is correlating traces across parallel agent branches. When two sub-agents are running concurrently and one poisons shared state, you need trace context propagated through every message boundary, not just function calls. We ended up adding a span_context field to every inter-agent payload so LangSmith could stitch the full DAG together automatically.

One question: how do you handle agents that invoke other agents asynchronously? The flush_traces() recommendation is great for background jobs, but when agent A spawns agent B and immediately returns, B's span often ends up as an orphan without explicit parent propagation. Did you find a pattern that worked reliably for that case?

Max Quimby • Apr 15

Solid layering framework — the three-tier approach mirrors how we think about observability in production multi-agent systems. One thing worth adding to the conversation: agent observability needs to differ from service observability in a fundamental way. You need intent visibility, not just execution visibility.

With traditional services, a trace tells you what happened. With agents, you also need to understand why the agent took a given path at a decision fork. We've found that logging the reasoning trace (Claude's thinking steps, or LangGraph's node transitions) alongside the tool calls is the only way to distinguish "the agent chose wrong" from "the agent was given bad inputs." Without that, LangSmith gives you a beautiful trace of a bad decision with no way to know if it was model error or prompt error.

The OpenTelemetry piece becomes especially critical once you're running multiple agent instances concurrently — correlating spans across parallel runs without confusing their contexts is genuinely hard. Have you tried OTEL async context propagation with Claude's streaming responses? That's where things got messy for us.

Max Quimby • Apr 16

The intermediate failure point is the one that bites teams hardest and is almost never in the documentation. You instrument inputs and outputs, your dashboards look clean, and then a tool call silently returns a malformed response that the agent "handles" by hallucinating around it — no exception, no logged error, just subtly wrong downstream output. The only way to catch that is exactly what you describe: capturing tool response metadata as its own span, not just logging the final prompt/completion pair.

One thing worth adding to the OTel layer: cost attribution at the span level rather than the session level. In a multi-agent pipeline, certain agents or tool calls dominate spend in ways that aren't visible in aggregate dashboards. We've found that adding model.cost and tokens.input/tokens.output as span attributes — and then grouping in your OTel backend by agent role — immediately reveals which part of your workflow is expensive versus which is actually complex.

The flush_traces() tip for background jobs is genuinely important. Lost traces in async workers is a silent problem that ruins your debugging fidelity right when you need it most.

Raju Dandigam • May 21

The point about agent failures emerging through “chains of small decisions” is exactly where traditional logs usually stop being useful. Production observability systems are important, but during development I often want a much faster inner-loop workflow where I can inspect execution paths immediately while iterating on prompts, tools, or orchestration logic. That’s the direction I’ve been exploring in agent-inspect: local execution trees, JSONL traces, and CLI inspection for TypeScript agents before anything reaches larger telemetry infrastructure. Feels like there’s room for both local debugging and production observability layers in the ecosystem.