DEV Community

chunxiaoxx
chunxiaoxx

Posted on

AI Agent Observability in 2026: OpenAI Agents SDK, LangSmith, and OpenTelemetry

AI Agent Observability in 2026: OpenAI Agents SDK, LangSmith, and OpenTelemetry

If you are building production AI agents, "it runs on my laptop" is not enough.

You need to answer questions like:

  • Which tool calls failed?
  • Where did latency spike?
  • Which handoff or guardrail caused the run to derail?
  • How do you connect agent traces to the rest of your production telemetry?

A practical 2026 stack is:

  1. OpenAI Agents SDK for agent execution and built-in traces
  2. LangSmith for agent-native debugging, evaluation, and dashboards
  3. OpenTelemetry for vendor-neutral export into your wider observability stack

This post focuses on what is actually useful in production.


1) What the OpenAI Agents SDK gives you by default

The OpenAI Agents SDK ships with built-in tracing enabled by default.

That matters because agent failures are rarely a single API error. A bad run is usually a sequence:

  • user input
  • retrieval
  • model generation
  • tool call
  • guardrail
  • handoff
  • retry
  • final output

The SDK traces this workflow as traces and spans.

According to the OpenAI Agents SDK tracing docs, the default instrumentation includes:

  • the overall Runner.run() / run_sync() / run_streamed() workflow
  • agent spans
  • generation spans
  • function/tool call spans
  • guardrail spans
  • handoff spans
  • audio transcription and speech spans when relevant

This is the right baseline because agent debugging requires step-level causality, not just final output logging.

Important production detail

For long-running workers and background jobs, the SDK documentation recommends calling flush_traces() when you need immediate export at the end of a unit of work.

That is important if you run agents inside:

  • Celery workers
  • background tasks
  • queue consumers
  • cron-style jobs

A minimal pattern looks like this:

from agents import Runner, flush_traces, trace


def run_job(agent, prompt: str):
    try:
        with trace("background_job"):
            result = Runner.run_sync(agent, prompt)
        return result.final_output
    finally:
        flush_traces()
Enter fullscreen mode Exit fullscreen mode

Without an explicit flush, traces may export in the background a few seconds later. That is acceptable for many apps, but not for every operational workflow.


2) Where LangSmith fits

OpenAI's built-in tracing is useful, but most teams also need:

  • searchable traces across runs
  • evaluation workflows
  • dashboards and alerts
  • user feedback logging
  • framework-agnostic observability

This is where LangSmith fits.

LangSmith's documentation now explicitly supports:

  • OpenTelemetry-based tracing
  • OpenAI Agents SDK tracing
  • tracing for both LangChain and non-LangChain applications

That means you do not need to rewrite your stack around one framework to get observability.

A practical division of labor is:

  • OpenAI Agents SDK = emits detailed agent workflow traces
  • LangSmith = developer-facing debugging, evaluation, alerting, run inspection
  • OpenTelemetry = standard transport layer into the rest of your telemetry system

This separation is valuable because it avoids lock-in while still giving you agent-native debugging.


3) Why OpenTelemetry matters

In production, agents should not be a special observability island.

Your infra team already monitors:

  • APIs
  • workers
  • databases
  • queues
  • cost and latency trends

If agent telemetry cannot join that system, you create a blind spot.

OpenTelemetry solves this by giving you a vendor-neutral standard for traces, metrics, and logs.

The OpenTelemetry GenAI semantic conventions now cover:

  • events
  • metrics
  • model spans
  • agent spans
  • provider-specific conventions including OpenAI
  • related conventions for MCP

This is the key architectural point:

Your agent stack can be agent-native at development time and still be standards-based in production.

That is how you avoid rebuilding your monitoring stack every time the AI tooling layer changes.


4) Recommended architecture

For a small team shipping real agents, use this pattern:

Layer 1 — Agent runtime

Use the OpenAI Agents SDK to run workflows and capture fine-grained traces for:

  • model generations
  • tool invocations
  • guardrails
  • handoffs
  • retries

Layer 2 — Agent debugging and evaluation

Use LangSmith for:

  • per-run debugging
  • dataset-based evaluation
  • dashboards
  • feedback collection
  • trace search and comparison

Layer 3 — Platform observability

Use OpenTelemetry export so traces can join:

  • API traces
  • worker traces
  • database spans
  • alerting pipelines
  • cost and latency dashboards

This gives you both:

  • deep agent visibility
  • system-wide operational visibility

5) What to measure first

Most teams over-measure prompt details and under-measure workflow reliability.

Start with:

Reliability

  • tool call success rate
  • handoff success rate
  • guardrail trigger rate
  • retry rate
  • final task completion rate

Performance

  • total run latency
  • per-tool latency
  • model latency by step
  • queue wait time for async jobs

Cost

  • cost per successful task
  • cost per failed run
  • token usage by workflow stage

Quality

  • user feedback attached to trace IDs
  • evaluation score by workflow version
  • regression rate after prompt/tool changes

These metrics are more useful than generic "LLM quality" scores because they are tied to concrete operations.


6) Common production mistake

A common mistake is to log only the final prompt and final response.

That is not observability.

For agents, the real failure often happens earlier:

  • retrieval returned the wrong context
  • a tool timed out
  • a guardrail blocked the route
  • a handoff moved to the wrong specialist agent
  • the run succeeded technically but violated a business policy

If you cannot see the intermediate spans, you cannot fix the system quickly.


7) Bottom line

In 2026, production agent observability should look like this:

  • OpenAI Agents SDK for built-in step-level traces
  • LangSmith for debugging, evaluation, and agent-centric monitoring
  • OpenTelemetry for portable, system-wide observability

That combination is practical because it matches how real agent systems fail: not as one big error, but as a chain of small decisions, tool calls, and routing events.

If your agents call tools, hand off work, or run asynchronously, observability is not optional. It is part of the product.


Sources used

  • OpenAI Agents SDK tracing docs
  • LangSmith docs for OpenTelemetry and OpenAI Agents SDK tracing
  • OpenTelemetry GenAI semantic conventions docs

Top comments (3)

Collapse
 
max_quimby profile image
Max Quimby

This matches a frustration we kept running into — agents that log a confident final answer while the actual problem was a silent tool failure three steps back. The three-layer stack you describe is clean, but in practice we found the trickiest part is correlating traces across parallel agent branches. When two sub-agents are running concurrently and one poisons shared state, you need trace context propagated through every message boundary, not just function calls. We ended up adding a span_context field to every inter-agent payload so LangSmith could stitch the full DAG together automatically.

One question: how do you handle agents that invoke other agents asynchronously? The flush_traces() recommendation is great for background jobs, but when agent A spawns agent B and immediately returns, B's span often ends up as an orphan without explicit parent propagation. Did you find a pattern that worked reliably for that case?

Collapse
 
max_quimby profile image
Max Quimby

Solid layering framework — the three-tier approach mirrors how we think about observability in production multi-agent systems. One thing worth adding to the conversation: agent observability needs to differ from service observability in a fundamental way. You need intent visibility, not just execution visibility.

With traditional services, a trace tells you what happened. With agents, you also need to understand why the agent took a given path at a decision fork. We've found that logging the reasoning trace (Claude's thinking steps, or LangGraph's node transitions) alongside the tool calls is the only way to distinguish "the agent chose wrong" from "the agent was given bad inputs." Without that, LangSmith gives you a beautiful trace of a bad decision with no way to know if it was model error or prompt error.

The OpenTelemetry piece becomes especially critical once you're running multiple agent instances concurrently — correlating spans across parallel runs without confusing their contexts is genuinely hard. Have you tried OTEL async context propagation with Claude's streaming responses? That's where things got messy for us.

Collapse
 
max_quimby profile image
Max Quimby

The intermediate failure point is the one that bites teams hardest and is almost never in the documentation. You instrument inputs and outputs, your dashboards look clean, and then a tool call silently returns a malformed response that the agent "handles" by hallucinating around it — no exception, no logged error, just subtly wrong downstream output. The only way to catch that is exactly what you describe: capturing tool response metadata as its own span, not just logging the final prompt/completion pair.

One thing worth adding to the OTel layer: cost attribution at the span level rather than the session level. In a multi-agent pipeline, certain agents or tool calls dominate spend in ways that aren't visible in aggregate dashboards. We've found that adding model.cost and tokens.input/tokens.output as span attributes — and then grouping in your OTel backend by agent role — immediately reveals which part of your workflow is expensive versus which is actually complex.

The flush_traces() tip for background jobs is genuinely important. Lost traces in async workers is a silent problem that ruins your debugging fidelity right when you need it most.