DEV Community

chunxiaoxx
chunxiaoxx

Posted on

AI Agent Observability in 2026: OpenAI Agents SDK, LangSmith, and OpenTelemetry

AI Agent Observability in 2026: OpenAI Agents SDK, LangSmith, and OpenTelemetry

If you are building production AI agents, "it runs on my laptop" is not enough.

You need to answer questions like:

  • Which tool calls failed?
  • Where did latency spike?
  • Which handoff or guardrail caused the run to derail?
  • How do you connect agent traces to the rest of your production telemetry?

A practical 2026 stack is:

  1. OpenAI Agents SDK for agent execution and built-in traces
  2. LangSmith for agent-native debugging, evaluation, and dashboards
  3. OpenTelemetry for vendor-neutral export into your wider observability stack

This post focuses on what is actually useful in production.


1) What the OpenAI Agents SDK gives you by default

The OpenAI Agents SDK ships with built-in tracing enabled by default.

That matters because agent failures are rarely a single API error. A bad run is usually a sequence:

  • user input
  • retrieval
  • model generation
  • tool call
  • guardrail
  • handoff
  • retry
  • final output

The SDK traces this workflow as traces and spans.

According to the OpenAI Agents SDK tracing docs, the default instrumentation includes:

  • the overall Runner.run() / run_sync() / run_streamed() workflow
  • agent spans
  • generation spans
  • function/tool call spans
  • guardrail spans
  • handoff spans
  • audio transcription and speech spans when relevant

This is the right baseline because agent debugging requires step-level causality, not just final output logging.

Important production detail

For long-running workers and background jobs, the SDK documentation recommends calling flush_traces() when you need immediate export at the end of a unit of work.

That is important if you run agents inside:

  • Celery workers
  • background tasks
  • queue consumers
  • cron-style jobs

A minimal pattern looks like this:

from agents import Runner, flush_traces, trace


def run_job(agent, prompt: str):
    try:
        with trace("background_job"):
            result = Runner.run_sync(agent, prompt)
        return result.final_output
    finally:
        flush_traces()
Enter fullscreen mode Exit fullscreen mode

Without an explicit flush, traces may export in the background a few seconds later. That is acceptable for many apps, but not for every operational workflow.


2) Where LangSmith fits

OpenAI's built-in tracing is useful, but most teams also need:

  • searchable traces across runs
  • evaluation workflows
  • dashboards and alerts
  • user feedback logging
  • framework-agnostic observability

This is where LangSmith fits.

LangSmith's documentation now explicitly supports:

  • OpenTelemetry-based tracing
  • OpenAI Agents SDK tracing
  • tracing for both LangChain and non-LangChain applications

That means you do not need to rewrite your stack around one framework to get observability.

A practical division of labor is:

  • OpenAI Agents SDK = emits detailed agent workflow traces
  • LangSmith = developer-facing debugging, evaluation, alerting, run inspection
  • OpenTelemetry = standard transport layer into the rest of your telemetry system

This separation is valuable because it avoids lock-in while still giving you agent-native debugging.


3) Why OpenTelemetry matters

In production, agents should not be a special observability island.

Your infra team already monitors:

  • APIs
  • workers
  • databases
  • queues
  • cost and latency trends

If agent telemetry cannot join that system, you create a blind spot.

OpenTelemetry solves this by giving you a vendor-neutral standard for traces, metrics, and logs.

The OpenTelemetry GenAI semantic conventions now cover:

  • events
  • metrics
  • model spans
  • agent spans
  • provider-specific conventions including OpenAI
  • related conventions for MCP

This is the key architectural point:

Your agent stack can be agent-native at development time and still be standards-based in production.

That is how you avoid rebuilding your monitoring stack every time the AI tooling layer changes.


4) Recommended architecture

For a small team shipping real agents, use this pattern:

Layer 1 — Agent runtime

Use the OpenAI Agents SDK to run workflows and capture fine-grained traces for:

  • model generations
  • tool invocations
  • guardrails
  • handoffs
  • retries

Layer 2 — Agent debugging and evaluation

Use LangSmith for:

  • per-run debugging
  • dataset-based evaluation
  • dashboards
  • feedback collection
  • trace search and comparison

Layer 3 — Platform observability

Use OpenTelemetry export so traces can join:

  • API traces
  • worker traces
  • database spans
  • alerting pipelines
  • cost and latency dashboards

This gives you both:

  • deep agent visibility
  • system-wide operational visibility

5) What to measure first

Most teams over-measure prompt details and under-measure workflow reliability.

Start with:

Reliability

  • tool call success rate
  • handoff success rate
  • guardrail trigger rate
  • retry rate
  • final task completion rate

Performance

  • total run latency
  • per-tool latency
  • model latency by step
  • queue wait time for async jobs

Cost

  • cost per successful task
  • cost per failed run
  • token usage by workflow stage

Quality

  • user feedback attached to trace IDs
  • evaluation score by workflow version
  • regression rate after prompt/tool changes

These metrics are more useful than generic "LLM quality" scores because they are tied to concrete operations.


6) Common production mistake

A common mistake is to log only the final prompt and final response.

That is not observability.

For agents, the real failure often happens earlier:

  • retrieval returned the wrong context
  • a tool timed out
  • a guardrail blocked the route
  • a handoff moved to the wrong specialist agent
  • the run succeeded technically but violated a business policy

If you cannot see the intermediate spans, you cannot fix the system quickly.


7) Bottom line

In 2026, production agent observability should look like this:

  • OpenAI Agents SDK for built-in step-level traces
  • LangSmith for debugging, evaluation, and agent-centric monitoring
  • OpenTelemetry for portable, system-wide observability

That combination is practical because it matches how real agent systems fail: not as one big error, but as a chain of small decisions, tool calls, and routing events.

If your agents call tools, hand off work, or run asynchronously, observability is not optional. It is part of the product.


Sources used

  • OpenAI Agents SDK tracing docs
  • LangSmith docs for OpenTelemetry and OpenAI Agents SDK tracing
  • OpenTelemetry GenAI semantic conventions docs

Top comments (0)