AI Agent Observability in 2026: OpenAI Agents SDK, LangSmith, and OpenTelemetry
If you are building production AI agents, "it runs on my laptop" is not enough.
You need to answer questions like:
- Which tool calls failed?
- Where did latency spike?
- Which handoff or guardrail caused the run to derail?
- How do you connect agent traces to the rest of your production telemetry?
A practical 2026 stack is:
- OpenAI Agents SDK for agent execution and built-in traces
- LangSmith for agent-native debugging, evaluation, and dashboards
- OpenTelemetry for vendor-neutral export into your wider observability stack
This post focuses on what is actually useful in production.
1) What the OpenAI Agents SDK gives you by default
The OpenAI Agents SDK ships with built-in tracing enabled by default.
That matters because agent failures are rarely a single API error. A bad run is usually a sequence:
- user input
- retrieval
- model generation
- tool call
- guardrail
- handoff
- retry
- final output
The SDK traces this workflow as traces and spans.
According to the OpenAI Agents SDK tracing docs, the default instrumentation includes:
- the overall
Runner.run()/run_sync()/run_streamed()workflow - agent spans
- generation spans
- function/tool call spans
- guardrail spans
- handoff spans
- audio transcription and speech spans when relevant
This is the right baseline because agent debugging requires step-level causality, not just final output logging.
Important production detail
For long-running workers and background jobs, the SDK documentation recommends calling flush_traces() when you need immediate export at the end of a unit of work.
That is important if you run agents inside:
- Celery workers
- background tasks
- queue consumers
- cron-style jobs
A minimal pattern looks like this:
from agents import Runner, flush_traces, trace
def run_job(agent, prompt: str):
try:
with trace("background_job"):
result = Runner.run_sync(agent, prompt)
return result.final_output
finally:
flush_traces()
Without an explicit flush, traces may export in the background a few seconds later. That is acceptable for many apps, but not for every operational workflow.
2) Where LangSmith fits
OpenAI's built-in tracing is useful, but most teams also need:
- searchable traces across runs
- evaluation workflows
- dashboards and alerts
- user feedback logging
- framework-agnostic observability
This is where LangSmith fits.
LangSmith's documentation now explicitly supports:
- OpenTelemetry-based tracing
- OpenAI Agents SDK tracing
- tracing for both LangChain and non-LangChain applications
That means you do not need to rewrite your stack around one framework to get observability.
A practical division of labor is:
- OpenAI Agents SDK = emits detailed agent workflow traces
- LangSmith = developer-facing debugging, evaluation, alerting, run inspection
- OpenTelemetry = standard transport layer into the rest of your telemetry system
This separation is valuable because it avoids lock-in while still giving you agent-native debugging.
3) Why OpenTelemetry matters
In production, agents should not be a special observability island.
Your infra team already monitors:
- APIs
- workers
- databases
- queues
- cost and latency trends
If agent telemetry cannot join that system, you create a blind spot.
OpenTelemetry solves this by giving you a vendor-neutral standard for traces, metrics, and logs.
The OpenTelemetry GenAI semantic conventions now cover:
- events
- metrics
- model spans
- agent spans
- provider-specific conventions including OpenAI
- related conventions for MCP
This is the key architectural point:
Your agent stack can be agent-native at development time and still be standards-based in production.
That is how you avoid rebuilding your monitoring stack every time the AI tooling layer changes.
4) Recommended architecture
For a small team shipping real agents, use this pattern:
Layer 1 — Agent runtime
Use the OpenAI Agents SDK to run workflows and capture fine-grained traces for:
- model generations
- tool invocations
- guardrails
- handoffs
- retries
Layer 2 — Agent debugging and evaluation
Use LangSmith for:
- per-run debugging
- dataset-based evaluation
- dashboards
- feedback collection
- trace search and comparison
Layer 3 — Platform observability
Use OpenTelemetry export so traces can join:
- API traces
- worker traces
- database spans
- alerting pipelines
- cost and latency dashboards
This gives you both:
- deep agent visibility
- system-wide operational visibility
5) What to measure first
Most teams over-measure prompt details and under-measure workflow reliability.
Start with:
Reliability
- tool call success rate
- handoff success rate
- guardrail trigger rate
- retry rate
- final task completion rate
Performance
- total run latency
- per-tool latency
- model latency by step
- queue wait time for async jobs
Cost
- cost per successful task
- cost per failed run
- token usage by workflow stage
Quality
- user feedback attached to trace IDs
- evaluation score by workflow version
- regression rate after prompt/tool changes
These metrics are more useful than generic "LLM quality" scores because they are tied to concrete operations.
6) Common production mistake
A common mistake is to log only the final prompt and final response.
That is not observability.
For agents, the real failure often happens earlier:
- retrieval returned the wrong context
- a tool timed out
- a guardrail blocked the route
- a handoff moved to the wrong specialist agent
- the run succeeded technically but violated a business policy
If you cannot see the intermediate spans, you cannot fix the system quickly.
7) Bottom line
In 2026, production agent observability should look like this:
- OpenAI Agents SDK for built-in step-level traces
- LangSmith for debugging, evaluation, and agent-centric monitoring
- OpenTelemetry for portable, system-wide observability
That combination is practical because it matches how real agent systems fail: not as one big error, but as a chain of small decisions, tool calls, and routing events.
If your agents call tools, hand off work, or run asynchronously, observability is not optional. It is part of the product.
Sources used
- OpenAI Agents SDK tracing docs
- LangSmith docs for OpenTelemetry and OpenAI Agents SDK tracing
- OpenTelemetry GenAI semantic conventions docs
Top comments (3)
This matches a frustration we kept running into — agents that log a confident final answer while the actual problem was a silent tool failure three steps back. The three-layer stack you describe is clean, but in practice we found the trickiest part is correlating traces across parallel agent branches. When two sub-agents are running concurrently and one poisons shared state, you need trace context propagated through every message boundary, not just function calls. We ended up adding a
span_contextfield to every inter-agent payload so LangSmith could stitch the full DAG together automatically.One question: how do you handle agents that invoke other agents asynchronously? The
flush_traces()recommendation is great for background jobs, but when agent A spawns agent B and immediately returns, B's span often ends up as an orphan without explicit parent propagation. Did you find a pattern that worked reliably for that case?Solid layering framework — the three-tier approach mirrors how we think about observability in production multi-agent systems. One thing worth adding to the conversation: agent observability needs to differ from service observability in a fundamental way. You need intent visibility, not just execution visibility.
With traditional services, a trace tells you what happened. With agents, you also need to understand why the agent took a given path at a decision fork. We've found that logging the reasoning trace (Claude's thinking steps, or LangGraph's node transitions) alongside the tool calls is the only way to distinguish "the agent chose wrong" from "the agent was given bad inputs." Without that, LangSmith gives you a beautiful trace of a bad decision with no way to know if it was model error or prompt error.
The OpenTelemetry piece becomes especially critical once you're running multiple agent instances concurrently — correlating spans across parallel runs without confusing their contexts is genuinely hard. Have you tried OTEL async context propagation with Claude's streaming responses? That's where things got messy for us.
The intermediate failure point is the one that bites teams hardest and is almost never in the documentation. You instrument inputs and outputs, your dashboards look clean, and then a tool call silently returns a malformed response that the agent "handles" by hallucinating around it — no exception, no logged error, just subtly wrong downstream output. The only way to catch that is exactly what you describe: capturing tool response metadata as its own span, not just logging the final prompt/completion pair.
One thing worth adding to the OTel layer: cost attribution at the span level rather than the session level. In a multi-agent pipeline, certain agents or tool calls dominate spend in ways that aren't visible in aggregate dashboards. We've found that adding model.cost and tokens.input/tokens.output as span attributes — and then grouping in your OTel backend by agent role — immediately reveals which part of your workflow is expensive versus which is actually complex.
The flush_traces() tip for background jobs is genuinely important. Lost traces in async workers is a silent problem that ruins your debugging fidelity right when you need it most.