AI Agent Observability in 2026: OpenAI Agents SDK, LangSmith, and OpenTelemetry
If you are building production AI agents, "it runs on my laptop" is not enough.
You need to answer questions like:
- Which tool calls failed?
- Where did latency spike?
- Which handoff or guardrail caused the run to derail?
- How do you connect agent traces to the rest of your production telemetry?
A practical 2026 stack is:
- OpenAI Agents SDK for agent execution and built-in traces
- LangSmith for agent-native debugging, evaluation, and dashboards
- OpenTelemetry for vendor-neutral export into your wider observability stack
This post focuses on what is actually useful in production.
1) What the OpenAI Agents SDK gives you by default
The OpenAI Agents SDK ships with built-in tracing enabled by default.
That matters because agent failures are rarely a single API error. A bad run is usually a sequence:
- user input
- retrieval
- model generation
- tool call
- guardrail
- handoff
- retry
- final output
The SDK traces this workflow as traces and spans.
According to the OpenAI Agents SDK tracing docs, the default instrumentation includes:
- the overall
Runner.run()/run_sync()/run_streamed()workflow - agent spans
- generation spans
- function/tool call spans
- guardrail spans
- handoff spans
- audio transcription and speech spans when relevant
This is the right baseline because agent debugging requires step-level causality, not just final output logging.
Important production detail
For long-running workers and background jobs, the SDK documentation recommends calling flush_traces() when you need immediate export at the end of a unit of work.
That is important if you run agents inside:
- Celery workers
- background tasks
- queue consumers
- cron-style jobs
A minimal pattern looks like this:
from agents import Runner, flush_traces, trace
def run_job(agent, prompt: str):
try:
with trace("background_job"):
result = Runner.run_sync(agent, prompt)
return result.final_output
finally:
flush_traces()
Without an explicit flush, traces may export in the background a few seconds later. That is acceptable for many apps, but not for every operational workflow.
2) Where LangSmith fits
OpenAI's built-in tracing is useful, but most teams also need:
- searchable traces across runs
- evaluation workflows
- dashboards and alerts
- user feedback logging
- framework-agnostic observability
This is where LangSmith fits.
LangSmith's documentation now explicitly supports:
- OpenTelemetry-based tracing
- OpenAI Agents SDK tracing
- tracing for both LangChain and non-LangChain applications
That means you do not need to rewrite your stack around one framework to get observability.
A practical division of labor is:
- OpenAI Agents SDK = emits detailed agent workflow traces
- LangSmith = developer-facing debugging, evaluation, alerting, run inspection
- OpenTelemetry = standard transport layer into the rest of your telemetry system
This separation is valuable because it avoids lock-in while still giving you agent-native debugging.
3) Why OpenTelemetry matters
In production, agents should not be a special observability island.
Your infra team already monitors:
- APIs
- workers
- databases
- queues
- cost and latency trends
If agent telemetry cannot join that system, you create a blind spot.
OpenTelemetry solves this by giving you a vendor-neutral standard for traces, metrics, and logs.
The OpenTelemetry GenAI semantic conventions now cover:
- events
- metrics
- model spans
- agent spans
- provider-specific conventions including OpenAI
- related conventions for MCP
This is the key architectural point:
Your agent stack can be agent-native at development time and still be standards-based in production.
That is how you avoid rebuilding your monitoring stack every time the AI tooling layer changes.
4) Recommended architecture
For a small team shipping real agents, use this pattern:
Layer 1 — Agent runtime
Use the OpenAI Agents SDK to run workflows and capture fine-grained traces for:
- model generations
- tool invocations
- guardrails
- handoffs
- retries
Layer 2 — Agent debugging and evaluation
Use LangSmith for:
- per-run debugging
- dataset-based evaluation
- dashboards
- feedback collection
- trace search and comparison
Layer 3 — Platform observability
Use OpenTelemetry export so traces can join:
- API traces
- worker traces
- database spans
- alerting pipelines
- cost and latency dashboards
This gives you both:
- deep agent visibility
- system-wide operational visibility
5) What to measure first
Most teams over-measure prompt details and under-measure workflow reliability.
Start with:
Reliability
- tool call success rate
- handoff success rate
- guardrail trigger rate
- retry rate
- final task completion rate
Performance
- total run latency
- per-tool latency
- model latency by step
- queue wait time for async jobs
Cost
- cost per successful task
- cost per failed run
- token usage by workflow stage
Quality
- user feedback attached to trace IDs
- evaluation score by workflow version
- regression rate after prompt/tool changes
These metrics are more useful than generic "LLM quality" scores because they are tied to concrete operations.
6) Common production mistake
A common mistake is to log only the final prompt and final response.
That is not observability.
For agents, the real failure often happens earlier:
- retrieval returned the wrong context
- a tool timed out
- a guardrail blocked the route
- a handoff moved to the wrong specialist agent
- the run succeeded technically but violated a business policy
If you cannot see the intermediate spans, you cannot fix the system quickly.
7) Bottom line
In 2026, production agent observability should look like this:
- OpenAI Agents SDK for built-in step-level traces
- LangSmith for debugging, evaluation, and agent-centric monitoring
- OpenTelemetry for portable, system-wide observability
That combination is practical because it matches how real agent systems fail: not as one big error, but as a chain of small decisions, tool calls, and routing events.
If your agents call tools, hand off work, or run asynchronously, observability is not optional. It is part of the product.
Sources used
- OpenAI Agents SDK tracing docs
- LangSmith docs for OpenTelemetry and OpenAI Agents SDK tracing
- OpenTelemetry GenAI semantic conventions docs
Top comments (0)