klement Gunndu

Posted on Mar 6 • Edited on Mar 13

Trace Your AI Agent With OpenTelemetry in Python

#python #ai #devops #tutorial

Your AI agent passed every test. Then a user asked it something slightly different, and it returned garbage.

You check the logs. They say "200 OK." The LLM responded. The tools ran. But somewhere between the prompt and the final output, the chain went wrong — and you have no idea where.

This is the observability gap that kills AI agents in production. Traditional logging tells you what happened. Tracing tells you where, how long, and in what order each step executed. For multi-step agents that call tools, chain prompts, and make decisions, tracing is the difference between debugging for 5 minutes and debugging for 5 hours.

OpenTelemetry is the industry standard for distributed tracing. As of March 2026, the Python SDK (v1.40.0) is production-stable with dedicated instrumentation libraries for LangChain, OpenAI, and other AI frameworks.

Here are 3 patterns to trace your AI agent — from zero-config auto-instrumentation to custom spans that capture exactly what you need.

Why Tracing Beats Logging for AI Agents

Standard logging captures individual events: "LLM called," "tool returned," "response sent." But AI agents are pipelines — a sequence of dependent steps where the output of one becomes the input of the next. When something goes wrong at step 4, the root cause is often at step 2.

Tracing captures the full execution tree. Each operation becomes a span with a start time, end time, parent-child relationship, and custom attributes. Spans nest inside each other, forming a trace that shows exactly how a single request flowed through your agent.

Three things tracing gives you that logging does not:

Latency attribution — which step is slow? The LLM call? The tool execution? The prompt formatting?
Causality — which upstream decision caused the downstream failure?
Cost per request — by recording token counts on each LLM span, you see the exact cost of each user interaction.

Pattern 1: Auto-Instrument LangChain With Zero Code Changes

The fastest way to add tracing to an existing LangChain agent is the opentelemetry-instrumentation-langchain package (v0.53.0, Python >=3.10). It wraps every LLM call, chain invocation, and tool execution in OpenTelemetry spans automatically.

Install:

pip install opentelemetry-sdk opentelemetry-instrumentation-langchain

Add two lines before your agent code runs:

from opentelemetry.instrumentation.langchain import LangchainInstrumentor

LangchainInstrumentor().instrument()

That's it. Every chain.invoke(), llm.predict(), and tool call now emits a span with:

Operation name (e.g., langchain.chain.invoke)
Duration in milliseconds
Input prompt and output completion (logged to span attributes by default)
Token counts for cost tracking

To see the traces locally during development, add a console exporter:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import (
    SimpleSpanProcessor,
    ConsoleSpanExporter,
)
from opentelemetry.instrumentation.langchain import LangchainInstrumentor

# Set up tracing to console
provider = TracerProvider()
provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))
trace.set_tracer_provider(provider)

# Auto-instrument LangChain
LangchainInstrumentor().instrument()

# Your agent code — no changes needed
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_template(
    "Explain {concept} in one paragraph"
)
model = ChatOpenAI(model="gpt-4o-mini")
chain = prompt | model

result = chain.invoke({"concept": "distributed tracing"})
print(result.content)

Run this and you'll see JSON trace output in your terminal showing each span with its parent-child relationships, timing, and attributes.

Privacy Note

By default, the instrumentor logs prompts and completions to span attributes. In production, this might include sensitive user data. Disable it with:

export TRACELOOP_TRACE_CONTENT=false

This keeps timing and structure visible while stripping the actual content.

Pattern 2: Custom Spans for Agent Decision Points

Auto-instrumentation captures framework-level operations. But the most valuable debugging information lives in your application logic — the decisions your agent makes between LLM calls.

Use manual spans to trace these decision points:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import (
    BatchSpanProcessor,
    ConsoleSpanExporter,
)

# Configure provider with batch processing for production
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(ConsoleSpanExporter())
)
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("ai-agent")


def route_query(query: str) -> str:
    """Decide which tool to use based on query content."""
    with tracer.start_as_current_span("route_query") as span:
        span.set_attribute("query.text", query)
        span.set_attribute("query.length", len(query))

        if "search" in query.lower():
            tool = "web_search"
        elif "calculate" in query.lower():
            tool = "calculator"
        else:
            tool = "general_llm"

        span.set_attribute("routing.tool_selected", tool)
        span.set_attribute("routing.reason", f"keyword match: {tool}")
        return tool


def execute_tool(tool_name: str, query: str) -> str:
    """Execute the selected tool and trace the result."""
    with tracer.start_as_current_span("execute_tool") as span:
        span.set_attribute("tool.name", tool_name)

        # Simulate tool execution
        if tool_name == "web_search":
            result = f"Search results for: {query}"
        elif tool_name == "calculator":
            result = "42"
        else:
            result = f"LLM response to: {query}"

        span.set_attribute("tool.result_length", len(result))
        span.set_attribute("tool.success", True)
        return result


def run_agent(query: str) -> str:
    """Full agent pipeline with tracing."""
    with tracer.start_as_current_span("agent_pipeline") as span:
        span.set_attribute("agent.query", query)

        tool = route_query(query)
        result = execute_tool(tool, query)

        span.set_attribute("agent.tool_used", tool)
        span.set_attribute("agent.response_length", len(result))
        return result


response = run_agent("search for Python OpenTelemetry examples")
print(response)

The trace output shows a tree:

agent_pipeline (parent)
├── route_query (child — 2ms)
│   └── routing.tool_selected: web_search
└── execute_tool (child — 15ms)
    └── tool.name: web_search

This is where debugging becomes fast. When a user reports a wrong answer, you find the trace and see exactly which tool was selected, why, and what it returned. No guessing.

What Attributes to Record

Not every variable is worth tracing. Focus on the decision boundaries:

Attribute	Why It Matters
`routing.tool_selected`	Which path the agent took
`routing.reason`	Why it picked that path
`tool.success`	Did the tool call succeed?
`tool.result_length`	Catch empty/truncated results
`llm.token_count`	Cost tracking per request
`llm.model`	Which model handled this step
`agent.retry_count`	Detect retry storms

Pattern 3: Export to a Backend for Production Debugging

Console output works for development. In production, you need traces in a backend where you can search, filter, and set alerts.

OpenTelemetry exports to any OTLP-compatible backend: Jaeger, Grafana Tempo, Datadog, New Relic, or a self-hosted collector. The code change is one line — swap the exporter.

Export to an OTLP Collector

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.resources import Resource
from opentelemetry.exporter.otlp.proto.http.trace_exporter import (
    OTLPSpanExporter,
)

# Identify your service in the backend
resource = Resource(attributes={
    "service.name": "my-ai-agent",
    "service.version": "1.0.0",
    "deployment.environment": "production",
})

provider = TracerProvider(resource=resource)

# Point to your OTLP collector
otlp_exporter = OTLPSpanExporter(
    endpoint="http://localhost:4318/v1/traces"
)

provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
trace.set_tracer_provider(provider)

Install the OTLP exporter:

pip install opentelemetry-exporter-otlp-proto-http

Run Jaeger Locally in 30 Seconds

Jaeger is an open-source tracing backend. Run it with Docker to see your traces in a web UI:

docker run -d --name jaeger \
  -p 16686:16686 \
  -p 4318:4318 \
  jaegertracing/jaeger:2 \
  --set receivers.otlp.protocols.http.endpoint=0.0.0.0:4318

Open http://localhost:16686, select your service name, and search for traces. Each trace shows the full span tree with timing and attributes.

Combine Auto-Instrumentation + Custom Spans

The real power is combining Pattern 1 and Pattern 2. Auto-instrument LangChain for framework-level spans, then add custom spans for your application logic:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.resources import Resource
from opentelemetry.exporter.otlp.proto.http.trace_exporter import (
    OTLPSpanExporter,
)
from opentelemetry.instrumentation.langchain import LangchainInstrumentor

# Production setup
resource = Resource(attributes={"service.name": "my-ai-agent"})
provider = TracerProvider(resource=resource)
provider.add_span_processor(
    BatchSpanProcessor(
        OTLPSpanExporter(endpoint="http://localhost:4318/v1/traces")
    )
)
trace.set_tracer_provider(provider)

# Auto-instrument LangChain
LangchainInstrumentor().instrument()

# Custom tracer for application logic
tracer = trace.get_tracer("agent-logic")


def process_user_request(user_input: str) -> str:
    with tracer.start_as_current_span("process_request") as span:
        span.set_attribute("user.input_length", len(user_input))

        # This LangChain call is auto-traced
        from langchain_openai import ChatOpenAI
        from langchain_core.prompts import ChatPromptTemplate

        prompt = ChatPromptTemplate.from_template(
            "Answer concisely: {question}"
        )
        model = ChatOpenAI(model="gpt-4o-mini")
        chain = prompt | model

        result = chain.invoke({"question": user_input})

        span.set_attribute("response.length", len(result.content))
        return result.content

The trace tree now shows both layers:

process_request (your custom span)
└── langchain.chain.invoke (auto-instrumented)
    ├── langchain.prompt.format
    └── langchain.llm.predict
        └── openai.chat.completions

One glance tells you: the request took 1.2s total, 50ms was routing, 1.1s was the LLM call, and the LLM returned 847 tokens.

3 Production Lessons

After running OpenTelemetry on multi-agent systems in production, three patterns consistently prevent outages:

1. Trace sampling is mandatory. Tracing every request in production generates gigabytes of data per hour. Use a sampling rate — start at 10% and adjust. OpenTelemetry supports head-based and tail-based sampling out of the box.

2. Set span limits. An agent that enters a retry loop can generate thousands of spans per request. Set max_events and max_attributes in your TracerProvider to prevent memory exhaustion:

from opentelemetry.sdk.trace import TracerProvider, SpanLimits

provider = TracerProvider(
    span_limits=SpanLimits(
        max_events=128,
        max_attributes=64,
    )
)

3. Add error status to failed spans. When a tool call fails, mark the span as an error so your backend can alert on it:

from opentelemetry.trace import StatusCode

with tracer.start_as_current_span("tool_call") as span:
    try:
        result = call_external_tool()
    except Exception as e:
        span.set_status(StatusCode.ERROR, str(e))
        span.record_exception(e)
        raise

What to Trace First

If you're adding OpenTelemetry to an existing agent, start with these 3 spans. They catch 80% of production issues:

The full request — parent span wrapping the entire agent pipeline. Measures end-to-end latency.
Each LLM call — auto-instrumentation handles this. Watch for latency spikes and token count anomalies.
Tool selection and execution — custom spans on your routing logic. The #1 source of wrong answers is the agent picking the wrong tool.

Everything else — prompt formatting, memory retrieval, output parsing — can be added incrementally as you encounter specific debugging needs.

All code examples use OpenTelemetry Python SDK v1.40.0 and opentelemetry-instrumentation-langchain v0.53.0, tested as of March 2026.

Follow @klement_gunndu for more AI engineering content. We're building in public.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.