AI Agent Observability: Tracing, Logging & Debugging in Production (2026 Guide)

#ai #devops #programming #monitoring

Your AI agent works in development. It passes tests. You deploy it. Then a user reports: "It gave me a completely wrong answer." Now what?

    Without observability, debugging an AI agent is like debugging a web app with no logs — impossible. You can't see which tools it called, what the LLM returned at each step, why it chose one path over another, or where the reasoning broke down.

    This guide covers everything you need to make your AI agent observable: what to trace, how to structure logs, which tools to use, and how to build dashboards that actually help you debug production issues.

    ## Why Agent Observability Is Different

    Traditional application monitoring tracks request/response pairs. AI agent observability needs to track **multi-step reasoning chains** where each step involves an LLM call, a tool invocation, or a decision point.


        Traditional AppAI Agent
        Deterministic flowNon-deterministic (LLM decides the path)
        Fixed number of stepsVariable steps (1 to 50+)
        Errors are clearErrors can be subtle (correct format, wrong content)
        Latency is predictableLatency varies 10x based on reasoning path
        Cost is fixed per requestCost varies based on tokens consumed
        One service callMultiple LLM + tool calls per request


    You need three pillars of observability for agents: **traces** (the full execution path), **logs** (what happened at each step), and **metrics** (aggregate performance data).

    ## Pillar 1: Distributed Tracing for Agents

    A trace captures the full lifecycle of a single agent request — every LLM call, tool invocation, and decision point.

    ### Trace Structure

    A typical agent trace looks like this:

Request: "What were our sales last quarter?"
│
├── [Span] LLM Decision (420ms, 850 tokens)
│   └── Decision: Call tool "query_database"
│
├── [Span] Tool: query_database (180ms)
│   ├── Input: SELECT SUM(amount) FROM sales WHERE quarter='Q4-2025'
│   └── Output: {"total": 1247500}
│
├── [Span] LLM Decision (380ms, 620 tokens)
│   └── Decision: Call tool "format_currency"
│
├── [Span] Tool: format_currency (2ms)
│   └── Output: "$1,247,500"
│
└── [Span] LLM Response (290ms, 430 tokens)
    └── "Your sales last quarter were $1,247,500..."

Total: 1,272ms | 1,900 tokens | $0.008 | 5 spans

    ### Implementation with OpenTelemetry

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Setup
provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="localhost:4317"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("ai-agent")

class TracedAgent:
    def run(self, user_input: str) -> str:
        with tracer.start_as_current_span("agent_request") as span:
            span.set_attribute("user.input", user_input[:200])
            span.set_attribute("agent.version", "1.2.0")

            steps = 0
            total_tokens = 0

            while True:
                steps += 1

                # Trace LLM decision
                with tracer.start_as_current_span("llm_decision") as llm_span:
                    decision = self.llm.decide(user_input, self.context)
                    llm_span.set_attribute("llm.model", "gpt-4o")
                    llm_span.set_attribute("llm.tokens.input", decision.input_tokens)
                    llm_span.set_attribute("llm.tokens.output", decision.output_tokens)
                    llm_span.set_attribute("llm.decision_type", decision.type)
                    total_tokens += decision.total_tokens

                if decision.type == "respond":
                    span.set_attribute("agent.steps", steps)
                    span.set_attribute("agent.total_tokens", total_tokens)
                    span.set_attribute("agent.cost_usd", total_tokens * 0.000003)
                    return decision.content

                # Trace tool execution
                with tracer.start_as_current_span("tool_call") as tool_span:
                    tool_span.set_attribute("tool.name", decision.tool)
                    tool_span.set_attribute("tool.input", str(decision.args)[:500])
                    result = self.tools.execute(decision.tool, decision.args)
                    tool_span.set_attribute("tool.output", str(result)[:500])
                    tool_span.set_attribute("tool.success", result.success)

        **Tip:** Truncate inputs and outputs in span attributes to prevent trace storage from exploding. 200-500 chars is usually enough for debugging. Store full payloads only when needed for replay.


    ## Pillar 2: Structured Logging

    Traces show the flow. Logs capture the details. For AI agents, structured JSON logs are essential — you'll need to filter, aggregate, and search them programmatically.

    ### What to Log at Each Step


        EventRequired FieldsOptional Fields
        Request receivedrequest_id, user_id, input (truncated)session_id, source
        LLM callmodel, tokens_in, tokens_out, latency_ms, decisiontemperature, prompt_hash
        Tool calltool_name, input, output, success, latency_msretry_count, error_type
        Guardrail triggeredguardrail_name, reason, action_takeninput_that_triggered, severity
        Response sentrequest_id, latency_total_ms, total_tokens, cost_usduser_satisfaction
        Errorerror_type, error_message, step, stack_tracerecovery_action


    ### Implementation

import json
import logging
import time
from uuid import uuid4

class AgentLogger:
    def __init__(self):
        self.logger = logging.getLogger("agent")
        handler = logging.StreamHandler()
        handler.setFormatter(logging.Formatter("%(message)s"))
        self.logger.addHandler(handler)
        self.logger.setLevel(logging.INFO)

    def _log(self, event: str, **kwargs):
        entry = {
            "timestamp": time.time(),
            "event": event,
            **kwargs
        }
        self.logger.info(json.dumps(entry))

    def request_start(self, request_id: str, user_input: str):
        self._log("request_start",
                  request_id=request_id,
                  input_preview=user_input[:200],
                  input_length=len(user_input))

    def llm_call(self, request_id: str, model: str, tokens_in: int,
                 tokens_out: int, latency_ms: int, decision: str):
        self._log("llm_call",
                  request_id=request_id,
                  model=model,
                  tokens_in=tokens_in,
                  tokens_out=tokens_out,
                  latency_ms=latency_ms,
                  decision_type=decision,
                  cost_usd=round((tokens_in * 0.000003 + tokens_out * 0.000015), 6))

    def tool_call(self, request_id: str, tool: str, success: bool,
                  latency_ms: int, error: str = None):
        self._log("tool_call",
                  request_id=request_id,
                  tool=tool,
                  success=success,
                  latency_ms=latency_ms,
                  error=error)

    def request_end(self, request_id: str, total_ms: int,
                    total_tokens: int, steps: int, cost_usd: float):
        self._log("request_end",
                  request_id=request_id,
                  total_ms=total_ms,
                  total_tokens=total_tokens,
                  steps=steps,
                  cost_usd=cost_usd)

    ## Pillar 3: Metrics and Dashboards

    Metrics give you the bird's-eye view. While traces help you debug individual requests, metrics tell you how your agent is performing overall.

    ### Essential Agent Metrics


        MetricTypeAlert Threshold
        Request latency (p50, p95, p99)Histogramp95 > 30s
        Tokens per requestHistogramp99 > 10,000
        Cost per requestHistogramp99 > $0.50
        Steps per requestHistogramMean > 8
        Tool success rateCounter
        LLM error rateCounter> 2%
        Guardrail trigger rateCounter> 10%
        Daily costGauge> budget * 0.8
        Requests per minuteCounterSpike > 3x baseline


    ### Prometheus Metrics Example

from prometheus_client import Histogram, Counter, Gauge

# Latency
agent_latency = Histogram(
    "agent_request_duration_seconds",
    "Time to complete an agent request",
    buckets=[0.5, 1, 2, 5, 10, 20, 30, 60]
)

# Cost
agent_cost = Histogram(
    "agent_request_cost_usd",
    "Cost per agent request in USD",
    buckets=[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0]
)

# Token usage
agent_tokens = Histogram(
    "agent_tokens_total",
    "Total tokens per request",
    ["model"],
    buckets=[100, 500, 1000, 2000, 5000, 10000]
)

# Tool calls
tool_calls = Counter(
    "agent_tool_calls_total",
    "Total tool calls",
    ["tool_name", "status"]  # status: success, failure, blocked
)

# Daily spend
daily_spend = Gauge(
    "agent_daily_spend_usd",
    "Running total spend for today"
)

    ## Observability Tools Compared

    You don't have to build everything from scratch. Here are the main tools for agent observability in 2026:


        ToolBest ForPriceKey Feature
        LangSmithLangChain agentsFree tier + $39/moFull trace visualization, playground replay
        LangfuseAny framework (open source)Free (self-host) / $59/moOpen source, prompt management
        Arize PhoenixLLM evaluationFree (open source)Embedding visualization, eval workflows
        OpenTelemetry + JaegerCustom agentsFree (open source)Standard protocol, any backend
        HeliconeLLM API proxyFree tier + $20/moZero-code setup, cost tracking
        BraintrustEvals + observabilityFree tier + usageEval-first, CI integration
        Datadog LLM ObservabilityEnterpriseContact salesFull APM integration, RBAC


    ### Quick Setup: Langfuse (Open Source)

# pip install langfuse

from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context

langfuse = Langfuse(
    public_key="pk-...",
    secret_key="sk-...",
    host="https://cloud.langfuse.com"  # or self-hosted
)

@observe()  # Automatically traces this function
def run_agent(user_input: str) -> str:
    langfuse_context.update_current_observation(
        input=user_input,
        metadata={"version": "1.2.0"}
    )

    # LLM call — automatically captured
    decision = call_llm(user_input)

    # Tool call — create a child span
    with langfuse_context.observe(name="tool_call") as span:
        result = execute_tool(decision.tool, decision.args)
        span.update(
            input=decision.args,
            output=result,
            metadata={"tool": decision.tool}
        )

    response = generate_response(result)
    langfuse_context.update_current_observation(
        output=response,
        usage={"total_tokens": 1500}
    )
    return response

    ## Debugging Common Agent Failures

    Here are the most common production issues and how observability helps you diagnose them:

    ### 1. Wrong Answer (Hallucination)

    **Symptom:** Agent returns confident but incorrect information.

    **Debug with traces:** Check the tool call outputs. Did the tool return correct data? If yes, the LLM misinterpreted it. If no, the tool query was wrong (often a hallucinated SQL query or wrong API parameter).

# Look for in your logs:
# 1. Tool output vs final response — do they match?
# 2. Did the LLM call the right tool?
# 3. Were the tool arguments correct?

# Common fix: Add output verification step
verification = llm.generate(f"""
Given this tool output: {tool_result}
And this response draft: {agent_response}
Does the response accurately reflect the data? YES/NO + explanation
""")

    ### 2. Infinite Loop

    **Symptom:** Request never completes, high token usage.

    **Debug with traces:** Look at the steps count. If it's near your max, check the last 5 actions — you'll usually see a pattern: the agent tries tool A, gets an error, tries tool A again with slightly different args, gets the same error, etc.

# Prevention: Track action patterns
if steps[-5:] == steps[-10:-5]:  # Same 5 actions repeated
    logger.warning("Loop detected", pattern=steps[-5:])

    ### 3. Slow Responses

    **Symptom:** Latency spikes to 30s+.

    **Debug with traces:** Look at the span durations. Is one LLM call taking 15s (model congestion)? Is a tool call timing out? Is the agent taking too many steps?

    The fix depends on what's slow:


        - **LLM slow:** Model routing to faster model for simple steps
        - **Tool slow:** Add timeouts, use cached results
        - **Too many steps:** Improve system prompt, add few-shot examples


    ### 4. Unexpected Tool Usage

    **Symptom:** Agent calls tools it shouldn't, or calls them with wrong arguments.

    **Debug with traces:** Check the LLM decision span. What was in the context when it decided to call that tool? Usually it's a prompt injection, ambiguous user input, or missing tool description.

    ## Building an Agent Debug Dashboard

    Here's what an effective agent dashboard should show:

    ### Overview Panel

        - Requests per minute (last 24h graph)
        - p50/p95 latency (last 24h graph)
        - Error rate (target: < 2%)
        - Daily cost (running total vs budget)


    ### Deep Dive Panel

        - Slowest requests (clickable to trace view)
        - Most expensive requests (token usage breakdown)
        - Failed requests (error categorization)
        - Guardrail triggers (which guardrails fire most)


    ### Tool Performance Panel

        - Success rate per tool
        - Average latency per tool
        - Most called tools (shows agent behavior patterns)
        - Tool errors by type


    ### Cost Panel

        - Cost per request histogram
        - Cost breakdown by model
        - Daily/weekly/monthly spend trends
        - Cost per user (identify expensive patterns)


    ## Advanced: Replay and Time Travel Debugging

    The killer feature of good agent observability: the ability to replay a past request with identical context.

class TraceRecorder:
    """Record everything needed to replay an agent execution."""

    def record(self, request_id: str):
        return {
            "request_id": request_id,
            "timestamp": time.time(),
            "user_input": self.user_input,
            "system_prompt": self.system_prompt,
            "tool_results": self.tool_results,  # Ordered list
            "llm_responses": self.llm_responses,  # Each step
            "context_at_each_step": self.contexts,
            "final_response": self.response,
        }

class TraceReplayer:
    """Replay a past request, optionally with modifications."""

    def replay(self, trace: dict, modifications: dict = None):
        """Replay with same inputs. Optionally change system prompt,
        tool behavior, or model to test alternatives."""

        config = {**trace}
        if modifications:
            config.update(modifications)

        # Re-run with original tool outputs (deterministic replay)
        # or with live tools (test if fix works)
        return self.agent.run(
            config["user_input"],
            system_prompt=config.get("system_prompt"),
            mock_tools=config.get("tool_results") if not modifications else None
        )

    Replay debugging lets you answer "would my fix have prevented this bug?" without waiting for the same user input to happen again.

    ## Observability Anti-Patterns

    ### 1. Logging Everything
    Full prompt/response logging for every request will blow up your storage costs and create a PII liability. Log metadata by default, full payloads only for sampled requests or errors.

    ### 2. No Sampling
    At scale, trace 100% of errors but sample successful requests. 10% sampling for successful requests gives you enough data without the cost.



# Sampling strategy
def should_trace_full(request) -> bool:
    if request.is_error:
        return True  # Always trace errors
    if request.cost_usd > 0.10:
        return True  # Always trace expensive requests
    if request.latency_ms > 10000:
        return True  # Always trace slow requests
    return random.random() 
            Want to stay current on agent observability tools and practices? [AI Agents Weekly](/newsletter.html) covers production patterns, new tools, and real-world debugging stories 3x/week.



        ## Conclusion

        Observability is the difference between "the agent seems to work" and "the agent provably works." Without traces, you're guessing. Without structured logs, you're grepping. Without metrics, you're reacting instead of preventing.

        Start with the basics: structured logging with request IDs and cost tracking. Add tracing when you need to debug specific failures. Build dashboards when you have enough data to establish baselines. The investment pays off the first time you diagnose a production issue in minutes instead of hours.

        Your agent's reliability is only as good as your ability to see what it's doing.

---

*Get our free [AI Agent Starter Kit](https://paxrel.com/ai-agent-starter-kit.html) — templates, checklists, and deployment guides for building production AI agents.*

DEV Community

AI Agent Observability: Tracing, Logging & Debugging in Production (2026 Guide)

Top comments (0)