Simran Kumari for OpenObserve

Posted on Apr 10 • Edited on May 15 • Originally published at openobserve.ai

AI Agent Monitoring: How to Observe Autonomous AI Agents in Production

#aiagents #opentelemetry #observability #llm

AI agent monitoring — also called LLM observability — is the practice of collecting, analysing, and acting on telemetry data generated by LLM calls and the autonomous agents built on top of them. Think of it as traditional APM, but purpose-built for AI workloads.

A modern AI agent is not a static API call. It's a dynamic, multi-step reasoning system that may:

Plan and decompose subtasks autonomously
Call external tools (web search, code execution, APIs)
Retrieve documents via Retrieval-Augmented Generation (RAG)
Spawn sub-agents for parallel task execution
Loop and self-correct until a goal is satisfied

Every one of those steps is a potential point of failure, latency spike, or cost explosion. Just as DevOps engineers would never deploy a microservice without metrics, traces, and logs, MLOps and AI engineers need the same rigour for LLM-powered systems.

Why It Matters in Production

The jump from a prototype that "works on my machine" to a reliable production AI agent is enormous. Here's what routinely breaks without proper monitoring:

🔴 Runaway Token Costs

An unchecked agentic loop can consume millions of tokens before you notice. A single misbehaving agent session — stuck in a reasoning loop — can exhaust your entire daily token budget in minutes. Token-level telemetry gives you per-request cost visibility and the ability to set budget-based circuit breakers.

🔴 Silent Latency Regressions

A new model version, a longer system prompt, or a change in retrieval strategy can quietly double your agent's response time. Without distributed latency traces, you discover this from frustrated users — not from a proactive alert.

🔴 Rate-Limit Cascade Failures

LLM API rate limits hit unpredictably under production load. A single rate-limit event can trigger aggressive retries across multiple parallel agent sessions, cascading into a full outage.

🔴 Degraded Output Quality

Hallucinations, refusals, and incoherent responses increase as context windows grow or prompts drift. Span-level metadata correlating prompt structure with output quality lets you catch these regressions systematically.

🔴 Multi-Step Reasoning Failures

In agentic pipelines, a failure deep in a reasoning chain is nearly impossible to attribute without distributed tracing. Did the agent fail because the web search tool returned bad data, because the LLM misinterpreted the tool output, or because the context window overflowed? Traces answer this.

🔴 Compliance & Audit Requirements

Enterprise deployments increasingly require complete audit logs of what the agent decided, why, what data it accessed, and what actions it took.

The Four Pillars of LLM Observability

1. Distributed Tracing

Every agent action — from receiving a user prompt to returning a final answer — is instrumented as a trace composed of spans. Each span captures a discrete unit of work: an LLM call, a tool invocation, a database retrieval, or a sub-agent call.

Tracing answers: "What happened, in what order, and how long did each step take?"

2. Metrics

Aggregated numerical data over time — token counts, latency percentiles (p50/p95/p99), error rates, throughput, and cost per request. Metrics are cheap to store and fast to query, making them ideal for real-time dashboards and threshold-based alerting.

3. Structured Logs

Rich, machine-readable event records attached to each agent action — prompt text, model parameters, completion content, tool call arguments, and exception stack traces. Unlike metrics, logs retain the full context needed for post-incident debugging.

4. Evaluations (Evals)

A layer unique to AI observability: automated or human-assisted scoring of agent outputs for correctness, safety, relevance, and faithfulness. Evals close the loop between operational telemetry and output quality.

💡 Pro Tip: For most teams starting out, distributed tracing delivers the highest immediate value. It reveals exactly where latency and failures originate across multi-step agent pipelines — something neither metrics nor logs alone can show.

Key Metrics to Track

Metric	What It Tells You	Typical Alert Threshold
`llm.usage.prompt_tokens`	Input token consumption per request	> 80% of model context window
`llm.usage.completion_tokens`	Output token consumption per request	Sudden spike > 2× baseline
`llm.usage.total_tokens`	Combined cost proxy per call	Daily cost budget exceeded
`duration` (end-to-end)	User-perceived latency	p95 > 10s for interactive agents
`error.rate`	% of requests that fail or timeout	> 1% over a 5-minute window
`tool_call.count`	Tool invocations per session	> 20 per session (loop indicator)
`agent.steps`	Depth of reasoning chain	> configured max steps
`llm.request.model`	Which model was invoked	Unexpected model fallback detected

OpenTelemetry: The Standard for AI Observability

OpenTelemetry (OTel) is the open-source observability framework that has become the industry standard for instrumenting distributed systems. For AI agents, it provides a vendor-neutral way to emit traces, metrics, and logs from any LLM call to any compatible backend — OpenObserve, Prometheus, Jaeger, Grafana, Datadog, and more.

The ecosystem includes dedicated auto-instrumentation libraries for all major LLM providers:

opentelemetry-instrumentation-openai
opentelemetry-instrumentation-anthropic
opentelemetry-instrumentation-langchain
opentelemetry-instrumentation-llama-index
opentelemetry-instrumentation-cohere

These libraries wrap LLM client calls and automatically attach semantic attributes — token counts, model name, temperature, max tokens, error details — as span attributes, with no manual instrumentation required.

How OTel Spans Map to Agent Steps

In an agentic pipeline, the OTel trace tree mirrors the agent's reasoning hierarchy:

[root trace] user-request
  └── [span] planner-llm-call
        └── [span] tool: web_search
        └── [span] tool: code_executor
              └── [span] sub-agent: summariser-llm-call

This lets you instantly see which step was the bottleneck or failure point in any given agent run.

Setting Up LLM Monitoring with OpenObserve

OpenObserve is an open-source observability platform with a native OTLP endpoint — purpose-built for high-volume telemetry at significantly lower cost and resource footprint than alternatives like the Elastic Stack.

Prerequisites

Python 3.8+
uv package manager (or pip)
An OpenObserve account — cloud or self-hosted
Your OpenObserve organisation ID and Base64-encoded auth token
API key for your LLM provider (OpenAI, Anthropic, etc.)

Step 1: Configure Your Environment

Create a .env file in your project root:

# OpenObserve instance URL
OPENOBSERVE_URL=https://api.openobserve.ai/

# Your OpenObserve organisation slug or ID
OPENOBSERVE_ORG=your_org_id

# Basic auth token — Base64-encoded "email:password"
OPENOBSERVE_AUTH_TOKEN="Basic <your_base64_token>"

# Enable or disable tracing (default: true)
OPENOBSERVE_ENABLED=true

# LLM provider keys
OPENAI_API_KEY="your-openai-key"
ANTHROPIC_API_KEY="your-anthropic-key"

Step 2: Install Dependencies

# Using uv (recommended)
uv pip install openobserve-telemetry-sdk \
               opentelemetry-instrumentation-openai \
               opentelemetry-instrumentation-anthropic \
               python-dotenv

# Or with pip
pip install openobserve-telemetry-sdk opentelemetry-instrumentation-openai python-dotenv

Step 3: Instrument Your Application

OpenAI

Add two lines before any LLM calls are made:

from opentelemetry.instrumentation.openai import OpenAIInstrumentor
from openobserve import openobserve_init

# Instrument OpenAI and initialise the OpenObserve exporter
OpenAIInstrumentor().instrument()
openobserve_init()

from openai import OpenAI

client = OpenAI()

# Use the client exactly as normal — traces are captured automatically
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Summarise this document..."}]
)
print(response.choices[0].message.content)

Anthropic (Claude)

from opentelemetry.instrumentation.anthropic import AnthropicInstrumentor
from openobserve import openobserve_init

AnthropicInstrumentor().instrument()
openobserve_init()

from anthropic import Anthropic

client = Anthropic()
response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Analyse this data..."}]
)
print(response.content[0].text)

Every call is now captured as a trace span and exported to OpenObserve automatically.

Note: The openobserve-telemetry-sdk is an optional thin wrapper around the standard OTel Python SDK. If you already use OpenTelemetry, you can send telemetry directly to OpenObserve's OTLP endpoint without it.

Step 4: View Traces in OpenObserve

Log in to your OpenObserve instance
Navigate to Traces in the left sidebar
Filter by service name, model name, or time range
Click any span to inspect token counts, latency, parameters, and full request metadata

What Gets Captured in Each Trace Span

The OTel instrumentation libraries automatically attach the following attributes — no manual coding needed:

OTel Attribute	Description	Example Value
`llm.request.model`	Model identifier	`gpt-4o`
`llm.usage.prompt_tokens`	Tokens in the prompt	`1,247`
`llm.usage.completion_tokens`	Tokens in the response	`312`
`llm.usage.total_tokens`	Combined token usage	`1,559`
`llm.request.temperature`	Sampling temperature	`0.7`
`llm.request.max_tokens`	Max response length	`2048`
`duration`	End-to-end request latency	`2,340ms`
`error`	Exception details on failure	`RateLimitError: 429`

Adding Custom Span Attributes

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("agent-task") as span:
    span.set_attribute("user.id", "usr_abc123")
    span.set_attribute("session.id", "sess_xyz789")
    span.set_attribute("agent.name", "research-agent")
    span.set_attribute("task.type", "document-summarisation")
    span.set_attribute("prompt.version", "v2.3.1")

    # Your LLM call here — child spans are created automatically
    response = client.chat.completions.create(...)

Unique Challenges in Agentic Systems

Non-Determinism

Unlike traditional software, the same input to an agent may produce different execution paths on different runs. Your monitoring must capture the full trace of each individual run, not just aggregated statistics.

Long-Horizon Context Windows

As agents maintain conversation history across multiple turns, context windows grow substantially. A single agent session can consume tens of thousands of tokens. Per-turn token tracking is essential.

Nested and Parallel Tool Calls

Modern agents call multiple tools — often in parallel. Distributed tracing with proper parent-child span relationships is the only reliable way to reconstruct the true execution timeline.

Infinite Loop Detection

Agents can get stuck in reasoning loops, repeatedly calling the same tool without making progress. Monitor agent.steps and tool_call.count per session, combined with a max-step circuit breaker.

Multi-Agent Coordination

Orchestrator-worker architectures require trace context propagation across agent boundaries. OpenTelemetry's W3C TraceContext standard enables this:

from opentelemetry.propagate import inject, extract
import requests

# Orchestrator: inject trace context into outgoing request headers
headers = {}
inject(headers)  # adds traceparent, tracestate headers

response = requests.post(
    "http://worker-agent/execute",
    json={"task": task_payload},
    headers=headers
)

# Worker agent: extract and continue the trace
context = extract(incoming_request.headers)
with tracer.start_as_current_span("worker-task", context=context):
    # Appears as child span in orchestrator's trace
    ...

⚠️ Critical: Always propagate the W3C traceparent header when your orchestrator calls a worker agent. Without this, each agent's activity appears as a disconnected root trace — making end-to-end debugging nearly impossible.

Best Practices for AI Agent Monitoring

✅ Instrument Early, Not After the Fact

Add observability during development, not after incidents. Retrofitting into a complex agentic system leaves blind spots in the most critical execution paths.

✅ Separate Evaluation Metrics from Operational Metrics

Don't conflate system health (latency, error rate, tokens) with output quality (correctness, relevance, safety). Keep them in separate pipelines with separate alert policies.

✅ Sample Intelligently, Not Uniformly

Use head-based sampling for normal traffic (e.g., 10%), but configure tail-based sampling to capture 100% of failed or slow requests. Full fidelity where it matters most, without prohibitive storage costs.

✅ Mask Sensitive Data Before Export

from opentelemetry.sdk.trace import SpanProcessor

class SensitiveDataRedactor(SpanProcessor):
    SENSITIVE_ATTRS = ["llm.prompts", "llm.completions", "user.email"]

    def on_end(self, span):
        for attr in self.SENSITIVE_ATTRS:
            if attr in span.attributes:
                span.set_attribute(attr, "[REDACTED]")

✅ Version Your Prompts

Treat prompt templates as software artefacts with version identifiers. Attach prompt.version: v2.3.1 as a span attribute to compare performance across prompt versions — just like canary deployments.

✅ Tag Every Trace with Business Context

Add user.id, session.id, agent.name, task.type, and feature.flag to every trace. These transform your observability data from an engineering artefact into a product intelligence asset.

✅ Build a Feedback Loop from Evals to Prompts

Connect your evaluation pipeline back to your prompt management system. When evaluations detect a quality regression, it should automatically trigger a prompt review workflow — the AI equivalent of failing a CI/CD pipeline on test failures.

Conclusion

As autonomous AI agents take on consequential tasks — writing and executing code, managing business workflows, interacting with customers at scale — the organisations that invest in proper observability will have a decisive operational advantage: faster debugging cycles, lower costs, better output quality, and the confidence to scale reliably.

OpenTelemetry + OpenObserve gives you a vendor-neutral, open-source foundation that scales from a solo developer's project to an enterprise deployment, without lock-in or prohibitive cost at scale.

You cannot improve what you cannot measure. For AI agents, observability is the measurement layer that makes continuous improvement possible.

Top comments (1)

Raju Dandigam • Jun 30

I like the comparison to traditional DevOps because teams would never scale microservices without telemetry, yet many AI agents still ship with very limited visibility. The hard part is that agent observability needs to explain both system behavior and outcome quality. A trace that shows latency is useful, but a trace that connects model calls, tool choices, retrieved context, and final output is much more actionable. This is where AI observability starts becoming its own engineering discipline.