OpenTelemetry for AI Agents: Stop Guessing What Your Agent Did

#ai #devops #typescript #monitoring

AI agents fail in ways that logs don't capture. The agent called the right function, got a valid response, then produced the wrong output. By the time you notice, the trace is gone.

OpenTelemetry fixes this. Here's the full setup for a Claude-based agent.

The Problem With Console.log Debugging

A typical agent debugging session:

User reports wrong output
You add console.log at suspected failure points
Reproduce the failure (if you can)
Find the log line, add more logs around it
Repeat

This works for synchronous code. For agents that run multi-step workflows, call tools in parallel, or execute asynchronously — it breaks down. You can't correlate log lines across steps without request IDs threaded through every call.

OpenTelemetry gives you distributed tracing: every step of agent execution is a span, spans are linked into a trace, and you can visualize the full execution tree.

Setup: Jaeger + OTEL SDK

Run Jaeger locally:

docker run -d --name jaeger \
  -p 16686:16686 \
  -p 4318:4318 \
  jaegertracing/all-in-one:latest

Install OTEL packages:

npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-http @opentelemetry/api

Create the tracer setup (load before anything else):

// instrumentation.ts
import { NodeSDK } from '@opentelemetry/sdk-node'
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'
import { Resource } from '@opentelemetry/resources'
import { SEMRESATTRS_SERVICE_NAME } from '@opentelemetry/semantic-conventions'

const sdk = new NodeSDK({
  resource: new Resource({
    [SEMRESATTRS_SERVICE_NAME]: 'claude-agent',
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? 'http://localhost:4318/v1/traces',
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-http': { enabled: true },
      '@opentelemetry/instrumentation-fetch': { enabled: true },
    }),
  ],
})

sdk.start()
process.on('SIGTERM', () => sdk.shutdown())

Instrumenting the Agent

// lib/agent/traced-agent.ts
import Anthropic from '@anthropic-ai/sdk'
import { trace, SpanStatusCode } from '@opentelemetry/api'

const tracer = trace.getTracer('claude-agent', '1.0.0')
const client = new Anthropic()

interface Tool {
  name: string
  description: string
  input_schema: object
  execute: (input: unknown) => Promise<unknown>
}

export async function runAgent(userMessage: string, tools: Tool[], sessionId: string) {
  return tracer.startActiveSpan('agent.run', async (rootSpan) => {
    rootSpan.setAttributes({
      'agent.session_id': sessionId,
      'agent.user_message': userMessage.slice(0, 200),
    })

    try {
      const messages: Anthropic.MessageParam[] = [{ role: 'user', content: userMessage }]
      let iteration = 0

      while (iteration < 10) {
        const response = await tracer.startActiveSpan('agent.llm_call', async (llmSpan) => {
          llmSpan.setAttributes({
            'llm.model': 'claude-sonnet-4-6',
            'llm.iteration': iteration,
            'llm.message_count': messages.length,
          })

          const result = await client.messages.create({
            model: 'claude-sonnet-4-6',
            max_tokens: 4096,
            tools: tools.map(t => ({
              name: t.name,
              description: t.description,
              input_schema: t.input_schema as Anthropic.Tool['input_schema'],
            })),
            messages,
          })

          llmSpan.setAttributes({
            'llm.input_tokens': result.usage.input_tokens,
            'llm.output_tokens': result.usage.output_tokens,
            'llm.stop_reason': result.stop_reason ?? '',
          })
          llmSpan.end()
          return result
        })

        if (response.stop_reason === 'end_turn') {
          const output = response.content
            .filter(b => b.type === 'text')
            .map(b => (b as Anthropic.TextBlock).text)
            .join('')
          rootSpan.setAttribute('agent.output', output.slice(0, 500))
          rootSpan.setStatus({ code: SpanStatusCode.OK })
          rootSpan.end()
          return output
        }

        const toolUses = response.content.filter(b => b.type === 'tool_use')
        messages.push({ role: 'assistant', content: response.content })

        const toolResults = await Promise.all(
          toolUses.map(async (block) => {
            const toolBlock = block as Anthropic.ToolUseBlock
            const tool = tools.find(t => t.name === toolBlock.name)

            return tracer.startActiveSpan(`agent.tool.${toolBlock.name}`, async (toolSpan) => {
              toolSpan.setAttributes({
                'tool.name': toolBlock.name,
                'tool.input': JSON.stringify(toolBlock.input).slice(0, 500),
              })

              try {
                const result = await tool!.execute(toolBlock.input)
                toolSpan.setStatus({ code: SpanStatusCode.OK })
                toolSpan.end()
                return {
                  type: 'tool_result' as const,
                  tool_use_id: toolBlock.id,
                  content: JSON.stringify(result),
                }
              } catch (err) {
                toolSpan.setStatus({ code: SpanStatusCode.ERROR, message: String(err) })
                toolSpan.recordException(err as Error)
                toolSpan.end()
                return {
                  type: 'tool_result' as const,
                  tool_use_id: toolBlock.id,
                  content: `Error: ${String(err)}`,
                  is_error: true,
                }
              }
            })
          })
        )

        messages.push({ role: 'user', content: toolResults })
        iteration++
      }

      throw new Error('Max iterations reached')
    } catch (err) {
      rootSpan.setStatus({ code: SpanStatusCode.ERROR, message: String(err) })
      rootSpan.recordException(err as Error)
      rootSpan.end()
      throw err
    }
  })
}

What You See in Jaeger

After running a few agent calls, open http://localhost:16686. Select the claude-agent service and pick any trace. You'll see:

agent.run (340ms)
├── agent.llm_call [iteration=0] (210ms)
│   input_tokens=847, output_tokens=312
├── agent.tool.search_documents (45ms)
│   query="invoice #1234"
├── agent.tool.get_customer (23ms)
│   customer_id="cust_abc"
├── agent.llm_call [iteration=1] (180ms)
│   input_tokens=1204, output_tokens=89
└── [end_turn]

When a tool fails, the span turns red. When the LLM loops unexpectedly, you see the iteration count climb. Token costs per session are visible without any extra instrumentation.

Production Considerations

Sample aggressively — trace 10% of traffic, 100% of errors
Redact PII — never put user content in span attributes; use hashed IDs
Set span limits — truncate agent.output to 500 chars to prevent attribute size errors
Use baggage for session ID — propagate session_id through async boundaries with context.with()

Full Observability Stack

OpenTelemetry traces + structured logs + Stripe event webhooks give you the complete picture of every agent session. This pattern is built into the Workflow Automator MCP — it adds tracing to any Claude agent running in the IDE.

Workflow Automator MCP — $15/mo — pre-built OTEL instrumentation for Claude agent loops
AI SaaS Starter Kit — $99 one-time — full production agent stack with tracing, auth, and billing

→ whoffagents.com

Top comments (1)

Raju Dandigam • May 21

The “stop guessing” framing feels very accurate because most agent debugging today still involves piecing together execution behavior manually from scattered logs. OTEL-style spans are a huge improvement once workflows become multi-step and tool-heavy. I’ve been exploring a local-first complement to that workflow in agent-inspect, where TypeScript agent runs become inspectable execution trees with JSONL traces and CLI tooling. It feels like developers increasingly need both layers: fast local visibility during iteration and larger distributed observability once systems move into production.