DEV Community

Atlas Whoff
Atlas Whoff

Posted on

OpenTelemetry for AI Agents: Tracing Claude API Calls in Production

You're running Claude in production. Requests are slow, costs are spiking, some responses are garbage — and you have no idea why.

That's the problem OpenTelemetry solves. And it works with AI agents just as well as with HTTP services.

Here's how to add proper distributed tracing to a Claude-powered agent.

Why Tracing Matters for AI Agents

Traditional API monitoring breaks down for LLM applications because:

  1. Latency is multi-component: TTFT (time-to-first-token), generation speed, and post-processing are all distinct phases
  2. Cost attribution is opaque: Which prompt consumed 80% of your tokens last night?
  3. Errors are soft: A response that's factually wrong isn't a 500 — tracing is the only way to catch it

OpenTelemetry gives you spans, attributes, and traces that survive across async boundaries — exactly what you need for agents that make multiple LLM calls per request.

The Setup

Install the packages:

npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node
npm install @opentelemetry/exporter-trace-otlp-http
Enter fullscreen mode Exit fullscreen mode

Initialize the tracer (do this before anything else):

// tracing.ts — import this first
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();
Enter fullscreen mode Exit fullscreen mode

Wrapping Claude Calls in Spans

The Anthropic SDK doesn't auto-instrument. You wrap it manually:

import { trace, SpanStatusCode, context } from '@opentelemetry/api';
import Anthropic from '@anthropic-ai/sdk';

const tracer = trace.getTracer('claude-agent', '1.0.0');
const client = new Anthropic();

export async function tracedCompletion(
  prompt: string,
  systemPrompt: string,
  model = 'claude-sonnet-4-6'
) {
  return tracer.startActiveSpan('claude.completion', async (span) => {
    span.setAttributes({
      'llm.model': model,
      'llm.prompt_length': prompt.length,
      'llm.system_prompt_length': systemPrompt.length,
      'llm.vendor': 'anthropic',
    });

    try {
      const startTime = Date.now();

      const response = await client.messages.create({
        model,
        max_tokens: 4096,
        system: systemPrompt,
        messages: [{ role: 'user', content: prompt }],
      });

      const latency = Date.now() - startTime;

      span.setAttributes({
        'llm.input_tokens': response.usage.input_tokens,
        'llm.output_tokens': response.usage.output_tokens,
        'llm.total_tokens': response.usage.input_tokens + response.usage.output_tokens,
        'llm.latency_ms': latency,
        'llm.stop_reason': response.stop_reason,
      });

      span.setStatus({ code: SpanStatusCode.OK });
      return response;
    } catch (error) {
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: error instanceof Error ? error.message : 'Unknown error',
      });
      span.recordException(error as Error);
      throw error;
    } finally {
      span.end();
    }
  });
}
Enter fullscreen mode Exit fullscreen mode

Tracing Multi-Step Agent Flows

For agents with tool use or multi-turn conversations, nest your spans:

export async function runAgentTask(task: string) {
  return tracer.startActiveSpan('agent.task', async (taskSpan) => {
    taskSpan.setAttribute('agent.task', task);

    let step = 0;
    let continueLoop = true;

    while (continueLoop) {
      await tracer.startActiveSpan(`agent.step.${step}`, async (stepSpan) => {
        stepSpan.setAttribute('agent.step_number', step);

        const response = await tracedCompletion(
          buildPrompt(task, step),
          SYSTEM_PROMPT
        );

        const toolCalls = extractToolCalls(response);
        stepSpan.setAttribute('agent.tool_calls', toolCalls.length);

        for (const tool of toolCalls) {
          await tracer.startActiveSpan(`tool.${tool.name}`, async (toolSpan) => {
            toolSpan.setAttributes({
              'tool.name': tool.name,
              'tool.input_size': JSON.stringify(tool.input).length,
            });

            const result = await executeTool(tool);
            toolSpan.setAttribute('tool.success', result.success);
            toolSpan.end();
          });
        }

        continueLoop = toolCalls.length > 0 && step < 10;
        step++;
        stepSpan.end();
      });
    }

    taskSpan.setAttribute('agent.total_steps', step);
    taskSpan.end();
  });
}
Enter fullscreen mode Exit fullscreen mode

Sending to Langfuse (Recommended for LLM Tracing)

Langfuse is purpose-built for LLM observability and has native OTLP ingestion:

OTEL_EXPORTER_OTLP_ENDPOINT=https://cloud.langfuse.com/api/public/otel
OTEL_EXPORTER_OTLP_HEADERS=Authorization=Basic <base64(pk:sk)>
Enter fullscreen mode Exit fullscreen mode

This gives you:

  • Per-model cost breakdowns
  • Prompt/response logging with diff views
  • Latency percentiles per agent step
  • Score tracking (you can attach evals to traces)

Grafana Tempo and Honeycomb also work if you're already using those.

The Attributes That Actually Matter

After running this in production, these are the attributes worth instrumenting:

Attribute Why
llm.input_tokens Cost is dominated by input — track it per-call
llm.cache_read_input_tokens If you're using prompt caching, this tells you your hit rate
llm.latency_ms P99 matters more than average for user-facing flows
agent.task_type Lets you slice costs and latency by what the agent is doing
user.id Attribution for per-user cost limits or billing

Prompt Caching and Tracing

If you're using Anthropic's prompt caching (you should be — it cuts costs 80%+), the API returns cache hit data in the usage object:

span.setAttributes({
  'llm.input_tokens': response.usage.input_tokens,
  'llm.output_tokens': response.usage.output_tokens,
  'llm.cache_creation_input_tokens': response.usage.cache_creation_input_tokens ?? 0,
  'llm.cache_read_input_tokens': response.usage.cache_read_input_tokens ?? 0,
});
Enter fullscreen mode Exit fullscreen mode

Track your cache hit ratio over time. If it drops below 70%, your session architecture is broken.

Cost Calculation in Spans

Add a derived cost attribute so you can alert on expensive sessions:

const COSTS = {
  'claude-sonnet-4-6': { input: 3.0, output: 15.0, cache_read: 0.30 },
  'claude-opus-4-7': { input: 15.0, output: 75.0, cache_read: 1.50 },
};

function estimateCost(model: string, usage: Anthropic.Usage): number {
  const rates = COSTS[model] ?? COSTS['claude-sonnet-4-6'];
  const inputCost = (usage.input_tokens / 1_000_000) * rates.input;
  const outputCost = (usage.output_tokens / 1_000_000) * rates.output;
  const cacheReadCost = ((usage.cache_read_input_tokens ?? 0) / 1_000_000) * rates.cache_read;
  return inputCost + outputCost + cacheReadCost;
}

// Then in your span:
span.setAttribute('llm.estimated_cost_usd', estimateCost(model, response.usage));
Enter fullscreen mode Exit fullscreen mode

Now you can alert in Grafana when a single agent task costs more than $0.50.

What This Unlocks

Once you have traces:

  1. Performance regression detection — new prompt version slowed P95 by 40%? You'll see it
  2. Cost attribution — "the research step costs 3x the writing step — should we use a cheaper model there?"
  3. Error correlation — "responses with empty tool outputs always follow cache miss patterns"
  4. User-level billing — if you're selling AI-powered features, you can charge per-token with accurate data

Building AI agents for production? The AI SaaS Starter Kit includes prompt caching, token tracking, and structured agent logging patterns baked in — skip the 2-week instrumentation ramp.

Top comments (0)