Kowshik Jallipalli

Posted on Mar 6

From Black Box to Traceable Swarm: OpenTelemetry Patterns for AI Agents

#ai #agents #programming #tutorial

Multi-agent workflows are incredible until they fail in production. When a planning agent delegates a task to a research agent, which then hits a rate limit, silently retries five times, and finally returns a hallucinated JSON object, debugging via console.log is impossible.

You don't need a shiny new "AI Observability" platform to fix this. You need distributed tracing.

By treating your agents like microservices and standardizing their outputs into an AgentEvent schema, you can pipe their execution states directly into standard OpenTelemetry (OTel). However, naive implementations often introduce massive security vulnerabilities (like logging raw PII) and application-crashing bugs (like circular JSON parsing).

Here is the audited, production-hardened pattern for instrumenting an agent swarm so you can actually see what your LLMs are doing without compromising your system.

The Scenario: The Customer Research Swarm
Imagine a small B2B SaaS feature: a user enters a company domain, and a "Customer Research Swarm" generates a briefing.

This involves:

Planner Agent: Breaks the goal into steps.

Scraper Agent: Uses a headless browser tool to read the company website.

Summarizer Agent: Compiles the final report using user data.

If this takes 45 seconds and costs $0.12 in tokens, you need to know exactly where that time and money went.

Why This Matters (The Audit Perspective)
If you simply dump the llm_response into your telemetry provider (Datadog, New Relic, etc.), you are creating a compliance nightmare. Prompts and tool arguments frequently contain user emails, internal database schemas, or API keys.

Furthermore, LLM tool-call arguments are deeply nested objects. A naive JSON.stringify(args) in your logging middleware will eventually hit a circular reference, throw a TypeError, and crash your Node.js process mid-execution. Your observability layer must be hardened to fail safely.

How it Works: The Standardized Agent Event
LLMs output unstructured text. OTel requires structured spans. The bridge between them is a strict event schema. We define core states for any agentic workflow: plan, model_call, tool_call, guardrail_hit, and error.

Instead of raw logging, your orchestrator emits these standardized objects, which are then passed through a sanitization layer before being bound to an active OTel trace.

The Code: Schema and Audited OTel Integration
Here is how you define this contract in TypeScript and translate it into safe OpenTelemetry spans.

The Event Schema Define the strict types for the events your agent runner will emit. // src/types/telemetry.ts

export interface AgentEvent {
eventId: string;
traceId: string; // Ties the entire user request together
agentName: string; // e.g., "ScraperAgent"
type: AgentEventType;
timestamp: number;
payload: Record; // The prompt, tool args, or error details
metrics?: {
promptTokens?: number;
completionTokens?: number;
latencyMs?: number;
};
}

The Hardened OTel Emitter This telemetry wrapper maps AgentEvent objects to OTel spans. Notice the safeStringify function: this is the critical audit fix that prevents process crashes and redacts sensitive keys before they ever leave your server. // src/telemetry/tracer.ts import { trace, SpanStatusCode, context } from '@opentelemetry/api';

const tracer = trace.getTracer('agent-swarm-orchestrator');

/**

AUDIT FIX: Prevents TypeError: Converting circular structure to JSON
and redacts standard PII/Secrets before sending to APM. */ function safeStringify(obj: any): string { const cache = new Set(); const stringified = JSON.stringify(obj, (key, value) => { if (typeof value === 'object' && value !== null) { if (cache.has(value)) return '[Circular]'; cache.add(value); } // Basic redaction (expand this regex based on your domain) if (key.match(/password|secret|api_key|email|token/i)) { return '[REDACTED]'; } return value; });

// Prevent APM payload rejection (e.g., Datadog 64KB attribute limit)
return stringified.length > 10000 ? stringified.substring(0, 10000) + '...[TRUNCATED]' : stringified;
}

export function recordAgentEvent(event: AgentEvent) {
// Grab the active async context so spans correctly nest as children
const activeContext = context.active();

tracer.startActiveSpan(
${event.agentName}.${event.type},
undefined,
activeContext,
(span) => {
// 1. Tag standard attributes
span.setAttribute('agent.name', event.agentName);
span.setAttribute('agent.event_type', event.type);

  // 2. Tag metrics (Crucial for cost tracking)
  if (event.metrics) {
    if (event.metrics.promptTokens) span.setAttribute('llm.usage.prompt_tokens', event.metrics.promptTokens);
    if (event.metrics.completionTokens) span.setAttribute('llm.usage.completion_tokens', event.metrics.completionTokens);
    if (event.metrics.latencyMs) span.setAttribute('llm.latency_ms', event.metrics.latencyMs);
  }

  // 3. Handle payloads safely
  if (event.type === 'tool_call') {
    span.setAttribute('tool.name', event.payload.toolName);
    span.setAttribute('tool.arguments', safeStringify(event.payload.args));
  }

  if (event.type === 'guardrail_hit') {
    span.setAttribute('guardrail.reason', event.payload.reason);
    span.addEvent('Guardrail Blocked Execution');
  }

  // 4. Handle Errors
  if (event.type === 'error') {
    span.recordException(new Error(event.payload.errorMessage));
    span.setStatus({
      code: SpanStatusCode.ERROR,
      message: event.payload.errorMessage,
    });
  } else {
    span.setStatus({ code: SpanStatusCode.OK });
  }

  span.end();
}

);
}
Pitfalls and Gotchas
When instrumenting AI swarms with OTel, watch out for these operational and security traps:

Async Context Dropping: In Node.js, OpenTelemetry relies on AsyncLocalStorage to maintain the traceId across asynchronous calls. If your agent uses custom event emitters, worker threads, or certain RxJS observables, the OTel context will silently drop, resulting in orphaned child spans. Always explicitly bind your callbacks to context.active().

Payload Size Limits: Most OTel collectors will drop spans that exceed payload size limits (often ~64KB). Do not dump a 100,000-token RAG document context into a span attribute. Truncate it (as shown in the audited code) or log a pointer (like an S3 URI) instead.

High Cardinality Nightmare: Never use dynamic user input as the span name (e.g., tracer.startActiveSpan("query: what is your refund policy")). This explodes your metrics cardinality and will spike your APM bill exponentially. Keep span names static (e.g., ScraperAgent.tool_call) and put the dynamic query safely in the attributes.

What to Try Next
Ready to stop guessing what your agents are doing? Try these next steps:

The "Cost Per Feature" Dashboard: Export these spans to Grafana or Datadog and query sum(llm.usage.prompt_tokens) GROUP BY agent.name. This immediately reveals which agent is burning your Anthropic/OpenAI budget.

Tail-Based Error Sampling: If your swarm runs thousands of times a day, tracing every loop gets expensive. Configure your OTel Collector to use tail-based sampling: drop 95% of the happy paths, but keep 100% of the traces where a guardrail_hit or error occurred.

Time-to-First-Token (TTFT) Spans: Enhance the model_call event to record TTFT. If a multi-agent workflow feels sluggish to the end user, this metric tells you instantly if the bottleneck is your Postgres database or the LLM's initial reasoning latency.

Top comments (1)

Armorer Labs • May 12

The swarm case is where a plain trace starts to need more agent semantics.

For each handoff I would want the record to capture parent run id, child run id, input artifact references, expected output contract, timeout/retry policy, and final artifact. Without that, you can see spans, but it is still hard to explain which agent owned the decision.

The trace tells you the path. The run tree tells you responsibility.