What is OpenTelemetry and Why It Matters Now
OpenTelemetry (OTel) is an open standard for achieving observability in distributed systems. It handles the three pillars — traces, metrics, and logs — through a unified API.
In microservice and AI Agent systems, it's difficult to identify "where is the slow request?" and "which LLM call is eating costs?" OpenTelemetry solves this problem.
Backend options:
- Jaeger: OSS distributed tracing (self-hosted)
- Grafana Tempo + Prometheus: Metrics + traces integration
- Datadog / Honeycomb: Managed services
- Signoz: OSS full-stack observability
Basic Setup: Instrumentation for Node.js
// src/instrumentation.ts - Must run before app starts
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { OTLPMetricExporter } from "@opentelemetry/exporter-metrics-otlp-http";
import { PeriodicExportingMetricReader } from "@opentelemetry/sdk-metrics";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { Resource } from "@opentelemetry/resources";
const resource = new Resource({
"service.name": "my-ai-service",
"service.version": process.env.npm_package_version ?? "0.0.0",
"deployment.environment": process.env.NODE_ENV ?? "development",
});
export const sdk = new NodeSDK({
resource,
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_ENDPOINT ?? "http://localhost:4318/v1/traces",
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter(),
exportIntervalMillis: 15_000,
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
process.on("SIGTERM", () => sdk.shutdown().finally(() => process.exit(0)));
Custom Traces: Instrumenting LLM Calls
import { trace, SpanStatusCode, SpanKind } from "@opentelemetry/api";
import Anthropic from "@anthropic-ai/sdk";
const tracer = trace.getTracer("llm-service", "1.0.0");
async function tracedLLMCall(prompt: string, model = "claude-sonnet-4-5"): Promise<string> {
return tracer.startActiveSpan("llm.call", {
kind: SpanKind.CLIENT,
attributes: { "llm.model": model, "llm.prompt_length": prompt.length, "llm.provider": "anthropic" },
}, async (span) => {
try {
const client = new Anthropic();
const startTime = Date.now();
const response = await client.messages.create({
model, max_tokens: 1024,
messages: [{ role: "user", content: prompt }],
});
span.setAttributes({
"llm.input_tokens": response.usage.input_tokens,
"llm.output_tokens": response.usage.output_tokens,
"llm.latency_ms": Date.now() - startTime,
});
span.setStatus({ code: SpanStatusCode.OK });
return response.content[0].text;
} catch (error) {
span.setStatus({ code: SpanStatusCode.ERROR, message: (error as Error).message });
span.recordException(error as Error);
throw error;
} finally {
span.end();
}
});
}
Custom Metrics: Measuring Business KPIs
import { metrics, ValueType } from "@opentelemetry/api";
const meter = metrics.getMeter("ai-service", "1.0.0");
const requestCounter = meter.createCounter("api.requests.total", {
description: "Total number of API requests",
});
const latencyHistogram = meter.createHistogram("api.latency.ms", {
description: "API request latency in milliseconds",
unit: "ms",
advice: {
explicitBucketBoundaries: [10, 25, 50, 100, 250, 500, 1000, 2500, 5000],
},
});
const activeConnectionsGauge = meter.createObservableGauge("db.connections.active");
activeConnectionsGauge.addCallback((result) => {
result.observe(pool.totalCount - pool.idleCount, { db: "primary" });
});
Correlated Logs and Traces
import { trace } from "@opentelemetry/api";
import pino from "pino";
const logger = pino({
mixin() {
const span = trace.getActiveSpan();
if (!span) return {};
const ctx = span.spanContext();
return { traceId: ctx.traceId, spanId: ctx.spanId };
},
});
// Now logs and traces are correlated
// Search by trace ID in Jaeger to find corresponding logs
logger.info({ userId: 123 }, "User logged in");
Docker Compose: OTel Collector + Jaeger
version: "3.8"
services:
otel-collector:
image: otel/opentelemetry-collector-contrib:0.96.0
command: ["--config=/etc/otel-collector-config.yaml"]
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
ports:
- "4317:4317"
- "4318:4318"
jaeger:
image: jaegertracing/all-in-one:1.55
ports:
- "16686:16686" # Jaeger UI
prometheus:
image: prom/prometheus:v2.50.0
ports:
- "9090:9090"
grafana:
image: grafana/grafana:10.3.0
ports:
- "3000:3000"
Implementing OpenTelemetry makes "what's slow" and "what's eating costs" visible. Especially instrumenting LLM calls is a critical investment that directly contributes to AI system optimization.
This article is from the Claude Code Complete Guide (7 chapters) on note.com.
myouga (@myougatheaxo) - VTuber axolotl. Sharing practical AI development tips.
Top comments (0)