The Problem with Observability Silos
You have logs in CloudWatch, metrics in Datadog, and traces in a different tool. Correlating a slow request across three systems is a manual nightmare.
OpenTelemetry (OTel) standardizes how you emit telemetry—then send it anywhere.
Core Concepts
- Traces: The journey of a request through your system
- Spans: Individual operations within a trace
- Metrics: Numerical measurements over time
- Logs: Timestamped events (OTel correlates these with traces)
Setup
npm install @opentelemetry/sdk-node \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-trace-otlp-http \
@opentelemetry/exporter-metrics-otlp-http
// instrumentation.ts — load BEFORE everything else
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { Resource } from '@opentelemetry/resources';
import { SEMRESATTRS_SERVICE_NAME } from '@opentelemetry/semantic-conventions';
const sdk = new NodeSDK({
resource: new Resource({
[SEMRESATTRS_SERVICE_NAME]: 'my-api',
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? 'http://localhost:4318/v1/traces',
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? 'http://localhost:4318/v1/metrics',
}),
exportIntervalMillis: 30000,
}),
instrumentations: [getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-fs': { enabled: false }, // too noisy
})],
});
sdk.start();
process.on('SIGTERM', () => sdk.shutdown());
// server.ts
require('./instrumentation'); // Must be first
import express from 'express';
// ...
Auto-Instrumentation
getNodeAutoInstrumentations automatically instruments:
- HTTP/HTTPS: every incoming and outgoing request
- Express: middleware, routes
- Prisma/pg: database queries
- Redis: cache operations
- gRPC: service calls
Zero code changes needed for these.
Custom Spans
import { trace, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('my-service');
async function processOrder(orderId: string) {
return tracer.startActiveSpan('processOrder', async (span) => {
span.setAttributes({
'order.id': orderId,
'order.source': 'api',
});
try {
const order = await db.orders.findUnique({ where: { id: orderId } });
span.setAttributes({ 'order.total': order.total, 'order.items': order.items.length });
await tracer.startActiveSpan('validateInventory', async (childSpan) => {
await checkInventory(order.items);
childSpan.end();
});
await tracer.startActiveSpan('chargePayment', async (childSpan) => {
await chargeStripe(order);
childSpan.end();
});
span.setStatus({ code: SpanStatusCode.OK });
return order;
} catch (error) {
span.recordException(error as Error);
span.setStatus({ code: SpanStatusCode.ERROR, message: (error as Error).message });
throw error;
} finally {
span.end();
}
});
}
Custom Metrics
import { metrics } from '@opentelemetry/api';
const meter = metrics.getMeter('my-service');
// Counter: monotonically increasing
const requestCounter = meter.createCounter('http.requests.total', {
description: 'Total HTTP requests',
});
// Histogram: distribution of values
const requestDuration = meter.createHistogram('http.request.duration', {
description: 'HTTP request duration in ms',
unit: 'ms',
});
// Observable gauge: current value
const activeConnections = meter.createObservableGauge('db.connections.active', {
description: 'Active database connections',
});
activeConnections.addCallback((result) => {
result.observe(pool.totalCount - pool.idleCount);
});
// Middleware to record metrics
app.use((req, res, next) => {
const start = Date.now();
requestCounter.add(1, { method: req.method, route: req.route?.path });
res.on('finish', () => {
requestDuration.record(Date.now() - start, {
method: req.method,
status_code: res.statusCode,
route: req.route?.path ?? 'unknown',
});
});
next();
});
Backends You Can Send To
# Jaeger (open source, self-hosted)
docker run -p 16686:16686 -p 4318:4318 jaegertracing/all-in-one
# Grafana Tempo + Loki + Prometheus (full stack)
# See grafana/otel-lgtm docker image
# Commercial: Datadog, Honeycomb, New Relic, Lightstep
# Just change the OTLP endpoint URL
OTEL_EXPORTER_OTLP_ENDPOINT=https://api.honeycomb.io/
OTEL_EXPORTER_OTLP_HEADERS=x-honeycomb-team=YOUR_API_KEY
That's the point: instrument once, switch backends by changing an env var.
What You Get
After setup, for every request you automatically see:
- Full trace with all DB queries and their duration
- Which query is the bottleneck
- Downstream HTTP calls and their latency
- Error details with stack traces linked to traces
- P50/P95/P99 latency per endpoint
No more guessing where time is spent.
OpenTelemetry instrumentation pre-configured with Jaeger for local dev and OTLP for production: Whoff Agents AI SaaS Starter Kit.
Top comments (0)