Introduction
Modern backend systems are rarely a single process. A single user request might touch an API gateway, three microservices, a PostgreSQL database, a Redis cache, and an external payment provider — all in under 200 milliseconds. When something goes wrong (and it will), you need to know exactly where.
Traditional logging — console.log("request received") — doesn't cut it here. You need observability: the ability to ask arbitrary questions about your system's behavior from the outside, without modifying the code.
OpenTelemetry (OTel) is the open-source standard that gives you that power. It's vendor-neutral, CNCF-graduated, and has become the de facto way to instrument Node.js services. This guide walks you through everything: setting up the SDK, auto-instrumenting Express and database drivers, writing custom spans, collecting metrics, correlating logs with trace IDs, and shipping data to Jaeger or Grafana Tempo.
1. Why Observability Matters: The Three Pillars
Observability is built on three complementary signals:
Traces answer where did time go? A trace is a directed acyclic graph of spans, each representing a unit of work. A root span covers the entire HTTP request; child spans cover the database query, the Redis lookup, the downstream API call. Traces reveal latency hotspots and error propagation paths.
Metrics answer how is the system behaving right now? Request rates, error rates, p99 latency, queue depth, memory usage. Metrics are cheap to store and great for dashboards and alerting.
Logs answer what exactly happened? Structured log lines with timestamps and context. When correlated with a trace ID, a log line becomes surgically precise — you can jump straight from a metric alert to the exact trace to the exact log line that caused it.
Without all three signals connected, you're debugging in the dark.
2. OpenTelemetry vs. Commercial APM Tools
Before OTel, every APM vendor (Datadog, New Relic, Dynatrace) had a proprietary agent you'd install and be permanently coupled to. Switching vendors meant re-instrumenting your entire codebase.
OpenTelemetry changes that:
| OpenTelemetry | Datadog / New Relic | |
|---|---|---|
| Vendor lock-in | None — export to any backend | Locked to proprietary format |
| Cost | Free SDK; backend costs vary | $15-25/host/month minimum |
| Standard | CNCF graduated project | Proprietary |
| Backends | Jaeger, Tempo, OTLP, Zipkin, Prometheus | Vendor-specific |
| Auto-instrumentation | 100+ libraries | 100+ libraries |
OTel doesn't replace Datadog entirely — Datadog still has excellent UX and ML-based anomaly detection. But OTel lets you choose your backend, or even fan out to multiple backends simultaneously. Your instrumentation code is write-once.
3. Setting Up the OTel SDK in Node.js/Express
Installation
npm install \
@opentelemetry/sdk-node \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-trace-otlp-http \
@opentelemetry/exporter-prometheus \
@opentelemetry/sdk-metrics \
@opentelemetry/semantic-conventions
The Tracing Entrypoint
Create tracing.js and require it before anything else. This is critical — OTel patches modules at import time, so it must run first.
// tracing.js
'use strict';
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-http');
const { PeriodicExportingMetricReader } = require('@opentelemetry/sdk-metrics');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
const resource = new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'order-service',
[SemanticResourceAttributes.SERVICE_VERSION]: '1.4.2',
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV || 'development',
});
const traceExporter = new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_TRACES_ENDPOINT || 'http://localhost:4318/v1/traces',
});
const metricExporter = new OTLPMetricExporter({
url: process.env.OTEL_EXPORTER_OTLP_METRICS_ENDPOINT || 'http://localhost:4318/v1/metrics',
});
const sdk = new NodeSDK({
resource,
traceExporter,
metricReader: new PeriodicExportingMetricReader({
exporter: metricExporter,
exportIntervalMillis: 15_000, // export every 15 seconds
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-fs': { enabled: false }, // too noisy
'@opentelemetry/instrumentation-http': { enabled: true },
'@opentelemetry/instrumentation-express': { enabled: true },
'@opentelemetry/instrumentation-pg': { enabled: true },
'@opentelemetry/instrumentation-redis': { enabled: true },
}),
],
});
sdk.start();
process.on('SIGTERM', () => {
sdk.shutdown()
.then(() => console.log('OTel SDK shut down'))
.catch(err => console.error('Error shutting down OTel SDK', err))
.finally(() => process.exit(0));
});
Starting Your App
// package.json
{
"scripts": {
"start": "node -r ./tracing.js server.js"
}
}
Or using the --require flag directly:
node --require ./tracing.js server.js
The Express Server
// server.js
const express = require('express');
const { Pool } = require('pg');
const redis = require('redis');
const app = express();
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
const redisClient = redis.createClient({ url: process.env.REDIS_URL });
redisClient.connect();
app.get('/orders/:id', async (req, res) => {
const { id } = req.params;
// Redis cache lookup
const cached = await redisClient.get(`order:${id}`);
if (cached) {
return res.json(JSON.parse(cached));
}
// Postgres query
const { rows } = await pool.query('SELECT * FROM orders WHERE id = $1', [id]);
if (!rows.length) return res.status(404).json({ error: 'not found' });
await redisClient.setEx(`order:${id}`, 300, JSON.stringify(rows[0]));
res.json(rows[0]);
});
app.listen(3000, () => console.log('Listening on :3000'));
With tracing.js loaded, every HTTP request, Postgres query, and Redis command is automatically captured as a span — zero additional code required.
4. Auto-Instrumentation for HTTP, DB, and Redis
getNodeAutoInstrumentations() wraps over 40 popular libraries. Here's what you get for free:
HTTP/Express: Every inbound request becomes a root span with attributes like http.method, http.route, http.status_code, http.url. Every outbound https.request() or axios/fetch call becomes a child span with the remote URL and status.
PostgreSQL (pg): Every pool.query() call becomes a span with db.system=postgresql, db.statement (the SQL), and db.name. You'll see the exact query text in your trace.
Redis: Every Redis command (GET, SET, HGET, etc.) becomes a span with db.system=redis and db.statement.
Other auto-instrumented libraries include: mysql2, mongodb, grpc-js, graphql, kafkajs, aws-sdk, ioredis, knex, typeorm, and many more.
Disabling Noisy Instrumentations
The filesystem instrumentation (@opentelemetry/instrumentation-fs) creates a span for every fs.readFile call, which is typically too noisy. Disable it explicitly as shown in the SDK setup above.
5. Custom Spans and Attributes
Auto-instrumentation is great, but business logic needs custom spans. Use the trace API to create them.
const { trace, SpanStatusCode } = require('@opentelemetry/api');
const tracer = trace.getTracer('order-service', '1.4.2');
async function processPayment(orderId, amount, currency) {
// Create a custom span
return tracer.startActiveSpan('payment.process', async (span) => {
try {
// Add semantic attributes
span.setAttributes({
'order.id': orderId,
'payment.amount': amount,
'payment.currency': currency,
'payment.provider': 'stripe',
});
const result = await stripeClient.charges.create({
amount: amount * 100,
currency,
source: await getPaymentToken(orderId),
});
span.setAttributes({
'payment.charge_id': result.id,
'payment.status': result.status,
});
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (err) {
// Record the exception — this adds a span event with the stack trace
span.recordException(err);
span.setStatus({
code: SpanStatusCode.ERROR,
message: err.message,
});
throw err;
} finally {
span.end();
}
});
}
Nested Spans
Child spans are automatically associated with the parent when created inside startActiveSpan:
async function fulfillOrder(orderId) {
return tracer.startActiveSpan('order.fulfill', async (parentSpan) => {
parentSpan.setAttribute('order.id', orderId);
// This span is automatically a child of order.fulfill
const payment = await processPayment(orderId, 99.99, 'usd');
// Another child span
await tracer.startActiveSpan('order.notify', async (notifySpan) => {
await sendConfirmationEmail(orderId);
notifySpan.end();
});
parentSpan.end();
return { orderId, payment };
});
}
Adding Span Events
Span events are timestamped annotations within a span — useful for checkpoints:
span.addEvent('cache.miss', { 'cache.key': `order:${id}` });
span.addEvent('db.query.start');
// ... query executes ...
span.addEvent('db.query.complete', { 'db.rows_returned': rows.length });
6. Metrics: Counters, Histograms, and Gauges
OpenTelemetry Metrics gives you the three instrument types you need for production dashboards.
// metrics.js
const { metrics } = require('@opentelemetry/api');
const meter = metrics.getMeter('order-service', '1.4.2');
// Counter: monotonically increasing (requests, errors, events)
const requestCounter = meter.createCounter('http.requests.total', {
description: 'Total number of HTTP requests',
});
// Histogram: distribution of values (latency, payload size)
const latencyHistogram = meter.createHistogram('http.request.duration_ms', {
description: 'HTTP request latency in milliseconds',
unit: 'ms',
advice: {
explicitBucketBoundaries: [5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000],
},
});
// UpDownCounter: can go up or down (queue depth, active connections)
const activeConnections = meter.createUpDownCounter('db.connections.active', {
description: 'Active database connections',
});
// Observable Gauge: sampled on demand (CPU, memory — use callbacks)
const memoryGauge = meter.createObservableGauge('process.memory_mb', {
description: 'Process memory usage in MB',
});
memoryGauge.addCallback((observableResult) => {
const usage = process.memoryUsage();
observableResult.observe(usage.heapUsed / 1024 / 1024, { type: 'heap' });
observableResult.observe(usage.rss / 1024 / 1024, { type: 'rss' });
});
module.exports = { requestCounter, latencyHistogram, activeConnections };
Using Metrics in Middleware
// middleware/metrics.js
const { requestCounter, latencyHistogram } = require('../metrics');
function metricsMiddleware(req, res, next) {
const start = Date.now();
res.on('finish', () => {
const duration = Date.now() - start;
const labels = {
method: req.method,
route: req.route?.path || 'unknown',
status_code: String(res.statusCode),
};
requestCounter.add(1, labels);
latencyHistogram.record(duration, labels);
});
next();
}
module.exports = metricsMiddleware;
// server.js
app.use(require('./middleware/metrics'));
Now Prometheus (or any OTLP-compatible backend) receives per-route request counts and latency histograms, enabling p50/p95/p99 latency dashboards.
7. Correlating Logs with Trace IDs
The power of the three pillars comes from connecting them. When a log line carries the same traceId and spanId as the trace you're investigating, you can jump directly between them.
OpenTelemetry makes this trivial with the context API:
// logger.js — structured logger with automatic trace correlation
const { trace, context } = require('@opentelemetry/api');
function getTraceContext() {
const span = trace.getActiveSpan();
if (!span) return {};
const { traceId, spanId, traceFlags } = span.spanContext();
return {
traceId,
spanId,
traceSampled: (traceFlags & 0x01) === 1,
};
}
const logger = {
info(message, extra = {}) {
console.log(JSON.stringify({
level: 'info',
message,
timestamp: new Date().toISOString(),
service: 'order-service',
...getTraceContext(),
...extra,
}));
},
error(message, err, extra = {}) {
console.error(JSON.stringify({
level: 'error',
message,
timestamp: new Date().toISOString(),
service: 'order-service',
error: { name: err?.name, message: err?.message, stack: err?.stack },
...getTraceContext(),
...extra,
}));
},
};
module.exports = logger;
// In your route handler
const logger = require('./logger');
app.get('/orders/:id', async (req, res) => {
logger.info('Fetching order', { orderId: req.params.id });
try {
const order = await getOrder(req.params.id);
logger.info('Order retrieved', { orderId: req.params.id, status: order.status });
res.json(order);
} catch (err) {
logger.error('Failed to fetch order', err, { orderId: req.params.id });
res.status(500).json({ error: 'internal error' });
}
});
This produces logs like:
{
"level": "info",
"message": "Order retrieved",
"timestamp": "2026-03-22T14:23:01.882Z",
"service": "order-service",
"traceId": "3e8a1b2c4d5e6f7a8b9c0d1e2f3a4b5c",
"spanId": "a1b2c3d4e5f6a7b8",
"traceSampled": true,
"orderId": "ord_9182",
"status": "shipped"
}
In Grafana, you can click traceId in a log line and jump directly to the Tempo trace. Or pivot from a slow trace to its correlated logs. This is the three-pillar payoff.
8. Exporting to Jaeger (Local Dev) and OTLP (Production)
Local Development with Jaeger
Run Jaeger all-in-one with Docker Compose:
# docker-compose.yml
version: '3.8'
services:
jaeger:
image: jaegertracing/all-in-one:1.54
ports:
- "16686:16686" # Jaeger UI
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
environment:
- COLLECTOR_OTLP_ENABLED=true
app:
build: .
environment:
- OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://jaeger:4318/v1/traces
- NODE_ENV=development
depends_on:
- jaeger
Start it:
docker-compose up -d
# App runs on :3000, Jaeger UI at http://localhost:16686
Make a few requests to your Express app, then open http://localhost:16686. Select order-service from the dropdown. You'll see a waterfall of every request with its child spans — HTTP, Postgres, Redis — with exact durations.
Production: OTel Collector + OTLP
For production, route telemetry through the OpenTelemetry Collector. It handles batching, retries, filtering, and fan-out to multiple backends.
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 512
exporters:
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
prometheusremotewrite:
endpoint: http://prometheus:9090/api/v1/write
logging:
loglevel: warn
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheusremotewrite]
Your Node.js service just points to the collector:
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
The collector handles everything downstream — you can swap backends without touching application code.
9. Sampling Strategies
Sampling is essential. In high-traffic production systems, recording every single trace is prohibitively expensive. The goal is to capture enough data to debug issues without burning storage and money.
Head-Based Sampling (SDK-Level)
Decide at the start of a request whether to record it:
const { TraceIdRatioBasedSampler, ParentBasedSampler } = require('@opentelemetry/sdk-trace-base');
// Sample 10% of traces, but always respect parent sampling decision
const sampler = new ParentBasedSampler({
root: new TraceIdRatioBasedSampler(0.1), // 10% sampling rate
});
const sdk = new NodeSDK({
sampler,
// ...rest of config
});
ParentBasedSampler is critical in microservices — if Service A decides to sample a trace, Service B will continue sampling it even at a lower rate. This keeps traces complete.
Always-Sample Errors (Tail-Based via Collector)
The collector's tailsampling processor makes sampling decisions after seeing the full trace — enabling you to always keep error traces:
# otel-collector-config.yaml (tail sampling)
processors:
tail_sampling:
decision_wait: 10s
num_traces: 50000
expected_new_traces_per_sec: 1000
policies:
- name: errors-policy
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow-traces-policy
type: latency
latency: { threshold_ms: 2000 }
- name: random-policy
type: probabilistic
probabilistic: { sampling_percentage: 5 }
This configuration:
- Always keeps traces with errors
- Always keeps traces slower than 2 seconds
- Randomly samples 5% of everything else
This is the gold standard for production sampling — you never miss an interesting trace.
10. Visualizing Traces in Grafana Tempo
Grafana Tempo is the open-source distributed tracing backend that integrates natively with Grafana dashboards. It's built for scale: storing traces in object storage (S3/GCS) at a fraction of the cost of Jaeger's Elasticsearch backend.
Full Stack with Docker Compose
# docker-compose.prod-local.yml
version: '3.8'
services:
tempo:
image: grafana/tempo:2.3.1
command: ["-config.file=/etc/tempo.yaml"]
volumes:
- ./tempo.yaml:/etc/tempo.yaml
- tempo-data:/var/tempo
ports:
- "4317:4317" # OTLP gRPC
- "3200:3200" # Tempo query API
prometheus:
image: prom/prometheus:v2.48.0
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
grafana:
image: grafana/grafana:10.2.2
ports:
- "3001:3000"
environment:
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
volumes:
- ./grafana/provisioning:/etc/grafana/provisioning
volumes:
tempo-data:
# tempo.yaml
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
grpc:
http:
storage:
trace:
backend: local
local:
path: /var/tempo/traces
wal:
path: /var/tempo/wal
Grafana Data Sources
In Grafana's provisioning config, link Tempo and Prometheus:
# grafana/provisioning/datasources/datasources.yaml
apiVersion: 1
datasources:
- name: Tempo
type: tempo
url: http://tempo:3200
jsonData:
tracesToLogsV2:
datasourceUid: loki
serviceMap:
datasourceUid: prometheus
nodeGraph:
enabled: true
- name: Prometheus
type: prometheus
url: http://prometheus:9090
uid: prometheus
What You Get
With this stack running, Grafana gives you:
- Service Map — an auto-generated DAG of all your services and their dependencies, with error rate and latency for each edge
- Trace Waterfall — click any span to see attributes, events, and linked logs
- RED Dashboard — Rate, Errors, Duration for each service endpoint, derived automatically from trace data
- Metrics Correlation — jump from a Prometheus alert to the traces that fired during the anomaly window
The TraceQL query language (Grafana Tempo's trace query DSL) lets you filter traces programmatically:
{ resource.service.name = "order-service" && span.http.status_code >= 500 } | rate()
This returns a time series of error rates across all spans matching those conditions — a metric derived directly from trace data, without separate instrumentation.
Putting It All Together
Here's a production-ready tracing.js that handles environment-based configuration:
// tracing.js — production-ready
'use strict';
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-http');
const { PeriodicExportingMetricReader } = require('@opentelemetry/sdk-metrics');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
const { ParentBasedSampler, TraceIdRatioBasedSampler } = require('@opentelemetry/sdk-trace-base');
const isProd = process.env.NODE_ENV === 'production';
const resource = new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: process.env.OTEL_SERVICE_NAME || 'my-service',
[SemanticResourceAttributes.SERVICE_VERSION]: process.env.npm_package_version || '0.0.0',
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV || 'development',
});
const sdk = new NodeSDK({
resource,
sampler: new ParentBasedSampler({
root: new TraceIdRatioBasedSampler(isProd ? 0.1 : 1.0), // 100% in dev, 10% in prod
}),
traceExporter: new OTLPTraceExporter(), // uses OTEL_EXPORTER_OTLP_ENDPOINT env var
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter(),
exportIntervalMillis: isProd ? 15_000 : 5_000,
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-fs': { enabled: false },
}),
],
});
sdk.start();
console.log(`OTel SDK started [${process.env.NODE_ENV}]`);
process.on('SIGTERM', async () => {
await sdk.shutdown();
process.exit(0);
});
Set these environment variables in your deployment:
OTEL_SERVICE_NAME=order-service
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
NODE_ENV=production
Conclusion
OpenTelemetry in Node.js has reached production maturity. The auto-instrumentation layer handles the heavy lifting for HTTP, databases, and cache — giving you detailed traces from day one. Custom spans let you annotate your business logic with the context that matters. Metrics and correlated logs complete the observability picture.
The vendor-neutral design is the real win: instrument once, export anywhere. Start with Jaeger locally to get familiar with traces. Graduate to the OTel Collector + Grafana Tempo + Prometheus stack for production. When (or if) you need to add Datadog or New Relic for specific features, you add a collector exporter — your application code doesn't change.
The full stack outlined here costs nothing to run on your own infrastructure except compute. For most teams, that means replacing $1,000+/month APM bills with a self-hosted stack that gives you more control and equal visibility.
Start with node -r ./tracing.js server.js. You'll have your first traces in under five minutes.
Wilson Xu is a backend engineer specializing in distributed systems and developer tooling. He writes about Node.js, observability, and cloud-native infrastructure.
Top comments (0)