Wilson Xu

Posted on Mar 21

OpenTelemetry for Node.js: Distributed Tracing, Metrics, and Logs

#javascript #typescript #devops #node

Introduction

Modern backend systems are rarely a single process. A single user request might touch an API gateway, three microservices, a PostgreSQL database, a Redis cache, and an external payment provider — all in under 200 milliseconds. When something goes wrong (and it will), you need to know exactly where.

Traditional logging — console.log("request received") — doesn't cut it here. You need observability: the ability to ask arbitrary questions about your system's behavior from the outside, without modifying the code.

OpenTelemetry (OTel) is the open-source standard that gives you that power. It's vendor-neutral, CNCF-graduated, and has become the de facto way to instrument Node.js services. This guide walks you through everything: setting up the SDK, auto-instrumenting Express and database drivers, writing custom spans, collecting metrics, correlating logs with trace IDs, and shipping data to Jaeger or Grafana Tempo.

1. Why Observability Matters: The Three Pillars

Observability is built on three complementary signals:

Traces answer where did time go? A trace is a directed acyclic graph of spans, each representing a unit of work. A root span covers the entire HTTP request; child spans cover the database query, the Redis lookup, the downstream API call. Traces reveal latency hotspots and error propagation paths.

Metrics answer how is the system behaving right now? Request rates, error rates, p99 latency, queue depth, memory usage. Metrics are cheap to store and great for dashboards and alerting.

Logs answer what exactly happened? Structured log lines with timestamps and context. When correlated with a trace ID, a log line becomes surgically precise — you can jump straight from a metric alert to the exact trace to the exact log line that caused it.

Without all three signals connected, you're debugging in the dark.

2. OpenTelemetry vs. Commercial APM Tools

Before OTel, every APM vendor (Datadog, New Relic, Dynatrace) had a proprietary agent you'd install and be permanently coupled to. Switching vendors meant re-instrumenting your entire codebase.

OpenTelemetry changes that:

	OpenTelemetry	Datadog / New Relic
Vendor lock-in	None — export to any backend	Locked to proprietary format
Cost	Free SDK; backend costs vary	$15-25/host/month minimum
Standard	CNCF graduated project	Proprietary
Backends	Jaeger, Tempo, OTLP, Zipkin, Prometheus	Vendor-specific
Auto-instrumentation	100+ libraries	100+ libraries

OTel doesn't replace Datadog entirely — Datadog still has excellent UX and ML-based anomaly detection. But OTel lets you choose your backend, or even fan out to multiple backends simultaneously. Your instrumentation code is write-once.

3. Setting Up the OTel SDK in Node.js/Express

Installation

npm install \
  @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-http \
  @opentelemetry/exporter-prometheus \
  @opentelemetry/sdk-metrics \
  @opentelemetry/semantic-conventions

The Tracing Entrypoint

Create tracing.js and require it before anything else. This is critical — OTel patches modules at import time, so it must run first.

// tracing.js
'use strict';

const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-http');
const { PeriodicExportingMetricReader } = require('@opentelemetry/sdk-metrics');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');

const resource = new Resource({
  [SemanticResourceAttributes.SERVICE_NAME]: 'order-service',
  [SemanticResourceAttributes.SERVICE_VERSION]: '1.4.2',
  [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV || 'development',
});

const traceExporter = new OTLPTraceExporter({
  url: process.env.OTEL_EXPORTER_OTLP_TRACES_ENDPOINT || 'http://localhost:4318/v1/traces',
});

const metricExporter = new OTLPMetricExporter({
  url: process.env.OTEL_EXPORTER_OTLP_METRICS_ENDPOINT || 'http://localhost:4318/v1/metrics',
});

const sdk = new NodeSDK({
  resource,
  traceExporter,
  metricReader: new PeriodicExportingMetricReader({
    exporter: metricExporter,
    exportIntervalMillis: 15_000, // export every 15 seconds
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-fs': { enabled: false }, // too noisy
      '@opentelemetry/instrumentation-http': { enabled: true },
      '@opentelemetry/instrumentation-express': { enabled: true },
      '@opentelemetry/instrumentation-pg': { enabled: true },
      '@opentelemetry/instrumentation-redis': { enabled: true },
    }),
  ],
});

sdk.start();

process.on('SIGTERM', () => {
  sdk.shutdown()
    .then(() => console.log('OTel SDK shut down'))
    .catch(err => console.error('Error shutting down OTel SDK', err))
    .finally(() => process.exit(0));
});

Starting Your App

// package.json
{
  "scripts": {
    "start": "node -r ./tracing.js server.js"
  }
}

Or using the --require flag directly:

node --require ./tracing.js server.js

The Express Server

// server.js
const express = require('express');
const { Pool } = require('pg');
const redis = require('redis');

const app = express();
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
const redisClient = redis.createClient({ url: process.env.REDIS_URL });

redisClient.connect();

app.get('/orders/:id', async (req, res) => {
  const { id } = req.params;

  // Redis cache lookup
  const cached = await redisClient.get(`order:${id}`);
  if (cached) {
    return res.json(JSON.parse(cached));
  }

  // Postgres query
  const { rows } = await pool.query('SELECT * FROM orders WHERE id = $1', [id]);
  if (!rows.length) return res.status(404).json({ error: 'not found' });

  await redisClient.setEx(`order:${id}`, 300, JSON.stringify(rows[0]));
  res.json(rows[0]);
});

app.listen(3000, () => console.log('Listening on :3000'));

With tracing.js loaded, every HTTP request, Postgres query, and Redis command is automatically captured as a span — zero additional code required.

4. Auto-Instrumentation for HTTP, DB, and Redis

getNodeAutoInstrumentations() wraps over 40 popular libraries. Here's what you get for free:

HTTP/Express: Every inbound request becomes a root span with attributes like http.method, http.route, http.status_code, http.url. Every outbound https.request() or axios/fetch call becomes a child span with the remote URL and status.

PostgreSQL (pg): Every pool.query() call becomes a span with db.system=postgresql, db.statement (the SQL), and db.name. You'll see the exact query text in your trace.

Redis: Every Redis command (GET, SET, HGET, etc.) becomes a span with db.system=redis and db.statement.

Other auto-instrumented libraries include: mysql2, mongodb, grpc-js, graphql, kafkajs, aws-sdk, ioredis, knex, typeorm, and many more.

Disabling Noisy Instrumentations

The filesystem instrumentation (@opentelemetry/instrumentation-fs) creates a span for every fs.readFile call, which is typically too noisy. Disable it explicitly as shown in the SDK setup above.

5. Custom Spans and Attributes

Auto-instrumentation is great, but business logic needs custom spans. Use the trace API to create them.

const { trace, SpanStatusCode } = require('@opentelemetry/api');

const tracer = trace.getTracer('order-service', '1.4.2');

async function processPayment(orderId, amount, currency) {
  // Create a custom span
  return tracer.startActiveSpan('payment.process', async (span) => {
    try {
      // Add semantic attributes
      span.setAttributes({
        'order.id': orderId,
        'payment.amount': amount,
        'payment.currency': currency,
        'payment.provider': 'stripe',
      });

      const result = await stripeClient.charges.create({
        amount: amount * 100,
        currency,
        source: await getPaymentToken(orderId),
      });

      span.setAttributes({
        'payment.charge_id': result.id,
        'payment.status': result.status,
      });

      span.setStatus({ code: SpanStatusCode.OK });
      return result;

    } catch (err) {
      // Record the exception — this adds a span event with the stack trace
      span.recordException(err);
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: err.message,
      });
      throw err;

    } finally {
      span.end();
    }
  });
}

Nested Spans

Child spans are automatically associated with the parent when created inside startActiveSpan:

async function fulfillOrder(orderId) {
  return tracer.startActiveSpan('order.fulfill', async (parentSpan) => {
    parentSpan.setAttribute('order.id', orderId);

    // This span is automatically a child of order.fulfill
    const payment = await processPayment(orderId, 99.99, 'usd');

    // Another child span
    await tracer.startActiveSpan('order.notify', async (notifySpan) => {
      await sendConfirmationEmail(orderId);
      notifySpan.end();
    });

    parentSpan.end();
    return { orderId, payment };
  });
}

Adding Span Events

Span events are timestamped annotations within a span — useful for checkpoints:

span.addEvent('cache.miss', { 'cache.key': `order:${id}` });
span.addEvent('db.query.start');
// ... query executes ...
span.addEvent('db.query.complete', { 'db.rows_returned': rows.length });

6. Metrics: Counters, Histograms, and Gauges

OpenTelemetry Metrics gives you the three instrument types you need for production dashboards.

// metrics.js
const { metrics } = require('@opentelemetry/api');

const meter = metrics.getMeter('order-service', '1.4.2');

// Counter: monotonically increasing (requests, errors, events)
const requestCounter = meter.createCounter('http.requests.total', {
  description: 'Total number of HTTP requests',
});

// Histogram: distribution of values (latency, payload size)
const latencyHistogram = meter.createHistogram('http.request.duration_ms', {
  description: 'HTTP request latency in milliseconds',
  unit: 'ms',
  advice: {
    explicitBucketBoundaries: [5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000],
  },
});

// UpDownCounter: can go up or down (queue depth, active connections)
const activeConnections = meter.createUpDownCounter('db.connections.active', {
  description: 'Active database connections',
});

// Observable Gauge: sampled on demand (CPU, memory — use callbacks)
const memoryGauge = meter.createObservableGauge('process.memory_mb', {
  description: 'Process memory usage in MB',
});
memoryGauge.addCallback((observableResult) => {
  const usage = process.memoryUsage();
  observableResult.observe(usage.heapUsed / 1024 / 1024, { type: 'heap' });
  observableResult.observe(usage.rss / 1024 / 1024, { type: 'rss' });
});

module.exports = { requestCounter, latencyHistogram, activeConnections };

Using Metrics in Middleware

// middleware/metrics.js
const { requestCounter, latencyHistogram } = require('../metrics');

function metricsMiddleware(req, res, next) {
  const start = Date.now();

  res.on('finish', () => {
    const duration = Date.now() - start;
    const labels = {
      method: req.method,
      route: req.route?.path || 'unknown',
      status_code: String(res.statusCode),
    };

    requestCounter.add(1, labels);
    latencyHistogram.record(duration, labels);
  });

  next();
}

module.exports = metricsMiddleware;

// server.js
app.use(require('./middleware/metrics'));

Now Prometheus (or any OTLP-compatible backend) receives per-route request counts and latency histograms, enabling p50/p95/p99 latency dashboards.

7. Correlating Logs with Trace IDs

The power of the three pillars comes from connecting them. When a log line carries the same traceId and spanId as the trace you're investigating, you can jump directly between them.

OpenTelemetry makes this trivial with the context API:

// logger.js — structured logger with automatic trace correlation
const { trace, context } = require('@opentelemetry/api');

function getTraceContext() {
  const span = trace.getActiveSpan();
  if (!span) return {};

  const { traceId, spanId, traceFlags } = span.spanContext();
  return {
    traceId,
    spanId,
    traceSampled: (traceFlags & 0x01) === 1,
  };
}

const logger = {
  info(message, extra = {}) {
    console.log(JSON.stringify({
      level: 'info',
      message,
      timestamp: new Date().toISOString(),
      service: 'order-service',
      ...getTraceContext(),
      ...extra,
    }));
  },

  error(message, err, extra = {}) {
    console.error(JSON.stringify({
      level: 'error',
      message,
      timestamp: new Date().toISOString(),
      service: 'order-service',
      error: { name: err?.name, message: err?.message, stack: err?.stack },
      ...getTraceContext(),
      ...extra,
    }));
  },
};

module.exports = logger;

// In your route handler
const logger = require('./logger');

app.get('/orders/:id', async (req, res) => {
  logger.info('Fetching order', { orderId: req.params.id });

  try {
    const order = await getOrder(req.params.id);
    logger.info('Order retrieved', { orderId: req.params.id, status: order.status });
    res.json(order);
  } catch (err) {
    logger.error('Failed to fetch order', err, { orderId: req.params.id });
    res.status(500).json({ error: 'internal error' });
  }
});

This produces logs like:

{
  "level": "info",
  "message": "Order retrieved",
  "timestamp": "2026-03-22T14:23:01.882Z",
  "service": "order-service",
  "traceId": "3e8a1b2c4d5e6f7a8b9c0d1e2f3a4b5c",
  "spanId": "a1b2c3d4e5f6a7b8",
  "traceSampled": true,
  "orderId": "ord_9182",
  "status": "shipped"
}

In Grafana, you can click traceId in a log line and jump directly to the Tempo trace. Or pivot from a slow trace to its correlated logs. This is the three-pillar payoff.

8. Exporting to Jaeger (Local Dev) and OTLP (Production)

Local Development with Jaeger

Run Jaeger all-in-one with Docker Compose:

# docker-compose.yml
version: '3.8'
services:
  jaeger:
    image: jaegertracing/all-in-one:1.54
    ports:
      - "16686:16686"   # Jaeger UI
      - "4317:4317"     # OTLP gRPC
      - "4318:4318"     # OTLP HTTP
    environment:
      - COLLECTOR_OTLP_ENABLED=true

  app:
    build: .
    environment:
      - OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://jaeger:4318/v1/traces
      - NODE_ENV=development
    depends_on:
      - jaeger

Start it:

docker-compose up -d
# App runs on :3000, Jaeger UI at http://localhost:16686

Make a few requests to your Express app, then open http://localhost:16686. Select order-service from the dropdown. You'll see a waterfall of every request with its child spans — HTTP, Postgres, Redis — with exact durations.

Production: OTel Collector + OTLP

For production, route telemetry through the OpenTelemetry Collector. It handles batching, retries, filtering, and fan-out to multiple backends.

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  memory_limiter:
    check_interval: 1s
    limit_mib: 512

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true
  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write
  logging:
    loglevel: warn

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]

Your Node.js service just points to the collector:

OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318

The collector handles everything downstream — you can swap backends without touching application code.

9. Sampling Strategies

Sampling is essential. In high-traffic production systems, recording every single trace is prohibitively expensive. The goal is to capture enough data to debug issues without burning storage and money.

Head-Based Sampling (SDK-Level)

Decide at the start of a request whether to record it:

const { TraceIdRatioBasedSampler, ParentBasedSampler } = require('@opentelemetry/sdk-trace-base');

// Sample 10% of traces, but always respect parent sampling decision
const sampler = new ParentBasedSampler({
  root: new TraceIdRatioBasedSampler(0.1), // 10% sampling rate
});

const sdk = new NodeSDK({
  sampler,
  // ...rest of config
});

ParentBasedSampler is critical in microservices — if Service A decides to sample a trace, Service B will continue sampling it even at a lower rate. This keeps traces complete.

Always-Sample Errors (Tail-Based via Collector)

The collector's tailsampling processor makes sampling decisions after seeing the full trace — enabling you to always keep error traces:

# otel-collector-config.yaml (tail sampling)
processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 50000
    expected_new_traces_per_sec: 1000
    policies:
      - name: errors-policy
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-traces-policy
        type: latency
        latency: { threshold_ms: 2000 }
      - name: random-policy
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

This configuration:

Always keeps traces with errors
Always keeps traces slower than 2 seconds
Randomly samples 5% of everything else

This is the gold standard for production sampling — you never miss an interesting trace.

10. Visualizing Traces in Grafana Tempo

Grafana Tempo is the open-source distributed tracing backend that integrates natively with Grafana dashboards. It's built for scale: storing traces in object storage (S3/GCS) at a fraction of the cost of Jaeger's Elasticsearch backend.

Full Stack with Docker Compose

# docker-compose.prod-local.yml
version: '3.8'
services:
  tempo:
    image: grafana/tempo:2.3.1
    command: ["-config.file=/etc/tempo.yaml"]
    volumes:
      - ./tempo.yaml:/etc/tempo.yaml
      - tempo-data:/var/tempo
    ports:
      - "4317:4317"   # OTLP gRPC
      - "3200:3200"   # Tempo query API

  prometheus:
    image: prom/prometheus:v2.48.0
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana:10.2.2
    ports:
      - "3001:3000"
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning

volumes:
  tempo-data:

# tempo.yaml
server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc:
        http:

storage:
  trace:
    backend: local
    local:
      path: /var/tempo/traces
    wal:
      path: /var/tempo/wal

Grafana Data Sources

In Grafana's provisioning config, link Tempo and Prometheus:

# grafana/provisioning/datasources/datasources.yaml
apiVersion: 1
datasources:
  - name: Tempo
    type: tempo
    url: http://tempo:3200
    jsonData:
      tracesToLogsV2:
        datasourceUid: loki
      serviceMap:
        datasourceUid: prometheus
      nodeGraph:
        enabled: true

  - name: Prometheus
    type: prometheus
    url: http://prometheus:9090
    uid: prometheus

What You Get

With this stack running, Grafana gives you:

Service Map — an auto-generated DAG of all your services and their dependencies, with error rate and latency for each edge
Trace Waterfall — click any span to see attributes, events, and linked logs
RED Dashboard — Rate, Errors, Duration for each service endpoint, derived automatically from trace data
Metrics Correlation — jump from a Prometheus alert to the traces that fired during the anomaly window

The TraceQL query language (Grafana Tempo's trace query DSL) lets you filter traces programmatically:

{ resource.service.name = "order-service" && span.http.status_code >= 500 } | rate()

This returns a time series of error rates across all spans matching those conditions — a metric derived directly from trace data, without separate instrumentation.

Putting It All Together

Here's a production-ready tracing.js that handles environment-based configuration:

// tracing.js — production-ready
'use strict';

const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-http');
const { PeriodicExportingMetricReader } = require('@opentelemetry/sdk-metrics');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');
const { ParentBasedSampler, TraceIdRatioBasedSampler } = require('@opentelemetry/sdk-trace-base');

const isProd = process.env.NODE_ENV === 'production';

const resource = new Resource({
  [SemanticResourceAttributes.SERVICE_NAME]: process.env.OTEL_SERVICE_NAME || 'my-service',
  [SemanticResourceAttributes.SERVICE_VERSION]: process.env.npm_package_version || '0.0.0',
  [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV || 'development',
});

const sdk = new NodeSDK({
  resource,
  sampler: new ParentBasedSampler({
    root: new TraceIdRatioBasedSampler(isProd ? 0.1 : 1.0), // 100% in dev, 10% in prod
  }),
  traceExporter: new OTLPTraceExporter(),  // uses OTEL_EXPORTER_OTLP_ENDPOINT env var
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter(),
    exportIntervalMillis: isProd ? 15_000 : 5_000,
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-fs': { enabled: false },
    }),
  ],
});

sdk.start();
console.log(`OTel SDK started [${process.env.NODE_ENV}]`);

process.on('SIGTERM', async () => {
  await sdk.shutdown();
  process.exit(0);
});

Set these environment variables in your deployment:

OTEL_SERVICE_NAME=order-service
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
NODE_ENV=production

Conclusion

OpenTelemetry in Node.js has reached production maturity. The auto-instrumentation layer handles the heavy lifting for HTTP, databases, and cache — giving you detailed traces from day one. Custom spans let you annotate your business logic with the context that matters. Metrics and correlated logs complete the observability picture.

The vendor-neutral design is the real win: instrument once, export anywhere. Start with Jaeger locally to get familiar with traces. Graduate to the OTel Collector + Grafana Tempo + Prometheus stack for production. When (or if) you need to add Datadog or New Relic for specific features, you add a collector exporter — your application code doesn't change.

The full stack outlined here costs nothing to run on your own infrastructure except compute. For most teams, that means replacing $1,000+/month APM bills with a self-hosted stack that gives you more control and equal visibility.

Start with node -r ./tracing.js server.js. You'll have your first traces in under five minutes.

Wilson Xu is a backend engineer specializing in distributed systems and developer tooling. He writes about Node.js, observability, and cloud-native infrastructure.

DEV Community