DEV Community

Cover image for OpenTelemetry for Node.js Developers: A Practical Guide to Observability in Distributed Systems
Olawale Afuye
Olawale Afuye

Posted on

OpenTelemetry for Node.js Developers: A Practical Guide to Observability in Distributed Systems

TL;DR: Your app is running. Users are complaining. You have no idea why. This is what happens when you skip observability. OpenTelemetry (OTel) fixes that and this guide shows you exactly how to implement it in Node.js.


The Problem No Dashboard Will Tell You About

You've deployed your microservices. Everything looks green on the surface. Then at 2am, Slack goes off: "Checkout is broken."

You open your logs. You see... something. You check your metrics. You see... a spike. But which service caused it? Was it the payment service timing out? Was the inventory service returning bad data? Was the API gateway dropping requests?

Without proper observability, you're debugging in the dark.

This is the exact problem OpenTelemetry (OTel) was designed to solve.


What Is Observability (And Why "Monitoring" Isn't Enough)

Monitoring tells you that something is wrong. Observability tells you why.

Formally: observability is the measure of how well you can understand the internal state of a system based on the data it produces as output.

In a distributed, microservice-based architecture, you can't step inside a service and watch it run. Observability is how you get that visibility from the outside in.

It relies on four data types, collectively known as M.E.L.T.:

Pillar What It Is What It Tells You
Metrics Numeric measurements at regular intervals System performance trends over time (e.g. error rate, p99 latency)
Events Discrete actions at a point in time Business-level triggers (e.g. user purchase, payment initiated)
Logs Granular, timestamped application output Millisecond-by-millisecond reconstruction of events
Traces A record of a request's full journey Causal chains, bottlenecks, and cross-service latency

Most teams start with logs and call it a day. That's like having a CCTV system with no timestamps and no audio. Traces and metrics are what turn raw logs into a story you can actually debug.

OTel's actual signal taxonomy: OpenTelemetry formalizes three primary signals — Traces, Metrics, and Logs. In OTel, Events are expressed as Span Events (attached to a trace span) or structured log records, not a standalone fourth signal. M.E.L.T. is a broader observability framework concept popularized by vendors like New Relic. It's a useful mental model, but don't conflate it with OTel's own data model.


A Brief History: Why OpenTelemetry Exists

Before 2019, the observability ecosystem was fragmented. Two major CNCF projects competed for adoption:

  • OpenTracing — a vendor-neutral API for distributed tracing
  • OpenCensus — Google's solution for metrics and tracing

Both were good. Both were different. Neither was a standard.

In 2019, the two projects merged to form OpenTelemetry — a single, vendor-neutral, CNCF-backed framework for generating, collecting, and exporting telemetry data. Today it is the second most active CNCF project after Kubernetes.

The mandate is clear: write your instrumentation once, send it anywhere.


The Four Concepts You Must Understand Before Writing Code

1. Spans — The Atomic Unit of a Trace

A span represents a single unit of work: an HTTP call, a database query, a function execution. Every span has:

  • A name
  • A start and end timestamp (giving you latency)
  • Status (including whether it errored shown as red in Zipkin)
  • Optional attributes (metadata)

Spans are linked in parent-child relationships. When your Dashboard service calls your Movies service, the Dashboard's span becomes the parent, and the Movies service creates a child span. Together, they form a trace the complete map of a single request.

[Dashboard Service] ──────────────────────────── 320ms
    └── [Movies Service API Call] ────────── 290ms  ← bottleneck
          └── [DB Query: find_all_movies] ── 270ms  ← root cause
Enter fullscreen mode Exit fullscreen mode

This is exactly the kind of visualization Zipkin gives you — and why distributed tracing is invaluable.

2. Span Context and Correlation Context

For spans across services to be linked into one trace, metadata must travel with each request. This is handled by two mechanisms:

  • Span Context: carries the traceId, spanId, and traceFlags — the IDs that tell the backend these spans belong together. Mandatory.
  • Correlation Context: carries user-defined properties like customerId, dataRegion, or providerHostname. Optional, but powerful for business-level debugging.

The OTel HTTP auto-instrumentation plugin propagates this context automatically via HTTP headers.

3. Metrics vs. Traces, When to Use Which

Concern Use Traces Use Metrics
"Why is this request slow?"
"What's our p99 latency over 30 days?"
"Which service is the bottleneck?"
"How many requests per second are we handling?"

Traces answer why at the individual request level. Metrics answer what at aggregate scale over time.

4. The OpenTelemetry Collector — Your Telemetry Router

Without a Collector, your service sends data directly to a specific backend (Zipkin, Jaeger, etc.). Change the backend and you re-instrument every service.

The OTel Collector sits in between:

[Service A] ──┐
[Service B] ──┼──► (OTLP) ──► [OTel Collector] ──► [Zipkin]     (local/dev)
[Service C] ──┘                      │              [New Relic]  (production)
                                     └────────────► [Prometheus] (metrics)
Enter fullscreen mode Exit fullscreen mode

Your services always speak OTLP (OpenTelemetry's native protocol) to the Collector. The Collector then speaks whatever protocol each backend requires. This is the key architectural point: your application code is completely decoupled from the backend's data format.

It has three stages:

  1. Receiver — accepts data in multiple formats (Zipkin, Jaeger, Prometheus, FluentBit)
  2. Processor — filters, batches, or transforms data before export
  3. Exporter — forwards to one or more backends

Swap backends without touching a single line of application code. This is the right architecture for production.


Setting Up Distributed Tracing in Node.js

The recommended architecture in 2025+ is: Node.js → OTLP → OTel Collector → backend. Your application always exports via OTLP (OTel's native protocol), and the Collector handles routing to whatever backend you're using.

We'll use Zipkin as the local visualization backend here — it's an excellent learning tool for seeing traces. In production you'd swap the Collector's exporter to New Relic, Jaeger, or wherever you're sending data, without touching the application.

Step 1: Install Dependencies

npm install @opentelemetry/sdk-node \
            @opentelemetry/exporter-trace-otlp-http \
            @opentelemetry/auto-instrumentations-node
Enter fullscreen mode Exit fullscreen mode

@opentelemetry/sdk-node is the modern entry point. It wires together the trace provider, resource detection, and auto-instrumentations in one place. The older pattern of manually constructing NodeTracerProvider and calling provider.register() still works but is more verbose and prone to misconfiguration.

Step 2: Create Your tracing.js File

Keep instrumentation code in its own file, completely separate from your application logic.

// tracing.js
'use strict';

const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');

const sdk = new NodeSDK({
  serviceName: 'movies-service',
  traceExporter: new OTLPTraceExporter({
    url: 'http://localhost:4318/v1/traces', // OTel Collector OTLP/HTTP endpoint
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

// Graceful shutdown — flush spans before the process exits
process.on('SIGTERM', () => {
  sdk.shutdown().finally(() => process.exit(0));
});

console.log('Tracing initialized');
Enter fullscreen mode Exit fullscreen mode

NodeSDK handles resource attribute detection automatically (service name, runtime version, host info) and manages the lifecycle of the SDK cleanly. The SIGTERM handler ensures buffered spans are flushed rather than dropped when the process shuts down.

Step 3: Load It Before Your App Starts

Use the node -r flag to require the tracing file before your application code runs. This ensures that all subsequent modules are instrumented from the start.

node -r ./tracing.js index.js
Enter fullscreen mode Exit fullscreen mode

Or in your package.json:

{
  "scripts": {
    "start": "node -r ./tracing.js index.js"
  }
}
Enter fullscreen mode Exit fullscreen mode

Why -r? OTel works by monkey-patching Node.js core modules (like http). If your app loads before the instrumentation, those patches won't apply to already-loaded modules.

Step 4: Run the OTel Collector + Zipkin via Docker

# docker-compose.yml
services:
  zipkin:
    image: openzipkin/zipkin
    ports:
      - "9411:9411"

  otel-collector:
    image: otel/opentelemetry-collector-contrib
    command: ["--config=/etc/otel-config.yaml"]
    volumes:
      - ./otel-config.yaml:/etc/otel-config.yaml
    ports:
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
    depends_on:
      - zipkin
Enter fullscreen mode Exit fullscreen mode
# otel-config.yaml (learning setup)
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:4317"
      http:
        endpoint: "0.0.0.0:4318"

processors:
  batch: {}

exporters:
  zipkin:
    endpoint: "http://zipkin:9411/api/v2/spans"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [zipkin]
Enter fullscreen mode Exit fullscreen mode

Start the stack with docker-compose up, run your service, make some requests, and open http://localhost:9411. You'll see traces appear with parent-child span relationships, latency measurements per hop, and red error flags where things fail.


Setting Up Metrics Collection with Prometheus

Metrics are cheaper to store and better for trend analysis than traces. The standard framework for service-level metrics is RED: Rate (requests/second), Errors (error rate), Duration (latency). OTel supports all three through a Meter.

A counter covers Rate and Errors. For Duration, you need a histogram — not another counter. Histograms are what actually give you p95 and p99 latency, which are far more useful for SLOs than raw request counts.

Step 1: Install Dependencies

npm install @opentelemetry/sdk-metrics \
            @opentelemetry/exporter-prometheus
Enter fullscreen mode Exit fullscreen mode

Step 2: Initialize Your Meter, Counter, and Histogram

// metrics.js
const { MeterProvider } = require('@opentelemetry/sdk-metrics');
const { PrometheusExporter } = require('@opentelemetry/exporter-prometheus');

const exporter = new PrometheusExporter({ port: 9464 }, () => {
  console.log('Prometheus scrape endpoint: http://localhost:9464/metrics');
});

const meterProvider = new MeterProvider();
meterProvider.addMetricReader(exporter);

const meter = meterProvider.getMeter('movies-service');

const requestCounter = meter.createCounter('http_requests_total', {
  description: 'Total number of HTTP requests received',
});

// Histogram for latency — the correct instrument for p95/p99 analysis
const requestDuration = meter.createHistogram('http_request_duration_ms', {
  description: 'HTTP request latency in milliseconds',
  unit: 'ms',
});

module.exports = { requestCounter, requestDuration };
Enter fullscreen mode Exit fullscreen mode

Step 3: Instrument Your Express Middleware

Note the res.on('finish', ...) pattern below. Reading res.statusCode directly in the middleware body gives you the default value (usually 200) before the response is actually sent — the finish event gives you the real status code and accurate duration.

// index.js
const { requestCounter, requestDuration } = require('./metrics');

app.use((req, res, next) => {
  const startTime = Date.now();

  res.on('finish', () => {
    const attributes = {
      method: req.method,
      route: req.route?.path ?? req.path,
      status_code: res.statusCode,
    };
    requestCounter.add(1, attributes);
    requestDuration.record(Date.now() - startTime, attributes);
  });

  next();
});
Enter fullscreen mode Exit fullscreen mode

Step 4: Configure Prometheus to Scrape Your App

# prometheus.yml
global:
  scrape_interval: 15s  # Balance between granularity and performance

scrape_configs:
  - job_name: 'movies-service'
    static_configs:
      - targets: ['host.docker.internal:9464']
Enter fullscreen mode Exit fullscreen mode

Environment note: host.docker.internal resolves on Docker Desktop (macOS and Windows). On Linux, Docker does not add this hostname by default use 172.17.0.1 (the default Docker bridge gateway) or the host's actual IP. In Kubernetes, replace this entirely with a proper service discovery config or a PodMonitor.

Why 15 seconds? Granular enough for accurate per-minute rate calculations and fast anomaly detection, without generating an excessive volume of data points.


Routing Everything Through the OTel Collector to New Relic

Once you're ready for production, use the same OTLP-first architecture your services already speak OTLP to the Collector. Only the Collector's exporter changes.

otel-config.yaml (production)

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "0.0.0.0:4317"
      http:
        endpoint: "0.0.0.0:4318"

processors:
  batch: {}

exporters:
  otlp:
    endpoint: "https://otlp.nr-data.net:4317"
    headers:
      api-key: "${NEW_RELIC_LICENSE_KEY}"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp]
Enter fullscreen mode Exit fullscreen mode

docker-compose.yml (partial, production)

otel-collector:
  image: otel/opentelemetry-collector-contrib
  command: ["--config=/etc/otel-config.yaml"]
  volumes:
    - ./otel-config.yaml:/etc/otel-config.yaml
  ports:
    - "4317:4317"   # OTLP gRPC
    - "4318:4318"   # OTLP HTTP
  environment:
    - NEW_RELIC_LICENSE_KEY=${NEW_RELIC_LICENSE_KEY}
Enter fullscreen mode Exit fullscreen mode

Nothing in your application code changes between the local Zipkin setup and this production config. That's the whole point of OTLP-first: your app doesn't know or care what's downstream of the Collector. In New Relic's Explorer view, you can visualize all services, their dependencies, and drill into latency across your full distributed system.


Sampling: The Production Topic Nobody Puts in Tutorials

In development, tracing every request is fine. In production, with thousands of requests per second, tracing everything is a fast way to generate a very expensive observability bill and a lot of noise.

Sampling is how you control what percentage of traces you actually record.

Head Sampling (Probabilistic)

The decision is made at the start of a trace, before any data is collected:

const { TraceIdRatioBased } = require('@opentelemetry/sdk-trace-base');

const sdk = new NodeSDK({
  serviceName: 'movies-service',
  sampler: new TraceIdRatioBased(0.1), // Record 10% of all traces
  traceExporter: new OTLPTraceExporter({ url: 'http://localhost:4318/v1/traces' }),
  instrumentations: [getNodeAutoInstrumentations()],
});
Enter fullscreen mode Exit fullscreen mode

Simple, low-overhead. The tradeoff: you may drop the exact traces you needed errors and slow requests have the same chance of being dropped as fast successful ones.

Tail Sampling (Collector-side)

The decision is made at the Collector after seeing the full trace. This lets you express rules like: "always keep error traces, always keep traces over 1 second, sample everything else at 5%." This is the correct approach for production.

# In otel-config.yaml — requires otel/opentelemetry-collector-contrib
processors:
  tail_sampling:
    decision_wait: 10s   # Wait up to 10s for all spans in a trace to arrive
    policies:
      - name: keep-errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: keep-slow-traces
        type: latency
        latency: { threshold_ms: 1000 }
      - name: sample-everything-else
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }
Enter fullscreen mode Exit fullscreen mode

decision_wait needs to be long enough that all spans from across your services have arrived before the Collector makes its keep/drop call. Tune this based on your slowest service-to-service call.

The rule: never run OTel in high-traffic production without a sampling strategy. Head sampling is a quick win; tail sampling is the right architecture.


The Security Angle: What OWASP Says About Telemetry

This is the part most tutorials skip. Don't.

The OWASP Top 10 includes A09:2021 – Security Logging and Monitoring Failures precisely because bad observability is a security risk, not just an operational one.

Risks specific to telemetry:

1. Sensitive Data Leaking into Spans and Logs

Over-instrumentation is real. If you're logging request bodies, database queries, or user-facing errors without sanitization, you may be storing passwords, credit card numbers, session tokens, or PII directly inside your observability platform which is almost never protected as strictly as your production database.

Mitigation: Use OTel Collector processors to scrub sensitive fields before export:

processors:
  attributes:
    actions:
      - key: http.request.body
        action: delete
      - key: db.statement
        action: hash
Enter fullscreen mode Exit fullscreen mode

2. Log Injection

If user-controlled input flows directly into span attributes or log messages without sanitization, attackers can inject crafted entries to manipulate your log analysis tools or hide their activity.

Mitigation: Never log raw user input. Sanitize and validate before adding to span attributes.

3. Insufficient Monitoring Leading to Undetected Breaches

If you're collecting telemetry but not alerting on it, you're warehousing evidence of your own compromise. Data alone isn't observability active monitoring is.

Mitigation:

  • Configure alerts for repeated authentication failures, unusual spike patterns, and anomalous service-to-service traffic
  • Route security-relevant telemetry to a SIEM, not just a tracing backend
  • Periodically audit what your services are actually sending

4. Credential Exposure in Collector Configuration

Your OTel Collector config references API keys and ingest tokens. These must come from environment variables or secrets managersnever hardcoded in otel-config.yaml.

# ❌ Never do this
api-key: "NRAK-XXXXXXXXXXXXXXXXXXXX"

# ✅ Do this
api-key: "${NEW_RELIC_LICENSE_KEY}"
Enter fullscreen mode Exit fullscreen mode

Frontend Telemetry: It Works There Too

OTel isn't just for the backend. The @opentelemetry/sdk-trace-web package brings the same tracing model to the browser capturing document load times, XHR/fetch requests, and user interactions.

The critical win: trace propagation. When a user clicks a button in your browser app, the trace context is forwarded with the API call, linking the browser span to the backend span. You get a single trace that shows the full journey from UI click to database response.

Useful for Node.js BFFs, Next.js backends, and any system where front-to-back latency matters.


What to Implement First (A Practical Priority Order)

If you're starting from zero on an existing Node.js project, here's the sequence that gives you the fastest signal:

  1. Set up NodeSDK with OTLP + the Collector locally — route to Zipkin for visualization. Even in local dev, using the Collector means your application code never needs to change when you switch backends.
  2. Add a request duration histogram — a histogram on your most trafficked endpoint gives you p95/p99 latency immediately. A counter alone won't tell you what's slow.
  3. Add sampling before production — head sampling is a 5-minute change. Tail sampling via the Collector is the right long-term answer. Skip this and high traffic will surprise you.
  4. Audit your span attributes for PII — do this before you go to production. Retrofitting data redaction is painful.
  5. Configure alerts — at minimum, alert on error rate and p95 latency. If you only get paged when users tweet at you, you're already too late.

Summary

Concept Tool What You Get
Instrumentation NodeSDK + OTLP Modern, vendor-neutral trace export
Local Tracing OTel Collector → Zipkin Request flow, latency, bottlenecks, errors
Metrics Prometheus + Grafana Rate/Error/Duration (RED), p95/p99 histograms
Sampling Head or Tail (Collector) Cost control without losing critical traces
Production Export OTel Collector → New Relic Full-stack service map and alerting
Security OTel Processor + SIEM PII redaction, breach detection

OpenTelemetry is not a monitoring tool, it's a contract. A contract between your application and any tool that wants to understand it. Write that contract once, and you're free to change backends, scale services, or bring in new tooling without starting over.


Resources to Go Deeper


Have you implemented OTel in a production Node.js system? What was the first thing traces revealed that you didn't expect? Drop it in the comments, genuinely curious.

Top comments (0)