TL;DR: Your app is running. Users are complaining. You have no idea why. This is what happens when you skip observability. OpenTelemetry (OTel) fixes that and this guide shows you exactly how to implement it in Node.js.
The Problem No Dashboard Will Tell You About
You've deployed your microservices. Everything looks green on the surface. Then at 2am, Slack goes off: "Checkout is broken."
You open your logs. You see... something. You check your metrics. You see... a spike. But which service caused it? Was it the payment service timing out? Was the inventory service returning bad data? Was the API gateway dropping requests?
Without proper observability, you're debugging in the dark.
This is the exact problem OpenTelemetry (OTel) was designed to solve.
What Is Observability (And Why "Monitoring" Isn't Enough)
Monitoring tells you that something is wrong. Observability tells you why.
Formally: observability is the measure of how well you can understand the internal state of a system based on the data it produces as output.
In a distributed, microservice-based architecture, you can't step inside a service and watch it run. Observability is how you get that visibility from the outside in.
It relies on four data types, collectively known as M.E.L.T.:
| Pillar | What It Is | What It Tells You |
|---|---|---|
| Metrics | Numeric measurements at regular intervals | System performance trends over time (e.g. error rate, p99 latency) |
| Events | Discrete actions at a point in time | Business-level triggers (e.g. user purchase, payment initiated) |
| Logs | Granular, timestamped application output | Millisecond-by-millisecond reconstruction of events |
| Traces | A record of a request's full journey | Causal chains, bottlenecks, and cross-service latency |
Most teams start with logs and call it a day. That's like having a CCTV system with no timestamps and no audio. Traces and metrics are what turn raw logs into a story you can actually debug.
OTel's actual signal taxonomy: OpenTelemetry formalizes three primary signals — Traces, Metrics, and Logs. In OTel, Events are expressed as Span Events (attached to a trace span) or structured log records, not a standalone fourth signal. M.E.L.T. is a broader observability framework concept popularized by vendors like New Relic. It's a useful mental model, but don't conflate it with OTel's own data model.
A Brief History: Why OpenTelemetry Exists
Before 2019, the observability ecosystem was fragmented. Two major CNCF projects competed for adoption:
- OpenTracing — a vendor-neutral API for distributed tracing
- OpenCensus — Google's solution for metrics and tracing
Both were good. Both were different. Neither was a standard.
In 2019, the two projects merged to form OpenTelemetry — a single, vendor-neutral, CNCF-backed framework for generating, collecting, and exporting telemetry data. Today it is the second most active CNCF project after Kubernetes.
The mandate is clear: write your instrumentation once, send it anywhere.
The Four Concepts You Must Understand Before Writing Code
1. Spans — The Atomic Unit of a Trace
A span represents a single unit of work: an HTTP call, a database query, a function execution. Every span has:
- A name
- A start and end timestamp (giving you latency)
- Status (including whether it errored shown as red in Zipkin)
- Optional attributes (metadata)
Spans are linked in parent-child relationships. When your Dashboard service calls your Movies service, the Dashboard's span becomes the parent, and the Movies service creates a child span. Together, they form a trace the complete map of a single request.
[Dashboard Service] ──────────────────────────── 320ms
└── [Movies Service API Call] ────────── 290ms ← bottleneck
└── [DB Query: find_all_movies] ── 270ms ← root cause
This is exactly the kind of visualization Zipkin gives you — and why distributed tracing is invaluable.
2. Span Context and Correlation Context
For spans across services to be linked into one trace, metadata must travel with each request. This is handled by two mechanisms:
-
Span Context: carries the
traceId,spanId, andtraceFlags— the IDs that tell the backend these spans belong together. Mandatory. -
Correlation Context: carries user-defined properties like
customerId,dataRegion, orproviderHostname. Optional, but powerful for business-level debugging.
The OTel HTTP auto-instrumentation plugin propagates this context automatically via HTTP headers.
3. Metrics vs. Traces, When to Use Which
| Concern | Use Traces | Use Metrics |
|---|---|---|
| "Why is this request slow?" | ✅ | ❌ |
| "What's our p99 latency over 30 days?" | ❌ | ✅ |
| "Which service is the bottleneck?" | ✅ | ❌ |
| "How many requests per second are we handling?" | ❌ | ✅ |
Traces answer why at the individual request level. Metrics answer what at aggregate scale over time.
4. The OpenTelemetry Collector — Your Telemetry Router
Without a Collector, your service sends data directly to a specific backend (Zipkin, Jaeger, etc.). Change the backend and you re-instrument every service.
The OTel Collector sits in between:
[Service A] ──┐
[Service B] ──┼──► (OTLP) ──► [OTel Collector] ──► [Zipkin] (local/dev)
[Service C] ──┘ │ [New Relic] (production)
└────────────► [Prometheus] (metrics)
Your services always speak OTLP (OpenTelemetry's native protocol) to the Collector. The Collector then speaks whatever protocol each backend requires. This is the key architectural point: your application code is completely decoupled from the backend's data format.
It has three stages:
- Receiver — accepts data in multiple formats (Zipkin, Jaeger, Prometheus, FluentBit)
- Processor — filters, batches, or transforms data before export
- Exporter — forwards to one or more backends
Swap backends without touching a single line of application code. This is the right architecture for production.
Setting Up Distributed Tracing in Node.js
The recommended architecture in 2025+ is: Node.js → OTLP → OTel Collector → backend. Your application always exports via OTLP (OTel's native protocol), and the Collector handles routing to whatever backend you're using.
We'll use Zipkin as the local visualization backend here — it's an excellent learning tool for seeing traces. In production you'd swap the Collector's exporter to New Relic, Jaeger, or wherever you're sending data, without touching the application.
Step 1: Install Dependencies
npm install @opentelemetry/sdk-node \
@opentelemetry/exporter-trace-otlp-http \
@opentelemetry/auto-instrumentations-node
@opentelemetry/sdk-node is the modern entry point. It wires together the trace provider, resource detection, and auto-instrumentations in one place. The older pattern of manually constructing NodeTracerProvider and calling provider.register() still works but is more verbose and prone to misconfiguration.
Step 2: Create Your tracing.js File
Keep instrumentation code in its own file, completely separate from your application logic.
// tracing.js
'use strict';
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const sdk = new NodeSDK({
serviceName: 'movies-service',
traceExporter: new OTLPTraceExporter({
url: 'http://localhost:4318/v1/traces', // OTel Collector OTLP/HTTP endpoint
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
// Graceful shutdown — flush spans before the process exits
process.on('SIGTERM', () => {
sdk.shutdown().finally(() => process.exit(0));
});
console.log('Tracing initialized');
NodeSDK handles resource attribute detection automatically (service name, runtime version, host info) and manages the lifecycle of the SDK cleanly. The SIGTERM handler ensures buffered spans are flushed rather than dropped when the process shuts down.
Step 3: Load It Before Your App Starts
Use the node -r flag to require the tracing file before your application code runs. This ensures that all subsequent modules are instrumented from the start.
node -r ./tracing.js index.js
Or in your package.json:
{
"scripts": {
"start": "node -r ./tracing.js index.js"
}
}
Why
-r? OTel works by monkey-patching Node.js core modules (likehttp). If your app loads before the instrumentation, those patches won't apply to already-loaded modules.
Step 4: Run the OTel Collector + Zipkin via Docker
# docker-compose.yml
services:
zipkin:
image: openzipkin/zipkin
ports:
- "9411:9411"
otel-collector:
image: otel/opentelemetry-collector-contrib
command: ["--config=/etc/otel-config.yaml"]
volumes:
- ./otel-config.yaml:/etc/otel-config.yaml
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
depends_on:
- zipkin
# otel-config.yaml (learning setup)
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
http:
endpoint: "0.0.0.0:4318"
processors:
batch: {}
exporters:
zipkin:
endpoint: "http://zipkin:9411/api/v2/spans"
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [zipkin]
Start the stack with docker-compose up, run your service, make some requests, and open http://localhost:9411. You'll see traces appear with parent-child span relationships, latency measurements per hop, and red error flags where things fail.
Setting Up Metrics Collection with Prometheus
Metrics are cheaper to store and better for trend analysis than traces. The standard framework for service-level metrics is RED: Rate (requests/second), Errors (error rate), Duration (latency). OTel supports all three through a Meter.
A counter covers Rate and Errors. For Duration, you need a histogram — not another counter. Histograms are what actually give you p95 and p99 latency, which are far more useful for SLOs than raw request counts.
Step 1: Install Dependencies
npm install @opentelemetry/sdk-metrics \
@opentelemetry/exporter-prometheus
Step 2: Initialize Your Meter, Counter, and Histogram
// metrics.js
const { MeterProvider } = require('@opentelemetry/sdk-metrics');
const { PrometheusExporter } = require('@opentelemetry/exporter-prometheus');
const exporter = new PrometheusExporter({ port: 9464 }, () => {
console.log('Prometheus scrape endpoint: http://localhost:9464/metrics');
});
const meterProvider = new MeterProvider();
meterProvider.addMetricReader(exporter);
const meter = meterProvider.getMeter('movies-service');
const requestCounter = meter.createCounter('http_requests_total', {
description: 'Total number of HTTP requests received',
});
// Histogram for latency — the correct instrument for p95/p99 analysis
const requestDuration = meter.createHistogram('http_request_duration_ms', {
description: 'HTTP request latency in milliseconds',
unit: 'ms',
});
module.exports = { requestCounter, requestDuration };
Step 3: Instrument Your Express Middleware
Note the res.on('finish', ...) pattern below. Reading res.statusCode directly in the middleware body gives you the default value (usually 200) before the response is actually sent — the finish event gives you the real status code and accurate duration.
// index.js
const { requestCounter, requestDuration } = require('./metrics');
app.use((req, res, next) => {
const startTime = Date.now();
res.on('finish', () => {
const attributes = {
method: req.method,
route: req.route?.path ?? req.path,
status_code: res.statusCode,
};
requestCounter.add(1, attributes);
requestDuration.record(Date.now() - startTime, attributes);
});
next();
});
Step 4: Configure Prometheus to Scrape Your App
# prometheus.yml
global:
scrape_interval: 15s # Balance between granularity and performance
scrape_configs:
- job_name: 'movies-service'
static_configs:
- targets: ['host.docker.internal:9464']
Environment note:
host.docker.internalresolves on Docker Desktop (macOS and Windows). On Linux, Docker does not add this hostname by default use172.17.0.1(the default Docker bridge gateway) or the host's actual IP. In Kubernetes, replace this entirely with a proper service discovery config or aPodMonitor.Why 15 seconds? Granular enough for accurate per-minute rate calculations and fast anomaly detection, without generating an excessive volume of data points.
Routing Everything Through the OTel Collector to New Relic
Once you're ready for production, use the same OTLP-first architecture your services already speak OTLP to the Collector. Only the Collector's exporter changes.
otel-config.yaml (production)
receivers:
otlp:
protocols:
grpc:
endpoint: "0.0.0.0:4317"
http:
endpoint: "0.0.0.0:4318"
processors:
batch: {}
exporters:
otlp:
endpoint: "https://otlp.nr-data.net:4317"
headers:
api-key: "${NEW_RELIC_LICENSE_KEY}"
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp]
docker-compose.yml (partial, production)
otel-collector:
image: otel/opentelemetry-collector-contrib
command: ["--config=/etc/otel-config.yaml"]
volumes:
- ./otel-config.yaml:/etc/otel-config.yaml
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
environment:
- NEW_RELIC_LICENSE_KEY=${NEW_RELIC_LICENSE_KEY}
Nothing in your application code changes between the local Zipkin setup and this production config. That's the whole point of OTLP-first: your app doesn't know or care what's downstream of the Collector. In New Relic's Explorer view, you can visualize all services, their dependencies, and drill into latency across your full distributed system.
Sampling: The Production Topic Nobody Puts in Tutorials
In development, tracing every request is fine. In production, with thousands of requests per second, tracing everything is a fast way to generate a very expensive observability bill and a lot of noise.
Sampling is how you control what percentage of traces you actually record.
Head Sampling (Probabilistic)
The decision is made at the start of a trace, before any data is collected:
const { TraceIdRatioBased } = require('@opentelemetry/sdk-trace-base');
const sdk = new NodeSDK({
serviceName: 'movies-service',
sampler: new TraceIdRatioBased(0.1), // Record 10% of all traces
traceExporter: new OTLPTraceExporter({ url: 'http://localhost:4318/v1/traces' }),
instrumentations: [getNodeAutoInstrumentations()],
});
Simple, low-overhead. The tradeoff: you may drop the exact traces you needed errors and slow requests have the same chance of being dropped as fast successful ones.
Tail Sampling (Collector-side)
The decision is made at the Collector after seeing the full trace. This lets you express rules like: "always keep error traces, always keep traces over 1 second, sample everything else at 5%." This is the correct approach for production.
# In otel-config.yaml — requires otel/opentelemetry-collector-contrib
processors:
tail_sampling:
decision_wait: 10s # Wait up to 10s for all spans in a trace to arrive
policies:
- name: keep-errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: keep-slow-traces
type: latency
latency: { threshold_ms: 1000 }
- name: sample-everything-else
type: probabilistic
probabilistic: { sampling_percentage: 5 }
decision_wait needs to be long enough that all spans from across your services have arrived before the Collector makes its keep/drop call. Tune this based on your slowest service-to-service call.
The rule: never run OTel in high-traffic production without a sampling strategy. Head sampling is a quick win; tail sampling is the right architecture.
The Security Angle: What OWASP Says About Telemetry
This is the part most tutorials skip. Don't.
The OWASP Top 10 includes A09:2021 – Security Logging and Monitoring Failures precisely because bad observability is a security risk, not just an operational one.
Risks specific to telemetry:
1. Sensitive Data Leaking into Spans and Logs
Over-instrumentation is real. If you're logging request bodies, database queries, or user-facing errors without sanitization, you may be storing passwords, credit card numbers, session tokens, or PII directly inside your observability platform which is almost never protected as strictly as your production database.
Mitigation: Use OTel Collector processors to scrub sensitive fields before export:
processors:
attributes:
actions:
- key: http.request.body
action: delete
- key: db.statement
action: hash
2. Log Injection
If user-controlled input flows directly into span attributes or log messages without sanitization, attackers can inject crafted entries to manipulate your log analysis tools or hide their activity.
Mitigation: Never log raw user input. Sanitize and validate before adding to span attributes.
3. Insufficient Monitoring Leading to Undetected Breaches
If you're collecting telemetry but not alerting on it, you're warehousing evidence of your own compromise. Data alone isn't observability active monitoring is.
Mitigation:
- Configure alerts for repeated authentication failures, unusual spike patterns, and anomalous service-to-service traffic
- Route security-relevant telemetry to a SIEM, not just a tracing backend
- Periodically audit what your services are actually sending
4. Credential Exposure in Collector Configuration
Your OTel Collector config references API keys and ingest tokens. These must come from environment variables or secrets managersnever hardcoded in otel-config.yaml.
# ❌ Never do this
api-key: "NRAK-XXXXXXXXXXXXXXXXXXXX"
# ✅ Do this
api-key: "${NEW_RELIC_LICENSE_KEY}"
Frontend Telemetry: It Works There Too
OTel isn't just for the backend. The @opentelemetry/sdk-trace-web package brings the same tracing model to the browser capturing document load times, XHR/fetch requests, and user interactions.
The critical win: trace propagation. When a user clicks a button in your browser app, the trace context is forwarded with the API call, linking the browser span to the backend span. You get a single trace that shows the full journey from UI click to database response.
Useful for Node.js BFFs, Next.js backends, and any system where front-to-back latency matters.
What to Implement First (A Practical Priority Order)
If you're starting from zero on an existing Node.js project, here's the sequence that gives you the fastest signal:
- Set up NodeSDK with OTLP + the Collector locally — route to Zipkin for visualization. Even in local dev, using the Collector means your application code never needs to change when you switch backends.
- Add a request duration histogram — a histogram on your most trafficked endpoint gives you p95/p99 latency immediately. A counter alone won't tell you what's slow.
- Add sampling before production — head sampling is a 5-minute change. Tail sampling via the Collector is the right long-term answer. Skip this and high traffic will surprise you.
- Audit your span attributes for PII — do this before you go to production. Retrofitting data redaction is painful.
- Configure alerts — at minimum, alert on error rate and p95 latency. If you only get paged when users tweet at you, you're already too late.
Summary
| Concept | Tool | What You Get |
|---|---|---|
| Instrumentation | NodeSDK + OTLP | Modern, vendor-neutral trace export |
| Local Tracing | OTel Collector → Zipkin | Request flow, latency, bottlenecks, errors |
| Metrics | Prometheus + Grafana | Rate/Error/Duration (RED), p95/p99 histograms |
| Sampling | Head or Tail (Collector) | Cost control without losing critical traces |
| Production Export | OTel Collector → New Relic | Full-stack service map and alerting |
| Security | OTel Processor + SIEM | PII redaction, breach detection |
OpenTelemetry is not a monitoring tool, it's a contract. A contract between your application and any tool that wants to understand it. Write that contract once, and you're free to change backends, scale services, or bring in new tooling without starting over.
Resources to Go Deeper
- OpenTelemetry JS Documentation
- OTel NodeSDK Configuration Reference
- Tail Sampling Processor Docs
- Prometheus Configuration Guide
- OTel Collector Contrib Distro
- OWASP A09 – Security Logging and Monitoring Failures
Have you implemented OTel in a production Node.js system? What was the first thing traces revealed that you didn't expect? Drop it in the comments, genuinely curious.
Top comments (0)