The Node.js Observability Stack in 2026: OpenTelemetry, Prometheus, and Distributed Tracing
You can't debug what you can't see. Every production incident that took longer than 30 minutes to resolve had the same root cause: missing observability. Not missing monitoring — that's a checkbox. Missing observability: the ability to ask arbitrary questions about your system's internal state from its external outputs.
The difference matters. Monitoring tells you that p99 latency spiked. Observability tells you why: that a specific user's request hit an upstream database with a cold connection pool, during a GC pause, on the one pod that never got the config rollout.
In 2026, the Node.js observability standard is OpenTelemetry. It is no longer optional, niche, or experimental. It is the baseline. Here is how to build the full stack.
The Three Pillars (and Why the Framing Is Wrong)
You've seen the diagram: Logs, Metrics, Traces. It's a useful model, but it hides the relationship that makes observability actually work: correlation.
Individual pillars are not observability. A trace without the correlated logs is half a story. Metrics without traces cannot identify which request caused the spike. Logs without trace IDs are expensive text files.
Real observability means every log entry, every metric data point, and every trace span share a correlation context. This is what OpenTelemetry provides by default — and what every homegrown solution almost always misses.
The Stack
| Layer | Tool | Role |
|---|---|---|
| Instrumentation | OpenTelemetry SDK | Auto-instrument HTTP, DB, Redis, queues |
| Traces | OTLP to Jaeger or Tempo | Distributed request tracing |
| Metrics | Prometheus + prom-client | RED metrics, counters, histograms |
| Logs | Pino + trace injection | Structured, correlated, trace-linked |
| Dashboards | Grafana | Unified view across all three pillars |
| Alerting | AlertManager | SLO-based error budget alerting |
Step 1: OpenTelemetry SDK Setup
Install the core packages:
npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node @opentelemetry/exporter-trace-otlp-http @opentelemetry/exporter-metrics-otlp-http @opentelemetry/resources @opentelemetry/semantic-conventions
Create src/instrumentation.ts — this must load before any application code:
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { Resource } from '@opentelemetry/resources';
const sdk = new NodeSDK({
resource: new Resource({
'service.name': process.env.SERVICE_NAME ?? 'my-api',
'service.version': process.env.SERVICE_VERSION ?? '0.0.0',
'deployment.environment': process.env.NODE_ENV ?? 'development',
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_TRACES_ENDPOINT ?? 'http://localhost:4318/v1/traces',
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({
url: process.env.OTEL_EXPORTER_OTLP_METRICS_ENDPOINT ?? 'http://localhost:4318/v1/metrics',
}),
exportIntervalMillis: 15_000,
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-http': { enabled: true },
'@opentelemetry/instrumentation-express': { enabled: true },
'@opentelemetry/instrumentation-pg': { enabled: true },
'@opentelemetry/instrumentation-redis': { enabled: true },
'@opentelemetry/instrumentation-fs': { enabled: false },
}),
],
});
sdk.start();
process.on('SIGTERM', async () => {
await sdk.shutdown();
process.exit(0);
});
Load it first in your entry point with import './instrumentation' before Express, Fastify, or any other framework code.
What you get for free: every incoming HTTP request becomes a root trace span, every outgoing call becomes a child span, every database query gets timing and SQL text, and async context propagates via AsyncLocalStorage.
Step 2: The RED Method with Custom Metrics
Auto-instrumentation covers infrastructure. Track business logic with custom metrics using the RED method:
- Rate: requests per second (Counter)
- Errors: error rate as percentage (Counter filtered by status_code)
- Duration: latency distribution (Histogram)
import { metrics } from '@opentelemetry/api';
const meter = metrics.getMeter('my-api', '1.0.0');
const httpRequestsTotal = meter.createCounter('http_requests_total', {
description: 'Total number of HTTP requests',
});
const requestDurationMs = meter.createHistogram('http_request_duration_ms', {
description: 'HTTP request duration in milliseconds',
unit: 'ms',
advice: {
explicitBucketBoundaries: [5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000],
},
});
const dbConnections = meter.createObservableGauge('db_active_connections');
dbConnections.addCallback(result => result.observe(pool.totalCount));
export function recordRequest(method: string, route: string, statusCode: number, durationMs: number) {
const labels = { method, route, status_code: String(statusCode) };
httpRequestsTotal.add(1, labels);
requestDurationMs.record(durationMs, labels);
}
Use explicit histogram bucket boundaries matching your SLO thresholds. If your SLO is p99 < 500ms, you need a bucket boundary at 500.
Step 3: Correlating Logs with Traces
Without trace IDs in your logs, you cannot jump from a metric spike to the actual failing requests. A thin proxy around your logger injects the active span context automatically:
import pino from 'pino';
import { trace } from '@opentelemetry/api';
const base = pino({ level: process.env.LOG_LEVEL ?? 'info' });
export const logger = new Proxy(base, {
get(target: any, method: string) {
if (!['info','warn','error','debug','fatal'].includes(method)) {
return target[method];
}
return (...args: any[]) => {
const span = trace.getActiveSpan();
const ctx = span?.spanContext();
if (ctx) {
return target.child({ traceId: ctx.traceId, spanId: ctx.spanId })[method](...args);
}
return target[method](...args);
};
}
});
Every log entry now automatically carries the trace ID:
{
"level": "error",
"time": 1711234567890,
"traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
"spanId": "00f067aa0ba902b7",
"msg": "Payment processing failed",
"userId": "usr_abc123"
}
In Grafana, click any trace span and jump to correlated logs filtered by that exact traceId. This transforms "p99 spiked" into "here are the exact 47 requests that caused it, with their full context."
Step 4: Prometheus Pull Metrics
Prefer pull-based metrics? Add prom-client:
import { collectDefaultMetrics, Registry } from 'prom-client';
const register = new Registry();
collectDefaultMetrics({ register });
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
Default metrics included automatically:
-
process_cpu_seconds_total— CPU usage -
process_resident_memory_bytes— Memory -
nodejs_eventloop_lag_seconds— Critical health indicator -
nodejs_gc_duration_seconds— GC pause distribution -
nodejs_heap_space_size_used_bytes— V8 heap by generation
Watch nodejs_eventloop_lag_p99 closely. If it exceeds 100ms, you have CPU-intensive synchronous work on the main thread. Every millisecond of event loop blockage affects every concurrent request.
Step 5: Distributed Tracing Across Services
OpenTelemetry auto-instrumentation injects W3C traceparent headers on outgoing HTTP calls automatically. Service B receives the header and continues the same distributed trace — zero code changes needed.
For queue-based messaging:
import { propagation, context } from '@opentelemetry/api';
async function publishMessage(queue: string, payload: object) {
const carrier: Record<string, string> = {};
propagation.inject(context.active(), carrier);
await mq.publish(queue, { ...payload, _otel: carrier });
}
async function consumeMessage(msg: any) {
const ctx = propagation.extract(context.active(), msg._otel ?? {});
await context.with(ctx, () => processMessage(msg));
}
This chains consumer traces to publisher traces, giving end-to-end visibility across async message boundaries.
Step 6: SLO-Based Alerting
Threshold alerting fires too often on non-incidents and misses real ones. SLO-based alerting fires when error budget burns faster than sustainable:
- alert: SLOErrorBudgetFastBurn
expr: |
(
rate(http_requests_total{status_code=~"5.."}[1h]) /
rate(http_requests_total[1h])
) > (14 * 0.001)
AND
(
rate(http_requests_total{status_code=~"5.."}[5m]) /
rate(http_requests_total[5m])
) > (14 * 0.001)
for: 2m
annotations:
summary: "Error budget burning 14x faster than sustainable"
description: "At this burn rate, the monthly budget exhausts in approximately 2 days"
Requiring both 1h and 5m windows prevents false positives from brief transient spikes. This single alert replaces a dozen threshold alerts and reduces alert fatigue by 60-80%.
Step 7: Grafana Dashboard Tiers
Tier 1 — Service Overview (is-it-on-fire view):
RPS with error rate overlay, p50/p95/p99 latency, error budget burn rate, active alerts.
Tier 2 — Service Deep Dive:
Rate/Error/Duration per endpoint, database query latency histogram, cache hit rate, event loop lag trend.
Tier 3 — Infrastructure:
Heap usage by V8 generation, CPU per instance, memory trends, DB connection pool utilization.
Link all tiers with drilldown navigation. Click a spike in Tier 1 to land in Tier 2 filtered to that time range. Click a trace span to jump to Jaeger with the exact distributed trace.
Step 8: The OTel Collector
Deploy the OpenTelemetry Collector between your services and backends. Do not connect directly to backends in production.
Why it matters:
- Backend flexibility: Switch from Jaeger to Tempo without touching application code
- Tail-based sampling: Keep 100% of error traces, sample 10% of successful ones — in the Collector, not the app
- Enrichment: Add Kubernetes metadata to every span automatically
- Resilience: Buffer telemetry if backends are down; your app never blocks on export
version: "3.8"
services:
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
volumes:
- ./collector-config.yaml:/etc/otelcol/config.yaml
ports:
- "4318:4318"
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686"
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
grafana:
image: grafana/grafana:latest
ports:
- "3001:3000"
Run docker-compose up -d to spin up the full local observability stack.
Production Observability Checklist
- [ ] OpenTelemetry SDK initialized before all application code
- [ ] SERVICE_NAME and SERVICE_VERSION set via environment variables
- [ ] Auto-instrumentation covers HTTP, DB, and cache layers
- [ ] Custom RED metrics on every service boundary
- [ ] Trace IDs injected into every log line
- [ ] /metrics endpoint or OTLP export configured
- [ ] OTel Collector deployed (not direct-to-backend)
- [ ] Tail-based sampling: 100% errors, sampled success
- [ ] SLO defined with fast-burn error budget alert
- [ ] Grafana dashboard covers Rate, Error, Duration
- [ ] Runbook linked from every alert rule
The Investment
Setting this up takes 4-6 hours the first time. That sounds like a lot — until you compare it to the 8-hour war room you will have without it when something breaks in production on a Friday.
Observability is the difference between operating a system and guessing about a system. In 2026, the tooling is mature, the standards are stable, and the cost of not instrumenting is too high to justify. Build it once. It pays dividends on every incident you never have to suffer through again.
Want a companion CLI that validates your observability setup before deployment? ci-check includes checks for missing health endpoints and absent OTEL_SERVICE_NAME configuration.
This article is part of the Node.js Production Engineering series on axiom-experiment.hashnode.dev.
Top comments (0)