DEV Community

Atlas Whoff
Atlas Whoff

Posted on

OpenTelemetry for Node.js: Distributed Tracing in Production Microservices

OpenTelemetry for Node.js: Distributed Tracing in Production Microservices

When a request spans 5 services and takes 800ms, you need to know which service is the problem.
OpenTelemetry gives you that visibility.

What OpenTelemetry Does

It instruments your code to emit traces, metrics, and logs in a vendor-neutral format.
You pick the backend (Jaeger, Grafana Tempo, Honeycomb, Datadog) separately.

Setup

npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-http @opentelemetry/resources \
  @opentelemetry/semantic-conventions
Enter fullscreen mode Exit fullscreen mode
// instrumentation.ts — must be loaded BEFORE your app
import { NodeSDK } from '@opentelemetry/sdk-node'
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'
import { Resource } from '@opentelemetry/resources'
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions'

const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'my-api',
    [SemanticResourceAttributes.SERVICE_VERSION]: process.env.npm_package_version,
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
  }),
  instrumentations: [getNodeAutoInstrumentations()],
})

sdk.start()
process.on('SIGTERM', () => sdk.shutdown())
Enter fullscreen mode Exit fullscreen mode
// package.json
{
  "scripts": {
    "start": "node --require ./instrumentation.js dist/server.js"
  }
}
Enter fullscreen mode Exit fullscreen mode

Automatic Instrumentation

getNodeAutoInstrumentations() automatically traces:

  • HTTP requests (incoming and outgoing)
  • PostgreSQL queries (via pg)
  • Redis operations
  • gRPC calls
  • DNS lookups

No code changes needed for these.

Manual Spans

Add custom spans for business logic:

import { trace, SpanStatusCode } from '@opentelemetry/api'

const tracer = trace.getTracer('my-service')

async function processOrder(orderId: string) {
  return tracer.startActiveSpan('process-order', async (span) => {
    span.setAttribute('order.id', orderId)

    try {
      const order = await db.order.findUniqueOrThrow({ where: { id: orderId } })
      span.setAttribute('order.total', order.total)
      span.setAttribute('order.items_count', order.items.length)

      await chargeCard(order)
      await fulfillOrder(order)

      span.setStatus({ code: SpanStatusCode.OK })
      return order
    } catch (error) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message })
      span.recordException(error)
      throw error
    } finally {
      span.end()
    }
  })
}
Enter fullscreen mode Exit fullscreen mode

Context Propagation

Trace IDs must be passed between services:

import { propagation, context } from '@opentelemetry/api'

// Service A — inject trace context into outgoing request
async function callServiceB(data: unknown) {
  const headers: Record<string, string> = {}
  propagation.inject(context.active(), headers)

  return fetch('http://service-b/process', {
    method: 'POST',
    headers: { ...headers, 'Content-Type': 'application/json' },
    body: JSON.stringify(data),
  })
}

// Service B — extract trace context from incoming request
// (auto-instrumentations handles this automatically for HTTP)
Enter fullscreen mode Exit fullscreen mode

Next.js Integration

// instrumentation.ts (Next.js 14+ built-in support)
export async function register() {
  if (process.env.NEXT_RUNTIME === 'nodejs') {
    await import('./instrumentation.node')
  }
}
Enter fullscreen mode Exit fullscreen mode

What to Look For in Traces

Slow DB queries:

  • Sort spans by duration
  • Look for db.statement attributes with full SQL
  • N+1 queries appear as hundreds of identical short spans

External API bottlenecks:

  • HTTP client spans show exactly which external call is slow
  • Compare http.status_code across services

Error propagation:

  • Error spans bubble up — find the root cause, not just where it surfaced

Backend Options

Backend Best For Cost
Jaeger Self-hosted, free Infra only
Grafana Tempo Integrated with Loki/Prometheus Free tier
Honeycomb Developer experience Free to $200/mo
Datadog APM Enterprise, full observability Expensive

For side projects: Grafana Cloud free tier (50GB traces/month).


Running MCP servers in your AI workflow? The MCP Security Scanner audits them for security vulnerabilities before they run on your machine. $29 one-time.

Top comments (0)