Why is my Node.js app slow? An OpenTelemetry debugging checklist

#node #opentelemetry #performance #devops

Node.js makes single-threaded asynchronous I/O cheap. It also makes a single bad pattern in one corner of the codebase capable of slowing the whole process. This is the production-debugging checklist I'd actually run, in the order I'd run it, with the OpenTelemetry instrumentation that lets you skip the guesswork.

1. The event loop is blocked

The single most common cause of "Node is slow." A CPU-heavy synchronous operation (a regex, a JSON.parse on a 50MB string, a crypto operation) holds the event loop and every other request waits. Symptoms: latency on unrelated endpoints spikes simultaneously, and the spike correlates with one particular endpoint's traffic.

OTel signal: nodejs.eventloop.delay.p95 and nodejs.eventloop.delay.p99via the@opentelemetry/instrumentation-runtime-node` package. Anything above 20ms is suspect; above
100ms is the cause of your incident. Chart against request latency. The correlation is
usually obvious.

2. Garbage collection is pausing the process

Long-lived references that should have been short-lived (caches without size limits, closures capturing request objects, listeners not removed) push the heap up. Eventually V8 runs a major GC and the process pauses for hundreds of milliseconds.

OTel signal: nodejs.gc.duration (histogram) and v8js.heap.size.used` (gauge) from the runtime instrumentation. A GC duration p99 above 200ms with a growing heap-used line tells the whole story.

3. A downstream call is the actual slow thing

"My app is slow" usually means "the response is slow." About half the time, your Node service is fine and is waiting on something else: a database query, a downstream microservice, a third-party API.

OTel signal: the trace waterfall. Open a slow trace. The long span is rarely your application code; it's almost always an outbound HTTP or DB span.
@opentelemetry/instrumentation-http and the database-specific instrumentation packages produce these for free.

4. Database N+1 queries

An ORM (Sequelize, TypeORM, Prisma) issuing one query per item in a result set is a near-universal Node.js pattern. The endpoint that fetches "100 orders with their line items" issues 1 + 100 + N queries instead of one join.

OTel signal: count of database spans per trace. If a single trace has 50+ spans from the
same DB span name, you have a query loop. The db.statement attribute (hashed if sensitive)
shows the repeated pattern.

5. Blocking sync I/O on the hot path

fs.readFileSync, crypto.pbkdf2Sync, JSON.parse on a huge body. Any synchronous operation in the request path holds the event loop. Often introduced by a contractor or a quick-fix PR that "worked locally."

OTel signal: CPU-time spans (with manual instrumentation) or just the event-loop-delay correlation from #1. A specific endpoint where event-loop delay spikes on every request is the smoking gun.

6. Connection pool exhaustion

The PostgreSQL client, the Redis client, the HTTP keepalive pool. Each has a max-connection setting that defaults to a small number. Under load, requests queue waiting for a connection, and the wait time looks like database latency from the outside.

OTel signal: custom up-down counter on pool size, or the difference between db.client.connections.usage and db.client.connections.max. The OTel database instrumentation libraries are starting to emit these natively; verify with your specific version.

7. Logging overhead

A logger emitting at debug level in production, writing to stdout that's then piped through a sidecar agent, can become non-trivial CPU work. Especially if the logger is doing JSON serialisation of large objects.

OTel signal: log rate metrics (count of log records per second) plus the runtime CPU usage. Sharp logging spike that correlates with latency is the giveaway. Setting the logger level back to info is usually the fix.

8. Async leaks and uncaught rejections

Promises that never resolve, async hooks that accumulate, unhandled promise rejections that log without crashing. Each one leaks resources over time. The process gets slower hour by hour and recovers only on restart.

OTel signal: nodejs.eventloop.utilization climbing over hours, paired with v8js.heap.size.used climbing. If your service runs fine after a deploy and gets progressively slower over the next 12 hours, this is what you're looking at.

The minimum instrumentation to get all of this

Three packages, one config file:

npm install @opentelemetry/sdk-node \
              @opentelemetry/auto-instrumentations-node \
              @opentelemetry/instrumentation-runtime-node

  // otel.js (required before your app)
  const { NodeSDK } = require('@opentelemetry/sdk-node');
  const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
  const { RuntimeNodeInstrumentation } =
  require('@opentelemetry/instrumentation-runtime-node');
  const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');
  const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-http');
  const { PeriodicExportingMetricReader } = require('@opentelemetry/sdk-metrics');

  const sdk = new NodeSDK({
    traceExporter: new OTLPTraceExporter({
      url: 'https://your-otlp-endpoint/v1/traces',
      headers: { 'x-api-key': process.env.OTEL_API_KEY },
    }),
    metricReader: new PeriodicExportingMetricReader({
      exporter: new OTLPMetricExporter({
        url: 'https://your-otlp-endpoint/v1/metrics',
        headers: { 'x-api-key': process.env.OTEL_API_KEY },
      }),
    }),
    instrumentations: [
      getNodeAutoInstrumentations(),
      new RuntimeNodeInstrumentation(),
    ],
  });
  sdk.start();

Then node -r ./otel.js app.js. Auto-instrumentation covers HTTP server, HTTP client, all major DB clients, gRPC, and message queues. Runtime instrumentation covers event loop, GC, heap. The eight signals above all appear without further work.

A debugging order that usually works

If you don't know where to start, this order resolves most incidents in under fifteen minutes:

Open the trace waterfall for a slow request. If the long span is a downstream call, jump to that service. The problem is not Node.
If the long span is in your service, check nodejs.eventloop.delay.p99 in the same time window. Spiking? You're CPU-bound or blocking; identify the endpoint by correlation.
If event loop is fine but heap is growing, GC pauses. Look at nodejs.gc.duration p99.
If event loop and GC look fine, count DB spans per trace. 50+ on one endpoint means N+1.
None of the above → connection pool exhaustion, logging overhead, or async leak. Each has a distinct signal pattern from the runtime instrumentation.