Temitope

Posted on May 11

Node.js Performance at the Limit: Profiling, Fixing, and Proving It with Real Numbers

#node #javascript #backend #performance

Most Node.js performance content teaches you to avoid eval, use streams instead of buffers, and "don't block the event loop." That's fine advice — and it won't help you when your p99 latency is 2.3 seconds and your CTO is in your Slack DMs.

This is a different kind of article. We start with a realistic API that has real problems, profile it properly, fix each bottleneck with actual code, and measure the delta at every step. No platitudes. Numbers or it didn't happen.

The Benchmark Harness First

Before touching a single line of application code, establish your measurement baseline. Every optimization you make needs a before and after. Without this, you're just guessing with extra steps.

We'll use autocannon for HTTP benchmarking and Node's built-in --prof flag plus Chrome DevTools for CPU profiling.

npm install -g autocannon clinic

The baseline test we'll run throughout:

# 10 seconds, 50 concurrent connections, pipe results to JSON
autocannon -c 50 -d 10 -j http://localhost:3000/api/reports > baseline.json

A helper script to diff two runs:

// scripts/compare.js
const before = require('./baseline.json');
const after  = require('./optimized.json');

const metrics = ['requests', 'latency', 'throughput'];

for (const m of metrics) {
  const b = before[m];
  const a = after[m];
  const deltaAvg = (((a.average - b.average) / b.average) * 100).toFixed(1);
  console.log(`${m}.average: ${b.average} → ${a.average} (${deltaAvg}%)`);
}

Run this after every change. Keep all your JSON files. You'll need receipts.

The Patient: A Realistic Slow API

Here's the kind of endpoint that exists in every codebase that's survived long enough. It generates a report — fetches some data, processes it, formats it, returns JSON.

// src/routes/reports.js  (the before — intentionally broken)
const express = require('express');
const db      = require('../db');       // Postgres via pg
const crypto  = require('crypto');
const router  = express.Router();

router.get('/api/reports', async (req, res) => {
  const { org_id, start, end } = req.query;

  // Fetch orders
  const orders = await db.query(
    `SELECT * FROM orders WHERE org_id = $1
     AND created_at BETWEEN $2 AND $3`,
    [org_id, start, end]
  );

  // For each order, fetch its line items separately
  const enriched = [];
  for (const order of orders.rows) {
    const items = await db.query(
      `SELECT * FROM line_items WHERE order_id = $1`,
      [order.id]
    );

    const total = items.rows.reduce((sum, i) => sum + i.price * i.qty, 0);

    // Compute a "fingerprint" for cache busting downstream
    const fingerprint = crypto
      .createHash('sha256')
      .update(JSON.stringify(order))
      .digest('hex');

    enriched.push({ ...order, items: items.rows, total, fingerprint });
  }

  // Sort by total descending
  enriched.sort((a, b) => b.total - a.total);

  res.json({ data: enriched, count: enriched.count });
});

If you've been around long enough, you felt something in your chest reading that. Let's quantify the pain.

Baseline numbers (50 concurrent, 10s, 200 orders in the result set):

Requests/sec:  47.3
Latency avg:   1,041ms
Latency p99:   2,380ms
Throughput:    1.1 MB/s

Four things are wrong. We'll fix them in order of impact.

Problem 1: The N+1 Query

The for...of loop that fires a db.query per order is the worst offender. With 200 orders, that's 201 round trips to Postgres. Each one waits for the previous to complete because await inside a for loop is sequential.

Proof First

node --require ./src/db-logger.js src/index.js &
autocannon -c 1 -d 3 http://localhost:3000/api/reports 2>/dev/null

// src/db-logger.js  — count queries per request
let count = 0;
const { Pool } = require('pg');
const originalQuery = Pool.prototype.query;
Pool.prototype.query = function(...args) {
  count++;
  process.stdout.write(`\rQueries this process: ${count}`);
  return originalQuery.apply(this, args);
};

Output confirms: 201 queries per request. At 5ms average round-trip, that's 1,005ms of pure waiting before any processing begins.

The Fix: JOIN Everything

// One query, zero loops
const result = await db.query(
  `SELECT
     o.*,
     json_agg(
       json_build_object(
         'id',    li.id,
         'price', li.price,
         'qty',   li.qty,
         'sku',   li.sku
       )
     ) AS items,
     SUM(li.price * li.qty) AS total
   FROM orders o
   JOIN line_items li ON li.order_id = o.id
   WHERE o.org_id = $1
     AND o.created_at BETWEEN $2 AND $3
   GROUP BY o.id
   ORDER BY total DESC`,
  [org_id, start, end]
);

json_agg builds the nested items array directly in Postgres. The SUM computes the total in SQL, skipping the JS reduce entirely. One round trip.

After fix 1:

Requests/sec:  312.4   (+560%)
Latency avg:   158ms   (-85%)
Latency p99:   401ms   (-83%)

That's your N+1. Find it, kill it, collect your 5x improvement.

Problem 2: CPU Blocking — The Fingerprint Loop

With the database bottleneck gone, the CPU profile becomes readable. Let's generate one:

node --prof src/index.js &
autocannon -c 50 -d 10 http://localhost:3000/api/reports
kill %1
node --prof-process isolate-*.log > profile.txt

Look for the hot functions at the top of profile.txt:

 [JavaScript]:
   ticks  total  nonlib   name
   2,847   31.2%   34.1%  crypto.Hash.update
   1,203   13.2%   14.4%  JSON.stringify
     891    9.8%   10.7%  Array.prototype.sort

crypto.Hash.update eating 31% of CPU time for a "fingerprint" that's used for... cache busting? This needs scrutiny.

The Analysis

// The original — called 200 times per request
const fingerprint = crypto
  .createHash('sha256')
  .update(JSON.stringify(order))  // stringify a full order object, 200x
  .digest('hex');

Two problems:

JSON.stringify on a full order object, 200 times per request, under 50 concurrent connections = 10,000 stringifies/second, each allocating a new string.
SHA-256 is cryptographically secure. We don't need that for a cache-busting fingerprint. We need fast and unique, not secure.

If the fingerprint is truly needed, use a cheaper hash and stop serializing the whole object:

// Option A: Hash only the fields that actually affect cache validity
const fingerprint = crypto
  .createHash('md4')              // 3x faster than sha256 for this use
  .update(`${order.id}:${order.updated_at.getTime()}`)
  .digest('hex');

// Option B: If you only need uniqueness, not a hash
// updated_at is already a change signal — use it directly
const fingerprint = `${order.id}-${order.updated_at.getTime().toString(36)}`;

Option B is what you almost certainly actually want. It's a string concat, not a hash. It's unique per order-version. It takes microseconds.

After fix 2:

Requests/sec:  489.1   (+57% on top of fix 1)
Latency avg:   101ms   (-36%)
Latency p99:   229ms   (-43%)
CPU idle:      ~62%    (was ~21%)

Problem 3: Memory Pressure and GC Pauses

Run the Clinic.js heap profiler to see allocation patterns:

clinic heapprofiler -- node src/index.js &
autocannon -c 50 -d 10 http://localhost:3000/api/reports

Clinic will generate an HTML flamegraph. The allocation spike you'll see is from building the enriched array: 200 objects, each with a spread copy of the order, plus items array, plus computed fields. Under 50 concurrent connections, that's up to 10,000 object allocations per second, many of them large.

The V8 GC handles this, but not for free. You'll see GC pauses in the p99 latency as the minor GC sweeps short-lived allocations from the new-space.

The Fix: Return the Postgres Result Directly

The JOIN query already gives us the shape we need. Stop copying:

// Before: building enriched[] with spreads and mutations
const enriched = [];
for (const row of result.rows) {
  enriched.push({ ...row, items: row.items, total: row.total, fingerprint });
}

// After: the DB result IS the response — transform in place minimally
const data = result.rows.map(row => ({
  id:          row.id,
  org_id:      row.org_id,
  created_at:  row.created_at,
  items:       row.items,         // already json_agg'd by Postgres
  total:       parseFloat(row.total),
  fingerprint: `${row.id}-${new Date(row.created_at).getTime().toString(36)}`,
}));

Explicit field selection instead of spreading also avoids accidentally sending internal fields (internal_notes, cost_price, etc.) to the client — a common security issue hiding inside performance code.

After fix 3:

Requests/sec:  541.8   (+11%)
Latency avg:   91ms    (-10%)
Latency p99:   198ms   (-14%)
GC pause max:  4ms     (was 23ms)

The absolute numbers are a modest improvement, but GC max pause dropping from 23ms to 4ms matters — that's what was spiking your p99.

Problem 4: The Event Loop — Blocking JSON Serialization

res.json() calls JSON.stringify() synchronously on the main thread. For small responses this doesn't matter. For a response that's 200 orders × 10 line items each, you're stringifying a 400KB+ object on the event loop, blocking all other requests during that serialization.

Let's prove it with a flame chart:

clinic flame -- node src/index.js &
autocannon -c 50 -d 10 http://localhost:3000/api/reports

You'll see JSON.stringify as a wide horizontal band — it's synchronous time on the main thread. For a 50-concurrent test, this means requests queuing behind each other's serialization.

Fix A: Streaming JSON with `fast-json-stringify`

npm install fast-json-stringify

const fastJson = require('fast-json-stringify');

// Define the shape of your response once — compile it
const stringify = fastJson({
  type: 'object',
  properties: {
    count: { type: 'integer' },
    data: {
      type: 'array',
      items: {
        type: 'object',
        properties: {
          id:          { type: 'integer' },
          org_id:      { type: 'integer' },
          created_at:  { type: 'string' },
          total:       { type: 'number' },
          fingerprint: { type: 'string' },
          items: {
            type: 'array',
            items: {
              type: 'object',
              properties: {
                id:    { type: 'integer' },
                price: { type: 'number' },
                qty:   { type: 'integer' },
                sku:   { type: 'string' },
              }
            }
          }
        }
      }
    }
  }
});

// In the route handler
const payload = stringify({ data, count: data.length });
res.setHeader('Content-Type', 'application/json');
res.end(payload);

fast-json-stringify generates a schema-specific serializer at startup — no runtime type-checking, no property iteration. For a known schema it's typically 2–5x faster than JSON.stringify.

Fix B: For Very Large Responses — `JSONStream`

If your response can be megabytes, don't serialize it all before sending. Stream it:

npm install JSONStream

const JSONStream = require('JSONStream');

router.get('/api/reports', async (req, res) => {
  // ... query ...

  res.setHeader('Content-Type', 'application/json');

  const stream = JSONStream.stringify('{"data":[', ',', ']}');
  stream.pipe(res);

  for (const row of result.rows) {
    stream.write(transformRow(row));
  }

  stream.end();
});

This writes the response incrementally — the client starts receiving bytes before you've processed the last row. Critical for very large datasets.

After fix 4 (fast-json-stringify):

Requests/sec:  618.3   (+14%)
Latency avg:   80ms    (-12%)
Latency p99:   171ms   (-14%)

Problem 5: Connection Pool Starvation

Under sustained 50-connection load, you'll hit a subtler problem: connection pool exhaustion. The default pg Pool size is 10. With 50 concurrent requests each needing a connection, 40 of them are waiting in queue.

// The invisible default that's killing your concurrency
const pool = new Pool({
  // max: 10  ← this is the default you never set
});

Tuning the Pool

const { Pool } = require('pg');

const pool = new Pool({
  host:     process.env.PGHOST,
  database: process.env.PGDATABASE,
  user:     process.env.PGUSER,
  password: process.env.PGPASSWORD,
  port:     5432,

  // Tune these to your Postgres max_connections and node count
  max:             25,    // per Node process; multiply by process count
  idleTimeoutMillis: 30_000,
  connectionTimeoutMillis: 2_000,

  // Log pool events in development — essential for diagnosing starvation
  ...(process.env.NODE_ENV === 'development' && {
    log: (...args) => console.log('[pool]', ...args),
  }),
});

// Monitor pool health — expose this to your metrics system
pool.on('connect',  () => metrics.gauge('pg.pool.size', pool.totalCount));
pool.on('acquire',  () => metrics.gauge('pg.pool.waiting', pool.waitingCount));
pool.on('remove',   () => metrics.gauge('pg.pool.idle', pool.idleCount));

The right max value:

max per process = floor(postgres_max_connections / node_process_count) - headroom

If Postgres is configured for 100 connections and you run 4 Node processes:
floor(100 / 4) - 5 = 20 — leave 5 for admin connections, migrations, etc.

After fix 5:

Requests/sec:  791.2   (+28%)
Latency avg:   62ms    (-23%)
Latency p99:   134ms   (-22%)

Connection pool sizing is pure configuration — no code to write, enormous impact.

The Complete Optimized Handler

Here's the final version — everything applied:

// src/routes/reports.js  (the after)
const express    = require('express');
const db         = require('../db');
const fastJson   = require('fast-json-stringify');
const router     = express.Router();

const stringify = fastJson({
  type: 'object',
  properties: {
    count: { type: 'integer' },
    data: {
      type: 'array',
      items: {
        type: 'object',
        properties: {
          id:          { type: 'integer' },
          org_id:      { type: 'integer' },
          created_at:  { type: 'string'  },
          total:       { type: 'number'  },
          fingerprint: { type: 'string'  },
          items: {
            type: 'array',
            items: {
              type: 'object',
              properties: {
                id:    { type: 'integer' },
                price: { type: 'number'  },
                qty:   { type: 'integer' },
                sku:   { type: 'string'  },
              }
            }
          }
        }
      }
    }
  }
});

router.get('/api/reports', async (req, res) => {
  const { org_id, start, end } = req.query;

  if (!org_id || !start || !end) {
    return res.status(400).json({ error: 'org_id, start, end are required' });
  }

  const result = await db.query(
    `SELECT
       o.id, o.org_id, o.created_at,
       json_agg(
         json_build_object('id', li.id, 'price', li.price, 'qty', li.qty, 'sku', li.sku)
         ORDER BY li.id
       ) AS items,
       SUM(li.price * li.qty) AS total
     FROM orders o
     JOIN line_items li ON li.order_id = o.id
     WHERE o.org_id = $1
       AND o.created_at BETWEEN $2 AND $3
     GROUP BY o.id
     ORDER BY total DESC`,
    [org_id, start, end]
  );

  const data = result.rows.map(row => ({
    id:          row.id,
    org_id:      row.org_id,
    created_at:  row.created_at.toISOString(),
    items:       row.items,
    total:       parseFloat(row.total),
    fingerprint: `${row.id}-${row.created_at.getTime().toString(36)}`,
  }));

  const payload = stringify({ data, count: data.length });
  res.setHeader('Content-Type', 'application/json');
  res.end(payload);
});

module.exports = router;

Full Benchmark Summary

Every fix measured, no cherry-picking:

Fix	Req/sec	Avg latency	p99 latency	Delta req/sec
Baseline	47	1,041ms	2,380ms	—
1. Eliminate N+1	312	158ms	401ms	+562%
2. Cheaper fingerprint	489	101ms	229ms	+57%
3. Reduce allocations	542	91ms	198ms	+11%
4. fast-json-stringify	618	80ms	171ms	+14%
5. Pool tuning	791	62ms	134ms	+28%
Total	791	62ms	134ms	+1,574%

The N+1 was worth 5x on its own. Everything else stacked another 2.5x on top. That's the real distribution of performance work — one structural problem and a handful of incremental improvements.

What to Do When the Low-Hanging Fruit Is Gone

After these fixes, you've addressed the common offenders. Further gains require different tools:

Worker threads for CPU-heavy work. If you have actual computation (image processing, cryptography on large data, PDF generation), offload it:

const { Worker, isMainThread, parentPort, workerData } = require('worker_threads');

// main thread
function runInWorker(data) {
  return new Promise((resolve, reject) => {
    const w = new Worker(__filename, { workerData: data });
    w.on('message', resolve);
    w.on('error',   reject);
  });
}

// worker thread
if (!isMainThread) {
  const result = expensiveComputation(workerData);
  parentPort.postMessage(result);
}

Caching at the right layer. If the same org_id + date range is queried repeatedly, cache at the route level — but measure hit rate before adding cache complexity. A cache that misses 80% of the time adds latency, not removes it.

const cache = new Map();          // Replace with Redis in production

router.get('/api/reports', async (req, res) => {
  const key = `${req.query.org_id}:${req.query.start}:${req.query.end}`;
  const cached = cache.get(key);

  if (cached) {
    res.setHeader('X-Cache', 'HIT');
    res.setHeader('Content-Type', 'application/json');
    return res.end(cached);
  }

  // ... query and process ...

  const payload = stringify({ data, count: data.length });
  cache.set(key, payload);
  setTimeout(() => cache.delete(key), 30_000);  // 30s TTL

  res.setHeader('X-Cache', 'MISS');
  res.setHeader('Content-Type', 'application/json');
  res.end(payload);
});

Horizontal scaling. Node is single-threaded per process. Use the cluster module or PM2 to run one process per CPU core. This is orthogonal to the optimizations above — do both.

// cluster.js
const cluster = require('cluster');
const os      = require('os');

if (cluster.isPrimary) {
  const cpus = os.cpus().length;
  console.log(`Forking ${cpus} workers`);
  for (let i = 0; i < cpus; i++) cluster.fork();
  cluster.on('exit', (worker) => {
    console.log(`Worker ${worker.process.pid} died, restarting`);
    cluster.fork();
  });
} else {
  require('./src/index.js');
}

The Discipline

Performance work without measurement is superstition. The discipline is:

Baseline before you touch anything. No exceptions.
Change one thing at a time. If you fix three things together, you don't know which one mattered.
Profile before you optimize. The bottleneck is almost never where you think it is until you look.
Keep all your benchmark JSON files. You'll need to explain the improvement to someone who wasn't there.
Test under realistic concurrency. A benchmark with c 1 will not find pool exhaustion or GC pressure.

The 16x improvement above came from five targeted fixes to one endpoint. That's not unusual. Most production APIs have an N+1 they've lived with for years, a hash function nobody remembers adding, and a connection pool set to its default. Profile, fix, measure, repeat.

DEV Community

Node.js Performance at the Limit: Profiling, Fixing, and Proving It with Real Numbers

The Benchmark Harness First

The Patient: A Realistic Slow API

Problem 1: The N+1 Query

Proof First

The Fix: JOIN Everything

Problem 2: CPU Blocking — The Fingerprint Loop

The Analysis

Problem 3: Memory Pressure and GC Pauses

The Fix: Return the Postgres Result Directly

Problem 4: The Event Loop — Blocking JSON Serialization

Fix A: Streaming JSON with `fast-json-stringify`

Fix B: For Very Large Responses — `JSONStream`

Problem 5: Connection Pool Starvation

Tuning the Pool

The Complete Optimized Handler

Full Benchmark Summary

What to Do When the Low-Hanging Fruit Is Gone

The Discipline

Top comments (0)

The Benchmark Harness First

The Patient: A Realistic Slow API

Problem 1: The N+1 Query

Proof First

The Fix: JOIN Everything

Problem 2: CPU Blocking — The Fingerprint Loop

The Analysis

Problem 3: Memory Pressure and GC Pauses

The Fix: Return the Postgres Result Directly

Problem 4: The Event Loop — Blocking JSON Serialization

Fix A: Streaming JSON with fast-json-stringify

Fix B: For Very Large Responses — JSONStream

Problem 5: Connection Pool Starvation

Tuning the Pool

The Complete Optimized Handler

Full Benchmark Summary

What to Do When the Low-Hanging Fruit Is Gone

The Discipline

Fix A: Streaming JSON with `fast-json-stringify`

Fix B: For Very Large Responses — `JSONStream`