Most Node.js performance content teaches you to avoid eval, use streams instead of buffers, and "don't block the event loop." That's fine advice — and it won't help you when your p99 latency is 2.3 seconds and your CTO is in your Slack DMs.
This is a different kind of article. We start with a realistic API that has real problems, profile it properly, fix each bottleneck with actual code, and measure the delta at every step. No platitudes. Numbers or it didn't happen.
The Benchmark Harness First
Before touching a single line of application code, establish your measurement baseline. Every optimization you make needs a before and after. Without this, you're just guessing with extra steps.
We'll use autocannon for HTTP benchmarking and Node's built-in --prof flag plus Chrome DevTools for CPU profiling.
npm install -g autocannon clinic
The baseline test we'll run throughout:
# 10 seconds, 50 concurrent connections, pipe results to JSON
autocannon -c 50 -d 10 -j http://localhost:3000/api/reports > baseline.json
A helper script to diff two runs:
// scripts/compare.js
const before = require('./baseline.json');
const after = require('./optimized.json');
const metrics = ['requests', 'latency', 'throughput'];
for (const m of metrics) {
const b = before[m];
const a = after[m];
const deltaAvg = (((a.average - b.average) / b.average) * 100).toFixed(1);
console.log(`${m}.average: ${b.average} → ${a.average} (${deltaAvg}%)`);
}
Run this after every change. Keep all your JSON files. You'll need receipts.
The Patient: A Realistic Slow API
Here's the kind of endpoint that exists in every codebase that's survived long enough. It generates a report — fetches some data, processes it, formats it, returns JSON.
// src/routes/reports.js (the before — intentionally broken)
const express = require('express');
const db = require('../db'); // Postgres via pg
const crypto = require('crypto');
const router = express.Router();
router.get('/api/reports', async (req, res) => {
const { org_id, start, end } = req.query;
// Fetch orders
const orders = await db.query(
`SELECT * FROM orders WHERE org_id = $1
AND created_at BETWEEN $2 AND $3`,
[org_id, start, end]
);
// For each order, fetch its line items separately
const enriched = [];
for (const order of orders.rows) {
const items = await db.query(
`SELECT * FROM line_items WHERE order_id = $1`,
[order.id]
);
const total = items.rows.reduce((sum, i) => sum + i.price * i.qty, 0);
// Compute a "fingerprint" for cache busting downstream
const fingerprint = crypto
.createHash('sha256')
.update(JSON.stringify(order))
.digest('hex');
enriched.push({ ...order, items: items.rows, total, fingerprint });
}
// Sort by total descending
enriched.sort((a, b) => b.total - a.total);
res.json({ data: enriched, count: enriched.count });
});
If you've been around long enough, you felt something in your chest reading that. Let's quantify the pain.
Baseline numbers (50 concurrent, 10s, 200 orders in the result set):
Requests/sec: 47.3
Latency avg: 1,041ms
Latency p99: 2,380ms
Throughput: 1.1 MB/s
Four things are wrong. We'll fix them in order of impact.
Problem 1: The N+1 Query
The for...of loop that fires a db.query per order is the worst offender. With 200 orders, that's 201 round trips to Postgres. Each one waits for the previous to complete because await inside a for loop is sequential.
Proof First
node --require ./src/db-logger.js src/index.js &
autocannon -c 1 -d 3 http://localhost:3000/api/reports 2>/dev/null
// src/db-logger.js — count queries per request
let count = 0;
const { Pool } = require('pg');
const originalQuery = Pool.prototype.query;
Pool.prototype.query = function(...args) {
count++;
process.stdout.write(`\rQueries this process: ${count}`);
return originalQuery.apply(this, args);
};
Output confirms: 201 queries per request. At 5ms average round-trip, that's 1,005ms of pure waiting before any processing begins.
The Fix: JOIN Everything
// One query, zero loops
const result = await db.query(
`SELECT
o.*,
json_agg(
json_build_object(
'id', li.id,
'price', li.price,
'qty', li.qty,
'sku', li.sku
)
) AS items,
SUM(li.price * li.qty) AS total
FROM orders o
JOIN line_items li ON li.order_id = o.id
WHERE o.org_id = $1
AND o.created_at BETWEEN $2 AND $3
GROUP BY o.id
ORDER BY total DESC`,
[org_id, start, end]
);
json_agg builds the nested items array directly in Postgres. The SUM computes the total in SQL, skipping the JS reduce entirely. One round trip.
After fix 1:
Requests/sec: 312.4 (+560%)
Latency avg: 158ms (-85%)
Latency p99: 401ms (-83%)
That's your N+1. Find it, kill it, collect your 5x improvement.
Problem 2: CPU Blocking — The Fingerprint Loop
With the database bottleneck gone, the CPU profile becomes readable. Let's generate one:
node --prof src/index.js &
autocannon -c 50 -d 10 http://localhost:3000/api/reports
kill %1
node --prof-process isolate-*.log > profile.txt
Look for the hot functions at the top of profile.txt:
[JavaScript]:
ticks total nonlib name
2,847 31.2% 34.1% crypto.Hash.update
1,203 13.2% 14.4% JSON.stringify
891 9.8% 10.7% Array.prototype.sort
crypto.Hash.update eating 31% of CPU time for a "fingerprint" that's used for... cache busting? This needs scrutiny.
The Analysis
// The original — called 200 times per request
const fingerprint = crypto
.createHash('sha256')
.update(JSON.stringify(order)) // stringify a full order object, 200x
.digest('hex');
Two problems:
-
JSON.stringifyon a full order object, 200 times per request, under 50 concurrent connections = 10,000 stringifies/second, each allocating a new string. - SHA-256 is cryptographically secure. We don't need that for a cache-busting fingerprint. We need fast and unique, not secure.
If the fingerprint is truly needed, use a cheaper hash and stop serializing the whole object:
// Option A: Hash only the fields that actually affect cache validity
const fingerprint = crypto
.createHash('md4') // 3x faster than sha256 for this use
.update(`${order.id}:${order.updated_at.getTime()}`)
.digest('hex');
// Option B: If you only need uniqueness, not a hash
// updated_at is already a change signal — use it directly
const fingerprint = `${order.id}-${order.updated_at.getTime().toString(36)}`;
Option B is what you almost certainly actually want. It's a string concat, not a hash. It's unique per order-version. It takes microseconds.
After fix 2:
Requests/sec: 489.1 (+57% on top of fix 1)
Latency avg: 101ms (-36%)
Latency p99: 229ms (-43%)
CPU idle: ~62% (was ~21%)
Problem 3: Memory Pressure and GC Pauses
Run the Clinic.js heap profiler to see allocation patterns:
clinic heapprofiler -- node src/index.js &
autocannon -c 50 -d 10 http://localhost:3000/api/reports
Clinic will generate an HTML flamegraph. The allocation spike you'll see is from building the enriched array: 200 objects, each with a spread copy of the order, plus items array, plus computed fields. Under 50 concurrent connections, that's up to 10,000 object allocations per second, many of them large.
The V8 GC handles this, but not for free. You'll see GC pauses in the p99 latency as the minor GC sweeps short-lived allocations from the new-space.
The Fix: Return the Postgres Result Directly
The JOIN query already gives us the shape we need. Stop copying:
// Before: building enriched[] with spreads and mutations
const enriched = [];
for (const row of result.rows) {
enriched.push({ ...row, items: row.items, total: row.total, fingerprint });
}
// After: the DB result IS the response — transform in place minimally
const data = result.rows.map(row => ({
id: row.id,
org_id: row.org_id,
created_at: row.created_at,
items: row.items, // already json_agg'd by Postgres
total: parseFloat(row.total),
fingerprint: `${row.id}-${new Date(row.created_at).getTime().toString(36)}`,
}));
Explicit field selection instead of spreading also avoids accidentally sending internal fields (internal_notes, cost_price, etc.) to the client — a common security issue hiding inside performance code.
After fix 3:
Requests/sec: 541.8 (+11%)
Latency avg: 91ms (-10%)
Latency p99: 198ms (-14%)
GC pause max: 4ms (was 23ms)
The absolute numbers are a modest improvement, but GC max pause dropping from 23ms to 4ms matters — that's what was spiking your p99.
Problem 4: The Event Loop — Blocking JSON Serialization
res.json() calls JSON.stringify() synchronously on the main thread. For small responses this doesn't matter. For a response that's 200 orders × 10 line items each, you're stringifying a 400KB+ object on the event loop, blocking all other requests during that serialization.
Let's prove it with a flame chart:
clinic flame -- node src/index.js &
autocannon -c 50 -d 10 http://localhost:3000/api/reports
You'll see JSON.stringify as a wide horizontal band — it's synchronous time on the main thread. For a 50-concurrent test, this means requests queuing behind each other's serialization.
Fix A: Streaming JSON with fast-json-stringify
npm install fast-json-stringify
const fastJson = require('fast-json-stringify');
// Define the shape of your response once — compile it
const stringify = fastJson({
type: 'object',
properties: {
count: { type: 'integer' },
data: {
type: 'array',
items: {
type: 'object',
properties: {
id: { type: 'integer' },
org_id: { type: 'integer' },
created_at: { type: 'string' },
total: { type: 'number' },
fingerprint: { type: 'string' },
items: {
type: 'array',
items: {
type: 'object',
properties: {
id: { type: 'integer' },
price: { type: 'number' },
qty: { type: 'integer' },
sku: { type: 'string' },
}
}
}
}
}
}
}
});
// In the route handler
const payload = stringify({ data, count: data.length });
res.setHeader('Content-Type', 'application/json');
res.end(payload);
fast-json-stringify generates a schema-specific serializer at startup — no runtime type-checking, no property iteration. For a known schema it's typically 2–5x faster than JSON.stringify.
Fix B: For Very Large Responses — JSONStream
If your response can be megabytes, don't serialize it all before sending. Stream it:
npm install JSONStream
const JSONStream = require('JSONStream');
router.get('/api/reports', async (req, res) => {
// ... query ...
res.setHeader('Content-Type', 'application/json');
const stream = JSONStream.stringify('{"data":[', ',', ']}');
stream.pipe(res);
for (const row of result.rows) {
stream.write(transformRow(row));
}
stream.end();
});
This writes the response incrementally — the client starts receiving bytes before you've processed the last row. Critical for very large datasets.
After fix 4 (fast-json-stringify):
Requests/sec: 618.3 (+14%)
Latency avg: 80ms (-12%)
Latency p99: 171ms (-14%)
Problem 5: Connection Pool Starvation
Under sustained 50-connection load, you'll hit a subtler problem: connection pool exhaustion. The default pg Pool size is 10. With 50 concurrent requests each needing a connection, 40 of them are waiting in queue.
// The invisible default that's killing your concurrency
const pool = new Pool({
// max: 10 ← this is the default you never set
});
Tuning the Pool
const { Pool } = require('pg');
const pool = new Pool({
host: process.env.PGHOST,
database: process.env.PGDATABASE,
user: process.env.PGUSER,
password: process.env.PGPASSWORD,
port: 5432,
// Tune these to your Postgres max_connections and node count
max: 25, // per Node process; multiply by process count
idleTimeoutMillis: 30_000,
connectionTimeoutMillis: 2_000,
// Log pool events in development — essential for diagnosing starvation
...(process.env.NODE_ENV === 'development' && {
log: (...args) => console.log('[pool]', ...args),
}),
});
// Monitor pool health — expose this to your metrics system
pool.on('connect', () => metrics.gauge('pg.pool.size', pool.totalCount));
pool.on('acquire', () => metrics.gauge('pg.pool.waiting', pool.waitingCount));
pool.on('remove', () => metrics.gauge('pg.pool.idle', pool.idleCount));
The right max value:
max per process = floor(postgres_max_connections / node_process_count) - headroom
If Postgres is configured for 100 connections and you run 4 Node processes:
floor(100 / 4) - 5 = 20 — leave 5 for admin connections, migrations, etc.
After fix 5:
Requests/sec: 791.2 (+28%)
Latency avg: 62ms (-23%)
Latency p99: 134ms (-22%)
Connection pool sizing is pure configuration — no code to write, enormous impact.
The Complete Optimized Handler
Here's the final version — everything applied:
// src/routes/reports.js (the after)
const express = require('express');
const db = require('../db');
const fastJson = require('fast-json-stringify');
const router = express.Router();
const stringify = fastJson({
type: 'object',
properties: {
count: { type: 'integer' },
data: {
type: 'array',
items: {
type: 'object',
properties: {
id: { type: 'integer' },
org_id: { type: 'integer' },
created_at: { type: 'string' },
total: { type: 'number' },
fingerprint: { type: 'string' },
items: {
type: 'array',
items: {
type: 'object',
properties: {
id: { type: 'integer' },
price: { type: 'number' },
qty: { type: 'integer' },
sku: { type: 'string' },
}
}
}
}
}
}
}
});
router.get('/api/reports', async (req, res) => {
const { org_id, start, end } = req.query;
if (!org_id || !start || !end) {
return res.status(400).json({ error: 'org_id, start, end are required' });
}
const result = await db.query(
`SELECT
o.id, o.org_id, o.created_at,
json_agg(
json_build_object('id', li.id, 'price', li.price, 'qty', li.qty, 'sku', li.sku)
ORDER BY li.id
) AS items,
SUM(li.price * li.qty) AS total
FROM orders o
JOIN line_items li ON li.order_id = o.id
WHERE o.org_id = $1
AND o.created_at BETWEEN $2 AND $3
GROUP BY o.id
ORDER BY total DESC`,
[org_id, start, end]
);
const data = result.rows.map(row => ({
id: row.id,
org_id: row.org_id,
created_at: row.created_at.toISOString(),
items: row.items,
total: parseFloat(row.total),
fingerprint: `${row.id}-${row.created_at.getTime().toString(36)}`,
}));
const payload = stringify({ data, count: data.length });
res.setHeader('Content-Type', 'application/json');
res.end(payload);
});
module.exports = router;
Full Benchmark Summary
Every fix measured, no cherry-picking:
| Fix | Req/sec | Avg latency | p99 latency | Delta req/sec |
|---|---|---|---|---|
| Baseline | 47 | 1,041ms | 2,380ms | — |
| 1. Eliminate N+1 | 312 | 158ms | 401ms | +562% |
| 2. Cheaper fingerprint | 489 | 101ms | 229ms | +57% |
| 3. Reduce allocations | 542 | 91ms | 198ms | +11% |
| 4. fast-json-stringify | 618 | 80ms | 171ms | +14% |
| 5. Pool tuning | 791 | 62ms | 134ms | +28% |
| Total | 791 | 62ms | 134ms | +1,574% |
The N+1 was worth 5x on its own. Everything else stacked another 2.5x on top. That's the real distribution of performance work — one structural problem and a handful of incremental improvements.
What to Do When the Low-Hanging Fruit Is Gone
After these fixes, you've addressed the common offenders. Further gains require different tools:
Worker threads for CPU-heavy work. If you have actual computation (image processing, cryptography on large data, PDF generation), offload it:
const { Worker, isMainThread, parentPort, workerData } = require('worker_threads');
// main thread
function runInWorker(data) {
return new Promise((resolve, reject) => {
const w = new Worker(__filename, { workerData: data });
w.on('message', resolve);
w.on('error', reject);
});
}
// worker thread
if (!isMainThread) {
const result = expensiveComputation(workerData);
parentPort.postMessage(result);
}
Caching at the right layer. If the same org_id + date range is queried repeatedly, cache at the route level — but measure hit rate before adding cache complexity. A cache that misses 80% of the time adds latency, not removes it.
const cache = new Map(); // Replace with Redis in production
router.get('/api/reports', async (req, res) => {
const key = `${req.query.org_id}:${req.query.start}:${req.query.end}`;
const cached = cache.get(key);
if (cached) {
res.setHeader('X-Cache', 'HIT');
res.setHeader('Content-Type', 'application/json');
return res.end(cached);
}
// ... query and process ...
const payload = stringify({ data, count: data.length });
cache.set(key, payload);
setTimeout(() => cache.delete(key), 30_000); // 30s TTL
res.setHeader('X-Cache', 'MISS');
res.setHeader('Content-Type', 'application/json');
res.end(payload);
});
Horizontal scaling. Node is single-threaded per process. Use the cluster module or PM2 to run one process per CPU core. This is orthogonal to the optimizations above — do both.
// cluster.js
const cluster = require('cluster');
const os = require('os');
if (cluster.isPrimary) {
const cpus = os.cpus().length;
console.log(`Forking ${cpus} workers`);
for (let i = 0; i < cpus; i++) cluster.fork();
cluster.on('exit', (worker) => {
console.log(`Worker ${worker.process.pid} died, restarting`);
cluster.fork();
});
} else {
require('./src/index.js');
}
The Discipline
Performance work without measurement is superstition. The discipline is:
- Baseline before you touch anything. No exceptions.
- Change one thing at a time. If you fix three things together, you don't know which one mattered.
- Profile before you optimize. The bottleneck is almost never where you think it is until you look.
- Keep all your benchmark JSON files. You'll need to explain the improvement to someone who wasn't there.
-
Test under realistic concurrency. A benchmark with
c 1will not find pool exhaustion or GC pressure.
The 16x improvement above came from five targeted fixes to one endpoint. That's not unusual. Most production APIs have an N+1 they've lived with for years, a hash function nobody remembers adding, and a connection pool set to its default. Profile, fix, measure, repeat.
Top comments (0)