DEV Community: Emir

Database Performance Bottlenecks: N+1 Queries, Missing Indexes, and Connection Pools

Emir — Fri, 27 Mar 2026 02:56:37 +0000

"Software engineering is more than just centering a div." — howtocenterdiv.com

TLDR: You profiled Node.js, added Redis, switched runtimes. Still slow. The database was the problem the whole time. Fix N+1 queries and missing indexes before you touch anything else.

Same story every time. App is slow. Someone swaps the framework. Still slow. Someone adds Redis. Still slow. Finally somebody opens up the query plan and there it is — a sequential scan across 4 million rows because nobody put an index on user_id. Four seconds per request. Was sitting there the whole time.

Nobody ever looks at the database first though. It's always "let's try Fastify" or "let's add a cache" because those feel like real engineering work. An index is one line of SQL. You can't demo one line of SQL in standup. A framework migration gets you a PR, a benchmark chart, maybe even a blog post. But it won't fix a missing index, obviously.

And honestly? Most backend devs have never been taught to read a query plan. Bootcamps don't go there, tutorials barely mention EXPLAIN, and you end up figuring it all out at 2am during a production fire. After that you never forget it. But it shouldn't take a production incident to learn this stuff.

N+1 Queries

We all learn about N+1 early on. We all still write one at some point:

const orders = await db.query('SELECT * FROM orders LIMIT 50');

for (const order of orders) {
  order.user = await db.query(
    'SELECT * FROM users WHERE id = $1',
    [order.userId]
  );
}

51 round trips for 50 orders. PM says "can we do 500 per page?" and now it's 501. The page is slow because of a loop hitting the database on every iteration — nothing to do with Node being slow or not.

What's really annoying is that N+1 hides in dev. You've got 10 rows locally, it returns instantly, you move on. Deploys, data piles up, and six months from now somebody files a Jira ticket about the orders page taking 8 seconds. By then everyone forgot this code existed.

One JOIN:

SELECT orders.*, users.name, users.email
FROM orders
JOIN users ON orders.user_id = users.id
LIMIT 50;

ORMs make it worse because the code looks clean:

// Reads beautifully. Fires 51 queries.
const orders = await prisma.order.findMany({ take: 50 });
for (const order of orders) {
  const user = await prisma.user.findUnique({ where: { id: order.userId } });
}

// This does the same thing in 1 query.
const orders = await prisma.order.findMany({
  take: 50,
  include: { user: true },
});

Nothing about that first version looks wrong. Prisma, Sequelize, TypeORM, all of them will do this to you if you don't pay attention. The whole selling point of an ORM is that it hides SQL. But it hides the query count too, and that's the part that bites.

Turn on query logging. Prisma has log: ['query'] on the client, Sequelize takes logging: console.log. Go look at what gets generated. I've opened up codebases where one endpoint was doing 200+ queries because somebody nested includes three levels deep. Endpoint worked, was just painfully slow, and nobody thought to check the logs until users started complaining.

GraphQL makes it uglier because resolvers fire per field, so a single query can spawn dozens of DB calls without anyone realizing. DataLoader was built for exactly this — batches and deduplicates calls within one tick. Retrofitting it later is painful so do it early if you can.

EXPLAIN ANALYZE

Slow query? Don't touch the code yet. Run this first:

EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 42;

The output tells you how Postgres actually executed the query. Seq Scan is the bad one — it means Postgres read through the entire table, row by row, to find your data. On a table with 200 rows nobody cares. On a table with millions of rows that's where your response time went. Index Scan is what you want — Postgres found an index and skipped straight to the relevant rows.

-- Seq Scan: reading 4,999,999 rows to find 1. painful.
Seq Scan on orders (cost=0.00..89234.00 rows=1 width=120)
  Filter: (user_id = 42)
  Rows Removed by Filter: 4999999

-- One index fixes it
CREATE INDEX idx_orders_user_id ON orders(user_id);

-- Now look at the difference
Index Scan using idx_orders_user_id on orders
  Index Cond: (user_id = 42)

4 seconds to 4 milliseconds on a production table with real data. Not theoretical.

There's a third one you might run into: Bitmap Heap Scan. Postgres builds a map of which disk pages probably contain matching rows and reads those instead of scanning everything. Not usually a problem, but if your query should return like 3 rows and you see a bitmap scan, dig into it. Index might not match well or Postgres has outdated info about the data in that table.

Also look at estimated rows versus actual rows in the output. When those are wildly different (Postgres expected 10, actually got 50,000) the planner is making decisions on wrong assumptions. Fix it by running ANALYZE tablename — different command from EXPLAIN ANALYZE, it just tells Postgres to rescan and update its internal statistics. I've seen terrible query plans go to perfectly fine after that one command.

Do this for every query before it ships. Not just the complicated ones — the boring three-line SELECTs are the ones that slip through review without anyone running EXPLAIN, and those tend to be the worst performers.

Indexes

Indexes speed up reads but slow down writes. Every INSERT, UPDATE, DELETE also has to update every index on that table. I've debugged write latency issues that turned out to be nothing more than too many indexes on one table — 15 of them, most added "just in case" and never used by any query.

Only index what you actually filter, join, or sort on. WHERE, JOIN, ORDER BY. If someone says "let's index this column in case we need it later" — push back. You probably won't need it, and writes just got slower for nothing.

One thing that confuses people: a composite index on (user_id, created_at) is a completely different thing from two separate indexes on each column. Not interchangeable even though it feels like they should be. Your query does WHERE user_id = 42 ORDER BY created_at? The composite covers both filter and sort in one pass. Separate indexes — Postgres picks one for the WHERE, grabs the rows, sorts them in memory separately.

Column order in composites matters a lot. (user_id, created_at) works for queries that filter on user_id alone. It does absolutely nothing for queries filtering only on created_at. Leftmost column has to be in the query. Works like a phone book — sorted by last name first, then first name within that. Easy to find all the Smiths. Impossible to efficiently find every John across all last names.

Partial indexes are another trick not enough people know about. If you've got a table where 95% of the rows are 'completed' and you're only ever querying for the 'pending' ones, you can tell Postgres to only index those:

CREATE INDEX idx_orders_pending ON orders(created_at)
WHERE status = 'pending';

Tiny index, covers exactly what you need, barely any write overhead.

Connection Pools

Postgres defaults to 100 max connections. Your app uses a pool — reusable connections instead of opening new ones per request. Most libraries default to 10 and nobody revisits that number after initial setup.

Real traffic breaks this immediately:

Pool size: 10
Requests per second: 300
Each request holds a connection for 50ms

Connections needed: 300 * 0.05 = 15
You have 10. Five requests queued at any moment.
Waits compound. p99 latency climbs.

People reference the formula (core_count * 2) + effective_spindle_count. Four-core SSD box gives roughly 9. But this is a starting point at best — you need to load test with realistic traffic to find what actually works.

One thing people get wrong here: cranking the pool up doesn't help. Setting it to 50 per instance can actually make performance worse. Postgres has to context-switch between connections and each one takes memory on the server side. The sweet spot is usually lower than you'd expect. PostgreSQL wiki has a good writeup on why.

Kubernetes makes it worse. 10 pods × 10 pool size = 100 connections just from your app. Already hitting the default limit. Then a rolling deploy starts, new pods open connections before old pods release theirs, and for a few seconds you're way over. Seen this crash databases.

PgBouncer fixes this. Sits between your app and Postgres, multiplexes connections at the protocol level. Your 10 pods each think they've got 10 connections but PgBouncer maps all of that to maybe 20 actual Postgres connections. Set it up before you actually need it — debugging connection limits at 3am while Googling "FATAL: too many clients already" is not a fun time.

The Slow Query Log

The single most underrated tool in Postgres and almost nobody has it turned on:

ALTER SYSTEM SET log_min_duration_statement = 1000;
SELECT pg_reload_conf();

That 1000 is milliseconds. Logs every query over one second. You can lower it to 500 or 200 later but start at a second so you don't drown in output.

Leave it on in production for a day then go read the file. You will find queries you completely forgot about. Some ancient cron job nobody's looked at since it was written, an admin panel endpoint that hits a full table scan twice a year when someone actually clicks the button.

Here's the thing about optimization — most teams do it by guessing. "This query has three JOINs so it's probably the slow one." Then you check and the three-JOIN query runs in 2ms because the indexes are right. Meanwhile some tiny SELECT with no WHERE index has been quietly running 10,000 times a day, each one doing a seq scan. The log doesn't guess. It just records what happened.

pg_stat_statements goes even further if you want it. Extension that tracks execution count, total time, average time for every distinct query. Basically gives you a leaderboard of which queries eat the most database time. Takes five minutes to turn on.

When Redis Actually Makes Sense

After queries are cleaned up. After indexes are right. After the slow query log stopped turning up problems.

Then sure, put Redis in front of it.

Good fit for data that changes rarely but gets read all the time — sessions, feature flags, config, product catalogs. Bad fit as a patch over a slow query. The slow query doesn't go away. It still fires on cache misses: traffic spikes, cold starts after deploys, TTL expiration at 3am when nobody's monitoring.

Check if you even need a separate service. Postgres materialized views are cached query results that live right in the database. Refresh them on a schedule or trigger. For dashboards and aggregated stats, they might be enough without adding Redis to your infrastructure.

If you do go with Redis, figure out invalidation before writing any code. "We'll handle cache invalidation later" always ends with users seeing stale data and filing bugs nobody can reproduce because the cache expired between the report and the investigation. TTL, write-through, or both — pick one upfront.

Fix the query first. Cache if the numbers justify it after.

Transactions

Transactions hold locks. Longer the transaction, longer the lock, bigger the pileup:

// Holds a lock while sendEmail does... whatever sendEmail does
await db.transaction(async (trx) => {
  await trx('orders').where({ id: orderId }).update({ status: 'confirmed' });
  await sendEmail(order.userEmail); // could take 2+ seconds
  await trx('inventory').where({ productId }).decrement('stock', 1);
});

Email has no business being inside a database transaction. SMTP is slow, sometimes very slow, and during that entire time you've got a lock held on the orders row while every other request that needs it waits.

Move it out:

await db.transaction(async (trx) => {
  await trx('orders').where({ id: orderId }).update({ status: 'confirmed' });
  await trx('inventory').where({ productId }).decrement('stock', 1);
});

await sendEmail(order.userEmail);

"But what if the email fails?" Putting it inside the transaction doesn't actually solve that. If sendEmail throws between the two DB calls, the whole transaction rolls back — now your order isn't confirmed AND the email didn't send. Worse than before. Better approach: commit the DB work, send the email separately, handle failures with a retry queue or dead letter table. Side effects and transactions don't mix.

Transactions should be short and contain only DB operations. HTTP calls, file I/O, email sends — none of that belongs in there.

SELECT * Considered Harmful (Sometimes)

Problem with SELECT * is that it pulls back everything on the row. Every column. Including that massive TEXT field full of HTML from the CMS feature someone bolted on two years ago that nobody remembers. You need id, name, and price for a product list — but you're also dragging 10KB of product description HTML across the wire for every single row.

-- Grabs everything including the giant description column
SELECT * FROM products WHERE category_id = 5;

-- Just what you need
SELECT id, name, price, thumbnail_url FROM products WHERE category_id = 5;

Non-issue on small tables. But wide tables returning hundreds of rows — the bandwidth waste shows up in response times. Postgres also has this thing called covering indexes where if every column in your SELECT is already in the index, it never has to touch the actual table. Really fast when it kicks in. SELECT * prevents that from ever happening because obviously no index has every column.

ORMs select everything by default. Prisma has select, Sequelize has attributes. Worth using for list endpoints where you don't need every field.

Before You Add Infrastructure

Before Redis. Before read replicas. Before anyone brings up sharding:

1. Run EXPLAIN ANALYZE on your slowest queries.
2. Hunt for N+1 patterns in your ORM code.
3. Add indexes on columns in WHERE, JOIN, ORDER BY.
4. Turn on the slow query log.
5. Check pool size against real traffic.
6. Nothing slow inside transactions.

Step 2 or 3 alone fixes most apps. Rest is cleanup.

New infrastructure looks great on an architecture diagram and ships fast. But if the actual problem is a missing index or a loop full of queries, none of it matters. A bad query behind three caches is still a bad query — now you just also have three caches to debug.

This also isn't a one-time thing. Queries that work at 100K rows collapse at 10M. An index you added for a feature that got killed last quarter is just slowing down writes now. Check the slow query log every few weeks, not just during outages. It's ongoing work.

The query plan doesn't lie.

Why Headless Browsers Get Detected: A Technical Breakdown

Emir — Fri, 13 Mar 2026 17:31:42 +0000

howtocenterdiv.com — "Software engineering is more than just centering a div."

Puppeteer rats itself out in at least 11 different ways the moment it starts up — and that's before it's even loaded a single page. Scraping tutorials almost never bring this up, then act shocked when the same script that ran perfectly on localhost gets hammered in production.

Here's what people get wrong: bot detection isn't a single if-statement checking one flag. It's a scoring system. Each signal you leak adds weight to a total, and once that total crosses a threshold, you're done — blocked, CAPTCHAed, or worst of all, quietly served garbage data so you don't even know it happened.

Layer	What It Checks	When
TLS Fingerprint (JA3)	Cipher suite order, extensions	TCP handshake — before HTTP
HTTP/2 Fingerprint	Frame settings, header order	First request
`navigator` properties	`webdriver`, `plugins`, `languages`	JS runtime
Canvas / WebGL	Rendering entropy, GPU string	JS runtime
Mouse & keyboard	Movement patterns, timing	Behavioral
IP reputation	ASN, datacenter range	DNS / IP layer

Most developers fixate on the navigator layer. They patch webdriver, maybe fake the user agent, and call it a day. They have no idea TLS fingerprinting has already clocked them before a single line of JavaScript ran.

Detection is cumulative and concurrent. Failing one check won't get you blocked. Getting blocked happens because a handful of small failures push the score over the threshold together. You can dodge navigator.webdriver perfectly and still get caught — because your JA3, canvas fingerprint, and plugin list aren't telling the same story.

Signal #1 — `navigator.webdriver`

console.log(navigator.webdriver);
// Headless → true   (instant detection)
// Real browser → undefined

The value itself isn't even the whole story. Detectors also inspect the property descriptor's configurability — that's a fingerprint of its own.

// Looks like a fix, still detectable
Object.defineProperty(navigator, 'webdriver', { get: () => false });

// What a detector actually sees
Object.getOwnPropertyDescriptor(navigator, 'webdriver');
// → { value: false, writable: true, configurable: true }
// In real Chrome, the property doesn't exist at all — just undefined

The classic mistake is patching properties after the page loads instead of injecting before. The clock doesn't wait.

Signal #2 — The `plugins` Array

navigator.plugins.length;
// Real Chrome → 3–7
// Headless    → 0  ← one-line detection

navigator.mimeTypes.length;
// Real Chrome → 2+
// Headless    → 0

Any halfway-decent detection script checks navigator.plugins.length === 0 and stops right there. But stuffing in fake plugins isn't a real fix either. The names, descriptions, and mime types inside each plugin object all have to be internally consistent — and they have to match the user-agent you claimed. If your UA says Chrome 120 on macOS but your plugin list looks like Chrome on Windows, that mismatch is itself a signal.

Signal #3 — Canvas Fingerprinting

const canvas = document.createElement('canvas');
const ctx = canvas.getContext('2d');
ctx.font = '11pt "Times New Roman"';
ctx.fillText('Cwm fjordbank', 2, 15);
const fingerprint = canvas.toDataURL();
// Real machine   → unique hash, varies by hardware
// Headless Chrome → identical hash, every single time

Headless Chrome produces pixel-perfect identical output for identical code, no matter what machine it's running on. There's no GPU variance. Detection systems maintain databases of known headless canvas hashes. Yours is already in there.

Signal #4 — WebGL Renderer String

const gl = document.createElement('canvas').getContext('webgl');
const info = gl.getExtension('WEBGL_debug_renderer_info');
gl.getParameter(info.UNMASKED_VENDOR_WEBGL);
// Real machine → "Intel Inc." / "NVIDIA Corporation"
// Headless     → "Google SwiftShader"  ← banned everywhere

SwiftShader is Google's software renderer, built for display-less environments. That string has been identified and blacklisted across detection systems everywhere. If SwiftShader shows up, you're flagged — doesn't matter what else you've cleaned up.

Signal #5 — TLS / JA3 Fingerprint

The first move in a TLS handshake is the client sending a ClientHello. Inside it: the list of cipher suites, extensions, and elliptic curves the client supports. The ordering of those items is dictated by the underlying TLS library — not the user-agent string you set.

JA3 = md5(SSLVersion, Ciphers, Extensions, EllipticCurves, ECPointFormats)

Client	TLS Library	JA3 Hash
Chrome 120 / macOS	BoringSSL	`cd08e31494f9531f560d64c695473da9`
Node.js 20 (axios/got)	OpenSSL	`b32309a26951912be7dba376398abc3b`
Python requests	Python ssl	`3b5074b1b5d032e5620f69f9f700ff0e`

You can set User-Agent: Chrome/120 all you want. The TLS handshake already announced Node.js before any JavaScript touched the page. There is no JS-layer fix for this one.

Signal #6 — HTTP/2 Fingerprint

Real Chrome and Node's http2 module send different SETTINGS frames:

Chrome 120:  HEADER_TABLE_SIZE=65536, ENABLE_PUSH=0, INITIAL_WINDOW_SIZE=6291456
Node.js:     HEADER_TABLE_SIZE=4096,  ENABLE_PUSH=1, INITIAL_WINDOW_SIZE=65535

This gets extracted at the load balancer level, well before any application logic sees the request.

Signal #7 — Behavioral Entropy

// Bot movement
mousemove: (100,200) → (400,200) → (400,500)  // perfect L-shapes, instant

// Human movement
mousemove: (100,200) → (138,213) → (201,228) → ...  // curved, variable speed

Mouse movement is just one piece. Detection systems also profile keystroke timing (real humans: 50–200ms between keystrokes), scroll behavior, and how long someone spends on a page before doing anything. A bot that clicks 80 milliseconds after page load is a bot.

Math.random() delays don't fix this. A straight-line mouse path with randomized timing is still a straight-line mouse path.

Entropy scores accumulate across full sessions, not just individual events. That's why some bots clear the first checkpoint and get flagged 30 seconds later — the score built up over time, not all at once.

The Mistake Matrix

Mistake	Why It Fails
Only patching `navigator.webdriver`	10+ signals still leak
Using `got`/`axios` with spoofed headers	JA3 still says Node.js
No `--disable-blink-features=AutomationControlled`	`window.chrome` exposes automation flag
Datacenter proxies (AWS/GCP/Azure)	ASN is blacklisted before fingerprint checks even run
`User-Agent` without matching `sec-ch-ua`	Header contradiction — caught immediately
`Math.random()` delays only	Timing variance isn't behavioral entropy

How the Score Actually Adds Up

All checks run simultaneously, scores stack:

IP reputation:       +0.1  (clean, residential)
JA3 mismatch:        +0.6  (Node.js TLS on Chrome UA)
navigator.webdriver: +0.0  (patched correctly)
Canvas hash:         +0.4  (known headless hash)
Plugin count:        +0.3  (empty plugins)
Mouse entropy:       +0.5  (straight-line movement)
─────────────────────────────────────────────
Total: 1.9  →  Block threshold: 1.5

You nailed navigator. Doesn't matter — TLS and canvas alone already pushed it over.

The `sec-ch-ua` Problem

Chrome 90+ attaches client hints to every request. A real Chrome session looks like:

sec-ch-ua: "Chromium";v="120", "Google Chrome";v="120", "Not-A.Brand";v="99"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "macOS"
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...

A Puppeteer session that sets a modern user-agent but sends no client hints looks like:

User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...
# sec-ch-ua: missing entirely

A Chrome 90+ user-agent without sec-ch-ua is a physical impossibility. Flagged on arrival. And just being present isn't enough — the brand token order and version numbers have to be consistent with the full UA string.

Tools vs. What They Actually Cover

Tool	Fixes	Doesn't Fix
`puppeteer-extra-plugin-stealth`	JS-layer signals	TLS/HTTP2 fingerprint
`rebrowser-puppeteer`	CDP leaks, runtime injection	TLS, behavioral
Go + CycleTLS	JA3 fingerprint	Behavioral, canvas
Real Chrome via CDP	TLS, canvas, GPU	Proxy/IP reputation

The only client that passes every layer by default is a real Chrome browser, running on consumer hardware, behind a residential IP.

Most guides don't even touch this. They treat detection as a checklist — knock off each item, done. But detection systems are probabilistic. They don't need certainty, just confidence. Fix 9 out of 10 signals and you can still get blocked if that last signal carries enough weight. JA3 mismatches typically score 0.5–0.7. One leak can be all it takes.