How Database Indexes Actually Work (and When They Backfire)

#database #sql #backend #webdev

TL;DR: Without an index, your database finds a row by reading every row (a full table scan). An index is a sorted structure that lets it jump straight to the row instead. But indexes are a trade, not free speed: they only help selective queries, and they slow down every write. Use EXPLAIN to see what your database is actually doing.

The setup

Your query was instant in development. In production, it crawls. Same code. The only thing that changed is the amount of data.

Nine times out of ten, this is why: without an index, the database has only one way to find your row, which is to read every row, one at a time, until it matches. That is a full table scan.

On 10,000 rows you don't notice. On 10,000,000, it is painful.

Why an index exists (first principles)

An index exists to avoid that work. Scanning the whole table doesn't scale: a query that filters on one column shouldn't have to read every row to find a handful.

So the database keeps a separate, sorted directory of one column's values, stored alongside the table, like the index at the back of a textbook. Instead of flipping through every page, you look up the term and jump straight to the page.

CREATE INDEX idx_users_email ON users(email);

Why it's fast

Because the directory is sorted, the database doesn't read it top to bottom either. It navigates straight toward the value: narrow the range, discard the half that can't contain it, repeat. That is the same halving idea as binary search, and the structure that makes it work on disk is a B-tree (the disk-friendly generalization, not literally a binary search over rows).

The payoff: finding one row among a million takes on the order of ~20 steps, not a million.

The part most engineers miss: it's a trade

Indexes are not free performance. They are a trade-off:

They only help when your query is selective, when it returns a small fraction of the table. WHERE id = ? (one row) flies. WHERE active = true (half the table) can be slower with the index than a plain scan, because there is no shortcut when you are returning most of the rows. A plan can literally say "using index" and still be slow; what matters is how many rows it touches.
Every index has a write cost. Inserts, updates, and deletes all have to keep every index current. The more indexes you have, the slower your writes.
They cost storage.

So the rule isn't "add indexes everywhere." It is: index the columns your queries actually filter, join, and sort on, and only where it is selective enough to help.

Stop guessing: read the plan

You never have to guess at any of this. Put EXPLAIN in front of your query and the database shows you its plan before it runs anything.

EXPLAIN SELECT * FROM users WHERE email = ?;

What you are looking for is "reads the whole table" turning into "uses the index":

Database	Slow (full scan)	Fast (uses index)
SQLite	`SCAN users`	`SEARCH users USING INDEX`
Postgres	`Seq Scan`	`Index Scan` / `Index Only Scan`
MySQL	`type: ALL`	`type: ref` / `range`

(In Postgres, EXPLAIN ANALYZE also shows the real row counts versus the planner's estimates, which is where a lot of "why didn't it use my index?" mysteries get solved.)

Get in the habit of reading the plan, and you will catch the slow query before it pegs your database and takes the app down.

Watch it visually

I made a short, visual breakdown of all of this, from the full table scan to the index, to why it is fast, to when it backfires, to reading EXPLAIN:

It is the first episode of The Leap, a series explaining the systems we build on, from databases and networking to memory and distributed systems, from first principles. Next up: the dark side of indexes, and when adding one is the wrong move.

What's the nastiest slow query you have had to debug in production?