How do you know you need a database index?

#rails #database #performance #interview

I got asked this in an interview years ago, and I've asked it from the other side of the table since. I like it because the lazy answer ("index everything") is wrong, and the real skill is knowing where to look before you touch anything. So here's the whole loop: how I spot a column that needs an index, how I prove the index actually helped, and what it costs me to add one.

Which columns actually need one

The candidates are columns you filter, sort, or join on. In SQL terms, anything in a WHERE, ORDER BY, JOIN, or GROUP BY on a table that's big or growing.

A few I always check first:

Foreign keys, like user_id and order_id. In Rails these don't get an index automatically when you add the reference unless you ask for one, and they get hammered by association lookups and joins. This is the missing index I find most often.

Lookup columns like email, slug, and token. These usually want a unique index anyway.

Filter and sort columns like status and created_at, the stuff your dashboards page over.

None of it matters at small scale. A sequential scan over 500 rows is instant. The pain shows up at a few million rows, when an endpoint that felt fine in development starts timing out in production.

How I measure whether it helped

Mostly EXPLAIN ANALYZE.

EXPLAIN ANALYZE
SELECT * FROM orders WHERE user_id = 42 AND status = 'paid';

Here's what I read in the output. A Seq Scan on a big table is the red flag: Postgres is reading every row to answer the query. After I add the index, I want to see that become an Index Scan or a Bitmap Index Scan.

I compare the actual time before and after, not just the plan shape. I also check estimated rows against actual rows. When those are far apart, the planner is running on stale statistics, and a quick ANALYZE sometimes fixes the query with no index at all.

The local plan only goes so far. To find what's worth fixing, I look at production: pg_stat_statements for the genuinely expensive queries, plus whatever APM is running (New Relic, Skylight, Datadog) and the query timings in the Rails log. It's easy to lovingly optimize a query that runs twice a day while ignoring the one that runs ten thousand times an hour.

Indexes aren't free

This is the part people skip. Every index costs disk space, and it slows down writes, because every INSERT, UPDATE, and DELETE has to keep the index current. On a write-heavy table that adds up fast.

So I don't index on a hunch. I index columns I can show are being queried, and I drop indexes nobody uses. An unused index is pure cost. It slows your writes and gives you nothing back.

The practical one: `Order.where(user_id: id, status: "paid")`

I'd add a composite index on [:user_id, :status]:

add_index :orders, [:user_id, :status]

Column order is the whole game here. user_id goes first because it's the selective, always-present equality filter. The B-tree narrows to one user's orders, then status filters within that small set.

There's a bonus: because user_id is the leftmost column, this same index also covers queries that filter on user_id alone, so you don't need a second index just for it.

Reversing it to [:status, :user_id] would be worse. status has only a handful of distinct values, so leading with it barely narrows the search before it gets to user_id.