DEV Community

Mohamed Idris
Mohamed Idris

Posted on

Part 11: Indexes and Performance

Part of the "SQL: Zero to Ninja" series, written for junior web devs.

Your app worked great with 200 users. Then it got popular, hit a million rows,
and one page started taking 8 whole seconds to load. Same query, same code,
nothing changed. So what happened? Your database was quietly reading every
single row, every single time. Let's fix that with indexes, and then squash the
sneaky N+1 bug that haunts almost every ORM app.

The idea in one line

An index is a sorted shortcut that lets the database jump straight to the
rows you want, instead of reading the whole table to find them.

The metaphor: a dictionary

Imagine you want the word "ninja" in a paper dictionary.

You do not start at page 1 and read every word until you hit "ninja". That
would take all day. Instead, you flip to the "N" section, then narrow down fast.
The dictionary is sorted, so you jump close and zoom in.

Now imagine a dictionary with the words in random order. To find "ninja" you
would have to read it cover to cover. Painful.

That is the whole story:

No index   -->  read the dictionary cover to cover   (full table scan)
With index -->  jump to "N", then zoom in            (sorted lookup)
Enter fullscreen mode Exit fullscreen mode

A table with no index is the random-order dictionary. An index puts things in
order so the database can flip right to the page.

Why a query gets slow

When you run this:

SELECT * FROM orders WHERE user_id = 42;
Enter fullscreen mode Exit fullscreen mode

If there is no index on user_id, the database does a full table scan. It
literally checks every row: "is this user 42? no. is this one? no..." all the
way down.

Row 1   user_id=7   nope
Row 2   user_id=3   nope
Row 3   user_id=42  yes! keep it
...
Row 9,999,999       (still reading)
Enter fullscreen mode Exit fullscreen mode

With 100 rows, that is instant. With 10 million rows, that is your 8-second page.

Adding an index

You create an index on the column you search by:

CREATE INDEX idx_orders_user_id ON orders (user_id);
Enter fullscreen mode Exit fullscreen mode

Now the database keeps a sorted list of user_id values, each pointing to where
that row lives. The mental model is a B-tree: a sorted structure that lets
the database binary-search (split in half, again and again) instead of reading
everything.

Looking for user_id = 42 in a sorted index:

        [ 50 ]
        /     \
    [ 25 ]   [ 75 ]
      |
    42 is here-ish, jump straight to it
Enter fullscreen mode Exit fullscreen mode

Each step throws away half the remaining rows. That is how it finds one row out
of millions in a blink.

When indexes help

Index the columns you actually look things up by:

  • Columns in a WHERE filter: WHERE user_id = 42
  • Columns in a JOIN ... ON: ON orders.user_id = users.id
  • Columns in ORDER BY: ORDER BY created_at

If you sort or filter or join on it a lot, it is a good index candidate.

The cost (indexes are not free)

An index is like keeping that sorted list updated. Every time you INSERT,
UPDATE, or DELETE, the database also has to fix the index. So:

  • Writes get a little slower.
  • Indexes take up disk space.

That is why you do not index every column. Index the ones you search on, and
leave the rest alone.

See it with EXPLAIN

You do not have to guess whether your query uses an index. Ask the database with
EXPLAIN (or EXPLAIN ANALYZE to actually run it and time it):

EXPLAIN ANALYZE SELECT * FROM orders WHERE user_id = 42;
Enter fullscreen mode Exit fullscreen mode

You are looking for one word:

Seq Scan on orders   -->  bad, it read the whole table
Index Scan on orders -->  good, it used the index
Enter fullscreen mode Exit fullscreen mode

"Seq Scan" (sequential scan) means cover-to-cover reading. "Index Scan" means it
jumped to the right page. If you added an index and still see a Seq Scan, your
query is hiding an index killer (more on that below).

The N+1 problem (read this twice)

This one bites almost every web dev who uses an ORM. Say you load 50 users, then
show each user's orders. The lazy ORM does this:

1 query:  SELECT * FROM users LIMIT 50;        -- get the users
then for EACH user:
  SELECT * FROM orders WHERE user_id = 1;      -- query #2
  SELECT * FROM orders WHERE user_id = 2;      -- query #3
  ...
  SELECT * FROM orders WHERE user_id = 50;     -- query #51
Enter fullscreen mode Exit fullscreen mode

That is 1 + 50 = 51 queries for one page. Each one is a round trip to the
database. This is the N+1 problem: 1 query to get the list, then N more, one per
item.

The fix: ask once

Grab all the orders in a single query instead of 50:

-- One batched query for everyone's orders
SELECT * FROM orders WHERE user_id IN (1, 2, 3, ..., 50);
Enter fullscreen mode Exit fullscreen mode

Or join them together:

SELECT users.name, orders.total
FROM users
LEFT JOIN orders ON orders.user_id = users.id;
Enter fullscreen mode Exit fullscreen mode

In ORM land this is called eager loading. You tell the ORM "load the orders
with the users" up front:

// Sequelize
User.findAll({ include: Order })

// Prisma
prisma.user.findMany({ include: { orders: true } })

// Rails / ActiveRecord
User.includes(:orders)

// Laravel / Eloquent
User::with('orders')->get()
Enter fullscreen mode Exit fullscreen mode

51 queries become 2. Your slow page is fast again.

Gotchas juniors hit

  1. Indexing a tiny table. A 50-row table is faster to just scan. An index there adds work for no win. Indexes shine on big tables.
  2. Indexing a low-variety column. A column like is_active that is only true or false barely helps. The index can only split the table in two, so the database often just scans anyway.
  3. Leading wildcard LIKE. WHERE name LIKE '%ohn' cannot use a normal index, because the index is sorted by the start of the word. It is like asking the dictionary for every word that ends in "ohn". Sorting by first letter does not help. WHERE name LIKE 'Joh%' is fine though.

Recap

  • No index means a full table scan: the database reads every row.
  • An index is a sorted shortcut (think B-tree) so it can jump straight in.
  • Index columns used in WHERE, JOIN ON, and ORDER BY.
  • Indexes cost write speed and disk, so do not index everything.
  • Use EXPLAIN to see "Index Scan" (good) vs "Seq Scan" (bad).
  • The N+1 problem is 1 + N queries. Fix it with a JOIN, a batched WHERE ... IN (...), or your ORM's eager loading.

Your turn

You have a query: SELECT * FROM users WHERE email = 'a@b.com'. Which column
should you index, and why? And if a page loads 30 products then runs one extra
query per product to get its category, what is that bug called and how do you fix
it? Explain both to a friend and you have got it.

Next up is Part 12: Transactions, where we make several statements happen
all-or-nothing, so you never end up with a paid order that has no items.

Top comments (1)

Collapse
 
edriso profile image
Mohamed Idris

Part 11 Practice: Indexes and Performance

Try these before peeking at the solutions. They mix "think about it" questions
with real SQL using our shared schema (users, orders, products,
order_items).


1. Pick the column to index

Your app runs this query all the time and it has gotten slow:

SELECT * FROM orders WHERE user_id = 42;
Enter fullscreen mode Exit fullscreen mode

Which column should you index, and write the statement.

Solution

CREATE INDEX idx_orders_user_id ON orders (user_id);
Enter fullscreen mode Exit fullscreen mode

Why: the column in the WHERE filter is what the database searches by. Indexing
user_id turns a full table scan into a fast sorted lookup.


2. Read the EXPLAIN

You run EXPLAIN on a query and see this line:

Seq Scan on orders  (cost=0.00..18450.00 rows=1)
Enter fullscreen mode Exit fullscreen mode

Did your query use an index? What does this tell you?

Solution

No. "Seq Scan" means a sequential scan, the database read the whole table
cover to cover. If you expected an index to be used, you are probably missing one
on the filtered column (or something is blocking it, like a leading-wildcard
LIKE). You want to see "Index Scan" instead.


3. Spot the N+1

A teammate's code does this on the dashboard:

const users = await db.query("SELECT * FROM users LIMIT 50");
for (const user of users) {
  user.orders = await db.query(
    "SELECT * FROM orders WHERE user_id = ?", [user.id]
  );
}
Enter fullscreen mode Exit fullscreen mode

How many queries does this run for 50 users? What is the bug called?

Solution

It runs 51 queries: 1 to load the users, then 1 per user (50 more). This is
the N+1 query problem. Each loop iteration is a separate round trip to the
database, which gets slow fast.


4. Fix the N+1 with SQL

Rewrite the work above so it grabs every user's orders in one query instead
of 50.

Solution

SELECT users.id, users.name, orders.id AS order_id, orders.total
FROM users
LEFT JOIN orders ON orders.user_id = users.id
WHERE users.id IN (/* the 50 user ids */);
Enter fullscreen mode Exit fullscreen mode

Why: one JOIN (or one WHERE user_id IN (...) batch) pulls all the orders at
once. 51 queries become 1 or 2. In an ORM this is eager loading, like
include: { orders: true }.


5. Should you index this?

You are tempted to add an index on orders.status, which is almost always one of
just two values: 'pending' or 'paid'. Good idea or not?

Solution

Usually not worth it. status is a low-variety column, so an index can only
split the table into a couple of big buckets and the database often just scans
anyway. Save indexes for columns with many different values that you filter,
join, or sort on.


6. Why is this search slow even with an index?

You added an index on products.name, but this query still does a full scan:

SELECT * FROM products WHERE name LIKE '%phone%';
Enter fullscreen mode Exit fullscreen mode

Why?

Solution

A normal index is sorted by the start of the value. A leading wildcard
('%phone%') asks for matches anywhere in the middle, so the sorted order cannot
help, just like asking a dictionary for every word containing "phone". A search
like LIKE 'phone%' (no leading %) can use the index. For real "contains"
search you would reach for a full-text index instead.


Nice work. If you can say which column to index and why, and you can smell an
N+1 from across the room, you are thinking like a backend dev now. See you in
Part 12.