Ashraf Shaik

Posted on Feb 3 • Edited on Feb 4

How B-Tree Indexes Work in SQL Databases: An Intuitive Guide

#webdev #database #sql #programming

Why Indexes Exist (The Real Problem)

Relational databases are optimized to store large volumes of data reliably, but they are not necessarily optimized to search them efficiently by default. When you ask a database to find a specific row without an index, it often performs a Full Table Scan, looking at every row one by one. This works at a small scale but becomes painfully slow as data grows.

Indexes exist to avoid this brute-force approach. Instead of checking every row, an index allows the database to quickly narrow down where the matching data resides.

B-Tree indexes are the default choice in most SQL databases (PostgreSQL, MySQL/InnoDB, SQL Server, Oracle) because they keep data sorted while supporting inserts and updates in a controlled, predictable way.

The intuition: even with 10 million rows, a B-Tree index typically needs only a handful of comparison steps to find a single value. Locating a row is a matter of a few targeted jumps rather than a scan through millions of entries. Writes do carry overhead—every insert must maintain this structure—but that cost grows slowly, allowing B-Trees to scale exceptionally well.

What Problem a B-Tree Solves

A naive sorted array allows for binary search, but inserts are expensive because data must be shifted to make room (O(n)). A linked list allows cheap inserts, but search is slow (O(n)).

A B-Tree is a data structure designed specifically for storage systems that:

Keeps data sorted for range queries and ordering
Allows logarithmic search (O(log n)) complexity
Minimizes disk I/O by being shallow and wide
Supports efficient writes through localized node splits

The key constraint: databases read data in pages, not individual rows.

Pages: How Databases Actually Store Data

Before talking about B-Trees, it helps to understand one fundamental idea: databases work with pages.

A page is a fixed-size block of data (commonly 8 KB). When the database needs anything—a row, an index entry, or a pointer—it reads the entire page containing it.

This applies to:

Table data (the actual rows)
Index data (B-Tree nodes)

The performance goal is simple: do as much useful work as possible per page read. B-Tree nodes are designed to fit neatly inside one page, allowing the database to make massive progress with every single disk hit.

This page-based model is why B-Trees look very different from textbook binary trees.

B-Tree Structure (High Level)

A B-Tree consists of three layers:

Root node: the starting point of every search
Internal nodes: the signposts used for navigation
Leaf nodes: the bottom layer where the actual data (or pointers) live

Properties

Sorted keys: keys inside a node are always kept in order
High fanout: each node contains many keys (often hundreds), not just two
Balanced depth: all leaf nodes are at the exact same depth, guaranteeing predictable performance regardless of which key you are seeking

Internal vs. Leaf Nodes

Internal nodes contain keys and pointers to child pages. They are used purely for navigation.
Leaf nodes contain keys and pointers to the actual rows.

Implementation details differ:

In MySQL (InnoDB), leaf nodes store the full row (clustered index).
In PostgreSQL, leaf nodes store row pointers (heap TIDs).

The link: leaf nodes are usually linked together (left-to-right), allowing the database to perform range scans without going back up to the root.

How a Lookup Works

Consider the query:

SELECT * FROM users WHERE email = 'a@b.com';

Start at the root page
Compare the key with the keys in the node to find the correct range
Follow the child pointer to the next page
Repeat until a leaf node is reached
Fetch the row pointer or the row itself

Each step is a page read, not a row-by-row operation.

Why B-Tree Height Stays Small

Because each node holds many keys, the tree grows wide, not tall.

Typical numbers:

Page size: 8 KB
Key + pointer: ~40 bytes
Keys per node (fanout): ~200

A tree with height 3 can support:

200³ ≈ 8,000,000 entries

This is why B-Tree lookups remain extremely fast even at massive scale.

Inserts and Node Splits

When you insert a new row, the database must update the B-Tree:

The key is placed in the correct leaf node
If the node has space, the insert is trivial
If the node is full, it splits into two nodes
The middle key is promoted to the parent node

Splits can propagate upward to the root, but they are rare in practice. This balancing act is the tax paid to keep the tree efficient for future reads.

Range Queries and Ordering

B-Trees keep keys sorted, which enables efficient range scans:

SELECT * FROM orders
WHERE created_at BETWEEN '2025-01-01' AND '2025-01-31';

After locating the first key ('2025-01-01') using the standard lookup, the database simply walks the linked leaf nodes sequentially until it hits the end of the range.

This is why:

B-Trees natively support ORDER BY
B-Trees support prefix scans (e.g. LIKE 'abc%')
Hash indexes cannot do either of these

Composite Indexes

When you create an index on multiple columns, such as (user_id, created_at), the B-Tree sorts by user_id first, and then by created_at for tied values.

Efficient for: user_id = ? or user_id = ? AND created_at > ?
Inefficient for: created_at = ? alone

Think of it like a phone book sorted by (Last Name, First Name). It’s easy to find all Smiths, but impossible to find everyone named John without reading the whole book.

Common Misconceptions

"Indexes speed up all queries": False. They speed up reads but add overhead to every INSERT, UPDATE, and DELETE.
"More indexes are always better": False. Every index increases write amplification and takes up disk space.
"B-Trees are only for equality": False. Their primary strength is handling ranges and sorted output.

When a B-Tree Is the Wrong Choice

Avoid B-Tree indexes when:

Low cardinality: If 90% of rows have is_active = true, the database will often prefer a full table scan because reading the index plus the table is more expensive.
High-frequency equality only: If you only ever do exact matches on unique keys, a hash index may be faster (though less flexible).
Append-only / time-series data: Consider BRIN (Block Range Indexes) for massive tables where data is naturally ordered by time.
Full-text search: Use GIN or specialized inverted indexes for searching words inside long text.