Stop guessing and start understanding how databases utilize indexes to design them more efficiently!
Translated by ChatGPT 3.5.
This article primarily discusses the most common B+ tree indexes.
Indexes act like a table of contents in a book, helping to locate desired content efficiently. Well-designed indexes can enhance search efficiency, while poorly designed ones can degrade it. Therefore, index design is crucial. To create efficient indexes, it’s essential to understand how databases utilize indexes to search for data.
B+ Tree
When I was learning about database indexes, every book and article jumped straight into B+ trees, making me want to give up before even starting. However, most of the time, we don’t need to delve deeply into these structures.
Here’s a simple B+ tree:
+----+----+
| 10 | 20 |
+----+----+
/ | \
+---+ +----+----+ +----+
| 5 | | 15 | 18 | | 25 |
+---+ +----+----+ +----+
/ | \ / | \ / | \
1 6 9 11 16 19 21 26 30
It may seem complex, but except for leaf nodes, other nodes exist to quickly locate leaf nodes. By focusing only on leaf nodes and ignoring other nodes, it becomes a simple ordered doubly-linked list:
+---+---+---+----+----+----+----+----+----+
| 1 | 6 | 9 | 11 | 16 | 19 | 21 | 26 | 30 |
+---+---+---+----+----+----+----+----+----+
Just remember these three characteristics of B+ trees:
- Leaf nodes are ordered.
- Quick navigation to leaf nodes is possible.
- Leaf nodes are connected, allowing fast traversal.
I’ll represent indexes using linked lists from now on. If you’re interested in how B+ trees maintain order, insertion, deletion, etc., you can explore it on your own.
How Databases Utilize Indexes
Let’s assume the database has selected what it considers an appropriate index. Here’s a simplified process of how it uses indexes to search for data:
- Utilizing the characteristics of the index data structure, the database swiftly locates the first index node that meets the conditions. This node’s corresponding row may not necessarily meet the query conditions because the index’s columns may not include all the columns required by the query.
- Starting from the node located in step 1, the database performs a unidirectional traversal, using the values of the index nodes for preliminary screening. Since B+ tree leaf nodes are ordered and connected, unidirectional traversal is swift. It retrieves the corresponding row data from the disk through nodes that pass the preliminary screening.
- For the rows filtered in step 2, further filtering may occur. This step isn’t always necessary. If the index’s columns include all the columns required by the query conditions, no further filtering is needed. This is the ideal scenario.
Now, let’s explore some common query scenarios and how databases utilize indexes to search for data, hoping to provide you with some insights.
Rapid Location
Querying on the primary key id
is the simplest case. When we query SELECT * FROM table WHERE id = 2
, the database can swiftly find the rows that meet the condition through the index. It's highly efficient with no extra steps.
index on (id)
|
v
+------++-----+-----+-----+-----+-----+
| id || 1 | 2 | 3 | 4 | 5 |
+------++-----+-----+-----+-----+-----+
✅
Unidirectional Traversal
When we want to find the three youngest people aged 20 or above SELECT * FROM table WHERE age >= 20 LIMIT 3
, the database quickly locates the first node meeting the condition via the index, then proceeds with unidirectional traversal to find the first three rows meeting the condition.
index on (age)
|
v ------------>
+-------++------+------+------+------+------+
| age || 18 | 20 | 21 | 22 | 23 |
+-------++------+------+------+------+------+
✅ ✅ ✅
Multi-column Index
Multi-column indexes are similar to single-column indexes, except leaf nodes of multi-column indexes contain values of multiple columns, also in order. The sorting rule of nodes in multi-column indexes is similar to string sorting, comparing from left to right, considering the comparison result of the first differing column. For instance, with a multi-column index (a, b, c)
, the node (1, 2, 3)
is less than (1, 2, 4)
but greater than (1, 1, 4)
. When querying SELECT * FROM table WHERE a = 1 AND b = 2 AND c = 3
, similar to single-column indexes, the database can quickly find the nodes meeting the conditions via the multi-column index.
index on (a, b, c)
|
v
+-----++-----+-----+-----+-----+-----+
| a || 1 | 1 | 1 | 1 | 2 |
+-----++-----+-----+-----+-----+-----+
| b || 1 | 1 | 2 | 2 | 5 |
+-----++-----+-----+-----+-----+-----+
| c || 1 | 3 | 2 | 3 | 4 |
+-----++-----+-----+-----+-----+-----+
✅
Queries fully utilizing (a, b, c)
indexes include:
SELECT * FROM table WHERE a = 1 AND b = 2 AND c = 3
SELECT * FROM table WHERE a = 1 AND b = 2
SELECT * FROM table WHERE a = 1
Having a (a, b, c)
index is equivalent to having both (a, b)
and (a)
indexes simultaneously.
The rule for databases to swiftly locate nodes meeting conditions using multi-column indexes is: match sequentially from left to right based on index columns; stop matching if a column doesn’t match or isn’t in the query conditions.
For example, querying SELECT * FROM table WHERE a = 1 AND c = 3
with (a, b, c)
index, since it matches a = 1
from left to right and stops due to b not in the query conditions, it's equivalent to using (a)
index in locating the first node meeting the conditions. The database must then traverse all rows with a = 1
for filtering.
index on (a, b, c)
|
v ---------------->
+-----++-----+-----+-----+-----+-----+
| a || 1 | 1 | 1 | 1 | 2 |
+-----++-----+-----+-----+-----+-----+
| b || 1 | 1 | 2 | 2 | 5 |
+-----++-----+-----+-----+-----+-----+
| c || 1 | 3 | 2 | 3 | 4 |
+-----++-----+-----+-----+-----+-----+
❌ ✅ ❌ ✅
💡 However, during unidirectional traversal and filtering,
(a, b, c)
index is faster than(a)
index because(a, b, c)
index nodes containc
values, eliminating the need to readc
values from disk. Reading from memory is much faster than random disk reads.
Designing a new multi-column index (a, c)
for this query can efficiently locate rows meeting conditions.
index on (a, c)
|
v ---->
+-----++-----+-----+-----+-----+-----+
| a || 1 | 1 | 1 | 1 | 2 |
+-----++-----+-----+-----+-----+-----+
| c || 1 | 2 | 3 | 3 | 4 |
+-----++-----+-----+-----+-----+-----+
✅ ✅
Order of Multi-column Indexes
The order of columns in multi-column indexes is crucial. For instance, in a grade table with id, class, score columns, having a multi-column index (score, class)
, when querying SELECT * FROM table WHERE class = 2 AND score >= 60
, the index seemingly helps the query. However, its assistance is limited. After swiftly locating the first node meeting the condition, the database needs to traverse all nodes with scores greater than 60 to find all rows meeting the condition. This may include many nodes not meeting the condition.
index on (score, class)
|
v ------------------->
+---------++------+------+------+------+------+------+
| score || 45 | 52 | 60 | 64 | 75 | 95 |
+---------++------+------+------+------+------+------+
| class || 1 | 2 | 2 | 1 | 1 | 2 |
+---------++------+------+------+------+------+------+
✅ ❌ ❌ ✅
However, using (class, score)
index efficiently locates all rows meeting the condition. There are no nodes not meeting the condition during unidirectional traversal.
index on (class, score)
|
v ----->
+---------++------+------+------+------+------+------+
| class || 1 | 1 | 1 | 2 | 2 | 2 |
+---------++------+------+------+------+------+------+
| score || 45 | 64 | 75 | 52 | 60 | 95 |
+---------++------+------+------+------+------+------+
✅ ✅
It’s not only about designing indexes that databases can adopt but also designing indexes that efficiently utilize actual query requirements.
Utilizing Ordered Indexes
Since indexes themselves are ordered, you can design indexes tailored to query requirements, eliminating the need for additional sorting. For example, in the grade table scenario, querying SELECT * FROM table WHERE class = 2 ORDER BY score ASC
uses an index (class, score)
. The unidirectional traversal through index nodes is ordered, eliminating the need for additional result sorting. If the ordered nature of indexes isn't utilized, even if you need to query a single row, the database has to retrieve all rows meeting the condition and then sort them, which is highly inefficient.
index on (class, score)
|
v ------------>
+---------++------+------+------+------+------+------+
| class || 1 | 1 | 1 | 2 | 2 | 2 |
+---------++------+------+------+------+------+------+
| score || 45 | 64 | 75 | 52 | 60 | 95 |
+---------++------+------+------+------+------+------+
✅ ✅ ✅
As leaf nodes of indexes form bidirectional linked lists, SELECT * FROM table WHERE class = 2 ORDER BY score ASC
can be efficiently queried.
index on (class, score)
|
<----------- v
+---------++------+------+------+------+------+------+
| class || 1 | 1 | 1 | 2 | 2 | 2 |
+---------++------+------+------+------+------+------+
| score || 45 | 64 | 75 | 52 | 60 | 95 |
+---------++------+------+------+------+------+------+
✅ ✅ ✅
Conclusion
- When designing indexes, strive to arrange adjacent leaf nodes for query result indexes to reduce the number of non-matching nodes during unidirectional traversal.
- Include columns from filtering conditions in indexes to minimize disk reads during filtering.
- If sorting is required, utilize the ordered nature of indexes to avoid additional sorting operations.
Future explorations could include how database query optimizers select indexes, utilizing indexes in join
, group by
, subqueries
, etc., and common pitfalls with indexes.
Top comments (0)