DEV Community

Cover image for The What and The How of Indexes in Databases
Ujwal Kumar
Ujwal Kumar

Posted on • Edited on

The What and The How of Indexes in Databases

What is an Index?

An index is a data structure associated with a table that speeds up the retrieval of rows. By creating an index on a column or a set of columns, the database engine can locate relevant rows more efficiently. This is particularly useful for large tables or queries that frequently filter or sort data.

Types of Indexes:

On a broad level, there are two categories of indexes:

  • Clustered Index: A clustered index is a type of index that defines the physical order of the rows in a table based on a column. A clustered index directly sorts and stores the table's data based on the indexed column(s). By default, in many database systems, the primary key constraint creates a clustered index on the primary key column(s). This ensures that the primary key uniquely identifies each row in the table and also determines the physical order of the rows. In most database systems, each table can have only one clustered index, as the physical order of the rows can only be based on one criterion only.

  • Non-Clustered Index: A non-clustered index is a type of index that creates a separate data structure (generally B+ trees) to store index keys and pointers to the corresponding rows. A non-clustered index does not affect the physical order of the rows on disk. In contrast to clustered indexes, which are limited to one per table, a table can have multiple non-clustered indexes. While non-clustered indexes enhance query performance, they also require additional storage space to store the index keys and pointers.

Critical Considerations

Although indexes are a powerful tool to optimize query performance on tables which are read-heavy and are updated less frequently, there are a few things that should be kept in mind while creating indexes on a table. Here are some key considerations:

  • Creating indexes on columns that are frequently used in WHERE, JOIN, ORDER BY and GROUP BY clauses can improve the performance of your queries.
  • Too many indexes can lead to overhead in terms of storage and maintenance. The number of columns in the index definition will directly affect the performance of insert, update and delete operations.
  • We should choose as few columns as possible in tables with intensive data updates for the index definition.
  • We should define the clustered index in as few columns as possible. Ideally, our clustered index should be defined in a unique column and not contain a null value.
  • The more repetitive data we have in the column where we define the index, the lower our index performance will be.
  • For queries that involve multiple columns in the WHERE clause or JOIN conditions, consider creating composite indexes (indexes on multiple columns). Composite indexes can improve query performance by allowing the database to quickly filter or match rows based on combinations of column values.
  • We should be careful about the order of the columns in composite indexes. As a rule of thumb, column with most distinct values should come first in the composite index.

By keeping these considerations in mind, you can create indexes that effectively improve query performance, optimize resource utilization and enhance overall database efficiency.

How does Index help to improve query performance?

To understand how do indexes help to improve the query performance, let us take a look at a example table.

Table Name: Users

Field Name Storage (in bytes)
Id 4
Name 8
Age 4
City 8
Country 8

Each record in this table will take 4+8+4+8+8 = 32 bytes of storage. Let us assume this table contains 100 records. Data is stored in different blocks in the disk. When we want to read data from the table, the database engine reads the data from the disk in the form of blocks. Each block has some size of data that it can hold. Let us assume the block can store 128 bytes of data, then each block will contain 4 records. Let us also assume that reading each block from disk takes around 1 second. If we want to fetch the user by age, in worst case, we will have to fetch 25 blocks to find the relevant record(s) which will take around 25 seconds (in real world databases the data read from the disk is less as lot of the databases use data structures like B+ trees, AVL trees etc. for performance optimization). If we have an index on the age column, then the index would be sorted on the indexed column and for representational purpose we can consider it to look something like below.

Age Address on Disk
11 4
12 2
13 5
16 1
19 11

We will have 100 records with two columns for the index. One of them is age and other is the address of the record on disk. Age takes 4 bytes and let us assume the address also takes 4 bytes. The total size for 100 records for this index would be 800 bytes. The number of blocks required to store the index would be 800/128 = 6.25 ~ 7 blocks. To fetch a single record from the database, the amount of blocks that would be needed to be read by the database engine will be 7 for the index plus 1 block for the data (assuming all records matching that age reside on the same block). On the basis of our assumption, this will take around 8 seconds. So total improvement factor for the query will be 25/8 = 3.125. As we can see, even in a small table there is significant improvement in the query. This improvement is even more visible when we have huge tables with lots of records.

Conclusion

Indexing is a very useful technique that helps in optimizing the search time in database queries. The main purpose of indexing is to provide better performance for data retrieval. We have to be extremely smart while applying indexes. In-depth knowledge of the queries that run on the tables is an extremely important prerequisite. Sometimes an index may have negative impacts like slower inserts/updates/deletes and large disk storage that need to be taken care of before creating index on a table.

Top comments (0)