Nikita Kutsokon

Posted on Mar 7 • Edited on Mar 10

Databases: Indexes. Part 2

#webdev #backend #database #systemdesign

In the previous section, we explored the fundamental concepts of database indexes and their role in enhancing query performance. Now, we will delve deeper into advanced indexing techniques that can further optimize your database. As data grows and queries become more complex, understanding how to fine-tune and select the right indexing strategy becomes crucial. In this section, we’ll cover different types of indexes, such as clustered, non-clustered, and composite indexes, and explore when and how to use them effectively. We will also discuss common indexing pitfalls and best practices to ensure your database remains efficient as it scales.

Types of Indexes

1. Single-Column Indexes
This is the most basic type of index. It is created on a single column of a table.

Let’s say we have a users table with columns id, name, and age. If we frequently search for users by their name, we could create an index on the name column:

CREATE INDEX idx_name ON users(name);

This index will speed up searches like:

SELECT * FROM users WHERE name = 'Alice';

Without the index, the database would have to scan every row, which is slow for large tables. But with the index, it can directly look up rows where the name is 'Alice'.

2. Composite Indexes
A composite index is an index that involves more than one column. It’s useful when you frequently query the database using multiple columns together in the WHERE clause or as part of JOIN conditions.

Imagine we have a users table with the following columns: first_name, last_name, age. If we often search for users by both first_name and last_name together, we can create a composite index like this:

CREATE INDEX idx_name_age ON users(first_name, last_name);

Now, if we run a query like:

SELECT * FROM users WHERE first_name = 'John' AND last_name = 'Doe';

The database will use the composite index to quickly find rows where both first_name is 'John' and last_name is 'Doe', without scanning the entire table.

⚠️ Important Note:

A composite index is like a combined shortcut that helps the database find information faster when searching through multiple columns at once. The order of columns in this index is important. For example, if the index is created with first_name first and last_name second, it works best when you search using both first_name and last_name. However, if you only search by last_name, the index won't be as effective because it was designed to prioritize the first column (first_name). In short, the database can use the index for queries that match the first column or more, but not just the later ones.

3. Unique Indexes
A unique index is an index that ensures the uniqueness of the values in the indexed column. It is automatically created when you define a column with a UNIQUE constraint.

Suppose we have a users table with an email column, and we want to ensure no two users have the same email. We can create a unique index like this:

CREATE UNIQUE INDEX idx_email ON users(email);

This ensures that every email in the email column is unique. If someone tries to insert a duplicate email, the database will reject the insertion.

4. Full-Text Indexes
A full-text index is specialized for text searching. It allows the database to efficiently search for words or phrases within large text columns, often used in fields like articles, descriptions, and comments.

If you have a posts table with a content column, you can create a full-text index for efficient searching:

CREATE FULLTEXT INDEX idx_content ON posts(content);

Then, you can perform a search like:

SELECT * FROM posts WHERE MATCH(content) AGAINST ('keyword');

This is much faster than a basic search because the full-text index is designed specifically to handle such queries efficiently.

5. Clustered Index

A clustered index determines the physical order of data in the table. When you create a primary key on a table, a clustered index is automatically created. There can be only one clustered index per table because the data rows are physically sorted by it.

CREATE TABLE users (
    id INT PRIMARY KEY,
    name VARCHAR(100),
    age INT
);

The database will create a clustered index on the id column by default. The data in the table will be physically stored in order of id. This makes range queries like:

SELECT * FROM users WHERE id BETWEEN 1 AND 10;

6. Non-Clustered Index

A non-clustered index is a separate structure from the data table. It contains a copy of the indexed columns and a pointer to the actual rows in the table. Unlike clustered indexes, non-clustered indexes don't change the physical order of the data.

For a table with a non-clustered index:

CREATE INDEX idx_name ON users(name);

This index will help quickly look up rows based on the name column, but the data in the table remains unordered based on name.

Table Scans

When querying a database, the way the system retrieves data from tables significantly affects performance. Databases use different scanning methods to fetch records, depending on whether indexes are available and how the query is structured. Understanding these table scans helps in optimizing queries and designing efficient indexes. Let's look at some of them:

1. Sequential Table Scan
A Sequential Table Scan occurs when the database reads every row in a table to find matching records. This happens when:

No index exists on the column being queried. The query optimizer determines that scanning the entire table is more efficient than using an index
Suppose we have a users table:

SELECT * FROM users WHERE age = 25;

If age is not indexed, the database will check each row, making this slow for large tables.

2. Index Scan

An Index Scan occurs when the database reads the entire index instead of scanning the table directly. This is more efficient than a full table scan but can still lead to high I/O costs for large datasets because, after scanning the index, the database needs to fetch the actual data from the table. This is more efficient than a full table scan but still involves scanning multiple entries in the index.

If age is indexed, the query:

SELECT * FROM users WHERE age = 25;

Will first scan the index for matching values before retrieving actual records. This is faster than a sequential scan but can still be expensive if many rows match.

3. Bitmap Index Scan
A Bitmap Index Scan works differently from a traditional index scan because it is optimized for columns with low cardinality (few unique values). Instead of storing direct row pointers, it uses a bitmap (a series of bits) to represent which rows contain a particular value.

Let's say we have a users table with a status column that can have only three values: active, inactive, and banned. Since there are only three possible values, this column is a great candidate for a bitmap index.

How It Works

Imagine we have a users table with 10 million rows and a status column, which can have only three values:

active
inactive
banned

If we don’t use an index, searching for all active users would mean checking each row one by one, which is slow. Instead the database creates a separate bitmap for each unique value (active, inactive, banned). Each bitmap is simply a long list of 1 and 0, where:

1 means the row has that value.
0 means the row does not have that value

Now, if we run:

SELECT * FROM users WHERE status = 'active';

Instead of scanning the whole table, the database retrieves the bitmap for 'active':

1 0 1 0

This tells the database exactly which rows contain active (1st, 3rd). The system skips all rows where the bit is 0, making the lookup very fast. Finally, it fetches only those rows from the actual table.

Why Is This Efficient ?

Bitmaps are small → They use less space compared to traditional indexes.
Fast Filtering → The database can quickly determine matching rows using bitwise operations (AND, OR).
Great for Multiple Conditions → If we add another condition, like:

SELECT * FROM users WHERE status = 'active' AND age = 30;

The database just combines bitmaps for status = 'active' and age = 30 using a bitwise AND operation, avoiding unnecessary scans.

When to Use a Bitmap Index
✔ Best for columns with few unique values (status, gender, category).
❌ Not efficient for high-cardinality columns (username, email).

Understanding the EXPLAIN Query

The EXPLAIN command in PostgreSQL provides the query execution plan that shows how the database will execute your query. This helps to analyze and optimize the performance of queries. Let's break down each part of the provided SQL and EXPLAIN results. To see how it is works, lets create a customer table:

CREATE TABLE customer (
    id SERIAL PRIMARY KEY,
    first_name VARCHAR(50) NOT NULL,
    last_name VARCHAR(50) NOT NULL,
    status VARCHAR(10) CHECK (status IN ('active', 'inactive', 'banned')) NOT NULL,
    age INT CHECK (age >= 0)
);

This creates a customer table with columns for id, first_name, last_name, status, and age. To play with this table we need data in it, lets insert random with the next query:

INSERT INTO customer (first_name, last_name, status, age)
SELECT 
    LEFT(md5(random()::text), 10),
    LEFT(md5(random()::text), 10),
    (ARRAY['active', 'inactive', 'banned'])[floor(random() * 3 + 1)], 
    floor(random() * 80 + 18)
FROM generate_series(1, 10);

Now we have a table with next data:

Let's see how the EXPLAIN query works in action:

EXPLAIN SELECT * FROM customer;

This command shows the execution plan for selecting all rows from the customer table. It will likely use a Seq Scan (sequential scan) because no index is defined yet, and the entire table is being scanned. Let's extend our abilities of EXPLAIN with ANALYZE parameter. EXPLAIN ANALYZE executes the query and returns both the execution plan and the actual runtime statistics:

EXPLAIN ANALYZE  SELECT * FROM customer WHERE status = 'banned';

Let's break down the EXPLAIN ANALYZE output for the query:

Seq Scan on customer
The query starts with a sequential scan (Seq Scan) on the customer table. This means PostgreSQL is checking each row of the table to find the records that match the condition (status = 'banned'). Since no index is set up for the status column, PostgreSQL has no shortcut to directly find the rows it needs. Instead, it has to look through the entire table, one row at a time.
cost=0.00..13.25 rows=1 width=282
The cost of executing this query is estimated at 13.25, with 0.00 as the startup cost and 13.25 as the total cost. The startup cost is low because PostgreSQL only needs minimal resources to start executing the query. The total cost is higher because PostgreSQL needs to go through all the rows to apply the filter (status = 'banned'). The value 13.25 represents the internal calculation PostgreSQL uses to estimate how much work it will take to process the query. It also expects that about 1 row will match the condition.
rows=1
PostgreSQL's initial estimation of the number of rows that will match the condition (status = 'banned'). This is an estimate based on table statistics.
width=282
The average size (in bytes) of the rows in the customer table.
actual time 0.147..0.151
In terms of actual execution, the query took around 0.147 to 0.151 milliseconds to complete. This is how long PostgreSQL took to scan the table and find the rows that matched the condition. Since only 5 rows matched the condition, it was able to process the query very quickly, with the total time of only 0.151 ms.
rows=5 loops=1
Finally, the query executed in a single loop (loops=1). This means PostgreSQL only had to scan the table once to find the matching rows. Even though it estimated that only 1 row would be returned, it actually found 5 rows that had the status 'banned', which is why 5 rows were returned. This reflects the actual results of the query.

To illustrate how adding an index can change the table scan type, let's walk through the process with the customer table:

We start by inserting a huge amount of random data into the customer table:

INSERT INTO customer (first_name, last_name, status, age)
SELECT 
    LEFT(md5(random()::text), 10),
    LEFT(md5(random()::text), 10),
    (ARRAY['active', 'inactive', 'banned'])[floor(random() * 3 + 1)], 
    floor(random() * 80 + 18)
FROM generate_series(1, 1000000);

Now, let's check the query plan for selecting rows based on the first_name column:

EXPLAIN ANALYZE  SELECT * FROM customer WHERE first_name  = '698ffdfe4d';

The database decides to use a Sequential Scan (Seq Scan), meaning it scans all rows in the table to find the matching entry. Additionally, parallel workers are used to improve the scan efficiency when dealing with larger datasets:

Now, we add an index on the first_name column to improve search performance:

CREATE INDEX idx_customer_first_name ON customer(first_name);

Once the index is created, let's run the same query again to see the changes in the execution plan:

EXPLAIN ANALYZE  SELECT * FROM customer WHERE first_name  = '698ffdfe4d';

As we can see, after adding the index, the database now uses an Index Scan rather than a Sequential Scan. This significantly reduces the query execution time and cost.

🎉 The Index Scan using _id_customer_first_name is approximately 263 times faster than the Sequential Scan (0.233 ms vs 61.311 ms). This demonstrates the significant performance boost indexing provides, especially when filtering by a specific column in large datasets. By avoiding a full table scan, the index dramatically reduces query execution time and improves efficiency.

How to Choose the Right Index

Choosing the right index for a table is critical to optimizing query performance. While indexes can significantly speed up read operations, they come with overhead on write operations (like INSERT, UPDATE, DELETE) because the indexes need to be maintained. Therefore, it’s essential to strike a balance between the speed of queries and the cost of maintaining indexes. Here are some guidelines to help you choose the best index for your scenario:

Understand Your Queries
Before creating an index, analyze which queries run most often. Focus on columns in WHERE, JOIN, and ORDER BY clauses, as indexing them can speed up search and sorting operations.

Choose High-Selectivity Columns
Indexes work best on columns with many unique values, like email or user_id. Columns with few values (status or gender) may not benefit much from a regular index, but bitmap indexes can be useful.

Single vs Composite Indexes
A single-column index is best for filtering by one column, while a composite index is better for queries using multiple columns together. Remember, in composite indexes, the order of columns matters—queries should use the left-most column first.

Balance Indexes with Write Performance
Indexes speed up searches but slow down writes (INSERT, UPDATE, DELETE) because they need to be updated too. Avoid unnecessary indexes and regularly check for unused ones to maintain fast performance.

Consider Table Size
For small tables, indexes might not make a big difference. But for large tables, they are essential to avoid slow full-table scans. Sparse data (many NULL values) can also reduce index efficiency.

Test and Optimize
Use tools like EXPLAIN to check how your queries use indexes. Try different index types and compare performance to find the best option. Regularly update your indexing strategy as your data and queries evolve.

Helpful Links 🤓

Text resources:

Video resources:

Top comments (1)

Vlad Didenko • Mar 8

🤓