Mustafa ERBAY

Posted on May 29 • Originally published at mustafaerbay.com.tr

Database Index Selection: Why Basic Approaches Fall Short?

#postgres #database #index #performance

Introduction: The Unseen Costs of Indexes

When we talk about database performance, indexes are usually the first thing that comes to mind. When a query runs slowly, the first place we look is often for missing or incorrect indexes. We generally know what B-tree, GIN, and BRIN index types do and when to use them. We even have those famous graphs from PostgreSQL documentation in our minds. But in the real world, especially in large and complex systems, how much do we question why these basic index choices often fall short, or even sometimes degrade performance?

In this post, I'll explain with concrete examples from my own experiences why index selections cannot be made by just looking at table and query structures, and how organizational workflows, data distribution, and even hardware can influence these decisions. From the "late shipment report" problem I encountered in a manufacturing ERP to index optimizations in my own financial calculators, we will focus on moments where we pushed the limits of basic approaches.

B-Tree Index: The Savior for Every Situation?

The default and most frequently used index type in PostgreSQL is undoubtedly B-tree. It is generally very successful in speeding up queries using operators like equality (=), greater than (>), less than (<), and BETWEEN. It even works for prefix searches like LIKE 'prefix%'. I remember adding a B-tree index to almost every table while working on a manufacturing ERP for over 5 years.

However, B-trees also have their limits. Especially for searches like LIKE '%suffix' or LIKE '%substring%', due to the structure of B-tree, it can't do much beyond a full table scan. When we encounter such queries, the first solution that comes to mind is either using FTS (Full-Text Search) for more complex search algorithms or moving towards more advanced index structures.

For example, in a client project, we were trying to filter product movements in operator screens in real-time. We were querying by date range and product code, and these queries were quite fast with B-tree indexes. However, operators sometimes wanted to search by entering part of the product description. A search like LIKE '%screen%' caused serious performance issues in tables with millions of rows. Initially, we tried GIN indexes using the pg_trgm extension, but this slowed down table writes. Finally, we made the search need more structural by moving to a different data model.

ℹ️ Limitations of B-Tree Index

B-tree indexes, with their ordered data structure, speed up many common queries. However, their performance can degrade as search patterns become more complex or when data distribution is very uneven. They are particularly insufficient for full-text or complex string matching.

GIN Index: The Powerhouse for Text Searches?

When working with Full-Text Search (FTS) or text-heavy data, GIN (Generalized Inverted Index) indexes come into play. They are used to search for specific words or patterns in data of JSONB, array, or text types. GIN indexes can be a lifesaver in analyzing product descriptions or reviews on an e-commerce site.

In a client project, we were storing product features as JSONB. We needed to query the existence of a specific feature ("color": "blue") or multiple features ("color": "blue" AND "size": "XL") within this JSONB field. GIN indexes were a perfect fit for such queries. We created this index with the command CREATE INDEX idx_products_features ON products USING GIN (features);, and our queries went from seconds to milliseconds.

However, GIN indexes also come with their own costs. GIN indexes occupy much more disk space than B-trees and, importantly, slow down table data insertion (INSERT) or update (UPDATE) operations. This is because the index needs to be updated with every data change. In a project with my own financial calculators, using GIN indexes while processing constantly updated financial data had slowed down write performance so much that we considered moving the data to a separate time-series database.

# Example GIN index creation command
CREATE INDEX idx_articles_content ON articles USING GIN (to_tsvector('turkish', content));

# JSONB query with GIN index
SELECT * FROM products WHERE features @> '{"renk": "mavi", "boyut": "XL"}';

⚠️ Considerations for GIN Indexes

While GIN indexes excel in complex data structures and text searches, they come with significant costs in terms of disk space and write performance. They should be used with caution in systems with heavy write operations.

BRIN Index: An Alternative for Large, Ordered Data

BRIN (Block Range Index) indexes are designed as an alternative to B-trees for large tables and ordered datasets. BRIN indexes use the physical order of the table on disk to determine if data falls within a certain range. Since they only store one entry per data block, they are much smaller than B-trees.

In a data warehouse project, we had a time-series table with millions of records. Data was typically added to this table in chronological order. When querying using the event_timestamp column, using a B-tree index both greatly increased the index size and didn't provide the expected performance for some queries. This is precisely where BRIN indexes came into play.

CREATE INDEX idx_timeseries_event_time ON timeseries_data USING BRIN (event_timestamp);

With this index, when the query specified a time range, PostgreSQL only had to scan the data blocks corresponding to that range, rather than reading all millions of records. The biggest advantage of BRIN indexes is that when data is added in order or has a specific natural order, they can offer similar or better performance with a much smaller footprint than B-trees.

However, BRIN indexes also have a critical prerequisite: the data must be physically ordered on disk. If your data is frequently updated, deleted, or randomly inserted, the benefits of BRIN indexes quickly disappear. I once tried a BRIN index on a stock movement table in a manufacturing company's ERP system. The data was ordered when added, but later stock corrections and returns caused the order to be disrupted, rendering the BRIN index useless.

💡 Advantages and Conditions of BRIN Indexes

BRIN indexes are an excellent option for large and ordered datasets. They save disk space and are effective for range queries. However, as they rely on the physical order of data on disk, maintaining data order is critical.

Overlooked Factors in Index Selection

Typically, when selecting indexes, we focus on query patterns, data types, and the basic characteristics of the index type. However, in a production environment, things are much more complex.

Data Distribution and Cardinality

The cardinality of a column (the number of unique values) plays a critical role in index selection. A B-tree index on a column with low cardinality (e.g., columns with only a few distinct values like gender or status codes) often doesn't perform better than a full table scan. This is because the index will point to rows representing a large portion of the table. In such cases, it's crucial to carefully examine the EXPLAIN ANALYZE output.

At one point, a client's order status table had a status column with only 3 distinct values: 'pending', 'processing', 'completed'. We had created a B-tree index on this column. However, the query WHERE status = 'completed' was slow because it scanned 70% of the table. In this situation, optimizing the query or managing the status in a different data structure might have been a more appropriate approach than using an index.

# B-tree index on a low-cardinality column being insufficient
EXPLAIN ANALYZE SELECT * FROM orders WHERE status = 'completed';
-- We would expect to see a large 'Seq Scan' or 'Bitmap Heap Scan' in the analysis output.

Write vs. Read Balance

Indexes improve read performance but degrade write performance. Every index must be updated during a data change. If your table experiences very frequent data writes (e.g., logging or real-time transaction records), updating multiple indexes for each added piece of data can create a significant performance bottleneck.

In the backend of my own mobile application, I was anonymously logging user activities. Initially, I had created B-tree indexes on columns like date, user ID, and activity type. When millions of log rows were added daily, write performance dropped so much that the application started to slow down. Eventually, I realized that most log queries were just searching by time and switched to a BRIN index solely on event_timestamp, removing the other indexes. This change increased write performance by over 300%.

Index Maintenance and Cost

Indexes don't just take up space; they also require maintenance. In PostgreSQL, the VACUUM operation is important for reclaiming free space left by deleted or updated rows and optimizing indexes. Operations like VACUUM FULL are more aggressive but can cause significant access issues by locking the table.

In a manufacturing ERP system, we weren't regularly checking the pg_stat_user_indexes table. Over time, the indexes had become so bloated that we started experiencing disk space issues. By looking at the idx_scan and last_vacuum/last_autovacuum columns in pg_stat_user_indexes, we identified which indexes were unused or hadn't been VACUUMed for a long time. Deleting unused indexes and optimizing VACUUM settings helped us reduce disk usage by 20%.

Advanced Indexing Approaches

There are also more advanced methods we can resort to when basic index types are insufficient.

Partial Indexes

Partial indexes allow you to create an index on only a specific subset of the table. This reduces the index size and improves write performance. For example, if you frequently query only records with a specific status, you can create a partial index for that status.

In a client project, we rarely queried cancelled orders. The order table had millions of rows, and queries with the condition status = 'cancelled' were slow. However, cancelled orders constituted only 1% of the table. In this case, instead of adding an index to the entire table, we created a partial index just for cancelled orders:

CREATE INDEX idx_orders_cancelled ON orders (order_id) WHERE status = 'cancelled';

This index was much smaller, containing only the order_ids of cancelled orders, and it sped up relevant queries.

Expression Indexes

Expression indexes allow you to create an index on the results of functions or expressions performed on columns, rather than on the columns themselves. The to_tsvector expression I mentioned earlier is an example. Or you can use the lower() function for case-insensitive comparisons.

For instance, if you have a username column in a user table and frequently perform queries like WHERE lower(username) = 'admin', creating an expression index on lower(username) will speed up these queries.

CREATE INDEX idx_users_lower_username ON users (lower(username));
SELECT * FROM users WHERE lower(username) = 'admin';

Covering Indexes (with `INCLUDE` in PostgreSQL)

With the INCLUDE keyword in PostgreSQL 11 and later versions, it's possible to create covering indexes. This allows the query to be completed using only the index, without needing to access the main table. This can significantly improve query performance.

In a financial reporting tool, I needed to retrieve transaction details for a specific account and date range. We had a B-tree index on both the account ID and the date. However, the query also retrieved the transaction description. In this case, I created a covering index by adding the transaction description to the INCLUDE part of the index, which included the order by and where conditions:

CREATE INDEX idx_transactions_account_date ON transactions (account_id, transaction_date) INCLUDE (description);

This way, queries that needed the account_id, transaction_date, and description columns could run solely from the index, without touching the main table at all.

🔥 Considerations for Covering Indexes

Covering indexes can significantly improve query performance but also increase index size. Since each included column in INCLUDE increases the index size, it's important to only add columns that are truly needed. Otherwise, the index itself can become a performance bottleneck.

Conclusion: Indexes Are a Tool, But Not a Solution on Their Own

Database indexes are one of the cornerstones of performance optimization. However, making the right decision by just looking at basic types like B-tree, GIN, or BRIN, or even just analyzing query plans, is often not possible. Factors like data distribution, write/read balance, index costs, and advanced indexing strategies must also be considered.

We must remember that indexes are only one part of the solution in complex systems. Sometimes, the best index is no index at all. Or a better data model, better query writing, or even choosing a different database technology can yield much more effective results than index optimization. One of the biggest mistakes I've seen in my career is over-reliance on indexes while neglecting the underlying data model or query logic.

As I mentioned in my previous [related: database performance analysis] posts, learning to read EXPLAIN ANALYZE output is the first step, but being able to see the system as a whole and manage trade-offs correctly is essential.

DEV Community

Database Index Selection: Why Basic Approaches Fall Short?

Introduction: The Unseen Costs of Indexes

B-Tree Index: The Savior for Every Situation?

GIN Index: The Powerhouse for Text Searches?

BRIN Index: An Alternative for Large, Ordered Data

Overlooked Factors in Index Selection

Data Distribution and Cardinality

Write vs. Read Balance

Index Maintenance and Cost

Advanced Indexing Approaches

Partial Indexes

Expression Indexes

Covering Indexes (with `INCLUDE` in PostgreSQL)

Conclusion: Indexes Are a Tool, But Not a Solution on Their Own

Top comments (0)

Introduction: The Unseen Costs of Indexes

B-Tree Index: The Savior for Every Situation?

GIN Index: The Powerhouse for Text Searches?

BRIN Index: An Alternative for Large, Ordered Data

Overlooked Factors in Index Selection

Data Distribution and Cardinality

Write vs. Read Balance

Index Maintenance and Cost

Advanced Indexing Approaches

Partial Indexes

Expression Indexes

Covering Indexes (with INCLUDE in PostgreSQL)

Conclusion: Indexes Are a Tool, But Not a Solution on Their Own

Covering Indexes (with `INCLUDE` in PostgreSQL)