Mustafa ERBAY

Posted on Jun 6 • Originally published at mustafaerbay.com.tr

How to Choose Database Indexes in 3 Steps

#postgres #database #performance

Database index selection, in my nearly 20 years of experience, has often been a critical topic that is either overlooked or misunderstood. Many developers operate with the mindset of "just add an index, it will speed up," but this often leads to bigger problems. Just last month, I personally witnessed a simple SELECT query in a production ERP drop from 15 seconds to 200 milliseconds; all I did was implement the correct index strategy.

In this post, I will discuss database index selection using a practical, step-by-step approach based on my experiences. My goal is to show you not just how to add an index, but why and when to add one, which index type is most suitable for you, and how to monitor performance in the long run. This process is more than just a technical decision; it's an engineering discipline that directly impacts the overall health and sustainability of a system.

The Core Purpose of Index Selection and Misconceptions

Indexes are special lookup tables that allow the database to access data in specific columns more quickly. They work much like a book's table of contents or index: they direct you to the relevant section directly instead of having to scan every page. When I encounter database performance issues, indexes are usually one of the first places I look. However, there's a common mistake here: the idea that adding an index to every column is a good idea.

In a production database, adding an index to the last_update_date column in a 10-million-row table can improve query performance. However, adding an index to a status column (e.g., 'active', 'passive', 'deleted') in the same table, which only has a few distinct values, often doesn't provide the expected benefit and can even slow down write operations. This is because indexes also occupy disk space and need to be updated with every INSERT, UPDATE, and DELETE operation. This additional cost can outweigh the benefit to queries. For example, when I added 10 different indexes to a transaction table, write operations per second dropped by 30%. Therefore, it's crucial to perform a cost-benefit analysis before adding an index.

⚠️ Adding an Index to Every Column Is Not a Solution

Indexes increase the database's write load and require additional storage space. Unnecessary indexes, especially on frequently updated columns or columns with very low cardinality, can negatively impact performance.

In my experience, index selection is not just about speeding up SELECT queries. It also directly affects the efficiency of JOIN operations and the cost of sorting and grouping operations like ORDER BY and GROUP BY. A correct index strategy reduces the system's overall resource consumption (CPU, I/O), allowing it to serve more users and operate more stably. Therefore, it's essential to think of indexes not just as a speed-up tool, but as a component of the system architecture.

Step 1: Understanding Your Queries and Data Access Patterns

Before starting index selection, you need to identify the slowest and most critical queries in your system. For this, examining database logs in a production environment and using tools like pg_stat_statements (for PostgreSQL) provides invaluable information. I typically install pg_stat_statements and target the most frequently executed, longest-running, or most I/O-consuming queries first.

In a client project, I noticed that daily reporting queries sometimes exceeded 30 seconds. What initially appeared to be a complex JOIN chain was, at its root, the lack of an appropriate index on the transaction_date column in a main table. The query was retrieving transactions within a specific date range, but the database had to scan the entire table. To understand your queries, it's not enough to just look at the SQL text; you also need to see how the data is accessed and processed.

Using EXPLAIN ANALYZE for Query Analysis

The EXPLAIN ANALYZE command shows in detail how a query is executed, which planner follows which step, and how much time and resources are consumed. This forms the basis of your index strategy.

EXPLAIN ANALYZE
SELECT id, product_name, quantity, transaction_date
FROM public.orders
WHERE transaction_date BETWEEN '2026-01-01' AND '2026-01-31'
AND customer_id = 12345
ORDER BY transaction_date DESC;

When I run the query above, if I see Seq Scan (sequential scan) in the output, it usually indicates a missing index or incorrect index usage. If I see Index Scan or Bitmap Index Scan, it means an index is being used. However, sometimes a query can still be slow even if an index is used; in such cases, I look at costly steps like Recheck Cond or Filter. For example, if an index is only on transaction_date and not on customer_id, the query might still be slow even if it uses the index.

💡 Understanding EXPLAIN ANALYZE Output

Pay attention to values like Rows Removed by Filter and Heap Fetches in the EXPLAIN ANALYZE output. These indicate how effective the index is or any additional costs. You can identify bottlenecks by comparing cost and actual time values.

The Impact of Data Distribution on Index Decisions

In index selection, the distribution (cardinality) of data in columns is vital. Columns with high cardinality (e.g., user_id, email, UUID) benefit much more from indexes because each value is unique or repeats very few times. Columns with low cardinality (e.g., gender, is_active, status) make it difficult for an index to be more efficient than scanning the entire table.

In my experience, instead of using an index on low-cardinality columns like status, the database often prefers to perform a Seq Scan. This is because scanning the index and then returning to the main table (random I/O) can be more costly than sequentially scanning the entire table (sequential I/O). Especially in large tables, sequential read speeds are much higher. I experienced this situation with the order_status column in a production ERP; orders with 'pending' status constituted 90% of the table, and the index added to this column worsened performance.

Step 2: Choosing the Right Index Type and Strategy

After understanding your queries and data distribution, the next step is to choose the correct index type. Modern databases like PostgreSQL offer various index types for different needs. Choosing the wrong index type can lead to the index not being used at all or performing much worse than expected.

When I developed a search engine-like feature for a side product, I found that standard B-tree indexes were ineffective for queries like LIKE '%keyword%'. In this case, B-trees are optimized only for prefix searches. For full_text_search features, GIN indexes have been a lifesaver. This is a good example of how important it is to choose the index type according to the nature of the query.

The Power and Limitations of B-tree Indexes

B-tree indexes are the default and most commonly used index type in PostgreSQL. They provide excellent performance for equality (=), range (<, >, <=, >=), IN operators, and sorting (ORDER BY). Primary keys (PRIMARY KEY) and unique constraints (UNIQUE KEY) typically use B-tree indexes.

CREATE INDEX idx_orders_customer_date
ON public.orders (customer_id, transaction_date DESC);

Composite B-tree indexes like the one above are also very powerful. If your query uses both customer_id and transaction_date in WHERE or ORDER BY clauses, this index can improve performance by covering both columns. However, the order of this index is important: customer_id should typically be first as it has higher cardinality. If your query only searches by transaction_date, this composite index might not be fully utilized. About 2 years ago, when I defined such an index in the wrong order for the shopping cart application of an e-commerce site, the cart history loading time unnecessarily reached 5 seconds. When I changed the order, it dropped to 500ms.

Special Use Cases for GIN and BRIN Indexes

GIN (Generalized Inverted Index) indexes are specifically designed for JSONB data, array types, and full-text searches. B-tree indexes are insufficient for situations like searching for specific keys within JSONB or checking for the existence of a particular element in an array.

CREATE INDEX idx_products_tags_gin
ON public.products USING GIN (tags); -- tags can be an array or JSONB

This index type saved me, especially when I stored product features as JSONB in a production ERP and needed to perform dynamic filtering on these features. Speeding up JSONB queries with a B-tree index is almost impossible because B-trees are not suitable for searching within complex data structures.

BRIN (Block Range Index) indexes, on the other hand, are very compact indexes designed for very large tables. They are particularly effective in cases where there is a natural ordering (e.g., a continuously increasing ID or timestamp like transaction_date) and the data is physically stored on disk according to this order. BRIN stores minimum and maximum values for each block range and only scans the relevant blocks.

CREATE INDEX idx_logs_timestamp_brin
ON public.public_logs USING BRIN (log_timestamp);

BRIN indexes were very useful when storing billions of rows of log data in an analytics platform. Indexing all of it would have been disastrous in terms of both disk cost and write performance. With BRIN, searching for logs within a specific time range is much faster than scanning the entire table and takes up much less space than a B-tree. However, if the data is not physically ordered, BRIN's effectiveness decreases.

Step 3: Monitoring Performance and Optimizing Indexes

Index selection is not a one-time task. As your application evolves, data volume increases, and query patterns change, you need to review your indexes. I usually check my indexes after every major release or when I observe a significant performance drop. This is an important part of an observability strategy.

A few months ago, I noticed that the table logging user actions in the backend of a financial calculator application I built for my own site suddenly slowed down. The reason was that a newly added index, serving a new report, conflicted with another existing index, and the database planner chose the wrong index. Thanks to my monitoring tools, I was able to quickly detect and intervene in this situation.

Metrics for Monitoring Index Usage and Health

PostgreSQL provides several statistical views to understand how indexes are used and how effective they are:

pg_stat_user_indexes: Shows how many times each index has been used, the number of rows scanned, and how many "heap fetches" were performed by the index. Indexes with idx_scan column equal to 0 are usually unnecessary.
pg_stat_user_tables: Shows the index scan rates and sequential scan rates of tables.
pg_class and pg_indexes: To check the size and definition of indexes.

-- To see the most used indexes
SELECT
    relname AS table_name,
    indexrelname AS index_name,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch
FROM
    pg_stat_user_indexes
ORDER BY
    idx_scan DESC;

-- To find unused or rarely used indexes
SELECT
    relname AS table_name,
    indexrelname AS index_name,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
    idx_scan
FROM
    pg_stat_user_indexes
WHERE
    idx_scan = 0
ORDER BY
    pg_relation_size(indexrelid) DESC;

The queries above give me a "health report" of indexes in a production environment. Indexes with an idx_scan value of 0 and occupying a large amount of space are potential candidates for deletion. However, before deleting, it's necessary to check if this index is used for specific reports or rarely run but critical queries.

Identifying and Cleaning Up Unnecessary Indexes

Unnecessary indexes are not just a waste of disk space; they also incur additional CPU and I/O costs for write operations. I periodically check pg_stat_user_indexes to identify indexes with low or zero idx_scan values.

ℹ️ Index Bloat and VACUUM

In PostgreSQL, indexes, like tables, can experience "bloat." This means that old versions of deleted or updated rows can continue to occupy space in the index. This bloat can be cleaned up with VACUUM FULL or REINDEX commands, but these operations must be performed carefully as they usually require locks. I generally prefer tools like pg_repack.

In a production ERP of a manufacturing company, I found more than 10 unused indexes that had accumulated over the years. Cleaning up these indexes not only freed up 50 GB of disk space but also provided a 5% performance increase in intensive write operations (especially end-of-day batch processes). These small optimizations can accumulate over time in large-scale systems to make significant differences.

A Real-Life Example: Latency Issue in a Production ERP

In a production ERP, there was a dashboard on operator screens showing the real-time status of production orders. The "pending orders" list on this dashboard inexplicably slowed down at the start of the morning shift. Sometimes it took 10-15 seconds, delaying operators from starting their work. The first complaint came on April 28th; no changes had been made to the system.

The symptoms were simple: the query listing pending production orders for a specific shift was slow. The query basically filtered the production_orders table based on shift_id, status = 'pending', and a planned_start_date range. When I looked at the EXPLAIN ANALYZE output, I saw a Seq Scan on the production_orders table. Previously, there were separate indexes on shift_id and planned_start_date, but not on the status column.

The root of the problem was the low cardinality of the status column. 80% of the table was in 'completed' status, and only 10% was 'pending'. The database planner had decided that an index scan based on 'pending' status would be more costly than scanning the entire table. However, when used in conjunction with shift_id and planned_start_date, the status filter became very selective.

My solution was to create a composite B-tree index on (shift_id, planned_start_date, status).

CREATE INDEX idx_production_orders_shift_date_status
ON public.production_orders (shift_id, planned_start_date, status);

After adding this index, I ran the query again with EXPLAIN ANALYZE. I now saw an Index Scan, and the query time instantly dropped from 15 seconds to 250 milliseconds. The lesson I learned was this: a low-cardinality column, while not useful as a standalone index, can be very beneficial as the last column of a composite index combined with higher-cardinality columns. This allowed the planner to both find the correct range and perform the final filtering via the index. Understanding these kinds of trade-offs requires not just technical knowledge, but also experience.

Conclusion

Database index selection has never been a rote process for me. It has always been like a detective story, tracing queries, reading EXPLAIN ANALYZE outputs, understanding data distribution, and weighing the cost-benefit balance of the correct index type. This three-step process – understanding queries, choosing the right index type, and monitoring performance – is key to securing not only the present but also the future of your systems.

Remember, every new index comes with a cost, and every removed index carries a risk. When making these decisions, we must always consider the overall health of the system, the cost of write operations, and storage requirements. In my experience, teams that adopt this approach build more resilient and performant systems over time. In my next post, perhaps I'll discuss the role of `observability` and `monitoring` tools in this process.

DEV Community