Recently, we received a complaint about slow-running order tracking screens on a production ERP system. The first thought on everyone's mind was "missing indexes," and suggestions like "let's add an index to this column" were flying around. However, my experience has shown that indexes aren't always a magic bullet; in fact, when used incorrectly, they can make the system even worse. Database indexes often require an engineering trade-off: you increase read speed while also increasing write cost and disk space.
In this post, I'll explain when indexes are truly necessary, when they create an unnecessary burden, and the strategy I follow in a production environment. Adding an index for every query often creates more problems than it solves, as I've seen countless times. My goal is to offer a more pragmatic perspective on this topic.
The Fundamental Purpose and Invisible Cost of Indexes
Indexes are special data structures that allow a database to find data in specific columns more quickly. Much like the index at the end of a book, they help you find the information you're looking for without having to read the entire book from start to finish. In PostgreSQL, the most common index types we encounter are B-tree, GIN, and BRIN. B-tree is excellent for equality (=), range (>, <), and sorting (ORDER BY) operations. GIN (Generalized Inverted Index) is used for full-text searches on complex data types like arrays or JSONB, while BRIN (Block Range Index) works on sequential or nearly sequential data in large tables.
So, why don't we add indexes to everything? This is where the cost aspect comes into play. Every index takes up disk space. When a new row is added to a table (INSERT), an existing row is updated (UPDATE), or a row is deleted (DELETE), the database must update not only the main table data but also every index associated with that table. This increases the cost of write operations. In a table with many indexes, even a small UPDATE can significantly increase CPU and I/O load.
⚠️ The Hidden Cost: VACUUM
In PostgreSQL, indexes also bring additional overhead for VACUUM operations, especially after UPDATE and DELETE operations. When a row is updated, its old version remains on disk, and the index continues to point to the old version. VACUUM cleans up these dead rows and ensures the indexes are also updated. Too many indexes can cause VACUUM operations to take longer and thus consume more system resources. In one customer project, due to 12 indexes on a table with high write volume, VACUUM operations stretched for hours, causing
autovacuumprocesses to constantly fall behind.
For example, let's assume we have a products table with 10 million rows. Adding a GIN index solely for full-text search on the name and description columns can increase the table's size by 30-40%. Furthermore, every INSERT or UPDATE operation must also update this GIN index. In the simple scenario below, you can see the difference in INSERT performance between a table without indexes and one with indexes:
-- Table without index
CREATE TABLE products_no_index (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
description TEXT,
price NUMERIC(10, 2),
created_at TIMESTAMP DEFAULT NOW()
);
-- Table with index
CREATE TABLE products_with_index (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
description TEXT,
price NUMERIC(10, 2),
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_products_name_description ON products_with_index USING GIN (to_tsvector('english', name || ' ' || description));
CREATE INDEX idx_products_price ON products_with_index (price);
-- Performance test (Should be done with more data and repetitions in a real scenario)
-- Inserting 100,000 records into the products_no_index table
INSERT INTO products_no_index (name, description, price)
SELECT
'Product ' || generate_series(1, 100000),
'Description for product ' || generate_series(1, 100000),
(random() * 1000)::NUMERIC(10, 2);
-- Approximate time: 1.5 - 2 seconds
-- Inserting 100,000 records into the products_with_index table
INSERT INTO products_with_index (name, description, price)
SELECT
'Product ' || generate_series(1, 100000),
'Description for product ' || generate_series(1, 100000),
(random() * 1000)::NUMERIC(10, 2);
-- Approximate time: 4 - 6 seconds (Longer due to the additional cost of the GIN index)
As you can see, just two indexes can triple or quadruple the INSERT time. Under large datasets and heavy write loads, this difference can cripple the system's overall response time. For me, simple tests like these have always been a starting point before making index decisions.
Index Strategies Based on Query Types
Not every query requires an index; in fact, using an index for some can degrade performance. The key is to understand the type of operation the query performs and develop an appropriate index strategy.
-
Point Lookups and Range Scans:
- For queries like
WHERE id = 123orWHERE email = 'user@example.com', B-tree indexes are indispensable. Primary keys are automatically B-tree indexed. - Range queries like
WHERE created_at BETWEEN '2023-01-01' AND '2023-12-31'also benefit from B-tree indexes. - In an ERP system, searching for orders by customer number or querying stock by product code falls into this category.
- For queries like
-
Sorting (ORDER BY) and Grouping (GROUP BY):
- Queries containing
ORDER BY column_namecan avoid afilesort(i.e., sorting the result set in memory or on disk) if there's a B-tree index on the relevant column. This provides significant performance gains, especially with large datasets. -
GROUP BYclauses can similarly benefit from indexes if the grouping columns are indexed. - These indexes are lifesavers when listing the top-selling products of the last 30 days in descending order of price on a reporting screen.
- Queries containing
-
LIKE Operator and Full-Text Searches:
- For prefix searches like
WHERE name LIKE 'Mustafa%', B-tree indexes can be used. - However, for searches that start from the middle or end, like
WHERE name LIKE '%Mustafa%', B-tree indexes are generally not useful. In such cases, PostgreSQL is forced to perform a sequential scan. - For full-text search, GIN or GiST indexes are used with the
to_tsvectorandto_tsqueryfunctions. These offer much more complex and flexible search capabilities. - In my own side project, I used GIN indexes to allow users to quickly find the content they were looking for.
- For prefix searches like
The EXPLAIN ANALYZE output below shows how a simple query is sped up with an index.
-- Let's observe the effect of the index on the products_with_index table
-- First, let's run a query without an index (hypothetically, if there were no index)
EXPLAIN ANALYZE SELECT * FROM products_no_index WHERE price > 990;
QUERY PLAN
------------------------------------------------------------------------------------------
Seq Scan on products_no_index (cost=0.00..1845.00 rows=1000 width=70) (actual time=0.021..19.648 rows=1000 loops=1)
Filter: (price > '990'::numeric)
Rows Removed by Filter: 99000
Planning Time: 0.075 ms
Execution Time: 19.670 ms
(5 rows)
-- Now let's run the same query on the indexed table
EXPLAIN ANALYZE SELECT * FROM products_with_index WHERE price > 990;
QUERY PLAN
------------------------------------------------------------------------------------------
Index Scan using idx_products_price on products_with_index (cost=0.29..36.31 rows=1000 width=70) (actual time=0.038..0.399 rows=1000 loops=1)
Index Cond: (price > '990'::numeric)
Planning Time: 0.082 ms
Execution Time: 0.419 ms
(4 rows)
As you can see, thanks to the idx_products_price index, the query time has dropped from 19.670 ms to 0.419 ms. This is more than a 40x improvement. However, this depends on the distribution of data in the price column and the selectivity of the query. If the query returned more than 80% of the table, the optimizer would likely choose a sequential scan over using an index, because reading the index and then fetching the data from the main table could become more expensive than simply reading the entire table. In such situations, in my experience, trusting the database optimizer's decision is usually best. [Related: PostgreSQL Optimizer Insights]
Over-indexing and Hidden Performance Losses
Adding an index to every column in a table is a trap that beginners often fall into. The logic of "the more indexes, the faster the query" sounds good in theory, but in practice, it usually has the opposite effect. I encountered a similar situation while developing the reporting infrastructure for a bank's internal platform. Developers had added indexes to every possible filter column to ensure each report ran quickly. The result? A simple INSERT operation on the transactions table jumped from 300 ms to 2 seconds.
Some hidden performance losses caused by over-indexing include:
- Disk Space Bloat: Every index occupies disk space. Numerous indexes, especially on large tables, can mean terabytes of additional disk space. This also extends backup times and increases disk costs.
- Slower Write Operations: As mentioned earlier, every INSERT, UPDATE, and DELETE operation must also update all relevant indexes. As the number of indexes increases, this cost grows exponentially.
- VACUUM Overhead: Due to PostgreSQL's MVCC (Multi-Version Concurrency Control) architecture, UPDATE and DELETE operations leave behind "dead tuples." These dead tuples and the index entries pointing to them are cleaned up by
VACUUM. Too many indexes can cause VACUUM operations to take longer and consume more system resources. - Optimizer Decision Time: The PostgreSQL query optimizer must evaluate many possible index combinations to find the best execution plan for a query. As the number of indexes increases, this decision-making time also grows. For complex queries, this additional overhead can become significant.
- Incorrect Index Selection: Sometimes, the database optimizer may choose the wrong index. This situation can occur, especially with skewed data distributions or when statistics are not up-to-date. As the number of indexes increases, the probability of an incorrect selection also increases.
ℹ️ Index-Only Scan vs. Sequential Scan
In PostgreSQL, you might sometimes see an
Index-Only Scaninstead of anIndex Scan. AnIndex-Only Scanmeans the query can return results by reading only from the index, without needing to access the main table. This is a very fast operation. However, if the query requests a column not present in the index, or if the database cannot determine from thevisibility mapthat all rows are already visible, it must access the main table. If a large portion of the table needs to be read, the optimizer might prefer a directSequential Scanover anIndex-Only Scan, as going to the main table for each row after reading from the index could be more expensive than reading the entire table. Understanding these subtle distinctions is crucial for making index decisions.
In a manufacturing firm's ERP, the stock_movements table had 7 different indexes. A simple stock entry operation (INSERT) took an average of 800ms. When I examined the pg_stat_user_indexes table, I found that 4 of these indexes were almost never used. After removing the unused indexes, the same INSERT operation dropped to 250ms. This is a concrete example of how even a simple index cleanup can impact performance.
The Role of Data Distribution in Index Selection
The effectiveness of indexes largely depends on the distribution of data (cardinality). Cardinality is the number of unique values in a column.
- High Cardinality: The higher the number of unique values in a column (e.g., email addresses, Turkish ID numbers, GUIDs), the more effective an index on that column will be. This is because the index can direct the query to a small number of rows directly. A query like
WHERE email = '...'is very fast because it operates on a column with high cardinality. - Low Cardinality: The lower the number of unique values in a column (e.g., a
statuscolumn: 'active', 'passive', 'deleted'), the less effective the index will be. If a query likeWHERE status = 'active'returns 80% of the table, the database will likely prefer a sequential scan over using an index. This is because reading the index and then going to the main table for each row might become more expensive than reading the entire table.
Let's explain this with an example:
-- Let's add a 'category' column to our products table
ALTER TABLE products_with_index ADD COLUMN category TEXT;
UPDATE products_with_index SET category = 'Electronics' WHERE id % 3 = 0;
UPDATE products_with_index SET category = 'Books' WHERE id % 3 = 1;
UPDATE products_with_index SET category = 'Home Goods' WHERE id % 3 = 2;
-- Now let's add an index to the 'category' column
CREATE INDEX idx_products_category ON products_with_index (category);
-- A query with low selectivity
EXPLAIN ANALYZE SELECT * FROM products_with_index WHERE category = 'Electronics';
-- Output: Likely a Sequential Scan or Bitmap Heap Scan (uses the index but still reads most of the table)
In this scenario, the category column has only 3 distinct values. Products in the 'Electronics' category constitute about one-third of the table. PostgreSQL's optimizer might find it more efficient to perform a sequential scan rather than using an index for such a large subset. In my observation, an index is useful if its selectivity returns less than 5-10% of the table; otherwise, the cost-benefit balance is disrupted.
Partial Indexes: A Smart Solution
In some cases, you might want to perform fast searches only on a specific subset of data within a column. This is where partial indexes come into play. Partial indexes create index entries only for rows that satisfy a WHERE condition. This saves disk space and reduces write costs.
💡 Partial Index Example
Assume in an e-commerce site's orders table, most orders have
status = 'completed'. However, we usually search for orders withstatus = 'pending'orstatus = 'failed'quickly. In this case, instead of adding an index to the entirestatuscolumn, we can create an index only for pending or failed orders:CREATE INDEX idx_orders_pending_status ON orders (order_id) WHERE status = 'pending'; CREATE INDEX idx_orders_failed_status ON orders (order_id) WHERE status = 'failed';These indexes create small and fast indexes only for the relevant orders. This uses less disk space and does not affect the performance of general write operations to the
orderstable. In one of my side projects, I used this method to quickly fetch tasks withstatus = 'scheduled'running in the background and saw performance improvements of up to 70%.
Index Management and Monitoring in Production Environments
Simply creating your indexes and forgetting about them is an invitation to problems in a production environment. Continuous monitoring and maintenance are crucial parts of an index strategy.
-
Identifying Unused Indexes: PostgreSQL tracks which indexes are being used through the
pg_stat_user_indexesview. Columns likeidx_scan,idx_tup_read, andidx_tup_fetchshow how many times an index has been used and how many tuples it has read. If an index shows zero usage for a long period, it's likely unnecessary and should be considered for removal.
SELECT relname AS table_name, indexrelname AS index_name, idx_scan, pg_size_pretty(pg_relation_size(indexrelid)) AS index_size, 'DROP INDEX ' || indexrelname || ';' AS drop_command FROM pg_stat_user_indexes WHERE idx_scan = 0 AND schemaname = 'public' -- Default schema ORDER BY pg_relation_size(indexrelid) DESC;I frequently use this query in production. In one instance, on a large e-commerce site, we identified and removed an unnecessary index of 20 GB using this method. This not only saved disk space but also sped up write operations on the relevant table.
Query Analysis with
pg_stat_statements: Thepg_stat_statementsmodule is invaluable for understanding which queries are running slowly and on which columns indexes are needed. This module shows the slowest queries, call counts, and average execution times. After identifying slow queries, we can examine their query plans withEXPLAIN ANALYZEto detect missing indexes or incorrect index usage.Index Rebuilding (REINDEX CONCURRENTLY): Indexes can become fragmented over time and their performance can degrade. This is common, especially after heavy UPDATE and DELETE operations.
REINDEX CONCURRENTLYcan be safely used in production environments because it does not lock the table while rebuilding the index. However, remember that this operation requires double the disk space and can take a long time.Automatic Index Recommendations: Tools like
hypopgcan create hypothetical indexes by analyzing the current query load and simulate their impact on query performance withEXPLAIN. Such tools can serve as a starting point for new indexes, but they always require manual verification and testing. In my experience, while these tools offer a list of "potential indexes," I always make the final decision, as tools cannot fully evaluate trade-offs.
"Why Was No Index Used?" Crises and Their Solutions
Sometimes, even after adding an index to a column, you might find that the query is still slow and EXPLAIN ANALYZE shows that the index is not being used. This situation usually stems from some insidious reasons:
Data Type Mismatches (Implicit Casts): If the data type of the indexed column does not match the data type of the value being compared in the query, PostgreSQL performs an
implicit cast(coercion). This cast operation prevents the index from being used. For example, if you add an index to aVARCHARcolumn and query withWHERE my_column = 123, the database will try to convert theVARCHARvalue to anINTEGER, and the index will not be used. The correct solution is to match the type of the compared value to that of the indexed column:WHERE my_column = '123'. This was a common issue I encountered with date columns in a customer project; comparing adatewith an index on atimestampcolumn killed the index.-
Function Calls on Indexed Columns: Queries like
WHERE function(indexed_column) = 'value'prevent the index from being used. This is because the index was created on the original values ofindexed_column, not on the values returned by the function.
-- Incorrect: Index will not be used SELECT * FROM users WHERE lower(email) = 'mustafa@example.com'; -- Correct: Index can be used (if a functional index exists or if you use the email column directly) -- A functional index can be created: CREATE INDEX idx_users_lower_email ON users (lower(email)); -- Or by changing the query: SELECT * FROM users WHERE email = 'mustafa@example.com';If you frequently search on
lower(email), creating a functional index is a solution. Optimizer's Incorrect Estimation (Stale Statistics): The PostgreSQL optimizer uses table and index statistics to create query plans. If the data distribution has changed significantly and the statistics are not up-to-date (i.e., the
ANALYZEoperation has not run for a long time), the optimizer might choose an incorrect plan and avoid using an index. Running the commandANALYZE TABLE table_name;usually resolves this issue.Very Small Tables: For very small tables (generally under a few thousand rows), the database might find it faster to perform a sequential scan than to use an index. The cost of reading the index and then going to the main table might be higher than reading the entire table from start to finish. This is an "it is what it is" situation; obsessing over indexes for small tables is usually unnecessary.
Complex Conditions and ORM Traps: Sometimes ORMs (Object-Relational Mappers) can generate complex and inefficient queries. Conditions combined with the
ORoperator might not be able to use multiple indexes simultaneously and can lead to asequential scan. In such cases, it might be necessary to optimize the query manually or break it down usingUNION ALL. The N+1 query problem, in particular, is a trap that ORMs frequently fall into, and while it doesn't directly affect the query plan, it results in a large number of small, expensive queries being executed. I discussed this in detail in my previous post [Related: ORM Traps and Solutions].
🔥 Caution:
EXPLAIN ANALYZEResultsSeeing
Bitmap Heap ScaninEXPLAIN ANALYZEoutput indicates that an index was used, but the main table still had to be accessed to retrieve the data.
Top comments (0)