Database Indexes: Necessary for Every Query?

#database #postgres #performance #indexing

Recently, we received a complaint about slow-running order tracking screens on a production ERP system. The first thought on everyone's mind was "missing indexes," and suggestions like "let's add an index to this column" were flying around. However, my experience has shown that indexes aren't always a magic bullet; in fact, when used incorrectly, they can make the system even worse. Database indexes often require an engineering trade-off: you increase read speed while also increasing write cost and disk space.

In this post, I'll explain when indexes are truly necessary, when they create an unnecessary burden, and the strategy I follow in a production environment. Adding an index for every query often creates more problems than it solves, as I've seen countless times. My goal is to offer a more pragmatic perspective on this topic.

The Fundamental Purpose and Invisible Cost of Indexes

Indexes are special data structures that allow a database to find data in specific columns more quickly. Much like the index at the end of a book, they help you find the information you're looking for without having to read the entire book from start to finish. In PostgreSQL, the most common index types we encounter are B-tree, GIN, and BRIN. B-tree is excellent for equality (=), range (>, <), and sorting (ORDER BY) operations. GIN (Generalized Inverted Index) is used for full-text searches on complex data types like arrays or JSONB, while BRIN (Block Range Index) works on sequential or nearly sequential data in large tables.

So, why don't we add indexes to everything? This is where the cost aspect comes into play. Every index takes up disk space. When a new row is added to a table (INSERT), an existing row is updated (UPDATE), or a row is deleted (DELETE), the database must update not only the main table data but also every index associated with that table. This increases the cost of write operations. In a table with many indexes, even a small UPDATE can significantly increase CPU and I/O load.

⚠️ The Hidden Cost: VACUUM

In PostgreSQL, indexes also bring additional overhead for VACUUM operations, especially after UPDATE and DELETE operations. When a row is updated, its old version remains on disk, and the index continues to point to the old version. VACUUM cleans up these dead rows and ensures the indexes are also updated. Too many indexes can cause VACUUM operations to take longer and thus consume more system resources. In one customer project, due to 12 indexes on a table with high write volume, VACUUM operations stretched for hours, causing autovacuum processes to constantly fall behind.

For example, let's assume we have a products table with 10 million rows. Adding a GIN index solely for full-text search on the name and description columns can increase the table's size by 30-40%. Furthermore, every INSERT or UPDATE operation must also update this GIN index. In the simple scenario below, you can see the difference in INSERT performance between a table without indexes and one with indexes:

-- Table without index
CREATE TABLE products_no_index (
    id SERIAL PRIMARY KEY,
    name TEXT NOT NULL,
    description TEXT,
    price NUMERIC(10, 2),
    created_at TIMESTAMP DEFAULT NOW()
);

-- Table with index
CREATE TABLE products_with_index (
    id SERIAL PRIMARY KEY,
    name TEXT NOT NULL,
    description TEXT,
    price NUMERIC(10, 2),
    created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX idx_products_name_description ON products_with_index USING GIN (to_tsvector('english', name || ' ' || description));
CREATE INDEX idx_products_price ON products_with_index (price);

-- Performance test (Should be done with more data and repetitions in a real scenario)
-- Inserting 100,000 records into the products_no_index table
INSERT INTO products_no_index (name, description, price)
SELECT
    'Product ' || generate_series(1, 100000),
    'Description for product ' || generate_series(1, 100000),
    (random() * 1000)::NUMERIC(10, 2);
-- Approximate time: 1.5 - 2 seconds

-- Inserting 100,000 records into the products_with_index table
INSERT INTO products_with_index (name, description, price)
SELECT
    'Product ' || generate_series(1, 100000),
    'Description for product ' || generate_series(1, 100000),
    (random() * 1000)::NUMERIC(10, 2);
-- Approximate time: 4 - 6 seconds (Longer due to the additional cost of the GIN index)

As you can see, just two indexes can triple or quadruple the INSERT time. Under large datasets and heavy write loads, this difference can cripple the system's overall response time. For me, simple tests like these have always been a starting point before making index decisions.

Index Strategies Based on Query Types

Not every query requires an index; in fact, using an index for some can degrade performance. The key is to understand the type of operation the query performs and develop an appropriate index strategy.

Point Lookups and Range Scans:
- For queries like WHERE id = 123 or WHERE email = 'user@example.com', B-tree indexes are indispensable. Primary keys are automatically B-tree indexed.
- Range queries like WHERE created_at BETWEEN '2023-01-01' AND '2023-12-31' also benefit from B-tree indexes.
- In an ERP system, searching for orders by customer number or querying stock by product code falls into this category.
Sorting (ORDER BY) and Grouping (GROUP BY):
- Queries containing ORDER BY column_name can avoid a filesort (i.e., sorting the result set in memory or on disk) if there's a B-tree index on the relevant column. This provides significant performance gains, especially with large datasets.
- GROUP BY clauses can similarly benefit from indexes if the grouping columns are indexed.
- These indexes are lifesavers when listing the top-selling products of the last 30 days in descending order of price on a reporting screen.
LIKE Operator and Full-Text Searches:
- For prefix searches like WHERE name LIKE 'Mustafa%', B-tree indexes can be used.
- However, for searches that start from the middle or end, like WHERE name LIKE '%Mustafa%', B-tree indexes are generally not useful. In such cases, PostgreSQL is forced to perform a sequential scan.
- For full-text search, GIN or GiST indexes are used with the to_tsvector and to_tsquery functions. These offer much more complex and flexible search capabilities.
- In my own side project, I used GIN indexes to allow users to quickly find the content they were looking for.

The EXPLAIN ANALYZE output below shows how a simple query is sped up with an index.

-- Let's observe the effect of the index on the products_with_index table
-- First, let's run a query without an index (hypothetically, if there were no index)
EXPLAIN ANALYZE SELECT * FROM products_no_index WHERE price > 990;
                                        QUERY PLAN
------------------------------------------------------------------------------------------
 Seq Scan on products_no_index  (cost=0.00..1845.00 rows=1000 width=70) (actual time=0.021..19.648 rows=1000 loops=1)
   Filter: (price > '990'::numeric)
   Rows Removed by Filter: 99000
 Planning Time: 0.075 ms
 Execution Time: 19.670 ms
(5 rows)

-- Now let's run the same query on the indexed table
EXPLAIN ANALYZE SELECT * FROM products_with_index WHERE price > 990;
                                        QUERY PLAN
------------------------------------------------------------------------------------------
 Index Scan using idx_products_price on products_with_index  (cost=0.29..36.31 rows=1000 width=70) (actual time=0.038..0.399 rows=1000 loops=1)
   Index Cond: (price > '990'::numeric)
 Planning Time: 0.082 ms
 Execution Time: 0.419 ms
(4 rows)

As you can see, thanks to the idx_products_price index, the query time has dropped from 19.670 ms to 0.419 ms. This is more than a 40x improvement. However, this depends on the distribution of data in the price column and the selectivity of the query. If the query returned more than 80% of the table, the optimizer would likely choose a sequential scan over using an index, because reading the index and then fetching the data from the main table could become more expensive than simply reading the entire table. In such situations, in my experience, trusting the database optimizer's decision is usually best. [Related: PostgreSQL Optimizer Insights]

Over-indexing and Hidden Performance Losses

Adding an index to every column in a table is a trap that beginners often fall into. The logic of "the more indexes, the faster the query" sounds good in theory, but in practice, it usually has the opposite effect. I encountered a similar situation while developing the reporting infrastructure for a bank's internal platform. Developers had added indexes to every possible filter column to ensure each report ran quickly. The result? A simple INSERT operation on the transactions table jumped from 300 ms to 2 seconds.

Some hidden performance losses caused by over-indexing include:

Disk Space Bloat: Every index occupies disk space. Numerous indexes, especially on large tables, can mean terabytes of additional disk space. This also extends backup times and increases disk costs.
Slower Write Operations: As mentioned earlier, every INSERT, UPDATE, and DELETE operation must also update all relevant indexes. As the number of indexes increases, this cost grows exponentially.
VACUUM Overhead: Due to PostgreSQL's MVCC (Multi-Version Concurrency Control) architecture, UPDATE and DELETE operations leave behind "dead tuples." These dead tuples and the index entries pointing to them are cleaned up by VACUUM. Too many indexes can cause VACUUM operations to take longer and consume more system resources.
Optimizer Decision Time: The PostgreSQL query optimizer must evaluate many possible index combinations to find the best execution plan for a query. As the number of indexes increases, this decision-making time also grows. For complex queries, this additional overhead can become significant.
Incorrect Index Selection: Sometimes, the database optimizer may choose the wrong index. This situation can occur, especially with skewed data distributions or when statistics are not up-to-date. As the number of indexes increases, the probability of an incorrect selection also increases.

ℹ️ Index-Only Scan vs. Sequential Scan

In PostgreSQL, you might sometimes see an Index-Only Scan instead of an Index Scan. An Index-Only Scan means the query can return results by reading only from the index, without needing to access the main table. This is a very fast operation. However, if the query requests a column not present in the index, or if the database cannot determine from the visibility map that all rows are already visible, it must access the main table. If a large portion of the table needs to be read, the optimizer might prefer a direct Sequential Scan over an Index-Only Scan, as going to the main table for each row after reading from the index could be more expensive than reading the entire table. Understanding these subtle distinctions is crucial for making index decisions.

In a manufacturing firm's ERP, the stock_movements table had 7 different indexes. A simple stock entry operation (INSERT) took an average of 800ms. When I examined the pg_stat_user_indexes table, I found that 4 of these indexes were almost never used. After removing the unused indexes, the same INSERT operation dropped to 250ms. This is a concrete example of how even a simple index cleanup can impact performance.

The Role of Data Distribution in Index Selection

The effectiveness of indexes largely depends on the distribution of data (cardinality). Cardinality is the number of unique values in a column.

High Cardinality: The higher the number of unique values in a column (e.g., email addresses, Turkish ID numbers, GUIDs), the more effective an index on that column will be. This is because the index can direct the query to a small number of rows directly. A query like WHERE email = '...' is very fast because it operates on a column with high cardinality.
Low Cardinality: The lower the number of unique values in a column (e.g., a status column: 'active', 'passive', 'deleted'), the less effective the index will be. If a query like WHERE status = 'active' returns 80% of the table, the database will likely prefer a sequential scan over using an index. This is because reading the index and then going to the main table for each row might become more expensive than reading the entire table.

Let's explain this with an example:

-- Let's add a 'category' column to our products table
ALTER TABLE products_with_index ADD COLUMN category TEXT;
UPDATE products_with_index SET category = 'Electronics' WHERE id % 3 = 0;
UPDATE products_with_index SET category = 'Books' WHERE id % 3 = 1;
UPDATE products_with_index SET category = 'Home Goods' WHERE id % 3 = 2;

-- Now let's add an index to the 'category' column
CREATE INDEX idx_products_category ON products_with_index (category);

-- A query with low selectivity
EXPLAIN ANALYZE SELECT * FROM products_with_index WHERE category = 'Electronics';
-- Output: Likely a Sequential Scan or Bitmap Heap Scan (uses the index but still reads most of the table)

In this scenario, the category column has only 3 distinct values. Products in the 'Electronics' category constitute about one-third of the table. PostgreSQL's optimizer might find it more efficient to perform a sequential scan rather than using an index for such a large subset. In my observation, an index is useful if its selectivity returns less than 5-10% of the table; otherwise, the cost-benefit balance is disrupted.

Partial Indexes: A Smart Solution

In some cases, you might want to perform fast searches only on a specific subset of data within a column. This is where partial indexes come into play. Partial indexes create index entries only for rows that satisfy a WHERE condition. This saves disk space and reduces write costs.

💡 Partial Index Example

Assume in an e-commerce site's orders table, most orders have status = 'completed'. However, we usually search for orders with status = 'pending' or status = 'failed' quickly. In this case, instead of adding an index to the entire status column, we can create an index only for pending or failed orders:
CREATE INDEX idx_orders_pending_status ON orders (order_id) WHERE status = 'pending';
CREATE INDEX idx_orders_failed_status ON orders (order_id) WHERE status = 'failed';
These indexes create small and fast indexes only for the relevant orders. This uses less disk space and does not affect the performance of general write operations to the orders table. In one of my side projects, I used this method to quickly fetch tasks with status = 'scheduled' running in the background and saw performance improvements of up to 70%.

Index Management and Monitoring in Production Environments

Simply creating your indexes and forgetting about them is an invitation to problems in a production environment. Continuous monitoring and maintenance are crucial parts of an index strategy.

Identifying Unused Indexes: PostgreSQL tracks which indexes are being used through the pg_stat_user_indexes view. Columns like idx_scan, idx_tup_read, and idx_tup_fetch show how many times an index has been used and how many tuples it has read. If an index shows zero usage for a long period, it's likely unnecessary and should be considered for removal.
```
SELECT
    relname AS table_name,
    indexrelname AS index_name,
    idx_scan,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
    'DROP INDEX ' || indexrelname || ';' AS drop_command
FROM
    pg_stat_user_indexes
WHERE
    idx_scan = 0
    AND schemaname = 'public' -- Default schema
ORDER BY
    pg_relation_size(indexrelid) DESC;
```
I frequently use this query in production. In one instance, on a large e-commerce site, we identified and removed an unnecessary index of 20 GB using this method. This not only saved disk space but also sped up write operations on the relevant table.
Query Analysis with pg_stat_statements: The pg_stat_statements module is invaluable for understanding which queries are running slowly and on which columns indexes are needed. This module shows the slowest queries, call counts, and average execution times. After identifying slow queries, we can examine their query plans with EXPLAIN ANALYZE to detect missing indexes or incorrect index usage.
Index Rebuilding (REINDEX CONCURRENTLY): Indexes can become fragmented over time and their performance can degrade. This is common, especially after heavy UPDATE and DELETE operations. REINDEX CONCURRENTLY can be safely used in production environments because it does not lock the table while rebuilding the index. However, remember that this operation requires double the disk space and can take a long time.
Automatic Index Recommendations: Tools like hypopg can create hypothetical indexes by analyzing the current query load and simulate their impact on query performance with EXPLAIN. Such tools can serve as a starting point for new indexes, but they always require manual verification and testing. In my experience, while these tools offer a list of "potential indexes," I always make the final decision, as tools cannot fully evaluate trade-offs.

"Why Was No Index Used?" Crises and Their Solutions

Sometimes, even after adding an index to a column, you might find that the query is still slow and EXPLAIN ANALYZE shows that the index is not being used. This situation usually stems from some insidious reasons:

Data Type Mismatches (Implicit Casts): If the data type of the indexed column does not match the data type of the value being compared in the query, PostgreSQL performs an implicit cast (coercion). This cast operation prevents the index from being used. For example, if you add an index to a VARCHAR column and query with WHERE my_column = 123, the database will try to convert the VARCHAR value to an INTEGER, and the index will not be used. The correct solution is to match the type of the compared value to that of the indexed column: WHERE my_column = '123'. This was a common issue I encountered with date columns in a customer project; comparing a date with an index on a timestamp column killed the index.

Function Calls on Indexed Columns: Queries like WHERE function(indexed_column) = 'value' prevent the index from being used. This is because the index was created on the original values of indexed_column, not on the values returned by the function.

-- Incorrect: Index will not be used
SELECT * FROM users WHERE lower(email) = 'mustafa@example.com';

-- Correct: Index can be used (if a functional index exists or if you use the email column directly)
-- A functional index can be created: CREATE INDEX idx_users_lower_email ON users (lower(email));
-- Or by changing the query: SELECT * FROM users WHERE email = 'mustafa@example.com';

If you frequently search on lower(email), creating a functional index is a solution.

Optimizer's Incorrect Estimation (Stale Statistics): The PostgreSQL optimizer uses table and index statistics to create query plans. If the data distribution has changed significantly and the statistics are not up-to-date (i.e., the ANALYZE operation has not run for a long time), the optimizer might choose an incorrect plan and avoid using an index. Running the command ANALYZE TABLE table_name; usually resolves this issue.
Very Small Tables: For very small tables (generally under a few thousand rows), the database might find it faster to perform a sequential scan than to use an index. The cost of reading the index and then going to the main table might be higher than reading the entire table from start to finish. This is an "it is what it is" situation; obsessing over indexes for small tables is usually unnecessary.
Complex Conditions and ORM Traps: Sometimes ORMs (Object-Relational Mappers) can generate complex and inefficient queries. Conditions combined with the OR operator might not be able to use multiple indexes simultaneously and can lead to a sequential scan. In such cases, it might be necessary to optimize the query manually or break it down using UNION ALL. The N+1 query problem, in particular, is a trap that ORMs frequently fall into, and while it doesn't directly affect the query plan, it results in a large number of small, expensive queries being executed. I discussed this in detail in my previous post [Related: ORM Traps and Solutions].

🔥 Caution: EXPLAIN ANALYZE Results

Seeing Bitmap Heap Scan in EXPLAIN ANALYZE output indicates that an index was used, but the main table still had to be accessed to retrieve the data.