Learnings from a slow query analysis in PostgreSQL

#postgres #sql #database

Recently, I encountered a slow PostgreSQL query in my work. The query was taking a lot of time and using a lot of I/O, even though it was relatively straightforward. The reason for its slowness was not immediately apparent. However, after some analysis, we were able to optimise the query and reduce its worst-case runtime by ~1000 times. In this article, I will share my learnings.

For this article all queries were run on PostgreSQL 14 on my laptop. Query times may differ if underlying hardware is different but relative latency improvements after optimisations mentioned below will be similar.

Query to be optimised:

It is a simple query that has where clause on two columns and selects a single third column as output:

SELECT id FROM test WHERE col_1 = 13 AND col_2 = 41

Let’s seed some data first:

The actual production table contained around 150 million records. For our analysis 50 million rows would be sufficient:

-- Create table
CREATE TABLE test (
 id SERIAL PRIMARY KEY NOT NULL,
 col_1 BIGINT not null,
 col_2 BIGINT not null
);

-- Randomly insert 50 million rows
INSERT INTO test (col_1, col_2) 
SELECT random() * 50, random() * 50 FROM generate_series(1, 50000000);

-- Check data distribution 
-- We will use col_1 = 13 and col_2 = 41 in our queries
SELECT col_1, col_2, COUNT(*) as cnt from test
GROUP BY (col_1, col_2)
ORDER BY cnt DESC;

Output:
col_1  col_2   cnt
13     41      20524
22     1       20508
25     14      20412
41     2       20384
34     45      20378
6      8       20377

All the query plans excludes network time taken by data to travel to and from database server.

Let’s Start Analysing and Optimising the query:

Let’s first run the query without doing anything and see its query plan:

This query plan contains the “actual steps” taken by the database engine while executing the query.

There is lot of information in above query plan but we will focus only on few things:

Total time taken in query execution is ~6 secs (which is huge).
Total rows returned by the query are 20,107.
For matching the condition ((col_1 = 13) AND (col_2 = 41)), the strategy used by database engine is “Sequential Scan”.
- In sequential scan, engine goes through every row of the table one by one and discard the rows which does not match the query condition. That means, every row must be visited at-least once.
All 50 million rows were scanned but more than 49 million rows were discarded. This is the major inefficiency that can be seen in the plan.

We need to somehow reduce the number of rows scanned during query execution. General solution to this is to create an index on columns used in where clause. So let’s create an index on col_1 and col_2 and re-analyse the query:

CREATE INDEX concurrently test_col_1_col_2_idx ON test(col_1, col_2)

Re-analyse query:

Now the engine decided to do “Index Scan” using the index that we created in the previous step.
The query time reduced significantly but it is still huge(~1 sec) for the amount of data we are working with.
After doing the index scan lots of reads are happening — 19,899 — which is resulting in high query time. Reason for high number of reads:
- We created the index on col_1 and col_2 but we need id in the query response.
- Using the index, engine was able to know the location of rows to be accessed without scanning the whole table. But to actually return the id, each row needs to accessed using the address stored in the index (because index only contains information about col_1 and col_2 and not about id ).
- That’s why number of reads are almost same as number of rows returned by the query.

In order to avoid large number of reads we need to somehow avoid reading every filtered row.

Postgres allows to keep additional information with every entry of an index using INCLUDE clause. These columns are not indexed — having a where clause on these columns will not use the index.
Reference Docs: link

Let’s try including the id column in the index and see if it improves the performance:

-- Drop previously created Index
DROP INDEX test_col_1_col_2_idx

-- Create new index that includes ID
CREATE INDEX CONCURRENTLY test_col_1_col_2_idx_2 ON test(col_1, col_2) INCLUDE (id)

Rerun the query:

After including id in the index, the query time reduced drastically to just ~7 ms.
Number of reads also reduced from ~19K to just ~110.
Engine switched from using an “Index Scan” to “Index Only Scan”, meaning that to execute complete query just scanning the index was sufficient, there was no need to access table rows.

Few points to note:

Avoid including too much extra data with the index as it will bloat the index and increase index size significantly.
The columns that are included using INCLUDE clause are not indexed, so having a WHERE condition on these columns will not use the index.
This approach is helpful only if large number of rows are returned by the query. If the query returns only few rows, say 1 or 2 then there won’t be any significant performance gains using above approach.

Conclusion:

When writing a SELECT query with WHERE clause, index creation depends not just on the columns that are included in WHERE clause but also on the columns that the query selects as final output.