Introduction & Data Setup
Before we dive in, it is vital to understand the basic elementary blocks of PostgreSQL query plans. This has been covered in a separate blog post, and I highly encourage readers to go through first -
There are several node types in PostgreSQL query plans,
- Scans.
- Joins.
- Sort.
- Aggregates etc.,
In this article, we will dive deep into the Scan
node type, which is probably the most fundamental piece. Let's use the same fake_data
table that was used in the query plan blog. For ease of understanding, we are going to disable parallel scans. We will enable this a bit later and also understand parallel scans in a separate section.
SET max_parallel_workers_per_gather = 0;
Scans
There will be other node types in the examples below, but the focus will be only on the scan nodes and highlighted throughout the examples below.
Sequential Scans
Let's search for an id
with value 1000
.
This is the simplest of them all; if there is no index or a lesser number of rows, then the Planner resorts to scanning all of the rows present. You would/should never encounter Sequential Scans in a typical production system since they are very slow and get slower as the data increases. There are exceptions when the table is too small where an index is pointless and Sequential Scans are fast enough that you don't have to worry.
Index Scans
Let's create a simple BTree index on the id
column to speed up the above query.
CREATE INDEX id_idx ON fake_data USING BTREE(id)
After index creation, the Planner now uses the index to perform what is called an Index Scan.
As you can see, it is significantly faster (around 3500x) than a sequential scan. We should create appropriate indexes to speed up queries based on our access/query patterns. Here is an article that goes in-depth on what kind of indexes to create for different access patterns:
Slow Queries? 10X Query Performance with a Database Index
Derek Xiao for Arctype ・ Jan 26 ・ 5 min read
Index Only Scans
Index Only scans are very similar to index scans except that they scan only the indexes and do not touch the table data. This is possible only if the query contains the indexed column in both the SELECT
and WHERE
clauses. A slight tweak to the above query demonstrates this,
One thing to note is that PostgreSQL might sometimes choose to do Index Scan in place of Index Only Scan if the table is not vacuumed well, i.e., the Planner will realize that there could be some mismatch between the indexes and the actual table data. This can happen if some large delete/insert in/from the table and the indexes have not been updated. It is good practice to time the auto vacuum correctly and vacuum if there is any large operation like a table load. Index Only Scans can be very fast in terms of I/O and time since indexes are practically cached in shared_buffers
.
Bitmap Heap Scan & Bitmap Index Scan
Let's compare on a different column name
which is text content.
CREATE INDEX name_str_idx on fake_data USING BTREE(name)
EXPLAIN ANALYZE
SELECT
*
FROM
fake_data
WHERE
fake_data.name = 'David Bennett'
The Planner decided to go with Bitmap heap Scan even though there was a BTree index on the name
column. The reason is that index scans cause random I/O if there is no ordering to the rows (name
is text content). This is costly in rotational hard drives. To solve this, the Planner takes a two-stage approach. The Bitmap Index Scan constructs a data structure called a bitmap from the index present and represents the approximate location of pages in the disk and feeds it to the parent node, i.e., Bitmap Heap Scan, which then uses that to fetch the pages/rows.
There is a discussion in the PostgreSQL mailing list, which goes into depth on its workings, but on a high level, a Bitmap Heap Scan always works in tandem with Bitmap Index Scan to speed things up. For example, suppose there are a large number of matching rows. In that case, the Planner decides to do Bitmap Heap Scan + Bitmap Index Scan over a traditional Index Scan, in which the executor iterates the Bitmap instead of the Index itself for faster results.
Bitmap And & Bitmap Or
The Bitmap data structure is not only beneficial for single matches but can also be combined if there are two indexes.
Bitmap And is also similar,
Keep in mind that in certain situations Bitmap And
or Bitmap Or
won't work, and we will have to create composite indexes. But in many situations, the Planner can combine two individual indexes very efficiently.
Parallel Scans
Sequential Scans are the slowest of all the plans we have seen so far because there is nothing to optimize there. The Planner goes over the data sequentially and tries to find the result. PostgreSQL optimizes this further by adding parallelism in the queries. Let's simulate this by removing the index for the id
column and running the query. If you are using the same session as before, restart your database client for max_parallel_workers_per_gather
setting to reset to default.
The default workers configuration is 2
and hence two workers have been used to run the query. Workers are processes behind the scenes, which enables the executor to run the scans in parallel. Going deep into how parallelization works is perhaps a topic of another blog. Still, the general recommendation is to keep the workers equal to how many cores are there in the CPU for the most efficient results.
Conclusion
Hopefully, by now, you have a good idea of scans and why PostgreSQL uses specific types of scans in different places. Understanding these scan types will help users answer questions like,
- Why is not PostgreSQL using my index in a direct
WHERE
comparison? - Why is PostgreSQL not able to combine two indexes using Bitmap Scan?
- How to optimize our queries for Index Only Scan?
One thing to understand is that there are certain kinds of queries, such as Count
, Avg
and other aggregate queries, which always result in Sequential Scan because they have to scan the entirety of the table anyway for the result. Stay tuned for more articles on PostgreSQL node types.
Top comments (0)