1.5.2 Scan Nodes: SeqScan, IndexScan, BitmapScan

#postgres #database #internals #executor

A scan node is a leaf of the executor tree. Having no children, it reads rows straight from disk. That single phrase, "reads straight from disk," actually hides three different methods. To pull rows that pass the same condition from the same table, PG has three separate scan nodes: SeqScan, IndexScan, and BitmapScan. The job is identical, so why three nodes? One reason: depending on what fraction of the rows pass the condition, a different way of reading the heap pages turns out to be cheaper.

All three run on the same skeleton. The procedure a scan node follows to produce one row is to fetch a raw tuple through the access method, check the qual (the filter expression, like a WHERE condition), and if it passes, project out the needed columns and return it. The difference between the three nodes lies in the first step: where, and in what order, the next raw tuple comes from. SeqScan sweeps the whole table, IndexScan probes only the spots the index points to, and BitmapScan gathers those spots all at once and then reads them in an organized order.

Sequential scan: read every table page from first to last

The simplest method is the sequential scan. Without going through an index, it reads the table's heap pages in their physical storage order, from the first page to the last. From each page it pulls rows one by one, checks the qual, and sends up only those that pass. When a query like SELECT * FROM orders WHERE amount > 100 matches a large portion of the table, most pages have to be read anyway, so this method is fastest.

A sequential scan is fast for two reasons. First, there is no side work of searching an index. Second, reading pages in storage order makes it sequential I/O for the disk. Disks and the OS apply read-ahead, prefetching the next pages while reading consecutive blocks, which is far more efficient than hunting down scattered locations one at a time. This is also the basis on which the planner assigns different costs to sequential access and random access, and that cost model is covered in 1.4.

Inside PG, the per-row supply for a sequential scan is handled by SeqNext. It just calls the access method that sweeps the heap in order and hands back the next tuple, and in that same step it also judges whether the tuple it read is visible (visibility) under the current transaction's snapshot. The specific rules this visibility judgment follows are covered in Chapter 3 on transactions and MVCC. Here it is enough to know that this judgment happens together with every row the scan sends up.

Index scan: take an address and probe the heap one row at a time

When only a tiny fraction of the rows pass the condition, the story changes. Sweeping a ten-million-row table to find a single row, as in SELECT * FROM orders WHERE id = 42, is wasteful. This is where the index scan comes in.

To understand the index scan, you first need the idea of a TID. A TID (tuple identifier) is the physical address of a row inside the heap. It is a (page number, offset within the page) pair pointing to which page and which slot the row sits in, and the index's leaf holds exactly this TID alongside the key value. When you follow the index down to find a key, you get a TID saying "the row with this key lives at this heap address."

So one row of an index scan consists of two actions. First it pulls the next matching TID from the index, then it goes to the heap address that TID points to and reads the actual row. The row it reads is checked for snapshot visibility right there. Because one index key can point to several versions of a row (under MVCC, modifying a row can leave the old and new versions coexisting), the index handing back a TID does not guarantee that row is currently visible. So the visibility check has to happen at the heap, not at the index. The index scan repeats this "one TID from the index, one row from the heap, visibility check" bundle for as many matching keys as there are.

A weakness hides in this. The TIDs the index hands back follow the index key order, not the physical storage order in the heap. So the path of probing the heap jumps around between pages, becoming random access. With one or two matches it touches only a page or two and there is no problem, but as matches grow into the thousands, it jumps to scattered pages thousands of times, and may even revisit the same page repeatedly. If a page holds several matching rows that come out far apart in index order, that page is read again each time. Because random access costs more per page than sequential access, the advantage of the index scan fades quickly as matches grow. This is where "reading through an index is always faster" breaks down. If the planner's estimate of few matching rows misses and a large portion of the table actually matches, the index scan ends up slower than reading everything sequentially from the start, because it jumps across scattered pages without end.

In exchange, the index scan has an advantage the sequential scan lacks: it returns rows in index key order. If a query has ORDER BY id and happens to scan via the id index, the rows the index scan sends up are already sorted, so there is no need for a separate Sort node on top. This is why a query like SELECT * FROM orders ORDER BY id LIMIT 10 is fast. Instead of gathering everything to sort it, you just pull the first ten rows in index order and you are done.

A variant of the index scan is the index-only scan. If every column the query needs is contained in the index, the answer can be built by reading only the index tuples, without going to the heap. Visibility still has to be confirmed, but instead of reading the heap directly, it consults the Visibility Map (an auxiliary structure that marks, per page, whether all tuples on that page are visible to everyone). If the page is marked all-visible, it skips the heap visit; otherwise it goes to the heap for just that row.

Bitmap scan: collect all addresses, then read pages once in physical order

There is a middle range where the matching rows are too many for an index scan and too few for a sequential scan. Consider finding a single customer's few hundred orders, as in SELECT * FROM orders WHERE customer_id = 7. A few hundred is too much for a sequential scan that reads all ten million rows, yet probing scattered pages a few hundred times with an index scan, revisiting the same page over and over, is also inefficient. The bitmap scan is for this range.

The idea behind the bitmap scan is to turn the index scan's random access into something close to sequential. The key is to not read rows one by one as you follow the index, but to first gather all the matching TIDs, then sort those addresses into the heap's page order and read them. That way you read front to back in one direction instead of jumping between pages, and even if a page holds several matching rows, that page is read exactly once.

Two nodes split this work. The lower BitmapIndexScan sweeps the index and gathers all matching TIDs into a single bitmap. The bitmap is a data structure that marks "there is a match at this slot of this page" with a set bit. Unlike the other scan nodes that send their results up one row at a time, this node sweeps the index to the end in one shot and hands the completed bitmap up whole. The upper BitmapHeapScan takes that bitmap and visits the heap pages its set bits point to in physical order, emitting rows one by one.

This structure carries one constraint: the size of the bitmap. Recording the exact slot (offset within the page) of every matching row makes the representation precise, but when matches reach into the millions, that bitmap eats too much memory. So when the memory for the bitmap (work_mem) runs short, PG drops the precision by one level. What was marked per row becomes marked per page, recording only "there is a match somewhere on this page." The first, precise method is called exact, the second, coarse one is called lossy.

A lossy page comes with a price. Since you only know "there is a match on this page," after reading the page you have to check again whether the rows on it actually pass the condition. This re-check is called recheck. Rows from an exact page need no recheck because the bitmap pinned the exact slot, but rows from a lossy page have the original condition applied once more to filter them. The Heap Blocks: exact=N lossy=M in EXPLAIN output shows the counts of these two kinds of pages. A large lossy number signals that the bitmap did not fit in memory and gave up precision, and raising work_mem brings that many pages back to exact.

Another advantage of the bitmap approach is that it can combine several indexes. If WHERE customer_id = 7 AND status = 'paid' has a customer_id index and a status index each, two BitmapIndexScans can each build a bitmap, and those two can be overlapped with a bitwise AND to keep only the TIDs that satisfy both conditions. Unlike an index scan that narrows by one index and filters the rest at the heap, this merges both conditions at the bitmap stage before going to the heap. For an OR condition, the two bitmaps are unioned.

The figure above compares how the three scans visit the heap when matching rows are scattered across pages 2, 7, and 9 of the same table. The sequential scan reads every page from 1 to 10 in order, regardless of matches. The index scan touches only the matching pages, but following index key order it jumps to 7, 2, 9, and revisits 7 and 2 when matches come out apart (the numbers in the circles are the visit order). The bitmap scan sorts the same matching pages into the physical order 2, 7, 9 and reads each exactly once. The jagged path of the index scan straightened into a one-directional line in the bitmap scan is the heart of it.

What this means in practice

When an index scan is slow, the culprit is usually a missed estimate and random access. If in EXPLAIN ANALYZE the index scan's estimated rows and actual rows diverge sharply with the actual far higher, the planner chose the index scan expecting few matches while many actually matched, and it is jumping across pages without end. Stale statistics behind a missed estimate are a common cause, so run ANALYZE again first to check whether the estimate matches reality. When the estimate and the actual diverge widely and the matching rows are far more than expected, you can set enable_indexscan = off to force the planner to leave the index scan out of its choices, so you can weigh the cost of a BitmapScan or a SeqScan directly. If the estimate keeps missing even after ANALYZE, that column's data distribution is not being captured well enough by the default statistics target (default_statistics_target, default 100), and the root fix is to raise that column's statistics target with ALTER TABLE ... ALTER COLUMN ... SET STATISTICS.