JoongHyuk Shin

Posted on Jun 4 • Edited on Jun 21

1.4.3 Cost Model: The Index Scan Formula

#postgres #database #internals #planner

In 1.4.2, the cost of a sequential scan had two parts: the disk cost of reading every page in the table, and the CPU cost of processing every row read. An index scan reads only the pages it matches, so when the result row count is low it generally reads less and costs less than a sequential scan. The key word is "generally." Even with few result rows, an index scan can lose to a sequential scan. Whether it wins depends not only on how many pages it reads, but also on how those pages are laid out on disk. This section is about how an index scan's cost is computed, and when it loses to a sequential scan.

The difference between the two starts with what each one reads over the same table. A sequential scan reads every page of the table in order; an index scan uses the index to read only the pages that match.

Going by the picture alone, an index scan looks unconditionally cheaper, since it reads less. But there are two traps. First, "only the matched pages" is not always a single page. The picture shows id = 5, which finds one row, so one page; but when many rows match, that many pages get read. Second, the cost of reading one of those pages is higher than in a sequential scan. Because the pages sit at scattered locations the scan jumps to directly, a random access, it costs up to 4 times what a sequential access (reading neighboring pages back to back) does.

So it comes down to weighing "the savings from reading fewer pages" against "the penalty of each page costing more," and which way that scale tips is what this section is about.

An index scan's cost is the sum of three costs

Fetching one row through an index scan takes two disk accesses. First it descends the index structure to find where in the table the matching row lives, then it reads the table page at that location to pull out the actual row. Each of these two accesses has a cost, and adding the CPU cost of processing the fetched row makes three costs in total.

The second access (reading the table page) is needed because the query asks for columns the index does not hold. If a query needs only columns that live in the index (for example, SELECT id ... WHERE id > 5 with an index on id), PostgreSQL can skip the table and read only the index, an index-only scan. The conditions under which that is possible (the visibility map) are covered in Chapter 2. This section deals with the ordinary index scan that must also read the table.

The function that computes index scan cost is cost_index. For the simplest case, with no join involved and no parallel scan (a mode where several worker processes split a table's pages and read them concurrently), the structure unfolds like this:

total_cost = index_access_cost      (① the cost of reading the index)
           + heap_fetch_cost         (② the disk cost of pulling rows from the table)
           + cpu_cost                (③ the CPU cost of processing fetched rows)

Each of the three, line by line:

① Index access cost: the I/O of reading the index pages while descending the structure, plus the CPU cost of examining the index entries there. This value is computed differently for each index type (btree, hash, GiST, and so on), so PostgreSQL leaves it to a per-access-method (AM) callback. We look at this cost separately at the end of the section.
② Heap fetch cost: the disk I/O of reading the table page at the location the index reported ("this row is on table page 47"). Here "heap" is PostgreSQL's term for the body where a table's rows actually live. It has nothing to do with the heap data structure from heapsort or priority queues; it means something closer to "the unsorted area where the table is stored" (as opposed to the index, a separate structure). This cost is the star of the section. Of the three, it is the term that varies most with circumstances, so whether an index scan beats a sequential scan is mostly decided here.
③ CPU cost: the per-row processing cost for each row fetched from the heap. On top of the base cpu_tuple_cost, it adds the cost of any condition the index could not filter and the table must recheck. For example, with WHERE id > 5 AND name LIKE '%kim%' and an index only on id, the index filters id > 5 but cannot judge name LIKE '%kim%' on its own. So name has to be rechecked on each row fetched from the heap, and that recheck cost lands here. This value is multiplied by the number of rows fetched. It is the same structure as the sequential scan's CPU cost.

Seeing which stage of the actual flow each of the three costs attaches to:

① and ③ are similar in character to the sequential scan's cost terms. The real difference is in ②, so we start there.

Heap fetch cost is decided by correlation

Once the index reports a matching row's location, PostgreSQL must read the table page at that location. The key question is "how scattered are those pages on disk?" Reading neighbors in order costs seq_page_cost (1.0) per page; jumping around scattered locations costs random_page_cost (4.0). For the same number of pages, the cost can differ by 4 times depending on how they are laid out.

This "how well the order the index points to matches the order the rows are actually stored on disk" is called correlation. First, note that the table (heap) is not itself sorted. Rows are generally piled onto disk in the order they were inserted. So correlation is not about "is the table sorted" but about "how well, by accident or by design, the insertion order lines up with the index column's order." Two extremes make it clear.

Perfect correlation: insertion order matches the index column's order. For example, a table where an auto-incrementing id was inserted in order has small-id rows on early pages and large-id rows on later pages. Following the id index, the table pages come out in order: page 1, page 2, page 3. Only the first page is a random access; the rest are all sequential.
Zero correlation: insertion order is unrelated to the index column's order. A table where id values were inserted in random order, or where updates and deletes have scattered the rows, looks like this. Following the id index, the table pages come out scattered, like page 47, page 3, page 91, so every page read is a random access.

cost_index computes the cost at each of these two extremes, then settles on a value between them according to the actual correlation.

min_IO_cost (perfect correlation)  = random_page_cost + (pages_fetched - 1) × seq_page_cost
max_IO_cost (zero correlation)     = pages_fetched × random_page_cost

csquared    = correlation²
heap_fetch_cost = max_IO_cost + csquared × (min_IO_cost - max_IO_cost)

Reading the formula: when correlation is 0, csquared is 0 too, so max_IO_cost (all random) is used as is; when correlation is 1, csquared is 1, so it becomes min_IO_cost (nearly sequential). The value in between is set by the square of the correlation. The higher the correlation, the closer it gets to min_IO_cost. The square is used to match the observation that even a slight drop in correlation produces quite a few page jumps in practice. A comment in the PostgreSQL source even leaves a question mark over whether this squared scheme is appropriate.

One thing about the zero-correlation pages_fetched (the number of table pages to read): it is not simply "the number of result rows." When rows are randomly scattered across many pages, even fetching a few rows touches many pages. PostgreSQL estimates this page count with a formula proposed by Mackert and Lohman in 1989. In short, it takes the total page count of the table, the number of rows to fetch, and the number of pages that can fit in the memory cache (effective_cache_size), and produces "the actual number of disk page fetches, accounting for cache effects." The more rows fetched, the closer it converges to the table's page count.

Same selectivity, different cost

To see how this computation decides actual plan selection, here is a concrete example. Suppose an accounts table has 10,000 rows stored across 100 pages on disk, with a unique index on the id column.

First, a single-row lookup WHERE id = 5. The selectivity (the fraction of all rows that the WHERE clause keeps) is 1/10000, and the result is 1 row. Finding one row through the index means the table page holding it is just 1 page. With only one page, correlation is irrelevant: it is one random access either way.

② heap_fetch_cost ≈ 4.0          (1 page × random_page_cost)
① index_access_cost ≈ 4          (a few index leaf pages + tree descent)
③ cpu_cost ≈ 0.01                (process 1 row)
total ≈ 8

The same query run as a sequential scan is total ≈ 225 (read all 100 pages and check the id of all 10,000 rows). The index scan is about 8. For a single-row lookup with an extremely low result count, the index scan is overwhelmingly cheaper. This is half the answer to the question posed earlier: when the result count is low, the index scan's heap fetch touches only a few pages, so it is cheaper than the sequential scan.

But there is another half. Consider a range query WHERE id BETWEEN 1 AND 2000. Selectivity is 0.2, and the result is 2,000 rows. This time correlation decides the cost.

When the table was loaded in id order so that correlation is near 1, the 2,000 rows the index points to are clustered on the first 20 pages of the table. min_IO_cost applies: only the first page is random, the other 19 are sequential.

② heap_fetch_cost ≈ 4.0 + 19 × 1.0 = 23      (correlation ≈ 1)
③ cpu_cost ≈ 0.01 × 2000 = 20
total ≈ 23 + 20 + (index cost) ≈ 50 or so

For the same query, when the table was loaded in random order so that correlation is near 0, those 2,000 rows are scattered across all 100 pages. Fetching 2,000 rows at random effectively touches all 100 pages of the table, each a random access.

② heap_fetch_cost ≈ 100 × 4.0 = 400          (correlation ≈ 0)
③ cpu_cost ≈ 0.01 × 2000 = 20
total ≈ 400 + 20 + (index cost) ≈ 420 or more

Run as a sequential scan, this range query costs about 250 (BETWEEN has two conditions, so each row gets one more comparison, costing slightly more CPU than the single-row lookup's 225). Compared against this 250, the same range query, depending on correlation, either wins big over the sequential scan (about 50) or loses big to it (about 420). Selectivity is 0.2 in both cases, yet the result is the opposite. This is the other half of the question. Even with good selectivity, bad correlation can make an index scan lose to a sequential scan, because reading the whole table by random access is more expensive than sweeping it once sequentially.

The planner shows the result of this decision through EXPLAIN. In the second, bad-correlation case, the planner compares the index scan candidate (about 420) against the sequential scan candidate (about 250) and picks the cheaper sequential scan, printing this:

Seq Scan on accounts  (cost=0.00..250.00 rows=2000 width=...)
  Filter: ((id >= 1) AND (id <= 2000))

Had correlation been good, the same query would resolve to an index scan and an Index Scan node would appear. The point is that the planner does not use the index just because it exists; it is the result of a cost comparison that factors in correlation.

Look at a real workload and this pattern shows up often. A log-style table that accumulates in time order has naturally high correlation on its timestamp column, so time-range queries resolve nicely to index scans. A table where random updates and inserts are mixed in has low correlation, and you see the planner drop to a sequential scan even for a range query of the same selectivity. The root of the common puzzle "I added an index but it isn't being used" is usually right here.

Reading the index is not free either

So far we have focused on ② (heap fetch). Lastly, ①, the cost of reading the index itself. As mentioned, this computation is handled by a per-type callback, and most of them, btree included, go through the shared function genericcostestimate.

Index access cost also splits into page I/O and CPU.

Index page I/O: what the index mostly reads are leaf pages. A leaf page is the bottom layer of the index tree, where the actual key values and the location info ("the row for this key is on table page N") are stored. The more keys match, the more leaf pages are read, and that page count is multiplied by random_page_cost. The root and middle-layer pages, being at the top of the tree, are few and used so often that they are almost always sitting in the memory cache, so the cost calculation ignores them.
Index CPU: for each index entry visited, it adds cpu_index_tuple_cost (0.005) and the qual operator cost (cpu_operator_cost, 0.0025) and multiplies.

Where does the correlation value that drove ②'s heap fetch cost come from? That value too is filled in by the per-AM callback. genericcostestimate here is a shared computation routine not tied to any particular index type. Each AM's callback (btree, hash, GiST, and so on) calls this shared routine to compute the common part of the index cost, then adds its own type-specific values on top if needed. This shared routine leaves correlation at 0 to begin with. Being a generic computation that does not look at column statistics, it has no way to know the real correlation, so it sets the worst case (all random) as the default. Btree overwrites that with the real value: it reads the actual correlation that ANALYZE recorded in pg_statistic (the catalog holding per-column statistics) and passes it along (as is for a single-column index, multiplied by 0.75 to stay conservative for a multi-column one). So ②'s correlation adjustment really works only for indexes like btree that fill in the real value. An index that leaves correlation at 0 always has its heap fetch priced at max_IO_cost (all random).

The bitmap scan's cost

The index scan we have seen so far takes one row location (TID) from the index, goes straight to that table page to fetch the row, and moves on to the next location. There is another way that uses the same index but reads the table differently: the bitmap scan. A bitmap scan does not read each matching location one by one as it goes; it gathers all of them first, then sorts those locations by table page number and reads each page once, front to back. It takes the same index as input, but the order in which it reads the table differs, so it has its own cost model (cost_bitmap_heap_scan). Three things differ from the index scan we have seen so far.

First, it is less sensitive to correlation. An index scan's heap fetch cost moved between random and sequential by the square of correlation. A bitmap scan sorts the locations by page number before reading, so no matter what order the rows happen to be loaded in the table (good or bad correlation), it reads in one direction as the page number rises. So the bitmap's per-page cost is set not by correlation but by "what fraction of the table's pages are being read." When the pages read make up a large fraction of the table, the per-page cost converges toward seq_page_cost; when only a small fraction is read, it stays closer to random_page_cost. Even in a situation where correlation near 0 would push an index scan up to max_IO_cost (all random), the bitmap scan sorts first, so its cost does not climb that high.

Second, it does not re-read the same page. An index scan re-read a page each time, even when it held several matching rows, if those rows came out far apart in index key order. A bitmap scan gathers and sorts the locations first, so however many matches a page holds, it is read exactly once. As a result, even as the matching row count grows, the number of pages read does not exceed the table's total page count. This contrasts with an index scan, whose cost inflated quickly through repeated visits as matches grew.

Third, the startup cost before the first row is high. An index scan can fetch the first row as soon as it receives one location from the index. A bitmap scan must gather all matching locations first in order to sort them, so it only starts reading the first table page after sweeping the index to the end once. That puts the entire cost of sweeping the index onto startup, so a bitmap scan's startup cost is higher than an index scan's. This is why bitmap loses ground on queries like LIMIT that only need the first few rows quickly.

Putting the three together shows the cases where the bitmap scan is favorable: the middle range where the matching rows are too many to handle with an index scan (random jumps and repeated visits blow up) yet too few to warrant a sequential scan that reads the whole table. In this range, the benefit of narrowing locations through the index is still alive, plus the benefit of the sort turning random access into something close to sequential reads. The planner computes the cost of both the index scan and the bitmap scan over the same index, and when the matching row count falls into this middle range, it picks the bitmap.

What this means in practice

First, use pg_stats to check whether a missing index scan is correlation's fault. If you suspect the correlation problem shown above, the correlation column of the pg_stats view lets you read that column's physical sortedness directly. Close to 1 favors an index scan; close to 0 hurts on range queries.

Second, you can revive an index scan for a frequently range-queried column by adjusting load order. Because correlation is determined by physical load order. The same data and the same index can have different correlation depending on the order they are written to disk. A table that accumulates in time order has naturally high correlation on its timestamp. If you want to raise the correlation of a frequently range-queried column, the CLUSTER command physically reorders the table into that index's order. Note that CLUSTER only sorts once, and later updates and inserts scatter it again, so it has to be re-run periodically.

Third, tuning random_page_cost on SSDs acts directly on index scans. Since the heap fetch's max_IO_cost is multiplied by random_page_cost, lowering this value from 4.0 to 1.1 sharply drops the cost of uncorrelated index scans. Random access on an SSD is not as expensive as on an HDD, so leaving the default at 4.0 makes bad-correlation index scans look unfairly expensive and the planner often falls back to sequential scans. This is why lowering the value to revive index scan opportunities is a common tuning move in SSD-based production.