JoongHyuk Shin

Posted on Jun 3 • Edited on Jun 21

1.4.2 Cost Model: Sequential Scan Formula

#postgres #database #internals #planner

In 1.4.1, we saw the planner build multiple candidate paths over the same table and pick the one with the lowest cost to crystallize into a plan. Each candidate is represented by the Path data structure, and each Path carries a single number called total_cost. The planner compares candidates against this number. Where does this number come from? As the 1.4 chapter intro touched on, PostgreSQL has a cost model. It converts physical cost estimates (the cost of reading one disk page, the cost of processing one tuple, and so on) into arbitrary unit numbers, then sums these units per candidate path to produce total_cost. The rest of chapter 1.4 unpacks how this cost model computes costs for each kind of candidate. We start with the simplest one: a sequential scan, which reads a table from its first page to its last in order, examining every row.

Two units carry the cost throughout this section, and they are worth pinning down first: the page and the tuple.

A page is the fixed-size block PostgreSQL uses to store and read a table on disk. The default size is 8KB. The key point is that disk I/O happens in units of pages, not individual rows. Even if you need just one row, PostgreSQL reads the entire page that holds it into memory. That is why, in the cost model, disk cost is counted as "how many pages do we read", not "how many rows".

A tuple is one row as stored inside a page. "row" and "tuple" point at nearly the same thing from different angles: a row is the logical row, the thing a query returns or a WHERE clause filters, while a tuple is the physical form of that row as it sits inside a disk page, PostgreSQL's term for it. The cost model bothers to distinguish them because disk cost is counted in pages while CPU cost is counted in the number of tuples processed. The internal layout of pages and tuples (how tuples are arranged within a page, and so on) is covered in detail in chapter 2; here we only carry forward that the unit of disk reads is the page and the unit of CPU processing is the tuple.

Cost is a ratio, not an absolute value

PostgreSQL's cost has no units. It's not seconds, not milliseconds. If someone seeing EXPLAIN's cost=0.00..205.00 for the first time asks "does this mean 0.2 seconds?", that intuition misses. The number itself is in arbitrary units, and meaning comes only from ratios between candidates.

The reference point is seq_page_cost = 1.0. The cost of reading one sequential page is fixed at 1, and every other cost constant is defined as a ratio to this. Current PG 18 defaults are:

GUC	Default	Meaning
`seq_page_cost`	1.0	Cost of fetching one sequential page. Reference unit
`random_page_cost`	4.0	Cost of fetching one non-sequential (random) page
`cpu_tuple_cost`	0.01	CPU cost of the executor processing one tuple
`cpu_index_tuple_cost`	0.005	CPU cost of processing one index tuple
`cpu_operator_cost`	0.0025	CPU cost of one operator or function call
`parallel_tuple_cost`	0.1	Cost of a parallel worker passing one tuple to the leader
`parallel_setup_cost`	1000.0	Cost of setting up shared memory at the start of a parallel scan

Two things stand out from this table. First, reading one page costs 100 times more than processing one tuple on CPU, at default values. Second, a random page costs 4 times a sequential page.

What's the difference between "sequential" and "random"? Sequential page reading goes through page 1, page 2, page 3 in order, picking up adjacent pages one after another. Random page reading jumps around: page 1, then page 47, then page 12, where positions are scattered. On an HDD, the moving head takes time to seek to a different location, so random reads are much slower. On an SSD, with no moving head, the gap between the two narrows.

Another fact is baked into the ratio. The OS kernel reads ahead: when it fetches a page from disk, it guesses that the next pages will be needed soon and pulls them into memory in advance. This is called read-ahead, or prefetch. Sequential reading benefits from this guess because the next page really is the next page. Random reading can't benefit because there's no way to know which page comes next. The PG default ratio seq : random = 1 : 4 reflects an environment of HDD plus kernel read-ahead.

Why ratios instead of absolute values? Pin down absolute values, and environmental dependency gets baked into the model. Measure millisecond values on HDD, and SSD readings become inaccurate. A small table that fits entirely in memory is yet another case. Arbitrary-unit ratios keep the model environment-neutral, and operators adjust the ratios to match their environment. In SSD environments, dropping random_page_cost = 1.1 to nudge the planner toward index scans is a common tuning, and that's possible precisely because the model uses ratios.

In other engine codebases, I've seen attempts to express cost in actual millisecond units. The result is accurate but every constant has to be re-measured whenever the environment changes. PostgreSQL's arbitrary-unit ratio model gives up that precision in exchange for tunability and environment independence.

The formula is a sum of two axes

The function that computes sequential scan cost is cost_seqscan. We start from the simplest case: no join is involved (so no variable from another table is bound to the path; this case is called a non-parameterized path, and parameterized paths get full treatment in 1.4.5 on joins), and no parallel scan either (a mode where multiple worker processes split up reading the table's pages concurrently). The formula for this simple case reads:

disk_run_cost = seq_page_cost × pages
cpu_run_cost  = (cpu_tuple_cost + qual.per_tuple) × tuples
              + pathtarget.per_tuple × rows

startup_cost  = qual.startup + pathtarget.startup
total_cost    = startup_cost + cpu_run_cost + disk_run_cost

These two values, startup_cost and total_cost, are the two numbers EXPLAIN prints on each node as cost=0.00..225.00. The first number is startup_cost, the cost paid before the first row is emitted, and the second is total_cost, the cost through the last row. A sequential scan can start emitting rows the moment it reads the first page, so its startup is near zero, and it typically begins with cost=0.00...

Walking through the variables:

pages, tuples: the estimated number of pages this table occupies on disk and the estimated number of rows contained in them. Where these come from is the next section.
qual: the cost of evaluating the WHERE clause. PG separates qual cost into two parts. startup is the one-time setup cost incurred once. per_tuple is the cost of re-evaluating qual for each tuple. A simple comparison qual like id = 5 usually has only a small per_tuple value, with startup essentially zero.
pathtarget: the cost of evaluating the SELECT clause expressions. For SELECT a, b * 2 FROM ..., this means the cost of evaluating a and b * 2. Like qual, it splits into startup and per_tuple. With only simple column references it's near zero. Complex function calls drive per_tuple up.
rows: the estimated row count after applying WHERE. This value gets multiplied with pathtarget. Tuples filtered out by WHERE don't need SELECT expression evaluation.

Drawing the two cost axes:

Why split into two axes? Because the two costs consume physically different resources. Reading one page is disk I/O. Processing one tuple is CPU. The same row estimate carries different cost weights depending on whether the bottleneck is disk (large pages) or CPU (large tuples, for example a narrow table packed densely with tuples per page). Putting these in separate terms lets a single formula stay accurate across both environments.

Zooming in on cpu_run_cost, the two terms charge cost differently. The main term (cpu_tuple_cost + qual.per_tuple) is paid once per tuple across all tuples, regardless of survival. Every tuple has to be inspected to evaluate qual, which is how survival gets decided in the first place. The second term (pathtarget.per_tuple × rows) is paid only on surviving rows: SELECT expressions need to run only for tuples that passed WHERE. Even within CPU cost, there's a split between "scales with all tuples" and "scales with output row count".

This makes it obvious why tuples and pages are the two most important inputs to this formula. PG estimates table size before computing cost.

Good selectivity doesn't reduce disk cost

Looking at the formula again, an interesting asymmetry shows up. disk_run_cost multiplies pages. The main term of cpu_run_cost multiplies tuples. The output row count appears only in the second term of cpu_run_cost (pathtarget.per_tuple × rows). It doesn't appear in disk_run_cost or the main term.

What's the "output row count" here? It's the estimated number of rows that survive WHERE filtering. The fraction of rows that survive is called selectivity. Where selectivity itself comes from is covered in detail in 1.4.8. Selectivity of 1/10000 means one row out of ten thousand survives.

This asymmetry becomes clear with a concrete example. Suppose an accounts table with 10,000 rows in 100 pages, queried with WHERE id = 5. If id is unique, selectivity is 1/10000 and the result row count is 1. Even so, the sequential scan cost gets computed as:

disk_run_cost = 1.0 × 100 = 100
cpu_run_cost  ≈ (0.01 + 0.0025) × 10000  +  0 × 1  ≈ 125
total_cost    ≈ 225

Breaking down the two terms of cpu_run_cost:

(0.01 + 0.0025) × 10000: the main term. Scales with all tuples. The 0.0025 is the cost of evaluating the = operator from id = 5 (cpu_operator_cost).
0 × 1: the pathtarget term (pathtarget.per_tuple × rows). With only simple SELECT and one surviving row, this contribution is negligible.

Even though only 1 row survives, disk still pays the full 100. CPU also pays 125 because the main term scales with 10,000 tuples. EXPLAIN output shows the same pattern:

Seq Scan on accounts  (cost=0.00..225.00 rows=1 width=...)
  Filter: (id = 5)

rows=1 but cost=...225.00: this is the asymmetry made visible.

What this means in practice: a sequential scan is priced as "read every page even if only one row survives". WHERE doesn't reduce disk cost. The main CPU term scales with all tuples regardless of survival. The only part that shrinks with result row count is the second term (pathtarget.per_tuple × rows), and with simple column references that part is near zero, so its impact on total cost is negligible. Better selectivity doesn't make a meaningful dent in sequential scan cost.

How does this asymmetry shape the planner's decisions? Compare with an index scan candidate over the same table. An index scan follows the index first to locate only the pages with matching rows, so its disk cost scales with the number of pages it visits. When matching rows are sparsely scattered, only a few pages are touched. So in general, when selectivity is good (few output rows), index scan disk cost gets much smaller than sequential scan disk cost. Without this asymmetry, the planner would always favor sequential scan as cheaper, missing index scan opportunities on large tables.

Where do `pages` and `tuples` come from?

Where these two formula inputs come from determines the model's reliability. PG fills them in over two stages before computing cost.

The first stage is physical estimation. When the planner registers a table as a work target, in get_relation_info, it calls estimate_rel_size, which fills baserel->pages and baserel->tuples. Two data sources go into filling these:

pg_class.relpages, pg_class.reltuples. pg_class is the system catalog holding metadata for every table. The last ANALYZE (the stats-gathering command) or VACUUM (which reclaims dead tuples and refreshes stats alongside) leaves these values behind.
RelationGetNumberOfBlocks(): the current page count on disk, checked directly.

PG doesn't use just one or the other; it combines them. pages is simple. RelationGetNumberOfBlocks() returns the current page count on disk, used as-is. tuples is trickier. If rows have been added since the last stats, using pg_class.reltuples (the tuple count at stats time) directly would underestimate the truth. So PG takes the per-page tuple density from the stats snapshot (pg_class.reltuples / pg_class.relpages), multiplies it by current pages, and gets tuples. The assumption is that density stays roughly similar across pages, so a proportional growth in page count implies proportional growth in tuple count. For a brand-new table that has never been ANALYZE'd, PG estimates tuple width from attribute datatypes and derives density from that.

The second stage is logical estimation. Once physical estimation has filled pages and tuples, the planner calls set_baserel_size_estimates from set_plain_rel_size. That function uses clauselist_selectivity to compute the WHERE clause selectivity, then writes baserel->rows = tuples × selectivity. How selectivity itself is computed, based on column statistics in the pg_statistic catalog (histograms, most-common-values, n_distinct, and the like), is left for 1.4.8 to unpack.

The point: pages and tuples are estimates grounded in pg_class stats. If ANALYZE gets skipped and stats go stale, density gets wrong, and the whole cost comparison goes off. The classic pattern where the same SQL produces a good plan one day and a bad plan the next has its root here.

Parallel and disabled_nodes: two adjustments on top of the same formula

cost_seqscan applies two more adjustments on top of the formula above.

The first is the parallel scan adjustment. When multiple worker processes split up reading a table's pages (path->parallel_workers > 0), cpu_run_cost gets divided by the number of workers. CPU work is distributed across workers, so the burden each worker shoulders shrinks accordingly. disk_run_cost stays the same. The kernel's read-ahead already speeds up the disk side, so adding workers won't reduce I/O further; that assumption is baked in. The result row count is also re-pinned to the count one worker processes. With this adjustment, the parallel sequential scan candidate ends up with a different cost than the non-parallel one over the same table, so the two enter the comparison pool separately.

The second is the disabled node adjustment. As 1.4.1 covered, when an operator tries to turn off sequential scans with enable_seqscan = off, PG handles it by pinning disabled_nodes = 1 on the Path node. The dominance comparison rule looks at which side has fewer disabled_nodes before comparing cost. Older PG versions handled this differently. They added a large constant to the cost so that disabled nodes' costs became astronomical and lost out in comparison. But that inflated cost distorted comparisons among candidates elsewhere in the plan tree, so PG switched to using a separate count field. After computing the cost formula, cost_seqscan writes disabled_nodes = 0 if enable_seqscan is on, or disabled_nodes = 1 if it's off.

In practice

Run ANALYZE explicitly once right after bulk loads or deletes. Autovacuum, the background facility that refreshes statistics automatically, exists, but it fires only after hitting a threshold. If you need to run a heavy query right after loading a fresh table or right after a bulk INSERT/DELETE, stats can go stale during that gap and throw off cost estimates, so running ANALYZE explicitly once before the query is the safe move.

DEV Community

1.4.2 Cost Model: Sequential Scan Formula

Cost is a ratio, not an absolute value

The formula is a sum of two axes

Good selectivity doesn't reduce disk cost

Where do `pages` and `tuples` come from?

Parallel and disabled_nodes: two adjustments on top of the same formula

In practice

Top comments (0)

Cost is a ratio, not an absolute value

The formula is a sum of two axes

Good selectivity doesn't reduce disk cost

Where do pages and tuples come from?

Parallel and disabled_nodes: two adjustments on top of the same formula

In practice

Where do `pages` and `tuples` come from?