In 1.4.1, we saw the planner build multiple candidate paths over the same table and pick the one with the lowest cost to crystallize into a plan. Each candidate is represented by the Path data structure, and each Path carries a single number called total_cost. The planner compares candidates against this number. Where does this number come from? As the 1.4 chapter intro touched on, PostgreSQL has a cost model. It converts physical cost estimates (the cost of reading one disk page, the cost of processing one tuple, and so on) into arbitrary unit numbers, then sums these units per candidate path to produce total_cost. The rest of chapter 1.4 unpacks how this cost model computes costs for each kind of candidate. We start with the simplest one: a sequential scan, which reads a table from its first page to its last in order, examining every row.
Cost is a ratio, not an absolute value
PostgreSQL's cost has no units. It's not seconds, not milliseconds. If someone seeing EXPLAIN's cost=0.00..205.00 for the first time asks "does this mean 0.2 seconds?", that intuition misses. The number itself is in arbitrary units, and meaning comes only from ratios between candidates.
The reference point is seq_page_cost = 1.0. The cost of reading one sequential page is fixed at 1, and every other cost constant is defined as a ratio to this. Current PG 18 defaults are:
| GUC | Default | Meaning |
|---|---|---|
seq_page_cost |
1.0 | Cost of fetching one sequential page. Reference unit |
random_page_cost |
4.0 | Cost of fetching one non-sequential (random) page |
cpu_tuple_cost |
0.01 | CPU cost of the executor processing one tuple |
cpu_index_tuple_cost |
0.005 | CPU cost of processing one index tuple |
cpu_operator_cost |
0.0025 | CPU cost of one operator or function call |
parallel_tuple_cost |
0.1 | Cost of a parallel worker passing one tuple to the leader |
parallel_setup_cost |
1000.0 | Cost of setting up shared memory at the start of a parallel scan |
Two things stand out from this table. First, reading one page costs 100 times more than processing one tuple on CPU, at default values. Second, a random page costs 4 times a sequential page.
What's the difference between "sequential" and "random"? Sequential page reading goes through page 1, page 2, page 3 in order, picking up adjacent pages one after another. Random page reading jumps around: page 1, then page 47, then page 12, where positions are scattered. On an HDD, the moving head takes time to seek to a different location, so random reads are much slower. On an SSD, with no moving head, the gap between the two narrows.
Another fact is baked into the ratio. The OS kernel reads ahead: when it fetches a page from disk, it guesses that the next pages will be needed soon and pulls them into memory in advance. This is called read-ahead, or prefetch. Sequential reading benefits from this guess because the next page really is the next page. Random reading can't benefit because there's no way to know which page comes next. The PG default ratio seq : random = 1 : 4 reflects an environment of HDD plus kernel read-ahead.
Why ratios instead of absolute values? Pin down absolute values, and environmental dependency gets baked into the model. Measure millisecond values on HDD, and SSD readings become inaccurate. A small table that fits entirely in memory is yet another case. Arbitrary-unit ratios keep the model environment-neutral, and operators adjust the ratios to match their environment. In SSD environments, dropping random_page_cost = 1.1 to nudge the planner toward index scans is a common tuning, and that's possible precisely because the model uses ratios.
In other engine codebases, I've seen attempts to express cost in actual millisecond units. The result is accurate but every constant has to be re-measured whenever the environment changes. PostgreSQL's arbitrary-unit ratio model gives up that precision in exchange for tunability and environment independence.
The formula is a sum of two axes
The function that computes sequential scan cost is cost_seqscan. We start from the simplest case: no join is involved (so no variable from another table is bound to the path; this case is called a non-parameterized path, and parameterized paths get full treatment in 1.4.5 on joins), and no parallel scan either (a mode where multiple worker processes split up reading the table's pages concurrently). The formula for this simple case reads:
disk_run_cost = seq_page_cost × pages
cpu_run_cost = (cpu_tuple_cost + qual.per_tuple) × tuples
+ pathtarget.per_tuple × rows
startup_cost = qual.startup + pathtarget.startup
total_cost = startup_cost + cpu_run_cost + disk_run_cost
Walking through the variables:
-
pages,tuples: the estimated number of pages this table occupies on disk and the estimated number of rows contained in them. Where these come from is the next section. -
qual: the cost of evaluating the WHERE clause. PG separates qual cost into two parts.startupis the one-time setup cost incurred once.per_tupleis the cost of re-evaluating qual for each tuple. A simple comparison qual likeid = 5usually has only a smallper_tuplevalue, withstartupessentially zero. -
pathtarget: the cost of evaluating the SELECT clause expressions. ForSELECT a, b * 2 FROM ..., this means the cost of evaluatingaandb * 2. Like qual, it splits intostartupandper_tuple. With only simple column references it's near zero. Complex function calls driveper_tupleup. -
rows: the estimated row count after applying WHERE. This value gets multiplied withpathtarget. Tuples filtered out by WHERE don't need SELECT expression evaluation.
Drawing the two cost axes:
total_cost
=
┌──────────── disk_run_cost ────────────┬─────────────── cpu_run_cost ───────────────┐
│ seq_page_cost × pages │ (cpu_tuple_cost + qual.per_tuple) × tuples │
│ │ + pathtarget.per_tuple × rows │
└───────────────────────────────────────┴────────────────────────────────────────────┘
+ startup_cost
(usually near zero)
Why split into two axes? Because the two costs consume physically different resources. Reading one page is disk I/O. Processing one tuple is CPU. The same row estimate carries different cost weights depending on whether the bottleneck is disk (large pages) or CPU (large tuples, for example a narrow table packed densely with tuples per page). Putting these in separate terms lets a single formula stay accurate across both environments.
Zooming in on cpu_run_cost, the two terms charge cost differently. The main term (cpu_tuple_cost + qual.per_tuple) is paid once per tuple across all tuples, regardless of survival. Every tuple has to be inspected to evaluate qual, which is how survival gets decided in the first place. The second term (pathtarget.per_tuple × rows) is paid only on surviving rows: SELECT expressions need to run only for tuples that passed WHERE. Even within CPU cost, there's a split between "scales with all tuples" and "scales with output row count".
This makes it obvious why tuples and pages are the two most important inputs to this formula. PG estimates table size before computing cost.
Good selectivity doesn't reduce disk cost
Looking at the formula again, an interesting asymmetry shows up. disk_run_cost multiplies pages. The main term of cpu_run_cost multiplies tuples. The output row count appears only in the second term of cpu_run_cost (pathtarget.per_tuple × rows). It doesn't appear in disk_run_cost or the main term.
What's the "output row count" here? It's the estimated number of rows that survive WHERE filtering. The fraction of rows that survive is called selectivity. Where selectivity itself comes from is covered in detail in 1.4.6. Selectivity of 1/10000 means one row out of ten thousand survives.
This asymmetry becomes clear with a concrete example. Suppose an accounts table with 10,000 rows in 100 pages, queried with WHERE id = 5. If id is unique, selectivity is 1/10000 and the result row count is 1. Even so, the sequential scan cost gets computed as:
disk_run_cost = 1.0 × 100 = 100
cpu_run_cost ≈ (0.01 + 0.0025) × 10000 + 0 × 1 ≈ 125
total_cost ≈ 225
Breaking down the two terms of cpu_run_cost:
-
(0.01 + 0.0025) × 10000: the main term. Scales with all tuples. The0.0025is the cost of evaluating the=operator fromid = 5(cpu_operator_cost). -
0 × 1: the pathtarget term (pathtarget.per_tuple × rows). With only simple SELECT and one surviving row, this contribution is negligible.
Even though only 1 row survives, disk still pays the full 100. CPU also pays 125 because the main term scales with 10,000 tuples. EXPLAIN output shows the same pattern:
Seq Scan on accounts (cost=0.00..225.00 rows=1 width=...)
Filter: (id = 5)
rows=1 but cost=...225.00: this is the asymmetry made visible.
What this means in practice: a sequential scan is priced as "read every page even if only one row survives". WHERE doesn't reduce disk cost. The main CPU term scales with all tuples regardless of survival. The only part that shrinks with result row count is the second term (pathtarget.per_tuple × rows), and with simple column references that part is near zero, so its impact on total cost is negligible. Better selectivity doesn't make a meaningful dent in sequential scan cost.
How does this asymmetry shape the planner's decisions? Compare with an index scan candidate over the same table. An index scan follows the index first to locate only the pages with matching rows, so its disk cost scales with the number of pages it visits. When matching rows are sparsely scattered, only a few pages are touched. So when selectivity is good (few output rows), index scan disk cost gets much smaller than sequential scan disk cost. Without this asymmetry, the planner would always favor sequential scan as cheaper, missing index scan opportunities on large tables.
Where do pages and tuples come from?
Where these two formula inputs come from determines the model's reliability. PG fills them in over two stages before computing cost.
The first stage is physical estimation. When the planner registers a table as a work target, in get_relation_info, it calls estimate_rel_size, which fills baserel->pages and baserel->tuples. Two data sources go into filling these:
-
pg_class.relpages,pg_class.reltuples.pg_classis the system catalog holding metadata for every table. The lastANALYZE(the stats-gathering command) orVACUUM(which reclaims dead tuples and refreshes stats alongside) leaves these values behind. -
RelationGetNumberOfBlocks(): the current page count on disk, checked directly.
PG doesn't use just one or the other; it combines them. pages is simple. RelationGetNumberOfBlocks() returns the current page count on disk, used as-is. tuples is trickier. If rows have been added since the last stats, using pg_class.reltuples (the tuple count at stats time) directly would underestimate the truth. So PG takes the per-page tuple density from the stats snapshot (pg_class.reltuples / pg_class.relpages), multiplies it by current pages, and gets tuples. The assumption is that density stays roughly similar across pages, so a proportional growth in page count implies proportional growth in tuple count. For a brand-new table that has never been ANALYZE'd, PG estimates tuple width from attribute datatypes and derives density from that.
The second stage is logical estimation. Once physical estimation has filled pages and tuples, the planner calls set_baserel_size_estimates from set_plain_rel_size. That function uses clauselist_selectivity to compute the WHERE clause selectivity, then writes baserel->rows = tuples × selectivity. How selectivity itself is computed, based on column statistics in the pg_statistic catalog (histograms, most-common-values, n_distinct, and the like), is left for 1.4.6 to unpack.
The point: pages and tuples are estimates grounded in pg_class stats. If ANALYZE gets skipped and stats go stale, density gets wrong, and the whole cost comparison goes off. The classic pattern where the same SQL produces a good plan one day and a bad plan the next has its root here.
Parallel and disabled_nodes: two adjustments on top of the same formula
cost_seqscan applies two more adjustments on top of the formula above.
The first is the parallel scan adjustment. When multiple worker processes split up reading a table's pages (path->parallel_workers > 0), cpu_run_cost gets divided by the number of workers. CPU work is distributed across workers, so the burden each worker shoulders shrinks accordingly. disk_run_cost stays the same. The kernel's read-ahead already speeds up the disk side, so adding workers won't reduce I/O further; that assumption is baked in. The result row count is also re-pinned to the count one worker processes. With this adjustment, the parallel sequential scan candidate ends up with a different cost than the non-parallel one over the same table, so the two enter the comparison pool separately.
The second is the disabled node adjustment. As 1.4.1 covered, when an operator tries to turn off sequential scans with enable_seqscan = off, PG handles it by pinning disabled_nodes = 1 on the Path node. The dominance comparison rule looks at which side has fewer disabled_nodes before comparing cost. Older PG versions handled this differently. They added a large constant to the cost so that disabled nodes' costs became astronomical and lost out in comparison. But that inflated cost distorted comparisons among candidates elsewhere in the plan tree, so PG switched to using a separate count field. After computing the cost formula, cost_seqscan writes disabled_nodes = 0 if enable_seqscan is on, or disabled_nodes = 1 if it's off.
What this means in practice
First, don't attach units to the EXPLAIN cost number. Reading cost=0.00..205.00 as "this query takes 205ms" is wrong. The unit is arbitrary, and the number means something only in comparison with other candidates within the same EXPLAIN output. If candidate A has cost 200 and candidate B has cost 800, that tells you the planner picked A. It says nothing about absolute runtime. For actual execution time, look at actual time= in EXPLAIN ANALYZE.
Second, skipping ANALYZE breaks the cost model. pages and tuples are derived from pg_class stats, and those two variables directly determine disk_run_cost and cpu_run_cost. Stale stats lead to wrong density, which leads to unstable plans for the same SQL. That's why a fresh table needs ANALYZE after data load, and why ANALYZE should be re-run after bulk INSERT or DELETE. Autovacuum (the background PG facility that runs stats updates and dead tuple reclamation automatically) handles this most of the time but fires only after hitting a threshold, so before running large queries shortly after a load, running ANALYZE explicitly once is the safe move.
Third, on SSD, leaving random_page_cost at 4.0 makes index scans look unfairly expensive. The default reflects HDD-era assumptions where random reads cost 4x sequential reads. On SSDs, random cost is nearly identical to or only slightly more than sequential. Operators commonly tune random_page_cost = 1.1 to nudge the planner toward index scans. PG docs also recommend narrowing the gap between the two values for cache-resident workloads.
Fourth, the cost weight differs between narrow large tables and wide small tables. disk_run_cost scales with pages, while cpu_run_cost scales with tuples. A narrow table packed with many tuples per page has a high tuples / pages ratio, so the CPU share dominates. A table with wide rows fitting few tuples per page has the disk share dominate. The same row count can lead the planner to different candidate choices. As a monitoring signal, looking at pg_class.relpages and reltuples together, and being aware of their ratio, gives intuition for cost variability.
Top comments (0)