1.4.7 Parallel Cost: When to Go Parallel and How Many Workers

#postgres #database #internals #planner

The planner weighs how to read a table, how to join, and how to aggregate by cost, and picks the cheapest candidate. Parallel query is one more decision layered on top of those candidates: whether one process handles a given scan or join alone, or several processes split the work, and if they split it, among how many. This decision is also made by the planner, not the executor. The planner costs a parallel candidate separately, puts it on the same scale as the serial one, and keeps whichever is cheaper.

So understanding parallelism comes down to three numbers. When does the planner consider parallel at all. If it does, how many workers does it assign. And how much does the cost drop once the work is split. These three numbers map directly onto the questions you hit in practice: "this table is huge, why isn't it going parallel," and "I added cores, why didn't it get that much faster." How the worker processes actually divide the pages and cooperate is the executor's job, covered in 1.5. Here we look at the numbers the planner uses to decide whether to go parallel in the first place.

Small tables are never even candidates

Parallelism is not free. Launching worker processes, setting up shared memory, and having the leader gather their results all carry a fixed cost. When a table is small, that fixed cost outweighs the gain from splitting the work, so reading it alone is actually faster. That is why the planner only considers parallelism once a table crosses a certain size.

That threshold is min_parallel_table_scan_size. The default is 8MB, which works out to 1024 pages (PG 18). The planner estimates how many pages a scan will read, and if that estimate falls below this value, it does not build a parallel path at all. This is usually the first reason you stare at an EXPLAIN plan and never see a Parallel Seq Scan. Index scans have their own threshold, min_parallel_index_scan_size (default 512KB, 64 pages). When a scan reads both the heap and an index, each threshold is checked on its own.

Write queries drop out of parallel consideration regardless of size. The SELECT inside an INSERT ... SELECT or an UPDATE never goes parallel, no matter how heavy that SELECT is. The parallel portion has to be strictly read-only, and the reason for that restriction is tangled up with how workers share state, which 1.5 covers. Here we just note the fact: a SELECT that comes with a write is not a candidate.

Worker count is set by table size

Once the planner decides to go parallel, how many workers does it launch? PostgreSQL sets this from the logarithm of the table size, with base 3 specifically. It starts at one worker and adds one more each time the page count reaches three times the threshold. Since the threshold itself starts at min_parallel_table_scan_size (8MB), the steps come out like this (PG 18 defaults).

8MB or more: 1 worker
24MB or more: 2 workers
72MB or more: 3 workers
216MB or more: 4 workers

Each time the table triples in size, the worker count goes up by one. It does not double as the size doubles; it grows much more slowly. A 1GB table does not get dozens of workers. The marginal gain from adding workers shrinks on larger tables, so the formula is conservative from the start.

There is one more cap on top of this: max_parallel_workers_per_gather, which defaults to 2. Even if the log formula yields 4, this cap of 2 trims the worker count to 2. So with no special configuration, scanning a large table normally gets at most two workers, and together with the leader, which pitches in directly, three processes split the same scan. To use more cores, you have to raise this value. Setting it to 0 turns parallelism off entirely.

Workers are not a per-query resource. The total number of parallel workers the whole server can run at once is bounded by another cap, max_parallel_workers (default 8), so when several queries go parallel at the same time, they share this pool. If the pool runs short, the planner may have planned for 2 workers but only 1 actually launches, or none do.

Parallelism divides only the CPU cost

Once you have decided to add workers and settled on a count, how is the parallel path's cost computed? The heart of it fits in one sentence. Parallelism divides only the CPU cost by the worker count; it does not divide the disk cost.

A scan cost breaks into two chunks: the cost of reading pages off disk, and the CPU cost of evaluating the WHERE conditions and processing each row that comes back. When going parallel, PostgreSQL divides only the CPU cost by the worker count. The disk cost stays at its full value.

There are two reasons the disk cost is not divided. First, no matter how many workers you attach, the total number of pages that must be read does not shrink. If a table is 10,000 pages, all 10,000 still have to come off disk no matter who reads them. Parallelism just lets several workers split those pages among themselves; it does not reduce the page count.

So would splitting the reads make the disk deliver pages faster? Counter to intuition, it does not. The disk is a single device the workers share, and there is a ceiling on how fast that device hands out pages. On top of that, when one process reads pages in order, the OS prefetches the next ones (read-ahead), so even a single reader already gets close to that device ceiling. Adding readers therefore does not make the disk deliver data N times faster. If anything, workers jumping between different regions can look like scattered access to the disk and cost a little.

PostgreSQL's cost model takes the conservative view here: it treats the disk cost as having nothing to amortize across workers and leaves it untouched. On fast SSDs parallel reads can give some benefit, but the model assumes that benefit is zero so as not to overvalue parallelism. So what parallelism actually cuts is the CPU side alone.

That said, the value the CPU cost is divided by is not exactly the worker count. The leader does not just launch workers and sit idle; it runs the same scan itself and produces rows, so the number of actual hands at work is the worker count plus the leader's share. The planner estimates that share as "the leader spends 30% of its time servicing each worker." So the divisor that actually scales the CPU cost comes out like this.

1 worker: 1 + (1 − 0.3) = 1.7
2 workers: 2 + (1 − 0.6) = 2.4
3 workers: 3 + (1 − 0.9) = 3.1
4 workers: 4 + 0 = 4.0

As workers pile up, the leader spends more time collecting their results and doing the post-processing on top, and less time on the scan itself. So the leader's share shrinks as workers grow, and by 4 workers it is treated as zero. This leader participation is toggled by parallel_leader_participation, which defaults to on.

So for parallelism to cut cost meaningfully, the CPU chunk of the scan has to be large. A scan that evaluates a heavy WHERE condition on every row, or runs a large aggregate, has a big CPU chunk, so splitting it pays off. A scan that just reads a lot of pages with light per-row work has most of its cost on the disk side, so attaching workers cuts little.

The fixed cost Gather adds

A parallel path does not only have costs that shrink; it also has costs that get added. The Gather node that collects the workers' results piles on two costs. One is the setup cost of launching workers and preparing shared memory; the other is the communication cost of handing each row a worker produces over to the leader.

The setup cost is parallel_setup_cost, which defaults to 1000. Recall that the cost of reading one page in earlier chapters was around 1, which makes 1000 a hefty fixed cost, on par with reading 1000 pages. The communication cost is parallel_tuple_cost, defaulting to 0.1 per row.

This setup cost is the decisive reason parallelism loses out on small queries. The fixed 1000 is laid down the same whether the table is small or large, but the CPU cost that gets cut is tiny when the table is small. On a small table the added 1000 outweighs the amount saved, so the parallel path ends up costing more than the serial one. The planner compares their total costs and keeps the cheaper one, so on a small query, even if a parallel candidate is built, it falls behind in the cost comparison. If the size threshold is the first gate, this cost comparison is the second gate above it.

When result order has to be preserved, GatherMerge is used instead of Gather. This happens when an ORDER BY means each worker's already-sorted output has to be merged without breaking the order. That merge carries the burden of continually comparing the front rows from each worker, so the planner charges GatherMerge 5% more for communication than Gather, plus the comparison cost of the merge sort itself. The price of keeping order is reflected directly in the cost.

Comparing serial and parallel cost in numbers

Let us put the rules so far onto actual numbers. A sequential scan cost is the sum of a disk term and a CPU term. The disk term multiplies the cost of reading one sequential page (seq_page_cost, default 1.0) by the page count, and the CPU term multiplies the cost of processing one tuple by the tuple count.

Take an accounts table with 1 million rows in 10,000 pages (about 80MB), scanned with WHERE balance > 9000. Since it is a simple comparison qual, the CPU cost per tuple is cpu_tuple_cost (0.01) plus the comparison cost cpu_operator_cost (0.0025), for 0.0125. The cost of the serial sequential scan comes out as:

disk_run_cost = 1.0 × 10000       = 10000
cpu_run_cost  = 0.0125 × 1000000  = 12500
total         ≈ 22500

Now attach 2 workers. 80MB clears the size threshold (8MB), so parallelism becomes a candidate; the log formula yields 3 workers but the default cap of 2 trims it. The divisor for 2 workers is 2.4. The disk term stays put and only the CPU term is divided by 2.4. On top of that, the Gather that collects the results adds setup of 1000 and a per-row communication cost. If 1000 rows pass the WHERE, the communication cost is parallel_tuple_cost 0.1 × 1000 = 100.

disk_run_cost = 1.0 × 10000       = 10000   (unchanged)
cpu_run_cost  = 12500 ÷ 2.4       ≈ 5208    (divided by workers)
Gather setup  = 1000
Gather comm   = 0.1 × 1000         = 100
total         ≈ 16308

22500 dropped to 16308, so the planner picks parallel. The roughly 6000 saved comes almost entirely from the CPU term being cut from 12500 to 5208. The disk term of 10000 holds equally on both sides, and the setup of 1000 is added only on the parallel side.

The numbers flip the other way when a table barely clears the size threshold. With small page and tuple counts, the CPU term is small too, so splitting it among workers only cuts a few hundred, while the setup of 1000 is still laid down in full. The parallel total then ends up larger, so the path clears the size gate but falls behind in the cost comparison. This is the setup cost acting as that second gate, expressed in numbers.

What this means in practice

First, when a large table won't go parallel, check the size threshold and the worker caps first. The check order is threefold. First, min_parallel_table_scan_size (default 8MB). If the table is smaller than this, the planner never even raises parallelism as a candidate. Next, max_parallel_workers_per_gather (default 2). If this is 0, parallelism is fully off, and to use more cores on an analytical query you raise it per session. Last, the overall pool, max_parallel_workers (default 8). In a high-concurrency environment where this pool runs short, the query runs with fewer workers than planned. When Workers Planned (the number the planner intended to launch) and Workers Launched (the number that actually started) differ in EXPLAIN ANALYZE, this pool shortage is exactly what happened.

Second, doubling the workers does not double the speed. Parallelism does not speed up in direct proportion to core count. As workers grow, the leader's share of collecting results and post-processing grows, so the effect of each additional worker tapers off, and on top of that there is the fixed setup cost. The planner's cost model itself reflects this diminishing return by shaving the leader's contribution as workers grow. So do not expect raising max_parallel_workers_per_gather to speed things up by that same ratio; it is better to vary the worker count and actually measure, finding the point where the effect flattens out.