1.5.5 Parallel Query: How Worker Processes Cooperate

#postgres #database #internals #executor

The executor we have seen so far is a single flow: one process pulls the plan tree from the top. An aggregate or join over a large table forces that one process to read every page from start to finish, so even when several CPU cores sit idle, only one of them works. Parallel query hands that work to several processes. But PostgreSQL's parallelism is not threads inside one process. It is separate processes cooperating. Since processes do not share memory, three structural challenges follow: distributing pages, sharing state, and forbidding writes.

Gather: Parallelism Happens in Only One Stretch of the Plan Tree

A parallel query does not run the whole plan tree in parallel. Somewhere in the tree sits a node called Gather, and only the subtree below that Gather is split across processes. Everything above Gather runs in a single process, as usual.

Gather does two things. It launches the worker processes, and it merges the rows those workers produce together with its own rows into a single output stream for its parent. The process that launches the workers is called the leader. The leader does not just sit and collect: it also runs the subtree below it and produces rows itself. So with two workers, three participants run the same plan, counting the leader.

This only works under one condition. The subtree below Gather must produce no duplicate rows even when the leader and the workers each run it independently. If three participants run the same subtree with no coordination, every row comes out three times and the result triples, which is wrong. The node that solves this is a 'parallel-aware' node. A parallel-aware scan node coordinates among the processes so that each one reads a different part of the table even though they run the same node. This is the node EXPLAIN shows as Parallel Seq Scan. A plain Seq Scan has one process read the whole table; a Parallel Seq Scan has several processes running the same node split the pages among themselves.

How Workers Split Pages Without Overlap

So three workers read the same table. How do they avoid reading the same page twice? The answer is surprisingly simple. There is one counter in shared memory, and they advance it atomically to divide the pages. (Atomic here means that even when several processes touch the same value at the same instant, the operation completes as one indivisible step, with no one slipping in halfway.)

A table lays out its pages in order, from page 0 to page N. The shared state of a parallel sequential scan holds a counter called phs_nallocated, which says how many pages have been handed out to workers so far. When a worker runs out of pages to read, it does a fetch-and-add on this counter. Fetch-and-add is an atomic operation that reads the counter's current value and adds to it in the same step, so even if several workers call it at the same moment, each one receives a different value. One worker gets "starting at 100," and the next automatically gets the range after it. Coordinating through this single counter means the workers never have to take locks and talk to each other.

There is one more piece to how pages get handed out. A worker does not take pages one at a time. It takes a run of consecutive pages at once, and that run is called a chunk. If a worker took only page 100, then knocked on the counter again and got page 250, its reads would scatter all over the table. From the OS's point of view the disk accesses would look sparse, and read-ahead (the OS prefetching the next pages from disk before they are asked for) would not kick in, because each backend is a separate process and the OS cannot recognize them as one sequential stream. So PG hands a worker a whole consecutive chunk, and only after the worker finishes that chunk does it knock on the counter again for the next one. Since a chunk is contiguous, reads within it are sequential and read-ahead stays alive.

Chunk size is not fixed; it is derived from the table size. PG splits the table into roughly 1024 to 2048 pieces and uses that as the base chunk size, capped at 8192 pages so it never grows too large. Then, as the scan nears the end, when only about 64 chunks remain, it halves the chunk size. This final touch matters. With large chunks, one worker can pick up the last big chunk late and keep reading alone for a long time while the others sit idle with nothing to do. Splitting the tail into smaller chunks spreads the remaining work evenly across the workers, so they all finish at about the same time.

How Workers See the Same Data as the Leader

Splitting the pages is settled. But there is a deeper problem. A worker is a separate process, so it starts fresh, inheriting none of the in-memory state the leader holds. The worker does not know how far the transaction has progressed, which snapshot it is viewing data through (a snapshot is the point in time a transaction sees its data from, and it determines which rows are visible and which are not), or what the GUC settings are. Left like this, a worker could see different data than the leader. For instance, inside the same transaction, a worker could take a different snapshot, so a row invisible to the leader becomes visible to the worker.

The snapshot, transaction IDs, and row visibility (which rows are visible to this transaction right now) that show up here are all covered in detail in Chapter 3, Transactions and MVCC.

PG solves this by copying state. Before launching workers, the leader allocates a piece of dynamic shared memory (a region of memory that several processes look into together, abbreviated DSM) and serializes into it the state a worker needs to behave just like the leader. As soon as a worker starts, it attaches to this DSM and restores that state as its own. The key pieces copied are:

The transaction snapshot and the active snapshot. This is the most direct mechanism for making a worker see data from the same point in time as the leader.
The current transaction's XID and the list of in-progress XIDs. This makes a tuple's visibility check return the same answer in the worker and the leader.
The combo CID mappings. This is the information needed to handle consistently the cases where a visibility decision gets complicated, such as a transaction that created a row and then deleted it.
All GUC values. So a worker runs the plan with the same work_mem and the same settings as the leader.
The authenticated user ID, the database, and the security context. The worker connects to the same database with the same privileges.

Once the copy is done, a worker runs the plan on top of the same transaction, the same snapshot, and the same settings as the leader. So no matter who reads which page, the visibility decisions match.

How do the result rows a worker produces get back to the leader? Through a queue set up inside the DSM. Each worker has its own shared-memory queue (a tuple queue): the worker writes the rows it produces into that queue, and the leader cycles through the queues pulling rows out. From the leader's side, it gathers its own rows and the rows pulled from the workers' queues into a single stream and sends them up.

These queues carry more than results; they carry errors too. When an error happens in a worker, the error message is placed on a dedicated queue and the worker signals the leader. At its next interrupt check, the leader reads that message and re-throws it as if it had raised the error itself. So an error that fired inside a worker still appears to the user as a single, ordinary query error.

Why Parallel Queries Are Read-Only

Parallel query carries one big restriction. While running below Gather, it cannot write to the database and cannot run DDL. The parallel stretch is strictly read-only. This is not an unfinished implementation or a temporary limit; it follows directly from the state-copying structure above.

At the heart of it is the combo CID. When a transaction deletes or updates, within the same transaction, a row it just created, the ordinary XID alone is not enough to decide that row's visibility, so an extra mapping called the combo CID is built. In parallel mode the leader copies its combo CID mapping to the workers at startup. The catch is that this copy happens once, at the start. If a worker were to write data mid-execution, a new combo CID would be created, and there would be no way to tell the leader or the other workers about it. The processes inside the same transaction would then diverge, each holding different visibility information. There is no way to prevent that divergence, so the choice was to forbid writes outright.

PG enforces this read-only rule in code as well. Before building a parallel context, it enters parallel mode and arms checks that block unsafe operations, then disarms them when the parallel work ends. While those checks are armed, attempting a dangerous operation such as a data write or a permanent GUC change fails with an error.

Because of this, the SELECT inside a data-modifying statement like INSERT ... SELECT or UPDATE gets no parallelism, no matter how heavy that SELECT is. The whole statement is a writing transaction.

Not every write is like that, though. CREATE TABLE AS, SELECT INTO, and CREATE MATERIALIZED VIEW are allowed to go parallel. The trick is that the part that runs in parallel is only the read. CREATE TABLE AS ... SELECT has the workers split the reading of its SELECT, while the leader alone loads the gathered results into the new table. The workers still only read, so the read-only rule is not broken. On top of that, the table the results go into is being created at that very moment, so no worker can see or touch it, and it collides with no one's visibility. That is why PG allows parallelism for these three commands.

Gather vs GatherMerge: Dropping Order or Keeping It

There are two kinds of nodes that collect workers' results: Gather and GatherMerge. The difference is how they handle result order.

Gather does not care about order. Whichever worker delivers a row first, that row goes out first, so the workers' results come out interleaved as they arrive. For a query that needs no ordering, such as a plain aggregate or a scan with only a filter, this is the lightest option.

GatherMerge preserves order. It is used when each worker delivers sorted results. Since each worker's output is sorted within itself, the leader merges while preserving the overall sort order: it compares only the front rows of each worker's queue and emits the smallest. To do this comparison efficiently it uses a heap internally (a data structure that quickly yields the smallest among many values). GatherMerge is chosen for an ORDER BY parallel query where each worker sorts separately and the results have to be combined without breaking that sort.

What This Means in Practice

For a job that precomputes a heavy aggregate, pulling the result into a new table and swapping it in by name preserves the parallel gain, rather than loading directly into an existing summary table. An INSERT INTO or UPDATE is a writing transaction and loses parallelism even for the SELECT inside it, whereas CREATE TABLE AS lets the workers split the reading part among themselves.