1.5.4 Aggregation: HashAgg, GroupAgg, Sort-Based

#postgres #database #internals #executor

A scan node reads rows from a table one at a time, and a join node pairs up rows arriving from its two children. Both can take one input row and emit one result row right away. Aggregation breaks that pattern. In SELECT region, COUNT(*) FROM sales GROUP BY region, producing the single row for region = 'Seoul' requires counting every row that belongs to 'Seoul' first. A group's answer is only settled after all of that group's rows have been seen.

But the executor runs on a pull model that drags results up one row at a time. The top node asks for "the next row," that request travels down the tree, and a row climbs back up one level at a time. The "you have to see everything before you get an answer" nature of aggregation collides, at first glance, with "pull one row at a time." PostgreSQL resolves this in two ways. One is GroupAggregate, which sorts the input by the grouping key so that rows of the same group come out one after another. The other is HashAggregate, which skips sorting and accumulates per-group state in a hash table. The GroupAggregate and HashAggregate nodes you see in EXPLAIN for a GROUP BY query are exactly these two.

Aggregation builds its answer by accumulating state

Before looking at how the two methods differ, we need to see how a single group's answer is built, because both methods share this accumulation engine.

Each aggregate function carries one running state value, the transition value. It starts by setting that state to an initial condition, and every time a row in the group arrives, a transition function updates the state. Once the group's last row is processed, a final function extracts the final result from that state. A comment in the PostgreSQL source sums this up in three lines.

transvalue = initcond
foreach input_tuple do
    transvalue = transfunc(transvalue, input_value)
result = finalfunc(transvalue)

Take SUM(amount). The initial condition is 0, and the transition function is "add this row's amount to the running sum." As rows come in as 10, 20, 30, the state grows 0 → 10 → 30 → 60. SUM has no separate final function because the accumulated value is the answer. AVG, by contrast, accumulates both a sum and a count, and at the end its final function computes "sum divided by count" to produce the average. Whether or not a final function exists varies by aggregate, but the flow of carrying a state, updating it per row, and finishing at the end is the same for every aggregate.

One thing becomes clear here: aggregation has to push every row of a group through the transition function before that group's state is complete. So the key question becomes "how do you gather all the rows of the same group into one state value, missing none?" GroupAggregate and HashAggregate only differ in how they gather; the way they accumulate is identical.

The figure below shows, side by side, how the two methods handle the same input. The grouping key has three values a, b, c, and the number attached to each row is the value to SUM. Adding the numbers within the same key gives that group's result (a is 10 + 20 = 30).

GroupAggregate: catch the point where a group ends in sorted input

GroupAggregate starts from the assumption that the input is already sorted by the grouping key. When it is sorted, rows of the same group come out one after another without a break. If you group by region, all the 'Busan' rows come out, then the 'Seoul' rows follow, then the 'Incheon' rows.

Because they are grouped together this way, you can tell immediately where a group ends. Compare the grouping key of the previous row with the one just received: if they are equal, it is the same group, so add it to the state through the transition function; if they differ, that spot is a group boundary. A changed key means you have seen the previous group's last row, so at that moment you run the previous group's final function, complete one result row, and send it up to the parent. Then you reset the new group's state to the initial condition and start accumulating the new group from the row you just received.

This behavior meshes cleanly with the pull model. GroupAggregate does not gather the entire input and wait. Each time a group ends, it emits that group's result right away. When the parent asks for the first row, it pulls input only as far as the first group's end, returns one result row, and stops. If there is a LIMIT 3 above it, it produces only three groups and never touches the rest of the input. What it holds in memory is just the state value of the one group it is currently processing, so even with millions of groups, only one sits in memory at a time.

The catch is that the "input must be sorted" assumption is not free. If the input arrives already sorted from a grouping-key index, it splits groups with no extra cost; otherwise a Sort node is placed below GroupAggregate to sort the input first, and aggregation only starts after that. In EXPLAIN, whether a Sort hangs under GroupAggregate, or a sorted Index Scan feeds in directly, tells you whether this sorting cost was paid.

HashAggregate: gather groups in a hash table without sorting

HashAggregate drops the sorting assumption and uses a hash table instead. Feeding the grouping key into a hash function yields a fixed slot per key, and the same key always goes to the same slot. So no matter what order the input comes in, you hash a row's key, find that key's slot, and apply the transition function to the state sitting there. A 'Seoul' row, wherever it is scattered across the input, all gathers into the one 'Seoul' slot.

Not needing sorting is HashAggregate's strength. Since there is no need to line up the input in advance, it saves the entire sorting cost in situations where there is no suitable index and a sort would otherwise be required. It shines especially when the number of groups is small. With only a handful of regions, the hash table holds only a few slots, and you can read a large input quickly while just accumulating into those few slots.

In return, a constraint that GroupAggregate did not have appears. HashAggregate cannot emit a single result row until it has pulled in all of the input and completed the hash table. Until the last row arrives, no group can be sure "this is all of this group." So in the build phase it exhausts the input to the end and completes the state of every group, and only then does it scan the table and pull out group results one by one to send up. Unlike GroupAggregate, which emits a result every time a group ends, HashAggregate cannot emit even the first row before the input is fully read. Even with a LIMIT 3 above it, the build phase must read the entire input, so it cannot stop as early as the sort-based method.

When memory runs out: HashAggregate's disk spill

HashAggregate's other weakness is that the hash table takes up memory. When groups are few, only a handful of slots are needed, so it is light; but when the grouping key is nearly unique and groups balloon into the millions, the hash table presses on memory. The memory limit available here is work_mem multiplied by hash_mem_multiplier. work_mem is the amount of memory a single operation like sorting or hashing can use before spilling to disk, and hash_mem_multiplier is a factor that raises that limit several-fold for hash-based operations only. Since its default is 2.0, the hash table can use up to twice work_mem by default.

When this limit is exceeded, HashAggregate enters spill mode. Groups that already have a slot in the table keep accumulating in memory, while a row that would need to create a new group does not claim another slot in memory and is instead written aside to disk. When writing to disk, rows are split across several partitions by the hash value of the key. After reading through the input once and completing the in-memory groups, the partitions stacked on disk are pulled back in one at a time and aggregated the same way. If a partition is too large to fit in memory again, that partition is split into finer pieces.

Concretely, it looks like this. Suppose work_mem is 4MB and you run SELECT customer_id, SUM(amount) FROM orders GROUP BY customer_id over 10 million orders. If there are 2 million customers, the hash table needs 2 million slots, and the moment these slots exceed the memory limit (4MB times hash_mem_multiplier) during accumulation, it switches to spill mode. The customer slots that had entered the table up to that point keep adding their sums in memory, while rows of new customers that did not yet have a slot are pushed to disk partitions. After reading through all 10 million orders once, the sums of the customers remaining in memory are completed and emitted as results, and the customers pushed to disk are pulled back partition by partition and finished. Whether a given customer stays in memory or is pushed to disk is just a matter of whether its slot was created before or after the limit was crossed.

This disk split resembles the way HashJoin, seen earlier, splits into batches when the hash table overflows memory. Both follow the same principle of spilling to disk once the hash table exceeds the work_mem limit; they only differ in purpose, with the join matching pairs across both sides and the aggregate accumulating per-group state.

EXPLAIN ANALYZE shows in numbers whether this spill happened. A Batches value greater than 1 on the HashAggregate node means it was split and processed on disk, Memory Usage reports the amount of memory used, and Disk Usage reports how much was written to disk. A Batches of 1 signals that the hash table fit entirely in memory.

Disk spill is a feature introduced in PostgreSQL 13. Thanks to it, HashAggregate can be attempted even when groups are so numerous that the hash table does not fit entirely in memory.