1.4.6 Aggregate Cost: Choosing HashAgg vs GroupAgg

#postgres #database #internals #planner

A query like SELECT region, COUNT(*) FROM sales GROUP BY region folds many rows together, collapsing each group into a single value. This folding of many rows into one is aggregation, and COUNT, SUM, AVG are the familiar examples. PostgreSQL handles aggregation in the execution plan with one of two nodes: HashAggregate or GroupAggregate. Both do the same job, grouped aggregation, but they go about it differently, and the planner picks one or the other to nail into the plan tree for a given query. When you see HashAggregate in one EXPLAIN and GroupAggregate in another, that is the result of this choice. What decides it? How the two nodes actually run (how they accumulate state, fill the hash table, and split to disk when memory runs out) is covered in Chapter 1.5. This section looks at the step before that: how the planner compares their costs and picks one.

The two strategies have the same total; they differ at startup

The function that computes aggregate cost is cost_agg. Look inside it and one surprising fact shows up: the sort-based strategy (GroupAggregate) and the hash-based one (HashAggregate) have the same total cost. A source comment states it plainly.

in this cost model, AGG_SORTED and AGG_HASHED have
exactly the same total CPU cost, but AGG_SORTED has lower startup cost.

That makes sense. Both strategies pass every input row through the aggregate function exactly once and emit one result row per group. The total amount of work is the same, so the total cost is the same. So what is the planner comparing? The answer is the rest of that comment: startup cost.

As we saw earlier, in EXPLAIN's cost=A..B, the first number A is startup (the cost paid before the first result row appears) and the second number B is total (the cost through the last row). What earlier sections called "cost" and used for comparison was mostly the total, the second number. The two aggregate strategies diverge at the first number, startup.

GroupAggregate works on sorted input, spots where one group ends, and emits that group's result right away. The first row comes out the instant the first group closes, so the cost to the first result is small. In the cost model, GroupAggregate's startup inherits the input's startup cost almost unchanged. HashAggregate, by contrast, cannot declare any group complete until it has pulled in the entire input and finished building the hash table, so it cannot emit a single row before then. The cost model therefore loads HashAggregate's startup with the cost of reading the input to the end, that is, the input's full total cost.

This difference shows up in real EXPLAIN numbers. Suppose we group a 100,000-row table by one column, and that column happens to have an index, so sorted input arrives for free. Running the same aggregate query both ways gives these costs (about 1,000 groups).

GroupAggregate  (cost=0.29..4138.17 rows=1001)
  ->  Index Only Scan ...  (cost=0.29..3628.16 rows=100000)
HashAggregate   (cost=1943.00..1953.01 rows=1001)
  ->  Seq Scan ...         (cost=0.00..1443.00 rows=100000)

Look at the first number, startup. GroupAggregate is 0.29, near zero; HashAggregate is 1943. The index hands over the sort for free, so GroupAggregate can emit a result the moment the first group closes, and its cost to the first result is near zero. HashAggregate has to pull in all 100,000 rows and finish the hash table before the first row appears, so the cost of reading the input to the end, 1943, becomes its startup outright.

So where do we see that the totals are equal? Take each node's total and subtract the total of the input scan below it, leaving just what the aggregate node added. For GroupAggregate, 4138.17 - 3628.16 = 510.01; for HashAggregate, 1953.01 - 1443.00 = 510.01. The cost the aggregation itself adds is identical down to the penny. The total numbers (4138 and 1953) drift apart not because aggregation costs more in one case, but because GroupAggregate chose a more expensive index scan (3628) over the sequential scan (1443) to get sorted input. Strip away the input and the aggregate work is the same; what differs is when that work is paid, which is startup.

This difference actually decides the choice when there is a LIMIT on top, or a parent node that needs only some of the results. For a query like GROUP BY ... LIMIT 10 that needs only the first few rows, a low-startup GroupAggregate can build just ten groups and stop without processing the whole input. HashAggregate cannot use this advantage: whatever the LIMIT, its build phase must read the input to the end. Even when the totals are equal, the lower-startup side becomes the cheaper candidate in situations like this.

GroupAggregate's startup is low only when the sort is free

But GroupAggregate's low startup hides a precondition: "sorted input." To find group boundaries by comparing adjacent rows, rows of the same group must arrive back to back, which means the input has to be sorted by the grouping key.

When the input already arrives sorted, this precondition is met for free. If the grouping key has an index and the read goes through it, for instance, the sorted order follows at no extra cost. In that case GroupAggregate's startup really is low.

The problem is when there is no sorted input. Then the planner places a Sort node under GroupAggregate to sort the input first. Sorting means reading the whole input and lining it up, so its entire cost goes into startup: the first group cannot be emitted until the sort finishes, and the sort cannot finish until the whole input is read. The moment a fresh sort is required, GroupAggregate's startup advantage vanishes. It becomes just like HashAggregate in having to read the input to the end before the first row, with the extra work of sorting on top.

Take the same query as the sorted case above, but now without an index, so it must sort first. The costs change like this.

GroupAggregate  (cost=9747.82..10507.83 rows=1001)
  ->  Sort       (cost=9747.82..9997.82 rows=100000)
        ->  Seq Scan ...  (cost=0.00..1443.00 rows=100000)
HashAggregate   (cost=1943.00..1953.01 rows=1001)

GroupAggregate's startup jumped from 0.29 to 9747. That is the Sort's cost 9747.82 carried up wholesale into startup. The sort raises not just the startup but the total (10507) too, putting it more than five times above HashAggregate's total (1953). Unlike the sorted case where the two strategies' aggregate costs matched, the gap opens here because GroupAggregate had to buy a sort it did not have before and take on its full cost. For this query the planner picks HashAggregate without hesitation.

So the first condition that decides the choice is "is the input already sorted by the grouping key?" If it is, GroupAggregate gets its low startup without paying for a sort, which is favorable; if a fresh sort is needed, HashAggregate becomes attractive by exactly that cost.

HashAggregate's risk is the number of groups

HashAggregate skips the sort but takes on a different cost. It has to hold per-group accumulators in a hash table, and as the number of groups grows, that table presses on memory.

The memory a hash table may use is work_mem multiplied by hash_mem_multiplier. work_mem is how much memory one operation like a sort or a hash can use before spilling to disk, and hash_mem_multiplier is a factor that raises that limit by a multiple for hash-based operations only (default 2.0). The cost model estimates the number of groups and gauges whether the hash table fits within this limit. If it fits, it finishes in memory with no spill; if it overflows, a spill cost for processing in disk-split chunks is added.

The key input here is the group-count estimate. The planner estimates how many groups will come out from the grouping key's statistics (how that estimate is made is covered in Section 1.4.8). For a GROUP BY region with only a few regions, the group count is small, the hash table is light, and HashAggregate comes out cheap. For a SELECT customer_id, SUM(amount) FROM orders GROUP BY customer_id with millions of customers, the group count rivals the input size, the hash table overflows the memory limit, and a disk spill cost attaches.

When a spill happens, that much cost is added to HashAggregate's total, because the overflowed rows have to be written to disk and read back later, creating disk I/O. That I/O cost is proportional to the amount moved to and from disk, and the cost model charges twice what a sort would for the same amount of disk. Even moving the same amount, HashAggregate's disk access pattern is considered less efficient than a sort's on real hardware. The background is a difference in how the two operations use disk. HashAggregate's spill scatters the overflowed rows across several partition files by the hash of the grouping key, then reads them back partition by partition; a sort, by contrast, reads its sorted runs back in order and merges them. Going back and forth across scattered files tends to be slower on disk than reading in order, so the same page count is charged more heavily for HashAggregate. That said, why it is exactly twice is not a value derived from theory but an empirical fudge factor, and the source comment offers no precise rationale, calling it only a "generic penalty." In the end, for a query where many groups make a spill likely, this doubled disk cost lifts HashAggregate's total and tips the planner's scale toward the sort-based side.

So the second condition is "how many groups are there?" Few groups favor HashAggregate by saving the sort; a group count rivaling the input attaches a spill cost and makes the sort-based side safer. Since the group-count estimate drives this judgment, a wrong estimate throws off the choice along with it.

The planner builds both and picks by cost

The planner does not fix one of the two in advance and then cost it. At the point where grouping is handled, it builds a sort-based path and a hash-based path separately (add_paths_to_grouping_rel). The sort-based one is a path with GroupAggregate placed on sorted input (or on top of a Sort), and the hash-based one is a HashAggregate path. Once both are registered as candidates, the standard cost comparison from Section 1.4.1 picks the cheaper one. The startup difference, sort cost, and spill cost seen so far all enter this comparison as numbers and decide the outcome.

But before costs are even compared, candidate eligibility can already differ. If an aggregate function is not in a form a hash can handle, HashAggregate never makes it onto the candidate list at all. An aggregate written with WITHIN GROUP (ORDER BY ...), like percentile_cont for a median (an ordered-set aggregate), needs the values within a group sorted to produce an answer, but a hash table only gathers a group's values together without sorting them. Such aggregates leave the sort-based strategy as the only candidate.

Turning enable_hashagg off removes HashAggregate from the candidates. The interesting part is that, as we will see in Chapter 1.5, even after HashAggregate gained the ability to spill to disk, the planner treats this switch not as a soft condition ("use it only when it fits in memory") but as a hard off-switch. In memory or spilling to disk, off means off.

What this means in practice

First, when the aggregate strategy looks wrong, suspect statistics before indexes. The planner picks between HashAggregate and GroupAggregate on a single group-count estimate. If that estimate is far off, say it expects few groups when there are really millions, it may pick HashAggregate only to watch the hash table overflow memory and spill to disk. If an aggregate query suddenly slows down, compare the estimated row count (rows=) in EXPLAIN against the actual count from EXPLAIN ANALYZE, and if they diverge sharply, run ANALYZE to refresh the statistics so the planner can see the group count correctly and reconsider the strategy.

Second, one index on a frequently grouped column widens the choice of aggregate strategy. GroupAggregate's low startup holds only when the input arrives sorted by the grouping key. With an index on that column, the planner can get sorted input without a fresh sort and hold GroupAggregate as a cheap candidate. That sort is also reused by an ORDER BY above, so a query like GROUP BY region ORDER BY region finishes grouping and ordering in one go off a single index sort. Knowing which columns you frequently group and order by lets you bake into the design that an index on that column acts directly on aggregate cost.

Third, work_mem affects both which strategy the aggregate picks and how fast it finishes. The cost model charges HashAggregate's spill cost based on whether the hash table fits in work_mem (times hash_mem_multiplier). When this value is small, a spill cost is charged heavily even at the same group count, and the planner shies away from HashAggregate; when it is large enough, the planner assumes it finishes in memory without a spill and picks HashAggregate. If a heavy aggregate query is slow from spilling to disk, you can raise work_mem for that session. But work_mem is memory taken per connection, and per operation within a plan, so raising it globally by a lot is risky when many connections are active. Raising it per session for heavy aggregate queries only is the safe approach.