GROUP BY and window functions look declarative — the query says what it wants, and PostgreSQL figures out how to compute it. In practice the planner has strong opinions about how: whether to hash or sort, whether to parallelise, whether to spill memory to disk, whether a matching index changes the plan entirely. Learn to read what the planner picked and why, and aggregate-heavy queries become one of the easiest categories to tune.
This article is the fifth in the Complete Guide to PostgreSQL SQL Query Analysis & Optimization series. Every EXPLAIN block below is captured from a real run on the series' Neon Postgres 17.8 database (500,000-row sim_bp_orders and friends).
The two aggregate strategies
For GROUP BY, the planner chooses primarily between two implementations — plus some parallel and distinct variants layered on top.
HashAggregate builds a hash table keyed by the group-by columns; each incoming row probes the hash and either creates a new entry or updates an existing one's running aggregate state. Fast when the hash table fits in work_mem. Doesn't care about input order.
GroupAggregate requires input already sorted on the group-by columns. Each group's rows arrive contiguously, so the aggregate can emit a result row and clear its state between groups — constant memory regardless of group count. Picked when the input is already sorted (typically because the group-by matches an index order) or when the planner thinks the hash table won't fit.
The distinguishing signal in EXPLAIN is the node type itself: HashAggregate vs GroupAggregate. When you see Sort → GroupAggregate and no matching index, the planner has decided a sort + streaming aggregate is cheaper than trying to hash. In parallel plans you'll often see a composite shape — Partial HashAggregate inside each worker, topped by Finalize GroupAggregate on the leader — which is a parallel partial-aggregation pattern rather than "just a HashAggregate."
Here's that exact shape, from the classic dashboard query "how many orders in each status?":
SELECT status, count(*), avg(total_amount_cents)
FROM sim_bp_orders
GROUP BY status;
Finalize GroupAggregate (cost=8334.96..8336.27 rows=5 width=49)
(actual time=148.938..151.912 rows=5 loops=1)
Group Key: sim_bp_orders.status
Buffers: shared hit=3705
-> Gather Merge (actual time=148.924..151.895 rows=15 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (actual time=140.390..140.391 rows=5 loops=3)
Sort Key: sim_bp_orders.status
Sort Method: quicksort Memory: 25kB
-> Partial HashAggregate
(actual time=140.366..140.367 rows=5 loops=3)
Group Key: sim_bp_orders.status
Batches: 1 Memory Usage: 24kB
-> Parallel Seq Scan on sim_bp_orders
(actual time=0.006..32.097 rows=166667 loops=3)
Execution Time: 151.973 ms
152 ms. This is parallel partial aggregation: each parallel worker (plus the leader, making three process loops) computes a partial HashAggregate over its slice of the table (rows=166667 loops=3 ≈ 500k total), produces its five-row partial result, sorts those by status, and feeds them up to Gather Merge. The leader then finalises with Finalize GroupAggregate — combining the three sets of partial states into five final rows. Partial aggregation is the reason aggregate queries scale so well with parallel workers: only the partial group states (5 rows per worker here, 15 rows total) cross the worker-to-leader boundary, no matter how big the input was.
The Batches: 1 Memory Usage: 24kB on the Partial HashAggregate means the hash table fit in work_mem and didn't spill. Five groups with running sum/count fits easily in 24 kB.
The aggregate spill — and how to diagnose it
Things get interesting when the number of groups grows. A HashAggregate spill on a 117k-group count looks like:
HashAggregate (actual time=347.354..392.138 rows=117060 loops=1)
Group Key: u.email
Planned Partitions: 4 Batches: 5 Memory Usage: 8241kB Disk Usage: 6920kB
The Disk Usage: 6920kB and Batches > 1 are the spill signals. PostgreSQL 13+ handles this gracefully — the executor detects that not all groups fit in memory, writes partial state to per-partition spill files, and processes them in additional passes — but the extra I/O is not free. On our database it cost roughly 40% of the query's total time.
Two fixes for HashAggregate spills:
-
Raise
work_memper-session so the hash fits in memory. Set per-role (ALTER ROLE analytics SET work_mem = '64MB') rather than cluster-wide, becausework_memis allocated per sort/hash node per connection and a cluster-wide raise multiplies by concurrency. -
Sort + GroupAggregate is cheaper than a spilling HashAggregate when the group-by column is indexed. Force it with
SET enable_hashagg = off;as a diagnostic, and if the Sort + GroupAggregate plan is faster, the underlying issue is "too many groups for current work_mem." Usually the right answer is to raisework_memfor the session anyway, since Sort also useswork_mem.
The MyDBA analyzer rule temp_blocks_written fires when a node's Temp Written Blocks exceeds 100. That field is populated from JSON-format EXPLAIN output — MyDBA's visualiser runs the rules over JSON plans, not the text format pasted here — so the rule fires automatically on both HashAggregate spills and Sort spills when captured through the native integration.
Sort spills — the external merge
When a sort doesn't fit in work_mem, PostgreSQL falls back to an external merge sort: write sorted runs to disk, then merge them. You see this as Sort Method: external merge with a Disk: size in the sort node:
SELECT status,
percentile_cont(0.5) WITHIN GROUP (ORDER BY total_amount_cents) AS median,
percentile_cont(0.95) WITHIN GROUP (ORDER BY total_amount_cents) AS p95
FROM sim_bp_orders
GROUP BY status;
Percentiles are expensive because the implementation needs an ordered sample per group. PostgreSQL's percentile_cont evaluates as an ordered-set aggregate, which requires sorting the input per group:
GroupAggregate (actual time=202.589..358.067 rows=5 loops=1)
Group Key: status
Buffers: shared hit=3697, temp read=2707 written=2494
-> Sort (actual time=146.514..202.485 rows=500000 loops=1)
Sort Key: status
Sort Method: external merge Disk: 12048kB
Buffers: shared hit=3689, temp read=1506 written=1512
-> Seq Scan on sim_bp_orders (actual time=0.007..49.886 rows=500000 loops=1)
Execution Time: 358.230 ms
358 ms. The Sort spilled 12 MB of temp files. The GroupAggregate node above it shows its own temp read=2707 written=2494 — that's the ordered-set aggregate's internal tuplestore materialising per-group sorted input for the percentile computation, not a generic "every aggregate spills" phenomenon. Ordered-set aggregates like percentile_cont, percentile_disc, and mode() all force per-group materialisation; a simple count() or avg() on the same plan wouldn't produce that second temp-I/O figure. The MyDBA rule sort_on_disk fires on any Sort with Sort Space Type = Disk, which this plan has.
The right fix depends on the workload. For a one-off analytical report, raising work_mem to ~40 MB for that session turns the external merge into an in-memory quicksort. For a dashboard that runs this every minute, you want a materialised view:
CREATE MATERIALIZED VIEW order_amount_percentiles AS
SELECT status,
percentile_cont(0.5) WITHIN GROUP (ORDER BY total_amount_cents) AS median,
percentile_cont(0.95) WITHIN GROUP (ORDER BY total_amount_cents) AS p95
FROM sim_bp_orders
GROUP BY status;
-- Refresh on whatever schedule fits your freshness requirement:
REFRESH MATERIALIZED VIEW CONCURRENTLY order_amount_percentiles;
REFRESH MATERIALIZED VIEW CONCURRENTLY requires a unique index on the view, reads the source tables outside the refresh window, and replaces the view atomically. The dashboard then queries the view instead of re-running the percentile calculation, and the 358 ms query becomes a 0.5 ms single-row scan.
Window functions
A window function produces an output row for every input row, but with access to a frame of related rows. The syntax:
agg_func(...) OVER (
PARTITION BY col1, col2 -- split input into independent groups
ORDER BY col3, col4 -- order within each partition
ROWS BETWEEN ... AND ... -- or RANGE BETWEEN, or GROUPS BETWEEN
)
The planner implements window functions via a WindowAgg node that consumes an input ordered appropriately and emits one output row per input. If the input isn't already ordered, the planner inserts a Sort before the WindowAgg — which is often where the cost lives.
Consider a common pattern: "the most recent order per user." Pre-PostgreSQL 15 the usual rewrite was:
SELECT user_id, order_id, created_at
FROM (
SELECT user_id, order_id, created_at,
row_number() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS rn
FROM sim_bp_orders
) t
WHERE rn = 1
LIMIT 100;
The PostgreSQL 15+ optimisation for this is the WindowAgg Run Condition — the planner notices that WHERE rn = 1 can be pushed into the WindowAgg, so it can stop computing row numbers for each partition as soon as rn > 1:
Limit (actual time=0.093..0.525 rows=100 loops=1)
Buffers: shared hit=305
-> WindowAgg (actual time=0.092..0.518 rows=100 loops=1)
Run Condition: (row_number() OVER (?) <= 1)
-> Incremental Sort
Sort Key: user_id, created_at DESC
Presorted Key: user_id
Full-sort Groups: 9 Sort Method: quicksort
-> Index Scan using idx_sim_bp_orders_user_id on sim_bp_orders
(actual time=0.014..0.339 rows=302 loops=1)
Execution Time: 0.543 ms
0.54 ms. Two optimisations are visible:
-
Run Condition: (row_number() OVER (?) <= 1)— the WindowAgg stops producing rows for a partition oncernexceeds 1, so only the first row per user is computed. This lets the plan short-circuit onceLIMIT 100is satisfied after only 302 input rows (not the full 500k). -
Incremental SortwithPresorted Key: user_id— the input arrives already sorted byuser_id(fromidx_sim_bp_orders_user_id), and the WindowAgg needs it sorted by(user_id, created_at DESC). An Incremental Sort only sorts within eachuser_idgroup rather than globally, which costs drastically less memory and allows pipelined execution.
Even so, a LATERAL join with LIMIT 1 inside is often simpler and at least as fast for "top-N per group" with small N.
Frame specifications
Most window function work defaults to an implicit frame clause that trips people up. The rules:
-
No
ORDER BYclause → the frame defaults toRANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING— the whole partition. This is what you want forsum()oravg()over an entire partition. -
ORDER BYclause present → the frame defaults toRANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW— the running total up to this row. This is what you want for running sums, but easy to get wrong. -
Ranking functions (
row_number(),rank(),dense_rank()) — the frame is irrelevant because the function's result only depends on the ordering.
A common mistake: computing a "running average over the last 7 rows" and getting a running average over all preceding rows because the frame clause was omitted. The fix is explicit:
avg(value) OVER (
ORDER BY timestamp
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
)
ROWS BETWEEN N PRECEDING AND CURRENT ROW is a physical window of N+1 rows. RANGE BETWEEN '7 days' PRECEDING AND CURRENT ROW is a logical window based on the ORDER BY value — useful when timestamps aren't evenly spaced. GROUPS BETWEEN N PRECEDING AND CURRENT ROW (PostgreSQL 11+) treats ties as a single "group" and counts those.
LAG, LEAD, and first/last value
The navigation functions — lag(x, n), lead(x, n), first_value(x), last_value(x) — let you reference rows offset from the current one. Classic use: detect state transitions.
SELECT order_id, status,
lag(status) OVER (PARTITION BY user_id ORDER BY created_at) AS prev_status,
created_at
FROM sim_bp_orders
WHERE user_id = 42;
Each row gets the status of the user's previous order. The window can then be wrapped in a subquery or CTE to find "orders where the status changed":
WITH ordered AS (
SELECT order_id, status, created_at,
lag(status) OVER (PARTITION BY user_id ORDER BY created_at) AS prev_status
FROM sim_bp_orders
)
SELECT * FROM ordered WHERE prev_status IS DISTINCT FROM status;
Two performance notes. First, last_value() with a default frame is surprising — because the default frame ends at the current row, last_value() returns the current row's value, not the partition's last. To actually get the partition's last value, specify ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. Second, LAG and LEAD compile to very cheap operations (just a pointer to the previous/next row in the window), while first_value/last_value with an explicit full-partition frame can force materialisation.
Aggregate-related indexes
An index on the GROUP BY columns is a straightforward win when it exists: the planner can use GroupAggregate over an index scan and skip the hash build entirely. The index has to cover the group key in exactly the right order — a composite index on (status, created_at) serves GROUP BY status, but a (created_at, status) doesn't.
For queries that frequently aggregate a narrow window of a big table (WHERE created_at > ... GROUP BY user_id), a partial index or materialised view of the aggregate result is usually the right answer, because re-aggregating millions of rows every time beats out any planner optimisation. Precomputation is the most robust performance tactic for aggregates.
Quick diagnostic checklist
When an aggregate query is slow:
-
Is the aggregate node a
HashAggregatewithBatches > 1orDisk Usage > 0? The hash table spilled. Raisework_memfor the session, or create a supporting index to enable GroupAggregate instead. -
Is there a
Sortabove aGroupAggregatewithSort Method: external merge? The sort spilled. Same fix: morework_mem, or an index that provides pre-sorted input. -
Is there a
WindowAggover aSortthat processes all input before theLIMIT? Check if aRun Conditionis possible (PG 15+) or if the problem can be rewritten asLATERAL + LIMIT N. - Is the aggregate running every time the dashboard loads? Move it behind a materialised view refreshed on schedule. This is usually the biggest win of all.
-
Does the MyDBA analyzer flag
sort_on_disk,hash_batches_spill, ortemp_blocks_written? These are the three rules that specifically target aggregate-related spills; if any fire, follow the suggestion inline.
Next steps
Aggregates interact closely with the shape of your WHERE clauses — a filter that narrows the input set before aggregation is almost always cheaper than aggregating and then filtering. The next article, WHERE Clause Optimisation, covers sargability and composite-index ordering in detail, with an eye toward getting predicates to apply as early in the plan as possible.
postgres #performance #database #sql
Full article with the complete series: https://mydba.dev/blog/postgres-aggregate-window-tuning
Top comments (0)