Philip McClarence

Posted on Apr 30

PostgreSQL Aggregate and Window Function Tuning

#database #performance #postgres #sql

GROUP BY and window functions look declarative — the query says what it wants, and PostgreSQL figures out how to compute it. In practice the planner has strong opinions about how: whether to hash or sort, whether to parallelise, whether to spill memory to disk, whether a matching index changes the plan entirely. Learn to read what the planner picked and why, and aggregate-heavy queries become one of the easiest categories to tune.

This article is the fifth in the Complete Guide to PostgreSQL SQL Query Analysis & Optimization series. Every EXPLAIN block below is captured from a real run on the series' Neon Postgres 17.8 database (500,000-row sim_bp_orders and friends).

The two aggregate strategies

For GROUP BY, the planner chooses primarily between two implementations — plus some parallel and distinct variants layered on top.

HashAggregate builds a hash table keyed by the group-by columns; each incoming row probes the hash and either creates a new entry or updates an existing one's running aggregate state. Fast when the hash table fits in work_mem. Doesn't care about input order.

GroupAggregate requires input already sorted on the group-by columns. Each group's rows arrive contiguously, so the aggregate can emit a result row and clear its state between groups — constant memory regardless of group count. Picked when the input is already sorted (typically because the group-by matches an index order) or when the planner thinks the hash table won't fit.

The distinguishing signal in EXPLAIN is the node type itself: HashAggregate vs GroupAggregate. When you see Sort → GroupAggregate and no matching index, the planner has decided a sort + streaming aggregate is cheaper than trying to hash. In parallel plans you'll often see a composite shape — Partial HashAggregate inside each worker, topped by Finalize GroupAggregate on the leader — which is a parallel partial-aggregation pattern rather than "just a HashAggregate."

Here's that exact shape, from the classic dashboard query "how many orders in each status?":

SELECT status, count(*), avg(total_amount_cents)
FROM sim_bp_orders
GROUP BY status;

Finalize GroupAggregate  (cost=8334.96..8336.27 rows=5 width=49)
    (actual time=148.938..151.912 rows=5 loops=1)
  Group Key: sim_bp_orders.status
  Buffers: shared hit=3705
  ->  Gather Merge  (actual time=148.924..151.895 rows=15 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        ->  Sort  (actual time=140.390..140.391 rows=5 loops=3)
              Sort Key: sim_bp_orders.status
              Sort Method: quicksort  Memory: 25kB
              ->  Partial HashAggregate
                    (actual time=140.366..140.367 rows=5 loops=3)
                    Group Key: sim_bp_orders.status
                    Batches: 1  Memory Usage: 24kB
                    ->  Parallel Seq Scan on sim_bp_orders
                          (actual time=0.006..32.097 rows=166667 loops=3)
 Execution Time: 151.973 ms

152 ms. This is parallel partial aggregation: each parallel worker (plus the leader, making three process loops) computes a partial HashAggregate over its slice of the table (rows=166667 loops=3 ≈ 500k total), produces its five-row partial result, sorts those by status, and feeds them up to Gather Merge. The leader then finalises with Finalize GroupAggregate — combining the three sets of partial states into five final rows. Partial aggregation is the reason aggregate queries scale so well with parallel workers: only the partial group states (5 rows per worker here, 15 rows total) cross the worker-to-leader boundary, no matter how big the input was.

The Batches: 1 Memory Usage: 24kB on the Partial HashAggregate means the hash table fit in work_mem and didn't spill. Five groups with running sum/count fits easily in 24 kB.

The aggregate spill — and how to diagnose it

Things get interesting when the number of groups grows. A HashAggregate spill on a 117k-group count looks like:

HashAggregate  (actual time=347.354..392.138 rows=117060 loops=1)
  Group Key: u.email
  Planned Partitions: 4  Batches: 5  Memory Usage: 8241kB  Disk Usage: 6920kB

The Disk Usage: 6920kB and Batches > 1 are the spill signals. PostgreSQL 13+ handles this gracefully — the executor detects that not all groups fit in memory, writes partial state to per-partition spill files, and processes them in additional passes — but the extra I/O is not free. On our database it cost roughly 40% of the query's total time.

Two fixes for HashAggregate spills:

Raise work_mem per-session so the hash fits in memory. Set per-role (ALTER ROLE analytics SET work_mem = '64MB') rather than cluster-wide, because work_mem is allocated per sort/hash node per connection and a cluster-wide raise multiplies by concurrency.
Sort + GroupAggregate is cheaper than a spilling HashAggregate when the group-by column is indexed. Force it with SET enable_hashagg = off; as a diagnostic, and if the Sort + GroupAggregate plan is faster, the underlying issue is "too many groups for current work_mem." Usually the right answer is to raise work_mem for the session anyway, since Sort also uses work_mem.

The MyDBA analyzer rule temp_blocks_written fires when a node's Temp Written Blocks exceeds 100. That field is populated from JSON-format EXPLAIN output — MyDBA's visualiser runs the rules over JSON plans, not the text format pasted here — so the rule fires automatically on both HashAggregate spills and Sort spills when captured through the native integration.

Sort spills — the external merge

When a sort doesn't fit in work_mem, PostgreSQL falls back to an external merge sort: write sorted runs to disk, then merge them. You see this as Sort Method: external merge with a Disk: size in the sort node:

SELECT status,
       percentile_cont(0.5) WITHIN GROUP (ORDER BY total_amount_cents) AS median,
       percentile_cont(0.95) WITHIN GROUP (ORDER BY total_amount_cents) AS p95
FROM sim_bp_orders
GROUP BY status;

Percentiles are expensive because the implementation needs an ordered sample per group. PostgreSQL's percentile_cont evaluates as an ordered-set aggregate, which requires sorting the input per group:

GroupAggregate  (actual time=202.589..358.067 rows=5 loops=1)
  Group Key: status
  Buffers: shared hit=3697, temp read=2707 written=2494
  ->  Sort  (actual time=146.514..202.485 rows=500000 loops=1)
        Sort Key: status
        Sort Method: external merge  Disk: 12048kB
        Buffers: shared hit=3689, temp read=1506 written=1512
        ->  Seq Scan on sim_bp_orders  (actual time=0.007..49.886 rows=500000 loops=1)
 Execution Time: 358.230 ms

358 ms. The Sort spilled 12 MB of temp files. The GroupAggregate node above it shows its own temp read=2707 written=2494 — that's the ordered-set aggregate's internal tuplestore materialising per-group sorted input for the percentile computation, not a generic "every aggregate spills" phenomenon. Ordered-set aggregates like percentile_cont, percentile_disc, and mode() all force per-group materialisation; a simple count() or avg() on the same plan wouldn't produce that second temp-I/O figure. The MyDBA rule sort_on_disk fires on any Sort with Sort Space Type = Disk, which this plan has.

The right fix depends on the workload. For a one-off analytical report, raising work_mem to ~40 MB for that session turns the external merge into an in-memory quicksort. For a dashboard that runs this every minute, you want a materialised view:

CREATE MATERIALIZED VIEW order_amount_percentiles AS
SELECT status,
       percentile_cont(0.5) WITHIN GROUP (ORDER BY total_amount_cents) AS median,
       percentile_cont(0.95) WITHIN GROUP (ORDER BY total_amount_cents) AS p95
FROM sim_bp_orders
GROUP BY status;

-- Refresh on whatever schedule fits your freshness requirement:
REFRESH MATERIALIZED VIEW CONCURRENTLY order_amount_percentiles;

REFRESH MATERIALIZED VIEW CONCURRENTLY requires a unique index on the view, reads the source tables outside the refresh window, and replaces the view atomically. The dashboard then queries the view instead of re-running the percentile calculation, and the 358 ms query becomes a 0.5 ms single-row scan.

Window functions

A window function produces an output row for every input row, but with access to a frame of related rows. The syntax:

agg_func(...) OVER (
    PARTITION BY col1, col2     -- split input into independent groups
    ORDER BY col3, col4          -- order within each partition
    ROWS BETWEEN ... AND ...     -- or RANGE BETWEEN, or GROUPS BETWEEN
)

The planner implements window functions via a WindowAgg node that consumes an input ordered appropriately and emits one output row per input. If the input isn't already ordered, the planner inserts a Sort before the WindowAgg — which is often where the cost lives.

Consider a common pattern: "the most recent order per user." Pre-PostgreSQL 15 the usual rewrite was:

SELECT user_id, order_id, created_at
FROM (
    SELECT user_id, order_id, created_at,
           row_number() OVER (PARTITION BY user_id ORDER BY created_at DESC) AS rn
    FROM sim_bp_orders
) t
WHERE rn = 1
LIMIT 100;

The PostgreSQL 15+ optimisation for this is the WindowAgg Run Condition — the planner notices that WHERE rn = 1 can be pushed into the WindowAgg, so it can stop computing row numbers for each partition as soon as rn > 1:

Limit  (actual time=0.093..0.525 rows=100 loops=1)
  Buffers: shared hit=305
  ->  WindowAgg  (actual time=0.092..0.518 rows=100 loops=1)
        Run Condition: (row_number() OVER (?) <= 1)
        ->  Incremental Sort
              Sort Key: user_id, created_at DESC
              Presorted Key: user_id
              Full-sort Groups: 9  Sort Method: quicksort
              ->  Index Scan using idx_sim_bp_orders_user_id on sim_bp_orders
                    (actual time=0.014..0.339 rows=302 loops=1)
 Execution Time: 0.543 ms

0.54 ms. Two optimisations are visible:

Run Condition: (row_number() OVER (?) <= 1) — the WindowAgg stops producing rows for a partition once rn exceeds 1, so only the first row per user is computed. This lets the plan short-circuit once LIMIT 100 is satisfied after only 302 input rows (not the full 500k).
Incremental Sort with Presorted Key: user_id — the input arrives already sorted by user_id (from idx_sim_bp_orders_user_id), and the WindowAgg needs it sorted by (user_id, created_at DESC). An Incremental Sort only sorts within each user_id group rather than globally, which costs drastically less memory and allows pipelined execution.

Even so, a LATERAL join with LIMIT 1 inside is often simpler and at least as fast for "top-N per group" with small N.

Frame specifications

Most window function work defaults to an implicit frame clause that trips people up. The rules:

No ORDER BY clause → the frame defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING — the whole partition. This is what you want for sum() or avg() over an entire partition.
ORDER BY clause present → the frame defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW — the running total up to this row. This is what you want for running sums, but easy to get wrong.
Ranking functions (row_number(), rank(), dense_rank()) — the frame is irrelevant because the function's result only depends on the ordering.

A common mistake: computing a "running average over the last 7 rows" and getting a running average over all preceding rows because the frame clause was omitted. The fix is explicit:

avg(value) OVER (
    ORDER BY timestamp
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
)

ROWS BETWEEN N PRECEDING AND CURRENT ROW is a physical window of N+1 rows. RANGE BETWEEN '7 days' PRECEDING AND CURRENT ROW is a logical window based on the ORDER BY value — useful when timestamps aren't evenly spaced. GROUPS BETWEEN N PRECEDING AND CURRENT ROW (PostgreSQL 11+) treats ties as a single "group" and counts those.

LAG, LEAD, and first/last value

The navigation functions — lag(x, n), lead(x, n), first_value(x), last_value(x) — let you reference rows offset from the current one. Classic use: detect state transitions.

SELECT order_id, status,
       lag(status) OVER (PARTITION BY user_id ORDER BY created_at) AS prev_status,
       created_at
FROM sim_bp_orders
WHERE user_id = 42;

Each row gets the status of the user's previous order. The window can then be wrapped in a subquery or CTE to find "orders where the status changed":

WITH ordered AS (
    SELECT order_id, status, created_at,
           lag(status) OVER (PARTITION BY user_id ORDER BY created_at) AS prev_status
    FROM sim_bp_orders
)
SELECT * FROM ordered WHERE prev_status IS DISTINCT FROM status;

Two performance notes. First, last_value() with a default frame is surprising — because the default frame ends at the current row, last_value() returns the current row's value, not the partition's last. To actually get the partition's last value, specify ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. Second, LAG and LEAD compile to very cheap operations (just a pointer to the previous/next row in the window), while first_value/last_value with an explicit full-partition frame can force materialisation.

Aggregate-related indexes

An index on the GROUP BY columns is a straightforward win when it exists: the planner can use GroupAggregate over an index scan and skip the hash build entirely. The index has to cover the group key in exactly the right order — a composite index on (status, created_at) serves GROUP BY status, but a (created_at, status) doesn't.

For queries that frequently aggregate a narrow window of a big table (WHERE created_at > ... GROUP BY user_id), a partial index or materialised view of the aggregate result is usually the right answer, because re-aggregating millions of rows every time beats out any planner optimisation. Precomputation is the most robust performance tactic for aggregates.

Quick diagnostic checklist

When an aggregate query is slow:

Is the aggregate node a HashAggregate with Batches > 1 or Disk Usage > 0? The hash table spilled. Raise work_mem for the session, or create a supporting index to enable GroupAggregate instead.
Is there a Sort above a GroupAggregate with Sort Method: external merge? The sort spilled. Same fix: more work_mem, or an index that provides pre-sorted input.
Is there a WindowAgg over a Sort that processes all input before the LIMIT? Check if a Run Condition is possible (PG 15+) or if the problem can be rewritten as LATERAL + LIMIT N.
Is the aggregate running every time the dashboard loads? Move it behind a materialised view refreshed on schedule. This is usually the biggest win of all.
Does the MyDBA analyzer flag sort_on_disk, hash_batches_spill, or temp_blocks_written? These are the three rules that specifically target aggregate-related spills; if any fire, follow the suggestion inline.

Next steps

Aggregates interact closely with the shape of your WHERE clauses — a filter that narrows the input set before aggregation is almost always cheaper than aggregating and then filtering. The next article, WHERE Clause Optimisation, covers sargability and composite-index ordering in detail, with an eye toward getting predicates to apply as early in the plan as possible.